Large language models (LLMs) have transformed the technology landscape, enabling more advanced human-computer interactions. Yet, their rise brings significant ethical and security concerns, particularly regarding data privacy and their use in illicit activities.
Illegally Sourced Data
One of the critical issues with LLMs is the use of illegally sourced data for pre-training and fine tuning. While LLMs require vast datasets to function effectively, many companies rely on questionable or illicit sources to obtain this data.
Data scraping
Data scraping—automatically extracting information from websites without permission—is a common method used to gather data in bulk. Though some of this data may be publicly available, its usage often violates the terms of service of websites or breaches data protection laws such as GDPR (General Data Protection Regulation) in Europe.
More recently, companies using LLMs have come under fire for scraping content from websites like Reddit, StackOverflow, and even news sites without proper attribution or licensing, raising ethical and legal concerns over how these models are trained.
Illegal markets
In some cases, companies buy data from third-party vendors, many of whom operate in legal grey areas or outright illegal markets. These datasets frequently include personal information such as contact details, financial records, and even medical data, which are obtained without user consent. When sensitive information, such as medical or financial records, is utilized without authorization, it exposes individuals to serious privacy risks.
The scale of this issue has grown with the rise of “data brokers” who collect, aggregate, and sell sensitive data to the highest bidder, often without any transparency regarding how the data was originally acquired.
One high-profile example of this was the Facebook-Cambridge Analytica scandal, where data from millions of users was harvested without explicit consent and used to manipulate public opinion during election campaigns. Although this incident did not directly involve LLMs, it highlighted how personal data can be misused once it falls into the wrong hands.
Cases like these highlight the urgent need for greater transparency in data collection practices to prevent misuse.
Cybercrime and LLM Misuse
The misuse of large language models (LLMs) in the realm of cybercrime has escalated significantly, with these models becoming powerful tools for malicious actors.
LLMS for Phishing
One of the primary ways LLMs are being weaponized is in the creation of sophisticated phishing campaigns. Unlike traditional phishing methods, which often rely on generic or poorly written messages, LLMs can generate highly realistic and personalized content (such as fake emails, chat interactions, and even entire customer service bots that convincingly mimic legitimate services), making it far more difficult for victims to discern fraudulent communications from legitimate ones.
These AI-generated interactions can bypass traditional phishing detection systems that often rely on identifying spelling errors or awkward phrasing common in manual phishing attempts. LLMs can automatically tailor phishing messages by extracting specific information from victims’ social media profiles or public data, further enhancing the personalization and believability of the attacks.
A significant example of this came in 2023 when a surge of AI-powered phishing kits was detected on dark web marketplaces. These kits, equipped with LLM-driven chatbots, enabled even low-skill cybercriminals to deploy sophisticated phishing campaigns at scale. These bots engage users in real time, responding dynamically to their questions or concerns, which makes the fraud more convincing. Attackers have used these AI-generated conversations to steal sensitive information like credit card details, passwords, and social security numbers. The ability to replicate human conversation patterns makes these bots especially dangerous in social engineering attacks.
deepfake generation
Deepfakes, a combination of the terms “deep learning” and “fake,” are synthetic media where artificial intelligence (AI) is used to alter audio or video to convincingly mimic real people. While deepfakes started as a novel AI tool for entertainment and creativity, they have rapidly evolved into a significant threat in the world of cybercrime.
Deepfakes are created using advanced machine learning techniques, such as Generative Adversarial Networks (GANs), which involve two neural networks—one generating fake media and the other attempting to detect it. Over time, the generator improves, producing increasingly realistic content. LLMs enhance the text or audio component of these deepfakes, scripting dialogues that can be inserted into videos or synthesizing convincing speech patterns in real-time. This combination of text generation and deep learning makes the end product incredibly convincing.
Criminals can use deepfake technology to impersonate corporate executives or public figures, often referred to as “CEO fraud” or “business email compromise” (BEC).
In one reported case, a deepfake audio impersonating a CEO was used to convince an employee to transfer $243,000 to a fraudulent bank account. The deepfake audio was so accurate that the employee believed they were receiving direct instructions from their superior.
Another emerging threat is the use of deepfakes in blackmail or extortion schemes. Attackers may create fake videos or audio recordings of individuals in compromising situations, threatening to release them unless a ransom is paid.
The Broader Impact
These developments underscore the double-edged nature of LLMs. While they offer incredible advancements in AI-driven services, their misuse raises pressing questions about data ethics and security. If left unchecked, the proliferation of LLMs in both legitimate and criminal spheres could result in widespread harm, from personal privacy violations to large-scale cyberattacks.





