The State of Web Scraping in the EU

Data Protection Implications of Web Scraping

The General Data Protection Regulation (GDPR) defines personal data as "any information relating to an identified or identifiable natural person." Web scraping poses significant data protection challenges because it often collects personal data, including sensitive data without individuals' knowledge or consent. In the EU, data protection laws limit the legal use of web scraping. The GDPR defines processing as any operation on personal data, including collection, organization, storage, modification, retrieval, use, and dissemination. Since web scraping involves these activities, operators are considered data controllers. This means they must comply with controller obligations, including having a lawful basis for data processing, a legitimate purpose (e.g., training a model), and adhering to principles of transparency, data minimization, storage limitation, accuracy, security, confidentiality, integrity, and accountability.

Under the GDPR, any processing of personal data must be justified by a legitimate legal basis. While the European AI Act aims to establish a comprehensive legal framework for the deployment and operation of AI systems, it currently does not provide a specific legal basis for the initial collection of personal data for training AI tools. Instead, the AI Act focuses on data processing within AI sandboxes and development environments, leaving the justification for initial data collection to be governed by the GDPR. Organizations using web scraping therefore must ensure a lawful basis under GDPR for processing both ordinary and special categories of personal data, considering various legal bases:

Consent: Consent is unlikely to serve as a valid legal basis for web scraping, as it requires the informed and voluntary agreement of the individuals whose data is collected. Obtaining such consent is practically impossible in the context of automated and large-scale data collection, particularly given the "black box" nature of AI. This complexity further complicates the issue of consent for subsequent data processing.
Contractual necessity: Processing based on contractual necessity requires a direct contractual relationship between the data controller and the data subject. In the context of web scraping, there is typically no such relationship with the individuals whose data is being collected. Consequently, this legal basis is generally inapplicable for justifying web scraping activities.

Further to this, when web scraping captures special categories of personal data, such as health information, additional constraints under Article 9 of the GDPR apply. These constraints include the necessity for explicit consent or meeting specific conditions, such as processing for substantial public interest or scientific research purposes.

Legitimate Interest of Web Scrapers in Data Collection

The EDPB's ChatGPT Task Force Report clearly points out that collection of training data, pre-processing of the data and training are different data processing purposes that require their own established legal basis. This aligns with the CNIL’s “Using the Legal Basis of Legitimate Interest to Develop an AI System” that differentiates between the different phases of training and using AI systems with data scraped from the Internet identifying risks with each phase of the training and utilisation process.

The Task Force reminds us that the legal assessment of legitimate interest legal basis must consider three key criteria: (i) the existence of a legitimate interest, (ii) the necessity of processing, ensuring the data is adequate, relevant, and limited to what is necessary, and (iii) balancing the interests. This requires a careful evaluation of the fundamental rights and freedoms of data subjects against the controller’s legitimate interests, taking into account the reasonable expectations of data subjects. The Task Force suggests that safeguards could include technical measures like defining precise collection criteria and ensuring that certain data categories or sources (such as public social media profiles) are excluded from data collection.

The Autoriteit Persoonsgegevens [“AP”], in its Guidelines, states that only legally protected interests qualify as legitimate interests, and purely commercial interests are insufficient [note that the CNIL says that “The commercial aim of developing an AI system is not inherently contradictory to using the legal basis of legitimate interest.”]. The AP also precise and says that if an organization or a third party has an additional legally recognized interest, such as improving systems for fraud prevention or IT security, then a legitimate interest may be established. The AP's position indicates that establishing a legitimate interest for web scraping is challenging and often impractical. In contrast, the EDPB's ChatGPT Task Force emphasizes the necessity of a case-by-case evaluation, considering both the collection and processing of "ordinary" personal data and special categories of personal data, for which additional safeguard apply.

The AP, the EDPB’s ChatGPT Task Force Report and the CNIL also recommend using specific safeguards to favour the relevant data controller relying on web scraping techniques. These safeguards, as listed by the CNIL include: (i) mandatory measures to ensure data minimization, such as setting precise criteria for data collection and applying filters to exclude unnecessary data (e.g., bank transactions, geolocation, sensitive data), and promptly deleting irrelevant data once identified (e.g., collecting pseudonyms on forums when only comment content is needed); and (ii) applying supplementary guarantees.

These supplementary guarantees may be: (i) excluding data collection from predefined sites with sensitive information, such as pornographic sites, health forums, and social networks primarily used by minors, as well as genealogy sites or those with extensive personal data; (ii) avoiding data from sites that explicitly prohibit scraping through robot.txt or ai.txt files; (iii) implementing a blacklist for individuals who object to data collection on specific websites, even before collection begins; (iv) ensuring individuals' rights to object to data collection; (v) limiting data collection to freely accessible data and explicitly public user data, thereby preventing loss of control over private information (e.g., excluding private social network posts); (vi) applying anonymization or pseudonymization measures immediately after collection to enhance data security; (vii) informing users about affected websites and data collection practices through web scraping notifications; (viii) preventing cross-referencing personal data with other identifiers unless necessary for developing AI systems; and (ix) registering contact details with the CNIL to inform individuals and enable them to exercise their GDPR rights with the data controller.

Conclusions

Web scraping is integral to the development of AI but poses significant legal challenges, particularly regarding data protection. While the controller’s or a third party’s legitimate interest as legal basis under the GDPR can justify data collection if a legitimate interest is established and balanced against data subject rights, comprehensive safeguards must be implemented to mitigate legal risks on a case-by-case basis. The evolving regulatory landscape, including the AI Act, will likely provide further clarity on permissible data collection practices, but current uncertainties necessitate cautious and responsible data handling practices.

Originally published by IAPP.

By Tamas Bereczki and Adam Liber, Partners, Provaris

Sidebar

Navigation

Baker McKenzie Advises Cheyne Capital on the Refinancing of Kaffee Partner

Wolf Theiss Advises Advent International on Acquisition of Majority Stake in Reckitt’s Essential Home Portfolio

Reneta Petkova Joins Legalis Global as a Senior Consultant

Jacek Stoklosa Becomes a Partner at DZP

Vlasceanu & Partners and Malkoc & Partners Advise Ulusoy Group on Entry into Romanian Renewable Energy Market

Cytowski & Partners Advises Kontext on USD 10 Million Seed Round Led by M13

Lovric Novokmet & Partners Advises on Sale of Crnov-Commerce to Ricardo

Schoenherr Advises Bank Pekao on EUR 31 Million Acquisition Financing for City 2 Office Building in Wroclaw

Eversheds Sutherland and DGKV Advise on Eurobank Bulgaria's Covered Bonds Issuance

Polish Deals Pipeline Picks Up: A Buzz Interview with Michal Matera of A&O Shearman

The Debrief: July 2025

Less Noise, More Clicks: The Summer Marketing Advantage

Contentious Reforms in Lithuania: A Buzz Interview with Aiste Mikociuniene of Widen Legal

Hot Practice in Poland: Andrzej Wysokinski on Greenberg Traurig's Banking & Finance Practice

Increased Regulators' Scrutiny in Turkiye: A Buzz Interview with Sinan Diniz of KST Law

Future-Proofing Legal Operations: Insights into AI, LLMs, and Next-Gen Tools

2025 Turkish GC Summit Sneak Peek: Interview with Kerem Turunc of Turunc

Cybersecurity in the AI Age

Inside Insight: Interview with Mihaela Scarlatescu of Farmexim

Inside Insight: Interview with Ana Zakovska of IT Labs

Inside Insight: Simone Quantschnigg of Vamed Care

The State of Web Scraping in the EU

Tools

Typography

Hungary Knowledge Partner

Our Latest Issue

News Categories

Latest News

More Analysis

Latest Analysis and Commentary

In-House Categories

Latest In-House

Tools

Typography

Share This

Hungary Knowledge Partner

Our Latest Issue