High-quality, diverse, and extensive datasets are fundamental for improving machine learning model performance, and web scraping helps gather the necessary data to develop more robust and generalizable models. Web scraping pose different legal challenges, such as data protection, copyright and contractual law related issues. Intellectual property concerns arise as website content, like text, images, and data, is often copyrighted, and scraping without the copyright owner’s permission may lead to infringement claims. Further to this, many websites prohibit scraping in their terms of service, and violating these terms can also result in legal action against the operators of web scrapers.
Data Protection Implications of Web Scraping
The General Data Protection Regulation (GDPR) defines personal data as "any information relating to an identified or identifiable natural person." Web scraping poses significant data protection challenges because it often collects personal data, including sensitive data without individuals' knowledge or consent. In the EU, data protection laws limit the legal use of web scraping. The GDPR defines processing as any operation on personal data, including collection, organization, storage, modification, retrieval, use, and dissemination. Since web scraping involves these activities, operators are considered data controllers. This means they must comply with controller obligations, including having a lawful basis for data processing, a legitimate purpose (e.g., training a model), and adhering to principles of transparency, data minimization, storage limitation, accuracy, security, confidentiality, integrity, and accountability.
Under the GDPR, any processing of personal data must be justified by a legitimate legal basis. While the European AI Act aims to establish a comprehensive legal framework for the deployment and operation of AI systems, it currently does not provide a specific legal basis for the initial collection of personal data for training AI tools. Instead, the AI Act focuses on data processing within AI sandboxes and development environments, leaving the justification for initial data collection to be governed by the GDPR. Organizations using web scraping therefore must ensure a lawful basis under GDPR for processing both ordinary and special categories of personal data, considering various legal bases:
- Consent: Consent is unlikely to serve as a valid legal basis for web scraping, as it requires the informed and voluntary agreement of the individuals whose data is collected. Obtaining such consent is practically impossible in the context of automated and large-scale data collection, particularly given the "black box" nature of AI. This complexity further complicates the issue of consent for subsequent data processing.
- Contractual necessity: Processing based on contractual necessity requires a direct contractual relationship between the data controller and the data subject. In the context of web scraping, there is typically no such relationship with the individuals whose data is being collected. Consequently, this legal basis is generally inapplicable for justifying web scraping activities.
Further to this, when web scraping captures special categories of personal data, such as health information, additional constraints under Article 9 of the GDPR apply. These constraints include the necessity for explicit consent or meeting specific conditions, such as processing for substantial public interest or scientific research purposes.
Legitimate Interest of Web Scrapers in Data Collection
The EDPB's ChatGPT Task Force Report clearly points out that collection of training data, pre-processing of the data and training are different data processing purposes that require their own established legal basis. This aligns with the CNIL’s “Using the Legal Basis of Legitimate Interest to Develop an AI System” that differentiates between the different phases of training and using AI systems with data scraped from the Internet identifying risks with each phase of the training and utilisation process.
The Task Force reminds us that the legal assessment of legitimate interest legal basis must consider three key criteria: (i) the existence of a legitimate interest, (ii) the necessity of processing, ensuring the data is adequate, relevant, and limited to what is necessary, and (iii) balancing the interests. This requires a careful evaluation of the fundamental rights and freedoms of data subjects against the controller’s legitimate interests, taking into account the reasonable expectations of data subjects. The Task Force suggests that safeguards could include technical measures like defining precise collection criteria and ensuring that certain data categories or sources (such as public social media profiles) are excluded from data collection.
The Autoriteit Persoonsgegevens [“AP”], in its Guidelines, states that only legally protected interests qualify as legitimate interests, and purely commercial interests are insufficient [note that the CNIL says that “The commercial aim of developing an AI system is not inherently contradictory to using the legal basis of legitimate interest.”]. The AP also precise and says that if an organization or a third party has an additional legally recognized interest, such as improving systems for fraud prevention or IT security, then a legitimate interest may be established. The AP's position indicates that establishing a legitimate interest for web scraping is challenging and often impractical. In contrast, the EDPB's ChatGPT Task Force emphasizes the necessity of a case-by-case evaluation, considering both the collection and processing of "ordinary" personal data and special categories of personal data, for which additional safeguard apply.
The AP, the EDPB’s ChatGPT Task Force Report and the CNIL also recommend using specific safeguards to favour the relevant data controller relying on web scraping techniques. These safeguards, as listed by the CNIL include: (i) mandatory measures to ensure data minimization, such as setting precise criteria for data collection and applying filters to exclude unnecessary data (e.g., bank transactions, geolocation, sensitive data), and promptly deleting irrelevant data once identified (e.g., collecting pseudonyms on forums when only comment content is needed); and (ii) applying supplementary guarantees.
These supplementary guarantees may be: (i) excluding data collection from predefined sites with sensitive information, such as pornographic sites, health forums, and social networks primarily used by minors, as well as genealogy sites or those with extensive personal data; (ii) avoiding data from sites that explicitly prohibit scraping through robot.txt or ai.txt files; (iii) implementing a blacklist for individuals who object to data collection on specific websites, even before collection begins; (iv) ensuring individuals' rights to object to data collection; (v) limiting data collection to freely accessible data and explicitly public user data, thereby preventing loss of control over private information (e.g., excluding private social network posts); (vi) applying anonymization or pseudonymization measures immediately after collection to enhance data security; (vii) informing users about affected websites and data collection practices through web scraping notifications; (viii) preventing cross-referencing personal data with other identifiers unless necessary for developing AI systems; and (ix) registering contact details with the CNIL to inform individuals and enable them to exercise their GDPR rights with the data controller.
Conclusions
Web scraping is integral to the development of AI but poses significant legal challenges, particularly regarding data protection. While the controller’s or a third party’s legitimate interest as legal basis under the GDPR can justify data collection if a legitimate interest is established and balanced against data subject rights, comprehensive safeguards must be implemented to mitigate legal risks on a case-by-case basis. The evolving regulatory landscape, including the AI Act, will likely provide further clarity on permissible data collection practices, but current uncertainties necessitate cautious and responsible data handling practices.
By Tamas Bereczki and Adam Liber, Partners, Provaris