Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

In both situations, the data sets almost always include personal data. Thus, AI developers should carefully consider their obligations under the GDPR as well as local privacy law, depending on what applies to them.

The million-dollar question

Privacy compliance, amongst other things, should be considered as soon as training data is collected. Even if publicly available data is used for training purposes (e.g. data published on YouTube), it does not mean that such data can be freely used. This is a standard misconception amongst AI developers. Training on data sets that include personal data can take place only if the developers have a lawful legal basis for processing such data. Under the GDPR, this usually comes down to two things: consent or legitimate interest. While this may appear impossible or challenging, all AI providers training on personal data should consider privacy concerns very carefully.

The first EU guidance on the lawfulness of web scraping was encompassed under the EDPB's Report of ChatGPT's Taskforce issued in May 2024. This report indirectly supports innovation and stresses that legitimate interest might be considered as the only possible legal basis for data processing under web scraping techniques, provided that certain safeguards are applied. A couple of months prior to this, in March 2024, the UK Information Commissioner's Office (the "ICO") issued a consultation and explored the issues around the legality of web scraping. It also concluded that legitimate interest is the only remaining lawful basis for web scraping.

Pursuant to both the ICO and the EDBP's report, legitimate interest might be considered as a lawful basis for web scraping if the following criteria are met: (i) a legitimate interest exists; (ii) the processing is necessary, with personal data being adequate, relevant and limited to what is required for the purposes for which they are processed; and (iii) the interests are balanced.

Legitimate interest can serve as a lawful basis for data processing only if the interest is clearly defined and justified. Thus, when training AI models on web scraped data, this interest should not be broadly defined or vague. According to the EDPB, it is necessary not only to recognise the interest, but also to concretely justify it in terms of the purpose for which the data is collected. If the intended use of the model cannot be clearly defined in advance, it becomes challenging to justify it.

Web scraping is often considered necessary due to the volume of data required to train these models. However, according to the EDPB, even when large data sets are used, it must be ensured that unnecessary data is not collected, especially data that is not relevant to the specific training purposes. Therefore, the EDPB emphasises the importance of applying measures during data collection and excluding certain types of data from the collection process, such as public social media profiles.

Balancing interests is perhaps the most complex criterion. It is necessary to assess whether the rights and freedoms of individuals outweigh the legitimate interests of the controller. Web scraping is an invisible processing activity, where people are often unaware that their data has been collected and processed in this way. This means that individuals may lose control over their data, which can compromise their privacy rights. This necessitates the mandatory application of technical and organisational measures, such as data filtering during collection and excluding certain sources from the process.

Special approach for special categories of personal data

A particular issue arises with the scraping of special categories of personal data, such as data related to health, political views and religious beliefs. Processing this data requires the explicit consent of the individual, which further complicates the legality of web scraping. Without clear and explicit consent, processing such data may directly violate the GDPR, which strictly demands respect for privacy and individual rights.

One example where this issue arises is search engine scraping. This is what Google engages in when it collects data for the sole purpose of indexing and enabling searches. Unique to search engines, this form of scraping may be considered justified in the context of the public's right to information, as recognised by the Charter of Fundamental Rights of the European Union, but each case must still be carefully evaluated to ensure that the fundamental rights of individuals are not violated. This exception can only be justified with strict protective measures and a clear framework that limits processing to what is necessary to achieve legitimate objectives.

But that's not all

One of the key elements in ensuring GDPR compliance, especially in the context of web scraping, is the obligation to inform individuals whose data are being collected, even when consent is not the basis for processing. Article 13 of the GDPR clearly mandates that individuals must be informed prior to the processing of their data collected directly from them. However, when data is collected through web scraping, which often involves gathering data from publicly available sources, Article 14 of the GDPR (or Article 24 of the Serbian privacy law) applies. This article governs the obligation to inform individuals about the processing of their data, even when the processing is not immediately apparent or is indirect, as is the case with web scraping.

Depending on the AI product itself, the provider of AI systems might also have other obligations under the GDPR and/or local privacy laws. These obligations include legitimate interest assessment (LIA) and data protection impact assessment (DPIA), possibly with the obligation to acquire prior approval from the competent authority (depending on the AI system itself).

Final remarks

In an era of rapid AI development and widespread digitalisation, the legality of web scraping has become a critical question for AI developers. Despite the potential for innovation that web scraping offers, it is all too often forgotten that every step in this process is deeply rooted in a complex legal framework designed to protect individuals' privacy. Given that the EU AI Act will become applicable for generative AI models within a year in the EU (or three years depending on whether the models were placed on the market before 2 August 2025), or outside of the EU in specific situations, developers collecting data through web scraping should carefully analyse whether their products will be affected by this law. If yes, their products and business operations must be promptly adjusted to reflect these developments.

By Marija Vlajkovic, Partner, and Marija Lukic, Senior Associate, Schoenherr

Sidebar

Navigation

Oppenheim Advises CPI Europe on Sale of Marriott Hotel Budapest

Gospic Plazina Stojs and Mamic Peric Reberski Rimac Advise on Bosqar Invest’s EUR 143.2 Million Issuance

Schoenherr and Gessel Advise BHM Group on Divestment of A-Centrum Retail Parks in Poland to Reticulum Group

Paksoy Advises Siniora on Acquisition of Remaining Stake in Polonez

Malgorzata Bakula and Slavomir Slavik Make Partner at Baker McKenzie

SSW Advises Eika Asset Management on Financing for Polish Logistics Project

Clifford Chance Advises Lenders on Financing for Cimsa’s Acquisition of Mannok

Dentons Advises Ciklum on Acquisition of GoSolve Group

Gen Temizer Erdogan Girgin and Kolcuoglu Demirkan Kocakli Advise on Ticimax's Majority Stake Sale to Team.blue

The Debrief: June 2025

2025 Turkish GC Summit Sneak Peek: Interview with Kerem Turunc of Turunc

Guest Editorial: “So Your Hourly Rate Is 500 EUR…”

The Corner Office: Inflationary Pressure

“Don’t Trust That Email” – An Increasingly Recurring Note from Law Firms

Looking In: Interview with Jan Andrusko of Perkins Coie

2025 Turkish GC Summit Sneak Peek: Interview with Kerem Turunc of Turunc

Cybersecurity in the AI Age

Inside Insight: Interview with Mihaela Scarlatescu of Farmexim

Inside Insight: Interview with Ana Zakovska of IT Labs

Inside Insight: Simone Quantschnigg of Vamed Care

Inside Insight: Konstantinos Argyropoulos of Space Hellas

Privacy Concerns in Web Scraping: a GDPR and Serbian Privacy Law Perspective

Tools

Typography

Our Latest Issue

CMS and Greenberg Traurig Advise on Sale of LPP Distribution Center to REICO Long Lease Fund

Tabakov, Tabakova & Partners Advises on PDO Registration for Natural Mineral Water Hissarya

Tugce Tatari Joins Abdi Ibrahim Pharmaceuticals as Legal Director

North Macedonia Adopts New Law to Regulate Seasonal and Occasional Employment

Przemyslaw Karolak Joins Aion Bank as Deputy Head of Legal

Oppenheim Advises CPI Europe on Sale of Marriott Hotel Budapest

Gospic Plazina Stojs and Mamic Peric Reberski Rimac Advise on Bosqar Invest’s EUR 143.2 Million Issuance

Schoenherr and Gessel Advise BHM Group on Divestment of A-Centrum Retail Parks in Poland to Reticulum Group

Paksoy Advises Siniora on Acquisition of Remaining Stake in Polonez

Malgorzata Bakula and Slavomir Slavik Make Partner at Baker McKenzie

News Categories

Latest News

More Analysis

Latest Analysis and Commentary

In-House Categories

Latest In-House

Tools

Typography

Share This

Our Latest Issue