This browser is not actively supported anymore. For the best passle experience, we strongly recommend you upgrade your browser.

Freshfields TQ

Technology quotient - the ability of an individual, team or organization to harness the power of technology

| 9 minutes read

Uncharted territory? Opportunities, boundaries and uncertainties of text and data mining


AI systems, in particular foundation models used for generative AI applications, require training. Conducting such training requires large quantities of data and content which may include copyrighted materials such as texts and images. The process of extracting large quantities of data for analysis and assessment, particularly from the Internet, is known as ‘text and data mining’ (or TDM).

While certain TDM activities could be deemed as copyright infringement, the EU, US and UK legal frameworks provide for safe harbour provisions which may shield AI providers from liability. Legislators face the difficult task of balancing the interests of AI developers in obtaining high quality training data for AI systems with the interest of rightsholders that their works are not used without their consent or at least without appropriate compensation. The application of existing laws is unclear and will probably only be clarified in court. Several cases working their way through the courts in the US and the UK, including a class action against Stability AI, lawsuits brought by Getty Images against Stability AI and a lawsuit challenging GitHub Co-pilot.

Against this backdrop, AI providers must carefully assess whether their TDM activities could be deemed infringing or otherwise requires a licence from rightsholders for such TDM activities. Similarly, rightsholders need to ensure that they actively manage their content and make use of their “opt-out”-rights (reservations) under EU copyright law, in case they want to prevent AI providers from relying on the EU’s TDM specific copyright exemption. However, this is practically difficult since a common technical standard for valid opt-outs is yet to be developed.

AI training data and the relevance of text and data mining

The rapid development of AI systems has created a variety of new and challenging questions under copyright law.

Some of these questions relate to the machine-learning algorithms and datasets used by these AI systems, the collection and use of data, and the output from such AI systems. While there are different ways to develop and train AI systems, many relevant systems require large quantities of training data. This data, which may include copyright protected material such as still images and text, is generally only necessary for training purposes and can be deleted afterwards but is often stored – and sometimes altered – during the training process. The easiest and most cost-efficient way for AI developers to gather training data is to “scrape” it from the Internet. This is where TDM comes into play.

What is TDM?

According to EU copyright law, TDM means any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations. UK and US law do not have similar definitions.

This broad definition is likely to encompass many, if not most, of the techniques currently applied by AI providers to acquire data to train their generative AI systems. This inevitably raises the question: to what extent is TDM subject to restrictions under copyright law?

When can TDM become a copyright issue?

Generally, ideas, facts, data or more broadly, information as such is not strictly subject to copyright protection as copyright aims to protect original expressions. Moreover, copyright generally only reserves certain acts (such as copying, broadcasting, making accessible to the public) to the author. Access to a work as such and gathering information from it is not among the rights exclusively reserved to the author.

An argument could be made that the mere assessment of information underlying a copyright work should not be considered an act subject to copyright laws, as there is no copyright which would allow rightsholders to prevent users from viewing and assessing accessible works. At least under the current copyright framework shared by most jurisdictions, there is arguably no apparent reason why the result should be different if an AI system, instead of a human, simply “looks” at a copyright protected work.

However, TDM as used to train AI systems will not only require “looking” at works but also typically extracting and reproducing – at least parts of – works. The reproduction of works is subject to existing copyright laws. Absent specific exceptions, a reproduction of copyright protected works is subject to the rightsholder’s consent.

What copyright exemptions apply to TDM?

The balancing of interests between AI developers and rightsholders is increasingly subject to both legislative discussions and court proceedings. While the EU has adopted a specific framework to regulate TDM under the DSM-Directive, AI developers in the US try to rely on the seemingly more flexible “fair use” exception.

EU-specific TDM exemption

In response to the growing importance of TDM in developing new technologies, the EU introduced two mandatory copyright exceptions in 2019, allowing limited TDM without rightsholders’ consent.

Art. 3 of the DSM-Directive allows TDM for scientific purposes and Art. 4 of the DSM-Directive allows TDM for commercial purposes. AI Developers generally use TDM for commercial purposes and so in practice, Art. 4 will be most relevant to AI related TDM. To rely on Art. 4 for TDM of copyrighted content:

  • the content in question must have been lawfully accessed;
  • reproductions and extractions may only be retained for as long as is necessary for the purposes of the TDM in question; and
  • rightsholders must not have made reservations with respect to TDM rights (through machine-readable means in the case of online content).

This third requirement means that in practice Art. 4 provides for an opt-out mechanism. If rightsholders have not opted out by making an appropriate reservation, AI providers can rely on the specific TDM copyright exception to justify their extractions/reproduction of works. This opt-out requirement appears simple at first glance but leaves open several questions on how it will work in practice:

  • Art. 4 of the DSM-Directive requires rightsholders to opt-out in a machine-readable format, which to date, very few have done. This is particularly true for rightsholders based outside of the EU and for material which was uploaded before the DSM-Directive was adopted.
  • It is also unclear what “machine readable” format means. While some argue that a simple written reservation in standard terms and conditions is sufficient to meet this requirement, others believe that a truly machine readable opt-out is necessary. This problem is made more challenging by the lack of universally accepted standards for machine-readable reservations despite standardisation activities by organisations such as the W3C (“TDM Reservation Protocol”) or the IPTC (“Rights ML Standard 2.0”) or the C2PA with specifications to be used as traceable metadata.
  • While the current situation with relatively few machine-readable opt-outs favours AI developers, this may change in the future, and it remains to be seen whether Art. 4 can guarantee an adequate balancing of interest between rightsholders and TDM users in the long run. If more and more rightsholders opt-out, the ability of AI developers to rely on Art. 4 DSM-Directive may narrow considerably. As AI developers bear the burden of proof for the non-existence of “opt-outs” they should document their efforts to scan for such declarations appropriately.
  • It also remains to be seen how the practical consequences of a violation will be handled by courts (e.g., if a valid “opt-out” was not observed). This includes questions such as the calculation of damages and how the infringement may be rectified. For technical reasons, the “extraction” of “learned” data from an AI system might prove just as challenging as removing a stolen egg from a cake making rectifying infringement difficult.

EU-specific temporary reproduction exemption

Another EU law copyright exception which could potentially exempt TDM from copyright infringement is Art. 5(1) of the InfoSoc Directive which creates an exemption for temporary reproductions of copyrighted material. This exception was intended to enable Internet browsing or caching but is technology-neutral and could therefore in principle also provide an exemption for TDM.

However, the Art. exception is subject to strict conditions and the ECJ has held that that it can only apply to an otherwise infringing act if:

  • the act is temporary;
  • the act is transient or incidental;
  • the act is an integral and essential part of a technological process;
  • the act’s sole purpose is to enable a transmission in a network between third parties by an intermediary or a lawful use of a work or protected subject-matter; and
  • the act has no independent economic significance.

It remains unclear whether TDM can meet all of these conditions, in particular the requirement of a lack of an independent economic significance. Considering the CJEU generally construes copyright exceptions narrowly in line with international standards, AI providers must carefully assess whether this exception could serve as a robust defence before relying on it for TDM of copyrighted material in the EU.

US fair use exception

In contrast to the EU, US copyright law does not provide for specific TDM exceptions but does allow for “fair use” of copyrighted work under Sec. 107 of the US Copyright Act.

Whether use of copyrighted work is fair use is a technical and fact-specific question which the US courts will use the following non-exhaustive factors to consider:

  • the purpose and character of the use, including whether the use is of a commercial nature or is for non-profit educational purposes (also known as the “transformative use factor”);
  • the nature of the copyrighted work;
  • the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
  • the effect of the use upon the potential market for or value of the copyrighted work.

Whether AI developers’ use of TDM to train their models is indeed fair use is being actively argued before US courts. The outcome of these cases will further shape the boundaries of the fair use exception in the TDM setting. A recent decision by the US Supreme Court in the case Warhol v. Goldsmith – as summarised in this blog – does not address the issue directly but may provide some hints at the “direction of travel” of fair use of copyright work for AI development.

There has also been debate in the US on whether the Copyright Act would preclude claims that TDM breaches terms of service or license terms if the TDM is otherwise permissible under copyright law. Defendants have previously been successful arguing this in court to dismiss breach of contract claims but the US courts have not always taken a clear position and may allow a breach of contract claim where there are extra elements that make it qualitatively different from a copyright infringement claim such as a contractual promise to pay. Whether the Copyright Act does in fact pre-empt breach of contract claims therefore remains an open question which we will need to keep under close scrutiny going forward. 

Outlook and take aways

There is considerable uncertainty around the legality of using publicly available copyrighted content as training data for AI systems. Clarity will be provided by future court decisions, but this will take time.

It also remains to be seen whether rightsholders may seek to challenge TDM outside of copyright law. For example, if a rightsholder’s copyrighted work or an output by AI contains personal information, using that work for developing AI systems may amount to unlawful processing of personal data under EU and UK data privacy law.

The implications of the proposed EU AI Act are also uncertain. Even though the proposed AI Act regime does not focus on copyright law, the latest proposal includes a transparency obligation relating to the use of copyright protected material in training data (see our corresponding blog here), which should be closely monitored by AI developers and may give rightsholders better visibility on how their works are used.

Considering the ubiquitous nature of the Internet, it does not help that there are conceptual differences between major jurisdictions, including the EU, US and UK, on how to approach TDM. In this context, it may be useful to look for commonalities: Despite local differences, all legislative activity in this area and, the “fair use” exception should comply with standards of international copyright law. Here, the three-step test pursuant to the Revised Berne Convention and TRIPs serves as a limiting factor for the application of copyright exceptions and only allows them if they:

  • are confined to certain, special cases;
  • do not interfere with the normal exploitation of a work; and
  • do not unreasonably prejudice the legitimate interests of the author.

This internationally established principle may be helpful for legislators and courts as guidance to establish a somewhat level playing field, and an at least similar international approach to TDM.


ai, eu ai act, eu digital strategy, intellectual property, tech media and telecoms