This browser is not actively supported anymore. For the best passle experience, we strongly recommend you upgrade your browser.
| 7 minute read

AI Training & Copyright Part 1: Text-and-data mining under court scrutiny

Recent Case Law by the Regional Court of Munich (“GEMA vs. Open AI”)

Training of large AI models with copyright protected material is a hotly contested area: Dozens of infringement proceedings are already pending internationally. 

While the second-instance Higher Regional Court of Hamburg has recently applied a balanced interpretation of the text and data mining (TDM) exception (see our corresponding blog here), the discussion is ongoing:

A judgment by the first-instance Regional Court of Munich published shortly before the decision in Hamburg has taken a controversial position on the question whether copyright-relevant reproductions do not only occur at training level, but may also be contained in the AI models under specific circumstances. 

The German music rights association “GEMA” had sued Open AI for alleged copyright infringement when training the large language model (LLM) ChatGPT with protected song lyrics. The Court followed GEMA’s theory that - at least certain older versions of ChatGPT - contained reproductions inside the model. It seems, however, that the court has taken an (inadmissible?) shortcut by equating “memorization” to a “reproduction” under copyright law. 

The evolving case law on AI training should be closely monitored by AI developers globally, also due to the upcoming compliance obligations under the AI Act (see our blog here).

The GEMA case: “What’s in the box?”

The GEMA case concerned the lyrics of 9 famous German songs including “Atemlos” by Kristina Bach and “Über den Wolken” by Reinhard Mey. OpenAI had acquired these lyrics by crawling publicly accessible websites. In response to user prompts such as “What are the lyrics of “Atemlos”?” ChatGPT provided a few (correct) lines of the lyrics, followed by further (incorrect) text which contained hallucinations. 

It was undisputed between the parties that the song lyrics were used during training. There was, however, strong disagreement as to how the AI model operates. GEMA argued that the AI model functioned like a database and would always generate the same output based on the specific prompts. OpenAI disputed this understanding of AI models arguing that the model does not store or copy specific training data, but rather reflects in its parameters what it has learned based on the entire training data set.

The court described the process of AI model training as follows: In a pre-training phase, the training data is transformed into machine-readable format. During this phase the human-readable text is divided into so-called tokens, which are words or parts of words. Each token is then assigned a unique numerical index converting the text into a numerical format that can be processed by a computer. During training, the semantic meaning of individual tokens and their proximity to each other is analysed. These learnings are expressed in a multidimensional space by a mathematical vector (the tokens are “vectorised”). The model’s algorithm determines which vectors and dimensions a model uses to capture the different meanings of a word and forms parameters. In an additional training phase, the model learns from selected prompts and ideal outputs to adjust its parameters for optimal responses.

After training, the model is able to respond to user prompts through a chatbot. The court explains that incoming inputs (the “prompts”) follow the same processing steps as during pre-training (meaning input is tokenised, vectorised, and passed through multiple layers of the model for contextual understanding). The court goes on to describe different techniques for generating output and explains that the model generates a token sequence that appears statistically plausible, e.g. because a certain sequence of tokens was included in different publicly accessible websites and therefore appeared in the training data multiple times.

Cutting the Gordian Knot

The court tried to avoid an in-depth technical analysis of the mechanics of ChatGPT by applying a short-cut. It argued that since 

  • it was known that the lyrics were used for training,
  • (some of) the lyrics could be extracted by prompts, and
  • this could not be a coincidence,

there must be a “copy” inside the model. In other words: It equated the AI model’s ability to “memorize” and “regurgitate” (parts of) its training data with a “reproduction” under copyright law. In this context, it should be noted that “memorization” is not a legal or clearly defined technical term, but rather a metaphor for a complex technical phenomenon (an AI system does not “recall” or “memorize” anything comparable to the human brain).

Based on this assumption, the court concluded that potential copyright relevant acts could occur in three phases (the split in three phases as such is based on a first instance decision by the Hamburg Regional Court (LAION Case):

  • “Phase 1”: Copying training data and making it machine-readable (= creation of the training data) constitutes a reproduction under copyright law. The court concludes - in line with a broad consensus in legal literature - that these acts are permissible under the TDM exception unless rightsholders have declared a valid opt-out. The necessary format of such opt-outs (in particular its machine-readability) is subject to an ongoing debate. The court did not analyse this further, since it found an infringement in “Phase 2”.
  • “Phase 2”: Analyzing of the training data and enriching with meta-information (= training itself) constitutes a reproduction and is not covered by the TDM (or any other) exception if memorization occurs.
  • “Phase 3”: Generation of output based on prompts. The court concluded that this constitutes a further reproduction and communication to the public if the output contains copyright protected works (or protectable parts of such works).

Smart short-cut or circular reasoning?

The court’s reasoning regarding “Phase 1” (i.e. that the TDM exception is applicable on AI training) is in line with the majority view which is not seriously called into question by a few isolated voices in legal literature arguing for a different result.

The “short-cut” taken by the court concerning “Phase 2” instead of summoning technical experts is much more controversial: The court’s assumption that the lyrics must be “stored” inside the model and that this constitutes a reproduction under copyright law seems to contradict established copyright principles and the technical realities of machine learning:

  • The right of reproduction is harmonized within the EU (Art. 2 InfoSoc-Directive). Case law by the German Federal Court of Justice has developed two requirements for this right: first, the copy has to be “fixated” on to something, e.g. a file in a storage medium, and secondly, it must be possible to make this fixation perceivable by humans, such as by printing it on paper or displaying it on another human-readable medium. This seems, in principle, in line with CJEU case law (cf. Case C-310/17 - Levola Hengelo BV v Smilde Foods BV; Case C-5/08 - Infopaq).

    However, the court has recognized that the AI model does not work like a database. Instead, it assumes that the output is the result of a statistical analysis which is initiated based on a user prompt. There is no part of the model which “contains” the song lyrics. We could draw an analogy with the human brain: When reading or hearing song lyrics multiple times, we would be able to recall (parts of) the lyrics, but without “retaining” copies of the latter in the brain.

    It is questionable whether the mere ability to re-create (short) sections of lyrics is sufficient to assume that there is a “fixation” in the sense of a reproduction. Isn’t the mere fact that the output contained hallucinations an indication that the lyrics weren’t “fixated” inside the model? And even if there is a “fixation”: Would it really be possible to make it perceivable by humans (based on a prompt) or does the prompt only lead to a re-creation which was not previously fixated.

  • It’s also doubtful whether the court could arrive at its conclusion without taking (technical) evidence. Even if one accepted the court’s theory in general: Wouldn’t the assumption of a “fixation” require that the same output can be reproduced indefinitely? If, for instance, different tests led to different results (with varying degrees of hallucination) this would speak against the idea that the lyrics are already embodied in the text in retrievable form.
  • The artificial distinction made by the court between Phase 1 (copying data for TDM is permissible) and Phase 2 (the training itself is a reproduction and TDM exception does not apply) could undermine the rationale of the AI Act, which acknowledges the necessity of TDM for AI training by an express reference. If the TDM exception is accepted for creating the training data but such training data then cannot be used for its sole function (= AI training) the applicability of the TDM exception is rendered meaningless.

The court’s reasoning is therefore based on an extremely broad interpretation of the reproduction right, essentially abandoning the requirement of any fixation. It also appears to be based on an oversimplification of the technical background. Notably, the court did not involve technical experts that would arguably have been required to clarify the technical functioning of the model. If (parts of) the training data are reproduced in the output, this is an issue that should rather be handled in Phase 3 (generation of output based on prompts). 

The big blank space

The court’s judgment is also nearly silent on a crucial point, which is applicable law and international jurisdiction: Under the principle of territoriality of copyright law - a universally recognized standard of international copyright law - copyright protection is limited to a territory of a specific country. Therefore, German copyright law would only be applicable to copyright-relevant acts within the territory of Germany. The court has, however, not made any determination where the (assumed) reproductions took place. 

Key takeaways and outlook

The decision by the Regional Court of Munich is an early, first-instance judgment attempting to cope with the (copyright) challenges raised by (generative) AI, more specifically AI training. It remains to be seen whether the court’s decision will remain an outlier or whether other courts will follow its theory. The argument that verbatim extracts of certain (famous) training data in the outputs prove that copies of the input data are stored in the model does not necessarily reflect technical realities and it will likely require further technical expertise in future cases.

Ultimately, however, these questions will likely have to be decided by the CJEU. First referral proceedings which might provide some clarity (Like Company) are already pending.

In the long run, the debate around the scope and pre-requisites of the TDM exception may even turn out to be a footnote in the legal discussion around AI training: The number of TDM opt-outs will certainly increase and it seems the more and more likely that the EU legislator will  consider changes to the copyright acquis (the evaluation of the Digital Single Market Directive is imminent). Answers provided by courts today may therefore soon prove inapplicable to a changing legal landscape.

Tags

ai, eu dsm directive, intellectual property