If your AI remembers too much, it might break the law: Germany’s GEMA v. OpenAI ruling & What life sciences should know about AI copyright risk
- daniel.schuppmann
- 17 minutes ago
- 5 min read

If your company is building its own large language model (LLM) for applications like drug discovery, biomarker identification, or clinical trial optimization, this case should be on your radar. A recent ruling by the Munich Regional Court (LG München v. 11.11.2025 – 42 O 14139/24 – GEMA vs. Open AI) —while not yet legally binding—sends a clear signal: training LLMs on protected works constitutes copyright infringement under German law, unless narrow legal exceptions apply. This includes both the use of the works during model training and their (re)appearance in chatbot outputs.
The case involved a claim by GEMA, the German music rights society, against OpenAI. At issue: ChatGPT had reproduced well-known German song lyrics in response to basic user prompts. The court concluded that OpenAI had trained its models on copyrighted lyrics without obtaining permission from the rights holders. That, it held, amounts to copyright infringement under German law. Crucially, the court (1) rejected the assumption that such training is covered by the European text and data mining exception, and (2) dismissed the technical argument that an infringing reproduction requires a clearly identifiable dataset within the model - two positions many companies had relied on, and which had broad support in the academic and technical communities. Rather, it found that even where works are broken down into numerical parameters and distributed across the model, a reproducible presence, i.e., the ability to reconstruct protected content from prompts, still qualifies as a tangible fixation under copyright law. Or, put less formally: if it looks like a duck, swims like a duck, and quacks like a duck, then it probably is a duck.
Why should life sciences companies care? Because LLMs trained on proprietary scientific literature, legacy clinical data, or health records may memorize and later reproduce protected content. If so, developers could be liable for copyright infringement. In other words: You should think about the legality of your training data, the boundaries of text and data mining (TDM), and the potential liability when your LLMs "memorize" third-party content.
Let’s break it down in a bit more detail: what the court said, why it matters, and what you should do now.
I. Applicable law: why German copyright rules apply (and why “fair use” won’t save you)
The court applied German law based on the Rome II Regulation (Art. 8(1)), which governs non-contractual obligations arising from the infringement of intellectual property rights. It relied on the “country of protection” principle: the applicable law is that of the country where protection is sought. Because GEMA claimed infringement in Germany, German law applied. For alleged violations of personality rights, which are excluded from Rome II, the court turned to German conflict-of-law rules which also led to the application of German law. This also explains why arguments based on the U.S. “fair use” doctrine are irrelevant in the European context. Fair use—broadly interpreted in the U.S. to permit certain transformative uses without permission—has no direct counterpart in the EU. European copyright law is based on closed, codified exceptions. Courts interpret them narrowly. The Munich ruling follows this tradition. For globally operating life sciences companies developing AI systems across multiple jurisdictions, this divergence introduces significant legal complexity. A model trained in the U.S. under a fair use theory may still trigger infringement claims in Europe.
II. Core finding 1: “memorization” constitutes reproduction
At the heart of the case is the legal classification of what happens during model training. The court held: “The copyrighted lyrics were reproduced within the models and are embedded in the model parameters in a way that allows their output. This qualifies as a reproduction under §16 UrhG [= German Act on Copyright and Related Rights].”
Crucially, the court rejected OpenAI’s argument that model outputs are the result of probabilistic synthesis. Even if the information is stored as probabilities or vectors, the fact that the copyright protected work can be reproduced by the model in response to simple prompts is sufficient to qualify the internal storage as a reproduction. The court also emphasized that it is irrelevant whether the memorization is intentional or incidental. This confirms that storing protected content in a model—even without access to the original format—is enough to trigger copyright liability if that content can be reproduced.
III. Core finding 2: Text and data mining does not justify AI training
Text and data mining (TDM) refers to the automated analysis of large volumes of digital content—like scientific articles or clinical records—to identify patterns, trends or relationships. It allows computers to “read” and extract insights from texts in big chunks, without human researchers having to go through each document manually. In the EU, TDM is permitted under specific conditions, but rights holders can opt out and prohibit such use if they expressly reserve their rights.
The court made a key distinction between different phases of AI development:
Phase 1: Collecting, creating and formatting the training dataset.
Phase 2: Training the model, during which the data is analyzed.
Phase 3: Use of the trained model via prompts and outputs.
The German and EU TDM exceptions under §44b UrhG and Article 4 of the Directive on Copyright in the Digital Single Market (DSM directive) apply only to Phase 1. They do not permit the reproduction of works into the model itself during Phase 2. The court found that:
“The memorization of training data in the model exceeds the scope of analysis permitted under the TDM exceptions. This is not mere evaluation but a lasting incorporation into the model's parameters, which in turn interferes with the exploitation interests of the rights holders.” In other words: TDM allows you to analyze, but not to absorb.
So watch out: Training a model on a set of FDA submissions? If those documents include copyrighted content and your model can reproduce it—even in parts—you may be liable. Using legacy CRO reports or medical literature? If your model “remembers” and outputs content from those sources, TDM exceptions won’t protect you.
IV. Core finding 3: Direct liability for outputs
The court also addressed the liability for infringing outputs. It rejected the argument that the operator of the model is merely a neutral platform, like a hosting provider. Instead, the court held that the operator of a generative model is directly responsible for its outputs if they reproduce copyright protected content via simple prompts. There is no “intermediary shield.”
V. Practical implications for life sciences companies
This ruling should prompt a reassessment of AI development workflows across the sector. Key recommendations include:
Audit your training data: Ensure that copyrighted content (e.g., clinical trial reports, publications, scientific data sets, submissions, legacy contracts) is cleared for use (e.g. via valid licensing arrangements) beyond TDM purposes.
Design model architecture with reproducibility in mind: Understand what your model can output. If your LLM is capable of regenerating proprietary or third-party text, you might be exposed.
Negotiate license terms carefully: When using data from CROs, academic collaborators, or public databases, ensure the rights include model training and potential reproduction.
Implement output filters: Especially for applications like auto-generated patient materials or scientific literature summaries, consider using retrieval-based or template-constrained approaches.
Monitor case law: The ruling is under appeal, and higher courts may refine the framework. But the current trend is toward stricter scrutiny of AI training methods.
Developing or deploying LLMs in Europe? We help life sciences companies align AI innovation with copyright compliance without slowing down R&D.

Comments