Artificial Intelligence and LLM Training: Progress or Appropriation?

Ryan

4 weeks ago

Artificial Intelligence and LLM Training: Progress or Appropriation?

In the debate surrounding the training of generative AI, an argument is increasingly being put forward that claims to settle once and for all the issue of intellectual property in training data: the comparison with clean room reverse engineering. The idea is that, if this practice has been considered legitimate for decades, then the way artificial intelligence models are trained should be considered legitimate as well. A line of reasoning with a distinctly technical slant, aimed at disposing of the problem by presenting it as already known and already resolved.

This analogy is used to shift the discussion onto a purely technical plane, as if demonstrating a similarity of process were sufficient to close any question of legitimacy. But the core issue is not technical, or not only technical: it is legal and regulatory. It concerns who may use what, under which conditions, subject to which limits, and with what effects on the distribution of value.

In the first part of this article, we will examine in detail why clean room reverse engineering and the training of generative AI models operate according to profoundly different logics, both from a technical standpoint and, above all, from a legal one. In the second part, we will move to the level of consequences and perspectives, showing how the introduction of a clear regulatory framework is a necessary condition for addressing coherently the issue of AI training and the distribution of value that derives from it.

For the sake of brevity, this article focuses exclusively on the issue of intellectual property as it relates to the data used in the training of generative artificial intelligences, and on the legitimacy, or lack thereof, of certain technical analogies frequently invoked in public debate. Other equally important issues, such as the impact of AI on labor and employment, its effects on the economy and markets, or the environmental implications related to infrastructure and energy consumption, are not addressed here.

For further discussion of these topics, please refer to two previously published articles:

The image accompanying this article depicts an AI generated reinterpretation, in a “Ghibli style” aesthetic, of a well known meme, widely circulated on social media without attribution: an emblematic example of the issues surrounding stylistic appropriation and the legitimacy of training processes.

Clean Room Reverse Engineering

Clean room reverse engineering was developed to address a very concrete, and far from theoretical, problem: how to build a system compatible with an existing one without infringing copyright. It is not, therefore, a “clever” technique for copying more effectively, but a method explicitly designed to prevent copying, in contexts where compatibility is necessary but access to the original source code is not permitted.

The principle on which it is based is simple, yet extremely rigorous: what may be replicated is the functional behavior of a system, not the expressive form through which that behavior is implemented. Copyright law does not protect functionality in the abstract, but rather the specific way in which it is expressed, organized, and encoded.

For further reading on this topic, we recommend consulting the Wikipedia entry on Clean room design.

A particularly well known example, one that lends itself well to explaining how the clean room approach works, is that of Compaq and the IBM BIOS. To produce a BIOS compatible with IBM’s without violating its rights, Compaq adopted a very precise procedure: two distinct teams, kept strictly separate. The first analyzed the original BIOS and produced documentation describing only its observable behavior and required functionalities, deliberately avoiding any reference to source code or implementation choices. The second team, which had never accessed the original BIOS, used only that documentation to develop a compatible implementation.

The decisive point is not the organizational separation itself, but what follows from it: the reference document is neither a transcription nor a reworking of the original code, but an abstract functional specification, deliberately designed to exclude any expressive elements. It is this enforced abstraction that makes the difference.

This is where the true meaning of clean room reverse engineering becomes clear. The constraints are not a technical detail or a marginal precaution: they are the legal foundation of the entire framework. The process is constructed in such a way as to make copying impossible by definition. In the absence of these constraints, one could no longer speak of legitimate reverse engineering, but of copyright infringement, in other words, misappropriation.

Differences with GenAI Training

To understand where the analogy with clean room reverse engineering breaks down, it is first necessary to clarify a preliminary distinction that is often invoked superficially: that between training and inference in generative AI systems.

In very broad terms, within the lifecycle of a generative AI model, training is the phase in which the system is trained on large volumes of data, such as texts, images, code, and audio, in order to learn patterns, correlations, and statistical structures; inference, by contrast, is the subsequent phase in which the already trained model is queried to generate new outputs, without directly accessing the original training data.

This distinction underpins the most common defense: since, during inference, the model does not consult protected works but relies solely on parameters learned during training, the output would be legally comparable to an abstract reference document, like the one produced in a clean room reverse engineering process.

According to this view, training “sees” the protected works, but inference does not. The problem would therefore be confined to a technical phase already completed, devoid of legal relevance for the model’s final use.

The problem is that this reconstruction is misleading, for a very simple reason:

In clean room reverse engineering, the reference document is not a technical by product of the process, but an artifact intentionally designed to be abstract, functional, and free of expressive elements. It is the result of deliberate constraints introduced precisely to prevent the reproduction of protected content.

In the training of generative AI models, by contrast, no equivalent mechanism of enforced abstraction exists. The model does not distill a neutral functional specification, but incorporates, in statistical form, structures and patterns derived directly from expressive works. The lack of direct access to data during inference does not negate the fact that those works contributed decisively to the formation of the model itself.

In other words, the distinction between training and inference does not amount to a clean room separation: it is merely functional and temporal, not conceptual or legal. For this reason, the analogy fails. What is prevented in principle in clean room reverse engineering is, in generative AI training, the very precondition for the system’s operation.

Put differently, GenAI training does not produce a neutral functional specification, but builds a statistical model that incorporates patterns, structures, and correlations derived directly from expressive works. There is no clean room, neither conceptually nor operationally. There are no barriers designed to prevent the reproduction of stylistic, narrative, or formal elements. If anything, the process is deliberately the opposite.

As if that were not enough, data collection occurs in a massive and indiscriminate manner, without limits of purpose, without abstraction based selection, and without role separation. The value of the model increases precisely insofar as it absorbs and reprocesses the expressivity of others.

This is not functional compatibility, but a statistical reworking of existing material, entirely grounded in works of authorship created by others.

The False Dilemma: Pro or Anti AI

Having clarified this point, it is worth dispelling a recurring misconception: criticizing current training practices does not mean rejecting artificial intelligence. The problem, here as elsewhere, is not the technology itself, but how it is used, or more precisely, how it is built, trained, and monetized.

At present, the training of generative AI is not neutral by nature. In the absence of rules, it becomes a tool for the systematic extraction of cultural and creative value, favoring those who can afford billion dollar infrastructures while leaving everyone else with the externalities.

The Need for an Adequate Regulatory Framework

At this point, the core issue becomes difficult to avoid: if we want to seriously discuss the legitimacy of generative AI training, the question of constraints cannot be treated as a secondary detail. As in the case of clean room reverse engineering, it is not the outcome of the process that determines legitimacy, but the architecture of rules that governs it.

Today, rather than a coherent regulatory framework, there is a vacuum, an area in which technology advances rapidly while rules struggle to keep pace. But this is not merely a matter of regulatory delay: the real issue lies upstream, in the evident absence of a framework conceived from the outset to regulate training as an activity that is structurally non neutral and that should therefore rest on a set of indispensable normative presuppositions.

First and foremost, consent and licensing of the data used for training cannot be implicit or presumed. They must be explicit, verifiable, and consistent with the purpose for which the model is trained. Likewise, source traceability cannot remain a mere aspiration; it should be a necessary condition for accountability and value redistribution.

Within this framework, opt out mechanisms must be rethought as genuine instruments of control, not merely formal gestures. And compensation for those who contribute cultural or creative value cannot be treated as a side issue: it is an integral part of an economic ecosystem that aims to be sustainable rather than purely extractive.

Alongside these elements, more structural requirements emerge, ones that concern the very way in which AI training should be conceived and regulated. For instance, model training should always be accompanied by an explicit declaration of purpose. This is not a formality: knowing why a model is trained is a prerequisite for determining how and under what conditions data may be used.

This declaration should also be paired with a clear distinction between different usage purposes. Research, public use, and commercial exploitation are not interchangeable categories and cannot be treated as such. Without this distinction, any model becomes potentially reusable without limit, with consequences that extend far beyond the technical domain and directly affect markets, culture, and bargaining power.

Here, the parallel with the clean room becomes useful, but in the opposite sense from how it is often deployed in public debate. As in clean room reverse engineering, legitimacy in AI training cannot be entrusted to the good faith of actors involved, nor assessed ex post based on final outputs. It must be built by design, through clear, verifiable, and non circumventable constraints that define from the outset what is permissible and what is not.

In the absence of such a regulatory framework, the outcome is plain to see: a small number of large private actors concentrate infrastructure, data, and computational capacity, reinforcing dominant positions that are difficult to challenge. In this context, artificial intelligence risks becoming not a tool for disseminating knowledge, but a powerful multiplier of existing asymmetries.

An Alternative Worth Discussing

At this point, the debate tends to harden. Any call for rules is quickly dismissed as an attempt to “slow down progress.” But this opposition, beyond being sterile, is also misleading. There is in fact an alternative that deserves far more serious consideration: treating large language models as collective infrastructures rather than as the exclusive property of individual companies.

Such an approach could take the form of models developed or funded with public resources, subject to transparent governance, regulated access, and objectives explicitly oriented toward collective benefit. This would not mean denying space to private innovation, but rebalancing a field that today appears heavily skewed.

It is an ambitious scenario, certainly, but also one of the few in which a technology of this magnitude could truly be considered enabling rather than purely extractive.

Conclusions: Beyond the “All or Nothing” Logic

The debate on artificial intelligence is often trapped in a sterile opposition: on one side, those who call for the total rejection of the technology; on the other, those who argue that any externally imposed limit represents an unacceptable brake on innovation. Beyond being a false alternative, this way of thinking may be the single greatest obstacle to a genuinely constructive discussion.

Artificial intelligence should neither be rejected outright nor allowed to grow unchecked in the name of supposed technological determinism. It must be understood, governed, and ultimately civilized, not because it is inherently dangerous, but because the way it is built, trained, and monetized produces concrete economic, cultural, and social effects.

Without clear rules, the risk is that of systematic appropriation of value on an industrial scale, with benefits and power concentrated in the hands of a few actors. With an adequate regulatory framework, or better still, with an approach that explicitly places the collective interest at its center, AI could instead become a foundational infrastructure of our time: powerful, shared, and more equitably distributed.

The real crossroads, once again, is not between progress and stagnation, but between properly governed progress, capable of producing broadly shared benefits, and growth left entirely to the logic of accumulation.

Print Friendly & PDF Download