How much information do LLMs really memorize? Now we know, thanks to Meta, Google, Nvidia and Cornell

6 hours ago

0 0 6 minutes read

Become a member of the event that is trusted by business leaders for almost two decades. VB Transform brings together the people who build the real Enterprise AI strategy. Leather

Most people who are interested in Generative AI probably already know that great language models (LLMS) – such as those behind Chatgpt, Anthropic’s Claude and Google’s Gemini – are trained on mass data sets: trillions of words from websites, books, code bases and, more and more, other media, audio, audio and video. But why?

From this data, LLMS develops a statistical, general understanding of language, the patterns and the world – encrypted in the form of billions of parameters, or ‘institutions’, in a network of artificial neurons (those mathematical functions are that converting input data into inventions signals).

By being exposed to all these training data, LLMS learn to detect and generalize patterns that are reflected in the parameters of their neurons. The word ‘apple’, for example, often appears near terms related to food, fruit or trees, and sometimes computers. The model takes that apples can be red, green or yellow, or sometimes even other colors if they are rotten or rare, “apple” are spelled in English and are edible. This statistical knowledge influences how the model responds when a user enters a prompt – shaping the output it generates on the basis of the associations that it “has learned” of the training data.

But a big question – even under AI researchers – continues to exist: how much of the training data of an LLM is used to build generalized representations of concepts, and how much is instead to remember Literally or stored in a way that is identical or almost identical to the original data?

This is not only important to better understand how LLMS works – and when they go wrong – but also as model providers defend themselves in infringement of copyright infringement on court cases that have been filed by data makers and owners, such as artists and record labels. If it is shown that LLMS literally reproduces considerable parts of their training data, courts may rather be the side of claimants and claim that the models have copied illegally protected material. If not – if the models are found to generate output on the basis of generalized patterns instead of exact replication – developers may continue to scrape and train on copyrighted data under existing legal defenses such as reasonable use.

Now we finally have an answer to the question of how many LLMs remember versus generalize: A new study released this week From researchers from Meta, Google DeepMind, Cornell University and Nvidia thinks so GPT style models have a fixed memorization capacity of approximately 3.6 bits per parameter.

To understand what 3.6 bits in practice means:

A single bit is the smallest unit of digital data, which represent a 0 or 1. Eight bits, form a byte.
Saving 3.6 bits makes approximately 12.13 different values possible, as calculated by 2^3.6.
This is about the amount of information required to choose one of the 12 options comparable to selecting a month of the year or the outcome of a role of a 12-sided dice.
It is not enough to even save one English letter (which needs around 4.7 bits), But it is just enough to cod a character of a reduced set of 10 common English letters (for which approximately 3.32 bits are needed).
In Bytes, 3.6 bits is 0.45 bytes – less than half of the size of a typical character stored in Ascii (used 8 bits or 1 byte).

This number is model -independent within reasonable architectural variations: different depths, widths and precisions have produced comparable results. The estimate stuck over model sizes and even precision levels, where models with complete precision achieve slightly higher values (up to 3.83 bits/parameter).

More training data does not lead to more memorization – in fact a model will be a model less likely To remember a single data point

An important collection meal of the research is that models no longer remember when they are trained on more data. Instead, the fixed capacity of a model is distributed over the data set, which means that each individual data point receives less attention.

Jack Morris, the main author, explained via the social network X That “training on more data will force models to remember less per sample.”

These findings can facilitate concern about large models that remember copyright protected or sensitive content.

If remaining is limited and diluted in many examples, the chance of reproducing a specific training example decreases. In essence, more training data lead to safer generalization behavior, not increased risk.

How the researchers identified these findings

To quantify exactly how many language models remember, the researchers used an unconventional but powerful approach: She trained transformer models on data sets composed of uniform random bitstrings. Each of these bitstrings was independently sampled, so that there were no patterns, structure or redundancy about examples.

Because every sample is unique and is devoid of shared functions, every possibility that the model displays in Reconstructing or identifying these strings during evaluation directly reflects how much information it has retained – or remember– Training.

The main reason for this setup was to fully eliminate the possibility of generalization. In contrast to the natural language – which is full of grammatical structure, semantic overlap and repeating concepts – uniform random data does not contain such information. Every example is essentially noise, without statistical relationship with another. In such a scenario, all the performance of the model on test data must arise purely from remembering the training examples, because there is no distribution pattern to generalize.

The authors claim that their method might be One of the few fundamental ways to disconnect memorization of learning In practice, because when LLMs are trained on real language, even when they produce an output that corresponds to the training data, it is difficult to know whether they remember the input or only have derived the underlying structure from the patterns they have observed.

With this method, the researchers can map a direct relationship between the number of model parameters and the total stored information. By gradually increasing the model size and training each variant for saturation, between hundreds of experiments on models ranging from 500K to 1.5 billion parameters, they saw consistent results: 3.6 Remember bits per parameterwhich they report as a fundamental measure of LLM memory capacity.

The team applied their methodology to models that have also been trained on practice datasets. When trained on text, models showed a balance of memorization and generalization.

Smaller datasets encouraged more memorization, but as the data set size increased, models have shifted to the learning of generalizable patterns. This transition was characterized by a phenomenon known as “Double Descent”, whereby the performance temporarily dips before the generalization is improved.

The study also investigated how model precision – comparison of training in bfloat16 versus float32 – confirms memorization capacity. They saw a modest increase from 3.51 to 3.83 bits per parameter when switching to full 32-bit precision. However, this profit is much less than the double of available bits would suggest, which implies that reducing the efficiency of a higher precision.

Unique data has been remembered earlier

The paper proposes a scale law that relates the capacity and dataset size of a model to the effectiveness of attacks of membership infections.

These attacks try to determine whether a certain data point was part of the training set of a model. The research shows that such attacks become unreliable as the dataset size grows and the argument supports that large -scale training helps reduce privacy risk.

Although the paper focuses on behavior on average, some researchers have pointed out that certain types of data, such as very unique or stylized writing, are increasingly sensitive to memorization.

The authors acknowledge this limitation and emphasize that their method is designed to characterize general trends instead of Edge cases.

On the way to a greater human understanding of LLM comprehension

By introducing a fundamental and quantifiable definition of memorization, the study of developers and researchers gives new tools for evaluating the behavior of language models. This helps not only with model transparency, but also with compliance, privacy and ethical standards in AI development. The findings suggest that more data and no less safer path can be when training large-scale language models.

To put the total Model Memorization in perspective:

A 500K parameter model can remember approximately 1.8 million bits or 225 KB data.
A parametermodel of 1.5 billion can contain around 5.4 billion bits or 675 megabytes of raw information.
This is not comparable to typical file storage such as images (for example, a non -compressed image of 3.6 MB is approximately 30 million bits), but it is important when divided over discreet text patterns.

I am not a lawyer or legal expert, but I would greatly expect such an investigation to be cited in the many current lawsuits between AI providers and data makers/rights owners.

Source link