DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs
Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. More information
Large Language Models (LLMs) with very long context windows have been in the news lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks a lot of possibilities for developers.
But how well do these long-context LLMs really understand and use the vast amounts of information they receive?
Researchers at Google Deepmind have introduced Michelangeloa new benchmark designed to evaluate the reasoning capabilities of LLMs in the long context. Their findings, published in a new research paper, show that while current frontier models have made progress in retrieving information from large in-context data, they still struggle with tasks that require reasoning about data structure.
The need for better benchmarks for the long context
The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the attention has been on retrieval tasks, such as the popular needle-in-a-haystack assessment, where the model is tasked with finding a specific piece of information within a large context.
“Over time, models have become significantly better able to perform in long contexts,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For example, the popular needle-in-a-haystack evaluation for retrieval is now well saturated up to extremely long context lengths. It has therefore become important to determine whether the models for more difficult tasks capable of solving regimes in short contexts are also solvable in the long term.”
Retrieval tasks do not necessarily reflect a model’s ability to reason about the entire context. A model may be able to find a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that test a model’s ability to reason about long contexts have limitations.
“It is easy to develop long reasoning evaluations that are solvable with a combination of retrieval alone and information stored in model weights, thus ‘short-circuiting’ the test of the model’s ability to use the long context,” Vodrahalli said .
Michelangelo
To address the limitations of current benchmarks, the researchers introduced Michelangelo, a “minimal, synthetic, and unleaked long-context reasoning evaluation for large language models.”
Michelangelo is based on the analogy of a sculptor chiseling away irrelevant pieces of marble to reveal the underlying structure. The benchmark focuses on evaluating the model’s ability to understand the relationships and structure of the information within its context window, rather than simply retrieving isolated facts.
The benchmark consists of three core tasks:
Latent list: The model must process a long series of operations performed on a Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “Latent List measures a model’s ability to track the properties of a latent data structure over the course of a stream of code instructions,” the researchers write.
Multi-round co-reference resolution (MRCR): The model should produce parts of a long conversation between a user and an LLM. This requires the model to understand the structure of the conversation and resolve references to previous turns, even if the conversation contains confusing or distracting elements. “MRCR measures the model’s ability to understand ordering in natural text, to distinguish between similar concepts of writing, and to reproduce a specific piece of prior context subject to conflicting difficult questions,” the researchers write .
“I don’t know” (IDK): The model is presented with a long story and is asked to answer multiple choice questions about it. For some questions, the context does not contain the answer, and the model must be able to recognize the limits of its knowledge and respond with “I don’t know.” “IDK measures the model’s ability to understand whether it knows what it does not know, based on the context presented,” the researchers write.
Latent structure queries
The tasks in Michelangelo are based on a new framework called Latent Structure Queries (LSQ). LSQ provides a general approach for designing long-context reasoning evaluations that can be extended to arbitrary lengths. It can also test the model’s understanding of implicit information, as opposed to simple fact retrieval. LSQ relies on synthesizing test data to avoid the pitfalls of leaking test data into the training corpus.
“By requiring the model to extract information from structures rather than values from keys (sculptures from marble rather than needles from haystacks), we can more deeply test the context understanding of language models without the need for retrieval,” the authors write. researchers.
LSQ has three key differences from other approaches for evaluating long-context LLMs. First, it is explicitly designed to prevent short-circuit errors in evaluations beyond retrieval tasks. Second, it specifies a methodology for independently increasing task complexity and context length. And finally, it is general enough to cover a large number of reasoning tasks. The three tests used in Michelangelo involve interpreting codes and reasoning about loosely written text.
“The goal is that long-context evaluations that go beyond reasoning and are implemented by following LSQ will lead to fewer scenarios where a proposed evaluation amounts to solving a retrieval task,” Vodrahalli said.
Evaluation of frontier models on Michelangelo
The researchers evaluated ten groundbreaking LLMs on Michelangelo, including several variants of Gemini, GPT-4 and 4o, and Claude. They tested the models on contexts of up to 1 million tokens. Gemini models performed best on MRCR, GPT models excelled on the Latent List, and Claude 3.5 Sonnet achieved the highest scores on IDK.
However, all models showed a significant drop in performance as the complexity of the reasoning tasks increased, indicating that even with very long context windows, current LLMs still have room to improve their ability to reason over large amounts of information.
“Frontier models have room for improvement on all the reasoning primitives beyond retrieval (Latent List, MRCR, IDK) that we explore at Michelangelo,” Vodrahalli said. “Different boundary models have different strengths and weaknesses – each class performs well in different contexts and on different tasks. What does appear to be universal across all models is the initial drop in performance on long reasoning tasks.”
Michelangelo’s assessments cover basic primitives necessary for long-context reasoning, and the findings could have important implications for business applications. For example, in real-world applications where the model cannot rely on its prior training knowledge and must perform multi-hop reasoning over many different locations in very long contexts, Vodrahalli expects performance to decrease as context length increases.
“This is especially true if the documents contain a lot of information that is not relevant to the task at hand, making it difficult for a model to immediately distinguish which information is relevant or not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all the relevant information to answer a question is in one common place in the document.”
The researchers will continue to add more assessments to Michelangelo and hope to make them readily available so that other researchers can test their models on them.
Source link