Start your day with intelligence. Get The OODA Daily Pulse.
Large language models (LLMs) with very long context windows have been making headlines lately. The ability to cram hundreds of thousands or even millions of tokens into a single prompt unlocks many possibilities for developers. But how well do these long-context LLMs really understand and utilize the vast amounts of information they receive? Researchers at Google DeepMind have introduced Michelangelo, a new benchmark designed to evaluate the long-context reasoning capabilities of LLMs. Their findings, published in a new research paper, show that while current frontier models have progressed in retrieving information from large in-context data, they still struggle with tasks that require reasoning over the data structure. The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has prompted researchers to develop new benchmarks to evaluate their capabilities. However, most of the focus has been on retrieval tasks, such as the popular “needle-in-a-haystack” evaluation, where the model is tasked with finding a specific piece of information within a large context. “Over time, models have grown considerably more capable in long context performance,” Kiran Vodrahalli, research scientist at Google DeepMind, told VentureBeat. “For instance, the popular needle-in-a-haystack evaluation for retrieval has now been well saturated up to extremely long context lengths. Thus, it has become important to determine whether the harder tasks models are capable of solving in short context regimes are also solvable at long ranges.”
Full report : DeepMind’s Michelangelo benchmark reveals limitations of long-context LLMs.