Reimagining LLM Memory: Using Context as Training Data Unlocks Models That Learn at Test-Time

We keep seeing LLMs with larger context windows in the news, along with promises that they can hold entire conversation histories, volumes of books, or multiple codebases in view at once. And yet, these models still repeat the same mistakes. We still have to copy and paste the earlier context back into the chat for LLMs to “get it”. A smart co-worker would pick up on these patterns, adapt, and carry the lessons forward. Why can’t LLMs?

In this blog post, we observe a critical difference between LLM memory and human memory. Then, we introduce test-time training with an end-to-end formulation (TTT-E2E), our latest research, in which the LLM compresses the context it’s reading into its weights through next-token prediction.

Our key results are highlighted in Figure 1, which measures scaling with context length, in terms of loss (left) and latency (right). Transformer with full attention scales well in terms of loss but not latency. Recurrent Neural Networks (RNNs), such as Mamba 2 and Gated DeltaNet, scale well in latency but not loss. TTT-E2E is the only method that scales well in both.

Left panel: TTT-E2E turns the worst line (gray) into the best (light green) at 128K context length. Loss ∆ (↓), the y-value, is computed as (loss of the reported method) − (loss of transformer with full attention), so loss ∆ of full attention itself (dark green) is the flat line at y=0. While other methods produce worse loss ∆ in longer context, TTT-E2E maintains the same advantage over full attention.

Right panel: Similar to RNNs, TTT-E2E has constant inference latency regardless of context length, making it 2.7x faster than full attention for 128K context on an NVIDIA H100, and 35x faster for 2M context. All models have 3B parameters and are trained with 164B tokens.

Scaling with context length, in terms of both loss and latency, is the most fundamental problem in long-context and LLM research. TTT-E2E is the first method that shows a sign of life at this problem, while all the other methods exhibit qualitatively different trends. Moreover, we observed no wall for the scaling trends of TTT-E2E across rigorous and extensive experiments. These results indicate that the research community might finally arrive at a basic solution to long context in 2026.

Our paper and code are publicly available.

How does LLM memory differ from human memory?

Humans are remarkably good at improving with more “context” in the form of life experience, despite their imperfect recall of the exact details. For example, consider your first lecture in machine learning. You might not recall the instructor’s first word during the lecture, but the intuition you learned is probably helping you understand this blog post, even if that happened years ago.

On the other hand, transformers with self-attention are inefficient with long context, in part because they are designed for nearly lossless recall. The basic form of self-attention is called full attention, which maintains full memory of every token by caching and comparing their keys and values. As a consequence, full attention readily attends to every detail, but its cost per token grows linearly with context length. Processing the 10-millionth token takes one million times longer than processing the 10th.

To process long context without burning the planet, modern architectures often combine full attention with approximations such as sliding-window attention, Mamba, and Gated DeltaNet layers. These approximations have a constant cost per token, but also become significantly less effective in longer context compared to full attention. Specifically, these approximations lose important information that would have helped them predict the future, as shown in Figure 1.

Our method: compressing context into weights

How can we design a method with a constant cost per token that can still remember the important, predictive, and intuitive information in long context?

The key mechanism is compression. For example, humans compress a massive amount of experience into their brains, which preserves the important information while leaving out many details. For language models, we know that training with next-token prediction also compresses a massive amount of data into their weights. So what if we just continue training the language model at test time through next-token prediction on the given context?

We found this simple form of Test-Time Training (TTT) highly effective once we added another missing piece. At training time, we prepare the model’s initialization for TTT through meta-learning instead of standard pre-training. This addition makes our method end-to-end (E2E) in two ways. Our inner loop directly optimizes the next-token prediction loss at the end of the network, in contrast to prior work on long-context TTT (e.g., Titans). And our outer loop directly optimizes the final loss after TTT.

What will be the role of RAG?

TTT is like updating the human brain, while a retrieval-based methods, such as RAG, are like writing things down and looking things up in a notepad or calendar. The notepad will continue to be a practical supplement to the brain, especially when the details matter, like shopping for a long list of groceries. But human productivity is mostly determined by their brains, not by the notepads they use. Similarly, the productivity of an AI agent is mostly determined by how well it compresses a massive amount of context into predictive and intuitive information.

Limitations

At training time, the meta-learning phase of TTT-E2E requires gradients of gradients. Our current implementation of meta-learning is 3.4x slower than standard pre-training for short context (8K), because the standard API of FlashAttention does not support gradients of gradients. We can overcome this limitation by either developing a custom attention kernel that supports gradients of gradients or initializing TTT-E2E from a standard Transformer pre-trained without TTT. We invite the community to join us in these efforts!

Conclusion

For a deeper dive into the method, results, and implementation details, please check out the full paper End-to-End Test-Time Training for Long Context. All experiments can be reproduced using the code and datasets in our public repo.