NVIDIA Kaggle Grandmasters Win Artificial General Intelligence Competition

NVIDIA researchers on Friday won a key Kaggle competition many in the field treat as a real-time pulse check on humanity’s progress toward artificial general intelligence (AGI).

Ivan Sorokin and Jean-Francois Puget, two members of the Kaggle Grandmasters of NVIDIA (KGMoN), came in first on the Kaggle ARC Prize 2025 public leaderboard with a 27.64% score by building a solution evaluated on the same dataset behind the ARC-AGI-2 benchmark.

The team, which called itself NVARC, fine-tuned a 4B model variant that outperformed far larger, more expensive models on the same benchmark at just 20 cents per task. It showcased not only state-of-the-art results, but a breakthrough in scalable, economical AGI-style reasoning.

The ARC-AGI benchmark measures how well AI systems perform abstract reasoning and then generalize from very few examples using grid-based visual puzzles. ARC-AGI-2 is a harder, updated version that removes overlap with public training data. It is explicitly designed to resist shortcuts and brute-force memorization, making it a sharper test of true systematic abstraction.

The ARC-AGI benchmark has become one of the most closely watched indicators of real progress toward general reasoning in AI. Unlike typical machine learning benchmarks, ARC-AGI tasks can’t be solved through scale, memorization, or pattern scraping. Each puzzle is a tiny grid with only a handful of examples, forcing systems to infer abstract rules—and apply them to a brand-new test case. Scores on the more-difficult ARC-AGI-2 are widely viewed as a proxy for how well an AI system can learn from almost nothing.

That’s why the Kaggle ARC Prize 2025 leaderboard matters: It’s the most open, reproducible arena where researchers test AGI-style reasoning under strict compute and time limits.

The winning NVIDIA NVARC solution wasn’t powered by giant models or brute-force search. Instead, it leaned on three ideas any developer can appreciate: synthetic data, test-time training, and disciplined engineering.

Heavyweight LLM reasoning methods—chain-of-thought, tool use, even RL-style agents—couldn’t fit within Kaggle’s tight runtime. So NVARC flipped the strategy: Move all complex reasoning offline into a synthetic data pipeline, and train smaller models capable of running fast during evaluation.

Using staged puzzle generation, concept decomposition, and progressively stronger open-weight models like Qwen that enable transparent, reproducible research to fine-tune, the team built a diverse synthetic corpus of ARC-style tasks. The open models let the team inspect and adapt weights, share methods, and rapidly iterate on reasoning systems, turning frontier‑level capabilities into a broadly accessible, collaborative foundation.

Final models only needed to recognize and adapt patterns, rather than execute full program-search logic. Test-time training learns each puzzle’s specifics from its tiny example set—a technique that has become essential for leading ARC-AGI performance.

The result was a compact, cost-efficient ensemble that outperformed much larger systems and set a new bar on ARC-AGI-2, showing how synthetic data and adaptive learning can push reasoning forward.

To successfully build out these winning solutions, the team leveraged the NVIDIA NeMo suite of tools, including NeMo RL for scalable reinforcement learning, and NeMo Skills for streamlining SDG pipelines.

Learn more about the technical details from NVARC’s writeup on Kaggle and watch this interview with ARC.