Nested Learning: AI's Great Leap Forward

Mid-2025, Google DeepMind pulled off a move that should make anyone building with AI agents pay attention: instead of buying Windsurf outright, they hired Windsurf’s CEO (Varun Mohan), co-founder (Douglas Chen), and key R&D talent.. and paired that with a non-exclusive license to some Windsurf technology (reportedly a $2B+ licensing / compensation package).

That combo matters because it’s not just “drama”. It’s a blueprint: get elite builder-leadership + secure the right to ship the tech inside your own platform + keep the startup independent so you avoid a messy acquisition.

The follow-through: Gemini 3 + Antigravity

A few months later, DeepMind launched Gemini 3, positioning it as their most capable model – especially for multimodal understanding (text + images + audio, etc.).

At the same time, they introduced Antigravity, an agentic development environment / IDE built to let AI agents plan, implement, and test code.. and (importantly) it’s designed to work with multiple models, not only Gemini.

A note worth keeping your eyes open on: Google’s messaging emphasizes improved reasoning and reliability, but independent reporting suggests hallucinations are still a central weakness across the industry (including top models). In other words: progress, yes – “solved”, no.

“The Thinking Game” is a good DeepMind lens

If you don’t have much context on DeepMind, The Thinking Game documentary is a surprisingly effective shortcut into how they think – the culture, the ambition, and Demis Hassabis’s obsession with turning long-range research into real capability.

And that brings us to the thing that (to me) connects all these dots.

The real prize: continual learning

Agentic systems don’t just need “a bigger brain”. They need a way to get better from experience without breaking themselves.

That’s the continual learning problem in one sentence:

Learn new things over time without wiping out what you already know.

This is where a lot of “classic” fine-tuning runs into trouble.

Fine-tuning has a narrow range of applications

Fine-tuning can be great for:

style alignment (tone, format, policy),
narrow domain specialization,
teaching repeatable behaviors.

But when the goal is “update the model with new facts / new skills continuously”, naive fine-tuning tends to collide with catastrophic forgetting.. the model improves on the new stuff and quietly degrades on the old stuff.

Step 1: Sparse Memory Fine-tuning (late 2025)

A concrete example: in Continual Learning via Sparse Memory Finetuning, the authors show that full fine-tuning can cause severe drops on prior capabilities, while a sparse-memory approach reduces that forgetting dramatically. (this used to be accessible for reading thourgh openreview, but at least you can read the excerpt on arxiv)

The basic idea is quite simple:

Don’t update the whole model.
Update only the parts that are “responsible” for the new knowledge.
Reduce interference by keeping most parameters stable.

In the paper’s framing, the “memory” components are sparsely updated by design, which gives you a lever to learn without trashing prior competence.

Step 2: Nested Learning (late 2025 / early 2026)

Then Google moved the conversation forward again with Nested Learning. They wrote about this on their research blog – Introducing Nested Learning: A new ML paradigm for continual learning… published on my birthday. It’s quite interesting how such random, mere coincidences draw attention.

Instead of treating the model as a single frozen blob that occasionally gets retrained, Nested Learning treats learning as multiple optimization loops operating at different speeds – like gears.

And now the very exciting part.

Two gears, one brain

Think of it like a human system:

Inner loop = short-term adaptation (fast gear)

This is your “working memory” mode:

quick adjustment,
immediate usefulness,
low commitment.

Like remembering a name you were just told (or forgetting it instantly, depending on the current version of the human model I’m running).

In model terms: a fast loop that can adapt in real time without doing a full retrain or destabilizing everything else.

Outer loop = long-term consolidation (slow gear)

This is the “sleep on it” mode:

integrate over time,
stabilize what matters,
merge it into durable capability.

In model terms: a slower loop that decides what the system should actually keep, and how it should be integrated into the core.

Why this matters

Previous generations of updates often forced a painful trade-off:

You teach the model something new…
…and risk overwriting something old.

Nested Learning separates these processes:

the fast gear grabs new info and makes it usable now,
the slow gear consolidates it safely later.

If this works at scale, it’s a path toward models that don’t just retrieve knowledge – they accumulate it, refine it, and improve through use.

And once you connect that back to:

DeepMind hiring Windsurf leadership + licensing Windsurf tech,
shipping Gemini 3 as their flagship multimodal model,
and releasing Antigravity as an agentic coding platform,

…it becomes pretty obvious:

They’re not just building a better assistant. They’re building the learning loop for agents.

Conclusion: The BOOM Moment

This is a massive moment. Honestly, it feels as impactful as the original discovery of Transformers -the foundation of the entire AI revolution and every modern Large Language Model.

We are moving away from static machines and toward something that actually evolves.

I mean my heart’s beating, my heart’s beating. My hands are shakin’, my hands are shakin’ but I’m still going because it’s actually happening. It’s remembering. BOOM. Continuous context. BOOM. Learning. We are officially in a new era.