Why Google Runs AI Mode On Flash, Explained By Google’s Chief Scientist

Google Chief Scientist Jeff Dean said Flash’s low latency and cost are why Google can run Search AI at scale. Retrieval is a design choice, not a limitation, he added.

In an interview on the Latent Space podcast, Dean explained why Flash became the production tier for Search. He also laid out why the pipeline that narrows the web to a handful of documents will likely persist.

Google started rolling out Gemini 3 Flash as the default for AI Mode in December. Dean’s interview explains the rationale behind that decision.

Why Flash Is The Production Tier

Dean called latency the critical constraint for running AI in Search. As models handle longer and more complex tasks, speed becomes the bottleneck.

“Having low latency systems that can do that seems really important, and flash is one direction, one way of doing that.”

Podcast hosts noted Flash’s dominance across services like Gmail and YouTube. Dean said search is part of that expansion, with Flash’s use growing across AI Mode and AI Overviews.

Flash can serve at this scale because of distillation. Each generation’s Flash inherits the previous generation’s Pro-level performance, getting more capable without getting more expensive to run.

“For multiple Gemini generations now, we’ve been able to make the sort of flash version of the next generation as good or even substantially better than the previous generation’s pro.”

That’s the mechanism that makes the architecture sustainable. Google pushes frontier models for capability development, then distills those capabilities into Flash for production deployment. Flash is the tier Google designed to run at search scale.

Retrieval Over Memorization

Beyond Flash’s role in search, Dean described a design philosophy that keeps external content central to how these models work. Models shouldn’t waste capacity storing facts they can retrieve.

“Having the model devote precious parameter space to remember obscure facts that could be looked up is actually not the best use of that parameter space.”

Retrieval from external sources is a core capability, not a workaround. The model looks things up and works through the results rather than carrying everything internally.

Why Staged Retrieval Likely Persists

AI search can’t read the entire web at once. Current attention mechanisms are quadratic, meaning computational cost grows rapidly as context length increases. Dean said “a million tokens kind of pushes what you can do.” Scaling to a billion or a trillion isn’t feasible with existing methods.

Dean’s long-term vision is models that give the “illusion” of attending to trillions of tokens. Reaching that requires new techniques, not just scaling what exists today. Until then, AI search will likely keep narrowing a broad candidate pool to a handful of documents before generating a response.

Why This Matters

The model reading your content in AI Mode is getting better each generation. But it’s optimized for speed over reasoning depth, and it’s designed to retrieve your content rather than memorize it. Being findable through Google’s existing retrieval and ranking signals is the path into AI search results.

We’ve tracked every model swap in AI Mode and AI Overviews since Google launched AI Mode with Gemini 2.0. Google shipped Gemini 3 to AI Mode on release day, then started rolling out Gemini 3 Flash as the default a month later. Most recently, Gemini 3 became the default for AI Overviews globally.

Every model generation follows the same cycle. Frontier for capability, then distillation into Flash for production. Dean presented this as the architecture Google expects to maintain at search scale, not a temporary fallback.

Looking Ahead

Based on Dean’s comments, staged retrieval is likely to persist until attention mechanisms move past their quadratic limits. Google’s investment in Flash suggests the company expects to use this architecture across multiple model generations.

One change to watch is automatic model selection. Google’s Robby Stein described mentioned the concept previously, which involves routing complex queries to Pro while keeping Flash as the default.

Featured Image: Robert Way/Shutterstock

Why Google Runs AI Mode On Flash, Explained By Google’s Chief Scientist

How a ‘client brain’ gives AI the context SEO work needs

MIT Research Shows The Shift Reshaping SEO Strategy

Commerce media expands beyond retail sites with Demand Gen integration

How a ‘client brain’ gives AI the context SEO work needs

Attackers exploit Palo Alto GlobalProtect flaw days after disclosure

MIT Research Shows The Shift Reshaping SEO Strategy

Commerce media expands beyond retail sites with Demand Gen integration

Our Picks

How a ‘client brain’ gives AI the context SEO work needs

Attackers exploit Palo Alto GlobalProtect flaw days after disclosure

MIT Research Shows The Shift Reshaping SEO Strategy

Why Google Runs AI Mode On Flash, Explained By Google’s Chief Scientist

Why Flash Is The Production Tier

Retrieval Over Memorization

Why Staged Retrieval Likely Persists

Why This Matters

Looking Ahead

Related Posts