Foundation Models
Confidence 0.80 · 2 sources · last confirmed 2026-04-30
Large pretrained models — typically transformer-based — that serve as the substrate for downstream AI applications via prompting, fine-tuning, or retrieval. Foundation models drive most modern generative-ai capability and are the locus of the “frontier” debate (capability, safety, transparency).
Working definition
A foundation model is a model trained on broad data at scale that can be adapted to a wide range of downstream tasks. The term, coined by Stanford’s Center for Research on Foundation Models (CRFM), foregrounds the adaptation role — the model is a foundation other things are built on, not the end product.
Frontier model is a near-synonym foregrounding capability (top of the leaderboard). The two terms diverge in policy/regulatory contexts: “frontier” often implies the regulator-relevant subset of foundation models above a capability threshold.
Key claims
Industry has decisively taken the frontier
- ~90% of notable AI models in 2024 came from industry (vs. 60% in 2023); 91.18% in 2025 per the AI Index 2026 update. Academia remains the leading source of highly cited research, but not of new notable models.
- 2024: U.S. 40 notable models, China 15, Europe 3.
- 2025 (AI Index 2026): U.S. 59, China 35, South Korea 8, Europe 2.
- Top organizations 2025: OpenAI 20, Google 14, Alibaba 11, Anthropic 7, xAI 5, DeepSeek 4, LG AI Research 4, Meta 4, Tsinghua University 4 (only academic institution in the top 9), ByteDance / Moonshot / Nvidia 3 each.
- The U.S.-China performance gap has effectively closed: DeepSeek-R1 briefly matched the top U.S. model in Feb 2025; as of March 2026, Anthropic’s top model leads by just 2.7%.
Compute is scaling fast
- Training compute for notable models doubles every ~5 months.
- Dataset size for training LLMs doubles every ~8 months.
- Power required for training doubles annually.
- Global AI compute capacity reached 17.1M H100-equivalents by 2025, growing 3.3× per year since 2022 (AI Index 2026 §1.2).
- Nvidia: >60% of total compute; Google + Amazon supply most of the remainder; Huawei holds small but growing share.
- U.S. hosts 5,427 AI data centers — more than 10× any other country.
- TSMC fabricates almost every leading AI chip; TSMC-U.S. expansion began operating in 2025.
- AI data center power capacity: 29.6 GW — comparable to New York state at peak demand.
- Carbon emissions trajectory:
- AlexNet (2012): 0.01 tons
- GPT-3 (2020): 588 tons
- GPT-4 (2023): 5,184 tons
- Llama 3.1 405B (2024): 8,930 tons
- Grok 4 (2025): 72,816 tons (AI Index 2026)
- (Reference: average American = 18 tons/year.)
- Annual GPT-4o inference water use alone may exceed the drinking water needs of 1.2 million people.
Disclosure is dropping (AI Index 2026)
- Training code, dataset sizes, parameter counts increasingly withheld for the most resource-intensive systems including those from OpenAI, Anthropic, Google.
- Reported parameter counts have stayed near 1 trillion for three years as disclosure dropped — but training compute (estimable independently from hardware) has continued to rise.
- OLMo 3.1 Think 32B with ~90× fewer parameters than Grok 4 achieves comparable benchmark results via pruning, deduplication, curation alone — evidence that data quality and post-training matter as much as scale.
Open-weight catching up to closed-weight
- Performance gap between top closed-weight and top open-weight models on the Chatbot Arena Leaderboard: 8.0% in early Jan 2024 → 1.7% by Feb 2025 on some benchmarks.
- The frontier is also tightening overall: top-2 model gap = 0.7%, top-10 gap = 5.4% (down from 4.9% / 11.9% the year before).
U.S.-China performance gap narrowing fast
- End-2023 vs. end-2024 gaps on major benchmarks:
- MMLU: 17.5pp → 0.3pp
- MMMU: 13.5pp → 8.1pp
- MATH: 24.3pp → 1.6pp
- HumanEval: 31.6pp → 3.7pp
Transparency improving
- Foundation Model Transparency Index (CRFM): avg score among major developers 37% (Oct 2023) → 58% (May 2024). Substantial progress, ~42% gap remaining. See responsible-ai.
Smaller is mighty
- 142× model-size reduction for the same MMLU >60% threshold in two years: PaLM (540B params, 2022) → Phi-3-mini (3.8B, 2024).
Notable foundation model series (mentioned via AI Index 2025)
To be promoted to standalone entity pages when discussed in depth in another source. Currently noted here as a roster:
- OpenAI: GPT-4, GPT-4o, o1, o3 (test-time compute reasoning), SORA (video).
- Google DeepMind: Gemini family (Gemini-1.5-Flash-8B is the 280×-cost-reduction marker), Veo 2 (video).
- Anthropic: Claude 3 family (incl. Sonnet — implicit-bias study).
- Meta: Llama 3.1 405B (the 8,930-ton-CO2 marker), Movie Gen (video).
- Microsoft: Phi-3-mini (the 3.8B-param-MMLU marker).
- Mistral AI: French open-source.
- xAI: Grok family.
Debates / contradictions
- “Frontier” vs. “foundation” framing. “Frontier” emphasizes capability gap; “foundation” emphasizes adaptation role. Different policy/regulation implications — frontier-model bills target capability thresholds; foundation-model bills target the broader pretraining-then-adapt pattern.
- Compute-scaling sustainability. Data-commons shrinkage (see responsible-ai) plus rising energy demands (driving nuclear-energy partnerships — Microsoft’s Three Mile Island, Google’s SMRs, Amazon’s SMRs) raise structural questions about the 5-month-compute-doubling trajectory continuing.
- Open-weight closing the gap. As open-weight performance catches closed-weight, the policy logic for restricting model release weakens — but so does the commercial moat for closed-weight providers. Open question how 2025–2026 plays out.
Related concepts
- generative-ai — the dominant use of foundation models today
- ai-benchmarks — how foundation-model capabilities are evaluated
- responsible-ai — the transparency, safety, and governance overlay
- enterprise-ai-adoption — the deployment context