Frontier Hugging Face releases with extreme scale (2021 – 2025)- Wiki

iamcapote · June 5, 2025, 7:00pm

Frontier Hugging Face releases with extreme scale (2021 – 2025)

Model	Best at	Release (YYYY-MM)	Context tokens	Total params	Dense / active params	Experts	Layers	FFN width	Vocab	Modality
nvidia Llama-3.1 Nemotron-8B-UltraLong-4M-Instruct	longest single-sequence context (4 M)	2025-03	4 000 000	8.04 B*	8.04 B	—	32	14 336	129 024	text
google Switch-C-2048	largest released parameter count (1.6 T MoE)	2022-11	≈2 048†	1.6 T	≈11 B‡	2 048	15	6 144	32 128	text
microsoft MT-NLG 530B	largest dense model with public paper	2022-01	2 048†	530 B	530 B	—	—	—	≈50 k	text
hpcai-tech Grok-1	widest FFN (32 768) + 8-expert MoE	2024-03	8 192	314 B*	≈79 B§	8	64	32 768	131 072	text
unsloth Gemma-3 1B-IT	largest vocab (262 k) in < 2 B params, 32 k context, multimodal	2025-03	32 768	1 B*	1 B	—	26	6 912	262 144	text + image

* Parameter counts taken from model name or original paper; not explicitly enumerated in HF artefact.
† No official figure; model family consistently trained/evaluated with 2 048-token windows.
‡ One expert (~0.7 ) active per token ⇒ 1.6 T / 2048 ≈ 11 B effective dense compute. § Two of eight experts active (25 ) ⇒ 314 B × 0.25 ≈ 78.5 B.

Metric definitions

Context tokens – maximum tokens per forward pass
Total params – all trainable weights
Dense/active params – identical for dense models; MoE rows show total then per-token active
Experts – independent FFN blocks addressable by router
Layers – transformer blocks
FFN width – hidden dimension inside each block
Vocab – tokenizer size
Modality – native I/O types

Optimal usage snapshots

Model	When it shines	Typical load-out
Nemotron-8B-4M	full-book scan, code-base refactor without chunking	≥4 × H100 80 GB; sliding-window > 3 M; stream outputs
Switch-C-2048	trillion-scale scaling-law research	experts on CPU via DeepSpeed ZeRO; top-1 routing; micro-batch ≤ 8
MT-NLG 530B	highest-fluency dense open weights	8 × A100 80 GB; 8-bit load; LoRA for domain transfer
Grok-1	sparse-routing experiments, expert pruning	8 × H800 80 GB; vLLM or Colossal-AI for MoE serving
Gemma-3 1B-IT	on-device RAG with vision grounding	consumer GPU ≥ 6 GB; mix `<image_soft_token>` in prompts

References

Nemotron config + 4 M window + release (huggingface.co, huggingface.co, huggingface.co)
Switch-C model-card heading + config + release (huggingface.co, huggingface.co, huggingface.co)
MT-NLG paper page + PDF (huggingface.co)
Grok-1 config + release (huggingface.co, huggingface.co)
Gemma-3 config + README context + release (huggingface.co, huggingface.co, huggingface.co)