Frontier Hugging Face releases with extreme scale (2021 – 2025)- Wiki

Frontier Hugging Face releases with extreme scale (2021 – 2025)

Model Best at Release (YYYY-MM) Context tokens Total params Dense / active params Experts Layers FFN width Vocab Modality
nvidia Llama-3.1 Nemotron-8B-UltraLong-4M-Instruct longest single-sequence context (4 M) 2025-03 4 000 000 8.04 B* 8.04 B 32 14 336 129 024 text
google Switch-C-2048 largest released parameter count (1.6 T MoE) 2022-11 ≈2 048† 1.6 T ≈11 B‡ 2 048 15 6 144 32 128 text
microsoft MT-NLG 530B largest dense model with public paper 2022-01 2 048† 530 B 530 B ≈50 k text
hpcai-tech Grok-1 widest FFN (32 768) + 8-expert MoE 2024-03 8 192 314 B* ≈79 B§ 8 64 32 768 131 072 text
unsloth Gemma-3 1B-IT largest vocab (262 k) in < 2 B params, 32 k context, multimodal 2025-03 32 768 1 B* 1 B 26 6 912 262 144 text + image

* Parameter counts taken from model name or original paper; not explicitly enumerated in HF artefact.
† No official figure; model family consistently trained/evaluated with 2 048-token windows.
‡ One expert (~0.7 ) active per token ⇒ 1.6 T / 2048 ≈ 11 B effective dense compute. § Two of eight experts active (25 ) ⇒ 314 B × 0.25 ≈ 78.5 B.

Metric definitions

Context tokens – maximum tokens per forward pass
Total params – all trainable weights
Dense/active params – identical for dense models; MoE rows show total then per-token active
Experts – independent FFN blocks addressable by router
Layers – transformer blocks
FFN width – hidden dimension inside each block
Vocab – tokenizer size
Modality – native I/O types

Optimal usage snapshots

Model When it shines Typical load-out
Nemotron-8B-4M full-book scan, code-base refactor without chunking ≥4 × H100 80 GB; sliding-window > 3 M; stream outputs
Switch-C-2048 trillion-scale scaling-law research experts on CPU via DeepSpeed ZeRO; top-1 routing; micro-batch ≤ 8
MT-NLG 530B highest-fluency dense open weights 8 × A100 80 GB; 8-bit load; LoRA for domain transfer
Grok-1 sparse-routing experiments, expert pruning 8 × H800 80 GB; vLLM or Colossal-AI for MoE serving
Gemma-3 1B-IT on-device RAG with vision grounding consumer GPU ≥ 6 GB; mix <image_soft_token> in prompts

References

Nemotron config + 4 M window + release (huggingface.co, huggingface.co, huggingface.co)
Switch-C model-card heading + config + release (huggingface.co, huggingface.co, huggingface.co)
MT-NLG paper page + PDF (huggingface.co)
Grok-1 config + release (huggingface.co, huggingface.co)
Gemma-3 config + README context + release (huggingface.co, huggingface.co, huggingface.co)