Frontier Hugging Face releases with extreme scale (2021 – 2025)
| Model | Best at | Release (YYYY-MM) | Context tokens | Total params | Dense / active params | Experts | Layers | FFN width | Vocab | Modality |
|---|---|---|---|---|---|---|---|---|---|---|
| nvidia Llama-3.1 Nemotron-8B-UltraLong-4M-Instruct | longest single-sequence context (4 M) | 2025-03 | 4 000 000 | 8.04 B* | 8.04 B | — | 32 | 14 336 | 129 024 | text |
| google Switch-C-2048 | largest released parameter count (1.6 T MoE) | 2022-11 | ≈2 048† | 1.6 T | ≈11 B‡ | 2 048 | 15 | 6 144 | 32 128 | text |
| microsoft MT-NLG 530B | largest dense model with public paper | 2022-01 | 2 048† | 530 B | 530 B | — | — | — | ≈50 k | text |
| hpcai-tech Grok-1 | widest FFN (32 768) + 8-expert MoE | 2024-03 | 8 192 | 314 B* | ≈79 B§ | 8 | 64 | 32 768 | 131 072 | text |
| unsloth Gemma-3 1B-IT | largest vocab (262 k) in < 2 B params, 32 k context, multimodal | 2025-03 | 32 768 | 1 B* | 1 B | — | 26 | 6 912 | 262 144 | text + image |
* Parameter counts taken from model name or original paper; not explicitly enumerated in HF artefact.
† No official figure; model family consistently trained/evaluated with 2 048-token windows.
‡ One expert (~0.7 ) active per token ⇒ 1.6 T / 2048 ≈ 11 B effective dense compute.
§ Two of eight experts active (25 ) ⇒ 314 B × 0.25 ≈ 78.5 B.
Metric definitions
Context tokens – maximum tokens per forward pass
Total params – all trainable weights
Dense/active params – identical for dense models; MoE rows show total then per-token active
Experts – independent FFN blocks addressable by router
Layers – transformer blocks
FFN width – hidden dimension inside each block
Vocab – tokenizer size
Modality – native I/O types
Optimal usage snapshots
| Model | When it shines | Typical load-out |
|---|---|---|
| Nemotron-8B-4M | full-book scan, code-base refactor without chunking | ≥4 × H100 80 GB; sliding-window > 3 M; stream outputs |
| Switch-C-2048 | trillion-scale scaling-law research | experts on CPU via DeepSpeed ZeRO; top-1 routing; micro-batch ≤ 8 |
| MT-NLG 530B | highest-fluency dense open weights | 8 × A100 80 GB; 8-bit load; LoRA for domain transfer |
| Grok-1 | sparse-routing experiments, expert pruning | 8 × H800 80 GB; vLLM or Colossal-AI for MoE serving |
| Gemma-3 1B-IT | on-device RAG with vision grounding | consumer GPU ≥ 6 GB; mix <image_soft_token> in prompts |
References
Nemotron config + 4 M window + release (huggingface.co, huggingface.co, huggingface.co)
Switch-C model-card heading + config + release (huggingface.co, huggingface.co, huggingface.co)
MT-NLG paper page + PDF (huggingface.co)
Grok-1 config + release (huggingface.co, huggingface.co)
Gemma-3 config + README context + release (huggingface.co, huggingface.co, huggingface.co)