AI & ML Open Release

An open-source family of language models for Kazakh that outperforms much larger multilingual models by using a language-specific tokenizer.

March 24, 2026

Original Paper

SozKZ: Training Efficient Small Language Models for Kazakh from Scratch

Saken Tukenov

arXiv · 2603.20854

The Takeaway

It provides a blueprint for building high-performance models for low-resource languages (22M+ speakers) at a fraction of the parameter count. The release includes weights, code, and a custom 50K BPE tokenizer, enabling local deployment on modest hardware.

From the abstract

Kazakh, a Turkic language spoken by over 22 million people, remains underserved by existing multilingual language models, which allocate minimal capacity to low-resource languages and employ tokenizers ill-suited to agglutinative morphology. We present SozKZ, a family of Llama-architecture language models (50M-600M parameters) trained entirely from scratch on 9 billion tokens of Kazakh text with a dedicated 50K BPE tokenizer. We evaluate all models on three Kazakh benchmarks -- multiple-choice c

Read the original paper →

← Back to today's papers