Releases 70B parameter models that operate entirely on bytes, effectively 'liberating' LLMs from static tokenizers.
arXiv · March 18, 2026 · 2603.15953
The Takeaway
Challenges the standard paradigm that large models require fixed vocabularies. This architecture improves text compression and robustness to spelling or domain variations, providing a path toward truly language-agnostic and more efficient foundation models.
From the abstract
Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encode