Most AI models are reading DNA 'grammar' wrong because they treat it like human language instead of an evolutionary map.
April 14, 2026
Original Paper
EvoLen: Evolution-Guided Tokenization for DNA Language Model
arXiv · 2604.08698
The Takeaway
Using standard LLM techniques on DNA misses how biology actually works. This 'EvoLen' approach uses species-wide evolutionary signals to find the real functional 'words' of life, making genetic predictions far more accurate.
From the abstract
Tokens serve as the basic units of representation in DNA language models (DNALMs), yet their design remains underexplored. Unlike natural language, DNA lacks inherent token boundaries or predefined compositional rules, making tokenization a fundamental modeling decision rather than a naturally specified one. While existing approaches like byte-pair encoding (BPE) excel at capturing token structures that reflect human-generated linguistic regularities, DNA is organized by biological function and