AI & ML Scaling Insight

A factorial study on EHR foundation models reveals that joint encoding of code-attribute pairs (local binding) is the primary driver of performance and efficiency.

arXiv · March 18, 2026 · 2603.15644

Lin Lawrence Guo, Santiago Eduardo Arciniegas, Joseph Jihyung Lee, Adam Paul Yan, George Tomlinson, Jason Fries, Lillian Sung

The Takeaway

This challenges the need for complex temporal or workflow-specific tokenization schemes in structured data transformers, showing that simple joint tokenization reduces FLOPs by up to 39.5% while improving accuracy.

From the abstract

Foundation models for structured electronic health records (EHRs) are pretrained on longitudinal sequences of timestamped clinical events to learn adaptable patient representations. Tokenization -- how these timelines are converted into discrete model inputs -- determines what information is preserved, how efficiently it is encoded, and which relationships must be learned versus precomputed. Yet the impact of tokenization design choices on downstream performance and computational efficiency rema