AI & ML Nature Is Weird

Your model's final 'probability' outputs are leaking nearly as much private internal information as its hidden layers.

April 15, 2026

Original Paper

What do your logits know? (The answer may surprise you!)

arXiv · 2604.09885

The Takeaway

This paper shows that 'logits' (the final output scores) leak significant task-irrelevant information about an input image. Previously, we thought only the internal 'hidden' states were a privacy risk, but it turns out the very final layer is just as chatty. This means that even if you only expose the final API response, an attacker can still reconstruct high-fidelity details about the input data. For practitioners, this is a major security warning: protecting your internal weights isn't enough to prevent data leakage. You may need to inject noise or restrict logit access to keep data truly private.

From the abstract

Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different "representational levels'' as it is compressed from the rich information encoded in

Read the original paper →

← Back to today's papers