Your model's final 'probability' outputs are leaking nearly as much private internal information as its hidden layers.
April 15, 2026
Original Paper
What do your logits know? (The answer may surprise you!)
arXiv · 2604.09885
The Takeaway
This paper shows that 'logits' (the final output scores) leak significant task-irrelevant information about an input image. Previously, we thought only the internal 'hidden' states were a privacy risk, but it turns out the very final layer is just as chatty. This means that even if you only expose the final API response, an attacker can still reconstruct high-fidelity details about the input data. For practitioners, this is a major security warning: protecting your internal weights isn't enough to prevent data leakage. You may need to inject noise or restrict logit access to keep data truly private.
From the abstract
Recent work has shown that probing model internals can reveal a wealth of information not apparent from the model generations. This poses the risk of unintentional or malicious information leakage, where model users are able to learn information that the model owner assumed was inaccessible. Using vision-language models as a testbed, we present the first systematic comparison of information retained at different "representational levels'' as it is compressed from the rich information encoded in