A white-box membership inference attack using 'gradient-induced feature drift' to outperform all existing confidence-based methods.
April 2, 2026
Original Paper
G-Drift MIA: Membership Inference via Gradient-Induced Feature Drift in LLMs
arXiv · 2604.00419
The Takeaway
By applying a single gradient-ascent step and measuring the 'drift' in internal representations, practitioners can audit whether specific data was used in training with much higher accuracy than loss-based attacks, even when distributions are identical.
From the abstract
Large language models (LLMs) are trained on massive web-scale corpora, raising growing concerns about privacy and copyright. Membership inference attacks (MIAs) aim to determine whether a given example was used during training. Existing LLM MIAs largely rely on output probabilities or loss values and often perform only marginally better than random guessing when members and non-members are drawn from the same distribution. We introduce G-Drift MIA, a white-box membership inference method based o