A new statistical test that reliably detects whether a dataset was NOT used in an LLM's training corpus.
March 25, 2026
Original Paper
Detecting Non-Membership in LLM Training Data via Rank Correlations
arXiv · 2603.22707
The Takeaway
While membership inference is common, proving non-membership is critical for legal compliance and 'right to be forgotten' auditing. Using rank correlations of model logits, it provides a framework to verify dataset exclusion without requiring access to the full training pipeline.
From the abstract
As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem -- verifying that a dataset was not used -- has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level