AI & ML New Capability

A new statistical test that reliably detects whether a dataset was NOT used in an LLM's training corpus.

March 25, 2026

Original Paper

Detecting Non-Membership in LLM Training Data via Rank Correlations

Pranav Shetty, Mirazul Haque, Zhiqiang Ma, Xiaomo Liu

arXiv · 2603.22707

The Takeaway

While membership inference is common, proving non-membership is critical for legal compliance and 'right to be forgotten' auditing. Using rank correlations of model logits, it provides a framework to verify dataset exclusion without requiring access to the full training pipeline.

From the abstract

As large language models (LLMs) are trained on increasingly vast and opaque text corpora, determining which data contributed to training has become essential for copyright enforcement, compliance auditing, and user trust. While prior work focuses on detecting whether a dataset was used in training (membership inference), the complementary problem -- verifying that a dataset was not used -- has received little attention. We address this gap by introducing PRISM, a test that detects dataset-level