Building a Feature Drift Detection System with Kolmogorov–Smirnov Tests
How I built an ML monitoring pipeline that catches silent model degradation by comparing reference and production data distributions, feature by feature.
Machine learning models rarely fail loudly. They fail silently, as the world drifts away from the data they were trained on — a pricing model trained before a market shift, a fraud detector facing a new attack pattern. The model keeps returning predictions; they just stop being good.
This post walks through the design of a feature drift detection system I built to catch exactly that failure mode.
The core idea
Drift detection reduces to a statistical question: is the data my model sees in production drawn from the same distribution as my reference data?
For each feature independently, the two-sample Kolmogorov–Smirnov (KS) test answers this. It compares empirical cumulative distribution functions and returns a statistic (the maximum distance between the two CDFs) and a p-value.
from scipy import stats
def detect_drift(reference: np.ndarray, production: np.ndarray,
threshold: float = 0.05) -> DriftResult:
statistic, p_value = stats.ks_2samp(reference, production)
return DriftResult(
statistic=statistic,
p_value=p_value,
drifted=p_value < threshold,
)
A small p-value means the two samples are unlikely to come from the same distribution — the feature has drifted.
Making it fast enough for real datasets
Running per-feature statistical tests naively over large datasets is slow. Two decisions kept the pipeline fast:
- Vectorized operations everywhere. All preprocessing (imputation, scaling, encoding) runs as NumPy/Pandas column operations, never Python loops over rows.
- Feature-wise independence. Because each feature is tested separately, the work is trivially parallelizable and memory-bounded per column rather than per dataset.
Configurable alerting
Not every drifted feature matters equally. The system supports per-feature alert thresholds, so a feature with high importance in the model can trigger alerts at smaller drift magnitudes than a marginal one:
thresholds = {
"transaction_amount": 0.01, # model depends heavily on this
"user_agent_category": 0.10, # low importance, tolerate more drift
}
When drift exceeds tolerance, the pipeline can trigger automated retraining — closing the loop from detection to remediation.
Interpreting drift, not just flagging it
A binary "drifted / not drifted" flag doesn't tell you what changed. The dashboard renders, for each feature, overlaid histograms and density plots of the reference and production distributions alongside the drift score. Seeing that a feature's distribution developed a second mode is far more actionable than a p-value alone.
What I'd do next
- Multivariate drift: per-feature KS tests miss correlation changes; methods like MMD or classifier-based drift detection catch them.
- Categorical features: the KS test assumes continuous distributions; chi-squared tests are the natural counterpart.
- Drift severity calibration: mapping drift statistics to expected model performance loss, so alerts rank by impact rather than statistical surprise.
The full pipeline — preprocessing, testing, visualization, and retraining triggers — is on GitHub.