KS (Kolmogorov-Smirnov) Plot in Data Science

KS Plot in Data Science

Introduction

The Kolmogorov-Smirnov (KS) statistic is a non-parametric test used to compare a sample distribution with a reference probability distribution or to compare two sample distributions. In data science, specifically in the domain of credit scoring and binary classification problems, the KS plot is utilized to measure the discriminatory power of a model.

Background

The KS test statistic quantifies the distance between the empirical distribution function of a sample and the cumulative distribution function of a reference distribution (one-sample K-S test), or the empirical distribution functions of two samples (two-sample K-S test). The statistic is named after Andrey Kolmogorov and Nikolai Smirnov.

One-Sample K-S Test

Suppose we want to test if our sample follows a certain reference distribution. The one-sample K-S test computes:

D=maxFn(x)F(x)

Where:

  • is the empirical distribution function (EDF) of the sample.
  • is the cumulative distribution function (CDF) of the reference distribution.

Two-Sample K-S Test

To compare the distributions of two samples, the two-sample K-S test computes:

D=maxF1,n(x)F2,n(x)

Where:

  • and are the EDFs of the two samples respectively.

KS Plot in Credit Scoring and Binary Classification

The KS statistic is popular in the realm of credit scoring. It is used to assess the discriminative power of credit scoring models. Here’s how it’s done:

  1. Data Preparation: Start by sorting data in descending order of predicted probabilities (or scores) and then divide the data into deciles.
  2. Calculate Cumulative Percentages:
    • For each decile, compute the cumulative percentage of “goods” and “bads”. In credit scoring, “goods” refer to customers who paid back their loan, and “bads” refer to those who defaulted.
  3. Plotting:
    • On the x-axis: Cumulative percentage of total customers (from 10% to 100%).
    • On the y-axis: Cumulative percentage of goods and bads.
  4. Compute KS Statistic:
    • The KS statistic is the maximum difference between the cumulative percentage of goods and bads. This value would be evident as the maximum vertical distance between the two curves on the KS plot.

The larger the KS statistic, the better the model is at discriminating between the two groups. A value close to 0 suggests the model has no discriminative power, while a value close to 1 indicates perfect separation.

Benefits of Using KS Plot

  • Simple Visualization: Allows for a quick visual assessment of model performance.
  • Non-parametric: Doesn’t assume any particular distribution for the underlying data.
  • Comprehensive: Considers the entire distribution of scores rather than relying on a single threshold.

Limitations

  • Binary Classification: The KS plot is best suited for binary classification problems and may not be directly applicable for multi-class scenarios.
  • Lack of Calibration: While the KS plot shows discrimination ability, it doesn’t provide insight into the calibration of the model.
  • Overemphasis: In some cases, an extremely high KS value might suggest overfitting.
Author: user