KS (Kolmogorov-Smirnov) Plot in Data Science

KS Plot in Data Science

Introduction

The Kolmogorov-Smirnov (KS) statistic is a non-parametric test used to compare a sample distribution with a reference probability distribution or to compare two sample distributions. In data science, specifically in the domain of credit scoring and binary classification problems, the KS plot is utilized to measure the discriminatory power of a model.

Background

The KS test statistic quantifies the distance between the empirical distribution function of a sample and the cumulative distribution function of a reference distribution (one-sample K-S test), or the empirical distribution functions of two samples (two-sample K-S test). The statistic is named after Andrey Kolmogorov and Nikolai Smirnov.

One-Sample K-S Test

Suppose we want to test if our sample follows a certain reference distribution. The one-sample K-S test computes:

D=max∣Fn(x)−F(x)∣

Where:

$F_{n} (x)$ is the empirical distribution function (EDF) of the sample.
$F (x)$ is the cumulative distribution function (CDF) of the reference distribution.

Two-Sample K-S Test

To compare the distributions of two samples, the two-sample K-S test computes:

D=max∣F1,n(x)−F2,n(x)∣

Where:

$F_{1, n} (x)$ and $F_{2, n} (x)$ are the EDFs of the two samples respectively.

KS Plot in Credit Scoring and Binary Classification

The KS statistic is popular in the realm of credit scoring. It is used to assess the discriminative power of credit scoring models. Here’s how it’s done:

Data Preparation: Start by sorting data in descending order of predicted probabilities (or scores) and then divide the data into deciles.
Calculate Cumulative Percentages:
- For each decile, compute the cumulative percentage of “goods” and “bads”. In credit scoring, “goods” refer to customers who paid back their loan, and “bads” refer to those who defaulted.
Plotting:
- On the x-axis: Cumulative percentage of total customers (from 10% to 100%).
- On the y-axis: Cumulative percentage of goods and bads.
Compute KS Statistic:
- The KS statistic is the maximum difference between the cumulative percentage of goods and bads. This value would be evident as the maximum vertical distance between the two curves on the KS plot.

The larger the KS statistic, the better the model is at discriminating between the two groups. A value close to 0 suggests the model has no discriminative power, while a value close to 1 indicates perfect separation.

Benefits of Using KS Plot

Simple Visualization: Allows for a quick visual assessment of model performance.
Non-parametric: Doesn’t assume any particular distribution for the underlying data.
Comprehensive: Considers the entire distribution of scores rather than relying on a single threshold.

Limitations

Binary Classification: The KS plot is best suited for binary classification problems and may not be directly applicable for multi-class scenarios.
Lack of Calibration: While the KS plot shows discrimination ability, it doesn’t provide insight into the calibration of the model.
Overemphasis: In some cases, an extremely high KS value might suggest overfitting.

Post Views: 8

KS (Kolmogorov-Smirnov) Plot in Data Science

KS Plot in Data Science

Introduction

Background

One-Sample K-S Test

Two-Sample K-S Test

KS Plot in Credit Scoring and Binary Classification

Benefits of Using KS Plot

Limitations

Trending

Recent Posts

Featured Posts – Slider Widget

JavaScript’s reduceRight() method to iterate over an array from right to left

Merging Multiple Images into a Single PDF File Using Python

Nanotechnology

Electronics and Instrumentation

Chemical Engineering

Civil Engineering

Backpressure in AWS Kinesis Streams: Optimizing Data Processing

Troubleshooting Data Ingestion and Processing Issues with AWS Kinesis Streams

Impact of Shard Count Modification on AWS Kinesis Streams

How to map values of a Series according to an input correspondence:SSeries.map()

Most Viewed Posts

KS Plot in Data Science

Introduction

Background

One-Sample K-S Test

Two-Sample K-S Test

KS Plot in Credit Scoring and Binary Classification

Benefits of Using KS Plot

Limitations

Related Articles

Trending

Recent Posts

Featured Posts – Slider Widget