measure difference between two distribution

10,228

There exist many ways to characterise the "difference between two distributions". A specific and targeted answer requires more details concerning e.g. the underlying probability distribution(s).

It all depends on how you define a difference between two distributions. To give you two ideas:

  1. A Kolmogorov-Smirnov test is a non-parametric test, that measures the "distance" between two cumulative/empirical distribution functions.
  2. The Kullback-Leibler divergence measures the "distance" between two distributions in the language of information theory as a change in entropy.

Update [a year later]

Upon revisiting this post it might be important to emphasise a few things:

  1. The standard two-sample Kolmogorov-Smirnov (KS) test assumes the underlying distribution to be continuous. For discrete data (which the data from the original post seems to be), an alternative may be to use a bootstrap version of the two-sample KS test, as in Matching::ks.boot. More details can be found on e.g. Cross Validated: Can I use Kolmogorov-Smirnov to compare two empirical distributions? and on Wikipedia: Two-sample Kolmogorov–Smirnov test.
  2. Provided the sample data in the original post is representative, I don't think there will be a very meaningful answer from either a KS-statistic based test or the KL divergence (or really any other test for that matter). The reason being that the values from every sample are essentially all zero (to be precise, >80% of the values are zero). That in combination with the small sample size of 21 values per sample means that there is really not a lot "left" to characterise any underlying distribution.
  3. In more general terms (and ignoring the limitations pointed out in the previous point), to calculate the KL divergence for all pairwise combinations one could do the following

    library(entropy)
    library(tidyverse)
    expand.grid(1:length(lst), 1:length(lst)) %>%
        rowwise() %>%
        mutate(KL = KL.empirical(lst[[Var1]], lst[[Var2]]))
    

    Since the KL divergence is not symmetric, we will need to calculate both the upper and lower triangular parts of the pairwise KL divergence matrix. In the interest of reducing compute time one could instead make use of a symmetrised KL divergence, which requires calculating the KL divergence only for the upper or lower triangular parts of the pairwise KL divergence matrix (although the symmetrised KL divergence versions themselves require calculating both KL divergences, i.e. KL(1->2) and KL(2->1) but this may be done through an optimised routine).

Share:
10,228
SummonersRift
Author by

SummonersRift

I love programming and prototyping computer systems.

Updated on June 04, 2022

Comments

  • SummonersRift
    SummonersRift almost 2 years

    I have a distance vector of a sample program. I am trying to quantify how similar they are. I was using Euclidean distance between sample groups (each value belongs to a bucket, we compare bucket by bucket), which works fine. But there are too many comparisons that needs to be done for large number of samples.

    I was wondering if there is a efficient way to build an index to compare the samples. The samples look like this--

    Sample:1 = {25 0 17 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
    Sample:2 = {25 1 16 2 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
    Sample:3 = {25 3 16 2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}