How to generate distributions given, mean, SD, skew and kurtosis in R?

49,168

Solution 1

There is a Johnson distribution in the SuppDists package. Johnson will give you a distribution that matches either moments or quantiles. Others comments are correct that 4 moments does not a distribution make. But Johnson will certainly try.

Here's an example of fitting a Johnson to some sample data:

require(SuppDists)

## make a weird dist with Kurtosis and Skew
a <- rnorm( 5000, 0, 2 )
b <- rnorm( 1000, -2, 4 )
c <- rnorm( 3000,  4, 4 )
babyGotKurtosis <- c( a, b, c )
hist( babyGotKurtosis , freq=FALSE)

## Fit a Johnson distribution to the data
## TODO: Insert Johnson joke here
parms<-JohnsonFit(babyGotKurtosis, moment="find")

## Print out the parameters 
sJohnson(parms)

## add the Johnson function to the histogram
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")

The final plot looks like this:

enter image description here

You can see a bit of the issue that others point out about how 4 moments do not fully capture a distribution.

Good luck!

EDIT As Hadley pointed out in the comments, the Johnson fit looks off. I did a quick test and fit the Johnson distribution using moment="quant" which fits the Johnson distribution using 5 quantiles instead of the 4 moments. The results look much better:

parms<-JohnsonFit(babyGotKurtosis, moment="quant")
plot(function(x)dJohnson(x,parms), -20, 20, add=TRUE, col="red")

Which produces the following:

enter image description here

Anyone have any ideas why Johnson seems biased when fit using moments?

Solution 2

This is an interesting question, which doesn't really have a good solution. I presume that even though you don't know the other moments, you have an idea of what the distribution should look like. For example, it's unimodal.

There a few different ways of tackling this problem:

  1. Assume an underlying distribution and match moments. There are many standard R packages for doing this. One downside is that the multivariate generalisation may be unclear.

  2. Saddlepoint approximations. In this paper:

    Gillespie, C.S. and Renshaw, E. An improved saddlepoint approximation. Mathematical Biosciences, 2007.

    We look at recovering a pdf/pmf when given only the first few moments. We found that this approach works when the skewness isn't too large.

  3. Laguerre expansions:

    Mustapha, H. and Dimitrakopoulosa, R. Generalized Laguerre expansions of multivariate probability densities with moments. Computers & Mathematics with Applications, 2010.

    The results in this paper seem more promising, but I haven't coded them up.

Solution 3

One solution for you might be the PearsonDS library. It allows you to use a combination of the first four moments with the restriction that kurtosis > skewness^2 + 1.

To generate 10 random values from that distribution try:

library("PearsonDS")
moments <- c(mean = 0,variance = 1,skewness = 1.5, kurtosis = 4)
rpearson(10, moments = moments)

Solution 4

This question was asked more than 3 years ago, so I hope my answer doesn't come too late.

There is a way to uniquely identify a distribution when knowing some of the moments. That way is the method of Maximum Entropy. The distribution that results from this method is the distribution that maximizes your ignorance about the structure of the distribution, given what you know. Any other distribution that also has the moments that you specified but is not the MaxEnt distribution is implicitly assuming more structure than what you input. The functional to maximize is Shannon's Information Entropy, $S[p(x)] = - \int p(x)log p(x) dx$. Knowing the mean, sd, skewness and kurtosis, translate as constraints on the first, second, third, and fourth moments of the distribution, respectively.

The problem is then to maximize S subject to the constraints: 1) $\int x p(x) dx = "first moment"$, 2) $\int x^2 p(x) dx = "second moment"$, 3) ... and so on

I recommend the book "Harte, J., Maximum Entropy and Ecology: A Theory of Abundance, Distribution, and Energetics (Oxford University Press, New York, 2011)."

Here is a link that tries to implement this in R: https://stats.stackexchange.com/questions/21173/max-entropy-solver-in-r

Solution 5

I agree you need density estimation to replicate any distribution. However, if you have hundreds of variables, as is typical in a Monte Carlo simulation, you would need to have a compromise.

One suggested approach is as follows:

  1. Use the Fleishman transform to get the coefficient for the given skew and kurtosis. Fleishman takes the skew and kurtosis and gives you the coefficients
  2. Generate N normal variables (mean = 0, std = 1)
  3. Transform the data in (2) with the Fleishman coefficients to transform the normal data to the given skew and kurtosis
  4. In this step, use data from from step (3) and transform it to the desired mean and standard deviation (std) using new_data = desired mean + (data from step 3)* desired std

The resulting data from Step 4 will have the desired mean, std, skewness and kurtosis.

Caveats:

  1. Fleishman will not work for all combinations of skewness and kurtois
  2. Above steps assume non-correlated variables. If you want to generate correlated data, you will need a step before the Fleishman transform
Share:
49,168
Aaron B
Author by

Aaron B

Updated on July 09, 2022

Comments

  • Aaron B
    Aaron B almost 2 years

    Is it possible to generate distributions in R for which the Mean, SD, skew and kurtosis are known? So far it appears the best route would be to create random numbers and transform them accordingly. If there is a package tailored to generating specific distributions which could be adapted, I have not yet found it. Thanks

    • Dason
      Dason about 13 years
      As noted those don't uniquely describe a distribution. Even if you define all of the moments you're not guaranteed to uniquely define a distribution. I think you need to explain what it is you're exactly trying to do. Why are you trying to do this? Can you place further restrictions that would make it possible to define a distribution?
    • Aaron B
      Aaron B about 13 years
      Ah yes, we want unimodal, continuous distributions in a single dimension. The resultant distributions will eventually be transformed numerically as a way to test a variation of niche theory through simulation.
    • gung - Reinstate Monica
      gung - Reinstate Monica over 9 years
      On Cross Validated (stats.SE) the following is somewhat related & may be of interest to readers here: How to simulate data that satisfy specific constraints such as having specific mean and standard deviation?
  • hadley
    hadley about 13 years
    Something looks wrong with that curve - a simple position shift would make the fit substantially better
  • JD Long
    JD Long about 13 years
    I agree it looks off. When I get a little time I may dig into it a tad.
  • Artem Klevtsov
    Artem Klevtsov over 10 years
    There are R implementation of this?
  • mccurcio
    mccurcio about 2 years
    NOTE: This code no longer works on R=4.0
  • Yahya
    Yahya almost 2 years
    Is there any equivalent to this in Python?