Standard deviation of generic list?

88,937

Solution 1

This article should help you. It creates a function that computes the deviation of a sequence of double values. All you have to do is supply a sequence of appropriate data elements.

The resulting function is:

private double CalculateStandardDeviation(IEnumerable<double> values)
{   
  double standardDeviation = 0;

  if (values.Any()) 
  {      
     // Compute the average.     
     double avg = values.Average();

     // Perform the Sum of (value-avg)_2_2.      
     double sum = values.Sum(d => Math.Pow(d - avg, 2));

     // Put it all together.      
     standardDeviation = Math.Sqrt((sum) / (values.Count()-1));   
  }  

  return standardDeviation;
}

This is easy enough to adapt for any generic type, so long as we provide a selector for the value being computed. LINQ is great for that, the Select funciton allows you to project from your generic list of custom types a sequence of numeric values for which to compute the standard deviation:

List<ValveData> list = ...
var result = list.Select( v => (double)v.SomeField )
                 .CalculateStdDev();

Solution 2

The example above is slightly incorrect and could have a divide by zero error if your population set is 1. The following code is somewhat simpler and gives the "population standard deviation" result. (http://en.wikipedia.org/wiki/Standard_deviation)

using System;
using System.Linq;
using System.Collections.Generic;

public static class Extend
{
    public static double StandardDeviation(this IEnumerable<double> values)
    {
        double avg = values.Average();
        return Math.Sqrt(values.Average(v=>Math.Pow(v-avg,2)));
    }
}

Solution 3

Even though the accepted answer seems mathematically correct, it is wrong from the programming perspective - it enumerates the same sequence 4 times. This might be ok if the underlying object is a list or an array, but if the input is a filtered/aggregated/etc linq expression, or if the data is coming directly from the database or network stream, this would cause much lower performance.

I would highly recommend not to reinvent the wheel and use one of the better open source math libraries Math.NET. We have been using that lib in our company and are very happy with the performance.

PM> Install-Package MathNet.Numerics

var populationStdDev = new List<double>(1d, 2d, 3d, 4d, 5d).PopulationStandardDeviation();

var sampleStdDev = new List<double>(2d, 3d, 4d).StandardDeviation();

See http://numerics.mathdotnet.com/docs/DescriptiveStatistics.html for more information.

Lastly, for those who want to get the fastest possible result and sacrifice some precision, read "one-pass" algorithm https://en.wikipedia.org/wiki/Standard_deviation#Rapid_calculation_methods

Share:
88,937
Tom Hangler
Author by

Tom Hangler

Updated on July 05, 2022

Comments

  • Tom Hangler
    Tom Hangler almost 2 years

    I need to calculate the standard deviation of a generic list. I will try to include my code. Its a generic list with data in it. The data is mostly floats and ints. Here is my code that is relative to it without getting into to much detail:

    namespace ValveTesterInterface
    {
        public class ValveDataResults
        {
            private List<ValveData> m_ValveResults;
    
            public ValveDataResults()
            {
                if (m_ValveResults == null)
                {
                    m_ValveResults = new List<ValveData>();
                }
            }
    
            public void AddValveData(ValveData valve)
            {
                m_ValveResults.Add(valve);
            }
    

    Here is the function where the standard deviation needs to be calculated:

            public float LatchStdev()
            {
    
                float sumOfSqrs = 0;
                float meanValue = 0;
                foreach (ValveData value in m_ValveResults)
                {
                    meanValue += value.LatchTime;
                }
                meanValue = (meanValue / m_ValveResults.Count) * 0.02f;
    
                for (int i = 0; i <= m_ValveResults.Count; i++) 
                {   
                    sumOfSqrs += Math.Pow((m_ValveResults - meanValue), 2);  
                }
                return Math.Sqrt(sumOfSqrs /(m_ValveResults.Count - 1));
    
            }
        }
    }
    

    Ignore whats inside the LatchStdev() function because I'm sure its not right. Its just my poor attempt to calculate the st dev. I know how to do it of a list of doubles, however not of a list of generic data list. If someone had experience in this, please help.

  • Tom Hangler
    Tom Hangler almost 14 years
    my c# doesnt have an AVERAGE. It doesnt show up. Thats one of my problems. Also I cannot pass a generic list through my function as a parameters. The mean needs to be implemented inside the stdevmethod like my code above. My standard deviation is off tho.
  • Tom Hangler
    Tom Hangler almost 14 years
    Also guys. C# doesn't have the average (Math.average). So i calculate the mean myself like my code above. Its the standard deviation that I have the most trouble with. Thanks
  • LBushkin
    LBushkin almost 14 years
    @Tom Hangler, make sure you add using System.Linq; at the top of your file to include the library of LINQ functions. THese include both Average() and Select()
  • Tom Hangler
    Tom Hangler almost 14 years
    oh ok thanks. Im sorry I'm a noob. I dont think that visual studio recognizes system.ling. Also what is the v=> and the d=> stand for? also should all the code you gave me be in my one standarddeviation function? thanks
  • LBushkin
    LBushkin almost 14 years
    It's a 'Q' not a 'G' at the end of System.Linq. I assumed you're using .NET 3.5, if not, then you will not have access to LINQ, and a slightly different solution would be appropriate.
  • LBushkin
    LBushkin almost 14 years
    The v=> and d=> syntax (and what follows) creates a lambda expression - essentially an anonymous function that accepts a parameter v or v (respectively) and uses that to compute some result. You can read more about them here: msdn.microsoft.com/en-us/library/bb397687.aspx
  • Jesse C. Slicer
    Jesse C. Slicer almost 14 years
    Take note that this algorithm implements Sample Standard Deviation as opposed to "plain" Standard Deviation.
  • tenpn
    tenpn almost 13 years
    the if(values.Count()>0) line should probably check for > 1, since you're dividing by values.Count() - 1.
  • Wouter
    Wouter almost 12 years
    This one should be the answer, it calculates Standard Deviation as opposed to the answer by LBushkin which really calculates Sample Standard Deviation
  • Levitikon
    Levitikon about 8 years
    +1 This is the actual Standard Deviation (aka population standard deviation) as opposed to Sample Standard Deviation in LBushkin's answer.
  • BlueSky
    BlueSky over 5 years
    For much faster performance (3.37x on my machine), multiply the terms instead of using Math.Pow: (d - avg) * (d - avg) instead of: Math.Pow(d - avg, 2)
  • BlueSky
    BlueSky over 5 years
    double sum = values.Sum(d => (d - avg) * (d - avg));
  • BlueSky
    BlueSky almost 5 years
    return Math.Sqrt(values.Average(v=> (v-avg) * (v-avg))); is 3.37x faster on my machine. Math.Pow() is much slower than normal multiplication.
  • Jonathan DeMarks
    Jonathan DeMarks almost 5 years
    @BlueSky Thanks for doing the benchmark! I love having both options available to see clearly. Math.Pow() might be a bit more readable but your code is more performant, so folks can choose what is right for their scenario.
  • Aric
    Aric almost 5 years
    When all values are equal to the mean, the standard deviation will be zero. In this case shouldn't ret be assigned an invalid value such as -1 at first to indicate when the standard deviation could not be calculated? Otherwise, there is the (admittedly very rare) possibility of returning a false negative since zero is a valid result.
  • Aric
    Aric almost 5 years
    After more thought, returning zero for an empty population could work, but it may be useful to indicate that there was no data in the return value.
  • Steven.Xi
    Steven.Xi over 3 years
    From mathmatic, this is the the right answer. However you should definatly avoid using this code in production: the parameter is IEnumerable<double>, with this code, the IEnumerable will be invoked twice. Take a good sample, what if the this function is invoked on a EF query? Best way is check if this IEnumreable can bel cast to a collection, if not, do a .ToList() first.
  • Steven.Xi
    Steven.Xi over 3 years
    Same as my comment below, avoid iterate IEnumerable<T> multiple times in an helper/extension function. As you never know where is this IEnumerable coming from. It could from a db query, which iterate multiple times will result duplicated db read. Cast / convert to a collection before iterate it pls.