Algorithm for scatter plot 'best-fit' line

13,063

Solution 1

using a Linear least squares algorithm

public class XYPoint
{
    public int X;
    public double Y;
}

class Program
{
    public static List<XYPoint> GenerateLinearBestFit(List<XYPoint> points, out double a, out double b)
    {
        int numPoints = points.Count;
        double meanX = points.Average(point => point.X);
        double meanY = points.Average(point => point.Y);

        double sumXSquared = points.Sum(point => point.X * point.X);
        double sumXY = points.Sum(point => point.X * point.Y);

        a = (sumXY / numPoints - meanX * meanY) / (sumXSquared / numPoints - meanX * meanX);
        b = (a * meanX - meanY);

        double a1 = a;
        double b1 = b;

        return points.Select(point => new XYPoint() { X = point.X, Y = a1 * point.X - b1 }).ToList();
    }

    static void Main(string[] args)
    {
        List<XYPoint> points = new List<XYPoint>()
                                   {
                                       new XYPoint() {X = 1, Y = 12},
                                       new XYPoint() {X = 2, Y = 16},
                                       new XYPoint() {X = 3, Y = 34},
                                       new XYPoint() {X = 4, Y = 45},
                                       new XYPoint() {X = 5, Y = 47}
                                   };

        double a, b;

        List<XYPoint> bestFit = GenerateLinearBestFit(points, out a, out b);

        Console.WriteLine("y = {0:#.####}x {1:+#.####;-#.####}", a, -b);

        for(int index = 0; index < points.Count; index++)
        {
            Console.WriteLine("X = {0}, Y = {1}, Fit = {2:#.###}", points[index].X, points[index].Y, bestFit[index].Y);
        }
    }
}

Solution 2

Yes. You will want to use Linear Regression, specifically Simple Linear Regression.

The algorithm is essentially:

  • assume there exists a line of best fit, y = ax + b
  • for each of your points, you want to minimise their distance from this line
  • calculate the distance for each point from the line, and sum the distances (normally we use the square of the distance to more heavily penalise points further from the line)
  • find the values of a and b that minimise the resulting equation using basic calculus (there should be only one minimum)

The wikipedia page will give you everything you need.

Share:
13,063

Related videos on Youtube

veezi
Author by

veezi

Updated on September 15, 2022

Comments

  • veezi
    veezi over 1 year

    I'm writing a small application in C# using MSChart control to do Scatter Plots of sets of X and Y data points. Some of these can be rather large (hundreds of data points).

    Wanted to ask if there's a 'standard' algorith for plotting a best-fit line across the points. I'm thinking to divide the X data points to a predefined number of sets, say 10 or 20, and for each set take the average of the corresponding Y values and the middle X value, and so on to create the line. Is this a correct approach?

    I've searched existing threads but they all seem to be about achieving the same using existing applications like Matlab.

    Thanks,

  • Ryan
    Ryan almost 4 years
    Hey, this is an awesome and elegant solution and one that should be reused! I actually have one correction: 'b = (a * meanX - meanY);' should be flipped to 'b = (meanY - a * meanX);' , as Y=MX+B is B=Y-MX