How to get confidence intervals from curve_fit

14,832

Here's a quick and wrong answer: you can approximate the errors from the covariance matrix for your a and b parameters as the square root of its diagonals: np.sqrt(np.diagonal(pcov)). The parameter uncertainties can then be used to draw the confidence intervals.

The answer is wrong because you before you fit your data to a model, you'll need an estimate of the errors on your averaged disc_p1 points. When averaging, you have lost the information about the scatter of the population, leading curve_fit to believe that the y-points you feed it are absolute and undisputable. This might cause an underestimation of your parameter errors.

For an estimate of the uncertainties of your averaged Y values, you need to estimate their dispersion measure and pass it along to curve_fit while saying that your errors are absolute. Below is an example of how to do this for a random dataset where each of your points consists of a 1000 samples drawn from a normal distribution.

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

# model function
func = lambda x, a, b: a * (1 / (x**2)) + b 

# approximating OP points
n_ypoints = 7 
x_data = np.linspace(70, 190, n_ypoints)

# approximating the original scatter in Y-data
n_nested_points = 1000
point_errors = 50
y_data = [func(x, 4e6, -100) + np.random.normal(x, point_errors,
          n_nested_points) for x in x_data]

# averages and dispersion of data
y_means = np.array(y_data).mean(axis = 1)
y_spread = np.array(y_data).std(axis = 1)

best_fit_ab, covar = curve_fit(func, x_data, y_means,
                               sigma = y_spread,
                               absolute_sigma = True)
sigma_ab = np.sqrt(np.diagonal(covar))

from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

# plotting the unaveraged data
flier_kwargs = dict(marker = 'o', markerfacecolor = 'silver',
                    markersize = 3, alpha=0.7)
line_kwargs = dict(color = 'k', linewidth = 1)
bp = plt.boxplot(y_data, positions = x_data,
                 capprops = line_kwargs,
                 boxprops = line_kwargs,
                 whiskerprops = line_kwargs,
                 medianprops = line_kwargs,
                 flierprops = flier_kwargs,
                 widths = 5,
                 manage_ticks = False)
# plotting the averaged data with calculated dispersion
#plt.scatter(x_data, y_means, facecolor = 'silver', alpha = 1)
#plt.errorbar(x_data, y_means, y_spread, fmt = 'none', ecolor = 'black')

# plotting the model
hires_x = np.linspace(50, 190, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 800, text_res)
plt.xlim(40, 200)
plt.ylim(0, 1000)
plt.show()

absolutely weighted least squares

Edit: If you are not considering the intrinsic errors on the data points, you are probably fine with using the "qiuck and wrong" case I mentioned before. The square root of the diagonal entries of covariance matrix can then be used to calculate your confidence intervals. However, note that the confidence intervals have shrunk now that we've dropped the uncertainties:

from scipy.optimize import curve_fit
import matplotlib.pylab as plt
import numpy as np

func = lambda x, a, b: a * (1 / (x**2)) + b

n_ypoints = 7
x_data = np.linspace(70, 190, n_ypoints)

y_data = np.array([786.31, 487.27, 341.78, 265.49,
                    224.76, 208.04, 200.22])
best_fit_ab, covar = curve_fit(func, x_data, y_data)
sigma_ab = np.sqrt(np.diagonal(covar))

# an easy way to properly format parameter errors
from uncertainties import ufloat
a = ufloat(best_fit_ab[0], sigma_ab[0])
b = ufloat(best_fit_ab[1], sigma_ab[1])
text_res = "Best fit parameters:\na = {}\nb = {}".format(a, b)
print(text_res)

plt.scatter(x_data, y_data, facecolor = 'silver',
            edgecolor = 'k', s = 10, alpha = 1)

# plotting the model
hires_x = np.linspace(50, 200, 100)
plt.plot(hires_x, func(hires_x, *best_fit_ab), 'black')
bound_upper = func(hires_x, *(best_fit_ab + sigma_ab))
bound_lower = func(hires_x, *(best_fit_ab - sigma_ab))
# plotting the confidence intervals
plt.fill_between(hires_x, bound_lower, bound_upper,
                 color = 'black', alpha = 0.15)
plt.text(140, 630, text_res)
plt.xlim(60, 200)
plt.ylim(0, 800)
plt.show()

no-sigma-case

If you're unsure whether to include the absolute errors or how to estimate them in your case, you'd be better off asking for advice at Cross Validated, as Stack Overflow is mainly for discussion on implementations of regression methods and not for discussion on the underlying statistics.

Share:
14,832
osmak
Author by

osmak

Updated on June 07, 2022

Comments

  • osmak
    osmak about 2 years

    My question involves statistics and python and I am a beginner in both. I am running a simulation, and for each value for the independent variable (X) I produce 1000 values for the dependent variable (Y). What I have done is that I calculated the average of Y for each value of X and fitted these averages using scipy.optimize.curve_fit. The curve fits nicely, but I want to draw also the confidence intervals. I am not sure if what I am doing is correct or if what I want to do can be done, but my question is how can I get the confidence intervals from the covariance matrix produced by curve_fit. The code reads the averages from files first then it just simply uses curve_fit.

    import numpy as np
    import matplotlib.pyplot as plt
    from scipy.optimize import curve_fit
    
    
    def readTDvsTx(L, B, P, fileformat):
        # L should be '_Fixed_' or '_'
        TD = []
        infile = open(fileformat.format(L, B, P), 'r')
        infile.readline()  # To remove header
        for line in infile:
            l = line.split()  # each line contains TxR followed by CD followed by TD
            if eval(l[0]) >= 70 and eval(l[0]) <=190:
                td = eval(l[2])
                TD.append(td)
        infile.close()
        tdArray = np.array(TD)
    
        return tdArray
    
    
    def rec(x, a, b):
        return a * (1 / (x**2)) + b
    
    
    
    fileformat = 'Densities_file{}BS{}_PRNTS{}.txt'
    txR = np.array(range(70, 200, 20))
    parents = np.array(range(1,6))
    disc_p1 = readTDvsTx('_Fixed_', 5, 1, fileformat)
    
    
    popt, pcov = curve_fit(rec, txR, disc_p1)
    
    
    plt.plot(txR, rec(txR, popt[0], popt[1]), 'r-')
    plt.plot(txR, disc_p1, '.')
    
    print(popt)
    plt.show()
    

    And here is the resulting fit: enter image description here

    • Ulrich Stern
      Ulrich Stern almost 8 years
      The kmpfit module can calculate the confidence band when fitting a non-linear function, see this answer of mine. You will need to use all points for the fitting, not just the averages.
    • Ulrich Stern
      Ulrich Stern almost 8 years
      PS: if you want to do the confidence band calculation yourself, my comment on the answer has a link (to this page).
    • Vlas Sokolov
      Vlas Sokolov over 7 years
      It isn't that trivial to use all the points for the fitting because osmak's function is multivariate.
    • osmak
      osmak over 7 years
      Thank you all for your comments. The thing is I think I have misinterpreted the way I get my values. In my simulation, I search for a certain density which I call Target Density or TD for short. The way I do it is that I run 1000 simulation instances and check the average for those using some criterion, which if satisfied, indicates that I have reached my TD. Increasing the value of the independent variable will not affect the TD, i.e., it is not normally distributed.
  • osmak
    osmak over 7 years
    Thanks for your answer. The thing is I think I have misinterpreted the way I get my values. In my simulation, I search for a certain density which I call Target Density or TD for short. The way I do it is that I run 1000 simulation instances and check the average for those using some criterion, which if satisfied, indicates that I have reached my TD. Increasing the value of the independent variable will not affect the TD, i.e., it is not normally distributed.
  • Vlas Sokolov
    Vlas Sokolov over 7 years
    So the converged TD values come without any uncertainty then?
  • osmak
    osmak over 7 years
    Its not that they don't come without any uncertainty, they are more like limits. I search for the lowest TD (the value of the independent variable) that satisfies a certain criterion, i.e., increasing it will also satisfy the criterion. If I repeat the search for a certain configuration (which may take a couple of days of execution) I usually get the same limit plus/minus 10, but this is not feasible to do because it is very time consuming which makes it very difficult to get statistically sound data.
  • Vlas Sokolov
    Vlas Sokolov over 7 years
    @osmak I see. I've edited my answer to address your comment. If I understand your case, then you'd probably still need to keep in mind that the confidence bounds derived in this fashion are still an approximation for some "best case scenario" of your experiment.
  • osmak
    osmak over 7 years
    Thanks a lot I really appreciate your efforts, you have been very helpful.
  • Steve
    Steve over 3 years
    Nice example, thanks for posting. Note that I got an error running your first code sample: TypeError: boxplot() got an unexpected keyword argument 'manage_xticks' This error went away when I removed the manage_xticks argument from the call to boxplot.
  • Vlas Sokolov
    Vlas Sokolov over 3 years
    @Steve there was an API change in matplotlib, in newer versions manage_xticks became manage_ticks.