Class SimpleRegression

java.lang.Object
org.yamcs.external.SimpleRegression

public class SimpleRegression extends Object
Estimates an ordinary least squares regression model with one independent variable.

y = intercept + slope * x

Standard errors for intercept and slope are available as well as ANOVA, r-square and Pearson's r statistics.

Observations (x,y pairs) can be added to the model one at a time or they can be provided in a 2-dimensional array. The observations are not stored in memory, so there is no limit to the number of observations that can be added to the model.

Usage Notes:

  • When there are fewer than two observations in the model, or when there is no variation in the x values (i.e. all x values are the same) all statistics return NaN. At least two observations with different x coordinates are required to estimate a bivariate regression model.
  • Getters for the statistics always compute values based on the current set of observations -- i.e., you can get statistics, then add more data and get updated statistics without using a new instance. There is no "compute" method that updates all statistics. Each of the getters performs the necessary computations to return the requested statistic.
  • The intercept term may be suppressed by passing false to the SimpleRegression(boolean) constructor. When the hasIntercept property is false, the model is estimated without a constant term and getIntercept() returns 0.
  • Constructor Details

    • SimpleRegression

      public SimpleRegression()
      Create an empty SimpleRegression instance
    • SimpleRegression

      public SimpleRegression(boolean includeIntercept)
      Create a SimpleRegression instance, specifying whether or not to estimate an intercept.

      Use false to estimate a model with no intercept. When the hasIntercept property is false, the model is estimated without a constant term and getIntercept() returns 0.

      Parameters:
      includeIntercept - whether or not to include an intercept term in the regression model
  • Method Details

    • addData

      public void addData(double x, double y)
      Adds the observation (x,y) to the regression data set.

      Uses updating formulas for means and sums of squares defined in "Algorithms for Computing the Sample Variance: Analysis and Recommendations", Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American Statistician, vol. 37, pp. 242-247, referenced in Weisberg, S. "Applied Linear Regression". 2nd Ed. 1985.

      Parameters:
      x - independent variable value
      y - dependent variable value
    • append

      public void append(SimpleRegression reg)
      Appends data from another regression calculation to this one.

      The mean update formulae are based on a paper written by Philippe Pébay: Formulas for Robust, One-Pass Parallel Computation of Covariances and Arbitrary-Order Statistical Moments, 2008, Technical Report SAND2008-6212, Sandia National Laboratories.

      Parameters:
      reg - model to append data from
      Since:
      3.3
    • removeData

      public void removeData(double x, double y)
      Removes the observation (x,y) from the regression data set.

      Mirrors the addData method. This method permits the use of SimpleRegression instances in streaming mode where the regression is applied to a sliding "window" of observations, however the caller is responsible for maintaining the set of observations in the window.

      The method has no effect if there are no points of data (i.e. n=0)
      Parameters:
      x - independent variable value
      y - dependent variable value
    • removeData

      public void removeData(double[][] data)
      Removes observations represented by the elements in data.

      If the array is larger than the current n, only the first n elements are processed. This method permits the use of SimpleRegression instances in streaming mode where the regression is applied to a sliding "window" of observations, however the caller is responsible for maintaining the set of observations in the window.

      To remove all data, use clear().

      Parameters:
      data - array of observations to be removed
    • clear

      public void clear()
      Clears all data from the model.
    • getN

      public long getN()
      Returns the number of observations that have been added to the model.
      Returns:
      n number of observations that have been added.
    • predict

      public double predict(double x)
      Returns the "predicted" y value associated with the supplied x value, based on the data that has been added to the model when this method is activated.

      predict(x) = intercept + slope * x

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double,NaN is returned.
      Parameters:
      x - input x value
      Returns:
      predicted y value
    • getIntercept

      public double getIntercept()
      Returns the intercept of the estimated regression line, if hasIntercept() is true; otherwise 0.

      The least squares estimate of the intercept is computed using the normal equations. The intercept is sometimes denoted b0.

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double,NaN is returned.
      Returns:
      the intercept of the regression line if the model includes an intercept; 0 otherwise
      See Also:
    • hasIntercept

      public boolean hasIntercept()
      Returns true if the model includes an intercept term.
      Returns:
      true if the regression includes an intercept; false otherwise
      See Also:
    • getSlope

      public double getSlope()
      Returns the slope of the estimated regression line.

      The least squares estimate of the slope is computed using the normal equations. The slope is sometimes denoted b1.

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double.NaN is returned.
      Returns:
      the slope of the regression line
    • getSumSquaredErrors

      public double getSumSquaredErrors()
      Returns the sum of squared errors (SSE) associated with the regression model.

      The sum is computed using the computational formula

      SSE = SYY - (SXY * SXY / SXX)

      where SYY is the sum of the squared deviations of the y values about their mean, SXX is similarly defined and SXY is the sum of the products of x and y mean deviations.

      The sums are accumulated using the updating algorithm referenced in addData(double, double).

      The return value is constrained to be non-negative - i.e., if due to rounding errors the computational formula returns a negative result, 0 is returned.

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double,NaN is returned.
      Returns:
      sum of squared errors associated with the regression model
    • getTotalSumSquares

      public double getTotalSumSquares()
      Returns the sum of squared deviations of the y values about their mean.

      This is defined as SSTO here.

      If n < 2, this returns Double.NaN.

      Returns:
      sum of squared deviations of y values
    • getXSumSquares

      public double getXSumSquares()
      Returns the sum of squared deviations of the x values about their mean. If n < 2, this returns Double.NaN.
      Returns:
      sum of squared deviations of x values
    • getSumOfCrossProducts

      public double getSumOfCrossProducts()
      Returns the sum of crossproducts, xi*yi.
      Returns:
      sum of cross products
    • getRegressionSumSquares

      public double getRegressionSumSquares()
      Returns the sum of squared deviations of the predicted y values about their mean (which equals the mean of y).

      This is usually abbreviated SSR or SSM. It is defined as SSM here

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double.NaN is returned.
      Returns:
      sum of squared deviations of predicted y values
    • getMeanSquareError

      public double getMeanSquareError()
      Returns the sum of squared errors divided by the degrees of freedom, usually abbreviated MSE.

      If there are fewer than three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

      Returns:
      sum of squared deviations of y values
    • getR

      public double getR()
      Returns Pearson's product moment correlation coefficient, usually denoted r.

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double,NaN is returned.
      Returns:
      Pearson's r
    • getRSquare

      public double getRSquare()
      Returns the coefficient of determination, usually denoted r-square.

      Preconditions:

      • At least two observations (with at least two different x values) must have been added before invoking this method. If this method is invoked before a model can be estimated, Double,NaN is returned.
      Returns:
      r-square
    • getInterceptStdErr

      public double getInterceptStdErr()
      Returns the standard error of the intercept estimate, usually denoted s(b0).

      If there are fewer that three observations in the model, or if there is no variation in x, this returns Double.NaN.

      Additionally, a Double.NaN is returned when the intercept is constrained to be zero
      Returns:
      standard error associated with intercept estimate
    • getSlopeStdErr

      public double getSlopeStdErr()
      Returns the standard error of the slope estimate, usually denoted s(b1).

      If there are fewer that three data pairs in the model, or if there is no variation in x, this returns Double.NaN.

      Returns:
      standard error associated with slope estimate