Find the parameters of a linear relationship using the least squares method online. Linear pair regression analysis

Let us approximate the function by a polynomial of degree 2. To do this, we calculate the coefficients of the normal system of equations:

, ,

Let's create a normal least squares system, which has the form:

The solution to the system is easy to find:, , .

Thus, a polynomial of the 2nd degree is found: .

Theoretical information

Return to page<Введение в вычислительную математику. Примеры>

Example 2. Finding the optimal degree of a polynomial.

Return to page<Введение в вычислительную математику. Примеры>

Example 3. Derivation of a normal system of equations for finding the parameters of the empirical dependence.

Let us derive a system of equations to determine the coefficients and functions , which carries out the root-mean-square approximation of a given function by points. Let's compose a function and write it down for her necessary condition extremum:

Then the normal system will take the form:

We obtained a linear system of equations for unknown parameters and, which is easily solved.

Theoretical information

Return to page<Введение в вычислительную математику. Примеры>

Example.

Experimental data on the values ​​of variables X And at are given in the table.

As a result of their alignment, the function is obtained

Using least square method, approximate these data by a linear dependence y=ax+b(find parameters A And b). Find out which of the two lines better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the least squares method (LSM).

The task is to find the linear dependence coefficients at which the function of two variables A And btakes the smallest value. That is, given A And b the sum of squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, solving the example comes down to finding the extremum of a function of two variables.

Deriving formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding the partial derivatives of a function by variables A And b, we equate these derivatives to zero.

We solve the resulting system of equations using any method (for example by substitution method or Cramer’s method) and obtain formulas for finding coefficients using the least squares method (LSM).

Given A And b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums , , , and parameter n— amount of experimental data. We recommend calculating the values ​​of these amounts separately.

Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill out the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​in the 2nd row for each number i.

The values ​​in the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients A And b. We substitute the corresponding values ​​from the last column of the table into them:

Hence, y = 0.165x+2.184— the desired approximating straight line.

It remains to find out which of the lines y = 0.165x+2.184 or better approximates the original data, that is, makes an estimate using the least squares method.

Error estimation of the least squares method.

To do this, you need to calculate the sum of squared deviations of the original data from these lines And , a smaller value corresponds to a line that better approximates the original data in the sense of the least squares method.

Since , then straight y = 0.165x+2.184 better approximates the original data.

Graphic illustration of the least squares (LS) method.

Everything is clearly visible on the graphs. The red line is the found straight line y = 0.165x+2.184, the blue line is , pink dots are the original data.

Why is this needed, why all these approximations?

I personally use it to solve problems of data smoothing, interpolation and extrapolation problems (in the original example they might be asked to find the value of an observed value y at x=3 or when x=6 using the least squares method). But we’ll talk more about this later in another section of the site.

Top of page

Proof.

So that when found A And b function takes the smallest value, it is necessary that at this point the matrix of the quadratic form of the second order differential for the function was positive definite. Let's show it.

The second order differential has the form:

That is

Therefore, the matrix of quadratic form has the form

and the values ​​of the elements do not depend on A And b.

Let us show that the matrix is ​​positive definite. To do this, the angular minors must be positive.

Angular minor of the first order . The inequality is strict because the points do not coincide. In what follows we will imply this.

Second order angular minor

Let's prove that by the method of mathematical induction.

Conclusion: found values A And b correspond to the smallest value of the function , therefore, are the required parameters for the least squares method.

No time to figure it out?
Order a solution

Top of page

Developing a forecast using the least squares method. Example of problem solution

Extrapolation is a method scientific research, which is based on the dissemination of past and present trends, patterns, connections to the future development of the forecast object. Extrapolation methods include moving average method, exponential smoothing method, least squares method.

Essence least squares method consists in minimizing the sum of square deviations between observed and calculated values. The calculated values ​​are found using the selected equation - the regression equation. The smaller the distance between the actual values ​​and the calculated ones, the more accurate the forecast based on the regression equation.

A theoretical analysis of the essence of the phenomenon being studied, the change in which is reflected by a time series, serves as the basis for choosing a curve. Sometimes considerations about the nature of the increase in the levels of the series are taken into account. Thus, if output growth is expected in an arithmetic progression, then smoothing is performed in a straight line. If it turns out that growth is in geometric progression, then smoothing must be done using an exponential function.

Working formula for the least squares method : Y t+1 = a*X + b, where t + 1 – forecast period; Уt+1 – predicted indicator; a and b are coefficients; X - symbol time.

Calculation of coefficients a and b is carried out using the following formulas:

where, Uf – actual values ​​of the dynamics series; n – number of time series levels;

Smoothing time series using the least squares method serves to reflect the pattern of development of the phenomenon being studied. In the analytical expression of a trend, time is considered as an independent variable, and the levels of the series act as a function of this independent variable.

The development of a phenomenon does not depend on how many years have passed since the starting point, but on what factors influenced its development, in what direction and with what intensity. From here it is clear that the development of a phenomenon over time is the result of the action of these factors.

Correctly establishing the type of curve, the type of analytical dependence on time is one of the most difficult tasks of predictive analysis .

The selection of the type of function that describes the trend, the parameters of which are determined by the least squares method, is carried out in most cases empirically, by constructing a number of functions and comparing them with each other according to the value of the mean square error, calculated by the formula:

where UV are the actual values ​​of the dynamics series; Ur – calculated (smoothed) values ​​of the dynamics series; n – number of time series levels; p – the number of parameters defined in formulas describing the trend (development trend).

Disadvantages of the least squares method :

  • when trying to describe the economic phenomenon being studied using a mathematical equation, the forecast will be accurate for a short period of time and the regression equation should be recalculated as new information becomes available;
  • the complexity of selecting a regression equation that is solvable using standard computer programs.

An example of using the least squares method to develop a forecast

Task . There are data characterizing the unemployment rate in the region, %

  • Construct a forecast of the unemployment rate in the region for November, December, January using the following methods: moving average, exponential smoothing, least squares.
  • Calculate the errors in the resulting forecasts using each method.
  • Compare the results and draw conclusions.

Least squares solution

To solve this, let's create a table in which we will produce necessary calculations:

ε = 28.63/10 = 2.86% forecast accuracy high.

Conclusion : Comparing the results obtained from the calculations moving average method , exponential smoothing method and the least squares method, we can say that the average relative error when calculating using the exponential smoothing method falls within the range of 20-50%. This means that the accuracy of the forecast in this case is only satisfactory.

In the first and third cases, the forecast accuracy is high, since the average relative error is less than 10%. But the moving average method made it possible to obtain more reliable results (forecast for November - 1.52%, forecast for December - 1.53%, forecast for January - 1.49%), since the average relative error when using this method is the smallest - 1 ,13%.

Least square method

Other articles on this topic:

List of sources used

  1. Scientific and methodological recommendations on diagnosing social risks and forecasting challenges, threats and social consequences. Russian State Social University. Moscow. 2010;
  2. Vladimirova L.P. Forecasting and planning in market conditions: Textbook. allowance. M.: Publishing House "Dashkov and Co", 2001;
  3. Novikova N.V., Pozdeeva O.G. Forecasting national economy: Educational and methodological manual. Ekaterinburg: Ural Publishing House. state econ. Univ., 2007;
  4. Slutskin L.N. MBA course on business forecasting. M.: Alpina Business Books, 2006.

MNC program

Enter data

Data and approximation y = a + b x

i- number of experimental point;
x i- value of a fixed parameter at a point i;
y i- value of the measured parameter at a point i;
ωi- measurement weight at a point i;
y i, calc.- difference between measured and regression calculated value y at the point i;
S x i (x i)- error estimate x i when measuring y at the point i.

Data and approximation y = k x

i x i y i ωi y i, calc. Δy i S x i (x i)

Click on the chart

User's manual for the MNC online program.

In the data field, enter on each separate line the values ​​of `x` and `y` at one experimental point. Values ​​must be separated by a whitespace character (space or tab).

The third value could be the weight of the point `w`. If the weight of a point is not specified, it is equal to one. In the vast majority of cases, the weights of experimental points are unknown or not calculated, i.e. all experimental data are considered equivalent. Sometimes the weights in the studied range of values ​​are absolutely not equivalent and can even be calculated theoretically. For example, in spectrophotometry, weights can be calculated using simple formulas, although this is mostly neglected to reduce labor costs.

Data can be pasted via the clipboard from a spreadsheet in an office suite such as Excel from Microsoft Office or Calc from Open Office. To do this, in the spreadsheet, select the range of data to copy, copy to the clipboard, and paste the data into the data field on this page.

To calculate using the least squares method, at least two points are needed to determine two coefficients `b` - the tangent of the angle of inclination of the line and `a` - the value intercepted by the line on the `y` axis.

To estimate the error of the calculated regression coefficients, you need to set the number of experimental points to more than two.

Least squares method (LSM).

How more quantity experimental points, the more accurate the statistical assessment of the coefficients (by reducing the Student's coefficient) and the closer the estimate to the estimate of the general sample.

Obtaining values ​​at each experimental point is often associated with significant labor costs, so a compromise number of experiments is often carried out that gives a manageable estimate and does not lead to excessive labor costs. As a rule, the number of experimental points for a linear least squares dependence with two coefficients is selected in the region of 5-7 points.

A Brief Theory of Least Squares for Linear Relationships

Let's say we have a set of experimental data in the form of pairs of values ​​[`y_i`, `x_i`], where `i` is the number of one experimental measurement from 1 to `n`; `y_i` - the value of the measured quantity at point `i`; `x_i` - the value of the parameter we set at point `i`.

As an example, consider the operation of Ohm's law. By changing the voltage (potential difference) between sections of an electrical circuit, we measure the amount of current passing through this section. Physics gives us a dependence found experimentally:

`I = U/R`,
where `I` is the current strength; `R` - resistance; `U` - voltage.

In this case, `y_i` is the current value being measured, and `x_i` is the voltage value.

As another example, consider the absorption of light by a solution of a substance in solution. Chemistry gives us the formula:

`A = ε l C`,
where `A` is the optical density of the solution; `ε` - transmittance of the solute; `l` - path length when light passes through a cuvette with a solution; `C` is the concentration of the dissolved substance.

In this case, `y_i` is the measured value of optical density `A`, and `x_i` is the concentration value of the substance that we specify.

We will consider the case when the relative error in the assignment `x_i` is significantly less than the relative error in the measurement `y_i`. We will also assume that all measured values ​​`y_i` are random and normally distributed, i.e. obey the normal distribution law.

In the case of a linear dependence of `y` on `x`, we can write the theoretical dependence:
`y = a + b x`.

WITH geometric point In terms of vision, coefficient `b` denotes the tangent of the angle of inclination of the line to the `x` axis, and coefficient `a` - the value of `y` at the point of intersection of the line with the `y` axis (at `x = 0`).

Finding the regression line parameters.

In an experiment, the measured values ​​of `y_i` cannot exactly lie on the theoretical straight line due to measurement errors, which are always inherent real life. Therefore, a linear equation must be represented by a system of equations:
`y_i = a + b x_i + ε_i` (1),
where `ε_i` is the unknown measurement error of `y` in the `i`-th experiment.

Dependency (1) is also called regression, i.e. the dependence of two quantities on each other with statistical significance.

The task of restoring the dependence is to find the coefficients `a` and `b` from the experimental points [`y_i`, `x_i`].

To find the coefficients `a` and `b` it is usually used least square method(MNC). It is a special case of the maximum likelihood principle.

Let's rewrite (1) in the form `ε_i = y_i - a - b x_i`.

Then the sum of squared errors will be
`Φ = sum_(i=1)^(n) ε_i^2 = sum_(i=1)^(n) (y_i - a - b x_i)^2`. (2)

The principle of least squares (least squares) is to minimize the sum (2) with respect to parameters `a` and `b`.

The minimum is achieved when the partial derivatives of the sum (2) with respect to the coefficients `a` and `b` are equal to zero:
`frac(partial Φ)(partial a) = frac(partial sum_(i=1)^(n) (y_i - a - b x_i)^2)(partial a) = 0`
`frac(partial Φ)(partial b) = frac(partial sum_(i=1)^(n) (y_i - a - b x_i)^2)(partial b) = 0`

Expanding the derivatives, we obtain a system of two equations with two unknowns:
`sum_(i=1)^(n) (2a + 2bx_i — 2y_i) = sum_(i=1)^(n) (a + bx_i — y_i) = 0`
`sum_(i=1)^(n) (2bx_i^2 + 2ax_i — 2x_iy_i) = sum_(i=1)^(n) (bx_i^2 + ax_i — x_iy_i) = 0`

We open the brackets and transfer the sums independent of the required coefficients to the other half, we obtain a system of linear equations:
`sum_(i=1)^(n) y_i = a n + b sum_(i=1)^(n) bx_i`
`sum_(i=1)^(n) x_iy_i = a sum_(i=1)^(n) x_i + b sum_(i=1)^(n) x_i^2`

Solving the resulting system, we find formulas for the coefficients `a` and `b`:

`a = frac(sum_(i=1)^(n) y_i sum_(i=1)^(n) x_i^2 — sum_(i=1)^(n) x_i sum_(i=1)^(n ) x_iy_i) (n sum_(i=1)^(n) x_i^2 — (sum_(i=1)^(n) x_i)^2)` (3.1)

`b = frac(n sum_(i=1)^(n) x_iy_i — sum_(i=1)^(n) x_i sum_(i=1)^(n) y_i) (n sum_(i=1)^ (n) x_i^2 — (sum_(i=1)^(n) x_i)^2)` (3.2)

These formulas have solutions when `n > 1` (the line can be constructed using at least 2 points) and when the determinant `D = n sum_(i=1)^(n) x_i^2 - (sum_(i= 1)^(n) x_i)^2 != 0`, i.e. when the `x_i` points in the experiment are different (i.e. when the line is not vertical).

Estimation of errors of regression line coefficients

For a more accurate assessment of the error in calculating the coefficients `a` and `b`, a large number of experimental points is desirable. When `n = 2`, it is impossible to estimate the error of the coefficients, because the approximating line will uniquely pass through two points.

Error random variable`V` is defined law of error accumulation
`S_V^2 = sum_(i=1)^p (frac(partial f)(partial z_i))^2 S_(z_i)^2`,
where `p` is the number of parameters `z_i` with error `S_(z_i)`, which affect the error `S_V`;
`f` is a function of the dependence of `V` on `z_i`.

Let us write down the law of error accumulation for the error of coefficients `a` and `b`
`S_a^2 = sum_(i=1)^(n)(frac(partial a)(partial y_i))^2 S_(y_i)^2 + sum_(i=1)^(n)(frac(partial a )(partial x_i))^2 S_(x_i)^2 = S_y^2 sum_(i=1)^(n)(frac(partial a)(partial y_i))^2 `,
`S_b^2 = sum_(i=1)^(n)(frac(partial b)(partial y_i))^2 S_(y_i)^2 + sum_(i=1)^(n)(frac(partial b )(partial x_i))^2 S_(x_i)^2 = S_y^2 sum_(i=1)^(n)(frac(partial b)(partial y_i))^2 `,
because `S_(x_i)^2 = 0` (we previously made a reservation that the error `x` is negligible).

`S_y^2 = S_(y_i)^2` - error (variance, squared standard deviation) in the measurement of `y`, assuming that the error is uniform for all values ​​of `y`.

Substituting formulas for calculating `a` and `b` into the resulting expressions we get

`S_a^2 = S_y^2 frac(sum_(i=1)^(n) (sum_(i=1)^(n) x_i^2 — x_i sum_(i=1)^(n) x_i)^2 ) (D^2) = S_y^2 frac((n sum_(i=1)^(n) x_i^2 — (sum_(i=1)^(n) x_i)^2) sum_(i=1) ^(n) x_i^2) (D^2) = S_y^2 frac(sum_(i=1)^(n) x_i^2) (D)` (4.1)

`S_b^2 = S_y^2 frac(sum_(i=1)^(n) (n x_i — sum_(i=1)^(n) x_i)^2) (D^2) = S_y^2 frac( n (n sum_(i=1)^(n) x_i^2 — (sum_(i=1)^(n) x_i)^2)) (D^2) = S_y^2 frac(n) (D) ` (4.2)

In most real experiments, the value of `Sy` is not measured. To do this, it is necessary to carry out several parallel measurements (experiments) at one or several points in the plan, which increases the time (and possibly the cost) of the experiment. Therefore, it is usually assumed that the deviation of `y` from the regression line can be considered random. The estimate of variance `y` in this case is calculated using the formula.

`S_y^2 = S_(y, rest)^2 = frac(sum_(i=1)^n (y_i - a - b x_i)^2) (n-2)`.

The `n-2` divisor appears because our number of degrees of freedom has decreased due to the calculation of two coefficients using the same sample of experimental data.

This estimate is also called the residual variance relative to the regression line `S_(y, rest)^2`.

The significance of coefficients is assessed using the Student’s t test

`t_a = frac(|a|) (S_a)`, `t_b = frac(|b|) (S_b)`

If the calculated criteria `t_a`, `t_b` are less than the tabulated criteria `t(P, n-2)`, then it is considered that the corresponding coefficient is not significantly different from zero with a given probability `P`.

To assess the quality of the description of a linear relationship, you can compare `S_(y, rest)^2` and `S_(bar y)` relative to the mean using the Fisher criterion.

`S_(bar y) = frac(sum_(i=1)^n (y_i — bar y)^2) (n-1) = frac(sum_(i=1)^n (y_i — (sum_(i= 1)^n y_i) /n)^2) (n-1)` - sample estimate of the variance `y` relative to the mean.

To assess the effectiveness of the regression equation to describe the dependence, the Fisher coefficient is calculated
`F = S_(bar y) / S_(y, rest)^2`,
which is compared with the tabular Fisher coefficient `F(p, n-1, n-2)`.

If `F > F(P, n-1, n-2)`, the difference between the description of the relationship `y = f(x)` using the regression equation and the description using the mean is considered statistically significant with probability `P`. Those. regression describes the dependence better than the spread of `y` around the mean.

Click on the chart
to add values ​​to the table

Least square method. The least squares method means the determination of unknown parameters a, b, c, the accepted functional dependence

The least squares method refers to the determination of unknown parameters a, b, c,… accepted functional dependence

y = f(x,a,b,c,…),

which would provide a minimum of the mean square (variance) of the error

, (24)

where x i, y i is a set of pairs of numbers obtained from the experiment.

Since the condition for the extremum of a function of several variables is the condition that its partial derivatives are equal to zero, then the parameters a, b, c,… are determined from the system of equations:

; ; ; … (25)

It must be remembered that the least squares method is used to select parameters after the type of function y = f(x) defined

If, from theoretical considerations, no conclusions can be drawn about what the empirical formula should be, then one has to be guided by visual representations, primarily by graphical representations of the observed data.

In practice, they are most often limited to the following types of functions:

1) linear ;

2) quadratic a.

It is widely used in econometrics in the form of a clear economic interpretation of its parameters.

Linear regression comes down to finding an equation of the form

or

Equation of the form allows based on specified parameter values X have theoretical values ​​of the resultant characteristic, substituting the actual values ​​of the factor into it X.

The construction of linear regression comes down to estimating its parameters - A And V. Linear regression parameter estimates can be found using different methods.

The classical approach to estimating linear regression parameters is based on least squares method(MNC).

The least squares method allows us to obtain such parameter estimates A And V, at which the sum of squared deviations of the actual values ​​of the resultant characteristic (y) from calculated (theoretical) minimum:

To find the minimum of a function, you need to calculate the partial derivatives for each of the parameters A And b and set them equal to zero.

Let us denote by S, then:

Transforming the formula, we obtain the following system of normal equations for estimating parameters A And V:

Solving the system of normal equations (3.5) either by the method of sequential elimination of variables or by the method of determinants, we find the required estimates of the parameters A And V.

Parameter V called the regression coefficient. Its value shows the average change in the result with a change in the factor by one unit.

The regression equation is always supplemented with an indicator of the closeness of the connection. When using linear regression, such an indicator is the linear correlation coefficient. There are different modifications of the linear correlation coefficient formula. Some of them are given below:

As is known, the linear correlation coefficient is within the limits: -1 1.

To assess the quality of selection of a linear function, the square is calculated

Linear correlation coefficient called coefficient of determination. The coefficient of determination characterizes the proportion of variance of the resulting characteristic y, explained by regression, in the total variance of the resulting trait:

Accordingly, the value 1 characterizes the share of variance y, caused by the influence of other factors not taken into account in the model.

Questions for self-control

1. The essence of the least squares method?

2. How many variables does pairwise regression provide?

3. What coefficient determines the closeness of the connection between changes?

4. Within what limits is the coefficient of determination determined?

5. Estimation of parameter b in correlation-regression analysis?

1. Christopher Dougherty. Introduction to econometrics. - M.: INFRA - M, 2001 - 402 p.

2. S.A. Borodich. Econometrics. Minsk LLC “New Knowledge” 2001.


3. R.U. Rakhmetova Short course in econometrics. Tutorial. Almaty. 2004. -78p.

4. I.I. Eliseeva. Econometrics. - M.: “Finance and Statistics”, 2002

5. Monthly information and analytical magazine.

Nonlinear economic models. Nonlinear regression models. Transformation of variables.

Nonlinear economic models..

Transformation of variables.

Elasticity coefficient.

If there are nonlinear relationships between economic phenomena, then they are expressed using the corresponding nonlinear functions: for example, an equilateral hyperbola , parabolas of the second degree, etc.

There are two classes of nonlinear regressions:

1. Regressions that are nonlinear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, for example:

Polynomials of various degrees - , ;

Equilateral hyperbola - ;

Semilogarithmic function - .

2. Regressions that are nonlinear in the parameters being estimated, for example:

Power - ;

Demonstrative - ;

Exponential - .

The total sum of squared deviations of individual values ​​of the resulting characteristic at from the average value is caused by the influence of many reasons. Let us conditionally divide the entire set of reasons into two groups: factor under study x And other factors.

If the factor does not influence the result, then the regression line on the graph is parallel to the axis Oh And

Then the entire variance of the resulting characteristic is due to the influence of other factors and the total sum of squared deviations will coincide with the residual. If other factors do not influence the result, then y tied With X functionally and the residual sum of squares is zero. In this case, the sum of squared deviations explained by the regression is the same as the total sum of squares.

Since not all points of the correlation field lie on the regression line, their scatter always occurs as a result of the influence of the factor X, i.e. regression at By X, and caused by other causes (unexplained variation). The suitability of a regression line for forecasting depends on what part of the total variation of the trait at accounts for the explained variation

Obviously, if the sum of squared deviations due to regression is greater than the residual sum of squares, then the regression equation is statistically significant and the factor X has a significant impact on the result u.

, i.e., with the number of freedom of independent variation of a characteristic. The number of degrees of freedom is related to the number of units of the population n and the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P

The assessment of the significance of the regression equation as a whole is given using F-Fisher criterion. In this case, a null hypothesis is put forward that the regression coefficient is equal to zero, i.e. b = 0, and therefore the factor X does not affect the result u.

The immediate calculation of the F-test is preceded by analysis of variance. The central place in it is occupied by the decomposition of the total sum of squared deviations of a variable at from the average value at into two parts - “explained” and “unexplained”:

Total sum of squared deviations;

Sum of squared deviation explained by regression;

Residual sum of squared deviations.

Any sum of squared deviations is related to the number of degrees of freedom , i.e., with the number of freedom of independent variation of a characteristic. The number of degrees of freedom is related to the number of population units n and with the number of constants determined from it. In relation to the problem under study, the number of degrees of freedom should show how many independent deviations from P possible required to form a given sum of squares.

Dispersion per degree of freedomD.

F-ratios (F-test):

If the null hypothesis is true, then the factor and residual variances do not differ from each other. For H 0, a refutation is necessary so that the factor dispersion exceeds the residual dispersion several times. The English statistician Snedekor developed tables of critical values F-relations at different levels of significance of the null hypothesis and different numbers of degrees of freedom. Table value F-criterion is the maximum value of the ratio of variances that can occur in case of random divergence for a given level of probability of the presence of the null hypothesis. Calculated value F-relationships are considered reliable if o is greater than the table.

In this case, the null hypothesis about the absence of a relationship between signs is rejected and a conclusion is drawn about the significance of this relationship: F fact > F table H 0 is rejected.

If the value is less than the tabulated F fact ‹, F table, then the probability of the null hypothesis is higher than a specified level and cannot be rejected without serious risk of drawing the wrong conclusion about the presence of a relationship. In this case, the regression equation is considered statistically insignificant. But he doesn’t deviate.

Standard error of regression coefficient

To assess the significance of the regression coefficient, its value is compared with its standard error, i.e. the actual value is determined t-Student's t-test: which is then compared with the table value at a certain level of significance and number of degrees of freedom ( n- 2).

Standard parameter error A:

The significance of the linear correlation coefficient is checked based on the magnitude of the error correlation coefficient t r:

Total trait variance X:

Multiple Linear Regression

Model building

Multiple regression represents a regression of an effective characteristic with two or more factors, i.e. a model of the form

Regression can give good results in modeling if the influence of other factors affecting the object of study can be neglected. The behavior of individual economic variables cannot be controlled, i.e. it is not possible to ensure the equality of all other conditions for assessing the influence of one factor under study. In this case, you should try to identify the influence of other factors by introducing them into the model, i.e., construct a multiple regression equation: y = a+b 1 x 1 +b 2 +…+b p x p + .

The main goal of multiple regression is to build a model with a large number of factors, while determining the influence of each of them separately, as well as their combined impact on the modeled indicator. The specification of the model includes two ranges of issues: selection of factors and choice of the type of regression equation

I am a mathematician and programmer. The biggest leap I took in my career was when I learned to say: "I do not understand anything!" Now I am not ashamed to tell the luminary of science that he is giving me a lecture, that I do not understand what he, the luminary, is telling me. And it's very difficult. Yes, admitting your ignorance is difficult and embarrassing. Who likes to admit that he doesn’t know the basics of something? Due to my profession, I have to attend a large number of presentations and lectures, where, I admit, in the vast majority of cases I want to sleep because I do not understand anything. But I don’t understand because the huge problem of the current situation in science lies in mathematics. It assumes that all listeners are familiar with absolutely all areas of mathematics (which is absurd). Admitting that you don’t know what a derivative is (we’ll talk about what it is a little later) is shameful.

But I've learned to say that I don't know what multiplication is. Yes, I don't know what a subalgebra over a Lie algebra is. Yes, I don’t know why quadratic equations are needed in life. By the way, if you are sure that you know, then we have something to talk about! Mathematics is a series of tricks. Mathematicians try to confuse and intimidate the public; where there is no confusion, there is no reputation, no authority. Yes, it is prestigious to speak in as abstract a language as possible, which is complete nonsense.

Do you know what a derivative is? Most likely you will tell me about the limit of the difference ratio. In the first year of mathematics and mechanics at St. Petersburg State University, Viktor Petrovich Khavin told me determined derivative as the coefficient of the first term of the Taylor series of the function at a point (this was a separate gymnastics to determine the Taylor series without derivatives). I laughed at this definition for a long time until I finally understood what it was about. The derivative is nothing more than a simple measure of how similar the function we are differentiating is to the function y=x, y=x^2, y=x^3.

I now have the honor of lecturing to students who afraid mathematics. If you are afraid of mathematics, we are on the same path. As soon as you try to read some text and it seems to you that it is overly complicated, then know that it is poorly written. I assert that there is not a single area of ​​mathematics that cannot be discussed “on the fingers” without losing accuracy.

Assignment for the near future: I assigned my students to understand what a linear quadratic regulator is. Don’t be shy, spend three minutes of your life and follow the link. If you don’t understand anything, then we are on the same path. I (a professional mathematician-programmer) didn’t understand anything either. And I assure you, you can figure this out “on your fingers.” At the moment I don't know what it is, but I assure you that we will be able to figure it out.

So, the first lecture that I am going to give to my students after they come running to me in horror and say that a linear-quadratic regulator is a terrible thing that you will never master in your life is least squares methods. Can you solve linear equations? If you are reading this text, then most likely not.

So, given two points (x0, y0), (x1, y1), for example, (1,1) and (3,2), the task is to find the equation of the line passing through these two points:

illustration

This line should have an equation like the following:

Here alpha and beta are unknown to us, but two points of this line are known:

We can write this equation in matrix form:

Here we should make a lyrical digression: what is a matrix? A matrix is ​​nothing more than a two-dimensional array. This is a way of storing data; no further meanings should be attached to it. It depends on us exactly how to interpret a certain matrix. Periodically I will interpret it as a linear mapping, periodically as a quadratic form, and sometimes simply as a set of vectors. This will all be clarified in context.

Let's replace concrete matrices with their symbolic representation:

Then (alpha, beta) can be easily found:

More specifically for our previous data:

Which leads to the following equation of the line passing through the points (1,1) and (3,2):

Okay, everything is clear here. Let's find the equation of the line passing through three points: (x0,y0), (x1,y1) and (x2,y2):

Oh-oh-oh, but we have three equations for two unknowns! A standard mathematician will say that there is no solution. What will the programmer say? And he will first rewrite the previous system of equations in the following form:

In our case, the vectors i, j, b are three-dimensional, therefore (in the general case) there is no solution to this system. Any vector (alpha\*i + beta\*j) lies in the plane spanned by the vectors (i, j). If b does not belong to this plane, then there is no solution (equality cannot be achieved in the equation). What to do? Let's look for a compromise. Let's denote by e(alpha, beta) exactly how far we have not achieved equality:

And we will try to minimize this error:

Why square?

We are looking not just for the minimum of the norm, but for the minimum of the square of the norm. Why? The minimum point itself coincides, and the square gives a smooth function (a quadratic function of the arguments (alpha, beta)), while simply the length gives a cone-shaped function, non-differentiable at the minimum point. Brr. A square is more convenient.

Obviously, the error is minimized when the vector e orthogonal to the plane spanned by the vectors i And j.

Illustration

In other words: we are looking for a straight line such that the sum of the squared lengths of the distances from all points to this straight line is minimal:

UPDATE: I have a problem here, the distance to the straight line should be measured vertically, and not by orthogonal projection. This commentator is right.

Illustration

In completely different words (carefully, poorly formalized, but it should be clear): we take all possible lines between all pairs of points and look for the average line between all:

Illustration

Another explanation is straightforward: we attach a spring between all data points (here we have three) and the straight line that we are looking for, and the straight line of the equilibrium state is exactly what we are looking for.

Minimum quadratic form

So, given this vector b and a plane spanned by the column vectors of the matrix A(in this case (x0,x1,x2) and (1,1,1)), we are looking for the vector e with a minimum square of length. Obviously, the minimum is achievable only for the vector e, orthogonal to the plane spanned by the column vectors of the matrix A:

In other words, we are looking for a vector x=(alpha, beta) such that:

Let me remind you that this vector x=(alpha, beta) is the minimum of the quadratic function ||e(alpha, beta)||^2:

Here it would be useful to remember that the matrix can be interpreted also as a quadratic form, for example, the identity matrix ((1,0),(0,1)) can be interpreted as a function x^2 + y^2:

quadratic form

All this gymnastics is known under the name linear regression.

Laplace's equation with Dirichlet boundary condition

Now the simplest real task: there is a certain triangulated surface, it is necessary to smooth it. For example, let's load a model of my face:

The original commit is available. To minimize external dependencies, I took the code of my software renderer, already on Habré. To solve a linear system, I use OpenNL, this is an excellent solver, which, however, is very difficult to install: you need to copy two files (.h+.c) to the folder with your project. All smoothing is done with the following code:

For (int d=0; d<3; d++) { nlNewContext(); nlSolverParameteri(NL_NB_VARIABLES, verts.size()); nlSolverParameteri(NL_LEAST_SQUARES, NL_TRUE); nlBegin(NL_SYSTEM); nlBegin(NL_MATRIX); for (int i=0; i<(int)verts.size(); i++) { nlBegin(NL_ROW); nlCoefficient(i, 1); nlRightHandSide(verts[i][d]); nlEnd(NL_ROW); } for (unsigned int i=0; i&face = faces[i]; for (int j=0; j<3; j++) { nlBegin(NL_ROW); nlCoefficient(face[ j ], 1); nlCoefficient(face[(j+1)%3], -1); nlEnd(NL_ROW); } } nlEnd(NL_MATRIX); nlEnd(NL_SYSTEM); nlSolve(); for (int i=0; i<(int)verts.size(); i++) { verts[i][d] = nlGetVariable(i); } }

X, Y and Z coordinates are separable, I smooth them separately. That is, I solve three systems of linear equations, each with a number of variables equal to the number of vertices in my model. The first n rows of matrix A have only one 1 per row, and the first n rows of vector b have the original model coordinates. That is, I tie a spring between the new position of the vertex and the old position of the vertex - the new ones should not move too far from the old ones.

All subsequent rows of matrix A (faces.size()*3 = number of edges of all triangles in the mesh) have one occurrence of 1 and one occurrence of -1, with the vector b having zero components opposite. This means I put a spring on each edge of our triangular mesh: all edges try to get the same vertex as their starting and ending point.

Once again: all vertices are variables, and they cannot move far from their original position, but at the same time they try to become similar to each other.

Here's the result:

Everything would be fine, the model is really smoothed, but it has moved away from its original edge. Let's change the code a little:

For (int i=0; i<(int)verts.size(); i++) { float scale = border[i] ? 1000: 1; nlBegin(NL_ROW); nlCoefficient(i, scale); nlRightHandSide(scale*verts[i][d]); nlEnd(NL_ROW); }

In our matrix A, for the vertices that are on the edge, I add not a row from the category v_i = verts[i][d], but 1000*v_i = 1000*verts[i][d]. What does it change? And this changes our quadratic form of error. Now a single deviation from the top at the edge will cost not one unit, as before, but 1000*1000 units. That is, we hung a stronger spring on the extreme vertices, the solution will prefer to stretch the others more strongly. Here's the result:

Let's double the spring strength between the vertices:
nlCoefficient(face[ j ], 2); nlCoefficient(face[(j+1)%3], -2);

It is logical that the surface has become smoother:

And now even a hundred times stronger:

What is this? Imagine that we have dipped a wire ring in soapy water. As a result, the resulting soap film will try to have the least curvature as possible, touching the border - our wire ring. This is exactly what we got by fixing the border and asking for a smooth surface inside. Congratulations, we have just solved Laplace's equation with Dirichlet boundary conditions. Sounds cool? But in reality, you just need to solve one system of linear equations.

Poisson's equation

Let's remember another cool name.

Let's say I have an image like this:

Looks good to everyone, but I don’t like the chair.

I'll cut the picture in half:



And I will select a chair with my hands:

Then I’ll pull everything that’s white in the mask to the left side of the picture, and at the same time I’ll say throughout the whole picture that the difference between two neighboring pixels should be equal to the difference between two neighboring pixels on the right picture:

For (int i=0; i

Here's the result:

Example from life

I deliberately did not make licked results, because... I just wanted to show how exactly you can apply least squares methods, this is a training code. Let me now give an example from life:

I have a number of photographs of fabric samples like this:

My task is to make seamless textures from photographs of this quality. To start, I (automatically) look for a repeating pattern:

If I cut this quadrilateral straight out, then due to distortion the edges will not meet, here is an example of a pattern repeated four times:

Hidden text

Here is a fragment where the seam is clearly visible:

Therefore, I will not cut along a straight line, here is the cutting line:

Hidden text

And here is a pattern repeated four times:

Hidden text

And a fragment of it to make it clearer:

It’s already better, the cut did not go in a straight line, avoiding all sorts of curls, but the seam is still visible due to the uneven lighting in the original photo. This is where the least squares method for Poisson's equation comes to the rescue. Here is the final result after leveling the lighting:

The texture turned out perfectly seamless, and all this automatically from a photo of very mediocre quality. Don't be afraid of mathematics, look for simple explanations, and you will be happy in engineering.

Example.

Experimental data on the values ​​of variables X And at are given in the table.

As a result of their alignment, the function is obtained

Using least square method, approximate these data by a linear dependence y=ax+b(find parameters A And b). Find out which of the two lines better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the least squares method (LSM).

The task is to find the linear dependence coefficients at which the function of two variables A And b takes the smallest value. That is, given A And b the sum of squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, solving the example comes down to finding the extremum of a function of two variables.

Deriving formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding the partial derivatives of a function by variables A And b, we equate these derivatives to zero.

We solve the resulting system of equations using any method (for example by substitution method or Cramer's method) and obtain formulas for finding coefficients using the least squares method (LSM).

Given A And b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums ,,, and parameter n- amount of experimental data. We recommend calculating the values ​​of these amounts separately. Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill out the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values ​​in the fourth row of the table are obtained by multiplying the values ​​of the 2nd row by the values ​​of the 3rd row for each number i.

The values ​​in the fifth row of the table are obtained by squaring the values ​​in the 2nd row for each number i.

The values ​​in the last column of the table are the sums of the values ​​across the rows.

We use the formulas of the least squares method to find the coefficients A And b. We substitute the corresponding values ​​from the last column of the table into them:

Hence, y = 0.165x+2.184- the desired approximating straight line.

It remains to find out which of the lines y = 0.165x+2.184 or better approximates the original data, that is, makes an estimate using the least squares method.

Error estimation of the least squares method.

To do this, you need to calculate the sum of squared deviations of the original data from these lines And , a smaller value corresponds to a line that better approximates the original data in the sense of the least squares method.

Since , then straight y = 0.165x+2.184 better approximates the original data.

Graphic illustration of the least squares (LS) method.

Everything is clearly visible on the graphs. The red line is the found straight line y = 0.165x+2.184, the blue line is , pink dots are the original data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating approximate values ​​of functions from their known values ​​at certain fixed points is widely used.

This kind of function approximation problem often arises:

    when constructing approximate formulas for calculating the values ​​of characteristic quantities of the process under study using tabular data obtained as a result of the experiment;

    in numerical integration, differentiation, solving differential equations, etc.;

    if necessary, calculate the values ​​of functions at intermediate points of the considered interval;

    when determining the values ​​of characteristic quantities of a process outside the considered interval, in particular when forecasting.

If, to model a certain process specified by a table, we construct a function that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the problem of constructing approximating functions itself will be called an approximation problem.

This article discusses the capabilities of the MS Excel package for solving this type of problem, in addition, it provides methods and techniques for constructing (creating) regressions for tabulated functions (which is the basis of regression analysis).

Excel has two options for building regressions.

    Adding selected regressions (trendlines) to a diagram built on the basis of a data table for the process characteristic under study (available only if a diagram has been constructed);

    Using the built-in statistical functions of the Excel worksheet, allowing you to obtain regressions (trend lines) directly from the source data table.

Adding trend lines to a chart

For a table of data that describes a process and is represented by a diagram, Excel has an effective regression analysis tool that allows you to:

    build on the basis of the least squares method and add five types of regressions to the diagram, which model the process under study with varying degrees of accuracy;

    add the constructed regression equation to the diagram;

    determine the degree of correspondence of the selected regression to the data displayed on the chart.

Based on chart data, Excel allows you to obtain linear, polynomial, logarithmic, power, exponential types of regressions, which are specified by the equation:

y = y(x)

where x is an independent variable that often takes the values ​​of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, a countdown of the time of the process under study (characteristics).

1 . Linear regression is good for modeling characteristics whose values ​​increase or decrease at a constant rate. This is the simplest model to construct for the process under study. It is constructed in accordance with the equation:

y = mx + b

where m is the tangent of the linear regression slope to the x-axis; b - coordinate of the point of intersection of linear regression with the ordinate axis.

2 . A polynomial trend line is useful for describing characteristics that have several distinct extremes (maxima and minima). The choice of polynomial degree is determined by the number of extrema of the characteristic under study. Thus, a second-degree polynomial can well describe a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is constructed in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where coefficients c0, c1, c2,... c6 are constants whose values ​​are determined during construction.

3 . The logarithmic trend line is successfully used when modeling characteristics whose values ​​initially change rapidly and then gradually stabilize.

y = c ln(x) + b

4 . A power-law trend line gives good results if the values ​​of the relationship under study are characterized by a constant change in the growth rate. An example of such a dependence is the graph of uniformly accelerated motion of a car. If there are zero or negative values ​​in the data, you cannot use a power trend line.

Constructed in accordance with the equation:

y = c xb

where coefficients b, c are constants.

5 . An exponential trend line should be used when the rate of change in the data is continuously increasing. For data containing zero or negative values, this type of approximation is also not applicable.

Constructed in accordance with the equation:

y = c ebx

where coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the reliability of the approximation: the closer the R2 value is to unity, the more reliably the trend line approximates the process under study. If necessary, the R2 value can always be displayed on the chart.

Determined by the formula:

To add a trend line to a data series:

    activate a chart based on a series of data, i.e. click within the chart area. The Diagram item will appear in the main menu;

    after clicking on this item, a menu will appear on the screen in which you should select the Add trend line command.

The same actions can be easily implemented by moving the mouse pointer over the graph corresponding to one of the data series and right-clicking; In the context menu that appears, select the Add trend line command. The Trend Line dialog box will appear on the screen with the Type tab opened (Fig. 1).

After this you need:

Select the required trend line type on the Type tab (the Linear type is selected by default). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 . The Built on series field lists all data series in the chart in question. To add a trend line to a specific data series, select its name in the Built on series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

    change the name of the trend line in the Name of the approximating (smoothed) curve field.

    set the number of periods (forward or backward) for the forecast in the Forecast field;

    display the equation of the trend line in the diagram area, for which you should enable the show equation on the diagram checkbox;

    display the approximation reliability value R2 in the diagram area, for which you should enable the Place the approximation reliability value on the diagram (R^2) checkbox;

    set the intersection point of the trend line with the Y axis, for which you should enable the checkbox for the intersection of the curve with the Y axis at a point;

    Click the OK button to close the dialog box.

In order to start editing an already drawn trend line, there are three ways:

    use the Selected trend line command from the Format menu, having previously selected the trend line;

    select the Format trend line command from the context menu, which is called up by right-clicking on the trend line;

    double click on the trend line.

The Trend Line Format dialog box will appear on the screen (Fig. 3), containing three tabs: View, Type, Parameters, and the contents of the last two completely coincide with the similar tabs of the Trend Line dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete a trend line that has already been drawn, select the trend line to be deleted and press the Delete key.

The advantages of the considered regression analysis tool are:

    the relative ease of constructing a trend line on charts without creating a data table for it;

    a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

    the ability to predict the behavior of the process under study by an arbitrary (within the limits of common sense) number of steps forward and also backward;

    the ability to obtain the trend line equation in analytical form;

    the possibility, if necessary, of obtaining an assessment of the reliability of the approximation.

The disadvantages include the following:

    the construction of a trend line is carried out only if there is a diagram built on a series of data;

    the process of generating data series for the characteristic under study based on the trend line equations obtained for it is somewhat cluttered: the required regression equations are updated with each change in the values ​​of the original data series, but only within the chart area, while the data series formed on the basis of the old line equation trend remains unchanged;

    In PivotChart reports, changing the view of a chart or associated PivotTable report does not preserve existing trendlines, meaning that before you draw trendlines or otherwise format a PivotChart report, you should ensure that the report layout meets the required requirements.

Trend lines can be used to supplement data series presented on charts such as graph, histogram, flat non-standardized area charts, bar charts, scatter charts, bubble charts, and stock charts.

You cannot add trend lines to data series in 3D, normalized, radar, pie, and donut charts.

Using Excel's built-in functions

Excel also has a regression analysis tool for plotting trend lines outside the chart area. There are a number of statistical worksheet functions you can use for this purpose, but all of them only allow you to build linear or exponential regressions.

Excel has several functions for constructing linear regression, in particular:

    TREND;

  • SLOPE and CUT.

As well as several functions for constructing an exponential trend line, in particular:

    LGRFPRIBL.

It should be noted that the techniques for constructing regressions using the TREND and GROWTH functions are almost the same. The same can be said about the pair of functions LINEST and LGRFPRIBL. For these four functions, creating a table of values ​​uses Excel features such as array formulas, which somewhat clutters the process of building regressions. Let us also note that the construction of linear regression, in our opinion, is most easily accomplished using the SLOPE and INTERCEPT functions, where the first of them determines the slope of the linear regression, and the second determines the segment intercepted by the regression on the y-axis.

The advantages of the built-in functions tool for regression analysis are:

    a fairly simple, uniform process of generating data series of the characteristic under study for all built-in statistical functions that define trend lines;

    standard methodology for constructing trend lines based on generated data series;

    the ability to predict the behavior of the process under study by the required number of steps forward or backward.

The disadvantages include the fact that Excel does not have built-in functions for creating other (except linear and exponential) types of trend lines. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts that are close to reality. In addition, when using the TREND and GROWTH functions, the equations of the trend lines are not known.

It should be noted that the authors did not set out to present the course of regression analysis with any degree of completeness. Its main task is to show, using specific examples, the capabilities of the Excel package when solving approximation problems; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how such problems can be solved relatively easily even by a user who does not have extensive knowledge of regression analysis.

Examples of solving specific problems

Let's look at solving specific problems using the listed Excel tools.

Problem 1

With a table of data on the profit of a motor transport enterprise for 1995-2002. you need to do the following:

    Build a diagram.

    Add linear and polynomial (quadratic and cubic) trend lines to the chart.

    Using the trend line equations, obtain tabular data on enterprise profits for each trend line for 1995-2004.

    Make a forecast for the enterprise's profit for 2003 and 2004.

The solution of the problem

    In the range of cells A4:C11 of the Excel worksheet, enter the worksheet shown in Fig. 4.

    Having selected the range of cells B4:C11, we build a diagram.

    We activate the constructed diagram and, according to the method described above, after selecting the type of trend line in the Trend Line dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the diagram. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the trend being added, and in the Forecast forward for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the approximation reliability value R2 in the diagram area, enable the show equation on the screen checkboxes and place the approximation reliability value (R^2) on the diagram. For better visual perception, we change the type, color and thickness of the constructed trend lines, for which we use the View tab of the Trend Line Format dialog box (see Fig. 3). The resulting diagram with added trend lines is shown in Fig. 5.

    To obtain tabular data on enterprise profits for each trend line for 1995-2004. Let's use the trend line equations presented in Fig. 5. To do this, in the cells of the range D3:F3, enter text information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the cell range D5:D13. It should be noted that each cell with a linear regression formula from the range of cells D4:D13 has as an argument a corresponding cell from the range A4:A13. Similarly, for quadratic regression, fill the range of cells E4:E13, and for cubic regression, fill the range of cells F4:F13. Thus, a forecast for the enterprise's profit for 2003 and 2004 has been compiled. using three trends. The resulting table of values ​​is shown in Fig. 6.

Problem 2

    Build a diagram.

    Add logarithmic, power and exponential trend lines to the chart.

    Derive the equations of the obtained trend lines, as well as the reliability values ​​of the approximation R2 for each of them.

    Using the trend line equations, obtain tabular data on the enterprise's profit for each trend line for 1995-2002.

    Make a forecast of the company's profit for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with logarithmic, power and exponential trend lines added to it (Fig. 7). Next, using the obtained trend line equations, we fill out a table of values ​​for the enterprise’s profit, including the predicted values ​​for 2003 and 2004. (Fig. 8).

In Fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the lowest value of approximation reliability

R2 = 0.8659

The highest values ​​of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Problem 3

With the table of data on the profit of a motor transport enterprise for 1995-2002, given in task 1, you must perform the following steps.

    Obtain data series for linear and exponential trend lines using the TREND and GROW functions.

    Using the TREND and GROWTH functions, make a forecast of the enterprise’s profit for 2003 and 2004.

    Construct a diagram for the original data and the resulting data series.

The solution of the problem

Let's use the worksheet for Problem 1 (see Fig. 4). Let's start with the TREND function:

    select the range of cells D4:D11, which should be filled with the values ​​of the TREND function corresponding to the known data on the profit of the enterprise;

    Call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click the OK button. The same operation can be performed by clicking the (Insert Function) button on the standard toolbar.

    In the Function Arguments dialog box that appears, enter the range of cells C4:C11 in the Known_values_y field; in the Known_values_x field - the range of cells B4:B11;

    To make the entered formula become an array formula, use the key combination + + .

The formula we entered in the formula bar will look like: =(TREND(C4:C11,B4:B11)).

As a result, the range of cells D4:D11 is filled with the corresponding values ​​of the TREND function (Fig. 9).

To make a forecast of the enterprise's profit for 2003 and 2004. necessary:

    select the range of cells D12:D13 where the values ​​predicted by the TREND function will be entered.

    call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4:C11; in the Known_values_x field - the range of cells B4:B11; and in the New_values_x field - the range of cells B12:B13.

    turn this formula into an array formula using the key combination Ctrl + Shift + Enter.

    The entered formula will look like: =(TREND(C4:C11;B4:B11;B12:B13)), and the range of cells D12:D13 will be filled with the predicted values ​​of the TREND function (see Fig. 9).

The data series is similarly filled in using the GROWTH function, which is used in the analysis of nonlinear dependencies and works in exactly the same way as its linear counterpart TREND.

Figure 10 shows the table in formula display mode.

For the initial data and the obtained data series, the diagram shown in Fig. eleven.

Problem 4

With the table of data on the receipt of applications for services by the dispatch service of a motor transport enterprise for the period from the 1st to the 11th of the current month, you must perform the following actions.

    Get data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

    Obtain a series of data for exponential regression using the LGRFPRIBL function.

    Using the above functions, make a forecast about the receipt of applications to the dispatch service for the period from the 12th to the 14th of the current month.

    Create a diagram for the original and received data series.

The solution of the problem

Note that, unlike the TREND and GROWTH functions, none of the functions listed above (SLOPE, INTERCEPT, LINEST, LGRFPRIB) are regression. These functions play only a supporting role, determining the necessary regression parameters.

For linear and exponential regressions built using the functions SLOPE, INTERCEPT, LINEST, LGRFPRIB, the appearance of their equations is always known, in contrast to linear and exponential regressions corresponding to the TREND and GROWTH functions.

1 . Let's build a linear regression with the equation:

y = mx+b

using the SLOPE and INTERCEPT functions, with the regression slope m determined by the SLOPE function, and the free term b by the INTERCEPT function.

To do this, we carry out the following actions:

    enter the original table into the cell range A4:B14;

    the value of parameter m will be determined in cell C19. Select the Slope function from the Statistical category; enter the range of cells B4:B14 in the known_values_y field and the range of cells A4:A14 in the known_values_x field. The formula will be entered in cell C19: =SLOPE(B4:B14,A4:A14);

    Using a similar technique, the value of parameter b in cell D19 is determined. And its contents will look like: =SEGMENT(B4:B14,A4:A14). Thus, the values ​​of the parameters m and b required for constructing a linear regression will be stored in cells C19, D19, respectively;

    Next, enter the linear regression formula in cell C4 in the form: =$C*A4+$D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change during possible copying). The absolute reference sign $ can be typed either from the keyboard or using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula into the range of cells C4:C17. We obtain the required data series (Fig. 12). Due to the fact that the number of requests is an integer, you should set the number format with the number of decimal places to 0 on the Number tab of the Cell Format window.

2 . Now let's build a linear regression given by the equation:

y = mx+b

using the LINEST function.

For this:

    Enter the LINEST function as an array formula in the cell range C20:D20: =(LINEST(B4:B14,A4:A14)). As a result, we obtain the value of parameter m in cell C20, and the value of parameter b in cell D20;

    enter the formula in cell D4: =$C*A4+$D;

    copy this formula using the fill marker into the cell range D4:D17 and get the desired data series.

3 . We build an exponential regression with the equation:

using the LGRFPRIBL function it is performed similarly:

    In the cell range C21:D21 we enter the LGRFPRIBL function as an array formula: =( LGRFPRIBL (B4:B14,A4:A14)). In this case, the value of parameter m will be determined in cell C21, and the value of parameter b will be determined in cell D21;

    the formula is entered into cell E4: =$D*$C^A4;

    using the fill marker, this formula is copied to the range of cells E4:E17, where the data series for exponential regression will be located (see Fig. 12).

In Fig. Figure 13 shows a table where you can see the functions we use with the required cell ranges, as well as formulas.

Magnitude R 2 called coefficient of determination.

The task of constructing a regression dependence is to find the vector of coefficients m of model (1) at which the coefficient R takes on the maximum value.

To assess the significance of R, Fisher's F test is used, calculated using the formula

Where n- sample size (number of experiments);

k is the number of model coefficients.

If F exceeds some critical value for the data n And k and the accepted confidence probability, then the value of R is considered significant. Tables of critical values ​​of F are given in reference books on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n=2 for a simple linear model is equal to 1 (a single straight line can always be drawn through 2 points on a plane). However, if the experimental data are random variables, such a value of R should be trusted with great caution. Usually, to obtain significant R and reliable regression, they strive to ensure that the number of experiments significantly exceeds the number of model coefficients (n>k).

To build a linear regression model you need:

1) prepare a list of n rows and m columns containing experimental data (column containing the output value Y must be either first or last in the list); For example, let’s take the data from the previous task, adding a column called “Period No.”, number the period numbers from 1 to 12. (these will be the values X)

2) go to the menu Data/Data Analysis/Regression

If the "Data Analysis" item in the "Tools" menu is missing, then you should go to the "Add-Ins" item in the same menu and check the "Analysis package" checkbox.

3) in the "Regression" dialog box, set:

· input interval Y;

· input interval X;

· output interval - the upper left cell of the interval in which the calculation results will be placed (it is recommended to place them on a new worksheet);

4) click "Ok" and analyze the results.

The essence of the least squares method is in finding the parameters of a trend model that best describes the tendency of development of any random phenomenon in time or space (a trend is a line that characterizes the tendency of this development). The task of the least squares method (LSM) comes down to finding not just some trend model, but to finding the best or optimal model. This model will be optimal if the sum of square deviations between the observed actual values ​​and the corresponding calculated trend values ​​is minimal (smallest):

where is the square deviation between the observed actual value

and the corresponding calculated trend value,

The actual (observed) value of the phenomenon being studied,

The calculated value of the trend model,

The number of observations of the phenomenon being studied.

MNC is used quite rarely on its own. As a rule, most often it is used only as a necessary technical technique in correlation studies. It should be remembered that the information basis of OLS can only be a reliable statistical series, and the number of observations should not be less than 4, otherwise the smoothing procedures of OLS may lose common sense.

The MNC toolkit boils down to the following procedures:

First procedure. It turns out whether there is any tendency at all to change the resultant attribute when the selected factor-argument changes, or in other words, is there a connection between “ at " And " X ».

Second procedure. It is determined which line (trajectory) can best describe or characterize this trend.

Third procedure.

Example. Let's say we have information about the average sunflower yield for the farm under study (Table 9.1).

Table 9.1

Observation number

Productivity, c/ha

Since the level of technology in sunflower production in our country has remained virtually unchanged over the past 10 years, it means that, apparently, fluctuations in yield during the analyzed period were very much dependent on fluctuations in weather and climatic conditions. Is this really true?

First OLS procedure. The hypothesis about the existence of a trend in sunflower yield changes depending on changes in weather and climatic conditions over the analyzed 10 years is tested.

In this example, for " y " it is advisable to take the sunflower yield, and for " x » – number of the observed year in the analyzed period. Testing the hypothesis about the existence of any relationship between " x " And " y "can be done in two ways: manually and using computer programs. Of course, with the availability of computer technology, this problem can be solved by itself. But in order to better understand the MNC tools, it is advisable to test the hypothesis about the existence of a relationship between “ x " And " y » manually, when only a pen and an ordinary calculator are at hand. In such cases, the hypothesis about the existence of a trend is best checked visually by the location of the graphical image of the analyzed series of dynamics - the correlation field:

The correlation field in our example is located around a slowly increasing line. This in itself indicates the existence of a certain trend in changes in sunflower yields. It is impossible to talk about the presence of any tendency only when the correlation field looks like a circle, a circle, a strictly vertical or strictly horizontal cloud, or consists of chaotically scattered points. In all other cases, the hypothesis about the existence of a relationship between “ x " And " y ", and continue research.

Second OLS procedure. It is determined which line (trajectory) can best describe or characterize the trend of changes in sunflower yield over the analyzed period.

If you have computer technology, the selection of the optimal trend occurs automatically. In “manual” processing, the selection of the optimal function is carried out, as a rule, visually - by the location of the correlation field. That is, based on the type of graph, the equation of the line that best fits the empirical trend (the actual trajectory) is selected.

As is known, in nature there is a huge variety of functional dependencies, so it is extremely difficult to visually analyze even a small part of them. Fortunately, in real economic practice, most relationships can be described quite accurately either by a parabola, or a hyperbola, or a straight line. In this regard, with the “manual” option of selecting the best function, you can limit yourself to only these three models.

Hyperbola:

Second order parabola: :

It is easy to see that in our example, the trend in sunflower yield changes over the analyzed 10 years is best characterized by a straight line, so the regression equation will be the equation of a straight line.

Third procedure. The parameters of the regression equation characterizing this line are calculated, or in other words, an analytical formula is determined that describes the best trend model.

Finding the values ​​of the parameters of the regression equation, in our case the parameters and , is the core of the OLS. This process comes down to solving a system of normal equations.

(9.2)

This system of equations can be solved quite easily by the Gauss method. Let us recall that as a result of the solution, in our example, the values ​​of the parameters and are found. Thus, the found regression equation will have the following form: