A bidimensional example
Let's consider a small dataset built by adding some uniform noise to the points belonging to a segment bounded between -6 and 6. The original equation is y = x + 2 + η, where η is a noise term.
In the following graph, there's a plot with a candidate regression function:
The dataset is defined as follows:
import numpy as np
nb_samples = 200
X = np.arange(-5, 5, 0.05)
Y = X + 2
Y += np.random.normal(0.0, 0.5, size=nb_samples)
As we're working on a plane, the regressor we're looking for is a function of only two parameters (the intercept and the only multiplicative coefficient) with an additive random normal noise term that is associated with every data point xi (formally, all ηi are independent and identically distributed (i.i.d) variables):
To fit our model, we must find the best parameters and we start with an Ordinary Least Squares (OLS) approach based on the known data points (xi, yi). The cost function to minimize is as follows:
With an analytic approach, to find the global minimum we must impose both derivatives to be equal to 0 (in the general case, the same must be done using the gradient ?L = 0):
So, we can define the function in Python, using an input vector containing both α and β:
import numpy as np
def loss(v):
e = 0.0
for i in range(nb_samples):
e += np.square(v[0] + v[1]*X[i] - Y[i])
return 0.5 * e
And the gradient can be defined as follows:
import numpy as np
def gradient(v):
g = np.zeros(shape=2)
for i in range(nb_samples):
g[0] += (v[0] + v[1]*X[i] - Y[i])
g[1] += ((v[0] + v[1]*X[i] - Y[i]) * X[i])
return g
The optimization problem can now be solved using SciPy:
from scipy.optimize import minimize
minimize(fun=loss, x0=[0.0, 0.0], jac=gradient, method='L-BFGS-B')
fun: 25.224432728145842
hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
jac: array([ -8.03369622e-07, 3.60194360e-06])
message: b'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 8
nit: 7
status: 0
success: True
x: array([ 1.96534464, 0.98451589])
As expected, the regression denoised our dataset, rebuilding the original equation: y = x + 2 (with a negligible approximation error). This procedure is absolutely acceptable; however, it's not difficult to understand that the problem can be solved in closed form in a single step. The first thing to do is get rid of the intercept by adding an extra feature equal to 1:
At this point, the generic problem can be expressed in vectorial notation using a coefficient vector θ:
In the bidimensional case θ = (β, α) because the intercept always corresponds to the last value. If we assume that the noise is homoscedastic with a variance equal to σ2 (that is, η ~ N(0, σ2I)), the cost function can be rewritten as follows:
If the matrix XTX has full rank (that is, det(XTX) ≠ 0), it's easy to find the solution ?L = 0 using the Moore-Penrose pseudo-inverse (which is an extension of matrix inversion when the shape is not square):
The previous example becomes this:
import numpy as np
Xs = np.expand_dims(X, axis=1)
Ys = np.expand_dims(Y, axis=1)
Xs = np.concatenate((Xs, np.ones_like(Xs)), axis=1)
result = np.linalg.inv(np.dot(Xs.T, Xs)).dot(Xs.T).dot(Y)
print('y = %.2fx + %2.f' % (result[0], result[1]))
y = 0.98x + 2
Clearly, this approach is much more efficient and exploits the vectorization features provided by NumPy (the computation on large datasets is extremely fast, and there's no more need for multiple gradient evaluations). According to the Gauss-Markov theorem, this is the Best Linear Unbiased Estimator (BLUE), meaning that there are no other solutions with a smaller coefficient variance. We omit the proofs (which are easy to obtain by applying the standard formulas), but we have E[θopt] = 0 and Var[θopt] = σ2(XTX)-1 that depends on σ2 and on the inverse of XTX. For our example, the coefficient covariance matrix is as follows:
import numpy as np
covariance = (0.5 ** 2) * np.linalg.inv(np.dot(Xs.T, Xs))
print(covariance)
[[ 1.50003750e-04 3.75009375e-06]
[ 3.75009375e-06 1.25009375e-03]]
The covariance matrix is decorrelated (all non-diagonal terms are close to 0) and the variances for the coefficient and the intercept are extremely small (thanks also to the simplicity of the problem). It's interesting to notice that the variance of the intercept is ten times larger than the variance of the coefficient. This is due to the fact that we expect half of the points to be above the regression line and the other half below it. Therefore, small vertical shifts reduce the error for 50% of points and increase it for the remaining part. On the other side, a small change in the slope has an impact on all points, increasing always the mean squared error.
Considering m samples and m i.i.d normal noise terms, the probability of the output Y given the sample set X, the parameter vector θopt, and the noise variance σ2 is Gaussian, with a density function equal to the following:
This means that, once the model has been trained and the θopt has been found, we expect all samples to have a Gaussian distribution centered on the regression hyperplane. In the following diagram, there's a bidimensional example of this concept:
In the general case of heteroscedastic noise, we can express the noise covariance matrix as Σ = σ2C, where C is a generic square and invertible matrix whose values are normally bounded between 0 and 1. This is not a limitation, because, if we select σ2 as the maximum variance, a generic element Cij becomes a weight for the covariance between the parameter features θi and θj (obviously, on the diagonal, there are the variances for all the features).
The cost function is slightly different because we must take into account the different impact of the noise on the single features:
It's easy to compute the gradient and derive an expression for the parameters, similar to what we have obtained in the previous case:
For a general linear regression problem based on the minimization of the squared error, it's possible to prove that the optimal prediction (that is, with the minimum variance) of a new sample's xj corresponds to this:
This result confirms our expectations: the optimal regressor will always predict the expected value of the dependent variable y conditioned to the input xj. Hence, considering the previous diagram, the Gaussians are optimized to have their means as close as possible to each training sample (of course, with the constraint of a global linearity).