Many statistical problems require the estimation of the population covariance matrix, which can be thought of as an estimate of the shape of the distribution of a data set. In most cases, such an assessment must be performed on a sample whose properties (size, structure, homogeneity) have a large impact on the quality of the assessment. Thesklearn.kowariancjaThe package provides tools to accurately estimate the population covariance matrix in various settings.
We assume that the observations are independent and identically distributed (i.i.d.).
2.6.1.Empirical covariance¶
It is known that the covariance matrix of the data set is well approximated by the classical methodmaximum likelihood estimator(or "empirical covariance"), provided that the number of observations is large enough compared to the number of features (variables that describe the observations). More specifically, the maximum likelihood estimator of the sample is the asymptotically unbiased estimator of the covariance matrix of the relevant population.
The empirical sample covariance matrix can be calculated usingempirical_covariancepackaging or placement functionsEmpirical covarianceobject in the sample data usingEmpirycznaKowariancja.fitmethod. Note that the results depend on whether the data is centered, so it makes sense to use the methodassume_centred
parameter exactly. More specifically, ifcentered hypothesis=False
, then the test set should have the same mean vector as the training set. If not, they should both be user-centered andassume_centered=True
must be used.
2.6.2.Contracted covariance¶
2.6.2.1.Basic contraction¶
Despite being an asymptotically unbiased estimator of the covariance matrix, the maximum likelihood estimator is not a good estimator of the eigenvalues of the covariance matrix, so the precision matrix obtained by its inversion is not accurate. Sometimes it even happens that the empirical covariance matrix cannot be inverted for numerical reasons. To avoid such an inversion problem, a transformation of the empirical covariance matrix was introduced:shrinkage
.
In scikit-learn, this transformation (with a user-defined shrinkage factor) can be applied directly to the previously computed covariance fromcontraction_covariancemethod. In addition, the collapsed covariance estimator can be fitted to the data with aContracted Covarianceobject and ofShrunkCovariance.fitmethod. Again, the results depend on whether the data is centered, so you can use the optionsassume_centred
parameter exactly.
Mathematically, this contraction consists of reducing the ratio between the smallest and largest eigenvalues of the empirical covariance matrix. This can be done by simply shifting each eigenvalue according to the given shift, which is equivalent to finding the l2 biased maximum likelihood estimator of the covariance matrix. In practice, the contraction results in a simple convex transformation:\(\Sigma_{\rmshrunk} = (1-\alpha)\hat{\Sigma} + \alpha\frac{{\rmTr}\hat{\Sigma}}{p}\rm Id\).
Shrink size selection,\(\alfa\)results in finding a variance/variance trade-off and is discussed below.
2.6.2.2.Ledoit-Wolf contraction¶
In his 2004 article,[1], O. Ledoit and M. Wolf propose a formula to calculate the optimal shrinkage factor\(\alfa\)which minimizes the mean squared error between the estimated and the true covariance matrix.
The Ledoit-Wolf estimator of the covariance matrix can be calculated on a sample using:ledoit_wolfmodesklearn.kowariancjapackaging, or otherwise obtainable by placementLedoitWolfobjection for the same sample.
Note
In the case that the population covariance matrix is isotropic
Note that when the number of samples is much larger than the number of features, it can be expected that shrinkage will not be necessary. The intuition is that if the population covariance has full degree as the sample size increases, the sample covariance will also become positive definite. As a result, shrinking will not be necessary and the process should be automatic.
However, this is not the case with the Ledoit-Wolf process when the population covariance is a multiple of the identity matrix. In this case, the Ledoit-Wolf contraction estimate approaches 1 as the number of samples increases. This shows that the best estimate of the Ledoit-Wolf covariance matrix is a multiple of the identity. Since the population covariance is already a multiple of the identity matrix, the Ledoit-Wolf solution is indeed a reasonable approximation.
2.6.2.3.Oracle approximate contraction¶
Assuming that the data are Gaussian, Chen et al.[2]derived a formula for choosing a shrinkage factor that gives a smaller root mean square error than that derived from the Ledoit and Wolf formula. The resulting estimator is known as the OracleShrinkage approximate covariance estimator.
The OAS estimator of the covariance matrix can be calculated on a sample from:osmodesklearn.kowariancjapackaging or otherwise obtained by placementOASobjection for the same sample.
2.6.3.Rare inverse covariance¶
The inverse matrix of the covariance matrix, often called the precision matrix, is analogous to the partial correlation matrix. This gives a relationship of partial independence. In other words, if two features are conditional independently of each other, the corresponding coefficient in the precision matrix will be zero. This is why it makes sense to estimate a sparse precision matrix: estimating the covariance matrix is best dependent on learning independence relations from the data. This is what is calledcovariance option.
In the case of small samples wheren_samples
made to ordern_functions
or smaller, sparse inverse covariance estimators tend to perform better than covariance shrinkage estimators. However, in the opposite case or in the case of highly correlated data, it can be numerically unstable. Furthermore, unlike shrinkage estimators, sparse estimators are able to recover the non-diagonal structure.
TheGraphicLassothe estimator applies a l1 penalty to force the precision matrix to be sparse: the higheralpha
parameter, the sparser the precision matrix. SuitableGraphicLassoCVthe object uses cross-validation for automatic tuningalpha
parameter.
Note
Recovery of structures
Retrieving a graphical structure from associations in data is a difficult task. If you are interested in such a recovery, remember that:
Data retrieval is easier with a correlation matrix than a covariance matrix: aggregate your observations before runningGraphicLasso
If the underlying graph contains nodes with significantly more connections than the average node, the algorithm will skip some of those connections.
If the number of observations is not large compared to the number of edges in the underlying graph, you will not retrieve them.
Even if recovery conditions are favorable, the alpha parameter is chosen by cross-validation (e.g.GraphicLassoCVobject) will select too many edges. However, corresponding edges will have higher weights than their counterparts.
The mathematical formula is as follows:
\[\hat{K} = \mathrm{argmin}_K \big( \mathrm{tr} S K - \mathrm{log} \mathrm{det} K + \alpha \|K\|_1 \big)\]
Where\(K\)is the precision matrix to be estimated, and\(SMALL\)is an example of a covariance matrix.\(\|K\|_1\)is the sum of the absolute values of the off-diagonal coefficients\(K\). The algorithm used to solve this problem is the GLasso algorithm from a 2008 biostatistics paper by Friedman. This is the same algorithm as in Rglass
packet.
2.6.4.Robust covariance estimation¶
Real data sets are often subject to measurement or recording errors. Regular but rare occurrences can also occur for various reasons. Very rare observations are called outliers. The empirical covariance estimator and shrinkage covariance estimators presented above are very sensitive to the presence of outliers in the data. Therefore, robust covariance estimators must be used to estimate the covariance of real datasets. Alternatively, robust covariance estimators can be used to detect outliers and discard/reduce some observations according to further data processing.
Thesklearn.kowariancja
the package implements a robust covariance estimator, a minimum covariance determinant[3].
2.6.4.1.Determinant of minimum covariance¶
The deterministic minimum covariance estimator is a robust data covariance estimator introduced by P.J. Rousseau in:[3]. The idea is to find a given proportion (h) of "good" observations that are not outliers and calculate their empirical covariance matrix. This empirical covariance matrix is then scaled to compensate for the selection of observations made (the "coherence step"). Once the minimum determinant of the covariance estimator is calculated, the observations can be weighted according to their Mahalanobis distance, leading to a new weighting of the covariance matrix of the data set (the "reweighting step").
Rousseeuwa and Van Driessena[4]developed the FastMCD algorithm to calculate the least significant covariance. This algorithm is used in scikit learning when mapping an MCD object to data. At the same time, the FastMCD algorithm also calculates an accurate estimate of the location of the data set.
Raw estimates can be accessed asraw_location_
Iraw_covariance_
properties AMinCovDetrobust object covariance estimator.
Effect of outliers on location and covariance estimates | Separation of input values from outliers using Mahalanobis distance |
---|---|