2.6. Covariance estimation (2023)

Many statistical problems require the estimation of the population covariance matrix, which can be thought of as an estimate of the shape of the distribution of a data set. In most cases, such an assessment must be performed on a sample whose properties (size, structure, homogeneity) have a large impact on the quality of the assessment. Thesklearn.kowariancjaThe package provides tools to accurately estimate the population covariance matrix in various settings.

We assume that the observations are independent and identically distributed (i.i.d.).

2.6.1.Empirical covariance

It is known that the covariance matrix of the data set is well approximated by the classical methodmaximum likelihood estimator(or "empirical covariance"), provided that the number of observations is large enough compared to the number of features (variables that describe the observations). More specifically, the maximum likelihood estimator of the sample is the asymptotically unbiased estimator of the covariance matrix of the relevant population.

The empirical sample covariance matrix can be calculated usingempirical_covariancepackaging or placement functionsEmpirical covarianceobject in the sample data usingEmpirycznaKowariancja.fitmethod. Note that the results depend on whether the data is centered, so it makes sense to use the methodassume_centredparameter exactly. More specifically, ifcentered hypothesis=False, then the test set should have the same mean vector as the training set. If not, they should both be user-centered andassume_centered=Truemust be used.

2.6.2.Contracted covariance

2.6.2.1.Basic contraction

Despite being an asymptotically unbiased estimator of the covariance matrix, the maximum likelihood estimator is not a good estimator of the eigenvalues ​​of the covariance matrix, so the precision matrix obtained by its inversion is not accurate. Sometimes it even happens that the empirical covariance matrix cannot be inverted for numerical reasons. To avoid such an inversion problem, a transformation of the empirical covariance matrix was introduced:shrinkage.

In scikit-learn, this transformation (with a user-defined shrinkage factor) can be applied directly to the previously computed covariance fromcontraction_covariancemethod. In addition, the collapsed covariance estimator can be fitted to the data with aContracted Covarianceobject and ofShrunkCovariance.fitmethod. Again, the results depend on whether the data is centered, so you can use the optionsassume_centredparameter exactly.

Mathematically, this contraction consists of reducing the ratio between the smallest and largest eigenvalues ​​of the empirical covariance matrix. This can be done by simply shifting each eigenvalue according to the given shift, which is equivalent to finding the l2 biased maximum likelihood estimator of the covariance matrix. In practice, the contraction results in a simple convex transformation:\(\Sigma_{\rmshrunk} = (1-\alpha)\hat{\Sigma} + \alpha\frac{{\rmTr}\hat{\Sigma}}{p}\rm Id\).

Shrink size selection,\(\alfa\)results in finding a variance/variance trade-off and is discussed below.

2.6.2.2.Ledoit-Wolf contraction

In his 2004 article,[1], O. Ledoit and M. Wolf propose a formula to calculate the optimal shrinkage factor\(\alfa\)which minimizes the mean squared error between the estimated and the true covariance matrix.

The Ledoit-Wolf estimator of the covariance matrix can be calculated on a sample using:ledoit_wolfmodesklearn.kowariancjapackaging, or otherwise obtainable by placementLedoitWolfobjection for the same sample.

Note

In the case that the population covariance matrix is ​​isotropic

Note that when the number of samples is much larger than the number of features, it can be expected that shrinkage will not be necessary. The intuition is that if the population covariance has full degree as the sample size increases, the sample covariance will also become positive definite. As a result, shrinking will not be necessary and the process should be automatic.

However, this is not the case with the Ledoit-Wolf process when the population covariance is a multiple of the identity matrix. In this case, the Ledoit-Wolf contraction estimate approaches 1 as the number of samples increases. This shows that the best estimate of the Ledoit-Wolf covariance matrix is ​​a multiple of the identity. Since the population covariance is already a multiple of the identity matrix, the Ledoit-Wolf solution is indeed a reasonable approximation.

2.6.2.3.Oracle approximate contraction

Assuming that the data are Gaussian, Chen et al.[2]derived a formula for choosing a shrinkage factor that gives a smaller root mean square error than that derived from the Ledoit and Wolf formula. The resulting estimator is known as the OracleShrinkage approximate covariance estimator.

The OAS estimator of the covariance matrix can be calculated on a sample from:osmodesklearn.kowariancjapackaging or otherwise obtained by placementOASobjection for the same sample.

2.6.3.Rare inverse covariance

The inverse matrix of the covariance matrix, often called the precision matrix, is analogous to the partial correlation matrix. This gives a relationship of partial independence. In other words, if two features are conditional independently of each other, the corresponding coefficient in the precision matrix will be zero. This is why it makes sense to estimate a sparse precision matrix: estimating the covariance matrix is ​​best dependent on learning independence relations from the data. This is what is calledcovariance option.

In the case of small samples wheren_samplesmade to ordern_functionsor smaller, sparse inverse covariance estimators tend to perform better than covariance shrinkage estimators. However, in the opposite case or in the case of highly correlated data, it can be numerically unstable. Furthermore, unlike shrinkage estimators, sparse estimators are able to recover the non-diagonal structure.

TheGraphicLassothe estimator applies a l1 penalty to force the precision matrix to be sparse: the higheralphaparameter, the sparser the precision matrix. SuitableGraphicLassoCVthe object uses cross-validation for automatic tuningalphaparameter.

Note

Recovery of structures

Retrieving a graphical structure from associations in data is a difficult task. If you are interested in such a recovery, remember that:

  • Data retrieval is easier with a correlation matrix than a covariance matrix: aggregate your observations before runningGraphicLasso

  • If the underlying graph contains nodes with significantly more connections than the average node, the algorithm will skip some of those connections.

  • If the number of observations is not large compared to the number of edges in the underlying graph, you will not retrieve them.

  • Even if recovery conditions are favorable, the alpha parameter is chosen by cross-validation (e.g.GraphicLassoCVobject) will select too many edges. However, corresponding edges will have higher weights than their counterparts.

The mathematical formula is as follows:

\[\hat{K} = \mathrm{argmin}_K \big( \mathrm{tr} S K - \mathrm{log} \mathrm{det} K + \alpha \|K\|_1 \big)\]

Where\(K\)is the precision matrix to be estimated, and\(SMALL\)is an example of a covariance matrix.\(\|K\|_1\)is the sum of the absolute values ​​of the off-diagonal coefficients\(K\). The algorithm used to solve this problem is the GLasso algorithm from a 2008 biostatistics paper by Friedman. This is the same algorithm as in Rglasspacket.

2.6.4.Robust covariance estimation

Real data sets are often subject to measurement or recording errors. Regular but rare occurrences can also occur for various reasons. Very rare observations are called outliers. The empirical covariance estimator and shrinkage covariance estimators presented above are very sensitive to the presence of outliers in the data. Therefore, robust covariance estimators must be used to estimate the covariance of real datasets. Alternatively, robust covariance estimators can be used to detect outliers and discard/reduce some observations according to further data processing.

Thesklearn.kowariancjathe package implements a robust covariance estimator, a minimum covariance determinant[3].

2.6.4.1.Determinant of minimum covariance

The deterministic minimum covariance estimator is a robust data covariance estimator introduced by P.J. Rousseau in:[3]. The idea is to find a given proportion (h) of "good" observations that are not outliers and calculate their empirical covariance matrix. This empirical covariance matrix is ​​then scaled to compensate for the selection of observations made (the "coherence step"). Once the minimum determinant of the covariance estimator is calculated, the observations can be weighted according to their Mahalanobis distance, leading to a new weighting of the covariance matrix of the data set (the "reweighting step").

Rousseeuwa and Van Driessena[4]developed the FastMCD algorithm to calculate the least significant covariance. This algorithm is used in scikit learning when mapping an MCD object to data. At the same time, the FastMCD algorithm also calculates an accurate estimate of the location of the data set.

Raw estimates can be accessed asraw_location_Iraw_covariance_properties AMinCovDetrobust object covariance estimator.

Effect of outliers on location and covariance estimates

Separation of input values ​​from outliers using Mahalanobis distance

References

Top Articles
Latest Posts
Article information

Author: Kelle Weber

Last Updated: 05/09/2023

Views: 5975

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.