[MADS] Mahalanobis distance

hlee — Mon, 09 Mar 2009 21:18:11 +0000

It bears the name of its inventor, Prasanta Chandra Mahalanobis. As opposed to the Euclidean distance, a household name, the name of this distance is rarely used but many pseudonyms exist with variations adapted into broad scientific disciplines and applications. Therefore, under different names, I believe that the Mahalanobis distance is frequently applied in exploring and analyzing astronomical data.

First, a simple definition of the Mahalanobis distance:
$$D^2(X_i)=(X_i-\bar{X})^T\hat{\Sigma}^{-1}(X_i-\bar{X})$$
It can be seen as an Euclidean distance after strandardizing multivariate data. In one way or the other, scatter plots and regression analyses (includes all sorts of fancy correlation studies) reflect the notion of this distance.

To my knowledge, the Mahalanobis distance is employed for exploring multivariate data when one is up to finding, diagnosing, or justifying removal of outliers, identical to astronomers’ 3 or 5 σ in univariate cases. Classical text books on multivariate data analysis or classification/clustering have detail information. One would like to check the wiki site here.

Surprisingly, despite its popularity, the terminology itself is under-represented in ADS. A single paper in ApJS, none in other major astronomical journals, was found with the name Mahalanobis in their abstracts.

ApJS, v.176, pp.276-292 (2008): The Palomar Testbed Interferometer Calibrator Catalog (van Belle et al)

Unfortunately their description and the usage of the Mahalanobis distance is quite ambiguous. See the quote:

The Mahalanobis distance (MD) is a multivariate generalization of one-dimensional Euclidean distance

They only showed MD for measuring distance among uncorrelated variables/metrics and didn’t introduce that the generalization is obtained from the covariance matrix.

Since standardization (generalization or normalization) is a pretty common practice, the lack of appearance in abstracts does not mean it’s not used in astronomy. So I did the full text search among A&A, ApJ, AJ, and MNRAS, which lead me four more publications containing the Mahalanobis distance. Quite less than I expected.

A&A, 330, 215 (1998): Study of an unbiased sample of B stars observed with Hipparcos: the discovery of a large amount of new slowly pulsating B stars (Waelkens et al.)
MNRAS, 334, 20 (2002): UBV(RI)C photometry of Hipparcos red stars (Koen et al.)
AJ, 99, 1108 (1990): Kinematics and composition of H II regions in spiral galaxies. II – M51, M101 and NGC 2403 (Zaritsky, Elston, and Hill)
A&A, 343, 496 (1999): An analysis of the incidence of the VEGA phenomenon among main-sequence and POST main-sequence stars (Plets and Vynckier )

The last two papers have the definition in the way how I know (p.1114 from Zaritsky, Elston, and Hill’s and p. 504 from Plets and Vynckier’s).

Including the usage given in these papers, the Mahalanobis distance is popularly used in exploratory data analysis (EDA): 1. measuring distance, 2. outlier removal, 3. checking normality of multivariate data. Due to its requirement of estimating the (inverse) covariance matrix, it shares tools with principal component analysis (PCA), linear discriminant analysis (LDA), and other methods requiring Wishart distributions.

By the way, Wishart distribution is also underrepresented in ADS. Only one paper appeared via abstract text search.
[2006MNRAS.372.1104P] Likelihood techniques for the combined analysis of CMB temperature and polarization power spectra (Percival, W. J.; Brown, M. L)

Lastly, I want to point out that estimating the covariance matrix and its inversion can be very challenging in various problems, which has lead people to develop numerous algorithms, strategies, and applications. These mathematical or computer scientific challenges and prescriptions are not presented in this post. Please, be aware that estimating the (inverse) covariance matrix is not simple as they are presented with real data.

The AstroStat Slog » Mahalanobis

[MADS] Mahalanobis distance