[MADS] Mahalanobis distance

It bears the name of its inventor, Prasanta Chandra Mahalanobis. As opposed to the Euclidean distance, a household name, the name of this distance is rarely used but many pseudonyms exist with variations adapted into broad scientific disciplines and applications. Therefore, under different names, I believe that the Mahalanobis distance is frequently applied in exploring and analyzing astronomical data.

First, a simple definition of the Mahalanobis distance:
$$D^2(X_i)=(X_i-\bar{X})^T\hat{\Sigma}^{-1}(X_i-\bar{X})$$
It can be seen as an Euclidean distance after strandardizing multivariate data. In one way or the other, scatter plots and regression analyses (includes all sorts of fancy correlation studies) reflect the notion of this distance.

To my knowledge, the Mahalanobis distance is employed for exploring multivariate data when one is up to finding, diagnosing, or justifying removal of outliers, identical to astronomers’ 3 or 5 σ in univariate cases. Classical text books on multivariate data analysis or classification/clustering have detail information. One would like to check the wiki site here.

Surprisingly, despite its popularity, the terminology itself is under-represented in ADS. A single paper in ApJS, none in other major astronomical journals, was found with the name Mahalanobis in their abstracts.

ApJS, v.176, pp.276-292 (2008): The Palomar Testbed Interferometer Calibrator Catalog (van Belle et al)

Unfortunately their description and the usage of the Mahalanobis distance is quite ambiguous. See the quote:

The Mahalanobis distance (MD) is a multivariate generalization of one-dimensional Euclidean distance

They only showed MD for measuring distance among uncorrelated variables/metrics and didn’t introduce that the generalization is obtained from the covariance matrix.

Since standardization (generalization or normalization) is a pretty common practice, the lack of appearance in abstracts does not mean it’s not used in astronomy. So I did the full text search among A&A, ApJ, AJ, and MNRAS, which lead me four more publications containing the Mahalanobis distance. Quite less than I expected.

  1. A&A, 330, 215 (1998): Study of an unbiased sample of B stars observed with Hipparcos: the discovery of a large amount of new slowly pulsating B stars (Waelkens et al.)
  2. MNRAS, 334, 20 (2002): UBV(RI)C photometry of Hipparcos red stars (Koen et al.)
  3. AJ, 99, 1108 (1990): Kinematics and composition of H II regions in spiral galaxies. II – M51, M101 and NGC 2403 (Zaritsky, Elston, and Hill)
  4. A&A, 343, 496 (1999): An analysis of the incidence of the VEGA phenomenon among main-sequence and POST main-sequence stars (Plets and Vynckier )

The last two papers have the definition in the way how I know (p.1114 from Zaritsky, Elston, and Hill’s and p. 504 from Plets and Vynckier’s).

Including the usage given in these papers, the Mahalanobis distance is popularly used in exploratory data analysis (EDA): 1. measuring distance, 2. outlier removal, 3. checking normality of multivariate data. Due to its requirement of estimating the (inverse) covariance matrix, it shares tools with principal component analysis (PCA), linear discriminant analysis (LDA), and other methods requiring Wishart distributions.

By the way, Wishart distribution is also underrepresented in ADS. Only one paper appeared via abstract text search.
[2006MNRAS.372.1104P] Likelihood techniques for the combined analysis of CMB temperature and polarization power spectra (Percival, W. J.; Brown, M. L)

Lastly, I want to point out that estimating the covariance matrix and its inversion can be very challenging in various problems, which has lead people to develop numerous algorithms, strategies, and applications. These mathematical or computer scientific challenges and prescriptions are not presented in this post. Please, be aware that estimating the (inverse) covariance matrix is not simple as they are presented with real data.

6 Comments
  1. Raffaele D'Abrusco:

    Very interesting post, as I have been using the Mahalanobis’ distance recently I was delighted to know how frequently this useful distance can be found in astronomical papers. I hope you won’t mind if I suggest to you my own publication where Mahalanobis’ distance is used: http://lanl.arxiv.org/abs/0805.0156 .

    03-12-2009, 1:39 pm
  2. hlee:

    Because of some difficulties in sieving (I don’t have an iphone to use the iFish app and my last year efforts was very human labor dependent) the flow of arxiv papers (ADS includes all arXiv disciplines in searching), I couldn’t mention papers like yours. Always, new information and indication of my imperfect colanders is welcome.

    03-13-2009, 12:20 am
  3. Jiangang:

    Well, we have been using M-distance to compare the galaxies in terms of their colors. One tricky thing about using M distance is the calculation of covariance. Since traditional estimator is sensitive to outliers, some robust procedures should be applied. Meanwhile, it will be sensible to use local covariance rather than global covariance, which needs some iterations.

    03-16-2009, 3:53 pm
  4. TomLoredo:

    Hyunsook, here’s another reference for you:

    Object detection in multi-epoch data by Jogesh Babu et al.

    This appeared in a special issue of Statistical Methodology devoted to astrostatistics, published last July.

    03-24-2009, 3:18 pm
  5. TomLoredo:

    Darn, I hate how GroundTruth doesn’t give you a preview or let you edit posts! Of course, the paper is by Babu et al..

    03-24-2009, 3:19 pm
  6. vlk:

    I corrected it, Tom :)

    We’re still feeling our way around this rather temperamental wordpress system. Comment preview, last we checked, completely messes up the spam filter (not that the filter is all that well behaved now — it has been flagging comments as spam quite randomly, apologies to all those affected!)

    03-24-2009, 3:25 pm
Leave a comment