When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.
I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.
As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.
In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.
From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.
Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.
Lastly, ANCOVA is a complete [MADS].
The need of source separation methods in astronomy has led various adaptations of decomposition methods available. It is not difficult to locate those applications from journals of various fields including astronomical journals. However, they are most likely soliciting one dimension reduction method of their choice over others to emphasize that their strategy works better. I rarely come up with a paper which gathered and summarized component separation methods applicable to astronomical data. In that regards, the following paper seems useful to overview methods of reducing dimensionality for astronomers.
[arxiv:0805.0269]
Component separation methods for the Planck mission
S.M.Leach et al.
Check its appendix for method description.
Various library/modules are available through software/data analysis system so that one can try various dimension reduction methods conveniently. The only concern I have is the challenge of interpretation after these computational/mathematical/statistical analysis, how to assign physics interpretation to images/spectra produced by decomposition. I think this is a big open question.
]]>First, a simple definition of the Mahalanobis distance:
$$D^2(X_i)=(X_i-\bar{X})^T\hat{\Sigma}^{-1}(X_i-\bar{X})$$
It can be seen as an Euclidean distance after strandardizing multivariate data. In one way or the other, scatter plots and regression analyses (includes all sorts of fancy correlation studies) reflect the notion of this distance.
To my knowledge, the Mahalanobis distance is employed for exploring multivariate data when one is up to finding, diagnosing, or justifying removal of outliers, identical to astronomers’ 3 or 5 σ in univariate cases. Classical text books on multivariate data analysis or classification/clustering have detail information. One would like to check the wiki site here.
Surprisingly, despite its popularity, the terminology itself is under-represented in ADS. A single paper in ApJS, none in other major astronomical journals, was found with the name Mahalanobis in their abstracts.
ApJS, v.176, pp.276-292 (2008): The Palomar Testbed Interferometer Calibrator Catalog (van Belle et al)
Unfortunately their description and the usage of the Mahalanobis distance is quite ambiguous. See the quote:
The Mahalanobis distance (MD) is a multivariate generalization of one-dimensional Euclidean distance
They only showed MD for measuring distance among uncorrelated variables/metrics and didn’t introduce that the generalization is obtained from the covariance matrix.
Since standardization (generalization or normalization) is a pretty common practice, the lack of appearance in abstracts does not mean it’s not used in astronomy. So I did the full text search among A&A, ApJ, AJ, and MNRAS, which lead me four more publications containing the Mahalanobis distance. Quite less than I expected.
The last two papers have the definition in the way how I know (p.1114 from Zaritsky, Elston, and Hill’s and p. 504 from Plets and Vynckier’s).
Including the usage given in these papers, the Mahalanobis distance is popularly used in exploratory data analysis (EDA): 1. measuring distance, 2. outlier removal, 3. checking normality of multivariate data. Due to its requirement of estimating the (inverse) covariance matrix, it shares tools with principal component analysis (PCA), linear discriminant analysis (LDA), and other methods requiring Wishart distributions.
By the way, Wishart distribution is also underrepresented in ADS. Only one paper appeared via abstract text search.
[2006MNRAS.372.1104P] Likelihood techniques for the combined analysis of CMB temperature and polarization power spectra (Percival, W. J.; Brown, M. L)
Lastly, I want to point out that estimating the covariance matrix and its inversion can be very challenging in various problems, which has lead people to develop numerous algorithms, strategies, and applications. These mathematical or computer scientific challenges and prescriptions are not presented in this post. Please, be aware that estimating the (inverse) covariance matrix is not simple as they are presented with real data.
]]>I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:
A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification
< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour
Although a quintessentially statistical notion, my impression is that PCA has always been more popular with non-statisticians. Of course we love to prove its optimality properties in our courses, and at one time the distribution theory of sample covariance matrices was heavily studied.
…but who could not feel suspicious when observing the explosive growth in the use of PCA in the biological and physical sciences and engineering, not to mention economics?…it became the analysis tool of choice of the hordes of former physicists, chemists and mathematicians who unwittingly found themselves having to be statisticians in the computer age.
My initial theory for its popularity was simply that they were in love with the prefix eigen-, and felt that anything involving it acquired the cachet of quantum mechanics, where, you will recall, everything important has that prefix.
He gave the following eigen-’s: eigengenes, eigenarrays, eigenexpression, eigenproteins, eigenprofiles, eigenpathways, eigenSNPs, eigenimages, eigenfaces, eigenpatterns, eigenresult, and even eigenGoogle.
How many miracles must one witness before becoming a convert?…Well, I’ve seen my three miracles of exploratory data analysis, examples where I found I had a problem, and could do something about it using PCA, so now I’m a believer.
No need to mention that astronomers explore data with PCA and utilize eigen- values and vectors to transform raw data into more interpretable ones.
]]>Update: The poster is at CHASC.
]]> Spectroscopic Surveys: Present by Yip. C. overviews recent spectroscopic sky surveys and spectral analysis techniques toward Virtual Observatories (VO). In addition that spectroscopic redshift measures increase like Moore’s law, the surveys tend to go deeper and aim completeness. Mainly elliptical galaxy formation has been studied due to more abundance compared to spirals and the galactic bimodality in color-color or color-magnitude diagrams is the result of the gas-rich mergers by blue mergers forming the red sequence. Principal component analysis has incorporated ratios of emission line-strengths for classifying Type-II AGN and star forming galaxies. Lyα identifies high z quasars and other spectral patterns over z reveal the history of the early universe and the characteristics of quasars. Also, the recent discovery of 10 satellites to the Milky Way is mentioned.
Spectral analyses take two approaches: one is the model based approach taking theoretical templates, known for its flaws but straightforward extractions of physical parameters, and the other is the empirical approach, useful for making discoveries but difficult in the analysis interpretation. Neither of them has substantial advantage to the other. When it comes to fitting, Chi-square minimization has been dominant but new methodologies are under developing. For spectral classification problems, principal component analysis (Karlhunen-Loeve transformation), artificial neural network, and other machine learning techniques have been applied.
In the end, the author reports statistical and astrophysical challenges in massive spectroscopic data of present days: 1. modeling galaxies, 2. parameterizing star formation history, 3. modeling quasars, 4. multi-catalog based calibration (separating systematic and statistics errors), 5. estimating parameters, which would be beneficial to VO, of which objective is the unification of data access.
]]>