The AstroStat Slog » PCA

Scatter plots and ANCOVA

hlee — Thu, 15 Oct 2009 23:46:14 +0000

Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists^[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page^[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS].

This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

]]>

[ArXiv] component separation methods

hlee — Tue, 08 Sep 2009 15:17:34 +0000

I happened to observe a surge of principle component analysis (PCA) and independent component analysis (ICA) applications in astronomy. The PCA and ICA is used for separating mixed components with some assumptions. For the PCA, the decomposition happens by the assumption that original sources are orthogonal (uncorrelated) and mixed observations are approximated by multivariate normal distribution. For ICA, the assumptions is sources are independent and not gaussian (it grants one source component to be gaussian, though). Such assumptions allow to set dissimilarity measures and algorithms work toward maximize them.

The need of source separation methods in astronomy has led various adaptations of decomposition methods available. It is not difficult to locate those applications from journals of various fields including astronomical journals. However, they are most likely soliciting one dimension reduction method of their choice over others to emphasize that their strategy works better. I rarely come up with a paper which gathered and summarized component separation methods applicable to astronomical data. In that regards, the following paper seems useful to overview methods of reducing dimensionality for astronomers.

[arxiv:0805.0269]
Component separation methods for the Planck mission
S.M.Leach et al.
Check its appendix for method description.

Various library/modules are available through software/data analysis system so that one can try various dimension reduction methods conveniently. The only concern I have is the challenge of interpretation after these computational/mathematical/statistical analysis, how to assign physics interpretation to images/spectra produced by decomposition. I think this is a big open question.

]]>

[MADS] Mahalanobis distance

hlee — Mon, 09 Mar 2009 21:18:11 +0000

It bears the name of its inventor, Prasanta Chandra Mahalanobis. As opposed to the Euclidean distance, a household name, the name of this distance is rarely used but many pseudonyms exist with variations adapted into broad scientific disciplines and applications. Therefore, under different names, I believe that the Mahalanobis distance is frequently applied in exploring and analyzing astronomical data.

First, a simple definition of the Mahalanobis distance:
$$D^2(X_i)=(X_i-\bar{X})^T\hat{\Sigma}^{-1}(X_i-\bar{X})$$
It can be seen as an Euclidean distance after strandardizing multivariate data. In one way or the other, scatter plots and regression analyses (includes all sorts of fancy correlation studies) reflect the notion of this distance.

To my knowledge, the Mahalanobis distance is employed for exploring multivariate data when one is up to finding, diagnosing, or justifying removal of outliers, identical to astronomers’ 3 or 5 σ in univariate cases. Classical text books on multivariate data analysis or classification/clustering have detail information. One would like to check the wiki site here.

Surprisingly, despite its popularity, the terminology itself is under-represented in ADS. A single paper in ApJS, none in other major astronomical journals, was found with the name Mahalanobis in their abstracts.

ApJS, v.176, pp.276-292 (2008): The Palomar Testbed Interferometer Calibrator Catalog (van Belle et al)

Unfortunately their description and the usage of the Mahalanobis distance is quite ambiguous. See the quote:

The Mahalanobis distance (MD) is a multivariate generalization of one-dimensional Euclidean distance

They only showed MD for measuring distance among uncorrelated variables/metrics and didn’t introduce that the generalization is obtained from the covariance matrix.

Since standardization (generalization or normalization) is a pretty common practice, the lack of appearance in abstracts does not mean it’s not used in astronomy. So I did the full text search among A&A, ApJ, AJ, and MNRAS, which lead me four more publications containing the Mahalanobis distance. Quite less than I expected.

A&A, 330, 215 (1998): Study of an unbiased sample of B stars observed with Hipparcos: the discovery of a large amount of new slowly pulsating B stars (Waelkens et al.)
MNRAS, 334, 20 (2002): UBV(RI)C photometry of Hipparcos red stars (Koen et al.)
AJ, 99, 1108 (1990): Kinematics and composition of H II regions in spiral galaxies. II – M51, M101 and NGC 2403 (Zaritsky, Elston, and Hill)
A&A, 343, 496 (1999): An analysis of the incidence of the VEGA phenomenon among main-sequence and POST main-sequence stars (Plets and Vynckier )

The last two papers have the definition in the way how I know (p.1114 from Zaritsky, Elston, and Hill’s and p. 504 from Plets and Vynckier’s).

Including the usage given in these papers, the Mahalanobis distance is popularly used in exploratory data analysis (EDA): 1. measuring distance, 2. outlier removal, 3. checking normality of multivariate data. Due to its requirement of estimating the (inverse) covariance matrix, it shares tools with principal component analysis (PCA), linear discriminant analysis (LDA), and other methods requiring Wishart distributions.

By the way, Wishart distribution is also underrepresented in ADS. Only one paper appeared via abstract text search.
[2006MNRAS.372.1104P] Likelihood techniques for the combined analysis of CMB temperature and polarization power spectra (Percival, W. J.; Brown, M. L)

Lastly, I want to point out that estimating the covariance matrix and its inversion can be very challenging in various problems, which has lead people to develop numerous algorithms, strategies, and applications. These mathematical or computer scientific challenges and prescriptions are not presented in this post. Please, be aware that estimating the (inverse) covariance matrix is not simple as they are presented with real data.

]]>

[ArXiv] 2nd week, June 2008

hlee — Mon, 16 Jun 2008 14:47:42 +0000

As Prof. Speed said, PCA is prevalent in astronomy, particularly this week. Furthermore, a paper explicitly discusses R, a popular statistics package.

[astro-ph:0806.1140] N.Bonhomme, H.M.Courtois, R.B.Tully
Derivation of Distances with the Tully-Fisher Relation: The Antlia Cluster
(Tully Fisher relation is well known and one of many occasions statistics could help. On the contrary, astronomical biases as well as measurement errors hinder from the collaboration).
[astro-ph:0806.1222] S. Dye
Star formation histories from multi-band photometry: A new approach (Bayesian evidence)
[astro-ph:0806.1232] M. Cara and M. Lister
Avoiding spurious breaks in binned luminosity functions
(I think that binning is not always necessary and overdosed, while there are alternatives.)
[astro-ph:0806.1326] J.C. Ramirez Velez, A. Lopez Ariste and M. Semel
Strength distribution of solar magnetic fields in photospheric quiet Sun regions (PCA was utilized)
[astro-ph:0806.1487] M.D.Schneider et al.
Simulations and cosmological inference: A statistical model for power spectra means and covariances
(They used R and its package Latin hypercube samples, lhs.)
[astro-ph:0806.1558] Ivan L. Andronov et al.
Idling Magnetic White Dwarf in the Synchronizing Polar BY Cam. The Noah-2 Project (PCA is applied)
[astro-ph:0806.1880] R. G. Arendt et al.
Comparison of 3.6 – 8.0 Micron Spitzer/IRAC Galactic Center Survey Point Sources with Chandra X-Ray Point Sources in the Central 40×40 Parsecs (K-S test)

]]>

[ArXiv] 5th week, Apr. 2008

hlee — Mon, 05 May 2008 07:08:42 +0000

Since I learned Hubble’s tuning fork^[1] for the first time, I wanted to do classification (semi-supervised learning seems more suitable) galaxies based on their features (colors and spectra), instead of labor intensive human eye classification. Ironically, at that time I didn’t know there is a field of computer science called machine learning nor statistics which do such studies. Upon switching to statistics with a hope of understanding statistical packages implemented in IRAF and IDL, and learning better the contents of Numerical Recipes and Bevington’s book, the ignorance was not the enemy, but the accessibility of data was.

I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:

[astro-ph:0804.4068] S. Pires et al.
FASTLens (FAst STatistics for weak Lensing) : Fast method for Weak Lensing Statistics and map making
[astro-ph:0804.4142] M.Kowalski et al.
Improved Cosmological Constraints from New, Old and Combined Supernova Datasets
[astro-ph:0804.4219] M. Bazarghan and R. Gupta
Automated Classification of Sloan Digital Sky Survey (SDSS) Stellar Spectra using Artificial Neural Networks
[gr-qc:0804.4144]E. L. Robinson, J. D. Romano, A. Vecchio
Search for a stochastic gravitational-wave signal in the second round of the Mock LISA Data challenges
[astro-ph:0804.4483]C. Lintott et al.
Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey
[astro-ph:0804.4692] M. J. Martinez Gonzalez et al.
PCA detection and denoising of Zeeman signatures in stellar polarised spectra
[astro-ph:0805.0101] J. Ireland et al.
Multiresolution analysis of active region magnetic structure and its correlation with the Mt. Wilson classification and flaring activity

A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification

< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour

Wikipedia link: Hubble sequence
]]>

PCA

hlee — Fri, 18 Apr 2008 17:38:28 +0000

Prof. Speed writes columns for IMS Bulletin and the April 2008 issue has Terence’s Stuff: PCA (p.9). Here are quotes with minor paraphrasing:

Although a quintessentially statistical notion, my impression is that PCA has always been more popular with non-statisticians. Of course we love to prove its optimality properties in our courses, and at one time the distribution theory of sample covariance matrices was heavily studied.

…but who could not feel suspicious when observing the explosive growth in the use of PCA in the biological and physical sciences and engineering, not to mention economics?…it became the analysis tool of choice of the hordes of former physicists, chemists and mathematicians who unwittingly found themselves having to be statisticians in the computer age.

My initial theory for its popularity was simply that they were in love with the prefix eigen-, and felt that anything involving it acquired the cachet of quantum mechanics, where, you will recall, everything important has that prefix.

He gave the following eigen-’s: eigengenes, eigenarrays, eigenexpression, eigenproteins, eigenprofiles, eigenpathways, eigenSNPs, eigenimages, eigenfaces, eigenpatterns, eigenresult, and even eigenGoogle.

How many miracles must one witness before becoming a convert?…Well, I’ve seen my three miracles of exploratory data analysis, examples where I found I had a problem, and could do something about it using PCA, so now I’m a believer.

No need to mention that astronomers explore data with PCA and utilize eigen- values and vectors to transform raw data into more interpretable ones.

]]>

[ArXiv]4th week, Mar. 2008

hlee — Sun, 30 Mar 2008 23:51:42 +0000

The numbers of astro-ph preprints on average have been decreased so as my hours of reading abstracts…. cool!!! By the way, there is a paper about solar cycle, PCA, ICA, and Lomb-Scargle periodogram.

[astro-ph:0803.3154]B. G. Elmegreen
The Stellar Initial Mass Function in 2007: A Year for Discovering Variations
[astro-ph:0803.3260]J.K. Lawrence, A.C. Cadavid & A. Ruzmaikin
Rotational quasi periodicities and the Sun – heliosphere connection (I wish arxiv provides keywords. My keywords to this preprint are solar cycle, Lomb-Scargle periodogram, PCA, ICA, all interesting to CHASC folks. Particularly, I felt some similarity to one of stat310 talks about Gravity Probe B)
[astro-ph:0803.3775] L. Samushia, & B. Ratra
Constraints on Dark Energy from Galaxy Cluster Gas Mass Fraction versus Redshift data (another example of Monte Carlo Markov Chain, not Markov chain Monte Carlo in the abstract but MCMC is not their research focus)

]]>

[ArXiv] 1st week, Nov. 2007

hlee — Fri, 02 Nov 2007 21:59:08 +0000

To be exact, the title of this posting should contain 5th week, Oct, which seems to be the week of EGRET. In addition to astro-ph papers, although they are not directly related to astrostatistics, I include a few statistics papers which may be profitable for astronomical data analysis.

[astro-ph:0710.4966]
Uncertainties of the antiproton flux from Dark Matter annihilation in comparison to the EGRET excess of diffuse gamma rays by Iris Gebauer
[astro-ph:0710.5106]
The dark connection between the Canis Major dwarf, the Monoceros ring, the gas flaring, the rotation curve and the EGRET excess of diffuse Galactic Gamma Rays by W. de Boer et.al.
[astro-ph:0710.5119]
Determination of the Dark Matter profile from the EGRET excess of diffuse Galactic gamma radiation by Markus Weber
[astro-ph:0710.5171]
Systematic Bias in Cosmic Shear: Beyond the Fisher Matrix by A.Amara and A. Refregier
[astro-ph:0710.5560]
Principal Component Analysis of the Time- and Position-Dependent Point Spread Function of the Advanced Camera for Surveys by M.J. Jee et.al.
[astro-ph:0710.5637]
A method of open cluster membership determination by G. Javakhishvili et.al.
[stat.CO:0710.5670]
An Elegant Method for Generating Multivariate Poisson Data by I. Yahav and G.Shmueli
[astro-ph:0710.5788]
Variations in Stellar Clustering with Environment: Dispersed Star Formation and the Origin of Faint Fuzzies by B. G. Elmegreen
[math.ST:0710.5749]
On the Laplace transform of some quadratic forms and the exact distribution of the sample variance from a gamma or uniform parent distribution by T.Royen
[math.ST:0710.5797]
The Distribution of Maxima of Approximately Gaussian Random Fields by Y. Nardi, D.Siegmund and B.Yakir
[astro-ph:0711.0177]
Maximum Likelihood Method for Cross Correlations with Astrophysical Sources by R.Jansson and G. R. Farrar
[stat.ME:0711.0198]
A Geometric Approach to Confidence Sets for Ratios: Fieller’s Theorem, Generalizations, and Bootstrap by U. von Luxburg and V. H. Franz

]]>

Summarizing Coronal Spectra

vlk — Wed, 11 Jul 2007 15:50:50 +0000

Hyunsook and I have preliminary findings (work done with the help of the X-Atlas group) on the efficacy of using spectral proxies to classify low-mass coronal sources, put up as a poster at the XGratings workshop. The workshop has a “poster haiku” session, where one may summarize a poster in a single transparency and speak on it for a couple of minutes. I cannot count syllables, so I wrote a limerick instead:

For simple models, hardness ratios make for a useful grid;
But to describe hi-res coronal spectra they’re quite horrid.: So we went to find, with line ratios as witness,; Patterns and trends in a high-dimensional mess;
And extract stellar subclasses from the morass where it is hid.

Update: The poster is at CHASC.

]]>

[ArXiv] Spectroscopic Survey, June 29, 2007

hlee — Mon, 02 Jul 2007 22:07:39 +0000

From arXiv/astro-ph:0706.4484

Spectroscopic Surveys: Present by Yip. C. overviews recent spectroscopic sky surveys and spectral analysis techniques toward Virtual Observatories (VO). In addition that spectroscopic redshift measures increase like Moore’s law, the surveys tend to go deeper and aim completeness. Mainly elliptical galaxy formation has been studied due to more abundance compared to spirals and the galactic bimodality in color-color or color-magnitude diagrams is the result of the gas-rich mergers by blue mergers forming the red sequence. Principal component analysis has incorporated ratios of emission line-strengths for classifying Type-II AGN and star forming galaxies. Lyα identifies high z quasars and other spectral patterns over z reveal the history of the early universe and the characteristics of quasars. Also, the recent discovery of 10 satellites to the Milky Way is mentioned.

Spectral analyses take two approaches: one is the model based approach taking theoretical templates, known for its flaws but straightforward extractions of physical parameters, and the other is the empirical approach, useful for making discoveries but difficult in the analysis interpretation. Neither of them has substantial advantage to the other. When it comes to fitting, Chi-square minimization has been dominant but new methodologies are under developing. For spectral classification problems, principal component analysis (Karlhunen-Loeve transformation), artificial neural network, and other machine learning techniques have been applied.

In the end, the author reports statistical and astrophysical challenges in massive spectroscopic data of present days: 1. modeling galaxies, 2. parameterizing star formation history, 3. modeling quasars, 4. multi-catalog based calibration (separating systematic and statistics errors), 5. estimating parameters, which would be beneficial to VO, of which objective is the unification of data access.

]]>