[ArXiv] classifying spectra

hlee — Fri, 23 Oct 2009 00:08:07 +0000

[arXiv:stat.ME:0910.2585]
Variable Selection and Updating In Model-Based Discriminant Analysis for High Dimensional Data with Food Authenticity Applications
by Murphy, Dean, and Raftery

Classifying or clustering (or semi supervised learning) spectra is a very challenging problem from collecting statistical-analysis-ready data to reducing the dimensionality without sacrificing complex information in each spectrum. Not only how to estimate spiky (not differentiable) curves via statistically well defined procedures of estimating equations but also how to transform data that match the regularity conditions in statistics is challenging.

Another reason that astrophysics spectroscopic data classification and clustering is more difficult is that observed lines, and their intensities and FWHMs on top of continuum are related to atomic database and latent variables/hyper parameters (distance, rotation, absorption, column density, temperature, metalicity, types, system properties, etc). Frequently it becomes very challenging mixture problem to separate lines and to separate lines from continuum (boundary and identifiability issues). These complexity only appears in astronomy spectroscopic data because we only get indirect or uncontrolled data ruled by physics, as opposed to the the meat species spectra in the paper. These spectroscopic data outside astronomy are rather smooth, observed in controlled wavelength range, and no worries for correcting recession/radial velocity/red shift/extinction/lensing/etc.

Although the most relevant part to astronomers, i.e. spectroscopic data processing is not discussed in this paper, the most important part, statistical learning application to complex curves, spectral data, is well described. Some astronomers with appropriate data would like to try the variable selection strategy and to check out the classification methods in statistics. If it works out, it might save space for storing spectral data and time to collect high resolution spectra. Please, keep in mind that it is not necessary to use the same variable selection strategy. Astronomers can create better working versions for classification and clustering purpose, like Hardness Ratios, often used to reduce the dimensionality of spectral data since low total count spectra are not informative in the full energy (wavelength) range. Curse of dimensionality!.

]]>

Astroinformatics

hlee — Mon, 13 Jul 2009 00:21:53 +0000

Approximately for a decade, there have been journals dedicated to bioinformatics. On the other hand, there is none in astronomy although astronomers have a long history of comprising a huge volume of catalogs and data archives. Prof. Bickel’s comment during his plenary lecture at the IMS-APRM particularly on sparse matrix and philosophical issues on choosing principal components led me to wonder why astronomers do not discuss astroinformatics.

Nevertheless, I’ve noticed a few astronomers rigorously apply principle component analysis (PCA) in order to reduce the dimensionality of a data set. An evident example of PCA applications in astronomy is photo-z. In contrast to the wide PCA application, almost no publication about statistical adequacy studies is found by investigating the properties of covariance matrix and its estimation method particularly when it is sparse. Even worse, the notion of measurement errors are improperly implemented since statistician’s dimension reduction methodology never confronted astronomers’ measurement errors. How to choose components is seldom discussed since the significance in physics model is rarely agreeing with statistical significance. This disagreement often elongates scientific writings hard to please readers. As a compromise, statistical parts are omitted, which makes me feel the publication incomplete.

Due to its easy visualization via intuitive scales, in wavelet multiscale imaging, the coarse scale to fine scale approach and the assumption of independent noise enables to clean the noisy image and to accentuate features in it. Likewise, principle components and other dimension reduction methods in statistics capture certain features via transformed metrics and regularized, or penalized objective functions. These features are not necessary to match the important features in astrophysics unless the likelihood function and selected priors match physics models. To my knowledge, astronomical literature exploiting PCA for dimension reduction for prediction rarely explains why PCA is chosen for dimensionality reduction, how to compensate the sparsity in covariance matrix, and other questions, often the major topics in bioinformatics. In the literature, these questions are explored to explain the particular selection of gene attributes or bio-markers under a certain response like blood pressures and types of cancers. Instead of binning and chi-square minimization, statisticians explore strategies how to compensate sparsity in the data set to get unbiased best fits and righteous error bars based on data matching assumptions and theory.

Luckily, there are efforts among some renown astronomers to form a community of astroinformatics. At the dawn of bioinformatics, genetic scientists were responsible for the bio part and statisticians were responsible for the informatics until young scientists are educated enough to carry out bioinformatics by themselves. Observing this trend partially from statistics conferences created an urge in me that it is my responsibility to ponder why there has been shortage of statisticians’ involvement in astronomy regardless of plethora of catalogs and data archives with long history. A few postings will follow what I felt while working among astronomers. I hope this small bridging effort to narrow the gap between two communities. My personal wish is to see prospering astroinformatics like bioinformatics.

]]>

The AstroStat Slog » variable selection

[ArXiv] classifying spectra

Astroinformatics