The AstroStat Slog » cosmology

my first AAS. IV. clustering

hlee — Fri, 20 Jun 2008 03:42:06 +0000

I was questioned by two attendees, acquainted before the AAS, if I can suggest them clustering methods relevant to their projects. After all, we spent quite a time to clarify the term clustering.

The statistician’s and astronomer’s understanding of clustering is different:
- classification vs. clustering or supervised learning vs. unsupervised learning: the former terms from the pairs indicate the fact that the scientist already knows types of objects in his hands. A photometry data set with an additional column saying star, galaxy, quasar, and unknown is a target for classification or supervised learning. Simply put, classification is finding a rule with photometric colors that could classify these different type objects. If there’s no additional column but the scatter plots or plots after dimension reduction manifesting grouping patterns, it is clustering or unsupervised learning whose goal is finding hyperplanes to separates these clusters optimally; in other words, answering these questions, are there real clusters? If so, how many? is the objective of clustering/unsupervised learning. Overall, rudimentarily, the presence of an extra column of types differentiates between classification and clustering.
- physical clustering vs. statistical clustering:
  Cosmologists and alike are interested in clusters/clumps of matters/particles/objects. For astrophysicists, clusters are associated with spatial evolution of the universe. Inquiries related to clustering from astronomers are more likely related to finding these spatial clumps statistically, which is a subject of stochastic geometry or spatial statistics. On the other hand, statisticians and data analysts like to investigate clusters in a reparameterized multi-dimensional space. Distances computed do not follow the fundamental laws of physics (gravitation, EM, weak, and strong) but reflect relationships in the multi-dimensional space; for example, in a CM diagram, stars of a kind are grouped. The consensus between two communities about clustering is that the number of clusters is unknown, where the plethora of classification methods cannot be applied and that the study objectives are seeking methodologies for quantifying clusters .
astronomer’s clustering problems are either statistical classification (closed to semi-supervised learning) or spatial statistics.
The way of manifesting noisy clusters in the universe or quantifying the current status of matter distribution leads to the very fundamentals of the birth of the universe, where spatial statistics can be a great partner. In the era of photometric redshifts, various classification techniques enhances the accuracy of prediction.
astronomer’s testing the reality of clusters seems limited: Cosmology problems have been tackled as inverse problem. Based on theoretical cosmology models, simulations are performed and the results are transformed into some surrogate parameters. These surrogates are generally represented by some smooth curves or straight lines in a plot where observations made their debut as points with bidirectional error bars (so called measurement errors). The judgment about the cosmological model under the test happens by a simple regression (correlation) or eyes on these observed data points. If observations and a curve from a cosmological model presented in a 2D plot match well, the given cosmological model is confirmed in the conclusion section. Personally, this procedure of testing cosmological models to account for clusters of the universe can be developed in a more statistically rigorous fashion instead of matching straight lines.
Challenges to statisticians in astronomy, measurement errors: In (statistical) learning, I believe, there has been no standard procedure to account for astronomers’ measurement errors into modeling. I think measurements errors are, in general, ignored because systematics errors are not recognized in statistics. On the other hand, in astronomy, measurement errors accompanying data, are a very crucial piece of information, particularly for verifying the significance of the observations. Often this measurement errors became denominator in the χ² function which is treated as a χ² distribution to get best fits and confidence intervals.

Personal lessons from two short discussions at the AAS were more collaboration between statisticians and astronomers to include measurement errors in classification or semi-supervised learning particularly for nowadays when we are enjoying plethora of data sets and moving forward with a better aid from statisticians for testing/verifying the existence of clusters beyond fitting a straight line.

[ArXiv] 2nd week, June 2008

hlee — Mon, 16 Jun 2008 14:47:42 +0000

As Prof. Speed said, PCA is prevalent in astronomy, particularly this week. Furthermore, a paper explicitly discusses R, a popular statistics package.

[astro-ph:0806.1140] N.Bonhomme, H.M.Courtois, R.B.Tully
Derivation of Distances with the Tully-Fisher Relation: The Antlia Cluster
(Tully Fisher relation is well known and one of many occasions statistics could help. On the contrary, astronomical biases as well as measurement errors hinder from the collaboration).
[astro-ph:0806.1222] S. Dye
Star formation histories from multi-band photometry: A new approach (Bayesian evidence)
[astro-ph:0806.1232] M. Cara and M. Lister
Avoiding spurious breaks in binned luminosity functions
(I think that binning is not always necessary and overdosed, while there are alternatives.)
[astro-ph:0806.1326] J.C. Ramirez Velez, A. Lopez Ariste and M. Semel
Strength distribution of solar magnetic fields in photospheric quiet Sun regions (PCA was utilized)
[astro-ph:0806.1487] M.D.Schneider et al.
Simulations and cosmological inference: A statistical model for power spectra means and covariances
(They used R and its package Latin hypercube samples, lhs.)
[astro-ph:0806.1558] Ivan L. Andronov et al.
Idling Magnetic White Dwarf in the Synchronizing Polar BY Cam. The Noah-2 Project (PCA is applied)
[astro-ph:0806.1880] R. G. Arendt et al.
Comparison of 3.6 – 8.0 Micron Spitzer/IRAC Galactic Center Survey Point Sources with Chandra X-Ray Point Sources in the Central 40×40 Parsecs (K-S test)

[ArXiv] Ripley’s K-function

hlee — Tue, 22 Apr 2008 03:56:33 +0000

Because of the extensive works by Prof. Peebles and many (observational) cosmologists (almost always I find Prof. Peeble’s book in cosmology literature), the 2 (or 3) point correlation function is much more dominant than any other mathematical and statistical methods to understand the structure of the universe. Unusually, this week finds an astro-ph paper written by a statistics professor addressing the K-function to explore the mystery of the universe.

[astro-ph:0804.3044] J.M. Loh
Estimating Third-Order Moments for an Absorber Catalog

Instead of getting to the detailed contents, which is left to the readers, I’d rather cite a few key points without math symbols.The script K is denoted as the 3rd order K-function from which the three-point and reduced three-point correlation functions are derived. The benefits of using the script K function over these correlation functions are given regarding bin size and edge correction. Yet, the author did not encourage to use the script K function only but to use all tools. Also, the feasibility of computing third or higher order measures of clustering is mentioned due to larger datasets and advances in computing. In appendix, the unbiasedness of the estimator regarding the script K is proved.

The reason for bringing in this K-function comes from my early experience in learning statistics. My memory of learning the 2 point correlation function from an undergraduate cosmology class is very vague but the basic idea of modeling this function gave me an epiphany during a spatial statistics class several years ago when the Ripley’s K-function was introduced. I vividly remember that I set up my own project to use this K-function to get the characteristics of the spatial distribution of GRBs. The particular reason for selecting GRBs instead of galaxies was 1. I was able to find the data set from the internet on my own (BATSE catalog: astronomers may think accessing data archives is easy but generally statistics students were not exposed to the fact that astronomical data sets are available via internet and in terms of data sets, they depend heavily on data providers, or clients), and 2. I recalled a paper by Professors Efron and Petrosian (1995, ApJ, 449:215-223 Testing Isotropy versus Clustering of Gamma-ray Bursts, who utilized the nearest neighborhood approach. After a few weeks, I made another discovery that people found GRB redshifts and began to understand the cosmological origin of GRBs more deeply. In other words, 2D spatial statistics was not the way to find the origins of GRBs. Due to a few shortcomings, one of them was the latitude dependent observation of BATSE (as a second year graduate student, I didn’t confront the idea of censoring and truncation, yet), I discontinued my personal project with a discouragement that I cannot make any contribution (data themselves, like discovering the distances, speak more louder than statistical inferences without distances).

I was delighted to see the work by Prof. Loh about the Ripley’s K function. Those curious about the K function may check the book written by Martinez and Saar, Statistics of the Galaxy Distribution (Amazon Link). Many statistical publications are also available under spatial statistics and point process that includes the Ripley’s K function.

[ArXiv] 1st week, Mar. 2008

hlee — Fri, 07 Mar 2008 23:01:56 +0000

Irrelevant to astrostatistics but interesting for baseball lovers.
[stat.AP:0802.4317] Jensen, Shirley, & Wyner
Bayesball: A Bayesian Hierarchical Model for Evaluating Fielding in Major League Baseball

With the 5th year WMAP data release, there were many WMAP related papers and among them, most statistical papers are listed. WMAP specific/related:

[astro-ph:0803.0586] J. Dunkley et. al.
Five-Year Wilkinson Microwave Anisotropy Probe (WMAP) Observations: Likelihoods and Parameters from the WMAP data (likelihoods)
[astro-ph:0803.0715] B. Gold et. al.
Five-Year Wilkinson Microwave Anisotropy Probe (WMAP) Observations: Galactic Foreground Emission (MCMC)
[astro-ph:0803.0889] Ichikawa, Sekiguchi, & Takahashi
Probing the Effective Number of Neutrino Species with Cosmic Microwave Background

And others:

[astro-ph:0802.4464] M. Sahlén et.al.
The XMM Cluster Survey: Forecasting cosmological and cluster scaling-relation parameter constraints
[astro-ph:0803.0918] J.M. Colberg et.al.
The Aspen–Amsterdam Void Finder Comparison Project (TFE, tessellation field estimator)
[astro-ph:0803.0885] J.Ballot et.al.
On deriving p-mode parameters for inclined solar-like stars (MLE, maximum likelihood estimator)

By the way, I noticed [astro-ph:0802.4464] used Monte Carlo Markov Chain, whereas [astro-ph:0803.0715] used Markov chain Monte Carlo. Do they mean different? Or the former is a typo?

[ArXiv] 2nd week, Dec. 2007

hlee — Fri, 14 Dec 2007 21:16:47 +0000

No shortage in papers~

[astro-ph:0712.1038]
Extended Anomalous Foreground Emission in the WMAP 3-Year Data G. Dobler and D. P. Finkbeiner
[astro-ph:0712.1217]
Generalized statistical models of voids and hierarchical structure in cosmology A. Z. Mekjian
[astro-ph:0712.1155]
The colour-lightcurve shape relation of Type Ia supernovae and the reddening law S. Nobili and A. Goobar
[astro-ph:0712.1297]
The Structure of the Local Supercluster of Galaxies Revealed by the Three-Dimensional Voronoi’s Tessellation Method O. V. Melnyk, A. A. Elyiv, and I. B. Vavilova
[astro-ph:0712.1594]
Photometric Redshifts with Surface Brightness Priors H. F. Stabenau, A. Connolly and B. Jain
[stat.ME:0712.1663]
Efficient Blind Search: Optimal Power of Detection under Computational Cost Constraints N. Meinshausen, P. Bickel and J. Rice
[astro-ph:0712.1917]
Are solar cycles predictable? M. Schuessler

Voronoi Tessellation for nonparametric density estimation (mass distribution in the universe) interest me very much. If you are working on the topic, would you kindly share useful informations or write your thoughts on the subject here?

Photometric Redshifts

hlee — Wed, 25 Jul 2007 06:28:40 +0000

Since I began to subscribe arxiv/astro-ph abstracts, from an astrostatistical point of view, one of the most frequent topics has been photometric redshifts. This photometric redshift has been a popular topic as the catalog of remote photometric object observation multiplies its volume and sky survey projects in multiple bands lead to virtual observatories (VO – will discuss in the later posting). Just searching by photometric redshifts in google scholar and arxiv.org provides more than 2000 articles since 2000.

Quantifying redshifts is one of the key astronomical measures to identify the type of objects as well as to provide their distance. Typically, measuring redshifts requires spectral data, which are quite expensive in many aspects compared to photometric data. Let me explain a little what are spectral data and photometric data to enhance understandings for non astronomers.

Collecting photometric data starts from taking pictures with different filters. Through blue, yellow, red optical filters, or infrared, ultra-violet, X-ray filters, objects look different (or have different light intensity) and various astronomical objects can be identify via investigating pictures of many filter combinations. On the other hand, collecting spectral data starts from dispersing light through a specially designed prism. Because of this light dispersion, it takes longer to collect lights from a object and the smaller number of objects are recorded in a picture plate compared to collecting photometric data. A nice feature of this expensive spectral data is providing the physical condition of the object directly: first, the distance by the relative spectral line shifts of spectral lines; second, abundance (the metallic composition of the object), temperature, type of the object also from spectral lines. Therefore, utilizing photometric data to infer measures normally available from spectral data is a very attractive topic in astronomy.

However, there are many challenges. The massive volume of data and sampling bias*, like Malmquist bias (wiki) and Lutz-Kelker bias, hinder traditional regression techniques, where numerous statistical and machine learning methods have been introduced to make most of these photometric data to infer distances economically and quickly.

*((For a reference regarding these biases and astronomical distances, please check Distance Estimation in Cosmology by
Hendry, M. A. and Simmons, J. F. L., Vistas in Astronomy, vol. 39, Issue 3, pp.297-314.))

[ArXiv] Random Matrix, July 13, 2007

hlee — Mon, 16 Jul 2007 17:30:23 +0000

From arxiv/astro-ph:0707.1982v1,
Nflation: observable predictions from the random matrix mass spectrum by Kim and Liddle

To my knowledge, random matrix received statisticians’ interests fairly recently and SAMSI (Statistical and Applied Mathematical Sciences Institute) offered a semester long program on High Dimensional Inference and Random Matrices (tutorials and lecture notes can be found) during Fall 2006 . However, my knowledge is very limited to make a comment or critic on Kim and Liddle’s paper. Clearly, nonetheless, this paper is not about random matrix theory but about its straightforward application to the cosmological model viability.

A. Liddle has published papers on theoretic cosmology from a statistical model based approach (the ones I’ve seen are most likely related to statistical model selection). Personally, I like his book: An Introduction to Modern Cosmology (2nd ed. ISBN 0-470-84835-9), which might be useful to statisticians who wish to work with cosmologists.

[ArXiv] A Lecture Note, June 17, 2007

hlee — Mon, 18 Jun 2007 19:06:55 +0000

From arxiv/astro-ph:0706.1988,
Lectures on Astronomy, Astrophysics, and Cosmology looks helpful to statisticians who like to know astronomy, astrophysics, and cosmology. The lecture note starts from introducing fundamentals of astronomy, UNITS!!!, and its history. It also explains astronomical measures such as distances and their units, luminosity, and temperature; HR diagram (astronomers’ summary diagram); stellar evolution; and relevant topics in cosmology. At least, a third of the article will be useful to grasp a rough idea of astronomy as a scientific subject beyond colorful pictures. Statisticians who are keen to cosmology are recommended to read beyond.

This is not a high energy lecture note; therefore, statisticians interested in high energy are encouraged to visit Astro Jargon for Statisticians and CHASC.