As a part of introducing nonparametric statistics, I wanted to write about applications of computation geometry from the nonparametric 2/3 dimensional density estimation perspective. Also, the following article came along when I just began to collect statistical applications in astronomy (my [ArXiv] series). This [arXiv] paper, in fact, initiated me to investigate Voronoi Tessellations in astronomy in general.
[arxiv/astro-ph:0707.2877]
Voronoi Tessellations and the Cosmic Web: Spatial Patterns and Clustering across the Universe
by Rien van de Weygaert
Since then, quite time has passed. In the mean time, I found more publications in astronomy specifically using tessellation as a main tool of nonparametric density estimation and for data analysis. Nonetheless, in general, topics in spatial statistics tend to be unrecognized or almost ignored in analyzing astronomical spatial data (I mean data points with coordinate information). Many seem only utilizing statistics partially or not at all. Some might want to know how often Voronoi tessellation is applied in astronomy. Here, I listed results from my ADS search by limiting tessellation in title key words. :
Then, the topic has been forgotten for a while until this recent [arXiv] paper, which reminded me my old intention for introducing tessellation for density estimation and for understanding large scale structures or clusters (astronomers’ jargon, not the term in machine or statistical learning).
[arxiv:stat.ME:0910.1473] Moment Analysis of the Delaunay Tessellation Field Estimator
by M.N.M van Lieshout
Looking into plots of the papers by van de Weygaert or van Lieshout, without mathematical jargon and abstraction, one can immediately understand what Voronoi and Delaunay Tessellation is (Delaunay Tessellation is also called as Delaunay Triangulation (wiki). Perhaps, you want to check out wiki:Delaunay Tessellation Field Estimator as well). Voronoi tessellations have been adopted in many scientific/engineering fields to describe the spatial distribution. Astronomy is not an exception. Voronoi Tessellation has been used for field interpolation.
van de Weygaert described Voronoi tessellations as follows:
van Lieshout derived explicit expressions for the mean and variance of Delaunay Tessellatoin Field Estimator (DTFE) and showed that for stationary Poisson processes, the DTFE is asymptotically unbiased with a variance that is proportional to the square intensity.
We’ve observed voids and filaments of cosmic matters with patterns of which theory hasn’t been discovered. In general, those patterns are manifested via observed galaxies, both directly and indirectly. Individual observed objects, I believe, can be matched to points that construct Voronoi polygons. They represent each polygon and investigating its distributional properly helps to understand the formation rules and theories of those patterns. For that matter, probably, various topics in stochastic geometry, not just Voronoi tessellation, can be adopted.
There are plethora information available on Voronoi Tessellation such as the website of International Symposium on Voronoi Diagrams in Science and Engineering. Two recent meeting websites are ISVD09 and ISVD08. Also, the following review paper is interesting.
Centroidal Voronoi Tessellations: Applications and Algorithms (1999) Du, Faber, and Gunzburger in SIAM Review, vol. 41(4), pp. 637-676
By the way, you may have noticed my preference for Voronoi Tessellation over Delaunay owing to the characteristics of this centroidal Voronoi that each observation is the center of each Voronoi cell as opposed to the property of Delaunay triangulation that multiple simplices are associated one observation/point. However, from the perspective of understanding the distribution of observations as a whole, both approaches offer summaries and insights in a nonparametric fashion, which I put the most value on.
]]>From SINGS (Spitzer Infrared Nearby Galaxies Survey): Isn’t it a beautiful Hubble tuning fork?
As a first year graduate student of statistics, because of the rumor that Prof. C.R.Rao won’t teach any more and because of his fame, the most famous statistician alive, I enrolled his “multivariate analysis” class without thinking much. Everything is smooth and easy for him and he has incredible memories of equations and proofs. However, I only grasped intuitive concepts like why the method works, not details of mathematics, theorems, and their proofs. Instantly, I began to think how methods can be applied to astronomical data. After a few lessons, I desperately wanted to try out multivariate analysis methods to classify galactic morphology.
The dream died shortly because there’s no data set that can be properly fed into statistical methods for classification. I spent quite time on searching some astronomical data bases including ADS. This was before SDSS or VizieR become popular as now. Then, I thought about applying them to classify supernovae because understanding the pattern of their light curves tells a lot of the history of our universe (Type Ia SNe are standard candle) and because I know some publicly available SN light curves. Immediately, I realize that individual light curves are biased from the sampling perspective. I do not know how to correct them for multivariate analysis. I also thought about applying multivariate analysis methods to stellar spectral types and stars of different mechanical systems (single, binary, association, etc). I thought about how to apply newly learned methods to every astronomical objects that I learned, from sunspots to AGNs.
Regardless of target objects to be scrutinized under this fascinating subject “multivariate analysis,” two factors kept discouraged me: one was that I didn’t have enough training to develop new statistical models in a couple of weeks to reflect unique statistical challenges embedded in data that have missings, irregularities, non-iid, outliers and others that are hardly transcribed into statistical setting, and the other, which was more critical, was that no accessible astronomical database repository for statistical learning. Without deep knowledge in astronomy and trained skills to handle astronomical data, catalogs are generally useless. Those catalogs and data sets in archives are different from data sets from data repositories in machine learning (these data sets are intuitive).
Astronomers would think analyzing toy/mock data sets is not scientific because it’s not leading to any new discovery which they always make. From data analyst viewpoints, scientific advances mean finding tools that summarize data in an optimal manner. As I demanded in Astroinformatics, methods for retrieving information can be attempted and validated with well understood, astrophysically devastated data sets. Pythagoras theorem was proved not only once but there are 39 different ways to prove it.
Seeing this nice poster image (the full resolution image of 56MB is available from the link), brought me some memory of my enthusiasm of applying statistical learning methods for better knowledge discovery. As you can see there are so many different types of galaxies and often times there is no clear boundary between them – consider classifying blurry galaxies by eyes: a spiral can be classified as a irregular, for example. Although I wish for automatic classification of these astrophysical objects, because of difficulties in composing a training set for classification or collecting data of distinctive manifold groups for clustering, as much as complexity that this tuning fork shows, machine learning procedures is equally complicated to be developed. Complex topology of astronomical objects seems to be the primary reason of lacking in statistical learning applications compared to other fields.
Nonetheless, multivariable analysis can be useful for viewing relations from different perspectives, apart from known physics models. It may help to develop more fine tuned physics model by taking latent variables into account that are found from statistical learning processes. Such attempts, I believe, can assist astronomers to design telescopes and to invent efficient ways to collect/analyze data by knowing which features are more significant than others to understand morphological shape of galaxies, patterns in light curves, spectral types, etc. When such experiences accumulate, different insights of physics can kick in like scientists scrambled and assembled galaxies into a tuning fork that led developing various evolution models.
To make a long story short, you have two choices: one, just enjoy these beautiful pictures and apprehend the complexity of our universe, or two, this picture of Hubble’s tuning fork can be inspirational to you for advances in astroinformatics. Whichever path you choose, it’s your time worthy.
]]>Title: Statistical Models: Theory and Practice (click for the publisher’s website)
My one line review, rather a comment several months ago was
Bias in asymptotic standard errors is not a familiar topic for astronomers
and I don’t understand why I wrote it but I think I came up this comment owing to my pursuit of modeling measurement errors occurring in astronomical researches.
My overall impression of the book was that astronomers might not fancy it because of the cited examples and models quite irrelevant to astronomy. On the contrary, I liked it because it reflects what statistics ought to be in the real data analysis world. This does not mean the book covers every bit of statistics. When you teach statistics, you don’t expect student’s learning curve of statistical logistics is continuous. You only hope that they jump the discontinuity points successfully and you give every effort to lower the steps of these discontinuity points. The book looked to offering comforts to ease such efforts or to hint promises for almost continuous learning curves. The perspective and scope of the book was very impressive to me at that time.
It is sad to learn brilliant minded people passing away before their insights reach others who need them. I admire professors at Berkeley, not only because of their research activities and contributions but also because of their pedagogical contributions to statistics and its applications to many fields including astronomy (J. Neyman and E. Scott. are as familiar to statisticians as to astronomers, for example. Their papers about the spatial distribution of galaxies are, to my knowledge, well sought among astronomers).
]]>me: Why Bayesian methods?
astronomers: Because Bayesian is robust. Because frequentist method is not robust.
By intention, I made the conversation short. Obviously, I didn’t ask all astronomers the same question and therefore, this conversation does not reflect the opinion of all astronomers. Nevertheless, this summarizes what I felt at CfA.
I was educated in frequentist school which I didn’t realize before I come to CfA. Although I didn’t take their courses, there were a few Bayesian professors (I took two but it’s nothing to do with this bipartisanship. Contents were just foundations of statistics). However, I found that getting ideas and learning brilliant algorithms by Bayesians were equally joyful as learning mature statistical theories from frequentists.
How come astronomers possess the idea that Bayesian statistics is robust and frequentist is not? Do they think that the celebrated Gaussian distribution and almighty chi-square methods compose the whole frequentist world? (Please, note that F-test, LRT, K-S test, PCA take little fraction of astronomers’ statistics other than chi-square methods according to astronomical publications, let alone Bayesian methods but no statistics can compete with chi-square methods in astronomy.) Is this why they think frequentist methods are not robust? The longer history is, the more flaws one finds so that no one expect chi-square stuffs are universal panacea. Applying the chi-square looks like growing numbers of epicycles. From the history, finding shortcomings makes us move forward, evolve, invent, change paradigms, etc., instead of saying that chi-square (frequentist) methods are not robust. I don’t think we spent time to learn chi-square stuffs from class. There are too many robust statistics that frequentists have developed. Text books have “robust statistics” in their titles are most likely written by frequentists. Did astronomers check text books and journals before saying frequentists methods are not robust? I’m curious how this bipartisanship, especially that one party is favored and the other is despised but blindly utilized in data analysis, has developed (Probably I should feel relieved about no statistics dictatorship in the astronomical society and exuberant about the efforts of balancing between two parties from a small number of scientists).
Although I think more likely in a frequentist way, I don’t object Bayesian. It’s nothing different from learning mother tongues and cultures. Often times I feel excited how Bayesian get over some troubles that frequentists couldn’t.. If I exaggerate, finding what frequentists achieved but Bayesians haven’t yet or the other way around is similar to the event that by changing the paradigm from the geocentric universe to the heliocentric one could explain the motions of planets with simplicity instead of adding more numbers of epicycles and complicating the description of motions. I equally cherish results from both statistical cultures. Satisfying the simplicity and the fundamental laws including probability theories, is the most important in pursuing proper applications of statistics, not the bipartisanship.
My next post will be about “Robust Statistics” to rectify the notion of robustness that I acquired from CfA. I’d like to hear your, astronomer and statistician alike, thoughts on robustness associated with your statistical culture of choice. I only can write about robustness based what I read and was taught. This also can be biased. Perhaps, other statisticians advocate the astronomer’s notion that Bayesian is robust and frequentist is not. Not much communications with statisticians makes me difficult to obtain the general consensus. Equally likely, I don’t know every astronomer’s thoughts on robustness. Nonetheless, I felt the notion of robustness is different between statisticians and astronomers and this could generate some discussions.
I may sound like Joe Liberman, overall. But remember that tossing him one party to the other back and forth was done by media explicitly. People can be opinionated but I’m sure he pursued his best interests regardless of parties.
]]>I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:
A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification
< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour
]]>We are organizing a competition specifically targeting the statistics and computer science communities. The challenge is to measure cosmic shear at a level sufficient for future surveys such as the Large Synaptic Survey Telescope. Right now, we’ve stripped out most of complex observational issues leaving a pure statistical inference problem. The competition kicks off this summer, but we want to give possible participants a chance to prepare.
The website www.great08challenge.info will provide continual updates on the competition.
Machine learning and statistical learning become more and more popular in astronomy. Artificial Neural Network (ANN) and Support Vector Machine (SVM) are hardly missed when classifying on massive survey data is the objective. The authors provide a gentle tutorial on SVM for galactic morphological classification. Their source code GALSVM is linked for the interested readers.
One of the biggest challenges to apply SVM or other classification methods in astronomy is quantification of measures, or how to define parameters and variables physically meaningful and machine interpretable at the same time. The authors of arxiv/astro-ph:0709.1359 followed the idea of Abraham et. al. (1994), who introduced concentration. However, my impression so far tells me that standardized indices (like economic indicators) are hardly found for the classification purpose in astronomy. Astronomical Machine Learning consortium would accelerate understanding many populations in the Universe.
Without an optical afterglow, a galaxy within the 2 arc second error region of a GRB x-ray afterglow is identified as a host galaxy; however confusion can rise due to the facts that 1. the edge of a galaxy is diffused, 2. multiple sources could exist within 2 arc second error region, 3.the distance between the galaxy and the x-ray afterglow is measured by projection, and 4. lensing causes increase of brightness and position shifts. In this paper, the authors “investigated the fields of 72 GRBs in order to examine the general issue of associations between GRBs and host galaxies.”
The authors added some statistical issues on this matching GRBs and host galaxies but current knowledge and techniques seem short to tackle the problem. Yet, to prevent false discovery, the authors proposed strategic studies for the followings:
As multi-wavelength studies become popular nowadays, this source matching issue across bands continuously arises where statistics can contribute the validity of source matching methods. So far, those methods are incomprehensible to statisticians.
]]>Quantifying redshifts is one of the key astronomical measures to identify the type of objects as well as to provide their distance. Typically, measuring redshifts requires spectral data, which are quite expensive in many aspects compared to photometric data. Let me explain a little what are spectral data and photometric data to enhance understandings for non astronomers.
Collecting photometric data starts from taking pictures with different filters. Through blue, yellow, red optical filters, or infrared, ultra-violet, X-ray filters, objects look different (or have different light intensity) and various astronomical objects can be identify via investigating pictures of many filter combinations. On the other hand, collecting spectral data starts from dispersing light through a specially designed prism. Because of this light dispersion, it takes longer to collect lights from a object and the smaller number of objects are recorded in a picture plate compared to collecting photometric data. A nice feature of this expensive spectral data is providing the physical condition of the object directly: first, the distance by the relative spectral line shifts of spectral lines; second, abundance (the metallic composition of the object), temperature, type of the object also from spectral lines. Therefore, utilizing photometric data to infer measures normally available from spectral data is a very attractive topic in astronomy.
However, there are many challenges. The massive volume of data and sampling bias*, like Malmquist bias (wiki) and Lutz-Kelker bias, hinder traditional regression techniques, where numerous statistical and machine learning methods have been introduced to make most of these photometric data to infer distances economically and quickly.
*((For a reference regarding these biases and astronomical distances, please check Distance Estimation in Cosmology by
Hendry, M. A. and Simmons, J. F. L., Vistas in Astronomy, vol. 39, Issue 3, pp.297-314.))
Off the topic but worth to be notified:
1. They used AIC for model comparison. In spite of many advocates for BIC, choosing AIC would do a better job for analyzing catalog data (399,929 galaxies) since the penalty term in BIC with huge sample will lead to select the model of most parsimony.
2. Despite that more detailed discussion hasn’t been posted, I’d like to point out photometric redshift studies are more or less regression problems. Whether they use sophisticated and up-to-date classification schemes such as support vector machine (SVM), artificial neural network (ANN), or classical regression methods, the goal of the study in photometric redshifts is finding predictors for right classification and the model from those predictors. I wish there will be some studies on quantile regression, which receive many spotlights recently in economics.
3. Adaptive kernels were mentioned and the results of adaptive kernel regression are highly expected.
4. Comparing root mean square errors from various classification and regression models based on Sloan Digital Sky Survey (SDSS) EDR (Early Data Release) to DR5 (Date Release 5) might mislead the conclusion of choosing the best regression/classification method due to different sample sizes in EDR to DR5. Further formulation, especially asymptotic properties of these root mean square errors will be very useful to make a legitimate comparison among different regression/classification strategies.
]]>