I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is The Elements of Statistical Learning. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.
First, the book website:
The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman
You’ll find examples, R codes, relevant publications, and plots used in the text books.
Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.
The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.
This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.
The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.
Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.
I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]
—————————– Here, I patch my scribbles about the book.
What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.
You might want to check the book’s website for data sets if you want to try some ideas described there
The Elements of Statistical Learning
In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.
This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.
[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X only through the conditional mean f(x).
The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.
How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.
[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.
Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).
]]>
![]() |
If you do not mind the time predictor, it is hard to believe that this is a light curve, time dependent data. At a glance, this data set represents a simple block design for the one-way ANOVA. ANOVA stands for Analysis of Variance, which is not a familiar nomenclature for astronomers.
Consider a case that you have a very long strip of land that experienced FIVE different geological phenomena. What you want to prove is that crop productivity of each piece of land is different. So, you make FIVE beds and plant same kind seeds. You measure the location of each seed from the origin. Each bed has some dozens of seeds, which are very close to each other but their distances are different. On the other hand, the distance between planting beds are quite far unable to say that plants in the test bed A affects plants in B. In other words, A and B are independent suiting for my statistical inference procedure by the F-test. All you need is after a few months, measuring the total weight of crop yield from each plant (with measurement errors).
Now, let’s look at the plot above. If you replace distance to time and weight to flux, the pattern in data collection and its statistical inference procedure matches with the one-way ANOVA. It’s hard to say this data set is designed for time series analysis apart from the complication in statistical inference due to measurement errors. How to design the statistical study with measurement errors, huge gaps in time, and unequal time intervals is complex and unexplored. It depends highly on the choice of inference methods, assumptions on error i.e. likelihood function construction, prior selection, and distribution family properties.
Speaking of ANOVA, using the F-test means that we assume residuals are Gaussian from which one can comfortably modify the model with additive measurement errors. Here I assume there’s no correlation in measurement errors and plant beds. How to parameterize the measurement errors into model depends on such assumptions as well as how to assess sampling distribution and test statistics.
Although I know this Crab nebula data set is not for the one-way ANOVA, the pattern in the scatter plot drove me to test the data set. The output said to reject the null hypothesis of statistically equivalent flux in FIVE time blocks. The following is R output without measurement errors.
Df Sum Sq Mean Sq F value Pr(>F)
factor 4 4041.8 1010.4 143.53 < 2.2e-16 ***
Residuals 329 2316.2 7.0
If the gaps are minor, I would consider time series with missing data next. However, the missing pattern does not agree with my knowledge in missing data analysis. I wonder how astronomers handle such big gaps in time series data, what assumptions they would take to get a best fit and its error bar, how the measurement errors are incorporated into statistical model, what is the objective of statistical inference, how to relate physical meanings to statistical significant parameter estimates, how to assess the model choice is proper, and more questions. When the contest is over, if available, I’d like to check out any statistical treatments to answer these questions. I hope there are scientists who consider similar statistical issues in these data sets by the INTEGRAL team.
]]>In addition to Bayesian methodologies, like this week’s astro-ph, studies on characterizing empirical spatial distributions of voids and galaxies frequently appear, which I believe can be enriched further with the ideas from stochastic geometry and spatial statistics. Click for what was appeared in arXiv this week.
A paper linked with an animation: [astro-ph:0802.3664] Dubinski and Farah
GRAVITAS: Portraits of a Universe in Motion (I wanted to but I couldn’t watch.)
Lecture notes on CMB: [astro-ph:0802.3688] Wayne Hu
Lecture Notes on CMB Theory: From Nucleosynthesis to Recombination
Subject : Special Joint CS and Physics Colloquium
Title : How Astronomers, Computer Scientists and Statisticians are working together to tackle hard problems in astronomy
Speaker: Pavlos Protopapas
Date : Thursday February 7
Time : 3:15 pm
Place : Nelson Auditorium, Anderson Hall (Click for the map, 200 College Ave, Medford, MA, I think)
Abstract:
New astronomical surveys such as Pan-STARRS and LSST are under development and will collect petabytes of data. These surveys will image large areas of sky repeatedly to great depth, and will detect vast numbers of moving, variably bright, and transient objects. The data product of these surveys is series of observations taken over time, or light-curves.
The IIC has established an inter-disciplinary Center for Time Series with an immediate focus on astronomy. I will present three research topics currently being pursued at the IIC that require expertise from astronomy, computer science and statistics. These are: identifying novel astronomical phenomena in large light-curve datasets, searching for rare phenomena such as extra-solar planets, and efficiently searching for significant events such as occultations of stars by small objects in the outer reaches of our solar system.
Pavlos Protopapas is a senior scientist at the IIC and Harvard-Smithsonian Center for Astrophysics. His research interests spans the outer solar system, extra-solar planets and gravitational lensing. He specializes in analyzing large collections of astronomical data, with a toolbox drawn from data-mining, computer science and statistics.
]]>This paper intend to publicize the largest data set of Gamma Ray Burst (GRB) X-ray afterglows (right curves after the event), which is available from http://www.asdc.asi.it. It is claimed to be a complete on-line catalog of GRB observed by two wide-Field Cameras on board BeppoSAX (Click for its Wiki) in the period of 1996-2002. It is comprised with 77 bursts and 56 GRBs with Xray light curves, covering the energy range 40-700keV. A brief introduction to the instrument, data reduction, and catalog description is given.
]]>