The AstroStat Slog » data mining

[Book] The Elements of Statistical Learning, 2nd Ed.

hlee — Thu, 22 Jul 2010 13:25:44 +0000

This was written more than a year ago, and I forgot to post it.

I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is The Elements of Statistical Learning. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there
The Elements of Statistical Learning
In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X only through the conditional mean f(x).
The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

More on Space Weather

hlee — Tue, 22 Sep 2009 17:03:11 +0000

Thanks to a Korean solar physicist^[1] I was able to gather the following websites and some relevant information on Space Weather Forecast in action, not limited to literature nor toy data.

Space Weather Research Lab at NJIT
SEEDS — Solar Eruptive Event Detection System at George Mason University.
CACTUS A software package for ‘Computer Aided CME Tracking
SRON in the Netherlands

These seem quite informative and I believe more statisticians and data scientists (signal and image processing, machine learning, computer vision, and data mining) easily collaborate with solar physicists. All the complexity, as a matter of fact, comes from data processing to be fed in to (machine, statistical) learning algorithms and defining the objectives of learning. Once settled, one can easily apply numerous methods in the field to these time varying solar images.

I’m writing this short posting because I finally found those interesting articles that I collected for my previous post on Space Weather. After finding them and scanning through, I realized that methodology-wise they only made baby steps. You’ll see a limited number key words are repeated although there is a humongous society of scientists and engineers in the knowledge discovery and data mining.

Note that the objectives of these studies are quite similar. They described machine learning for the purpose of automatizing the procedure of detecting features of interest of the Sun and possible forecasting relevant phenomena that affects our own atmosphere due to associated solar activities.

Automated Prediction of CMEs Using Machine Learning of CME – Flare Associations by Qahwaji et al. (2008) in Solar Phy. vol 248, pp.471-483.
Automatic Short-Term Solar Flare Prediction using Machine Learning and Sunspot Associations by Qahwaji and Colak (2007) in Solar Phy. vol. 241, pp. 195-211

Space weather is defined by the U.S. National Space Weather Probram (NSWP) as “conditions on the Sun and in the solar wind, magnetosphere, ionosphere, and thermosphere that can influence the performance and reliability of space-borne and ground-based technological systems and can endanger human life or health”

Personally thinking, the section of “jackknife” needs to be replaced with “cross-validation.”
Automatic Detection and Classification of Coronal Mass Ejections by Qu et al. (2006) in Solar Phy. vol. 237, pp.419-431.
Automatic Solar Filament Detection Using image Processing Techniques by Qu et al. (2005) in Solar Phy., vol. 228, pp. 119-135
Automatic Solar Flare Tracking Using Image-Processing Techniques by Qu, et al. (2004) in Solar Phy. vol. 222, pp. 137-149
Automatic Solar Flare Detection Using MLP, RBF, and SVM by Qu et al. (2003) in Solar Phy. vol. 217, pp.157-172. pp. 157-172

I’d like add a survey paper on another type of learning methods beyond Support Vector Machine (SVM) used in almost all articles above. Luckily, this survey paper happened to address my concern about the “practices of background subtraction” in high energy astrophysics.

A Survey of Manifold-Based Learning methods by Huo, Ni, Smith
[Excerpt] What is Manifold-Based Learning?
It is an emerging and promising approach in nonparametric dimension reduction. The article reviewed principle component analysis, multidimensional scaling (MDS), generative topological mapping (GTM), locally linear embedding (LLE), ISOMAP, Laplacian eigenmaps, Hessian eigenmaps, and local tangent space alignment (LTSA) Apart from these revisits and comparison, this survey paper is useful to understand the danger of background subtraction. Homogeneity does not mean constant background to be subtracted, often cause negative source observation.

More collaborations among multiple disciplines are desired in this relatively new field. For me, it is one of the best data and information scientific fields of the 21st century and any progress will be beneficial to human kind.

I must acknowledge him for his kindness and patience. He was my wikipedia to questions while I was studying the Sun.

[Book] Elements of Information Theory

hlee — Wed, 11 Mar 2009 17:04:26 +0000

by T. Cover and J. Thomas website: http://www.elementsofinformationtheory.com/

Once, perhaps more, I mentioned this book in my post with the most celebrated paper by Shannon (see the posting). Some additional recommendation of the book has been made to answer offline inquiries. And this book always has been in my favorite book list that I like to use for teaching. So, I’m not shy with recommending this book to astronomers with modern objective perspectives and practicality. Before advancing for more praises, I must say that those admiring words do not imply that I understand every line and problem of the book. Like many fields, Information theory has grown fast since the monumental debut paper by Shannon (1948) like the speed of astronomers observation techniques. Without the contents of this book, most of which came after Shannon (1948), internet, wireless communication, compression, etc could not have been conceived. Since the notion of “entropy“, the core of information theory, is familiar to astronomers (physicists), the book would be received better among them than statisticians. This book should be read easier to astronomers than statisticians.

My reason for recommending this book is that, personally thinking, having some knowledge in information theory (in data compression and channel capacity) would help to resolve limited bandwidth in the era of massive unprecedented astronomical survey projects with satellites or ground base telescopes.

The content can be viewed from the aspect of applied probability; therefore, the basics of probability theories including distributions and uncertainties become familiar to astronomers than indulging probability textbooks.

Many of my [MADS] series are motivated by the content of this book, where I learned many practical data processing ideas and objectives (data compression, data transmission, network information theory, ergodic theory, hypothesis testing, statistical mechanic, quantum mechanics, inference, probability theory, lossless coding/decoding, convex optimization, etc) although those [MADS] postings are not visible on the slog yet (I hope I can get most of them through within several months; otherwise, someone should continue my [MADS] and introducing modern statistics to astronomers). The most commonly practiced ideas in engineering could help accelerating the data processing procedures in astronomy and turning astronomical inference processes more efficient and consistent, which have been neglected because of many other demands. Here, I’d rather defer discussing details of particular topics from the book and describing how astronomers applied them (There are quite hidden statistical jewels from ADS but not well speculated). Through [MADS], I will discuss further more, how information theory could help processing astronomical data from data collecting, pipe-lining, storing, extracting, and exploring to summarizing, modeling, estimating, inference, and prediction. Instead of discussing topics of the book, I’d like to quote interesting statements in the introductory chapter of the book to offer delicious flavors and to tempt you for reading it.

… it [information theory] has fundamental contributions to make in statistical physics (thermodynamics), computer science (Kolmogorov complexity or algorithmic complexity), statistical inference (Occam’s Razor: The simplest explanation is best), and to probability and statistics (error exponents for optimal hypothesis testing and estimation).

… information theory intersects physics (statistical mechanics), mathematics (probability theory), electrical engineering (communication theory), and computer science (algorithmic complexity).

There is a pleasing complementary relationship between algorithmic complexity and computational complexity. One can think about computational complexity (time complexity) and Kolmogorov complexity (program length or descriptive complexity) as two axes corresponding to program running time and program length. Kolmogorov complexity focuses on minimizing along the second axis, and computational complexity focuses on minimizing along the first axis. Little work has been done on the simultaneous minimzation of the two.

The concept of entropy in information theory is related to the concept of entropy in statistical mechanics.

In addition to the book’s website, googling the title will show tons of links spanning from gambling/establishing portfolio to computational complexity, in between there are statistics, probability, statistical mechanics, communication theory, data compression, etc where the order does not imply relevance or importance of the subjects. Such broad notion is discussed in the intro chapter. If you have the book in your hand, regardless of their editions, you might first want to check Fig. 1.1 “Relationship of information theory to other fields” a diagram explaining connections and similarities among these subjects.

Data analysis tools, methods, algorithms, and theories including statistics (both exploratory data analysis and inference) should follow the direction of retrieving meaningful information from observations. Sometimes, I feel that priority is lost, ship without captain, treating statistics or information science as black box without any interests of knowing what’s in there.

I don’t know how many astronomy departments offer classes for data analysis, data mining, information theory, machine learning, or statistics for graduate students. I saw none from my alma matter although it offers the famous summer school recently. The closest one I had was computational physics, focusing how to solve differential equations (stochastic differential equations were not included) and optimization (I learned the game theory, unexpected. Overall, I am still fond of what I learned from that class). I haven’t seen any astronomy graduate students in statistics classes nor in EE/CS classes related to signal processing, information theory, and data mining (some departments offer statistics classes for their own students, like the course of experimental designs for students of agriculture science). Not enough educational efforts for the new information era and big survey projects is what I feel in astronomy. Yet, I’m very happy to see some apprenticeships to cope with those new patterns in astronomical science. I only hope it grows, beyond a few small guilds. I wish they have more resources to make their works efficient as time goes.

accessing data, easier than before but…

hlee — Tue, 20 Jan 2009 17:59:56 +0000

Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L₂ norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

A Data Miner’s Story

hlee — Wed, 14 May 2008 05:20:59 +0000

Usama Fayyad (click the image to listen the lecture)

A Data Miner’s Story – Getting to Know the Grand Challenges

This talk was given as an award acceptance speech during KDD 2007 and sounded to aim for general pubic, who heard of Yahoo! My catch point was his overview on data mining, which seemed to be initiated from astronomy (his working at JPL was the opportunity for him to see the extraordinary capability of data mining in the real world to be applied in business modeling on top of extracting informations based on data). The whole story touched the fundamental of data mining.

A few times I attended talks at CfA because abstracts contain the phrase, data mining, but I always felt something is short. My impression from astronomers’ data ming talks is that they tend to focus only on collecting data, which is a small part of data mining although I understand collecting by itself is tremendously difficult. Usama Fayyad’s talk provided the answer to my skepticism as well as other great insights of data mining. Collecting data is not a small part from customer’s view points according to his evolutionary charts describing the real world data mining different from the scientific and technical side of data mining.

I thought astronomers particularly working hard on survey projects might be benefited from his pragmatic perspective of data mining. I hope at some point not just algorithms of collecting data but more embedded data mining tools to be exploited in future surveys and virtual observatory projects.

FYI, the description of the astronomy project that inspired him, called sky cat (I’m not sure) comes around 15 minutes and the talk starts around 6 minutes. The first 5 minutes or so were spent for introduction. And here is the link of another talk, which he pointed in this talk. Please, skip the first 8.5 minutes. You won’t regret it. Usama Fayyad’s talk is good but the opening is …

From Mining the Web to Inventing the New Sciences Underlying the Internet

On-line Machine Learning Lectures and Notes

hlee — Thu, 03 Jan 2008 18:44:14 +0000

I found this website a while ago but haven’t checked until now. They are quite useful by its contents (even pages of the lecture notes are properly flipped for you while the lecture is given). Increasing popularity of machine learning among astronomers will find more use of such lectures. If you have time to learn machine learning and other related subjects, please visit http://videolectures.net/. Specifically classified links to interesting subjects are found by your click.

Mathematics:
Mathematics>Operations Research (lectures by Gene Golub, Professor at Stanford and Lieven Vandenberghe, one of the authors of Convex Optimzation – a link to the pdf file)
Mathematics>Statistics (including Peter Bickel, Professor at UC Berkeley).

Physics:
Physics (You’ll see Randall Smith)

[In the near future, some selected lectures with summary note might be suggested; nevertheless, your recommendations are mostly welcome.]