The AstroStat Slog » hlee

[Book] The Elements of Statistical Learning, 2nd Ed.

hlee — Thu, 22 Jul 2010 13:25:44 +0000

This was written more than a year ago, and I forgot to post it.

I’ve noticed that there are rapidly growing interests and attentions in data mining and machine learning among astronomers but the level of execution is yet rudimentary or partial because there has been no comprehensive tutorial style literature or book for them. I recently introduced a machine learning book written by an engineer. Although it’s a very good book, it didn’t convey the foundation of machine learning built by statisticians. In the quest of searching another good book so as to satisfy the astronomers’ pursuit of (machine) learning methodology with the proper amount of statistical theories, the first great book came along is The Elements of Statistical Learning. It was chosen for this writing not only because of its fame and its famous authors (Hastie, Tibshirani, and Friedman) but because of my personal story. In addition, the 2nd edition, which contains most up-to-date and state-of-the-art information, was released recently.

First, the book website:

The Elements of Statistical Learning by Hastie, Tibshirani, and Friedman

You’ll find examples, R codes, relevant publications, and plots used in the text books.

Second, I want to tell how I learned about this book before its first edition was published. Everyone has a small moment of meeting very famous people. Mine is shaking hands with President Clinton, in 2000. I still remember the moment vividly because I really wanted to tell him that ice cream was dripping on his nice suit but the top of the line guards blocked my attempt of speaking/pointing icecream dripping with a finger afterward the hand shaking. No matter what context is, shaking hands with one of the greatest presidents is a memorable thing. Yet it was not my cherishing moment because of icecreaming dripping and scary bodyguards. My most cherishing moment of meeting famous people is the half an hour conversation with late Prof. Leo Breinman (click for my two postings about him), author of probability textbook, creator of CART, and the most forefront pioneer in machine learning.

The conclusion of that conversation was a book soon to be published after explaining him my ideas of applying statistics to astronomical data and his advices to each problems. I was not capable to understand every statistics so that his answer about this new coming book at that time was the most relevant and apt one.

This conversation happened during the 3rd Statistical Challenges in Modern Astronomy (SCMA). Not long passed since I began my graduate study in statistics but had an opportunity to assist the conference organizer, my advisor Dr. Babu and to do some chores during the conference. By accident, I read the book by Murtagh about multivariate data analysis, so I wanted to speak to him. Except that, I have no desire to speak renown speakers and attendees. Frankly, I didn’t have any idea who’s who at the conference and a few years later, I realized that the conference dragged many famous people and the density of such people was higher than any conference I attended. Who would have imagine that I could have a personal conversation with Prof. Breiman, at that time. I have seen enough that many famous professors train people during conferences. Getting a chance for chatting some seconds are really hard and tall/strong people push someone small like me away always.

The story goes like this: a sunny perfect early summer afternoon, he was taking a break for a cigar and I finished my errands for the session. Not much to do until the end of session, I decided to take some fresh air and I spotted him enjoying his cigar. Only the worst was that I didn’t know he was the person of CART and the founder of statistical machine learning. Only from his talk from the previous session, I learned he was a statistician, who did data mining on galaxies. So, I asked him if I can join him and ask some questions related to some ideas that I have. One topic I wanted to talk about classification of SN light curves, by that time from astronomical text books, there are Type I & II, and Type I has subcategories, Ia, Ib, and Ic. Later, I heard that there is Type III. But the challenge is observations didn’t happen with equal intervals. There were more data mining topics and the conversation went a while. In the end, he recommended me a book which will be published soon.

Having such a story, a privilege of talking to late Prof. Breiman through an very unique meeting, SCMA, before knowing the fame of the book, this book became one of my favorites. The book, indeed, become popular, around that time, almost only book discussing statistical learning; therefore, it was an excellent textbook for introducing statistics to engineerers and machine learning to statisticians. In the mean time, statistical learning enjoyed popularity in many disciplines that have data sets and urging for learning with the aid of machine. Now books and journals on machine learning, data mining, and knowledge discovery (KDD) became prosperous. I was so delighted to see the 2nd edition in the market to bridge the gap over the years.

I thank him for sharing his cigar time, probably his short free but precious time for contemplation, with me. I thank his patience of spending time with such an ignorant girl with a foreign english accent. And I thank him for introducing a book which will became a bible in the statistical learning community within a couple of years (I felt proud of myself that I access the book before people know about it). Perhaps, astronomers cannot have many joys from this book that I experienced from how I encounter the book, who introduced the book, whether the book was used in a course, how often book is referred, etc. But I assure that it’ll narrow the gap in the notions how astronomers think about data mining (preprocessing, pipelining, and bulding catalogs) and how statisticians treat data mining. The newly released 2nd edition would help narrowing the gap further and assist astronomers to coin brilliant learning algorithms specific for astronomical data. [The END]

—————————– Here, I patch my scribbles about the book.

What distinguish this book from other machine learning books is that not only authors are big figures in statistics but also fundamentals of statistics and probability are discussed in all chapters. Most of machine learning books only introduce elementary statistics and probability in chapter 2, and no basics in statistics is discussed in later chapters. Generally, empirical procedures, computer algorithms, and their results without presenting basic theories in statistics are presented.

You might want to check the book’s website for data sets if you want to try some ideas described there
The Elements of Statistical Learning
In addition to its historical footprint in the field of statistical learning, I’m sure that some astronomers want to check out topics in the book. It’ll help to replace some data analysis methods in astronomy celebrating their centennials sooner or later with state of the art methods to cope with modern data.

This new edition reflects some evolutions in statistical learning whereas the first edition has been an excellent harbinger of the field. Pages quoted from the 2nd edition.

[p.28] Suppose in fact that our data arose from a statistical model $Y=f(X)+e$ where the random error e has E(e)=0 and is independent of X. Note that for this model, f(x)=E(Y|X=x) and in fact the conditional distribution Pr(Y|X) depends on X only through the conditional mean f(x).
The additive error model is a useful approximation to the truth. For most systems the input-output pairs (X,Y) will not have deterministic relationship Y=f(X). Generally there will be other unmeasured variables that also contribute to Y, including measurement error. The additive model assumes that we can capture all these departures from a deterministic relationship via the error e.

How statisticians envision “model” and “measurement errors” quite different from astronomers’ “model” and “measurement errors” although in terms of “additive error model” they are matching due to the properties of Gaussian/normal distribution. Still, the dilemma of hen or eggs exists prior to any statistical analysis.

[p.30] Although somewhat less glamorous than the learning paradigm, treating supervised learning as a problem in function approximation encourages the geometrical concepts of Euclidean spaces and mathematical concepts of probabilistic inference to be applied to the problem. This is the approach taken in this book.

Strongly recommend to read chapter 3, Linear Methods for Regression: In astronomy, there are so many important coefficients from regression models, from Hubble constant to absorption correction (temperature and magnitude conversion is another example. It seems that these relations can be only explained via OLS (ordinary least square) with the homogeneous error assumption. Yet, books on regressions and linear models are not generally thin. As much diversity exists in datasets, more amount of methodology, theory and assumption exists in order to reflect that diversity. One might like to study the statistical properties of these indicators based on mixture and hierarchical modeling. Some inference, say population proportion can be drawn to verify some hypotheses in cosmology in an indirect way. Understanding what regression analysis and assumptions and how statistician efforts made these methods more robust and interpretable, and reflecting reality would change forcing E(Y|X)=aX+b models onto data showing correlations (not causality).

A short note on Probability for astronomers

hlee — Mon, 28 Dec 2009 03:13:02 +0000

I often feel irksome whenever I see a function being normalized over a feasible parameter space and it being used as a probability density function (pdf) for further statistical inference. In order to be a suitable pdf, normalization has to be done over a measurable space not over a feasible space. Such practice often yields biased best fits (biased estimators) and improper error bars. On the other hand, validating a measurable space under physics seems complicated. To be precise, we often lost in translation.

When I was teaching statistics, despite undergraduate courses, there were both undergraduate and graduate students of various fields except astrophysics majors. I wondered why they were not encouraged to take some basic statistics whereas they were encouraged to take some computer science courses. As there are many astronomers good at programming and designing tools, I’m sure that recommending students to take statistics courses will renovate astronomical data analysis procedures (beyond Bevington’s book) and hind theories (statistics and mathematics per se, not physics laws).

Here’s an interesting lecture for developing a curriculum for the new era in computer science and why the basic probability theory and statistics is important to raise versatile computer scientists. It could be a bit out dated now because I saw it several months ago.

About a little more than the half way through the lecture, he emphasizes that probability course partaking the computer science curriculum. I wonder any astronomy professor has similar arguments and stresses for any needs of basic probability theories to be learned among young future astrophysicists in order to prevent many statistics misuses appearing in astronomical literature. Particularly confusions between fitting (estimating) and inference (both model assessment and uncertainty quantification) are frequently observed in literature where authors claim their superior statistics and statistical data analysis. I personally sometimes attribute this confusion to the lack of distinction between what is random and what is deterministic, or strong believe in their observed and processed data absent from errors and probabilistic nature.

Many introductory books introduce very interesting problems many of which have some historical origins to introduce probability theories (many anecdotes). One can check out the very basics, probability axioms, and measurable function from wikipedia. With examples, probability is high school or lower level math that you already know but with jargon you’ll like to recite lexicons many times so that you are get used to foundations, basics, and their theories.

We often mention measurable to discuss random variables, uncertainties, and distributions without verbosity. “Assume measurable space … ” saves multiple paragraphs in an article and changes the structure of writing. This short adjective implies so many assumptions depending on statistical models and equations that you are using for best fits and error bars.

Consider a LF, that is truncated due to observational limits. The common practice I saw is drawing a histogram in a way that the adaptive binning makes the overall shape reflecting a partial bell shape curve. Thanks to its smoothed look, scientists impose a gaussian curve to partially observed data and find parameter estimates that determine the shape of this gaussian curve. There is no imputation step to fake unobserved points to comprise the full probability space. The parameter space of gaussian curves frequently does not coincide with the physically feasible space; however, such discrepancy is rarely discussed in astronomical literature and subsequent biased results look like a taboo.

Although astronomers emphasize the importance of uncertainties, factorization nor stratification of uncertainties has never been clear (model uncertainty, systematic uncertainty or bias, statistical uncertainties or variance). Hierarchical relationships or correlations among these different uncertainties are never addressed in a full measure. Basics of probability theory and the understanding of random variables would help to characterize uncertainties both in mathematical sense and astrophysical sense. This knowledge also assist appropriate quantification of these characterized uncertainties.

Statistical models are rather simple compared to models of astrophysics. However, statistics is the science of understanding uncertainties and randomness and therefore, some strategies of transcribing from complicated astrophysical models into statistical models, in order to reflect the probabilistic nature of observed (or parameters, for Bayesian modeling), are necessary. Both raw or processed data manifest the behavior of random variables. Their underlying processes determine not only physics models but also statistical models written in terms of random variables and the link functions connecting physics and uncertainties. To my best understanding, bridging and inventing statistical models for astrophysics researches seem tough due to the lack of awareness of basics of probability theory.

Once I had a chance to observe a Decadal survey meeting, which covered so diverse areas in astronomy. They discussed new projects, advancing current projects, career developments, and a little bit about educating professional astronomers apart from public reach (which often receives more importance than university curriculum. I also believe that wide spread public awareness of astronomy is very important). What I missed while I observing the meeting is that interdisciplinary knowledge transferring efforts to broaden the field of astronomy and astrophysics nor curriculum design ideas. Because of its long history, I thought astronomy is a science of everything. Marching a path for a long time made astronomy more or less the most isolated and exclusive science.

Perhaps asking astronomy majors taking multiple statistics courses is too burdensome; therefore being taught by faculty who are specialized in (statistical) data analysis organizes a data analysis course and incorporates several hours of basic probability is more realistic and what I anticipate. With a few hours of bringing fundamental notions in random variables and probability, the claims of “statistical rigorous methods and powerful results” will become more appropriate. Currently, statistics is science but in astronomy literature, it looks more or less like an adjective that modify methods and results like “powerful”, “superior”, “excellent”, “better”, “useful,” and so on. Basics of probability is easily incorporated into introduction of algorithms in designing experiments and optimization methods, which are currently used in a brute force fashion^[1].

Occasionally, I see gems from arxiv written by astronomers. Their expertise in astronomy and their interest in statistics has produced intriguing accounts for statistically rigorous data analysis and inference procedures. Their papers includes explanation of fundamentals of statistics and probability more appropriate to astronomers than statistics textbooks for scientists and engineers of different fields. I wish more astronomers join this venture knowing basics and diversities of statistics to rectify many unconscious misuses of statistics while they argue that their choice of statistics is the most powerful one thanks to plausible results.

What I mean by a brute force fashion is that trying all methods listed in the software manual, and then later, stating that the method A gave most plausible values that matches with data in a scatter plot

astronomy bibliography

hlee — Wed, 23 Dec 2009 02:13:39 +0000

Because of blogging and projects I worked on, I happened to collect quite many publications in Astronomy. The collection is biased toward my personal interests. However, these authors discussed statistics in a wide range. So, I felt my astronomical bibliography can be useful to slog audience. Some areas could match your interests. Or your own name can be found.

MNRAS, 362, 826-832 (2005) Cayon, Jin, Treaster, Higher Criticism statistic: detecting and identifying non-Gaussianityin the WMAP first year data.(found recently for a [MADS] post about HC)
MNRAS, 347, 1241-1254 (2004) Sochting, Clowes, Campusano, tessellation

Xray, LF, PowerLaw estimation, Poisson, Pareto, truncated, heavy tail dist’n

ApJ, 310, 334-342 (1986), Schmitt, Maccacaro, estimating alpha of pareto, poisson noise
ApJ, 293, 178-191, (1985), Schmitt, upper bounds
ApJ, 374, 344-355, (1991), Kraft, Burrow, Nousek confidence limits
ApJ, 228, 939-947 (1979), Cash, MLE
ApJ, 518,380-393 (1999) Mighell, parameter estimation, poisson data, chi^2
MNRAS, 225, 155-170 (1987), Fasano, Franceschini, multidim’l, Kolmogorov-Smirnov
A&A, 188, 258-264 (1987) Gosset, 3D Kolmogorov Smirnov

Calibration uncertainty related

MNRAS, 335, 1193-1200 (2002) Bridle, Crittened, et al. Analytic marginalization over CMB calibration and beam uncertainty
ApJ, 693, 822-829 (2009) Humphrey, Liu, Buote, chi^2 and possionian data: …
ApJ, 690, 128-143 (2009) Grimm, McDowell, et al. Chandra ACIS
ApJ, 471, 673-682 (1996) Churazov, Gilfanov, et al. (low counts)
ApJ, 562, 575-582 (2001) Davis, pileup, CCD
ApJ, 539, 172-186 (2000) Buote, averaging arfs, chi^2,
A&A, 162, 340-348 (1986) Simpson, Mayer-Hasselwander, Bootstrap sampling: applications in gamma-ray astronomy (simple linear regression, parametric bootstrap)
PASJ (Publ. Astron. Soc. Japan), 59, S113-132 (2007), Ishisaki et al (mentions ARF uncertainty, sec. 4)
ApSS, 231, 157-160 (1995) Halon, Bennet, et al. Spectral characaterisation of GRBs (used XSPEC)
[0805.2207] (ApJ) Vikhlinin, et al. (uncertainty, bias, stacking, section 8, calibration uncertainties, astronomical uncertainties, modeling uncertainties, be aware that lexicon implications are quite different from those of statistics)
ApJ, 168, 151, 393- (1968), Schmidt, space dist’n and LFs of quisi-stellar radio sources
More directly related references to Chandra calibration uncertainty will be available via a paper that my group is preparing.

Isochrone related (inference on parameters of stellar evolution models)

A&A, astro-ph/0504483 (2006) Cervino, Luridiana, Cervino-Luridiana, Confidence limits of evolutionary synthesis models.
0510411/astro-ph/Cervino, Luridiana, revisiting and assessing uncertainties in stellar populations synthesis models (rectify or render modeling processes)
MNRAS, 351,487-504 (2004) F. Pont, L. Eyer, Isochrone ages for field dwarfs: method and application to the age-metallicity relation.
MRNAS, 316, 605-612 (2000) Hernandez, Valls-Gabaud, Gilmore, The recent star formation history of the Hipparcos solar neightbourhood
A&A, 366, 538-546 (2001) T. Lejeune, D. Schaerer, Database of Geneva Stellar evolution tracks and isochrones for …
MNRAS, 304, 705-719 (1999) Hernandez, Valls-Gabaud, Gilmore, Deriving star formation histories: inverting Hertzsprung-Russell diagrams through a variational calculus maximum likelihood method
MNRAS, 375, 1220-1240 (2007), Mayen, Naylor, et al, Empirical isochrones and relative ages for young stars, and the radiative-convective gap. (DATA)
MNRAS, 373, 1251-1263 (2006) Naylor, Jeffries, a maximum likelihood method for fitting CM diagrams
0702422/astro-ph/Cervino, Luridiana,
ApJ, 645,1436-1447 (2006) von Hippel, et al.
ApJ, 345, 245-256 (1989), Cardelli, Clayton, Mathia,
0708.1964, McWilliam, Globuar cluster abundance … 47 Tuc
A&A, 331, 81-120 (1998), Perryman, Brown, et al (Hyades: dist. struc. dynamics, and age)
AJ, 103, 460- (1992), Hodder, Nemec, Richer, Fahlman (M71)
MNRAS, 347, 101-118 (2004) Sandquist, CMD of M67
A&ASS, 141, 371-383 (2000) Girardi, Bressan, Bertelli, Chiosi, Evolution tracks and isochrones
A&A, 414, 163-174(2004) Salaris, Weiss, Percival, age of the oldest open cluster (can cross compare)
MNRAS, 332, 91-108 (2002) Dolphin, numerical methods of star formation history measurement and applications to seven dwarf spheroidals
Ap&SS (arxiv:0710.4003), Demarque, Guenther, Li, et al (YREC: the Yale Rotating Stellar Evolution Code)
AJ, 137, 3668-3684 (2009) Cignono, Sabbi, Nota, et al. (Star Formation History in the SMC, NGC602)
ApJ (arxiv:0706.1202) Gieles, Lamers, Zwart (age distribution of star clusters in the SMC)
AJ, 135:1361-1383 (2008) de Jong, Rix, Martin, et al (Numerical CMD analysis of SDSS data and applications)
[astro-ph:0812.1323] Muench, Getman, et al. (Star Formation in the Orion Nebula I; Stellar Content)
[astro-ph:0606170] Blanton, Roweis (K-corrections and filter transformations in the UV, optical, and NIR) (check out NMF in the appendix)
ApJ (? arxiv: 0806.2945) Martin, de Jong, Rix (ML analysis & MW satellites)
and more…

spatial stat/GRB

A&A, 354, 1-6 (2000) Meszaros, Bagoly, Vavrek
MNRAS, 241, 109-117 (1989), Scott, Tout, Nearest neightbor
A&A, 162, 340-348 (1986) Simpson, Mayer-Hasselwander, Bootstrap sampling, app. in GR
MNRAS, 210, 19-23,( 1984) Barrow, Bhavsar, Sonoda, Bootstrap, galaxy cluster
A&A, 362, 851-864 (2000) La Barbera, Busarello, Capaccioli (measurement errors, intrinsic scatter, fundamental plane)
Bull. Astro. Soc. India (2002) 30,445-448, Shanthi, Bhat, Reclassification of GRBs ..
MNRAS, 301,419-434 (1998) Pichon, Thiebaut, Nonparametric reconstruction of dist’n fns from observed galactic discs.
A&A, 403, 443-448 (2003) Meszaros, Stocek, anisotropy kn the angular distribution of long GRBs?
ApJ, 538, 165-180 (2000), Hakkila, Haglin, et al. GRB class properties
MNRAS, 343, 255-258 (2003), Magliocchetti, Ghirlanda, Celotti, evidence for anisotropy
ApJ, 513, 543-548 (1999) Kerscher, et al. (Martinez), J function.
A&A, 366, 376-386 (2001) Valdarnini
MNRAS, 328, 283-290 (2001), Balastegui, Ruiz-Lapuente, Canal, reclassification of grbs
ApJ, 566, 202-209 (2002) Rajaniemi, Mahonen, SOM, classifying GRBs

giving up sorting

A&SS, 271, 213-226 (2000), Takeuchi, Application of the information criterion to the estimation of galaxy luminosity fuction
ApJ, 438, 322-340 (1995) Wheaton, Dunklee, et al (multiparameter linear LS fitting to poisson data) Vague to me, could have more statistical rigor.
ApJ, 483, 3540-349 (1997) Kolaczyk, Nonparametric estimation of GRB intesities using Haar Wavelets. (It talked IMSE)
MNRAS, 377, 120-132 (2007) Shekdy, Decin, et al (Estimating stellar parameters from spectra using a hierarchical Bayesian approach)
ApJSS, 113, 89-104 (1997) Newberg, Yanny (3D parameterization of the stellar locus with application to QSO color selection) *STELLAR LOCUS/LOCI*
A&A, 478, 971-980 (2008) Huertas-Company, Rouan, et al (Robust morphological classification, SVM)
*ApJ, 600, 681-694 (2004) Baldry, Glazebrook, et al (Quantifying the Binodal CM distribution of Galaxies)
A&A, 501, 813-820 (2009) Knop, Hauschildt, Baron
ApJSS, 181,1-52 (2009) Swesty, Myra
ApJ, 303, 336-346 (1986) Gehrels, Confidence limits (can be tried with more robust approaches)
ApJ, 595, 59-70 (2003) Budavari, Connolly, Szalay et al (Angular Clustering with Photometric Redshifts in the SDSS: Bimodality in the clustering Properties of Galaxies)
AJ, 131, 790-805 (2006) Lu, Zhuou, Wang et al (*EL-ICA*)
MNRAS, 372, 615-629 (2006) Stivoli, et al (ICA, fastICA)
MNRAS, 376, 739-759
ApJ, 162, 405-410 (1970) Crawford, Jauncey, Murdoch MLE of Powerlaw alpha
ApJ, 681, 679-691 (2005) Helsdon, Ponman, Mulchaey
ApJ, 161, 271-303 (2005) Grimm, McDowell, Zezas, Kim, Fabbiano (XLF)
ApJ, 481, 644-659 (1997) Nichol, Holden, Romer, Ulmer, Burke, Collins (XLF)
ApJ, 269, 35-41 (1983) Marshall, Avni, Tananbaum, Zamorani (Likelihood, Linden-Bell, )
ApJ, 662, 224-235 (2007) Kocevski, Ebeling, Mullis, Tully
ApJ, 671, 1471-1496 (2007) Barkhouse, Yee, Lopez-Cruz (Schechter function, LF)
NMRAS, 301, 881-914 (1998) Ebeling, Edge, et al (log N- log S dist)
MNRAS, 281, 799-824 (1996) Ebeling, Voges, et al. (ROSAT..)
MNRAS, 389, 1190-1208 (2008) Bottino, Banday, Maino (FASTICA, WMAP)
MNRAS, 376, 739-758 (2007) Aumont, Macias-Perez (PolEMICA, SMICA, ICA, CMB, blind source separation)
MNRAS, 374, 1207-1215 (2007) Maino, Donzelli, Banday et al (CMB, FastICA)
MNRAS, 354, 55-70 (2004) Baccigalupi et al (CMB)
MNRAS, 344, 544-552 (2003) Maino, Banday, et al (COBE, fastICA)
MNRAS, 334, 53-68 (2002) Maino, Farusi, et al (fastICA)
MNRAS, 318, 769-780 (2000) Baccigalupi, Bedini, et al (Neural Net, CMB)
A&A, 422, 113-1121 (2004) Zhang, Zhao (SVM, supervised learning)
MNRAS, 356, 872-882 (2005) Ascasibar, Binney (*Density Estiimation*, has Tessellation, binary tree)
MNRAS, 373, 1293-1307 (2006) Sharma, Steinmetz (multi dim density estimation)
MNRAS, 368, 497-510 (2006) Diehl, Statler (adaptive binning, Voronoi tessellations)
MNRAS, 372, 1104-1116 (2006) Percival, Brown (Likelihood techniques, CMB)
ApJSS, 176, 276-292 (2008) van Belle et al (Palomar)
A&A, 368,776-786 (2001) Ramella, Boschin, et al (galaxy clusters, Voronoi Tessellations)
MNRAS, 331, 569-577 (2002) Sochting, Clowes, Campusano (galaxy cluster, MST)
MNRAS, 347, 1241-1254 (2004) Sochting, Clowes, Campusano (MLE, Voronoi Tessellations)
ApJ, 477, 79-92 (1997) Scharf, Jones, Ebeling, et al (ROAST, Voronoi Tessellations, VTP, source detection, thresholding needs more statistical rigor
clustering in massive data sets, fionn murtagh, chemical data analysis in the large
http://www.beilstein-institut.de/bozen20000/proceedings/murtagh/murtagh.pdf
Advanced data mining tools for exploring large astronomical data, by Longo, et al. SPIE Vol 4477 (2001)
Massive Datasets in Astronomy by Brunner, Djorgovski, Prince, Szalay, astro-ph/010648
Mining Massive Data Streams by Hulten, Domingos, Spencer, J. Mach Learning Research 1 (2005)
Adaptive Piecewise-constant Modeling of Signals in Multidimensional Spaces, Scargle, Jackson, Norris, Phystat2003
Error analysis of the photometric redshift technique, MNRAS, 330, 889-894 (2002)
Introduction to Statistical Issues in Paraticle PHysics, R. Barlow, Phystat2003
Detectin of non-random patterns in cosmological gravitational clustering
Valdarnini A\&A 366 376-386
Population analysis of faint galaxies with mixture modeling, Titus, Spillar, Johnson, AJ, Vol. 114(3). 1997
Automated Classification of Rosat sources using heteorgeneous multiwavelength
source catalog by McGlynn et al ApJ 616:1284-1300, 2004
AJ, 122, 3492-3505 (2001) Miller, Genovese, Nichol, Wasserman, et al. (FDR in astrophysical data analysis)
MNRAS, 374, 867-876 (2007) Priddey et al. (survival analysis and bayesian)
A&A, 310, 508-518 (1996) Carbillet, Ricort, Aime, Perrier
ApJ, 659, 29-51 (2007) Kim, Wilkes, Kim et al (chandra)
ApJSS, 169, 401-429 (2007) Kim, Kim, Wilkes, et al (chandra)
MNRAS, 369, 677-696 (2006) Protopapas et al. Finding outlier light curves in catalogues of periodic variables stars.
MNRAS, 360, 447-491 (2005) Diego, Protopapas, Sandvik, Tegmark, nonparametric inversion of strong lensing system
MNRAS, 375, 958-970 (2007) Diego, Tegmark, Protopapas, Sandvik,
MNRAS, 362, 1247-1258 (2005) Diego, Sandvik, Protopapas, Tegmark, Benitez, Broadhurst
ApJ, 134, 1963-1993
MNRAS, 362, 460-468, Protopapas, Jimenez, Alcock, Fast idenitification of transits from light-curves
MNRAS,378, 716-722 (2007) Lane, Gray, et al.

COSMOS

ApJSS, 172, 353-367 (2007) Brusa, et al (Likelihood ratio technique, Fisher will not be happy)
ApJSS, 172,406-433 (2007) Scarlata et al.
ApJSS, 172,494-510 (2007) Scarlata et al.
ApJSS, 172,320-328 (2007) Kartaltepe et al.
ApJSS, 172, 284-295 (2007) Capak et al.
ApJSS, 172, 182-195 (2007) Finoguenov et al
ApJSS, 182, 341-352 (2007) Cappelluti et al.
MNRAS,259, 413-420 (1992) Sutherland, Sauders, LRT for source identification
A&A, 398, 901-918 (2003) Ciliegi, et al.
AJ, 123, 1807-1825 (2002) Goto et al. (SDSS, cluster detection)
AJ, 135, 1810-1824 (2008) Fridmann (robustness, influence function)
ApJ, 579, 48-75 (2002) Scranton, Johnston, et al (Analysis of systematic effects and stat uncertainties in … SDSS)
A&A, 330, 447-452 (1998) Molinari, Smareglia (Neural Net, galaxy classification, LF of E/SO)
ApJ, 556, 937-943 (2001) Gortiglioni, Mahonen, Hakala, Frantti (SOM, star-galaxy discrimination)
A&A, 482, 483-498 (2008) Torniainen, Tornikoski, Turunen, et al (SOM)
ApJ, 566, 202-209 (2002) Rajaniemi, Mahonen (GRB, SOM)
ApJSS, 111,357-367 (1997) Naim, Ratnatunga, Griffiths (Galaxy morphology, SOM)
MNRAS, 334, 53-68 (2002) Maino, Farusi, et al (FastICA, CMB, Plank)
MNRAS, 340, 1269-1278 (2003) Ebeling, improved approx. of Poissonian errors for high CLs.
ApJ, 461, 396-407 (1996) Mattox et al (Likelihood, Egret)
ApJ, 504, 405-418 (1998) Scargle (Bayesian Blocks)
AJ, 124, 147-157 (2002) Whitmore, Schweizer, Kundu, Miller (LF, GC NGC 3610) K-S test
MNRAS,155, 95-118 (1971) *Lynden-Bell*
ApJ, 116, 144- (1952) *Neyman, Scott*
ApJ, 117, 92- (1953) *Neyman, Scott*
ApJ, 183, 1-13 (1973) Murdoch, Crawford
AJ, 115, 1206-1211 (1998) Saha
ApJ, 645, 1436-1447 (2006) von Hippel, Jefferys, Scott, et al (CMD, Bayesian)
A&A, 415, 571-576 (2004) Bonatto, Bica, Girardi (isochrones, WEBDA, source of some clusters for isochrone fitting, inference problem)
MNRAS, 317, 831-842 (2000) *Hernandez*, Gilmore, Valls-Gabaud, Nonparametric star formation histories for four dwarf spheroidal galaxies of the local group
A&A, 436, 127-143 (2005) Jorgensen, Lindegren, Determination of Stellar ages from isochrones: &Beysian est. vs. isochrone fitting
A&A, 386, 187-203 (2002) Meibom, Andersen, Nordstrom, Stellar evolution, open clusters
ApJ, 462, 672-683 (1996) Tolstoy, Saha, CMD, Bayesian Inference
A&A, 472, 293-298 (2007) Ramos, Extreme value theory, solar cycle
ApJ, 427, 438-445 (1994) Zepka, Cordes, Wasserman
ApJ, 470,706-714 (1996) Akritas, Bershady, Linear regression, measurement errors
ApJ, 646, 1445-1451 (2006) Ramos, MDL and model selection
ApJ, 438, 269-287 (1995) Baliunas, Donahue, et al Chromospheric variations in MS stars II
ApJ, 270, 119-122 (1983) Morrison, McCammon
ApJ, 199, 299-306 (1975) Kellogg, Baldwin, Koch
ApJ, 508, 314-327 (1998) Mukherjee, Feigelson, Babu, Murtagh, Fraley, Raftery (Three types of GRBs)
A&ASS,116, 395-402 (1996) Faundez-Aban, et al, Classiciation of Planetary Nebule, Neural Network, Supervised learning
AJ, 79, 745- (1974) Lucy, EM algorithm in astronomy
ApJ, 610,1213-1227 (2004) Esch, Connors, et. al.
ApJ, 495, 100-114 (1998) Jones, Scharf, Ebeling, etal (logN-logS)
MNRAS, 281, 799-829 (1996) Ebeling, Voges, et al. (Voronoi Tessellation)
MNRAS, 370, 141-150 (2006) Recio-Blanco, Bijaoui, de Laverny (MATISSE algorithm)
ApJ, 661, 135-148 (2007), Zezas, et al *XLF*
MNRAS, 338, 891-902 (2003) Smith, Lutz-Kelker bias
ApJSS, 129, 1-31 (2000) Takeuchi,Yoshikawa, Ishii (stat. methods of estimating LF)
ApJ, 560, 606-616 (2001) *Loh, Quashnock, Stein*
MNRAS, 324, 51-56 (2001) Rauzy, assessing the completeness
A&ASS,127, 335-352 (1998) Fadda, Slezak, Bijaoui, Density estimation with non-parametric methods
ApJ, 412, 64-71 (1993) Landy, Szalay, bias and variance of angular correlation functions
A&A, 423-443 (2004) Demianski, Doroshkevich, stat. characterics of large scale structure (it didn’t look statistical to me)
MNRAS, 296, 253-272 (1998) Colombi, Szapudi, Szalay, effects of sampling on statistics of large-scale structure
AJ, 104, 1472- (1992) Secker, Stat. Investigation into the shape of the GCLF
MNRAS, 351, L49-L53 (2004), *Liddle* How many cosmological parameters?
A&A, 431, 511-516 (2005) Pfenniger, Revaz, Tully-Fisher
ApJ, 524, L79-L82 (1999) Bromley, Tegmark, Is the CMB really non-gaussian?
MNRAS, 321, 44-56 (2001) Koen, Hipparcos, time series
MNRAS, 340, 1190-1198 (2003) Bissantz, Munk, Scholz, Parametric versus nonparametric modelling? stat. evidence based on p-value curves
MNRAS, 336, 131-138 (2002) Bissantz, Munk, graphical selection method post model selection problem to me
AJ, 70(3), 193-, (1965) Sun’s Motion and Sunspots
ApJ, 480, 22-35 (1997) Tegmark, Taylor, Heavens, CMB, SVD, large data set
Physical Review D, 55(10), 5895- (1997) Tegmark, how to measure CMB power spectra w/o losin g information
ApJ, 480, L87-L90 (1997) Tegmark, … w/o losing information
ApJ, 499, 555-576 (1998) Tegmark, Hailton, Strauss, Vogeley, Szalay, Galaxy power spectrum
Physical Review D, 69, 103501-1 (2004), Tegmark et al.
J. Math. Phys. 41(6), 3801- (2000), Schroer, Quantum field theory
ApJ, 518, L69-72 (1999), Tegmark, Bromley, Obsearvational Evidence for Stochastic Biasing
ApJ, 519, 513-517 (1999), Tegmark, Comparing and combining CMB data sets
MNRAS, 312, 285-294 (2000), Hamilton, Tegmark, Decorrelating the power spectrum of galaxies
ApJ, 544, 30-42 (2000) Tegmark, Zaldarriaga, 10 parameter CMB
ApJ, 499, 526-532 (1998) Tegmark, Rees, CMB fluctuation?
MNRAS, 341, 1199-1204 (2003) Yamamoto, SDSS, QSO, spatial power spectrum
A&A, 431, 511-516 (2005) Pfenniger, Revaz, Tully-Fisher
AJ, 133, 734-754 (2007) Blanton, Roweis, K-corrections and Filter transformations (NMF)
ApJ, 483, 350-369 (1997) Damiani, Maggio, et al, wavelet, detection
ApJ, 483, 370-389 (199), Damiani, et al, wavelet, detection, application
ApJSS, 138, 185-218 (2002) Freeman, et al, wavdetect
A&A, 246, 291-300 (1991) Zaninetti, Dynamical Voronoi Tessellation
AJ, 115, 2598-2615 (1998) Barnbaum, Bradley (radio, filter)
AJ, 130, 2424-2433 (2005) Mitchell, Robertson, Sault Alternative Adaptive Filter structures (radio astronomy)
AJ, 130, 2916-2927 (2006) Poulsen, Jeffs, Warnick (cancellation, filter, LMS)
ApJ, 688, L49-52 (2008) Kitiashvili, Kosovichev (data assimilation, solar cycles)
MNRAS, 334, 533-541 (2002) Herranz, et al. (adaptive filter)
AJ, 120, 2163-2173 (2000) Stoica, Larsson, Li (adaptive filter bank)
ApJ, 399, 345-351 (1992) *Efron, Petrosian, *
A&ASS, 127, 335-352 (1998) Fadda, Slezak, Bijaoui (density estimation, nonparametric, penalized likelihood)
MNRAS, 359, 993-1006 (2005) Lopez-Caniego, et al. (Neyman-Pearson detector, filter design)
A&A … salaris, Cassisi, CC diagram, GC, systematic uncertainties using the (V-K)-(V-I)
ApJSS, 172:219-238 (2007) Leauthaud, Massey, et al. Weak grav. lensing with COSMOS
ApJSS, 172:254-269 (2007) Guzzo, Casata, et al. COSMOS, large scale structure, morphology
ApJSS, 172:150-171 (2007) Scoville, et al. COSMOS, galaxy evolution
A&A, 343, 496-506 (1999) Plets, Vynckier, MS, post MS, minium volume ellipsoid
MRNAS, 318, 92-100 (2000) Lucy, L.B. hypothesis testing, chi^2
MNRAS, 380, 551-570 (2007) Platen, van de Weygaert, Jones, WVF void detection
ApJ, 293, 192-206 (1985) Feigelson, Nelson, stat. method. with upper limits
AJ, 123, 2945-2975 (2002) Richards, Fan, Newberg et al, SDSS, quasar,
ApJSS, 155, 257-269 (2004) Richards, Nichol, Gray et al, SDSS, quasar,
ApJ, 545, 6-25 (2000) Beisbart, kerscher, luminosity, morpholgy-dep. clustering, galaxy (marked point process, random field, see Stoyan, Kendall, and Mecke)
AJ, 122, 1238-1250 (2001) Shimasaku, et al. SDSS, statistical properties, photometric system
AJ, 122, 1861-1874 (2001) Strateva, et al. SDSS, color separation
ApJSS, 155:243-256 (2004), Weinstein, Richards, Scheider, et al. SDSS, photometric redshifts
Geophy. Res. Let., 33, L07201 (2006) Colwell, Esposito, Sremcevic (self-gravity wakes in Saturn’s A ring measured by stellar occultations from Cassini)
AJ, 133, 2624-2629 (2007) Hedman et al (about wake structure in Saturn’s A ring) I was fascinated when I first heard about two shephard moon effect that forms Saturn’s ring and later possible meteorites and moons
ApJ, 622:759-771(2005), Gorski, Hivon, Banday, et al (HEALPIX)

Copula application (I wanted put this under [MADS])

MNRAS, 393, 1370-1976 (2009) Koen (CIs for the correlation between the GRB peak energy and the associated SN peak brightness, Gaussian copula)

Non astronomy not in statistics:

Clustering Properties of Hierarchical Self-Organizing Maps, Lampinen and Oja
J. Math. Imaging and VIsion, vol. 3, pp. 261-272, 1992
Quite many machine learning publications … too long to type them out!

From Terence’s stuff: You want proof?

hlee — Mon, 21 Dec 2009 00:27:30 +0000

Please, IMS Bulletin, v.38 (10) check p.11 of this pdf file for the whole article.

It is widely believed that under some fairly general conditions, MLEs are consistent, asymptotically normal, and efficient. Stephen Stigler has elegantly documented some of Fisher’s troubles when he wanted a proof. You want proof? Of course you can pile on assumptions so that the proof is easy. If checking your assumptions in any particular case is harder than checking the conclusion in that case, you will have joined a great tradition.
I used to think that efficiency was a thing for the theorists (I can live with inefficiency), that normality was a thing of the past (we can simulate), but that—in spite of Ralph Waldo Emerson—consistency is a thing we should demand of any statistical procedure. Not any more. These days we can simulate in and around the conditions of our data, and learn whether a novel procedure behaves as it should in that context. If it does, we might just believe the results of its application to our data. Other people’s data? That’s their simulation, their part of the parameter space, their problem. Maybe some theorist will take up the challenge, and study the procedure, and produce something useful. But if we’re still waiting for that with MLEs in general (canonical exponential families are in good shape), I wouldn’t hold my breath for this novel procedure. By the time a few people have tried the new procedure, each time checking its suitability by simulation in their context, we will have built up a proof by simulation. Shocking? Of course.
Some time into my career as a statistician, I noticed that I don’t check the conditions of a theorem before I use some model or method with a set of data. I think in statistics we need derivations, not proofs. That is, lines of reasoning from some assumptions to a formula, or a procedure, which may or may not have certain properties in a given context, but which, all going well, might provide some insight. The evidence that this might be the case can be mathematical, not necessarily with epsilon-delta rigour, simulation, or just verbal. Call this “a statistician’s proof ”. This is what I do these days. Should I be kicked out of the IMS?

After reading many astronomy literature, I develop a notion that astronomers like to use the maximum likelihood as a robust alternative to the chi-square minimization for fitting astrophysical models with parameters. I’m not sure it is truly robust because not many astronomy paper list assumptions and conditions for their MLEs.

Often I got confused with their target parameters. They are not parameters in statistical models. They are not necessarily satisfy the properties of probability theory. I often fail to find statistical properties of these parameters for the estimation. It is rare checking statistical modeling procedures with assumptions described by Prof. Speed. Even derivation is a bit short to be called “rigorous statistical analysis.” (At least I wish to see a sentence that “It is trivial to derive the estimator with this and that properties”).

Common phrases I confronted from astronomical literature is that authors’ strategy is statistically rigorous, superior, or powerful without showing why and how it is rigorous, superior, or powerful. I tried to convey these pitfalls and general restrictions in their employed statistical methods. Their strategy is not “statistically robust” nor “statistically powerful” nor “statistically rigorous.” Statisticians have own measures of “superiority” to discuss the improvement in their statistics, analysis strategies, and methodology.

It has not been easy since I never intend to case specific fault picking every time I see these statements. A method believed to be robust can be proven as not a robust method with your data and models. By simulations and derivations with the sufficient description of conditions, your excellent method can be presented with statistical rigors.

Within similar circumstances for statistical modeling and data analysis, there’s a trade off between robustness and conditions among statistical methodologies. Before stating a particular method adopted is robust or rigid, powerful or insensitive, efficient or inefficient, and so on; derivation, proof, or simulation studies are anticipated to be named the analysis and procedure is statistically excellent.

Before it gets too long, I’d like say that statistics have traditions for declaring working methods via proofs, simulations, or derivations. Each has their foundations: assumptions and conditions to be stated as “robust”, “efficient”, “powerful”, or “consistent.” When new statistics are introduced in astronomical literature, I hope to see some additional effort of matching statistical conditions to the properties of target data and some statistical rigor (derivations or simulations) prior to saying they are “robust”, “powerful”, or “superior.”

arxiv list

hlee — Thu, 10 Dec 2009 21:18:36 +0000

When I begin to subscribe arXiv/astro-ph and arXiv/stat, although only for a year I listed astro-ph papers featuring relatively advanced statistics, I also kept more papers relevant to astrostatistics beyond astro-ph or introducing hot topics in statistics and computer science for astronomical data applications. While creating my own arXiv as follows, I had a hope to write up short introductions of statistics that are unlikely known to most of astronomers (like my MADS) and matching subjects/targets in astronomy. I thought such effort could spawn new collaborations or could expand understanding of statistics among astronomers (see Magic Crystal). Well, I couldn’t catch up the growth rate and it’s about time to terminate the hope. However, I thought some papers can be useful to some slog subscribers. I hope they do.

[0704.1743] Fukugita, Nakamura, Okamura, et al (catalogue of morphologically classified galaxies from the SDSS database for trying various machine learning algorithms for automated classification)
[0911.1015] Gudendort, Segers ( Extreme-Value Copulas)
[0710.2024] Franz (Ratios: A short guide to confidence limits and proper use)
[0707.4473] Covey, Ivezic, Schlegel, Finkbeiner, et al. (Outliers in SDSS and 2MASS)
[0511503] (astro-ph) MNRAS,Nolan, Harva, Kaban, Raychaudhury, data driven bayesian approach
[0505017] (cs) Abellanas, Clavero, Hurtado, Delaunay depth
[0706.2704] (astro-ph) Wang, Zhang, Liu, Zhao (SDSS, kernel regression) Quantile regression can be applied
[0805.0056] Kong, Mizera, Quantile Tomography: using quantiles with multivariate data
[0907.5236] Gosh, Resnick Mean Excess Plots, Pareto
[0907.3454] Rinaldo, Wasserman (Low-Noise Density Clustering)
[0906.3979] Friendly (Golden Age of Statistical Graphics)
[0905.2819] Benjamini, Gavrilov (FDR control)
[0903.2651] Ambler, Silverman (Spatial point processes)
[0906.0562] Loubes, Rochet, Regularization with Approx. L^2 maximum entropy method
[0904.0430] Diederichs, Juditski, et al (Sparse NonGaussian Component Analysis)
[0905.0454] McWhirter,Proudler (eds) *Mathematics in Signal Processing V*
[Tensor Decompositions, by *Peirre Comon*]
[0904.3842] Li, Dong (Dimension Reduction)
[0903.1283] Wiesel, Eldar, Hero III (Covariance estimation, graphical models)
[0904.1148] Beynaud-Bouret, Rivoirard
[0903.5147] cai, Zhou (Data driven BLock Thresholding approach to wavelet estimation)
[0905.0483] Harmany, Marcia, Willet (Sparse Poisson intensity reconstruction)
[0904.2949] Jhort, McKeague, van Keilegom (Empirical Likelihood)
[0809.3373] (astro-ph) Bailer-Jones, Smith, et al. (GAIA, SVM)
[0904.0156] Berger, Bernardo, Sun (formal definition of reference priors)
[0703360] (math.st) Drton *(LRTs and singularities)*
[0807.3719] Shi, Belkin, Bin Yu
[0903.5480] Andrieu, Roberts
[0903.3620] Casella, Consonni (Reconciling Model Selection and Prediction)
[0903.0447] Alqallaf, van Aelst et al (propa. outliers in multivariate data)
[0903.2654] Ambler, Silverman (Bayesian wvelet thresholding)
[0206366] (astro-ph) van de Weygaert, *Comis Foam*
[0806.0560] Noble, Nowak, Beyond XSPEC, ISIS
[0908.3553] Liang, Stochastic approximation (SAMC), Bayesian model selection
[0804.3829] Liu, Li, *Hao,* Jin
[0802.2097] Roelofs, Bassa, et al
[0805.3983] Carlberg, Sullivan, et al (Clusering of SN IA host galaxies)
[0808.0572] *Efron, Microarrays, Empirical Bayes, and Two groups model*
[0805.4264] Tempel, Einasto, Einasto, Saar, Anatomy of galaxy functions
[0909.0170] Estate, Khmaladze, Koul, (GoF problem for errors in nonparametric regression: dist’n free approach)
[0909.0608] *Liu, Lindsay*
[0702052] de Wit, Auchere (astro-ph, multispectral analysis of solar EUV images)
[0508651] Pires, Juin, Yvon, et al (astro-ph, Sunyaev-Zel’dovich clusters)
[0808.0012] Caticha (on slog, lectures on prob., entropy & stat. physics)
[0808.3587] Verbeke, Molenberghs, Beunckens, Model selection with incomplete data
[0806.1487] Scheider et al. Sim. and cos. inference: a statistical model for power spectra means and covarances.
[0807.4209] Adamakis, Morton-Jones, Walsh (solar physics, Bayes Factor)
[0808.3852] Diaconis, Khare, Saloff-Coste
[0807.3734] Rocha, Zhao, *Bin Yu* (SPLICE)
[0807.1005] Erven, Grunwald, Rooij ( … AIC-BIC dilemma)
[0805.2838] *E.L. Lehmann* (historical account)
[0805.4136] Genovese, Freeman, Wasserman, Nichol, Miller
[0806.3301] Tibshirani (not robert, but ryan)
[0706.3622] Wittek, Barko (physics,data-an)
[0805.4417] Georgakakis, et at (logN-logS, a bit fishy to me)
[0805.4141] Genovese, Perone-Pacifico, et al
[0806.3286] Chipman, George, McChulloch (BART)
[0710.2245] Efron (size, power, and FDR)
[0807.2900] Richards, Freeman, Lee, Schafer (PCA)
[0609042] (math.ST) Hoff (SVD)
[0707.0701] (cs.AI) Luss, d’Aspremont (Sparse PCA)
[0901.4252] Benko, Hardle, Kneip (Common Functional PC)
[0505017] (cs.CG) Abellanas, Claverol, Hutado (Delaunay depth)
[0906.1905] (astro-ph.IM) Guio, Achilleos, VOISE, Voronoi Image Segmentation algorithm
[0605610] (astro-ph) Sochting, Huber, Clowes, Howell (FSVS Cluster Catalogue, Voronoi Tessellation)
[0611473] (math.ST) Rigollet, Vert, Plug-in, Density Level Sets
[0707.0481] Lee, Nadler, Wasserman (Treelets)
[0805.4417] Georgakakis, et at (logN-logS, a bit fishy to me)
[0805.4141] Genovese, Perone-Pacifico, et al
[0806.3286] Chipman, George, McChulloch (BART)
[0710.2245] Efron (size, power, and FDR)
[0807.2900] Richards, Freeman, Lee, Schafer (PCA)
[0609042] (math.ST) Hoff (SVD)
[0707.0701] (cs.AI) Luss, d’Aspremont (Sparse PCA)
[0901.4252] Benko, Hardle, Kneip (Common Functional PC)
[0505017] (cs.CG) Abellanas, Claverol, Hutado (Delaunay depth)
[0906.1905] (astro-ph.IM) Guio, Achilleos, VOISE, Voronoi Image Segmentation algorithm
[0605610] (astro-ph) Sochting, Huber, Clowes, Howell (FSVS Cluster Catalogue, Voronoi Tessellation)
[0611473] (math.ST) Rigollet, Vert, Plug-in, Density Level Sets
[0707.0481] Lee, Nadler, Wasserman (Treelets)
[0805.2325] (astro-ph) Loh (block boostrap, subsampling)
[0901.0751] Chen, Wu, Yi (Copula, Semiparametric Markov Model)
[0911.3944] White, Khudanpur, Wolfe (Likelihood based Semi-Supervised Model Selection with applications to Speech Processing)
[0911.4650] Varoquaux, Sadaghiani
[0803.2344] Vossen
[0805.0269] Leach et al (Component Separation methods for the Plank mission: Appendix reviews various component separation/dimension reduction methods)
[0907.4728] Arlot, Celisse (survey of CV for model selection)
[0908.2503] Biau, Parta (sequential quantile prediction of time series)
[0905.4378] Ben-Haim, Eldar, (CRBound for Sparse Estimation)
[0906.3082] Cohen, Sackrowitz, Xu (Multiple Testing for dependent case)
[0906.3091] Sarkar, Guo (FDR)
[0903.5161] Rinner, Dickhaus, Roters (FDR)
[0810.4808] Huang, CHen (ANOVA, coefficient, F-test for local poly. regression)
[0901.4752] Chretien, (Robust est. of Gaussian mixtures)
[0908.2918] James, Wang, Zhu (Functional linear regression)
[0908.3961] Clifford, Cosma
[0906.3662] Lindquist (stat. anal. fMRI data)
[0706.1062] Clauset, Shalizi, Newman (PowerLaw dist’n)
[0712.0881] Zuo, Hastie, Tibshirani (DoF, Lasso)
[0712.0901] Jiang, Luan, Wang
[0705.4020] Chattopadhyay, Misra, et al (GRB, classification, model based)
[0707.1891] Holmberg, Nordstrom, Anderson (isochrones, calibration, Geneva-Copenhagen)
[0708.1510] Cobb, Bailyn, Connecting GRBs and galaxies:
[0705.2774] Kelly
[0708.0302] Chamger, James, Lambert, Wiel (incremental quantile, monitoring)
[0708.0169] Mikhail, Data-driven goodness of fit tests, attempts to generalize the theory of score tests
[0706.1495] Huskova, Kirch, Bootstrapping CI for the change point of time series
[0708.4030] Richer, Dotter, et al (NGC6397, GC, CMD, LF)
[0708.1071] Shepp, Statistical thinking: From Tukey to Vardi and beyond
[0708.0499] *Hunter, Wang, Hettmansperger *
[0704.0781] Cabrera, Firmani et al (Swift, long GRBs)
[0706.2590] Ramos, &Extreme Value Theory and the solar cycle (pareto dist’n, survival)*
[0706.2704] Wang, Zhang, Liu, Zhao (kernel regression, CV, redshift) <- quantile regression?
[0707.1611] Budavari, Szalay, (identification, Bayes factor)
[0707.1900] Vetere, Soffitta, et al. (GRB, BeppoSAX)
[0707.1982] Kim, *Liddle* (random matrix mass spectrum)
[0707.2064] Allen, (Star Formation, Bayesian)
[0011057] (hep-ex) Cranmer, Kernel Estimation in High Energy Physics
[0512484] (astro-ph) Mukherjee, Parkinson, Corasaniti, *Liddle* (model selection, dark energy)
[0701113] (astro-ph) Liddle (information criteria for astrophysical model selection)
[0810.2821] Cozman, concentration inequalities and LLNs under irrelevance of lower and upper expectations.
[0810.5275] Hall, Park, Samworth
[0709.1538] Einbeck, Evers, *Bailer-Jones*, localized principal components
[0804.4068] *Pires, Stark*, et al, LASTLens (week lensing)
[0804.0713] Delaigle, Hall, Meister
[0802.0131] (astro-ph) Bobin, Starck, Ottensamer (*Compressed Sensing* in Astronomy)
[0803.1708] Taylor, Worsley, (Random Fields of Multivariate Test Statistics, shape analysis)
[0803.1736] Salibian-Barrera, Yohai (high breakdown point robust regression, censored data)
[0803.4026] Amini, Wainwright, (Sparse Principal Components)
[0803.1752] Ren, (weighted empirical liklihood)
[0803.3863] Efron (simultaneous inference)
[0801.3552] Clifford, Cosma, probabilistic counting algorithms
[0802.1406] Blanchard, Roquain (multiple testing)
[0707.2877] van de Weygaert
[0806.3932] Vavrek, Balazs, Meszaros, etal (testing the randomness in the sky distribution of GRBs), MNRAS, 391(3), 2008
[0911.3769] Chan, Spatial clustering, LRT
[0911.3749] Hall, Miller
[0909.0184] Chan, Hall robust nearest neighbor methods for classifying high dimensional data
[0911.3827] Jung, Marron, PCA High Dim
[0911.3531] Owen, Karl Pearson’s meta analysis revisited
[0911.3501] Wang, Zhu, Zhou, Quantile regression varying coefficient models
[0505200] (physica) *Pilla, Loader, Taylor*
[0501289] (math.ST) *Meinshausen, Rice* Estimating the proportion of false null hypotheses among a large number of independently tested hypotheses.
[0806.1326] Velez, Ariste, Semel (PCA, Sun, magnetic fields)
[0906.4582] *Belabbas, Wolfe*, PCA, high-dimensional data
[0903.3002] Huang, Zhang, Metaxas Learning with Structured Sparsity
[9209010] (gr-qc) Finn, Detection, Measurement, and Gravitational Radiation
[0112467] (astro-ph) Petrosian
[0103040] (astro-ph) Peebles, N-point correlation functions
[9912329] (astro-ph) Kerscher, Stat. analysis of large scale structure in the universe Minkowski functional and J function
[0107417] Connolly, Scranton, et al. Early SDSS
[0511503] (math.ST) Pilla, Loader, Volume-of-Tube Formula: Perturbation tests, mixture models, and scan statistics
[0503033] (astro-ph) Battye, Moss
[0504022] (astro-ph) Trotta, Applications of Bayes Model Selection to cosmological parameters
[0304301] (astro-ph) Nakamichi, Morikawa, AIC, is galaxy dist’n non-extensive and non-gaussian?
[0110230] (astro-ph) Nichol, Chong, Connolly, et al
[0806.1506] (astro-ph) Unzicker, Fischer, 2D galaxy dist’n, SDSS
[0304005] (astro-ph) Maller, McIntosh, et al. (Angular correlation funtion, Power spectrum)
[0108439] (astro-ph) Boschan (angular and 3D correlation functions)
[9601103 (astro-ph) Maddox, Efstathiou, Sutherland (sys errors, angular correlation function)
[0806.0520] Vio, Andreani
[0807.4672] Zhang, Johnson, Little, Cao
[0911.4546] Hobert, Roy, Robert
[0911.4207] Calsaverini, Vicente (information theory and copula)
[0911.4021] Fan, Wu, Feng (Local Quasi-Likelihood with a parametric guide) *
[0911.4076] Hall, Jin, Miller
[0911.4080] Genovese, Jin, Wasserman
[0802.2174] Faure, Kneib, et al. (strong lense, COSMOS)
[0802.1213] Bridle et al (Great08 Challenge)
[0711.0690] Davies, Kovac, Meise (Nonparametric Regression, Confidence regions and regularization)
[0901.3245] Nadler
[0908.2901] Hong, Meeker, McCalley
[0501221] (math) Cadre (Kernel Estimation of Density Level Sets)
[0908.2926] Oreshkin, Coates (Error Propagation in Particle Filters)
[0811.1663] *Lyons* (Open Statistical Issues in Particle Physics)
[0901.4392] Johnstone, Lu (Sparse Principle Component Analysis)
[0803.2095] Hall, Jin (HC)
[0709.4078] Giffin (… Life after Shannon)
[0802.3364] Leeb (model selection and evalutioin)
[0810.4752] Luxburg, Scholkopf (Stat. Learning Theory…)
[0708.1441] van de Weygaert, Schaap, The cosmic web: geometric analysis
[0804.2752] Buhlmann, Hothorn (Boosting algorithms…)
[0810.0944] Aydin, Pataki, Wang, Bullitt, Marron (PCA for trees)
[0711.0989] Chen (SDSS, volume limited sample)
[0709.1538] Einbeck, Evers, Bailer-Jones (Localized PC)
[0610835] (math.ST) Lehmann (On LRTs)
[0604410] (math.ST) Buntine, Discrete Component Analysis
[0707.4621] Hallin, Paindaveine (semiparametrically efficient rank-based inference I)
[0708.0079] Hallin, H. Oja, Paindaveine ( same as above II)
[0708.0976] Singh, Xie, Strawderman (confidence distribution)
[0706.3014] Gordon, Trotta (Bayesian calibrated significance levels.. the usage of p-values looks awkward)
[0709.0711] Quireza, Rocha-Pinto, Maciel
[0709.1208] Kuin, Rosen (measurement erros)
[0709.1359] Huertas-Company, et al (SVM, morphological classification)
[0708.2340] Miller, Kitching, Heymans, et. al. (Bayesian Galaxy Shape Measurement, weak lensing survey)
[0709.4316] Farchione, Kabaila (confidence intervals for the normal mean)
[0710.4245] Fearnhead, Papaspiliopoulos, Roberts (Particle Filters)
[0705.4199] (astro-ph) Leccardi, Molendi , an unbiased temp estimator for stat. poor X-ray specra (can be improved… )
[0712.1663] Meinshausen, *Bickel, Rice* (efficient blind search)
[0706.4108] *Bickel, Kleijn, Rice* (Detecting Periodicity in Photon Arrival Times)
[0704.1584] Leeb, Potscher (estimate the unconditional distribution of post model selection estimator)
[0711.2509] Pope, Szapudi (Shrinkage Est. Power Spectrum Covariance matrix)
[0703746] (math.ST) Flegal, Maran, Jones (MCMC: can we trush the third significant figure?)
[0710.1965] (physics.soc-ph) Volchenkov, Blanchard, Sestieri of Venice
[0712.0637] Becker, Silvestri, Owen, Ivezic, Lupton (in pursuit of LSST science requirements)
[0703040] Johnston, Teodoro, *Martin Hendry* Completeness I: revisted, reviewed, and revived
[0910.5449] Friedenberg, Genovese (multiple testing, remote sensing, LSST)
[0903.0474] Nordman, Stationary Bootstrap’s Variance (Check Lahiri99)
[0706.1062] (physics.data-an) Clauset, Shalizi, Newman (power law distributions in empirical data)
[0805.2946] Kelly, Fan, Vestergaard (LF, Gaussian mixture, MCMC)
[0503373] (astro-ph) Starck, Pires, Refregier (weak lensing mass reconstruction using wavelets)
[0909.0349] Panaretos
[0903.5463] Stadler, Buhlmann
[0906.2128] Hall, Lee, Park, Paul
[0906.2530] Donoho, Tanner
[0905.3217] Hirakawa, Wolfe
[0903.0464] Clarke, Hall
[0701196] (math) Lee, Meng
[0805.4136] Genovese, Freeman, Wasserman, NIchol, Miller
[0705.2774] Kelly
[0910.1473] Lieshout
[0906.1698] Spokoiny
[0704.3704] Feroz, Hobson
[0711.2349] Muller, Welsh
[0711.3236] Kabaila, Giri
[0711.1917] Leng
[0802.0536] Wang
[0801.4627] Potscher, Scheider
[0711.0660] Potscher, Leeb
[0711.1036] Potscher
[0702781] (math.st) Potscher
[0711.0993] Kabaila, Giri
[0802.0069] Ghosal, Lember, Vaart
[0704.1466] Leeb, Potscher
[0701781] (math) Grochenig, Potscher, Rauhut
[0702703] (math.ST) Leeb, Potscher
[astro-ph:0911.1777] Computing the Bayesian Evidence from a Markov Chain Monte Carlo Simulation of the Posterior Distribution (Martin Weinberg)
[0812.4933] Wiaux, Jacques (Compressed sensing, interferometry)
[0708.2184] Sung, Geyer
[0811.1705] Meyer
[0811.1700] Witten, Tibshirani
[0706.1703] Land, SLosar
[0712.1458] Loh, Zhu
[0808.4042] Commenges
[0806.3978] Vincent Vu, Bin Yu, Robert Kass
[0808.4032] Stigler
[0805.1944] astro-ph
[0807.1815] Cabella, Marinucci
[0808.0777] Buja, Kunsch
[0809.1024] Xu, Grunwald
[0807.4081] Roquain, Wiel
[0806.4105] Rofling, Ribshirani
[0808.0657 HUbert, Rousseeuw, Aelst
[0112467] (astro-ph) Petrosian
[0808.2902] Robert, Casella, A History of MCMC
[0809.2754] Grunwald, VItanyi, Algorithmic INofmration THeory
[0809.4866] Carter, Raich, Hero, An information geometric framework for DImensionality reduction
[0809.5032] Allman, Matias, Rhodes
[0811.0528] Owen
[0811.0757] Chamandy, Tayler, Gosselin
[0810.3985] Stute, Wang
[0804.2996] Stigler
[0807.4086] Commenges, Sayyareh, Letenneur…
[0710.5343] Peng, Paul, MLE, functional PC, sparse longitudinal data
[0709.1648] Cator, Jongbloed, et al. *Asymptotics: Particles, Processes, and Inverse problems*
[0710.3478] *Hall, Qiu, Nonparametric Est. of a PSF in Multivariate Problems*
[0804.3034] Catalan, Isern, Carcia-Berro, Ribas (some stellar clusters, LF, Mass F, weighted least square)
[0801.1081] Hernandez, Valls-Gabaud, estimation of basic parameters, stellar populations
[0410072] (math.ST) Donoho, Jin, HC, detecting sparse heterogeneous mixtures
[0803.3863] Efron
[0706.4190] Rondonotti, Marron, Park, SiZer for time series
[0709.0709] Lian, Bayes and empirical Bayes changepoint problems
[0802.3916] Carvalho, Rocha, Hobson, PowellSnakes
[0709.0300] Roger, Ferrera, Lahav, et al, Decoding the spectra of SDSS early-type galaxies
[0810.4807] Pesquet, et al. SURE, Signal/Image Devonvolution
[0906.0346] (cs.DM) Semiparametric estimation of a noise model with quantization errors
[0207026] (hep-ex) Barlow, Systematic Errors: Facts and Fictions
[0705.4199, Leccardi, Molendi, unbiased temperature estimator for statistically poor x-ray spectra
[0709.1208] Kuin, Rosen, measurement error Swift
[0708.4316] Farchione, *Kabila* confidence intervals for the normal mean utilizing prior information
[0708.0976] Singh, Xia, Strawderman confidence distribution
[0901.0721] Albrecht, et al. (dark energy)
[0908.3593] Singh, Scott, Nowak, adaptive hausdorff estimation of density level sets
[0702052] (astro-ph) de Wit, Auchere, Multipectral analysis, sun, EUV, morphology
[0706.1580] Lopes, photometric redshifts, SDSS
[0106038] (astro-ph) Richards et al photometric redshifts of quasars

Erich Lehmann

hlee — Tue, 08 Dec 2009 04:46:34 +0000

He was one of the frequently cited statisticians in this slog because of his influence in statistics. It is extremely difficult to avoid his textbooks and his establishment of theoretical statistics when one begins to comprehend and to appreciate the modern theoretical statistics. To me, Testing Statistical Hypotheses and Theory of Point Estimation are two pillars of graduate statistical education. In addition, Elements of Large Sample Theory and Nonparametrics: Statistical Methods Based on Ranks are also eye openers.

It has not been long since I read Reminiscences of a Statistician: The Company I Kept. I quoted this book and an arXiv paper here :see the posts. I became very grateful to him because of his contributions to the statistical science. I feel so sad to see his obituary, particularly when I’m soon going to have time for reading his books more carefully.

From Quantile Probability and Statistical Data Modeling

hlee — Sat, 21 Nov 2009 10:06:24 +0000

by Emanuel Parzen in Statistical Science 2004, Vol 19(4), pp.652-662 JSTOR

I teach that statistics (done the quantile way) can be simultaneously frequentist and Bayesian, confidence intervals and credible intervals, parametric and nonparametric, continuous and discrete data. My first step in data modeling is identification of parametric models; if they do not fit, we provide nonparametric models for fitting and simulating the data. The practice of statistics, and the modeling (mining) of data, can be elegant and provide intellectual and sensual pleasure. Fitting distributions to data is an important industry in which statisticians are not yet vendors. We believe that unifications of statistical methods can enable us to advertise, “What is your question? Statisticians have answers!”

I couldn’t help liking this paragraph because of its bitter-sweetness. I hope you appreciate it as much as I did.

some python modules

hlee — Fri, 13 Nov 2009 21:46:54 +0000

I was told to stay away from python and I’ve obeyed the order sincerely. However, I collected the following stuffs several months back at the instance of hearing about import inference and I hate to see them getting obsolete. At that time, collecting these modules and getting through them could help me complete the first step toward the quest Learning Python (the first posting of this slog).

There are quite many websites dedicated to python as you already know. Some of them talk only to astronomers. A tiny fraction of those websites are for statisticians but I haven’t met any statistician preferring only python. We take the gist of various languages. So, I’ll leave a general website aggregation, such as AstroPy (I think this website is extremely useful for astronomers), to enrich your bookmark under the “python” tab regardless of your profession. Instead, I’ll discuss some python libraries and modules that can be useful for those exercising astrostatistics and make their work easier. I must say that by intention I omitted a few modules because I was not sure their publicity and copyright sensitivity. If you have modules that can be introduced publicly, let me know. I’ll be happy to add them. If my description is improper and want them to be taken off, also let me know.

Over the past few years, python became the most common and versatile script language for both communities, and therefore, I believe, it would accelerate many collaborations. Much of my time is spent to find out how to read, maneuver, and handle raw data/image. Most of tactics for astronomers are quite unfamiliar, sometimes insensible to me (see my read.table() and data analysis system and its documentation). Somehow, one script language, thanks to its open and free intention to all communities, is promising by narrowing the gap for prosperous and efficient collaborations, Python

The first posting on this slog was about Python. I thought that kicking off with a computer language relatively new and open to many communities could motivate me and others for more interdisciplinary works with diversity. After a few years, unfortunately, I didn’t achieve that goal. Yet, I still think that these libraries and modules, introduced below, to be useful for your transition from some programming languages, or for writing your own but pro bono wrapper for better communication with the others.

I’ll take numpy, scipy, and RPy for granted. For the plotting purpose, matplotlib seems most common.

Reading astronomical data (click links to download libraries, modules, and tutorials)

First, start with Using Python for Interactive Data Analysis (in pdf) Quite useful manual, particularly for IDL users. It compares pros and cons of Python and IDL.
IDLsave Simply, without IDL, a .save file becomes legible. This is a brilliant small module.
PyRAF (I was really frustrated with IRAF and spent many sleepless nights. Apart from data reduction, I don’t remember much of statistics from IRAF except simple statistics for Gaussian populations. I guess PyRAF does better job). And there’s PyFITS for handling fits format data.
APLpy (the Astronomical Plotting Library in Python) is a Python module aimed at producing publication-quality plots of astronomical imaging data in FITS format (this introduction is copied from the APLpy site).

Statistics, Mathematics, or data science
Due to RPy, introducing smaller modules seems not much worthy but quite many modules and library for statistics are available, not relying on R.

MDP (Modular toolkit for Data Processing)
Multivariate data analysis methods like PCA, ICA, FA, etc. become very popular in the astronomical society.
pywavelets (Not only FT, various transformation methodologies are often used and wavelet transformation ranks top).
PyIMSL (see my post, PyIMSL)
PyMC I introduced this module in a century ago. It may be lack of versatility or robustness due to parametric distribution objects but I liked the tutorial very much from which one can expand and devise their own working MCMC algorithm.
PyBUGS (I introduced this python wrapper in BUGS but the link to PyBUGS is not working anymore. I hope it revives.)
SAGE (Software for Algebra and Geometry Experimentation) is a free open-source mathematics software system licensed under the GPL (Link to the online tutorial).
python_statlib descriptive statistics for the python programming language.
PYSTAT Nice website but the product is not available yet. Be aware! It is not PhyStat!!!

Module for AstroStatistics
import inference (Unfortunately, the links to examples and tutorial are not available currently)

Without clear objectives, it is not easy to pick up a new language. If you are used to work with one from alphabet soup, you most likely adhere to your choice. Changing alphabets or transferring language names only happens when your instructor specifically ask you to use their preferring languages and when analysis {modules, libraries, tools} are only available within that preferred language. Somehow, thanks to the object oriented style, python makes transition and communication easier than other languages. Furthermore, script languages are more intuitive and better interpretable.

Quotes from Common Errors in Statistics

hlee — Fri, 13 Nov 2009 17:13:01 +0000

by P.I.Good and J.W.Hardin. Publisher’s website

My astronomer neighbor mentioned this book a while ago and quite later I found intriguing quotes.

GIGO: Garbage in; garbage out. Fancy statistical methods will not rescue garbage data. Course notes of Raymond J. Carroll (2001)

I often see a statement like data were grouped/binned to improve statistics. This seems hardly true unless the astronomer knows the true underlying population distribution from which those realizations (either binned or unbinned) are randomly drawn. Nonetheless, smoothing/binning (modifying sample) can help hypothesis testing to infer the population distribution. This validation step is often ignored, though. For the righteous procedures of statistics application, I hope astronomers adopt the concepts in the design of experiments to collect good quality data without wasting resources. What I mean by wasting resources is that, due to the instrumental and atmospheric limitations, indefinite exposure is not necessary to collect good quality image. Instead of human eye inspection, machine can do the job. I guess that minimax type optimal points exist for operating telescopes, feature extraction/detection, or model/data quality assessment. Clarifying the sources of uncertainty and stratifying them for testing, sampling, and modeling purposes as done in analysis of variance is quite unexplored in astronomy. Instead, more efforts go to salvaging garbage and so far, many gems are harvested by tremendous amount of efforts. But, I’m afraid that it could get as much painful as gold miners’ experience during the mid 19th century gold rush.

Interval Estimates (p.51)
A common error is to specify a confidence interval in the form (estimate – k*standard error, estimate+k*standard error). This form is applicable only when an interval estimate is desired for the mean of a normally distributed random variable. Even then k should be determined for tables of the Student’s t-distribution and not from tables of the normal distribution.

How to get appropriate degrees of freedom seems most relevant to avoid this error when estimates are the coefficients of complex curves or equation/model itself. The t-distribution with a large d.f. (>30) is hardly discernible from the z-distribution.

Desirable estimators are impartial,consistent, efficient, robust, and minimum loss. Interval estimates are to be preferred to point estimates; they are less open to challenge for they convey information about the estimate’s precision.

Every Statistical Procedure Relies on Certain Assumptions for correctness.

What I often fail to find from astronomy literature are these assumptions. Statistics is not elixir to every problem but works only on certain conditions.

Know your objectives in testing. Know your data’s origins. Know the assumptions you feel comfortable with. Never assign probabilities to the true state of nature, but only to the validity of your own predictions. Collecting more and better data may be your best alternative

Unfortunately, the last sentence is not an option for astronomers.

From Guidelines for a Meta-Analysis
Kepler was able to formulate his laws only because (1) Tycho Brahe has made over 30 years of precise (for the time) astronomical observations and (2) Kepler married Brahe’s daughter and thus gained access to his data.

Not exactly same but it reflects some reality of contemporary. Without gaining access to data, there’s not much one can do and collecting data is very painstaking and time consuming.

From Estimating Coefficient
…Finally, if the error terms come from a distribution that is far from Gaussian, a distribution that is truncated, flattened or asymmetric, the p-values and precision estimates produced by the software may be far from correct.

Please, double check numbers from your software.

To quote Green and Silverman (1994, p. 50), “There are two aims in curve estimation, which to some extent conflict with one another, to maximize goodness-of-fit and to minimize roughness.

Statistically significant findings should serve as a motivation for further corroborative and collateral research rather than as a basis for conclusions.

To be avoided are a recent spate of proprietary algorithms available solely in software form that guarantee to find a best-fitting solution. In the worlds of John von Neumann, “With four parameters I can fit an elephant and with five I can make him wiggle his trunk.” Goodness of fit is no guarantee of predictive success, …

If physics implies wiggles, then there’s nothing wrong with an extra parameter. But it is possible that best fit parameters including these wiggles might not be the ultimate answer to astronomers’ exploration. It can be just a bias due to introducing this additional parameter for wiggles in the model. Various statistical tests are available and caution is needed before reporting best fit parameter values (estimates) and their error bars.

[ArXiv] Voronoi Tessellations

hlee — Wed, 28 Oct 2009 14:29:24 +0000

As a part of exploring spatial distribution of particles/objects, not to approximate via Poisson process or Gaussian process (parametric), nor to impose hypotheses such as homogenous, isotropic, or uniform, various nonparametric methods somewhat dragged my attention for data exploration and preliminary analysis. Among various nonparametric methods, the one that I fell in love with is tessellation (state space approaches are excluded here). Computational speed wise, I believe tessellation is faster than kernel density estimation to estimate level sets for multivariate data. Furthermore, conceptually constructing polygons from tessellation is intuitively simple. However, coding and improving algorithms is beyond statistical research (check books titled or key-worded partially by computational geometry). Good news is that for computation and getting results, there are some freely available softwares, packages, and modules in various forms.

As a part of introducing nonparametric statistics, I wanted to write about applications of computation geometry from the nonparametric 2/3 dimensional density estimation perspective. Also, the following article came along when I just began to collect statistical applications in astronomy (my [ArXiv] series). This [arXiv] paper, in fact, initiated me to investigate Voronoi Tessellations in astronomy in general.

[arxiv/astro-ph:0707.2877]
Voronoi Tessellations and the Cosmic Web: Spatial Patterns and Clustering across the Universe
by Rien van de Weygaert

Since then, quite time has passed. In the mean time, I found more publications in astronomy specifically using tessellation as a main tool of nonparametric density estimation and for data analysis. Nonetheless, in general, topics in spatial statistics tend to be unrecognized or almost ignored in analyzing astronomical spatial data (I mean data points with coordinate information). Many seem only utilizing statistics partially or not at all. Some might want to know how often Voronoi tessellation is applied in astronomy. Here, I listed results from my ADS search by limiting tessellation in title key words. :

[arxiv/astro-ph:0110259]
Detecting Clusters of Galaxies in the Sloan Digital Sky Survey I : Monte Carlo Comparison of Cluster Detection Algorithms
by Kim, R.S.J. et al. (2002) AJ, 123, pp.20-36.
[arxiv/astro-ph:0906.1905]
The VOISE Algorithm: a Versatile Toll for Automatic Segmentation of Astronomical Images
by Guio, P. and Achilleos, N. (2009)
Using Voronoi Techniques to determine the shapes of photon sources
by Wilkinson and Meurs Irish Astronomical Journal, 1998, 25(1), 37
High-order 3D Voronoi tessellation for identifying isolated galaxies, pairs and triplets
by Elyiv, A.; Melnyk, O.; Vavilova, I. 2009..MNRAS..394..1409E
3-D Voronoi’s Tessellation as a Tool for Identifying Galaxy Groups
by Melnyk, Olga V.; Elyiv, Andrii A.; Vavilova, Iryna B. 2007..IAUS..235..223M
Adaptive binning of X-ray data with weighted Voronoi tessellations
by Diehl, Steven; Statler, Thomas S. 2006..MNRAS..368..497D
Adaptive spatial binning of integral-field spectroscopic data using Voronoi tessellations
by Cappellari, M. and Copin, Y. 2003..MNRAS..342..345C
Adaptive Spatial Binning of 2D Spectra and Images Using Voronoi Tessellations
by Cappellari, M.; Copin, Y. 2002..ASPC..282..515CA
Finding galaxy clusters using Voronoi tessellations
by Ramella, M.; Boschin, W.; Fadda, D.; Nonino, M. 2001..A&A…368..776R
The Forest Method as a New Parallel Tree Method with the Sectional Voronoi Tessellation
by Yahagi, Hideki; Mori, Masao; Yoshii, Yuzuru 1999..ApJS..124..1
Cluster Identification via Voronoi Tesselation ..1999..ASPC..176..108
The accuracy of parameters determined with the core-sampling method: Application to Voronoi tessellations 1997..A&AS..123..495
Dynamical Voronoi tessellation. V. Thickness and incompleteness.
by Zaninetti, L 1995..A&AS..109..71
Fragmenting the Universe. 3: The constructions and statistics of 3-D Voronoi tessellations
by van de Weygaert, Rien 1994..A&A..283..361
Dynamical Voronoi tessellation. IV. The distribution of the asteroids
by Zaninetti, L 1993..A&A..276..255
Quasi-periodic structures in the large-scale galaxy distribution and three-dimensional Voronoi tessellation
1991..MNRAS..250..519
Dynamical Voronoi tessellation. III – The distribution of galaxies
by Zaninetti, L 1991..A&A..246..291
Dynamical Voronoi tessellation. II – The three-dimensional case
by Zaninetti, L 1990..A&A..233..293
Dynamical Voronoi tessellation. I – The two-dimensional case
by Zaninetti, L 1989..A&A..224..345

Then, the topic has been forgotten for a while until this recent [arXiv] paper, which reminded me my old intention for introducing tessellation for density estimation and for understanding large scale structures or clusters (astronomers’ jargon, not the term in machine or statistical learning).

[arxiv:stat.ME:0910.1473] Moment Analysis of the Delaunay Tessellation Field Estimator
by M.N.M van Lieshout

Looking into plots of the papers by van de Weygaert or van Lieshout, without mathematical jargon and abstraction, one can immediately understand what Voronoi and Delaunay Tessellation is (Delaunay Tessellation is also called as Delaunay Triangulation (wiki). Perhaps, you want to check out wiki:Delaunay Tessellation Field Estimator as well). Voronoi tessellations have been adopted in many scientific/engineering fields to describe the spatial distribution. Astronomy is not an exception. Voronoi Tessellation has been used for field interpolation.

van de Weygaert described Voronoi tessellations as follows:

the asymptotic frame for the ultimate matter distribution,
the skeleton of the cosmic matter distribution,
a versatile and flexible mathematical model for weblike spatial pattern, and
a natural asymptotic result of an evolution in which low-density expanding void regions dictate the spatial organization of the Megaparsec universe, while matter assembles in high-density filamentary and wall-like interstices between the voids.

van Lieshout derived explicit expressions for the mean and variance of Delaunay Tessellatoin Field Estimator (DTFE) and showed that for stationary Poisson processes, the DTFE is asymptotically unbiased with a variance that is proportional to the square intensity.

We’ve observed voids and filaments of cosmic matters with patterns of which theory hasn’t been discovered. In general, those patterns are manifested via observed galaxies, both directly and indirectly. Individual observed objects, I believe, can be matched to points that construct Voronoi polygons. They represent each polygon and investigating its distributional properly helps to understand the formation rules and theories of those patterns. For that matter, probably, various topics in stochastic geometry, not just Voronoi tessellation, can be adopted.

There are plethora information available on Voronoi Tessellation such as the website of International Symposium on Voronoi Diagrams in Science and Engineering. Two recent meeting websites are ISVD09 and ISVD08. Also, the following review paper is interesting.

Centroidal Voronoi Tessellations: Applications and Algorithms (1999) Du, Faber, and Gunzburger in SIAM Review, vol. 41(4), pp. 637-676

By the way, you may have noticed my preference for Voronoi Tessellation over Delaunay owing to the characteristics of this centroidal Voronoi that each observation is the center of each Voronoi cell as opposed to the property of Delaunay triangulation that multiple simplices are associated one observation/point. However, from the perspective of understanding the distribution of observations as a whole, both approaches offer summaries and insights in a nonparametric fashion, which I put the most value on.