The AstroStat Slog » Feigelson

Kaplan-Meier Estimator (Equation of the Week)

vlk — Wed, 09 Jul 2008 17:00:54 +0000

The Kaplan-Meier (K-M) estimator is the non-parametric maximum likelihood estimator of the survival probability of items in a sample. “Survival” here is a historical holdover because this method was first developed to estimate patient survival chances in medicine, but in general it can be thought of as a form of cumulative probability. It is of great importance in astronomy because so much of our data are limited and this estimator provides an excellent way to estimate the fraction of objects that may be below (or above) certain flux levels. The application of K-M to astronomy was explored in depth in the mid-80′s by Jurgen Schmitt (1985, ApJ, 293, 178), Feigelson & Nelson (1985, ApJ 293, 192), and Isobe, Feigelson, & Nelson (1986, ApJ 306, 490). [See also Hyunsook's primer.] It has been coded up and is available for use as part of the ASURV package.

Consider a simple case where you have N observations of the luminosities of a source. Let us say that all N sources have been detected and their luminosities are estimated to be L_i, i=1..N, and that they are ordered such that L_i < L_i+1 Then, it is easy to see that the fraction of sources above each L_i can be written as the sequence

{ N-1, N-2, N-3, … 2, 1, 0}/N

The K-M estimator is a generalized form that describes this sequence, and is written as a product. The probability that an object in the sample has luminosity greater than L_k is

S(L>L₁) = (N-1)/N
S(L>L₂) = (N-1)/N * ((N-1)-1)/(N-1) = (N-1)/N * (N-2)/(N-1) = (N-2)/N
S(L>L₃) = (N-1)/N * ((N-1)-1)/(N-1) * ((N-2)-1)/(N-2) = (N-3)/N
…
S(L>L_k) = Π_i=1..k (n_i-1)/n_i = (N-k)/N

where n_k are the number of objects still remaining at luminosity level L ≥ L_k, and at each stage one object is decremented to account for the drop in the sample size.

Now that was for the case when all the objects are detected. But now suppose some are not, and only upper limits to their luminosities are available. A specific value of L cannot be assigned to these objects, and the only thing we can say is that they will “drop out” of the set at some stage. In other words, the sample will be “censored”. The K-M estimator is easily altered to account for this, by changing the decrement in each term of the product to include the censored points. Thus, the general K-M estimator is

S(L>L_k) = Π_i=1..k (n_i-c_i)/n_i

where c_i are the number of objects that drop out between L_i-1 and L_i.

Note that the K-M estimator is a maximum likelihood estimator of the cumulative probability (actually one minus the cumulative probability as it is usually understood), and uncertainties on it must be estimated via Monte Carlo or bootstrap techniques [or not.. see below].

Survival Analysis: A Primer

hlee — Tue, 08 Jul 2008 23:27:38 +0000

Astronomers confront with various censored and truncated data. Often these types of data are called after famous scientists who generalized them, like Eddington bias. When these censored or truncated data become the subject of study in statistics, instead of naming them, statisticians try to model them so that the uncertainty can be quantified. This area is called survival analysis. If your library has The American Statistician subscription and you are an astronomer handles censored or truncated data sets, this primer would be useful for briefly conceptualizing statistics jargon in survival analysis and for characterizing uncertainties residing in your data.

Survival Analysis: A Primer by David A. Freedman
The American Statistician, May 2008, Vol. 62, No.2, pp. 110-119

This article explains the basics of survival analysis and adds criticisms on previously conducted studies. Since the given examples are from medical studies, astronomers may not be interested in reading the whole article. Nonetheless, Freedman offers the definitions in survival analysis such as survival function, hazard rate, the Kaplan-Meier estimator, the proportional hazard model with clarity and conciseness. For example, if τ (a positive random variable indicating the waiting time for failure) is Weibull, the hazard rate takes an exact form of the celebrated power law in astronomy (I think modification of pdfs reflecting censoring and truncation may lead more robust results compared to fitting power laws unless parameters in power laws have astrophysical implications and survival analysis approaches cannot perform the same parametrization).

Commonality between power laws and Pareto distributions and frequent appearance of power laws in astronomical journals drives some anticipation of frequent applications of survival analysis to astronomical data; on the contrary, there are not many.

Though there are more, here are a few references relevant to survival analysis, that utilized examples from astronomy or appeared astronomical journals:

Nonparametric Methods for Doubly Truncated Data by B Efron and V Petrosian. (subscription required)
Journal of the American Statistical Association, Vol. 94, pp. 824-834 (1999)
Survival Analysis of the Gamma-Ray Burst Data by B Efron and V Petrosian. (subscription required)
Journal of the American Statistical Association, Vol. 89, pp. 452-464 (1994)
A simple test of independence for truncated data with applications to redshift surveys by B Efron and V Petrosian
ApJ, Vol. 399, pp.345-352 (1992)
Statistical methods for astronomical data with upper limits. I – Univariate distributions by Feigelson and Nelson
ApJ, Vol. 293, pp.192-206 (1985)
Nonparametric Estimation of the Slope of a Truncated Regression by Bhattacharya, Chernoff, and Yang (subscription required)
The Annals of Statistics, Vol. 11(2), pp. 505-514 (1983)

Note that these papers only dealt particular statistical interests with an general introduction about survival analysis and definitions of estimators based on relatively small sample size data sets. Facing massive survey data with truncation and heterogeneity in measurement errors in astronomy could open a new era of survival analysis.

Lastly, there are studies regarding Pareto distribution some of which are presented in the slog. (Use “search” with Pareto. More statistical papers on survival analysis in astronomy are welcome to be added; please, inform me.)

Astrostatistics: Goodness-of-Fit and All That!

hlee — Wed, 15 Aug 2007 02:17:00 +0000

During the International X-ray Summer School, as a project presentation, I tried to explain the inadequate practice of χ^2 statistics in astronomy. If your best fit is biased (any misidentification of a model easily causes such bias), do not use χ^2 statistics to get 1σ error for the 68% chance of capturing the true parameter.

Later, I decided to do further investigation on that subject and this paper came along: Astrostatistics: Goodness-of-Fit and All That! by Babu and Feigelson.

First, the authors pointed out that the χ^2 method 1) is inappropriate when errors are non-gaussian, 2) does not provide clear decision procedures between models with different numbers of parameters or between acceptable models, and 3) is possibly difficult to obtain confidence intervals on parameters when complex correlations between the parameters are present. As a remedy to the χ^2 method, they introduced distribution free tests, such as Kolmogorov-Smirnoff (K-S) test, Cramer-von Mises (C-vM) test, and Anderson-Darling (A-D) test. Among these distribution free tests, the K-S test is well known to astronomers but it has been ignored that the results from these tests become unreliable when the data come from a multivariate distribution. Furthermore, K-S tests fail when the data set is used for parameter estimation and computing the empirical distribution function.

The authors proposed resampling schemes to overcome the above shortcomings by showing both parametric and nonparametric bootstrap methods, and advanced to model comparison particularly when models are not nested. The best fit model can be chosen among other candidate models based on their distances (e.g. Kullback-Leibler distance) to the unknown hypothetical true model.

AstroStatistics Summer School at PSU

hlee — Mon, 29 Jan 2007 04:34:20 +0000

Since Summer 2005, G. Jogesh Babu (Statistics) and Eric Feigelson (Astronomy) have organized lectures and lab sessions on statistics for astronomers and physicists. Lecturers are professors from Penn State statistics department and invited renown scientists from different countries. Students show diverse demography as well. Within a week or so, students listen Statistics 101 to recently published statistical theories particularly applied to astronomical data. They also learn how to use R, a statistical software and script language to perform statistics they learn through lectures. Past two years, this summer school proved its uniqueness and usefulness. More information on the upcoming school can be found at http://astrostatistics.psu.edu/su07/index.html and other topics regarding astrostatistics at Center for AstroStatistics at Penn State.