[ArXiv] 3rd week, Apr. 2008

hlee — Mon, 21 Apr 2008 01:05:55 +0000

The dichotomy of outliers; detecting outliers to be discarded or to be investigated; statistics that is robust enough not to be influenced by outliers or sensitive enough to alert the anomaly in the data distribution. Although not related, one paper about outliers made me to dwell on what outliers are. This week topics are diverse.

[astro-ph:0804.1809] H. Khiabanian, I.P. Dell’Antonio
A Multi-Resolution Weak Lensing Mass Reconstruction Method (Maximum likelihood approach; my naive eyes sensed a certain degree of relationship to the GREAT08 CHALLENGE)
[astro-ph:0804.1909] A. Leccardi and S. Molendi
Radial temperature profiles for a large sample of galaxy clusters observed with XMM-Newton
[astro-ph:0804.1964] C. Young & P. Gallagher
Multiscale Edge Detection in the Corona
[astro-ph:0804.2387] C. Destri, H. J. de Vega, N. G. Sanchez
The CMB Quadrupole depression produced by early fast-roll inflation: MCMC analysis of WMAP and SDSS data
[astro-ph:0804.2437] P. Bielewicz, A. Riazuelo
The study of topology of the universe using multipole vectors
[astro-ph:0804.2494] S. Bhattacharya, A. Kosowsky
Systematic Errors in Sunyaev-Zeldovich Surveys of Galaxy Cluster Velocities
[astro-ph:0804.2631] M. J. Mortonson, W. Hu
Reionization constraints from five-year WMAP data
[astro-ph:0804.2645] R. Stompor et al.
Maximum Likelihood algorithm for parametric component separation in CMB experiments (separate section for calibration errors)
[astro-ph:0804.2671] Peeples, Pogge, and Stanek
Outliers from the Mass–Metallicity Relation I: A Sample of Metal-Rich Dwarf Galaxies from SDSS
[astro-ph:0804.2716] H. Moradi, P.S. Cally
Time-Distance Modelling In A Simulated Sunspot Atmosphere (discusses systematic uncertainty)
[astro-ph:0804.2761] S. Iguchi, T. Okuda
The FFX Correlator
[astro-ph:0804.2742] M Bazarghan
Automated Classification of ELODIE Stellar Spectral Library Using Probabilistic Artificial Neural Networks
[astro-ph:0804.2827]S.H. Suyu et al.
Dissecting the Gravitational Lens B1608+656: Lens Potential Reconstruction (Bayesian)

Cross-validation for model selection

hlee — Mon, 20 Aug 2007 03:35:48 +0000

One of the most frequently cited papers in model selection would be An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion by M. Stone, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1 (1977), pp. 44-47.
(Akaike’s 1974 paper, introducing Akaike Information Criterion (AIC), is the most often cited paper in the subject of model selection).

The popularity of AIC comes from its simplicity. By penalizing the log of maximum likelihood with the number of model parameters (p), one can choose the best model that describes/generates the data. Nonetheless, we know that AIC has its shortcoming: all candidate models are nested each other and come from the same parametric family. For an exponential family, the trace of multiplication of score function and Fisher information becomes equivalent to the number of parameters, where you can easily raise a question, “what happens when the trace cannot be obtained analytically?”

The general form of AIC is called TIC (Takeuchi’s information criterion, Takeuchi, 1976), where the penalized term is written as the trace of multiplication of score function and Fisher information. Still, I haven’t answered to the question above.

I personally think that a trick to avoid such dilemma is the key content of Stone (1974), using cross-validation. Stone proved that computing the log likelihood by cross-validation is equivalent to AIC, without computing the score function and Fisher information or getting an exact estimate of the number of parameters. Cross-validation enables to obtain the penalized maximum log likelihoods across models (penalizing is necessary due to estimating the parameters) so that comparison among models for selection becomes feasible while it elevates worries of getting the proper number of parameters (penalization).

Numerous tactics are available for the purpose of model selection. Although variable selection (candidate models are generally nested) is a very hot topic in statistics these days and tones of publication could be found, when it comes to applying resampling methods to model selection, there are not many works. As Stone proved, cross-validation relieves any difficulties of calculating the score function and Fisher information of a model. I was working on non-nested model selection (selecting a best model from different parametric families) with Jackknife with Prof. Babu and Prof. Rao at Penn State until last year (paper hasn’t submitted yet) based on finding that the Jackknife enables to get the unbiased maximum likelihood. Even though high cost of computation compared to cross-validation and the jackknife, the bootstrap has occasionally appeared for model selection.

I’m not sure cross-validation or the jackknife is a feasible approach to be implemented in astronomical softwares, when they compute statistics. Certainly it has advantages when it comes to calculating likelihoods, like Cash statistics.

The AstroStat Slog » Cash statistics

[ArXiv] 3rd week, Apr. 2008

Cross-validation for model selection