The AstroStat Slog » confidence interval http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 [MADS] plug-in estimator http://hea-www.harvard.edu/AstroStat/slog/2009/mads-plug-in-estimator/ http://hea-www.harvard.edu/AstroStat/slog/2009/mads-plug-in-estimator/#comments Tue, 21 Apr 2009 02:34:40 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=2199 I asked a couple of astronomers if they heard the term plug-in estimator and none of them gave me a positive answer.

When computing sample mean (xbar) and sample variance (s^2) to obtain (xbar-s, xbar+s) and to claim this interval covers 68%, these sample mean, sample variance, and the interval are plug-in estimators. Whilst clarifying the form of sampling distribution, or on verifying the formulas or estimators of sample mean and sample variance truly match with true mean and true variance, I can drop plug-in part because I know asymptotically such interval (estimator) will cover 68%.

When there is lack of sample size or it is short in sufficing (theoretic) assumptions, instead of saying 1-σ, one would want to say s, or plug-in error estimator. Without knowing the true distribution (asymptotically, the empirical distribution), somehow 1-σ mislead that best fit and error bar assures 68% coverage, which is not necessary true. What is computed/estimated is s or a plug-in estimator that is defined via Δ chi-square=1. Generally, the Greek letter σ in statistics indicate parameter, not a function of data (estimator), for instance, sample standard deviation (s), root mean square error (rmse), or the solution of Δ chi-square=1.

Often times I see extended uses of statistics and their related terms in astronomical literature which lead unnecessary comments and creative interpretation to account for unpleasant numbers. Because of the plug-in nature, the interval may not cover the expected value from physics. It’s due to chi-square minimization (best fit can be biased) and data quality (there are chances that data contain outliers or go under modification through instruments and processing). Unless robust statistics is employed (outliers could shift best fits and robust statistics are less sensitive to outliers) and calibration uncertainty or some other correction tools are suitably implemented, strange intervals are not necessary to be followed by creative comments or to be discarded. Those intervals are by products of employing plug-in estimators whose statistical properties are unknown during the astronomers’ data analysis state. Instead of imaginative interpretations, one would proceed with investigating those plug-in estimators and try to device/rectify them in order to make sure they lead close to the truth.

For example, instead of simple average (xbar=f(x_1,…,x_n) :average is a function of data whereas the chi-square minimzation method is another function of data), whose breakdown point is asymptotically zero and can be off from the truth, median (another function of data) could serve better (breakdown point is 1/2). We know that the chi-square methods are based on L2 norm (e.g. variation of least square methods). Instead, one can develop methods based on L1 norm as in quantile regression or least absolute deviation (LAD, in short, link from wiki). There are so many statistics are available to walk around short comings of popular plug-in estimators when sampling distribution is not (perfect) gaussian or analytic solution does not exist.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/mads-plug-in-estimator/feed/ 2
[ArXiv] 1st week, June 2008 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/#comments Mon, 09 Jun 2008 01:45:45 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=328 Despite no statistic related discussion, a paper comparing XSPEC and ISIS, spectral analysis open source applications might bring high energy astrophysicists’ interests this week.

  • [astro-ph:0806.0650] Kimball and Ivezi\’c
    A Unified Catalog of Radio Objects Detected by NVSS, FIRST, WENSS, GB6, and SDSS (The catalog is available HERE. I’m always fascinated with the possibilities in catalog data sets which machine learning and statistics can explore. And I do hope that the measurement error columns get recognition from non astronomers.)

  • [astro-ph:0806.0820] Landau and Simeone
    A statistical analysis of the data of Delta \alpha/ alpha from quasar absorption systems (It discusses Student t-tests from which confidence intervals for unknown variances and sample size based on Type I and II errors are obtained.)

  • [stat.ML:0806.0729] R. Girard
    High dimensional gaussian classification (Model based – gaussian mixture approach – classification, although it is often mentioned as clustering in astronomy, on multi- dimensional data is very popular in astronomy)

  • [astro-ph:0806.0520] Vio and Andreani
    A Statistical Analysis of the “Internal Linear Combination” Method in Problems of Signal Separation as in CMB Observations (Independent component analysis, ICA is discussed)

  • [astro-ph:0806.0560] Nobel and Nowak
    Beyond XSPEC: Towards Highly Configurable Analysis (The flow of spectral analysis with XSPEC and Sherpa has not been accepted smoothly; instead, it has been a personal struggle. It seems the paper considers XSPEC as a black box, which I completely agree with. The main objective of the paper is comparing XSPEC and ISIS)

  • [astro-ph:0806.0113] Casandjian and Grenier
    A revised catalogue of EGRET gamma-ray sources (The maximum likelihood detection method, which I never heard from statistical literature, is utilized)
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/feed/ 0
Signal Processing and Bootstrap http://hea-www.harvard.edu/AstroStat/slog/2008/signal-processing-and-bootstrap/ http://hea-www.harvard.edu/AstroStat/slog/2008/signal-processing-and-bootstrap/#comments Wed, 30 Jan 2008 06:33:25 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/signal-processing-and-bootstrap/ Astronomers have developed their ways of processing signals almost independent to but sometimes collaboratively with engineers, although the fundamental of signal processing is same: extracting information. Doubtlessly, these two parallel roads of astronomers’ and engineers’ have been pointing opposite directions: one toward the sky and the other to the earth. Nevertheless, without an intensive argument, we could say that somewhat statistics has played the medium of signal processing for both scientists and engineers. This particular issue of IEEE signal processing magazine may shed lights for astronomers interested in signal processing and statistics outside the astronomical society.

IEEE Signal Processing Magazine Jul. 2007 Vol 24 Issue 4: Bootstrap methods in signal processing

This link will show the table of contents and provide links to articles; however, the access to papers requires IEEE Xplore subscription via libraries or individual IEEE memberships). Here, I’d like to attempt to introduce some articles and tutorials.

Special topic on bootstrap:
The guest editors (A.M. Zoubir & D.R. Iskander)[1] open the issue by providing the rationale, the occasional invalid Gaussian noise assumption, and the consequential complex modeling in their editorial opening, Bootstrap Methods in Signal Processing. A practical approach has been Monte Carlo simulations but the cost of repeating experiments is problematic. The suggested alternative is the bootstrap, which provides tools for designing detectors for various signals subject to noise or interference from unknown distributions. It is said that the bootstrap is a computer-intensive tool for answering inferential questions and this issue serves as tutorials that introduce this computationally intensive statistical method to the signal processing community.

The first tutorial is written by those two guest editors: Bootstrap Methods and Applications, which begins with the list of bootstrap methods and emphasizes its resilience. It discusses the number of bootstrap samples to compensate a simulation (Monte Carlo) error to a statistical error and the sampling methods for dependent data with real examples. The flowchart from Fig. 9 provides the guideline for how to use the bootstrap methods as a summary.

The title of the second tutorial is Jackknifing Multitaper Spectrum Estimates (D.J. Thomson), which introduces the jackknife, multitaper estimates of spectra, and applying the former to the latter with real data sets. The author added the reason for his preference of jackknife to bootstrap and discussed the underline assumptions on resampling methods.

Instead of listing all articles from the special issue, a few astrostatistically notable articles are chosen:

  • Bootstrap-Inspired Techniques in Computational Intelligence (R. Polikar) explains the bootstrap for estimating errors, algorithms of bagging, boosting, and AdaBoost, and other bootstrap inspired techniques in ensemble systems with a discussion of missing.
  • Bootstrap for Empirical Multifractal Analysis (H. Wendt, P. Abry & S. Jaffard) explains block bootstrap methods for dependent data, bootstrap confidence limits, bootstrap hypothesis testing in addition to multifractal analysis. Due to the personal lack of familiarity in wavelet leaders, instead of paraphrasing, the article’s conclusion is intentionally replaced with quoting sentences:

    First, besides being mathematically well-grounded with respect to multifractal analysis, wavelet leaders exhibit significantly enhanced statistical performance compared to wavelet coefficients. … Second, bootstrap procedures provide practitioners with satisfactory confidence limits and hypothesis test p-values for multifractal parameters. Third, the computationally cheap percentile method achieves already excellent performance for both confidence limits and tests.

  • Wild Bootstrap Test (J. Franke & S. Halim) discusses the residual-based nonparametric tests and the wild bootstrap for regression models, applicable to signal/image analysis. Their test checks the differences between two irregular signals/images.
  • Nonparametric Estimates of Biological Transducer Functions (D.H.Foster & K.Zychaluk) I like the part where they discuss generalized linear model (GLM) that is useful to expend the techniques of model fitting/model estimation in astronomy beyond gaussian and least square. They also mentioned that the bootstrap is simpler for getting confidence intervals.
  • Bootstrap Particle Filtering (J.V.Candy) It is a very pleasant reading for Bayesian signal processing and particle filter. It overviews MCMC and state space model, and explains resampling as a remedy to overcome the shortcomings of importance sampling in signal processing.
  • Compressive sensing. (R.G.Baranuik)

    A lecture note presents a new method to capture and represent compressible signals at a rate significantly below the Nyquist rate. This method employs nonadaptive linear projections that preserve the structure of the signal;

I do wish this brief summary assists you selecting a few interesting articles.

  1. They wrote a book, the bootstrap and its application in signal processing.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/signal-processing-and-bootstrap/feed/ 0
[ArXiv] 2nd week, Jan. 2007 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/#comments Fri, 11 Jan 2008 19:44:44 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ It is notable that there’s an astronomy paper contains AIC, BIC, and Bayesian evidence in the title. The topic of the paper, unexceptionally, is cosmology like other astronomy papers discussed these (statistical) information criteria (I only found a couple of papers on model selection applied to astronomical data analysis without articulating CMB stuffs. Note that I exclude Bayes factor for the model selection purpose).

To find the paper or other interesting ones, click

  • [astro-ph:0801.0638]
    AIC, BIC, Bayesian evidence and a notion on simplicity of cosmological model M Szydlowski & A. Kurek

  • [astro-ph:0801.0642]
    Correlation of CMB with large-scale structure: I. ISW Tomography and Cosmological Implications S. Ho et.al.

  • [astro-ph:0801.0780]
    The Distance of GRB is Independent from the Redshift F. Song

  • [astro-ph:0801.1081]
    A robust statistical estimation of the basic parameters of single stellar populations. I. Method X. Hernandez and D. Valls–Gabaud

  • [astro-ph:0801.1106]
    A Catalog of Local E+A(post-starburst) Galaxies selected from the Sloan Digital Sky Survey Data Release 5 T. Goto (Carefully built catalogs are wonderful sources for classification/supervised learning, or semi-supervised learning)

  • [astro-ph:0801.1358]
    A test of the Poincare dodecahedral space topology hypothesis with the WMAP CMB data B.S. Lew & B.F. Roukema

In cosmology, a few candidate models to be chosen, are generally nested. A larger model usually is with extra terms than smaller ones. How to define the penalty for the extra terms will lead to a different choice of model selection criteria. However, astronomy papers in general never discuss the consistency or statistical optimality of these selection criteria; most likely Monte Carlo simulations and extensive comparison across those criteria. Nonetheless, my personal thought is that the field of model selection should be encouraged to astronomers to prevent fallacies of blindly fitting models which might be irrelevant to the information that the data set contains. Physics tells a correct model but data do the same.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/feed/ 0
[ArXiv] 4th week, Nov. 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-4th-week-nov-2007/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-4th-week-nov-2007/#comments Sat, 24 Nov 2007 13:26:40 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-4th-week-nov-2007/ A piece of thought during my stay in Korea: As not many statisticians are interested in modern astronomy while they look for data driven problems, not many astronomers are learning up to date statistics while they borrow statistics in their data analysis. The frequency is quite low in astronomers citing statistical journals as little as statisticians introducing astronomical data driven problems. I wonder how other fields lowered such barriers decades ago.

No matter what, there are preprints from this week that may help to shrink the chasm.

  • [stat.ME:0711.3236]
    Confidence intervals in regression utilizing prior information P. Kabaila and K. Giri
  • [stat.ME:0711.3271]
    Computer model validation with functional output M. J. Bayarri, et. al.
  • [astro-ph:0711.3266]
    Umbral Fine Structures in Sunspots Observed with Hinode Solar Optical Telescope R. Kitai, et.al.
  • [astro-ph:0711.2720]
    Magnification Probability Distribution Functions of Standard Candles in a Clumpy Universe C. Yoo et.al.
  • [astro-ph:0711.3196]
    Upper Limits from HESS AGN Observations in 2005-2007 HESS Collaboration: F. Aharonian, et al
  • [astro-ph:0711.2509]
    Shrinkage Estimation of the Power Spectrum Covariance Matrix A. C. Pope and I. Szapudi
  • [astro-ph:0711.2631]
    Statistical properties of extragalactic sources in the New Extragalactic WMAP Point Source (NEWPS) catalogue J. González-Nuevo, et. al.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-4th-week-nov-2007/feed/ 0
[ArXiv] Post Model Selection, Nov. 7, 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-post-model-selection-nov-7-2007/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-post-model-selection-nov-7-2007/#comments Wed, 07 Nov 2007 15:57:01 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-post-model-selection-nov-7-2007/ Today’s arxiv-stat email included papers by Poetscher and Leeb, who have been working on post model selection inference. Sometimes model selection is misled as a part of statistical inference. Simply, model selection can be considered as a step prior to inference. How you know your data are from chi-square distribution, or gamma distribution? (this is a model selection problem with nested models.) Should I estimate the degree of freedom, k from Chi-sq or α and β from gamma to know mean and error? Will the errors of the mean be same from both distributions?

Prior to estimating means and errors of parameters, one wishes to choose a model where parameters of interests are properly embedded. The arising problem is one uses the same data to choose a model (e.g. choosing the model with the largest likelihood value or bayes factor) as well as to perform statistical inference (estimating parameters, calculating confidence intervals and testing hypotheses), which inevitably introduces bias. Such bias has been neglected in general (a priori tells what model to choose: e.g. the 2nd order polynomial is the absolute truth and the residuals are realizations of the error term, by the way how one can sure that the error follows normal distribution?). Asymptotics enables this bias to be O(n^m), where m is smaller than zero. Estimating this bias has been popular since Akaike introduced AIC (one of the most well known model selection criteria). Numerous works are found in the field of robust penalized likelihood. Variable selection has been a very hot topic in a recent few decades. Beyond my knowledge, there were more approaches to cope with this bias not to contaminate the inference results.

The works by Professors Poetscher and Leeb looked unique to me in the line of resolving the intrinsic bias arise from inference after model selection. In stead of being listed in my weekly arxiv lists, their arxiv papers deserved to be listed under a separate posting. I also included some more general references.

The list of paper from today’s arxiv:

  • [stat.TH:0702703] Can one estimate the conditional distribution of post-model-selection estimators? by H. Leeb and B. M. P\”{o}tscher
  • [stat.TH:0702781] The distribution of model averaging estimators and an impossibility result regarding its estimation by B. M. P\”{o}tscher
  • [stat.TH:0704.1466] Sparse Estimators and the Oracle Property, or the Return of Hodges’ Estimator by H. Leeb and B. M. Poetscher
  • [stat.TH:0711.0660] On the Distribution of Penalized Maximum Likelihood Estimators: The LASSO, SCAD, and Thresholding by B. M. Poetscher, and H. Leeb
  • [stat.TH:0701781] Learning Trigonometric Polynomials from Random Samples and Exponential Inequalities for Eigenvalues of Random Matrices by K. Groechenig, B.M. Poetscher, and H. Rauhut

Other resources:

[Added on Nov.8th] There were a few more relevant papers from arxiv.

  • [stat.AP:0711.0993] Upper bounds on the minimum coverage probability of confidence intervals in regression after variable selection by P. Kabaila and K. Giri
  • [stat.ME:0710.1036] Confidence Sets Based on Sparse Estimators Are Necessarily Large by B. M. Pötscher
]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-post-model-selection-nov-7-2007/feed/ 0
[ArXiv] Poisson Mixture, Aug. 16, 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-poisson-mixture/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-poisson-mixture/#comments Fri, 17 Aug 2007 22:15:57 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-poisson-mixture-aug-16-2007/ From arxiv/math.st:0708.2153v1
Estimating the number of classes by Mao and Lindsay

This study could be linked to identifying the number of lines from Poisson nature x-ray count data, one of the key interests for astronomers. However, as pointed by the authors, estimating the numbers of classes is a difficult statistical problem. I.J.Good[1] said that

I don’t believe it is usually possible to estimate the number of species, but only an appropriate lower bound to that number. This is because there is nearly always a good chance that there are a very large number of extremely rare species.


The authors have been working on the Poisson mixture models on genetic data. I wonder if anything could be extracted for astronomical applications. The Poisson mixture models also explain coverage problems, beyond line identification. Without mathematical equations, summarizing the body of the paper seems impossible so that only their abstract is added.

Abstract:
Estimating the unknown number of classes in a population has numerous important applications. In a Poisson mixture model, the problem is reduced to estimating the odds that a class is undetected in a sample. The discontinuity of the odds prevents the existence of locally unbiased and informative estimators and restricts confidence intervals to be one-sided. Confidence intervals for the number of classes are also necessarily one-sided. A sequence of lower bounds to the odds is developed and used to define pseudo maximum likelihood estimators for the number of classes.

  1. courtesy of the paper: Estimating the number of species: A review by Bunge and Fitzpatrick (1993), JASA, 88, 364-373.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-poisson-mixture/feed/ 0
[ArXiv] Classical confidence intervals, June 25, 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-classical-confidence-intervals-june-25-2007/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-classical-confidence-intervals-june-25-2007/#comments Wed, 27 Jun 2007 18:23:02 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-classical-confidence-intervals-june-25-2007/ From arXiv:physics.data-an/0706.3622v1:
Comments on the unified approach to the construction of classical confidence intervals

This paper comments on classical confidence intervals and upper limits, as the so-called a flip-flopping problem, both of which are related asymptotically (when n is large enough) by the definition but cannot be converted from one to the another by preserving the same coverage due to the poisson nature of the data.

I’ve heard a few discussions about classical confidence intervals and upper limits from particle physicists and theoretical statisticians. Nonetheless, not being in the business from the beginning (1. the time point of particle physicists aware of statistics to obtain coverages and upper limits or 2. Neyman’s publication (1937) Phil. Trans. Royal Soc. London A, 236, p.333) makes it hard to grasp the essence of this flip-flopping problem. On the other hand, I could sense that lots of statistical challenges (for both classical and bayesian statisticians) residing in this flip-flopping problem and wish for some tutorials or chronological reviews on the subject.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-classical-confidence-intervals-june-25-2007/feed/ 0