The AstroStat Slog » mixture

[Books] Bayesian Computations

hlee — Fri, 11 Sep 2009 20:40:23 +0000

A number of practical Bayesian data analysis books are available these days. Here, I’d like to introduce two that were relatively recently published. I like the fact that they are rather technical than theoretical. They have practical examples close to be related with astronomical data. They have R codes so that one can try algorithms on the fly instead of jamming probability theories.

Bayesian Computation with R
Author:Jim Albert
Publisher: Springer (2007)

As the title said, accompanying R package LearnBayes is available (clicking the name will direct you for downloading the package). Furthermore, the last chapter is about WinBUGS. (Please, check out resources listed in BUGS for other BUGS, Bayesian inference Using Gibbs Sampling) Overall, it is quite practical and instructional. If an young astronomer likes to enter the competition posted below because of sophisticated data requiring non traditional statistical modeling, this book can be a good starting. (Here, traditional methods include brute force Monte Carlo simulations, chi^2/weighted least square fitting, and test statistics with rigid underlying assumptions).

An interesting quote is filtered because of a comment from an astronomer, “Bayesian is robust but frequentist is not” that I couldn’t agree with at the instance.

A Bayesian analysis is said to be robust to the choice of prior if the inference is insensitive to different priors that match the user’s beliefs.

Since there’s no discussion of priors in frequentist methods, Bayesian robustness cannot be matched and compared with frequentist’s robustness. Similar to my discussion in Robust Statistics, I kept the notion that robust statistics is insensitive to outliers or iid Gaussian model assumption. Particularly, the latter is almost ways assumed in astronomical data analysis, unless other models and probability densities are explicitly stated, like Poisson counts and Pareto distribution. New Bayesian algorithms are invented to achieve robustness, not limited to the choice of prior but covering the topics from frequentists’ robust statistics.

The introduction of Bayesian computation focuses on analytical and simple parametric models and well known probability densities. These models and their Bayesian analysis produce interpretable results. Gibbs sampler, Metropolis-Hasting algorithms, and their few hybrids could handle scientific problems as long as scientific models and the uncertainties both in observations and parameters transcribed into well known probability density functions. I think astronomers like to check Chap 6 (MCMC) and Chap 9 (Regression Models). Often times, in order to prove strong correlation between two variables, astronomers adopt simple linear regression models and fit the data to these models. A priori knowledge enhances the flexibility of fitting analysis in which Bayesian computation works robustly different from straightforward chi-square methods. The book does not have sophisticated algorithms nor theories. It only offers very necessities and foundations for Bayesian computations to be accommodated into scientific needs.

The other book is

Bayesian Core: A Practical Approach to Computational Bayesian Statistics.
Author: J. Marin and C.P.Robert
Publisher: Springer (2007).

Although the book is written by statisticians, the very first real data example is CMBdata (cosmic microwave background data; instead of cosmic, the book used cosmological. I’m not sure which one is correct but I’m so used to CMB by cosmic microwave background). Surprisingly, CMB became a very easy topic in statistics in terms of testing normality and extreme values. Seeing the real astronomy data first from the book was the primary reason of introducing this book. Also, it’s a relatively small volume book (about 250 pages) compared other Bayesian textbooks with the broad coverage of topics in Bayesian computation. There are other practical real data sets to illustrate Bayesian computations in the book and these example data sets are found from the book website

The book begins with R, then normal models, regression and variable selection, generalized linear models, capture-recapture experiments, mixture models, dynamic models, and image analysis are covered.

I feel exuberant when I found the book describes the law of large numbers (LLN) that justifies the Monte Carlo methods. The LLN appears often when integration is approximated by summation, which astronomers use a lot without referring the name of this law. For more information, I rather give a wikipedia link to Law of Large Numbers.

Several MCMC algorithms can be mixed together within a single algorithm using either a circular or a random design. While this construction is often suboptimal (in that the inefficient algorithms in the mixture are still used on a regular basis), it almost always brings an improvement compared with its individual components. A special case where a mixed scenario is used is the Metropolis-within-Gibbs algorithm: When building a Gibbs sample, it may happen that it is difficult or impossible to simulate from some of the conditional distributions. In that case, a single Metropolis step associated with this conditional distribution (as its target) can be used instead.

Description in Sec. 4.2 Metropolis-Hasting Algorithms is expected to be more appreciated and comprehended by astronomers because of the historical origins of these topics, detailed balance equation and random walk.

Personal favorite is section 6 on mixture models. Astronomers handle data of multi populations (multiple epochs of star formations, single or multiple break power laws, linear or quadratic models, metalicities from merging or formation triggers, backgrounds+sources, environment dependent point spread functions, and so on) and discusses the difficulties of label switching problems (identifiability issue in codifying data into MCMC or EM algorithm)

A completely different approach to the interpretation and estimation of mixtures is the semiparametric perspective. To summarize this approach, consider that since very few phenomena obey probability laws corresponding to the most standard distributions, mixtures such as (*) can be seen as a good trade-off between fair represntation of the phenomenon and efficient estimation of the underlying distribution. If k is large enough, there is theoretical support for the argument that (*) provides a good approximation (in some functional sense) to most distributions. Hence, a mixture distribution can be perceived as a type of basis approximation of unknown distributions, in a spirit similar to wavelets and splines, but with a more intuitive flavor (for a statistician at least). This chapter mostly focuses on the “parametric” case, when the partition of the sample into subsamples with different distributions f_j does make sense form the dataset point view (even though the computational processing is the same in both cases).

We must point at this stage that mixture modeling is often used in image smoothing but not in feature recognition, which requires spatial coherence and thus more complicated models…

My patience ran out to comprehend every detail of the book but the section of reversible jump MCMC, hidden Markov model (HMM), and Markov random fields (MRF) would be very useful. Particularly, these topics appear often in image processing, which field astronomers have their own algorithms. Adaption and comparison across image analysis methods promises new directions of scientific imaging data analysis beyond subjective denoising, smoothing, and segmentation.

Readers considering more advanced Bayesian computation and rigorous treatment of MCMC methodology, I’d like to point a textbook, frequently mentioned by Marin and Robert.

Monte Carlo Statistical Methods Robert, C. and Casella, G. (2004)
Springer-Verlag, New York, 2nd Ed.

There are a few more practical and introductory Bayesian Analysis books recently published or soon to be published. Some readership would prefer these books of running ink. Perhaps, there is/will be Bayesian Computation with Python, IDL, Matlab, Java, or C/C++ for those who never intend to use R. By the way, for Mathematica users, you would like to check out Phil Gregory’s book which I introduced in [books] a boring title. My point is that applied statistics has become more friendly to non statisticians through these good introductory books and free online materials. I hope more astronomers apply statistical models in their data analysis without much trouble in executing Bayesian methods. Some might want to check BUGS, introduced [BUGS]. This posting contains resources of how to use BUGS and available packages under languages.

The LRT is worthless for …

hlee — Fri, 25 Apr 2008 05:48:06 +0000

One of the speakers from the google talk series exemplified model based clustering and mentioned the likelihood ratio test (LRT) for defining the number of clusters. Since I’ve seen the examples of ill-mannerly practiced LRTs from astronomical journals, like testing two clusters vs three, or a higher number of components, I could not resist indicating that the LRT is improperly used from his illustration. As a reply, the citation regarding the LRT was different from his plot and the test was carried out to test one component vs. two, which closely observes the regularity conditions. I was relieved not to find another example of the ill-used LRT.

There are various tests applicable according to needs and conditions from data and source models but it seems no popular astronomical lexicons have these on demand tests except the LRT (Once I saw the score test since posting [ArXiv]s in the slog and a few non-parametric rank based tests over the years). I’m sure well knowledgeable astronomers soon point out that I jumped into a conclusion too quickly and bring up counter-examples. Until then, be advised that your LRTs, χ^2 tests, and F-tests ask for your statistical attention prior to their applications for any statistical inferences. These tests are not magic crystals, producing answers you are looking for. To bring such care and attentions, here’s a thought provoking titled paper that I found some years ago.

The LRT is worthless for testing a mixture when the set of parameters is large
JM Azaıs, E Gassiat, C Mercadier (click here :I found it from internet but it seems the link was on and off and sometimes was not available.)

Here, quotes replace theorems and their proofs^[1] :

We prove in this paper that the LRT is worthless from testing a distribution against a two components mixture when the set of parameters is large.
One knows that the traditional Chi-square theory of Wilks[16^[2]] does not apply to derive the asymptotics of the LRT due to a lack of identifiability of the alternative under the null hypothesis.
…for unbounded sets of parameters, the LRT statistic tends to infinity in probability, as Hartigan[7^[3]] first noted for normal mixtures.
…the LRT cannot distinguish the null hypothesis (single gaussian) from any contiguous alternative (gaussian mixtures). In other words, the LRT is worthless^[4].

For astronomers, the large set of parameters are of no concern due to theoretic constraints from physics. Experiences and theories bound the set of parameters small. Sometimes, however, the distinction between small and large sets can be vague.

The characteristics of the LRT is well established under the compactness set assumption (either compact or bounded) but troubles happen when the limit goes to the boundary. As cited before in the slog a few times, readers are recommend to read for more rigorously manifested ideas from astronomy about the LRT, Protassov, et.al. (2002) Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test, ApJ, 571, p. 545

Readers might want to look for mathematical statements and proofs from the paper
S.S.Wilks. The large sample distribution of the likelihood ratio for testing composite hypothesis, Ann. Math. Stat., 9:60-62, 1938
J.A.Hartigan, A failure of likelihood asymptotics for normal mixtures. in Proc. Berkeley conf. Vol. II, pp.807-810
Comment to Theorem 1. They proved the lack of worth in the LRT under more general settings, see Theoremm 2

[ArXiv] An unbiased estimator, May 29, 2007

hlee — Tue, 30 Oct 2007 07:37:07 +0000

From arxiv/astro-ph:0705.4199v1
In search of an unbiased temperature estimator for statistically poor X-ray spectra
A. Leccardi and S. Molendi

There was a delay of writing about this paper, which by accident was lying under the pile of papers irrelevant to astrostatistics. (It has been quite overwhelming to track papers with various statistical applications and papers with rooms left for statistical improvements from arxiv:astro-ph). Although there is a posting about this paper (see Vinay’s posting), I’d like to give a shot. I was very excited because I haven’t seen any astronomical papers discussing unbiased estimators solely.

By the same token that the authors discussed bias in the χ^2 method and the maximum likelihood estimator, we know that the χ^2 function is not always symmetric for applying Δχ^2 =1 for a 68% confidence interval. The nominal level interval from the Δχ^2 method does not always provide the nominal coverage when the given model to be fitted does not satisfy the (regularity) conditions for approximating χ^2 distribution. The χ^2 best fit does not always observe the (in probability or almost sure) convergence to the true parameter, i.e. biased so that the coverage level misleads the information of the true parameter. The illustration of the existence of bias in traditional estimators in high energy astronomy is followed by authors’ proposals of unbiased (temperature) estimators via (variable) transformation.

Transformation is one way of reducing bias (e.g. Box-Cox transformation or power transformation is a common practice in introductory statistics to make residuals more homogeneous). Transformation leads an asymmetric distribution to (asymptotically) symmetric. Different from the author’s comment (the parametric bootstrap reached no improvement in bias reduction), reducing bias from computing likelihoods (Cash statistics) can be achieved by statistical subsampling methods, like cross-validation, jackknife, and bootstrap upon careful designs of subsampling schemes (instead of parametric bootstrap, nonparametric bootstrap could yield a different conclusion). Penalized likelihood, instead of L_2 norm (the χ^2 measure is L_2), L_1 norm penalty helps to reduce bias as well.

One of the useful discussions about unbiased estimators is the comparison between the χ^2 best fit method and Cash statistics (Maximum Poisson Likelihood Estimator). Overall, Cash statistics excels the χ^2 best fit method. Neither of these two methods overcome bias from low counts, small exposure time, background level, and asymmetry pdf (probability density function) in T(temperature), their parameter of interest. Their last passage to obtain an unbiased estimator was taking a nonparametric approach to construct a mixture model from three pdf’s to estimate the uncertainty. They concluded the results from the mixing distributions were excellent. This mixing distributions takes an effect of reducing errors by averaging. Personally, their saying “the only method that returns the expected temperature under very different situations” seems to be overstated. Either designing more efficient mixing distributions (unequal weighting triplets than equal weights) or defining M-estimators upon understanding three EPIC instruments would produce better degrees of unbiasedness.

Note that the maximum likelihood estimator (MLE) is a consistent estimator (asymptotically unbiased) under milder regularity conditions in contrast to the χ^2 best fit estimator. Instead of stating that MLE can be biased, it would have been better to discuss the suitability of regularity conditions to source models built on Poisson photon counts for estimating temperatures and XSPEC estimation procedures.

Last, I’d like to quote their question as it is:

What are the effects of pure statistical uncertainties in determining interesting parameters of highly non linear models (e.g. the temperature of th ICM), when we analyze spectra accumulated from low surface brightness regions using current X-ray experiments?

Although the authors tried to answer this question, my personal opinion is that they were not able to fully account the answer but left a spacious room for estimating statistical uncertainty and bias rigorously in high energy astrophysics with more statistical care (e.g. instead of MLE or Cash statistics, we could develop more robust but unbiased M-estimator).

[ArXiv] Identifiability and mixtures of distributions, Aug. 3, 2007

hlee — Fri, 07 Sep 2007 06:02:58 +0000

From arxiv/math.st: 0708.0499v1
Inference for mixtures of symmetric distributions by Hunter, Wang, and Hettmansperger, Annals of Statistics, 2007, Vol.35(1), pp.224-251.

Consider a case of fitting a spectral line in addition to continuum with a delta function or a gaussian (normal) density function. Among many regularity conditions, personally the most bothersome one is identifiability. When the scale parameter (σ) goes to zero, we cannot tell which model, delta, or gaussian, is a right one. Furthermore, the likelihood ratio test cannot be applied to the delta function due to its discontinuity. For a classical confidence interval or a hypothesis test which astronomers are familiar with from Numerical Recipes, identifiability and the set property (topology) of model parameters suffer from the lack of attentions from astronomers who performs statistical inference on model parameters. I found a few astronomical papers that ignored this identifiability but used the likelihood ratio tests for an extra component discovery. Clearly, these are statistical malpractices.

Although math.st:0708.0499 did not discuss spectral line fitting, it offers a nice review on identifiability when inferencing for mixtures of symmetric distributions.