The AstroStat Slog » LRT

The chance that A has nukes is p%

hlee — Fri, 23 Oct 2009 17:26:07 +0000

I watched a movie in which one of the characters said, “country A has nukes with 80% chance” (perhaps, not 80% but it was a high percentage). One of the statements in that episode is that people will not eat lettuce only if the 1% chance of e coli is reported, even lower. Therefore, with such a high percentage of having nukes, it is right to send troops to A. This episode immediately brought me a thought about astronomers’ null hypothesis probability and their ways of concluding chi-square goodness of fit tests, likelihood ratio tests, or F-tests.

First of all, I’d like to ask how you would like to estimate the chance of having nukes in a country? What this 80% implies here? But, before getting to the question, I’d like to discuss computing the chance of e coli infection, first.

From the frequentists perspective, computing the chance of e coli infection is investigating sample of lettuce and counts species that are infected: n is the number of infected species and N is the total sample size. 1% means one among 100. Such percentage reports and their uncertainties are very familiar scene during any election periods for everyone. From Bayesian perspective, Pr(p|D) ~ L(D|p) pi(p), properly choosing likelihoods and priors, one can estimate the chance of e coli infection and uncertainty. Understanding of sample species and a prior knowledge helps to determine likelihoods and priors.

How about the chance that country A has nukes? Do we have replicates of country A so that a committee investigate each country and count ones with nukes to compute the chance? We cannot do that. Traditional frequentist approach, based on counting, does not work here to compute the chance. Either using fiducial likelihood approach or Bayesian approach, i.e. carefully choosing a likelihood function adequately (priors are only for Bayesian) allows one to compuate such probability of interest. In other words, those computed chances highly depend on the choice of model and are very subjective.

So, here’s my concern. It seems like that astronomers want to know the chance of their spectral data being described by a model (A*B+C)*D (each letter stands for one of models such as listed in Sherpa Models). This is more like computing the chance of having nukes in country A, not counting frequencies of the event occurrence. On the other hand, p-value from goodness of fit tests, LRTs, or F-tests is a number from the traditional frequentists’ counting approach. In other words, p-value accounts for, under the null hypothesis (the (A*B+C)*D model is the right choice so that residuals are Gaussian), how many times one will observe the event (say, reduced chi^2 >1.2) if the experiments are done N times. The problem is that we only have one time experiment and that one spectrum to verify the (A*B+C)*D is true. Goodness of fit or LRT only tells the goodness or the badness of the model, not the statistically and objectively quantified chance.

In order to know the chance of the model (A*B+C)*D, like A has nuke with p%, one should not rely on p-values. If you have multiple models, one could compute pairwise relative chances i.e. odds ratios, or Bayes factors. However this does not provide the uncertainty of the chance (astronomers have the tendency of reporting uncertainties of any point estimates even if the procedure is statistically meaningless and that quantified uncertainty is not statistical uncertainty, as in using delta chi^2=1 to report 68% confidence intervals). There are various model selection criteria that cater various conditions embedded in data to make a right model choice among other candidate models. In addition, post-inference for astronomical models is yet a very difficult problem.

In order to report the righteous chance of (A*B+C)*D requires more elaborated statistical modeling, always brings some fierce discussions between frequentists and Bayesian because of priors and likelihoods. Although it can be very boring process, I want astronomers to leave the problem to statisticians instead of using inappropriate test statistics and making creative interpretation of statistics.

Please, keep this question in your mind when you report probability: what kind of chance are you computing? The chance of e coli infection? Or the chance that A has nukes? Make sure to understand that p-values from data analysis packages does not tell you that the chance the model (A*B+C)*D is (one minus p-value)%. You don’t want to report one minus p-value from a chi-square test statistic as the chance that A has nukes.

]]>

[ArXiv] Particle Physics

hlee — Fri, 20 Feb 2009 23:48:39 +0000

[stat.AP:0811.1663]
Open Statistical Issues in Particle Physics by Louis Lyons

My recollection of meeting Prof. L. Lyons was that he is very kind and listening. I was delighted to see his introductory article about particle physics and its statistical challenges from an [arxiv:stat] email subscription.

Descriptions of various particles from modern particle physics are briefly given (I like such brevity, conciseness, but delivering necessaries. If you want more on physics, find those famous bestselling books like The first three minutes, A brief history of time, The elegant universe, or Feynman’s and undergraduate textbooks of modern physics and of particle physics). Large Hardron Collider (LHC, hereafter. LHC related slog postings: LHC first beam, The Banff challenge, Quote of the week, Phystat – LHC 2008) is introduced on top of its statistical challenges from the data collecting/processing perspectives since it is expected to collect 10¹⁰ events. Visit LHC website to find more about LHC.

My one line summary of the article is solving particle physics problems from the hypothesis testing or rather broadly classical statistical inference approaches. I enjoyed the most reading section 5 and 6, particularly the subsection titled Why 5σ? Here are some excerpts I like to share with you from the article:

It is hoped that the approaches mentioned in this article will be interesting or outrageous enough to provoke some Statisticians either to collaborate with Particle Physicists, or to provide them with suggestions for improving their analyses. It is to be noted that the techniques described are simply those used by Particle Physicists; no claim is made that they are necessarily optimal (Personally, I like such openness and candidness.).

… because we really do consider that our data are representative as samples drawn according to the model we are using (decay time distributions often are exponential; the counts in repeated time intervals do follow a Poisson distribution, etc.), and hence we want to use a statistical approach that allows the data “to speak for themselves,” rather than our analysis being dominated by our assumptions and beliefs, as embodied in Bayesian priors.

Because experimental detectors are so expensive to construct, the time-scale over which they are built and operated is so long, and they have to operate under harsh radiation conditions, great care is devoted to their design and construction. This differs from the traditional statistical approach for the design of agricultural tests of different fertilisers, but instead starts with a list of physics issues which the experiment hopes to address. The idea is to design a detector which will proved answers to the physics questions, subject to the constraints imposed by the cost of the planned detectors, their physical and mechanical limitations, and perhaps also the limited available space. (Personal belief is that what segregates physical science from other science requiring statistical thinking is that uncontrolled circumstances are quite common in physics and astronomy whereas various statistical methodologies are developed under assumptions of controllable circumstances, traceable subjects, and collectible additional sample.)

…that nothing was found, it is more useful to quote an upper limit on the sought-for effect, as this could be useful in ruling out some theories.

… the nuisance parameters arise from the uncertainties in the background rate b and the acceptance ε. These uncertainties are usually quoted as σ_b and σ_ε, and the question arises of what these errors mean. … they would express the width of the Bayesian posterior or of the frequentist interval obtained for the nuisance parameter. … they may involve Monte Carlo simulations, which have systematic uncertainties as well as statistical errors …

Particle physicists usually convert p into the number of standard deviation σ of a Gaussian distribution, beyond which the one-sided tail area corresponds to p. Thus, 5σ corresponds to a p-value of 3e-7. This is done simple because it provides a number which is easier to remember, and not because Guassians are relevant for every situation.
Unfortunately, p-values are often misinterpreted as the probability of the theory being true, given the data. It sometimes helps colleagues clarify the difference between p(A|B) and p(B|A) by reminding them that the probability of being pregnant, given the fact that you are female, is considerable smaller than the probability of being female, given the fact that you are pregnant.

… the situation is much less clear for nuisance parameters, where error estimates may be less rigorous, and their distribution is often assumed to be Gaussian (or truncated Gaussain) by default. The effect of these uncertainties on very small p-values needs to be investigated case-by-case.
We also have to remember that p-values merely test the null hypothesis. A more sensitive way to look for new physics is via the likelihood ratio or the differences in χ² for the two hypotheses, that is, with and without the new effect. Thus, a very small p-value on its own is usually not enough to make a convincing case for discovery.

If we are in the asymptotic regime, and if the hypotheses are nested, and if the extra parameters of the larger hypothesis are defined under the samller one, and in that case do not lie on the boundary of their allowed region, then the difference in χ² should itself be distributed as a χ², with the number of degrees of freedom equal to the number of extra parameters (I’ve seen many papers in astronomy not minding (ignoring) these warnings for the likelihood ratio tests)

The standard method loved by Particle Physicists (astronomers alike) is χ². This, however, is only applicable to binned data (i.e., in a one or more dimensional histogram). Furthermore, it loses its attractive feature that its distribution is model independent when there are not enough data, which is likely to be so in the multi-dimensional case. (High energy astrophysicists deal low count data on multi-dimensional parameter space; the total number of bins are larger than the number of parameters but to me, binning/grouping seems to be done aggressively to meet the good S/N so that the detail information about the parameters from the data gets lost. ).

…, the σ_i are supposed to be the true accuracies of the measurements. Often, all that we have available are estimates of their values (I also noticed astronomers confuse between true σ and estimated σ). Problems arise in situations where the error estimate depends on the measured value a (parameter of interest). For example, in counting experiments with Poisson statistics, it is typical to set the error as the square root of the observd number. Then a downward fluctuation in the observation results in an overestimated weight, and a_best-fit is biased downward. If instead the error is estimated as the square root of the expected number a, the combined result is biased upward – the increased error reduces S at large a. (I think astronomers are aware of this problem but haven’t taken actions yet to rectify the issue. Unfortunately not all astronomers take the problem seriously and some blindly apply 3*sqrt(N) as a threshold for the 99.7 % (two sided) or 99.9% (one sided) coverage.)

Background estimation, particularly when observed n is less tan the expected background b is discussed in the context of upper limits derived from both statistical streams – Bayesian and frequentist. The statistical focus from particle physicists’ concern is classical statistical inference problems like hypothesis testing or estimating confidence intervals (it is not necessary that these intervals are closed) under extreme physical circumstances. The author discusses various approaches with modern touches of both statistical disciplines to tackle how to obtain upper limits with statistically meaningful and allocatable quantification.

As described, many physicists endeavor on a grand challenge of finding a new particle but this challenge is put concisely from the statistically perspectives like p-values, upper limits, null hypothesis, test statistics, confidence intervals with peculiar nuisance parameters or rather lack of straightforwardness priors, which lead to lengthy discussions among scientists and produce various research papers. In contrast, the challenges that astronomers have are not just finding the existence of new particles but going beyond or juxtaposing. Astronomers like to parameterize them by selecting suitable source models, from which collected photons are the results of modification caused by their journey and obstacles in their path. Such parameterization allows them to explain the driving sources of photon emission/absorption. It enables to predict other important features, temperature to luminosity, magnitudes to metalicity, and many rules of conversions.

Due to different objectives, one is finding a hay look alike needle in a haystack and the other is defining photon generating mechanisms (it may lead to find a new kind celestial object), this article may not interest astronomers. Yet, having the common ground, physics and statistics, it is a dash of enlightenment of knowing various statistical methods applied to physical data analysis for achieving a goal, refining physics. I recall my posts on coverages and references therein might be helpful:interval estimation in exponential families and [arxiv] classical confidence interval.

I felt that from papers some astronomers do not aware of problems with χ² minimization nor the underline assumptions about the method. This paper convey some dangers about the χ² with the real examples from physics, more convincing for astronomers than statisticians’ hypothetical examples via controlled Monte Carlo simulations.

And there are more reasons to check this paper out!

]]>

Likelihood Ratio Test Statistic [Equation of the Week]

vlk — Wed, 18 Jun 2008 17:00:30 +0000

From Protassov et al. (2002, ApJ, 571, 545), here is a formal expression for the Likelihood Ratio Test Statistic,

T_LRT = -2 ln R(D,Θ₀,Θ)

R(D,Θ₀,Θ) = [ sup_θεΘ₀ p(D|Θ₀) ] / [ sup_θεΘ p(D|Θ) ]

where D are an independent data sample, Θ are model parameters {θ_i, i=1,..M,M+1,..N}, and Θ₀ form a subset of the model where θ_i = θ_i⁰, i=1..M are held fixed at their nominal values. That is, Θ represents the full model and Θ₀ represents the simpler model, which is a subset of Θ. R(D,Θ₀,Θ) is the ratio of the maximal (technically, supremal) likelihoods of the simpler model to that of the full model.

When standard regularity conditions hold — the likelihoods p(D|Θ) and p(D|Θ₀) are thrice differentiable; Θ₀ is wholly contained within Θ, i.e., the nominal values {θ_i⁰, i=1..M} are not at the boundary of the allowed values of {θ_i}; and the allowed range of D are not dependent on the specific values of {θ_i} — then the LRT statistic is distributed as a χ²-distribution with the same number of degrees of freedom as the difference in the number of free parameters between Θ and Θ₀. These are important conditions, which are not met in some very common astrophysical problems (e.g, one cannot use it to test the significance of the existence of an emission line in a spectrum). In such cases, the distribution of T_LRT must be calibrated via Monte Carlo simulations for that particular problem before using it as a test for the significance of the extra model parameters.

Of course, an LRT statistic is not obliged to have exactly this form. When it doesn’t, even if the regularity conditions hold, it will not be distributed as a χ²-distribution, and must be calibrated, either via simulations, or analytically if possible. One example of such a statistic is the F-test (popularized among astronomers by Bevington). The F-test uses the ratio of the difference in the best-fit χ² to the reduced χ² of the full model, F=Δχ²/χ²_ν, as the statistic of choice. Note that the numerator by itself constitutes a valid LRT statistic for Gaussian data. This is distributed as the F-distribution, which results when a ratio is taken of two quantities each distributed as the χ². Thus, all the usual regularity conditions must hold for it to be applicable, as well as that the data must be in the Gaussian regime.

]]>

The Flip Test

vlk — Thu, 01 May 2008 18:00:08 +0000

Why is it that detection of emission lines is more reliable than that of absorption lines?

That was one of the questions that came up during the recent AstroStat Special Session at HEAD2008. When you look at the iconic Figure 1 from Protassov et al (2002), which shows how the null distribution of the Likelihood Ratio Test (LRT) and how it holds up for testing the existence of emission and absorption lines. The thin vertical lines are the nominal F-test cutoffs for a 5% false positive rate. The nominal F-test is too conservative in the former case (figures a and b; i.e., actual existing lines will not be recognized as such), and is too anti-conservative in the latter case (figure c; i.e., non-existent lines will be flagged as real).

Why the dichotomy in the two cases? David and Eric basically said during the Q&A that followed their talks that when we know that some statistic is calibrated, we can tell how it is distributed, but when it is not, we usually don’t know how badly off it will be in specific cases.

Here’s an interesting tidbit. A long time ago, in the infancy of my grad studenthood, I learned the following trick. When you are confronted with an emission line spectrum, and you think you see a weak line just barely above noise, how do you avoid getting fooled? You flip the spectrum over and look at it upside down! Does that feature still look like a line? Then you are probably OK with thinking it is real.

But why should that trick work? Our brains seem to be somehow rigged to discount absorption lines, so that when an emission feature is flipped over, it becomes “less significant”. This is the opposite of the usual F-test, and is in fact in line with the method recommended by Protassov et al.

Why this should be so, I have no idea. Will that trick work with trained statisticians? Or even with all astronomers? I have no answers, only questions.

]]>

The LRT is worthless for …

hlee — Fri, 25 Apr 2008 05:48:06 +0000

One of the speakers from the google talk series exemplified model based clustering and mentioned the likelihood ratio test (LRT) for defining the number of clusters. Since I’ve seen the examples of ill-mannerly practiced LRTs from astronomical journals, like testing two clusters vs three, or a higher number of components, I could not resist indicating that the LRT is improperly used from his illustration. As a reply, the citation regarding the LRT was different from his plot and the test was carried out to test one component vs. two, which closely observes the regularity conditions. I was relieved not to find another example of the ill-used LRT.

There are various tests applicable according to needs and conditions from data and source models but it seems no popular astronomical lexicons have these on demand tests except the LRT (Once I saw the score test since posting [ArXiv]s in the slog and a few non-parametric rank based tests over the years). I’m sure well knowledgeable astronomers soon point out that I jumped into a conclusion too quickly and bring up counter-examples. Until then, be advised that your LRTs, χ^2 tests, and F-tests ask for your statistical attention prior to their applications for any statistical inferences. These tests are not magic crystals, producing answers you are looking for. To bring such care and attentions, here’s a thought provoking titled paper that I found some years ago.

The LRT is worthless for testing a mixture when the set of parameters is large
JM Azaıs, E Gassiat, C Mercadier (click here :I found it from internet but it seems the link was on and off and sometimes was not available.)

Here, quotes replace theorems and their proofs^[1] :

We prove in this paper that the LRT is worthless from testing a distribution against a two components mixture when the set of parameters is large.
One knows that the traditional Chi-square theory of Wilks[16^[2]] does not apply to derive the asymptotics of the LRT due to a lack of identifiability of the alternative under the null hypothesis.
…for unbounded sets of parameters, the LRT statistic tends to infinity in probability, as Hartigan[7^[3]] first noted for normal mixtures.
…the LRT cannot distinguish the null hypothesis (single gaussian) from any contiguous alternative (gaussian mixtures). In other words, the LRT is worthless^[4].

For astronomers, the large set of parameters are of no concern due to theoretic constraints from physics. Experiences and theories bound the set of parameters small. Sometimes, however, the distinction between small and large sets can be vague.

The characteristics of the LRT is well established under the compactness set assumption (either compact or bounded) but troubles happen when the limit goes to the boundary. As cited before in the slog a few times, readers are recommend to read for more rigorously manifested ideas from astronomy about the LRT, Protassov, et.al. (2002) Statistics, Handle with Care: Detecting Multiple Model Components with the Likelihood Ratio Test, ApJ, 571, p. 545

Readers might want to look for mathematical statements and proofs from the paper
S.S.Wilks. The large sample distribution of the likelihood ratio for testing composite hypothesis, Ann. Math. Stat., 9:60-62, 1938
J.A.Hartigan, A failure of likelihood asymptotics for normal mixtures. in Proc. Berkeley conf. Vol. II, pp.807-810
Comment to Theorem 1. They proved the lack of worth in the LRT under more general settings, see Theoremm 2

]]>

[ArXiv] 3rd week, Jan. 2008

hlee — Fri, 18 Jan 2008 18:24:23 +0000

Seven preprints were chosen this week and two mentioned model selection.

[astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
[astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
[astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
[astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
[astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
[astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
[astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang

]]>