Here is a specific example that came up due to a comment by a referee on a paper with David G.-A. We had said that the O abundance is dominated by uncertainties in the DEM at low temperatures because that is where most of the emission from O is formed. The referee disputed this, saying yeah, but O is also present at higher temperatures, and since the DEM is much higher there, that should be the predominant contribution to the estimate. In effect, the referee said, “show me!” The problem is, how? The measured fluxes are:
fO7obs = 2 +- 0.75
fO8obs = 4 +- 0.88
The predicted fluxes are:
fO7pred = 1.8 +- 0.72
fO8pred = 3.6 +- 0.96
where the error bars here come from the stddev of the fluxes predicted by each DEM realization that comes out of the MCMC analysis. On the face of it, it looks like a pretty good match to the observations, though a slightly different picture emerges if one were to look at the distribution of the predicted fluxes:
mode(fO7pred)=0.76 (95% HPD interval = 0.025:2.44)
mode(fO8pred)=2.15 (95% HPD interval = 0.95:4.59)
What if one computed the flux at each temperature and did the same calculation separately? That is shown in the following plot, where the product of the DEM and the line emissivity computed at each temperature bin is shown for both O VII (red) and O VIII (blue). The histograms are for the best-fit DEM solution, and the vertical bars are stddevs on the product, which differs from the flux only by a constant. The dashed lines show the 95% highest posterior density intervals.
Figure 1: Fluxes from O VII and O VIII computed at each temperature from DEM solution of RST 137B. The solid histograms are the fluxes for the best-fit DEM, and the vertical bars are the stddev for each temperature bin. The dotted lines denote the 95% highest-posterior density intervals for each temperature.
But even this tells an incomplete tale. The full measure of the uncertainty goes unseen until all the individual curves are seen, as in the animated gif below which shows the flux calculated for each MCMC draw of the DEM:
Figure 2: Predicted flux in O VII and O VIII lines as a product of line emissivity and MCMC samples of the DEM for various temperatures. The dashed histogram is from the best-fit DEM, the solid histograms are for the various samples (the running number at top right indicates the sample sequence; only the last 100 of the 2000 MCMC draws are shown).
So this brings me to my question. How does one represent this kind of uncertainty in a static plot? We know what the uncertainty is, we just don’t know how to publish them.
]]>If the bootstrap is an automatic processor for frequentist inference, then MCMC is its Bayesian counterpart.
Sometime in my second year of studying statistics, I said that bootstrap and MCMC are equivalent but reflect different streams in statistics. The response to this comment was ‘that’s nonsense.’ Although I forgot details of the circumstance, I was hurt and didn’t try to prove myself. After years, the occasion immediately floats on the surface upon seeing this sentence.
[1]: Personally, point spread function (PSF) plays a major role in extracting astrophysical information from raw observations.
[2]: In addition to this week’s GRB paper, [astro-ph:0710.3099]*, there were some statistical studies about multi-modality in GRBs [see Three classes of GRBs].
I think that (astro)physics provides templates such as likelihoods for a statistical analysis. The statistical disputes between the bimodality and the trimodality of GRB distribution can be settled from 1. more studies like [astro-ph:0710.3099]* or 2. better statistical inference tools or model selections. Unfortunately, 2 requires templates, which are acquired from in-depth cross matching survey studies.
]]>Feroz & Hobson (2007, arXiv:0704.3704) describe a technique called Nested Sampling (Skilling 2004), one that could give MCMC a run for its money. It takes the one inefficient part of MCMC — the burn-in phase — and turns that into a virtue. The way it seems to work is to keep track of how the parameter space is traversed as the model parameters {theta} reach the mode of the posterior, and to take the sequence of likelihoods thus obtained L(theta), and turn it around to get theta(L). Neat.
Two big (computational) problems that I see are (1) the calculation of theta(L), and (2) the sampling to discard the tail of L(theta). The former, it seems to me, becomes intractable exactly when the likelihood surface gets complicated. The latter, again, it seems you have to run through just as many iterations as in MCMC to get a decent sample size. Of course, if you have a good theta(L), it does seem to be an improvement over MCMC in that you won’t need to run the chains multiple times to make sure you catch all the modes.
I think the main advantage of MCMC is that it produces and keeps track of marginalized posteriors for each parameter, whereas in this case, you have to essentially keep a full list of samples from the joint posterior and then marginalize over it yourself. The larger the sample size, the harder this gets, and in fact it is a bit difficult to tell whether the nested sampling method is competing with MCMC or Monte Carlo integration.
Is there any reason why this should not be combined with MCMC? i.e., can we use nested sampling from the burn-in phase to figure out the proposal distributions for Metropolis or Metropolis-Hastings in situ, and get the best of both worlds?
]]>It has long been known that non-linear curve-fitting, for all its apparent objectivity and the statistical theoretical foundation that enables us to evaluate the goodness of fits, is still something of an art. Leaving aside issues such as a correct specification of the asymmetrical errors that are usually prevalent in astrophysical situations (mostly due to the underlying Poisson distribution), fitting a model curve to the data is never a simple and straightforward process. More often than not, the parameters returned by the programs as those that “best fit” the data are not those that truly maximize the statistical likelihood, and are caused by the programs being trapped in local minima of the chi^2 surface. Furthermore, numerical errors also creep in to the outputs based on how stopping rules are implemented: cases abound where repeated applications of the fitting program simply to move the “best-fit” parameters closer to the true best-fit values but do not ever reach it. These problems are compounded when the number of parameters in the model are large (>10), and there are significant correlations between the parameters.
A typical strategy adopted to ensure that the fitting process does result in the best possible fit, is to ensure that the fitting process begins with parameters that are very close to the true best-fit values. This is done by first carrying out the fitting in multiple stages, with only a few parameters fit at a time, holding the others fixed, until the parameter values returned by the program show that the solutions have converged. Only then is a fit to all the parameters attempted. Then, the errors on each parameter are computed by taking projections cutting through the chi^2 surface along each parameter axis, and if the best-fit values do not change during this process, then one can be confident that the best possible fit has been obtained.
What this means in practice is that studies that carry out automated fits to a large number of data sets where there is no possibility of human intervention, and thus no means to verify that the best fit parameters have been found, are unreliable. As an example, we show in the Figure above the case where a Gaussian model was fit 100 times to profiles generated by a Poisson with a mean mu=1000 (the Poisson profile is indistinguishable from a Gaussian for mu>~100, cf. Park et al. 2006) and with a total normalization of 1000. In all cases, the fits were started with initial values at the nominal best-fit values. The Figure shows a scatter plot of the initial value of the chi^2, obtained at the end of the first fit attempt, versus the final chi^2, obtained after going through the process described above. Changes as large as delta chi^2 approx 25 are seen, which changes the character of the fit significantly.
]]>