#### how to trace?

I was at the SUSY 09 public lecture given by a Nobel laureate, Frank Wilczek of QCD (quantum chromodynamics). As far as I know SUSY is the abbreviation of **SUperSYmetricity** in particle physics. Finding such antimatter(? I’m afraid I read “Angels and Demons” too quickly) will explain the unification theory among electromagnetic, weak, and strong forces and even the gravitation according to the speaker’s graph. I’ll not go into the details of particle physics and the standard model. The reason is too obvious. Instead, I’d like to show this image from wikipedia and to discuss my related questions.

Whenever LHC (Large Hardron Collider, several posts from the slog) is publicly advertised, the grand scale of accelerator (26km) is the center of attention for these unprecedented controlled experiments for particle physics researches. Controlled in conjunction of factorization in statistical experiment designs to eliminate unknowns and to factor in external components (covariates, for example). By the same token, not the grand scale of the accelerator, but the detector and controlled/isolated system, and its designs for collecting data seem most important to me. Without searching for reports, I want to believe that many countless efforts have been put into detectors and data processors, which seem to be overshadowed because of the grand scale of the accelerating tube.

For fun and honoring the speaker’s showing it to the public, you might like to see this youtube rap again.

As a statistician, curious about the detector and the physics leading the designs of such expensive and extreme studies, I was more interested in knowing further on

- how data are collected and
- how study was designed or what are the hypotheses

not the scale of the accelerator nor the feeling inside the 2 degree vacuum tube. There was no clue to find out partial answers to these questions through the public lecture. So, I hope some slog readers could help me understand better the following issues spawn from this public lecture. Let me talk my questions statistically and try to associate them with the image above.

**Uncertainty Principle**

The uncertainty principle by physicists is written roughly as follows:

Δ E Δ t > h

where h is Plank’s constant. Instead of energy and time, Δ x Δ p > h, location and momentum is used as well. This principle is more or less related to **precision** or **bias.** One cannot measure things with 100% precision. In other word, in measuring quantities from physics, there is no exact unbiased estimator (asymptotically unbiased is a different context). In order to observe subparticle in a short time scale, the energy must be high. Yet, unless the energy is extremely high, the uncertainty of when the event happen is huge so that no one can assign exact numerals when the eveny happen. This uncertainty principle is the primary reason for such large accelerator so that particle can gain tremendous energy and therefore, an observer can determine the location and the time of the event (collision, subsequent annihilation, and scatters of subparticles) with uncertainty from the principle of physics.

**What is Uncertainty?**

I’ve always had a conflicting notion about uncertainty in statistics and astronomy. The uncertainty from the Heigenberg’s uncertainty principle and the uncertainty from measure theory and the stochastic nature of data. Although the word is same but the implications are different. The former describes precision as discussed above and the later accuracy (Bevington’s book describes the difference between precision and accuracy, if I recall correctly). When an astronomer has data and computes a best fit and one σ from the chi-square, that σ is quantifying the uncertainty/scale of the Gaussian distribution, a model for residuals that the astronomer has chosen for fitting the data with the model of physics.

When it comes to **measurement errors** it’s more like discussing precision, not accuracy or the scale parameters of distribution functions (family of distributions). Either measurement errors, or computing uncertainty via chi-square minimization or Bayesian posterior distribution estimation, most of procedures to understand uncertainties in astronomical literature is based on parametrizating uncertainties. Luckily we know that Gaussian and Poisson distribution for parametrization works almost all cases in astronomy. Yet, my understanding is that there’s not much distinction between precision and accuracy in astronomical data analysis, not much awareness about the difference between the uncertainty principle from physics and the uncertainty by the stochastic nature of data. This seems causing biased or underestimated results. With jargon of statistics, instead of overlooking, the issues of model mis-identification and model uncertainty^{[1]} of other disciplines are worth to be looked into to narrow the gap.

As a statistician, I approach the problem of uncertainty hierarchically. Start from the simplest that sigma is known and used the given sigma as the ground truth. If statistics does not advocate such condition, then move to a direction of estimating it, and testing whether it is homogenous or heterogeneous error, etc to understand the sampling distribution better and device statistics accordingly. During the procedure, I’ll add a model for measurement errors. If Gaussian, adding statistical uncertainty terms and measurement error terms works well, an easy convolution of Gaussian distributions (see my why gaussianity?). I might have to ignore some factors in my hierarchical modeling procedure if their contribution is almost none but the hierarchical model becomes too complicated for such mediocre gain. Instead, it would be easy to follow the rule of thumb strategy developed by astronomers with great knowledge and experience. Anyway, if parametric strategy does not work, I’ll employ nonparametric approaches. Focusing on Bayesian methodology, it’s like modeling hierarchically from parametric likelihoods and informative, subjective priors to nonparametric likelihoods, objective, noninformative priors. Overall, these are efforts of modeling both physics and errors assuming that measurements are taken accurately; multiple measurements and collecting many photons quantifies how accurately the best fit is obtained. On the other hand, under the uncertainty principle, intrinsic measurement bias (unknown but bounded) is inevitable. Not statistics but physics could tell how precise measurements can be taken. Still it’s uncertainty but different kind. I sometimes confront astronomers mixing strategies of calibrating the uncertainties of different grounds and also I got confused and lost.

I’d like to say that multiple observations (the amount of degrees of freedom in chi-square minimizations, and bins in histograms) are realizations of coupling of bias and variance (precision and accuracy; measurement errors and statistical uncertainty in sigma/error bar) from which the importance of proper parametrization and regularized optimization is never enough to be emphasized to get that right 68% coverage of the uncertainty in a best fit, instead of simple least square or chi-square. Statisticians often discuss the mean square error (see my post [MADS] Law of Total Variance) than the error bar to account for the overall uncertainty in a best fit.

I’m afraid that my words sound gibberish – I hope that statisticians with good commands of literal and scientific languages discuss the uncertainty of physics and of statistics and how it affects choosing statistical methods and drawing statistical inference from (astro)physical data. I’m also afraid that people continue going for one sigma by feeding the data into the chi-square function and adding speculated systematic errors (say 15% of the computed sigma from the chi-square minimization) without second thoughts on the implications of uncertainty and on assumptions for its quantification methods.

**Identifiability**

I wonder how the shot of above image is taken when protons are colliding. There should be a tremendous number of subparticles generated from the collisions of many protons. Unless there is a single photo frame that takes traces of all those particles (collision happens in 3D camera chamber? Perhaps, they use medical imaging, tomography techniques but processing time wise I doubt its feasibility), I think those traces are the reconstruction of multiple cross sectional shots. My biggest concern was how each line and dot you see from the picture can be associated to a certain particle. Physics and standard model can tell that their trajectories are distinguishable, depending on their charges, types, and mass but there are, say, millions of events happening in the matter of extremely short time scale! How certain one can say this is the trace of a certain particle.

The speaker discussed massive data and uncertainty as another challenge. So many procedures in terms of (statistical) data analysis seem not explored yet although theory of physics is very sophisticated and complicated. If physics is an deductive/deterministic science, then statistics is inductive/stochastic. I personally believe that theories are able to conclude the same from both physical and statistical experiments. I guess now it’s time to prove such thesis with data and statistics and it starts with identifying particles’ traces and their meta-data.

**image reconstruction**

To create an image of many particles as above when we have the identifiability issue and the uncertainty in time and space, I wonder how pictures are constructed from each collision. The lecturer used an analogy of a dodecaheron calendar with missing months to deliver the feeling of image reconstruction in particle physics. Whenever I see such images of many ray traces and hear promises that LHC will deliver, I’ve been wondering how they reconstruct those traces after the particle collisions and measuring times of events. Thanks to the uncertainty principle and its mediocre scale, there must be some tremendous constraints and missings. How much information is contained in that reconstructed image? How much information loss is inevitable due to those constraints. It would be very interesting to know each step from detectors to images and find statistical and information theoretical challenges.

**massive data processing**

Colliding one proton to the other seems ideal to discover the unification theory advocating the standard model by tracing individual relatively small number of particles. If so, the picture above could have been simpler than what it looks. Unfortunately, it’s not the case and huge number of protons are sent for collision. I’ve kept heard the gigantic size of data that particle physics experiments create. I wondered how such massive data are processed while the speaker showed the picture of one of world best computing facilities at CERN. Not just for automated pipelining but for processing, cleaning, summarizing, and evaluating from statistical aspects would require clever algorithms to make most of those multiple processors.

**hypothesis testing**

I still think that quests for searching particles via LHC are classical decision theoretic hypothesis testing problem: the null hypothesis is no new(unobserved particle) vs. the alternative hypothesis contains the model/information of new particle by the theory (SUSY, antimatter etc). Statistically speaking, in order to observe such matter or to reject null hypothesis comfortable, we need statistically powerful tests, where Neyman-Pearson test/construction is often mentioned. One needs to design an experiment that is powerful enough (power here has two connotations: one is physically powerful enough to make proton have high energy so that one can observe particles in the brief time and space frame, and the other is statistically powerful such that if such new particles exist, the test is powerful enough to reject the null hypothesis with decent power and false discovery rate). How to transcribe data and models into a powerful test seems still an open question to physicists. You can check discussions from the links in the PHYSTAT LHC 2007 post.

**source detection**

In the similar context of source detection in astronomy, how do physicists define and segregate source (particle of interest, higgs, for example) from background? It’s also related to identifiability of particles shown in the picture. How can a physicist see an rare event among tons of background events which form a wide sampling distribution or in other words, that have a huge uncertainty as an ensemble. Also the source event has its own uncertainty because of the uncertainty principle. How to form robust thresholding methods? How to develop Bayesian learning strategies for better detection? Perhaps the underlying (statistical) models are different for particle physics and for astronomy, but the basic idea of how to apply statistical inference seem not much different from the fact that 1. background can be more dominant, 2. background is used for the null hypothesis, and 3. the source distribution comprise the alternative distribution. It’ll be very interesting to collect statistics for source detection and formalize those methods so that consistent source detection results can be achieved by devising statistics suits the data types.

**cliche and irony**

I’d like to quote two phrases from the public lecture.

finding an

atomin a haystack

A cliche for all groups of scientists. I’ve heard “finding a needle in a haystack” so many times because of the new challenge that we confronted from the information era. On the other hand, replacing “needle” to “atom” was new to me. Unfortunately, my impression is that physicists are not equipped with tools to do such data mining either a needle or an atom. I wonder what computer scientists can offer them for this more challenging quest to answer the fundamental question about the universe.

it’s an exciting time to be physicists

The speaker used physicists but I’ve heard the same sentence from astronomers and statisticians with their professions replaced with physicists. After hearing it too often from various people, I became doubtful since I cannot feel such excitements imminently. It feels like after hearing **change** too often before it happens, one cannot feel the real progress of changes. Words always travel faster than actions. Sometimes words can be just empty promise. That’s why I thought it’s a cliche and irony. Perhaps it’s due to that fact I’m at the intersection of the combination of these scientist sets, not at the center of any set. Ironically, defining boundaries is also fuzzy nowadays. Perhaps, I’m already excited and afraid of transiting down to a lower energy level. Anyway, being enthusiastic and living in an exciting time seems different matter.

**What will come next?**

I haven’t heard the news about Phystat 2009, whose previous meetings occurred every odd year in the 21st century. Personally, their meeting agenda and subsequent proceedings were very informative and offered clues to my questions. I hope the next meeting soon to be held.

- the notions of model uncertainty among astronomers and statisticians are different. Hopefully, I have time to talk about it[↩]

## Leave a comment