The AstroStat Slog » modeling

The Perseid Project [Announcement]

vlk — Mon, 02 Aug 2010 21:21:35 +0000

There is an ambitious project afoot to build a 3D map of a meteor stream during the Perseids on Aug 11-12. I got this missive about it from the organizer, Chris Crawford:

This will be one of the better years for Perseids; the moon, which often interferes with the Perseids, will not be a problem this year. So I’m putting together something that’s never been done before: a spatial analysis of the Perseid meteor stream. We’ve had plenty of temporal analyses, but nobody has ever been able to get data over a wide area — because observations have always been localized to single observers. But what if we had hundreds or thousands of people all over North America and Europe observing Perseids and somebody collected and collated all their observations? This is crowd-sourcing applied to meteor astronomy. I’ve been working for some time on putting together just such a scheme. I’ve got a cute little Java applet that you can use on your laptop to record the times of fall of meteors you see, the spherical trig for analyzing the geometry (oh my aching head!) and a statistical scheme that I *think* will reveal the spatial patterns we’re most likely to see — IF such patterns exist. I’ve also got some web pages describing the whole shebang. They start here:

http://www.erasmatazz.com/page78/page128/PerseidProject/PerseidProject.html

I think I’ve gotten all the technical, scientific, and mathematical problems solved, but there remains the big one: publicizing it. It won’t work unless I get hundreds of observers. That’s where you come in. I’m asking two things of you:

1. Any advice, criticism, or commentary on the project as presented in the web pages.
2. Publicizing it. If we can get that ol’ Web Magic going, we could get thousands of observers and end up with something truly remarkable. So, would you be willing to blog about this project on your blog?
3. I would be especially interested in your comments on the statistical technique I propose to use in analyzing the data. It is sketched out on the website here:

http://www.erasmatazz.com/page78/page128/PerseidProject/Statistics/Statistics.html

Given my primitive understanding of statistical analysis, I expect that your comments will be devastating, but if you’re willing to take the time to write them up, I’m certainly willing to grit my teeth and try hard to understand and implement them.

Thanks for any help you can find time to offer.

Chris Crawford

]]>

From Quantile Probability and Statistical Data Modeling

hlee — Sat, 21 Nov 2009 10:06:24 +0000

by Emanuel Parzen in Statistical Science 2004, Vol 19(4), pp.652-662 JSTOR

I teach that statistics (done the quantile way) can be simultaneously frequentist and Bayesian, confidence intervals and credible intervals, parametric and nonparametric, continuous and discrete data. My first step in data modeling is identification of parametric models; if they do not fit, we provide nonparametric models for fitting and simulating the data. The practice of statistics, and the modeling (mining) of data, can be elegant and provide intellectual and sensual pleasure. Fitting distributions to data is an important industry in which statisticians are not yet vendors. We believe that unifications of statistical methods can enable us to advertise, “What is your question? Statisticians have answers!”

I couldn’t help liking this paragraph because of its bitter-sweetness. I hope you appreciate it as much as I did.

]]>

Scatter plots and ANCOVA

hlee — Thu, 15 Oct 2009 23:46:14 +0000

Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists^[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page^[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS].

This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

]]>

[ArXiv] Statistical Analysis of fMRI Data

hlee — Wed, 02 Sep 2009 00:43:13 +0000

[arxiv:0906.3662] The Statistical Analysis of fMRI Data by Martin A. Lindquist
Statistical Science, Vol. 23(4), pp. 439-464

This review paper offers some information and guidance of statistical image analysis for fMRI data that can be expanded to astronomical image data. I think that fMRI data contain similar challenges of astronomical images. As Lindquist said, collaboration helps to find shortcuts. I hope that introducing this paper helps further networking and collaboration between statisticians and astronomers.

List of similarities

data acquisition: data read in frequency domain and image reconstruction via inverse Fourier transform. (To my naive eyes, It looks similar to Power Spectrum Analysis for cosmic microwave background (CMB) data).
amplitudes or coefficients are cared and analyzed, not phase nor wavelets.
understanding data:brain physiology or physics like cosmological models that describe data generating mechanism.
limits in/trade-off between spatial and temporal resolution.
understanding/modeling noise and signal.

These similarities seem common for statistically analyzing images from fMRI or telescopes. Notwithstanding, no astronomers can (or want) to carry out experimental design. This can be a huge difference between medical and astronomical image analysis. My emphasis is that because of these commonalities, strategies in preprocessing and data analysis for fMRI data can be shared for astronomical observations and vise versa. Some sloggers would like to check Section 6 that covers various statistical models and methods for spatial and temporal data.

I’d rather simply end this posting with the following quotes, saying that statisticians play a critical role in scientific image analysis.

There are several common objectives in the analysis of fMRI data. These include localizing regions of the brain activated by a task, determining distributed networks that correspond to brain function and making predictions about psychological or disease states. Each of these objectives can be approached through the application of suitable statistical methods, and statisticians play an important role in the interdisciplinary teams that have been assembled to tackle these problems. This role can range from determining the appropriate statistical method to apply to a data set, to the development of unique statistical methods geared specifically toward the analysis of fMRI data. With the advent of more sophisticated experimental designs and imaging techniques, the role of statisticians promises to only increase in the future.

A full spatiotemporal model of the data is generally not considered feasible and a number of short cuts are taken throughout the course of the analysis. Statisticians play an important role in determining which short cuts are appropriate in the various stages of the analysis, and determining their effects on the validity and power of the statistical analysis.

]]>

[ArXiv] Special Issue from Annals of Applied Statistics

hlee — Mon, 09 Feb 2009 10:02:01 +0000

When I was studying astronomy, during when I once became a subject for a social science survey study about life in a department where gender bias is extreme (I was only female), people often asked me how to forecast weather or how to predict future (boys often get questions related to becoming astronauts in addition to weather men and astrologists). Relating astronomy to earth science still happens. Statisticians that I met at conferences, often tried to associate my efforts on astronomical data with those of geologists and meteorologists, who often use stochastic models and spatial temporal models, dimensional extensions of models in time series. Because of this confusion between astronomy and meteorology/geology/oceanology, and the longer history of wide statistical applications found from the latter subjects (a good counter example is the least square method by Gauss but I cannot think more examples to contradict my statement that statistics is used widely among earth scientists with rich history), from time to time my attention has been paid to various applications and models in those subjects so as to find a thread for similar applications for astronomy. Although I don’t like the misconception of astronomy equal to meteorology or geoscience, those scientific fields, what so ever, share at least one commonality that statistical methods are applied to analyzing satellite data.

There is a special issue on Atmospheric Science from the Annals of Applied Statistics, offering me intriguing discussions for finding a common ground between atmospheric science and astronomy. If the general public perception cannot tell the difference between meteorology and astronomy, despite the fact that my affirmative reply to statisticians’ comments on my interests in astronomy always has been “Astronomy and meteorology are very different scientific disciplines,” let’s find out some similarities from how statistics is applied. Astronomers can find more useful applications in the issue from their ends. Here, provided are some interesting ones from my judegment with their [arXiv] links. The whole issue’s table of contents given here: AoAS, vol 2, issue 4 (2008). Most of articles are now to be located at arXiv.

[arxiv:0901.3665] Parameter estimation for computationally intensive nonlinear regression with an application to climate modeling
by D. Drignei, C. E. Forest, and D. Nychka
:I wish for your attention to the sections about constructing a surrogate for the nonlinear complex climate model
[arxiv:0901.3670] Interpolating fields of carbon monoxide data using a hybrid statistical-physical model
by A. Malmberg, A. Arellano, D. P. Edwards, N. Flyer, D. Nychka, C. Wikle
:many astronomers would find more similarities in approaches by reading the abstract than I would. The only difference would be that they are using Carbon Oxide data as a result of the earth green house effect
[arxiv:0901.3494] Interpreting self-organizing maps through space–time data models
by H. Sang, A. E. Gelfand, C. Lennard, G. Hegerl, B. Hewitson
:a good reference for astronomers interested in SOM for high dimensional data and dimension reduction

]]>

Statistics is the study of uncertainty

hlee — Mon, 31 Mar 2008 03:16:18 +0000

I began to study statistics with the notion that statistics is the study of information (retrieval) and a part of information is uncertainty which is taken for granted in our random world. Probably, it is the other way around; information is a part of uncertainty. Could this be the difference between Bayesian and frequentist?

The statistician’s task is to articulate the scientist’s uncertainties in the language of probability, and then to compute with the numbers found: cited from The Philosophy of Statistics by Dennis V. Lindley (2000). The Statistician, 49(3), pp.293-337. The article is a very good read (no theorems and their proofs. It does not begin with “Assume that …”).

The author starts the article by posing Statistics is the study of uncertainty and the rest is very agreeable as the quotes given above and below.

Because you do not know how to measure the distance to our moon, it does not follow that you do not believe in the existence of a distance to it. Scientists have spent much effort on the accurate determination of length because they were convinced that the concept of distance made sense in terms of krypton light. Similarly, it seems reasonable to attempt the measurement of uncertainty.

significance level – the probability of some aspect of the data, given H is true
probability – your probability of H, given the data

Many people, especially in scientific matters, think that their statements are objective, expressed through the probability, and are alarmed by the intrusion of subjectivity. Their alarm can be alleviated by considering reality and how that reality is reflected in the probability calculus.

I have often seen the stupid question posed ‘what is an appropriate prior for the variance σ² of a normal (data) density?’ It is stupid because σ is just a Greek letter.

The statistician’s role is to articulate the client’s preferences in the form of a utility function, just as it is to express their uncertainty through probability,

where clients can be replaced with astronomers.

Upon accepting that statistics is the study of uncertainty, we’d better think about what this uncertainty is. Depending on the description of uncertainty, or the probability, the uncertainty quantification would change. As the author mentioned, statisticians formulate the clients’ uncertainty transcription, which I think astronomers should take the responsibility of. Nevertheless, I become to have a notion that astronomers do not care the subtleness in uncertainties. Generally, the probability model of this uncertainty is built on the independent property and at some point is approximated to Gaussian distribution. Yet, there are changes in this tradition and frequently I observe from arXiv:astro-ph that astronomers are utilizing Bayesian modeling for observed phenomenon and reflecting non gaussian uncertainty.

I heard that the effort on visualizing uncertainty is under progress. Prior to codifying, I wish those astronomers to be careful on the meaning of the uncertainty and the choice of statistics, i.e., modeling the uncertainty.

]]>

[ArXiv] 3rd week, Jan. 2008

hlee — Fri, 18 Jan 2008 18:24:23 +0000

Seven preprints were chosen this week and two mentioned model selection.

[astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
[astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
[astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
[astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
[astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
[astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
[astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang

]]>