The AstroStat Slog » PSF

data analysis system and its documentation

hlee — Fri, 02 Oct 2009 02:11:04 +0000

So far, I didn’t complain much related to my “statistician learning astronomy” experience. Instead, I’ve been trying to emphasize how fascinating it is. I hope that more statisticians can join this adventure when statisticians’ insights are on demand more than ever. However, this positivity seems not working so far. In two years of this slog’s life, there’s no posting by a statistician, except one about BEHR. Statisticians are busy and well distracted by other fields with more tangible data sets. Or compared to other fields, too many obstacles and too high barriers exist in astronomy for statisticians to participate. I’d like to talk about these challenges from my ends.^[1]

The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification^[2] ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.^[3] In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.

Even if data and software issues are resolved, there’s another barrier to climb. Validation. If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?

I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?

Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, lack of documentation for nonspecialists to perform the task.

A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for exploratory data analysis and confirmatory data analysis with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.

In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.

In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.

For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf^[4] Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all ahelp “commands” a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).

	arf	rmf	psf	pha	data
get	get_arf	get_rmf	get_psf	get_pha	get_data
set	set_arf	set_rmf	set_psf	set_pha	set_data
unpack	unpack_arf	unpack_rmf	unpack_psf	unpack_pha	unpack_data
load	load_arf	load_rmf	load_psf	load_pha	load_data
read	read_arf	read_rmf	read_psf	read_pha	read_data

[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].

So, I decide to seek for a help through cxc help desk several months back. Their answers are very reliable and prompt. My question was “what are the difference among read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(), where xxx can be data, arf, rmf, and psf?” The answer to this question was that

You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:

http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html

This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.

Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to validate my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, the lack of documentation.

What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular data processing and statistical inference given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.

Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in where is ciao X?^[5] not every data analysis system has to have similar design and goals.

Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in where is ciao X? I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references^[6]

This is quite an overdue posting. Links and associated content can be outdated.
For the classification purpose, data with clear distinction between response and predictor variables so called a training data set must be given. However, I often fail to get processed data sets for statistical analysis. I first spend time to read data and question what is outlier, bias, or garbage. I’m not sure how to clean and extract numbers for statistical analysis and every sub-field in astronomy have their own way to clean to be fed into statistics and scatter plots. For example, image processing is still executed case by case via trained eyes of astronomers. On the other hand, in medical imaging diagnosis specialists offer training sets with which scientists in computer vision develop algorithms for classification. Such collaboration yields accelerated, automatic but preliminary diagnosis tools. A small fraction of results from these preliminary methods still can be ambiguous, i.e. false positive or false negative. Yet, when such ambiguous cancerous cell images at the decision boundaries occur, specialists like trained astronomers scrutinize those images to make a final decision. As medical imaging and its classification algorithms resolve the issue of expert shortage under overflowing images, I wish astronomers adopt their strategies to confront massive streaming images and to assist sparse trained astronomers
Something I like to see is handling background statistically in high energy astrophysics. When simulating a source, background can be simulated as well via Makov Random field, kriging, and other spatial statistics methods. In reality, background is subtracted once in measurement space and the random nature of background is not interactively reflected. Regardless of available statistical methodology to reflect the nature of background, it is difficult to implement it for trial and validation because those tools are not compatible for adding statistical modules and packages.
A Sherpa expert told me there is an FAQ (I failed to locate previously) on this matter. However, from data analysis perspective like a distinction between data.structure, vector, matrix, list and other data types in R, the description is not sufficient for someone who wants to learn ciao and to perform scientific (both deterministic or stochastic) data analysis via scripting i.e. handling objects appropriately. You might want to read comparing commands in Sharpa from Shepa FAQ
I know there is ciaox. Apart from the space between ciao and X, there is another difference that astronomers do not care much compared to statisticians: the difference between X and x. Typically, the capital letter is for random variable and lower case letters for observation or value
By the way, there are ciao workshop materials available that could function as tutorials. Please, locate them if needed.

Wavelet-regularized image deconvolution

hlee — Fri, 12 Jun 2009 20:47:36 +0000

A Fast Thresholded Landweber Algorithm for Wavelet-Regularized Multidimensional Deconvolution
Vonesch and Unser (2008)
IEEE Trans. Image Proc. vol. 17(4), pp. 539-549

Quoting the authors, I also like to say that the recovery of the original image from the observed is an ill-posed problem. They traced the efforts of wavelet regularization in deconvolution back to a few relatively recent publications by astronomers. Therefore, I guess the topic and algorithm of this paper could drag some attentions from astronomers.

They explain the wavelet based reconstruction procedure in a simple term. The matrix-vector product w_x= Wx yields the coefficients of x in the wavelet basis, and W^TWx reconstructs the signal from these coefficients.

Their assumed model is

y=Hx_orig + b,

where y and x_{orig} are vectors containing uniform samples of the original and measured signals; b represents the measurement error. H is a square (block) circulant matrix that approximates the convolution with the PSF. Then, the problem of deconvolution is to find an estimate that maximizes the cost function

J(x) = J_data(x)+ λ J_reg(x)

They described that “this functional can also interpreted as a (negative) log-likelihood in a Bayesian statistical framework, and deconvolution can then be seen as a maximum a posteriori (MAP) estimation problem.” Also the description of the cost function is applicable to the frequently appearing topic in regression or classification problems such as ridge regression, quantile regression, LASSO, LAR, model/variable selection, state space models from time series and spatial statistics, etc.

The observed image is the d-dimensional covolution of an origianl image (the characteristic function of the object of interest) with the impulse response (or PSF). of the imaging system.

The notion of regularization or penalizing the likelihood seems not well received among astronomers based on my observation that often times the chi-square minimization (the simple least square method) without penalty is suggested and used in astronomical data analysis. Since image analysis with wavelets popular in astronomy, the fast algorithm for wavelet regularized variational deconvolution introduced in this paper could bring faster results to astronomers and could offer better insights of the underlying physical processes by separating noise and background more in a model according fashion, not simple background subtraction.

Beta Profile [Equation of the Week]

vlk — Wed, 04 Jun 2008 17:00:43 +0000

X-ray telescopes generally work by reflecting photons at grazing incidence. As you can imagine, even small imperfections in the mirror polishing will show up as huge roadbumps to the incoming photons, and the higher their energy, the easier it is for them to scatter off their prescribed path. So X-ray telescopes tend to have sharp peaks and fat tails compared to the much more well-behaved normal-incidence telescopes, whose PSFs (Point Spread Functions) can be better approximated as Gaussians.

X-ray telescopes usually also have gratings that can be inserted into the light path, so that photons of different energies get dispersed by different angles, and whose actual energies can then be inferred accurately by measuring how far away on the detector they ended up. The accuracy of the inference is usually limited by the width of the PSF. Thus, a major contributor to the LRF (Line Response Function) is the aforementioned scattering.

A correct accounting of the spread of photons of course requires a full-fledged response matrix (RMF), but as it turns out, the line profiles can be fairly well approximated with Beta profiles, which are simply Lorentzians modified by taking them to the power β –

where B(1/2,β-1/2) is the Beta function, and N is a normalization constant defined such that integrating the Beta profile over the real line gives the area under the curve as N. The parameter β controls the sharpness of the function — the higher the β, the peakier it gets, and the more of it that gets pushed into the wings. Chandra LRFs are usually well-modeled with β~2.5, and XMM/RGS appears to require Lorentzians, β~1.

The form of the Lorentzian may also be familiar to people as the Cauchy Distribution, which you get for example when the ratio is taken of two quantities distributed as zero-centered Gaussians. Note that the mean and variance are undefined for that distribution.

Equation of the Week: Confronting data with model

vlk — Fri, 02 May 2008 22:06:05 +0000

Starting a new feature — highlighting some equation that is widely used in astrophysics or astrostatistics. Today’s featured equation: what instruments do to incident photons.

Almost all of astrophysics is about inferring some characteristic of the source from the light gathered by some instrument. And the instruments usually give us information on 3 quantities of interest: when the photon arrived, where did it land, and what was its color. In X-ray astronomy in particular, these are usually recorded as a lengthy table of “events”, where each row lists (t, p, E), where t are the arrival times of the photons (which can be measured with a precision anywhere from microseconds to seconds), p=(x,y) says where on which pixel of the detector the photon landed, and E represents how energetic the photon was. Generally, both p and E are blurred, the former because of the intrinsic limitations of how well the telescope can focus and how well the detector can localize, and the latter because of limitations in evaluating the total energy deposited in the detector by the incoming photon. So, in general, the expected signal at the detector can be written as follows:

where E is the energy of the incoming photon and E’ is what it is registered as, p is the true sky location and p’ is the registered location, S(E,p,t;theta) is the astrophysical signal, A(E,p’) is the effective area of the telescope/detector combination, P(p,p’;E,t) is the so-called Point Spread Function, and R(E,E’;p) is the energy response matrix.

The expected signal, M(E’,p’,t) is then compared with the collected data to infer the values and uncertainty in theta, the parameters that define the source model. The task of calibration scientists is to compute A, R, and P accurately enough that the residual systematic errors are smaller than the typical statistical error in the determination of the theta.

In general, the calibration products also vary with time, but usually these variations are negligible over the duration of an observation, so you can leave out the t from them. Not always though!

[ArXiv] 1st week, Dec. 2007

hlee — Fri, 07 Dec 2007 19:15:04 +0000

There’s only one day in the first week of December with no preprint appearance. Dubbing the week of Dec. 2nd as the first week is hoped to be accepted.

[stat.ML:0711.4983]
A Method for Compressing Parameters in Bayesian Models with Application to Logistic Sequence Prediction Models L. Li and R. M. Neal
[astro-ph:0711.4886]
Requirements on PSF Calibration for Dark Energy from Cosmic Shear S. Paulin-Henriksson et.al.
[astro-ph:0711.4895]
The impact of going beyond the Maxwell distribution in direct dark matter detection rates J. D. Vergados, S. H. Hansen, and O. Host
[astro-ph:0712.0003]
The Galaxy Cross-Correlation Function as a Probe of the Spatial Distribution of Galactic Satellites J. Chen
[stat.ME:0712.0283]
Wavelet methods in statistics: Some recent developments and their applications A. Antoniadis
[stat.ML:0712.0189]
Summarization and Classification of Non-Poisson Point Processes J. Picka and M. Deng
[astro-ph:0712.0588]
SZ and CMB reconstruction using Generalized Morphological Component Analysis J. Bobin et. al.
[astro-ph:0712.0610]
X-Atlas: An Online Archive of Chandra’s Stellar High Energy Transmission Gratings Observations O. W. Westbrook et.al.
[astro-ph:0712.0618]
Precision of Hubble constant derived using black hole binary absolute distances and statistical redshift information C. L. MacLeod and C. J. Hogan

[ArXiv] 3rd week, Oct. 2007

hlee — Fri, 19 Oct 2007 16:51:25 +0000

Quite interesting papers were up at arXiv, including a theoretical statistics paper that no astronomer manages to miss. To find the paper and others, please click

[astro-ph:0710.2371]
Probing Non-Gaussianity In The Cosmic Microwave Background Anisotropies: One Point Distribution Function by E. Jeong, and G. F. Smoot
[astro-ph:0710.2397]
Initial-Final Mass Relationship for Stars of Different Metallicities by X. Meng and Z. Chen
[astro-ph:0710.2436]
Optical afterglows of gamma-ray bursts: a bimodal distribution?” N. Marco, G. Gabriele, and G. Giancarlo.
[astro-ph:0710.2455]
Superfluid turbulence and pulsar glitch statistics by A. Melatos and C. Peralta
[astro-ph:0710.2664]
Reconstructing Images from Projections Using the Maximum-Entropy Method. Numerical Simulations of Low-Aspect Astrotomography by A. T. Bajkova
[astro-ph:0710.2783]
Void Statistics and Void Galaxies in the 2dFGRS by A. M. von Benda-Beckmann and V. Mueller
[astro-ph:0710.2872] summer school material
Geometrical constraints on dark energy models by R. Lazkoz
[astro-ph:0710.2882]
The IMF of the massive star-forming region NGC 3603 from NIR AO observations by Y. Harayama, F. Eisenhauer, and F. Martins
[hep-ph:0710.1952]
Global neutrino parameter estimation using Markov Chain Monte Carlo by S. Hannestad
[astro-ph:0710.3099]*
Short Gamma Ray Bursts: a bimodal origin? by R. Salvaterra, A. Cerutti, G. Chincarini, M. Colpi, C. Guidorzi, P.Romano
[astro-ph:0710.2397]
Clues to Globular Cluster Evolution from Multiwavelength Observations of Extragalactic Systems by A. Kundu, T. J. Maccarone, and S. E. Zepf
(I’ve been keen on globular clusters, best archeological objects in the universe to understand galaxy formation and the history of the universe. They are like genomes of a galaxy.)
[stat.TH:0710.3478]
Nonparametric estimation of a point-spread function in multivariate problems by P. Hall and P. Qiu

[1]: Personally, point spread function (PSF) plays a major role in extracting astrophysical information from raw observations.

[2]: In addition to this week’s GRB paper, [astro-ph:0710.3099]*, there were some statistical studies about multi-modality in GRBs [see Three classes of GRBs].

I think that (astro)physics provides templates such as likelihoods for a statistical analysis. The statistical disputes between the bimodality and the trimodality of GRB distribution can be settled from 1. more studies like [astro-ph:0710.3099]* or 2. better statistical inference tools or model selections. Unfortunately, 2 requires templates, which are acquired from in-depth cross matching survey studies.