The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification[2] ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.[3] In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.
Even if data and software issues are resolved, there’s another barrier to climb. Validation. If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?
I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?
Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, lack of documentation for nonspecialists to perform the task.
A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for exploratory data analysis and confirmatory data analysis with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.
In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.
In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.
For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf[4] Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all ahelp “commands” a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).
arf | rmf | psf | pha | data | |
---|---|---|---|---|---|
get | get_arf | get_rmf | get_psf | get_pha | get_data |
set | set_arf | set_rmf | set_psf | set_pha | set_data |
unpack | unpack_arf | unpack_rmf | unpack_psf | unpack_pha | unpack_data |
load | load_arf | load_rmf | load_psf | load_pha | load_data |
read | read_arf | read_rmf | read_psf | read_pha | read_data |
[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].
So, I decide to seek for a help through cxc help desk several months back. Their answers are very reliable and prompt. My question was “what are the difference among read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(), where xxx can be data, arf, rmf, and psf?” The answer to this question was that
You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:
http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html
This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.
Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to validate my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, the lack of documentation.
What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular data processing and statistical inference given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.
Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in where is ciao X?[5] not every data analysis system has to have similar design and goals.
Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in where is ciao X? I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references[6]
I’ve heard many times about the lack of documentation of this extensive data analysis system, ciao. I saw people still using ciao 3.4 although the new version 4 has been available for many months. Although ciao is not the only tool for Chandra data analysis, it was specifically designed for it. Therefore, I expect it being used frequently with popularity. But the reality is against my expectation. Whatever (fierce) discussion I’ve heard, it has been irrelevant to me because ciao is not intended for statistical analysis. Then, out of sudden, after many months, a realization hit me. ciao is different from other data analysis systems and softwares. This difference has been a hampering factor of introducing ciao outside the Chandra scientist community and of gaining popularity. This difference was the reason I often got lost in finding suitable documentations.
http://cxc.harvard.edu/ciao/ is the website to refer when you start using ciao and manuals are listed here, manuals and memos. The aforementioned difference is that I’m used to see Introduction, Primer, Tutorial, Guide for Beginners at the front page or the manual websites but not from the ciao websites. From these introductory documentations, I can stretch out to other specific topics, modules, tool boxes, packages, libraries, plug-ins, add-ons, applications, etc. Tutorials are the inertia of learning and utilizing data analysis systems. However, the layout of ciao manual websites seems not intended for beginners. It was hard to find basics when some specific tasks with ciao and its tools got stuck. It might be useful only for Chandra scientists who have been using ciao for a long time as references but not beyond. It could be handy for experts instructing novices by working side by side so that they can give better hands-on instruction.
I’ll contrast with other popular data analysis systems and software.
Even if I’ve been navigating the ciao website and its threads high in volume so many times, I only come to realize now that there’s no beginner’s guide to be called as ciao cookbook, ciao tutorial, ciao primer, ciao primer, ciao for dummies, or introduction to ciao at the visible location.
This is a cultural difference. Personal thought is that this tradition prevents none Chandra scientists from using data in the Chandra archive. A good news is that there has been ciao workshops and materials from the workshops are still available. I believe compiling these materials in a fashion that other beginners’ guides introducing the data analysis system can be a good starting point for writing up a front-page worthy tutorial. The existence of this introductory material could embrace more people to use and to explore Chandra X-ray data. I hope these tutorials from other softwares and data analysis systems (primer, cookbook, introduction, tutorial, or ciao for dummies) can be good guide lines to fully compose a ciao primer.
]]>Studying that Smile (subscription needed)
A tutorial on multispectral imaging of paintings using the Mona Lisa as a case study
by Ribes, Pillay, Schmitt, and Lahanier
IEEE Sig. Proc. Mag. Jul. 2008, pp.14-26
Conclusions: In this article, we have presented a tutorial description of the multispectral acquisition of images from a signal processing point of view.
I wonder if there’s literature in astronomy matching this tutorial from which we may expand and improve current astronomical photometry processes by adopting strategies developed by more populated signal/image processing engineers and statisticians. (Considering good textbooks on statistical signal processing, and many fundamental algorithms born thanks to them, I must include statisticians. Although not discussed in this tutorial, Hidden Markov Model (HMM) is often used in signal processing but from ADS, with such keywords, no astronomical publication is aware of HMM – please, confirm my finding that HMM is not used among astronomers because my search scheme is likely imperfect.)
]]>IEEE Signal Processing Magazine Jul. 2007 Vol 24 Issue 4: Bootstrap methods in signal processing
This link will show the table of contents and provide links to articles; however, the access to papers requires IEEE Xplore subscription via libraries or individual IEEE memberships). Here, I’d like to attempt to introduce some articles and tutorials.
Special topic on bootstrap:
The guest editors (A.M. Zoubir & D.R. Iskander)[1] open the issue by providing the rationale, the occasional invalid Gaussian noise assumption, and the consequential complex modeling in their editorial opening, Bootstrap Methods in Signal Processing. A practical approach has been Monte Carlo simulations but the cost of repeating experiments is problematic. The suggested alternative is the bootstrap, which provides tools for designing detectors for various signals subject to noise or interference from unknown distributions. It is said that the bootstrap is a computer-intensive tool for answering inferential questions and this issue serves as tutorials that introduce this computationally intensive statistical method to the signal processing community.
The first tutorial is written by those two guest editors: Bootstrap Methods and Applications, which begins with the list of bootstrap methods and emphasizes its resilience. It discusses the number of bootstrap samples to compensate a simulation (Monte Carlo) error to a statistical error and the sampling methods for dependent data with real examples. The flowchart from Fig. 9 provides the guideline for how to use the bootstrap methods as a summary.
The title of the second tutorial is Jackknifing Multitaper Spectrum Estimates (D.J. Thomson), which introduces the jackknife, multitaper estimates of spectra, and applying the former to the latter with real data sets. The author added the reason for his preference of jackknife to bootstrap and discussed the underline assumptions on resampling methods.
Instead of listing all articles from the special issue, a few astrostatistically notable articles are chosen:
First, besides being mathematically well-grounded with respect to multifractal analysis, wavelet leaders exhibit significantly enhanced statistical performance compared to wavelet coefficients. … Second, bootstrap procedures provide practitioners with satisfactory confidence limits and hypothesis test p-values for multifractal parameters. Third, the computationally cheap percentile method achieves already excellent performance for both confidence limits and tests.
A lecture note presents a new method to capture and represent compressible signals at a rate significantly below the Nyquist rate. This method employs nonadaptive linear projections that preserve the structure of the signal;
I do wish this brief summary assists you selecting a few interesting articles.
In general, Cosmologists are Bayesians and High Energy Physicists are Frequentists.
I thought it was opposite. By the way, if you crave for more papers, click
Most of information about R can be found at R Project including the software itself and many add-on packages. These individually contributed packages serve particular statistical interests of their users. The documentation menu on the website and each packages contain extensive documentations of how-to’s. Some large packages include demos so that following the scripts in a demo makes R learning easy.
For astronomers, the R tutorial from Penn State Summer School for Astronomers will be useful. This tutorial illustrates R with astronomical data sets. Copy-and-pasting command lines will be a good starting point until data structures and programming logics become internalized. R is a fairly simple language to learn if one has a little experience in other programming languages.
A good online tutorial, providing an overview of R, is found from this link. Many user’s interest dependent tutorials available on line. Here are sample images from Taeyoung’s Tutorial (click for pdf).
Among many available textbooks, the followings provide general R usage. More books are available for specific needs.
Also, Introduction to the R Project for Statistical Computing for use at ITC (click for pdf) by D.G. Rossiter provides a short but extensive overview of R.Unfortunately, FITS reader is not available in R. We hope a skillful astronomer to contribute a FITS reader among other packages.
]]>————————————————————————
1/ Python basics:
My favorite Intro to Python website is the Tutorial by Guido van Rossum (python founder):
http://docs.python.org/tut/
I also find other references at this site to be useful: http://docs.python.org
For more complicated questions I often search: http://www.python.org/
2/ Scientific python: numarray, numpy These modules allow one to use APL/IDL- like syntax with matrices (i.e. implicit loops over indices when doing many common operations). They also have some handy scientific functions. Eventually “numarray” will be replaced by “numpy”, but it hasn’t happened yet (because of “pyfits” for fits files). It should happen this year (November??).
Numarray home page: http://www.stsci.edu/resources
Numpy home pages: http://sourceforge.net/project
3/ For fits files, the astronomical programmers at Space Telescope Science Insititute alse wrote “pyfits”: http://www.stsci.edu/resources