The AstroStat Slog » ciao

Everybody needs crampons

vlk — Fri, 30 Apr 2010 16:12:36 +0000

Sherpa is a fitting environment in which Chandra data (and really, X-ray data from any observatory) can be analyzed. It has just undergone a major update and now runs on python. Or allows python to run. Something like that. It is a very powerful tool, but I can never remember how to use it, and I have an amazing knack for not finding what I need in the documentation. So here is a little cheat sheet (which I will keep updating ~~as and when~~ if I learn more):

2010-apr-30: Aneta has setup a blogspot site to deal with simple Sherpa techniques and tactics: http://pysherpa.blogspot.com/

On Help:

In general, to get help, use: ahelp "something" (note the quotes)
Even more useful, type: ? wildcard to get a list of all commands that include the wildcard
You can also do a form of autocomplete: type TAB after writing half a command to get a list of all possible completions.

Data I/O:

To read in your PHA file, use: load_pha()
Often for Chandra spectra, the background is included in that same file. In any case, to read it in separately, use: load_bkg()
- Q: should it be loaded in to the same dataID as the source?
- A: Yes.
- A: When the background counts are present in the same file, they can be read in separately and assigned to the background via set_bkg('src',get_data('bkg')), so counts from a different file can be assigned as background to the current spectrum.
To read in the corresponding ARF, use: load_arf()
- Q: load_bkg_arf() for the background — should it be done before or after load_bkg(), or does it matter?
- A: does not matter
To read in the corresponding RMF, use: load_rmf()
- Q: load_bkg_rmf() for the background, and same question as above
- A: same answer as above; does not matter.
To see the structure of the data, type: print(get_data()) and print(get_bkg())
To select a subset of channels to analyze, use: notice_id()
To subtract background from source data, use: subtract()
To not subtract, to undo the subtraction, etc., use: unsubtract()
To plot useful stuff, use: plot_data(), plot_bkg(), plot_arf(), plot_model(), plot_fit(), etc.
(Q: how in god’s name does one avoid plotting those damned error bars? I know error bars are necessary, but when I have a spectrum with 8192 bins, I don’t want it washed out with silly 1-sigma Poisson bars. And while we are asking questions, how do I change the units on the y-axis to counts/bin? A: rumors say that plot_data(1,yerr=0) should do the trick, but it appears to be still in the development version.)

Fitting:

To fit model to the data, command it to: fit()
To get error bars on the fit parameters, use: projection() (or covar(), but why deliberately use a function that is guaranteed to underestimate your error bars?)
Defining models appears to be much easier now. You can use syntax like: set_source(ModelName.ModelID+AnotherModel.ModelID2) (where you can distinguish between different instances of the same type of model using the ModelID — e.g., set_source(xsphabs.abs1*powlaw1d.psrc+powlaw1d.pbkg))
To see what the model parameter values are, type: print(get_model())
To change statistic, use: set_stat() (options are various chisq types, cstat, and cash)
To change the optimization method, use: set_method() (options are levmar, moncar, neldermead, simann, simplex)

Timestamps:
v1:2007-dec-18
v2:2008-feb-20
v3:2010-apr-30

data analysis system and its documentation

hlee — Fri, 02 Oct 2009 02:11:04 +0000

So far, I didn’t complain much related to my “statistician learning astronomy” experience. Instead, I’ve been trying to emphasize how fascinating it is. I hope that more statisticians can join this adventure when statisticians’ insights are on demand more than ever. However, this positivity seems not working so far. In two years of this slog’s life, there’s no posting by a statistician, except one about BEHR. Statisticians are busy and well distracted by other fields with more tangible data sets. Or compared to other fields, too many obstacles and too high barriers exist in astronomy for statisticians to participate. I’d like to talk about these challenges from my ends.^[1]

The biggest challenge for a statistician to use astronomical data is the lack of mercy for nonspecialists’ accessing data including format, quantification, and qualification^[2] ; and data analysis systems. IDL is costly although it is used in many disciplines and other tools in astronomy are hardly utilized for different projects.^[3] In that regards, I welcome astronomers using python to break such exclusiveness in astronomical data analysis systems.

Even if data and software issues are resolved, there’s another barrier to climb. Validation. If you have a catalog, you’ll see variables of measures, and their errors typically reflecting the size of PSF and its convolution to those metrics. If a model of gaussian assumption applied, in order to tabulate power law index, King’s, Petrosian’s, or de Vaucouleurs’ profile index, and numerous metrics, I often fail to find any validation of gaussian assumptions, gaussian residuals, spectral and profile models, outliers, and optimal binning. Even if a data set is publicly available, I also fail to find how to read in raw data, what factors must be considered, and what can be discarded because of unexpected contamination occurred like cosmic rays and charge over flows. How would I validate raw data that are read into a data analysis system is correctly processed to match values in catalogs? How would I know all entries in catalog are ready for further scientific data analysis? Are those sources real? Is p-value appropriately computed?

I posted an article about Chernoff faces applied to Capella observations from Chandra. Astronomers already processed the raw data and published a catalog of X-ray spectra. Therefore, I believe that the information in the catalog is validated and ready to be used for scientific data analysis. I heard that repeated Capella observation is for the calibration. Generally speaking, in other fields, targets for calibration are almost time invariant and exhibit consistency. If Capella is a same star over the 10 years, the faces in my post should look almost same, within measurement error; but as you saw, it was not consistent at all. Those faces look like observations were made toward different objects. So far I fail to find any validation efforts, explaining why certain ObsIDs of Capella look different than the rest. Are they real Capella? Or can I use this inconsistent facial expression as an evidence that Chandra calibration at that time is inappropriate? Or can I conclude that Capella was a wrong choice for calibration?

Due to the lack of quantification procedure description from the raw data to the catalog, what I decided to do was accessing the raw data and data processing on my own to crosscheck the validity in the catalog entries. The benefit of this effort is that I can easily manipulate data for further statistical inference. Although reading and processing raw data may sound easy, I came across another problem, lack of documentation for nonspecialists to perform the task.

A while ago, I talked about read.table() in R. There are slight different commands and options but without much hurdle, one can read in ascii data in various styles easily with read.table() for exploratory data analysis and confirmatory data analysis with R. From my understanding, statisticians do not spend much time on reading in data nor collecting them. We are interested in methodology to extract information of the population based on sample. While the focus is methodology, all the frustrations with astronomical data analysis softwares occur prior to investigating the best method. The level of frustration reached to the extend of terminating my eagerness for more investigation about inference tools.

In order to assess those Capella observations, thanks to its on-site help, I evoke ciao. Beforehand, I’d like to disclaim that I exemplify ciao to illustrate the culture difference that I experienced as a statistician. It was used to discuss why I think that astronomical data analysis systems are short of documentations and why that astronomical data processing procedures are lack of validation. I must say that I confront very similar problems when I tried to learn astronomical packages such as IRAF and AIPS. Ciao happened to be at handy when writing this post.

In order to understand X-ray data, not only image data files, one also needs effective area (arf), redistribution matrix (rmf), and point spread function (psf). These files are called by calibration data files. If the package was developed for general users, like read.table() I expect there should be a homogenized/centralized data including calibration data reading function with options. Instead, there were various kinds of functions one can use to read in data but the description was not enough to know which one is doing what. What is the functionality of these commands? Which one only stores names of data file? Which one reconfigures the raw data reflecting up to date calibration file? Not knowing complete data structures and classes within ciao, not getting the exact functionality of these data reading functions from ahelp, I was not sure the log likelihood that I computed is appropriate or not.

For example, there are five different ways to associate an arf: read_arf(), load_arf(), set_arf(), get_arf(), and unpack_arf() from ciao. Except unpack_arf(), I couldn’t understand the difference among these functions for accessing an arf^[4] Other softwares including XSPEC that I use, in general, have a single function with options to execute different level of reading in data. Ciao has an extensive web documentation without a tutorial (see my post). So I read all ahelp “commands” a few times. But I still couldn’t decide which one to use for my work to read in arfs and rmfs (I happened to have many calibration data files).

	arf	rmf	psf	pha	data
get	get_arf	get_rmf	get_psf	get_pha	get_data
set	set_arf	set_rmf	set_psf	set_pha	set_data
unpack	unpack_arf	unpack_rmf	unpack_psf	unpack_pha	unpack_data
load	load_arf	load_rmf	load_psf	load_pha	load_data
read	read_arf	read_rmf	read_psf	read_pha	read_data

[Note that above links may not work since ciao documentation website evolves quickly. Some might be routed to different links so please, check this website for other data reading commands: cxc.harvard.edu/sherpa/ahelp/index_alphabet.html].

So, I decide to seek for a help through cxc help desk several months back. Their answers are very reliable and prompt. My question was “what are the difference among read_xxx(), load_xxx(), set_xxx(), get_xxx(), and unpack_xxx(), where xxx can be data, arf, rmf, and psf?” The answer to this question was that

You can find detailed explanations for these Sherpa commands in the “ahelp” pages of the Sherpa website:

http://cxc.harvard.edu/sherpa/ahelp/index_alphabet.html

This is a good answer but a big cultural shock to a statistician. It’s like having an answer like “check http://www.r-project.org/search.html and http://cran.r-project.org/doc/FAQ/R-FAQ.html” for IDL users to find out the difference between read.table() and scan(). Probably, for astronomers, all above various data reading commands are self explanatory like R having read.table(), read.csv(), and scan(). Disappointingly, this answer was not I was looking for.

Well, thanks to this embezzlement, hesitation, and some skepticism, I couldn’t move to the next step of implementing fitting methods. At the beginning, I was optimistic when I found out that Ciao 4.0 and up is python compatible. I thought I could do things more in statistically rigorous ways since I can fake spectra to validate my fitting methods. I was thinking about modifying the indispensable chi-square method that is used twice for point estimation and hypothesis testing that introduce bias (a link made to a posting). My goal was make it less biased and robust, less sensitive iid Gaussian residual assumptions. Against my high expectation, I became frustrated at the first step, reading and playing with data to get a better sense and to develop a quick intuition. I couldn’t even make a baby step to my goal. I’m not sure if it a good thing or not, but I haven’t been completely discouraged. Also, time helps gradually to overcome this culture difference, the lack of documentation.

What happens in general is that, if a predecessor says, use “set_arf(),” then the apprentice will use “set_arf()” without doubts. If you begin learning on your own purely relying on documentations, I guess at some point you have to make a choice. One can make a lucky guess and move forward quickly. Sometimes, one can land on miserable situation because one is not sure about his/her choice and one cannot trust the features appeared after these processing. I guess it is natural to feel curiosity about what each of these commands is doing to your data and what information is carried over to the next commands in analysis procedures. It seems righteous to know what command is best for the particular data processing and statistical inference given the data. What I found is that such comparison across commands is missing in documentations. This is why I thought astronomical data analysis systems are short of mercy for nonspecialists.

Another thing I observed is that there seems no documentation nor standard procedure to create the repeatable data analysis results. My observation of astronomers says that with the same raw data, the results by scientist A and B are different (even beyond statistical margins). There are experts and they have knowledge to explain why results are different on the same raw data. However, not every one can have luxury of consulting those few experts. I cannot understand such exclusiveness instead of standardizing the procedures through validations. I even saw that the data that A analyzed some years back can be different from this year’s when he/she writes a new proposal. I think that the time for recreating the data processing and inference procedure to explain/justify/validate the different results or to cover/narrow the gap could have not been wasted if there are standard procedures and its documentation. This is purely a statistician’s thought. As the comment in where is ciao X?^[5] not every data analysis system has to have similar design and goals.

Getting lost while figuring out basics (handling, arf, rmf, psf, and numerous case by case corrections) prior to applying any simple statistics has been my biggest obstacle in learning astronomy. The lack of documenting validation methods often brings me frustration. I wonder if there’s any astronomers who lost in learning statistics via R, minitab, SAS, MATLAB, python, etc. As discussed in where is ciao X? I wish there is a centralized tutorial that offers basics, like how to read in data, how to do manipulate datum vector and matrix, how to do arithmetics and error propagation adequately not violating assumptions in statistics (I don’t like the fact that the point estimate of background level is subtracted from observed counts, random variable when the distribution does not permit such scale parameter shifting), how to handle and manipulate fits format files from Chandra for various statistical analysis, how to do basic image analysis, how to do basic spectral analysis, and so on with references^[6]

This is quite an overdue posting. Links and associated content can be outdated.
For the classification purpose, data with clear distinction between response and predictor variables so called a training data set must be given. However, I often fail to get processed data sets for statistical analysis. I first spend time to read data and question what is outlier, bias, or garbage. I’m not sure how to clean and extract numbers for statistical analysis and every sub-field in astronomy have their own way to clean to be fed into statistics and scatter plots. For example, image processing is still executed case by case via trained eyes of astronomers. On the other hand, in medical imaging diagnosis specialists offer training sets with which scientists in computer vision develop algorithms for classification. Such collaboration yields accelerated, automatic but preliminary diagnosis tools. A small fraction of results from these preliminary methods still can be ambiguous, i.e. false positive or false negative. Yet, when such ambiguous cancerous cell images at the decision boundaries occur, specialists like trained astronomers scrutinize those images to make a final decision. As medical imaging and its classification algorithms resolve the issue of expert shortage under overflowing images, I wish astronomers adopt their strategies to confront massive streaming images and to assist sparse trained astronomers
Something I like to see is handling background statistically in high energy astrophysics. When simulating a source, background can be simulated as well via Makov Random field, kriging, and other spatial statistics methods. In reality, background is subtracted once in measurement space and the random nature of background is not interactively reflected. Regardless of available statistical methodology to reflect the nature of background, it is difficult to implement it for trial and validation because those tools are not compatible for adding statistical modules and packages.
A Sherpa expert told me there is an FAQ (I failed to locate previously) on this matter. However, from data analysis perspective like a distinction between data.structure, vector, matrix, list and other data types in R, the description is not sufficient for someone who wants to learn ciao and to perform scientific (both deterministic or stochastic) data analysis via scripting i.e. handling objects appropriately. You might want to read comparing commands in Sharpa from Shepa FAQ
I know there is ciaox. Apart from the space between ciao and X, there is another difference that astronomers do not care much compared to statisticians: the difference between X and x. Typically, the capital letter is for random variable and lower case letters for observation or value
By the way, there are ciao workshop materials available that could function as tutorials. Please, locate them if needed.

Where is ciao X ?

hlee — Thu, 30 Jul 2009 06:57:00 +0000

X={ primer, tutorial, cookbook, Introduction, guidebook, 101, for dummies, …}

I’ve heard many times about the lack of documentation of this extensive data analysis system, ciao. I saw people still using ciao 3.4 although the new version 4 has been available for many months. Although ciao is not the only tool for Chandra data analysis, it was specifically designed for it. Therefore, I expect it being used frequently with popularity. But the reality is against my expectation. Whatever (fierce) discussion I’ve heard, it has been irrelevant to me because ciao is not intended for statistical analysis. Then, out of sudden, after many months, a realization hit me. ciao is different from other data analysis systems and softwares. This difference has been a hampering factor of introducing ciao outside the Chandra scientist community and of gaining popularity. This difference was the reason I often got lost in finding suitable documentations.

http://cxc.harvard.edu/ciao/ is the website to refer when you start using ciao and manuals are listed here, manuals and memos. The aforementioned difference is that I’m used to see Introduction, Primer, Tutorial, Guide for Beginners at the front page or the manual websites but not from the ciao websites. From these introductory documentations, I can stretch out to other specific topics, modules, tool boxes, packages, libraries, plug-ins, add-ons, applications, etc. Tutorials are the inertia of learning and utilizing data analysis systems. However, the layout of ciao manual websites seems not intended for beginners. It was hard to find basics when some specific tasks with ciao and its tools got stuck. It might be useful only for Chandra scientists who have been using ciao for a long time as references but not beyond. It could be handy for experts instructing novices by working side by side so that they can give better hands-on instruction.

I’ll contrast with other popular data analysis systems and software.

When I began to use R, I started with R manual page containing this pdf file, Introduction to R. Based on this introductory documentations, I could learn specific task oriented packages easily and could build more my own data analysis tools.
When I began to use Matlab, I was told to get the Matlab primer. Although the current edition is commercial, there are free copies of old editions are available via search engines or course websites. There other tutorials are available as well. After crashing basics of Matlab, it was not difficult to getting right tool boxes for topic specific data analysis and scripting for particular needs.
When I began to use SAS (Statistical Analysis System), people in the business said get the little SAS book which gives the basis of this gigantic system, from which I was able to expend its usage for particular statistical projects.
Recently, I began to learn Python to use many astronomical and statistical data analysis modules developed by various scientists. Python has its tutorials where I can point for basic to fully utilize those task specific modules and my own scripting.
Commericial softwares often come with their own beginners’ guide and demos that a user can follow easily. By acquiring basics from these tutorials, expending applications can be well directed. On the other hands, non-commercial softwares may be lack of extensive but centralized tutorials unlike python and R. Nonetheless, acquiring tutorials for teaching is easy and these unlicensed materials are very handy whenever problems are confronted under various but task specific projects.
I used to have IDL tutorials on which I relied a lot to use some astronomy user libraries and CHIANTI (atomic database). I guess the resources of tutorials have changed dramatically since then.

Even if I’ve been navigating the ciao website and its threads high in volume so many times, I only come to realize now that there’s no beginner’s guide to be called as ciao cookbook, ciao tutorial, ciao primer, ciao primer, ciao for dummies, or introduction to ciao at the visible location.

This is a cultural difference. Personal thought is that this tradition prevents none Chandra scientists from using data in the Chandra archive. A good news is that there has been ciao workshops and materials from the workshops are still available. I believe compiling these materials in a fashion that other beginners’ guides introducing the data analysis system can be a good starting point for writing up a front-page worthy tutorial. The existence of this introductory material could embrace more people to use and to explore Chandra X-ray data. I hope these tutorials from other softwares and data analysis systems (primer, cookbook, introduction, tutorial, or ciao for dummies) can be good guide lines to fully compose a ciao primer.

It bothers me.

hlee — Mon, 17 Nov 2008 17:39:04 +0000

The full description is given http://cxc.harvard.edu/ciao3.4/ahelp/bayes.html about “bayes” under sherpa/ciao^[1]. Some sentences kept bothering me and here’s my account for the reason given outside of quotes.

SUBJECT(bayes) CONTEXT(sherpa)
SYNOPSIS
A Bayesian maximum likelihood function.

Maximum likelihood function is common for both Bayesian and frequentist methods. I don’t know get the point why “Bayesian” is particularly added with “maximum likelihood function.”

DESCRIPTION
(snip)
We can relate this likelihood to the Bayesian posterior density for S(i) and B(i)
using Bayes’ Theorem:

p[S(i),B(i) | N(i,S)] = p[S(i)|B(i)] * p[B(i)] * p[N(i,S) | S(i),B(i)] / p[D] .

The factor p[S(i)|B(i)] is the Bayesian prior probability for the source model
amplitude, which is assumed to be constant, and p[D] is an ignorable normalization
constant. The prior probability p[B(i)] is treated differently; we can specify it
using the posterior probability for B(i) off-source:

p[B(i)] = [ A (A B(i))^N(i,B) / N(i,B)! ] * exp[-A B(i)] ,

where A is an “area” factor that rescales the number of predicted background
counts B(i) to the off-source region.

IMPORTANT: this formula is derived assuming that the background is constant as a
function of spatial area, time, etc. If the background is not constant, the Bayes
function should not be used.

Why not? If I rephrase it, what it said is that B(i) is a constant. Then why one bothers to write p[B(i)], a probability density of a constant? The statement sounds self contradictory to me. I guess B(i) is a constant parameter. It would be suitable to write that Background is homogeneous and the Background is describable with homogeneous Poisson process if the above pdf is a correct model for Background. Also, a slight notation change is required. Assuming the Poisson process, we could estimate the background rate (constant parameter) and its density p[B(i)], and this estimate is a constant as stated for p[S(i)|B(i)], a prior probability for the constant source model amplitude.

I think the reason for “Bayes should not used” is that the current sherpa is not capable of executing hierarchical modeling. Nevertheless, I believe one can script the MCMC methodologies with S-Lang/Python to be aggregated with existing sherpa tools to incorporate a possible space dependent density, p[B(i,x,y)]. I was told that currently a constant background regardless of locations and background subtraction is commonly practiced.

To take into account all possible values of B(i), we integrate, or marginalize,
the posterior density p[S(i),B(i) | N(i,S)] over all allowed values of B(i):

p[S(i) | N(i,S)] = (integral)_0^(infinity) p[S(i),B(i) | N(i,S)] dB(i) .

For the constant background case, this integral may be done analytically. We do
not show the final result here; see Loredo. The function -log p[S(i)|N(i,S)] is
minimized to find the best-fit value of S(i). The magnitude of this function
depends upon the number of bins included in the fit and the values of the data
themselves. Hence one cannot analytically assign a `goodness-of-fit’ measure to a
given value of this function. Such a measure can, in principle, be computed by
performing Monte Carlo simulations. One would repeatedly sample new datasets from
the best-fit model, and fit them, and note where the observed function minimum
lies within the derived distribution of minima. (The ability to perform Monte
Carlo simulations is a feature that will be included in a future version of
Sherpa.)

Note on Background Subtraction

Bayesian computation means one way or the other that one is able to get posterior distributions in the presence of various parameters regardless of their kinds: source or background. I wonder why there’s a discrimination such that source parameter has uncertainty whereas the background is constant and is subtracted (yet marginalization is emulated by subtracting different background counts with corresponding weights). It fell awkward to me. Background counts as well as source counts are Poisson random. I would like to know what justifies constant background while one uses probabilistic approaches via Bayesian methods. I would like to know why the mixture model approach – a mixture of source model and background model with marginalization over background by treating B(i) as a nuisance parameter – has not been tried. By casting eye sights broadly on Bayesian modeling methods and basics of probability, more robustly estimating the source model and their parameters is tractable without subtracting background prior to fitting a source model.

The background should not be subtracted from the data when this function is used
The background only needs to be specified, as in this example:
(snip)

EXAMPLES
EXAMPLE 1
Specify the fitting statistic and then confirm it has been set. The method is then
changed from “Levenberg-Marquardt” (the default), since this statistic does not
work with that algorithm.

sherpa> STATISTIC BAYES
sherpa> SHOW STATISTIC
Statistic: Bayes
sherpa> METHOD POWELL
(snip)

I would like to know why it’s not working with Levenberg-Marquardt (LM) but working with Powell. Any references that explain why LM does not work with Bayes?

I do look forward your comments and references, particularly reasons for Bayesian maximum likelihood function and Bugs with LM. Also, I look forward to see off the norm approaches such as modeling fully in Bayesian ways (like van Dyk et al. 2001, yet I see its application rarely) or marginalizing Background without subtraction but simultaneously fitting the source model. There are plenty of rooms to be improved in source model fitting under contamination and distortion of x-ray photon incidents through space, telescope, and signal transmission.

Note that the current sherpa is beta under ciao 4.0 not under ciao 3.4 and a description about “bayes” from the most recent sherpa is not available yet, which means this post needs updates one new release is available

Mexican Hat [EotW]

vlk — Wed, 28 May 2008 17:00:38 +0000

The most widely used tool for detecting sources in X-ray images, especially Chandra data, is the wavelet-based wavdetect, which uses the Mexican Hat (MH) wavelet. Now, the MH is not a very popular choice among wavelet aficianados because it does not form an orthonormal basis set (i.e., scale information is not well separated), and does not have compact support (i.e., the function extends to inifinity). So why is it used here?

The short answer is, it has a convenient background subtractor built in, is analytically comprehensible, and uses concepts very familiar to astronomers. The last bit can be seen by appealing to Gaussian smoothing. Astronomers are (or were) used to smoothing images with Gaussians, and in a manner of speaking, all astronomical images already come presmoothed by PSFs (point spread functions) that are nominally approximated by Gaussians. Now, if an image were smoothed by another Gaussian of a slightly larger width, the difference between the two smoothed images should highlight those features which are prominent at the spatial scale of the larger Gaussian. This is the basic rationale behind a wavelet.

So, in the following, G(x,y;σ_x,σ_y,x_o,y_o) is a 2D Gaussian written in such that the scaling of the widths and the transposition of the function is made obvious. It is defined over the real plane x,y ε R² and for widths σ_x,σ_y. The Mexican Hat wavelet MH(x,y;σ_x,σ_y,x_o,y_o) is generated as the difference between the two Gaussians of different widths, which essentially boils down to taking partial derivatives of G(σ_x,σ_y) wrt the widths. To be sure, these must really be thought of as operators where the functions are correlated with a data image, so the derivaties must be carried out inside an integral, but I am skipping all that for the sake of clarity. Also note, the MH is sometimes derived as the second derivative of G(x,y), the spatial derivatives that is.

The integral of the MH over R² results in the positive bump and the negative annulus canceling each other out, so there is no unambiguous way to set its normalization. And finally, the Fourier Transform shows which spatial scales (k_x,y are wavenumbers) are enhanced or filtered during a correlation.

Background Subtraction [EotW]

vlk — Wed, 21 May 2008 17:00:32 +0000

There is a lesson that statisticians, especially of the Bayesian persuasion, have been hammering into our skulls for ages: do not subtract background. Nevertheless, old habits die hard, and old codes die harder. Such is the case with X-ray aperture photometry.

When C counts are observed in a region of the image that overlaps a putative source, and B counts in an adjacent, non-overlapping region that is mostly devoid of the source, the question that is asked is, what is the intensity of a source that might exist in the source region, given that there is also background. Let us say that the source has intensity s, and the background has intensity b in the first region. Further let a fraction f of the source overlap that region, and a fraction g overlap the adjacent, “background” region. Then, if the area of the background region is r times larger, we can solve for s and b and even determine the errors:

Note that the regions do not have to be circular, nor does the source have to be centered in it. As long as the PSF fractions f and g can be calculated, these formulae can be applied. In practice, f is large, typically around 0.9, and the background region is chosen as an annulus centered on the source region, with g~0.

It always comes as a shock to statisticians, but this is not ancient history. We still determine maximum likelihood estimates of source intensities by subtracting out an estimated background and propagate error by the method of moments. To be sure, astronomers are well aware that these formulae are valid only in the high counts regime ( s,C,B>>1, b>0 ) and when the source is well defined ( f~1, g~0 ), though of course it doesn’t stop them from pushing the envelope. This, in fact, is the basis of many standard X-ray source detection algorithms (e.g., celldetect).

Furthermore, it might come as a surprise to many astronomers, but this is also the rationale behind the widely-used wavelet-based source detection algorithm, wavdetect. The Mexican Hat wavelet used with it has a central positive bump, surrounded by a negative annular moat, which is a dead ringer for the source and background regions used here. The difference is that the source intensity is not deduced from the wavelet correlations and the signal-to-noise ratio ( s/sigma_s ) is not used to determine source significance, but rather extensive simulations are used to calibrate it.

The power of wavdetect

vlk — Sun, 21 Oct 2007 19:59:31 +0000

wavdetect is a wavelet-based source detection algorithm that is in wide use in X-ray data analysis, in particular to find sources in Chandra images. It came out of the Chicago “Beta Site” of the AXAF Science Center (what CXC used to be called before launch). Despite the fancy name, and the complicated mathematics and the devilish details, it is really not much more than a generalization of earlier local cell detect, where a local background is estimated around a putative source and the question is asked, is whatever signal that is being seen in this pixel significantly higher than expected? However, unlike previous methods that used a flux measurement as the criterion for detection (e.g., using signal-to-noise ratios as proxy for significance threshold), it tests the hypothesis that the observed signal can be obtained as a fluctuation from the background.

Because wavdetect is an extremely complex algorithm, it is not possible to determine its power easily, and large numbers of simulations must be done. In December 2001 (has it been that long?!), the ChaMP folks held a mini workshop, and I reported on my investigations into the efficiency of wavdetect, calculating the probability of detecting a source in various circumstances. I had been meaning to get it published, but this is not really publishable by itself, and furthermore, wavdetect has evolved sufficiently since then that the actual numbers are obsolete. So here, strictly for entertainment, are the slides from that long ago talk. As Xiao-li puts it, “nothing will go waste!”

(click on the image to go to the pdf)