The AstroStat Slog » Languages

Everybody needs crampons

vlk — Fri, 30 Apr 2010 16:12:36 +0000

Sherpa is a fitting environment in which Chandra data (and really, X-ray data from any observatory) can be analyzed. It has just undergone a major update and now runs on python. Or allows python to run. Something like that. It is a very powerful tool, but I can never remember how to use it, and I have an amazing knack for not finding what I need in the documentation. So here is a little cheat sheet (which I will keep updating ~~as and when~~ if I learn more):

2010-apr-30: Aneta has setup a blogspot site to deal with simple Sherpa techniques and tactics: http://pysherpa.blogspot.com/

On Help:

In general, to get help, use: ahelp "something" (note the quotes)
Even more useful, type: ? wildcard to get a list of all commands that include the wildcard
You can also do a form of autocomplete: type TAB after writing half a command to get a list of all possible completions.

Data I/O:

To read in your PHA file, use: load_pha()
Often for Chandra spectra, the background is included in that same file. In any case, to read it in separately, use: load_bkg()
- Q: should it be loaded in to the same dataID as the source?
- A: Yes.
- A: When the background counts are present in the same file, they can be read in separately and assigned to the background via set_bkg('src',get_data('bkg')), so counts from a different file can be assigned as background to the current spectrum.
To read in the corresponding ARF, use: load_arf()
- Q: load_bkg_arf() for the background — should it be done before or after load_bkg(), or does it matter?
- A: does not matter
To read in the corresponding RMF, use: load_rmf()
- Q: load_bkg_rmf() for the background, and same question as above
- A: same answer as above; does not matter.
To see the structure of the data, type: print(get_data()) and print(get_bkg())
To select a subset of channels to analyze, use: notice_id()
To subtract background from source data, use: subtract()
To not subtract, to undo the subtraction, etc., use: unsubtract()
To plot useful stuff, use: plot_data(), plot_bkg(), plot_arf(), plot_model(), plot_fit(), etc.
(Q: how in god’s name does one avoid plotting those damned error bars? I know error bars are necessary, but when I have a spectrum with 8192 bins, I don’t want it washed out with silly 1-sigma Poisson bars. And while we are asking questions, how do I change the units on the y-axis to counts/bin? A: rumors say that plot_data(1,yerr=0) should do the trick, but it appears to be still in the development version.)

Fitting:

To fit model to the data, command it to: fit()
To get error bars on the fit parameters, use: projection() (or covar(), but why deliberately use a function that is guaranteed to underestimate your error bars?)
Defining models appears to be much easier now. You can use syntax like: set_source(ModelName.ModelID+AnotherModel.ModelID2) (where you can distinguish between different instances of the same type of model using the ModelID — e.g., set_source(xsphabs.abs1*powlaw1d.psrc+powlaw1d.pbkg))
To see what the model parameter values are, type: print(get_model())
To change statistic, use: set_stat() (options are various chisq types, cstat, and cash)
To change the optimization method, use: set_method() (options are levmar, moncar, neldermead, simann, simplex)

Timestamps:
v1:2007-dec-18
v2:2008-feb-20
v3:2010-apr-30

]]>

some python modules

hlee — Fri, 13 Nov 2009 21:46:54 +0000

I was told to stay away from python and I’ve obeyed the order sincerely. However, I collected the following stuffs several months back at the instance of hearing about import inference and I hate to see them getting obsolete. At that time, collecting these modules and getting through them could help me complete the first step toward the quest Learning Python (the first posting of this slog).

There are quite many websites dedicated to python as you already know. Some of them talk only to astronomers. A tiny fraction of those websites are for statisticians but I haven’t met any statistician preferring only python. We take the gist of various languages. So, I’ll leave a general website aggregation, such as AstroPy (I think this website is extremely useful for astronomers), to enrich your bookmark under the “python” tab regardless of your profession. Instead, I’ll discuss some python libraries and modules that can be useful for those exercising astrostatistics and make their work easier. I must say that by intention I omitted a few modules because I was not sure their publicity and copyright sensitivity. If you have modules that can be introduced publicly, let me know. I’ll be happy to add them. If my description is improper and want them to be taken off, also let me know.

Over the past few years, python became the most common and versatile script language for both communities, and therefore, I believe, it would accelerate many collaborations. Much of my time is spent to find out how to read, maneuver, and handle raw data/image. Most of tactics for astronomers are quite unfamiliar, sometimes insensible to me (see my read.table() and data analysis system and its documentation). Somehow, one script language, thanks to its open and free intention to all communities, is promising by narrowing the gap for prosperous and efficient collaborations, Python

The first posting on this slog was about Python. I thought that kicking off with a computer language relatively new and open to many communities could motivate me and others for more interdisciplinary works with diversity. After a few years, unfortunately, I didn’t achieve that goal. Yet, I still think that these libraries and modules, introduced below, to be useful for your transition from some programming languages, or for writing your own but pro bono wrapper for better communication with the others.

I’ll take numpy, scipy, and RPy for granted. For the plotting purpose, matplotlib seems most common.

Reading astronomical data (click links to download libraries, modules, and tutorials)

First, start with Using Python for Interactive Data Analysis (in pdf) Quite useful manual, particularly for IDL users. It compares pros and cons of Python and IDL.
IDLsave Simply, without IDL, a .save file becomes legible. This is a brilliant small module.
PyRAF (I was really frustrated with IRAF and spent many sleepless nights. Apart from data reduction, I don’t remember much of statistics from IRAF except simple statistics for Gaussian populations. I guess PyRAF does better job). And there’s PyFITS for handling fits format data.
APLpy (the Astronomical Plotting Library in Python) is a Python module aimed at producing publication-quality plots of astronomical imaging data in FITS format (this introduction is copied from the APLpy site).

Statistics, Mathematics, or data science
Due to RPy, introducing smaller modules seems not much worthy but quite many modules and library for statistics are available, not relying on R.

MDP (Modular toolkit for Data Processing)
Multivariate data analysis methods like PCA, ICA, FA, etc. become very popular in the astronomical society.
pywavelets (Not only FT, various transformation methodologies are often used and wavelet transformation ranks top).
PyIMSL (see my post, PyIMSL)
PyMC I introduced this module in a century ago. It may be lack of versatility or robustness due to parametric distribution objects but I liked the tutorial very much from which one can expand and devise their own working MCMC algorithm.
PyBUGS (I introduced this python wrapper in BUGS but the link to PyBUGS is not working anymore. I hope it revives.)
SAGE (Software for Algebra and Geometry Experimentation) is a free open-source mathematics software system licensed under the GPL (Link to the online tutorial).
python_statlib descriptive statistics for the python programming language.
PYSTAT Nice website but the product is not available yet. Be aware! It is not PhyStat!!!

Module for AstroStatistics
import inference (Unfortunately, the links to examples and tutorial are not available currently)

Without clear objectives, it is not easy to pick up a new language. If you are used to work with one from alphabet soup, you most likely adhere to your choice. Changing alphabets or transferring language names only happens when your instructor specifically ask you to use their preferring languages and when analysis {modules, libraries, tools} are only available within that preferred language. Somehow, thanks to the object oriented style, python makes transition and communication easier than other languages. Furthermore, script languages are more intuitive and better interpretable.

]]>

Do people use Fortran?

hlee — Tue, 27 Oct 2009 03:41:42 +0000

I’m very sure that Fortran is one of the major scientific programming languages. Many functions, modules, and libraries are written in this language. Without being aware of, these routines are ported into many script languages. However, I become curious whether Fortran is still the major force in astronomy or statistics, compared to say 20 years ago (10 seems too small).

I recently placed my Numerical Recipes in Fortran in someone’s hands because I can access the electronic version of NR in C/C++. I have some manuals about Fortran 77 and 90/95, and IMSL in Fortran but I haven’t put my hands on them in recent years. I now feel that these manuals are on the verge of recycling bins or deletion. But the question about the trend in scientific computation languages pulls my sleeve to think over. With a bit of shyness, I want to ask scientists with long experience in both fields for their opinions about Fortran. Do any experienced scientists ask their students or post-docs to acquire knowledge in Fortran? While young people pursuing Python, R, and other scripting languages thanks to GNU GPL (There are a few caveats in this transition, but I’ll discuss that later).

]]>

[Books] Bayesian Computations

hlee — Fri, 11 Sep 2009 20:40:23 +0000

A number of practical Bayesian data analysis books are available these days. Here, I’d like to introduce two that were relatively recently published. I like the fact that they are rather technical than theoretical. They have practical examples close to be related with astronomical data. They have R codes so that one can try algorithms on the fly instead of jamming probability theories.

Bayesian Computation with R
Author:Jim Albert
Publisher: Springer (2007)

As the title said, accompanying R package LearnBayes is available (clicking the name will direct you for downloading the package). Furthermore, the last chapter is about WinBUGS. (Please, check out resources listed in BUGS for other BUGS, Bayesian inference Using Gibbs Sampling) Overall, it is quite practical and instructional. If an young astronomer likes to enter the competition posted below because of sophisticated data requiring non traditional statistical modeling, this book can be a good starting. (Here, traditional methods include brute force Monte Carlo simulations, chi^2/weighted least square fitting, and test statistics with rigid underlying assumptions).

An interesting quote is filtered because of a comment from an astronomer, “Bayesian is robust but frequentist is not” that I couldn’t agree with at the instance.

A Bayesian analysis is said to be robust to the choice of prior if the inference is insensitive to different priors that match the user’s beliefs.

Since there’s no discussion of priors in frequentist methods, Bayesian robustness cannot be matched and compared with frequentist’s robustness. Similar to my discussion in Robust Statistics, I kept the notion that robust statistics is insensitive to outliers or iid Gaussian model assumption. Particularly, the latter is almost ways assumed in astronomical data analysis, unless other models and probability densities are explicitly stated, like Poisson counts and Pareto distribution. New Bayesian algorithms are invented to achieve robustness, not limited to the choice of prior but covering the topics from frequentists’ robust statistics.

The introduction of Bayesian computation focuses on analytical and simple parametric models and well known probability densities. These models and their Bayesian analysis produce interpretable results. Gibbs sampler, Metropolis-Hasting algorithms, and their few hybrids could handle scientific problems as long as scientific models and the uncertainties both in observations and parameters transcribed into well known probability density functions. I think astronomers like to check Chap 6 (MCMC) and Chap 9 (Regression Models). Often times, in order to prove strong correlation between two variables, astronomers adopt simple linear regression models and fit the data to these models. A priori knowledge enhances the flexibility of fitting analysis in which Bayesian computation works robustly different from straightforward chi-square methods. The book does not have sophisticated algorithms nor theories. It only offers very necessities and foundations for Bayesian computations to be accommodated into scientific needs.

The other book is

Bayesian Core: A Practical Approach to Computational Bayesian Statistics.
Author: J. Marin and C.P.Robert
Publisher: Springer (2007).

Although the book is written by statisticians, the very first real data example is CMBdata (cosmic microwave background data; instead of cosmic, the book used cosmological. I’m not sure which one is correct but I’m so used to CMB by cosmic microwave background). Surprisingly, CMB became a very easy topic in statistics in terms of testing normality and extreme values. Seeing the real astronomy data first from the book was the primary reason of introducing this book. Also, it’s a relatively small volume book (about 250 pages) compared other Bayesian textbooks with the broad coverage of topics in Bayesian computation. There are other practical real data sets to illustrate Bayesian computations in the book and these example data sets are found from the book website

The book begins with R, then normal models, regression and variable selection, generalized linear models, capture-recapture experiments, mixture models, dynamic models, and image analysis are covered.

I feel exuberant when I found the book describes the law of large numbers (LLN) that justifies the Monte Carlo methods. The LLN appears often when integration is approximated by summation, which astronomers use a lot without referring the name of this law. For more information, I rather give a wikipedia link to Law of Large Numbers.

Several MCMC algorithms can be mixed together within a single algorithm using either a circular or a random design. While this construction is often suboptimal (in that the inefficient algorithms in the mixture are still used on a regular basis), it almost always brings an improvement compared with its individual components. A special case where a mixed scenario is used is the Metropolis-within-Gibbs algorithm: When building a Gibbs sample, it may happen that it is difficult or impossible to simulate from some of the conditional distributions. In that case, a single Metropolis step associated with this conditional distribution (as its target) can be used instead.

Description in Sec. 4.2 Metropolis-Hasting Algorithms is expected to be more appreciated and comprehended by astronomers because of the historical origins of these topics, detailed balance equation and random walk.

Personal favorite is section 6 on mixture models. Astronomers handle data of multi populations (multiple epochs of star formations, single or multiple break power laws, linear or quadratic models, metalicities from merging or formation triggers, backgrounds+sources, environment dependent point spread functions, and so on) and discusses the difficulties of label switching problems (identifiability issue in codifying data into MCMC or EM algorithm)

A completely different approach to the interpretation and estimation of mixtures is the semiparametric perspective. To summarize this approach, consider that since very few phenomena obey probability laws corresponding to the most standard distributions, mixtures such as (*) can be seen as a good trade-off between fair represntation of the phenomenon and efficient estimation of the underlying distribution. If k is large enough, there is theoretical support for the argument that (*) provides a good approximation (in some functional sense) to most distributions. Hence, a mixture distribution can be perceived as a type of basis approximation of unknown distributions, in a spirit similar to wavelets and splines, but with a more intuitive flavor (for a statistician at least). This chapter mostly focuses on the “parametric” case, when the partition of the sample into subsamples with different distributions f_j does make sense form the dataset point view (even though the computational processing is the same in both cases).

We must point at this stage that mixture modeling is often used in image smoothing but not in feature recognition, which requires spatial coherence and thus more complicated models…

My patience ran out to comprehend every detail of the book but the section of reversible jump MCMC, hidden Markov model (HMM), and Markov random fields (MRF) would be very useful. Particularly, these topics appear often in image processing, which field astronomers have their own algorithms. Adaption and comparison across image analysis methods promises new directions of scientific imaging data analysis beyond subjective denoising, smoothing, and segmentation.

Readers considering more advanced Bayesian computation and rigorous treatment of MCMC methodology, I’d like to point a textbook, frequently mentioned by Marin and Robert.

Monte Carlo Statistical Methods Robert, C. and Casella, G. (2004)
Springer-Verlag, New York, 2nd Ed.

There are a few more practical and introductory Bayesian Analysis books recently published or soon to be published. Some readership would prefer these books of running ink. Perhaps, there is/will be Bayesian Computation with Python, IDL, Matlab, Java, or C/C++ for those who never intend to use R. By the way, for Mathematica users, you would like to check out Phil Gregory’s book which I introduced in [books] a boring title. My point is that applied statistics has become more friendly to non statisticians through these good introductory books and free online materials. I hope more astronomers apply statistical models in their data analysis without much trouble in executing Bayesian methods. Some might want to check BUGS, introduced [BUGS]. This posting contains resources of how to use BUGS and available packages under languages.

]]>

Where is ciao X ?

hlee — Thu, 30 Jul 2009 06:57:00 +0000

X={ primer, tutorial, cookbook, Introduction, guidebook, 101, for dummies, …}

I’ve heard many times about the lack of documentation of this extensive data analysis system, ciao. I saw people still using ciao 3.4 although the new version 4 has been available for many months. Although ciao is not the only tool for Chandra data analysis, it was specifically designed for it. Therefore, I expect it being used frequently with popularity. But the reality is against my expectation. Whatever (fierce) discussion I’ve heard, it has been irrelevant to me because ciao is not intended for statistical analysis. Then, out of sudden, after many months, a realization hit me. ciao is different from other data analysis systems and softwares. This difference has been a hampering factor of introducing ciao outside the Chandra scientist community and of gaining popularity. This difference was the reason I often got lost in finding suitable documentations.

http://cxc.harvard.edu/ciao/ is the website to refer when you start using ciao and manuals are listed here, manuals and memos. The aforementioned difference is that I’m used to see Introduction, Primer, Tutorial, Guide for Beginners at the front page or the manual websites but not from the ciao websites. From these introductory documentations, I can stretch out to other specific topics, modules, tool boxes, packages, libraries, plug-ins, add-ons, applications, etc. Tutorials are the inertia of learning and utilizing data analysis systems. However, the layout of ciao manual websites seems not intended for beginners. It was hard to find basics when some specific tasks with ciao and its tools got stuck. It might be useful only for Chandra scientists who have been using ciao for a long time as references but not beyond. It could be handy for experts instructing novices by working side by side so that they can give better hands-on instruction.

I’ll contrast with other popular data analysis systems and software.

When I began to use R, I started with R manual page containing this pdf file, Introduction to R. Based on this introductory documentations, I could learn specific task oriented packages easily and could build more my own data analysis tools.
When I began to use Matlab, I was told to get the Matlab primer. Although the current edition is commercial, there are free copies of old editions are available via search engines or course websites. There other tutorials are available as well. After crashing basics of Matlab, it was not difficult to getting right tool boxes for topic specific data analysis and scripting for particular needs.
When I began to use SAS (Statistical Analysis System), people in the business said get the little SAS book which gives the basis of this gigantic system, from which I was able to expend its usage for particular statistical projects.
Recently, I began to learn Python to use many astronomical and statistical data analysis modules developed by various scientists. Python has its tutorials where I can point for basic to fully utilize those task specific modules and my own scripting.
Commericial softwares often come with their own beginners’ guide and demos that a user can follow easily. By acquiring basics from these tutorials, expending applications can be well directed. On the other hands, non-commercial softwares may be lack of extensive but centralized tutorials unlike python and R. Nonetheless, acquiring tutorials for teaching is easy and these unlicensed materials are very handy whenever problems are confronted under various but task specific projects.
I used to have IDL tutorials on which I relied a lot to use some astronomy user libraries and CHIANTI (atomic database). I guess the resources of tutorials have changed dramatically since then.

Even if I’ve been navigating the ciao website and its threads high in volume so many times, I only come to realize now that there’s no beginner’s guide to be called as ciao cookbook, ciao tutorial, ciao primer, ciao primer, ciao for dummies, or introduction to ciao at the visible location.

This is a cultural difference. Personal thought is that this tradition prevents none Chandra scientists from using data in the Chandra archive. A good news is that there has been ciao workshops and materials from the workshops are still available. I believe compiling these materials in a fashion that other beginners’ guides introducing the data analysis system can be a good starting point for writing up a front-page worthy tutorial. The existence of this introductory material could embrace more people to use and to explore Chandra X-ray data. I hope these tutorials from other softwares and data analysis systems (primer, cookbook, introduction, tutorial, or ciao for dummies) can be good guide lines to fully compose a ciao primer.

]]>

It bothers me.

hlee — Mon, 17 Nov 2008 17:39:04 +0000

The full description is given http://cxc.harvard.edu/ciao3.4/ahelp/bayes.html about “bayes” under sherpa/ciao^[1]. Some sentences kept bothering me and here’s my account for the reason given outside of quotes.

SUBJECT(bayes) CONTEXT(sherpa)
SYNOPSIS
A Bayesian maximum likelihood function.

Maximum likelihood function is common for both Bayesian and frequentist methods. I don’t know get the point why “Bayesian” is particularly added with “maximum likelihood function.”

DESCRIPTION
(snip)
We can relate this likelihood to the Bayesian posterior density for S(i) and B(i)
using Bayes’ Theorem:

p[S(i),B(i) | N(i,S)] = p[S(i)|B(i)] * p[B(i)] * p[N(i,S) | S(i),B(i)] / p[D] .

The factor p[S(i)|B(i)] is the Bayesian prior probability for the source model
amplitude, which is assumed to be constant, and p[D] is an ignorable normalization
constant. The prior probability p[B(i)] is treated differently; we can specify it
using the posterior probability for B(i) off-source:

p[B(i)] = [ A (A B(i))^N(i,B) / N(i,B)! ] * exp[-A B(i)] ,

where A is an “area” factor that rescales the number of predicted background
counts B(i) to the off-source region.

IMPORTANT: this formula is derived assuming that the background is constant as a
function of spatial area, time, etc. If the background is not constant, the Bayes
function should not be used.

Why not? If I rephrase it, what it said is that B(i) is a constant. Then why one bothers to write p[B(i)], a probability density of a constant? The statement sounds self contradictory to me. I guess B(i) is a constant parameter. It would be suitable to write that Background is homogeneous and the Background is describable with homogeneous Poisson process if the above pdf is a correct model for Background. Also, a slight notation change is required. Assuming the Poisson process, we could estimate the background rate (constant parameter) and its density p[B(i)], and this estimate is a constant as stated for p[S(i)|B(i)], a prior probability for the constant source model amplitude.

I think the reason for “Bayes should not used” is that the current sherpa is not capable of executing hierarchical modeling. Nevertheless, I believe one can script the MCMC methodologies with S-Lang/Python to be aggregated with existing sherpa tools to incorporate a possible space dependent density, p[B(i,x,y)]. I was told that currently a constant background regardless of locations and background subtraction is commonly practiced.

To take into account all possible values of B(i), we integrate, or marginalize,
the posterior density p[S(i),B(i) | N(i,S)] over all allowed values of B(i):

p[S(i) | N(i,S)] = (integral)_0^(infinity) p[S(i),B(i) | N(i,S)] dB(i) .

For the constant background case, this integral may be done analytically. We do
not show the final result here; see Loredo. The function -log p[S(i)|N(i,S)] is
minimized to find the best-fit value of S(i). The magnitude of this function
depends upon the number of bins included in the fit and the values of the data
themselves. Hence one cannot analytically assign a `goodness-of-fit’ measure to a
given value of this function. Such a measure can, in principle, be computed by
performing Monte Carlo simulations. One would repeatedly sample new datasets from
the best-fit model, and fit them, and note where the observed function minimum
lies within the derived distribution of minima. (The ability to perform Monte
Carlo simulations is a feature that will be included in a future version of
Sherpa.)

Note on Background Subtraction

Bayesian computation means one way or the other that one is able to get posterior distributions in the presence of various parameters regardless of their kinds: source or background. I wonder why there’s a discrimination such that source parameter has uncertainty whereas the background is constant and is subtracted (yet marginalization is emulated by subtracting different background counts with corresponding weights). It fell awkward to me. Background counts as well as source counts are Poisson random. I would like to know what justifies constant background while one uses probabilistic approaches via Bayesian methods. I would like to know why the mixture model approach – a mixture of source model and background model with marginalization over background by treating B(i) as a nuisance parameter – has not been tried. By casting eye sights broadly on Bayesian modeling methods and basics of probability, more robustly estimating the source model and their parameters is tractable without subtracting background prior to fitting a source model.

The background should not be subtracted from the data when this function is used
The background only needs to be specified, as in this example:
(snip)

EXAMPLES
EXAMPLE 1
Specify the fitting statistic and then confirm it has been set. The method is then
changed from “Levenberg-Marquardt” (the default), since this statistic does not
work with that algorithm.

sherpa> STATISTIC BAYES
sherpa> SHOW STATISTIC
Statistic: Bayes
sherpa> METHOD POWELL
(snip)

I would like to know why it’s not working with Levenberg-Marquardt (LM) but working with Powell. Any references that explain why LM does not work with Bayes?

I do look forward your comments and references, particularly reasons for Bayesian maximum likelihood function and Bugs with LM. Also, I look forward to see off the norm approaches such as modeling fully in Bayesian ways (like van Dyk et al. 2001, yet I see its application rarely) or marginalizing Background without subtraction but simultaneously fitting the source model. There are plenty of rooms to be improved in source model fitting under contamination and distortion of x-ray photon incidents through space, telescope, and signal transmission.

Note that the current sherpa is beta under ciao 4.0 not under ciao 3.4 and a description about “bayes” from the most recent sherpa is not available yet, which means this post needs updates one new release is available

]]>

read.table()

hlee — Mon, 27 Oct 2008 15:05:27 +0000

The first step of data analysis or applications is reading the data sets into a tool of choice. Recent years, I’ve been using R (see also Learning R) for that regard but I’ve enjoyed freedoms for the same purpose from these languages and tools: BASIC, fortran77/90/95, C/C++, IDL, IRAF, AIPS, mongo/supermongo, MATLAB, Maple, Mathematica, SAS, SPSS, Gauss, ARC, Minitab, and recently Python and ciao which I just began to learn. Many of them I lost the fluency of how to use it. Quick learning tends to be flash memory. Some will need brain defragmentation and recovering time for extensive scientific work. A few I don’t like to use at all. No matter what, I’m not a computer geek. I’m not good at new gadgets, new softwares, nor welcome new and allegedly versatile computing systems. But one must be if he/she want to handle data. Until recently I believed R has such versatility in the aspect of reading in data. Yet, there is nothing without exceptions.

From time to time, I talked about among many factors, FITS format data make it difficult statisticians and astronomers work together. Statisticians cannot read in FITS format unless astronomers convert it into ascii or jpeg format for them whereas astronomers do not want to wasted their busy time for doing a chore like file format conversion wasting computer resources as well. Only a peaceful reunion happens when the data analysis become intractable via traditional methodology described in Numerical Recipes or Bevington and Robinson. They realize statistical (new) theory need to be found and collaboration happens with involvement of graduate students from both fields who patiently do many tedious jobs while learning (I missed this part while I was graduate student, which sometimes I thank my advisor for).

Now, let’s get back to the title. read.table()^[1] is a commonly used command line in R when you read in data in ascii format. It’ reads in data intelligently. As I said, it has been versatile enough. Numerals are in numeric format, letters are character format, missings are stored as NA, etc. read.table() make it easy to jump into data analysis right away. Well, now you know why I write this. I confronted a case read.table() does not read things correctly with astronomical data “even in ascii format.,” which I never had since I began to use S-Plus/R.

Although I know how to fix this simple problem that I’ll describe later, I want to point out the lack of compatibility in data formats between two communities and the common tools used for accessing data sets, which, I believe, is one of the biggest factors that prohibit astronomically uneducated statisticians from participating collaborations. I’ve mixed up tools for consulting courses to assist clients of various disciplines (grad students from agriculture, horticulture, physiology, social science, psychology were my clients) and for executing projects in electrical engineering and computational physics (these heavily rely on MATLAB) but reading data was the most simplest and fundamental step that I don’t have to worry about across various data sets with R (probably, those graduate students and professors of engineering and physics provided well trimmed and proven data sets).

When you have a long way to complete your mission and when you stumbled with your first step, I think it’s easy to loose eagerness for the future unless there’s support from your colleagues. Instead, I mostly likely receive discouraging comments such as “Why using R?” “You won’t have such problems if you use other tools” (Although it takes a bit of extra time to manuever, I eventually get to there). Such frustrating comments also degrade eagerness furthermore. So, from 100% I normally begin with, only 25% eagerness is left after two discouraging moments occurred at the initial step of data analysis whose end is invisibly far away. I only hang on to this 25%, still big by the normal standard and I wish for this last long until the final step without exponential decays that happened at the beginning.

Ah, the example, I promised. Click here for one example (from XAtlas) and check if read.table() can do the job in an one shot when the 3rd column is your x and the 4th column is your y. It’ll produce a beautiful spectrum if the data points are read in properly as numerals. My trick was using awk to extract those two columns because of unequal row entries in columns and read that into R. Such two steps work unfortunately made read.table() of R recognized entries as categorical data. To remove the episode of R recognizing entries as categorical data, between two steps, you must to fix the cause that read.table() reads what looks like numerals into categorical. If you investigate the data set files carefully you’ll find why; however, it’s a bit of tedious job when one have thousand entries in each data file and there are numerous data files. Without information, this effort will be same as writing a line of scanf()/READ in C/Fortran by counting column by column to type correct floating point format. This manifest the differences of formatting tables between astronomers and statisticians including scientists from ecometrics, econometrics, psycometrics, biometrics, bioinformatics, and others that include statistics related suffix.

Except such artifact (or cultural difference), XAtlas is a great catalog for statisticians in functional data analysis, who look for examples to deal with non smooth curves. New strategies and statistical applications will help astronomers see such unprecedented data sets better. Perhaps, actually more certainty, your 25% will grow back to 100% once you see those spectra and other metrics on your own plotting windows.

click here for the explanation of the read.table() function and
click here for the reason why is read.table() so inefficient?

]]>

BUGS

hlee — Tue, 16 Sep 2008 20:34:23 +0000

Astronomers tend to think in Bayesian way, but their Bayesian implementation is very limited. OpenBUGS, WinBUGS, GeoBUGS (BUGS for geostatistics; for example, modeling spatial distribution), R2WinBUGS (R BUGS wrapper) or PyBUGS (Python BUGS wrapper) could boost their Bayesian eagerness. Oh, by the way, BUGS stands for Bayesian inference Using Gibbs Sampling.

Disclaimer: I never did serious Bayesian computations so that information I provide here tends to be very shallow. Both statisticians and astronomers oriented by Bayesian ideals are very welcome to add advanced pieces of information.

Bayesian statistics is very much preferred in astronomy, at least here at Harvard Smithsonian Center for Astrophysics. Yet, I do not understand why astronomy data analysis packages do not include libraries, modules, or toolboxes for MCMC (porting scripts from Numerical Recipes or IMSL, or using Python does not count here since these are also used by engineers and scientists of other disciplines: my view is also restricted thanks to my limited experience in using astronomical data analysis packages like ciao, XSPEC, IDL, IRAF, and AIPS) similar to WinBUGS or OpenBUGS. Most of Bayesian analysis in astronomy has to be done from the scratch, which drives off simple minded people like me (I prefer analytic forms and estimators than posterior chains). I hope easily implementable Bayesian Data Analysis modules come along soon to current astronomical data analysis systems for any astronomers who only had a lecture about Bayes theorem and Gibbs sampling. Perhaps, BUGS can be a role model to develop such modules.

As listed, one does not need R to use BUGS. WinBUGS is both stand alone and R implementable. PyBUGS can be handy since python is popular among astronomers. I heard that MATLAB (its open source counterpart, OCTAVE) has its own tools to maneuver Bayesian Data Analysis relatively easily. There are many small MCMC modules to solve particular problems in astronomy but none of them are reported to be robust enough so as to be applied in other type data sets. Not many have the freedom of choosing models and priors.

Hopefully, well knowledged Bayesians contribute in developing modules for Bayesian data analysis in astronomy. I don’t like to see contour plots, obtained from brute-forceful and blinded χ² fitting, claimed to be bivariate probability density profiles. I’d like to project the module development like the way that BUGS is developed in astronomical data analysis packages with various Bayesian libraries. Here are some web links about BUGS:
The BUGS Project
WinBUGS
OpenBUGS
Calling WinBUGS 1.4 from other programs

]]>

A Conversation with Peter Huber

hlee — Sat, 06 Sep 2008 00:46:59 +0000

The problem with data analysis is of course that it is a performing art. It is not something you easily write a paper on; rather, it is something you do. And so it is difficult to publish.

quoted from this conversation ——————————————————-

Statistical Science has a nice “conversations” series with renown statisticians. This series always benefits me because of 1. learning the history of statistics through a personal life, 2. confronting various aspects in statistics as many statisticians as were interviewed, and 3. acquiring an introductory education in the statistics that those interviewees have perfected over many years in a plain language. One post in the slog from this series was a conversation with Leo Breiman about the two cultures in statistical modeling. Because of Prof. Huber’s diverse experiences and many contributions in various fields, this conversation may entertain astronomers and computer scientists as well as statisticians.

The dialog is available through arxiv.org: [stat.ME:0808.0777] written by Andreas Buja, Hans R. Künsch.

He became famous due to his early year paper in robust statistics titled, Robust Estimation of a Location Parameter but I see him as a pioneer in data mining, laying a corner stone for massive/multivariate data analysis when computers were not as much capable as today’s. His book, Robust Statistics (Amazon link) and the paper Projection Pursuit in Annals of Statistics (Vol. 13, No. 2, pp. 435-475, yr. 1985) are popular among many well known publications.

He has publications in geoscience and Babylonian astronomy. This conversation includes names like Steven Weinberg, the novel laureate (The First Three Minutes is a well known general science book) and late Carl Sagan (famous for books/a movie like Cosmos and Contact) showing his extent scholarly interests and genius beyond statistics. At the beginning, I felt like learning the history of computation and data analysis apart from statistics.

]]>

NR, the 3rd edition

hlee — Fri, 29 Aug 2008 00:44:07 +0000

Talking about limits in Numerical Recipes in my PyIMSL post, I couldn’t resist checking materials, particularly updates in the new edition of Numerical Recipes by Press, et al. (2007).

The NR website: http://www.nr.com/

Some noticeable additions are

chi-square with small numbers of counts,
various statistical applications including Markov chain Monte Carlo, statistical learning (classification and clustering– KD-tree, model based clustering, nonparametric multivariate data analysis), computational geometry, sparse matrices, kriging, and graphs,
improved functions including incomplete gamma and beta functions,
wavelet on the interval, and
information-theoretic properties of distributions.

I’m sure you’ll find more interesting topics to your liking that are added in this new edition. Maximum Entropy Image Restoration, I didn’t know this topic was in Numerical Recipes in other editions. The third edition looks very modern to me compared to the second edition. You’ll enjoy it.

]]>