The AstroStat Slog » catalog http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 Astroinformatics http://hea-www.harvard.edu/AstroStat/slog/2009/astroinformatics/ http://hea-www.harvard.edu/AstroStat/slog/2009/astroinformatics/#comments Mon, 13 Jul 2009 00:21:53 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=3131 Approximately for a decade, there have been journals dedicated to bioinformatics. On the other hand, there is none in astronomy although astronomers have a long history of comprising a huge volume of catalogs and data archives. Prof. Bickel’s comment during his plenary lecture at the IMS-APRM particularly on sparse matrix and philosophical issues on choosing principal components led me to wonder why astronomers do not discuss astroinformatics.

Nevertheless, I’ve noticed a few astronomers rigorously apply principle component analysis (PCA) in order to reduce the dimensionality of a data set. An evident example of PCA applications in astronomy is photo-z. In contrast to the wide PCA application, almost no publication about statistical adequacy studies is found by investigating the properties of covariance matrix and its estimation method particularly when it is sparse. Even worse, the notion of measurement errors are improperly implemented since statistician’s dimension reduction methodology never confronted astronomers’ measurement errors. How to choose components is seldom discussed since the significance in physics model is rarely agreeing with statistical significance. This disagreement often elongates scientific writings hard to please readers. As a compromise, statistical parts are omitted, which makes me feel the publication incomplete.

Due to its easy visualization via intuitive scales, in wavelet multiscale imaging, the coarse scale to fine scale approach and the assumption of independent noise enables to clean the noisy image and to accentuate features in it. Likewise, principle components and other dimension reduction methods in statistics capture certain features via transformed metrics and regularized, or penalized objective functions. These features are not necessary to match the important features in astrophysics unless the likelihood function and selected priors match physics models. To my knowledge, astronomical literature exploiting PCA for dimension reduction for prediction rarely explains why PCA is chosen for dimensionality reduction, how to compensate the sparsity in covariance matrix, and other questions, often the major topics in bioinformatics. In the literature, these questions are explored to explain the particular selection of gene attributes or bio-markers under a certain response like blood pressures and types of cancers. Instead of binning and chi-square minimization, statisticians explore strategies how to compensate sparsity in the data set to get unbiased best fits and righteous error bars based on data matching assumptions and theory.

Luckily, there are efforts among some renown astronomers to form a community of astroinformatics. At the dawn of bioinformatics, genetic scientists were responsible for the bio part and statisticians were responsible for the informatics until young scientists are educated enough to carry out bioinformatics by themselves. Observing this trend partially from statistics conferences created an urge in me that it is my responsibility to ponder why there has been shortage of statisticians’ involvement in astronomy regardless of plethora of catalogs and data archives with long history. A few postings will follow what I felt while working among astronomers. I hope this small bridging effort to narrow the gap between two communities. My personal wish is to see prospering astroinformatics like bioinformatics.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/astroinformatics/feed/ 1
accessing data, easier than before but… http://hea-www.harvard.edu/AstroStat/slog/2009/accessing-data/ http://hea-www.harvard.edu/AstroStat/slog/2009/accessing-data/#comments Tue, 20 Jan 2009 17:59:56 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=301 Someone emailed me for globular cluster data sets I used in a proceeding paper, which was about how to determine the multi-modality (multiple populations) based on well known and new information criteria without binning the luminosity functions. I spent quite time to understand the data sets with suspicious numbers of globular cluster populations. On the other hand, obtaining globular cluster data sets was easy because of available data archives such as VizieR. Most data sets in charts/tables, I acquire those data from VizieR. In order to understand science behind those data sets, I check ADS. Well, actually it happens the other way around: check scientific background first to assess whether there is room for statistics, then search for available data sets.

However, if you are interested in massive multivariate data or if you want to have a subsample from a gigantic survey project, impossible all to be documented in contrast to those individual small catalogs, one might like to learn a little about Structured Query Language (SQL). With nice examples and explanation, some Tera byte data are available from SDSS. Instead of images in fits format, one can get ascii/table data sets (variables of million objects are magnitudes and their errors; positions and their errors; classes like stars, galaxies, AGNs; types or subclasses like elliptical galaxies, spiral galaxies, type I AGN, type Ia, Ib, Ic, and II SNe, various spectral types, etc; estimated variables like photo-z, which is my keen interest; and more). Furthermore, thousands of papers related to SDSS are available to satisfy your scientific cravings. (Here are Slog postings under SDSS tag).

If you don’t want to limit yourself with ascii tables, you may like to check the quick guide/tutorial of Gator, which aggregated archives of various missions: 2MASS (Two Micron All-Sky Survey), IRAS (Infrared Astronomical Satellite), Spitzer Space Telescope Legacy Science Programs, MSX (Midcourse Space Experiment), COSMOS (Cosmic Evolution Survey), DENIS (Deep Near Infrared Survey of the Southern Sky), and USNO-B (United States Naval Observatory B1 Catalog). Probably, you also want to check NED or NASA/IPAC Extragalactic Database. As of today, the website said, 163 million objects, 170 million multiwavelength object cross-IDs, 188 thousand associations (candidate cross-IDs), 1.4 million redshifts, and 1.7 billion photometric measurements are accessible, which seem more than enough for data mining, exploring/summarizing data, and developing streaming/massive data analysis tools.

Probably, astronomers might wonder why I’m not advertising Chandra Data Archive (CDA) and its project oriented catalog/database. All I can say is that it’s not independent statistician friendly. It is very likely that I am the only statistician who tried to use data from CDA directly and bother to understand the contents. I can assure you that without astronomers’ help, the archive is just a hot potato. You don’t want to touch it. I’ve been there. Regardless of how painful it is, I’ve kept trying to touch it since It’s hard to resist after knowing what’s in there. Fortunately, there are other data scientist friendly archives that are quite less suffering compared to CDA. There are plethora things statisticians can do to improve astronomers’ a few decade old data analysis algorithms based on Gaussian distribution, iid assumption, or L2 norm; and to reflect the true nature of data and more relaxed assumptions for robust analysis strategies than for traditionally pursued parametric distribution with specific models (a distribution free method is more robust than Gaussian distribution but the latter is more efficient) not just with CDA but with other astronomical data archives. The latter like vizieR or SDSS provides data sets which are less painful to explore with without astronomical software/package familiarity.

Computer scientists are well aware of UCI machine learning archive, with which they can validate their new methods with previous ones and empirically prove how superior their methods are. Statisticians are used to handle well trimmed data; otherwise we suggest strategies how to collect data for statistical inference. Although tons of data collecting and sampling protocols exist, most of them do not match with data formats, types, natures, and the way how data are collected from observing the sky via complexly structured instruments. Some archives might be extensively exclusive to the funded researchers and their beneficiaries. Some archives might be super hot potatoes with which no statistician wants to involve even though they are free of charges. I’d like to warn you overall not to expect the well tabulated simplicity of text book data sets found in exploratory data analysis and machine learning books.

Some one will raise another question why I do not speculate VOs (virtual observatories, click for slog postings) and Google Sky (click for slog postings), which I praised in the slog many times as good resources to explore the sky and to learn astronomy. Unfortunately, for the purpose of direct statistical applications, either VOs or Google sky may not be fancied as much as their names’ sake. It is very likely spending hours exploring these facilities and later you end up with one of archives or web interfaces that I mentioned above. It would be easier talking to your nearest astronomer who hopefully is aware of the importance of statistics and could offer you a statistically challenging data set without worries about how to process and clean raw data sets and how to build statistically suitable catalogs/databases. Every astronomer of survey projects builds his/her catalog and finds common factors/summary statistics of the catalog from the perspective of understanding/summarizing data, the primary goal of executing statistical analyses.

I believe some astronomers want to advertise their archives and show off how public friendly they are. Such advertising comments are very welcome because I intentionally left room for those instead of listing more archives I heard of without hands-on experience. My only wish is that more statisticians can use astronomical data from these archives so that the application section of their papers is filled with data from these archives. As if with sunspots, I wish that more astronomical data sets can be used to validate methodologies, algorithms, and eventually theories. I sincerely wish that this shall happen in a short time before I become adrift from astrostatistics and before I cannot preach about the benefits of astronomical data and their archives anymore to make ends meet.

There is no single well known data repository in astronomy like UCI machine learning archive. Nevertheless, I can assure you that the nature of astronomical data and catalogs bear various statistical problems and many of those problems have never been formulated properly towards various statistical inference problems. There are so many statistical challenges residing in them. Not enough statisticians bother to look these data because of the gigantic demands for statisticians from uncountably many data oriented scientific disciplines and the persistent shortage in supplies.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/accessing-data/feed/ 3
missing data http://hea-www.harvard.edu/AstroStat/slog/2008/missing-data/ http://hea-www.harvard.edu/AstroStat/slog/2008/missing-data/#comments Mon, 27 Oct 2008 13:24:22 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=359 The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

  • Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
  • Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
  • Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
  • The multiple imputation FAQ page (web) by J. Shafer
  • Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
  • The Curse of the Missing Data (web) by Yong Kim
  • A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

  • data missing at random : missing for reasons related to completely observed variables in the data set
  • data missing completely at random : the complete cases are a random sample of the originally identified set of cases
  • non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
  • outliers treated as missing data
  • the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

  • complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
  • available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
  • single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
  • maximum likelihood, and
  • multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

  • mean substituion,
  • case substitution (scientific knowledge authorizes substitution),
  • hot deck imputation (external sources imputes imputation),
  • cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
  • regression imputation (prediction with independent variables and mean imputation is a special case) and
  • multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

  • simple missing data approaches that retain all the data
    1. -mean imputation
    2. -last value carried forward
    3. -using information from related observation
    4. -indicator variables for missingness of categorical predictors
    5. -indicator varibbles for missingness of continuous predictors
    6. -imputation based on logical values
  • random imputation of a single variables
  • imputation of several missing variables
  • model based imputation
  • combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/missing-data/feed/ 2
survey and design of experiments http://hea-www.harvard.edu/AstroStat/slog/2008/survey-and-design-of-experiments/ http://hea-www.harvard.edu/AstroStat/slog/2008/survey-and-design-of-experiments/#comments Wed, 01 Oct 2008 20:16:24 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=894 People of experience would say very differently and wisely against what I’m going to discuss now. This post only combines two small cross sections of each branch of two trees, astronomy and statistics.

When it comes to survey, the first thing comes in my mind is the census packet although I only saw it once (an easy way to disguise my age but this is true) but the questionaire layouts are so carefully and extensively done so as to give me a strong impression. Such survey is designed prior to collecting data so that after collection, data can be analyzed according to statistical methodology suitable to the design of the survey. Strategies for response quantification is also included (yes/no for 0/1, responses in 0 to 10 scale, bracketing salaries, age groups, and such, handling missing data) for elaborated statistical analysis to avoid subjective data transformation and arbitrary outlier eliminations.

In contrast, survey in astronomy means designing a mesh, not questionaires, unable to be transcribed into statistical models. This mesh has multiple layers like telescope, detector, and source detection algorithm, and eventually produces a catalog. Designing statistical methodology is not a part of it that draws interpretable conclusion. Collecting what goes through that mesh is astronomical survey. Analyzing the catalog does not necessarily involve sophisticated statistics but often times adopts chi-sq fittings and cast aways of unpleasant/uninteresting data points.

As other conflicts in jargon, –a simplest example is Ho: I used to know it as Hubble constant but now, it is recognized first as a notation for a null hypothesissurvey has been one of them and like the measurement error, some clarification about the term, survey is expected to be given by knowledgeable astrostatisticians to draw more statisticians involvement in grand survey projects soon to come. Luckily, the first opportunity will be given soon at the Special Session: Meaning from Surveys and Population Studies: BYOQ during the 213 AAS meeting, at Long Beach, California on Jan. 5th, 2009.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/survey-and-design-of-experiments/feed/ 3
Classification and Clustering http://hea-www.harvard.edu/AstroStat/slog/2008/classification-and-clusterin/ http://hea-www.harvard.edu/AstroStat/slog/2008/classification-and-clusterin/#comments Thu, 18 Sep 2008 23:48:43 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=747 Another deduced conclusion from reading preprints listed in arxiv/astro-ph is that astronomers tend to confuse classification and clustering and to mix up methodologies. They tend to think any algorithms from classification or clustering analysis serve their purpose since both analysis algorithms, no matter what, look like a black box. I mean a black box as in neural network, which is one of classification algorithms.

Simply put, classification is regression problem and clustering is mixture problem with unknown components. Defining a classifier, a regression model, is the objective of classification and determining the number of clusters is the objective of clustering. In classification, predefined classes exist such as galaxy types and star types and one wishes to know what prediction variables and their functional allow to separate Quasars from stars without individual spectroscopic observations by only relying on handful variables from photometric data. In clustering analysis, there is no predefined class but some plots visualize multiple populations and one wishes to determine the number of clusters mathematically not to be subjective in concluding remarks saying that the plot shows two clusters after some subjective data cleaning. A good example is that as photons from Gamma ray bursts accumulate, extracting features like F_{90} and F_{50} enables scatter plots of many GRBs, which eventually led people believe there are multiple populations in GRBs. Clustering algorithms back the hypothesis in a more objective manner opposed to the subjective manner of scatter plots with non statistical outlier elimination.

However, there are challenges to make a clear cut between classification and clustering both in statistics and astronomy. In statistics, missing data is the phrase people use to describe this challenge. Fortunately, there is a field called semi-supervised learning to tackle it. (Supervised learning is equivalent to classification and unsupervised learning is to clustering.) Semi-supervised learning algorithms are applicable to data, a portion of which has known class types and the rest are missing — astronomical catalogs with unidentified objects are a good candidate for applying semi-supervised learning algorithms.

From the astronomy side, the fact that classes are not well defined or subjective is the main cause of this confusion in classification and clustering and also the origin of this challenge. For example, will astronomer A and B produce same results in classifying galaxies according to Hubble’s tuning fork?[1] We are not testing individual cognitive skills. Is there a consensus to make a cut between F9 stars and G0 stars? What make F9.5 star instead of G0? With the presence of error bars, how one is sure that the star is F9, not G0? I don’t see any decision theoretic explanation in survey papers when those stellar spectral classes are presented. Classification is generally for data with categorical responses but astronomer tend to make something used to be categorical to continuous and still remain to apply the same old classification algorithms designed for categorical responses.

From a clustering analysis perspective, this challenge is caused by outliers, or peculiar objects that do not belong to the majority. The size of this peculiar objects can make up a new class that is unprecedented before. Or the number is so small that a strong belief prevails to discard these data points, regarded as observational mistakes. How much we can trim the data with unavoidable and uncontrollable contamination (remember, we cannot control astronomical data as opposed to earthly kinds)? What is the primary cause defining the number of clusters? physics, statistics, astronomers’ experience in processing and cleaning data, …

Once the ambiguity in classification, clustering, and the complexity of data sets is resolved, another challenge is still waiting. Which black box? For the most of classification algorithms, Pattern Recognition and Machine Learning by C. Bishop would offer a broad spectrum of black boxes. Yet, the book does not include various clustering algorithms that statisticians have developed in addition to outlier detection. To become more rigorous in selecting a black box for clustering analysis and outlier detection, one is recommended to check,

For me, astronomers tend to be in a haste owing to the pressure of publishing results immediately after data release and to overlook suitable methodologies for their survey data. It seems that there is no time for consulting machine learning specialists to verify the approaches they adopted. My personal prayer is that this haste should not be settled as a trend in astronomical survey and large data analysis.

  1. Check out the project, GALAXY ZOO
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/classification-and-clusterin/feed/ 0
[Book] pattern recognition and machine learning http://hea-www.harvard.edu/AstroStat/slog/2008/pml/ http://hea-www.harvard.edu/AstroStat/slog/2008/pml/#comments Tue, 16 Sep 2008 19:20:43 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=355 A nice book by Christopher Bishop.
While I was reading abstracts and papers from astro-ph, I saw many applications of algorithms from pattern recognition and machine learning (PRML). The frequency will increase as large scale survey projects numerate, where recommending a good textbook or a reference in the field seems timely.

Survey and population studies generally invite large data sets. Any discussion about individual objects from that survey is an indication that those objects are outliers with respect to the objects in the catalog, created from survey and population studies. These outliers are the objects deserving strong spotlights, in contrast to the notion that outliers are useless. Other than studies about outliers, survey and population studies generally involve machine learning and pattern recognition, or supervised learning and unsupervised learning, or classification and clustering, or statistical learning. Whatever jargon you choose to use, the book overviews most popular machine learning methods extensively with examples, nice illustrations, and concise math. Upon understanding characteristics of the catalog such as dimensions, sample size, independent variable, dependent variable, missing values, sampling (volume limited, magnitude limited, incompleteness), measurement errors, scatter plots, and so on, as a second step to summarize the large data as a whole, the book could offer proper approaches based on your data analysis objective in a statistical sense – in terms of summarizing data.

Click here to access the book website for various resources including a few book chapters, retailer links, examples, and solutions. One of reviews you can check.

A lesson from reading arxiv/astro-ph during the past year is that astronomers must become interdisciplinary particularly those in surveys and creating catalogs. From the information retrieval viewpoint, some rudimentary education about pattern recognition and machine learning is a must as I personally think basic statistics and probability theory should be offered to young astronomers (like astrostatistics summer school at Penn State). While attending graduate school, I saw non stat majors taking statistics classes, except students from astronomy or physics. To confirm this hypothesis, I took computational physics to learn how astronomers and physicists handle data with uncertainty. Although it was one of my favorite classes, the course was quite off from statistics. (Game theory was the most statistically relevant subject.) Hence, I think not many astronomy departments offer practical statistics courses or machine learning and therefore, recommending good and modern textbooks related to (statistical) data analysis can be beneficial to self teaching astronomers. I hope my reasoning is in the right track.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/pml/feed/ 0
[ArXiv] 1st week, June 2008 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/#comments Mon, 09 Jun 2008 01:45:45 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=328 Despite no statistic related discussion, a paper comparing XSPEC and ISIS, spectral analysis open source applications might bring high energy astrophysicists’ interests this week.

  • [astro-ph:0806.0650] Kimball and Ivezi\’c
    A Unified Catalog of Radio Objects Detected by NVSS, FIRST, WENSS, GB6, and SDSS (The catalog is available HERE. I’m always fascinated with the possibilities in catalog data sets which machine learning and statistics can explore. And I do hope that the measurement error columns get recognition from non astronomers.)

  • [astro-ph:0806.0820] Landau and Simeone
    A statistical analysis of the data of Delta \alpha/ alpha from quasar absorption systems (It discusses Student t-tests from which confidence intervals for unknown variances and sample size based on Type I and II errors are obtained.)

  • [stat.ML:0806.0729] R. Girard
    High dimensional gaussian classification (Model based – gaussian mixture approach – classification, although it is often mentioned as clustering in astronomy, on multi- dimensional data is very popular in astronomy)

  • [astro-ph:0806.0520] Vio and Andreani
    A Statistical Analysis of the “Internal Linear Combination” Method in Problems of Signal Separation as in CMB Observations (Independent component analysis, ICA is discussed)

  • [astro-ph:0806.0560] Nobel and Nowak
    Beyond XSPEC: Towards Highly Configurable Analysis (The flow of spectral analysis with XSPEC and Sherpa has not been accepted smoothly; instead, it has been a personal struggle. It seems the paper considers XSPEC as a black box, which I completely agree with. The main objective of the paper is comparing XSPEC and ISIS)

  • [astro-ph:0806.0113] Casandjian and Grenier
    A revised catalogue of EGRET gamma-ray sources (The maximum likelihood detection method, which I never heard from statistical literature, is utilized)
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-1st-week-june-2008/feed/ 0
[ArXiv] Ripley’s K-function http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-ripleys-k-function/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-ripleys-k-function/#comments Tue, 22 Apr 2008 03:56:33 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=277 Because of the extensive works by Prof. Peebles and many (observational) cosmologists (almost always I find Prof. Peeble’s book in cosmology literature), the 2 (or 3) point correlation function is much more dominant than any other mathematical and statistical methods to understand the structure of the universe. Unusually, this week finds an astro-ph paper written by a statistics professor addressing the K-function to explore the mystery of the universe.

[astro-ph:0804.3044] J.M. Loh
Estimating Third-Order Moments for an Absorber Catalog

Instead of getting to the detailed contents, which is left to the readers, I’d rather cite a few key points without math symbols.The script K is denoted as the 3rd order K-function from which the three-point and reduced three-point correlation functions are derived. The benefits of using the script K function over these correlation functions are given regarding bin size and edge correction. Yet, the author did not encourage to use the script K function only but to use all tools. Also, the feasibility of computing third or higher order measures of clustering is mentioned due to larger datasets and advances in computing. In appendix, the unbiasedness of the estimator regarding the script K is proved.

The reason for bringing in this K-function comes from my early experience in learning statistics. My memory of learning the 2 point correlation function from an undergraduate cosmology class is very vague but the basic idea of modeling this function gave me an epiphany during a spatial statistics class several years ago when the Ripley’s K-function was introduced. I vividly remember that I set up my own project to use this K-function to get the characteristics of the spatial distribution of GRBs. The particular reason for selecting GRBs instead of galaxies was 1. I was able to find the data set from the internet on my own (BATSE catalog: astronomers may think accessing data archives is easy but generally statistics students were not exposed to the fact that astronomical data sets are available via internet and in terms of data sets, they depend heavily on data providers, or clients), and 2. I recalled a paper by Professors Efron and Petrosian (1995, ApJ, 449:215-223 Testing Isotropy versus Clustering of Gamma-ray Bursts, who utilized the nearest neighborhood approach. After a few weeks, I made another discovery that people found GRB redshifts and began to understand the cosmological origin of GRBs more deeply. In other words, 2D spatial statistics was not the way to find the origins of GRBs. Due to a few shortcomings, one of them was the latitude dependent observation of BATSE (as a second year graduate student, I didn’t confront the idea of censoring and truncation, yet), I discontinued my personal project with a discouragement that I cannot make any contribution (data themselves, like discovering the distances, speak more louder than statistical inferences without distances).

I was delighted to see the work by Prof. Loh about the Ripley’s K function. Those curious about the K function may check the book written by Martinez and Saar, Statistics of the Galaxy Distribution (Amazon Link). Many statistical publications are also available under spatial statistics and point process that includes the Ripley’s K function.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-ripleys-k-function/feed/ 0
[ArXiv] 2nd week, Jan. 2007 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/#comments Fri, 11 Jan 2008 19:44:44 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ It is notable that there’s an astronomy paper contains AIC, BIC, and Bayesian evidence in the title. The topic of the paper, unexceptionally, is cosmology like other astronomy papers discussed these (statistical) information criteria (I only found a couple of papers on model selection applied to astronomical data analysis without articulating CMB stuffs. Note that I exclude Bayes factor for the model selection purpose).

To find the paper or other interesting ones, click

  • [astro-ph:0801.0638]
    AIC, BIC, Bayesian evidence and a notion on simplicity of cosmological model M Szydlowski & A. Kurek

  • [astro-ph:0801.0642]
    Correlation of CMB with large-scale structure: I. ISW Tomography and Cosmological Implications S. Ho et.al.

  • [astro-ph:0801.0780]
    The Distance of GRB is Independent from the Redshift F. Song

  • [astro-ph:0801.1081]
    A robust statistical estimation of the basic parameters of single stellar populations. I. Method X. Hernandez and D. Valls–Gabaud

  • [astro-ph:0801.1106]
    A Catalog of Local E+A(post-starburst) Galaxies selected from the Sloan Digital Sky Survey Data Release 5 T. Goto (Carefully built catalogs are wonderful sources for classification/supervised learning, or semi-supervised learning)

  • [astro-ph:0801.1358]
    A test of the Poincare dodecahedral space topology hypothesis with the WMAP CMB data B.S. Lew & B.F. Roukema

In cosmology, a few candidate models to be chosen, are generally nested. A larger model usually is with extra terms than smaller ones. How to define the penalty for the extra terms will lead to a different choice of model selection criteria. However, astronomy papers in general never discuss the consistency or statistical optimality of these selection criteria; most likely Monte Carlo simulations and extensive comparison across those criteria. Nonetheless, my personal thought is that the field of model selection should be encouraged to astronomers to prevent fallacies of blindly fitting models which might be irrelevant to the information that the data set contains. Physics tells a correct model but data do the same.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/feed/ 0
Spurious Sources http://hea-www.harvard.edu/AstroStat/slog/2007/spurious-sources/ http://hea-www.harvard.edu/AstroStat/slog/2007/spurious-sources/#comments Wed, 19 Sep 2007 18:21:57 +0000 vlk http://hea-www.harvard.edu/AstroStat/slog/2007/spurious-sources/ [arXiv:0709.2358] Cleaning the USNO-B Catalog through automatic detection of optical artifacts, by Barron et al.

Statistically speaking, “false sources” are generally in the domain of Type II Type I errors, defined by the probability of detecting a signal where there is none. But what if there is a clear signal, but it is not real?

In astronomical analysis, sources are generally defined with reference to the existing background, as point-fluctuations that exceed some significance threshold defined by the estimated background “in the vicinity”. The threshold is usually set such that we can tolerate “a few” false positives at borderline significance. But that ignores the effect of systematic deviations that can be caused by various instrumental features. Such things are common in X-ray images — window support structures, chip gaps, bad CCD columns, cosmic-ray hits, etc. Optical data are generally cleaner, but by no means immune to the problem. Barron et al. here describe how they have gone through the USNO-B catalog and have modeled and eliminated artifacts coming from diffraction spikes and telescope reflection halos of bright stars.

The bad news? More than 2.3% of the sources are flagged as spurious. Compare to the typical statistical significance at which the detection thresholds are set (usually >3sigma).

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/spurious-sources/feed/ 2