Last Updated: 2010aug31

mini-Workshop on Computational AstroStatistics

Challenges and Methods for Massive Astronomical Data

Aug 24-25, 2010

Phillips Auditorium, Harvard-Smithsonian Center for Astrophysics
60 Garden St., Cambridge, MA 02138

hea-www.harvard.edu/AstroStat/CAS2010
| Description | Schedule | Participants | Contacts |

Description

The California-Boston-Smithsonian AstroStatistics Collaboration hosted a mini-workshop on Computational Astro-statistics at the CfA. With the advent of new missions like the Solar Dynamic Observatory (SDO), Panoramic Survey and Rapid Response (Pan-STARRS) and Large Synoptic Survey (LSST), astronomical data collection is fast outpacing our capacity to analyze them. Astrostatistical effort has generally focused on principled analysis of individual observations, on one or a few sources at a time. But the new era of data intensive observational astronomy forces us to consider combining multiple datasets and infer parameters that are common to entire populations. Many astronomers really want to use every data point and even non-detections, but this becomes problematic for many statistical techniques.

The goal of the Workshop was to explore new problems in Astronomical data analysis that arise from data complexity. Our focus is on problems that have generally been considered intractable due to insufficient computational power or inefficient algorithms, but are now becoming tractable. Examples of such problems include: accounting for uncertainties in instrument calibration; classification, regression, and density estimations of massive data sets that may be truncated and contaminated with measurement errors and outliers; and designing statistical emulators to efficiently approximate the output from complex astrophysical computer models and simulations, thus making statistical inference on them tractable. We aimed to present some issues based on existing X-ray data from observatories such as Chandra and XMM-Newton to the statisticians and clarify difficulties with the currently used methodologies, e.g. MCMC methods. The Workshop consisted of review talks on current Statistical methods by Statisticians, descriptions of data analysis issues by astronomers, and open discussions between Astronomers and Statisticians. We hope to define a path for development of new algorithms that target specific issues, designed to help with applications to SDO, Pan-STARRS, LSST, and other survey data.

Schedule

The schedule was structured to encourage questions and discussion, both during the talks themselves as well as during the loosely structured Discussion sessions at the end of the day.


Tuesday, Aug 24

Alanna's notes: [.txt] (internal only)


9:30am - Noon : Session 1A : video [.rm 277MB]
Moderator: Andreas Zezas (Crete)
Aneta Siemiginowska (SAO) : Welcome and Introduction
[.pdf]
Kirk Borne (George Mason) : LSST: Informatics and Statistics Research Challenges
Abstract
The proposed Large Synoptic Survey Telescope (LSST) would generate the equivalent of one entire Sloan Digital Sky Survey's data output each night for 10 years. The scientific discovery potential from these data is enormous, as are the research challenges that they impose. I will review briefly the plans for LSST and for the new LSST Informatics and Statistical Sciences research collaboration team. The primary emphasis will be on the research questions that are related to the large, complex data collection to be produced by the survey. These research questions will be framed within the context of a new emerging astronomy subdiscipline, Astroinformatics.
[.ppt]
Keith Arnaud (GSFC) : LISA: A Big Problem on a Small Data Set
Abstract
The Laser Interferometer in Space Antenna is a planned NASA/ESA mission to measure gravitational waves. Although the basic LISA data set comprises only three time series their analysis is a significant problem in computational astrostatistics because the signals from tens of thousands of sources are superimposed. I will describe the problem and show some of the approaches adopted.
[.ppt]
Brandon Kelly (SAO) : Constraining astronomical populations with truncated data sets
Abstract
Understanding astronomical populations and their evolution is often one of the primary goals of large surveys. However, this is not always a straightforward task in that the quantities of interest, such as mass, are not measurable, but rather they are derived from measurable quantities such as luminosity with uncertainty. Moreover, the situation is complicated by data truncation caused by brightness limits of telescopes. This makes it difficult to perform statistical inference on the astronomical populations, especially if one wants to accurately account for the uncertainty in the derived parameters. In this talk I will discuss a Bayesian approach to this problem, based on hierarchical modeling, as well as recent applications of this approach to astronomical surveys. I will conclude by discussing some of the computational problems facing this approach, outlining areas where further work is needed.
[.ppt]

Noon - 1:30pm : Lunch break

1:30pm - 4pm : Session 1B : video [.rm 244MB]
Moderator: Paul Baines (UC Davis)
Peter Freeman (CMU) : Nonlinear Data Reparametrization with Diffusion Map
Abstract
Data that inhabit complex structures in high-dimensional spaces, such as galaxy spectra, often possess a simpler underlying geometry. Diffusion map is a nonlinear eigen-technique that captures that geometry by propagating local neighborhood information through a Markov process. It thus allows one to find a natural coordinate system for data whose original parametrization is not amenable to available statistical techniques. In this talk I will review the basics of diffusion map, show how it has been applied to various datasets by members of our group, and outline the challenges we face in scaling up diffusion map- based algorithms in the era of LSST.
[.pdf]
Joey Richards (UC Berkeley) : Real-time Classification for The Palomar Transient Factory
Abstract
I will be talking about the challenges of classifying astronomical time-series data, such as the photometric light curves collected by PTF. Recently, we created a method for supernova light curve classification, and I will show results using data from the DES Supernova Photometric Classification Challenge. The next challenge is to extend these methods to deal with highly multi-class problems and to scale them up for real-time classification in preparation for the LSST.
[.pdf]
Daryl Geller (Stony Brook) : Spherical wavelets for CMB temperature and polarization data analysis
Abstract
Spherical wavelets are a tool for spherical data analysis. Their main advantage over spherical harmonics is their localization, both in space and frequency. (Of course, spherical harmonics are completely localized in frequency, but they are spread all over the sphere.) This property of spherical wavelets has been exploited in CMB analysis, in particular in avoiding foregrounds/masked regions, and also in searching for features/asymmetries, specifically the "cold spot".
We discuss four different kinds of spherical wavelets, all of "needlet" type; they all possess the crucial properties of localization. In addition, under mild conditions, the needlet coefficients of a random field (defined by taking inner products of the random field with the needlets) turn out to be asymptotically uncorrelated, making it possible to exploit the law of large numbers for power spectrum-type estimations.
For the study of CMB temperature, we discuss standard needlets and introduce Mexican needlets; for the study of CMB polarization, we introduce spin needlets; and for the study of cross-spectra between the temperature and polarization fields (one of the main objectives of the Planck mission), we introduce mixed needlets.
My contributions have been in collaboration with Domenico Marinucci, Frode Hansen and Azita Mayeli.
[.pdf]

4pm - 4:20pm : Coffee break

4:20pm - 5:30pm : Open Discussion : video [.rm 114MB]
Moderator: Vinay Kashyap (SAO)
[.pdf]
Nick Wright (SAO) : Statistical Challenges in the Chandra Cygnus OB2 Survey
[.pdf]
Raffaele D'Abrusco (SAO) : IVOA IG-KDD
[.pdf]
Shantanu Desai (Illinois) : Dark Energy Data Management System : Overview and Challenges
[.pdf]

Wednesday, Aug 25


9:30am - Noon : Session 2A : video [.rm 288MB]
Moderator: Jeremy Drake (SAO)
Alisdair Davey & Paola Testa (SAO) : Challenges in Data Distribution and Analysis with the Solar Dynamics Observatory
Abstract
We discuss the challenges of storing, accessing, and analyzing the large volume of data (~2TB/day) from the Solar Dynamics Observatory (SDO). New tools are required to use SDO data effectively, and in particular meta-data are created to allow scientists to identify and retrieve data sets that address their particular science questions. We present the results of the efforts in this regard by the SDO Feature Finding team, to build a comprehensive computer vision pipeline for SDO. This pipeline will provide complete metadata on many of the features and events detectable on the Sun without human intervention and making them available to the entire solar community. We also talk about the challenges of providing access to the data to solar scientists from round the planet.
[.pptx]
[ movie1, movie2, movie3, movie4 ]
Ashish Mahabal (CalTech) : Where statistical methods can help with Transients classification from surveys
Abstract
Recent advances in observing and computing technology have led to a large explosion in astronomical data in terms of sheer volume, so much so that there is no way humans can look at all the data. As a result of synoptic sky surveys like Palomar-Quest and Catalina Realtime Transient Survey digital movies instead of individual snapshots are available (although with often large gaps between successive frames). Detecting transients in these streams is the starting point of many interesting projects (new classes, sub-populations of objects, and in general better understanding of the nature of different types of astronomical objects). Characterizing and classifying the transients is not easy though, partly owing to the sparsity of the data as well as the presence of upper limits (varying error-bars, missing data, censored data etc.), and mainly because it has to be done based on just a small number of initial observations. Even new developments may be required to make substantial progress. Sometimes some context information helps. This is often from other wavelengths and with rather different characteristics. Combining the heterogenous data forms another challenge. I will present details on these and a few other issues as well as the current status. Other existing and forthcoming surveys (e.g. LSST, ASKAP-VAST, Gaia) will automatically benefit from advances in this area.
[.pdf]
Pavlos Protopapas (SAO) : Discovery of celestial objects using machine learning techniques
Abstract
In the modern era of astronomy data are expanding in exponential rate. Our current traditional methods do not work for these massive data rates and machine learning has been called to the rescue. In this talk I will present with few examples of machine learning that have used in the Time Series Center in order to discover new celestial objects. We are discovering new variable stars, new Quasars and new objects at the very edge of the solar system. These discoveries are helping us shape our understanding of the universe we live in and could only be possible with advanced machine learning methods.
[.pdf]

Noon - 1:30pm : Lunch break

1:30pm - 4pm : Session 2B : video [.rm 224MB]
Moderator: Brandon Kelly (CfA)
Alexander Gray (Georgia Tech) : Beyond RAM: Fast Statistical Analysis in Databases
Abstract
In recent years we have developed the fastest current algorithms for various critical computations in astrostatistics, including n-point correlation functions, all-nearest-neighbors, kernel density estimation, and nonparametric Bayes classification. The codes, however, were developed assuming the data can fit in memory. I'll discuss how we have begun to enable such fast algorithms in the setting where the data fit on disk, but not necessarily RAM, by employing a novel disk-based tree structure and algorithmic approach. I will show experimental runtimes for an implementation within Microsoft SQL Server.
Alex Blocker (Harvard) : Semi-parametric Robust Event Detection for Massive Time-Series Datasets
Abstract
The detection and analysis of events within massive collections of time-series has become an extremely important task for time-domain astronomy. In particular, many scientific investigations (e.g. the analysis of microlensing and other transients) begin with the detection of isolated events in irregularly-sampled series with both non-linear trends and non-Gaussian noise. I will discuss a semi-parametric, robust, parallel method for identifying variability and isolated events at multiple scales in the presence of the above complications. This approach harnesses the power of Bayesian modeling while maintaining much of the speed and scalability of more ad-hoc machine learning approaches. I will also contrast this work with event detection methods from other fields, highlighting the unique challenges posed by astronomical surveys. Finally, I will present initial results from the application of this method to 87.2 million EROS sources, where we have obtained a greater than 100-fold reduction in candidates for certain types of phenomena.
[.pdf]
Lukasz Wyrzykowski (Cambridge) : Transient classification with Gaia
Abstract
Gaia is a successor of Hipparcos mission and its main goal is to derive positions, distances and motion information about a billion stars and create a 6D image of the Galaxy. In its 5 years life-time (from 2012) it will repeatedly scan the entire sky allowing also for the almost-real-time detections of new objects or anomalous behaviour of stars.
In my talk I will present the preparations undertaken for the detection and classification of transient events in Gaia. I will describe proposed detection methods and show first results of the classification of simulated data using SOMs, ANNs and Bayesian classifiers.
[.pdf]

4pm - 4:20pm : Coffee break

4:20pm - 5:15 : Open Discussion : video [.rm 119MB]
Moderator: David van Dyk (UC Irvine)
Alanna Connors (Eureka Sci) : Workshop Wrap-up
[.rtf]

Participants


Contacts

Vinay Kashyap (vkashyap @ cfa . harvard . edu)
Aneta Siemiginowska (asiemiginowska @ cfa . harvard . edu)
David van Dyk (dvd @ ics . uci . edu)

CHASC

CAS2010
Description
Schedule
1A
1B
1C
2A
2B
2C
Participants
Contacts
This workshop was supported by CHASC/C-BAS, NSF grants DMS 09-07185 (HU) and DMS 09-07522 (UCI), and the Chandra X-Ray Center