Last Updated: 2010aug31
mini-Workshop on Computational AstroStatistics
Challenges and Methods for Massive Astronomical Data
Aug 24-25, 2010
The California-Boston-Smithsonian AstroStatistics Collaboration
hosted a mini-workshop on Computational Astro-statistics
at the CfA.
With the advent of new missions like the Solar Dynamic Observatory
(SDO), Panoramic Survey and Rapid Response (Pan-STARRS) and Large
Synoptic Survey (LSST), astronomical data collection is fast outpacing
our capacity to analyze them. Astrostatistical effort has generally
focused on principled analysis of individual observations, on one
or a few sources at a time. But the new era of data intensive
observational astronomy forces us to consider combining multiple
datasets and infer parameters that are common to entire populations.
Many astronomers really want to use every data point and even
non-detections, but this becomes problematic for many statistical
techniques.
The goal of the Workshop was to explore new problems in Astronomical
data analysis that arise from data complexity. Our focus is on
problems that have generally been considered intractable due to
insufficient computational power or inefficient algorithms, but are
now becoming tractable. Examples of such problems include: accounting
for uncertainties in instrument calibration; classification,
regression, and density estimations of massive data sets that may
be truncated and contaminated with measurement errors and outliers;
and designing statistical emulators to efficiently approximate the
output from complex astrophysical computer models and simulations,
thus making statistical inference on them tractable. We aimed to
present some issues based on existing X-ray data from observatories
such as Chandra and XMM-Newton to the statisticians and clarify
difficulties with the currently used methodologies, e.g. MCMC
methods. The Workshop consisted of review talks on current
Statistical methods by Statisticians, descriptions of data analysis
issues by astronomers, and open discussions between Astronomers and
Statisticians. We hope to define a path for development of new
algorithms that target specific issues, designed to help with
applications to SDO, Pan-STARRS, LSST, and other survey data.
The schedule was structured to encourage questions and discussion, both
during the talks themselves as well as during the loosely structured
Discussion sessions at the end of the day.
Alanna's notes: [.txt] (internal only)
9:30am - Noon : Session 1A : video [.rm 277MB]
Moderator: Andreas Zezas (Crete)
- Aneta Siemiginowska (SAO) : Welcome and Introduction
- [.pdf]
- Kirk Borne (George Mason) : LSST: Informatics and Statistics Research Challenges
- Abstract
The proposed Large Synoptic Survey Telescope (LSST) would
generate the equivalent of one entire Sloan Digital Sky Survey's
data output each night for 10 years. The scientific discovery
potential from these data is enormous, as are the research challenges
that they impose. I will review briefly the plans for LSST
and for the new LSST Informatics and Statistical Sciences
research collaboration team. The primary emphasis will be
on the research questions that are related to the large, complex
data collection to be produced by the survey. These research
questions will be framed within the context of a new emerging
astronomy subdiscipline, Astroinformatics.
[.ppt]
- Keith Arnaud (GSFC) : LISA: A Big Problem on a Small Data Set
- Abstract
The Laser Interferometer in Space Antenna is a planned NASA/ESA mission
to measure gravitational waves. Although the basic LISA data set
comprises only three time series their analysis is a significant
problem in computational astrostatistics because the signals from tens
of thousands of sources are superimposed. I will describe the problem
and show some of the approaches adopted.
[.ppt]
- Brandon Kelly (SAO) : Constraining astronomical populations with truncated data sets
- Abstract
Understanding astronomical populations and their evolution
is often one of the primary goals of large surveys. However,
this is not always a straightforward task in that the
quantities of interest, such as mass, are not measurable,
but rather they are derived from measurable quantities such
as luminosity with uncertainty. Moreover, the situation is
complicated by data truncation caused by brightness limits
of telescopes. This makes it difficult to perform statistical
inference on the astronomical populations, especially if
one wants to accurately account for the uncertainty in the
derived parameters. In this talk I will discuss a Bayesian
approach to this problem, based on hierarchical modeling,
as well as recent applications of this approach to astronomical
surveys. I will conclude by discussing some of the computational
problems facing this approach, outlining areas where further
work is needed.
[.ppt]
Noon - 1:30pm : Lunch break
1:30pm - 4pm : Session 1B : video [.rm 244MB]
Moderator: Paul Baines (UC Davis)
- Peter Freeman (CMU) : Nonlinear Data Reparametrization with Diffusion Map
- Abstract
Data that inhabit complex structures in high-dimensional
spaces, such as galaxy spectra, often possess a simpler
underlying geometry. Diffusion map is a nonlinear
eigen-technique that captures that geometry by propagating
local neighborhood information through a Markov process.
It thus allows one to find a natural coordinate system for
data whose original parametrization is not amenable to
available statistical techniques. In this talk I will
review the basics of diffusion map, show how it has been
applied to various datasets by members of our group, and
outline the challenges we face in scaling up diffusion map-
based algorithms in the era of LSST.
[.pdf]
- Joey Richards (UC Berkeley) : Real-time Classification for The Palomar Transient Factory
- Abstract
I will be talking about the challenges of classifying astronomical
time-series data, such as the photometric light curves collected by
PTF. Recently, we created a method for supernova light curve
classification, and I will show results using data from the DES
Supernova Photometric Classification Challenge. The next challenge
is to extend these methods to deal with highly multi-class problems
and to scale them up for real-time classification in preparation for
the LSST.
[.pdf]
- Daryl Geller (Stony Brook) : Spherical wavelets for CMB temperature and polarization data analysis
- Abstract
Spherical wavelets are a tool for spherical data analysis.
Their main advantage over spherical harmonics is their
localization, both in space and frequency. (Of course,
spherical harmonics are completely localized in frequency,
but they are spread all over the sphere.) This property
of spherical wavelets has been exploited in CMB analysis,
in particular in avoiding foregrounds/masked regions, and
also in searching for features/asymmetries, specifically
the "cold spot".
We discuss four different kinds of spherical wavelets, all
of "needlet" type; they all possess the crucial properties
of localization. In addition, under mild conditions, the
needlet coefficients of a random field (defined by taking
inner products of the random field with the needlets) turn
out to be asymptotically uncorrelated, making it possible
to exploit the law of large numbers for power spectrum-type
estimations.
For the study of CMB temperature, we discuss standard
needlets and introduce Mexican needlets; for the study of
CMB polarization, we introduce spin needlets; and for the
study of cross-spectra between the temperature and polarization
fields (one of the main objectives of the Planck mission),
we introduce mixed needlets.
My contributions have been in collaboration with Domenico
Marinucci, Frode Hansen and Azita Mayeli.
[.pdf]
4pm - 4:20pm : Coffee break
4:20pm - 5:30pm : Open Discussion : video [.rm 114MB]
Moderator: Vinay Kashyap (SAO)
[.pdf]
- Nick Wright (SAO) : Statistical Challenges in the Chandra Cygnus OB2 Survey
- [.pdf]
- Raffaele D'Abrusco (SAO) : IVOA IG-KDD
- [.pdf]
- Shantanu Desai (Illinois) : Dark Energy Data Management System : Overview and Challenges
- [.pdf]
9:30am - Noon : Session 2A : video [.rm 288MB]
Moderator: Jeremy Drake (SAO)
- Alisdair Davey & Paola Testa (SAO) :
Challenges in Data Distribution and Analysis with the Solar Dynamics
Observatory
- Abstract
We discuss the challenges of storing, accessing, and analyzing
the large volume of data (~2TB/day) from the Solar Dynamics
Observatory (SDO). New tools are required to use SDO data
effectively, and in particular meta-data are created to
allow scientists to identify and retrieve data sets that
address their particular science questions. We present the
results of the efforts in this regard by the SDO Feature
Finding team, to build a comprehensive computer vision
pipeline for SDO. This pipeline will provide complete
metadata on many of the features and events detectable on
the Sun without human intervention and making them available
to the entire solar community. We also talk about the
challenges of providing access to the data to solar scientists
from round the planet.
[.pptx]
[
movie1,
movie2,
movie3,
movie4
]
- Ashish Mahabal (CalTech) : Where statistical methods can help with Transients classification from surveys
- Abstract
Recent advances in observing and computing technology have
led to a large explosion in astronomical data in terms of
sheer volume, so much so that there is no way humans can
look at all the data. As a result of synoptic sky surveys
like Palomar-Quest and Catalina Realtime Transient Survey
digital movies instead of individual snapshots are available
(although with often large gaps between successive frames).
Detecting transients in these streams is the starting point
of many interesting projects (new classes, sub-populations
of objects, and in general better understanding of the
nature of different types of astronomical objects).
Characterizing and classifying the transients is not easy
though, partly owing to the sparsity of the data as well
as the presence of upper limits (varying error-bars, missing
data, censored data etc.), and mainly because it has to be
done based on just a small number of initial observations.
Even new developments may be required to make substantial
progress. Sometimes some context information helps. This
is often from other wavelengths and with rather different
characteristics. Combining the heterogenous data forms
another challenge. I will present details on these and a
few other issues as well as the current status. Other
existing and forthcoming surveys (e.g. LSST, ASKAP-VAST,
Gaia) will automatically benefit from advances in this area.
[.pdf]
- Pavlos Protopapas (SAO) : Discovery of celestial objects using machine learning techniques
- Abstract
In the modern era of astronomy data are expanding in
exponential rate. Our current traditional methods do not
work for these massive data rates and machine learning has
been called to the rescue. In this talk I will present with
few examples of machine learning that have used in the Time
Series Center in order to discover new celestial objects.
We are discovering new variable stars, new Quasars and new
objects at the very edge of the solar system. These discoveries
are helping us shape our understanding of the universe we
live in and could only be possible with advanced machine
learning methods.
[.pdf]
Noon - 1:30pm : Lunch break
1:30pm - 4pm : Session 2B : video [.rm 224MB]
Moderator: Brandon Kelly (CfA)
- Alexander Gray (Georgia Tech) : Beyond RAM: Fast Statistical Analysis in Databases
- Abstract
In recent years we have developed the fastest current
algorithms for various critical computations in astrostatistics,
including n-point correlation functions, all-nearest-neighbors,
kernel density estimation, and nonparametric Bayes
classification. The codes, however, were developed assuming
the data can fit in memory. I'll discuss how we have begun
to enable such fast algorithms in the setting where the
data fit on disk, but not necessarily RAM, by employing a
novel disk-based tree structure and algorithmic approach.
I will show experimental runtimes for an implementation
within Microsoft SQL Server.
- Alex Blocker (Harvard) : Semi-parametric Robust Event Detection for Massive Time-Series Datasets
- Abstract
The detection and analysis of events within massive collections of
time-series has become an extremely important task for time-domain
astronomy. In particular, many scientific investigations (e.g. the
analysis of microlensing and other transients) begin with the
detection of isolated events in irregularly-sampled series with both
non-linear trends and non-Gaussian noise. I will discuss a
semi-parametric, robust, parallel method for identifying variability
and isolated events at multiple scales in the presence of the above
complications. This approach harnesses the power of Bayesian modeling
while maintaining much of the speed and scalability of more ad-hoc
machine learning approaches. I will also contrast this work with event
detection methods from other fields, highlighting the unique
challenges posed by astronomical surveys. Finally, I will present
initial results from the application of this method to 87.2 million
EROS sources, where we have obtained a greater than 100-fold reduction
in candidates for certain types of phenomena.
[.pdf]
- Lukasz Wyrzykowski (Cambridge) : Transient classification with Gaia
- Abstract
Gaia is a successor of Hipparcos mission and its main goal is to
derive positions, distances and motion information about a billion
stars and create a 6D image of the Galaxy. In its 5 years life-time
(from 2012) it will repeatedly scan the entire sky allowing also for
the almost-real-time detections of new objects or anomalous behaviour
of stars.
In my talk I will present the preparations undertaken for the
detection and classification of transient events in Gaia. I will
describe proposed detection methods and show first results of the
classification of simulated data using SOMs, ANNs and Bayesian
classifiers.
[.pdf]
4pm - 4:20pm : Coffee break
4:20pm - 5:15 : Open Discussion : video [.rm 119MB]
Moderator: David van Dyk (UC Irvine)
- Alanna Connors (Eureka Sci) : Workshop Wrap-up
- [.rtf]
- Alanna Connors (Eureka Sci)
- Alberto Conti (STScI)
- Alex Blocker (Harvard)
- Alexander Gray (Georgia Tech)
- Alisdair Davey (SAO)
- Allison Strom (SAO)
- Andreas Zezas (Crete)
- Aneta Siemiginowska (SAO)
- Angelica de Oliveira Costa (CfA)
- Ashish Mahabal (CalTech)
- Brandon Kelly (CfA)
- Brendan Allen (CfA)
- Cecilia Garraffo (CfA)
- Chris Stubbs (CfA)
- Daryl Geller (Stony Brook)
- David Kipping (CfA)
- David Stenning (UC Irvine)
- David van Dyk (UC Irvine)
- Dharam Vir Lal (CfA)
- Eric Kolaczyk (BU)
- Francesco Massaro (SAO)
- Gautham Narayan (CfA)
- Ignazio Pillitteri (SAO)
- Irwin Shapiro (CfA)
- Jan Forbrich (CfA)
- Jennifer Posson-Brown (SAO)
- Jeremy Drake (SAO)
- Jin Xu (UC Irvine)
- Joey Richards (UC Berkeley)
- Kaisey Mandel (CfA)
- Karim Pichara (PUC de Chile)
- Keith Arnaud (GSFC)
- Kirk Borne (George Mason)
- Li Ji (CfA)
- Lukasz Wyrzykowski (Cambridge)
- Margarita Karovska (SAO)
- Nathan Stein (Harvard)
- Nick Wright (SAO)
- Paola Testa (SAO)
- Paul Baines (UC Davis)
- Paul Green (SAO)
- Pavlos Protopapas (CfA)
- Pete Ratzlaff (SAO)
- Peter Freeman (Carnegie Mellon)
- Raffaele D'Abrusco (CfA)
- Saku Vrtilek (SAO)
- Shandong Min (UC Irvine)
- Shantanu Desai (Illinois)
- Susana Eyheramendy (PUC de Chile)
- Terry Gaetz (SAO)
- Thomas Granger (CfA)
- Tom Aldcroft (SAO)
- Trae Winter (SAO)
- Vinay Kashyap (SAO)
- Xiao-Li Meng (Harvard)
- Yaming Yu (UC Irvine)
Vinay Kashyap (vkashyap @ cfa . harvard . edu)
Aneta Siemiginowska (asiemiginowska @ cfa . harvard . edu)
David van Dyk (dvd @ ics . uci . edu)
This workshop was supported by
CHASC/C-BAS,
NSF grants DMS 09-07185 (HU) and DMS 09-07522 (UCI), and the
Chandra X-Ray Center