The AstroStat Slog » robust

[MADS] data depth

hlee — Tue, 02 Jun 2009 02:51:29 +0000

How would you assign orders to multivariate data? If you have your strategy to achieve this ordering task, I’d like to ask, “is your strategy affine invariant?” meaning that shift and rotation invariant.

Let’s make the question a bit concrete. What would be the largest point from a scatter plot below? Thirty random bivariate data were generated from bivariate normal distribution (mu=[1,1], sigma=[[1,-.9],[-.9,1]]). If you have applied your rule of assigning ranks to some largest or smallest points, could you continue to apply it to all data points? Once you’ve done it, shift your data w.r.t. the mean (not mu; although I told you mu, data do not tell what true mu is but we can estimate mu and assign uncertainty on that estimate via various methods, beyond computing sample standard deviation or traditional chi-square methods for more complex data and models. Mean, a solution to the chi-square method or a least square method to the hypothesis (model) mu=\bar{X}, is only one estimator among many estimators) and rotate them clock-wise by 90 degrees. Does each point preserve the same rank when you apply your ranking method to the shifted and rotated multivariate data?

First, if you think about coordinate-wise ranking, you’ll find out that two sets of ranks are not matching. Additional questions I like to add are, which axis you would consider first? X or Y? and how to handle ties in one coordinate (since bivariate normal, a continuous distribution, the chance of ties is zero but real data often show ties).

The notion of Data Depth is not employed in astronomy as well as other fields because variables have physical meaning so that multidimensional data cloud is projected onto the most important axis and from which luminosity functions/histograms are drawn to discover a trend, say a clump at a certain color or a certain redshift indicating an event, epoch, or episode in formation history. I wonder if they truly look like cluttered when they are deprojected into the original multi-dimensional space. Remember the curse of dimensionality. Even though the projected data show bumps, in multidimensional space, a hypothesis testing will conclude complete spatial randomness, or there is no clump. Consider transformations and distance metrics. Or how nearest neighbors are defined. Simply, consider mathematical distances of pairs of galaxy for statistical analysis, and these galaxies have location metrics: longitude, latitude, and redshift (3D spatial data points), and scales such as magnitudes, group names, types, shapes, and classification metrics, which cannot be used directly into computing Euclidean distance. Choices of metrics and transformation are different from mathematics and algorithmic standpoints, and from practical and astronomical viewpoints to assign ranks to data. The importance of ordering data is 1. identifying/removing outliers, 2. characterizing data distribution quickly, for example, current contour plots can be accelerated, 3. updating statistics efficiently – only looking into neighbors instead of processing whole data.

The other reason that affine invariant ordering of multivariate data finds no use in astronomy can be attributed to its nonparametric methodology nature. So far, my impression is that nonparametric statistics is a relatively unknown territory to astronomers. Unless a nonparametric statistic is designed for hypothesis testings and interval estimations, this nonparametric method does not give confidence levels/intervals that astronomers always like to report, “how confident is the result of ordering multivariate data or what is the N sigma uncertainty that this data point is the center within plus minus 2?” Nonparametric bootstrap or jackknife are rarely used for computing uncertainties to compensate the fact that a nonparametric statistic only offers a point estimate (best fit). Regardless of the lack of popularity in these robust point estimators from nonparametric statistics, the notion of data depth and nonparametric methods, I believe, would help astronomers to retrieve summary statistics quickly on multidimensional massive and streaming data from future surveys. This is the primary reason to introduce and advertise data depth in this post.

Data depth is a statistic that measures ranks of multivariate data. It assigns a numeric value to a vector (data point) based on its centrality relative to a data cloud. The statistical depth satisfies the following conditions:

affine invariance
maximality at the center: data depth-wise median pertains highest rank in univariate data, the datum of maximum value.
monotonicity
depth goes zero when euclidean distance of the datum goes to infinity

Nowadays, observation in several bands is more typical than observing in two bands or three, at most. The structures such as bull’s eye or god’s fingers are not visible in 1D projection if location coordinates are considered instead of colors. In order to understand such multidimensional data, as indicated above, this [MADS], data depth might enhance the efficiency of data pipe-lining and computing summary statistics. It’s related to handling multivariate outliers and storing/discarding unnecessary/unscientific data. Data points on the surface of data clouds or tiny parasite like data clumps accompanying the important data clouds are not visible nor identifiable when the whole data set is projected onto one or two dimensions. We can define a train of data summary from data depths matching with traditional means, standard devidations, skewness, and kurtosis measures. If a datum and a tiny data cloud does not belong to that set, we can discard those outliers with statistical reasons and separate them from the main archive/catalog. I’d like to remind you that a star of modest color does not necessary belong to the main sequence, if you use projected 1D histogram onto a color axis, you would not know this star is a pre-main sequence star, a giant star, or a white dwarf. This awareness of seeing a data set in a multidimensional space and adapting suitable statistics of multivariate data analysis is critical when a query is executed on massive and streaming data. Otherwise, one would get biased data due to coordinate wise trimming. Also, data depth based algorithmic efficient methods could save wasting cpu time, storage, and human labor for validation.

As often we see data trimming happens based on univariate projection and N sigma rules. Such trimming tends to remove precious data points. I want to use a strategy to remove parasites, not my long tail. As a vertebrate is the principle axis, tail and big nose is outlier than parasites swimming near your belly. Coordinate wise data trimming with N sigma rules will chop the tail and long nose first, then later they may or may not remove parasites. Without human eye verification, say virtual observers (people using virtual observatories for their scientific researches) are likely to receive data with part of head and tail missing but those parasites are not removed. This analogy is applicable to fields stars to stellar clusters, dwarf galaxies to a main one, binaries to single stars, population II vs. population I stars, etc.
Outliers are not necessarily bad guys. Important discoveries were made by looking into outliers. My point is that identifying outliers with respect to multidimensional data distribution is important instead of projecting data onto the axis of a vertebrate.

In order to know further about data depth, I recommend these three papers, long but crispy and delicious to read.

Multivariate analysis by data depth: Descriptive statistics, graphics and inference (with discussion)
R.Liu, J. Parelius, and K. Singh (1999) Ann. Stat. vol. 27 pp. 783-858
(It’s a comprehensive survey paper on data depth)
Mathematics and the Picturing of Data
Tukey, J.W. (1974) Proceedings of the International Congress of Mathematicians, 2, 523-531
The Ordering of Multivariate Data
Barnett, V. (1976) J. R. Stat. Soc. A, 139(3), 318-355
(It answers more than three decades ago how to order multivariate data.
You might want to check publications by R. Serfling (scroll down a bit).

As we have various distance metrics, there are various statistical depths I already mentioned one in my [MADS] Mahalanobis distance Here is a incomplete list of data depths that I heard of:

mahalanobis depth
simplicial depth,
half space depth (Tukey depth)
Oja depth
convex hull peeling depth
projection depth
majority depth

I do not know the definitions of all depths and how to compute depths based on them (General notions of statistical depth function Serfling and Zuo (2000, Annals of Statistics, 28, pp.461-482) is very useful to get theoretical understanding about statistical data depth). What I do know is that except the Mahalanobis depth, the computational complexity of other depths is relatively simple. The reason for excluding the Mahalanobis depth is that it requires estimating large covariance matrix and inverting it. Therefore, ordering multivariate data with these data depth measures, including assigning medians and quantiles, statistically, that is more robust than coordinate-wise mean or additive errors. Also, removing outliers based on statistical depths pertains more statistical senses but less burdensome for computers and human laborers. Speaking of contouring, having assigned ranks by numeric values on each point, by collecting the multivariate data points of a certain level of ranks will lead to drawing a quick contour.

The story turned out to be longer than I expected, but the one take home message is

Q:How would you assign orders to multivariate data?
A:Statisticians have discussed the notion of data depth to answer the question.

———————————————————————
NOTE: The usages of depth in statistics and astronomy are different. In astronomy, depth is associated with the distance that a telescope can reach. In fact, it’s an euphemism. To be more precise, the depth of a telescope is somewhat defined by how luminously dim away that a telescope cannot catch photons from that source. Deep sky surveys imply that the telescope(s) in the survey projects can see deeper than the previous ones. Or more dim objects than previous telescopes. In statistics, depth is a bit abstract and conceptual based on definition. Data depth is multivariate version of quantile for univariate data on which one can easily order data points from the smallest to largest.

Robust Statistics

hlee — Mon, 18 May 2009 17:18:09 +0000

My understandings of “robustness” from the education in statistics and from communicating with astronomers are hard to find a mutual interest. Can anyone help me to build a robust bridge to get over this abyss?

First, since it’s short, let’s quote a comment from an astronomer that might reflect the notion of robust statistics in astronomy.

Bayesian is robust.

Is every Bayesian method robust and its counter part from classical statistics is not robust? I know that popular statistics in astronomy are not, generally speaking, robust and those popular statistics were borne before the notion of robustness in statistics were recognized and discussed.

I do understand why such partisan comment was produced. Astronomers always reports their data analysis results by best fits, error bars, probability, or some significance levels (they don’t say explicitly, p-values, powers, type I or type II errors, unbiased estimates, and other statistical jargon in inference problems) and those classical methods of frequent use have caused frustrations due to their lack of robustness. On the contrary, MCMC algorithms for estimating posterior distributions produce easy interpretable results to report best fit (mode) and error bar (HPD).

My understanding of robustness as a statistician does not draw a line between Bayesian and frequenstists. The following is quoted from the Katholieke Universiteit Leuven website of which mathematics department has a focus group for robust statistics.

Robust statistical methods and applications.
The goal of robust statistics is to develop data analytical methods which are resistant to outlying observations in the data, and hence which are also able to detect these outliers. Pioneering work in this area has been done by Huber (1981), Hampel et al. (1986) and Rousseeuw and Leroy (1987). In their work, estimators for location, scale, scatter and regression play a central role. They assume that the majority of the data follow a parametric model, whereas a minority (the contamination) can take arbitrary values. This approach leads to the concept of the influence function of an estimator which measures the influence of a small amount of contamination in one point. Other measures of robustness are the finite-sample and the asymptotic breakdown value of an estimator. They tell what the smallest amount of contamination is which can carry the estimates beyond all bounds.

Nowadays, robust estimators are being developed for many statistical models. Our research group is very active in investigating estimators of covariance and regression for high-dimensional data, with applications in chemometrics and bio-informatics. Recently, robust estimators have been developed for PCA (principal component analysis), PCR (principal component regression), PLS (partial least squares), classification, ICA (independent component analysis) and multi-way analysis. Also robust measures of skewness and tail weight have been introduced. We study robustness of kernel methods, and regression quantiles for censored data.

My understanding of “robustness” from statistics education is pandemic, covers both Bayesian and frequentist. Any methods and models that are insensitive or immune to outliers, are robust methods and statistics. For example, median is more robust than mean since the breakpoint of median is 1/2 and that of mean is 0, asymptotically. Both mean and median are estimable from Bayesian and frequentist methods. Instead of standard deviation, tactics like lower and upper quartiles to indicate error bars or Winsorization (or trim) to obtain a standard deviation for the error bar, are adopted regardless of Bayesian or frequenstist. Instead of the chi square goodness-of-fit tests, which assume Gaussian residuals, nonparametrics tests or distribution free tests can be utilized.

The notion that frequentist methods are not robust might have been developed from the frustration that those chi-square related methods in astronomy do not show robust characteristics. The reason is that data are prone to the breaks of the normality (Gaussianity) assumption. Also, the limited library of nonparametric methods in data analysis packages and softwares envisions that frequentist methods are not robust. An additional unattractive aspect about frequentist methods is that the description seems too mathematical, too abstract, and too difficult to be codified with full of unfriendly jargon whereas the Bayesian methods provide step by step modeling procedures with explanation why they chose such likelihood and such priors based on external knowledge from physics and observations (MCMC algorithms in the astronomical papers are adaptation of already proven algorithms from statistics and algorithm journals).

John Tukey said:

Robustness refers to the property of a procedure remaining effective even in the absence of usual assumptions such as normality and no incorrect data values. In simplest terms the idea is to improve upon the use of the simple arithmetic average in estimating the center of a distribution. As a simple case one can ask: Is it ever better to use the sample median than the samle mean, and if so, when?

I don’t think the whole set of frequentist methods is the complement set of Bayesians. Personally I feel quite embarrassed whenever I am told that frequentist methods are not robust compared to Bayesian methods. Bayesian methods become robust when a priori knowledge (subjective priors) allows the results to be unaffected by outliers with a general class of likelihood. Regardless of being frequentist or Bayesian, statistics have been developed to be less sensitive to outliers and to do optimal inferences, i.e. to achieve the goal, robustness.

Ah…there are other various accounts for robustness methods/statistics in astronomy not limited to “bayesian is robust.” As often I got confused whenever I see statistically rigorous while the method is a simple chi-square minimization in literature, I believe astronomers can educate me the notion of robustness and the definition of robust statistics in various ways. I hope for some sequels from astronomers about robustness and robust statistics to this post of limited view.

Bipartisanship

hlee — Wed, 10 Dec 2008 17:41:58 +0000

We have seen the word “bipartisan” often during the election and during the on-going recession period. Sometimes, I think that the bipartisanship is not driven by politicians but it’s driven by media, commentator, and interpreters.

me: Why Bayesian methods?
astronomers: Because Bayesian is robust. Because frequentist method is not robust.

By intention, I made the conversation short. Obviously, I didn’t ask all astronomers the same question and therefore, this conversation does not reflect the opinion of all astronomers. Nevertheless, this summarizes what I felt at CfA.

I was educated in frequentist school which I didn’t realize before I come to CfA. Although I didn’t take their courses, there were a few Bayesian professors (I took two but it’s nothing to do with this bipartisanship. Contents were just foundations of statistics). However, I found that getting ideas and learning brilliant algorithms by Bayesians were equally joyful as learning mature statistical theories from frequentists.

How come astronomers possess the idea that Bayesian statistics is robust and frequentist is not? Do they think that the celebrated Gaussian distribution and almighty chi-square methods compose the whole frequentist world? (Please, note that F-test, LRT, K-S test, PCA take little fraction of astronomers’ statistics other than chi-square methods according to astronomical publications, let alone Bayesian methods but no statistics can compete with chi-square methods in astronomy.) Is this why they think frequentist methods are not robust? The longer history is, the more flaws one finds so that no one expect chi-square stuffs are universal panacea. Applying the chi-square looks like growing numbers of epicycles. From the history, finding shortcomings makes us move forward, evolve, invent, change paradigms, etc., instead of saying that chi-square (frequentist) methods are not robust. I don’t think we spent time to learn chi-square stuffs from class. There are too many robust statistics that frequentists have developed. Text books have “robust statistics” in their titles are most likely written by frequentists. Did astronomers check text books and journals before saying frequentists methods are not robust? I’m curious how this bipartisanship, especially that one party is favored and the other is despised but blindly utilized in data analysis, has developed (Probably I should feel relieved about no statistics dictatorship in the astronomical society and exuberant about the efforts of balancing between two parties from a small number of scientists).

Although I think more likely in a frequentist way, I don’t object Bayesian. It’s nothing different from learning mother tongues and cultures. Often times I feel excited how Bayesian get over some troubles that frequentists couldn’t.. If I exaggerate, finding what frequentists achieved but Bayesians haven’t yet or the other way around is similar to the event that by changing the paradigm from the geocentric universe to the heliocentric one could explain the motions of planets with simplicity instead of adding more numbers of epicycles and complicating the description of motions. I equally cherish results from both statistical cultures. Satisfying the simplicity and the fundamental laws including probability theories, is the most important in pursuing proper applications of statistics, not the bipartisanship.

My next post will be about “Robust Statistics” to rectify the notion of robustness that I acquired from CfA. I’d like to hear your, astronomer and statistician alike, thoughts on robustness associated with your statistical culture of choice. I only can write about robustness based what I read and was taught. This also can be biased. Perhaps, other statisticians advocate the astronomer’s notion that Bayesian is robust and frequentist is not. Not much communications with statisticians makes me difficult to obtain the general consensus. Equally likely, I don’t know every astronomer’s thoughts on robustness. Nonetheless, I felt the notion of robustness is different between statisticians and astronomers and this could generate some discussions.

I may sound like Joe Liberman, overall. But remember that tossing him one party to the other back and forth was done by media explicitly. People can be opinionated but I’m sure he pursued his best interests regardless of parties.

[ArXiv] 5th week, Apr. 2008

hlee — Mon, 05 May 2008 07:08:42 +0000

Since I learned Hubble’s tuning fork^[1] for the first time, I wanted to do classification (semi-supervised learning seems more suitable) galaxies based on their features (colors and spectra), instead of labor intensive human eye classification. Ironically, at that time I didn’t know there is a field of computer science called machine learning nor statistics which do such studies. Upon switching to statistics with a hope of understanding statistical packages implemented in IRAF and IDL, and learning better the contents of Numerical Recipes and Bevington’s book, the ignorance was not the enemy, but the accessibility of data was.

I’m glad to see this week presented a paper that I had dreamed of many years ago in addition to other interesting papers. Nowadays, I’m more and more realizing that astronomical machine learning is not simple as what we see from machine learning and statistical computation literature, which typically adopted data sets from the data repository whose characteristics are well known over the many years (for example, the famous iris data; there are toy data sets and mock catalogs, no shortage of data sets of public characteristics). As the long list of authors indicates, machine learning on astronomical massive data sets are never meant to be a little girl’s dream. With a bit of my sentiment, I offer the list of this week:

[astro-ph:0804.4068] S. Pires et al.
FASTLens (FAst STatistics for weak Lensing) : Fast method for Weak Lensing Statistics and map making
[astro-ph:0804.4142] M.Kowalski et al.
Improved Cosmological Constraints from New, Old and Combined Supernova Datasets
[astro-ph:0804.4219] M. Bazarghan and R. Gupta
Automated Classification of Sloan Digital Sky Survey (SDSS) Stellar Spectra using Artificial Neural Networks
[gr-qc:0804.4144]E. L. Robinson, J. D. Romano, A. Vecchio
Search for a stochastic gravitational-wave signal in the second round of the Mock LISA Data challenges
[astro-ph:0804.4483]C. Lintott et al.
Galaxy Zoo : Morphologies derived from visual inspection of galaxies from the Sloan Digital Sky Survey
[astro-ph:0804.4692] M. J. Martinez Gonzalez et al.
PCA detection and denoising of Zeeman signatures in stellar polarised spectra
[astro-ph:0805.0101] J. Ireland et al.
Multiresolution analysis of active region magnetic structure and its correlation with the Mt. Wilson classification and flaring activity

A relevant post related machine learning on galaxy morphology from the slog is found at svm and galaxy morphological classification

< Added: 3rd week May 2008>[astro-ph:0805.2612] S. P. Bamford et al.
Galaxy Zoo: the independence of morphology and colour

Wikipedia link: Hubble sequence

[ArXiv] An unbiased estimator, May 29, 2007

hlee — Tue, 30 Oct 2007 07:37:07 +0000

From arxiv/astro-ph:0705.4199v1
In search of an unbiased temperature estimator for statistically poor X-ray spectra
A. Leccardi and S. Molendi

There was a delay of writing about this paper, which by accident was lying under the pile of papers irrelevant to astrostatistics. (It has been quite overwhelming to track papers with various statistical applications and papers with rooms left for statistical improvements from arxiv:astro-ph). Although there is a posting about this paper (see Vinay’s posting), I’d like to give a shot. I was very excited because I haven’t seen any astronomical papers discussing unbiased estimators solely.

By the same token that the authors discussed bias in the χ^2 method and the maximum likelihood estimator, we know that the χ^2 function is not always symmetric for applying Δχ^2 =1 for a 68% confidence interval. The nominal level interval from the Δχ^2 method does not always provide the nominal coverage when the given model to be fitted does not satisfy the (regularity) conditions for approximating χ^2 distribution. The χ^2 best fit does not always observe the (in probability or almost sure) convergence to the true parameter, i.e. biased so that the coverage level misleads the information of the true parameter. The illustration of the existence of bias in traditional estimators in high energy astronomy is followed by authors’ proposals of unbiased (temperature) estimators via (variable) transformation.

Transformation is one way of reducing bias (e.g. Box-Cox transformation or power transformation is a common practice in introductory statistics to make residuals more homogeneous). Transformation leads an asymmetric distribution to (asymptotically) symmetric. Different from the author’s comment (the parametric bootstrap reached no improvement in bias reduction), reducing bias from computing likelihoods (Cash statistics) can be achieved by statistical subsampling methods, like cross-validation, jackknife, and bootstrap upon careful designs of subsampling schemes (instead of parametric bootstrap, nonparametric bootstrap could yield a different conclusion). Penalized likelihood, instead of L_2 norm (the χ^2 measure is L_2), L_1 norm penalty helps to reduce bias as well.

One of the useful discussions about unbiased estimators is the comparison between the χ^2 best fit method and Cash statistics (Maximum Poisson Likelihood Estimator). Overall, Cash statistics excels the χ^2 best fit method. Neither of these two methods overcome bias from low counts, small exposure time, background level, and asymmetry pdf (probability density function) in T(temperature), their parameter of interest. Their last passage to obtain an unbiased estimator was taking a nonparametric approach to construct a mixture model from three pdf’s to estimate the uncertainty. They concluded the results from the mixing distributions were excellent. This mixing distributions takes an effect of reducing errors by averaging. Personally, their saying “the only method that returns the expected temperature under very different situations” seems to be overstated. Either designing more efficient mixing distributions (unequal weighting triplets than equal weights) or defining M-estimators upon understanding three EPIC instruments would produce better degrees of unbiasedness.

Note that the maximum likelihood estimator (MLE) is a consistent estimator (asymptotically unbiased) under milder regularity conditions in contrast to the χ^2 best fit estimator. Instead of stating that MLE can be biased, it would have been better to discuss the suitability of regularity conditions to source models built on Poisson photon counts for estimating temperatures and XSPEC estimation procedures.

Last, I’d like to quote their question as it is:

What are the effects of pure statistical uncertainties in determining interesting parameters of highly non linear models (e.g. the temperature of th ICM), when we analyze spectra accumulated from low surface brightness regions using current X-ray experiments?

Although the authors tried to answer this question, my personal opinion is that they were not able to fully account the answer but left a spacious room for estimating statistical uncertainty and bias rigorously in high energy astrophysics with more statistical care (e.g. instead of MLE or Cash statistics, we could develop more robust but unbiased M-estimator).