The AstroStat Slog » parallel coordinates

Scatter plots and ANCOVA

hlee — Thu, 15 Oct 2009 23:46:14 +0000

Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists^[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality^[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page^[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS].

This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.

[MADS] Parallel Coordinates

hlee — Wed, 29 Jul 2009 06:02:18 +0000

Speaking of XAtlas from my previous post I tried another visualization tool called Parallel Coordinates on these Capella observations and two stars with multiple observations (AL Lac and IM Peg). As discussed in [MADS] Chernoff face, full description of the catalog is found from XAtlas website. The reason for choosing these stars is that among low mass stars, next to Capella (I showed 16), IM PEG (HD 21648, 8 times), and AR Lac (although different phases, 6 times) are most frequently observed. I was curious about which variation, within (statistical variation) and between (Capella, IM Peg, AL Lac), is dominant. How would they look like from the parametric space of High Resolution Grating Spectroscopy from Chandra?

Having 13 X-ray line and/or continuum ratios, a typical data display would be the 13 choose 2 combination of scatter plots as follows. Note that the upper left panels with three colors are drawn for the classification purpose (red: AL Lac, blue: IM Peg, green:Capella) while lower right ones are discolored for the clustering analysis purpose. These scatter plots are essential to exploratory data analysis but they do not convey information efficiently with these many scatter plots. In astronomical journals, thanks to astronomers’ a priori knowledge, a fewer pairs of important variables are selected and displayed to reduce the visualization complexity dramatically. Unfortunately, I cannot select physically important variables only.

I am not a well-knowledged astronomer but believe in reducing dimensionality by the research objective. The goal is set from asking questions like “what do you want from this multivariate data set?” classification (classification rule/regression model that separates three stars, Capella, AL Lac, and IM Peg), clustering (are three stars naturally clustered into three groups? Or are there different number of clusters even if they are not well visible from above scatter plots?), hypothesis testing (are they same type of stars or different?), point estimation and its confidence interval (means and their error bars), and variable selection (or dimension reduction). So far no statistical question is well defined (it can be good thing for new discoveries). Prior to any confirmatory data analysis, we’d better find a way to display this multidimensional data efficiently. I thought parallel coordinates serve the purpose well but surprisingly, it was never discussed in astronomical literature, at least it didn’t appear in ADS.

Each 13 variable was either normalized (left) or standardized (right). The parallel coordinate plot looks both simpler and more informative. Capella observations occupy relatively separable space than the other stars. It is easy to distinguish that one Capella observation is an obvious outlier to the rest which is hardly seen from scatter plots. It is clear that discriminant analysis or classical support vector machine type classification methods cannot separate AL Lac and IM Pec. Clustering based on distance measures of dissimilarity also cannot be applied in order to see a natural grouping of these two stars whereas Capella observations form its own cluster. To my opinion, parallel coordinates provide more information about multidimensional data (dim>3) in a simpler way than scatter plots of multivariate data. It naturally shows highly correlated variables within the same star observations or across all target stars. This insight from visualization is a key to devising methods of variable selection or reducing dimensionality in the data set.

Personal opinion is that not having an efficient and informative visualization tool for visualizing complex high resolution spectra in many detailed metrics, smoothed bivariate (trivariate at most) information such as hardness ratios and quantiles are utilized in displaying X-ray spectral data, instead. I’m not saying that the parallel coordinates are the ultimate answer to visualizing multivariate data but I’d like to emphasize that this method is more informative, intuitive and simple to understand the structure of relatively high dimensional data cloud.

Parallel coordinates has a long history. The earliest discussion I found was made in 1880ies. It became popular by Alfred Inselberg and gained recognition among statisticians by George Wegman (1990, Hyperdimensional Data Analysis Using Parallel Coordinates). Colorful images of the Sun, stars, galaxies, and their corona, interstellar gas, and jets are the eye catchers. I hope that data visualization tools gain equal spot lights since they summarize data and deliver lots of information. If images are well decorated cakes, then these tools from EDA are sophisticated and well baked cookies.

——————- [Added]
According to

[arxiv:0906.3979] The Golden Age of Statistical Graphics
Michael Friendly (2008)
Statistical Science, Vol. 23, No. 4, pp. 502-535

it is 1885. Not knowing French – if I knew I’d like to read Gauss’ paper immediately prior to anything – I don’t know what the reference is about.