The AstroStat Slog » visualization

[MADS] Parallel Coordinates

hlee — Wed, 29 Jul 2009 06:02:18 +0000

Speaking of XAtlas from my previous post I tried another visualization tool called Parallel Coordinates on these Capella observations and two stars with multiple observations (AL Lac and IM Peg). As discussed in [MADS] Chernoff face, full description of the catalog is found from XAtlas website. The reason for choosing these stars is that among low mass stars, next to Capella (I showed 16), IM PEG (HD 21648, 8 times), and AR Lac (although different phases, 6 times) are most frequently observed. I was curious about which variation, within (statistical variation) and between (Capella, IM Peg, AL Lac), is dominant. How would they look like from the parametric space of High Resolution Grating Spectroscopy from Chandra?

Having 13 X-ray line and/or continuum ratios, a typical data display would be the 13 choose 2 combination of scatter plots as follows. Note that the upper left panels with three colors are drawn for the classification purpose (red: AL Lac, blue: IM Peg, green:Capella) while lower right ones are discolored for the clustering analysis purpose. These scatter plots are essential to exploratory data analysis but they do not convey information efficiently with these many scatter plots. In astronomical journals, thanks to astronomers’ a priori knowledge, a fewer pairs of important variables are selected and displayed to reduce the visualization complexity dramatically. Unfortunately, I cannot select physically important variables only.

I am not a well-knowledged astronomer but believe in reducing dimensionality by the research objective. The goal is set from asking questions like “what do you want from this multivariate data set?” classification (classification rule/regression model that separates three stars, Capella, AL Lac, and IM Peg), clustering (are three stars naturally clustered into three groups? Or are there different number of clusters even if they are not well visible from above scatter plots?), hypothesis testing (are they same type of stars or different?), point estimation and its confidence interval (means and their error bars), and variable selection (or dimension reduction). So far no statistical question is well defined (it can be good thing for new discoveries). Prior to any confirmatory data analysis, we’d better find a way to display this multidimensional data efficiently. I thought parallel coordinates serve the purpose well but surprisingly, it was never discussed in astronomical literature, at least it didn’t appear in ADS.

Each 13 variable was either normalized (left) or standardized (right). The parallel coordinate plot looks both simpler and more informative. Capella observations occupy relatively separable space than the other stars. It is easy to distinguish that one Capella observation is an obvious outlier to the rest which is hardly seen from scatter plots. It is clear that discriminant analysis or classical support vector machine type classification methods cannot separate AL Lac and IM Pec. Clustering based on distance measures of dissimilarity also cannot be applied in order to see a natural grouping of these two stars whereas Capella observations form its own cluster. To my opinion, parallel coordinates provide more information about multidimensional data (dim>3) in a simpler way than scatter plots of multivariate data. It naturally shows highly correlated variables within the same star observations or across all target stars. This insight from visualization is a key to devising methods of variable selection or reducing dimensionality in the data set.

Personal opinion is that not having an efficient and informative visualization tool for visualizing complex high resolution spectra in many detailed metrics, smoothed bivariate (trivariate at most) information such as hardness ratios and quantiles are utilized in displaying X-ray spectral data, instead. I’m not saying that the parallel coordinates are the ultimate answer to visualizing multivariate data but I’d like to emphasize that this method is more informative, intuitive and simple to understand the structure of relatively high dimensional data cloud.

Parallel coordinates has a long history. The earliest discussion I found was made in 1880ies. It became popular by Alfred Inselberg and gained recognition among statisticians by George Wegman (1990, Hyperdimensional Data Analysis Using Parallel Coordinates). Colorful images of the Sun, stars, galaxies, and their corona, interstellar gas, and jets are the eye catchers. I hope that data visualization tools gain equal spot lights since they summarize data and deliver lots of information. If images are well decorated cakes, then these tools from EDA are sophisticated and well baked cookies.

——————- [Added]
According to

[arxiv:0906.3979] The Golden Age of Statistical Graphics
Michael Friendly (2008)
Statistical Science, Vol. 23, No. 4, pp. 502-535

it is 1885. Not knowing French – if I knew I’d like to read Gauss’ paper immediately prior to anything – I don’t know what the reference is about.

Astroart Survey

vlk — Sun, 02 Nov 2008 12:42:01 +0000

Astronomy is known for its pretty pictures, but as Joe the Astronomer would say, those pretty pictures don’t make themselves. A lot of thought goes into maximizing scientific content while conveying just the right information, all discernible at a single glance. So the hardworkin folks at Chandra want your help in figuring out what works and how well, and they have set up a survey at http://astroart.cfa.harvard.edu/. Take the survey, it is both interesting and challenging!

[Book] The Grammar of Graphics

hlee — Wed, 08 Oct 2008 23:55:37 +0000

All of a sudden, partially owing to a thought provoking talk about visualization by Felice Frankel at IIC, I recollected a book, The Grammar of Graphics by Leland Wilkinson (2nd Ed. – I partially read the 1st ed. and felt little of use several years ago because there seemed no link for visualization of data from astronomy.)

Both good and bad reviews exist but I don’t believe there’s a book this extensive to cover the grammar of graphics. Not many statisticians are handling images compared to computer vision engineers but at some points, all engineers and scientists must present their work into graphs and tables. By the same token, tongs are different, although alphabets are common. Often times, plots from scientist A cannot talk to scientist B (A \ne B). This communication discrepancy seems prevalent between astronomy and statistics.

Almost all chapters begin with the Greek or Latin origins of chapter names to reflect the common origins of lexicons in graphics regardless of subjects. Some chapters, on the contrary, tend to illuminate different practices/perspectives/interests in graphics between astronomers and statisticians:

Chap. 6 [Scale]: Scaling by log transformation is meant to stabilize errors (Box-Cox transformation) in statistics; in contrast, in astronomy to impose a linear relationship between predictor and response which is manifested better in log scale.
Chap. 7 [Statistics]: Discussion on error bars, bins, and histogram; although graphical tools are same but the objectives seem different (statistics – optimal binning: astronomy – enhancing signals in each bin).
Chap 15. [Uncertainty]: Concepts of uncertainty; many words are associated with uncertainty, for example, variability, noise, incompleteness, indeterminacy, bias, error, accuracy, precision, reliability, validity, quality, and integrity.

Overall, the ideas are implored to be included adaptively in the astronomical data analysis packages for visualizing the analyzed products. Perhaps, it may inspire some astronomers to transform the ways of visualization. For instance, instead of histograms, in my opinion, box-plots, qq-plots, and scatter plots would shed improved information while maintaining the simplicity but except scatter plots, other summary plots are not commonly used in astronomy. A benefit from box plot and qq plot is checking gaussianity without sacrificing information from binning. However, there’s no golden rule which type or grammar of graphics is correct and shall be used . Only exists user preference.

Different disciplines maintain their ways of presenting graphics and expect that they can talk to viewers of other disciplines. No one fully reached that point, disappointingly. Extensive discussion and persuasion is required to deliver stories behind graphics to others.

As Felice Frankel pointed out the way of visualization could enhance recognition and understanding of deliberate delivering of information. To the purpose, a few interesting quotes from the book is replaced the conclusion of this post.

The first ed. of this book, and Part 1 of the current ed., explicitly cautioned that the grammar of graphics is not a visualization system.
We are surprised, nevertheless, to discover how little some visualization researchers in various fields know about the origins of many the of techniques that are routinely applied in visualization.
The grammar of graphics determined how algebra, geometry, aesthetics, statistics, scales, and coordinates interact. In the world of statistical graphics, we cannot confuse aesthetics with geometry by picking a tree graphics to represent a continuous flow of migrating insects across a geographic field simply because we like the impression in conveys.
If we must choose a single word to characterize the focus of modern statistics, it would be uncertainty (Stigler, 1983)
… decision-makers need statistical tools to formalize the scenarios they encounter and they need graphical aids to keep them from making irrational decisions. … the use of graphics for decision-making under uncertainty is a relatively recent field. … We need to go beyond the use of error bars to incorporate other aesthetics in the representation of error. And we need research to assess the effectiveness of decision-making based on these graphics using a Bayesian yardstick.

[ArXiv] Data Visualization, July 17, 2007

hlee — Wed, 18 Jul 2007 05:04:55 +0000

From arxiv/astro-ph:0707.2474,
Visualization, Exploration and Data Analysis of Complex Astrophysical Data by Comparato, Becciani, Costa, Larsson, Garilli, Gheller, and Taylor

This paper introduces a novel advanced visualization tool VisIVO,^[1] its advantages from combining a protocol called PLASTIC (Platform for Astronomy Tool Interconnection) for displaying and extracting information from astrophysical data, its enhanced connection to VO (Virtual Observatory), and its usage in several scientific cases.

Data visualization has never been emphasized more than these days in all fields. Each field has its own peculiarity of their data patterns and experiencing fast growth in their size. Tools specifically designed for astrophysical data well deserve a welcome.

Available at http://visivo.cineca.it