The AstroStat Slog » correlation http://hea-www.harvard.edu/AstroStat/slog Weaving together Astronomy+Statistics+Computer Science+Engineering+Intrumentation, far beyond the growing borders Fri, 09 Sep 2011 17:05:33 +0000 en-US hourly 1 http://wordpress.org/?v=3.4 Scatter plots and ANCOVA http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/ http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/#comments Thu, 15 Oct 2009 23:46:14 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/?p=1640 Astronomers rely on scatter plots to illustrate correlations and trends among many pairs of variables more than any scientists[1]. Pages of scatter plots with regression lines are often found from which the slope of regression line and errors bars are indicators of degrees of correlation. Sometimes, too many of such scatter plots makes me think that, overall, resources for drawing nice scatter plots and papers where those plots are printed are wasted. Why not just compute correlation coefficients and its error and publicize the processed data for computing correlations, not the full data, so that others can verify the computation results for the sake of validation? A couple of scatter plots are fine but when I see dozens of them, I lost my focus. This is another cultural difference.

When having many pairs of variables that demands numerous scatter plots, one possibility is using parallel coordinates and a matrix of correlation coefficients. If Gaussian distribution is assumed, which seems to be almost all cases, particularly when parametrizing measurement errors or fitting models of physics, then error bars of these coefficients also can be reported in a matrix form. If one considers more complex relationships with multiple tiers of data sets, then one might want to check ANCOVA (ANalysis of COVAriance) to find out how statisticians structure observations and their uncertainties into a model to extract useful information.

I’m not saying those simple examples from wikipedia, wikiversity, or publicly available tutorials on ANCOVA are directly applicable to statistical modeling for astronomical data. Most likely not. Astrophysics generally handles complicated nonlinear models of physics. However, identifying dependent variables, independent variables, latent variables, covariates, response variables, predictors, to name some jargon in statistical model, and defining their relationships in a rather comprehensive way as used in ANCOVA, instead of pairing variables for scatter plots, would help to quantify relationships appropriately and to remove artificial correlations. Those spurious correlations appear frequently because of data projection. For example, datum points on a circle on the XY plane of the 3D space centered at zero, when seen horizontally, look like that they form a bar, not a circle, producing a perfect correlation.

As a matter of fact, astronomers are aware of removing these unnecessary correlations via some corrections. For example, fitting a straight line or a 2nd order polynomial for extinction correction. However, I rarely satisfy with such linear shifts of data with uncertainty because of changes in the uncertainty structure. Consider what happens when subtracting background leading negative values, a unrealistic consequence. Unless probabilistically guaranteed, linear operation requires lots of care. We do not know whether residuals y-E(Y|X=x) are perfectly normal only if μ and σs in the gaussian density function can be operated linearly (about Gaussian distribution, please see the post why Gaussianity? and the reference therein). An alternative to the subtraction is linear approximation or nonparametric model fitting as we saw through applications of principle component analysis (PCA). PCA is used for whitening and approximating nonlinear functional data (curves and images). Taking the sources of uncertainty and their hierarchical structure properly is not an easy problem both astronomically and statistically. Nevertheless, identifying properties of the observed both from physics and statistics and putting into a comprehensive and structured model could help to find out the causality[2] and the significance of correlation, better than throwing numerous scatter plots with lines from simple regression analysis.

In order to understand why statisticians studied ANCOVA or, in general, ANOVA (ANalysis Of VAriance) in addition to the material in wiki:ANCOVA, you might want to check this page[3] and to utilize your search engine with keywords of interest on top of ANCOVA to narrow down results.

From the linear model perspective, if a response is considered to be a function of redshift (z), then z becomes a covariate. The significance of this covariate in addition to other factors in the model, can be tested later when one fully fit/analyze the statistical model. If one wants to design a model, say rotation speed (indicator of dark matter occupation) as a function of redshift, the degrees of spirality, and the number of companions – this is a very hypothetical proposal due to my lack of knowledge in observational cosmology. I only want to point that the model fitting problem can be seen from statistical modeling like ANCOVA by identifying covariates and relationships – because the covariate z is continuous, and the degrees are fixed effect (0 to 7, 8 tuples), and the number of companions are random effect (cluster size is random), the comprehensive model could be described by ANCOVA. To my knowledge, scatter plots and simple linear regression are marginalizing all additional contributing factors and information which can be the main contributors of correlations, although it may seem Y and X are highly correlated in the scatter plot. At some points, we must marginalize over unknowns. Nonetheless, we still have some nuisance parameters and latent variables that can be factored into the model, different from ignoring them, to obtain advanced insights and knowledge from observations in many measures/dimensions.

Something, I think, can be done with a small/ergonomic chart/table via hypothesis testing, multivariate regression, model selection, variable selection, dimension reduction, projection pursuit, or names of some state of the art statistical methods, is done in astronomy with numerous scatter plots, with colors, symbols, and lines to account all possible relationships within pairs whose correlation can be artificial. I also feel that trees, electricity, or efforts can be saved from producing nice looking scatter plots. Fitting/Analyzing more comprehensive models put into a statistical fashion helps to identify independent variables or covariates causing strong correlation, to find collinear variables, and to drop redundant or uncorrelated predictors. Bayes factors or p-values can be assessed for comparing models, testing significance their variables, and computing error bars appropriately, not the way that the null hypothesis probability is interpreted.

Lastly, ANCOVA is a complete [MADS]. :)

  1. This is not an assuring absolute statement but a personal impression after reading articles of various fields in addition to astronomy. My readings of other fields tell that many rely on correlation statistics but less scatter plots by adding straight lines going through data sets for the purpose of imposing relationships within variable pairs
  2. the way that chi-square fitting is done and the goodness-of-fit test is carried out is understood by the notion that X causes Y and by the practice that the objective function, the sum of (Y-E[Y|X])^2/σ^2 is minimized
  3. It’s a website of Vassar college, that had a first female faculty in astronomy, Maria Mitchell. It is said that the first building constructed is the Vassar College Observatory, which is now a national historic landmark. This historical factor is the only reason of pointing this website to drag some astronomers attentions beyond statistics.
]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/scatter-plots-and-ancova/feed/ 0
Correlation is not causation http://hea-www.harvard.edu/AstroStat/slog/2009/correlation-is-not-causation/ http://hea-www.harvard.edu/AstroStat/slog/2009/correlation-is-not-causation/#comments Fri, 06 Mar 2009 13:22:25 +0000 vlk http://hea-www.harvard.edu/AstroStat/slog/?p=1764 What XKCD says:
xkcd on correlation: I used to think correlation implied causation - Then I took a statistics class.  Now I dont - Sounds like the class helped.  Well, maybe.

The mouseover text on the original says “Correlation doesn’t imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing ‘look over there’.”

It is a bad habit, hard to break, the temptation is great.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2009/correlation-is-not-causation/feed/ 1
Mexican Hat [EotW] http://hea-www.harvard.edu/AstroStat/slog/2008/eotw-mexican-hat/ http://hea-www.harvard.edu/AstroStat/slog/2008/eotw-mexican-hat/#comments Wed, 28 May 2008 17:00:38 +0000 vlk http://hea-www.harvard.edu/AstroStat/slog/?p=311 The most widely used tool for detecting sources in X-ray images, especially Chandra data, is the wavelet-based wavdetect, which uses the Mexican Hat (MH) wavelet. Now, the MH is not a very popular choice among wavelet aficianados because it does not form an orthonormal basis set (i.e., scale information is not well separated), and does not have compact support (i.e., the function extends to inifinity). So why is it used here?

The short answer is, it has a convenient background subtractor built in, is analytically comprehensible, and uses concepts very familiar to astronomers. The last bit can be seen by appealing to Gaussian smoothing. Astronomers are (or were) used to smoothing images with Gaussians, and in a manner of speaking, all astronomical images already come presmoothed by PSFs (point spread functions) that are nominally approximated by Gaussians. Now, if an image were smoothed by another Gaussian of a slightly larger width, the difference between the two smoothed images should highlight those features which are prominent at the spatial scale of the larger Gaussian. This is the basic rationale behind a wavelet.

So, in the following, G(x,y;σxy,xo,yo) is a 2D Gaussian written in such that the scaling of the widths and the transposition of the function is made obvious. It is defined over the real plane x,y ε R2 and for widths σxy. The Mexican Hat wavelet MH(x,y;σxy,xo,yo) is generated as the difference between the two Gaussians of different widths, which essentially boils down to taking partial derivatives of G(σxy) wrt the widths. To be sure, these must really be thought of as operators where the functions are correlated with a data image, so the derivaties must be carried out inside an integral, but I am skipping all that for the sake of clarity. Also note, the MH is sometimes derived as the second derivative of G(x,y), the spatial derivatives that is.

Mexican Hat wavelet

The integral of the MH over R2 results in the positive bump and the negative annulus canceling each other out, so there is no unambiguous way to set its normalization. And finally, the Fourier Transform shows which spatial scales (kx,y are wavenumbers) are enhanced or filtered during a correlation.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/eotw-mexican-hat/feed/ 1
Books – a boring title http://hea-www.harvard.edu/AstroStat/slog/2008/books-a-boring-title/ http://hea-www.harvard.edu/AstroStat/slog/2008/books-a-boring-title/#comments Fri, 25 Jan 2008 16:53:21 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/books-a-boring-title/ I have been observing some sorts of misconception about statistics and statistical nomenclature evolution in astronomy, which I believe, are attributed to the lack of references in the astronomical society. There are some textbooks designed for junior/senior science and engineering students, which are likely unknown to astronomers. Example-wise, these books are not suitable, to my knowledge. Although I never expect astronomers to learn standard graduate (mathematical) statistics textbooks, I do wish astronomers go beyond Numerical Recipes (W. H. Press, S. A. Teukolsky, W. T. Vetterling, & B. P. Flannery) and Error Data Reduction and Analysis for the Physical Sciences (P. R. Bevington & D. K. Robinson). Here are some good ones written by astronomers, engineers, and statisticians:

The motivation of writing this posting was originated to Vinay’s recommendation: Practical Statistics for Astronomers (J.V.Wall and C.R.Jenkins), which provides many statistical insights and caveats that astronomers tend to ignore. Without looking at the error distribution and the properties of data, astronomers jump into chi-square and correlation. If someone reads the book, he/she will be careful on adopting statistics of common practice in astronomy, developed many decades ago, and founded on strong assumptions, not compatible with modern data sets. The book addresses many concerns that have been growing in my mind for astronomers and introduces various statistical methods applicable in astronomy.

The view points of astronomers without in-class statistics education but with full readership of this book, would be different from mine. The book mentioned unbiasedness, consistency, closedness, and robustness of statistics, which normally are not discussed nor proved in astronomy papers. Therefore, those readers may miss the insights, caveats, and contents-between-the-lines of the book, which I care about. To reduce such gap, as for quick and easy understanding of classical statistics, I recommend Cartoon Guide to Statistics (Larry Gonick, Woollcott Smith Business & Investing Collins) as a first step. This cartoon book enhances fundamentals in statistics only with fun and a friendly manner, and provides everything that rudimentary textbooks offer.

If someone wants to know beyond classical statistics (so called frequentist statistics) and likes to know popular Bayesian statistics, astronomy professor Phil Gregory’s Bayesian Logical Data Analysis for the Physical Sciences is recommended. If one likes to know little bit more on the modern statistics of frequentists and Bayesians, All of Statistics (Larry Wasserman) is recommended. I realize that textbooks for non-statistics students are too thick to go through in a short time (The book for senior engineering students at Penn State I used for teaching was Probability and Statistics for Engineering and the Sciences by Jay. L Devore, 4th and 5th edition and it was about 600 pages. The current edition is 736 pages). One of well received textbooks for graduate students in electrical engineering is Probability, Random Variables and Stochastic Processes (A. Papoulis & S.U. Pillai). I remember the book offers a rather less abstract definition of measure and practical examples (Personally, Hermite polynomials was useful from the book).

For a casual reading about statistics and its 20th century history, The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century (D. Salsburg) is quite nice.

Statistics is not just for best fit analysis and error bars. It is a wonderful telescope extracts correct information when it is operated carefully to the right target by the manual. It gets rid of atmospheric and other blurring factors when statistics is understood righteously. It is not a black box nor a magic, as many people think.

The era of treating everything gaussian is over decades ago. Because of the central limit theorem and the delta method (a good example is log-transformation), many statistics asymptotically follows the normal (gaussian) distribution but there are various families of distributions. Because of possible bias in the chi-square method, the error bar cannot guarantee the appointed coverage, like 95%. There are also nonparametric statistics, known for robustness, whereas it may be less efficient than statistics of distribution family assumption. Yet, it does not require model assumption. Also, Bayesian statistics works wonderfully if correct information on priors, suitable likelihood models, and computing powers for hierarchical models and numerical integration are provided.

Before jumping into the chi-square for fitting and testing at the same time, to prevent introducing bias, exploratory data analysis is required for better understanding data and for seeking a suitable statistic and its assumptions. The exploratory data analysis starts from simple scatter plots and box plots. A little statistical care for data and good interests in the truth of statistical methods are all I am asking for. I do wish that these books could assist the realization of my wishes.

—————————————————————————-
[1.] Most of links to books are from amazon.com but there is no personal affiliation to the company.

[2.] In addition to the previous posting on chi-square, what is so special about chi square in astronomy, I’d like to mention possible bias in chi-square fitting and testing. It is well known that utilizing the same data set for fitting, which results in parameter estimates so called in astronomy best fit values and error bars, and testing based on these parameter estimates brings out bias so that the best fit is biased from the true parameter value and the error bar does not match the aimed coverage. See the problem from Aneta’s an example of chi2 bias in fitting x-ray spectra

[3.] More book recommendation is welcome.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/books-a-boring-title/feed/ 12
[ArXiv] 3rd week, Jan. 2008 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-3rd-week-jan-2008/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-3rd-week-jan-2008/#comments Fri, 18 Jan 2008 18:24:23 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-3rd-week-jan-2008/ Seven preprints were chosen this week and two mentioned model selection.

  • [astro-ph:0801.2186] Extrasolar planet detection by binary stellar eclipse timing: evidence for a third body around CM Draconis H.J.Deeg (it discusses model selection in section 4.4)
  • [astro-ph:0801.2156] Modeling a Maunder Minimum A. Brandenburg & E. A. Spiegel (it could be useful for those who does sunspot cycle modeling)
  • [astro-ph:0801.1914] A closer look at the indications of q-generalized Central Limit Theorem behavior in quasi-stationary states of the HMF model A. Pluchino, A. Rapisarda, & C. Tsallis
  • [astro-ph:0801.2383] Observational Constraints on the Dependence of Radio-Quiet Quasar X-ray Emission on Black Hole Mass and Accretion Rate B.C. Kelly et.al.
  • [astro-ph:0801.2410] Finding Galaxy Groups In Photometric Redshift Space: the Probability Friends-of-Friends (pFoF) Algorithm I. Li & H. K.C. Yee
  • [astro-ph:0801.2591] Characterizing the Orbital Eccentricities of Transiting Extrasolar Planets with Photometric Observations E. B. Ford, S. N. Quinn, &D. Veras
  • [astro-ph:0801.2598] Is the anti-correlation between the X-ray variability amplitude and black hole mass of AGNs intrinsic? Y. Liu & S. N. Zhang
]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-3rd-week-jan-2008/feed/ 0
[ArXiv] 2nd week, Jan. 2007 http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/#comments Fri, 11 Jan 2008 19:44:44 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/ It is notable that there’s an astronomy paper contains AIC, BIC, and Bayesian evidence in the title. The topic of the paper, unexceptionally, is cosmology like other astronomy papers discussed these (statistical) information criteria (I only found a couple of papers on model selection applied to astronomical data analysis without articulating CMB stuffs. Note that I exclude Bayes factor for the model selection purpose).

To find the paper or other interesting ones, click

  • [astro-ph:0801.0638]
    AIC, BIC, Bayesian evidence and a notion on simplicity of cosmological model M Szydlowski & A. Kurek

  • [astro-ph:0801.0642]
    Correlation of CMB with large-scale structure: I. ISW Tomography and Cosmological Implications S. Ho et.al.

  • [astro-ph:0801.0780]
    The Distance of GRB is Independent from the Redshift F. Song

  • [astro-ph:0801.1081]
    A robust statistical estimation of the basic parameters of single stellar populations. I. Method X. Hernandez and D. Valls–Gabaud

  • [astro-ph:0801.1106]
    A Catalog of Local E+A(post-starburst) Galaxies selected from the Sloan Digital Sky Survey Data Release 5 T. Goto (Carefully built catalogs are wonderful sources for classification/supervised learning, or semi-supervised learning)

  • [astro-ph:0801.1358]
    A test of the Poincare dodecahedral space topology hypothesis with the WMAP CMB data B.S. Lew & B.F. Roukema

In cosmology, a few candidate models to be chosen, are generally nested. A larger model usually is with extra terms than smaller ones. How to define the penalty for the extra terms will lead to a different choice of model selection criteria. However, astronomy papers in general never discuss the consistency or statistical optimality of these selection criteria; most likely Monte Carlo simulations and extensive comparison across those criteria. Nonetheless, my personal thought is that the field of model selection should be encouraged to astronomers to prevent fallacies of blindly fitting models which might be irrelevant to the information that the data set contains. Physics tells a correct model but data do the same.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2008/arxiv-2nd-week-jan-2007/feed/ 0
[ArXiv] Astronomy Job Market in US http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-astronomy-job-market-in-us/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-astronomy-job-market-in-us/#comments Fri, 21 Dec 2007 17:47:59 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-astronomy-job-market-in-us/ It’s a report about the job market in US.

[astro-ph:0712.2820] The Production Rate and Employment of Ph.D. Astronomers T.S. Metcalfe

Related Comments:

  1. Much more jobs than I expected. However, it cannot compete with jobs in Statistics.
  2. Three jobs before having a stable one in astronomy. Do not know in statistics.
  3. Astronomy Ph.D. students receive more cares, in a sense that the job market is controlled to guarantee a position for every student. In statistics, without care you can find something (not necessary a research position).

Unrelated Comment on Correlation:
It’s a cultural difference. Maybe not. When I learned correlation years ago from a textbook, the procedure is, 1. compute the correlation and 2. do a t-test. In astronomical papers, 1. do regression and 2. plot the simple linear regression line with error bands and data points. The computing procedure is same but the way illustrates the results seems different.

I wonder what would it be like when we narrow the job market for astrostatisticians.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-astronomy-job-market-in-us/feed/ 0
[ArXiv] Correlation Studies, June 12, 2007 http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-correlation-studies-june-12-2007/ http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-correlation-studies-june-12-2007/#comments Mon, 18 Jun 2007 21:08:35 +0000 hlee http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-correlation-studies-june-12-2007/ One of arxiv/astro-ph preprints, arxiv/0706.1703v1 discusses correlation between galactic HI and the cosmic microwave background (CMB) and reports no statistically significant correlation.

Beyond the astrophysical significance of the paper, when correlation appears in scientific papers, people expect that the papers are about statistics. Are these correlation studies truly statistical science?

Statistical Challenges in Modern Astronomy III (2001) was the first conference I confronted astronomy since my subject of interest had changed from solar physics to statistics, of which field I only have a very rudimentary level of knowledge at that time. Although I was a mere helper for the conference, I managed to eavesdrop some talks and discussions from conference participants and the word correlation was frequently captured.

Consider a set of paired points uniformly distributed on a circle in 2D euclidean space. The estimated correlation is close to zero but we understand this data set is highly correlated. Depending on the definition of correlation associated with data space, the degree of correlation could show significantly different measures. Therefore, I have been doubting what is so important about correlation in astronomy.

After some years, I realized that correlation is important in astronomy, astrophysics, and cosmology as in arxiv/0706.1703v1 and other papers due to the fact that the estimated correlation coefficient may tell physical correlation among objects of interest. The correlation is treated as a blinded statistical tool that directly tells the physical correlation. I have some impression that astronomers believe important physical correlation comes from a statistically significant correlation coefficient without investigating the foundation of statistical inference.

On the other hand, the nice part of arxiv/0706.1703v1 is authors’ two caveats on correlation: 1. inevitable appearance of correlation due to random fluctuation, therefore not to use a-posteriori statistics and 2. misleading visual correlation, therefore, quantitative methods are required, like Monte Carlo methods for assessing significance.

I hope that some astronomers provide a good description of what makes estimating correlation so important and how statistically significant correlation becomes physically important correlation.

p.s. In the paper,

If one draws N numbers between 0 and 1, the probability that they will all be smaller than x is p=1-x^N.

I think this should be p=x^N.

]]>
http://hea-www.harvard.edu/AstroStat/slog/2007/arxiv-correlation-studies-june-12-2007/feed/ 2