The AstroStat Slog » incompleteness

[SPS] Testing Completeness

hlee — Wed, 19 Nov 2008 05:34:59 +0000

There will be a special session at the 213th AAS meeting on meaning from surveys and population studies (SPS). Until then, it might be useful to pull out some interesting and relevant papers and questions/challenges as a preliminary to the meeting. I will not list astronomical catalogs and surveys only, which are literally countless these days but will bring out some if they change the way how science is performed with a description of the catalog (the best example would be SDSS, Sloan Digital Sky Survey, to my knowledge).

The main focus of the series of postings (I’m not sure how many there will be. There are chances that [SPS] series might be terminated after this season) is introducing some statistical challenges including managing data, opt to be spawn from astronomical surveys and population studies. My paper selection criterion is based on the group discussions from the SPS working group during SAMSI astrostatistics program in 2006 (group leaders were G. Babu, Director of CASt and T. Loredo).

Completeness – I. Revised, reviewed and revived by Johnston, Teodoro, and Hendry
MNRAS, 376(4), pp. 1757-176
Abstract (abridged to the first paragraph) We have extended and improved the statistical test recently developed by Rauzy for assessing the completeness in apparent magnitude of magnitude-redshift surveys. Our improved test statistic retains the robust properties – specifically independence of the spatial distribution of galaxies within a survey – of the Tc statistic introduced in Rauzy’s seminal paper, but now accounts for the presence of both a faint and bright apparent magnitude limit. We demonstrate that a failure to include a bright magnitude limit can significantly affect the performance of Rauzy’s Tc statistic. Moreover, we have also introduced a new test statistic, Tv, defined in terms of the cumulative distance distribution of galaxies within a redshift survey. These test statistics represent powerful tools for identifying and characterizing systematic errors in magnitude-redshift data.

One of the authors was an active participant of the SPS working group at SAMSI. The following three quotes pertain statistically genuine content-wise although the paper was published in MNRAS.

It is straightforward to show from this definition that the random variable η has a uniform distribution on the interval [0,1], and furthermore that η and Z are statistically independent.

If the sample is complete in apparent magnitude, for a given pair of trial magnitude limits, then Tc should be normally distributed with mean zero and variance unity. If, on the other hand, the trial faint (bright) magnitude limit is fainter (brighter) than the true limit, Tc will become systematically negative, due to the systematic departure of the $$\hat{\eta}_i$$ distribution from uniform on the interval [0,1].

If the sample is complete in apparent magnitude, for a given pair of trail magnitude limits, then Tv should be normally distributed with mean zero, and variance unity. If, on the other hand, the trail faint (bright)magnitude limit is fainter (brighter) than the true limit, in either case Tv will become systematically negative, due to the systematic departure of the $$\hat{\tau}_i$$ distribution from uniform on the interval [0,1].

Their statistics is utilized as a diagnostic tool such that the estimate of statistics becomes an indicator of completeness at a given magnitude. Otherwise, asymptotic studies could have been exercised in depth so that people who use their statistics (Tc and Tv) could obtain p-values (for hypothesis testing) and confidence intervals. The authors, however, computed the means and variances and stated that these statistics are standard normal without no rigorous proofs. On the other hand, the process of estimating Tc and Tv statistics is nonparametric so that further statistical inference such as showing that asymptotically Tc and Tv are normal, can be very challenging unless strong assumptions on (probabilistic) models and/or priors are given. Overall, these statistics are more statistically appealing to me in terms of testing completeness compared to other ratio based methods.

Testing completeness now seems not a difficult task due to these statistics, extensive survey catalogs, and better understanding of populations. However, still uncertainties in k-correction, e-correction, and extinction correction make their statistics fuzzy and difficult to interpret results. Changes in statistics due to these uncertainties are hard to be characterized. Furthermore, obtaining good (point) estimators for these correction terms still remains as almost unconquered.

In addition to testing completeness described in the above paper, regarding incompleteness, I’ve seen modeling efforts basically based on the power law, whose slope parameter is an indicator of cosmological models from x-ray astronomy. Unfortunately, incompleteness makes the slope estimation process complex and lots of efforts are found in searching/estimating a model reflecting this incompleteness in observations as a function of redshifts or magnitudes; otherwise, it is fitting a simple ordinary linear regression model with a complete data set.

I believe someday incompleteness will be stochastically modeled (parameterized to draw information and to offer good prediction) beyond testing and will offer better understanding of the visible universe (visible here is a very broad concept, not indicating something only can be seen through naked human eyes). For a while, (in)completeness has been a concept and a word of meaning to which mathematical compactness and statistical modeling has never been attached to test and to understand uncertainties.

p.s. I have been paying lots of attention on citation style; in contrast, you’ve noticed my citations are far from consistency. Two noticeable differences between citation styles of statistics and astronomy are abbreviation of journal names and inclusion of titles. Astronomers’ citation is compact, concise, and same across astronomical journals; on the contrary, statisticians’ citation is lengthy, informative (because of title), and various across statistical and applied statistics journals. MNRAS reminded me something that from a paper written by a very renowned statistician referred a paper from MNRAS but said Monograph National Royal Astronomical Society. I think now you become gracious to my citation style.

[disclaimer] I saw various population studies in astronomy from a broad wavelength range, each of which has different objectives, targets, obstacles, and study designs (even telescopes, detectors, data pipelines, and sampling schemes are different), and (in)completeness studies are designed to reflect those differences. I’m afraid that I’m only reporting a tiny fraction of all efforts related to (in)completeness. Your comments are most welcome. Also, I wish for your posts and comments regarding (in)completeness, volume/magnitude limited sample, survey studies, upper limits, missing values in survey, clustering, spatial distribution, large scale structure, etc in the near future.

]]>

missing data

hlee — Mon, 27 Oct 2008 13:24:22 +0000

The notions of missing data are overall different between two communities. I tend to think missing data carry as good amount of information as observed data. Astronomers…I’m not sure how they think but my impression so far is that a missing value in one attribute/variable from a object/observation/informant, all other attributes related to that object become useless because that object is not considered in scientific data analysis or model evaluation process. For example, it is hard to find any discussion about imputation in astronomical publication or statistical justification of missing data with respect to inference strategies. On the contrary, they talk about incompleteness within different variables. Putting this vague argument with a concrete example, consider a catalog of multiple magnitudes. To draw a color magnitude diagram, one needs both color and magnitude. If one attribute is missing, that star will not appear in the color magnitude diagram and any inference methods from that diagram will not include that star. Nonetheless, one will trying to understand how different proportions of stars are observed according to different colors and magnitudes.

I guess this cultural difference is originated from the quality of data. Speaking of typical size of that data sets that statisticians handle, a child can count the number of data points. The size of astronomical data, only rounded numbers of stars in the catalog are discussed and dropping some missing data won’t affect the final results.

Introducing how statisticians handle missing data may benefit astronomers who handles small catalogs due to observational challenge in the survey. Such data with missing values can be put into statistically rigorous data analysis processes in stead of ad hoc procedures of obtaining complete cases that risk throwing many data points.

In statistics, utilizing information of missing data enhances information toward the direction that the inference method tries to retrieve. Despite larger, it’s better to have error bars than nothing. My question is what are statistical proposals for astronomers to handle missing data? Even though I want to find such list, instead, I give a few somewhat nontechnical papers that explain the following missing data types in statistics and a few statistics books/articles that statisticians often cite.

Data mining and the impact of missing data by M.L. Brown and J.F.Kros, Industrial Management and Data Systems (2003) Vol. 103, No. 8, pp.611-621
Missing Data: Our View of the State of the Art by J.L.Schafer and J.W.Graham, Psychological Methods (2002) Vol.7, No. 2, pp. 147-177
Missing Data, Imputation, and the Bootstrap by B. Efron, JASA (1984) 89 426 p. 463- and D.B.Rubin’s comment
The multiple imputation FAQ page (web) by J. Shafer
Statistical Analysis with Missing Data by R.J.A. Little and D.B.Rubin (2002) 2nd ed. New York: Wiley.
The Curse of the Missing Data (web) by Yong Kim
A Review of Methods for Missing Data by T.D.Pigott, Edu. Res. Eval. (2001) 7(4),pp.353-383 (survey of missing data analysis strategies and illustration with “asthma data”)

Pigott discusses missing data methods to general audience in plain terms under the following categories: complete-cases, available-cases, single-value imputation, and more recent model-based methods, maximum likelihood for multivariate normal data, and multiple imputation. Readers of craving more information see Schafer and Graham or books by Schafer (1997) and Little and Rubin (2002).

Most introductory articles begin with common assumptions like missing at random (MAR) or missing at completely random (MCAR) but these seem not apply to typical astronomical data sets (I don’t know exactly why yet – I cannot provide counter examples to prove – but that’s what I have observed and was told). Currently, I like to find ways to link between statistical thinking about missing data and modeling to astronomical data of missing through discovering commonality in their missing properties). I hope you can help me and others of such efforts. For your information, the following are the short definitions of these assumptions:

data missing at random : missing for reasons related to completely observed variables in the data set
data missing completely at random : the complete cases are a random sample of the originally identified set of cases
non-ignorable missing data : the reasons for the missing observations depend on the values of those variables.
outliers treated as missing data
the assumption of an ignorable response mechanism.

Statistical researches are conducted traditionally under the circumstance that complete data are available and the goal is characterizing inference results from the missing data analysis methods by comparing results from data with complete information and dropping observations on the variables of interests. Simulations enable to emulate these different kind of missing properties. A practical astronomer may raise a question about such comparison and simulating missing data. In real applications, such step is not necessary but for the sake of statistical/theoretical authenticity/validation and approval of new missing data analysis methods, the comparison between results from complete data and missing data is unavoidable.

Against my belief that statistical analysis with missing data is applied universally, it seems like only regression type strategy can cope with missing data despite the diverse categories of missing data, so far. Often cases in multivariate data analysis in astronomy, the relationship between response variables and predictors is not clear. More frequently, responses do not exist but the joint distribution of given variables is more cared. Without knowing data generating distribution/model, analyzing arbitrarily built models with missing data for imputation and for estimation seems biased. This gap of handling different data types is the motivation of introducing statistical missing data analysis to astronomers, but statistical strategies of handing missing data may be seen very limited. I believe, however, some “new” concepts in missing data analysis approaches can be salvaged like the assumptions for analyzing data with underlying multivariate normal distribution, favored by astronomers many of whom apply principle component analysis (PCA) nowadays. Understanding conditions for multivariate normal distribution and missing data more rigorously leads astronomers to project their data analysis onto the regression analysis space since numerous survey projects in addition to the emergence of new catalogs pose questions of relationships among observed variables or estimated parameters. The broad areas of regression analysis embraces missing data in various ways and likewise, vast astronomical surveys and catalogs need to move forward in terms of adopting proper data analysis tools to include missing data since instead of laws of physics, finding relationships among variables empirically is the scientific objective of surveys, and missing data are not ignorable. I think that tactics in missing data analysis will allow steps forward in astronomical data analysis and its statistical inference.

Statisticians or other scientists utilizing statistics might have slightly different ways to call the strategies of missing data analysis, my way of putting the strategies of missing data analysis described in above texts is as follows:

complete case analysis (caveat: relatively few cases may be left for the analysis and MCAR is assumed),
available case analysis (pairwise deletion, delete selected variables. caveat: correlations in variable pairs)
single-value imputation (typically mean value is imputed, causing biased results and underestimated variance, not recommended. )
maximum likelihood, and
multiple imputation (the last two are based on two assumptions: multivariate normal and ignorable missing data mechanism)

and the following are imputation strategies:

mean substituion,
case substitution (scientific knowledge authorizes substitution),
hot deck imputation (external sources imputes imputation),
cold deck imputation (values drawn from the next most similar case but difficulty in defining what is “similar”),
regression imputation (prediction with independent variables and mean imputation is a special case) and
multiple imputation

Some might prefer the following listing (adopted from Gelman and Brown’s regression analysis book):

simple missing data approaches that retain all the data

-mean imputation
-last value carried forward
-using information from related observation
-indicator variables for missingness of categorical predictors
-indicator varibbles for missingness of continuous predictors
-imputation based on logical values

random imputation of a single variables
imputation of several missing variables
model based imputation
combining inferences from multiple imputation

Explicit assumptions are acknowledged through statistical missing data analysis compared to subjective data processing toward complete data set. I often see discrepancies between plots from astronomical journals and linked catalogs where missing data including outliers reside but through the subjective data cleaning step they do not appear in plots. On the other hand, statistics exclusively explains assumptions and conditions of missing data. However, I don’t know what is proper or correct from scientific viewpoints. Such explication does not exist and judgments on assumptions on missing data and processing them left to astronomers. Moreover, astronomers have the advantages like knowledge in physics for imputing data more suitably and subtly.

Schafer and Graham described, with or without missing data, the goal of a statistical procedure should be to make valid and efficient inferences about a population of interest — not to estimate, predict, or recover missing observations nor to obtain the same results that we would have seen with complete data.

The following quote from the above web link (Y. Kim) says more.

Dealing with missing data is a fact of life, and though the source of many headaches, developments in missing data algorithms for both prediction and parameter estimation purposes are providing some relief. Still, they are no substitute for critical planning. When it comes to missing data, prevention is the best medicine.

Missing entries in astronomical catalogs are unpreventable; therefore, one needs statistically improved strategies more than ever because of the increase volume of surveys and catalogs proportionally many missing data reside. Or current methods using complete data (getting rid of all observations with at least one missing entry) could be the only way to go. There are more rooms left to discuss strategies case by case, which would come in future post. This one is already too long.

]]>