my first AAS. IV. clustering

I was questioned by two attendees, acquainted before the AAS, if I can suggest them clustering methods relevant to their projects. After all, we spent quite a time to clarify the term clustering.

  • The statistician’s and astronomer’s understanding of clustering is different:
    • classification vs. clustering or supervised learning vs. unsupervised learning: the former terms from the pairs indicate the fact that the scientist already knows types of objects in his hands. A photometry data set with an additional column saying star, galaxy, quasar, and unknown is a target for classification or supervised learning. Simply put, classification is finding a rule with photometric colors that could classify these different type objects. If there’s no additional column but the scatter plots or plots after dimension reduction manifesting grouping patterns, it is clustering or unsupervised learning whose goal is finding hyperplanes to separates these clusters optimally; in other words, answering these questions, are there real clusters? If so, how many? is the objective of clustering/unsupervised learning. Overall, rudimentarily, the presence of an extra column of types differentiates between classification and clustering.
    • physical clustering vs. statistical clustering:
      Cosmologists and alike are interested in clusters/clumps of matters/particles/objects. For astrophysicists, clusters are associated with spatial evolution of the universe. Inquiries related to clustering from astronomers are more likely related to finding these spatial clumps statistically, which is a subject of stochastic geometry or spatial statistics. On the other hand, statisticians and data analysts like to investigate clusters in a reparameterized multi-dimensional space. Distances computed do not follow the fundamental laws of physics (gravitation, EM, weak, and strong) but reflect relationships in the multi-dimensional space; for example, in a CM diagram, stars of a kind are grouped. The consensus between two communities about clustering is that the number of clusters is unknown, where the plethora of classification methods cannot be applied and that the study objectives are seeking methodologies for quantifying clusters .
  • astronomer’s clustering problems are either statistical classification (closed to semi-supervised learning) or spatial statistics.
    The way of manifesting noisy clusters in the universe or quantifying the current status of matter distribution leads to the very fundamentals of the birth of the universe, where spatial statistics can be a great partner. In the era of photometric redshifts, various classification techniques enhances the accuracy of prediction.
  • astronomer’s testing the reality of clusters seems limited: Cosmology problems have been tackled as inverse problem. Based on theoretical cosmology models, simulations are performed and the results are transformed into some surrogate parameters. These surrogates are generally represented by some smooth curves or straight lines in a plot where observations made their debut as points with bidirectional error bars (so called measurement errors). The judgment about the cosmological model under the test happens by a simple regression (correlation) or eyes on these observed data points. If observations and a curve from a cosmological model presented in a 2D plot match well, the given cosmological model is confirmed in the conclusion section. Personally, this procedure of testing cosmological models to account for clusters of the universe can be developed in a more statistically rigorous fashion instead of matching straight lines.
  • Challenges to statisticians in astronomy, measurement errors: In (statistical) learning, I believe, there has been no standard procedure to account for astronomers’ measurement errors into modeling. I think measurements errors are, in general, ignored because systematics errors are not recognized in statistics. On the other hand, in astronomy, measurement errors accompanying data, are a very crucial piece of information, particularly for verifying the significance of the observations. Often this measurement errors became denominator in the χ2 function which is treated as a χ2 distribution to get best fits and confidence intervals.

Personal lessons from two short discussions at the AAS were more collaboration between statisticians and astronomers to include measurement errors in classification or semi-supervised learning particularly for nowadays when we are enjoying plethora of data sets and moving forward with a better aid from statisticians for testing/verifying the existence of clusters beyond fitting a straight line.

Leave a comment