Projects between meetings 1 and 2

Form groups of approximately 5 participants, possibly with diverse background, and choose one of the following projects. Each group is to prepare 20 minutes summaries of their work on the project to be presented during the second meeting of the course.

We consider a data sets in the form of a collection of distance spaces {(X_1,d_1), ...., (X_n,d_n)}. If you choose a data analysis project you can use a data set of your choice or one of the following:

Sampling of shapes with noise. Ground truth labels: each shape belongs to a class.
Point processes. Ground truth labels: each type of point process belongs to a class.
Repeated sub-samplings from distance spaces (X_1, d_1), ... (X_n, d_n). Ground truth labels: the distance space (X_i,d_i) belongs to class i.

You could browse https://snap.stanford.edu/data/#socnets Links to an external site. for interesting data sets related to graphs.

Project 1: Kleinberg’s theorem

Read and discuss the paper: An impossibility theorem for clustering, by Kleinberg https://www.cs.cornell.edu/home/kleinber/nips15.pdf

Questions you might think about are: what is your opinion on consistency, present examples where consistency might or not be appropriate. What are suitable axioms for clustering according to you?

Project 2: Fast hierarchical clustering

Read and discuss the paper: Fast Hierarchical, Agglomerative Clustering Routines for R and Python, Journal of Statistical Software 53 (2013), no. 9, 1–18, http://www.jstatsoft.org/v53/i09/

Implement the definition of hierarchical clusterings presented in our course notes, compare it to the algorithm presented in the paper or to the standard python implementations.

Think about how permuting the elements of a distance space impacts the associated hierarchical clustering dendogram.

Project 3: Spaces of Clustering

Read and discuss the paper: Spaces of Clusterings by A.Rolle and L.Scoccola, https://arxiv.org/abs/1902.01436

Project 4: New Invariants for Dendograms

Observe that the barcode and the stable rank encode the same information of a dendogram and propose an interpretation of such invariants in terms of the dendogram.

Identify pairs of non isomorphic dendograms that have the same stable rank and propose invariants to
distinguish between them. Stable components as defined in [1] are an example of invariant you might consider.
Think about ways of comparing and interpreting your proposed invariants. One possible strategy is to follow these steps:

Choose a data set
Implement the stable rank, stable components and the new invariants you have proposed.
Compute the single linkage dendogram and/or the density based dendogram (see example 3 from [1] ) of the distance spaces in your data.
Compute the stable rank, stable components and your proposed invariants on the single linkage dendogram and/or the density based dendogram.
Compare the results you obtain with the single linkage dendogram and the density based dendogram (if you considered them both). Compare the results you obtain by using the different types of invariants on dendograms or combinations of them. You might also vary the level of noise in your data and assess the robustness of your invariants with respect to noise.

[1] Stable components and layers https://arxiv.org/abs/1905.05788

Project 5: Dendograms and Stable Rank for Classification

In this project we build features from clustering the data. A possible pipeline is as follows:

Choose a data set
Compute hierarchical clustering (single linkage/ complete linkage/ average linkage) of each distance space in your data set.
Compare the obtained dendograms either by using a distance on the associated ultra pseudometric or a distance between the associated stable rank, which are in the form of pcf. Visualize and interpret the result. Can you recover the underlying community structure ?
Vectorize the dendograms and use these features as an input for a ML algorithm to perform the classification task of predicting the class of each distance space.
Possible questions you can consider to assess this method: are these descriptors discriminative / robust to noise on your data ? Does the choice of single/complete or average linkage strongly influence the result ?Does the choice of the distances defining your input distance spaces strongly influence the result ? In case you chose an artificial data set on which you can control the level of noise, are the descriptors robust to noise ?