1. Introduction

There are at least two challenges in this task:

  1. Too many missing days distribute differently among units.
  2. All series are non-stationary because of non-business-days, long-term trends, and seasonality.

Why not k-Means

There are dedicated implementation of k-means clustering for time series, like tslearn.clustering.TimeSeriesKMeans. However, missing entries must be filled with meaningfull values first, or the algorithm does not work. As shown before, there are only 186 time index when entries are complete. Clearly, neither to fill with 0 nor to interpolate is a good choice in terms of time series.

Monte Carlo simulations can be used, but the resulted effect is not known. The argument is that most clustering algorithms rely on calculation of cluster centres, which involves multiple units at the same time. Nevertheless, the joint set of missing days from those units might be large compared to that from pair-wise units, so simulated entries exist in more time index, distorting cluster centres.

Moreover, most methods to validate cluster results are based on original data, so they cannot be used, either. Probably the only option left is to validate based on pair-wise similarity measures, like Euclidean distance and Pearson correlation.

For these reasons, it is a good idea to use agglomerative hierarchical clustering with single linkage, which is discussed in the following section.

2. Agglomerative Hierarchical (Single Linkage)

Agglomerative hierarchical clustering (AHC) using single linkage relies on some distance matrix resulted from original data only. Entries in such distance matrix measure the similarity of some unit pair according to a pre-defined criterion. For similarity between time series, Pearson correlation, instead of Euclidean distance, is usually used.

To compute pair-wise correlation, data points with NaNs are removed. Resulted distance matrix has two NaN entries, (463, 485) and (485, 463). They are set as 0 for now.

Why it does not Work

When number of clusters is set to 4, for example, differences in sizes -- 497, 1, 1, 1 respectively -- are significant. The reason is that units in last three clusters do not have strong correlation with any other unit. That is, they are outliers. Though their distribution among clusters does not have a huge impact, they prevent the algorithm to process units in the first big cluster.

A method based on weighted graphs is proposed in the following section to handle this issue.

3. Max Spanning Tree on Distance Matrix

AHC using single linkage is an optimisation problem in essence. To run max spanning tree on distance matrix yields better result, actually. Another advantage of this approach is that the complete graph (with lightgrey-coloured edges in the following figure), which represesnts the distance matrix, provides a more intuitive way to choose clusters.

Demo with Units 70 ~ 99

Units 70 ~ 99 are considered in the following example, resulting a clearer figure.

The max spanning tree of the complete graph is highlighted by black edges. The following table highlights two units with high degrees: 77 and 79. They can be seen as representative units for two clusters. Besides, based on the graph figure, it seems to be a good idea to have another cluster, represented by unit 99.

77 79 98 99 93 94 84 96 80 92
degree 11 10 3 3 2 2 2 2 2 1

Weakest links between such units (two blue edges) are found. If they are removed, the tree becomes a forest with three components, which correspond to three clusters.

The resulted cluster with household 77 being representative unit, for example, has another 10 members: 71, 76, 77, 82, 86, 88, 92, 94, 95, and 97.

The distance matrix can be sorted according to clusters. Here is a heatmap for the previous example. Correlations between members within clusters are shown by three diagonal block matrices, whose entries seem to have higher values. In constrast, off-diagonal block matrices represent inter-cluster correlations, which tend to be low and even negative. So the clustering result is satisfying.

To determine the number of clusters is not a trivial task. With distance matrix and max spanning tree visualised, it is relatively easier to try different numbers and validate results using heatmaps. However, this issue has not been addressed for now.

4. Effect of Different Resolutions

All the units are to be clustered in this section. Hourly and daily down-sampled profiles are used respectively in first two parts, and results are compared.

Using Hourly Profiles

Here are nodes with high degrees, which are seen as potential representative units:

3 79 77 54 197 68 48 222 223 457
degree 100 46 35 24 16 12 12 10 9 9

Will choose 3, 79, 77, 54, and 197 as representative households. It results in 5 clusters.

Here is a heatmap for sorted correlation matrix after clustering. Results are satisfying.

Using Daily Profiles

Here are units with high degrees, and household 80, 74, 155, 249 are chosen as potential representative units. So number of clusters is 4.

80 74 155 249 388 68 456 465 478 222
degree 31 17 10 10 9 9 8 8 7 7

Here is a heatmap for sorted correlation matrix after clustering.

Compare Results

Units are usually labelled by integers in clustering result, and integer values are irrelevant. Some metrics can be used to quantify difference between two clustering results. Three symmetric metrics are used here, because the true label is unknown.

Adjusted Rand index is 1 when two clustering results are exactly the same, and being -1 corresponds to the biggest difference possible. There are another two indices. Normalized mutual information (NMI) is often used in the literature, while adjusted mutual information (AMI) was proposed more recently and is normalized against chance. (1 is best; negative is bad)

index value when results are different when results are similar
rand adjusted Rand index -0.015766 close to -1 close to 1
AMI adjusted mutual information 0.016228 being negative close to 1
NMI normalized mutual information 0.026368 being negative close to 1

So, two clustering results are quite different. There are multiple reasons behind:

  • Lots of variation is lost when profiles are down-sampled in daily resolution.
  • Outliers (spikes) may distort two results in different manners.
  • The fact that missing entries come in batchs has significant impact on clustering with hourly profiles.

How to handle and analyse such issues is discussed in the next section.

5. Future Work

  • Multiple results with different number of clusters should be investigated.
  • Model-based clustering is to cluster based on parameters in statistical models. It is especially handy to cluster time series.
  • Pearson coefficients are based on original profiles for now. Scaling (amplitude) and translation (offset) variances most likely dominate clustering results. Preprocessing can be applied to reduce such effect.
  • Sum of inter-cluster correlations should be minimised at the same time, the problem can be formulated as an MILP, which, however, can be NP-hard.
  • Outlier can be detected and removed, reducing their impact on squared terms in some commands.

6. Results

Clustering results are stored in dictionaries, which use representative units as keys.

Appendix. Four Cluster using k-Means

To show why k-Means clustering is not an ideal choice, it is used here. For example, assume there are 4 clusters.

Performance Evaluation using Silhouett Analysis

Results are not satifying.

For n_clusters = 4,  the average silhouette_score is : 0.27821213695582137.