Unsupervised clustering processes bring order to unexplored data by creating groups, or clusters, of input data objects based on properties which are otherwise difficult to see with the naked eye. The similarities and differences between clusters and their object members reveal a structure that can be further investigated. Clustering may be the first step in an analytical process to understand a large collection of data with multiple dimensions which are difficult to view at once. A geographic clustering analysis might originate from a dataset of complex spatio-temporal inputs. After clustering, the data is mapped onto geographic space with its clusters, showing new relationships between places and the properties of the data.
This post is an example of unsupervised clustering of time series data. Clustering time series can be especially revealing of social patterns because of their shape. When plotted, the magnitude of the measurements is often salient, which can mask the ways that the measurements change over time. Clustering can explicitly account for shape, rather than magnitude when seeking to compare time series based on how they change.
For example: traffic patterns. The raw amount of traffic is important for knowing and planning for congestion. But the shape of daily traffic at sites throughout the city tell us how traffic changes with respect to; the normal workday, how late into the night vehicles tend to be active, and how much traffic increases at lunch time. Dublin tracks traffic with sensors embedded in the roadway at specific intersections throughout the city. The Sydney Coordinated Adaptive Traffic System (SCATS) tracks vehicles and reports volumes every 6 minutes, and creates priority signals for public and emergency vehicles. Samples of the traffic volume data are available as open data from Smart Dublin here. This particular sample ranges from January 1 through April 30.
To view the code used to create this analysis, go to ; Sam Stehle’s blog;.
To prepare for creating clusters, we first do 2 things. For each sensor, get the daily time series and split it into 7 groups, one for each day of the week. It doesn’t make sense to compare the traffic commuting patterns on Monday with the ones from Saturday – the weekday workday creates an inherently different traffic pattern. Then, the measurements for each sensor for each day of the week are combined, creating an average commuting pattern for each location for each day of the week.
With these average daily time series, we can perform the following analysis:
-4 clusters
-shape-based time series analysis (rather than magnitude of traffic)
-partitional clustering process (creates exclusive clusters, rather than hierarchical ones
-define each cluster by its centroid, or the most representative sensor’s time series of its assigned cluster.
The representative centroids of each of the 4 clusters on each day are shown in the plots below. Note that frequently, the clustering process picks up three clusters:
-one cluster has the typical commuting pattern, with a peak at the morning rush hour and another peak in the evening rush hour.
-one cluster’s time series has just one large peak at the morning rush hour
-one cluster’s time series has just one large peak at the evening rush hour
Some other interesting patterns that the clustering picked out:
-on Sunday, one of the clusters is defined by a peak which is larger overnight than during the day! Later closing times on Saturday night and the lack of a regular work day for most people leads to high traffic late into Sunday morning
-for locations which have a higher peak in the evening rush hour, that peak is generally much higher than the sensors which have a high peak in the morning rush hour
-Fridays have smaller evening peaks, presumably because workers leave work earlier for the weekend, leading to more consistent traffic throughout the day
-some locations have consistently low traffic. This is not surprising (depending on where those locations are), but their shapes are still picked out by the clustering process as being significantly different than other locations.
Is there a spatial pattern in the clusters? Below is a map of the 4 Monday clusters. Click here for an interactive version.
*The GPS points are not always accurate, with some locations offset from the roadway. This is an ongoing problem with GPS accuracy and data representation, particularly with respect to the SCATS data.
Some interesting patterns which are evident:
-cluster 4’s shape is not like the others, in that there is no noticeable peak of traffic in the morning or evening. Traffic is high throughout the day. Not surprisingly, the locations which were put into cluster 4 are in consistently high-traffic areas: Dame Street, O’Connell Street, and South Circular Road near Kilmainham.
-clusters 2 and 3 are essentially reversed – cluster 2 has a morning traffic peak and cluster 3 has an evening traffic peak. You will often find these sensors on the same road, on different lanes of travel. Look at Naas Road in southwest Dublin or Old Cabra Road in northwest Dublin.
Next, we will compare this analysis with data from 2018 and 2020.