-- *A Python Course for the Humanities*

In the previous chapter we have developed a system that on the basis of examples attempts to learn a function to classify new, unseen examples. Not always do we have the luxury of a labeled data set. In fact, most of the time only unlabeled data is available. In unsupervized machine learning, or learning without supervision, we attempt the create systems that detect patterns in our data, such as groupings or clusters. Given a collection of texts, we could for example try to measure the pairwise distances between all texts and given these distances construct a grouping of the texts. Another example of unsupervized learning is the popular method of *Topic Modeling* in which we attempt to find clusters of semantically coherent words that together form a topic.

In this chapter we will introduce you to some of the techniques to cluster you data without supervision. As is the case with supervized learning, there are many different approaches to clustering. We will discuss one of the most popular ones: hierachical agglomerative clustering. We will develop a general hierarchical cluster module and implement a number of different cluster procedures.

All cluster techniques follow a similar procedure. We start with a data set consisting of $n$ different data points. The end state will be a data set with $k$ clusters each consisting of a number of original data points. The cluster procedure iteratively moves through all data points and assigns each data point to a particular cluster. Cluster techniques differ with respect to how the merging of data points happens. In this section we will look at hierarchical clustering, which is a clustering method that builds hierarchies of clusters. Typically hierarchical clustering techniques construct a so-called dendrogram, which has a top or root node covering all other data points. The leaf nodes of the dendrogram, or cluster tree, consists of all original data points. If we think of the original datapoints as singleton clusters, hierarchical clustering is an iterative procedure in which in each iteration two clusters are merged into a new cluster. This process is repeated until we arrive at the root node.

At each iteration, the two clusters with the highest similarity combined. The definition of what counts as being similar is what differentiates between the different hierarchical clustering methods. One popular clustering method is **single-linkage clustering**. Mathematically, the single linkage function – the distance $D(X,Y)$ between clusters $X$ and $Y$ – is described by the expression

where $X$ and $Y$ are any two clusters, and $d(x,y)$ represents the distance between the two data points $x$ and $y$. The two clusters $A$ and $B$ of which $D(A,B)$ is the smallest are merged into a new cluster $C$.

Let's have a look at a small example to make this all a little more concrete. The following table presents a data set of apples. Each apple is described according to a number of different features:

quality | color | firmness | taste | |
---|---|---|---|---|

0 | bad | red | firm | sweet |

1 | bad | red | firm | sweet |

2 | bad | yellow | firm | sour |

3 | good | red | soft | sour |

4 | good | yellow | soft | sweet |

5 | bad | yellow | firm | sour |

To construct clusters from these items, we need have some way to compute the distance or similarity between two items. A very simple distance method would be to take the length of the difference between the feature values of two apples $A$ and $B$:

$$ A \cup B = \{ x: x \in A | x \not\in B\} $$Take apple$_0$ and apple$_2$ as an example:

In [ ]:

```
apple_a = {"bad", "red", "firm", "sweet"}
apple_b = {"bad", "yellow", "firm", "sour"}
len(apple_a.difference(apple_b))
```

We can do the same thing for all items to obtain the pairwise distances between all apples:

In [ ]:

```
apples = [(0, {"bad", "red", "firm", "sweet"}),
(1, {"bad", "red", "firm", "sour"}),
(2, {"bad", "yellow", "firm", "sour"}),
(3, {"good", "red", "soft", "sour"}),
(4, {"good", "yellow", "soft", "sweet"}),
(5, {"bad", "yellow", "firm", "sour"})]
n = len(apples)
distances = [[0 for i in range(n)] for _ in range(n)]
for i in range(n):
for j in range(i):
distance = len(apples[i][1].difference(apples[j][1]))
distances[i][j] = distance
distances[j][i] = distances[i][j]
```

The result is a distance matrix in which for each combination of items the distance is given:

In [ ]:

```
distances
```

If you look closely, you'll notice that the upper and lower triangle of the matrix are the same. Now that we have all pairwise distances, we can start clustering the examples. We have to start with merging the two clusters of which the data points are most similar. Can you figure out which clusters are most similar to each other?

Indeed, the clusters containing apple$_2$ and apple$_5$. Let's merge these two clusters into a new cluster:

In [ ]:

```
apples[2] = (6, (apples[2], apples[5]))
del apples[5]
apples
```

We now continue our clustering procedure and again select the two items that are most similar to each other. We see that $D(\text{apple$_0$}, \text{apple$_1$}) = 1$. The same holds for $D(\text{apple$_1$}, \text{apple$_2$})$ and $D(\text{apple$_1$}, \text{apple$_5$})$. In such cases we choose one of these pairs at random. Let's go for the pair of apple$_0$ and apple$_1$ and merge these two into a new cluster:

In [ ]:

```
apples[0] = (7, (apples[0], apples[1]))
del apples[1]
apples
```

We continue and select the next pair with the shortest distance. This is $D(\text{apple$_1$}, \text{apple$_2$})$. Since apple$_1$ is in cluster 7 and apple$_2$ in cluster 6, we now merge cluster 6 and 7:

In [ ]:

```
apples[0] = (8, (apples[0], apples[1]))
del apples[1]
apples
```

Next on our list is the pair of apple$_3$ and apple$_4$:

In [ ]:

```
apples[1] = (9, (apples[1], apples[2]))
del apples[2]
apples
```

We conclude the clustering procedure by merging the last two clusters:

In [ ]:

```
apples[0] = (10, (apples[0], apples[1]))
del apples[1]
apples
```

What does our clustering tell us? We can distinguish two main groups: cluster 8 (consisting of apple$_0$, apple$_1$, apple$_2$ and apple$_5$), and cluster 9 (consisting of apple$_3$ and apple$_4$). These two clusters seem to point out that the quality of the apples is considered on the basis of their firmness and that, apparently, soft apples are considered to be better than hard apples.

Hopefully, you will now have a better feeling of hierarchical clustering and what the single linkage method is about. It is about time that we start with our own implementation of the hierarchical clustering algorithm.

**a)** We will begin with implementing a simple similarity metric, called the **Jaccard Distance**. This metric computes the dissimilarity between two sets by dividing the difference of the sizes of the union and the intersection of two sets by the size of their union:

In [ ]:

```
def jaccard_distance(a, b):
# insert your code here
# these tests should return True if your code is correct
print(jaccard_distance({'a', 'b', 'c'}, {'b', 'c', 'a'}) == 0.0)
print(round(jaccard_distance({'a', 'b', 'c'}, {'b', 'c'}), 2) == 0.33)
```

**b)** Write a function `pairwise_distances`

that takes as input a list of examples and some distance function and returns a matrix represented by a nested list which contains all pairwise distances.

In [ ]:

```
def pairwise_distances(X, distance_fn=jaccard_distance):
# insert your code here
# these tests should return True if your code is correct
X = [{'a', 'f', 'c'}, {'b', 'd', 'a'}, {'a', 'b', 'c'}, {'f', 'b', 'c'}]
print(pairwise_distances(X) == [[0, 0.8, 0.5, 0.5],
[0.8, 0, 0.5, 0.8],
[0.5, 0.5, 0, 0.5],
[0.5, 0.8, 0.5, 0 ]])
```

**c)** Next, we will write a function that takes as input a distance matrix represented as a nested list and returns two indexes $i$ and $j$ corresponding to the two clusters $A, B$ with the smallest distance $D(A, B)$:

In [ ]:

```
from itertools import combinations
def smallest_distance(dm):
# insert your code here
# these tests should return True if your code is correct
distances = [[0, 1, 2, 3, 3, 2],
[1, 0, 1, 2, 4, 1],
[2, 1, 0, 3, 3, 0],
[3, 2, 3, 0, 2, 3],
[3, 4, 3, 2, 0, 3],
[2, 1, 0, 3, 3, 0]]
print(smallest_distance(distances) == (2, 5))
```

So, we have created a function to compute the distance between two sets, a function to compute the pairwise distances between items given a distance function and a function to extract the indices corresponding to the clusters with the smallest distance. Next, we need to define a data structure to represent the hierarchical tree of clusters. First, we will define a class named `Cluster`

which represents a single node in a cluster tree. A `Cluster`

consists of a cluster ID, and a list of its child nodes. A `Cluster`

object is actually no more than a nested list in Python where each list has a unique ID. We therefore define `Cluster`

as a subclass of a `list`

object:

In [ ]:

```
class Cluster(list):
"""Represents a Cluster node in a Dendrogram. A Cluster can be
initialized using
>>> c = Cluster(1)
to create a cluster leaf node with id=1. You can also initialize
a non-terminal `Cluster` node using
>>> c = Cluster(3, Cluster(1), Cluster(2))
where Cluster(1) and Cluster(2) are the children of Cluster(3)."""
def __init__(self, id, *children):
self.id = id
super(Cluster, self).__init__(children)
def __repr__(self):
childstr = ", ".join(str(c) for c in self)
if self:
return '%s(%s, [%s])' % (type(self).__name__, self.id, childstr)
return '%s(%s)' % (type(self).__name__, self.id)
def __str__(self):
return self.pprint()
def pprint(self, indent=0):
s = '%s(%s' % (type(self).__name__, self.id)
for child in self:
if child:
s += '\n' + ' ' * (indent + 2) + child.pprint(indent=indent+2)
else:
s += '\n' + ' ' * (indent + 2) + '%r' % child
return s + ')'
```

We create a new `Cluster`

using:

In [ ]:

```
c1 = Cluster(1)
c2 = Cluster(2)
```

To create a non-terminal `Cluster`

we initialize a `Cluster`

object as follows:

In [ ]:

```
print(Cluster(3, c1, c2))
```

A `ClusterTree`

is an object that consists of multiple `Cluster`

objects. At initialization, each `Cluster`

node obtains an ID within the range 0 to $n$ where $n$ is the number of data points to be clustered. You can also pass a list of labels to the `labels`

argument of the constructor, which will then be used as ID's. The class `ClusterTree`

is responsible for merging two `Cluster`

nodes into a new `Cluster`

. Complete the method `merge`

. It takes as input two indices $i$ and $j$, corresponding to two `Cluster`

objects $C_i$ and $C_j$. These two clusters are merged into a new cluster which takes the original position of cluster $C_i$. Don't forget to remove cluster $C_j$ and initialize the new cluster with an appropriate ID.

In [ ]:

```
class ClusterTree:
"""A ClusterTree, or Dendrogram consists of one or more
`Cluster` objects. Initialize a `ClusterTree` using
>>> tree = ClusterTree(n=10)
where n is the number of original data points to be clustered."""
def __init__(self, n, labels=None):
self._n = n
if labels is None:
labels = range(n)
self._clusters = [Cluster(i) for i in labels]
def merge(self, i, j):
# insert your code here
def __str__(self):
return '%s' % self._clusters[0]
# these tests should return True if your code is correct
tree = ClusterTree(5)
tree.merge(1, 2)
print(len(tree._clusters[1]) == 2)
print(tree._clusters[1].id == 5)
```

Now it's time for the most tricky part of our clustering algorithm: computing the linkage function. We have already implemented a function that extracts the two indices corresponding to the two clusters that are closest to each other. But how do we find the indices of two clusters when they have been merged? One way would be to recursively go through the `ClusterTree`

and extract the indices from there. There is however another way which makes it possible to use the same `smallest_distance`

function at each iteration of the clustering procedure. As we will see this method is particularly beneficial because it will allow us to implement a number of different linkage functions in a elegant and simple way.

Recall that in the single linkage function – the distance $D(X,Y)$ between clusters $X$ and $Y$ – is described by the expression

$$D(X,Y)=\min_{x\in X, y\in Y} d(x,y),$$where $X$ and $Y$ are any two clusters, and $d(x,y)$ represents the distance between the two data points $x$ and $y$. Since we are only interested in the minimal distance of ${x\in X, y\in Y}$, we can store that information directly in our distance matrix. Let's go through an example to show you more clearly what this means.

Say we have a distance matrix containing the pairwise distance between 6 clusters:

```
distances = [[0, 1, 2, 3, 3, 1],
[ , 0, 1, 2, 4, 1],
[ , , 0, 3, 3, 0],
[ , , , 0, 2, 3],
[ , , , , 0, 3],
[ , , , , , 0]]
```

The smallest distance can be found between cluster 2 and cluster 5, which can be obtained using our function `smallest_distance`

:

```
>>> smallest_distance(distances)
(2, 5)
```

So far we haven't done anything different. Here comes the crucial step. Before we enter the next iteration of our clustering procedure, we update the distance matrix to reflect the minimal distances between all clusters with respect to cluster 2 and cluster 5. For example, we compare cluster 0 to both cluster 2 and cluster 5. The distance between cluster 0 and cluster 2 is 2. The distance between cluster 0 and cluster 5 is 1, which is the smallest. We update the largest distance (between cluster 0 and cluster 2) to 1 to represent the minimal distance between 0 and any of cluster 2 or cluster 5:

```
distances = [[0, 1, 1, 3, 3, 1],
[ , 0, 1, 2, 4, 1],
[ , , 0, 3, 3, 0],
[ , , , 0, 2, 3],
[ , , , , 0, 3],
[ , , , , , 0]]
```

we do this for all remaining clusters to obtain the following distance matrix:

```
distances = [[0, 1, 1, 3, 3, 1],
[ , 0, 1, 2, 4, 1],
[ , , 0, 3, 3, 0],
[ , , , 0, 2, 3],
[ , , , , 0, 3],
[ , , , , , 0]]
```

Since all distances between any cluster and the cluster (2, 5) are now stored in both the rows and columns of cluster 2 and 5, we can remove one of them to obtain the following distance matrix.

```
distances = [[0, 1, 1, 3, 3],
[ , 0, 1, 2, 4],
[ , , 0, 3, 3],
[ , , , 0, 2],
[ , , , , 0]]
```

From here on the same procedure is repeated until there is only one cluster left. So, again we extract the two indices corresponding to the two clusters that are closest to each other.

```
>>> smallest_distance(distances)
(1, 2)
```

We update the distance matrix

```
distances = [[0, 1, 2, 3],
[ , 0, 3, 3],
[ , , 0, 2],
[ , , , 0]]
```

and go on with the next iteration.

**a)** Implement the function called `single_linkage`

. It takes as input a distance matrix, and two indices $i$ and $j$ corresponding to the two clusters in the distance matrix that are closest to each other. Update the matrix according to the procedure described above and return the new matrix without row$_j$ and column$_j$.

In [ ]:

```
def single_linkage(dm, i, j):
# insert your code here
# these tests should return True if your code is correct
distances = [[0, 1, 2, 3, 3, 1],
[1, 0, 1, 2, 4, 1],
[2, 1, 0, 3, 3, 0],
[3, 2, 3, 0, 2, 3],
[3, 4, 3, 2, 0, 3],
[1, 1, 0, 3, 3, 0]]
print(single_linkage(distances, 2, 5) == [[0, 1, 1, 3, 3],
[1, 0, 1, 2, 4],
[1, 1, 0, 3, 3],
[3, 2, 3, 0, 2],
[3, 4, 3, 2, 0]])
```

**b)** Great. Now that we have our linkage function in place, we start working on the final piece of the clustering algorithm: the main iterative loop, of which the skeleton is given below. Fill in the missing elements.

In [ ]:

```
def cluster(data_points, labels=None, linkage=single_linkage, distance_fn=jaccard_distance):
# initialize a `ClusterTree` with n=len(data_points)
tree = # insert your code here
# compute the pairwise distances between all data points
# using the provided distance function
dm = # insert your code here
while len(dm) > 1:
# extract the indices of the clusters corresponding to the
# two closest clusters in the distance matrix
i, j = # insert your code here
# update the distance matrix using the provided linkage function
dm = # insert your code here
# merge the two clusters in the ClusterTree:
# insert your code here
return tree
# these tests should return True if your code is correct
apples = [{"bad", "red", "firm", "sweet"}, {"bad", "red", "firm", "sour"},
{"bad", "yellow", "firm", "sour"}, {"good", "red", "soft", "sour"},
{"good", "yellow", "soft", "sweet"}, {"bad", "yellow", "firm", "sour"}]
tree = cluster(apples)
print(tree._clusters[0].id == 10)
```

Single linkage is just one of many different clustering strategies in hierarchical clustering. Another important linkage function is **complete linkage**. Complete linkage is very similar to single linkage except that we do not take the minimal distance between two clusters but the maximal distance. The distance $D(X,Y)$ between clusters $X$ and $Y$ — is described by the following expression

where $d(x,y)$ is the distance between elements $x \in X$ and $y \in Y$. $X$ and $Y$ are both clusters.

The difference between single linkage and complete linkage lies only in the function used to compute the distance of a cluster to all other clusters (i.e. `min`

or `max`

). We could write a function `complete_linkage`

that is almost equal to our `single_linkage`

function except for the function to compute the distances. However, this would imply that we repeat quite a bit of code, which is never good practice. Instead, we will make an abstraction over the two linkage functions in the function called `general_linkage`

. This function takes the same arguments as our `single_linkage`

function plus an argument that specifies the distance function used.

In [ ]:

```
def general_linkage(dm, i, j, distance_fn):
for k in range(len(dm)):
if k != i and k != j:
dm[i][k] = distance = distance_fn(dm[i][k], dm[j][k])
dm[k][i] = distance
dm = [[val for c, val in enumerate(row) if c != j]
for r, row in enumerate(dm) if r != j]
return dm
```

This general formulation allows us to redefine the `single_linkage`

function as follows:

In [ ]:

```
def single_linkage(dm, i, j):
return general_linkage(dm, i, j, min)
```

Implement the complete_linkage function.

In [ ]:

```
def complete_linkage(dm, i, j):
# insert your code here
# these tests should return True if your code is correct
distances = [[0, 1, 2, 3, 3, 1],
[1, 0, 1, 2, 4, 1],
[2, 1, 0, 3, 3, 0],
[3, 2, 3, 0, 2, 3],
[3, 4, 3, 2, 0, 3],
[1, 1, 0, 3, 3, 0]]
complete_linkage(distances, 2, 5) == [[0, 1, 2, 3, 3],
[1, 0, 1, 2, 4],
[2, 1, 0, 3, 3],
[3, 2, 3, 0, 2],
[3, 4, 3, 2, 0]]
```

Now that we have implemented all functions to perform hierarchical cluster analysis, let's apply the method to a more realistic and more interesting example than clustering apples. Within the family of Indo-European languages we can distinguish multiple sub-families, such as Germanic languages (e.g. German and Dutch) or Romance languages (e.g. French and Italian). In the cell below I listed the first 10 numerals in the variable `numerals`

for each of the 10 languages stored in the variable `languages`

:

In [ ]:

```
numerals = [
["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"],
["een", "twee", "drie", "vier", "vijf", "zes", "zeven", "acht", "negen", "tien"],
["ien", "twa", "trije", "fjouwer", "fiif", "seis", "san", "acht", "njoggen", "tsien"],
["eins", "zwei", "drei", "vier", "funf", "sechs", "sieben", "acht", "neun", "zehn"],
["en", "to", "tre", "fire", "fem", "seks", "sju", "atte", "ni", "ti"],
["én", "to", "tre", "fire", "fem", "seks", "syv", "otte", "ni", "ti"],
["en", "tva", "tre", "fyra", "fem", "sex", "sju", "atta", "nio", "tio"],
["uno", "dos", "tres", "cuatro", "cinco", "seis", "siete", "ocho", "nueve", "diez"],
["un", "deux", "trois", "quatre", "cinq", "six", "sept", "huit", "neuf", "dix"],
["uno", "due", "tre", "quattro", "cinque", "sei", "sette", "otto", "nove", "dieci"]]
languages = ['English', 'Dutch', 'Frisian', 'German', 'Norwegian',
'Danish', 'Swedish', 'Spanish', 'French', 'Italian']
```

We apply hierarchical cluster analysis to investigate whether we can detect some interesting and meaningful groupings of these languages on the basis of their first 10 numerals.

We first need to decide how to compute the distance between two languages. One simple way is to take the sum of the distances between each numeral in language $A$ and language $B$. Mathematically, the distance between two languages $A$ and $B$ is described by the following expression:

$$ D(A, B) = \sum^n_{i=1} d(A_i, B_i) $$$A$ and $B$ are two vectors with $n$ items and $d$ is some distance function. How do we compute the distance between two numerals like Dutch *een* and English *one*? There are more appropriate and advanced methods to compute this distance, but for the moment let's make use of the jaccard distance function, we defined above:

In [ ]:

```
dutch_one = set(list("een"))
english_one = set(list("one"))
jaccard_distance(dutch_one, english_one)
```

**a)** Write a function `summed_jaccard_distance`

that takes as input two equally sized lists $A$ and $B$. Return $D(A, B) = \sum^n_{i=1} d(A_i, B_i)$ where $d$ is the jaccard distance.

In [ ]:

```
def summed_jaccard_distance(A, B):
# insert your code here
# these tests should return True if your code is correct
print(round(summed_jaccard_distance(numerals[0], numerals[4]), 2) == 5.57)
print(round(summed_jaccard_distance(numerals[5], numerals[6]), 2) == 4.8)
```

**b)** Perform a cluster analysis using single linkage and the summed jaccard distance on the numerals data. Report on your findings.

In [ ]:

```
solution = # insert your code here
print(solution)
```

**c)** This website contains a data set with the numbers from 1 to 10 in over 5000 languages. Visit the website and add some other languages (preferably from different language families) to the data set. Run the analyis again and report on you findings.

You've reached the end of the chapter. Ignore the code below, it's just here to make the page pretty:

In [387]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom.css", "r").read()
return HTML(styles)
css_styling()
```

Out[387]:

Python Programming for the Humanities by http://fbkarsdorp.github.io/python-course is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/fbkarsdorp/python-course.