next up previous contents index
Next: Single-link and complete-link clustering Up: Hierarchical clustering Previous: Hierarchical clustering   Contents   Index


Hierarchical agglomerative clustering

Hierarchical clustering algorithms are either top-down or bottom-up. Bottom-up algorithms treat each document as a singleton cluster at the outset and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all documents. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering or HAC . Top-down clustering requires a method for splitting a cluster. It proceeds by splitting clusters recursively until individual documents are reached. See Section 17.6 . HAC is more frequently used in IR than top-down clustering and is the main subject of this chapter.

\includegraphics[width=15cm]{rprojectsingle.eps} A dendrogram of a single-link clustering of 30 documents from Reuters-RCV1. Two possible cuts of the dendrogram are shown: at 0.4 into 24 clusters and at 0.1 into 12 clusters.

Before looking at specific similarity measures used in HAC in Sections 17.2 -17.4 , we first introduce a method for depicting hierarchical clusterings graphically, discuss a few key properties of HACs and present a simple algorithm for computing an HAC.

An HAC clustering is typically visualized as a dendrogram as shown in Figure 17.1 . Each merge is represented by a horizontal line. The y-coordinate of the horizontal line is the similarity of the two clusters that were merged, where documents are viewed as singleton clusters. We call this similarity the combination similarity of the merged cluster. For example, the combination similarity of the cluster consisting of Lloyd's CEO questioned and Lloyd's chief / U.S. grilling in Figure 17.1 is $\approx
0.56$. We define the combination similarity of a singleton cluster as its document's self-similarity (which is 1.0 for cosine similarity).

By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering. For example, we see that the two documents entitled War hero Colin Powell were merged first in Figure 17.1 and that the last merge added Ag trade reform to a cluster consisting of the other 29 documents.

A fundamental assumption in HAC is that the merge operation is monotonic . Monotonic means that if $s_1,s_2,\ldots,s_{K-1}$ are the combination similarities of the successive merges of an HAC, then $s_1 \geq s_2 \geq \ldots \geq
s_{K-1}$ holds. A non-monotonic hierarchical clustering contains at least one inversion $s_i < s_{i+1}$ and contradicts the fundamental assumption that we

chose the best merge available at each step. We will see an example of an inversion in Figure 17.12 .

Hierarchical clustering does not require a prespecified number of clusters. However, in some applications we want a partition of disjoint clusters just as in flat clustering. In those cases, the hierarchy needs to be cut at some point. A number of criteria can be used to determine the cutting point:

Figure 17.2: A simple, but inefficient HAC algorithm.
\begin{figure}\begin{algorithm}{SimpleHAC}{d_1,\ldots,d_N}
\begin{FOR}
{n \= 1 \...
...\emph{(deactivate cluster)}
\end{FOR}\\
\RETURN{A}
\end{algorithm}
\end{figure}

\begin{figure}
% latex2html id marker 26460
\psset{unit=0.7cm}
\begin{pspicture}...
...{}
is a
similarity between two documents from different clusters.}
\end{figure}

A simple, naive HAC algorithm is shown in Figure 17.2 . We first compute the $N \times N$ similarity matrix $C$. The algorithm then executes $N-1$ steps of merging the currently most similar clusters. In each iteration, the two most similar clusters are merged and the rows and columns of the merged cluster $\oldell$ in $C$ are updated.[*]The clustering is stored as a list of merges in $A$. $I$ indicates which clusters are still available to be merged. The function SIM$(\oldell,m,j)$ computes the similarity of cluster $j$ with the merge of clusters $\oldell$ and $m$. For some HAC algorithms, SIM$(\oldell,m,j)$ is simply a function of $C[j][\oldell]$ and $C[j][m]$, for example, the maximum of these two values for single-link.

We will now refine this algorithm for the different similarity measures of single-link and complete-link clustering (Section 17.2 ) and group-average and centroid clustering ( and 17.4 ). The merge criteria of these four variants of HAC are shown in Figure 17.3 .

\begin{figure}
% latex2html id marker 26488
\par
\psset{unit=0.75cm}
\par
\begin...
...e-link similarity of the
two left two-point clusters (solid line).}
\end{figure}


next up previous contents index
Next: Single-link and complete-link clustering Up: Hierarchical clustering Previous: Hierarchical clustering   Contents   Index
© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07