Clustering analysis of SDA results


After a docking simulation (or an association rate simulation where the complexes are recorded), the complexes file contains the most energetically favorable docked positions. There are typically between 500 and, when using solute flexibility, up to 20,000 entries in this file. Using a clustering method allows these docking positions to be grouped into a few (3 to 10) representative clusters that are meaningful for analysis.
Different methods of clustering are available in SDA:

Average-linkage algorithm

Tool : auxi/clust

This clustering tool allows the clustering of docking positions using a hierarchical agglomerative average-linkage clustering algorithm. For this, the backbone RMSDs of all docked structures against each other are calculated. Please note, with this method, two structures with identical centers but different orientations can have large RMSDs.

The algorithm starts with each docking result representing its own cluster. At each step, the algorithm merges structures in clusters based on the average linkage distance (distance between the centers of clusters). Once the given number of clusters is reached, representatives of these clusters are written. These representative structures are chosen based on the closest distance to the center of the corresponding cluster.

All options are described in the pdf documentation

A typical usage of this clustering in SDA:

# generate a complexes file suitable for the clust program
read_record complexes -format_cluster

# prepare cluster.dat file
bin/clust -cl -n 2000 -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb
# make pdb files for the representatives of the 7 most populated clusters
bin/clust -re -clcl 7 -f55 fort.55 -p1 p1.pdb -p2 p2.pdb
# prepare score file
bin/clust -sc -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb
# make energy analysis and generate pdbs for the 7 most populated clusters
bin/clust -an -p -clcl 7 -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb > cluster-out-info-2000

The final number of clusters is given as input to the clust program. The default value is set to 5, but it might be useful to adapt this value. In general, no correct number of final clusters exists and by decreasing the final number of clusters the loss of information increases. Therefore, an appropriate cluster number should balance between information gain and loss. We provide a tool called '' in the auxi/clustering/programs/clust/aux/ directory. Using this (see below), you can plot the shortest distance between two clusters depending on the number of clusters in a given cycle (red line). Also plotted is the percentage distance increase relative to the shortest distance of the clustering step before (blue circles). This plot helps to find the clustering step where two clusters are merged that are quite distant and should not be merged to avoid too much loss of information. This clustering step can be identified by looking for a kink in the red curve. This analysis method is limited, as it only takes the shortest distance between two clusters into account. The RMSD variance in each cluster, the cluster sizes, and the total number of complexes are not considered.

An example can be found in the examples/bnbs/docking directory. In this Barnase/Barstar example, there is a kink when seven clusters are generated, but there is, for example, also a kink when 13 clusters are generated. This shows that the number of clusters depends on the required accuracy and the details of the project.

Afterwards, it is possible to adapt the number of clusters in the (found in the examples/bnbs/bnbs_docking example) script by changing the N2 value or by adjusting the -clcl option in the clust program.

To run '' it is necessary to run the clust program with the following command first:
USAGE: ./clust -an -p -cy > cluster_formation.out

The outputfile can then be used as input for '':
USAGE: ./ title inputFile outputFile

Where 'title' is the general title for the plot, the 'inputFile' is the outputfile of the cluster analysis (generated when applying the options '-an -p -cy', 'cluster_formation.out' from above). The 'outputFile' is the name for the *png image file.

A complete documentation, containing all options is available as pdf.

The recompilation of the clust program can be done in the src directory by calling make clust

An example output of the clust tool is given here:
Analysis of the clustering output

All programs provide a summary in a similar format.

The table shows the output of the docking simulation from the barnase-barstar example generated with the auxi/clust tool.
It describes the 5 final clusters classed by their cluster size (ClSize, the number of the members in each cluster):

 No  ClSize ClFSize    Repr    ReprE     ClAE    CLAED  RepRMSD  CLFRMSD      ElE   ElDesE   HyDesE      LjE   spread
  1     236   67370      35  -22.585  -22.289    0.555    4.965    5.052  -12.978    7.925  -17.532    0.000    1.380
  2      92    7759     301  -20.819  -20.774    0.907   16.251   16.121  -12.005    6.764  -15.578    0.000    1.749
  3      88   11624     304  -20.771  -21.390    1.045   13.867   13.984  -12.029    6.861  -15.603    0.000    1.129
  4      80    8890     230  -21.181  -21.696    0.729    3.370    3.479  -14.342    8.823  -15.662    0.000    1.558
  5       4     159     459  -19.340  -19.400    0.145   18.567   18.675   -9.798    6.046  -15.588    0.000    0.801

Distance between the chosen representatives of the clusters listed:

        complex_35      complex_301     complex_304     complex_230     complex_459     
complex_35      0       13.3751 10.9781 3.40311 16.8218 
complex_301     13.3751 0       3.86128 14.8969 7.1714  
complex_304     10.9781 3.86128 0       12.5002 8.52078 
complex_230     3.40311 14.8969 12.5002 0       17.83   
complex_459     16.8218 7.1714  8.52078 17.83   0       

Description of columns
No:        Cluster Number
ClSize:    Number of entries in the complexes (f55) file used in the cluster
ClFSize:   Number of representative entries for the cluster
Repr:      Representative chosen (corresponds to linenumber in the complexes, f55 file)
ReprE:     Total interaction energy of the chosen Representative
ClAE:      Average total energy of all cluster members weighted with number of representatives.
CLAED:     Weighted standard deviation of total energy of cluster entries in the complexes (f55) file
RepRMSD:   RMSD of the representative to solute 2
CLFRMSD:   Average RMSD of the cluster to solute 2
Ele:       Electrostatic energy
ElDesE:    Electrostatic desolvation energy of the representative complex
HyDesE:    Hydrophobic desolvation energy of the representative complex
LjE:       Lennard-Jones energy of the representative complex
spread:	   arithmetic average of rmsd's of each cluster member from the representative, weighted by occupancy
The clust tool also creates a file containing a matrix of all distances between complexes and for each cluster a list of the details of the cluster members.
Single-linkage algorithm

Tool : auxi/

This python script implements a single-linkage hierarchical agglomerative clustering algorithm. The complexes within an RMSD_CUTOFF (default 3 Å) are integrated into the closest cluster.

An example of usage in SDA 7:
# generate a complexes file suitable for the clust program
read_record complexes -format_cluster

# suppress the rotation around the z axis for the first 2000 complexes( see below)
auxi/ complexes.comp_cluster 2000

# cluster the first 2000 complexes wtrotz.comp_cluster p2.pdb 2000

# generate pdbs of the 7 clusters with corresponding representative pdb files
trajectory2dcd mycluster.comp_cluster Min.pdb -input -nb_complexes 7 >> out_surface_clustering

k-Means algorithm

Tool : auxi/

As an alternative to the average- and single-linkage hierarchical clustering methods, the k-means algorithm can be used for the clustering of the docked complexes. The k-means algorithm consists of four major steps:
(1) initialization
(2) member assignment
(3) cluster center update
(4) convergence
The initialization step is the most important step in the k-means clustering method, since the results will strongly depend on the initial selection of the cluster centers. In our implementation, the Forgy method, which assigns k random cluster centers, was used as an initialization method. Based on the initial cluster centers, each member in the data set will be assigned to the cluster of the closest center. In the third step, the cluster centers will be updated and the members that are closest to the geometric center of their corresponding cluster will be chosen as new cluster centers. The second and the third steps will be repeated until all the clusters converge.

The clustering program should be run several times with different starting conditions in order to obtain reasonably small spreads for each cluster.

# generate a complexes file suitable for the clust program
read_record complexes -format_cluster

# suppress the rotation around the z axis for the first 2000 complexes( see
auxi/ complexes.comp_cluster 2000

# cluster with the k-means algorithm and generate the pdb files -pdb p2.pdb -comp complexes.comp_cluster -cln 7


-atom : 'ca', 'bb', 'aa' (Atom selection for the calculation of RMSD. Ca, backbone and all-atom selections are available)

-rmat : rmsd_matrix.dat (RMSD matrix file can be used as an input)

-cln : # (Number of clusters requested)

-seed : # (Seed for the Python random number generator)

-h : (Prints out the usage and help menu)

Preprocessing for solute-surface simulation

Tool : auxi/

In the case of a uniform and planar surface (which needs to be in the xy-plane), the translation along the plane and the rotation around the z-axis are irrelevant. To be able to use one of the previous algorithms optimally, this tool will translate all entries in the complexes file to the center of the surface, and rotate the solute about the z-axis so that its x-axis always points in the same direction.

Usage : auxi/ complexes n
where n indicates that the first n (number) entries in the complexes file will be considered.

[Back to Main documentation]