Clustering analysis of SDA results
After a docking simulation (or an association rate simulation where the complexes are recorded), the complexes file
contains the most energetically favorable docked positions. There are typically between 500 and, when using solute flexibility, up to 20,000 entries
in this file. Using a clustering method allows these docking positions to be grouped into a few (3 to 10) representative clusters that are meaningful for analysis.
Different methods of clustering are available in SDA:
- An average-linkage algorithm, applicable to solute-solute interactions with output described here
- A single-linkage algorithm (faster)
- A k-Means algorithm
- Additionally, a python script is provided for the specific case of a solute-surface simulation. This script moves all docking positions at the surface to one spot on the surface. This is required since the docking poses at different positions of the surface are equivalent if the surface has an atomically repeating structure.
Tool : auxi/clust
This clustering tool allows the clustering of docking positions using a hierarchical agglomerative average-linkage clustering algorithm. For this, the backbone RMSDs of all docked structures against each other are calculated. Please note, with this method, two structures with identical centers but different orientations can have large RMSDs.
The algorithm starts with each docking result representing its own cluster. At each step, the algorithm merges structures in clusters based on the average linkage distance (distance between the centers of clusters). Once the given number of clusters is reached, representatives of these clusters are written. These representative structures are chosen based on the closest distance to the center of the corresponding cluster.
All options are described in the pdf documentation
A typical usage of this clustering in SDA:
# generate a complexes file suitable for the clust program
read_record complexes -format_cluster
# prepare cluster.dat file
bin/clust -cl -n 2000 -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb
# make pdb files for the representatives of the 7 most populated clusters
bin/clust -re -clcl 7 -f55 fort.55 -p1 p1.pdb -p2 p2.pdb
# prepare score file
bin/clust -sc -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb
# make energy analysis and generate pdbs for the 7 most populated clusters
bin/clust -an -p -clcl 7 -f55 complexes.comp_cluster -p1 p1.pdb -p2 p2.pdb > cluster-out-info-2000
The final number of clusters is given as input to the clust program. The default value is set to 5, but it might be useful to adapt this value. In general, no correct number of final clusters exists and by decreasing the final number of clusters the loss of information increases. Therefore, an appropriate cluster number should balance between information gain and loss. We provide a tool called 'plot_cycles.py' in the auxi/clustering/programs/clust/aux/ directory. Using this (see below), you can plot the shortest distance between two clusters depending on the number of clusters in a given cycle (red line). Also plotted is the percentage distance increase relative to the shortest distance of the clustering step before (blue circles). This plot helps to find the clustering step where two clusters are merged that are quite distant and should not be merged to avoid too much loss of information. This clustering step can be identified by looking for a kink in the red curve. This analysis method is limited, as it only takes the shortest distance between two clusters into account. The RMSD variance in each cluster, the cluster sizes, and the total number of complexes are not considered.
An example can be found in the examples/bnbs/docking directory. In this Barnase/Barstar example, there is a kink when seven clusters are generated, but there is, for example, also a kink when 13 clusters are generated. This shows that the number of clusters depends on the required accuracy and the details of the project.
Afterwards, it is possible to adapt the number of clusters in the run_clusterDPipe.sh (found in the examples/bnbs/bnbs_docking example) script by changing the N2 value or by adjusting the -clcl option in the clust program.
To run 'plot_cycles.py' it is necessary to run the clust program with the following command first:
USAGE: ./clust -an -p -cy > cluster_formation.out
The outputfile can then be used as input for 'plot_cycles.py':
USAGE: ./plot_cycles.py title inputFile outputFile
Where 'title' is the general title for the plot, the 'inputFile' is the outputfile of the cluster analysis
(generated when applying the options '-an -p -cy', 'cluster_formation.out' from above). The 'outputFile'
is the name for the *png image file.
A complete documentation, containing all options is available as pdf.
The recompilation of the clust program can be done in the src directory by calling make clust
All programs provide a summary in a similar format.
The table shows the output of the docking simulation from the barnase-barstar example
generated with the auxi/clust tool.
It describes the 5 final clusters classed by their cluster size (ClSize, the number of the members in each cluster):
No ClSize ClFSize Repr ReprE ClAE CLAED RepRMSD CLFRMSD ElE ElDesE HyDesE LjE spread 1 236 67370 35 -22.585 -22.289 0.555 4.965 5.052 -12.978 7.925 -17.532 0.000 1.380 2 92 7759 301 -20.819 -20.774 0.907 16.251 16.121 -12.005 6.764 -15.578 0.000 1.749 3 88 11624 304 -20.771 -21.390 1.045 13.867 13.984 -12.029 6.861 -15.603 0.000 1.129 4 80 8890 230 -21.181 -21.696 0.729 3.370 3.479 -14.342 8.823 -15.662 0.000 1.558 5 4 159 459 -19.340 -19.400 0.145 18.567 18.675 -9.798 6.046 -15.588 0.000 0.801 Distance between the chosen representatives of the clusters listed: complex_35 complex_301 complex_304 complex_230 complex_459 complex_35 0 13.3751 10.9781 3.40311 16.8218 complex_301 13.3751 0 3.86128 14.8969 7.1714 complex_304 10.9781 3.86128 0 12.5002 8.52078 complex_230 3.40311 14.8969 12.5002 0 17.83 complex_459 16.8218 7.1714 8.52078 17.83 0 Description of columns No: Cluster Number ClSize: Number of entries in the complexes (f55) file used in the cluster ClFSize: Number of representative entries for the cluster Repr: Representative chosen (corresponds to linenumber in the complexes, f55 file) ReprE: Total interaction energy of the chosen Representative ClAE: Average total energy of all cluster members weighted with number of representatives. CLAED: Weighted standard deviation of total energy of cluster entries in the complexes (f55) file RepRMSD: RMSD of the representative to solute 2 CLFRMSD: Average RMSD of the cluster to solute 2 Ele: Electrostatic energy ElDesE: Electrostatic desolvation energy of the representative complex HyDesE: Hydrophobic desolvation energy of the representative complex LjE: Lennard-Jones energy of the representative complex spread: arithmetic average of rmsd's of each cluster member from the representative, weighted by occupancyThe
clust
tool also creates a file containing a matrix of all distances between complexes and for each cluster a list of the details of the cluster members.
Tool : auxi/myClustering_complexes-SDA7.py
This python script implements a single-linkage hierarchical agglomerative clustering algorithm. The complexes within an RMSD_CUTOFF (default 3 Å) are integrated into the closest cluster.
An example of usage in SDA 7:
# generate a complexes file suitable for the clust program
read_record complexes -format_cluster
# suppress the rotation around the z axis for the first 2000 complexes( see below)
auxi/DelRotZ_complexes-SDA7.py complexes.comp_cluster 2000
# cluster the first 2000 complexes
myClustering_complexes-SDA7.py wtrotz.comp_cluster p2.pdb 2000
# generate pdbs of the 7 clusters with corresponding representative pdb files
trajectory2dcd mycluster.comp_cluster Min.pdb -input sda_prometcs.in -nb_complexes 7 >> out_surface_clustering
Tool : auxi/k-means.py
As an alternative to the average- and single-linkage hierarchical clustering methods,
the k-means algorithm can be used for the clustering of the
docked complexes. The k-means algorithm consists of four major steps:
(1) initialization
(2) member assignment
(3) cluster center update
(4) convergence
The initialization step is the most important step in the k-means clustering
method, since the results will strongly depend on the initial selection
of the cluster centers. In our implementation, the Forgy method, which
assigns k random cluster centers, was used as an initialization method.
Based on the initial cluster centers, each member in the data set will
be assigned to the cluster of the closest center. In the third step,
the cluster centers will be updated and the members that are closest to
the geometric center of their corresponding cluster will be chosen as
new cluster centers. The second and the third steps will be repeated
until all the clusters converge.
The clustering program should be run several times with different starting conditions in order to obtain reasonably small spreads for each cluster.
USAGE:
# generate a complexes file suitable for the clust program
read_record complexes -format_cluster
# suppress the rotation around the z axis for the first 2000 complexes( see below)
auxi/DelRotZ_complexes-SDA7.py complexes.comp_cluster 2000
# cluster with the k-means algorithm and generate the pdb files
k-means.py -pdb p2.pdb -comp complexes.comp_cluster -cln 7
Options:
-atom : 'ca', 'bb', 'aa' (Atom selection for the calculation of RMSD. Ca, backbone and all-atom selections are available)
-rmat : rmsd_matrix.dat (RMSD matrix file can be used as an input)
-cln : # (Number of clusters requested)
-seed : # (Seed for the Python random number generator)
-h : (Prints out the usage and help menu)
Tool : auxi/DelRotZ_complexes-SDA7.py
In the case of a uniform and planar surface (which needs to be in the xy-plane), the translation along the plane and the rotation around the z-axis are irrelevant. To be able to use one of the previous algorithms optimally, this tool will translate all entries in the complexes file to the center of the surface, and rotate the solute about the z-axis so that its x-axis always points in the same direction.
Usage : auxi/DelRotZ_complexes-SDA7.py complexes n
where n indicates that the first n (number) entries in the complexes file will be considered.