A novel water-swap reaction coordinate for the calculation of absolute protein-ligand binding free energies

Christopher J. Woods, Maturos Malaisree, Supot Hannongbua and Adrian J. Mulholland

Abstract

The accurate prediction of absolute protein-ligand binding free energies is one of the grand challenge problems of computational science. Binding free energy measures the strength of binding between a ligand and a protein, and an algorithm that would allow its accurate prediction would be a powerful tool for rational drug design. Here we present the development of a new method that allows for the absolute binding free energy of a protein-ligand complex to be calculated from first principles, using a single simulation. Our method involves the use of a novel reaction coordinate that swaps a ligand bound to a protein, with an equivalent volume of bulk water. This water-swap reaction coordinate is built using an identity constraint, which identifies a cluster of water molecules from bulk water that occupies the same volume as the ligand in the protein active site. A dual topology algorithm is then used to swap the ligand from the active site with the identified water cluster from bulk water. The free energy is then calculated using replica exchange thermodynamic integration (RETI). This returns the free energy change of simultaneously transferring the ligand to bulk water, as an equivalent volume of bulk water is transferred back to the protein active site. This, directly, is the absolute binding free energy. It should be noted that while this reaction coordinate models the binding process directly, an accurate forcefield and sufficient sampling are still required to allow for the binding free energy to be predicted correctly. In this paper we present the details and development of this method, and demonstrate how the potential of mean force (PMF) along the water-swap coordinate can be improved by calibrating the soft-core Coulomb and Lennard-Jones parameters used for the dual topology calculation. The optimal parameters were applied to cal- culations of protein-ligand binding free energies of a neuraminidase inhibitor (oseltamivir), with these results compared to experiment. These results demonstrate that the water-swap coordinate provides a viable and potentially powerful new route for the prediction of protein-ligand binding free energies.

This page provides details about the simulations that were performed in the paper "A novel water-swap reaction coordinate for the calculation of absolute protein-ligand binding free energies". This paper has been accepted for publication in the Journal of Chemical Physics, and this page will be updated with a copy of the paper once the paper is available.

If you would have a question about the water-swap reaction coordinate, then please see this FAQ, or if your question isn't there, or isn't answered clearly, then please email the authors.

If you would like to perform the simulations discussed in the paper then you will need to download the version of Sire used for the work. You can do this by installing subversion and typing (on Unix, OS X or Linux);


mkdir Sire
mkdir Sire/source
cd Sire/source
svn co http://sire.googlecode.com/svn/corelib/tags/waterswap2010 ./corelib
svn co http://sire.googlecode.com/svn/python2/tags/waterswap2010 ./python2

To compile the code you will need to have installed Qt4, cmake, python, boost, GSL and, optionally, MPI and BLAS. You can then type;


mkdir Sire/build
mkdir Sire/build/corelib
mkdir Sire/build/python2
cd Sire/build/corelib
cmake ../../souce/corelib
make install/strip
cd ../python2
cmake ../../source/python2
make install/strip

Sire comes with more detailed installation instructions if you need more information, or if you run into difficulties, you can post a message to the Sire Users mailing list.

In addition to Sire, to run the simulations you will also need a copy of ProtoMS. You can download and install ProtoMS by typing;


mkdir Work
cd Work
svn co http://protoms.googlecode.com/svn/tags/waterswap2010 ./ProtoMS
cd ProtoMS/src
make

The next step is to download the simulation input files;

  • waterswap_ligand.tgz : These are the files for the simulations that transfer oseltamiivr between two water boxes
  • waterswap_protein.tgz : These are the files for the simulations that calculate the absolute binding free energy of oseltamivir to neuraminidase

Unpack these files into the directory in which you want to run the simulations.


tar -zxvf waterswap_ligand.tgz
tar -zxvf waterswap_protein.tgz

There are README files in these directories that describe their contents in detail. In brief, the directories are arranged by figure from the paper, e.g. waterswap_ligand/figure6 contains the input files needed to run the simulation that generated figure 6, while waterswap_ligand/figure7/a contains a series of directories, one for each of the PMFs shows in figure 7(a).

Each simulation is run using a python script called runsim.py. This python script performs one iteration of the simulation, where one iteration is 100 thousand Monte Carlo moves, followed by a replica exchange move between the replicas. To run this script you should type;


sire_python runsim.py >> runsim.log

or, if you have MPI and wish to use more than one processor at a time


mpirun -np X sire_python runsim.py >> runsim.log

where "X" is the number of processors you wish to use. You can use however many processors you wish as Sire will adapt automatically to the number, although you will not gain any benefit of using more processors that you have replicas (there are 24 replicas).

As the above command only runs one iteration, you must run this command several times in sequence to generate a full simulation. For example, I use a simple torque script that runs the MPI command 200 times in sequence to perform the 20 million moves in the simulation. Note that the script is clever, in the sense that it will automatically recover from any errors, and will automatically pick up from the last iteration. This means that if the last iteration was interupted (e.g. because you ran out of time in the cluster queue), then the next time you run sire_python, it will remove the incomplete iteration and will re-run it. Note also that you can run as many iterations as you like, simply by running sire_python as many times as you like.

The result of running sire_python will be a directory called output, within which there will be one directory per iteration (e.g. output/iteration_0001 contains the output from iteration 1). Within each iteration directory will be directories for each lambda value (e.g. output/iteration_0001/0.001000 contains the output from the lambda=0.001 replica from iteration 1). These directories contains the PDB coordinate files for the protein-box and bulk-water-box, together with the energy component trajectory for that replica. As well as containing directories for each lambda value, the output/iteration_???? directories also contain summary files, e.g. average_dg_f.txt contains the Zwanzig equation average free energy difference between the forwards and reference states, divided by delta lambda (0.001). This is the free energy gradient for each lambda value with respect to lambda, as calculated solely from the configurations sampled during this iteration. These gradients can be integrated to obtain the PMF from this iteration. This is performed automatically during the simulation, resulting in the forwards difference PMF in output/iteration_????/pmf_f.txt and the backwards difference PMF in output/iteration_????/pmf_b.txt. These PMFs are just for this specific iteration. To calculate the full average PMF for the complete simulation, you need to either average the PMFs from multiple iterations, or, more accurately, average the free energy gradients from multiple iterations and then integrate those. We have written an automatic analysis script, calc_free_energy.py, which we have included with the simulation input files. You can run it by typing;


python calc_free_energy.py output/iteration_0001/average_dg_f.txt X Y

where X is the iteration that wish to start the analysis, and Y is the iteration you wish to finish, e.g.


python calc_free_energy.py output/iteration_0001/average_dg_f.txt 50 200

This will scan all of the average_dg_f.txt free energy gradient files, and will calculate all of the PMF averages going back from iteration Y to iteration X, e.g. average Y, then Y, Y-1, then Y, Y-1, Y-2, etc. until Y, Y-1, Y-2 ... X+1, X. For each iteration, it will average the free energy gradients and integrate those to obtain the average PMF, and it will also calculate the standard error on each average and then integrate those to obtain the maximum and minimum average PMFs at the 95% confidence level. The script will then split the iterations into equilibration and production by choosing the iteration at which this error in minimised. It then prints out this average PMF, together with corresponding maximum and minimum PMFs (and data about how it has reached its decision).

You can run this script to generate the backwards difference average PMF by typing;


python calc_free_energy.py output/iteration_0001/average_dg_b.txt X Y

Average the forwards difference and backwards difference PMFs to obtain those that are given in the figures of the paper. The reported free energy differences are the average of the forwards PMF and backwards PMF values at lambda=1, the difference errors are the differences between the forwards and backwards PMFs at lambda=1, and the random errors are the average of the random errors of the forwards and backwards PMFs at lambda=1 (both of the random errors for forwards and backwards should be nearly identical).

To check that the analysis script has properly worked out where equilibration has ended, and production started, you can plot the predicted free energy as a function of iteration. You can do this by looking at the free energy at lambda=1 in the PMF files for each iteration, e.g.


tail -n 1 output/iteration*/pmf_f.txt

and then plot the second column using a program like Excel or Xmgrace.

To visualise the trajectory you must first bunzip2 all of the PDB files, via;


bunzip2 output/iteration*/*/*.pdb.bz2

(note that you should bzip2 all of the PDB files once you have finished visualising them, as they take up a lot of space)

Then, you can type;


vmd output/iteration*/0.001000/bound*.pdb

to visualise the protein-box trajectory at lambda=0.001 (change 0.001 to the value of lambda you wish to view), or


vmd output/iteration*/0.001000/free*.pdb

to visualise the bulk-water-box at lambda=0.001.

Alternatively, as replica exchange is performed, you can visualise an individual replica's trajectory by typing;


vmd output/iteration*/*/bound_000_*.pdb

where the "000" is the replica ID number (e.g. "000" to "023" if there are 24 replicas). For the bulk-water-box replica trajectory type;


vmd output/iteration*/*/free_000_*.pdb

Finally, to visualise the replica trajectory (the lambda values sampled by each replica) open the file output/replica_trajectory.txt in a graphing program like Excel or Xmgrace.

If you have any problems running or analysing the simulations, please feel free to get in contact with me (Christopher Woods, at chryswoods@gmail.com), or please post a message on the Sire Users mailing list.

Also, if you think that there are any problems or issues with this page, then please add a comment below. Thanks.