Water-Swap Reaction Coordinate FAQ
The water-swap reaction coordinate (WSRC) provides a novel reaction pathway along which the absolute binding free energy of two molecules may be calculated. The method was developed initially for calculation of absolute protein-ligand binding free energies, but it can be applied to any condensed phase host-guest system.
Here are a collection of questions that we have been asked about the WSRC, together with our answers. If you have some questions, then please add them via the comments link below, and we will do our best to answer them.
What is Water-Swap Reaction Coordinate?
The water-swap reaction coordinate provides a coordinate that may be used to calculate absolute binding affinities of protein-ligand systems. This reaction coordinate exchanges a ligand molecule bound to the protein with an equivalent volume of bulk water (while simultaneously the ligand is transferred to the cavity left behind in bulk water). The method in described in full in [Link to paper once published]. While the method was developed for protein-ligand systems, it is applicable to any host-guest system in any solvent (e.g. the guest molecule is swapped with an equivalent volume of bulk solvent).
Why do we need to "swap" the ligand with the cluster of water?
It is possible to calculate the binding affinity by turning off (annihilating) the ligand molecule, both while bound to the protein and while free in water. However, annihilating the ligand creates a cavity in solvent. Calculating the free energy of filling that cavity is computationally demanding, and indeed redundant, as is reality that cavity is filled by water that is displaced as the ligand creates a cavity in bulk. Swapping the ligand with an equivalent volume of bulk water means that you are not creating any cavities, and, as an additional benefit, it means that the binding affinity can be calculated from a single simulation (rather than two separate simulations, one in protein and one in bulk).
How do you choose "the cluster of water molecules" that occupy the same volume as the ligand in the active site?
The cluster is chosen dynamically throughout the simulation by using the identity constraint. The identity constraint provides a means of identifying clusters of solvent molecules in bulk solvent, and then maintaining the identity of that cluster during a simulation without the need for restraints or cages. The identity constraint is described in [Link to paper once published]. Identity points are added to the atoms of the ligand, and these dynamically identify a cluster of water molecules that has about the same shape and volume as the ligand. This cluster moves and changes shape as the ligand moves and changes shape - not by moving the water molecules, but instead by updating the identities of the waters that belong to the cluster.
How many water molecules do we need?
You need as many as to cover the volume of the ligand. In theory, the number of waters will not affect the final predicted binding affinity - however fast convergence of the simulation is helped by picking a number that is roughly right. We've found that the simulation is robust, e.g. a simulation for one ligand can be performed with 12-18 waters, and still converge well. Typically we position identity points on most of the heavy atoms of the ligand (in rings we will put points of 3 or 4 out of the 5-6 heavy atoms).
How do you position the identity points on the ligand structure?
Put the points on the heavy atoms of the ligand, typically on every heavy atom, or every other heavy atom. Use 1-2 less points per ring, e.g. for 6 membered rings use 4 or 5 points. Don't worry too much though - the results are relatively robust to the number and position of points, and you can always check by running a quick calculation that swaps your ligand between two water boxes (which should give a free energy change of zero).
What is the lambda coordinate? How does it work?
The lambda coordinate is a dual topology reaction coordinate that is used by WSRC to turn off the ligand and turn off the water cluster from the protein- and bulk-water-boxes, while simultaneously switching on the ligand and water cluster in the bulk-water- and protein-boxes. For example, at lambda=0, the single ligand molecule is 100% in the protein-box and 0% in the bulk-water-box, and the water cluster is 0% in the protein-box and 100% in the bulk-water-box. At lambda=1, the ligand molecule is 0% in the protein-box and 100% in the bulk-water-box, while the water cluster is 100% in the protein-box and 0% in the bulk-water-box. At lambda=0.5, the ligand exists in both boxes (50% in the protein-box, 50% in the bulk-water-box) and the water cluster exists in both boxes (50% in the protein-box and 50% in the bulk-water-box). lambda thus acts as the slider that swaps the ligand and water cluster between the two boxes.
How many lambda values do we need?
You need as many as are necessary to map the change in free energy gradient across the lambda coordinate. While any free energy method can be used to calculate the potential of mean force (PMF) along the WSRC, we use replica-exchange thermodynamic integration. This requires placing replicas at lots of lambda values across the WSRC and evaluating the gradient of the free energy with respect to lambda. Integrating these gradients then returns the PMF. Ideally you should run a sufficient number of replicas such that the WSRC is completely covered, and any differences in free energy gradient between neighbouring replicas are small. Typically we use 24-48 replicas. You could probably get away with 8 replicas as a minimum, and should not need more than 64 replicas as a maximum.
How do you exchange the coordinates of the bound and unbound ligand (from lambda=0 to lambda=1)?
This is a common misconception about the WSRC. We are not changing the coordinates of either the ligand or the water cluster. The protein- and bulk-water-boxes actually lie directly on top of each other. They can do this as the molecules in these two boxes do not interact with each other (the molecules in the protein-box do not interact with the molecules in the bulk-water-box). Pictures of the two boxes show them drawn side-by-side only for clarity. Moving the ligand (and water-cluster) between the two boxes just involves scaling the interaction energies between the ligand and protein-box, and water-cluster and bulk-water-box by 1-lambda, and scaling the interaction energies between the ligand and bulk-water-box and water cluster and protein box by lambda. The coordinates of the ligand and cluster are not changed.
Do you sample the real conformation of the ligand in the bound state (ligand complex with protein) and unbound state (ligand in bulk water) in WSRC?
Yes - at lambda=0 we are simulating the ligand bound to the protein (so the protein and ligand both adopt the bound conformation), while at lambda=1 we are simulating the free protein and free ligand (so the protein and ligand adopt their free / bulk conformations). We really do have a single reaction coordinate that moves between the ligand-bound and ligand-free states. However, while in theory the WSRC can be used to calculate the full free energy of binding (including conformational free energy), in practice this will depend on whether there is sufficient sampling during the simulation. There is normally sufficient sampling to explore the change from the free to the bound conformation of the ligand, but current sampling methods and simulation lengths are too short to sample the conformational range between the bound and free forms of the protein (although this is something that we are addressing in another research project).
How about problems due to the entropy term? Does WSRC include the binding entropy?
Yes, in calculating a free energy, we implicitly calculate the entropy (although we have not, as yet, tried separating out the entropy term). However, we are as yet unsure whether or not the WSRC includes the entropy penalty of the ligand losing translational and rotational motion upon binding. There are arguments that it does include it, and there are arguments against. All we will definitely say, is that it can be used to calculate the free energy of transferring the ligand from a point an infinite distance from the protein, to a specific orientation and binding mode within the protein.
Can we use WSRC method in every protein-ligand systems?
Mostly - WSRC is useful in protein-ligand systems where the ligand is replaced by water. It won't work in proteins where the ligand creates the binding pocket (i.e. the pocket is completely collapsed when the ligand isn't there), or when the binding pocket is empty when the ligand isn't there.
Can we use MD simulations instead of MC simulations in WSRC?
Yes, probably! The method is based on the use of the identity constraint, which in theory can be used in dynamics simulations (the inspiration for the identity constraint was derived originally for dynamics simulations). We haven't tried this yet, nor worked on deriving the gradients, so the WSRC as yet can only be used with Monte Carlo.
Why do you say that WSRC is the first single simulation for the calculation of absolute binding free energies? Is it 1 simulation or is it 24 simulations in parallel?
It is one ensemble simulation. Replica exchange is used to create a single super-ensemble of configurational space + lambda, with 24 replicas exploring that super-ensemble collectively in parallel. See the RETI paper for a fuller description of ensemble simulations.
What are the drawbacks of WSRC?
The method is still too new for us to know where it doesn't work. Perhaps the biggest drawback at the moment is that it is only implemented in our software - hopefully it will be adopted widely in more established codes so that it will be easier for everyone to use, so that all of the drawbacks can be found (and then fixed!).
How can you avoid high peaks in PMFs generated along the WSRC?
The shape of the PMF along the WSRC is completely controllable (as, while lambda=0 and lambda=1 are physical states - bound and unbound - every lambda value in-between is a non-physical hybrid state, adn thus can be changed by playing with the potential, and this change will not affect the calculated binding affnity). You can change the PMF by playing with the soft-core potential (which is used to soften the non-physical hybrid states), by changing the way the energy terms are connected to lambda, or by adding additional functions to the potential at intermediate lambda values that counter the peak.
How do you treat the boundary between the protein and the water?
The protein box and the water box are two separate periodic boxes. In theory, they are both part of the same, infinite water box (both views into this infinite box), as both are connected to the same heat and pressure bath. You can thus think of the protein box as being bounded by a heat bath, and that same heat bath then bounding the water box. In practise, there is no boundary between the protein and water boxes - the connection to the boxes is achieved by using the total energy and total volume of both boxes in the Monte Carlo acceptance test - otherwise the protein and water boxes do not see each other.
How do you choose the position or number of replicas along lambda?
You choose the position of the replicas in the same way as you would for any other thermodynamic integration simulation across a lambda coordinate. While in theory, you could optimise the position of the replicas (the positions at which you measure the gradient of the PMF - dG / dlambda), and then integrate the gradients using quadrature, in practise we use the large numbers of processors available to us to just run as many replicas as are necessary to flood the lambda coordinate. We ran on 8-core Xeons, so used three nodes, thus 24 cores, and ran 24 replicas in parallel, spread roughly evenly across lambda, with more replicas centered at the end points (lambda=0 and lambda=1), as this is where we expected the free energy to change most. We're coming to the conclusion that perhaps 24 replicas is not enough, so we are beginning to use 48 replicas (6 nodes). We are fortunate to have access to a large University supercomputer, so 6 nodes is not a huge request...