The MPI code in Sire is now finally working and robust (as opposed to working and falling over every ten seconds!). I've written the code around the concept of Backends (which are threads that are waiting to do some work) and WorkPackets, that contain all of the information and code necessary to perform a chunk of work. I then have Frontends, which provide an interface to the connection to a Backend (thereby hiding all of the communication code) and a Node object, that holds a Frontend to perform the work. A Nodes object then contains a collection of Node objects, and provides a queue that allows for individual Node objects to be scheduled to process the WorkPackets.
The upshot of all of this is that the communication and parellilisation is now pretty much hidden, and all the user has to worry about is how to write a WorkPacket. I've written two WorkPackets; the first is PythonPacket, that contains a Python interpreter and a script. This means that sire_python can schedule and run multiple python scripts (in serial and in parallel). This enables sire_python to act as a python task farm.
The second packet is a SimPacket, that contains a System and a Moves object, together with a number of moves that should be run. This is, in essence, everything necessary to run a bit of a simulation. I now use SimPacket to run all of the simulations in Sire (even single-processor, serial jobs) as it very clean (both in code, and conceptually). This also makes replica exchange very easy, e.g.
void RepExMove::move(Nodes &nodes, RepExReplicas &replicas, int nmoves_to_run)
{
for (int i=0; i<nmoves; ++i)
{
QList<Simulation> running_sims;
for (int j=0; j<replicas.nReplicas(); ++j)
{
RepExReplica replica = replicas[j];
Node node = nodes.getNode();
running_sims.append( Simulation::run(node, replica.system(),
replica.moves(), replica.nMoves(),
replica.recordStatistics()) );
}
for (int j=0; j<replicas.nReplicas(); ++j)
{
//wait for the simulation to finish
running_sims[j].wait();
SimPacket sim = running_sims[j].result();
replicas.setSystem(j, sim.system());
replicas.setMoves(j, sim.moves());
}
this->testAndSwap(replicas);
}
}
(I should note - the above code is correct, but the real code in Sire has more error checking, and also handles resubmission of jobs due to jobs being stopped or aborted on remote nodes, or if fixable errors occur - take a look here)
I've tested the code on my laptop and on bluecrystal (our large new linux supercomputer at Bristol) and it works well. It took ages to chase down small bugs and improbable race conditions, but I'm now very pleased with the result. I've been running tests on my laptop where I've run 30 python scripts in parallel over a 100 process MPI job, and it completes and runs without any problems (even shutdown works properly - which is important, as I don't want my job sat in a queue because it hasn't finished properly).
In other news, there is now a growing small army of Sire users, and some interesting QM/MM simulations are now in progress. If you are the MGMS Young Modellers Forum this Friday then you will be able to listen to Katie Shaw talk about her work using Sire to investigate the balance of forces across the QM/MM interface. I'll also be at the YMF, so feel free to say hello :-)
*UPDATE* - Katie won one of the prizes at the YMF for her talk. Congratulations! :-)