It's been taking for-ages to get the MPI code working. I thought I had it a week or so ago, but it was deadlocking like anything when I was trying to run some production simulations. I've pretty much now rewritten it so that there is now a very clean separation of communicators and threads. It's been a tough week, but at least it now all feels robust (rather than feeling really flaky!)
One thing that would help though is for C++ to provide a loop or a mutex or wait condition that was just a little bit less than permanent - e.g. it is easy to loop forever...
while (true)
{
// do stuff
}
//or as a macro
forever()
{
// do stuff
}
...but, the problem now is that this loop will keep going until it is horribly killed by the user either by literally using "kill -9" or by CTRL-C. Cleaning up or handling this is a real pain (e.g. what if this loop was in a background thread, or was doing some IO etc.
Equally, mutexes and wait conditions can be locked forever as well, e.g.
mutex.lock() // blocks forever if no-one unlocks the mutex waiter.wait( &mutex ) // waits forever if no-one wakes us up
Again, these will just block forever until we are unceremoniously killed.
What I want therefore, is something that is a little less than forever, but is still a very long time. What I want is forages (for-a-ges, as in nearly forever - e.g. it took forages to get home last night, rather than forages, as in looks for stuff - the pig forages for truffles).
try
{
while (for_ages())
{
//do stuff
}
}
catch(std::system_exit)
{
// do clean-up at system exit
}
// or, even better
forages()
{
//do stuff
}
except(std::system_exit)
{
//do clean-up
}
...or...
try
{
mutex.lock( for_ages() );
waiter.wait( &mutex, for_ages() );
}
catch(std::system_exit)
{
//do clean-up at system exit
}
Essentially, we have a "for_ages()" object that is used to wait or loop for-ages, but which will raise a system_exit exception if it is told that the program is being exited. We could also imagine that the for_ages() object could also be told to break for other reasons, e.g. imagine sending a STOP_FORAGES signal to a thread that we suspect is deadlocked, e.g.
/// Thread 1
void work()
{
try
{
mutex.lock( for_ages() ); // this is stuck, as mutex is already locked
}
catch (std::forages_exit)
{
cout << "Oh dear - were we deadlocked?\n";
}
}
/// Thread 2
void work()
{
ThreadHandle thread1 = getHandle( THREAD_1 );
if (thread1.isStuckForAges())
{
thread1.sendSignal( EXIT_FORAGES );
}
thread1.doSomeMoreWork( work );
}
I'm only thinking about this as I've had to write lots of similar things in SireCluster, but much more messily (e.g. I block forages waiting to grab a node, but at Cluster shutdown a change of variables signals blocking calls that they wlll never be sucessful, so they stop blocking and return a null resource, which then has to be checked.