When socket(2), setsockopt(2), bind(2), listen(2), or accept(2) return an
unexpected error status, it is usually not a good idea to let the campaign
continue. This is especially a problem as the perror(3) message gets lost
in normal campaign output and may be missed by the user.
Change-Id: I92747174e0706a613bedd8c6664cc8d888e07533
Due to the previous DatabaseCampaign fix, this may not be necessary
anymore, but it's nevertheless a good idea to handle thread creation
failures properly.
Change-Id: I8317a77dd5338509727e737040944320e7755ae3
Delay insertion of to-be-sent jobs into m_runningJobs until they are
really sent, as getMessage() won't work anymore (as in: segfault) if
this job is concurrently re-sent (due to campaign end), its result is
received, and deleted in the campaign. This becomes non-hypothetical
with larger values for CLIENT_JOB_LIMIT and CLIENT_JOB_REQUEST_SEC.
Additionally, reinsert the remaining jobs into the input queue if
communication fails, instead of inefficiently delaying redistribution
until the campaign end.
Change-Id: If85e3c8261deda86beb8d4d93343429223753f22
To allow the JobServer to shutdown properly, the accept() loop in
JobServer::run() needs to regularly check whether we're done. This
change introduces a timed, non-blocking variant of accept() into
SocketComm to achieve this.
Change-Id: Id411096be816c4ed6c7b0b37674410e22152eb22
To avoid accessing destroyed resources in CommThreads talking to clients,
we need to properly join them on shutdown. The m_CommMutex becomes a
JobServer member to make sure it isn't destroyed before the JobServer
itself.
Change-Id: I35b9fb93ace08a7a9476650f8f5e93597a3a8aa0
This change cleans up in/out queue synchronization in the job server.
End-of-jobs conditions are now properly signaled through the
SynchronizedQueue, allowing to resume and abort blocked readers when
no more input is expected.
Change-Id: I3eaf37115ccf8c5b5afe3d971c7109cd62b68906
Up until now the JobServer was silently losing jobs and only claiming to be
finished - a workaround for this was to restart the campaign until all jobs
were finished according to the database and the campaign's output.
This change fixes the underlying problem, so a single campaign-run suffices
and does no longer lose any jobs.
Debugging this was awful and took us quite some time...
Change-Id: Ie6c982cc3b2ce11128941f1f13be563bae22565c
Synchronize re-sending jobs in sendPendingExperimentData() and modifying
(or indirectly, via getDone() and the campaign, deleting) jobs in the
m_runningJobs queue.
a) sendPendingExperimentData needs an intact job to serialize and send it.
b) After moving the job to m_doneJobs, it may be retrieved and deleted
by the campaign at any time.
Additionally, receiving a result overwrites the job's contents. This
already may cause breakage in sendPendingExperimentData (a).
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1943 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
Fix 1: A result message with a nonexistent or invalid run ID must be
ignored in any case. 0 is only OK for NEED_WORK messages, clients
communicating a result must know the ID.
Fix 2: Tell the client the run ID in the first place ...
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1692 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
A campaign server now tells all clients a unique run ID (the UNIX timestamp
when it was started). This allows us to ignore results from "old" clients
that talked to another server before, and to tell them to die.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1677 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
The FailBochs client is not linked by the Bochs build system anymore, but
by our cmake scripts (make fail-client):
- All Bochs libraries are merged into libfailbochs.a (a new target
within the Bochs Autotools scripts).
- The previous libfail.a is *not* a merge of all Fail* libraries anymore,
but pulls these in via library dependencies.
Additionally I did a lot of build system cleanup, e.g. additional external
libraries may now be pulled in where they're needed.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1390 8c4709b5-6ec9-48aa-a5cd-a96041d1645a