Commit Graph

48 Commits

Author SHA1 Message Date
010d4a892d DatabaseCampaign: fix finished experiments SQL
The database queries to fetch all unfinished experiments were broken.
The server tried to insert all finished pilot_ids into the temporary
result_ids table and then discard all experiments which have the correct
(finished) count of IDs in this table. This cannot work as the pilot_id
is the only column of result_ids and must be a unique primary key.

As a fix, the count of results is stored as a second field in result_ids
and the result table is now joined against result_ids to check this field.

Change-Id: I6a9fb774825f0cc4ce104c6e51d7b2fe16957aec
2014-03-18 11:18:27 +01:00
5ee96032c9 jobserver: gracefully handle thread creation failures
Due to the previous DatabaseCampaign fix, this may not be necessary
anymore, but it's nevertheless a good idea to handle thread creation
failures properly.

Change-Id: I8317a77dd5338509727e737040944320e7755ae3
2014-02-25 13:32:56 +01:00
25a390970a DatabaseCampaign: avoid table locking
It is necessary to copy pilot IDs of existing results to a temporary table
before fetching undone jobs from the DB: Otherwise, due to MyISAMs
table-level locking, collect_result_thread() will block in INSERT (SHOW
PROCESSLIST state "Waiting for table level lock") until the (streamed)
pilot query finishes.  As one pilot query follows after the other,
collect_result_thread() may even starve until the memory for the
JobServer's "done" queue runs out, resulting in a crash and the loss of all
queued results.

Change-Id: Ib0ec5fa84db466844b1e9aa0e94142b4d336b022
2014-02-25 13:32:55 +01:00
85e3911202 Merge branch 'ubuntu-saucy-fixes' 2014-01-24 17:02:44 +01:00
17e76c140b cpn: needs comm and MySQL at link time
The dependency on fail-comm exists not only at compile time (the
latter is due to protobuf header generation).

Change-Id: I2bae51e763d9a385bda94e77df3e88619fa28a30
2014-01-23 14:31:24 +01:00
4cb97a7fa5 formatting, typos, comments, details
Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a
2014-01-22 13:08:13 +01:00
7591c9edc5 Merge branch 'jobclientserver-fixes' 2014-01-22 13:07:59 +01:00
4e21b42374 cpn: use strtoul for conversion of unsigned ints
As 32-bit libc6 atoi() caps the value of unsigned ints bigger than
2^31-1 (instead of just letting it overflow to the corresponding
negative value, as on x86_64), it must not be used especially for the
conversion of 32-bit pointers.

Change-Id: Ie0821a6f4cd04aebd37ea3d4028b63a05373810f
2014-01-21 00:10:56 +01:00
1f6e275e5e jobserver: bugfix: potential race
Delay insertion of to-be-sent jobs into m_runningJobs until they are
really sent, as getMessage() won't work anymore (as in: segfault) if
this job is concurrently re-sent (due to campaign end), its result is
received, and deleted in the campaign.  This becomes non-hypothetical
with larger values for CLIENT_JOB_LIMIT and CLIENT_JOB_REQUEST_SEC.

Additionally, reinsert the remaining jobs into the input queue if
communication fails, instead of inefficiently delaying redistribution
until the campaign end.

Change-Id: If85e3c8261deda86beb8d4d93343429223753f22
2014-01-20 22:48:08 +01:00
73adc71437 jobserver: use non-blocking accept
To allow the JobServer to shutdown properly, the accept() loop in
JobServer::run() needs to regularly check whether we're done.  This
change introduces a timed, non-blocking variant of accept() into
SocketComm to achieve this.

Change-Id: Id411096be816c4ed6c7b0b37674410e22152eb22
2014-01-20 22:48:08 +01:00
8671669053 jobserver: join remaining threads on shutdown
To avoid accessing destroyed resources in CommThreads talking to clients,
we need to properly join them on shutdown.  The m_CommMutex becomes a
JobServer member to make sure it isn't destroyed before the JobServer
itself.

Change-Id: I35b9fb93ace08a7a9476650f8f5e93597a3a8aa0
2014-01-20 22:48:08 +01:00
8505ddbb04 jobserver: synchronization cleanup
This change cleans up in/out queue synchronization in the job server.
End-of-jobs conditions are now properly signaled through the
SynchronizedQueue, allowing to resume and abort blocked readers when
no more input is expected.

Change-Id: I3eaf37115ccf8c5b5afe3d971c7109cd62b68906
2014-01-20 22:48:08 +01:00
5ac108ea4b Merge branch 'mysql-concurrency-fixes' 2014-01-20 18:35:35 +01:00
8f9ee3fddd DatabaseCampaign: run statistics update when finished
Change-Id: Ib68e54ba82e988db0d2d74ffafa6dc9bd54cd272
2014-01-20 18:34:51 +01:00
33b63651ae DatabaseCampaign: MySQL / concurrency fixes
According to
<http://dev.mysql.com/doc/refman/5.5/en/c-api-threaded-clients.html>,
a MySQL connection handle must not be used concurrently with an open
result set and mysql_use_result() in one thread
(DatabaseCampaign::run()), and mysql_query() in another
(DatabaseCampaign::collect_result_thread()).  This indeed leads to
crashes when bounding the outgoing job queue (SERVER_OUT_QUEUE_SIZE),
and maybe even more insidous effects in other cases.  The solution is
to create separate connections for both threads.

Additionally, call mysql_library_init() before spawning any threads.

Change-Id: I2981f2fdc67c9a2cbe8781f1a21654418f621aeb
2014-01-20 18:34:51 +01:00
9c984b9704 fail/cpn: (Database)Campaign no longer loses jobs
Up until now the JobServer was silently losing jobs and only claiming to be
finished - a workaround for this was to restart the campaign until all jobs
were finished according to the database and the campaign's output.
This change fixes the underlying problem, so a single campaign-run suffices
and does no longer lose any jobs.
Debugging this was awful and took us quite some time...

Change-Id: Ie6c982cc3b2ce11128941f1f13be563bae22565c
2014-01-15 12:59:13 +01:00
ab9c0edf10 DatabaseCampaign: run jobs for known-outcome exps, too
Although we know that a known_outcome=1 pilot does not exhibit
behavior different from the golden run, the database schema does not
yet know what this behavior looks like (in terms of result-table
column values).  In order to be able to JOIN valid results for all
memory writes in the trace table (fspgroup maps them all onto *one*
pilot per variant), we need to run these experiments, too.

Additionally, don't join the fspgroup table; we only need this one for
result calculations afterwards.

Change-Id: Idcd2991274fede84526b1eee68a231774625d11a
2013-12-05 19:27:44 +01:00
d26fc28fa4 cpn/database: include data_width in the fsppilot during prune step
During the prune step the data_width of the injected location was not
propagated before. It is now stored in fsppilot (database layout change!) and
sent in the fsppilot protobuf message.

Change-Id: I0562f6fc8957adea0f8a9fb63469ca5e3f4b7b2d
2013-09-11 10:27:04 +02:00
9843b520c1 dbcampaign: select multiple variants/benchmark pairs
The variant/benchmark selection now can use SQL LIKE syntax, all unfinished
pilots from all selected variants are sent to the clients. E.g.:

./cored-voter-server  -v x86-cored-voter -b simple-% -p basic

Will select the fsppilots in the variants:

- x86-cored-voter/simple-ip/basic
- x86-cored-voter/simple-instr/basic

The variant and benchmark information is now sent within the
fsppilot.

Change-Id: I287bfcddc478d0b79d89e156d6f5bf8188674532
2013-07-05 10:19:58 +02:00
d9c9b43102 dciao-kernelstructs: several experiment fixes.
The previous fault injection experiment was kind of bullshit. This one
is better in several ways:

- sanity check at injection time (correct IP)
- correct counting of kernel_transistions
- copy whole activation scheme

Change-Id: I014eea4d6fe103bc02ffd7bbca95dc56a1a4d9ea
2013-05-29 16:18:22 +02:00
6789a313a9 DCiAOKernelImporter: different injection semantic.
Is now very similar to normal importer, and may be deleted in the future, but
at the moment, this should be merged, since it is the importer used in the
sobres-2013 paper.

This changes the MySQL Schema. instr1_absolute was introduced.

Change-Id: I1bc2919bd14c335beca6d586b7cc0f80767ad7d5
2013-05-29 16:17:03 +02:00
6d8b3331d8 doxygen: doc generation fixed
Doxygen skips undesired directories and files now. In addition, the
documentation of the "fail" namespace has been fixed. Note that there
are still several warnings (due to incomplete documentations) in the
Doxygen output.

Change-Id: Idad4f1ecff453765b307fa40a5c1cebc0c2ce2bb
2013-05-29 13:34:12 +02:00
880e7a81ff comm: ignore SIGPIPE
This prevents client and server from being sent a SIGPIPE (and
terminating) when the other side unexpectedly closes the connection.
It's way easier to handle this condition when checking the write()
return value, than to do anything smart in a SIGPIPE handler.  More
details:
<http://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly>

Change-Id: I1da5bf5ef79c8b7b00ede976e96ed4f1c560049d
2013-04-29 15:32:12 +02:00
0f16f18d75 cosmetics
Change-Id: Ifae805ae1e2dac95324e054af09a7b70f5d5b60c
2013-04-22 14:24:02 +02:00
c24ed774b0 experiments/dciao-kernelstructs: new database driven experiment for DCiAO
The dciao-kernelstructs experiment does a trace imported by the
DCiAOKernelImporter:

   bin/import-trace -t trace.pb  -i DCiAOKernelImporter --elf-file app.elf

Pruned by the basic method:

   bin/prune-trace

and does CiAO fault injection experiments, where the results are
stored in the database.

Change-Id: I485dc2e5097b3ebaf354241f474ee3d317213707
2013-04-03 10:39:51 +02:00
f18cddc63c DatabaseCampaign: abstract campain for interaction with MySQL Database
The DatabaseCampaign interacts with the MySQL tables that are created
by the import-trace and prune-trace tools. It does offer all
unfinished experiment pilots from the database to the
fail-clients. Those clients send back a (by the experiment) defined
protobuf message as a result. The custom protobuf message does have to
need the form:

   import "DatabaseCampaignMessage.proto";

   message ExperimentMsg {
       required DatabaseCampaignMessage fsppilot = 1;

       repeated group Result = 2 {
          // custom fields
          required int32 bitoffset = 1;
          optional int32 result = 2;
       }
   }

The DatabaseCampaignMessage is the pilot identifier from the
database. For each of the repeated result entries a row in a table is
allocated. The structure of this table is constructed (by protobuf
reflection) from the description of the message. Each field in the
Result group becomes a column in the result table. For the given
example it would be:

    CREATE TABLE result_ExperimentMessage(
           pilot_id INT,
           bitoffset INT NOT NULL,
           result INT,
           PRIMARY_KEY(pilot_id)
    )

Change-Id: I28fb5488e739d4098b823b42426c5760331027f8
2013-04-02 09:52:42 +02:00
94214063ac Fixed whitespaces.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2067 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-02-07 00:51:14 +00:00
00f809231f Code cleanup for commit 1963-1965
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2014 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-01-23 14:22:05 +00:00
fc1d21fe53 Bugfix for server-client communication
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1965 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-30 18:13:13 +00:00
d7842c2ad7 The Jobclient can get several jobs with one request
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1963 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-30 16:50:02 +00:00
hsc
127161ef5a bounded job queue (configurable, unbounded by default)
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1945 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-20 15:01:58 +00:00
hsc
e409ae2f76 JobServer: synchronization issues
Synchronize re-sending jobs in sendPendingExperimentData() and modifying
(or indirectly, via getDone() and the campaign, deleting) jobs in the
m_runningJobs queue.

a) sendPendingExperimentData needs an intact job to serialize and send it.
b) After moving the job to m_doneJobs, it may be retrieved and deleted
   by the campaign at any time.

Additionally, receiving a result overwrites the job's contents.  This
already may cause breakage in sendPendingExperimentData (a).

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1943 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-20 15:01:52 +00:00
hsc
1d498a516b JobServer: do not try to talk to a dying minion
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1942 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-20 15:01:49 +00:00
hsc
49d1608969 correct sanity checks for client/server communication
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1933 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-14 13:31:53 +00:00
6f98d64613 bugfix: racecondition removed
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1921 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-12 11:46:26 +00:00
hsc
35b1d0203e CampaignManager: destructor / cleanup
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1916 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-10 16:18:40 +00:00
hsc
86ba9cb377 CampaignManager: only instantiate JobServer when needed
As we have a global CampaignManager instance in the fail-cpn library, a
JobServer member variable is not such a good idea.  Essentially, we started
all JobServer threads (which is done in its constructor) within a
fail-client before this commit.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1915 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-10 16:14:06 +00:00
hsc
55dd79cc03 cosmetics
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1849 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-10-26 16:13:36 +00:00
15def480d9 warning-fix in release mode (var not initialized).
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1731 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-10-09 11:10:29 +00:00
hsc
d45965753d bugfix: handle old clients properly
Fix 1: A result message with a nonexistent or invalid run ID must be
ignored in any case.  0 is only OK for NEED_WORK messages, clients
communicating a result must know the ID.

Fix 2: Tell the client the run ID in the first place ...

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1692 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-25 16:14:22 +00:00
hsc
7513dacad1 properly deal with clients that talked to another campaign server before
A campaign server now tells all clients a unique run ID (the UNIX timestamp
when it was started).  This allows us to ignore results from "old" clients
that talked to another server before, and to tell them to die.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1677 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-23 17:28:07 +00:00
hsc
f9c96ddf2d prefix internal libraries to avoid naming conflicts with system libraries
This is a precaution to avoid current and future naming conflicts with
common system libraries.  libutil (part of libc) is the first, but probably
not the last example that already caused trouble twice.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1614 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-12 07:52:30 +00:00
hsc
e56918e40e centralized and cmake-based campaign server+port config
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1590 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-04 13:57:01 +00:00
hsc
f992f53d5d spacing
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1585 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-02 10:17:00 +00:00
d9b24a7c60 Changes I made in the l4-sys experiment recently, plus one minor style fix
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1584 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-01 16:05:22 +00:00
c06565aa4e Basic SAL files and makefile modifications for adding gem5.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1457 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-07-17 15:35:29 +00:00
hsc
4a4b3ea7e2 FailBochs build process reversed
The FailBochs client is not linked by the Bochs build system anymore, but
by our cmake scripts (make fail-client):
 -  All Bochs libraries are merged into libfailbochs.a (a new target
    within the Bochs Autotools scripts).
 -  The previous libfail.a is *not* a merge of all Fail* libraries anymore,
    but pulls these in via library dependencies.

Additionally I did a lot of build system cleanup, e.g. additional external
libraries may now be pulled in where they're needed.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1390 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-06-29 22:22:41 +00:00
2575604b41 Fail* directories reorganized, Code-cleanup (-> coding-style), Typos+comments fixed.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1321 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-06-08 20:09:43 +00:00