Instead of issuing a query for every variant, we assemble a set of
variant ids and query `WHERE variant_id in (1,...)'. This has not only
the effect of higher optimization potential for the database, but also
the query is issued before any result can come back. This will avoid an
overfull receive queue within the job server.
Change-Id: I5b1c60f92b97741ce26d9e50760b601929cef44f
Since we know for which variant we want to have the completed pilots, we
do not have to catch all pilot_ids but only those who of pilots that are
finished and have the correct variant_id. This speeds the startup of the
campaign server enormously when having many completed campaigns in the
database.
Change-Id: I8be584a2dd6d8d7315f30dcb5bff89647353001e
This change makes the DatabaseCampaign load all pilot_ids from the result
table in memory instead of LEFT JOINing them for each variant. This vastly
improves campaign speed (possibly making commit 5567c59 superfluous) at the
cost of slightly increased startup time for half-completed (large)
campaigns.
By exploiting the generally continuous nature of pilot IDs and using a
boost::icl::interval_map, the additional memory requirements are
insignificant.
Change-Id: I1e744fb9ca33efea77a2a785cea3c94106f360df
When no variants matching the command line parameters were found, the
campaign printed an uninitialized sent_pilots count.
Change-Id: Ib1d70ae86f02059daeb9a62567d6c83802e4986e
If the queue for outbound jobs is not unlimited, experiment rows are fetched
from the DB server continuously as experiments finish. When this takes too
long the connection to the DB server can be lost. The code did not check for
a mysql_error and assumed the result set was fetched completely, thus skipping
a potentially large amount of experiments (in our case only ~20000 of 400000+
experiments were run).
This change adds checks to determine if the result fetch loop was finished due
to an error and checks the sent pilot count to the unfinished experiment count.
Additionally, the mysql result object is correctly freed.
The underlying problem of MySQL connection loss can hopefully be prevented by
increasing timeouts in the MySQL config as described in doc/how-to-build.txt.
To prevent the problem from occurring when this is forgotten, this change
reverts the default job queue length to be unlimited (SERVER_OUT_QUEUE_SIZE=0),
at the cost of increased memory usage.
Change-Id: I09d9faddd8190c6dd5fbe733a0679a733d5837ec
The database queries to fetch all unfinished experiments were broken.
The server tried to insert all finished pilot_ids into the temporary
result_ids table and then discard all experiments which have the correct
(finished) count of IDs in this table. This cannot work as the pilot_id
is the only column of result_ids and must be a unique primary key.
As a fix, the count of results is stored as a second field in result_ids
and the result table is now joined against result_ids to check this field.
Change-Id: I6a9fb774825f0cc4ce104c6e51d7b2fe16957aec
When socket(2), setsockopt(2), bind(2), listen(2), or accept(2) return an
unexpected error status, it is usually not a good idea to let the campaign
continue. This is especially a problem as the perror(3) message gets lost
in normal campaign output and may be missed by the user.
Change-Id: I92747174e0706a613bedd8c6664cc8d888e07533
Due to the previous DatabaseCampaign fix, this may not be necessary
anymore, but it's nevertheless a good idea to handle thread creation
failures properly.
Change-Id: I8317a77dd5338509727e737040944320e7755ae3
It is necessary to copy pilot IDs of existing results to a temporary table
before fetching undone jobs from the DB: Otherwise, due to MyISAMs
table-level locking, collect_result_thread() will block in INSERT (SHOW
PROCESSLIST state "Waiting for table level lock") until the (streamed)
pilot query finishes. As one pilot query follows after the other,
collect_result_thread() may even starve until the memory for the
JobServer's "done" queue runs out, resulting in a crash and the loss of all
queued results.
Change-Id: Ib0ec5fa84db466844b1e9aa0e94142b4d336b022
Previously the code did not handle equivalence classes, which consist
only of one instruction (length 1). As these classes for example
come up at two consecutive read instructions, we have to handle them.
Change-Id: Ib9e475a782828a380dfc79f5b390ca9192f4b8e3
As we gain some degrees of freedom in choice of the specific
injection instruction offset, this can be used to minimize
navigational costs. This is a first approach towards pruning-aware
injection points.
To do so, we need to modify the sql query, which gets the pilots,
so we additionally join with the trace table to get begin and
end information for equivalence classes, which are feeded into
the creation of InjectionPoints.
Change-Id: I343b712dfcbed1299121f02eee9ce1b136a7ff15
As the InjectionPoint is considered to be a container for abstract
"points in time" which can be navigated to, not every object of
a InjectionPointHops needs a smart-hopping calculator.
Change-Id: I150a46cf79a2b9d8ddb2d24a6d89dc3d4246cdb3
As atoi caps the value of a unsigned int bigger than (2^31 - 1) other
than just letting it overflow to the corresponding negative value on
32Bit-integer machines, it must not be used for parsing to unsigned int.
TODO: Also apply this fix to all other unsigned values (in database)
which get parsed by atoi.
Change-Id: I96e29b14d36479ab6e567c527a40feb0b5fb14e5
As these tools work closely together with fail components, its
easiest, to build them in this context. As these tools don't
really matter for fail use, they might never be pushed to the
master branch.
Change-Id: I8c8bd80376d0475f08a531a995d829e85032371b
The dependency on fail-comm exists not only at compile time (the
latter is due to protobuf header generation).
Change-Id: I2bae51e763d9a385bda94e77df3e88619fa28a30
As for the pandaboard to navigate fast to the injection
instruction we need to deliver a hop chain to the fail-client,
this commit adds a generic wrapper for a injection point.
For now we have only the two options hop chain and instruction
offset, so it is activated via a cmake ON/OFF switch.
Change-Id: Ic01a07a30ac386d4316e6d6d271baf1549db966a
As 32-bit libc6 atoi() caps the value of unsigned ints bigger than
2^31-1 (instead of just letting it overflow to the corresponding
negative value, as on x86_64), it must not be used especially for the
conversion of 32-bit pointers.
Change-Id: Ie0821a6f4cd04aebd37ea3d4028b63a05373810f
Delay insertion of to-be-sent jobs into m_runningJobs until they are
really sent, as getMessage() won't work anymore (as in: segfault) if
this job is concurrently re-sent (due to campaign end), its result is
received, and deleted in the campaign. This becomes non-hypothetical
with larger values for CLIENT_JOB_LIMIT and CLIENT_JOB_REQUEST_SEC.
Additionally, reinsert the remaining jobs into the input queue if
communication fails, instead of inefficiently delaying redistribution
until the campaign end.
Change-Id: If85e3c8261deda86beb8d4d93343429223753f22
To allow the JobServer to shutdown properly, the accept() loop in
JobServer::run() needs to regularly check whether we're done. This
change introduces a timed, non-blocking variant of accept() into
SocketComm to achieve this.
Change-Id: Id411096be816c4ed6c7b0b37674410e22152eb22
To avoid accessing destroyed resources in CommThreads talking to clients,
we need to properly join them on shutdown. The m_CommMutex becomes a
JobServer member to make sure it isn't destroyed before the JobServer
itself.
Change-Id: I35b9fb93ace08a7a9476650f8f5e93597a3a8aa0
This change cleans up in/out queue synchronization in the job server.
End-of-jobs conditions are now properly signaled through the
SynchronizedQueue, allowing to resume and abort blocked readers when
no more input is expected.
Change-Id: I3eaf37115ccf8c5b5afe3d971c7109cd62b68906
According to
<http://dev.mysql.com/doc/refman/5.5/en/c-api-threaded-clients.html>,
a MySQL connection handle must not be used concurrently with an open
result set and mysql_use_result() in one thread
(DatabaseCampaign::run()), and mysql_query() in another
(DatabaseCampaign::collect_result_thread()). This indeed leads to
crashes when bounding the outgoing job queue (SERVER_OUT_QUEUE_SIZE),
and maybe even more insidous effects in other cases. The solution is
to create separate connections for both threads.
Additionally, call mysql_library_init() before spawning any threads.
Change-Id: I2981f2fdc67c9a2cbe8781f1a21654418f621aeb
Up until now the JobServer was silently losing jobs and only claiming to be
finished - a workaround for this was to restart the campaign until all jobs
were finished according to the database and the campaign's output.
This change fixes the underlying problem, so a single campaign-run suffices
and does no longer lose any jobs.
Debugging this was awful and took us quite some time...
Change-Id: Ie6c982cc3b2ce11128941f1f13be563bae22565c
Although we know that a known_outcome=1 pilot does not exhibit
behavior different from the golden run, the database schema does not
yet know what this behavior looks like (in terms of result-table
column values). In order to be able to JOIN valid results for all
memory writes in the trace table (fspgroup maps them all onto *one*
pilot per variant), we need to run these experiments, too.
Additionally, don't join the fspgroup table; we only need this one for
result calculations afterwards.
Change-Id: Idcd2991274fede84526b1eee68a231774625d11a
During the prune step the data_width of the injected location was not
propagated before. It is now stored in fsppilot (database layout change!) and
sent in the fsppilot protobuf message.
Change-Id: I0562f6fc8957adea0f8a9fb63469ca5e3f4b7b2d
The variant/benchmark selection now can use SQL LIKE syntax, all unfinished
pilots from all selected variants are sent to the clients. E.g.:
./cored-voter-server -v x86-cored-voter -b simple-% -p basic
Will select the fsppilots in the variants:
- x86-cored-voter/simple-ip/basic
- x86-cored-voter/simple-instr/basic
The variant and benchmark information is now sent within the
fsppilot.
Change-Id: I287bfcddc478d0b79d89e156d6f5bf8188674532
The previous fault injection experiment was kind of bullshit. This one
is better in several ways:
- sanity check at injection time (correct IP)
- correct counting of kernel_transistions
- copy whole activation scheme
Change-Id: I014eea4d6fe103bc02ffd7bbca95dc56a1a4d9ea
Is now very similar to normal importer, and may be deleted in the future, but
at the moment, this should be merged, since it is the importer used in the
sobres-2013 paper.
This changes the MySQL Schema. instr1_absolute was introduced.
Change-Id: I1bc2919bd14c335beca6d586b7cc0f80767ad7d5
Doxygen skips undesired directories and files now. In addition, the
documentation of the "fail" namespace has been fixed. Note that there
are still several warnings (due to incomplete documentations) in the
Doxygen output.
Change-Id: Idad4f1ecff453765b307fa40a5c1cebc0c2ce2bb
This prevents client and server from being sent a SIGPIPE (and
terminating) when the other side unexpectedly closes the connection.
It's way easier to handle this condition when checking the write()
return value, than to do anything smart in a SIGPIPE handler. More
details:
<http://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly>
Change-Id: I1da5bf5ef79c8b7b00ede976e96ed4f1c560049d
The dciao-kernelstructs experiment does a trace imported by the
DCiAOKernelImporter:
bin/import-trace -t trace.pb -i DCiAOKernelImporter --elf-file app.elf
Pruned by the basic method:
bin/prune-trace
and does CiAO fault injection experiments, where the results are
stored in the database.
Change-Id: I485dc2e5097b3ebaf354241f474ee3d317213707
The DatabaseCampaign interacts with the MySQL tables that are created
by the import-trace and prune-trace tools. It does offer all
unfinished experiment pilots from the database to the
fail-clients. Those clients send back a (by the experiment) defined
protobuf message as a result. The custom protobuf message does have to
need the form:
import "DatabaseCampaignMessage.proto";
message ExperimentMsg {
required DatabaseCampaignMessage fsppilot = 1;
repeated group Result = 2 {
// custom fields
required int32 bitoffset = 1;
optional int32 result = 2;
}
}
The DatabaseCampaignMessage is the pilot identifier from the
database. For each of the repeated result entries a row in a table is
allocated. The structure of this table is constructed (by protobuf
reflection) from the description of the message. Each field in the
Result group becomes a column in the result table. For the given
example it would be:
CREATE TABLE result_ExperimentMessage(
pilot_id INT,
bitoffset INT NOT NULL,
result INT,
PRIMARY_KEY(pilot_id)
)
Change-Id: I28fb5488e739d4098b823b42426c5760331027f8