Commit Graph

90 Commits

Author SHA1 Message Date
33d40df4bd DatabaseCampaign/-Experiment: add randomjump faults
Taken from experiments/erika-tester.

Change-Id: Ic1aa72d1bfc839009297892cf9e50b3edb53fef9
2020-05-23 22:52:00 +02:00
c34917ca80 Debian 10: MySQL/MariaDB related fixes
Change-Id: I538caf6dedaa785061194a87c7e4965df3839088
2019-10-21 17:14:51 +02:00
8d060ef375 Debian 10: switch to C++14
Some libraries, e.g. protobuf, depend on C++11 now.  As they are
(indirectly) included in some .ah aspect headers, everything has to be
compiled with C++11 enabled now.

This change switches to C++14 globally.

Change-Id: I56a802bd510704d668a2b2c8957e11725fbe98b7
2019-10-21 17:14:51 +02:00
527763e87f JobServer: remove "come again" diagnostic
The "--[Server] No workload, come again..." appears every time a
larger job set is loaded from the database, once for every client that
knocks.  This isn't helpful and scrolls out relevant information,
hence I'm removing it for now.

Change-Id: Ic7ca5b3a0c096b384ba4803df5b482a96bf803b1
2018-08-27 20:20:53 +02:00
8426084e5a CampaignManager: avoid parameter-name clash
The -p parameter is already being used by several campaign servers for
the prune method to restrict to (which was broken in commit
6c120004e), hence allow only --port to choose a different server TCP
port at runtime.

Change-Id: Ia30e40d564e85a9702118dc28df4988ec628e491
2018-08-27 15:08:17 +02:00
3a47b20df2 JobServer: use steady_clock for interval measurement
std::chrono::system_clock is not monotonic, instead use
std::chrono::steady_clock for interval measurements.

Change-Id: I231affecfe8e89481720e47b59132fc838cdf73c
2018-08-03 22:00:23 +02:00
a547b0d5b4 JobServer: print completion percentage and ETA
If the JobServer is provided a total number of experiments by the
campaign, it now prints a completion percentage and an estimated
remaining runtime along the usual progress reports.

Change-Id: Ibd781ba8bff9af3a85683bbd29728216e316da57
2018-08-03 19:53:45 +02:00
f89794329c JobServer: progress-report overhaul
The JobServer progress-report output now shows the total number of
completed jobs instead of the (almost always zero) inbound queue fill
level.  Additionally, the current number of incoming results per
second is shown, which also prepares for an ETA calculation in the
following commit.

Change-Id: I6b71c45f44b9e6b9b17c059959a90068b51c165c
2018-08-03 19:51:07 +02:00
5d5927a88a DatabaseExperiment: add register FI
Calling the DatabaseCampaign with --inject-registers or
--force-inject-registers now injects into CPU registers.  This is achieved
by reinterpreting data addresses in the DB as addresses within the register
file.  (The mapping between registers and data addresses is implemented in
core/util/llvmdisassembler/LLVMtoFailTranslator.hpp.)  The difference
between --inject-registers and --force-inject-registers is what the
experiment does when a data address is not interpretable as a register: the
former option then injects into memory (DatabaseCampaignMessage,
RegisterInjectionMode AUTO), the latter skips the injection altogether
(FORCE).

Currently only compiles together with the Bochs backend; the
DatabaseExperiment's redecodeCurrentInstruction() function must be
moved into the Bochs EEA to remedy this.

Change-Id: I23f152ac0adf4cb6fbe82377ac871e654263fe57
2018-07-24 09:45:00 +02:00
baaa6c3ce8 JobClient/Server fixes
- Retain original CLIENT_RETRY_COUNT semantics after Boost::Asio
  switch
- JobClient is C++11 now, too
- Message reception copy/paste error fixes

Change-Id: I19c474b2a79cd2ac8657e8d58d6170202d096fb0
2018-05-09 17:43:28 +02:00
9272c5cbed Move JobClient to Boost::asio as well
I did this mainly so server and client use a common networking API
IMO, using Boost::asio results in nicer name-lookup code.
Since no longer needed, I removed the SocketComm stuff.
The client is still synchronous; I see no benefit in having it
asynchronous.

I'm not super happy with the random backoff by the clients, if they
can't connect to the server. It makes the code really messy, 3 retries
is totally arbitrary, as is the backup windows. I believe launching
the server and clients in the correct order should be handled by a
launch script
Change-Id: Ifea64919fc228aa530c90449686f51bf63eb70e7
2018-05-09 17:41:52 +02:00
9ae8123433 JobServer: fix C++14 dependency
The recent Boost.Asio overhaul requires C++14 features, not only C++11.

Change-Id: I6decf0e6532956f7061d8a9021ec2c8406679266
2018-05-03 16:28:26 +02:00
6c120004eb Use boost-asio to improve FAIL* server performance
This patch overhauls the FAIL* server code to leverage Boost asio to be able to
handle a large number of clients (>4000). In this implementation the server is
now single threaded. I've not encountered any problems with this for up to
about 10k clients. Boost ASIO can also be used multithreaded, but I assume the
FAIL* internal data structures (Synchronized*) will become a bottleneck first.

The code now additionally depends on Boost Coro and Boost Context, as well as
a C++ 14 compiler, although the only C++14 feature required is a lambda capture
with initializer, such as [ x = std::move(x) ]. gcc-4.9.2 does this.

The code could (and probably should) be cleaned up more. Comments are wordy,
code is unnecessary now (multiple server threads), code is not self-contained
(headers spread dependencies), many ifdef's (server performance measuring
should be runtime rather than a compile time option), and much more. But for
this patch I was going for a minimal changeset the get the functionality in,
to have an easier review. Alas, FAIL* has no Unit-test suite to run the changes
against.

To handle such a large number of clients more changes were necessary, for
example server status output is now performed every 1s, instead for every
request.

The class Minion was removed completely; the only thing it was doing was
encapsulate an int.

The server has now a runtime-configurable port, or it can select a free port on
its own if none is specified. This requires the CampaignManager to add a port
argument and instantiate the JobServer dynamically.

Change-Id: Iad9238972161f95f5802bd2251116f8aeee14884
2017-09-15 06:26:14 +02:00
ad558abeb6 DatabaseCampaign/-Experiment: add burst faults
This change introduces the ability to inject burst faults to
the DatabaseCampaign/-Experiment and thus to all derived
campaigns/experiments.

Change-Id: I491d021ed3953562bd7c908e9de50d448bc8ef33
2016-03-11 19:01:17 +01:00
e99e4aafa8 JobServer: initialize sockaddr_in
This most probably is not a real problem, but does not take much work
to fix.  Found by Coverity Scan, in several reports.

Change-Id: I8bd12e3f7afeb4b1c4e1b057bdbd95da9aa9211c
2015-02-07 18:20:39 +01:00
8c2b6cf028 JobServer: fix socket leaks
Found by Coverity Scan, CID 25600.

Change-Id: Ic0c549928ce8058c145d178ed06b41b543676460
2015-02-07 18:20:30 +01:00
fe9e25374a CampaignManager: initialize campaign member
Found by Coverity Scan, CID 25798.

Change-Id: Ib310ca3198c78a8e01d044d90ada1cd0c22b26d6
2015-02-07 17:29:29 +01:00
412ecbba63 dbcampaign: skip existing pilots with wrong fspmethod
Loading existing pilots with a different fspmethod_id is a waste of
time.

Change-Id: I3519a14822029999fa2ed854daff9853c0cbeec1
2015-01-21 14:53:33 +01:00
d58694521c dbcampaign: don't include fspmethod/variant ID in job msg
These IDs don't make sense by themselves but only after a lookup in the
database, which clients usually don't have (and don't need) access to.

Conflicts:
	src/core/comm/DatabaseCampaignMessage.proto.in

Change-Id: Ice739463552039b7fb48581722ea2e05984cea47
2015-01-21 14:53:32 +01:00
c422911741 dbcampaign: allow wildcard for prune method
Using mixed pruning methods now does not require to run the campaign
server twice anymore.

Change-Id: I3f62c269166b750892bb0e659ad0c180425d1479
2015-01-21 14:53:32 +01:00
0208e80dbb Merge branch 'sampling'
Conflicts:
	src/core/cpn/DatabaseCampaign.cc

Change-Id: Ic11d9ce26546bccba11768383a8fda6a3458530f
2014-09-08 15:36:21 +02:00
42182591e5 fix compiler warnings
(DatabaseCampaign; llvmdisassembler)

Change-Id: Ic31758018a0a1ff0ceac81f781eecfc5f8060f89
2014-08-28 12:12:38 +02:00
61050b3760 db-campaign: load completed pilots in one query
Instead of issuing a query for every variant, we assemble a set of
variant ids and query `WHERE variant_id in (1,...)'. This has not only
the effect of higher optimization potential for the database, but also
the query is issued before any result can come back. This will avoid an
overfull receive queue within the job server.

Change-Id: I5b1c60f92b97741ce26d9e50760b601929cef44f
2014-08-25 13:50:20 +02:00
268d9d4658 db-campaign: Do only load completed pilots from variant
Since we know for which variant we want to have the completed pilots, we
do not have to catch all pilot_ids but only those who of pilots that are
finished and have the correct variant_id. This speeds the startup of the
campaign server enormously when having many completed campaigns in the
database.

Change-Id: I8be584a2dd6d8d7315f30dcb5bff89647353001e
2014-08-25 12:48:10 +02:00
2100001497 DatabaseCampaign: use more flexible get_variants()
This change allows DatabaseCampaign users to take advantage of the
improved variant selection methods in the Database class (multiple
uses of --variant/--benchmark possible, plus
--exclude-variant/--exclude-benchmark switches).

Change-Id: Idb1ca04538ff7601b3648cd9ba766aa8690fff6b
2014-07-03 15:49:05 +02:00
0e1ed1feab prune-trace+DBCampaign: default to variant/benchmark %
If no --variant / --benchmark is specified, it's more reasonable to
prune or run *all* variants/benchmarks (using the wildcard "%")
instead of defaulting to "none"/"none".  The trivial case with only
one single variant/benchmark (which may still be "none"/"none" if
import-trace's default is used) is still covered by this new default
behavior.

Change-Id: I0e9001137d5e052183dd74211e2edbcfab749528
2014-07-03 15:42:25 +02:00
c827750090 Merge branch 'failpanda'
Conflicts:
	src/core/comm/DatabaseCampaignMessage.proto.in
	src/core/cpn/CMakeLists.txt
	src/core/cpn/DatabaseCampaign.cc
	src/core/sal/ConcreteCPU.hpp
	src/core/sal/SALConfig.hpp
	src/core/util/CMakeLists.txt

Change-Id: Id86b93d0e3ea4d9963fcc88605eec0603575ec83
2014-06-03 12:24:49 +02:00
277958b31b cleanups
Change-Id: I8022d937477668253c613e97c3a579ae65084b1e
2014-06-03 11:47:20 +02:00
cfd99fe3af DatabaseCampaign: load completed pilots in memory
This change makes the DatabaseCampaign load all pilot_ids from the result
table in memory instead of LEFT JOINing them for each variant.  This vastly
improves campaign speed (possibly making commit 5567c59 superfluous) at the
cost of slightly increased startup time for half-completed (large)
campaigns.

By exploiting the generally continuous nature of pilot IDs and using a
boost::icl::interval_map, the additional memory requirements are
insignificant.

Change-Id: I1e744fb9ca33efea77a2a785cea3c94106f360df
2014-04-27 19:04:05 +02:00
a8611d1ec0 DatabaseCampaign: fix log output
When no variants matching the command line parameters were found, the
campaign printed an uninitialized sent_pilots count.

Change-Id: Ib1d70ae86f02059daeb9a62567d6c83802e4986e
2014-04-27 19:04:05 +02:00
77b9b08a89 revert accidental change
Change-Id: I75d6d7a6e429d6603fd82b1ce99761c2c5c7ac90
2014-04-23 15:44:56 +02:00
2dd38dc524 Merge branch 'master' of ssh://i4gerrit.informatik.uni-erlangen.de:29418/fail 2014-04-23 14:55:40 +02:00
442069dd45 cmake: added missing link-time dependency cpn->util
(for synchronized queue implementations)

Change-Id: I7b32273b8e76a7b7921af117fdf3ca5af2f42553
2014-04-02 10:43:48 +02:00
ed46b0730a Merge branch 'master' of ssh://i4gerrit.informatik.uni-erlangen.de:29418/fail 2014-03-27 14:22:29 +01:00
5567c595fb DatabaseCampaign: experiment completion checks
If the queue for outbound jobs is not unlimited, experiment rows are fetched
from the DB server continuously as experiments finish. When this takes too
long the connection to the DB server can be lost. The code did not check for
a mysql_error and assumed the result set was fetched completely, thus skipping
a potentially large amount of experiments (in our case only ~20000 of 400000+
experiments were run).

This change adds checks to determine if the result fetch loop was finished due
to an error and checks the sent pilot count to the unfinished experiment count.
Additionally, the mysql result object is correctly freed.

The underlying problem of MySQL connection loss can hopefully be prevented by
increasing timeouts in the MySQL config as described in doc/how-to-build.txt.
To prevent the problem from occurring when this is forgotten, this change
reverts the default job queue length to be unlimited (SERVER_OUT_QUEUE_SIZE=0),
at the cost of increased memory usage.

Change-Id: I09d9faddd8190c6dd5fbe733a0679a733d5837ec
2014-03-21 11:36:38 +01:00
010d4a892d DatabaseCampaign: fix finished experiments SQL
The database queries to fetch all unfinished experiments were broken.
The server tried to insert all finished pilot_ids into the temporary
result_ids table and then discard all experiments which have the correct
(finished) count of IDs in this table. This cannot work as the pilot_id
is the only column of result_ids and must be a unique primary key.

As a fix, the count of results is stored as a second field in result_ids
and the result table is now joined against result_ids to check this field.

Change-Id: I6a9fb774825f0cc4ce104c6e51d7b2fe16957aec
2014-03-18 11:18:27 +01:00
dbff3ab236 jobserver: exit completely when socket ops fail
When socket(2), setsockopt(2), bind(2), listen(2), or accept(2) return an
unexpected error status, it is usually not a good idea to let the campaign
continue.  This is especially a problem as the perror(3) message gets lost
in normal campaign output and may be missed by the user.

Change-Id: I92747174e0706a613bedd8c6664cc8d888e07533
2014-03-05 16:48:37 +01:00
5ee96032c9 jobserver: gracefully handle thread creation failures
Due to the previous DatabaseCampaign fix, this may not be necessary
anymore, but it's nevertheless a good idea to handle thread creation
failures properly.

Change-Id: I8317a77dd5338509727e737040944320e7755ae3
2014-02-25 13:32:56 +01:00
25a390970a DatabaseCampaign: avoid table locking
It is necessary to copy pilot IDs of existing results to a temporary table
before fetching undone jobs from the DB: Otherwise, due to MyISAMs
table-level locking, collect_result_thread() will block in INSERT (SHOW
PROCESSLIST state "Waiting for table level lock") until the (streamed)
pilot query finishes.  As one pilot query follows after the other,
collect_result_thread() may even starve until the memory for the
JobServer's "done" queue runs out, resulting in a crash and the loss of all
queued results.

Change-Id: Ib0ec5fa84db466844b1e9aa0e94142b4d336b022
2014-02-25 13:32:55 +01:00
85e3911202 Merge branch 'ubuntu-saucy-fixes' 2014-01-24 17:02:44 +01:00
eab469192c cpn: regard equivalence classes of length 1
Previously the code did not handle equivalence classes, which consist
only of one instruction (length 1). As these classes for example
come up at two consecutive read instructions, we have to handle them.

Change-Id: Ib9e475a782828a380dfc79f5b390ca9192f4b8e3
2014-01-23 18:53:19 +01:00
ba765c16c2 cpn: pruning-aware injection points
As we gain some degrees of freedom in choice of the specific
injection instruction offset, this can be used to minimize
navigational costs. This is a first approach towards pruning-aware
injection points.

To do so, we need to modify the sql query, which gets the pilots,
so we additionally join with the trace table to get begin and
end information for equivalence classes, which are feeded into
the creation of InjectionPoints.

Change-Id: I343b712dfcbed1299121f02eee9ce1b136a7ff15
2014-01-23 18:53:19 +01:00
d7a9a2811d cpn: Not every InjectionPointHops calcs smart-hops
As the InjectionPoint is considered to be a container for abstract
"points in time" which can be navigated to, not every object of
a InjectionPointHops needs a smart-hopping calculator.

Change-Id: I150a46cf79a2b9d8ddb2d24a6d89dc3d4246cdb3
2014-01-23 18:53:19 +01:00
e824e7a0fa cpn: Parsing of unsigned int fixed
As atoi caps the value of a unsigned int bigger than (2^31 - 1) other
than just letting it overflow to the corresponding negative value on
32Bit-integer machines, it must not be used for parsing to unsigned int.

TODO: Also apply this fix to all other unsigned values (in database)
which get parsed by atoi.

Change-Id: I96e29b14d36479ab6e567c527a40feb0b5fb14e5
2014-01-23 18:53:18 +01:00
8b5098abdd tools: added compute-hops and dump-hops tools
As these tools work closely together with fail components, its
easiest, to build them in this context. As these tools don't
really matter for fail use, they might never be pushed to the
master branch.

Change-Id: I8c8bd80376d0475f08a531a995d829e85032371b
2014-01-23 18:53:11 +01:00
17e76c140b cpn: needs comm and MySQL at link time
The dependency on fail-comm exists not only at compile time (the
latter is due to protobuf header generation).

Change-Id: I2bae51e763d9a385bda94e77df3e88619fa28a30
2014-01-23 14:31:24 +01:00
c142818325 cpn: Generic wrapper for injection point
As for the pandaboard to navigate fast to the injection
instruction we need to deliver a hop chain to the fail-client,
this commit adds a generic wrapper for a injection point.
For now we have only the two options hop chain and instruction
offset, so it is activated via a cmake ON/OFF switch.

Change-Id: Ic01a07a30ac386d4316e6d6d271baf1549db966a
2014-01-22 18:04:29 +01:00
4cb97a7fa5 formatting, typos, comments, details
Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a
2014-01-22 13:08:13 +01:00
7591c9edc5 Merge branch 'jobclientserver-fixes' 2014-01-22 13:07:59 +01:00
4e21b42374 cpn: use strtoul for conversion of unsigned ints
As 32-bit libc6 atoi() caps the value of unsigned ints bigger than
2^31-1 (instead of just letting it overflow to the corresponding
negative value, as on x86_64), it must not be used especially for the
conversion of 32-bit pointers.

Change-Id: Ie0821a6f4cd04aebd37ea3d4028b63a05373810f
2014-01-21 00:10:56 +01:00