christoph/fail - fail - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Hannes Weisbach	6c120004eb	Use boost-asio to improve FAIL* server performance This patch overhauls the FAIL* server code to leverage Boost asio to be able to handle a large number of clients (>4000). In this implementation the server is now single threaded. I've not encountered any problems with this for up to about 10k clients. Boost ASIO can also be used multithreaded, but I assume the FAIL* internal data structures (Synchronized) will become a bottleneck first. The code now additionally depends on Boost Coro and Boost Context, as well as a C++ 14 compiler, although the only C++14 feature required is a lambda capture with initializer, such as [ x = std::move(x) ]. gcc-4.9.2 does this. The code could (and probably should) be cleaned up more. Comments are wordy, code is unnecessary now (multiple server threads), code is not self-contained (headers spread dependencies), many ifdef's (server performance measuring should be runtime rather than a compile time option), and much more. But for this patch I was going for a minimal changeset the get the functionality in, to have an easier review. Alas, FAIL has no Unit-test suite to run the changes against. To handle such a large number of clients more changes were necessary, for example server status output is now performed every 1s, instead for every request. The class Minion was removed completely; the only thing it was doing was encapsulate an int. The server has now a runtime-configurable port, or it can select a free port on its own if none is specified. This requires the CampaignManager to add a port argument and instantiate the JobServer dynamically. Change-Id: Iad9238972161f95f5802bd2251116f8aeee14884	2017-09-15 06:26:14 +02:00
Michael Lenz	ad558abeb6	DatabaseCampaign/-Experiment: add burst faults This change introduces the ability to inject burst faults to the DatabaseCampaign/-Experiment and thus to all derived campaigns/experiments. Change-Id: I491d021ed3953562bd7c908e9de50d448bc8ef33	2016-03-11 19:01:17 +01:00
Horst Schirmeier	e99e4aafa8	JobServer: initialize sockaddr_in This most probably is not a real problem, but does not take much work to fix. Found by Coverity Scan, in several reports. Change-Id: I8bd12e3f7afeb4b1c4e1b057bdbd95da9aa9211c	2015-02-07 18:20:39 +01:00
Horst Schirmeier	8c2b6cf028	JobServer: fix socket leaks Found by Coverity Scan, CID 25600. Change-Id: Ic0c549928ce8058c145d178ed06b41b543676460	2015-02-07 18:20:30 +01:00
Horst Schirmeier	fe9e25374a	CampaignManager: initialize campaign member Found by Coverity Scan, CID 25798. Change-Id: Ib310ca3198c78a8e01d044d90ada1cd0c22b26d6	2015-02-07 17:29:29 +01:00
Horst Schirmeier	412ecbba63	dbcampaign: skip existing pilots with wrong fspmethod Loading existing pilots with a different fspmethod_id is a waste of time. Change-Id: I3519a14822029999fa2ed854daff9853c0cbeec1	2015-01-21 14:53:33 +01:00
Horst Schirmeier	d58694521c	dbcampaign: don't include fspmethod/variant ID in job msg These IDs don't make sense by themselves but only after a lookup in the database, which clients usually don't have (and don't need) access to. Conflicts: src/core/comm/DatabaseCampaignMessage.proto.in Change-Id: Ice739463552039b7fb48581722ea2e05984cea47	2015-01-21 14:53:32 +01:00
Horst Schirmeier	c422911741	dbcampaign: allow wildcard for prune method Using mixed pruning methods now does not require to run the campaign server twice anymore. Change-Id: I3f62c269166b750892bb0e659ad0c180425d1479	2015-01-21 14:53:32 +01:00
Horst Schirmeier	0208e80dbb	Merge branch 'sampling' Conflicts: src/core/cpn/DatabaseCampaign.cc Change-Id: Ic11d9ce26546bccba11768383a8fda6a3458530f	2014-09-08 15:36:21 +02:00
Horst Schirmeier	42182591e5	fix compiler warnings (DatabaseCampaign; llvmdisassembler) Change-Id: Ic31758018a0a1ff0ceac81f781eecfc5f8060f89	2014-08-28 12:12:38 +02:00
Christian Dietrich	61050b3760	db-campaign: load completed pilots in one query Instead of issuing a query for every variant, we assemble a set of variant ids and query `WHERE variant_id in (1,...)'. This has not only the effect of higher optimization potential for the database, but also the query is issued before any result can come back. This will avoid an overfull receive queue within the job server. Change-Id: I5b1c60f92b97741ce26d9e50760b601929cef44f	2014-08-25 13:50:20 +02:00
Christian Dietrich	268d9d4658	db-campaign: Do only load completed pilots from variant Since we know for which variant we want to have the completed pilots, we do not have to catch all pilot_ids but only those who of pilots that are finished and have the correct variant_id. This speeds the startup of the campaign server enormously when having many completed campaigns in the database. Change-Id: I8be584a2dd6d8d7315f30dcb5bff89647353001e	2014-08-25 12:48:10 +02:00
Horst Schirmeier	2100001497	DatabaseCampaign: use more flexible get_variants() This change allows DatabaseCampaign users to take advantage of the improved variant selection methods in the Database class (multiple uses of --variant/--benchmark possible, plus --exclude-variant/--exclude-benchmark switches). Change-Id: Idb1ca04538ff7601b3648cd9ba766aa8690fff6b	2014-07-03 15:49:05 +02:00
Horst Schirmeier	0e1ed1feab	prune-trace+DBCampaign: default to variant/benchmark % If no --variant / --benchmark is specified, it's more reasonable to prune or run all variants/benchmarks (using the wildcard "%") instead of defaulting to "none"/"none". The trivial case with only one single variant/benchmark (which may still be "none"/"none" if import-trace's default is used) is still covered by this new default behavior. Change-Id: I0e9001137d5e052183dd74211e2edbcfab749528	2014-07-03 15:42:25 +02:00
Horst Schirmeier	c827750090	Merge branch 'failpanda' Conflicts: src/core/comm/DatabaseCampaignMessage.proto.in src/core/cpn/CMakeLists.txt src/core/cpn/DatabaseCampaign.cc src/core/sal/ConcreteCPU.hpp src/core/sal/SALConfig.hpp src/core/util/CMakeLists.txt Change-Id: Id86b93d0e3ea4d9963fcc88605eec0603575ec83	2014-06-03 12:24:49 +02:00
Horst Schirmeier	277958b31b	cleanups Change-Id: I8022d937477668253c613e97c3a579ae65084b1e	2014-06-03 11:47:20 +02:00
Horst Schirmeier	cfd99fe3af	DatabaseCampaign: load completed pilots in memory This change makes the DatabaseCampaign load all pilot_ids from the result table in memory instead of LEFT JOINing them for each variant. This vastly improves campaign speed (possibly making commit `5567c59` superfluous) at the cost of slightly increased startup time for half-completed (large) campaigns. By exploiting the generally continuous nature of pilot IDs and using a boost::icl::interval_map, the additional memory requirements are insignificant. Change-Id: I1e744fb9ca33efea77a2a785cea3c94106f360df	2014-04-27 19:04:05 +02:00
Horst Schirmeier	a8611d1ec0	DatabaseCampaign: fix log output When no variants matching the command line parameters were found, the campaign printed an uninitialized sent_pilots count. Change-Id: Ib1d70ae86f02059daeb9a62567d6c83802e4986e	2014-04-27 19:04:05 +02:00
Bjoern Doebel	77b9b08a89	revert accidental change Change-Id: I75d6d7a6e429d6603fd82b1ce99761c2c5c7ac90	2014-04-23 15:44:56 +02:00
Bjoern Doebel	2dd38dc524	Merge branch 'master' of ssh://i4gerrit.informatik.uni-erlangen.de:29418/fail	2014-04-23 14:55:40 +02:00
Horst Schirmeier	442069dd45	cmake: added missing link-time dependency cpn->util (for synchronized queue implementations) Change-Id: I7b32273b8e76a7b7921af117fdf3ca5af2f42553	2014-04-02 10:43:48 +02:00
Bjoern Doebel	ed46b0730a	Merge branch 'master' of ssh://i4gerrit.informatik.uni-erlangen.de:29418/fail	2014-03-27 14:22:29 +01:00
Florian Lukas	5567c595fb	DatabaseCampaign: experiment completion checks If the queue for outbound jobs is not unlimited, experiment rows are fetched from the DB server continuously as experiments finish. When this takes too long the connection to the DB server can be lost. The code did not check for a mysql_error and assumed the result set was fetched completely, thus skipping a potentially large amount of experiments (in our case only ~20000 of 400000+ experiments were run). This change adds checks to determine if the result fetch loop was finished due to an error and checks the sent pilot count to the unfinished experiment count. Additionally, the mysql result object is correctly freed. The underlying problem of MySQL connection loss can hopefully be prevented by increasing timeouts in the MySQL config as described in doc/how-to-build.txt. To prevent the problem from occurring when this is forgotten, this change reverts the default job queue length to be unlimited (SERVER_OUT_QUEUE_SIZE=0), at the cost of increased memory usage. Change-Id: I09d9faddd8190c6dd5fbe733a0679a733d5837ec	2014-03-21 11:36:38 +01:00
Florian Lukas	010d4a892d	DatabaseCampaign: fix finished experiments SQL The database queries to fetch all unfinished experiments were broken. The server tried to insert all finished pilot_ids into the temporary result_ids table and then discard all experiments which have the correct (finished) count of IDs in this table. This cannot work as the pilot_id is the only column of result_ids and must be a unique primary key. As a fix, the count of results is stored as a second field in result_ids and the result table is now joined against result_ids to check this field. Change-Id: I6a9fb774825f0cc4ce104c6e51d7b2fe16957aec	2014-03-18 11:18:27 +01:00
Horst Schirmeier	dbff3ab236	jobserver: exit completely when socket ops fail When socket(2), setsockopt(2), bind(2), listen(2), or accept(2) return an unexpected error status, it is usually not a good idea to let the campaign continue. This is especially a problem as the perror(3) message gets lost in normal campaign output and may be missed by the user. Change-Id: I92747174e0706a613bedd8c6664cc8d888e07533	2014-03-05 16:48:37 +01:00
Horst Schirmeier	5ee96032c9	jobserver: gracefully handle thread creation failures Due to the previous DatabaseCampaign fix, this may not be necessary anymore, but it's nevertheless a good idea to handle thread creation failures properly. Change-Id: I8317a77dd5338509727e737040944320e7755ae3	2014-02-25 13:32:56 +01:00
Horst Schirmeier	25a390970a	DatabaseCampaign: avoid table locking It is necessary to copy pilot IDs of existing results to a temporary table before fetching undone jobs from the DB: Otherwise, due to MyISAMs table-level locking, collect_result_thread() will block in INSERT (SHOW PROCESSLIST state "Waiting for table level lock") until the (streamed) pilot query finishes. As one pilot query follows after the other, collect_result_thread() may even starve until the memory for the JobServer's "done" queue runs out, resulting in a crash and the loss of all queued results. Change-Id: Ib0ec5fa84db466844b1e9aa0e94142b4d336b022	2014-02-25 13:32:55 +01:00
Horst Schirmeier	85e3911202	Merge branch 'ubuntu-saucy-fixes'	2014-01-24 17:02:44 +01:00
Lars Rademacher	eab469192c	cpn: regard equivalence classes of length 1 Previously the code did not handle equivalence classes, which consist only of one instruction (length 1). As these classes for example come up at two consecutive read instructions, we have to handle them. Change-Id: Ib9e475a782828a380dfc79f5b390ca9192f4b8e3	2014-01-23 18:53:19 +01:00
Lars Rademacher	ba765c16c2	cpn: pruning-aware injection points As we gain some degrees of freedom in choice of the specific injection instruction offset, this can be used to minimize navigational costs. This is a first approach towards pruning-aware injection points. To do so, we need to modify the sql query, which gets the pilots, so we additionally join with the trace table to get begin and end information for equivalence classes, which are feeded into the creation of InjectionPoints. Change-Id: I343b712dfcbed1299121f02eee9ce1b136a7ff15	2014-01-23 18:53:19 +01:00
Lars Rademacher	d7a9a2811d	cpn: Not every InjectionPointHops calcs smart-hops As the InjectionPoint is considered to be a container for abstract "points in time" which can be navigated to, not every object of a InjectionPointHops needs a smart-hopping calculator. Change-Id: I150a46cf79a2b9d8ddb2d24a6d89dc3d4246cdb3	2014-01-23 18:53:19 +01:00
Lars Rademacher	e824e7a0fa	cpn: Parsing of unsigned int fixed As atoi caps the value of a unsigned int bigger than (2^31 - 1) other than just letting it overflow to the corresponding negative value on 32Bit-integer machines, it must not be used for parsing to unsigned int. TODO: Also apply this fix to all other unsigned values (in database) which get parsed by atoi. Change-Id: I96e29b14d36479ab6e567c527a40feb0b5fb14e5	2014-01-23 18:53:18 +01:00
Lars Rademacher	8b5098abdd	tools: added compute-hops and dump-hops tools As these tools work closely together with fail components, its easiest, to build them in this context. As these tools don't really matter for fail use, they might never be pushed to the master branch. Change-Id: I8c8bd80376d0475f08a531a995d829e85032371b	2014-01-23 18:53:11 +01:00
Horst Schirmeier	17e76c140b	cpn: needs comm and MySQL at link time The dependency on fail-comm exists not only at compile time (the latter is due to protobuf header generation). Change-Id: I2bae51e763d9a385bda94e77df3e88619fa28a30	2014-01-23 14:31:24 +01:00
Lars Rademacher	c142818325	cpn: Generic wrapper for injection point As for the pandaboard to navigate fast to the injection instruction we need to deliver a hop chain to the fail-client, this commit adds a generic wrapper for a injection point. For now we have only the two options hop chain and instruction offset, so it is activated via a cmake ON/OFF switch. Change-Id: Ic01a07a30ac386d4316e6d6d271baf1549db966a	2014-01-22 18:04:29 +01:00
Horst Schirmeier	4cb97a7fa5	formatting, typos, comments, details Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a	2014-01-22 13:08:13 +01:00
Horst Schirmeier	7591c9edc5	Merge branch 'jobclientserver-fixes'	2014-01-22 13:07:59 +01:00
Lars Rademacher	4e21b42374	cpn: use strtoul for conversion of unsigned ints As 32-bit libc6 atoi() caps the value of unsigned ints bigger than 2^31-1 (instead of just letting it overflow to the corresponding negative value, as on x86_64), it must not be used especially for the conversion of 32-bit pointers. Change-Id: Ie0821a6f4cd04aebd37ea3d4028b63a05373810f	2014-01-21 00:10:56 +01:00
Horst Schirmeier	1f6e275e5e	jobserver: bugfix: potential race Delay insertion of to-be-sent jobs into m_runningJobs until they are really sent, as getMessage() won't work anymore (as in: segfault) if this job is concurrently re-sent (due to campaign end), its result is received, and deleted in the campaign. This becomes non-hypothetical with larger values for CLIENT_JOB_LIMIT and CLIENT_JOB_REQUEST_SEC. Additionally, reinsert the remaining jobs into the input queue if communication fails, instead of inefficiently delaying redistribution until the campaign end. Change-Id: If85e3c8261deda86beb8d4d93343429223753f22	2014-01-20 22:48:08 +01:00
Horst Schirmeier	73adc71437	jobserver: use non-blocking accept To allow the JobServer to shutdown properly, the accept() loop in JobServer::run() needs to regularly check whether we're done. This change introduces a timed, non-blocking variant of accept() into SocketComm to achieve this. Change-Id: Id411096be816c4ed6c7b0b37674410e22152eb22	2014-01-20 22:48:08 +01:00
Horst Schirmeier	8671669053	jobserver: join remaining threads on shutdown To avoid accessing destroyed resources in CommThreads talking to clients, we need to properly join them on shutdown. The m_CommMutex becomes a JobServer member to make sure it isn't destroyed before the JobServer itself. Change-Id: I35b9fb93ace08a7a9476650f8f5e93597a3a8aa0	2014-01-20 22:48:08 +01:00
Horst Schirmeier	8505ddbb04	jobserver: synchronization cleanup This change cleans up in/out queue synchronization in the job server. End-of-jobs conditions are now properly signaled through the SynchronizedQueue, allowing to resume and abort blocked readers when no more input is expected. Change-Id: I3eaf37115ccf8c5b5afe3d971c7109cd62b68906	2014-01-20 22:48:08 +01:00
Horst Schirmeier	5ac108ea4b	Merge branch 'mysql-concurrency-fixes'	2014-01-20 18:35:35 +01:00
Horst Schirmeier	8f9ee3fddd	DatabaseCampaign: run statistics update when finished Change-Id: Ib68e54ba82e988db0d2d74ffafa6dc9bd54cd272	2014-01-20 18:34:51 +01:00
Horst Schirmeier	33b63651ae	DatabaseCampaign: MySQL / concurrency fixes According to <http://dev.mysql.com/doc/refman/5.5/en/c-api-threaded-clients.html>, a MySQL connection handle must not be used concurrently with an open result set and mysql_use_result() in one thread (DatabaseCampaign::run()), and mysql_query() in another (DatabaseCampaign::collect_result_thread()). This indeed leads to crashes when bounding the outgoing job queue (SERVER_OUT_QUEUE_SIZE), and maybe even more insidous effects in other cases. The solution is to create separate connections for both threads. Additionally, call mysql_library_init() before spawning any threads. Change-Id: I2981f2fdc67c9a2cbe8781f1a21654418f621aeb	2014-01-20 18:34:51 +01:00
Michael Lenz	9c984b9704	fail/cpn: (Database)Campaign no longer loses jobs Up until now the JobServer was silently losing jobs and only claiming to be finished - a workaround for this was to restart the campaign until all jobs were finished according to the database and the campaign's output. This change fixes the underlying problem, so a single campaign-run suffices and does no longer lose any jobs. Debugging this was awful and took us quite some time... Change-Id: Ie6c982cc3b2ce11128941f1f13be563bae22565c	2014-01-15 12:59:13 +01:00
Horst Schirmeier	ab9c0edf10	DatabaseCampaign: run jobs for known-outcome exps, too Although we know that a known_outcome=1 pilot does not exhibit behavior different from the golden run, the database schema does not yet know what this behavior looks like (in terms of result-table column values). In order to be able to JOIN valid results for all memory writes in the trace table (fspgroup maps them all onto one pilot per variant), we need to run these experiments, too. Additionally, don't join the fspgroup table; we only need this one for result calculations afterwards. Change-Id: Idcd2991274fede84526b1eee68a231774625d11a	2013-12-05 19:27:44 +01:00
Christian Dietrich	d26fc28fa4	cpn/database: include data_width in the fsppilot during prune step During the prune step the data_width of the injected location was not propagated before. It is now stored in fsppilot (database layout change!) and sent in the fsppilot protobuf message. Change-Id: I0562f6fc8957adea0f8a9fb63469ca5e3f4b7b2d	2013-09-11 10:27:04 +02:00
Christian Dietrich	9843b520c1	dbcampaign: select multiple variants/benchmark pairs The variant/benchmark selection now can use SQL LIKE syntax, all unfinished pilots from all selected variants are sent to the clients. E.g.: ./cored-voter-server -v x86-cored-voter -b simple-% -p basic Will select the fsppilots in the variants: - x86-cored-voter/simple-ip/basic - x86-cored-voter/simple-instr/basic The variant and benchmark information is now sent within the fsppilot. Change-Id: I287bfcddc478d0b79d89e156d6f5bf8188674532	2013-07-05 10:19:58 +02:00
Christian Dietrich	d9c9b43102	dciao-kernelstructs: several experiment fixes. The previous fault injection experiment was kind of bullshit. This one is better in several ways: - sanity check at injection time (correct IP) - correct counting of kernel_transistions - copy whole activation scheme Change-Id: I014eea4d6fe103bc02ffd7bbca95dc56a1a4d9ea	2013-05-29 16:18:22 +02:00

1 2

78 Commits