christoph/fail - fail - Gitea: Git with a cup of tea

Author	SHA1	Message	Date
Horst Schirmeier	4cb97a7fa5	formatting, typos, comments, details Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a	2014-01-22 13:08:13 +01:00
Horst Schirmeier	de39bf6120	jobclient: use initializer list Change-Id: I7eb42f947bbabd61e1aad9224cedd7ffceec4f10	2014-01-20 22:48:08 +01:00
Horst Schirmeier	5ffcb82138	jobclient: initial number of jobs configurable The new CLIENT_JOB_INITIAL configuration option allows to configure the client to request more than one job in the first request round. If a reasonable initial value is chosen, this removes the job ramp-up after each fail-client restart, and slightly improves overall throughput. Change-Id: Idac2721264ec264c520d341fac64a8311a974708	2014-01-20 22:48:08 +01:00
Horst Schirmeier	2c31bf79b0	jobclient: expect communication failures This change makes the JobClient act properly on communication aborts. Change-Id: I0a76489f117e9721546215e3b627002605e25452	2014-01-20 22:48:08 +01:00
Horst Schirmeier	882d4f381b	jobclient: bugfix: faster shutdown at campaign end The JobClient currently waits a LONG time until it really shuts down after not having reached the server in sendResultsToServer() (which is unfortunately the by far most probable point in the code to determine this): - A different bug (fixed in the previous commit) provoked the situation that a (way) too large amount of jobs was fetched before. - sendResult() (called after each experiment iteration) realized that CLIENT_JOB_REQUEST_SEC seconds are over, and tried to prematurely call home to send first results (without planning to get new jobs yet). - If the server was gone (done, or aborted), connect in sendResultsToServer() failed after several retries and timeouts. - All subsequent calls to sendResult() retried connecting to the server (again, with retries and timeouts), once for each remaining job. - When all jobs were done, getParam() tries to connect a last time, finally telling the experiment that nobody's home. This resulted in client shutdown times of up to four hours (for the default CLIENT_JOB_LIMIT of 1000) after the campaign server terminated. This change solves the issue by not handing out new (cached) jobs after the connect failed once, making the experiment terminate quickly. Change-Id: I0d8cb2e084d783aca74c51a503fa72eb2b2eb0b7	2014-01-20 22:48:08 +01:00
Horst Schirmeier	ee7bc23d85	jobclient: bugfix: initialize timing statistics If we don't properly initialize the job timing statistics, the number of jobs to be requested in the second request to the server is based on the wrong timings. In our test case, CLIENT_JOB_LIMIT jobs were requested at once. Change-Id: I7e9d8ab6fe14e4488b3a74baf061d9a07f3a77c4	2014-01-20 22:48:08 +01:00
Horst Schirmeier	6d4dfeb913	shutdown cleanups revisited This change became necessary as we observed weird fail-client SIGSEGV crashes with both Bochs and Gem5 backends and different experiments. Some Fail* components are instantiated statically: the SimulatorController instance "simulator", containing the ListenerManager and the CoroutineManager, and the active ExperimentFlow subclass(es) (experiments/instantiate-experiment*.ah.in). The experiment(s) is registered as an active flow in the CoroutineManager at startup. As plugins (which are ExperimentFlows themselves) are often created on an experiment's stack, ExperimentFlows deregister themselves on destruction (e.g., when leaving the plugin variable's scope). The core problem is, that the creation and destruction order of statically instantiated objects depends on the link order; if the experiment is destroyed after the CoroutineManager, its automatic self-deregistering feature talks to the smoking ruins of the latter. This change removes all static instantiations of ExperimentFlow and replaces them with constructions on the heap. Additionally it makes sure that the CoroutineManager recognizes that a shutdown is in progress, and refrains from touching potentially already destroyed data structures when a (mistakenly globally instantiated) ExperimentFlow deregisters in this case. Change-Id: I8a7d42fb141222cd2cce6040ab1a01f9de61be24	2013-09-04 10:13:48 +02:00
Richard Hellwig	12f9915d1c	core/efw: send back results earlier The client sends results back earlier (i.e., before all jobs are done) if the client response time (CLIENT_JOB_REQUEST_SEC) is exceeded. This makes sure that extraordinarily long-running experiments get reported back before, e.g., the LIDO job timeout kills the Fail* instance. Change-Id: I3ada0360ec54b63f80a7008570ca514449720220	2013-06-17 17:43:42 +02:00
Horst Schirmeier	de754c5f27	comm: handle connect() failures properly Quoting connect(3posix): "If connect() fails, the state of the socket is unspecified. Conforming applications should close the file descriptor and create a new socket before attempting to reconnect." Change-Id: Ibcdcc0f546560a41009832894659a37947243f2f	2013-05-29 16:29:09 +02:00
Horst Schirmeier	20b70df651	efw/JobClient: knock less often when there are no jobs yet Change-Id: If769b402a7b00ed3aebedd5f4d0954831a0ee905	2013-04-29 15:34:08 +02:00
Horst Schirmeier	880e7a81ff	comm: ignore SIGPIPE This prevents client and server from being sent a SIGPIPE (and terminating) when the other side unexpectedly closes the connection. It's way easier to handle this condition when checking the write() return value, than to do anything smart in a SIGPIPE handler. More details: <http://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly> Change-Id: I1da5bf5ef79c8b7b00ede976e96ed4f1c560049d	2013-04-29 15:32:12 +02:00
Horst Schirmeier	0f16f18d75	cosmetics Change-Id: Ifae805ae1e2dac95324e054af09a7b70f5d5b60c	2013-04-22 14:24:02 +02:00
hellwig	fdb39c9613	correction of commit 2079 git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2080 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-02-13 13:16:18 +00:00
hellwig	a2830fa140	JobClient: weighting for throughput calculation The new troughput is now calculated as: 0.5old throughput + 0.5 the current throughput of the last job-set. This prevents excessive variations in the calculation of the new throughput. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2079 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-02-13 13:11:29 +00:00
hoffmann	94214063ac	Fixed whitespaces. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2067 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-02-07 00:51:14 +00:00
hsc	8ce25257c3	bugfix: compile with libstdc++-4.7 git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2049 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-02-01 13:57:42 +00:00
hellwig	179272abea	Once an experiment terminates, all results will be sent to the server. One an experiment terminates, sending the results back to the server will be initiated by the jobclients destructor. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2042 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-01-31 16:33:43 +00:00
hellwig	542ee51c4b	Method to query the number of undone jobs in JocClient.hpp added. Since several jobs can be fetched from the server, it is interesting to know how much undone jobs are still available. This will accomplished by the new method getNumberOfUndoneJobs(). git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2041 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-01-31 16:33:39 +00:00
hellwig	ff2dc189ce	Correction of commit 2014 git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2019 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-01-24 11:26:26 +00:00
hellwig	00f809231f	Code cleanup for commit 1963-1965 git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2014 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2013-01-23 14:22:05 +00:00
hellwig	fc1d21fe53	Bugfix for server-client communication git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1965 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-11-30 18:13:13 +00:00
hellwig	d7842c2ad7	The Jobclient can get several jobs with one request git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1963 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-11-30 16:50:02 +00:00
hsc	49d1608969	correct sanity checks for client/server communication git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1933 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-11-14 13:31:53 +00:00
adrian	0c3568dc2f	CoroutineManager: comment fix (deprecated). git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1722 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-10-05 11:56:39 +00:00
adrian	31aa3aa925	bugfixes in overall coroutine handling to allow the overwriting of onTrigger. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1721 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-10-05 11:48:39 +00:00
hsc	7513dacad1	properly deal with clients that talked to another campaign server before A campaign server now tells all clients a unique run ID (the UNIX timestamp when it was started). This allows us to ignore results from "old" clients that talked to another server before, and to tell them to die. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1677 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-09-23 17:28:07 +00:00
hsc	f9c96ddf2d	prefix internal libraries to avoid naming conflicts with system libraries This is a precaution to avoid current and future naming conflicts with common system libraries. libutil (part of libc) is the first, but probably not the last example that already caused trouble twice. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1614 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-09-12 07:52:30 +00:00
hsc	e56918e40e	centralized and cmake-based campaign server+port config git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1590 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-09-04 13:57:01 +00:00
adrian	2076d21e61	Experiment updates due to last commit. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1449 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-07-12 10:45:39 +00:00
hsc	4a4b3ea7e2	FailBochs build process reversed The FailBochs client is not linked by the Bochs build system anymore, but by our cmake scripts (make fail-client): - All Bochs libraries are merged into libfailbochs.a (a new target within the Bochs Autotools scripts). - The previous libfail.a is not a merge of all Fail* libraries anymore, but pulls these in via library dependencies. Additionally I did a lot of build system cleanup, e.g. additional external libraries may now be pulled in where they're needed. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1390 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-06-29 22:22:41 +00:00
adrian	ad0cfb9b11	Pre-/postprocessing is done within the event objects (Bochs-specific event added), ++coding-style. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1366 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-06-21 10:47:22 +00:00
adrian	2575604b41	Fail* directories reorganized, Code-cleanup (-> coding-style), Typos+comments fixed. git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1321 8c4709b5-6ec9-48aa-a5cd-a96041d1645a	2012-06-08 20:09:43 +00:00

32 Commits