Commit Graph

28 Commits

Author SHA1 Message Date
1c774ce50d JobClient: fix retry delay
Only wait for the retry delay if really retrying.

Change-Id: If12bd3745c799edc5933874d9a44d049646e0e87
2018-08-01 14:19:05 +02:00
00882f98ad JobClient: resolve endpoint only once
The JobClient now resolves the server IP once (lazily, when needed) instead
on each connect attempt, reducing the amount of DNS requests sent out.

Change-Id: I9804048d3252da333cb3addbe94a01fdf3c707c8
2018-07-31 12:33:52 +02:00
e63f7376f8 JobClient: connect to IPv4 endpoints only
As long as the JobServer only listens on IPv4 endpoints, it makes no
sense to attempt a connect to an IPv6 endpoint on the client side.

(However, it's 2018 and we should also be capable of using IPv6 on
both the client and server side ...)

Change-Id: I9c3916466c350ce74a31cef3b6ae0e7ac56367c7
2018-07-24 09:16:33 +02:00
baaa6c3ce8 JobClient/Server fixes
- Retain original CLIENT_RETRY_COUNT semantics after Boost::Asio
  switch
- JobClient is C++11 now, too
- Message reception copy/paste error fixes

Change-Id: I19c474b2a79cd2ac8657e8d58d6170202d096fb0
2018-05-09 17:43:28 +02:00
9272c5cbed Move JobClient to Boost::asio as well
I did this mainly so server and client use a common networking API
IMO, using Boost::asio results in nicer name-lookup code.
Since no longer needed, I removed the SocketComm stuff.
The client is still synchronous; I see no benefit in having it
asynchronous.

I'm not super happy with the random backoff by the clients, if they
can't connect to the server. It makes the code really messy, 3 retries
is totally arbitrary, as is the backup windows. I believe launching
the server and clients in the correct order should be handled by a
launch script
Change-Id: Ifea64919fc228aa530c90449686f51bf63eb70e7
2018-05-09 17:41:52 +02:00
9d5737a37d jobclient: fix error reporting
gethostbyname() doesn't set errno but h_errno.  Thus, we need to call
herror() instead of perror() to print an appropriate error message.
(Thanks, Björn.)

Change-Id: I8fd4bdd4af41774dd290151c5ad37090d006f423
2014-06-30 12:21:42 +02:00
4cb97a7fa5 formatting, typos, comments, details
Change-Id: Iae5f1acb653a694622e9ac2bad93efcfca588f3a
2014-01-22 13:08:13 +01:00
de39bf6120 jobclient: use initializer list
Change-Id: I7eb42f947bbabd61e1aad9224cedd7ffceec4f10
2014-01-20 22:48:08 +01:00
5ffcb82138 jobclient: initial number of jobs configurable
The new CLIENT_JOB_INITIAL configuration option allows to configure
the client to request more than one job in the first request round.
If a reasonable initial value is chosen, this removes the job ramp-up
after each fail-client restart, and slightly improves overall
throughput.

Change-Id: Idac2721264ec264c520d341fac64a8311a974708
2014-01-20 22:48:08 +01:00
2c31bf79b0 jobclient: expect communication failures
This change makes the JobClient act properly on communication aborts.

Change-Id: I0a76489f117e9721546215e3b627002605e25452
2014-01-20 22:48:08 +01:00
882d4f381b jobclient: bugfix: faster shutdown at campaign end
The JobClient currently waits a LONG time until it really shuts down
after not having reached the server in sendResultsToServer() (which is
unfortunately the by far most probable point in the code to determine
this):

 -  A different bug (fixed in the previous commit) provoked the
    situation that a (way) too large amount of jobs was fetched
    before.
 -  sendResult() (called after each experiment iteration) realized
    that CLIENT_JOB_REQUEST_SEC seconds are over, and tried to
    prematurely call home to send first results (without planning to
    get new jobs yet).
 -  If the server was gone (done, or aborted), connect in
    sendResultsToServer() failed after several retries and timeouts.
 -  All subsequent calls to sendResult() retried connecting to the
    server (again, with retries and timeouts), once for each remaining
    job.
 -  When all jobs were done, getParam() tries to connect a last time,
    finally telling the experiment that nobody's home.

This resulted in client shutdown times of up to four hours (for the
default CLIENT_JOB_LIMIT of 1000) after the campaign server
terminated.  This change solves the issue by not handing out new
(cached) jobs after the connect failed once, making the experiment
terminate quickly.

Change-Id: I0d8cb2e084d783aca74c51a503fa72eb2b2eb0b7
2014-01-20 22:48:08 +01:00
ee7bc23d85 jobclient: bugfix: initialize timing statistics
If we don't properly initialize the job timing statistics, the number
of jobs to be requested in the second request to the server is based
on the wrong timings.  In our test case, CLIENT_JOB_LIMIT jobs were
requested at once.

Change-Id: I7e9d8ab6fe14e4488b3a74baf061d9a07f3a77c4
2014-01-20 22:48:08 +01:00
12f9915d1c core/efw: send back results earlier
The client sends results back earlier (i.e., before all jobs are
done) if the client response time (CLIENT_JOB_REQUEST_SEC) is
exceeded. This makes sure that extraordinarily long-running
experiments get reported back before, e.g., the LIDO job timeout
kills the Fail* instance.

Change-Id: I3ada0360ec54b63f80a7008570ca514449720220
2013-06-17 17:43:42 +02:00
de754c5f27 comm: handle connect() failures properly
Quoting connect(3posix): "If connect() fails, the state of the socket is
unspecified.  Conforming applications should close the file descriptor and
create a new socket before attempting to reconnect."

Change-Id: Ibcdcc0f546560a41009832894659a37947243f2f
2013-05-29 16:29:09 +02:00
20b70df651 efw/JobClient: knock less often when there are no jobs yet
Change-Id: If769b402a7b00ed3aebedd5f4d0954831a0ee905
2013-04-29 15:34:08 +02:00
880e7a81ff comm: ignore SIGPIPE
This prevents client and server from being sent a SIGPIPE (and
terminating) when the other side unexpectedly closes the connection.
It's way easier to handle this condition when checking the write()
return value, than to do anything smart in a SIGPIPE handler.  More
details:
<http://stackoverflow.com/questions/108183/how-to-prevent-sigpipes-or-handle-them-properly>

Change-Id: I1da5bf5ef79c8b7b00ede976e96ed4f1c560049d
2013-04-29 15:32:12 +02:00
0f16f18d75 cosmetics
Change-Id: Ifae805ae1e2dac95324e054af09a7b70f5d5b60c
2013-04-22 14:24:02 +02:00
fdb39c9613 correction of commit 2079
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2080 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-02-13 13:16:18 +00:00
a2830fa140 JobClient: weighting for throughput calculation
The new troughput is now calculated as:
0.5*old throughput + 0.5* the current throughput of the last job-set.

This prevents excessive variations in the calculation of the new
throughput.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2079 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-02-13 13:11:29 +00:00
94214063ac Fixed whitespaces.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2067 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-02-07 00:51:14 +00:00
179272abea Once an experiment terminates, all results will be sent to the server.
One an experiment terminates, sending the results back to the server will
be initiated by the jobclients destructor.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2042 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-01-31 16:33:43 +00:00
ff2dc189ce Correction of commit 2014
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2019 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-01-24 11:26:26 +00:00
00f809231f Code cleanup for commit 1963-1965
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@2014 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2013-01-23 14:22:05 +00:00
fc1d21fe53 Bugfix for server-client communication
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1965 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-30 18:13:13 +00:00
d7842c2ad7 The Jobclient can get several jobs with one request
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1963 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-30 16:50:02 +00:00
hsc
49d1608969 correct sanity checks for client/server communication
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1933 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-11-14 13:31:53 +00:00
hsc
7513dacad1 properly deal with clients that talked to another campaign server before
A campaign server now tells all clients a unique run ID (the UNIX timestamp
when it was started).  This allows us to ignore results from "old" clients
that talked to another server before, and to tell them to die.

git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1677 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-09-23 17:28:07 +00:00
2575604b41 Fail* directories reorganized, Code-cleanup (-> coding-style), Typos+comments fixed.
git-svn-id: https://www4.informatik.uni-erlangen.de/i4svn/danceos/trunk/devel/fail@1321 8c4709b5-6ec9-48aa-a5cd-a96041d1645a
2012-06-08 20:09:43 +00:00