Message boards : News : Jobs incoming!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 953 - Posted: 1 Sep 2015, 18:45:03 UTC - in response to Message 952.  

When new patches from Microsoft come in (or some other reason to restart a machine), I'm looking for a good Point to restart the Client.

At the Moment, I wait until I see that a "fresh" cmsRun comes up and then I restart the box; is there a better Point? (I want to be shure that the result from last run is already uploaded)

Probably a bit earlier than that, but I haven't studied the sequence in full. As you already know, suspend/resume is not our strong point. Providing you're not doing it every day (or more often!) I don't think the project really cares that much -- not at the moment, anyway. Myself, I usually set "No new tasks" and wait until the current task times out.
ID: 953 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 954 - Posted: 1 Sep 2015, 18:57:34 UTC - in response to Message 953.  
Last modified: 1 Sep 2015, 18:57:53 UTC

Myself, I usually set "No new tasks" and wait until the current task times out.

I would love to set "No new Tasks" to inside the VM so that it doesn't fetch more Jobs. Perhaps a Point for the ToDo-List later on if this Project survives ;-)
ID: 954 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 107
Message 955 - Posted: 1 Sep 2015, 19:11:11 UTC - in response to Message 954.  

Should be easy. We can add a button in the graphics web app that sets a shutdown signal. When the job finishes it will shutdown the VM rather than run another job.
ID: 955 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 956 - Posted: 1 Sep 2015, 19:14:40 UTC - in response to Message 955.  

Should be easy. We can add a button in the graphics web app that sets a shutdown signal. When the job finishes it will shutdown the VM rather than run another job.

Yeah, I would love this
ID: 956 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 957 - Posted: 1 Sep 2015, 19:18:00 UTC

I know, you have other things to do, but if you find the time----?

Is there a way to improve efficiency?
Currently the vboxheadless process has only an overall utilization of 88%.(24h)
This includes a lot of overhead. (atlas is in the mid 90%)

Would it be possible to have the "uploading" and "downloading" running in the background and at the same time the cpu crunching ?
There seems to be a lot of fruitless cpu idling going on.

According to my conservative estimates, it should be possible to double the output.
ID: 957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 959 - Posted: 1 Sep 2015, 19:27:56 UTC

I've started a thread in Number Crunching to allow people to set out ideas for improvements. Please make your suggestions there, but move discussions to a parallel thread to try to keep the wish-list uncluttered.
Cheers, ivan.
ID: 959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 960 - Posted: 1 Sep 2015, 19:34:02 UTC - in response to Message 957.  

I know, you have other things to do, but if you find the time----?

Is there a way to improve efficiency?
Currently the vboxheadless process has only an overall utilization of 88%.(24h)
This includes a lot of overhead. (atlas is in the mid 90%)

Would it be possible to have the "uploading" and "downloading" running in the background and at the same time the cpu crunching ?
There seems to be a lot of fruitless cpu idling going on.

According to my conservative estimates, it should be possible to double the output.

We could run longer jobs, so the startup phase takes up proportionately less time, but then the output files would be larger and we run the risk of communication time-outs in staging out the results. This apparently happened when I tried to send larger jobs a couple of weeks ago, so I'm loth to produce output much larger than the current 30-35 MB. This could vary with the type of simulation being run, too. I'll try again tomorrow to get my colleagues to suggest interesting jobs to run, my repertoire is rather limited.
ID: 960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 965 - Posted: 1 Sep 2015, 19:40:33 UTC
Last modified: 1 Sep 2015, 19:40:55 UTC

Thanks, Ivan.
First you have to get it running reliably at all, then optimize.
I understand.
ID: 965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 966 - Posted: 1 Sep 2015, 19:42:33 UTC - in response to Message 960.  

We could run longer jobs, so the startup phase takes up proportionately less time, but then the output files would be larger and we run the risk of communication time-outs in staging out the results. This apparently happened when I tried to send larger jobs a couple of weeks ago, so I'm loth to produce output much larger than the current 30-35 MB.

You could make it configurable for the user, if (s)he has a stable Internet or not (e.g. my vDSL has 50 MBit down- and 10 MBit upload)
ID: 966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 970 - Posted: 1 Sep 2015, 22:01:08 UTC

Cause it happened during this run of jobs, I post it here.

I have 1 cmsRun, that did not ended normally; up to 21st record thereafter nothing more reported.
Meanwhile another cmsRun has ended normally and another new one is running.

Do you want to have some logs from this unfinished cmsRun?

I've extended the BOINC-runtime to 40 hours (atm 24 hrs), so will be able to post (or email) the logs tomorrow.
ID: 970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 971 - Posted: 1 Sep 2015, 22:01:45 UTC - in response to Message 966.  

We could run longer jobs, so the startup phase takes up proportionately less time, but then the output files would be larger and we run the risk of communication time-outs in staging out the results. This apparently happened when I tried to send larger jobs a couple of weeks ago, so I'm loth to produce output much larger than the current 30-35 MB.

You could make it configurable for the user, if (s)he has a stable Internet or not (e.g. my vDSL has 50 MBit down- and 10 MBit upload)

Sucks to be you :-0!
Here's my connexion tonight:

Connection mode : ADSL2
Type : Fast
Noise margin (dB) : 12.3
Attenuation (dB) : 46.0
Attainable download rate (kbps) : 2224
ADSL status : Connected [0]

Downstream Upstream
Rate (kbps) 1148 1077

But that apart, the size of the jobs is determined at submission time for the batch concerned, it's not configurable on a job-by-job basis. Sorry...
ID: 971 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 972 - Posted: 1 Sep 2015, 23:26:39 UTC - in response to Message 951.  
Last modified: 1 Sep 2015, 23:51:07 UTC

Looks like this batch will run out tonight at this rate (i.e. another 4 or 5 jobs for each machine) so tomorrow I'll have the timing statistics I want for my comparison:
515 jobs; 0 completed, 0 removed, 411 idle, 89 running, 15 held, 0 suspended

Well, it just made midnight, London time:
97 jobs; 0 completed, 0 removed, 0 idle, 82 running, 15 held, 0 suspended
so I submitted another 1,000 jobs the same as the last batch. That lasted a bit over 30 hours; hopefully this lot will get us into next week.
Statistics? Who needs 'em?

[Update] All jobs now submitted to the Condor queue and updated with the required constraints. I'm for bed, after attempting today's Guardian cryptic crossword... [/Update]
ID: 972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 973 - Posted: 2 Sep 2015, 10:47:50 UTC - in response to Message 970.  

I have 1 cmsRun, that did not ended normally; up to 21st record thereafter nothing more reported.
Meanwhile another cmsRun has ended normally and another new one is running.

Do you want to have some logs from this unfinished cmsRun?

I've extended the BOINC-runtime to 40 hours (atm 24 hrs), so will be able to post (or email) the logs tomorrow.

Could following (from MasterLog) has been the reason?

09/01/15 21:32:19 (pid:9715) CCBListener: no activity from CCB server in 1240s; assuming connection is dead.
09/01/15 21:32:19 (pid:9715) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds.
09/01/15 21:32:38 (pid:9715) DefaultReaper unexpectedly called on pid 9718, status 25344.
09/01/15 21:32:38 (pid:9715) The STARTD (pid 9718) exited with status 99 (daemon will not restart automatically)
09/01/15 21:33:19 (pid:9715) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#52528
09/01/15 21:37:44 (pid:9715) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
09/01/15 21:37:44 (pid:9715) Got SIGTERM. Performing graceful shutdown.
09/01/15 21:37:44 (pid:9715) All daemons are gone. Exiting.
09/01/15 21:37:44 (pid:9715) **** condor_master (condor_MASTER) pid 9715 EXITING WITH STATUS 99
ID: 973 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 974 - Posted: 2 Sep 2015, 14:11:36 UTC

Dunno whats happening here - I have two machines with jobs whizzing through for the last 2 days (hope they're giving good results)...
ID: 974 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 976 - Posted: 2 Sep 2015, 19:58:58 UTC - in response to Message 972.  

... so I submitted another 1,000 jobs the same as the last batch. That lasted a bit over 30 hours; hopefully this lot will get us into next week.

Not sure, but guessing less than 290 jobs left. Short weeks in Britain ;)
ID: 976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 980 - Posted: 3 Sep 2015, 9:48:41 UTC - in response to Message 976.  

... so I submitted another 1,000 jobs the same as the last batch. That lasted a bit over 30 hours; hopefully this lot will get us into next week.

Not sure, but guessing less than 290 jobs left. Short weeks in Britain ;)

Typo; 10,000 jobs...
ID: 980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 981 - Posted: 3 Sep 2015, 9:55:04 UTC

FYI, here are graphs showing how long it took to get jobs back from the Grid and from CMS@Home from that batch of 2,000 jobs on Monday:

and active jobs since Monday on T3_CH_Volunteer (time scale goes until midnight tonight, it wasn't obvious how to set it to "now"):

ID: 981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 982 - Posted: 3 Sep 2015, 10:53:49 UTC - in response to Message 981.  
Last modified: 3 Sep 2015, 11:02:40 UTC

Thanks,Ivan.

Very interesting. Certainly shows we aren't the fastest. Surprise, surprise although it's better than I, for one, expected, given all the inefficiencies in the present system. Production work, when the volunteers aren't as involved might not be quite so good although there will be more machines.

Is there a "Grid" version of the last plot, failed/successful jobs?

Where are the "lost" (abandoned) jobs in these statistics? Can they
be shown separately? As vLHC do? (I think)second plot down
ID: 982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 984 - Posted: 3 Sep 2015, 13:36:24 UTC - in response to Message 982.  

Thanks,Ivan.

Very interesting. Certainly shows we aren't the fastest. Surprise, surprise.

Is there a "Grid" version of the last plot, failed/successful jobs?

Where are the "lost" (abandoned) jobs in these statistics? Can they
be shown separately? As vLHC do? (I think)second plot down

There is probably at least one way of dredging up those statistics, but Dashboard is such a huge thing with far too many options. The other problem is that we're not as tightly coupled to the reporting as Grid might be, and lots of things can go astray. OTOH, Dashboard reported 1970 successes for the Grid batch, but I only counted 1858 returned results,

[As usual, I got dragged away by other things and forgot to post this.]
ID: 984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 986 - Posted: 3 Sep 2015, 19:13:15 UTC

Seems as if Microsoft doesn't like your app.

On Tuesday they had patches that needed restarts, today they had one patch but it needs a restart too :-(

Not good for the Performance of your app
ID: 986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 7 · Next

Message boards : News : Jobs incoming!


©2024 CERN