Message boards : CMS Application : Ready For Production
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2480 - Posted: 21 Mar 2016, 14:52:41 UTC - in response to Message 2475.  
Last modified: 21 Mar 2016, 14:52:50 UTC

At the moment the volunteer is given the benefit of the doubt as failures may be our fault. Over time and as we improve the fault detection, tasks will fail without credit given. This already happens in some scenarios such as if the machine fails to boot and the absence of the heartbeat file triggers the shutdown.
ID: 2480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3490 - Posted: 26 May 2016, 9:27:39 UTC - in response to Message 2480.  

On Monday we plan to update the production project with the version that we have been running in the dev project. Please let us know if there are any objections or if there is something else to fix first.
ID: 3490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3492 - Posted: 26 May 2016, 10:06:11 UTC - in response to Message 3490.  
Last modified: 26 May 2016, 10:24:51 UTC

I just did a boinc shutdown(LAIM off).
All VMs were aborted.

EDIT: On resume, the jobs it was working on were abandoned, the tasks continued with a new job.
ID: 3492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3494 - Posted: 26 May 2016, 12:16:10 UTC - in response to Message 3492.  

I have just noticed that Condor has been upgraded from v8.0.4 to v8.4.6 in the latest release of CernVM. We picked it up when I built the new image yesterday. I will need to review the configuration for this version.
ID: 3494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3495 - Posted: 26 May 2016, 17:53:38 UTC

I have reached the quota, even though all tasks reported are valid or aborted.
Cannot get new work.

5/26/2016 7:49:53 PM | vLHCathome-dev | Requesting new tasks for CPU
5/26/2016 7:49:54 PM | vLHCathome-dev | Scheduler request completed: got 0 new tasks
5/26/2016 7:49:54 PM | vLHCathome-dev | No tasks sent
5/26/2016 7:49:54 PM | vLHCathome-dev | No tasks are available for CMS Simulation
5/26/2016 7:49:54 PM | vLHCathome-dev | This computer has finished a daily quota of 1 tasks
ID: 3495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3496 - Posted: 26 May 2016, 20:24:00 UTC - in response to Message 3495.  

I see the host has four tasks in progress is this correct?

http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?hostid=617

One thing that I noticed is that your VM stops and starts. Why is this? Why are they not suspending? A consequence is that each time the VM starts the Condor TTL is reset so eventually the 18h timeout is hit. This reset should probably not happen.
ID: 3496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3497 - Posted: 26 May 2016, 21:16:23 UTC - in response to Message 3496.  
Last modified: 26 May 2016, 21:17:04 UTC

I have exited boinc with 3 tasks once with LAIM off.
As described in a different thread, all jobs, that were running were abandoned, new jobs started.
I then did a few more jobs and shut down the tasks by pasting "shutdown" into the
shared folder.
I aborted the remaining task.
Then, i tried to get a new tasks,but quota was reached.
After a few attempts i switched and got 4 Theory tasks.
ID: 3497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3498 - Posted: 26 May 2016, 21:16:40 UTC - in response to Message 3496.  

That's four Theory tasks, Laurence. Don't be embarrassed, I've made the same mistake myself... :-(
ID: 3498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 3499 - Posted: 27 May 2016, 5:51:16 UTC - in response to Message 3497.  

I have exited boinc with 3 tasks once with LAIM off.
As described in a different thread, all jobs, that were running were abandoned, new jobs started.

I described in a different thread that saving several VM's at once can cause problems, because all VM's have to save their state to disk (snapshots of the VM's).
Specially on slow disks (I suppose you have a 5400rpm) this saving will take too long (longer than 60 seconds).
When you're lucky the VM will boot from scratch when resumed, but could also end in an error.

So with slow disks and several VM's running, suspend them with a time interval one after the other and of course suspend tasks 'Ready to start' first.
ID: 3499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3500 - Posted: 27 May 2016, 7:28:07 UTC - in response to Message 3499.  

longer than 60 seconds


Thanks for the info. What determines this 60sec and why can't it be changed?
ID: 3500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 3501 - Posted: 27 May 2016, 9:12:14 UTC - in response to Message 3500.  

longer than 60 seconds


Thanks for the info. What determines this 60sec and why can't it be changed?

It's hard coded in the vboxwrapper program and was changed. In the past it was 15 seconds.
On a busy system one could already suffer from slow saving a single VM within 15 seconds.
So I tested with the wrapper-developers different scenarios (suspend LAIM off, stop BOINC, host reboot).
60 seconds seems to be a good compromise when a system is not too overloaded ;) - Slower disks make it worse.
If someone wants to stop BOINC and/or reboot the system, he/she don't want to wait too long.

Therefore my advice to stop the VM's in a staggered way when you have (too) many VM's and wants to stop BOINC.
Keep in mind that a suspended VM with LAIM on is not saved to disk, so also have to create a snapshot when BOINC is stopped.
You can test it yourself. Open VBox Manager, suspend a VM (LAIM off) and watch how long it takes until
the state of the suspended VM is going from 'running' to 'saved'.
ID: 3501 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3502 - Posted: 27 May 2016, 11:26:59 UTC - in response to Message 3501.  

Thanks, Crystal.
But why is it not set up, that the wrapper waits for a shutdown confirmation from the VM, before it singals boinc, that it is OK to shut down?
If you have a fast drive, it shuts down fast, if you have a slow drive, it shuts down slowly.
ID: 3502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 3503 - Posted: 27 May 2016, 12:22:07 UTC - in response to Message 3502.  

But why is it not set up, that the wrapper waits for a shutdown confirmation from the VM, before it singals boinc, that it is OK to shut down?

In fact it's not the wrapper, but BOINC client that wants the processes to shutdown fast. It has to do with the Windows timeout values when shutting down or reboot a machine.
When it takes too long all processes will be killed by the operating system. 60 seconds is already rather long for nowadays machines.
Fot that reason it's always good to stop BOINC yourself before shutting down the machine, especially when you have VM's running ;)
... and you want to avoid that BOINC waits for ever when the saving is unsuccessful.
ID: 3503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 3506 - Posted: 28 May 2016, 10:06:23 UTC - in response to Message 3490.  

On Monday we plan to update the production project with the version that we have been running in the dev project. Please let us know if there are any objections or if there is something else to fix first.

Hi Laurence,

It looks like the suspend/resume solution you've implemented in the Theory application is not (yet) included in the CMS-application.
I had 4 CMS-tasks running, suspended the tasks one after the other (LAIM off), stopped BOINC, shutdown host, waited 1.5 hours and started up everything.
3 of the 4 first processed a few records, but after all on the 4 VM's saved cmsRun's were not finished.
In stead all processes were quitted and new cmsRun's started.

Example:

05/28/16 11:42:41 (pid:5540) CCBListener: no activity from CCB server in 6322s; assuming connection is dead.
05/28/16 11:42:41 (pid:5540) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
05/28/16 11:42:41 (pid:5540) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <130.246.180.120:9818>.
05/28/16 11:42:41 (pid:5540) IO: Failed to read packet header
05/28/16 11:42:41 (pid:5540) Lost connection to shadow, waiting 7200 secs for reconnect
05/28/16 11:43:41 (pid:5540) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#92447
05/28/16 11:44:27 (pid:5540) Got SIGQUIT. Performing fast shutdown.
05/28/16 11:44:27 (pid:5540) ShutdownFast all jobs.
05/28/16 11:44:27 (pid:5540) Process exited, pid=5544, signal=9
05/28/16 11:44:27 (pid:5540) Failed to send job exit status to shadow
05/28/16 11:44:27 (pid:5540) Last process exited, now Starter is exiting
ID: 3506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3521 - Posted: 30 May 2016, 9:44:03 UTC - in response to Message 3490.  

The production project has been updated with the version that is currently running here.
ID: 3521 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3605 - Posted: 24 Jun 2016, 21:28:49 UTC

There's a new workflow in the offing, just a test of ten jobs to start with. These are the sort of jobs we envisaged running from the start -- on my work machine they take 40 mins and produce 4 MB result files, but a test on the "real" Grid took 3 or 4 times longer (for slightly larger event numbers). So, they should be much more ADSL-friendly for those without fibre.
If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared.
ID: 3605 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3607 - Posted: 25 Jun 2016, 14:05:05 UTC

ID: 3607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3608 - Posted: 25 Jun 2016, 17:12:29 UTC - in response to Message 3607.  

No, they are WMAgent jobs, but for some reason they don't show as them. I know Hassen put in a trouble ticket about it, but I guess it's not terribly serious compared to most other problems. I'll e-mail my Dashboard contact, that may be more effective...
ID: 3608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3609 - Posted: 25 Jun 2016, 18:38:56 UTC - in response to Message 3608.  

Thanks, Ivan.
I just would like to know, what is dipping into the T3_CH_Volunteers pot of recources.
ID: 3609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3610 - Posted: 25 Jun 2016, 21:34:06 UTC - in response to Message 3609.  

Thanks, Ivan.
I just would like to know, what is dipping into the T3_CH_Volunteers pot of recources.

Well, at the moment we are running some WMAgent jobs that Hassen submitted in April/May that were held back because they didn't match the logic set up for our HTCondor jobs. We're trying to get WMAgents jobs submitted by Federica to run, and changes to that end in our requirements actually let Hassen's jobs start running. Now we're waiting for those jobs to drain to see if Federica's jobs will run. Someone (I presume Hassen, even though he's still on holiday pending his leaving CERN IT) managed to cancel some of the earlier jobs -- hence the spikes you see on our "Jobs" graphs. But we're still waiting for the rest to drain.
And I'm just as impatient as you-all, because I want to see how my little test of real simulation requests pans out, so I want the current CRAB3 batch to finish as soon as possible!
ID: 3610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : CMS Application : Ready For Production


©2024 CERN