Message boards :
CMS Application :
New Version v48.00
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This version has a few improvements that should make the VM more resilient to network issues during startup. The tasks should fail with either no network or port closed messages rather than the heartbeat errors we have been seeing in production. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I tested suspend/resume for just over 1 hour and it was fine. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I tested suspend/resume for just over 1 hour and it was fine. OK, then I'll test the overnight suspend/resume. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I tested suspend/resume for just over 1 hour and it was fine. The running job did not survive the overnight suspend and this morning resume. Strange is that after the resume the job resumes OK, but after about half an hour Condor decides to kill the job or even didn't know there was a job running. Because the elapse time was over 12 hours BOINC-task was ended. https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=300652 From MasterLog: 01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731 01/14/17 08:40:11 condor_write(): Socket closed when trying to write 2194 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer 01/14/17 08:40:11 Buf::write(): condor_write() failed 01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead. 01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731 |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
New tasks are failing. Running jobs count is dropping. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731). Last part of StartLog: 01/13/17 19:54:31 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 404 (DEACTIVATE_CLAIM_FORCIBLY), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 01/13/17 21:13:14 Called deactivate_claim_forcibly() 01/13/17 21:13:14 Starter pid 5186 exited with status 0 01/13/17 21:13:14 State change: starter exited 01/13/17 21:13:14 Changing activity: Busy -> Idle 01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 01/13/17 21:13:14 Got activate_claim request from shadow (188.184.82.11) 01/13/17 21:13:14 Remote job ID is 19859.2 01/13/17 21:13:14 Got universe "VANILLA" (5) from request classad 01/13/17 21:13:14 State change: claim-activation protocol successful 01/13/17 21:13:14 Changing activity: Idle -> Busy 01/13/17 22:16:22 CCBListener: no activity from CCB server in 3188s; assuming connection is dead. 01/13/17 22:16:22 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead. 01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:40:29 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer 01/14/17 08:40:29 Buf::write(): condor_write() failed 01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 Last part of StarterLog: 01/13/17 18:19:14 (pid:5186) Running job as user nobody 01/13/17 18:19:14 (pid:5186) Create_Process succeeded, pid=5190 01/13/17 19:53:31 (pid:5186) CCBListener: no activity from CCB server in 1456s; assuming connection is dead. 01/13/17 19:53:31 (pid:5186) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 19:54:32 (pid:5186) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2520785 01/13/17 21:13:13 (pid:5186) Process exited, pid=5190, status=0 01/13/17 21:13:14 (pid:5186) Got SIGQUIT. Performing fast shutdown. 01/13/17 21:13:14 (pid:5186) ShutdownFast all jobs. 01/13/17 21:13:14 (pid:5186) **** condor_starter (condor_STARTER) pid 5186 EXITING WITH STATUS 0 01/13/17 21:13:14 (pid:6033) ****************************************************** 01/13/17 21:13:14 (pid:6033) ** condor_starter (CONDOR_STARTER) STARTING UP 01/13/17 21:13:14 (pid:6033) ** /usr/sbin/condor_starter 01/13/17 21:13:14 (pid:6033) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 01/13/17 21:13:14 (pid:6033) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 01/13/17 21:13:14 (pid:6033) ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ 01/13/17 21:13:14 (pid:6033) ** $CondorPlatform: x86_64_RedHat6 $ 01/13/17 21:13:14 (pid:6033) ** PID = 6033 01/13/17 21:13:14 (pid:6033) ** Log last touched 1/13 21:13:14 01/13/17 21:13:14 (pid:6033) ****************************************************** 01/13/17 21:13:14 (pid:6033) Using config source: /etc/condor/condor_config 01/13/17 21:13:14 (pid:6033) Using local config sources: 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/10_security.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/14_network.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/20_workernode.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/30_lease.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/35_cms.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/40_ccb.config 01/13/17 21:13:14 (pid:6033) /etc/condor/condor_config.local 01/13/17 21:13:14 (pid:6033) config Macros = 154, Sorted = 154, StringBytes = 5592, TablesBytes = 5640 01/13/17 21:13:14 (pid:6033) CLASSAD_CACHING is OFF 01/13/17 21:13:14 (pid:6033) Daemon Log is logging: D_ALWAYS D_ERROR 01/13/17 21:13:14 (pid:6033) Daemoncore: Listening at <10.0.2.15:44921> on TCP (ReliSock). 01/13/17 21:13:14 (pid:6033) DaemonCore: command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921&noUDP> 01/13/17 21:13:14 (pid:6033) DaemonCore: private command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921> 01/13/17 21:13:15 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 01/13/17 21:13:15 (pid:6033) Communicating with shadow <188.184.82.11:4080?addrs=188.184.82.11-4080&noUDP&sock=1632254_5a80_125092> 01/13/17 21:13:15 (pid:6033) Submitting machine is "vocms0159.cern.ch" 01/13/17 21:13:15 (pid:6033) setting the orig job name in starter 01/13/17 21:13:15 (pid:6033) setting the orig job iwd in starter 01/13/17 21:13:15 (pid:6033) Chirp config summary: IO false, Updates false, Delayed updates true. 01/13/17 21:13:15 (pid:6033) Initialized IO Proxy. 01/13/17 21:13:15 (pid:6033) Done setting resource limits 01/13/17 21:13:17 (pid:6033) File transfer completed successfully. 01/13/17 21:13:17 (pid:6033) Job 19859.2 set to execute immediately 01/13/17 21:13:17 (pid:6033) Starting a VANILLA universe job with ID: 19859.2 01/13/17 21:13:17 (pid:6033) IWD: /var/lib/condor/execute/dir_6033 01/13/17 21:13:17 (pid:6033) Output file: /var/lib/condor/execute/dir_6033/_condor_stdout 01/13/17 21:13:17 (pid:6033) Error file: /var/lib/condor/execute/dir_6033/_condor_stderr 01/13/17 21:13:17 (pid:6033) Renice expr "10" evaluated to 10 01/13/17 21:13:17 (pid:6033) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_6033/condor_exec.exe ireid_MonteCarlo_eff_IDR_CMS_Home_170111_211021_1648-Sandbox.tar.bz2 214799 01/13/17 21:13:17 (pid:6033) Running job as user nobody 01/13/17 21:13:17 (pid:6033) Create_Process succeeded, pid=6037 01/13/17 22:16:58 (pid:6033) CCBListener: no activity from CCB server in 3223s; assuming connection is dead. 01/13/17 22:16:58 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 22:17:58 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 01/14/17 08:40:38 (pid:6033) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>. 01/14/17 08:40:38 (pid:6033) IO: Failed to read packet header 01/14/17 08:40:38 (pid:6033) Lost connection to shadow, waiting 1200 secs for reconnect 01/14/17 08:40:38 (pid:6033) CCBListener: no activity from CCB server in 34058s; assuming connection is dead. 01/14/17 08:40:38 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:41:39 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 444 |
New tasks are failing. Yes, trying to work out what is happening... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 444 |
Looks like something amiss with the Condor server. There are 700 jobs queued but only 21 running. I'm getting error messages that Condor jobs aren't starting. I've emailed Laurence. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
From what I can tell the matchmaking isn't working i.e. job descriptions do not matching the resource descriptions. As the resource description have not changed, I am assuming it is the job descriptions. EDIT: Looks like the accounting group changed. EDIT: Actually we have a blockage, more later. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The issue is that the job queue became full with merge jobs and no resources were running them. When we changed the configuration to get WMAgent jobs instead of CRAB jobs, my cluster also picked this up and hence stopped running merged jobs. The configuration for my cluster has now been updated to run merge jobs again and will hopefully unblock everything, |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 444 |
Thanks, Laurence. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) There is a significant speed improvement (10-15% or more). You might want to look at enabling this by default. (it does not work with LHCb) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) VirtualBox mentions a possible performance increase of 5%. It's hard to measure, cause the runtimes of the CMS-jobs differs so much. I already did a measurement over 7 jobs with LargePages off: average 6151 seconds - max 8386 seconds - min 4340 seconds I also want to make a serie of 7 CMS-jobs with LargePages on: 3 done so far, 2 below the above average and 1 above. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
You are correct. There are a number of factors, overlapping uploads, other system activity.... BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev. Obviously there are losses involved with shutting down and starting up again. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev. When you get a fresh CMS-task, suspend the task early before Condor gets a change. It will stay in memory even when LAIM is deselected. Stop the BOINC client and the VM will be saved. Disgard the save state. Now you have a Powered down VM to change the Large Pages-parameter with the command "%VBOX_MSI_INSTALL_PATH%vboxmanage.exe" modifyvm virtualmachinename --largepages on|off When you want to run the VM in BOINC again: Keep the suspending state in BOINC and start the VM with VirtualBox Manager for a few minutes. Close the VM with the savestate option. When saved, it's BOINC's turn again. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) The figures for 7 tasks with parameter --largepages on: average 6024 seconds - max 8450 seconds - min 3255 seconds |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Hi Crystal The figures for 7 tasks with parameter --largepages on: I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. Thanks for posting your method to enable "large pages". I will give i a try. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. I'm increasing my test series. I have 2 CMS-VM's running simultaneously - 1 VM with largepages on and 1 VM with largepages off -, so they suffer from the same conditions. Ivan made a note at the production project, that the CMS-queue will drain over the weekend, so we'll see how many jobs I will get until the well is dry. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. I'm ending my test series after done 30 jobs on both VM's: CPU Summary of total 30 jobs with parameter largepages on : 203710 seconds CPU Summary of total 30 jobs with parameter largepages off: 204702 seconds To close to call it a significant performance increase, I tend to say. |
©2024 CERN