Thread 'New Version v48.00'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 4609 - Posted: 13 Jan 2017, 9:38:40 UTC This version has a few improvements that should make the VM more resilient to network issues during startup. The tasks should fail with either no network or port closed messages rather than the heartbeat errors we have been seeing in production. ID: 4609 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 4610 - Posted: 13 Jan 2017, 14:05:47 UTC - in response to Message 4609. I tested suspend/resume for just over 1 hour and it was fine. ID: 4610 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4611 - Posted: 13 Jan 2017, 14:11:34 UTC - in response to Message 4610. I tested suspend/resume for just over 1 hour and it was fine. OK, then I'll test the overnight suspend/resume. ID: 4611 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4612 - Posted: 14 Jan 2017, 8:39:50 UTC - in response to Message 4611. I tested suspend/resume for just over 1 hour and it was fine. OK, then I'll test the overnight suspend/resume. The running job did not survive the overnight suspend and this morning resume. Strange is that after the resume the job resumes OK, but after about half an hour Condor decides to kill the job or even didn't know there was a job running. Because the elapse time was over 12 hours BOINC-task was ended. https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=300652 From MasterLog: 01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731 01/14/17 08:40:11 condor_write(): Socket closed when trying to write 2194 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer 01/14/17 08:40:11 Buf::write(): condor_write() failed 01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead. 01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731 ID: 4612 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 4613 - Posted: 14 Jan 2017, 22:53:12 UTC - in response to Message 4612. Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731). ID: 4613 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4614 - Posted: 15 Jan 2017, 15:13:06 UTC New tasks are failing. Running jobs count is dropping. ID: 4614 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4615 - Posted: 15 Jan 2017, 15:21:36 UTC - in response to Message 4613. Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731). Last part of StartLog: 01/13/17 19:54:31 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 404 (DEACTIVATE_CLAIM_FORCIBLY), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 01/13/17 21:13:14 Called deactivate_claim_forcibly() 01/13/17 21:13:14 Starter pid 5186 exited with status 0 01/13/17 21:13:14 State change: starter exited 01/13/17 21:13:14 Changing activity: Busy -> Idle 01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 01/13/17 21:13:14 Got activate_claim request from shadow (188.184.82.11) 01/13/17 21:13:14 Remote job ID is 19859.2 01/13/17 21:13:14 Got universe "VANILLA" (5) from request classad 01/13/17 21:13:14 State change: claim-activation protocol successful 01/13/17 21:13:14 Changing activity: Idle -> Busy 01/13/17 22:16:22 CCBListener: no activity from CCB server in 3188s; assuming connection is dead. 01/13/17 22:16:22 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead. 01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:40:29 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer 01/14/17 08:40:29 Buf::write(): condor_write() failed 01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735 Last part of StarterLog: 01/13/17 18:19:14 (pid:5186) Running job as user nobody 01/13/17 18:19:14 (pid:5186) Create_Process succeeded, pid=5190 01/13/17 19:53:31 (pid:5186) CCBListener: no activity from CCB server in 1456s; assuming connection is dead. 01/13/17 19:53:31 (pid:5186) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 19:54:32 (pid:5186) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2520785 01/13/17 21:13:13 (pid:5186) Process exited, pid=5190, status=0 01/13/17 21:13:14 (pid:5186) Got SIGQUIT. Performing fast shutdown. 01/13/17 21:13:14 (pid:5186) ShutdownFast all jobs. 01/13/17 21:13:14 (pid:5186) ** condor_starter (condor_STARTER) pid 5186 EXITING WITH STATUS 0 01/13/17 21:13:14 (pid:6033) ************************************************** 01/13/17 21:13:14 (pid:6033) condor_starter (CONDOR_STARTER) STARTING UP 01/13/17 21:13:14 (pid:6033) /usr/sbin/condor_starter 01/13/17 21:13:14 (pid:6033) SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 01/13/17 21:13:14 (pid:6033) Configuration: subsystem:STARTER local:<NONE> class:DAEMON 01/13/17 21:13:14 (pid:6033) $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $ 01/13/17 21:13:14 (pid:6033) $CondorPlatform: x86_64_RedHat6 $ 01/13/17 21:13:14 (pid:6033) PID = 6033 01/13/17 21:13:14 (pid:6033) Log last touched 1/13 21:13:14 01/13/17 21:13:14 (pid:6033) **************************************************** 01/13/17 21:13:14 (pid:6033) Using config source: /etc/condor/condor_config 01/13/17 21:13:14 (pid:6033) Using local config sources: 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/10_security.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/14_network.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/20_workernode.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/30_lease.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/35_cms.config 01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/40_ccb.config 01/13/17 21:13:14 (pid:6033) /etc/condor/condor_config.local 01/13/17 21:13:14 (pid:6033) config Macros = 154, Sorted = 154, StringBytes = 5592, TablesBytes = 5640 01/13/17 21:13:14 (pid:6033) CLASSAD_CACHING is OFF 01/13/17 21:13:14 (pid:6033) Daemon Log is logging: D_ALWAYS D_ERROR 01/13/17 21:13:14 (pid:6033) Daemoncore: Listening at <10.0.2.15:44921> on TCP (ReliSock). 01/13/17 21:13:14 (pid:6033) DaemonCore: command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921&noUDP> 01/13/17 21:13:14 (pid:6033) DaemonCore: private command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921> 01/13/17 21:13:15 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 01/13/17 21:13:15 (pid:6033) Communicating with shadow <188.184.82.11:4080?addrs=188.184.82.11-4080&noUDP&sock=1632254_5a80_125092> 01/13/17 21:13:15 (pid:6033) Submitting machine is "vocms0159.cern.ch" 01/13/17 21:13:15 (pid:6033) setting the orig job name in starter 01/13/17 21:13:15 (pid:6033) setting the orig job iwd in starter 01/13/17 21:13:15 (pid:6033) Chirp config summary: IO false, Updates false, Delayed updates true. 01/13/17 21:13:15 (pid:6033) Initialized IO Proxy. 01/13/17 21:13:15 (pid:6033) Done setting resource limits 01/13/17 21:13:17 (pid:6033) File transfer completed successfully. 01/13/17 21:13:17 (pid:6033) Job 19859.2 set to execute immediately 01/13/17 21:13:17 (pid:6033) Starting a VANILLA universe job with ID: 19859.2 01/13/17 21:13:17 (pid:6033) IWD: /var/lib/condor/execute/dir_6033 01/13/17 21:13:17 (pid:6033) Output file: /var/lib/condor/execute/dir_6033/_condor_stdout 01/13/17 21:13:17 (pid:6033) Error file: /var/lib/condor/execute/dir_6033/_condor_stderr 01/13/17 21:13:17 (pid:6033) Renice expr "10" evaluated to 10 01/13/17 21:13:17 (pid:6033) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_6033/condor_exec.exe ireid_MonteCarlo_eff_IDR_CMS_Home_170111_211021_1648-Sandbox.tar.bz2 214799 01/13/17 21:13:17 (pid:6033) Running job as user nobody 01/13/17 21:13:17 (pid:6033) Create_Process succeeded, pid=6037 01/13/17 22:16:58 (pid:6033) CCBListener: no activity from CCB server in 3223s; assuming connection is dead. 01/13/17 22:16:58 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/13/17 22:17:58 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 01/14/17 08:40:38 (pid:6033) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>. 01/14/17 08:40:38 (pid:6033) IO: Failed to read packet header 01/14/17 08:40:38 (pid:6033) Lost connection to shadow, waiting 1200 secs for reconnect 01/14/17 08:40:38 (pid:6033) CCBListener: no activity from CCB server in 34058s; assuming connection is dead. 01/14/17 08:40:38 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds. 01/14/17 08:41:39 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601 ID: 4615 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 209	Message 4616 - Posted: 15 Jan 2017, 15:38:28 UTC - in response to Message 4614. New tasks are failing. Running jobs count is dropping. Yes, trying to work out what is happening... ID: 4616 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 209	Message 4617 - Posted: 15 Jan 2017, 15:47:03 UTC - in response to Message 4616. Looks like something amiss with the Condor server. There are 700 jobs queued but only 21 running. I'm getting error messages that Condor jobs aren't starting. I've emailed Laurence. ID: 4617 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 4618 - Posted: 15 Jan 2017, 21:28:46 UTC - in response to Message 4617. Last modified: 15 Jan 2017, 21:49:52 UTC From what I can tell the matchmaking isn't working i.e. job descriptions do not matching the resource descriptions. As the resource description have not changed, I am assuming it is the job descriptions. EDIT: Looks like the accounting group changed. EDIT: Actually we have a blockage, more later. ID: 4618 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 4619 - Posted: 15 Jan 2017, 22:27:30 UTC - in response to Message 4618. The issue is that the job queue became full with merge jobs and no resources were running them. When we changed the configuration to get WMAgent jobs instead of CRAB jobs, my cluster also picked this up and hence stopped running merged jobs. The configuration for my cluster has now been updated to run merge jobs again and will hopefully unblock everything, ID: 4619 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 209	Message 4620 - Posted: 16 Jan 2017, 8:59:53 UTC - in response to Message 4619. Thanks, Laurence. ID: 4620 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4653 - Posted: 12 Feb 2017, 16:27:59 UTC I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) There is a significant speed improvement (10-15% or more). You might want to look at enabling this by default. (it does not work with LHCb) ID: 4653 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4663 - Posted: 16 Feb 2017, 14:55:20 UTC - in response to Message 4653. I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) There is a significant speed improvement (10-15% or more). VirtualBox mentions a possible performance increase of 5%. It's hard to measure, cause the runtimes of the CMS-jobs differs so much. I already did a measurement over 7 jobs with LargePages off: average 6151 seconds - max 8386 seconds - min 4340 seconds I also want to make a serie of 7 CMS-jobs with LargePages on: 3 done so far, 2 below the above average and 1 above. ID: 4663 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4664 - Posted: 16 Feb 2017, 16:51:37 UTC - in response to Message 4663. You are correct. There are a number of factors, overlapping uploads, other system activity.... BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev. Obviously there are losses involved with shutting down and starting up again. ID: 4664 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4665 - Posted: 16 Feb 2017, 18:01:11 UTC - in response to Message 4664. BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev. When you get a fresh CMS-task, suspend the task early before Condor gets a change. It will stay in memory even when LAIM is deselected. Stop the BOINC client and the VM will be saved. Disgard the save state. Now you have a Powered down VM to change the Large Pages-parameter with the command "%VBOX_MSI_INSTALL_PATH%vboxmanage.exe" modifyvm virtualmachinename --largepages on\|off When you want to run the VM in BOINC again: Keep the suspending state in BOINC and start the VM with VirtualBox Manager for a few minutes. Close the VM with the savestate option. When saved, it's BOINC's turn again. ID: 4665 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4666 - Posted: 16 Feb 2017, 22:05:56 UTC - in response to Message 4663. I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell) There is a significant speed improvement (10-15% or more). VirtualBox mentions a possible performance increase of 5%. It's hard to measure, cause the runtimes of the CMS-jobs differs so much. I already did a measurement over 7 jobs with LargePages off: average 6151 seconds - max 8386 seconds - min 4340 seconds I also want to make a serie of 7 CMS-jobs with LargePages on: 3 done so far, 2 below the above average and 1 above. The figures for 7 tasks with parameter --largepages on: average 6024 seconds - max 8450 seconds - min 3255 seconds ID: 4666 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4667 - Posted: 17 Feb 2017, 8:03:04 UTC - in response to Message 4666. Hi Crystal The figures for 7 tasks with parameter --largepages on: average 6024 seconds - max 8450 seconds - min 3255 seconds I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. Thanks for posting your method to enable "large pages". I will give i a try. ID: 4667 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4669 - Posted: 17 Feb 2017, 13:16:31 UTC - in response to Message 4667. I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. I'm increasing my test series. I have 2 CMS-VM's running simultaneously - 1 VM with largepages on and 1 VM with largepages off -, so they suffer from the same conditions. Ivan made a note at the production project, that the CMS-queue will drain over the weekend, so we'll see how many jobs I will get until the well is dry. ID: 4669 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1269 Credit: 1,027,903 RAC: 28	Message 4670 - Posted: 19 Feb 2017, 11:00:55 UTC - in response to Message 4669. I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest. I'm increasing my test series. I have 2 CMS-VM's running simultaneously - 1 VM with largepages on and 1 VM with largepages off -, so they suffer from the same conditions. I'm ending my test series after done 30 jobs on both VM's: CPU Summary of total 30 jobs with parameter largepages on : 203710 seconds CPU Summary of total 30 jobs with parameter largepages off: 204702 seconds To close to call it a significant performance increase, I tend to say. ID: 4670 · Rating: 0 · rate: / Reply Quote

Development for LHC@home