Message boards : CMS Application : New Version v48.00
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 4609 - Posted: 13 Jan 2017, 9:38:40 UTC

This version has a few improvements that should make the VM more resilient to network issues during startup. The tasks should fail with either no network or port closed messages rather than the heartbeat errors we have been seeing in production.
ID: 4609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 4610 - Posted: 13 Jan 2017, 14:05:47 UTC - in response to Message 4609.  

I tested suspend/resume for just over 1 hour and it was fine.
ID: 4610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4611 - Posted: 13 Jan 2017, 14:11:34 UTC - in response to Message 4610.  

I tested suspend/resume for just over 1 hour and it was fine.

OK, then I'll test the overnight suspend/resume.
ID: 4611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4612 - Posted: 14 Jan 2017, 8:39:50 UTC - in response to Message 4611.  

I tested suspend/resume for just over 1 hour and it was fine.

OK, then I'll test the overnight suspend/resume.

The running job did not survive the overnight suspend and this morning resume.
Strange is that after the resume the job resumes OK, but after about half an hour Condor decides to kill the job or even didn't know there was a job running.
Because the elapse time was over 12 hours BOINC-task was ended.

https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=300652

From MasterLog:
01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731
01/14/17 08:40:11 condor_write(): Socket closed when trying to write 2194 bytes to collector vccondor01.cern.ch, fd is 12, errno=104 Connection reset by peer
01/14/17 08:40:11 Buf::write(): condor_write() failed
01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead.
01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513731
ID: 4612 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 4613 - Posted: 14 Jan 2017, 22:53:12 UTC - in response to Message 4612.  

Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731).
ID: 4613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4614 - Posted: 15 Jan 2017, 15:13:06 UTC

New tasks are failing.
Running jobs count is dropping.
ID: 4614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4615 - Posted: 15 Jan 2017, 15:21:36 UTC - in response to Message 4613.  

Thanks for testing. Do you have the StartLog or StarterLog? The MasterLog seems fine. The connection was closed by the server during the suspend but it reconnected fine. The before and after id is the same (#2513731).

Last part of StartLog:
01/13/17 19:54:31 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735
01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 404 (DEACTIVATE_CLAIM_FORCIBLY), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
01/13/17 21:13:14 Called deactivate_claim_forcibly()
01/13/17 21:13:14 Starter pid 5186 exited with status 0
01/13/17 21:13:14 State change: starter exited
01/13/17 21:13:14 Changing activity: Busy -> Idle
01/13/17 21:13:14 PERMISSION DENIED to submit-side@matchsession from host 188.184.82.11 for command 444 (ACTIVATE_CLAIM), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
01/13/17 21:13:14 Got activate_claim request from shadow (188.184.82.11)
01/13/17 21:13:14 Remote job ID is 19859.2
01/13/17 21:13:14 Got universe "VANILLA" (5) from request classad
01/13/17 21:13:14 State change: claim-activation protocol successful
01/13/17 21:13:14 Changing activity: Idle -> Busy
01/13/17 22:16:22 CCBListener: no activity from CCB server in 3188s; assuming connection is dead.
01/13/17 22:16:22 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/13/17 22:17:22 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735
01/14/17 08:40:11 CCBListener: no activity from CCB server in 34061s; assuming connection is dead.
01/14/17 08:40:11 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/14/17 08:40:29 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer
01/14/17 08:40:29 Buf::write(): condor_write() failed
01/14/17 08:41:13 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2513735


Last part of StarterLog:
01/13/17 18:19:14 (pid:5186) Running job as user nobody
01/13/17 18:19:14 (pid:5186) Create_Process succeeded, pid=5190
01/13/17 19:53:31 (pid:5186) CCBListener: no activity from CCB server in 1456s; assuming connection is dead.
01/13/17 19:53:31 (pid:5186) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/13/17 19:54:32 (pid:5186) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2520785
01/13/17 21:13:13 (pid:5186) Process exited, pid=5190, status=0
01/13/17 21:13:14 (pid:5186) Got SIGQUIT. Performing fast shutdown.
01/13/17 21:13:14 (pid:5186) ShutdownFast all jobs.
01/13/17 21:13:14 (pid:5186) **** condor_starter (condor_STARTER) pid 5186 EXITING WITH STATUS 0
01/13/17 21:13:14 (pid:6033) ******************************************************
01/13/17 21:13:14 (pid:6033) ** condor_starter (CONDOR_STARTER) STARTING UP
01/13/17 21:13:14 (pid:6033) ** /usr/sbin/condor_starter
01/13/17 21:13:14 (pid:6033) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1)
01/13/17 21:13:14 (pid:6033) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON
01/13/17 21:13:14 (pid:6033) ** $CondorVersion: 8.4.8 Jun 30 2016 BuildID: 373513 $
01/13/17 21:13:14 (pid:6033) ** $CondorPlatform: x86_64_RedHat6 $
01/13/17 21:13:14 (pid:6033) ** PID = 6033
01/13/17 21:13:14 (pid:6033) ** Log last touched 1/13 21:13:14
01/13/17 21:13:14 (pid:6033) ******************************************************
01/13/17 21:13:14 (pid:6033) Using config source: /etc/condor/condor_config
01/13/17 21:13:14 (pid:6033) Using local config sources:
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/10_security.config
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/14_network.config
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/20_workernode.config
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/30_lease.config
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/35_cms.config
01/13/17 21:13:14 (pid:6033) /etc/condor/config.d/40_ccb.config
01/13/17 21:13:14 (pid:6033) /etc/condor/condor_config.local
01/13/17 21:13:14 (pid:6033) config Macros = 154, Sorted = 154, StringBytes = 5592, TablesBytes = 5640
01/13/17 21:13:14 (pid:6033) CLASSAD_CACHING is OFF
01/13/17 21:13:14 (pid:6033) Daemon Log is logging: D_ALWAYS D_ERROR
01/13/17 21:13:14 (pid:6033) Daemoncore: Listening at <10.0.2.15:44921> on TCP (ReliSock).
01/13/17 21:13:14 (pid:6033) DaemonCore: command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921&noUDP>
01/13/17 21:13:14 (pid:6033) DaemonCore: private command socket at <10.0.2.15:44921?addrs=10.0.2.15-44921>
01/13/17 21:13:15 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601
01/13/17 21:13:15 (pid:6033) Communicating with shadow <188.184.82.11:4080?addrs=188.184.82.11-4080&noUDP&sock=1632254_5a80_125092>
01/13/17 21:13:15 (pid:6033) Submitting machine is "vocms0159.cern.ch"
01/13/17 21:13:15 (pid:6033) setting the orig job name in starter
01/13/17 21:13:15 (pid:6033) setting the orig job iwd in starter
01/13/17 21:13:15 (pid:6033) Chirp config summary: IO false, Updates false, Delayed updates true.
01/13/17 21:13:15 (pid:6033) Initialized IO Proxy.
01/13/17 21:13:15 (pid:6033) Done setting resource limits
01/13/17 21:13:17 (pid:6033) File transfer completed successfully.
01/13/17 21:13:17 (pid:6033) Job 19859.2 set to execute immediately
01/13/17 21:13:17 (pid:6033) Starting a VANILLA universe job with ID: 19859.2
01/13/17 21:13:17 (pid:6033) IWD: /var/lib/condor/execute/dir_6033
01/13/17 21:13:17 (pid:6033) Output file: /var/lib/condor/execute/dir_6033/_condor_stdout
01/13/17 21:13:17 (pid:6033) Error file: /var/lib/condor/execute/dir_6033/_condor_stderr
01/13/17 21:13:17 (pid:6033) Renice expr "10" evaluated to 10
01/13/17 21:13:17 (pid:6033) Using wrapper /usr/local/bin/job-wrapper to exec /var/lib/condor/execute/dir_6033/condor_exec.exe ireid_MonteCarlo_eff_IDR_CMS_Home_170111_211021_1648-Sandbox.tar.bz2 214799
01/13/17 21:13:17 (pid:6033) Running job as user nobody
01/13/17 21:13:17 (pid:6033) Create_Process succeeded, pid=6037
01/13/17 22:16:58 (pid:6033) CCBListener: no activity from CCB server in 3223s; assuming connection is dead.
01/13/17 22:16:58 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/13/17 22:17:58 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601
01/14/17 08:40:38 (pid:6033) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.82.11:4080>.
01/14/17 08:40:38 (pid:6033) IO: Failed to read packet header
01/14/17 08:40:38 (pid:6033) Lost connection to shadow, waiting 1200 secs for reconnect
01/14/17 08:40:38 (pid:6033) CCBListener: no activity from CCB server in 34058s; assuming connection is dead.
01/14/17 08:40:38 (pid:6033) CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
01/14/17 08:41:39 (pid:6033) CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#2525601
ID: 4615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,170,951
RAC: 1,761
Message 4616 - Posted: 15 Jan 2017, 15:38:28 UTC - in response to Message 4614.  

New tasks are failing.
Running jobs count is dropping.

Yes, trying to work out what is happening...
ID: 4616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,170,951
RAC: 1,761
Message 4617 - Posted: 15 Jan 2017, 15:47:03 UTC - in response to Message 4616.  

Looks like something amiss with the Condor server. There are 700 jobs queued but only 21 running. I'm getting error messages that Condor jobs aren't starting. I've emailed Laurence.
ID: 4617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 4618 - Posted: 15 Jan 2017, 21:28:46 UTC - in response to Message 4617.  
Last modified: 15 Jan 2017, 21:49:52 UTC

From what I can tell the matchmaking isn't working i.e. job descriptions do not matching the resource descriptions. As the resource description have not changed, I am assuming it is the job descriptions.

EDIT: Looks like the accounting group changed.

EDIT: Actually we have a blockage, more later.
ID: 4618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 4619 - Posted: 15 Jan 2017, 22:27:30 UTC - in response to Message 4618.  

The issue is that the job queue became full with merge jobs and no resources were running them. When we changed the configuration to get WMAgent jobs instead of CRAB jobs, my cluster also picked this up and hence stopped running merged jobs. The configuration for my cluster has now been updated to run merge jobs again and will hopefully unblock everything,
ID: 4619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,170,951
RAC: 1,761
Message 4620 - Posted: 16 Jan 2017, 8:59:53 UTC - in response to Message 4619.  

Thanks, Laurence.
ID: 4620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4653 - Posted: 12 Feb 2017, 16:27:59 UTC

I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell)
There is a significant speed improvement (10-15% or more).
You might want to look at enabling this by default.

(it does not work with LHCb)
ID: 4653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4663 - Posted: 16 Feb 2017, 14:55:20 UTC - in response to Message 4653.  

I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell)
There is a significant speed improvement (10-15% or more).

VirtualBox mentions a possible performance increase of 5%.
It's hard to measure, cause the runtimes of the CMS-jobs differs so much.
I already did a measurement over 7 jobs with LargePages off: average 6151 seconds - max 8386 seconds - min 4340 seconds

I also want to make a serie of 7 CMS-jobs with LargePages on:
3 done so far, 2 below the above average and 1 above.
ID: 4663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4664 - Posted: 16 Feb 2017, 16:51:37 UTC - in response to Message 4663.  

You are correct. There are a number of factors, overlapping uploads, other system activity....
BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev.
Obviously there are losses involved with shutting down and starting up again.
ID: 4664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4665 - Posted: 16 Feb 2017, 18:01:11 UTC - in response to Message 4664.  

BTW. how do you enable it? I shut down boinc and edit the .vbox and .vbox-prev.

When you get a fresh CMS-task, suspend the task early before Condor gets a change. It will stay in memory even when LAIM is deselected.
Stop the BOINC client and the VM will be saved.
Disgard the save state.
Now you have a Powered down VM to change the Large Pages-parameter with the command

"%VBOX_MSI_INSTALL_PATH%vboxmanage.exe" modifyvm virtualmachinename --largepages on|off

When you want to run the VM in BOINC again:
Keep the suspending state in BOINC and start the VM with VirtualBox Manager for a few minutes. Close the VM with the savestate option.
When saved, it's BOINC's turn again.
ID: 4665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4666 - Posted: 16 Feb 2017, 22:05:56 UTC - in response to Message 4663.  

I have been enabling "LargePages" in the VM for a while now and it works fine (as far as i can tell)
There is a significant speed improvement (10-15% or more).

VirtualBox mentions a possible performance increase of 5%.
It's hard to measure, cause the runtimes of the CMS-jobs differs so much.
I already did a measurement over 7 jobs with LargePages off: average 6151 seconds - max 8386 seconds - min 4340 seconds

I also want to make a serie of 7 CMS-jobs with LargePages on:
3 done so far, 2 below the above average and 1 above.

The figures for 7 tasks with parameter --largepages on:

average 6024 seconds - max 8450 seconds - min 3255 seconds
ID: 4666 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4667 - Posted: 17 Feb 2017, 8:03:04 UTC - in response to Message 4666.  

Hi Crystal

The figures for 7 tasks with parameter --largepages on:

average 6024 seconds - max 8450 seconds - min 3255 seconds


I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest.

Thanks for posting your method to enable "large pages".
I will give i a try.
ID: 4667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4669 - Posted: 17 Feb 2017, 13:16:31 UTC - in response to Message 4667.  

I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest.

I'm increasing my test series. I have 2 CMS-VM's running simultaneously - 1 VM with largepages on and 1 VM with largepages off -, so they suffer from the same conditions.

Ivan made a note at the production project, that the CMS-queue will drain over the weekend, so we'll see how many jobs I will get until the well is dry.
ID: 4669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 4670 - Posted: 19 Feb 2017, 11:00:55 UTC - in response to Message 4669.  

I think a better approach for small sample size (<15 or so) is to disregard the longest and the shortest runtime and average the rest.

I'm increasing my test series. I have 2 CMS-VM's running simultaneously - 1 VM with largepages on and 1 VM with largepages off -, so they suffer from the same conditions.

I'm ending my test series after done 30 jobs on both VM's:

CPU Summary of total 30 jobs with parameter largepages on : 203710 seconds
CPU Summary of total 30 jobs with parameter largepages off: 204702 seconds

To close to call it a significant performance increase, I tend to say.
ID: 4670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : New Version v48.00


©2024 CERN