Message boards : CMS Application : New Version v47.50
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 42
Message 4168 - Posted: 7 Oct 2016, 9:57:56 UTC

This new version should scale the memory with the number of cores. The base memory is 2GB and 1GB extra will be added per core.
ID: 4168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 77
Message 4169 - Posted: 7 Oct 2016, 14:10:33 UTC - in response to Message 4168.  

I ran 1 WU with the following setup:

  • 3 cores (configured via the project webpage)
  • no app_config.xml



Result (https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=271086):


  • The VM started with 3 cores and 2 GB RAM
  • After 7.5 minutes I got an error 207 (EXIT_NO_SUB_TASKS)

ID: 4169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1179
Credit: 815,336
RAC: 580
Message 4170 - Posted: 7 Oct 2016, 17:57:59 UTC - in response to Message 4168.  

The memory extension is not working. It sticks to 2048MB. Tested it with 2 cores.
ID: 4170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 77
Message 4171 - Posted: 8 Oct 2016, 7:03:00 UTC - in response to Message 4169.  

I ran 1 WU with the following setup:

  • 3 cores (configured via the project webpage)
  • no app_config.xml



Result (https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=271086):


  • The VM started with 3 cores and 2 GB RAM
  • After 7.5 minutes I got an error 207 (EXIT_NO_SUB_TASKS)


I just saw that my host got the old v47.30 app instead the new v47.50.
Both are listed here.
ID: 4171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1179
Credit: 815,336
RAC: 580
Message 4172 - Posted: 8 Oct 2016, 7:35:42 UTC - in response to Message 4171.  

I just saw that my host got the old v47.30 app instead the new v47.50.
Both are listed here.

... and for Windows I got the 47.40 (vbox64_mt_mcore) instead of the new 47.50 (vbox64_mt_mcore_cms).
ID: 4172 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 748
Credit: 11,599,032
RAC: 1,786
Message 4173 - Posted: 8 Oct 2016, 20:16:08 UTC - in response to Message 4172.  

I just saw that my host got the old v47.30 app instead the new v47.50.
Both are listed here.

... and for Windows I got the 47.40 (vbox64_mt_mcore) instead of the new 47.50 (vbox64_mt_mcore_cms).


I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today.

http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=192
Mad Scientist For Life
ID: 4173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1179
Credit: 815,336
RAC: 580
Message 4174 - Posted: 9 Oct 2016, 8:14:59 UTC - in response to Message 4173.  

I got CMS Simulation v47.50 (vbox64_mt_mcore_cms) windows_x86_64 on one Win 10 OS but all I got was Errors so I will try it on another host and see if I have better luck today.

It looks like the VM's are not booting properly.

Also the setup seems not working or are you using an app_config.xml:

Setting Memory Size for VM. (3000MB)
Setting CPU Count for VM. (1)
ID: 4174 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 42
Message 4179 - Posted: 11 Oct 2016, 8:10:44 UTC - in response to Message 4171.  

Thanks. I have just deprecated the older versions
ID: 4179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,873,628
RAC: 161
Message 4180 - Posted: 11 Oct 2016, 8:47:16 UTC - in response to Message 4179.  

Laurence, I had a task complete successfully ~0500 GMT this morning; since then they have all failed. I see several possible errors, e.g.

2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 enter Daemons::UpdateCollector
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Trying to update collector <130.246.180.120:9623>
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 Attempting to send update via TCP to collector lcggwms02.gridpp.rl.ac.uk <130.246.180.120:9623>
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:11 exit Daemons::UpdateCollector
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown.
...
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 State change: benchmarks completed
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 05:57:33 slot1: Changing activity: Benchmarking -> Idle
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 No resources have been claimed for 300 seconds
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Shutting down Condor on this machine.
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 Got SIGTERM. Performing graceful shutdown.
2016-10-11 06:02:37 (25369): Guest Log: 10/11/16 06:02:35 shutdown graceful
...
2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 CronJobList: Deleting all jobs
2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 All resources are free, exiting.
2016-10-11 06:02:38 (25369): Guest Log: 10/11/16 06:02:35 **** condor_startd (condor_STARTD) pid 4300 EXITING WITH STATUS 0
2016-10-11 06:02:38 (25369): Guest Log: [ERROR] No jobs were available to run.
2016-10-11 06:02:38 (25369): Guest Log: [INFO] Shutting Down.
2016-10-11 06:02:38 (25369): VM Completion File Detected.
2016-10-11 06:02:38 (25369): VM Completion Message: No jobs were available to run.


http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=272506

I had a couple of failures this morning on vLHC@home on the same machine but it seems to be successfully running two jobs at the moment.

https://lhcathome.cern.ch/vLHCathome/result.php?resultid=6608306

I haven't seen anything untoward on the condor server, the tasks aren't getting as far as requesting -- or at least getting! -- jobs from the server.
ID: 4180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,873,628
RAC: 161
Message 4181 - Posted: 11 Oct 2016, 9:48:14 UTC - in response to Message 4173.  
Last modified: 11 Oct 2016, 9:48:44 UTC

Magic, I see you have a job running on 1482 at present.
ID: 4181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,873,628
RAC: 161
Message 4182 - Posted: 11 Oct 2016, 10:09:55 UTC - in response to Message 4179.  

Thanks. I have just deprecated the older versions

I just noticed I've been getting 47.30 since 0752 UTC.
ID: 4182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 77
Message 4183 - Posted: 11 Oct 2016, 13:59:15 UTC

I got 2 tasks v47.30.
1 before and 1 after a project reset.

Both

  • run 3 cores (as configured)
  • use 2 GB RAM
  • run into an error 207 after a few minutes


https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=273547
https://lhcathome.cern.ch/vLHCathome-dev/result.php?resultid=273607

ID: 4183 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1179
Credit: 815,336
RAC: 580
Message 4184 - Posted: 11 Oct 2016, 14:46:18 UTC

I got a task CMS Simulation v47.40 (vbox64_mt_mcore), but on the applications site only 47.50 (vbox64_mt_mcore_cms) is available.

In my prefs I configured 1 task with 2 cores. The VM created has only 1 core and 2048 MB base memory.

<app_version>
<app_name>CMS</app_name>
<version_num>4740</version_num>
<platform>windows_x86_64</platform>
<avg_ncpus>1.000000</avg_ncpus>
<max_ncpus>2.000000</max_ncpus>
<flops>26823773813.494980</flops>
<plan_class>vbox64_mt_mcore</plan_class>
<api_version>7.7.0</api_version>
<cmdline>--memory_size_mb 2048</cmdline>


If you want to run a VM with 2 cores avg_ncpus should be 2 and not max_ncpus
ID: 4184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 601
Message 4197 - Posted: 19 Oct 2016, 14:23:40 UTC
Last modified: 19 Oct 2016, 14:35:59 UTC

The 12/18 hr timeout still causes a fair bit of wasted time (and errors and frustration), but tasks behave oddly sometimes. In the course of shifting stuff to the new servers, I noticed this:-

CMS 47.50 (mt, 1 core, took 3gb)
Host# 553.
CMS Task# 274259.
From a very long stderr:-

Day/ Time
14/03.08.14 Wrapper start. (Boinc task started)
14/03.15.53 Job 6492 start.
14/06.15.33 Job finished with 0.
14/06.18.12 Job 7054 start.
14/06.57.02 VM stopped. (Normal host shutdown time)

15/01.04.44 Wrapper start. (Normal host startup time)
15/01.04.46 Job finished with 0.
15/01.04.46 Job 7054 start.
15/04.45.36 Job finished with 134.
15/04.45.46 Job 755 start.
15/06.57.02 VM stopped. (Normal shutdown time)

16/01.04.44 Wrapper start. (Normal startup time)
16/01.04.46 Job finished with 134.
16/01.04.46 Job 755 start.
16/06.12.17 Job finished with 0.
16/06.14.15 Job 5986 start.
16/06.57.02 VM stopped. (Normal shutdown time)

17/01.04.43 Wrapper start. (Normal startup time)
17/01.04.45 Job finished with 0.
17/01.04.45 Job 5986 start.
17/03.22.00 Boinc finish. (Boinc task finished OK)

Locating the jobs on Dashboard is a bit cumbersome; it's not obvious which CRAB
task they each belong to, but looks like 6492, 7054, 755 all finished OK, first
attempt. 5986 failed 61311 no second attempt.

Hosts are normally started by their BIOS clock and shutdown by a cron job using the boinccmd "quit" command. After 3 mins the host is powered off ('nuther cron job), so the process should be fairly graceful.
Total time ca 15h53. The shutdown time is clearly not counting towards the timeout - a welcome change, but the stopping and resuming doesn't seem to work well, what is exit code 134? with the job in hand restarted. The last job was cut short.

Is this how it's supposed to work? I seem to remember that the "production" standard was performance equal to the old T4T... I don't think we're quite there yet.
ID: 4197 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,873,628
RAC: 161
Message 4198 - Posted: 19 Oct 2016, 14:40:14 UTC - in response to Message 4197.  

From an e-mail I sent yesterday:

OK, the 134 stage-out errors seem to be mainly when the pythia
job fails with an error:

...
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
== CMSSW: EvtGen:Will take first kinematically allowed decay in the decay
table
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Your event has been rejected 10000 times!
== CMSSW: EvtGen:Will now abort.
== CMSSW: Complete
== CMSSW: process id is 6188 status is 134
======== CMSSW OUTPUT FINSHING ========
...
Job wrapper did not finish successfully (exit code 134). Setting that same
exit code for the stageout wrapper.
Stageout wrapper finished with exit code 134. Will report failure to
Dashboard.
...
ID: 4198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : New Version v47.50


©2024 CERN