New version v3.12

Author	Message
Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819	Message 5725 - Posted: 12 Dec 2018, 10:01:49 UTC Yes CP you are correct.......these slots are usually not in sequence.and depending on how many cores you are using you will see it happening on the VB Manager log files (yeah I have watched thousands of them run over the years) I have ran all versions from 8-core down to single core tasks and for me the best is 2-core tasks running X4 on on all the 8-core pc's and with Theory tasks they run ok with only 8GB ram (I do that on one of them) and the others have 16-24 GB ram and they always work with these Theory tasks and the former CMS and LHCb's (and many times with Atlas-alpha testing)......2am and I am waiting for the latest windows update with lots of new stuff I will never want to finish so I can suspend and reboot here. ID: 5725 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5726 - Posted: 12 Dec 2018, 10:57:56 UTC - in response to Message 5724. Thanks Crystal, now three tasks with one CPU. Yes Magic, we never reaching your performance for Theory. Good luck with your Updates. ID: 5726 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819	Message 5732 - Posted: 15 Dec 2018, 5:39:30 UTC https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743072 I just got another of these * EXIT_DISK_LIMIT_EXCEEDED* On a different host this time and that one is only running one-core tasks and it can't be the settings I use since I never even put any limits on the Boinc manager and this host never does anything else other than these tasks and a 2-core LHC Theory and they never have any problems. Not sure what this EXIT_DISK_LIMIT_EXCEEDED even means unless it is that slot problem I remember from over 3 years ago. ID: 5732 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5733 - Posted: 15 Dec 2018, 7:42:13 UTC - in response to Message 5712. Saw in gpu_grid also this error: They had a discussion with this boinc parameter for Condor: <rsc_disk_bound> Maybe Laurence can help us therefore. It seem not on the User-end. You have nothing changed. ID: 5733 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819	Message 5734 - Posted: 15 Dec 2018, 9:44:22 UTC - in response to Message 5733. Saw in gpu_grid also this error: They had a discussion with this boinc parameter for Condor: <rsc_disk_bound> Maybe Laurence can help us therefore. It seem not on the User-end. You have nothing changed. Yes I agree with you Axel and I have some up and running again so I will see how they work when I get up later today. ID: 5734 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 478 Credit: 394,720 RAC: 318	Message 5735 - Posted: 15 Dec 2018, 14:30:57 UTC The stderr.txt contains 2 lines of the following kind a while before the error EXIT_DISK_LIMIT_EXCEEDED: 2018-12-15 06:51:10 (129813): Guest Log: [INFO] Condor JobID: 483300.101 in slot1 2018-12-15 06:51:15 (129813): Guest Log: [INFO] MCPlots JobID: 47695141 in slot1 Is it possible to use this IDs to check whether the failed VMs always run the same app, e.g. sherpa, that may cause the error? ID: 5735 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5736 - Posted: 15 Dec 2018, 20:06:31 UTC - in response to Message 5733. Saw in gpu_grid also this error: They had a discussion with this boinc parameter for Condor: <rsc_disk_bound> Maybe Laurence can help us therefore. It seem not on the User-end. You have nothing changed. rsc_disk_bound A bound on the maximum disk space used by the job, including all input, temporary, and output files. The job will only be sent to hosts with at least this much available disk space. If this bound is exceeded, the job will be aborted. ID: 5736 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819	Message 5737 - Posted: 16 Dec 2018, 1:59:10 UTC (I just got home) So far today they have been Valids but the last host that had that problem turned one in and the other is still running ( about 33% so far) and just to test again I just started another so it will be like it was last time it happened. The first host that did this is just back to running 4 X2 core Theory tasks for LHC and it has no probles with those (it does have 8-cores with 8GB ram and the drive is SSD with only 177GB free) This other 4-core that had the same problem (it is actually a AMD 3-core that I opened up the 4th core on) has 385GB free and 8GB ram and it before had no problem running a 2-core for LHC and 2 single core tasks here. The other 8-core that I run two 2-core tasks on here with a pair of 2-core LHC tasks could never have any of the typical problems since it has 24GB ram with Intel running at 3.67GHz on a 2TB HD I only saw one error on that last failed task looking at the VB Manager log but all the rest was OK ( like the ones you see on the RDC ) So I will watch these 2 running here now along with the 2-core LHC task for the next 8 hours or so and see if it happens again. ID: 5737 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5738 - Posted: 16 Dec 2018, 8:46:21 UTC What, when the <rsc_disk_bound> is faulty from Server and not from the Users Boinc? In starter.log is a line: 12/16/18 01:27:02 (pid:4371) '/usr/bin/singularity --version' did not exit successfully (code -759635680); the first line of output was ''. Maybe they do a upgrade for Theory on -dev? ID: 5738 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746	Message 5740 - Posted: 16 Dec 2018, 12:41:48 UTC - in response to Message 5738. What, when the <rsc_disk_bound> is faulty from Server and not from the Users Boinc? The rsc_disk_bound is what is has been for years: 8,000,000,000 bytes. A normal Theory task uses about 800MB up to 1100MB in a slot directory, so there must be something very strange going on. It would be informative to check with every EXIT_DISK_LIMIT_EXCEEDED error whether this is also written in BOINC's event log, cause the abort should be initiated by BOINC and not by the VM or wrapper. I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes. ID: 5740 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746	Message 5741 - Posted: 18 Dec 2018, 8:20:07 UTC - in response to Message 5740. I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes. I ran 2 x 4 tasks. Peak disk usages between 838MB and 1266MB All successful without the exceeded disk limit error. I can't reproduce this behavior. Maybe someone else can give more information about what files are growing extremely. ID: 5741 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5742 - Posted: 18 Dec 2018, 9:10:16 UTC - in response to Message 5741. Have 7 Theory with always one CPU running for the moment on one Computer. ID: 5742 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5744 - Posted: 18 Dec 2018, 11:07:07 UTC - in response to Message 5742. Last modified: 18 Dec 2018, 11:51:59 UTC One of them is a sherpa 2.2.4 tree LCG_87: Output of the job wrapper may appear here. 08:03:56 +0100 2018-12-18 [INFO] New Job Starting in slot1 08:03:56 +0100 2018-12-18 [INFO] Condor JobID: 483823.52 in slot1 08:04:01 +0100 2018-12-18 [INFO] MCPlots JobID: 47765377 in slot1 ATM 15 MByte after 4 hour runtime! Edit: 20 Min. later.... 2.4 GByte Port 64722 boinc_7585e319e855c9da Is more help needed? Edit: 20 Min. later 3.6 GByte ID: 5744 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5745 - Posted: 18 Dec 2018, 13:45:33 UTC - in response to Message 5741. (beam){ BEAM_1 = 2212; BEAM_ENERGY_1 = 3500.; BEAM_2 = 2212; BEAM_ENERGY_2 = 3500.; }(beam) (processes){ Process 93 93 -> 93 93 93{2}; Order (*,0); Max_N_Quarks 4; CKKW sqr(50/E_CMS); Integration_Error 0.02 {4}; End process; }(processes) Now more as 6 GByte ID: 5745 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746	Message 5746 - Posted: 18 Dec 2018, 14:42:07 UTC - in response to Message 5745. Last modified: 18 Dec 2018, 14:42:57 UTC Now more as 6 GByte Could you show the directory tree of all files and their sizes in the slot? ID: 5746 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5747 - Posted: 18 Dec 2018, 14:45:52 UTC Last modified: 18 Dec 2018, 14:48:24 UTC It is the running.log from RDP. Now 8.9 GByte. The size of the vm_image is 5.762 GByte. ID: 5747 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746	Message 5748 - Posted: 18 Dec 2018, 14:52:14 UTC - in response to Message 5747. When all files in the slot and sub-dirs exceeds 7629.39MB the task should crash :( ID: 5748 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5749 - Posted: 18 Dec 2018, 15:16:54 UTC - in response to Message 5748. Last modified: 18 Dec 2018, 15:28:10 UTC That's all we want ;-) What, when a XP-task with only 4 GByte Disk-use is running? Magic is one user with XP! Sorry Crystal, now 11 GByte. Will it cancel by... Disk-end. ID: 5749 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725	Message 5750 - Posted: 18 Dec 2018, 16:51:27 UTC - in response to Message 5749. 12 GByte let the task stopping with disk-error: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743403 P.Skands have some work now ;-) ID: 5750 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746	Message 5751 - Posted: 18 Dec 2018, 17:54:40 UTC - in response to Message 5750. It's good to know that it's the application - the VM-job - causing the errors. Probably it's always a Sherpa job. ID: 5751 · Rating: 0 · rate: / Reply Quote

Development for LHC@home