Message boards :
Theory Application :
New version v3.12
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819 |
Yes CP you are correct.......these slots are usually not in sequence.and depending on how many cores you are using you will see it happening on the VB Manager log files (yeah I have watched thousands of them run over the years) I have ran all versions from 8-core down to single core tasks and for me the best is 2-core tasks running X4 on on all the 8-core pc's and with Theory tasks they run ok with only 8GB ram (I do that on one of them) and the others have 16-24 GB ram and they always work with these Theory tasks and the former CMS and LHCb's (and many times with Atlas-alpha testing)......2am and I am waiting for the latest windows update with lots of new stuff I will never want to finish so I can suspend and reboot here. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
Thanks Crystal, now three tasks with one CPU. Yes Magic, we never reaching your performance for Theory. Good luck with your Updates. |
Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743072 I just got another of these * EXIT_DISK_LIMIT_EXCEEDED* On a different host this time and that one is only running one-core tasks and it can't be the settings I use since I never even put any limits on the Boinc manager and this host never does anything else other than these tasks and a 2-core LHC Theory and they never have any problems. Not sure what this EXIT_DISK_LIMIT_EXCEEDED even means unless it is that *slot* problem I remember from over 3 years ago. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
Saw in gpu_grid also this error: It seem not on the User-end. You have nothing changed. |
Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819 |
Saw in gpu_grid also this error: Yes I agree with you Axel and I have some up and running again so I will see how they work when I get up later today. |
Send message Joined: 28 Jul 16 Posts: 478 Credit: 394,720 RAC: 318 |
The stderr.txt contains 2 lines of the following kind a while before the error EXIT_DISK_LIMIT_EXCEEDED: 2018-12-15 06:51:10 (129813): Guest Log: [INFO] Condor JobID: 483300.101 in slot1 2018-12-15 06:51:15 (129813): Guest Log: [INFO] MCPlots JobID: 47695141 in slot1 Is it possible to use this IDs to check whether the failed VMs always run the same app, e.g. sherpa, that may cause the error? |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
Saw in gpu_grid also this error: rsc_disk_bound A bound on the maximum disk space used by the job, including all input, temporary, and output files. The job will only be sent to hosts with at least this much available disk space. If this bound is exceeded, the job will be aborted. |
Send message Joined: 8 Apr 15 Posts: 753 Credit: 11,736,916 RAC: 8,819 |
(I just got home) So far today they have been Valids but the last host that had that problem turned one in and the other is still running ( about 33% so far) and just to test again I just started another so it will be like it was last time it happened. The first host that did this is just back to running 4 X2 core Theory tasks for LHC and it has no probles with those (it does have 8-cores with 8GB ram and the drive is SSD with only 177GB free) This other 4-core that had the same problem (it is actually a AMD 3-core that I opened up the 4th core on) has 385GB free and 8GB ram and it before had no problem running a 2-core for LHC and 2 single core tasks here. The other 8-core that I run two 2-core tasks on here with a pair of 2-core LHC tasks could never have any of the typical problems since it has 24GB ram with Intel running at 3.67GHz on a 2TB HD I only saw one error on that last failed task looking at the VB Manager log but all the rest was OK ( like the ones you see on the RDC ) So I will watch these 2 running here now along with the 2-core LHC task for the next 8 hours or so and see if it happens again. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
What, when the <rsc_disk_bound> is faulty from Server and not from the Users Boinc? In starter.log is a line: 12/16/18 01:27:02 (pid:4371) '/usr/bin/singularity --version' did not exit successfully (code -759635680); the first line of output was ''. Maybe they do a upgrade for Theory on -dev? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746 |
What, when the <rsc_disk_bound> is faulty from Server and not from the Users Boinc? The rsc_disk_bound is what is has been for years: 8,000,000,000 bytes. A normal Theory task uses about 800MB up to 1100MB in a slot directory, so there must be something very strange going on. It would be informative to check with every EXIT_DISK_LIMIT_EXCEEDED error whether this is also written in BOINC's event log, cause the abort should be initiated by BOINC and not by the VM or wrapper. I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746 |
I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes. I ran 2 x 4 tasks. Peak disk usages between 838MB and 1266MB All successful without the exceeded disk limit error. I can't reproduce this behavior. Maybe someone else can give more information about what files are growing extremely. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
Have 7 Theory with always one CPU running for the moment on one Computer. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
One of them is a sherpa 2.2.4 tree LCG_87: Output of the job wrapper may appear here. 08:03:56 +0100 2018-12-18 [INFO] New Job Starting in slot1 08:03:56 +0100 2018-12-18 [INFO] Condor JobID: 483823.52 in slot1 08:04:01 +0100 2018-12-18 [INFO] MCPlots JobID: 47765377 in slot1 ATM 15 MByte after 4 hour runtime! Edit: 20 Min. later.... 2.4 GByte Port 64722 boinc_7585e319e855c9da Is more help needed? Edit: 20 Min. later 3.6 GByte |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
(beam){ BEAM_1 = 2212; BEAM_ENERGY_1 = 3500.; BEAM_2 = 2212; BEAM_ENERGY_2 = 3500.; }(beam) (processes){ Process 93 93 -> 93 93 93{2}; Order (*,0); Max_N_Quarks 4; CKKW sqr(50/E_CMS); Integration_Error 0.02 {4}; End process; }(processes) Now more as 6 GByte |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746 |
Could you show the directory tree of all files and their sizes in the slot? |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
It is the running.log from RDP. Now 8.9 GByte. The size of the vm_image is 5.762 GByte. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746 |
When all files in the slot and sub-dirs exceeds 7629.39MB the task should crash :( |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
That's all we want ;-) What, when a XP-task with only 4 GByte Disk-use is running? Magic is one user with XP! Sorry Crystal, now 11 GByte. Will it cancel by... Disk-end. |
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,898,980 RAC: 5,725 |
12 GByte let the task stopping with disk-error: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743403 P.Skands have some work now ;-) |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 848,858 RAC: 1,746 |
It's good to know that it's the application - the VM-job - causing the errors. Probably it's always a Sherpa job. |
©2024 CERN