Message boards : Theory Application : New version v3.12
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 753
Credit: 11,736,916
RAC: 8,819
Message 5725 - Posted: 12 Dec 2018, 10:01:49 UTC

Yes CP you are correct.......these slots are usually not in sequence.and depending on how many cores you are using you will see it happening on the VB Manager log files (yeah I have watched thousands of them run over the years)

I have ran all versions from 8-core down to single core tasks and for me the best is 2-core tasks running X4 on on all the 8-core pc's and with Theory tasks they run ok with only 8GB ram (I do that on one of them) and the others have 16-24 GB ram and they always work with these Theory tasks and the former CMS and LHCb's (and many times with Atlas-alpha testing)......2am and I am waiting for the latest windows update with lots of new stuff I will never want to finish so I can suspend and reboot here.
ID: 5725 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5726 - Posted: 12 Dec 2018, 10:57:56 UTC - in response to Message 5724.  

Thanks Crystal,
now three tasks with one CPU.
Yes Magic,
we never reaching your performance for Theory.
Good luck with your Updates.
ID: 5726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 753
Credit: 11,736,916
RAC: 8,819
Message 5732 - Posted: 15 Dec 2018, 5:39:30 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743072

I just got another of these * EXIT_DISK_LIMIT_EXCEEDED*

On a different host this time and that one is only running one-core tasks and it can't be the settings I use since I never even put any limits on the Boinc manager and this host never does anything else other than these tasks and a 2-core LHC Theory and they never have any problems.

Not sure what this EXIT_DISK_LIMIT_EXCEEDED even means unless it is that *slot* problem I remember from over 3 years ago.
ID: 5732 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5733 - Posted: 15 Dec 2018, 7:42:13 UTC - in response to Message 5712.  

Saw in gpu_grid also this error:
They had a discussion with this boinc parameter for Condor:
<rsc_disk_bound>
Maybe Laurence can help us therefore.


It seem not on the User-end. You have nothing changed.
ID: 5733 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 753
Credit: 11,736,916
RAC: 8,819
Message 5734 - Posted: 15 Dec 2018, 9:44:22 UTC - in response to Message 5733.  

Saw in gpu_grid also this error:
They had a discussion with this boinc parameter for Condor:
<rsc_disk_bound>
Maybe Laurence can help us therefore.


It seem not on the User-end. You have nothing changed.


Yes I agree with you Axel and I have some up and running again so I will see how they work when I get up later today.
ID: 5734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 318
Message 5735 - Posted: 15 Dec 2018, 14:30:57 UTC

The stderr.txt contains 2 lines of the following kind a while before the error EXIT_DISK_LIMIT_EXCEEDED:
2018-12-15 06:51:10 (129813): Guest Log: [INFO] Condor JobID:  483300.101 in slot1
2018-12-15 06:51:15 (129813): Guest Log: [INFO] MCPlots JobID: 47695141 in slot1

Is it possible to use this IDs to check whether the failed VMs always run the same app, e.g. sherpa, that may cause the error?
ID: 5735 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5736 - Posted: 15 Dec 2018, 20:06:31 UTC - in response to Message 5733.  

Saw in gpu_grid also this error:
They had a discussion with this boinc parameter for Condor:
<rsc_disk_bound>
Maybe Laurence can help us therefore.


It seem not on the User-end. You have nothing changed.


rsc_disk_bound
A bound on the maximum disk space used by the job, including all input, temporary, and output files. The job will only be sent to hosts with at least this much available disk space. If this bound is exceeded, the job will be aborted.
ID: 5736 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 753
Credit: 11,736,916
RAC: 8,819
Message 5737 - Posted: 16 Dec 2018, 1:59:10 UTC

(I just got home)

So far today they have been Valids but the last host that had that problem turned one in and the other is still running ( about 33% so far) and just to test again I just started another so it will be like it was last time it happened.

The first host that did this is just back to running 4 X2 core Theory tasks for LHC and it has no probles with those (it does have 8-cores with 8GB ram and the drive is SSD with only 177GB free)

This other 4-core that had the same problem (it is actually a AMD 3-core that I opened up the 4th core on) has 385GB free and 8GB ram and it before had no problem running a 2-core for LHC and 2 single core tasks here.

The other 8-core that I run two 2-core tasks on here with a pair of 2-core LHC tasks could never have any of the typical problems since it has 24GB ram with Intel running at 3.67GHz on a 2TB HD

I only saw one error on that last failed task looking at the VB Manager log but all the rest was OK ( like the ones you see on the RDC )

So I will watch these 2 running here now along with the 2-core LHC task for the next 8 hours or so and see if it happens again.
ID: 5737 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5738 - Posted: 16 Dec 2018, 8:46:21 UTC

What, when the
<rsc_disk_bound> is faulty from Server and not from the Users Boinc?

In starter.log is a line:
12/16/18 01:27:02 (pid:4371) '/usr/bin/singularity --version' did not exit successfully (code -759635680); the first line of output was ''.

Maybe they do a upgrade for Theory on -dev?
ID: 5738 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 5740 - Posted: 16 Dec 2018, 12:41:48 UTC - in response to Message 5738.  

What, when the <rsc_disk_bound> is faulty from Server and not from the Users Boinc?

The rsc_disk_bound is what is has been for years: 8,000,000,000 bytes.

A normal Theory task uses about 800MB up to 1100MB in a slot directory,
so there must be something very strange going on.

It would be informative to check with every EXIT_DISK_LIMIT_EXCEEDED error whether this is also written
in BOINC's event log, cause the abort should be initiated by BOINC and not by the VM or wrapper.

I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes.
ID: 5740 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 5741 - Posted: 18 Dec 2018, 8:20:07 UTC - in response to Message 5740.  

I've 4 single core Theory's from dev running now with logging the slots with files > 524,288 bytes.

I ran 2 x 4 tasks. Peak disk usages between 838MB and 1266MB
All successful without the exceeded disk limit error.
I can't reproduce this behavior. Maybe someone else can give more information about what files are growing extremely.
ID: 5741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5742 - Posted: 18 Dec 2018, 9:10:16 UTC - in response to Message 5741.  

Have 7 Theory with always one CPU running for the moment on one Computer.
ID: 5742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5744 - Posted: 18 Dec 2018, 11:07:07 UTC - in response to Message 5742.  
Last modified: 18 Dec 2018, 11:51:59 UTC

One of them is a sherpa 2.2.4 tree LCG_87:
Output of the job wrapper may appear here.
08:03:56 +0100 2018-12-18 [INFO] New Job Starting in slot1
08:03:56 +0100 2018-12-18 [INFO] Condor JobID: 483823.52 in slot1
08:04:01 +0100 2018-12-18 [INFO] MCPlots JobID: 47765377 in slot1
ATM 15 MByte after 4 hour runtime!

Edit: 20 Min. later.... 2.4 GByte
Port 64722
boinc_7585e319e855c9da

Is more help needed?

Edit: 20 Min. later 3.6 GByte
ID: 5744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5745 - Posted: 18 Dec 2018, 13:45:33 UTC - in response to Message 5741.  

(beam){
BEAM_1 = 2212; BEAM_ENERGY_1 = 3500.;
BEAM_2 = 2212; BEAM_ENERGY_2 = 3500.;
}(beam)

(processes){
Process 93 93 -> 93 93 93{2};
Order (*,0); Max_N_Quarks 4;
CKKW sqr(50/E_CMS);
Integration_Error 0.02 {4};
End process;
}(processes)

Now more as 6 GByte
ID: 5745 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 5746 - Posted: 18 Dec 2018, 14:42:07 UTC - in response to Message 5745.  
Last modified: 18 Dec 2018, 14:42:57 UTC


Now more as 6 GByte

Could you show the directory tree of all files and their sizes in the slot?
ID: 5746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5747 - Posted: 18 Dec 2018, 14:45:52 UTC
Last modified: 18 Dec 2018, 14:48:24 UTC

It is the running.log from RDP.
Now 8.9 GByte.
The size of the vm_image is 5.762 GByte.
ID: 5747 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 5748 - Posted: 18 Dec 2018, 14:52:14 UTC - in response to Message 5747.  

When all files in the slot and sub-dirs exceeds 7629.39MB the task should crash :(
ID: 5748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5749 - Posted: 18 Dec 2018, 15:16:54 UTC - in response to Message 5748.  
Last modified: 18 Dec 2018, 15:28:10 UTC

That's all we want ;-)
What, when a XP-task with only 4 GByte Disk-use is running?
Magic is one user with XP!

Sorry Crystal,
now 11 GByte. Will it cancel by... Disk-end.
ID: 5749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,898,980
RAC: 5,725
Message 5750 - Posted: 18 Dec 2018, 16:51:27 UTC - in response to Message 5749.  

12 GByte let the task stopping with disk-error:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2743403
P.Skands have some work now ;-)
ID: 5750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 848,858
RAC: 1,746
Message 5751 - Posted: 18 Dec 2018, 17:54:40 UTC - in response to Message 5750.  

It's good to know that it's the application - the VM-job - causing the errors.

Probably it's always a Sherpa job.
ID: 5751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Theory Application : New version v3.12


©2024 CERN