Message boards : Theory Application : VERY long running tasks are finishng with errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Fardringle

Send message
Joined: 31 Jul 22
Posts: 11
Credit: 3,600,928
RAC: 9,815
Message 8199 - Posted: 30 Sep 2023, 22:39:23 UTC

Recently, my Windows 10 computers have been getting some Theory Simulation v5.5 tasks with extremely long run times of about 10 days. When I saw the estimated times when the tasks were first downloaded, I figured they were just extra big tasks since I have seen tasks now and then in the past that run for a couple of days and finish successfully, so I decided to let them run to completion.

Three of these long tasks have been completed and reported now, and all three of them have been 'validated' as "Error while computing". Am I doing something wrong, or are these just bad tasks that should be aborted? The task logs also say that there was only about 1 hour of CPU time recorded during those 10 days of processing time.

One of the tasks (the third link) says it was also completed and validated successfully by a different computer in about 7.5 minutes, so that definitely doesn't seem right. The other two have not been sent out to any other computers yet.


https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3242121

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3242108

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3235556
ID: 8199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 753
Credit: 11,734,345
RAC: 9,133
Message 8200 - Posted: 1 Oct 2023, 1:27:08 UTC - in response to Message 8199.  
Last modified: 1 Oct 2023, 1:28:02 UTC

Have you ever checked the *running log* on these tasks after they have run longer than 18 hours or more?
There you can see if they are actually doing anything and if not abort them instead of wasting days.

I have mentioned this many times over the years but here is one from last month
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=628&postid=8133
Mad Scientist For Life
ID: 8200 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 318
Message 8201 - Posted: 1 Oct 2023, 8:17:35 UTC - in response to Message 8199.  

Am I doing something wrong

No.


All 3 got stuck in an error and were shut down by vboxwrapper at the end of the configured 10 day timeout (watchdog).

A normal shutdown should have happened within a few minutes after this lines:
2023-09-20 03:29:17 (71556): Guest Log: 05:29:16 EDT -04:00 2023-09-20: cranky: [INFO] Container 'runc' finished with status code 1.
2023-09-20 03:29:17 (71556): Guest Log: 05:29:16 EDT -04:00 2023-09-20: cranky: [INFO] Preparing output.

"Container 'runc' finished with status code 1" points out an error deeper in the scripts but doesn't mention what had happened.
<file_name>Theory_2636-1174706-0_0_r1893224513_result</file_name> at the end of the log shows it was a task from the faulty mcplots revision 2636 which has been cancelled a few days ago (prod was also affected).
Those tasks sometimes finish automatically after a BOINC restart.
If not, cancel them manually.


Tasks showing this in stderr.txt should also be cancelled:
2023-08-23 20:16:52 (23452): Guest Log: Probing /cvmfs/xxxxx.cern.ch... Failed!

Since Theory downloads a couple of log functions from CVMFS it depends on the failed repository whether they print additional information or not but they usually fail sooner or later or get stuck.


Be aware that ATM there's no development activity for Theory vbox.
When tests are required it is usually announced in this forum or at the -prod forum.
ID: 8201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : VERY long running tasks are finishng with errors


©2024 CERN