Message boards :
Theory Application :
VERY long running tasks are finishng with errors
Message board moderation
Author | Message |
---|---|
Send message Joined: 31 Jul 22 Posts: 11 Credit: 5,126,392 RAC: 1,933 |
Recently, my Windows 10 computers have been getting some Theory Simulation v5.5 tasks with extremely long run times of about 10 days. When I saw the estimated times when the tasks were first downloaded, I figured they were just extra big tasks since I have seen tasks now and then in the past that run for a couple of days and finish successfully, so I decided to let them run to completion. Three of these long tasks have been completed and reported now, and all three of them have been 'validated' as "Error while computing". Am I doing something wrong, or are these just bad tasks that should be aborted? The task logs also say that there was only about 1 hour of CPU time recorded during those 10 days of processing time. One of the tasks (the third link) says it was also completed and validated successfully by a different computer in about 7.5 minutes, so that definitely doesn't seem right. The other two have not been sent out to any other computers yet. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3242121 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3242108 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3235556 |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,324,905 RAC: 1,836 |
Have you ever checked the *running log* on these tasks after they have run longer than 18 hours or more? There you can see if they are actually doing anything and if not abort them instead of wasting days. I have mentioned this many times over the years but here is one from last month https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=628&postid=8133 Mad Scientist For Life |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
Am I doing something wrong No. All 3 got stuck in an error and were shut down by vboxwrapper at the end of the configured 10 day timeout (watchdog). A normal shutdown should have happened within a few minutes after this lines: 2023-09-20 03:29:17 (71556): Guest Log: 05:29:16 EDT -04:00 2023-09-20: cranky: [INFO] Container 'runc' finished with status code 1. 2023-09-20 03:29:17 (71556): Guest Log: 05:29:16 EDT -04:00 2023-09-20: cranky: [INFO] Preparing output. "Container 'runc' finished with status code 1" points out an error deeper in the scripts but doesn't mention what had happened. <file_name>Theory_2636-1174706-0_0_r1893224513_result</file_name> at the end of the log shows it was a task from the faulty mcplots revision 2636 which has been cancelled a few days ago (prod was also affected). Those tasks sometimes finish automatically after a BOINC restart. If not, cancel them manually. Tasks showing this in stderr.txt should also be cancelled: 2023-08-23 20:16:52 (23452): Guest Log: Probing /cvmfs/xxxxx.cern.ch... Failed! Since Theory downloads a couple of log functions from CVMFS it depends on the failed repository whether they print additional information or not but they usually fail sooner or later or get stuck. Be aware that ATM there's no development activity for Theory vbox. When tests are required it is usually announced in this forum or at the -prod forum. |
©2024 CERN