Stageout failures

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4510 - Posted: 17 Dec 2016, 13:15:29 UTC - in response to Message 4508. Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)? ID: 4510 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4511 - Posted: 17 Dec 2016, 17:34:06 UTC - in response to Message 4510. Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)? It is possible and in this case the three jobs finished in the same minute: 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) 2016-12-17 07:36:37 (1840): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project) 2016-12-17 07:37:00 (4760): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project) I suppose you want to suggest that congestion could be the cause of the exit code 151. I have a fiber connection with 70 Mbps upload speed, so I join Ivan's suggestion that the CMS job's rootfile was already present on the server. In dashboard I also could not see my IP-connection with the job I had, but 2 others (1 failed and 1 still running), as far as dashboard is reliable. ID: 4511 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266	Message 4512 - Posted: 17 Dec 2016, 17:40:52 UTC - in response to Message 4511. 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) If you can tell me the JobID or your time-zone, I can look that up in the logs. I have a fiber connection You are obviously not British then. :-) ID: 4512 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4513 - Posted: 17 Dec 2016, 19:53:46 UTC - in response to Message 4511. Last modified: 17 Dec 2016, 19:54:21 UTC I suppose you want to suggest that congestion could be the cause of the exit code 151. No, i am not suggesting that. I just think, eigther vbox or the guest software cannot handle simultaneous uploads properly. ID: 4513 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4514 - Posted: 17 Dec 2016, 22:05:02 UTC - in response to Message 4512. 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) If you can tell me the JobID or your time-zone, I can look that up in the logs. Sorry, the job finished and I can't remember the JobID and don't have the logs saved. The above mentioned time (2016-12-17 07:36:09) was CET, so 06:36:09 UTC ID: 4514 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266	Message 4515 - Posted: 18 Dec 2016, 9:48:20 UTC - in response to Message 4514. Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. ID: 4515 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4516 - Posted: 18 Dec 2016, 11:21:22 UTC - in response to Message 4515. Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. Never mind. But I don't think that for CRAB ID and JOBID the same number is used. ID: 4516 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4517 - Posted: 18 Dec 2016, 15:53:28 UTC - in response to Message 4516. Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. Never mind. But I don't think that for CRAB ID and JOBID the same number is used. I was wrong! jobNumber and CRAB ID are the same. ID: 4517 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158	Message 4518 - Posted: 19 Dec 2016, 8:47:36 UTC - in response to Message 4503. What's interesting is that since I add the machines for the WMAgent jobs, the job failure rate increased. As these VMs are dedicated and in the same data center as the storage, I would expect minimal failures. Some investigation is needed ... ID: 4518 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 34	Message 4569 - Posted: 21 Dec 2016, 15:13:12 UTC Curious. Each of my hosts started with a 151 error (non-dev project) after the condor server outage. Same as last time. stderr.txt:2016-12-21 14:44:55 (12245): Guest Log: [INFO] Job finished in slot1 with 151. stderr.txt:2016-12-21 15:04:47 (24002): Guest Log: [INFO] Job finished in slot1 with 151. ID: 4569 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266	Message 4570 - Posted: 21 Dec 2016, 16:48:33 UTC - in response to Message 4569. Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance. ID: 4570 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 34	Message 4571 - Posted: 21 Dec 2016, 18:56:30 UTC - in response to Message 4570. Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance. Aahh... I think I start to understand. The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem. My host got it after the problem was solved but can not upload the result as the target is already prepared for the first host, right? And this is why we see a lot of 151 errors after a condor problem. You explained that several times before but my brain must have been blocked. ID: 4571 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238	Message 4572 - Posted: 21 Dec 2016, 20:36:40 UTC - in response to Message 4571. The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem. My host got it after the problem was solved but can not upload the result as the target is already prepared for the first host, right? I would understand that the target is not only prepared/reserved but already occupied by a returned valid root-file. I hope I'm correct. ID: 4572 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158	Message 4573 - Posted: 21 Dec 2016, 20:53:06 UTC - in response to Message 4572. Last modified: 21 Dec 2016, 20:54:12 UTC To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system. It is not trivial to find a solution so we just need to make sure that we minimize the times HTCondor server goes offline. ID: 4573 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4574 - Posted: 21 Dec 2016, 22:17:54 UTC - in response to Message 4573. Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back? ID: 4574 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158	Message 4575 - Posted: 21 Dec 2016, 22:25:24 UTC - in response to Message 4574. Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back? It should try to reconnect while the VM is still alive. It's kind of similar to the suspend/resume issue but with the server. We have not tested this though. ID: 4575 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4576 - Posted: 21 Dec 2016, 22:30:34 UTC - in response to Message 4575. Could there not be an extension to the 18h vm time, in such a case? If HTCondor is down, it would not get any new tasks anyway. ID: 4576 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266	Message 4577 - Posted: 21 Dec 2016, 22:58:21 UTC - in response to Message 4573. To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system. It is not trivial to find a solution so we just need to make sure that we minimize the times HTCondor server goes offline. To add my bit to this -- I know I've said it before but not everybody can read everything; when the second invocation fails it also deletes the original file! So, the third try succeeds because there is no conflicting result file. You can see the "echo blip" from last night's failures on the activity graphs now. :-( ID: 4577 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158	Message 4578 - Posted: 21 Dec 2016, 23:39:36 UTC - in response to Message 4576. Could there not be an extension to the 18h vm time, in such a case? If HTCondor is down, it would not get any new tasks anyway. It should continue trying until the VM gets terminated by the 18h limit of the BOINC client. But we have to think about the trade off. Is it not better to loose 2 hours of processing and swap to another application rather than to idle for an unknown amount of time. We are now in the realm of optimizing the behaviour due to a failure mode and we should probably focus our efforts on providing a more reliable service in the first place. ID: 4578 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4580 - Posted: 22 Dec 2016, 15:01:53 UTC Last modified: 22 Dec 2016, 15:03:27 UTC The CMS jobs graph shows rising number of crab jobs, but the errors are staying the same(or even falling). That is good news. Has anything changed? ID: 4580 · Rating: 0 · rate: / Reply Quote

Development for LHC@home