Message boards :
CMS Application :
Stageout failures
Message board moderation
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 853 |
Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)? It is possible and in this case the three jobs finished in the same minute: 2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) 2016-12-17 07:36:37 (1840): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project) 2016-12-17 07:37:00 (4760): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project) I suppose you want to suggest that congestion could be the cause of the exit code 151. I have a fiber connection with 70 Mbps upload speed, so I join Ivan's suggestion that the CMS job's rootfile was already present on the server. In dashboard I also could not see my IP-connection with the job I had, but 2 others (1 failed and 1 still running), as far as dashboard is reliable. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 8 |
2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) If you can tell me the JobID or your time-zone, I can look that up in the logs. I have a fiber connectionYou are obviously not British then. :-) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I suppose you want to suggest that congestion could be the cause of the exit code 151. No, i am not suggesting that. I just think, eigther vbox or the guest software cannot handle simultaneous uploads properly. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 853 |
2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project) Sorry, the job finished and I can't remember the JobID and don't have the logs saved. The above mentioned time (2016-12-17 07:36:09) was CET, so 06:36:09 UTC |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 8 |
Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 853 |
Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. Never mind. But I don't think that for CRAB ID and JOBID the same number is used. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 853 |
Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time. I was wrong! jobNumber and CRAB ID are the same. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
What's interesting is that since I add the machines for the WMAgent jobs, the job failure rate increased. As these VMs are dedicated and in the same data center as the storage, I would expect minimal failures. Some investigation is needed ... |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
Curious. Each of my hosts started with a 151 error (non-dev project) after the condor server outage. Same as last time. stderr.txt:2016-12-21 14:44:55 (12245): Guest Log: [INFO] Job finished in slot1 with 151. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 8 |
Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance. |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance. Aahh... I think I start to understand. The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem. My host got it after the problem was solved but can not upload the result as the target is already prepared for the first host, right? And this is why we see a lot of 151 errors after a condor problem. You explained that several times before but my brain must have been blocked. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 853 |
The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem. I would understand that the target is not only prepared/reserved but already occupied by a returned valid root-file. I hope I'm correct. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system. It is not trivial to find a solution so we just need to make sure that we minimize the times HTCondor server goes offline. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back? It should try to reconnect while the VM is still alive. It's kind of similar to the suspend/resume issue but with the server. We have not tested this though. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Could there not be an extension to the 18h vm time, in such a case? If HTCondor is down, it would not get any new tasks anyway. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 8 |
To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system. To add my bit to this -- I know I've said it before but not everybody can read everything; when the second invocation fails it also deletes the original file! So, the third try succeeds because there is no conflicting result file. You can see the "echo blip" from last night's failures on the activity graphs now. :-( |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Could there not be an extension to the 18h vm time, in such a case? It should continue trying until the VM gets terminated by the 18h limit of the BOINC client. But we have to think about the trade off. Is it not better to loose 2 hours of processing and swap to another application rather than to idle for an unknown amount of time. We are now in the realm of optimizing the behaviour due to a failure mode and we should probably focus our efforts on providing a more reliable service in the first place. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The CMS jobs graph shows rising number of crab jobs, but the errors are staying the same(or even falling). That is good news. Has anything changed? |
©2025 CERN