Message boards : CMS Application : Stageout failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4510 - Posted: 17 Dec 2016, 13:15:29 UTC - in response to Message 4508.  

Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)?
ID: 4510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4511 - Posted: 17 Dec 2016, 17:34:06 UTC - in response to Message 4510.  

Is it possible ,that any of the two uploaded at the same time as the cms task?(within a couple of minutes or so)?

It is possible and in this case the three jobs finished in the same minute:

2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project)
2016-12-17 07:36:37 (1840): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project)
2016-12-17 07:37:00 (4760): Guest Log: [INFO] Job finished in slot1 with 0. (Theory LHC@home project)


I suppose you want to suggest that congestion could be the cause of the exit code 151.
I have a fiber connection with 70 Mbps upload speed, so I join Ivan's suggestion that the CMS job's rootfile was already present on the server.
In dashboard I also could not see my IP-connection with the job I had, but 2 others (1 failed and 1 still running), as far as dashboard is reliable.
ID: 4511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4512 - Posted: 17 Dec 2016, 17:40:52 UTC - in response to Message 4511.  

2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project)

If you can tell me the JobID or your time-zone, I can look that up in the logs.

I have a fiber connection
You are obviously not British then. :-)
ID: 4512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4513 - Posted: 17 Dec 2016, 19:53:46 UTC - in response to Message 4511.  
Last modified: 17 Dec 2016, 19:54:21 UTC

I suppose you want to suggest that congestion could be the cause of the exit code 151.


No, i am not suggesting that.

I just think, eigther vbox or the guest software cannot handle simultaneous uploads properly.
ID: 4513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4514 - Posted: 17 Dec 2016, 22:05:02 UTC - in response to Message 4512.  

2016-12-17 07:36:09 (4156): Guest Log: [INFO] Job finished in slot1 with 151. (CMS vLHCathome-dev project)

If you can tell me the JobID or your time-zone, I can look that up in the logs.

Sorry, the job finished and I can't remember the JobID and don't have the logs saved.

The above mentioned time (2016-12-17 07:36:09) was CET, so 06:36:09 UTC
ID: 4514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4515 - Posted: 18 Dec 2016, 9:48:20 UTC - in response to Message 4514.  

Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time.
ID: 4515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4516 - Posted: 18 Dec 2016, 11:21:22 UTC - in response to Message 4515.  

Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time.

Never mind.

But I don't think that for CRAB ID and JOBID the same number is used.
ID: 4516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4517 - Posted: 18 Dec 2016, 15:53:28 UTC - in response to Message 4516.  

Hmm, couldn't find it. Your task log suggests it was JobID 6490, but that ID doesn't match your time.

Never mind.

But I don't think that for CRAB ID and JOBID the same number is used.

I was wrong!

jobNumber and CRAB ID are the same.
ID: 4517 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 158
Message 4518 - Posted: 19 Dec 2016, 8:47:36 UTC - in response to Message 4503.  

What's interesting is that since I add the machines for the WMAgent jobs, the job failure rate increased. As these VMs are dedicated and in the same data center as the storage, I would expect minimal failures. Some investigation is needed ...
ID: 4518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 4569 - Posted: 21 Dec 2016, 15:13:12 UTC

Curious.
Each of my hosts started with a 151 error (non-dev project) after the condor server outage.
Same as last time.

stderr.txt:2016-12-21 14:44:55 (12245): Guest Log: [INFO] Job finished in slot1 with 151.

stderr.txt:2016-12-21 15:04:47 (24002): Guest Log: [INFO] Job finished in slot1 with 151.
ID: 4569 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4570 - Posted: 21 Dec 2016, 16:48:33 UTC - in response to Message 4569.  

Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance.
ID: 4570 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 34
Message 4571 - Posted: 21 Dec 2016, 18:56:30 UTC - in response to Message 4570.  

Yes, that will be the "echo" effect of a retry job finding the existing result file from a job that couldn't report back during the Condor disturbance.

Aahh...
I think I start to understand.
The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem.
My host got it after the problem was solved but can not upload the result as the target is already prepared for the first host, right?

And this is why we see a lot of 151 errors after a condor problem.

You explained that several times before but my brain must have been blocked.
ID: 4571 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 4572 - Posted: 21 Dec 2016, 20:36:40 UTC - in response to Message 4571.  

The CRAB job (or whatever else but NOT the WU itself) was processed by another host before or during the condor problem.
My host got it after the problem was solved but can not upload the result as the target is already prepared for the first host, right?

I would understand that the target is not only prepared/reserved but already occupied by a returned valid root-file. I hope I'm correct.
ID: 4572 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 158
Message 4573 - Posted: 21 Dec 2016, 20:53:06 UTC - in response to Message 4572.  
Last modified: 21 Dec 2016, 20:54:12 UTC

To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system.

It is not trivial to find a solution so we just need to make sure that we minimize the times HTCondor server goes offline.
ID: 4573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4574 - Posted: 21 Dec 2016, 22:17:54 UTC - in response to Message 4573.  

Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back?
ID: 4574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 158
Message 4575 - Posted: 21 Dec 2016, 22:25:24 UTC - in response to Message 4574.  

Should there not a feature,that, if there is no connection to HTCondor, the result is held by the volunteers computer,until (max12h or so) HTCondor is back?


It should try to reconnect while the VM is still alive. It's kind of similar to the suspend/resume issue but with the server. We have not tested this though.
ID: 4575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4576 - Posted: 21 Dec 2016, 22:30:34 UTC - in response to Message 4575.  

Could there not be an extension to the 18h vm time, in such a case?

If HTCondor is down, it would not get any new tasks anyway.
ID: 4576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 4577 - Posted: 21 Dec 2016, 22:58:21 UTC - in response to Message 4573.  

To clarify, the access control for the storage system allows a volunteer write a new file but not to overwrite or delete it. When the HTCondor server goes offline, jobs run to completion and the output file will be uploaded to the storage system but this is not reported back to the HTCondor server. When the HTCondor server comes back online, the VM has probably been destroyed, HTCondor times-out the job and then resubmits it. This second invocation of the job will fail as the output file already exists in the storage system.

It is not trivial to find a solution so we just need to make sure that we minimize the times HTCondor server goes offline.


To add my bit to this -- I know I've said it before but not everybody can read everything; when the second invocation fails it also deletes the original file! So, the third try succeeds because there is no conflicting result file. You can see the "echo blip" from last night's failures on the activity graphs now. :-(
ID: 4577 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 158
Message 4578 - Posted: 21 Dec 2016, 23:39:36 UTC - in response to Message 4576.  

Could there not be an extension to the 18h vm time, in such a case?

If HTCondor is down, it would not get any new tasks anyway.


It should continue trying until the VM gets terminated by the 18h limit of the BOINC client. But we have to think about the trade off. Is it not better to loose 2 hours of processing and swap to another application rather than to idle for an unknown amount of time. We are now in the realm of optimizing the behaviour due to a failure mode and we should probably focus our efforts on providing a more reliable service in the first place.
ID: 4578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4580 - Posted: 22 Dec 2016, 15:01:53 UTC
Last modified: 22 Dec 2016, 15:03:27 UTC

The CMS jobs graph shows rising number of crab jobs, but the errors are staying the same(or even falling).
That is good news.

Has anything changed?
ID: 4580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : CMS Application : Stageout failures


©2024 CERN