Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
OK, it finished, on a host with a current IP different from the original one. So no obvious change in behaviour yet. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had job 2078 fail to start. No mentioning on dashboard.Job appears to be done by other computer. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
I had job 2078 fail to start. 2078 appears to have completed successfully at 0420 this morning, in Italy. Dashboard shows an Italian IP address and compatible timing. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had the logs. Unfortunately i started a new cms task, so the logs are gone. Next time, i will make a copy, but i had that job attempting to start. It puzzles me, that there was no record of this on the server. This task attempted to start at 2.06 UTC. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
I had the logs. Strange, that's the start times in the Italian job. However, it didn't start processing the first event until 0310. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Job failed to start(3051). Start time 16.30local (15:30UTC) Nothing about it in dashboard. I can send the logs, if you want. Next job started 6 minutes later. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Job failed to start(3051). Hmm, the logs -- and Dashboard -- have it starting at 1536 on a Norwegian host. IP is for a Norwegian ISP. Please send logs, to Uni address if you have it. Next job started 6 minutes later. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Please send logs, to Uni address if you have it. No, I don't. Please send PM. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Does the project run dummy-jobs to test the performance? I have a job, that is running (cms-run close to 100%) but no cms-run files are produced.The condor_stderr list things about performance. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Does the project run dummy-jobs to test the performance? Not that I'm aware of. Laurence? Can you capture the condor logs in your browser and send them to me? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Let me know, if you need any logs. I finally got out of it by suspending the task for 25min. If i had not done it, it would presumably have run until the end of the 24h limit. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,968,227 RAC: 10,233 |
I had this at the start of my first vLHC CMS task. It ran 2 sets of 'jobs' that took 38 and 35 minutes to complete, looks like performance testing. They both ran something called: wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930 After these 2 it then went on to processing real events. Another 'aborted' with no events at 06:22am (CET). It did 4 more successful jobs and since then it has been creating Run folders every 7 minutes but no events are being processed, just 4 files: [ ] MasterLog 29-Jan-2016 08:43 2.1K [ ] ProcLog 29-Jan-2016 08:43 12K [ ] StartdLog 29-Jan-2016 08:43 5.1K [ ] StarterLog 29-Jan-2016 08:38 620 ...being created in the folder. Looks like something has gone AWOL. Edit: Looks like no jobs to run. Ivan, know you are busy, just putting this here so it is available. If anyone wants the full logs for the test runs I have them. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Nearly all connections to the server are failing. Lots ad lots of errors on dashboard. Maybe, it was not such a good idea, to link this batch to vlhc? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Nearly all connections to the server are failing. Don't know yet. Something fell on its ear about 0225 this morning -- Condor isn't putting jobs into the queue, even though it has a backlog of ~4,000 still to schedule. Waiting to hear back from the experts. I don't think there's much else I can do at the moment. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Experts reported: Job submission is failing because "ERROR: proxy has expired". WTF? The proxy should have been seven days, not just three. And I'm sure I set my local proxy to 8 days before submitting the batch. ??? Anyway, I generated a new local 8-day proxy and copied it to the Condor server at RAL. Jobs are starting to be queued and run again now; keeoing a close eye on the situation... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It resumed fine,thanks. However i had 3 jobs fail (postprocessing failed) in a row from about 02.18UTC until the interruption. Jobs: 6185,6603 and 6965. I never had 3 fail in a row. EDIT: never mind, all due to the outage, i guess. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch: [eesridr:src] > wc 160126.usersort 8790 149430 1704191 160126.usersort [eesridr:src] > grep 'status 0' 160126.usersort|wc 8028 136476 1555320 [eesridr:src] > grep 'status 151' 160126.usersort|wc 578 9826 112990 (that leaves 184 non-151 errors) That doesn't tie in well with the current Dashboard numbers, but we all know that Dashboard is not a good reporter of the instantaneous position. I can pretty well tell if an entry is from CMS-dev or vLHC from the userID; the UserIDs > 352 are tending to dominate. As expected from my simplistic explanation, several have 5 or more hosts reporting "151"s (file transfer failures for the latecomers...). |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,968,227 RAC: 10,233 |
Unless the volunteer digs into the logs to find the fail(s) or receives an email informing them of such they will not know :-( |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301 |
Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch: No, the problem's probably not in Userland. An hour or so later the counts are: [eesridr:src] > wc 160126.usersort 9050 153850 1755070 160126.usersort [eesridr:src] > grep 'status 0' 160126.usersort|wc 8056 136952 1560748 [eesridr:src] > grep 'status 151' 160126.usersort|wc 796 13532 155703 What's even more depressing is my own counts: [eesridr:src] > grep '32157 ' 16126.report 32157 79553 0 66 32157 79553 151 8 I'm on a 1 Gbps link at work, so I'm not saturating the upload channel! So, there's a systemic problem. I'll not submit a new batch once this one ends, until we can work out the cause of failure. |
©2024 CERN