Expect errors eventually

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1590 - Posted: 11 Jan 2016, 1:06:01 UTC - in response to Message 1589. OK, it finished, on a host with a current IP different from the original one. So no obvious change in behaviour yet. ID: 1590 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1591 - Posted: 11 Jan 2016, 9:12:58 UTC I had job 2078 fail to start. No mentioning on dashboard.Job appears to be done by other computer. ID: 1591 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1592 - Posted: 11 Jan 2016, 16:39:10 UTC - in response to Message 1591. I had job 2078 fail to start. No mentioning on dashboard.Job appears to be done by other computer. 2078 appears to have completed successfully at 0420 this morning, in Italy. Dashboard shows an Italian IP address and compatible timing. ID: 1592 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1593 - Posted: 11 Jan 2016, 17:11:31 UTC - in response to Message 1592. I had the logs. Unfortunately i started a new cms task, so the logs are gone. Next time, i will make a copy, but i had that job attempting to start. It puzzles me, that there was no record of this on the server. This task attempted to start at 2.06 UTC. ID: 1593 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1594 - Posted: 11 Jan 2016, 18:11:30 UTC - in response to Message 1593. I had the logs. Unfortunately i started a new cms task, so the logs are gone. Next time, i will make a copy, but i had that job attempting to start. It puzzles me, that there was no record of this on the server. This task attempted to start at 2.06 UTC. Strange, that's the start times in the Italian job. However, it didn't start processing the first event until 0310. ID: 1594 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1595 - Posted: 12 Jan 2016, 15:58:00 UTC Job failed to start(3051). Start time 16.30local (15:30UTC) Nothing about it in dashboard. I can send the logs, if you want. Next job started 6 minutes later. ID: 1595 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1596 - Posted: 12 Jan 2016, 20:53:59 UTC - in response to Message 1595. Job failed to start(3051). Start time 16.30local (15:30UTC) Nothing about it in dashboard. I can send the logs, if you want. Hmm, the logs -- and Dashboard -- have it starting at 1536 on a Norwegian host. IP is for a Norwegian ISP. Please send logs, to Uni address if you have it. Next job started 6 minutes later. ID: 1596 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1597 - Posted: 12 Jan 2016, 21:06:29 UTC Please send logs, to Uni address if you have it. No, I don't. Please send PM. ID: 1597 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1694 - Posted: 28 Jan 2016, 23:02:10 UTC Does the project run dummy-jobs to test the performance? I have a job, that is running (cms-run close to 100%) but no cms-run files are produced.The condor_stderr list things about performance. ID: 1694 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1695 - Posted: 29 Jan 2016, 0:31:55 UTC - in response to Message 1694. Does the project run dummy-jobs to test the performance? I have a job, that is running (cms-run close to 100%) but no cms-run files are produced.The condor_stderr list things about performance. Not that I'm aware of. Laurence? Can you capture the condor logs in your browser and send them to me? ID: 1695 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1698 - Posted: 29 Jan 2016, 0:50:48 UTC - in response to Message 1695. Let me know, if you need any logs. I finally got out of it by suspending the task for 25min. If i had not done it, it would presumably have run until the end of the 24h limit. ID: 1698 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 5,968,227 RAC: 10,233	Message 1701 - Posted: 29 Jan 2016, 7:56:28 UTC - in response to Message 1698. Last modified: 29 Jan 2016, 8:06:14 UTC I had this at the start of my first vLHC CMS task. It ran 2 sets of 'jobs' that took 38 and 35 minutes to complete, looks like performance testing. They both ran something called: wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930 After these 2 it then went on to processing real events. Another 'aborted' with no events at 06:22am (CET). It did 4 more successful jobs and since then it has been creating Run folders every 7 minutes but no events are being processed, just 4 files: [ ] MasterLog 29-Jan-2016 08:43 2.1K [ ] ProcLog 29-Jan-2016 08:43 12K [ ] StartdLog 29-Jan-2016 08:43 5.1K [ ] StarterLog 29-Jan-2016 08:38 620 ...being created in the folder. Looks like something has gone AWOL. Edit: Looks like no jobs to run. Ivan, know you are busy, just putting this here so it is available. If anyone wants the full logs for the test runs I have them. ID: 1701 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1703 - Posted: 29 Jan 2016, 10:55:50 UTC Nearly all connections to the server are failing. Lots ad lots of errors on dashboard. Maybe, it was not such a good idea, to link this batch to vlhc? ID: 1703 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1704 - Posted: 29 Jan 2016, 11:04:27 UTC - in response to Message 1703. Nearly all connections to the server are failing. Lots ad lots of errors on dashboard. Maybe, it was not such a good idea, to link this batch to vlhc? Don't know yet. Something fell on its ear about 0225 this morning -- Condor isn't putting jobs into the queue, even though it has a backlog of ~4,000 still to schedule. Waiting to hear back from the experts. I don't think there's much else I can do at the moment. ID: 1704 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1705 - Posted: 29 Jan 2016, 11:07:55 UTC - in response to Message 1704. Thanks, Ivan. ID: 1705 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1706 - Posted: 29 Jan 2016, 11:23:22 UTC - in response to Message 1704. Experts reported: Job submission is failing because "ERROR: proxy has expired". WTF? The proxy should have been seven days, not just three. And I'm sure I set my local proxy to 8 days before submitting the batch. ??? Anyway, I generated a new local 8-day proxy and copied it to the Condor server at RAL. Jobs are starting to be queued and run again now; keeoing a close eye on the situation... ID: 1706 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1709 - Posted: 29 Jan 2016, 13:17:36 UTC Last modified: 29 Jan 2016, 13:21:38 UTC It resumed fine,thanks. However i had 3 jobs fail (postprocessing failed) in a row from about 02.18UTC until the interruption. Jobs: 6185,6603 and 6965. I never had 3 fail in a row. EDIT: never mind, all due to the outage, i guess. ID: 1709 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1718 - Posted: 29 Jan 2016, 20:55:20 UTC Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch: [eesridr:src] > wc 160126.usersort 8790 149430 1704191 160126.usersort [eesridr:src] > grep 'status 0' 160126.usersort\|wc 8028 136476 1555320 [eesridr:src] > grep 'status 151' 160126.usersort\|wc 578 9826 112990 (that leaves 184 non-151 errors) That doesn't tie in well with the current Dashboard numbers, but we all know that Dashboard is not a good reporter of the instantaneous position. I can pretty well tell if an entry is from CMS-dev or vLHC from the userID; the UserIDs > 352 are tending to dominate. As expected from my simplistic explanation, several have 5 or more hosts reporting "151"s (file transfer failures for the latecomers...). ID: 1718 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 5,968,227 RAC: 10,233	Message 1719 - Posted: 29 Jan 2016, 21:06:58 UTC - in response to Message 1718. Unless the volunteer digs into the logs to find the fail(s) or receives an email informing them of such they will not know :-( ID: 1719 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,973,351 RAC: 2,301	Message 1720 - Posted: 29 Jan 2016, 22:01:28 UTC - in response to Message 1718. Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch: [eesridr:src] > wc 160126.usersort 8790 149430 1704191 160126.usersort [eesridr:src] > grep 'status 0' 160126.usersort\|wc 8028 136476 1555320 [eesridr:src] > grep 'status 151' 160126.usersort\|wc 578 9826 112990 (that leaves 184 non-151 errors) That doesn't tie in well with the current Dashboard numbers, but we all know that Dashboard is not a good reporter of the instantaneous position. I can pretty well tell if an entry is from CMS-dev or vLHC from the userID; the UserIDs > 352 are tending to dominate. As expected from my simplistic explanation, several have 5 or more hosts reporting "151"s (file transfer failures for the latecomers...). No, the problem's probably not in Userland. An hour or so later the counts are: [eesridr:src] > wc 160126.usersort 9050 153850 1755070 160126.usersort [eesridr:src] > grep 'status 0' 160126.usersort\|wc 8056 136952 1560748 [eesridr:src] > grep 'status 151' 160126.usersort\|wc 796 13532 155703 What's even more depressing is my own counts: [eesridr:src] > grep '32157 ' 16126.report 32157 79553 0 66 32157 79553 151 8 I'm on a 1 Gbps link at work, so I'm not saturating the upload channel! So, there's a systemic problem. I'll not submit a new batch once this one ends, until we can work out the cause of failure. ID: 1720 · Rating: 0 · rate: / Reply Quote

Development for LHC@home