Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1590 - Posted: 11 Jan 2016, 1:06:01 UTC - in response to Message 1589.  

OK, it finished, on a host with a current IP different from the original one. So no obvious change in behaviour yet.
ID: 1590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1591 - Posted: 11 Jan 2016, 9:12:58 UTC

I had job 2078 fail to start.
No mentioning on dashboard.Job appears to be done by other computer.
ID: 1591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1592 - Posted: 11 Jan 2016, 16:39:10 UTC - in response to Message 1591.  

I had job 2078 fail to start.
No mentioning on dashboard.Job appears to be done by other computer.

2078 appears to have completed successfully at 0420 this morning, in Italy. Dashboard shows an Italian IP address and compatible timing.
ID: 1592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1593 - Posted: 11 Jan 2016, 17:11:31 UTC - in response to Message 1592.  

I had the logs.
Unfortunately i started a new cms task, so the logs are gone.
Next time, i will make a copy, but i had that job attempting to start.
It puzzles me, that there was no record of this on the server.
This task attempted to start at 2.06 UTC.
ID: 1593 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1594 - Posted: 11 Jan 2016, 18:11:30 UTC - in response to Message 1593.  

I had the logs.
Unfortunately i started a new cms task, so the logs are gone.
Next time, i will make a copy, but i had that job attempting to start.
It puzzles me, that there was no record of this on the server.
This task attempted to start at 2.06 UTC.

Strange, that's the start times in the Italian job. However, it didn't start processing the first event until 0310.
ID: 1594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1595 - Posted: 12 Jan 2016, 15:58:00 UTC

Job failed to start(3051).
Start time 16.30local (15:30UTC)
Nothing about it in dashboard.
I can send the logs, if you want.

Next job started 6 minutes later.
ID: 1595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1596 - Posted: 12 Jan 2016, 20:53:59 UTC - in response to Message 1595.  

Job failed to start(3051).
Start time 16.30local (15:30UTC)
Nothing about it in dashboard.
I can send the logs, if you want.

Hmm, the logs -- and Dashboard -- have it starting at 1536 on a Norwegian host.
IP is for a Norwegian ISP. Please send logs, to Uni address if you have it.
Next job started 6 minutes later.

ID: 1596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1597 - Posted: 12 Jan 2016, 21:06:29 UTC

Please send logs, to Uni address if you have it.


No, I don't.
Please send PM.
ID: 1597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1694 - Posted: 28 Jan 2016, 23:02:10 UTC

Does the project run dummy-jobs to test the performance?
I have a job, that is running (cms-run close to 100%) but no cms-run files are produced.The condor_stderr list things about performance.
ID: 1694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1695 - Posted: 29 Jan 2016, 0:31:55 UTC - in response to Message 1694.  

Does the project run dummy-jobs to test the performance?
I have a job, that is running (cms-run close to 100%) but no cms-run files are produced.The condor_stderr list things about performance.

Not that I'm aware of. Laurence? Can you capture the condor logs in your browser and send them to me?
ID: 1695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1698 - Posted: 29 Jan 2016, 0:50:48 UTC - in response to Message 1695.  

Let me know, if you need any logs.

I finally got out of it by suspending the task for 25min.

If i had not done it, it would presumably have run until the end of the 24h limit.
ID: 1698 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 10,233
Message 1701 - Posted: 29 Jan 2016, 7:56:28 UTC - in response to Message 1698.  
Last modified: 29 Jan 2016, 8:06:14 UTC

I had this at the start of my first vLHC CMS task.

It ran 2 sets of 'jobs' that took 38 and 35 minutes to complete, looks like performance testing. They both ran something called:

wmagent_riahi_TEST_HELIX_0911-T3_CH_VolunteerBackfill_160125_144505_9930

After these 2 it then went on to processing real events.

Another 'aborted' with no events at 06:22am (CET).

It did 4 more successful jobs and since then it has been creating Run folders every 7 minutes but no events are being processed, just 4 files:

[ ] MasterLog 29-Jan-2016 08:43 2.1K
[ ] ProcLog 29-Jan-2016 08:43 12K
[ ] StartdLog 29-Jan-2016 08:43 5.1K
[ ] StarterLog 29-Jan-2016 08:38 620

...being created in the folder.

Looks like something has gone AWOL. Edit: Looks like no jobs to run.

Ivan, know you are busy, just putting this here so it is available. If anyone wants the full logs for the test runs I have them.
ID: 1701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1703 - Posted: 29 Jan 2016, 10:55:50 UTC

Nearly all connections to the server are failing.
Lots ad lots of errors on dashboard.

Maybe, it was not such a good idea, to link this batch to vlhc?
ID: 1703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1704 - Posted: 29 Jan 2016, 11:04:27 UTC - in response to Message 1703.  

Nearly all connections to the server are failing.
Lots ad lots of errors on dashboard.

Maybe, it was not such a good idea, to link this batch to vlhc?

Don't know yet. Something fell on its ear about 0225 this morning -- Condor isn't putting jobs into the queue, even though it has a backlog of ~4,000 still to schedule. Waiting to hear back from the experts. I don't think there's much else I can do at the moment.
ID: 1704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1705 - Posted: 29 Jan 2016, 11:07:55 UTC - in response to Message 1704.  

Thanks, Ivan.
ID: 1705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1706 - Posted: 29 Jan 2016, 11:23:22 UTC - in response to Message 1704.  

Experts reported:
Job submission is failing because "ERROR: proxy has expired".

WTF? The proxy should have been seven days, not just three. And I'm sure I set my local proxy to 8 days before submitting the batch. ???
Anyway, I generated a new local 8-day proxy and copied it to the Condor server at RAL. Jobs are starting to be queued and run again now; keeoing a close eye on the situation...
ID: 1706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1709 - Posted: 29 Jan 2016, 13:17:36 UTC
Last modified: 29 Jan 2016, 13:21:38 UTC

It resumed fine,thanks.
However i had 3 jobs fail (postprocessing failed) in a row from about 02.18UTC until the interruption.
Jobs: 6185,6603 and 6965.

I never had 3 fail in a row.

EDIT: never mind, all due to the outage, i guess.
ID: 1709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1718 - Posted: 29 Jan 2016, 20:55:20 UTC

Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch:
[eesridr:src] > wc 160126.usersort
8790 149430 1704191 160126.usersort
[eesridr:src] > grep 'status 0' 160126.usersort|wc
8028 136476 1555320
[eesridr:src] > grep 'status 151' 160126.usersort|wc
578 9826 112990
(that leaves 184 non-151 errors)
That doesn't tie in well with the current Dashboard numbers, but we all know that Dashboard is not a good reporter of the instantaneous position. I can pretty well tell if an entry is from CMS-dev or vLHC from the userID; the UserIDs > 352 are tending to dominate. As expected from my simplistic explanation, several have 5 or more hosts reporting "151"s (file transfer failures for the latecomers...).
ID: 1718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,968,227
RAC: 10,233
Message 1719 - Posted: 29 Jan 2016, 21:06:58 UTC - in response to Message 1718.  

Unless the volunteer digs into the logs to find the fail(s) or receives an email informing them of such they will not know :-(
ID: 1719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,973,351
RAC: 2,301
Message 1720 - Posted: 29 Jan 2016, 22:01:28 UTC - in response to Message 1718.  

Things aren't looking good for a Friday night, I've just harvested the status reports from the Condor logs for the current batch:
[eesridr:src] > wc 160126.usersort
8790 149430 1704191 160126.usersort
[eesridr:src] > grep 'status 0' 160126.usersort|wc
8028 136476 1555320
[eesridr:src] > grep 'status 151' 160126.usersort|wc
578 9826 112990
(that leaves 184 non-151 errors)
That doesn't tie in well with the current Dashboard numbers, but we all know that Dashboard is not a good reporter of the instantaneous position. I can pretty well tell if an entry is from CMS-dev or vLHC from the userID; the UserIDs > 352 are tending to dominate. As expected from my simplistic explanation, several have 5 or more hosts reporting "151"s (file transfer failures for the latecomers...).


No, the problem's probably not in Userland. An hour or so later the counts are:
[eesridr:src] > wc 160126.usersort
9050 153850 1755070 160126.usersort
[eesridr:src] > grep 'status 0' 160126.usersort|wc
8056 136952 1560748
[eesridr:src] > grep 'status 151' 160126.usersort|wc
796 13532 155703


What's even more depressing is my own counts:
[eesridr:src] > grep '32157 ' 16126.report
32157 79553 0 66
32157 79553 151 8

I'm on a 1 Gbps link at work, so I'm not saturating the upload channel! So, there's a systemic problem. I'll not submit a new batch once this one ends, until we can work out the cause of failure.
ID: 1720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN