Message boards : News : Jobs incoming!
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,669
RAC: 15,099
Message 1089 - Posted: 16 Sep 2015, 13:19:27 UTC - in response to Message 1088.  

Mine seem to be running normally, haven't noticed anything wrong.

In fact since about the 11th I haven't seen an X509 error and I am no longer jumping time zones. Don't know whether my ISP did something, you did something or it's just the direction the rain is falling that makes everything work now !
ID: 1089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 202
Message 1090 - Posted: 16 Sep 2015, 13:41:08 UTC - in response to Message 1086.  
Last modified: 16 Sep 2015, 14:15:49 UTC

Started a host manually to see if anything seems wrong.
Watched job 5856 all the way through. 250 events no problems.

"Complete
process id is 8212 status is 0"

I suppose there could be a problem sending the result back,
how would we know?

Edit:-

This is from condor stdout...

INFO Davix: Operation failure: HTTP 404 : File not found . After 1 retry
INFO Davix: Failure: Impossible to execute operation on https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root, error Failure HTTP 404 : File not found after 1 attempts
INFO Davix: Try to Recover with Metalink...
DEBUG Davix: Creat HttpRequest for https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root
DEBUG Davix: Executing head query to https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root for Metalink file

Does this indicate a problem? if this:-
output//dpm/brunel.ac.uk/
is part of a pathname, should there be "//" there? or do others have as much
trouble typing as I do.
ID: 1090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1091 - Posted: 16 Sep 2015, 15:14:06 UTC - in response to Message 1090.  
Last modified: 16 Sep 2015, 15:14:20 UTC

Does this indicate a problem? if this:-
output//dpm/brunel.ac.uk/
is part of a pathname, should there be "//" there? or do others have as much
trouble typing as I do.

Thanks for the output, not too sure yet what the problem is, I'll get my VM running again.
A double slash in a Linux pathname is no problem as far as I'm aware.
ID: 1091 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1092 - Posted: 16 Sep 2015, 15:24:17 UTC

Are still all incoming Jobs are bad ?

Mine may have been bad until 11:00 °Clock this morning (my Squid was hanging), but since then, it seems as if all is running fine.
ID: 1092 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1093 - Posted: 16 Sep 2015, 15:39:12 UTC - in response to Message 1090.  


INFO Davix: Operation failure: HTTP 404 : File not found . After 1 retry
INFO Davix: Failure: Impossible to execute operation on https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root, error Failure HTTP 404 : File not found after 1 attempts
INFO Davix: Try to Recover with Metalink...
DEBUG Davix: Creat HttpRequest for https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root
DEBUG Davix: Executing head query to https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0005/step1_5856.root for Metalink file

Does this indicate a problem?

I get the same for a job I know was successful:

INFO Davix: Failure: Impossible to execute operation on https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0000/step1_6.root, error Failure HTTP 404 :
File not found after 1 attempts
INFO Davix: Try to Recover with Metalink...
DEBUG Davix: Creat HttpRequest for https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0000/step1_6.root
DEBUG Davix: Executing head query to https://data-bridge-test.cern.ch/myfed/cms-boinc/output//dpm/brunel.ac.uk/home/cms/store/user/ireid/CMS_at_Home/CRAB3_MinBias/150908_152652/0000/step1_6.root for Metalink file

Now, let's look at 5856: job_out.5856.0.txt on the Condor server looks OK, but unfortunately Dashboard has it as "pending".
Ah! But there's a job_out.5856.1.txt as well, so it's been resubmitted:

 cat 150908_152652:ireid_crab_CMS_at_Home_MinBias2/job_out.5856.1.txt
Job output has not been processed by post-job.

So, something's giving the same symptoms we had a few weeks ago. :-(
ID: 1093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1094 - Posted: 16 Sep 2015, 15:42:54 UTC - in response to Message 1092.  

Are still all incoming Jobs are bad ?

Mine may have been bad until 11:00 °Clock this morning (my Squid was hanging), but since then, it seems as if all is running fine.

See above; jobs aren't reporting as successful to Condor and Dashboard.
Ah, Bingo! Job 5856 has returned a result, so the problem's at the Condor end.
ID: 1094 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1095 - Posted: 16 Sep 2015, 18:47:51 UTC

Problem solved, we think. Turns out that Condor uses a different proxy than the one I renew periodically at work, and it had a default life of seven days. Which expired...

I've put in a small batch of jobs half as big as the last batch, which should get us through the night until I can sort out the base problem tomorrow.

By the way, check out a new Job Activities Page. Thanks to Andrew and Laurence for that one.
ID: 1095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,669
RAC: 15,099
Message 1097 - Posted: 16 Sep 2015, 19:43:06 UTC - in response to Message 1095.  

Hurray.

Like the new JAP, any chance of a link on the left for it ?
ID: 1097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,669
RAC: 15,099
Message 1098 - Posted: 16 Sep 2015, 19:45:37 UTC - in response to Message 1097.  

Guessing the error rate has dropped significantly then ?
ID: 1098 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1099 - Posted: 16 Sep 2015, 20:27:45 UTC - in response to Message 1098.  

Guessing the error rate has dropped significantly then ?

Early days, but Dashboard reckons 57 successes and no failures so far with the new batch.
ID: 1099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1100 - Posted: 16 Sep 2015, 20:28:54 UTC - in response to Message 1097.  

Hurray.

Like the new JAP, any chance of a link on the left for it ?

I understand that Laurence is brushing up on his PHP...
ID: 1100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1101 - Posted: 16 Sep 2015, 21:22:17 UTC - in response to Message 1099.  

Guessing the error rate has dropped significantly then ?

Early days, but Dashboard reckons 57 successes and no failures so far with the new batch.

This is looking better now:


ID: 1101 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,657,669
RAC: 15,099
Message 1102 - Posted: 16 Sep 2015, 23:51:19 UTC - in response to Message 1100.  

Hurray.

Like the new JAP, any chance of a link on the left for it ?

I understand that Laurence is brushing up on his PHP...

:-)
ID: 1102 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1103 - Posted: 17 Sep 2015, 18:00:19 UTC

Just found:

09/17/15 19:48:54 (pid:12496) attempt to connect to <130.246.180.120:9619> failed: timed out after 20 seconds.
09/17/15 19:48:54 (pid:12496) ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9619 failed.
09/17/15 19:48:54 (pid:12496) Failed to start non-blocking update to <130.246.180.120:9619>.
09/17/15 19:49:30 (pid:12496) attempt to connect to <130.246.180.120:9619> failed: Connection timed out (connect errno = 110).
09/17/15 19:49:30 (pid:12496) ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9619 failed.
09/17/15 19:49:30 (pid:12496) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds.
09/17/15 19:52:37 (pid:12496) attempt to connect to <130.246.180.120:9619> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (173 to go).

I checked my Firewall and found, that until now only ports 9620-9623 are opened for CMS. Now I added 9619.

It would really be very helpfull if someone could check which ports are needed for which IP-Adress(es).

http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=63
ID: 1103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,888,015
RAC: 1,314
Message 1105 - Posted: 17 Sep 2015, 20:10:54 UTC - in response to Message 1103.  

I've alerted some experts to check it out -- way beyond me, I'm afraid.
ID: 1105 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 751
Credit: 11,610,444
RAC: 1,210
Message 1106 - Posted: 18 Sep 2015, 1:50:35 UTC

No problems here.
Mad Scientist For Life
ID: 1106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 821,086
RAC: 730
Message 1108 - Posted: 18 Sep 2015, 10:23:15 UTC - in response to Message 1095.  

I've put in a small batch of jobs half as big as the last batch, which should get us through the night until I can sort out the base problem tomorrow.

By the way, check out a new Job Activities Page.

Positive side effect.
The number of failed jobs is about the same as before condor striked, but by halving the numbers of events per job, the number of jobs is doubled. So the percentage of failed jobs is halved.
ID: 1108 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 202
Message 1109 - Posted: 18 Sep 2015, 12:04:39 UTC - in response to Message 1102.  
Last modified: 18 Sep 2015, 12:12:00 UTC

Hurray.

Like the new JAP, any chance of a link on the left for it ?

I understand that Laurence is brushing up on his PHP...

:-)


Yes!!! Thanks, Laurence.

All it needs now, to be really useful is to be able to see
the numbers of good/failed/abandoned jobs per host...
ID: 1109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 202
Message 1110 - Posted: 18 Sep 2015, 12:22:46 UTC - in response to Message 1108.  

I've put in a small batch of jobs half as big as the last batch, which should get us through the night until I can sort out the base problem tomorrow.

By the way, check out a new Job Activities Page.

Positive side effect.
The number of failed jobs is about the same as before condor striked, but by halving the numbers of events per job, the number of jobs is doubled. So the percentage of failed jobs is halved.


Since the overheads for this project seem quite high (don't know about Atlas, we aren't allowed to see what's going on inside there...) there will be a point at which they cause the throughput to decrease again as the size of jobs is reduced. It would be nice to see where this point is.
ID: 1110 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1111 - Posted: 18 Sep 2015, 14:57:22 UTC

Maybe it would be a good idea to not download certain data again(that does not change) for every task/job, but store it locally.

I would also like to know the benefits of constant network connection versus the download, crunch, upload approach.
ID: 1111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : Jobs incoming!


©2024 CERN