Message boards : News : Agent Broken
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 695 - Posted: 19 Aug 2015, 15:00:43 UTC

We have an issue with the agent so the VMs will not get new jobs until this has been resolved.
ID: 695 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 703 - Posted: 19 Aug 2015, 18:50:15 UTC - in response to Message 695.  

We have identified and fixed the issue. The voms-proxy-info command needed by the glidein was not found. A fix should be upload to CVMFS within a next hour or so.
ID: 703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 704 - Posted: 19 Aug 2015, 19:23:01 UTC - in response to Message 703.  

Will my currently running Task resume work automaticaly?
Or do i need to Abort and start a new one?
ID: 704 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 705 - Posted: 19 Aug 2015, 19:57:04 UTC - in response to Message 704.  

It should resume automatically
ID: 705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 706 - Posted: 19 Aug 2015, 19:58:04 UTC - in response to Message 705.  

The new agent should now be in CVMFS. Please let me know if it works for you or not.
ID: 706 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 707 - Posted: 19 Aug 2015, 20:10:14 UTC - in response to Message 706.  

The new agent should now be in CVMFS. Please let me know if it works for you or not.

No cmsRun started.

Every 2 minutes this cycle:

21:49:01 +0200 2015-08-19 [INFO] Requesting an X509 credential
21:49:02 +0200 2015-08-19 [INFO] Downloading glidein
21:49:02 +0200 2015-08-19 [INFO] Running glidein (check logs)
21:50:01 +0200 2015-08-19 [INFO] CMS glidein ended
21:51:01 +0200 2015-08-19 [INFO] Starting CMS Application
21:51:01 +0200 2015-08-19 [INFO] Reading the BOINC volunteer's information
21:51:01 +0200 2015-08-19 [INFO] Volunteer: (38
) Host: 37
ID: 707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 708 - Posted: 19 Aug 2015, 20:16:06 UTC

No change, yet. Same pattern as before.
ID: 708 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 709 - Posted: 19 Aug 2015, 20:23:15 UTC - in response to Message 707.  

And mine is

21:01:01 +0100 2015-08-19 [INFO] Starting CMS Application
21:01:01 +0100 2015-08-19 [INFO] Reading the BOINC volunteer's information
21:01:01 +0100 2015-08-19 [INFO] Volunteer: (229
) Host: 380
21:01:01 +0100 2015-08-19 [INFO] Requesting an X509 credential
21:01:07 +0100 2015-08-19 [INFO] Downloading glidein
21:01:12 +0100 2015-08-19 [INFO] Running glidein (check logs)
21:08:01 +0100 2015-08-19 [INFO] CMS glidein ended

(I think those are the right starting/ending points for the cycle)

IDs are right for this project - CP is ID 38, I am 229.
ID: 709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 268
Message 710 - Posted: 19 Aug 2015, 20:27:29 UTC
Last modified: 19 Aug 2015, 21:00:20 UTC

OK here, I think, (host 553) cmsRun ~96% cpu BUT no alt-F5 display. IDs are correct.

Edit. Been running 20 mins or so, no "Glidein ended" message on F1 screen.

ID: 710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 711 - Posted: 19 Aug 2015, 20:35:20 UTC - in response to Message 709.  

Where do i find that info?
21:01:01 +0100 2015-08-19 [INFO] Starting CMS Application
21:01:01 +0100 2015-08-19 [INFO] Reading the BOINC volunteer's information
21:01:01 +0100 2015-08-19 [INFO] Volunteer: (229
) Host: 380
21:01:01 +0100 2015-08-19 [INFO] Requesting an X509 credential
21:01:07 +0100 2015-08-19 [INFO] Downloading glidein
21:01:12 +0100 2015-08-19 [INFO] Running glidein (check logs)
21:08:01 +0100 2015-08-19 [INFO] CMS glidein ended
ID: 711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 712 - Posted: 19 Aug 2015, 20:36:17 UTC - in response to Message 709.  

(I think those are the right starting/ending points for the cycle)

Richard is right, I think.

There is 1 minute pause between 'CMS glidein ended' and 'Starting CMS Application'.
ID: 712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 713 - Posted: 19 Aug 2015, 20:37:57 UTC - in response to Message 711.  

Where do i find that info?


On the ALT+F1 screen and in the log: cron-stdout
ID: 713 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 714 - Posted: 19 Aug 2015, 20:39:57 UTC
Last modified: 19 Aug 2015, 20:40:16 UTC

Thanks, mine just says:

22:39:02 +0200 2015-08-19 [INFO] CMS glidein ended
ID: 714 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,608,816
RAC: 14,999
Message 716 - Posted: 19 Aug 2015, 21:16:46 UTC - in response to Message 714.  

My StartdLog has these error...

08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) ******************************************************
08/19/15 22:08:00 (pid:755) ** condor_startd (CONDOR_STARTD) STARTING UP
08/19/15 22:08:00 (pid:755) ** /home/boinc/CMSRun/glide_kLrbvH/main/condor/sbin/condor_startd
08/19/15 22:08:00 (pid:755) ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
08/19/15 22:08:00 (pid:755) ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
08/19/15 22:08:00 (pid:755) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
08/19/15 22:08:00 (pid:755) ** $CondorPlatform: x86_64_RedHat5 $
08/19/15 22:08:00 (pid:755) ** PID = 755
08/19/15 22:08:00 (pid:755) ** Log last touched time unavailable (No such file or directory)
08/19/15 22:08:00 (pid:755) ******************************************************
08/19/15 22:08:00 (pid:755) Using config source: /home/boinc/CMSRun/glide_kLrbvH/condor_config
08/19/15 22:08:00 (pid:755) config Macros = 211, Sorted = 211, StringBytes = 10492, TablesBytes = 7636
08/19/15 22:08:00 (pid:755) CLASSAD_CACHING is ENABLED
08/19/15 22:08:00 (pid:755) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:00 (pid:755) DaemonCore: command socket at <10.0.2.15:43734?noUDP>
08/19/15 22:08:00 (pid:755) DaemonCore: private command socket at <10.0.2.15:43734>
08/19/15 22:08:01 (pid:755) authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:647: in library: PEM routines, function PEM_read_bio: no start line

08/19/15 22:08:03 (pid:755) SECMAN: required authentication with collector lcggwms02.gridpp.rl.ac.uk:9619 failed, so aborting command CCB_REGISTER.
08/19/15 22:08:03 (pid:755) ERROR: AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:18). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
08/19/15 22:08:03 (pid:755) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds.
08/19/15 22:08:03 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:03 (pid:755) my_popenv failed
08/19/15 22:08:03 (pid:755) Failed to run hibernation plugin '/home/boinc/CMSRun/glide_kLrbvH/main/condor/libexec/condor_power_state ad'
08/19/15 22:08:04 (pid:755) VM-gahp server reported an internal error
08/19/15 22:08:04 (pid:755) VM universe will be tested to check if it is available
08/19/15 22:08:04 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:04 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:04 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:04 (pid:755) History file rotation is enabled.
08/19/15 22:08:04 (pid:755) Maximum history file size is: 20971520 bytes
08/19/15 22:08:04 (pid:755) Number of rotated history files is: 2
08/19/15 22:08:04 (pid:755) Allocating auto shares for slot type 1: Cpus: 1.000000, Memory: auto, Swap: auto, Disk: auto
slot type 1: Cpus: 1.000000, Memory: 2002, Swap: 100.00%, Disk: 100.00%
08/19/15 22:08:04 (pid:755) New machine resource of type 1 allocated
08/19/15 22:08:04 (pid:755) Setting up slot pairings
08/19/15 22:08:04 (pid:755) my_popenv failed
08/19/15 22:08:04 (pid:755) init_local_hostname: ipv6_getaddrinfo() could not look up 246-471-21479: Name or service not known (-2)
08/19/15 22:08:04 (pid:755) Adding 'mips' to the Supplimental ClassAd list
08/19/15 22:08:04 (pid:755) CronJobList: Adding job 'mips'
08/19/15 22:08:04 (pid:755) Adding 'kflops' to the Supplimental ClassAd list
08/19/15 22:08:04 (pid:755) CronJobList: Adding job 'kflops'
08/19/15 22:08:04 (pid:755) CronJob: Initializing job 'mips' (/home/boinc/CMSRun/glide_kLrbvH/main/condor/libexec/condor_mips)
08/19/15 22:08:04 (pid:755) CronJob: Initializing job 'kflops' (/home/boinc/CMSRun/glide_kLrbvH/main/condor/libexec/condor_kflops)
08/19/15 22:08:04 (pid:755) State change: IS_OWNER is false
08/19/15 22:08:04 (pid:755) Changing state: Owner -> Unclaimed
08/19/15 22:08:04 (pid:755) State change: RunBenchmarks is TRUE
08/19/15 22:08:04 (pid:755) Changing activity: Idle -> Benchmarking
08/19/15 22:08:04 (pid:755) BenchMgr:StartBenchmarks()
08/19/15 22:08:04 (pid:755) authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:647: in library: PEM routines, function PEM_read_bio: no start line

08/19/15 22:08:04 (pid:755) SECMAN: required authentication with daemon at <10.0.2.15:56260> failed, so aborting command DC_CHILDALIVE.
08/19/15 22:08:04 (pid:755) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:56260> (try 1 of 3): AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:36). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
08/19/15 22:08:04 (pid:755) authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:647: in library: PEM routines, function PEM_read_bio: no start line

08/19/15 22:08:04 (pid:755) SECMAN: required authentication with daemon at <10.0.2.15:56260> failed, so aborting command DC_CHILDALIVE.
08/19/15 22:08:04 (pid:755) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:56260> (try 2 of 3): AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:54). There is probably a problem with your credentials. (Did you run grid-proxy-init?)|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:36). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
08/19/15 22:08:04 (pid:755) authenticate_self_gss: acquiring self credentials failed. Please check your Condor configuration file if this is a server process. Or the user environment variable if this is a user process.

GSS Major Status: General failure
GSS Minor Status Error Chain:
globus_gsi_gssapi: Error with GSI credential
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:647: in library: PEM routines, function PEM_read_bio: no start line

08/19/15 22:08:04 (pid:755) SECMAN: required authentication with daemon at <10.0.2.15:56260> failed, so aborting command DC_CHILDALIVE.
08/19/15 22:08:04 (pid:755) ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:56260> (try 3 of 3): AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:72). There is probably a problem with your credentials. (Did you run grid-proxy-init?)|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:54). There is probably a problem with your credentials. (Did you run grid-proxy-init?)|AUTHENTICATE:1003:Failed to authenticate with any method|AUTHENTICATE:1004:Failed to authenticate using GSI|GSI:5003:Failed to authenticate. Globus is reporting error (851968:36). There is probably a problem with your credentials. (Did you run grid-proxy-init?)
08/19/15 22:08:04 (pid:755) ERROR "FAILED TO SEND INITIAL KEEP ALIVE TO OUR PARENT <10.0.2.15:56260>" at line 9470 in file /slots/12/dir_4417/userdir/src/condor_daemon_core.V6/daemon_core.cpp
08/19/15 22:08:04 (pid:755) startd exiting because of fatal exception.


...which I'm guessing means it isn't happy ?
ID: 716 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 268
Message 718 - Posted: 19 Aug 2015, 21:52:34 UTC
Last modified: 19 Aug 2015, 22:09:51 UTC

Just started a second host, 266. Had to reboot VM to get it out of the previous loop but cmsRun now at >90%. No alt-F5 display on this host either. The cmsRun-stdout log file is OK. IDs are correct.
ID: 718 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 328,405
RAC: 184
Message 719 - Posted: 19 Aug 2015, 21:59:32 UTC - in response to Message 718.  

I just did a fresh install of BOINC, added CMS-dev and it worked. For those who are experiencing issues I would suggest aborting the current task.
ID: 719 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 720 - Posted: 20 Aug 2015, 6:44:02 UTC - in response to Message 719.  

I just did a fresh install of BOINC, added CMS-dev and it worked. For those who are experiencing issues I would suggest aborting the current task.

I started with a new BOINC-task, so a fresh copy of the Master-VM without success.

I get the same sequence starting every 2nd minute:

07:56:01 +0200 2015-08-20 [INFO] Starting CMS Application
07:56:01 +0200 2015-08-20 [INFO] Reading the BOINC volunteer's information
07:56:01 +0200 2015-08-20 [INFO] Volunteer: (38
) Host: 37
07:56:01 +0200 2015-08-20 [INFO] Requesting an X509 credential
07:56:01 +0200 2015-08-20 [INFO] Downloading glidein
07:56:02 +0200 2015-08-20 [INFO] Running glidein (check logs)
07:57:01 +0200 2015-08-20 [INFO] CMS glidein ended


and in the cron-stderr 38 times so far the line: head: cannot open `file' for reading: No such file or directory
It looks like with every cycle 1 line is added.

I also have no output on the screen, except 'top' ALT+F3
On ALT+F1 only: Starting crond: [ OK ]
and further blank.
ID: 720 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 726 - Posted: 20 Aug 2015, 12:26:14 UTC - in response to Message 719.  

I just did a fresh install of BOINC, added CMS-dev and it worked. For those who are experiencing issues I would suggest aborting the current task.

I rebooted my VM which had not worked yesterday and today I successfully ran a 200-record job and it staged out correct results. I didn't suspend anything while it ran.

I will do some investigations shortly on the suspend/resume situation. But so far I can say:

1. When the job was running it had 5 open tcp connections to lcggwms02.gridpp.ral.ac.uk (all on port 9619) and 1 to port 9818.

2. The web logs such as cmsRun-stdout.log and _condor_stdout were written to the end of the job and the stageout, but did not get renewed when the next job started. But maybe I hadn't waited long enough for the next job to begin as so far it's hard to know what's going on due to poor logging - see next point:

3. You badly need a live VM console showing at least the cmsRun-stdout file, plus some files showing the job handling. In general the present VM consoles aren't optimal choices IMHO…

Ben
ID: 726 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 727 - Posted: 20 Aug 2015, 12:46:03 UTC
Last modified: 20 Aug 2015, 12:46:24 UTC

On my Windows 10 Home edition I cannot see the consoles here, at vLHC and Atlas. But here at least I can see the logs. But I see the consoles of CERN Summer Challenge.
Tullio
ID: 727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Agent Broken


©2024 CERN