Message boards : CMS Application : CMS 46.27 on vLHC looping but doing nothing usefull
Message board moderation

To post messages, you must log in.

AuthorMessage
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 2580 - Posted: 1 Apr 2016, 9:59:36 UTC

In the last days I got the Feeling that my CMS-Task at vLHC is doing nothing usefull. Every time I checked inside the VM the Console ALT/F5 I saw nothing.

ATL/F1 - ALT/F4 are showing "normal" Content, but ALT/F5 is always empty.

I checked logs, But I'm not shure where to take a look.

I see run-1 up to run-30, the WU is running 4 hours and 20 minutes

Run-1 has the latest / newest Date-Signatur:

http://localhost:55538/logs/run-1/glide_GJSlZn/

startDLOG:

04/01/16 11:46:22 (pid:10314) ******************************************************
04/01/16 11:46:22 (pid:10314) ** condor_startd (CONDOR_STARTD) STARTING UP
04/01/16 11:46:22 (pid:10314) ** /home/boinc/CMSRun/glide_GJSlZn/main/condor/sbin/condor_startd
04/01/16 11:46:22 (pid:10314) ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
04/01/16 11:46:22 (pid:10314) ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
04/01/16 11:46:22 (pid:10314) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
04/01/16 11:46:22 (pid:10314) ** $CondorPlatform: x86_64_RedHat5 $
04/01/16 11:46:22 (pid:10314) ** PID = 10314
04/01/16 11:46:22 (pid:10314) ** Log last touched time unavailable (No such file or directory)
04/01/16 11:46:22 (pid:10314) ******************************************************
04/01/16 11:46:22 (pid:10314) Using config source: /home/boinc/CMSRun/glide_GJSlZn/condor_config
04/01/16 11:46:22 (pid:10314) config Macros = 213, Sorted = 213, StringBytes = 10679, TablesBytes = 7708
04/01/16 11:46:22 (pid:10314) CLASSAD_CACHING is ENABLED
04/01/16 11:46:22 (pid:10314) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB
04/01/16 11:46:22 (pid:10314) DaemonCore: command socket at <10.0.2.15:34546?noUDP>
04/01/16 11:46:22 (pid:10314) DaemonCore: private command socket at <10.0.2.15:34546>
04/01/16 11:46:22 (pid:10314) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#152909
04/01/16 11:46:22 (pid:10314) my_popenv failed
04/01/16 11:46:22 (pid:10314) Failed to run hibernation plugin '/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/condor_power_state ad'
04/01/16 11:46:22 (pid:10314) VM-gahp server reported an internal error
04/01/16 11:46:22 (pid:10314) VM universe will be tested to check if it is available
04/01/16 11:46:22 (pid:10314) History file rotation is enabled.
04/01/16 11:46:22 (pid:10314) Maximum history file size is: 20971520 bytes
04/01/16 11:46:22 (pid:10314) Number of rotated history files is: 2
04/01/16 11:46:22 (pid:10314) Allocating auto shares for slot type 1: Cpus: 1.000000, Memory: auto, Swap: auto, Disk: auto
slot type 1: Cpus: 1.000000, Memory: 2002, Swap: 100.00%, Disk: 100.00%
04/01/16 11:46:22 (pid:10314) New machine resource of type 1 allocated
04/01/16 11:46:22 (pid:10314) Setting up slot pairings
04/01/16 11:46:22 (pid:10314) my_popenv failed
04/01/16 11:46:22 (pid:10314) Adding 'mips' to the Supplimental ClassAd list
04/01/16 11:46:22 (pid:10314) CronJobList: Adding job 'mips'
04/01/16 11:46:22 (pid:10314) Adding 'kflops' to the Supplimental ClassAd list
04/01/16 11:46:22 (pid:10314) CronJobList: Adding job 'kflops'
04/01/16 11:46:22 (pid:10314) CronJob: Initializing job 'mips' (/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/condor_mips)
04/01/16 11:46:22 (pid:10314) CronJob: Initializing job 'kflops' (/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/condor_kflops)
04/01/16 11:46:22 (pid:10314) State change: IS_OWNER is false
04/01/16 11:46:22 (pid:10314) Changing state: Owner -> Unclaimed
04/01/16 11:46:22 (pid:10314) State change: RunBenchmarks is TRUE
04/01/16 11:46:22 (pid:10314) Changing activity: Idle -> Benchmarking
04/01/16 11:46:22 (pid:10314) BenchMgr:StartBenchmarks()
04/01/16 11:46:40 (pid:10314) State change: benchmarks completed
04/01/16 11:46:40 (pid:10314) Changing activity: Benchmarking -> Idle
04/01/16 11:52:17 (pid:10314) No resources have been claimed for 30 seconds
04/01/16 11:52:17 (pid:10314) Shutting down Condor on this machine.
04/01/16 11:52:17 (pid:10314) Got SIGTERM. Performing graceful shutdown.
04/01/16 11:52:17 (pid:10314) shutdown graceful
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) Killing job mips
04/01/16 11:52:17 (pid:10314) Killing job kflops
04/01/16 11:52:17 (pid:10314) Deleting cron job manager
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting all jobs
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting all jobs
04/01/16 11:52:17 (pid:10314) Deleting benchmark job mgr
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) Killing job mips
04/01/16 11:52:17 (pid:10314) Killing job kflops
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) Killing job mips
04/01/16 11:52:17 (pid:10314) Killing job kflops
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting all jobs
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting job 'mips'
04/01/16 11:52:17 (pid:10314) CronJob: Deleting job 'mips' (/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/condor_mips), timer -1
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting job 'kflops'
04/01/16 11:52:17 (pid:10314) CronJob: Deleting job 'kflops' (/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/condor_kflops), timer -1
04/01/16 11:52:17 (pid:10314) Cron: Killing all jobs
04/01/16 11:52:17 (pid:10314) CronJobList: Deleting all jobs
04/01/16 11:52:17 (pid:10314) All resources are free, exiting.
04/01/16 11:52:17 (pid:10314) **** condor_startd (condor_STARTD) pid 10314 EXITING WITH STATUS 0

StarterLog:

04/01/16 11:46:22 (pid:10318) FILETRANSFER: "/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
04/01/16 11:46:22 (pid:10318) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
04/01/16 11:46:22 (pid:10318) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_GJSlZn/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring

MasterLog

04/01/16 11:46:21 (pid:10311) ******************************************************
04/01/16 11:46:21 (pid:10311) ** condor_master (CONDOR_MASTER) STARTING UP
04/01/16 11:46:21 (pid:10311) ** /home/boinc/CMSRun/glide_GJSlZn/main/condor/sbin/condor_master
04/01/16 11:46:21 (pid:10311) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
04/01/16 11:46:21 (pid:10311) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
04/01/16 11:46:21 (pid:10311) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
04/01/16 11:46:21 (pid:10311) ** $CondorPlatform: x86_64_RedHat5 $
04/01/16 11:46:21 (pid:10311) ** PID = 10311
04/01/16 11:46:21 (pid:10311) ** Log last touched time unavailable (No such file or directory)
04/01/16 11:46:21 (pid:10311) ******************************************************
04/01/16 11:46:21 (pid:10311) Using config source: /home/boinc/CMSRun/glide_GJSlZn/condor_config
04/01/16 11:46:21 (pid:10311) config Macros = 212, Sorted = 212, StringBytes = 10635, TablesBytes = 7672
04/01/16 11:46:21 (pid:10311) CLASSAD_CACHING is OFF
04/01/16 11:46:21 (pid:10311) Daemon Log is logging: D_ALWAYS D_ERROR
04/01/16 11:46:21 (pid:10311) DaemonCore: command socket at <10.0.2.15:32779?noUDP>
04/01/16 11:46:21 (pid:10311) DaemonCore: private command socket at <10.0.2.15:32779>
04/01/16 11:46:22 (pid:10311) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#152908
04/01/16 11:46:22 (pid:10311) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_GJSlZn/main/condor/sbin/condor_master (mtime:1459503969)
04/01/16 11:46:22 (pid:10311) Started DaemonCore process "/home/boinc/CMSRun/glide_GJSlZn/main/condor/sbin/condor_startd", pid and pgroup = 10314
04/01/16 11:52:17 (pid:10311) Got SIGTERM. Performing graceful shutdown.
04/01/16 11:52:17 (pid:10311) condor_write(): Socket closed when trying to write 409 bytes to collector lcggwms02.gridpp.rl.ac.uk:9621, fd is 10
04/01/16 11:52:17 (pid:10311) Buf::write(): condor_write() failed
04/01/16 11:52:17 (pid:10311) Sent SIGTERM to STARTD (pid 10314)
04/01/16 11:52:17 (pid:10311) AllReaper unexpectedly called on pid 10314, status 0.
04/01/16 11:52:17 (pid:10311) The STARTD (pid 10314) exited with status 0
04/01/16 11:52:17 (pid:10311) All daemons are gone. Exiting.
04/01/16 11:52:17 (pid:10311) **** condor_master (condor_MASTER) pid 10311 EXITING WITH STATUS 0
ID: 2580 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2581 - Posted: 4 Apr 2016, 16:38:56 UTC - in response to Message 2580.  
Last modified: 4 Apr 2016, 16:39:51 UTC

There has not been any work for cms-tasks in over a week.
Tasks are supposed to shout down, when there is no work.
The fact, that it was running for over 4h is concerning and needs to be looked at.

Here:

http://dashb-cms-job-task.cern.ch/dashboard/templates/task-analysis/#user=ivan+reid&refresh=0&table=Mains&p=1&records=25&activemenu=1&pattern=&task=&from=&till=&timerange=lastMonth

You can see, if any work is available.
ID: 2581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : CMS 46.27 on vLHC looping but doing nothing usefull


©2024 CERN