Thread 'No new jobs'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2571 - Posted: 26 Mar 2016, 16:20:58 UTC Last modified: 26 Mar 2016, 16:21:41 UTC The tasks are running for 10min or so, finish, next one starts. I am not sure, if that "No Jobs available" behavior is desirable. Maybe that should be reviewed. Limiting the server to provide no more new task for, maybe 30min, would be better. ID: 2571 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2572 - Posted: 26 Mar 2016, 16:30:19 UTC - in response to Message 2571. The tasks are running for 10min or so, finish, next one starts. I am not sure, if that "No Jobs available" behavior is desirable. Maybe that should be reviewed. Limiting the server to provide no more new task for, maybe 30min, would be better. Hmm, that might be why we ran out more quickly than I expected. Unfortunately, many people at CERN (and elsewhere) are on holidays until the 4th April. We may not get this sorted quickly. :-( I'll see what clues I can find on the Condor server. ID: 2572 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2573 - Posted: 26 Mar 2016, 16:41:10 UTC - in response to Message 2571. Ah, I think I misinterpreted your post. If you're talking about the current problem user, I did send another e-mail today. My logs don't show any errors from him since 0605Z today. Dashboard shows three failures since then, none an 8002 and none from his IP. My comments about lack of jobs still stand, unfortunately. ID: 2573 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2574 - Posted: 26 Mar 2016, 16:46:04 UTC - in response to Message 2573. Last modified: 26 Mar 2016, 16:46:35 UTC Neither. I was referring to to the (boinc) behavior, when no jobs are available. But, first things first. See, if you can submit more work. ID: 2574 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2575 - Posted: 26 Mar 2016, 21:48:55 UTC - in response to Message 2574. Yes, the correct behaviour is to pass this information back to the BOINC client and tell it to go work on an other project for a while. Another optimization to be done after the other issues have been addressed. ID: 2575 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2578 - Posted: 31 Mar 2016, 10:13:02 UTC Sorry for the lack of communications because, well Easter (our uni closes for a week). And of course because there was nothing I could do but wait for experts to re-appear. There are enough around now that the cause of the problem has been identified, but not a work-around yet (once again, CMS@Home is a corner-case that stresses mainstream assumptions). More news soon, I hope. ID: 2578 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2582 - Posted: 4 Apr 2016, 16:42:11 UTC There has been no work for over a week now. It is sad to see, so much time was wasted. Any chance of work anytime soon? ID: 2582 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2583 - Posted: 4 Apr 2016, 20:49:49 UTC - in response to Message 2582. Here is an update on the current situation. A software update of CRAB to improve infrastructure validation on job submission exposed a logic bug between one of the client libraries that we use and the storage system. The result is that we can't submit any CMS jobs using CRAB until an update is available. It didn't help that this hit us just before the Easter weekend when many people including myself took vacation during this period. There is a potential workaround and it is being investigated so there is a change that we could have some more jobs soon. ID: 2583 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2586 - Posted: 5 Apr 2016, 12:32:20 UTC In fact, the workaround finally took effect, after an inexplicable delay. I've managed to run a CRAB submission and jobs are arriving at the Condor server. Currently there are no glide-ins there to service them but I expect that will be rectified soon. :-) ID: 2586 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2587 - Posted: 5 Apr 2016, 13:05:48 UTC Last modified: 5 Apr 2016, 13:19:19 UTC The VM's doesn't come to the point to fetch a job. Pending 10000 Running 0. Tasks running a few minutes and are killed then. Glidein Run 1 ends after 50 seconds. Whole boot.log: Tue Apr 5 15:01:16 2016: Setting hostname localhost: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:16 2016: Setting up Logical Volume Management: No volume groups found Tue Apr 5 15:01:16 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:16 2016: Checking filesystems Tue Apr 5 15:01:16 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Mounting local filesystems: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Enabling local filesystem quotas: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Enabling /etc/fstab swaps: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Entering non-interactive startup Tue Apr 5 15:01:17 2016: Starting vmcontext_prolog... Tue Apr 5 15:01:17 2016: Tue Apr 5 15:01:17 2016: Bringing up loopback interface: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:18 2016: Bringing up interface eth0: Tue Apr 5 15:01:18 2016: Determining IP information for eth0... done. Tue Apr 5 15:01:21 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting auditd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting portreserve: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting system logger: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting irqbalance: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting rpcbind: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:24 2016: Starting CernVM: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting system message bus: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting Avahi daemon... ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting NFS statd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting cups: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Mounting filesystems: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting acpi daemon: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting HAL daemon: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Retrigger failed udev events^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Applying noop elevator: sda ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Applying ktune sysctl settings: Tue Apr 5 15:01:27 2016: /etc/tune-profiles/cernvm/sysctl.ktune: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:28 2016: Applying sysctl settings from /etc/sysctl.conf Tue Apr 5 15:01:29 2016: Starting automount: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting the VirtualBox Guest Additions ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting VirtualBox Guest Addition service ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting vmcontext_hepix...^[[60G[^[[0;33mWARNING^[[0;39m] Tue Apr 5 15:01:33 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'init-local' at Tue, 05 Apr 2016 13:01:35 +0000. Up 38.89 seconds. Tue Apr 5 15:01:35 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'init' at Tue, 05 Apr 2016 13:01:36 +0000. Up 40.08 seconds. Tue Apr 5 15:01:36 2016: ci-info: +++++++++++++++++++++++++Net device info+++++++++++++++++++++++++ Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: \| Device \| Up \| Address \| Mask \| Hw-Address \| Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: \| sit0 \| True \| . \| . \| . \| Tue Apr 5 15:01:36 2016: ci-info: \| lo \| True \| 127.0.0.1 \| 255.0.0.0 \| . \| Tue Apr 5 15:01:36 2016: ci-info: \| eth0 \| True \| 10.0.2.15 \| 255.255.255.0 \| 08:00:27:c9:a7:9f \| Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: ++++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++++ Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:36 2016: ci-info: \| Route \| Destination \| Gateway \| Genmask \| Interface \| Flags \| Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:36 2016: ci-info: \| 0 \| 0.0.0.0 \| 10.0.2.2 \| 0.0.0.0 \| eth0 \| UG \| Tue Apr 5 15:01:36 2016: ci-info: \| 1 \| 10.0.2.0 \| 0.0.0.0 \| 255.255.255.0 \| eth0 \| U \| Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:37 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'modules:config' at Tue, 05 Apr 2016 13:01:38 +0000. Up 41.97 seconds. Tue Apr 5 15:01:38 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'modules:final' at Tue, 05 Apr 2016 13:01:39 +0000. Up 43.20 seconds. Tue Apr 5 15:01:39 2016: ci-info: no authorized ssh keys fingerprints found for user cloud-user. Tue Apr 5 15:01:39 2016: ci-info: no authorized ssh keys fingerprints found for user cloud-user. Tue Apr 5 15:01:40 2016: ec2: Tue Apr 5 15:01:40 2016: ec2: ############################################################# Tue Apr 5 15:01:40 2016: ec2: -----BEGIN SSH HOST KEY FINGERPRINTS----- Tue Apr 5 15:01:40 2016: ec2: 1024 63:fd:ec:d0:4c:7c:c7:33:a3:76:ba:91:00:33:c5:a9 /etc/ssh/ssh_host_dsa_key.pub (DSA) Tue Apr 5 15:01:40 2016: ec2: 2048 29:09:a4:29:c8:5c:0d:2c:be:76:59:a2:d8:19:7e:7b /etc/ssh/ssh_host_key.pub (RSA1) Tue Apr 5 15:01:40 2016: ec2: 2048 b9:56:c4:b4:c4:6a:33:4e:ba:14:6f:02:43:bd:50:7a /etc/ssh/ssh_host_rsa_key.pub (RSA) Tue Apr 5 15:01:40 2016: ec2: -----END SSH HOST KEY FINGERPRINTS----- Tue Apr 5 15:01:40 2016: ec2: ############################################################# Tue Apr 5 15:01:40 2016: -----BEGIN SSH HOST KEY KEYS----- Tue Apr 5 15:01:40 2016: 2048 35 2842589064180712032266612294911096332459607665467558789444438190688324072035463210195132798387809895683854959526209889002429436703268070385056689521517555599526571018618029304994073059118791852612652893866694532608954246064760952629308155011078661 Tue Apr 5 15:01:40 2016: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3fWeGY6pgNgcPM9JjmrJb4u4qqBCrKKFA7BeH9F7x49HKexOmrqeUH1SDM3oIIMre5BIS34hZClNWQjNIZplsDasnmziWe9FugJAaH0lVZ+DJTjlg54hG5gDesndNui7/xGvB6QlDvPo35ug5E9Zd24EfN//7ZGCbqSBYKFGIljFq31Bq/xI1+kYDVFK5jXY9fNsiry6uU7RnHdnhXlevjge72o Tue Apr 5 15:01:40 2016: -----END SSH HOST KEY KEYS----- Tue Apr 5 15:01:40 2016: Cloud-init v. 0.7.4 finished at Tue, 05 Apr 2016 13:01:40 +0000. Datasource DataSourceNone. Up 43.55 seconds Tue Apr 5 15:01:40 2016: Starting ntpd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Tue Apr 5 15:01:40 2016: Starting console mouse services: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting tuned: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting crond: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting atd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Running CernVM context boot hooks: Tue Apr 5 15:01:40 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:45 2016: rm: cannot remove `/etc/grid-security/certificates': Is a directory ID: 2587 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2588 - Posted: 5 Apr 2016, 13:31:01 UTC - in response to Message 2587. It never just rains it pours! A server certificate expired on 04/04/16 18:19. :( ID: 2588 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,772 RAC: 141	Message 2589 - Posted: 5 Apr 2016, 14:02:59 UTC - in response to Message 2588. Last modified: 5 Apr 2016, 14:16:52 UTC It never just rains it pours! Dry spell or summer at the end: Tasks are running longer, cvmfs2 is doing something and more important on dashboard I see 3 jobs running from Ivan's batch. However there are also jobs from Hassen Riahi wmagent-batch 160401_110444_7709 around. I got 2 of those and 1 for Ivan (sorry Ivan ;) ) ID: 2589 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2590 - Posted: 5 Apr 2016, 16:51:05 UTC S'OK, we're not in competition. In fact, those jobs have to take over from mine before we can go into production mode. ID: 2590 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2591 - Posted: 6 Apr 2016, 2:12:26 UTC Last modified: 6 Apr 2016, 2:15:15 UTC It looks as though the VM is getting jobs OK, condor_stdout ends:- == DIR: run_and_lumis.tar.gz == DIR: sandbox.tar.gz ==== Local directory contents dump FINISHING ==== ======== CMSRunAnalysis.py STARTING at Wed Apr 6 01:31:31 GMT 2016 ======== Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_y6Ysc9/execute/dir_8347... ++ pwd + python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_y6Ysc9/execute/dir_8347 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=373 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_373.txt --runAndLumis=job_lumis_373.json --lheInputFiles=False --firstEvent=93001 --firstLumi=1117 --lastEvent=93251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0 but top doesn't show cmsRun starting. The job is killed after a few minutes. stderr ends like this:- 2016-04-06 01:25:12 (21023): Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 2016-04-06 01:25:25 (21023): Guest Log: [INFO] Downloading glidein 2016-04-06 01:25:33 (21023): Guest Log: [INFO] Running glidein (check logs) 2016-04-06 01:49:02 (21023): Guest Log: [INFO] CMS glidein Run 1 ended 2016-04-06 01:50:01 (21023): Guest Log: [INFO] CMS glidein Run 1 ended 2016-04-06 01:50:02 (21023): Guest Log: Log extracts for Run 1 jobs 2016-04-06 01:50:02 (21023): Guest Log: [INFO] No output. Shutting down! 2016-04-06 01:50:02 (21023): VM Completion File Detected. 2016-04-06 01:50:02 (21023): Powering off VM. 2016-04-06 01:50:03 (21023): Successfully stopped VM. 2016-04-06 01:50:03 (21023): Deregistering VM. (boinc_fc98c9c163bd3c37, slot#3) 2016-04-06 01:50:03 (21023): Removing network bandwidth throttle group from VM. 2016-04-06 01:50:04 (21023): Removing storage controller(s) from VM. 2016-04-06 01:50:04 (21023): Removing VM from VirtualBox. 2016-04-06 01:50:04 (21023): Removing virtual disk drive from VirtualBox. 01:50:09 (21023): called boinc_finish(0) Boot log ends:- Wed Apr 6 02:05:50 2016: -----END SSH HOST KEY KEYS----- Wed Apr 6 02:05:50 2016: Cloud-init v. 0.7.4 finished at Wed, 06 Apr 2016 01:05:50 +0000. Datasource DataSourceNone. Up 82.10 seconds Wed Apr 6 02:05:50 2016: Starting ntpd: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:51 2016: Wed Apr 6 02:05:51 2016: Starting console mouse services: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:51 2016: Starting tuned: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:52 2016: Starting crond: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:53 2016: Starting atd: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:53 2016: Running CernVM context boot hooks: Wed Apr 6 02:05:53 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:06:14 2016: rm: cannot remove `/etc/grid-security/certificates': Is a directory ID: 2591 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2592 - Posted: 6 Apr 2016, 7:45:03 UTC - in response to Message 2591. Is it still doing that? I don't have any logs from you in this batch tho' you have a task outstanding since 0103Z. Maybe a project reset is in order. ID: 2592 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2596 - Posted: 7 Apr 2016, 1:27:02 UTC - in response to Message 2592. Is it still doing that? No. Three jobs 2068, 2087 and 2129 running, OK so far. ID: 2596 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2597 - Posted: 7 Apr 2016, 7:16:16 UTC - in response to Message 2596. Good! All three machines are returning good jobs. ID: 2597 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2623 - Posted: 9 Apr 2016, 15:54:26 UTC I believe, we are going to need anew batch sometime tomorrow morning. ID: 2623 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2625 - Posted: 10 Apr 2016, 7:01:28 UTC Thanks for putting a new batch on, Ivan. However, we are only running at 1/3 capacity, as T4T has no more CMS-Simulation boinc tasks. ID: 2625 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 2626 - Posted: 10 Apr 2016, 10:00:06 UTC - in response to Message 2625. That appears to have been fixed. I sent Laurence et al. an email early this morning -- I'd had a late night... ID: 2626 · Rating: 0 · rate: / Reply Quote

Development for LHC@home