Message boards :
News :
No new jobs
Message board moderation
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The tasks are running for 10min or so, finish, next one starts. I am not sure, if that "No Jobs available" behavior is desirable. Maybe that should be reviewed. Limiting the server to provide no more new task for, maybe 30min, would be better. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
The tasks are running for 10min or so, finish, next one starts. Hmm, that might be why we ran out more quickly than I expected. Unfortunately, many people at CERN (and elsewhere) are on holidays until the 4th April. We may not get this sorted quickly. :-( I'll see what clues I can find on the Condor server. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
Ah, I think I misinterpreted your post. If you're talking about the current problem user, I did send another e-mail today. My logs don't show any errors from him since 0605Z today. Dashboard shows three failures since then, none an 8002 and none from his IP. My comments about lack of jobs still stand, unfortunately. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Neither. I was referring to to the (boinc) behavior, when no jobs are available. But, first things first. See, if you can submit more work. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
Yes, the correct behaviour is to pass this information back to the BOINC client and tell it to go work on an other project for a while. Another optimization to be done after the other issues have been addressed. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
Sorry for the lack of communications because, well Easter (our uni closes for a week). And of course because there was nothing I could do but wait for experts to re-appear. There are enough around now that the cause of the problem has been identified, but not a work-around yet (once again, CMS@Home is a corner-case that stresses mainstream assumptions). More news soon, I hope. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There has been no work for over a week now. It is sad to see, so much time was wasted. Any chance of work anytime soon? |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
Here is an update on the current situation. A software update of CRAB to improve infrastructure validation on job submission exposed a logic bug between one of the client libraries that we use and the storage system. The result is that we can't submit any CMS jobs using CRAB until an update is available. It didn't help that this hit us just before the Easter weekend when many people including myself took vacation during this period. There is a potential workaround and it is being investigated so there is a change that we could have some more jobs soon. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
In fact, the workaround finally took effect, after an inexplicable delay. I've managed to run a CRAB submission and jobs are arriving at the Condor server. Currently there are no glide-ins there to service them but I expect that will be rectified soon. :-) |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
The VM's doesn't come to the point to fetch a job. Pending 10000 Running 0. Tasks running a few minutes and are killed then. Glidein Run 1 ends after 50 seconds. Whole boot.log: Tue Apr 5 15:01:16 2016: Setting hostname localhost: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:16 2016: Setting up Logical Volume Management: No volume groups found Tue Apr 5 15:01:16 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:16 2016: Checking filesystems Tue Apr 5 15:01:16 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Mounting local filesystems: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Enabling local filesystem quotas: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Enabling /etc/fstab swaps: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:17 2016: Entering non-interactive startup Tue Apr 5 15:01:17 2016: Starting vmcontext_prolog... Tue Apr 5 15:01:17 2016: Tue Apr 5 15:01:17 2016: Bringing up loopback interface: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:18 2016: Bringing up interface eth0: Tue Apr 5 15:01:18 2016: Determining IP information for eth0... done. Tue Apr 5 15:01:21 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting auditd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting portreserve: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting system logger: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting irqbalance: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:21 2016: Starting rpcbind: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:24 2016: Starting CernVM: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting system message bus: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting Avahi daemon... ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:26 2016: Starting NFS statd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting cups: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Mounting filesystems: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting acpi daemon: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Starting HAL daemon: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Retrigger failed udev events^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Applying noop elevator: sda ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:27 2016: Applying ktune sysctl settings: Tue Apr 5 15:01:27 2016: /etc/tune-profiles/cernvm/sysctl.ktune: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:28 2016: Applying sysctl settings from /etc/sysctl.conf Tue Apr 5 15:01:29 2016: Starting automount: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting the VirtualBox Guest Additions ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting VirtualBox Guest Addition service ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:29 2016: Starting vmcontext_hepix...^[[60G[^[[0;33mWARNING^[[0;39m] Tue Apr 5 15:01:33 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'init-local' at Tue, 05 Apr 2016 13:01:35 +0000. Up 38.89 seconds. Tue Apr 5 15:01:35 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'init' at Tue, 05 Apr 2016 13:01:36 +0000. Up 40.08 seconds. Tue Apr 5 15:01:36 2016: ci-info: +++++++++++++++++++++++++Net device info+++++++++++++++++++++++++ Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: | Device | Up | Address | Mask | Hw-Address | Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: | sit0 | True | . | . | . | Tue Apr 5 15:01:36 2016: ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | . | Tue Apr 5 15:01:36 2016: ci-info: | eth0 | True | 10.0.2.15 | 255.255.255.0 | 08:00:27:c9:a7:9f | Tue Apr 5 15:01:36 2016: ci-info: +--------+------+-----------+---------------+-------------------+ Tue Apr 5 15:01:36 2016: ci-info: ++++++++++++++++++++++++++++++Route info++++++++++++++++++++++++++++++ Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:36 2016: ci-info: | Route | Destination | Gateway | Genmask | Interface | Flags | Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:36 2016: ci-info: | 0 | 0.0.0.0 | 10.0.2.2 | 0.0.0.0 | eth0 | UG | Tue Apr 5 15:01:36 2016: ci-info: | 1 | 10.0.2.0 | 0.0.0.0 | 255.255.255.0 | eth0 | U | Tue Apr 5 15:01:36 2016: ci-info: +-------+-------------+----------+---------------+-----------+-------+ Tue Apr 5 15:01:37 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'modules:config' at Tue, 05 Apr 2016 13:01:38 +0000. Up 41.97 seconds. Tue Apr 5 15:01:38 2016: Starting cloud-init: Cloud-init v. 0.7.4 running 'modules:final' at Tue, 05 Apr 2016 13:01:39 +0000. Up 43.20 seconds. Tue Apr 5 15:01:39 2016: ci-info: no authorized ssh keys fingerprints found for user cloud-user. Tue Apr 5 15:01:39 2016: ci-info: no authorized ssh keys fingerprints found for user cloud-user. Tue Apr 5 15:01:40 2016: ec2: Tue Apr 5 15:01:40 2016: ec2: ############################################################# Tue Apr 5 15:01:40 2016: ec2: -----BEGIN SSH HOST KEY FINGERPRINTS----- Tue Apr 5 15:01:40 2016: ec2: 1024 63:fd:ec:d0:4c:7c:c7:33:a3:76:ba:91:00:33:c5:a9 /etc/ssh/ssh_host_dsa_key.pub (DSA) Tue Apr 5 15:01:40 2016: ec2: 2048 29:09:a4:29:c8:5c:0d:2c:be:76:59:a2:d8:19:7e:7b /etc/ssh/ssh_host_key.pub (RSA1) Tue Apr 5 15:01:40 2016: ec2: 2048 b9:56:c4:b4:c4:6a:33:4e:ba:14:6f:02:43:bd:50:7a /etc/ssh/ssh_host_rsa_key.pub (RSA) Tue Apr 5 15:01:40 2016: ec2: -----END SSH HOST KEY FINGERPRINTS----- Tue Apr 5 15:01:40 2016: ec2: ############################################################# Tue Apr 5 15:01:40 2016: -----BEGIN SSH HOST KEY KEYS----- Tue Apr 5 15:01:40 2016: 2048 35 2842589064180712032266612294911096332459607665467558789444438190688324072035463210195132798387809895683854959526209889002429436703268070385056689521517555599526571018618029304994073059118791852612652893866694532608954246064760952629308155011078661 Tue Apr 5 15:01:40 2016: ssh-rsa AAAAB3NzaC1yc2EAAAABIwAAAQEA3fWeGY6pgNgcPM9JjmrJb4u4qqBCrKKFA7BeH9F7x49HKexOmrqeUH1SDM3oIIMre5BIS34hZClNWQjNIZplsDasnmziWe9FugJAaH0lVZ+DJTjlg54hG5gDesndNui7/xGvB6QlDvPo35ug5E9Zd24EfN//7ZGCbqSBYKFGIljFq31Bq/xI1+kYDVFK5jXY9fNsiry6uU7RnHdnhXlevjge72o Tue Apr 5 15:01:40 2016: -----END SSH HOST KEY KEYS----- Tue Apr 5 15:01:40 2016: Cloud-init v. 0.7.4 finished at Tue, 05 Apr 2016 13:01:40 +0000. Datasource DataSourceNone. Up 43.55 seconds Tue Apr 5 15:01:40 2016: Starting ntpd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Tue Apr 5 15:01:40 2016: Starting console mouse services: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting tuned: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting crond: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Starting atd: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:40 2016: Running CernVM context boot hooks: Tue Apr 5 15:01:40 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Tue Apr 5 15:01:45 2016: rm: cannot remove `/etc/grid-security/certificates': Is a directory |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
It never just rains it pours! A server certificate expired on 04/04/16 18:19. :( |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
It never just rains it pours! Dry spell or summer at the end: Tasks are running longer, cvmfs2 is doing something and more important on dashboard I see 3 jobs running from Ivan's batch. However there are also jobs from Hassen Riahi wmagent-batch 160401_110444_7709 around. I got 2 of those and 1 for Ivan (sorry Ivan ;) ) |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
S'OK, we're not in competition. In fact, those jobs have to take over from mine before we can go into production mode. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101 |
It looks as though the VM is getting jobs OK, condor_stdout ends:- == DIR: run_and_lumis.tar.gz == DIR: sandbox.tar.gz ==== Local directory contents dump FINISHING ==== ======== CMSRunAnalysis.py STARTING at Wed Apr 6 01:31:31 GMT 2016 ======== Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_y6Ysc9/execute/dir_8347... ++ pwd + python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_y6Ysc9/execute/dir_8347 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=373 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_373.txt --runAndLumis=job_lumis_373.json --lheInputFiles=False --firstEvent=93001 --firstLumi=1117 --lastEvent=93251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0 but top doesn't show cmsRun starting. The job is killed after a few minutes. stderr ends like this:- 2016-04-06 01:25:12 (21023): Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 2016-04-06 01:25:25 (21023): Guest Log: [INFO] Downloading glidein 2016-04-06 01:25:33 (21023): Guest Log: [INFO] Running glidein (check logs) 2016-04-06 01:49:02 (21023): Guest Log: [INFO] CMS glidein Run 1 ended 2016-04-06 01:50:01 (21023): Guest Log: [INFO] CMS glidein Run 1 ended 2016-04-06 01:50:02 (21023): Guest Log: Log extracts for Run 1 jobs 2016-04-06 01:50:02 (21023): Guest Log: [INFO] No output. Shutting down! 2016-04-06 01:50:02 (21023): VM Completion File Detected. 2016-04-06 01:50:02 (21023): Powering off VM. 2016-04-06 01:50:03 (21023): Successfully stopped VM. 2016-04-06 01:50:03 (21023): Deregistering VM. (boinc_fc98c9c163bd3c37, slot#3) 2016-04-06 01:50:03 (21023): Removing network bandwidth throttle group from VM. 2016-04-06 01:50:04 (21023): Removing storage controller(s) from VM. 2016-04-06 01:50:04 (21023): Removing VM from VirtualBox. 2016-04-06 01:50:04 (21023): Removing virtual disk drive from VirtualBox. 01:50:09 (21023): called boinc_finish(0) Boot log ends:- Wed Apr 6 02:05:50 2016: -----END SSH HOST KEY KEYS----- Wed Apr 6 02:05:50 2016: Cloud-init v. 0.7.4 finished at Wed, 06 Apr 2016 01:05:50 +0000. Datasource DataSourceNone. Up 82.10 seconds Wed Apr 6 02:05:50 2016: Starting ntpd: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:51 2016: Wed Apr 6 02:05:51 2016: Starting console mouse services: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:51 2016: Starting tuned: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:52 2016: Starting crond: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:53 2016: Starting atd: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:05:53 2016: Running CernVM context boot hooks: Wed Apr 6 02:05:53 2016: ^[[60G[^[[0;32m OK ^[[0;39m] Wed Apr 6 02:06:14 2016: rm: cannot remove `/etc/grid-security/certificates': Is a directory |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
Is it still doing that? I don't have any logs from you in this batch tho' you have a task outstanding since 0103Z. Maybe a project reset is in order. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101 |
Is it still doing that? No. Three jobs 2068, 2087 and 2129 running, OK so far. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
Good! All three machines are returning good jobs. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I believe, we are going to need anew batch sometime tomorrow morning. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for putting a new batch on, Ivan. However, we are only running at 1/3 capacity, as T4T has no more CMS-Simulation boinc tasks. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,939,683 RAC: 3,233 |
That appears to have been fixed. I sent Laurence et al. an email early this morning -- I'd had a late night... |
©2024 CERN