Message boards :
Number crunching :
issue of the day
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
However maybe also because of the issue I have. Resetting the project and getting all fresh project files, did not solve my problem. I meanwhile discovered that the project launched a new version of the application (v4622) on 28 Jan 2016, 12:05:09 UTC. This was not announced in the News afaik. The only difference is a new project xml-file. The VM-vdi is unchanged. New in the xml-file is copying init_data.xml to the shared project directory in the task-slot and the runtime is extended from 24hrs to 36hrs. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I did a project detach and reattach. It is not working. Bootlog is the only file in the logs. Investigating.... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
My guess is, that when vlhc users were disallowed, cms-dev users were as well. Once a cms-dev boinc tasks ends, you cannot get in again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 270 |
I meanwhile discovered that the project launched a new version of the application (v4622) on 28 Jan 2016, 12:05:09 UTC. Hmm, that's news to me as well. Will follow up tomorrow. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
edit: vLHCathome-dev 3/9/2016 2:51:20 PM [error] No start tag in scheduler reply Above is message from boinc, when attempting to report task. After the switch-over, the credentials do not seem to work. It goes on and on and on.... 00:04:06.324627 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:06.974927 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:27.871165 VMMDev: Guest Log: [INFO] Starting CMS Application - Run 4 00:04:27.908388 VMMDev: Guest Log: [INFO] Reading the BOINC volunteer's information 00:04:28.002609 VMMDev: Guest Log: [INFO] Volunteer: Rasputin42 (xxx) Host: xxx 00:04:28.050134 VMMDev: Guest Log: [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxxx 00:04:28.103016 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:04:28.831879 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:30.671557 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:04:31.513885 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:36.858739 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:04:37.520845 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:04:37.607688 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:38.372696 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:04:53.174549 NAT: old socket rcv size: 64KB 00:04:53.174581 NAT: old socket snd size: 64KB 00:04:59.372916 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:00.158923 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:02.053716 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:02.651167 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:08.205944 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:08.866343 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:08.919476 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:09.573582 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:27.833479 VMMDev: Guest Log: [INFO] Starting CMS Application - Run 5 00:05:27.861517 VMMDev: Guest Log: [INFO] Reading the BOINC volunteer's information 00:05:27.949528 VMMDev: Guest Log: [INFO] Volunteer: Rasputin42 (xxx) Host: xxx 00:05:27.983997 VMMDev: Guest Log: [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxxx 00:05:28.031234 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:28.664188 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:30.684688 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:31.302647 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:33.209132 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:33.858069 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:39.441975 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:40.137187 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:40.203887 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:40.920251 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:05:59.176726 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:05:59.951849 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:06:01.840979 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:06:02.770356 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:06:04.376781 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:06:05.292544 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:06:10.776294 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:06:11.374606 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:06:11.436696 VMMDev: Guest Log: [INFO] Requesting an X509 credential from CMS-Dev 00:06:12.097996 VMMDev: Guest Log: [INFO] Requesting an X509 credential from LHC@home 00:06:27.835997 VMMDev: Guest Log: [INFO] Starting CMS Application - Run 6 00:06:27.867779 VMMDev: Guest Log: [INFO] Reading the BOINC volunteer's information 00:06:27.953261 VMMDev: Guest Log: [INFO] Volunteer: Rasputin42 (xxx) Host: xxx 00:06:27.994671 VMMDev: Guest Log: [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxx |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
A 2nd? run started.Has this been fixed? I found this in cron_stdout: 16:51:01 +0100 2016-03-10 [INFO] Starting CMS Application - Run 1 16:51:02 +0100 2016-03-10 [INFO] Reading the BOINC volunteer's information 16:51:02 +0100 2016-03-10 [INFO] Volunteer: Rasputin42 (277) Host: 617 16:51:02 +0100 2016-03-10 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 16:51:02 +0100 2016-03-10 [INFO] Requesting an X509 credential from CMS-Dev subject : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277/CN=xxxxxxxx issuer : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277 identity : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 130:00:00 (5.4 days) 16:51:03 +0100 2016-03-10 [INFO] Downloading glidein 16:51:05 +0100 2016-03-10 [INFO] Running glidein (check logs) 23:47:01 +0100 2016-03-10 [INFO] CMS glidein Run 1 ended Copying 354466 bytes file:///home/boinc/wu_1457619778_8_0_1.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1457619778_8_0_1.tgz Short exit status: 0 Short exit status: 0 Short exit status: 0 Short exit status: 0 23:48:01 +0100 2016-03-10 [INFO] Starting CMS Application - Run 2 23:48:01 +0100 2016-03-10 [INFO] Reading the BOINC volunteer's information 23:48:01 +0100 2016-03-10 [INFO] Volunteer: Rasputin42 (277) Host: 617 23:48:01 +0100 2016-03-10 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 23:48:01 +0100 2016-03-10 [INFO] Requesting an X509 credential from CMS-Dev subject : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277/CN=xxxxxxx issuer : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277 identity : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:59:59 (5.4 days) 23:48:03 +0100 2016-03-10 [INFO] Downloading glidein 23:48:03 +0100 2016-03-10 [INFO] Running glidein (check logs) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Thanks for pointing this out. This is copying the log files for that run. http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=139& postid=2261#2261 By collecting them all we will be in a better position to debug issues after they have occurred. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,191,283 RAC: 3,172 |
I've just had one complete that had done 2 runs also. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have a large number of jobs, which ALL have been started and aborted by ONE host. IP available, if needed. This guy must produce a lot of these. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I would just need the Task number, WU Name or host id. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Jobs: 3895,4252,5070,5539,5455 and more. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Are the jobs the task number as it seems to be fine? http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=5455 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I think, you misunderstood. The point is, that a host starts a lot of jobs, which are abandoned, and picked up by others. They do not produce errors, but the host should be checked, as of why it produces such large numbers of abandoned tasks. The jobs, i listed were picked up by me and finished. I can only guess, how may abandoned jobs were produced by this single host, if i am picking up that many. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
When you refer to jobs, do you mean CMS jobs or Boinc tasks. How do you know that you are getting abandoned jobs? |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I only see 29 abandoned BOINC tasks in the past 7 days. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am talking about CMS jobs. I know, they are abandoned, because they have an ip address associated with them, that is not mine. http://dashb-cms-job-task.cern.ch/dashboard/templates/task-analysis/#user=ivan+reid&refresh=0&table=Jobs&p=1&records=25&activemenu=2&status=all&site=&tid=160226_150549%3Aireid_crab_CMS_at_Home_MinBias_250evE If you type a job number into search and click on the + sign on the very left of the job number and then click on the attempt number, you can see the IP address, the job was originally assigned to.(amongst other things) With non-abandoned jobs, they would have my IP on it. EDIT:(Of course only jobs, i have been calculating) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
OK, here's what's happening. Ivan's proxy expires when the job is in the queue on the server. A VM requests a new job and the jobs fails. It does not even start as the site is recorded as unknown. The job is resubmitted and Ivan's script eventually renews the proxy. You then get the good job. There is nothing wrong with the volunteer side of things, it is just noise created by the proxy expiring. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Laurence. Where does the job, i am picking up, have the IP from, if not a volunteer, that did not finish it? This, by itself is, is not remarkable. It becomes remarkable, when the IP is the same for a large number of jobs. A "fresh" job does not have an IP. The jobs in question did not fail, never left the queue. They were abandoned and reassigned (to me). In any case, if you do not have a problem with that, why should i? |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
I just re-enabled one of my boxes to fetch work from this Project. The machine was added 2 month ago, so before your Project rename. I only allowed to get work and work was fetched; this was good. Now, run 1 has finished and the box is sitting their idle. I just checked around and found in http://localhost:57156/logs/run-1/glide_UES5to/MasterLog: 03/11/16 14:52:31 (pid:7863) ****************************************************** 03/11/16 14:52:31 (pid:7863) ** condor_master (CONDOR_MASTER) STARTING UP 03/11/16 14:52:31 (pid:7863) ** /home/boinc/CMSRun/glide_UES5to/main/condor/sbin/condor_master 03/11/16 14:52:31 (pid:7863) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 03/11/16 14:52:31 (pid:7863) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON 03/11/16 14:52:31 (pid:7863) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 03/11/16 14:52:31 (pid:7863) ** $CondorPlatform: x86_64_RedHat5 $ 03/11/16 14:52:31 (pid:7863) ** PID = 7863 03/11/16 14:52:31 (pid:7863) ** Log last touched time unavailable (No such file or directory) 03/11/16 14:52:31 (pid:7863) ****************************************************** 03/11/16 14:52:31 (pid:7863) Using config source: /home/boinc/CMSRun/glide_UES5to/condor_config 03/11/16 14:52:31 (pid:7863) config Macros = 212, Sorted = 212, StringBytes = 10636, TablesBytes = 7672 03/11/16 14:52:31 (pid:7863) CLASSAD_CACHING is OFF 03/11/16 14:52:31 (pid:7863) Daemon Log is logging: D_ALWAYS D_ERROR 03/11/16 14:52:31 (pid:7863) DaemonCore: command socket at <10.0.2.15:58692?noUDP> 03/11/16 14:52:31 (pid:7863) DaemonCore: private command socket at <10.0.2.15:58692> 03/11/16 14:52:32 (pid:7863) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#131760 03/11/16 14:52:32 (pid:7863) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_UES5to/main/condor/sbin/condor_master (mtime:1457704337) 03/11/16 14:52:32 (pid:7863) Started DaemonCore process "/home/boinc/CMSRun/glide_UES5to/main/condor/sbin/condor_startd", pid and pgroup = 7866 03/11/16 15:03:35 (pid:7863) condor_write(): Socket closed when trying to write 2896 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:03:35 (pid:7863) Buf::write(): condor_write() failed 03/11/16 15:14:33 (pid:7863) condor_write(): Socket closed when trying to write 2897 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:14:33 (pid:7863) Buf::write(): condor_write() failed 03/11/16 15:25:31 (pid:7863) condor_write(): Socket closed when trying to write 2897 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:25:31 (pid:7863) Buf::write(): condor_write() failed 03/11/16 15:36:29 (pid:7863) condor_write(): Socket closed when trying to write 2914 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:36:29 (pid:7863) Buf::write(): condor_write() failed 03/11/16 15:47:27 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:47:27 (pid:7863) Buf::write(): condor_write() failed 03/11/16 15:52:51 (pid:7863) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9619 03/11/16 15:52:51 (pid:7863) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds. 03/11/16 15:53:52 (pid:7863) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#131774 03/11/16 15:58:25 (pid:7863) condor_write(): Socket closed when trying to write 2915 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 15:58:25 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:09:23 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 16:09:23 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:20:21 (pid:7863) condor_write(): Socket closed when trying to write 2915 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 16:20:21 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:31:20 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 16:31:20 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:42:18 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 16:42:18 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:53:16 (pid:7863) condor_write(): Socket closed when trying to write 2896 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 16:53:16 (pid:7863) Buf::write(): condor_write() failed 03/11/16 16:54:11 (pid:7863) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9619 03/11/16 16:54:11 (pid:7863) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9619 failed; will try to reconnect in 60 seconds. 03/11/16 16:55:12 (pid:7863) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9619 as ccbid 130.246.180.120:9619#131787 03/11/16 17:04:14 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 17:04:14 (pid:7863) Buf::write(): condor_write() failed 03/11/16 17:15:12 (pid:7863) condor_write(): Socket closed when trying to write 2915 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 17:15:12 (pid:7863) Buf::write(): condor_write() failed 03/11/16 17:26:10 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 17:26:10 (pid:7863) Buf::write(): condor_write() failed 03/11/16 17:37:08 (pid:7863) condor_write(): Socket closed when trying to write 2898 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10 03/11/16 17:37:08 (pid:7863) Buf::write(): condor_write() failed What can I do or what has to be done on your side ? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I suggest to detach and reattach to: http://lhcathomedev.cern.ch/vLHCathome-dev |
©2024 CERN