Message boards :
CMS Application :
New Refactored Version (47.01)
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
A new refactored version (47.01) of the CMS application is now available. It uses the same approach as the Theory application and hence the glidein is no longer used. There is a high chance that a few things like the consoles and Web logs are broken but hopefully it will in general work. Please post issues to this thread. Note that this update is only for the development project, the beta application in the production server will remain with the old version. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,944,443 RAC: 3,276 |
"Giving up on download of CMS_2016_04_19.vdi: permanent HTTP error" [/Edit] From Task report: {core_client_version}7.2.33{/core_client_version} {![CDATA[ {message} app_version download error: couldn't get input files: {file_xfer_error} {file_name}CMS_2016_04_19.vdi{/file_name} {error_code}-224{/error_code} {error_message}permanent HTTP error{/error_message} {/file_xfer_error} {/message} [/Edit] |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
Try again. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,944,443 RAC: 3,276 |
Try again. Download worked. Thanks. Just suspended the current jobs and it started running, but it stopped after 42" with "Waiting to run (Scheduler wait: VM environment needed to be cleaned up.)" [Edit] I stopped BOINC and restarted it. The task has now been running for over 4 minutes but there's no sign of console or log buttons yet. [/Edit] |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,944,443 RAC: 3,276 |
It failed after 10 minutes. Job report is http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=149364 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,944,443 RAC: 3,276 |
It failed after 10 minutes. Job report is http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=149364 OK, aborted all the old jobs, downloaded more and one started (as per my app_config file). Now there are console and log buttons. It started up remarkably quickly, and Alt-F3 shows cmsRun running. No Alt-F4 or Alt-F5 consoles, and just the Condor logs on the Web pages. This appears to be my job in the condor_status display on the server: 9-22-24270.9-22-24 LINUX X86_64 Claimed Busy 1.350 4500 0+00:09:51 |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
Will fix the logging tomorrow. It would be interesting to see how long it now takes for people. I know some have done measurements before. As far as I recall the start time was around 8 minutes. |
Send message Joined: 2 Jul 15 Posts: 15 Credit: 140,962 RAC: 0 |
Just noticed the following line in the MasterLog and StartLog: PERMISSION DENIED to condor@261-554-14495 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 After 10 minutes nothing has happened (no output to the logs). Is this an error? |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
It is an error but should not cause any problems. The logs still need to be plugged into the consoles so the best thing to do for now is to look at the output of the top command. I haven't seen any significant drop in the running capacity so it looks like this is working for everyone. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,944,443 RAC: 3,276 |
Looks like we have 42 out of 312 converts so far: Tue Apr 19 22:25:49 [cms005@lcggwms02:~] > condor_status|grep -v glidein|grep LINUX|wc 42 335 3353 Tue Apr 19 22:26:00 [cms005@lcggwms02:~] > condor_status|grep glidein|grep LINUX|wc 270 2160 21600 Are you one of them? 198-288-8135.198-2 LINUX X86_64 Claimed Busy 1.050 4500 0+00:16:28 217-735-15116.217- LINUX X86_64 Claimed Busy 1.070 4500 0+00:52:47 217-735-20188.217- LINUX X86_64 Claimed Busy 1.030 4500 0+00:52:35 217-735-22179.217- LINUX X86_64 Claimed Busy 1.080 4500 0+00:41:22 217-735-632.217-73 LINUX X86_64 Claimed Busy 0.980 4500 0+00:52:41 244-439-11664.244- LINUX X86_64 Claimed Busy 1.070 4500 0+01:10:40 244-439-13224.244- LINUX X86_64 Claimed Busy 1.060 4500 0+00:37:46 244-439-26870.244- LINUX X86_64 Claimed Busy 1.150 4500 0+00:55:26 244-439-30920.244- LINUX X86_64 Claimed Busy 1.210 4500 0+00:20:44 244-439-6552.244-4 LINUX X86_64 Claimed Busy 0.970 4500 0+01:42:47 244-440-31443.244- LINUX X86_64 Claimed Busy 1.140 4500 0+00:33:35 244-440-8273.244-4 LINUX X86_64 Claimed Busy 1.070 4500 0+00:08:23 244-442-2227.244-4 LINUX X86_64 Claimed Busy 1.310 4500 0+01:04:43 244-442-23758.244- LINUX X86_64 Claimed Busy 1.240 4500 0+01:03:30 244-443-15714.244- LINUX X86_64 Claimed Busy 1.050 4500 0+01:24:04 244-443-22958.244- LINUX X86_64 Claimed Busy 1.190 4500 0+00:29:14 244-443-23209.244- LINUX X86_64 Claimed Busy 1.040 4500 0+00:12:43 244-443-29506.244- LINUX X86_64 Claimed Busy 1.000 4500 0+00:00:14 244-444-30450.244- LINUX X86_64 Claimed Busy 1.030 4500 0+01:51:03 244-444-31073.244- LINUX X86_64 Claimed Busy 1.200 4500 0+01:32:24 244-444-8816.244-4 LINUX X86_64 Claimed Busy 1.170 4500 0+00:27:09 244-446-32076.244- LINUX X86_64 Claimed Busy 1.530 4500 0+00:57:50 244-446-6442.244-4 LINUX X86_64 Claimed Busy 1.260 4500 0+00:40:38 244-455-31729.244- LINUX X86_64 Claimed Busy 0.170 4500 0+00:02:25 244-458-7846.244-4 LINUX X86_64 Claimed Busy 1.220 4500 0+00:58:26 244-497-2229.244-4 LINUX X86_64 Claimed Busy 1.010 4500 0+01:28:24 244-497-25124.244- LINUX X86_64 Claimed Busy 1.460 4500 0+01:25:40 275-614-1073.275-6 LINUX X86_64 Claimed Busy 1.380 4500 0+00:30:33 275-614-10834.275- LINUX X86_64 Claimed Busy 1.330 4500 0+00:30:36 275-614-16843.275- LINUX X86_64 Claimed Busy 1.270 4500 0+00:38:30 275-614-30274.275- LINUX X86_64 Claimed Busy 1.090 4500 0+00:25:36 282-625-23922.282- LINUX X86_64 Claimed Busy 1.080 4500 0+00:40:14 320-780-23078.320- LINUX X86_64 Claimed Retiring 1.140 4500 0+00:01:34 320-780-27662.320- LINUX X86_64 Claimed Retiring 1.130 4500 0+00:01:34 320-780-4714.320-7 LINUX X86_64 Claimed Retiring 0.980 4500 0+00:01:34 35-836-20182.35-83 LINUX X86_64 Claimed Busy 1.640 4500 0+00:36:31 361-1065-27903.361 LINUX X86_64 Claimed Busy 1.180 4500 0+00:11:51 55-767-16738.55-76 LINUX X86_64 Claimed Busy 0.890 4500 0+00:08:11 9-22-24038.9-22-24 LINUX X86_64 Claimed Busy 1.940 4500 0+00:17:20 9-22-24270.9-22-24 LINUX X86_64 Claimed Retiring 1.070 4500 0+00:00:43 vc-cms-dev-03.cern LINUX X86_64 Claimed Busy 1.140 4500 0+00:49:04 |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 90 |
I let the versions change under their own steam. cmsRun started at 18mins but it took until 29mins to start using significant CPU. If anything, slower than previous version. This error appeared here, too:- 04/20/16 02:50:40 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 3528 04/20/16 02:53:28 PERMISSION DENIED to condor@178-1024-9302 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 Since cmsRun is using 80-90% CPU things are, presumably, running OK. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
It would be interesting to see how long it now takes for people. I know some have done measurements before. As far as I recall the start time was around 8 minutes. I don't see any difference to the former version, so I suppose I'm still running the old one. Because of the flood I still have old tasks in queue and will abort the ones not yet started. That gives me the chance to monitor the startup with the old version once again and compare with the new version when downloaded later today. Edit: Meanwhile I've the new version and 2 tasks "Ready to start" - Breakfast first. Boot.log: Wed Apr 20 07:14:10 2016: Setting hostname localhost: Begin processing the 1st record: at 20-Apr-2016 07:20:20.474 CEST |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
That gives me the chance to monitor the startup with the old version once again and compare with the new version when downloaded later today. With the new version, cmsRun started between 2 and 3 minutes after the VM booted. The only log available is MasterLog. I got some red messages from any Log through the 'top' window. The user 'boinc' disappeared (like in the Theory app) and user 'nobody' appeared. The 'top' console is no longer accessible for commands like 'u', 'h', 'b' etc. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The 'top' console is no longer accessible for commands like 'u', 'h', 'b' etc. I also have to ask, to put this feature back,please. Startup time from boinc reporting the task being started to CMS-run process reaching "steady-state" (cpu>90% on cmsRun)6 min (10Mbit down/1Mbit up) No job progress displayed (neigher in console nor trough "show graphics"). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The red text is output over the top console window. Please redirect to the correct window. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
Have just pushed an update. Please check the logs and top command in about 1 hour. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Console F4 and F5 have an output, now. F5 shows: accessing /var /lib/...../cmsRun-stdout.log :no such file or directory |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Have just pushed an update. Please check the logs and top command in about 1 hour. Some improvement is done. 'top' commands are accepted again. Thanks! Colours for consoles 4 and 5 badly readable. Console 5 (events processing etc.) can't access the data, cause the file is not there, although directory structure for jobs seems to be created. However we can find now 'somewhere' which job is running. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 107 |
Fixes published. Will be available for new tasks in about one hour. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Task is running now 13 hours and 16 minutes. After last job has finished, the VM is idle for longer than 50 minutes. No 'nobody' processes present for mentioned time. I expect killing mechanism jumping in when idle for so long time. The last 4 lines were added to the console after the last job was finished. |
©2024 CERN