Message boards :
News :
Another new image and access to the log files
Message board moderation
Author | Message |
---|---|
Send message Joined: 1 Aug 14 Posts: 14 Credit: 884 RAC: 0 |
Once again we have updated the VM images, so it would be nice if you could get a new VM. This time we have done multiple things:
We are hoping to address the credit problem in this release, but we don't know if our modifications will do the job... So please report back how it's going for you. 2) We have implemented a web server on the vm, so that now you should be able to press the button “show graphics” on your job in the boinc manager. When your web browser opens, you should see the sample page of t4t. Don't think about it too much, that are just some sample images, that are included in the t4t-webapp package. For now we (as the current developers) don't have the knowledge to produce such images out of the CMS framework, so that this will be done later by people from CMS. However you can look at the logs, that are produced by the CMSJobAgent (which fetches the jobs) and the cmsRun (the actual CMS program). Just click on the Logs button and you will be there.
I still haven't figured out why that is... So as a conclusion, you might want to look at the logs which have "tail" in their name. 2) stderr and stdout seem to be swapped sometimes The reason for this is, that our server dose not have a valid certificate, so wget ends up dumping it's log to the stderr. You should find this in your logs: Connecting to data-bridge-test.cern.ch|128.142.154.228|:443... connected. WARNING: cannot verify data-bridge-test.cern.ch’s certificate, issued by “/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=data-bridge-test/emailAddress=root@data-bridge-test”: Self-signed certificate encountered. WARNING: certificate common name “data-bridge-test” doesn't match requested host name “data-bridge-test.cern.ch”. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
Also in this newest version no real calculations are done in the VM. The BOINC task is running now for over 13 hours. 32 times CMSrun started and always gets a Fatal Exception. 30 times: [37m[25/02/15 20:12:17] cmsRun -j FrameworkJobReport.xml PSet.py[0m [37m[25/02/15 20:12:30] ----- Begin Fatal Exception 25-Feb-2015 20:12:30 CET-----------------------[0m [37m[25/02/15 20:12:30] An exception of category 'Incomplete configuration' occurred while[0m [37m[25/02/15 20:12:30] [0] Constructing the EventProcessor[0m [37m[25/02/15 20:12:30] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'[0m [37m[25/02/15 20:12:30] Exception Message:[0m [37m[25/02/15 20:12:30] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml[0m [37m[25/02/15 20:12:30] ----- End Fatal Exception -------------------------------------------------[0m [37m[25/02/15 20:12:31] Complete[0m [37m[25/02/15 20:12:31] process id is 4553 status is 65[0m and 2 times: [37m[25/02/15 22:06:27] cmsRun -j FrameworkJobReport.xml PStail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file et.py[0m [37m[25/02/15 22:06:35] ----- Begin Fatal Exception 25-Feb-2015 22:06:34 CET-----------------------[0m [37m[25/02/15 22:06:35] An exception of category 'Incomplete configuration' occurred while[0m [37m[25/02/15 22:06:35] [0] Constructing the EventProcessor[0m [37m[25/02/15 22:06:35] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'[0m [37m[25/02/15 22:06:35] Exception Message:[0m [37m[25/02/15 22:06:35] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml[0m [37m[25/02/15 22:06:35] ----- End Fatal Exception -------------------------------------------------[0m [37m[25/02/15 22:06:35] Complete[0m [37m[25/02/15 22:06:35] process id is 8886 status is 65[0m The BOINC task will run 157.5 hours this way doing nothing and then will be aborted due to exceeded elapsed runtime :-( In normal circumstances: When will a BOINC-task end successfully? CP |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Also in this newest version no real calculations are done in the VM. Hi, thanks for bringing this up. In fact this is a known problem, we just haven't settled on a particular solution. In short /cvmfs/cms.cern.ch/SITECONF/local is a link to another directory in /cvmfs/cms.cern.ch/SITECONF which is set at startup; the directory it links to depends on a reverse-lookup of the IP address, mapping to one of the WLCG (Grid) nodes used by CMS (e.g., at random, T2_US_Florida). If your IP is not associated with a Grid site the mapping returns a null string -- so the link points nowhere (in practice, it links back to the base directory). One proposed solution is to insert a dummy Tier-3 site (say, T3_INT_BOINC_CMS...) into the database and have the startup script link to that if the lookup returns a null string (since CVMFS is a read-only file system from the user's point-of-view this can't be done at just any time after-the-fact). Since this means I can run jobs at work, but not at home, I'm hoping for a fix Real Soon Now. Cheers! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
OK, I'll abort the task and wait for NEWS! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Has anyone had a job complete successfully with the latest image? My attempt ran for 27-1/2 hours and then aborted: 26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB 26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished Task Work unit Sent Time reported Status Run time(sec) CPU time (sec) Credit Application 1152 209 25 Feb 2015, 16:36:23 UTC 26 Feb 2015, 20:09:29 UTC Error while computing 98,993.91 64,008.69 --- CMS Simulation v37.01 As far as I can tell I had my disk limit set to 100 GB. I've set it to unlimited and will try again. [Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB 26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished[Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation. OK, there was a disk limit in the job template. It's now been raised, so please abort any running jobs and try again. Sorry for the inconvenience but it is still the development phase! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
Has anyone had a job complete successfully with the latest image? Just saw that you doubled the maximum possible disksize to 9536.74MB. I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished? With ATLAS the VM seems to do only 1 job by the VM. VirtualLHC does several jobs within the VM and the task is ended by a duration limit. Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
Has anyone had a job complete successfully with the latest image? Hi CP, As usual you're on the money! Thanks a lot for the help on behalf of Hendrik, Ivan and Laurence (I'm also helping them to get CMS on the road). So a new version has been set up, and <job_duration> has been set to 1 hour to speed up testing. It can be reset to 24 hours when things are a bit more advanced. All the very best - Ben |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
Has anyone had a job complete successfully with the latest image? If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise. result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218 I added a duration limit of 1 hour to the job xml for test purposes. During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work. CPU time about 10% of elapsed time. CP |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
... I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!) Now the validator needs enabling... Ben |
Send message Joined: 1 Aug 14 Posts: 14 Credit: 884 RAC: 0 |
... Thank you a lot Crystal Pellet and Ben for your feedback. Your comments bring us forward in big steps :) The sample_bitwise_validator is now up and running. Hendrik |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Yay, I got credit! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
Yay, I got credit! Yeah and Hendrik got credit from his tasks returned September last year! :D "What's in the barrel, does not sour." |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact. |
Send message Joined: 26 Feb 15 Posts: 26 Credit: 5,042,431 RAC: 2,011 |
Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact. It is a cached page, and is not updated real-time. Patience. Reno, NV Team: SETI.USA |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. Since yesterday I didn't get data from CERN stored by the CernVMFileSystem2-process and the VM stayed at a size of ~750MB, but since your post (did you/someone something?) data is coming again and the VM grew up to almost 1.5GB. CMSrun started, but as before because of not part of CERN's network, the process ended in a "Fatal Exception". CP |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Trust me, we're working on that. There's also a new VM image as of a little while ago which is supposed to cure the problems I was seeing yesterday (a race condition at startup, apparently). |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. OK, there's another image now that's designed to cure the problem for non-Grid sites/ISPs running our programmes. I won't be in a position to test it for another 10-12 hours, so if any of you who have been affected can try it and report back, that'd be great. Thanks. |
©2024 CERN