Thread 'Another new image and access to the log files'

Author	Message
Hendrik Project developer Project tester Send message Joined: 1 Aug 14 Posts: 14 Credit: 884 RAC: 0	Message 20 - Posted: 25 Feb 2015, 16:50:59 UTC Once again we have updated the VM images, so it would be nice if you could get a new VM. This time we have done multiple things: 1) Credit problem We are hoping to address the credit problem in this release, but we don't know if our modifications will do the job... So please report back how it's going for you. 2) We have implemented a web server on the vm, so that now you should be able to press the button “show graphics” on your job in the boinc manager. When your web browser opens, you should see the sample page of t4t. Don't think about it too much, that are just some sample images, that are included in the t4t-webapp package. For now we (as the current developers) don't have the knowledge to produce such images out of the CMS framework, so that this will be done later by people from CMS. However you can look at the logs, that are produced by the CMSJobAgent (which fetches the jobs) and the cmsRun (the actual CMS program). Just click on the Logs button and you will be there. Some questions about the logs already arose internally (thanks to Ben) so some short comments on that: 1) As you might notice, we have two versions of each log . One is produced by tail one by dumbq-logcat. Personally I liked how dumbq is able to timestamp the output (we use it for the consoles as well), but it seems to have some difficulties, when being directed to a file, so you will notice, that the log stops at random places, but then continues from there as it gets new input. I still haven't figured out why that is... So as a conclusion, you might want to look at the logs which have "tail" in their name. 2) stderr and stdout seem to be swapped sometimes The reason for this is, that our server dose not have a valid certificate, so wget ends up dumping it's log to the stderr. You should find this in your logs: Connecting to data-bridge-test.cern.ch\|128.142.154.228\|:443... connected. WARNING: cannot verify data-bridge-test.cern.ch’s certificate, issued by “/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=data-bridge-test/emailAddress=root@data-bridge-test”: Self-signed certificate encountered. WARNING: certificate common name “data-bridge-test” doesn't match requested host name “data-bridge-test.cern.ch”. ID: 20 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 22 - Posted: 26 Feb 2015, 8:12:53 UTC - in response to Message 20. Also in this newest version no real calculations are done in the VM. The BOINC task is running now for over 13 hours. 32 times CMSrun started and always gets a Fatal Exception. 30 times: [37m[25/02/15 20:12:17] cmsRun -j FrameworkJobReport.xml PSet.py[0m [37m[25/02/15 20:12:30] ----- Begin Fatal Exception 25-Feb-2015 20:12:30 CET-----------------------[0m [37m[25/02/15 20:12:30] An exception of category 'Incomplete configuration' occurred while[0m [37m[25/02/15 20:12:30] [0] Constructing the EventProcessor[0m [37m[25/02/15 20:12:30] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'[0m [37m[25/02/15 20:12:30] Exception Message:[0m [37m[25/02/15 20:12:30] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml[0m [37m[25/02/15 20:12:30] ----- End Fatal Exception -------------------------------------------------[0m [37m[25/02/15 20:12:31] Complete[0m [37m[25/02/15 20:12:31] process id is 4553 status is 65[0m and 2 times: [37m[25/02/15 22:06:27] cmsRun -j FrameworkJobReport.xml PStail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file et.py[0m [37m[25/02/15 22:06:35] ----- Begin Fatal Exception 25-Feb-2015 22:06:34 CET-----------------------[0m [37m[25/02/15 22:06:35] An exception of category 'Incomplete configuration' occurred while[0m [37m[25/02/15 22:06:35] [0] Constructing the EventProcessor[0m [37m[25/02/15 22:06:35] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'[0m [37m[25/02/15 22:06:35] Exception Message:[0m [37m[25/02/15 22:06:35] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml[0m [37m[25/02/15 22:06:35] ----- End Fatal Exception -------------------------------------------------[0m [37m[25/02/15 22:06:35] Complete[0m [37m[25/02/15 22:06:35] process id is 8886 status is 65[0m The BOINC task will run 157.5 hours this way doing nothing and then will be aborted due to exceeded elapsed runtime :-( In normal circumstances: When will a BOINC-task end successfully? CP ID: 22 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 23 - Posted: 26 Feb 2015, 10:10:55 UTC - in response to Message 22. Also in this newest version no real calculations are done in the VM. The BOINC task is running now for over 13 hours. 32 times CMSrun started and always gets a Fatal Exception. 30 times: [37m[25/02/15 20:12:30] Exception Message:[0m [37m[25/02/15 20:12:30] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml[0m [37m[25/02/15 20:12:30] ----- End Fatal Exception -------------------------- Hi, thanks for bringing this up. In fact this is a known problem, we just haven't settled on a particular solution. In short /cvmfs/cms.cern.ch/SITECONF/local is a link to another directory in /cvmfs/cms.cern.ch/SITECONF which is set at startup; the directory it links to depends on a reverse-lookup of the IP address, mapping to one of the WLCG (Grid) nodes used by CMS (e.g., at random, T2_US_Florida). If your IP is not associated with a Grid site the mapping returns a null string -- so the link points nowhere (in practice, it links back to the base directory). One proposed solution is to insert a dummy Tier-3 site (say, T3_INT_BOINC_CMS...) into the database and have the startup script link to that if the lookup returns a null string (since CVMFS is a read-only file system from the user's point-of-view this can't be done at just any time after-the-fact). Since this means I can run jobs at work, but not at home, I'm hoping for a fix Real Soon Now. Cheers! ID: 23 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 24 - Posted: 26 Feb 2015, 11:22:03 UTC - in response to Message 23. OK, I'll abort the task and wait for NEWS! ID: 24 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 25 - Posted: 27 Feb 2015, 10:33:33 UTC Last modified: 27 Feb 2015, 10:38:09 UTC Has anyone had a job complete successfully with the latest image? My attempt ran for 27-1/2 hours and then aborted: 26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB 26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished Task Work unit Sent Time reported Status Run time(sec) CPU time (sec) Credit Application 1152 209 25 Feb 2015, 16:36:23 UTC 26 Feb 2015, 20:09:29 UTC Error while computing 98,993.91 64,008.69 --- CMS Simulation v37.01 As far as I can tell I had my disk limit set to 100 GB. I've set it to unlimited and will try again. [Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation. ID: 25 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 26 - Posted: 27 Feb 2015, 10:58:14 UTC - in response to Message 25. 26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB 26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished [Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation. OK, there was a disk limit in the job template. It's now been raised, so please abort any running jobs and try again. Sorry for the inconvenience but it is still the development phase! ID: 26 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 27 - Posted: 27 Feb 2015, 12:23:43 UTC - in response to Message 25. Has anyone had a job complete successfully with the latest image? Just saw that you doubled the maximum possible disksize to 9536.74MB. I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished? With ATLAS the VM seems to do only 1 job by the VM. VirtualLHC does several jobs within the VM and the task is ended by a duration limit. Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task. ID: 27 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 28 - Posted: 27 Feb 2015, 14:04:25 UTC - in response to Message 27. Has anyone had a job complete successfully with the latest image? Just saw that you doubled the maximum possible disksize to 9536.74MB. I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished? With ATLAS the VM seems to do only 1 job by the VM. VirtualLHC does several jobs within the VM and the task is ended by a duration limit. Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task. Hi CP, As usual you're on the money! Thanks a lot for the help on behalf of Hendrik, Ivan and Laurence (I'm also helping them to get CMS on the road). So a new version has been set up, and <job_duration> has been set to 1 hour to speed up testing. It can be reset to 24 hours when things are a bit more advanced. All the very best - Ben ID: 28 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 29 - Posted: 27 Feb 2015, 14:06:47 UTC - in response to Message 27. Last modified: 27 Feb 2015, 14:08:53 UTC Has anyone had a job complete successfully with the latest image? Just saw that you doubled the maximum possible disksize to 9536.74MB. I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished? With ATLAS the VM seems to do only 1 job by the VM. VirtualLHC does several jobs within the VM and the task is ended by a duration limit. Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task. If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise. result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218 I added a duration limit of 1 hour to the job xml for test purposes. During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work. CPU time about 10% of elapsed time. CP ID: 29 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 30 - Posted: 27 Feb 2015, 14:31:20 UTC - in response to Message 29. ... If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise. result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218 I added a duration limit of 1 hour to the job xml for test purposes. During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work. CPU time about 10% of elapsed time. CP I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!) Now the validator needs enabling... Ben ID: 30 · Rating: 0 · rate: / Reply Quote

Hendrik Project developer Project tester Send message Joined: 1 Aug 14 Posts: 14 Credit: 884 RAC: 0	Message 31 - Posted: 27 Feb 2015, 15:04:08 UTC - in response to Message 30. ... If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise. result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218 I added a duration limit of 1 hour to the job xml for test purposes. During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work. CPU time about 10% of elapsed time. CP I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!) Now the validator needs enabling... Ben Thank you a lot Crystal Pellet and Ben for your feedback. Your comments bring us forward in big steps :) The sample_bitwise_validator is now up and running. Hendrik ID: 31 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 32 - Posted: 27 Feb 2015, 15:24:49 UTC - in response to Message 31. Yay, I got credit! ID: 32 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 33 - Posted: 27 Feb 2015, 15:34:06 UTC - in response to Message 32. Yay, I got credit! Yeah and Hendrik got credit from his tasks returned September last year! :D "What's in the barrel, does not sour." ID: 33 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 34 - Posted: 27 Feb 2015, 16:34:33 UTC - in response to Message 33. Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact. ID: 34 · Rating: 0 · rate: / Reply Quote

zombie67 [MM] Send message Joined: 26 Feb 15 Posts: 26 Credit: 5,331,144 RAC: 0	Message 35 - Posted: 27 Feb 2015, 17:07:54 UTC - in response to Message 34. Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact. It is a cached page, and is not updated real-time. Patience. Reno, NV Team: SETI.USA ID: 35 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 40 - Posted: 3 Mar 2015, 15:03:28 UTC I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. ID: 40 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,826 RAC: 136	Message 41 - Posted: 3 Mar 2015, 17:32:17 UTC - in response to Message 40. I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. Since yesterday I didn't get data from CERN stored by the CernVMFileSystem2-process and the VM stayed at a size of ~750MB, but since your post (did you/someone something?) data is coming again and the VM grew up to almost 1.5GB. CMSrun started, but as before because of not part of CERN's network, the process ended in a "Fatal Exception". CP ID: 41 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 42 - Posted: 4 Mar 2015, 15:17:08 UTC - in response to Message 41. Trust me, we're working on that. There's also a new VM image as of a little while ago which is supposed to cure the problems I was seeing yesterday (a race condition at startup, apparently). ID: 42 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 298	Message 43 - Posted: 6 Mar 2015, 9:52:14 UTC - in response to Message 41. I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down. Since yesterday I didn't get data from CERN stored by the CernVMFileSystem2-process and the VM stayed at a size of ~750MB, but since your post (did you/someone something?) data is coming again and the VM grew up to almost 1.5GB. CMSrun started, but as before because of not part of CERN's network, the process ended in a "Fatal Exception". OK, there's another image now that's designed to cure the problem for non-Grid sites/ISPs running our programmes. I won't be in a position to test it for another 10-12 hours, so if any of you who have been affected can try it and report back, that'd be great. Thanks. ID: 43 · Rating: 0 · rate: / Reply Quote

Development for LHC@home