Message boards : News : Another new image and access to the log files
Message board moderation

To post messages, you must log in.

AuthorMessage
Hendrik
Project developer
Project tester
Avatar

Send message
Joined: 1 Aug 14
Posts: 14
Credit: 884
RAC: 0
Message 20 - Posted: 25 Feb 2015, 16:50:59 UTC

Once again we have updated the VM images, so it would be nice if you could get a new VM.

This time we have done multiple things:
    1) Credit problem
    We are hoping to address the credit problem in this release, but we don't know if our modifications will do the job... So please report back how it's going for you.

    2) We have implemented a web server on the vm, so that now you should be able to press the button “show graphics” on your job in the boinc manager. When your web browser opens, you should see the sample page of t4t. Don't think about it too much, that are just some sample images, that are included in the t4t-webapp package. For now we (as the current developers) don't have the knowledge to produce such images out of the CMS framework, so that this will be done later by people from CMS.
    However you can look at the logs, that are produced by the CMSJobAgent (which fetches the jobs) and the cmsRun (the actual CMS program). Just click on the Logs button and you will be there.




Some questions about the logs already arose internally (thanks to Ben) so some short comments on that:

    1) As you might notice, we have two versions of each log . One is produced by tail one by dumbq-logcat. Personally I liked how dumbq is able to timestamp the output (we use it for the consoles as well), but it seems to have some difficulties, when being directed to a file, so you will notice, that the log stops at random places, but then continues from there as it gets new input.
    I still haven't figured out why that is...
    So as a conclusion, you might want to look at the logs which have "tail" in their name.

    2) stderr and stdout seem to be swapped sometimes
    The reason for this is, that our server dose not have a valid certificate, so wget ends up dumping it's log to the stderr.

    You should find this in your logs:
    Connecting to data-bridge-test.cern.ch|128.142.154.228|:443... connected.
    WARNING: cannot verify data-bridge-test.cern.ch’s certificate, issued by “/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=data-bridge-test/emailAddress=root@data-bridge-test”:
    Self-signed certificate encountered.
    WARNING: certificate common name “data-bridge-test” doesn't match requested host name “data-bridge-test.cern.ch”.

ID: 20 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 22 - Posted: 26 Feb 2015, 8:12:53 UTC - in response to Message 20.  

Also in this newest version no real calculations are done in the VM.

The BOINC task is running now for over 13 hours.
32 times CMSrun started and always gets a Fatal Exception.

30 times:

[37m[25/02/15 20:12:17] cmsRun -j FrameworkJobReport.xml PSet.py
[25/02/15 20:12:30] ----- Begin Fatal Exception 25-Feb-2015 20:12:30 CET-----------------------
[25/02/15 20:12:30] An exception of category 'Incomplete configuration' occurred while
[25/02/15 20:12:30] [0] Constructing the EventProcessor
[25/02/15 20:12:30] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
[25/02/15 20:12:30] Exception Message:
[25/02/15 20:12:30] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
[25/02/15 20:12:30] ----- End Fatal Exception -------------------------------------------------
[25/02/15 20:12:31] Complete
[25/02/15 20:12:31] process id is 4553 status is 65


and 2 times:

[25/02/15 22:06:27] cmsRun -j FrameworkJobReport.xml PStail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory
tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file
tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory
tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has become inaccessible: No such file or directory
tail: `/home/boinc/CMSRun/cmsRun-stdout.log' has appeared; following end of new file
et.py
[25/02/15 22:06:35] ----- Begin Fatal Exception 25-Feb-2015 22:06:34 CET-----------------------
[25/02/15 22:06:35] An exception of category 'Incomplete configuration' occurred while
[25/02/15 22:06:35] [0] Constructing the EventProcessor
[25/02/15 22:06:35] [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
[25/02/15 22:06:35] Exception Message:
[25/02/15 22:06:35] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
[25/02/15 22:06:35] ----- End Fatal Exception -------------------------------------------------
[25/02/15 22:06:35] Complete
[25/02/15 22:06:35] process id is 8886 status is 65


The BOINC task will run 157.5 hours this way doing nothing and then will be aborted due to exceeded elapsed runtime :-(

In normal circumstances: When will a BOINC-task end successfully?

CP
ID: 22 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 23 - Posted: 26 Feb 2015, 10:10:55 UTC - in response to Message 22.  

Also in this newest version no real calculations are done in the VM.

The BOINC task is running now for over 13 hours.
32 times CMSrun started and always gets a Fatal Exception.

30 times:

[25/02/15 20:12:30] Exception Message:
[25/02/15 20:12:30] Valid site-local-config not found at /cvmfs/cms.cern.ch/SITECONF/local/JobConfig/site-local-config.xml
[25/02/15 20:12:30] ----- End Fatal Exception --------------------------

Hi, thanks for bringing this up. In fact this is a known problem, we just haven't settled on a particular solution.
In short /cvmfs/cms.cern.ch/SITECONF/local is a link to another directory in /cvmfs/cms.cern.ch/SITECONF which is set at startup; the directory it links to depends on a reverse-lookup of the IP address, mapping to one of the WLCG (Grid) nodes used by CMS (e.g., at random, T2_US_Florida). If your IP is not associated with a Grid site the mapping returns a null string -- so the link points nowhere (in practice, it links back to the base directory). One proposed solution is to insert a dummy Tier-3 site (say, T3_INT_BOINC_CMS...) into the database and have the startup script link to that if the lookup returns a null string (since CVMFS is a read-only file system from the user's point-of-view this can't be done at just any time after-the-fact).
Since this means I can run jobs at work, but not at home, I'm hoping for a fix Real Soon Now.
Cheers!
ID: 23 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 24 - Posted: 26 Feb 2015, 11:22:03 UTC - in response to Message 23.  

OK, I'll abort the task and wait for NEWS!
ID: 24 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 25 - Posted: 27 Feb 2015, 10:33:33 UTC
Last modified: 27 Feb 2015, 10:38:09 UTC

Has anyone had a job complete successfully with the latest image? My attempt ran for 27-1/2 hours and then aborted:

26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB
26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished

Task Work unit Sent	Time reported Status	Run time(sec)	CPU time (sec)	Credit	Application
1152	209	25 Feb 2015, 16:36:23 UTC	26 Feb 2015, 20:09:29 UTC	Error while computing	98,993.91	64,008.69	---	CMS Simulation v37.01

As far as I can tell I had my disk limit set to 100 GB. I've set it to unlimited and will try again.

[Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation.
ID: 25 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 26 - Posted: 27 Feb 2015, 10:58:14 UTC - in response to Message 25.  

26-Feb-2015 20:07:46 [CMS-dev] Aborting task CMS_11968_1424870354.314895_0: exceeded disk limit: 4780.64MB > 4768.37MB
26-Feb-2015 20:08:03 [CMS-dev] Computation for task CMS_11968_1424870354.314895_0 finished
[Edit] Just heard there may be a limit in the VM, so I'll hold off while awaiting confirmation.


OK, there was a disk limit in the job template. It's now been raised, so please abort any running jobs and try again. Sorry for the inconvenience but it is still the development phase!
ID: 26 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 27 - Posted: 27 Feb 2015, 12:23:43 UTC - in response to Message 25.  

Has anyone had a job complete successfully with the latest image?

Just saw that you doubled the maximum possible disksize to 9536.74MB.

I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished?
With ATLAS the VM seems to do only 1 job by the VM.
VirtualLHC does several jobs within the VM and the task is ended by a duration limit.

Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task.
ID: 27 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 28 - Posted: 27 Feb 2015, 14:04:25 UTC - in response to Message 27.  

Has anyone had a job complete successfully with the latest image?

Just saw that you doubled the maximum possible disksize to 9536.74MB.

I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished?
With ATLAS the VM seems to do only 1 job by the VM.
VirtualLHC does several jobs within the VM and the task is ended by a duration limit.

Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task.

Hi CP,

As usual you're on the money! Thanks a lot for the help on behalf of Hendrik, Ivan and Laurence (I'm also helping them to get CMS on the road).

So a new version has been set up, and <job_duration> has been set to 1 hour to speed up testing. It can be reset to 24 hours when things are a bit more advanced.

All the very best - Ben
ID: 28 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 29 - Posted: 27 Feb 2015, 14:06:47 UTC - in response to Message 27.  
Last modified: 27 Feb 2015, 14:08:53 UTC

Has anyone had a job complete successfully with the latest image?

Just saw that you doubled the maximum possible disksize to 9536.74MB.

I never saw a normal ended task, so question: How tells the VM-programm to the wrapper, that the task is finished?
With ATLAS the VM seems to do only 1 job by the VM.
VirtualLHC does several jobs within the VM and the task is ended by a duration limit.

Your project seems to do several CMSruns within a VM-life too, but in the wrapper xml (CMS_23_01_2015.xml), there is no <job_duration> for the BOINC-task.

If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise.

result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218

I added a duration limit of 1 hour to the job xml for test purposes.
During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work.
CPU time about 10% of elapsed time.

CP
ID: 29 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 30 - Posted: 27 Feb 2015, 14:31:20 UTC - in response to Message 29.  

...
If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise.

result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218

I added a duration limit of 1 hour to the job xml for test purposes.
During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work.
CPU time about 10% of elapsed time.

CP

I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!)

Now the validator needs enabling...

Ben
ID: 30 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hendrik
Project developer
Project tester
Avatar

Send message
Joined: 1 Aug 14
Posts: 14
Credit: 884
RAC: 0
Message 31 - Posted: 27 Feb 2015, 15:04:08 UTC - in response to Message 30.  

...
If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise.

result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218

I added a duration limit of 1 hour to the job xml for test purposes.
During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work.
CPU time about 10% of elapsed time.

CP

I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!)

Now the validator needs enabling...

Ben


Thank you a lot Crystal Pellet and Ben for your feedback.
Your comments bring us forward in big steps :)

The sample_bitwise_validator is now up and running.

Hendrik
ID: 31 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 32 - Posted: 27 Feb 2015, 15:24:49 UTC - in response to Message 31.  

Yay, I got credit!
ID: 32 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 33 - Posted: 27 Feb 2015, 15:34:06 UTC - in response to Message 32.  

Yay, I got credit!

Yeah and Hendrik got credit from his tasks returned September last year! :D

"What's in the barrel, does not sour."
ID: 33 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 34 - Posted: 27 Feb 2015, 16:34:33 UTC - in response to Message 33.  

Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact.
ID: 34 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 26 Feb 15
Posts: 26
Credit: 5,042,431
RAC: 2,011
Message 35 - Posted: 27 Feb 2015, 17:07:54 UTC - in response to Message 34.  

Question: why does the user search, ordered by credit, not give correct results now that credit is being earned? I see only zombie67 (ID 157) with any credit, although several other users have credit in fact.


It is a cached page, and is not updated real-time. Patience.
Reno, NV
Team: SETI.USA
ID: 35 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 40 - Posted: 3 Mar 2015, 15:03:28 UTC

I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down.
ID: 40 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 2
Message 41 - Posted: 3 Mar 2015, 17:32:17 UTC - in response to Message 40.  

I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down.

Since yesterday I didn't get data from CERN stored by the CernVMFileSystem2-process and the VM stayed at a size of ~750MB,
but since your post (did you/someone something?) data is coming again and the VM grew up to almost 1.5GB.
CMSrun started, but as before because of not part of CERN's network, the process ended in a "Fatal Exception".

CP
ID: 41 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 42 - Posted: 4 Mar 2015, 15:17:08 UTC - in response to Message 41.  

Trust me, we're working on that. There's also a new VM image as of a little while ago which is supposed to cure the problems I was seeing yesterday (a race condition at startup, apparently).
ID: 42 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 728
Message 43 - Posted: 6 Mar 2015, 9:52:14 UTC - in response to Message 41.  

I'm having a problem with recent work-units. Please check your logs (as detailed in the original post) and check that you're getting sensible output. If not, it'd be best to suspend while we chase this down.

Since yesterday I didn't get data from CERN stored by the CernVMFileSystem2-process and the VM stayed at a size of ~750MB,
but since your post (did you/someone something?) data is coming again and the VM grew up to almost 1.5GB.
CMSrun started, but as before because of not part of CERN's network, the process ended in a "Fatal Exception".

OK, there's another image now that's designed to cure the problem for non-Grid sites/ISPs running our programmes. I won't be in a position to test it for another 10-12 hours, so if any of you who have been affected can try it and report back, that'd be great.
Thanks.
ID: 43 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Another new image and access to the log files


©2024 CERN