1) Message boards : Number crunching : Houston, we have a problem (Message 1019)
Posted 6 Sep 2015 by Hendrik
Post:

Andrew replies:
It appears that the clock jump (caused by suspending the VM I guess)
triggers the condor_master to restart all daemons, including the startd.
This is confirmed by the source code, and it seems that there's no way to
stop this. When a startd is restarted in this way you lose your job (it
doesn't do a peaceful or even graceful restart). It's not obvious to me that
there's any way around this. This is probably a question for the
htcondor-users mailing list...
So I guess that's just another caveat that we have to make our volunteers aware of.



One could try to prevent the clock jumps by disabling timesyncing in virtualbox (host -> guest bios) and in the cernVM image (guest <-> internet).
At the same time this approach would put the condor server (real time) and the job running in the VM (behind real time) significantly out of time sync. I am not sure whether condor is able to deal with this.

This Post might be of interest on the vbox side of time syncing: https://forums.virtualbox.org/viewtopic.php?t=8535#p152906
[Edit]This tutorial is more datailed, but a bit windows specific: http://stevenormrod.com/2012/10/disabling-time-sync-in-virtualbox/[/Edit]
2) Message boards : News : Progress! (Message 894)
Posted 28 Aug 2015 by Hendrik
Post:
That's great!

I can't wait to see those plots :)
Especially those that compare CMS@Home to the "traditional" CMS resources.
3) Message boards : News : Vbox Wrapper Updates (Message 153)
Posted 22 Mar 2015 by Hendrik
Post:
Hmm, startup says it cant find cvmfs service, nor find an ethernet connection. I tried to get a copy of the boot log but many commands including scp and more fail with an input error. Investigations continue...

Well, it continues the same. Here's a screenshot that shows some of the error messages I get at the tail end of booting. (The httpd message is "normal" AFAICR.)


Yes, the httpd message is normal.
And I think that the RTNETLINK message is normal as well, at least I am seeing it on my windows machine, which is behaving normally.
4) Message boards : News : New Release (v45) (Message 86)
Posted 17 Mar 2015 by Hendrik
Post:

Can CERN confirm that they get results back for their CMS-database?



Jep, we are getting your results back :)
5) Message boards : Number crunching : 41.01 on OSX (Message 84)
Posted 17 Mar 2015 by Hendrik
Post:
Hi Steven,
thank you for your feedback. 50 of 54 sounds quite good :)
Of course you are right, we want to get up to 54 of 54. For the first three I don't really know what that could have been...

About the fourth: Was it this task?
http://boincai05.cern.ch/CMS-dev/result.php?resultid=11941

If so it could be a similar problem to which Zombie67[MM] just reported as well.
http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=15&postid=82

Thank you,
Hendrik
6) Message boards : News : New Release (v45) (Message 83)
Posted 17 Mar 2015 by Hendrik
Post:
What should a normal run time be? In other words, if it goes longer than x minutes, we should about it?

24 hours runtime is the limit again now after being 1 hour for test purposes.


I had to abort this task after running over 48 hours. Maybe the limit is not working?:

http://boincai05.cern.ch/CMS-dev/result.php?resultid=24939


Hi,
from looking at the task it looks like some communication problem between the vboxwrapper and boinc. The stderr log is just completely empty, which it normally isn't. The task was running, right? Was the VM itself running as well? Did you have a look into the VM?
7) Message boards : News : New Release (v45) (Message 77)
Posted 16 Mar 2015 by Hendrik
Post:
That sounds not so good:

[37m[15/03/15 19:11:12] ERROR:root:No message received! Nothing to do!
[15/03/15 19:12:07] ERROR:root:No message received! Nothing to do!
[15/03/15 19:13:08] ERROR:root:No message received! Nothing to do!
[15/03/15 19:14:07] ERROR:root:No message received! Nothing to do!
[15/03/15 19:15:07] ERROR:root:No message received! Nothing to do!
[15/03/15 19:16:07] ERROR:root:No message received! Nothing to do!
[15/03/15 19:17:07] ERROR:root:No message received! Nothing to do!
[15/03/15 19:18:07] ERROR:root:No message received! Nothing to do!



Hi newman,
this error pops up, when our job queue is empty. We have just filled it up again, so it should work now :)
8) Message boards : Number crunching : Credit? (Message 76)
Posted 16 Mar 2015 by Hendrik
Post:
Hi KPX,
I just re-enabled the assimilator and all pending wu's have been assimilated.
Please let me know if this fixed the problem for you.


Cheers,
Hendrik
9) Message boards : Number crunching : Credit? (Message 75)
Posted 16 Mar 2015 by Hendrik
Post:
Hi KPX,
Sorry for the wait.
The fix I was tailking about was the idea to grant credit via the app configuration in BOINC. But it turned out, that starting the validator and assimilator did the job.
For your Job it could be that it is not getting credit, due to the assimilator being down at the moment. I will look into that today.
10) Message boards : News : Another new image and access to the log files (Message 31)
Posted 27 Feb 2015 by Hendrik
Post:
...
If you enable the "sample_bitwise_validator", maybe I will have the first task validated successfully BOINCwise.

result: http://boincai05.cern.ch/CMS-dev/result.php?resultid=1218

I added a duration limit of 1 hour to the job xml for test purposes.
During that hour about 3 or 4 cmsrun's started, but because of the access failures to CERN's BOINC server, the jobs within the VM did not real work.
CPU time about 10% of elapsed time.

CP

I agree with you: all my jobs fail too, as the CMS job scheduler doesn't yet allow testers from outside CERN. (I'm at home today!)

Now the validator needs enabling...

Ben


Thank you a lot Crystal Pellet and Ben for your feedback.
Your comments bring us forward in big steps :)

The sample_bitwise_validator is now up and running.

Hendrik
11) Message boards : Number crunching : Credit? (Message 21)
Posted 25 Feb 2015 by Hendrik
Post:
Hello admins, any chance that tasks already crunched will get validated and credit awarded?


Hi KPX,
we are working on this problem, but at the moment we don't have a tested fix for this.

Cheers,
Hendrik
12) Message boards : News : Another new image and access to the log files (Message 20)
Posted 25 Feb 2015 by Hendrik
Post:
Once again we have updated the VM images, so it would be nice if you could get a new VM.

This time we have done multiple things:
    1) Credit problem
    We are hoping to address the credit problem in this release, but we don't know if our modifications will do the job... So please report back how it's going for you.

    2) We have implemented a web server on the vm, so that now you should be able to press the button “show graphics” on your job in the boinc manager. When your web browser opens, you should see the sample page of t4t. Don't think about it too much, that are just some sample images, that are included in the t4t-webapp package. For now we (as the current developers) don't have the knowledge to produce such images out of the CMS framework, so that this will be done later by people from CMS.
    However you can look at the logs, that are produced by the CMSJobAgent (which fetches the jobs) and the cmsRun (the actual CMS program). Just click on the Logs button and you will be there.




Some questions about the logs already arose internally (thanks to Ben) so some short comments on that:

    1) As you might notice, we have two versions of each log . One is produced by tail one by dumbq-logcat. Personally I liked how dumbq is able to timestamp the output (we use it for the consoles as well), but it seems to have some difficulties, when being directed to a file, so you will notice, that the log stops at random places, but then continues from there as it gets new input.
    I still haven't figured out why that is...
    So as a conclusion, you might want to look at the logs which have "tail" in their name.

    2) stderr and stdout seem to be swapped sometimes
    The reason for this is, that our server dose not have a valid certificate, so wget ends up dumping it's log to the stderr.

    You should find this in your logs:
    Connecting to data-bridge-test.cern.ch|128.142.154.228|:443... connected.
    WARNING: cannot verify data-bridge-test.cern.ch’s certificate, issued by “/C=--/ST=SomeState/L=SomeCity/O=SomeOrganization/OU=SomeOrganizationalUnit/CN=data-bridge-test/emailAddress=root@data-bridge-test”:
    Self-signed certificate encountered.
    WARNING: certificate common name “data-bridge-test” doesn't match requested host name “data-bridge-test.cern.ch”.

13) Message boards : News : Server failure over the night (Message 13)
Posted 20 Feb 2015 by Hendrik
Post:
Over the night we experienced a server failure in our back end server, which feeds jobs into the VM's.
So it might have been that your VM was sitting around doing nothing at that time...
We have now restarted the server and everything is back to normal.
Your VM's should be receiving jobs again :)

Unfortunately at the moment we do not completely understand, what caused the server to crash, but we hope to figure that out soon!

EDIT: In case your VM is still not running properly, getting a fresh VM by getting a new Boinc job/wu should solve this.
14) Message boards : News : New VM image and new console feature! (Message 12)
Posted 19 Feb 2015 by Hendrik
Post:
We have just updated the VM image. So please abroad your running jobs/wu's and get a new one.
The new image should have improved stability and use your cpu cycles properly.

As well we have included a new feature, so that you can now see information about the job, etc. on the consoles. (Similar to test4theory/vLHC)
You can open the VM console by clicking on the CMS-dev job in your BOINC Manager and then on the "show VM Console"-button on the left. An rdp client should open automatically and connect to the VM.
Once there you can look through the different consoles. They are as following:
1: Job output stdout [white]
2: JobAgent stdout [white]
3: top
4: Job output stderr [red]
5: JobAgent stderr [red]

On Windows you can use Ctl-Al-F[n] to jump to the Consoles, on Linux you should try it with Alt-F[n]
The output is still a bit messy, but pleas bear with us.
A graphic version as in t4t is coming soon -- to a VM near you :D

As well we were just successful, to run the VM on a Microsoft Surface, which is probably the first time ever that a CMS Job was run on a Surface :D



©2024 CERN