Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,545
RAC: 1,472
Message 1533 - Posted: 1 Jan 2016, 10:10:42 UTC

Because the CMS-dev project needs a continuous connection to the server and the task can't be suspended for a while without causing a job fail,
the CMS-dev project is not suitable for the average BOINC-cruncher.

E.g. of job with a suspend of 35 minutes. Job 1362 failed.

Begin processing the 71st record. Run 1, Event 408371, LumiSection 4084 at 01-Jan-2016 09:47:56.291 CET
Begin processing the 72nd record. Run 1, Event 408372, LumiSection 4084 at 01-Jan-2016 09:48:18.781 CET
Begin processing the 73rd record. Run 1, Event 408373, LumiSection 4084 at 01-Jan-2016 09:48:21.368 CET
Begin processing the 74th record. Run 1, Event 408374, LumiSection 4084 at 01-Jan-2016 09:49:32.293 CET

BOINC - CMS-dev 01 Jan 09:51:18 task CMS_29334_1427806586.202274_1 suspended by user
BOINC - CMS-dev 01 Jan 10:27:41 task CMS_29334_1427806586.202274_1 resumed by user

Begin processing the 75th record. Run 1, Event 408375, LumiSection 4084 at 01-Jan-2016 10:28:01.673 CET
Begin processing the 76th record. Run 1, Event 408376, LumiSection 4084 at 01-Jan-2016 10:28:02.797 CET
Begin processing the 77th record. Run 1, Event 408377, LumiSection 4084 at 01-Jan-2016 10:28:22.978 CET
Begin processing the 78th record. Run 1, Event 408378, LumiSection 4084 at 01-Jan-2016 10:28:29.289 CET
Begin processing the 79th record. Run 1, Event 408379, LumiSection 4084 at 01-Jan-2016 10:28:41.848 CET
Begin processing the 80th record. Run 1, Event 408380, LumiSection 4084 at 01-Jan-2016 10:28:50.343 CET
Begin processing the 81st record. Run 1, Event 408381, LumiSection 4084 at 01-Jan-2016 10:28:52.019 CET
Begin processing the 82nd record. Run 1, Event 408382, LumiSection 4084 at 01-Jan-2016 10:29:07.634 CET
Begin processing the 83rd record. Run 1, Event 408383, LumiSection 4084 at 01-Jan-2016 10:29:14.725 CET
Begin processing the 84th record. Run 1, Event 408384, LumiSection 4084 at 01-Jan-2016 10:29:21.383 CET


Unneeded kill of job 1362 and asking a new job:

01/01/16 10:28:28 (pid:11867) CCBListener: no activity from CCB server in 2359s; assuming connection is dead.
01/01/16 10:28:28 (pid:11867) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds.
01/01/16 10:29:29 (pid:11867) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#85910
01/01/16 10:29:53 (pid:11867) Got SIGQUIT. Performing fast shutdown.
01/01/16 10:29:53 (pid:11867) ShutdownFast all jobs.
01/01/16 10:29:53 (pid:11867) Process exited, pid=11871, signal=9
01/01/16 10:29:53 (pid:11867) Last process exited, now Starter is exiting
01/01/16 10:29:53 (pid:11867) **** condor_starter (condor_STARTER) pid 11867 EXITING WITH STATUS 0
01/01/16 10:29:54 (pid:13681) FILETRANSFER: "/home/boinc/CMSRun/glide_lneklf/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/01/16 10:29:54 (pid:13681) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_lneklf/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_lneklf/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/01/16 10:29:54 (pid:13681) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_lneklf/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
01/01/16 10:30:43 (pid:13722) ******************************************************
01/01/16 10:30:43 (pid:13722) ** condor_starter (CONDOR_STARTER) STARTING UP
ID: 1533 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1536 - Posted: 1 Jan 2016, 17:53:12 UTC - in response to Message 1533.  

Thanks for the report. I'll look into that next week, if one of my colleagues doesn't beat me to it. I didn't think it was that sensitive.
ID: 1536 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1537 - Posted: 2 Jan 2016, 9:43:29 UTC

Rebooting a host causes problems. (Win10 does this randomly for 'updates'.)
BOINC Manager switching out a CMS task causes problems.
Network bandwidth limitation causes problems.
Corrupt VBox image causes problems. (Generally, VBox causes problems, but that's another topic.)
Job running at 24-hour mark causes problems.
Laptop going to sleep or moving away from network causes problems.
Credit issue, while not a problem YET, certainly will be once "live", especially if a single failed job in a task affects credit.

I see a few "implementation" areas that can be fixed to solve some of these, and I think the project has the people with the expertise to do so - but I think the bottom line is that the current DESIGN of the project is not very suitable for "most" BOINC users. I firmly believe that two fixes are needed - 1) each task should have a "set number" of jobs, pre-loaded, with results sent at end of task, and 2) tasks should be much shorter than 24 hours. I get the "overhead" issue, but there are projects that send me tasks that are completed in 13 SECONDS! CPDN and PrimeGrid have tasks that run for days, but they don't have any network requirements. A CMS task running for 24 hours just invites any of the above issues to cause a failure.

Those of us running this now are obviously enough "into" BOINC that our systems are relatively stable and productive, yet you're already seeing a very high level of failures, especially in stage-out. This is my first VBox project, so I can't compare how LHC is to CMS, but IMHO as a 12-year BOINCer and long-time programmer/project lead, the project as it stands is a long way from "production ready". I'm not giving up, I would really like to see this succeed! This is the purpose of doing a "DEV" trial run in the first place, to find just this kind of issues - the title of this thread is, unfortunately, all too true at present.
ID: 1537 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1538 - Posted: 2 Jan 2016, 15:42:22 UTC
Last modified: 2 Jan 2016, 15:43:46 UTC

This is a timely subject, as I just turned off my CERN machine that normally runs 24/7 in order to move it to another part of the basement. It was off for about 10 minutes, and doesn't seem to have done it any harm:
http://boincai05.cern.ch/CMS-dev/results.php?hostid=688
http://lhcathome2.cern.ch/vLHCathome/results.php?hostid=85673
http://atlasathome.cern.ch/results.php?hostid=17032

But this machine is a bit unusual in using a large write-cache (14 GB PrimoCache), intended to protect the SSD from the high write rate of ATLAS, among others. However, I have found that it reduces error rates on ATLAS and a variety of other BOINC projects as well, especially CPDN for example. And that leaves me 18 GB working memory, enough for any of the CERN projects on 6 cores of my i7-4790 while reserving 2 cores for the GPUs. Also, I use an uninterruptible power supply to prevent cache corruption in case of power outages, which also helps prevent errors on various BOINC projects.
ID: 1538 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1539 - Posted: 2 Jan 2016, 16:18:26 UTC - in response to Message 1538.  

Hi Jim1348,
Even if CMS does not have any work or your computers producing only errors, your boinc credits will still be the same.
It is different to other projects.The boinc credits mean absolutely nothing(only, that you have been connected to the cms-server).

Regarding the cache..., have you tried setting windows to do more caching?

Command line (as administrator) fsutil behavior set memory usage 2

Restart required.To set it back: same command but change the 2 to a 1.

This way, all the memory remains under os control(for apps,if required), but does more caching.
It is up to you.
ID: 1539 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1540 - Posted: 2 Jan 2016, 20:33:25 UTC - in response to Message 1539.  
Last modified: 2 Jan 2016, 20:51:40 UTC

Fsutil might help a little, but I don't see that it can cache anywhere nearly enough for my purposes. I want the writes associated with BOINC to go to main memory, to protect the SSD. I use either a ramdisk for that purpose, placing the BOINC Data folder on the ramdisk, or else use a write cache. If it is large enough, a write cache accomplishes basically the same thing as a ramdisk, and is a little easier to install since you don't have to change the default location of the BOINC Data folder.

I normally set the size of the cache or the ramdisk from around 11 GB to 18 GB, which is enough for running my BOINC projects on 6 cores. Smaller BOINC projects could be done with much less memory, but they often don't write as much, and hence are less of a problem for reducing SSD lifetime.

Some of the SSDs now have their own caches included in their utilities; Samsung Rapid Mode cache is sized at 1 GB, and the Crucial Momentum cache can go as high as 4 GB if you have enough memory. They might be enough to protect your SSD for most purposes, though you would still be writing to the disk some. And their cache is split between read and write caches in some undisclosed manner, perhaps based on usage, but for protecting SSDs only the write caching is relevant. Note that you can still read out the data written into a write cache; it just does not automatically place data in the cache due to reads, but reads do not produce the wear on the SSDs.

Thanks for the tip about the credits. I will look into it further now.
ID: 1540 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1541 - Posted: 2 Jan 2016, 20:55:47 UTC - in response to Message 1537.  
Last modified: 2 Jan 2016, 21:11:24 UTC

Regarding the high write activity of ATLAS:
I upgraded the vbowrapper to ver. 26178 a while ago. This reduces drastically the write activity.(for all vbox projects.CMS integrated it shortly after it was available)
Atlas has been very slow implementing this(if at all).
(NEVER MIND, THEY FINALLY IMPLEMENTED IT)

I prefer the download, crunch, upload method.
The vbox approach with constant network activity is way to susceptible to errors.

But the project is, what it is, so we have to live with it.
ID: 1541 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1556 - Posted: 4 Jan 2016, 19:22:20 UTC

I took my machine down two or three times this morning to replace fans. The downtime was maybe from 10 minutes to 40 minutes or so each time. It looks like one of them got it.
http://boincai05.cern.ch/CMS-dev/result.php?resultid=74935
ID: 1556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1557 - Posted: 4 Jan 2016, 20:04:58 UTC - in response to Message 1556.  

I took my machine down two or three times this morning to replace fans. The downtime was maybe from 10 minutes to 40 minutes or so each time. It looks like one of them got it.
http://boincai05.cern.ch/CMS-dev/result.php?resultid=74935

Just as a matter of interest, how did you shut down? I took down my Linux box a few times this morning to add Seti@Home V8 to the app_info.xml and I noticed it took more than a few seconds for VBoxHeadless to wrap up its shutdown. Let me just do some serial logins (home->work->CMS machine->RAL)...

OK, nothing but status 0 all day:
151230_023405:ireid_crab_CMS_at_Home_MinBias_300ev/job_out.3156.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 4 09:22:27 GMT 2016 on 9-22-30202 with (short) status 0 ========
151230_023405:ireid_crab_CMS_at_Home_MinBias_300ev/job_out.3215.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 4 11:57:00 GMT 2016 on 9-22-17724 with (short) status 0 ========
151230_023405:ireid_crab_CMS_at_Home_MinBias_300ev/job_out.3277.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 4 14:41:02 GMT 2016 on 9-22-23119 with (short) status 0 ========
151230_023405:ireid_crab_CMS_at_Home_MinBias_300ev/job_out.3337.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 4 16:28:48 GMT 2016 on 9-22-23119 with (short) status 0 ========
151230_023405:ireid_crab_CMS_at_Home_MinBias_300ev/job_out.3292.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 4 18:16:12 GMT 2016 on 9-22-23119 with (short) status 0 ========


Nothing definitive of course, but it shows that errors are not inevitable. Looking more closely at your log, it looks like a VM error rather than a CMS problem per se but I'm not knowledgeable enough about VMs to give any definitive diagnosis.
ID: 1557 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1558 - Posted: 4 Jan 2016, 20:25:59 UTC - in response to Message 1557.  
Last modified: 4 Jan 2016, 20:40:33 UTC

I just did a normal software shutdown via the Windows Start button each time. I don't think there were any glitches, though I was working on the fans while it ran to some extent. But if there were any real crashes, it would have shown up in the Event Viewer, but that just shows normal shutdowns; certainly no BSODs.

I do get a somewhat curious error, which I have posted on before, about
Log Name: System
Source: Microsoft-Windows-WHEA-Logger
Date: 1/4/2016 4:17:39 AM
Event ID: 19
Task Category: None
Level: Warning
Keywords:
User: LOCAL SERVICE
Computer: i7-4790-PC
Description:
A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 4


But that seems to be a normal bug on Haswell machines running VMs and either CMS-dev or vLHC; I am not sure which. It does not cause any problems that I can see.

EDIT: I do manage that PC over the LAN using TightVNC; there is normally no monitor attached. I don't see that that affects the shutdown though.
ID: 1558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers


©2024 CERN