Message boards : News : Urgent Update for Windows Users
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 379 - Posted: 19 May 2015, 12:24:33 UTC

A bug has been found in the Windows version of BOINC which means that files larger than 4 GiB (2^32 bytes) are being left behind in slot directories, affecting us and other BOINC projects. Unfortunately, we do produce VM files that large, so we are interfering with these other projects. If you are active on this project, using Windows, please update your BOINC version to a patched version (see this message and thread).

"The files that are needed to apply the hotfix are

For 64-bit BOINC
boinc.080515.x64.zip

For 32-bit BOINC
boinc.080515.x86.zip

Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards."


Thanks to the community for helping us debug this, especially Richard, Crystal Pellet and Ray, and to the BOINC crew for coming up with the fix.

In the meantime, we will move on to debugging why our VMs are currently growing so large.
ID: 379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 383 - Posted: 20 May 2015, 20:15:07 UTC - in response to Message 379.  

In the meantime, we will move on to debugging why our VMs are currently growing so large.

Just to note that today, when I tried to reproduce the increase in "disk" size when jobs weren't being sent, I saw no increase in VM size. ??! I contacted the development team, and more jobs were created so, for now, we're crunching and the phantom disk-enlarger has been banished. Please let me/us know if it raises its head again. Cheers!
ID: 383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 384 - Posted: 21 May 2015, 6:53:53 UTC - in response to Message 383.  

... I contacted the development team, and more jobs were created so, for now, we're crunching and the phantom disk-enlarger has been banished. Please let me/us know if it raises its head again. Cheers!

That must have been for a short period, so reproduction should be possible again:

[21/05/15 08:32:17] ERROR:root:No message received! Nothing to do!
[21/05/15 08:33:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:36:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:37:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:39:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:40:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:42:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:43:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:44:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:45:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:46:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:47:03] ERROR:root:No message received! Nothing to do!
[21/05/15 08:49:02] ERROR:root:No message received! Nothing to do!
[21/05/15 08:50:02] ERROR:root:No message received! Nothing to do!
ID: 384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 1
Message 388 - Posted: 22 May 2015, 16:41:13 UTC - in response to Message 384.  

My current task didn't do much actual work yesterday, with the "nothing to do" and "file truncated" messages but just now it is running a job, using all of one core.
The .vdi in the slot has grown to 8.4GB! but the Actual size shown in VBox Virtual Media Manager is only 2.33GB of the 20GB allocated.
I'm hopeful that the JOB will finish before the TASK finishes in 80mins time so I can see if there is any tidy up within the VM at the end of the job.
Just before the Boinc Task finishes, I'll try to remember to copy all the logs in case they might be of interest in diagnosing the growth of the .vdi.
ID: 388 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 1
Message 389 - Posted: 22 May 2015, 18:09:04 UTC - in response to Message 388.  

.vdi didn't grow much more but I'm not sure whether it finished 1 job and started another in that time. Noticed it processing record 76 at time of last post but when it was terminated by Boinc it was on record 46 so I assume that was a different job.
Task 57571 completed and validated, completely emptied the slot and no debris in VBox.
8¬)
ID: 389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 390 - Posted: 22 May 2015, 19:27:26 UTC - in response to Message 389.  

.vdi didn't grow much more but I'm not sure whether it finished 1 job and started another in that time. Noticed it processing record 76 at time of last post but when it was terminated by Boinc it was on record 46 so I assume that was a different job.
Task 57571 completed and validated, completely emptied the slot and no debris in VBox.
8¬)

Tkanks, Ray. I think there's a way to look at the stdout log on the project website, but I don't know how, exactly. I noticed the jobs I was running today were running over 100 events rather than the 5 that were done previously. For my part of the project I need to find out more of the details of how this is all done, especially now that my unfortunate personal interruption is behind me.
ID: 390 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 1
Message 391 - Posted: 22 May 2015, 20:19:12 UTC - in response to Message 390.  
Last modified: 22 May 2015, 20:56:43 UTC

I don't know if these logs are generated if there is no active job (I suspect not) but the "Show Graphics" button in Boinc connects to the localhost and shows the Machine Logs including CMSJobAgent-stdout.log and cmsRun-stdout.log although both appear to show info only from the current "job" rather than the whole "task".
I forgot to look in there for the last Task but I'll take copies if the VM size of the current task gets excessive again.
Maybe that log could be captured and added to the final upload (if it's not already) if it's helpful?
ID: 391 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 392 - Posted: 23 May 2015, 2:44:54 UTC
Last modified: 23 May 2015, 3:27:33 UTC

Is there any point saving out disk images for you guys to look at later?
You could easily set up an FTP server or similar to receive them.
ID: 392 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 393 - Posted: 23 May 2015, 8:47:55 UTC - in response to Message 391.  

... and shows the Machine Logs including CMSJobAgent-stdout.log and cmsRun-stdout.log although both appear to show info only from the current "job" rather than the whole "task".

The cmsRun-stdout.log shows all jobs during 1 BOINC-task.
Just search for "Executing CMSSW" to count the number of jobs started.
1 job runs now much longer than before, maybe because there are now 100 records in 1 job.
ID: 393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 394 - Posted: 23 May 2015, 8:57:04 UTC - in response to Message 393.  

1 job runs now much longer than before, maybe because there are now 100 records in 1 job.

My current job is very stable, well done team!
ID: 394 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 395 - Posted: 23 May 2015, 9:22:42 UTC - in response to Message 392.  

Is there any point saving out disk images for you guys to look at later?
You could easily set up an FTP server or similar to receive them.

At this point I'd say not. There were 120 tasks out in the field yesterday; at an average of 5 GB/image that's 0.6 TB/day -- and over a day to send them all in serially if we're limited to home broadband speeds of 60 Mbps.
ID: 395 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,298
RAC: 1,931
Message 396 - Posted: 24 May 2015, 6:06:49 UTC

I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem"

Which didn't work.
ID: 396 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 397 - Posted: 24 May 2015, 7:05:21 UTC - in response to Message 396.  

I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem"

Which didn't work.

The correct libcurl.dll belongs to the recommended BOINC version 7.4.42 and that version should be installed first
as written here at CMS-dev in the "exceeded disk limit" thread and in the LHC-sixtrack thread "DISK LIMIT EXCEEDED".
ID: 397 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,298
RAC: 1,931
Message 399 - Posted: 24 May 2015, 20:55:57 UTC - in response to Message 397.  

I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem"

Which didn't work.

The correct libcurl.dll belongs to the recommended BOINC version 7.4.42 and that version should be installed first
as written here at CMS-dev in the "exceeded disk limit" thread and in the LHC-sixtrack thread "DISK LIMIT EXCEEDED".



And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that.
ID: 399 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 400 - Posted: 24 May 2015, 23:10:07 UTC - in response to Message 399.  

And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that.

Are you sure you don't need it?

Your Einstein host 10859996 (one of the ones running v7.4.42) spat out a bunch of "Maximum disk usage exceeded" errors on 11 May.

That's the problem with this BOINC bug - it produces minimal errors here, but scatters errors over all sorts of other projects, which aren't equipped to diagnose and solve them. That's why we asked all CMS-dev Windows users to apply the hotfix, whether they perceived a problem or not. And my apologies for missing the problem with updating older versions of BOINC in my initial post.
ID: 400 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,298
RAC: 1,931
Message 401 - Posted: 25 May 2015, 1:02:30 UTC - in response to Message 400.  
Last modified: 25 May 2015, 1:10:01 UTC

And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that.

Are you sure you don't need it?

Your Einstein host 10859996 (one of the ones running v7.4.42) spat out a bunch of "Maximum disk usage exceeded" errors on 11 May.

That's the problem with this BOINC bug - it produces minimal errors here, but scatters errors over all sorts of other projects, which aren't equipped to diagnose and solve them. That's why we asked all CMS-dev Windows users to apply the hotfix, whether they perceived a problem or not. And my apologies for missing the problem with updating older versions of BOINC in my initial post.


I realize you may not have seen what I said and found regarding this but if you take a closer look since I figured out what actually happened I have not had ONE single disk error.

Just look at that particular one (Win8.1) and click on the tab for *Valid tasks.

The disk error has not happened here or at Atlas since then.

(I just turned one in and started another which is why I stopped by just now)

As I showed over on that LHC thread I can run Atlas X2 , vLHC X2, LHC X3,CMS-dev, and a GPU all on one 8-core with Win7

Of course this *fix* should be used by those who can not do this.

You can see I do quite a lot of complete and validated tasks on every project I run.

The only error I get since then is just because VB doesn't always behave when you do a reboot for another reason but that is not the disk error.

Snapshots can be a problem and we got rid of those at vLHC with the Databridge.

(Oh and Richard......no apologies are needed )
Mad Scientist For Life
ID: 401 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 403 - Posted: 25 May 2015, 8:25:41 UTC - in response to Message 401.  

The disk error has not happened here or at Atlas since then.

I've not really been looking for errors here or at Atlas - I reckon CERN can look after its own ;)

But before urging Ivan to post this 'urgent update' news, I did spot-check a number of the top hosts here - from a number of users, not just you - and found a number of 'disk usage' errors at the other projects they run. Collatz seems to be a popular 'also runs' project for the users here, and several hosts there show the characteristic errors in their task lists.
ID: 403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 406 - Posted: 26 May 2015, 8:26:52 UTC - in response to Message 401.  
Last modified: 26 May 2015, 8:27:57 UTC

...
Snapshots can be a problem and we got rid of those at vLHC with the Databridge.
...

Just FYI, we got rid of snapshots with a vboxwrapper upgrade on vLHC and here for all WU's, nothing to do with DataBridge.

Ben
ID: 406 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 407 - Posted: 26 May 2015, 11:47:50 UTC

For anyone who wants it, there's now a full Windows BOINC installer which incorporates the various hotfixes.

It's still very much a beta test build (v7.6.1), so be ready for unexpected surprises - but the BOINC developers want to fast-track the v7.6 line for official release, and I think they feel there's not much more work to do on it - famous last words.

One thing to watch out for - with the full installer, you get the new version of the Manager, as well as the client we need to use here. There's been a major re-organisation of the menus in Advanced view, which is disconcerting at first.

Find your preferred version in the download directory - it hasn't been added to the standard download page yet.

http://boinc.berkeley.edu/dl/?C=M;O=D
ID: 407 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 413 - Posted: 28 May 2015, 7:14:39 UTC

v7.6.1 has been withdrawn because of Manager problems with Windows 10, and replaced with v7.6.2

Now available via http://boinc.berkeley.edu/download_all.php
ID: 413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Urgent Update for Windows Users


©2024 CERN