Message boards :
News :
Urgent Update for Windows Users
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1128 Credit: 7,870,419 RAC: 595 |
A bug has been found in the Windows version of BOINC which means that files larger than 4 GiB (2^32 bytes) are being left behind in slot directories, affecting us and other BOINC projects. Unfortunately, we do produce VM files that large, so we are interfering with these other projects. If you are active on this project, using Windows, please update your BOINC version to a patched version (see this message and thread). "The files that are needed to apply the hotfix are For 64-bit BOINC boinc.080515.x64.zip For 32-bit BOINC boinc.080515.x86.zip Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards." Thanks to the community for helping us debug this, especially Richard, Crystal Pellet and Ray, and to the BOINC crew for coming up with the fix. In the meantime, we will move on to debugging why our VMs are currently growing so large. |
Send message Joined: 20 Jan 15 Posts: 1128 Credit: 7,870,419 RAC: 595 |
In the meantime, we will move on to debugging why our VMs are currently growing so large. Just to note that today, when I tried to reproduce the increase in "disk" size when jobs weren't being sent, I saw no increase in VM size. ??! I contacted the development team, and more jobs were created so, for now, we're crunching and the phantom disk-enlarger has been banished. Please let me/us know if it raises its head again. Cheers! |
Send message Joined: 13 Feb 15 Posts: 1178 Credit: 810,985 RAC: 2,009 |
... I contacted the development team, and more jobs were created so, for now, we're crunching and the phantom disk-enlarger has been banished. Please let me/us know if it raises its head again. Cheers! That must have been for a short period, so reproduction should be possible again: [21/05/15 08:32:17] ERROR:root:No message received! Nothing to do! [21/05/15 08:33:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:36:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:37:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:39:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:40:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:42:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:43:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:44:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:45:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:46:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:47:03] ERROR:root:No message received! Nothing to do! [21/05/15 08:49:02] ERROR:root:No message received! Nothing to do! [21/05/15 08:50:02] ERROR:root:No message received! Nothing to do! |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,945,852 RAC: 1 |
My current task didn't do much actual work yesterday, with the "nothing to do" and "file truncated" messages but just now it is running a job, using all of one core. The .vdi in the slot has grown to 8.4GB! but the Actual size shown in VBox Virtual Media Manager is only 2.33GB of the 20GB allocated. I'm hopeful that the JOB will finish before the TASK finishes in 80mins time so I can see if there is any tidy up within the VM at the end of the job. Just before the Boinc Task finishes, I'll try to remember to copy all the logs in case they might be of interest in diagnosing the growth of the .vdi. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,945,852 RAC: 1 |
.vdi didn't grow much more but I'm not sure whether it finished 1 job and started another in that time. Noticed it processing record 76 at time of last post but when it was terminated by Boinc it was on record 46 so I assume that was a different job. Task 57571 completed and validated, completely emptied the slot and no debris in VBox. 8¬) |
Send message Joined: 20 Jan 15 Posts: 1128 Credit: 7,870,419 RAC: 595 |
.vdi didn't grow much more but I'm not sure whether it finished 1 job and started another in that time. Noticed it processing record 76 at time of last post but when it was terminated by Boinc it was on record 46 so I assume that was a different job. Tkanks, Ray. I think there's a way to look at the stdout log on the project website, but I don't know how, exactly. I noticed the jobs I was running today were running over 100 events rather than the 5 that were done previously. For my part of the project I need to find out more of the details of how this is all done, especially now that my unfortunate personal interruption is behind me. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,945,852 RAC: 1 |
I don't know if these logs are generated if there is no active job (I suspect not) but the "Show Graphics" button in Boinc connects to the localhost and shows the Machine Logs including CMSJobAgent-stdout.log and cmsRun-stdout.log although both appear to show info only from the current "job" rather than the whole "task". I forgot to look in there for the last Task but I'll take copies if the VM size of the current task gets excessive again. Maybe that log could be captured and added to the final upload (if it's not already) if it's helpful? |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
Is there any point saving out disk images for you guys to look at later? You could easily set up an FTP server or similar to receive them. |
Send message Joined: 13 Feb 15 Posts: 1178 Credit: 810,985 RAC: 2,009 |
... and shows the Machine Logs including CMSJobAgent-stdout.log and cmsRun-stdout.log although both appear to show info only from the current "job" rather than the whole "task". The cmsRun-stdout.log shows all jobs during 1 BOINC-task. Just search for "Executing CMSSW" to count the number of jobs started. 1 job runs now much longer than before, maybe because there are now 100 records in 1 job. |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
1 job runs now much longer than before, maybe because there are now 100 records in 1 job. My current job is very stable, well done team! |
Send message Joined: 20 Jan 15 Posts: 1128 Credit: 7,870,419 RAC: 595 |
Is there any point saving out disk images for you guys to look at later? At this point I'd say not. There were 120 tasks out in the field yesterday; at an average of 5 GB/image that's 0.6 TB/day -- and over a day to send them all in serially if we're limited to home broadband speeds of 60 Mbps. |
Send message Joined: 8 Apr 15 Posts: 734 Credit: 11,558,298 RAC: 1,931 |
I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem" Which didn't work. |
Send message Joined: 13 Feb 15 Posts: 1178 Credit: 810,985 RAC: 2,009 |
I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem" The correct libcurl.dll belongs to the recommended BOINC version 7.4.42 and that version should be installed first as written here at CMS-dev in the "exceeded disk limit" thread and in the LHC-sixtrack thread "DISK LIMIT EXCEEDED". |
Send message Joined: 8 Apr 15 Posts: 734 Credit: 11,558,298 RAC: 1,931 |
I decided to try this download even though the Win7's were working and after several tries all I get is this box "The program can't start because libcurl.dll is missing from your computer.Try reinstalling the program to fix this problem" And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that. Are you sure you don't need it? Your Einstein host 10859996 (one of the ones running v7.4.42) spat out a bunch of "Maximum disk usage exceeded" errors on 11 May. That's the problem with this BOINC bug - it produces minimal errors here, but scatters errors over all sorts of other projects, which aren't equipped to diagnose and solve them. That's why we asked all CMS-dev Windows users to apply the hotfix, whether they perceived a problem or not. And my apologies for missing the problem with updating older versions of BOINC in my initial post. |
Send message Joined: 8 Apr 15 Posts: 734 Credit: 11,558,298 RAC: 1,931 |
And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that. I realize you may not have seen what I said and found regarding this but if you take a closer look since I figured out what actually happened I have not had ONE single disk error. Just look at that particular one (Win8.1) and click on the tab for *Valid tasks. The disk error has not happened here or at Atlas since then. (I just turned one in and started another which is why I stopped by just now) As I showed over on that LHC thread I can run Atlas X2 , vLHC X2, LHC X3,CMS-dev, and a GPU all on one 8-core with Win7 Of course this *fix* should be used by those who can not do this. You can see I do quite a lot of complete and validated tasks on every project I run. The only error I get since then is just because VB doesn't always behave when you do a reboot for another reason but that is not the disk error. Snapshots can be a problem and we got rid of those at vLHC with the Databridge. (Oh and Richard......no apologies are needed ) Mad Scientist For Life |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
The disk error has not happened here or at Atlas since then. I've not really been looking for errors here or at Atlas - I reckon CERN can look after its own ;) But before urging Ivan to post this 'urgent update' news, I did spot-check a number of the top hosts here - from a number of users, not just you - and found a number of 'disk usage' errors at the other projects they run. Collatz seems to be a popular 'also runs' project for the users here, and several hosts there show the characteristic errors in their task lists. |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
... Just FYI, we got rid of snapshots with a vboxwrapper upgrade on vLHC and here for all WU's, nothing to do with DataBridge. Ben |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
For anyone who wants it, there's now a full Windows BOINC installer which incorporates the various hotfixes. It's still very much a beta test build (v7.6.1), so be ready for unexpected surprises - but the BOINC developers want to fast-track the v7.6 line for official release, and I think they feel there's not much more work to do on it - famous last words. One thing to watch out for - with the full installer, you get the new version of the Manager, as well as the client we need to use here. There's been a major re-organisation of the menus in Advanced view, which is disconcerting at first. Find your preferred version in the download directory - it hasn't been added to the standard download page yet. http://boinc.berkeley.edu/dl/?C=M;O=D |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
v7.6.1 has been withdrawn because of Manager problems with Windows 10, and replaced with v7.6.2 Now available via http://boinc.berkeley.edu/download_all.php |
©2024 CERN