41) Message boards : News : Urgent Update for Windows Users (Message 400)
Posted 24 May 2015 by Richard Haselgrove
Post:
And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that.

Are you sure you don't need it?

Your Einstein host 10859996 (one of the ones running v7.4.42) spat out a bunch of "Maximum disk usage exceeded" errors on 11 May.

That's the problem with this BOINC bug - it produces minimal errors here, but scatters errors over all sorts of other projects, which aren't equipped to diagnose and solve them. That's why we asked all CMS-dev Windows users to apply the hotfix, whether they perceived a problem or not. And my apologies for missing the problem with updating older versions of BOINC in my initial post.
42) Message boards : Number crunching : exceeded disk limit (Message 378)
Posted 19 May 2015 by Richard Haselgrove
Post:
OK, messages sent to Ivan, Laurence and Hendrik.

The files that are needed to apply the hotfix are

For 64-bit BOINC
boinc.080515.x64.zip

For 32-bit BOINC
boinc.080515.x86.zip

Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards.
43) Message boards : Number crunching : exceeded disk limit (Message 377)
Posted 19 May 2015 by Richard Haselgrove
Post:
I really think we need to get a grip on this problem.

To recap: there's a bug in the BOINC client which means it fails to delete files larger than 4 GB when it should.

This project is (still, today) producing files larger than 4 GB.

When those files are left lying around, we cause errors for every other BOINC project that our computers may be attached to. That's not nice. Eric M has had to put up a front-page warning at LHC classic, so he can get on with his work.

The cure is simple and permanent: apply the 080515 hotfix BOINC client. But I had a look through the top 200 hosts yesterday (pretty much the active user base here), and only 9 of the 154 windows machines had the hotfix applied - take a bow, rbpeake, Crystal Pellet, Ray Murray, and m. (the other three were mine)

Since the message clearly isn't getting through, even to people who have posted in this thread, I'm going to send a PM to the admins asking them to reinforce the message via a front-page news item and BOINC 'Notice': and if that doesn't work, to ask them to enforce a minimum BOINC version of 7.5.1 for Windows computers attached to this project.
44) Message boards : Number crunching : exceeded disk limit (Message 363)
Posted 10 May 2015 by Richard Haselgrove
Post:
Well, it seems to be running as well as the others at the moment:

image

- we won't know for certain until the back end starts supplying jobs again, of course.

CMS only supplies 64-bit apps, sure - but the app it supplies is BOINC's 64-bit VBox wrapper. I simply set up an app_info file and substituted the 32-bit wrapper files, and off it went. The whole point of a VM is that the guest OS doesn't have to match the host: my hardware on this host is fully 64-bit capable, and includes the virtualization hooks to enable VBox to run.

I was aware you used the app_info.xml, vbox32 en BOINC32, but could not see if your machine was 64bit capable.
I suppose you also placed your 4.79GB vdi-file in the project folder and linked to it in your app_info, so you don't have to wait whether the size of the vdi is increased >4GB.

Well, I did a quick cheat-test by copying the 4.79 GB file I saved a couple of days ago to the 32-bit machine, and putting in in another project's slot directory. 'Exceeded disk limit' came up with the right numbers, the task was aborted and the file deleted, and everything carried on working properly.
45) Message boards : Number crunching : exceeded disk limit (Message 361)
Posted 10 May 2015 by Richard Haselgrove
Post:
Well, it seems to be running as well as the others at the moment:



- we won't know for certain until the back end starts supplying jobs again, of course.

CMS only supplies 64-bit apps, sure - but the app it supplies is BOINC's 64-bit VBox wrapper. I simply set up an app_info file and substituted the 32-bit wrapper files, and off it went. The whole point of a VM is that the guest OS doesn't have to match the host: my hardware on this host is fully 64-bit capable, and includes the virtualization hooks to enable VBox to run.
46) Message boards : Number crunching : exceeded disk limit (Message 359)
Posted 10 May 2015 by Richard Haselgrove
Post:
David thanks us all, but implicitly asks us to keep our eyes open for any other glitches:
That's good news.
I guess that CERN's VM images are the first > 4GB files that BOINC has dealt with.
Not surprising that there were glitches (and we may find others).
-- David

Following a question from Sekerob, I've tested that my 4.79 GB .vdi file was properly detected and deleted under a 32-bit Windows OS and the 32-bit version of the private drop. I've now set host 393 up to run under the 32-bit versions of VBox wrapper, but otherwise mimicking the stock delivery - an 1-hour first test run seems to have started normally, and has reached the 'file truncated' loop. If this run works, I'll let the next one run the full 24 hours, but suspend work fetch after that (I have to be away for a few days next week).
47) Message boards : Number crunching : exceeded disk limit (Message 358)
Posted 10 May 2015 by Richard Haselgrove
Post:
It was rather a long way down the list.

Alphabetical order.

Recent versions of BOINC auto-generate a fully-populated cc_config.xml template, also in alphabetic order, when some options are set via the GUI. That caused problems for some users, who added their own tags manually and ended up with duplicate entries. Best to keep the strict alphabetic order for both the configuration file and its documentation.
48) Message boards : Number crunching : exceeded disk limit (Message 355)
Posted 10 May 2015 by Richard Haselgrove
Post:
Yes - that's what I added yesterday, in response to the question ;-) (check the Wiki history tab!)

It's based on my own observation, rather than formal guidance from the developers - it wan't documented when the question was asked.
49) Message boards : Number crunching : exceeded disk limit (Message 353)
Posted 10 May 2015 by Richard Haselgrove
Post:
My second machine also transitioned gracefully from CMS to another project and back again, without errors - but since it happened at 02:30 in the morning, I didn't see what the final image file size was.

I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro.

I couldn't find any equivalent number in the current file-set, but I did find

<job_duration>86400</job_duration>

in CMS_26_03_2015a.xml (project folder). Changing that to 43200 didn't affect the running task, but it did start the next on course for what looks like is going to be a 12-hour run - so the transitions will happen in daylight, for a while at least.

Another useful question turned up on the BOINC message boards yesterday:

What is this: vbox_windows set to 0
50) Message boards : Number crunching : exceeded disk limit (Message 351)
Posted 9 May 2015 by Richard Haselgrove
Post:
And the same here, with the image file close to the 5.7 GB that was reported in the error reports that brought me here.

09/05/2015 19:46:19 | CMS-dev | Message from task: 0
09/05/2015 19:46:19 | | [slot] cleaning out slots/1: handle_exited_app()
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_finish_called
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_task_state.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/output
09/05/2015 19:46:19 | | [slot] removed file slots/1/stderr.txt
09/05/2015 19:46:19 | | [slot] removed file slots/1/VBox.log
09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.exe
09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.pdb
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_checkpoint.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_job.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_remote_desktop.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_webapi.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_floppy_1.img
09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_image.vdi
09/05/2015 19:46:19 | CMS-dev | Computation for task CMS_30909_1427806622.286095_0 finished
09/05/2015 19:46:19 | | [slot] cleaning out slots/1: get_free_slot()
09/05/2015 19:46:19 | NumberFields@home | [slot] assigning slot 1 to wu_sf3_DS-10x271_Grp504817of682667_0
09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/GetDecics_2.00_windows_intelx86 to slots/1/GetDecics_2.00_windows_intelx86
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp504817of682667.dat to slots/1/in
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp504817of682667_0_0 to slots/1/out
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_temporary_exit
09/05/2015 19:46:19 | NumberFields@home | Starting task wu_sf3_DS-10x271_Grp504817of682667_0
09/05/2015 19:46:19 | NumberFields@home | [cpu_sched] Starting task wu_sf3_DS-10x271_Grp504817of682667_0 using GetDecics version 200 in slot 1

and as you can see, the following task from another project started cleanly and is running normally.
51) Message boards : Number crunching : exceeded disk limit (Message 346)
Posted 9 May 2015 by Richard Haselgrove
Post:
I have one with a 5.2 GB file, already running under the new private drop - but it's not due to finish for another 4 hours, so Ray will beat me to it.

Likewise, I had to stop BOINC to upgrade part-way through: the VM console is still showing nothing but "file truncated" messages, and very low CPU usage. Although I did a little Test4Theory testing in the early days, I've forgotten what little I learned about the fine control of the VM from externally or within BOINC.

It's a pity that the VM/BOINC combination doesn't yet allow a VM-based project to release un-needed resources back for other BOINC projects to use if, as Crystal Pellet suggests, the looping messages only appear "when CMS's job queue (not BOINC-queue) is empty and the VM don't get jobs". That's a question I remember raising with Ben Segal at the 2010 BOINC workshop in London, and I fear it will deter some of the more competitive volunteers from participating in this type of project.
52) Message boards : Number crunching : exceeded disk limit (Message 342)
Posted 9 May 2015 by Richard Haselgrove
Post:
David says

Thanks.
The problem was that the BOINC client used Windows APIs
for accessing files that didn't work for >= 4GB files.
I fixed this (I think). Rom will have a new private drop soon.
-- David

and Rom says

Here is a new private drop:
x86: http://boinc.berkeley.edu/dl/boinc.080515.x86.zip
x64: http://boinc.berkeley.edu/dl/boinc.080515.x64.zip

----- Rom

Looking at the code change, I think there's a reasonable expectation that this will solve the copying problem too - they've switched from using a 32-bit to a 64-bit version of a low-level Windows library function.
53) Message boards : Number crunching : exceeded disk limit (Message 340)
Posted 8 May 2015 by Richard Haselgrove
Post:
David's found the problem:

fa3f6be5128071fe7e15563e727b5478c45a63b8
54) Message boards : Number crunching : exceeded disk limit (Message 339)
Posted 8 May 2015 by Richard Haselgrove
Post:
Those are presumably the times when the VM console just shows an endless queue of

tail: /home/boinc/stderr: file truncated

- as both machines have been showing every time I've looked today.
55) Message boards : Number crunching : exceeded disk limit (Message 337)
Posted 8 May 2015 by Richard Haselgrove
Post:
After fours days of watching (not continuously!), we have a winner - or perhaps a loser.

After a seemingly normal run and exit, I was left with

D:\BOINCdata\slots\1>dir
 Volume in drive D is Data
 Volume Serial Number is 7031-B70C

 Directory of D:\BOINCdata\slots\1

08/05/2015  19:27    <DIR>          .
08/05/2015  19:27    <DIR>          ..
08/05/2015  19:27     5,148,508,160 vm_image.vdi
               1 File(s)  5,148,508,160 bytes
               2 Dir(s)  445,699,624,960 bytes free

D:\BOINCdata\slots\1>

- and that's the first vm_image.vdi file over 4GB that I've seem.

Slot cleaning appeared normal:

08/05/2015 19:27:43 | CMS-dev | Message from task: 0
08/05/2015 19:27:43 | | [slot] cleaning out slots/1: handle_exited_app()
08/05/2015 19:27:43 | | [slot] removed file slots/1/boinc_finish_called
08/05/2015 19:27:43 | | [slot] removed file slots/1/boinc_task_state.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/init_data.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/output
08/05/2015 19:27:43 | | [slot] removed file slots/1/stderr.txt
08/05/2015 19:27:43 | | [slot] removed file slots/1/VBox.log
08/05/2015 19:27:43 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.exe
08/05/2015 19:27:43 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.pdb
08/05/2015 19:27:43 | | [slot] removed file slots/1/vbox_checkpoint.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/vbox_job.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/vbox_remote_desktop.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/vbox_webapi.xml
08/05/2015 19:27:43 | | [slot] removed file slots/1/vm_floppy_1.img
08/05/2015 19:27:43 | CMS-dev | Computation for task CMS_30776_1427806619.199316_0 finished
08/05/2015 19:27:43 | | [slot] cleaning out slots/1: get_free_slot()
08/05/2015 19:27:43 | NumberFields@home | [slot] assigning slot 1 to wu_sf3_DS-10x271_Grp502611of682667_0
08/05/2015 19:27:43 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/GetDecics_2.00_windows_intelx86 to slots/1/GetDecics_2.00_windows_intelx86
08/05/2015 19:27:43 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp502611of682667.dat to slots/1/in
08/05/2015 19:27:43 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp502611of682667_0_0 to slots/1/out
08/05/2015 19:27:43 | NumberFields@home | [cpu_sched] Restarting task wu_sf3_DS-10x271_Grp502611of682667_0 using GetDecics version 200 in slot 1
08/05/2015 19:27:44 | NumberFields@home | Aborting task wu_sf3_DS-10x271_Grp502611of682667_0: exceeded disk limit: 4910.01MB > 244.14MB
08/05/2015 19:27:44 | NumberFields@home | [sched_op] Deferring communication for 00:01:23
08/05/2015 19:27:44 | NumberFields@home | [sched_op] Reason: Unrecoverable error for task wu_sf3_DS-10x271_Grp502611of682667_0
08/05/2015 19:27:45 | | [slot] cleaning out slots/1: handle_exited_app()
08/05/2015 19:27:45 | | [slot] removed file slots/1/boinc_lockfile
08/05/2015 19:27:45 | | [slot] removed file slots/1/GetDecics_2.00_windows_intelx86
08/05/2015 19:27:45 | | [slot] removed file slots/1/in
08/05/2015 19:27:45 | | [slot] removed file slots/1/init_data.xml
08/05/2015 19:27:45 | | [slot] removed file slots/1/out
08/05/2015 19:27:45 | | [slot] removed file slots/1/stderr.txt
08/05/2015 19:27:45 | NumberFields@home | Computation for task wu_sf3_DS-10x271_Grp502611of682667_0 finished

- but notice there's no reference to vm_image.vdi in either 'cleaning out slots' loop. 15 minutes later (after drafting the above), the file remained visible to both Command Prompt and Windows Explorer, and could be copied to another folder (taking the length of time you'd expect for a 4GB file).

But it remained invisible to BOINC's clear-out routine:

08/05/2015 19:43:15 | NumberFields@home | task wu_sf3_DS-10x271_Grp502822of682667_0 resumed by user
08/05/2015 19:43:16 | | [slot] cleaning out slots/1: get_free_slot()
08/05/2015 19:43:16 | NumberFields@home | [slot] assigning slot 1 to wu_sf3_DS-10x271_Grp502822of682667_0
08/05/2015 19:43:16 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/GetDecics_2.00_windows_intelx86 to slots/1/GetDecics_2.00_windows_intelx86
08/05/2015 19:43:16 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp502822of682667.dat to slots/1/in
08/05/2015 19:43:16 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp502822of682667_0_0 to slots/1/out
08/05/2015 19:43:16 | NumberFields@home | [cpu_sched] Restarting task wu_sf3_DS-10x271_Grp502822of682667_0 using GetDecics version 200 in slot 1
08/05/2015 19:43:17 | NumberFields@home | Aborting task wu_sf3_DS-10x271_Grp502822of682667_0: exceeded disk limit: 4910.01MB > 244.14MB
08/05/2015 19:43:17 | NumberFields@home | [sched_op] Deferring communication for 00:01:32
08/05/2015 19:43:17 | NumberFields@home | [sched_op] Reason: Unrecoverable error for task wu_sf3_DS-10x271_Grp502822of682667_0
08/05/2015 19:43:18 | | [slot] cleaning out slots/1: handle_exited_app()
08/05/2015 19:43:18 | | [slot] removed file slots/1/boinc_lockfile
08/05/2015 19:43:18 | | [slot] removed file slots/1/GetDecics_2.00_windows_intelx86
08/05/2015 19:43:18 | | [slot] removed file slots/1/in
08/05/2015 19:43:18 | | [slot] removed file slots/1/init_data.xml
08/05/2015 19:43:18 | | [slot] removed file slots/1/out
08/05/2015 19:43:18 | | [slot] removed file slots/1/stderr.txt
08/05/2015 19:43:18 | NumberFields@home | Computation for task wu_sf3_DS-10x271_Grp502822of682667_0 finished

Starting a new CMS-dev task in the same slot replaced the old file with a new one, with a new datestamp and a new file-size. QED.
56) Message boards : Number crunching : exceeded disk limit (Message 335)
Posted 7 May 2015 by Richard Haselgrove
Post:
Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations.

I'm not sure that this would account for cases like ritterm's, where - if I read him right - the file was still visible for manual deletion some hours or days later.

Richard, my original instance (which prompted the graphs above) is detailed in the thread VBox Wrappers Updated to 26157. Note especially Message 172 ff.

Ah - I see you've been round this cycle internally once before. Sorry, I'm a late arrival - I only got involved when '196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED' became a live issue at other projects too, and was eventually tracked back to those same 5.7 GB .vdi leftovers that you saw in the slot directories back in March.

But my question remains - how on earth do those files manage to survive three separate BOINC attempts to avoid them? As I posted over at SETI - read from message 1674370 - responsibility for clearing the slot directories lies with BOINC, and

In theory, according to David's initial response, the current logic specifies:

1) Delete everything in the slot folder on task exit (this failed in the example)
2) Delete everything - i.e. anything remaining - in the slot before reuse
3) Don't reuse the slot if (2) fails

The error we were originally investigating implies that step (3) failed. That may have been because step (2) failed without returning an error ...

The implication of Jason's reply was that Windows' retval from DeleteFile() might be 'success', but in practice signify a queued housekeeping task, to be actioned later. I'm sceptical that the disk latencies involved could be so macroscopic as to cause failures hours or even days later - and if they are, then I suggest that it's important that BOINC hardens its attitude to verification that the assumed OS sub-operation has indeed performed its job as expected.

I became involved in this issue because of errors reported at Milkyway, Einstein and (I discovered later) WCG. As I put in my initial email to boinc_alpha, "Cross-project errors like this, where the behaviour of one project leads to errors for another project, are hard for project staff to analyse and remediate. Would it be wise for BOINC to perform an additional safety check, that a slot directory which it is proposing to re-use is indeed empty as expected?".
57) Message boards : Number crunching : exceeded disk limit (Message 331)
Posted 6 May 2015 by Richard Haselgrove
Post:
Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations.

I'm not sure that this would account for cases like ritterm's, where - if I read him right - the file was still visible for manual deletion some hours or days later.
58) Message boards : Number crunching : exceeded disk limit (Message 326)
Posted 6 May 2015 by Richard Haselgrove
Post:
Verrrrrrrrry interesting. There's been a lot of speculation about that at SETI - this may be a smoking gun.

I wonder why just that one file. Have you still got it, and does it contain anything significant?
59) Message boards : Number crunching : exceeded disk limit (Message 324)
Posted 6 May 2015 by Richard Haselgrove
Post:
06 May 11:14:13 Running as a daemon (GPU computing disabled)

Yea! That'll cut down on the help-desk workload.
60) Message boards : Number crunching : exceeded disk limit (Message 321)
Posted 6 May 2015 by Richard Haselgrove
Post:
All the error message reports that I've seen refer to problems with a .vdi file of 5 GB or more. Neither of the two I've run so far got that large - the second one was a bit short of 4 GB. I don't see why DeleteFile() should have problems with any particular file size, but if there is a boundary, 4 GB sounds like a significant number. I don't know what feature of CMS controls the growth of the .vdi, but that might be something to watch.


Previous 20 · Next 20


©2024 CERN