41)
Message boards :
News :
Urgent Update for Windows Users
(Message 400)
Posted 24 May 2015 by Richard Haselgrove Post: And now on this thread where it belongs......I have one with 7.4.42 but since I haven't needed this maybe one of these days when I have time and the rest I will do that. Are you sure you don't need it? Your Einstein host 10859996 (one of the ones running v7.4.42) spat out a bunch of "Maximum disk usage exceeded" errors on 11 May. That's the problem with this BOINC bug - it produces minimal errors here, but scatters errors over all sorts of other projects, which aren't equipped to diagnose and solve them. That's why we asked all CMS-dev Windows users to apply the hotfix, whether they perceived a problem or not. And my apologies for missing the problem with updating older versions of BOINC in my initial post. |
42)
Message boards :
Number crunching :
exceeded disk limit
(Message 378)
Posted 19 May 2015 by Richard Haselgrove Post: OK, messages sent to Ivan, Laurence and Hendrik. The files that are needed to apply the hotfix are For 64-bit BOINC boinc.080515.x64.zip For 32-bit BOINC boinc.080515.x86.zip Simply extract the two files for your version from the .zip archive, and copy them to your BOINC program folder - you'll need to stop the BOINC client while you do this, and restart it again afterwards. |
43)
Message boards :
Number crunching :
exceeded disk limit
(Message 377)
Posted 19 May 2015 by Richard Haselgrove Post: I really think we need to get a grip on this problem. To recap: there's a bug in the BOINC client which means it fails to delete files larger than 4 GB when it should. This project is (still, today) producing files larger than 4 GB. When those files are left lying around, we cause errors for every other BOINC project that our computers may be attached to. That's not nice. Eric M has had to put up a front-page warning at LHC classic, so he can get on with his work. The cure is simple and permanent: apply the 080515 hotfix BOINC client. But I had a look through the top 200 hosts yesterday (pretty much the active user base here), and only 9 of the 154 windows machines had the hotfix applied - take a bow, rbpeake, Crystal Pellet, Ray Murray, and m. (the other three were mine) Since the message clearly isn't getting through, even to people who have posted in this thread, I'm going to send a PM to the admins asking them to reinforce the message via a front-page news item and BOINC 'Notice': and if that doesn't work, to ask them to enforce a minimum BOINC version of 7.5.1 for Windows computers attached to this project. |
44)
Message boards :
Number crunching :
exceeded disk limit
(Message 363)
Posted 10 May 2015 by Richard Haselgrove Post: Well, it seems to be running as well as the others at the moment: Well, I did a quick cheat-test by copying the 4.79 GB file I saved a couple of days ago to the 32-bit machine, and putting in in another project's slot directory. 'Exceeded disk limit' came up with the right numbers, the task was aborted and the file deleted, and everything carried on working properly. |
45)
Message boards :
Number crunching :
exceeded disk limit
(Message 361)
Posted 10 May 2015 by Richard Haselgrove Post: Well, it seems to be running as well as the others at the moment: - we won't know for certain until the back end starts supplying jobs again, of course. CMS only supplies 64-bit apps, sure - but the app it supplies is BOINC's 64-bit VBox wrapper. I simply set up an app_info file and substituted the 32-bit wrapper files, and off it went. The whole point of a VM is that the guest OS doesn't have to match the host: my hardware on this host is fully 64-bit capable, and includes the virtualization hooks to enable VBox to run. |
46)
Message boards :
Number crunching :
exceeded disk limit
(Message 359)
Posted 10 May 2015 by Richard Haselgrove Post: David thanks us all, but implicitly asks us to keep our eyes open for any other glitches: That's good news. Following a question from Sekerob, I've tested that my 4.79 GB .vdi file was properly detected and deleted under a 32-bit Windows OS and the 32-bit version of the private drop. I've now set host 393 up to run under the 32-bit versions of VBox wrapper, but otherwise mimicking the stock delivery - an 1-hour first test run seems to have started normally, and has reached the 'file truncated' loop. If this run works, I'll let the next one run the full 24 hours, but suspend work fetch after that (I have to be away for a few days next week). |
47)
Message boards :
Number crunching :
exceeded disk limit
(Message 358)
Posted 10 May 2015 by Richard Haselgrove Post: It was rather a long way down the list. Alphabetical order. Recent versions of BOINC auto-generate a fully-populated cc_config.xml template, also in alphabetic order, when some options are set via the GUI. That caused problems for some users, who added their own tags manually and ended up with duplicate entries. Best to keep the strict alphabetic order for both the configuration file and its documentation. |
48)
Message boards :
Number crunching :
exceeded disk limit
(Message 355)
Posted 10 May 2015 by Richard Haselgrove Post: Yes - that's what I added yesterday, in response to the question ;-) (check the Wiki history tab!) It's based on my own observation, rather than formal guidance from the developers - it wan't documented when the question was asked. |
49)
Message boards :
Number crunching :
exceeded disk limit
(Message 353)
Posted 10 May 2015 by Richard Haselgrove Post: My second machine also transitioned gracefully from CMS to another project and back again, without errors - but since it happened at 02:30 in the morning, I didn't see what the final image file size was. I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro. I couldn't find any equivalent number in the current file-set, but I did find <job_duration>86400</job_duration> in CMS_26_03_2015a.xml (project folder). Changing that to 43200 didn't affect the running task, but it did start the next on course for what looks like is going to be a 12-hour run - so the transitions will happen in daylight, for a while at least. Another useful question turned up on the BOINC message boards yesterday: What is this: vbox_windows set to 0 |
50)
Message boards :
Number crunching :
exceeded disk limit
(Message 351)
Posted 9 May 2015 by Richard Haselgrove Post: And the same here, with the image file close to the 5.7 GB that was reported in the error reports that brought me here. 09/05/2015 19:46:19 | CMS-dev | Message from task: 0 09/05/2015 19:46:19 | | [slot] cleaning out slots/1: handle_exited_app() 09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_finish_called 09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_task_state.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/output 09/05/2015 19:46:19 | | [slot] removed file slots/1/stderr.txt 09/05/2015 19:46:19 | | [slot] removed file slots/1/VBox.log 09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.exe 09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.pdb 09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_checkpoint.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_job.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_remote_desktop.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_webapi.xml 09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_floppy_1.img 09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_image.vdi 09/05/2015 19:46:19 | CMS-dev | Computation for task CMS_30909_1427806622.286095_0 finished 09/05/2015 19:46:19 | | [slot] cleaning out slots/1: get_free_slot() 09/05/2015 19:46:19 | NumberFields@home | [slot] assigning slot 1 to wu_sf3_DS-10x271_Grp504817of682667_0 09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml 09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/GetDecics_2.00_windows_intelx86 to slots/1/GetDecics_2.00_windows_intelx86 09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp504817of682667.dat to slots/1/in 09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp504817of682667_0_0 to slots/1/out 09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_temporary_exit 09/05/2015 19:46:19 | NumberFields@home | Starting task wu_sf3_DS-10x271_Grp504817of682667_0 09/05/2015 19:46:19 | NumberFields@home | [cpu_sched] Starting task wu_sf3_DS-10x271_Grp504817of682667_0 using GetDecics version 200 in slot 1 and as you can see, the following task from another project started cleanly and is running normally. |
51)
Message boards :
Number crunching :
exceeded disk limit
(Message 346)
Posted 9 May 2015 by Richard Haselgrove Post: I have one with a 5.2 GB file, already running under the new private drop - but it's not due to finish for another 4 hours, so Ray will beat me to it. Likewise, I had to stop BOINC to upgrade part-way through: the VM console is still showing nothing but "file truncated" messages, and very low CPU usage. Although I did a little Test4Theory testing in the early days, I've forgotten what little I learned about the fine control of the VM from externally or within BOINC. It's a pity that the VM/BOINC combination doesn't yet allow a VM-based project to release un-needed resources back for other BOINC projects to use if, as Crystal Pellet suggests, the looping messages only appear "when CMS's job queue (not BOINC-queue) is empty and the VM don't get jobs". That's a question I remember raising with Ben Segal at the 2010 BOINC workshop in London, and I fear it will deter some of the more competitive volunteers from participating in this type of project. |
52)
Message boards :
Number crunching :
exceeded disk limit
(Message 342)
Posted 9 May 2015 by Richard Haselgrove Post: David says Thanks. and Rom says Here is a new private drop: Looking at the code change, I think there's a reasonable expectation that this will solve the copying problem too - they've switched from using a 32-bit to a 64-bit version of a low-level Windows library function. |
53)
Message boards :
Number crunching :
exceeded disk limit
(Message 340)
Posted 8 May 2015 by Richard Haselgrove Post: David's found the problem: fa3f6be5128071fe7e15563e727b5478c45a63b8 |
54)
Message boards :
Number crunching :
exceeded disk limit
(Message 339)
Posted 8 May 2015 by Richard Haselgrove Post: Those are presumably the times when the VM console just shows an endless queue of tail: /home/boinc/stderr: file truncated - as both machines have been showing every time I've looked today. |
55)
Message boards :
Number crunching :
exceeded disk limit
(Message 337)
Posted 8 May 2015 by Richard Haselgrove Post: After fours days of watching (not continuously!), we have a winner - or perhaps a loser. After a seemingly normal run and exit, I was left with D:\BOINCdata\slots\1>dir Volume in drive D is Data Volume Serial Number is 7031-B70C Directory of D:\BOINCdata\slots\1 08/05/2015 19:27 <DIR> . 08/05/2015 19:27 <DIR> .. 08/05/2015 19:27 5,148,508,160 vm_image.vdi 1 File(s) 5,148,508,160 bytes 2 Dir(s) 445,699,624,960 bytes free D:\BOINCdata\slots\1> - and that's the first vm_image.vdi file over 4GB that I've seem. Slot cleaning appeared normal: 08/05/2015 19:27:43 | CMS-dev | Message from task: 0 - but notice there's no reference to vm_image.vdi in either 'cleaning out slots' loop. 15 minutes later (after drafting the above), the file remained visible to both Command Prompt and Windows Explorer, and could be copied to another folder (taking the length of time you'd expect for a 4GB file). But it remained invisible to BOINC's clear-out routine: 08/05/2015 19:43:15 | NumberFields@home | task wu_sf3_DS-10x271_Grp502822of682667_0 resumed by user Starting a new CMS-dev task in the same slot replaced the old file with a new one, with a new datestamp and a new file-size. QED. |
56)
Message boards :
Number crunching :
exceeded disk limit
(Message 335)
Posted 7 May 2015 by Richard Haselgrove Post: Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations. Ah - I see you've been round this cycle internally once before. Sorry, I'm a late arrival - I only got involved when '196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED' became a live issue at other projects too, and was eventually tracked back to those same 5.7 GB .vdi leftovers that you saw in the slot directories back in March. But my question remains - how on earth do those files manage to survive three separate BOINC attempts to avoid them? As I posted over at SETI - read from message 1674370 - responsibility for clearing the slot directories lies with BOINC, and In theory, according to David's initial response, the current logic specifies: The implication of Jason's reply was that Windows' retval from DeleteFile() might be 'success', but in practice signify a queued housekeeping task, to be actioned later. I'm sceptical that the disk latencies involved could be so macroscopic as to cause failures hours or even days later - and if they are, then I suggest that it's important that BOINC hardens its attitude to verification that the assumed OS sub-operation has indeed performed its job as expected. I became involved in this issue because of errors reported at Milkyway, Einstein and (I discovered later) WCG. As I put in my initial email to boinc_alpha, "Cross-project errors like this, where the behaviour of one project leads to errors for another project, are hard for project staff to analyse and remediate. Would it be wise for BOINC to perform an additional safety check, that a slot directory which it is proposing to re-use is indeed empty as expected?". |
57)
Message boards :
Number crunching :
exceeded disk limit
(Message 331)
Posted 6 May 2015 by Richard Haselgrove Post: Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations. I'm not sure that this would account for cases like ritterm's, where - if I read him right - the file was still visible for manual deletion some hours or days later. |
58)
Message boards :
Number crunching :
exceeded disk limit
(Message 326)
Posted 6 May 2015 by Richard Haselgrove Post: Verrrrrrrrry interesting. There's been a lot of speculation about that at SETI - this may be a smoking gun. I wonder why just that one file. Have you still got it, and does it contain anything significant? |
59)
Message boards :
Number crunching :
exceeded disk limit
(Message 324)
Posted 6 May 2015 by Richard Haselgrove Post: 06 May 11:14:13 Running as a daemon (GPU computing disabled) Yea! That'll cut down on the help-desk workload. |
60)
Message boards :
Number crunching :
exceeded disk limit
(Message 321)
Posted 6 May 2015 by Richard Haselgrove Post: All the error message reports that I've seen refer to problems with a .vdi file of 5 GB or more. Neither of the two I've run so far got that large - the second one was a bit short of 4 GB. I don't see why DeleteFile() should have problems with any particular file size, but if there is a boundary, 4 GB sounds like a significant number. I don't know what feature of CMS controls the growth of the .vdi, but that might be something to watch. |
©2024 CERN