Message boards :
Number crunching :
exceeded disk limit
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
VBox 4.3.26; Win7-64; no service install. Slot properly cleaned. 109 CMS-dev 06 May 09:54:59 Scheduler request completed: got 1 new tasks 110 06 May 09:55:01 [slot] cleaning out slots/0: get_free_slot() 111 CMS-dev 06 May 09:55:01 [slot] assigning slot 0 to CMS_30830_1427806620.502770_0 112 CMS-dev 06 May 09:55:01 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/vboxwrapper_26165_windows_x86_64.exe to slots/0/vboxwrapper_26165_windows_x86_64.exe 113 CMS-dev 06 May 09:55:01 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/CMS_26_03_2015a.xml to slots/0/vbox_job.xml 114 CMS-dev 06 May 09:55:01 Starting task CMS_30830_1427806620.502770_0 115 06 May 09:55:56 [slot] removed file slots/0/init_data.xml 116 CMS-dev 06 May 09:55:56 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/vboxwrapper_26165_windows_x86_64.pdb to slots/0/vboxwrapper_26165_windows_x86_64.pdb 117 CMS-dev 06 May 10:08:20 task CMS_30830_1427806620.502770_0 suspended by user 118 CMS-dev 06 May 10:09:04 task CMS_30830_1427806620.502770_0 resumed by user 119 06 May 10:09:05 [slot] removed file slots/0/init_data.xml 120 06 May 10:09:05 [slot] removed file slots/0/boinc_temporary_exit 121 06 May 10:09:06 [slot] cleaning out slots/0: handle_exited_app() 122 06 May 10:09:06 [slot] removed file slots/0/boinc_c622628eff463924/boinc_c622628eff463924.vbox 123 06 May 10:09:06 [slot] removed file slots/0/boinc_c622628eff463924/boinc_c622628eff463924.vbox-prev 124 06 May 10:09:06 [slot] removed file slots/0/boinc_c622628eff463924/Logs/VBox.log 125 06 May 10:09:06 [slot] removed file slots/0/boinc_c622628eff463924/Logs/VBoxStartup.log 126 06 May 10:09:06 [slot] removed file slots/0/boinc_c622628eff463924/Snapshots/2015-05-06T08-08-22-948266400Z.sav 127 06 May 10:09:06 [slot] removed file slots/0/boinc_lockfile 128 06 May 10:09:06 [slot] removed file slots/0/boinc_task_state.xml 129 06 May 10:09:06 [slot] removed file slots/0/init_data.xml 130 06 May 10:09:06 [slot] removed file slots/0/stderr.txt 131 06 May 10:09:06 [slot] removed file slots/0/VBox.log 132 06 May 10:09:06 [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.exe 133 06 May 10:09:06 [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.pdb 134 06 May 10:09:06 [slot] removed file slots/0/vbox_checkpoint.xml 135 06 May 10:09:06 [slot] removed file slots/0/vbox_job.xml 136 06 May 10:09:06 [slot] removed file slots/0/vbox_remote_desktop.xml 137 06 May 10:09:06 [slot] removed file slots/0/vbox_webapi.xml 138 06 May 10:09:06 [slot] removed file slots/0/vm_floppy_0.img 139 06 May 10:09:06 [slot] removed file slots/0/vm_image.vdi 140 CMS-dev 06 May 10:09:06 Computation for task CMS_30830_1427806620.502770_0 finished |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
All the error message reports that I've seen refer to problems with a .vdi file of 5 GB or more. Neither of the two I've run so far got that large - the second one was a bit short of 4 GB. I don't see why DeleteFile() should have problems with any particular file size, but if there is a boundary, 4 GB sounds like a significant number. I don't know what feature of CMS controls the growth of the .vdi, but that might be something to watch. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Next step: BOINC running as a service and wait: 1 06 May 11:14:13 Starting BOINC client version 7.5.1 for windows_x86_64 2 06 May 11:14:13 This a development version of BOINC and may not function properly 3 06 May 11:14:13 log flags: file_xfer, sched_ops, task, slot_debug 5 06 May 11:14:13 Running as a daemon (GPU computing disabled) 101 CMS-dev 06 May 11:14:52 Scheduler request completed: got 1 new tasks 102 CMS-dev 06 May 11:14:54 [slot] assigning slot 0 to CMS_30682_1427806617.088559_0 103 CMS-dev 06 May 11:14:54 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/vboxwrapper_26165_windows_x86_64.exe to slots/0/vboxwrapper_26165_windows_x86_64.exe 104 CMS-dev 06 May 11:14:54 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/CMS_26_03_2015a.xml to slots/0/vbox_job.xml 105 CMS-dev 06 May 11:14:54 Starting task CMS_30682_1427806617.088559_0 106 06 May 11:14:56 [slot] removed file slots/0/init_data.xml 107 CMS-dev 06 May 11:14:56 [slot] linked ../../projects/boincai05.cern.ch_CMS-dev/vboxwrapper_26165_windows_x86_64.pdb to slots/0/vboxwrapper_26165_windows_x86_64.pdb |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
All the error message reports that I've seen refer to problems with a .vdi file of 5 GB or more. Neither of the two I've run so far got that large - the second one was a bit short of 4 GB. I don't see why DeleteFile() should have problems with any particular file size, but if there is a boundary, 4 GB sounds like a significant number. I don't know what feature of CMS controls the growth of the .vdi, but that might be something to watch. I've noted that too and mentioned it in the WCG-forums and you may probably have seen that. http://www.worldcommunitygrid.org/forums/wcg/viewpostinthread?post=491651 That's also the reason why I ran my first test 24 hours to let the vdi grow, but the filesize was only 2.219.835.392 bytes. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
06 May 11:14:13 Running as a daemon (GPU computing disabled) Yea! That'll cut down on the help-desk workload. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Interesting! Not the vm_image.vdi left in the slot directory, but stderr.txt: 124 CMS-dev 06 May 14:22:48 Message from task: 0 125 06 May 14:22:48 [slot] cleaning out slots/0: handle_exited_app() 126 06 May 14:22:48 [slot] removed file slots/0/boinc_finish_called 127 06 May 14:22:48 [slot] removed file slots/0/boinc_task_state.xml 128 06 May 14:22:48 [slot] removed file slots/0/init_data.xml 129 06 May 14:22:48 [slot] removed file slots/0/output 130 06 May 14:22:48 [slot] failed to remove file slots/0/stderr.txt: unlink() failed 131 06 May 14:22:48 [slot] removed file slots/0/VBox.log 132 06 May 14:22:48 [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.exe 133 06 May 14:22:48 [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.pdb 134 06 May 14:22:48 [slot] removed file slots/0/vbox_checkpoint.xml 135 06 May 14:22:48 [slot] removed file slots/0/vbox_job.xml 136 06 May 14:22:48 [slot] removed file slots/0/vbox_remote_desktop.xml 137 06 May 14:22:48 [slot] removed file slots/0/vbox_webapi.xml 138 06 May 14:22:48 [slot] removed file slots/0/vm_floppy_0.img 139 06 May 14:22:48 [slot] removed file slots/0/vm_image.vdi 140 CMS-dev 06 May 14:22:48 Computation for task CMS_30682_1427806617.088559_0 finished 141 CMS-dev 06 May 14:22:52 Sending scheduler request: To report completed tasks. 142 CMS-dev 06 May 14:22:52 Reporting 1 completed tasks 143 CMS-dev 06 May 14:22:52 Not requesting tasks: "no new tasks" requested via Manager 144 CMS-dev 06 May 14:22:54 Scheduler request completed VBox 4.3.26; Win7-64; BOINC installed as service. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Verrrrrrrrry interesting. There's been a lot of speculation about that at SETI - this may be a smoking gun. I wonder why just that one file. Have you still got it, and does it contain anything significant? |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Verrrrrrrrry interesting. There's been a lot of speculation about that at SETI - this may be a smoking gun. I saved that file, but the contents is just what's in result's Stderr output. After restarting the BOINC service: 06 May 14:48:08 [slot] cleaning out slots/0: delete old slot dirs 06 May 14:48:08 [slot] removed file slots/0/stderr.txt I fetched a new CMS-task using an extended VM-size (1.8GB) and running Rom's newest vboxwrapper 26167. Maybe I need several steps to approach the 4GB. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,942,314 RAC: 3,195 |
All the error message reports that I've seen refer to problems with a .vdi file of 5 GB or more. Neither of the two I've run so far got that large - the second one was a bit short of 4 GB. I don't see why DeleteFile() should have problems with any particular file size, but if there is a boundary, 4 GB sounds like a significant number. I don't know what feature of CMS controls the growth of the .vdi, but that might be something to watch. My impression is that it's the result files being held on "disk" that cause the growth of the VM. I suppose you've noticed Richard that the VM has a 10 GB limit, which caused me some loss of hair first time I encountered it, because my BOINC limits were way above the 10 GB it was claiming as a limit. Notice the blue line, where a >5 GB VM was persisting in the slot directory, causing the task to fail at 10 GB. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations. I'm not sure that this would account for cases like ritterm's, where - if I read him right - the file was still visible for manual deletion some hours or days later. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Reproducable. BOINC running as a service. 06-May-2015 23:06:30 [CMS-dev] Message from task: 0 06-May-2015 23:06:30 [---] [slot] cleaning out slots/0: handle_exited_app() 06-May-2015 23:06:30 [---] [slot] removed file slots/0/boinc_finish_called 06-May-2015 23:06:30 [---] [slot] removed file slots/0/boinc_task_state.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/init_data.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/output 06-May-2015 23:06:30 [---] [slot] failed to remove file slots/0/stderr.txt: unlink() failed 06-May-2015 23:06:30 [---] [slot] removed file slots/0/VBox.log 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.exe 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.pdb 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vbox_checkpoint.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vbox_job.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vbox_remote_desktop.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vbox_webapi.xml 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vm_floppy_0.img 06-May-2015 23:06:30 [---] [slot] removed file slots/0/vm_image.vdi 06-May-2015 23:06:30 [CMS-dev] Computation for task CMS_30819_1427806620.244734_0 finished |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Returned a next task, but now BOINC not running as a service again. Everything in the slot was cleaned. The vdi-file was almost 2GB. 07-May-2015 08:52:24 [CMS-dev] Message from task: 0 07-May-2015 08:52:24 [---] [slot] cleaning out slots/0: handle_exited_app() 07-May-2015 08:52:24 [---] [slot] removed file slots/0/boinc_finish_called 07-May-2015 08:52:24 [---] [slot] removed file slots/0/boinc_task_state.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/init_data.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/output 07-May-2015 08:52:24 [---] [slot] removed file slots/0/stderr.txt 07-May-2015 08:52:24 [---] [slot] removed file slots/0/VBox.log 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.exe 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vboxwrapper_26165_windows_x86_64.pdb 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vbox_checkpoint.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vbox_job.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vbox_remote_desktop.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vbox_webapi.xml 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vm_floppy_0.img 07-May-2015 08:52:24 [---] [slot] removed file slots/0/vm_image.vdi 07-May-2015 08:52:24 [CMS-dev] Computation for task CMS_30674_1427806616.874120_0 finished |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,942,314 RAC: 3,195 |
Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations. Richard, my original instance (which prompted the graphs above) is detailed in the thread VBox Wrappers Updated to 26157. Note especially Message 172 ff. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Jason suggests that for Windows, the DeleteFile() API call - which underpins all BOINC's cleansing calls - may return 'success' (file queued for deletion), but with a large file, subsequent calls by other threads - e.g. the disk limit check by the next task - may find that it is still present pending the completion of other OS housekeeping operations. Ah - I see you've been round this cycle internally once before. Sorry, I'm a late arrival - I only got involved when '196 (0xc4) EXIT_DISK_LIMIT_EXCEEDED' became a live issue at other projects too, and was eventually tracked back to those same 5.7 GB .vdi leftovers that you saw in the slot directories back in March. But my question remains - how on earth do those files manage to survive three separate BOINC attempts to avoid them? As I posted over at SETI - read from message 1674370 - responsibility for clearing the slot directories lies with BOINC, and In theory, according to David's initial response, the current logic specifies: The implication of Jason's reply was that Windows' retval from DeleteFile() might be 'success', but in practice signify a queued housekeeping task, to be actioned later. I'm sceptical that the disk latencies involved could be so macroscopic as to cause failures hours or even days later - and if they are, then I suggest that it's important that BOINC hardens its attitude to verification that the assumed OS sub-operation has indeed performed its job as expected. I became involved in this issue because of errors reported at Milkyway, Einstein and (I discovered later) WCG. As I put in my initial email to boinc_alpha, "Cross-project errors like this, where the behaviour of one project leads to errors for another project, are hard for project staff to analyse and remediate. Would it be wise for BOINC to perform an additional safety check, that a slot directory which it is proposing to re-use is indeed empty as expected?". |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,945,852 RAC: 0 |
I don't know how long the undeleted images would have remained without manual deletion but certainly overnight as I normally check my machines before going to work in the morning and returning in the evening. Restarting Boinc wouldn't clear them and Boinc would continually try to reuse the non-empty slot resulting in a whole batch of failed LHC/Sixtrack tasks which is what prompted my search for a cause. Only manual deletion seemed to resolve the problem and free up the slot for use. (As stated in earlier post, if CMS reused the slot, the debris image would be overwritten by the new image, restarting at the initial size.) Now that we're watching, the slots are being properly cleaned out on exit. The image is only getting to a little short of 2GB so it may just have been the 5GB size that was causing the issue. Perhaps the file deletion was started but not completed by the time Boinc called exit so the deletion was then cancelled. However, as Richard says, Boinc should notice that the slot is not empty and therefore not try to reuse it, and/or clean out any debris before reallocating that slot. Maybe not much help, but might point someone in the right direction. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
After fours days of watching (not continuously!), we have a winner - or perhaps a loser. After a seemingly normal run and exit, I was left with D:\BOINCdata\slots\1>dir Volume in drive D is Data Volume Serial Number is 7031-B70C Directory of D:\BOINCdata\slots\1 08/05/2015 19:27 <DIR> . 08/05/2015 19:27 <DIR> .. 08/05/2015 19:27 5,148,508,160 vm_image.vdi 1 File(s) 5,148,508,160 bytes 2 Dir(s) 445,699,624,960 bytes free D:\BOINCdata\slots\1> - and that's the first vm_image.vdi file over 4GB that I've seem. Slot cleaning appeared normal: 08/05/2015 19:27:43 | CMS-dev | Message from task: 0 - but notice there's no reference to vm_image.vdi in either 'cleaning out slots' loop. 15 minutes later (after drafting the above), the file remained visible to both Command Prompt and Windows Explorer, and could be copied to another folder (taking the length of time you'd expect for a 4GB file). But it remained invisible to BOINC's clear-out routine: 08/05/2015 19:43:15 | NumberFields@home | task wu_sf3_DS-10x271_Grp502822of682667_0 resumed by user Starting a new CMS-dev task in the same slot replaced the old file with a new one, with a new datestamp and a new file-size. QED. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
That's great news, Richard. I was almost to 4GB, but it's hard to come there. I hope Ivan/Laurence are reading here too, cause I found under which circumstance the size of the vm_image.vdi file is growing more than normal. You would expect, it's when the VM is doing normal work, but no, it's just when CMS's job queue (not BOINC-queue) is empty and the VM don't get jobs. That was/is today the case and that's why your VM could grow above the 4GB. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
Those are presumably the times when the VM console just shows an endless queue of tail: /home/boinc/stderr: file truncated - as both machines have been showing every time I've looked today. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
|
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
David's found the problem I've been testing other circumstances where BOINC has to handle >4GB VM-files. 1. The project has a 4.5GB base VM in the project directory. BOINC is not able to copy_temp that file into a slot directory. Every minute/few minutes a new copy_temp file is created mostly with 0 bytes and also an init_data.xml is renewed. Task is running but a VM can never be created. 2. Task is suspended with LAIM off. VM with >4 GB vdi-file in a slot is saved. When resuming the task it lasts a bit longer, but then the vdi-file is shrinked to 1.35GB, but the VM is restored and running. I start using the BOINC client of May 8th also called 7.5.1. |
©2024 CERN