Message boards : Number crunching : exceeded disk limit
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 342 - Posted: 9 May 2015, 9:03:28 UTC - in response to Message 341.  

David says

Thanks.
The problem was that the BOINC client used Windows APIs
for accessing files that didn't work for >= 4GB files.
I fixed this (I think). Rom will have a new private drop soon.
-- David

and Rom says

Here is a new private drop:
x86: http://boinc.berkeley.edu/dl/boinc.080515.x86.zip
x64: http://boinc.berkeley.edu/dl/boinc.080515.x64.zip

----- Rom

Looking at the code change, I think there's a reasonable expectation that this will solve the copying problem too - they've switched from using a 32-bit to a 64-bit version of a low-level Windows library function.
ID: 342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 343 - Posted: 9 May 2015, 10:04:54 UTC - in response to Message 342.  

Looking at the code change, I think there's a reasonable expectation that this will solve the copying problem too - they've switched from using a 32-bit to a 64-bit version of a low-level Windows library function.

Thanks Richard.

I try to check the solving of the delete problem first.
The VM in the slot is now 3.668 MB and 21 hours to go.
Will not touch the task - no pause, suspend, BOINC restart etc.

If that succeeds, I'll test the BOINC-copy to a slot from a >4GB project VM.
ID: 343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 344 - Posted: 9 May 2015, 13:15:55 UTC - in response to Message 343.  

The VM in the slot is now 3.668 MB and 21 hours to go.

Required file size of vm_image.vdi achieved: 4,22 GB and 18 hours to go.
ID: 344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 345 - Posted: 9 May 2015, 14:45:09 UTC

Good work David et al.
This is the bit of Boincing I enjoy(?) most: helping to find, identify, isolate a problem to allow others to hopefully find a resolution as I've no idea about the coding side so can't help there. Much better than those who have a problem but just go away until it's fixed.

I've currently got one of these with the "file truncated" messages.
At 4 hours the image was 2.25GB, at 14 hours, 4.5GB and currently 21 hours 5.2GB. CPU usage is 2% so it's clearly not doing anything useful, yet the image continues to grow. (That's another issue, for Dr. Ivan to look at.)

I've had a few of these before and I think a manual reset through VBox sparked them back into useful work. Before doing that with this one, is there a particular log file that I could copy to provide any useful diagnostic into what caused the apparent blockage in the first place?

Change of plan:
<3 hours for that one to finish so shortly before it does I'm going to suspend it and upgrade to the test Boinc, hoping that the VM won't reset itself in that process. If that works, we will soon see if the large file gets deleted.
ID: 345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 346 - Posted: 9 May 2015, 15:19:05 UTC

I have one with a 5.2 GB file, already running under the new private drop - but it's not due to finish for another 4 hours, so Ray will beat me to it.

Likewise, I had to stop BOINC to upgrade part-way through: the VM console is still showing nothing but "file truncated" messages, and very low CPU usage. Although I did a little Test4Theory testing in the early days, I've forgotten what little I learned about the fine control of the VM from externally or within BOINC.

It's a pity that the VM/BOINC combination doesn't yet allow a VM-based project to release un-needed resources back for other BOINC projects to use if, as Crystal Pellet suggests, the looping messages only appear "when CMS's job queue (not BOINC-queue) is empty and the VM don't get jobs". That's a question I remember raising with Ben Segal at the 2010 BOINC workshop in London, and I fear it will deter some of the more competitive volunteers from participating in this type of project.
ID: 346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 347 - Posted: 9 May 2015, 15:56:01 UTC

I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro.
ID: 347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 348 - Posted: 9 May 2015, 16:11:50 UTC - in response to Message 347.  

I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro.

It will, but as stated before, at least with the BOINC client of the 4th of May, the vdi-file >4GB will shrunk, so the delete fix can't be tested as long the vdi is not grown again over 4GB.
ID: 348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 0
Message 349 - Posted: 9 May 2015, 18:01:23 UTC
Last modified: 9 May 2015, 18:18:49 UTC

Success 8¬) using 080515.x64

Image had grown to 6GB but everything in the slot got cleared out.

Task 56691

09/05/2015 18:32:46 | CMS-dev | Message from task: 0
09/05/2015 18:32:46 | | [slot] cleaning out slots/4: handle_exited_app()
09/05/2015 18:32:46 | | [slot] removed file slots/4/boinc_finish_called
09/05/2015 18:32:46 | | [slot] removed file slots/4/boinc_task_state.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/init_data.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/output
09/05/2015 18:32:46 | | [slot] removed file slots/4/stderr.txt
09/05/2015 18:32:46 | | [slot] removed file slots/4/VBox.log
09/05/2015 18:32:46 | | [slot] removed file slots/4/vboxwrapper_26165_windows_x86_64.exe
09/05/2015 18:32:46 | | [slot] removed file slots/4/vboxwrapper_26165_windows_x86_64.pdb
09/05/2015 18:32:46 | | [slot] removed file slots/4/vbox_checkpoint.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/vbox_job.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/vbox_remote_desktop.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/vbox_webapi.xml
09/05/2015 18:32:46 | | [slot] removed file slots/4/vm_floppy_4.img
09/05/2015 18:32:46 | | [slot] removed file slots/4/vm_image.vdi
09/05/2015 18:32:46 | CMS-dev | Computation for task CMS_30897_1427806622.031532_0 finished

New task again has the "file truncated" messages and in ALT-F5 in red:

ERROR:root:No message received!: Nothing to do!

so I guess there is no actual work available for the VM so I'll suspend this one for now while upgrading the other machine and will let the new T4T Databridge use that core for a while.
ID: 349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 350 - Posted: 9 May 2015, 18:53:29 UTC - in response to Message 349.  
Last modified: 9 May 2015, 18:54:11 UTC

Success 8¬) using 080515.x64
Image had grown to 6GB but everything in the slot got cleared out.

Well done! I'll try that build tomorrow.

ERROR:root:No message received!: Nothing to do!
so I guess there is no actual work available for the VM so I'll suspend this one for now.

Same here, I tried starting and stopping it a few times but its sulking.
ID: 350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 351 - Posted: 9 May 2015, 18:56:37 UTC

And the same here, with the image file close to the 5.7 GB that was reported in the error reports that brought me here.

09/05/2015 19:46:19 | CMS-dev | Message from task: 0
09/05/2015 19:46:19 | | [slot] cleaning out slots/1: handle_exited_app()
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_finish_called
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_task_state.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/output
09/05/2015 19:46:19 | | [slot] removed file slots/1/stderr.txt
09/05/2015 19:46:19 | | [slot] removed file slots/1/VBox.log
09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.exe
09/05/2015 19:46:19 | | [slot] removed file slots/1/vboxwrapper_26165_windows_x86_64.pdb
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_checkpoint.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_job.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_remote_desktop.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vbox_webapi.xml
09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_floppy_1.img
09/05/2015 19:46:19 | | [slot] removed file slots/1/vm_image.vdi
09/05/2015 19:46:19 | CMS-dev | Computation for task CMS_30909_1427806622.286095_0 finished
09/05/2015 19:46:19 | | [slot] cleaning out slots/1: get_free_slot()
09/05/2015 19:46:19 | NumberFields@home | [slot] assigning slot 1 to wu_sf3_DS-10x271_Grp504817of682667_0
09/05/2015 19:46:19 | | [slot] removed file slots/1/init_data.xml
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/GetDecics_2.00_windows_intelx86 to slots/1/GetDecics_2.00_windows_intelx86
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/sf3_DS-10x271_Grp504817of682667.dat to slots/1/in
09/05/2015 19:46:19 | NumberFields@home | [slot] linked ../../projects/numberfields.asu.edu_NumberFields/wu_sf3_DS-10x271_Grp504817of682667_0_0 to slots/1/out
09/05/2015 19:46:19 | | [slot] removed file slots/1/boinc_temporary_exit
09/05/2015 19:46:19 | NumberFields@home | Starting task wu_sf3_DS-10x271_Grp504817of682667_0
09/05/2015 19:46:19 | NumberFields@home | [cpu_sched] Starting task wu_sf3_DS-10x271_Grp504817of682667_0 using GetDecics version 200 in slot 1

and as you can see, the following task from another project started cleanly and is running normally.
ID: 351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 352 - Posted: 10 May 2015, 7:18:34 UTC
Last modified: 10 May 2015, 7:41:10 UTC

And delete success here too with the newest private BOINC-client boinc.080515.x64 leading to an empty slot:

CMS-dev 10 May 09:04:16 Message from task: 0
10 May 09:04:16 [slot] cleaning out slots/8: handle_exited_app()
10 May 09:04:16 [slot] removed file slots/8/boinc_finish_called
10 May 09:04:16 [slot] removed file slots/8/boinc_task_state.xml
10 May 09:04:16 [slot] removed file slots/8/init_data.xml
10 May 09:04:16 [slot] removed file slots/8/output
10 May 09:04:16 [slot] removed file slots/8/stderr.txt
10 May 09:04:16 [slot] removed file slots/8/VBox.log
10 May 09:04:16 [slot] removed file slots/8/vboxwrapper_26165_windows_x86_64.exe
10 May 09:04:16 [slot] removed fileslots/8/vboxwrapper_26165_windows_x86_64.pdb
10 May 09:04:16 [slot] removed file slots/8/vbox_checkpoint.xml
10 May 09:04:16 [slot] removed file slots/8/vbox_job.xml
10 May 09:04:16 [slot] removed file slots/8/vbox_remote_desktop.xml
10 May 09:04:16 [slot] removed file slots/8/vbox_webapi.xml
10 May 09:04:16 [slot] removed file slots/8/vm_floppy_8.img
10 May 09:04:16 [slot] removed file slots/8/vm_image.vdi
CMS-dev 10 May 09:04:16 Computation for task CMS_30966_1427806623.586496_0 finished

Directory contents a few minutes before the task finished:

09-05-2015 09:23 0 boinc_lockfile
10-05-2015 09:02 508 boinc_task_state.xml
09-05-2015 11:26 9.255 init_data.xml
10-05-2015 08:21 8.447 stderr.txt
09-05-2015 15:47 123.660 VBox.log
09-05-2015 08:51 102 vboxwrapper_26165_windows_x86_64.exe
09-05-2015 08:51 102 vboxwrapper_26165_windows_x86_64.pdb
10-05-2015 09:02 217 vbox_checkpoint.xml
09-05-2015 08:51 85 vbox_job.xml
09-05-2015 08:51 69 vbox_remote_desktop.xml
09-05-2015 08:51 53 vbox_webapi.xml
09-05-2015 09:19 28.672 vm_floppy_8.img
10-05-2015 08:53 6.429.868.032 vm_image.vdi

Now I'll test the other 2 circumstances:
1. BOINC-copy >4GB file from project directory to a slot - Outcome: It's working with a 4.28GB project file.
2. Resume task with a vdi file >4GB in the slot - Outcome: This is working too. The now 4.44GB vdi file in the slot didn't shrunk like before.

Remark 1: Although no real work was done by the VM, the CPU-usage was 19.1% of the elapsed time.
Remark 2: I don't know why I'm getting so high credits for those tasks.
ID: 352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 353 - Posted: 10 May 2015, 9:34:22 UTC - in response to Message 347.  

My second machine also transitioned gracefully from CMS to another project and back again, without errors - but since it happened at 02:30 in the morning, I didn't see what the final image file size was.

I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro.

I couldn't find any equivalent number in the current file-set, but I did find

<job_duration>86400</job_duration>

in CMS_26_03_2015a.xml (project folder). Changing that to 43200 didn't affect the running task, but it did start the next on course for what looks like is going to be a 12-hour run - so the transitions will happen in daylight, for a while at least.

Another useful question turned up on the BOINC message boards yesterday:

What is this: vbox_windows set to 0
ID: 353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 354 - Posted: 10 May 2015, 11:00:29 UTC - in response to Message 353.  


Another useful question turned up on the BOINC message boards yesterday:

What is this: vbox_windows set to 0

It's here

<vbox_window>0|1</vbox_window>

If enabled, launch VBox applications in full interactive mode.
Otherwise, run them silently with VBoxHeadless.

John
ID: 354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 355 - Posted: 10 May 2015, 11:37:53 UTC - in response to Message 354.  

Yes - that's what I added yesterday, in response to the question ;-) (check the Wiki history tab!)

It's based on my own observation, rather than formal guidance from the developers - it wan't documented when the question was asked.
ID: 355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 356 - Posted: 10 May 2015, 11:41:12 UTC - in response to Message 355.  
Last modified: 10 May 2015, 11:45:31 UTC

Ahh, that explains it. It was rather a long way down the list.
Hopefully the developers will realise that updating the documentation
is as important as updating the code.
I'll go back to sleep.
ID: 356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 357 - Posted: 10 May 2015, 11:50:39 UTC - in response to Message 347.  

I've been wondering whether these will accept a "graceful finish" by editing the checkpoint file like T4T do/did? Experiment for 2moro.

I didn't react earlier, because I wasn't sure what would happen after a task-resume with a oversized VM in a slot with the new BOINC client. Had to test that first; with success.

Like T4T you may end your task gracefully by editing the vbox_checkpoint.xml in the corresponding slot of the task.
Normally the job duration is 86400 seconds, so if you want an early finish
- suspend the task with 'Leave application in memory' off. The VM will be in saved state and not in paused state.
- change the elapsed time to 86400 in vbox_checkpoint.xml in the slot.
- Resume the task.
ID: 357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 358 - Posted: 10 May 2015, 11:56:22 UTC - in response to Message 356.  
Last modified: 10 May 2015, 11:57:20 UTC

It was rather a long way down the list.

Alphabetical order.

Recent versions of BOINC auto-generate a fully-populated cc_config.xml template, also in alphabetic order, when some options are set via the GUI. That caused problems for some users, who added their own tags manually and ended up with duplicate entries. Best to keep the strict alphabetic order for both the configuration file and its documentation.
ID: 358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 359 - Posted: 10 May 2015, 18:18:47 UTC

David thanks us all, but implicitly asks us to keep our eyes open for any other glitches:
That's good news.
I guess that CERN's VM images are the first > 4GB files that BOINC has dealt with.
Not surprising that there were glitches (and we may find others).
-- David

Following a question from Sekerob, I've tested that my 4.79 GB .vdi file was properly detected and deleted under a 32-bit Windows OS and the 32-bit version of the private drop. I've now set host 393 up to run under the 32-bit versions of VBox wrapper, but otherwise mimicking the stock delivery - an 1-hour first test run seems to have started normally, and has reached the 'file truncated' loop. If this run works, I'll let the next one run the full 24 hours, but suspend work fetch after that (I have to be away for a few days next week).
ID: 359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 360 - Posted: 10 May 2015, 20:29:27 UTC - in response to Message 359.  

Hi Richard,

I don't know, whether testing 32bit is useful at CMS, cause CMS has only vbox64 applications and the Linux VM is 64bit too.
ID: 360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 361 - Posted: 10 May 2015, 20:47:59 UTC - in response to Message 360.  

Well, it seems to be running as well as the others at the moment:



- we won't know for certain until the back end starts supplying jobs again, of course.

CMS only supplies 64-bit apps, sure - but the app it supplies is BOINC's 64-bit VBox wrapper. I simply set up an app_info file and substituted the 32-bit wrapper files, and off it went. The whole point of a VM is that the guest OS doesn't have to match the host: my hardware on this host is fully 64-bit capable, and includes the virtualization hooks to enable VBox to run.
ID: 361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : Number crunching : exceeded disk limit


©2024 CERN