Message boards : Number crunching : vboxwrapper issue
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1055 - Posted: 8 Sep 2015, 21:29:32 UTC

I just noticed, that the vboxwrapper is constantly, without any interruption, writing to the boinc/slots/ x folder.

ATLAS and CMS are doing the same thing.

It is writing to the VBox.log file at a rate of 100 to 150 KByte/sec.

However, the file is not updated and is only 65 KByte in size.

Also, when when looking at the slot folder the time of last change of the VBox.log file jumps between the current time and about half an hour earlier back and forth.

I moved the boinc data directory to a usb flash drive.
Everything else seems to be working.

Anybody have any ideas, what is going on?
ID: 1055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 1057 - Posted: 8 Sep 2015, 22:54:55 UTC - in response to Message 1055.  

That's disturbing! Anyone else seeing this? Can't try myself as I'm at home at the moment (where my bandwidth has crept up to 2.2 Mbps lately, but still not enough to run CMS@Home).
ID: 1057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1059 - Posted: 9 Sep 2015, 0:13:05 UTC - in response to Message 1057.  

I have checked it in more detail. I moved the boinc data directory back to the hard disk, just in case.
I used process monitor to view it.
It seems, vbox.log is written to about 3 times a second, without actually updating anything(size remains exactly the same, about 65KB).
This applies to ATLAS as well.

I understand, that logs need to be kept, but this is excessive.

-------vboxwrapper_26169_windows_x86_64.exe----

Is the offender.
ID: 1059 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 1061 - Posted: 9 Sep 2015, 9:06:26 UTC - in response to Message 1059.  

I have checked it in more detail. I moved the boinc data directory back to the hard disk, just in case.
I used process monitor to view it.
It seems, vbox.log is written to about 3 times a second, without actually updating anything(size remains exactly the same, about 65KB).
This applies to ATLAS as well.

I understand, that logs need to be kept, but this is excessive.

-------vboxwrapper_26169_windows_x86_64.exe----

Is the offender.

It's a text file, right? Have you tried tail -f -- oh, you're not using Linux...
It's just that 65536 is a convenient binary number, leading to thoughts of a limited buffer. Since it's not limited to CMS it sounds like something to be researched on Oracle's forums, if they have any.
ID: 1061 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,473,403
RAC: 11,877
Message 1063 - Posted: 9 Sep 2015, 9:18:50 UTC - in response to Message 1061.  

On the single job I still have running my VBox.log is only 58Kb in size and was last updated over an hour ago just after the job (and VM) got started. But my job is never getting a credential to get access to download any work.
ID: 1063 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 1064 - Posted: 9 Sep 2015, 9:28:43 UTC

First: It's not a vboxwrapper issue, but VirtualBox.

I've seen too, that VBox.log is updated frequently - not always, but every now and then.
The filesize not updating could be caused by the initial 64k filesize.
As long as the written lines fits within that 64k, you don't see a difference.
The same log is kept in slots/x/boinc_xxxxx/logs.

Maybe the reason of frequent writing to that file is time outsyncing of the host and the guest.
ID: 1064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1065 - Posted: 9 Sep 2015, 10:10:04 UTC

I have deleted the Vbox.log and it reappears instantly.
It says it has been created and updated 11 hours ago.
I have renamed it, that works, but the original file reappears and still claims to have been created and last changed 11 hours ago.

It is not updating but rewriting the file every time. They are absolutely identical.
ID: 1065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,473,403
RAC: 11,877
Message 1066 - Posted: 9 Sep 2015, 10:50:19 UTC - in response to Message 1065.  

If I add a line of text to the file it saves it but then when I open it my line has gone. It does this even if the task is suspended, the VM needs to go to Saved state (by shutting down BOINC) before the constant updating stops and my line of text is preserved.

Haven't got any Atlas tasks running to check that does the same.

Not looking good for SSD's
ID: 1066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,473,403
RAC: 11,877
Message 1067 - Posted: 9 Sep 2015, 12:21:27 UTC - in response to Message 1064.  

First: It's not a vboxwrapper issue, but VirtualBox.

If the vboxwrapper is setting up how virtualbox runs is that not a vboxwrapper problem ?

Perhaps the flush interval for the vbox logging is working to the same time scale as the 1 hour x509 credential request delay that lasts only 1 minute !
ID: 1067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 1068 - Posted: 9 Sep 2015, 13:25:48 UTC

There was a discussion about this on the boinc_alpha email last week. Selected extracts:

David Anderson wrote

A glance at the source code shows that vboxwrapper
copies the entire log file, then looks at the last 16KB of the copy.
It does this every second.

Rom Walton wrote

Ah, that makes sense now.

I thought I had converted the log extraction over to using the COM API when running on Windows.

My initial intent was to prevent interfering with VirtualBox if it was attempting to do something with its log file.

----- Rom

Rom again

- Can we improve this?
Yes. Just switching over to the API should help a bit. Or at the minimum switches the disk IO over to the vboxheadless process. A couple of more changes will be required to really reduce IO though.

IMachine::readLog() accepts offset and size parameters, so we could get the majority case (no new log messages have been written by VirtualBox) down to zero disk IO.

----- Rom

So, maybe the next version of the wrapper?
ID: 1068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 1069 - Posted: 9 Sep 2015, 13:36:09 UTC - in response to Message 1067.  
Last modified: 9 Sep 2015, 13:39:02 UTC

First: It's not a vboxwrapper issue, but VirtualBox.

If the vboxwrapper is setting up how virtualbox runs is that not a vboxwrapper problem ?

Maybe you're right.
The vbox.log from the Logs-directory (controlled by VirtualBox) is copied to the slots directory.
Probably while the wrapper add Guest log information from there to the stderr.txt.

Edit: Meanwhile Richard jumped onto this. Thanks Richard!
ID: 1069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,473,403
RAC: 11,877
Message 1070 - Posted: 9 Sep 2015, 14:08:41 UTC - in response to Message 1068.  

There was a discussion about this on the boinc_alpha email last week.

So, maybe the next version of the wrapper?

Do we have a rough ETA for that ?
How long does it take to get a new wrapper to be used on a project ?
ID: 1070 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1072 - Posted: 9 Sep 2015, 14:26:58 UTC

Not looking good for SSD's


That is my concern.If it is writing the same file with the same length every second or more often, it is going to wear down that sector on the SSD in no time flat.

A minimum of 86400 writes per day.
I wonder, if the write buffer would catch(minimize) that.
ID: 1072 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,473,403
RAC: 11,877
Message 1073 - Posted: 9 Sep 2015, 14:31:01 UTC - in response to Message 1068.  

I thought I had converted the log extraction over to using the COM API when running on Windows.

Richard, does this mean it only affects Windows ?
ID: 1073 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 1074 - Posted: 9 Sep 2015, 14:40:35 UTC - in response to Message 1070.  

There was a discussion about this on the boinc_alpha email last week.

So, maybe the next version of the wrapper?

Do we have a rough ETA for that ?
How long does it take to get a new wrapper to be used on a project ?

Jacob Klein started that conversation last Wednesday (2nd September), because he saw the same heavy disk usage at RNA World. First of all, somebody - probably Rom Walton - would have to code the change, and there's no sign of that yet.

Rom is no longer a paid employee of BOINC (since the BOINC project failed to get renewal of their NSF grant funding in the summer), and is coding as a volunteer in his spare time. With last weekend being the Labor Day holiday in the USA, I don't blame him taking a break. Compiled wrappers are usually available for downloading soon after coding, but I imagine projects using them would want a period of in-house testing before deploying them for general use.

There is a wrapper v26170 already - one newer than we use here - but that was compiled on 25 August, so it won't have this fix.
ID: 1074 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 1075 - Posted: 9 Sep 2015, 14:41:51 UTC - in response to Message 1073.  

I thought I had converted the log extraction over to using the COM API when running on Windows.

Richard, does this mean it only affects Windows ?

Someone else will have to answer that one - any 'nixers out there able to comment?
ID: 1075 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1076 - Posted: 9 Sep 2015, 15:34:58 UTC - in response to Message 1072.  

Not looking good for SSD's


That is my concern.If it is writing the same file with the same length every second or more often, it is going to wear down that sector on the SSD in no time flat.

A minimum of 86400 writes per day.
I wonder, if the write buffer would catch(minimize) that.

The SSD wear-leveling algorithm (implemented via the firmware) will spread out the writes to different cells, even if the logical address is the same. I doubt that the small number of writes to a log will bother it at all.

On the other hand, for projects where writes are a real problems (WCG/CEP2 most of all, but also CPDN and ATLAS), I use a write cache (PrimoCache) or a ramdisk. That is mainly to save the SSDs I use, but it also reduces the error rates. Apparently a high write-rate causes contention that some of the programs don't handle very well.
ID: 1076 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1096 - Posted: 16 Sep 2015, 19:26:19 UTC
Last modified: 16 Sep 2015, 19:41:12 UTC

Writes Write Bytes Path
3.215 152.925.964 <Total>
1.707 115.524.888 \slots\7\VBox.log
1.502 37.398.904 \slots\7\vm_image.vdi
3 1.521 \slots\7\boinc_task_state.xml
3 651 \slots\7\vbox_checkpoint.xml


This is a list of all file writes during a 15min period to the active CMS slot folder.
Please notice the excessive write activity to the vbox.log file.

It writes 3 times the amount of data a single log file than to the ENTIRE VM_IMAGE.

I really think, someone should address that sooner than later

    ID: 1096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Crystal Pellet
    Volunteer tester

    Send message
    Joined: 13 Feb 15
    Posts: 1180
    Credit: 815,336
    RAC: 431
    Message 1223 - Posted: 12 Oct 2015, 6:34:13 UTC

    Hello Ivan/Laurence,

    I'm testing vboxwrapper v26175.
    In that version the huge number of disk i/o's of slot's vbox.log should be reduced.
    I had an issue overnight with this version reading the Volunteer Info:

    .
    .
    .
    08:00:01 +0200 2015-10-12 [INFO] Reading the BOINC volunteer's information
    08:00:02 +0200 2015-10-12 [INFO] Volunteer: () Host:
    08:00:02 +0200 2015-10-12 [INFO] BOINC_USERID is not an integer
    08:00:02 +0200 2015-10-12 [INFO] Going to sleep for 1 hour
    08:01:02 +0200 2015-10-12 [INFO] Starting CMS Application - Run 8
    08:01:02 +0200 2015-10-12 [INFO] Reading the BOINC volunteer's information
    08:01:03 +0200 2015-10-12 [INFO] Volunteer: () Host:
    08:01:03 +0200 2015-10-12 [INFO] BOINC_USERID is not an integer
    08:01:03 +0200 2015-10-12 [INFO] Going to sleep for 1 hour


    until I rebooted the VM this morning.
    From where do you read that info? Is it from init_data.xml in the slot directory?

    08:13:01 +0200 2015-10-12 [INFO] Starting CMS Application - Run 9
    08:13:01 +0200 2015-10-12 [INFO] Reading the BOINC volunteer's information
    08:13:02 +0200 2015-10-12 [INFO] Volunteer: Crystal Pellet (38) Host: 37
    08:13:02 +0200 2015-10-12 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
    08:13:02 +0200 2015-10-12 [INFO] Requesting an X509 credential
    subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38/CN=58327480
    issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38
    identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38
    type : RFC 3820 compliant impersonation proxy
    strength : 1024 bits
    path : /tmp/x509up_u500
    timeleft : 129:59:57 (5.4 days)
    08:13:08 +0200 2015-10-12 [INFO] Downloading glidein
    08:13:15 +0200 2015-10-12 [INFO] Running glidein (check logs)


    Btw: sleep 1 hour seems to be 1 minute.
    ID: 1223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    Profile Laurence
    Project administrator
    Project developer
    Project tester
    Avatar

    Send message
    Joined: 12 Sep 14
    Posts: 1064
    Credit: 327,073
    RAC: 133
    Message 1227 - Posted: 12 Oct 2015, 11:52:41 UTC - in response to Message 1223.  

    In the slot directory you will find a floppy image file. This is attached to the VM and the user information can then be read from /dev/fd0.

    Although the cron sleeps for 1 hour, the next cron job is run. I guess this is a bug :(
    ID: 1227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
    1 · 2 · Next

    Message boards : Number crunching : vboxwrapper issue


    ©2024 CERN