1) Message boards : CMS Application : New Version 60.64 (Message 7758)
Posted 20 Aug 2022 by Jim1348
Post:
That is all very interesting, if a little strange.
I would hope that in production, they limit the CMS to a single core in that case, so that you can select more than one core for native ATLAS.
2) Message boards : CMS Application : New Version 60.64 (Message 7753)
Posted 18 Aug 2022 by Jim1348
Post:
I suppose this will be fixed in production, so that you get the benefit of the additional cores?
I can just wait until then. Thanks.
3) Message boards : CMS Application : New Version 60.64 (Message 7750)
Posted 18 Aug 2022 by Jim1348
Post:
I am running four cores on one 60.64, but BoincTasks shows that only one core is in use.
(Ryzen 3900X, Ubuntu 20.04.4, BOINC 7.16.6, VBox 6.1.34)

And the estimated time shows as 18 hours total, with 1 hour completed.
It appears that mt is not working properly.
4) Message boards : ATLAS Application : ATLAS very long simulation v1.01 download errors (Message 7138)
Posted 22 Mar 2021 by Jim1348
Post:
Well after several more failures, I finally downloaded two that are running properly.
Maybe the drought is over.

EDIT: Not quite over. I just got nine more failures. It looks like a mix of good and bad ones.
5) Message boards : ATLAS Application : ATLAS very long simulation v1.01 download errors (Message 7133)
Posted 21 Mar 2021 by Jim1348
Post:
I just attached a Ryzen 3600 (Ubuntu 20.04.2), and all I am seeing are download errors.
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4365

I have attached four other machines that have not had that problem.
Is it on my end?
6) Message boards : News : New native Linux ATLAS application (Message 4717)
Posted 24 Feb 2017 by Jim1348
Post:
I am willing to give it a try, but what is the payoff?
That is, is there much of a performance advantage? If it works, will you be able to access it on the main LHC site?
7) Message boards : News : New CMS App v46.26 (Message 2238)
Posted 4 Mar 2016 by Jim1348
Post:
In "HKEY_LOCAL_MACHINE/SYSTEM/CurrentControlSet/Control"
there is "WaitToKillServiceTimeout".

All values are in mS.

I've no idea if there are special conditions around updates or
notebooks but they might give starting points for a bit of
experimenting. The value on XP boxes around here for
"WaitToKillAppTimeout" is 20s.

Edit:- This comes from notes I've kept from way back, when the
time taken to merge/delete snapshots was a problem. I didn't
record their source, so a bit of creative Googling might be in order.

Here is some uncreative Googling. I think this is the right one for apps:
http://www.eightforums.com/tutorials/37424-waittokillapptimeout-specify-shutdown-windows.html

For services:
http://www.makeuseof.com/tag/3-ways-speed-windows-7-shutdown-process/

I don't know if you need both, but it wouldn't hurt to set them both to a suitably high level.
8) Message boards : News : New CMS App v46.26 (Message 2230)
Posted 4 Mar 2016 by Jim1348
Post:
There are situations where Windows will not allow applications to take longer than 10-15 seconds to shutdown. In those cases Windows will just terminate the application. One specific case is during the process of installing patches. Another would be if the volunteer has Windows configured to shutdown when a notebook screen is closed.

Hummm. I sometimes use JV16 Power Tools to shorten the delay before Windows (Win7 64-bit for me) closes down a hung application. Maybe the long delay in shutting down the VM is considered a "hung application"?

I will avoid doing this on those machines running VBox.
9) Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers (Message 1558)
Posted 4 Jan 2016 by Jim1348
Post:
I just did a normal software shutdown via the Windows Start button each time. I don't think there were any glitches, though I was working on the fans while it ran to some extent. But if there were any real crashes, it would have shown up in the Event Viewer, but that just shows normal shutdowns; certainly no BSODs.

I do get a somewhat curious error, which I have posted on before, about
Log Name: System
Source: Microsoft-Windows-WHEA-Logger
Date: 1/4/2016 4:17:39 AM
Event ID: 19
Task Category: None
Level: Warning
Keywords:
User: LOCAL SERVICE
Computer: i7-4790-PC
Description:
A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 4


But that seems to be a normal bug on Haswell machines running VMs and either CMS-dev or vLHC; I am not sure which. It does not cause any problems that I can see.

EDIT: I do manage that PC over the LAN using TightVNC; there is normally no monitor attached. I don't see that that affects the shutdown though.
10) Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers (Message 1556)
Posted 4 Jan 2016 by Jim1348
Post:
I took my machine down two or three times this morning to replace fans. The downtime was maybe from 10 minutes to 40 minutes or so each time. It looks like one of them got it.
http://boincai05.cern.ch/CMS-dev/result.php?resultid=74935
11) Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers (Message 1540)
Posted 2 Jan 2016 by Jim1348
Post:
Fsutil might help a little, but I don't see that it can cache anywhere nearly enough for my purposes. I want the writes associated with BOINC to go to main memory, to protect the SSD. I use either a ramdisk for that purpose, placing the BOINC Data folder on the ramdisk, or else use a write cache. If it is large enough, a write cache accomplishes basically the same thing as a ramdisk, and is a little easier to install since you don't have to change the default location of the BOINC Data folder.

I normally set the size of the cache or the ramdisk from around 11 GB to 18 GB, which is enough for running my BOINC projects on 6 cores. Smaller BOINC projects could be done with much less memory, but they often don't write as much, and hence are less of a problem for reducing SSD lifetime.

Some of the SSDs now have their own caches included in their utilities; Samsung Rapid Mode cache is sized at 1 GB, and the Crucial Momentum cache can go as high as 4 GB if you have enough memory. They might be enough to protect your SSD for most purposes, though you would still be writing to the disk some. And their cache is split between read and write caches in some undisclosed manner, perhaps based on usage, but for protecting SSDs only the write caching is relevant. Note that you can still read out the data written into a write cache; it just does not automatically place data in the cache due to reads, but reads do not produce the wear on the SSDs.

Thanks for the tip about the credits. I will look into it further now.
12) Message boards : Number crunching : CMS-dev only suitable for 24/7 BOINC-crunchers (Message 1538)
Posted 2 Jan 2016 by Jim1348
Post:
This is a timely subject, as I just turned off my CERN machine that normally runs 24/7 in order to move it to another part of the basement. It was off for about 10 minutes, and doesn't seem to have done it any harm:
http://boincai05.cern.ch/CMS-dev/results.php?hostid=688
http://lhcathome2.cern.ch/vLHCathome/results.php?hostid=85673
http://atlasathome.cern.ch/results.php?hostid=17032

But this machine is a bit unusual in using a large write-cache (14 GB PrimoCache), intended to protect the SSD from the high write rate of ATLAS, among others. However, I have found that it reduces error rates on ATLAS and a variety of other BOINC projects as well, especially CPDN for example. And that leaves me 18 GB working memory, enough for any of the CERN projects on 6 cores of my i7-4790 while reserving 2 cores for the GPUs. Also, I use an uninterruptible power supply to prevent cache corruption in case of power outages, which also helps prevent errors on various BOINC projects.
13) Message boards : News : New jobs available (Message 1280)
Posted 21 Oct 2015 by Jim1348
Post:
In running the new CMS-dev jobs on my i7-4790 (Win7 64-bit, Asrock Z97 motherboard), I am starting to see "Error ID 19", which is an "Internal parity error".

This machine is not overclocked in any way, and I have never seen this error before in any circumstance.
However, I am now running two vLHC jobs also, which I hadn't done before with the CMS-dev, so it may be the combination that is triggering the error.

It may not really be an error, but more of a bug, since it has all the same symptoms as this: https://communities.vmware.com/thread/471348
At present, it is occurring about every 1/2 hour to 1 1/2 hours, and the processor ID may be different. It has not caused any operational difficulties or crashes thus far.

In case anyone is interested, the full event is:

Log Name: System
Source: Microsoft-Windows-WHEA-Logger
Date: 10/21/2015 7:59:49 AM
Event ID: 19
Task Category: None
Level: Warning
Keywords:
User: LOCAL SERVICE
Computer: i7-4790-PC
Description:
A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 1

The details view of this entry contains further information.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{C26C4F3C-3F66-4E99-8F8A-39405CFED220}" />
<EventID>19</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-10-21T11:59:49.190491900Z" />
<EventRecordID>17477</EventRecordID>
<Correlation ActivityID="{9635EFCA-D3B3-4861-B150-A061BC3C0D5E}" />
<Execution ProcessID="1520" ThreadID="2156" />
<Channel>System</Channel>
<Computer>i7-4790-PC</Computer>
<Security UserID="S-1-5-19" />
</System>
<EventData>
<Data Name="ErrorSource">1</Data>
<Data Name="ApicId">1</Data>
<Data Name="MCABank">0</Data>
<Data Name="MciStat">0x90000040000f0005</Data>
<Data Name="MciAddr">0x0</Data>
<Data Name="MciMisc">0x0</Data>
<Data Name="ErrorType">12</Data>
<Data Name="TransactionType">256</Data>
<Data Name="Participation">256</Data>
<Data Name="RequestType">256</Data>
<Data Name="MemorIO">256</Data>
<Data Name="MemHierarchyLvl">256</Data>
<Data Name="Timeout">256</Data>
<Data Name="OperationType">256</Data>
<Data Name="Channel">256</Data>
<Data Name="Length">864</Data>
<Data Name="RawData">435045521002FFFFFFFF0300020000000200000060030000303B0B00150A0F140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F890A688C9ADD60BD10100000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000018020000400000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000002000000000000000000000000000000000000000000000058020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000057010000000000000002080000000000C30603000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000000100000000000000C306030000081001FFFBFA7FFFFBEBBF000000000000000000000000000000000000000000000000000000000000000001000000010000009E8324FCF70BD10101000000000000000000000000000000000000000000000005000F0040000090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
</EventData>
</Event>
14) Message boards : Number crunching : vboxwrapper issue (Message 1076)
Posted 9 Sep 2015 by Jim1348
Post:
Not looking good for SSD's


That is my concern.If it is writing the same file with the same length every second or more often, it is going to wear down that sector on the SSD in no time flat.

A minimum of 86400 writes per day.
I wonder, if the write buffer would catch(minimize) that.

The SSD wear-leveling algorithm (implemented via the firmware) will spread out the writes to different cells, even if the logical address is the same. I doubt that the small number of writes to a log will bother it at all.

On the other hand, for projects where writes are a real problems (WCG/CEP2 most of all, but also CPDN and ATLAS), I use a write cache (PrimoCache) or a ramdisk. That is mainly to save the SSDs I use, but it also reduces the error rates. Apparently a high write-rate causes contention that some of the programs don't handle very well.
15) Message boards : Number crunching : Multiple Jobs In A Single Host (Message 1010)
Posted 5 Sep 2015 by Jim1348
Post:
Note that there is a "bug" in that whilst the app_config file will allow only one task to run at a time, the work fetch process doesn't take this into account. So more tasks can be downloaded only for the "excess" to sit there waiting. This can also result in idle cores (presumably the one that would have run the waiting task).

Quite true. It is not a problem here that I have found thus far, since the limit set for the CMS tasks is only "1". But to prevent CPU starvation on other projects, you can adjust the "resource share" for each project accordingly.

For example I have 6 cores available (2 are reserved for GPUs), and if I want 2 cores for Project A and 4 cores for Project B, I use the app_config to limit Project A to 2 work units max. Then, I adjust the resource share to 50% for Project A, which sets those downloads to 1/3 of the total. I don't need to do anything for Project B, and it all works out well most of the time. Occasionally, there are mis-estimates for the running time, but they get corrected eventually.

You don't actually need the app_config at all in that case if you are willing to live with long-term averages, but there are projects that take a lot of memory where I do want a limit at all times.
16) Message boards : Number crunching : anti virus exclusion not working (Message 999)
Posted 4 Sep 2015 by Jim1348
Post:
I have found that for many of the AVs, the exclusions only apply to the scans (i.e., a daily scheduled scan), but not to the real-time protection. That has caused me infinite trouble, and I limit the use of AVs. On my main machine, I use only Windows Defender (spyware only on Win7), which causes no problems, and on my dedicated machines, I don't use any at all, since they run only distributed computing work.

But I think that if you want to try another AV, the Panda (free) exclusions work, and also the Avira (free), the last time I tried them.
17) Message boards : News : No new jobs (Message 830)
Posted 23 Aug 2015 by Jim1348
Post:
I quite agree that the credit system does not serve much of a purpose, except to ensure that your PC is working. Isn't the Credit New system supposed to make it more equal between projects? That would suit me fine. The absolute number of the points is useless, as long as they are roughly comparable.

The numbers handed out on a lot of projects should be reduced by two (or three) orders of magnitude anyway, to make the numbers less cumbersome to deal with. If some people leave, you might gain others with a better appreciation of what the project is really all about.



©2024 CERN