1) Message boards : Sixtrack Application : Xtrack beam simulation (Message 7972)
Posted 15 Mar 2023 by marmot
Post:
Looked at some of the WU's , that are labeled as able to work on Linux or Windows, and some of these hitting Linux are completing and validating with 3 sec CPU runtime.
When they hit a Windows OS, they fail and don't end at deadline.

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2283919
2) Message boards : Sixtrack Application : Xtrack beam simulation (Message 7937)
Posted 8 Mar 2023 by marmot
Post:
Posted my evaluation of the 38 I have running here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=557&postid=7927

The 33 still running are over 2 days into the run and deadline sometime in 24 hours.

These are supposed to need a GPU co-proc?
If true, then their GPU detection/self-aborting, code is not functioning.
3) Message boards : Sixtrack Application : XTrack suite (Message 7927)
Posted 5 Mar 2023 by marmot
Post:
I was trying to solve Hyperthreading issue by playing with P and C states on this new (to me) Broadwell server and 38 of these picked this time to come to that computer.

So they were stressed tested as I finally had to restart the machine when boinc.exe stopped uploading.

Its back in non-hyperthreaded mode (need to update the BIOS) and they are running again.

5 things:

1) they do not use the BOINC checkpointing system properly. Under properties, they show no checkpoint since initial start time.
--- not sure if they saved any work since the initial 18 hours of run time (tell me if there is a way to check and if you need a copy of something).
2) they refused to respond to boinc.exe giving them the shutdown command. They got left as orphan processes after boinc.exe and rosetta WU's left RAM.
-- had to use Process Hacker to suspend then manually terminate the processes.
3) they are hardened against such abuse. Instead of 'computation error' upon restart of BOINC after computer rebooted, they started computing again.
-- maybe this is because they are starting fresh and have no data set to check for computation errors?
4) They start at 0% and 0 time; so if you are using a proprietary checkpointing scheme, it didn't update the BOINC client's progress status.
5) The machine is capped at 2MB d/l rate and it was hitting that cap for 30 minutes without d/ling new WU, was that these WU's in communication with the server?

Maybe these are all things known to the devs, but if I'm going to run test bed WU's, I should give you my report.

Aborted 5 to see the consequences. They did not get resent.

Running the other 33 till the deadline or they actually complete (unless you all tell me to end their run prematurely). I'll put off maintenance on that machine till they resolve.
4) Message boards : CMS Application : New version 49.00 (Message 6359)
Posted 8 May 2019 by marmot
Post:
Why a forwarding rule?

Because I wanted to set rules for the specific machine attempting to run the CMS WU and leave the rest of the network rules alone.

some issue had me on the phone with the ISP and they refused to offer any assistance

Most of them don't know what you (or LHC@home) really need and they will be afraid of being made responsible if your firewall is open for malware.

I've spent thousands of hours on tech support calls; I'm well aware of their motivations. I needed them to assign me a new IP and they kept refusing so, in that instance, I played along.

I'm wondering if I simply misunderstand what you mean or if you don't understand how a firewall works and what the portlists describe.

Next I can try putting it in a demilitarized zone outside the router's firewall

Bad idea.
You'd better try to understand how to configure your firewall.

It's not a great risk to put a machine in the demilitarized zone. Worst case, restore from backups. It's just a BOINC machine; not a machine w/ personal information.
Been a computer support technician since 1993... So please don't assume everyone that asks for help is a beginner.
Not all of us are so prideful that we won't ask for help when we need it. Besides, I made it clear that the issue seemed to not be with my configurations in my first post and that has proven to be the case.

Anyway, it was nice that you tried to help, but this was all a waste of diagnostic time.

This machine, with it's firewall back on, the port forwarding rules removed, is running a single core CMS job without issues. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2774249
Another machine in my network, which always had it's firewall on, and no port forwarding rules set, ran 1 successful and 1 failed single core CMS. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2773064

Indeed a very poor download rate.

Yes, the issue is responsiveness at the server, or bandwidth along the network path beyond my local ISP, causing the client to time-out when attempting 4 core WU's. My local bandwidth is neither overwhelmed nor too slow as it had a full 19mbit/s available when several of the CMS jobs failed that all showed maximum 5kbit/sec transfer rates (the CMS VM never demanded more than 0.0025% of my available bandwidth).
The issue has nothing to do with my local configurations.
5) Message boards : Theory Application : Native Theory Application in Production (Message 6343)
Posted 3 May 2019 by marmot
Post:

There must be many (potential) volunteers who want or need to shut their computers down overnght, over the weekend or whatever without a lot of manual attention..

Starting June 1 to September 30, the electric here will have peak hour rates, 9x greater than off-peak from 10am-7pm so all the BOINCMgr will be set to suspend all work during those hours. The electric company is pushing to have those new meters be default installs in all homes.
6) Message boards : CMS Application : New version 49.00 (Message 6342)
Posted 3 May 2019 by marmot
Post:


I would argue that cutting-edge science research is not necessarily easy but agree with the observation that the easier it is the more volunteers we will get.


The new direction LHC is taking, and the changes they make to how BOINC is used, may spread to many other projects. It's important work that may be widely imitated; so I am just arguing for ease-of-use to be a top priority. Maybe each project handing out a pre-setup VM, containing a BOINC installation attached to their project, will become common. Turn-key operation is, by definition, supposed to be very user-friendly. Users shouldn't have to be required to create port-forwarding rules in their, possibly, ISP rented router...

I just signed up for my 49th project and have reached 100+ hours in 160+ WU's and keep spreadsheet data of performance, plus setup problems (doh, bet the port rules are in the spreadsheet from 2016...), for most of those WU's. Honestly, LHC@home WU's were some of the hardest to get functioning optimally.
If you want other opinions from BOINCers that have way more experience than I (240x 100+ hour WU's, 80+ projects), ask over at WUProps forums http://wuprop.boinc-af.org/forum_index.php about how LHC@home compares to other projects in ease of use.

Have a good weekend.
7) Message boards : CMS Application : New version 49.00 (Message 6341)
Posted 3 May 2019 by marmot
Post:
Did you check your firewall?
CMS needs additional ports.
See:
http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use


The machine's Windows firewall is off.
The router's firewall was blocking incoming traffic on 8080 and 3128 (is CMS using an ATLAS port?).

I setup a port forwarding rule to the computer in question with port list:
UDP, TCP: 3125, 8080, 23128, 3128, 5222, 9094, 9618, 4080, 1094, 8443, 9133, 9135, 9148, 9149, 9166, 9196, 9199 (per http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use)

The router open-sessions log now shows several connections from the machine's IP to:
137.138.156.85:9618 (vocms0840.cern.ch)
128.142.142.167:9618 (vccondor01.cern.ch)
128.142.168.202:3125 (vocms0322.cern.ch)
131.225.205.134:3125 (cmssrv245.fnal.gov)
but then the sessions all close during the benchmark phase.
After HTCONDOR Ping message appears a couple sessions return connecting to 137.138.156.85:9618 TCP, but network traffic from the VM is still glacial

After the new port forwarding rule and router reset, the last 3 CMS 4-core still failed at the 786-788 second mark:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772898
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772841
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772298

Is this IP part of the LHC network that is still being blocked after adding the port forwarding rule (if so then my new rule is corrupted/not being followed)?
Blocked incoming TCP connection request from 5.188.206.251:8080 to xxx.xxx.xxx.xxx:5022

What changed since my computers were running LHC@Home is some issue had me on the phone with the ISP and they refused to offer any assistance till I reset the router to factory defaults.
Saved the router config, but only a backup from 3+ years ago would restore. Lost (and forgot all about) the entries for LHC.

The WU all fail with precise timing at 787-/+1 sec.
The machine is on too-many-error lockout till tomorrow.
Next I can try putting it in a demilitarized zone outside the router's firewall with only it's built-in firewall or just try single cores.
8) Message boards : Theory Application : Native Theory Application in Production (Message 6330)
Posted 3 May 2019 by marmot
Post:
I know that with a bit of babysitting and some clever control scripts this can be overcome to some extent, but we're getting a long way from the original ideas of BOINC.


Glad to see I'm not alone in that thought.

Here's my comments in another thread: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=465&postid=6324#6324
9) Message boards : CMS Application : New version 49.00 (Message 6329)
Posted 3 May 2019 by marmot
Post:
Did you check your firewall?
CMS needs additional ports.
See:
http://lhcathome.web.cern.ch/test4theory/my-firewall-complaining-which-ports-does-project-use


That's possible given that Windows firewall came up and asked for permissions about 10:45UTC, which I allowed on private networks (inside the home to my router). The previous failed 50 WU's were while I was asleep and not around to answer firewall notifications.

But, It's failed 2 more CMS since I accepted the Firewall permissions.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772620
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772865

For diagnostics, shutting the machine's local firewall off for the few hours and see if any of the 4 core CMS in the queue survive.
If this fixes it then I'll comb through the rules looking for the blocking entry.

Beside that:
You mentioned a couple of errors before the successful HTCondor ping.
All of them can be ignored
.
Good to know.

Suggestion:
If you suspect your internet connection could be too slow, you may try to get at first a singlecore setup running.
Your recent 4-core-setup may try too many downloads concurrently.
(I personally don't think this is the reason)


95% of local bandwidth is available, but I'll switch to single cores as a test if the turned off firewall still has the 4 cores failing.
10) Message boards : CMS Application : New version 49.00 (Message 6324)
Posted 3 May 2019 by marmot
Post:
Indeed a very poor download rate.
What can be done?

1. LHC@home could mirror the required files on their own systems and distribute them via fast networks, e.g. own CVMFS or openhtc.io.


Good idea. Great catch on your part, especially if it solves this issue. You deserve appreciation and a name mention pushed down to all the BOINC Mgr clients attributing the solution to your ingenuity.

2. Volunteers could configure their local firewall to reject connections to slow mirror servers.
The local downloader would then immediately pic another mirror from the list.

3. Volunteers using a local proxy could do the following:
3.1 Reject connections to slow mirrors and to mirrors using HTTPS.
3.2 Pic 1 fast and reliable mirror from the list and rewrite all URLs to other mirrors to point to that fast mirror.


Not great ideas. We're unpaid volunteers with jobs and families. It's hard enough to get masses of people, volunteering to work on BOINC science projects, to write their own app_config.xml let alone manually adjust firewall settings or setup a local proxy servers.

The very first sentence at the BOINC homepage sets the guiding philosophy for all BOINC projects:

BOINC lets you help cutting-edge science research using your computer (Windows, Mac, Linux) or Android device. BOINC downloads scientific computing jobs to your computer and runs them invisibly in the background. It's easy and safe. - https://boinc.berkeley.edu/


it's EASY and safe.
Not, you'll need to setup your on squid proxy, manually adjust your firewall settings, build your own custom VM to run native WU's or leave behind the common OS installed on your Best Buy tablet/phone/laptop and install a special OS.

The project developers want large numbers of people to volunteer to work on their project; then make it easy to do so.
Don't push work onto the volunteers when a solution exists that a paid employee can easily solve.

Follow the guiding philosophy of BOINC: It's easy and safe to volunteer to do science.
11) Message boards : CMS Application : New version 49.00 (Message 6323)
Posted 3 May 2019 by marmot
Post:
Every task is ending in "1 (0x00000001) Unknown error code".

The machine is staying just under commits that would put it into the swap file.
Virtual Box v 5.1.28, Windows 8.1.
This machine just ran the Theory WU's successfully.
It's running 1x TACC boinc2dockers (successfully) while running 1x CMS jobs.
Running 7 other custom VM's.

Logs:
2019-05-03 03:15:47 (3848): Guest Log: [ERROR] Condor ended after 784 seconds.
2019-05-03 03:15:47 (3848): Guest Log: [INFO] Shutting Down.

--------------------------------------------------------------------------

Shutdown all VM's, killed VBox service and restarted VBox.
Started a fresh CMS VM and watched the startup.

Things that are out of ordinary as starting:

can't rename eth0 -> eth1. Resource maybe busy.

ip6tables: no config file Warning
ip4tables: no config file Warning

You may need to restart the Windows system or restart the guest system to enable Guest Additions. (machine was restarted earlier in the day during the lightning storm)
Starting vmcontext_hepix Warning
-
-
.
HT Condor Ping


Watching with Process Hacker: disk writes ~ 2-4kb/s, network transfers ~ 0.05 -> 4kb/s.
Certainly taking a very long time to get it's work. Start d/ls at 5:40am, ended by 6:00 am.

ERROR: Condor ended after 787 seconds.

------------------
So it appears my issue is similar to other people. The VM gives up after failing to get the data set before timing out due to slow connection to CONDOR.

This issue is at the server or the VM needs to be more lenient on the time outs before failing the job.
All the other work in the home is d/led and forget; no continuous connections to CONDOR or other outside servers. (Except maybe 1x boinc2docker, not sure).
Not even streaming a movie while CMS was running. Fast.com reports full 20MB/s connection available.

Last year, without a local proxy, this ISP allowed me to run 90 Theory at once all connected to CONDOR.

It's not like a lot of dev CMS are currently even running at this time.

Is there something wrong with the configuration of the guest VM's ethernet connection?
12) Message boards : Number crunching : issue of the day (Message 6317)
Posted 2 May 2019 by marmot
Post:
Did the 2nd one return after the due date but within a grace period?
Also, are my fine because they are using older version of VBox?

This result is fine when returning a completed on the 3rd attempt on being sent out:
--------------------------------------------------------------------------------------------------------------------






While this one is failed with too many errors upon receiving a completed, good result:
-------------------------------------------------------------------------------------------------------------------


13) Message boards : Theory Application : Windows Version (Message 6313)
Posted 30 Apr 2019 by marmot
Post:
Started my first Theory work here.
BOINC 7.14.1
Virtual Box 5.1.26 r117224

Preferences were set to 4 cores and the 1st WU completed in 2000 seconds CPU time ~ 2300 s.

2nd work unit https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772209 ran while I slept and went for 9+ hours.
The console had a request to make a bug report to GNU. I suspended it and then created an app_config.xml to run Theory at 1 core w/ default RAM 1030mb to avoid so much idle CPU time (guess that's not changed on this test version).

The broken WU with the request for bug report to GNU, what next for it?



©2024 CERN