Message boards : Theory Application : New Native App - Linux Only
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10

AuthorMessage
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 261
Message 6164 - Posted: 7 Mar 2019, 14:33:18 UTC

One reason to suspend/stop a task (especially a longrunner) is to prepare a client shutdown, e.g. to run system updates.
At LHC production David Cameron explained that regardless if ATLAS native would respect the STOP/CONT signals a task would always start from the scratch after a client restart or a reboot.

How would Theory native proceed in such a case?
Would it also start from the scratch or would it continue the task from the point where it stopped?

Well, on windows you would snapshot the vbox VM.
ID: 6164 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6196 - Posted: 12 Mar 2019, 9:35:44 UTC
Last modified: 12 Mar 2019, 9:38:46 UTC

This task crashed a few minutes ago:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759703
Three Volunteers had this problem:
max # von Fehler/Gesamt/Erfolg Aufgaben 3, 3, 1
Fehler Zu viele Ergebnisse insgesamt
ID: 6196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6213 - Posted: 13 Mar 2019, 6:56:55 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1882669

===> [runRivet] Wed Mar 13 03:14:46 UTC 2019 [boinc pp jets 8000 25 - pythia6 6.428 390 100000 30]

Setting environment...
grep: /etc/redhat-release: No such file or directory
MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c
g++ = /cvmfs/sft.cern.ch/lcg/external/gcc/4.8.4/x86_64-slc6/bin/g++
g++ version = 4.8.4
RIVET=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c/rivet/2.6.1/x86_64-slc6-gcc48-opt
Rivet version = rivet v2.6.1
RIVET_REF_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c/rivet/2.6.1/x86_64-slc6-gcc48-opt/share/Rivet
RIVET_ANALYSIS_PATH=/shared/analyses
GSL=/cvmfs/sft.cern.ch/lcg/external/GSL/1.10/x86_64-slc6-gcc48-opt
HEPMC=/cvmfs/sft.cern.ch/lcg/external/HepMC/2.06.08/x86_64-slc6-gcc48-opt
FASTJET=/cvmfs/sft.cern.ch/lcg/external/fastjet/3.0.3/x86_64-slc6-gcc48-opt
PYTHON=/cvmfs/sft.cern.ch/lcg/external/Python/2.7.4/x86_64-slc6-gcc48-opt
ROOTSYS=/cvmfs/sft.cern.ch/lcg/app/releases/ROOT/5.34.26/x86_64-slc6-gcc48-opt/root

Input parameters:
mode=boinc
beam=pp
process=jets
energy=8000
params=25
specific=-
generator=pythia6
version=6.428
tune=390
nevts=100000
seed=30

was unsuccessful for all three Volunteers:

<core_client_version>7.5.1</core_client_version>
<![CDATA[
<message>
process exited with code 195 (0xc3, -61)
</message>
<stderr_txt>
04:14:25 (22719): wrapper (7.15.26016): starting
04:14:25 (22719): wrapper (7.15.26016): starting
04:14:25 (22719): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.28 ()
03:14:25 2019-03-13: cranky-0.0.28: [INFO] Detected Theory App
03:14:25 2019-03-13: cranky-0.0.28: [INFO] Checking CVMFS.
03:14:42 2019-03-13: cranky-0.0.28: [INFO] Checking runc.
03:14:43 2019-03-13: cranky-0.0.28: [INFO] Creating the filesystem.
03:14:43 2019-03-13: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3
03:14:43 2019-03-13: cranky-0.0.28: [INFO] Updating config.json.
03:14:43 2019-03-13: cranky-0.0.28: [INFO] Running Container 'runc'.
05:24:33 2019-03-13: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 1.
06:24:33 (22719): cranky exited; CPU time 6570.264835
06:24:33 (22719): app exit status: 0xce
06:24:33 (22719): called boinc_finish(195)

</stderr_txt>
]]>
ID: 6213 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 6214 - Posted: 13 Mar 2019, 7:04:37 UTC - in response to Message 6213.  

was unsuccessful for all three Volunteers
The same happened with this workunit: https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1882576

3 times the same error: Exit status 195 (0x000000C3) EXIT_CHILD_FAILED

I do not know the job description.
ID: 6214 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6215 - Posted: 13 Mar 2019, 7:18:07 UTC - in response to Message 6214.  

Thanks Crystal,
two are more than one ;-)
It can be this Pythia 6.
ID: 6215 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 6216 - Posted: 13 Mar 2019, 8:52:18 UTC - in response to Message 6215.  

I see that we get a few jobs per day like this. Have found one and will run it myself to see what is happening.
ID: 6216 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 6217 - Posted: 13 Mar 2019, 9:59:16 UTC - in response to Message 6216.  

Yes, I see that the job failed.
100000 events processed
dumping histograms...
Rivet.Analysis.Handler: INFO  Finalising analyses
terminate called after throwing an instance of 'YODA::LowStatsError'
  what():  Requested variance of a distribution with only one effective entry
./runRivet.sh: line 376:   267 Aborted                 (core dumped) $rivetExecString  (wd: /shared/tmp/tmp.8MuWyGp9sd)

Processing histograms...
input  = /shared/tmp/tmp.8MuWyGp9sd/flat
output = /shared
./runRivet.sh: line 850:   268 Killed                  display_service $tmpd_dump "$beam $process $energy $params $generator $version $tune"  (wd: /shared)
ERROR: following histograms should be produced according to run parameters,
       but missing from Rivet output:


It then gets resubmitted as it may be a host issue. We will see if we can stop these jobs either being sent or being resubmitted.
ID: 6217 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6224 - Posted: 13 Mar 2019, 13:51:01 UTC

This Computer show MCPlot-tasks for today, but the last one was finishing 36 hours ago on March.11
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3723
2019-03-10 33 0 33
2019-03-11 29 0 29
2019-03-12 7 0 7
2019-03-13 6 0 6
ID: 6224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6227 - Posted: 15 Mar 2019, 10:19:40 UTC

SL76 with Boinc 7.5.1:
Clean start from Linux:
The -dev-Tasks (max. 6 Tasks) - Every Minute ONE -dev Task is downloading and not once all!
This info is important for the production-transfer, when it is on other Linux also. The Server thanks it with a better performance.

cvmfs:config with openhtc.io
openhtc.io show a entry 206.167.181.94:3128 - Is this ok?
ID: 6227 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 478
Credit: 394,720
RAC: 261
Message 6228 - Posted: 15 Mar 2019, 11:13:02 UTC - in response to Message 6227.  

SL76 with Boinc 7.5.1:
Clean start from Linux:
The -dev-Tasks (max. 6 Tasks) - Every Minute ONE -dev Task is downloading and not once all!
This info is important for the production-transfer, when it is on other Linux also. The Server thanks it with a better performance.

cvmfs:config with openhtc.io
openhtc.io show a entry 206.167.181.94:3128 - Is this ok?

Seems to be weird.



This is not cloudflare.

dig -x 206.167.181.94

; <<>> DiG 9.9.9-P1 <<>> -x 206.167.181.94
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10471
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 5

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;94.181.167.206.in-addr.arpa.   IN      PTR

;; ANSWER SECTION:
94.181.167.206.in-addr.arpa. 3600 IN    PTR     206-167-181-94.cloud.computecanada.ca.

;; AUTHORITY SECTION:
181.167.206.in-addr.arpa. 86400 IN      NS      ns1.zonerisq.ca.
181.167.206.in-addr.arpa. 86400 IN      NS      ns2.zonerisq.ca.

;; ADDITIONAL SECTION:
ns1.zonerisq.ca.        86400   IN      A       162.219.54.2
ns2.zonerisq.ca.        86400   IN      A       162.219.55.2
ns1.zonerisq.ca.        86400   IN      AAAA    2620:10a:80eb::2
ns2.zonerisq.ca.        86400   IN      AAAA    2620:10a:80ec::2

;; Query time: 590 msec
;; SERVER: 192.168.0.12#53(192.168.0.12)
;; WHEN: Fri Mar 15 12:01:07 CET 2019
;; MSG SIZE  rcvd: 240




Port 3128 is a typical squid port.
openhtc.io uses port 80 instead.


Did you configure this proxy in your local CVMFS configuration?
If not, did you configure an automatic proxy detection.

You may try a "cvmfs_config showconfig -s" to see if one of the repositories uses a special setup.
ID: 6228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6241 - Posted: 24 Mar 2019, 9:10:27 UTC - in response to Message 6073.  

Have 7 tasks in parallel, but slot-Nr. are shown up to 21!
Maybe, they are not deleted after finishing?


Yes there was an issue with a previous image where the slot directories were not be clean. Let me know if that is still the case.

Since 2 days have only native-Theory in one SL76 running.
Therefore only 6 tasks are possible, the Slot-Numbers are growing to 14 for the moment.
Watching it to find a point, when one new Slot is created.
Up to now found no identical info from event-text in Boinc to the creation of a new slot-Nr.
It can be controlled when the new date-time is shown in the explorer of the Linux.
Can it be, when one task finished and parallel one is downloading and begin exact at the same time the running?
ID: 6241 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 6242 - Posted: 24 Mar 2019, 16:06:57 UTC - in response to Message 6241.  

Can it be, when one task finished and parallel one is downloading and begin exact at the same time the running?
I run only native Theory and directly after 1 job has finished, the same slot is used by the task in queue.
Are the not used slots empty and are empty slots removed after a BOINC-restart?
ID: 6242 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6243 - Posted: 24 Mar 2019, 17:25:14 UTC - in response to Message 6242.  
Last modified: 24 Mar 2019, 17:45:51 UTC

ATM 18 and 19 as new slot-No. are active and 0 thru 3 (because of 6 tasks).
The difference between a new slot number and the one before is sometime one Hour or 2, 3 hours.
Seeing at the timestamp in the explorer.
Have boinc always running since two days, so no exit. But think all were cleared after a restart.
Will testing this when it is important.
It must be a special constellation to let Boinc building a new slot number.
Have Boinc 7.5.1. Is this a problem?
Laurence know about this new Slot building. See the copied answer in the message.
BTW WCG have no problems with the same Boinc 7.5.1 in a other SL76, where the Sherpa is now running 126 hours. OMG.
This is the watching Computer:
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3723
The old slot No. are empty, sorry, have forgotten to answer.
Crystal,
this evening have no more time. Ned-D ;-)
ID: 6243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 6244 - Posted: 24 Mar 2019, 18:51:53 UTC - in response to Message 6243.  
Last modified: 24 Mar 2019, 18:52:29 UTC

Have Boinc 7.5.1. Is this a problem?
Could be the problem.
7.5.1 is even not an official Berkeley version. I've BOINC 7.12.0 running.
Crystal,
this evening have no more time. Ned-D ;-)
I think I'll be watching 'Tatort' from Munich ;) Depends on the plot.
ID: 6244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6245 - Posted: 25 Mar 2019, 7:36:06 UTC - in response to Message 6244.  

Have Boinc 7.5.1. Is this a problem?
Could be the problem.
7.5.1 is even not an official Berkeley version. I've BOINC 7.12.0 running.

This Version is from -native Atlas Installation-Guide.
SL have no own Software-Installation for Boinc.
When you start this Version is a Event-message, this is a Development-Version.
Don't know whats the best Boinc-Version to let it run in ScientificLinux.
Up to now there are no problems with Boinc (SL69 or SL76).
ID: 6245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6246 - Posted: 25 Mar 2019, 10:09:34 UTC - in response to Message 6245.  

WCG show in Download a Fedora-Info for the EPEL-Release.
Is this a option for SL69 or SL76 to install this Boinc-Version from WCG?
ID: 6246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 672
Credit: 1,901,223
RAC: 5,079
Message 6275 - Posted: 28 Mar 2019, 8:47:54 UTC

SL76 with Boinc 7.5.1 6 Tasks for Theory possible:
Scratch show folder for example 0a,0b,0c... with entries days ago as timestamp.
Is it possible that they were not cleared after finishing Theory Tasks?
Boinc need 13 MByte more disk-space. Running only one Sherpa from 6 possible tasks, because of the space limit.
Can clear this manuell or exit the Boinc after finishing the last Sherpa.
ID: 6275 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10

Message boards : Theory Application : New Native App - Linux Only


©2024 CERN