Message boards : Theory Application : New Native App - Linux Only
Message board moderation
Previous · 1 . . . 7 · 8 · 9 · 10
Author | Message |
---|---|
![]() Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 6 ![]() ![]() |
One reason to suspend/stop a task (especially a longrunner) is to prepare a client shutdown, e.g. to run system updates. At LHC production David Cameron explained that regardless if ATLAS native would respect the STOP/CONT signals a task would always start from the scratch after a client restart or a reboot. How would Theory native proceed in such a case? Would it also start from the scratch or would it continue the task from the point where it stopped? Well, on windows you would snapshot the vbox VM. |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
This task crashed a few minutes ago: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759703 Three Volunteers had this problem: max # von Fehler/Gesamt/Erfolg Aufgaben 3, 3, 1 Fehler Zu viele Ergebnisse insgesamt |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1882669 ===> [runRivet] Wed Mar 13 03:14:46 UTC 2019 [boinc pp jets 8000 25 - pythia6 6.428 390 100000 30] Setting environment... grep: /etc/redhat-release: No such file or directory MCGENERATORS=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c g++ = /cvmfs/sft.cern.ch/lcg/external/gcc/4.8.4/x86_64-slc6/bin/g++ g++ version = 4.8.4 RIVET=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c/rivet/2.6.1/x86_64-slc6-gcc48-opt Rivet version = rivet v2.6.1 RIVET_REF_PATH=/cvmfs/sft.cern.ch/lcg/external/MCGenerators_lcgcmt67c/rivet/2.6.1/x86_64-slc6-gcc48-opt/share/Rivet RIVET_ANALYSIS_PATH=/shared/analyses GSL=/cvmfs/sft.cern.ch/lcg/external/GSL/1.10/x86_64-slc6-gcc48-opt HEPMC=/cvmfs/sft.cern.ch/lcg/external/HepMC/2.06.08/x86_64-slc6-gcc48-opt FASTJET=/cvmfs/sft.cern.ch/lcg/external/fastjet/3.0.3/x86_64-slc6-gcc48-opt PYTHON=/cvmfs/sft.cern.ch/lcg/external/Python/2.7.4/x86_64-slc6-gcc48-opt ROOTSYS=/cvmfs/sft.cern.ch/lcg/app/releases/ROOT/5.34.26/x86_64-slc6-gcc48-opt/root Input parameters: mode=boinc beam=pp process=jets energy=8000 params=25 specific=- generator=pythia6 version=6.428 tune=390 nevts=100000 seed=30 was unsuccessful for all three Volunteers: <core_client_version>7.5.1</core_client_version> <![CDATA[ <message> process exited with code 195 (0xc3, -61) </message> <stderr_txt> 04:14:25 (22719): wrapper (7.15.26016): starting 04:14:25 (22719): wrapper (7.15.26016): starting 04:14:25 (22719): wrapper: running ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.0.28 () 03:14:25 2019-03-13: cranky-0.0.28: [INFO] Detected Theory App 03:14:25 2019-03-13: cranky-0.0.28: [INFO] Checking CVMFS. 03:14:42 2019-03-13: cranky-0.0.28: [INFO] Checking runc. 03:14:43 2019-03-13: cranky-0.0.28: [INFO] Creating the filesystem. 03:14:43 2019-03-13: cranky-0.0.28: [INFO] Using /cvmfs/cernvm-prod.cern.ch/cvm3 03:14:43 2019-03-13: cranky-0.0.28: [INFO] Updating config.json. 03:14:43 2019-03-13: cranky-0.0.28: [INFO] Running Container 'runc'. 05:24:33 2019-03-13: cranky-0.0.28: [ERROR] Container 'runc' terminated with status code 1. 06:24:33 (22719): cranky exited; CPU time 6570.264835 06:24:33 (22719): app exit status: 0xce 06:24:33 (22719): called boinc_finish(195) </stderr_txt> ]]> |
Send message Joined: 13 Feb 15 Posts: 1223 Credit: 937,009 RAC: 1,066 ![]() ![]() ![]() |
was unsuccessful for all three VolunteersThe same happened with this workunit: https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1882576 3 times the same error: Exit status 195 (0x000000C3) EXIT_CHILD_FAILED I do not know the job description. |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
Thanks Crystal, two are more than one ;-) It can be this Pythia 6. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1129 Credit: 339,230 RAC: 2 ![]() |
I see that we get a few jobs per day like this. Have found one and will run it myself to see what is happening. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1129 Credit: 339,230 RAC: 2 ![]() |
Yes, I see that the job failed. 100000 events processed dumping histograms... Rivet.Analysis.Handler: INFO Finalising analyses terminate called after throwing an instance of 'YODA::LowStatsError' what(): Requested variance of a distribution with only one effective entry ./runRivet.sh: line 376: 267 Aborted (core dumped) $rivetExecString (wd: /shared/tmp/tmp.8MuWyGp9sd) Processing histograms... input = /shared/tmp/tmp.8MuWyGp9sd/flat output = /shared ./runRivet.sh: line 850: 268 Killed display_service $tmpd_dump "$beam $process $energy $params $generator $version $tune" (wd: /shared) ERROR: following histograms should be produced according to run parameters, but missing from Rivet output: It then gets resubmitted as it may be a host issue. We will see if we can stop these jobs either being sent or being resubmitted. |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
This Computer show MCPlot-tasks for today, but the last one was finishing 36 hours ago on March.11 https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3723 2019-03-10 33 0 33 2019-03-11 29 0 29 2019-03-12 7 0 7 2019-03-13 6 0 6 |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
SL76 with Boinc 7.5.1: Clean start from Linux: The -dev-Tasks (max. 6 Tasks) - Every Minute ONE -dev Task is downloading and not once all! This info is important for the production-transfer, when it is on other Linux also. The Server thanks it with a better performance. cvmfs:config with openhtc.io openhtc.io show a entry 206.167.181.94:3128 - Is this ok? |
![]() Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 6 ![]() ![]() |
SL76 with Boinc 7.5.1: Seems to be weird. This is not cloudflare. dig -x 206.167.181.94 ; <<>> DiG 9.9.9-P1 <<>> -x 206.167.181.94 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10471 ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 2, ADDITIONAL: 5 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;94.181.167.206.in-addr.arpa. IN PTR ;; ANSWER SECTION: 94.181.167.206.in-addr.arpa. 3600 IN PTR 206-167-181-94.cloud.computecanada.ca. ;; AUTHORITY SECTION: 181.167.206.in-addr.arpa. 86400 IN NS ns1.zonerisq.ca. 181.167.206.in-addr.arpa. 86400 IN NS ns2.zonerisq.ca. ;; ADDITIONAL SECTION: ns1.zonerisq.ca. 86400 IN A 162.219.54.2 ns2.zonerisq.ca. 86400 IN A 162.219.55.2 ns1.zonerisq.ca. 86400 IN AAAA 2620:10a:80eb::2 ns2.zonerisq.ca. 86400 IN AAAA 2620:10a:80ec::2 ;; Query time: 590 msec ;; SERVER: 192.168.0.12#53(192.168.0.12) ;; WHEN: Fri Mar 15 12:01:07 CET 2019 ;; MSG SIZE rcvd: 240 Port 3128 is a typical squid port. openhtc.io uses port 80 instead. Did you configure this proxy in your local CVMFS configuration? If not, did you configure an automatic proxy detection. You may try a "cvmfs_config showconfig -s" to see if one of the repositories uses a special setup. |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
Have 7 tasks in parallel, but slot-Nr. are shown up to 21! Since 2 days have only native-Theory in one SL76 running. Therefore only 6 tasks are possible, the Slot-Numbers are growing to 14 for the moment. Watching it to find a point, when one new Slot is created. Up to now found no identical info from event-text in Boinc to the creation of a new slot-Nr. It can be controlled when the new date-time is shown in the explorer of the Linux. Can it be, when one task finished and parallel one is downloading and begin exact at the same time the running? |
Send message Joined: 13 Feb 15 Posts: 1223 Credit: 937,009 RAC: 1,066 ![]() ![]() ![]() |
Can it be, when one task finished and parallel one is downloading and begin exact at the same time the running?I run only native Theory and directly after 1 job has finished, the same slot is used by the task in queue. Are the not used slots empty and are empty slots removed after a BOINC-restart? |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
ATM 18 and 19 as new slot-No. are active and 0 thru 3 (because of 6 tasks). The difference between a new slot number and the one before is sometime one Hour or 2, 3 hours. Seeing at the timestamp in the explorer. Have boinc always running since two days, so no exit. But think all were cleared after a restart. Will testing this when it is important. It must be a special constellation to let Boinc building a new slot number. Have Boinc 7.5.1. Is this a problem? Laurence know about this new Slot building. See the copied answer in the message. BTW WCG have no problems with the same Boinc 7.5.1 in a other SL76, where the Sherpa is now running 126 hours. OMG. This is the watching Computer: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3723 The old slot No. are empty, sorry, have forgotten to answer. Crystal, this evening have no more time. Ned-D ;-) |
Send message Joined: 13 Feb 15 Posts: 1223 Credit: 937,009 RAC: 1,066 ![]() ![]() ![]() |
Have Boinc 7.5.1. Is this a problem?Could be the problem. 7.5.1 is even not an official Berkeley version. I've BOINC 7.12.0 running. Crystal,I think I'll be watching 'Tatort' from Munich ;) Depends on the plot. |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
Have Boinc 7.5.1. Is this a problem?Could be the problem. This Version is from -native Atlas Installation-Guide. SL have no own Software-Installation for Boinc. When you start this Version is a Event-message, this is a Development-Version. Don't know whats the best Boinc-Version to let it run in ScientificLinux. Up to now there are no problems with Boinc (SL69 or SL76). |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
WCG show in Download a Fedora-Info for the EPEL-Release. Is this a option for SL69 or SL76 to install this Boinc-Version from WCG? |
Send message Joined: 22 Apr 16 Posts: 731 Credit: 2,205,280 RAC: 1,580 ![]() ![]() ![]() |
SL76 with Boinc 7.5.1 6 Tasks for Theory possible: Scratch show folder for example 0a,0b,0c... with entries days ago as timestamp. Is it possible that they were not cleared after finishing Theory Tasks? Boinc need 13 MByte more disk-space. Running only one Sherpa from 6 possible tasks, because of the space limit. Can clear this manuell or exit the Boinc after finishing the last Sherpa. |
©2025 CERN