Message boards :
Number crunching :
issue of the day
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 11 · Next
Author | Message |
---|---|
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
The machine which got stuck activating the fuse module is still stuck at the same place, but a second machine which was running at the time has picked up the new directory structure without any problems. I'll restart the stuck one. |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
Ivan, are any of your work units for your presentation running yet? |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
Are we out of Jobs again ? Just added a machine and it Looks, as if it can not get Jobs ? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,168,972 RAC: 1,763 |
Ivan, are any of your work units for your presentation running yet? No, I'm getting an error message in CRAB3 submission that it can't gsissh to the RAL Condor machine -- which is a puzzle as I can log in manually. RAL tells me that nothing has changed at the machine, so I'm at a loss to explain it. I edited out the section that sends to RAL and two jobs ran successfully at Vanderbilt. :-( We may run out of re-submitted jobs over the weekend, to which I can only say, "OK, relax and run one of your other projects until I can get a handle on this." |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
The machine which was stuck just now is active again (did a VM reset from Oracle, left BOINC running throughout) and seems to be busy (%CPU 86) with Beginning CMSSW wrapper script |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I've a host stuck at "activating fuse module". Tried rebooting the VM, aborting the task and rebooting the host... still sticks at the same place. Short of resetting the project and starting afresh I can't think of anything else to do. Any ideas? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,168,972 RAC: 1,763 |
I've a host stuck at "activating fuse module". Tried rebooting the VM, aborting IIRC, that's when it's trying to restart the cvmfs distributed file system. You don't have any firewall or network issues, do you? |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I've a host stuck at "activating fuse module". Tried rebooting the VM, aborting Not that I know of. The firewall is common to the whole lan and I've just started another host with no problems. There may have been network problems overnight when this host first started, a different host failed due to not downloading a file from RAL but that was OK this morning. The screen is as shown here so it's the second time the fuse module is activated. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Looks like each subdirectory: It only generates 3 of these sub-Folders and then starts a new "run-x+1" folder. /logs/run-4/glide_OmvP3r/dir_4810/ /logs/run-4/glide_OmvP3r/dir_8936/ /logs/run-4/glide_OmvP3r/dir_32714/ |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
During that time the results are also staged-out and this is a 15MB file. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 36 |
Rasputin42 wrote:
This greatly depends on the speed of the CPU of your machine and the % of CPU you have allowed to use in BOINC. A minor influence has the kind of events (records) you get to do in the job. There is a large deviation in the used cpu-time for 1 event, varying from almost 0 (zero) seconds up to (on my i7 2600) 69 seconds for 1 single event. In run-1 my machine did 8 jobs in 5 hours and 37 minutes. |
Send message Joined: 10 May 15 Posts: 4 Credit: 39,333 RAC: 0 |
I am running CMS on my desktop OK but on my Windows 7 laptop it does not run. It sits at 0.00% complete and when I looked at Vbox it had thid message. Runtime error opening 'C:\ProgramData\BOINC\slots\3\boinc_7b8c6c74b1591434\boinc_7b8c6c74b1591434.vbox' for reading: -103(Path not found.). D:\tinderbox\win-4.3\src\VBox\Main\src-server\MachineImpl.cpp[731] (long __cdecl Machine::registeredInit(void)). Result Code: E_FAIL (0x80004005) Component: Machine Interface: IMachine {480cf695-2d8d-4256-9c7c-cce4184fa048} I had a look and the file is not there ? C:\ProgramData\BOINC\slots\3\boinc_7b8c6c74b1591434\boinc_7b8c6c74b1591434.vbox This is the host: http://boincai05.cern.ch/CMS-dev/results.php?hostid=641 Proud Founder and member of Have a look at my WebCam |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,168,972 RAC: 1,763 |
Rasputin42 wrote: OK, for your information, the current jobs are "minimum-bias" events -- essentially non-(scientifically)-interesting background events from "ordinary" proton-proton collisions. These are then thrown into simulations where we seed the events with "interesting" scenarios, say Higgs->gamma-gamma. The stuff I'm working on at the moment is the ~2025 scenario where we expect, on average, 140 or maybe even 200 such events each time a counter-rotating pair of proton bunches pass through each other (i.e. "collide") in the centre of CMS, with maybe one interesting event every 100 million, or billion, or whatever, collisions. So these sort of simulations are important since we throw the background events into our simulations of interesting events, to make sure we can still separate the wheat from the chaff. However, to the best of my knowledge, there is no restriction on the min-bias results -- if the random numbers fall in the right range at the right time, we may even throw a random H->gamma-gamma into the supposed background. So these events are of random complexity, which leads to the variability of simulation times per event. [Edit] Just to add, the events I'm hoping to use for the "challenge", once I've sorted out the problem that prevents my actually submitting new jobs, are so-called TTbar events, where the proton-proton collision creates a top- and anti-top quark pair (via "pair production", as, e.g. electron-positron production from energetic gamma rays). These top quarks can then decay in a number of ways, so the simulation is necessarily more complex than simple min-bias events. This way I hope to increase the run-time per job by about a factor of three for the same amount of data returned to stage-out. [/Edit] |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Ivan, are any of your work units for your presentation running yet? Ivan, Just noticed these entries in a log:- CRAB_Dest = "/store/temp/user/ireid.17433470b3faad006e8120ad843d39e3666b08f0/MinBias/CRAB3_tutorial_MC_generation_test4 CRAB_UserWebDir = "/home/cms005/150818_115322:ireid_crab_CMS_at_Home_test7" Is this one of "your jobs"? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,168,972 RAC: 1,763 |
Ivan, are any of your work units for your presentation running yet? Yeah. In fact everything you're working on at the moment is one of my jobs. The first line is the temporary destination that the results are stored in before being transferred to stage-out. The second line is the actual directory on the RAL Condor server that results are copied to. The mismatch in "test" numbers is because I forgot there are two places in the submission script where you can change an identity -- in fact the results are going into a CRAB3_tutorial_MC_generation_test4 directory, but distinguished by the 150818_115322 bit, which if you squint carefully enough you will see is the date and time of the CRAB task submission. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I've a host stuck at "activating fuse module". Tried rebooting the VM, aborting Running OK now. All 6 working well as I write (that's inviting disaster...) If this was caused by a transient network problem, it's not good that the system doesn't (or didn't) seem able to recover from it without manual intervention. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 36 |
After the 2nd Glidein-run, that lasted 5 hours and 48 minutes, Glidein start looping again. So no real processing anymore of cmsRun's. Meanwhile 134 run-xxx maps created. Not sure whether following from run-3 is relevant: From Glidein-stdout: Executing /home/boinc/CMSRun/glide_pu0ihT/main/glidein_sitewms_setup.sh Unsupported GLIDEIN_SiteWMS encountered Executing /home/boinc/CMSRun/glide_pu0ihT/main/smart_partitionable.sh === Last script starting Sat Aug 22 03:15:27 CEST 2015 (1440206127) after validating for 19 === === Condor starting Sat Aug 22 03:15:29 CEST 2015 (1440206129) === === Condor ended Sat Aug 22 03:15:34 CEST 2015 (1440206134) after 5 === === Stats of main === ================= Total jobs 0 utilization 0 Total goodZ jobs 0 (0%) utilization 0 (0%) Total goodNZ jobs 0 (0%) utilization 0 (0%) Total badSignal jobs 0 (0%) utilization 0 (0%) Total badOther jobs 0 (0%) utilization 0 (0%) === End Stats of main === From Glidein-stderr: CONDOR_OS default ERROR_GEN_PATH /home/boinc/CMSRun/glide_pu0ihT/main/error_gen.sh CONSTS_FILE /home/boinc/CMSRun/glide_pu0ihT/main/constants.cfg CONDOR_VARS_FILE /home/boinc/CMSRun/glide_pu0ihT/main/condor_vars.lst UNTAR_CFG_FILE /home/boinc/CMSRun/glide_pu0ihT/main/untar.cfg GRIDMAP /home/boinc/CMSRun/glide_pu0ihT/main/grid-mapfile # At least map pu0ihT is not present in the map run-3. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 36 |
After the 2nd Glidein-run, that lasted 5 hours and 48 minutes, Glidein start looping again. Requested a new BOINC-task for a fresh start. No cmsRun is started, but therefore every 7½ minutes a run-x map is created. In it: [DIR] glide_l4w0AM/ 22-Aug-2015 08:43 - [ ] glidein-stderr 22-Aug-2015 08:42 28K [ ] glidein-stdout 22-Aug-2015 08:42 4.3K In the glide_xxxxxx-map there are only: [ ] MasterLog 22-Aug-2015 08:18 2.1K [ ] ProcLog 22-Aug-2015 08:18 12K [ ] StartdLog 22-Aug-2015 08:18 5.1K [ ] StarterLog 22-Aug-2015 08:13 620 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,168,972 RAC: 1,763 |
No cmsRun is started, but therefore every 7½ minutes a run-x map is created. We've run out of jobs on the Condor server. You can all stand down until further notice. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
IT is not working. It generates run-1, run-2 etc. every 7 minutes. Looks like Condor (whatever it is)is not working. Here is /logs/run-1/glide_8oUehg/StartdLog 08/26/15 20:59:59 (pid:7614) ****************************************************** 08/26/15 20:59:59 (pid:7614) ** condor_startd (CONDOR_STARTD) STARTING UP 08/26/15 20:59:59 (pid:7614) ** /home/boinc/CMSRun/glide_8oUehg/main/condor/sbin/condor_startd 08/26/15 20:59:59 (pid:7614) ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 08/26/15 20:59:59 (pid:7614) ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 08/26/15 20:59:59 (pid:7614) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 08/26/15 20:59:59 (pid:7614) ** $CondorPlatform: x86_64_RedHat5 $ 08/26/15 20:59:59 (pid:7614) ** PID = 7614 08/26/15 20:59:59 (pid:7614) ** Log last touched time unavailable (No such file or directory) 08/26/15 20:59:59 (pid:7614) ****************************************************** 08/26/15 20:59:59 (pid:7614) Using config source: /home/boinc/CMSRun/glide_8oUehg/condor_config 08/26/15 20:59:59 (pid:7614) config Macros = 211, Sorted = 211, StringBytes = 10603, TablesBytes = 7636 08/26/15 20:59:59 (pid:7614) CLASSAD_CACHING is ENABLED 08/26/15 20:59:59 (pid:7614) Daemon Log is logging: D_ALWAYS D_ERROR D_JOB 08/26/15 20:59:59 (pid:7614) DaemonCore: command socket at <10.0.2.15:50881?noUDP> 08/26/15 20:59:59 (pid:7614) DaemonCore: private command socket at <10.0.2.15:50881> 08/26/15 21:00:00 (pid:7614) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#38173 08/26/15 21:00:00 (pid:7614) my_popenv failed 08/26/15 21:00:00 (pid:7614) Failed to run hibernation plugin '/home/boinc/CMSRun/glide_8oUehg/main/condor/libexec/condor_power_state ad' 08/26/15 21:00:00 (pid:7614) VM-gahp server reported an internal error 08/26/15 21:00:00 (pid:7614) VM universe will be tested to check if it is available 08/26/15 21:00:00 (pid:7614) History file rotation is enabled. 08/26/15 21:00:00 (pid:7614) Maximum history file size is: 20971520 bytes 08/26/15 21:00:00 (pid:7614) Number of rotated history files is: 2 08/26/15 21:00:00 (pid:7614) Allocating auto shares for slot type 1: Cpus: 1.000000, Memory: auto, Swap: auto, Disk: auto slot type 1: Cpus: 1.000000, Memory: 2002, Swap: 100.00%, Disk: 100.00% 08/26/15 21:00:00 (pid:7614) New machine resource of type 1 allocated 08/26/15 21:00:00 (pid:7614) Setting up slot pairings 08/26/15 21:00:00 (pid:7614) my_popenv failed 08/26/15 21:00:00 (pid:7614) Adding 'mips' to the Supplimental ClassAd list 08/26/15 21:00:00 (pid:7614) CronJobList: Adding job 'mips' 08/26/15 21:00:00 (pid:7614) Adding 'kflops' to the Supplimental ClassAd list 08/26/15 21:00:00 (pid:7614) CronJobList: Adding job 'kflops' 08/26/15 21:00:00 (pid:7614) CronJob: Initializing job 'mips' (/home/boinc/CMSRun/glide_8oUehg/main/condor/libexec/condor_mips) 08/26/15 21:00:00 (pid:7614) CronJob: Initializing job 'kflops' (/home/boinc/CMSRun/glide_8oUehg/main/condor/libexec/condor_kflops) 08/26/15 21:00:00 (pid:7614) State change: IS_OWNER is false 08/26/15 21:00:00 (pid:7614) Changing state: Owner -> Unclaimed 08/26/15 21:00:00 (pid:7614) State change: RunBenchmarks is TRUE 08/26/15 21:00:00 (pid:7614) Changing activity: Idle -> Benchmarking 08/26/15 21:00:00 (pid:7614) BenchMgr:StartBenchmarks() 08/26/15 21:00:19 (pid:7614) State change: benchmarks completed 08/26/15 21:00:19 (pid:7614) Changing activity: Benchmarking -> Idle 08/26/15 21:04:33 (pid:7614) No resources have been claimed for 30 seconds 08/26/15 21:04:33 (pid:7614) Shutting down Condor on this machine. 08/26/15 21:04:33 (pid:7614) Got SIGTERM. Performing graceful shutdown. 08/26/15 21:04:33 (pid:7614) shutdown graceful 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) Killing job mips 08/26/15 21:04:33 (pid:7614) Killing job kflops 08/26/15 21:04:33 (pid:7614) Deleting cron job manager 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting all jobs 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting all jobs 08/26/15 21:04:33 (pid:7614) Deleting benchmark job mgr 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) Killing job mips 08/26/15 21:04:33 (pid:7614) Killing job kflops 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) Killing job mips 08/26/15 21:04:33 (pid:7614) Killing job kflops 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting all jobs 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting job 'mips' 08/26/15 21:04:33 (pid:7614) CronJob: Deleting job 'mips' (/home/boinc/CMSRun/glide_8oUehg/main/condor/libexec/condor_mips), timer -1 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting job 'kflops' 08/26/15 21:04:33 (pid:7614) CronJob: Deleting job 'kflops' (/home/boinc/CMSRun/glide_8oUehg/main/condor/libexec/condor_kflops), timer -1 08/26/15 21:04:33 (pid:7614) Cron: Killing all jobs 08/26/15 21:04:33 (pid:7614) CronJobList: Deleting all jobs 08/26/15 21:04:33 (pid:7614) All resources are free, exiting. |
©2024 CERN