Message boards :
Theory Application :
The Theory Application
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 14 |
Tasks on both machines got past the previous halt error and after c.10mins each has started processing events. I followed the Graphics button to the Machine logs link and from Alt-F2 can see events processing in real time. One looked like it had trouble with its initial Sherpa (like CP's) but has dropped that in favour of a Pythia which looks to be running fine. (Similarly, 1 machine has a Pythia6, the other a Pythia8) I'll see if I can pay attention when each of these jobs finishes as to whether another will take over. Current Boinc estimated time remaining is a little over 4 days but extrapolating from % progress gives c.18hrs. Is there a self-termination at 24hrs like the standard Theory tasks? Both tasks happy at 60 and 40 mins respectively. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
Current Boinc estimated time remaining is a little over 4 days but extrapolating from % progress gives c.18hrs. Is there a self-termination at 24hrs like the standard Theory tasks? The wrapper will kill BOINC task after 18 hours -> job duration set to 64800 seconds |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
The VM shutdowning itself seems be solved. Very well! Yes CP (and Ray), the first "real" T4T jobs have been submitted and the web logs are also working. The next step is to feed the results back into MCPlots, which Leonardo is doing currently, so you will start to get MCPlots stats updates for the work you do. A lot of progress today! So Rasputin, you can also restart testing now... |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 14 |
Update to my last post. Job finished fine, replaced by another, and another. Task continues normally. Good work, guys 8¬) The wrapper will kill BOINC task after 18 hours -> job duration set to 64800 seconds I wonder if that's a typo from the 86400 seconds (24hrs) that T4T runs? With the base memory being 2GB rather than the ordinary 256MB, I'll need to squeeze some extra memory into 2 of my boxes to allow them to run these. (I was going to anyway but this is the spur to actually do something about it.) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I wonder if that's a typo from the 86400 seconds (24hrs) that T4T runs? No, it's the same 64800 (18 hrs-limit) as is used by the CMS-application, when a VM is not stopped normally after 1 run of 6 or 12 hours. With the base memory being 2GB rather than the ordinary 256MB, I'll need to squeeze some extra memory into 2 of my boxes to allow them to run these. (I was going to anyway but this is the spur to actually do something about it.) Memory used at the moment ~1.2GB, swap used 0k, so maybe the 2GB could be lowered a bit. For the multi-threaded Challenge-VM we used as a rule of thumb 512kB for 1 thread. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 14 |
I exited Boinc and made sure VBox had saved the VM before restarting my hosts to apply some Windows updates. On reboot, and restart of Boinc, the TASK continued but the previously running JOB was lost however a new Job to replace it started up. (or it might have been a restart of that job; didn't note its id so can't tell either way) If I have time 2moro I'll do some more "robustness" testing. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Back to the joys of Suspend and Resume with Condor :( I think we need to start a new thread on this topic. The advantage this time is that infrastructure is all under our control, so we should be able to make better progress. To help, the Condor log files (MasterLog, StartLog and StarterLog) can be found in the Web logs directory. The log file handling for the jobs has also be improved so each job log is now archived rather than being overwritten. Available once the CVMFS has updated. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I noticed that also sherpa is running well now: ===> [runRivet] Wed Apr 13 07:04:24 CEST 2016 [boinc ee zhad 197 - - sherpa 1.4.1 default 100000 188] |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,106,329 RAC: 9,076 |
Back to getting failure to start work due to... 2016-04-14 12:17:09 (16392): Guest Log: [INFO] Theory application starting. Check log files. 2016-04-14 12:23:10 (16392): Guest Log: [ERROR] App is not supported. Shutting down! |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
Back to getting failure to start work due to... I just had one of those for a BOINC job that started with Sherpa. I have also seen a number of EXT4 inode addressing errors (not recorded on the logs, alas), I'll get a new VDI. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What is the task exit strategy? Does the task just time-out after 18h or is it properly shut down at the end of a job? One task, that exited by itself stated:2016-04-14 05:59:24 (2448): Guest Log: [ERROR] App is not supported. Shutting down! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I set the memory for the Theory-VM's to 1024MB. I've 3 Theory's running now. Max memory used 916632k and max swap used 4528k out of 1GB. Referring to Rasputin's question. What is the finish strategy for a BOINC-task? Just let it run for 18 hours or a proper finish somewhere after 12 hours, when a Theory job has uploaded the result. So far I didn't found the answer. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I am just about to push a new version. Should I set the machine memory to 1024MB or 512MB? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It should be set to minimum,where it does not crash and barly uses swap. If wanted, we can assign more. I really think, it needs a second core. Any unused cpu will be used elsewhere. The avergage cpu load with nearly all jobs is well above 1.5 (using 2 cores) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
I am just about to push a new version. Should I set the machine memory to 1024MB or 512MB? The VM's I've seen are using between 750MB and 895MB, so 512 or 1024 should not be the only choice. With the configured swap space and without too much swapping 896MB should suffice. I don't agree with Rasputin to configure the VM with 2 cores. In the past it was good to have 2 cores cause only 1 task was send. Now we can run several VM's parallel each using 100% CPU. In the past I've tested that a VM with 2 cores is using between 1.2 and 1.7 cores depending of what kind of job is running (Pythia6, Pythia8, Vincia, Sherpa, Herwig++ etc.) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
What is the task exit strategy? It seems, the first job going over 12h is finished and the task is shut down properly. Also MC-plot-Ids of processed jobs are in the stderr.log. However, they do not show on console F4 or in the logs any more. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Computer: 1164 OpenSuse 13.2(x64) Guest under Windows 8.1(x64)pro. Boinc 7.2.42 Virtualbox 5.0.18_SUSEr10667 Theory_2016_04_30.xml: <vbox_job> <os_name>Linux26_64</os_name> <memory_size_mb>630</memory_size_mb> <enable_network/> <enable_remotedesktop/> <copy_to_shared>init_data.xml</copy_to_shared> <completion_trigger_file>shutdown</completion_trigger_file> <enable_shared_directory/> <pf_host_port>7859</pf_host_port> <pf_guest_port>80</pf_guest_port> <job_duration>64800</job_duration> <enable_vm_savestate_usage/> <disable_automatic_checkpoints/> <heartbeat_filename>heartbeat</heartbeat_filename> <minimum_heartbeat_interval>1200</minimum_heartbeat_interval> </vbox_job> Task is waiting, because of upgrading Boinc to the latest Version. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had a coule of JOBS, which had a very long runtime. Currently, i have one, that has been running for close to 4h and: Display update finished (127 histograms, 1000 events). I have an average cpu and have already allocated more memory and a second core. How can it be avoided, that job takes longer than the cutoff time of the boinc task? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 30 |
How can it be avoided, that job takes longer than the cutoff time of the boinc task? It can't. But maybe you have an endless loop. That sometimes happens. 2 options to get rid of it: - Abort the task in BOINC Manager (no credits) - Put a shutdown file into the shared directory (credits granted) |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 689 |
Computer-ID:1165, Win 10(x64) pro works since 2 and a half hour. finished_1.log up to finished_4.log Mem: 619.856k total 565.800k used Swap: 1.048.572k total 44.884k used seem to be running ok for the moment. |
©2024 CERN