Thread 'Open Issues'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2951 - Posted: 22 Apr 2016, 18:12:38 UTC Last modified: 22 Apr 2016, 20:00:26 UTC Here is the list of open issues. If there is something that is not listed, please post to get it added. Suspend/Resume Job Final State Analysis (and show result in logs/console) Backoff on Errors Download tasks according to requested time (workbuffer) and not a fixed number ID: 2951 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2952 - Posted: 22 Apr 2016, 18:40:36 UTC Last modified: 22 Apr 2016, 18:52:19 UTC I have allocated 2 cores to the task. The load average is very high (15min average somtimes up to 1.81) I can only speculate how high it might be with just one core. (Maybe , i try that next) This is not a fault as such, but an efficiency issue. Other tasks (CMS, Atlas) are not anywere near that bad, under the same conditions. Finished jobs should show in the stderr with pass/fail status. I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job. EDIT:Not to forget: Download tasks according to requested time (workbuffer) and not a fixed number. ID: 2952 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2953 - Posted: 22 Apr 2016, 19:43:16 UTC - in response to Message 2952. I have allocated 2 cores to the task. The load average is very high (15min average somtimes up to 1.81) I can only speculate how high it might be with just one core. (Maybe , i try that next) With 2 cores, 1.81 seems fine. The workload may just be adapting. I would suggest trying with 1 core. Finished jobs should show in the stderr with pass/fail status. Will extend the job analysis description I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job. Does this happen in the T4T production version? If so, please let me know where you see this. EDIT:Not to forget: Download tasks according to requested time (workbuffer) and not a fixed number. Will add this. ID: 2953 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2955 - Posted: 22 Apr 2016, 19:58:02 UTC I would also like to see, which app (pythia6.xxx, sherpa, herwig...)actually calculated the job. Does this happen in the T4T production version? If so, please let me know where you see this. No, this is just for fault finding. There may only be certain apps causing certain issues, which we cannot tell, if we do not know, which app it is. They used to be in the logs a few days ago, but things change... ID: 2955 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2956 - Posted: 22 Apr 2016, 20:02:31 UTC - in response to Message 2955. The logs should be back. A few things broke while re-factoring today. If they are not there in new tasks from now on, let me know. ID: 2956 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2969 - Posted: 23 Apr 2016, 7:17:32 UTC I have allocated 2 cores to the task. The load average is very high (15min average somtimes up to 1.81) I can only speculate how high it might be with just one core. (Maybe , i try that next) I have tested it with only one core. Load average, as expected, very high. (1.85- 1.93 15min load average). This is with agile-runmc app on two tasks. The apps are still nowhere to be found in the logs ID: 2969 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,442 RAC: 55	Message 2976 - Posted: 23 Apr 2016, 20:14:41 UTC Last 4 tasks ended in computation error with no reason for me. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155757 2016-04-23 21:35:59 (5040): Status Report: CPU Time: '35317.362792' 2016-04-23 22:03:57 (5040): Guest Log: [ERROR] Condor exited with 0 2016-04-23 22:03:57 (5040): Guest Log: [INFO] Shutting Down. 2016-04-23 22:03:57 (5040): VM Completion File Detected. 2016-04-23 22:03:57 (5040): VM Completion Message: Condor exited with 0 . 2016-04-23 22:03:57 (5040): Powering off VM. 2016-04-23 22:03:59 (5040): Successfully stopped VM. 2016-04-23 22:04:04 (5040): Deregistering VM. (boinc_394876cc1189c4ec, slot#1) 2016-04-23 22:04:04 (5040): Removing virtual disk drive(s) from VM. 2016-04-23 22:04:04 (5040): Removing network bandwidth throttle group from VM. 2016-04-23 22:04:04 (5040): Removing storage controller(s) from VM. 2016-04-23 22:04:04 (5040): Removing VM from VirtualBox. 22:04:09 (5040): called boinc_finish(1) ID: 2976 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2977 - Posted: 23 Apr 2016, 20:22:52 UTC Last modified: 23 Apr 2016, 20:32:27 UTC Task finished after about 10h Computation error http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155454 Everything seemed fine, last job successful. Similar to Crystal Pellet. 2 more tasks failed with computation error after 10h to 10h30min runtime. ID: 2977 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2978 - Posted: 23 Apr 2016, 20:36:00 UTC Last modified: 23 Apr 2016, 20:38:48 UTC The odd thing is the first few lines in the stderr. <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> Unzul�ssige Funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> 2016-04-23 12:22:05 (2472): vboxwrapper (7.7.26184): starting Function not permitted---?????? EDIT: My last valid result did not have that. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=155532 ID: 2978 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 2979 - Posted: 23 Apr 2016, 20:53:45 UTC - in response to Message 2976. Sorry about that. We should give credit in this case. Need to double-check what is going on when Condor exits. ID: 2979 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 2980 - Posted: 23 Apr 2016, 20:56:09 UTC - in response to Message 2969. Last modified: 23 Apr 2016, 20:56:47 UTC I have allocated 2 cores to the task. The load average is very high (15min average somtimes up to 1.81) I can only speculate how high it might be with just one core. (Maybe , i try that next) I have tested it with only one core. Load average, as expected, very high. (1.85- 1.93 15min load average). This is with agile-runmc app on two tasks. The apps are still nowhere to be found in the logs Theory apps are dual-threaded, but the second thread is only used for graphics generation and uses less than half a CPU. In the past (back in the days of cernvmwrapper) we tried allocating 2 cores per task but discontinued it as it wasted half a CPU on average. The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it. ID: 2980 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2981 - Posted: 23 Apr 2016, 21:04:11 UTC - in response to Message 2980. The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it. I am just concerned, that 1 core is doing work, where nearly 2 would be needed. It is just eighter wasting some cpu (2 cores) or slowing down (1 core) the task quite a bit. ID: 2981 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3042 - Posted: 26 Apr 2016, 19:12:55 UTC Thanks for adding the finished x.log files. Now we can see, if a job passed or failed. ID: 3042 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 0	Message 3086 - Posted: 28 Apr 2016, 20:58:16 UTC Last modified: 28 Apr 2016, 20:59:13 UTC Over at VirtualLHC, the 32bit app has a base memory requirement of only 256MB. Challenge 64bit runs happily with 512MB per core. The apps here are requesting 2GB, triggering the need for app_configs to limit the number of tasks running so as not to overburden contributors' hosts. Is there any prospect of reducing the VM memory requirement before these 64s get released to production? (question applies equally for CMS and ATLAS) ID: 3086 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3090 - Posted: 28 Apr 2016, 22:29:42 UTC - in response to Message 3086. Theory yes, CMS and ATLAS no. ID: 3090 · Rating: 0 · rate: / Reply Quote

Development for LHC@home