Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 11 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 776 - Posted: 21 Aug 2015, 7:35:51 UTC

Please post to this thead any issues you are having today and we will try to fix them.
ID: 776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 777 - Posted: 21 Aug 2015, 8:07:14 UTC - in response to Message 776.  
Last modified: 21 Aug 2015, 8:08:35 UTC

No output of the screens ALT+F1, ALT+F4 and ALT+F5

Most logs are in the map run-1. These logs are updated:

MasterLog 21-Aug-2015 10:00 1.9K
ProcLog 21-Aug-2015 10:00 311K
StartdLog 21-Aug-2015 09:57 419K
StarterLog 21-Aug-2015 09:42 16K

These logs are not updated and contains only the info of the very first job:

FrameworkJobReport.xml 21-Aug-2015 08:15 21K
_condor_stdout 21-Aug-2015 08:17 283K
cmsRun-stderr.log 21-Aug-2015 07:27 243
cmsRun-stdout.log 21-Aug-2015 08:15 35K
glidein-stderr 21-Aug-2015 07:25 28K
glidein-stdout 21-Aug-2015 07:25 4.3K
scramOutput.log 21-Aug-2015 07:27 659
ID: 777 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 778 - Posted: 21 Aug 2015, 9:26:44 UTC - in response to Message 777.  

I confirm the findings of Crystal Pellet and a run-2 Folder was generated about 6h after run-1.
Otherwise it is now behaving the same way.

/ logs /cron-stdout:
00:32:01 +0200 2015-08-21 [INFO] Starting CMS Application - Run 1
00:32:01 +0200 2015-08-21 [INFO] Reading the BOINC volunteer's information
00:32:04 +0200 2015-08-21 [INFO] Volunteer: Rasputin42 (277) Host: 617
00:32:04 +0200 2015-08-21 [INFO] Requesting an X509 credential
00:32:06 +0200 2015-08-21 [INFO] Downloading glidein
00:32:06 +0200 2015-08-21 [INFO] Running glidein (check logs)
06:19:02 +0200 2015-08-21 [INFO] CMS glidein Run 1 ended
06:20:01 +0200 2015-08-21 [INFO] Starting CMS Application - Run 2
06:20:01 +0200 2015-08-21 [INFO] Reading the BOINC volunteer's information
06:20:06 +0200 2015-08-21 [INFO] Volunteer: Rasputin42 (277) Host: 617
06:20:06 +0200 2015-08-21 [INFO] Requesting an X509 credential
06:20:09 +0200 2015-08-21 [INFO] Downloading glidein
06:20:10 +0200 2015-08-21 [INFO] Running glidein (check logs)
ID: 778 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 779 - Posted: 21 Aug 2015, 9:54:18 UTC - in response to Message 778.  

Yes, I'm getting something similar.
Thu 20 Aug 2015 05:32:46 PM BST | CMS-dev | Resetting project
Thu 20 Aug 2015 05:32:54 PM BST | CMS-dev | work fetch resumed by user
Thu 20 Aug 2015 05:32:55 PM BST | CMS-dev | update requested by user
Thu 20 Aug 2015 05:32:58 PM BST | CMS-dev | Master file download succeeded
Thu 20 Aug 2015 05:33:03 PM BST | CMS-dev | Sending scheduler request: Requested by user.
Thu 20 Aug 2015 05:33:03 PM BST | CMS-dev | Requesting new tasks for CPU
Thu 20 Aug 2015 05:33:05 PM BST | CMS-dev | Scheduler request completed: got 1 new tasks


And now in "graphics" logs I see:
boot.log 20-Aug-2015 17:41 11K
[ ] cron-stderr 20-Aug-2015 17:42 64
[TXT] cron-stdout 21-Aug-2015 06:12 1.2K
[DIR] run-1/ 20-Aug-2015 17:47 -
[DIR] run-2/ 21-Aug-2015 00:09 -
[DIR] run-3/ 21-Aug-2015 06:14 -


but delving into the run directories -- run-1:
[TXT] FrameworkJobReport.xml 20-Aug-2015 18:28 22K
[ ] MasterLog 21-Aug-2015 00:05 2.2K
[ ] ProcLog 21-Aug-2015 00:05 775K
[ ] StartdLog 21-Aug-2015 00:05 1.1M
[ ] StarterLog 21-Aug-2015 00:05 43K

[TXT] _condor_stderr 20-Aug-2015 17:43 0
[ ] _condor_stdout 20-Aug-2015 18:30 282K
[ ] cmsRun-stderr.log 20-Aug-2015 17:46 243
[TXT] cmsRun-stdout.log 20-Aug-2015 18:28 35K
[ ] cron-stderr 20-Aug-2015 17:42 64
[TXT] cron-stdout 21-Aug-2015 06:12 1.2K
[ ] glidein-stderr 21-Aug-2015 00:05 126K
[ ] glidein-stdout 21-Aug-2015 00:05 8.9K

[ ] scramOutput.log 20-Aug-2015 17:46 658



run-2:
[TXT] FrameworkJobReport.xml 21-Aug-2015 00:37 22K
[ ] MasterLog 21-Aug-2015 06:10 2.2K
[ ] ProcLog 21-Aug-2015 06:10 737K
[ ] StartdLog 21-Aug-2015 06:10 1.1M
[ ] StarterLog 21-Aug-2015 06:09 42K

[TXT] _condor_stderr 21-Aug-2015 00:08 0
[ ] _condor_stdout 21-Aug-2015 00:38 282K
[ ] cmsRun-stderr.log 21-Aug-2015 00:08 244
[TXT] cmsRun-stdout.log 21-Aug-2015 00:37 35K
[ ] glidein-stderr 21-Aug-2015 06:10 126K
[ ] glidein-stdout 21-Aug-2015 06:10 8.9K

[ ] scramOutput.log 21-Aug-2015 00:08 661


run-3:
[TXT] FrameworkJobReport.xml 21-Aug-2015 06:43 22K
[ ] MasterLog 21-Aug-2015 09:09 2.3K
[ ] ProcLog 21-Aug-2015 10:39 547K
[ ] StartdLog 21-Aug-2015 10:35 828K
[ ] StarterLog 21-Aug-2015 10:25 35K

[TXT] _condor_stderr 21-Aug-2015 06:13 0
[ ] _condor_stdout 21-Aug-2015 06:45 282K
[ ] cmsRun-stderr.log 21-Aug-2015 06:13 244
[TXT] cmsRun-stdout.log 21-Aug-2015 06:43 35K
[ ] glidein-stderr 21-Aug-2015 06:13 28K
[ ] glidein-stdout 21-Aug-2015 06:13 4.2K
[ ] scramOutput.log 21-Aug-2015 06:13 661


So, it's not starting a new log directory for each job, and some, but not all, of the logs are ending up in the currently opened directory.
ALT-F4 and ALT-F5 console displays are also still blank.
ID: 779 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 781 - Posted: 21 Aug 2015, 10:53:57 UTC
Last modified: 21 Aug 2015, 10:54:29 UTC

I checked "suspend" and "Resume" and it seems to have worked fine, as long as I didn't exit boinc.

When I exit BOINC all unfinished work seems to be lost and it starts again from Zero; don't know if it first fetches a new Job or restarts the old one
ID: 781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 782 - Posted: 21 Aug 2015, 10:56:25 UTC

Still running in run-1

7 cmsRun's successful returned (triggered by condor_exec.exe, I suppose):

08/21/15 07:26:15 (pid:7627) Create_Process succeeded, pid=7631
08/21/15 08:17:15 (pid:7627) Process exited, pid=7631, status=151

08/21/15 08:17:18 (pid:9763) Create_Process succeeded, pid=9769
08/21/15 09:02:28 (pid:9763) Process exited, pid=9769, status=151

08/21/15 09:02:31 (pid:11886) Create_Process succeeded, pid=11890
08/21/15 09:42:02 (pid:11886) Process exited, pid=11890, status=151

08/21/15 09:42:05 (pid:13600) Create_Process succeeded, pid=13605
08/21/15 10:02:33 (pid:13600) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622
08/21/15 10:02:33 (pid:13600) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds.
08/21/15 10:03:33 (pid:13600) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#23054
08/21/15 10:03:38 (pid:13600) Accepted request to reconnect from <130.246.180.120:9818>
08/21/15 10:03:38 (pid:13600) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_11916>
08/21/15 10:03:38 (pid:13600) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_11916>
08/21/15 10:27:32 (pid:13600) Process exited, pid=13605, status=151

08/21/15 10:27:36 (pid:15521) Create_Process succeeded, pid=15527
08/21/15 11:11:01 (pid:15521) Process exited, pid=15527, status=151

08/21/15 11:11:03 (pid:17375) Create_Process succeeded, pid=17379
08/21/15 11:56:30 (pid:17375) Process exited, pid=17379, status=151

08/21/15 11:56:33 (pid:19290) Create_Process succeeded, pid=19294
08/21/15 12:38:36 (pid:19290) Process exited, pid=19294, status=151
ID: 782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 783 - Posted: 21 Aug 2015, 11:11:19 UTC - in response to Message 778.  

Rasputin42, Your log suggests the glidein is run every 6 hours so I would expect to see a run-x directory every six hours. If the glidein is getting multiple jobs then the log files may only show the first job. Will investigate further.
ID: 783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 784 - Posted: 21 Aug 2015, 11:45:12 UTC - in response to Message 783.  
Last modified: 21 Aug 2015, 11:53:39 UTC

Rasputin42, Your log suggests the glidein is run every 6 hours so I would expect to see a run-x directory every six hours. If the glidein is getting multiple jobs then the log files may only show the first job. Will investigate further.

[DIR] run-1/ 21-Aug-2015 07:28 -
[DIR] run-2/ 21-Aug-2015 13:29 -

Looks like that's true.

I hope the glidein waits until the running cmsRun has finished. It seems so or was it coincidence just 6 hours and 1 minute in between.

used default retire time, 21600
using default retire spread, 2160


From the starterLog:
08/21/15 12:38:39 (pid:21092) Create_Process succeeded, pid=21096
08/21/15 13:26:14 (pid:21092) Process exited, pid=21096, status=151
At least a few minutes before creating map run-2, condor_exec.exe ended.
ID: 784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 785 - Posted: 21 Aug 2015, 11:55:02 UTC
Last modified: 21 Aug 2015, 12:03:04 UTC

Looping again:

Every 2 minutes a new map run-x is created.

[ ] boot.log 21-Aug-2015 07:24 11K
[TXT] cron-stderr 21-Aug-2015 07:25 0
[TXT] cron-stdout 21-Aug-2015 13:56 7.9K
[DIR] run-1/ 21-Aug-2015 07:28 -
[DIR] run-2/ 21-Aug-2015 13:29 -
[DIR] run-3/ 21-Aug-2015 13:31 -
[DIR] run-4/ 21-Aug-2015 13:33 -
[DIR] run-5/ 21-Aug-2015 13:35 -
[DIR] run-6/ 21-Aug-2015 13:37 -
[DIR] run-7/ 21-Aug-2015 13:39 -
[DIR] run-8/ 21-Aug-2015 13:41 -
[DIR] run-9/ 21-Aug-2015 13:43 -
[DIR] run-10/ 21-Aug-2015 13:45 -
[DIR] run-11/ 21-Aug-2015 13:47 -
[DIR] run-12/ 21-Aug-2015 13:49 -
[DIR] run-13/ 21-Aug-2015 13:51 -
[DIR] run-14/ 21-Aug-2015 13:53 -
[DIR] run-15/ 21-Aug-2015 13:55 -
ID: 785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 786 - Posted: 21 Aug 2015, 12:14:17 UTC

The machine I was watching yesterday was looping until 22:40, and then switched to the 6-hour cycle at about run-201.

It's due to shut down the current BOINC task instance within an hour, and start a new one. It'll be interesting to see how the log map changes.
ID: 786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 787 - Posted: 21 Aug 2015, 12:19:55 UTC - in response to Message 785.  

Please could you email me the glidein-stdout and glidein-stderr from one of the runs.

Thanks,
ID: 787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 788 - Posted: 21 Aug 2015, 12:22:42 UTC

The glidein creates a working directory glide_xxxx and the CMSRun is executed in a subdirectory called dir_xxxx. I have now replicated this structure in the Web logs so hopefully we will see what is happening. The update has been pushed and will appear shortly.
ID: 788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 789 - Posted: 21 Aug 2015, 13:02:26 UTC
Last modified: 21 Aug 2015, 13:14:08 UTC

Rebooted the VM to get rid of the looping and now 'top'=ALT+F3 no longer available,
logs not available at all=webpage, only ALT+F1 show Running glidein (check logs) and
ALT+F2 show me that cmsRun is working 35 minutes now. (even ALT+F6 or 10 not available)
Will wait to see what's happening after this 1st cmsRun after the boot is ready.

A 2nd cmsRun started, but else no differences, no improved access.
I'll request a new BOINC-task in the hope that with a fresh VM Laurence update is already available.
ID: 789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 May 15
Posts: 64
Credit: 55,584
RAC: 0
Message 790 - Posted: 21 Aug 2015, 13:35:34 UTC

20 minutes into a new task, and the console has been stuck at



for at least the last 10 of them.

Having lit the blue touch paper, I'm now retiring to a safe distance and waiting...
ID: 790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 791 - Posted: 21 Aug 2015, 13:45:54 UTC - in response to Message 788.  

The glidein creates a working directory glide_xxxx and the CMSRun is executed in a subdirectory called dir_xxxx. I have now replicated this structure in the Web logs so hopefully we will see what is happening. The update has been pushed and will appear shortly.

Yeah, see this structure now growing. Do you want something from this to see ?
ID: 791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 792 - Posted: 21 Aug 2015, 13:57:19 UTC - in response to Message 790.  
Last modified: 21 Aug 2015, 14:11:05 UTC

20 minutes into a new task, and the console has been stuck at

[ image ]

for at least the last 10 of them.

Having lit the blue touch paper, I'm now retiring to a safe distance and waiting...

Had the same for over 30 minutes and decided to reboot the VM.

Now it is working with the new directory structure underneath run-1/:

[DIR] glide_JsrsA8/ 21-Aug-2015 15:52 -
[ ] glidein-stderr 21-Aug-2015 15:50 28K
[ ] glidein-stdout 21-Aug-2015 15:50 4.3K


Edit: In the glide_xxxxxx directory another sub-directory called dir_xxxxx where xxxxx = the process-id of condor_starter with cmsRun output in it.
cmsRun is busy and all usual ALT+Fx display information.
ID: 792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 793 - Posted: 21 Aug 2015, 14:02:18 UTC - in response to Message 792.  

Yes, I'm seeing that directory structure under run-4 now. Managed to catch it between jobs so I aborted and started a new task.
ID: 793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 794 - Posted: 21 Aug 2015, 14:23:18 UTC - in response to Message 793.  

Yep, hopefully the Web logs and console are working now for everyone. It is Friday afternoon so I won't touch anything now. Let's see how it goes over the weekend. I would be interested to know if anyone has any suspend/resume related issues.
ID: 794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 795 - Posted: 21 Aug 2015, 14:31:56 UTC - in response to Message 794.  

It is Friday afternoon so I won't touch anything now.

Yeah, hands off . . . only to touch a well-deserved drink.
ID: 795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 796 - Posted: 21 Aug 2015, 14:45:29 UTC

Looks like each subdirectory:
/logs/run-3/glide_sN6j69/dir_14859/cmsRun-stdout.log
/logs/run-3/glide_sN6j69/dir_19030/cmsRun-stdout.log


contains the results of one lot of 200 runs.

Between run 200 and run 1 of the next lot is a gap of about 7-8 minutes.
I guess, that is the time it takes to reinitialize everything and get new data.
ID: 796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN