Posted: 3 Jan 2016, 13:21:57 UTC

We've been puzzling for a while as to why we were getting a lot of "stage-out" failures -- i.e. problems returning result files to data storage.
I'd pushed up lately to CMS jobs taking 2-5 hours, depending on processor speed, and returning ~150 MB result files. This means that on average each VM is returning ~50 MB/hr (or to put it another way, at an upload speed of 1 Mbps, returning a result file would take 1500 seconds, or 25 minutes).
It seems technology is roughly consistent across the world, and many consumers are still on ADSL broadband -- where the A means Asymmetric, that is upload speed is usually much slower than download speed. Upload speeds around, or even less than, 1 Mbps seem to be the norm for ADSL broadband.
So, the problems started occurring when enthusiastic volunteers started running several machines at once on their home networks. This meant that the total load on the upstream channel exceeded availability, uploads stalled and we started getting transfer time-outs.
So, the caution to take away from this is to make sure you know you upload speed, and make sure you don't run so many machines that they take your line into saturation.
I believe there are some workloads we could commission with a somewhat smaller MB/hr result generation; I'll let you know if we can start running them.
