I have to download ~ 10,000 files from a site, daily, where each file size ranges from 50KB - 150KB
I plan to use a simple process manager that forks out X-number of processes to download chunks in parallel: files 1 to 1000 by one process, files 1001 to 2000 by the next process, files 2001 to 3000 by the next process, and so on
Again, they will all run in parallel, and it will be on an Amazon EC2 instance.
There is a general rule of thumb that I use to determine how many processes to spawn so the entire process (all 10k files) download in the quickest amount of time?
I presume "more processes" is not better, since at some point the bandwidth will get congested.
I would ideally like to keep this on one EC2 instance, but am open to using more if you feel that is the optimal solution.
What's the best way to find out the optimal amount?
Thanks!
Note: The number is not fixed at 10k. That is just one partner site. We have other partner sites where we may need, say, 50,000 files or more, so I'd like the solution to be generic enough.