a faster way to download multiple files
Solution 1
Execute the downloads concurrently instead of sequentially, and set a sensible MaxDegreeOfParallelism otherwise you will try to make too many simultaneous request which will look like a DOS attack:
public static void Main(string[] args)
{
var urls = new List<string>();
Parallel.ForEach(
urls,
new ParallelOptions{MaxDegreeOfParallelism = 10},
DownloadFile);
}
public static void DownloadFile(string url)
{
using(var sr = new StreamReader(HttpWebRequest.Create(url)
.GetResponse().GetResponseStream()))
using(var sw = new StreamWriter(url.Substring(url.LastIndexOf('/'))))
{
sw.Write(sr.ReadToEnd());
}
}
Solution 2
Download files in several threads. Number of threads depends on your throughput. Also, look at WebClient
and HttpWebRequest
classes. Simple sample:
var list = new[]
{
"http://google.com",
"http://yahoo.com",
"http://stackoverflow.com"
};
var tasks = Parallel.ForEach(list,
s =>
{
using (var client = new WebClient())
{
Console.WriteLine($"starting to download {s}");
string result = client.DownloadString((string)s);
Console.WriteLine($"finished downloading {s}");
}
});
Solution 3
I'd use several threads in parallel, with a WebClient
. I recommend setting the max degree of parallelism to the number of threads you want, since unspecified degree of parallelism doesn't work well for long running tasks. I've used 50 parallel downloads in one of my projects without a problem, but depending on the speed of an individual download a much lower might be sufficient.
If you download multiple files in parallel from the same server, you're by default limited to a small number (2 or 4) of parallel downloads. While the http standard specifies such a low limit, many servers don't enforce it. Use ServicePointManager.DefaultConnectionLimit = 10000;
to increase the limit.
o17t H1H' S'k
Personal pronouns and copulas (PPACs): Kindly use the Laadan male suffix -id when you are referring to meid and a male marker is lacking. e.g. you->youid, your->yourid, yours->yoursid, are->areid. If you are referring to a group of people of which Id amid part, but the others do not adhere to such pronouns or copulas, kindly explicitly exclude meid from that group and reconjunct meid separately using myid ppacs. e.g. "they are my friends" -> either "theyid areid my friends" or "they, excluding him, are my friends, and he isid my friend". Also i may use a smallcaps "i" as an abbreviation for "Id". Editing myid i's to I's as often happens is super-offensive. "You know what is reputation? Is people talking, is gossip. I also have reputation; not so pleasant, I think you know."
Updated on July 27, 2022Comments
-
o17t H1H' S'k almost 2 years
i need to download about 2 million files from the SEC website. each file has a unique url and is on average 10kB. this is my current implementation:
List<string> urls = new List<string>(); // ... initialize urls ... WebBrowser browser = new WebBrowser(); foreach (string url in urls) { browser.Navigate(url); while (browser.ReadyState != WebBrowserReadyState.Complete) Application.DoEvents(); StreamReader sr = new StreamReader(browser.DocumentStream); StreamWriter sw = new StreamWriter(), url.Substring(url.LastIndexOf('/'))); sw.Write(sr.ReadToEnd()); sr.Close(); sw.Close(); }
the projected time is about 12 days... is there a faster way?
Edit: btw, the local file handling takes only 7% of the time
Edit: this is my final implementation:
void Main(void) { ServicePointManager.DefaultConnectionLimit = 10000; List<string> urls = new List<string>(); // ... initialize urls ... int retries = urls.AsParallel().WithDegreeOfParallelism(8).Sum(arg => downloadFile(arg)); } public int downloadFile(string url) { int retries = 0; retry: try { HttpWebRequest webrequest = (HttpWebRequest)WebRequest.Create(url); webrequest.Timeout = 10000; webrequest.ReadWriteTimeout = 10000; webrequest.Proxy = null; webrequest.KeepAlive = false; webresponse = (HttpWebResponse)webrequest.GetResponse(); using (Stream sr = webrequest.GetResponse().GetResponseStream()) using (FileStream sw = File.Create(url.Substring(url.LastIndexOf('/')))) { sr.CopyTo(sw); } } catch (Exception ee) { if (ee.Message != "The remote server returned an error: (404) Not Found." && ee.Message != "The remote server returned an error: (403) Forbidden.") { if (ee.Message.StartsWith("The operation has timed out") || ee.Message == "Unable to connect to the remote server" || ee.Message.StartsWith("The request was aborted: ") || ee.Message.StartsWith("Unable to read data from the transport connection: ") || ee.Message == "The remote server returned an error: (408) Request Timeout.") retries++; else MessageBox.Show(ee.Message, "Error", MessageBoxButtons.OK, MessageBoxIcon.Error); goto retry; } } return retries; }
-
CodesInChaos over 12 yearslooks very dubious to me. You're using a shared instance of browser from multiple threads. And calling
Application.DoEvents
from another thread is probably wrong too. -
Myles McDonnell over 12 years@CodeInChaos, agreed, I focussed on the parallelism without considering the download implementation. will fix..
-
Myles McDonnell over 12 years..now fixed, replaced browser control with HttpWebRequest
-
o17t H1H' S'k over 12 yearsthanks, i could get a factor of 4 speedup with this method (also using ServicePointManager.DefaultConnectionLimit = 10000;) i guess this is due to server restrictions. any further suggestions?
-
o17t H1H' S'k over 12 yearsindeed ServicePointManager.DefaultConnectionLimit = 10000; turned out to be critical in order to get speedups higher than 2
-
Myles McDonnell over 12 yearsThe bottleneck is I suspect is the number of concurrent connections per client (IP address) at the server. If you know what that is set the MaxDegreeOfParallelism to match, this won;t increase through but will prevent requests waiting for a connection. To get more throughput if you have the resources you could scale out, i.e. split the URLs between n-clients each with a distinct IP address to run concurrently.
-
Myles McDonnell over 12 yearsThe only thing missing here is to set the MaxDegreeOfParallelism. The OP states 2 million files so without it the above will queue 2 million work items and make way more concurrent requests to the server that it will allow and/or handle. It's best to throttle it to the max connections per client of the target server.