HttpClient scrape data from website with login c#

20,095

There are many ways to perform a login to a website, and that depends on the authentication method used by the specific site (Forms authentication, Basic authentication, Windows authetication etc.). Usually websites use the FormsAuthentication.

To perform a login in a standard FormsAuthentication website using the HttpClient, you need to set the CookieContainer, because authentication data will be set on cookies.

In your specific example, the login form makes a POST to any of the page in HTTPS, i used https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559 as an example. This is the code to make the request using HttpClient:

var baseAddress = new Uri("https://wttv.click-tt.de/");
var cookieContainer = new CookieContainer();
using (var handler = new HttpClientHandler() { CookieContainer = cookieContainer })
using (var client = new HttpClient(handler) { BaseAddress = baseAddress })
{
    //usually i make a standard request without authentication, eg: to the home page.
    //by doing this request you store some initial cookie values, that might be used in the subsequent login request and checked by the server
    var homePageResult = client.GetAsync("/");
    homePageResult.Result.EnsureSuccessStatusCode();

    var content = new FormUrlEncodedContent(new[]
    {
        //the name of the form values must be the name of <input /> tags of the login form, in this case the tag is <input type="text" name="username">
        new KeyValuePair<string, string>("username", "username"),
        new KeyValuePair<string, string>("password", "password"),
    });
    var loginResult = client.PostAsync("/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559", content).Result;
    loginResult.EnsureSuccessStatusCode();

    //make the subsequent web requests using the same HttpClient object
}

However, many websites uses some javascript loaded form values or even more some captcha controls, and obviously this solution will not work. That might be done as said with a WebBrowser control (by automating the user input on form fields and then the login button click, this link has an example: https://social.msdn.microsoft.com/Forums/vstudio/en-US/0b77ca8c-48ce-4fa8-9367-c7491aa359b0/yahoo-login-via-systemnetsockets-namespace?forum=vbgeneral).

As a general rule to inspect how works the login on your desidered website, use Fiddler: http://www.telerik.com/fiddler: when you click the login button on your website, watch Fiddler and find the login request (usually it's the first request just after you click the "Login" button, and usually is a POST request).

Then inspect the request data (select the request and go to the "Inspectors" - "TextView" tab) and try to replicate the request on your code.

On the left pane there are all the requests intercepted by Fiddler, on the right pane there are the request and response inspectors

On the left pane there are all the requests intercepted by Fiddler, on the right pane there are the request and response inspectors (on top there are request inspectors, on bottom there are the response inspectors)

Edit

Same code with old WebRequest class: http://rextester.com/LLP86817

var cookieContainer = new CookieContainer();

HttpWebRequest request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/");
request.CookieContainer = cookieContainer;
//set the user agent and accept header values, to simulate a real web browser
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";


//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

Console.WriteLine("FIRST RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

request = (HttpWebRequest)HttpWebRequest.Create("https://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559");
//set the cookie container object
request.CookieContainer = cookieContainer;
request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36";
request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8";

//set method POST and content type application/x-www-form-urlencoded
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";

//SET AUTOMATIC DECOMPRESSION
request.AutomaticDecompression = DecompressionMethods.Deflate | DecompressionMethods.GZip;

//insert your username and password
string data = string.Format("username={0}&password={1}", "username", "password");
byte[] bytes = System.Text.Encoding.UTF8.GetBytes(data);

request.ContentLength = bytes.Length;

using (Stream dataStream = request.GetRequestStream())
{
    dataStream.Write(bytes, 0, bytes.Length);
    dataStream.Close();
}

Console.WriteLine("LOGIN RESPONSE");
Console.WriteLine();
using (WebResponse response = request.GetResponse())
{
    using (StreamReader sr = new StreamReader(response.GetResponseStream()))
    {
        Console.WriteLine(sr.ReadToEnd());
    }
}

//request = (HttpWebRequest)HttpWebRequest.Create("INTERNAL PROTECTED PAGE ADDRESS");
//After a successful login, you must use the same cookie container for all request
//request.CookieContainer = cookieContainer;

//....
Share:
20,095
derchrome
Author by

derchrome

Updated on July 09, 2022

Comments

  • derchrome
    derchrome almost 2 years

    I would like to scrape some data from the following website:

    http://wttv.click-tt.de/cgi-bin/WebObjects/nuLigaTTDE.woa/wa/teamPortrait?teamtable=1673669&pageState=rueckrunde&championship=SK+Bez.+BB+13%2F14&group=204559#.

    The website contains some data about table tennis. The actual season can be accessed without login the last seasons only with login. For the actual season I have already created some code to get the data out of it and it works fine. I am using the HttpClient from the HtmlAgilityPack. The code look like this:

                HttpClient http = new HttpClient();
                var response = await http.GetByteArrayAsync(website);
                String source = Encoding.GetEncoding("utf-8").GetString(response, 0, response.Length - 1);
                source = WebUtility.HtmlDecode(source);
                HtmlDocument resultat = new HtmlDocument();
                resultat.LoadHtml(source);
    
                Do something to get the relevant data from resultat by scanning the DocumentNodes from resultat...
    

    Now I would like to fetch the data from the website that needs a login. Does anyone has an idea for that how to login to the website and get the data? The login must be done by clicking on "Ergebnishistorie freischalten ..." and then entering the username and passwort.