Scraping data from all asp.net pages with AJAX pagination implemented

12,360

Solution 1

In general, in order to fake the ASP.NET web site to think that you actually pressed a button (in more general terms - performed a postback), you need to do the following:

  1. Get the value of every single INPUT and SELECT element on the page. It might not be required in every scenario, but you should always at least get the values of all hidden fields where the name starts with "__" (such as __VIEWSTATE). You don't really need to know what is written in them - just that the value in them has to be sent back to the server unchanged.

  2. Create a POST request to the server. You need to use the classic POST, avoiding any AJAX requests. Using some browser plugins (in Firefox or Chrome) it might be possible to disable XMLHttpRequest so you can then intercept the non-AJAX request with tools like Fiddler.

  3. Add every value from #1 to that post request. There are only two values you need to overwrite: __EVENTTARGET and __EVENTARGUMENT. You would leave those empty except if the link or button that you try to imitate has a onclick handler like <a href="javascript:__doPostBack('ctl00$login','')">. If it is, parse the values from this link - the first one is the event target (it usually will match the ID of some element on the page), the second is the event argument.

  4. If you executed the request correctly, you should get back HTML page. If you get a partial response, check if you didn't pass the HTTP header that asks for async result.

Solution 2

My best advice is to use iMacros https://addons.mozilla.org/en-US/firefox/addon/imacros-for-firefox/

iMacros :

  1. Record your flow of page downloading. http://wiki.imacros.net/First_Steps
  2. Save web page to local directory. http://wiki.imacros.net/SAVEAS
  3. Scrap email, addresses etc using PHP script.

No matter whether it's ajax - .aspx, .jsp or .php.

Solution 3

I would recommend branching out into Ruby and trying Capybara which is a sane way of using Selenium. It lets you do a visit of a page, then examine the actual DOM. You can click on everything, wait for events, etc. It uses a real browser.

visit "http://www.google.com" 
page.find("button[name=btnK]")
Share:
12,360
Subodh Ghulaxe
Author by

Subodh Ghulaxe

Updated on June 03, 2022

Comments

  • Subodh Ghulaxe
    Subodh Ghulaxe almost 2 years

    I want to scrap a webpage containing a list of user with addresses, email etc. webpage contain list of user with pagination i.e. page contains 10 users when I click on page 2 link it will load users list form 2nd page via AJAX and update list so on for all pagination links.

    Website is developed in asp i.e. page with extension .aspx since I don't know anything about asp.net and how asp manages pagination and AJAX

    I am using simple html dom http://sourceforge.net/projects/simplehtmldom/ to scrap contain

    for pages having users <=10 I dont have to simulate AJAX request same as when user clicks on pagination link

    but for page having pagination to get data from other pages I am simulating post AJAX request

    require 'simple_html_dom.php';
    
    $html = file_get_html('www.example.com/user_list.aspx');
    
    $viewstate = $html->find("#__VIEWSTATE");
    $viewstate = $viewstate[0]->attr['value'];
    
    $eventvalidation        = $html->find("#__EVENTVALIDATION");
    $eventvalidation        = $eventvalidation[0]->attr['value'];
    $number_of_pageinations = 3;
    
    $pageNumberCodes = array(
        'ctl00$cphMainContent$rdpMembers$ctl01$ctl01',
        'ctl00$cphMainContent$rdpMembers$ctl01$ctl02',
        'ctl00$cphMainContent$rdpMembers$ctl01$ctl03'
    ); // this code is added for each page in POST  as  __EVENTTARGET 
    
    for ($i = 0; $i < $number_of_pageinations; $i++) {
        $options = array(
            CURLOPT_RETURNTRANSFER => true, // return web page
            CURLOPT_HEADER => false, // don't return headers
            CURLOPT_ENCODING => "", // handle all encodings
            CURLOPT_USERAGENT => "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7'", // who am i
            CURLOPT_AUTOREFERER => true, // set referer on redirect
            CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
            CURLOPT_TIMEOUT => 1120, // timeout on response
            CURLOPT_MAXREDIRS => 10, // stop after 10 redirects
            CURLOPT_POST => true,
            CURLOPT_VERBOSE => true,
            CURLOPT_POSTFIELDS => urlencode('ctl00%24scriptManager=ctl00%24cphMainContent%24ctl00%24cphMainContent%24rdpMembersPanel%7C' . $pageNumberCodes[0] . '&__EVENTTARGET=' . $pageNumberCodes[0] . '&__EVENTARGUMENT=' . '&__VIEWSTATE=' . $viewstate . '&__EVENTVALIDATION=' . $eventvalidation . "&google=" . '&ctl00%24cphMainContent%24txtZip=' . '&ctl00%24cphMainContent%24cboRadius=Exact' . '&ctl00%24cphMainContent%24txtMemberName=' . '&ctl00%24cphMainContent%24txtCity=Honolulu' . '&ctl00%24cphMainContent%24cboState=HI' . '&ctl00%24cphMainContent%24txtAddress=' . '&ctl00_cphMainContent_rdpMembers_ClientState=' . '&ctl00%24cphMainContent%24ddList=-Select%20field%20to%20sort-' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_ddList_ClientState=' . '&ctl00_cphMainContent_rdlMembers_ClientState=' . '&ctl00_cphMainContent_rdpMembers1_ClientState=' . '&__ASYNCPOST=true' . 'RadAJAXControlID=ctl00_cphMainContent_RadAjaxManager1')
        );
        $ch      = curl_init($url);
        curl_setopt_array($ch, $options);
        $return = curl_exec($ch);
        curl_close($ch);
        echo $return;
    
        $newHtml = str_get_html($return);
    
        $viewstate = $newHtml->find("#__VIEWSTATE");
        $viewstate = $viewstate[0]->attr['value'];
    
        $eventvalidation = $newHtml->find("#__EVENTVALIDATION");
        $eventvalidation = $eventvalidation[0]->attr['value'];
    }
    

    this should echo data from different pages but It always prints data of first page, can anybody point me where I am worng and what is missing I dont know how asp manages paginations and AJAX request and what is __EVENTARGUMENT, __VIEWSTATE and __EVENTVALIDATION