Scraping hidden elements using BeautifulSoup

12,487

You can use the underlying web-api to extract the grid-item details, which are rendered by the angularJS javascript framework, so the HTML is not static.

One way to parse would be use selenium to get the data, but identifying the web-api is pretty simple using the developer tools of the browser.

EDIT: I use firebug add-on with firefox to see the GET requests made from "Net tab"

enter image description here

and the GET request for the page is:

https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2

And it returned a callback JS script, which was almost completely JSON data.

The JSON it returned contained the details for the grid items

Each grid item was described as a json object like below:

{
        "product_id": 23491960,
        "complex_product_id": 7287171,
        "name": "Samsung Galaxy Z1 (Black)",
        "short_desc": "",
        "bullet_points": {
            "salient_feature": ["Screen: 10.16 cm (4\")", "Camera: 3.1 MP Rear/VGA Front", "RAM: 768 MB", "ROM: 4 GB", "Dual-core 1.2 GHz Cortex-A7", "Battery: 1500 mAh/Li-Ion"]
        },
        "url": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "seourl": "https://catalog.paytm.com/v1/p/samsung-z1-black-MOBSAMSUNG-Z1-BSMAR2320696B3C745",
        "url_type": "product",
        "promo_text": null,
        "image_url": "https://assetscdn.paytm.com/images/catalog/product/M/MO/MOBSAMSUNG-Z1-BSMAR2320696B3C745/2.jpg",
        "vertical_id": 18,
        "vertical_label": "Mobile",
        "offer_price": 5090,
        "actual_price": 5799,
        "merchant_name": "SMARTBUY",
        "authorised_merchant": false,
        "stock": true,
        "brand": "Samsung",
        "tag": "+5% Cashback",
        "product_tag": "+5% Cashback",
        "shippable": true,
        "created_at": "2015-09-17T08:28:25.000Z",
        "updated_at": "2015-12-29T05:55:29.000Z",
        "img_width": 400,
        "img_height": 400,
        "discount": "12"
    }

So you can get the details without even using beautifulSoup in the following way.

import requests
import json

response = requests.get("https://catalog.paytm.com/v1//g/electronics/mobile-accessories/mobiles/smart-phones?page_count=1&items_per_page=30&resolution=960x720&quality=high&sort_popular=1&cat_tree=1&callback=angular.callbacks._3&channel=web&version=2")
jsonResponse = ((response.text.split('angular.callbacks._3('))[1].split(');')[0])
data = json.loads(jsonResponse)
print(data["grid_layout"])
grid_data = data["grid_layout"]

for grid_item in grid_data:
    print("Brand:", grid_item["brand"])
    print("Product Name:", grid_item["name"])
    print("Current Price: Rs", grid_item["offer_price"])
    print("==================")

you would get output like

Brand: Samsung
Product Name: Samsung Galaxy Z1 (Black)
Current Price: Rs 4990
==================
Brand: Samsung
Product Name: Samsung Galaxy A7 (Gold)
Current Price: Rs 22947
==================

Hope this helps.

Share:
12,487
enterML
Author by

enterML

Data Scientist,machine learning and deep learning engineer.

Updated on June 04, 2022

Comments

  • enterML
    enterML almost 2 years

    I was trying to scrape data from a website for my project.But the problem is I am not getting the tags in my outputs which I am seeing in my developer toolbar screen. the following is the snapshot of the the DOM from which I wanted to scrape the data :

    <div class="bigContainer">
          <!-- ngIf: products.grid_layout.length > 0 --><div ng-if="products.grid_layout.length > 0">
            <div class="fl">
              <!-- ngRepeat: product in products.grid_layout --><!-- ngIf: $index%3==0 -->
              <div ng-repeat="product in products.grid_layout" ng-if="$index%3==0" class="GridItems">
              <grid-item product="product" gakey="ga_key" idx="$index" ancestors="products.ancestors" is-search-item="isSearchItem" is-filter="isFilter">
                  <a ng-href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" ng-click="searchProductTrack(product, idx+1)" tabindex="0" href="/shop/p/nokia-lumia-930-black-MOBNOKIA-LUMIA-SRI-673652FB190B4?psearch=organic|undefined|lumia 930|grid" class="" style="">
               </grid-item>   
    

    I am able to get the div tag with class "bigContainer" but I am not able to scrape the tags within this tag.For example if I want to get the grid-item tag,I got an empty list which means it shows that there is no such tag. Why is this happening? Please help!!