NamedTuple to Dataframe

10,108

pd.DataFrame goes perfectly with namedtuple and actually constructs the columns.

Sample data:

In [21]: Video = namedtuple("Video", "video_id title duration views thumbnail De
    ...: scription")
In [22]: In [20]: pd.DataFrame(data=[Video(1, 'Vid Title', 5, 10, 'Thumb',' Des'
    ...: )])
Out[22]: 
   video_id      title  duration  views thumbnail Description
0         1  Vid Title         5     10     Thumb         Des

Since your function is not actually returning the df and not utilizing it anywhere else in the code, how are you sure that it is empty?

Update

You just need to edit the return of parse_video_div to return a pd.DataFrame and concatenate the list into a single pd.DataFrame in get_videos function.

Here are the edits highlighted.

def parse_video_div(div):


    #####
    return pd.DataFrame(data=[Video(video_id, title, duration, views, thumbnail, Description)])
    # shorter version
    # return pd.DataFrame(data=[l])

def get_videos(username):
    ####
    videos_df = pd.concat(videos, ignore_index=True)
    return videos_df # return the DataFrame

You need a concantenation function in the end. in the parse_page_div, you can return any pd.DataFrame input, let that be dict, pd.Series, namedtuple, or even a list. In this example, I chose a pd.DataFrame to ease things, however, in terms of performance, it can add a few milliseconds to your processing.

Share:
10,108

Related videos on Youtube

Jazz
Author by

Jazz

Updated on October 01, 2022

Comments

  • Jazz
    Jazz about 2 months

    I am working on retrieving metadata from youtube channels and it's videos.

    Everything is going fine, but currently I am struggling to put all the information in a dataframe which I need. Here is the following code which I am using from this github https://gist.github.com/andkamau/0d4e312c97f41a975440a05fd76b1d29

    import urllib.request
    import json
    from bs4 import BeautifulSoup
    from collections import namedtuple
    import pafy
    from pandas import *
    import pandas as pd
    
    df = pd.DataFrame() 
    Video = namedtuple("Video", "video_id title duration views thumbnail Description")
    
    def parse_video_div(div):
        video_id = div.get("data-context-item-id", "")
        title = div.find("a", "yt-uix-tile-link").text
        duration = div.find("span", "video-time").contents[0].text
        views = str(div.find("ul", "yt-lockup-meta-info").contents[0].text.rstrip(" views").replace(",", ""))
        img = div.find("img")
        videoDescription = pafy.new("https://www.youtube.com/watch?v="+video_id)
        thumbnail = "http:" + img.get("src", "") if img else ""
        Description = videoDescription.description
        l = Video(video_id, title, duration, views, thumbnail, Description)
         
         # storing in the dataframe
         
        df = pd.DataFrame(list(Video(video_id, title, duration, views, thumbnail, Description)))
        return Video(video_id, title, duration, views, thumbnail, Description)
    
    def parse_videos_page(page):
        video_divs = page.find_all("div", "yt-lockup-video")
        return [parse_video_div(div) for div in video_divs]
    
    def find_load_more_url(page):
        for button in page.find_all("button"):
            url = button.get("data-uix-load-more-href")
            if url:
                return "http://www.youtube.com" + url
    
    def download_page(url):
        print("Downloading {0}".format(url))
        return urllib.request.urlopen(url).read()
    
    def get_videos(username):
        page_url = "http://www.youtube.com/channel/{0}/videos".format(username)
        page = BeautifulSoup(download_page(page_url))
        videos = parse_videos_page(page)
        page_url = find_load_more_url(page)
        while page_url:
            json_data = json.loads(str(download_page(page_url).decode("utf-8")))
            page = BeautifulSoup(json_data.get("content_html", ""))
            videos.extend(parse_videos_page(page))
            page_url = find_load_more_url(BeautifulSoup(json_data.get("load_more_widget_html", "")))
        return videos
    
    if __name__ == "__main__":
        videos = get_videos("UC-M9eLhclbe16sDaxLzc0ng")
        for video in videos:
            print(video)
        print("{0} videos".format(len(videos)))
    

    The function parse_video_div(div) is having all the information and my dataframe. But unfortunately the dataframe returns nothing. May be I need to loop the namedtuple somehow.

    Any lead on how I can achieve my dataframe to see my data?

  • Jazz
    Jazz over 4 years
    I am struggling to put all the data in one dataframe in the function parse_video_div and return it. So that I can use the dataframe later for further data analysis. @iDrwish
  • iDrwish
    iDrwish over 4 years
    @Jazz Check the updated answer, I believe it addresses your point.
  • Jazz
    Jazz over 4 years
    sorry for the late reply as I was running the changes you suggested. It worked magically!!! Thank you so much! :) I have marked it as a corrected answer :)
  • Energya
    Energya over 2 years
    Thanks for this answer! I've always explicitly used pd.DataFrame.from_records(records, columns=Record._fields), but now I know I can do it shorter. Bit surprised the docs don't specify this as an option, so may submit a PR to pandas to include it more clearly