NamedTuple to Dataframe
pd.DataFrame
goes perfectly with namedtuple
and actually constructs the columns.
Sample data:
In [21]: Video = namedtuple("Video", "video_id title duration views thumbnail De
...: scription")
In [22]: In [20]: pd.DataFrame(data=[Video(1, 'Vid Title', 5, 10, 'Thumb',' Des'
...: )])
Out[22]:
video_id title duration views thumbnail Description
0 1 Vid Title 5 10 Thumb Des
Since your function is not actually returning the df
and not utilizing it anywhere else in the code, how are you sure that it is empty?
Update
You just need to edit the return of parse_video_div
to return a pd.DataFrame
and concatenate the list into a single pd.DataFrame
in get_videos
function.
Here are the edits highlighted.
def parse_video_div(div):
#####
return pd.DataFrame(data=[Video(video_id, title, duration, views, thumbnail, Description)])
# shorter version
# return pd.DataFrame(data=[l])
def get_videos(username):
####
videos_df = pd.concat(videos, ignore_index=True)
return videos_df # return the DataFrame
You need a concantenation function in the end. in the parse_page_div
, you can return any pd.DataFrame
input, let that be dict
, pd.Series
, namedtuple
, or even a list. In this example, I chose a pd.DataFrame
to ease things, however, in terms of performance, it can add a few milliseconds to your processing.
Related videos on Youtube
Jazz
Updated on October 01, 2022Comments
-
Jazz about 1 year
I am working on retrieving metadata from youtube channels and it's videos.
Everything is going fine, but currently I am struggling to put all the information in a
dataframe
which I need. Here is the following code which I am using from thisgithub
https://gist.github.com/andkamau/0d4e312c97f41a975440a05fd76b1d29import urllib.request import json from bs4 import BeautifulSoup from collections import namedtuple import pafy from pandas import * import pandas as pd df = pd.DataFrame() Video = namedtuple("Video", "video_id title duration views thumbnail Description") def parse_video_div(div): video_id = div.get("data-context-item-id", "") title = div.find("a", "yt-uix-tile-link").text duration = div.find("span", "video-time").contents[0].text views = str(div.find("ul", "yt-lockup-meta-info").contents[0].text.rstrip(" views").replace(",", "")) img = div.find("img") videoDescription = pafy.new("https://www.youtube.com/watch?v="+video_id) thumbnail = "http:" + img.get("src", "") if img else "" Description = videoDescription.description l = Video(video_id, title, duration, views, thumbnail, Description) # storing in the dataframe df = pd.DataFrame(list(Video(video_id, title, duration, views, thumbnail, Description))) return Video(video_id, title, duration, views, thumbnail, Description) def parse_videos_page(page): video_divs = page.find_all("div", "yt-lockup-video") return [parse_video_div(div) for div in video_divs] def find_load_more_url(page): for button in page.find_all("button"): url = button.get("data-uix-load-more-href") if url: return "http://www.youtube.com" + url def download_page(url): print("Downloading {0}".format(url)) return urllib.request.urlopen(url).read() def get_videos(username): page_url = "http://www.youtube.com/channel/{0}/videos".format(username) page = BeautifulSoup(download_page(page_url)) videos = parse_videos_page(page) page_url = find_load_more_url(page) while page_url: json_data = json.loads(str(download_page(page_url).decode("utf-8"))) page = BeautifulSoup(json_data.get("content_html", "")) videos.extend(parse_videos_page(page)) page_url = find_load_more_url(BeautifulSoup(json_data.get("load_more_widget_html", ""))) return videos if __name__ == "__main__": videos = get_videos("UC-M9eLhclbe16sDaxLzc0ng") for video in videos: print(video) print("{0} videos".format(len(videos)))
The function
parse_video_div(div)
is having all the information and mydataframe
. But unfortunately thedataframe
returns nothing. May be I need to loop thenamedtuple
somehow.Any lead on how I can achieve my
dataframe
to see my data? -
Jazz over 5 yearsI am struggling to put all the data in one
dataframe
in the functionparse_video_div
and return it. So that I can use thedataframe
later for further data analysis. @iDrwish -
iDrwish over 5 years@Jazz Check the updated answer, I believe it addresses your point.
-
Jazz over 5 yearssorry for the late reply as I was running the changes you suggested. It worked magically!!! Thank you so much! :) I have marked it as a corrected answer :)
-
Energya over 3 yearsThanks for this answer! I've always explicitly used
pd.DataFrame.from_records(records, columns=Record._fields)
, but now I know I can do it shorter. Bit surprised the docs don't specify this as an option, so may submit a PR to pandas to include it more clearly