Pyspark JSON object or file to RDD

14,035

You can use SparkSQL's read.json to read the file like -

jdf = spark.read.json("path/to/AskReddit.json")

and perform all kind of sql type of operation, even rdd type of operations on it.But json struture is really really nested with no fixed columns that can be derived with something like explode

Json Structure You are better off with using read.json and using dataframe like-

jdf\
.withColumn('after',jdf.data.after)\
.withColumn('before',jdf.data.before)\
.withColumn('modhash',jdf.data.modhash)\
.withColumn('NestedKind',explode(jdf.data.children.kind))\
.withColumn('subreddit',explode(jdf.data.children.data.subreddit))\
.withColumn('clicked',explode(jdf.data.children.data.clicked))\
.show()

+--------------------+-------+---------+------+--------------------+----------+---------+-------+
|                data|   kind|    after|before|             modhash|NestedKind|subreddit|clicked|
+--------------------+-------+---------+------+--------------------+----------+---------+-------+
|[t3_66qv3r,null,W...|Listing|t3_66qv3r|  null|3r7ao0m7qiadae13d...|        t3|AskReddit|  false|
|[t3_66qv3r,null,W...|Listing|t3_66qv3r|  null|3r7ao0m7qiadae13d...|        t3|AskReddit|  false|
Share:
14,035
dreamin
Author by

dreamin

Updated on June 26, 2022

Comments

  • dreamin
    dreamin almost 2 years

    I am new to PySpark and I have an AskReddit json file which I got from this link. I am trying to create an RDD which I then hope to perform operation such as map and flatmap. I was advised to get the json in a jsonlines format but despite using pip to install jsonlines, I am unable to import the package in the PySpark notebook. Below is what I have tried for reading in the json.

    In [10]: import json
    
             data = json.load(open("AskReddit.json", "r"))
             jsonrdd = sc.parallelize(data)
             jsonrdd.take(5) 
    
    Out[11]: [u'kind', u'data']
    

    I also tried to do the following which gives me the whole contents of the json file after doing jsonrdd.take(1).

    In [6]: jsonrdd = sc.wholeTextFiles("*.json")
            jsonrdd.take(1)
    

    However, I would like to get each json object as one line in the rdd. How would I go about this?