Pyspark - converting json string to DataFrame

41,555

You can do the following

newJson = '{"Name":"something","Url":"https://stackoverflow.com","Author":"jangcy","BlogEntries":100,"Caller":"jangcy"}'
df = spark.read.json(sc.parallelize([newJson]))
df.show(truncate=False)

which should give

+------+-----------+------+---------+-------------------------+
|Author|BlogEntries|Caller|Name     |Url                      |
+------+-----------+------+---------+-------------------------+
|jangcy|100        |jangcy|something|https://stackoverflow.com|
+------+-----------+------+---------+-------------------------+
Share:
41,555
Jangcy
Author by

Jangcy

Updated on July 22, 2021

Comments

  • Jangcy
    Jangcy almost 3 years

    I have a test2.json file that contains simple json:

    {  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}
    

    I have uploaded my file to blob storage and I create a DataFrame from it:

    df = spark.read.json("/example/data/test2.json")
    

    then I can see it without any problems:

    df.show()
    +------+-----------+------+---------+--------------------+
    |Author|BlogEntries|Caller|     Name|                 Url|
    +------+-----------+------+---------+--------------------+
    |jangcy|        100|jangcy|something|https://stackover...|
    +------+-----------+------+---------+--------------------+
    

    Second scenario: I have really the same json string declared within my notebook:

    newJson = '{  "Name": "something",  "Url": "https://stackoverflow.com",  "Author": "jangcy",  "BlogEntries": 100,  "Caller": "jangcy"}'
    

    I can print it etc. But now if I'd like to create a DataFrame from it:

    df = spark.read.json(newJson)
    

    I get the 'Relative path in absolute URI' error:

    'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
    Traceback (most recent call last):
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 249, in json
        return self._df(self._jreader.json(self._spark._sc._jvm.PythonUtils.toSeq(path)))
      File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 79, in deco
        raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
    pyspark.sql.utils.IllegalArgumentException: 'java.net.URISyntaxException: Relative path in absolute URI: {  "Name":%20%22something%22,%20%20%22Url%22:%20%22https:/stackoverflow.com%22,%20%20%22Author%22:%20%22jangcy%22,%20%20%22BlogEntries%22:%20100,%20%20%22Caller%22:%20%22jangcy%22%7D'
    

    Should I apply additional transformations to the newJson string? If yes, what should them be? Please forgive me, if this is too trivial, as I am very new to Python and Spark.

    I am using Jupyter notebook with PySpark3 Kernel.

    Thanks in advance.