how to print out snippets of a RDD in the spark-shell / pyspark?

14,642

I found out how to do this (here) and thought this will be useful for other users, so sharing here. take(x) selects the first x items and foreach prints them:

scala> val readmeFile = sc.textFile("input/tmp/README.md")
scala> readmeFile.take(5).foreach(println)
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and ...

scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
scala> linesContainingSpark.take(5).foreach(println)
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming

The examples below are the equivalent but using pyspark:

>>> readmeFile = sc.textFile("input/tmp/README.md")
>>> for line in readmeFile.take(5): print line
... 
# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, and Python, and an optimized engine that
supports general computation graphs for data analysis. It also supports a

and

>>> linesContainingSpark = readmeFile.filter(lambda line: "Spark" in line)
>>> for line in linesContainingSpark.take(5): print line
... 
# Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides
rich set of higher-level tools including Spark SQL for SQL and structured
and Spark Streaming.
You can find the latest Spark documentation, including a programming
Share:
14,642
Chris Snow
Author by

Chris Snow

Data (Architecture|Engineering|Science).

Updated on July 03, 2022

Comments

  • Chris Snow
    Chris Snow almost 2 years

    When working in the spark-shell, I frequently want to inspect RDDs (similar to using head in unix).

    For example:

    scala> val readmeFile = sc.textFile("input/tmp/README.md")
    scala> // how to inspect the readmeFile?
    

    and ...

    scala> val linesContainingSpark = readmeFile.filter(line => line.contains("Spark"))
    scala> // how to inspect linesContainingSpark?
    
  • lrnzcig
    lrnzcig almost 9 years
    Probably you've realized already; the take(5) would be indeed like head in unix, and the filter you were using in your posted question would be more like a grep. However, the filter did not give you any results just like that since you did not collect them; the easiest way would be adding a take after the filter.