Saving as Text in Spark 1.30 using Dataframes in Scala

22,514

Solution 1

You can use this:

teenagers.rdd.saveAsTextFile("/user/me/out")

Solution 2

First off, you should consider if you really need to save the data frame as text. Because DataFrame holds the data by columns (and not by rows as rdd), .rdd operation is costly, because the data need to be reprocessed for that. parquet is a columnar format and is much more efficient to be used.

That being said, sometimes you really do need to save as a text file.

As far as I know DataFrame out of the box won't let you save as text file. If you look at the source code, you'll see that 4 formats are supported:

jdbc
json
parquet
orc

so your options are either using df.rdd.saveAsTextFile as suggested before, or to use spark-csv, that will allow you to do something like:

Spark 1.4+:

val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("cars.csv")
df.select("year", "model").write.format("com.databricks.spark.csv").save("newcars.csv")

Spark 1.3:

val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "cars.csv", "header" -> "true"))
df.select("year", "model").save("newcars.csv", "com.databricks.spark.csv")

with the added value of handling the annoying parts of quoting and escaping of the strings

Solution 3

If you look at the migration guide https://spark.apache.org/docs/latest/sql-programming-guide.html#upgrading-from-spark-sql-10-12-to-13, you can see that

[...] DataFrames no longer inherit from RDD directly [...]

You can still use saveAsTextFile if you use ".rdd" method to get a RDD[Row].

Solution 4

In python: to get a CSV (no header) for dataframe df

df.rdd.map(lambda r: ";".join([str(c) for c in r])).saveAsTextFile(outfilepath)

There is also an extension developped by Databricks: spark-csv

Cf https://github.com/databricks/spark-csv

Share:
22,514
Admin
Author by

Admin

Updated on November 14, 2020

Comments

  • Admin
    Admin over 3 years

    I am using Spark version 1.3.0 and using dataframes with SparkSQL in Scala. In version 1.2.0 there was a method called "saveAsText". In version 1.3.0 using dataframes there is only a "save" method. The default output is parquet.
    How can I specify the output should be TEXT using the save method ?

    // sc is an existing SparkContext.
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    // this is used to implicitly convert an RDD to a DataFrame.
    import sqlContext.implicits._
    
    // Define the schema using a case class.
    // Note: Case classes in Scala 2.10 can support only up to 22 fields. To work around this limit,
    // you can use custom classes that implement the Product interface.
    case class Person(name: String, age: Int)
    
    // Create an RDD of Person objects and register it as a table.
    val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt)).toDF()
    people.registerTempTable("people")
    
    // SQL statements can be run by using the sql methods provided by sqlContext.
    val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")
    
    teenagers.save("/user/me/out")
    
  • Admin
    Admin about 9 years
    TVM, but is there any option to save as text using the save method. I have not been able to find much documentation. The default is to save as parquet.
  • Dylan Hogg
    Dylan Hogg over 7 years
    Note that the Spark 1.3 method is deprecated and will be removed in Spark 2.0
  • arun
    arun over 6 years
    This will write one Row per line in the output file. You may need to use map to convert Row objects into csv before saving as text files.