How to save data frame in ".txt" file using pyspark
Solution 1
If you want to write out a text file for a multi column dataframe, you will have to concatenate the columns yourself. In the example below I am separating the different column values with a space and replacing null values with a *
:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([("foo", "bar"), ("baz", None)],
('a', 'b'))
def myConcat(*cols):
concat_columns = []
for c in cols[:-1]:
concat_columns.append(F.coalesce(c, F.lit("*")))
concat_columns.append(F.lit(" "))
concat_columns.append(F.coalesce(cols[-1], F.lit("*")))
return F.concat(*concat_columns)
df_text = df.withColumn("combined", myConcat(*df.columns)).select("combined")
df_text.show()
df_text.coalesce(1).write.format("text").option("header", "false").mode("append").save("output.txt")
This gives as output:
+--------+
|combined|
+--------+
| foo bar|
| baz *|
+--------+
And your output file should look likes this
foo bar
baz *
Solution 2
You can concatenate the columns easily using the following line (assuming you want a positional file and not a delimited one, using this method for a delimited file would require that you had delimiter columns between each data column):
dataFrameWithOnlyOneColumn = dataFrame.select(concat(*dataFrame.columns).alias('data'))
After concatenating the columns, your previous line should work just fine:
dataFrameWithOnlyOneColumn.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")
Alice
Updated on July 05, 2022Comments
-
Alice over 1 year
I have a dataframe with 1000+ columns. I need to save this dataframe as .txt file(not as .csv) with no header,mode should be "append"
used below command which is not working
df.coalesce(1).write.format("text").option("header", "false").mode("append").save("<path>")
error i got
pyspark.sql.utils.AnalysisException: 'Text data source supports only a single column,
Note: Should not use RDD to save. Becouse i need to save files multiple times in the same path.