Splitting a dataframe into separate CSV files
Solution 1
Why not just groupby Column1
and save each group?
df = df.sort_values('Column1').assign(NewColumn='NewColumnValue')
print(df)
Column1 Column2 NewColumn
0 1 93644 NewColumnValue
5 1 19593 NewColumnValue
6 1 12707 NewColumnValue
1 2 63246 NewColumnValue
7 2 53480 NewColumnValue
2 3 47790 NewColumnValue
3 3 39644 NewColumnValue
4 3 32585 NewColumnValue
for i, g in df.groupby('Column1'):
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
Thanks to Unatiel for the improvement. header=False
will not write headers and index_label=False
will not write an index column.
This creates 3 files:
1.csv
2.csv
3.csv
Each having data corresponding to each Column1
group.
Solution 2
pandas.DataFrame
supports a method to write it's data as a csv to_csv()
. You have no need for csv
module in this case.
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1').set_index('Column1')
df['NewColumn'] = 'NewColumnValue'
for key in df.index.unique():
df.loc[key].to_csv('%d.csv' % int(key), header=False)
for key df.index.unique():
will loop over every unique value in the index. In your example, it will loop over (1, 2 , 3)
. header=False
willmake sure the header isn't written to the output file.
And to explain why you get the wrong output in your example, try print(list(df))
. This should output all the columns in df. This is why for key, rows in csv.reader((df)):
iterates over the columns in df.
Actually, you should get 1 csv for every column in your dataframe, and their contents are likely something like ,[NAME_OF_COLUMN]
or maybe ,<itertools.... object at 0x.....>
.
Solution 3
You don't need to switch to itertools
for the filtering, pandas
has all of the necessary functionality built-in.
# -*- coding: utf-8 -*-
import pandas as pd
df = pd.read_csv('source.csv')
df = df.sort_values('Column1') # Sorting isn't needed
df['NewColumn'] = 'NewColumnValue'
for key in df['Column1'].unique(): # For each value in Column1
# These two steps can be combined into a single call
# I'll separate for clarity:
# 1) filter the dataframe on the unique value
dw = df[df['Column1']==key]
# 2) write the resulting dataframe without headers
dw.to_csv("%s.csv" % key, header=False)
Steve Dallas
Updated on June 15, 2022Comments
-
Steve Dallas almost 2 years
I have a fairly large csv, looking like this:
+---------+---------+ | Column1 | Column2 | +---------+---------+ | 1 | 93644 | | 2 | 63246 | | 3 | 47790 | | 3 | 39644 | | 3 | 32585 | | 1 | 19593 | | 1 | 12707 | | 2 | 53480 | +---------+---------+
My intent is to
- Add a new column
- Insert a specific value into that column, 'NewColumnValue', on each row of the csv
- Sort the file based on the value in Column1
- Split the original CSV into new files based on the contents of 'Column1', removing the header
For example, I want to end up with multiple files that look like:
+---+-------+----------------+ | 1 | 19593 | NewColumnValue | | 1 | 93644 | NewColumnValue | | 1 | 12707 | NewColumnValue | +---+-------+----------------+ +---+-------+-----------------+ | 2 | 63246 | NewColumnValue | | 2 | 53480 | NewColumnValue | +---+-------+-----------------+ +---+-------+-----------------+ | 3 | 47790 | NewColumnValue | | 3 | 39644 | NewColumnValue | | 3 | 32585 | NewColumnValue | +---+-------+-----------------+
I have managed to do this using separate .py files:
Step1
# -*- coding: utf-8 -*- import pandas as pd df = pd.read_csv('source.csv') df = df.sort_values('Column1') df['NewColumn'] = 'NewColumnValue' df.to_csv('ready.csv', index=False, header=False)
Step2
import csv from itertools import groupby for key, rows in groupby(csv.reader(open("ready.csv")), lambda row: row[0]): with open("%s.csv" % key, "w") as output: for row in rows: output.write(",".join(row) + "\n")
But I'd really like to learn how to accomplish everything in a single .py file. I tried this:
# -*- coding: utf-8 -*- #This processes a large CSV file. #It will dd a new column, populate the new column with a uniform piece of data for each row, sort the CSV, and remove headers #Then it will split the single large CSV into multiple CSVs based on the value in column 0 import pandas as pd import csv from itertools import groupby df = pd.read_csv('source.csv') df = df.sort_values('Column1') df['NewColumn'] = 'NewColumnValue' for key, rows in groupby(csv.reader((df)), lambda row: row[0]): with open("%s.csv" % key, "w") as output: for row in rows: output.write(",".join(row) + "\n")
but instead of working as intended, it's giving me multiple CSVs named after each column header.
Is that happening because I removed the header row when I used separate .py files and I'm not doing it here? I'm not really certain what operation I need to do when splitting the files to remove the header.
-
Steve Dallas over 6 yearsThanks for answering. However, I'm getting a syntax error on: df = df.sort_values('Column1').set_index('Column1')
-
Steve Dallas over 6 yearsThanks for answering. It gave me an error on line 12, but after I changed it to "dw.to_csv("%s.csv" % key, header=False) " it worked great.
-
Steve Dallas over 6 yearsThanks. And to answer your question, it was because when I spent today searching google for a solution to the problem pandas kept cropping up. I didn't realize there was another way :)
-
Unatiel over 6 yearsI don't get any syntax error on this, with python3. That being said, @cᴏʟᴅsᴘᴇᴇᴅ's answer is better, you should use it instead.
-
Steve Dallas over 6 yearsI tried this method and it's producing some odd output files. Taking my file 13.csv as an example, it produces a file that has headers and additionally it seems to be inserting a headerless column with data I don't recognize. e.g. +------+---------+---------+----------------+ | | Column1 | Column2 | NewColumn | | 6446 | 13 | 36457 | NewColumnValue | +------+---------+---------+----------------+
-
Unatiel over 6 yearsYou should try
g.to_csv('{}.csv'.format(i), header=False, index_label=False)
.header=False
won't write the headers andindex_label=False
won't write that headerless column (this is the index of your dataframe). -
Unatiel over 6 years@SteveDallas Now that I think about it,
df =
will indeed output a syntax error. I think you've added a new line by mistake afterdf =
. I don't really see where the syntax error would be in my code. I'm a bit tired right now though. -
Steve Dallas over 6 yearsExcellent! I added a 'index=False' and it worked like a charm. Thanks cᴏʟᴅsᴘᴇᴇᴅ and @Unatiel