Pandas expand rows from list data available in column
Solution 1
DataFrame.explode
Since pandas >= 0.25.0
we have the explode
method for this, which expands a list to a row for each element and repeats the rest of the columns:
df.explode('column1').reset_index(drop=True)
Output
column1 column2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
6 g 3
7 h 3
8 i 3
Since pandas >= 1.1.0
we have the ignore_index
argument, so we don't have to chain with reset_index
:
df.explode('column1', ignore_index=True)
Output
column1 column2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
6 g 3
7 h 3
8 i 3
Solution 2
You can create DataFrame
by its constructor and stack
:
df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
.stack()
.reset_index(level=1, drop=True)
.reset_index(name='column1')[['column1','column2']]
print (df2)
column1 column2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
6 g 3
7 h 3
8 i 3
If need change ordering by subset [['column1','column2']]
, you can also omit first reset_index
:
df2 = pd.DataFrame(df.column1.tolist(), index=df.column2)
.stack()
.reset_index(name='column1')[['column1','column2']]
print (df2)
column1 column2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
6 g 3
7 h 3
8 i 3
Another solution DataFrame.from_records
for creating DataFrame
from first column, then create Series
by stack
and join
to original DataFrame
:
df = pd.DataFrame({'column1': [['a','b','c'],['d','e','f'],['g','h','i']],
'column2':[1,2,3]})
a = pd.DataFrame.from_records(df.column1.tolist())
.stack()
.reset_index(level=1, drop=True)
.rename('column1')
print (a)
0 a
0 b
0 c
1 d
1 e
1 f
2 g
2 h
2 i
Name: column1, dtype: object
print (df.drop('column1', axis=1)
.join(a)
.reset_index(drop=True)[['column1','column2']])
column1 column2
0 a 1
1 b 1
2 c 1
3 d 2
4 e 2
5 f 2
6 g 3
7 h 3
8 i 3
Solution 3
Another solution is to use the result_type='expand'
argument of the pandas.apply
function available since pandas 0.23. Answering @splinter's question this method can be generalized -- see below:
import pandas as pd
from numpy import arange
df = pd.DataFrame(
{'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
'column2': [1,2,3]}
)
pd.melt(
df.join(
df.apply(lambda row: row['column1'], axis=1, result_type='expand')
),
value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2')[['column1','column2']]
# can be generalized
df = pd.DataFrame(
{'column1' : [['a','b','c'],['d','e','f'],['g','h','i']],
'column2': [1,2,3],
'column3': [[1,2],[2,3],[3,4]],
'column4': [42,23,321],
'column5': ['a','b','c']}
)
(pd.melt(
df.join(
df.apply(lambda row: row['column1'], axis=1, result_type='expand')
),
value_vars=arange(df['column1'].shape[0]), value_name='column1', id_vars=df.columns[1:])
.drop(columns=['variable'])[list(df.columns[:1]) + list(df.columns[1:])]
.sort_values(by=['column1']))
UPDATE (for Jwely's comment): if you have lists with varying length, you can do:
df = pd.DataFrame(
{'column1' : [['a','b','c'],['d','f'],['g','h','i']],
'column2': [1,2,3]}
)
longest = max(df['column1'].apply(lambda x: len(x)))
pd.melt(
df.join(
df.apply(lambda row: row['column1'] if len(row['column1']) >= longest else row['column1'] + [None] * (longest - len(row['column1'])), axis=1, result_type='expand')
),
value_vars=arange(df['column1'].shape[0]), value_name='column1', var_name='column2').query("column1 == column1")[['column1','column2']]
Sanjay Yadav
Currently working as "Senior Software Engineer" at EPAM Systems Hyderabad. Working as a Python developer. I am good at Technical side.
Updated on July 09, 2022Comments
-
Sanjay Yadav almost 2 years
I have a data frame like this in pandas:
column1 column2 [a,b,c] 1 [d,e,f] 2 [g,h,i] 3
Expected output:
column1 column2 a 1 b 1 c 1 d 2 e 2 f 2 g 3 h 3 i 3
How to process this data ?
-
Kris over 7 yearsI like your second solution
-
jezrael over 7 years@Kris - thank you, it is better, so now it is first.
-
splinter almost 6 yearsWould the same approach work if we had more than one
column2
? That is, if we have many columns while keepingcolumn1
is as in the question? -
bencekd over 5 years@splinter: see my answer about a method with pandas 0.23, I've included generalization stackoverflow.com/a/53570916/5356704
-
Jwely almost 5 yearsI believe this solution requires every list in "column1" to be of the same length, 3 in this case.
-
bencekd almost 5 yearsI think the question was about lists with same length in the first column, but with slight modifications you can do different list lengths -- see my edit
-
Henry Henrinson almost 5 yearsIn typical pandas fashion, this fails if the column consists of empty lists. Perfect.
-
Shiva Rama Krishna over 4 yearsif you are using pandas < 0.25.0 i made a patch to make it running below gist.github.com/BurakaKrishna/538cdad998247b95f9b2898015360a8e
-
Erfan over 4 yearsI see yours is using lot of
for loops
, I would not advice people to use that approach, here are better vectorized alternatives forpandas < 0.25.0
@ShivaRamaKrishna -
topher217 almost 3 yearsIs there a good way to do this without lists as your index? For example, say I have to dateframes one with Timestamps with second accuracy, and another with only minute accuracy. I want to expand the one with minute accuracy by duplicating all values 60 times so I can merge them. I guess I can create a new index with a list of length 60 in each and do this explode method, but wondered if there is a more pandas way of doing this.
-
Erfan almost 3 yearsThat looks like a resample from minute to second problem, not explode perse @topher217.
-
topher217 almost 3 years@Erfan perfect! Yes, I knew there had to be something.
resample
withpad
orbfill
looks like a great way to get this done. Thanks!