Split Pandas Series into DataFrame by delimiter

18,567

Solution 1

You can use str.split:

df = SR_test.str.split('; ', expand=True)
print df

    0   1   2   3   4
0   a   b   c   d   e
1  aa  bb  cc  dd  ee
2  a1  b2  c3  d4  e5

Another faster solution, if Series have no NaN values:

print pd.DataFrame([ x.split('; ') for x in SR_test.tolist() ])
    0   1   2   3   4
0   a   b   c   d   e
1  aa  bb  cc  dd  ee
2  a1  b2  c3  d4  e5

Timings:

SR_test = pd.concat([SR_test]*1000).reset_index(drop=True)

In [21]: %timeit SR_test.str.split('; ', expand=True)
10 loops, best of 3: 34.5 ms per loop

In [22]: %timeit pd.DataFrame([ x.split('; ') for x in SR_test.tolist() ])
100 loops, best of 3: 9.59 ms per loop

Solution 2

Use the vectorised str.split with param expand=True and pass as the data arg to the DataFrame ctor:

In [4]:
df = pd.DataFrame(SR_test.str.split(';',expand=True))
df

Out[4]:
    0    1    2    3    4
0   a    b    c    d    e
1  aa   bb   cc   dd   ee
2  a1   b2   c3   d4   e5
Share:
18,567
O.rka
Author by

O.rka

I am an academic researcher studying machine-learning and microorganisms

Updated on July 31, 2022

Comments

  • O.rka
    O.rka almost 2 years

    I'm trying to split a pandas series object by a particular delimiter "; " in this case. I want to turn it into a dataframe there will always be the same amount of "columns" or to be more exact, same amount of "; " that will indicate columns. I thought this would do the trick but it didnt python, how to convert a pandas series into a pandas DataFrame? I dont want to iterate through, I'm sure pandas has made a shortcut that's more effective.

    Does anyone know of the most efficient way to split this series into a dataframe by "; " ?

    #Example Data
    SR_test = pd.Series(["a; b; c; d; e","aa; bb; cc; dd; ee","a1; b2; c3; d4; e5"])
    # print(SR_test)
    # 0         a; b; c; d; e
    # 1    aa; bb; cc; dd; ee
    # 2    a1; b2; c3; d4; e5
    
    #Convert each row one at a time (not efficient)
    tmp = []
    for element in SR_test:
        tmp.append([e.strip() for e in element.split("; ")])
    DF_split = pd.DataFrame(tmp)
    # print(DF_split)
    #     0   1   2   3   4
    # 0   a   b   c   d   e
    # 1  aa  bb  cc  dd  ee
    # 2  a1  b2  c3  d4  e5
    
  • O.rka
    O.rka almost 8 years
    Iterating through is quicker?
  • jezrael
    jezrael almost 8 years
    Yes, if If use this way. str.split is a bit slower, because it works with NaN values very nice too.