How to form tuple column from two columns in Pandas

python dataframe pandas tuples

159,087

Solution 1

Get comfortable with zip. It comes in handy when dealing with column data.

df['new_col'] = list(zip(df.lat, df.long))

It's less complicated and faster than using apply or map. Something like np.dstack is twice as fast as zip, but wouldn't give you tuples.

Solution 2

In [10]: df
Out[10]:
          A         B       lat      long
0  1.428987  0.614405  0.484370 -0.628298
1 -0.485747  0.275096  0.497116  1.047605
2  0.822527  0.340689  2.120676 -2.436831
3  0.384719 -0.042070  1.426703 -0.634355
4 -0.937442  2.520756 -1.662615 -1.377490
5 -0.154816  0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708  1.332853
7  0.637496 -0.643773 -0.492668 -0.777344
8  1.109497 -0.610165  0.260325  2.533383
9 -1.224584  0.117668  1.304369 -0.152561

In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)

In [12]: df
Out[12]:
          A         B       lat      long                             lat_long
0  1.428987  0.614405  0.484370 -0.628298      (0.484370195967, -0.6282975278)
1 -0.485747  0.275096  0.497116  1.047605      (0.497115615839, 1.04760475074)
2  0.822527  0.340689  2.120676 -2.436831      (2.12067574274, -2.43683074367)
3  0.384719 -0.042070  1.426703 -0.634355      (1.42670326172, -0.63435462504)
4 -0.937442  2.520756 -1.662615 -1.377490     (-1.66261469102, -1.37749004179)
5 -0.154816  0.617671 -0.090484 -0.191906  (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708  1.332853     (-0.629707821728, 1.33285348929)
7  0.637496 -0.643773 -0.492668 -0.777344   (-0.492667604075, -0.777344111021)
8  1.109497 -0.610165  0.260325  2.533383        (0.26032456699, 2.5333825651)
9 -1.224584  0.117668  1.304369 -0.152561     (1.30436900612, -0.152560909725)

Solution 3

Pandas has the itertuples method to do exactly this:

list(df[['lat', 'long']].itertuples(index=False, name=None))

Solution 4

You should try using pd.to_records(index=False):

import pandas as pd
df = pd.DataFrame({'language': ['en', 'ar', 'es'], 'greeting': ['Hi', 'اهلا', 'Hola']})
df

   language  greeting
0       en    Hi
1       ar    اهلا
2       es   Hola

df['list_of_tuples'] = list(df[['language', 'greeting']].to_records(index=False))
df['list_of_tuples']

0    [en, Hi]
1    [ar, اهلا]
2    [es, Hola]

enjoy!

Solution 5

I'd like to add df.values.tolist(). (as long as you don't mind to get a column of lists rather than tuples)

import pandas as pd
import numpy as np

size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)}) 

%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

View more solutions

159,087

elksie5000

Macmillan Cancer Support

Updated on August 07, 2022

Comments

elksie5000 over 1 year

I've got a Pandas DataFrame and I want to combine the 'lat' and 'long' columns to form a tuple.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 205482 entries, 0 to 209018
Data columns:
Month           205482  non-null values
Reported by     205482  non-null values
Falls within    205482  non-null values
Easting         205482  non-null values
Northing        205482  non-null values
Location        205482  non-null values
Crime type      205482  non-null values
long            205482  non-null values
lat             205482  non-null values
dtypes: float64(4), object(5)

The code I tried to use was:

def merge_two_cols(series): 
    return (series['lat'], series['long'])

sample['lat_long'] = sample.apply(merge_two_cols, axis=1)

However, this returned the following error:

---------------------------------------------------------------------------
 AssertionError                            Traceback (most recent call last)
<ipython-input-261-e752e52a96e6> in <module>()
      2     return (series['lat'], series['long'])
      3 
----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
      5

...

AssertionError: Block shape incompatible with manager

How can I solve this problem?

elksie5000 about 11 years

That's brilliant. Thank you. Clearly need to get my head around lambda functions.
Wouter Overmeire about 11 years

Did this work on your data? If so, can you share your pandas version and the data? I wonder why your code did not work, it should.
elksie5000 about 11 years

The version is 0.10.1_20130131. Excuse my ignorance, but what's the best way of uploading a section of the data for you? (Still a relative newbie).
Wouter Overmeire about 11 years

I failed to reproduce on 0.10.1. Best way of uploading? You can either create code that generates a frame holding random data, that has the same issue and share that code or pickle the frame above (sample) and transfer it via a free big file transfer service. How to pickle (in two lines, without ","): import pickle, with open('sample.pickle', 'w') as file: pickle.dump(sample, file)
Balzer82 over 9 years

I need exactly this, but in the opposite direction. I have a column with the lat_long tuple and need two columns with lat and long. How to unpack the tuple?
Wouter Overmeire over 9 years

There are several ways, see e.g stackoverflow.com/questions/22799300/…
imrek about 8 years

I have df[["year", "month", "day"]].apply(tuple, axis=1) where "year", "month", "day" are just integers, and this fails to do anything. EDIT: This works for floats only, what a ***y language.
paulwasit over 7 years

in python3, you have to use list. This should work: df['new_col'] = list(zip(df.lat, df.long))
Dale over 7 years

@paulwasit ah yes, my love hate relationship with python 3's lazy behavior. thanks.
Pengju Zhao almost 7 years

This method list(zip(df.lat, df.long)) in 124ms is much more efficient than df[['lat', 'long']].apply(tuple, axis=1) in 14.2 s for 900k rows. The ratio is more than 100.
seeiespi about 6 years

I am trying to use this with a longer list of columns df['new_col'] = list(zip(df[cols_to_keep])) but keep getting an error: Length of values does not match length of index any advice?
rishi jain over 4 years

I have upvoted this as I need to zip 10 columns and don't want to give dataframe name 10 times. Just want to give Column names.
Peter Hansen over 4 years

zip(df[cols_to_keep]) will iterate over the DataFrame, creating a list of columns instead of a list of Series. you need zip( [df[c] for c in cols_to_keep])
ChaimG over 4 years

When you have more than just these two columns: %timeit df[['a', 'b']].values.tolist(). It's still much faster.
jedge about 4 years

@PeterHansen's answer helped me but think it may have been missing an * to unpack the list first - i.e. df['new_col'] = list(zip(*[df[c] for c in cols_to_keep])
Zizzipupp almost 4 years

This one fails for me with: TypeError: only integer scalar arrays can be converted to a scalar index.
ThatNewGuy about 3 years

It's faster to create it, but any operations on that column will be faster in tuple form. For example, try calling .value_counts() on a column of lists vs a column of tuples.