How to form tuple column from two columns in Pandas
159,087
Solution 1
Get comfortable with zip
. It comes in handy when dealing with column data.
df['new_col'] = list(zip(df.lat, df.long))
It's less complicated and faster than using apply
or map
. Something like np.dstack
is twice as fast as zip
, but wouldn't give you tuples.
Solution 2
In [10]: df
Out[10]:
A B lat long
0 1.428987 0.614405 0.484370 -0.628298
1 -0.485747 0.275096 0.497116 1.047605
2 0.822527 0.340689 2.120676 -2.436831
3 0.384719 -0.042070 1.426703 -0.634355
4 -0.937442 2.520756 -1.662615 -1.377490
5 -0.154816 0.617671 -0.090484 -0.191906
6 -0.705177 -1.086138 -0.629708 1.332853
7 0.637496 -0.643773 -0.492668 -0.777344
8 1.109497 -0.610165 0.260325 2.533383
9 -1.224584 0.117668 1.304369 -0.152561
In [11]: df['lat_long'] = df[['lat', 'long']].apply(tuple, axis=1)
In [12]: df
Out[12]:
A B lat long lat_long
0 1.428987 0.614405 0.484370 -0.628298 (0.484370195967, -0.6282975278)
1 -0.485747 0.275096 0.497116 1.047605 (0.497115615839, 1.04760475074)
2 0.822527 0.340689 2.120676 -2.436831 (2.12067574274, -2.43683074367)
3 0.384719 -0.042070 1.426703 -0.634355 (1.42670326172, -0.63435462504)
4 -0.937442 2.520756 -1.662615 -1.377490 (-1.66261469102, -1.37749004179)
5 -0.154816 0.617671 -0.090484 -0.191906 (-0.0904840623396, -0.191905582481)
6 -0.705177 -1.086138 -0.629708 1.332853 (-0.629707821728, 1.33285348929)
7 0.637496 -0.643773 -0.492668 -0.777344 (-0.492667604075, -0.777344111021)
8 1.109497 -0.610165 0.260325 2.533383 (0.26032456699, 2.5333825651)
9 -1.224584 0.117668 1.304369 -0.152561 (1.30436900612, -0.152560909725)
Solution 3
Pandas has the itertuples
method to do exactly this:
list(df[['lat', 'long']].itertuples(index=False, name=None))
Solution 4
You should try using pd.to_records(index=False)
:
import pandas as pd
df = pd.DataFrame({'language': ['en', 'ar', 'es'], 'greeting': ['Hi', 'اهلا', 'Hola']})
df
language greeting
0 en Hi
1 ar اهلا
2 es Hola
df['list_of_tuples'] = list(df[['language', 'greeting']].to_records(index=False))
df['list_of_tuples']
0 [en, Hi]
1 [ar, اهلا]
2 [es, Hola]
enjoy!
Solution 5
I'd like to add df.values.tolist()
. (as long as you don't mind to get a column of lists rather than tuples)
import pandas as pd
import numpy as np
size = int(1e+07)
df = pd.DataFrame({'a': np.random.rand(size), 'b': np.random.rand(size)})
%timeit df.values.tolist()
1.47 s ± 38.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit list(zip(df.a,df.b))
1.92 s ± 131 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related videos on Youtube
Comments
-
elksie5000 over 1 year
I've got a Pandas DataFrame and I want to combine the 'lat' and 'long' columns to form a tuple.
<class 'pandas.core.frame.DataFrame'> Int64Index: 205482 entries, 0 to 209018 Data columns: Month 205482 non-null values Reported by 205482 non-null values Falls within 205482 non-null values Easting 205482 non-null values Northing 205482 non-null values Location 205482 non-null values Crime type 205482 non-null values long 205482 non-null values lat 205482 non-null values dtypes: float64(4), object(5)
The code I tried to use was:
def merge_two_cols(series): return (series['lat'], series['long']) sample['lat_long'] = sample.apply(merge_two_cols, axis=1)
However, this returned the following error:
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) <ipython-input-261-e752e52a96e6> in <module>() 2 return (series['lat'], series['long']) 3 ----> 4 sample['lat_long'] = sample.apply(merge_two_cols, axis=1) 5
...
AssertionError: Block shape incompatible with manager
How can I solve this problem?
-
elksie5000 about 11 yearsThat's brilliant. Thank you. Clearly need to get my head around lambda functions.
-
Wouter Overmeire about 11 yearsDid this work on your data? If so, can you share your pandas version and the data? I wonder why your code did not work, it should.
-
elksie5000 about 11 yearsThe version is 0.10.1_20130131. Excuse my ignorance, but what's the best way of uploading a section of the data for you? (Still a relative newbie).
-
Wouter Overmeire about 11 yearsI failed to reproduce on 0.10.1. Best way of uploading? You can either create code that generates a frame holding random data, that has the same issue and share that code or pickle the frame above (sample) and transfer it via a free big file transfer service. How to pickle (in two lines, without ","): import pickle, with open('sample.pickle', 'w') as file: pickle.dump(sample, file)
-
Balzer82 over 9 yearsI need exactly this, but in the opposite direction. I have a column with the lat_long tuple and need two columns with lat and long. How to unpack the tuple?
-
Wouter Overmeire over 9 yearsThere are several ways, see e.g stackoverflow.com/questions/22799300/…
-
imrek about 8 yearsI have
df[["year", "month", "day"]].apply(tuple, axis=1)
where "year", "month", "day" are just integers, and this fails to do anything. EDIT: This works for floats only, what a ***y language. -
paulwasit over 7 yearsin python3, you have to use
list
. This should work:df['new_col'] = list(zip(df.lat, df.long))
-
Dale over 7 years@paulwasit ah yes, my love hate relationship with python 3's lazy behavior. thanks.
-
Pengju Zhao almost 7 yearsThis method
list(zip(df.lat, df.long))
in 124ms is much more efficient thandf[['lat', 'long']].apply(tuple, axis=1)
in 14.2 s for 900k rows. The ratio is more than 100. -
seeiespi about 6 yearsI am trying to use this with a longer list of columns
df['new_col'] = list(zip(df[cols_to_keep]))
but keep getting an error:Length of values does not match length of index
any advice? -
rishi jain over 4 yearsI have upvoted this as I need to zip 10 columns and don't want to give dataframe name 10 times. Just want to give Column names.
-
Peter Hansen over 4 years
zip(df[cols_to_keep])
will iterate over the DataFrame, creating a list of columns instead of a list of Series. you needzip( [df[c] for c in cols_to_keep])
-
ChaimG over 4 yearsWhen you have more than just these two columns:
%timeit df[['a', 'b']].values.tolist()
. It's still much faster. -
jedge about 4 years@PeterHansen's answer helped me but think it may have been missing an * to unpack the list first - i.e.
df['new_col'] = list(zip(*[df[c] for c in cols_to_keep])
-
Zizzipupp almost 4 yearsThis one fails for me with:
TypeError: only integer scalar arrays can be converted to a scalar index
. -
ThatNewGuy about 3 yearsIt's faster to create it, but any operations on that column will be faster in tuple form. For example, try calling
.value_counts()
on a column of lists vs a column of tuples.