How to get maximum length of each column in the data frame using pandas python

43,008

Solution 1

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions.

You can use pd.DataFrame.select_dtypes to select object columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
                   'B': ['a', 'abcde', 'abc'],
                   'C': [1, 2.5, 1.5]})

measurer = np.vectorize(len)

Max length for all columns

res1 = measurer(df.values.astype(str)).max(axis=0)

array([4, 5, 3])

Max length for object columns

res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)

array([4, 5])

Or if you need output as a dictionary:

res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5, 'C': 3}

df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5}

Solution 2

You can use min max after using str and len method

df["A"].str.len().max()
df["A"].str.len().min()

df["Column Name"].str.len().max()
df["Column Name"].str.len().min()

Solution 3

Some great answers here and I would like to contribute mine

Solution:

dict([(v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max())for v in df.columns.values])

Explanation:

#convert tuple to dictionary
dict( 
    [
        #create a tuple such that (column name, max length of values in column)
        (v, df[v].apply(lambda r: len(str(r)) if r!=None else 0).max()) 
            for v in df.columns.values #iterates over all column values
    ])

Sample output

{'name': 4, 'DoB': 10, 'Address': 2, 'comment1': 21, 'comment2': 17}

Solution 4

Finding max number of chars for all columns (any type) in dataframe:

for column in df:
    print(column,"->", df[column].astype(str).str.len().max())

works fairly fast, I'm using it on 80 million rows df.

Solution 5

Select only object type columns

df2 = df1[[x for x in df1 if df1[x].dtype == 'object']]

Get the maximum length in each column

max_length_in_each_col = df2.applymap(lambda x: len(x)).max()
Share:
43,008
singularity2047
Author by

singularity2047

Updated on March 20, 2021

Comments

  • singularity2047
    singularity2047 about 3 years

    I have a data frame where most of the columns are varchar/object type. Length of the column varies a lot and could be anything within the range of 3 - 1000+ . Now, for each column, I want to measure maximum length.

    I know how to calculate maximum length for a col. If its varchar then:

    max(df.char_col.apply(len))
    

    and if its number (float8 or int64) then:

    max(df.num_col.map(str).apply(len))
    

    But my dataframe has hundreds of column and I want to calculate maximum length for all columns at the same time. The problem for that is, there are different data types, and I dont know how to do all at once.

    So Question 1: How to get maximum column length for each columns in the data frame

    Now I am trying to do that only for varchar/object type columns using following code:

    xx = df.select_dtypes(include = ['object'])
    for col in [xx.columns.values]:
       maxlength = [max(xx.col.apply(len))]
    

    I selected only object type columns and tried to write a for loop. But its not working. probably using apply() within for loop is not a good idea.

    Question 2: How to get maximum length of each column for only object type columns

    Sample data frame:

    d1 = {'name': ['john', 'tom', 'bob', 'rock', 'jimy'], 'DoB': ['01/02/2010', '01/02/2012', '11/22/2014', '11/22/2014', '09/25/2016'], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA'], 'comment1': ['Very good performance', 'N/A', 'Need to work hard', 'No Comment', 'Not satisfactory'], 'comment2': ['good', 'Meets Expectation', 'N', 'N/A', 'Incompetence']}
    df1 = pd.DataFrame(data = d1)
    df1['month'] = pd.DatetimeIndex(df1['DoB']).month
    df1['year'] = pd.DatetimeIndex(df1['DoB']).year
    
  • Steve Gon
    Steve Gon about 5 years
    For some reason the df2 statement is pulling in all column types, not just object.
  • Hrvoje
    Hrvoje over 3 years
    I've changed ´O´ to ´object´ - it should select object columns now.
  • Bouncner
    Bouncner over 3 years
    Does not work well for large data sets. Failed with a warning numpy.core._exceptions.MemoryError: Unable to allocate 4.18 TiB for an array with shape (4130207, 5) and data type <U55682
  • Dan Nissenbaum
    Dan Nissenbaum about 3 years
    Odd how the simplest, cleanest answer has only one upvote compared to 24. Anyways, this answer works perfectly for me.
  • MSallal
    MSallal about 3 years
    Thanks Dan, much appreciated :)
  • Corina Roca
    Corina Roca about 3 years
    @Harvey posted below a very valid answer: for column in df: print(column,"->", df[column].astype(str).str.len().max())