Pythonic type hints with pandas?

61,964

Solution 1

Why not just use pd.DataFrame?

import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Result is the same:

> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame

Solution 2

I'm currently doing the following:

from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')

Which gives:

> help(csv_to_df)
Help on function csv_to_df in module __main__:

csv_to_df(path:str) -> ~pandas.core.frame.DataFrame

Don't know how pythonic that is, but it's understandable enough as a type hint, I find.

Solution 3

Now there is a pip package that can help with this. https://github.com/CedricFR/dataenforce

You can install it with pip install dataenforce and use very pythonic type hints like:

def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
    pass

Solution 4

Check out the answer given here which explains the usage of the package data-science-types.

pip install data-science-types

Demo

# program.py

import pandas as pd

df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment

Run using mypy the same way:

$ mypy program.py

Solution 5

This is straying from the original question but building off of @dangom's answer using TypeVar and @Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:

from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
    return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Share:
61,964

Related videos on Youtube

dangom
Author by

dangom

Updated on July 08, 2022

Comments

  • dangom
    dangom almost 2 years

    Let's take a simple function that takes a str and returns a dataframe:

    import pandas as pd
    def csv_to_df(path):
        return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
    

    What is the recommended pythonic way of adding type hints to this function?

    If I ask python for the type of a DataFrame it returns pandas.core.frame.DataFrame. The following won't work though, as it'll tell me that pandas is not defined.

     def csv_to_df(path: str) -> pandas.core.frame.DataFrame:
         return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
    
    • Moses Koledoye
      Moses Koledoye about 7 years
      But you're using the pd alias, and you can probably define custom types.
    • dangom
      dangom about 7 years
      @MosesKoledoye if I try pd.core.frame.DataFrame I'll get an AttributeError instead of a NameError.
    • Chris
      Chris about 7 years
      I am not an authority on "pythonicity" but I would recommend doc-strings (using ''' this function takes a inputType and returns an outputType ''') this is also what will be shown if someone calls help(yourFunction) function on your function.
    • 00schneider
      00schneider about 4 years
      the library dataenforce allows to check for data types inside the data frame github.com/CedricFR/dataenforce
  • Tom Roth
    Tom Roth about 6 years
    @Azat Ibrakov would you mind elaborating on your comment? Sometimes I'm not sure what is and isn't 'pythonic'.
  • burkesquires
    burkesquires over 5 years
    Note: This assumes you import pandas as pd at the top of your script. Just importing in main is not enough as pd will not resolve.
  • Philipp_Kats
    Philipp_Kats almost 5 years
    it also won't allow specifying dtypes for specific columns, which could be extremely useful
  • Georgy
    Georgy almost 5 years
    @Philipp_Kats Currently there is no way to specify dtypes for DataFrame columns in type hints, and I haven't seen any work done in this direction (correct me if I'm wrong). Linking a related question on type hints with NumPy and dtypes: Type hint for NumPy ndarray dtype?. You will see that it's also not implemented there yet.
  • user2304916
    user2304916 over 4 years
    This gives an error in mypy error: No library stub file for module 'pandas'
  • Georgy
    Georgy over 4 years
  • dangom
    dangom over 4 years
    I see people downvoting this answer. For context, this was the solution I found for my own question, and for all intents and purposes it works just fine. The more pythonic solution above, which I accepted as correct answer (but does have its own perks, see comments), was only provided 8 months afterwards.
  • Alex
    Alex about 4 years
    It's not pythonic since it is less clear and harder to maintain than the accepted answer for this question. Since the type path here is not verified by the compiler it won't raise errors if it's wrong. This could happen from a typo in your TypeVar arg or change to the module itself.
  • Victor M Herasme Perez
    Victor M Herasme Perez over 3 years
    I receive a warning when I use this: The argument to 'TypeVar()' must be a string equal to the variable name to which it is assigned
  • uetoyo
    uetoyo almost 3 years
    @Azat Ibrakov These "pythonic" and "not pythonic" arguments are like a mantra for many "Pythonists". I think we should stop arguments in this style. A had never heard this type of argumentation from e.g. Java developer. In my opinion, there is nothing wrong with this solution.
  • user3897315
    user3897315 over 2 years
    Unfortunately, this is buried at bottom. In 2021 this is the best answer. Note too the comment by Daniel Malachov following the linked answer (stackoverflow.com/a/63446142/8419574).
  • blthayer
    blthayer over 2 years
    @user3897315 - I disagree that this is the best answer in 2021. If you visit data-science-types on GitHub you'll find the repository has been archived, and the README updated (on Feb 16 2021) with the following note: "⚠️ this project has mostly stopped development ⚠️ The pandas team and the numpy team are both in the process of integrating type stubs into their codebases, and we don't see the point of competing with them."
  • kevin_theinfinityfund
    kevin_theinfinityfund over 2 years
    I agree, but following that I don't see a timeline when pandas or numpy will have these pushed or ETA in their roadmap.
  • decorator-factory
    decorator-factory about 2 years
    This is not the correct use of a type variable. A TypeVar exists to link two types together (mypy docs). You probably meant a type alias: PandasDataFrame = pandas.core.frame.DataFrame