Pythonic type hints with pandas?
Solution 1
Why not just use pd.DataFrame
?
import pandas as pd
def csv_to_df(path: str) -> pd.DataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Result is the same:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> pandas.core.frame.DataFrame
Solution 2
I'm currently doing the following:
from typing import TypeVar
PandasDataFrame = TypeVar('pandas.core.frame.DataFrame')
def csv_to_df(path: str) -> PandasDataFrame:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Which gives:
> help(csv_to_df)
Help on function csv_to_df in module __main__:
csv_to_df(path:str) -> ~pandas.core.frame.DataFrame
Don't know how pythonic that is, but it's understandable enough as a type hint, I find.
Solution 3
Now there is a pip package that can help with this. https://github.com/CedricFR/dataenforce
You can install it with pip install dataenforce
and use very pythonic type hints like:
def preprocess(dataset: Dataset["id", "name", "location"]) -> Dataset["location", "count"]:
pass
Solution 4
Check out the answer given here which explains the usage of the package data-science-types
.
pip install data-science-types
Demo
# program.py
import pandas as pd
df: pd.DataFrame = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]}) # OK
df1: pd.DataFrame = pd.Series([1,2,3]) # error: Incompatible types in assignment
Run using mypy the same way:
$ mypy program.py
Solution 5
This is straying from the original question but building off of @dangom's answer using TypeVar
and @Georgy's comment that there is no way to specify datatypes for DataFrame columns in type hints, you could use a simple work-around like this to specify datatypes in a DataFrame:
from typing import TypeVar
DataFrameStr = TypeVar("pandas.core.frame.DataFrame(str)")
def csv_to_df(path: str) -> DataFrameStr:
return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
Related videos on Youtube
![dangom](https://i.stack.imgur.com/mlJHI.png?s=256&g=1)
dangom
Updated on July 08, 2022Comments
-
dangom almost 2 years
Let's take a simple function that takes a str and returns a dataframe:
import pandas as pd def csv_to_df(path): return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
What is the recommended pythonic way of adding type hints to this function?
If I ask python for the type of a DataFrame it returns
pandas.core.frame.DataFrame
. The following won't work though, as it'll tell me that pandas is not defined.def csv_to_df(path: str) -> pandas.core.frame.DataFrame: return pd.read_csv(path, skiprows=1, sep='\t', comment='#')
-
Moses Koledoye about 7 yearsBut you're using the
pd
alias, and you can probably define custom types. -
dangom about 7 years@MosesKoledoye if I try pd.core.frame.DataFrame I'll get an
AttributeError
instead of aNameError
. -
Chris about 7 yearsI am not an authority on "pythonicity" but I would recommend doc-strings (using
''' this function takes a inputType and returns an outputType '''
) this is also what will be shown if someone callshelp(yourFunction)
function on your function. -
00schneider about 4 yearsthe library
dataenforce
allows to check for data types inside the data frame github.com/CedricFR/dataenforce
-
-
Tom Roth about 6 years@Azat Ibrakov would you mind elaborating on your comment? Sometimes I'm not sure what is and isn't 'pythonic'.
-
burkesquires over 5 yearsNote: This assumes you
import pandas as pd
at the top of your script. Just importing inmain
is not enough as pd will not resolve. -
Philipp_Kats almost 5 yearsit also won't allow specifying dtypes for specific columns, which could be extremely useful
-
Georgy almost 5 years@Philipp_Kats Currently there is no way to specify dtypes for DataFrame columns in type hints, and I haven't seen any work done in this direction (correct me if I'm wrong). Linking a related question on type hints with NumPy and dtypes: Type hint for NumPy ndarray dtype?. You will see that it's also not implemented there yet.
-
user2304916 over 4 yearsThis gives an error in mypy
error: No library stub file for module 'pandas'
-
Georgy over 4 years@user2304916 See Unable to suppress
No library stub file for module...
error. -
dangom over 4 yearsI see people downvoting this answer. For context, this was the solution I found for my own question, and for all intents and purposes it works just fine. The more pythonic solution above, which I accepted as correct answer (but does have its own perks, see comments), was only provided 8 months afterwards.
-
Alex about 4 yearsIt's not pythonic since it is less clear and harder to maintain than the accepted answer for this question. Since the type path here is not verified by the compiler it won't raise errors if it's wrong. This could happen from a typo in your
TypeVar
arg or change to the module itself. -
Victor M Herasme Perez over 3 yearsI receive a warning when I use this:
The argument to 'TypeVar()' must be a string equal to the variable name to which it is assigned
-
uetoyo almost 3 years@Azat Ibrakov These "pythonic" and "not pythonic" arguments are like a mantra for many "Pythonists". I think we should stop arguments in this style. A had never heard this type of argumentation from e.g. Java developer. In my opinion, there is nothing wrong with this solution.
-
user3897315 over 2 yearsUnfortunately, this is buried at bottom. In 2021 this is the best answer. Note too the comment by Daniel Malachov following the linked answer (stackoverflow.com/a/63446142/8419574).
-
blthayer over 2 years@user3897315 - I disagree that this is the best answer in 2021. If you visit data-science-types on GitHub you'll find the repository has been archived, and the README updated (on Feb 16 2021) with the following note: "⚠️ this project has mostly stopped development ⚠️ The pandas team and the numpy team are both in the process of integrating type stubs into their codebases, and we don't see the point of competing with them."
-
kevin_theinfinityfund over 2 yearsI agree, but following that I don't see a timeline when pandas or numpy will have these pushed or ETA in their roadmap.
-
decorator-factory about 2 yearsThis is not the correct use of a type variable. A
TypeVar
exists to link two types together (mypy docs). You probably meant a type alias:PandasDataFrame = pandas.core.frame.DataFrame