What is the difference between NaN and None?

114,673

Solution 1

NaN is used as a placeholder for missing data consistently in pandas, consistency is good. I usually read/translate NaN as "missing". Also see the 'working with missing data' section in the docs.

Wes writes in the docs 'choice of NA-representation':

After years of production use [NaN] has proven, at least in my opinion, to be the best decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used everywhere as the NA value, and there are API functions isnull and notnull which can be used across the dtypes to detect NA values.
...
Thus, I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when NAs must be introduced.

Note: the "gotcha" that integer Series containing missing data are upcast to floats.

In my opinion the main reason to use NaN (over None) is that it can be stored with numpy's float64 dtype, rather than the less efficient object dtype, see NA type promotions.

#  without forcing dtype it changes None to NaN!
s_bad = pd.Series([1, None], dtype=object)
s_good = pd.Series([1, np.nan])

In [13]: s_bad.dtype
Out[13]: dtype('O')

In [14]: s_good.dtype
Out[14]: dtype('float64')

Jeff comments (below) on this:

np.nan allows for vectorized operations; its a float value, while None, by definition, forces object type, which basically disables all efficiency in numpy.

So repeat 3 times fast: object==bad, float==good

Saying that, many operations may still work just as well with None vs NaN (but perhaps are not supported i.e. they may sometimes give surprising results):

In [15]: s_bad.sum()
Out[15]: 1

In [16]: s_good.sum()
Out[16]: 1.0

To answer the second question:
You should be using pd.isnull and pd.notnull to test for missing data (NaN).

Solution 2

NaN can be used as a numerical value on mathematical operations, while None cannot (or at least shouldn't).

NaN is a numeric value, as defined in IEEE 754 floating-point standard. None is an internal Python type (NoneType) and would be more like "inexistent" or "empty" than "numerically invalid" in this context.

The main "symptom" of that is that, if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result...

In the other hand, you cannot perform mathematical operations using None as operand.

So, depending on the case, you could use None as a way to tell your algorithm not to consider invalid or inexistent values on computations. That would mean the algorithm should test each value to see if it is None.

Numpy has some functions to avoid NaN values to contaminate your results, such as nansum and nan_to_num for example.

Solution 3

The function isnan() checks to see if something is "Not A Number" and will return whether or not a variable is a number, for example isnan(2) would return false

The conditional myVar is not None returns whether or not the variable is defined

Your numpy array uses isnan() because it is intended to be an array of numbers and it initializes all elements of the array to NaN these elements are considered "empty"

Share:
114,673
user1083734
Author by

user1083734

Updated on March 27, 2020

Comments

  • user1083734
    user1083734 about 4 years

    I am reading two columns of a csv file using pandas readcsv() and then assigning the values to a dictionary. The columns contain strings of numbers and letters. Occasionally there are cases where a cell is empty. In my opinion, the value read to that dictionary entry should be None but instead nan is assigned. Surely None is more descriptive of an empty cell as it has a null value, whereas nan just says that the value read is not a number.

    Is my understanding correct, what IS the difference between None and nan? Why is nan assigned instead of None?

    Also, my dictionary check for any empty cells has been using numpy.isnan():

    for k, v in my_dict.iteritems():
        if np.isnan(v):
    

    But this gives me an error saying that I cannot use this check for v. I guess it is because an integer or float variable, not a string is meant to be used. If this is true, how can I check v for an "empty cell"/nan case?

  • heltonbiker
    heltonbiker almost 11 years
    I think isnan(2) would return False, since 2 is not a NaN.
  • user1083734
    user1083734 almost 11 years
    I agree with you that None should be used for non-existent entries, so why does df=pd.readcsv('file.csv') give me NaN values for the empty cells and not None? As far as I'm aware, pd.DataFrames are not exclusive for numbers.
  • heltonbiker
    heltonbiker almost 11 years
    Also, numpy.empty doesn't initialize array values to NaN. It simply doesn't initialize the values at all.
  • heltonbiker
    heltonbiker almost 11 years
    Well, it's probably a design choice. I suppose DataFrames and Series have a dtype, so invalid values of dtype=float must be represented by numeric values, which NaN is and None is not (None is of NoneType).
  • heltonbiker
    heltonbiker almost 11 years
    Also, a lot of Pandas methods have a na argument, which let you decide which value you are going to use to replace not-available values
  • user1083734
    user1083734 almost 11 years
    Ok, thanks. So I am not actually reading numbers into my DataFrame, but strings of numbers and letters. What sort of check should I be using to detect empty cells? A check like; if dtype==float: ??
  • heltonbiker
    heltonbiker almost 11 years
    Perhaps posting a sample of your CSV data would help. I can imagine that, if there are strings, then dtype would be string for the whole column (Series). But perhaps if not every row has the same number of columns, you end up with unavailable data. I think you'll have to check that.
  • heltonbiker
    heltonbiker almost 11 years
  • Jaime
    Jaime almost 11 years
    The proper check for None-ness is myVar is not None, not myVar != None.
  • Andy Hayden
    Andy Hayden almost 11 years
    @heltonbiker pandas chooses object as the dtype for columns with strings (see note here). Otherwise it has to store the size of the largest element for every element (usually you don't know every string is a specific/the same length).
  • Jeff
    Jeff almost 11 years
    just adding 2c here....np.nan allows for vectorized operations; its a float value, while None by definition forces object type, and basically disables all efficiency in numpy, so repeat 3 times fast: object==bad, float==good
  • Michael
    Michael over 10 years
    Note that np.isnan() is not implemented for string variables, so if you pass it a string it will crash. Better to use pd.isnull which works with strings.
  • A. Kootstra
    A. Kootstra over 4 years
    While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review
  • eswara amirthan s
    eswara amirthan s about 4 years
    @A.Kootstra I understand
  • Gathide
    Gathide about 4 years
    Is <NA> also an np.nan?
  • peer
    peer almost 4 years
    Could you provide a code example for "if you perform, say, an average or a sum on an array containing NaN, even a single one, you get NaN as a result"? For me pd.__version__ == '0.23.4'; pd.Series([1,2,np.NaN]).mean() == 1.5
  • graj499
    graj499 almost 3 years
    @heltonbiker yeah you are right read_csv() gives NaN but when you read excel and xlsb file It will give you None.
  • Guy s
    Guy s over 2 years
    The question was specifically about pandas. This answer is great, why isn't it presented first?!
  • eltings
    eltings over 2 years
    another gotcha for this case: bool(None) -> False, while bool(float('nan')) -> True)