Pandas selecting by label sometimes return Series, sometimes returns DataFrame

52,914

Solution 1

Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc. There are other ways, but in my opinion this is the cleanest.

In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame

In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame

Solution 2

The TLDR

When using loc

df.loc[:] = Dataframe

df.loc[int] = Dataframe if you have more than one column and Series if you have only 1 column in the dataframe

df.loc[:, ["col_name"]] = Dataframe if you have more than one row and Series if you have only 1 row in the selection

df.loc[:, "col_name"] = Series

Not using loc

df["col_name"] = Series

df[["col_name"]] = Dataframe

Solution 3

You have an index with three index items 3. For this reason df.loc[3] will return a dataframe.

The reason is that you don't specify the column. So df.loc[3] selects three items of all columns (which is column 0), while df.loc[3,0] will return a Series. E.g. df.loc[1:2] also returns a dataframe, because you slice the rows.

Selecting a single row (as df.loc[1]) returns a Series with the column names as the index.

If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]. Another option is boolean indexing (df.loc[df.index==1]) or the take method (df.take([0]), but this used location not labels!).

Solution 4

Use df['columnName'] to get a Series and df[['columnName']] to get a Dataframe.

Solution 5

You wrote in a comment to joris' answer:

"I don't understand the design decision for single rows to get converted into a series - why not a data frame with one row?"

A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit

The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.

http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure

The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)

.

Edit : I don't agree with me

A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:

import pandas as pd

df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])

print '-------- df -------------'
print df

print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])

print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])

result

-------- df -------------
    0
2  11
3  12
3  13

------- df.loc[2] --------
0    11
Name: 2, dtype: int64
type(df.loc[1]) :  <class 'pandas.core.series.Series'>

--------- df[0] ----------
2    11
3    12
3    13
Name: 0, dtype: int64
type(df[0]) :  <class 'pandas.core.series.Series'>

So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.

.

Then what is a DataFrame ?

In the previous version of this answer, I asked this question, trying to find the answer to the Why is that? part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row? in one of his comment,
while the Is there a way to ensure I always get back a data frame? part has been answered by Dan Allan.

Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.

However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.

.

Then what is a DataFrame object ?

The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.

# with pandas 0.12

from pandas import Series
print 'Series  :\n',Series
print 'Series.__bases__  :\n',Series.__bases__

from pandas import DataFrame
print '\nDataFrame  :\n',DataFrame
print 'DataFrame.__bases__  :\n',DataFrame.__bases__

print '\n-------------------'

from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__  :\n',NDFrame.__bases__

from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__  :\n',PandasContainer.__bases__

from pandas.core.base import PandasObject
print '\nPandasObject.__bases__  :\n',PandasObject.__bases__

from pandas.core.base import StringMixin
print '\nStringMixin.__bases__  :\n',StringMixin.__bases__

result

Series  :
<class 'pandas.core.series.Series'>
Series.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)

DataFrame  :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__  :
(<class 'pandas.core.generic.NDFrame'>,)

-------------------

NDFrame.__bases__  :
(<class 'pandas.core.generic.PandasContainer'>,)

PandasContainer.__bases__  :
(<class 'pandas.core.base.PandasObject'>,)

PandasObject.__bases__  :
(<class 'pandas.core.base.StringMixin'>,)

StringMixin.__bases__  :
(<type 'object'>,)

So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.

The ways these extracting methods work are described in this page: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.

Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:

The best way to think about the pandas data structures is as flexible containers for lower dimensional data.

The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.

Share:
52,914

Related videos on Youtube

jobevers
Author by

jobevers

Updated on May 07, 2022

Comments

  • jobevers
    jobevers about 2 years

    In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.

    Why is that? Is there a way to ensure I always get back a data frame?

    In [1]: import pandas as pd
    
    In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3])
    
    In [3]: type(df.loc[3])
    Out[3]: pandas.core.frame.DataFrame
    
    In [4]: type(df.loc[1])
    Out[4]: pandas.core.series.Series
    
  • jobevers
    jobevers over 10 years
    Thats the behavior I would expect. I don't understand the design decision for single rows to get converted into a series - why not a data frame with one row?
  • joris
    joris over 10 years
    Ah, why selecting a single row returns a Series, I don't really know.
  • joris
    joris over 10 years
    Indeed, this is cleaner than my options :-)
  • jobevers
    jobevers over 10 years
    Thanks. Worth noting that this returns a DataFrame even if the label isn't in the index.
  • Jeff
    Jeff over 10 years
    FYI, with a non-duplicate index, and a single indexer (e.g. a single label), you will ALWAYS get back a Series, its only because you have duplicates in the index that it is a DataFrame.
  • Jeff
    Jeff over 10 years
    FYI, DataFrame is NOT an ndarray sub-class, neither is a Series (starting 0.13, prior to that it was though). These are more dict-like that anything.
  • eyquem
    eyquem over 10 years
    Thank you to inform me. I really appreciate because I am new in the learning of Pandas. But I need more information to understand well. Why is it written in the docs that a Series is a subclass of ndarray ?
  • Jeff
    Jeff over 10 years
    it was before 0.13 (releasing shortly), here are dev docs: pandas.pydata.org/pandas-docs/dev/dsintro.html#series
  • eyquem
    eyquem over 10 years
    OK. Thank you very much. However it doesn't change the basis of my reasoning and understanding, does it ? - In Pandas inferior to 0.13 , DataFrame and other Pandas' objects different from Series: what are they subclass of ?
  • eyquem
    eyquem over 10 years
    @Jeff Thank you. I modified my answer after your information. I would be pleased to know what you think of my edit.
  • Paul Oyster
    Paul Oyster over 9 years
    Note that there is a yet another gotcha: if using the suggested workaround, and there are no matching rows, the result will be a DataFrame with a single row, all NaN.
  • Dan Allan
    Dan Allan over 9 years
    Paul, what version of pandas are you using? On the latest version, I get a KeyError when I try .loc[[nonexistent_label]].
  • Wouter
    Wouter over 5 years
    If you are selecting both on the index and the columns then the loc requires 2 lists to get a dataframe rather than a series: df.loc[ [indexlist], [columnlist] ] (even if the list contains just a single item).
  • Shoonya
    Shoonya about 5 years
    very useful observation @Jeff . adding to it, the index also needs to be sorted.
  • smci
    smci almost 5 years
    Beware that takes a copy of the original df.
  • Jonathan
    Jonathan almost 5 years
    Using a list in .loc is much slower than without it. To be still readable but also much faster, better use df.loc[1:1]
  • Willem
    Willem over 3 years
    In my opinion this answer is misleading. Both df['column'] and df.loc[:, 'column'] will return a Series. Both df[['column']] and df.loc[:, ['column']] will return a DataFrame. The difference lies not in loc, but in whether you are using a string ('column') or a list (['column']) to index. So the anser is really: string --> Series, list --> DataFrame. loc has nothing to do with it.
  • MrR
    MrR about 3 years
    This is incorrect. df.loc[:, ["col_name"]] would return a series if only one row is selected.
  • Colin Anthony
    Colin Anthony about 3 years
    right, if the dataframe consisted of just a single row, since the : selects all rows
  • MrR
    MrR about 3 years
    so since one is concerned with the type of the results, perhaps you could add different sections specifying that the type is different depending on the cardinality of the results.
  • ghukill
    ghukill over 2 years
    The syntax result = df[df.index == idx] is a really nice option; fit my purposes perfectly.