Pandas selecting by label sometimes return Series, sometimes returns DataFrame
Solution 1
Granted that the behavior is inconsistent, but I think it's easy to imagine cases where this is convenient. Anyway, to get a DataFrame every time, just pass a list to loc
. There are other ways, but in my opinion this is the cleanest.
In [2]: type(df.loc[[3]])
Out[2]: pandas.core.frame.DataFrame
In [3]: type(df.loc[[1]])
Out[3]: pandas.core.frame.DataFrame
Solution 2
The TLDR
When using loc
df.loc[:]
= Dataframe
df.loc[int]
= Dataframe if you have more than one column and Series if you have only 1 column in the dataframe
df.loc[:, ["col_name"]]
= Dataframe if you have more than one row and Series if you have only 1 row in the selection
df.loc[:, "col_name"]
= Series
Not using loc
df["col_name"]
= Series
df[["col_name"]]
= Dataframe
Solution 3
You have an index with three index items 3
. For this reason df.loc[3]
will return a dataframe.
The reason is that you don't specify the column. So df.loc[3]
selects three items of all columns (which is column 0
), while df.loc[3,0]
will return a Series. E.g. df.loc[1:2]
also returns a dataframe, because you slice the rows.
Selecting a single row (as df.loc[1]
) returns a Series with the column names as the index.
If you want to be sure to always have a DataFrame, you can slice like df.loc[1:1]
. Another option is boolean indexing (df.loc[df.index==1]
) or the take method (df.take([0])
, but this used location not labels!).
Solution 4
Use df['columnName']
to get a Series and df[['columnName']]
to get a Dataframe.
Solution 5
You wrote in a comment to joris' answer:
"I don't understand the design decision for single rows to get converted into a series - why not a data frame with one row?"
A single row isn't converted in a Series.
It IS a Series: No, I don't think so, in fact; see the edit
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For example, DataFrame is a container for Series, and Panel is a container for DataFrame objects. We would like to be able to insert and remove objects from these containers in a dictionary-like fashion.
http://pandas.pydata.org/pandas-docs/stable/overview.html#why-more-than-1-data-structure
The data model of Pandas objects has been choosen like that. The reason certainly lies in the fact that it ensures some advantages I don't know (I don't fully understand the last sentence of the citation, maybe it's the reason)
.
Edit : I don't agree with me
A DataFrame can't be composed of elements that would be Series, because the following code gives the same type "Series" as well for a row as for a column:
import pandas as pd
df = pd.DataFrame(data=[11,12,13], index=[2, 3, 3])
print '-------- df -------------'
print df
print '\n------- df.loc[2] --------'
print df.loc[2]
print 'type(df.loc[1]) : ',type(df.loc[2])
print '\n--------- df[0] ----------'
print df[0]
print 'type(df[0]) : ',type(df[0])
result
-------- df -------------
0
2 11
3 12
3 13
------- df.loc[2] --------
0 11
Name: 2, dtype: int64
type(df.loc[1]) : <class 'pandas.core.series.Series'>
--------- df[0] ----------
2 11
3 12
3 13
Name: 0, dtype: int64
type(df[0]) : <class 'pandas.core.series.Series'>
So, there is no sense to pretend that a DataFrame is composed of Series because what would these said Series be supposed to be : columns or rows ? Stupid question and vision.
.
Then what is a DataFrame ?
In the previous version of this answer, I asked this question, trying to find the answer to the Why is that?
part of the question of the OP and the similar interrogation single rows to get converted into a series - why not a data frame with one row?
in one of his comment,
while the Is there a way to ensure I always get back a data frame?
part has been answered by Dan Allan.
Then, as the Pandas' docs cited above says that the pandas' data structures are best seen as containers of lower dimensional data, it seemed to me that the understanding of the why would be found in the characteristcs of the nature of DataFrame structures.
However, I realized that this cited advice must not be taken as a precise description of the nature of Pandas' data structures.
This advice doesn't mean that a DataFrame is a container of Series.
It expresses that the mental representation of a DataFrame as a container of Series (either rows or columns according the option considered at one moment of a reasoning) is a good way to consider DataFrames, even if it isn't strictly the case in reality. "Good" meaning that this vision enables to use DataFrames with efficiency. That's all.
.
Then what is a DataFrame object ?
The DataFrame class produces instances that have a particular structure originated in the NDFrame base class, itself derived from the PandasContainer base class that is also a parent class of the Series class.
Note that this is correct for Pandas until version 0.12. In the upcoming version 0.13, Series will derive also from NDFrame class only.
# with pandas 0.12
from pandas import Series
print 'Series :\n',Series
print 'Series.__bases__ :\n',Series.__bases__
from pandas import DataFrame
print '\nDataFrame :\n',DataFrame
print 'DataFrame.__bases__ :\n',DataFrame.__bases__
print '\n-------------------'
from pandas.core.generic import NDFrame
print '\nNDFrame.__bases__ :\n',NDFrame.__bases__
from pandas.core.generic import PandasContainer
print '\nPandasContainer.__bases__ :\n',PandasContainer.__bases__
from pandas.core.base import PandasObject
print '\nPandasObject.__bases__ :\n',PandasObject.__bases__
from pandas.core.base import StringMixin
print '\nStringMixin.__bases__ :\n',StringMixin.__bases__
result
Series :
<class 'pandas.core.series.Series'>
Series.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>, <type 'numpy.ndarray'>)
DataFrame :
<class 'pandas.core.frame.DataFrame'>
DataFrame.__bases__ :
(<class 'pandas.core.generic.NDFrame'>,)
-------------------
NDFrame.__bases__ :
(<class 'pandas.core.generic.PandasContainer'>,)
PandasContainer.__bases__ :
(<class 'pandas.core.base.PandasObject'>,)
PandasObject.__bases__ :
(<class 'pandas.core.base.StringMixin'>,)
StringMixin.__bases__ :
(<type 'object'>,)
So my understanding is now that a DataFrame instance has certain methods that have been crafted in order to control the way data are extracted from rows and columns.
The ways these extracting methods work are described in this page:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing
We find in it the method given by Dan Allan and other methods.
Why these extracting methods have been crafted as they were ?
That's certainly because they have been appraised as the ones giving the better possibilities and ease in data analysis.
It's precisely what is expressed in this sentence:
The best way to think about the pandas data structures is as flexible containers for lower dimensional data.
The why of the extraction of data from a DataFRame instance doesn't lies in its structure, it lies in the why of this structure. I guess that the structure and functionning of the Pandas' data structure have been chiseled in order to be as much intellectually intuitive as possible, and that to understand the details, one must read the blog of Wes McKinney.
Related videos on Youtube
jobevers
Updated on May 07, 2022Comments
-
jobevers about 2 years
In Pandas, when I select a label that only has one entry in the index I get back a Series, but when I select an entry that has more then one entry I get back a data frame.
Why is that? Is there a way to ensure I always get back a data frame?
In [1]: import pandas as pd In [2]: df = pd.DataFrame(data=range(5), index=[1, 2, 3, 3, 3]) In [3]: type(df.loc[3]) Out[3]: pandas.core.frame.DataFrame In [4]: type(df.loc[1]) Out[4]: pandas.core.series.Series
-
jobevers over 10 yearsThats the behavior I would expect. I don't understand the design decision for single rows to get converted into a series - why not a data frame with one row?
-
joris over 10 yearsAh, why selecting a single row returns a Series, I don't really know.
-
joris over 10 yearsIndeed, this is cleaner than my options :-)
-
jobevers over 10 yearsThanks. Worth noting that this returns a DataFrame even if the label isn't in the index.
-
Jeff over 10 yearsFYI, with a non-duplicate index, and a single indexer (e.g. a single label), you will ALWAYS get back a Series, its only because you have duplicates in the index that it is a DataFrame.
-
Jeff over 10 yearsFYI, DataFrame is NOT an ndarray sub-class, neither is a Series (starting 0.13, prior to that it was though). These are more dict-like that anything.
-
eyquem over 10 yearsThank you to inform me. I really appreciate because I am new in the learning of Pandas. But I need more information to understand well. Why is it written in the docs that a Series is a subclass of ndarray ?
-
Jeff over 10 yearsit was before 0.13 (releasing shortly), here are dev docs: pandas.pydata.org/pandas-docs/dev/dsintro.html#series
-
eyquem over 10 yearsOK. Thank you very much. However it doesn't change the basis of my reasoning and understanding, does it ? - In Pandas inferior to 0.13 , DataFrame and other Pandas' objects different from Series: what are they subclass of ?
-
eyquem over 10 years@Jeff Thank you. I modified my answer after your information. I would be pleased to know what you think of my edit.
-
Paul Oyster over 9 yearsNote that there is a yet another gotcha: if using the suggested workaround, and there are no matching rows, the result will be a DataFrame with a single row, all NaN.
-
Dan Allan over 9 yearsPaul, what version of pandas are you using? On the latest version, I get a
KeyError
when I try.loc[[nonexistent_label]]
. -
Wouter over 5 yearsIf you are selecting both on the index and the columns then the loc requires 2 lists to get a dataframe rather than a series: df.loc[ [indexlist], [columnlist] ] (even if the list contains just a single item).
-
Shoonya about 5 yearsvery useful observation @Jeff . adding to it, the index also needs to be sorted.
-
smci almost 5 yearsBeware that takes a copy of the original df.
-
Jonathan almost 5 yearsUsing a list in
.loc
is much slower than without it. To be still readable but also much faster, better usedf.loc[1:1]
-
Willem over 3 yearsIn my opinion this answer is misleading. Both
df['column']
anddf.loc[:, 'column']
will return a Series. Bothdf[['column']]
anddf.loc[:, ['column']]
will return a DataFrame. The difference lies not inloc
, but in whether you are using a string ('column'
) or a list (['column']
) to index. So the anser is really: string --> Series, list --> DataFrame.loc
has nothing to do with it. -
MrR about 3 yearsThis is incorrect.
df.loc[:, ["col_name"]]
would return a series if only one row is selected. -
Colin Anthony about 3 yearsright, if the dataframe consisted of just a single row, since the
:
selects all rows -
MrR about 3 yearsso since one is concerned with the type of the results, perhaps you could add different sections specifying that the type is different depending on the cardinality of the results.
-
ghukill over 2 yearsThe syntax
result = df[df.index == idx]
is a really nice option; fit my purposes perfectly.