Convert string to dict, then access key:values??? How to access data in a <class 'dict'> for Python?

24,841

Solution 1

Just ran into this problem. My solution:

import ast
import pandas as pd

df = pd.DataFrame(["{u'type': u'Point', u'coordinates': [-43,144]}","{u'type': u'Point', u'coordinates': [-34,34]}","{u'type': u'Point', u'coordinates': [-102,344]}"],columns=["Coordinates"])

df = df["Coordinates"].astype('str')
df = df.apply(lambda x: ast.literal_eval(x))
df = df.apply(pd.Series)

Solution 2

My first instinct is to use the json.loads to cast the strings into dicts. But the example you've posted does not follow the json standard since it uses single instead of double quotes. So you have to convert the strings first.

A second option is to just use regex to parse the strings. If the dict strings in your actual DataFrame do not exactly match my examples, I expect the regex method to be more robust since lat/long coords are fairly standard.

import re
import pandasd as pd

df = pd.DataFrame(data={'Coordinates':["{u'type': u'Point', u'coordinates': [-43.30175, 123.45]}",
    "{u'type': u'Point', u'coordinates': [-51.17913, 123.45]}"],
    'idx': [130, 278]})


##
# Solution 1- use json.loads
##

def string_to_dict(dict_string):
    # Convert to proper json format
    dict_string = dict_string.replace("'", '"').replace('u"', '"')
    return json.loads(dict_string)

df.CoordDicts = df.Coordinates.apply(string_to_dict)
df.CoordDicts[0]['coordinates']
#>>> [-43.30175, 123.45]


##
# Solution 2 - use regex
##
def get_lat_lon(dict_string):
    # Get the coordinates string with regex
    rs = re.search("(\-?\d+(\.\d+)?),\s*(\-?\d+(\.\d+)?)", dict_string).group()
    # Cast to floats
    coords = [float(x) for x in rs.split(',')]
    return coords

df.Coords = df.Coordinates.apply(get_lat_lon)
df.Coords[0]
#>>> [-43.30175, 123.45]

Solution 3

Assuming you start with a Series of dicts, you can use the .tolist() method to create a list of dicts and use this as input for a DataFrame. This approach will map each distinct key to a column.

You can filter by keys on creation by setting the columns argument in pd.DataFrame(), giving you the neat one-liner below. Hope that helps.

# Starting assumption:
data = ["{'coordinates': [-43.301755, -22.990065], 'type': 'Point', 'elevation': 1000}",
        "{'coordinates': [-51.17913026, -30.01201896], 'type': 'Point'}"]
s = pd.Series(data).apply(eval)

# Create a DataFrame with a list of dicts with a selection of columns
pd.DataFrame(s.tolist(), columns=['coordinates'])
Out[1]: 
                    coordinates
0      [-43.301755, -22.990065]
1  [-51.17913026, -30.01201896]
Share:
24,841
Linwoodc3
Author by

Linwoodc3

Updated on July 05, 2022

Comments

  • Linwoodc3
    Linwoodc3 almost 2 years

    I am having issues accessing data inside a dictionary.

    Sys: Macbook 2012
    Python: Python 3.5.1 :: Continuum Analytics, Inc.

    I am working with a dask.dataframe created from a csv.

    Edit Question

    How I got to this point

    Assume I start out with a Pandas Series:

    df.Coordinates
    130      {u'type': u'Point', u'coordinates': [-43.30175...
    278      {u'type': u'Point', u'coordinates': [-51.17913...
    425      {u'type': u'Point', u'coordinates': [-43.17986...
    440      {u'type': u'Point', u'coordinates': [-51.16376...
    877      {u'type': u'Point', u'coordinates': [-43.17986...
    1313     {u'type': u'Point', u'coordinates': [-49.72688...
    1734     {u'type': u'Point', u'coordinates': [-43.57405...
    1817     {u'type': u'Point', u'coordinates': [-43.77649...
    1835     {u'type': u'Point', u'coordinates': [-43.17132...
    2739     {u'type': u'Point', u'coordinates': [-43.19583...
    2915     {u'type': u'Point', u'coordinates': [-43.17986...
    3035     {u'type': u'Point', u'coordinates': [-51.01583...
    3097     {u'type': u'Point', u'coordinates': [-43.17891...
    3974     {u'type': u'Point', u'coordinates': [-8.633880...
    3983     {u'type': u'Point', u'coordinates': [-46.64960...
    4424     {u'type': u'Point', u'coordinates': [-43.17986...
    

    The problem is, this is not a true dataframe of dictionaries. Instead, it's a column full of strings that LOOK like dictionaries. Running this show it:

    df.Coordinates.apply(type)
    130      <class 'str'>
    278      <class 'str'>
    425      <class 'str'>
    440      <class 'str'>
    877      <class 'str'>
    1313     <class 'str'>
    1734     <class 'str'>
    1817     <class 'str'>
    1835     <class 'str'>
    2739     <class 'str'>
    2915     <class 'str'>
    3035     <class 'str'>
    3097     <class 'str'>
    3974     <class 'str'>
    3983     <class 'str'>
    4424     <class 'str'>
    

    My Goal: Access the coordinates key and value in the dictionary. That's it. But it's a str

    I converted the strings to dictionaries using eval.

    new = df.Coordinates.apply(eval)
    130      {'coordinates': [-43.301755, -22.990065], 'typ...
    278      {'coordinates': [-51.17913026, -30.01201896], ...
    425      {'coordinates': [-43.17986794, -22.91000096], ...
    440      {'coordinates': [-51.16376782, -29.95488677], ...
    877      {'coordinates': [-43.17986794, -22.91000096], ...
    1313     {'coordinates': [-49.72688407, -29.33757253], ...
    1734     {'coordinates': [-43.574057, -22.928059], 'typ...
    1817     {'coordinates': [-43.77649254, -22.86940539], ...
    1835     {'coordinates': [-43.17132318, -22.90895217], ...
    2739     {'coordinates': [-43.1958313, -22.98755333], '...
    2915     {'coordinates': [-43.17986794, -22.91000096], ...
    3035     {'coordinates': [-51.01583481, -29.63593292], ...
    3097     {'coordinates': [-43.17891379, -22.96476163], ...
    3974     {'coordinates': [-8.63388008, 41.14594453], 't...
    3983     {'coordinates': [-46.64960938, -23.55902666], ...
    4424     {'coordinates': [-43.17986794, -22.91000096], ...
    

    Next I text the type of object and get:

    130      <class 'dict'>
    278      <class 'dict'>
    425      <class 'dict'>
    440      <class 'dict'>
    877      <class 'dict'>
    1313     <class 'dict'>
    1734     <class 'dict'>
    1817     <class 'dict'>
    1835     <class 'dict'>
    2739     <class 'dict'>
    2915     <class 'dict'>
    3035     <class 'dict'>
    3097     <class 'dict'>
    3974     <class 'dict'>
    3983     <class 'dict'>
    4424     <class 'dict'>
    

    If I try to access my dictionaries: new.apply(lambda x: x['coordinates']

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-71-c0ad459ed1cc> in <module>()
    ----> 1 dfCombined.Coordinates.apply(coord_getter)
    
    /Users/linwood/anaconda/envs/dataAnalysisWithPython/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
       2218         else:
       2219             values = self.asobject
    -> 2220             mapped = lib.map_infer(values, f, convert=convert_dtype)
       2221 
       2222         if len(mapped) and isinstance(mapped[0], Series):
    
    pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:62658)()
    
    <ipython-input-68-748ce2d8529e> in coord_getter(row)
          1 import ast
          2 def coord_getter(row):
    ----> 3     return (ast.literal_eval(row))['coordinates']
    
    TypeError: 'bool' object is not subscriptable
    

    It's some type of class, because when I run dir I get this for one object:

    new.apply(lambda x: dir(x))[130]
    130           __class__
    130        __contains__
    130         __delattr__
    130         __delitem__
    130             __dir__
    130             __doc__
    130              __eq__
    130          __format__
    130              __ge__
    130    __getattribute__
    130         __getitem__
    130              __gt__
    130            __hash__
    130            __init__
    130            __iter__
    130              __le__
    130             __len__
    130              __lt__
    130              __ne__
    130             __new__
    130          __reduce__
    130       __reduce_ex__
    130            __repr__
    130         __setattr__
    130         __setitem__
    130          __sizeof__
    130             __str__
    130    __subclasshook__
    130               clear
    130                copy
    130            fromkeys
    130                 get
    130               items
    130                keys
    130                 pop
    130             popitem
    130          setdefault
    130              update
    130              values
    Name: Coordinates, dtype: object
    

    My Problem: I just want to access the dictionary. But, the object is <class 'dict'>. How do I covert this to a regular dict or just access the key:value pairs?

    Any ideas??

  • Linwoodc3
    Linwoodc3 almost 8 years
    Thanks for the help @piRSquared, but that gave me the same error. I added more information above. When I run dir on the objects, it's some type of class. Any suggestions?
  • andrew
    andrew almost 8 years
    @Linwoodc3, FYI, on my system, your method of using eval works with my example DataFrame. I am using Python 2.7. Despite the version differences, I expect the regex solution to still work.
  • Linwoodc3
    Linwoodc3 almost 8 years
    Sorry, just came back. Will check!
  • Linwoodc3
    Linwoodc3 almost 8 years
    Got an error again. "TypeError: expected string or bytes-like object"
  • fpersyn
    fpersyn over 4 years
    Note - The dicts in your list do not need to be of the same length for this to work. Dicts may miss multiple keys that are present in other dicts and vice versa. For example, when you run pd.DataFrame(s.tolist()) you will notice that elevation is set to NaN in the second row.
  • szeitlin
    szeitlin over 2 years
    So the string.replace for the quotes, followed by json.loads, works in my case. However, I think this shouldn't happen - in my case the original data was formatted correctly as dictionaries, and only got coerced to strings after I wrote it out to CSV and read it back in.