export data to csv from mongodb by using python

14,048

So... The problem is that the csv writer doesn't understand the concept of "subdictionaries" as mongo returns it.

If I understood correctly, when you query Mongo, you get a dictionary like this:

{
   "_id": "a hex ID that correspond with the record that contains several answers",
   "answers": [ ... a list with a bunch of dicts in it... ]
}

So when the csv.DictWriter tries to write that, it is only writing one dictionary (the topmost). It doesn't know (or cares) that answers is a list that contains dictionaries whose values need to be written in columns as well (accessing fields in dictionaries using the dot notation such as answers.order is only understood by Mongo, not by the csv writer)

What I understand you should do is "walk" the list of answers and create one dictionary out of each record (each dictionary) in that list. Once you have a list of "flattened" dictionaries you can pass those and write them in your csv file:

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
    answers_record_id = answers_record['_id']
    for answer_record in answers_record['answers']:
        flattened_record = {
            '_id': answers_record_id,
            'answers.order': answer_record['order'],
            'answers.text': answer_record['text'],
            'answers.answerId': answer_record['answerId']
        }
        flattened_records.append(flattened_record)

# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for flattened_record in flattened_records:
        write.writerow(flattened_record)

Whatch for the use of plurals. answers_record is different than answer_record

That creates a file like this:

$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b

EDIT:

Your query (the one that makes cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})) will return all the entries in the questions collection. If this collection is very large, you might want to use the cursor as an iterator.

As you might have already realized, the first for loop in my code above puts all the records in a list (the flattened_records list). You can do lazy loading by iterating through the cursor (instead of loading all the items in memory, fetch one, do something with it, get the next, do something with it...).

It's slightly slower, but more memory efficient.

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for answers_record in cursor:  # Here we are using 'cursor' as an iterator
        answers_record_id = answers_record['_id']
        for answer_record in answers_record['answers']:
            flattened_record = {
                '_id': answers_record_id,
                'answers.order': answer_record['order'],
                'answers.text': answer_record['text'],
                'answers.answerId': answer_record['answerId']
            }
            write.writerow(flattened_record)

It will produce the same .csv file as shown above.

Share:
14,048
user7070824
Author by

user7070824

Updated on June 28, 2022

Comments

  • user7070824
    user7070824 almost 2 years

    I am having problems with export to csv by using python script. some array data need to be exported to CSV from Mongodb, but the following script did not export properly because three subfield data are dumped into a column. I want to separate three fields(order, text,answerid) under answers field into three different columns in a CSV.

    the sample of Mongodb:

    "answers": [
            {
                "order": 0,
                "text": {
                    "en": "Yes"
                },
                "answerId": "527d65de7563dd0fb98fa28c"
            },
            {
                "order": 1,
                "text": {
                    "en": "No"
                },
                "answerId": "527d65de7563dd0fb98fa28b"
            }
        ]
    

    the python script:

    import csv
    cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
    cursor = list(cursor)
    with open('answer_2.csv', 'w') as outfile:   
    
        fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
        write = csv.DictWriter(outfile, fieldnames=fields)
        write.writeheader()
        for x in cursor: 
            for y, v in x.iteritems():
                if y == 'answers'
                    print (y, v)
                    write.writerow(v)
                    write.writerow(x)
    
  • user7070824
    user7070824 over 7 years
    it worked. it is awesome! I got four separated fields in csv after run the script. thanks so much.
  • BorrajaX
    BorrajaX over 7 years
    Glad I was able to help! By the way if my answer solved your problem, feel free to click the big checkbox to its left to accept it as the answer (SO netiquette: meta.stackexchange.com/questions/14994/…)
  • user7070824
    user7070824 over 7 years
    if anyone wants to use the script, make sure to put a space between step 1 and step2. It works perfectly. thanks for your help.
  • BorrajaX
    BorrajaX over 7 years
    @user7070824, I added a "bonus" in my answer (how to make it more memory efficient)
  • user7070824
    user7070824 over 7 years
    The second script worked too. I tested it. @BorrajaX 2 , I already posted another problem. Could you have a look if you have time? the issue is the type field can not be added into a list in python script because another collection stages has a number of field name which donot match in every record. thanks again
  • VGupta
    VGupta over 6 years
    what if we have one more nested document like 'answers' here ?