export data to csv from mongodb by using python
So... The problem is that the csv
writer doesn't understand the concept of "subdictionaries" as mongo returns it.
If I understood correctly, when you query Mongo, you get a dictionary like this:
{
"_id": "a hex ID that correspond with the record that contains several answers",
"answers": [ ... a list with a bunch of dicts in it... ]
}
So when the csv.DictWriter
tries to write that, it is only writing one dictionary (the topmost). It doesn't know (or cares) that answers
is a list that contains dictionaries whose values need to be written in columns as well (accessing fields in dictionaries using the dot notation such as answers.order
is only understood by Mongo, not by the csv writer)
What I understand you should do is "walk" the list of answers and create one dictionary out of each record (each dictionary) in that list. Once you have a list of "flattened" dictionaries you can pass those and write them in your csv
file:
cursor = client.stack_overflow.stack_039.find(
{}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})
# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
answers_record_id = answers_record['_id']
for answer_record in answers_record['answers']:
flattened_record = {
'_id': answers_record_id,
'answers.order': answer_record['order'],
'answers.text': answer_record['text'],
'answers.answerId': answer_record['answerId']
}
flattened_records.append(flattened_record)
# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for flattened_record in flattened_records:
write.writerow(flattened_record)
Whatch for the use of plurals. answers_record
is different than answer_record
That creates a file like this:
$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b
EDIT:
Your query (the one that makes cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
) will return all the entries in the questions
collection. If this collection is very large, you might want to use the cursor
as an iterator.
As you might have already realized, the first for
loop in my code above puts all the records in a list (the flattened_records
list). You can do lazy loading by iterating through the cursor
(instead of loading all the items in memory, fetch one, do something with it, get the next, do something with it...).
It's slightly slower, but more memory efficient.
cursor = client.stack_overflow.stack_039.find(
{}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})
with open('stack_039.csv', 'w') as outfile:
fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
write = csv.DictWriter(outfile, fieldnames=fields)
write.writeheader()
for answers_record in cursor: # Here we are using 'cursor' as an iterator
answers_record_id = answers_record['_id']
for answer_record in answers_record['answers']:
flattened_record = {
'_id': answers_record_id,
'answers.order': answer_record['order'],
'answers.text': answer_record['text'],
'answers.answerId': answer_record['answerId']
}
write.writerow(flattened_record)
It will produce the same .csv
file as shown above.
user7070824
Updated on June 28, 2022Comments
-
user7070824 almost 2 years
I am having problems with export to csv by using python script. some array data need to be exported to CSV from Mongodb, but the following script did not export properly because three subfield data are dumped into a column. I want to separate three fields(order, text,answerid) under answers field into three different columns in a CSV.
the sample of Mongodb:
"answers": [ { "order": 0, "text": { "en": "Yes" }, "answerId": "527d65de7563dd0fb98fa28c" }, { "order": 1, "text": { "en": "No" }, "answerId": "527d65de7563dd0fb98fa28b" } ]
the python script:
import csv cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1}) cursor = list(cursor) with open('answer_2.csv', 'w') as outfile: fields = ['_id','answers.order', 'answers.text', 'answers.answerid'] write = csv.DictWriter(outfile, fieldnames=fields) write.writeheader() for x in cursor: for y, v in x.iteritems(): if y == 'answers' print (y, v) write.writerow(v) write.writerow(x)
-
user7070824 over 7 yearsit worked. it is awesome! I got four separated fields in csv after run the script. thanks so much.
-
BorrajaX over 7 yearsGlad I was able to help! By the way if my answer solved your problem, feel free to click the big checkbox to its left to accept it as the answer (SO netiquette: meta.stackexchange.com/questions/14994/…)
-
user7070824 over 7 yearsif anyone wants to use the script, make sure to put a space between step 1 and step2. It works perfectly. thanks for your help.
-
BorrajaX over 7 years@user7070824, I added a "bonus" in my answer (how to make it more memory efficient)
-
user7070824 over 7 yearsThe second script worked too. I tested it. @BorrajaX 2 , I already posted another problem. Could you have a look if you have time? the issue is the type field can not be added into a list in python script because another collection stages has a number of field name which donot match in every record. thanks again
-
VGupta over 6 yearswhat if we have one more nested document like 'answers' here ?