Python searches CSV for string in one column, returns string from another column in the same row

15,468

Solution 1

For something like this, I think using the pandas library will keep your sanity in check. Assuming a 15,000-row CSV file with two columns, String and ID.

In [1]: import pandas as pd

In [2]: words = ['happy','sad','good','bad','sunny','rainy']

In [3]: df = pd.read_csv('data.csv')

In [4]: df.head(5)
Out[4]: 
  Strings  ID
0   happy   1
1     sad   2
2   happy   3
3     sad   4
4    good   5

In [5]: for word in words:
   ...:     print '{} : {}'.format(word, df['Strings'].str.lower().str.contains(word).sum())
   ...:     
happy : 2501
sad : 2500
good : 2500
bad : 2500
sunny : 2499
rainy : 2500

Alternatively, you can just create a pivot table and it will have similar results.

In [30]: df_pt = df.pivot_table(index='Strings',values='ID',aggfunc=len)

In [31]: df_pt
Out[31]: 
Strings
bad        2500
good       2500
happy      2501
rainy      2500
sad        2500
sunny      2499
Name: ID, dtype: int64

If you need to get the IDs for each word, you can just simply select/index/filter the data:

In [6]: df_happy = df[df['Strings'] == 'happy']

In [7]: df_happy.head(5)
Out[7]: 
   Strings  ID
0    happy   1
2    happy   3
12   happy  13
14   happy  15
18   happy  19

If you need it as a list, then:

In [8]: list_happy = df_happy['ID'].tolist()

In [9]: list_happy[:5]
Out[9]: [1, 3, 13, 15, 19]

I've truncated some parts, obviously, but the idea remains the same.

Solution 2

You said that you would like to print the id of the row when you found a word. Assuming that you have a comma separated csv file with only two colums, this is how I would do it:

fin = open('data.csv')
words = ["happy","sad","good","bad","sunny","rainy"]
found = {}
for line in fin:
    str1,id=line.split(',')
    for w in words:
        if w in str1:
            print id
            found[w]=found.get(w,0)+1
            break

print(found)

If you have multiple columns in the file, you could instead do:

split_line=line.split(',')
str1=split_line[0] # Whatever columns
id=split_line[1] # Whatever columns
Share:
15,468
Chris Thurber
Author by

Chris Thurber

Updated on June 04, 2022

Comments

  • Chris Thurber
    Chris Thurber almost 2 years

    I'm attempting to write a program in python that searches ~27,000 rows for each string in a list. Each string I am searching for is in one column, and has an 'id' value in another column that I would like printed if found. The code I currently have counts the number of times that string appears in the document, but I am still unable to find a way to return specific values for each unique row in which the strings are found.

    import csv
    fin = open('data.csv')
    words = ["happy","sad","good","bad","sunny","rainy"]
    found = {}
    count = 0
    for line in fin:
        for word in words:
            if word in line:
                count = count + 1
        found[word] = count
    print(found)
    

    The main semantic problem with the code above is that printing the 'found' dictionary only yields one of the results and its count from the 'words' list.

  • Chris Thurber
    Chris Thurber over 9 years
    Awesome. I'm still a novice in Python, so thanks for the added comments and solution. This solves the counting problem, but how can I return the 'id' value I mentioned for each row that 'word' is found in?
  • ch3ka
    ch3ka over 9 years
    Well, you'd have to keep track of the id's. Just store them away, as you find them.