How to count the number of words in a paragraph and exclude some words (from a file)?
Solution 1
The first part is ok where you get the total words and print the result.
Where you fall down is here
words_par = 0
for words_par in lines:
if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"):
words_par = line.split()
print len(words_par)
print words_par.replace('P1', '') #doesn't display it but still counts
else:
print 'No words'
The words_par is at first a string containing the line from the file. Under a condition which will never be meet, it is turned into a list with the
line.split()
expression. This, if the expression
words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3")
were to ever return True, would always be splitting the last line in your file, due to the last time it was assigned to was in the first part of your program where you did a full count of the number of words in the file. That should really be
words_par.split()
Also
words_par.startswith("P1" or "P2" or "P3")
will always be
words_par.startswith("P1")
since
"P1" or "P2" or "P3"
always evaluates to the first one which is True, which is the first string in this case. Read http://docs.python.org/reference/expressions.html if you want to know more.
While we are at it, unless you are wanting to do bitwise comparisons avoid doing
something & something
instead do
something and something
The first will evaluate both expressions no matter what the result of the first, where as the second will only evaluate the second expression if the first is True. If you do this your code will operate a little more efficiently.
The
print len(words_par)
on the next line is always going to counting the number of characters in the line, since the if statement is always going to evaluate to False and the word_par never got split into a list of words.
Also the else clause on the for loop will always be executed no matter whether the sequence is empty or not. Have a look at http://docs.python.org/reference/compound_stmts.html#the-for-statement for more information.
I wrote a version of what I think you are after as a example according to what I think you want. I tried to keep it simple and avoid using things like list comprehension, since you say you are just starting to learn, so it is not optimal, but hopefully will be clear. Also note I made no comments, so feel free to hassle me to explain things for you.
words = None
with open('data.txt') as f:
words = f.read().split()
total_words = len(words)
print 'Total words:', total_words
in_para = False
para_count = 0
para_type = None
paragraph = list()
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
if in_para == False:
in_para = True
para_type = word
else:
print 'Words in paragraph', para_type, ':', para_count
print ' '.join(paragraph)
para_count = 0
del paragraph[:]
para_type = word
else:
paragraph.append(word)
para_count += 1
else:
if in_para == True:
print 'Words in last paragraph', para_type, ':', para_count
print ' '.join(paragraph)
else:
print 'No words'
EDIT:
I actually just noticed some redundant code in the example. The variable para_count is not needed, since the words are being appended to the paragraph variable. So instead of
print 'Words in paragraph', para_type, ':', para_count
You could just do
print 'Words in paragraph', para_type, ':', len(paragraph)
One less variable to keep track of. Here is the corrected snippet.
in_para = False
para_type = None
paragraph = list()
for word in words:
if ('P1' in word or
'P2' in word or
'P3' in word ):
if in_para == False:
in_para = True
para_type = word
else:
print 'Words in paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
del paragraph[:]
para_type = word
else:
paragraph.append(word)
else:
if in_para == True:
print 'Words in last paragraph', para_type, ':', len(paragraph)
print ' '.join(paragraph)
else:
print 'No words'
Solution 2
You shouldn't call open ('zery.txt', 'r')
with identifier text. It is not the text in the file, it is the handler of the file, described as a "file-like object" in the docs (I never understood what it means, "file-like object", by the way)
.
with open ('C:/data.txt', 'r') as f:
........
........
is better than
f = open ('C:/data.txt', 'r')
......
.....
f.close()
.
You should read the instructions concerning split() , so you'll see that you can do:
with open ('C:/data.txt', 'r') as f:
text = f.read()
words_all = len(text.split())
print 'Total words: ', words_all
.
If the structure of your text is:
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.
then words_par.endswith("P1" or "P2" or "P3")
is always False, hence the desired spliting isn't performed.
Consequently, words_par doesn't become a list, it remains a string, that's why the characters are counted.
.
Also, your code is certainly wrong.
If the splitting was performed, it would be the last line obtained in the first for-loop, in the beginning of the code, that would be repeatedly splitted.
So, instead of
for words_par in lines:
if words_par.startswith("P1" or "P2" or "P3"):
words_par = line.split()
it is certainly:
for line in lines:
if line[0:2] in ("P1","P2","P3") :
words_par = line.split()
Solution 3
Maybe I didn't understand the requirements completely, but I'll do my best.
The first part about counting all words is quite ok. I'd shorten it a bit:
with open('C:/data.txt', 'r') as textfile:
lines = list(textfile)
words_all = sum([len(line.split()) for line in lines])
print 'Total words: ', words_all
In the second part, something seems to go wrong.
words_par = 0 # You can leave out this line,
# 'words_par' is initialized in the for-statement
More problems here:
if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"):
"P1" or "P2" or "P3"
evaluates to "P1"
(non-empty strings are "truthy" values). So you could shorten the line to
if words_par.startswith("P1") & words_par.endswith("P1"):
which is probably not what you wanted.
When the condition evaluates to False, the split-method is not called and words_par
remains a string (and not a list of strings as expected). So len(words_par)
returns the number of characters instead of the number of words.
(A little disgression on names: IMHO this error arose from an inaccurate naming of a variable. A different naming
for line in lines:
if line.startswith(...:
words_par = line.split()
print len(words_par)
would have produced a clear error message. In a second reading, that must have been what you meant anyway.)
Comments
-
epo3 over 1 year
I've just started to learn Python so my question might be a bit silly. I'm trying to create a program that would:
- import a text file (got it)
- count the total number of words (got it),
- count the number of words in a specific paragraph, starting with a specific phrase (e.g. "P1", ending with another participant "P2") and exclude these words from my word count. Somehow I ended up with something that counts the number of characters instead :/
- print paragraphs separately (got it)
- exclude "P1" "P2" etc. words from my word count.My text files look like this:
P1: Bla bla bla.
P2: Bla bla bla bla.
P1: Bla bla.
P3: Bla.I ended up with this code:
text = open (r'C:/data.txt', 'r') lines = list(text) text.close() words_all = 0 for line in lines: words_all = words_all + len(line.split()) print 'Total words: ', words_all words_par = 0 for words_par in lines: if words_par.startswith("P1" or "P2" or "P3") & words_par.endswith("P1" or "P2" or "P3"): words_par = line.split() print len(words_par) print words_par.replace('P1', '') #doesn't display it but still counts else: print 'No words'
Any ideas how to improve it?
Thanks
-
eyquem about 12 yearsr in r'C:/data.txt' is completely vain
-
Jakob Bowyer about 12 yearsSometimes its nice to be explicit.
-
eyquem about 12 years@Jakob Bowyer It is because it is explicit that it is vain. So your sentence means "Sometimes, it's nice to be vain".
-
SingleNegationElimination about 12 yearsit should probably be
r'C:\data.txt'
, since the correct directory separator on windows is \, and'C:\\data.txt'
is too awful.
-
-
MattH about 12 years
line.startswith("P1" or "P2" or "P3")
is equivalent toline.startswith("P1")
and misleading at best. -
eyquem about 12 years@MattH Oh ! I didn't see that. I went to your last answer (Linux non-blocking FIFO) and upvoted it
-
epo3 about 12 yearsthanks guys! @james: you got it right, it works as I wanted. now I have to digest all the knowledge and try to understand what went wrong :)
-
James Hurford about 12 years@epo3 Your welcome. Have a look at my corrected snippet for a better way of doing it.
-
epo3 about 12 yearsI don't understand this bit: <br>if in_para == False:</br> <br>in_para = True</br>. How can I add all the values for a certain paragraph? e.g. summing up all P1 word counts. I tried writing a code but didn't come up with anything that would make sense :/
-
James Hurford about 12 yearsin_para is a flag that makes sure the that you have encountered a P1, P2 or P3 word, thus not counting anything that does not start with those words. How can you sum up the count of words in all P1 paragraphs sounds like a new question, which, if you posted it, I would be happy to answer.
-
epo3 about 12 years