how to read the text word by word
13,862
Solution 1
A simple approach is using string.Split
without argument(splits by white-space characters):
using (StreamReader sr = new StreamReader(path))
{
while (sr.Peek() >= 0)
{
string line = sr.ReadLine();
string[] words = line.Split();
foreach(string word in words)
{
foreach(Char c in word)
{
// ...
}
}
}
}
I've used StreamReader.ReadLine
to read the entire line.
To parse HTML i would use a robust library like HtmlAgilityPack.
Solution 2
You can split the string on whitespace, but you will have to deal with punctuation and HTML markup (you said you were working with txt and htm files).
string[] tokens = text.split(); // default for split() will split on white space
foreach(string tok in tokens)
{
// process tok string here
}
Solution 3
Here's my implementation of lazy extension to StreamReader
. The idea is not to load the entire file into memory especially if your file is a single long line.
public static string ReadWord(this StreamReader stream, Encoding encoding)
{
string word = "";
// read single character at a time building a word
// until reaching whitespace or (-1)
while(stream.Read()
.With(c => { // with each character . . .
// convert read bytes to char
var chr = encoding.GetChars(BitConverter.GetBytes(c)).First();
if (c == -1 || Char.IsWhiteSpace(chr))
return -1; //signal end of word
else
word = word + chr; //append the char to our word
return c;
}) > -1); // end while(stream.Read() if char returned is -1
return word;
}
public static T With<T>(this T obj, Func<T,T> f)
{
return f(obj);
}
to use simply:
using (var s = File.OpenText(file))
{
while(!s.EndOfStream)
s.ReadWord(Encoding.Default).ToCharArray().DoSomething();
}
Author by
Hurrem
Updated on June 04, 2022Comments
-
Hurrem almost 2 years
I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?
for (int i = 0; i < text.Length; i++) {}