how to read the text word by word

13,862

Solution 1

A simple approach is using string.Split without argument(splits by white-space characters):

using (StreamReader sr = new StreamReader(path)) 
{
    while (sr.Peek() >= 0) 
    {
        string line = sr.ReadLine();
        string[] words = line.Split();
        foreach(string word in words)
        {
            foreach(Char c in word)
            {
                // ...
            }
        }
    }
}

I've used StreamReader.ReadLine to read the entire line.

To parse HTML i would use a robust library like HtmlAgilityPack.

Solution 2

You can split the string on whitespace, but you will have to deal with punctuation and HTML markup (you said you were working with txt and htm files).

string[] tokens = text.split(); // default for split() will split on white space
foreach(string tok in tokens)
{
    // process tok string here
}

Solution 3

Here's my implementation of lazy extension to StreamReader. The idea is not to load the entire file into memory especially if your file is a single long line.

public static string ReadWord(this StreamReader stream, Encoding encoding)
{
    string word = "";
    // read single character at a time building a word 
    // until reaching whitespace or (-1)
    while(stream.Read()
       .With(c => { // with each character . . .
            // convert read bytes to char
            var chr = encoding.GetChars(BitConverter.GetBytes(c)).First();

            if (c == -1 || Char.IsWhiteSpace(chr))
                 return -1; //signal end of word
            else
                 word = word + chr; //append the char to our word

            return c;
    }) > -1);  // end while(stream.Read() if char returned is -1
    return word;
}

public static T With<T>(this T obj, Func<T,T> f)
{
    return f(obj);
}

to use simply:

using (var s = File.OpenText(file))
{
    while(!s.EndOfStream)
        s.ReadWord(Encoding.Default).ToCharArray().DoSomething();
}
Share:
13,862
Hurrem
Author by

Hurrem

Updated on June 04, 2022

Comments

  • Hurrem
    Hurrem almost 2 years

    I'm working with a txt or htm file. And currently I'm looking up the document char by char, using for loop, but I need to look up the text word by word, and then inside the word char by char. How can I do this?

    for (int i = 0; i < text.Length; i++)
    {}