is there a way to read a word document line by line

17,461

Solution 1

I would suggest following the code on this page here

The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.

EDIT: Document is retrieved by calling this:

Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj,
                                      ref nullobj, ref nullobj, ref nullobj);

Sadly the formatting of the code on the page I linked wasn't all to easy.

EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.

In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().

Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences

Solution 2

This helps in you getting string line by line.

    object file = Path.GetDirectoryName(Application.ExecutablePath) + @"\Answer.doc";

    Word.Application wordObject = new Word.ApplicationClass();
    wordObject.Visible = false;

    object nullobject = Missing.Value;
    Word.Document docs = wordObject.Documents.Open
        (ref file, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject,
        ref nullobject, ref nullobject, ref nullobject, ref nullobject);

    String strLine;
    bool bolEOF = false;

    docs.Characters[1].Select();

    int index = 0;
    do
    {
        object unit = Word.WdUnits.wdLine;
        object count = 1;
        wordObject.Selection.MoveEnd(ref unit, ref count);

        strLine = wordObject.Selection.Text;
        richTextBox1.Text += ++index + " - " + strLine + "\r\n"; //for our understanding

        object direction = Word.WdCollapseDirection.wdCollapseEnd;
        wordObject.Selection.Collapse(ref direction);

        if (wordObject.Selection.Bookmarks.Exists(@"\EndOfDoc"))
            bolEOF = true;
    } while (!bolEOF);

    docs.Close(ref nullobject, ref nullobject, ref nullobject);
    wordObject.Quit(ref nullobject, ref nullobject, ref nullobject);
    docs = null;
    wordObject = null;

Here's the genius behind the code. Follow the link for some more explanation on how it works.

Share:
17,461
Fraiser
Author by

Fraiser

Updated on June 04, 2022

Comments

  • Fraiser
    Fraiser almost 2 years

    I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...

    Word.Application word = new Word.Application();
    doc = word.Documents.Open(@"C:\SampleText.doc");
    doc.Activate();
    
    foreach (Word.Range docRange in doc.Words) // loads all words in document
    {
        IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
            .Select(i => docRange.Text.Substring(i))
            .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
    
        wordPosition =
            (int)
            docRange.get_Information(
                Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
    
        foreach (var substring in sortedSubstrings)
        {
            index = docRange.Text.IndexOf(substring) + wordPosition;
            charLocation[index] = substring;
        }
    }
    

    However I would have preferred to load the document one line at a time... is it possible to do so?

    I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.

    foreach (Word.Paragraph para in doc.Paragraphs)
    {
        foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle**
        {
            IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length)
                .Select(i => docRange.Text.Substring(i))
                .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2)));
    
            wordPosition =
                (int)
                docRange.get_Information(
                    Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber);
    
            foreach (var substring in sortedSubstrings)
            {
                index = docRange.Text.IndexOf(substring) + wordPosition;
                charLocation[index] = substring;
            }
    
        }
    }
    
  • Fraiser
    Fraiser over 12 years
    unfortunately i cannot use the Word.ApplicationClass (Microsoft.Interop.Word) class in vs2010. :( so the code above does not work... What i need is for the Word.Paragraph para in doc.Paragraphs to be enumerable.. can you please help!!!
  • Nick Udell
    Nick Udell over 12 years
    I have altered my answer to show you how to iterate through sentences. It's impossible to iterate through the file line by line as how many characters there are per line is entirely dependant on the page settings. You could get the page width and height and then use those to ream off certain numbers of characters, but that seems like a lot of effort. What do you need this code for?
  • Bat_Programmer
    Bat_Programmer over 10 years
    ran this code but unfortunately went to an infinite loop. I dont know why
  • nawfal
    nawfal over 10 years
    No idea, what version of word is it? Ensure EndOfDoc bookmark exists by searching for bookmarks. Typically by default that's at the end of each doc.