is there a way to read a word document line by line
Solution 1
I would suggest following the code on this page here
The crux of it is that you read it with a Word.ApplicationClass (Microsoft.Interop.Word) object, although where he's getting the "Doc" object is beyond me. I would assume you create it with the ApplicationClass.
EDIT: Document is retrieved by calling this:
Word.Document doc = wordApp.Documents.Open(ref file, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj,
ref nullobj, ref nullobj, ref nullobj);
Sadly the formatting of the code on the page I linked wasn't all to easy.
EDIT2: From there you can loop through doc paragraphs, however as far as I can see there is no way of looping through lines. I would suggest using some pattern matching to find linebreaks.
In order to extract the text from a paragraph, use Word.Paragraph.Range.Text, this will return all the text inside a paragraph. Then you must search for linebreak characters. I'd use string.IndexOf().
Alternatively, if by lines you want to extract one sentence at a time, you can simply iterate through Range.Sentences
Solution 2
This helps in you getting string line by line.
object file = Path.GetDirectoryName(Application.ExecutablePath) + @"\Answer.doc";
Word.Application wordObject = new Word.ApplicationClass();
wordObject.Visible = false;
object nullobject = Missing.Value;
Word.Document docs = wordObject.Documents.Open
(ref file, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject,
ref nullobject, ref nullobject, ref nullobject, ref nullobject);
String strLine;
bool bolEOF = false;
docs.Characters[1].Select();
int index = 0;
do
{
object unit = Word.WdUnits.wdLine;
object count = 1;
wordObject.Selection.MoveEnd(ref unit, ref count);
strLine = wordObject.Selection.Text;
richTextBox1.Text += ++index + " - " + strLine + "\r\n"; //for our understanding
object direction = Word.WdCollapseDirection.wdCollapseEnd;
wordObject.Selection.Collapse(ref direction);
if (wordObject.Selection.Bookmarks.Exists(@"\EndOfDoc"))
bolEOF = true;
} while (!bolEOF);
docs.Close(ref nullobject, ref nullobject, ref nullobject);
wordObject.Quit(ref nullobject, ref nullobject, ref nullobject);
docs = null;
wordObject = null;
Here's the genius behind the code. Follow the link for some more explanation on how it works.
Fraiser
Updated on June 04, 2022Comments
-
Fraiser almost 2 years
I am trying to extract all the words in a Word document. I am able to do it all in one go as follows...
Word.Application word = new Word.Application(); doc = word.Documents.Open(@"C:\SampleText.doc"); doc.Activate(); foreach (Word.Range docRange in doc.Words) // loads all words in document { IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length) .Select(i => docRange.Text.Substring(i)) .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2))); wordPosition = (int) docRange.get_Information( Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber); foreach (var substring in sortedSubstrings) { index = docRange.Text.IndexOf(substring) + wordPosition; charLocation[index] = substring; } }
However I would have preferred to load the document one line at a time... is it possible to do so?
I can load it by paragraph however I am unable to iterate through the paragraphs to extract all words.
foreach (Word.Paragraph para in doc.Paragraphs) { foreach (Word.Range docRange in para) // Error: type Word.para is not enumeranle** { IEnumerable<string> sortedSubstrings = Enumerable.Range(0, docRange.Text.Trim().Length) .Select(i => docRange.Text.Substring(i)) .OrderBy(s => s.Length < 3 ? s : s.Remove(2, Math.Min(s.Length - 2, 2))); wordPosition = (int) docRange.get_Information( Microsoft.Office.Interop.Word.WdInformation.wdFirstCharacterColumnNumber); foreach (var substring in sortedSubstrings) { index = docRange.Text.IndexOf(substring) + wordPosition; charLocation[index] = substring; } } }
-
Fraiser over 12 yearsunfortunately i cannot use the Word.ApplicationClass (Microsoft.Interop.Word) class in vs2010. :( so the code above does not work... What i need is for the Word.Paragraph para in doc.Paragraphs to be enumerable.. can you please help!!!
-
Nick Udell over 12 yearsI have altered my answer to show you how to iterate through sentences. It's impossible to iterate through the file line by line as how many characters there are per line is entirely dependant on the page settings. You could get the page width and height and then use those to ream off certain numbers of characters, but that seems like a lot of effort. What do you need this code for?
-
Bat_Programmer over 10 yearsran this code but unfortunately went to an infinite loop. I dont know why
-
nawfal over 10 yearsNo idea, what version of word is it? Ensure
EndOfDoc
bookmark exists by searching for bookmarks. Typically by default that's at the end of each doc.