Powershell search matching string in word document
Solution 1
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = @{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = @{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += New-Object -TypeName PsCustomObject -Property $properties
}
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$document.close()
$application.quit()
}
getStringMatch
import-csv $output
There are a couple of ways to get what you want. A simple approach is since you have the text of the document already lets perform a regex match on it and return the results and more. This helps in trying to address getting some words around in document.
We have the variable $charactersAround
which sets the number of characters to match around the $findtext
. Also I though the output was a better fit for a CSV file so I used $results
to capture a hashtable of properties that, in the end, are output to a csv file.
Be sure to change the variables for your own testing. Now that we are using regex to locate the matches this opens up a world of possibilities.
Sample Output
Match TextAround File
----- ---------- ----
First dley Air Services Limited dba First Air meets or exceeds all term C:\Temp\20120315132117214.docx
Solution 2
Good answer from @Matt. I improved it a little (new PowerShell version have problems with the given array. And to search big amount of documents it runs out of memory. Here is my improved version:
#ERROR REPORTING ALL
Set-StrictMode -Version latest
$path = "c:\Temp"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$output = "c:\temp\wordfiletry.csv"
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "First"
$charactersAround = 30
$results = @{}
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files)
{
$document = $application.documents.open($file.FullName,$false,$true)
$range = $document.content
If($range.Text -match ".{$($charactersAround)}$($findtext).{$($charactersAround)}"){
$properties = @{
File = $file.FullName
Match = $findtext
TextAround = $Matches[0]
}
$results += @(New-Object -TypeName PsCustomObject -Property $properties)
}
$document.close()
}
If($results){
$results | Export-Csv $output -NoTypeInformation
}
$application.quit()
}
getStringMatch
import-csv $output
Solution 3
Thanks! You provided a great solution to use PowerShell regex expressions to look for information in a Word document. I needed to modify it to meet my needs. Maybe, it will help someone else. It reads each line of the word document, and then uses the regex expression to determine if the line is a match. The output could easily be modified or dumped to a log file.
Set-StrictMode -Version latest
$path = "c:\Temp\pii"
$files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) }
$application = New-Object -comobject word.application
$application.visible = $False
$findtext = "[0-9]" #regex
Function getStringMatch
{
# Loop through all *.doc files in the $path directory
Foreach ($file In $files) {
$document = $application.documents.open($file.FullName,$false,$true)
$arrContents = $document.content.text.split()
$varCounter = 0
ForEach ($line in $arrContents) {
$varCounter++
If($line -match $findtext) {
"File: $file Found: $line Line: $varCounter"
}
}
$document.close()
}
$application.quit()
}
getStringMatch
Related videos on Youtube
Yogesh
Updated on October 28, 2022Comments
-
Yogesh about 1 year
I have a simple requirement. I need to search a string in Word document and as result I need to get matching line / some words around in document.
So far, I could successfully search a string in folder containing Word documents but it returns True / False based on whether it could find search string or not.
#ERROR REPORTING ALL Set-StrictMode -Version latest $path = "c:\MORLAB" $files = Get-Childitem $path -Include *.docx,*.doc -Recurse | Where-Object { !($_.psiscontainer) } $output = "c:\wordfiletry.txt" $application = New-Object -comobject word.application $application.visible = $False $findtext = "CRHPCD01" Function getStringMatch { # Loop through all *.doc files in the $path directory Foreach ($file In $files) { $document = $application.documents.open($file.FullName,$false,$true) $range = $document.content $wordFound = $range.find.execute($findText) if($wordFound) { "$file.fullname has $wordfound" | Out-File $output -Append } } $document.close() $application.quit() } getStringMatch
-
Yogesh almost 9 yearsI need some more help from you. If you can please tell me why there are extra '$' sign you have put. Sorry for not being able to understand myself. .{$($charactersAround)}$($findtext).{$($charactersAround)}")
-
Matt almost 9 years
$(..)
means evaluate the contents as a subexpression. It was to ensure the string variables expanded properly something like$findtext.{$charactersAround}
would not work properly since it would try to find a property$charactersAround
( in our case is 30) of the variable$findtext
. @Yogesh