Excel VBA to Search for Text in PDF and Extract and Name Pages
Loops are definitely excellent for some things, but can tie down processing with these higher queries. Recently, a colleague and I were doing a similar task (not pdf-related though), and we had much success with using a range.find method instead of a loop executing instr on each cell.
Some points of interest: -To mimic the “loop cells” functionality when using the .find method, we ended our range statement with .cells, as seen below:
activesheet.usedrange.cells.find( )
Where the desired string goes within the ( ).
-The return value: “A Range object that represents the first cell where that information is found.”
Once the .find method returns a range, a subsequent subroutine can extract the page number and document name.
-If you need to find the nth instance of an occurrence, “You can use the FindNext andFindPrevious methods to repeat the search.” (Microsoft)
Microsoft overview of range.find: https://msdn.microsoft.com/en-us/vba/excel-vba/articles/range-find-method-excel
So with this approach, the user can use a loop based on a count of cells in your list to execute the .find method for each string.
Downside is (I assume) that this must be done on text within the excel application; also, I’ve not tested it to determine if the string has to inhabit the cell by itself (I don’t think this is a concern).
‘===================
Another suggestion that might be beneficial is to first bulk-rip all text from the .pdf with as little looping as possible (direct actions at the document object level). Then your find/return approach can be applied to the bulk text.
I did a similar activity when creating study notes from a professor’s PowerPoints; I grabbed all the text into a .txt file, then returned every sentence containing the instance of a list of strings.
‘=====================
A few caveats: I admit that I have not executed parsing at the sheer size of your project, so my suggestions might not be advantageous in practice.
Also, I have not done much work parsing .pdf documents, as I try to opt for anything that is .txt/excel app first, and engage it instead.
Good luck in your endeavors; I hope I was able to at least provide food for thought!
Admin
Updated on July 09, 2022Comments
-
Admin almost 2 years
I have the following code, which looks at each cell in column A of my spreadsheet, searches for the text it finds there in the specified PDF and then extracts the page where it finds the text as a PDF, naming it with the value in the cell of the spreadsheet. The code works but is rather slow, I may need to search for as many as 200 words in a PDF which could be as long as 600 pages. Is there a way to make the code faster? Currently it loops through each cell searches through each page looping through each word until it finds the word in the cell.
Sub test_with_PDF() Dim objApp As Object Dim objPDDoc As Object Dim objjso As Object Dim wordsCount As Long Dim page As Long Dim i As Long Dim strData As String Dim strFileName As String Dim lastrow As Long, c As Range Dim PageNos As Integer Dim newPDF As Acrobat.CAcroPDDoc Dim NewName As String Dim Folder As String lastrow = Sheets("Sheet1").Cells(Rows.Count, "A").End(xlUp).Row strFileName = selectFile() Folder = GetFolder() Set objApp = CreateObject("AcroExch.App") Set objPDDoc = CreateObject("AcroExch.PDDoc") 'AD.1 open file, if =false file is damage If objPDDoc.Open(strFileName) Then Set objjso = objPDDoc.GetJSObject PageNos = 0 For Each c In Sheets("Sheet1").Range("A2:A" & lastrow) For page = 0 To objPDDoc.GetNumPages - 1 wordsCount = objjso.GetPageNumWords(page) For i = 0 To wordsCount If InStr(1, c.Value, ", ") = 0 Then If objjso.getPageNthWord(page, i) = c.Value Then PageNos = PageNos + 1 If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then Set newPDF = CreateObject("AcroExch.pdDoc") NewName = Folder & "\" & c.Offset(0, 4) & ".pdf" newPDF.Open (NewName) newPDF.InsertPages lngPages, objPDDoc, page, 1, 0 newPDF.Save 1, NewName newPDF.Close Set newPDF = Nothing Exit For Else Set newPDF = CreateObject("AcroExch.PDDoc") newPDF.Create NewName = Folder & "\" & c.Offset(0, 4) & ".pdf" newPDF.InsertPages -1, objPDDoc, page, 1, 0 newPDF.Save 1, NewName newPDF.Close Set newPDF = Nothing Exit For End If End If Else If objjso.getPageNthWord(page, i) = c.Offset(0, 1).Value Then If objjso.getPageNthWord(page, i + 1) = c.Offset(0, 2).Value Then PageNos = PageNos + 1 If FileExist(Folder & "\" & c.Offset(0, 4) & ".pdf") Then Set newPDF = CreateObject("AcroExch.pdDoc") NewName = Folder & "\" & c.Offset(0, 4) & ".pdf" newPDF.Open (NewName) newPDF.InsertPages lngPages, objPDDoc, page, 1, 0 newPDF.Save 1, NewName newPDF.Close Set newPDF = Nothing Exit For Else Set newPDF = CreateObject("AcroExch.PDDoc") newPDF.Create NewName = Folder & "\" & c.Offset(0, 4) & ".pdf" newPDF.InsertPages -1, objPDDoc, page, 1, 0 newPDF.Save 1, NewName newPDF.Close Set newPDF = Nothing Exit For End If Exit For End If End If End If Next i Next page c.Offset(0, 3).Value = PageNos PageNos = 0 Next c MsgBox "Done" Else MsgBox "error!" End If End Sub Function FileExist(path As String) As Boolean If Dir(path) <> vbNullString Then FileExist = True End Function Function selectFile() Dim fd As FileDialog, fileName As String On Error GoTo ErrorHandler Set fd = Application.FileDialog(msoFileDialogFilePicker) fd.AllowMultiSelect = False If fd.Show = True Then If fd.SelectedItems(1) <> vbNullString Then fileName = fd.SelectedItems(1) End If Else 'Exit code if no file is selected End End If 'Return Selected FileName selectFile = fileName Set fd = Nothing Exit Function ErrorHandler: Set fd = Nothing MsgBox "Error " & Err & ": " & Error(Err) End Function Function GetFolder() As String Dim fldr As FileDialog Dim sItem As String Set fldr = Application.FileDialog(msoFileDialogFolderPicker) With fldr .Title = "Select the Folder where you want you new PDFs to go" .AllowMultiSelect = False .InitialFileName = Application.DefaultFilePath If .Show <> -1 Then GoTo NextCode sItem = .SelectedItems(1) End With NextCode: GetFolder = sItem Set fldr = Nothing End Function
Many thanks in advance.