Best practices for searchable archive of thousands of documents (pdf and/or xml)

xml pdf lucene full-text-search elasticsearch

29,045

Solution 1

In summary: I'm going to be recommending ElasticSearch, but let's break the problem down and talk about how to implement it:

There are a few parts to this:

Extracting the text from your docs to make them indexable
Making this text available as full text search
Returning highlighted snippets of the doc
Knowing where in the doc those snippets are found to allow for paging
Return the full doc

What can ElasticSearch provide:

ElasticSearch (like Solr) uses Tika to extract text and metadata from a wide variety of doc formats
It, pretty obviously, provides powerful full text search. It can be configured to analyse each doc in the appropriate language with, stemming, boosting the relevance of certain fields (eg title more important than content), ngrams etc. ie standard Lucene stuff
It can return highlighted snippets for each search result
It DOESN'T know where those snippets occur in your doc
It can store the original doc as an attachment, or it can store and return the extracted text. But it'll return the whole doc, not a page.

You could just send the whole doc to ElasticSearch as an attachment, and you'd get full text search. But the sticking points are (4) and (5) above: knowing where you are in a doc, and returning parts of a doc.

Storing individual pages is probably sufficient for your where-am-I purposes (although you could equally go down to paragraph level), but you want them grouped in a way that a doc would be returned in the search results, even if search keywords appear on different pages.

First the indexing part: storing your docs in ElasticSearch:

Use Tika (or whatever you're comfortable with) to extract the text from each doc. Leave it as plain text, or as HTML to preserve some formatting. (forget about XML, no need for it).
Also extract the metadata for each doc: title, authors, chapters, language, dates etc
Store the original doc in your filesystem, and record the path so that you can serve it later
In ElasticSearch, index a "doc" doc which contains all of the metadata, and possibly the list of chapters
Index each page as a "page" doc, which contains:
- A parent field which contains the ID of the "doc" doc (see "Parent-child relationship" below)
- The text
- The page number
- Maybe the chapter title or number
- Any metadata which you want to be searchable

Now for searching. How you do this depends on how you want to present your results - by page, or grouped by doc.

Results by page are easy. This query returns a list of matching pages (each page is returned in full) plus a list of highlighted snippets from the page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "text" : "interesting keywords"
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Displaying results grouped by "doc" with highlights from the text is a bit trickier. It can't be done with a single query, but a little client side grouping will get you there. One approach might be:

Step 1: Do a top-children-query to find the parent ("doc") whose children ("page") best match the query:

curl -XGET 'http://127.0.0.1:9200/my_index/doc/_search?pretty=1'  -d '
{
   "query" : {
      "top_children" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "score" : "sum",
         "type" : "page",
         "factor" : "5"
      }
   }
}

Step 2: Collect the "doc" IDs from the above query and issue a new query to get the snippets from the matching "page" docs:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "filtered" : {
         "query" : {
            "text" : {
               "text" : "interesting keywords"
            }
         },
         "filter" : {
            "terms" : {
               "doc_id" : [ 1,2,3],
            }
         }
      }
   },
   "highlight" : {
      "fields" : {
         "text" : {}
      }
   }
}
'

Step 3: In your app, group the results from the above query by doc and display them.

With the search results from the second query, you already have the full text of the page which you can display. To move to the next page, you can just search for it:

curl -XGET 'http://127.0.0.1:9200/my_index/page/_search?pretty=1'  -d '
{
   "query" : {
      "constant_score" : {
         "filter" : {
            "and" : [
               {
                  "term" : {
                     "doc_id" : 1
                  }
               },
               {
                  "term" : {
                     "page" : 2
                  }
               }
            ]
         }
      }
   },
   "size" : 1
}
'

Or alternatively, give the "page" docs an ID consisting of $doc_id _ $page_num (eg 123_2) then you can just retrieve that page:

curl -XGET 'http://127.0.0.1:9200/my_index/page/123_2

Parent-child relationship:

Normally, in ES (and most NoSQL solutions) each doc/object is independent - there are no real relationships. By establishing a parent-child relationship between the "doc" and the "page", ElasticSearch makes sure that the child docs (ie the "page") are stored on the same shard as the parent doc (the "doc").

This enables you to run the top-children-query which will find the best matching "doc" based on the content of the "pages".

Solution 2

Use Sunspot or RSolr or similar, it handles most major document formats. They use Solr/Lucene.

Solution 3

I've built and maintain an application that indexes and searches 70k+ PDF documents. I found it was necessarily to pull out the plain text from the PDFs, store the contents in SQL and index the SQL table using Lucene. Otherwise, performance was horrible.

29,045

Meltemi

Updated on December 27, 2020

Comments

Meltemi over 3 years

Revisiting a stalled project and looking for advice in modernizing thousands of "old" documents and making them available via web.

Documents exist in various formats, some obsolete: (.doc, PageMaker, hardcopy (OCR), PDF, etc.). Funds are available to migrate the documents into a 'modern' format, and many of the hardcopies have already been OCR'd into PDFs - we had originally assumed that PDF would be the final format but we're open to suggestions (XML?).

Once all docs are in a common format we would like to make their contents available and searchable via a web interface. We'd like the flexibility to return only portions (pages?) of the entire document where a search 'hit' is found (I believe Lucene/elasticsearch makes this possible?!?) Might it be more flexible if content was all XML? If so how/where to store the XML? Directly in database, or as discrete files in the filesystem? What about embedded images/graphs in the documents?

Curious how others might approach this. There is no "wrong" answer I'm just looking for as many inputs as possible to help us proceed.

Thanks for any advice.
Dave Newton almost 12 years

What was the benefit of storing the content in a DB? Wouldn't it have been easier to just extract the content (assuming you didn't just use Solr and skip the manual processing), index it, and throw away the plain-text content?
Meltemi almost 12 years

pros & cons to PDF over XML in this case? we have the option, at this stage, to go either way. I would think PDF might be easier to create at first but perhaps harder to maintain & "serve"?!? dunno. looking for advice.
Josh Siok almost 12 years

Alright... I had to go back and look at the code. Here's what I'm doing. First off, I must say, we have a separate indexing server that handles just this function. Here's the process: 1) extract text from PDFs on content server 2) store text in .txt files using similar directory/file names. 3) index the text files. Upon searching, we are able to correlate the results to the original PDFs based on file paths/naming
Dave Newton almost 12 years

@Meltemi I don't see how a PDF would be more difficult to serve; a file is a file. XML files would need to be formatted, and you'd need to do conversion between all formats to xml.
Meltemi almost 12 years

A file is a file but we would like to "serve" only portions of the complete document at a time. So I suppose we could break each PDF up into hundreds of smaller PDFs it starts to become unwieldy. Wondering if XML might make this easier over long haul?!? Perhaps not.
Dave Newton almost 12 years

@Meltemi Totally depends; without knowing exact requirements it's difficult to say. XML DBs kind of fell out of favor. Content would still need to be formatted/transformed, which can be as simple or complex as you'd like. Transformation from original source to XML, again depending on your needs, could be trivial, or essentially impossible. Might be better off using a big data solution and drop files-at-the-application-level completely--an hBase row can have millions of columns, each containing a paragraph or whatever, each row being a single doc. Tons of solutions.
Meltemi almost 12 years

@D.Newton - "tons of solutions". well that's why I'm asking the questions. I'm looking for ideas. not trying pick sides. as for the "requirements" they're tied to what's possible, complexity & cost. Basically all I KNOW is that we'd like users to be able to query all these reports and if there is a 'hit' present "some" portion of the document that includes the 'hit'. and, from there, I believe we'd like the user to be able to continue paging through the document. But not download the whole thing. Hope that makes sense?!?
Meltemi almost 12 years

OK, I'll say it: "DrTech for President!" ;-) Fantastic answer! Wish I could up vote more. Thank you!
DrTech almost 12 years

:) Funny that, my name is Clinton, after all :)
Marko Bonaci almost 12 years

I don't see any benefit in using relational db here. @Dave, one correction, you don't throw away original text content, you use search engine (Solr, ES, ...) to both index and store it. Then, in search results, you simply show link to original file.
Josh Siok almost 12 years

There are two reasons we did it this way. First, overall indexing time was faster. Second there is related data in the database that corresponds to each document, thus it was simpler to build the full index this way.
Meltemi almost 12 years

You don't know, offhand, how to go about indexing each "page" of a PDF?
DrTech almost 12 years

Poppler tools poppler.freedesktop.org available by default on most linux distros is very fast and very good.
Meltemi over 11 years

Any idea (example?) how to establish this parent/child relationship via the Tire gem?
AlvaroAV over 9 years

Awesome explanation, you saved me infinite research hours! Thanks!!
windup about 9 years

If you split by page then you will also possibly be unable to find phrases that are split across multiple pages, no?
Igor Beaufils about 8 years

I know i'm a bit late but when you say "1.Use Tika (or whatever you're comfortable with)" what are the alternative to tika ? (with ES)... I heard about libextractor GNU. But it seems a bit old, are there others ?