What is a good Java Library to use for searching through several files for a list of search terms?

10,694

Solution 1

I would not recommend using Lucene (or Solr) for these requirements.

  1. First of all, there is no need for full-featured text search library that (to put it simply) does all kinds of magic to have very robust text search using all linguistic knowledge of stemming, grammar and syntax tricks.

  2. While Lucene is a powerful you cannot have everything with Lucene with out-of-box functionality. As an example, it is relatively easy to configure it to find apples with an "apple" term. Okay. But using the same configuration it will not find you "123" in "12345" string. And forget about "non-readable" texts like application logs. Lucene is a 'google' like engine, it searches texts for humans from human-readable proper texts. To address all sorts of "basic" string matches you will need to write a custom processing code that integrates with Lucene functionality and it is not simple any more.

With Java it is much simpler and quicker to write a BufferedReader scanner that recursively processes the files and folders and searches for exact or partial matches using String.match and String.contains operations.

Solution 2

Have you considered using Lucene? It can index and search through text files for search terms as you require. It is not difficult to integrate into your app either but not quite as simple as "ArrayList occurrences = SomeLibrary.parse("directoryPath","searchTerm");" :) I don't think you will find a solution that simple.

The performance of the search will also be good if you use Lucene.

You could go a step further and use Solr (also an Apache product) but this may be overkill for you.

If you decide to look into Lucene then this may be of some assistance to you.

Solution 3

I recomend Apache Solr. Easy to configure and it can index millions of documents. Solr make all possible optimizations in index and queries. Many documentation. And better of all, is open.

Solution 4

Grae, It goes like this:

  • Lucene is a native Java search library. It has a somewhat steep learning curve.
  • Solr is a search engine built using Lucene as a web application. It is much easier to learn, and can be used via an HTTP interface or a Java interface called Solrj.

If you prefer the minimal Java version, you need Lucene. If you want the quickest-to-implement solution, use Solr. Here's a Solr tutorial and a Lucene tutorial.

Both approaches here require an indexing stage and a later retrieval stage. Your question seems to have a more grep-like flavor, but I do not know a matching Java library for this. You also did not describe the file types - bare Lucene works with raw text. You may need Apache Tika to get text and metadata from your files.

Share:
10,694
GC_
Author by

GC_

Updated on June 04, 2022

Comments

  • GC_
    GC_ almost 2 years

    Basically, what I would like to do is search through a folder its subfolders for a list of search terms. It does not have to be highly optimized or anything like that. I would like the library to be able to "Match Case," match "Whole Words Only," etc.

    I think I could write something like this, opening each file in a file, and searching each word, etc, but I really want a short-cut. Is there some library that already does most of this?

    My dream code would be something like:

    ArrayList occurrences = SomeLibrary.parse("directoryPath","searchTerm");
    

    Is there anything close to this high level?

    Thanks, Grae