Apache Lucene vs Google Search Appliance

18,972

Its probably hard to find a comparison between Apache Lucene and the Google Search Appliance because they're such different things. While Lucene is a software component for indexing documents with basic relevance "boosting" built in, the GSA is an enterprise search product (appliance/physical hardware) with lot's of out-of-the-box functionality to tune and optimize search results based off of the Google search algorithm.

So they are basically two great tools with different implementation scenarios. But of course overlap especially if used for providing search on your average website.

Off the top of my head a few topics you might want to start with for a comparison:

Deployment/Architecture

  • Lucene is a software component that can be deeply integrated in your own software providing an index (usually file based, sometimes in memory) to index and retrieve content quickly.
  • The lucene project provides quite a large list of analyzers to do propper indexing of different languages (western languages, arabic, asian etc.) but has room for improvements with analyzers
  • Lucene for .Net is quite a popular port to be integrated on Microsoft .Net Plattforms.
  • GSA software and hardware bundled together and sold as an appliance with an HTTP(s) interface providing the search results in either HTML (through its own XSLTs) or XML (for better integration in your website)
  • GSA comes with language bundles (installed and downloadable). You'd have to choose one of the bundles. If you need support for more languages you might need to add another GSA to the infrastructure (if all required languages are not in a single bundle)
  • GSA is performing excellent and requires very little maintenance
  • GSA let's you scale with almost no engineering effort. globally distributed, but connected GSAs can be set up through the web interface
  • GSA can be made HA by purchasing a cheaper hot-backup module

Indexing

  • Lucene provides crawlers (and a crawler API) to index content. It doesn't care if your crawler actually crawls the website like Google or if you crawl a database based on SQL statements or provide a text stream read out from flat files. But usually you have to implement the crawler if the provided does not fit your needs
  • GSA uses the crawler technology used by Google, respecting Robots instructions (in TXT or Meta tags), it provides a feed API for sources that can not be crawled (i.e. no linking between them) and it supports setting up SQL queries to all mayor DBs for retrievel of data out of a database (be it URLs to crawl or the data itself)

Retrieval / relevance tuning

  • Lucene does not aim at and has no good support for relevance tuning (except boosting entries in the index). It's up to the application using the index results to do the tuning
  • Lucene is the index used by SOLR which provides tuning and architectures more similar to a GSA (including result retrievel over HTTP(s))
  • GSA let's you bias result sets based on meta-data, date and URL patterns. In the latest version you can even set up your own entities and bias the results based on them
  • GSA supports out of the box facets for meta-data and some more fancy stuff on their interface like preview images for documents, autosuggest etc.

Commercial things

  • Lucene is an Open Source (no costs) Product, but requires hardware to be purchased
  • GSA starts at around $20k for 500k documents/URLs
  • Google provides several support levels
  • GSA licenses have to be renewed on a 2 or 3 year basis (you get new hardware)
  • GSA does not require any additional hardware (appliance is included)

...there's so much more to add, but I hope you get the point.


Update February 2016:

Google has informed partners that the GSA will be discontinued around 2019. The best site to link to at the moment seems to be http://fortune.com/2016/02/04/google-ends-search-appliance/.

Share:
18,972
Riju Mahna
Author by

Riju Mahna

Working as a Technical Lead in LnT Infotech. Learning Java for almost 9 years now....and will continue to do so...as there is no end in sight, yet !!!

Updated on June 03, 2022

Comments

  • Riju Mahna
    Riju Mahna almost 2 years

    Has anyone come across with the features of Apache Lucene? I heard its even comparable to Google Search Appliance (GSA). I was looking for a definite comparison between the two, if possible?

    Those comparisons available online are pretty vague.

  • Doug T.
    Doug T. over 8 years
    I disagree with this comment Lucene does not aim at and has no good support for relevance tuning (except boosting entries in the index). It's up to the application using the index results to do the tuning considering there's whole books on Lucene-based relevance tuning manning.com/turnbull (yes that's my book)
  • Reto Hugi
    Reto Hugi over 8 years
    Agreed, I was not specific enough regarding the area of "tuning". Lucene provides a scoring mechanism and scores can be boosted on document and field level and at query time. But AFAIK it is still up to the application using lucene to apply business rules (SORL, ElasticSearch etc. provide such mechanisms). Would you mind explaining where specifically you would disagree with that? I would update my answer accordingly. Thank you.