Configuring Tika With Solr

solr apache-tika

11,425

Check ExtractingRequestHandler for Integration of Solr with Tika.
Solr provides tika.config inbuilt and you would not need to define it unless overriding the config.
You can go with the default config as defined in the solrconfig.xml

<!-- Solr Cell Update Request Handler

   http://wiki.apache.org/solr/ExtractingRequestHandler 

-->
<requestHandler name="/update/extract" 
              startup="lazy"
              class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
  <str name="lowernames">true</str>
  <str name="uprefix">ignored_</str>

  <!-- capture link hrefs but ignore div attributes -->
  <str name="captureAttr">true</str>
  <str name="fmap.a">links</str>
  <str name="fmap.div">ignored_</str>
</lst>
</requestHandler>

You can use the commands to index the files to solr with additional metadata.

curl "http://localhost:8983/solr/update/extract?literal.id=2&literal.title=Test&commit=true&fmap.content=text" -F "[email protected]"

By default the content of the files are copied to content field and copied over to text, you can override the settings.

11,425

Author by

user2475624

Updated on June 04, 2022

Comments

user2475624 almost 2 years

I am Looking to index Rich types documents(Pdf, Doc, rtf, txt) into Solr. I found Tika as a solution. I made a rant over the web but didn't found any Docs/links to make it work with ExtractingRequestHandler.

Anyone can please provide step by step way to configure Tika with ExtractingRequestHandler.

Thanks In Advance :)
user2475624 almost 11 years

@jayedra One issue ! While indexing other than pdf types it throws an Java.lang.noClassDefFoundError. Any clue ??
user2475624 almost 11 years

Thanks Jayendra That was a URL issue so jetty was throwing exception anyway solved. But for now I can't see my docs with solr Query . whats may wrong any clue?
user2475624 almost 11 years

@jayedra here is my query:-stackoverflow.com/questions/17697019/…