How can I do indexing .html files in SOLR

11,156

You can use Solr Extracting Request Handler to feed Solr with the HTML file and extract contents from the html file. e.g. at link

Solr uses Apache Tika to extract contents from the uploaded html file

Nutch with Solr is a wider solution if you want to Crawl websites and have it indexed.
Nutch with Solr Tutorial will get you started.

Share:
11,156
Anand Khatri
Author by

Anand Khatri

Updated on August 11, 2022

Comments

  • Anand Khatri
    Anand Khatri almost 2 years

    The files I want to do indexing is stored on the server(I don't need to crawl). /path/to/files/ the sample HTML file is

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
    <meta name="product_id" content="11"/>
    <meta name="assetid" content="10001"/>
    <meta name="title" content="title of the article"/>
    <meta name="type" content="0xyzb"/>
    <meta name="category" content="article category"/>
    <meta name="first" content="details of the article"/>
    
    <h4>title of the article</h4>
    <p class="link"><a href="#link">How cite the Article</a></p>
    <p class="list">
      <span class="listterm">Length: </span>13 to 15 feet<br>
      <span class="listterm">Height to Top of Head: </span>up to 18 feet<br>
      <span class="listterm">Weight: </span>1,200 to 4,300 pounds<br>
      <span class="listterm">Diet: </span>leaves and branches of trees<br>
      <span class="listterm">Number of Young: </span>1<br>
      <span class="listterm">Home: </span>Sahara<br>
    
    </p>
    </p>
    

    I have added the request handler in solrconfing.xml file.

    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">/path/to/data-config.xml</str>
    </lst>
    

    My data-config.xml is look like this

    <dataConfig>
    <dataSource type="FileDataSource" />
    <document>
        <entity name="f" processor="FileListEntityProcessor" baseDir="/path/to html/files/" fileName=".*html" recursive="true" rootEntity="false" dataSource="null">
            <field column="plainText" name="text"/>
        </entity>
    </document>
    </dataConfig>
    

    I have kept the default schema.xml file and added the following piece of code to schema.xml file.

     <field name="product_id" type="string" indexed="true" stored="true"/>
     <field name="assetid" type="string" indexed="true" stored="true" required="true" />
     <field name="title" type="string" indexed="true" stored="true"/>
     <field name="type" type="string" indexed="true" stored="true"/>
     <field name="category" type="string" indexed="true" stored="true"/>
     <field name="first" type="text_general" indexed="true" stored="true"/>
    
     <uniqueKey>assetid</uniqueKey>
    

    when I tried to do the full import after setting it up it shows that all html files fetched. But when I search in SOLR it didn't show me any result. Anyone have idea what could be possible cause?

    My understanding is all the files fetched correctly but not indexed in SOLR. Does anyone know how can I indexed those meta tags and content of the HTML file in SOLR?

    your reply will be appreciated.

  • Anand Khatri
    Anand Khatri over 11 years
    your comment is not very clear to me. can you please elaborate how you created the program and how you created the XML and how you link that to SOLR?
  • Chris Warner
    Chris Warner over 11 years
    Sure. The program could be a c# or Java program to read your HTML files and build from their Meta fields a formatted <update><doc><field1/><field2></doc></update> xml file or files. Then point the dataimporthandler to these properly formatted xml files to update the index. Does that help?
  • Anand Khatri
    Anand Khatri over 11 years
    ohh it means I have to write an external program and initially I have to feed all the files to that program and that will generate related xml files and then SOLR is able to do indexing. I want something automated and fast because I have files in several TB(tera bytes). so it's good to have automated process.
  • Chris Warner
    Chris Warner over 11 years
    You mentioned not wanting to crawl the html files, which would be very easy with nutch.apache.org I think I'd use nutch to crawl the html files, or I'd write a program to read the html files and update the index. I wouldn't use dataimporthandler at all
  • Anand Khatri
    Anand Khatri over 11 years
    Do you know how to configure nutch apache with SOLR? I have tried nutch once but didn't get succeed. and documentation of nutch is not so clear. IF you know then can you please help me out to set up and configure?
  • Anand Khatri
    Anand Khatri over 11 years
    I am more interested into the TIKA configuration. But in the documentation they have used the CURL command. I don't want to go with CURL I want something automated process. Do you have any working example with TIKA and SOLR? It would be more clear and helpful.
  • Jayendra
    Jayendra over 11 years
    the curl is only for example. You can use a client like Solrj to check your folder and push the changes to Solr. You can schedule a job to do the same. Tika acts as a wrapper to indetify the file and parse it using libraries. You do not need to make any changes.
  • Anand Khatri
    Anand Khatri over 11 years
    I have post another question for Tika1.2 and solr4 configuration.Question can you please take a look over there and tell me what's wrong I am doing?