Search/Find a file and file content in Hadoop

70,381

Solution 1

  1. You can do this: hdfs dfs -ls -R / | grep [search_term].
  2. It sounds like a MapReduce job might be suitable here. Here's something similar, but for text files. However, if these documents are small, you may run into inefficiencies. Basically, each file will be assigned to one map task. If the files are small, the overhead to set up the map task may be significant compared to the time necessary to process the file.

Solution 2

You can use hadoop.HdfsFindTool with solr, is more quickly than 'hdfs dfs ls -R' and more useful.

hadoop jar search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool -find /user/hive/tmp -mtime 7

Usage: hadoop fs [generic options]
    [-find <path> ... <expression> ...]
    [-help [cmd ...]]
    [-usage [cmd ...]]

Solution 3

Depending on how the data is stored in HDFS, you may need to use the -text option to dfs for a string search. In my case I had thousands of messages stored daily in a series of HDFS sequence files in AVRO format. From the command-line on an edge node, this script:

  1. Searches the /data/lake/raw directory at its first level for a list of files.
  2. Passes result to awk, which outputs columns 6 & 8 (date and file name)
  3. Grep outputs lines with the file date in question (2018-05-03)
  4. Passes those lines with two columns to awk, which outputs only column 2, the list of files.
  5. That is read with a while-loop which takes each file name, extracts it from HDFS as text.
  6. Each line of the file is grep-ed for string "7375675".
  7. Lines meeting that criteria are output the screen (stdout)

There is a solr jar-file implementation that is supposedly faster I have not tried.

hadoop fs -ls /data/lake/raw | awk {'print $6"   "$8'} | grep 2018-05-03 | awk {'print $2'} | while read f; do hadoop fs -text $f | grep 7375675 && echo $f ; done

Solution 4

Usually when I'm searching for files in hadoop, as stated by ajduff574, it's done with

hdfs dfs -ls -R $path | grep "$file_pattern" | awk '{print $8}'

This code simply print out the path for each pattern and can then be further be manipulated incase you wish to search within the content of the files. Ex:

hdfs dfs -cat $(hdfs dfs -ls -R $path | grep "$file_pattern" | awk '{print $8}') | grep "$search_pattern"

search_pattern: The content that you are looking for within the file

file_pattern: The file that you are looking for.

path: The path for the search to look into recursivly, this includes sub folders as well.

Share:
70,381
leon
Author by

leon

Updated on April 01, 2021

Comments

  • leon
    leon about 3 years

    I am currently working on a project using Hadoop DFS.

    1. I notice there is no search or find command in Hadoop Shell. Is there a way to search and find a file (e.g. testfile.doc) in Hadoop DFS?

    2. Does Hadoop support file content search? If so, how to do it? For example, I have many Word Doc files stored in HDFS, I want to list which files have the words "computer science" in them.

    What about in other Distributed File Systems? Is file content search a soft spot of distributed file systems?

  • ajduff574
    ajduff574 almost 13 years
    Maybe I should also mention, Lucene (lucene.apache.org) can do indexing and search, and I think there is a plugin for Word docs. You can probably rig something together. I think there has been some work done on Lucene + Hadoop.
  • leon
    leon almost 13 years
    Thanks for your reply. But Isnt hadoop dfs -lsr / | grep [search_term] very slow against many files or directories?
  • ajduff574
    ajduff574 almost 13 years
    It's definitely not fast, but it isn't too bad. On our cluster, with >100,000 files, it still takes less than a minute, which I think is pretty acceptable.
  • leon
    leon almost 13 years
    @@ajduff574 I assume resursive list lsr command does not use any map/reduce function to do the search right? Why doesn't hadoop support the search in metadata level? As all the metadata is stored in Namenode's RAM right?
  • mwol
    mwol about 3 years
    Unfortunately it seems there it does not output the actual files which contained the search term, so it is probably not really helpful in your case.