Get list of files from hdfs (hadoop) directory using python script

14,793

Solution 1

Use subprocess

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

EDIT: Answer without python. The first option can be used to recursively print all the sub-directories as well. The last redirect statement can be omitted or changed based on your requirement.

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

EDIT: Correcting a missing quote in awk command.

Solution 2

import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

Solution 3

Why not have the HDFS client do the hard work by using the -C flag instead of relying on awk or python to print the specific columns of interest?

i.e. Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

Afterwards, split the output on new lines and then you will have your list of paths.

Here's an example along with logging and error handling (including for when the directory/file doesn't exist):

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()
Share:
14,793
sara
Author by

sara

Updated on June 08, 2022

Comments

  • sara
    sara about 2 years

    How to get a list of files from hdfs (hadoop) directory using python script?

    I have tried with following line:

    dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

    The directory have list of files "file1,file2,file3....fileN". By using the line i got all the content list only. But i need to get list of file names.

    Can anyone please help me to find out this problem?

    Thanks in advance.

  • Adrian W
    Adrian W about 5 years
    This looks like an improvement of this answer. You might want to edit your answer to explain the improvements.