process files in a directory as they appear

8,850

Solution 1

It sounds as if you simply should write a small processing script and use GNU Parallel for parallel processing:

http://www.gnu.org/software/parallel/man.html#example__gnu_parallel_as_dir_processor

So something like this:

inotifywait -q -m -r -e CLOSE_WRITE --format %w%f my_dir |
  parallel 'mv {} /tmp/processing/{/};myscript.sh /tmp/processing/{/} other_inputs; rm /tmp/processing/{/}'

Watch the intro videos to learn more: http://pi.dk/1

Edit:

It is required that myscript.sh can deal with 0 length files (e.g. ignore them).

If you can avoid the touch you can even do:

inotifywait -q -m -r -e CLOSE_WRITE --format %w%f my_dir |
  parallel myscript.sh {} other_inputs

Installing GNU Parallel is as easy as:

wget http://git.savannah.gnu.org/cgit/parallel.git/plain/src/parallel
chmod 755 parallel

Solution 2

First, your script will operate on one file (the last in the list). Also, I don't think that a one liner is always appropriate or elegant. Cron does a lot behind the scenes, and you need to be able to review things that fail. Running cron "frequently" may be an issue. You may end up with dozens of these processes running, slowing down the systems as they are all trying to process the files in their queue.

This is what I'd do.

Dir="$HOME/input_files"   # never hardcode when you have variables
for filename in "$Dir"/*.xml; do
    # is the file non-empty AND is it still there, or may caught by another
    # process
    if [ -s "$filename" ]; then
        # move files locally will be faster than crossing filesystems to /tmp
        mkdir -p "$Dir/.processing"
        # temp name should use pid, just in case another input with the same name comes in
        tempname="$Dir/.processing/`basename $filename .xml`.$$"
        mv "$filename" "$tempname"
        # send stdout and stderr to a .output file
        myscript.sh "$tempname" other_inputs > "$tempname.output" 2>&1
        rc=$?
        if [ $rc -eq 0 ]; then
            rm "$tempname" "$tempname.output"
        else
            echo "Error processing $filename; rc=$rc" >&2
            echo "File in $tempname" >&2
        fi
    done

This will either remove the file after processing, or on error will keep the file in the .processing directory including the output of the command. The command above doesn't throttle anything, but it does allow for more than one to run without interfering with each other. There are other questions on how to create fairly efficient work queues to augment.

Solution 3

Use the inotify(7) interface to monitor the incoming directory rather than polling through cron. inotify-tools give you the inotifywait program you can use to monitor directory if you don't want to write code against the system call interface.

Share:
8,850

Related videos on Youtube

J Jones
Author by

J Jones

Updated on September 18, 2022

Comments

  • J Jones
    J Jones over 1 year

    Possible Duplicate:
    How to run a command when a directory's contents are updated?

    I'm trying to write a simple etl process that would look for files in a directory each minute, and if so, load them onto a remote system (via a script) and then delete them.

    Things that complicate this: the loading may take more than a minute. To get around that, I figured I could move all files into a temporary processing directory, act on them there, and then delete them from there. Also, in my attempt to get better at command line scripting, I'm trying for a more elegant solution. I started out by writing a simple script to accomplish my task, shown below:

    #!/bin/bash
    
    for i in ${find /home/me/input_files/ -name "*.xml"}; do
    FILE=$i;
    done;
    BASENAME=`basename $FILE`
    mv $FILE /tmp/processing/$BASENAME
    myscript.sh /tmp/processing/$BASENAME other_inputs
    rm /tmp/processing/$BASENAME
    

    This script removes the file from the processing directory almost immediately (which stops the duplicate processing problem), cleans up after itself at the end, and allows the file to be processed in between.

    However, this is U/Linux after all. I feel like I should be able to accomplish all this in a single line by piping and moving things around instead of a bulky script to maintain.

    Also, using parallel to concurrent process this would be a plus.

    Addendum: some sort of FIFO queue might be the answer to this as well. Or maybe some other sort of directory watcher instead of a cron. I'm open for all suggestions that are more elegant than my little script. Only issue is the files in the "input directory" are touched moments before they are actually written to, so some sort of ! -size -0 would be needed to only handle real files.

  • clerksx
    clerksx about 12 years
    Danger Will Robinson! This will break if $HOME has spaces in; you must quote every instance of $Dir. There's also some fairly useless use of $? when you could just run the command within the if statement.
  • Arcege
    Arcege about 12 years
    Quite right. I doubt that a home directory would have. The adduser routines raises an error if a username has a non-standard character; but the home directory could be set independently. Answered edited.
  • Arcege
    Arcege about 12 years
    Preserving the $? is rarely useless. I commonly use the value down the chain, either OR'ing to return to the calling routine or some other action.
  • J Jones
    J Jones about 12 years
    How will the inotify interface handle my issue of touching files (aka, creating zero byte versions), and me only wanting to be notified (or take action on) the complete files)
  • R Perrin
    R Perrin about 12 years
    See Ole Tange's informative example: CLOSE_WRITE is the event that occurs at the time you want - end of file creation. Here's another example: sauers.com/blog/linux-tip-inotify
  • J Jones
    J Jones about 12 years
    Ole, the touch actually comes from parallel it seems. I have parallel which operates on a list of files, changing their format and contents. However, it quickly runs through the inputs creating output files, and then slowly processes the inputs. (You can see my earlier process here: unix.stackexchange.com/questions/32162/… )
  • J Jones
    J Jones about 12 years
    And my next comment would how can I use something like awk (needing to be escaped by single quotes) in a parallel statement where I want to send multiple commands (therefore needing to enclose the commands in single quotes as well, such as your top example. ........... Simple use case is: ls ~ | parallel ' du -k | awk '{ print $1 }';ls -l'
  • J Jones
    J Jones about 12 years
    Answering my own comment, you can escape characters with a quoted (aka, out of parallel execution ). so my parallel looks like parallel ' awk '\''{ print $1 } '\''...
  • Ole Tange
    Ole Tange about 12 years
    For more advanced quoting see: --shellquote, --quote, gnu.org/software/parallel/man.html#quoting
  • Ole Tange
    Ole Tange about 12 years
    The 'touch' is probably not a touch then: It is simply an open file with no content. Thus you can use the version above with no 'touch'. The key here is whether the file is closed and re-openend or if it stays open, and it most likely does the latter.
  • J Jones
    J Jones about 12 years
    After exploring what alerts get raised on it, i would agree, that it is just an open but empty file as data is being streamed into it. using CLOSE_WRITE was perfect for my situation.
  • J Jones
    J Jones about 12 years
    Using inotify wait, and running a string of commands (as described here gnu.org/software/parallel/man.html#example__composed_command‌​s ) was my final solution