Sort an array of pathnames of files by their basenames

5,719

Solution 1

Contrary to ksh or zsh, bash has no builtin support for sorting arrays or lists of arbitrary strings. It can sort globs or the output of alias or set or typeset (though those last 3 not in the user's locale sorting order), but that can't be used practically here.

There's nothing in the POSIX toolchest that can readily sort arbitrary lists of strings either¹ (sort sorts lines, so only short (LINE_MAX being often shorter than PATH_MAX) sequences of characters other than NUL and newline, while file paths are non-empty sequences of bytes other than 0).

So while you could implement your own sorting algorithm in awk (using the < string comparison operator) or even bash (using [[ < ]]), for arbitrary paths in bash, portably, the easiest may be to resort to perl:

With bash4.4+, you could do:

readarray -td '' sorted_filearray < <(perl -MFile::Basename -l0 -e '
  print for sort {basename($a) cmp basename($b)} @ARGV' -- "${filearray[@]}")

That gives a strcmp()-like order. For an order based on the locale's collation rules like in globs or the output of ls, add a -Mlocale argument to perl. For numeric sort (more like GNU sort -g as it supports numbers like +3, 1.2e-5 and not thousand separators, though not hexadimals), use <=> instead of cmp (and again -Mlocale for the user's decimal mark to be honoured like for the sort command).

You'll be limited by the maximum size of arguments to a command. To avoid that, you could pass the list of files to perl on its stdin instead of via arguments:

readarray -td '' sorted_filearray < <(
  printf '%s\0' "${filearray[@]}" | perl -MFile::Basename -0le '
    chomp(@files = <STDIN>);
    print for sort {basename($a) cmp basename($b)} @files')

With older versions of bash, you could use a while IFS= read -rd '' loop instead of readarray -d '' or get perl to output the list of paths properly quoted so you can pass it to eval "array=($(perl...))".

With zsh, you can fake a glob expansion for which you can define a sort order:

sorted_filearray=(/(e{'reply=($filearray)'}oe{'REPLY=$REPLY:t'}))

With reply=($filearray) we actually force the glob expansion (which initially was just /) to be the elements of the array. Then we define the sort order to be based on the tail of the filename.

For a strcmp()-like order, fix the locale to C. For numeric sort (similar to GNU sort -V, not sort -n which makes a significant difference when comparing 1.4 and 1.23 (in locales where . is the decimal mark) for instance), add the n glob qualifier.

Instead of oe{expression}, you can also use a function to define a sorting order like:

by_tail() REPLY=$REPLY:t

or more advanced ones like:

by_numbers_in_tail() REPLY=${(j:,:)${(s:,:)${REPLY:t}//[^0-9]/,}}

(so a/foo2bar3.pdf (2,3 numbers) sorts after b/bar1foo3.pdf (1,3) but before c/baz2zzz10.pdf (2,10)) and use as:

sorted_filearray=(/(e{'reply=($filearray)'}no+by_numbers_in_tail))

Of course, those can be applied on real globs as that's what they're primarily intended for. For instance, for a list of pdf files in any directory, sorted by basename/tail:

pdfs=(**/*.pdf(N.oe+by_tail))

¹ If a strcmp()-based sorting is acceptable, and for short strings, you could transform the strings to their hex-encoding with awk before passing to sort and transform back after sorting.

Solution 2

sort in GNU coreutils allows custom field separator and key. You set / as field separator and sort based on second field to sort on the basename, instead of entire path.

printf "%s\n" "${filearray[@]}" | sort -t/ -k2 will produce

dir2/0003.pdf
dir1/0010.pdf
dir3/0040.pdf

Solution 3

oldIFS="$IFS"; IFS=$'\n'
if [[ -o noglob ]]; then
  setglob=1; set -o noglob
else
  setglob=0
fi

sorted=( $(printf '%s\n' "${filearray[@]}" |
            awk '{ print $NF, $0 }' FS='/' OFS='/' |
            sort | cut -d'/' -f2- ) )

IFS="$oldIFS"; unset oldIFS
(( setglob == 1 )) && set +o noglob
unset setglob

Sorting of file names with newlines in their names will cause issues at the sort step.

It generates a /-delimited list with awk that contains the basename in the first column and the complete path as the remaining columns:

0003.pdf/dir2/0003.pdf
0010.pdf/dir1/0010.pdf
0040.pdf/dir3/0040.pdf

This is what is sorted, and cut is used to remove the first /-delimited column. The result is turned into a new bash array.

Solution 4

Sorting with gawk expression (supported by bash's readarray):

Sample array of filenames containing whitespaces:

filearray=("dir1/name 0010.pdf" "dir2/name  0003.pdf" "dir3/name 0040.pdf")

readarray -t sortedfilearr < <(printf '%s\n' "${filearray[@]}" | awk -F'/' '
   BEGIN{PROCINFO["sorted_in"]="@val_num_asc"}
   { a[$0]=$NF }
   END{ for(i in a) print i}')

The output:

echo "${sortedfilearr[*]}"
dir2/name 0003.pdf dir1/name 0010.pdf dir3/name 0040.pdf

Accessing single item:

echo "${sortedfilearr[1]}"
dir1/name 0010.pdf

That assumes that no file path contains newline characters. Note that the numerical sorting of the values in @val_num_asc only applies to the leading numerical part of the key (none in this example) with fallback to lexical comparison (based on strcmp(), not the locale's sorting order) for ties.

Solution 5

Since "dir1 and dir2 are arbitrary pathnames", we can't count on them consisting of a single directory (or of the same number of directories). So we need to convert the last slash in the pathnames to something that does not occur elsewhere in the pathname. Supposing the character @ does not occur in your data, you can sort by basename like this:

cat pathnames | sed 's|\(.*\)/|\1@|' | sort -t@ -k+2 | sed 's|@|/|'

The first sed command replaces the last slash in each pathname with the chosen separator, the second reverses the change. (For simplicity I'm assuming the pathnames can be delivered one per line. If they are in a shell variable, convert them to one-per-line format first.)

Share:
5,719

Related videos on Youtube

Tim
Author by

Tim

Elitists are oppressive, anti-intellectual, ultra-conservative, and cancerous to the society, environment, and humanity. Please help make Stack Exchange a better place. Expose elite supremacy, elitist brutality, and moderation injustice to https://stackoverflow.com/contact (complicit community managers), in comments, to meta, outside Stack Exchange, and by legal actions. Push back and don't let them normalize their behaviors. Changes always happen from the bottom up. Thank you very much! Just a curious self learner. Almost always upvote replies. Thanks for enlightenment! Meanwhile, Corruption and abuses have been rampantly coming from elitists. Supportive comments have been removed and attacks are kept to control the direction of discourse. Outright vicious comments have been removed only to conceal atrocities. Systematic discrimination has been made into policies. Countless users have been harassed, persecuted, and suffocated. Q&amp;A sites are for everyone to learn and grow, not for elitists to indulge abusive oppression, and cover up for each other. https://softwareengineering.stackexchange.com/posts/419086/revisions https://math.meta.stackexchange.com/q/32539/ (https://i.stack.imgur.com/4knYh.png) and https://math.meta.stackexchange.com/q/32548/ (https://i.stack.imgur.com/9gaZ2.png) https://meta.stackexchange.com/posts/353417/timeline (The moderators defended continuous harassment comments showing no reading and understanding of my post) https://cs.stackexchange.com/posts/125651/timeline (a PLT academic had trouble with the books I am reading and disparaged my self learning posts, and a moderator with long abusive history added more insults.) https://stackoverflow.com/posts/61679659/revisions (homework libels) Much more that have happened.

Updated on September 18, 2022

Comments

  • Tim
    Tim almost 2 years

    Suppose that I have list of pathnames of files stored in an array

    filearray=("dir1/0010.pdf" "dir2/0003.pdf" "dir3/0040.pdf" ) 
    

    I want to sort the elements in the array according to the basenames of the filenames, in numeric order

    sortedfilearray=("dir2/0003.pdf" "dir1/0010.pdf" "dir3/0040.pdf") 
    

    How can I do that?

    I can only sort their basename parts:

    basenames=()
    for file in "${filearray[@]}"
    do
        filename=${file##*/}
        basenames+=(${filename%.*})
    done
    sortedbasenamearr=($(printf '%s\n' "${basenames[@]}" | sort -n))
    

    I am thinking about

    • creating an associative array whose keys are the basenames and values are the pathnames, so access to the pathnames is always done via basenames.
    • creating another array for basenames only, and apply sort to the basename array.

    Thanks.

    • Jeff Schaller
      Jeff Schaller almost 7 years
      It's not a good idea, but you can sort in bash
    • Jeff Schaller
      Jeff Schaller almost 7 years
      Careful with an array keyed on the basenames, if you could have dir1/42.pdf and dir2/42.pdf
    • Tim
      Tim almost 7 years
      That (different pathnames with the same basename) doesn't happen in my case. But if a bash script can deal with it, that will be great. I don't have reasonably good requirements on how to sort pathnames with the same basename, maybe someone else may. dir1 dir2 are just made up, and they are actually arbitrary pathnames.
  • Tim
    Tim almost 7 years
    thanks. what is a "list" in bash? Is it different from bash array? I never heard of it and it would be great. yes, storing the filenames in a "list" could be a good idea. I got the filenames as $@ or $* from command line arguments for running a script
  • Jeff Schaller
    Jeff Schaller almost 7 years
    Storing the file names in a file allows for external utilities, but also risks misinterpretation of, say, newlines.
  • Tim
    Tim almost 7 years
    Is Schwartzian Transform used in sorting some kind of design pattern, e.g. template, strategy, ... patterns, as introduced in the book Design Pattern by Gang of Four?
  • roaima
    roaima almost 7 years
    @JeffSchaller fortunately there are no newlines in numbers. If I was writing completely generic filename-safe code I quite possibly wouldn't be using bash.
  • Kusalananda
    Kusalananda almost 7 years
    This is a standard option for sort, not a GNU extension. This will work if the paths are all of the same length.
  • MiniMax
    MiniMax almost 7 years
    Same answer in the same time :)
  • Kusalananda
    Kusalananda almost 7 years
    @StéphaneChazelas A bit hairy, but ok...
  • Stéphane Chazelas
    Stéphane Chazelas almost 7 years
    Note that arguably, it computes the wrong basename for paths like /some/dir/.
  • Kusalananda
    Kusalananda almost 7 years
    @StéphaneChazelas Yes, but the OP specifically said he had paths of files, so I'll just assume that there is a proper basename at the end of the path.
  • roaima
    roaima almost 7 years
    +1 for a solution that doesn't require passing each file's full name to sort
  • Stéphane Chazelas
    Stéphane Chazelas almost 7 years
    Note that in a typical GNU non-C locale, a/x.c++ b/x.c-- c/x.c++ would be sorted in that order even though - sorts before + because -, + and /'s primary weight is IGNORE (so comparing x.c++/a/x.c++ against x.c--/b/x.c++ first compares xcaxc against xcbxc, and only in case of ties would the other weights (where - comes before +) would be considered.
  • Stéphane Chazelas
    Stéphane Chazelas almost 7 years
    That could be worked around by joining on /x/ instead of /, but that wouldn't address the case where in the C locale on ASCII based systems, a/foo would sort after a/foo.txt for instance because / sorts after ..
  • Federico Poloni
    Federico Poloni almost 7 years
    This works only if the paths contain a single directory each. What about some/long/path/0011.pdf? As far as I can see from its man page, sort does contains no option to sort by the last field.
  • kael
    kael about 6 years
    Ha! This is great! I made it slightly more robust (and slightly uglier) by subbing a non-displaying character like so: cat pathnames | sed 's|\(.*\)/|\1'$'\4''|' | sort -t$'\4' -k+2nr | sed 's|'$'\4''|/|'. (I just grabbed \4 from the ascii table. Apparently "END OF TEXT"?)
  • kael
    kael about 6 years
    See this answer below for a great bash one-liner: unix.stackexchange.com/a/394166/41735
  • alexis
    alexis about 6 years
    @kael, \4 is ^D (control-D). Unless you type it yourself at the terminal, it's an ordinary control character. In other words, safe to use in this way.
  • eMPee584
    eMPee584 over 3 years
    had hoped for exactly this, thanks for covering my laziness 😀
  • Flurrywinde
    Flurrywinde about 3 years
    Note that the n in -k+2nr in @kael's modification makes this only work on filenames that consist only of numbers. Leave it out if the filename has other characters. (The r reverses the sort.)