Sort an array of pathnames of files by their basenames
Solution 1
Contrary to ksh or zsh, bash has no builtin support for sorting arrays or lists of arbitrary strings. It can sort globs or the output of alias
or set
or typeset
(though those last 3 not in the user's locale sorting order), but that can't be used practically here.
There's nothing in the POSIX toolchest that can readily sort arbitrary lists of strings either¹ (sort
sorts lines, so only short (LINE_MAX being often shorter than PATH_MAX) sequences of characters other than NUL and newline, while file paths are non-empty sequences of bytes other than 0).
So while you could implement your own sorting algorithm in awk
(using the <
string comparison operator) or even bash
(using [[ < ]]
), for arbitrary paths in bash
, portably, the easiest may be to resort to perl
:
With bash4.4+
, you could do:
readarray -td '' sorted_filearray < <(perl -MFile::Basename -l0 -e '
print for sort {basename($a) cmp basename($b)} @ARGV' -- "${filearray[@]}")
That gives a strcmp()
-like order. For an order based on the locale's collation rules like in globs or the output of ls
, add a -Mlocale
argument to perl
. For numeric sort (more like GNU sort -g
as it supports numbers like +3
, 1.2e-5
and not thousand separators, though not hexadimals), use <=>
instead of cmp
(and again -Mlocale
for the user's decimal mark to be honoured like for the sort
command).
You'll be limited by the maximum size of arguments to a command. To avoid that, you could pass the list of files to perl
on its stdin instead of via arguments:
readarray -td '' sorted_filearray < <(
printf '%s\0' "${filearray[@]}" | perl -MFile::Basename -0le '
chomp(@files = <STDIN>);
print for sort {basename($a) cmp basename($b)} @files')
With older versions of bash
, you could use a while IFS= read -rd ''
loop instead of readarray -d ''
or get perl
to output the list of paths properly quoted so you can pass it to eval "array=($(perl...))"
.
With zsh
, you can fake a glob expansion for which you can define a sort order:
sorted_filearray=(/(e{'reply=($filearray)'}oe{'REPLY=$REPLY:t'}))
With reply=($filearray)
we actually force the glob expansion (which initially was just /
) to be the elements of the array. Then we define the sort order to be based on the tail of the filename.
For a strcmp()
-like order, fix the locale to C. For numeric sort (similar to GNU sort -V
, not sort -n
which makes a significant difference when comparing 1.4
and 1.23
(in locales where .
is the decimal mark) for instance), add the n
glob qualifier.
Instead of oe{expression}
, you can also use a function to define a sorting order like:
by_tail() REPLY=$REPLY:t
or more advanced ones like:
by_numbers_in_tail() REPLY=${(j:,:)${(s:,:)${REPLY:t}//[^0-9]/,}}
(so a/foo2bar3.pdf
(2,3 numbers) sorts after b/bar1foo3.pdf
(1,3) but before c/baz2zzz10.pdf
(2,10))
and use as:
sorted_filearray=(/(e{'reply=($filearray)'}no+by_numbers_in_tail))
Of course, those can be applied on real globs as that's what they're primarily intended for. For instance, for a list of pdf
files in any directory, sorted by basename/tail:
pdfs=(**/*.pdf(N.oe+by_tail))
¹ If a strcmp()
-based sorting is acceptable, and for short strings, you could transform the strings to their hex-encoding with awk
before passing to sort
and transform back after sorting.
Solution 2
sort
in GNU coreutils allows custom field separator and key. You set /
as field separator and sort based on second field to sort on the basename, instead of entire path.
printf "%s\n" "${filearray[@]}" | sort -t/ -k2
will produce
dir2/0003.pdf
dir1/0010.pdf
dir3/0040.pdf
Solution 3
oldIFS="$IFS"; IFS=$'\n'
if [[ -o noglob ]]; then
setglob=1; set -o noglob
else
setglob=0
fi
sorted=( $(printf '%s\n' "${filearray[@]}" |
awk '{ print $NF, $0 }' FS='/' OFS='/' |
sort | cut -d'/' -f2- ) )
IFS="$oldIFS"; unset oldIFS
(( setglob == 1 )) && set +o noglob
unset setglob
Sorting of file names with newlines in their names will cause issues at the sort
step.
It generates a /
-delimited list with awk
that contains the basename in the first column and the complete path as the remaining columns:
0003.pdf/dir2/0003.pdf
0010.pdf/dir1/0010.pdf
0040.pdf/dir3/0040.pdf
This is what is sorted, and cut
is used to remove the first /
-delimited column. The result is turned into a new bash
array.
Solution 4
Sorting with gawk expression (supported by bash's readarray
):
Sample array of filenames containing whitespaces:
filearray=("dir1/name 0010.pdf" "dir2/name 0003.pdf" "dir3/name 0040.pdf")
readarray -t sortedfilearr < <(printf '%s\n' "${filearray[@]}" | awk -F'/' '
BEGIN{PROCINFO["sorted_in"]="@val_num_asc"}
{ a[$0]=$NF }
END{ for(i in a) print i}')
The output:
echo "${sortedfilearr[*]}"
dir2/name 0003.pdf dir1/name 0010.pdf dir3/name 0040.pdf
Accessing single item:
echo "${sortedfilearr[1]}"
dir1/name 0010.pdf
That assumes that no file path contains newline characters. Note that the numerical sorting of the values in @val_num_asc
only applies to the leading numerical part of the key (none in this example) with fallback to lexical comparison (based on strcmp()
, not the locale's sorting order) for ties.
Solution 5
Since "dir1
and dir2
are arbitrary pathnames", we can't count on them consisting of a single directory (or of the same number of directories). So we need to convert the last slash in the pathnames to something that does not occur elsewhere in the pathname. Supposing the character @
does not occur in your data, you can sort by basename like this:
cat pathnames | sed 's|\(.*\)/|\1@|' | sort -t@ -k+2 | sed 's|@|/|'
The first sed
command replaces the last slash in each pathname with the chosen separator, the second reverses the change. (For simplicity I'm assuming the pathnames can be delivered one per line. If they are in a shell variable, convert them to one-per-line format first.)
Related videos on Youtube
![Tim](https://i.stack.imgur.com/3PCjR.png?s=256&g=1)
Tim
Elitists are oppressive, anti-intellectual, ultra-conservative, and cancerous to the society, environment, and humanity. Please help make Stack Exchange a better place. Expose elite supremacy, elitist brutality, and moderation injustice to https://stackoverflow.com/contact (complicit community managers), in comments, to meta, outside Stack Exchange, and by legal actions. Push back and don't let them normalize their behaviors. Changes always happen from the bottom up. Thank you very much! Just a curious self learner. Almost always upvote replies. Thanks for enlightenment! Meanwhile, Corruption and abuses have been rampantly coming from elitists. Supportive comments have been removed and attacks are kept to control the direction of discourse. Outright vicious comments have been removed only to conceal atrocities. Systematic discrimination has been made into policies. Countless users have been harassed, persecuted, and suffocated. Q&A sites are for everyone to learn and grow, not for elitists to indulge abusive oppression, and cover up for each other. https://softwareengineering.stackexchange.com/posts/419086/revisions https://math.meta.stackexchange.com/q/32539/ (https://i.stack.imgur.com/4knYh.png) and https://math.meta.stackexchange.com/q/32548/ (https://i.stack.imgur.com/9gaZ2.png) https://meta.stackexchange.com/posts/353417/timeline (The moderators defended continuous harassment comments showing no reading and understanding of my post) https://cs.stackexchange.com/posts/125651/timeline (a PLT academic had trouble with the books I am reading and disparaged my self learning posts, and a moderator with long abusive history added more insults.) https://stackoverflow.com/posts/61679659/revisions (homework libels) Much more that have happened.
Updated on September 18, 2022Comments
-
Tim almost 2 years
Suppose that I have list of pathnames of files stored in an array
filearray=("dir1/0010.pdf" "dir2/0003.pdf" "dir3/0040.pdf" )
I want to sort the elements in the array according to the basenames of the filenames, in numeric order
sortedfilearray=("dir2/0003.pdf" "dir1/0010.pdf" "dir3/0040.pdf")
How can I do that?
I can only sort their basename parts:
basenames=() for file in "${filearray[@]}" do filename=${file##*/} basenames+=(${filename%.*}) done sortedbasenamearr=($(printf '%s\n' "${basenames[@]}" | sort -n))
I am thinking about
- creating an associative array whose keys are the basenames and values are the pathnames, so access to the pathnames is always done via basenames.
- creating another array for basenames only, and apply
sort
to the basename array.
Thanks.
-
Jeff Schaller almost 7 yearsIt's not a good idea, but you can sort in bash
-
Jeff Schaller almost 7 yearsCareful with an array keyed on the basenames, if you could have dir1/42.pdf and dir2/42.pdf
-
Tim almost 7 yearsThat (different pathnames with the same basename) doesn't happen in my case. But if a bash script can deal with it, that will be great. I don't have reasonably good requirements on how to sort pathnames with the same basename, maybe someone else may.
dir1
dir2
are just made up, and they are actually arbitrary pathnames.
-
Tim almost 7 yearsthanks. what is a "list" in bash? Is it different from bash array? I never heard of it and it would be great. yes, storing the filenames in a "list" could be a good idea. I got the filenames as
$@
or$*
from command line arguments for running a script -
Jeff Schaller almost 7 yearsStoring the file names in a file allows for external utilities, but also risks misinterpretation of, say, newlines.
-
Tim almost 7 yearsIs Schwartzian Transform used in sorting some kind of design pattern, e.g. template, strategy, ... patterns, as introduced in the book Design Pattern by Gang of Four?
-
roaima almost 7 years@JeffSchaller fortunately there are no newlines in numbers. If I was writing completely generic filename-safe code I quite possibly wouldn't be using bash.
-
Kusalananda almost 7 yearsThis is a standard option for
sort
, not a GNU extension. This will work if the paths are all of the same length. -
MiniMax almost 7 yearsSame answer in the same time :)
-
Kusalananda almost 7 years@StéphaneChazelas A bit hairy, but ok...
-
Stéphane Chazelas almost 7 yearsNote that arguably, it computes the wrong basename for paths like
/some/dir/
. -
Kusalananda almost 7 years@StéphaneChazelas Yes, but the OP specifically said he had paths of files, so I'll just assume that there is a proper basename at the end of the path.
-
roaima almost 7 years+1 for a solution that doesn't require passing each file's full name to sort
-
Stéphane Chazelas almost 7 yearsNote that in a typical GNU non-C locale,
a/x.c++ b/x.c-- c/x.c++
would be sorted in that order even though-
sorts before+
because-
,+
and/
's primary weight is IGNORE (so comparingx.c++/a/x.c++
againstx.c--/b/x.c++
first comparesxcaxc
againstxcbxc
, and only in case of ties would the other weights (where-
comes before+
) would be considered. -
Stéphane Chazelas almost 7 yearsThat could be worked around by joining on
/x/
instead of/
, but that wouldn't address the case where in the C locale on ASCII based systems,a/foo
would sort aftera/foo.txt
for instance because/
sorts after.
. -
Federico Poloni almost 7 yearsThis works only if the paths contain a single directory each. What about
some/long/path/0011.pdf
? As far as I can see from its man page,sort
does contains no option to sort by the last field. -
kael about 6 yearsHa! This is great! I made it slightly more robust (and slightly uglier) by subbing a non-displaying character like so:
cat pathnames | sed 's|\(.*\)/|\1'$'\4''|' | sort -t$'\4' -k+2nr | sed 's|'$'\4''|/|'
. (I just grabbed\4
from the ascii table. Apparently "END OF TEXT"?) -
kael about 6 yearsSee this answer below for a great bash one-liner: unix.stackexchange.com/a/394166/41735
-
alexis about 6 years@kael,
\4
is^D
(control-D). Unless you type it yourself at the terminal, it's an ordinary control character. In other words, safe to use in this way. -
eMPee584 over 3 yearshad hoped for exactly this, thanks for covering my laziness 😀
-
Flurrywinde about 3 yearsNote that the
n
in-k+2nr
in @kael's modification makes this only work on filenames that consist only of numbers. Leave it out if the filename has other characters. (Ther
reverses the sort.)