Can tr commands be chained to avoid multiple tr processes in a pipeline?
Solution 1
You can combine multiple translations (excepting complex cases involving overlapping locale-dependent sets), but you can't combine deletion with translation.
<doyle_sherlock_holmes.txt tr -d '[:punct:]' | tr '[:upper:] ' '[:lower:]\n'
Two calls to tr
are likely to be faster than a single call to more complex tools, but this is very dependent on the input size, on the proportions of different characters, on the implementation of tr
and competing tools, on the operating system, on the number of cores, etc.
Solution 2
Yes. You can do that w/ tr
in an ASCII locale (which is, for a GNU tr
anyway, kind of its only purview). You can use the POSIX classes, or you can reference the byte values of each character by octal number. You can split their transformations across ranges, as well.
LC_ALL=C tr '[:upper:]\0-\101\133-140\173-\377' '[:lower:][\n*]' <input
The above command would transform all uppercase characters to lowercase, ignore lowercase chars entirely, and transform all other characters to newlines. Of course, then you wind up with a ton of blank lines. The tr
-s
queeze repeats switch could be useful in that case, but if you use it alongside the [:upper:]
to [:lower:]
transformation then you wind up squeezing uppercase characters as well. In that way it still requires a second filter like...
LC... tr ... | tr -s \\n
...or...
LC... tr ... | grep .
...and so it winds up being a lot less convenient than doing...
LC_ALL=C tr -sc '[:alpha:]' \\n <input | tr '[:upper:]' '[:lower:]'
...which squeezes the -c
omplement of alphabetic characters by sequence into a single newline a piece, then does the upper to lower transform on the other side of the pipe.
That isn't to say that ranges of that nature are not useful. Stuff like:
tr '\0-\377' '[1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][0*]' </dev/random
...can be pretty handy as it converts the input bytes to all digits over a spread spectrum of their values. Waste not, want not, you know.
Another way to do the transform could involve dd
.
tr '\0-\377' '[A*64][B*64][C*64][D*64]' </dev/urandom |
dd bs=32 cbs=8 conv=unblock,lcase count=1
dadbbdbd
ddaaddab
ddbadbaa
bdbdcadd
Because dd
can do both unblock
and lcase
conversions at the same time, it might even be possible to pass much of the work off to it. But that can only be really useful if you can accurately predict the number of bytes per word - or at least can pad each word with spaces beforehand to a predictable byte count, because unblock
eats trailing spaces at the end of each block.
Solution 3
Here are a few approaches:
GNU
grep
andtr
: find all words and make them lower casegrep -Po '\w+' file | tr '[A-Z]' '[a-z]'
GNU grep and perl: as above but perl handles the conversion to lower case
grep -Po '\w+' file | perl -lne 'print lc()'
perl: find all alphabetic characters and print them in lower case (thanks @steeldriver):
perl -lne 'print lc for /[a-z]+/ig' file
sed: remove all characters that are not alphabetic or spaces, substitute all alphabetic characters with their lower case versions and replace all spaces with newlines. Note that this assumes that all whitespace is spaces, no tabs.
sed 's/[^a-zA-Z ]\+//g;s/[a-zA-Z]\+/\L&/g; s/ \+/\n/g' file
Related videos on Youtube
tlehman
Currently building Harvester HCI at SUSE. Former AWS
Updated on September 18, 2022Comments
-
tlehman over 1 year
I have a bunch of txt files, I'd like to output them lower-cased, only alphabetic and one word-per line, I can do it with several
tr
commands in a pipeline like this:tr -d '[:punct:]' <doyle_sherlock_holmes.txt | tr '[:upper:]' '[:lower:]' | tr ' ' '\n'
Is it possible to do this in one scan? I could write a C program to do this, but I feel like there's a way to do it using
tr
,sed
,awk
orperl
.-
terdon over 9 yearsWhat OS are you using? Do you have access to the GNU tools?
-
-
smw over 9 yearsWould something like
perl -lne 'print lc for /[[:alpha:]]+/g'
also work? or is it poor style? (I'm new to perl and trying to learn!) -
terdon over 9 years@steeldriver yes it would, nice one! If you're learning Perl, I'm sure you've come across its motto: TMTOWTDI :) Thanks, I'll add that one.
-
Costas over 9 yearsI am not sure re combining
tr -s '[:upper:] [:punct:]' '[:lower:]\n' <doyle_sherlock_holmes.txt
-
Costas over 9 yearsWith new version (> 4.2.1)
sed -z 's/\W*\(\w\+\)\W*/\L\1\n/g'
-
terdon over 9 years@Costas ah,
sed
can do\w
now? Cool! -
tlehman over 9 years+2 bonus points for getting
dd
involved :) -
mikeserv over 9 years@TobiLehman - I'm very pleased you approve.
-
mikeserv over 9 years@Costas - while the newline thing might be acctepable here, I don't think squeezing the uppercase chars would be. For example:
printf 'A.AAAA,A' | tr -s '[:upper:] [:punct:]' '[:lower:][\n*]'
getsa\na\na'
, and the transformation for... '[:lower:]\n'
might not necessarily do anything at all to'[:punct:]'
anyway - sometr
s will truncate set1 to match 2 and some will do an implied[\n*]
. It's better just to use the range there. -
mikeserv over 9 years@terdon - it's done that for awhile, but, because Costas didn't mention it, I think the most interesting thing about the above comment is GNU
sed
's-z
ero delimit switch - it cycles over\0NUL
s rather than newlines. Pretty cool when you do something liketar -c . | tr -s \\0 | sed -z ...
- but kinda slow.