Can tr commands be chained to avoid multiple tr processes in a pipeline?

15,849

Solution 1

You can combine multiple translations (excepting complex cases involving overlapping locale-dependent sets), but you can't combine deletion with translation.

<doyle_sherlock_holmes.txt tr -d '[:punct:]' | tr '[:upper:] ' '[:lower:]\n'

Two calls to tr are likely to be faster than a single call to more complex tools, but this is very dependent on the input size, on the proportions of different characters, on the implementation of tr and competing tools, on the operating system, on the number of cores, etc.

Solution 2

Yes. You can do that w/ tr in an ASCII locale (which is, for a GNU tr anyway, kind of its only purview). You can use the POSIX classes, or you can reference the byte values of each character by octal number. You can split their transformations across ranges, as well.

LC_ALL=C tr '[:upper:]\0-\101\133-140\173-\377' '[:lower:][\n*]' <input

The above command would transform all uppercase characters to lowercase, ignore lowercase chars entirely, and transform all other characters to newlines. Of course, then you wind up with a ton of blank lines. The tr -squeeze repeats switch could be useful in that case, but if you use it alongside the [:upper:] to [:lower:] transformation then you wind up squeezing uppercase characters as well. In that way it still requires a second filter like...

LC... tr ... | tr -s \\n

...or...

LC... tr ... | grep .

...and so it winds up being a lot less convenient than doing...

LC_ALL=C tr -sc '[:alpha:]' \\n <input | tr '[:upper:]' '[:lower:]'

...which squeezes the -complement of alphabetic characters by sequence into a single newline a piece, then does the upper to lower transform on the other side of the pipe.

That isn't to say that ranges of that nature are not useful. Stuff like:

tr '\0-\377' '[1*25][2*25][3*25][4*25][5*25][6*25][7*25][8*25][9*25][0*]' </dev/random

...can be pretty handy as it converts the input bytes to all digits over a spread spectrum of their values. Waste not, want not, you know.

Another way to do the transform could involve dd.

tr '\0-\377' '[A*64][B*64][C*64][D*64]' </dev/urandom |
dd bs=32 cbs=8 conv=unblock,lcase count=1

dadbbdbd
ddaaddab
ddbadbaa
bdbdcadd

Because dd can do both unblock and lcase conversions at the same time, it might even be possible to pass much of the work off to it. But that can only be really useful if you can accurately predict the number of bytes per word - or at least can pad each word with spaces beforehand to a predictable byte count, because unblock eats trailing spaces at the end of each block.

Solution 3

Here are a few approaches:

  • GNU grep and tr: find all words and make them lower case

    grep -Po '\w+' file | tr '[A-Z]' '[a-z]'
    
  • GNU grep and perl: as above but perl handles the conversion to lower case

    grep -Po '\w+' file | perl -lne 'print lc()'
    
  • perl: find all alphabetic characters and print them in lower case (thanks @steeldriver):

    perl -lne 'print lc for /[a-z]+/ig' file
    
  • sed: remove all characters that are not alphabetic or spaces, substitute all alphabetic characters with their lower case versions and replace all spaces with newlines. Note that this assumes that all whitespace is spaces, no tabs.

    sed 's/[^a-zA-Z ]\+//g;s/[a-zA-Z]\+/\L&/g; s/ \+/\n/g' file
    
Share:
15,849

Related videos on Youtube

tlehman
Author by

tlehman

Currently building Harvester HCI at SUSE. Former AWS

Updated on September 18, 2022

Comments

  • tlehman
    tlehman over 1 year

    I have a bunch of txt files, I'd like to output them lower-cased, only alphabetic and one word-per line, I can do it with several tr commands in a pipeline like this:

    tr -d '[:punct:]' <doyle_sherlock_holmes.txt | tr '[:upper:]' '[:lower:]' | tr ' ' '\n'
    

    Is it possible to do this in one scan? I could write a C program to do this, but I feel like there's a way to do it using tr, sed, awk or perl.

    • terdon
      terdon over 9 years
      What OS are you using? Do you have access to the GNU tools?
  • smw
    smw over 9 years
    Would something like perl -lne 'print lc for /[[:alpha:]]+/g' also work? or is it poor style? (I'm new to perl and trying to learn!)
  • terdon
    terdon over 9 years
    @steeldriver yes it would, nice one! If you're learning Perl, I'm sure you've come across its motto: TMTOWTDI :) Thanks, I'll add that one.
  • Costas
    Costas over 9 years
    I am not sure re combining tr -s '[:upper:] [:punct:]' '[:lower:]\n' <doyle_sherlock_holmes.txt
  • Costas
    Costas over 9 years
    With new version (> 4.2.1) sed -z 's/\W*\(\w\+\)\W*/\L\1\n/g'
  • terdon
    terdon over 9 years
    @Costas ah, sed can do \w now? Cool!
  • tlehman
    tlehman over 9 years
    +2 bonus points for getting dd involved :)
  • mikeserv
    mikeserv over 9 years
    @TobiLehman - I'm very pleased you approve.
  • mikeserv
    mikeserv over 9 years
    @Costas - while the newline thing might be acctepable here, I don't think squeezing the uppercase chars would be. For example: printf 'A.AAAA,A' | tr -s '[:upper:] [:punct:]' '[:lower:][\n*]' gets a\na\na', and the transformation for ... '[:lower:]\n' might not necessarily do anything at all to '[:punct:]' anyway - some trs will truncate set1 to match 2 and some will do an implied [\n*]. It's better just to use the range there.
  • mikeserv
    mikeserv over 9 years
    @terdon - it's done that for awhile, but, because Costas didn't mention it, I think the most interesting thing about the above comment is GNU sed's -zero delimit switch - it cycles over \0NULs rather than newlines. Pretty cool when you do something like tar -c . | tr -s \\0 | sed -z ... - but kinda slow.