Sort with unequal whitespace in first column

11,385

Solution 1

As others have commented, it will make it easier to work with the data if it is comma separated values (CSV).

Here is my solution for converting the data to CSV:

$ cat file | sed 's/ \([0-9]\)/,\1/g' 
Kuala Lumpur,78,56
Seoul,85,66
Karachi,95,75
Tokyo,85,60
Lahore,85,75
Manila,90,85

It replaces any space preceding a digit with a comma. \1 references the group ([0-9]), the digit after the space. From there you can use sort with the -t argument to specify a field separator.

$ cat file | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2  
Kuala Lumpur,78,56
Tokyo,85,60
Seoul,85,66
Lahore,85,75
Manila,90,85
Karachi,95,75

If you'd like to convert back to spaces or make a table, here are two examples:

$ cat test | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2 | tr , ' '
Kuala Lumpur 78 56
Tokyo 85 60
Seoul 85 66
Lahore 85 75
Manila 90 85
Karachi 95 75

$ cat test | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2 | column -s, -t 
Kuala Lumpur  78  56
Tokyo         85  60
Seoul         85  66
Lahore        85  75
Manila        90  85
Karachi       95  75

Solution 2

awk '{print $NF,$0}' file.txt | sort -nr -k1 | cut -d' ' -f2-
  • $NF: number of fields, $0: whole line
  • sort -nr: numeric reverse (descending)
  • sort -k1: sort by first column (delimited by sequences of spaces and tabs)
  • cut -d: delimiter (defaults to tab)
  • cut -f2-: fields 2 to last (doesn't collapse or strip delimiters)
ruby -e 'puts readlines.sort_by{|l|l.split[-1].to_i}.reverse' file.txt
  • readlines = ARGF.readlines
  • split splits on whitespace by default

Solution 3

If you can change your file to have tab separated columns, your life will be easier. If changing the file is not an option, this Perl one-liner will do it for you:

perl -ne 's/\s+/\t/g; s/([a-z])\s([a-z])/$1 $2/ig; s/\t$/\n/; print;' file |
  sort -t$'\t' -nk3
Kuala Lumpur    78  56  
Tokyo   85  60  
Seoul   85  66  
Karachi 95  75  
Lahore  85  75  
Manila  90  85  

EXPLANATION:

  • s/\s+/\t/g : change ALL spaces to TABs.
  • s/([a-z])\s([a-z])/$1 $2/ig : change TABs that are between two letters (no numbers) back to single spaces.
  • s/\t$/\n/ : the first substitution introduces a TAB at the end of each line, change that back to a newline character (\n).

  • sort -t$'\t' -nk3 : use TAB as a delimiter (sort has a strange syntax there, I know, see here for more information) and sort numerically on the third column.

Share:
11,385

Related videos on Youtube

healix
Author by

healix

moving from c++ into Java and it ain't pretty

Updated on September 18, 2022

Comments

  • healix
    healix almost 2 years

    I need to sort this list by name, high temp and low temp:

    Kuala Lumpur 78 56
    Seoul 85 66
    Karachi 95 75
    Tokyo 85 60
    Lahore 85 75
    Manila 90 85
    

    I figured since whitespace is the delimiter for a column I could just sort -k 1 which gives me this:

    Karachi         95 75
    Kuala Lumpur    78 56
    Lahore          85 75
    Manila          90 85
    Seoul           85 66
    Tokyo           85 60
    

    But the "Kuala Lumpur" is causing problems because of the space.

    So I tried to treat "Lumpur" as a column and to sort the first set of nums I did

    sort -k 3n
    

    but I get this:

    Tokyo           85 60
    Seoul           85 66
    Karachi         95 75
    Lahore          85 75
    Kuala Lumpur    78 56 <---Why is this out of order?
    Manila          90 85
    

    How do I deal with this one space?

    • Bernhard
      Bernhard about 11 years
      How is the file generated? You should use a different separator.
    • healix
      healix about 11 years
      the file was generated with cat. What different separator? I was thinking of using tr and removing the space
    • Bernhard
      Bernhard about 11 years
      A comma , would be a decent separator here. I don't see how cat would generate this file. There is something behind, right?
    • terdon
      terdon about 11 years
      Could you post the desired output? Sorting alphabetically seems to work in your example. The only problem seems to be sorting by temp, is that correct?
    • healix
      healix about 11 years
      @terdon Correct. When the temp is sorted the "Kuala Lumpur" doesn't sort properly.
    • healix
      healix about 11 years
      @Bernhard I didn't want to use a comma. I tried tabs after the names for the nums but didn't help much.
    • terdon
      terdon about 11 years
      So changing the delimiter is an option? In that case, you should be fine. Just add a tab between the name and the numbers and between each temperature.
    • healix
      healix about 11 years
      @terdon I'm having a problem with that. When I do sort -t' '... and put a tab in with CTR-V then TAB I don't think its reading in the tab. I tried <TAB> as well
    • terdon
      terdon about 11 years
      See my answer for how to use sort with tabs.
  • chris
    chris over 2 years
    I like that this answer addresses the OP without merely saying that if the OP had a different format, the problem would be easier. However, this did not work when I tried it with /proc/filesystems, but the answer from @Lri did. /proc/filesystems may be a special case because the first field is either present or missing, but it fits the pattern of trying to sort on a field that appears after a variable amount of whitespace characters. As an example mistake, the output has <whitespace> btrfs as the first entry, but nodev autofs is further down the list.
  • chris
    chris over 2 years
    I really like this solution. I was going to use awk, but not in such an elegant way. Thank you from 8.5 years in the future.
  • terdon
    terdon over 2 years
    @chris that's a completely different situation since you have empty fields, so sort cannot find a "second" field to sort on. You might want to ask a new question, but I think this should do what you probably want: sed 's/^\t/XXX\t/' /proc/filesystems | sort -k2,2 | sed 's/XXX//'.
  • chris
    chris over 2 years
    Thanks, terdon. That's a fair criticism. However, I view them as the same class of problem, because "Kuala Lumpur" essentially has a field in it ("Lumpur") that the other rows do not. Also, thanks for the sed suggestion. I wound up using an answer based on @Lri's post. However, I like this one also. I'm going to keep both of them in my script for now (one in comments, both with attribution :) ) while I think about which one I like better for the general case of sorting based on a missing nth column. Yours may prove more straightforward for that.