Sort with unequal whitespace in first column

linux bash sort

11,385

Solution 1

As others have commented, it will make it easier to work with the data if it is comma separated values (CSV).

Here is my solution for converting the data to CSV:

$ cat file | sed 's/ \([0-9]\)/,\1/g' 
Kuala Lumpur,78,56
Seoul,85,66
Karachi,95,75
Tokyo,85,60
Lahore,85,75
Manila,90,85

It replaces any space preceding a digit with a comma. \1 references the group ([0-9]), the digit after the space. From there you can use sort with the -t argument to specify a field separator.

$ cat file | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2  
Kuala Lumpur,78,56
Tokyo,85,60
Seoul,85,66
Lahore,85,75
Manila,90,85
Karachi,95,75

If you'd like to convert back to spaces or make a table, here are two examples:

$ cat test | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2 | tr , ' '
Kuala Lumpur 78 56
Tokyo 85 60
Seoul 85 66
Lahore 85 75
Manila 90 85
Karachi 95 75

$ cat test | sed 's/ \([0-9]\)/,\1/g' | sort -t, -k2 | column -s, -t 
Kuala Lumpur  78  56
Tokyo         85  60
Seoul         85  66
Lahore        85  75
Manila        90  85
Karachi       95  75

Solution 2

awk '{print $NF,$0}' file.txt | sort -nr -k1 | cut -d' ' -f2-

$NF: number of fields, $0: whole line
sort -nr: numeric reverse (descending)
sort -k1: sort by first column (delimited by sequences of spaces and tabs)
cut -d: delimiter (defaults to tab)
cut -f2-: fields 2 to last (doesn't collapse or strip delimiters)

ruby -e 'puts readlines.sort_by{|l|l.split[-1].to_i}.reverse' file.txt

readlines = ARGF.readlines
split splits on whitespace by default

Solution 3

If you can change your file to have tab separated columns, your life will be easier. If changing the file is not an option, this Perl one-liner will do it for you:

perl -ne 's/\s+/\t/g; s/([a-z])\s([a-z])/$1 $2/ig; s/\t$/\n/; print;' file |
  sort -t$'\t' -nk3
Kuala Lumpur    78  56  
Tokyo   85  60  
Seoul   85  66  
Karachi 95  75  
Lahore  85  75  
Manila  90  85

EXPLANATION:

s/\s+/\t/g : change ALL spaces to TABs.
s/([a-z])\s([a-z])/$1 $2/ig : change TABs that are between two letters (no numbers) back to single spaces.
s/\t$/\n/ : the first substitution introduces a TAB at the end of each line, change that back to a newline character (\n).
sort -t$'\t' -nk3 : use TAB as a delimiter (sort has a strange syntax there, I know, see here for more information) and sort numerically on the third column.

11,385

healix

moving from c++ into Java and it ain't pretty

Updated on September 18, 2022

Comments

healix almost 2 years
I need to sort this list by name, high temp and low temp:
```
Kuala Lumpur 78 56
Seoul 85 66
Karachi 95 75
Tokyo 85 60
Lahore 85 75
Manila 90 85
```
I figured since whitespace is the delimiter for a column I could just sort -k 1 which gives me this:
```
Karachi         95 75
Kuala Lumpur    78 56
Lahore          85 75
Manila          90 85
Seoul           85 66
Tokyo           85 60
```
But the "Kuala Lumpur" is causing problems because of the space.

So I tried to treat "Lumpur" as a column and to sort the first set of nums I did
```
sort -k 3n
```
but I get this:
```
Tokyo           85 60
Seoul           85 66
Karachi         95 75
Lahore          85 75
Kuala Lumpur    78 56 <---Why is this out of order?
Manila          90 85
```
How do I deal with this one space?
- Bernhard about 11 years
  
  How is the file generated? You should use a different separator.
- healix about 11 years
  
  the file was generated with cat. What different separator? I was thinking of using tr and removing the space
- Bernhard about 11 years
  
  A comma , would be a decent separator here. I don't see how cat would generate this file. There is something behind, right?
- terdon about 11 years
  
  Could you post the desired output? Sorting alphabetically seems to work in your example. The only problem seems to be sorting by temp, is that correct?
- healix about 11 years
  
  @terdon Correct. When the temp is sorted the "Kuala Lumpur" doesn't sort properly.
- healix about 11 years
  
  @Bernhard I didn't want to use a comma. I tried tabs after the names for the nums but didn't help much.
- terdon about 11 years
  
  So changing the delimiter is an option? In that case, you should be fine. Just add a tab between the name and the numbers and between each temperature.
- healix about 11 years
  
  @terdon I'm having a problem with that. When I do sort -t' '... and put a tab in with CTR-V then TAB I don't think its reading in the tab. I tried <TAB> as well
- terdon about 11 years
  
  See my answer for how to use sort with tabs.
chris over 2 years

I like that this answer addresses the OP without merely saying that if the OP had a different format, the problem would be easier. However, this did not work when I tried it with /proc/filesystems, but the answer from @Lri did. /proc/filesystems may be a special case because the first field is either present or missing, but it fits the pattern of trying to sort on a field that appears after a variable amount of whitespace characters. As an example mistake, the output has <whitespace> btrfs as the first entry, but nodev autofs is further down the list.
chris over 2 years

I really like this solution. I was going to use awk, but not in such an elegant way. Thank you from 8.5 years in the future.
terdon over 2 years

@chris that's a completely different situation since you have empty fields, so sort cannot find a "second" field to sort on. You might want to ask a new question, but I think this should do what you probably want: sed 's/^\t/XXX\t/' /proc/filesystems | sort -k2,2 | sed 's/XXX//'.
chris over 2 years

Thanks, terdon. That's a fair criticism. However, I view them as the same class of problem, because "Kuala Lumpur" essentially has a field in it ("Lumpur") that the other rows do not. Also, thanks for the sed suggestion. I wound up using an answer based on @Lri's post. However, I like this one also. I'm going to keep both of them in my script for now (one in comments, both with attribution :) ) while I think about which one I like better for the general case of sorting based on a missing nth column. Yours may prove more straightforward for that.