Text file look-up by column

bash shell-script sed awk filter

5,498

Solution 1

It's not like I haven't tried before asking... here's my attempt... but it looks way too complicated to me. Disregard the logic that handles dirty files gracefully, it was not part of the question and it's not the focus of the text look-up anyway. It just so happens that the files I have sometimes do not start with "HEADER" but with some garbage, with all the rest of the data being absolutely fine, always.

#!/bin/bash

file_to_scan="${1}"
name_to_lookup="${2}"

ASSUME_FIRST_LINE_IS_HEADER="false" # Sometimes input files begin with spurious lines

FILE_HEADER_REGEX='^\[#\][[:blank:]]+OWNER_NAME[[:blank:]]+NAME[[:blank:]]+SIZE\s*$'

FIELD_HEADER_NAME=' NAME'
FIELD_HEADER_SIZE=' SIZE'

if [ "$ASSUME_FIRST_LINE_IS_HEADER" == "true" ]; then
    header_line=$(head -n 1 "${file_to_scan}")
else
    header_line="$(
        grep \
            --colour=never \
            --extended-regexp \
            "${FILE_HEADER_REGEX}" \
            "${file_to_scan}"
        )"
fi

colstartend=($(
    printf "${header_line}" \
        | \
        awk \
            -v name="${FIELD_HEADER_NAME}" \
            -v size="${FIELD_HEADER_SIZE}" \
            '{
                 print index($0, name)+1;
                 print index($0, size);
             }'
))

sed -E "1,/${FILE_HEADER_REGEX}/d" "${file_to_scan}" \
    | \
    awk \
        -v name_to_lookup="${name_to_lookup}" \
        -v colstart="${colstartend[0]}" \
        -v offset="$(( ${colstartend[1]} - ${colstartend[0]} ))" \
        '{
             name_field = substr($0, colstart, offset);
             sub(/ *$/, "", name_field);
             if (name_field == name_to_lookup) {
               print substr($1, 2, length($1)-2)
             }
         }'

Solution 2

If the field widths are constant - i.e. the file format you've shown with the field widths you have are at their maximum - you can use GNU awk (gawk(1)) and set the FIELDWIDTHS variable to use fixed width parsing:

gawk -v searchstr="Ideas worth zero" -- '
    BEGIN { FIELDWIDTHS="6 15 27 5" }  # assuming the final field width is 5
    # Pre-process data
    {
        gsub(/[^[:digit:]]/, "", $1)  # strip out non-numbers
        for (i = 2; i <= NF; i++)
            gsub(/[[:space:]]*$/, "", $i)  # strip trailing whitespace
    }
    # match here
    $3 == searchstr { print $1 }
' file.txt

You can wrap that in a shell script or a function and parameterise searchstr (-v searchstr="$1").

However, if the fields are of variable width - i.e. if the data changes, the width of the fields may change - you'll need to be a little more clever and dynamically determine the field widths from inspecting the first line. Given that one field is called OWNER_NAME, using an underscore, I'm assuming that spaces are not present in field names, so I can assume that whitespace separates the field names.

With that defined, you can replace the BEGIN... line with this code:

NR == 1 {
    for (i = 2; i <= NF; i++)
        FIELDWIDTHS=FIELDWIDTHS index($0" ", " "$i" ")-index($0" ", " "$(i-1)" ") " "
    FIELDWIDTHS=FIELDWIDTHS "5"  # assuming 5 is the width of the last field
    next
}

That will look at the fields on the first line and calculate the field widths by calculating the difference between the positions of subsequent fields for the second to the last field. I've assumed the width of the last field is 5, but I think you can just put a big number there and it will work with what's left over.

We need to look for a space before and after the name to ensure we do not find NAME inside OWNER_NAME (or if there was a field called OWNER), and instead match the whole field (we also need to append a space to $0 to ensure we can match a space at the end even if there were none there).

You could get fancier so that you can query by field name instead of matching only on $3, but I'll leave that to you.

Solution 3

OK, if the length of the columns is not known, I'd switch to a more powerful language than bash:

#!/usr/bin/perl
use warnings;
use strict;

my $string = shift;
open my $FH, '<', '1.txt' or die $!;
my $first_line = <$FH>;
my ($before, $name) = $first_line =~ /(.* )(NAME *)/;
my $column = length $before;
$string .= ' ' x (length($name) - length $string);     # adjust the length of $string
while (<$FH>) {
    if ($column == index $_, $string, $column) {
        /^\[([0-9]+)\]/ and print "$1\n";
    }
}

Solution 4

Probably the simplest to filter the lines first by 'Ideas worth zero, then tossing the lines '... or more':

grep 'Ideas worth zero' | grep -v 'Ideas worth zero or more'

And to get the number from that pipe the input into:

cut -d' ' -f1 | tr -d ']['

Which cuts the first field (delimited by a space) and removing the squeare brackets.

Best would be to if you can slightly change the file format in such a way that it comes with proper field delimiters.

Solution 5

 $ cat test
[#]     OWNER_NAME      NAME    SIZE
[6]     Robottinosino   Software        200
[42]    Robottinosino   Ideas worth zero        188
[12]    Robottinosino   Ideas worth zero or more        111
[13]    I am Batman     Hardware        180
[25]    Robottinosino   Profile Pictures        170

 $ cat test.sh
#!/bin/bash -
awk -F"\t" '(NR<=1){for(i=1;i<NF;i++) if(toupper("'$1'")==toupper($i)) field=i;} (toupper($field) == toupper("'"$2"'")){print $1}'

 $ cat test | ./test.sh NAME "Ideas worth zero"
[42]

I'm not sure that delimiter is tab. But it's pretty easy to change it with sed. For example sed 's/\s\s+/\t/g' will do the job.

Also you can specify any other field, not only NAME. It will find the right column number itself.

In case you'll need only third column script will be much easier.

ps. I've used it in my own project, therefore it seems has a quite more functionality you need.

upd. due to delimiter is not tab change launch line to

 cat test | sed 's/\s\s\+/\t/g' | ./test.sh NAME "Ideas worth zero"

It works perfect at my site.

View more solutions

5,498

JohnyMoraes

Updated on September 18, 2022

Comments

JohnyMoraes over 1 year
I have a file in this format:
```
[#]   OWNER_NAME     NAME                       SIZE
[6]   Robottinosino  Software                   200
[42]  Robottinosino  Ideas worth zero           188
[12]  Robottinosino  Ideas worth zero or more   111
[13]  I am Batman    Hardware                   180
[25]  Robottinosino  Profile Pictures           170
```
and I would like to be able to do the following using command line tools:
```
my_command "Ideas worth zero"
```
and get this result:
```
42
```
and not risk getting this result:
```
12
```
I have thought of using grep to identify the line, awk to get the 1st field but I am not sure how to reliably and efficiently match on the whole 'NAME' field short of counting at which column the text 'OWNER_NAME' and 'SIZE' appear in the header and get everything in-between with some whitespace trimming.

Notice 'OWNER_NAME' could be more than one word: e.g. 'OWNER_NAME' = "I am Batman".

Any ideas with accompanying implementation?

What I have to go by here, is just the old family of cat, head, tail, awk, sed, grep, cut, etc.
- camh almost 12 years
  
  Are the field widths variable (i.e. do fields get wider if there is more text in a field than last time)? Can you count the field widths once and it will always remain correct?
- JohnyMoraes almost 12 years
  
  Field widths are variable for this kind of files but are constant within the file, that's why I thought I should base my "text cutting" on the HEADER row and just trim the whitespace... I am looking for an elegant and simple solution too.
JohnyMoraes almost 12 years

Does this "grep -v" approach scale on occurrences like 'Ideas worth zero and a bit', 'Ideas worth zero or something like that', 'Ideas work zero comma fourtytwo', etc?
rush almost 12 years

Yeap. But only in case you'll specify them all =)
JohnyMoraes almost 12 years

The first does not have tab ('\t') as a delimiter, unfortunately.
JohnyMoraes almost 12 years

Unfortunately column widths are not known before running the script (i.e. they are automatically adjusted to fit the fields, which contain strings of variable lengths)
JohnyMoraes almost 12 years

Nice Perl script but unfortunately I don't know Perl and it would be difficult to maintain this...
rush almost 12 years

I've updated the answer to fix delimiter issue.
choroba almost 12 years

@Robottinosino: It should be rather easy to rewrite it in python or any other scripting language that is easier to maintain for you.
JohnyMoraes almost 12 years

Seems to me like you are doing a lot of work even for cases in which there is no match? For example, you are trimming whitespace on each field even on a non-matching line? You are cleaning up the first field ("[#]") a priori even if you only use it on a match? I very much liked the code that calculates FIELDWIDTHS and I learned some awk from it, but... aren'you again performing this calculation for all fields where the only one whose length you really need is the "NAME" one?
JohnyMoraes almost 12 years

Python is golden for me, and I agree with you: probably better to use a fully-fledged scripting language here. What I have to go by here, though, is just the old family of cat, head, tail, awk, sed, grep, cut, etc.
JohnyMoraes almost 12 years

if any field contains a double space this would break, wouldn't it? e.g. Owner name: "I am a spacious username". Just glancing at it, it does not seem like "just replacing the delimiter" is so trivial, unfortunately... correct me if I am wrong, I may be :)
JohnyMoraes almost 12 years

I was thinking of using grep --byte-offset to get the starting and ending column numbers for the NAME field but bytes aren't characters and characters here may be Unicode and have more bytes per char... Hmmm...
rush almost 12 years

Yes, unfortunately you're right. =)
camh almost 12 years

@Robottinosino: I was making it a little more general than it needed to be so that you could easily match on different fields if you needed to. I find it makes it clearer to separate the cleansing code from the actual matching code (the business logic, so to speak). If you have a very large data set and the extra unnecessary processing is significant, by all means, optimise out the extra work.
Kevin almost 12 years

FYI, -F can take a full regex, so you don't even have to use sed.