how to subset a file - select a numbers of rows or columns

52,283

Solution 1

Filtering rows is easy, for example with AWK:

cat largefile | awk 'NR >= 10000  && NR <= 100000 { print }'

Filtering columns is easier with CUT:

cat largefile | cut -d '\t' -f 10000-100000

As Rahul Dravid mentioned, cat is not a must here, and as Zsolt Botykai added you can improve performance using:

awk 'NR > 100000 { exit } NR >= 10000 && NR <= 100000' largefile
cut -d '\t' -f 10000-100000 largefile 

Solution 2

Some different solutions:

For row ranges: In sed :

sed -n 10000,100000p somefile.txt

For column ranges in awk:

awk -v f=10000 -v t=100000 '{ for (i=f; i<=t;i++) printf("%s%s", $i,(i==t) ? "\n" : OFS) }' details.txt

Solution 3

For the first problem, selecting a set of rows from a large file, piping tail to head is very simple. You want 90000 rows from largefile starting at row 10000. tail grabs the back end of largefile starting at row 10000 and then head chops off all but the first 90000 rows.

tail -n +10000 largefile | head -n 90000 -

Solution 4

Was beaten to it for the sed solution, so I'll post a perl dito instead. To print selected lines.

$ seq 100 | perl -ne 'print if $. >= 10 && $. <= 20' 
10
11
12
13
14
15
16
17
18
19
20

To print selective columns, use

perl -lane 'print $F[1] .. $F[3] '

-F is used in conjunction with -a, to choose the delimiter on which to split lines.

To test, use seq and paste to get generate some columns

$ seq 50 | paste - - - - -
1   2   3   4   5
6   7   8   9   10
11  12  13  14  15
16  17  18  19  20
21  22  23  24  25
26  27  28  29  30
31  32  33  34  35
36  37  38  39  40
41  42  43  44  45
46  47  48  49  50

Lets's print everything except the first and the last column

$ seq 50 | paste - - - - - | perl -lane 'print join "   ", $F[1] .. $F[3]'
2   3   4
7   8   9
12  13  14
17  18  19
22  23  24
27  28  29
32  33  34
37  38  39
42  43  44
47  48  49

In the join statement above, there is a tab, you get it by doing a ctrl-v tab.

Share:
52,283
jianfeng.mao
Author by

jianfeng.mao

Updated on August 01, 2022

Comments

  • jianfeng.mao
    jianfeng.mao almost 2 years

    I would like to have your advice/help on how to subset a big file (millions of rows or lines).

    For example,

    (1) I have big file (millions of rows, tab-delimited). I want to a subset of this file with only rows from 10000 to 100000.

    (2) I have big file (millions of columns, tab-delimited). I want to a subset of this file with only columns from 10000 to 100000.

    I know there are tools like head, tail, cut, split, and awk or sed. I can use them to do simple subsetting. But, I do not know how to do this job.

    Could you please give any advice? Thanks in advance.