How to read a file starting from a specific line number using Scanner?

12,095

Solution 1

If you don't want to read but just skip the lines you read previously, you need to acquire the position where you left off.

The different solutions are presented in a form of a function which takes the input to read from and the start position (byte position) to start reading lines from, e.g.:

func solution(input io.ReadSeeker, start int64) error

A special io.Reader input is used which also implements io.Seeker, the common interface which allows skipping data without having to read them. *os.File implements this, so you are allowed to pass a *File to these functions. Good. The "merged" interface of both io.Reader and io.Seeker is io.ReadSeeker.

If you want a clean start (to start reading from the beginning of the file), simply pass start = 0. If you want to resume a previous processing, pass the byte position where the last processing was stopped/aborted. This position is the value of the pos local variable in the functions (solutions) below.

All the examples below with their testing code can be found on the Go Playground.

1. With bufio.Scanner

bufio.Scanner does not maintain the position, but we can very easily extend it to maintain the position (the read bytes), so when we want to restart next, we can seek to this position.

In order to do this with minimal effort, we can use a new split function which splits the input into tokens (lines). We can use Scanner.Split() to set the splitter function (the logic to decide where are the boundaries of tokens/lines). The default split function is bufio.ScanLines().

Let's take a look at the split function declaration: bufio.SplitFunc

type SplitFunc func(data []byte, atEOF bool) (advance int, token []byte, err error)

It returns the number of bytes to advance: advance. Exactly what we need to maintain the file position. So we can create a new split function using the builtin bufio.ScanLines(), so we don't even have to implement its logic, just use the advance return value to maintain position:

func withScanner(input io.ReadSeeker, start int64) error {
    fmt.Println("--SCANNER, start:", start)
    if _, err := input.Seek(start, 0); err != nil {
        return err
    }
    scanner := bufio.NewScanner(input)

    pos := start
    scanLines := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
        advance, token, err = bufio.ScanLines(data, atEOF)
        pos += int64(advance)
        return
    }
    scanner.Split(scanLines)

    for scanner.Scan() {
        fmt.Printf("Pos: %d, Scanned: %s\n", pos, scanner.Text())
    }
    return scanner.Err()
}

2. With bufio.Reader

In this solution we use the bufio.Reader type instead of the Scanner. bufio.Reader already has a ReadBytes() method which is very similar to the "read a line" functionality if we pass the '\n' byte as the delimeter.

This solution is similar to JimB's, with the addition of handling all valid line terminator sequences and also stripping them off from the read line (it is very rare they are needed); in regular expression notation, it is \r?\n.

func withReader(input io.ReadSeeker, start int64) error {
    fmt.Println("--READER, start:", start)
    if _, err := input.Seek(start, 0); err != nil {
        return err
    }

    r := bufio.NewReader(input)
    pos := start
    for {
        data, err := r.ReadBytes('\n')
        pos += int64(len(data))
        if err == nil || err == io.EOF {
            if len(data) > 0 && data[len(data)-1] == '\n' {
                data = data[:len(data)-1]
            }
            if len(data) > 0 && data[len(data)-1] == '\r' {
                data = data[:len(data)-1]
            }
            fmt.Printf("Pos: %d, Read: %s\n", pos, data)
        }
        if err != nil {
            if err != io.EOF {
                return err
            }
            break
        }
    }
    return nil
}

Note: If the content ends with an empty line (line terminator), this solution will process an empty line. If you don't want this, you can simply check it like this:

if len(data) != 0 {
    fmt.Printf("Pos: %d, Read: %s\n", pos, data)
} else {
    // Last line is empty, omit it
}

Testing the solutions:

Testing code will simply use the content "first\r\nsecond\nthird\nfourth" which contains multiple lines with varying line terminating. We will use strings.NewReader() to obtain an io.ReadSeeker whose source is a string.

Test code first calls withScanner() and withReader() passing 0 start position: a clean start. In the next round we will pass a start position of start = 14 which is the position of the 3. line, so we won't see the first 2 lines processed (printed): resume simulation.

func main() {
    const content = "first\r\nsecond\nthird\nfourth"

    if err := withScanner(strings.NewReader(content), 0); err != nil {
        fmt.Println("Scanner error:", err)
    }
    if err := withReader(strings.NewReader(content), 0); err != nil {
        fmt.Println("Reader error:", err)
    }

    if err := withScanner(strings.NewReader(content), 14); err != nil {
        fmt.Println("Scanner error:", err)
    }
    if err := withReader(strings.NewReader(content), 14); err != nil {
        fmt.Println("Reader error:", err)
    }
}

Output:

--SCANNER, start: 0
Pos: 7, Scanned: first
Pos: 14, Scanned: second
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 0
Pos: 7, Read: first
Pos: 14, Read: second
Pos: 20, Read: third
Pos: 26, Read: fourth
--SCANNER, start: 14
Pos: 20, Scanned: third
Pos: 26, Scanned: fourth
--READER, start: 14
Pos: 20, Read: third
Pos: 26, Read: fourth

Try the solutions and testing code on the Go Playground.

Solution 2

Instead of using a Scanner, use a bufio.Reader, specifically the ReadBytes or ReadString methods. This way you can read up to each line termination, and still receive the full line with line endings.

r := bufio.NewReader(inputFile)

var line []byte
fPos := 0 // or saved position

for i := 1; ; i++ {
    line, err = r.ReadBytes('\n')
    fmt.Printf("[line:%d pos:%d] %q\n", i, fPos, line)

    if err != nil {
        break
    }
    fPos += len(line)
}

if err != io.EOF {
    log.Fatal(err)
}

You can store the combination of file position and line number however you choose, and the next time you start, you use inputFile.Seek(fPos, os.SEEK_SET) to move to where you left off.

Solution 3

If you want to use Scanner you have go trough the begging of the file till you find GetCounter() end-line symbols.

scanner := bufio.NewScanner(inputFile)
// context line above

// skip first GetCounter() lines
for i := 0; i < GetCounter(); i++ {
    scanner.Scan()
}

// context line below
for scanner.Scan() {
    fmt.Println(scanner.Text())
}

Alternatively you could store offset instead of line number in the counter but remember that termination token is stripped when using Scanner and for new line the token is \r?\n (regexp notation) so it isn't clear if you should add 1 or 2 to the text length:

// Not clear how to store offset unless custom SplitFunc provided
inputFile.Seek(GetCounter(), 0)
scanner := bufio.NewScanner(inputFile)

So it is better to use previous solution or not using Scanner at all.

Solution 4

There's a lot of words in the other answers, and they're not really reusable code so here's a re-usable function that seeks to the given line number & returns it and the offset where the line starts. play.golang

func SeekToLine(r io.Reader, lineNo int) (line []byte, offset int, err error) {
    s := bufio.NewScanner(r)

    var pos int

    s.Split(func(data []byte, atEof bool) (advance int, token []byte, err error) {
        advance, token, err = bufio.ScanLines(data, atEof)
        pos += advance
        return advance, token, err
    })

    for i := 0; i < lineNo; i++ {
        offset = pos

        if !s.Scan() {
            return nil, 0, io.EOF
        }
    }

    return s.Bytes(), pos, nil
}
Share:
12,095

Related videos on Youtube

Amyth
Author by

Amyth

SOreadytohelp Software Engineer, Tech Freak, Always eager to learn new technologies and implement them to create problem solving applications. Following are a few of my works: Personal Tech Blog Tech Blog Online GCM Tester Online APNS Tester

Updated on September 15, 2022

Comments

  • Amyth
    Amyth over 1 year

    I am new to Go and I am trying to write a simple script that reads a file line by line. I also want to save the progress (i.e. the last line number that was read) on the filesystem somewhere so that if the same file was given as the input to the script again, it starts reading the file from the line where it left off. Following is what I have started off with.

    package main
    
    // Package Imports
    import (
        "bufio"
        "flag"
        "fmt"
        "log"
        "os"
    )
    
    // Variable Declaration
    var (
        ConfigFile = flag.String("configfile", "../config.json", "Path to json configuration file.")
    )
    
    // The main function that reads the file and parses the log entries
    func main() {
        flag.Parse()
        settings := NewConfig(*ConfigFile)
    
        inputFile, err := os.Open(settings.Source)
        if err != nil {
            log.Fatal(err)
        }
        defer inputFile.Close()
    
        scanner := bufio.NewScanner(inputFile)
        for scanner.Scan() {
            fmt.Println(scanner.Text())
        }
    
        if err := scanner.Err(); err != nil {
            log.Fatal(err)
        }
    }
    
    // Saves the current progress
    func SaveProgress() {
    
    }
    
    // Get the line count from the progress to make sure
    func GetCounter() {
    
    }
    

    I could not find any methods that deals with line numbers in the scanner package. I know I can declare an integer say counter := 0 and increment it each time a line is read like counter++. But the next time how do I tell the scanner to start from a specific line? So for example if I read till line 30 the next time I run the script with the same input file, how can I make scanner to start reading from line 31?

    Update

    One solution I can think of here is to use the counter as I stated above and use an if condition like the following.

        scanner := bufio.NewScanner(inputFile)
        for scanner.Scan() {
            if counter > progress {
                fmt.Println(scanner.Text())
            }
        }
    

    I am pretty sure something like this would work, but it is still going to loop over the lines that we have already read. Please suggest a better way.

    • JimB
      JimB over 8 years
      There's no metadata in a file to indicate where "line 30" is. unless you store the byte offset somewhere, you need to read from the start every time.
  • Amyth
    Amyth over 8 years
    Amazing, Thanks. I am gonna try this and accept your answer.
  • JimB
    JimB over 8 years
    @Amyth: i just occurred to me that I quickly wrote that example from a combination of 2 other pieces, and you don't really need the Seek call to get the position when using the bufio.Reader, just add up the bytes read (this will save a lot of syscalls too)
  • Amyth
    Amyth over 8 years
    Amazing answer and explanation. Thanks @icza
  • 0xcaff
    0xcaff about 7 years
    Your regexp for multiple line ending types is wrong. \r\n, \r and \n are all valid line endings. The correct regexp would be /(\r\n|\r|\n)/
  • icza
    icza about 7 years
    @caffinatedmonkey bufio.ScanLines() also mentions / uses \r?\n.