How to write a Generic Log Parser

12,426

Solution 1

I ended up not writing my own and using logstash.

Solution 2

AWStats is a great log parser, open source, and you can do whatever you want with the resulting database that it generates.

Solution 3

You can use a Scanner for example, and some regexes. Here is a snippet of what I did to parse some complex logs :

private static final Pattern LINE_PATTERN = Pattern.compile(
  "(\\S+:)?(\\S+? \\S+?) \\S+? DEBUG \\S+? - DEMANDE_ID=(\\d+?) - listener (\\S+?) : (\\S+?)");

public static EventLog parse(String line) throws ParseException {
    String demandId;
    String listenerClass;
    long startTime;
    long endTime;

    SimpleDateFormat sdf = new SimpleDateFormat(DATE_PATTERN);
    Matcher matcher = LINE_PATTERN.matcher(line);
    if (matcher.matches()) {
        int offset = matcher.groupCount()-4; // 4 interesting groups, the first is optional
        demandeId = matcher.group(2+offset);
        listenerClass = matcher.group(3+offset);
        long time = sdf.parse(matcher.group(1+offset)).getTime();
        if ("starting".equals(matcher.group(4+offset))) {
            startTime = time;
            endTime = -1;
        } else {
            startTime = -1;
            endTime = time;
        }
        return new EventLog(demandeId, listenerClass, startTime, endTime);
    }
    return null;
}

So, with regexes and groups, it works pretty well.

Solution 4

If you have the possibility (and you should with a good logger framework) I would recommend you to duplicate logs in a parsable format. For example, with log4j use an XMLLayout or something like this. It will be a lot easier to parse because then you will know the exact format of the logs.

You can do this quite transparently to the running app just by setup. Think about using asynchronuous appender in order to not disturb too much the running application.

Also if the XMLLayout can suit your needs have a look at Apache chainsaw

Solution 5

Log4j's LogFilePatternReceiver does exactly that...

This log entry: 17-11-2011 14:07:14 ERROR MyXmlParser - Premature end of file

Can be parsed using the following logformat (assuming origin is the same as 'logger'), with a timestamp leveraging Java's SimpleDateFormat of dd-MM-yyyy kk:mm:ss

TIMESTAMP LEVEL LOGGER - MESSAGE

The timezone and the level in the other form are a little tricker...there is the ability to remap strings to levels (E to ERROR) but I don't know that the timezone will quite work.

Try it out, check out the source, and play with support for it in the latest developer snapshot of Chainsaw:

http://people.apache.org/~sdeboy

Share:
12,426
Mario Duarte
Author by

Mario Duarte

Senior Software Engineer

Updated on June 04, 2022

Comments

  • Mario Duarte
    Mario Duarte almost 2 years

    We need to parse several log files and run some statistics on the logs entries found (things such as number of occurrence of certain messages, spikes of occurrences, etc). The problem is with writing a log parser that will handle several log formats and will allow me to add a new log format with very little work.

    To make things easier for now I'm only looking at logs that will basically look similar to this:

    [11/17/11 14:07:14:030 EST] MyXmlParser     E   Premature end of file
    

    so each log entry will contain a timestamp, originator (of the log message), level and log message. One important detail is that a message may have more than one line (e.g. stacktrace). Another instance of the log entry could be:

    17-11-2011 14:07:14 ERROR    MyXmlParser   - Premature end of file
    

    I'm looking for a good way to specify the log format as well as the most adequate technology to implement the parser for it. I though about regular expressions but I think it will be tricky to handle situations such as the multi-line message (e.g. stacktrace).

    Actually the task of writing a parser for a specific log format does not sound so easy itself when I consider the possibility of multi-line messages. How do you go about parsing those files?

    Ideally I would be able to specify something like this as a log format:

    [%TIMESTAMP] %ORIGIN %LEVEL %MESSAGE
    

    or

    %TIMESTAMP %LEVEL %ORIGIN - %MESSAGE
    

    Obviously I would have to assign the right converter to each field to it would handle it correctly (e.g. the timestamp).

    Could anyone give me some good ideas on how to implement this in a robust and modular way (I'm using Java) ?