Parsing XML file using Logstash

12,930

The multiline filter allows to create xml file as a single event and we can use xml-filter or xpath to parse the xml to ingest data in elasticsearch. In the multiline filter, we mention a pattern( in below example) that is used by logstash to scan your xml file. Once the pattern matches all the entries after that will be considered as a single event.

The following is an example of working config file for my data

input {
    file {
        path => "C:\Users\186181152\Downloads\stations3.xml"
        start_position => "beginning"
        sincedb_path => "/dev/null"
        exclude => "*.gz"
        type => "xml"
        codec => multiline {
            pattern => "<stations>" 
            negate => "true"
            what => "previous"
        }
    }
}

filter {
    xml {
        source => "message"
        store_xml => false
        target => "stations"
        xpath => [
            "/stations/station/id/text()", "station_id",
            "/stations/station/name/text()", "station_name"
        ]
    }
}

output {
    elasticsearch {
        codec => json
        hosts => "localhost"
        index => "xmlns24"
    }
    stdout {
        codec => rubydebug
    }
}   
Share:
12,930
KARAN SHAH
Author by

KARAN SHAH

I love to learn and code

Updated on September 11, 2022

Comments

  • KARAN SHAH
    KARAN SHAH over 1 year

    I am trying to parse an XML file in Logstash. I want to use XPath to do the parsing of documents in XML. So when I run my config file the data loads into elasticsearch but It is not in the way I want to load the data. The data loaded in elasticsearch is each line in xml document

    Structure of my XML file

    enter image description here

    What I want to achieve:

    create fields in elasticsearch that stores the follwing

    ID =1
    Name = "Finch"
    

    My Config file:

    input{
        file{
            path => "C:\Users\186181152\Downloads\stations.xml"
            start_position => "beginning"
            sincedb_path => "/dev/null"
            exclude => "*.gz"
            type => "xml"
        }
    }
    filter{
        xml{
            source => "message"
            store_xml => false
            target => "stations"
            xpath => [
                "/stations/station/id/text()", "station_id",
                "/stations/station/name/text()", "station_name"
            ]
        }
    }
    
    output{
        elasticsearch{
            codec => json
            hosts => "localhost"
            index => "xmlns"
        }
        stdout{
            codec => rubydebug
        }
    }
    

    Output in Logstash:

    {
        "station_name" => "%{station_name}",
        "path" => "C:\Users\186181152\Downloads\stations.xml",
        "@timestamp" => 2018-02-09T04:03:12.908Z,
        "station_id" => "%{station_id}",
        "@version" => "1",
        "host" => "BW",
        "message" => "\t\r",
        "type" => "xml"
    }
    
    • baudsp
      baudsp about 6 years
      I don't think dev/null is supported on Windows.
    • baudsp
      baudsp about 6 years
      Is the whole xml file on the same line, i.e. no line break? Because if it's not the case, the file will be treated line by line (as indicated in the doc), thus causing the empty station_id and station_name.
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp Dev/null works fine. I tried a csv file and it loaded the data correctly
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp the whole xml file is not on same line. the file follows standard xml file conventions. one tag on one line
    • baudsp
      baudsp about 6 years
      What I meant is that setting sincedb_path => "/dev/null" will not have the same behavior as on a linux system. The purpose of setting sincedb_path => "/dev/null" is that the sincedb file will not be written, so logstash will not remember what in each file has been read. But it won't prevent logstash from running.
    • baudsp
      baudsp about 6 years
      The file input read line by line, creating one message per line, explaining your result. You'll have to use the multiline codec on your input. See this stackoverflow.com/questions/34800559/…
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp thanks. I will try the multiline codec and let you know if that works our for me
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp. I tried the multiline codec
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp. I tried the multiline codec with following pattern ibelow type= xml and the it does not even create an index anymore. What should the sincedb path be in winodws operating system. codec => multiline { pattern => "<stations>" negate => "true" what => "previous" }
    • baudsp
      baudsp about 6 years
      From what I've read elsewhere, you can use nul as sincedb path to the same effect as unix dev\null.
    • baudsp
      baudsp about 6 years
      I think that, since logstash has already read the file, it won't do anything with it. You'll have to add lines to it or use another file. Or find the since_db file and delete it. Or use another sincedb path
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp the Multiline filter solves the problem and yes I am facing the problem of since_db. if there are some possible turn workaround let me know to fix since_db. Also, I would request you to answer this question so I can mark it completed
    • baudsp
      baudsp about 6 years
      The sincedb_path => "nul" works, I've just tested it, you can use this so that logstash don't remember what has been read. You can answer your own question, I don't know if I'll time to answer this one.
    • KARAN SHAH
      KARAN SHAH about 6 years
      @baudsp Yes sincedb_path works then why does my logstash load data only when my sytem restarts or boots up. I mean I tried another configuration on the same file and it works flawlessly. Can you help me out with that