Parsing XML file using Logstash
The multiline filter allows to create xml file as a single event and we can use xml-filter or xpath to parse the xml to ingest data in elasticsearch. In the multiline filter, we mention a pattern( in below example) that is used by logstash to scan your xml file. Once the pattern matches all the entries after that will be considered as a single event.
The following is an example of working config file for my data
input {
file {
path => "C:\Users\186181152\Downloads\stations3.xml"
start_position => "beginning"
sincedb_path => "/dev/null"
exclude => "*.gz"
type => "xml"
codec => multiline {
pattern => "<stations>"
negate => "true"
what => "previous"
}
}
}
filter {
xml {
source => "message"
store_xml => false
target => "stations"
xpath => [
"/stations/station/id/text()", "station_id",
"/stations/station/name/text()", "station_name"
]
}
}
output {
elasticsearch {
codec => json
hosts => "localhost"
index => "xmlns24"
}
stdout {
codec => rubydebug
}
}
Comments
-
KARAN SHAH over 1 year
I am trying to parse an XML file in Logstash. I want to use XPath to do the parsing of documents in XML. So when I run my config file the data loads into
elasticsearch
but It is not in the way I want to load the data. The data loaded inelasticsearch
is each line in xml documentStructure of my XML file
What I want to achieve:
create fields in elasticsearch that stores the follwing
ID =1 Name = "Finch"
My Config file:
input{ file{ path => "C:\Users\186181152\Downloads\stations.xml" start_position => "beginning" sincedb_path => "/dev/null" exclude => "*.gz" type => "xml" } } filter{ xml{ source => "message" store_xml => false target => "stations" xpath => [ "/stations/station/id/text()", "station_id", "/stations/station/name/text()", "station_name" ] } } output{ elasticsearch{ codec => json hosts => "localhost" index => "xmlns" } stdout{ codec => rubydebug } }
Output in Logstash:
{ "station_name" => "%{station_name}", "path" => "C:\Users\186181152\Downloads\stations.xml", "@timestamp" => 2018-02-09T04:03:12.908Z, "station_id" => "%{station_id}", "@version" => "1", "host" => "BW", "message" => "\t\r", "type" => "xml" }
-
baudsp about 6 yearsI don't think
dev/null
is supported on Windows. -
baudsp about 6 yearsIs the whole xml file on the same line, i.e. no line break? Because if it's not the case, the file will be treated line by line (as indicated in the doc), thus causing the empty
station_id
andstation_name
. -
KARAN SHAH about 6 years@baudsp Dev/null works fine. I tried a csv file and it loaded the data correctly
-
KARAN SHAH about 6 years@baudsp the whole xml file is not on same line. the file follows standard xml file conventions. one tag on one line
-
baudsp about 6 yearsWhat I meant is that setting
sincedb_path => "/dev/null"
will not have the same behavior as on a linux system. The purpose of settingsincedb_path => "/dev/null"
is that the sincedb file will not be written, so logstash will not remember what in each file has been read. But it won't prevent logstash from running. -
baudsp about 6 yearsThe file input read line by line, creating one message per line, explaining your result. You'll have to use the multiline codec on your input. See this stackoverflow.com/questions/34800559/…
-
KARAN SHAH about 6 years@baudsp thanks. I will try the multiline codec and let you know if that works our for me
-
KARAN SHAH about 6 years@baudsp. I tried the multiline codec
-
KARAN SHAH about 6 years@baudsp. I tried the multiline codec with following pattern ibelow type= xml and the it does not even create an index anymore. What should the sincedb path be in winodws operating system. codec => multiline { pattern => "<stations>" negate => "true" what => "previous" }
-
baudsp about 6 yearsFrom what I've read elsewhere, you can use
nul
as sincedb path to the same effect as unixdev\null
. -
baudsp about 6 yearsI think that, since logstash has already read the file, it won't do anything with it. You'll have to add lines to it or use another file. Or find the
since_db
file and delete it. Or use another sincedb path -
KARAN SHAH about 6 years@baudsp the Multiline filter solves the problem and yes I am facing the problem of since_db. if there are some possible turn workaround let me know to fix since_db. Also, I would request you to answer this question so I can mark it completed
-
baudsp about 6 yearsThe
sincedb_path => "nul"
works, I've just tested it, you can use this so that logstash don't remember what has been read. You can answer your own question, I don't know if I'll time to answer this one. -
KARAN SHAH about 6 years@baudsp Yes sincedb_path works then why does my logstash load data only when my sytem restarts or boots up. I mean I tried another configuration on the same file and it works flawlessly. Can you help me out with that
-