Ruby - Reading and editing XML file

14,047

Does that extra \n always appear in the <SUPPLIER> node? As others have suggested, Nokogiri is a great choice for parsing XML (or HTML). You could iterate through each <SUPPLIER> node and remove the \n character, then save the XML as a new file.

require 'nokogiri'

# read and parse the old file
file = File.read("old.xml")
xml = Nokogiri::XML(file)

# replace \n and any additional whitespace with a space
xml.xpath("//SUPPLIER").each do |node|
  node.content = node.content.gsub(/\n\s+/, " ")
end

# save the output into a new file
File.open("new.xml", "w") do |f|
  f.write xml.to_xml
end
Share:
14,047
Silver
Author by

Silver

Updated on June 08, 2022

Comments

  • Silver
    Silver almost 2 years

    I am writing a Ruby (1.9.3) script that reads XML files from a folder and then edit it if necessary.

    My issue is that I was given XML files converted by Tidy but its ouput is a little strange, fo example:

    <?xml version="1.0" encoding="utf-8"?>
    <XML>
      <item>
          <ID>000001</ID>
          <YEAR>2013</YEAR>
          <SUPPLIER>Supplier name test,
          Coproration</SUPPLIER>
    ...
    

    As you can see the has and extra CRLF. I dont know why it has this behaviour but I am addressing it with a ruby script. But am having trouble as I need to see either if the last character of the line is ">" or if the first is "<" so that I can see if there is something wrong with the markup.

    I have tried:

    Dir.glob("C:/testing/corrected/*.xml").each do |file|
    
    puts file
    
      File.open(file, 'r+').each_with_index do |line, index|
    
        first_char = line[0,1]
    
        if first_char != "<"
            //copy this line to the previous line and delete this one?
        end
    
      end
    
    end
    

    I also feel like I should be copying the original file content as I read it to another temporary file and then overwrite. Is that the best "way"? Any tips are welcome as I do not have much experience in altering a files content.

    Regards