Reading contents of XML file without having to remove the XML declaration

10,558

Solution 1

There can be no text or whitespace before the <?xml ?> encoding declaration other than a BOM, and no text between the declaration and the root element other than line break.

Anything else is an invalid document.

UPDATE:

I think your expectation of XmlTextReader.read() is incorrect.

Each call to XmlTextReader.Read() steps through the next "token" in the XML document, one token at a time. "Token" means XML elements, whitespace, text, and XML encoding declaration.

Your call to reader.ReadOuterXML() is returning an empty string because the first token in your XML file is an XML declaration, and an XML declaration does not have an OuterXML.

Consider this code:

    XmlTextReader reader = new XmlTextReader("test.xml");
    reader.Read();
    Console.WriteLine(reader.NodeType);  // XMLDeclaration
    reader.Read();
    Console.WriteLine(reader.NodeType);  // Whitespace
    reader.Read();
    Console.WriteLine(reader.NodeType);  // Element
    string rs = reader.ReadOuterXml();

The code above produces this output:

XmlDeclaration
Whitespace
Element

The first "token" is the XML declaration.

The second "token" encountered is the line break after the XML declaration.

The third "token" encountered is the <s:Envelope> element. From here a call to reader.ReadOuterXML() will return what I think you're expecting to see - the text of <s:Envelope> element, which is the entire soap packet.

If what you really want is to load the XML file into memory as objects, just call var doc = XDocument.Load("test.xml") and be done with the parsing in one fell swoop.

Unless you're working with an XML doc that is so monstrously huge that it won't fit in system memory, there's really not a lot of reason to go poking through the XML document one token at a time.

Solution 2

What about

XmlDocument doc=new XmlDocument;
doc.Load(@"c:\my path\a.xml");
//Now we have the XML document - convert it to a String
//There are many ways to do this, one should be:
StringWriter sw=new StringWriter();
doc.Save(sw);
String finalresult=sw.ToString();

Solution 3

EDIT: I'm assuming you mean you actually have text between the document declaration and the root element. If that's not the case, please clarify.

Without removing the extra text, it's simply an invalid XML file. I wouldn't expect it to work. You don't have an XML file - you have something a bit like an XML file, but with extraneous stuff before the root element.

Solution 4

IMHO you can't read this file. It's because there's a plain text before the root element <s:Envelope> which makes whole document invalid.

Share:
10,558
Pingpong
Author by

Pingpong

Updated on June 05, 2022

Comments

  • Pingpong
    Pingpong almost 2 years

    I want to read all XML contents from a file. The code below only works when the XML declaration (<?xml version="1.0" encoding="UTF-8"?>) is removed. What is the best way to read the file without removing the XML declaration?

    XmlTextReader reader = new XmlTextReader(@"c:\my path\a.xml");
                reader.Read();
                string rs = reader.ReadOuterXml();
    

    Without removing the XML declaration, reader.ReadOuterXml() returns an empty string.

    <?xml version="1.0" encoding="UTF-8"?>  
    <s:Envelope xmlns:s="http://www.w3.org/2003/05/soap-envelope" xmlns:a="http://www.w3.org/2005/08/addressing">
      <s:Header>
        <a:Action s:mustUnderstand="1">http://www.as.com/ver/ver.IClaimver/Car</a:Action>
        <a:MessageID>urn:uuid:b22149b6-2e70-46aa-8b01-c2841c70c1c7</a:MessageID>
        <ActivityId CorrelationId="16b385f3-34bd-45ff-ad13-8652baeaeb8a" xmlns="http://schemas.microsoft.com/2004/09/ServiceModel/Diagnostics">04eb5b59-cd42-47c6-a946-d840a6cde42b</ActivityId>
        <a:ReplyTo>
          <a:Address>http://www.w3.org/2005/08/addressing/anonymous</a:Address>
        </a:ReplyTo>
        <a:To s:mustUnderstand="1">http://localhost/ver.Web/ver2011.svc</a:To>
      </s:Header>
      <s:Body xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
        <Car xmlns="http://www.as.com/ver">
          <carApplication>
            <HB_Base xsi:type="HB" xmlns="urn:core">
              <Header>
                <Advisor>
                  <AdvisorLocalAuthorityCode>11</AdvisorLocalAuthorityCode>
                  <AdvisorType>1</AdvisorType>
                </Advisor>
              </Header>
              <General>
                <ApplyForHB>yes</ApplyForHB>
                <ApplyForCTB>yes</ApplyForCTB>
                <ApplyForFSL>yes</ApplyForFSL>
                <ConsentSupplied>no</ConsentSupplied>
                <SupportingDocumentsSupplied>no</SupportingDocumentsSupplied>
              </General>
            </HB_Base>
          </carApplication>
        </Car>
      </s:Body>
    </s:Envelope>
    

    Update

    I know other methods that use NON-xml reader (e.g. by using File.ReadAllText()). But I need to know a way that uses an xml method.

  • Pingpong
    Pingpong over 12 years
    Sorry for the confusion. it is the comment, not for testing. Please refer to the updated contents.
  • Pingpong
    Pingpong over 12 years
    Sorry for the confusion. it is the comment, not for testing. Please refer to the updated contents.
  • Pingpong
    Pingpong over 12 years
    Thanks. your answer is very helpful. I wonder a line break or white space between them is still considered a token, which seems not necessary.
  • dthorpe
    dthorpe over 12 years
    Treating whitespace (and comments) as a token is important to preserve the line breaks and indents that users use to format the XML code. Without that, you would constantly be fighting with your XML editor as it reformats your XML and destroys your indent style. If you are not working in a round-trip user interactive manner, just straight processing of XML, you can ignore whitespace and comments. But you have to actively ignore them. ;>