Start tag expected, '<' not found in sitemap.xml — Not sure what's wrong

12,233

Solution 1

UPDATE: As stated in the comments, the sitemap validators in question are having trouble parsing gzipped sitemaps (in OP's case Amazon S3 only serves gzipped text responses).


I'm now in the camp that thinks this is a server issue, but I have some data to back that up (so I didn't edit the other answer). Here is what I did (my original point about being "more valid" is still below): I copied your file (viewing the source in the browser) and created a sitemap.xml that I uploaded to my S3 Bucket (and confirmed that all validators mentioned in this question consider it valid). Then I used WGET to fetch your sitemap and my copied sitemap and this is what I found (obscuring my bucket name with [myexamples3bucket.example] but you can see that it's an AWS IP address):

:~# wget http://[myexamples3bucket.example]/original.xml
--2013-04-02 13:26:42--  http://static.gnld.com/original.xml
Resolving [myexamples3bucket.example]... 207.171.189.80
Connecting to [myexamples3bucket.example]|207.171.189.80|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4578 (4.5K) [text/xml]
Saving to: `original.xml'

100%[======================================>] 4,578       --.-K/s   in 0.002s

2013-04-02 13:26:42 (1.97 MB/s) - `original.xml' saved [4578/4578]

Then I tried to fetch your sitemap:

:~# wget http://aahank.com/sitemap.xml
--2013-04-02 13:26:55--  http://aahank.com/sitemap.xml
Resolving aahank.com... 178.236.4.60
Connecting to aahank.com|178.236.4.60|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 766 [application/xml]
Saving to: `sitemap.xml'

100%[======================================>] 766         --.-K/s   in 0s

2013-04-02 13:26:55 (144 MB/s) - `sitemap.xml' saved [766/766]

The contents of these two files are very different. While the "copied" sitemap looks exactly what you would expect, your original sitemap looks like this:

^_�^H^@^@^@^@^@^@^CÍM�Ú0^P����^_^P×j��^O>,����=�J�ï¿ï¿½^Rq��1�^XY�Lnw���^R�^V�l
                           �jO$+U���:z�s�i�2V�Ë���u�]��Þ8_;����EcÑ9È[�M����^BwJjhw��-�4^Z^\ZJ��0I^O�0^Q�!���9��^^^]�1;^N�^]����Ǫ^Z̪^_��˪ڪB$Aɪ^M�^DmHcT-
�Ns,ªAÚª^Z�a�T�XÄV5��^[^^����A�F9^KTpÆÖe�AÔ���2È^_�$

This points to AmazonS3 being the culprit. I'm offering this up in case anyone else can figure out how to fix this. Good luck!


As for being more valid, using the official definition of a valid sitemap, I've made the following (small) changes to your site map, uploaded it to my S3 Bucket and tested it against the two sites you've linked to, and it now passes:

<?xml version='1.0' encoding='UTF-8'?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd"
        xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">

Everything else is unchanged. The error messages on those two sites are very unhelpful, but the important thing I added was the xmlns:xsi and xsi:schemaLocation which should inform a validator of the intended format. I would think that these are assumed by crawlers, but in the case of the two linked services, the absence of these attributes technically make the document invalid.

Solution 2

The problem must be with some settings of the server - port blocking / proxy settings / security settings. I copied your unchanged sitemap file to my server and it is read and validated by both validators. This is all I can tell you. But you can be sure, the problem is not your XML code.

Share:
12,233
its_me
Author by

its_me

Updated on June 04, 2022

Comments

  • its_me
    its_me about 2 years

    This is what my website's sitemap.xml looks like:

    <?xml version="1.0" encoding="UTF-8"?>
    <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    
      <url>
        <loc>http://example.com/</loc>
        <lastmod>2013-04-02T12:45:31+00:00</lastmod>
        <changefreq>daily</changefreq>
        <priority>1</priority>
      </url>
    
      <url>
        <loc>http://example.com/2013/wordpress-customize-login-page/</loc>
        <lastmod>2013-03-01T12:06:00+00:00</lastmod>
        <changefreq>weekly</changefreq>
        <priority>0.8</priority>
      </url>
    
    </urlset>
    

    And here's the original sitemap. First, I made sure of valid XML markup, then checked my sitemap on xmlcheck and sitemapxml.

    The two sitemap validators gave this error:

    Fatal Error 4: Start tag expected, '<' not found in http://example.com/sitemap.xml on line 1 column 1
    

    As I see it, nothing's missing. Not sure what I am doing wrong. (Googling didn't help either.)

    • Deepanshu Goyal
      Deepanshu Goyal about 11 years
    • Amit
      Amit about 11 years
      Check for extra line/spaces before <?xml tag
    • its_me
      its_me about 11 years
      @Amit I provided a link to the sitemap itself in my question. As you can see, there's no visible/invisible lines before <?xml tag (checked using type sitemap.xml | more).
    • its_me
      its_me about 11 years
      @Deepanshu I am not sure how that relates in my specific case.
    • Jason Sperske
      Jason Sperske about 11 years
      @TheoneManis, this is strange, this validator seems to take it, while the two you linked to do not
    • its_me
      its_me about 11 years
      @JasonSperske Yes, strange. I am really unsure where the actual problem is, and don't see how Amazon S3 could be messing it up like Martin said. Now I am starting to think if it's some kind of a cache issue on their (i.e. the xml validators) end.
    • Jason Sperske
      Jason Sperske about 11 years
      @TheoneManis, could you be using CloudFront to host this sitemap?
    • Jason Sperske
      Jason Sperske about 11 years
      Sorry about that, I look a look at the HTTP Response headers and I can clearly see you are hosting this from AmazonS3. I tried busting the cache of those validator services (added ?cache=somerandomstuff to the URL, S3 will ignore the query string params but the URL will be different) and it didn't make a difference. I also tested for a Byte Order mark (none found) which can really mess up validators)
    • its_me
      its_me about 11 years
      @JasonSperske So, you meant to say, there doesn't seem to be anything wrong with my sitemap, correct? So, like Martin suggested, it could be S3, in some weird way.
    • Jason Sperske
      Jason Sperske about 11 years
  • its_me
    its_me about 11 years
    Oh, it's Amazon S3, and I am not sure how it's meddling. I guess that warrants a new question. Thanks for the excellent finding by the way. :)
  • Jason Sperske
    Jason Sperske about 11 years
    @TheoneManis, it could be the ACL settings of the bucket, maybe the URL is visible to you (because of a cookie) but from another server they just get the accesses denied message?
  • its_me
    its_me about 11 years
    @JasonSperske No. I don't think that's the case. I linked to the original sitemap file in my question. It's visible to the public. Please take a shot, if you can.
  • Jason Sperske
    Jason Sperske about 11 years
    @Martin, I think this might actually be an invalid sitemap.xml. Adding the extra xmlns:xsi and xsi:schemaLocation as outlined here allows the document to pass. These might be really picky validators though, because the "invalid" sitemap is likely to be perfectly understandable by crawlers.
  • its_me
    its_me about 11 years
    Ah... looks like it's gzip :D All text responses served by my site are gzipped by default (as S3 doesn't do Vary: Accept-Encoding I made sure it's just gzip). Seems the two sitemap validator sites are having trouble with this while this doesn't: validome.org/google. Wouldn't have known this if it wasn't for your detailed input. Thanks so much!
  • Martin Turjak
    Martin Turjak about 11 years
    @JasonSperske First, I want to say I like your elaborate answer above. But what you say here ... even if the xsi parts added to the xml code make it "more" valid ... it does not explain why when I put putting Theone's unchanged code on another server it passed the validators ... and it in both cases passes as valid. So it still puzzles me a little?
  • Jason Sperske
    Jason Sperske about 11 years
    I clearly missed the larger amazon s3 gziped content without appropriate header. When I added the missing attributes and tested on my s3 bucket I had jumped to the conclusion that it was what was missing. My comment to you was writen befor I tried the unmodified site map as well.
  • Jason Sperske
    Jason Sperske about 11 years
    Also you totally called the "it's the server", so your answer has my vote