Google Webmaster Tools says my XML sitemap "appears to be an HTML page"
Solution 1
Farseeker's suggestion is a good first step in troubleshooting (a text/html
content-type would certainly produce this result) - Google Webmaster Tools should display a different error message if the sitemap file contains invalid XML.
Given the temporary nature of the issue, have you checked your server logs to determine whether an error page was produced on Google's prior requests?
If you are dynamically generating sitemap files, a scripting error, database timeout, or other issue could produce an HTML error page intermittently.
Solution 2
Because of the content-type
header that it's spitting out. Inspect it with your favourite tool (Firebug, etc) and see what it's sending.
Solution 3
You could extend the header to include the schema stuff:
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
and then validate online
If it passes that it must be Google's problem.
Related videos on Youtube
Adam Lynch
Updated on September 18, 2022Comments
-
Adam Lynch almost 2 years
We're running a lot of sites and we've started to get a lot of these errors in Webmaster Tools:
Sitemap is HTML
Your Sitemap appears to be an HTML page. Please use a supported sitemap format instead.One of the problematic sitemaps:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.same_domain.co.uk/folder/file1.shtml</loc> <lastmod>2011-05-11</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> <url> <loc>http://www.same_domain.co.uk/folder/file2.shtml</loc> <lastmod>2011-05-11</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> <url> <loc>http://www.same_domain.co.uk/folder/file3.shtml</loc> <lastmod>2011-05-11</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> <url> <loc>http://www.same_domain.co.uk/folder/file4.shtml</loc> <lastmod>2011-05-11</lastmod> <changefreq>weekly</changefreq> <priority>0.5</priority> </url> </urlset>
Why would GWTs think this is anything but XML?
(Server: IIS)
Edit:
"This document was successfully checked as well-formed XML!" -W3C Validator.
Edit:
I resubmitted two problematic sitemaps, one with no changes, and one with a couple of extra lines to ensure it's treated as XML. Ran the "Fetch as Googlebot" diagnostic tool. Both are fine now. I'm just going to re-submit all sitemaps with the "Sitemap is HTML" error
The question remains:
Why did this happen? Why did GWTs think these XML sitemaps were HTML?
-
Adam Lynch about 13 yearsNope.
Content-Type text/xml
-
Adam Lynch about 13 years"Line 2 (
<urlset>
): 192 SchemaLocation: schemaLocation value = 'sitemaps.org/schemas/sitemap/0.9/sitemap.xsd' must have even number of URI's. Line 2: 192 cvc-elt.1: Cannot find the declaration of element 'urlset'" -
Adam Lynch about 13 years
"No errors were found"
but does that mean that this will solve the errors? We have a lot of sitemaps -
paulmorriss about 13 yearsIt means it's valid, so I guess it's Google's problem. You could report it on the webmaster forums google.com/support/forum/p/Webmasters/… with all the info you've put here (right content type, extension, valid XML).
-
Adam Lynch about 13 yearsI've now added the schema attributes & values you've given so I'm waiting to see if the problem is gone
-
paulmorriss about 13 yearsThey aren't necessary according to the minimal sitemap example on sitemaps.org/protocol.php, however they do mean the XML validator can check the file, and won't do any harm.
-
Adam Lynch about 13 yearsOk then, I'll post in the forum
-
Jonx about 13 years@Adam Lynch - The correct content type is
application/xml
(Edit: A review of RFC 3023 leaves some ambiguity on this point, but try theapplication
type for troubleshooting) ... Strike that - tested and GWT works with either content-type. -
John Conde about 13 yearsAn error message from the server would definitely send out HTML headers and is a very plausible explanation.