Error: XML Content does not seem to be XML | R 3.1.0

62,015

Solution 1

Remove the s from https

library(XML)

fileURL<-"https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
doc <- xmlTreeParse(sub("s", "", fileURL), useInternal = TRUE)
class(doc)
## [1] "XMLInternalDocument" "XMLAbstractDocument"

Solution 2

You can use RCurl to fetch the content and then XML seems to be able to handle it

library(XML)
library(RCurl)
fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
xData <- getURL(fileURL)
doc <- xmlParse(xData)

Solution 3

xmlTreeParse does not support https.

You can load the data with getURL (from RCurl) and then parse it.

Solution 4

Answer is at http://www.omegahat.net/RCurl/installed/RCurl/html/getURL.html. Key point is to use ssl.verifyPeer=FALSE with getURL if certificate error is shown.

library (RCurl)
library (XML)
curlVersion()$features
curlVersion()$protocol
##These should show ssl and https. I can see these on windows 8.1 at least. 
##It may differ on other OSes.

temp <- getURL("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml", ssl.verifyPeer=FALSE)
DFX <- xmlTreeParse(temp,useInternal = TRUE)

If ssl or https capability is not shown by libcurl functions, check using Rcurl with HTTPs.

Solution 5

Using download.file avoids introducing another dependency. The following function returns the output of XML::xmlParse also when the URL starts with https. It caches the file to a temporary directory so that it will be downloaded only once if this function is called many times during an R session.

xml_parse <- function(xml_url){
    # Temporary copy of the xml file, valid for this R session
    xml_temp_file <- file.path(tempdir(), basename(xml_url))
    if (!file.exists(xml_temp_file)){
        print(sprintf("Downloading to %s.", xml_temp_file))
        download.file(xml_url, xml_temp_file)
    }
    return(XML::xmlParse(xml_temp_file))
}

# Example
xml_content = xml_parse("https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml")
Share:
62,015
Admin
Author by

Admin

Updated on August 05, 2022

Comments

  • Admin
    Admin over 1 year

    I am trying to get this XML file, but am unable to. I checked the other solutions in the same topic, but I couldn't understand. I am a R newbie.

    > library(XML)
    > fileURL <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
    > doc <- xmlTreeParse(fileURL,useInternal=TRUE)
    

    Error: XML content does not seem to be XML: 'https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml'

    Can you please help?

  • Admin
    Admin almost 10 years
    Thanks @jdharrison for the reply. I got the following Error Message when I typed the fourth line: XData <- getURL(fileURL). Error in function (type, msg, asError = TRUE): SSL certificate problem, verify that the CA cert is OK. Details: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed What does it mean?
  • jdharrison
    jdharrison almost 10 years
    @ArpanGanguli Use xData <- getURL(fileURL, ssl.verifypeer = FALSE). The error is explained in depth at omegahat.org/RCurl/FAQ.html
  • Sean
    Sean almost 8 years
    should that be omegahat.net ??
  • jdharrison
    jdharrison almost 8 years
    @Sean yes it is now .net omegahat.net/RCurl/FAQ.html
  • Sean
    Sean almost 8 years
    I think that should now be omegahat.net
  • Atul Kumar
    Atul Kumar almost 8 years
    Updated link base URL from omegahat.org to omegahat.net
  • agent18
    agent18 over 5 years
    I am getting an error: Unknown IO errorfailed to load external entity
  • Guy Manova
    Guy Manova over 3 years
    good one! why does that happen though? (the https ruining the xml read?)
  • mccurcio
    mccurcio over 3 years
    In other words, replace "HTTPS" with "HTTP"
  • Paul Rougieux
    Paul Rougieux over 2 years
    The issue is that some sources only provide https URLs.
  • captcoma
    captcoma about 2 years
    I still get the error mentioned above but I can repeat this until it works for every file.