How can I strip invalid XML characters from strings in Perl?
Solution 1
The complete regex for removal of invalid xml-1.0 characters is:
# #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
for xml-1.1 it is:
# allowed: [#x1-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
$str =~ s/[^\x01-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go;
# restricted:[#x1-#x8][#xB-#xC][#xE-#x1F][#x7F-#x84][#x86-#x9F]
$str =~ s/[\x01-\x08\x0B-\x0C\x0E-\x1F\x7F-\x84\x86-\x9F]//go;
Solution 2
As almost everyone else has said, use a regular expression. It's honestly not complex enough to be worth adding to a library. Preprocess your text with a substitution.
Your comment about linefeeds above suggests that the formatting is of some importance to you so you will possibly have to decide exactly what you want to replace some characters with.
The list of invalid characters is clearly defined in the XML spec (here - http://www.w3.org/TR/REC-xml/#charsets - for example). The disallowed characters are the ASCII control characters bar carriage return, linefeed and tab. So, you are looking at a 29 character regular expression character class. That's not too bad surely.
Something like:
$text =~ s/[\x00-\x08 \x0B \x0C \x0E-\x19]//g;
should do it.
Solution 3
I've found a solution, but it uses the iconv
command instead of perl.
$ iconv -c -f UTF-8 -t UTF-8 invalid.utf8 > valid.utf8
The solutions given above based on regular expressions do not work!!, consider the following example:
$ perl -e 'print "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n<root>\x{A0}\x{A0}</root>"' > invalid.xml
$ perl -e 'use XML::Simple; XMLin("invalid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F
$ perl -ne 's/[^\x09\x0A\x0D\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]//go; print' invalid.xml > valid.xml
$ perl -e 'use XML::Simple; XMLin("valid.xml")'
invalid.xml:2: parser error : Input is not proper UTF-8, indicate encoding !
Bytes: 0xA0 0xA0 0x3C 0x2F
In fact, the two files invalid.xml
and valid.xml
are identical.
The thing is that the range "\x20-\x{D7FF}" matches valid representations of those unicode characters, but not e.g. the invalid character sequence "\x{A0}\x{A0}".
Solution 4
Translate is a lot faster than regex substitution. Especially if all you want to do delete characters. Using newt's set:
$string_to_clean =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d;
A test like this:
cmpthese 1_000_000
, { translate => sub {
my $copy = $text;
$copy =~ tr/\x00-\x08\x0B\x0C\x0E-\x19//d;
}
, substitute => sub {
my $copy = $text;
$copy =~ s/[\x00-\x08\x0B\x0C\x0E-\x19]//g;
}
};
yeilded:
Rate substitute translate
substitute 287770/s -- -86%
translate 2040816/s 609% --
And the more characters I needed to delete the faster tr got in relation.
Solution 5
Okay, this seems to be already answered, but what the hey. If you want to author XML documents, you must use an XML library.
#!/usr/bin/perl
use strict;
use XML::LibXML;
my $doc = XML::LibXML::Document->createDocument('1.0');
$doc->setURI('http://example.com/myuri');
$doc->setDocumentElement($doc->createElement('root-node'));
$doc->documentElement->appendTextChild('text-node',<<EOT);
This node contains &, ñ, á, <, >...
EOT
print $doc->toString;
This produces the following:
$ perl test.pl
<?xml version="1.0"?>
<root-node><text-node> This node contains &, 񬠡, <, >...
</text-node></root-node>
Edit: I now see that you are already using XML::LibXML. This should do the trick.
Comments
-
AndrewR almost 2 years
I'm looking for what the standard, approved, and robust way of stripping invalid characters from strings before writing them to an XML file. I'm talking here about blocks of text containing backspace (^H) and formfeed characters etc.
There has to be a standard library/module function for doing this but I can't find it.
I'm using XML::LibXML to build a DOM tree that I then serialize to disk.
-
AnthonyWJones almost 15 years@David: do these libraries simply strip the control characters from the incoming string?
-
AndrewR almost 15 years...which also strips linefeeds - so not very useful :)
-
aks almost 15 yearsOuch, didn't think about the linefeeds. newt's answer seems ok then for what you're trying to do.
-
AndrewR almost 15 yearsYep. This is pretty much what I ended up doing.
-
Nic Gibson almost 15 yearsI must admit that I only posted after I'd searched CPAN because I was convinced that RE must be in Regexp::Common somewhere!
-
Nic Gibson almost 15 yearsAbsolutely true - I generally don't use tr// because it's so limited but this is certainly an appropriate use.
-
Nic Gibson almost 15 yearsAs far as I'm aware, XML::LibXML doesn't do anything to text node content apart from reject it if it contains invalid characters. I'd be suprised if the other libraries did anything either.
-
ysth almost 15 yearsYes, it's a lot faster, but 287770/s is plenty fast.
-
crazy_in_love almost 15 yearsnewt, that's the point of using an XML library in the first place.
-
ysth almost 15 yearsThanks for the example; I was a little shocked at the comment that claimed XML::LibXML didn't handle this for you.
-
Nic Gibson almost 15 yearsOf course it is, but he was asking about how to ensure that he didn't get this problem by ensuring that the text content didn't contain invalid characters.
-
Nic Gibson almost 15 yearsOf course it does. But the original question was about removing the characters that will cause XML::LibXML to reject the content (characters below ASCII space bar the whitespace chars). This is not quite the same thing.
-
ysth almost 15 years"use strict" is nice, but warnings are even more important. Don't forget -w or "use warnings"!
-
ysth almost 15 years@newt: I'm not completely sure what you mean by "this problem". I see XML::LibXML stripping out the "illegal" characters, except for nul, which it treats as the end of the data :(
-
derby over 14 yearshmmm ... just came across this ... XML::LibXML does not handle this if your use $node->appendText( $str ) ... but does if you use $parent->appendTextChild( 'node', $str ) ... weirdness
-
Juan A. Navarro over 13 yearsThis solution based on regular expressions doesn't work. See my answer bellow.
-
mikebabcock about 8 yearsUnfortunately that does not work to remove invalid control characters between tags such as <D>[015][015]</D> where [015] is an invalid character that has gotten into the string.
-
Brian Tingle over 7 yearsthe issue is that there are code point that are valid UTF-8 that are illegal in XML