Is there a way in ruby 1.9 to remove invalid byte sequences from strings?
10,746
Solution 1
"€foo\xA0".chars.select(&:valid_encoding?).join
Solution 2
"€foo\xA0".encode('UTF-16le', invalid: :replace, replace: '').encode('UTF-8')
Solution 3
Ruby 2.0 and 1.9.3
"€foo\xA0".encode(Encoding::UTF_8, Encoding::UTF_8, :invalid => :replace)
Ruby 2.1+
"€foo\xA0".scrub
Related videos on Youtube
Author by
StefanH
Updated on April 27, 2020Comments
-
StefanH about 4 years
Suppose you have a string like
"€foo\xA0"
, encoded UTF-8, Is there a way to remove invalid byte sequences from this string? ( so you get"€foo"
)In ruby-1.8 you could use
Iconv.iconv('UTF-8//IGNORE', 'UTF-8', "€foo\xA0")
but that is now deprecated."€foo\xA0".encode('UTF-8')
doesn't do anything, since it is already UTF-8. I tried:"€foo\xA0".force_encoding('BINARY').encode('UTF-8', :undef => :replace, :replace => '')
which yields
"foo"
But that also loses the valid multibyte character €
-
Van der Hoorn about 12 yearsI was under the impression it has a larger character set than UTF-8, meaning you don't loose any valid data. Unfortunately the following doesn't work:
"€foo\xA0".encode('UTF-8', :invalid => :replace, :replace => '')
because the string is already UTF-8, so it will not be encoded again. -
jwadsack over 11 yearsFWIW, running a test on a large file I found this method to be an order of magnitude faster than the
valid_encoding
approach. -
Zr40 over 11 yearsUTF-8 and UTF-16 can both represent all Unicode characters. The only difference is the way the characters are encoded.
-
tadman over 11 years
UTF-32
is also an option, butUTF-16
seems to work well enough. The new emoji characters might need the extra space. -
Zr40 almost 11 yearsAll UTF encodings are equally capable of encoding all possible Unicode characters; there's no difference in that regard between UTF-8, UTF-16 and UTF-32. The only practical difference is the output size.
-
Dorian over 9 yearsThrows an error with this string:
"eEspa\xF1a;FB"
-
Dorian over 9 yearsIt doesn't remove the
\xF1
in this string"eEspa\xF1a;FB"
-
Severin over 9 yearsThis does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post - you can always comment on your own posts, and once you have sufficient reputation you will be able to comment on any post.
-
John Dvorak over 9 years@Severin how come not? It looks like an (incorrect) answer to the question. It removes all invalid byte sequence from a string. It just removes all valid ones as well.
-
Van der Hoorn about 9 years@Dorian: what Ruby version?
-
Dorian about 9 years@VanderHoorn: it was ruby < 2.1 because it works with ruby 2.1+
-
Van der Hoorn about 9 years@Dorian: I see. Could it be a Ruby 2.0.x issue? Because I think I used Ruby 1.9.3 when I answered the original question.
-
acobster about 9 years@Dorian, on 1.9.3 IRB console,
"eEspa\xF1a;FB".chars.select{|i| i.valid_encoding?}.join
returns"eEspaa;FB"
...do you not get that behavior or have I misunderstood?