Convert non-breaking spaces to spaces in Ruby
Solution 1
For the old versions of ruby (1.8.x), the fixes are the ones described in the question.
This is fixed in the newer versions of ruby 1.9+.
Solution 2
Use /\u00a0/
to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ')
converts all non-breaking spaces to regular spaces.
Use /[[:space:]]/
to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/
, which matches only ASCII whitespace.
See also: Ruby Regexp documentation
Solution 3
If you cannot use \s
for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s
, is absolutely required to match any Unicode whitespace code point.
There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s
case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.
If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.
Note that you cannot use a Unicode General Category property like \p{Zs}
to stand for \p{Whitespace}
. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.
Solution 4
Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)
Ruby 1.9
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true
Ruby 1.8
require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true
Solution 5
For whatever reason \s doesn't match \u00a0.
I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:
Sequence As[...] Meaning
\d [0-9] ASCII decimal digit character
\D [^0-9] Any character except a digit
\h [0-9a-fA-F] Hexadecimal digit character
\H [^0-9a-fA-F] Any character except a hex digit
\s [ \t\r\n\f] ASCII whitespace character
\S [^ \t\r\n\f] Any character except whitespace
\w [A-Za-z0-9\_] ASCII word character
\W [^A-Za-z0-9\_] Any character except a word character
coolaj86
<3 Go, Rust, Node, VanillaJS ✓ Python, C ✗ Java, C++ </3 PHP (and therefore React) #FathersMatter #BlackFathersMatter
Updated on June 24, 2022Comments
-
coolaj86 almost 2 years
I have cases where user-entered data from an html textarea or input is sometimes sent with
\u00a0
(non-breaking spaces) instead of spaces when encoded as utf-8 json.I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.
There are also two bugs in Ruby, one of which can be used to combat the other.
For whatever reason
\s
doesn't match\u00a0
.However
[^[:print:]]
, which definitely should not match) and\xC2\xA0
both will match, but I consider those to be less-than-ideal ways to deal with the issue.Are there other recommendations for getting around this issue?
-
steenslag almost 14 yearsWhich Ruby version? In 1.9.2 /\u00a0/ does match.
-
coolaj86 almost 14 years\s doesn't match \u00a0 \u00a0 matches in 1.9, but I'm not sure about 1.8
-
PJP about 13 yearsRule #1: When you think you have found a bug in an extremely popular program, especially in something that is tested and used extensively, such as Firefox's textarea handling, very quietly and carefully go over your testing. 99 times out of 100 the problem will be on your side of the fence. When I see non-breaking spaces show up in a text field, where it's likely that people would paste text in, I suspect Microsoft Word, or an editor that is set to substitute &NBSP; for spaces. You can easily test your theory by creating a page, put a text area in it and try to duplicate the problem.
-
-
tchrist about 13 yearsOh, it’s supposed to, alright. It just doesn’t. See my answer.
-
PJP about 13 yearsThere's a difference between it being in a spec, and it being in the code. Whether or not it's supposed to because of the spec is a moot point right now, because it isn't there, and no matter how much we want it to be there it won't until someone in the core-team decides to add it. So, the reality is, it isn't supposed to work because it isn't coded to. Maybe in future revs that will change. I'd like to see it meet the specs, but they don't ask me.
-
owenmarshall about 12 yearsThat's a really odd take on things. tchrist is absolutely correct, and saying that something "isn't supposed to work" because it currently doesn't work is the best vacuous truth I've read in a while. Either way - gsub on [[:space:]] until someone makes Ruby actually comply with standards.
-
nasmorn over 11 yearsCan you get more specific? I just had the same problem on 1.9.3p194 which is fairly 1.9ish. \s doesnt match unicode non-breaking space but \u00a0 does.
-
Andrei Botalov over 11 yearsLook at unicode.org/versions/Unicode6.2.0/ch06.pdf - Space characters. But id does look incomplete
-
Jo Liss about 11 yearsFixed my answer to simply use
[[:space]]
(note to self: not[:space]
). -
P.M about 11 years"s.gsub(/\u00a0/, ' ') " is what I have been looking for.
-
Kelvin over 10 years@JoLiss Your answer is correct, but your "note to self" is missing the trailing colon. I made this same mistake myself several times.