Convert non-breaking spaces to spaces in Ruby

10,308

Solution 1

For the old versions of ruby (1.8.x), the fixes are the ones described in the question.

This is fixed in the newer versions of ruby 1.9+.

Solution 2

Use /\u00a0/ to match non-breaking spaces. For instance s.gsub(/\u00a0/, ' ') converts all non-breaking spaces to regular spaces.

Use /[[:space:]]/ to match all whitespace, including Unicode whitespace like non-breaking spaces. This is unlike /\s/, which matches only ASCII whitespace.

See also: Ruby Regexp documentation

Solution 3

If you cannot use \s for Unicode whitespace, that’s a bug in the Ruby regex implementation, because according to UTS#18 “Unicode Regular Expressions” Annex C on Compatibility Properties a \s, is absolutely required to match any Unicode whitespace code point.

There is no wiggle-room allowed since the two columns detailing the Standard Recommendation and the POSIX Compatibility are the same for the \s case. You cannot document your way around this: you are out of compliance with The Unicode Standard, in particular, with UTS#18’s RL1.2a, if you do not do this.

If you do not meet RL1.2a, you do not meet the Level 1 requirements, which are the most basic and elementary functionality needed to use regular expressions on Unicode. Without that, you are pretty much lost. This is why standards exist. My recollection is that Ruby also fails to meet several other Level 1 requirements. You may therefore wish to use a programming language that meets at least Level 1 if you actually need to handle Unicode with regular expressions.

Note that you cannot use a Unicode General Category property like \p{Zs} to stand for \p{Whitespace}. That’s because the Whitespace property is a derived property, not a general category. There are also control characters included in it, not just separators.

Solution 4

Actual functioning IRB code examples that answer the question, with latest Rubies (May 2012)

Ruby 1.9

require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]"
doc = '<html><body> &nbsp; </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text
s.each_codepoint {|c| print c, ' ' } #=> 32 160 32
s.strip.each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\s+/,'').each_codepoint {|c| print c, ' ' } #=> 160
s.gsub(/\u00A0/,'').strip.empty? #true

Ruby 1.8

require 'rubygems'
require 'nokogiri'
RUBY_DESCRIPTION # => "ruby 1.8.7 (2012-02-08 patchlevel 358) [x86_64-linux]"
doc = '<html><body> &nbsp; </body></html>'
page = Nokogiri::HTML(doc)
s = page.inner_text # " \302\240 "
s.gsub(/\s+/,'') # "\302\240"
s.gsub(/\302\240/,'').strip.empty? #true

Solution 5

For whatever reason \s doesn't match \u00a0.

I think the "whatever reason" is that is not supposed to. Only the POSIX and \p construct character classes are Unicode aware. The character-class abbreviations are not:

Sequence   As[...]        Meaning
     \d    [0-9]          ASCII decimal digit character
     \D    [^0-9]         Any character except a digit
     \h    [0-9a-fA-F]    Hexadecimal digit character
     \H    [^0-9a-fA-F]   Any character except a hex digit
     \s    [ \t\r\n\f]    ASCII whitespace character
     \S    [^ \t\r\n\f]   Any character except whitespace
     \w    [A-Za-z0-9\_]  ASCII word character
     \W    [^A-Za-z0-9\_] Any character except a word character
Share:
10,308
coolaj86
Author by

coolaj86

&lt;3 Go, Rust, Node, VanillaJS ✓ Python, C ✗ Java, C++ &lt;/3 PHP (and therefore React) #FathersMatter #BlackFathersMatter

Updated on June 24, 2022

Comments

  • coolaj86
    coolaj86 almost 2 years

    I have cases where user-entered data from an html textarea or input is sometimes sent with \u00a0 (non-breaking spaces) instead of spaces when encoded as utf-8 json.

    I believe that to be a bug in Firefox, as I know that the user isn't intentionally putting in non-breaking spaces instead of spaces.

    There are also two bugs in Ruby, one of which can be used to combat the other.

    For whatever reason \s doesn't match \u00a0.

    However [^[:print:]], which definitely should not match) and \xC2\xA0 both will match, but I consider those to be less-than-ideal ways to deal with the issue.

    Are there other recommendations for getting around this issue?

    • steenslag
      steenslag almost 14 years
      Which Ruby version? In 1.9.2 /\u00a0/ does match.
    • coolaj86
      coolaj86 almost 14 years
      \s doesn't match \u00a0 \u00a0 matches in 1.9, but I'm not sure about 1.8
    • PJP
      PJP about 13 years
      Rule #1: When you think you have found a bug in an extremely popular program, especially in something that is tested and used extensively, such as Firefox's textarea handling, very quietly and carefully go over your testing. 99 times out of 100 the problem will be on your side of the fence. When I see non-breaking spaces show up in a text field, where it's likely that people would paste text in, I suspect Microsoft Word, or an editor that is set to substitute &NBSP; for spaces. You can easily test your theory by creating a page, put a text area in it and try to duplicate the problem.
  • tchrist
    tchrist about 13 years
    Oh, it’s supposed to, alright. It just doesn’t. See my answer.
  • PJP
    PJP about 13 years
    There's a difference between it being in a spec, and it being in the code. Whether or not it's supposed to because of the spec is a moot point right now, because it isn't there, and no matter how much we want it to be there it won't until someone in the core-team decides to add it. So, the reality is, it isn't supposed to work because it isn't coded to. Maybe in future revs that will change. I'd like to see it meet the specs, but they don't ask me.
  • owenmarshall
    owenmarshall about 12 years
    That's a really odd take on things. tchrist is absolutely correct, and saying that something "isn't supposed to work" because it currently doesn't work is the best vacuous truth I've read in a while. Either way - gsub on [[:space:]] until someone makes Ruby actually comply with standards.
  • nasmorn
    nasmorn over 11 years
    Can you get more specific? I just had the same problem on 1.9.3p194 which is fairly 1.9ish. \s doesnt match unicode non-breaking space but \u00a0 does.
  • Andrei Botalov
    Andrei Botalov over 11 years
    Look at unicode.org/versions/Unicode6.2.0/ch06.pdf - Space characters. But id does look incomplete
  • Jo Liss
    Jo Liss about 11 years
    Fixed my answer to simply use [[:space]] (note to self: not [:space]).
  • P.M
    P.M about 11 years
    "s.gsub(/\u00a0/, ' ') " is what I have been looking for.
  • Kelvin
    Kelvin over 10 years
    @JoLiss Your answer is correct, but your "note to self" is missing the trailing colon. I made this same mistake myself several times.