JavaScript regex whitespace characters

13,112

Solution 1

HTML != Javascript. Javascript is completely literal, %20 is %20 and   is a string of characters & n b s p and ;. For character classes I consider nearly every that is RegEx in perl to be applicable in JS (you can't do named groups etc).

http://www.regular-expressions.info/javascript.html is the refernece I use.

Solution 2

A simple test:

for(var i = 0; i < 1000; i++) {
    if(String.fromCharCode(i).replace(/\s+/, "") == "") console.log(i);
}

The char codes (Chrome):

9
10
11
12
13
32
160

Solution 3

For Mozilla its;

 [ \f\n\r\t\v\u00A0\u2028\u2029]

(Ref)

For IE (JScript) its

[ \f\n\r\t\v] 

(Ref)

Solution 4

In Firefox \s - matches a single white space character, including space, tab, form feed, line feed. Equivalent to [ \f\n\r\t\v\u00A0\u2028\u2029].

For example, /\s\w*/ matches ' bar' in "foo bar."

https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions

Solution 5

Here's an expansion of primvdb's answer, covering the entire 16-bit space, including unicode code point values and a comparison with str.trim(). I tried to edit the answer to improve it, but my edit was rejected, so I had to post this new one.

Identify all single-byte characters which will be matched as whitespace regex \s or by String.prototype.trim():

const regexList = [];
const trimList = [];

for (let codePoint = 0; codePoint < 2 ** 16; codePoint += 1) {
  const str = String.fromCodePoint(codePoint);
  const unicode = codePoint.toString(16).padStart(4, '0');

  if (str.replace(/\s/, '') === '') regexList.push([codePoint, unicode]);
  if (str.trim() === '') trimList.push([codePoint, unicode]);
}

const identical = JSON.stringify(regexList) === JSON.stringify(trimList);
const list = regexList.reduce((str, [codePoint, unicode]) => `${str}${unicode} ${codePoint}\n`, '');

console.log({identical});
console.log(list);

The list (in V8):

0009 9
000a 10
000b 11
000c 12
000d 13
0020 32
00a0 160
1680 5760
2000 8192
2001 8193
2002 8194
2003 8195
2004 8196
2005 8197
2006 8198
2007 8199
2008 8200
2009 8201
200a 8202
2028 8232
2029 8233
202f 8239
205f 8287
3000 12288
feff 65279
Share:
13,112
beatgammit
Author by

beatgammit

I write lots of code for fun and pay. Go, Rust, and Javascript are my specialties, though I'm confident in Python, Java, C#, C++, and C, with passing interest in a few other languages.

Updated on June 13, 2022

Comments

  • beatgammit
    beatgammit almost 2 years

    I have done some searching, but I couldn't find a definitive list of whitespace characters included in the \s in JavaScript's regex.

    I know that I can rely on space, line feed, carriage return, and tab as being whitespace, but I thought that since JavaScript was traditionally only for the browser, maybe URL encoded whitespace and things like &nbsp; and %20 would be supported as well.

    What exactly is considered by JavaScript's regex compiler? If there are differences between browsers, I only really care about webkit browsers, but it would be nice to know of any differences. Also, what about Node.js?

  • beatgammit
    beatgammit about 13 years
    Right, but JavaScript often runs in the browser, so I assumed that there might be some special cases.
  • beatgammit
    beatgammit about 13 years
    I'm marked yours as correct because you cleared up the root question. Thanks!
  • Alex K.
    Alex K. about 13 years
    FF looks like it does, for IE instead of \s use the class [\s\u00A0\u2028\u2029]