Replacing UTF-8 characters

13,914

Solution 1

First off: I urge you to stop using jsPDF if it doesn't support Unicode. It's mid 2014, and the lack of support should have meant the death of the project two years ago. But that's just my personal conviction and not part of the answer you're looking for.

If jsPDF only supports ANSI (a 255 character block, rather than ASCII's 127 character block), then you can simply do a regex replace for everything above \xFF:

"lolテスト".replace(/[\u0100-\uFFFF]/g,'');
// gives us "lol"

If you only want to get rid of quotation marks (but leave in potentially jsPDF breaking unicode), you can use the pattern for "just quotation marks" based on where they live in the unicode map:

string.replace(/[\u2018-\u201F\u275B-\u275E]/g, '')

will catch ['‘','’','‚','‛','“','”','„','‟','❛','❜','❝','❞'], although of course what you probably want to do is replace them with the corresponding safe character instead. Good news: just make a replacement array for the list just presented, and work with that.

2017 edit:

ES6 introduced a new pattern for unicode strings in the form of the \u{...} pattern, which can do "any number of hexdigits" inside the curly braces, so a full Unicode 9 compatible regexp would now be:

// we can't use these in a regexp directly, unfortunately
start = `\u{100}`;
end = `\u{10FFF0}`;
searchPattern = new RegExp(`[${start}-${end}]`,`g`);
c = `lolテスト`.replace(searchPattern, ``);

Solution 2

use

$(htmlstring).replace(/[^\x00-\x7F]/g,'')

to remove all non-ascii charakter

(via regex-any-ascii-character)

Share:
13,914
Admin
Author by

Admin

Updated on June 11, 2022

Comments

  • Admin
    Admin almost 2 years

    I am working on an open jquery library jspdf.The above library does not support UTF-8 characters. Is there any way so that i can remove all the quotes UTF-8 character in my html string by using regex or any other method.

    PSEDO CODE:
    
    $(htmlstring).replace("utf-8 quotes character" , "") 
    
  • Álvaro González
    Álvaro González almost 10 years
    A little clarification. Once we have a JavaScript string we no longer have UTF-8 (or ISO-8859-1 or whatever encoding the file is saved as): JavaScript makes a transparent conversion to its internal encoding (UCS-2 or UTF-16, the engine can choose). Good news is that we don't need to think about encodings any more, we can refer to characters by their \u escape sequence, which is basically its universal Unicode code point. Bad news is that JavaScript will split characters beyond 0xFFFF due to incomplete Unicode support.
  • Admin
    Admin almost 10 years
    thanks for you advice , but i needed this just for a small work , i just want to know how to remove utf quotes character only(, ’ and other)
  • Mike 'Pomax' Kamermans
    Mike 'Pomax' Kamermans almost 10 years
    @SomPathak you can, but any remaining non-ansi unicode's still going to break jsPDF. Simply find out the specific unicode number for your quote symbols, and use a straight up patter like /[\u2018-\u201F\u275B-\u275E]/g
  • Álvaro González
    Álvaro González almost 10 years
    ... or do a simple .replace(/[«»]/g, ''). The problem with Unicode characters belongs to the library, not JavaScript itself.
  • Admin
    Admin almost 10 years
    @ÁlvaroG.Vicario what would be regex to replace all utf-8 code
  • Mike 'Pomax' Kamermans
    Mike 'Pomax' Kamermans almost 10 years
    fun fact, « and » are single byte, and not a problem in this case (\uAB and \uBB)
  • Álvaro González
    Álvaro González almost 10 years
    @Mike'Pomax'Kamermans Of course, it was just an example, we don't know exactly what "utf-8 quotes character" stands for.
  • Álvaro González
    Álvaro González almost 10 years
    @SomPathak Replace all UTF-8 code? htmlstring = "";, because UTF-8 includes all characters that exist. Do you have a clear idea so far of what you want to remove?
  • Mike 'Pomax' Kamermans
    Mike 'Pomax' Kamermans almost 10 years
    yes, yes, hilarious, but let's not go literal because someone uses "unicode" wrong =) His original post is pretty clear in that he wants higher unicode quotes replaced.