JavaScript regular expression for Unicode emoji

29,455

Solution 1

In ECMAScript 6 you should be able to detect it in a fairly simple way. I have compiled a simple regex comprising of different Unicode blocks namely:

Regex:

/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug

Playground: play around with emoji and regex

This answer doesn't directly answer the question but gives a fair insight on how to handle emoji using Unicode blocks and ES6.

Solution 2

I think you could also use Unicode character properties. Even Unicode Consortium themselves provide a regex, which can be adjusted for ECMAScript relatively easily (by replacing all occurrences of \x with \u and putting it all in one line). It does select possible Emoji though, meaning it will yield false positives. It's explicitly advised to still validate all matches before assuming they are in fact emoji.

Here's a somewhat stricter version of that regex which will return less false positives, with a mini demo:

const sentence = 'A ticket to 大阪 costs ¥2000 👌. Repeated emojis: 😁😁. Crying cat: 😿. Repeated emoji with skin tones: ✊🏿✊🏿✊🏿✊✊✊🏿. Flags: 🇱🇹🏴󠁧󠁢󠁷󠁬󠁳󠁿. Scales ⚖️⚖️⚖️.';

const regexpUnicodeModified = /\p{RI}\p{RI}|\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?(\u{200D}\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?)+|\p{EPres}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})?|\p{Emoji}(\p{EMod}+|\u{FE0F}\u{20E3}?|[\u{E0020}-\u{E007E}]+\u{E007F})/gu
console.log(sentence.match(regexpUnicodeModified));

This will log the following:

> Array ["👌", "😁", "😁", "😿", "✊🏿", "✊🏿", "✊🏿", "✊", "✊", "✊🏿", "🇱🇹", "🏴󠁧󠁢󠁷󠁬󠁳󠁿", "⚖️", "⚖️", "⚖️"]

which means it matches:

  • simple emoji
  • emoji with modifiers (skin tones)
  • country flags
  • region flags
  • emoji presentation sequences

Note that I don't see how this could be used for replacing specific emoji with images, as the OP wanted, but it does make it possible to place the emoji inside extra tags and such.

Solution 3

You can change to \U characters with below function.

var emojiToUnicode=function (message){
    var emojiRegexp = /([\uE000-\uF8FF]|\uD83C[\uDC00-\uDFFF]|\uD83D[\uDC00-\uDFFF]|[\u2694-\u2697]|\uD83E[\uDD10-\uDD5D])/g;
    if(!message)
        return;
    try{ 
        var newMessage = message.match(emojiRegexp);
        for(var emoj in newMessage){
              var emojmessage = newMessage[emoj];
              var index = message.indexOf(emojmessage);
              if(index === -1)
                  continue;
              emojmessage = "\\u" + emojmessage.charCodeAt(0).toString(16) + "\\u" + emojmessage.charCodeAt(1).toString(16);
              message = message.substr(0, index) + emojmessage + message.substr(index + 2);
            }
        return message;
    }catch(err){
        console.error("error in emojiToUnicode"+err.stack);
    }
 };

Solution 4

A lot of the suggested patterns do not match Modifier Sequence emojis (skin tones) or compound emojis correctly, or are outdated and don't match newer emojis, or both.

Consider this doozy of an emoji and the regular expression that would match it:

console.log("👩🏽‍❤️‍💋‍👨".split('').map(function(chr) { return '\\u' + chr.charCodeAt(0).toString(16); }).join(''))

That's quite the pattern. It's because it's a bunch of other emojis joined with the U+200D ZERO WIDTH JOINER:

👩 + U+200D + ❤️‍ + U+200D + 💋‍ + U+200D + 👨

So, you want your pattern to match the longer sequences first or you'll match those "inner emojis" erroneously.

Solution? Use a pattern like this, which, while long, is drop dead simple because it's a single alternation (?:longest|secondLongest|....|secondShortest|shortest): https://github.com/sweaver2112/Regex-combined-emojis/blob/master/regex.js

Here's a working example:

/*compile the pattern string into a regex*/
let emoRegex = new RegExp(emojiPattern, "g")

/*extracting the emojis*/
let emojis = [..."This 😀👩‍⚖️is the 🧗‍♀️text🥣.".matchAll(emoRegex)];
console.log(emojis)

/*count of emojis*/
let emoCount = [..."This 😀👩‍⚖️is the 🧗‍♀️text🥣.".matchAll(emoRegex)].length
console.log(emoCount)

/*strip emojis from text*/
let stripped = "This 😀👩‍⚖️is the 🧗‍♀️text🥣.".replaceAll(emoRegex, "")
console.log(stripped)

/*use the pattern string to build a custom regex*/
let customRegex = new RegExp(".*"+emojiPattern+"{3}$") //match a string ending in 3 emojis
console.log(customRegex.test("yep three here 😀👩‍⚖️🥣"))
console.log(customRegex.test("nope 🥣😀"))
<script src="https://gitcdn.link/repo/sweaver2112/Regex-combined-emojis/master/regex.js"></script>

Regex 101 Demo matches all 3521 Emojis as of May 2021

The demo includes all characters from *https://unicode.org/emoji/charts/full-emoji-list.html and *https://unicode.org/emoji/charts-13.1/full-emoji-modifiers.html:

Share:
29,455
Mohamed Mohamed
Author by

Mohamed Mohamed

Updated on November 16, 2021

Comments

  • Mohamed Mohamed
    Mohamed Mohamed over 2 years

    I want to replace all the emoji in a string with an icon. I successfully replaced these: {:) :D :P :3 <3 XP .... etc} to icons, so if the user writes :) in a string, it will be replaced with an icon.

    But I have a problem: what if user directly pastes the Unicode 😊 which is equal to :)?

    What I need: How can I change the Unicode icon to JavaScript regular expressions something like \ud800-\udbff. I have many emoji, so I need an idea about converting them, and after converting them, I want to match them with regular expressions.

    Example: 😁wew😁
    Change those emoji to \uD83D\uDE01|\uD83D\uDE4F|. I don't know how to change them, so I need to know how to change any emoji to those characters.