Tokenize in JavaScript
Solution 1
I would do what you started: split by /W+/
and then validate each token (length and stopwords) in the array by using .filter().
var text = "This is a short text about StackOverflow.";
var stopwords = ['this'];
var words = text.split(/\W+/).filter(function(token) {
token = token.toLowerCase();
return token.length >= 2 && stopwords.indexOf(token) == -1;
});
console.log(words); // ["is", "short", "text", "about", "StackOverflow"]
You could easily tweak a regex to look for words >= 2
characters, but there's no point if you're already going to need to post-process to remove stopwords (token.length
will be faster than any fancy regex you write).
Solution 2
Easy with Ramda:
var text = "This is a short text about how StackOverflow has gas.";
var stopWords = ['have', 'has'];
var isLongWord = R.compose(R.gt(R.__, 2), R.length);
var isGoWord = R.compose(R.not, R.contains(R.__, stopWords));
var tokenize = R.compose(R.filter(isGoWord), R.filter(isLongWord), R.split(' '));
tokenize(text); // ["This", "short", "text", "about", "how", "StackOverflow", "gas."]
Solution 3
What about splitting on something like this if you want to use a pure regex approach:
\W+|\b\w{1,2}\b
https://regex101.com/r/rB4cJ4/1
Jamgreen
Updated on June 06, 2022Comments
-
Jamgreen almost 2 years
If I have a string, how can I split it into an array of words and filter out some stopwords? I only want words of length 2 or greater.
If my string is
var text = "This is a short text about StackOverflow.";
I can split it with
var words = text.split(/\W+/);
But using
split(/\W+/)
, I get all words. I could check if the words have a length of at least 2 withfunction validate(token) { return /\w{2,}/.test(token); }
but I guess I could do this smarter/faster with regexp.
I also have an array
var stopwords = ['has', 'have', ...]
which shouldn't be allowed in the array.Actually, if I can find a way to filter out stopwords, I could just add all letters a, b, c, ..., z to the stopwords array to only accept words with at least 2 characters.