Remove HTML In Google Sheets Cells

regex google-sheets

14,956

Solution 1

You could use REGEXREPLACE function:

=REGEXREPLACE(A1,"</?\S+[^<>]*>","")

This formula also can be converted into ArrayFormula:

=ARRAYFORMULA(REGEXREPLACE(offset(A1,,,COUNTA(A:A)),"</?\S+[^<>]*>",""))

This formula would populate down automatically. COUNTA(A:A) is number of rows in column [A] with your data

Solution 2

You can just use a Find and Replace dialog.

Find what: </?\S+[^<>]*>

Replace with: <empty string>

Search: Specific range = '<SHEET_NAME>'!<COLUMN>:<COLUMN> (e.g. 'Copy of Sheet1'!A:A)

Make sure Search using regular expressions is checked. Then, click Replace all.

Here is a screen:

enter image description here

Solution 3

I like Max Makhrov's approach, but a slightly simpler regex would be:

"<[^<>]+>"

In the formula:

=REGEXREPLACE(A1,"<[^<>]+>","")

If you look back at Max's, the other elements he included were:

/?

He's saying that the first character is either a '/' or not. Because you see things like "<div>" or "</div>" tags, and we want to match them both.

\S+

"\S" means any "non-whitespace" character. "+" means one or more times. So that slash is followed by some number of visible characters.

Then he has "[^<>]*" -- any non angle bracket character zero or more times, eventually finding that close bracket. So after our last group of characters stops (ie, we hit whitespace), THEN we start searching for that close bracket.

But slashes, nonwhitespace -- we can just let "[^<>]" do all the work for us from the beginning. We might use a "+" instead of "*", to make sure we find at least one character in the tag. (If you want to strip "empty" tags too, you can make it a "*"... I'm not sure when those would come up.)

So my version is "dumber" in a sense. It just says let's not worry about whether there are slashes or whitespace or whatever, and just take whatever we get between two angle brackets.

You might notice this fixes a subtle bug with Max's original answer, where it would remove the whole field if the entry and tags contained no whitespace at all. The "\S+" is "greedy". So if you hit something like this:

<b>test1</b><div>test2</div>

The "\S" will match every single character from the first 'b' to the last 'div'. It never stops to look for nonbracket characters, because it hasn't found a space yet. So you won't get anything back in your field. (It will appear to work find if you have spaces in your tags, that's why it works on first glance with the examples above.)

If you want to make that operator non-greedy, you have to add a "?" character after it, like so:

"</?\S+?[^<>]*>"

If you use my simpler version, though, <[^<>]+>, it's ok that it's greedy, because it's going to stop anytime it finds that next angle bracket, without really caring about spaces at all.

So either:

=REGEXREPLACE(A1,"<[^<>]+>","")

=ARRAYFORMULA(REGEXREPLACE(offset(A1,,,COUNTA(A:A)),"<[^<>]+>",""))

Best of luck!

14,956

Author by

Eddie Padin

Updated on June 06, 2022

Comments

Eddie Padin almost 2 years
Trying to determine the best method to automatically remove HTML in all cells within a column in Google Sheets.

Example of cell data:
```
<span style="color:#0000FF">test</span>
```
I'd like to strip out all of the HTML and leave the plain text, automatically whenever a new row is added.

I'm not sure if using a regex in the cell as a formula or some type of script would be a better approach.