Remove HTML from SQL Server column

sql regex sql-server-2008-r2

13,912

Solution 1

I ended up using a solution that i got from here https://www.simple-talk.com/sql/t-sql-programming/tsql-regular-expression-workbench/. It works exceptionaly well, this is what worked for me. Create the function inside the DB:

IF OBJECT_ID(N'dbo.RegexReplace') IS NOT NULL 
   DROP FUNCTION dbo.RegexReplace
GO
CREATE FUNCTION dbo.RegexReplace
(
  @pattern VARCHAR(255),
  @replacement VARCHAR(255),
  @Subject VARCHAR(MAX),
  @global BIT = 1,
 @Multiline bit =1
)
RETURNS VARCHAR(MAX)

AS BEGIN
DECLARE @objRegexExp INT,
    @objErrorObject INT,
    @strErrorMessage VARCHAR(255),
    @Substituted VARCHAR(8000),
    @hr INT,
    @Replace BIT

SELECT  @strErrorMessage = 'creating a regex object'
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
IF @hr = 0 
    SELECT  @strErrorMessage = 'Setting the Regex pattern',
            @objErrorObject = @objRegexExp
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
IF @hr = 0 /*By default, the regular expression is case sensitive. Set the IgnoreCase property to True to make it case insensitive.*/
    SELECT  @strErrorMessage = 'Specifying the type of match' 
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'MultiLine', @Multiline
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'Global', @global
IF @hr = 0 
    SELECT  @strErrorMessage = 'Doing a Replacement' 
IF @hr = 0 
    EXEC @hr= sp_OAMethod @objRegexExp, 'Replace', @Substituted OUT,
        @subject, @Replacement
 /*If the RegExp.Global property is False (the default), Replace will return the @subject string with the first regex match (if any) substituted with the replacement text. If RegExp.Global is true, the @Subject string will be returned with all matches replaced.*/   
IF @hr <> 0 
    BEGIN
        DECLARE @Source VARCHAR(255),
            @Description VARCHAR(255),
            @Helpfile VARCHAR(255),
            @HelpID INT

        EXECUTE sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
            @Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
        SELECT  @strErrorMessage = 'Error whilst '
                + COALESCE(@strErrorMessage, 'doing something') + ', '
                + COALESCE(@Description, '')
        RETURN @strErrorMessage
    END
EXEC sp_OADestroy @objRegexExp
RETURN @Substituted
END
Go

Now simply run this:

--replace all the HTML in a field example
update TableName set fieldName=dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
 '',fieldName,1,1)

It works exceptionally well

Solution 2

Let's create a sample table for discussion

create table #tmp (id int identity primary key, sometext varchar(max));
insert #tmp values ('This <has some> HTML in it <and some more> yeah>');
insert #tmp values ('No HTML here');
insert #tmp values ('This is > than that');
insert #tmp values ('And This is < than that');
insert #tmp values ('');
insert #tmp values ('I have many        blanks. Don''t lose them <y>');
insert #tmp values (null);
insert #tmp values ('<b>This is bold</b> and <i>this is in italics</i>');
insert #tmp values ('I <<<<stttuttter> a lot <>');
GO

Next, if you don't already have a Numbers (aka Tally) table in your database, create one.

create table Numbers (number int identity primary key);
GO
insert Numbers default values
GO 1000

It's better to approach this using multi-statement TSQL, which can make use of indexes and materialize the intermediate results. The first step here is to collect all the characters that are not surrounded by < and > tags.

create table #tmp2 (id int, number int, primary key(id,number), char char(1));

insert #tmp2
select u.id, N.Number, substring(u.sometext, N.number, 1) char
from #tmp u
join Numbers N on N.number <= len(u.sometext)
left join
    (
        select t.id, N.number lt, charindex('>', t.sometext, N.number+1) gt
        from #tmp t
        join Numbers N on substring(t.sometext, N.number, 1) = '<'
    ) exclusions on u.id = exclusions.id and n.number between exclusions.lt and exclusions.gt
where exclusions.id is null
GO

The next step just zips them up using FOR XML.

update #tmp
set sometext = (select char+''
            from #tmp2 b
            where b.id = #tmp.id
            order by Number
            for xml path(''), type).value('/','nvarchar(max)')
where sometext like '%<%>%'

Check our table again

select * from #tmp

id          sometext
----------- --------------------------------------------------
1           This  HTML in it  yeah>
2           No HTML here
3           This is > than that
4           And This is < than that
5           
6           I have many        blanks. Don't lose them 
7           NULL
8           This is bold and this is in italics
9           I  a lot

For a small table, doing the exclusion in line in a single query (as above) would be sufficient. For a larger table, a CTE approach appears to work better as it traverses the data only once. The table in the code above can be enLARGEd by running these against the table after creation:

-- replicate the table data to about 20K records
insert #tmp select sometext from #tmp
GO 11

-- expand each string by 8 times the original length
update #tmp set sometext = sometext + sometext
GO 3

And using this to create the #tmp2 table

;with cte(id, pos, sometext, size, char, flag) as (
    select id, 1, sometext, len(sometext), substring(sometext, 1, 1),
        case when substring(sometext, 1, 1) = '<' then 1 else 0 end
    from #tmp
    where sometext like '%<%>%'
    union all
    select id, pos+1, sometext, size, substring(sometext, pos+1, 1),
        case when substring(sometext, pos+1, 1) = '<' then 1 -- starts a new html tag section
             when char = '>' then 0 -- closed in prior iteration
             else flag end -- continue flag
    from cte
    where pos < len(sometext)
)
insert #tmp2 (id, Number, char)
select id, pos, char
from cte
where flag = 0
option (maxrecursion 500)
GO

This results in a timing of

62 seconds to create #tmp2
1 second to update #tmp using #tmp2

The algorithm has linear complexity, so if your strings are shorter, or you have fewer records, just make approximate scaling to get a feel for how long the process will take.

13,912

Author by

Mat41

Updated on June 14, 2022

Comments

Mat41 almost 2 years
We used to use a WYSIWYG editor and have four text type columns in a SQL Server 2008 R2 table. I would like to remove all the HTML tags. This works when used in an ASP (Classic ASP) page:
```
Function RemoveHTML(strText )
  Dim RegEx

  Set RegEx = New RegExp

  RegEx.Pattern = "<[^>]*>"
  RegEx.Global = True

  RemoveHTML = RegEx.Replace(strText, "")
End Function
```
However I would like a different solution perhaps SQL driven. Would very much like some help. Can regular expressions be used in an SQL statement in this way. Any help suggestions would be excellent? I think I would like to keep the break tags but all the rest can go.

TYIA
Adir D about 11 years

Would be interesting to see how this performs against a large table, of course.
Mat41 about 11 years

@richardTheKiwi thank you for your time from mattThekiwi. Wow this seems very very long. Im just taking it all in...
Mat41 about 11 years

can anybody help me ommiting br and p tags from the REGEX used above