Remove HTML from SQL Server column

13,912

Solution 1

I ended up using a solution that i got from here https://www.simple-talk.com/sql/t-sql-programming/tsql-regular-expression-workbench/. It works exceptionaly well, this is what worked for me. Create the function inside the DB:

IF OBJECT_ID(N'dbo.RegexReplace') IS NOT NULL 
   DROP FUNCTION dbo.RegexReplace
GO
CREATE FUNCTION dbo.RegexReplace
(
  @pattern VARCHAR(255),
  @replacement VARCHAR(255),
  @Subject VARCHAR(MAX),
  @global BIT = 1,
 @Multiline bit =1
)
RETURNS VARCHAR(MAX)

AS BEGIN
DECLARE @objRegexExp INT,
    @objErrorObject INT,
    @strErrorMessage VARCHAR(255),
    @Substituted VARCHAR(8000),
    @hr INT,
    @Replace BIT

SELECT  @strErrorMessage = 'creating a regex object'
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
IF @hr = 0 
    SELECT  @strErrorMessage = 'Setting the Regex pattern',
            @objErrorObject = @objRegexExp
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
IF @hr = 0 /*By default, the regular expression is case sensitive. Set the IgnoreCase property to True to make it case insensitive.*/
    SELECT  @strErrorMessage = 'Specifying the type of match' 
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'MultiLine', @Multiline
IF @hr = 0 
    EXEC @hr= sp_OASetProperty @objRegexExp, 'Global', @global
IF @hr = 0 
    SELECT  @strErrorMessage = 'Doing a Replacement' 
IF @hr = 0 
    EXEC @hr= sp_OAMethod @objRegexExp, 'Replace', @Substituted OUT,
        @subject, @Replacement
 /*If the RegExp.Global property is False (the default), Replace will return the @subject string with the first regex match (if any) substituted with the replacement text. If RegExp.Global is true, the @Subject string will be returned with all matches replaced.*/   
IF @hr <> 0 
    BEGIN
        DECLARE @Source VARCHAR(255),
            @Description VARCHAR(255),
            @Helpfile VARCHAR(255),
            @HelpID INT

        EXECUTE sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
            @Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
        SELECT  @strErrorMessage = 'Error whilst '
                + COALESCE(@strErrorMessage, 'doing something') + ', '
                + COALESCE(@Description, '')
        RETURN @strErrorMessage
    END
EXEC sp_OADestroy @objRegexExp
RETURN @Substituted
END
Go

Now simply run this:

--replace all the HTML in a field example
update TableName set fieldName=dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
 '',fieldName,1,1)

It works exceptionally well

Solution 2

Let's create a sample table for discussion

create table #tmp (id int identity primary key, sometext varchar(max));
insert #tmp values ('This <has some> HTML in it <and some more> yeah>');
insert #tmp values ('No HTML here');
insert #tmp values ('This is > than that');
insert #tmp values ('And This is < than that');
insert #tmp values ('');
insert #tmp values ('I have many        blanks. Don''t lose them <y>');
insert #tmp values (null);
insert #tmp values ('<b>This is bold</b> and <i>this is in italics</i>');
insert #tmp values ('I <<<<stttuttter> a lot <>');
GO

Next, if you don't already have a Numbers (aka Tally) table in your database, create one.

create table Numbers (number int identity primary key);
GO
insert Numbers default values
GO 1000

It's better to approach this using multi-statement TSQL, which can make use of indexes and materialize the intermediate results. The first step here is to collect all the characters that are not surrounded by < and > tags.

create table #tmp2 (id int, number int, primary key(id,number), char char(1));

insert #tmp2
select u.id, N.Number, substring(u.sometext, N.number, 1) char
from #tmp u
join Numbers N on N.number <= len(u.sometext)
left join
    (
        select t.id, N.number lt, charindex('>', t.sometext, N.number+1) gt
        from #tmp t
        join Numbers N on substring(t.sometext, N.number, 1) = '<'
    ) exclusions on u.id = exclusions.id and n.number between exclusions.lt and exclusions.gt
where exclusions.id is null
GO

The next step just zips them up using FOR XML.

update #tmp
set sometext = (select char+''
            from #tmp2 b
            where b.id = #tmp.id
            order by Number
            for xml path(''), type).value('/','nvarchar(max)')
where sometext like '%<%>%'

Check our table again

select * from #tmp

id          sometext
----------- --------------------------------------------------
1           This  HTML in it  yeah>
2           No HTML here
3           This is > than that
4           And This is < than that
5           
6           I have many        blanks. Don't lose them 
7           NULL
8           This is bold and this is in italics
9           I  a lot 

For a small table, doing the exclusion in line in a single query (as above) would be sufficient. For a larger table, a CTE approach appears to work better as it traverses the data only once. The table in the code above can be enLARGEd by running these against the table after creation:

-- replicate the table data to about 20K records
insert #tmp select sometext from #tmp
GO 11

-- expand each string by 8 times the original length
update #tmp set sometext = sometext + sometext
GO 3

And using this to create the #tmp2 table

;with cte(id, pos, sometext, size, char, flag) as (
    select id, 1, sometext, len(sometext), substring(sometext, 1, 1),
        case when substring(sometext, 1, 1) = '<' then 1 else 0 end
    from #tmp
    where sometext like '%<%>%'
    union all
    select id, pos+1, sometext, size, substring(sometext, pos+1, 1),
        case when substring(sometext, pos+1, 1) = '<' then 1 -- starts a new html tag section
             when char = '>' then 0 -- closed in prior iteration
             else flag end -- continue flag
    from cte
    where pos < len(sometext)
)
insert #tmp2 (id, Number, char)
select id, pos, char
from cte
where flag = 0
option (maxrecursion 500)
GO

This results in a timing of

  • 62 seconds to create #tmp2
  • 1 second to update #tmp using #tmp2

The algorithm has linear complexity, so if your strings are shorter, or you have fewer records, just make approximate scaling to get a feel for how long the process will take.

Share:
13,912
Mat41
Author by

Mat41

Updated on June 14, 2022

Comments

  • Mat41
    Mat41 almost 2 years

    We used to use a WYSIWYG editor and have four text type columns in a SQL Server 2008 R2 table. I would like to remove all the HTML tags. This works when used in an ASP (Classic ASP) page:

    Function RemoveHTML(strText )
      Dim RegEx
    
      Set RegEx = New RegExp
    
      RegEx.Pattern = "<[^>]*>"
      RegEx.Global = True
    
      RemoveHTML = RegEx.Replace(strText, "")
    End Function
    

    However I would like a different solution perhaps SQL driven. Would very much like some help. Can regular expressions be used in an SQL statement in this way. Any help suggestions would be excellent? I think I would like to keep the break tags but all the rest can go.

    TYIA

  • Adir D
    Adir D about 11 years
    Would be interesting to see how this performs against a large table, of course.
  • Mat41
    Mat41 about 11 years
    @richardTheKiwi thank you for your time from mattThekiwi. Wow this seems very very long. Im just taking it all in...
  • Mat41
    Mat41 about 11 years
    can anybody help me ommiting br and p tags from the REGEX used above