Remove HTML from SQL Server column
Solution 1
I ended up using a solution that i got from here https://www.simple-talk.com/sql/t-sql-programming/tsql-regular-expression-workbench/. It works exceptionaly well, this is what worked for me. Create the function inside the DB:
IF OBJECT_ID(N'dbo.RegexReplace') IS NOT NULL
DROP FUNCTION dbo.RegexReplace
GO
CREATE FUNCTION dbo.RegexReplace
(
@pattern VARCHAR(255),
@replacement VARCHAR(255),
@Subject VARCHAR(MAX),
@global BIT = 1,
@Multiline bit =1
)
RETURNS VARCHAR(MAX)
AS BEGIN
DECLARE @objRegexExp INT,
@objErrorObject INT,
@strErrorMessage VARCHAR(255),
@Substituted VARCHAR(8000),
@hr INT,
@Replace BIT
SELECT @strErrorMessage = 'creating a regex object'
EXEC @hr= sp_OACreate 'VBScript.RegExp', @objRegexExp OUT
IF @hr = 0
SELECT @strErrorMessage = 'Setting the Regex pattern',
@objErrorObject = @objRegexExp
IF @hr = 0
EXEC @hr= sp_OASetProperty @objRegexExp, 'Pattern', @pattern
IF @hr = 0 /*By default, the regular expression is case sensitive. Set the IgnoreCase property to True to make it case insensitive.*/
SELECT @strErrorMessage = 'Specifying the type of match'
IF @hr = 0
EXEC @hr= sp_OASetProperty @objRegexExp, 'IgnoreCase', 1
IF @hr = 0
EXEC @hr= sp_OASetProperty @objRegexExp, 'MultiLine', @Multiline
IF @hr = 0
EXEC @hr= sp_OASetProperty @objRegexExp, 'Global', @global
IF @hr = 0
SELECT @strErrorMessage = 'Doing a Replacement'
IF @hr = 0
EXEC @hr= sp_OAMethod @objRegexExp, 'Replace', @Substituted OUT,
@subject, @Replacement
/*If the RegExp.Global property is False (the default), Replace will return the @subject string with the first regex match (if any) substituted with the replacement text. If RegExp.Global is true, the @Subject string will be returned with all matches replaced.*/
IF @hr <> 0
BEGIN
DECLARE @Source VARCHAR(255),
@Description VARCHAR(255),
@Helpfile VARCHAR(255),
@HelpID INT
EXECUTE sp_OAGetErrorInfo @objErrorObject, @source OUTPUT,
@Description OUTPUT, @Helpfile OUTPUT, @HelpID OUTPUT
SELECT @strErrorMessage = 'Error whilst '
+ COALESCE(@strErrorMessage, 'doing something') + ', '
+ COALESCE(@Description, '')
RETURN @strErrorMessage
END
EXEC sp_OADestroy @objRegexExp
RETURN @Substituted
END
Go
Now simply run this:
--replace all the HTML in a field example
update TableName set fieldName=dbo.RegexReplace('<(?:[^>''"]*|([''"]).*?\1)*>',
'',fieldName,1,1)
It works exceptionally well
Solution 2
Let's create a sample table for discussion
create table #tmp (id int identity primary key, sometext varchar(max));
insert #tmp values ('This <has some> HTML in it <and some more> yeah>');
insert #tmp values ('No HTML here');
insert #tmp values ('This is > than that');
insert #tmp values ('And This is < than that');
insert #tmp values ('');
insert #tmp values ('I have many blanks. Don''t lose them <y>');
insert #tmp values (null);
insert #tmp values ('<b>This is bold</b> and <i>this is in italics</i>');
insert #tmp values ('I <<<<stttuttter> a lot <>');
GO
Next, if you don't already have a Numbers (aka Tally) table in your database, create one.
create table Numbers (number int identity primary key);
GO
insert Numbers default values
GO 1000
It's better to approach this using multi-statement TSQL, which can make use of indexes and materialize the intermediate results. The first step here is to collect all the characters that are not surrounded by <
and >
tags.
create table #tmp2 (id int, number int, primary key(id,number), char char(1));
insert #tmp2
select u.id, N.Number, substring(u.sometext, N.number, 1) char
from #tmp u
join Numbers N on N.number <= len(u.sometext)
left join
(
select t.id, N.number lt, charindex('>', t.sometext, N.number+1) gt
from #tmp t
join Numbers N on substring(t.sometext, N.number, 1) = '<'
) exclusions on u.id = exclusions.id and n.number between exclusions.lt and exclusions.gt
where exclusions.id is null
GO
The next step just zips them up using FOR XML
.
update #tmp
set sometext = (select char+''
from #tmp2 b
where b.id = #tmp.id
order by Number
for xml path(''), type).value('/','nvarchar(max)')
where sometext like '%<%>%'
Check our table again
select * from #tmp
id sometext
----------- --------------------------------------------------
1 This HTML in it yeah>
2 No HTML here
3 This is > than that
4 And This is < than that
5
6 I have many blanks. Don't lose them
7 NULL
8 This is bold and this is in italics
9 I a lot
For a small table, doing the exclusion in line in a single query (as above) would be sufficient. For a larger table, a CTE approach appears to work better as it traverses the data only once. The table in the code above can be enLARGEd by running these against the table after creation:
-- replicate the table data to about 20K records
insert #tmp select sometext from #tmp
GO 11
-- expand each string by 8 times the original length
update #tmp set sometext = sometext + sometext
GO 3
And using this to create the #tmp2 table
;with cte(id, pos, sometext, size, char, flag) as (
select id, 1, sometext, len(sometext), substring(sometext, 1, 1),
case when substring(sometext, 1, 1) = '<' then 1 else 0 end
from #tmp
where sometext like '%<%>%'
union all
select id, pos+1, sometext, size, substring(sometext, pos+1, 1),
case when substring(sometext, pos+1, 1) = '<' then 1 -- starts a new html tag section
when char = '>' then 0 -- closed in prior iteration
else flag end -- continue flag
from cte
where pos < len(sometext)
)
insert #tmp2 (id, Number, char)
select id, pos, char
from cte
where flag = 0
option (maxrecursion 500)
GO
This results in a timing of
- 62 seconds to create #tmp2
- 1 second to update #tmp using #tmp2
The algorithm has linear complexity, so if your strings are shorter, or you have fewer records, just make approximate scaling to get a feel for how long the process will take.
Mat41
Updated on June 14, 2022Comments
-
Mat41 almost 2 years
We used to use a WYSIWYG editor and have four text type columns in a SQL Server 2008 R2 table. I would like to remove all the HTML tags. This works when used in an ASP (Classic ASP) page:
Function RemoveHTML(strText ) Dim RegEx Set RegEx = New RegExp RegEx.Pattern = "<[^>]*>" RegEx.Global = True RemoveHTML = RegEx.Replace(strText, "") End Function
However I would like a different solution perhaps SQL driven. Would very much like some help. Can regular expressions be used in an SQL statement in this way. Any help suggestions would be excellent? I think I would like to keep the break tags but all the rest can go.
TYIA
-
Adir D about 11 yearsWould be interesting to see how this performs against a large table, of course.
-
Mat41 about 11 years@richardTheKiwi thank you for your time from mattThekiwi. Wow this seems very very long. Im just taking it all in...
-
Mat41 about 11 yearscan anybody help me ommiting br and p tags from the REGEX used above