How to strip HTML tags from a string in SQL Server?
Solution 1
There is a UDF that will do that described here:
User Defined Function to Strip HTML
CREATE FUNCTION [dbo].[udf_StripHTML] (@HTMLText VARCHAR(MAX))
RETURNS VARCHAR(MAX) AS
BEGIN
DECLARE @Start INT
DECLARE @End INT
DECLARE @Length INT
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
WHILE @Start > 0 AND @End > 0 AND @Length > 0
BEGIN
SET @HTMLText = STUFF(@HTMLText,@Start,@Length,'')
SET @Start = CHARINDEX('<',@HTMLText)
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
GO
Edit: note this is for SQL Server 2005, but if you change the keyword MAX to something like 4000, it will work in SQL Server 2000 as well.
Solution 2
Derived from @Goner Doug answer, with a few things updated:
- using REPLACE where possible
- conversion of predefined entities like é
(I chose the ones I needed :-)
- some conversion of list tags <ul> and <li>
ALTER FUNCTION [dbo].[udf_StripHTML]
--by Patrick Honorez --- www.idevlop.com
--inspired by http://stackoverflow.com/questions/457701/best-way-to-strip-html-tags-from-a-string-in-sql-server/39253602#39253602
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
set @HTMLText = replace(@htmlText, '<br>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br/>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<br />',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '<li>','- ')
set @HTMLText = replace(@htmlText, '</li>',CHAR(13) + CHAR(10))
set @HTMLText = replace(@htmlText, '’' collate Latin1_General_CS_AS, '''' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '€' collate Latin1_General_CS_AS, '€' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '<' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '>' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'œ' collate Latin1_General_CS_AS, 'oe' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, ' ' collate Latin1_General_CS_AS, ' ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '©' collate Latin1_General_CS_AS, '©' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '«' collate Latin1_General_CS_AS, '«' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '®' collate Latin1_General_CS_AS, '®' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '±' collate Latin1_General_CS_AS, '±' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '²' collate Latin1_General_CS_AS, '²' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '³' collate Latin1_General_CS_AS, '³' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'µ' collate Latin1_General_CS_AS, 'µ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '·' collate Latin1_General_CS_AS, '·' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'º' collate Latin1_General_CS_AS, 'º' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '»' collate Latin1_General_CS_AS, '»' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¼' collate Latin1_General_CS_AS, '¼' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '½' collate Latin1_General_CS_AS, '½' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '¾' collate Latin1_General_CS_AS, '¾' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&Aelig' collate Latin1_General_CS_AS, 'Æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ç' collate Latin1_General_CS_AS, 'Ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'È' collate Latin1_General_CS_AS, 'È' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'É' collate Latin1_General_CS_AS, 'É' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ê' collate Latin1_General_CS_AS, 'Ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'Ö' collate Latin1_General_CS_AS, 'Ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'à' collate Latin1_General_CS_AS, 'à' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'â' collate Latin1_General_CS_AS, 'â' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ä' collate Latin1_General_CS_AS, 'ä' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'æ' collate Latin1_General_CS_AS, 'æ' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ç' collate Latin1_General_CS_AS, 'ç' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'è' collate Latin1_General_CS_AS, 'è' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'é' collate Latin1_General_CS_AS, 'é' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ê' collate Latin1_General_CS_AS, 'ê' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ë' collate Latin1_General_CS_AS, 'ë' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'î' collate Latin1_General_CS_AS, 'î' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ô' collate Latin1_General_CS_AS, 'ô' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ö' collate Latin1_General_CS_AS, 'ö' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '÷' collate Latin1_General_CS_AS, '÷' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ø' collate Latin1_General_CS_AS, 'ø' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ù' collate Latin1_General_CS_AS, 'ù' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ú' collate Latin1_General_CS_AS, 'ú' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'û' collate Latin1_General_CS_AS, 'û' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, 'ü' collate Latin1_General_CS_AS, 'ü' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '"' collate Latin1_General_CS_AS, '"' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '&' collate Latin1_General_CS_AS, '&' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '‹' collate Latin1_General_CS_AS, '<' collate Latin1_General_CS_AS)
set @HTMLText = replace(@htmlText, '›' collate Latin1_General_CS_AS, '>' collate Latin1_General_CS_AS)
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
Solution 3
Here's an updated version of this function that incorporates the RedFilter answer (Pinal's original) with the LazyCoders additions and the goodeye typo corrections AND my own addition to handle in-line <STYLE>
tags inside the HTML.
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start int
DECLARE @End int
DECLARE @Length int
-- Replace the HTML entity & with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity < with the '<' character
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity > with the '>' character
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity & with the '&' character
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace the HTML entity with the ' ' character
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX(' ', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END
-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END
-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <STYLE> tags
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('<', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<STYLE', @HTMLText)
SET @End = CHARINDEX('</STYLE>', @HTMLText, CHARINDEX('</STYLE>', @HTMLText)) + 7
SET @Length = (@End - @Start) + 1
END
-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END
RETURN LTRIM(RTRIM(@HTMLText))
END
Solution 4
If your HTML is well formed, I think this is a better solution:
create function dbo.StripHTML( @text varchar(max) ) returns varchar(max) as
begin
declare @textXML xml
declare @result varchar(max)
set @textXML = REPLACE( @text, '&', '' );
with doc(contents) as
(
select chunks.chunk.query('.') from @textXML.nodes('/') as chunks(chunk)
)
select @result = contents.value('.', 'varchar(max)') from doc
return @result
end
go
select dbo.StripHTML('This <i>is</i> an <b>html</b> test')
Solution 5
Here is a version that doesn't require an UDF and works even if the HTML contains tags without matching closing tags.
TRY_CAST(REPLACE(REPLACE(REPLACE([HtmlCol], '>', '/> '), '</', '<'), '--/>', '-->') AS XML).value('.', 'NVARCHAR(MAX)')
Comments
-
irag10 over 3 years
I've got data in SQL Server 2005 that contains HTML tags and I'd like to strip all that out, leaving just the text between the tags. Ideally also replacing things like
<
with<
, etc.Is there an easy way to do this or has someone already got some sample T-SQL code?
I don't have the ability to add extended stored procs and the like, so would prefer a pure T-SQL approach (preferably one backwards compatible with SQL 2000).
I just want to retrieve the data with stripped out HTML, not update it, so ideally it would be written as a user-defined function, to make for easy reuse.
So for example converting this:
<B>Some useful text</B> <A onclick="return openInfo(this)" href="http://there.com/3ce984e88d0531bac5349" target=globalhelp> <IMG title="Source Description" height=15 alt="Source Description" src="/ri/new_info.gif" width=15 align=top border=0> </A>> <b>more text</b></TD></TR>
to this:
Some useful text > more text
-
irag10 over 15 yearsGreat, thanks. Comments there link to an improved version: lazycoders.blogspot.com/2007/06/… which deals with more html entities.
-
D'Arcy Rittich over 14 yearsNote that as a string-intensive UDF in SQL Server 2005 or later, this is a perfect candidate for implementing a CLR UDF function for a massive performance boost. More info on doing so here: stackoverflow.com/questions/34509/…
-
goodeye over 12 yearsNote the lazycoders post has two typos. Remove the single quotes from around the
CHAR(13) + CHAR(10)
in two of the sections that have these. Subtle enough I didn't catch it until it exceeded the length of a short field (interestingly, and required for me, all replacements are shorter than the original string). -
Saeed Neamati over 12 yearsThis worked for me. +1. But could you please explain your code, so that developers understand it more easily? :)
-
JDPeckham over 11 yearsit looks like it loads the html as an xml document then selects all of the values out of it. Note: this code pukes on
-
JDPeckham over 11 yearsWhat about html encoded values? would need them decoded. Thanks.
-
dudeNumber4 over 11 yearsPut a hack in for not bombing on HTML codes. Obviously just a quick hack for in-house use or whatever (just as with accepted UDF).
-
ladieu about 11 yearsdidn't work for me SELECT dbo.StripHtml('<b>somestuff</b>'); returns that exact string
-
Peter Herdenborg about 11 years@ladieu, this is expected. Check very first line of the answer ("this SQL removes all tags except BR, B, and P with any accompanying attributes").
-
qxotk over 10 yearsI used the lazycoders, plus the typo fix from @goodeye above - works great. To save time, the lazycoders blog version is here: lazycoders.blogspot.com/2007/06/…
-
Micah B. over 9 yearsIt does have to be well formed, so it's not as fault tolerant as RedFilter's.
-
futureelite7 almost 9 yearsthis SQL function is incorrect. Please refer to the answer below for the corrected function.
-
David over 8 yearsHTML is not a subset of XML. XHTML is, but HTML is not headed down that road any more.
-
Shadi Alnamrouti almost 8 yearsThis worked for me after I used nvarchar instead of varchar because I use unicode characters in html tags
-
Shadi Alnamrouti almost 8 yearsThis worked for me after I used nvarchar instead of varchar because I use unicode characters inside html tags
-
lanartri over 7 yearsFor my information, any reason using
STUFF()
instead ofREPLACE()
(which woudl be mush shorter IMO) ? -
Goner Doug over 7 yearsI hadn't really thought about it. I simply copied/modified the original, as indicated. Replace might very-well be a better option. I wonder if there's a performance comparison between the two functions to consider...
-
Dennis Jaheruddin about 6 yearsWithout the actual solution to cast from xml I feel like this is at best a partial solution.
-
Vinnie Amir about 6 yearsCAST(@xml as varchar(max)). Or CONVERT(xml), @XML). Assumed most developers would figure that out.
-
Dennis Jaheruddin about 6 yearsIt is definitely reasonable to assume that developers know how to cast, but keep in mind that someone reading your answer may not directly see that 'simply' casting is all that needs to be done. Especially because it is mentioned that we can cast where appropriate. -- I am not trying to be negative, just hope this helps you in creation of answers that are more easy to recognize as being usefull!
-
Caius Jard almost 6 years@futureelite7 using "below" and "above" as a reference for where to find answers on an SO page is nonsensical, because the answer order can be changed using the tabs at the top (and further more, voting can change answer order). Please specify an answer using the name of the author posting it
-
Caius Jard almost 6 yearsWhy does
SET @End = CHARINDEX('>',@HTMLText,CHARINDEX('<',@HTMLText))
recalculateCHARINDEX('<',@HTMLText)
instead of reusing@Start
, which has only just been calculated to be that exact same thing on the preceding statement? -
D.R. almost 6 yearsI've used this and love it, but I did add one more replace to the top group: </p> I changed to a char 13 + char 10 also since the end of a paragraph tag would typically indicate a new line. It worked perfectly in my particular scenario
-
Felix Eve over 5 yearsSo what part of this is the column name? Lets say I have a table called
data
with a column calledhtml
and I want to select all values in that column but strip the html tags how could I use your answer to achieve that? -
Jono over 5 years@GonerDoug cheers for this, was reading through the accepted comments being like, this really needs updating.
-
Mojtaba almost 5 years@RedFilter That does not remove Embedded CSS Styles.
-
matt123788 about 4 yearsThis answer worked great for the most part, but there is an assumption that all of your HTML tags are valid. In my case, there was a truncation issue on the VARCHAR upload that eliminated some closing tags. A simple PATINDEX RTrim did the trick to remove everything else.
-
Baahubali almost 4 yearsafter using the above, i had some weird characters added to my text. changing the varchar(max) parameter to nvarchar(max) fixed it.
-
a_hardin almost 4 yearsIn addition to the change @D.R. made (plus a few more that needed carriage returns), I also moved the replaces that result in
<
and>
to the very end. Otherwise they got removed with the tags. -
Marcello Miorelli over 3 yearsI have added the following line in the first block:
set @HTMLText = replace(@htmlText, '
',CHAR(13) + CHAR(10))
-
Ben about 3 yearsI was not able to get this answer to work on a simple query like
<font color=black><font color=navy><b>SELECT</b></font> *<BR><font color=navy><b>FROM</b></font> Table<font>
-
Ben about 3 yearsI was unable to get this to work using the sample HTML from the original question.
-
Ben about 3 yearsI was unable to get this method to work using the sample HTML from the original question.
-
Ben about 3 yearsThis answer does not remove predefined entities like
and it will strip out the>
that was requested in the original question. The answer by Patrick Honorez does handle the original question and is a more complete solution. YMMV. -
dudeNumber4 about 3 yearsYes, it does have to be well formed, which is a tyrant. Closest to op's snippet would be
select dbo.StripHTML('<B>Some useful text</B> <A onclick="return openInfo(this)" href="http://there.com/3ce984e88d0531bac5349" target="globalhelp"> <IMG title="Source Description" height="15" alt="Source Description" src="/ri/new_info.gif" width="15" align="top" border="0" /> </A>> <b>more text</b>')
-
user752746 about 3 yearsThank you, this worked perfectly for me. Cheers!
-
Asiri Dissanayaka over 2 yearsThis breaks on <br/> and <hr/> and <img/> and <input />