SQLTeam.com | Weblogs | Forums

Removing carriage returns and HTML tags


#1

Hello, I have some data imported into a database that has html tags. I have the code below which replaces certain tags, and the last set of code removes all other between < > tags. This is working well. The issue is that if the tag has a carriage return after the tag I also need to remove those. Here is two examples. In the first example the code I have will work fine.

Eg 1.

<html><body>enter new academic year, change data and save.</body></html>

Desired Result = enter new academic year, change data and save.

In this second example. I also need to remove the carriage return after the <html>, <body>, </div> and &gt;. Note sometimes a tag may have a carriage return after it and in other cases it will not.

Eg 2.

  <html>
    <body>
    <div align="left"><font face="Arial"><span style="font-size:8pt">Navigate to Setup SACR &gt; Foundation 
    Tables &gt; 
    Academic Structure &gt; Program Planner<br />
    Search for Program/Plan combination and check 
    highest Academic Year still this year</span></font></div>
    </body>
    </html>

Desired Result = Navigate to Setup SACR > Foundation
Tables > Academic Structure > Program Planner
Search for Program/Plan combination and check
highest Academic Year still this year

Usage: dbo.udf_StripHTML(TS_Description) TS_Description

USE [UofA]
GO
/****** Object:  UserDefinedFunction [dbo].[udf_StripHTML]    Script Date: 18/08/2015 8:43:19 AM ******/
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO


ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN
DECLARE @Start  int
DECLARE @End    int
DECLARE @Length int

-- Replace the HTML entity &amp; with the '&' character (this needs to be done first, as
-- '&' might be double encoded as '&amp;amp;')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &lt; with the '<' character
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '<')
SET @Start = CHARINDEX('&lt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &gt; with the '>' character
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '>')
SET @Start = CHARINDEX('&gt;', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &amp; with the '&' character
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '&')
SET @Start = CHARINDEX('&amp;amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &nbsp; with the ' ' character
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, ' ')
SET @Start = CHARINDEX('&nbsp;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace the HTML entity &quot; with the '"' character
SET @Start = CHARINDEX('&quot;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '"')
SET @Start = CHARINDEX('&quot;', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Replace any <br> tags with a newline
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
--CHAR(13) + CHAR(10)
SET @Start = CHARINDEX('<br>', @HTMLText)
SET @End = @Start + 3
SET @Length = (@End - @Start) + 1
END

-- Replace any <br/> tags with a newline
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
--CHAR(13) + CHAR(10)
SET @Start = CHARINDEX('<br/>', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

-- Replace any <br /> tags with a newline
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, CHAR(13) + CHAR(10))
--CHAR(13) + CHAR(10)
SET @Start = CHARINDEX('<br />', @HTMLText)
SET @End = @Start + 5
SET @Length = (@End - @Start) + 1
END

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
SET @HTMLText = STUFF(@HTMLText, @Start, @Length, '')
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, CHARINDEX('<', @HTMLText))
SET @Length = (@End - @Start) + 1
END

RETURN LTRIM(RTRIM(@HTMLText))

END

#2
set @HTMLText=replace(@HTMLText,'\r\n','');
set @HTMLText=replace(@HTMLText,'\n','');

#3

Thanks for your reply bitsmed however this will not solve my problem. This is because if you look at my examples I only want to remove carriage returns under certain circumstances. This is shown in Eg 2. Where a carriage return is directly after given html tags. The carriage return after the word 'check' for example needs to be kept.


#4

How about this:

set @HTMLText=replace(@HTMLText,'>\r\n','>');
set @HTMLText=replace(@HTMLText,'>\n','>');

This will remove linefeeds after tags.


#5

Thanks bitsmed. Worked with a minor mod.

dbo.udf_StripHTML(REPLACE(TS_Description, '>'+CHAR(13)+CHAR(10), '> ')) TS_Description

#6

Good luck with this :smile:We have never been able to remove HTML tags without using an HTML parser; any broken HTML tags, which the browser may very well display on screen "just fine", will ruin your day with a Find & Replace type approach; maybe your HTML is mechanically-generated and it doesn't have n-degrees-of-freedom in the way that it is formed, but IME the biggest issue is that a "Remove" done mechanically "steals" some real data in the process.

The looping approach you are having to use is also horrifically slow. Of course if you only have to process one record now-and-again (e.g. as a user saves it) that's probably not a problem, but for a batch process cleaning up lots of rows of data (e.g. an overnight import) it will scale very badly.

Where we need to do this we use JavaScript on the user's data entry form to sort it out and provide both "full fat" and "tag-free" parameters to the APP. Languages such as ASP / PHP include parsing capabilities (i.e. far superior to what you can do in SQL)

Personally I would avoid this:

SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1

WHILE (@Start > 0 AND @End > 0 AND @Length > 0) BEGIN
...
SET @Start = CHARINDEX('&amp;', @HTMLText)
SET @End = @Start + 4
SET @Length = (@End - @Start) + 1
END

where the various pieces of code are repeated, as changing anything for Code Maintenance latter on is likely to introduce inconsistencies.

Personally, again, I would use SELECT rather than SET so that all variables could be adjusted in a single (SELECT) statement, rather than multiple (SET) statements.

No idea why SET was ever constrained to only being able to update a single variable ... strikes me as being nuts! but there we are.

In any event if you want to capture @@ROWNUMBER and @@ERROR you HAVE to use SELECT (because it has to be done as a single statement) so in circumstances like that you can't use SET (so if the objective is to be standards-compliant IMHO there are situations where you can't, so I always use SELECT instead of SET for @Variable manipulation in order to be consistent). its just personal preference though.


#7

Functions should always be written for efficiency -- btw, even at the expense of ease of maintenance -- so let's touch this one up some.

SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
ALTER FUNCTION [dbo].[udf_StripHTML]
(
@HTMLText varchar(MAX)
)
RETURNS varchar(MAX)
AS
BEGIN

DECLARE @Start  int
DECLARE @End    int

--Replace HTML entities and selected tags with plain text chars.
SET @HTMLText = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(
    @HTMLText,
    '&amp;', '&'),
    '&amp;amp;', '&'),
    '&lt;', '<'),
    '&gt;', '>'),
    '&nbsp;', ' '),
    '&quot;', '"'),
    '<br>', CHAR(13) + CHAR(10)),
    '<br/>', CHAR(13) + CHAR(10)),
    '<br />', CHAR(13) + CHAR(10))

-- Remove anything between <whatever> tags
SET @Start = CHARINDEX('<', @HTMLText)
SET @End = CHARINDEX('>', @HTMLText, @Start + 1)

WHILE (@Start > 0 AND @End > 0) BEGIN
    SET @HTMLText = STUFF(@HTMLText, @Start, @End - @Start + 1, '')
    SET @Start = CHARINDEX('<', @HTMLText)
    SET @End = CHARINDEX('>', @HTMLText, @Start + 1)
END

RETURN LTRIM(RTRIM(@HTMLText))

END