Any reason NOT to use VARCHAR(MAX) instead of VARCHAR(n)?

Kristen · August 4, 2015, 9:57am

I have a logging table, created back in the days of SQL 2000, which is something along the lines of:

CREATE TABLE dbo.MyLogTable
(
	LogID		int IDENTITY(1,1) NOT NULL,
	LogName		varchar(255) NOT NULL,
	LogValue1	varchar(1000) NOT NULL,
	LogValue2	varchar(255) NULL,
	LogValue3	varchar(255) NULL,
	LogValue4	varchar(255) NULL,
	LogValue5	varchar(255) NULL,
	PRIMARY KEY
	(
		LogID
	)
)

The intention, at the time, was that the main "Log Message" was stored in the [LogValue1] column (given a "generous" 1,000 characters width at the time ) and any other Odds&Sods stuff that I needed to log would be in one of the other LogValue2...5 columns. Normally they would be empty.

In the event that the "Log Message" is more than 1,000 it is truncated ... which is a PITA - although it doesn't happen often, when it does there is every chance that the critical piece of information is lost. (A quick SELECT suggests that 2% of rows are "max width" on that column, so would probably use more if it were available)

Is there any reason NOT to just change this to VARCHAR(MAX) on all columns? From what I have read if the total collective size of the VARCHAR() columns is less than 8,000 approx (i.e. allowing for max page size 8,060 bytes) the record and all its columns will be stored as IN_ROW_DATA, beyond that columns that will fit within the page will still be stored IN_ROW_DATA but anything too big to fit will be stored as LOB_DATA.

Thus it seems to me that for all the current data that "will still fit" there is absolutely no change, and no performance penalty etc. I'm guessing that that is a huge proportion of the logged data ... but ... for the rest is the INSERT performance penalty significant? (The SELECT cost is negligible, we almost never query the log table, unless there is a problem that needs to be diagnosed).

Other possible issue is that we Insert-once and then (60% of the time) update-once. e.g. we log a "We started, here are the parameters" at the start of a routine and then at the end of it we update with an INT "Result code" (changing an existing value, NOT changing NULL to Sometime) and SOMETIMES add a [LogValueN] (i.e. changing from NULL to Something, or Something to SomethingElse).

The logging table gets millions of inserts a day (mostly between 9am and 5pm - my guess is that it peaks at around 500 inserts a second) hence INSERT and then optional UPDATE-once performance is very important.

robert_volk · August 4, 2015, 11:31am

Only real downside I see is excessive page splitting if you have short messages that fit multiple rows on a page, and then need to move if they're updated to a longer length. I imagine you're getting quite a few splits now, it's just a matter of how much impact they have. Index padding can help there, however...

I don't think it's a bad idea to use varchar(max) but...are you sure your messages exceed 8000 bytes? If you can profile some average sizes you may consider fixing your row size around 8000 bytes anyway, that way you always get one row per page. It sounds wasteful, but it avoids page splits and the consequences of page locking/latching during updates. It should still work with varchar(max), however if they go >8000 characters you'd go off page anyway, with increased I/O hit. If each column is MAX, you can end up with 5 off-page I/Os for a single row SELECT. VARCHAR(MAX) also can't use data compression; if that's available to you, and the messages you get are somewhat regular, you could save space.

Kristen · August 4, 2015, 12:39pm

Thanks Rob. My simple answer to the valid points you mention is ">8,000 will happen so rarely that I don't care about any performance hit" (so long as: the rest of the time, < 8,000 chars, there is NO performance hit compared to before)

Pondering this (with Mr Google's help ) has send me on a long-research-read from which it seems:

Any rows that fit IN ROW, including after any UPDATE, are likely to perform identically.

Convenience of VARCHAR(MAX) for things that MIGHT be larger is worth having. Performance penalty, given that only 2% of rows will be excess-sized, will be acceptable. (Although as Rob pointed out my 2% is > current 1,000 byte limit, it might be that none of them would actually exceed 8,000 bytes)

If the majority of rows were > 8,060 bytes it might be worth forcing the column to be LOB_DATA - an update of LOB_DATA is faster if it remains out-of-row, rather than shrinking to be accommodated IN_ROW. That's not the case for me.

Side issues:

varchar(max) in a table prevents an online index rebuild. We only ever REORGANISE large indexes so I presume that's not a problem for me.

Cannot move tables with LOB data to a different filegroup. Can't see that being an issue for me, certainly not a frequent one!! - if it arose we'd have to do a Create TempTable, copy data over, drop original, rename table name and all the associated Indexes, Constraints, etc etc.

VARCHAR(max) can't be part of an index - but nor can VARCHAR(900+1). Again not an issue for me. Similarly a SORT could force use of TEMPDB - that might be a consideration for not excessively sizing a column's width.

Data in LOB doesn't get compressed. Not sure that bothers me - its not a reason to truncate I'd-really-prefer-this-was-kept data.

Should I just use VARCHAR(MAX) all the time? Tempting but the consensus is that it encourages goofy-data. If a CODE column should be 10 characters max then don't give the user the opportunity to enter 11 or more We have generous sized email columns, because the RFC says they can be long ... so users put multiple email addresses in there (well, attempt to, we don't allow 2x "@" in the field)

However, if the data is genuinely variable in size - a "Notes" column for example - then I can't see a reason to set an arbitrary 1,000 character limit, as we have done in the past. Might as well make it VARCHAR(MAX) and if the user wants to store War & Peace in there let them worry about how to edit it given the letter-box sized form-field that we provide!

String manipulations on VARCHAR(MAX) datatype columns are slower Not a problem in my example, but might be a consideration for the "VARCHAR(MAX) everything" line of thought! if the columns are used in functions like CHARINDEX & PATINDEX, or in any UDFs. (Side note: I have separate UDFs to split VARCHAR(8000) and VARCHAR(MAX) because of the performance differences, I can continue to do that, if necessary - i.e. selectively calling One or Other depending on actual column size)

I also found some references to tables that are used by SSIS - in that it would allocate memory buffers for the size of column being imported. Don't know if that is still a consideration? (It is not one in my situation)

I also found some chatter that VARCHAR(MAX) columns should be grouped at the end of the table (something referencing something that they thought that Kimberly and/or Kalen may have said ...)

I haven't found anything about possible performance differences of @MyParameter VARCHAR(8000) compared to @MyParameter VARCHAR(MAX), although it seems that it depends on the size of the actual stored data - so, again, perhaps "no measurable difference" when the actual data stored is < 8,000 characters (which, for me, is 98% of the time)

I wonder if the total defined size of the table's columns (excluding VARCHAR(MAX) is > 8K whether it would be better to use VARCHAR(MAX). For example "Company names and Address lines MIGHT be long, I'll allocate VARCHAR(1000) to all of them". Is there then a risk that, collectively, the VARCHAR(n) columns are >8K and SQL then raises an error on INSERT - whereas with VARCHAR(MAX) columns it would just store them as LOB_DATA. That might be an edge-condition worth considering for a table with a lot of large VARCHAR(n) columns

Tangential thoughts:

I do have SProcs that do this

CREATE PROCEDURE MyProc
    @SearchString varchar(40)
AS
SELECT @SearchString = '%' + NullIf(RTrim(@SearchString), '') + '%'

SELECT Col1, Col2, ...
FROM   Mytable
WHERE     @SearchString IS NULL
       OR MyVarchar_40_Column LIKE @SearchString

of course there is an edge condition here where the user has provided a search string EXACTLY 40 character long (actually 39 Chars would do). @SearchString is not big enough to accommodate the pair of '%' wildcards. So I have to remember to DECLARE them as 42, rather than 40, and they are no longer an exact match of the column's defined width - so code-checking becomes less straightforward. @SearchString varchar(MAX) would fix that.

It begs the question that IF @StringParameter varchar(MAX) has no performance penalty when actually used for small strings whether that isn't preferable. Saves me having to remember to DECLARE @SearchString as "Maximum length of underlying table's column PLUS TWO" and also handy where a user could provide excessively long data (e.g. APP error failing to enforce form-field length), or a wayward data import. SProc could then check that length of all parameters was <= Max permitted length instead of SQL Engine begin responsible for that.

Currently all our SPRoc parameters are same-size-as-table's-column (with the plus-two exception above). If we increase the size of a table's column we have an avalanche of SProc changes that we have to make. Missing just one will (silently ) truncate data which is REALLY bad ... on the plus side we DO have to visit every place in the code where the column is used, so we do actually get the chance to think about whether the increase in length impacts something. Although, arguing black-is-white for a moment!, if the target column is, in practical terms, a 10-char column but is actually declared as VARCHAR(MAX) it will fail-safe it we accidentally store the new column's 12-char value into it.

It also causes me to wonder if all VARCHAR table columns should be created as VARCHAR(MAX) and their actual max length enforced within the APP. Saves any issue with increasing the column width in the future - just change some meta data within the APP and carry on ... no APP changes, no Database DDL changes.

Discuss!!!!

We've had clients with stupidly long Name & Address columns in their associated DBs (how do they ever print an address label for an envelop? and yet they have EMail address columns defined oo small to fit the Max Possible RFC definied length ). If we had VARCHAR(MAX) for all Name and Address etc. columns we could adjust Max Validated Size in our Metadata on a client-by-client basis.

I suppose it all comes down to me THINKING there was a performance penalty of VARCHAR(MAX), particularly for Small Data, and now finding that there does not appear to be one ...

gbritton · August 4, 2015, 1:27pm

You get a penalty for every MAX column, since the length field is 4 bytes instead of 2.

Kristen · August 4, 2015, 2:34pm

I found an interesting article

http://rusanu.com/2010/03/22/performance-comparison-of-varcharmax-vs-varcharn/

"The code path that handles the MAX types (varchar, nvarchar and varbinary) is different from the code path that handles their equivalent non-max length types. The non-max types can internally be represented as an ordinary pointer-and-length structure. But the max types cannot be stored internally as a contiguous memory area, since they can possibly grow up to 2Gb. So they have to be represented by a streaming interface, similar to COM's IStream. This carries over to every operation that involves the max types, including simple assignment and comparison, since these operations are more complicated over a streaming interface. The biggest impact is visible in the code that allocates and assign max-type variables"

So just using

DECLARE @MyLocalStringVariable VARCHAR(MAX)

will run slower (assuming it does not ever need to store more than 8,000 characters) than a regular

DECLARE @MyLocalStringVariable VARCHAR(8000)

because it uses a streaming interface.

SO sounds like it might be worth changing all SProc @Parameters from VARCHAR(n) to VARCHAR(8000) to avoid having to adjust them if the underlying table column increases in size etc. but resist changing them to VARCHAR(MAX) until the underlying tables's column also moves to MAX.

JeffModen · August 5, 2015, 1:40am

Heh... nicely done, Kristen. Ah... not good enough. That's a freakin' awesome write up. You have the makings of a nice article there. You should write it up as one and submit it. Especially since you did finally come to the realization that it does actually slow things down (by a factor of 2 in my personal experience) .

JeffModen · August 5, 2015, 1:46am

As a bit of a sidebar, if the LogValuen columns can contain a mixture of datatypes, then instead of converting them to VARCHAR(8000), seriously consider using SQL_VARIANT, instead.

Kristen · August 5, 2015, 6:38am

Interesting. I decided to change my logging table yesterday, big job of course as it is the most active table in the database and critical that performance is top notch. That said, there are a couple of things that have crept in over the years which I'm not very happy with, now that I look at them again ..., so a great opportunity to have a look at SQL_VARIANT too. I would never have thought of that, thanks.

JeffModen · August 15, 2015, 2:48am

Thanks for the feedback, Kristen. I'd be interested in how things went for you. When we did it, it worked very well.