Fix performance of database 60GB with json strings

sapator · September 21, 2016, 8:38am

Hi.
Having a database like this:

The main problems are.
Apilog:
AbsolutePath,Query,RequestHeaders,RequestBody,ResponseHeaders,ResponseBody
These are Json strings that contain large information. These have different info all the time.

Database size: It can go to 60GB every 10 days, so we are cleaning it constantly.

Response: A select from one row can take 30 seconds and for more rows 1-2 minutes.
The DB is sql 2008 R2 and the server is a new HP one (that is probably why it can handle 60GB selects)

Web site. This db is bound to a web site. Each time we do a select the web site is blocked.
We do not have set any indexes as we are not sure if is worth to set to json nvarchar and what the complications would be. As this site is live, we cannot do much experiment.

So the question is, what can we do to improve performance and if we need to set some indexes and where.
Thanks.

Kristen · September 21, 2016, 9:27am

What is the SQL for the query which is slow? That's the thing to look at, not the size / structure of the database. Looking at the Query Plan, and seeing "how" SQL is tackling the query, and then trying indexes to improve the Query Plan / Performance.

Rebuilding Indexes, when their fragmentation is above some threshold that you set, would then be import, along with ensuring that Statistics are maintained periodically for those indexes. You might also consider changing Adhoc Statistic Updates to be ASync rather than Sync (which is the default) as once that is triggered in Sync mode all queries sit&wait for the new Stats, whereas in ASync the queries will run, using the old stats, in to my mind that would be no worse than if the Stats were not updated for another millisecond or two!

sapator · September 21, 2016, 9:40am

Hi.
The sql is slow when trying to parse the Json - using like (at least slower than just selecting a non json column).
So doing

select something from Requestbody where where id = xxx and Requestbody like '%somesearch%'

That will slow down everything.
Also if I do not specify specific id's and just try to search the Json, it can take forever.

We have a rebuild index and statistics plan that runs on every DB every week.
It is a standard rebuild that our DB admin is using for years in every database and it works well.
I'm not sure only about the adhoc statistics .
Thanks.

Kristen · September 21, 2016, 9:58am

select something from Requestbody where where id = xxx and Requestbody like '%somesearch%'

The wildcard on [Requestbody] will be slow, it would be helped if there was an index on [id] AND the selectivity of [id] is reasonably good (it you have thousands of rows for each [id] value, for THAT table, then SQL may choose not to use the index and table-scan every row instead

-- SET SHOWPLAN_TEXT ON
GO
-- SET STATISTICS IO ON; SET STATISTICS TIME ON

DECLARE @StartTime datetime = GetDate();

select something 
from Requestbody 
where id = xxx 
and Requestbody like '%somesearch%'


SELECT	[Elapsed(ms)] = DATEDIFF(ms, @StartTime, GetDate())

SET STATISTICS IO OFF; SET STATISTICS TIME OFF; SET STATISTICS PROFILE OFF
GO
SET SHOWPLAN_TEXT OFF
GO
SET SHOWPLAN_ALL OFF
GO

Run from the SHOWPLAN comment for the Query Plan (the query is NOT run), and from the STATISTICS comment for the number of scans and the amount of I/O (the query IS run, so this will be the actual run-time for the query). I would do the statistics first, to get the base line.

In the SHOWPLAN I would look for which index is used for each table(s) in the query, and if that seems like the best choice. For example, if the Clustered Index is used when there is a dedicated index on the main column in your query (i.e. a column which is NOT in the Clustered Index) I would investigate why that is.

Also look at which tables/indexes use a SCAN rather than a SEEK

There is more to the tuning process than this, but they are my first "basic" steps to getting a handle on how things might be improved.

Try commenting out the LIKE as well, and seeing how much that improves things - although that's only good just-for-comparison of the baseline, of course.

sapator · September 21, 2016, 10:07am

Hi.
Will have a look and see what I can find out.
Unfortunately the "like" is commonly used as that is how we search a Json in pre 2016, so I don't think I can do something about that.
Will have a look on everything else you write.
Thanks again!

Kristen · September 21, 2016, 10:29am

Understood. Only thing that will improve it is an index on [id] (unless [Requestbody] is short enough to, also, put in that index)

Only other choice is to free-word-index the [Requestbody] column. You could use SQL's text tools for that but, I suspect, you will fall foul of its reserve words and stuff like that. We do [our own] free word indexing of things like Names and Addresses, and product names and descriptions, including using STEMming of words etc., for faster searching, but building your own is not a trivial undertaking.

sapator · September 21, 2016, 12:31pm

Thanks.

Ifor · September 21, 2016, 1:03pm

It may be worth trying to setup transactional replication to another DB for table where selects block your web site. The selects could then be run from the replicated DB.

You may also want to look at these JSON functions. eg It might be possible to change to application to save the JSON as XML which could then be indexed etc.

It may just be simpler to upgrade to SQL2016.

sapator · September 21, 2016, 1:17pm

You are correct on both.
Unfortunately
1)WE do not have the luxury to copy 60GB per day to another DB ( or at least we are talking about that and will see)
2)I has insisting on creating the functions to insert XML and not JSON but nobody listened

The xml try is here:

Can't get it to work tho as doing something lie this will complain about cannot insert identity


  declare @json nvarchar(max) 


 DECLARE @intErrNo int, @intRowCount int

select IDENTITY(int, 1, 1) AS [MyID],(sessionid),(AbsolutePath),(query),(RequestHeaders),(RequestBody),(PartnerId),
(VistaMember),(ResponseHTTPStatusCode),(ResponseHeaders),(ResponseBody),(LogCreated) Into 
    #MyTempTable
    FROM [APILog]
  where AbsolutePath like '%complete%'
  and id = 52342

ALTER TABLE #MyTempTable ADD OriginalID INT

  DECLARE    @intLoop int = 1    

WHILE @intLoop >= 1
BEGIN
SELECT TOP 1
    @json = ResponseBody,
        @intLoop = MyID+1
    FROM    #MyTempTable
    WHERE    [MyID] >= @intLoop
    ORDER BY [MyID]
    SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT

    IF @intRowCount = 0
    BEGIN
        SELECT    @intLoop = 0    -- Prevent any more iterations
        GOTO LoopExit
    END

--    ... perform loop tasks based on @Col1, @Col2 ...
--insert into #MyTempTable 
if @intLoop =2 
Begin

 select * Into 
    #tmpBigTable
    from (
     select * from parseJSON(@json) ) AS MyAliasName
    
     SET IDENTITY_INSERT #tmpBigTable ON;
     End
ELSE  
 SET IDENTITY_INSERT #tmpBigTable ON;
    Insert into  #tmpBigTable
      select * from parseJSON(@json)  AS MyAliasName;
--select * from parseJSON(@json) 

LoopExit:

END    -- Next loop iteration
    -- Equivalent to FETCH NEXT FROM MyCursor INTO @Col1, @Col2, ...

select * from #tmpBigTable
DROP TABLE #tmpBigTable
-- Cleanup
DROP TABLE #MyTempTable    -- Equivalent to CLOSE MyCursor DEALLOCATE MyCursor

ScottPletcher · September 21, 2016, 2:24pm

In general, for this structure of WHERE condition:
where where id = xxx and Requestbody like '%somesearch%'
You would need an index keyed on id that included Requestbody. That will be some overhead, but it will be allowed by SQL.

It also might be worth converting the json to text and using full-text search instead. Then again, it may not. Just a possibility, not enough details to be sure.

Kristen · September 21, 2016, 2:44pm

[Requestbody] is nvarchar(max), is that a problem for an Index column? I this there is a limit, but maybe SQL can check the index [containing a truncated value] and if it found, or the index contains the whole width of that particular row's column, all well and good and if not found then go check the full column value for that row anyway ...

Perhaps if Clustered Index is on [id] that would be the smallest amount of head-travel to get to the data, relevant to a specific [id] ?

I'm guessing all these things, as usual!, but would be interested in how you see it.

ScottPletcher · September 21, 2016, 3:39pm

Since your performance is so poor now, I'd consider forcing all nvarchar(max) values out of row. And I'd review and verify the daily indexing code, no matter how well it seems to have worked in the past.

@Kristen:
Yes, nvarchar max can be included in an index.

And based on the table definition shown in the initial post, the table is already uniquely clustered on id. But because of that, requestor can't realistically always specify the id as part of a "search", because that limits the search to one row.

Kristen · September 21, 2016, 3:49pm

I missed that ... thanks. I was thinking of:

and of no Schema known for that table.

In that case I'm at a lost because of your comment:

and this snippet from the code:

Surely that is going to match the single row id = 52342 (provided that that row also contains "complete" in [AbsolutePath]), and there will be no performance issue with that.

So I must be missing something obvious as to where the performance problem is coming from.

yosiasz · September 21, 2016, 7:15pm

do it in batches

first
Create an index on id

create table #theforceawakens(col1 , col2, col3, Requestbody )

insert into #theforceawakens
select col1 , col2, col3, Requestbody from Requestbody where where id = xxx

--now you will be working in a subset
select
from #theforceawakens
where Requestbody like '%somesearch%'

otherwise doing where Requestbody like '%somesearch%' on the source table with millions of rows is gonna hurt real bad

sapator · September 22, 2016, 12:02pm

OK.
So basically we are thinking about what to do.
The db is set up with indexes and rebuilds as I saw.
I think the problem is that there is just too much data to the DB.
We are trying to clean out some calls we do not need to be inserted to SQL (p.e. when a client opens up the site and before log in, we get 7 rows of unwanted data) and we are trying to figure out if we are going to have a backup db from 7-10 days before so we can run quires there.
Thanks.

yosiasz · September 22, 2016, 4:23pm

are you able to create an Archive database and push older unneeded data to it and keep the needed data in the main database?

sapator · September 23, 2016, 6:54am

We are thinking about it.
Keep data on archive.
Thing is we need to find some space and also we need to find a "hole" for the job to run so it does not block other stuff that run at night.

Ifor · September 23, 2016, 10:27am

If the structure of Requestbody is completely different for each row, and you do not want to upgrade to SQL2016, it may be worth looking at removing it from SQL and storing it in a document database like MongoDB.

sapator · September 23, 2016, 11:05am

Never heard of that.
I'm not sure how will use a select though.
Will it be something like a "linked server"?

Ifor · September 23, 2016, 11:17am

You would have to look into how to use it. I suspect it would need to be accessed via .NET.

It just occurred to me that a SQL database might not be the best place to store lots of very different JSON documents and a nosql document centric DB, like MongoDB, might be a better fit. The SQL database could then just be used for the relational data which is, after all, what it is good at. We do not know your data so it is difficult to tell.