Fix performance of database 60GB with json strings

select something from Requestbody where where id = xxx and Requestbody like '%somesearch%'

The wildcard on [Requestbody] will be slow, it would be helped if there was an index on [id] AND the selectivity of [id] is reasonably good (it you have thousands of rows for each [id] value, for THAT table, then SQL may choose not to use the index and table-scan every row instead :frowning:

-- SET SHOWPLAN_TEXT ON
GO
-- SET STATISTICS IO ON; SET STATISTICS TIME ON

DECLARE @StartTime datetime = GetDate();

select something 
from Requestbody 
where id = xxx 
and Requestbody like '%somesearch%'


SELECT	[Elapsed(ms)] = DATEDIFF(ms, @StartTime, GetDate())

SET STATISTICS IO OFF; SET STATISTICS TIME OFF; SET STATISTICS PROFILE OFF
GO
SET SHOWPLAN_TEXT OFF
GO
SET SHOWPLAN_ALL OFF
GO

Run from the SHOWPLAN comment for the Query Plan (the query is NOT run), and from the STATISTICS comment for the number of scans and the amount of I/O (the query IS run, so this will be the actual run-time for the query). I would do the statistics first, to get the base line.

In the SHOWPLAN I would look for which index is used for each table(s) in the query, and if that seems like the best choice. For example, if the Clustered Index is used when there is a dedicated index on the main column in your query (i.e. a column which is NOT in the Clustered Index) I would investigate why that is.

Also look at which tables/indexes use a SCAN rather than a SEEK

There is more to the tuning process than this, but they are my first "basic" steps to getting a handle on how things might be improved.

Try commenting out the LIKE as well, and seeing how much that improves things - although that's only good just-for-comparison of the baseline, of course.

1 Like

Hi.
Will have a look and see what I can find out.
Unfortunately the "like" is commonly used as that is how we search a Json in pre 2016, so I don't think I can do something about that.
Will have a look on everything else you write.
Thanks again!

Understood. Only thing that will improve it is an index on [id] (unless [Requestbody] is short enough to, also, put in that index)

Only other choice is to free-word-index the [Requestbody] column. You could use SQL's text tools for that but, I suspect, you will fall foul of its reserve words and stuff like that. We do [our own] free word indexing of things like Names and Addresses, and product names and descriptions, including using STEMming of words etc., for faster searching, but building your own is not a trivial undertaking.

1 Like

Thanks.

It may be worth trying to setup transactional replication to another DB for table where selects block your web site. The selects could then be run from the replicated DB.

You may also want to look at these JSON functions. eg It might be possible to change to application to save the JSON as XML which could then be indexed etc.

It may just be simpler to upgrade to SQL2016.

1 Like

You are correct on both.
Unfortunately
1)WE do not have the luxury to copy 60GB per day to another DB ( or at least we are talking about that and will see)
2)I has insisting on creating the functions to insert XML and not JSON but nobody listened

The xml try is here:

Can't get it to work tho as doing something lie this will complain about cannot insert identity


  declare @json nvarchar(max) 


 DECLARE @intErrNo int, @intRowCount int

select IDENTITY(int, 1, 1) AS [MyID],(sessionid),(AbsolutePath),(query),(RequestHeaders),(RequestBody),(PartnerId),
(VistaMember),(ResponseHTTPStatusCode),(ResponseHeaders),(ResponseBody),(LogCreated) Into 
    #MyTempTable
    FROM [APILog]
  where AbsolutePath like '%complete%'
  and id = 52342

ALTER TABLE #MyTempTable ADD OriginalID INT

  DECLARE    @intLoop int = 1    

WHILE @intLoop >= 1
BEGIN
SELECT TOP 1
    @json = ResponseBody,
        @intLoop = MyID+1
    FROM    #MyTempTable
    WHERE    [MyID] >= @intLoop
    ORDER BY [MyID]
    SELECT @intErrNo = @@ERROR, @intRowCount = @@ROWCOUNT

    IF @intRowCount = 0
    BEGIN
        SELECT    @intLoop = 0    -- Prevent any more iterations
        GOTO LoopExit
    END

--    ... perform loop tasks based on @Col1, @Col2 ...
--insert into #MyTempTable 
if @intLoop =2 
Begin

 select * Into 
    #tmpBigTable
    from (
     select * from parseJSON(@json) ) AS MyAliasName
    
     SET IDENTITY_INSERT #tmpBigTable ON;
     End
ELSE  
 SET IDENTITY_INSERT #tmpBigTable ON;
    Insert into  #tmpBigTable
      select * from parseJSON(@json)  AS MyAliasName;
--select * from parseJSON(@json) 

LoopExit:

END    -- Next loop iteration
    -- Equivalent to FETCH NEXT FROM MyCursor INTO @Col1, @Col2, ...

select * from #tmpBigTable
DROP TABLE #tmpBigTable
-- Cleanup
DROP TABLE #MyTempTable    -- Equivalent to CLOSE MyCursor DEALLOCATE MyCursor

In general, for this structure of WHERE condition:
where where id = xxx and Requestbody like '%somesearch%'
You would need an index keyed on id that included Requestbody. That will be some overhead, but it will be allowed by SQL.

It also might be worth converting the json to text and using full-text search instead. Then again, it may not. Just a possibility, not enough details to be sure.

[Requestbody] is nvarchar(max), is that a problem for an Index column? I this there is a limit, but maybe SQL can check the index [containing a truncated value] and if it found, or the index contains the whole width of that particular row's column, all well and good and if not found then go check the full column value for that row anyway ...

Perhaps if Clustered Index is on [id] that would be the smallest amount of head-travel to get to the data, relevant to a specific [id] ?

I'm guessing all these things, as usual!, but would be interested in how you see it.

Since your performance is so poor now, I'd consider forcing all nvarchar(max) values out of row. And I'd review and verify the daily indexing code, no matter how well it seems to have worked in the past.

@Kristen:
Yes, nvarchar max can be included in an index.

And based on the table definition shown in the initial post, the table is already uniquely clustered on id. But because of that, requestor can't realistically always specify the id as part of a "search", because that limits the search to one row.

1 Like

I missed that ... thanks. I was thinking of:

and of no Schema known for that table.

In that case I'm at a lost because of your comment:

and this snippet from the code:

Surely that is going to match the single row id = 52342 (provided that that row also contains "complete" in [AbsolutePath]), and there will be no performance issue with that.

So I must be missing something obvious as to where the performance problem is coming from.

do it in batches

first
Create an index on id

create table #theforceawakens(col1 , col2, col3, Requestbody )

insert into #theforceawakens
select col1 , col2, col3, Requestbody from Requestbody where where id = xxx

--now you will be working in a subset
select
from #theforceawakens
where Requestbody like '%somesearch%'

otherwise doing where Requestbody like '%somesearch%' on the source table with millions of rows is gonna hurt real bad

1 Like

OK.
So basically we are thinking about what to do.
The db is set up with indexes and rebuilds as I saw.
I think the problem is that there is just too much data to the DB.
We are trying to clean out some calls we do not need to be inserted to SQL (p.e. when a client opens up the site and before log in, we get 7 rows of unwanted data) and we are trying to figure out if we are going to have a backup db from 7-10 days before so we can run quires there.
Thanks.

are you able to create an Archive database and push older unneeded data to it and keep the needed data in the main database?

We are thinking about it.
Keep data on archive.
Thing is we need to find some space and also we need to find a "hole" for the job to run so it does not block other stuff that run at night.

If the structure of Requestbody is completely different for each row, and you do not want to upgrade to SQL2016, it may be worth looking at removing it from SQL and storing it in a document database like MongoDB.

1 Like

Never heard of that.
I'm not sure how will use a select though.
Will it be something like a "linked server"?

You would have to look into how to use it. I suspect it would need to be accessed via .NET.

It just occurred to me that a SQL database might not be the best place to store lots of very different JSON documents and a nosql document centric DB, like MongoDB, might be a better fit. The SQL database could then just be used for the relational data which is, after all, what it is good at. We do not know your data so it is difficult to tell.

The json data is (as you can see on the table picture) inside the web log.
What you suggest is the application write the log data to the MongoDB and everything else to the SQL.
Again we would need to combine the info as, from the log, we are breaking down the data (id's, tranaction numbers etc) so we can find the data to the SQL.
So if we do that, we need some link between MongoDB and SQL server so we can use the data on both sides.

P.S. I'm not sure if we can get a go on something like this since this is a multi national company and I'm not sure about the budgets.

Presumably you need to ask yourself how much of the data in Requestbody you will actually use. (1%, 10%, 90%?)

One would really need to know a lot more about the system to make recommendations - not something that can be done on a forum. I only floated the idea as a possibility.

Yes I understand.
Anyhow it will probably be 30-40% of the requestbody and 5-10% of the other json columns.
The system is about 40-50SQL servers divided to 10 sub companies.We get the log data of these servers to the APILOG(for the web site, so not all SQL data, just internet sales)
Everything is Microsoft based.
Thanks for the suggests btw.