Alternate method to getting MAX date other than sub-query

Damian39 · May 19, 2015, 3:42pm

I've written a SQL query with multiple temp tables. Within one of these temp tables I have also written a sub-query in order to try and bring back only the MAX date values for each account. The sub-query is causing the query to run for a very long time, and I wanted to know if there was an alternate way to retrieve the true max date for each account. In the past I've tried to add MAX to the date column within the SELECT statement, but I continue to receive multiple dates for the same account instead of the max date. Using a sub-query in the past has worked, but in this case the sub-query is simply taking too long. I was hoping someone might have an alternate method I haven't tried yet. Thank you all for taking the time to read this post.

Damian

gbritton · May 19, 2015, 4:01pm

Please post your work so far -- table defs and queries

Damian39 · May 19, 2015, 4:09pm

Here's the coding I have done thus far:

[code]IF OBJECT_ID('TEMPDB..#TMP1C') IS NOT NULL DROP TABLE #TMP1C
IF OBJECT_ID('TEMPDB..#TMP2C') IS NOT NULL DROP TABLE #TMP2C
IF OBJECT_ID('TEMPDB..#TMP3C') IS NOT NULL DROP TABLE #TMP3C

SELECT PRIN_MON, AGNT_MON, SUB_ACCT_NO_MON, TRAN_CDE_MON, TRAN_AMT_MON, TRAN_DTE_MON, OP_ID_MON,
TERM_ID_MON, ITM_TYP_MON, ENTRY_REFNO_MON

INTO #TMP1C

FROM Vantage.dbo.MON_TRAN_BASE (NOLOCK)

WHERE ENTRY_REFNO_MON LIKE ('DEPO%')
AND TRAN_DTE_MON BETWEEN '2015-04-01' AND GETDATE()
AND PRIN_MON IN (6000,7500)
--AND SUB_ACCT_NO_MON = '8495600011049428'

--SELECT * FROM #TMP1C

SELECT A.PRIN_MON, A.AGNT_MON, A.SUB_ACCT_NO_MON AS DEP_ACCT_NO,
A.TRAN_CDE_MON AS DEPOSIT_CODE, A.ENTRY_REFNO_MON AS REFERENCE,
A.TRAN_AMT_MON AS DEPOSIT_AMT, A.TRAN_DTE_MON AS DEPOSIT_DTE,
A.OP_ID_MON AS DEPOSIT_OPID, A.TERM_ID_MON AS DEPOSIT_TERM,
A.ITM_TYP_MON AS ITEM_TYPE, M.PRIN_MON AS REV_PRIN, M.AGNT_MON AS REV_AGNT,
M.SUB_ACCT_NO_MON AS REV_ACCT_NO, M.TRAN_CDE_MON AS REV_CODE,
M.TRAN_DTE_MON AS REV_DATE, M.ENTRY_REFNO_MON AS REFERENCE2,
M.TRAN_AMT_MON AS REV_AMOUNT, M.OP_ID_MON AS REV_OPID,
M.TERM_ID_MON AS REV_TERM, M.ITM_TYP_MON AS ITEM_TYPE2

INTO #TMP2C

FROM #TMP1C AS A (NOLOCK) LEFT JOIN Vantage.dbo.MON_TRAN_BASE AS M (NOLOCK)
ON A.PRIN_MON = M.PRIN_MON
AND A.AGNT_MON = M.AGNT_MON
AND A.SUB_ACCT_NO_MON = M.SUB_ACCT_NO_MON

WHERE M.ENTRY_REFNO_MON LIKE ('DEPREV%')
AND M.TRAN_DTE_MON BETWEEN '2015-04-01' AND GETDATE()
AND M.TRAN_AMT_MON < 0
AND M.PRIN_MON IN (6000,7500)

--SELECT * FROM #TMP2C

SELECT DISTINCT A.PRIN_MON, A.AGNT_MON, A.DEP_ACCT_NO,
C.RES_NAME_SBB AS CUSTOMER_NAME,
CASE WHEN C.EXT_STAT_SBB = ' ' THEN 'ACTIVE'
ELSE ' ' END AS CUSTOMER_STATUS,
A.DEPOSIT_CODE,
A.REFERENCE,
A.DEPOSIT_AMT,
A.DEPOSIT_DTE,
A.DEPOSIT_OPID,
A.DEPOSIT_TERM,
H.CREATE_DTE_OHI AS DEP_CREATE_DTE,
H.ORDER_NO_OHI AS DEP_ORD_NO,
H.LS_CHG_DTE_OHI AS DEP_CHG_DTE,
A.ITEM_TYPE,
A.REV_PRIN,
A.REV_AGNT,
A.REV_CODE,
A.REV_DATE,
A.REFERENCE2,
A.REV_AMOUNT,
A.REV_OPID,
A.REV_TERM,
A.ITEM_TYPE2,
H.CONNECT_DTE_OHI,
H2.CREATE_DTE_OHI AS REV_CREATE_DTE,
H2.ORDER_NO_OHI as REV_ORD_NO,
H2.LS_CHG_DTE_OHI AS REV_CHG_DTE,
-- H.SERV_CDE_OHI,
H.BEF_QTY_OHI,
H.AFT_QTY_OHI

INTO #TMP3C

FROM #TMP2C AS A (NOLOCK) LEFT JOIN Vantage.dbo.OHI_HIST_ITEM AS H (NOLOCK)

ON A.PRIN_MON = H.PRIN_OHI
AND A.AGNT_MON = H.AGNT_OHI
AND A.DEP_ACCT_NO = H.SUB_ACCT_NO_OHI
AND H.ACCT_STAGE_OHI = 'C' AND H.ITEM_STATUS_OHI IN ('C','O','P','B')

LEFT JOIN Vantage.dbo.OHI_HIST_ITEM AS H2 (NOLOCK)

ON A.REV_PRIN = H2.PRIN_OHI
AND A.REV_AGNT = H2.AGNT_OHI
AND A.REV_ACCT_NO = H2.SUB_ACCT_NO_OHI
AND H2.ACCT_STAGE_OHI = 'C' AND H2.ITEM_STATUS_OHI IN ('C','O','P','B')

    LEFT JOIN Vantage.dbo.SBB_BASE AS C (NOLOCK)

ON H.PRIN_OHI = C.PRIN_SBB
AND H.AGNT_OHI = C.AGNT_SBB
AND H.SUB_ACCT_NO_OHI = C.SUB_ACCT_NO_SBB

--WHERE A.SUB_ACCT_NO_MON = '8495752510982979'

WHERE DATEDIFF(D,A.DEPOSIT_DTE,A.REV_DATE) <= 90
AND C.EXT_STAT_SBB = ' '
AND H.CONNECT_DTE_OHI = (SELECT MAX(CONNECT_DTE_OHI) AS CONNECT_DTE_OHI FROM Vantage.dbo.OHI_HIST_ITEM as B (NOLOCK)
WHERE A.DEP_ACCT_NO = B.SUB_ACCT_NO_OHI)

GROUP BY A.PRIN_MON, A.AGNT_MON, A.DEP_ACCT_NO, C.RES_NAME_SBB, C.EXT_STAT_SBB,
A.DEPOSIT_CODE,
A.REFERENCE,
A.DEPOSIT_AMT,
A.DEPOSIT_DTE,
A.DEPOSIT_OPID,
A.DEPOSIT_TERM,
H.CREATE_DTE_OHI,
H.LS_CHG_DTE_OHI,
H.ORDER_NO_OHI,
A.ITEM_TYPE,
A.REV_PRIN,
A.REV_AGNT,
A.REV_CODE,
A.REV_DATE,
A.REFERENCE2,
A.REV_AMOUNT,
A.REV_OPID,
A.REV_TERM,
A.ITEM_TYPE2,
H.CONNECT_DTE_OHI, H2.CREATE_DTE_OHI, H2.ORDER_NO_OHI, H2.LS_CHG_DTE_OHI,
--H.SERV_CDE_OHI,
H.BEF_QTY_OHI,
H.AFT_QTY_OHI

--HAVING H.AFT_QTY_OHI - H.BEF_QTY_OHI > 0

SELECT DISTINCT A.PRIN_MON, A.AGNT_MON, A.DEP_ACCT_NO,
A.CUSTOMER_NAME,
A.CUSTOMER_STATUS,
A.DEPOSIT_CODE,
A.REFERENCE,
A.DEPOSIT_AMT,
A.DEPOSIT_DTE,
A.DEPOSIT_OPID,
A.DEPOSIT_TERM,
--A.DEP_CREATE_DTE,
A.DEP_ORD_NO,
--A.DEP_CHG_DTE,
A.ITEM_TYPE,
A.REV_PRIN,
A.REV_AGNT,
A.REV_CODE,
A.REV_DATE,
A.REFERENCE2,
A.REV_AMOUNT,
A.REV_OPID,
A.REV_TERM,
A.ITEM_TYPE2,
A.CONNECT_DTE_OHI,
--A.REV_CREATE_DTE,
A.REV_ORD_NO--,
--A.REV_CHG_DTE

FROM #TMP3C AS A (NOLOCK)
[/code]

gbritton · May 19, 2015, 4:24pm

First off, remove all the (NOLOCK) hints. NOLOCK is considered harmful and should only be used when the consequences are fully understood and accepted.

Second, try adding indexes to the temp tables you are creating. Ensure that the JOIN and WHERE predicates are covered by indexes.

Damian39 · May 19, 2015, 4:29pm

I have been told numerous times by the developers at the company I work at using NOLOCK is crucial, because there are numerous automated jobs that run off of these databases all day. When NOLOCK isn't used, queries tie up the databases thus hindering these automated jobs from completing. Also, I've never had a need to add indexes to any temp tables before and am unfamiliar with how to use them. Is there a tutorial anywhere I can refer to in order to better understand how to use indexing within my temp tables?

gbritton · May 19, 2015, 4:39pm

The other devs at your company are flirting with danger. NOLOCK should never, ever be used in production code. You run the risk (which approaches certainty on a busy system) of dirty reads. Worse yet, if you use NOLOCK for financial reporting, your company can misstate its results which carries very real legal risks as well. Whenever I find NOLOCK in a production job, I remove it immediately. My company understands the risks involved. The most I would allow is READPAST, which takes the same locks as NOLOCK but eliminates dirty reads. Of course, with both you may miss rows that are being (or have been just) inserted.

Locking is there to maintain data integrity -- both for data going in (INSERT, UPDATE, DELETE) and data going out (SELECT). Trying to defeat locking is saying that you don't care about data integrity. Try stating that on your next job interview!

To add an index (to any table, including temp tables)

CREATE INDEX myindex ON mytable(col1, col2, col3)

Kristen · May 19, 2015, 5:50pm

Leave!

Absolutely terrifying to have NOLOCK on production code. It does't happen often, but sooner or later someone will have some rows included twice, or missing altogether, from a report, or get a Dirty Read of an uncommitted transaction included in a report when the data is then rolled back. If they take a business decision based on a report it could be critical for them and the company. There are absolutely no circumstances where NOLOCK should be used on production code.

We use it here on 2 or 3 reports used only by DBAs (who know the consequences) just so that they can run diagnostics on a live system without causing interference to the users.

What you are probably looking for, instead, is to set the database to READ_COMMITTED_SNAPSHOT. There are a very few circumstances where this might make your code behave differently, so you can't just set that and cross your fingers! but there is a 99% or better chance that that setting will work for you without code change (other than taking all the NOLOCK statements out of your code ...)

I can't see that the NOLOCK on your #TEMP tables will do anything anyway - #TEMP tables are only available to one process, so you are the only user of those tables and you aren't going to BLOCK yourself!

We try to always put a Primary Key on a #TEMP table. That's normally enough to solve improve performance by ensuring that any columns used in a JOIN are part of the Clustered Index on the #TEMP table. Although looking at your code you have
PRIN_MON, AGNT_MON, DEP_ACCT_NO
on #TMP2C in the first JOIN but different columns in the second join:
REV_PRIN, REV_AGNT, REV_ACCT_NO
so a Primary Key on one set and an Index on the other should help.

You are also likely to find that indexes on the underlying tables involved in the join (if not already present) will help:
OHI_HIST_ITEM = PRIN_OHI, AGNT_OHI, SUB_ACCT_NO_OHI, ACCT_STAGE_OHI, ITEM_STATUS_OHI
That will do for both joins to #TMP2C
SBB_BASE = PRIN_SBB, AGNT_SBB, SUB_ACCT_NO_SBB, EXT_STAT_SBB

I have sometimes found that the SQL optimises misses an opportunity with inner sub-selects, so you might, hopefully!, find that changing

(SELECT MAX(CONNECT_DTE_OHI) AS CONNECT_DTE_OHI FROM Vantage.dbo.OHI_HIST_ITEM as B
                            WHERE    A.DEP_ACCT_NO    =    B.SUB_ACCT_NO_OHI)

to add a JOIN to #TMP2C (i.e. duplicating the "filter" which exists in the outer code) might reduce the number of rows that the inner MAX() sub-select processes, and that that speeds things up.l

WHERE DATEDIFF(D,A.DEPOSIT_DTE,A.REV_DATE) <= 90

is probably expensive because you are using a function in there WHERE clause, which is probably preventing any index on the two date columns from being used.

Can you apply that filter to the initial populating of #TEMP2C instead?

Are the DEPOSIT_DTE and REV_DATE columns DATE datatype? or DATETIME but only contain "pure dates", or do they also contain a TIME element? If they have a TIME element just comparing them using

WHERE DATEDIFF(D,A.DEPOSIT_DTE,A.REV_DATE) <= 90

has an edge condition where one is "early in the day" and the other "late in the day" which will probably mean that you are excluding some rows that your intention was to include.

SELECT DISTINCT ... INTO #TMP3C ...

I haven't looked carefully at what your code is doing at that point, but I always aim to code a SELECT in such a way as to not need a DISTINCT. If DISTINCT is needed then for sure SQL will be collecting excess rows, and then sorting them, and then throwing away the duplicates. That inevitably takes time. For example, using EXISTS instead of a JOIN can remove the need for a DISTINCT in some circumstances.

Again, you have

SELECT DISTINCT ...
FROM #TMP3C

when you have just used DISTINCT to populate #TMP3C so it shouldn't be needed (but I haven't checked that the column list is the same). Leaving that in the code might disguise a problem, in future, that will be hard to find as a consequence. "Keep it tidy"

Damian39 · May 19, 2015, 6:03pm

Thanks so much for your input Kristen. Just to clarify, I am not a DBA and really only have read-only access to the databases. I reached out to one of our senior managers that writes and runs queries all day, and this was his reply to why we use NOLOCK within our queries:

Well if you were working at say a facebook that
has millions of transactions then I would say don’t use it. Since we are
strictly read, non-transactional shop and wouldn’t see a dirty record, we would
always want to use it here.

I have also gone through my query to remove any distincts that I do not need. The DATE datatype also contain a time stamp to them. I will see if I can place the DateDiff somewhere other than a WHERE clause.

Kristen · May 19, 2015, 6:24pm

The database could be set to READ ONLY then ...

... but if there are any updates to the tables that you are referencing then your manger is wrong I'm afraid.

In case you are interested here's a potential scenario:

Another user inserts a record into a table. It happens that the index page is full, so SQL splits the page into two halves, writes the first half back to the original page and creates a brand new page for the second half. Either the first or second half's page has the additional new entry.

Your query (using NOLOCK) is running at exactly the same time ... one of two things may happen:

Your query reads the index page just before it is split. Yout query does a bit of processing and then your query reads the next index page ... by which time the other half-page has been written already, so your query includes the second half of the page again. You get those records included twice in your Select (which will either screw up your Totals, or the duplicate rows in the resultset will break your application (the second outcome is obviously less dangerous but can be a cause of completely unexplained, and unreproducible, errors reported by users)

In the second scenario your query reads the index page just after it has been split. You get half the rows, from the first page, but your query processes the index before SQL has had a chance to insert the new "half page", so your query is totally missing those rows. That is unlikely to break your application (unless it happens to compare the totals are compared to some other query and found to be different), so the report will just be missing those rows.

Most people think of dirty reads as "Someone saves a record, it is written to disk and then some logic triggers a ROLLBACK" and that record got included in your report (and is now absent from the database). IF someone created a transaction for a $Billion and it was rejected then, yeah, that might screw up your report!!

In the real world "dirty reads" from index page splits are far, far more likely, and far, far more damaging, than a single record being rolled back.

Impossible to reproduce, almost impossible to diagnose, but looks and feels just like a smoking gun ...

gbritton · May 19, 2015, 6:31pm

"Since we are strictly read, non-transactional shop"

Then NOLOCK will not make a spec of difference. Since no updates are being done to your data, according to your manager, all the data that will be committed has been committed. You can't block an INSERT/UPDATE/DELETE since no one is doing that (since you're strictly read). You can have any number of simultaneous reads, hence NOLOCK in your case does nothing.

Damian39 · May 19, 2015, 7:37pm

I truly thank you all for the information you've given me concerning NOLOCK and using them within my queries. gbritton, I thought the same thing, that since I am not writing directly to a table, then NOLOCK shouldn't make any difference. Until I ran a query one night, and it was running for a while so I allowed it to run overnight. Upon coming to work the next morning, I was greeted by multiple users jumping on my case, because there were a number of automated jobs that run during the evening that were unable to run due to me not using NOLOCK within my query. As a result, a number of reports that were dependent on those jobs running successfully were unable to be produced until the jobs completed. This is one case in which they felt the need to further explain to me the importance of using NOLOCK when creating my queries. After reading both you and Kristen's posts, I have to say, I am thoroughly confused. I understand the access I have is read only, but according to everyone at my company, creating a query without using NOLOCK can, and in most cases will, lock up tables preventing anyone else from querying the same tables, or an automated job from running that is also accessing the same tables.

Kristen · May 20, 2015, 7:43am

Its the design of your system that is at fault ...

It doesn't work like that, if SELECT users caused no interference with Updating users there would be no need to have any locks at all - other than Updating Users locking out other Updating Users. What you describe / envisage is exactly what READ_COMMITTED_SNAPSHOT achieves

READ_COMMITTED_SNAPSHOT is probably the solution you actually need but that wasn't introduced until (I think) SQL2005 and many places that use NOLOCK did so prior to that version because there was nothing else available at the time (and the real dangers of index page splits were not well known at the time). READ_COMMITTED_SNAPSHOT would allow your report to run without interfering with other updating processes (but its a database-wide setting, so you can't just use that on your query).

READ_COMMITTED_SNAPSHOT uses the TEMPDB temporary database to create Snapshots of data that SELECT queries need so that they do not interference with Updating users

So you have been told to use NOLOCK so that you don't interfere with overnight processes etc. that are running ...

... what about when YOUR report gets a duplicate row, or has rows missing, from an Index Page Split. Who gets the chop then ... you? or your senior manager?

gbritton · May 20, 2015, 1:37pm

So, your manager is lying to you! You are not a read-only shop. Others are updating the tables. Kristen's suggestion will help you. Use RCSI.

I'm also concerned that you are apparently running test queries on a production system. Surely you have a development environment for that sort of thing?