Confused about Joins

Josh_Hawley · September 9, 2015, 11:22pm

I'm a little new to sql that is this complex. Most of the stuff I have done is simple calls to single tables (or maybe 2). I have read the article here on groupby, and it seemed to get me close, but I am somehow still getting duplicate rows, and I am not sure why.

select 
cd.customer_id,
cd.bill2_name,
cd.bill2_csz,
h.total_lines,
h.total_sales,
h.total_cost,
h.total_sales - h.total_cost as Profit,
(h.total_sales - h.total_cost) / h.total_sales as ppct

from 
(select 
	customer_id,
	count(1) as total_lines, 
	sum(p21_sales_history_report_view.sales_price) as total_sales, 
	sum(p21_sales_history_report_view.cogs_amount) as total_cost 

from p21_sales_history_report_view 
where invoice_date between '2015-09-07 00:00:00.000' and '2015-09-08 23:59:59.999' 
group by customer_id
) h 
inner join
(Select
	customer_id,
	bill2_name as bill2_name,
	CONCAT(bill2_city,' ',bill2_state, ' ', bill2_postal_code) as bill2_csz

from invoice_hdr) cd
on h.customer_id = cd.customer_id

I get results that include the proper data (totals/etc are correct), but I get many duplicate lines of each. When I run the query for h (the table that pulls from the view) I get the data that I want without customer name/etc. h also does not have any duplication in it, so I know that part of the query is correct.

Any help would be greatly appreciated. Also any other suggestions about better ways to do this are also welcome. It is basically a simple sales history report. The catch is that the customer data has to come off of the invoice, rather than from a separate table.

Thanks in advance
Josh Hawley

ps: I know that the dates are hard coded and ugly, they will be replaced with parameters when I get this working.

Dohsan · September 10, 2015, 8:21am

Without any example data, it is hard to say exactly why, but looks like you've started to do the right thing - break up the query into it's separate parts to check them

If you run the second sub query

Select
	customer_id,
	bill2_name as bill2_name,
	CONCAT(bill2_city,' ',bill2_state, ' ', bill2_postal_code) as bill2_csz

from invoice_hdr

Are there duplicates in there? Check with:

SELECT	A.customer_id,
		COUNT(*)
FROM	invoice_hdr AS A
GROUP	BY A.customer_id
HAVING	COUNT(*) > 1;

Kristen · September 10, 2015, 9:21am

That's fine but two things to note

'yyyy-mm-dd' is ambiguous to SQL and may be parsed wrongly (or raise an error). Better to use 'yyyymmdd" (without punctuation) which is guaranteed to be parsed unambiguously. If you are bored! try this:

SET LANGUAGE 'French'
SELECT CONVERT(datetime, '2015-12-31 00:00:00.000')
SET LANGUAGE 'English'
SELECT CONVERT(datetime, '2015-12-31 00:00:00.000')

"between '2015-09-07 00:00:00.000' and '2015-09-08 23:59:59.999' is wrong I'm afraid. You can't do that

SELECT	CONVERT(datetime, '20150908 23:59:59.999')

will round the date to "tomorrow" i.e. 09-09-2015

You need to do:

where     invoice_date >= '20150907 00:00:00.000' 
      and invoice_date <  '20150909 00:00:00.000' 	-- i.e. "less than tomorrow"

There are various ways to get to "tomorrow"; e.g. if you have a parameter which contains @LastDateToIncludeInReport. This is pretty ghastly to read, but uses only Integer arithmetic, so performs well (MUCH better than using Strong conversion, for example)

DATEADD(Day, DATEDIFF(Day, 0, @LastDateToIncludeInReport)+1, 0)
-- Test:
SELECT GetDate() AS Now, DATEADD(Day, DATEDIFF(Day, 0, GetDate())+1, 0) AS Tomorrow

note that the TIME part is removed, so you wind up with "Midnight tonight" which you can then use with a LESS THAN test for the end-point.

Re: your duplicates it might be enough to just do:

(Select DISTINCT
	customer_id,
	bill2_name as bill2_name,
	CONCAT(bill2_city,' ',bill2_state, ' ', bill2_postal_code) as bill2_csz

from invoice_hdr)

to get rid of the duplicates. BUT ... ... I hate using DISTINCT because, in effect, SQL has to get all the data, sort it, and then remove the duplicates, which does not perform / scale well. So better to work around it, rather than use DISTINCT if you can.

I presume you have many [invoice_hdr] records PER customer_id, and SOME of them will have different BILLING addresses?

There is probably a CUSTOMER table you could use instead, which has a "Primary Address", and thus will only have one-per-customer??

If you need to use the Billing Address from [invoice_hdr], and you want only one but you don't care which one, then you could do

(Select
	[RowNumber] = ROW_NUMBER()
			OVER
			(
				PARTITION BY customer_id 
				ORDER BY customer_id, SomeDateColumn DESC
			),
	customer_id,
	bill2_name as bill2_name,
	CONCAT(bill2_city,' ',bill2_state, ' ', bill2_postal_code) as bill2_csz
from invoice_hdr) cd
on h.customer_id = cd.customer_id
AND cd.RowNumber =1

this will only include rows from [invoice_hdr] where the calculated [RowNumber] = 1, and if you use a suitable sort order (e.g. by Invoice Date DESCending) you will get the billing address from the most recent invoice.

Josh_Hawley · September 10, 2015, 10:19pm

Yes there are. I thought that since this table was joined to the other one, that it wouldn't matter...

Josh_Hawley · September 10, 2015, 10:29pm

That worked perfectly. There are some new keywords in there that I will have to read about so that I understand exactly how they work, but I think that I understand what you did.

I can't use the customer data because there are multiple contacts/locations for many of them. When our database creates an invoice, it copies the contact data from the correct contact to the invoice_hdr table. I'm still not guaranteed that it is the correct data (because there is a one -> many relationship), but I do know that it showed up on at least one of the invoices in the selected date range, which is good enough for this report.

Josh_Hawley · September 10, 2015, 10:30pm

Thank you both for helping me get this figured out

Josh_Hawley · September 10, 2015, 10:31pm

Also thanks for the explanation about the date issues. They were not causing an issue with my numbers in my test data, but i'm sure that it would have later.

Kristen · September 11, 2015, 7:26am

"Normally" (which comes with a big "IF" !! ) there would be a Flag or somesuch to indicate which was the "main" contact address. If you have that I would go with Customer Table as the indexing will almost certainly give you better performance than trying to find "Most recent invoice header record for this customer".

But the problem may well be best-solved, in the users' eyes, by "Address we have most recently billed to" (i.e. as you are doing it) - I can imagine that the "main address" doesn't get changed reliably when a customer starts getting billed at a new address ... and, of course, conversely someone who frequently asked for goods to be drop-shipped at different addresses is not well described by "most recent billing address" ...

These type of "what would best suit the user" questions plaque me!! Its no use asking the user, I never get the right answer to the right question UNTIL I have built the report and THEN they say "That's not what I wanted" or "wouldn't it be nice if ..."

Josh_Hawley · September 11, 2015, 8:32pm

You must have a crystal ball... The whole scope of this report has changed... right after I got it working. /sigh

Kristen · September 12, 2015, 4:20am

Sorry about that