Number of records returned in aggregation

innerjoin · March 24, 2017, 10:34am

This statement works fine and returns only the groups which total sum is >= than 20000:

Select THIS_OFFICE_ID as OFID_PCC_NR, PUSHSTAT_DATE AS THE_DATE,
SUM(TRANSACTION_CNT) as Total_Count, REQUIRED.REQ_TOTAL as OFID_PCC_Total FROM CAM_PUSH_STATISTICSTB
INNER JOIN
(SELECT THIS_OFFICE_ID, SUM(CAST(TRANSACTION_CNT as BIGINT)) FROM CAM_PUSH_STATISTICSTB
WHERE pushstat_date BETWEEN '3/3/2017' AND '3/7/2017'
GROUP BY THIS_OFFICE_ID) AS REQUIRED(REQ_OFFICE_ID, REQ_TOTAL) ON THIS_OFFICE_ID=REQUIRED.REQ_OFFICE_ID
WHERE REQUIRED.REQ_TOTAL > 20000
AND pushstat_date BETWEEN '3/3/2017' AND '3/7/2017'
GROUP by REQUIRED.REQ_TOTAL, THIS_OFFICE_ID, PUSHSTAT_DATE
ORDER by REQUIRED.REQ_TOTAL desc, THIS_OFFICE_ID, PUSHSTAT_DATE;

OFFICE_ID TRANSACTION_DATE TRANSACTION_CNT TOTAL_SUMMA

(1) Stockholm 03Mar17 5000 24000
(2) Stockholm 04Mar17 6000 24000
(3) Stockholm 05Mar17 7000 24000
(4) Stockholm 06Mar17 3000 24000
(5) Stockholm 07Mar17 3000 24000

(6) Kiruna 03Mar17 4000 22000
(7) Kiruna 04Mar17 5000 22000
(8) Kiruna 05Mar17 6000 22000
(9) Kiruna 07Mar17 7000 22000

(10) Copenhagen 03Mar17 1000 6000
(11) Copenhagen 04Mar17 5000 6000

Stockholm and Kiruna are returned but not Copenhagen. That's fine, working as designed.

I would like to expand this statement to also return the days containing data for each group, i.e. 5 for Stockholm and 4 for Kiruna. So it doesn't work to just compute max(date) minus min(date) since usually there are missing days in between.

I suspect count(TRANSACTION_CNT) for each group must be involved. Thanks in advance for all suggestions.

ScottPletcher · March 24, 2017, 2:45pm

Best guess, with no useable sample data to test it, would be:

Remove PUSHSTAT_DATE from the GROUP BY
and add
COUNT(DISTINCT PUSHSTAT_DATE)
to the SELECT list.

innerjoin · March 24, 2017, 3:27pm

my typo I should have written "I suspect count(PUSHSTAT_DATE)"

@ScottPletcher: your suggestion returns the number of data days but not as in expanding my query, rather giving a different query, it does not fulfill my requirement. I need the data I get plus number of data days. This is what I need:

(1) Stockholm 03Mar17 5000 24000 5
(2) Stockholm 04Mar17 6000 24000 5
(3) Stockholm 05Mar17 7000 24000 5
(4) Stockholm 06Mar17 3000 24000 5
(5) Stockholm 07Mar17 3000 24000 5

(6) Kiruna 03Mar17 4000 22000 4
(7) Kiruna 04Mar17 5000 22000 4
(8) Kiruna 05Mar17 6000 22000 4
(9) Kiruna 07Mar17 7000 22000 4

ScottPletcher · March 24, 2017, 3:49pm

Good luck then. Without useable sample data to this point, I can't participate further.

Do notice, though, that I specified DISTINCT in the COUNT, which could make a difference.

innerjoin · March 24, 2017, 9:59pm

Oh! I missed that about the usable data. Here it is:

Summary

2017-03-03 00:00:00.000 KIRUNA 2318
2017-03-04 00:00:00.000 KIRUNA 1583
2017-03-05 00:00:00.000 KIRUNA 3947
2017-03-06 00:00:00.000 KIRUNA 6127
2017-03-07 00:00:00.000 KIRUNA 9633
2017-03-04 00:00:00.000 COPENHAGEN 3379
2017-03-05 00:00:00.000 COPENHAGEN 4198
2017-03-06 00:00:00.000 COPENHAGEN 4699
2017-03-07 00:00:00.000 COPENHAGEN 4861
2017-03-03 00:00:00.000 STOCKHOLM 4608
2017-03-04 00:00:00.000 STOCKHOLM 4069
2017-03-05 00:00:00.000 STOCKHOLM 5038
2017-03-06 00:00:00.000 STOCKHOLM 4962
2017-03-07 00:00:00.000 STOCKHOLM 5271

My SQL statement returns:

[details=Summary]STOCKHOLM 2017-03-03 00:00:00.000 4608 23948
STOCKHOLM 2017-03-04 00:00:00.000 4069 23948
STOCKHOLM 2017-03-05 00:00:00.000 5038 23948
STOCKHOLM 2017-03-06 00:00:00.000 4962 23948
STOCKHOLM 2017-03-07 00:00:00.000 5271 23948
KIRUNA 2017-03-03 00:00:00.000 2318 23608
KIRUNA 2017-03-04 00:00:00.000 1583 23608
KIRUNA 2017-03-05 00:00:00.000 3947 23608
KIRUNA 2017-03-06 00:00:00.000 6127 23608
KIRUNA 2017-03-07 00:00:00.000 9633 23608
COPENHAGEN 2017-03-04 00:00:00.000 3379 17137
COPENHAGEN 2017-03-05 00:00:00.000 4198 17137
COPENHAGEN 2017-03-06 00:00:00.000 4699 17137
COPENHAGEN 2017-03-07 00:00:00.000 4861 17137[/details]

with the condition ”WHERE REQUIRED.REQ_TOTAL between 17000 and 24000”
While I would like to have:

[details=Summary]STOCKHOLM 2017-03-03 00:00:00.000 4608 23948 5
STOCKHOLM 2017-03-04 00:00:00.000 4069 23948 5
STOCKHOLM 2017-03-05 00:00:00.000 5038 23948 5
STOCKHOLM 2017-03-06 00:00:00.000 4962 23948 5
STOCKHOLM 2017-03-07 00:00:00.000 5271 23948 5
KIRUNA 2017-03-03 00:00:00.000 2318 23608 5
KIRUNA 2017-03-04 00:00:00.000 1583 23608 5
KIRUNA 2017-03-05 00:00:00.000 3947 23608 5
KIRUNA 2017-03-06 00:00:00.000 6127 23608 5
KIRUNA 2017-03-07 00:00:00.000 9633 23608 5
COPENHAGEN 2017-03-04 00:00:00.000 3379 17137 4
COPENHAGEN 2017-03-05 00:00:00.000 4198 17137 4
COPENHAGEN 2017-03-06 00:00:00.000 4699 17137 4
COPENHAGEN 2017-03-07 00:00:00.000 4861 17137 4
[/details]

bitsmed · March 25, 2017, 9:43am

Using @ScottPletcher method, I came up with:

select a.this_office_id as ofid_pcc_nr
      ,a.pushstat_date as the_date
      ,sum(a.transaction_cnt) as total_count
      ,b.req_total as ofid_pcc_total
      ,b.date_count
  from cam_push_statisticstb as a
       inner join (select this_office_id
                         ,sum(transaction_cnt) as req_total
                         ,count(distinct pushstat_date) as date_count
                     from cam_push_statisticstb
                    group by this_office_id
                  ) as b
               on b.this_office_id=a.this_office_id
              and b.req_total between 17000 and 24000
 where a.pushstat_date>=cast('2017-03-03' as datetime)
   and a.pushstat_date<cast('2017-03-08' as datetime)
 group by a.this_office_id
         ,a.pushstat_date
         ,b.req_total
         ,b.date_count
 order by b.req_total desc
         ,a.this_office_id
         ,a.pushstat_date
;

innerjoin · March 26, 2017, 6:12am

It works like a charm, thanks a lot bitsmed.

I just had to do a small adjustment and add "WHERE pushstat_date BETWEEN '3/3/2017' AND '3/07/2017'" in the inner statement as well for the cases where there is more data for the office while i just want data for the given date interval. Used as is it was counting all data days existing in db for the given office rather than only those inside the given period. You couldn't possibly know that with the limited data I provided.

QUESTION: I am grouping by pushstat_date in the outer statement so I am positively getting one and only one record per pushstat_date. Do I really need to specify "DISTINCT PUSHSTAT_DATE" or is it redundant? I have namely removed DISTINCT and it returns the same. Am I narrowly missing something fundamental?

Just out of curiosity: I use the WHERE statement with the dates as given above and it works just fine. Is your casting to datetime due to 'good coding discipline' or is there other reason?

Thanks A LOT again ScottPletcher and bitsmed for your time!

bitsmed · March 26, 2017, 11:18am

My mistake - the where statement filtering the date should be in the subselect aswell (nice spotted).

If table "cam_push_statisticstb" only have one row per "pushstat_date", then the distinct is not needed. If there are more rows per "cam_push_statisticstb", you'd get wrong count.

Whenever you compare fields, it's faster to compare same field types. I'm sure the optimizer is smart enough to do it for me, but this way, I show "the next reader", I know the field type. Also depending on your locale settings, "3/7/2017" could be either Marsh 7th or July 3rd. Additionally I never use "between" when comparing datetimes. In your where statement (using between), you'd miss records on March 7th if time portion was a factor.

Hope this makes sence

innerjoin · March 26, 2017, 1:41pm

Tanks a LOT again bitsmed. I will be using the casting to datetime and avoid between to make the statement optimal, I will be executing some pretty heavy searches so every bit counts (pun intended).

Regarding DISTINCT I've been doing quite a lot of spot checks without it, with a date interval of 5-10 days (easy to control check) having data for 200+ days in the background and it counts right both when the 5-10 day interval is complete as well as when 1 or more days have no data. I will consider keeping it anyway but would like to know if I am unnecessarily spending some valuable cpu power resources.

Thanks a lot again.

bitsmed · March 26, 2017, 2:09pm

Here's an easy check to see if table holds more than one this_office_id per day this year:

select this_office_id
      ,pushstat_date
  from cam_push_statisticstb
 where pushstat_date>=cast('2017-01-01' as datetime)
 group by this_office_id
         ,pushstat_date
 having count(*)>1
;

If this shows no rows, then you don't need the distinct. Infact your query might be optimized like this:

select this_office_id as ofid_pcc_nr
      ,pushstat_date as the_date
      ,transaction_cnt as total_count
      ,ofid_pcc_total
      ,date_count
  from (select this_office_id
              ,pushstat_date
              ,transaction_cnt
              ,sum(transaction_cnt) over(partition by this_office_id) as ofid_pcc_total
              ,count(*) over(partition by this_office_id) as date_count
          from cam_push_statisticstb
         where pushstat_date>=cast('2017-03-03' as datetime)
           and pushstat_date<cast('2017-03-08' as datetime)
       ) as a
 where ofid_pcc_total between 17000 and 24000
 order by b.req_total desc
         ,a.this_office_id
         ,a.pushstat_date
;

This will "only" run thru your table once, instead of twice (in the previous queries).

More speed you say??
You might consider creating a table (or temp table) and push the result of the subselect to that table, and with the right index(es), you might gain more speed

Edit: corrected query for speed (no need to sum for total_cnt, as it's already summed)

innerjoin · March 27, 2017, 7:17am

Thanks a LOT again bitsmed. I will try this improved version that loops the table only once. It sounds very promising. Thanks!

innerjoin · June 7, 2017, 8:41am

Only now I had the opportunity to implement bitsmed's last suggestion and it works fine .... well, almost.
With small variations I implemented this code:

SELECT this_office_id                         AS ofid_pcc_nr, 
       pushstat_date                          AS the_date, 
       transaction_cnt                        AS total_count, 
       ofid_pcc_total, 
       real_count, 
       Datediff(day, '2017-06-01', '2017-06-06') AS calender_count 
FROM   (SELECT this_office_id, 
               pushstat_date, 
               transaction_cnt, 
               Sum(transaction_cnt) OVER (partition BY this_office_id) AS ofid_pcc_total, 
               Count(*) OVER(partition BY this_office_id) AS real_count 
        FROM   cam_push_statisticstb 
        WHERE  pushstat_date BETWEEN '2017-06-01' AND '2017-06-06' 
       ) AS a 
WHERE  ofid_pcc_total >= 240000 
       AND Datediff(day, '2017-06-01', '2017-06-06') - real_count BETWEEN 0 AND 5
ORDER  BY ofid_pcc_total DESC, a.this_office_id, a.pushstat_date

This returns a neat list of dates and offices with sum.
The reason why it 'almost' works is that it returns several records for the same date, same Office id (there are several sub categories under each Office Id), obviously since I do not 'group by' date or Office ID.

Problem is that when I 'group by' this_office_id in the inner or the outer statement I get an error for other columns!

Any ideas?

bitsmed · June 10, 2017, 6:17am

Trying to figure out what this should do:

Are you only interested in records NOT having several resords for the same day?

If so, this might be useful:

select this_office_id as ofid_pcc_nr
      ,pushstat_date as the_date
      ,transaction_cnt as total_count
      ,ofid_pcc_total
      ,real_count
  from (select this_office_id
              ,pushstat_date
              ,transaction_cnt
              ,sum(transaction_cnt) over(partition by this_office_id) as ofid_pcc_total
              ,sum(real_count) over(partition by this_office_id) as real_count
              ,sum(1) over(partition by this_office_id) as day_count
          from (select this_office_id
                      ,pushstat_date
                      ,sum(transaction_cnt) as transaction_cnt
                      ,count(*) as real_count
                  from cam_push_statisticstb
                 where pushstat_date>=cast('2017-06-01' as datetime)
                   and pushstat_date<cast('2017-06-07' as datetime)
                 group by this_office_id
                         ,pushstat_date
               ) as a
       ) as a
 where ofid_pcc_total>=240000
   and real_count=day_count
 order by ofid_pcc_total desc
         ,this_office_id
         ,pushstat_date
;

innerjoin · June 15, 2017, 6:57pm

Thanks bitsmed. I just want to determine the number of missing dates in a period and be able to choose whatever amount of missing days that I want. Datediff gives me the real number of dates in a chosen period while real_count the number of actual date records. If they are equal their difference is 0, no date is missing. If the difference is 1 there is one day missing, only I do not know which but it is not important. So in the statement above I want all offices that lack between 0 and 5 dates in the period.

I have now a working statement combining one of your proposals above. The problem is it is working perfectly in MS Visual Studio but gives an error when embedded in VBA SQL with OLEDB Provider. I have simplified the statement and increased it step by step, until it gives an error, the line rejected in VBA SQL is:

Sum(Cast(transaction_cnt AS BIGINT)) 
	OVER(partition BY this_office_id, pushstat_date) AS sum_transaction_for_this_day,

The complete statement working like a charm in MS Visual Studio:

SELECT this_office_id	AS ofid_pcc_nr, 
       pushstat_date		AS the_date, 
	   sum_transaction_for_this_day,
       sum_transaction_for_this_office, 
       returned_records_from_db, 
       Datediff(day, '2017-05-01', '2017-06-11') + 1 AS records_acc_to_calender,
	Datediff(day, '2017-05-01', '2017-06-11') + 1 - returned_records_from_db AS MISSING_DATES
FROM   (SELECT this_office_id, office_pcc_id, pushstat_date, 
	Sum(Cast(transaction_cnt AS BIGINT)) 
		OVER(partition BY this_office_id, pushstat_date) AS sum_transaction_for_this_day,
	Sum(Cast(transaction_cnt AS BIGINT)) 
		OVER(partition BY this_office_id) AS sum_transaction_for_this_office, 
		DENSE_RANK() OVER (PARTITION BY this_office_id ORDER BY this_office_id, pushstat_date ASC) +
		DENSE_RANK() OVER (PARTITION BY this_office_id ORDER BY this_office_id, pushstat_date DESC) - 	1 AS returned_records_from_db
        FROM   cam_push_statisticstb 
        WHERE  pushstat_date BETWEEN '2017-05-01' AND '2017-06-11') AS a 
	 --WHERE Datediff(day, '2017-06-01', '2017-06-11') + 1 - returned_records_from_db > 3 
GROUP BY sum_transaction_for_this_office, this_office_id, 
pushstat_date, returned_records_from_db, sum_transaction_for_this_day
ORDER  BY sum_transaction_for_this_office desc, a.this_office_id, a.pushstat_date

When I parse the statement in MySQL Syntax prüfen I get also that sum over partition by is an error in MySQL.

Question is how can I rewrite the statement to fit VBA SQL.

I have a lot of heavy statements running in VBA SQL and always thought they had the same syntax requirements this is the first time a statement executes fine in MS Visual Studio and not in VBA SQL.

I am executing against the exact same db. VBA SQL gives very poor error messages so I have to check the syntax somewhere else as I did.

innerjoin · June 15, 2017, 7:01pm

I use DENSE_RANK because VBA SQL gives syntax error for count(*) OVER Partition By

DENSE_RANK gives the correct answer and is ok in VBA SQL

bitsmed · June 15, 2017, 7:32pm

MySQL and MSSQL are very different, so it makes no sence, validating MSSQL queries with a MySQL parser.

I think your problem lies with the version of the oledb driver you're using. Can you try with a newer version that matches your MSSQL version?

innerjoin · June 15, 2017, 7:45pm

Thanks

Yeah, i suspected they are not he same but casually they complain both ref the same command.

The VBA code with embedded SQL is installed in a server that I remote control I cannot update much there so I guess I'll have to settle with what is installed there.

I'll have to replace SUM OVER PARTITION BY with something else as I did with count(*) over partition -> DENSE_RANK

bitsmed · June 15, 2017, 8:18pm

Whats the version of your db engine (on the remote server)?

innerjoin · June 16, 2017, 7:38am

Not sure if this is enough:

Microsoft SQL Server 2008 R2 (RTM) - 10.50.1600.1 (X64)
Apr 2 2010 15:48:46
Copyright (c) Microsoft Corporation
Enterprise Edition (64-bit) on Windows NT 6.1 (Build 7601: Service Pack 1) (Hypervisor)

innerjoin · June 16, 2017, 11:23am

I have solved the problem and the solution is silly considering the time I spent troubleshooting.

MS VBA SQL accepts the SUM OVER PARTITION BY after the DENSE_RANK but not before. Go figure.

Not only has VBA SQL extremely poor error messages ('error 13 string mismatch' for almost everything) and non existent debug tools but it is also VEEERY touchy in stupid things. I inverted the SELECT order of sum_transaction_for_this_day and sum_transaction_for_this_office and got an error. When I removed one of them, executed, introduced the second one in a new position, executed, then it was accepted.

GO FIGURE! Is kind of finnicky and gave me lots of stomach-churning moments until I finally solved it ....

Thanks for your help and time bitsmed! really