Grouping time periods with gaps in them

llatinsky · November 17, 2015, 2:21pm

Hello, this is a simplified sample to demonstrate the problem I am having:

CREATE TABLE #Sample (ID INT NOT NULL, Yr INT NOT NULL, Col_A VARCHAR(10), Col_B INT, Col_C INT)
INSERT INTO #Sample values
(1, 2004, 'N/S', 10, 255),
(1, 2005, 'N/S', 10, 255),
(1, 2006, 'N/S', NULL, 1),
(1, 2007, 'N/S', 10, 255),
(1, 2008, 'N/S', 10, 255),
(1, 2009, 'N/S', 10, 255),
(1, 2010, 'N/S', 0, 255),
(1, 2011, 'N/A', 10, 255),
(1, 2012, 'N/A', 10, 255),
(1, 2013, 'N/A', 0, 0),
(1, 2014, 'Y', 101, 256),
(1, 2015, 'N/S', 1, 32),
(1, 2016, 'N/A', NULL,NULL );

SELECT CASE WHEN MIN(YR )=MAX(YR) THEN CAST(MIN(YR) AS CHAR(4)) ELSE CAST(MIN(YR ) AS CHAR(4))+'-'+RIGHT(CAST(MAX(YR) AS CHAR(4)),2) END AS FROMYEAR_TOYEAR,
ID , Col_A , Col_B , Col_C
FROM #Sample GROUP BY
ID , Col_A , Col_B , Col_C ORDER BY 1

RESULTS:
FROMYEAR_TOYEAR ID Col_A Col_B Col_C
2004-09 1 N/S 10 255
2006 1 N/S NULL 1
2010 1 N/S 0 255
2011-12 1 N/A 10 255
2013 1 N/A 0 0
2014 1 Y 101 256
2015 1 N/S 1 32
2016 1 N/A NULL NULL

CORRECT RESULTS SHOULD BE:

FROMYEAR_TOYEAR ID Col_A Col_B Col_C
2004-05 1 N/S 10 255
2006 1 N/S NULL 1
2007-09 1 N/S 10 255
2010 1 N/S 0 255
2011-12 1 N/A 10 255
2013 1 N/A 0 0
2014 1 Y 101 256
2015 1 N/S 1 32
2016 1 N/A NULL NULL

Please notice that periods from 2004 to 2009 though having same data have to be broken down, as the year in between(2006) had different data. I was trying to process the records by using windowing functions to break the group with the gap in years but as of today I don't have a viable solution...It seems to me there should be a way to do it in a single select statement...
Thank you for your help in advance

JamesK · November 17, 2015, 3:26pm

This is the class of problems that is usually referred to as the "Islands and Gaps" problem. You will find a number of SQL solutions if you search for it.

One way to solve it (and perhaps the most efficient) is to use the row_number function as shown below:

SELECT MIN(Yr),MAX(Yr),MAX(col_A), MAX(Col_B), MAX(Col_C)
FROM
(
	SELECT *,
		ROW_NUMBER() OVER(PARTITION BY id ORDER BY Yr)
		-ROW_NUMBER() OVER(PARTITION BY id ORDER BY col_b,Yr) Grp
	FROM
		#Sample
)s
GROUP BY Grp
ORDER BY 1;

My partitioning and ordering columns may not be exactly what you need for your more general problem, but for the sample data order by col_B, Yr should work. Also, I have not combined the min and max years as you have done, but that is easy enough to do.

llatinsky · November 17, 2015, 3:50pm

Hello, James, thank you very much for your reply. I tried windowing functions and looking for gaps in years but couldn't think up a good way to put them together. This will definitely work. Appreciate your help and reference to "Islands and Gaps" problem

jcelko · November 17, 2015, 10:26pm

Tables have keys that this thing does not. Identifiers are not numeric because you do no math on them. I prefer the MySQL convention for years so they can sort with ISO-8601 dates and you can be ready if ISO adopts it (under discussion).

Why is your generic, magic “id” a constant? Drop it as redundant. . Here is my guess at corrections, with ISO-11179 rules, a key, etc. :

CREATE TABLE Samples
(sample_year CHAR(10) NOT NULL PRIMARY KEY,
col_a VARCHAR(10) NOT NULL,
col_b INTEGER,
col_c INTEGER);

INSERT INTO Sample
VALUES
('2004-00-00', 'N/S', 10, 255),
('2005-00-00', 'N/S', 10, 255),
('2006-00-00', 'N/S', NULL, 1),
('2007-00-00', 'N/S', 10, 255),
('2008-00-00', 'N/S', 10, 255),
('2009-00-00', 'N/S', 10, 255),
('2010-00-00', 'N/S', 0, 255),
('2011-00-00', 'N/A', 10, 255),
('2012-00-00', 'N/A', 10, 255),
('2013-00-00', 'N/A', 0, 0),
('2014-00-00', 'Y', 101, 256),
('2015-00-00', 'N/S', 1, 32),
('2016-00-00', 'N/A', NULL, NULL);

We do not format data in a query, so all that string manipulation is not done in the database. We pass data to a presentation layer to make it pretty. Your mindset is still locked in 1970's monolithic COBOL programming. Your query should be:

SELECT MIN(sample_year) AS start_sample_year,
MAX(sample_year) AS end_sample_year,
sample_id, col_a, col_b, col_c
FROM Samples
GROUP BY col_a, col_b, col_c;

Please notice that periods from 2004 to 2009 though having same data have to be broken down, as the year in between(2006) had different data. I was trying to process the records [sic: rows are not records; more COBOL!] by using windowing functions to break the group with the gap in years but as of today I don't have a viable solution...It seems to me there should be a way to do it in a single select statement… <<
Use the “island & gaps” that JamesK gave you. The only thing I might change is having a Calendar table.

WITH X
AS
(SELECT sample_year, col_a, col_b, col_c,
(ROW_NUMBER() OVER(ORDER BY sample_year)
-ROW_NUMBER() OVER(ORDER BY col_a, col_b, col_c)) AS sample_grp
FROM Samples
GROUP BY sample_year, col_a, col_b, col_c)

SELECT MIN(sample_year), MAX(sample_year), col_a, col_b, col_c --, sample_grp
FROM X
GROUP BY col_a, col_b, col_c, sample_grp
ORDER BY MIN(sample_year); –-- optional sort

llatinsky · November 18, 2015, 1:51pm

Hello,Mr. Celko. Have been reading your articles since forever, and consider them a really good read, but this is about your reply. First of all, this is how I started my question:

"Hello, this is a simplified sample to demonstrate the problem I am having:"

Has nothing to do with the table structure or data i'm dealing with in real life. This was just a simple way to put some data together and show the desired output. There is no presentation layer, there is nothing, just a concept of dealing with same data that cannot be aggregated together because there are gaps. In real life I WAS using a calendar table and the row_number function, just tried to solve with LEAD/LAG, and got a little lost in the time allotted to this. That's why I resorted to asking the "Team" and got a very fast and spot-on reply from JamesK. This forum is called T-SQL , NOT database design 101 (though I always thought rows and records were interchangeable terms), I apologize for insulting your feelings but believe me if I ever thought you would be reading this with anything more than "Islands and Gaps" problem in mind - thank you JamesK again - I would have spent more time on thinking how to present it. Regards