SQLTeam.com | Weblogs | Forums

Pick 5% random sample from a set of records


#1

Greetings everyone!
I am new to SQL. I have a table in an SQL server 2012.

Information:

  1. ItemReviewId column is the PK.
  2. Deleted column will have only "yes" and "No" value.
  3. Audited column will have only "Yes" and "No" value.

I want to create a stored procedure to do the followings:

  1. Pick a random sample of 5% for 'Deleted' "Yes" and '5% for 'Deleted' "No" for distinct 'UserId' and distinct 'ReviewDate' in a given date range. Audited ="Yes" will be excluded from the sample.
  2. Update Audited to "Check"
  3. Update AssignedTo to "jsmith"

How can I achieve this?

Any help would be highly appreciated. Thanks in advance.


#2

If I understand your requerements correctly, this should get you the expected results:

with cte
  as (select itemreviewid
            ,100.0
            *row_number() over(partition by userid
                                           ,reviewdate
                                           ,deleted
                                   order by newid()
                              )
            /count(*) over(partition by userid
                                       ,reviewdate
                                       ,deleted
                          )
             as pct
        from yourtable
       where reviewdate>='2017-01-01'
         and reviewdate<'2017-02-01'
         and deleted in ('Yes','No')
         and audited!='Yes'
     )
select a.*
  from yourtable as a
       inner join cte as b
               on b.itemreviewid=a.itemreviewid
              and b.pct<=5
;

If you do get the expected results, you can update like this:

with cte
  as (select itemreviewid
            ,100.0
            *row_number() over(partition by userid
                                           ,reviewdate
                                           ,deleted
                                   order by newid()
                              )
            /count(*) over(partition by userid
                                       ,reviewdate
                                       ,deleted
                          )
             as pct
        from yourtable
       where reviewdate>='2017-01-01'
         and reviewdate<'2017-02-01'
         and deleted in ('Yes','No')
         and audited!='Yes'
     )
update a
   set audited='Check'
  from yourtable as a
       inner join cte as b
               on b.itemreviewid=a.itemreviewid
              and b.pct<=5
;

Now you "wrap" the above in a stored procedure.


#3

Thank you very much for looking into this.
I created a stored procedure as given below. But it does not return any records. I executed the stored procedure with the parameters like -- "EXEC spItemReviewPickSample 'jsmith', '6/22/17', 6/22/17'". Can anyone please take a look into the procedure?
Thanks in advance!

USE [SebpoDatabase]
GO
ALTER PROC [dbo].[spItemReviewPickSample]
(
@AssignedTo VarChar(50)
,@StartDate Datetime
,@EndDate Datetime
)
AS
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
row_number() over(partition by UserId
,ReviewDate
,Deleted
order by newid()
)
/count(
) over(partition by UserId
,Reviewdate
,Deleted
)
as pct
FROM tblItemReviewItems
WHERE ReviewDate>=@StartDate
AND ReviewDate<@EndDate
AND Deleted in ('Yes','No')
AND Audited!='No'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=5
;
WITH CTE
AS (SELECT ItemReviewId
,100.0
row_number() over(partition by UserId
,ReviewDate
,Deleted
order by newid()
)
/count(
) over(partition by UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE reviewdate>=@StartDate
and reviewdate<@EndDate
and deleted in ('Yes','No')
and audited!='No'
)
UPDATE a
SET Audited='Check', AssignedTo=@AssignedTo
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
and b.pct<=5
;
END


#4

May not be your issue, but @EndDate needs to be "one day after the cutoff date" (or if you want to define @EndDate as "the last actual day" then round @EndDate up one [at the start of the Sproc])

Your '6/22/17' string-date format may be fine, depending on all sorts of things that influence how SQL parses string dates, but to be guaranteed unambiguous use '20170622'


#5

Thank you very much for the input. I changed "ReviewDate>= and Reviewdate<" to "ReviewDate BETWEEN Start and End". This worked, but sample is rounding down.

The procedure picked 37 records from 945 records. How can I round this up?

ALTER PROC [dbo].[spItemReviewPickSample]
(
@AssignedTo VarChar(50)
,@StartDate Datetime
,@EndDate Datetime
)
AS
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
order by newid()
)
/count(
) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND Deleted in ('Yes','No')
AND Audited='No'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=5
;
WITH CTE
AS (SELECT ItemReviewId
,100.0
row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
ORDER BY newid()
)
/COUNT(
) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND deleted IN ('Yes','No')
AND audited='No'
)
UPDATE a
SET Audited='Check', AssignedTo=@AssignedTo
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=5
;
END


#6

Change b.pct<=5 to b.pct<6 (both places)


#7

Personally (and I think in the view of many/most? here) I wouldn't do that, particularly with a DateTime. You will be including midnight start of the following day, but not a few ms past that. There is no value for the last ms-value before midnight (that you could safely base your endpoint on), assuming you do not want to include "tomorrow" at all. That ms value is defined, but then along comes DATETIME2 with more precision and the sands-shift.

Better as you were IMO to use

WHERE ReviewDate>=@StartDate
AND ReviewDate<@EndDatePlusOne

#8

Thanks for all the help!