SQLTeam.com | Weblogs | Forums

Pick a random sample based on multiple criteria from a set of records


#1

Greetings everyone!

I have a table in an SQL server 2012. Few days ago I requested help for a picking a random sample of 5% from the table with only one criterion. This forum helped me with the problem and solved it. Thanks for that but this time the criteria are much more complicated. I know there are many wise guys out here and I hope this forum can help me solve this.

Information:

  1. ItemReviewId column is the PK.
  2. Deleted column will have only "yes" and "No" value.
  3. Audited column will have only "Yes" and "No" value.

I want to create a stored procedure to do the followings:

  1. Pick a random sample of 10% of all ItemReviewId for distinct 'UserId' and distinct 'ReviewDate' in a given date range. 10% sample should include- 5% of the total population from Deleted (No) and 5% of the total population from Deleted (Yes). Audited ="Yes" will be excluded from the sample.

For example – A user has 118 records. Out of the 118 records, 17 records have Deleted column value "No" and 101 records have Deleted column value "Yes". I need to pick a random sample of 12 records. Out of those 12 records, 6 should have Deleted column value "No" and 6 should have Deleted column value "Yes".
2. Update Audited to "Check"

In a nutshell pick a sample 5% of the population having Deleted="No" and 5% of the population having Deleted="Yes" and update Audited column value to "Check" of the sample.

How can I achieve this?

Any help would be highly appreciated. Thanks in advance.


#2

did you try 'top 5 percent' and specify order by newid()?


#3

I tried the below stored procedure. But it gives me 5% of Deleted="Yes" and 5% of Deleted="No". For the example above -- it picks only 6 having Deleted="Yes". But I want 6 records (5% of population (118)) having Deleted="Yes" and 6 records (5% of population (118)) having Deleted="No".

ALTER PROC [dbo].[spItemReviewQcPickSample]  
    (  
         @StartDate Datetime  
        ,@EndDate Datetime 
     )  
    AS  
    BEGIN  
    WITH CTE  
      AS (SELECT ItemReviewId  
                ,100.0  
                *row_number() OVER(PARTITION BY UserId  
                                               ,ReviewDate  
                                               ,Deleted  
                                       order by newid()  
                                  )
                /count(*) OVER(PARTITION BY UserId  
                                           ,Reviewdate  
                                           ,Deleted  
                              )  
                 AS pct  
           FROM tblItemReviewItems  
           WHERE ReviewDate BETWEEN @StartDate AND @EndDate  
             AND Deleted in ('Yes','No')  
             AND Audited='No'  
         )  
    SELECT a.*  
      FROM tblItemReviewItems AS a  
           INNER JOIN cte AS b  
                   ON b.ItemReviewId=a.ItemReviewId  
                  AND b.pct<=6  
    ;  
    WITH CTE  
      AS (SELECT ItemReviewId  
                ,100.00  
                *row_number() OVER(PARTITION BY UserId  
                                               ,ReviewDate  
                                               ,Deleted  
                                       ORDER BY newid()  
                                  )
                /COUNT(*) OVER(PARTITION BY UserId  
                                           ,Reviewdate  
                                           ,Deleted  
                              )  
                 AS pct  
            FROM tblItemReviewItems  
           WHERE ReviewDate BETWEEN @StartDate AND @EndDate  
             AND deleted IN ('Yes','No')  
             AND audited='No'  
         )  
    UPDATE a  
       SET Audited='Check'  
      FROM tblItemReviewItems AS a  
           INNER JOIN cte AS b  
                   ON b.ItemReviewId=a.ItemReviewId  
                  AND b.pct<=6  
    ;  
    END  

How can I achieve that?


#4

Try using two separate queries (one for Deleted=yes and another for "no") then union the results, maybe


#5

I can do that but how can I pick 5% (6) of the population (118) having Deleted='No' not the 5% (1) of Deleted='No' (17)?


#6

look at it another way, Your 118 has deleted either yes or no. 17 have no. so 5% of 118 is what pct of 17? 17/118*.05 in other words


#7

Thank you for looking into this!
I created a stored procedure that way fixing the %. But my problem is that the % you are referring to is not always the same. A user might have 17 deleted='No' out of 118, another user might have 38 deleted='No' out of 118. So for the first user the % is 35.29 and for the second user the % is 15.79. So that approach does not address the problem.


#8

i wasn't suggesting hard-coding the numbers. that was just an example. you'll have to count things each time


#9

Here is my stored procedure. How can I count things each time?

ALTER PROC [dbo].[spItemReviewQcPickSampleUpdate]
(
	@StartDate Datetime
	,@EndDate Datetime
)
AS
--Completed Items
BEGIN
WITH CTE
  AS (SELECT ItemReviewId
            ,100.0
            *row_number() OVER(PARTITION BY UserId
                                           ,ReviewDate
                                           --,Deleted
                                   order by newid()
                              )
            /count(*) OVER(PARTITION BY UserId
                                       ,Reviewdate
                                       --,Deleted
                          )
             AS pct
       FROM tblItemReviewItems
       WHERE ReviewDate BETWEEN @StartDate AND @EndDate
         AND Deleted = 'No'
         AND Audited='No'
	AND UserId NOT LIKE 'v_s2_'+ '%'
     )
SELECT a.*
  FROM tblItemReviewItems AS a
       INNER JOIN cte AS b
               ON b.ItemReviewId=a.ItemReviewId
              AND b.pct<=31
;
WITH CTE
  AS (SELECT ItemReviewId
            ,100.00
            *row_number() OVER(PARTITION BY UserId
                                           ,ReviewDate
                                           --,Deleted
                                   ORDER BY newid()
                              )
            /COUNT(*) OVER(PARTITION BY UserId
                                       ,Reviewdate
                                       --,Deleted
                          )
             AS pct
        FROM tblItemReviewItems
       WHERE ReviewDate BETWEEN @StartDate AND @EndDate
         AND deleted ='No'
         AND audited='No'
		 AND UserId NOT LIKE 'v_s2_'+ '%'
     )
UPDATE a
   SET Audited='Check'
  FROM tblItemReviewItems AS a
       INNER JOIN cte AS b
               ON b.ItemReviewId=a.ItemReviewId
              AND b.pct<=31
;
END

#10

easy way:

declare @nos float = select count() from ... where deleted = 'no'
declare @yes float = select count(
) from ... where delete = 'yes'
declare @samplepct float = .05 * @nos/(@nos + @yes)

then use the samplepct in

select top @samplepct percent from <population> where delete = 'No'

etc


#11

Sir, I have been trying for the last couple of days but could not create a stored procedure following your suggestions. May be I am asking too much. Could you please create a stored procedure for this? Your help is greatly appreciated.


#12

I was able to pick sample for a UserId. Now the questions are- i. how can I select samples for all the users? ii. how can I update the "AuditStatus" column value to "Check" to the selected rows?

ALTER PROC [dbo].[spPickSample]
AS
BEGIN
	DECLARE @Nos INT = (SELECT COUNT (*) FROM tblItemReviewItems WHERE Deleted='No' and UserId='jdoe1')
	DECLARE @Yes INT = (SELECT COUNT (*) FROM tblItemReviewItems WHERE Deleted='Yes' and UserId='jdoe1')
	DECLARE @Samplepct FLOAT =(.05*(@Nos + @Yes)) / @Nos*100
	DECLARE @Percentage AS INT = @Samplepct
	SELECT
		TOP (@Percentage) PERCENT *
	FROM
		tblItemReviewItems
	WHERE
		Deleted='No' AND UserId='jdoe1'
	ORDER BY
		NEWID()
END

#13

Could this be what you're looking for:

select b.*
  from (select itemreviewid
              ,ceiling((sum(case when deleted='No' then 1 else 0 end) over(partition by userid)
                       +sum(case when deleted='Yes' then 1 else 0 end) over(partition by userid)
			           )
                      *.05
                      )
               as records
              ,row_number() over(partition by deleted,userid order by newid()) as rn
          from tblitemreviewitems
       ) as a
       inner join tblitemreviewitems as b
               on b.itemreviewid=a.itemreviewid
 where a.rn<=a.records
 order by b.userid
         ,b.deleted
         ,b.itemreviewid
;

#14

Thank you very much. You are awesome! This is exactly what I needed. Here is my final stored procedure:

ALTER PROC [dbo].[spPickSample]
	(
		@StartDate AS DATETIME
		,@EndDate AS DATETIME
	)
AS
BEGIN
	DECLARE @tempItems TABLE
	(
		ItemReviewId VarChar(50)
		,UserId VarChar(50)
		,ReviewDate Datetime
		,Deleted varchar(10)
		,Audited varchar(10)
	)
	INSERT INTO @tempItems (ItemReviewId, UserId, ReviewDate, Deleted, Audited)

	SELECT b.ItemReviewId, UserId, ReviewDate, Deleted, Audited
	FROM (SELECT ItemreviewId
              ,CEILING((SUM(CASE WHEN Deleted='No' AND ReviewDate BETWEEN @StartDate AND @EndDate AND UserId NOT LIKE 'b_d2_' +'%' THEN 1 ELSE 0 END) OVER(PARTITION BY UserId)
                       +SUM(CASE WHEN Deleted='Yes' AND ReviewDate BETWEEN @StartDate AND @EndDate AND UserId NOT LIKE 'b_d2_' +'%' THEN 1 ELSE 0 END) OVER(PARTITION BY UserId)
			           )
                      *.05
                      )
               AS records
              ,row_number() OVER(PARTITION BY Deleted,UserId ORDER BY NEWID()) AS rn
          FROM tblItemReviewItems
       ) AS a
       INNER JOIN tblItemReviewItems AS b
               ON b.ItemreviewId=a.ItemreviewId
	WHERE
		a.rn<=a.records
	ORDER BY
		b.UserId
		,b.Deleted
		,b.ItemreviewId
;
	UPDATE a
	SET Audited='Check'
	FROM tblItemReviewItems AS a
		INNER JOIN @tempItems AS b
		ON b.ItemReviewId=a.ItemReviewId
	Begin
		UPDATE tblItemReviewItems SET Audited='Check' WHERE ReviewDate BETWEEN @StartDate AND @EndDate AND UserId LIKE 'b_d2_' +'%'
	End
END