Greetings everyone!
I have a table in an SQL server 2012. Few days ago I requested help for a picking a random sample of 5% from the table with only one criterion. This forum helped me with the problem and solved it. Thanks for that but this time the criteria are much more complicated. I know there are many wise guys out here and I hope this forum can help me solve this.

Information:
- ItemReviewId column is the PK.
- Deleted column will have only "yes" and "No" value.
- Audited column will have only "Yes" and "No" value.
I want to create a stored procedure to do the followings:
- Pick a random sample of 10% of all ItemReviewId for distinct 'UserId' and distinct 'ReviewDate' in a given date range. 10% sample should include- 5% of the total population from Deleted (No) and 5% of the total population from Deleted (Yes). Audited ="Yes" will be excluded from the sample.
For example – A user has 118 records. Out of the 118 records, 17 records have Deleted column value "No" and 101 records have Deleted column value "Yes". I need to pick a random sample of 12 records. Out of those 12 records, 6 should have Deleted column value "No" and 6 should have Deleted column value "Yes".
2. Update Audited to "Check"
In a nutshell pick a sample 5% of the population having Deleted="No" and 5% of the population having Deleted="Yes" and update Audited column value to "Check" of the sample.
How can I achieve this?
Any help would be highly appreciated. Thanks in advance.
did you try 'top 5 percent' and specify order by newid()?
I tried the below stored procedure. But it gives me 5% of Deleted="Yes" and 5% of Deleted="No". For the example above -- it picks only 6 having Deleted="Yes". But I want 6 records (5% of population (118)) having Deleted="Yes" and 6 records (5% of population (118)) having Deleted="No".
ALTER PROC [dbo].[spItemReviewQcPickSample]
(
@StartDate Datetime
,@EndDate Datetime
)
AS
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
order by newid()
)
/count(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND Deleted in ('Yes','No')
AND Audited='No'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
WITH CTE
AS (SELECT ItemReviewId
,100.00
*row_number() OVER(PARTITION BY UserId
,ReviewDate
,Deleted
ORDER BY newid()
)
/COUNT(*) OVER(PARTITION BY UserId
,Reviewdate
,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND deleted IN ('Yes','No')
AND audited='No'
)
UPDATE a
SET Audited='Check'
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=6
;
END
How can I achieve that?
Try using two separate queries (one for Deleted=yes and another for "no") then union the results, maybe
I can do that but how can I pick 5% (6) of the population (118) having Deleted='No' not the 5% (1) of Deleted='No' (17)?
look at it another way, Your 118 has deleted either yes or no. 17 have no. so 5% of 118 is what pct of 17? 17/118*.05 in other words
Thank you for looking into this!
I created a stored procedure that way fixing the %. But my problem is that the % you are referring to is not always the same. A user might have 17 deleted='No' out of 118, another user might have 38 deleted='No' out of 118. So for the first user the % is 35.29 and for the second user the % is 15.79. So that approach does not address the problem.
i wasn't suggesting hard-coding the numbers. that was just an example. you'll have to count things each time
Here is my stored procedure. How can I count things each time?
ALTER PROC [dbo].[spItemReviewQcPickSampleUpdate]
(
@StartDate Datetime
,@EndDate Datetime
)
AS
--Completed Items
BEGIN
WITH CTE
AS (SELECT ItemReviewId
,100.0
*row_number() OVER(PARTITION BY UserId
,ReviewDate
--,Deleted
order by newid()
)
/count(*) OVER(PARTITION BY UserId
,Reviewdate
--,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND Deleted = 'No'
AND Audited='No'
AND UserId NOT LIKE 'v_s2_'+ '%'
)
SELECT a.*
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=31
;
WITH CTE
AS (SELECT ItemReviewId
,100.00
*row_number() OVER(PARTITION BY UserId
,ReviewDate
--,Deleted
ORDER BY newid()
)
/COUNT(*) OVER(PARTITION BY UserId
,Reviewdate
--,Deleted
)
AS pct
FROM tblItemReviewItems
WHERE ReviewDate BETWEEN @StartDate AND @EndDate
AND deleted ='No'
AND audited='No'
AND UserId NOT LIKE 'v_s2_'+ '%'
)
UPDATE a
SET Audited='Check'
FROM tblItemReviewItems AS a
INNER JOIN cte AS b
ON b.ItemReviewId=a.ItemReviewId
AND b.pct<=31
;
END
easy way:
declare @nos float = select count() from ... where deleted = 'no'
declare @yes float = select count() from ... where delete = 'yes'
declare @samplepct float = .05 * @nos/(@nos + @yes)
then use the samplepct in
select top @samplepct percent from <population> where delete = 'No'
etc
1 Like
Sir, I have been trying for the last couple of days but could not create a stored procedure following your suggestions. May be I am asking too much. Could you please create a stored procedure for this? Your help is greatly appreciated.
I was able to pick sample for a UserId. Now the questions are- i. how can I select samples for all the users? ii. how can I update the "AuditStatus" column value to "Check" to the selected rows?
ALTER PROC [dbo].[spPickSample]
AS
BEGIN
DECLARE @Nos INT = (SELECT COUNT (*) FROM tblItemReviewItems WHERE Deleted='No' and UserId='jdoe1')
DECLARE @Yes INT = (SELECT COUNT (*) FROM tblItemReviewItems WHERE Deleted='Yes' and UserId='jdoe1')
DECLARE @Samplepct FLOAT =(.05*(@Nos + @Yes)) / @Nos*100
DECLARE @Percentage AS INT = @Samplepct
SELECT
TOP (@Percentage) PERCENT *
FROM
tblItemReviewItems
WHERE
Deleted='No' AND UserId='jdoe1'
ORDER BY
NEWID()
END
Could this be what you're looking for:
select b.*
from (select itemreviewid
,ceiling((sum(case when deleted='No' then 1 else 0 end) over(partition by userid)
+sum(case when deleted='Yes' then 1 else 0 end) over(partition by userid)
)
*.05
)
as records
,row_number() over(partition by deleted,userid order by newid()) as rn
from tblitemreviewitems
) as a
inner join tblitemreviewitems as b
on b.itemreviewid=a.itemreviewid
where a.rn<=a.records
order by b.userid
,b.deleted
,b.itemreviewid
;
1 Like
Thank you very much. You are awesome! This is exactly what I needed. Here is my final stored procedure:
ALTER PROC [dbo].[spPickSample]
(
@StartDate AS DATETIME
,@EndDate AS DATETIME
)
AS
BEGIN
DECLARE @tempItems TABLE
(
ItemReviewId VarChar(50)
,UserId VarChar(50)
,ReviewDate Datetime
,Deleted varchar(10)
,Audited varchar(10)
)
INSERT INTO @tempItems (ItemReviewId, UserId, ReviewDate, Deleted, Audited)
SELECT b.ItemReviewId, UserId, ReviewDate, Deleted, Audited
FROM (SELECT ItemreviewId
,CEILING((SUM(CASE WHEN Deleted='No' AND ReviewDate BETWEEN @StartDate AND @EndDate AND UserId NOT LIKE 'b_d2_' +'%' THEN 1 ELSE 0 END) OVER(PARTITION BY UserId)
+SUM(CASE WHEN Deleted='Yes' AND ReviewDate BETWEEN @StartDate AND @EndDate AND UserId NOT LIKE 'b_d2_' +'%' THEN 1 ELSE 0 END) OVER(PARTITION BY UserId)
)
*.05
)
AS records
,row_number() OVER(PARTITION BY Deleted,UserId ORDER BY NEWID()) AS rn
FROM tblItemReviewItems
) AS a
INNER JOIN tblItemReviewItems AS b
ON b.ItemreviewId=a.ItemreviewId
WHERE
a.rn<=a.records
ORDER BY
b.UserId
,b.Deleted
,b.ItemreviewId
;
UPDATE a
SET Audited='Check'
FROM tblItemReviewItems AS a
INNER JOIN @tempItems AS b
ON b.ItemReviewId=a.ItemReviewId
Begin
UPDATE tblItemReviewItems SET Audited='Check' WHERE ReviewDate BETWEEN @StartDate AND @EndDate AND UserId LIKE 'b_d2_' +'%'
End
END