1 year ago

#77306

test-img

SidC

How to Consistently Return 2.5% of a Result Set using NEWID() and Table Variable?

I'm using the NEWID() approach to obtain a sample of records from my population. I've created a table variable to define my population and I need to consistently return 2.5% of records from it. The query is as follows:

DECLARE @FirstofLastMonth date = DATEADD(DD,1,EOMONTH(Getdate(),-2)), 
@EndofLastMonth date =  EOMONTH(Getdate(), -1)

DECLARE @Sample table  (ID uniqueidentifier PRIMARY KEY, ClaimNO varchar(50), Company_ID 
varchar(10), FinancialResp varchar(30), ProviderType varchar(50), 
DateOfService date, ClaimType varchar(20), TotalBilled numeric(10,2), TotalPaid numeric(10,2) )
Insert into @Sample (ID, ClaimNo,Company_ID, FinancialResp, ProviderType, DateOfService, 
ClaimType, TotalBilled, TotalPaid)
Select distinct NEWID() as ID,  cd.ClaimNo, cd.Company_ID, cd.FinancialResp,
CASE
when cd.FinancialResp in ('KEYMA', 'SHP', 'SFKD') and lob.LOB_Network = '1' then 'Contracted'
when cd.FinancialResp in ('KEYMA', 'SHP', 'SFKD') and lob.LOB_Network = '0' then 'Non-Contracted'
when cd.FinancialResp = 'KDMA' and pu.UDF_Value = 'Y' then 'Contracted' 
when cd.FinancialResp = 'KDMA' and pu.UDF_Value <> 'Y' then 'Non-Contracted'
else 'Missing'
end as ProviderType,
Format( Min(cd.FromDateSvc), 'MM/dd/yyyy') as DateOfService, 
CASE 
when substring(cd.ClaimNo, 9,1) = '8' then 'Institutional'
when substring(cd.ClaimNO, 9,1) = '9' then 'Professional'
end as Claim_Type,
sum(cd.Billed) as TotalBilled,
SUM(cd.Net) as TotalPaid
from Claim_Details cd 
inner join Claim_Masters_V cm on cd.ClaimNo = cm.ClaimNo and cd.Company_ID = cm.Company_ID
inner join Prov_VendInfo pv on cm.Prov_KeyID = pv.Prov_KeyID and cm.Company_ID = pv.Company_ID
inner join Vend_Masters vm on pv.Vendor = vm.VendorID 
            and pv.Company_ID = vm.Environment_ID
inner join Prov_Master_V pm on cm.Prov_KeyID = pm.Prov_KeyID 
            and cm.Company_ID = pm.Company_ID
            and pv.Default_YN = 1
inner join BM_Prov_HP_LineOfBuss lob on cm.Prov_KeyID = lob.Prov_KeyID
            and cm.Company_ID = lob.Company_ID
            and cm.HPCode = lob.HPCode
            and pv.Vendor = lob.Vendor
            and lob.LOB_Start <= GETDATE()
left join Prov_UDF pu on cm.Prov_KeyID = pu.Prov_KeyID
            and cm.Company_ID = pu.Company_ID
            and pu.UDFNO = 10
where cm.HPCode = 'KD'
and cd.Status = '9'
and cd.FromDateSvc between @FirstofLastMonth and @EndofLastMonth
group by cd.ClaimNo, cd.FinancialResp, cd.Company_ID, pu.UDF_Value, lob.LOB_Network

--Select Count(*)
--from @Sample  --3,517 records 

select ClaimNo, FinancialResp, ProviderType, DateOfService, ClaimType, TotalBilled, TotalPaid  from @Sample 
WHERE 0.025 >= CAST(CHECKSUM(NEWID(),ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)
--We need to receive ~88 records each time above query is run.

When I run the above query, my results vary from between 80 and 100 records. The business case requires just 2.5% of qualifying records to be returned, no more and no less. How do I ensure that this is always the case, all things being equal?

UPDATE I've modified the approach a bit. I'm now loading all qualifying records into a table. I've refined the query which now returns the required 2.5% of the table population. However, each time I run the query, the results do not appear to be random samples in that the same RecordIDs are returned each time.

    select TOP 2.5 PERCENT RecordID, ClaimNo, HPCode, 
    FinancialResponsibility, ProviderType, ProcessorType, DateOfService, 
    DatePaid, ClaimType, TotalBilled, TotalPaid  
    from [Z_Monthly_Quality_Review]  
    order by RecordId

The table definition is:

CREATE TABLE [dbo].[Z_Monthly_Quality_Review] (
  [RecordId]                UNIQUEIDENTIFIER DEFAULT (newid()) NOT NULL,
  [ClaimNO]                 VARCHAR (50)     NULL,
  [Company_ID]              VARCHAR (10)     NULL,
  [HPCode]                  VARCHAR (10)     NULL,
  [FinancialResponsibility] VARCHAR (30)     NULL,
  [ProviderType]            VARCHAR (50)     NULL,
  [DateOfService]           DATE             NULL,
  [DatePaid]                DATE             NULL,
  [ClaimType]               VARCHAR (50)     NULL,
  [TotalBilled]             NUMERIC (11, 2)  NULL,
  [TotalPaid]               NUMERIC (11, 2)  NULL,
  [ProcessorType]           VARCHAR (100)    NULL,
  PRIMARY KEY CLUSTERED ([RecordId] ASC) WITH (FILLFACTOR = 90)
 );

Should I modify the query or table structure in order to consistently return a 2.5% random sample (different RecordIDs each time its run) of the table data?

tsql

0 Answers

Your Answer

Accepted video resources