Logo

SQL Server ROCTable Function

Updated 2023-11-02 12:19:09.947000

Description

Use the table-valued function ROCTABLE to show the calculation of the area under the ROC curve. The function accepts either raw or grouped data as input.

The function has a single input parameter which is an SQL SELECT statement which, when executed, returns a resultant table where the first column is the predicted probabilities. For the raw case, the SQL will return one additional column, consisting of zeroes and ones indicating the absence (0) or presence (1) of the characteristic of interest. You may also think of these as indicating the failure (0) or success (1) of the observation.

For the grouped case, the input SQL will return two additional columns containing the count of failures and successes for a predicted probability.

The function returns a table (described below) sorted by ascending predicted probability which calculates the True Positive Rate, the False Positive Rate, and the area under the ROC curve (AUROC). This is the same value returned by the LOGIT and LOGITSUM functions.

Syntax

SELECT * FROM [westclintech].[wct].[ROCTable](
   <@Matrix_RangeQuery, nvarchar(max),>)

Arguments

@Matrix_RangeQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of predicted probabilities and Y values.

Return Type

table

colNamecolDatatypecolDesc
idxinta unique identifier for the row identifying its positon in the resultant table
ppredfloatpredicted probability
failureintfor raw data, the count of the number of rows for the predicted probability having a value of 0. For grouped data, the sum of the second column passed into the function grouped by predicted probability
successintfor raw data, the sum of the second column grouped by predicted probability. For grouped data, the sum of the third column passed into the function grouped by the predicted probability
cumfailureintthe sum of failure for the current row and all preceding rows
cumsuccessintthe sum of success for the current row and all preceding rows
FalsePositiveRatefloatcumfailure(idx) /cumfailure(idxmax)
TruePositiveRatefloatcumsuccess(idx) /cumsuccess(idxmax)
AUROCfloat[FalsePositiveRate(idx+1) – FalsePositiveRate(idx)] * TruePositiveRate(idx)
cumAUROCfloatthe sum of AUROC for the current row and all preceding rows

Remarks

The first column returned by @Matrix_RangeQuery should contain the predicted probabilities where 0 <= predicted probability <= 1.

The resultant table returned by @Matrix_RangeQuery should return either 2 columns or 3 columns.

When the resultant table contains 2 columns the function assumes that the second column contains binary responses consisting of zero (0) or (1); the use of other values will produce unreliable results.

When the resultant table contains 3 columns the function assumes that the second column contains a count of the failures or absences and the third column contains a count of the successes or presence.

Examples

In this example we use the same data as was used in the LOGIT documentation, consisting of the Coronary Heart Disease data from Applied Logistic Regression, Third Edition by David W. Hosmer, Jr., Stanley Lemeshow, and Rodney X. Sturdivant . The data consist of a single independent variable (age) and an outcome (chd) which indicates the absence (0) or presence (1) of coronary heart disease. We will put the data into a temporary table, run the logistic regression, use the coefficients from the logistic to create the predicted probabilities and then produce the ROC table. Note that the AUROC value is actually returned by the LOGIT function; this example simply explains the calculation.

--Put the Hosmer data into the #chd table
SELECT *
INTO   #chd
  FROM (   VALUES (20, 0),
                  (23, 0),
                  (24, 0),
                  (25, 0),
                  (25, 1),
                  (26, 0),
                  (26, 0),
                  (28, 0),
                  (28, 0),
                  (29, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 1),
                  (32, 0),
                  (32, 0),
                  (33, 0),
                  (33, 0),
                  (34, 0),
                  (34, 0),
                  (34, 1),
                  (34, 0),
                  (34, 0),
                  (35, 0),
                  (35, 0),
                  (36, 0),
                  (36, 1),
                  (36, 0),
                  (37, 0),
                  (37, 1),
                  (37, 0),
                  (38, 0),
                  (38, 0),
                  (39, 0),
                  (39, 1),
                  (40, 0),
                  (40, 1),
                  (41, 0),
                  (41, 0),
                  (42, 0),
                  (42, 0),
                  (42, 0),
                  (42, 1),
                  (43, 0),
                  (43, 0),
                  (43, 1),
                  (44, 0),
                  (44, 0),
                  (44, 1),
                  (44, 1),
                  (45, 0),
                  (45, 1),
                  (46, 0),
                  (46, 1),
                  (47, 0),
                  (47, 0),
                  (47, 1),
                  (48, 0),
                  (48, 1),
                  (48, 1),
                  (49, 0),
                  (49, 0),
                  (49, 1),
                  (50, 0),
                  (50, 1),
                  (51, 0),
                  (52, 0),
                  (52, 1),
                  (53, 1),
                  (53, 1),
                  (54, 1),
                  (55, 0),
                  (55, 1),
                  (55, 1),
                  (56, 1),
                  (56, 1),
                  (56, 1),
                  (57, 0),
                  (57, 0),
                  (57, 1),
                  (57, 1),
                  (57, 1),
                  (57, 1),
                  (58, 0),
                  (58, 1),
                  (58, 1),
                  (59, 1),
                  (59, 1),
                  (60, 0),
                  (60, 1),
                  (61, 1),
                  (62, 1),
                  (62, 1),
                  (63, 1),
                  (64, 0),
                  (64, 1),
                  (65, 1),
                  (69, 1)) n (age, chd);

--Run LOGIT and store the results in #mylogit
SELECT *
INTO   #mylogit
  FROM wct.LOGIT('SELECT age,chd FROM #chd', 2);

--Calculate the predicted probabilities for each row in #chd and store the
--predicted probability and the chd value in #t
SELECT wct.LOGITPRED('SELECT stat_val FROM #mylogit where stat_name = ''b'' ORDER BY idx', cast(age as varchar(max))) as [p predicted],
       chd as y
INTO   #t
  FROM #chd;

--Run the ROCTable function
SELECT *
  FROM wct.ROCTable('SELECT * FROM #t');

This produces the following result.

idxppredfailuresuccesscumfailurecumsuccessFalsePositiveRateTruePositiveRateAUROCcumAUROC
00.912464554564153010100.023255813953488400
10.869939152344419010200.04651162790697670.0008159934720522230.000815993472052223
20.85686593067653611130.01754385964912280.069767441860465100.000815993472052223
30.84271622060268301140.01754385964912280.093023255813953500.000815993472052223
40.82744940176391402160.01754385964912280.1395348837209300.000815993472052223
50.81103299288096801170.01754385964912280.1627906976744190.002855977152182780.00367197062423501
60.79344461565528711280.03508771929824560.18604651162790700.00367197062423501
70.774673993551717022100.03508771929824560.2325581395348840.004079967360261120.00775193798449612
80.754724899971724123120.05263157894736840.279069767441860.009791921664626680.0175438596491228
90.733616953220639245160.0877192982456140.37209302325581400.0175438596491228
100.711387142595015035190.0877192982456140.4418604651162790.007751937984496120.0252957976336189
110.688090963392313126210.1052631578947370.48837209302325600.0252957976336189
120.663803041111905016220.1052631578947370.51162790697674400.0252957976336189
130.638617138505235026240.1052631578947370.5581395348837210.009791921664626680.0350877192982456
140.61264546440856117250.122807017543860.5813953488372090.01019991840065280.0452876376988984
150.586017240033851108250.1403508771929820.5813953488372090.01019991840065280.0554875560995512
160.558876524531328119260.1578947368421050.6046511627906980.02121583027335780.076703386372909
170.5313793534369512111270.1929824561403510.6279069767441860.0110159118727050.087719298245614
180.5036902959935131212290.2105263157894740.6744186046511630.02366381068951450.111383108935129
190.4759785844732812114300.2456140350877190.6976744186046510.01223990208078340.123623011015912
200.4484140048604641115310.2631578947368420.7209302325581390.01264789881680950.136270909832721
210.4211627589753441116320.2807017543859650.7441860465116280.02611179110567120.162382700938392
220.3943835106261782218340.3157894736842110.7906976744186050.02774377804977560.190126478988168
230.3682238123282762120350.3508771929824560.8139534883720930.04283965728274170.23296613627091
240.3428170766427843123360.4035087719298250.8372093023255810.02937576499388010.26234190126479
250.3182802114257522025360.438596491228070.8372093023255810.014687882496940.27702978376173
260.2947119867178421126370.4561403508771930.860465116279070.01509587923296610.292125662994696
270.2721921485117541127380.4736842105263160.8837209302325580.03100775193798450.323133414932681
280.2507812465609692029380.5087719298245610.8837209302325580.03100775193798450.354141166870665
290.2305211038773862131390.5438596491228070.9069767441860460.03182374541003670.385964912280702
300.2114358271319042133400.5789473684210530.9302325581395350.03263973888208890.418604651162791
310.1935332406631262035400.6140350877192980.9302325581395350.06527947776417790.483884128926969
320.1768066215825864139410.6842105263157890.9534883720930230.03345573235414120.51733986128111
330.1612366178210712041410.7192982456140350.9534883720930230.03345573235414120.550795593635251
340.1467932425433172043410.7543859649122810.9534883720930230.08363933088535290.634434924520604
350.1211250535032685148420.8421052631578950.9767441860465120.01713586291309670.651570787433701
360.109804435463621049420.8596491228070180.9767441860465120.03427172582619340.685842513259894
370.09942217640138632051420.8947368421052630.9767441860465120.03427172582619340.720114239086087
380.08124847365986182053420.9298245614035090.9767441860465120.01713586291309660.737250101999184
390.07334378841000281154430.94736842105263210.01754385964912290.754793961648307
400.06615277830121591055430.96491228070175410.01754385964912280.77233782129743
410.05962144972811551056430.98245614035087710.01754385964912290.789881680946553
420.04347875674882361057431100.789881680946553

You can see from the table that the cumulative AUROC value is 0.789881680946553. This is the same as the value returned by LOGIT .

--Get the AUROC value from #mylogit
SELECT stat_val
  FROM #mylogit
 WHERE stat_name = 'AUROC';

This produces the following result.

stat_val
0.789881680946553

However, ROCTABLE does return the False Positive Rate and the True Positive Rate, which can be graphed using SSRS, Excel, or any tool that you prefer. In this example, I have simply copied the FalsePositiveRate and TruePositiveRate from ROCTABLE , pasted them into Excel and then produced the following graph.

http://westclintech.com/Portals/0/images/doc_stats_ROCTable_img1.jpg

It is worth noting that our input data consisted of 100 rows, yet ROCTABLE only returned 43 rows of data from the temporary table #t, even though we generated a predicted probability for all 100 rows. This is because there were not 100 unique predicted probabilities. We can get the number of unique predicted probabilities using the following SQL.

SELECT COUNT(DISTINCT [p predicted]) as [COUNT p predicted]
  FROM #t;

This produces the following result, which matches what was returned by ROCTable .

COUNT p predicted
43

As Hosmer points out in section 5.4.2 "let n1 denote the number of subjects with y = 1 and n0 denote the number of subjects with y = 0. We can then create n1 x n0 pairs; each subject with y = 1 , paired with each subject with y = 0 . Of these n1 x n0 pairs, we determine the proportion of the pairs where the subject with y = 1 had the higher of the two probabilities. This proportion may be shown to be equal to the area under the ROC Curve."

The technique that he is suggesting lends itself quite well to SQL and we can use it to check the AUROC calculation in both LOGIT and ROCTable . We would not recommend this calculation as a practical matter as it requires a Cartesian product; in this case 57 x 43 combinations.

--Calculate the area under the ROC Curve using a Cartesian product
;with mycte
as (SELECT n1.y as y1,
           n1.[p predicted] as p1,
           n0.y as y0,
           n0.[p predicted] as p0
    FROM #t n1,
         #t n0
    WHERE n1.y = 1
          AND n0.y = 0)
SELECT COUNT(m1.y1) / cast(n.pairs as float) As AUROC,
       n.pairs
FROM
(SELECT SUM(y1)FROM mycte) n(pairs) ,
mycte m1
WHERE m1.p1 > m1.p0
GROUP BY n.pairs;

This produces the following result.

AUROCpairs
0.7898816809465522451

Now, let's look at an example using grouped data. The data consist of 3 independent variables; x1, x2, and x3, and 2 additional columns; the number of successes for that combination of independent variables and the number of observations for that combination of independent variables.

--Put grouped data into a temporary table #x
SELECT *
INTO #x
FROM
(
    VALUES
        (100, 1, 10, 28, 156),
        (150, 1, 10, 33, 144),
        (200, 1, 10, 44, 171),
        (250, 1, 10, 56, 196),
        (300, 1, 10, 55, 158),
        (350, 1, 10, 44, 100),
        (400, 1, 10, 57, 126),
        (450, 1, 10, 77, 166),
        (500, 1, 10, 84, 166),
        (100, 2, 10, 23, 153),
        (150, 2, 10, 31, 165),
        (200, 2, 10, 40, 179),
        (250, 2, 10, 42, 152),
        (300, 2, 10, 55, 181),
        (350, 2, 10, 68, 200),
        (400, 2, 10, 59, 148),
        (450, 2, 10, 69, 156),
        (500, 2, 10, 75, 157),
        (100, 1, 11, 19, 164),
        (150, 1, 11, 23, 147),
        (200, 1, 11, 35, 182),
        (250, 1, 11, 46, 196),
        (300, 1, 11, 41, 143),
        (350, 1, 11, 60, 189),
        (400, 1, 11, 59, 162),
        (450, 1, 11, 75, 187),
        (500, 1, 11, 59, 129),
        (100, 2, 11, 9, 105),
        (150, 2, 11, 22, 179),
        (200, 2, 11, 30, 182),
        (250, 2, 11, 32, 155),
        (300, 2, 11, 41, 164),
        (350, 2, 11, 58, 200),
        (400, 2, 11, 60, 181),
        (450, 2, 11, 75, 199),
        (500, 2, 11, 59, 141)
) n (x1, x2, x3, success, N);
--Run LOGIT and store the results in #mylogit
SELECT *
INTO #mylogit
FROM wct.LOGITSUM('SELECT x1,x2,x3,success,n-success from #x', 4, 5);
--Calculate the predicted probabilities using LOGITPROB for each row in #x and store the
--predicted probability and the group totals in #t
SELECT wct.LOGITPROB(n.x, m.stat_val) as [p predicted],
       n - Success as failure,
       success as success
INTO #t
FROM #x
    CROSS APPLY
(
    VALUES
        (0, 1),
        (1, x1),
        (2, x2),
        (3, x3)
) n (idx, x)
    INNER JOIN #mylogit m
        ON m.idx = n.idx
WHERE m.stat_name = 'b'
GROUP BY n - Success,
         success;
--Run ROCTable function
SELECT *
FROM wct.ROCTable('SELECT * FROM #t');

This produces the following result.

idxppredfailuresuccesscumfailurecumsuccessFalsePositiveRateTruePositiveRateAUROCcumAUROC
00.540977960616935828482840.0198259187620890.04819277108433740.0009554659644380230.000955465964438023
10.4987626276624982751641590.03965183752417790.09122203098106710.001962949892967840.00291841585740586
20.48856563643700589772532360.06117021276595740.1353987378083760.002291564711457050.00520998056886291
30.46015763004419270593232950.07809477756286270.169248422260470.00356010946244220.00877009003130511
40.44646221482361587694103640.09912959381044490.2088353413654620.003483955163011810.0122540451943169
50.43640352977477869574794210.1158123791102510.2415375788869760.004788704417004850.0170427496113218
60.41849907113284482595614800.135638297872340.2753872633390710.007457295332199210.024500044943521
70.40860533932183112756735550.1627176015473890.31841652323580.006851806230170750.0313518511736917
80.395320678867389597626140.1842359767891680.3522662076878940.004769561806219070.0361214129799108
90.38561153759944856448186580.1977756286266920.3775100401606430.01131799926980650.0474394122497173
100.368428698069553124759427330.2277562862669250.4205393000573720.01047281138924310.0579122236389604
110.3589879175070591035910457920.2526595744680850.4543889845094660.01450177610136590.0724139997403263
120.3463716165860071326811778600.2845744680851060.4934021801491680.01228733669133570.084701336431662
130.3371942774217651035512809150.3094777562862670.5249569707401030.01535778371846050.100059120150122
140.321041547879041216014019750.3387330754352030.5593803786574870.01744682515638680.117505945306509
150.31221476744549512960153010350.3699226305609280.5938037865748710.01808976719256130.135595712499071
160.30047174182625312655165610900.4003868471953580.6253585771658060.02116784352108630.156763556020157
170.29196731245262614056179611460.4342359767891680.6574870912220310.0225732995535610.179336855573718
180.27707542492828114258193812040.4685686653771760.6907630522088350.01703525902449260.19637211459821
190.26897857010244110241204012450.4932301740812380.7142857142857140.01899696048632220.215369075084533
200.25825113551110311042215012870.5198259187620890.7383820998278830.02267275790090450.238041832985437
210.25051375162483212744227713310.5505319148936170.7636259323006310.02270937854762520.260751211533062
220.23702839382728612341240013720.580270793036750.787148594377510.02854745869357510.289298670226637
230.22972992706715046255014180.6165377176015470.8135398737808380.02734091935578730.316639589582425
240.22009652392386413940268914580.6501450676982590.8364888123924270.02244928872716620.339088878309591
250.21317374821532711133280014910.6769825918762090.8554216867469880.02543928130316240.364528159612753
260.20115894604779412332292315230.7067214700193420.8737808376362590.03105555685022490.395583716462978
270.19468313179057414735307015580.7422630560928430.89386115892140.02895971839832390.424543434861302
280.18616417467176913431320415890.7746615087040620.9116465863453820.02821343400682030.452756868868122
290.18006227285707412828333216170.8056092843326890.9277108433734940.03409382209689820.486850690965021
300.16951163277135815230348416470.8423597678916830.9449225473321860.02832939938810230.515180090353123
310.16384566383021612423360816700.8723404255319150.958118187033850.03011493334487440.545295023697997
320.15641400752325913023373816930.9037717601547390.9713138267355130.03687047166283250.58216549536083
330.14195845040594115722389517150.9417311411992260.9839357429718880.03449484592140330.616660341282233
340.13706145888192714519404017340.976789168278530.9948364888123920.02309098233220250.639751323614436
350.118246213552031969413617431100.639751323614436

You can see from the table that the cumulative AUROC value is 0.639751323614436. This is the same as the value returned by LOGITSUM .

--Get the AUROC value from #mylogit
SELECT stat_val
FROM #mylogit
WHERE stat_name = 'AUROC';

This produces the following result.

stat_val
0.639751323614436

We can modify our SQL slightly from the previous example in order to verify the calculation of the area under the ROC curve using the Cartesian product.

--Calculate the area under the ROC Curve using a Cartesian product
SELECT SUM(n.y0 * n.y1) / cast(p.pairs as float) as AUROC,
       p.pairs
FROM
(
    SELECT t1.success as y1,
           t1.[p predicted] as p1,
           t2.failure as y0,
           t2.[p predicted] as p0
    FROM #t t1,
         #t t2
) n ,
(SELECT SUM(success) * SUM(failure)FROM #t) p(pairs)
WHERE p1 > p0
GROUP BY p.pairs;

This produces the following result.

AUROCpairs
0.6397513236144367209048

See Also

LINEST - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

LOGEST - Logarithmic regression

LOGIT - Logit regression

LOGITPRED - Calculate predicted values based on a logit regression

LOGITPROB - Calculate the probability that Y = 1 given a set of coefficients from a logistic regression

LOGITSUM - Logit regression using summary data

VIF - Variance inflation factors