Logo

SQL Server LOGITSUM Function

Updated 2024-02-13 20:13:13.277000

Description

Use the table-valued function LOGITSUM to calculate the binary logistic regression coefficients from a table of grouped x-values (the independent variables) with counts of the successes and failures for each group. The successes and failures can appear in any column and the column number of the successes and failures is specified at input.

x1x2x3xnsuccessfailure
x 1,1x 1,2x 1,3x 1,nsuccess 1failure 1
x 2,1x 2,2x 2,3x 2,nsuccess 2failure 2
x m,1x m,2x m,3x m,nsuccess mfailure m

For dichotomous values, where the y-values are {0,1}, use the LOGIT function.

Logistic regression estimates the probability of an event occurring. Unlike ordinary least squares (see LINEST and LINEST_q) which estimates the value of a dependent variable based on the independent variables, LOGIT measures the probability (p) of an event occurring (1) or not occurring (0). The probability is estimated as:

p = \hat{\pi} = \frac{e^{\beta_0 + \beta_1x_1 + ... + ... \beta_nx_n}}{1 + e^{\beta_0 + \beta_1x_1 + ... + ... \beta_nx_n}}

The LOGIT function works by finding the coefficient (ß) values that maximize the log-likelihood statistic, which is defined as:

LL = \sum_{i=1}^n N_i \times \{y_i \times \ln \hat{\pi}(x_i) + (1 - y_i) \times \ln (1 - \hat{\pi}(x_i)\}

using a method of iteratively re-weighted least squares.

Syntax

SELECT * FROM [westclintech].[wct].[LOGITSUM](
  <@Matrix_RangeQuery, nvarchar(max),>
 ,<@Success_ColumnNumber, int,>
 ,<@Failure_ColumnNumber, int,>)

Arguments

@Matrix_RangeQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of x-values including the counts of successes and failures.

@Success_ColumnNumber

the number of the column in the resultant table returned by @Matrix_RangeQuery which contains the count of successes. @Success_ColumnNumber must be of the type int or of a type that implicitly converts to int.

@Failure_ColumnNumber

the number of the column in the resultant table returned by @Matrix_RangeQuery which contains the count of failures. @Failure_ColumnNumber must be of the type int or of a type that implicitly converts to int.

Return Type

table

colNamecolDatatypecolDesc
stat_namenvarchar(4000)Identifies the statistic being returned: b estimated coefficient for each independent variable plus the intercept se standard error of b z z statistic for b pval p-value (normal distribution) for the z statistic Wald Wald statistic LL0 log-likelihood with just the intercept and no other coefficients LLM model log-likelihood chisq chi squared statistic df degrees of freedom p_chisq p-value of the chi-squared statistic AIC Akaike information criterion BIC Bayesian information criterion Nobs number of observations rsql log-linear ratio R2 rsqcs Cox and Snell's R2 rsqn Nagelkerke's R2 D deviance Iterations number of iteration in iteratively re-weighted least squares Converged a bit value identifying whether a solution was found (1) or not (0) AUROC area under the ROC curve For more information on how these statistics are calculated see the LOGIT documentation.
idxintIdentifies the subscript for the estimated coefficient (b), standard error of the coefficient (se), z statistics (z), p-value of the z statistic (pval), and the Wald statistic. When the idx is 0, it is referring to the intercept, otherwise the idx identifies the column number of independent variable. Descriptive statistics other than the ones mentioned above will have an idx of NULL.
stat_valfloatthe calculated value of the statistic.

Remarks

If @Success_ColumnNumber is NULL then the second right-most column in the resultant table is assumed to contain the count of successes.

If @Failure_ColumnNumber is NUL then the right-most column in the resultant table is assumed to contain the count of failures.

@Matrix_RangeQuery must return at least 3 columns or an error will be returned.

If @Success_ColumnNumber is not NULL and @Success_ColumnNumber < 1 an error will be returned.

If @Failure_ColumnNumber is not NULL and @Success_ColumnNumber < 1 an error will be returned.

If @Success_ColumnNumber is not NULL and @Success_ColumnNumber greater than the number of columns returned by @Matrix_RangeQuery an error will be returned.

If @Failure_ColumnNumber is not NULL and @Failure_ColumnNumber greater than the number of columns returned by @Matrix_RangeQuery an error will be returned.

Examples

Using the coronary heart disease data from LOGIT Example #1, we summarize the data into the 8 different age groups in Hosmer and use the summarized data in the logistic regression. This summarization can be handled entirely within the @MatrixRangeQuery string passed into the function.

SELECT *
INTO   #chd
  FROM (   VALUES (20, 0),
                  (23, 0),
                  (24, 0),
                  (25, 0),
                  (25, 1),
                  (26, 0),
                  (26, 0),
                  (28, 0),
                  (28, 0),
                  (29, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 0),
                  (30, 1),
                  (32, 0),
                  (32, 0),
                  (33, 0),
                  (33, 0),
                  (34, 0),
                  (34, 0),
                  (34, 1),
                  (34, 0),
                  (34, 0),
                  (35, 0),
                  (35, 0),
                  (36, 0),
                  (36, 1),
                  (36, 0),
                  (37, 0),
                  (37, 1),
                  (37, 0),
                  (38, 0),
                  (38, 0),
                  (39, 0),
                  (39, 1),
                  (40, 0),
                  (40, 1),
                  (41, 0),
                  (41, 0),
                  (42, 0),
                  (42, 0),
                  (42, 0),
                  (42, 1),
                  (43, 0),
                  (43, 0),
                  (43, 1),
                  (44, 0),
                  (44, 0),
                  (44, 1),
                  (44, 1),
                  (45, 0),
                  (45, 1),
                  (46, 0),
                  (46, 1),
                  (47, 0),
                  (47, 0),
                  (47, 1),
                  (48, 0),
                  (48, 1),
                  (48, 1),
                  (49, 0),
                  (49, 0),
                  (49, 1),
                  (50, 0),
                  (50, 1),
                  (51, 0),
                  (52, 0),
                  (52, 1),
                  (53, 1),
                  (53, 1),
                  (54, 1),
                  (55, 0),
                  (55, 1),
                  (55, 1),
                  (56, 1),
                  (56, 1),
                  (56, 1),
                  (57, 0),
                  (57, 0),
                  (57, 1),
                  (57, 1),
                  (57, 1),
                  (57, 1),
                  (58, 0),
                  (58, 1),
                  (58, 1),
                  (59, 1),
                  (59, 1),
                  (60, 0),
                  (60, 1),
                  (61, 1),
                  (62, 1),
                  (62, 1),
                  (63, 1),
                  (64, 0),
                  (64, 1),
                  (65, 1),
                  (69, 1)) n (age, chd)
SELECT *
  FROM wct.LOGITSUM(
           ' SELECT a.age as [Age] ,SUM(c.chd) as Present ,COUNT(*) - SUM(c.chd) as Absent FROM #chd c CROSS APPLY ( SELECT TOP 1 age FROM (VALUES (20,1),(30,2),(35,3),(40,4) ,(45,5),(50,6),(55,7),(60,8) )n(age,grp) WHERE n.age <= c.age ORDER BY n.age DESC )a GROUP BY a.age',
           2,
           3)

This produces the following result.

stat_nameidxstat_val
b0-4.85868536330207
b10.106379775834921
se01.059814083208
se10.0235010242724494
z0-4.58446952185716
z14.526601674959
pval04.55140467350453E-06
pval15.99397669650406E-06
Wald021.0173607968372
Wald120.4901227237416
LL0NULL-68.3314913574166
LLMNULL-54.2283793396647
chisqNULL28.2062240355037
dfNULL1
p_chisqNULL1.09053326927106E-07
AICNULL112.456758679329
BICNULL117.667099051306
NobsNULL100
rsqlNULL0.206392568603307
rsqcsNULL0.245773260267886
rsqnNULL0.32988140227239
DNULL-0.462332717792356
AUROCNULL0.745818033455732
IterationsNULL5
ConvergedNULL1

As you can see the results are similar but not exactly the same as those produced using the raw data in LOGIT . Using the summarized data, the fitted values are given by the equation:

https://westclintech.com/Portals/0/images/doc_stats_LOGITSUM_img1.jpg

We can compare the results of the two models for a 60-year old using the following SQL.

SELECT 1 / (1 + EXP(-1 * (-5.30945337391905 + 0.1109211422069 * 60))) as P_raw,
       1 / (1 + EXP(-1 * (-4.85868536330207 + 0.106379775834921 * 60))) as P_sum

This produces the following result.

P_rawP_sum
0.7934446156552870.821141607483233

Example #2

In this example we have 3 independent variables, a count of the number of successes and a count of the number of observations for each grouping. We will need to calculate the count of failures in @Matrix_RangeQuery .

SELECT *
INTO   #t
  FROM (   VALUES (100, 1, 10, 28, 156),
                  (150, 1, 10, 33, 144),
                  (200, 1, 10, 44, 171),
                  (250, 1, 10, 56, 196),
                  (300, 1, 10, 55, 158),
                  (350, 1, 10, 44, 100),
                  (400, 1, 10, 57, 126),
                  (450, 1, 10, 77, 166),
                  (500, 1, 10, 84, 166),
                  (100, 2, 10, 23, 153),
                  (150, 2, 10, 31, 165),
                  (200, 2, 10, 40, 179),
                  (250, 2, 10, 42, 152),
                  (300, 2, 10, 55, 181),
                  (350, 2, 10, 68, 200),
                  (400, 2, 10, 59, 148),
                  (450, 2, 10, 69, 156),
                  (500, 2, 10, 75, 157),
                  (100, 1, 11, 19, 164),
                  (150, 1, 11, 23, 147),
                  (200, 1, 11, 35, 182),
                  (250, 1, 11, 46, 196),
                  (300, 1, 11, 41, 143),
                  (350, 1, 11, 60, 189),
                  (400, 1, 11, 59, 162),
                  (450, 1, 11, 75, 187),
                  (500, 1, 11, 59, 129),
                  (100, 2, 11, 9, 105),
                  (150, 2, 11, 22, 179),
                  (200, 2, 11, 30, 182),
                  (250, 2, 11, 32, 155),
                  (300, 2, 11, 41, 164),
                  (350, 2, 11, 58, 200),
                  (400, 2, 11, 60, 181),
                  (450, 2, 11, 75, 199),
                  (500, 2, 11, 59, 141)) n (x1, x2, x3, success, N)
SELECT *
  FROM wct.LOGITSUM('SELECT x1,x2,x3,success,n-success from #t', 4, 5)

This produces the following result.

stat_nameidxstat_val
b01.47313667574166
b10.00420051489570411
b2-0.169229814246798
b3-0.323988399455435
se00.628307634096669
se10.000241365967082602
se20.0589477537583747
se30.0590000710920504
z02.34461049937682
z117.4030951690326
z2-2.8708441536291
z3-5.4913221875607
pval00.0190469656457266
pval17.81629936122213E-68
pval20.00409377299880153
pval33.98935835481458E-08
Wald05.49719839378803
Wald1302.867721462405
Wald28.24174615442641
Wald330.1546193675964
LL0NULL-3573.5582397508
LLMNULL-3396.62788760086
chisqNULL353.860704299878
dfNULL3
p_chisqNULL2.17621977974947E-76
AICNULL6801.25577520171
BICNULL6827.97234303532
NobsNULL5879
rsqlNULL0.0495109748546528
rsqcsNULL0.0584149763613681
rsqnNULL0.083034978045949
DNULL-0.101557327043684
AUROCNULL0.639751323614436
IterationsNULL4
ConvergedNULL1

See Also

LINEST - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

LOGEST - Logarithmic regression

LOGIT - Logit regression

LOGITPRED - Calculate predicted values based on a logit regression

LOGITPROB - Calculate the probability that Y = 1 given a set of coefficients from a logistic regression

VIF - Variance inflation factors