Logo

SQL Server VIF Function

Updated 2023-11-02 13:41:11.003000

Description

Use the table-valued function VIF to calculate the R2, tolerance and the variance inflation factor for a set of independent variables.

VIF performs a linear regression for each independent variable against all the other independent variables. For each of these regression analyses it calculates R2 as:

R^2 = 1 - \frac{SS_{res}}{SS_{tot}}

where SSres is the residual sum of squares and SStot is the total sum of squares.

The tolerance is derived directly from R2.

\text{VIF} = \frac{1}{\text{Tolerance}}

The variance inflation factor is calculated from the tolerance.

Syntax

SELECT * FROM [westclintech].[wct].[VIF](
   <@Matrix_RangeQuery, nvarchar(max),>)

Arguments

@Matrix_RangeQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of x-values used in the calculation.

Return Type

table

colNamecolDatatypecolDesc
ColNumintThe column number from the resultant table returned by @Matrix_RangeQuery.
RsquaredfloatThe R-squared value.
VIFfloatThe variance inflation factor.
TolerancefloatThe tolerance.

Remarks

@Matrix_RangeQuery must return at least 2 columns or an error will be returned.

If Rsquared = 1 then Tolerance is zero and VIF is NULL.

If there is only one column of data, no rows will be returned.

Examples

In this example we evaluate 3 independent variables: age, height and weight.

SELECT *
INTO #t
FROM
(
    VALUES
        (22, 76, 197.4),
        (27, 68.9, 189.3),
        (28, 67.7, 215.1),
        (20, 72.3, 173.6),
        (29, 71.2, 171),
        (23, 70.3, 184.6),
        (20, 73.4, 178.7),
        (23, 64.6, 217),
        (23, 68.2, 204),
        (24, 78.9, 249.9),
        (25, 76.3, 209.3),
        (22, 78.8, 209.7),
        (26, 61.4, 137.3),
        (24, 73.4, 186.2),
        (28, 76.5, 220.4),
        (20, 75.9, 171.8),
        (25, 77.1, 239.3),
        (25, 67.3, 169.9),
        (28, 71.8, 184.3),
        (21, 77.6, 180.3)
) n (Age, Height, Weight);
SELECT *
FROM wct.VIF('SELECT * FROM #t');

This produces the following result.

ColNumRsquaredVIFTolerance
10.1664696155057461.19971631340920.833530384494254
20.3497824142585771.537946714959630.650217585741423
30.2961591107112041.420775654296610.703840889288796

While we put our data into a temporary table (#t) and executed the table-valued function by referencing #t, we could have just as easily put the SQL inside the call to the function as the following SQL demonstrates.

SELECT *
FROM wct.VIF('
   SELECT
       *
   FROM (VALUES
        (22,76,197.4)
       ,(27,68.9,189.3)
       ,(28,67.7,215.1)
       ,(20,72.3,173.6)
       ,(29,71.2,171)
       ,(23,70.3,184.6)
       ,(20,73.4,178.7)
       ,(23,64.6,217)
       ,(23,68.2,204)
       ,(24,78.9,249.9)
       ,(25,76.3,209.3)
       ,(22,78.8,209.7)
       ,(26,61.4,137.3)
       ,(24,73.4,186.2)
       ,(28,76.5,220.4)
       ,(20,75.9,171.8)
       ,(25,77.1,239.3)
       ,(25,67.3,169.9)
       ,(28,71.8,184.3)
       ,(21,77.6,180.3)
       )n(Age,Height,Weight)');

Since the table-valued function uses dynamic SQL, however, it will not produce the desired result if it references a derived table or a common table expression outside the scope of the dynamic SQL. For example, the following SQL will not work.

WITH mycte
AS (SELECT *
    FROM
    (
        VALUES
            (22, 76, 197.4),
            (27, 68.9, 189.3),
            (28, 67.7, 215.1),
            (20, 72.3, 173.6),
            (29, 71.2, 171),
            (23, 70.3, 184.6),
            (20, 73.4, 178.7),
            (23, 64.6, 217),
            (23, 68.2, 204),
            (24, 78.9, 249.9),
            (25, 76.3, 209.3),
            (22, 78.8, 209.7),
            (26, 61.4, 137.3),
            (24, 73.4, 186.2),
            (28, 76.5, 220.4),
            (20, 75.9, 171.8),
            (25, 77.1, 239.3),
            (25, 67.3, 169.9),
            (28, 71.8, 184.3),
            (21, 77.6, 180.3)
    ) n (Age, Height, Weight) )
SELECT *
FROM wct.VIF('SELECT * FROM mycte');

This produces the following message.

Exception:

[Product version 1.13 Build: 0219.573]

An error occurred in Function Package while attempting to access database data:

Invalid object name 'mycte'. (msg:208 state:1 ln:0)

In this example we generate a random variable (x1) from the normal distribution with a mean of 0 and standard deviation of 1, another random variable (x2) from the normal distribution a mean of 2 and a standard deviation of 4. The x3 variable is calculated as x2 plus a uniform random variable between 0 and 1.

SELECT x1,
       x2,
       x2 + wct.RAND() as x3
INTO #t1
FROM
(
    SELECT X as X1,
           wct.RANDNORM(2, 4) as X2
    FROM wct.RANDSNORMAL(1000)
) n;
SELECT *
FROM wct.VIF('SELECT x1,x2,x3 FROM #t1');

This produces the following result (your result will be different).

ColNumRsquaredVIFTolerance
10.0009334730473665731.000934345233460.999066526952633
20.994621815206927185.9363406939730.00537818479307317
30.994621666804285185.9312102115010.00537833319571512

As expected the R2 between x2 and x3 is very close to and the VIF for the 2 columns is very high.

In this example, we generate 3 independent random variables and a fourth variable which is calculated as x1 + 2*x2 - 3*x3. In other words, co-linearity arises not between any two columns but among all the columns.

SELECT x1,
       x2,
       x3,
       x1 + 2 * x2 - 3 * x3 as x4
INTO #t2
FROM
(
    SELECT X as X1,
           wct.RANDNORM(2, 4) as X2,
           wct.RANDNORM(5, 10) as X3
    FROM wct.RANDSNORMAL(1000)
) n;
SELECT *
FROM wct.VIF('SELECT x1,x2,x3,x4 FROM #t2');

This produces the following result (your result will be different).

ColNumRsquaredVIFTolerance
11NULL0
21NULL0
31NULL0
41NULL0

Eliminating any of the columns will eliminate the co-linearity.

SELECT *
FROM wct.VIF('SELECT x1,x2,x4 FROM #t2');

This produces the following result (your result will be different).

ColNumRsquaredVIFTolerance
18.47196531994499E-051.000084726831230.999915280346801
20.07460468784015491.080619262773260.925395312159845
30.07463070673403271.080649646878420.925369293265967

See Also

LINEST - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

LOGEST - Logarithmic regression

LOGIT - Logit regression

LOGITPRED - Calculate predicted values based on a logit regression

LOGITPROB - Calculate the probability that Y = 1 given a set of coefficients from a logistic regression

LOGITSUM - Logit regression using summary data