Logo

SQL Server WLS_q Function

Updated 2024-02-13 19:59:02.277000

Description

Use the SQL Server table-valued function WLS_q to calculate the Ordinary Least Squares (OLS) solution for a series of x- and y-values and an associated column of weights; sometimes referred to as Weighted Least-Squares (WLS). WLS_q returns the coefficients of regression, standard errors, Student's T and associated p-value for each of the independent variables. It also returns summary statistics about the regression including the standard error of y, R2, adjusted R2, the F-statistic and its p-value, the regression sum of squares, the residual sum of squares and the quartiles of the residuals. WLS_q is closely related to LINEST_q and the regression coefficients and their standard errors, T statistics and p-values can be calculated in LINEST_q, though LINEST_q will not produce the correct summary statistics. See Examples to find out more.

Syntax

SELECT * FROM [westclintech].[wct].[WLS_q] (
   <@Matrix_RangeQuery, nvarchar(max),>
  ,<@LConst, bit,>
  ,<@y_Column, nvarchar(4000),>
  ,<@w_Column, nvarchar(4000),>)

Arguments

@Matrix_RangeQuery

The SELECT statement, as a string, which, when executed, creates the resultant table of w-, x- and y-values which will be used in the calculation. Data returned from the @Matrix_RangeQuery must be of a type float or of a type that intrinsically converts to float.

@LConst

A bit value specifying the calculation of a y-intercept (@LConst = 1) or regression through the origin (@LConst = 0).

@y_Column

The column name or column number containing the dependent (y) variable.

@w_Column

The column name or column number containing the weight (w) variable.

Return Type

table

colNamecolDatatypecolDesc
stat_namenvarchar(4000)Identifies the statistic being returned: m – estimated coefficient se –standard error of the estimated coefficient tstat – Student’s T statistic pval – p-values of the tstat rsq – R2 rsqa – adjusted R2 rsqm – multiple R2 sey – standard error for the y estimate F – F statistic F_pval – p-value of F df – residual degrees of freedom ss_resid – weighted sum of squares mss – modified sum-of-squares w_resid_quart – weighted residual quartile
idxintUniquely identifies a return value for the stat_names where multiple values are returned: m, se, tstat, pval, and w_resid_quart. For m, se, tstat, and pval, idx identifies that subscript of the estimated coefficient. For example, the stat_name m with an idx of 0, specifies that the stat_val is for m0, or the y-intercept (which is b in y = mx + b). An idx of 1 for the same stat_name identifies m1. For w_resid_quart idx identifies the quartile being returned. For all other stat_names returning a single value, the idx will be NULL.
stat_valfloatThe return value.
col_namenvarchar(4000)The column name from the resultant table produced by the dynamic SQL for the m, se, tstat, and pval stat_names.

Remarks

If @y_Column is NULL then @y_Column is the left-most column in @ColumnNames .

If @w_Column is NULL the @w_Column is the right-most column in @ColumnNames.

I f @y_Column = @w_Column then no rows are returned.

I f @y_Column is numeric and less than 1 or greater than the number of columns in @ColumnNames then no rows are returned.

I f @w_Column is numeric and less than 1 or greater than the number of columns in @ColumnNames then no rows are returned.

W eight values must be greater than zero.

The number of rows in the regression must be greater than or equal to the number of columns.

Examples

This example explains the calculation of the regression coefficients and the summary statistics in WLS_q . Let's set up an example in SQL Server and take a closer look at those calculation. We will put the WLS_q results into a temp table, #wls.

SELECT *
INTO #t
FROM
(
    VALUES
        (103, 126.8, 62.3, 0.420928305104083),
        (127.2, 115.7, 98, 0.642347072957175),
        (118, 103.4, 92.2, 0.503672280805613),
        (121.8, 95.2, 74.2, 0.349193063289055),
        (106.1, 96, 78.9, 0.321793289794097),
        (124.6, 124.7, 96.1, 1.34249371606786),
        (116.9, 122.2, 94.1, 0.401800920329203),
        (118.6, 128.2, 79.2, 0.67140606947821),
        (125.2, 116.9, 79.6, 0.336969408869812),
        (123.3, 112.3, 87.8, 0.556210387357181)
) n (y, x1, x2, w);

Use the WLS_q function to calculate the coefficients of regression and the associated statistics.

SELECT *
INTO #wls
FROM wct.WLS_q('SELECT y,x1,x2,w FROM #t', 1, 'y', 'w');
SELECT *
FROM #wls;

This produces the following result.

stat_nameidxstat_valcol_name
m076.2158913852846Intercept
m10.0222877042102519x1
m20.47373007655067x2
se024.1127459683416Intercept
se10.171731881726625x1
se20.173375419548835x2
tstat03.16081343391378Intercept
tstat10.129781983322882x1
tstat22.73239469460799x2
pval00.0159102496734965Intercept
pval10.900389539811493x1
pval20.0292374089699169x2
rsqNULL0.520577705092377NULL
seyNULL4.29237110562461NULL
FNULL3.80045314366198NULL
F_pvalNULL0.0762981122063386NULL
dfNULL7NULL
mssNULL140.04251562907NULL
ss_residNULL128.971147958807NULL
rsqmNULL0.721510710310233NULL
rsqaNULL0.383599906547341NULL
w_resid_quart0-5.46438974663061NULL
w_resid_quart1-3.44808553096924NULL
w_resid_quart20.839382652539098NULL
w_resid_quart32.08237170553033NULL
w_resid_quart45.03271582569117NULL

There is nothing further that needs to be done in terms of getting the regression results. The rest of this example serves as an explanation of how the results are calculated. To calculate weighted least squares the dependent variable and all the independent variables are multiplied by the square root of the weights. We can achieve that result in LINEST_q by setting @LConst = 0 and by manually creating the intercept in the result table. The following SQL does that and puts the results into a temp table #ols.

SELECT stat_name,
       idx - 1 as idx,
       stat_val,
       col_name
INTO #ols
FROM wct.LINEST_q('SELECT y*SQRT(w) as y, sqrt(w) as Intercept, x1*SQRT(w) as x1,
          x2*SQRT(w) as x2 FROM #t', 1, 0)
WHERE idx > 0
      OR idx IS NULL;
SELECT *
FROM #ols;

This produces the following result.

stat_nameidxstat_valcol_name
m076.2158913852846Intercept
m10.0222877042102519x1
m20.47373007655067x2
se024.1127459683416Intercept
se10.171731881726625x1
se20.173375419548835x2
tstat03.16081343391378Intercept
tstat10.129781983322882x1
tstat22.73239469460799x2
pval00.0159102496734965Intercept
pval10.900389539811493x1
pval20.0292374089699169x2
rsqNULL0.998391679632623NULL
seyNULL4.29237110562461NULL
FNULL1448.45556461448NULL
dfNULL7NULL
ss_regNULL80060.9901152796NULL
ss_residNULL128.971147958807NULL
rsqmNULL0.999195516219235NULL
rsqaNULL0.997702399475176NULL

The following SQL produces a side-by-side comparison of the regression results.

SELECT w.stat_name,
       w.idx,
       w.stat_val as [wls],
       o.stat_val as [ols]
FROM #wls w
    FULL OUTER JOIN #ols o
        ON (
               w.stat_name = o.stat_name
               AND ISNULL(w.idx, 0) = ISNULL(o.idx, 0)
           )
           OR
           (
               w.stat_name = 'mss'
               AND o.stat_name = 'ss_reg'
           );

This produces the following result.

stat_nameidxwlsols
m076.215891385284676.2158913852846
m10.02228770421025190.0222877042102519
m20.473730076550670.47373007655067
se024.112745968341624.1127459683416
se10.1717318817266250.171731881726625
se20.1733754195488350.173375419548835
tstat03.160813433913783.16081343391378
tstat10.1297819833228820.129781983322882
tstat22.732394694607992.73239469460799
pval00.01591024967349650.0159102496734965
pval10.9003895398114930.900389539811493
pval20.02923740896991690.0292374089699169
rsqNULL0.5205777050923770.998391679632623
seyNULL4.292371105624614.29237110562461
FNULL3.800453143661981448.45556461448
dfNULL77
mssNULL140.0425156290780060.9901152796
ss_residNULL128.971147958807128.971147958807
rsqmNULL0.7215107103102330.999195516219235
rsqaNULL0.3835999065473410.997702399475176
F_pvalNULL0.0762981122063386NULL
w_resid_quart0-5.46438974663061NULL
w_resid_quart1-3.44808553096924NULL
w_resid_quart20.839382652539098NULL
w_resid_quart32.08237170553033NULL
w_resid_quart45.03271582569117NULL

Notice that LINEST_q does a pretty good job with the m, se, tstat, pval, sey, df, and ss_resid statistics. And LINEST_q does not calculate F_pval nor does it calculate the quartiles of the residuals. The real difference arises with the calculation of the sum of squares of regression (ss_reg in LINEST_q; mss in WLS ) which is used in the calculation of rsq, F, rsqm and rsqa. The ss_reg value in LINEST_q is calculated as sum of yhat (ŷ) squared, where yhat is simply the independent variables multiplied by the coefficients of regression. We can see that calculation in the following SQL.

SELECT SUM(SQUARE([m0] * n.Intercept + [m1] * x1 + [m2] * x2)) as ss_reg
FROM
(
    SELECT 'm' + cast(idx as char(1)) as coef,
           stat_val
    FROM #ols
    WHERE stat_name = 'm'
) d
PIVOT
(
    max(stat_val)
    FOR coef in (m0, m1, m2)
) pvt
    CROSS JOIN
    (
        SELECT sqrt(w) as Intercept,
               x1 * SQRT(w) as x1,
               x2 * SQRT(w) as x2
        FROM #t
    ) n;

This produces the following result.

ss_reg
80060.9901152796

For weighted least squares, this calculation needs to be adjusted for the weights. The following SQL shows how to make this adjustment.

DECLARE @wavg_y as float =
        (
            SELECT wct.WAVG(w, y)FROM #t
        );
DECLARE @mss as float =
        (
            SELECT SUM(POWER([m0] * n.Intercept + [m1] * x1 + [m2] * x2 - @wavg_y,
                      2) * w) as mss
            FROM
            (
                SELECT 'm' + cast(idx as char(1)) as coef,
                       stat_val
                FROM #ols
                WHERE stat_name = 'm'
            ) d
            PIVOT
            (
                max(stat_val)
                FOR coef in (m0, m1, m2)
            ) pvt
                CROSS JOIN
                (SELECT 1 as Intercept, x1, x2, w FROM #t) n
        );
SELECT @mss as mss;

This produces the following result.

mss
140.04251562907

Having gotten the regression modified sum-of-squares value, we can then use the same formulas as in ordinary least squares to calculate the remaining statistics.

DECLARE @ssresid as float =
        (
            SELECT stat_val from #wls WHERE stat_name = 'ss_resid'
        );
DECLARE @df as float =
        (
            SELECT stat_val from #wls WHERE stat_name = 'df'
        );
DECLARE @rsq as float = @mss / (@mss + @ssresid);
DECLARE @rsqm as float = SQRT(@rsq);
DECLARE @p as float =
        (
            SELECT COUNT(*) - 1 FROM #t
        );
DECLARE @rsqa as float = 1 - (1 - @rsq) * @p / @df;
DECLARE @Fobs as float = @mss / ((@p - @df) * @ssresid / @df);
DECLARE @Fdist as float = wct.F_DIST_RT(@Fobs, @p - @df, @df);
SELECT stat_name,
       NULL as idx,
       stat_val
FROM
(
    VALUES
        ('rsq', @rsq),
        ('mss', @mss),
        ('rsqm', @rsqm),
        ('rsqa', @rsqa),
        ('F', @Fobs),
        ('F_pval', @Fdist)
) x (stat_name, stat_val);

This produces the following result.

stat_nameidxstat_val
rsqNULL0.520577705092377
mssNULL140.04251562907
rsqmNULL0.721510710310233
rsqaNULL0.383599906547341
FNULL3.80045314366198
F_pvalNULL0.0762981122063386

Finally, the w_resid_quart values are simply the quartiles of the residuals.

SELECT 'w_resid_quart' as stat_name,
       x.k as idx,
       wct.QUARTILE(y - ([m0] * n.Intercept + [m1] * x1 + [m2] * x2), x.k) as 
                 stat_val
FROM
(
    SELECT 'm' + cast(idx as char(1)) as coef,
           stat_val
    FROM #ols
    WHERE stat_name = 'm'
) d
PIVOT
(
    max(stat_val)
    FOR coef in (m0, m1, m2)
) pvt
    CROSS JOIN
    (
        SELECT y * sqrt(w) as y,
               sqrt(w) as Intercept,
               x1 * sqrt(w) as x1,
               x2 * sqrt(w) as x2
        FROM #t
    ) n
    CROSS APPLY
(
    VALUES
        (0),
        (1),
        (2),
        (3),
        (4)
) x (k)
GROUP BY x.k;

This produces the following result.

stat_nameidxstat_val
w_resid_quart0-5.46438974663062
w_resid_quart1-3.44808553096925
w_resid_quart20.839382652539094
w_resid_quart32.08237170553032
w_resid_quart45.03271582569117

See Also

LINEST - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

LINEST_q - the Ordinary Least Squares (OLS) solution for a series of x-values and y-values

WLS - Calculate a weighted least squares (WLS) solution for a series of x- and y-values and their associated weights.