Logo

SQL Server KMEANS Function

Updated 2023-11-06 19:36:10.820000

Description

Use the table-valued function KMEANS to perform k-means clustering in N-dimensions. The clustering algorithm minimizes the Euclidean distance among the cluster means and the supplied points.

The default solution for the KMEANS function uses an Expectation / Maximization (EM) scheme to assign the data points to a cluster. As such, it always returns a solution, but can be computationally intensive. Alternatively, you can specify other clustering algorithms which are computationally more efficient but may not return a solution. See the algorithm section of this document.

The KMEANS function allows you to specify the starting point for the clusters or you can specify that the function should randomly choose the clusters from the supplied data. If the clusters are to be randomly generated, you also have the ability to specify the number of times to generate clusters and KMEANS will return the best solution.

Given the data points and the clusters, the KMEANS EM process consists of calculating the Euclidean distance between each row of data and each cluster and assigning the row to the cluster with the smallest (i.e. minimum) distance. It then calculates the mean of all the data points assigned to a cluster and recalculate the Euclidean distance and re-assigns data points to different clusters as required. This process repeats until there are no changes in the assignment process.

The function then returns a variety of information. First, it will return the cluster that each row of data has been assigned to. In order to link the cluster assignment back to the supplied data, you need to provide a unique identifier for each input row. This unique identifier and the cluster assignment are returned by the function.

Second, KMEANS will return the calculated cluster means. These means will be returned in 3rd normal form, consisting of zero-based row and column identifiers. The row identifier links to the cluster number returned to the by the assignment process. The column identifier corresponds to the number of dimensions in the data; in other words if the input data being clustered contained 2 columns the column identifiers would be 0 and 1. If it contained 3 columns, the column identifiers would be 0, 1 and 2.

Third, KMEANS returns a count of the number of data rows assigned to the each cluster.

Fourth, KMEANS returns the within sum-of squares for each cluster. This is simply the sum of the Euclidean distance for all the data points in a cluster.

Fifth, KMEANS returns the total within sum-of-squares which is the sum of the within sum-squares.

Sixth, KMEANS returns the total sum-of squares for all the columns in the data points.

Seventh, KMEANS return the between sum-of-squares as the difference between the total sum-of-squares and the total within sum-of-squares.

Finally, KMEANS returns the number of iterations actually used to return the solution.

Syntax

SELECT * FROM [westclintech].[wct].[KMEANS](
  <@dataQuery, nvarchar(max),>
 ,<@meansQuery, nvarchar(max),>
 ,<@numclusters, int,>
 ,<@nstart, int,>
 ,<@algorithm, nvarchar(4000),>
 ,<@itermax, int,>
 ,<@seed, int,>
 ,<@Formula, nvarchar(max),>)

Arguments

@dataQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of data points used in the calculation. The first column should contain a unique row identifier so that the data points can be linked to the cluster.

@meansQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of clusters used in the calculation. If you want the function to randomly select clusters from the data points simply enter a NULL.

@numclusters

the number of clusters to assign the data points to. @numclusters is not used if @meansQuery is not NULL. @numclusters is of type int or of a type that implicitly converts to int.

@nstart

The number of times to generate clusters and calculate the k-means. @nstart is not used if @meansQuery is not NULL. @nstart is of type int or of a type that implicitly converts to int.

@algorithm

The algorithm used to calculate the cluster means. Use 'EM' for the method described in [1]. Use 'HartiganWong' for the Method described in [2]. Use 'Lloyd' for the method described in [3]. Use 'MacQueen' for the method described in [4].

@itermax

The maximum number of iterations to be used if @algorithm is not 'EM'.

@seed

an optional integer value to be passed in as a seed value to the random number generator. @seed is only used if @meansQuery is NULL to support the reproducibility of the result. @seed must be of type int or of a type that implicitly converts to int.

@Formula

a string identifying the distance calculation method. Only squared Euclidean distance is supported at this time.

Return Type

table

colNamecolDatatypecolDesc
labelnvarchar(4000)Identifies the value being returned: assign cluster assigned to the input row. means average of all the data points assigned to a cluster count number of input rows assigned to each cluster withinss within sum-of-square for each cluster tot_withinss total within sum-of-squares betweenss between sum-of-squares For more information on how these statistics are calculated see the Examples.
rowidsql_varianta row identifier. When label = 'assign' the row identifier is the unique identifier passed into the function as part of @dataQuery. When label = 'means', 'count', or withinss the rowid is the cluster number of the means. For all other returned values the rowid is zero.
colidsql_varianta column identifier. colid is only meaningful when label = 'means'.
valfloatthe value associated with the label, rowid, and colid.

Remarks

If the number of rows returned by @meansQuery is greater than the number of rows returned by @dataQuery, then no rows are returned by the function.

Examples

Example #1

In this example we create 100 x- and y- data points. This first 50 have their co-ordinates generated from a random normal distribution with mean 0 and standard deviation of 0.5. The next 50 from a random normal distribution with mean 1 and standard deviation of 0.5. We want to cluster into 2 groups. We store the data points in #a and the KMEANS results in #cl.

SELECT *
INTO #a
FROM
(
    VALUES
        (1, -0.288734000529778, 0.028008366637481),
        (2, 1.00124136514641, -0.0259909530904495),
        (3, 0.0333504354650915, -0.876618679571135),
        (4, 0.933425922353431, 0.0496637970439137),
        (5, -0.675451343015356, -0.285925028947782),
        (6, 0.0104917931771187, -0.487004791402046),
        (7, 0.624957285484608, -0.0899531155237702),
        (8, -0.357621093611398, 0.507471586371829),
        (9, -0.376344484108871, -0.996374244344291),
        (10, -0.469269351803447, -0.213639643602714),
        (11, -0.526256639669368, 0.0583186417913532),
        (12, -0.218579766590199, -0.446603785027476),
        (13, 0.165589586479491, 0.166951471249613),
        (14, -1.00710524896036, 0.205714960307865),
        (15, 0.105990216686146, -0.0165180796379959),
        (16, 0.618337523208285, -1.23294909688001),
        (17, 1.01878700912022, 1.28572907293332),
        (18, 0.650587996100294, -0.1026496287341),
        (19, 0.378387381897981, 0.325596640793835),
        (20, -0.863365199557165, 0.136883245518274),
        (21, -0.300753354003391, 0.512336617409175),
        (22, -0.176023228291308, 0.408829723187044),
        (23, 0.351761951378447, -0.104896585614254),
        (24, -0.0528356670018872, 0.189083886104254),
        (25, -0.629324314030086, -0.472704415561946),
        (26, 0.842217854047057, 0.428461505449659),
        (27, 0.455695645897981, -0.230519169442177),
        (28, 0.118715136245513, 1.20838667689411),
        (29, 0.609054305162906, -0.825524447844094),
        (30, -0.669387143617486, -0.231993621483007),
        (31, 0.3304101488949, 0.412689931379621),
        (32, -0.261456188156711, 0.255066273439332),
        (33, 0.341872760925356, -0.2947405192575),
        (34, -0.0304109773300371, -0.498390371103761),
        (35, 0.316480356515725, 0.0722378523553483),
        (36, 0.667758807529696, -0.00715370658345669),
        (37, 0.0036450451584494, -0.89514061863203),
        (38, 0.508779318476046, 0.0172755335669255),
        (39, -0.594217017573989, 0.0951151578462286),
        (40, -0.360802220218008, 0.0873631984909208),
        (41, 0.759608855694089, -0.527508521301341),
        (42, 0.188693986511965, 0.238066639151315),
        (43, -1.02611141021686, 0.68928506847962),
        (44, -0.682018726041188, 0.228118201589906),
        (45, -0.100390507794561, -0.567794235187172),
        (46, 0.432889702167244, -0.217822734845953),
        (47, -0.0509416278576112, 0.173051809776803),
        (48, 0.312093736010326, -0.323522815659134),
        (49, 0.479502688893913, -1.07882316750764),
        (50, 0.83552741443147, 0.442125410014104),
        (51, 0.585261194187741, 0.900055085949283),
        (52, 0.713219864615859, 0.677443021570104),
        (53, 1.75195030450227, 1.08266051062103),
        (54, 0.612927535197297, 1.21940935035723),
        (55, 1.42286577009465, 1.44165140997435),
        (56, 0.369658560590612, -0.0261684919509764),
        (57, 0.822728798463011, 0.181810365970488),
        (58, 0.963221990431761, 1.71520117059409),
        (59, 0.415674287787822, 1.52331442355577),
        (60, 0.682625867545545, 1.21764447445384),
        (61, 0.985579223535843, 1.35758920355553),
        (62, 1.33534798436697, 1.4585874588205),
        (63, 0.174726728287445, -0.33046139923284),
        (64, 0.825122880394366, 1.55513854831941),
        (65, 1.37820321927274, 0.757506201718838),
        (66, 0.730595420011204, 1.1153084153156),
        (67, 1.11364596083647, 0.852421099558211),
        (68, 1.24611428503222, 1.43598247703937),
        (69, 1.13391750766584, 0.825763775522194),
        (70, 1.32662883973569, 1.25925188294412),
        (71, 0.93864566951131, 0.804657510680511),
        (72, 0.79316174320394, 0.453606395449247),
        (73, -0.321574476014489, 1.60500525522013),
        (74, 0.953529490760264, 1.37045000563714),
        (75, 1.21514234817999, 1.86213111957823),
        (76, 1.26769942043408, 1.03257696628965),
        (77, 0.722360824343374, 1.56250137291409),
        (78, 1.88975145488757, 1.98770952700954),
        (79, 1.14321220981441, 0.859258942480517),
        (80, 1.06315792922944, 0.338524443622392),
        (81, 1.63613338973426, 0.880324216430942),
        (82, 0.64076688934037, 0.892979380007308),
        (83, 0.774830688037644, 1.07584025225466),
        (84, 2.19872624002488, 1.85615248865964),
        (85, 1.00556459361044, 0.836928053747326),
        (86, 1.81678421070415, 1.18650232790246),
        (87, 0.280746677542739, 0.886157967552234),
        (88, 0.904741598978549, 1.01022535426252),
        (89, 1.18921195181842, 1.15702883175313),
        (90, 1.1500192726029, 1.66410734801849),
        (91, 0.497181870244582, 1.06065918874539),
        (92, 1.00962963731269, 1.35642116001556),
        (93, 0.461289673444009, 1.38943001489177),
        (94, 1.35635166262214, 1.45738663542782),
        (95, 1.54238754493032, 0.712802723907808),
        (96, -0.112493848243706, 1.81344060709734),
        (97, 1.61784673115007, 0.809521630352994),
        (98, 0.379477751595754, 0.947107916175124),
        (99, 1.22738463449132, 1.70202513385709),
        (100, 1.3299513190713, 1.64704195307279)
) n (r, x, y);
--Run KMEANS and store in #cl
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   2,                  --@numclusters
                   1,                  --@nstart
                   NULL,               --@algorithm
                   NULL,               --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

We can run the following SQL to get the x- and y- data points and their assigned cluster.

--Get the data points and the assigned cluster.
SELECT #a.r,
       #a.x,
       #a.y,
       #cl.val as cluster
FROM #a
    INNER JOIN #cl
        ON #a.r = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY r;

This produces the following result.

rxycluster
1-0.28873400052977800.028008366637481001
21.0012413651464100-0.025990953090449501
30.0333504354650915-0.876618679571135001
40.93342592235343100.049663797043913701
5-0.6754513430153560-0.285925028947782001
60.0104917931771187-0.487004791402046001
70.6249572854846080-0.089953115523770201
8-0.35762109361139800.507471586371829001
9-0.3763444841088710-0.996374244344291001
10-0.4692693518034470-0.213639643602714001
11-0.52625663966936800.058318641791353201
12-0.2185797665901990-0.446603785027476001
130.16558958647949100.166951471249613001
14-1.00710524896036000.205714960307865001
150.1059902166861460-0.016518079637995901
160.6183375232082850-1.232949096880010001
171.01878700912022001.285729072933320000
180.6505879961002940-0.102649628734100001
190.37838738189798100.325596640793835001
20-0.86336519955716500.136883245518274001
21-0.30075335400339100.512336617409175001
22-0.17602322829130800.408829723187044001
230.3517619513784470-0.104896585614254001
24-0.05283566700188720.189083886104254001
25-0.6293243140300860-0.472704415561946001
260.84221785404705700.428461505449659000
270.4556956458979810-0.230519169442177001
280.11871513624551301.208386676894110000
290.6090543051629060-0.825524447844094001
30-0.6693871436174860-0.231993621483007001
310.33041014889490000.412689931379621001
32-0.26145618815671100.255066273439332001
330.3418727609253560-0.294740519257500001
34-0.0304109773300371-0.498390371103761001
350.31648035651572500.072237852355348301
360.6677588075296960-0.007153706583456691
370.0036450451584494-0.895140618632030001
380.50877931847604600.017275533566925501
39-0.59421701757398900.095115157846228601
40-0.36080222021800800.087363198490920801
410.7596088556940890-0.527508521301341001
420.18869398651196500.238066639151315001
43-1.02611141021686000.689285068479620001
44-0.68201872604118800.228118201589906001
45-0.1003905077945610-0.567794235187172001
460.4328897021672440-0.217822734845953001
47-0.05094162785761120.173051809776803001
480.3120937360103260-0.323522815659134001
490.4795026888939130-1.078823167507640001
500.83552741443147000.442125410014104000
510.58526119418774100.900055085949283000
520.71321986461585900.677443021570104000
531.75195030450227001.082660510621030000
540.61292753519729701.219409350357230000
551.42286577009465001.441651409974350000
560.3696585605906120-0.026168491950976401
570.82272879846301100.181810365970488001
580.96322199043176101.715201170594090000
590.41567428778782201.523314423555770000
600.68262586754554501.217644474453840000
610.98557922353584301.357589203555530000
621.33534798436697001.458587458820500000
630.1747267282874450-0.330461399232840001
640.82512288039436601.555138548319410000
651.37820321927274000.757506201718838000
660.73059542001120401.115308415315600000
671.11364596083647000.852421099558211000
681.24611428503222001.435982477039370000
691.13391750766584000.825763775522194000
701.32662883973569001.259251882944120000
710.93864566951131000.804657510680511000
720.79316174320394000.453606395449247000
73-0.32157447601448901.605005255220130000
740.95352949076026401.370450005637140000
751.21514234817999001.862131119578230000
761.26769942043408001.032576966289650000
770.72236082434337401.562501372914090000
781.88975145488757001.987709527009540000
791.14321220981441000.859258942480517000
801.06315792922944000.338524443622392000
811.63613338973426000.880324216430942000
820.64076688934037000.892979380007308000
830.77483068803764401.075840252254660000
842.19872624002488001.856152488659640000
851.00556459361044000.836928053747326000
861.81678421070415001.186502327902460000
870.28074667754273900.886157967552234000
880.90474159897854901.010225354262520000
891.18921195181842001.157028831753130000
901.15001927260290001.664107348018490000
910.49718187024458201.060659188745390000
921.00962963731269001.356421160015560000
930.46128967344400901.389430014891770000
941.35635166262214001.457386635427820000
951.54238754493032000.712802723907808000
96-0.11249384824370601.813440607097340000
971.61784673115007000.809521630352994000
980.37947775159575400.947107916175124000
991.22738463449132001.702025133857090000
1001.32995131907130001.647041953072790000

If we take these results and put them in a scatter plot in Excel it would look like this.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img1.jpg

The following SQL returns the cluster means.

--Get the cluster means
SELECT cast(rowid as int) as cluster,
       [0] as x,
       [1] as y
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY 1;

This produces the following result.

clusterxy
00.9923484049494361.17604188035636
10.0393943141342429-0.12996842652057

We can add the means to the scatter plot.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img2.jpg

This SQL returns the rest of the statistics from #cl.

SELECT label,
       cast(rowid as int) as rowid,
       val
FROM #cl
WHERE label <> 'means'
      AND label <> 'assign';

This produces the following result.

labelrowidval
count049
count151
withinss022.0992379163228
withinss120.4154902467054
tot_withinss042.5147281630282
totss0107.833200840574
betweenss065.3184726775461
iterations010

Example #2

To create some data for this example, we create 1,000 data points consisting of 2 columns which contain uniformly randomly generated integer values between 1 and 100.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');

If we were to create a scatter plot of the input data using EXCEL it would look something like this.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img3.jpg

In this example, we will pass 9 starting values into the function.

SELECT *
INTO #m
FROM
(
    VALUES
        (25, 25),
        (25, 50),
        (25, 75),
        (50, 25),
        (50, 50),
        (50, 75),
        (75, 25),
        (75, 50),
        (75, 75)
) n (x, y);

A scatter plot of the starting points would look like this.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img4.jpg

We will pass this data into the function and store the results in the #cl temp table.

SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   'SELECT * FROM #m', --@meansQuery
                   NULL,               --@numclusters
                   NULL,               --@nstart
                   NULL,               --@algorithm
                   NULL,               --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

We can extract the means (clusters) from the #cl table with the following SQL, which will also PIVOT the result into 2 columns. Your results will be different.

SELECT cast(rowid as int) as rowid,
       [0],
       [1]
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY rowid;

This produces the following result.

rowidxy
017.209090909090917.6727272727273
119.305785123966951.5785123966942
219.007194244604381.5539568345324
349.597701149425315.4137931034483
45446.6826923076923
555.168141592920483.6548672566372
681.653846153846214.1826923076923
785.946236559139847.9784946236559
886.620155038759782.1317829457364

We have 9 means (clusters), numbered 0 through 8 and each cluster contains 2 columns, 0 and 1. We are able to identify the entries in the #cl table because they have the label 'means'. The following scatter plot shows the original clusters (the blue circles) passed into the function and the values calculated by the function (the orange triangles).

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img5.jpg

The means are simply the column averages of the input data grouped by the cluster to which the input data have been assigned. These could be manually calculated (though there is no reason to as they are already returned by the function) with the following SQL.

SELECT #cl.val,
       AVG(cast(#a.x as float)) as x,
       AVG(cast(#a.y as float)) as y
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
GROUP BY #cl.val
ORDER BY 1;

The following SQL demonstrates how to link the original data points to their assigned cluster.

--Get the data points and the assigned cluster.
SELECT TOP 20
       #a.seq,
       #a.x,
       #a.y,
       #cl.val as cluster
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY seq;

This produces the following result. Your results will be different.

seqxycluster
15631
239494
349514
477196
594878
614100
79842
899116
966735
1054735
1158564
1242153
1373928
1427110
159240
1629942
1798667
186752
1923431
204150

In this scatter plot we have returned all thousand rows and created an Excel scatter plot showing the clustering for all the data points.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img6.jpg

KMEANS returns the count of clusters for each mean. You can get this from the temp table with the following SQL.

--Get the cluster count for each mean
SELECT cast(rowid as int) as cluster,
       val as [cluster count]
FROM #cl
WHERE label = 'count'
ORDER BY 1;

This produces the following result.

clustercluster count
092
1117
2118
3116
4124
5123
6110
7115
885

You could get the same result with the following SQL.

--Manually calculate the cluster count
SELECT val as cluster,
       COUNT(*) as [cluster count]
FROM #cl
WHERE label = 'assign'
GROUP BY val
ORDER BY 1;

You can get the within sum-of-squares from the temp table with the following SQL.

--Get the within sum-of-squares
SELECT cast(rowid as int) as cluster,
       val as withinss
FROM #cl
WHERE label = 'withinss';

This produces the following result.

clusterwithinss
021127.55
118819.7196261682
222099.3675213675
321264.6434782609
416794.6728971963
523644.7608695652
616077.0297029703
725104.6280991735
817831.6595744681

The within sum-squares is the sum of the Euclidean distance from each data point to its assigned mean. The following SQL replicates the within sum-of-squares values returned by the function.

--Manually calculate the within sum-of squares
;with mycte
as (SELECT rowid as rowid,
           [0] as x,
           [1] as y
    FROM
    (
        SELECT cast(rowid as int) as rowid,
               cast(colid as int) as colid,
               val
        FROM #cl
        WHERE label = 'means'
    ) d
    PIVOT
    (
        sum(val)
        for colid in ([0], [1])
    ) pvt)
SELECT #cl.val,
       SUM(POWER(#a.x - m.x, 2) + POWER(#a.y - m.y, 2)) as withinss
FROM #cl
    INNER JOIN #a
        ON #cl.rowid = #a.seq
    INNER JOIN mycte m
        ON #cl.val = m.rowid
WHERE #cl.label = 'assign'
GROUP BY #cl.val;

The total within sum-of-squares is the sum of the within sum-of-squares. This can be obtain directly from the temp table.

SELECT val as [total withinss]
FROM #cl
WHERE label = 'tot_withinss';

This produces the following result.

total withinss
179066.367176113

The following SQL demonstrates the logic behind the calculation of the total within sum-of-squares.

--Manually calculate the total within sum-of-squares
;with mycte
as (SELECT rowid as rowid,
           [0] as x,
           [1] as y
    FROM
    (
        SELECT cast(rowid as int) as rowid,
               cast(colid as int) as colid,
               val
        FROM #cl
        WHERE label = 'means'
    ) d
    PIVOT
    (
        sum(val)
        for colid in ([0], [1])
    ) pvt)
SELECT SUM(POWER(#a.x - m.x, 2) + POWER(#a.y - m.y, 2)) as [total withinss]
FROM #cl
    INNER JOIN #a
        ON #cl.rowid = #a.seq
    INNER JOIN mycte m
        ON #cl.val = m.rowid
WHERE #cl.label = 'assign';

The following SQL obtains the total sum-squares.

--Get the total sum-of-squares
SELECT val as totss
FROM #cl
WHERE label = 'totss';

This produces the following result.

totss
1629658.059

The following SQL demonstrates the logic behind the calculation of the total sum-of-squares.

--Manually calculate the total sum-of-squares
SELECT (sx + sy) * n as totss
FROM
(SELECT VARP(x) as sx, VARP(y) as sy, COUNT(*) as n FROM #a) b;

The between sum-of-squares is simply the difference between the total sum-of-squares and the total within sum-of squares. You can obtain between sum-of-squares from the temp table with the following SQL.

--Get the between sum-of-squares
SELECT val as betweenss
FROM #cl
WHERE label = 'betweenss';

This produces the following result.

betweenss
1492654.98818542

Example #3

Cluster starting points are randomly chosen from the supplied data points.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   9,                  --@numclusters
                   10,                 --@nstart
                   'HW',               --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

By passing 9 in as @numclusters we told the function to group the data into 9 clusters. By passing 10 in as @nstart we told the function to run the clustering algorithm 10 times, with 10 different starting clusters chosen from the supplied data points. Passing 'HW' in as @algorithm directs the function to use the technique described in [2]. The @itermax value of 25 indicates that the maximum number of iterations to be performed within in each of the 10 @nstart cases is 25; in other words if the algorithm does not find a solution within 25 iterations then no solution is returned for those starting points; though solutions might be returned for other starting points.

We can use the SQL from Example #2 to obtain the means.

--Get the means from #cl
SELECT cast(rowid as int) as rowid,
       [0] as x,
       [1] as y
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY rowid;

This produces the following result. Your results will be different.

rowidxy
050.108108108108184.0990990990991
114.844827586206917.1379310344828
281.228070175438618.0175438596491
349.34862385321114.0733944954128
487.916666666666749.8214285714286
516.078571428571448.5928571428571
616.924528301886884.0377358490566
752.560747663551444.2616822429907
882.637168141592978.6725663716814

As you can see, the means returned by this example are similar to the means returned by the first example, though they are actually assigned to different clusters. This doesn't really matter. The following scatter plot shows the means calculated by the first example against the means calculated by the second example.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img7.jpg

Again, using our SQL from Example #2, we can take all of our input data and their assigned means and put the results into an Excel scatter plot which will look something like this.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img8.jpg

Example #4

Increasing the number of clusters.

In this example, we are still going to randomly generate 1,000 data points but instead of grouping them into 9 clusters, we are going to group them into 16.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   16,                 --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

The EXCEL scatter plot for that analysis looks like this.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img9.jpg

This what an Excel scatter plot of the means would look like.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img10.jpg

Example #5

Increasing the number of data points to 100,000.

Whether you have 1,000 or 100,000 data points, the SQL is the same. In this example we put 100,000 rows into our table and then run KMEANS using the Hartigan & Wong algorithm.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) * wct.RAND() as y
INTO #a
FROM wct.SeriesFloat(1, 100, NULL, 100000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   16,                 --@numclusters
                   10,                 --@nstart
                   'HW',               --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

This is what the means look like.

rowidxy
034.592330046749949.3355719394316
184.112929193520972.6446708395648
288.016378863495224.3943521405385
339.538791308398426.2196413808555
414.299488466126125.8790052132487
510.31916710633836.98509642430662
618.173431924442974.9411059872743
760.148708067573848.1067823976652
890.4014352975286.54734212773786
964.17421196765426.174037542974
1011.499205034923547.7269035294419
1186.86931091212645.7490396549393
1252.616718520197874.4334388457509
1370.61398435537457.01043760864472
1450.32713959303887.41215021214116
1529.86254166833057.22745384678385

We run the following SQL to extract the input data and the assigned cluster and then drop the results into an Excel scatter plot.

SELECT #a.seq,
       #cl.val as cluster,
       #a.x,
       #a.y
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY seq;

This is what the scatter plot looks like.

http://westclintech.com/Portals/0/images/doc_stats_KMEANS_img11.jpg

Example #6

More than 2 columns per row

KMEANS is designed to support as many columns per row as you need (remember that the first column should uniquely identify the row). In this example we create 10,000 input rows which have 5 columns of data, with data randomly generated from the normal distribution with each column having a different mean and standard deviation. As with the previous examples, the results are in a temporary table.

SELECT k.Seq,
       k.SeriesValue as x,
       wct.RANDNORM(5, 10) as y,
       wct.RANDNORM(1, 0.5) as z,
       wct.RANDNORM(10, 2) as a,
       wct.RANDNORM(100, 15) as b
INTO #k
FROM wct.SeriesFloat(0, 1, NULL, 10000, 'N') k;
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

The results of the calculation should have assigned the input rows to 100 clusters, each cluster having 5 columns.

Example #7

Scaling the data

In this final example, we will use the same approach as in the previous example, except that we will execute KMEANS a second time having standardized the columnar data. We can then compare the column assignments as see how scaling has affected the clustering.

--Create the test data
SELECT k.Seq,
       k.SeriesValue as x,
       wct.RANDNORM(5, 10) as y,
       wct.RANDNORM(1, 0.5) as z,
       wct.RANDNORM(10, 2) as a,
       wct.RANDNORM(100, 15) as b
INTO #k
FROM wct.SeriesFloat(0, 1, NULL, 10000, 'N') k;
--Perform the KMEANS on the unscaled data
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );
--scale and center the data and save in #k_scaled
;
with mycte
as (SELECT data,
           AVG(val) as mean,
           STDEV(val) as sigma
    FROM
    (SELECT x, y, z, a, b FROM #k) p
        UNPIVOT
        (
            val
            for data in (x, y, z, a, b)
        ) as UNPVT
    GROUP BY data)
SELECT k.seq,
       wct.STANDARDIZE(x, mx.mean, mx.sigma) as x,
       wct.STANDARDIZE(y, my.mean, my.sigma) as y,
       wct.STANDARDIZE(z, mz.mean, mz.sigma) as z,
       wct.STANDARDIZE(a, ma.mean, ma.sigma) as a,
       wct.STANDARDIZE(b, mb.mean, mb.sigma) as b
INTO #k_scaled
FROM #k k
    CROSS JOIN mycte mx
    CROSS JOIN mycte my
    CROSS JOIN mycte mz
    CROSS JOIN mycte ma
    CROSS JOIN mycte mb
WHERE mx.data = 'x'
      AND my.data = 'y'
      AND mz.data = 'z'
      AND ma.data = 'a'
      AND mb.data = 'b';
--Perform KMEANS on the scaled data and
--store in #cl_scaled.
SELECT *
INTO #cl_scaled
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );
--Count the number of rows belonging to
--the same cluster as the first row in #k
SELECT COUNT(*)
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      );

This produces the following result.

count
126
--List the rows belonging the cluster in unscaled data
--but belonging to a different cluster in the scaled data
SELECT rowid
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      )
EXCEPT
SELECT rowid
FROM #cl_scaled
WHERE label = 'assign'
      AND val =
      (
          SELECT val
          FROM #cl_scaled
          WHERE label = 'assign'
                AND cast(rowid as int) = 1
      );

This produces the following result. Your results will be different.

rowid
40
374
598
714
796
831
933
1009
1114
1124
1268
1291
1363
1414
1481
1620
1659
1743
1786
1893
1939
2023
2121
2618
2660
3036
3131
3379
3583
3592
3763
3984
4086
4183
4381
4472
4478
4512
4616
4738
4855
4976
5225
5364
5572
5748
5844
5932
5994
6097
6310
6369
6393
6490
6596
6917
7123
7131
7255
7400
7412
7558
7603
7740
7837
7855
7924
7926
8045
8219
8319
8401
8405
8533
8541
9164
9244
9263
9290
9299
9381
9508
9819
9953
--List the rows that are common to the cluster in both the
--unscaled and scaled data
SELECT rowid
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      )
INTERSECT
SELECT rowid
FROM #cl_scaled
WHERE label = 'assign'
      AND val =
      (
          SELECT val
          FROM #cl_scaled
          WHERE label = 'assign'
                AND cast(rowid as int) = 1
      );

This produces the following result. Your results will be different.

rowid
1
125
579
1163
1261
1392
1744
1946
2236
2505
2814
3306
3541
4021
4027
4121
4457
4479
4559
4662
5239
5332
5438
5763
6045
6315
6319
6364
6573
6628
7048
7055
7067
7205
7433
8741
8848
8883
9114
9117
9212
9890

References

[1]     Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977 “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society , Series B, vol. 39, pp 1 – 38.

[2]     Hartigan, J.A. and Wong, M.A., 1979 “A K-means clustering algorithm”, Journal of the Royal Statistical Society , Series C (Applied Statistics), vol. 28, pp 100-108.

[3]     Lloyd, S.P., 1982 “Least Squares Quantization in PCM”, IEEE Transactions on Information Theory , vol. IT-28, no. 2, pp 129 – 137.

[4]     MacQueen J., 1967 “Some Methods for Classification and Analysis of Multivariate Observations”, Proceedings of the 5 th Berkeley Symposium on Statistics and Probability , vol. 1, pp 281 – 297.