SQL Server KMEANS Function

Updated 2023-11-06 19:36:10.820000

Description

Use the table-valued function KMEANS to perform k-means clustering in N-dimensions. The clustering algorithm minimizes the Euclidean distance among the cluster means and the supplied points.

The default solution for the KMEANS function uses an Expectation / Maximization (EM) scheme to assign the data points to a cluster. As such, it always returns a solution, but can be computationally intensive. Alternatively, you can specify other clustering algorithms which are computationally more efficient but may not return a solution. See the algorithm section of this document.

The KMEANS function allows you to specify the starting point for the clusters or you can specify that the function should randomly choose the clusters from the supplied data. If the clusters are to be randomly generated, you also have the ability to specify the number of times to generate clusters and KMEANS will return the best solution.

Given the data points and the clusters, the KMEANS EM process consists of calculating the Euclidean distance between each row of data and each cluster and assigning the row to the cluster with the smallest (i.e. minimum) distance. It then calculates the mean of all the data points assigned to a cluster and recalculate the Euclidean distance and re-assigns data points to different clusters as required. This process repeats until there are no changes in the assignment process.

The function then returns a variety of information. First, it will return the cluster that each row of data has been assigned to. In order to link the cluster assignment back to the supplied data, you need to provide a unique identifier for each input row. This unique identifier and the cluster assignment are returned by the function.

Second, KMEANS will return the calculated cluster means. These means will be returned in 3^rd normal form, consisting of zero-based row and column identifiers. The row identifier links to the cluster number returned to the by the assignment process. The column identifier corresponds to the number of dimensions in the data; in other words if the input data being clustered contained 2 columns the column identifiers would be 0 and 1. If it contained 3 columns, the column identifiers would be 0, 1 and 2.

Third, KMEANS returns a count of the number of data rows assigned to the each cluster.

Fourth, KMEANS returns the within sum-of squares for each cluster. This is simply the sum of the Euclidean distance for all the data points in a cluster.

Fifth, KMEANS returns the total within sum-of-squares which is the sum of the within sum-squares.

Sixth, KMEANS returns the total sum-of squares for all the columns in the data points.

Seventh, KMEANS return the between sum-of-squares as the difference between the total sum-of-squares and the total within sum-of-squares.

Finally, KMEANS returns the number of iterations actually used to return the solution.

Syntax

SELECT * FROM [westclintech].[wct].[KMEANS](
  <@dataQuery, nvarchar(max),>
 ,<@meansQuery, nvarchar(max),>
 ,<@numclusters, int,>
 ,<@nstart, int,>
 ,<@algorithm, nvarchar(4000),>
 ,<@itermax, int,>
 ,<@seed, int,>
 ,<@Formula, nvarchar(max),>)

Arguments

@dataQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of data points used in the calculation. The first column should contain a unique row identifier so that the data points can be linked to the cluster.

@meansQuery

the SELECT statement, as a string, which, when executed, creates the resultant table of clusters used in the calculation. If you want the function to randomly select clusters from the data points simply enter a NULL.

@numclusters

the number of clusters to assign the data points to. @numclusters is not used if @meansQuery is not NULL. @numclusters is of type int or of a type that implicitly converts to int.

@nstart

The number of times to generate clusters and calculate the k-means. @nstart is not used if @meansQuery is not NULL. @nstart is of type int or of a type that implicitly converts to int.

@algorithm

The algorithm used to calculate the cluster means. Use 'EM' for the method described in [1]. Use 'HartiganWong' for the Method described in [2]. Use 'Lloyd' for the method described in [3]. Use 'MacQueen' for the method described in [4].

@itermax

The maximum number of iterations to be used if @algorithm is not 'EM'.

@seed

an optional integer value to be passed in as a seed value to the random number generator. @seed is only used if @meansQuery is NULL to support the reproducibility of the result. @seed must be of type int or of a type that implicitly converts to int.

@Formula

a string identifying the distance calculation method. Only squared Euclidean distance is supported at this time.

Return Type

table

colName	colDatatype	colDesc
label	nvarchar(4000)	Identifies the value being returned: assign cluster assigned to the input row. means average of all the data points assigned to a cluster count number of input rows assigned to each cluster withinss within sum-of-square for each cluster tot_withinss total within sum-of-squares betweenss between sum-of-squares For more information on how these statistics are calculated see the Examples.
rowid	sql_variant	a row identifier. When label = 'assign' the row identifier is the unique identifier passed into the function as part of @dataQuery. When label = 'means', 'count', or withinss the rowid is the cluster number of the means. For all other returned values the rowid is zero.
colid	sql_variant	a column identifier. colid is only meaningful when label = 'means'.
val	float	the value associated with the label, rowid, and colid.

Remarks

If the number of rows returned by @meansQuery is greater than the number of rows returned by @dataQuery, then no rows are returned by the function.

Examples

Example #1

In this example we create 100 x- and y- data points. This first 50 have their co-ordinates generated from a random normal distribution with mean 0 and standard deviation of 0.5. The next 50 from a random normal distribution with mean 1 and standard deviation of 0.5. We want to cluster into 2 groups. We store the data points in #a and the KMEANS results in #cl.

SELECT *
INTO #a
FROM
(
    VALUES
        (1, -0.288734000529778, 0.028008366637481),
        (2, 1.00124136514641, -0.0259909530904495),
        (3, 0.0333504354650915, -0.876618679571135),
        (4, 0.933425922353431, 0.0496637970439137),
        (5, -0.675451343015356, -0.285925028947782),
        (6, 0.0104917931771187, -0.487004791402046),
        (7, 0.624957285484608, -0.0899531155237702),
        (8, -0.357621093611398, 0.507471586371829),
        (9, -0.376344484108871, -0.996374244344291),
        (10, -0.469269351803447, -0.213639643602714),
        (11, -0.526256639669368, 0.0583186417913532),
        (12, -0.218579766590199, -0.446603785027476),
        (13, 0.165589586479491, 0.166951471249613),
        (14, -1.00710524896036, 0.205714960307865),
        (15, 0.105990216686146, -0.0165180796379959),
        (16, 0.618337523208285, -1.23294909688001),
        (17, 1.01878700912022, 1.28572907293332),
        (18, 0.650587996100294, -0.1026496287341),
        (19, 0.378387381897981, 0.325596640793835),
        (20, -0.863365199557165, 0.136883245518274),
        (21, -0.300753354003391, 0.512336617409175),
        (22, -0.176023228291308, 0.408829723187044),
        (23, 0.351761951378447, -0.104896585614254),
        (24, -0.0528356670018872, 0.189083886104254),
        (25, -0.629324314030086, -0.472704415561946),
        (26, 0.842217854047057, 0.428461505449659),
        (27, 0.455695645897981, -0.230519169442177),
        (28, 0.118715136245513, 1.20838667689411),
        (29, 0.609054305162906, -0.825524447844094),
        (30, -0.669387143617486, -0.231993621483007),
        (31, 0.3304101488949, 0.412689931379621),
        (32, -0.261456188156711, 0.255066273439332),
        (33, 0.341872760925356, -0.2947405192575),
        (34, -0.0304109773300371, -0.498390371103761),
        (35, 0.316480356515725, 0.0722378523553483),
        (36, 0.667758807529696, -0.00715370658345669),
        (37, 0.0036450451584494, -0.89514061863203),
        (38, 0.508779318476046, 0.0172755335669255),
        (39, -0.594217017573989, 0.0951151578462286),
        (40, -0.360802220218008, 0.0873631984909208),
        (41, 0.759608855694089, -0.527508521301341),
        (42, 0.188693986511965, 0.238066639151315),
        (43, -1.02611141021686, 0.68928506847962),
        (44, -0.682018726041188, 0.228118201589906),
        (45, -0.100390507794561, -0.567794235187172),
        (46, 0.432889702167244, -0.217822734845953),
        (47, -0.0509416278576112, 0.173051809776803),
        (48, 0.312093736010326, -0.323522815659134),
        (49, 0.479502688893913, -1.07882316750764),
        (50, 0.83552741443147, 0.442125410014104),
        (51, 0.585261194187741, 0.900055085949283),
        (52, 0.713219864615859, 0.677443021570104),
        (53, 1.75195030450227, 1.08266051062103),
        (54, 0.612927535197297, 1.21940935035723),
        (55, 1.42286577009465, 1.44165140997435),
        (56, 0.369658560590612, -0.0261684919509764),
        (57, 0.822728798463011, 0.181810365970488),
        (58, 0.963221990431761, 1.71520117059409),
        (59, 0.415674287787822, 1.52331442355577),
        (60, 0.682625867545545, 1.21764447445384),
        (61, 0.985579223535843, 1.35758920355553),
        (62, 1.33534798436697, 1.4585874588205),
        (63, 0.174726728287445, -0.33046139923284),
        (64, 0.825122880394366, 1.55513854831941),
        (65, 1.37820321927274, 0.757506201718838),
        (66, 0.730595420011204, 1.1153084153156),
        (67, 1.11364596083647, 0.852421099558211),
        (68, 1.24611428503222, 1.43598247703937),
        (69, 1.13391750766584, 0.825763775522194),
        (70, 1.32662883973569, 1.25925188294412),
        (71, 0.93864566951131, 0.804657510680511),
        (72, 0.79316174320394, 0.453606395449247),
        (73, -0.321574476014489, 1.60500525522013),
        (74, 0.953529490760264, 1.37045000563714),
        (75, 1.21514234817999, 1.86213111957823),
        (76, 1.26769942043408, 1.03257696628965),
        (77, 0.722360824343374, 1.56250137291409),
        (78, 1.88975145488757, 1.98770952700954),
        (79, 1.14321220981441, 0.859258942480517),
        (80, 1.06315792922944, 0.338524443622392),
        (81, 1.63613338973426, 0.880324216430942),
        (82, 0.64076688934037, 0.892979380007308),
        (83, 0.774830688037644, 1.07584025225466),
        (84, 2.19872624002488, 1.85615248865964),
        (85, 1.00556459361044, 0.836928053747326),
        (86, 1.81678421070415, 1.18650232790246),
        (87, 0.280746677542739, 0.886157967552234),
        (88, 0.904741598978549, 1.01022535426252),
        (89, 1.18921195181842, 1.15702883175313),
        (90, 1.1500192726029, 1.66410734801849),
        (91, 0.497181870244582, 1.06065918874539),
        (92, 1.00962963731269, 1.35642116001556),
        (93, 0.461289673444009, 1.38943001489177),
        (94, 1.35635166262214, 1.45738663542782),
        (95, 1.54238754493032, 0.712802723907808),
        (96, -0.112493848243706, 1.81344060709734),
        (97, 1.61784673115007, 0.809521630352994),
        (98, 0.379477751595754, 0.947107916175124),
        (99, 1.22738463449132, 1.70202513385709),
        (100, 1.3299513190713, 1.64704195307279)
) n (r, x, y);
--Run KMEANS and store in #cl
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   2,                  --@numclusters
                   1,                  --@nstart
                   NULL,               --@algorithm
                   NULL,               --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

We can run the following SQL to get the x- and y- data points and their assigned cluster.

--Get the data points and the assigned cluster.
SELECT #a.r,
       #a.x,
       #a.y,
       #cl.val as cluster
FROM #a
    INNER JOIN #cl
        ON #a.r = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY r;

This produces the following result.

r	x	y	cluster
1	-0.2887340005297780	0.02800836663748100	1
2	1.0012413651464100	-0.02599095309044950	1
3	0.0333504354650915	-0.87661867957113500	1
4	0.9334259223534310	0.04966379704391370	1
5	-0.6754513430153560	-0.28592502894778200	1
6	0.0104917931771187	-0.48700479140204600	1
7	0.6249572854846080	-0.08995311552377020	1
8	-0.3576210936113980	0.50747158637182900	1
9	-0.3763444841088710	-0.99637424434429100	1
10	-0.4692693518034470	-0.21363964360271400	1
11	-0.5262566396693680	0.05831864179135320	1
12	-0.2185797665901990	-0.44660378502747600	1
13	0.1655895864794910	0.16695147124961300	1
14	-1.0071052489603600	0.20571496030786500	1
15	0.1059902166861460	-0.01651807963799590	1
16	0.6183375232082850	-1.23294909688001000	1
17	1.0187870091202200	1.28572907293332000	0
18	0.6505879961002940	-0.10264962873410000	1
19	0.3783873818979810	0.32559664079383500	1
20	-0.8633651995571650	0.13688324551827400	1
21	-0.3007533540033910	0.51233661740917500	1
22	-0.1760232282913080	0.40882972318704400	1
23	0.3517619513784470	-0.10489658561425400	1
24	-0.0528356670018872	0.18908388610425400	1
25	-0.6293243140300860	-0.47270441556194600	1
26	0.8422178540470570	0.42846150544965900	0
27	0.4556956458979810	-0.23051916944217700	1
28	0.1187151362455130	1.20838667689411000	0
29	0.6090543051629060	-0.82552444784409400	1
30	-0.6693871436174860	-0.23199362148300700	1
31	0.3304101488949000	0.41268993137962100	1
32	-0.2614561881567110	0.25506627343933200	1
33	0.3418727609253560	-0.29474051925750000	1
34	-0.0304109773300371	-0.49839037110376100	1
35	0.3164803565157250	0.07223785235534830	1
36	0.6677588075296960	-0.00715370658345669	1
37	0.0036450451584494	-0.89514061863203000	1
38	0.5087793184760460	0.01727553356692550	1
39	-0.5942170175739890	0.09511515784622860	1
40	-0.3608022202180080	0.08736319849092080	1
41	0.7596088556940890	-0.52750852130134100	1
42	0.1886939865119650	0.23806663915131500	1
43	-1.0261114102168600	0.68928506847962000	1
44	-0.6820187260411880	0.22811820158990600	1
45	-0.1003905077945610	-0.56779423518717200	1
46	0.4328897021672440	-0.21782273484595300	1
47	-0.0509416278576112	0.17305180977680300	1
48	0.3120937360103260	-0.32352281565913400	1
49	0.4795026888939130	-1.07882316750764000	1
50	0.8355274144314700	0.44212541001410400	0
51	0.5852611941877410	0.90005508594928300	0
52	0.7132198646158590	0.67744302157010400	0
53	1.7519503045022700	1.08266051062103000	0
54	0.6129275351972970	1.21940935035723000	0
55	1.4228657700946500	1.44165140997435000	0
56	0.3696585605906120	-0.02616849195097640	1
57	0.8227287984630110	0.18181036597048800	1
58	0.9632219904317610	1.71520117059409000	0
59	0.4156742877878220	1.52331442355577000	0
60	0.6826258675455450	1.21764447445384000	0
61	0.9855792235358430	1.35758920355553000	0
62	1.3353479843669700	1.45858745882050000	0
63	0.1747267282874450	-0.33046139923284000	1
64	0.8251228803943660	1.55513854831941000	0
65	1.3782032192727400	0.75750620171883800	0
66	0.7305954200112040	1.11530841531560000	0
67	1.1136459608364700	0.85242109955821100	0
68	1.2461142850322200	1.43598247703937000	0
69	1.1339175076658400	0.82576377552219400	0
70	1.3266288397356900	1.25925188294412000	0
71	0.9386456695113100	0.80465751068051100	0
72	0.7931617432039400	0.45360639544924700	0
73	-0.3215744760144890	1.60500525522013000	0
74	0.9535294907602640	1.37045000563714000	0
75	1.2151423481799900	1.86213111957823000	0
76	1.2676994204340800	1.03257696628965000	0
77	0.7223608243433740	1.56250137291409000	0
78	1.8897514548875700	1.98770952700954000	0
79	1.1432122098144100	0.85925894248051700	0
80	1.0631579292294400	0.33852444362239200	0
81	1.6361333897342600	0.88032421643094200	0
82	0.6407668893403700	0.89297938000730800	0
83	0.7748306880376440	1.07584025225466000	0
84	2.1987262400248800	1.85615248865964000	0
85	1.0055645936104400	0.83692805374732600	0
86	1.8167842107041500	1.18650232790246000	0
87	0.2807466775427390	0.88615796755223400	0
88	0.9047415989785490	1.01022535426252000	0
89	1.1892119518184200	1.15702883175313000	0
90	1.1500192726029000	1.66410734801849000	0
91	0.4971818702445820	1.06065918874539000	0
92	1.0096296373126900	1.35642116001556000	0
93	0.4612896734440090	1.38943001489177000	0
94	1.3563516626221400	1.45738663542782000	0
95	1.5423875449303200	0.71280272390780800	0
96	-0.1124938482437060	1.81344060709734000	0
97	1.6178467311500700	0.80952163035299400	0
98	0.3794777515957540	0.94710791617512400	0
99	1.2273846344913200	1.70202513385709000	0
100	1.3299513190713000	1.64704195307279000	0

If we take these results and put them in a scatter plot in Excel it would look like this.

The following SQL returns the cluster means.

--Get the cluster means
SELECT cast(rowid as int) as cluster,
       [0] as x,
       [1] as y
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY 1;

This produces the following result.

cluster	x	y
0	0.992348404949436	1.17604188035636
1	0.0393943141342429	-0.12996842652057

We can add the means to the scatter plot.

This SQL returns the rest of the statistics from #cl.

SELECT label,
       cast(rowid as int) as rowid,
       val
FROM #cl
WHERE label <> 'means'
      AND label <> 'assign';

This produces the following result.

label	rowid	val
count	0	49
count	1	51
withinss	0	22.0992379163228
withinss	1	20.4154902467054
tot_withinss	0	42.5147281630282
totss	0	107.833200840574
betweenss	0	65.3184726775461
iterations	0	10

Example #2

To create some data for this example, we create 1,000 data points consisting of 2 columns which contain uniformly randomly generated integer values between 1 and 100.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');

If we were to create a scatter plot of the input data using EXCEL it would look something like this.

In this example, we will pass 9 starting values into the function.

SELECT *
INTO #m
FROM
(
    VALUES
        (25, 25),
        (25, 50),
        (25, 75),
        (50, 25),
        (50, 50),
        (50, 75),
        (75, 25),
        (75, 50),
        (75, 75)
) n (x, y);

A scatter plot of the starting points would look like this.

We will pass this data into the function and store the results in the #cl temp table.

SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   'SELECT * FROM #m', --@meansQuery
                   NULL,               --@numclusters
                   NULL,               --@nstart
                   NULL,               --@algorithm
                   NULL,               --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

We can extract the means (clusters) from the #cl table with the following SQL, which will also PIVOT the result into 2 columns. Your results will be different.

SELECT cast(rowid as int) as rowid,
       [0],
       [1]
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY rowid;

This produces the following result.

rowid	x	y
0	17.2090909090909	17.6727272727273
1	19.3057851239669	51.5785123966942
2	19.0071942446043	81.5539568345324
3	49.5977011494253	15.4137931034483
4	54	46.6826923076923
5	55.1681415929204	83.6548672566372
6	81.6538461538462	14.1826923076923
7	85.9462365591398	47.9784946236559
8	86.6201550387597	82.1317829457364

We have 9 means (clusters), numbered 0 through 8 and each cluster contains 2 columns, 0 and 1. We are able to identify the entries in the #cl table because they have the label 'means'. The following scatter plot shows the original clusters (the blue circles) passed into the function and the values calculated by the function (the orange triangles).

The means are simply the column averages of the input data grouped by the cluster to which the input data have been assigned. These could be manually calculated (though there is no reason to as they are already returned by the function) with the following SQL.

SELECT #cl.val,
       AVG(cast(#a.x as float)) as x,
       AVG(cast(#a.y as float)) as y
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
GROUP BY #cl.val
ORDER BY 1;

The following SQL demonstrates how to link the original data points to their assigned cluster.

--Get the data points and the assigned cluster.
SELECT TOP 20
       #a.seq,
       #a.x,
       #a.y,
       #cl.val as cluster
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY seq;

This produces the following result. Your results will be different.

seq	x	y	cluster
1	5	63	1
2	39	49	4
3	49	51	4
4	77	19	6
5	94	87	8
6	14	10	0
7	9	84	2
8	99	11	6
9	66	73	5
10	54	73	5
11	58	56	4
12	42	15	3
13	73	92	8
14	27	11	0
15	9	24	0
16	29	94	2
17	98	66	7
18	6	75	2
19	23	43	1
20	4	15	0

In this scatter plot we have returned all thousand rows and created an Excel scatter plot showing the clustering for all the data points.

KMEANS returns the count of clusters for each mean. You can get this from the temp table with the following SQL.

--Get the cluster count for each mean
SELECT cast(rowid as int) as cluster,
       val as [cluster count]
FROM #cl
WHERE label = 'count'
ORDER BY 1;

This produces the following result.

cluster	cluster count
0	92
1	117
2	118
3	116
4	124
5	123
6	110
7	115
8	85

You could get the same result with the following SQL.

--Manually calculate the cluster count
SELECT val as cluster,
       COUNT(*) as [cluster count]
FROM #cl
WHERE label = 'assign'
GROUP BY val
ORDER BY 1;

You can get the within sum-of-squares from the temp table with the following SQL.

--Get the within sum-of-squares
SELECT cast(rowid as int) as cluster,
       val as withinss
FROM #cl
WHERE label = 'withinss';

This produces the following result.

cluster	withinss
0	21127.55
1	18819.7196261682
2	22099.3675213675
3	21264.6434782609
4	16794.6728971963
5	23644.7608695652
6	16077.0297029703
7	25104.6280991735
8	17831.6595744681

The within sum-squares is the sum of the Euclidean distance from each data point to its assigned mean. The following SQL replicates the within sum-of-squares values returned by the function.

--Manually calculate the within sum-of squares
;with mycte
as (SELECT rowid as rowid,
           [0] as x,
           [1] as y
    FROM
    (
        SELECT cast(rowid as int) as rowid,
               cast(colid as int) as colid,
               val
        FROM #cl
        WHERE label = 'means'
    ) d
    PIVOT
    (
        sum(val)
        for colid in ([0], [1])
    ) pvt)
SELECT #cl.val,
       SUM(POWER(#a.x - m.x, 2) + POWER(#a.y - m.y, 2)) as withinss
FROM #cl
    INNER JOIN #a
        ON #cl.rowid = #a.seq
    INNER JOIN mycte m
        ON #cl.val = m.rowid
WHERE #cl.label = 'assign'
GROUP BY #cl.val;

The total within sum-of-squares is the sum of the within sum-of-squares. This can be obtain directly from the temp table.

SELECT val as [total withinss]
FROM #cl
WHERE label = 'tot_withinss';

This produces the following result.

total withinss
179066.367176113

The following SQL demonstrates the logic behind the calculation of the total within sum-of-squares.

--Manually calculate the total within sum-of-squares
;with mycte
as (SELECT rowid as rowid,
           [0] as x,
           [1] as y
    FROM
    (
        SELECT cast(rowid as int) as rowid,
               cast(colid as int) as colid,
               val
        FROM #cl
        WHERE label = 'means'
    ) d
    PIVOT
    (
        sum(val)
        for colid in ([0], [1])
    ) pvt)
SELECT SUM(POWER(#a.x - m.x, 2) + POWER(#a.y - m.y, 2)) as [total withinss]
FROM #cl
    INNER JOIN #a
        ON #cl.rowid = #a.seq
    INNER JOIN mycte m
        ON #cl.val = m.rowid
WHERE #cl.label = 'assign';

The following SQL obtains the total sum-squares.

--Get the total sum-of-squares
SELECT val as totss
FROM #cl
WHERE label = 'totss';

This produces the following result.

totss
1629658.059

The following SQL demonstrates the logic behind the calculation of the total sum-of-squares.

--Manually calculate the total sum-of-squares
SELECT (sx + sy) * n as totss
FROM
(SELECT VARP(x) as sx, VARP(y) as sy, COUNT(*) as n FROM #a) b;

The between sum-of-squares is simply the difference between the total sum-of-squares and the total within sum-of squares. You can obtain between sum-of-squares from the temp table with the following SQL.

--Get the between sum-of-squares
SELECT val as betweenss
FROM #cl
WHERE label = 'betweenss';

This produces the following result.

betweenss
1492654.98818542

Example #3

Cluster starting points are randomly chosen from the supplied data points.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   9,                  --@numclusters
                   10,                 --@nstart
                   'HW',               --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

By passing 9 in as @numclusters we told the function to group the data into 9 clusters. By passing 10 in as @nstart we told the function to run the clustering algorithm 10 times, with 10 different starting clusters chosen from the supplied data points. Passing 'HW' in as @algorithm directs the function to use the technique described in [2]. The @itermax value of 25 indicates that the maximum number of iterations to be performed within in each of the 10 @nstart cases is 25; in other words if the algorithm does not find a solution within 25 iterations then no solution is returned for those starting points; though solutions might be returned for other starting points.

We can use the SQL from Example #2 to obtain the means.

--Get the means from #cl
SELECT cast(rowid as int) as rowid,
       [0] as x,
       [1] as y
FROM
(
    SELECT rowid,
           cast(colid as int) as colid,
           val
    FROM #cl
    WHERE label = 'means'
) d
PIVOT
(
    SUM(val)
    FOR colid in ([0], [1])
) pvt
ORDER BY rowid;

This produces the following result. Your results will be different.

rowid	x	y
0	50.1081081081081	84.0990990990991
1	14.8448275862069	17.1379310344828
2	81.2280701754386	18.0175438596491
3	49.348623853211	14.0733944954128
4	87.9166666666667	49.8214285714286
5	16.0785714285714	48.5928571428571
6	16.9245283018868	84.0377358490566
7	52.5607476635514	44.2616822429907
8	82.6371681415929	78.6725663716814

As you can see, the means returned by this example are similar to the means returned by the first example, though they are actually assigned to different clusters. This doesn't really matter. The following scatter plot shows the means calculated by the first example against the means calculated by the second example.

Again, using our SQL from Example #2, we can take all of our input data and their assigned means and put the results into an Excel scatter plot which will look something like this.

Example #4

Increasing the number of clusters.

In this example, we are still going to randomly generate 1,000 data points but instead of grouping them into 9 clusters, we are going to group them into 16.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) as y
INTO #a
FROM wct.SeriesInt(1, 100, NULL, 1000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   16,                 --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

The EXCEL scatter plot for that analysis looks like this.

This what an Excel scatter plot of the means would look like.

Example #5

Increasing the number of data points to 100,000.

Whether you have 1,000 or 100,000 data points, the SQL is the same. In this example we put 100,000 rows into our table and then run KMEANS using the Hartigan & Wong algorithm.

SELECT seq,
       seriesValue as x,
       wct.RANDBETWEEN(1, 100) * wct.RAND() as y
INTO #a
FROM wct.SeriesFloat(1, 100, NULL, 100000, 'R');
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #a', --@dataQuery
                   NULL,               --@meansQuery
                   16,                 --@numclusters
                   10,                 --@nstart
                   'HW',               --@algorithm
                   25,                 --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

This is what the means look like.

rowid	x	y
0	34.5923300467499	49.3355719394316
1	84.1129291935209	72.6446708395648
2	88.0163788634952	24.3943521405385
3	39.5387913083984	26.2196413808555
4	14.2994884661261	25.8790052132487
5	10.3191671063383	6.98509642430662
6	18.1734319244429	74.9411059872743
7	60.1487080675738	48.1067823976652
8	90.401435297528	6.54734212773786
9	64.174211967654	26.174037542974
10	11.4992050349235	47.7269035294419
11	86.869310912126	45.7490396549393
12	52.6167185201978	74.4334388457509
13	70.6139843553745	7.01043760864472
14	50.3271395930388	7.41215021214116
15	29.8625416683305	7.22745384678385

We run the following SQL to extract the input data and the assigned cluster and then drop the results into an Excel scatter plot.

SELECT #a.seq,
       #cl.val as cluster,
       #a.x,
       #a.y
FROM #a
    INNER JOIN #cl
        ON #a.seq = #cl.rowid
WHERE #cl.label = 'assign'
ORDER BY seq;

This is what the scatter plot looks like.

Example #6

More than 2 columns per row

KMEANS is designed to support as many columns per row as you need (remember that the first column should uniquely identify the row). In this example we create 10,000 input rows which have 5 columns of data, with data randomly generated from the normal distribution with each column having a different mean and standard deviation. As with the previous examples, the results are in a temporary table.

SELECT k.Seq,
       k.SeriesValue as x,
       wct.RANDNORM(5, 10) as y,
       wct.RANDNORM(1, 0.5) as z,
       wct.RANDNORM(10, 2) as a,
       wct.RANDNORM(100, 15) as b
INTO #k
FROM wct.SeriesFloat(0, 1, NULL, 10000, 'N') k;
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );

The results of the calculation should have assigned the input rows to 100 clusters, each cluster having 5 columns.

Example #7

Scaling the data

In this final example, we will use the same approach as in the previous example, except that we will execute KMEANS a second time having standardized the columnar data. We can then compare the column assignments as see how scaling has affected the clustering.

--Create the test data
SELECT k.Seq,
       k.SeriesValue as x,
       wct.RANDNORM(5, 10) as y,
       wct.RANDNORM(1, 0.5) as z,
       wct.RANDNORM(10, 2) as a,
       wct.RANDNORM(100, 15) as b
INTO #k
FROM wct.SeriesFloat(0, 1, NULL, 10000, 'N') k;
--Perform the KMEANS on the unscaled data
SELECT *
INTO #cl
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );
--scale and center the data and save in #k_scaled
;
with mycte
as (SELECT data,
           AVG(val) as mean,
           STDEV(val) as sigma
    FROM
    (SELECT x, y, z, a, b FROM #k) p
        UNPIVOT
        (
            val
            for data in (x, y, z, a, b)
        ) as UNPVT
    GROUP BY data)
SELECT k.seq,
       wct.STANDARDIZE(x, mx.mean, mx.sigma) as x,
       wct.STANDARDIZE(y, my.mean, my.sigma) as y,
       wct.STANDARDIZE(z, mz.mean, mz.sigma) as z,
       wct.STANDARDIZE(a, ma.mean, ma.sigma) as a,
       wct.STANDARDIZE(b, mb.mean, mb.sigma) as b
INTO #k_scaled
FROM #k k
    CROSS JOIN mycte mx
    CROSS JOIN mycte my
    CROSS JOIN mycte mz
    CROSS JOIN mycte ma
    CROSS JOIN mycte mb
WHERE mx.data = 'x'
      AND my.data = 'y'
      AND mz.data = 'z'
      AND ma.data = 'a'
      AND mb.data = 'b';
--Perform KMEANS on the scaled data and
--store in #cl_scaled.
SELECT *
INTO #cl_scaled
FROM wct.KMEANS(   'SELECT * FROM #k', --@dataQuery
                   NULL,               --@meansQuery
                   100,                --@numclusters
                   10,                 --@nstart
                   'MacQueen',         --@algorithm
                   100,                --@itermax
                   NULL,               --@seed
                   NULL                --@formula
               );
--Count the number of rows belonging to
--the same cluster as the first row in #k
SELECT COUNT(*)
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      );

This produces the following result.

count
126

--List the rows belonging the cluster in unscaled data
--but belonging to a different cluster in the scaled data
SELECT rowid
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      )
EXCEPT
SELECT rowid
FROM #cl_scaled
WHERE label = 'assign'
      AND val =
      (
          SELECT val
          FROM #cl_scaled
          WHERE label = 'assign'
                AND cast(rowid as int) = 1
      );

This produces the following result. Your results will be different.

	rowid
	40
	374
	598
	714
	796
	831
	933
	1009
	1114
	1124
	1268
	1291
	1363
	1414
	1481
	1620
	1659
	1743
	1786
	1893
	1939
	2023
	2121
	2618
	2660
	3036
	3131
	3379
	3583
	3592
	3763
	3984
	4086
	4183
	4381
	4472
	4478
	4512
	4616
	4738
	4855
	4976
	5225
	5364
	5572
	5748
	5844
	5932
	5994
	6097
	6310
	6369
	6393
	6490
	6596
	6917
	7123
	7131
	7255
	7400
	7412
	7558
	7603
	7740
	7837
	7855
	7924
	7926
	8045
	8219
	8319
	8401
	8405
	8533
	8541
	9164
	9244
	9263
	9290
	9299
	9381
	9508
	9819
	9953

--List the rows that are common to the cluster in both the
--unscaled and scaled data
SELECT rowid
FROM #cl
WHERE label = 'assign'
      AND val =
      (
          SELECT val FROM #cl WHERE label = 'assign' AND cast(rowid as int) = 1
      )
INTERSECT
SELECT rowid
FROM #cl_scaled
WHERE label = 'assign'
      AND val =
      (
          SELECT val
          FROM #cl_scaled
          WHERE label = 'assign'
                AND cast(rowid as int) = 1
      );

This produces the following result. Your results will be different.

	rowid
	1
	125
	579
	1163
	1261
	1392
	1744
	1946
	2236
	2505
	2814
	3306
	3541
	4021
	4027
	4121
	4457
	4479
	4559
	4662
	5239
	5332
	5438
	5763
	6045
	6315
	6319
	6364
	6573
	6628
	7048
	7055
	7067
	7205
	7433
	8741
	8848
	8883
	9114
	9117
	9212
	9890

References

^[1] Dempster, A.P., Laird, N.M., and Rubin, D.B. 1977 “Maximum Likelihood from Incomplete Data via the EM Algorithm,” Journal of the Royal Statistical Society , Series B, vol. 39, pp 1 – 38.

^[2] Hartigan, J.A. and Wong, M.A., 1979 “A K-means clustering algorithm”, Journal of the Royal Statistical Society , Series C (Applied Statistics), vol. 28, pp 100-108.

^[3] Lloyd, S.P., 1982 “Least Squares Quantization in PCM”, IEEE Transactions on Information Theory , vol. IT-28, no. 2, pp 129 – 137.

^[4] MacQueen J., 1967 “Some Methods for Classification and Analysis of Multivariate Observations”, Proceedings of the 5 th Berkeley Symposium on Statistics and Probability , vol. 1, pp 281 – 297.

seq	x	y	cluster
1	5	63	1
2	39	49	4
3	49	51	4
4	77	19	6
5	94	87	8
6	14	10	0
7	9	84	2
8	99	11	6
9	66	73	5
10	54	73	5
11	58	56	4
12	42	15	3
13	73	92	8
14	27	11	0
15	9	24	0
16	29	94	2
17	98	66	7
18	6	75	2
19	23	43	1
20	4	15	0

seq	x	y	cluster
1	5	63	1
2	39	49	4
3	49	51	4
4	77	19	6
5	94	87	8
6	14	10	0
7	9	84	2
8	99	11	6
9	66	73	5
10	54	73	5
11	58	56	4
12	42	15	3
13	73	92	8
14	27	11	0
15	9	24	0
16	29	94	2
17	98	66	7
18	6	75	2
19	23	43	1
20	4	15	0

Resources

SQL Server KMEANS Function

Description

Syntax

Arguments

@dataQuery

@meansQuery

@numclusters

@nstart

@algorithm

@itermax

@seed

@Formula

Return Type

Remarks

Examples

seq	x	y	cluster
1	5	63	1
2	39	49	4
3	49	51	4
4	77	19	6
5	94	87	8
6	14	10	0
7	9	84	2
8	99	11	6
9	66	73	5
10	54	73	5
11	58	56	4
12	42	15	3
13	73	92	8
14	27	11	0
15	9	24	0
16	29	94	2
17	98	66	7
18	6	75	2
19	23	43	1
20	4	15	0