Clustering Basic (DataMining)

K-means Clustering with R

In [2]:
library(cluster)
library(NbClust)
library(kohonen)
library(ggplot2)
library(gridExtra)
library(scales)
In [4]:
# Read Data
cdata <- read.delim("Cluster.txt", stringsAsFactors=FALSE)
In [5]:
head(cdata)
IDMONEYVISITCROSSAPI
1 1367900 15 3 14
2 264000 3 1 109
3 3467400 10 8 12
4 461000 3 1 70
5 5128000 4 2 45
6 6353620 6 5 22
In [6]:
set.seed(1)
km <- kmeans(subset(cdata, select=-c(ID)), centers=4)
str(km)
List of 9
 $ cluster     : Named int [1:1000] 2 2 2 2 2 2 1 2 2 2 ...
  ..- attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...
 $ centers     : num [1:4, 1:4] 1.28e+06 2.55e+05 1.13e+07 3.42e+06 3.26e+01 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "1" "2" "3" "4"
  .. ..$ : chr [1:4] "MONEY" "VISIT" "CROSS" "API"
 $ totss       : num 1.65e+15
 $ withinss    : num [1:4] 2.80e+13 2.77e+13 1.42e+14 3.39e+13
 $ tot.withinss: num 2.32e+14
 $ betweenss   : num 1.42e+15
 $ size        : int [1:4] 181 770 8 41
 $ iter        : int 3
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"
In [9]:
km
K-means clustering with 4 clusters of sizes 181, 770, 8, 41

Cluster means:
       MONEY      VISIT     CROSS       API
1  1284818.8  32.607735 10.966851  7.232044
2   255051.4   8.124675  3.654545 34.877922
3 11323243.8 101.125000 20.750000  0.750000
4  3421840.0  62.146341 15.390244  3.024390

Clustering vector:
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16
   2    2    2    2    2    2    1    2    2    2    1    2    1    2    1    2
  17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32
   2    2    2    1    2    2    1    1    2    2    1    2    1    1    2    2
  33   34   35   36   37   38   39   40   41   42   43   44   45   46   47   48
   3    4    1    2    2    2    2    2    2    2    2    2    2    2    2    2
  49   50   51   52   53   54   55   56   57   58   59   60   61   62   63   64
   1    2    2    2    1    1    1    2    2    2    2    2    2    2    4    3
  65   66   67   68   69   70   71   72   73   74   75   76   77   78   79   80
   2    2    2    2    1    2    2    2    2    2    1    2    1    4    2    1
  81   82   83   84   85   86   87   88   89   90   91   92   93   94   95   96
   4    2    2    2    1    2    2    2    2    2    2    2    2    2    2    4
  97   98   99  100  101  102  103  104  105  106  107  108  109  110  111  112
   1    2    2    2    2    1    2    1    1    2    1    2    2    2    2    1
 113  114  115  116  117  118  119  120  121  122  123  124  125  126  127  128
   2    4    2    1    2    2    2    2    2    2    1    2    4    1    2    2
 129  130  131  132  133  134  135  136  137  138  139  140  141  142  143  144
   2    1    2    2    2    1    1    2    2    1    2    2    2    1    1    1
 145  146  147  148  149  150  151  152  153  154  155  156  157  158  159  160
   2    2    1    2    2    2    2    2    2    2    2    2    2    2    1    2
 161  162  163  164  165  166  167  168  169  170  171  172  173  174  175  176
   2    2    4    2    2    2    2    2    2    2    2    2    2    2    2    2
 177  178  179  180  181  182  183  184  185  186  187  188  189  190  191  192
   2    2    2    2    2    1    4    2    2    2    2    2    4    2    2    2
 193  194  195  196  197  198  199  200  201  202  203  204  205  206  207  208
   2    2    1    1    2    2    2    2    2    2    2    2    2    2    2    4
 209  210  211  212  213  214  215  216  217  218  219  220  221  222  223  224
   2    2    2    2    2    1    2    2    2    2    2    2    2    4    2    2
 225  226  227  228  229  230  231  232  233  234  235  236  237  238  239  240
   1    4    2    2    2    4    2    2    2    2    2    2    2    2    1    2
 241  242  243  244  245  246  247  248  249  250  251  252  253  254  255  256
   1    1    2    2    2    1    2    2    2    2    2    2    2    2    2    2
 257  258  259  260  261  262  263  264  265  266  267  268  269  270  271  272
   2    1    2    1    2    2    2    2    2    2    2    2    4    2    2    2
 273  274  275  276  277  278  279  280  281  282  283  284  285  286  287  288
   2    2    2    2    2    1    2    2    2    2    2    2    2    2    2    2
 289  290  291  292  293  294  295  296  297  298  299  300  301  302  303  304
   2    2    2    2    2    1    1    2    1    2    1    2    2    2    2    2
 305  306  307  308  309  310  311  312  313  314  315  316  317  318  319  320
   2    2    1    2    2    2    2    2    2    2    2    2    2    2    1    2
 321  322  323  324  325  326  327  328  329  330  331  332  333  334  335  336
   2    2    2    2    2    2    1    2    1    1    2    1    2    2    2    2
 337  338  339  340  341  342  343  344  345  346  347  348  349  350  351  352
   2    2    1    2    2    2    1    1    2    2    2    1    2    2    2    1
 353  354  355  356  357  358  359  360  361  362  363  364  365  366  367  368
   2    2    1    2    2    2    2    2    2    2    2    2    2    2    2    2
 369  370  371  372  373  374  375  376  377  378  379  380  381  382  383  384
   2    2    2    2    2    2    2    1    2    2    1    2    2    2    2    2
 385  386  387  388  389  390  391  392  393  394  395  396  397  398  399  400
   2    2    2    2    1    1    2    2    2    2    2    1    2    2    2    2
 401  402  403  404  405  406  407  408  409  410  411  412  413  414  415  416
   2    1    2    2    2    2    1    2    2    2    2    1    2    1    2    2
 417  418  419  420  421  422  423  424  425  426  427  428  429  430  431  432
   2    2    2    2    2    2    2    2    2    2    2    2    1    2    2    2
 433  434  435  436  437  438  439  440  441  442  443  444  445  446  447  448
   1    2    2    1    2    1    2    2    2    2    1    2    2    1    2    1
 449  450  451  452  453  454  455  456  457  458  459  460  461  462  463  464
   1    2    2    2    2    2    2    1    2    2    2    2    4    1    2    2
 465  466  467  468  469  470  471  472  473  474  475  476  477  478  479  480
   2    2    1    2    2    2    2    2    2    2    2    2    2    2    2    1
 481  482  483  484  485  486  487  488  489  490  491  492  493  494  495  496
   2    2    4    2    2    2    4    2    2    2    1    2    2    2    2    2
 497  498  499  500  501  502  503  504  505  506  507  508  509  510  511  512
   2    2    2    2    2    2    2    2    2    4    2    2    2    2    2    2
 513  514  515  516  517  518  519  520  521  522  523  524  525  526  527  528
   2    2    2    2    1    2    2    2    2    2    2    2    2    2    2    1
 529  530  531  532  533  534  535  536  537  538  539  540  541  542  543  544
   1    2    2    2    2    2    2    4    2    2    2    2    1    2    2    2
 545  546  547  548  549  550  551  552  553  554  555  556  557  558  559  560
   2    2    1    2    4    1    4    2    2    2    2    2    2    2    2    2
 561  562  563  564  565  566  567  568  569  570  571  572  573  574  575  576
   2    1    2    1    2    1    2    1    2    1    2    4    2    1    2    2
 577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592
   2    2    4    2    1    2    1    1    2    2    4    2    2    1    1    2
 593  594  595  596  597  598  599  600  601  602  603  604  605  606  607  608
   2    2    2    2    2    2    1    1    2    2    2    4    2    2    2    1
 609  610  611  612  613  614  615  616  617  618  619  620  621  622  623  624
   1    2    2    2    1    1    2    2    2    1    1    1    2    1    2    2
 625  626  627  628  629  630  631  632  633  634  635  636  637  638  639  640
   2    2    2    2    2    2    2    2    2    2    2    2    1    1    2    1
 641  642  643  644  645  646  647  648  649  650  651  652  653  654  655  656
   1    2    1    2    2    2    2    2    2    2    4    2    2    2    2    2
 657  658  659  660  661  662  663  664  665  666  667  668  669  670  671  672
   2    2    2    4    2    2    2    2    2    2    3    2    1    2    1    2
 673  674  675  676  677  678  679  680  681  682  683  684  685  686  687  688
   2    1    2    2    2    1    1    2    2    2    1    2    2    2    1    2
 689  690  691  692  693  694  695  696  697  698  699  700  701  702  703  704
   1    2    2    2    2    1    1    2    2    4    2    2    2    2    2    2
 705  706  707  708  709  710  711  712  713  714  715  716  717  718  719  720
   2    2    2    1    2    2    2    1    2    2    1    1    2    2    2    2
 721  722  723  724  725  726  727  728  729  730  731  732  733  734  735  736
   2    2    2    2    2    2    2    2    2    2    2    2    2    1    1    4
 737  738  739  740  741  742  743  744  745  746  747  748  749  750  751  752
   2    2    2    2    2    2    2    2    2    2    2    4    2    2    4    2
 753  754  755  756  757  758  759  760  761  762  763  764  765  766  767  768
   2    1    2    2    2    2    2    3    2    2    2    1    2    1    2    2
 769  770  771  772  773  774  775  776  777  778  779  780  781  782  783  784
   3    2    2    2    2    2    2    2    2    4    1    2    2    2    4    1
 785  786  787  788  789  790  791  792  793  794  795  796  797  798  799  800
   1    2    2    2    2    2    2    1    2    2    2    1    2    2    1    2
 801  802  803  804  805  806  807  808  809  810  811  812  813  814  815  816
   1    2    2    2    2    2    2    2    2    4    2    2    1    1    2    1
 817  818  819  820  821  822  823  824  825  826  827  828  829  830  831  832
   2    1    1    2    2    2    2    2    2    1    2    2    2    1    2    2
 833  834  835  836  837  838  839  840  841  842  843  844  845  846  847  848
   1    1    1    2    1    2    1    1    2    2    1    2    2    2    2    2
 849  850  851  852  853  854  855  856  857  858  859  860  861  862  863  864
   1    2    2    2    2    4    2    2    2    2    1    2    2    3    2    2
 865  866  867  868  869  870  871  872  873  874  875  876  877  878  879  880
   2    2    2    1    2    2    2    2    2    2    1    1    3    2    2    2
 881  882  883  884  885  886  887  888  889  890  891  892  893  894  895  896
   2    2    2    2    2    1    2    2    2    1    2    2    1    2    2    2
 897  898  899  900  901  902  903  904  905  906  907  908  909  910  911  912
   2    2    2    2    2    2    2    1    4    1    2    2    2    2    2    3
 913  914  915  916  917  918  919  920  921  922  923  924  925  926  927  928
   2    2    1    2    2    2    2    2    2    2    2    2    1    2    2    1
 929  930  931  932  933  934  935  936  937  938  939  940  941  942  943  944
   1    2    2    2    2    2    1    2    2    2    2    2    2    2    2    2
 945  946  947  948  949  950  951  952  953  954  955  956  957  958  959  960
   2    2    2    4    2    1    1    2    2    2    1    2    2    2    4    1
 961  962  963  964  965  966  967  968  969  970  971  972  973  974  975  976
   2    2    2    2    2    2    2    2    2    1    2    2    2    1    2    2
 977  978  979  980  981  982  983  984  985  986  987  988  989  990  991  992
   2    2    2    4    2    2    2    2    2    2    4    2    2    2    2    2
 993  994  995  996  997  998  999 1000
   2    2    2    2    2    2    2    2

Within cluster sum of squares by cluster:
[1] 2.803588e+13 2.770304e+13 1.424076e+14 3.391002e+13
 (between_SS / total_SS =  85.9 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

군집의 반경과 관계를 2차원으로 도식

  • 군집간의 유사성도 알 수 있다.
  • Too many Outliers
    • 모양이 좋지 않다.
In [10]:
options(repr.plot.width=5,repr.plot.height=5)
clusplot(subset(cdata, select=-c(ID)), km$cluster, main="Cluster")
In [11]:
# 군집의 분포를 도식
cdata$cluster <- as.factor(km$cluster)
qplot(MONEY, VISIT, colour=cluster, data=cdata)
In [13]:
plot(subset(cdata, select=-c(ID,cluster)), col=km$cluster)

군집별로 각 군집화변수의 밀도를 도식: 방법1

  • 2번 Cluster는 자주오지도 많이 사지도 않는 뜨내기 고객.
In [15]:
options(repr.plot.width=7,repr.plot.height=5)
p1 <- qplot(MONEY, fill=cluster, alpha=.5, data=cdata, geom="density") + scale_alpha(guide="none")
p2 <- qplot(VISIT, fill=cluster, alpha=.5, data=cdata, geom="density") + theme(legend.position="none")
p3 <- qplot(CROSS, fill=cluster, alpha=.5, data=cdata, geom="density") + theme(legend.position="none")
p4 <- qplot(API, fill=cluster, alpha=.5, data=cdata, geom="density") + theme(legend.position="none")
grid.arrange(p1, p2, p3, p4, ncol=2, nrow=2)

군집별로 각 군집화변수의 밀도를 도식: 방법2

In [16]:
p1 <- ggplot(cdata, aes(MONEY)) + geom_density(fill='deeppink3', adjust=1) + facet_grid(. ~ cluster) + scale_x_continuous(breaks=NULL) + scale_y_continuous("", breaks=NULL)
p2 <- ggplot(cdata, aes(VISIT)) + geom_density(fill='deeppink3', adjust=1) + facet_grid(. ~ cluster) + scale_x_continuous(breaks=NULL) + scale_y_continuous("", breaks=NULL) + theme(strip.text.x=element_blank())
p3 <- ggplot(cdata, aes(CROSS)) + geom_density(fill='deeppink3', adjust=1) + facet_grid(. ~ cluster) + scale_x_continuous(breaks=NULL) + scale_y_continuous("", breaks=NULL) + theme(strip.text.x=element_blank())
p4 <- ggplot(cdata, aes(API)) + geom_density(fill='deeppink3', adjust=1) + facet_grid(. ~ cluster) + scale_x_continuous(breaks=NULL) + scale_y_continuous("", breaks=NULL) + theme(strip.text.x=element_blank())
grid.arrange(p1, p2, p3, p4, ncol=1, nrow=4)

군집의 크기를 도식

In [19]:
options(repr.plot.width=5,repr.plot.height=5)
x <- ggplot(cdata, aes(x=factor(1), fill=cluster))
x + geom_bar(width=1) + coord_polar(theta="y")

Determining the optimal number of clusters

1. 최적의 군집 수 찾기: 방법1

In [22]:
# 최적의 군집 수 찾기: 방법1
options(repr.plot.width=9,repr.plot.height=6)
set.seed(1)
sd <- cdata[sample(1:nrow(cdata),100),-1]
d <- dist(sd, method = "euclidean")
fit <- hclust(d, method="ave")
plot(fit)
# CutTree 형태로 Cluster Cut. k=n
rect.hclust(fit, k=4, border = "red")

2. 최적의 군집 수 찾기: 방법2

  • 군집내 분산이 적고 군집간 분산이 큰 것을 찾기.
    • $tot.withinss$ 군집 내 분산의 합
  • 기울기가 급격히 감소하는 지점에서 군집 수를 정함
In [24]:
# 최적의 군집 수 찾기: 방법2
options(repr.plot.width=7,repr.plot.height=4)
wss <- 0; set.seed(1)
for(i in 1:15) wss[i] <- kmeans(subset(cdata, select=-c(ID)), centers=i)$tot.withinss
plot(1:15, wss, type="b", xlab="# of clusters", ylab="Within group sum of squares")

3. 최적의 군집 수 찾기: 방법3

In [25]:
nc = NbClust(subset(cdata, select=-c(ID,cluster)), min.nc=2, max.nc=15, method='kmeans') # 2 ~ 15개 kmeans를 비교.
barplot(table(nc$Best.nc[1,]), xlab="# of clusters", ylab="# of criteria", main="Number of clusters chosen by 26 criteria")
*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters.
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure.

*******************************************************************
* Among all indices:
* 7 proposed 2 as the best number of clusters
* 5 proposed 3 as the best number of clusters
* 1 proposed 11 as the best number of clusters
* 2 proposed 12 as the best number of clusters
* 7 proposed 13 as the best number of clusters
* 1 proposed 14 as the best number of clusters
* 1 proposed 15 as the best number of clusters

                   ***** Conclusion *****

* According to the majority rule, the best number of clusters is  2


*******************************************************************