Cross Validation

The holdout method

  • 데이터가 적을 경우 사용 => 데이터가 많다면 운에 맡겨서 나온 것이라고 봐도 무방하다.
  • 현재 수업 중에 테스트하는 6:4 비율 활용이 이에 해당한다.

K-fold Cross Validation(KFCV)

In [13]:
library(caret)
library(randomForest)
library(ROCR)
In [4]:
cb <- read.delim("../1022_Decision Tree_2/Hshopping.txt", stringsAsFactors=FALSE)
colnames(cb) <- c("ID","SEX","AGE","AMT","STAR","REFUND") # Jupyter note Font Error using Korean
cb$REFUND <- factor(cb$REFUND)

set.seed(1)
flds <- createFolds(cb$REFUND, k=5, list=T, returnTrain=F)
In [6]:
str(flds)
List of 5
 $ Fold1: int [1:99] 5 7 12 14 18 19 21 25 30 32 ...
 $ Fold2: int [1:101] 11 33 39 41 45 80 85 86 90 94 ...
 $ Fold3: int [1:100] 10 15 16 22 24 31 34 35 36 48 ...
 $ Fold4: int [1:100] 2 3 4 8 9 13 20 23 26 29 ...
 $ Fold5: int [1:100] 1 6 17 27 28 43 46 47 53 54 ...

Perform 5 experiments

In [7]:
experiment <- function(train, test, m) {
    rf <- randomForest(REFUND ~ .-ID, data=train, ntree=50)
    rf_pred <- predict(rf, test, type="response")
    m$acc = c(m$acc, confusionMatrix(rf_pred, test$REFUND)$overall[1])
    rf_pred_prob <- predict(rf, test, type="prob")
    rf_pred <- prediction(rf_pred_prob[,2], cb.test$REFUND)
    m$auc = c(m$auc, performance(rf_pred, "auc")@y.values[[1]])
    return(m)
}
In [14]:
measure = list()
for(i in 1:5){
    inTest <- flds[[i]]
    cb.test <- cb[inTest,]
    cb.train <- cb[-inTest,]

    measure = experiment(cb.train,cb.test,measure)
}
In [16]:
measure
$acc
Accuracy
0.888888888888889
Accuracy
0.871287128712871
Accuracy
0.92
Accuracy
0.91
Accuracy
0.91
$auc
  1. 0.967030360531309
  2. 0.957653985507246
  3. 0.960729312762973
  4. 0.958859280037401
  5. 0.98153342683497
In [17]:
mean(measure$acc); sd(measure$acc)
0.900035203520352
0.0196715503407551
In [18]:
mean(measure$auc); sd(measure$auc)
0.96516127313478
0.00983943166216546