Transformation of Data¶

library(tidyverse)
library(nycflights13)
library(gridExtra)

5. Data Transformation¶

5.1.3 dplyr basics¶

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:¶

Pick observations by their values (filter()).
Reorder the rows (arrange()).
Pick variables by their names (select()).
Create new variables with functions of existing variables (mutate()).
Collapse many values down to a single summary (summarise()).

head(flights) # 2013년 nyc에 이착률 비행기.

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
3 2013     1   1      542            540         2      923            850
4 2013     1   1      544            545        -1     1004           1022
5 2013     1   1      554            600        -6      812            837
6 2013     1   1      554            558        -4      740            728
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00

5.2 Filter rows with filter()¶

data.table을 데이터 규모가 큰 경우 dplyr보단 더 많이 사용하지만 불편하다.

filter(flights, month==1, day==1) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

flights %>% filter(month==1,day==1) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Logical Operations¶

| vs || (or)

c(1,2,3) > 2

c(1,2,3) > 2  | c(4,5,6) <= 4  # 짝을 지어서 한다. 순서대로 1>2,4<=4 이런식으로 짝을 지어서

c(1,2,3) > 2  || c(4,5,6) <= 4 # 맨앞에꺼 하나만 가지고 한다.

filter(flights, month==11 | month==12) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013    11   1        5           2359         6      352            345
2 2013    11   1       35           2250       105      123           2356
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1         7      B6    745  N568JB    JFK  PSE      205     1617   23     59
2        87      B6   1816  N353JB    JFK  SYR       36      209   22     50
            time_hour
1 2013-11-01 23:00:00
2 2013-11-01 22:00:00

%in% 연산자 : 포함하는 것

filter(flights, month %in% c(11,12)) %>% head(2) # 중의 연산자 : month가 11월 12월 중.

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013    11   1        5           2359         6      352            345
2 2013    11   1       35           2250       105      123           2356
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1         7      B6    745  N568JB    JFK  PSE      205     1617   23     59
2        87      B6   1816  N353JB    JFK  SYR       36      209   22     50
            time_hour
1 2013-11-01 23:00:00
2 2013-11-01 22:00:00

5.2.3 Na Missing Value¶

is.na(NA)

!is.na("123")

NA ^ 0 # -> 1 모든 수의 0 승은 1 
NA | TRUE
FALSE & NA

But...

## but 
NA * 0 #NA....?

[1] NA

5.2.4 Exercises¶

1. Find all flights that
  1.1. Had an arrival delay of two or more hours

filter(flights, arr_delay>=120) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      811            630       101     1047            830
2 2013     1   1      848           1835       853     1001           1950
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       137      MQ   4576  N531MQ    LGA  CLT      118      544    6     30
2       851      MQ   3944  N942MQ    JFK  BWI       41      184   18     35
            time_hour
1 2013-01-01 06:00:00
2 2013-01-01 18:00:00

1.2. Flew to Houston (IAH or HOU)

filter(flights, dest %in% c('IAH','HOU')) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

1.3. Were operated by United, American, or Delta

unique(flights$carrier)

filter(flights, carrier %in% c('UA','AA','DL')) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

1.4. Departed in summer (July, August, and September)

filter(flights, month %in% 7:9) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     7   1        1           2029       212      236           2359
2 2013     7   1        2           2359         3      344            344
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       157      B6    915  N653JB    JFK  SFO      315     2586   20     29
2         0      B6   1503  N805JB    JFK  SJU      200     1598   23     59
            time_hour
1 2013-07-01 20:00:00
2 2013-07-01 23:00:00

filter(flights, between(month,7,9)) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     7   1        1           2029       212      236           2359
2 2013     7   1        2           2359         3      344            344
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       157      B6    915  N653JB    JFK  SFO      315     2586   20     29
2         0      B6   1503  N805JB    JFK  SJU      200     1598   23     59
            time_hour
1 2013-07-01 20:00:00
2 2013-07-01 23:00:00

1.5. Arrived more than two hours late, but didn’t leave late

filter(flights, dep_delay<=0, arr_delay > 120) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1  27     1419           1420        -1     1754           1550
2 2013    10   7     1350           1350         0     1736           1526
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       124      MQ   3728  N1EAMQ    EWR  ORD      135      719   14     20
2       130      EV   5181  N611QX    LGA  MSN      117      812   13     50
            time_hour
1 2013-01-27 14:00:00
2 2013-10-07 13:00:00

1.6. Were delayed by at least an hour, but made up over 30 minutes in flight

filter(flights, dep_delay >= 60, arr_delay < dep_delay - 30) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1     2205           1720       285       46           2040
2 2013     1   1     2326           2130       116      131             18
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       246      AA   1999  N5DNAA    EWR  MIA      146     1085   17     20
2        73      B6    199  N594JB    JFK  LAS      290     2248   21     30
            time_hour
1 2013-01-01 17:00:00
2 2013-01-01 21:00:00

1.7. Departed between midnight and 6am (inclusive)

filter(flights, dep_time <= 600) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?

filter(flights,is.na(dep_time)) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1       NA           1630        NA       NA           1815
2 2013     1   1       NA           1935        NA       NA           2240
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        NA      EV   4308  N18120    EWR  RDU       NA      416   16     30
2        NA      AA    791  N3EHAA    LGA  DFW       NA     1389   19     35
            time_hour
1 2013-01-01 16:00:00
2 2013-01-01 19:00:00

5.3 Arrange¶

arrange(flights, arr_delay) %>% head(2)# 오름 차순

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     5   7     1715           1729       -14     1944           2110
2 2013     5  20      719            735       -16      951           1110
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       -86      VX    193  N843VA    EWR  SFO      315     2565   17     29
2       -79      VX     11  N840VA    JFK  SFO      316     2586    7     35
            time_hour
1 2013-05-07 17:00:00
2 2013-05-20 07:00:00

arrange(flights, desc(arr_delay)) %>% head(2) #내림 차순

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   9      641            900      1301     1242           1530
2 2013     6  15     1432           1935      1137     1607           2120
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1      1272      HA     51  N384HA    JFK  HNL      640     4983    9      0
2      1127      MQ   3535  N504MQ    JFK  CMH       74      483   19     35
            time_hour
1 2013-01-09 09:00:00
2 2013-06-15 19:00:00

5.4 Select¶

select(flights, year, month,day) %>% head(2)

select(flights, year:day) %>% head(2)

select(flights, -(year:day)) %>% head(2)

  dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
1      517            515         2      830            819        11      UA
2      533            529         4      850            830        20      UA
  flight tailnum origin dest air_time distance hour minute           time_hour
1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01 05:00:00
2   1714  N24211    LGA  IAH      227     1416    5     29 2013-01-01 05:00:00

starts_with("abc"): matches names that begin with “abc”.¶

select(flights, starts_with('dep')) %>% head(2)

ends_with("xyz"): matches names that end with “xyz”.¶

select(flights, ends_with('delay')) %>% head(2)

Starts with & End with¶

select(flights, ends_with('delay'),starts_with('dep')) %>% head(2)

contains("ijk"): matches names that contain “ijk”.¶

select(flights, contains('d')) %>% head(3)

matches("(.)\1")¶

selects variables that match a regular expression.
This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
Regular Expression.

select(flights, matches('d[aeiou]')) %>% head(2) # d뒤에 aeiou가 붙는 경우

num_range("x", 1:3) matches x1, x2 and x3. 동일한 변수 뒤에 숫자가 붙은 경우¶

Rename : Change Colnum's Name¶

rename(flights, tail_num = tailnum) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tail_num origin dest air_time distance hour minute
1        11      UA   1545   N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714   N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Everything : 모든 Column's Names¶

Column의 순서를 변경할 때 사용.

select(flights, time_hour, air_time, everything()) %>% head(2)

            time_hour air_time year month day dep_time sched_dep_time dep_delay
1 2013-01-01 05:00:00      227 2013     1   1      517            515         2
2 2013-01-01 05:00:00      227 2013     1   1      533            529         4
  arr_time sched_arr_time arr_delay carrier flight tailnum origin dest distance
1      830            819        11      UA   1545  N14228    EWR  IAH     1400
2      850            830        20      UA   1714  N24211    LGA  IAH     1416
  hour minute
1    5     15
2    5     29

5.5 Add New variables with mutate()¶

flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
)

mutate(flights_sml, gain=arr_delay - dep_delay, speed = distance / air_time * 60) %>% head(2)

pipe¶

arrange(filter(flights,month==1),day) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

위 식을 보기 편하게 변경.

flights %>% filter(month==1) %>% arrange(day) %>% head(2)

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Summarize¶

summarise(flights, delay=mean(dep_delay,na.rm=T))

by_day <- flights %>% group_by(year,month,day) %>% summarise(delay=mean(dep_delay,na.rm=T))
head(by_day)

flights %>% group_by(year,month,day) %>% summarise(delay=mean(dep_delay,na.rm=T), cnt=n()) %>% head(2)

5. 탐색적 모형 분석¶

모형이 있는 것이 아니라, 그냥 데이터를 뒤져보는 것을 말한다.
환인적 탐색 분석보다 탐색적 모형분석이 오래 걸린다.
그래프를 그려본다. (특징을 슥 찾아본다.)

options(repr.plot.height=3)
ggplot(data=diamonds) + geom_histogram(aes(x=carat))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

특이한 패턴이 나온다. 특정한 무게에서 짤라서 쓰지 않을까?

ggplot(data=diamonds) + geom_histogram(aes(x=carat), binwidth = 0.01)

기술적으로 대단한 것은 아니지만, 발상이 대단한 부분이다.¶

양봉 그래프의 경우 boxplot을 그리게 되면 중간의 데이터가 있던 없던 동일한 모양이 나오는데
Histogram을 세로로 세운다면 정보의 손실이 없다.

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()

ggplot(data = mpg) + 
 geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) # reorder median을 기준으로

diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n))

7.5.3 Two continuous variables¶

ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.

library(hexbin)

Warning message:
: package 'hexbin' was built under R version 3.3.2

gg1 <- ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
gg2 <- ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price))

grid.arrange(gg1,gg2,ncol=2)

Another option is to bin one continuous variable so it acts like a categorical variable.

ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Boxplot Outlier Text Label¶

options(repr.plot.height=4)
ggplot(mpg,aes(class,hwy)) + 
  geom_boxplot(outlier.alpha = 0) + # Ouliter 가 보기 힘들다(점과 겹쳐서) outlier.alpha = 0 하면 그림을 없애준다.
  geom_text(aes(label=rownames(mpg)))

Only Text on Outliers¶

ggplot(mpg,aes(class,hwy)) + geom_boxplot() # <- 활용

q1 = quantile(mpg$hwy,.25)
q3 = quantile(mpg$hwy,.75)
iqr = q3 - q1
upper = q3 + 1.5 * iqr
lower = q1 - 1.5 * iqr

filter(mpg, hwy>upper | hwy < lower)

각각의 Class에 적용하기 귀찮다. 함수로 만든다.

filter.Outlier <- function(df){
  q1 = quantile(df$hwy,.25)
  q3 = quantile(df$hwy,.75)
  iqr = q3 - q1
  upper = q3 + 1.5 * iqr
  lower = q1 - 1.5 * iqr
  df %>% filter(hwy < lower | hwy > upper)
  # R에서는 Return을 안하면 마지막행이 Return 된다.
}

mpg <- mpg %>% mutate(name=row_number())

mpg %>% filter.Outlier()

dplyr 의 do function¶

do function : group 별로 적용되는 함수 또는 행위 적용
원리 : group_by 로 짤랐던 df를 각각에 함수를 적용하고 그결과를 rbind 한다고 생각하면 된다.
input data 에는 Dataframe이다.
dataframe이 들어갈 자리에 . 을 사용.

mpg %>% group_by(class) %>% do(head(.,1)) # 그룹바이한 class에 대해 각 df가 만들어지고 그게 .자리에 들어간다.

mpg %>% group_by(class) %>%
  do(filter.Outlier((.))) 
# do -> 그룹별로 함수 적용 
# (짤라냈던 각각의 df를 . 자리가 각각 넣어서 한다. 그리고 결과를 합친다.)

name자리에 새로운 Column 명을 한다면 해당 부분이 Label이 된다.

x = mpg %>% group_by(class) %>% do(filter.Outlier(.))

ggplot(mpg, aes(class,hwy)) + geom_boxplot(outlier.alpha = 0) +
  geom_text(data=x,aes(label=name))

2. 연비 top3¶

filter.hwy <- function(df){
  df %>% arrange(desc(hwy)) %>% head(3)
}
mpg %>% group_by(model)  %>% do(filter.hwy(.))

mpg %>% group_by(model) %>% do((.) %>% arrange(desc(hwy)) %>% head(3)) # do((.))  do안에서 사용하려면 (.) 로 묶어줘야된다.

Tips¶

rbind, cbind 보다 빠른 함수 : bind_rows,bind_cols 사용법은 동일하다.

filtering 후 차이를 알고 싶을때¶

setdiff() 함수 활용

flights <- flights %>% mutate(name= row_number())
aa <- filter(flights, carrier == 'AA') 
july <- filter(flights, month == 7)
i = setdiff(aa$name,july$name)

head(flights[i,])

  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      542            540         2      923            850
2 2013     1   1      558            600        -2      753            745
3 2013     1   1      559            600        -1      941            910
4 2013     1   1      606            610        -4      858            910
5 2013     1   1      623            610        13      920            915
6 2013     1   1      628            630        -2     1137           1140
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
2         8      AA    301  N3ALAA    LGA  ORD      138      733    6      0
3        31      AA    707  N3DUAA    LGA  DFW      257     1389    6      0
4       -12      AA   1895  N633AA    EWR  MIA      152     1085    6     10
5         5      AA   1837  N3EMAA    LGA  MIA      153     1096    6     10
6        -3      AA    413  N3BAAA    JFK  SJU      192     1598    6     30
            time_hour name
1 2013-01-01 05:00:00    3
2 2013-01-01 06:00:00   10
3 2013-01-01 06:00:00   15
4 2013-01-01 06:00:00   23
5 2013-01-01 06:00:00   32
6 2013-01-01 06:00:00   37

	year	month	day	dep_delay	arr_delay	distance	air_time	gain	speed
1	2013.0000	1.0000	1.0000	2.0000	11.0000	1400.0000	227.0000	9.0000	370.0441
2	2013.0000	1.0000	1.0000	4.0000	20.0000	1416.0000	227.0000	16.0000	374.2731

	year	month	day	delay
1	2013.00000	1.00000	1.00000	11.54893
2	2013.00000	1.00000	2.00000	13.85882
3	2013.00000	1.00000	3.00000	10.98783
4	2013.000000	1.000000	4.000000	8.951595
5	2013.000000	1.000000	5.000000	5.732218
6	2013.000000	1.000000	6.000000	7.148014

	year	month	day	delay	cnt
1	2013.00000	1.00000	1.00000	11.54893	842.00000
2	2013.00000	1.00000	2.00000	13.85882	943.00000

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class
1	volkswagen	jetta	1.9	1999	4	manual(m5)	f	33	44	d	compact
2	volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact
3	volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	name
1	volkswagen	jetta	1.9	1999	4	manual(m5)	f	33	44	d	compact	213
2	volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact	222
3	volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact	223

	day	dep_time	sched_dep_time	dep_delay	sched_arr_time	arr_delay	dest	distance
1	1	517	515	2	819	11	IAH	1400
2	1	533	529	4	830	20	IAH	1416
3	1	542	540	2	850	33	MIA	1089

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	name
1	chevrolet	corvette	5.7	1999	8	manual(m6)	r	16	26	p	2seater	24
2	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	1
3	audi	a6 quattro	2.8	1999	6	auto(l5)	4	15	24	p	midsize	16
4	dodge	caravan 2wd	2.4	1999	4	auto(l3)	f	18	24	r	minivan	38
5	dodge	dakota pickup 4wd	3.7	2008	6	manual(m6)	4	15	19	r	pickup	49
6	ford	mustang	3.8	1999	6	manual(m5)	r	18	26	r	subcompact	91
7	chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	14	20	r	suv	19

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	name
1	toyota	corolla	1.8	1999	4	manual(m5)	f	26	35	r	compact	196
2	toyota	corolla	1.8	2008	4	manual(m5)	f	28	37	r	compact	197
3	toyota	corolla	1.8	2008	4	auto(l4)	f	26	35	r	compact	198
4	volkswagen	jetta	1.9	1999	4	manual(m5)	f	33	44	d	compact	213
5	dodge	caravan 2wd	3.3	2008	6	auto(l4)	f	11	17	e	minivan	44
6	dodge	dakota pickup 4wd	4.7	2008	8	auto(l5)	4	9	12	e	pickup	55
7	dodge	ram 1500 pickup 4wd	4.7	2008	8	auto(l5)	4	9	12	e	pickup	66
8	dodge	ram 1500 pickup 4wd	4.7	2008	8	manual(m6)	4	9	12	e	pickup	70
9	toyota	toyota tacoma 4wd	2.7	2008	4	manual(m5)	4	17	22	r	pickup	203
10	volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact	222
11	volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact	223
12	dodge	durango 4wd	4.7	2008	8	auto(l5)	4	9	12	e	suv	60
13	jeep	grand cherokee 4wd	4.7	2008	8	auto(l5)	4	9	12	e	suv	127
14	subaru	forester awd	2.5	1999	4	manual(m5)	4	18	25	r	suv	160
15	subaru	forester awd	2.5	1999	4	auto(l4)	4	18	24	r	suv	161
16	subaru	forester awd	2.5	2008	4	manual(m5)	4	20	27	r	suv	162
17	subaru	forester awd	2.5	2008	4	manual(m5)	4	19	25	p	suv	163
18	subaru	forester awd	2.5	2008	4	auto(l4)	4	20	26	r	suv	164
19	subaru	forester awd	2.5	2008	4	auto(l4)	4	18	23	p	suv	165

	manufacturer	model	displ	year	cyl	trans	drv	cty	hwy	fl	class	name
1	toyota	4runner 4wd	2.7	1999	4	manual(m5)	4	15	20	r	suv	174
2	toyota	4runner 4wd	2.7	1999	4	auto(l4)	4	16	20	r	suv	175
3	toyota	4runner 4wd	4	2008	6	auto(l5)	4	16	20	r	suv	178
4	audi	a4	2	2008	4	manual(m6)	f	20	31	p	compact	3
5	audi	a4	2	2008	4	auto(av)	f	21	30	p	compact	4
6	audi	a4	1.8	1999	4	auto(l5)	f	18	29	p	compact	1
7	audi	a4 quattro	2	2008	4	manual(m6)	4	20	28	p	compact	10
8	audi	a4 quattro	2	2008	4	auto(s6)	4	19	27	p	compact	11
9	audi	a4 quattro	1.8	1999	4	manual(m5)	4	18	26	p	compact	8
10	audi	a6 quattro	3.1	2008	6	auto(s6)	4	17	25	p	midsize	17
11	audi	a6 quattro	2.8	1999	6	auto(l5)	4	15	24	p	midsize	16
12	audi	a6 quattro	4.2	2008	8	auto(s6)	4	16	23	p	midsize	18
13	nissan	altima	2.5	2008	4	manual(m6)	f	23	32	r	midsize	145
14	nissan	altima	2.5	2008	4	auto(av)	f	23	31	r	midsize	144
15	nissan	altima	2.4	1999	4	manual(m5)	f	21	29	r	compact	142
16	chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	14	20	r	suv	19
17	chevrolet	c1500 suburban 2wd	5.3	2008	8	auto(l4)	r	14	20	r	suv	21
18	chevrolet	c1500 suburban 2wd	5.7	1999	8	auto(l4)	r	13	17	r	suv	22
19	toyota	camry	2.4	2008	4	manual(m5)	f	21	31	r	midsize	182
20	toyota	camry	2.4	2008	4	auto(l5)	f	21	31	r	midsize	183
21	toyota	camry	2.2	1999	4	manual(m5)	f	21	29	r	midsize	180
22	toyota	camry solara	2.4	2008	4	manual(m5)	f	21	31	r	compact	189
23	toyota	camry solara	2.4	2008	4	auto(s5)	f	22	31	r	compact	190
24	toyota	camry solara	2.2	1999	4	manual(m5)	f	21	29	r	compact	188
25	dodge	caravan 2wd	2.4	1999	4	auto(l3)	f	18	24	r	minivan	38
26	dodge	caravan 2wd	3	1999	6	auto(l4)	f	17	24	r	minivan	39
27	dodge	caravan 2wd	3.3	2008	6	auto(l4)	f	17	24	r	minivan	42
28	honda	civic	1.8	2008	4	auto(l5)	f	25	36	r	subcompact	106
29	honda	civic	1.8	2008	4	auto(l5)	f	24	36	c	subcompact	107
30	honda	civic	1.8	2008	4	manual(m5)	f	26	34	r	subcompact	105
...	...	...	...	...	...	...	...	...	...	...	...	...
84	ford	mustang	3.8	1999	6	manual(m5)	r	18	26	r	subcompact	91
85	ford	mustang	4	2008	6	manual(m5)	r	17	26	r	subcompact	93
86	ford	mustang	3.8	1999	6	auto(l4)	r	18	25	r	subcompact	92
87	lincoln	navigator 2wd	5.4	2008	8	auto(l6)	r	12	18	r	suv	137
88	lincoln	navigator 2wd	5.4	1999	8	auto(l4)	r	11	17	r	suv	135
89	lincoln	navigator 2wd	5.4	1999	8	auto(l4)	r	11	16	p	suv	136
90	volkswagen	new beetle	1.9	1999	4	manual(m5)	f	35	44	d	subcompact	222
91	volkswagen	new beetle	1.9	1999	4	auto(l4)	f	29	41	d	subcompact	223
92	volkswagen	new beetle	2	1999	4	manual(m5)	f	21	29	r	subcompact	224
93	volkswagen	passat	1.8	1999	4	manual(m5)	f	21	29	p	midsize	228
94	volkswagen	passat	1.8	1999	4	auto(l5)	f	18	29	p	midsize	229
95	volkswagen	passat	2	2008	4	manual(m6)	f	21	29	p	midsize	231
96	nissan	pathfinder 4wd	4	2008	6	auto(l5)	4	14	20	p	suv	153
97	nissan	pathfinder 4wd	5.6	2008	8	auto(s5)	4	12	18	p	suv	154
98	nissan	pathfinder 4wd	3.3	1999	6	auto(l4)	4	14	17	r	suv	151
99	dodge	ram 1500 pickup 4wd	4.7	2008	8	auto(l5)	4	13	17	r	pickup	67
100	dodge	ram 1500 pickup 4wd	4.7	2008	8	auto(l5)	4	13	17	r	pickup	68
101	dodge	ram 1500 pickup 4wd	5.7	2008	8	auto(l5)	4	13	17	r	pickup	73
102	land rover	range rover	4.2	2008	8	auto(s6)	4	12	18	r	suv	132
103	land rover	range rover	4.4	2008	8	auto(s6)	4	12	18	r	suv	133
104	land rover	range rover	4	1999	8	auto(l4)	4	11	15	p	suv	131
105	hyundai	sonata	2.4	2008	4	manual(m5)	f	21	31	r	midsize	112
106	hyundai	sonata	2.4	2008	4	auto(l4)	f	21	30	r	midsize	111
107	hyundai	sonata	3.3	2008	6	auto(l5)	f	19	28	r	midsize	115
108	hyundai	tiburon	2	1999	4	manual(m5)	f	19	29	r	subcompact	117
109	hyundai	tiburon	2	2008	4	manual(m5)	f	20	28	r	subcompact	118
110	hyundai	tiburon	2	2008	4	auto(l4)	f	20	27	r	subcompact	119
111	toyota	toyota tacoma 4wd	2.7	2008	4	manual(m5)	4	17	22	r	pickup	203
112	toyota	toyota tacoma 4wd	2.7	1999	4	manual(m5)	4	15	20	r	pickup	201
113	toyota	toyota tacoma 4wd	2.7	1999	4	auto(l4)	4	16	20	r	pickup	202

Transformation of Data¶

5. Data Transformation¶

5.1.3 dplyr basics¶

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:¶

5.2 Filter rows with filter()¶

Logical Operations¶

5.2.3 Na Missing Value¶

5.2.4 Exercises¶

5.3 Arrange¶

5.4 Select¶

starts_with("abc"): matches names that begin with “abc”.¶

ends_with("xyz"): matches names that end with “xyz”.¶

Starts with & End with¶

contains("ijk"): matches names that contain “ijk”.¶

matches("(.)\1")¶

num_range("x", 1:3) matches x1, x2 and x3. 동일한 변수 뒤에 숫자가 붙은 경우¶

Rename : Change Colnum's Name¶

Everything : 모든 Column's Names¶

5.5 Add New variables with mutate()¶

pipe¶

Summarize¶

5. 탐색적 모형 분석¶

기술적으로 대단한 것은 아니지만, 발상이 대단한 부분이다.¶

7.5.3 Two continuous variables¶

Boxplot Outlier Text Label¶

Only Text on Outliers¶

dplyr 의 do function¶

2. 연비 top3¶

Tips¶

filtering 후 차이를 알고 싶을때¶

교집합을 알고 싶다면 intersect()를 사용.¶