Transformation of Data

In [2]:
library(tidyverse)
library(nycflights13)
library(gridExtra)

5. Data Transformation

5.1.3 dplyr basics

In this chapter you are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

  • Pick observations by their values (filter()).
  • Reorder the rows (arrange()).
  • Pick variables by their names (select()).
  • Create new variables with functions of existing variables (mutate()).
  • Collapse many values down to a single summary (summarise()).
In [3]:
head(flights) # 2013년 nyc에 이착률 비행기.
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
3 2013     1   1      542            540         2      923            850
4 2013     1   1      544            545        -1     1004           1022
5 2013     1   1      554            600        -6      812            837
6 2013     1   1      554            558        -4      740            728
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
3        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
4       -18      B6    725  N804JB    JFK  BQN      183     1576    5     45
5       -25      DL    461  N668DN    LGA  ATL      116      762    6      0
6        12      UA   1696  N39463    EWR  ORD      150      719    5     58
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3 2013-01-01 05:00:00
4 2013-01-01 05:00:00
5 2013-01-01 06:00:00
6 2013-01-01 05:00:00

5.2 Filter rows with filter()

  • data.table을 데이터 규모가 큰 경우 dplyr보단 더 많이 사용하지만 불편하다.
In [4]:
filter(flights, month==1, day==1) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
In [5]:
flights %>% filter(month==1,day==1) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Logical Operations

  • | vs || (or)
In [6]:
c(1,2,3) > 2
  1. FALSE
  2. FALSE
  3. TRUE
In [7]:
c(1,2,3) > 2  | c(4,5,6) <= 4  # 짝을 지어서 한다. 순서대로 1>2,4<=4 이런식으로 짝을 지어서
  1. TRUE
  2. FALSE
  3. TRUE
In [8]:
c(1,2,3) > 2  || c(4,5,6) <= 4 # 맨앞에꺼 하나만 가지고 한다.
TRUE
In [9]:
filter(flights, month==11 | month==12) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013    11   1        5           2359         6      352            345
2 2013    11   1       35           2250       105      123           2356
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1         7      B6    745  N568JB    JFK  PSE      205     1617   23     59
2        87      B6   1816  N353JB    JFK  SYR       36      209   22     50
            time_hour
1 2013-11-01 23:00:00
2 2013-11-01 22:00:00
  • %in% 연산자 : 포함하는 것
In [10]:
filter(flights, month %in% c(11,12)) %>% head(2) # 중의 연산자 : month가 11월 12월 중.
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013    11   1        5           2359         6      352            345
2 2013    11   1       35           2250       105      123           2356
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1         7      B6    745  N568JB    JFK  PSE      205     1617   23     59
2        87      B6   1816  N353JB    JFK  SYR       36      209   22     50
            time_hour
1 2013-11-01 23:00:00
2 2013-11-01 22:00:00

5.2.3 Na Missing Value

In [11]:
is.na(NA)
TRUE
In [12]:
!is.na("123")
TRUE
In [13]:
NA ^ 0 # -> 1 모든 수의 0 승은 1 
NA | TRUE
FALSE & NA
1
TRUE
FALSE
  • But...
In [14]:
## but 
NA * 0 #NA....?
[1] NA

5.2.4 Exercises

1. Find all flights that
  1.1. Had an arrival delay of two or more hours
In [15]:
filter(flights, arr_delay>=120) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      811            630       101     1047            830
2 2013     1   1      848           1835       853     1001           1950
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       137      MQ   4576  N531MQ    LGA  CLT      118      544    6     30
2       851      MQ   3944  N942MQ    JFK  BWI       41      184   18     35
            time_hour
1 2013-01-01 06:00:00
2 2013-01-01 18:00:00
1.2. Flew to Houston (IAH or HOU)
In [16]:
filter(flights, dest %in% c('IAH','HOU')) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
1.3. Were operated by United, American, or Delta
In [17]:
unique(flights$carrier)
  1. "UA"
  2. "AA"
  3. "B6"
  4. "DL"
  5. "EV"
  6. "MQ"
  7. "US"
  8. "WN"
  9. "VX"
  10. "FL"
  11. "AS"
  12. "9E"
  13. "F9"
  14. "HA"
  15. "YV"
  16. "OO"
In [18]:
filter(flights, carrier %in% c('UA','AA','DL')) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
1.4. Departed in summer (July, August, and September)
In [19]:
filter(flights, month %in% 7:9) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     7   1        1           2029       212      236           2359
2 2013     7   1        2           2359         3      344            344
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       157      B6    915  N653JB    JFK  SFO      315     2586   20     29
2         0      B6   1503  N805JB    JFK  SJU      200     1598   23     59
            time_hour
1 2013-07-01 20:00:00
2 2013-07-01 23:00:00
In [20]:
filter(flights, between(month,7,9)) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     7   1        1           2029       212      236           2359
2 2013     7   1        2           2359         3      344            344
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       157      B6    915  N653JB    JFK  SFO      315     2586   20     29
2         0      B6   1503  N805JB    JFK  SJU      200     1598   23     59
            time_hour
1 2013-07-01 20:00:00
2 2013-07-01 23:00:00
1.5. Arrived more than two hours late, but didn’t leave late
In [21]:
filter(flights, dep_delay<=0, arr_delay > 120) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1  27     1419           1420        -1     1754           1550
2 2013    10   7     1350           1350         0     1736           1526
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       124      MQ   3728  N1EAMQ    EWR  ORD      135      719   14     20
2       130      EV   5181  N611QX    LGA  MSN      117      812   13     50
            time_hour
1 2013-01-27 14:00:00
2 2013-10-07 13:00:00
1.6. Were delayed by at least an hour, but made up over 30 minutes in flight
In [22]:
filter(flights, dep_delay >= 60, arr_delay < dep_delay - 30) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1     2205           1720       285       46           2040
2 2013     1   1     2326           2130       116      131             18
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       246      AA   1999  N5DNAA    EWR  MIA      146     1085   17     20
2        73      B6    199  N594JB    JFK  LAS      290     2248   21     30
            time_hour
1 2013-01-01 17:00:00
2 2013-01-01 21:00:00
1.7. Departed between midnight and 6am (inclusive)
In [23]:
filter(flights, dep_time <= 600) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
3. How many flights have a missing dep_time? What other variables are missing? What might these rows represent?
In [24]:
filter(flights,is.na(dep_time)) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1       NA           1630        NA       NA           1815
2 2013     1   1       NA           1935        NA       NA           2240
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        NA      EV   4308  N18120    EWR  RDU       NA      416   16     30
2        NA      AA    791  N3EHAA    LGA  DFW       NA     1389   19     35
            time_hour
1 2013-01-01 16:00:00
2 2013-01-01 19:00:00

5.3 Arrange

In [25]:
arrange(flights, arr_delay) %>% head(2)# 오름 차순
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     5   7     1715           1729       -14     1944           2110
2 2013     5  20      719            735       -16      951           1110
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1       -86      VX    193  N843VA    EWR  SFO      315     2565   17     29
2       -79      VX     11  N840VA    JFK  SFO      316     2586    7     35
            time_hour
1 2013-05-07 17:00:00
2 2013-05-20 07:00:00
In [26]:
arrange(flights, desc(arr_delay)) %>% head(2) #내림 차순
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   9      641            900      1301     1242           1530
2 2013     6  15     1432           1935      1137     1607           2120
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1      1272      HA     51  N384HA    JFK  HNL      640     4983    9      0
2      1127      MQ   3535  N504MQ    JFK  CMH       74      483   19     35
            time_hour
1 2013-01-09 09:00:00
2 2013-06-15 19:00:00

5.4 Select

In [27]:
select(flights, year, month,day) %>% head(2)
yearmonthday
12013 1 1
22013 1 1
In [28]:
select(flights, year:day) %>% head(2)
yearmonthday
12013 1 1
22013 1 1
In [29]:
select(flights, -(year:day)) %>% head(2)
  dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
1      517            515         2      830            819        11      UA
2      533            529         4      850            830        20      UA
  flight tailnum origin dest air_time distance hour minute           time_hour
1   1545  N14228    EWR  IAH      227     1400    5     15 2013-01-01 05:00:00
2   1714  N24211    LGA  IAH      227     1416    5     29 2013-01-01 05:00:00

starts_with("abc"): matches names that begin with “abc”.

In [30]:
select(flights, starts_with('dep')) %>% head(2)
dep_timedep_delay
1517 2
2533 4

ends_with("xyz"): matches names that end with “xyz”.

In [31]:
select(flights, ends_with('delay')) %>% head(2)
dep_delayarr_delay
1 211
2 420

Starts with & End with

In [32]:
select(flights, ends_with('delay'),starts_with('dep')) %>% head(2)
dep_delayarr_delaydep_time
1 2 11517
2 4 20533

contains("ijk"): matches names that contain “ijk”.

In [33]:
select(flights, contains('d')) %>% head(3)
daydep_timesched_dep_timedep_delaysched_arr_timearr_delaydestdistance
11 517 515 2 819 11 IAH 1400
21 533 529 4 830 20 IAH 1416
31 542 540 2 850 33 MIA 1089

matches("(.)\1")

  • selects variables that match a regular expression.
    This one matches any variables that contain repeated characters. You’ll learn more about regular expressions in strings.
  • Regular Expression.
In [34]:
select(flights, matches('d[aeiou]')) %>% head(2) # d뒤에 aeiou가 붙는 경우
daydep_timesched_dep_timedep_delayarr_delaydestdistance
11 517 515 2 11 IAH 1400
21 533 529 4 20 IAH 1416

num_range("x", 1:3) matches x1, x2 and x3. 동일한 변수 뒤에 숫자가 붙은 경우

Rename : Change Colnum's Name

In [35]:
rename(flights, tail_num = tailnum) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tail_num origin dest air_time distance hour minute
1        11      UA   1545   N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714   N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Everything : 모든 Column's Names

  • Column의 순서를 변경할 때 사용.
In [36]:
select(flights, time_hour, air_time, everything()) %>% head(2)
            time_hour air_time year month day dep_time sched_dep_time dep_delay
1 2013-01-01 05:00:00      227 2013     1   1      517            515         2
2 2013-01-01 05:00:00      227 2013     1   1      533            529         4
  arr_time sched_arr_time arr_delay carrier flight tailnum origin dest distance
1      830            819        11      UA   1545  N14228    EWR  IAH     1400
2      850            830        20      UA   1714  N24211    LGA  IAH     1416
  hour minute
1    5     15
2    5     29

5.5 Add New variables with mutate()

In [37]:
flights_sml <- select(flights, 
                      year:day, 
                      ends_with("delay"), 
                      distance, 
                      air_time
)
In [38]:
mutate(flights_sml, gain=arr_delay - dep_delay, speed = distance / air_time * 60) %>% head(2)
yearmonthdaydep_delayarr_delaydistanceair_timegainspeed
12013.0000 1.0000 1.0000 2.0000 11.00001400.0000 227.0000 9.0000 370.0441
22013.0000 1.0000 1.0000 4.0000 20.00001416.0000 227.0000 16.0000 374.2731

pipe

In [39]:
arrange(filter(flights,month==1),day) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00
  • 위 식을 보기 편하게 변경.
In [40]:
flights %>% filter(month==1) %>% arrange(day) %>% head(2)
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      517            515         2      830            819
2 2013     1   1      533            529         4      850            830
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        11      UA   1545  N14228    EWR  IAH      227     1400    5     15
2        20      UA   1714  N24211    LGA  IAH      227     1416    5     29
            time_hour
1 2013-01-01 05:00:00
2 2013-01-01 05:00:00

Summarize

In [41]:
summarise(flights, delay=mean(dep_delay,na.rm=T))
delay
112.63907
In [42]:
by_day <- flights %>% group_by(year,month,day) %>% summarise(delay=mean(dep_delay,na.rm=T))
head(by_day)
yearmonthdaydelay
12013.00000 1.00000 1.00000 11.54893
22013.00000 1.00000 2.00000 13.85882
32013.00000 1.00000 3.00000 10.98783
42013.000000 1.000000 4.000000 8.951595
52013.000000 1.000000 5.000000 5.732218
62013.000000 1.000000 6.000000 7.148014
In [43]:
flights %>% group_by(year,month,day) %>% summarise(delay=mean(dep_delay,na.rm=T), cnt=n()) %>% head(2)
yearmonthdaydelaycnt
12013.00000 1.00000 1.00000 11.54893 842.00000
22013.00000 1.00000 2.00000 13.85882 943.00000

5. 탐색적 모형 분석

  • 모형이 있는 것이 아니라, 그냥 데이터를 뒤져보는 것을 말한다.
  • 환인적 탐색 분석보다 탐색적 모형분석이 오래 걸린다.
  • 그래프를 그려본다. (특징을 슥 찾아본다.)
In [44]:
options(repr.plot.height=3)
ggplot(data=diamonds) + geom_histogram(aes(x=carat))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
  • 특이한 패턴이 나온다. 특정한 무게에서 짤라서 쓰지 않을까?
In [45]:
ggplot(data=diamonds) + geom_histogram(aes(x=carat), binwidth = 0.01)

기술적으로 대단한 것은 아니지만, 발상이 대단한 부분이다.

  • 양봉 그래프의 경우 boxplot을 그리게 되면 중간의 데이터가 있던 없던 동일한 모양이 나오는데
  • Histogram을 세로로 세운다면 정보의 손실이 없다.
In [46]:
ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
  geom_boxplot()
In [69]:
ggplot(data = mpg) + 
 geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) # reorder median을 기준으로
In [47]:
diamonds %>% 
  count(color, cut) %>%  
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = n))

7.5.3 Two continuous variables

In [48]:
ggplot(data = diamonds) +
  geom_point(mapping = aes(x = carat, y = price))
In [49]:
ggplot(data = diamonds) + 
  geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
  • geom_bin2d() and geom_hex() divide the coordinate plane into 2d bins and then use a fill color to display how many points fall into each bin.
In [50]:
library(hexbin)
Warning message:
: package 'hexbin' was built under R version 3.3.2
In [51]:
gg1 <- ggplot(data = diamonds) +
  geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
gg2 <- ggplot(data = diamonds) +
  geom_hex(mapping = aes(x = carat, y = price))

grid.arrange(gg1,gg2,ncol=2)
  • Another option is to bin one continuous variable so it acts like a categorical variable.
In [52]:
ggplot(data = diamonds, mapping = aes(x = carat, y = price)) + 
  geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

Boxplot Outlier Text Label

In [53]:
options(repr.plot.height=4)
ggplot(mpg,aes(class,hwy)) + 
  geom_boxplot(outlier.alpha = 0) + # Ouliter 가 보기 힘들다(점과 겹쳐서) outlier.alpha = 0 하면 그림을 없애준다.
  geom_text(aes(label=rownames(mpg)))

Only Text on Outliers

In [54]:
ggplot(mpg,aes(class,hwy)) + geom_boxplot() # <- 활용
In [55]:
q1 = quantile(mpg$hwy,.25)
q3 = quantile(mpg$hwy,.75)
iqr = q3 - q1
upper = q3 + 1.5 * iqr
lower = q1 - 1.5 * iqr
In [56]:
filter(mpg, hwy>upper | hwy < lower)
manufacturermodeldisplyearcyltransdrvctyhwyflclass
1volkswagenjetta 1.9 1999 4 manual(m5)f 33 44 d compact
2volkswagennew beetle1.9 1999 4 manual(m5)f 35 44 d subcompact
3volkswagennew beetle1.9 1999 4 auto(l4) f 29 41 d subcompact
  • 각각의 Class에 적용하기 귀찮다. 함수로 만든다.
In [57]:
filter.Outlier <- function(df){
  q1 = quantile(df$hwy,.25)
  q3 = quantile(df$hwy,.75)
  iqr = q3 - q1
  upper = q3 + 1.5 * iqr
  lower = q1 - 1.5 * iqr
  df %>% filter(hwy < lower | hwy > upper)
  # R에서는 Return을 안하면 마지막행이 Return 된다.
}
In [58]:
mpg <- mpg %>% mutate(name=row_number())
In [59]:
mpg %>% filter.Outlier()
manufacturermodeldisplyearcyltransdrvctyhwyflclassname
1volkswagenjetta 1.9 1999 4 manual(m5)f 33 44 d compact 213
2volkswagennew beetle1.9 1999 4 manual(m5)f 35 44 d subcompact222
3volkswagennew beetle1.9 1999 4 auto(l4) f 29 41 d subcompact223

dplyr 의 do function

  • do function : group 별로 적용되는 함수 또는 행위 적용
  • 원리 : group_by 로 짤랐던 df를 각각에 함수를 적용하고 그결과를 rbind 한다고 생각하면 된다.
  • input data 에는 Dataframe이다.
  • dataframe이 들어갈 자리에 . 을 사용.
In [60]:
mpg %>% group_by(class) %>% do(head(.,1)) # 그룹바이한 class에 대해 각 df가 만들어지고 그게 .자리에 들어간다.
manufacturermodeldisplyearcyltransdrvctyhwyflclassname
1chevrolet corvette 5.7 1999 8 manual(m6)r 16 26 p 2seater 24
2audi a4 1.8 1999 4 auto(l5)f 18 29 p compact 1
3audi a6 quattro2.8 1999 6 auto(l5) 4 15 24 p midsize 16
4dodge caravan 2wd2.4 1999 4 auto(l3) f 18 24 r minivan 38
5dodge dakota pickup 4wd3.7 2008 6 manual(m6) 4 15 19 r pickup 49
6ford mustang 3.8 1999 6 manual(m5)r 18 26 r subcompact91
7chevrolet c1500 suburban 2wd5.3 2008 8 auto(l4) r 14 20 r suv 19
In [61]:
mpg %>% group_by(class) %>%
  do(filter.Outlier((.))) 
# do -> 그룹별로 함수 적용 
# (짤라냈던 각각의 df를 . 자리가 각각 넣어서 한다. 그리고 결과를 합친다.)
manufacturermodeldisplyearcyltransdrvctyhwyflclassname
1toyota corolla 1.8 1999 4 manual(m5)f 26 35 r compact 196
2toyota corolla 1.8 2008 4 manual(m5)f 28 37 r compact 197
3toyota corolla 1.8 2008 4 auto(l4)f 26 35 r compact 198
4volkswagenjetta 1.9 1999 4 manual(m5)f 33 44 d compact 213
5dodge caravan 2wd3.3 2008 6 auto(l4) f 11 17 e minivan 44
6dodge dakota pickup 4wd4.7 2008 8 auto(l5) 4 9 12 e pickup 55
7dodge ram 1500 pickup 4wd4.7 2008 8 auto(l5) 4 9 12 e pickup 66
8dodge ram 1500 pickup 4wd4.7 2008 8 manual(m6) 4 9 12 e pickup 70
9toyota toyota tacoma 4wd2.7 2008 4 manual(m5) 4 17 22 r pickup 203
10volkswagennew beetle1.9 1999 4 manual(m5)f 35 44 d subcompact222
11volkswagennew beetle1.9 1999 4 auto(l4) f 29 41 d subcompact223
12dodge durango 4wd4.7 2008 8 auto(l5) 4 9 12 e suv 60
13jeep grand cherokee 4wd4.7 2008 8 auto(l5) 4 9 12 e suv 127
14subaru forester awd2.5 1999 4 manual(m5) 4 18 25 r suv 160
15subaru forester awd2.5 1999 4 auto(l4) 4 18 24 r suv 161
16subaru forester awd2.5 2008 4 manual(m5) 4 20 27 r suv 162
17subaru forester awd2.5 2008 4 manual(m5) 4 19 25 p suv 163
18subaru forester awd2.5 2008 4 auto(l4) 4 20 26 r suv 164
19subaru forester awd2.5 2008 4 auto(l4) 4 18 23 p suv 165
  • name자리에 새로운 Column 명을 한다면 해당 부분이 Label이 된다.
In [62]:
x = mpg %>% group_by(class) %>% do(filter.Outlier(.))

ggplot(mpg, aes(class,hwy)) + geom_boxplot(outlier.alpha = 0) +
  geom_text(data=x,aes(label=name))

2. 연비 top3

In [63]:
filter.hwy <- function(df){
  df %>% arrange(desc(hwy)) %>% head(3)
}
mpg %>% group_by(model)  %>% do(filter.hwy(.))
manufacturermodeldisplyearcyltransdrvctyhwyflclassname
1toyota 4runner 4wd2.7 1999 4 manual(m5) 4 15 20 r suv 174
2toyota 4runner 4wd2.7 1999 4 auto(l4) 4 16 20 r suv 175
3toyota 4runner 4wd4 2008 6 auto(l5) 4 16 20 r suv 178
4audi a4 2 2008 4 manual(m6)f 20 31 p compact 3
5audi a4 2 2008 4 auto(av)f 21 30 p compact 4
6audi a4 1.8 1999 4 auto(l5)f 18 29 p compact 1
7audi a4 quattro2 2008 4 manual(m6)4 20 28 p compact 10
8audi a4 quattro2 2008 4 auto(s6) 4 19 27 p compact 11
9audi a4 quattro1.8 1999 4 manual(m5)4 18 26 p compact 8
10audi a6 quattro3.1 2008 6 auto(s6) 4 17 25 p midsize 17
11audi a6 quattro2.8 1999 6 auto(l5) 4 15 24 p midsize 16
12audi a6 quattro4.2 2008 8 auto(s6) 4 16 23 p midsize 18
13nissan altima 2.5 2008 4 manual(m6)f 23 32 r midsize 145
14nissan altima 2.5 2008 4 auto(av)f 23 31 r midsize 144
15nissan altima 2.4 1999 4 manual(m5)f 21 29 r compact 142
16chevrolet c1500 suburban 2wd5.3 2008 8 auto(l4) r 14 20 r suv 19
17chevrolet c1500 suburban 2wd5.3 2008 8 auto(l4) r 14 20 r suv 21
18chevrolet c1500 suburban 2wd5.7 1999 8 auto(l4) r 13 17 r suv 22
19toyota camry 2.4 2008 4 manual(m5)f 21 31 r midsize 182
20toyota camry 2.4 2008 4 auto(l5)f 21 31 r midsize 183
21toyota camry 2.2 1999 4 manual(m5)f 21 29 r midsize 180
22toyota camry solara2.4 2008 4 manual(m5) f 21 31 r compact 189
23toyota camry solara2.4 2008 4 auto(s5) f 22 31 r compact 190
24toyota camry solara2.2 1999 4 manual(m5) f 21 29 r compact 188
25dodge caravan 2wd2.4 1999 4 auto(l3) f 18 24 r minivan 38
26dodge caravan 2wd3 1999 6 auto(l4) f 17 24 r minivan 39
27dodge caravan 2wd3.3 2008 6 auto(l4) f 17 24 r minivan 42
28honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact106
29honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact107
30honda civic 1.8 2008 4 manual(m5)f 26 34 r subcompact105
.......................................
84ford mustang 3.8 1999 6 manual(m5)r 18 26 r subcompact91
85ford mustang 4 2008 6 manual(m5)r 17 26 r subcompact93
86ford mustang 3.8 1999 6 auto(l4) r 18 25 r subcompact92
87lincoln navigator 2wd5.4 2008 8 auto(l6) r 12 18 r suv 137
88lincoln navigator 2wd5.4 1999 8 auto(l4) r 11 17 r suv 135
89lincoln navigator 2wd5.4 1999 8 auto(l4) r 11 16 p suv 136
90volkswagennew beetle1.9 1999 4 manual(m5)f 35 44 d subcompact222
91volkswagennew beetle1.9 1999 4 auto(l4) f 29 41 d subcompact223
92volkswagennew beetle2 1999 4 manual(m5)f 21 29 r subcompact224
93volkswagenpassat 1.8 1999 4 manual(m5)f 21 29 p midsize 228
94volkswagenpassat 1.8 1999 4 auto(l5) f 18 29 p midsize 229
95volkswagenpassat 2 2008 4 manual(m6)f 21 29 p midsize 231
96nissan pathfinder 4wd4 2008 6 auto(l5) 4 14 20 p suv 153
97nissan pathfinder 4wd5.6 2008 8 auto(s5) 4 12 18 p suv 154
98nissan pathfinder 4wd3.3 1999 6 auto(l4) 4 14 17 r suv 151
99dodge ram 1500 pickup 4wd4.7 2008 8 auto(l5) 4 13 17 r pickup 67
100dodge ram 1500 pickup 4wd4.7 2008 8 auto(l5) 4 13 17 r pickup 68
101dodge ram 1500 pickup 4wd5.7 2008 8 auto(l5) 4 13 17 r pickup 73
102land rover range rover4.2 2008 8 auto(s6) 4 12 18 r suv 132
103land rover range rover4.4 2008 8 auto(s6) 4 12 18 r suv 133
104land rover range rover4 1999 8 auto(l4) 4 11 15 p suv 131
105hyundai sonata 2.4 2008 4 manual(m5)f 21 31 r midsize 112
106hyundai sonata 2.4 2008 4 auto(l4)f 21 30 r midsize 111
107hyundai sonata 3.3 2008 6 auto(l5)f 19 28 r midsize 115
108hyundai tiburon 2 1999 4 manual(m5)f 19 29 r subcompact117
109hyundai tiburon 2 2008 4 manual(m5)f 20 28 r subcompact118
110hyundai tiburon 2 2008 4 auto(l4) f 20 27 r subcompact119
111toyota toyota tacoma 4wd2.7 2008 4 manual(m5) 4 17 22 r pickup 203
112toyota toyota tacoma 4wd2.7 1999 4 manual(m5) 4 15 20 r pickup 201
113toyota toyota tacoma 4wd2.7 1999 4 auto(l4) 4 16 20 r pickup 202
In [64]:
mpg %>% group_by(model) %>% do((.) %>% arrange(desc(hwy)) %>% head(3)) # do((.))  do안에서 사용하려면 (.) 로 묶어줘야된다.
manufacturermodeldisplyearcyltransdrvctyhwyflclassname
1toyota 4runner 4wd2.7 1999 4 manual(m5) 4 15 20 r suv 174
2toyota 4runner 4wd2.7 1999 4 auto(l4) 4 16 20 r suv 175
3toyota 4runner 4wd4 2008 6 auto(l5) 4 16 20 r suv 178
4audi a4 2 2008 4 manual(m6)f 20 31 p compact 3
5audi a4 2 2008 4 auto(av)f 21 30 p compact 4
6audi a4 1.8 1999 4 auto(l5)f 18 29 p compact 1
7audi a4 quattro2 2008 4 manual(m6)4 20 28 p compact 10
8audi a4 quattro2 2008 4 auto(s6) 4 19 27 p compact 11
9audi a4 quattro1.8 1999 4 manual(m5)4 18 26 p compact 8
10audi a6 quattro3.1 2008 6 auto(s6) 4 17 25 p midsize 17
11audi a6 quattro2.8 1999 6 auto(l5) 4 15 24 p midsize 16
12audi a6 quattro4.2 2008 8 auto(s6) 4 16 23 p midsize 18
13nissan altima 2.5 2008 4 manual(m6)f 23 32 r midsize 145
14nissan altima 2.5 2008 4 auto(av)f 23 31 r midsize 144
15nissan altima 2.4 1999 4 manual(m5)f 21 29 r compact 142
16chevrolet c1500 suburban 2wd5.3 2008 8 auto(l4) r 14 20 r suv 19
17chevrolet c1500 suburban 2wd5.3 2008 8 auto(l4) r 14 20 r suv 21
18chevrolet c1500 suburban 2wd5.7 1999 8 auto(l4) r 13 17 r suv 22
19toyota camry 2.4 2008 4 manual(m5)f 21 31 r midsize 182
20toyota camry 2.4 2008 4 auto(l5)f 21 31 r midsize 183
21toyota camry 2.2 1999 4 manual(m5)f 21 29 r midsize 180
22toyota camry solara2.4 2008 4 manual(m5) f 21 31 r compact 189
23toyota camry solara2.4 2008 4 auto(s5) f 22 31 r compact 190
24toyota camry solara2.2 1999 4 manual(m5) f 21 29 r compact 188
25dodge caravan 2wd2.4 1999 4 auto(l3) f 18 24 r minivan 38
26dodge caravan 2wd3 1999 6 auto(l4) f 17 24 r minivan 39
27dodge caravan 2wd3.3 2008 6 auto(l4) f 17 24 r minivan 42
28honda civic 1.8 2008 4 auto(l5) f 25 36 r subcompact106
29honda civic 1.8 2008 4 auto(l5) f 24 36 c subcompact107
30honda civic 1.8 2008 4 manual(m5)f 26 34 r subcompact105
.......................................
84ford mustang 3.8 1999 6 manual(m5)r 18 26 r subcompact91
85ford mustang 4 2008 6 manual(m5)r 17 26 r subcompact93
86ford mustang 3.8 1999 6 auto(l4) r 18 25 r subcompact92
87lincoln navigator 2wd5.4 2008 8 auto(l6) r 12 18 r suv 137
88lincoln navigator 2wd5.4 1999 8 auto(l4) r 11 17 r suv 135
89lincoln navigator 2wd5.4 1999 8 auto(l4) r 11 16 p suv 136
90volkswagennew beetle1.9 1999 4 manual(m5)f 35 44 d subcompact222
91volkswagennew beetle1.9 1999 4 auto(l4) f 29 41 d subcompact223
92volkswagennew beetle2 1999 4 manual(m5)f 21 29 r subcompact224
93volkswagenpassat 1.8 1999 4 manual(m5)f 21 29 p midsize 228
94volkswagenpassat 1.8 1999 4 auto(l5) f 18 29 p midsize 229
95volkswagenpassat 2 2008 4 manual(m6)f 21 29 p midsize 231
96nissan pathfinder 4wd4 2008 6 auto(l5) 4 14 20 p suv 153
97nissan pathfinder 4wd5.6 2008 8 auto(s5) 4 12 18 p suv 154
98nissan pathfinder 4wd3.3 1999 6 auto(l4) 4 14 17 r suv 151
99dodge ram 1500 pickup 4wd4.7 2008 8 auto(l5) 4 13 17 r pickup 67
100dodge ram 1500 pickup 4wd4.7 2008 8 auto(l5) 4 13 17 r pickup 68
101dodge ram 1500 pickup 4wd5.7 2008 8 auto(l5) 4 13 17 r pickup 73
102land rover range rover4.2 2008 8 auto(s6) 4 12 18 r suv 132
103land rover range rover4.4 2008 8 auto(s6) 4 12 18 r suv 133
104land rover range rover4 1999 8 auto(l4) 4 11 15 p suv 131
105hyundai sonata 2.4 2008 4 manual(m5)f 21 31 r midsize 112
106hyundai sonata 2.4 2008 4 auto(l4)f 21 30 r midsize 111
107hyundai sonata 3.3 2008 6 auto(l5)f 19 28 r midsize 115
108hyundai tiburon 2 1999 4 manual(m5)f 19 29 r subcompact117
109hyundai tiburon 2 2008 4 manual(m5)f 20 28 r subcompact118
110hyundai tiburon 2 2008 4 auto(l4) f 20 27 r subcompact119
111toyota toyota tacoma 4wd2.7 2008 4 manual(m5) 4 17 22 r pickup 203
112toyota toyota tacoma 4wd2.7 1999 4 manual(m5) 4 15 20 r pickup 201
113toyota toyota tacoma 4wd2.7 1999 4 auto(l4) 4 16 20 r pickup 202

Tips

  • rbind, cbind 보다 빠른 함수 : bind_rows,bind_cols 사용법은 동일하다.

filtering 후 차이를 알고 싶을때

  • setdiff() 함수 활용
In [65]:
flights <- flights %>% mutate(name= row_number())
aa <- filter(flights, carrier == 'AA') 
july <- filter(flights, month == 7)
i = setdiff(aa$name,july$name)
In [66]:
head(flights[i,])
  year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
1 2013     1   1      542            540         2      923            850
2 2013     1   1      558            600        -2      753            745
3 2013     1   1      559            600        -1      941            910
4 2013     1   1      606            610        -4      858            910
5 2013     1   1      623            610        13      920            915
6 2013     1   1      628            630        -2     1137           1140
  arr_delay carrier flight tailnum origin dest air_time distance hour minute
1        33      AA   1141  N619AA    JFK  MIA      160     1089    5     40
2         8      AA    301  N3ALAA    LGA  ORD      138      733    6      0
3        31      AA    707  N3DUAA    LGA  DFW      257     1389    6      0
4       -12      AA   1895  N633AA    EWR  MIA      152     1085    6     10
5         5      AA   1837  N3EMAA    LGA  MIA      153     1096    6     10
6        -3      AA    413  N3BAAA    JFK  SJU      192     1598    6     30
            time_hour name
1 2013-01-01 05:00:00    3
2 2013-01-01 06:00:00   10
3 2013-01-01 06:00:00   15
4 2013-01-01 06:00:00   23
5 2013-01-01 06:00:00   32
6 2013-01-01 06:00:00   37

교집합을 알고 싶다면 intersect()를 사용.