Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data:
Happy families are all alike; every unhappy family is unhappy in its own way.
Leo Tolstoy
Characteristics of tidy data:
Characteristics of untidy data:
!@#$%^&*()
What makes this data not tidy?
WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
What makes this data not tidy?
[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)
Is each of the following a dataset or a summary table?
## # A tibble: 87 x 3## name height mass## <chr> <int> <dbl>## 1 Luke Skywalker 172 77## 2 C-3PO 167 75## 3 R2-D2 96 32## 4 Darth Vader 202 136## 5 Leia Organa 150 49## 6 Owen Lars 178 120## 7 Beru Whitesun lars 165 75## 8 R5-D4 97 32## 9 Biggs Darklighter 183 84## 10 Obi-Wan Kenobi 182 77## # … with 77 more rows## # A tibble: 3 x 2## gender avg_height## <chr> <dbl>## 1 feminine 165.## 2 masculine 177.## 3 <NA> 181.starwars %>% select(name, height, mass)
starwars %>% group_by(gender) %>% summarize( avg_height = mean(height, na.rm = TRUE) %>% round(2) )... based on the concepts of functions as verbs that manipulate data frames

select: pick columns by namearrange: reorder rowsslice: pick rows using index(es)filter: pick rows matching criteriadistinct: filter for unique rowsmutate: add new variablessummarise: reduce variables to valuesgroup_by: for grouped operationsFirst argument is always a data frame
Subsequent arguments say what to do with that data frame
Always return a data frame
Data from two hotels: one resort and one city hotel
Observations: Each row represents a hotel booking
Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled (Antonia et al., 2019)
Featured in TidyTuesday!
hotels <- read_csv("data/hotels.csv")names(hotels)
## [1] "hotel" "is_canceled" ## [3] "lead_time" "arrival_date_year" ## [5] "arrival_date_month" "arrival_date_week_number" ## [7] "arrival_date_day_of_month" "stays_in_weekend_nights" ## [9] "stays_in_week_nights" "adults" ## [11] "children" "babies" ## [13] "meal" "country" ## [15] "market_segment" "distribution_channel" ## [17] "is_repeated_guest" "previous_cancellations" ## [19] "previous_bookings_not_canceled" "reserved_room_type" ## [21] "assigned_room_type" "booking_changes" ## [23] "deposit_type" "agent" ## [25] "company" "days_in_waiting_list" ## [27] "customer_type" "adr" ## [29] "required_car_parking_spaces" "total_of_special_requests" ## [31] "reservation_status" "reservation_status_date"glimpse(hotels)
## Rows: 119,390## Columns: 32## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resor…## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…## $ arrival_date_month <chr> "July", "July", "July", "July", "July"…## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpora…## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpora…## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C"…## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C"…## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposi…## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", …## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ customer_type <chr> "Transient", "Transient", "Transient",…## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out",…## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…View only the lead_time (number of days between booking and arrival date):
View only the lead_time (number of days between booking and arrival date):
hotels %>% select(lead_time)
## # A tibble: 119,390 x 1## lead_time## <dbl>## 1 342## 2 737## 3 7## 4 13## 5 14## 6 14## 7 0## 8 9## 9 85## 10 75## # … with 119,380 more rowsView only the lead_time (number of days between booking and arrival date):
hotels %>% select(lead_time)
## # A tibble: 119,390 x 1## lead_time## <dbl>## 1 342## 2 737## 3 7## 4 13## 5 14## 6 14## 7 0## 8 9## 9 85## 10 75## # … with 119,380 more rowsselect() function.lead_timeView only the hotel type and lead_time:
hotels %>% select(hotel, lead_time)
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 342## 2 Resort Hotel 737## 3 Resort Hotel 7## 4 Resort Hotel 13## 5 Resort Hotel 14## 6 Resort Hotel 14## 7 Resort Hotel 0## 8 Resort Hotel 9## 9 Resort Hotel 85## 10 Resort Hotel 75## # … with 119,380 more rowsView only the hotel type and lead_time:
hotels %>% select(hotel, lead_time)
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 342## 2 Resort Hotel 737## 3 Resort Hotel 7## 4 Resort Hotel 13## 5 Resort Hotel 14## 6 Resort Hotel 14## 7 Resort Hotel 0## 8 Resort Hotel 9## 9 Resort Hotel 85## 10 Resort Hotel 75## # … with 119,380 more rowsWhat if we wanted to select these columns, and then arrange the data in descending order of lead time?
Select:
hotels %>% select(hotel, lead_time)
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 342## 2 Resort Hotel 737## 3 Resort Hotel 7## 4 Resort Hotel 13## 5 Resort Hotel 14## 6 Resort Hotel 14## 7 Resort Hotel 0## 8 Resort Hotel 9## 9 Resort Hotel 85## 10 Resort Hotel 75## # … with 119,380 more rowsSelect:
hotels %>% select(hotel, lead_time)
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 342## 2 Resort Hotel 737## 3 Resort Hotel 7## 4 Resort Hotel 13## 5 Resort Hotel 14## 6 Resort Hotel 14## 7 Resort Hotel 0## 8 Resort Hotel 9## 9 Resort Hotel 85## 10 Resort Hotel 75## # … with 119,380 more rowsSelect, then arrange:
hotels %>% select(hotel, lead_time) %>% arrange(desc(lead_time))
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 737## 2 Resort Hotel 709## 3 City Hotel 629## 4 City Hotel 629## 5 City Hotel 629## 6 City Hotel 629## 7 City Hotel 629## 8 City Hotel 629## 9 City Hotel 629## 10 City Hotel 629## # … with 119,380 more rowsIn programming, a pipe is a technique for passing information from one process to another.
In programming, a pipe is a technique for passing information from one process to another.
hotels, and pass it to the select() function,hotels %>% select(hotel, lead_time) %>% arrange(desc(lead_time))
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 737## 2 Resort Hotel 709## 3 City Hotel 629## 4 City Hotel 629## 5 City Hotel 629## 6 City Hotel 629## 7 City Hotel 629## 8 City Hotel 629## 9 City Hotel 629## 10 City Hotel 629## # … with 119,380 more rowsIn programming, a pipe is a technique for passing information from one process to another.
hotels, and pass it to the select() function,hotel and lead_time,hotels %>% select(hotel, lead_time) %>% arrange(desc(lead_time))
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 737## 2 Resort Hotel 709## 3 City Hotel 629## 4 City Hotel 629## 5 City Hotel 629## 6 City Hotel 629## 7 City Hotel 629## 8 City Hotel 629## 9 City Hotel 629## 10 City Hotel 629## # … with 119,380 more rowsIn programming, a pipe is a technique for passing information from one process to another.
hotels, and pass it to the select() function,hotel and lead_time,lead_time in descending order.hotels %>% select(hotel, lead_time) %>% arrange(desc(lead_time))
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 737## 2 Resort Hotel 709## 3 City Hotel 629## 4 City Hotel 629## 5 City Hotel 629## 6 City Hotel 629## 7 City Hotel 629## 8 City Hotel 629## 9 City Hotel 629## 10 City Hotel 629## # … with 119,380 more rowsThe pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.
The pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.
Any guesses as to why the package is called magrittr?
The pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.
Any guesses as to why the package is called magrittr?


park(drive(start_car(find("keys")), to = "work"))park(drive(start_car(find("keys")), to = "work"))
find("keys") %>% start_car() %>% drive(to = "work") %>% park()Use the dot to
hotels %>% filter(hotel == "Resort Hotel") %>% lm(adr ~ lead_time, data = .)
## ## Call:## lm(formula = adr ~ lead_time, data = .)## ## Coefficients:## (Intercept) lead_time ## 93.16876 0.01925select to keep variableshotels %>% select(hotel, lead_time)
## # A tibble: 119,390 x 2## hotel lead_time## <chr> <dbl>## 1 Resort Hotel 342## 2 Resort Hotel 737## 3 Resort Hotel 7## 4 Resort Hotel 13## 5 Resort Hotel 14## 6 Resort Hotel 14## 7 Resort Hotel 0## 8 Resort Hotel 9## 9 Resort Hotel 85## 10 Resort Hotel 75## # … with 119,380 more rowsselect to exclude variableshotels %>% select(-agent)
## # A tibble: 119,390 x 31## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…## <chr> <dbl> <dbl> <dbl> <chr> ## 1 Reso… 0 342 2015 July ## 2 Reso… 0 737 2015 July ## 3 Reso… 0 7 2015 July ## 4 Reso… 0 13 2015 July ## 5 Reso… 0 14 2015 July ## 6 Reso… 0 14 2015 July ## 7 Reso… 0 0 2015 July ## 8 Reso… 0 9 2015 July ## 9 Reso… 1 85 2015 July ## 10 Reso… 1 75 2015 July ## # … with 119,380 more rows, and 26 more variables:## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,## # deposit_type <chr>, company <chr>, days_in_waiting_list <dbl>,## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,## # total_of_special_requests <dbl>, reservation_status <chr>,## # reservation_status_date <date>select a range of variableshotels %>% select(hotel:arrival_date_month)
## # A tibble: 119,390 x 5## hotel is_canceled lead_time arrival_date_year arrival_date_month## <chr> <dbl> <dbl> <dbl> <chr> ## 1 Resort Hotel 0 342 2015 July ## 2 Resort Hotel 0 737 2015 July ## 3 Resort Hotel 0 7 2015 July ## 4 Resort Hotel 0 13 2015 July ## 5 Resort Hotel 0 14 2015 July ## 6 Resort Hotel 0 14 2015 July ## 7 Resort Hotel 0 0 2015 July ## 8 Resort Hotel 0 9 2015 July ## 9 Resort Hotel 1 85 2015 July ## 10 Resort Hotel 1 75 2015 July ## # … with 119,380 more rowsarrange in ascending / descending orderhotels %>% select(adults, children, babies) %>% arrange(babies)
## # A tibble: 119,390 x 3## adults children babies## <dbl> <dbl> <dbl>## 1 2 0 0## 2 2 0 0## 3 1 0 0## 4 1 0 0## 5 2 0 0## 6 2 0 0## 7 2 0 0## 8 2 0 0## 9 2 0 0## 10 2 0 0## # … with 119,380 more rowshotels %>% select(adults, children, babies) %>% arrange(desc(babies))
## # A tibble: 119,390 x 3## adults children babies## <dbl> <dbl> <dbl>## 1 2 0 10## 2 1 0 9## 3 2 0 2## 4 2 0 2## 5 2 0 2## 6 2 0 2## 7 2 0 2## 8 2 0 2## 9 2 0 2## 10 2 0 2## # … with 119,380 more rowsslice for certain row numbers# first fivehotels %>% slice(1:5)
## # A tibble: 5 x 32## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…## <chr> <dbl> <dbl> <dbl> <chr> <dbl>## 1 Reso… 0 342 2015 July 27## 2 Reso… 0 737 2015 July 27## 3 Reso… 0 7 2015 July 27## 4 Reso… 0 13 2015 July 27## 5 Reso… 0 14 2015 July 27## # … with 26 more variables: arrival_date_day_of_month <dbl>,## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,## # total_of_special_requests <dbl>, reservation_status <chr>,## # reservation_status_date <date>In R, you can use the # (hashtag or pound sign, depending on your age
😜) for adding comments to your code.
Any text following # will be printed as is, and won't be run as R code.
This is useful for leaving comments in your code and for temporarily disabling
certain lines of code while debugging.
hotels %>% # slice the first five rows # this line is a comment #select(hotel) %>% # this one doesn't run slice(1:5) # this line runs
## # A tibble: 5 x 32## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…## <chr> <dbl> <dbl> <dbl> <chr> <dbl>## 1 Reso… 0 342 2015 July 27## 2 Reso… 0 737 2015 July 27## 3 Reso… 0 7 2015 July 27## 4 Reso… 0 13 2015 July 27## 5 Reso… 0 14 2015 July 27## # … with 26 more variables: arrival_date_day_of_month <dbl>,## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,## # total_of_special_requests <dbl>, reservation_status <chr>,## # reservation_status_date <date>slice for certain row numbers# last fivelast_row <- nrow(hotels) # nrow() gives the number of rows in a data framehotels %>% slice((last_row - 4):last_row)
## # A tibble: 5 x 32## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…## <chr> <dbl> <dbl> <dbl> <chr> <dbl>## 1 City… 0 23 2017 August 35## 2 City… 0 102 2017 August 35## 3 City… 0 34 2017 August 35## 4 City… 0 109 2017 August 35## 5 City… 0 205 2017 August 35## # … with 26 more variables: arrival_date_day_of_month <dbl>,## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,## # total_of_special_requests <dbl>, reservation_status <chr>,## # reservation_status_date <date>filter to select a subset of rows# bookings in City Hotelshotels %>% filter(hotel == "City Hotel")
## # A tibble: 79,330 x 32## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…## <chr> <dbl> <dbl> <dbl> <chr> ## 1 City… 0 6 2015 July ## 2 City… 1 88 2015 July ## 3 City… 1 65 2015 July ## 4 City… 1 92 2015 July ## 5 City… 1 100 2015 July ## 6 City… 1 79 2015 July ## 7 City… 0 3 2015 July ## 8 City… 1 63 2015 July ## 9 City… 1 62 2015 July ## 10 City… 1 62 2015 July ## # … with 79,320 more rows, and 27 more variables:## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,## # total_of_special_requests <dbl>, reservation_status <chr>,## # reservation_status_date <date>filter for many conditions at oncehotels %>% filter( adults == 0, children >= 1 ) %>% select(adults, babies, children)
## # A tibble: 223 x 3## adults babies children## <dbl> <dbl> <dbl>## 1 0 0 3## 2 0 0 2## 3 0 0 2## 4 0 0 2## 5 0 0 2## 6 0 0 3## 7 0 1 2## 8 0 0 2## 9 0 0 2## 10 0 0 2## # … with 213 more rowsfilter for more complex conditions# bookings with no adults and some children or babies in the roomhotels %>% filter( adults == 0, children >= 1 | babies >= 1 # | means or ) %>% select(adults, babies, children)
## # A tibble: 223 x 3## adults babies children## <dbl> <dbl> <dbl>## 1 0 0 3## 2 0 0 2## 3 0 0 2## 4 0 0 2## 5 0 0 2## 6 0 0 3## 7 0 1 2## 8 0 0 2## 9 0 0 2## 10 0 0 2## # … with 213 more rows| operator | definition | operator | definition | |
|---|---|---|---|---|
< |
less than | x | y |
x OR y |
|
<= |
less than or equal to | is.na(x) |
test if x is NA |
|
> |
greater than | !is.na(x) |
test if x is not NA |
|
>= |
greater than or equal to | x %in% y |
test if x is in y |
|
== |
exactly equal to | !(x %in% y) |
test if x is not in y |
|
!= |
not equal to | !x |
not x |
|
x & y |
x AND y |
distinct to filter for unique rows... and arrange to order alphabetically
hotels %>% distinct(market_segment) %>% arrange(market_segment)
## # A tibble: 8 x 1## market_segment## <chr> ## 1 Aviation ## 2 Complementary ## 3 Corporate ## 4 Direct ## 5 Groups ## 6 Offline TA/TO ## 7 Online TA ## 8 Undefinedhotels %>% distinct(hotel, market_segment) %>% arrange(hotel, market_segment)
## # A tibble: 14 x 2## hotel market_segment## <chr> <chr> ## 1 City Hotel Aviation ## 2 City Hotel Complementary ## 3 City Hotel Corporate ## 4 City Hotel Direct ## 5 City Hotel Groups ## 6 City Hotel Offline TA/TO ## 7 City Hotel Online TA ## 8 City Hotel Undefined ## 9 Resort Hotel Complementary ## 10 Resort Hotel Corporate ## 11 Resort Hotel Direct ## 12 Resort Hotel Groups ## 13 Resort Hotel Offline TA/TO ## 14 Resort Hotel Online TAcount to create frequency tables# alphabetical order by defaulthotels %>% count(market_segment)
## # A tibble: 8 x 2## market_segment n## <chr> <int>## 1 Aviation 237## 2 Complementary 743## 3 Corporate 5295## 4 Direct 12606## 5 Groups 19811## 6 Offline TA/TO 24219## 7 Online TA 56477## 8 Undefined 2count to create frequency tables# alphabetical order by defaulthotels %>% count(market_segment)
## # A tibble: 8 x 2## market_segment n## <chr> <int>## 1 Aviation 237## 2 Complementary 743## 3 Corporate 5295## 4 Direct 12606## 5 Groups 19811## 6 Offline TA/TO 24219## 7 Online TA 56477## 8 Undefined 2# descending frequency orderhotels %>% count(market_segment, sort = TRUE)
## # A tibble: 8 x 2## market_segment n## <chr> <int>## 1 Online TA 56477## 2 Offline TA/TO 24219## 3 Groups 19811## 4 Direct 12606## 5 Corporate 5295## 6 Complementary 743## 7 Aviation 237## 8 Undefined 2count and arrange# ascending frequency orderhotels %>% count(market_segment) %>% arrange(n)
## # A tibble: 8 x 2## market_segment n## <chr> <int>## 1 Undefined 2## 2 Aviation 237## 3 Complementary 743## 4 Corporate 5295## 5 Direct 12606## 6 Groups 19811## 7 Offline TA/TO 24219## 8 Online TA 56477# descending frequency order# just like adding sort = TRUEhotels %>% count(market_segment) %>% arrange(desc(n))
## # A tibble: 8 x 2## market_segment n## <chr> <int>## 1 Online TA 56477## 2 Offline TA/TO 24219## 3 Groups 19811## 4 Direct 12606## 5 Corporate 5295## 6 Complementary 743## 7 Aviation 237## 8 Undefined 2count for multiple variableshotels %>% count(hotel, market_segment)
## # A tibble: 14 x 3## hotel market_segment n## <chr> <chr> <int>## 1 City Hotel Aviation 237## 2 City Hotel Complementary 542## 3 City Hotel Corporate 2986## 4 City Hotel Direct 6093## 5 City Hotel Groups 13975## 6 City Hotel Offline TA/TO 16747## 7 City Hotel Online TA 38748## 8 City Hotel Undefined 2## 9 Resort Hotel Complementary 201## 10 Resort Hotel Corporate 2309## 11 Resort Hotel Direct 6513## 12 Resort Hotel Groups 5836## 13 Resort Hotel Offline TA/TO 7472## 14 Resort Hotel Online TA 17729count# hotel type firsthotels %>% count(hotel, market_segment)
## # A tibble: 14 x 3## hotel market_segment n## <chr> <chr> <int>## 1 City Hotel Aviation 237## 2 City Hotel Complementary 542## 3 City Hotel Corporate 2986## 4 City Hotel Direct 6093## 5 City Hotel Groups 13975## 6 City Hotel Offline TA/TO 16747## 7 City Hotel Online TA 38748## 8 City Hotel Undefined 2## 9 Resort Hotel Complementary 201## 10 Resort Hotel Corporate 2309## 11 Resort Hotel Direct 6513## 12 Resort Hotel Groups 5836## 13 Resort Hotel Offline TA/TO 7472## 14 Resort Hotel Online TA 17729# market segment firsthotels %>% count(market_segment, hotel)
## # A tibble: 14 x 3## market_segment hotel n## <chr> <chr> <int>## 1 Aviation City Hotel 237## 2 Complementary City Hotel 542## 3 Complementary Resort Hotel 201## 4 Corporate City Hotel 2986## 5 Corporate Resort Hotel 2309## 6 Direct City Hotel 6093## 7 Direct Resort Hotel 6513## 8 Groups City Hotel 13975## 9 Groups Resort Hotel 5836## 10 Offline TA/TO City Hotel 16747## 11 Offline TA/TO Resort Hotel 7472## 12 Online TA City Hotel 38748## 13 Online TA Resort Hotel 17729## 14 Undefined City Hotel 2mutate to add a new variablehotels %>% mutate(little_ones = children + babies) %>% select(children, babies, little_ones) %>% arrange(desc(little_ones))
## # A tibble: 119,390 x 3## children babies little_ones## <dbl> <dbl> <dbl>## 1 10 0 10## 2 0 10 10## 3 0 9 9## 4 2 1 3## 5 2 1 3## 6 2 1 3## 7 3 0 3## 8 2 1 3## 9 2 1 3## 10 3 0 3## # … with 119,380 more rows# Resort Hotelhotels %>% mutate(little_ones = children + babies) %>% filter( little_ones >= 1, hotel == "Resort Hotel" ) %>% select(hotel, little_ones)
## # A tibble: 3,929 x 2## hotel little_ones## <chr> <dbl>## 1 Resort Hotel 1## 2 Resort Hotel 2## 3 Resort Hotel 2## 4 Resort Hotel 2## 5 Resort Hotel 1## 6 Resort Hotel 1## 7 Resort Hotel 2## 8 Resort Hotel 2## 9 Resort Hotel 1## 10 Resort Hotel 1## # … with 3,919 more rows# City Hotelhotels %>% mutate(little_ones = children + babies) %>% filter( little_ones > 1, hotel == "City Hotel" ) %>% select(hotel, little_ones)
## # A tibble: 2,140 x 2## hotel little_ones## <chr> <dbl>## 1 City Hotel 2## 2 City Hotel 2## 3 City Hotel 2## 4 City Hotel 2## 5 City Hotel 2## 6 City Hotel 2## 7 City Hotel 2## 8 City Hotel 2## 9 City Hotel 2## 10 City Hotel 3## # … with 2,130 more rowsWhat is happening in the following chunk?
hotels %>% mutate(little_ones = children + babies) %>% count(hotel, little_ones) %>% mutate(prop = n / sum(n))
## # A tibble: 12 x 4## hotel little_ones n prop## <chr> <dbl> <int> <dbl>## 1 City Hotel 0 73923 0.619 ## 2 City Hotel 1 3263 0.0273 ## 3 City Hotel 2 2056 0.0172 ## 4 City Hotel 3 82 0.000687 ## 5 City Hotel 9 1 0.00000838## 6 City Hotel 10 1 0.00000838## 7 City Hotel NA 4 0.0000335 ## 8 Resort Hotel 0 36131 0.303 ## 9 Resort Hotel 1 2183 0.0183 ## 10 Resort Hotel 2 1716 0.0144 ## 11 Resort Hotel 3 29 0.000243 ## 12 Resort Hotel 10 1 0.00000838summarise for summary stats# mean average daily rate for all bookingshotels %>% summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1## mean_adr## <dbl>## 1 102.summarise for summary stats# mean average daily rate for all bookingshotels %>% summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1## mean_adr## <dbl>## 1 102.summarise() changes the data frame entirely, it collapses rows down to a single
summary statistics, and removes all columns that are irrelevant to the calculation.
summarise() also lets you get away with being sloppy and not naming your new
column, but that's not recommended!
❌
hotels %>% summarise(mean(adr))
## # A tibble: 1 x 1## `mean(adr)`## <dbl>## 1 102.✅
hotels %>% summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1## mean_adr## <dbl>## 1 102.group_by for grouped operations# mean average daily rate for all booking at city and resort hotelshotels %>% group_by(hotel) %>% summarise(mean_adr = mean(adr))
## # A tibble: 2 x 2## hotel mean_adr## <chr> <dbl>## 1 City Hotel 105. ## 2 Resort Hotel 95.0The following two give the same result, so count is simply short for group_by then determine frequencies
hotels %>% group_by(hotel) %>% summarise(n = n())
## # A tibble: 2 x 2## hotel n## <chr> <int>## 1 City Hotel 79330## 2 Resort Hotel 40060hotels %>% count(hotel)
## # A tibble: 2 x 2## hotel n## <chr> <int>## 1 City Hotel 79330## 2 Resort Hotel 40060summarise can be used for multiple summary statistics as well
hotels %>% summarise( min_adr = min(adr), mean_adr = mean(adr), median_adr = median(adr), max_adr = max(adr) )
## # A tibble: 1 x 4## min_adr mean_adr median_adr max_adr## <dbl> <dbl> <dbl> <dbl>## 1 -6.38 102. 94.6 5400Keyboard shortcuts
| ↑, ←, Pg Up, k | Go to previous slide |
| ↓, →, Pg Dn, Space, j | Go to next slide |
| Home | Go to first slide |
| End | Go to last slide |
| Number + Return | Go to specific slide |
| b / m / f | Toggle blackout / mirrored / fullscreen mode |
| c | Clone slideshow |
| p | Toggle presenter mode |
| t | Restart the presentation timer |
| ?, h | Toggle this help |
| Esc | Back to slideshow |