+ - 0:00:00
Notes for current slide
Notes for next slide

Tidy data and data wrangling

Prof. Maria Tackett

1

Tidy data

3

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

4

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.
4

Tidy data

Happy families are all alike; every unhappy family is unhappy in its own way.

Leo Tolstoy

Characteristics of tidy data:

  • Each variable forms a column.
  • Each observation forms a row.
  • Each type of observational unit forms a table.

Characteristics of untidy data:

!@#$%^&*()

4

What makes this data not tidy?

WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)

WW2 Army Air Force combat aircraft from [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)

5

What makes this data not tidy?


[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)

[US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)

6

Summary tables

Is each of the following a dataset or a summary table?

## # A tibble: 87 x 3
## name height mass
## <chr> <int> <dbl>
## 1 Luke Skywalker 172 77
## 2 C-3PO 167 75
## 3 R2-D2 96 32
## 4 Darth Vader 202 136
## 5 Leia Organa 150 49
## 6 Owen Lars 178 120
## 7 Beru Whitesun lars 165 75
## 8 R5-D4 97 32
## 9 Biggs Darklighter 183 84
## 10 Obi-Wan Kenobi 182 77
## # … with 77 more rows
## # A tibble: 3 x 2
## gender avg_height
## <chr> <dbl>
## 1 feminine 165.
## 2 masculine 177.
## 3 <NA> 181.
7

Displaying data

starwars %>%
select(name, height, mass)
8

Displaying data

starwars %>%
select(name, height, mass)


Summarizing data

starwars %>%
group_by(gender) %>%
summarize(
avg_height = mean(height, na.rm = TRUE) %>% round(2)
)
8

Grammar of data wrangling

9

A grammar of data wrangling...

... based on the concepts of functions as verbs that manipulate data frames

  • select: pick columns by name
  • arrange: reorder rows
  • slice: pick rows using index(es)
  • filter: pick rows matching criteria
  • distinct: filter for unique rows
  • mutate: add new variables
  • summarise: reduce variables to values
  • group_by: for grouped operations
  • ... (many more)
10

Rules of dplyr functions

  • First argument is always a data frame

  • Subsequent arguments say what to do with that data frame

  • Always return a data frame

11

Data: Hotel bookings

  • Data from two hotels: one resort and one city hotel

  • Observations: Each row represents a hotel booking

  • Goal for original data collection: Development of prediction models to classify a hotel booking's likelihood to be cancelled (Antonia et al., 2019)

  • Featured in TidyTuesday!

hotels <- read_csv("data/hotels.csv")
12

First look: Variables

names(hotels)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
13

Second look: Overview

glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resor…
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75,…
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 20…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July"…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4,…
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2,…
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "B…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GB…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corpora…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corpora…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C"…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C"…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposi…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", …
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL"…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ customer_type <chr> "Transient", "Transient", "Transient",…
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3,…
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out",…
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 2…
14

Select a single column

View only the lead_time (number of days between booking and arrival date):

15

Select a single column

View only the lead_time (number of days between booking and arrival date):

hotels %>%
select(lead_time)
## # A tibble: 119,390 x 1
## lead_time
## <dbl>
## 1 342
## 2 737
## 3 7
## 4 13
## 5 14
## 6 14
## 7 0
## 8 9
## 9 85
## 10 75
## # … with 119,380 more rows
15

Select a single column

View only the lead_time (number of days between booking and arrival date):

hotels %>%
select(lead_time)
## # A tibble: 119,390 x 1
## lead_time
## <dbl>
## 1 342
## 2 737
## 3 7
## 4 13
## 5 14
## 6 14
## 7 0
## 8 9
## 9 85
## 10 75
## # … with 119,380 more rows
  • Start with a data frame
  • Pass it to the select() function.
  • Second argument is variable we want to select: lead_time
  • The result is a data frame with 119,300 and 1 column: --dplyr functions always expect a data frame and always yield a data frame.
15

Select multiple columns

View only the hotel type and lead_time:

16

Select multiple columns

View only the hotel type and lead_time:

hotels %>%
select(hotel, lead_time)
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 342
## 2 Resort Hotel 737
## 3 Resort Hotel 7
## 4 Resort Hotel 13
## 5 Resort Hotel 14
## 6 Resort Hotel 14
## 7 Resort Hotel 0
## 8 Resort Hotel 9
## 9 Resort Hotel 85
## 10 Resort Hotel 75
## # … with 119,380 more rows
16

Select multiple columns

View only the hotel type and lead_time:

hotels %>%
select(hotel, lead_time)
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 342
## 2 Resort Hotel 737
## 3 Resort Hotel 7
## 4 Resort Hotel 13
## 5 Resort Hotel 14
## 6 Resort Hotel 14
## 7 Resort Hotel 0
## 8 Resort Hotel 9
## 9 Resort Hotel 85
## 10 Resort Hotel 75
## # … with 119,380 more rows

What if we wanted to select these columns, and then arrange the data in descending order of lead time?

16

Data wrangling, step-by-step

Select:

hotels %>%
select(hotel, lead_time)
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 342
## 2 Resort Hotel 737
## 3 Resort Hotel 7
## 4 Resort Hotel 13
## 5 Resort Hotel 14
## 6 Resort Hotel 14
## 7 Resort Hotel 0
## 8 Resort Hotel 9
## 9 Resort Hotel 85
## 10 Resort Hotel 75
## # … with 119,380 more rows
17

Data wrangling, step-by-step

Select:

hotels %>%
select(hotel, lead_time)
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 342
## 2 Resort Hotel 737
## 3 Resort Hotel 7
## 4 Resort Hotel 13
## 5 Resort Hotel 14
## 6 Resort Hotel 14
## 7 Resort Hotel 0
## 8 Resort Hotel 9
## 9 Resort Hotel 85
## 10 Resort Hotel 75
## # … with 119,380 more rows

Select, then arrange:

hotels %>%
select(hotel, lead_time) %>%
arrange(desc(lead_time))
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 737
## 2 Resort Hotel 709
## 3 City Hotel 629
## 4 City Hotel 629
## 5 City Hotel 629
## 6 City Hotel 629
## 7 City Hotel 629
## 8 City Hotel 629
## 9 City Hotel 629
## 10 City Hotel 629
## # … with 119,380 more rows
17

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

19

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
hotels %>%
select(hotel, lead_time) %>%
arrange(desc(lead_time))
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 737
## 2 Resort Hotel 709
## 3 City Hotel 629
## 4 City Hotel 629
## 5 City Hotel 629
## 6 City Hotel 629
## 7 City Hotel 629
## 8 City Hotel 629
## 9 City Hotel 629
## 10 City Hotel 629
## # … with 119,380 more rows
19

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
  • then we select the variables hotel and lead_time,
hotels %>%
select(hotel, lead_time) %>%
arrange(desc(lead_time))
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 737
## 2 Resort Hotel 709
## 3 City Hotel 629
## 4 City Hotel 629
## 5 City Hotel 629
## 6 City Hotel 629
## 7 City Hotel 629
## 8 City Hotel 629
## 9 City Hotel 629
## 10 City Hotel 629
## # … with 119,380 more rows
20

What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

  • Start with the data frame hotels, and pass it to the select() function,
  • then we select the variables hotel and lead_time,
  • and then we arrange the data frame by lead_time in descending order.
hotels %>%
select(hotel, lead_time) %>%
arrange(desc(lead_time))
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 737
## 2 Resort Hotel 709
## 3 City Hotel 629
## 4 City Hotel 629
## 5 City Hotel 629
## 6 City Hotel 629
## 7 City Hotel 629
## 8 City Hotel 629
## 9 City Hotel 629
## 10 City Hotel 629
## # … with 119,380 more rows
21

Aside

The pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.

22

Aside

The pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.

Any guesses as to why the package is called magrittr?

22

Aside

The pipe operator is implemented in the package magrittr, though we don't need to load this package explicitly since tidyverse does this for us.

Any guesses as to why the package is called magrittr?

22

How does a pipe work?

  • You can think about the following sequence of actions - find keys, start car, drive to work, park.
23

How does a pipe work?

  • You can think about the following sequence of actions - find keys, start car, drive to work, park.
  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "work"))
23

How does a pipe work?

  • You can think about the following sequence of actions - find keys, start car, drive to work, park.
  • Expressed as a set of nested functions in R pseudocode this would look like:
park(drive(start_car(find("keys")), to = "work"))
  • Writing it out using pipes give it a more natural (and easier to read) structure:
    • Read the pipe as "and then"
find("keys") %>%
start_car() %>%
drive(to = "work") %>%
park()
23

What about other arguments?

Use the dot to

  • send results to a function argument other than first one or
  • use the previous result for multiple arguments
hotels %>%
filter(hotel == "Resort Hotel") %>%
lm(adr ~ lead_time, data = .)
##
## Call:
## lm(formula = adr ~ lead_time, data = .)
##
## Coefficients:
## (Intercept) lead_time
## 93.16876 0.01925
24

Working with a single data frame

25

You have a single data frame, and you want to process it and prepare it for anlaysis!

26

select to keep variables

hotels %>%
select(hotel, lead_time)
## # A tibble: 119,390 x 2
## hotel lead_time
## <chr> <dbl>
## 1 Resort Hotel 342
## 2 Resort Hotel 737
## 3 Resort Hotel 7
## 4 Resort Hotel 13
## 5 Resort Hotel 14
## 6 Resort Hotel 14
## 7 Resort Hotel 0
## 8 Resort Hotel 9
## 9 Resort Hotel 85
## 10 Resort Hotel 75
## # … with 119,380 more rows
27

select to exclude variables

hotels %>%
select(-agent)
## # A tibble: 119,390 x 31
## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Reso… 0 342 2015 July
## 2 Reso… 0 737 2015 July
## 3 Reso… 0 7 2015 July
## 4 Reso… 0 13 2015 July
## 5 Reso… 0 14 2015 July
## 6 Reso… 0 14 2015 July
## 7 Reso… 0 0 2015 July
## 8 Reso… 0 9 2015 July
## 9 Reso… 1 85 2015 July
## 10 Reso… 1 75 2015 July
## # … with 119,380 more rows, and 26 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
28

select a range of variables

hotels %>%
select(hotel:arrival_date_month)
## # A tibble: 119,390 x 5
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # … with 119,380 more rows
29

arrange in ascending / descending order

hotels %>%
select(adults, children, babies) %>%
arrange(babies)
## # A tibble: 119,390 x 3
## adults children babies
## <dbl> <dbl> <dbl>
## 1 2 0 0
## 2 2 0 0
## 3 1 0 0
## 4 1 0 0
## 5 2 0 0
## 6 2 0 0
## 7 2 0 0
## 8 2 0 0
## 9 2 0 0
## 10 2 0 0
## # … with 119,380 more rows
hotels %>%
select(adults, children, babies) %>%
arrange(desc(babies))
## # A tibble: 119,390 x 3
## adults children babies
## <dbl> <dbl> <dbl>
## 1 2 0 10
## 2 1 0 9
## 3 2 0 2
## 4 2 0 2
## 5 2 0 2
## 6 2 0 2
## 7 2 0 2
## 8 2 0 2
## 9 2 0 2
## 10 2 0 2
## # … with 119,380 more rows
30

slice for certain row numbers

# first five
hotels %>%
slice(1:5)
## # A tibble: 5 x 32
## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Reso… 0 342 2015 July 27
## 2 Reso… 0 737 2015 July 27
## 3 Reso… 0 7 2015 July 27
## 4 Reso… 0 13 2015 July 27
## 5 Reso… 0 14 2015 July 27
## # … with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
31

In R, you can use the # (hashtag or pound sign, depending on your age 😜) for adding comments to your code. Any text following # will be printed as is, and won't be run as R code. This is useful for leaving comments in your code and for temporarily disabling certain lines of code while debugging.

hotels %>%
# slice the first five rows # this line is a comment
#select(hotel) %>% # this one doesn't run
slice(1:5) # this line runs
## # A tibble: 5 x 32
## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 Reso… 0 342 2015 July 27
## 2 Reso… 0 737 2015 July 27
## 3 Reso… 0 7 2015 July 27
## 4 Reso… 0 13 2015 July 27
## 5 Reso… 0 14 2015 July 27
## # … with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
32

slice for certain row numbers

# last five
last_row <- nrow(hotels) # nrow() gives the number of rows in a data frame
hotels %>%
slice((last_row - 4):last_row)
## # A tibble: 5 x 32
## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo… arrival_date_we…
## <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 City… 0 23 2017 August 35
## 2 City… 0 102 2017 August 35
## 3 City… 0 34 2017 August 35
## 4 City… 0 109 2017 August 35
## 5 City… 0 205 2017 August 35
## # … with 26 more variables: arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
33

filter to select a subset of rows

# bookings in City Hotels
hotels %>%
filter(hotel == "City Hotel")
## # A tibble: 79,330 x 32
## hotel is_canceled lead_time arrival_date_ye… arrival_date_mo…
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 City… 0 6 2015 July
## 2 City… 1 88 2015 July
## 3 City… 1 65 2015 July
## 4 City… 1 92 2015 July
## 5 City… 1 100 2015 July
## 6 City… 1 79 2015 July
## 7 City… 0 3 2015 July
## 8 City… 1 63 2015 July
## 9 City… 1 62 2015 July
## 10 City… 1 62 2015 July
## # … with 79,320 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>,
## # total_of_special_requests <dbl>, reservation_status <chr>,
## # reservation_status_date <date>
34

filter for many conditions at once

hotels %>%
filter(
adults == 0,
children >= 1
) %>%
select(adults, babies, children)
## # A tibble: 223 x 3
## adults babies children
## <dbl> <dbl> <dbl>
## 1 0 0 3
## 2 0 0 2
## 3 0 0 2
## 4 0 0 2
## 5 0 0 2
## 6 0 0 3
## 7 0 1 2
## 8 0 0 2
## 9 0 0 2
## 10 0 0 2
## # … with 213 more rows
35

filter for more complex conditions

# bookings with no adults and some children or babies in the room
hotels %>%
filter(
adults == 0,
children >= 1 | babies >= 1 # | means or
) %>%
select(adults, babies, children)
## # A tibble: 223 x 3
## adults babies children
## <dbl> <dbl> <dbl>
## 1 0 0 3
## 2 0 0 2
## 3 0 0 2
## 4 0 0 2
## 5 0 0 2
## 6 0 0 3
## 7 0 1 2
## 8 0 0 2
## 9 0 0 2
## 10 0 0 2
## # … with 213 more rows
36

Logical operators in R


operator definition operator definition
< less than x | y x OR y
<= less than or equal to is.na(x) test if x is NA
> greater than !is.na(x) test if x is not NA
>= greater than or equal to x %in% y test if x is in y
== exactly equal to !(x %in% y) test if x is not in y
!= not equal to !x not x
x & y x AND y
37

distinct to filter for unique rows

... and arrange to order alphabetically

hotels %>%
distinct(market_segment) %>%
arrange(market_segment)
## # A tibble: 8 x 1
## market_segment
## <chr>
## 1 Aviation
## 2 Complementary
## 3 Corporate
## 4 Direct
## 5 Groups
## 6 Offline TA/TO
## 7 Online TA
## 8 Undefined
hotels %>%
distinct(hotel, market_segment) %>%
arrange(hotel, market_segment)
## # A tibble: 14 x 2
## hotel market_segment
## <chr> <chr>
## 1 City Hotel Aviation
## 2 City Hotel Complementary
## 3 City Hotel Corporate
## 4 City Hotel Direct
## 5 City Hotel Groups
## 6 City Hotel Offline TA/TO
## 7 City Hotel Online TA
## 8 City Hotel Undefined
## 9 Resort Hotel Complementary
## 10 Resort Hotel Corporate
## 11 Resort Hotel Direct
## 12 Resort Hotel Groups
## 13 Resort Hotel Offline TA/TO
## 14 Resort Hotel Online TA
39

count to create frequency tables

# alphabetical order by default
hotels %>%
count(market_segment)
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
40

count to create frequency tables

# alphabetical order by default
hotels %>%
count(market_segment)
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
# descending frequency order
hotels %>%
count(market_segment,
sort = TRUE)
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Online TA 56477
## 2 Offline TA/TO 24219
## 3 Groups 19811
## 4 Direct 12606
## 5 Corporate 5295
## 6 Complementary 743
## 7 Aviation 237
## 8 Undefined 2
40

count and arrange

# ascending frequency order
hotels %>%
count(market_segment) %>%
arrange(n)
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Undefined 2
## 2 Aviation 237
## 3 Complementary 743
## 4 Corporate 5295
## 5 Direct 12606
## 6 Groups 19811
## 7 Offline TA/TO 24219
## 8 Online TA 56477
# descending frequency order
# just like adding sort = TRUE
hotels %>%
count(market_segment) %>%
arrange(desc(n))
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Online TA 56477
## 2 Offline TA/TO 24219
## 3 Groups 19811
## 4 Direct 12606
## 5 Corporate 5295
## 6 Complementary 743
## 7 Aviation 237
## 8 Undefined 2
41

count for multiple variables

hotels %>%
count(hotel, market_segment)
## # A tibble: 14 x 3
## hotel market_segment n
## <chr> <chr> <int>
## 1 City Hotel Aviation 237
## 2 City Hotel Complementary 542
## 3 City Hotel Corporate 2986
## 4 City Hotel Direct 6093
## 5 City Hotel Groups 13975
## 6 City Hotel Offline TA/TO 16747
## 7 City Hotel Online TA 38748
## 8 City Hotel Undefined 2
## 9 Resort Hotel Complementary 201
## 10 Resort Hotel Corporate 2309
## 11 Resort Hotel Direct 6513
## 12 Resort Hotel Groups 5836
## 13 Resort Hotel Offline TA/TO 7472
## 14 Resort Hotel Online TA 17729
42

order matters when you count

# hotel type first
hotels %>%
count(hotel, market_segment)
## # A tibble: 14 x 3
## hotel market_segment n
## <chr> <chr> <int>
## 1 City Hotel Aviation 237
## 2 City Hotel Complementary 542
## 3 City Hotel Corporate 2986
## 4 City Hotel Direct 6093
## 5 City Hotel Groups 13975
## 6 City Hotel Offline TA/TO 16747
## 7 City Hotel Online TA 38748
## 8 City Hotel Undefined 2
## 9 Resort Hotel Complementary 201
## 10 Resort Hotel Corporate 2309
## 11 Resort Hotel Direct 6513
## 12 Resort Hotel Groups 5836
## 13 Resort Hotel Offline TA/TO 7472
## 14 Resort Hotel Online TA 17729
# market segment first
hotels %>%
count(market_segment, hotel)
## # A tibble: 14 x 3
## market_segment hotel n
## <chr> <chr> <int>
## 1 Aviation City Hotel 237
## 2 Complementary City Hotel 542
## 3 Complementary Resort Hotel 201
## 4 Corporate City Hotel 2986
## 5 Corporate Resort Hotel 2309
## 6 Direct City Hotel 6093
## 7 Direct Resort Hotel 6513
## 8 Groups City Hotel 13975
## 9 Groups Resort Hotel 5836
## 10 Offline TA/TO City Hotel 16747
## 11 Offline TA/TO Resort Hotel 7472
## 12 Online TA City Hotel 38748
## 13 Online TA Resort Hotel 17729
## 14 Undefined City Hotel 2
43

mutate to add a new variable

hotels %>%
mutate(little_ones = children + babies) %>%
select(children, babies, little_ones) %>%
arrange(desc(little_ones))
## # A tibble: 119,390 x 3
## children babies little_ones
## <dbl> <dbl> <dbl>
## 1 10 0 10
## 2 0 10 10
## 3 0 9 9
## 4 2 1 3
## 5 2 1 3
## 6 2 1 3
## 7 3 0 3
## 8 2 1 3
## 9 2 1 3
## 10 3 0 3
## # … with 119,380 more rows
45

Little ones in resort and city hotels

# Resort Hotel
hotels %>%
mutate(little_ones = children + babies) %>%
filter(
little_ones >= 1,
hotel == "Resort Hotel"
) %>%
select(hotel, little_ones)
## # A tibble: 3,929 x 2
## hotel little_ones
## <chr> <dbl>
## 1 Resort Hotel 1
## 2 Resort Hotel 2
## 3 Resort Hotel 2
## 4 Resort Hotel 2
## 5 Resort Hotel 1
## 6 Resort Hotel 1
## 7 Resort Hotel 2
## 8 Resort Hotel 2
## 9 Resort Hotel 1
## 10 Resort Hotel 1
## # … with 3,919 more rows
# City Hotel
hotels %>%
mutate(little_ones = children + babies) %>%
filter(
little_ones > 1,
hotel == "City Hotel"
) %>%
select(hotel, little_ones)
## # A tibble: 2,140 x 2
## hotel little_ones
## <chr> <dbl>
## 1 City Hotel 2
## 2 City Hotel 2
## 3 City Hotel 2
## 4 City Hotel 2
## 5 City Hotel 2
## 6 City Hotel 2
## 7 City Hotel 2
## 8 City Hotel 2
## 9 City Hotel 2
## 10 City Hotel 3
## # … with 2,130 more rows
46

What is happening in the following chunk?

hotels %>%
mutate(little_ones = children + babies) %>%
count(hotel, little_ones) %>%
mutate(prop = n / sum(n))
## # A tibble: 12 x 4
## hotel little_ones n prop
## <chr> <dbl> <int> <dbl>
## 1 City Hotel 0 73923 0.619
## 2 City Hotel 1 3263 0.0273
## 3 City Hotel 2 2056 0.0172
## 4 City Hotel 3 82 0.000687
## 5 City Hotel 9 1 0.00000838
## 6 City Hotel 10 1 0.00000838
## 7 City Hotel NA 4 0.0000335
## 8 Resort Hotel 0 36131 0.303
## 9 Resort Hotel 1 2183 0.0183
## 10 Resort Hotel 2 1716 0.0144
## 11 Resort Hotel 3 29 0.000243
## 12 Resort Hotel 10 1 0.00000838
47

summarise for summary stats

# mean average daily rate for all bookings
hotels %>%
summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1
## mean_adr
## <dbl>
## 1 102.
48

summarise for summary stats

# mean average daily rate for all bookings
hotels %>%
summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1
## mean_adr
## <dbl>
## 1 102.

summarise() changes the data frame entirely, it collapses rows down to a single summary statistics, and removes all columns that are irrelevant to the calculation.

48

summarise() also lets you get away with being sloppy and not naming your new column, but that's not recommended!

hotels %>%
summarise(mean(adr))
## # A tibble: 1 x 1
## `mean(adr)`
## <dbl>
## 1 102.

hotels %>%
summarise(mean_adr = mean(adr))
## # A tibble: 1 x 1
## mean_adr
## <dbl>
## 1 102.
49

group_by for grouped operations

# mean average daily rate for all booking at city and resort hotels
hotels %>%
group_by(hotel) %>%
summarise(mean_adr = mean(adr))
## # A tibble: 2 x 2
## hotel mean_adr
## <chr> <dbl>
## 1 City Hotel 105.
## 2 Resort Hotel 95.0
50

Calculating frequencies

The following two give the same result, so count is simply short for group_by then determine frequencies

hotels %>%
group_by(hotel) %>%
summarise(n = n())
## # A tibble: 2 x 2
## hotel n
## <chr> <int>
## 1 City Hotel 79330
## 2 Resort Hotel 40060
hotels %>%
count(hotel)
## # A tibble: 2 x 2
## hotel n
## <chr> <int>
## 1 City Hotel 79330
## 2 Resort Hotel 40060
51

Multiple summary statistics

summarise can be used for multiple summary statistics as well

hotels %>%
summarise(
min_adr = min(adr),
mean_adr = mean(adr),
median_adr = median(adr),
max_adr = max(adr)
)
## # A tibble: 1 x 4
## min_adr mean_adr median_adr max_adr
## <dbl> <dbl> <dbl> <dbl>
## 1 -6.38 102. 94.6 5400
52
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow