Joining data from multiple sources

# Joining data from multiple sources
### Prof. Maria Tackett

---

<div class="my-footer">
<span>
<a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a>
</span>
</div>

---

## [Click for PDF of slides](06-data-join-merge.pdf)

---

# Working with multiple data frames

---

## Fisheries of the world

Fisheries and Aquaculture Department of the Food and Agriculture Organization of 
the United Nations collects data on fisheries production of countries.

<img src="img/06/fisheries-data.png" width="60%" style="display: block; margin: auto;" />
.center[
.hand[...]
]

---

## Load data

```r
fisheries <- read_csv("data/fisheries.csv")
```

---

## First look at the data

```r
glimpse(fisheries)
```

```
## Rows: 216
## Columns: 4
## $ country     <chr> "Afghanistan", "Albania", "Algeria", "American Samoa", "A…
## $ capture     <dbl> 1000, 7886, 95000, 3047, 0, 486490, 3000, 755226, 3758, 1…
## $ aquaculture <dbl> 1200, 950, 1361, 20, 0, 655, 10, 3673, 16381, 0, 96847, 3…
## $ total       <dbl> 2200, 8836, 96361, 3067, 0, 487145, 3010, 758899, 20139, …
```

---

## Quick summaries of the data

```r
skim(fisheries) #skimr package
```

```
## ── Data Summary ────────────────────────
##                            Values   
## Name                       fisheries
## Number of rows             216      
## Number of columns          4        
## _______________________             
## Column type frequency:              
##   character                1        
##   numeric                  3        
## ________________________            
## Group variables            None     
## 
## ── Variable type: character ────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate   min   max empty n_unique whitespace
## 1 country               0             1     4    32     0      215          0
## 
## ── Variable type: numeric ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
##   skim_variable n_missing complete_rate    mean       sd    p0    p25    p50     p75     p100 hist 
## 1 capture               0             1 421916. 1478638.     0 3280.  33797  221884. 17800000 ▇▁▁▁▁
## 2 aquaculture           0             1 508368. 4496073.     0   25.2  1574.  25998  63700000 ▇▁▁▁▁
## 3 total                 0             1 930284. 5846301.     0 7270.  44648. 271901. 81500000 ▇▁▁▁▁
```
]

---

## Some summary stats

```r
fisheries %>%
  summarise(
    mean_cap = mean(capture),
    mean_aqc = mean(aquaculture),
    mean_tot = mean(total)
  )
```

```
## # A tibble: 1 x 3
##   mean_cap mean_aqc mean_tot
##      <dbl>    <dbl>    <dbl>
## 1  421916.  508368.  930284.
```

<br>

---

## A new approach!

```r
fisheries %>%
  summarise(across(capture:total, mean))
```

```
## # A tibble: 1 x 3
##   capture aquaculture   total
##     <dbl>       <dbl>   <dbl>
## 1 421916.     508368. 930284.
```

---

.discussion[
The (not-so-great) visualization below shows the distribution of fishery harvest 
of countries for 2016, by capture and aquaculture. What are some ways you would 
improve this visualization? Note that countries whose total harvest was less 
than 100,000 tons are not included in the visualization.
]

---

### Goal: calculate summary statistics at the continent level and visualize them

---

## Data prep

```r
continents <- read_csv("data/continents.csv")
```
]
--

Filter out countries whose total harvest was less than 100,000 tons since they 
are not included in the visualization:

```r
fisheries <- fisheries %>%
  filter(total >= 100000)

fisheries
```

```
## # A tibble: 82 x 4
##    country    capture aquaculture   total
##    <chr>        <dbl>       <dbl>   <dbl>
##  1 Angola      486490         655  487145
##  2 Argentina   755226        3673  758899
##  3 Australia   174629       96847  271476
##  4 Bangladesh 1674770     2203554 3878324
##  5 Brazil      705000      581230 1286230
##  6 Cambodia    629950      172500  802450
##  7 Cameroon    233190        2315  235505
##  8 Canada      874727      200765 1075492
##  9 Chad        110000          94  110094
## 10 Chile      1829238     1050117 2879355
## # … with 72 more rows
```

---

# Data joins

---

```r
fisheries %>% select(country)
```

```
## # A tibble: 82 x 1
##    country   
##    <chr>     
##  1 Angola    
##  2 Argentina 
##  3 Australia 
##  4 Bangladesh
##  5 Brazil    
##  6 Cambodia  
##  7 Cameroon  
##  8 Canada    
##  9 Chad      
## 10 Chile     
## # … with 72 more rows
```
]
.pull-right[

```r
continents
```

```
## # A tibble: 245 x 2
##    country           continent
##    <chr>             <chr>    
##  1 Afghanistan       Asia     
##  2 Åland Islands     Europe   
##  3 Albania           Europe   
##  4 Algeria           Africa   
##  5 American Samoa    Oceania  
##  6 Andorra           Europe   
##  7 Angola            Africa   
##  8 Anguilla          Americas 
##  9 Antigua & Barbuda Americas 
## 10 Argentina         Americas 
## # … with 235 more rows
```
]

---

## Joining data frames

```
something_join(x, y)
```

- `inner_join()`: all rows from x where there are matching values in y, return 
all combination of multiple matches in the case of multiple matches
- `left_join()`: all rows from x
- `right_join()`: all rows from y
- `full_join()`: all rows from both x and y
- `semi_join()`: all rows from x where there are matching values in y, keeping just columns from x.
- `anti_join()`: return all rows from x where there are not matching values in y, never duplicate rows of x
- ...
 
---

## Setup

For the next few slides...

```r
x
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     3
```
]
.pull-right[

```r
y
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     4
```
]

---

## `inner_join()`

```r
inner_join(x, y)
```

```
## # A tibble: 2 x 1
##   value
##   <dbl>
## 1     1
## 2     2
```
]
.pull-right[
![](img/06/inner-join.gif)
]

---

## `left_join()`

```r
left_join(x, y)
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     3
```
]
.pull-right[
![](img/06/left-join.gif)
]

---

## `right_join()`

```r
right_join(x, y)
```

```
## # A tibble: 3 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     4
```
]
.pull-right[
![](img/06/right-join.gif)
]

---

## `full_join()`

```r
full_join(x, y)
```

```
## # A tibble: 4 x 1
##   value
##   <dbl>
## 1     1
## 2     2
## 3     3
## 4     4
```
]
.pull-right[
![](img/06/full-join.gif)
]

---

## `semi_join()`

```r
semi_join(x, y)
```

```
## # A tibble: 2 x 1
##   value
##   <dbl>
## 1     1
## 2     2
```
]
.pull-right[
![](img/06/semi-join.gif)
]

---

## `anti_join()`

```r
anti_join(x, y)
```

```
## # A tibble: 1 x 1
##   value
##   <dbl>
## 1     3
```
]
.pull-right[
![](img/06/anti-join.gif)
]

---

.discussion[
We want to keep all rows and columns from `fisheries` and add a column for 
corresponding continents. Which join function should we use?
]

```r
fisheries %>% select(country)
```

```r
continents
```

---

## Join fisheries and continents

```r
fisheries <- left_join(fisheries, continents) 
```

Hint:

- Variables in the original fisheries dataset:

```
## [1] "country"     "capture"     "aquaculture" "total"
```

- Variables in the continents dataset:

```
## [1] "country"   "continent"
```

---

## Check the data

```r
fisheries %>%
  filter(is.na(continent))
```

```
## # A tibble: 3 x 5
##   country                          capture aquaculture   total continent
##   <chr>                              <dbl>       <dbl>   <dbl> <chr>    
## 1 Democratic Republic of the Congo  237372        3161  240533 <NA>     
## 2 Hong Kong                         142775        4258  147033 <NA>     
## 3 Myanmar                          2072390     1017644 3090034 <NA>
```

---

## Implement fixes

```r
fisheries <- fisheries %>%
  mutate(continent = case_when(
    country == "Democratic Republic of the Congo" ~ "Africa",
    country == "Hong Kong"                        ~ "Asia",
    country == "Myanmar"                          ~ "Asia",
    TRUE                                          ~ continent
    )
  )
```

...and check again

]

```r
fisheries %>%
  filter(is.na(continent))
```

```
## # A tibble: 0 x 5
## # … with 5 variables: country <chr>, capture <dbl>, aquaculture <dbl>, total <dbl>,
## #   continent <chr>
```

---

```r
fisheries %>%
  mutate(aquaculture_perc = aquaculture / total)
```

---

## Demo

---

## Demo

<!--
.your-turn[
- [RStudio Cloud](http://rstd.io/dsbox-cloud) > `AE 05 - Fisheries + Data joins` > open `fisheries.Rmd` and knit.
- Knit the document and work on the Exercises 2 - 4.
- Once done, place a green sticky on your laptop. If you have questions, place a pink sticky.
]

-->
---

## Visualize continent summary stats

```r
ggplot(fisheries_summary, aes(x = continent, y = mean_ap)) +
  geom_col()
```

![](06-data-join-merge_files/figure-html/unnamed-chunk-39-1.png)

---

## Improve visualization

```r
ggplot(fisheries_summary, 
*      aes(x = fct_reorder(continent, mean_ap), y = mean_ap)) +
  geom_col()
```

![](06-data-join-merge_files/figure-html/unnamed-chunk-40-1.png)

---

## Improve visualization further

```r
ggplot(fisheries_summary, 
       aes(y = fct_reorder(continent, mean_ap), x = mean_ap)) +
  geom_col() +
* scale_x_continuous(labels = label_percent(accuracy = 1)) +
* labs(
*   x = "",
*   y = "",
*   title = "Average share of aquaculture by continent",
*   subtitle = "out of total fisheries harvest, 2016",
*   caption = "Source: bit.ly/2VrawTt"
* ) +
* theme_minimal()
```

➡️ See next slide...

---

![](06-data-join-merge_files/figure-html/unnamed-chunk-42-1.png)