class: center, middle, inverse, title-slide # Data and Vizualization ## Visualizing different types of data --- layout: true <div class="my-footer"> <span> <a href="http://datasciencebox.org" target="_blank">datasciencebox.org</a> </span> </div> --- class: middle, center ## [Click for PDF of slides](04-data-and-viz-pt2.pdf) --- class: middle, center # Identifying variables --- ## Number of variables involved - .vocab[Univariate data analysis]: distribution of single variable <br> - .vocab[Bivariate data analysis]: relationship between two variables <br> - .vocab[Multivariate data analysis]: relationship between many variables at once, usually focusing on the relationship between two while conditioning for others --- ## Types of variables - .vocab[Numerical variables] can be classified as .vocab[continuous] or .vocab[discrete] based on whether or not the variable can take on an infinite number of values or only non-negative whole numbers, respectively. - *height* is continuous - *number of siblings* is discrete -- - If the variable is .vocab[categorical], we can determine if it is .vocab[ordinal] based on whether or not the levels have a natural ordering. - *hair color* is unordered - *year in school* is ordinal --- class: center, middle # Visualizing numerical data --- ## Describing numerical distributions - .vocab[shape:] - skewness: right-skewed, left-skewed, symmetric - modality: unimodal, bimodal, multimodal, uniform - .vocab[center:] mean (`mean`), median (`median`), mode (not always useful) - .vocab[spread:] range (`range`), standard deviation (`sd`), inter-quartile range (`IQR`) - .vocab[outliers:] observations outside of the usual pattern --- ## Starwars data ```r starwars ``` ``` ## # A tibble: 87 x 14 ## name height mass hair_color skin_color eye_color birth_year sex gender ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke… 172 77 other fair blue 19 male mascu… ## 2 C-3PO 167 75 none gold yellow 112 none mascu… ## 3 R2-D2 96 32 none white, bl… red 33 none mascu… ## 4 Dart… 202 136 none white yellow 41.9 male mascu… ## 5 Leia… 150 49 brown light brown 19 fema… femin… ## 6 Owen… 178 120 brown light blue 52 male mascu… ## 7 Beru… 165 75 brown light blue 47 fema… femin… ## 8 R5-D4 97 32 none white, red red NA none mascu… ## 9 Bigg… 183 84 black light brown 24 male mascu… ## 10 Obi-… 182 77 other fair blue-gray 57 male mascu… ## # … with 77 more rows, and 5 more variables: homeworld <chr>, species <chr>, ## # films <list>, vehicles <list>, starships <list> ``` --- ## Histograms .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_histogram(binwidth = 10) ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-3-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Density plots .small[ ```r ggplot(data = starwars, mapping = aes(x = height)) + geom_density() ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-4-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Side-by-side box plots .small[ ```r ggplot(data = starwars, mapping = aes(y = height, x = hair_color)) + geom_boxplot() ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-5-1.png" width="80%" style="display: block; margin: auto;" /> ] --- class: center, middle # Visualizing categorical data --- ## Bar plots .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color)) + geom_bar() ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-6-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, counts .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color, fill = eye_color2)) + geom_bar() ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-8-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Segmented bar plots, proportions .small[ ```r ggplot(data = starwars, mapping = aes(x = hair_color, fill = eye_color2)) + * geom_bar(position = "fill") + labs(y = "proportion") ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-9-1.png" width="70%" style="display: block; margin: auto;" /> ] --- ## Which bar plot is more appropriate? .question[ Which plot is more useful for visualizing the relationship between hair color and eye color? Why? ] .pull-left[ <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- class: center, middle # Data visualization --- ## What is data visualization? Anything that converts data sources into a visual representation - charts - plots - maps - tables - etc. .footnote[ Source: https://guides.library.duke.edu/datavis ] --- class: center, middle # Why do we visualize? --- ## Data: `datasaurus_dozen` Below is an excerpt from the `datasaurus_dozen` dataset: ``` ## # A tibble: 142 x 8 ## away_x away_y bullseye_x bullseye_y circle_x circle_y dino_x dino_y ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 32.3 61.4 51.2 83.3 56.0 79.3 55.4 97.2 ## 2 53.4 26.2 59.0 85.5 50.0 79.0 51.5 96.0 ## 3 63.9 30.8 51.9 85.8 51.3 82.4 46.2 94.5 ## 4 70.3 82.5 48.2 85.0 51.2 79.2 42.8 91.4 ## 5 34.1 45.7 41.7 84.0 44.4 78.2 40.8 88.3 ## 6 67.7 37.1 37.9 82.6 45.0 77.9 38.7 84.9 ## 7 53.3 97.5 39.5 80.8 48.6 78.8 35.6 79.9 ## 8 63.5 25.1 39.6 82.7 42.1 76.9 33.1 77.6 ## 9 68.0 81.0 34.8 80.0 41.0 76.4 29.0 74.5 ## 10 67.4 29.7 27.6 72.8 34.6 72.7 26.2 71.4 ## # … with 132 more rows ``` --- ## Summary statistics ```r datasaurus_dozen %>% group_by(dataset) %>% summarise(r = cor(x, y)) ``` ``` ## # A tibble: 13 x 2 ## dataset r ## <chr> <dbl> ## 1 away -0.0641 ## 2 bullseye -0.0686 ## 3 circle -0.0683 ## 4 dino -0.0645 ## 5 dots -0.0603 ## 6 h_lines -0.0617 ## 7 high_lines -0.0685 ## 8 slant_down -0.0690 ## 9 slant_up -0.0686 ## 10 star -0.0630 ## 11 v_lines -0.0694 ## 12 wide_lines -0.0666 ## 13 x_shape -0.0656 ``` --- .question[ How similar do the relationships between `x` and `y` look based on the plots? Based on the summary statistics? ] <img src="04-data-and-viz-pt2_files/figure-html/datasaurus-plot-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Anscombe's quartet ```r library(Tmisc) quartet ``` .pull-left[ ``` ## set x y ## 1 I 10 8.04 ## 2 I 8 6.95 ## 3 I 13 7.58 ## 4 I 9 8.81 ## 5 I 11 8.33 ## 6 I 14 9.96 ## 7 I 6 7.24 ## 8 I 4 4.26 ## 9 I 12 10.84 ## 10 I 7 4.82 ## 11 I 5 5.68 ## 12 II 10 9.14 ## 13 II 8 8.14 ## 14 II 13 8.74 ## 15 II 9 8.77 ## 16 II 11 9.26 ## 17 II 14 8.10 ## 18 II 6 6.13 ## 19 II 4 3.10 ## 20 II 12 9.13 ## 21 II 7 7.26 ## 22 II 5 4.74 ``` ] .pull-right[ ``` ## set x y ## 23 III 10 7.46 ## 24 III 8 6.77 ## 25 III 13 12.74 ## 26 III 9 7.11 ## 27 III 11 7.81 ## 28 III 14 8.84 ## 29 III 6 6.08 ## 30 III 4 5.39 ## 31 III 12 8.15 ## 32 III 7 6.42 ## 33 III 5 5.73 ## 34 IV 8 6.58 ## 35 IV 8 5.76 ## 36 IV 8 7.71 ## 37 IV 8 8.84 ## 38 IV 8 8.47 ## 39 IV 8 7.04 ## 40 IV 8 5.25 ## 41 IV 19 12.50 ## 42 IV 8 5.56 ## 43 IV 8 7.91 ## 44 IV 8 6.89 ``` ] --- ## Summarising Anscombe's quartet ```r quartet %>% group_by(set) %>% summarise( mean_x = mean(x), mean_y = mean(y), sd_x = sd(x), sd_y = sd(y), r = cor(x, y) ) ``` ``` ## # A tibble: 4 x 6 ## set mean_x mean_y sd_x sd_y r ## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 I 9 7.50 3.32 2.03 0.816 ## 2 II 9 7.50 3.32 2.03 0.816 ## 3 III 9 7.5 3.32 2.03 0.816 ## 4 IV 9 7.50 3.32 2.03 0.817 ``` --- ## Visualizing Anscombe's quartet ```r ggplot(quartet, aes(x = x, y = y)) + geom_point() + facet_wrap(~ set, ncol = 4) ``` <img src="04-data-and-viz-pt2_files/figure-html/quartet-plot-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Do you see anything out of the ordinary? ```r ggplot(student_survey, aes(x = first_kiss)) + geom_histogram(binwidth = 1) + labs(title = "How old were you when you had your first kiss?") ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-12-1.png" width="75%" style="display: block; margin: auto;" /> --- ## Reporting lower vs. higher values ```r ggplot(student_survey, aes(x = fb_visits_per_day)) + geom_dotplot(binwidth = 5, dotsize = 0.4) + labs(title = "How many times do you go on Facebook per day?") ``` <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-13-1.png" width="70%" style="display: block; margin: auto;" /> --- class: center, middle # Designing effective visualizations --- ## Keep it simple <img src="img/04/pie-3d.jpg" width="300" style="display: block; margin: auto;" /> <img src="04-data-and-viz-pt2_files/figure-html/pie-to-bar-1.png" width="600" style="display: block; margin: auto;" /> --- ## Use color to draw attention <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-14-1.png" width="500" style="display: block; margin: auto;" /> <img src="04-data-and-viz-pt2_files/figure-html/unnamed-chunk-15-1.png" width="600" style="display: block; margin: auto;" /> --- ## Tell a story <img src="img/04/time-series.story.png" width="800" style="display: block; margin: auto;" /> .footnote[ Credit: Angela Zoss and Eric Monson, Duke DVS ]