HW 01 - Data wrangling and visualization

due Wed, Sep 16 at 11:59p

The goal of this homework assignment is to apply everything you’ve learned over the past few weeks. The exercises in this homework are intentionally more open-ended than ones you’ve seen previously in the course. They are similar to the types of exercises you may see in the exam in that they require you to think holistically about all that you’ve learned. Please refer to previous labs and lectures, and application exercises for code examples as you complete the assignment.

Make sure to have at least three (3) commits for this homework assignment, and don’t forget to give your code chunks meaningful names!

Clone assignment repo + start new project

Packages

In this assignment we will work with the tidyverse package as usual.

library(tidyverse)

Lego sales

We have (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US. The dataset is called lego_sales and can be found in the data folder in your repo. You can find descriptions of each of each of the variables in the data set here.

lego_sales <- read_csv("data/lego_sales.csv")

Answer the following questions using pipes (%>%) and the tidyverse functions we’ve discussed. For each question, show your code and output and state your answer in a sentence, e.g. “The first three common names of customers are …”.

  1. What are the three most common first names of customers?

  2. What are the three most common themes of lego sets purchased?

  3. Among the most common theme of lego sets purchased, what is the most common subtheme?

Hint: Use the case_when() function.

  1. Create a new variable called age_group and group the ages into the following categories: “18 and under”, “19 - 25”, “26 - 35”, “36 - 50”, “51 and over”. Be sure to save the updated data set so you can use the new variable in other questions.

  2. What is the probability a randomly selected customer

    • is in the 19 - 25 age group?
    • is in the 19 - 25 age group and purchased a Duplo theme set?
    • is in the 19 - 25 age group given they purchased a Duplo theme set?

Hint: You will need to consider quantity of purchases.

  1. Which age group has purchased the largest number of lego sets? How many did they purchase?

Hint: You will need to consider quantity of purchases as well as price of lego sets.

  1. Which age group has spent the most money on legos? How much did they spend?

  2. Come up with a question you want to answer using these data, and write it down. Then, create a data visualization that answers the question, and briefly explain how your visualization answers the question.

For inspiration, check out The Evolution of a ggplot, Ep(1) by Cédric Scherer.

  1. Add one element to the plot from the previous exercise to change the look of the plot without changing the underlying data. For example, you can change the theme, background color, add annotations, etc. State the change you’re making and display the updated visualization. We encourage you to be creative!

Submission

Knit to PDF to create a PDF document. Commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.

Please only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Make sure to associate the “Overall” section with the first page.



This lab was adapted from Data Science in a Box.