The goal of this homework assignment is to apply everything you’ve learned over the past few weeks. The exercises in this homework are intentionally more open-ended than ones you’ve seen previously in the course. They are similar to the types of exercises you may see in the exam in that they require you to think holistically about all that you’ve learned. Please refer to previous labs and lectures, and application exercises for code examples as you complete the assignment.
Make sure to have at least three (3) commits for this homework assignment, and don’t forget to give your code chunks meaningful names!
Go to course organization on GitHub.
In addition to your private individual repositories, you should now see a repo named hw-01-lego-[github_username_name]. Go to that repository.
Clone the repo and start a new project in RStudio. See Lab 01 for more details about how to clone a repo and start a new project.
We have (simulated) data from Lego sales in 2018 for a sample of customers who bought Legos in the US. The dataset is called lego_sales
and can be found in the data folder in your repo. You can find descriptions of each of each of the variables in the data set here.
Answer the following questions using pipes (%>%
) and the tidyverse
functions we’ve discussed. For each question, show your code and output and state your answer in a sentence, e.g. “The first three common names of customers are …”.
What are the three most common first names of customers?
What are the three most common themes of lego sets purchased?
Among the most common theme of lego sets purchased, what is the most common subtheme?
Hint: Use the case_when()
function.
Create a new variable called age_group
and group the ages into the following categories: “18 and under”, “19 - 25”, “26 - 35”, “36 - 50”, “51 and over”. Be sure to save the updated data set so you can use the new variable in other questions.
What is the probability a randomly selected customer
Hint: You will need to consider quantity of purchases.
Hint: You will need to consider quantity of purchases as well as price of lego sets.
Which age group has spent the most money on legos? How much did they spend?
Come up with a question you want to answer using these data, and write it down. Then, create a data visualization that answers the question, and briefly explain how your visualization answers the question.
For inspiration, check out The Evolution of a ggplot, Ep(1) by Cédric Scherer.
Knit to PDF to create a PDF document. Commit all remaining changes, and push your work to GitHub. Make sure all files are updated on your GitHub repo.
Please only upload your PDF document to Gradescope. Before you submit the uploaded document, mark where each answer is to the exercises. If any answer spans multiple pages, then mark all pages. Make sure to associate the “Overall” section with the first page.
This lab was adapted from Data Science in a Box.