library(dplyr)
library(dbplyr)
library(janitor)
library(lubridate)
library(tibble)
library(tidyr)
library(purrr)
library(ggplot2)
library(stats)
library(infer)
library(stringr)
library(styler)
library(tidylog)
library(lintr)
library(remedy)
library(readr)
library(httr)
library(jsonlite)
library(httr2)
library(magrittr)
library(knitr)
library(rmarkdown)

Exercises

Part 1

Instructions

Inspired by the example query above, create queries to request the following information:

  • What are the highest-grossing dramas from 2010?
  • Have Will Ferrell and Liam Neeson ever been in a movie together?
  • Can you find kids’ movies with Tom Cruise in it?

Answer the questions and include the queries and results (tibbles) in the report.

Explanation

I fist decide to save my api_key and token in my environment in case I need them.

api_key <- "1c4727e7dfc2f3b66ede7491b77a8acd"

bearer_token <- "eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIxYzQ3MjdlN2RmYzJmM2I2NmVkZTc0OTFiNzdhOGFjZCIsIm5iZiI6MTc0OTEzNjQ4NC4xNjk5OTk4LCJzdWIiOiI2ODQxYjQ2NGU2Mjg4OTQwYWU3ZjUyMjUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.z5_r5mffW3BUEeB0n9MeEK9nd27m5EbABC6QJ95TqIk"

To find the dramas, I have to identify the code for the drama genre. This is the input from the movie website.

library(httr)

genre <- "https://api.themoviedb.org/3/genre/movie/list"

queryString1 <- list(language = "en")

response1 <- VERB("GET", genre, query = queryString1, add_headers('Authorization' = 'Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiIxYzQ3MjdlN2RmYzJmM2I2NmVkZTc0OTFiNzdhOGFjZCIsIm5iZiI6MTc0OTEzNjQ4NC4xNjk5OTk4LCJzdWIiOiI2ODQxYjQ2NGU2Mjg4OTQwYWU3ZjUyMjUiLCJzY29wZXMiOlsiYXBpX3JlYWQiXSwidmVyc2lvbiI6MX0.z5_r5mffW3BUEeB0n9MeEK9nd27m5EbABC6QJ95TqIk'), content_type("application/octet-stream"), accept("application/json"))

content(response1, "text")
## [1] "{\"genres\":[{\"id\":28,\"name\":\"Action\"},{\"id\":12,\"name\":\"Adventure\"},{\"id\":16,\"name\":\"Animation\"},{\"id\":35,\"name\":\"Comedy\"},{\"id\":80,\"name\":\"Crime\"},{\"id\":99,\"name\":\"Documentary\"},{\"id\":18,\"name\":\"Drama\"},{\"id\":10751,\"name\":\"Family\"},{\"id\":14,\"name\":\"Fantasy\"},{\"id\":36,\"name\":\"History\"},{\"id\":27,\"name\":\"Horror\"},{\"id\":10402,\"name\":\"Music\"},{\"id\":9648,\"name\":\"Mystery\"},{\"id\":10749,\"name\":\"Romance\"},{\"id\":878,\"name\":\"Science Fiction\"},{\"id\":10770,\"name\":\"TV Movie\"},{\"id\":53,\"name\":\"Thriller\"},{\"id\":10752,\"name\":\"War\"},{\"id\":37,\"name\":\"Western\"}]}"

Comment

I see that the code for drama is 18. The query parameters should be: release_year=2010, sort_by=revenue.desc, with_genres=18.

Highest-grossing dramas from 2010

Explanation

I first need to create the request, then parse the response to a json file to read it better on R studio and finally, I need to pull all original titles and not just the first one. That is why I use map_chr() on try2_list to look for “original title” in a loop.

try2 <- 
  request("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&primary_release_year=2010&sort_by=revenue.desc&with_genres=18") %>% 
  req_perform()

try2_list <- try2 %>% 
  resp_body_json()

original_titles <- map_chr(try2_list$results, "original_title")

dramas <- tibble(movie_title = original_titles)

kable(dramas, caption= "Top 20 highest-grossing dramas from 2010")
Top 20 highest-grossing dramas from 2010
movie_title
The Twilight Saga: Eclipse
The King’s Speech
The Karate Kid
Black Swan
Shutter Island
Sex and the City 2
True Grit
The Social Network
Due Date
Eat Pray Love
Step Up 3D
The Town
Wall Street: Money Never Sleeps
The Fighter
Dear John
Hereafter
Life As We Know It
Love & Other Drugs
唐山大地震
Green Zone

Have Will Ferrell and Liam Neeson ever been in a movie together?

Explanation

I first have to find the codes for Will Ferrell and Liam Neeson. The query parameters I will use are: with_cast= 3896%2C23659 and sort_by=title.asc. Again in this case I am looking for movie titles and I can use map_chr().

try3 <- 
  request("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&language=en-US&sort_by=title.asc&with_cast=3896%2C23659") %>% 
  req_perform()

try3_list <- try3 %>% 
  resp_body_json()

original_titles_2 <- map_chr(try3_list$results, "original_title")

movies2 <- tibble(movie_title = original_titles_2)

kable(movies2, caption="Movies with Liam Neeson and Will Ferrel")
Movies with Liam Neeson and Will Ferrel
movie_title
Anchorman 2: The Legend Continues
Daddy’s Home 2
The Lego Movie

Comment

Yes, Will Ferrell and Liam Neeson have been in 3 movies together.

Can you find kids’ movies with Tom Cruise in it?

Explanation

For this question I need the following queries: certification.lte = “G”, certification_country = “US” and with_cast = “137113”. 137113 is the code for Tom Cruise.

try4 <- 
  request("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&certification.lte=G&certification_country=US&sort_by=title.asc&with_cast=137113") %>% 
  req_perform()

try4_list <- try4 %>% 
  resp_body_json()

original_titles_4 <- map_chr(try4_list$results, "original_title") %>% 
  print()
## character(0)

Comment

There is no kid’s movie with Tom Cruise.

Part 2

Instructions

Write a query to find the ID of the animation movie company “Pixar”. Show both the query and how you extract the ID from the result in your report. Add the ID as inline code in your report.

Explanation

I first get the first part of the URL: https://api.themoviedb.org/3/search/company? with the end point “search/company” and then the query parameter I’m interested in, which is “query=pixar”. With resp_body_json() I get a long list but I can locate the where the id is.

pixar <- "https://api.themoviedb.org/3/search/company?api_key={api_key}&query=pixar"%>% 
  str_glue() %>% 
  request() %>% 
  req_perform()

pixar_list <- pixar %>% 
  resp_body_json()

pixar_id <- pixar_list$results[[1]]$id %>% 
  print()
## [1] 3

Comment

The ID of Pixar is 3.

Part 3

Instructions

Now that we have the ID of Pixar, we should be able to find all the movies they have worked on. Write a query to get all the Pixar movies and sort them by descending revenue. The result will be given to you as a JSON (parsed to a list by {httr}). Convert this list to a tibble so you have one row per film and one column per interesting piece of information. Show the query and the tibble in your report.

Explanation

As a first step, I need to make a request for the movies from Pixar and parse the result to a list.

pixar_movies <- "https://api.themoviedb.org/3/discover/movie?api_key={api_key}&sort_by=revenue.desc&with_companies=3"%>% 
  str_glue() %>% 
  request() %>% 
  req_perform()

pixar_movies_list <- pixar_movies %>% 
  resp_body_json() 

As a second step, I pull the “original titles” from the list.

original_titles_pixar <- map_chr(pixar_movies_list$results, "original_title") %>% 
  print()
##  [1] "Inside Out 2"        "Incredibles 2"       "Toy Story 4"        
##  [4] "Toy Story 3"         "Finding Dory"        "Finding Nemo"       
##  [7] "Inside Out"          "Coco"                "Monsters University"
## [10] "Up"                  "The Incredibles"     "Ratatouille"        
## [13] "Monsters, Inc."      "Cars 2"              "Brave"              
## [16] "WALL·E"              "Toy Story 2"         "Elemental"          
## [19] "Cars"                "Toy Story"

Comment

Unfortunately, if I use this method I only get the first 20 movies and not all movies. This is because the results are paginated by the movie website and there are 20 movies per page. Therefore, I need to get all pages and not just the first one.

Explanation

I create a function named “get_page” to get all pages from the start. I know there are 8 pages in total. map_chr() helps running get_page on the numbers 1 to 8 once for each number, using the numbers as the page number.

get_page <- function(page_num) {
  stringr::str_glue("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page={page_num}&sort_by=revenue.desc&with_companies=3")
}

pixar_urls <- map_chr(1:8, get_page)

print(pixar_urls)
## [1] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=1&sort_by=revenue.desc&with_companies=3"
## [2] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=2&sort_by=revenue.desc&with_companies=3"
## [3] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=3&sort_by=revenue.desc&with_companies=3"
## [4] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=4&sort_by=revenue.desc&with_companies=3"
## [5] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=5&sort_by=revenue.desc&with_companies=3"
## [6] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=6&sort_by=revenue.desc&with_companies=3"
## [7] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=7&sort_by=revenue.desc&with_companies=3"
## [8] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=8&sort_by=revenue.desc&with_companies=3"

Comment

As a result I get all 8 urls containing a list of 20 movies from pixar. Now I need to make a request for each one of them, and parse them. The issue is, I have to run request(), req_perform() and resp_body_json() on a list, within a list, within a list. Therefore I need to use map(). There are no brackets after request in map() because it would immediately call request before it’s mapped, and since R Studio sees no string argument because there is no value yet, I would get an error.

parsed_pages <- pixar_urls %>% 
  map(request) %>%  # builds the request
  map(req_perform) %>% # sends the request and return the http response
  map(resp_body_json) # parse the response

Explanation

I’m working with a nested list :

  • There are 8 pages of results,

  • Each page contains a list of 20 movies, and

  • Each movie is itself a named list with fields like “original_title”, “release_date”, “id”, and “popularity”.

To process this structure:

I first use map(pluck, “results”) to extract the “results” field from each page. This gives me a list of 8 elements, where each element is a list of 20 movies.

Then, I use map(~ map(.x, extract, …)) to go one level deeper:

  • The outer map() iterates over the list of pages.

  • .x represents one page (i.e., a list of 20 movies).

  • Inside the anonymous function ~ map(.x, extract, …), I use map() again to loop over each movie on that page and extract the specific fields I need.

This nested use of map() allows me to (1) go page by page, (2) then go movie by movie within each page, and (3) extract only the relevant data fields from each movie.

I use an anonymous function (~) because I need to apply a function (like extract()) to each sublist (each movie) within a list (each page). So .x refers to the list of movies on a single page — the result of the first map().

pixar_movie_info_list <- parsed_pages %>% 
  map(pluck, "results") %>% 
  map(~map(.x, extract, c("original_title", "id", "release_date", "popularity")))

Next, I want to transform the extracted fields from each movie into one big tibble. In this tibble, each row represents a movie, and the columns are the fields I previously extracted (like original_title, release_date, etc.).

To do this, I use map_dfr(as_tibble) inside an outer map(). The inner map_dfr() goes over the list of movies on one page (represented by .x), converts each movie (a list) into a tibble row, and returns a single tibble with 20 rows (one per movie). The outer map() does this for every page — so it returns one tibble per page.

Finally, I use list_rbind() to combine the tibbles from each page into one big tibble containing all the movies across all pages.

pixar_movie_info_tbl <- pixar_movie_info_list %>%
  map(~ map_dfr(.x, as_tibble)) %>%
  list_rbind()

paged_table(pixar_movie_info_tbl)

Part 4

Instructions

You may know that Pixar was acquired by Disney in 2006 after they had already been collaborating on films for more than a decade. For the last part of the report, we will look into whether this was a smart strategic decision by Disney by comparing the popularity of both Disney and Pixar films that came out from 2006 onwards.

  • First, acquire the “ID” for Disney using the search endpoint again. Note that if you try to find the company ID for Disney, there will be more than one result (Disney has many subsidiaries with the name “Disney” in it). For this exercise, specifically look for “Walt Disney Pictures” in the USA. Add the ID as inline code in your report.

  • Second, get the vote averages and vote counts for films from Walt Disney Productions and Pixar using the discover/movies endpoint. Use the API documentation again to find out how to get films from 2006 onwards.

  • Now, answer the following question: Are the films from Pixar, on average, more popular than those from Walt Disney Pictures? Compare the vote averages using boxplots. A suggestion would be to filter the data by including only films with a vote_count of at least 50. Consider that if only a few people voted on the film, the vote average would not be as representative as when many voted. After writing down your initial remarks based on the boxplot, state a statistical hypothesis and use a t-test to compare the average popularity for the two groups statistically. What is your conclusion? Do the test results confirm your observations? Was it a smart decision by Disney to acquire Pixar?

Explanation

The first step is to get the right id for Disney. Next I have to get all movies by Pixar from 2006 onwards and all movies by Walt Disney Pictures from 2006 onward and merge them into a big tibble. And finally create 2 boxplot, one for each production company to compare the average popularity for their respective movies and analyse if there is a difference.

I already have the id for Pixar. Once I have the id for both production companies and the correct query parameter to find the movies from 2006 onward, I can get all of the results thanks to the function I created earlier get_page(). I just have to change the url for each production company.

Next I have to apply the same steps I did before on the urls: (1) request and parse them, (2) extract the fields I’m interested in, (3) transform the results into one big tibble for each production company.

Keeping in mind I’m still working on a list of pages, a list of movies in each page, and variables in each movie, which is a list too. So a three layer list. Therefore, I have to use map() again and map(~map(.x…)) to access the inner layers of lists.

disney <- "https://api.themoviedb.org/3/search/company?api_key={api_key}&query=walt%20disney%20pictures"%>% 
  str_glue() %>% 
  request() %>% 
  req_perform()

disney_list <- disney %>% 
  resp_body_json()

disney_id <- disney_list$results[[1]]$id %>% 
  print()
## [1] 2

The id of Walt Disney Pictures is 2.

get_page_pixar2 <- function(page_num) { stringr::str_glue("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page={page_num}&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3")
}

pixar_2006_urls <- map_chr(1:6, get_page_pixar2)

print(pixar_2006_urls)
## [1] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=1&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"
## [2] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=2&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"
## [3] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=3&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"
## [4] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=4&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"
## [5] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=5&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"
## [6] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=6&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=3"

I know there are 6 pages of result, so I use get_page_pixar2() on the 1:6 vector.

parsed_pages_pixar_url_2006 <- pixar_2006_urls %>% 
  map(request) %>% 
  map(req_perform) %>% 
  map(resp_body_json)

I parse the urls for the movies from Pixar from 2006 onward.

pixar_movies_2006_info_list <- parsed_pages_pixar_url_2006 %>% 
  map(pluck, "results") %>% 
  map(~map(.x, extract, c("original_title", "id", "vote_average", "vote_count", "popularity", "release_date")))

Next, I turn the extracted fields into a single tibble for all the pages and movies: “pixar_movies_2006_info_tbl”.

pixar_movies_2006_info_tbl <- pixar_movies_2006_info_list %>%
  map(~ map_dfr(.x, as_tibble)) %>%
  list_rbind()

paged_table(pixar_movies_2006_info_tbl)

I repeat the operation for Disney movies. I start by getting all the pages for the movies from 2006 onward.

get_page_disney <- function(page_num) { stringr::str_glue("https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page={page_num}&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2")
}

disney_2006_urls <- map_chr(1:11, get_page_disney)

print(disney_2006_urls)
##  [1] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=1&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [2] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=2&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [3] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=3&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [4] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=4&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [5] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=5&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [6] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=6&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [7] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=7&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [8] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=8&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
##  [9] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=9&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2" 
## [10] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=10&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2"
## [11] "https://api.themoviedb.org/3/discover/movie?api_key=1c4727e7dfc2f3b66ede7491b77a8acd&page=11&primary_release_date.gte=2006-01-01&sort_by=primary_release_date.desc&with_companies=2"

Next, I need to parse the result I get from the request.

parsed_pages_disney_url_2006 <- disney_2006_urls %>% 
  map(request) %>% 
  map(req_perform) %>% 
  map(resp_body_json)

The, I extract the same fields I did for Pixar.

disney_movies_2006_info_list <- parsed_pages_disney_url_2006 %>% 
  map(pluck, "results") %>% 
  map(~map(.x, extract, c("original_title", "id", "vote_average", "vote_count", "popularity", "release_date")))

Finally, I turn the extracted field into a single tibble: “disney_movies_2006_info_tbl” and exclude the films by Pixar that might be in the list because of the acquisition.

disney_movies_2006_info_tbl <- disney_movies_2006_info_list %>%
  map(~ map_dfr(.x, as_tibble)) %>%
  list_rbind()

disney_movies_2006_info_tbl_clean <- disney_movies_2006_info_tbl %>% 
  filter(!(id %in% pixar_movies_2006_info_tbl$id))
## filter: removed 29 rows (14%), 171 rows remaining
paged_table(disney_movies_2006_info_tbl_clean)

Given that I want to merge the tibbles without forgetting the production company of each tibble and that I also need a column with the company names to create my boxplots, I create a new column for each tibble, called “company” with their respective company names: Pixar and Disney.

pixar_2006 <- pixar_movies_2006_info_tbl %>% 
  mutate(company = "pixar")
disney_2006 <- disney_movies_2006_info_tbl_clean %>% 
  mutate(company = "disney")

In the following code, I merge both tibbles by rows since they have the same column names.

movies_2006 <- pixar_2006 %>% 
  bind_rows(disney_2006)

paged_table(movies_2006)

Creating a boxplot

The following boxplot compared the average popularity (i.e. the “average_vote” column) for movies form Pixar and Walt Disney Pictures from 2006 onward. I excluded the movies with less than 50 vote counts as it wouldn’t give a representative result.

movies_2006 %>% 
  filter(vote_count > 50) %>% 
  ggplot(aes(x=company, y=vote_average)) +
  geom_point(size = 3, alpha = 0.05, col = "purple") +
  geom_boxplot(alpha = 0.1, fill = "purple")+
  theme_minimal()+
  labs(title = "Average popularity for Walt Disney Pictures and Pixar movies",
       subtitle = "From 2006 onwards",
       x= "Company",
       y= "Average popularity")

Comment

The movies by Pixar have a slightly higher median (7) than Disney for the average popularity (6.7). The score for Pixar movies have a wider range than Disney movies. The maximum values for Pixar films is also higher (7.5) than Disney. I think the difference in popularity average between Pixar and Disney movies is not significant enough, and that on average films from Pixar are not more popular than those from Walt Disney Pictures.

The null hypothesis in this case is that “on average, the average popularity of films from Pixar is the same as the films from Walt Disney Pictures”. The t-test will confirm or reject the null hypothesis.

movies_2006 %>% 
  filter(vote_count > 50) %>% 
  t_test(vote_average ~ company,
         order= c("pixar", "disney"))

Comment

The p-value from this t-test is lower than 0.05. This means that the popularity average of Pixar films and Disney films is on average very unlikely to be similar and that the null hypothesis can be rejected. My primary conclusion following the boxplot is consequently wrong. Based on this data, it is possible that the average popularity of Pixar movies is on average higher or more popular than Disney movies and that acquiring Pixar was a smart decision.