tl;dr: reviewing and walking through a set of a variety of different questions about a movies’ database by selecting data through SQL queries.
Week Review
Week 7 has flown by very quickly. It was interesting knowing more about databases, before this lecture I’ve imagined databases like an unknown blob of info to handle with a special power (aka code), but finally, I’ve retrieved this missing piece of information and they are simpler than I thought. Basically, databases are like different spreadsheet pages linked to each other; a place where data is stored, organized and shared between different tables(spreadsheets). To put and retrieve data from databases you need a specific set of instructions: here comes SQL, a coding language with specific keywords to handle data from databases.
Spreadsheets and databases both store and organize data, the only difference is that spreadsheets are static documents, while databases can be relational.
To me, they seemed something easier to control than 2D arrays or files!
As I wrote on Twitter:
Both lab and this problem set were a breath of fresh air, especially after the tough week 6. Completing them in a couple of hours, plus enjoying them without too much struggle, really lifted me up! It felt more like a quiz than a problem set (and I love quizzes and puzzles!).
DEFINITION
For each request, we should write a single SQL query that outputs the specified results. Our responses must take the form of a single SQL query, but we may nest other queries inside.
For each request the first thing I always do is .schema, to know how the database is organized and to easily navigate through database tables and write correct queries. Thanks to this we know there are 5 tables:
one with movies title and year
one with rating data (like number of votes and rating for each movie)
one with people data (like name and birth year)
two more tables that divide people data from directors and stars thanks to id values.
IDs represent a unique identifying number assigned to each row of data. These are important for identifying each movie and become even more useful when dealing with multiple tables.
REQUEST 1 - WRITE A SQL QUERY TO LIST THE TITLES OF ALL MOVIES RELEASED IN 2008
We know, thanks to .schema, that the movies table has both title and year column for each movie. The query then is pretty straightforward:
SELECT title
FROM movies
WHERE year = 2008;
SELECT is the desired output, in this case, a movie title.
FROM tells the location of the titles, found in the “movies” table.
WHERE is the constraint being put on the output, in this case, any YEAR where the value would be equal to 2008.
REQUEST 2 - WRITE A SQL QUERY TO DETERMINE THE BIRTH YEAR OF EMMA STONE
Very similar to the previous one, the only thing changed is the table to refer to (in this case people)
SELECT birth
FROM people
WHERE name = “Emma Stone”;
REQUEST 3 - WRITE A SQL QUERY TO LIST THE TITLES OF ALL MOVIES WITH A RELEASE DATE ON OR AFTER 2018, IN ALPHABETICAL ORDER
A very similar request to the first one with the only exception for the alphabetical order solved thanks to a new keyword: ORDER BY, which orders the selected column by the column specified (in this case both columns coincide)
SELECT title
FROM movies
WHERE year >= 2018
ORDER BY title;
REQUEST 4 - WRITE A SQL QUERY TO DETERMINE THE NUMBER OF MOVIES WITH AN IMDB RATING OF 10.0
Since movie title and movie ratings are stored in different tables we have two solutions here: Joining the two tables with the JOIN keyword, where we match two tables on their matching ID or nest the queries creating subqueries. In this case, I’ve preferred the latter.
SELECT COUNT(title)
FROM movies
WHERE id IN (
SELECT movie_id FROM ratings WHERE rating = 10.0);
In WHERE condition, we nested the result of another query with IDs (matching movies table id column with movie_id column in rating table), that search for the movie rating. We also used a SQL function: COUNT(), that, as the name suggests, is used to count the number of results returned in the SELECT statement.
REQUEST 5 - WRITE A SQL QUERY TO LIST THE TITLES AND RELEASE YEARS OF ALL HARRY POTTER MOVIES, IN CHRONOLOGICAL ORDER
Here we need to output two columns: title and year from the movies table. We can easily select both using a comma , as a separator. Since the order request is chronological we ORDER BY the year in the movies’ table.
The new keyword here is LIKE, a keyword used to check values in a column that follow a pattern (a pattern is a piece of text after LIKE keyword that we want the values in a column to match,); in this case, we used the percent symbol % that’s called a wildcard, because it matches any text with a text length of zero or more symbols, and it’s useful because it allows us to search any type of characters before and after “Harry Potter” string, returning all the Harry Potter movies.
SELECT title, year
FROM movies
WHERE title LIKE “Harry Potter%”
ORDER BY year
REQUEST 6 - WRITE A SQL QUERY TO DETERMINE THE AVERAGE RATING OF ALL MOVIES RELEASED IN 2012
We need a subquery, which means to nest a query within a query. Since the request needs to show the average rating we start with the rating table and then nest the movies table with their matching IDs to retrieve the movie year. The new thing here is AVG(), another SQL function that returns the average value of a numeric column.
SELECT AVG(rating)
FROM ratings
WHERE movie_id IN (
SELECT id FROM movies WHERE year = 2012);
REQUEST 7 - WRITE A SQL QUERY TO LIST ALL MOVIES RELEASED IN 2010 AND THEIR RATINGS, IN DESCENDING ORDER BY RATING. FOR MOVIES WITH THE SAME RATING, ORDER THEM ALPHABETICALLY BY TITLE.
We will need to search data within two tables for this query, this time I’ve used JOIN instead of nesting. We can JOIN as many tables as needed as long as they have a unique identifier common between the tables being joined. These identifiers may have different column names but their values should be the same.
In this case, we link the unique data in the id column in movies with the unique data in the movie_id column in ratings. Once we’ve joined the two tables, we can use variables present in either table for adding conditions or for ordering data. Now, we can order them by setting a descending order for ratings with the DESC keyword and, after the first order, a title ascending order for the same rating with the ASC keyword.
SELECT title, rating
FROM movies
JOIN ratings ON movies.id = ratings.movie_id
WHERE year = 2010
ORDER BY rating DESC, title ASC;
REQUEST 8 - WRITE A SQL QUERY TO LIST THE NAMES OF ALL PEOPLE WHO STARRED IN TOY STORY
Here we will need to join two tables. Since we need to show the names of all people who starred in the movie, we use people table. We then join stars, the table that connects people and movies tables, and then movies. Since we connect the movies with the people IDs, when we will search for the movie title the database will give us back the people's names connected to the specific movie.
SELECT name
FROM people
JOIN stars ON people.id = stars.person_id
JOIN movies ON stars.movie_id = movies.id
WHERE title = “Toy Story”;
REQUEST 9 - WRITE A SQL QUERY TO LIST THE NAMES OF ALL PEOPLE WHO STARRED IN A MOVIE RELEASED IN 2004, ORDERED BY BIRTH YEAR
Pretty similar to the previous one, we need to join three tables together (people, movies, and star), show people names (without duplicates, here the DISTINCT keyword is used for this specific reason) show them on a year condition and order the results by people birth date.
SELECT DISTINCT(name)
FROM people
JOIN stars ON people.id = stars.person_id
JOIN movies ON stars.movie_id = movies.id
WHERE year = 2004
ORDER BY birth;
REQUEST 10 - WRITE A SQL QUERY TO LIST THE NAMES OF ALL PEOPLE WHO HAVE DIRECTED A MOVIE THAT RECEIVED A RATING OF AT LEAST 9.0
Here we join 4 tables (people, directors, movies, and ratings). We connect people, directors, and movies to know all people who directed a movie and then we add the ratings table to check the condition needed (since the rating is stored there)
SELECT DISTINCT(name)
FROM people
JOIN directors ON people.id = directors.person_id
JOIN movies ON directors.movie_id = movies.id
JOIN ratings ON movies.id = ratings.movie_id
WHERE rating >= 9.0;
REQUEST 11 - WRITE A SQL QUERY TO LIST THE TITLES OF THE FIVE HIGHEST-RATED MOVIES (IN ORDER) THAT CHADWICK BOSEMAN STARRED IN, STARTING WITH THE HIGHEST RATED.
Similar to the previous one, we join 4 tables with the only difference here we need actors and not directors, so we will join movies and stars tables. We then set the condition on the actor name and order it by rating in descending order, since we need the highest rating. Also, we limit the result to 5, doing that thanks to a new keyword LIMIT, which restricts the amount of output to the value specified.
SELECT title
FROM movies
JOIN ratings ON movies.id = ratings.movie_id
JOIN stars ON movies.id = stars.movie_id
JOIN people ON stars.person_id = people.id
WHERE name = “Chadwick Boseman”
ORDER BY rating DESC LIMIT 5;
REQUEST 12 - WRITE A SQL QUERY TO LIST THE TITLES OF ALL MOVIES IN WHICH BOTH JOHNNY DEPP AND HELENA BONHAM CARTER STARRED
This query was harder for me, I struggled quite some time figuring out how to select the results where only BOTH actors appeared and not all movies where one or the other one was in.
I searched on stack overflow and found some useful tools:
GROUP BY, a statement that groups rows that have the same values into a summary row
HAVING basically a WHERE keyword, the difference is that HAVING can be used with aggregate functions (like COUNT())
COUNT(*) = 2 limits the records on the number of the actors, since COUNT(*) returns the number of the rows that satisfy the WHERE clause of the SELECT statement.
SELECT title
FROM movies
JOIN stars ON movies.id = stars.movie_id
JOIN people ON stars.person_id = people.id
WHERE name IN (“Johnny Depp”, “Helena Bonham Carter”)
GROUP BY movies.id, movies.title
HAVING COUNT(*) = 2;
REQUEST 13 - WRITE A SQL QUERY TO LIST THE NAMES OF ALL PEOPLE WHO STARRED IN A MOVIE IN WHICH KEVIN BACON ALSO STARRED
The last two queries were the tricky ones, even if it’s pretty similar to the 11th, the struggle here was to NOT include Kevin Bacon in the results.
I needed to search on stack overflow and then found that we can use IS NOT to only include rows where a condition is NOT true. That way Kevin Bacon's name will not be included since is the true value.
We proceed by using the DISTINCT keyword to get rid of duplicates and nesting 3 table results to get a list of actors that have been in movies with Kevin Bacon.
Thanks to this query I’ve learned that the order of data retrieves is pretty important (especially if you choose to nest tables).
SELECT DISTINCT(name)
FROM people
WHERE name IS NOT “Kevin Bacon” AND id IN(
SELECT person_id FROM stars WHERE movie_id IN(
SELECT movie_id FROM stars WHERE person_id IN(
SELECT id FROM people WHERE name IS “Kevin Bacon” and birth = 1958)))
ORDER BY name;