CS50 Week 7 - SQL

CS50 Week 7 - SQL

Lecture's notes, review and Anki's cards

·

8 min read

The lecture has been all about data, the process of how to store, collect, search and so much more. We need the ability to store data that persist so that we can build programs that remember what humans did last time or to build programs that can grow.

Databases let us just do that, by storing data in a simple text file or a program; these different approaches to store data are called: FLAT-FILE DATABASE and RELATIONAL DATABASE. Inside databases data is stored in its organized structure: related data are grouped into tables, each of which consists of rows and columns (like a spreadsheet).

But data in the real world can get messy fast since people can write the same thing in a different way, make typos or just mess with capitalizing letters. That’s why we need a way to clean data before we can manipulate it, this is called CANONICALIZED (or STANDARDIZED)

To canonicalize data, we will need to use a programming language first, like Python, to:

  • eliminate duplicates (using SETs)
  • ignoring capitalisation (using .upper() or .lower() methods)
  • strip any whitespaces (using .strip() method)
  • sort the output (using sorted() function)
import csv

titles = set()

with open("filename.csv") as file:
  reader = csv.DictReader(file)
  for row in reader:
    title = row["title"].strip().upper()
    titles.add(title)

for title in sorted(titles):
  print(title)

We can now manipulate the data, for example by counting how many same values are in the file (counting) or by searching a specific value(searching);

import csv

titles = {}

with open("filename.csv") as file:
  reader = csv.DictReader(file)
  for row in reader:
    title = row["title"].strip().upper()
    if not title in titles:
      titles[title] = 0
    titles[titles] += 1

for title in sorted(titles, key=lambda title: titles(title), reverse=True):
  print(title, titles[title])

Even with standardized data, finding all occurrences of a specific file can become messy, because of typos. In this case, we can use REGULAR EXPRESSIONS, a feature of many programming languages (like a mini-language in themselves) to represent patterns in a standardized way, used to clean up data or validate data.

Some of the RE syntax:

  • . any character
  • .* 0 or more character
  • .+ 1 or more character
  • ? an optional character
  • ^ start of the input
  • $ end of the input
import csv
import re

counter = 0

with open("filename.csv") as file:
  reader = csv.DictReader(file)
  for row in reader:
    title = row["title"].strip().upper()
    if re.search("^(OFFICE|THE.OFFICE)$", title):
      counter += 1

print(f"Number of people who like The Office: {counter}")

RELATIONAL DATABASES

Relational databases are programs that store data more efficiently, especially big data, which we can interact with using SQL (Structured Query Language). There are many software that implements SQL language like Oracle Database, MySQL, PostgreSQL, Microsoft Access, and SQLite(usually is used on smartphones because is very lightweight).

SQL, too, has its own data types to optimize the amount of space used for storing data, which we will need to specify when creating a table manually; some of them are: Posty.png Also, these databases use basically only four types of basic operation, with the acronym CRUD: CREATE, READ, UPDATE, DELETE.

  • CREATE, INSERT

    Used to create a new table (like a spreadsheet with rows and columns) in a database, or insert new items in a table. We need to know in advance what kind of data the different columns will store (int, text …).
    CREATE TABLE table_name (column_name datatype, …);
    INSERT INTO table (column1,column2) VALUES(value1, value2);
    
    We can also import a csv file as a table in a database, which will automate the process of creating a table.
sqlite3 db_name.dbopen SQLite and create a new database
.mode csvtells SQLite to enter the csv mode
.import filename.csv table_namecreates a table in our database

If we want to check a database design (to know what table, datatypes, and column it has inside) we can use .schema

  • SELECT

    Used to read and show data in a database specifying one or more specific columns OR by selecting all with the use of a special key, a wildcard, which is *
    SELECT column_name FROM table_name;
    SELECT * FROM table_name;
    
    SQL also uses operations on entire columns thanks to built-in functions, that can also be combined together. Some of them are:
AVG()Finds the average of a column’s values
COUNT()Finds out how many rows are in a table or in a specific column
SUM()Sum up the number values
MAX() Select the greatest value
MIN()Select the smallest value
DISTINCTSelect unique values, avoiding duplicates (except issues from formatting)
UPPER()Converts the items to uppercase
LOWER()Converts the items to lowercase

We can also qualify our selection according to some conditions, and some useful keys are:

WHEREFilter data according to something which is true or false (like a boolean expression)
LIKEAllows to do approximation, it searches for a matching string. It’s case-insensitive and it also supports some pattern matching (like % , a char placeholder for 0 or more chars)
ORDER BY
GROUP BYGroups results by column values
HAVINGFilters data with conditions, like a WHERE, but for groups
LIMITLimits the result list to a specified number of items
  • UPDATE

    Used to update or change the data in a database.
    UPDATE table_name SET column_name = changed_new_value ;
    
    UPDATE table_name SET column_name = changed_new_value WHERE bonus_condition;
    
  • DELETE, DROP

    Used to delete the data in a database or to drop an entire table.
    DELETE FROM table_name WHERE column_name LIKE condition;
    
    DROP TABLE table_name;
    

The efficiency of relational databases is given by the possibility to have and link multiple tables, this way we will avoid redundancies and data will be cleaner and better designed. In a relational database, each table is connected to another table using the PRIMARY-FOREIGN KEY, basically, each table will have a numeric ID that connects to another one. The PRIMARY KEY is used to uniquely identify each row while the FOREIGN KEY refers to a column in some other table.

The relationships in SQL database are of 3 types:

  • One-to-one relation is when one row in a table A is linked with zero or one row in table B.
  • One-to-many relation is when one row in a table A may be linked with many rows in table B.
  • Many-to-many relation is when each row in one table is linked to many rows in another table and vice versa (For example where one show can have many genres, and one genre can belong to many shows).

By using IDs we can now combine two or more queries to obtain our result, and there are a couple of ways to do that:

  • By formatting the query selecting multiple tables and setting conditions to let them combine properly
    SELECT title 
    FROM people, stars, shows 
    WHERE people.id = stars.person_id 
    AND stars.show_id = shows.id 
    AND name = “Steve Carell”;
    
  • By nesting one query into one or more: the inner one will return a list of IDs, and the outer one/s uses those to select the values that match.
    SELECT row 
    FROM table 
    WHERE id IN (
      SELECT row FROM table2 WHERE condition);
    
  • By using JOIN syntax, combining tables, and using their columns as though they were one table.
    SELECT title 
    FROM people 
    JOIN stars ON people.id = stars.person_id 
    JOIN shows ON stars.show_id = shows.id 
    WHERE name = “Steve Carell”;
    

INDEX

An index is a database data structure that allows us to get data faster, doing better than linear search. Using it will create B-trees (line C’s binary trees but with more children) with the data specified in the database and speed up the running time. The downside of creating indexes is that each of them takes up some amount of space, so we will be using more memory.

CREATE INDEX index_name ON table (column);

SQL PROBLEMS

SQL may incur also some problems, like the SQL injection attack, where someone can inject, or place, their own commands into inputs that we then run on our database. Someone can just type something that looks like SQL and trick the database to do something it didn’t intend to (like for example log in without a password using). This happens especially if a SQL query is formatted as a string; a solution to this would be using SQL’s ? placeholder, preventing dangerous characters from being interpreted as part of the command.

Another set of problems with databases are race conditions, where shared data is unintentionally changed by code running on different devices or servers at the same time. A serious problem for applications with multiple servers.
To solve this problem, SQL supports transactions, where we can lock rows in a database, such that a particular set of actions are guaranteed to happen together(atomic). To do that we use specific syntax to wrap specific lines of our code, like:

  • BEGIN TRANSACTION
  • COMMIT
  • ROLLBACK

But the more transactions we have, the slower our applications might be since each server has to wait for other servers’ transactions to finish.

If you'd like to do space repetitions, feel free to download my Anki's Cards