SQL / Relational

[1] Why we (I) love relational databases

  • ACID
    • Atomicity (transactions [multiple operations can be "all or nothing"] - even with externalities such as power loss)
    • Consistency (what's on disk is always acceptable as a state - including constraints, etc.)
    • Isolation (no transaction affects another)
    • Durability (if an insert completes, it is really written to disk)
  • Do we need these?
  • What apps don't need these things?
    • If Facebook fails to provide you with 1 in 100,000 status updates from your friends, what is the cost?
  • Can we get these properties from a database distributed across a network? (CAP Theorem: See http://www.youtube.com/watch?v=Jw1iFr4v58M and http://blog.beany.co.kr/wp-content/uploads/2011/03/nosql_cap.png)

[2] Creating a database and connecting (from shell)

# PostgreSQL

dropdb example
createdb example
psql example

# sqlite3
rm example.db        # drop a sqlite3 database (just remove the file)
sqlite3 example.db   # if the db doesn't exist, creates it; starts the command-line interface

[3] Dropping and creating a table

The column with type serial for PostgreSQL, or 'integer primary key autoincrement` for sqlite3 is a primary key that gets a value automatically when you insert a new row. By this means, you don't have to find out what the highest id is; the database system sets the key value appropriately automatically.

The -- indicates a comment.

# PostgreSQL

drop table names;
create table names (
    id serial, -- "auto increment"
    first_name text
);

# sqlite3

drop table names;
create table names (
    id integer primary key autoincrement,
    first_name text
);

[4] Another

drop table books;
create table books (
    id integer primary key autoincrement, -- remember, for PostgreSQL, use: serial
    name_id int,
    title text
);

[5] Adding data

delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');
  • Tip: Always give the column names so that you are not dependent on the created order

[6] more

delete from books;
insert into books (name_id, title) values (1, 'The C Programming Language');
insert into books (name_id, title) values (1, 'The Ruby Programming Language');
insert into books (name_id, title) values (2, 'Java Concurrency in Practice');
insert into books (name_id, title) values (null, 'SQL for Smarties');
-- Notice: We are assuming we know the values for names.id
-- What if we re-ran the inserts for names?

[7] query - wildcard for columns

select * from names;
select * from books;

[8] query - rearranging columns and specifying order

select first_name, id from names;
select name_id, id, title from books;

[9] cross join - return "product" (all combinations) of two tables

select *
from names, books;
-- don't do this with big tables

[10] same thing, but putting ids next together

select names.first_name, names.id as "names.id",
  books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
order by names.id;
-- don't do this with big tables

[11] inner join - Venn diagram


Inner join: What we want is represented by the orange section


[12] inner join - SQL - "Give me data from rows from both tables where the condition is true"

-- "explicit join notation"
select first_name, title
from books join names
  on names.id = books.name_id;

-- "implicit join notation"
select first_name, title
from books, names
where names.id = books.name_id;

-- Same as "product" join above, but using where clause
select names.first_name, names.id as "names.id",
  books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
where names.id = books.name_id
order by names.id;

[13] left join - Venn diagram


Left join: What we want is represented by the salmon and orange sections


[14] left join - SQL - "Give me data for 'left' table, and data the 'right' table when the condition is true, but nulls for the 'right' table when the condition is false"

select names.first_name, books.title
from names left join books
  on books.name_id = names.id;

 first_name |             title             
------------+-------------------------------
 John       | The C Programming Language
 John       | The Ruby Programming Language
 James      | Java Concurrency in Practice
 Tom        | 
(4 rows)

[15] left join is dominate pattern in ActiveRecord . . .

  • why?

[16] left join is dominate pattern in ActiveRecord

  • when we send an association message to a relation, we expect that association to be applied to each element
  • Note, however, that ActiveRecord avoids real left joins

[17] Now that you're so smart

  • How many rows will each of the following queries return?

      select * from names left  join books on books.name_id = names.id;
      select * from names right join books on books.name_id = names.id;
      select * from books left  join names on books.name_id = names.id;
      select * from books right join names on books.name_id = names.id;
    

[18] demo - study the log

    irb
    require 'active_record'
    require 'pg'
    require 'logger'
    B = ActiveRecord::Base
    B.logger = Logger.new(STDOUT)
    M = ActiveRecord::Migration
    
    spec = { adapter: :postgresql, database: :ex }
    B.establish_connection(adapter: :postgresql, database: :postgres, schema_search_path: :public)
    B.connection.drop_database spec[:database]
    B.connection.create_database spec[:database]
    B.establish_connection spec
    
    B.mass_assignment_sanitizer = :logger
    CN = Class.new(M) { def change; create_table :names do |t| t.text :first_name end; end }
    CB = Class.new(M) { def change; create_table :books do |t| t.text :title; t.references :name; end; end }
    CN.migrate(:up)
    CB.migrate(:up)
    Name = Class.new(B) { has_many :books }
    Book = Class.new(B) { belongs_to :name }

    john = Name.create! first_name: 'John'
    james = Name.create! first_name: 'James'
    tom = Name.create! first_name: 'Tom'
    c = Book.create! name: john, title: 'The C Programming Language'
    ruby = Book.create! name: john, title: 'The Ruby Programming Language'
    java = Book.create! name: james, title: 'Java Concurrency in Practice'

    # Let's discuss what we see in the log
    
    Name.all(include: :books)
    Name.all(joins: :books)
    Name.all(joins: 'left join books on books.name_id = names.id')
    Name.all(joins: 'left join books on books.name_id = names.id', include: :books)

[19] obviously not the way we would represent book ownership for realsies

  • We would use a join table in between names and books - Now a book can be owned by more than one person (but notice: this is arguably the wrong modeling)
  • Names and ownership

      select names.first_name, books.title
      from names, books, names_books
      where names.id = names_books.name_id
      and books.id = names_books.book_id;
    

[20] explain plan

  • table with slow query

      drop table example;
      create table example
        as select generate_series(1, 1000000) as id,
          'some message'::varchar(100) as message;
      insert into example values (1000001, 'different message');
      select * from example where message = 'different message'; -- 250ms
    
  • see the query plan

      explain select * from example where message = 'different message';
    
  • add an index, re-run query, and check plan again

      create index on example (message);
      select * from example where message = 'different message'; --16ms
      explain select * from example where message = 'different message';
    

[21] Aggregation and group by, having, etc.

  • group by provides for a means of showing the distinct values for a column

      select message from example group by message;
      -- compare:
      -- select distinct message from example;
    
  • But even better, we can aggregate according to the groups

      select message, count(*) from example group by message;
    
  • But even mo' better, we can as for groups having certain characteristics

      select message, count(*) from example group by message having count(*) = 1;
    

[22] Subquery expressions

  • You want a predicate that checks another table

  • Example: find all patients with an appointment

      select count(*) from patients
      where exists (
        select 1
        from appointment_storages
        where appointment_storages.patient_id = patients.id
        );
    
  • Other subquery expressions: in, not in, any; see docs (http://www.postgresql.org/docs/9.2/static/functions-subquery.html)

  • Why? Expressive. Compare this wrong query trying to do the same thing

      select count(*)
      from patients, appointment_storages
      where appointment_storages.patient_id = patients.id;
    
  • And compare this correct query trying to do the same thing

      select count(distinct patients.id)
      from patients, appointment_storages
      where appointment_storages.patient_id = patients.id;
    

[23] More about explain

  • Can see cost estimates; first value is startup, second is doing the work

      explain select count(*) from patients
      where exists (
        select 1
        from appointment_storages
        where appointment_storages.patient_id = patients.id
        );
    
  • Can also run (with analyze) to see actuals vs. estimates

      explain analyze select count(*) from patients
      where exists (
        select 1
        from appointment_storages
        where appointment_storages.patient_id = patients.id
        );
    
  • Hey, why is explain telling me it's going to do a sequential scan even though I have an index?

  • pgadmin has a visual explainer (demo)

  • Recommended: http://wiki.postgresql.org/images/4/45/Explaining_EXPLAIN.pdf

  • Want more? See PostgreSQL 9.0 High Performance (in Engineering library)

[24] self-joins

  • The classic problem: Find all instances of a staff member creating a task before another is completed.

  • Always explore your data first <-- best practice

  • First find a user with not too many tasks

      select users.id, users.email, practices.name, count(tasks.*)
      from users, tasks, practices
      where tasks.created_by_id = users.id
      and practices.id = users.practice_id
      group by users.id, users.email, practices.name
      having count(tasks.*) < 50
      order by count(tasks.*) desc;
    
  • Look at tasks for such a user to get a feel for the data

      select tasks.title, tasks.created_at, tasks.completed_at
      from tasks, users
      where tasks.created_by_id = users.id
      and users.id = 66
      order by created_at asc;
    
  • now join tasks with tasks

      select a.title || ' (' || a.id || ') overlaps ' || b.title || ' (' || b.id || ')'
      from tasks a, tasks b
      where a.created_by_id = b.created_by_id
      and b.created_at >= a.created_at
      and b.created_at <= a.completed_at
      and a.id != b.id
      and a.created_by_id = 66
      and b.created_by_id = 66
      order by a.id asc;
    
  • inspect some examples

      select * from tasks where id in (17359, 17401, 17402, 20457);
    

[25] Another: Find the first weight measurement (or null) for every patient

  • Let's get the earliest measured_at for each patient/marker combination

      select patient_id, marker_cname, min(measured_at) as first_measured_at
      from measurements
      group by patient_id, marker_cname;
    
  • We are using min() here. Could there be two measurements for the same patient and marker with the same measured_at?

      select m.patient_id, m.marker_cname, m.measured_at, count(*)
      from (
        select patient_id, marker_cname, min(measured_at) as first_measured_at
        from measurements
        group by patient_id, marker_cname) firsts,
      measurements m
      where firsts.patient_id = m.patient_id
      and firsts.marker_cname = m.marker_cname
      and firsts.first_measured_at = m.measured_at
      group by m.patient_id, m.marker_cname, m.measured_at
      having count(*) > 1;
    
  • Yes. Well that sucks. That might be a bug or deserve a constraint. What we will do is get the first id for the earliest measured_at for each patient/marker combination.

      select m.patient_id, m.marker_cname, min(id) as id
      from (
        select patient_id, marker_cname, min(measured_at) as first_measured_at
        from measurements
        group by patient_id, marker_cname) firsts,
      measurements m
      where firsts.patient_id = m.patient_id
      and firsts.marker_cname = m.marker_cname
      and firsts.first_measured_at = m.measured_at
      group by m.patient_id, m.marker_cname;
    
  • Now this is a very useful set of rows. We can do left joins from the patients table into this and select for various markers; and then join from this table into measurements. We can create a temporary table and direct PG to drop it after a commit. Then we can use and re-use this temporary table in a series of left joins.

      create temporary table firsts on commit drop as
      select m.patient_id, m.marker_cname, min(id) as id
      from (
        select patient_id, marker_cname, min(measured_at) as first_measured_at
        from measurements
        group by patient_id, marker_cname) firsts,
      measurements m
      where firsts.patient_id = m.patient_id
      and firsts.marker_cname = m.marker_cname
      and firsts.first_measured_at = m.measured_at
      group by m.patient_id, m.marker_cname;
    
      select p.id, weight, weight_first_measured_at, height, height_first_measured_at, ldl, ldl_first_measured_at
      from patients p
        left join (
          select f.patient_id, m.value as weight, m.measured_at as weight_first_measured_at
          from firsts f, measurements m where f.marker_cname = 'weight' and m.id = f.id) weights
          on weights.patient_id = p.id
        left join (
          select f.patient_id, m.value as height, m.measured_at as height_first_measured_at
          from firsts f, measurements m where f.marker_cname = 'height' and m.id = f.id) heights
          on heights.patient_id = p.id
        left join (
          select f.patient_id, m.value as ldl, m.measured_at as ldl_first_measured_at
          from firsts f, measurements m where f.marker_cname = 'ldl' and m.id = f.id) ldls
          on ldls.patient_id = p.id
      ;
    

[26] PG "distinct on"

  • When you want the first row in a group, you can use PG's "distinct on" clause (http://www.postgresql.org/docs/9.2/static/sql-select.html#SQL-DISTINCT)

  • But watch out, this can return surprising results

  • People commonly use it to mean: "Give me the 'top' something in this group" -- but what if there is a tie? Example

      drop table if exists employees;
      create table employees (
      name varchar(100),
      department integer,
      salary integer);
      insert into employees (name, department, salary) values ('John', 1, 20000);
      insert into employees (name, department, salary) values ('Joe', 1, 20000);
      insert into employees (name, department, salary) values ('Jill', 2, 40000);
    
      select department, max(salary) as max_salary
      from employees
      group by department;
    
      select e.department, e.name, e.salary
      from employees e,
      (select department, max(salary) as max_salary
      from employees
      group by department) maxs
      where e.department = maxs.department
      and e.salary = maxs.max_salary;
    
      select distinct on (department) *
      from employees
      order by department, salary desc;
    
  • Advice: Learn your SQL according to the standard (see Appendix B in SQL Cookbook)

[27] What is a PG database?

  • A PG database is that to which you connect
  • Got it? You can only query a database to which you are connected
  • A database contains schemas which contain tables
  • We don't deal with schemas because our tables are in the default schema (public)

[28] What is a PG schema?

  • Think of a PostgreSQL schema as a prefix to a table name (public.patients)
  • A database can have tables in different schemas
  • PG will match unprefixed table names according to a search path
  • Why do we like it?
    • Tables in a schema can be dropped together
    • Tables in a schema can be authorized separately
    • Tables can have the same name if in different schemas

[29] What is a PG tablespace?

  • A PG tablespace is a location in the file system where tables (etc.) go
  • Put frequently-queried tables on a fast disk
  • Make it easier for backups
  • NOTE: In Master/Slave, must have the identical tablespace (location) on both machines

[30] When should I add an index?

  • When your query is slow
  • What is slow?
    • Look at the logs and compare to other queries
  • Will an index help me?
    • Try explain analyze on your query
    • Add an index
    • Rerun explain analyze
  • Don't waste your time on this exercise on your development machine (PG will for instance just use sequential access for small data sets, 'cos it's fast); use something closer to production

[31] What is a constraint?

  • Check constraint: Disallow insertion of a row where a column value violates a rule (e.g., height > 144 inches)
  • Not null constraint: Disallow insertion of a row where a column value is null
  • Foreign key constraint: Disallow insertion of a row with foreign key value not found as a primary key in the other table

[32] Why don't Rails developers use foreign key constraints? (and other constraints)

  • History: Very difficult to create an ORM that can define constraint types in a database-agnostic way
    • Surely you jest?
  • History: Lack of clarity around tier responsibility: E.g., who should validate?
  • History: Inexperience with transactions
  • History: Probably not important in Basecamp (except for billing)
  • History: There was a time when people liked polymorphic associations in ActiveRecord; and how would you constrain such keys? (Triggers?)
  • Lazy

[33] Why don't we use foreign key constraints? (and other constraints)

  • Lazy
  • To date, stories have not defined data semantics, only behavioral semantics
  • To date, stories have not defined cost of anomalies
  • schema.rb

[34] pros and cons of deleting rows from the DB (vs. flagging them as 'deleted')

  • Discussion

[35] pros and cons of free-text fields

  • Discussion

[36] editing data in-place vs. storing new versions with each edit

  • Discussion

[37] Recommendations for Further Reading

  1. "SQL Reference" in Enterprise Java in a Nutshell (O'Reilly, 2005)
  2. Molinaro, SQL Cookbook (O'Reilly, 2005)
  3. Rozenshtein, The Essence of SQL (1997)
  4. Celko, SQL for Smarties (2010) (other books can be good; some people like The SQL Puzzle Book)
  5. Karwin, SQL Antipatterns (2010) (considers EAV to be an anti-pattern; discusses rounding errors; problems with grouping; much else)
  6. Venn diagrams: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

[37] Resources to avoid

  1. Elmasri, Fundamentals of Database Systems (6th ed., 2010). Regrettably, this book is a mess and tends to go for the most impenetrable explanation possible. Still, it has stuff on modeling and normalization that is hard to find.
  2. Anything on the web; seriously, watch out. People say really stupid stuff about SQL.
  3. Any book that is focused on one database. You will be unhappy.

[38] Populating names and books with join table

drop table names;
create table names (
    id serial, -- "auto increment"
    first_name text
);
drop table books;
create table books (
    id serial, -- "auto increment"
    title text
);

delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');
insert into names (first_name) values ('Patrick');

delete from books;
insert into books (title) values ('The C Programming Language');
insert into books (title) values ('The Ruby Programming Language');
insert into books (title) values ('Java Concurrency in Practice');
insert into books (title) values ('SQL for Smarties');
insert into books (title) values ('Specification by Example');

drop table if exists names_books;
create table names_books (
  name_id integer,
  book_id integer
);

insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The C Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The Ruby Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'SQL for Smarties'));
insert into names_books values ((select id from names where first_name = 'James'), (select id from books where title = 'Java Concurrency in Practice'));
insert into names_books values ((select id from names where first_name = 'Tom'), (select id from books where title = 'SQL for Smarties'));


select names.first_name, books.title
from names, books, names_books
where names.id = names_books.name_id
and books.id = names_books.book_id;

Other topics

[39] The meaning of null

See Elmasri.

[40] SQL injection

[41] Prepared statements

[42] Domain modeling

See Elmasri.

[43] Normalization

See Elmasri.

[44] Star schemas

[45] Data warehousing

[46] ETL