SQL / Relational

[1] Why we (I) love relational databases

ACID
- Atomicity (transactions [multiple operations can be "all or nothing"] - even with externalities such as power loss)
- Consistency (what's on disk is always acceptable as a state - including constraints, etc.)
- Isolation (no transaction affects another)
- Durability (if an insert completes, it is really written to disk)
Do we need these?
What apps don't need these things?
- If Facebook fails to provide you with 1 in 100,000 status updates from your friends, what is the cost?
Can we get these properties from a database distributed across a network? (CAP Theorem: See http://www.youtube.com/watch?v=Jw1iFr4v58M and http://blog.beany.co.kr/wp-content/uploads/2011/03/nosql_cap.png)

[2] Creating a database and connecting (from shell)

# PostgreSQL

dropdb example
createdb example
psql example

# sqlite3
rm example.db        # drop a sqlite3 database (just remove the file)
sqlite3 example.db   # if the db doesn't exist, creates it; starts the command-line interface

[3] Dropping and creating a table

The column with type serial for PostgreSQL, or 'integer primary key autoincrement` for sqlite3 is a primary key that gets a value automatically when you insert a new row. By this means, you don't have to find out what the highest id is; the database system sets the key value appropriately automatically.

The -- indicates a comment.

# PostgreSQL

drop table names;
create table names (
    id serial, -- "auto increment"
    first_name text
);

# sqlite3

drop table names;
create table names (
    id integer primary key autoincrement,
    first_name text
);

[4] Another

drop table books;
create table books (
    id integer primary key autoincrement, -- remember, for PostgreSQL, use: serial
    name_id int,
    title text
);

[5] Adding data

delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');

Tip: Always give the column names so that you are not dependent on the created order

[6] more

delete from books;
insert into books (name_id, title) values (1, 'The C Programming Language');
insert into books (name_id, title) values (1, 'The Ruby Programming Language');
insert into books (name_id, title) values (2, 'Java Concurrency in Practice');
insert into books (name_id, title) values (null, 'SQL for Smarties');
-- Notice: We are assuming we know the values for names.id
-- What if we re-ran the inserts for names?

[7] query - wildcard for columns

select * from names;
select * from books;

[8] query - rearranging columns and specifying order

select first_name, id from names;
select name_id, id, title from books;

[9] cross join - return "product" (all combinations) of two tables

select *
from names, books;
-- don't do this with big tables

[10] same thing, but putting ids next together

select names.first_name, names.id as "names.id",
  books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
order by names.id;
-- don't do this with big tables

[11] inner join - Venn diagram

Inner join: What we want is represented by the orange section

[12] inner join - SQL - "Give me data from rows from both tables where the condition is true"

-- "explicit join notation"
select first_name, title
from books join names
  on names.id = books.name_id;

-- "implicit join notation"
select first_name, title
from books, names
where names.id = books.name_id;

-- Same as "product" join above, but using where clause
select names.first_name, names.id as "names.id",
  books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
where names.id = books.name_id
order by names.id;

[13] left join - Venn diagram

Left join: What we want is represented by the salmon and orange sections

[14] left join - SQL - "Give me data for 'left' table, and data the 'right' table when the condition is true, but nulls for the 'right' table when the condition is false"

select names.first_name, books.title
from names left join books
  on books.name_id = names.id;

 first_name |             title             
------------+-------------------------------
 John       | The C Programming Language
 John       | The Ruby Programming Language
 James      | Java Concurrency in Practice
 Tom        | 
(4 rows)

[15] left join is dominate pattern in ActiveRecord . . .

why?

[16] left join is dominate pattern in ActiveRecord

when we send an association message to a relation, we expect that association to be applied to each element
Note, however, that ActiveRecord avoids real left joins

[17] Now that you're so smart

How many rows will each of the following queries return?

  select * from names left  join books on books.name_id = names.id;
  select * from names right join books on books.name_id = names.id;
  select * from books left  join names on books.name_id = names.id;
  select * from books right join names on books.name_id = names.id;

[18] demo - study the log

    irb
    require 'active_record'
    require 'pg'
    require 'logger'
    B = ActiveRecord::Base
    B.logger = Logger.new(STDOUT)
    M = ActiveRecord::Migration
    
    spec = { adapter: :postgresql, database: :ex }
    B.establish_connection(adapter: :postgresql, database: :postgres, schema_search_path: :public)
    B.connection.drop_database spec[:database]
    B.connection.create_database spec[:database]
    B.establish_connection spec
    
    B.mass_assignment_sanitizer = :logger
    CN = Class.new(M) { def change; create_table :names do |t| t.text :first_name end; end }
    CB = Class.new(M) { def change; create_table :books do |t| t.text :title; t.references :name; end; end }
    CN.migrate(:up)
    CB.migrate(:up)
    Name = Class.new(B) { has_many :books }
    Book = Class.new(B) { belongs_to :name }

    john = Name.create! first_name: 'John'
    james = Name.create! first_name: 'James'
    tom = Name.create! first_name: 'Tom'
    c = Book.create! name: john, title: 'The C Programming Language'
    ruby = Book.create! name: john, title: 'The Ruby Programming Language'
    java = Book.create! name: james, title: 'Java Concurrency in Practice'

    # Let's discuss what we see in the log
    
    Name.all(include: :books)
    Name.all(joins: :books)
    Name.all(joins: 'left join books on books.name_id = names.id')
    Name.all(joins: 'left join books on books.name_id = names.id', include: :books)

[19] obviously not the way we would represent book ownership for realsies

We would use a join table in between names and books - Now a book can be owned by more than one person (but notice: this is arguably the wrong modeling)

Names and ownership

  select names.first_name, books.title
  from names, books, names_books
  where names.id = names_books.name_id
  and books.id = names_books.book_id;

[20] explain plan

table with slow query

  drop table example;
  create table example
    as select generate_series(1, 1000000) as id,
      'some message'::varchar(100) as message;
  insert into example values (1000001, 'different message');
  select * from example where message = 'different message'; -- 250ms

see the query plan

  explain select * from example where message = 'different message';

add an index, re-run query, and check plan again

  create index on example (message);
  select * from example where message = 'different message'; --16ms
  explain select * from example where message = 'different message';

[21] Aggregation and group by, having, etc.

group by provides for a means of showing the distinct values for a column

  select message from example group by message;
  -- compare:
  -- select distinct message from example;

But even better, we can aggregate according to the groups

  select message, count(*) from example group by message;

But even mo' better, we can as for groups having certain characteristics

  select message, count(*) from example group by message having count(*) = 1;

[22] Subquery expressions

You want a predicate that checks another table

Example: find all patients with an appointment

  select count(*) from patients
  where exists (
    select 1
    from appointment_storages
    where appointment_storages.patient_id = patients.id
    );

Other subquery expressions: in, not in, any; see docs (http://www.postgresql.org/docs/9.2/static/functions-subquery.html)

Why? Expressive. Compare this wrong query trying to do the same thing

  select count(*)
  from patients, appointment_storages
  where appointment_storages.patient_id = patients.id;

And compare this correct query trying to do the same thing

  select count(distinct patients.id)
  from patients, appointment_storages
  where appointment_storages.patient_id = patients.id;

[23] More about explain

Can see cost estimates; first value is startup, second is doing the work

  explain select count(*) from patients
  where exists (
    select 1
    from appointment_storages
    where appointment_storages.patient_id = patients.id
    );

Can also run (with analyze) to see actuals vs. estimates

  explain analyze select count(*) from patients
  where exists (
    select 1
    from appointment_storages
    where appointment_storages.patient_id = patients.id
    );

Hey, why is explain telling me it's going to do a sequential scan even though I have an index?
pgadmin has a visual explainer (demo)
Recommended: http://wiki.postgresql.org/images/4/45/Explaining_EXPLAIN.pdf
Want more? See PostgreSQL 9.0 High Performance (in Engineering library)

[24] self-joins

The classic problem: Find all instances of a staff member creating a task before another is completed.
Always explore your data first <-- best practice

First find a user with not too many tasks

  select users.id, users.email, practices.name, count(tasks.*)
  from users, tasks, practices
  where tasks.created_by_id = users.id
  and practices.id = users.practice_id
  group by users.id, users.email, practices.name
  having count(tasks.*) < 50
  order by count(tasks.*) desc;

Look at tasks for such a user to get a feel for the data

  select tasks.title, tasks.created_at, tasks.completed_at
  from tasks, users
  where tasks.created_by_id = users.id
  and users.id = 66
  order by created_at asc;

now join tasks with tasks

  select a.title || ' (' || a.id || ') overlaps ' || b.title || ' (' || b.id || ')'
  from tasks a, tasks b
  where a.created_by_id = b.created_by_id
  and b.created_at >= a.created_at
  and b.created_at <= a.completed_at
  and a.id != b.id
  and a.created_by_id = 66
  and b.created_by_id = 66
  order by a.id asc;

inspect some examples

  select * from tasks where id in (17359, 17401, 17402, 20457);

[25] Another: Find the first weight measurement (or null) for every patient

Let's get the earliest measured_at for each patient/marker combination

  select patient_id, marker_cname, min(measured_at) as first_measured_at
  from measurements
  group by patient_id, marker_cname;

We are using min() here. Could there be two measurements for the same patient and marker with the same measured_at?

  select m.patient_id, m.marker_cname, m.measured_at, count(*)
  from (
    select patient_id, marker_cname, min(measured_at) as first_measured_at
    from measurements
    group by patient_id, marker_cname) firsts,
  measurements m
  where firsts.patient_id = m.patient_id
  and firsts.marker_cname = m.marker_cname
  and firsts.first_measured_at = m.measured_at
  group by m.patient_id, m.marker_cname, m.measured_at
  having count(*) > 1;

Yes. Well that sucks. That might be a bug or deserve a constraint. What we will do is get the first id for the earliest measured_at for each patient/marker combination.

  select m.patient_id, m.marker_cname, min(id) as id
  from (
    select patient_id, marker_cname, min(measured_at) as first_measured_at
    from measurements
    group by patient_id, marker_cname) firsts,
  measurements m
  where firsts.patient_id = m.patient_id
  and firsts.marker_cname = m.marker_cname
  and firsts.first_measured_at = m.measured_at
  group by m.patient_id, m.marker_cname;

Now this is a very useful set of rows. We can do left joins from the patients table into this and select for various markers; and then join from this table into measurements. We can create a temporary table and direct PG to drop it after a commit. Then we can use and re-use this temporary table in a series of left joins.

  create temporary table firsts on commit drop as
  select m.patient_id, m.marker_cname, min(id) as id
  from (
    select patient_id, marker_cname, min(measured_at) as first_measured_at
    from measurements
    group by patient_id, marker_cname) firsts,
  measurements m
  where firsts.patient_id = m.patient_id
  and firsts.marker_cname = m.marker_cname
  and firsts.first_measured_at = m.measured_at
  group by m.patient_id, m.marker_cname;

  select p.id, weight, weight_first_measured_at, height, height_first_measured_at, ldl, ldl_first_measured_at
  from patients p
    left join (
      select f.patient_id, m.value as weight, m.measured_at as weight_first_measured_at
      from firsts f, measurements m where f.marker_cname = 'weight' and m.id = f.id) weights
      on weights.patient_id = p.id
    left join (
      select f.patient_id, m.value as height, m.measured_at as height_first_measured_at
      from firsts f, measurements m where f.marker_cname = 'height' and m.id = f.id) heights
      on heights.patient_id = p.id
    left join (
      select f.patient_id, m.value as ldl, m.measured_at as ldl_first_measured_at
      from firsts f, measurements m where f.marker_cname = 'ldl' and m.id = f.id) ldls
      on ldls.patient_id = p.id
  ;

[26] PG "distinct on"

When you want the first row in a group, you can use PG's "distinct on" clause (http://www.postgresql.org/docs/9.2/static/sql-select.html#SQL-DISTINCT)
But watch out, this can return surprising results

People commonly use it to mean: "Give me the 'top' something in this group" -- but what if there is a tie? Example

  drop table if exists employees;
  create table employees (
  name varchar(100),
  department integer,
  salary integer);
  insert into employees (name, department, salary) values ('John', 1, 20000);
  insert into employees (name, department, salary) values ('Joe', 1, 20000);
  insert into employees (name, department, salary) values ('Jill', 2, 40000);

  select department, max(salary) as max_salary
  from employees
  group by department;

  select e.department, e.name, e.salary
  from employees e,
  (select department, max(salary) as max_salary
  from employees
  group by department) maxs
  where e.department = maxs.department
  and e.salary = maxs.max_salary;

  select distinct on (department) *
  from employees
  order by department, salary desc;

Advice: Learn your SQL according to the standard (see Appendix B in SQL Cookbook)

[27] What is a PG database?

A PG database is that to which you connect
Got it? You can only query a database to which you are connected
A database contains schemas which contain tables
We don't deal with schemas because our tables are in the default schema (public)

[28] What is a PG schema?

Think of a PostgreSQL schema as a prefix to a table name (public.patients)
A database can have tables in different schemas
PG will match unprefixed table names according to a search path
Why do we like it?
- Tables in a schema can be dropped together
- Tables in a schema can be authorized separately
- Tables can have the same name if in different schemas

[29] What is a PG tablespace?

A PG tablespace is a location in the file system where tables (etc.) go
Put frequently-queried tables on a fast disk
Make it easier for backups
NOTE: In Master/Slave, must have the identical tablespace (location) on both machines

[30] When should I add an index?

When your query is slow
What is slow?
- Look at the logs and compare to other queries
Will an index help me?
- Try explain analyze on your query
- Add an index
- Rerun explain analyze
Don't waste your time on this exercise on your development machine (PG will for instance just use sequential access for small data sets, 'cos it's fast); use something closer to production

[31] What is a constraint?

Check constraint: Disallow insertion of a row where a column value violates a rule (e.g., height > 144 inches)
Not null constraint: Disallow insertion of a row where a column value is null
Foreign key constraint: Disallow insertion of a row with foreign key value not found as a primary key in the other table

[32] Why don't Rails developers use foreign key constraints? (and other constraints)

History: Very difficult to create an ORM that can define constraint types in a database-agnostic way
- Surely you jest?
History: Lack of clarity around tier responsibility: E.g., who should validate?
History: Inexperience with transactions
History: Probably not important in Basecamp (except for billing)
History: There was a time when people liked polymorphic associations in ActiveRecord; and how would you constrain such keys? (Triggers?)
Lazy

[33] Why don't we use foreign key constraints? (and other constraints)

Lazy
To date, stories have not defined data semantics, only behavioral semantics
To date, stories have not defined cost of anomalies
schema.rb

[34] pros and cons of deleting rows from the DB (vs. flagging them as 'deleted')

Discussion

[35] pros and cons of free-text fields

Discussion

[36] editing data in-place vs. storing new versions with each edit

Discussion

[37] Recommendations for Further Reading

"SQL Reference" in Enterprise Java in a Nutshell (O'Reilly, 2005)
Molinaro, SQL Cookbook (O'Reilly, 2005)
Rozenshtein, The Essence of SQL (1997)
Celko, SQL for Smarties (2010) (other books can be good; some people like The SQL Puzzle Book)
Karwin, SQL Antipatterns (2010) (considers EAV to be an anti-pattern; discusses rounding errors; problems with grouping; much else)
Venn diagrams: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html

[37] Resources to avoid

Elmasri, Fundamentals of Database Systems (6th ed., 2010). Regrettably, this book is a mess and tends to go for the most impenetrable explanation possible. Still, it has stuff on modeling and normalization that is hard to find.
Anything on the web; seriously, watch out. People say really stupid stuff about SQL.
Any book that is focused on one database. You will be unhappy.

[38] Populating names and books with join table

drop table names;
create table names (
    id serial, -- "auto increment"
    first_name text
);
drop table books;
create table books (
    id serial, -- "auto increment"
    title text
);

delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');
insert into names (first_name) values ('Patrick');

delete from books;
insert into books (title) values ('The C Programming Language');
insert into books (title) values ('The Ruby Programming Language');
insert into books (title) values ('Java Concurrency in Practice');
insert into books (title) values ('SQL for Smarties');
insert into books (title) values ('Specification by Example');

drop table if exists names_books;
create table names_books (
  name_id integer,
  book_id integer
);

insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The C Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The Ruby Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'SQL for Smarties'));
insert into names_books values ((select id from names where first_name = 'James'), (select id from books where title = 'Java Concurrency in Practice'));
insert into names_books values ((select id from names where first_name = 'Tom'), (select id from books where title = 'SQL for Smarties'));


select names.first_name, books.title
from names, books, names_books
where names.id = names_books.name_id
and books.id = names_books.book_id;

sql