[1] Why we (I) love relational databases
-
ACID
- Atomicity (if you cluster a number of operations as a "transaction," then they will all be executed or none will be executed - even with externalities such as power loss)
- Consistency (what's on disk is always acceptable - including constraints, etc. In other words, if you unplug the computer that has your database, the database stored on disk will work when the power is restored: The on-disk version will not be corrupted.)
- Isolation (no transaction affects another; so multiple users can all be running their separate transactions and they won't interfere with one another.)
- Durability (if an insert completes, it is really written to disk)
- Do we need these?
- What apps don't need these things?
- Example: If Facebook fails to provide you with 1 in 100,000 status updates from your friends, what is the cost? Maybe we can sacrifice durability if it speeds things up.
- Can we get these properties from a database distributed across a network? (CAP Theorem: See http://www.youtube.com/watch?v=Jw1iFr4v58M)
[2] Creating a database and connecting (from shell)
# PostgreSQL
dropdb example
createdb example
psql example
# sqlite3
rm example.db # drop a sqlite3 database (just remove the file)
sqlite3 example.db # if the db doesn't exist, creates it; starts the command-line interface
[3] Dropping and creating a table
The column with type serial
for PostgreSQL, or 'integer primary key autoincrement` for sqlite3
is a primary key that gets a value automatically when you insert a new row. By this means, you don't
have to find out what the highest id is; the database system sets the key value appropriately
automatically.
The --
indicates a comment.
PostgreSQL
drop table names;
create table names (
id serial, -- "auto increment"
first_name text
);
sqlite3
drop table names;
create table names (
id integer primary key autoincrement,
first_name text
);
[4] Another
drop table books;
create table books (
id integer primary key autoincrement, -- remember, for PostgreSQL, use: serial
name_id int,
title text
);
[5] Adding data
delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');
- Tip: Always give the column names so that you are not dependent on the created order
[6] more
delete from books;
insert into books (name_id, title) values (1, 'The C Programming Language');
insert into books (name_id, title) values (1, 'The Ruby Programming Language');
insert into books (name_id, title) values (2, 'Java Concurrency in Practice');
insert into books (name_id, title) values (null, 'SQL for Smarties');
-- Notice: We are assuming we know the values for names.id
-- What if we re-ran the inserts for names?
[7] query - wildcard for columns
select * from names;
select * from books;
[8] query - rearranging columns and specifying order
select first_name, id from names;
select name_id, id, title from books;
[9] cross join - return "product" (all combinations) of two tables
select *
from names, books;
-- don't do this with big tables
[10] same thing, but putting ids next together
select names.first_name, names.id as "names.id",
books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
order by names.id;
-- don't do this with big tables
[11] inner join - Venn diagram
Inner join: What we want is represented by the orange section
[12] inner join - SQL - "Give me data from rows from both tables where the condition is true"
-- "explicit join notation"
select first_name, title
from books join names
on names.id = books.name_id;
-- "implicit join notation"
select first_name, title
from books, names
where names.id = books.name_id;
-- Same as "product" join above, but using where clause
select names.first_name, names.id as "names.id",
books.name_id as "books.name_id", books.id as "books.id", books.title
from names, books
where names.id = books.name_id
order by names.id;
[13] left join - Venn diagram
Left join: What we want is represented by the salmon and orange sections
[14] left join - SQL - "Give me data for 'left' table, and data the 'right' table when the condition is true, but nulls for the 'right' table when the condition is false"
select names.first_name, books.title
from names left join books
on books.name_id = names.id;
first_name | title
------------+-------------------------------
John | The C Programming Language
John | The Ruby Programming Language
James | Java Concurrency in Practice
Tom |
(4 rows)
[15] left join is dominate pattern in ActiveRecord . . .
- why?
[16] left join is dominate pattern in ActiveRecord
- when we send an association message to a relation, we expect that association to be applied to each element
- Note, however, that ActiveRecord avoids real left joins
[17] Now that you're so smart
-
How many rows will each of the following queries return?
select * from names left join books on books.name_id = names.id; select * from names right join books on books.name_id = names.id; select * from books left join names on books.name_id = names.id; select * from books right join names on books.name_id = names.id;
[18] demo - study the log
irb
require 'active_record'
require 'pg'
require 'logger'
B = ActiveRecord::Base
B.logger = Logger.new(STDOUT)
M = ActiveRecord::Migration
spec = { adapter: :postgresql, database: :ex }
B.establish_connection(adapter: :postgresql, database: :postgres, schema_search_path: :public)
B.connection.drop_database spec[:database]
B.connection.create_database spec[:database]
B.establish_connection spec
B.mass_assignment_sanitizer = :logger
CN = Class.new(M) { def change; create_table :names do |t| t.text :first_name end; end }
CB = Class.new(M) { def change; create_table :books do |t| t.text :title; t.references :name; end; end }
CN.migrate(:up)
CB.migrate(:up)
Name = Class.new(B) { has_many :books }
Book = Class.new(B) { belongs_to :name }
john = Name.create! first_name: 'John'
james = Name.create! first_name: 'James'
tom = Name.create! first_name: 'Tom'
c = Book.create! name: john, title: 'The C Programming Language'
ruby = Book.create! name: john, title: 'The Ruby Programming Language'
java = Book.create! name: james, title: 'Java Concurrency in Practice'
# Let's discuss what we see in the log
Name.all(include: :books)
Name.all(joins: :books)
Name.all(joins: 'left join books on books.name_id = names.id')
Name.all(joins: 'left join books on books.name_id = names.id', include: :books)
[19] obviously not the way we would represent book ownership for realsies
- We would use a join table in between names and books - Now a book can be owned by more than one person (but notice: this is arguably the wrong modeling)
-
Names and ownership
select names.first_name, books.title from names, books, names_books where names.id = names_books.name_id and books.id = names_books.book_id;
[20] explain plan
-
table with slow query
drop table example; create table example as select generate_series(1, 1000000) as id, 'some message'::varchar(100) as message; insert into example values (1000001, 'different message'); select * from example where message = 'different message'; -- 250ms
-
see the query plan
explain select * from example where message = 'different message';
-
add an index, re-run query, and check plan again
create index on example (message); select * from example where message = 'different message'; --16ms explain select * from example where message = 'different message';
[21] Aggregation and group by, having, etc.
-
group by
provides for a means of showing the distinct values for a columnselect message from example group by message; -- compare: -- select distinct message from example;
-
But even better, we can aggregate according to the groups
select message, count(*) from example group by message;
-
But even mo' better, we can as for groups
having
certain characteristicsselect message, count(*) from example group by message having count(*) = 1;
[22] Subquery expressions
-
You want a predicate that checks another table
-
Example: find all patients with an appointment
select count(*) from patients where exists ( select 1 from appointment_storages where appointment_storages.patient_id = patients.id );
-
Other subquery expressions:
in
,not in
,any
; see docs (http://www.postgresql.org/docs/9.2/static/functions-subquery.html) -
Why? Expressive. Compare this wrong query trying to do the same thing
select count(*) from patients, appointment_storages where appointment_storages.patient_id = patients.id;
-
And compare this correct query trying to do the same thing
select count(distinct patients.id) from patients, appointment_storages where appointment_storages.patient_id = patients.id;
[23] More about explain
-
Can see cost estimates; first value is startup, second is doing the work
explain select count(*) from patients where exists ( select 1 from appointment_storages where appointment_storages.patient_id = patients.id );
-
Can also run (with
analyze
) to see actuals vs. estimatesexplain analyze select count(*) from patients where exists ( select 1 from appointment_storages where appointment_storages.patient_id = patients.id );
-
Hey, why is explain telling me it's going to do a sequential scan even though I have an index?
-
pgadmin has a visual explainer (demo)
-
Recommended: http://wiki.postgresql.org/images/4/45/Explaining_EXPLAIN.pdf
-
Want more? See PostgreSQL 9.0 High Performance (in Engineering library)
[24] self-joins
-
The classic problem: Find all instances of a staff member creating a task before another is completed.
-
Always explore your data first <-- best practice
-
First find a user with not too many tasks
select users.id, users.email, practices.name, count(tasks.*) from users, tasks, practices where tasks.created_by_id = users.id and practices.id = users.practice_id group by users.id, users.email, practices.name having count(tasks.*) < 50 order by count(tasks.*) desc;
-
Look at tasks for such a user to get a feel for the data
select tasks.title, tasks.created_at, tasks.completed_at from tasks, users where tasks.created_by_id = users.id and users.id = 66 order by created_at asc;
-
now join tasks with tasks
select a.title || ' (' || a.id || ') overlaps ' || b.title || ' (' || b.id || ')' from tasks a, tasks b where a.created_by_id = b.created_by_id and b.created_at >= a.created_at and b.created_at <= a.completed_at and a.id != b.id and a.created_by_id = 66 and b.created_by_id = 66 order by a.id asc;
-
inspect some examples
select * from tasks where id in (17359, 17401, 17402, 20457);
[25] Another: Find the first weight measurement (or null) for every patient
-
Let's get the earliest measured_at for each patient/marker combination
select patient_id, marker_cname, min(measured_at) as first_measured_at from measurements group by patient_id, marker_cname;
-
We are using min() here. Could there be two measurements for the same patient and marker with the same measured_at?
select m.patient_id, m.marker_cname, m.measured_at, count(*) from ( select patient_id, marker_cname, min(measured_at) as first_measured_at from measurements group by patient_id, marker_cname) firsts, measurements m where firsts.patient_id = m.patient_id and firsts.marker_cname = m.marker_cname and firsts.first_measured_at = m.measured_at group by m.patient_id, m.marker_cname, m.measured_at having count(*) > 1;
-
Yes. Well that sucks. That might be a bug or deserve a constraint. What we will do is get the first id for the earliest measured_at for each patient/marker combination.
select m.patient_id, m.marker_cname, min(id) as id from ( select patient_id, marker_cname, min(measured_at) as first_measured_at from measurements group by patient_id, marker_cname) firsts, measurements m where firsts.patient_id = m.patient_id and firsts.marker_cname = m.marker_cname and firsts.first_measured_at = m.measured_at group by m.patient_id, m.marker_cname;
-
Now this is a very useful set of rows. We can do left joins from the patients table into this and select for various markers; and then join from this table into measurements. We can create a temporary table and direct PG to drop it after a commit. Then we can use and re-use this temporary table in a series of left joins.
create temporary table firsts on commit drop as select m.patient_id, m.marker_cname, min(id) as id from ( select patient_id, marker_cname, min(measured_at) as first_measured_at from measurements group by patient_id, marker_cname) firsts, measurements m where firsts.patient_id = m.patient_id and firsts.marker_cname = m.marker_cname and firsts.first_measured_at = m.measured_at group by m.patient_id, m.marker_cname; select p.id, weight, weight_first_measured_at, height, height_first_measured_at, ldl, ldl_first_measured_at from patients p left join ( select f.patient_id, m.value as weight, m.measured_at as weight_first_measured_at from firsts f, measurements m where f.marker_cname = 'weight' and m.id = f.id) weights on weights.patient_id = p.id left join ( select f.patient_id, m.value as height, m.measured_at as height_first_measured_at from firsts f, measurements m where f.marker_cname = 'height' and m.id = f.id) heights on heights.patient_id = p.id left join ( select f.patient_id, m.value as ldl, m.measured_at as ldl_first_measured_at from firsts f, measurements m where f.marker_cname = 'ldl' and m.id = f.id) ldls on ldls.patient_id = p.id ;
[26] PG "distinct on"
-
When you want the first row in a group, you can use PG's "distinct on" clause (http://www.postgresql.org/docs/9.2/static/sql-select.html#SQL-DISTINCT)
-
But watch out, this can return surprising results
-
People commonly use it to mean: "Give me the 'top' something in this group" -- but what if there is a tie? Example
drop table if exists employees; create table employees ( name varchar(100), department integer, salary integer); insert into employees (name, department, salary) values ('John', 1, 20000); insert into employees (name, department, salary) values ('Joe', 1, 20000); insert into employees (name, department, salary) values ('Jill', 2, 40000); select department, max(salary) as max_salary from employees group by department; select e.department, e.name, e.salary from employees e, (select department, max(salary) as max_salary from employees group by department) maxs where e.department = maxs.department and e.salary = maxs.max_salary; select distinct on (department) * from employees order by department, salary desc;
-
Advice: Learn your SQL according to the standard (see Appendix B in SQL Cookbook)
[27] What is a PG database?
- A PG database is that to which you connect
- Got it? You can only query a database to which you are connected
- A database contains schemas which contain tables
- We don't deal with schemas because our tables are in the default schema (
public
)
[28] What is a PG schema?
- Think of a PostgreSQL schema as a prefix to a table name (public.patients)
- A database can have tables in different schemas
- PG will match unprefixed table names according to a search path
- Why do we like it?
- Tables in a schema can be dropped together
- Tables in a schema can be authorized separately
- Tables can have the same name if in different schemas
[29] What is a PG tablespace?
- A PG tablespace is a location in the file system where tables (etc.) go
- Put frequently-queried tables on a fast disk
- Make it easier for backups
- NOTE: In Master/Slave, must have the identical tablespace (location) on both machines
[30] When should I add an index?
- When your query is slow
- What is slow?
- Look at the logs and compare to other queries
- Will an index help me?
- Try
explain analyze
on your query - Add an index
- Rerun
explain analyze
- Try
- Don't waste your time on this exercise on your development machine (PG will for instance just use sequential access for small data sets, 'cos it's fast); use something closer to production
[31] What is a constraint?
- Check constraint: Disallow insertion of a row where a column value violates a rule (e.g., height > 144 inches)
- Not null constraint: Disallow insertion of a row where a column value is null
- Foreign key constraint: Disallow insertion of a row with foreign key value not found as a primary key in the other table
[32] Why don't Rails developers use foreign key constraints? (and other constraints)
- History: Very difficult to create an ORM that can define constraint types in a database-agnostic way
- Surely you jest?
- History: Lack of clarity around tier responsibility: E.g., who should validate?
- History: Inexperience with transactions
- History: Probably not important in Basecamp (except for billing)
- History: There was a time when people liked polymorphic associations in ActiveRecord; and how would you constrain such keys? (Triggers?)
- Lazy
[33] Why don't we use foreign key constraints? (and other constraints)
- Lazy
- To date, stories have not defined data semantics, only behavioral semantics
- To date, stories have not defined cost of anomalies
- schema.rb
[34] pros and cons of deleting rows from the DB (vs. flagging them as 'deleted')
- Discussion
[35] pros and cons of free-text fields
- Discussion
[36] editing data in-place vs. storing new versions with each edit
- Discussion
[37] Recommendations for Further Reading
- "SQL Reference" in Enterprise Java in a Nutshell (O'Reilly, 2005)
- Molinaro, SQL Cookbook (O'Reilly, 2005)
- Rozenshtein, The Essence of SQL (1997)
- Celko, SQL for Smarties (2010) (other books can be good; some people like The SQL Puzzle Book)
- Karwin, SQL Antipatterns (2010) (considers EAV to be an anti-pattern; discusses rounding errors; problems with grouping; much else)
- Venn diagrams: http://www.codinghorror.com/blog/2007/10/a-visual-explanation-of-sql-joins.html
[37] Resources to avoid
- Elmasri, Fundamentals of Database Systems (6th ed., 2010). Regrettably, this book is a mess and tends to go for the most impenetrable explanation possible. Still, it has stuff on modeling and normalization that is hard to find.
- Anything on the web; seriously, watch out. People say really stupid stuff about SQL.
- Any book that is focused on one database. You will be unhappy.
[38] Populating names and books with join table
drop table names;
create table names (
id serial, -- "auto increment"
first_name text
);
drop table books;
create table books (
id serial, -- "auto increment"
title text
);
delete from names;
insert into names (first_name) values ('John');
insert into names (first_name) values ('James');
insert into names (first_name) values ('Tom');
insert into names (first_name) values ('Patrick');
delete from books;
insert into books (title) values ('The C Programming Language');
insert into books (title) values ('The Ruby Programming Language');
insert into books (title) values ('Java Concurrency in Practice');
insert into books (title) values ('SQL for Smarties');
insert into books (title) values ('Specification by Example');
drop table if exists names_books;
create table names_books (
name_id integer,
book_id integer
);
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The C Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'The Ruby Programming Language'));
insert into names_books values ((select id from names where first_name = 'John'), (select id from books where title = 'SQL for Smarties'));
insert into names_books values ((select id from names where first_name = 'James'), (select id from books where title = 'Java Concurrency in Practice'));
insert into names_books values ((select id from names where first_name = 'Tom'), (select id from books where title = 'SQL for Smarties'));
select names.first_name, books.title
from names, books, names_books
where names.id = names_books.name_id
and books.id = names_books.book_id;
Other topics
[39] The meaning of null
See Elmasri.
[40] SQL injection
[41] Prepared statements
[42] Domain modeling
See Elmasri.
[43] Normalization
See Elmasri.