A couple days ago Jeremy Miller, who might have the best .NET blog on the internet, wrote about the evil of composite keys in database design. Jeremy asks an honest question at the end: why are composite keys so prevalent in legacy databases?

The answer is pretty simple and relatively boring: database developers with educational backgrounds in Relational Calculus and Relational Algebra were just designing databases properly according to the standards and guidelines set forth by E.F. Codd. The issue Jeremy brings up isn’t really “why use composite keys?” but “why does writing persistence code with composite keys have to suck?” This is especially relevant today as more and more development gets done with O/R mappers, and O/R mappers seem to be written with an expectation that your database contains auto-incremented primary key values.

All of this brings forth the really important issue (in my eyes), and that is this: a lot of database developers are charged with designing databases and they don’t understand the difference between a primary key and a surrogate key, and they don’t understand what criteria you use to actually choose a primary key.

A perfect example of this is captured in an article written by Josh Berkus, a lead developer on PostgreSQL. In the article he transcribes a couple of chats he had with database developers who were querying him for solutions to their problems.

agliodbs: What’s the key for “sessions”?
newbie2: it has an “id” column
agliodbs: Yes, but what’s the real key? Which columns determine a unique row?
newbie2: I told you, the “id” column. It’s a primary key and everything.
agliodbs: That’s not going to help you identify duplicate sessions. You need another key … a unique constraint on real data columns, not just an “id” column.
newbie2: no I don’t
agliodbs: Good luck with your problem then.

Josh continues on with the article, describing why this ignorance seems so pervasive:

Inevitably, practices which are “necessary evils” tend to become “pervasive evils” in the hands of the untrained and the lazy. Not realizing that ID columns are a pragmatic compromise with application performance, application developers with a shallow grounding in RDBMSes are enshrining numeric IDs as part of the logical and theoretical model of their applications. Worse yet, even RDBMS book authors are instructing their readers to “always include an ID column,” suturing this misunderstanding into the body of industry knowledge like a badly wired cybernetic implant.

That explains a lot. You see, I’ve ran into the same thing in my career – I’ve had discussions with developers who tell me that they’ve been instructed to always include the id field first when designing tables because it is the primary key. I couldn’t understand what responsible database person or author would instruct a person to do such a thing. Then I read Josh’s article and it became clear: you don’t have to be responsible to write a book, you just have to be good enough with words to get published. The accuracy of the information isn’t what determines whether your book gets out or not.

That’s disappointing. Our job as a software developers is already tough enough; we spend our days solving problems, learning new technology and trying to make our customers happy. To top it off, we have to deal with people feeding us misinformation at a technical level. That can’t be good.

So What Is A Surrogate Key?

To get back to the issue, a surrogate key is essentially a placeholder for the real primary key.

Every table has to have a primary key; a combination of attributes (columns) that determine uniqueness. Databases are all about storing data reliably, and that means keeping duplicate data out of a table. There shouldn’t be two Dave Smith’s in your database if you only have one working for you (you wouldn’t want him collecting two paychecks would you?)

To avoid a table containing duplicate entries we turn to the primary key. A primary key can actually be one of many candidate keys. A table may have one or more candidate keys, each of which can uniquely identify a tuple (row) in the table.

Take, for instance, the case of a movie table. It contains movie names, year released, and director name.

Movie Director Year
Star Wars Episode 1 : The Phantom Menace George Lucas 1999
Star Wars Episode 2 : The Clone Wars George Lucas 2003
Psycho Alfred Hitchcock 1960
Psycho Gus Van Sant 1998

Movie name by itself is not a good candiate key, because as we can see, Alfred Hitchcock and Gus Van Sant each directed a version of “Psycho”. However, movie name + director name is a decent candidate key because we can be fairly sure that no director is going to direct the same movie twice. Another, possibly better candidate key is movie name + year released, since we can be fairly certain that no movie with the same name will be released in the same year. A third candidate key would be all three fields, and provides the best assurance of uniqueness.

However, the difficulty of working with a primary key composed from the natural attributes of the table is that it can be cumbersome to write persistence code. Also, composite primary keys make for large foreign keys in releated tables, since all of those attributes have to be copied.

The solution then is a “surrogate key” – a numeric key (here called “ID”), typically auto incremented, that can be used as a placeholder to the real key, providing us with a table that looks like so:

ID Movie Director Year
1 Star Wars Episode 1 : The Phantom Menace George Lucas 1999
2 Star Wars Episode 2 : The Clone Wars George Lucas 2003
3 Psycho Alfred Hitchcock 1960
4 Psycho Gus Van Sant 1998

The problem with a surrogate key is that some developers actually start to think that the surrogate is the primary key. They insert this value before determining the candidate keys and before they know if their table structure needs to be altered to include, or exclude, attributes (a sign that a table isn’t well designed is when there are no obvious candidate keys – that means there’s not enough information in the table to determine uniqueness).

All of this brings me back to one comment on Jeremy’s blog post:

Life is so much easier with surrogate keys. You can always make unique constraints where it’s necessary.

I have to ding Jeremy a couple of points for that comment. What he should have said was: Life is so much easier with surrogate keys. But you must always make unique constraints on your primary key before adding a surrogate.

Surrogate keys don’t exist by themselves; they require the presence of a real primary key in order to be useful. Without unique constraints the other attributes in the table provide no help in determining uniqueness.

One Comment

  1. vorpal.cc/blog » Primary, Composite, and Surrogate Keys says:

    [...] Meaningless Keys keys, more keys and nothing but keys… Primary Keyvil, Part I (checkout parts 2 and 3 also) Composite keys are evil Surrogate Key != Primary Key [...]