Choosing the right key debate


A surrogate key is a unique primary key generated by the RDMS that is not derived from any entity instance present in the database and whose only significance is to act as the primary key, definition adapted from [13].
An intelligent key is a type of key that contains embedded in it attributes of the entity, so data with actual significance. Some authors even use the term "meaningful key" or "natural key", even though if a natural key is in fact a special case of intelligent key that is in some way natural and immutable to the application.
The classifications according to the number of attributes contained by a key differentiates the simple key ( consisting of just one attribute ) and the compound or composite key ( consisting of more than one attribute ).
A Candidate Key is one or more columns whose values could be used to uniquely identify a row in a table. The Primary Key is chosen among a table\'s Candidate Keys.

There are many other types of classifications of keys, out of which we'll define in this paper only the ones that concern the development of Epocrates directly. In order to be a good or viable solution as any key, the desirable properties identified by Celko [1] that have to be met are:
 Familiarity
 Stability

 Minimality
 Simplicity
These properties can be expressed in other words like this :
To qualify as a primary key for an entity, an attribute must have the following properties:
 it must have a non-null value for each instance of the entity
 the value must be unique for each instance of an entity
 the values must not change or become null during the life of each entity instance

The practice has learned us that in many situations there are more attributes that can serve as a primary key for an entity. The problem is now to select a primary key out of all the possible candidate keys. The candidate keys that will not be chosen as primary key will be called alternate keys.

Intelligent Versus Surrogate Keys
Should I use business columns as primary key fields for tables in the database, or generate artificial primary key values?

The introduction of key-related notions and definitions in the preceding was meant to sustain and prove from a theoretical, scientific point of view of the existence and actuality of this dilemma with which we have been confronted in the design of the Epocrates database.

Some database designers will always use a surrogate key despite the eventual suitability of an intelligent key. On the other hand, some database designers will use a key already present in the data, if there is one: the addition of a surrogate key will slow down access to the table, particularly if it is indexed they argue.
To exemplify we present two of the most appreciated authors in the field, whose work have been quoted also in this paper more than once. Chris Date is a declared fan of surrogate keys and Joe Celko's option for intelligent keys is well known; besides this, it is admitted that each alternative has a number of real-world advantages and disadvantages that depend on the actual model under discussion.

Advantages for intelligent keys
The most important advantage of intelligent keys comes from their definition; having a meaning for the user, they can be very easily understood. Custom applications can hide surrogate keys from the end user making the process transparent, but adhoc database tools cannot, so a user entering data directly into the tables via a simple adhoc interface will face the problem of understanding what the surrogate key means and how does its mechanism functions.
Other advantages of intelligent keys that are worth to be mentioned are the space saving ( generated by the use of an existing attribute and not adding a new one ), ease of verification of data, and issues that arise when using a surrogate key approach in a replicated and distributed database environment.
Disadvantages for intelligent keys
We have presented before the properties of a potential key and one of them referred to the fact that key values should not be volatile. This makes it obvious that primary keys are hard to change. Whenever a primary key changes in value, or columns are added to or dropped from a primary key, the effects cascade down through foreign key relationships. This makes it a disadvantage for the intelligent keys in the situation when the business changes and this automatically leads to a change in the primary key value and / or structure with all the burning pains that this arises. Surrogate primary keys avoid this problem because they never need to be changed. The business columns that were making part of the intelligent primary key before, are now simple data columns that can be changed whenever this is needed and even added or dropped with relative ease. The problem of the uniqueness of the business column values is addressed now by means of a secondary index.
Performance may differ between the two techniques and a black and white verdict cannot be issued out of this criteria. It has been proven that a single-column surrogate key may be faster than a multi-column intelligent key. The explanation for this is the fact that usually RDBMS generate numeric data type surrogate keys that have also hardware support for operations which ends up in performance gain. On the other hand, a simple incrementing surrogate key may cause index contention problems not found with a randomly distributed intelligent key.
For the design of Epocrates we have chosen a solution with surrogated keys, mainly because in the type of application like this the users will not need to be concerned about how the data is represented inside the database, as the big majority of operations will be of the data retrieval type. The part of entering data into the database being left to the authors and the database specialists that will deal with actualization and maintenance of Epocrates after release, we consider that the disadvantages of using surrogate keys are exceeded by the advantages. There are many techniques for generating the surrogate keys which we do not present here that can be integrated as part of the business logic of the application; we will use SQL Server 2005 RDBMS tools for generation of surrogate keys.