Using Natural Keys

Hello. First time poster here who is trying to get back into databases after a couple year hiatus!

I realize this is a heated topic, but am hoping to get some reasonable advice on it...

A lot of people insist that you have to use System-generated/Sequence/AutoNumber/Surrogate keys as the primary key for a table.

While I can see the benefits if you have a very large table, for smaller tables I question the need.

For example, I have the following tiny tables in a website I am building... (**Disclosure: This is in MySQL, so hope you won't hate on me for that?!)

FEATURE
*************
- id (current pk)
- name


MEMBER_PLAN
- id (current pk)
- name


ENTITLEMENT
*******************
- id (pk)
- feature_id (fk)
- member_plan_id (fk)

In the first two tables, I am inclined to make "id" a (uk) and make "name" the (pk).

Why?

Because then in my junction table (i.e. Entitlement), it will be self-documenting and won't require me to join three tables for simple things I need to do!!

Thoughts?

Personally, I'd recommend a numeric primary key as they are more efficient and perform better than character based primary keys...typically. If name can be changed/edited than I would definitely NOT make it a PK.

Basically:

  • Keep it short . Because the primary key is used for lookups and comparisons, a short primary key means the database management system can process it more quickly than a long primary key.
  • Use a number for the primary key whenever possible. SQL Server or other database management systems process number data types faster than character data types.
  • Keep it simple. Don't use any special characters, embedded spaces, or a mix of upper and lower capitalization.
  • Never change the primary key after you assign it.


The problem is that these are super small tables and its a PITA to have to join 3 tables to get what I need.

Here is what I have currently...

MEMBER_PLAN

  • id (pk)
  • name (uk)

FEATURE

  • id (pk)
  • name (uk)

ENTITLEMENT

  • id (pk)
  • member_plan_id (uk1)
  • feature_id (uk1)




And here is what I would like to do...

MEMBER_PLAN

  • id (uk)
  • name (pk) <===

FEATURE

  • id (uk)
  • name (pk) <===

ENTITLEMENT

  • id (uk)
  • member_plan_id (fk)(uk1) <===
  • feature_id (fk)(uk1) <===

By doing this, I can look at "Entitlement" and easily see the relationships between "member Plans" and "Features" and not have to joing 3 tables to get that.

If we were taling about a table like "Member" or "Orders" I would use surrogate keys, but in a case like this it seems silly to use surrogate keys.

Nothing is chipped in stone here. If the tables are only ever going to be 2 columns and an association table and the PK is not going to change over time and you're not talking about millions of rows...do what's easiest....in the long run! Don't paint yourself into a corner for short term ease if long term things are going to get complicated!

I can't see "Features" being more than 50-100 records. And "Member Plans" being a dozen. And thus, "Entitlements" maybe being a few hundred.

All of this is for a website I am building and managing myself.

Doing it the way I proposed would just make my life easier since everything will now be in the "Entitlements" table and I can just query that to see which "Features" to display on a given page.

You should 100% use an id on the first two tables. I've been a full-time DBA for 30+ years, including extensive actual data modeling, and you only want to use an id there, period.

The third table doesn't need a separate identity; this is a bad error, but one you see over and over. Instead, use the keys from the other table as the natural key:
ENTITLEMENT ( member_plan_id, feature_id )

It's actually not a pain to join three tables, because you do not have to do it. Create a view that does the joins, and then reference the view. That's much better anyway, as it makes changes a lot easier.

Even if you decide to skip view, it would still be horrible to do a poor table design just to avoid a couple of simple joins.

Why?

I have seen a lot of very experienced DBA's argue that natural keys are, well, simply more "natural" and make sense in a lot of cases.

My premise is that these three tables are so insignificant and contrived, why not?

If I was Amazon . com, though, I certainly wouldn't be doing that!

Just trying to learn better from someone with 30+ yrs experience. :slight_smile:

Again, why?

The approach I have taken more in recent years is what you saw above...

Every table has a surrogate key using an sequence/autonumber/whatever to ensure uniqueness always. However, I set this with a UK constraint versus a PK.

Then, for tables where it has practical application, and won't come back to bite me, I use a natural key as a pk.

For the junction table, I have a surrgate key but this time make it the pk just so the table has that guaranteed uniqueness again.

Truth be told, this is for a website and I am using MySQL. (Not sure if this is a Microsoft-only website?)

Do you know if I can do Views in MySQL?

I know lots of database gurus that wouldn't see it as a mortal sin to use natural keys as the primary keys in my parent tables and then have the natural keys be the co-foreign keys in the junction table.

Again, can you explain why you feel this is a bad idea?

Look at the context...

It's not like I am using Address or Last Name as the natural key!

Thanks.

Feature names can change.
People's names can change.

It's that simple why you don't want to use them as a key. Even more so here, since you'll have the actual data in multiple tables and thus have to change it in multiple places.

Every table has a surrogate key using an sequence/autonumber/whatever to ensure uniqueness always.

Yes, that's the most common thing, and that's so damaging it's sad. The single biggest myth in table "design" is that an-autonumber-must-be-the-key.

First, it prevents any data normalization, period. OK, I know, developers (and many DBAs) never even attempt normalization anyway, but it's still an awful idea to skip the most pivotal, early design steps. Normalization depends on key values, but when the key value is meaningless, relationships to it are also meaningless.

Second, and just as important, it slows down subsequent queries, which forces extra indexes to be built just to partially compensate for the improper autonumber indexing. Terrible for performance.

If I join parent table MEMBER_PLAN to ENTITLEMENT and entitlement is keyed on ( member_plan_id, ... ), I get a wonderfully efficient MERGE join on the clustering index, and read the minimum amount of data possible on ENT. This is true no matter how many rows are read.

But when SQL has to go thru a non-clus index, you get lookups. Worse, unless it's an extremely small number of rows, you're likely to get a scan-and-hash-match of the ENT table or one its nonclus indexes. You'll also typically see far more nonclus indexes on the ENT/child table, used to avoid all the scans that would otherwise occur.

The "rules" for a "short" clustering key [] are just that, very general rules. Often the best performance comes from a set of keys. [] I don't really care what the PK is, since I don't have to cluster on it [in SQL Server, at least], performance depends by far most on the clus key.

You're totally losing me on your responses...
First you are saying it is bad to use Natural Keys...

Then in your next response, you NOT to use surrogate keys...

So which one is it?

If I can't use natural Keys and I can't use Surrogate keys, then what in the world do I use?

Normalization is an abstraction at a logical level. The type of keys you use have nothing to do how you normalize tables...

People use Surrogate Keys because they HELP performance...

And a Surrogate Key would be more easily indexed, so what is your point?

So why not use some sort of cached table or small footprint in memory database, or db such as sqllite etc? Then there is no heated topic to debate ?

As I stated, the problem is automatically assigning an autonumber surrogate to all tables.

"Master" tables will need a surrogate, esp. if they are char strings. FEATURES and MEMBER_PLANS (common practice is to use plurals on tables, for various reasons) are master tables, and will be an artificial id assigned.

Child tables may or may not need an autonumber, to maintain some type of sequence, but they should not be clustered first on that value. But you specified they should be.

A child table should be clustered first on the keys of their parents. ENTITLEMENT may need an id, but if needed in the clustering key, it would be the last column, not the first.

Using the actual values for FEATURE and MEMBER here could be a nightmare, as they appear in many tables, and have to change them, in coordinated fashion, across a dozen (more?) tables would be a royal pain.

And a Surrogate Key would be more easily indexed, so what is your point?

Huh? How is it "easier"? SQL doesn't at all mind two keys in an index rather than one; it's just some people that don't care for that.

People use Surrogate Keys because they HELP performance...

You should give some thought to what I actually wrote. That for child tables, you don't need to generate a surrogate key because you have a natural keys in the parent keys. The parent keys themselves are almost always surrogates, but that's irrelevant at the child level. To the child, the two parent keys are a natural key rather than blindly generating yet another autonumber to use and clustering on it. That is what hurts performance so much.