I am extremely new to SQL and trying to design my database.
I am trying my best to follow normalization practices.
I have an owner table with owner_id (pk), firstname, lastname, middlename etc. and then I have a contactinfo table with contact_id(pk), city, state, zip, email, owner_id(fk).
Is it a good idea to just make fk columns in my contactinfo table referencing columns in the owner table like firstname, lastname etc so that in my application if I want to mail something to someone, all the necessary data is at least referenced in a single table.
Or is it better to have only a couple relationships between tables and I would pull the data I want from many tables with the code in my vb.net application? Or do am I completely misunderstanding normalization?
Also the idea of having an auto increment primary key column for each table seems less scary to me than actually using a legit unique piece of data in my application as the pk. Does it really matter? I was worried about if I ever wanted to change data in a PK field...it might not let me or it would let me and I just accidentally leveled Moscow.
Foreign Keys are used to enforce a Parent-Child relationship. For any one Child, it must have one and only one Parent. In your example, one Address must have an Owner but could multiple Owners exist? If it is guaranteed that there is only one Owner then a FK would be a sound approach. Having a FK does not make them one table; you still would need to join the Owner and Address tables together. The existence of the FK could be used by the SQL Engine to create a more efficient plan. It is useful, also, to have an index on the FK columns so that the table joins are more efficient.
If, OTOH, the Address is not guaranteed to be unique to an Owner, you can use an intermediate table, sometimes called a junction table though it goes by a variety of names. This table would hold a reference to one Owner and one Address. This way if one address had multiple Owners, there would only be one entry in the Address table, two entries in the Owner table (one for each owner) but two in the junction table. One junction table entry would reference the address and owner #1 and the other entry in the junction table would reference the same address and owner #2. The junction table would have a FK referencing the Owner table and another FK referencing the Address table.
As far as the auto-increment primary key goes, you have just opened a can of worms that could lead to major religious wars. Some people love them; others hate them. From the sample columns you have listed, I don't see a way to have a unique Owner name. Too many "John Q. Smith" entries. Using an identity column would certainly give you a unique handle to an owners. The question becomes when someone says ship this to "John Q Smith" how do you know which one is being referred to so that you can find the proper Address?
I'm not quite sure what you are saying here. What you describe is wrong, I'm afraid, but I haven't quite grasped why / how you are envisaging your Address table [somehow including the firstname, lastname columns from the Owner table], so consequently I might be answering the wrong question! but here goes:
To normalise your data you should only have firstname, lastname in your Owner table. If you query your Contactinfo table, and you also need to include firstname, lastname in the results, you need to JOIN the Owner and Contactinfo tables:
SELECT C.contact_id, C.city, C.state, C.zip, C.email, O.firstname, O.lastname
FROM contactinfo AS C
JOIN owner AS O
ON O.owner_id = C.owner_id
The important thing (critically so, to my mind) about normalising this data is that IF you were to store [firstname] in both the Owner and the Contactinfo tables then at some point, for sure!, they would become different. At that point which, of the two values, do you trust? OTOH ff [firstname] is stored in only place then if it needs to be changed you are sure that that single change is the only one that is needed and anything else that needs [firstname] will "reference" it from that location.
If you frequently need firstname and lastname when querying Contactinfo you could make a VIEW that includes the extra columns. We have a VIEW for pretty much every table in our database which includes any "common lookup columns", and we have a naming convention to make that straightforward for use to use. Typically the view includes lookup names/descriptions for things such as StatusCode and CountryCode etc.
The view is somewhat wasteful because the queries that reference it mostly don't use ALL the lookup columns, and SQL will not completely optimise those lookups out, but it saves us a lot of time, means that we don't have JOINs scattered all over our code (which ALL need to be Found and Fixed if something changes - whereas with a VIEW if the relationship does change then we only have to fix the one View)
Here's an example assuming that you adopt a naming convention of adding "_View" to the name of your table for the "main lookup view"
CREATE VIEW contactinfo_View
SELECT C.contact_id, C.city, C.state, C.zip, C.email, C.owner_id,
FROM contactinfo AS C
JOIN owner AS O
ON O.owner_id = C.owner_id
We include ALL the columns from the core table (so that we can use the view anywhere in our code where we might have used the table itself), and any columns from other tables that are commonly required as Lookups
Here's an example of using the view
SELECT contact_id, city, state, zip, email, owner_id,
WHERE contact_id = 1234
There are Pros and Cons and Learned Debate (which will go on until the end-of-time!)
Here's my view:
We have Natural Keys for some things. For example, 2-letter Country Codes such as "US" and "GB" (but maybe that should be "UK"? ...)
We have 6 character codes for Names - such as SMIT01 for John Smith. Users will remember the codes for commonly used Name records, they can guess the code for many of them (not for Smith, but for unusual names then "First four letters plus 01" will work), and reports are "broadly alphabetical", users find it convenient as they have to look up relatively few Record Codes ... but ... for something like universities operators may allocate codes UNIV01, UNIV02 which is definitely NOT guessable!
We have an IDENTITY column in EVERY table. Even if we also have a Natural Key such as the Country Code. The IDENTITY is not always the primary key, not always (actually probably "not often") the Clustered Index, but always has a UNIQUE INDEX. We find having a unique, short, numeric code for EVERY record hugely useful. On Web Forms we only need to include the IDENTITY to uniquely identify a record. If you use Natural Keys instead then you stumble over side effects such as a "Hyphen" or "Space" being included in the code and that "breaking" the URL for the record in the browser.
Users will squirt their data into Excel, and then "back again". Excel will never mess with an IDENTITY value ... however, if the PKey is alphanumeric it might well. If the first few rows are all-digits then Excel will assume that that column will be numeric. If strings have leading "0" then Excel will chop them off (now you are hosed!!) If a numeric value looks like a date then Excel will convert it accordingly ... and so on. None of those problems with IDENTITY values.
IDENTITY numbers are also very skinny. They take up very little index space, so you get more index entries on a page (than a multi-character natural key) - that saves disk space and improves performance. We do have multi-part keys on some tables but we also givem then single, one-part, IDENTITY columns too; we find it MUCH easier to reference the record with the IDENTITY. Imagine that you have an Audit Table where all previous edits are stored. Lets also say that you have some Staging Tables to import and merge data from a completely different 3rd party system. IME writing the code for those to support a multi-part key is way more complex, far harder to "mechanically generate" the code, and as a consequence "more to go wrong". So even where such things have natural multi-part keys we also include an IDENTITY so that we can use that to easily reference a record.
Example: An Order Item would typically have a Primary Key of OrderID (from the parent record) and ItemNo (and that would also be the best candidate for the Clustered Index). We also have an OrderItemID unique IDENTITY value on the Order Item. In that situation we don't rely on it much (e.g. if there are not child tables of Order Item that need to reference a parent) but being able to reference the Order Items by one-part ID on Web Forms is MUCH easier than a multi-part key would be.
However, one tip. We start our IDENTITY values, in each table, at a different starting point - e.g. 1,000 different. Thus in our test data (which normally doesn't have very many rows) we do not have an Owner #1 AND a Contactinfo #1 so there is ZERO chance of us accidentally, and successfully, joining ownerid and contact_id - that weeds out errors where we accidentally did:
ON contact_id = owner_id -- ERROR!!
because, using different IDENTITY ranges, that will match zero rows and thus show up as a problem in testing.
Thank you so much the both of you. There is a lot in both of your responses that I have read over several times.
I am ultimately creating a property value database. So there could be multiple owners to a single property and they would be mailed stuff.
But even before that. I guess what I am trying to understand is by designing my database is....
First I want to split my tables out based on (and I apologize for not grasping the correct terminology) function and that is my "entity"
Secondly I want to limit repeat information in my tables.
The central importance of the application will actually revolve around the property and its value. The "owners" are linked to the property.
So my owner table I guess could have the contactinfo columns and it could just be one Table.
But if I were to split them out, then the owners name would obviously be required when trying to contact an owner. If I put owner names in the contractinfo table then I have duplicate information.
So I think from what you guys have said is basically if I split them out its fine, I just need to use a join table to link them together and that way I also have the added benefit of handling multiple owners to a single property.
and you answered my question Kristen and gave me a TON of things to think about. I need to use the SQL language to bring the data together in my application.
So I really dont understand when and partially why to use keys to create relationships between tables. I need to maybe read some more on that. I think this will become very clear to me as I make more tables.
I started with the owner table, but really I should probably start with the property table as sort of the "Mother" table and then create the other tables in relationship to the property table and go from there.
Properties naturally have a parcel number that uniquely identifies it. However parcel numbers have dashes and decimals in them. We also use a distinct 10 digit "control number" to reference really the property and the owners...I could use that also as a primary key and just have an auto incrementing id IDENTITY column. Of course if you start it at different number ranges for different tables then you sort of have to think about how many records each table may contain. The property table could get upwards of several hundred thousand records. Multiple owners sometimes to a property so maybe a million for that one. Alot to think about.
For an initial design, you really need to start with a logical (entity) design rather than a physical (table) design [Google "logical data modeling" for more info]. This seems like just semantics but it's very much not, it's a critical step to a good design.
Initially just concentrate on identifying all the Entities you need -- Owner, Property, etc.. Taking this step, you may realize that "contactinfo" is not an Entity. Therefore, while it may become a table(s) later, its data must be part of / related to another entity. Thus, the logical design process helps drive an accurate physical design process.
After you identify the Entities and what pieces of data you need about them -- called "Attributes" at this stage, rather than "columns" -- you go thru the normalization steps [Google "data normalization", for example]. First normal form and second normal form are absolute dead minimums, but almost everyone also goes thru third normal form, as otherwise the design can be poor. Almost no one goes to 4th normal form (or beyond), although some go on to BCNF (Boyce-Codd normal form).
This sounds complex, but it doesn't really have to be, if you follow it step by step.
If you skip the logical design phase, you frankly will end up with a design mess and will re-code and re-work tables multiple times and still stuffer from a mediocre (at best) design.