Normalized database design

simflex1 · March 3, 2017, 9:58pm

Greetings mates,

Can someone please help me review this database design to see if it needs tweaking?

First, a little explanation.

We have an app that will require each employee to answers some questions.

Question number is for the employee to provide his/her employeeid, name, tittle and email address.

Then employee is to provide information about his or her other source of income, including source name, source address and income

Then, the employee is provide his/her spouses name, address and income source.
Income source could come from one or more sources.

Finally, employee is to provide organization name, organization address and income if s/he has done business with any organizations.

These sort of questionnaire type information led to the following database design:

CREATE TABLE [dbo].[Main](
	[employeeID] [int] NOT NULL,
	[sourceID] [int] NOT NULL,
	[spouseID] [int] NOT NULL,
	[orgID] [int] NOT NULL,
        [questionID] [int] NOT NULL
) 

CREATE TABLE [dbo].[SpouseDetails](
	[spouseID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[spouseName] [nvarchar](50) NULL,
	[spouseAddress] [nvarchar](50) NULL,
	[spouseIncome] [numeric](18, 0) NULL
) 

CREATE TABLE [dbo].[SourceDetails](
	[sourceID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[sourceName] [nvarchar](50) NULL,
	[sourceAddress] [nvarchar](50) NULL,
	[sourceIncome] [numeric](18, 0) NULL
) 

CREATE TABLE [dbo].[OrgDetails](
	[OrgID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[orgName] [nvarchar](50) NULL,
	[orgAddress] [nvarchar](50) NULL,
	[orgIncome] [numeric](18, 0) NULL
) 

CREATE TABLE [dbo].[Questions](
	[questionID] [int] IDENTITY(1,1) NOT NULL,
	[Question] [nvarchar](255) NULL
) 

CREATE TABLE [dbo].[Employees](
	[employeeID] [int] IDENTITY(1,1) NOT NULL,
	[employeeName] [nvarchar](50) NULL,
	[ttitle] [nvarchar](50) NULL,
	[email] [nvarchar](50) NULL

)

I am wondering if this is fully normalized?

Thanks for your help in advance

ScottPletcher · March 3, 2017, 11:20pm

No, it's not. What's an easy way to tell if a design is normalized? Do you have twice as many tables as you expected to :-).

Tables SpouseDetails, SourceDetails and OrgDetails are not correct. You need separate Spouses, Sources and Organizations tables, and then intersection tables that relate the other tables, but that contain only information relating that one intersection.

The most helpful rules here are:

give each table a unique key column(s)
insure that every column in the table relates to that key, the whole key and nothing but the key.

simflex1 · March 4, 2017, 1:03am

Hi Scott,
Thanks for you kindness.

If I understand correctly, you are suggesting that employeeID be removed from Source. Spouse and Organization tables?

Something like:

Intersection table (for lack of a better name)
EmployeeID,
QuestionID,
SpouseID,
OrgID,
SourceID

Something like this?

Kristen · March 4, 2017, 10:04am

@ScottPletcher your views are always enlightening so I'd appreciate clarification pls

You wouldn't have EmployeeID in the Spouse table (for example)?

I'm presuming that we are not attempting to deduce who ELSE the Spouse has also been married to?

maybe I have misunderstood "and then intersection tables that relate the other tables" which perhaps refers to all the tables OTHER than Spouses, Sources and Organizations?

If Spouses, Sources and Organisations contain the EmployeeID (rather than having their own unique PKey and then there being a Link Table connecting Employee and Spouse / Former Spouse etc.) when crap-happens, and it always does!, from a practical standpoint it is easy to reunite! the Spouse Record with the Employee record. Indeed, no Link Table required at all.

I'm thinking of the Invoice Header / Invoice Items scenario - I would create a column in Invoice Items that contained the Invoice Header ID. Same for Spouses - have an Employee ID column.

If the same Spouse record can be associated with different Employees then I would have a Link Table.

simflex1 · March 4, 2017, 12:22pm

Ok, I am completely confused as to what you are advising me to do.

There are 5 tables and they are listed above.

You said they are not normalized as I currently have them,

Can you please show me how they can be normalized?

Kristen · March 4, 2017, 2:43pm

I have no view on what would make it well normalised, @ScottPletcher will have useful advice on that, and also on Clustered Indexes.

I would do it like this:

CREATE TABLE [dbo].[Employees](
	[employeeID] [int] IDENTITY(1,1) NOT NULL,
	[employeeName] [nvarchar](50) NULL,
	[ttitle] [nvarchar](50) NULL, -- Spelling
	[email] [nvarchar](50) NULL
)

If stuff like Address is specific to employee, and you don't need any "previous addresses" nor "alternative" addresses - i.e. its 1:1 - then I would consider storing that data in [Employees] table too.

CREATE TABLE [dbo].[SpouseDetails](
	[spouseID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[spouseName] [nvarchar](50) NULL,
	[spouseAddress] [nvarchar](50) NULL, -- Too short for address. Consider having State / ZIP Code / Country in separate fields for searchability
	[spouseIncome] [numeric](18, 0) NULL -- Money datatype instead??
) 

Unique Index spouseID
Clustered Index employeeID, spouseID

CREATE TABLE [dbo].[SourceDetails](
	[sourceID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[sourceName] [nvarchar](50) NULL,
	[sourceAddress] [nvarchar](50) NULL, -- Ditto re: address
	[sourceIncome] [numeric](18, 0) NULL
) 

Unique Index sourceID
Clustered Index employeeID, sourceID

CREATE TABLE [dbo].[OrgDetails](
	[OrgID] [int] IDENTITY(1,1) NOT NULL,
	[employeeID] [int] NULL,
	[orgName] [nvarchar](50) NULL,
	[orgAddress] [nvarchar](50) NULL, -- Ditto re: address
	[orgIncome] [numeric](18, 0) NULL
) 

Unique Index OrgID
Clustered Index employeeID, OrgID

CREATE TABLE [dbo].[Questions](
	[questionID] [int] IDENTITY(1,1) NOT NULL,
	[Question] [nvarchar](255) NULL
)

Other than making the text of the questions easy to alter I am not sure what this does? It does not, for example, indicate which table the result needs to be stored in / how it is stored. Clearly that can be programmed, but ...

You might not want the questionID to be Identity, but assign a number manually 10, 20, 30 ... then you can "insert one" if you need to.

I would not bother with this at all:

CREATE TABLE [dbo].[Main](
	[employeeID] [int] NOT NULL,
	[sourceID] [int] NOT NULL,
	[spouseID] [int] NOT NULL,
	[orgID] [int] NOT NULL,
        [questionID] [int] NOT NULL
)

Do you need Nvarchar everywhere? We rarely bother with it, unless we know we are doing to have to store wide characters. Most common European language accents etc. are available in 8-bit character sets

We never use Plurals for table names. Opinions vary on that, but worth considering - once you've built the whole thing it will be harder to change that!

We make all column names unique (using a prefix on each column which represents the table). I've seen people do that just for FKeys - so instead of [employeeID] in [OrgDetails] perhaps do [orgEmployeeID]. Again, opinions vary, but you might want to do a little reading around that. You have that on MOST bu not ALL your columns

Clearly the thing is going to work either way so its all about Defensive Programming - reducing bugs both during initial code development, but also during subsequent maintenance.

simflex1 · March 4, 2017, 4:00pm

Kristen,

Thanks so much for your help.

I really appreciate it.

First, I know you spotted ttitle as possible typo. Actually, it was intentional because I was under the impression that title is a reserved word.

About the overall DB design, maybe a little clarity from me.

As stated initially, this entire questionnaire is for and about employees.

An employee has to provide the following information.
I know it may sound like I am repeating myself again but please bear with me for a second.

Question1: Provide the following information about yourself:
Employee Name, Title, and Email. (This is why we have Employee table)

Question 2: Provide Information about your spouse's income source
Spouse Name, Spouse Address, Spouse Income (Hence SpouseDetails table. Income source could be more than one row,)

Question 3: Provide Your Source(s) of income other than regular pay
Source Name, Source Address, Source Income. (Hence SourceDetails. Income source could be more than one row).

Question 4: Provide Information About the Organization(s) you deal with outside of work.
OrgName, OrgAddress, OrgIncome, (Hence OrgDetails table. Could be more than one org).

So, that's how the Questions table came about.

How do I relate the Questions table to other tables?

I agree that more questions could be added later but I have always liked the idea of making the questionID a primary, auto increment ID. We can design the increment to be in 5s or 10s like 1, 10, 20. I worry that the admin who will tasked with managing the addition or removal of questions may run into issues by forgetting not just to add QuestionID followed by question but may not necessarily follow the increment convention.

Kristen · March 4, 2017, 5:36pm

If you prefix all the columns in the Employee table with "Emp" of "Employee" that problem will go away [quote="simflex1, post:7, topic:9574"]
I have always liked the idea of making the questionID a primary, auto increment ID. We can design the increment to be in 5s or 10s
[/quote]

Its a PITA the add a non-automatically-incrementing number to a table with an IDENTITY. I'm sensing that you only have half a dozen questions, in which case Manual Numbering seems OK to me. But you could have IDENTITY and a Sequence column - then use the Sequence column to control what order questions are asked. Easy to change the Sequence in future, if needed, such that "2" comes before "1" even ... the APP can use the QuestionID to actually decide where to store the data etc.

Sorry, I don't "get" why you need to.

simflex1 · March 4, 2017, 6:05pm

Hi again Kristen,

Great point about question ordering.

I didn't think of that. Thank you.

One of the great benefits of coming here.

About why I need the questions related, if I need to write a report about which employee has taken this questionnaire and the responses to these questions (about source, spouse, org), how do I relate the questions to those who answered them?

Don't the questions need to be inserted into the database just like the rest?

So that when the results are displayed, we would have something like:

Question #1: The question...

Then the response(s) to the question?

Kristen · March 5, 2017, 9:41am

[quote="simflex1, post:9, topic:9574"]
About why I need the questions related, if I need to write a report about which employee has taken this questionnaire and the responses to these questions (about source, spouse, org), how do I relate the questions to those who answered them?[/quote]

Can the data get into the database OTHER than by the questionnaire?

Can the data get into the database by a DIFFERENT Question#? (Seems to me that "Spouse Details" will be Question #2, or whatever, and not "one of these N-questions"??)

So either have a BIT column [IsFromQuestionnaire] on each of the tables, or an INT column [questionID] on those tables (indicating which Question# resulted in that row being created.

Could be

You have to ask the questions on a Web Form, APP, or somesuch. A Program will have to do that ... so the Program has to get the Questions from somewhere. That could be hardwired in the program itself.

The advantage of having the questions in the DB are:

1a. PRO: Easy to change the text (spelling mistake etc.)
1b. PRO: Easy to support multiple languages / variants (if that becomes an issue). Seems improbable in this case
1c. CON: APP has to do round-trip to DB to get the questions every time. probably not a big deal, but might not be worth building all that if just for small-usage
2a. Helpful to be able to JOIN , say, Spouse Income to the actual question that was asked and caused that row to be created.
2b. Unless you have WAY more questions / complexity I don't think this is necessary

simflex1 · March 6, 2017, 2:32am

OK, just clarify, from what you have provided so far, we are back almost to my original post.

Just two changes from my original design.

1, Main table is not needed, neither is Questions table.

Everything else is fine.
Could you please confirm at your convenient time.

Thanks a lot for your help.

Kristen · March 6, 2017, 7:22am

I would say the Questions table is optional. Handy to modify the text ...

Now ... IF the APP needs to use the Questions Table to process the data, then it becomes useful (well ... to my mind anyway!)

For example, if the Question Table had an SProc and Parameter name then the APP could just ask questions, call SProc, with @Parameters, and would not need to do any processing, per se, at all. You could add questions freely without changing the APP - just write an SProc each time there are more questions.

I'll be interested to hear what @ScottPletcher has to say though.

simflex1 · March 6, 2017, 2:38pm

@ScottPletcher has gone AWOL, LOL

ScottPletcher · March 6, 2017, 4:44pm

I'm very busy at work and naturally that comes first. I like to help, but it's not my major priority.

Normalization is a step by step process, particularly as you still getting more familiar with how to do it. [Plus "fully normalized" is a rather dangerous term to DBAs, implying something like 5NF (or more?). That's almost never necessary. Proper normalization to 3NF or BCNF is enough.]

So, let's start with the Employees table, since Employee is the key Entity here:

CREATE TABLE [dbo].[Employees](
	[employeeID] [int] IDENTITY(1,1) NOT NULL,
	[employeeName] [nvarchar](50) NULL,
	[ttitle] [nvarchar](50) NULL,
	[email] [nvarchar](50) NULL
)

Is that normalized? Let's review the rules for 1NF, and check each one for violations:
Rules:
1A) Every column has an "atomic" (indivisible) value (as it relates to the current and anticipated business needs).
1B) The row has a unique column/set of columns that distinguishes it from every other row (a unique key).

1A) Any column(s) not pass this test?:

employeeName. An emp name has multiple parts/values: first name, middle name and last name, at least, plus optionally a prefix/title ("Dr.", etc.) and/or suffix ("Jr.", "Sr.", "III", etc.). This column needs broken apart.

email? This depends on your view. Mine is that the local part (before the @) and the domain (after the @) should be stored separately, partly because it simplifies querying. For example, "Some of our employees use Yahoo, which is down right now, which of our employees is affected?". (It also allows one to switch to using a short numeric code for the domain, saving space and simplifying domain value changes, some key goals of normalization).
But, it is fairly common to (lazily, in my view ) use a single email address value, so you have your choice here.

1B) Rule is met: employeeId is a suitable key.

In passing, I'll note that Address would have the same 1NF issue as Name -- an Address is multiple parts, not a single value.

ScottPletcher · March 6, 2017, 4:56pm

On to 2NF and 3NF. The rules are:
2A) Table is in 1NF.
2B,3A) (In non-tech terms) Every non-key column depends on the key, the whole key and nothing but the key.

2A) Check: we just did that.

2B,3A) "ttitle"? Nope. The job title relates to the company as well, not just the Employee. For example, "Head Accountant" is Sally; I fire Sally (she was always late), so I delete Sally from the (current) emp table -- my job title just went away?! That should not happen! ttitle should be a title_id that points to another table that contains the actual title.

Let's consider OrgDetails briefly too. It's certainly possible that more than one employee would belong to the same Org. It's also possible I might want to store Org data before any Employee actually belongs to that Org (perhaps rare, but possible). Therefore, separate Org and OrgDetails tables are needed, and an intersection table that tells which emp(s) belong to which org(s):
employeeOrg
that contains:
(employeeId, orgId) as the key. Data might include the date that emp joined, date left, etc..
Also, just because the only emp currently belonging to OrgA were to quit that org, and so I deleted the emp's org record, I wouldn't necessarily want to lose the data I had about OrgA itself.

ScottPletcher · March 6, 2017, 5:22pm

[quote="Kristen, post:4, topic:9574"]
You wouldn't have EmployeeID in the Spouse table (for example)?
the Spouse Record with the Employee record. Indeed, no Link Table required at all.[/quote]

No, I would create a separate intersection table, like "employeeSpouses", to hold that data. Mainly because I need to store data about that specific relationship (date started, date ended, etc). Also, it's possible that a current spouse and a former spouse of the same Emp would still both be Emps themselves. In that case, for benefit calcs, retirement calcs, child support, scholarships, etc., I might need specific data on all those relationships.

Besides, if they ever legalize polygamy, my db structure can handle it

simflex1 · March 6, 2017, 6:17pm

@ScottPletcher said:

I'm very busy at work and naturally that comes first. I like to help, but it's not my major priority

I completely agree.

Thanks to you and Kristen for sparring sometime out of your busy schedules to help.

ScottPletcher · March 6, 2017, 7:18pm

If you'll look above, I did find time to do three separate posts about 1NF, 2NF and some other of my normalization thoughts (seems like every DBA has his/her own method ).

ScottPletcher · March 6, 2017, 10:41pm

Normalization is not especially difficult, although experience helps. But the main thing is to take it a step at a time, not try to normalize everything at once in your head. Instead, go thru column by column and check the rules to see if they are met, starting at 1NF and going thru 3NF/BCNF.