SQL Project Design and Approach

shazimkhan574 · July 29, 2015, 5:11pm

SQL Server Project.

Hi
I am into a sql project. The project is scheduled to start during sept and is for a month duration. I am a beginner in sql, however I have experience in control system domain.
I can execute basic sql that are required, however this is for the first time I am going to be a part of full fledged sql project.
Can anyone help me in clearing few of my doubts.
Below are the details of the project.

        ******************** Brief about the project ********************

Live data from ground level instruments is coming via a scada application (which acts as a front end software) needs to be dumped on sql.
There are 8 machines from which data needs to be taken. Hence it has been decided to have 8 tables, 9th table will have consolidated data of all 8 tables.
After the data is collected, a report needs to be generated, one individual report for each machine. And a consolidated report for the all the machine.
We also need to have different authenticated levels for different users, and based on their login the user information ll be captured.

        ******************** Tools/Platforms that ll be used ********************

SQL SERVER 2008, SSRS Reporting tool

        ******************** Approach ********************

During the meeting with the client, it was decided first we ll make the database ready and deploy it on the production plant, thereafter we ll optimise the data once the needs arises in future. However I have explained this is not the correct approach as the data ll grow in size with a short time, as it has been found out during first stage analysis.

The data generated is huge, around a 1000 rows per machine/hr. So roughly speaking the calculation goes like (1000*24hours) = 24000 rows per day per tble. So in a months time the data will be very huge.

So I explained to them the contraints that we ll encounter in the near future & I have been asked to come with an alternate solution.

        ******************** Design of the table ********************

I am planning to first address the database design needs. Any suggestions for the improvemnt of my points are highly appreciable. Please excuse if anything silly, as I am executing full fledged db project for the first time. Though I have seen database servers during my earlier projects.

        ******************** A small note ********************

Based on my experience to earlier plants, I have seen when the data grows exponentially beyond , say 500 mb. The retrieval time increases.
I have seen people struggling to retrieve data which consumes a good 1 hour +. However my question is it common for the system to take soo long to retriev the data when it grows to such an extent?
I am asking this because even a days data that needs to be displayed the next business day comes around 192000 rows. And the consolidated report needs to have atleast a months data which is very huge.

        ******************** Coming back to the design part ********************

Each table needs to have the above columns, each tbl is addressed as Machine1 to Machine8
SlNo
OperatorName
OperatorID
Dept
MachineName
MachineID
PowerGenerated
AlarmPriority
AlarmsAcknowledged
DateTime

I am planning to have a separate table with operator name, operatorID and dept as primary key and to call it in another table through foriegn keys. I hope it will be of help to a suffiecient level to keep the data from growing.
I am plannig to have the last table as a view of the 8 physical tables.

Does having indexes on the table help to optimise the exectution time, I am not sure how indexing ll help to reduce the execution time.

Can anyone suggest if the above approach is good or is it lacking any key points.
Also can anyone help me in optimising the data for faster retrival time.?
Is there anyway to store the data, say 6 months data, which is readily accessible than to just maintain backups on external hard disks
If you require any more info I would be happy to share.

Awaiting your replies

Kristen · July 29, 2015, 6:53pm

I think this is debatable, compared to having a single table with a Machine_ID column. (You have the Machine_ID column in the table anyway??)

But, that said, a partitioned-view (i.e. a view UNION'ing several physical tables together) can be queried very efficiently - if the correct indexes are in place SQL will only query those physical underlying tables that actually contain relevant data.

My guess is that it will be more common to query on a Date range, rather than on specific Machines, and thus partitioning the data (into physical tables) based on Date might be better.

Partitioning on Date - e.g. create, and start populating, a new table each month - means that it is easy to DROP an old table once it becomes stale - e.g. after 12 or 24 months. OTOH deleting millions of rows from a table (i.e. where the CreateDate < 12-months-ago) is a huge task on large tables and needs complex programming to handle it (because just using a simple DELETE will bring the machine to its knees! not to mention the disk space required to LOG the operation).

I would want Operator_ID to be unique (regardless of department). Then the Operator Table can have a [Department_ID] column (linking to a [Department] table containing Name and anything else of interest.

Ditto for Machine_ID - link to a [Machine] table containing [MachineName] etc.

This is "Database Normalisation" - you should not repeat any data in a table, so if MachineID and MachineName are always matched then just have an ID and link that to a table that provides the [Name].

Also the fewer the columns in the main table, and the narrower they are (as INT for an ID is much narrower than a Character Column for the Machine/Person Name) reduces table size, and increases the number of rows that can be stored in an index page. If the OperatorID will be less than 256, or 256*256, then use a BYTE or 2xBYTE datatype rather than INT (4xBYTE) to save space. MachineID seems like a good candidate for a small integer datatype

Absolutely, and its very important in terms of improving performance, more especally on large tables. It helps in several ways:

First for a Range Query:

SELECT Col1, Col2, ...
FROM MyTable
WHERE     MyDateTime >= @StartDate 
      AND MyDateTime < @EndDate

If there is an index on MyDateTime then SQL will use that to quickly find the relevant rows between the date range limits, and will then look up those rows in the underlying [MyTable]

SQL keeps statistics about the distribution of values in the index, so if you run the query with a date range that SQL estimates will match a huge number of rows it won't bother to use the index as index-lookup PLUS data-lookup will be slower than just scanning the whole table. SO SQL will use indexes in a smart way.

Then there is a table-join:

SELECT Col1, Col2, ...
FROM MyTable AS T1
     JOIN MyOtherTable AS T2
          ON T2.SomeColumn = T1.SomeColumn
WHERE     ...

in this example an index on [SomeColumn] can be used to speed up retrieval of the associated data.

Consideration needs to be given to which columns are used in the CLUSTERED INDEX (only one per table), as that provides the fastest retrieval route. Tables are often joined on the Primary Key column(s), and Primary Key columns are typically used for the Clustered Index, because in general that works out best. That said, the Primary Key is not always the best candidate for the clustered index. In your case you might use [DateTime], [SomeUniqueID] (where the Primary Key would be just [SomeUniqueID]) deliberately to get the benefit of having the [DateTime] at the start of the Clustered Index so that any Date Range query can make best use of finding all rows within that date range.

Then you get a Covering Index. Going back to my first example:

SELECT Col1, Col2
FROM MyTable
WHERE     MyDateTime >= @StartDate 
      AND MyDateTime < @EndDate

If you have an index [MyDateTime], [Col1], [Col2] then that is said to be a "covering index" for the query. Everything SQL needs for the WHERE and SELECT is in the index - the date range on [MyDateTime] is easily deduced (because that is the first column in the index) and the Col1, Col2 for the SELECT is also available - this then AVOIDS SQL having to do an additional Data Lookup on the underlying table. For that reason it is sometimes desirable to add columns to an index which are not intended to be directly part of a Range / WHERE clause test but which are used in the SELECT (in modern versions of SQL there is a way to INCLUDE such columns so that they use less disk space than if they were just declared as being part of the Keys in the index)

SQL has a tool that shows you the Query Plan that it will use to solve a particular query. This will show which indexes are being used and the estimated "Logical Disk Reads" and "Number of Table Scans". You can then experiment by creating a new index and reviewing the Query Plan again to see if it will use your new index and, if so, how much the Disk Reads and Table Scans have been reduced

SQL can also recommend indexes, based on queries that are used and which would perform better with additional.different indexes, which is a pretty cool feature! but don't blindly follow that advice as its a very blunt weapon, but it sure beats empirically creating indexes and then scratching your head as to why SQL hasn't used them!

Downside: Indexes take up disk space and each time you INSERT a new row (or update any relevant columns - but that doesn't apply to you) an entry in each Index has to be created/re-created. That's both Disk Space and Time. Logging data from 8 machines, 1,000 items per hour, is 133 inserts per minute ... presuming that you are also going to Query on the same machine? (You could restore yesterday's backup to a different, Enquiry, database server and only query on that) then it will be critical that the queries are optimised so as not to interfere with new data inserts.

Perhaps better? would be to log the data from the machines to, for example, CSV files and then IMPORT those once an hour or so (rather than in real time). That gives you a) no issue with performance as the Bulk Import will be much faster than individual data row inserts, and B) repeatability - e.g. if your database is corrupted you can just re-import the files since the backup file which you restored from

Well you can keep an old backup and then delete the stale data in the live database, but then what? How would you make use of the data of the backup? (Yeah, restore it and query it then DROP the restored database etc etc but I suspect its not really very viable as the temporary database you restored to would not be the one that the Application will be connecting to ...)

Partitioned Tables would probably offer a way, you could even used names storage - e.g. have storage called YEAR2015 that has all the monthly data partitioned tables for 2015, and then maybe you could mount that (from the Restore of a Backup) if the need arose in the future ... and easily DROP that storage unit when your archive reporting was completed. I've never done that, so I'm just guessing, but if you will need that you'd have to build that in from Day One.

1000 rows per machine/hr, and 8 machines, and assuming that the data is relatively skinny, sounds like only 10-50GB per annum. The queries definitely have to be optimised, but I can't see (in terms of disk storage) that you will have any difficulty retaining several years worth of data.