Sacrifice database integrity for speed a good or bad idea?

hder · November 13, 2019, 2:57am

Let's say you have a site that allows users to post questions and answers like this one. With each Question, you can have associated items like quantity of flags, answers, likes, views., etc.

Will have separate tables for flags, answers, likes, etc. so can hit the database to find the totals for each question, then post the data to display with each question listing. This maintains the integrity of the data throughout, which is good, but requiring several additional database hits/queries.

Or could have these quantities as part of the question record with a field for flags, answers, likes, views, etc. and have that field manipulated each time something is added or deleted. This would be faster as the system would not need to hit the database 4 or 5 extra times to get these quantities.

Then you would run a cronjob every night to correct the values in the questions fields if these two sources are out of sync (individual tables as the correct one).

Is this a good idea or bad idea? Should I worry about hitting the database 4 or 5 extra times for each question that gets listed? Just starting development at this point and would appreciate feedback on when this would be ever be a good idea, or never. If never, then when is it a concern, or do you just add more $$ and bandwidth as your community grows and don't worry about it.

Perhaps there is a better approach?

Thanks in advance!

robert_volk · November 13, 2019, 12:38pm

I lean towards data integrity all the time, over performance. My experience has been based on poor integrity = months of time spent correcting bad data, that would have been prevented in milliseconds with proper constraints. Also in financial industries (credit card processing, utility billing, cellular billing and service).

If you don't have any financial repercussions from bad data, you could perhaps relax the integrity. But if you care about good data, and you'd prefer not to spend time fixing it, then take the integrity route. You can incorporate processes to summarize data on-the-fly as you modify it (or triggers, but try to avoid them). Without specifics it's difficult to give advice.

The Stackoverflow database design is openly available, I don't know if the code is, but you could look at the DB schema to get some ideas on how they address things like view counts, etc.

Wim_Leys · November 13, 2019, 2:11pm

removed

ScottPletcher · November 13, 2019, 5:36pm

Reduce the I/O, go for a reasonably good total rather than a 100% total if necessary. It's not like you're showing a customer balance or other critical data here. It'd be crazy to scroll through counting up 100s, 1000s, 1Ms(?) of likes, etc. every time someone went to the where the total was displayed.

But, to keep the totals up to date as much as possible, add triggers to the other tables to maintain the totals. This is relatively easy to do and should keep the totals accurate. But, yes, probably once a day still run an actual to verify the counters: if a counter(s) are not accurate, correct it(them), then review the trigger(s) to see how the total got wrong.

hder · November 14, 2019, 11:21pm

Thanks Scott! I tend to agree with this method with non-critical metrics such as likes, hearts, views, etc. Nobody really cares too much about them being accurate, but I can see there could be a database loading/speed issue if the user base grows to something like 100,000+ and they are happily scrolling through tons of lists/images/posts with these metrics. Since I don't have enough experience with database loading/server performance, etc. , I'll let my fear dictate this direction, knowing that at least I won't need code re-writes if the user base ever gets large...

ScottPletcher · November 14, 2019, 11:26pm

Over time, the triggers should end up being 100% accurate, but perhaps not when you first start using them (hint: be sure to spend some time working out the best, and most accurate, way to implement the triggers).