I'm trying to understand whats happened - we had an 8 CPU server that was maxing out at 100% CPU across all CPUs with a heavy read component to queries.
What stopped it was changing the maxdop from 0 ( use all CPUs for each query ) to 1 ( use 1 CPU per query ).
Can anyone explain in plain language why this massive decrease in CPU utilization happened?
The only explanation I can come up with is that with the 0 MaxDop setting the SQL Server was "drinking from a fire hose" and was able to process the data as fast as it was sent to it, and as such is basically "drowned" in data at 100% CPU.
The MaxDop setting of 1 forced SQL Server to use 1 CPU per query, which has now dropped CPU to about 20% across all 8 CPUs.
Maybe I dont understand SQL as well as I thought - can someone enlighten me as to whats happened please? I hate not understanding....
In simple terms I now know that I have a "lever" I can pull to throttle throughput when a server is overwhelmed.
and read the definition of Maxdop
What you now know is that parallelism was causing problems for your queries. What you have also discovered is the ultimate sledgehammer solution, ban all queries from using parallelism entirely. That's step one into a bigger, wider world.
Using multiple CPUs to solve a query seems like a good idea, but it introduces hidden costs. If two CPUs each do half the necessary work, then at some point additional work needs to be done to recombine the two sets of results - and that work isn't free. In the worst case, with lots of queries hitting the database, the overhead of lots of parallel queries becomes worse than never allowing parallel processing at all and you start to see the problems you describe. At that point, setting MaxDOP to 1 can seem like the ultimate band-aid solution - it does make the problem go away, but at the expense of never letting the queries which could and should benefit from parallel processing do so.
So step 2 into the wider world is finding better solutions. There are multiple areas that can be investigated to find a better solution. A good place to start is the "Cost Threshold for Parallelism" setting - this determines how expensive a query plan needs to be before SQL Server will start considering parallel plans and the default (5 )is generally considered to be too low for most workloads on modern systems. Increasing this (and putting MaxDOP back!) might help to avoid simpler queries from switching to parallel plans, leaving only the more heavyweight work to grab extra CPU resource. Starting at 50 is a common rule-of-thumb, but experimenting a little can help find a good value.
Another thread to pull on at the same time is why those more expensive query plans exist at all. Often it's a sign of poor (or no) indexing and bad choices for clustered indexes. Re-assessing the types of queries that run, what data they access and what the expensive query plans look like can help guide better indexing strategies and reduce the amount of work required by SQL Server to resolve a request. Designing good indexing strategies can be hard, too few or too many will create issues but investing time to get it right can be well worth it.
And finally there is still wiggle room in the MaxDOP setting. It would be nice to be in a place where you can leave it unconstrained and just have everything work, but that's not always the case. However there is a fair chunk of room between "as much parallel as you want" and "never go parallel" and sometimes a setting somewhere in the middle is more appropriate. Again finding that spot is often a bit trial and error (and definitely something you want to do after the other options) but it may well be that letting your queries use some extra CPU without going crazy gets you the best performance overall.