Sum of rounded sub-values doesn't match the total value

dcristut · August 27, 2019, 12:28pm

Hello,
We need to split the value of a field (a production cost per item) according to a value key (hours / phase from total hours). We use rounded values and there is an inherent difference between the total value and the sum of the calculated "sub-values".
Example:

Costs/Items
+=====+========+==========+
| ItemId | Total Cost | Total Hours |
+---------+---------------+----------------+
| 7 | 3729 | 147 |
+---------+---------------+----------------+
Hours/Phases
+=====+=======+======+=========+
| Item Id | Phase Id | Hours | Phase cost |
+=====+=======+======+=========+
| 7 | 1 | 23 | 583 |
| 7 | 2 | 31 | 786 |
| 7 | 3 | 4 | 101 |
| 7 | 4 | 52 | 1319 |
| 7 | 5 | 37 | 939 |
+=====+=======+======+=========+
| "recomposed" value | 3728 |
+=====+=======+======+=========+
The Difference=1 and it needs to be added to one of the "sub-values" to balance the cost split.
Is there a way to distribute that difference to one (the first, the last or the one with the greatest value) of the "sub-values" ?

Thank you, Daniel
(sorry, I don't know hot to insert aligned tables )

yosiasz · August 27, 2019, 1:36pm

Best to provide sample data with real ddl and dml

Create table items(itemId int, name vachar(50))

Insert into items
Select 1, 'chicken'

dcristut · August 27, 2019, 2:30pm

Errrr, thank you, I think ....
I don’t understand what this answer is about.
We need an idea on how to distribute the difference over the calculated values.
Thank you, Daniel

dcristut · August 27, 2019, 3:21pm

Oh, I see now
Thank you, the data is, indeed, more readable.
Thx, Daniel

ScottPletcher · August 27, 2019, 4:28pm

Here's code to add the difference(s) to the highest existing value (but if there's a tie for highest, the diff would add to both, causing totals to be off: if that's possible, let me know).
Note that I would not have had time to work on this without usable sample data. I'm not trying to be difficult or harsh, but I help up to a dozen people a day online with SQL qs, I just don't have time to create sample data for everyone. I can write the SQL far more quickly than I can do all the code to create sample data.

;WITH cte_totals AS (
    SELECT HP.ItemId, SUM(HP.Hours) AS HP_TotalHours, SUM(HP.[Phasecost]) AS Total_Phasecost,
        MAX(HP.Hours) AS Max_Hours, 
        MAX(CI.[TotalCost]) AS CI_TotalCost, MAX(CI.[TotalHours]) AS CI_TotalHours,
        MAX(CI.[TotalCost]) - SUM(HP.Phasecost) AS Phase_Diff,
        MAX(CI.[TotalHours]) - SUM(HP.Hours) AS Hours_Diff
    FROM [Hours/Phases] HP
    CROSS APPLY (
        SELECT [TotalCost], [TotalHours]
        FROM [Costs/Items]
    ) AS CI
    GROUP BY HP.ItemId
)
SELECT HP.ItemId, HP.PhaseId, 
    CASE WHEN HP.Hours = CT.Max_Hours THEN HP.Hours + Hours_Diff 
        ELSE HP.Hours END AS Hours,
    CASE WHEN HP.Hours = CT.Max_Hours THEN HP.[Phasecost] + Phase_Diff 
        ELSE HP.Phasecost END AS [Phase cost]
    , HP.Phasecost AS original_phase_cost /*just for verification, remove after testing*/
FROM [Hours/Phases] HP
CROSS JOIN cte_totals CT

dcristut · August 27, 2019, 5:03pm

Hi Scott,
Thank you for the elegant solution.
Actually it IS possible to have a tie for the highest values, or have the values be evenly distributed ( like 10 hours = 2+2+2+2+2 hrs... ). In can be done somehow programatically in three steps (like splitting values, reassembling the results by summing them and then distribute the difference) ... a brute-force approach though
Sorry for the lack of data, I really didn't understand the helping hand @harish gave .... (thank you again harish).
Daniel

ScottPletcher · August 27, 2019, 5:59pm

Probably the easiest way is to combine hours and phase id for the MAX(), and then split them back out afterward. The code below does that. It will not perform as well as the first query but should perform than methods that require any additional I/O / table queries.

/* first, let's set the the last phase to the same highest hours as the 4th phase. */
UPDATE [Hours/Phases] SET Hours = 52 WHERE PhaseId = 5; SELECT * FROM [Hours/Phases];

/* then run the new query: note that only phase 5 values are adjusted, 
   and that the hours are adjusted too, since I didn't adjust the CI row 
   when I increased the total hours in HP */
/* note also you can run the CTE query by itself so you can see the preliminary query result */
;WITH Cte_Totals AS (
    SELECT HP.ItemId, SUM(HP.Hours) AS HP_TotalHours, 
        SUM(HP.[Phasecost]) AS Total_Phasecost,
        MAX(CAST(HP.Hours AS bigint) * 10000000000 + HP.PhaseId) AS Max_Hours_with_PhaseId,
        MAX(CI.[TotalCost]) AS CI_TotalCost, MAX(CI.[TotalHours]) AS CI_TotalHours,
        MAX(CI.[TotalCost]) - SUM(HP.Phasecost) AS Phase_Diff,
        MAX(CI.[TotalHours]) - SUM(HP.Hours) AS Hours_Diff
    FROM [Hours/Phases] HP
    CROSS APPLY (
        SELECT CI.[TotalCost], CI.[TotalHours]
        FROM [Costs/Items] CI
        WHERE CI.ItemId = HP.ItemId
    ) AS CI
    GROUP BY HP.ItemId
)
SELECT HP.ItemId, HP.PhaseId, 
    CASE WHEN HP.Hours = HP_calcs1.Max_Hours AND HP.PhaseId = HP_calcs1.Max_PhaseId
         THEN HP.Hours + Hours_Diff ELSE HP.Hours END AS Hours,
    CASE WHEN HP.Hours = HP_calcs1.Max_Hours AND HP.PhaseId = HP_calcs1.Max_PhaseId
         THEN HP.[Phasecost] + Phase_Diff ELSE HP.Phasecost END AS [Phase cost]
    , HP.Phasecost AS original_phase_cost /*just for verification, remove after testing*/
    , HP.Hours AS original_hours /*just for verification, remove after testing*/
FROM [Hours/Phases] HP
CROSS JOIN cte_totals CT
CROSS APPLY (
    SELECT Max_Hours_with_PhaseId / 10000000000 AS Max_Hours,
        Max_Hours_with_PhaseId % 10000000000 AS Max_PhaseId
) AS HP_calcs1

dcristut · August 28, 2019, 5:00am

Thank you very much, Scott

khtan · August 28, 2019, 8:47am

Assuming you wanted to adjust the existing Phasecost value. You can make use of the window function to calculate total Phasecost "SUM(hp.Phasecost) OVER(PARTITION BY hp.ItemId)" and compare with your original TotalCost, if there is a difference, adjust the different to the line with highest PhaseCost value "ROW_NUMBER() OVER (PARTITION BY hp.ItemId ORDER BY Hours DESC) = 1"

SELECT	
   hp.ItemId, 
   hp.PhaseId,
   hp.Hours,
   hp.Phasecost,
   hp.Phasecost 
  + CASE	WHEN	SUM(hp.Phasecost) OVER(PARTITION BY hp.ItemId) <> ci.TotalCost
	        AND	    ROW_NUMBER() OVER (PARTITION BY hp.ItemId ORDER BY Hours DESC) = 1
	        THEN	ci.TotalCost - SUM(hp.Phasecost) OVER(PARTITION BY hp.ItemId)
	        ELSE	0
	        END     AS PhasecostAdjusted
FROM	   [Costs/Items] AS ci
INNER JOIN [Hours/Phases] AS hp 
        ON ci.ItemId = hp.ItemId
ORDER BY hp.ItemId, hp.PhaseId

Or you can apply similar technic when you are inserting to the [Hours/Phases] table. Then you are skip this step as you are getting the data correct in the first place.

And here it is.

SELECT	hp.ItemId, hp.PhaseId, hp.Hours,
	CONVERT(int, ci.TotalCost * 1.0 * hp.Hours / ci.TotalHours)
	+
	CASE WHEN   SUM(CONVERT(INT, ci.TotalCost * 1.0 * hp.Hours / ci.TotalHours)) OVER(PARTITION BY hp.ItemId) <> ci.TotalCost
		 AND	ROW_NUMBER() OVER (PARTITION BY hp.ItemId ORDER BY Hours desc) = 1
		 THEN	ci.TotalCost - SUM(hp.Phasecost) OVER(PARTITION BY hp.ItemId)
		 ELSE	0
		 END	AS PhasecostCorrected
FROM	[Costs/Items] AS ci
INNER JOIN [Hours/Phases] AS hp 
        ON ci.ItemId = hp.ItemId
ORDER BY hp.ItemId, hp.PhaseId