Residual cost and related issues
=================================
We estimate age- and gender-specific average residual costs by restricting the sample to people who had 
no model- defined diagnosed diseases (but could have other diagnosed conditions, including chronic ones, 
or were otherwise healthy). Such people may or may not have had zero health expenditures. 

A potential complication is that in microsimulations, people are assigned disease status based mostly 
on IHME epidemiological prevalence (with some additional calibrations as appropriate), which may well 
be different from the administrative dataset-based prevalence. On the other hand, in the administrative 
datasets that we have analysed, people with undiagnosed health conditions may be misclassified as 
not having the conditions that we are modelling, and therefore their costs will be inappropriately 
part of the residual costs. Such costs will also not be captured when estimating the extra cost of disease.

One potential way to deal with this is to assume that IHME-based prevalence reflects “correct” 
epidemiological prevalence (i.e. including both diagnosed and undiagnosed cases). Under this 
assumption, we could in theory adjust the predicted costs of disease by multiplying it 
by some factor based on the difference between “diagnosed” and “real” prevalence. If we find, 
for example, that for women aged 50-59, the prevalence of diabetes based on administrative data is 
10%, while IHME-based prevalence is 12%, then we could multiply the estimated extra cost of disease 
in this group by 10/12=0.83, to make sure that such costs are representative of women who are both 
diagnosed and undiagnosed. Alternatively, we could assume that the extra disease cost equals zero 
for the proportion of people who are undiagnosed according to IHME data. Likewise, we could 
re-categorise our residual costs accordingly, which is likely to increase residual costs because a 
number of cases with zero expenditures will be reduced. Therefore the net effect on the total costs 
is ambiguous. 

Nevertheless, it is not certain that IHME-estimated disease prevalence is necessarily superior to 
the administratively-derived one, as it relies on data of varying quality and methodological basis 
(e.g. it can be based on multiple sources of survey data, with additional assumptions to correct for 
self-reporting bias). Some analysis shows that for example in France, age and gender-specific 
prevalence of diabetes and of several cancers is higher in the administrative dataset than in the IHME dataset, 
which suggests this divergence may not be due to the inclusion of undiagnosed cases in IHME data. 
Although in some other cases, the prevalence was considerably higher in the IHME dataset, this 
was mostly true at the oldest and the youngest ages, where IHME estimation methodology might rely 
on too little data and on too many assumptions. In addition, at the oldest ages (generally older than 60-70)
, where the prevalence rates diverge the most, the absolute numbers of affected people gets lower with 
each year of life, therefore the total impact on costs is reduced. 
Therefore, we prefer not to further adjust the extra disease/residual costs. Besides, since we are 
interested mostly in the “delta effect” of different interventions/scenario comparisons, 
the potential overestimation issue stemming from assigning the estimated costs to the undiagnosed 
cases is probably of minor significance.