4.2. Extra cost of disease¶

As discussed in , extra disease costs are estimated as “the mean marginal difference of the predicted outcome with a disease variable switched on or off”. There are two different (but related) approaches, depending on whether there was a chronic disease comorbidity.

In general, average healthcare cost for any age and gender can be predicted as follows:

(4.3)$\begin{split}E(C) = P(C>0) \times E(C|C>0) + P(C=0) \times \\ E(C|C=0) = P(C>0) \times E(C| C>0)\end{split}$

The extra cost of a disease can therefore be estimated, for a given gender and age group, as the difference in the predicted costs, conditional on the disease status:

(4.4)$\begin{split}{\hat Cost}_{extra} = P(C>0|disease=1) \times E(C|C>0, disease=1) - \\ P(C>0|disease=0) \times E(C|C>0, disease=0)\end{split}$

Another way to think about the first part of this formula is that $$E(C|C>0,disease=1)$$ component is representative of the population with a given disease who have positive healthcare expenditures, while multiplication by $$𝑃(C>0|disease=1)$$ factor makes such costs representative of the medically diagnosed population with a disease (who may or may not have positive healthcare expenditures).

Estimating the first part (i.e. the probability) component of the two-part of (4.4) is however complicated, because for a number of diseases in France (and for all diseases in Estonia and in the Netherlands), the disease definition in the administrative data depended on whether positive costs were reported. Therefore, estimating $$𝑃(C>0|disease=1)$$ with a logit or a similar approach was generally impossible. Even when this was not strictly the case (i.e. when a small proportion of patients with a disease had zero costs), the French research team believed that estimating $$𝑃(C>0|disease=1)$$ was not a feasible option because the disease definition was strongly endogenous to the probability of having nonzero expenditures. Therefore, it was decided to estimate the extra costs of diseases using the following formula:

(4.5)$\begin{split}{\hat C}_{disease=1} - {\hat C}_{disease=0} = P(C>0)E(C|C>0, disease=1) - \\ P(C>0)E(C|C>0, disease=0)\end{split}$

The first part probability is predicted unconditional of the disease status (but conditional of age). This is not ideal given the probability of having non-zero costs is likely to be higher in the sample of sick people than in the sample of healthy people. To deal with this, one could have assumed, for example, that the probability of having non-zero expenditures was equal to 1 in the sample of people with a disease. However, this assumption is arbitrary and it might lead to cost overestimation. On the other hand, our decision to use $$𝑃(C>0)$$ probability in the first part is likely to lead to a conservative extra cost estimation.

In any case, our estimates suggest that the difference between these probabilities is relatively small for those who are middle-aged or elderly (i.e., in the 50-90 y.o group), especially among women E.g. :

• $$P(C>0|d=0, women, age=60-64)=0.94$$

• $$P(C>0|d=1, women, age=60-64)=0.99$$

• $$P(C>0|d=0, men, age=60-64)=0.98$$

• $$P(C>0|d=1, men, age=60-64)=0.90$$

In the samples with at least one comorbidity, there is very little difference in the predicted probabilities depending on the main disease status.

(4.5) was estimated in two samples, by age and gender:

• Without any comorbidity (i.e., predicted average costs were compared among patients with a disease and without a disease, in the sample with no other chronic diseases)

• In the sample with at least one comorbidity, the predicted costs for patients without a disease of interest were subtracted from predicted costs for patients with a disease of interest.

The parameters in the second part of the two-part model as described by (4.5) were estimated similar to (4.2), but without the interactions, and with a dummy for a given disease of interest (rather than for a vector of diseases):

(4.6)$ln({Cost}_{i}) = \alpha + \beta \times {age}_{i} + {\gamma}_{k} \times {D}_{i,k} + {\epsilon}_{i}$

This equation is estimated as the Generalized Linear Model (GLM) with the log link and a gamma family distribution. This is a frequently used approach to model highly skewed data such as healthcare expenditures, whereby the so-called index function based on the covariates of interest is exponentiated via the log link to allow the non-negative prediction of the healthcare costs. Such an approach has an advantage over, for example, the ordinary least squares (OLS) estimator, as it avoids the need for retransformation when the goal is to predict actual, rather than log-transformed expenditures .

For example, the extra cost of diabetes for a woman aged 55 is predicted using parameters estimated in (4.5) as follows (separately for samples with and without any comorbidities):

$=\Phi (\hat a + {\hat b}_{50-55}) \times [exp(\hat\alpha + {\hat\beta}_{50-55} + {\gamma}_{diabetes} - exp(\hat\alpha + {\hat\beta}_{50-55})]$
(4.7)${\hat Cost}_{extra} = \frac{exp(\hat a + {\hat b}_{50-55})}{1 + (\hat a + {\hat b}_{50-55})} \times [exp(\hat \alpha + {\hat \beta}_{50-55} +{\gamma}_{diabetes}) - exp(\hat \alpha + {\hat \beta}_{50-55})$

As shown in (4.7), the effect of having a disease on predicted extra healthcare costs is nonlinear and depends on the age category.