Algorithms to Predict Breast Cancer Stage

September 11, 2017

Cost effectiveness and quality analysis of the treatment of cancer has long been a goal of health services researchers.  In particular, researchers aim to determine whether various treatments provide cost-effective methods to improve longevity and quality.  Physicians, however, use different treatments depending on the patient’s cancer stage.  Although most cost-effectiveness researchers want to take into account patient cancer stage in their analyses, these data are not available in many administrative data files, such as the Medicare claims files.

To overcome this problem, recent studies have examined how to develop accurate algorithms to account for cancer stage in studies using claims data.  A paper by Cooper et al. has provided an initial attempt to accomplish this feat, but a more recent paper by Smith et al. 2010 offers an alternative.  Today, I will review the Smith paper.


The initial study population consisted of 150,764 women (age ≥ 65 years) diagnosed with breast cancer between 1992 and 2002 identified through Surveillance Epidemiology and End Results (SEER)-Medicare.   From this population, the following cohorts were excluded beneficiaries characterized by:

  • Unknown SEER stage history
  • In situ rather than invasive cancer
  • Beneficiaries who were not continuously enrolled in Medicare FFS including beneficiaries who had had Medicare Advantage Coverage between 12 months prior and 9 months after diagnosis
  • Age less than 66 to ensure a complete year of history
  • Death

To determine the cancer stage, physicians typically use the following heuristic:

  • Observe if there is a distant tumor, then the patient is stage IV.
  • If the patient is not stage IV, then the patient is classified into stages based on tumor size and the extent of the disease.

This spreadsheet explains the cancer stage classification according to the American Joint Committee on Cancer (AJCC).

The study relied on demographic, tumor, and treatment characteristics to identify the cancer stage.  One of the key variables in the breast cancer algorithm was axillary lymph node involvement.  This spreadsheet also lists all the covariates included in the prediction algorithm.

To test the accuracy of the algorithm, the authors relied  on four metrics: sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV).  The authors calibrated the model on a baseline sample of the SEER data and tested the accuracy using a validation sample.

One drawback of the Smith et al. algorithm is that it requires both retrospective and prospective data for up to 1 year prior to and 1 year after the date of diagnosis.  Further, patients have to be continually enrolled in Medicare FFS for the algorithm to work properly.  Those who join a Medicare Advantage plan are dropped from the sample.


The authors claimed the following results:

“A claims-based algorithm was utilized to predict breast cancer stage, and was particularly successful when used to identify early stage disease. These prediction equations may be applied in future studies of breast cancer patients, substantially improving the utility of claims-based studies in this group. This method may similarly be employed to develop algorithms permitting claims-based epidemiologic studies of patients with other cancers.”