The Hard Truth about Artificial Intelligence in Healthcare: Clinical Effectiveness is Everything, not Flashy Tech
Lessons from HeartFlow’s Incursion into Cardiovascular Imaging
Introduction
Building machine learning/artificial intelligence medical devices (MAMDs) is much like bringing a new drug to market. First, both must be developed “in the lab.” Then, rigorous testing for efficacy and safety is conducted. Finally, physicians must be convinced to use the product and payers to reimburse it. Along this path, the vast majority of ostensibly promising drugs fail because the bar for commercial success is set high. This bar is no different for MAMDs. Yet, in the public discourse, too much weight is given to the technological sophistication of MAMDs instead of what is needed for successful implementation. Healthcare is complicated and product development decisions should be tailored to real-world settings rather than existing algorithms contrived to different use cases. This shoehorning is perhaps one reason why MAMD success stories are few and far between. To bridge the gap, what remains lacking is the adoption of evidence-based medicine concepts and clinical perspectives to the deployment of MAMDs. Without them, we may never see meaningful changes in medical practice.
In this piece, we discuss what it takes to implement MAMDs into clinical practice with a case study of HeartFlow, a Bay Area, Calif.-based company, and their artificial intelligence (AI) cardiac imaging product “fractional flow reserve derived from computed tomography” (FFRCT).[i] FFRCT is intended to improve non-invasive screening for stable suspected coronary artery disease (CAD) by using AI to interpret images gathered by computed tomography. HeartFlow has sponsored high-quality evidence generation on diagnostic accuracy, clinical endpoints, and cost-effectiveness. The product has been integrated into current clinical guidelines for CAD and is reimbursed by Medicare.
Despite the growing acceptance of FFRCT into the US healthcare paradigm, we find that the current evidence does not support a widespread adoption FFRCT on top of CCTA for stable suspected CAD patients. Diagnostic evidence fails to show the superiority of FFRCT to other imaging methods while clinical evidence does not clearly demonstrate the direct benefit of adding FFRCT to clinical practice in part due to trial design limitations. We believe that further work is needed to identify an optimal subpopulation for FFRCT based on additional clinical and cost-effectiveness studies.
There are two broad takeaways for MAMDs in general. First, it is crucial to precisely define who will be best served by the device. This requires careful consideration of the technical limitations around data acquisition in the clinic and the preferred place of the device in the medical treatment pipeline. Second, for the targeted subpopulation of patients, there should be randomized, well-controlled trials that definitively establish the impact of the MAMD. In contrast, large, pragmatic trials may fail to give clear answers.
The remainder of this commentary is organized as follows. We begin by providing background on the screening and treatment paradigm for stable CAD. Next, we briefly review the relevant clinical and diagnostic evidence. Then, we analyze and critically appraise the evidence surrounding FFRCT. Finally, we conclude by applying the principles discussed in the appraisal of FFRCT to the broader context of MAMDs. For readers who would like a deeper dive into this topic, a more detailed review and commentary of the data is available in the appendix.
Background
In the U.S., over 8.7 million patients per year are screened for CAD, a potentially life-threatening condition.[ii] For patients with predictably occurring chest pain (stable angina), cardiologists will opt for non-invasive screening to determine the presence of a blockage of the arteries leading to the heart (myocardial ischemia). The best method to find blockages involves inserting a catheter to measure blood flow, but this is an invasive and potentially risky procedure. Therefore, non-invasive screening is first employed to identify probable cases of CAD. Depending on the findings, physicians may: (1) monitor the patient, (2) pursue optimal medical treatment with a drug regimen, (3) confirm the non-invasive result using invasive catheterization, and/or (4) pursue a revascularization procedure such as a percutaneous coronary intervention (PCI).
There are several options for non-invasive screening. Traditionally, cardiologists have utilized exercise stress testing. Recently, however, heart vasculature imaging technology has risen in popularity due to improvements in overall diagnostic accuracy. One such imaging technology is computed tomography angiography (CCTA), a diagnostic test that produces detailed 3D images of the arteries. FFRCT builds upon CCTA screening by estimating the flow of blood through a vessel (i.e., fractional flow reserve (FFR)) using a physiologic simulation technique that models coronary flow from a high-quality CCTA image. The value proposition of FFRCT is as follows: for the extra cost of ordering HeartFlow to analyze the CCTA off-site, the clinician is provided with more complete and accurate information for future actions. Alternative imaging techniques include positron emission tomography (PET) and single-photon emission computed tomography (SPECT). However, CCTA must be ordered for FFRCT to be used.
Clinical Evidence for CCTA and FFRCT
Clinical evidence is the bedrock of proving any proposed change in how we treat patients is worth doing; what may work in theory may not work in practice. In the case of screening, the main burden of proof is two-fold: 1) the proposed method is more accurate than alternatives, and 2) the information provided helps physicians make better, or at least more cost-effective, decisions for a patient compared to standard of care. Diagnostic performance data answer the first while well-controlled clinical trials answer the second. This distinction is crucial and often overlooked in the case of MAMDs. Excellent predictive performance does not imply clinical effectiveness, which matters the most.
We will first review early clinical evidence around CCTA without FFRCT. Then, we will move to two recent trials that have integrated FFRCT into the clinical workflow. Lastly, we will cover relevant diagnostic performance evidence of FFRCT. While this evidence will be primarily used to support our opinion of FFRCT, it additionally provides useful examples of how MAMDs are studied and the possible shortcomings of different study designs.
Recall that to apply FFRCT, one must be screened with CCTA first. So, does CCTA already meaningful change clinical outcomes by itself? Two pragmatic trials, “Prospective Multicenter Imaging Study for Evaluation of Chest Pain” (PROMISE) and “Scottish Computed Tomography of the Heart” (SCOT-HEART) attempted to answer this question.[iii], [iv] A pragmatic trial is designed to ideally mimic real world medical practice with heterogenous populations. PROMISE failed to find a difference after two-years between CCTA and functional testing methodologies in clinical outcomes. SCOT-HEART, which ran for five-years, found a meaningful difference in non-fatal myocardial infarction (MI) but not cardiovascular or overall mortality. Notably, there were no significant differences between the groups in invasive catheterizations, pharmaceutical management, or revascularization procedures between groups. This, in turn, left some experts unsure regarding the mechanism by which CCTA was influencing treatment decisions.[v]
For CCTA, the picture is unclear whether it should be pursued compared to alternatives. Perhaps adding FFRCT on top of CCTA would clearly demonstrate a clearer benefit? For this question, we focus on two pragmatic trials: “Fractional Flow Reserve-Derived from Computed Tomography Coronary Angiography in the Assessment and Management of Stable Chest Pain” (FORECAST) and “Prospective Randomized Trial of the Optimal Evaluation of Cardiac Symptoms and Revascularization” (PRECISE).[vi], [vii] PRECISE is regarded by HeartFlow to be their strongest source of evidence surrounding FFRCT.[viii]
It is important to point out a particularly salient limitation of both trials: not every patient in the experimental arm received FFRCT. Rather, the decision to receive FFRCT was at the discretion of the physician and likely based on the CCTA result. This introduces confounding by indication and limits the direct causal claim we can make surrounding FFRCT. Another limitation arises when comparing FORECAST and PRECISE: PRECISE did not allow physicians in the control arm to use CCTA while FORECAST did. This a problem for interpreting the results of PRECISE because we will again have trouble isolating the direct contribution of FFRCT to outcomes. The last limitation we will mention is that the physician in the control arm can select the screening method, which may not necessarily be the most accurate one for practical reasons. Thus, we risk making “unfair” comparisons across experimental arms. We note that these sorts of limitations are often present in most pragmatic trials.
In the UK-based FORECAST trial, after nine months, while there failed to show a difference in clinical outcomes, catheterization utilization was lower. Average costs were slightly higher for FFRCT patients, but this was not statistically significant. For the PRECISE trial, the one-year results were similar to FORECAST: there was no significant difference found for non-fatal MI (though the experimental arm was slightly higher) and death. There were positive findings from the invasive catheterization without obstructive CAD endpoint (i.e., false positives from initial screening were less for the CCTA + FFRCT arm). As we argue in the appendix, however, this endpoint is not too informative about the population as it is driven by false positives, which leaves us unable to yet make conclusions about the false negatives. Furthermore, in both FORECAST and PRECISE, it is far too early in either trial to determine whether hard clinical outcomes were different – recall that SCOT-HEART ran for five years.
In terms of diagnostic performance, we focus on the results surrounding the study highlighted on HeartFlow’s webpage for FFRCT.[ix] The study took 208 patients from an active clinical trial surrounding suspected stable CAD and ran CCTA, CCTA + FFRCT, PET scan, and SPECT.[x] Patients were classified as having significant ischemia or not. Then, for ground truth, the investigators classified patients via an invasive catheterization. On their website, HeartFlow reports the classification results for each vessel on a per-protocol basis for which FFRCT is superior to other techniques in terms of area under the receiver operating curve (AUC). We argue that the intention to diagnose (ITD) results are more pertinent because they account for the 25% of patients that were unable to be screened by FFCT. For the ITD results, the AUC failed to show superiority compared to CCTA and the PET scan. In the per-patient results, PET’s AUC was in fact superior. We conjecture that the 25% rejection rate is a lower bound for the real-world performance due to differing patient characteristics and overall imaging quality of screening centers.
Analysis of HeartFlow’s FFRCT
The findings from the available evidence leave us with a relatively neutral and unclear view of FFRCT. The results from PRECISE and FORECAST trials fail to show a meaningful difference in the hard clinical outcomes of non-fatal MI and death as a result of integrating FFRCT into medical practice. For cost-effectiveness, FORECAST failed to show a difference in total costs while PRECISE shows that there were fewer false positive invasive catheterizations. Nevertheless, both trials are designed such that we cannot directly measure the effect of FFRCT, and the PRECISE trial unfairly restricted the control arm from using CCTA. In terms of diagnostic accuracy, FFRCT seems to be on par with alternatives with much higher rejection rates.
Taking a step back and reading the tea leaves, we believe that FFRCT could provide both clinical usefulness and cost-effectiveness in a more focused market. To define this market, there is a set of sequential filters that the patient must pass through: (1) the patient must first be eligible for CCTA, (2) an FFR measurement must be desired for future medical management, and (3) the image must be of high enough quality to accurately measure this non-invasively.
FFRCT is likely best employed in a situation where we have first viewed the CCTA results, initially treated the patient, and then can gauge the marginal utility of FFRCT. The gap between the diagnostic performance between the PP and ITD per-patient results due to rejection rates would support this approach. This post-hoc use of FFRCT is supported both by the authors of the FORECAST trial and the accompanying editorial.[xi] The proposed subpopulation would only include “moderate” risk patients, which is implied by the PRECISE trial stratifying patients before screening in the experimental arm. We would first treat these patients with a conservative, pharmaceutical strategy and only request FFRCT further down the road if revascularization is considered, as the FORECAST editorial authors suggest.
The potential contingency of FFRCT on revascularization would further shrink the market: a growing number of physicians may never even consider revascularization due to the results of the landmark International Study of Comparative Health Effectiveness with Medical and Invasive Approaches (ISCHEMIA) trial.[xii] ISCHEMIA failed to find a difference after five-years between the aforementioned conservative strategy and immediate invasive treatment. Importantly, only 20% of patients in the conservative arm went on to get a revascularization, which means potentially less use of FFRCT.
On the other hand, FFRCT may make CCTA more attractive because if we choose PET imaging, for example, we will not have the option to apply FFRCT down the road. It may be useful for HeartFlow to explore whether FFRCT can be applied to other imaging techniques to decrease the reliance on CCTA. Much more work is likely needed to accurately identify the optimal subpopulation for FFRCT based on severity, technical feasibility, and intended treatment options.
When the targeted population changes, clinical and cost-effectiveness evidence must be updated much like in the case of an expanded indication of a drug. The current evidence suggests there may be greater cost-effectiveness in the moderate severity population who have previously undergone CCTA and whom medical management has failed but is far from confirmatory. Despite the limitations of the trials, the positive results in PRECISE regarding invasive catheterization and the equivalency of cost-effectiveness in FORECAST are encouraging. With proper selection of patients and trial design, we could potentially see improvements on this front.
A possible trial to address both issues would be to recruit ideal patients for FFRCT and screen the patients with CCTA. Then, one group would all receive FFRCT while another group would continue with standard of care based on only CCTA. We may wish to also add an arm with PET scan, the best-performing comparator to FFRCT. Using this framework, investigators can iterate upon the optimal use case for FFRCT and observe how clinical outcomes may improve as a result. On this note, HeartFlow has integrated FFRCT into a larger software suite called RoadMap to manage CAD patients.[xiii] Perhaps it would be more productive to test the efficacy of the whole platform compared to standard of care.
Broader Lessons for AI in Healthcare
There are three important takeaways FFRCT provides to those wishing to build and integrate MAMDs into medical practice. First, it is important to ensure the data to be analyzed is reasonably attainable in practice. MAMD developers should clearly state which patients and healthcare centers can realistically use their device. This additionally lays out clear next steps for iteration of the product. In our example, access to FFRCT depends on whether the centers are equipped with the appropriate technical specifications to consistently produce high quality CCTA image. This begs the question: how much investment by the health system is needed to ensure the smooth performance of FFRCT compared to functional testing? Physicians must be open to their patients receiving increased radiation exposure as a result of CCTA compared to alternatives even when the screening may fail. In which scenarios is it worth it for patients to undergo such radiation exposure?
Second, to clearly demonstrate clinical benefit, the MAMD must be examined in medical practice using randomized studies. Useful endpoints include not only clinical outcomes also but cost-effectiveness, utilization, and burden on a health system. This way, even if there is no improvement in medical outcomes, benefits in cost-effectiveness and clinical efficiency could be demonstrated. In planning the study, investigators should keep in mind that an a priori and concretely defined clinical use-case maximizes the chance for positive trial outcomes and provides a clear direction for product development and adoption. A well-controlled trial on a focused population is far preferable from a product standpoint than a large, heterogeneous pragmatic trial where it can be difficult to ascribe causality to the MAMD. Critically, HeartFlow’s evidence for FFRCT falls short in this sense. As the FORECAST editorial states: “the FORECAST trial does not address the critical question of when exactly FFRCT should be utilized in the clinical care pathway” – PRECISE faces the same limitation.
Third, current pricing models must be adapted for the MAMD. HeartFlow has chosen a per-use price model that is reimbursed by the insurer. Whether this is the pricing best model for essentially what is a piece of software is unclear. For example, what should happen if FFRCT fails to screen? As opposed to a per-use model reimbursed by the payer, perhaps a software subscription model for screening centers may be more appropriate. MAMDs may be best encapsulated in an overarching platform that is sold directly to medical centers on the premise that they improve efficiency and save costs. The role of the payer is unclear in the subscription scenario and MAMDs will almost surely necessitate the development of new policies.
HeartFlow reveals a hard truth about MAMDs: product-market fit is more about ease of integration into a current clinical workflow than the complexity of the underlying technology. Product decisions should thus be weighted towards practicality and ease of demonstrating effectiveness. This involves an alignment of interests and incentives between data scientists, statisticians, clinicians, payers, patients, and health policymakers that we have had difficulty executing up to this point. Unfortunately, this may mean truly “futuristic” advancements in healthcare may be far away. Nevertheless, this is what is required for the dominance of AI in large healthcare markets.
Appendix
Disclaimer
This piece is an analysis of the company HeartFlow based on my interpretation of publicly available peer-reviewed information. This analysis is intended solely for educational purposes and should not be considered as medical advice. There is no affiliation with HeartFlow, and there are no conflicts of interest to declare.
[i] https://www.heartflow.com/heartflow-ffrct-analysis/
[ii] https://pubmed.ncbi.nlm.nih.gov/29386200/
[iii] https://www.nejm.org/doi/full/10.1056/nejmoa1415516
[iv] https://www.nejm.org/doi/full/10.1056/NEJMoa1805971
[v] https://www.nejm.org/doi/full/10.1056/NEJMe1809203
[vi] https://pubmed.ncbi.nlm.nih.gov/34269376/
[vii] https://jamanetwork.com/journals/jamacardiology/fullarticle/2808765
[viii] https://www.heartflow.com/newsroom/late-breaking-aha-acc-guideline-directed-ccta-ffrct-precise-trial/
[ix] https://www.heartflow.com/clinical-evidence/
[x] https://www.sciencedirect.com/science/article/pii/S0735109718391381?via%3Dihub
[xi] https://academic.oup.com/eurheartj/article/42/37/3853/6352228