Skip to main content
(R)evolution Interviews

How do you know if innovation is working?

We need rigorous experimental studies to assess our resorting to artificial intelligence. We spoke about it with a researcher who is working to the integration of the AI extensions’ Consort and Spirit.

Interview with Lavinia Ferrante di Ruffano

Test Evaluation Research Group, Inst. Applied Health Research, University of Birmingham

By December 2019June 24th, 2020No Comments
Photo by Lorenzo De Simone

Which are the current limits of artificial intelligence (AI) algorithms applied to patients’ care?

AI has a broad range of applications to patient healthcare, from patient identification all the way through to diagnosis and treatment prescription. While these algorithms have the potential to transform healthcare in a myriad of different ways (such as providing earlier or more accurate diagnosis, enabling faster and more efficient service delivery, and facilitating access to medical care), the key limit of at the moment is the dearth of evidence that the use of these interventions does more good than harm to patients. This is one of the key reasons underlying the slow uptake of AI healthcare technologies around the world. Conversely, the majority of AI intervention trials so far are validation studies (for example diagnostic accuracy studies), and even then few studies present externally validated results or compare the performance of AI with health-care professionals in the same patient sample [1]. In order to translate the potential of AI into clinical practice, studies are needed that evaluate patient and health service outcomes as a result of using an AI interventions, compared with current practice. Optimal reporting of these studies is critical to ensure that their results can be used to inform policy decisions and health technology assessments.

Do clinical trials still have a role in evaluating healthcare interventions such as AI algorithms?

Prior to their implementation in practice, all healthcare interventions must be evaluated rigorously to demonstrate that their use will do more good than harm to patient health. Randomised controlled trials provide the highest quality evidence for the effectiveness of healthcare interventions, and we do not see AI interventions as an exception to this. In the case of black-box algorithms, where the intended and unintended consequences of implementation may be unpredictable, the need for this level of evaluation will be even more critical.

“The key limit of at the moment is the dearth of evidence that the use of these interventions does more good than harm to patients”.

Which are the critical points missed by the current Consort and Spirit guidance regarding AI algorithms?

The original SPIRIT and CONSORT guidance was designed for the evaluation of therapeutic treatments (such as a drug or a surgical intervention), and so the AI extensions were conceived to identify and incorporate additional or different challenges to evaluating AI interventions. By discussion with all interested stakeholders, we are currently in the process of identifying all potential critical additions, however we hypothesise that elements which will require detailed and specific reporting include the study setting and its ability to administer a machine learning intervention in real time, the criteria for inclusion at the input-data level as well as at the participant level, the interactions between the human and the algorithm and its potential knock-on effects downstream, and the effects of adaptive machine learning technologies (which have the potential to continuously improve in performance) [2].

How will this problem be addressed by the Consort-AI and Spirit-AI steering group?

The CAISAI steering group have designed an international project to develop AI extensions to the existing CONSORT and SPIRIT checklists and guidance documents, which will focus specifically on clinical trials in which the intervention includes a machine learning or other AI component. Using the EQUATOR (Enhancing Quality and Transparency of Health Research) Network methodological framework for guideline development [3], the extensions will be produced in 4 stages: initial generation of additional items, two phases of Delphi participation, and a final consensus meeting to vote on the most accepted additions. Our initiative is complementary to the efforts of others working on reporting standards, such as the TRIPOD-ML (TRIPOD, Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) initiative of Collins and Moons, which seeks to improve the reporting of machine-learning-driven predictive model development and validation [4].

Are you going to involve all the relevant stakeholders in the consensus process?
In a consensus project like the CONSORT and SPIRIT AI extensions, the integrity of its output is directly related to the breadth of stakeholders who can contribute to the project. The CONSORT-AI and SPIRIT-AI steering group has given serious and lengthy consideration to ensure that representatives from all identified stakeholder groups, and from a range of nations, are involved in initial item generation, as well as the Delphi stages and final consensus meeting. We can confirm that individuals from the following stakeholder groups have already contributed or agreed to take part (listed in no particular order): patient representatives, policy–makers (government bodies, medical bodies and research institutes), regulatory bodies, medical journals, AI developers and industry, methodologists, statisticians, trialists, AI standardisation groups, clinicians from a range of specialties, AI health research institutes, computational scientists, machine learning scientists, clinical/health informatics specialists, ethicists and research funding bodies.

Why do you consider so important the role of medical journals’ editors?

Any guidance document will only be successful if it is visible and can be easily applied to all relevant evaluations. Medical journals, represented by their editors, therefore play a critical role in the success of reporting and methods guidelines. They achieve this in two ways: 1) by participating in the generation and discussion of new CONSORT and SPIRIT items, journal editors allow us to incorporate the unique perspective of those experienced in seeing across the breadth of submitted and published AI research, as well as extensive experience in implementing existing checklists with authors. 2) medical journals play a substantive role in disseminating and publicising the existence of reporting guidelines and checklists, ensuring that authors around the globe see the checklists, as well as requesting submitting authors to use the checklists.

Do you expect the guidance will have an impact on the FDA regulatory process?
As important stakeholders in the evaluation of healthcare interventions, we are engaging with several international regulatory bodies as part of the consensus process for producing the CONSORT-AI and SPIRIT-AI checklists. Changing or influencing current regulatory processes are not within the remit of this project. Instead, our central aim is to improve the reporting and design of trials used to evaluate the effectiveness of AI healthcare interventions, so that regulatory and health technology assessment bodies have access to an evidence–base of sufficient quality to facilitate the introduction of effective AI interventions into healthcare.


[1] Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digital Health 2019;1:e271-97.
[2] Consort-AI and Spirit-AI steering group. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nature Medicine 2019, Sep 24.
[3] EQUATOR Network. Reporting guidelines under development. (EQUATOR Network, accessed 4 August 2019);
[4] Collins GS, Moons KGM. Reporting of artificial intelligence prediction models. Lancet 2019;393:1577-9 (2019).