Considerations for Scalable AI in Clinical Dermatology

By Jochen Weber, Project Manager in Dermatology, Memorial Sloan Kettering Cancer Center

Learning to differentiate malignancies from benign skin lesions has become a staple application for fine-tuned, off-the-shelf computer vision models. In the ISIC 2024 Skin Cancer Detection challenge hosted on Kaggle, the winning entry utilized a combination of two transformer-based model architectures (EVA and EdgeNext) along with a boosted-tree ensemble, applied to image metadata. These recent technical advances are creating strong adoption pressures, as the performance of state-of-the-art models using highly standardized images now routinely exceeds that of expert readers relying solely on the same data.

Blindly trusting AI recommendations can lead to decision making becoming the healthcare equivalent of the paperclip maximizer, offering superficial efficiency gains paid for with unintended consequences.

AI models have already been tested in both single-image dermoscopy assessment and whole-body-surface triage settings, which can be applied to the patient level (determining broad skin type characteristics, such as atypical mole syndrome) as well as to the lesion level (determining which specific spots on a patient’s skin the clinician ought to include in a detailed examination). These findings further increase incentives for scaling AI to widespread use.

In this article, three often overlooked aspects of AI deployment in dermatology clinics are highlighted. Even with these aspects successfully incorporated into an AI training and inference procedure, patients may yet prefer humans weighing in on them, allowing all stakeholders to readily inquire into the reasons for making certain trade-offs.

Biases in existing training datasets. Given the relatively high cost associated with collecting and ground-truth labeling images in the past, most high-quality data available at present comes from populations with far from average characteristics. For instance, images available via the ISIC Archive, the preeminent publicly available source for data used in training skin cancer detection models, were almost exclusively contributed by clinics specialized in high-risk patient populations. This creates a bias for which kinds of malignancies and their associated images are potentially over-represented in the training data. In addition, given the geographic location of these contributing clinics and their hard to quantify patient self-selection criteria (e.g., w.r.t. the representativeness of patient ethnicity and skin tone), it remains to be seen how algorithms deployed in clinics would achieve general fairness in accuracy across populations, which has become a critical factor in deciding how to ethically deploy AI.

Differential utility based on patient characteristics is unavailable to AI. The primary utility of diagnosing melanoma (at a relatively early rather than at later stages) is to prevent death. However, for many patients, it may take years if not decades before locally detectable, non-invasive in situ disease would ever pose a serious threat. Patients also inherently exhibit idiosyncratic traits and preferences, which are difficult to capture outside of the patient-doctor relationship. Among those are: family history (for the relevant categories, given a specific suspected diagnosis), comorbidities, genetic and other known risk factors, overall risk tolerance, cosmetic considerations, and the willingness to consider alternative treatment options.

Overdiagnosis and the risk of being scarred for life. The clinical literature now offers fairly strong evidence that the incidence of melanoma findings, especially for melanoma in situ, has rapidly increased over the past decades without the corresponding expected decrease in mortality, see, for instance, the article by Kurtansky et al., “An Epidemiologic Analysis of Melanoma Overdiagnosis in the United States, 1975–2017.” This strongly suggests that the desire to prevent locally occurring melanoma from metastasizing has created signal detection pressures beyond what patient advocates might consider good medical practice: Patients who receive a melanoma (in situ) diagnosis followed by treatment (excisional biopsy) are often left scarred and with a recommendation for follow-up skin checks for the rest of their lives. Not only does this create anxiety in “cancer survivors,” but it also places a high cost on the healthcare system. In the absence of a much deeper insight into which (locally occurring) melanoma cases actually create mortality risk — and if so, when — when treating more and more patients, seems an overall poor choice. Clinicians who primarily rely on image-based AI classification may feel a pressure to act on those seemingly high-quality, but poorly traded off recommendations in order to avoid subsequent accusations of a missed malignancy. Overtreatment (in the absence of an unintended injury to the patient) is much harder to prove than undertreatment in individual cases.

In summary, there is an enormous potential for freeing up valuable time of clinicians for making the truly hard choices: Practicing dermatology well requires genuine value judgments, trading off costs (generally paid for by the community via insurance) as well as risks and benefits for the patient. These trade-offs can only be made by taking a large number of contextual variables into account. Blindly trusting AI recommendations can lead to decision making becoming the healthcare equivalent of the paperclip maximizer, offering superficial efficiency gains paid for with unintended consequences. As more and more patients will have access to AI augmented clinical diagnosis and treatment options, in itself an excellent development, great care will have to be taken to prevent good intentions from turning into widespread harm to patients who may never know whether treatment truly improved their lives.