Abnormality Detection in PET Imaging Using Foundation Models: A Comparative Study and Analysis

Abstract

The escalating clinical demand for positron emission tomography (PET) is accompanied by a heterogeneity in scanners, tracers and acquisition protocols, making PET data complex to interpret, especially when the lesions of interest are subtle or minute. Motivated by this challenge, we investigate abnormality detection on PET scans through the lens of large-scale foundation models and other pre-trained vision encoders, whose broad applicability has recently energised deep-learning research. We present the first systematic 2-D study that contrasts such pretrained backbones with traditional, task-specific architectures in both binary classification and semantic segmentation. For classification, embeddings were extracted from frozen encoders (DINOv2, RAD-DINO, ConvNeXt, etc.) and fed to a lightweight fully-connected head, while a traditional CNN provided the baseline. For segmentation, frozen pretrained encoders–both offthe-shelf and custom-built–are grafted onto the nnU-Net pipeline, exploring information fusion at the bottleneck and across multiple feature levels. Experiments demonstrate that DINOv2-based embeddings achieve the most promising results in classification, surpassing the CNN baseline across all metrics while requiring only lightweight training once embeddings are harvested. For segmentation, plain nnU-Net remains superior, indicating that PET-specific fine-tuning of general-purpose encoders may be necessary. These findings chart the promise and current limits of foundation models for clinical PET analysis.

Publication
13th European Workshop on Visual Information Processing