Medical Image Retrieval Needs a New Benchmark

By: Hackernoon

2025/08/27 05:35

:::info Authors:

(1) Farnaz Khun Jush, Bayer AG, Berlin, Germany (farnaz.khunjush@bayer.com);

(2) Steffen Vogler, Bayer AG, Berlin, Germany (steffen.vogler@bayer.com);

(3) Tuan Truong, Bayer AG, Berlin, Germany (tuan.truong@bayer.com);

(4) Matthias Lenga, Bayer AG, Berlin, Germany (matthias.lenga@bayer.com).

:::

Table of Links

Abstract and 1. Introduction

Materials and Methods

2.1 Vector Database and Indexing

2.2 Feature Extractors

2.3 Dataset and Pre-processing

2.4 Search and Retrieval

2.5 Re-ranking retrieval and evaluation
Evaluation and 3.1 Search and Retrieval

3.2 Re-ranking
Discussion

4.1 Dataset and 4.2 Re-ranking

4.3 Embeddings

4.4 Volume-based, Region-based and Localized Retrieval and 4.5 Localization-ratio
Conclusion, Acknowledgement, and References

ABSTRACT

While content-based image retrieval (CBIR) has been extensively studied in natural image retrieval, its application to medical images presents ongoing challenges, primarily due to the 3D nature of medical images. Recent studies have shown the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval. However, a benchmark for the retrieval of 3D volumetric medical images is still lacking, hindering the ability to objectively evaluate and compare the efficiency of proposed CBIR approaches in medical imaging. In this study, we extend previous work and establish a benchmark for region-based and localized multi-organ retrieval using the TotalSegmentator dataset (TS) with detailed multi-organ annotations. We benchmark embeddings derived from pre-trained supervised models on medical images against embeddings derived from pre-trained unsupervised models on non-medical images for 29 coarse and 104 detailed anatomical structures in volume and region levels. For volumetric image retrieval, we adopt a late interaction re-ranking method inspired by text matching. We compare it against the original method proposed for volume and region retrieval and achieve a retrieval recall of 1.0 for diverse anatomical regions with a wide size range. The findings and methodologies presented in this paper provide insights and benchmarks for further development and evaluation of CBIR approaches in the context of medical imaging.

1 Introduction

In the realm of computer vision, content-based image retrieval (CBIR) has been the subject of extensive research for several decades [Dubey, 2021]. CBIR systems typically utilize low-dimensional image representations stored in a database and subsequently retrieve similar images based on distance metrics or similarity measures of the image representations. Early approaches to CBIR involved manually crafting distinctive features, which led to a semantic gap, resulting in the loss of crucial image details due to the limitations of low-dimensional feature design [Dubey, 2021, Wang et al., 2022]. However, recent studies in deep learning have redirected attention towards the creation of machine-generated discriminative feature spaces, effectively addressing and bridging this semantic gap [Qayyum et al., 2017]. This shift has significantly enhanced the potential for more accurate and efficient CBIR methods [Dubey, 2021].

\ While natural image retrieval has been extensively researched, the application of retrieval frameworks to medical images, particularly radiology images, presents ongoing challenges. CBIR offers numerous advantages for medical images. Radiologists can utilize CBIR to search for similar cases, enabling them to review the history, reports, patient diagnoses, and prognoses, thereby enhancing their decision-making process. In real-world use-cases, we often encounter huge anonymized and unannotated datasets available from different studies or institutions where the available meta-information, such as DICOM header data, has been removed or is inconsistent. Manually, searching for relevant images in such databases is extremely time-consuming. Moreover, the development of new tools and research in the medical field requires trustable dataset sources and therefore a reliable method for retrieving images, making CBIR an essential component in advancing computer-aided medical image analysis and diagnosis. One of the key challenges with applying standard CBIR techniques to medical images lies in the fact that algorithms developed for natural images are typically designed for 2D images, while medical images are often 3D volumes which adds a layer of complexity to the retrieval process.

\ Recent studies have proposed and demonstrated the potential use of pre-trained vision embeddings for CBIR in the context of radiology image retrieval [Khun Jush et al., 2023, Abacha et al., 2023, Denner et al., 2024, Truong et al., 2023]. However, these studies have primarily focused on 2D images [Denner et al., 2024] or specific pathologies or tasks [Abacha et al., 2023, Khun Jush et al., 2023, Truong et al., 2023], overlooking the presence of multiple organs in the volumetric images, which is a critical aspect of real-world scenarios. Large multi-organ medical image datasets can be leveraged to thoroughly evaluate the efficacy of the proposed methods, enabling a more comprehensive assessment of CBIR approaches for radiology images. Despite previous efforts, there is still no established benchmark available for comparing methods for the retrieval of 3D volumetric medical images. This absence of a benchmark impedes the ability to objectively evaluate and compare the efficiency of the proposed CBIR approaches in the context of medical imaging.

\ Our previous work [Khun Jush et al., 2023] demonstrated the potential of utilizing pre-trained embeddings, originally trained on natural images, for various medical image retrieval tasks using the Medical Segmentation Decathlon Challenge (MSD) dataset [Antonelli et al., 2022]. The approach is outlined in Figure 1. Building upon this, the current study extends the methodology proposed in Khun Jush et al. [2023] to establish a benchmark for anatomical region-based and localized multi-organ retrieval. While the focus of Khun Jush et al. [2023] was on evaluating the feasibility of using 2D embeddings and benchmarking different aggregation strategies of 2D information for 3D medical image retrieval within the context of the single-organ MSD dataset [Antonelli et al., 2022], it was observed that the single-organ labeling, hinders the evaluations for images containing multiple organs. The main objective of this study is to set a benchmark for organ retrieval at the localized level, which is particularly valuable in practical scenarios, such as when users zoom in on specific regions of interest to retrieve similar images of the precise organ under examination. To achieve this, we evaluate a count-based method in regions using the TotalSegmentator dataset (TS) [Wasserthal et al., 2023]. TS dataset along with its detailed multi-organ annotations is a valuable resource for medical image analysis and research. This dataset provides comprehensive annotations for 104 organs or anatomical structures, which allow us to derive fine-grained retrieval tasks and comprehensively evaluate the proposed methods.

\ The contribution of this work is as follows:

\ • We benchmarked pre-trained 2D embeddings trained supervised on medical images against self-supervised pretrained embeddings trained on non-medical images for 3D radiology image retrieval. We utilize a count-based method to aggregate search results based on slice similarity to volume-level data retrieval.

\ • We propose evaluation schemes based on the TotalSegmentator dataset Wang et al. [2022] for 29 aggregated coarse anatomical regions and all 104 original anatomical regions. Our proposed evaluation assesses the capabilities of a 3D image search system at different levels, including a fine-grained measure related to the localization of anatomical regions.

\ • We adopted a late interaction re-ranking method originally used for text retrieval called ColBERT [Khattab and Zaharia, 2020] for volumetric image retrieval. For a 3D image query, this two-stage method generates a candidate 3D image result list utilizing a fast slice-wise similarity search and count-based aggregation. In the second stage, the full similarity information between all query and candidate slices is aggregated to determine re-ranking scores.

\ • We benchmarked the proposed re-ranking method against the original method proposed in Khun Jush et al. [2023] for volume, region, and localized retrieval on 29 modified coarse anatomical regions and 104 original anatomical regions from TS dataset Wang et al. [2022].

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

Disclaimer: The articles reposted on this site are sourced from public platforms and are provided for informational purposes only. They do not necessarily reflect the views of MEXC. All rights remain with the original authors. If you believe any content infringes on third-party rights, please contact service@support.mexc.com for removal. MEXC makes no guarantees regarding the accuracy, completeness, or timeliness of the content and is not responsible for any actions taken based on the information provided. The content does not constitute financial, legal, or other professional advice, nor should it be considered a recommendation or endorsement by MEXC.

Share Insights