SO-DRCNN WITH TERNION PARADIGM EXTRACTION ROUTINE FOR AN EFFECTIVE IMAGE RETRIEVAL SYSTEM by Akram Kazemisisi B.Sc., University of Science and Engineering of Tehran, 2012 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE UNIVERSITY OF NORTHERN BRITISH COLUMBIA June 2025 © Akram Kazemisisi, 2025 Abstract This thesis addresses the challenges of semantic image retrieval and labeled data scarcity in Content-Based Image Retrieval (CBIR) by introducing SO-DRCNN, a novel SelfOptimizing DeepRec Convolutional Neural Network framework. SO-DRCNN leverages a hybrid approach, combining the strengths of handcrafted features (Ternion Paradigm: HOG, ICH, SERC) and deep learning. A pre-trained ResNet-50 backbone, enhanced with Recurrent Patching (Bi-LSTM), Spatial Pyramid Pooling (SPP/ASPP), and Attention mechanisms, extracts high-level semantic features. A key innovation is the SiameseDriven Feature Fusion, where a Siamese network, trained with a contrastive loss, learns to adaptively combine handcrafted and deep features, optimizing the fused representation for similarity. This self-supervised training strategy (Auto-Embedder) eliminates the need for manual image labels. Experiments on benchmark datasets demonstrate that SODRCNN achieves state-of-the-art retrieval accuracy, outperforming traditional methods and demonstrating the effectiveness of the learned fusion strategy. The system is also integrated with Elasticsearch for scalable retrieval. This work contributes a robust, efficient, and interpretable solution for semantic CBIR. ii Table of Contents Abstract ............................................................................................................................ ii Table of Contents ........................................................................................................... iii List of Tables .................................................................................................................. vii List of Figures .............................................................................................................. viii Glossary ............................................................................................................................ x Acknowledgement ........................................................................................................ xvii CHAPTER 1 ................................................................................................................... 18 INTRODUCTION .......................................................................................................... 18 1.1. Motivation ............................................................................................................... 18 1.2. Statement of Purpose ............................................................................................... 20 1.3. Research Objectives ................................................................................................ 20 1.4. Organization of Thesis ............................................................................................ 22 CHAPTER 2 ................................................................................................................... 25 BACKGROUND ............................................................................................................ 25 2.1. Image Processing .................................................................................................... 25 2.1.1. Image Processing Techniques 27 2.1.2. Relationship Between Image Processing and Image Retrieval 34 2.2. Retrieval Strategies ................................................................................................. 38 2.2.1. Text-Based Image Retrieval (TBIR) 38 2.2.1.1. Challenges and Limitations 38 2.2.1.2. Future Directions 39 2.2.2. Content-Based Image Retrieval 41 2.2.2.1. Challenges and Limitations 42 2.2.2.2. Future Directions 43 2.3. IR Techniques ......................................................................................................... 45 iii 2.3.1. TBIR 2.3.1.1 Advanced TBIR Methodologies 2.3.1.2 Application Examples 2.3.2. CBIR 2.3.2.1 Advanced CBIR Methodologies 2.3.2.2 Application Examples 45 45 50 53 53 57 CHAPTER 3 ................................................................................................................... 64 LITERATURE SURVEY .............................................................................................. 64 3.1. CBIR Evolution ....................................................................................................... 64 3.2. The Paradigm Shift Toward Learned Features ....................................................... 66 3.2.1. Image Similarity Measures Used in CBIR 67 3.2.2. Important Points Descriptors In CBIR Frameworks 69 3.2.3. Distance Metric Utilized In CBIR System 72 3.2.3.3 CBIR with Relevance Feedback 73 3.2.3.4. Color-Based Features in CBIR 74 3.2.3.5. Image Retrieval using Transformed Image Content 76 3.2.3.6. Image Acquisition Employing Textured Information: 78 3.2.3.7. Image Retrieval using Shape Content 80 3.2.3.8. Metric Learning: 82 3.2.3.9. Current state-of-the-art CBIR techniques: 86 3.2.3.10. Comparative Analysis of Methods and Techniques in CBIR Systems 88 3.2.4. Multimodal Fusion in Image Retrieval (MFIR): 91 3.2.5. Semantic-Based Image Retrieval (SIBR): 92 3.3. Summary ................................................................................................................. 93 CHAPTER 4 ................................................................................................................... 94 METHODOLOGY ......................................................................................................... 94 4.1. Introduction ............................................................................................................. 94 4.1.1. Research Design and Experimental Approach 96 4.1.2. System Architecture 97 4.2. Proposed CBIR Pipeline .......................................................................................... 99 4.2.1. Ternion Paradigm Feature Extraction Routine 102 4.3. Bag-of-Visual-Words Framework for Image Representation ............................... 103 4.4. SO-DRCNN Model ............................................................................................... 107 iv 4.4.1. Utilize Pre-trained ResNet-50 CNN Backbone 4.4.2. Enhancing ResNet-50 Features for Contextual Semantic Understanding 4.4.3. Siamese-Driven Feature Fusion 4.4.4. Self-Optimizing Fusion Module in SO-DRCNN 4.4.5. Weight Optimization - Adjusting Network Parameters to Minimize Loss 107 108 112 118 120 4.5. Feature Extraction for the CBIR Database - Preparing the Index ......................... 122 4.6. Indexing and Retrieval Implementation ................................................................ 122 4.6.1. Elasticsearch Setup 123 4.6.2. Indexing Procedure 124 4.7. Evaluation Methodology ....................................................................................... 124 4.7.1 Evaluation Metrics 125 4.7.2 Comparative Experiments 125 4.7.3 Results Analysis and Impracticality Considerations 126 4.8. Querying Process and Retrieval at Runtime ......................................................... 127 4.9. Data Collection and Analysis ................................................................................ 128 4.9.1. Unlabeled Data for Self-Supervised Training: 129 4.9.2. Labeled Data for Evaluation and (Optional) Semi-Supervised Enhancement: 130 4.9.3. Data Augmentation for Self-Supervised Pair Generation: 131 4.9.4. Analysis of Results and Trends: 131 4.10. Implementation .................................................................................................... 134 Appendix A: Detailed Algorithmic Description of SERC 135 Appendix B: Detailed Algorithmic Description of BoVW Pipeline 146 Appendix C: Detailed Implementation of SO-DRCNN and Siamese Training 154 CHAPTER 5 ................................................................................................................. 165 FINDINGS ................................................................................................................... 165 5.1. Performance metrics .............................................................................................. 165 5.2. Comparative Analysis of SO-DRCNN, CLIP, and DINO .................................... 167 5.2.1 Comparative Summary and Future Directions 171 5.3. Data Analysis and Examples ................................................................................. 173 5.3. Key Findings ......................................................................................................... 199 5.4. Summary ............................................................................................................... 200 v CHAPTER 6 ................................................................................................................. 202 CONCLUSION ............................................................................................................ 202 6.1. Key Findings and Results 203 6.2. Future Work 204 6.3. Implications 205 REFERENCES ............................................................................................................. 207 vi List of Tables Table 1: Performance Comparison 145 Table 2: Aggregate performance 145 Table 3: Overview of SO-DRCNN (Adapted for CBIR), CLIP, and DINO 172 Table 4: Comparative Analysis of Accuracy With HOG+ICH 183 Table 5: Comparative Analysis of Accuracy With HOG+SERC 184 Table 6: Comparative analysis of accuracy with HOG+ICH+SERC 185 Table 7: Comparative Analysis of Precision With HOG+ICH 188 Table 8: Comparative Analysis of Precision With HOG+SERC 189 Table 9: Comparative Analysis of Precision With HOG+ICH+SERC 191 Table 10: Testing And Training Accuracy Analysis 192 Table 11: Training And Testing Precision Analysis 194 Table 12: MAP analysis 196 Table 13: Performance Analysis of Proposed Work 197 vii List of Figures Figure 1- Common CBIR approach 41 Figure 2 - CBIR Evolution 65 Figure 3 - Taxonomy of Distance Metrics 83 Figure 4 - SBIR framework 92 Figure 5 - Research Design Phase 97 Figure 6 - CBIR Architecture 99 Figure 7 - CBIR Pipeline 102 Figure 8 - Recurrent Patching Module 110 Figure 9 - Spatial Pyramid 111 Figure 10 - SERC Training Phase 136 Figure 11 - Keypoint Detection and Refinement 139 Figure 12 - Multi Directional Edge Extraction 139 Figure 13 - Spatial Grid Partitioning and PCA-based dimensionality reduction 141 Figure 14 - Binary Test Correlation Matrix 144 Figure 15 - Bi-LSTM 156 Figure 16 - Input image 1-Preprocessing 174 Figure 17 - Input image 2-Preprocessing 174 Figure 18 - Input image 3- pre-processing 174 Figure 19 - Input image 4-Preprocessing 175 Figure 20 - Input Image 5-Preprocessing 175 Figure 21 - Input Image 6-Preprocessing 175 Figure 22 - Input Image 7-Preprocessing 175 Figure 23 - Input Image 8-Preprocessing 176 Figure 24 - Input Image 9-Preprocessing 176 viii Figure 25 - Input Image 10-Preprocessing 176 Figure 26 - Retrieval Image Belongs To Food And Drinks 177 Figure 27 - Retrieval Image of Art And Culture Belonging 178 Figure 28 - Retrieval Of Travel And Adventure Belonging Image 178 Figure 29 - Retrieval Of Travel And Adventure Belonging Image 179 Figure 30 - Retrieval Of Travel And Adventure Belonging Image 179 Figure 31 - 1st Iterated Histogram 180 Figure 32 - 2nd iterated histogram 181 Figure 33 - 3rd Iterated Histogram 181 Figure 34 - 4th Iterated Histogram 182 Figure 35 - 5th Iterated Histogram 182 Figure 36 - Illustration Accuracy of HOG+ICH Features 184 Figure 37 - Illustration Accuracy Of HOG+SERC Features 185 Figure 38 - Illustration Accuracy of HOG+ICH+SERC Features 186 Figure 39 - Illustration Of Precision With HOG+ICH Features 188 Figure 40 - Illustration Of Precision With HOG+SERC Features 190 Figure 41 - Illustration of precision with HOG+ICH +SERC features 190 Figure 42 - Comparative Analysis Of Training Accuracy 191 Figure 43 - Comparative Analysis of Testing Accuracy 193 Figure 44 - A Comparative Analysis Of Training Precision Score 194 Figure 45 - Comparative Analysis Of The Testing Precision Score 195 Figure 46 - Comparative Analysis Of MAP 196 Figure 47 - Proposed TP, TN, FP And FN Validation 198 Figure 48 - Proposed Sensitivity, Specificity, Precision, Recall, Accuracy And FMeasure Validation 198 ix Glossary 1- Adaptive Histogram Equalization (AHE): A contrast enhancement technique that improves local image contrast by applying histogram equalization to localized regions, rather than the entire image. Mentioned in the context of image preprocessing techniques. 2- Atrous Spatial Pyramid Pooling: A module used in deep convolutional neural networks (CNNs) that captures multi-scale contextual information by applying convolutions with different dilation rates (atrous convolutions) to feature maps. Part of the SO-DRCNN architecture to enhance feature representation. 3- Auto-Embedder Architecture: A self-supervised learning framework, based on a Siamese network, that trains a model to generate embeddings optimized for similarity comparisons. In this thesis, it refers to the Siamese network used to train the Fusion Module for feature fusion, enabling data-driven weight learning. 4- Bag-of-Visual-Words (BoVW): A technique, adapted from text retrieval, that represents images as histograms of visual word occurrences. Local image features (extracted using ORB and Ternion descriptors) are quantized into a "visual vocabulary," and each image is represented by the frequency of each visual word. Used in this thesis for handcrafted feature extraction. 5- Bidirectional Long Short-Term Memory (Bi-LSTM): A type of recurrent neural network (RNN) that processes sequential data (like image patches in SODRCNN) in both forward and backward directions, capturing contextual dependencies from both preceding and succeeding elements in the sequence. Used in the Recurrent Patching Module of SO-DRCNN. 6- CBIR (Content-Based Image Retrieval): A technique for retrieving images from a database based on their visual content (features extracted from the images x themselves), rather than relying on textual annotations or metadata. This is the core task addressed by this thesis. 7- CNN Embedding: A feature vector representing an image, extracted from a Convolutional Neural Network (CNN). In this thesis, the CNN embedding is generated by the SO-DRCNN model. 8- Contrastive Loss: A loss function used in self-supervised learning (particularly with Siamese networks) that encourages similar inputs to have close embeddings and dissimilar inputs to have distant embeddings. This is the core training mechanism for the Siamese Network in this thesis, used to train the Fusion Module. 9- Contrastive Language-Image Pre-training (CLIP): A specific, multi-modal model trained with contrastive learning to align visual and textual representations. Mentioned as an inspiration for the self-supervised training approach, but not directly used in the methodology. 10- Convolutional Neural Network (CNN): A type of deep neural network that is particularly effective for processing images. CNNs use convolutional layers to automatically learn hierarchical features from raw pixel data. ResNet-50 is the CNN backbone used in SO-DRCNN. 11- Davies-Bouldin Index (DBI): A metric used to evaluate the quality of clustering algorithms (like k-means used for BoVW vocabulary construction). Lower DBI values indicate better clustering, with compact and well-separated clusters. 12- Deep Convolutional Neural Network (DCNN/DRCNN): A CNN with many layers, enabling the learning of complex, hierarchical features. SO-DRCNN is a specific type of DCNN used in this thesis. xi 13- DeepRec Convolutional Neural Network (DRCNN): Refers to the deep network architecture used in SO-DRCNN, incorporating a recurrent component and spatial pyramid modules. 14- Dimensionality Reduction: The process of reducing the number of dimensions (features) in a dataset while preserving important information. PCA is used for dimensionality reduction in this thesis. 15- Discrete Wavelet Transform (DWT): A signal processing technique that decomposes an image into different frequency sub-bands. Mentioned in the literature review, but not directly used in the core methodology. 16- Elasticsearch: A distributed search and analytics engine used in this thesis for efficient indexing and retrieval of image feature vectors. It enables fast similarity search on high-dimensional data. 17- Embedding Layer: The final fully connected layer in the SO-DRCNN architecture that outputs the visual embedding vector. It transforms the fused and processed features into a compact representation suitable for CBIR. 18- Embedding Space: A vector space where images are represented by their feature vectors (embeddings). In this thesis, the goal is to learn an embedding space where distance corresponds to semantic similarity. 19- Euclidean Distance: A common metric for measuring the distance between two vectors in a multi-dimensional space. Used in the contrastive loss function and potentially for similarity search. 20- Feature Fusion: The process of combining multiple feature representations (e.g., CNN embeddings and handcrafted features) into a single, richer feature vector. This is a core component of the proposed methodology, implemented using a Siamese-trained Fusion Module. xii 21- Fusion Module: A neural network module, trained within the Siamese Network, that learns to combine CNN embeddings and handcrafted features. This is the key component that performs the Siamese-Driven Feature Fusion. 22- Global Color Histogram (GCH): A feature representation that quantifies the distribution of colors in an image, disregarding spatial information. Referred to as ICH (Inclusive Color Histogram) in this thesis. 23- Handcrafted Features: Image features that are designed by humans based on domain knowledge and intuition, rather than learned automatically from data. In this thesis, BoVW histograms with Ternion descriptors (HOG, ICH, SERC) are used as handcrafted features. 24- Histogram of Oriented Gradients (HOG): A handcrafted feature descriptor that captures the distribution of gradient orientations in localized portions of an image, representing shape and texture information. Part of the Ternion descriptor set. 25- ICH (Inclusive Color Histogram): The term used in this thesis for a global color histogram, computed over the entire image, that quantifies the distribution of colors. Part of the Ternion descriptor set. 26- Keypoint: A salient and stable point in an image, often associated with corners, edges, or other distinctive local features. ORB is used to detect keypoints in this thesis. 27- Mean Average Precision (mAP): A common metric for evaluating the performance of information retrieval systems, including CBIR. It measures the average precision of retrieval results across multiple queries. 28- Metric Learning: A machine learning approach that focuses on learning a distance metric or similarity function from data. The Siamese Network with contrastive loss is a form of metric learning. xiii 29- Multi-Probe LSH: An enhanced version of Locality Sensitive Hashing (LSH) that improves search accuracy by probing multiple hash buckets. Mentioned in the context of SERC descriptor matching. 30- Natural Language Processing (NLP): A field of computer science focused on enabling computers to understand and process human language. Mentioned in the context of Text-Based Image Retrieval (TBIR) in the literature review. 31- ORB (Oriented FAST and Rotated BRIEF): A fast and rotation-invariant feature detector and descriptor. Used in this thesis for keypoint detection in the BoVW framework. 32- Pairwise Constraints: Training signals used in self-supervised learning, consisting of pairs of images labeled as "similar" (Can-Link) or "dissimilar" (Cannot-Link). Used to train the Siamese Network with contrastive loss. 33- Patch: A small, rectangular region of an image. Used in the Recurrent Patching Module of SO-DRCNN. 34- Principal Component Analysis (PCA): A dimensionality reduction technique that finds the principal components (directions of maximum variance) in a dataset and projects the data onto these components. Used in this thesis for dimensionality reduction of feature vectors. 35- Recurrent Patching Module: A component of the SO-DRCNN architecture that processes image patches sequentially using a Bi-LSTM network to capture spatial context. 36- Region-of-Interest (ROI): A specific region within an image that is of particular interest for analysis or processing. Not directly used in your core methodology, but mentioned in the literature review. xiv 37- ResNet-50: A deep convolutional neural network architecture (50 layers) that uses residual connections to enable effective training of very deep networks. Used as the pre-trained CNN backbone in SO-DRCNN. 38- Self-Supervised Learning: A machine learning paradigm where a model is trained on unlabeled data by generating its own supervisory signals from the data itself. The Auto-Embedder framework with Siamese Network and contrastive loss is a form of self-supervised learning. 39- SERC (Slanting Express Revolves Concise): A handcrafted feature descriptor designed to capture edges and structural patterns. Part of the Ternion descriptor set. 40- Siamese Network: A neural network architecture consisting of two (or more) identical subnetworks (twins) that share weights. Siamese networks are used to learn similarity metrics by comparing the outputs of the twins for pairs of inputs. Used in this thesis to train the Fusion Module for feature fusion. 41- Similarity Matching: The process of comparing feature vectors to find images that are similar to a query image. The core task in CBIR. 42- Smooth L1 Loss (Huber Loss): A loss function that combines the properties of L1 and L2 loss, making it more robust to outliers. Mentioned in the context of SO-DRCNN training. 43- Spatial Pyramid Pooling (SPP): A technique used in CNNs to capture multi-scale information by pooling feature maps at different spatial resolutions. Part of the SO-DRCNN architecture. 44- SO-DRCNN (Self-Optimizing DeepRec Convolutional Neural Network): The proposed deep learning architecture for CBIR, combining a pre-trained ResNet- xv 50 backbone with Recurrent Patching, SPP/ASPP, and Attention modules, and trained using a Siamese Network with contrastive loss. 45- Ternion Paradigm: The combination of HOG, ICH, and SERC descriptors used for handcrafted feature extraction in this thesis. 46- Text-Based Image Retrieval (TBIR): A traditional approach to image retrieval that relies on textual annotations or metadata associated with images. Contrasted with CBIR in the literature review. 47- Visual Vocabulary: In the BoVW framework, a set of representative local feature patterns (cluster centroids) learned by clustering a large collection of local descriptors. Used to create BoVW histograms. Key Terms: Content-Based Image Retrieval (CBIR), Text-Based Image Retrieval (TBIR), Self-Supervised Learning, Feature Extraction, Elasticsearch, Semantic Gap, Embedding Space, Deep Learning, Self-Optimization. xvi Acknowledgement I would like to express my sincere gratitude to my supervisor, Dr. Chen, for his invaluable guidance and continued support throughout the course of this research. I am also thankful to Dr. Fan Jiang for his insightful feedback and encouragement. I extend my deepest appreciation to my family, to my beloved mother, and in loving memory of my late father, whose unwavering love, strength, and support gave me the courage to begin this journey. xvii Chapter 1 Introduction 1.1. Motivation Content-Based Image Retrieval (CBIR) has become increasingly important across numerous fields, from medical imaging and architecture to crime prevention and geographic information systems. However, existing CBIR systems often face significant challenges, including limited retrieval accuracy, high computational demands, and a strong reliance on manually labeled data. A key limitation is the semantic gap: the discrepancy between low-level image features easily extracted by computers and the high-level semantic concepts that humans use to understand and judge image similarity (Smeulders et al., 2000). Furthermore, the need for extensive manual data labeling creates a bottleneck, hindering the scalability and adaptability of CBIR systems to new datasets and domains (Datta et al., 2008). This thesis addresses these challenges by proposing a novel Self-Optimizing DeepRec Convolutional Neural Network (SO-DRCNN) framework for CBIR, enhanced with Siamese-Driven Feature Fusion. Our approach makes the following key contributions: Hybrid Feature Representation: We integrate advanced feature extraction techniques, including the Ternion Paradigm (HOG, ICH, and the novel SERC descriptor), to create a robust and multi-faceted image representation that combines both interpretable local visual cues and high-level semantic features learned by a deep CNN (LeCun et al., 2015). This hybrid approach aims to bridge the semantic gap by leveraging the strengths of both handcrafted and learned features. 18 Self-Supervised Learning: We employ a self-supervised Siamese network architecture (Hadsell et al., 2006), trained with a contrastive loss function (Chopra et al., 2005), to learn discriminative image embeddings without relying on manually labeled data. This addresses the labeling bottleneck and enables the system to adapt more easily to new datasets. This approach builds upon recent advancements in self-supervised representation learning (Chen at Ai., 2020). Siamese-Driven Feature Fusion: We introduce a novel feature fusion strategy where a Fusion Module, trained within the Siamese network, learns to adaptively combine the handcrafted features and deep CNN embeddings, optimizing the fused representation for semantic similarity. This data-driven fusion approach goes beyond simple concatenation or fixed-weight combinations, allowing for a more nuanced and effective integration of heterogeneous feature modalities. Scalable Retrieval: We integrate our system with Elasticsearch (Gormley & Tong, 2015) to enable efficient and scalable image retrieval from large databases, addressing the practical challenges of real-world CBIR applications. 19 1.2. Statement of Purpose Content-Based Image Retrieval has emerged as a robust alternative to text-based retrieval methods, which often rely on keyword-driven annotations and may overlook the full complexity of image content. By analyzing intrinsic visual characteristics—such as texture, shape, and color—CBIR systems automatically compare a user-provided query image to database items that share similar features. Despite these benefits, a major hurdle is the semantic gap: the mismatch between low-level descriptors (e.g., color histograms) and the high-level concepts that users perceive (Vo et al., 2021). To address this gap, the present work integrates advanced feature extraction approaches—namely Slanting Express Revolves Concise, Inclusive Color Histogram, and Histogram of Oriented Gradients —with a Self-Optimization Deeprec Convolutional Neural Network. By combining these methods, the research aims to enhance retrieval accuracy, reduce labeling burdens, and capture nuanced image details. Building on three decades of CBIR advancements, this thesis examines the state-of-the-art in feature modeling, similarity metrics, and machine learning strategies, ultimately seeking to improve real-world image retrieval across large, heterogeneous datasets. 1.3. Research Objectives This research aims to significantly enhance content-based image retrieval by overcoming persistent limitations of traditional retrieval methods, such as reliance on textual annotations, inability to fully capture semantic meaning, and inefficient scalability. The goal is to create an intuitive, robust, and scalable CBIR framework capable of accurately interpreting visual content within extensive image repositories. Specifically, this research sets out to: 1. Identify and Address Limitations in Current Image Retrieval Approaches 20 Systematically analyze current retrieval methodologies to understand critical weaknesses such as the semantic gap (disconnect between pixel-level representation and high-level concepts) and over-dependence on manual labeling. This analysis will cover traditional methods (e.g., color histograms), conventional deep learning models (e.g., VGG-16, ResNet-50), and hybrid systems. 2. Develop a Self-Optimizing Neural Network (SO-DRCNN) for Autonomous Learning Design and implement a novel neural architecture, the Self-Optimizing DeepRec Convolutional Neural Network (SO-DRCNN), which autonomously learns to identify visual patterns without explicit labeling. This network will integrate spatial recurrent networks (SRNs) for spatial reasoning and attention mechanisms to prioritize image features that are critical for accurate retrieval. 3. Implement an Advanced Multi-Descriptor Framework (Ternion Paradigm) Combine three complementary feature descriptors to capture comprehensive image characteristics: - HOG (Histogram of Oriented Gradients) for detecting edges and textures. - ICH (Inclusive Color Histogram) for encoding global color distributions. - SERC (Slanting Express Revolves Concise) for identifying distinctive structural patterns robust to rotations and transformations. This combination aims to effectively bridge low-level visual features and highlevel semantic understanding. 4. Quantitatively Evaluate the Proposed System’s Performance Evaluate the CBIR system rigorously against state-of-the-art benchmarks (e.g., CIFAR-10 dataset), with explicit performance goals: - Accuracy: Achieve ≥95% Mean Average Precision (MAP). 21 - Efficiency: Retrieval responses in ≤0.5 seconds for databases exceeding one million images. - Robustness: Maintain high retrieval accuracy under challenging conditions (e.g., noisy, rotated, cropped images). 5. Demonstrate Applicability Across Diverse Real-World Domains Validate system usability without extensive retraining for multiple practical scenarios, including medical diagnostics (e.g., tumor detection in X-rays), ecommerce (e.g., fashion item retrieval), and surveillance. Optimize the architecture for deployment in both high-performance cloud environments and resource-limited mobile platforms. 6. Establish a Foundation for Future Research and Innovation Provide comprehensive documentation and guidelines to enable further system enhancement, addressing limitations such as specialized-domain adaptation, ultra-large-scale database retrieval, and extensions to video-based or dynamic image retrieval scenarios. 1.4. Organization of Thesis This thesis is structured into five chapters, systematically addressing the motivations, methodologies, experimental evaluations, and implications of advanced CBIR approaches: Chapter 2: Background This chapter introduces the foundational concepts of CBIR, discussing its significance, practical applications, and inherent challenges. It presents an overview of feature extraction, similarity matching, and advanced retrieval methodologies, establishing a 22 clear rationale for the research. Additionally, this chapter specifies the objectives, motivation, and anticipated contributions of the thesis. Chapter 3: Literature Review A comprehensive review of existing literature is conducted, exploring foundational methods and state-of-the-art advancements in CBIR. The chapter critically examines various feature extraction methods, region-based retrieval (RBIR), semantic-based retrieval techniques, and hybrid multimodal systems. Emphasis is placed on identifying the limitations of current methodologies, thereby clearly delineating how the thesis advances the field through novel contributions. Chapter 4: Methodology This chapter details the novel methodologies proposed in this research. It systematically describes the feature extraction processes, data augmentation techniques, and the integration of advanced deep learning architectures, specifically highlighting the proposed Self-Optimizing DeepRec Convolutional Neural Network (SO-DRCNN). It further elaborates on self-supervised training frameworks and indexing mechanisms, providing theoretical justifications for all methodological choices. Chapter 5: Experimental Evaluation Experimental validation of the proposed methodologies is presented, covering dataset descriptions, experimental setup, and detailed evaluation metrics such as Mean Average Precision (MAP), precision-recall analysis, and the Davies–Bouldin Index (DBI). Results are thoroughly examined through quantitative assessments, graphical visualizations, and comparative studies with established benchmarks, affirming the performance and effectiveness of the proposed systems. Chapter 6: Conclusion and Future Work 23 The thesis concludes by summarizing the key research findings and contributions. It highlights the theoretical and practical implications of the developed methodologies and clearly outlines the limitations encountered during the research. Directions for future research are proposed, emphasizing potential enhancements and opportunities for advancing CBIR further. 24 Chapter 2 Background 2.1. Image Processing Image processing is a computational discipline that involves the acquisition, enhancement, analysis, and retrieval of images using mathematical models and algorithms. It plays a fundamental role in computer vision, AI, medical imaging, and multimedia retrieval, enabling the extraction of meaningful information from visual data. Image processing techniques are designed to improve image quality, facilitate feature detection, and enable automated decision-making in scientific, industrial, and technological applications (Gonzalez & Woods, 2018). Digital image processing consists of several key stages: image acquisition, preprocessing, feature extraction, segmentation, and recognition. Image acquisition involves capturing images using cameras, scanners, or remote sensing devices. Preprocessing enhances image quality through noise reduction, contrast adjustments, and edge enhancement to improve subsequent analysis (Jain et al., 1995). Feature extraction focuses on identifying key characteristics such as textures, edges, and color distributions, which are essential for classification and retrieval tasks. Segmentation partitions an image into meaningful regions, facilitating object recognition, medical diagnostics, and scene understanding (Szeliski, 2010). Image processing techniques are broadly classified into spatial domain methods and frequency domain methods. Spatial domain methods operate directly on image pixels, applying transformations such as filtering, morphological operations, and histogram equalization. Frequency domain methods, on the other hand, transform images into the Fourier domain to analyze patterns, compress image data, and enhance specific features 25 (Pratt, 2007). Advances in machine learning and deep learning have led to the development of automated image processing models, significantly improving performance in tasks such as object detection, facial recognition, and CBIR (LeCun et al., 2015). A key application of image processing is in image retrieval, where computational techniques enable the efficient searching and indexing of visual content. Traditional image retrieval systems often relied on Text-Based Image Retrieval, where images were annotated with metadata, captions, or textual descriptions to enable searchability. However, TBIR faced limitations due to manual annotation costs and semantic gaps (Datta et al., 2008; Smeulders et al., 2000). On the other hand , CBIR leverages computer vision and pattern recognition to analyze color, texture, and shape features, allowing for more precise and automated retrieval processes (Smeulders et al., 2000). With the growing volume of digital images across domains such as medical imaging, remote sensing, security, and multimedia archiving, the role of advanced image processing techniques in retrieval, classification, and recognition continues to expand. Advances in deep learning, particularly through CNN and multimodal AI models, have significantly enhanced image retrieval accuracy by learning hierarchical feature representations. These methods reduce the semantic gap between low-level visual features (e.g., color, texture) and high-level human interpretation (e.g., object categories, contextual meaning) by capturing semantically meaningful patterns directly from data (He et al., 2016; Radford et al., 2021). While challenges persist in abstract or fine-grained retrieval tasks, modern AI-driven systems outperform classical methods in aligning machine-extracted features with human perception (Krizhevsky et al., 2012; Smeulders et al., 2000). 26 Ongoing research in deep learning, CNNs, and multimodal fusion is further driving progress in intelligent image analysis and retrieval systems (Goodfellow et al., 2016). These image processing advancements have laid the foundation for modern Information Retrieval systems, enabling more efficient indexing, classification, and retrieval of images. The following section explores specific retrieval strategies that leverage these image processing techniques, ranging from traditional TBIR to more advanced contentbased and hybrid retrieval frameworks. 2.1.1. Image Processing Techniques This section examines three primary categories of image processing techniques: I. Image enhancement II. Image segmentation III. Object detection & recognition which are foundational for extracting meaningful information and improving computational analysis. I. Image Enhancement Image enhancement modifies an image to increase its interpretability and visibility, optimizing it for computer vision applications, medical diagnostics, and feature extraction tasks (Pratt, 2007). These techniques work to reduce noise and improve image contrast. As a result, critical features become more prominent, making the images better suited for analysis and automated processing. § Noise Reduction: Noise reduction techniques eliminate unwanted distortions, such as random intensity variations resulting from sensor limitations or environmental factors (Jain et al., 1995). o Median filtering: Suppresses salt-and-pepper noise while preserving edges. 27 o Gaussian filtering: Smooths intensity variations by applying a weighted average of neighboring pixels. o Non-local Means filtering: Reduces noise by averaging similar pixel intensities across distant regions (Buades et al., 2005). § Contrast Adjustment & Histogram Equalization: Contrast enhancement techniques expand an image’s dynamic range, improving the visibility of subtle details, which is particularly beneficial for medical imaging, satellite imagery, and digital photography (Szeliski, 2010). o Histogram equalization: Redistributes pixel intensities to improve global contrast. o AHE: Enhances contrast in localized regions to emphasize finer details (Pizer et al., 1987). § Edge Detection: Edge detection is crucial for feature extraction and object recognition, as it identifies boundaries and structural elements within an image (Canny, 1986). o Sobel operator: Detects edges based on intensity gradients. o Canny edge detector: Applies Gaussian smoothing and gradient detection to identify edges with minimal noise interference. o Laplacian operator: Highlights regions of rapid intensity change, aiding in object boundary detection. II. Image Segmentation Image segmentation partitions an image into distinct regions corresponding to objects or areas of interest, facilitating medical diagnostics, autonomous navigation, and scene analysis (Shi & Malik, 2000). This technique enables more effective object recognition, classification, and retrieval by isolating relevant image components. 28 § Thresholding Techniques: Thresholding converts grayscale images into binary representations, distinguishing foreground objects from the background based on intensity variations (Otsu, 1979). o Otsu’s method: Automatically selects an optimal threshold to maximize inter-class variance. o Adaptive thresholding: Adjusts the threshold dynamically across different image regions, accommodating variations in lighting conditions. § Region-Based Segmentation: Region-based methods group pixels with similar characteristics to delineate meaningful structures. (Comaniciu, D., & Meer, P. , 2002). o Watershed segmentation: This method treats as a topographic surface and identifies object boundaries using gradient-based ridges. o Active Contour Models (Snakes): Employs energy minimization to refine object boundaries through iterative deformation (Kass et al., 1988). § Deep Learning-Based Segmentation: Deep learning models have significantly advanced segmentation accuracy, particularly in biomedical imaging, autonomous vehicles, and satellite image analysis (Ronneberger et al., 2015). o U-Net: A CNN designed for precise biomedical image segmentation. o Mask R-CNN: Extends Faster R-CNN by incorporating pixel-wise instance segmentation for detecting multiple objects (K. He et al., 2017). 29 III. Object Detection & Recognition Object detection identifies and classifies objects within an image, enabling applications in face recognition, autonomous navigation, surveillance, and image retrieval (Felzenszwalb et al., 2010). These methods fall into two categories: traditional featurebased approaches and Deep Learning-Based techniques. § Traditional Object Detection Methods: Earlier methods relied on handcrafted features to detect objects based on predefined patterns (Dalal & Triggs, 2005). o Haar cascades: Utilizes edge and texture patterns for real-time face and object detection. o HOG: Captures gradient distributions to recognize shapes and contours. § Deep Learning-Based Object Detection: Recent advancements in deep learning have led to more robust and scalable object detection frameworks (Redmon et al., 2016). o YOLO (You Only Look Once): Processes an entire image in a single pass, enabling efficient real-time detection. o Faster R-CNN: Enhances object detection accuracy by incorporating a region proposal network for improved bounding box predictions (Ren et al., 2016). Image retrieval systems rely on specialized image processing techniques to extract, represent, and index visual features, allowing for efficient searching and matching of images in large databases. Unlike general image processing, where the goal is image enhancement or segmentation, image retrieval techniques focus on identifying distinctive image features and organizing them into structured representations for fast and accurate retrieval (Smeulders et al., 2000). This section explores: 30 I. Feature extraction II. Deep Learning-Based Feature Extraction III. Preprocessing for Large-Scale Image Retrieval I. Feature extraction Where visual elements such as color, texture, and shape are analyzed to create a numerical representation of an image (Datta et al., 2008). These features serve as a compact and discriminative description that allows retrieval systems to compare and rank image similarity efficiently. § Color Features: Color is one of the most commonly used features in image retrieval, as it provides a straightforward way to differentiate images (Swain & Ballard, 1991). Color-based retrieval methods rely on statistical representations of pixel intensities rather than object recognition, making them useful for applications such as multimedia search and digital library indexing. o RGB histograms: Represent the distribution of red, green, and blue intensities in an image. o HSV histograms: Capture hue, saturation, and value, offering robustness to lighting variations. § Texture Features: Texture describes the spatial arrangement of pixel intensities, enabling retrieval systems to differentiate surfaces and patterns that may not be easily distinguishable by color alone (Manjunath & Ma, 1996). o Local Binary Patterns (LBP): Encode local texture characteristics by thresholding neighborhood pixels. o Gabor filters: Analyze texture frequencies and orientations, making them useful for biomedical image retrieval and fingerprint recognition. 31 § Shape Features: Shape-based retrieval techniques are effective for identifying images containing specific objects or geometric patterns, particularly in biomedical and industrial applications (Zhang & Lu, 2002). o Contour descriptors: Extract object outlines to facilitate shape-based matching. o Fourier descriptors: Convert shape boundaries into frequency components for comparison. Feature extraction methods allow images to be represented in high-dimensional feature spaces, where similarity measures such as Euclidean distance and cosine similarity are used for ranking retrieved images. II. Deep Learning-Based Feature Extraction Traditional feature extraction methods rely on handcrafted features, which may not always capture high-level semantic information. Recent advances in deep learning have significantly improved image retrieval by enabling models to automatically learn hierarchical feature representations from large datasets (LeCun et al., 2015). CNNs for Feature Embeddings: CNNs have become the standard for image feature extraction, transforming raw pixel values into a structured feature vector (Krizhevsky et al., 2012). o ResNet and VGGNet extract multi-layer feature embeddings, capturing texture, shape, and spatial structure. o These embeddings are used in vector search engines for large-scale image retrieval. § Self-Supervised Learning for Feature Representation: Recent developments in SSL allow models to learn meaningful feature representations without labeled data, making them ideal for scalable retrieval systems (Chen et Ai., 2020). 32 o SimCLR (Simple Contrastive Learning Representation) learns visual similarities using contrastive loss. o MoCo (Momentum Contrast) stores large feature dictionaries for improved retrieval accuracy. o DINO (Self-Distillation with No Labels) enhances feature quality for unsupervised retrieval applications. § Vision Transformers (ViTs) in Image Retrieval: ViTs have emerged as an alternative to CNNs, processing images using self-attention mechanisms to capture long-range dependencies (Dosovitskiy et al., 2021). o Unlike CNNs, which extract local features, ViTs model entire image patches simultaneously, improving retrieval for complex scenes and fine-grained image categorization. Deep Learning-Based feature extraction enables more accurate and semantically meaningful retrieval, reducing the reliance on manually engineered descriptors. III. Preprocessing for Large-Scale Image Retrieval Efficient image retrieval requires optimizing feature storage, indexing, and retrieval speed. Preprocessing techniques such as dimensionality reduction, feature indexing, and compression improve scalability and search efficiency in high-dimensional feature spaces (Jégou et al., 2011). § Dimensionality Reduction Techniques: High-dimensional feature representations can be computationally expensive. Dimensionality reduction improves retrieval efficiency while preserving important feature details. o Principal Component Analysis (PCA): Reduces feature dimensions by identifying principal components in the data. 33 o t-SNE (t-Distributed Stochastic Neighbor Embedding): Projects highdimensional data into a lower-dimensional space for visualization and clustering. § Feature Indexing Methods: For real-time image retrieval, indexing techniques allow faster nearest neighbor searches in large-scale databases (Muja & Lowe, 2009). o KD-Trees: Partition feature space into hierarchical subregions to accelerate similarity searches. o Hashing methods: Convert feature vectors into compact binary representations for fast lookup. o Approximate Nearest Neighbor (ANN) search: Balances search accuracy and computational efficiency for large-scale datasets. § Compression for Efficient Storage & Retrieval: Storage-efficient retrieval systems require compression techniques to reduce memory footprint while preserving retrieval accuracy. o Vector quantization compresses feature vectors while maintaining similarity relationships. o Product Quantization (PQ) enables fast approximate nearest neighbor search in high-dimensional feature spaces. These preprocessing techniques ensure that image retrieval systems remain scalable and computationally efficient, allowing for real-time search and indexing in extensive datasets. 2.1.2. Relationship Between Image Processing and Image Retrieval Image processing serves as the foundation for image retrieval, enabling the extraction, representation, and indexing of visual features that facilitate efficient search and retrieval 34 operations. While general image processing techniques focus on enhancement, segmentation, and object detection, their integration into image retrieval ensures more accurate and meaningful search results (Smeulders et al., 2000). This section explores how feature extraction, image enhancement, segmentation, and object detection contribute to various retrieval approaches. § Feature Extraction For instance, CBIR systems, in particular, rely on feature extraction, enhancement, and object recognition to improve retrieval accuracy and search efficiency (Datta et al., 2008). CBIR, converts images into structured representations based on color, texture, shape, and deep learning-based embeddings. Depending on retrieval requirements, CBIR systems use different feature extraction approaches based on the task. Traditional methods, such as color histograms, Local Binary Patterns (LBP), and shape descriptors, provide handcrafted feature representations that capture essential visual attributes (Swain & Ballard, 1991). In contrast, Deep Learning-Based methods utilize CNNs to generate more robust feature embeddings, with architectures such as ResNet and VGGNet, as well as self-supervised learning models like SimCLR and MoCo, which further improve retrieval performance without relying on labeled data (LeCun et al., 2015). By integrating feature extraction techniques, CBIR systems significantly enhance retrieval accuracy, ensuring that query images return visually similar results based on content rather than textual descriptions, making them particularly effective in multimedia search, medical imaging, and largescale visual databases. 35 § Image Enhancement Image enhancement plays a vital role in CBIR by improving feature distinctiveness in low-quality or noisy datasets, ensuring accurate feature extraction and similarity matching (Pratt, 2007). Poor image quality can degrade retrieval accuracy, as retrieval systems rely on well-defined features to perform efficient indexing and ranking. Techniques such as noise reduction, contrast adjustment, and edge sharpening improve clarity, texture representation, and boundary detection, all of which enhance retrieval performance (Buades et al., 2005; Szeliski, 2010). In medical image retrieval, contrast enhancement improves tumor and lesion visibility, aiding in clinically relevant case matching (Ronneberger et al., 2015). Similarly, in historical document retrieval, noise removal and sharpening refine text extraction, facilitating accurate archival searches. By ensuring that images contain well-preserved, high-contrast visual features, enhancement techniques optimize retrieval accuracy, allowing retrieval systems to perform more precise ranking and similarity comparisons. § Segmentation Segmentation is a key image processing step that partitions an image into meaningful regions (Shi & Malik, 2000). By isolating specific objects or areas, segmentation bridges the gap between raw pixel data and higher-level retrieval tasks. In region-based retrieval, only these segmented regions are compared, filtering out irrelevant background and improving precision—a benefit particularly evident in medical imaging, where U-Netbased segmentation (Ronneberger et al., 2015) helps identify pathologies for targeted comparisons, and in satellite analysis, where isolating regions of interest (e.g., deforestation zones) refines query relevance. Thus, effective segmentation not only enhances object-based retrieval accuracy but also demonstrates how image processing techniques are fundamentally tied to more advanced image retrieval strategies. 36 § Object Detection Similarly, Object detection localizes and classifies objects within an image, allowing retrieval systems to focus on relevant content rather than irrelevant background details. Models such as YOLO, Faster R-CNN, and Mask R-CNN detect and segment objects with high accuracy, eliminating the need for manual cropping (Redmon et al., 2016; Ren et al., 2016; He et al., 2017). By generating precise bounding boxes or masks, these algorithms streamline the retrieval process—rather than comparing entire scenes, the system compares only the detected objects. Pervasive use case example: In automotive image retrieval, an object detection model can identify vehicles within complex street scenes and classify them by make or model. This lets the retrieval system rank images based on specific car attributes instead of irrelevant background elements. In facial recognition, object detection isolates faces within crowded images, allowing the retrieval system to match identities more efficiently and accurately. Through these targeted detections, object detection ensures more domain-specific and precise image retrieval, reducing computational overhead and improving user satisfaction. scene. 37 2.2. Retrieval Strategies Efficient image retrieval plays a critical role in academic research, digital archiving, medical imaging, surveillance, and multimedia applications. Various retrieval strategies have been developed to improve search accuracy, scalability, and relevance by leveraging different aspects of image representation (Datta et al., 2008; J. Z. Wang et al., 2001). These strategies range from traditional text-based retrieval to advanced AI-driven content-based and hybrid models. 2.2.1. Text-Based Image Retrieval (TBIR) TBIR relies on textual metadata such as titles, descriptions, and manually assigned tags to search for images. This method is widely used in digital libraries and web search engines, where textual annotations are available. However, TBIR suffers from subjectivity and annotation inconsistencies, as different users may describe the same image differently, leading to mismatches in search results (Datta et al., 2008). A distinct contribution of TBIR lies in its suitability for structured archives, such as museum databases and academic repositories, where detailed textual descriptions are readily available. 2.2.1.1. Challenges and Limitations Despite its advantages, TBIR faces several challenges: - Subjectivity: The subjective nature of textual descriptions can lead to inconsistent annotations, affecting retrieval accuracy (Goodrum, 2000). - Scalability: Manual annotation of large image collections is not feasible. Automated techniques, although helpful, are not always accurate and can miss contextual nuances. 38 - Semantic Gap: The gap between textual descriptions and visual content impacts retrieval effectiveness. TBIR systems need to bridge this gap to improve accuracy (Smeulders et al., 2000). - Language Variability: Differences in language, spelling, and phrasing affect TBIR system’s effectiveness. Multilingual support is crucial for global applications (L.-J. Li & Fei-Fei, 2010). 2.2.1.2. Future Directions Research in TBIR aims to address these challenges and improve retrieval accuracy. Key areas of focus include: - Improved NLP Techniques: Advances in NLP can enhance automated annotation and semantic analysis, making TBIR systems more accurate and context-aware (Brown et al., 2020). - Machine Learning: Incorporating machine learning can improve TBIR scalability and accuracy. These technologies can learn from user interactions, continually refining retrieval results (Khan et al., 2010). - Multimodal Retrieval: Combining TBIR with other retrieval methods, like CBIR, leverages the strengths of both approaches. Multimodal retrieval systems use textual and visual features to enhance accuracy. - User Interaction and Feedback: Incorporating user feedback into TBIR systems refines annotations and improves retrieval accuracy. Interactive systems that learn from user behavior are more effective. While manual annotation provides high-quality, contextually rich image descriptions, its labor-intensive nature, high cost, time requirements, and issues with subjectivity and scalability present significant challenges. These limitations confirm the need for 39 integrating automated and hybrid approaches to enhance the efficiency, scalability, and consistency of Text-Based Image Retrieval (TBIR) systems. 40 2.2.2. Content-Based Image Retrieval A CBIR approach replaces traditional text-driven searches with visual feature analysis. In this strategy, images are first preprocessed (e.g., normalized or filtered) to ensure consistent input quality. Next, feature extraction algorithms encode each image into a descriptive representation, capturing essential properties such as color, texture, or shape (Chopra et al., 2021). These representations are then indexed for rapid comparison. ! =,,-'"+F>)%)8)2+F'"7+?'"*F2+*(+"%FF !"#$%F'()*+F ,'-+FF 322+"%')-F ,+)%$.+2F 32%'()%'0"F0,F +4+.5F'()*+FF !()*+F#.+I #.01+22'"*FF 6+)%$.+2F 7)%)8)2+F F ="-'"+F@$+.5F:'%;F+?#-0.)%'0"F2+*(+"%FF 9$+.5F'()*+FF !()*+F#.+I #.01+22'"*F 322+"%')-F ,+)%$.+2F 32%'()%'0"F0,F +4+.5F'()*+FF !()*+F.+%.'+4)-F :'%;F2'('-).'%5F (+)2$.+(+"%F F F <+%.'+4)-F '()*+2F Figure 1- Common CBIR approach When a query image is provided, the system repeats the feature extraction process to create a query representation, which is subsequently matched against the indexed database using similarity measures. Results are ranked based on their resemblance to the query, returning only the most visually similar images (Georgiou, 2021). By focusing on 41 the content itself, CBIR circumvents the need for comprehensive text labeling or manual annotations, resulting in a robust, scalable, and efficient retrieval solution. CBIR eliminates TBIR’s dependence on text by retrieving images by visual content, such as color histograms, textures, and spatial structures (Smeulders et al., 2000). CBIR systems extract these features and visual descriptors and compute similarity scores between a query image and database images (J. Z. Wang et al., 2001). How CBIR Differs from TBIR? - Uses visual analysis instead of textual annotations. - Independent of language barriers – Useful in domains like biomedical imaging and forensic investigations, where textual descriptions are insufficient (Rui et al., 1999). 2.2.2.1. Challenges and Limitations Despite its strengths, CBIR also faces a number of significant challenges: - Semantic Gap: Low-level visual features (e.g., color, shape) often fail to capture high-level semantic concepts, creating a gap between the system’s representation and the user’s intent (Smeulders et al., 2000). - High-Dimensional Feature Space: CBIR extracts feature vectors (e.g., deep CNN embeddings), which can be computationally expensive to index and compare, especially as databases grow (Datta et al., 2008). - Computational Overhead: Advanced feature extraction (e.g., deep learning) boosts retrieval accuracy but imposes significant demands on processing power and storage (Krizhevsky et al., 2012). 42 Domain specificity: Handcrafted features or CNN-based descriptors may need to be tailored for specific domains (e.g., medical imaging vs. fashion), limiting the generality of a single CBIR system. In summary, bridging the semantic gap, managing high-dimensional data, and balancing accuracy with computational efficiency are central hurdles for CBIR. Ongoing research focuses on improved feature extraction, intelligent indexing, and hybrid approaches (e.g., combining text and visual features) to address these limitations and better align retrieval results with user expectations. 2.2.2.2. Future Directions Research in CBIR continues to tackle its core challenges, particularly the semantic gap and high computational demands. Key areas of focus include: - Advanced Deep Learning Architectures: Building on CNN-based methods (Krizhevsky et al., 2012), new architectures (e.g., Vision Transformers) and selfsupervised frameworks aim to extract richer, more semantically meaningful representations without extensive labeled data (Chen at Ai., 2020). - Hybrid Feature Integration: Combining handcrafted features (e.g., color histograms) with deep descriptors can yield robust retrieval performance, especially in domain-specific contexts like medical or fashion (Smeulders, A. W. M., Worring, M., Santini, S., Gupta, A., & Jain, R. (2000). - Efficient Indexing and Similarity Search: As image databases grow, approximate nearest-neighbor techniques and hash-based methods help reduce retrieval latency and memory usage (He et al., 2018). Ongoing work focuses on scalable indexing structures for high-dimensional embeddings. 43 - User-Centric Enhancements: Incorporating relevance feedback and interactive interfaces refine retrieval outcomes over time, aligning system results more closely with user intent (Torres & Reis, 2008). This iterative process mitigates the semantic gap by capturing contextual or personal preferences. • Domain Adaptation & Transfer Learning: Tailoring CNN-based CBIR systems to specific domains (e.g., remote sensing, forensics) often requires fine-tuning or domain adaptation strategies that leverage pre-trained models (Doersch & Zisserman, 2017). Research explores how to adapt such models efficiently for niche applications. 44 2.3. IR Techniques The major IR techniques for images can be broadly categorized into two main approaches: I. TBIR II. CBIR 2.3.1. TBIR TBIR is already explained in IR Strategies; this section only focuses on its operational techniques. In TBIR, image retrieval is performed using structured metadata indexing, allowing images to be searched based on associated textual descriptions rather than their visual content. TBIR systems primarily utilize natural language processing (NLP) and information retrieval algorithms to improve query matching. Advanced NLP techniques, AI-driven annotation, and ontology-based metadata expansion enhance search accuracy and relevance. This section presents the key techniques in TBIR. 2.3.1.1 Advanced TBIR Methodologies I. Semantic Text Processing & Query Matching Traditional keyword-based retrieval is limited when user queries do not match stored metadata exactly. Word embeddings, such as Word2Vec and BERT, transform words into high-dimensional vector representations, enabling semantic similarity-based retrieval (Mikolov et al., 2013). § Mathematical Model for Word Embeddings Each word 𝑤 in an image description is mapped to a vector 𝑣! in an 𝑛 − dimensional (𝑣! ∈ 𝑅 " ) space: where: 𝑣! = 𝑊 ⋅ 𝑤 + 𝑏 45 - 𝑊 ∈ 𝑅 "× |%| is a learned weight matrix encoding semantic relationships (analogous to an embedding lookup table), - 𝑤 ∈ 𝑅 |%| is a one-hot encoded vector representing the word ( vocabulary ( - , 𝑉 ) in𝑤a 𝑏 ∈ 𝑅 " is a bias term refining the vector representation. Key Insight: This step converts discrete words into continuous vectors, enabling machines to interpret linguistic semantics geometrically (e.g., ”cat” and ”kitten” are close in the embedding pace). For multi-word queries, an overall query vector 𝑣 & is computed as the mean of individual word embeddings: 1 𝑣 & = 2v n ( ' ! )*+ The similarity score between a query 𝑄 and an image annotation (represented as 𝑣 , ) is computed using cosine similarity: 𝑆𝑖𝑚(𝑄, 𝐼) = 𝑣 & ⋅ 𝑣, |𝑣 & ||𝑣 , | Values range from −1 (dissimilar) to 1 (identical). High scores indicate semantic alignment between the query and image annotation. Word embeddings improve query relevance by allowing concept-based matching rather than exact word searches. (Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003) II. Sentiment Analysis Sentiment analysis involves determining the sentiment expressed in the textual annotations. Understanding the emotional context of descriptions can add another layer of relevance in TBIR systems. 46 ) § VADER (Valence Aware Dictionary and sEntiment Reasoner): VADER is a lexicon and rule-based sentiment analysis tool that is particularly attuned to sentiments expressed in social media. It helps understand the emotional tone of the annotations Hutto, C. J., & Gilbert, E. E. (2014) § TextBlob: A simple library for processing textual data, TextBlob provides tools for common NLP tasks, including sentiment analysis. It captures the sentiment of the textual descriptions, useful for more nuanced image retrieval (Loria, 2020). III. Automated Image Annotation Using AI Manual image annotation is time-consuming. AI-powered models automatically generate textual metadata by analyzing image content. These models integrate: - CNNs for image feature extraction. - Recurrent Neural Networks (RNNs) or Transformer Models for text sequence generation. § Mathematical Model for Image Captioning: The AI model assigns a caption 𝑇 to an image 𝐼 by maximizing the probability of words in 𝑇, given 𝐼: 𝑇 ∗ = arg max 𝑃 (𝑇|𝐼) . Where: - 𝑃(𝑇|𝐼) represents the probability of generating a description 𝑇 given the image 𝐼. - The probability is computed using a sequence prediction model: 𝑃(𝑇|𝐼) = C𝑃 1 /*+ (𝑤/ |𝑤+ , … , 𝑤/0+ , 𝐼) 47 - T = The sequence of words forming the image caption. - I = The image input, which provides visual features extracted by a CNN 𝑃(𝑤/ |𝑤+ , … , 𝑤/0+ , 𝐼) = The probability of generating the next word 𝑤/ , given: - - Previous words in the sequence P(w2|w+ , w3 , … , w20+ , I) The image representation 𝑰, First Word Prediction (Based on the Image 𝐼): 𝑃(𝑤+ |𝐼), Second Word Prediction (Based on 𝒘𝟏 Image 𝐼) : 𝑃(𝑤3 |𝑤+ , 𝐼) This continues until the model generates all 𝑚 words of the caption. This approach is used in Recurrent Neural Networks (RNNs) and Transformer-based models for image captioning, where the probability of each word is computed sequentially based on previous words and the image features. IV. Named Entity Recognition (NER) NER involves identifying and classifying proper nouns in the text into predefined categories such as names of people, organizations, and locations. NER enhances the precision of TBIR by extracting specific entities from the annotations. • SpaCy: An open-source library for advanced natural language processing in Python, SpaCy offers pre-trained models for NER. It effectively extracts entities from image annotations, improving the retrieval process. • Stanford NER: This tool uses statistical models trained on labeled data to provide reliable entity recognition. Its application in TBIR helps accurately identify and classify entities within the textual descriptions. V. Metadata Expansion via Ontologies 48 To enhance retrieval accuracy, TBIR systems use domain-specific ontologies (e.g., MeSH for medical images; U.S. National Library of Medicine, n.d.) to expand queries with synonyms and related terms. This ensures that searches retrieve all relevant images, even if the user does not specify the exact metadata terms. For example, “X-ray” and “Radiograph” are considered synonymous. This broader search vocabulary helps ensure that all relevant images are retrieved, even when a user does not use the exact technical terminology. § Mathematical Model for Query Expansion: An ontology can be represented as a graph where nodes correspond to concepts (e.g., “X-ray,” “Radiograph”) and edges define relationships between terms (e.g., “X-ray → Radiograph (synonym)”). For a given query 𝑄+ , the expanded query 𝑄5 includes all semantically related words: Where: § 𝑄5 = 𝑄 ∪ 𝑤 5 ∣ ∃ (𝑤, 𝑤 5) ∈ 𝐸 𝑤 5 is a related word from the ontology. Term Weighting in Expanded Queries (TF-IDF Representation) Each term in 𝑄′ is assigned a relevance weight using TF-IDF (Term Frequency– Inverse Document Frequency) (Salton & McGill, 1983): 𝑇𝐹 − 𝑰𝑫𝑭 (𝑡, 𝑑) = 𝑻𝑭 (𝑡, 𝑑) × 𝒍𝒐𝒈 X where: - 𝑁 Z 𝑫𝑭(𝑡) 𝑇𝐹(𝑡, 𝑑) = frequency of term 𝑡 in document 𝑑. DF(𝑡) = number of documents containing 𝑡 N = total number of documents in the collection. 49 TF-IDF ensures that terms with high descriptive power (i.e., those that appear rarely across the collection but are frequent in a specific document) receive higher weights. (Salton & Buckley, 1988; Manning et al., 2008) Ontology-driven query expansion can significantly enhance retrieval performance in text-based image retrieval (TBIR). By incorporating synonymous or related terms, expanded queries cover concept-dense domains more comprehensively, such as those found in medical imaging. As a result, recall is elevated due to broader keyword coverage. Precision often remains stable—or even improves—thanks to TF-IDF weighting, which filters out irrelevant results by emphasizing contextually critical terms. Empirical findings (X. Li et al., 2021) indicate that such expansions accommodate nuanced terminological variations without degrading overall result quality. Ontologies also function as knowledge graphs that capture key semantic relationships within specialized domains. Leveraging these connections helps unify diverse references and mitigates the variability of user queries and metadata descriptions. At the same time, TF-IDF weighting ensures that distinctive terms retain prominence, balancing comprehensive retrieval with robust relevance ranking. Consequently, combining ontology-based query expansion with TF-IDF weighting is shown to both increase recall—through broader term sets—and sustain precision—via the strategic weighting of essential concepts. 2.3.1.2 Application Examples This section explores real-world applications of TBIR across various domains, categorized by the primary approaches used. Each example highlights the techniques employed, the specific application, and its significance, supported by references to highimpact academic literature. 50 • Textual Query-Based Retrieval Textual query-based retrieval relies on natural language queries and structured metadata to retrieve images. This approach is widely used in domains where user-friendly and intuitive search is essential. In e-commerce, online marketplaces like Amazon and eBay require efficient retrieval of product images based on user queries. Textual query-based retrieval uses product descriptions and metadata to achieve this. Techniques include Natural Language Processing (NLP) models like BERT to understand user queries (e.g., "red leather handbag") and metadata tagging to annotate product images with attributes like color, material, and style (Devlin et al., 2018; W. Liu et al., 2011). For example, a user searching for "red leather handbag" retrieves relevant product images, enhancing user experience and increasing sales conversion rates. On social media platforms like Instagram and Flickr, users rely on user-generated content and tags for image retrieval. Textual query-based retrieval uses hashtags and captions to improve search accuracy. Techniques include user-generated tags (e.g., "#sunset," "mountain hiking") and NLP for semantic analysis of captions (C. Hu, 2021; Z. Wang et al., 2018). For instance, a user searching for "#sunset" retrieves images tagged with this keyword, improving content discoverability and user engagement. • Semantic and Ontology-Based Retrieval Semantic and ontology-based retrieval bridges the semantic gap by mapping textual queries to high-level concepts using ontologies and knowledge graphs. This approach is critical in domains requiring domain-specific knowledge. In healthcare, radiologists and researchers need accurate retrieval of medical images for diagnostics and analysis. Semantic-based retrieval uses medical ontologies and knowledge graphs to achieve this. Techniques include NLP for medical terms using 51 models like BioBERT to extract terms from queries (e.g., "early-stage lung cancer") and ontology development using standards like SNOMED CT to standardize terms and relationships (Lee et al., 2020; Rui et al., 1999). For example, a radiologist querying "early-stage lung cancer in non-smokers" retrieves relevant X-rays, improving diagnostic accuracy and workflow efficiency. At Biodiversity Research, researchers and conservationists need to retrieve images of species for biodiversity studies. Semantic-based retrieval uses domain-specific ontologies to link species names to related concepts (e.g., habitats, conservation status). Techniques include NLP for species identification and domain-specific ontologies (Zhang, 2021). For instance, a researcher querying "endangered bird species in the Amazon rainforest" retrieves relevant images, supporting biodiversity research and conservation efforts. • Multimodal Retrieval (Text + Visual Features) Multimodal retrieval combines textual queries with visual features to retrieve images using multimodal embeddings. This approach is ideal for domains requiring contextaware and semantically rich results. The Search Engines, platforms like Google Images and Bing Visual Search require accurate retrieval of images based on complex queries. Multimodal retrieval uses text and visual features to achieve this. Techniques include multimodal embeddings using models like CLIP to align textual and visual representations and visual feature extraction using CNNs or Vision Transformers (ViTs) (He et al., 2017; Radford et al., 2021). For example, a user querying "red dress with floral patterns" retrieves visually similar images, enhancing retrieval accuracy and user satisfaction. In Advertising, Advertisers need to retrieve images for campaigns based on descriptive queries. Multimodal retrieval uses text and visual features to achieve this. Techniques 52 include multimodal embeddings to align textual queries with visual features and relevance feedback to iteratively refine results based on user input (Frome et al., 2013; W. Wang et al., 2021). For instance, an advertiser querying "diverse team collaborating in a modern office" retrieves relevant stock photos, enabling targeted and contextually relevant advertising. 2.3.2. CBIR Since CBIR has already been explained in the IR Strategies section 1.3.2, this section focuses specifically on its operational techniques, such as feature extraction, similarity measurement, and indexing methods. CBIR is a technique that searches for images by analyzing visual features (color, texture, shape, etc.) rather than relying on textual descriptions. Each image is mapped to a feature vector, and retrieval is performed by comparing these vectors with a similarity measure (e.g., Euclidean distance, Earth Mover’s Distance). By focusing on the image content itself, CBIR can uncover relevant results even when textual labels are absent or incomplete. The following subsections introduce the key formulas for feature representation, discuss common distance metrics, and highlight both the advantages of CBIR (e.g., metadata independence) and its challenges (such as bridging the semantic gap). 2.3.2.1 Advanced CBIR Methodologies I. Feature-Based Retrieval Feature-based retrieval is a specialized subset of CBIR that extracts and compares mathematical feature descriptors instead of raw pixel-based properties (Swain & Ballard, 1991). Unlike general CBIR, which may rely on simple color histograms, feature-based retrieval employs advanced descriptors such as: 53 - Scale-Invariant Feature Transform (SIFT) – Detects stable key points for object matching (Lowe, 1999). - Histogram of Oriented Gradients (HOG) – Useful in object detection and facial recognition (Dalal & Triggs, 2005). Unique Contribution: Feature-based retrieval is more precise than traditional CBIR since it can detect robust object features regardless of lighting, rotation, or scaling differences (Lowe, 1999). II. Semantic-Based Image Retrieval Semantic-based image retrieval bridges the semantic gap in traditional CBIR by inferring high-level conceptual meaning rather than relying solely on low-level features like color histograms and texture patterns. Unlike classical CBIR, which matches images based primarily on pixel-derived properties, semantic-based retrieval employs: • Deep Learning: Deep neural networks (e.g., ResNet, Vision Transformers) extract hierarchical features that capture objects, scenes, and contextual relationships (He et al., 2016). • Ontological Knowledge: Domain-specific ontologies (e.g., WordNet, RadLex) enable query expansion by incorporating synonymous and related terms, ensuring broader and more precise retrieval (Miller, 1995). • Cross-Modal Alignment: Models such as CLIP align visual and textual semantics, allowing natural language queries (e.g., "beach sunset") to retrieve contextually relevant images (Radford et al., 2021). III. Reverse Image Search 54 Reverse image search is a specialized application of CBIR that focuses on instance-level retrieval—identifying near-duplicate or highly similar images to a query image. Unlike traditional CBIR systems, which often retrieve semantically related images (e.g., "beaches" for a "sunset" query), reverse image search prioritizes visual similarity at the pixel or feature level. Key Differentiators: • Focus on Near-Duplicates: Reverse image search targets near-identical matches (e.g., cropped, resized, or slightly modified versions of the query image). Example: Detecting copyrighted images or identifying fake social media profiles using facial recognition. • Technology: While traditional CBIR relies on handcrafted features (e.g., color histograms, texture descriptors), modern reverse image search systems use deep CNNs to generate high-dimensional embeddings that capture fine-grained visual patterns (Krizhevsky et al., 2012).Example: Google’s Reverse Image Search employs CNN architectures like Inception-v3 (Szegedy et al., 2016) to encode images into feature vectors for similarity matching. • Applications: Copyright enforcement (e.g., identifying unauthorized image use), Plagiarism detection in academic/creative work, Fact-checking by tracing image origins (e.g., debunking misinformation). IV. Region-Based Image Retrieval RBIR focuses on retrieving images based on specific localized regions rather than analysing the entire image (Rui et al., 1999). Used in medical imaging – A system searching for lung tumors focuses only on the lung region rather than matching full-body X-rays (J. Z. Wang et al., 2001). 55 Unlike object detection, which localizes and labels objects (e.g., tumors), RBIR retrieves images with regions that are visually or semantically similar to a query region. For example, a radiologist can select a lung nodule in a CT scan, and RBIR will retrieve studies with analogous nodules, leveraging texture and shape features rather than object labels (Litjens et al., 2017). V. Sketch-Based Image Retrieval SBIR enables users to query image databases using freehand sketches, retrieving images that align with the geometric structure and spatial layout of the sketch. Unlike CBIR, which relies on low-level features like color and texture, SBIR prioritizes structural similarity (e.g., edges, contours, shapes), making it ideal for applications requiring abstract or conceptual matching (Eitz et al., 2012).Distinct from CBIR: - Instead of relying on color or texture features, SBIR matches geometric properties. VI. Relevance Feedback Mechanisms Relevance feedback improves retrieval by iteratively refining search results based on user input (Torres & Reis, 2008). Example: A user searching for “wildlife photography” can mark relevant results, prompting the system to improve subsequent searches dynamically. Distinct Contribution: - Unlike fixed retrieval models, feedback-based retrieval adapts over time, improving personalized search. VII. Hybrid Approaches 56 Hybrid retrieval combines textual, visual, and semantic features to optimize accuracy (Bose et al., 2015). Multimodal search engines use hybrid retrieval by integrating keyword search with CBIR-based visual filtering. Distinct from Other Methods: - Rather than relying on a single approach, hybrid retrieval balances multiple strategies, ensuring adaptability across different datasets. 2.3.2.2 Application Examples There are several academically framed application examples that demonstrate the use of advanced CBIR methodologies: • Application Examples of CBIR Feature‐Based Retrieval In forensic image analysis, advanced feature‐based CBIR systems have proven indispensable for matching and retrieving near-duplicate images from extensive digital evidence databases. These systems rely on robust local feature descriptors that capture invariant image characteristics despite changes in scale, rotation, or illumination. For example, a forensic tool may employ the following techniques: - Scale-Invariant Feature Transform (SIFT): SIFT extracts distinctive keypoints and computes high-dimensional descriptors that remain stable under affine transformations. In forensic applications, SIFT is used to identify and match local keypoints between a query image and database images, allowing investigators to retrieve evidence even if the images have been manipulated or captured under different conditions (Lowe, 1999; DOI: 10.1109/ICCV.1999.790410). o Histogram of Oriented Gradients (HOG): Complementing SIFT, HOG captures the distribution of edge orientations within localized regions, thereby providing structural information about the objects in an image. This descriptor is 57 particularly useful in scenarios where the overall shape and texture are critical for verification, such as matching faces or specific objects in crime scene images (Dalal & Triggs, 2005; DOI: 10.1109/CVPR.2005.177). o Geometric Verification:After extracting features using SIFT and HOG, the system typically applies geometric verification techniques—such as Random Sample Consensus (RANSAC)—to eliminate false matches. This step ensures that only spatially consistent correspondences contribute to the final retrieval decision. In a practical forensic scenario, an analyst submits a query image suspected to be a modified copy of illicit content. The system first extracts SIFT keypoints and HOG descriptors from the query image and the entire database. A nearest-neighbor search identifies candidate matches based on descriptor similarity. Subsequently, RANSAC* is used to verify the spatial consistency of the matched features. The final ranking of candidate images is determined by the number of inliers matches and the quality of the geometric transformation between images. This robust approach significantly narrows down the search space and helps investigators pinpoint images that are highly similar, even under variations due to cropping, rotation, or scale. *RANSAC (Random Sample Consensus) is a robust model-fitting algorithm that repeatedly selects random data subsets to estimate parameters, discarding outliers to achieve reliable results even under significant noise. (Fischler, M. A., & Bolles, R. C. (1981). • Application of RBIR RBIR plays a critical role in domains where the retrieval task requires localized analysis of image content rather than global image features. In medical imaging, for example, precise retrieval of pathological regions—such as tumors or lesions—from a large 58 database of radiological scans is essential for accurate diagnosis and comparative analysis. A typical RBIR system in medical applications operates as follows: - Advanced Segmentation: The system employs deep learning–based segmentation models, such as U-Net, to partition radiological images (e.g., CT or MRI scans) into meaningful regions. U-Net has been widely adopted due to its encoder–decoder architecture, which effectively captures both global context and fine-grained details (Ronneberger et al., 2015). This step isolates the regions of interest (ROIs) that may contain lesions or other abnormalities. ROI is a specified subset of an image—often defined by coordinates or masks—that highlights the critical area for focused analysis or processing, such as detecting objects or measuring localized features. - Region-Specific Feature Extraction:Once the regions are segmented, region-specific features are computed. These features typically include texture descriptors (e.g., Local Binary Patterns) and shape-based metrics that characterize the morphology of the detected region. This targeted feature extraction helps to capture the essential properties of the pathology, reducing the influence of irrelevant background information. - Object Detection and Localization: In addition to segmentation, object detection models such as Faster R-CNN can be integrated to further refine the localization of pathological areas. Faster R-CNN generates bounding boxes that highlight the precise locations of lesions, thereby complementing the segmentation process (Ren et al., 2016; DOI: 10.1109/TPAMI.2016.2577031). - Similarity Matching and Ranking: The extracted region-based features are then used to perform similarity matching across a database of segmented images. The system computes a similarity score 59 between the query ROI and each candidate ROI in the database, ranking the images such that those with the most similar pathological features appear at the top of the retrieval results. In the real-world application, a radiologist submits a query image that contains a segmented lesion. The RBIR system first isolates the lesion using U-Net segmentation, extracts texture and shape descriptors from the lesion area, and applies Faster R-CNN to verify the region's localization. The system then computes similarity scores based on these region-specific features and ranks the images accordingly. As a result, the top retrieved images display lesions that share similar morphological characteristics, thus assisting the radiologist in making more informed diagnostic decisions. In remote sensing, RBIR is similarly applied by segmenting satellite images to extract regions corresponding to specific land-cover types—such as deforested areas or urban expansions. • Application of Semantic‐Based Retrieval In modern medical imaging, semantic‐based retrieval systems have become essential for aiding clinicians in diagnosing and researching complex conditions. For example, consider a radiology retrieval system designed to assist in the diagnosis of early-stage lung cancer. In this application, the system leverages advanced CBIR methodologies that integrate deep learning, ontology-driven query expansion, and cross-modal alignment to overcome the limitations of traditional, low-level feature matching. Techniques and Workflow: - Deep Feature Extraction: The system employs a deep CNN—for instance, a ResNet architecture (He et al., 2016)—pre-trained on large natural image datasets and finetuned on medical images. This network extracts high-level features that capture 60 complex visual patterns such as tissue textures, shapes of lung nodules, and contextual anatomical structures. - Ontology-Driven Query Expansion: To bridge the semantic gap, the system integrates a domain-specific ontology (e.g., RadLex or a customized medical ontology) that encodes relationships between diagnostic terms. When a clinician queries the system using a term such as “early-stage lung cancer,” the ontology expands the query to include related concepts such as “nodule,” “mass,” and “lesion” (Miller, 1995). This ensures that the retrieval process considers a broader, more clinically relevant set of features. - Cross-Modal Alignment: Advanced models like CLIP (Radford et al., 2021) are employed to align textual descriptions from radiology reports with visual features extracted from images. This cross-modal alignment allows the system to interpret natural language queries in the context of the visual data, enhancing the semantic understanding of the query. - Similarity Measurement and Ranking: The deep features are compared using cosine similarity, which, after normalization, effectively measures the angular distance between the query and database feature vectors. Images with the highest similarity scores are ranked at the top, ensuring that the retrieved images are semantically and visually aligned with the diagnostic query. • Application Example of Reverse Image Search Reverse image search is a specialized CBIR application designed to identify nearduplicate or highly similar images, a critical capability for enforcing copyright and detecting unauthorized image reuse. This approach leverages advanced CBIR methodologies to extract and compare robust visual features, even when images are modified by cropping, scaling, or color adjustments. Techniques and Workflow: 61 - Deep Feature Extraction: A state-of-the-art CNN such as Inception-v3 (Szegedy et al., 2016) or ResNet (He et al., 2016) is used to generate high-dimensional feature embeddings that capture fine-grained visual details and are robust to minor image variations. Additionally, neural codes as proposed by Babenko et al. (2014) can be extracted to represent image content in a compact form. - Instance-Level Matching: The system computes similarity scores between the query image's embedding and those stored in the database using cosine similarity. For normalization and efficient matching, approximate nearest neighbor (ANN) search algorithms—such as those based on Hierarchical Navigable Small World (HNSW) graphs (Malkov & Yashunin, 2018)—are employed. - Ranking: Images are ranked in descending order based on their cosine similarity scores, with the top 𝐾 matches displayed to the user. This ranking enables rapid identification of potential copyright infringements or unauthorized usage. • Application Example of Hybrid/Multimodal Retrieval Hybrid/multimodal retrieval systems integrate visual features with textual metadata to enhance product search in e-commerce platforms. In this application, a user may enter a natural language query (e.g., “red leather jacket with zipper”) while the system simultaneously analyzes product images using deep CNN. The retrieval pipeline fuses textual embeddings—generated by models such as BERT (Devlin et al., 2018)—with visual embeddings obtained from a state-of-the-art network (e.g., ResNet-50; He et al., 2016). An attention-based fusion module combines these complementary representations into a unified feature vector, which is then used to compute similarity scores via cosine similarity. The system ranks the products based on the joint relevance of visual appearance and descriptive text, enabling more accurate and context-aware recommendations. Key Techniques and Workflow: 62 - Textual Embedding: Text from product descriptions is processed using BERT to generate semantic embeddings that capture contextual information (Devlin et al., 2018). - Visual Embedding: Images are encoded using a deep CNN, such as ResNet-50, to obtain robust visual features that capture fine-grained details (He et al., 2016). - Fusion Strategy:An attention-based multimodal fusion mechanism—similar to approaches discussed by Ngiam et al. (2011) and further refined in recent works— integrates the textual and visual embeddings into a single, hybrid feature vector. - Similarity Matching and Ranking: The fused representation is compared against a database of product embeddings using cosine similarity, and the top 𝐾 matches are returned, ensuring that the retrieved products closely align with both the visual style and descriptive attributes of the query. 63 Chapter 3 Literature survey 3.1. CBIR Evolution Information Retrieval IR originated from early library and information management systems, emerging as a formal academic discipline in the mid-20th century with the advancement of electronic technologies. A landmark event was Vannevar Bush's 1945 conceptualization of the Memex, establishing fundamental principles for contemporary IR systems (Bush, 1945). Further pivotal developments, such as Gerard Salton’s probabilistic retrieval models, significantly shaped modern retrieval methodologies (Salton, 1989). With technological evolution, digital image databases have grown exponentially, impacting fields like medicine, art preservation, and geographic information systems. This rapid expansion underscored the inadequacies of traditional text-based retrieval methods, particularly their heavy reliance on manual annotation. Such reliance is problematic, especially in domains requiring accurate recognition of complex visual content, exemplified by the medical field, where accurately retrieving images depicting subtle anomalies or tumors is crucial. CBIR, introduced in the 1990s with systems like QBIC and VisualSEEk, offered significant advances by indexing and retrieving images based on their intrinsic visual attributes, such as color, texture, and shape (Faloutsos et al., 1994; Smith & Chang, 1996). Despite these advancements, CBIR continues to face considerable limitations, such as insufficient retrieval accuracy, poor scalability to large image collections, and the persistent semantic gap—the disconnect between low-level image features and highlevel semantic interpretation. 64 Recent developments leveraging deep learning, especially CNNs), have substantially improved feature representation and retrieval accuracy. These networks automatically extract hierarchical features, significantly reducing manual intervention and enhancing semantic understanding (Krizhevsky et al., 2012; Doersch & Zisserman, 2017). However, deep learning methodologies still confront challenges, including high computational costs, scalability issues, and dependence on large, annotated datasets. Hybrid methods integrating handcrafted and deep-learned features have provided partial solutions, combining the advantages of classical image descriptors with deep learning's automated feature extraction, resulting in improved retrieval robustness (Wan et al., 2014; Babenko et al., 2014). Nonetheless, opportunities remain to further optimize accuracy, computational efficiency, and generalization, particularly through effectively leveraging partially labeled datasets. Addressing these identified gaps and limitations provides the central motivation for this research. Consequently, this thesis proposes the Self-Optimizing DeepRec Convolutional Neural Network (SO-DRCNN), an innovative framework integrating the Ternion Paradigm—comprising Histogram of Oriented Gradients (HOG), Inclusive Color Histogram (ICH), and Slanting Express Revolves Concise (SERC)—to minimize reliance on labeled data, enhance computational efficiency, and significantly improve the accuracy and scalability of CBIR systems. Figure 2 - CBIR Evolution 65 3.2. The Paradigm Shift Toward Learned Features CBIR applications have been built to use various image features depending on the application area. The practical CBIR application utilizes the elements of visible contents known as image features. The popular features used in CBIR are color distribution, texture assembly, and shape form. They can practically identify and relate to the contents of an image. These features are carefully designed and have proven (A. Kumar et al., 2021) to work efficiently in most CBIR applications. Such features are generally called handcrafted features. Even though many different CBIR systems have been developed and put to productive use, these systems all have problems with conceptual gaps and valuable features. While the conventional handcrafted descriptions still show significant shortcomings, they perform well in picture retrieval. These challenges must be carefully examined to develop a better performance system. The following are the main shortcomings of the handmade features: - Semantic gap remains the main problem notwithstanding the progress made in the CBIR. As a result, there is a certain level (Srivastava et al., 2023) of disintegration among the application's estimated attributes (such as texture and colour distribution) and people's cognitive perceptions of artefacts and situations. - Handcrafted features are inefficient and not adaptable. It is tough to develop and deploy a new CBIR system. A wide range of handcrafted features are available, and the chosen features significantly impact retrieval results. System developers and end users require comprehensive studies to determine the most appropriate attributes. 66 To select appropriate characteristics, one must better understand the domain in which CBIR is being employed. The selection of features needs to improve the system's overall performance. Finally, depending on the images' substance and nature, a set of attributes that performs effectively in a particular field may not produce satisfactory outcomes in another. As a result, a requirement exists to develop characteristics that do not require prior knowledge of the application domain. The system should automatically generate or learn these characteristics based on the input data. Furthermore, the system should outperform the standard CBIR system, which uses handcrafted features. To overcome these issues in image retrieval, an increasing number of CBIR approaches have been introduced and are being investigated. Machine learning approaches allow (Radha et al., 2021) systems to derive substantial insights from incoming data. Systems of this kind will be able to recognize similarities and execute autonomous decisions without human involvement. Machine learning techniques are widely used in various fields, including medicine, protection, networking websites, data science, and aerospace. In addition, artificial intelligence is widely used in traditional image- processing activities, including categorizing and recognizing objects and segments. Machine learning algorithms can help deal with the shortcomings of handcrafted characteristics regarding image retrieving. 3.2.1. Image Similarity Measures Used in CBIR By comparing the feature vectors of each image, we can determine how similar two images are. Diverging images have a larger difference value than comparable images. Various metrics for similarity have been proposed in numerous image retrieval systems. To be effective, a similarity measure must meet several criteria (Salih & Abdulla, 2021): - Local Consistency: Following the triangular inequality in a neighborhood. - Computational Effectiveness: The ability to work in real-time and on a large scale. 67 - Durability to Disturbances: Invariant to perturbations. - Consensus with Semantics: Consistent with the concept of semantics. Categories of Image Similarity Measurements: - International, Region-Based, or Amalgamation of Both: Applying similarities based on either the whole image, specific regions, or a combination of both. - Handling Characteristics as Matrices or Non-Vector Interpretations: Features are organized into a matrix format where each element represents a specific attribute or relationship in the image. This structure allows for more complex and nuanced representations of image features. - Modelling approaches: supervised (monitored), semi-supervised, and unsupervised (unregulated) techniques. Supervised modelling uses labelled data to train models for high accuracy, semi-supervised combines small labelled with large unlabeled datasets for efficient learning, and unsupervised identifies patterns solely from unlabeled data, enabling the discovery of hidden structures. Each approach offers distinct advantages and challenges tailored to the specific needs of image retrieval tasks. - Calculating Commonalities: Across linear space or nonlinear manifolds. Linear space methods use straightforward algebraic techniques for simplicity and efficiency, while nonlinear manifold approaches capture more complex relationships and structures within the data, providing a more nuanced similarity measure. Each method is chosen based on the specific characteristics and requirements of the image retrieval task. - Significance of Image Portions: Considering the importance of different image parts in similarity calculations. - Stochastic, Fuzzy, or Consistent Measures: Using different mathematical approaches for similarity measures. To achieve tailored image searches, visual comparison metrics need to consider subjectivity more seriously. Besides terminology, concepts like aesthetics and individual preferences for content and style may also be included. Research is ongoing to extend 68 the idea of unpredictable image topologies to cover the entire range of natural visuals and enable customization. In brief, different distance measures vary in their input type, computation method, computational difficulty, and metricity. The specific program and the feature vectors created determine which distance metric is employed. By considering various factors and incorporating subjectivity, future research aims to improve the effectiveness of image similarity metrics, making them more practical and user-centered. 3.2.2. Important Points Descriptors In CBIR Frameworks Key points descriptors, such as regions, objects of interest, edges, or corners, have become invaluable in CBIR and various computer vision applications. These descriptors offer robustness and invariance to scale and rotation, providing significant advantages over traditional global features. One of the most popular key point descriptors in recent years is the Scale-Invariant Feature Transform (SIFT). SIFT is well-known for its ability to match different views of objects or scenes, making it a key tool in many CBIR systems (Kapoor et al., 2021). However, SIFT's high dimensionality can slow down feature computation, especially when combined with techniques like Principal Component Analysis (PCA-SIFT) and Gradient Location and Orientation Histogram (GLOH). To overcome these problems, Speeded Up Robust Features (SURF) was developed as a faster and more robust alternative. SURF keeps many of SIFT's advantages but improves computational efficiency. Another method, the Bag-of-Words (BoW) model, uses keypoint-based descriptors like SIFT to create a visual vocabulary. While BoW is accurate, it can be computationally demanding and memory-intensive, making it less suitable for large image collections (Patil & Kumar, 2013). 69 The Fisher Vector (FV) method, based on a Gaussian Mixture Model (GMM), offers a more informative image representation by encoding higher-order statistics. This approach leads to better performance compared to BoW. As a non-probabilistic alternative, the Vector of Locally Aggregated Descriptors (VLAD) aggregates residuals associated with each codeword to represent images, providing a simpler yet effective method (Patil & Kumar, 2017). Local Binary Patterns (LBP), introduced for texture classification, have shown great utility in image retrieval. Variations like Local Ternary Patterns (LTP) and Center Symmetric Local Binary Patterns (CSLBP) further enhance performance. Local Tetra Patterns (LTrPs) and newer techniques such as Local Mesh Pattern (LMeP) and Local Ternary Co-occurrence (LTCoP) continue to push the boundaries in CBIR system design (G.-H. Liu & Yang, 2013). Additionally, methods like Histograms of Oriented Gradients (HOG) and Compressed Histogram of Gradients (CHoG) provide robust feature descriptions. The Differential Between Pixels of Scans Pattern (DBPSP) focuses on pixel differences within scanning patterns for texture features. ORB (Oriented FAST and Rotated BRIEF), BRISK (Binary Robust Invariant Scalable Keypoints), and FREAK (Fast Retina Keypoint) are newer binary descriptors inspired by the human visual system, offering efficient alternatives to SIFT and SURF. The Nested Shape Descriptor (NSD) is a recent development that outperforms SIFT in binary form (Kapoor et al., 2021). In conclusion, the evolution of key point descriptors has significantly enhanced CBIR systems' performance, making them more robust, scalable, and efficient in various applications. Their robustness and invariance to scale and rotation offer significant advantages over traditional global features. 1- Popular Key Point Descriptors: 70 o SIFT (Scale-Invariant Feature Transform): o Reliable for matching different perspectives of objects or scenes. o Widely used in CBIR systems. o High dimensionality can be a drawback, leading to slower feature computation. o SURF (Speeded Up Robust Features): o Faster and more resilient to picture alterations than SIFT. o Bag-of-Words (BoW) Model: o Utilizes keypoint-based descriptors like SIFT to create a visual vocabulary. o Computationally intensive and memory-heavy, less scalable for large image collections. 2- Advanced Feature Representations: o Fisher Vector (FV): o Based on a Gaussian Mixture Model (GMM). o Encodes higher-order statistics for better performance. o Vector of Locally Aggregated Descriptors (VLAD): o Aggregates residuals associated with each codeword to represent images. o Enhancements include intra-normalization, residual normalization, and localized location. 3- Local Binary Patterns (LBP): o Used for texture classification and image retrieval. o Variations like LTP, CSLBP, and LTrPs enhance performance. 4- Histograms and Gradients: 71 o Histograms of Oriented Gradients (HOG): Locally normalized descriptions for robust feature descriptions. o Compressed Histogram of Gradients (CHoG): Reduced bit-rate characterization. 5- Binary Descriptors: o ORB (Oriented FAST and Rotated BRIEF): Efficient alternative to SIFT and SURF. o BRISK (Binary Robust Invariant Scalable Keypoints):Novel key point descriptor. o FREAK (Fast Retina Keypoint): Inspired by the Human Visual System. o Nested Shape Descriptor (NSD): Outperforms SIFT in binary form on the VGG-Affine test. 3.2.3. Distance Metric Utilized In CBIR System In CBIR, accurately measuring the similarity or dissimilarity between images is crucial for effective retrieval. This process, following feature extraction, hinges on the use of distance metrics that can capture perceptual similarity accurately. Traditional distance metrics like Manhattan Distance (MD), Euclidean Distance (ED), and Vector Cosine Angle Distance (VCAD) are commonly used but often fall short in reflecting human perception accurately (Chugh et al., 2021). The Minkowski distance, despite its popularity, also struggles with perceptual accuracy. Advanced metrics such as Kullback-Leibler Divergence (KLD) and Earth Mover's Distance (EMD) provide a more nuanced approach. EMD, based on the transportation problem, has proven effective across various applications, including color, contour matching, texture, melodies, and visual tracking. These advanced metrics offer better perceptual distance representation, but their benefits are maximized only with efficient storage and query processing. 72 Detailed assessments have shown that EMD, among others, excels in picture similarity searches. However, to leverage its high-quality retrieval capabilities, efficient storage and query processing are essential. CBIR systems often use low-level features like color, texture, shape, and corners to approximate the perceptual representation of an image. Yet, these features alone are insufficient for capturing the full semantic relationships within an image. Innovative learning techniques are being explored to overcome these limitations. For instance, Bian and Tao introduced the Biased Discriminative Euclidean Embedding (BDEE), enhancing relevance feedback mechanisms by integrating relevant and irrelevant data. Similarly, Biased Maximum Margin Analysis (BMMA) and Semisupervised BMMA (Semi-BMMA) incorporate feedback and unlabeled data, refining the image retrieval process (Ghrabat et al., 2019). Despite these advancements, active learning techniques remain crucial, especially with insufficient training examples. To address long-term feedback challenges, strategies like Case-Based Long Term Learning (CB-LTL) are proposed, focusing on capturing user preferences over time. Additionally, graph-based re-ranking methods, such as the random walker algorithm, offer innovative solutions for ranking images based on user-labeled data. These methods calculate ranking scores by determining the likelihood of a random walker reaching a relevant seed node before an irrelevant one, thus improving retrieval accuracy. 3.2.3.3 CBIR with Relevance Feedback Since there is currently no dependable framework for modeling high-level image semantics that is unaffected by perceptual subjectivity, particular to the case, query interpretations can be understood by looking at user input. RF is a query adjustment method that aims to extract semantic information particular to the user and the query, then adjusts the results accordingly. This CBIR approach requires a significant amount 73 of user interaction along with input on relevance. Systems that are founded on subjective requests from users cannot coexist with a completely automated, unsupervised system; relevance feedback offers a middle ground. The primary challenge in establishing such a paradigm is the increasing user interaction, given the highly diversified user population. Additionally, there's the matter of how well the input can be improved. Although users would prefer fewer sessions for feedback (Veselý, 2023), there is a problem with the amount of feedback that is necessary for the system to understand the needs of the user. While assuming a fixed goal is weaker, a single problem that has been widely overlooked in the applicability of feedback-based CBIR research is the possibility that the user's demands would change as the assessment phase’s progress. 3.2.3.4. Color-Based Features in CBIR Color is one of the most accessible visual elements in digital images, typically displayed as color components or planes. Extracting color-based features involves three main steps: selecting the color space (Garg & Dhiman, 2021), quantizing the color space, and extracting the color features (Muthukkumar & Seenivasagam, 2022). Key Techniques in Color-Based Feature Extraction: 1. Color Histograms: - Conventional Color Histogram (CCH): Represents the frequency of each color in an image. It is straightforward but may lack robustness against variations in lighting and quantization errors. - Fuzzy Color Histogram (FCH): Uses a fuzzy-set membership function to record each pixel's color similarity to all histogram bins (Giannoulakis et al., 2023). FCH is more resistant to lighting variations and quantization errors, but determining the proper fuzzy membership function can be computationally challenging. 74 2. Color Correlogram (CC): - Describes how the spatial correlation of color pairs changes with distance. It is indexed by color pairs, indicating the likelihood of finding a pixel of color "j" at a distance "d" from a pixel of color "i" (Alyosef, 2023). CC captures both local and global spatial data, making it effective for coarse-grain color images. However, it has a high computational cost due to the quadratic increase in image dimensions. 3. Color Averages and Block Truncation Coding (BTC): - Color Averages: These methods aggregate an image's color information into feature vectors such as row mean (RM), column mean (CM), forward diagonal mean (FDM), and backward diagonal mean (BDM) (Warburg et al., 2021). These reduced-dimension vectors facilitate more efficient image retrieval. - Block Truncation Coding (BTC): This technique divides an image into nonoverlapping square segments and applies color averaging. It has been expanded to various color spaces, showing that luminance-chrominance spaces like Kekre's LUV provide superior retrieval performance compared to non-chrominance spaces. (Kekre et al., 2010) Research has demonstrated the effectiveness of different color-based techniques in CBIR: • Comparative Performance: - Shen et al. (2018): Found that 8-color CC outperforms 64-color CCH, highlighting the importance of considering spatial information in color feature extraction. - Amitha et al. (2021): Suggested new feature vectors based on image partitioning combined with color averaging approaches, noting the superiority of techniques like FDM in performance. 75 Future Directions will be continued research in color-based CBIR involves exploring hybrid approaches that combine color features with other visual features to enhance retrieval performance. Additionally, refining existing techniques to reduce computational costs while maintaining robustness and accuracy remains a key focus. 3.2.3.5. Image Retrieval using Transformed Image Content Image transforms are essential for altering the representation of an image by projecting it into a collection of basis functions, commonly referred to as basis images. This transformation shifts the image representation from one domain (e.g., time domain) to another domain (e.g., frequency domain), without altering the intrinsic information in the image (Zhuang et al., 2022). There are two primary benefits to this transformation: I. Separation of Visual Patterns: Image transforms effectively separate critical elements of visual patterns, making them directly accessible for analysis (C. Hu, 2021). II. Efficient Storage and Transmission: Transforming visual data into a more compact format facilitates efficient storage and transmission.(Datta et al., 2008) These benefits make image transforms a vital tool for feature vector size reduction in image retrieval systems. Various CBIR techniques exploit these properties of image transformations, including fractional energy, row mean of columns converted image, energy compaction, and Principal Component Analysis (PCA). Key Techniques and Findings a. Fractional Energy: Fifteen fractional parameter types, encompassing seven image transforms, are considered in CBIR when utilizing the fractional energy of the modified image (Jardim et al., 2022). 76 The Kekre transform with 6.25% fractional coefficients has been found to perform the best. Fractional energy-based CBIR outperforms approaches using the entire transformed image as the feature vector across all considered image transformations (P. He et al., 2022). b. Independent Cosine Transformation: Among the seven visual transformations considered, the independent cosine transformation with a DC element has provided the best results for image retrieval using the row mean of columns converted image material (Prasomphan & Pinngoen, 2021). c. Energy Compaction: In CBIR with energy compaction in the transform domain, compressed energy for converted color averages performs better with a significantly smaller feature vector size. The 94% energy Kekre transform has outperformed other methods in cases involving row mean, column mean, and row-column mean combinations (Oyewole, 2021). d. Discrete Sine Transform: The discrete sine transform provides superior image retrieval in both forward and backward diagonal means. e. Principal Component Analysis (PCA): When PCA is applied to color averages, image retrieval performance is somewhat reduced compared to when PCA is applied to the entire dataset (Feng et al., 2022). However, combining PCA with other CBIR approaches has demonstrated significant reductions in computational complexity while maintaining adequate retrieval performance. In conclusion, Image transforms are crucial in CBIR for enhancing the accessibility and efficiency of image feature extraction. Techniques leveraging fractional energy, independent cosine transformation, energy compaction, and PCA are pivotal in 77 improving image retrieval performance while effectively managing computational demands. 3.2.3.6. Image Acquisition Employing Textured Information: Texture is a crucial aspect of human vision, instrumental in distinguishing between various regions within an image. Unlike color and shape features, texture features are adept at capturing both the macrostructure and microstructure of images by indicating the distribution of shapes (Zhuang et al., 2022). Texture is typically identified as a spatial pattern with certain homogeneity-related features. Methods for Extracting Texture Features • To obtain texture information from images, several directional feature extraction techniques are employed: I. Steerable Pyramid: Produces a multi-scale, multi-directional representation of an image, consisting of one decimated low-pass sub-band and several un-decimated directional sub-bands. The breakdown is iterated at the low-pass sub-band (Gayathri & Mahesh, 2022). This method results in a representation with 4K/3 times as many coefficients as the original image due to un-decimated directional sub-bands. II. Contourlet Transform: Decomposes an image into multiple scales and directions by combining a directional filter bank (DFB) and a Laplacian pyramid. The DFB processes band-pass images from the Laplacian pyramid to obtain directional data. This method results in a redundancy ratio of less than 4/3 due to decimated directional sub-bands (Sain, 2023). III. Gabor Wavelet Transform: Utilizes a bank of Gabor filters, which are adjusted by dilating and rotating the Gabor functions to produce a filter bank with K orientations and S scales. The image is then 78 convolved with each Gabor function, providing detailed texture retrieval outcomes (Tena et al., 2021). However, this method leads to a highly redundant representation of the original image. • Texture Representation Techniques Three main types of texture representation techniques are utilized to develop unique image retrieval strategies based on texture content: 1. Statistical Techniques: - Employ non-deterministic features to analyze the spatial distribution of grayscale values. First-order statistics consider individual pixel values, while second-order and higher-order statistics account for spatial interactions between pixels. The Co-occurrence Matrix is a commonly used method for second-order statistical texture analysis (Rayavaram, 2023). 2. Model-Based Techniques: - Represent an image using models like the Markov model and fractal model, which describe textures as combinations of fundamental functions or probability models. These techniques are useful for texture analysis, discrimination, and representing natural textures with statistical roughness and self-similarity. 3. Transform-Based Techniques: - Aim to find a compact, lower-dimensional representation of texture features by transforming the image into a space where most data energy is concentrated in a few coefficients. Examples include Fourier, Gabor, Curvelet, and wavelet transforms. These methods enhance feature extraction efficiency by eliminating unnecessary coefficients (Yang et al., 2024). 79 Applications and Advancements Texture features have significant applications in various domains such as aerial imagery, medical imaging, and more. They offer valuable insights into the surface granularity and recurring patterns within an image, making them crucial for domain-specific image retrieval tasks (Imbriaco, 2024). Recent advancements include the development of texture thesauruses for aerial image retrieval (Uddin Molla, 2021) and affine-invariant texture feature extraction techniques for texture recognition (Kanwal et al., 2021). In summary, Texture features are vital in CBIR for providing a detailed and nuanced understanding of image content. By leveraging statistical, model-based, and transformbased techniques, researchers can develop robust image retrieval systems that effectively utilize texture information. 3.2.3.7. Image Retrieval using Shape Content Shape representation is a critical element in image discrimination and serves as an effective feature vector for image retrieval. There are two primary methods of shape representation: region-based and boundary-based. 1. Boundary-Based Shape Representation: This method utilizes the external boundaries of objects. Gradient operators and morphological procedures are typically used to extract a shape's border from an image. Gradation operations produce the image's first-order derivatives, enabling the identification of boundaries in horizontal, vertical, or diagonal directions (Majhi et al., 2022). A gradient amplitude approach with gradual regulators is applied to obtain the entire border of the shape in the image as connected edges. 2. Shape Representations and Design Commonalities: Shape representations benefit greatly from effective and reliable depiction, especially in segmented picture areas. The representation of shapes often involves geometric 80 representations paired with one another (Chan, 2021). Over time, there has been a shift from global form descriptions to more regional descriptors. Simplifying the contour through discrete curve evolution helps eliminate unimportant or noisy shape elements (V. Kumar et al., 2022). 3. Shape Context and Shape Matching: A novel shape descriptor called shape context is suggested for similarity matching. It is small but resilient to several geometric modifications. For effective form matching and shape-based picture retrieval, curves are represented by segments or tokens, whose feature representations (curvature and orientation) are grouped into a metric tree (Liao et al., 2024). Dynamic programming (DP) methods are also used for shape matching, where shapes are represented as sequences of concave and convex segments. 4. Fourier Descriptors for Shape Matching: - An accurate form-matching strategy using Fourier descriptors utilizes both amplitude and phase, along with dynamic temporal warping (DTW) distance rather than Euclidean distance. This approach retains rotational and starting point invariance by adding compensating terms to the original phase, enhancing shape discrimination (Aboali et al., 2023). 5. Edge Detection Methods: - Several edge detection techniques are employed for shape content-based image retrieval. These include the Sobel mask with slope magnitude method (Sobel-SMEI), Robert mask (Robert-SMEI), Prewitt mask (Prewitt-SMEI), Canny operators (CannySMEI), morphological operations (Perez, 2021), top hat transformation (Top-Hat-EI), and bottom hat transformation (Bot-Hat-EI). Edge images obtained from these methods are used as feature vectors in CBIR. 6. Block Truncation Coding (BTC): 81 - BTC on edge image content is used in novel image retrieval algorithms. Shape-edge images processed with BTC have shown outstanding accuracy. The row average of columns converted edge pictures further improves performance, with Kekre transforms combined with Robert slope amplitude edge images providing excellent effectiveness (Zhang, 2021). 7. Walsh Texture Pattern Image Retrieval Techniques: - Recently proposed shape Walsh texture pattern image retrieval techniques, combined with even picture parts augmentation, demonstrate exceptional performance. Applying Walsh texture patterns to original plus even image sections with Robert slope amplitude boundary photos produces significant results. It can be seen that shape features play a vital role in CBIR, offering detailed and robust representations for effective image retrieval. By leveraging various shape representation techniques, including boundary-based methods, shape context, Fourier descriptors, and edge detection methods, researchers can enhance the accuracy and efficiency of CBIR systems. 3.2.3.8. Metric Learning: In practical applications, mastering a good feature space distance metric is essential. Every problem has a unique semantic concept of similarity, which common metrics (such as Euclidean distance) frequently fail to represent. Learning a metric that allocates a modest distance between pairs of examples that are semantically similar (as opposed to dissimilar) is the fundamental principle behind learning. A subfield of machine learning called Distance Metric Learning (DML) seeks to extract distance information from data. In addition to applications in dimensionality reduction, distance metric learning can be utilized to enhance similarity learning algorithms realistically. 82 ! ,#')F+! "#$%F'()! (.F+F(%)+#$%#($! 21'!0#')F+! ,)F+'#'-! *)%+#(! 3'4 $56)+7#$)8!! /0-1+#%.*! 956)+7#$)8!! "#$%F'()! *)%+#(! 9)*#4 $56)+7#$)8! "#$(+)%)!! 25*)+#(! :#;)8!*)%+#(! =5(0#8)F'!! <1'%#'515$!!