Digital Image Forgery Detection and Localization using the Innovated U-Net

noninnocent intent. The first category is employed to enhance the images and/or eliminate distortions while preserving the semantic content of the image. This category includes contrast enhancement, color enhancement, blurring, retouching, and red-eye correction. For example, the first category is widely used in fashion photographs, beauty care photos, truism, business, and marking. The second category involves malicious intentions and/or criminal activities. Generally, based on the type of content tampering, this type of manipulation has been classified into five categories: image splicing, copy–move forgery, geometric transformation forgery, text editing, and deep fake forgery [2]. Consequently, we have opted for the detection of manipulated images to provide efficient solutions employing various methods. To identify and detect an image forgery, the solutions can be broadly classified into two main categories: passive and active techniques. In active methods, some type of authentication data is embedded in the source image before distribution. The authentication data might be subsequently utilized to confirm ABSTRACT A reliable image copy–move forgery detection approach adaptable to different scenarios of tampering with color images is crucial for many applications. Different methods and solutions have been effectively proposed, but they are still subject to false positive/negative detections and cannot handle the variety of copy–move forgeries. In this paper, a machine learning model that combines ResNet 50 and U-net architectures for automatic image forgery detection in color image(s) is presented. The proposed system is inspired by the ResNet 50 architecture as an encoder and the U-Net architecture as a decoder. The encoder function implies applying convolution and normalizing for feature extraction. Conversely, the decoder functions is locating the spatial features. The decoder in the U-Net network comprises multiple decoder blocks, which are connected to corresponding encoder blocks by employing concatenate layers. A binary mask is then produced to represent the tampered regions in the image. Quantitative experimental results on two standard public datasets and a comparison with state-of-the-art methods demonstrate the effectiveness and robustness of the proposed model.


INTRODUCTION
With the development of digital image processing software packages and other editing tools, image forgery has become simple and popular [1].Image forgery can now imply many kinds of tampering and modifications to the visual contents of images in such a perfect way that they are unnoticeable to casual people.By altering the visual contents of an image, the new image is called a "forged" image [2].In many instances, the purpose of this manipulation is to influence the attention and opinions of the recipient.As the world becomes more dependent on digital images for getting information, the need to verify the authenticity (i.e., originality) and dependability of these images' increases.However, researchers and specialists are collaborating to develop computer-based systems to detect such forgeries automatically [3] [1].Using forged images for malicious purposes may have hazardous consequences in our society.These images are used in several application sectors, such as politics, investigations as forensic evidence, journalism, business, arts, and medical imaging.Generally, tampering images can be classified into two categories: tampering with innocent intent and tampering with noninnocent intent.The first category is employed to enhance the images and/or eliminate distortions while preserving the semantic content of the image.This category includes contrast enhancement, color enhancement, blurring, retouching, and red-eye correction.For example, the first category is widely used in fashion photographs, beauty care photos, truism, business, and marking.The second category involves malicious intentions and/or criminal activities.Generally, based on the type of content tampering, this type of manipulation has been classified into five categories: image splicing, copy-move forgery, geometric transformation forgery, text editing, and deep fake forgery [2].Consequently, we have opted for the detection of manipulated images to provide efficient solutions employing various methods.To identify and detect an image forgery, the solutions can be broadly classified into two main categories: passive and active techniques.In active methods, some type of authentication data is embedded in the source image before distribution.The authentication data might be subsequently utilized to confirm whether the image has been altered during a forensic examination.One potential constraint associated with this particular technique is its reliance on either specialized cameras or subsequent image processing procedures.Examples of active techniques commonly employed in the field of digital image security include digital image cryptography, digital signature implementation, and embedding a watermark into the original image prior to its utilization.Passive techniques are the most dominant methods in cases of forgery activities.These techniques are needed to determine whether an image has been tampered with, even in the absence of any pre-existing authentication data, such as a signature or watermark.Hence, passive forgery detection is regarded as a more appealing approach [4] [5].Copy-move is considered the most common method for image forgery.This technique involves replacing one or more image fragments with one or more image fragments from the same image to produce forged images.The main purpose of copy-move forgery is to either conceal object(s) or generate many duplicates of a certain object.Copy-move forgery detection (CMFD) is a challenging task.The technical complications with automatic CMFD can be related to the following issues [6]:

•
Presence or absence of structural components: Tampering can alter the image's contents, causing some features to become partially or completely obscured.
• Geometric transformations: Object appearance can be changed by geometric transformations, such as rotation, scaling, and translation.
• Imaging retouching: When the image is captured, lighting and camera attributes can affect the visual representation of objects.Generally, image retouching may change the appearance of objects and consequently affect the feature-set extraction.This situation may lead to many false negative and/or false positive results.
• Intensity and color adjustment: These processes also affect the output of the feature extraction step and consequently may cause many false negative and/or false positive results In this paper, a novel method for identifying and locating manipulated regions in digital images using machine learning-based semantic segmentation approach is presented.The proposed system is inspired by the ResNet50 model as an encoder and the U-Net architecture as decoder [7].The encoder function implies applying convolution and normalizing for feature extraction.Conversely, the decoder function is locating the spatial features.The decoder in the U-Net network comprises multiple decoder blocks, which are connected to corresponding encoder blocks by employing concatenate layers.Then, a binary mask is generated to denote the manipulated regions in the image.Figure 1 shows the general architecture of the proposed model for automatic forgery detection.
The subsequent sections of this research study are structured as follows: In Section II, a brief overview of the recent related works is provided.Subsequently, the preprocessing and feature extraction procedures are delineated.In Section III, the machine learning techniques employed for predicting forged regions is discussed.In Section IV, the results are discussed.

RELATED WORKS
Generally, CMFD techniques can be classified into three primary categories: block-based, key point-based, and machine learning-based [8].The block-based technique entails initially dividing the input image into overlapping or nonoverlapping blocks.Then, feature-sets are extracted for each block.The matching phase is employed to determine similar blocks based on their feature-sets [6].The key point-based approach involves extracting local features from the whole image and representing them as a set of descriptors.Finally, the descriptors are compared to identify forged regions [9].Deep learning algorithms depend on the creation of convolutional neural network (CNN) models, which possess high capacity for extracting meaningful information from the input images and other digital data [8].Parveen et al. utilized a block-based method for CMFD [10].The suggested approach encompasses a series of steps: converting the color image to grayscale and dividing the grayscale image into [8×8] overlapping blocks.DCT is for locating features, and finally the feature matching is conducted via the radix sort method.Yang et al. introduced a key pointbased approach using the SIFT technique [11].A formulation of a distribution strategy was devised to ensure the fair placement of key-points within an image.The enhanced SIFT descriptor was developed to depict key-points precisely within the context of the CMFD scenario, and the key-points were matched using the agglomerative hierarchical clustering (AHC) algorithm.Elaskily proposed an innovative model for CMFD based on deep learning [8].Specifically, a CNN model was proposed to generate a representation of categorized descriptors.Following the training phase of the CNN, the system could classify images to identify instances of copy-move forgeries.

PROPOSED SYSTEM
Generally, CNNs are a sophisticated form of artificial neural networks that utilize convolutional kernels for successful pattern recognition and image processing tasks.In this paper, the proposed model is inspired by the ResNet50 model as an encoder within the U-Net architecture.The encoder function implies applying a set of operations, such as convolution and normalizing to extract features.Conversely, the decoder function is locating the spatial features by combining two inputs (one stemming from the preceding layer of the decoder and the other originating from the symmetrical residual stage output of the encoder).
The proposed model includes three basic stages:

A. Preprocessing
Enhancing the quality of the dataset and the corresponding ground truth masks is highly important for the purpose of training.This stage implies the following steps: • Splitting, resizing, and labeling: The images are divided into three distinct sections, namely, 80% for training, 10% for validation, and 10% for testing.Subsequently, the images and ground truth masks are resized to standardize all inputs for the model, ensuring uniform dimensions.This standardization results in a reduction of the images to dimensions of 256 × 256.The manipulated images are similarly organized alongside the corresponding mask through the arrangement of the name syllable for each individual image.
• Normalization: Normalizing pixel values in the "image" and "mask" arrays.Normalization is a widely accepted practice in the fields of image processing and deep learning.This normalization aims to scale the pixel values to a range that typically falls between 0 and 1.Typically, pixel values in most images are initially represented within the range of 0 to 255 for each channel, assuming an 8-bit representation.By dividing these values by 255, normalization is achieved.

B. Architecture of the Innovated U-Net
The proposed model is partitioned into two distinct components: the encoder and the decoder: • Encoder Due to the inherent advantages possessed by ResNet [7], the utilization of ResNet50 as an encoder is deemed appropriate.First, ResNet50 enhances extracting features.Second, reducing the number of parameters makes the system more effective.Third, it offers the skip connections concept, which allows the model to preserve information from the earlier layer.Fourth, combining ResNet50 with other models shows excellent results, in which one model can overcome the weakness of the other.The proposed model accepts an image with dimensions of 256 × 256 × 3 as input.To commence, the initial block of the encoder executes convolutional operations utilizing a kernel size of 7 × 7 and a stride of 2. Following this, Max-Pooling is employed with a stride size of 2. Subsequently, four consecutive residual stages, namely, Res1, Res2, Res3, and Res4, are employed sequentially.Figure 2 shows the general architecture of ResNet50.Generally, classic CNNs make no assumptions about the spatial correlations between the extracted features or spatial relations among pixels.In this work, the U-Net network architecture is used as a decoder, which comprises multiple decoder blocks.The U-Net is a powerful CNN model that can capture detailed features and spatial coherence with their neighbor's, which makes it highly suitable for image segmentation applications [19].The main idea here is to take the whole image as an input and produce a full binary image as an output.In this work, the decoder in the U-Net network comprises multiple decoder blocks, which are connected to corresponding encoder blocks by employing concatenate layers.The decoder block engages in an upsampling procedure for the feature maps that are conveyed to it by the preceding block.This particular upsampling entails a convolutional operation using a kernel size of 3 × 3, which is subsequently followed by batch normalization.This upsampling procedure is reiterated four times.In the decoder section, four upSample blocks are respectively aligned with Res4, Res3, Res2, and Res1 of the encoder.Each upSample block is composed of feature maps with dimensions of (16×16×256), (32×32×128), (64×64×64), and (128×128×32).To predict the manipulated regions, the decoder network concludes with the utilization of the sigmoid pixel-wise classification function.The term P(TC) represents the likelihood of the two classes, namely, 0: forged and 1: original.This likelihood is determined through the utilization of the sigmoid function.Ultimately, binary masks are generated to denote the manipulated regions in the image.

PROBLEM OF IMBALANCED CLASSES
After Generally, researchers in machine learning have to deal with the imbalanced data sets.This problem arises when the number of samples in one class (i.e., pixels in the genuine regions) greatly exceeds those in other classes (i.e., pixels in the forged regions), resulting in inadequate to classify pixels [21].Practically, traditional classifiers are likely to bias into the large class samples and ignore the class with small samples [21][22].To tackle this issue, the probability of the forged and genuine regions is calculated by employing the statistical information obtained from ground truth samples.The weights exhibit a reciprocal correlation with the frequency at which each class occurs.Higher frequencies of appearance lead to lower weights.The class weights, denoted as class weights, are computed by employing the inverse ratio of the occurrence frequency of each class.The next is the all-encompassing expression used to determine the class weights.The class weight for each class is determined by the following [23]: where  represents the entirety of the samples,  is the number of classes,  is the number of samples for a specific class, and  is the allocation of weight for class .The strategy is to utilize the class weights parameter during the training procedure and the optimization by modifying the effect of each class on the overall loss.This approach enables the model to assign greater significance to classes that are under-represented, thus effectively tackling the problem of imbalanced classes.
Each image in the CoMoFoD dataset is coupled with its corresponding ground-truth mask that accurately outlines the forged regions.Five different categories of tampering are applied to the images: translation, rotation, scaling, combination, and distortion.Many post processing methods are used to modify the forged and original images, such as JPEG compression, blurring, noise addition, and color reduction.For the CASIA dataset, each image is coupled with its corresponding ground-truth mask.

PERFORMANCE MEASURES
Many parameters are utilized for evaluating the performance of the proposed model and may include the sensitivity, receiver operator characteristic (ROC), area Under the ROC curve (AUC), F1-score, the Matthews correlation coefficient (MCC), and the Jaccard similarity index or intersection over union (IoU) [26][27]: • Sensitivity is the ratio of correct predictions, specifically true positives, to the total number of true positives and false negatives.Sensitivity is calculated as follows [26]: • ROC curve is another metric to assess the performance of several different models.A curve that approximates the 45° diagonal of the ROC space suggests less precise examinations.In general, the closer the curve aligns with the top-left corner, the greater the precision of the examination [26].
• F1-score is an embodiment of the weighted average of recall and precision.As a result, the utilization of this specific metric involves the inclusion of false negatives and false positives.The F1-score possesses a higher degree of importance and usefulness compared with accuracy, especially when dealing with imbalanced classes, equation [26]:

RESULTS
S The experimental results reveal the excellent performance of the proposed U-Net model in determining and locating copy-move forged regions in images.The performance metrics of the proposed model trained on the CoMoFoD and CASIA2 datasets, as summarized in Tables 2 and 3, respectively.For quantitative comparison with other works, Tables 4 and 5 provide a comprehensive overview of key performance metrics when applied on CoMoFoD and CASIA2, respectively.Moreover, the model achieves an impressive recall rate of 99.71%, implying that it successfully captures nearly all of the actual forged regions found within the dataset.This high recall rate ensures that the model effectively detects the majority of forged regions, thereby minimizing the occurrence of false negatives.
The visual representation of the model's output is highly important for qualitative evaluation, as shown in Figures 6  and 7. Columns The examples of CMFD predicated output mask are highly similar to those of the ground truth mask.This finding implies that the proposed U-Net model exhibits robust performance in detecting and localizing manipulated regions in images.
For quantitative comparison with other works, Tables 4 and 5 provide a comprehensive overview of key performance metrics when applied on CoMoFoD and CASIA2, respectively.Moreover, the model achieves an impressive recall rate of 99.71%, implying that it successfully captures nearly all of the actual forged regions found within the dataset.This high recall rate ensures that the model effectively detects the majority of forged regions, thereby minimizing the occurrence of false negatives.

CONCLUSIONS
In this paper, an innovated machine learning-based model is proposed for the automatic detection of image forgery.
The qualitative performance of this model is highly excellent in locating forged regions.Furthermore, the predicted mask shows that the predicted forged regions closely resemble the ground truth mask.For the quantitative evaluation, many metrics are used, such as accuracy, F1-scores, AUC values, MCC scores, and IoU metrics.The accuracy and effectiveness in identifying forged areas emphasize the potential of deep learning approaches to improve forgery detection methods and to eliminate human intervention completely.The proposed model is tested and evaluated on two different datasets, CoMoFoD and CASIA2 datasets, which offers various copy-move scenarios.The experimental results reveal that the model can detect forged images with high accuracy, reaching up to 92.74%.

Fig. 1 .
Fig. 1.General Architecture of The Proposed Model For Automatic Forgery Detection Yao et al. presented a model based on deep learning for forgery detection in videos [12].The proposed model utilizes CNNs for the extraction of features.To reduce temporal redundancy between video frames, the video frames undergo preprocessing in three stages.These stages include the implementation of a frame absolute difference layer.Additionally, data augmentation techniques are applied to prepare image patches for the training phase.Wu et al. presented an innovative deep learning model called BusterNet, for the purpose of CMFD [13].BusterNet comprises two CNN architectures, followed by a fusion model.This remarkable model can accurately identify potential manipulated regions through the utilization of feature similarities.According to the author's statement, BusterNet surpasses existing cutting-edge models in terms of performance.Liu employed multiscale convolution for producing forgery probability maps and combined it with segmentation to obtain the final tampered maps [14].Bi et al. introduced the ringed residual U-Net (RRU-Net) for splicing forgery detection [15].The RRU-Net demonstrated enhanced utilization of contextual spatial details and effectively resolved the issue of gradient degradation in the detection of splicing forgery.Zhu et al. proposed an end-to-end neural network called AR-Net [16].The network is based on adaptive attention and residual refinement, which it aims to enhance the representation of features by fusing position and channel attention features.Additionally, deep matching is employed to calculate the self-correlation between feature maps, and the atrous spatial pyramid pooling (ASPP) technique combines the scaled correlation maps to produce the mask.Finally, the mask is refined through the residual refinement module, which preserves the boundary structure of objects.Kumar et al. employed unsupervised domain adaptation to learn the

TABLE II .
RESULTS OF THE COMOFOD DATASET Net model exhibited promising efficacy in detecting and locating of copy-move forgery within digital images.This model achieved notable levels of accuracy during the training and testing phases on the CoMoFoD and CASIA2 datasets.Specifically, on the CoMoFoD dataset, the model attained an accuracy of approximately 97.83% during the training phase and maintained similar levels of accuracy during the validation stage, up to 98.05% for validation, which correspondingly decreased during testing up to 91.83% for testing.On the CASIA2 dataset, the model achieved a slightly lower accuracy of around 93.72% during training, which correspondingly decreased during validation and testing up to 93.32% and 92.74% respectively.

TABLE IV .
COMPARISON OF OUR U-NET WITH OTHER MODELS IN THE LITERATURE ON COMOFOD DATASET

TABLE V .
COMPARISON OF OUR U-NET WITH OTHER MODELS IN THE LITERATURE CASIA2 DATASET