E NSEMBLE M ACHINE L EARNING A PPROACH FOR I O T I NTRUSION D ETECTION S YSTEMS

- The rapid growth and development of the Internet of Things (IoT) have had an important impact on various industries, including smart cities, the medical profession, autos, and logistics tracking. However, with the benefits of the IoT come security concerns that are becoming increasingly prevalent. This issue is being addressed by developing intelligent network intrusion detection systems (NIDS) using machine learning (ML) techniques to detect constantly changing network threats and patterns. Ensemble ML represents the recent direction in the ML field. This research proposes a new anomaly-based solution for IoT networks utilizing ensemble ML algorithms, including logistic regression, naive Bayes, decision trees, extra trees, random forests, and gradient boosting. The algorithms were tested on three different intrusion detection datasets. The ensemble ML method achieved an accuracy of 98.52% when applied to the UNSW-NB15 dataset, 88.41% on the IoTID20 dataset, and 91.03% on the BoTNeTIoT-L01-v2 dataset.


I. INTRODUCTION
Discovering emerging and unknown attacks requires an approach that can detect Internet of Things (IoT) intrusion; machine learning (ML) possesses this ability [1].The rapid growth of cyberattacks has resulted in the need of IoT's security architecture for intrusion detection.The security field faces serious challenges in the development of technology and the IoT.Current security methods do not provide adequate protection; hence, cyberattacks are increasing.[2].With the use of an ML-based approach, an intrusion detection system (IDS) was proposed for use on the IoT.The proposed model can be trained on different sources from large and classified datasets.This model can work effectively after being trained on smaller-sized data and classifying them in the target domain [3].Another IoT IDS has been proposed using ML and enhanced transient search optimization.The proposed system uses an enhanced transient search optimization algorithm to optimize the hyperparameters of the ML model.The outcomes of this paper show that the recommended system outperforms other IDS in terms of accuracy and false alarm rate [4].This work uses ensemble ML methods to detect intrusion in IoT networks.This article is organized as follows: Section 2 presents the related work, Section 3 presents the IoT intrusion detection system, Section 4 introduces ensemble ML, Section 5 provides the classifiers, Section 6 presents the proposed method, Sections 7 and 8 detail the experimental results, and Section 9 concludes this paper.

II. RELATED WORK
In this section, some previous works in the field of IoT IDS are reviewed.In [5], feature sets were used, and ML methods using multiple over-cluster approaches (artificial neural networks (NN), backing machines, and random forests (RF), and message queue telemetry transport (MQTT), a transport metric for waiting messages, UNSW-NB15, which is feature-based by TCP.The best features in the two groups were obtained, with high accuracy and less time for the ML algorithms.RF, binary, and the use of radio frequency on stream data and MQTT achieved accuracies of 97.37%, 98.67%, and 97.54%, respectively.
In [6], four algorithms-naive Bayes (NB), RF, J48, and zero-were utilized to categorize cyberattacks on the UNSW-NB15 dataset.Two groups were created using the UNSW-NB15 dataset using K-means and expectation maximization clustering techniques, depending on whether the objective attack is used or regular network traffic only.Following the classification above to create a subset of features, correlationbased features were used.The techniques are useful for research on intrusion detection in widespread networks.The results demonstrate that the RF and J48 algorithms achieved accuracies of 97.59% and 93.78%, respectively.
In [7], NN, logistic regression (LR), NB, decision tree (DT), SGD, and RF classifiers were evaluated empirically and tested using the UNSW-NB15 dataset.Accuracy indicates a correlation between classifiers.The RF classifier outperformed the other methods, having an accuracy of 95.43%.
In [8], the proposed system called MidSiot is used on the IoT.It consists of several stages, including identifying and classifying attacks and real network traffic, and achieved an average accuracy of 99.68%.
In [9], an IDS called Pearson correlation coefficientconvolutional neural networks (PCC-CNN) was established for the deep learning model.Intrusion detection was performed by collecting features, detecting changes, and extracting linear operations.Attacks are detected using the binary classifier based on three sets of data, achieving 98%, 99%, and 98% similarity accuracy in the three datasets.
In [10], a modified IDS was proposed based on ML, and the RF algorithm was used to enter features.The output of the IoTID20 dataset after removing the nominal features is 79 characters.The accuracy of the proposed model was 96.5%.The categorical values were converted into numeric values because the inputs of all algorithms must be numeric values.Most researchers used binary classification.In this paper, multiple classification of 9 or 10 categories will be used.

III. IOT INTRUSION DETECTION SYSTEM
The intrusion detection process involves monitoring and analyzing the events in a computer system or network for indicators of intrusions (attempts to undermine the confidentiality, integrity, or availability of a computer system or network).Attackers who access systems over the Internet, authorized users who try to gain unauthorized access rights, and authorized users who abuse their powers are all sources of intrusion.This monitoring and analysis process is automated by software or hardware solutions.
Intrusion detection enables organizations to defend their systems against risks brought on by growing network connections and dependence on information systems.Security professionals should decide whether to utilize intrusion detection rather than decide which intrusion detection features and capabilities to deploy, given the severity and type of contemporary network security threats.IDSs are now widely recognized as crucial to any organization's security architecture.Even though IDSs have been shown to improve system security, many organizations still need justification to purchase an IDS [11].
A security system for an IoT environment needs to be created while considering security precautions.Data-oriented security mechanisms must be prioritized to stop hostile users from gaining unauthorized access to data sources.Focusing on data integrity and confidentiality is crucial because doing so significantly lowers the major security dangers in an IoT context.Conventional security procedures, which are designed using cryptographic techniques, are not often used in IoT environments because of the huge amount of data.Network problems will be lessened if threats are discovered quickly.Conventional security models take more time to evaluate such a large volume of data to identify the risks.A bad user just needs brief unauthorized access to data to obtain sensitive information, and changing that information might significantly negatively affect the user.By blocking access from unauthorized users, an IDS identifies intruders and safeguards the network and data.A central IDS that monitors the network and distant nodes and detects intrusions might be employed to decrease this complexity.As a result, the network administrator receives a notification to take action on the security vulnerabilities [12].
Three steps make up the IDS's functionality.The first monitoring phase is based on network or host sensors.The second phase is analysis, which involves feature extraction and pattern recognition.The last stage is detection, which involves finding network anomalies or intrusions.IDS aids in quickly detecting vulnerabilities and monitoring and analyzing data, services, and networks as well as traffic analysis via efficient network management.It enhances data, network secrecy, and integrity while defending the network against threats.An IDS compiles and examines the system's data stream to find any malicious or dangerous activity.Traditional IDS design lacks real-time security for huge volume data streams and primarily focuses on providing security for Internet management features.
The IDS operates primarily in the network layer of the IOT system [11].The network layer of an IoT NIDS monitors Internet data transferred between the network's devices.Also, it serves as a second line of defense to detect and protect the network from threats from unauthorized users [12].
Typically, an IDS consists of sensors, which collect the data to be analyzed by IDS tools.These tools report abnormal activities such as attacks or unauthorized access.An intrusion can be defined as any assault that compromises the availability, confidentiality, or integrity of information.An IoT system's IDS should be able to analyze data packets and respond in real time at different IoT network levels utilizing different protocol stacks and adjust to different threats [13].

IV. ENSEMBLE MACHINE LEARNING
Ensemble approaches may combine many algorithms instead of just one ML classification algorithm.The model's accuracy is enhanced by using this method.Algorithms for supervised learning are ensemble approaches.Different training algorithms benefit from ensemble approaches, which increase the training accuracy to raise the testing accuracy.The ensemble approach may use different training algorithms to provide flexible training [14].
V. CLASSIFIERS ML is a subtype of artificial intelligence that allows a computer to make decisions independently without human input, enabling computers to learn independently without being explicitly programmed.The fundamental objective of ML is to create computer software that can access data and use it for learning procedures.
Several kinds of ML exist [15].Six ML methods (both linear and nonlinear) were extensively utilized for IDS data classification.Therefore, the background of the ensemble ML and six methods (DT, GB, and extra tree) should be understood so they can be utilized for intrusion detection.

A. Decision Tree
The DT is a supervised learning technique that is used to handle classification and regression problems and is most often selected to do both.It is a tree-structured classifier in which each leaf node represents the classification structure, and the interior nodes reflect the dataset's characteristics.A DT comprises two nodes: the decision node and the leaf node.In contrast to leaf nodes, which indicate choices' results and have no other branches, decision nodes are used to make decisions and contain multiple branches.Two possible answers represent each question in a DT: "yes" or "no," which enables the creation of branches.The tree could be split up into smaller trees (Figure 1) [16].

B. Random Forest
Many DT classifiers, each built using a random vector sampled independently from the input vector, make up the RF classifier.Each tree casts a unit vote for the dominant class to classify an input vector.Most DTs simulate scenarios that do not operate well but may provide the foundation for other trees to work better.The Gini index, which measures an attribute's impurity in classes, is used as an attribute selection metric.Every time a tree is developed to its maximum depth, a mix of features fresh training data is utilized.These mature trees have yet to be trimmed.This ability is one of the RF classifier's main benefits over other DT approaches (Figure 2) [17] [18].

C. Naive Bayes
The Bayes theorem is the foundation of NB classifiers.It is based on conditional probability, which refers to the chance that an event (A) will occur given that another event (B) has already occurred.Essentially, the theorem permits a hypothesis to be revised whenever new data are presented.It is a simple and effective predictive modeling technique.The model may directly extract two types of probabilities from the training data: the likelihood of each class and the conditional probability for each class given each x value.The Bayes theorem may be used to forecast new data using the probability model, as shown in Eq. ( 2) [19].

D. Logistic Regression
LR is used to predict a binary result (1 or 0, yes or no, true or false) given a collection of independent factors to depict binary or categorical outcomes.When the log of chances is used as the dependent variable when the outcome variable is categorical, LR is a particular instance of linear regression (Figure 3) [20], [18], [21].

E. Gradient Boosting
Gradient-boosted machines (GBMs) are popular ML algorithms that are widely used in many different sectors and are one of the most effective ways to win Kaggle tournaments.While RF constructs an ensemble of deep, autonomous trees, GBMs construct an ensemble of shallow, weak, consecutive trees, with each tree learning from and improving upon the previous ones.These numerous weak consecutive trees come together to form a potent "committee," frequently challenging other algorithms [22].

F. Extra Tree
The different trees and RF differ primarily in two ways.First, unlike RF, the different trees do not create the training subset for each tree using the tree bagging step.All DTs in the ensemble are trained using the whole training set.Second, the extra trees randomly choose the characteristic and its corresponding value during the node-splitting stage.As a result of these two variations, the trees are less prone to overfitting and have improved performance [23].

VI. PROPOSED METHOD
This research used three datasets: UNSW-NB15, IoTID-20, and BotNetIoT.Six types of ML architectures were tested to determine the effectiveness of various ML architectures on these datasets.Before the models were trained on the datasets, the data underwent preparation.Subsequently, two of the datasets, namely, UNSW-NB15 and BotNetIoT, were split into training and testing sets in a 70:30 ratio, while the IoTID-20 dataset was split into training and testing sets in an 80:20 ratio.The training data were then fed into ML algorithms, which included LR, NB, DT, extra trees, RF, and gradient boosting.Finally, the strongest results were voted on by using the ensemble method.The effectiveness of the trained models was evaluated using the test data, as presented in Figure 4.

A. Datasets
This paper used three IoT intrusion detection datasets.First, the UNSW-NB15 [24] dataset is a labeled network traffic dataset that contains more than two million records of network traffic captured from a realistic network environment, including benign and malicious attributes.The dataset includes 49 network features extracted from each n flow and labels that indicate whether the traffic is malicious or benign, making it a useful resource for evaluating the effectiveness of intrusion detection methods for IoT networks.Second, the IoTID-20 [25], [26] dataset is a publicly available labeled dataset that was specifically designed for IoT intrusion detection research.It contains network traffic data collected from a real-world IoT environment with 20 different types of IoT devices.The dataset includes benign and malicious attributes, with a total of 15 attack scenarios generated by using various network attacks, such as brute-force attacks, DoS attacks, and malware infections.The IoTID-20 dataset is useful for evaluating the effectiveness of various IDS and ML algorithms in detecting IoT-specific attacks.Table I shows the attack types in each dataset.This study uses an IoT dataset for IDS, specifically the Malicious BotNet dataset (BotNetIoT), which consists of data files collected during the detection of IoT botnet attacks on a cybersecurity system.This dataset is publicly available on Kaggle [27].
To create this dataset, researchers used Wireshark software to capture network traffic data from nine IoT devices in a local network.The data were collected in packet capture (PCAP) file format, which is commonly used for network analysis.The PCAP file contains data packets from the network, including 23 statistical features for the central switch in the network.
The data in the BotNetIoT dataset include benign and malicious traffic, with the malicious traffic generated by various IoT-specific attacks, such as botnets and infiltration attacks.The dataset is useful for evaluating the effectiveness of IDS in detecting IoT-specific attacks and assessing network health.It is also useful for training and testing ML algorithms for IoT intrusion detection.Table II shows the specification of the three datasets.B. Data Preprocessing 1) Data Cleaning: In this preprocessing step, the features that were not useful in the prediction process and had only one value are deleted.Moreover, rows that contain duplicate data were identified and deleted.
2) Handling Missing Values: The dataset has some missing values, which were substituted with the value of 0.
3) Normalization: Feature normalization is an essential step in data preprocessing.Data normalization is a practical approach to improving ML accuracy.The standard scaler transforms the data of the three datasets to a range between 0 and 1.It was implemented before being integrated into the proposed deep learning classification model, as shown in Eq. ( 2)

C. Ensemble Machine Learning Approach to Detecting IoT Intrusion
A voting-based ensemble classification technique is used.Several voting procedures exist, such as hard voting (voting based on a majority) and soft voting.Soft voting may be performed by using the average of probabilities, the product of probabilities, the lowest or maximum of probabilities, or none of them.
In this work, hard voting (voting based on a majority) was used to assess the voting mechanisms.

VII. EXPERIMENTAL RESULTS
In this part, the confusion matrix-based findings for multi-class classification were provided.The model's performance based on accuracy, precision, recall, and F1 score was assessed.In contrast to recall, which is determined by dividing the total number of positive class values into the test data by the number of true positive predictions, precision is calculated by dividing the total number of true positive predictions by the total number of positive class values predicted.The weighted average of recall and accuracy is the F1 score.Accuracy is determined by dividing the total number of forecasts by the number of right predictions (including true positive and true negative predictions).Poor recall is reflected by a large number of incorrect negative predictions, and low accuracy is indicated by a high proportion of false positive predictions.A high F1 score indicates accuracy and recall that are in balance, with few false negatives and positives.These measures were calculated using the appropriate equations, which were based on sources [28][29][30][31].(6) where TP is the true positive, TN is the true negative, FP is the false positive, and FN is the false negative.

VIII. CONCLUSION
Ensemble techniques mix several learning algorithms to achieve prediction performance that is better than that of any one of the component learning algorithms alone.Empirically, ensemble ML provides more accurate findings when models exhibit considerable variations.As a result, many ensemble approaches encourage variation among the models they combine.In this research, three intrusion detection datasets for the IoT (IoTID20, UNSW-NB15, and BoTNeTIoT-L01-v2) were employed to evaluate the performance of the ensemble classification method.The results indicate a preference for the ensemble classification method over the other algorithms, with accuracy rates of 88.41% on the IoTID20 dataset, 98.52% on the UNSW-NB15 dataset, and 91.03% on the BoTNeTIoT-L01-v2 dataset.In conclusion, ML approaches show great potential for IoT IDS.They can provide important solutions with their anomalybased approach and ability to detect unknown attacks.As a future research direction, a recommendation using several feature selection methods can be formulated.Hybrid feature selection methods can also be used.

Fig. 4
Fig. 4 Applying ensemble ML algorithms to different datasets.

TABLE I TYPES
OF ATTACKS IN EACH DATASET.

TABLE IV PERFORMANCE
METRICS IN THE UNSW_NB15 DATASET.

TABLE V PERFORMANCE
METRICS IN THE BOTNETIOT DATASET.
TableVshows the preference for extra tree algorithms over other algorithms, and the accuracy of this algorithm was 91.03%.In this study, this method was compared with methods in several recent studies.Table VI provides a comparison of the overall performance in multiple classifications on the UNSW_NB15 dataset in terms of accuracy.Table VII compares studies conducted on the IoTID20 dataset for subcategories in terms of accuracy.The proposed approach outperformed the other methods in terms of accuracy measures.
TABLE VI GENERAL COMPARISON OF MULTIPLE CLASSIFICATION ACCURACY MEASURES FOR THE UNSW_NB15 DATASET.
TABLE VII GENERAL COMPARISON OF SUBCATEGORIES WITH PRECISION MEASURES OF THE IOTID20 DATASET.