A Binary Particle Swarm Algorithm with A Simple Mutation Operator for Solving Phylogenies Problems

Manipulating a huge amount of data sets makes the mechanism of searching very difficult to find the best features combinations in phylogenetic tree construction. This article has proposed a binary particle swarm optimization algorithm that randomly selects chloroplast genes to construct phylogenetic trees. Also, a simple mutation operator the article has suggested to speed up the search and facilitate the search process to find the best chloroplast genes combinations. These combinations can be evaluated according to the bootstrap values of all the nodes of the constructed tree.


I. INTRODUCTION
Feature selection in DNA sequences had become an important research tool in biological experiments.BPSO selects helpful DNA sequences of genes and exclude genes sequences that have a negative impact by omitting them.In phylogenetic tree construction, the choice of features is important to influence the obtained results.Many algorithms have been proposed for feature selection.Some of these algorithms depend on the mechanism of filtering each element separately depending on specific measures.The most helpful ones try to divide the elements to subsets [1] in order to reduce execution time, find the best subset of elements and obtain a high computational complexity.The Binary swarm optimization technique has been considered as one of the most well-known algorithms for feature selection.Especially when we have a large searching space.More than one swarm can be launched at the same time.Adaptive mutation operators have been proposed in many articles [2] and [3].These operators make the swarm very capable of finding the optimal solution in local search.The objective of this article is to find the best phylogenetic tree with the largest and the best combinations of genes.Applying BPSO is not enough to find the intended results, so a mutation operator has been proposed in order to control the mechanism of local searching, where each element in the particle can make a difference.This article is established as follows: The MBPSO mechanism is presented in Section(2).The obtained results shown in Section(3), Also, there is a detailed comparison of the obtained results with other algorithms.Finally, this article ends with a conclusion section, in which the article is summarized.

II. METHODOLOGY
Suppose a population of N B, where N B is the number of particles in the swarm.Each particle is restricted by 4 information(position, velocity, score value and the local best position).The position and the velocity are represented by two arrays of length L, where L is the number of all genes.The position is represented by an array of binary numbers where 1 represents the included gene and 0 represents the not included genes.The score is the minimum bootstrap value [4] of the tree and it is the fitness value of the proposed phylogenetic tree.More precisely, each particle represented by a proposed phylogenetic tree of random selected elements (genes) and a fitness value (the measure of the confidence level of a tree).Each element has a position and velocity, if the fitness value is more than 90 and the number of the selected genes is more than or equal to 90% then this particle is considered as the best particle that the swarm ever had at the current time.Otherwise, the swarm will search till it finds the optima.First, we propose a swarm of a specific number of particles.Each particle is a phylogenetic tree constructed from a random number of genes for each gene there is a proposed velocity, the velocity value determines whether the gene will be selected or not in the next iteration.The velocity is controlled by several variables [5] and [6]: According to these variables, the swarm can either make a rush and this will be considered as global search (we will get the same topologies with missing genes more than expected), or make a slow movement toward the objective which can give us more probabilities where each new probability can be a good solution (we will get the same topologies with more included genes or new topologies which increase the variety).By applying the traditional mechanism of the swarm we got many missing genes and this is inadequate in our case.So, a proposed parameter has taken a place in order to control the velocity of the element which is represented by zero in the array of positions of the particle.In order to increase the included genes in the proposed tree, we count the number of missing genes z for the particle, if they are more than 10% we choose a random number of these elements z with position zero to multiply their velocities by a random parameter r3in the interval of [0.1, 0.5].We have determined the variable r3 with this interval to increase the probability of the appearance of ones "included genes".In order to notify the effect of this step, z should be in the interval [z/2, the total number of missing genes].For instance, if the position of the proposed particle initialized to p[01100000001010110000000011].The number of the missing genes is more than 10%.First, we will chose a random number in in the interval of [13,18] such as 14.This number represent z in our proposed algorithm.Then, 14 cells from position array with zero contain will be chosen randomly.In order to multiply their velocities with r3(v0 *0.In the second step the positions of these elements will be updated according to the previous step and the position will be p [11111111111111110101000111].While the remaining elements will keep their previous positions.The number of the included genes will be increasing obviously.The fitness function will be computed only one time by the iteration.We find that if we compute the fitness twice: before the mutation step and after it like many articles [2] and [3] the search process will take too long time, at the end we will choose the tree which has the largest subset of included elements. Our mechanism has been applied on Rosales family.This family has been analyzed and examined in many articles (for more information [7] and [8]).The Rosales family is constructed of 10 genomes according to according to [9] and [10].These genomes contain " 9 in-group species and 1 outgroup (Mollissima)".Each genome of this family consists of common genes.The total number of all the genes in all the genomes = 82 which represents L (the length of the two arrays: position and velocity).Each genome represents a branch in the constructed tree.The essential idea is omitting a random number of genes from each branch with a specific interval to notify their effect on the bootstrap value of the branch itself and its relation with the other branches (which can give a new topology).The minimum bootstrap value among all these branches is considered the fitness value which should be equal to or larger than 90.That means a well-supported tree where all its branches consist of the best subset of genes.Our proposed MBPSO algorithm illustrates the idea of our methodology: where POP represents the size of population, p[velocity] represents the velocity of the particle p at the current time, p[score] represents the fitness value of the article p.The fitness has been computed by using RAxml application ( for more information [10] and [11]).p[best] represents the local best position of the particle.According to many articles [6], [5] and [13] Equation( 1) is used to guarantied the convergence of the swarm.Where k is a random number in the interval [0.0,1.0].uoitc.edu.iq

III. 11RESULTS AND DISCUSSIONS
Our method namely MBPSO: mutated binary particle swarm optimization with φ1 = 2.5 and φ2 = 2.5 in the interval 0,1) has been executed 10 times for swarms of 3, 5 and 10 particles.The Results have been compared with the BPSO: traditional binary particle swarm optimization algorithm with the φ1 (c1 = 1 and r1 in the interval 0,1) and φ2 (c2 = 1 and r2 in the interval 0,1) as shown in Table (1) and another method DPSO: discrete particle swarm optimization method is described in [8] with the φ1 (c1 = 1 and r1 in the interval 0,1) and φ2 (c2 = 1 and r2 in the interval 0,1) and the proposed idea worked on minimizing the interval of r (random number in the interval of 0.0, 0.5) in Equation4 and a condition named b+p (where p represents the number of the included genes and b represents the minimum bootstrap value of the obtained tree).We noticed that there are many lost genes in the BPSO compared with MBPSO and DPSO.The best topologies obtained from MBPSO have less missing genes than that in DPSO.Moreover, we have got 13 topologies.
Whereas, only 7 topologies have been obtained with DPSO.Tabel1 represents the results of the three methods.For each method 3 swarms have been lunched (3 particles, 5 particles and 10 particles) the column Particle represents the number of particles in the swarm, Min Bootstrap represents the minimum bootstrap value of the best obtained tree that the current swarm has reached and Missing Genes represents the omitted genes from the best obtained tree.Figure1 shows the best topology obtained by the MBPSO method with 92 minimum bootstrap value.
Table1. the results of the three Experiments Fig. 1.the best tree obtained IV.CONCLUSION A simple mutation operator for binary particle swarm optimization algorithm has been proposed in this article, which is redacted to solve the problem of finding the best largest subset of genes sequences.This approach has been applied to Rosales family.The proposed method has been compared with the traditional particle swarm optimization technique (BPSO) and a discrete particle swarm optimizauoitc.edu.iqtion technique (DPSO) for the well supported phylogenies in order to approve our MBPSO.13 topologies have been obtained.The best topology was obtained with swarm of 10 particles where only 5 genes have been missed and the minimum bootstrap value was 92 (compared with other methods DPSO: 19 missing genes and BPSO:33 missing genes).