PAAD: P OLITICAL A RABIC A RTICLES D ATASET FOR A UTOMATIC T EXT C ATEGORIZATION

- Now day’s text Classification and Sentiment analysis is considered as one of the popular Natural Language Processing (NLP) tasks. This kind of technique plays significant role in human activities and has impact on the daily behaviours. Each article in different fields such as politics and business represent different opinions according to the writer tendency. A huge amount of data will be acquired through that differentiation. The capability to manage the political orientation of an online article automatically. Therefore, there is no corpus for political categorization was directed towards this task in Arabic, due to the lack of rich representative resources for training an Arabic text classifier. However, we introduce political Arabic articles dataset (PAAD) of textual data collected from newspapers, social network, general forum and ideology website. The dataset is 206 articles distributed into three categories as (Reform, Conservative and Revolutionary) that we offer to the research community on Arabic computational linguistics. We anticipate that this dataset would make a great aid for a variety of NLP tasks on Modern Standard Arabic, political text classification purposes. We present the data in raw form and excel file. Excel file will be in four types such as V1 raw data, V2 preprocessing, V3 root stemming and V4 light stemming.


I. INTRODUCTION
Nowadays more and more Arabic political is presented in the form of electronics. Arabic political inform us about the current event and explain situations, political articles include education, war, politics, economy and so on. Political articles also records people's opinion and action objectivity from our situations and Ariba countries have many situations as Iraq, Syria and Lebanon and so on. So, this area will reach for Arabic political articles and we have to make analysis. The Arabic Language is considered one of the most powerful and effective language around the world. According to the latest report, it is the 5th widely used languages around the globe. There are roughly 422 million people as a mother tongue and 250 million an additional language [1]. Arabic alphabet consists of 28 letters. The orientation of writing in Arabic is from right to left [2]. Compared to the high number of studies conducted on English sentiments, opinions, attitudes, and emotions, the number of similar studies on the Arabic language is very small.
In this paper will present four corpus such as V1, V2, V3, and V4 for each corpus will create process then make these corpora available online for who interest in NLP filed. For each corpus will using process except corpus V1 this original corpus but another will make process such as preprocessing, root stemming and light stemming. For these corpora will make statistical analysis for each one by using statistical analysis. Statistical analysis is considered very effective tools to determine any research work. This kind of methods can be applied for a number of research studies that propose some techniques for carrying out comparisons among various algorithms [3][4][5].
In this research paper, the data analysis applications are concentrated on machine learning models that association with the exploration of the statistical properties of the variables. This can be obtained by plotting the scatter of data, which summarize each class for visualization. In order to obtain the scatter visualization by using to undertake a visualization of the utilised data in these experiments, it computed with summary statistics, followed by visualisation methods including tdistributed Stochastic Neighbourhood Embedding (t-SNE) and Principal Component Analysis (PCA) [6]. Although PCA and t-SNE are powerful in most of advanced statistical methods, they have the great advantage of making distribution of the examined data. This research is presented scatter plots for each input variable and three outputs.

II. LITERATURE REVIEW
Annotation is the process of manually or automatically adding information into text for a given purpose. In computational linguistics, humans called annotators or taggers [7]. Annotation, and its companion activity of corpus creation, has become an important activity in computational linguistics since the widespread application of machine learning algorithms and lexicon. One of the most important preconditions for building an efficient model is to understand the input data characteristics. The specialized models always have potential to outperform the general one. The real contribution of this work is the exploration of characteristics in the data and its application on the process of building a classification model. For languages using corpus in different language for political as table below. There are many papers published for political the most of them collect from tweeter [21][22][23] and few from newspaper [24]. In table above as we can see from different natural language most of researchers using tweeter because ease to collect text because of API [25][26][27]. In this case we search from 2015 to 2020 for Arabic corpus as we can below in table 2 the years and Arabic corpus with purpose and it publish or not also the source with type of the corpus. In this survey by zaghouani and wajdi [33] all the corpus published none of them with political also the survey for Arabic corpus from 2005 to 2015 by Al-thubaity [34] none of them political corpus. As we can see above there in no corpus for political as we will work for Arabic and another language from 2015 to 2020 also the close paper and also political by Abooraig et al but the category different and also not available online that's make as to prepper our corpus and special with M,S and T because each one will solve problem for human right now. Some problems within Arabic corpus become apparent when examining the previous work. The simple reason that much work is being done on non-public corpus. From the comparison shown in tables 1 it can be seen that of the evaluated 13 corpora none of them publish. In table 2 there are 2 from 5 corpus online. The corpora used also came from several different domains, including newspaper and tweeter.

III. METHODOLOGY
The important unistructural data so have to collect then have to clean and change to structural data steps below show that in figure 1.

A. Corpus Collection
There is a crucial challenge during the search on internet for the Arabic articles. This challenge includes the absence of a standard Arabic dataset, so had collected and construct own dataset used in this study. collect 206 political Arabic articles called (PAAD). The dataset is collected from different places such as (social network, global forum, website ideologies, newspaper and TV). PAAD was distributed among three categories such as (conservative, reform, and revolutionary). PAAD has a total number of 206 articles. These categories are summarized in table 3. This corpus available free on Mandalay repository 3 . As shown in Table 4, there are three types of articles (short, medium and large). The gathering of online articles was based on Modern Standard Arabic. The vast majorities of modern Articles that written in the colloquial Arabic were excluded. A number of different sources was considered to build the corpus manually. There are roughly 60% of our corpus gathered was during Arab Spring revolution, while the 30% for the period preceding the Arab Spring revolution. The remaining percentage 10% is specific for beliefs and ideas that founded the party thought and for the political events.

B. Corpus sources
In this section, we discuss the procedure of our dataset collection resources.
1) Social network: Social networks are considered very effective tools for spreading news and information at any area around the world. Facebook 4 was used for crowd gathering during the Arab Spring revolution. There are a number of people are using these platforms to express their support for to the revolution, while other used to express their opposing views. The main purpose of this experimental study is to collect these posts and understanding ideologies and trends in the Arab world.
2) Global forums: The global forums are forums that do not require from a user to be of the same ideology as the forum members, to be a member. We chose global because specific ideological forums due to any member from other ideologies. We also considered collecting the writings of users in global forums source of our dataset. In addition to that, Arab policy forum 5 is an example of these forums.
3) Ideological websites: There are a number of ideological websites with various opinions. This kind of website always discuss political events throughout distribution articles or news. The admin of this website is frequently having the same ideology. Eventually, in order to collect articles from ideological websites, it should contain articles of party members.

4) TV websites and Newspaper
: TV websites and newspaper have corners page that is specified in the articles. We gathered a number of online articles from such sources.
Here we will mention to these sources such as BBC 6 . Table 5 shows the numbers of articles from various sources

C. Collection Method
The collected data is formatted in three folders (Reform, Conservative and Revolutionary). Each folder involves a list of text files numbered sequentially, in which a file corresponds to one whole article. The articles contain some English symbols, punctuation, digits, and Arabic diacritics as figure 2. There are five important steps that used in our study for collection data. Firstly, we extract and collect online Arabic articles manually using various internet sources. Secondly, it is very significant to remove the unwanted components. Thirdly, correct words that comprise bad mistakes or spelling. Fourthly, place every article in one folder, and then collects the documents that belong to the same ideology in one folder. Finally, raw data will be available for preprocessing.

D. Building corpus
Despite of the labelling process is straightforward and simple; it is crucial part of data preprocessing in Machine learning algorithms. The labelling process requires to be carried out precisely, as any error can render the quality of the dataset, and the general accuracy performance for various kind of machine learning approaches. The articles were collected using excel file and Python scripts written specifically for scraping three class. So, we read the folder from the excel file and read each article as shown figure 3.
6 https://www.bbc.com/arabic when building excel file, we get our dataset so have to build four datasets as table 6 below. will present the methodology used for building the dataset. The corpus V1 dataset is raw data but put in Excel. From V1 will built three new versions table 6 shows the various dataset. It is the version that is built Pre-processing V2 It is the version that is built with preprocessing ISRI stemmer V3 It is stemming Appling on V2 Light stemmer V4 It is light stemming Appling on V2

1) Preprocessing for building corpus V2
In this step after building V1 then will building V2 for make clean from unwanted words, noise words and stop words. The V2 is important for see the effective preprocessing on text classification. Corpus V1 need to be cleaned and prepared the text for further classification. The online articles comprise a number of uninformative and noise data like HTML tags and scripts, or advertisements, which hinders the extraction of words. Furthermore, the presence of special Arabic characters or punctuation marks is also not accurately determined. Figure  4 shown the preprocessing steps. The first step is tokenization that will break our sequence of text into pieces such as words. So, the input of our tokenize become the output for another process. The second step is remove punctuation this step is very important because will remove all punctuation such as (:"، ‫؛'‬ -‫؟‬ ! ? ; _ '{}/*//…[ ],). In third step will remove unwanted words that will show in table 7 by using list of regular expression. The most important step in preprocessing for Arabic language is normalization. This important because Arabic language has many shapes for write so our methods involve three steps such as (Diacritics, Tatweel and Latter). Beginning with the removal of a diacritics from an Arabic word, and culminating in the conversion of an alphabetic word into another. An example of this process is displayed in figure 5. If we don't remove then we have many words for same words then vector become large and we need more articles for building then will take time example " ‫تعبير‬ ‫ا‬ ً " and ‫"تعبيرا"‬ so these words if we remove ("ً ") then we have only one word. Tatweel in Arabic language we can for same character make it long as ‫)ـــــــــــ(‬ this use for make the shape of word pretty so this become problem because for same words we can write in many ways example ‫"االصــــالح"‬ ‫"االصــــــــــــالح",‬ and ‫"االصـالح"‬ its same word ‫."االصالح"‬ This become when press shaft and ‫)"ت"(‬ so we have to remove this tatweel for make lass words. The final step in normalization will make unique letter for some letters. In Arabic language there are many ways for write characters as table 8. When complete all steps above then remove any word content only two characters then we can remove. because in Arabic language two words dos not make any sense. Final steps stop words the Arabic stop word can be filtered from an article by removing the token, and matching the word with the stop words listed. The stop words list is a built-in NLTK library. In NLTK all words are sign words.

E. Root Stemming for Building V3
In this section will building from V2 the root stemming corpus V3. Stemming [35] is very important because removal affixes and suffixes then will reduce the number if feature of words to the same stem generated. There are different stemming approaches proposed for many languages. Arabic language is very complexity language that is why need strong stemming to process its complexity morphological. In this study will using Information Science Research Institute's (ISRI) stemmer [36]. This algorithm has 7 steps such as the following algorithm for ISRI stemmer. F. Light stemming for building V4 From corpus V2 will using light stemming to build corpus V4. In light stemming all rules we have here taken from ISRI stemmer. Using light stemming if we looking for the meaning word and keep it clear for human to understand the word. ISRI Light stemmer, which is a number of conditions that determine how to apply the stemming on a certain word or not, the rules of removing suffixes, prefixes and waw. The table 9 show the three conditions with length of words that would be stem. The conditions will be applying of the rule. The first apply waw and prefix finally suffix.

G. Statistical analysis
Statistical is very important for make analysis for data and see the data is good or not. In this case will using tables to show the frequency words belongs to each class and another thing. Also, will using visualization such as PCA, t-SNE and cloud word. Corpus visualization is very important step in the different approach in opinion mining, allowing the human advisor to gain an intuition of the data and the potential learn ability of such data. The results from corpus visualization can be used to guide the modelling phase, since a major component of learn ability is known to be a function of the correspondence between the different approaches and the type of representation it is supplied with. To undertake a visualization of the utilised data, it computed with summary statistics, followed by visualisation methods t-SNE and PCA. The main visualization analysis tools are discussed in the following sections.
1) Principal component analysis (PCA) PCA can reduce the dimensionality of the data easily by discovering the orthogonal linear integrations from the original feature with the largest variance [37]. Given a sample of P observations on vector N variables to . For each observation, dimensional vector representing the features. The main purpose is to find the mapping form to, where is dimension. In order to identify the initial principal component of the sample by the linear transformation in Equation 4 [38,39].
Where the vector (4) Var [ selected as maximum.
So, it is required to choose the feature where the variance of is maximum. The value of for which projection that obtain correspondence to the largest variance of . The principal component analysis is an effective process in terms of selecting a suitable number of features with accurate mapping dimensional space. In order to recover the original instances from the reduced presentation, the principal components are constructed error rate with minimum value [39].

2) T-distributed Stochastic Neighborhood Embedding (T-SNE)
A popular method for exploring high-dimensional data is something called t-SNE, introduced by Van and Hinton in 2008 [40]. Exploratory analysis is considered vital step in the machine learning algorithms, permitting the humankind to acquire an intuition of the data and the potential learn ability of such data. In addition to that, the outcomes from data exploration can be utilized to monitor the modelling level, since a major component of learn ability is known to be a function of the correspondence between the algorithms and the type of representation it is supplied with.

3) Word Cloud
In this paper will introduce words cloud that will visualization most frequency words in corpus also provide simple and effective meaning to visualize the most frequency words in corpus [41,42].

IV. EXPERIMENTAL RESULT
In this result will using python language for all building corpus and test. Our test will be occurred in four steps because we have four corpus and for each corpus will make test.

A. CORPUS V1
In this empirical study, we presented in details the statistical analysis and characteristic of V1. The characteristics are illustrated in Table 10. Revolutionary articles are the longest in terms of both the number of sentence and the number of digits, followed by Reform articles and then Conservative articles, whereas Conservative articles are the shortest. The same applies to the number of punctuations the Reform articles are the longest followed by Revolutionary articles and Conservative articles. Now, regarding the length of English words, Reform articles have the longest words, followed by Conservative articles and less English words for Revolutionary articles. For the number of unique words, the longest unique words are for Reform articles, then Revolutionary articles and Conservative articles have the less words. For the number of longest words, the longest words are for Reform articles, then Revolutionary articles and Conservative articles have the less words. In table above we have dot ('.') as spatial character this character very important for detection the sentence or split sentence in farther work. The figure 7 show the most frequent words in word cloud.  The table 11 below show the most 10 frequent words in PAAD corpus V1 for each class. The table above present the high frequency punctuation. In the figure 7 will show articles distribution in corpus V1 for each class. As we can see in table above for both visualization PCA and T-SNE the articles distribution is non-linear. As we can see the distribution return the scatter plot so for each point will present the article. The distribution shows the similarity clusters of articles as scatter plot.
B. CORPUS V2 In this section show how preprocess steps effect on corpus and reduce number of words also keep only important words after make preprocessing. Table 12 show number of words during the preprocessing steps. As we can see in table above the final words. The highest words with Revolutionary article flow by Reform article and the lest Conservative article. In the figure below show the most frequency words in corpus V2 and present in cloud word. The figure 8 show the most frequency words and as we can see the words will affect the decision will presented and different from figure 6 in corpus V1. In table 13 will present the most 10 frequency words belong to each class. As we can see in table above most of words same meaning but the write shape different so in next, we will present root and light stemming for reduce all these words from corpus V2. In figure 9 show the PCA and T-SEN.  Figure 9 show the distribution of our corpus V2 and as we can see the distribution show corpus V2 non-linear distribution. In this figure the distribution will be better than in figure 7 in corpus V1 that will make decision high accurate.
C. Corps V3 In this section show how root stemming effect on corpus V2 and reduce number of words. Table 14 show number of words when apply root stemming. In table 13 for corpus V2 there are many words in same root but different shape so after apply root stemming and se we can see in table 14 the words back to the original root. The figure 10 show the articles distribution for corpus V3. As we can see in figure above the corpus V3 is non-linear distribution.
D. Corpus V4 In this section show how light stemming effect on corpus V2 and reduce number of words. Table 15 show number of words after apply light stemming. In table 14 for corpus V3 root stemming reduce the root words but the meaning of word occurs almost lost. In table above for light stemming the meaning still there. Figure 11 show the distribution articles of corpus V4. In this article, we have presented the Political Arabic article dataset (PAAD). We address the research problem by collecting and building four types of corpus from PAAD and make this corpus available online. These corpus as V1 the original raw data and from this corpus will building corpus V2 that will apply on its preprocessing steps. From corpus V2 will building two another corpus such as V3 apply root stemming and corpus V4 apply light stemming. This dataset PAAD will use for how interest in NLP or text categorization.