Abstract
Stop word identification is one of the most important tasks for many text processing applications such as information retrieval. Stop words occur too frequently in documents in a collection and do not contribute significantly to determining the context or information about the documents. These words are worthless as index terms and should be removed during indexing as well as before querying by an information retrieval system. In this paper, we propose an automatic aggregated methodology based on term frequency, normalized inverse document frequency and information model to extract the light stop words from Persian text. We define a ‘light stop word’ as a stop word that has few letters and is not a compound word. In the Persian language, a complete stop word list can be derived by combining the light stop words. The evaluation results, using a standard corpus, show a good percentage of coincidence between the Persian and English stop words and a significant improvement in the number of index terms. Specifically, the first 32 Persian light stop words have a great impact on the index size reduction and the set of stop words can reduce the number of index terms by about 27%.
Keywords
1. Introduction
One of the initial tasks to build an information retrieval (IR) system consists of identifying the words that are too frequent among the documents in the collection. These words are called stop words because text processing stops when it finds one of them, discarding it as a candidate to be included in the index. Stop words have a significant impact on the text retrieval process in different languages. Throwing out these words decreases index size and generally improves retrieval effectiveness [1]. In TREC (Text REtrieval Conference), the top 33 stop words account for 30% of all the words [2]. According to Francis and Kucera [3], the 10 most frequently occurring words in English typically account for 20–30% of the tokens in a document. These words are said to have a very low discrimination value because they make an irrelevant document appear in the results when they are considered for searching purpose [4]. In other words, the amount of information carried by these words is negligible. Consequently, it is usually worthwhile to ignore all stop word terms when indexing documents and processing queries.
One way to improve the information retrieval system performance, then, is to eliminate stop words during the automatic indexing phase. Traditionally, a stop word list is supposed to include the most frequently occurring words. As with lexical analysis, however, some frequently occurring words are important as index terms as well. Generally, the stop word list policy will depend on the document database, features of the user and the indexing process. The use of a single fixed stop word list across different document collections could be detrimental to the retrieval effectiveness.
Lots of stop word lists have been developed for English language in the past, which are usually based on frequency statistics of a large corpus. The English stop word list available online [5, 6] is a good example of them. However, research and experimentation in the field of IR for the Persian language is relatively new and limited compared with the research that has been done in other languages. Persian is the language of many documents published on the internet and it is used by approximately 0.7% of all websites [7]. The number of Persian language blogs has also undergone dramatic growth, raising Persian to one of the top 10 languages of the global blogosphere. The number of internet users in Iran have increased since 2001 and currently Iran ranks 13 [7] in the world in terms of number of internet users. The construction of technological tools for minority languages such as Persian is one of the initiatives to preserve them on the internet. Researchers and linguists should work together to develop specific tools for these languages in order to adapt better to the information society.
One of the problems and challenges in Persian text processing is the different prescribed forms of writing. Another difficulty that may affect the automatic identification of Persian stop words is how to determine the word boundary. Space is not a deterministic delimiter and boundary sign. It may appear within a word or between words. On the other hand, there may be no space between two words. There are many words which can be written with a space, short space or no space. That is why the Academy of Persian Language and Literature introduced new rules and standard style. One of the APPL recommendations is to write separately the adverbs, prepositions and conjunctions in sentences. Generally, these categories of words are less significant and can be considered as stop words. For example, the word ‘
The remainder of the paper is structured as follows: Section 2 gives a view of the related works in this area and Section 3 presents a very brief overview of Persian language. Section 4 contains the description of our light stop word list construction method. Experimental results and discussions are shown in the fifth section and, finally, the last section depicts the conclusions and future works.
2. Related works
Usually, the documents are the primary object in an IR system and some process is performed on them to make the system ready to operate. One of the traditional lexical IR processes includes the construction of the stop words list. Some related research has been developed for the English language. Francis and Kucera [3], for instance, worked on the Brown Corpus and were able to extract 425 stop words. Similarly, Van Rijsbergen [4] produced a stop word list for English comprising 250 words and ‘fluff words’. Words like below, near and always have a low frequency but do not usually have a significant discriminating power.
Lo et al. [8] proposed a new method, called term-based random sampling, for automatically generating a stop word list for a given collection [8]. This approach, inspired by the query expansion technique, is based on how informative a given term is. The importance of a term can be assessed using the Kullback–Leibler divergence measure. This approach is then compared with various classical approaches based on Zipf’s law. Results show that the stop word lists derived from the methods inspired by Zipf’s law are reliable but very expensive to carry out. On the other hand, the computational effort taken to derive the stop word lists using the new approach was minimal compared with the baseline approaches, while achieving a comparable performance.
Zou et al. [9] suggested a method to construct a stop word list for the Chinese language. They used an aggregated model to measure both the word frequency characteristic by a statistical model and its information characteristic using an information model. This approach has been developed based on the idea that stop words are ranked at the top with far greater frequency than other words, while maintaining stable distribution in different documents. A combination of these two observations redefines the stop words as those words with stable and high frequency in documents. The generated list was compared with other existing lists and showed an improvement over the others. Alajmi et al. [10] used the same method to generate an Arabic stop word list.
El-Khair [11] conducted a comparative study on the effect of stop words elimination on Arabic IR. Three stop lists were used in the comparison: general stop words, corpus-based stop words and a combined stop word. The general stop word list was created based on Arabic language structure characteristics. The second list was elaborated based on the word frequency in the corpus, and a third list combined general and corpus-based stop words. It was concluded that the combined and the general stop word lists produce the best performing functions for retrieving in the Arabic language using the BM25 algorithm. The performance of the general and the combined stop word lists was relatively close. The use of any of them is recommended, but the general stop word list is certainly preferred when dealing with different corpora.
As for the Persian language, there are only a few studies on the construction of stop words list. Based on corpus statistics, Taghva et al. [12] present a list of 155 stop words and 12 verbs as verbal stop words in Persian text. They used a collection of 1850 documents collected from Persian websites to identify a Persian stop word list based on the distribution of the words. Referring to well-known English stop words and to common sense, they manually edited the result to remove some of the words from the list. These words, although frequent in their collection, should not be considered stop words in a general collection. The identification of this list was part of the Farsi project carried out by the Information Science Research Institute at the University of Nevada [13]. As mentioned by the authors, this list is relatively short and incomplete.
Another work was carried out by AleAhmad et al. [14]. They built a standard Persian corpus from the texts of the Hamshahri 1 newspaper, called the Hamshahri collection. They also identified a list of 50 high-frequency words in the corpus. Referring to the meaning of these words and their high frequency, these words can be considered as stop words in Persian text. In this corpus, there is also available a file with more than 800 stop words. This file contains the more frequent terms in the collection, because there are also included punctuation marks and other terms that cannot be considered words.
In addition, Esmaili et al. [15] built a test collection of Persian text named ‘Mahak’, and considering it, performed some statistical computations. As a result of one of them, they obtained a list of words with high frequencies. The list contains 35 words and, from a semantic point of view, they can be considered stop words. The last two mentioned lists are very short and incomplete, because the author’s aim was to construct a corpus but not to identify the Persian stop words. Those lists are the result of some statistical properties of Persian text. We can find included almost the entire last two stop word lists in the first one referred to above.
Finally, Davarpanah et al. [16] developed a Persian stop word list based on corpus statistics, domain-dependent and expert judgements. In the first stage, the authors extracted randomly 63 articles from 12 different high-ranking Persian journal titles on psychology, education, and library and information science. Then, based on the syntactic function and the frequency of the words, they produced a list of 746 words. In a second stage, they use the Hamshahri corpus [14] to generate stable high-frequency words. This second list contains 422 words. The two lists were aggregated together to generate the final one. A combination of these two observations redefined the 922 stop words as those words with stable and high frequency in documents. According to different grammatical criteria, they added some words to obtain the 927 final stop words list. There are only 14 stop words in the first list elaborated by Taghva et al. [12] that are not included in this list.
3. Persian language
Persian (also known as Farsi or Parsi
2
) is the official language of Iran and, along with Pashto, one of the official languages of Afghanistan. It is also spoken in Tajikistan and parts of Uzbekistan. Persian is a member of the Indo-European family of languages, and within that family, it belongs to the Indo-Iranian branch, within which the Iranian subbranch includes the following chronological linguistic path: Old Persian, middle Persian and modern Persian. The Arabic script has been adopted for writing Farsi in Iran. However, four sounds that did not exist in Arabic were added to the alphabet for Persian (
Persian is an SOV language – the sentences appear in the word order subject–object–verb. The verb is marked for tense and aspect and usually agrees with the subject in person and number. Persian is a pro-drop language, thus the subject is optional. Although Persian is a verb-final language, it does not adhere to a strict word order and the sentential constituents may occur in various positions in the clause; this is especially the case for preposition phrases and adverbials. In addition, there are no overt markers, such as case morphology, to indicate the function of a noun phrase or its boundary; in Persian, only specific direct objects receive an overt marker. The object marker râ (
4. Light stop words construction method
A review of the Persian linguistic and grammar [16, 22, 23] reveals that words in Persian such as other languages have two distinct levels of representation: semantic and syntactic representation. Syntactic words have a finite state and semantic ones have an infinite state. As is mentioned in the related literature, stop words primarily serve a syntactic function. Indeed, they are used just because of grammar and carry no significant information [9]. Therefore, possible words that may be considered as stop words should be collected from the different syntactic classes in Persian in a systematic way to ensure the completeness of the list. From the viewpoint of linguistics, Persian stop words usually will be those words with the following word categories: adverbs of time and place, pronouns, prepositions, determiners, conjunctions, interjection, interrogative words, ordinal numbers, auxiliaries and some verbs (verbal stop words). Generally, stop words in the Persian language, as in other languages, have certain properties:
they have little meaning if they are used separately;
they appear many times in a text;
they are necessary for the construction of the language;
they are general words and not particularly used in a certain field;
they are not used as a search keyword;
they never form a full sentence when used alone.
The construction method of the Persian light stop words has several steps. Our hypothesis is to consider that a Persian light stop word has a short longitude (very few characters), and a complete set of stop words can be derived by combining them. In order to determine the light stop word list, we will follow an aggregate-based methodology. The first stage will be to identify a list of high-frequency words in the Persian lexicon. The second step will generate a list of terms that have a low value of inverse document frequency. In the last step, we will calculate the entropy measure for each word and we build a list of high entropy values. Finally, the three obtained lists will be aggregated to derive a final list.
4.1. Length of words
The words with high frequency in Persian text have a very interesting characteristic. They are not part of compounds and, generally, their length (character number) varies from two to five letters. A statistical study of the Persian words properties showed us this characteristic. Figure 1 depicts the total occurrence of n-letter words (tokens) in the Hamshahri corpus. As we can see, the words between two and five characters have high frequency in the collection. Consequently, the n-letter words (n = 2, …, 5) are the base of our experiments to identify the Persian light stop words.

The number of total n-letter words (tokens).
4.2. Collection term frequency
One of the most obvious features of text from a statistical point of view is that the distribution of word frequencies is very skewed. George Kingsley Zipf observed that the term’s rank-frequency distribution can be fitted very closely by the relation:
here
Top 20 Persian words with highest frequencies.
4.3. Document term frequency
Luhn [25] used Zipf’s law as a null hypothesis to enable him to specify two cut-offs, an upper and a lower, thus excluding non-significant words. The words exceeding the upper cut-off were considered to be common and those below the lower cut-off rare, and therefore not contributing significantly to the content of the document [4]. The words exceeding the upper cut-off can be considered as stop words because they have a low weight in the indexation process. In other words, infrequently occurring terms have a greater probability of occurring in relevant documents and should be considered as more informative and therefore of more importance in these documents. By substituting ‘term frequency’ with normalized inverse document frequency (IDF) another stop word list can be computed. Normalized IDF is the most common form of IDF weighting used by Robertson and Sparck-Jones [8, 26], which normalizes with respect to the number of documents not containing the term (N doc − D k) and adds a constant of 0.5 to both numerator and denominator to moderate extreme values:
where Ndoc is the total number of documents in the collection and Dk is the number of documents containing term k. Stop words are known to make poor index terms and naturally they have a low inverse document frequency weight. Table 2 depicts the top 20 Persian words with low idfk value and is ordered in ascending to idfk value.
Top 20 Persian words with lowest inverse document frequency weight
4.4. Information model
Entropy is a measure of unpredictability or information content. The Shannon entropy, due to Claude E. Shannon [27], is a mathematical function that intuitively corresponds to the amount of information contained or supplied by a source of information. This source can be written text in a particular language or any computer file (collection of bytes). From the viewpoint of information theory, stop words are also those words which carry little information. Entropy, one of the fundamental measurements of information [9], offers us another method for better describing stop word selection. Suppose there are M distinct words and N documents altogether. We denote each word as wj (j = 1,…, M) and each document as Di (i = 1, …, N). For each word wj, we calculate its frequency in the document Di denoted as ƒi,j. However, the document has different lengths. In order to normalize the document length, we calculate the probability Pi,j of the word wj in the document Di which is its frequency in the document Di divided by the total number of words in document Di. Thus, we measure the information value of the word wj by its entropy. We calculate the entropy value (H) for word wj as follows:
Once the entropy of each word in the dataset has been calculated, the resulting list can be ordered by ascending entropy to reveal the words that have a greater probability of being noise words. By entropy measure for each word, another ordered list is prepared for further aggregation. The higher entropy the word has, the lower the information value of the word is [9]. Therefore, the words with lower entropy are extracted as candidates for stop words. The top 20 Persian words with highest entropy are shown in Table 3.
Top 20 Persian words with highest entropy
4.5. Aggregation
The features of stop words are revealed in different aspects by the three generated ordered lists. How does one obtain an aggregation of them? What kind of rules could ensure the fairness of the final result? The same problem was faced before social choice theory came into being. The Borda rule [28] is a well known method in social choice theory. Its purpose is to aggregate the information contained in a set of crisp relations in order to obtain a choice set containing the most preferred alternatives or ranking. Using terminology from the voting literature, we can see each stop word of a ranked list as a candidate and each obtained list as a voter. Each candidate receives points from each voter, according to its rank in the voter’s list. For example, the top-ranked candidate will receive n points, where n is the number of candidates in the respective ranked list. The total Borda score of the candidate will be the sum of its scores owing to each ranked list where it appears. In case that the candidate is not in the top-k list of some voter then it will receive a portion of the remaining points of the voter (each voter has a fixed number of points available for distribution). Using Borda rule, we obtain the top 20 Persian light stop words which are ranked in Table 4.
Top 20 Persian light stop words after applying ‘Borda’ ranking
5. Results and discussion
5.1. Materials
For our experiments in the Persian language, we used a standard test collection for Persian text, which is called Hamshahri [14]. The Hamshahri collection is the largest test collection for Persian text and is prepared and distributed by the University of Tehran. Hamshahri is one of the most popular daily newspapers in Iran that has been published for more than 20 years; it was one of the first online Persian newspapers. Since 1996, Hamshahri has presented its archive to the public through the website. Documents of the Hamshahri collection are actually news articles of the Hamshahri newspaper from 1996 to 2002. Corpus size is 345 MB, and 564 MB with tags. This corpus contains 166,774 textual news articles about a variety of subjects (82 categories, like politics, literature, art and economy) and includes nearly 417,000 different words. The total number of words in the corpus is about 63 million. Hamshahri articles vary between 1 and 140 KB in size with the average of 1.8 KB. Although the Hamshahri collection is built according to TREC specifications, its size is much less than that of TREC. In the TREC corpus, there are about 1.25 million documents with 18 GB size.
5.2. Results
From unique words (417,339 words) in the corpus, first, we extract all words between one and five letters (79,989 words in the corpus). There is only one word with one letter that has meaning in the Persian text. This word is ‘
5.3. Discussion
From the viewpoint of linguistic, similar to English stop words, Persian stop words are usually those words whose parts of speech are adverbs, prepositions, interjections and auxiliaries. According to different domains, we could classify all stop words into two categories. One kind is called domain-independent or ‘generic stop words’, which are stop words in the general domain. Another kind is document- or domain-dependent stop words. We call them ‘domain stop words’. As we applied our method to words with n-letters (n = 2, …, 5), we obtain mostly the generic stop words, but we also find a few stop words which are domain-dependent, like ‘
A comparison of our results and the previous Persian stop word lists is shown in Table 5. Our list was checked against the Persian stop word list identified by Davarpanah et al. [16]. This list was obtained based on the dictionary and expert judgement. The authors extend their list by combining the stop words that already exist in the list and, at least semantically, it can be powerful. The comparison shows that there is a high percentage of coincidence. The difference in coincidence between these two lists is explained by the fact that the corpus was different for generating the two stop word lists. Meanwhile, another difference of these two lists occurs in their building method. Another comparison between the Persian stop word list generated in our algorithm and the stop word list of Brown corpus [3], which is a well-known and widely used corpus in English, was performed. The results show a high coincidence between the English stop words and the top 100 Persian stop words. This similarity is stronger in the top 10 of Persian stop words. In this particular case, there is only one word, which is the imperfective verbal particle sign in Persian language, without English equivalent. The difference in coincidence can be explained by the fact that, generally, a large number of stop words are due to the language characteristics. These two languages are different in nature and, certainly, there are some words without equivalences between them.
Overlapping comparison of Persian stop words and general English stop words
As previously mentioned, dropping stop words in an IR system reduces the index size. Accordingly, stop word elimination decreases the search time for a given query and does not influence the retrieval effectiveness. In Figure 2, it can be seen that the top 20 light stop words reduce about 22% the index size of the Persian text.

Contribution of stop words to index size reduction.
We also notice that the first 32 Persian light stop words have a great impact on the index size reduction. After that, by increasing the number of stop words in the text, the reduction of the index size is very low. The experimental results (see Figure 2) indicate that there is a cut-off in the index size reduction when the first 32 stop words are considered. We can, therefore, specify that there is a ceiling in stop word contribution to the index size reduction. The set of Persian stop words reduces the size of index terms by about 27% and the optimal number of the stop words to be considered is 32. In Table 6 we show the remainder of light stop words that, with those shown in Table 4, complete the list of first 32 Persian stop words.
The remainder of the first 32 stop words (see Table 4)
The 10 most common Persian stop words occur, on average, in 86% of Persian documents. They also occupy about 18% of tokens in a document. The details of the 10 most common words are shown in Table 7, including the percentage of Persian documents in which each word appears. The word ‘
The 10 most common words in Persian text
There is a question that comes to mind. What is the closest list to the final list? To answer this question, we measured the similarity between the three lists and the final aggregated list using the Jaccard index. The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for comparing the similarity and diversity of sample sets. This index only uses presence–absence data. Suppose A and B are two non-empty finite sets. The Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets:
Table 8 shows the Jaccard index for the top 20 Persian light stop words. The best similarity is between the final list and the list of terms with high frequency in the collection. In this case, the Jaccard index value is equal to 1. In other words, all elements intersect between the two lists. Also, we have calculated the distance between each list and the final list. Spearman’s Footrule measure is the sum of the absolute value of the difference between the ranks. This measure gives us the total displacement of all elements of a list to be ranked as the other. For each word wi (i = 1, …, N), its position in the list is given by:
Similarity and distance measure between the final list and other lists
and the total displacement of all elements is given by:
In Table 8, the result of distance measurement is shown. The distance measure between the final list and the list of terms with high frequency in the collection is also less than other lists.
We could, therefore, conclude that a simple way to identify a stop word list for a given Persian corpus is to extract the high-frequency terms in the collection.
6. Conclusions and future work
As with lexical analysis in general, the stop word list policy will depend on the database and features of the user and the indexing process. From an information retrieval system perspective, the stop words are not good discriminators and are useless for the purposes of retrieval. This paper has described a method to automatically create a stop word list for a Persian textual information retrieval system. It could be used for natural language processing and, in particular, for information retrieval purposes. At first, different stop words list are extracted based on the collection term frequency, normalized inverse document frequency (idfk) of a word in the collection and word entropy measure. After all, the principal list is the aggregation of the three lists obtained by each method. Our method only extracts the Persian light stop words. A ‘light stop word’ is a frequently occurring word that consists of a non-compound word containing few letters (between two and five letters). By combining them, we can construct a complete stop word list for a Persian textual information retrieval system.
Our results also show that the light stop words, identified by the high-frequency terms in the collection, are a very effective way to obtain a light stop word list, because we have the best similarity and less distance between this list and the final list obtained by the composition of the three considered lists.
Since stop words (verbal and non-verbal) represent a good percentage of the tokens in the Persian texts, they reduce the size of the indexing structure considerably. For example, the 20 top Persian stop words can reduce by about 22% the size of index terms. This result is approximately similar to the English case because the 10 most frequently occurring words in English typically account for 20–30% of the tokens in a document. The first 32 Persian light stop words have a great impact on the index size reduction and the set of stop words can reduce the size of index terms by about 27% in the Persian text.
Our stop word list is derived from a Persian corpus and can be considered as a general Persian stop word list. Only a few words, although present in the list, could not be considered as stop words in a general collection because they are depending on the corpus. It remains to apply our method on different corpora to extract several stop word list and, by their aggregation, to identify a more general Persian stop word list.
Footnotes
Funding
This work has been partially supported by the project entitled ‘Compresión y Recuperación de Contenidos Multilingües’ financed by the National Plan I + D/I + D + I (TIN2009-14009-C02-02).
