Abstract
One of the core content of the prediction is that, on the basis of text label attributes, we can use the algorithm and a heuristic approach to acquire the association of texts, and extract the available text for the user. Therefore, this paper proposes a new content. First, the multi-label attributes are chosen to be the feature structure of text, and it is given the classification and assignment according to the distinguish method of the statistical data. Second, considering the relation between texts, we improve the traditional maximum entropy method. We are able to control the number of multiple leading text and subsequent text at the same time. Our method makes stronger association of text, and it leads to a more unified direction and higher correlation of obtained text through the label attributes. Then we can predict the similar texts. Experiments show that with the consideration of multi-label attributes of text and the control of the number of leading text as well as the subsequent text, the recall rate and precision are definitely improved when compared to similar existing methods.
Introduction
Nowadays, some prediction algorithms allow users to be able to get the text they need. Considering this situation: when the user gets a text, it means that he is interested in the content of this text [1]. Since the text data itself is a kind of refining data, the thoughts of text data always have some limitations. As it in response to the needs of users, it is usually needed to proceed the full range of multi-angle analysis for the text, which could probably result in a large number of label attributes to meet the needs of users from different aspects [2]. As we always obtain the text through its label attributes, which cannot reflect all the content of the text and be abstract. This might give rise to a problem, which the user can’t capture the needed text all at once. Therefore, the user will interact with the text through its label attributes for many times. In this interaction process, there will be some correlation between the texts [3]. We call the first text to be obtained as the leading text, and the text that is obtained by the prediction as the subsequent text. That is to say, the following text is on the basis of the leading text. Specific to the text prediction, we have to find a method for the user to analyse the characteristic of the obtained text, which can predict the text with similar characteristics [4].
It can be seen that the principle of text prediction method is to estimate the probability of text in a particular context. Moreover, it is a conditional probability [5]. For the problem of how to obtain such a conditional probability, at present, applying the information entropy to the text prediction method is a popular research direction [6]. For example, Rao Y et al. pointed out that the maximum entropy model was appropriate for social media use cases. They found that user’s emotion shown in the short text could be classified [7]. Bhattacharya S discussed the different text types to reflect its unique linguistic features in their respective characteristic entropy [8]. Xu L H et al had extracted the common string from the text, and put forward a text similarity method [9] and so on.
However, the above-mentioned work which applies the information entropy approach to the text prediction have the following problems: (1) Regardless of whether for the existing context or the event-based context, it might be necessary to take the leading text as a premise. The user has to follow this premise to get more sequence texts until the obtained texts can meet its requirement. These obtained texts from any time are non-independent of each other. Meanwhile, user cannot quantify the relations between those obtained texts. But most of the existing methods always use the single text as the leading text to predict the subsequent text. (2) Whether the obtained text or the predictable text, the consideration of these two data in the existing method are usually based on the single label attribute, which is quite similar with the 1-Gram problem [10, 11]. This method is very successful for word segmentation, POS tagging, word sense disambiguation and phrase recognition application. However, for the text access, users generally cannot obtain the text only through a single label attribute. As the text is not mutually independent, the correlation model of text data is complicated. Moreover, this problem cannot be transformed into n-Gram problem, because the corpus is so large that it is impossible to establish the statistical model accurately.
In this way, it is necessary to propose a method that allows users to obtain one or more texts continuously on the basis of the existing one or more texts to meet their needs. To solve the above-mentioned problems, this paper proposes a text prediction method based on multi-label attributes and an improved maximum entropy model. The differences between our method and the existing methods are as follows. (1) This improved method fully considers the non-independent situation between the texts. We don’t evade that predictive texts have relevance. Meanwhile, the brand-new design of the initial conditions of the maximum entropy is carried out by us. Moreover, we obtain the conditional probability of the subsequent text according to the leading texts. (2) Considering that the text data is one kind of non-structured data, we can obtain the text mainly through the label attribute of text. However, the existing information entropy for text prediction is always analyzed by using the single label attribute, while the methods proposed in this paper will refine the multi-label attributes to realize this function. (3) As we all know that the label attribute is non-structured data, when taking the universal rules of natural language and its reflection in the text into account, many label attributes can be classified. In this paper, different types of label attributes using different methods of assignment, this method can do a better job of refining and screening text. (4) In order to distinguish our method from the existing text prediction method which are based on the entropy, the label attribute of text is defined as parameter in the maximum entropy model, which can realize a better prediction compared to the ordinary method with single label attribute processing. Meanwhile, it is more accurate than the method which uses the extensive text data directly in the maximum entropy algorithm.
The main contributions of this paper include the following aspects: (1) We introduce an improved maximum entropy model to obtain the subsequent text on the premise of multiple leading texts and extract the subsequent text with a larger value of conditional probability. Meanwhile, the number both of the subsequent text and the leading text are not unique. (2) We convert the text information to multi label attributes according to the type of statistical data, and perform the classification according to the label attribute. (3) We take the predictions of the leading text, subsequent text and statistical data types on the text into consideration, and we verify the practicality of the proposed text prediction method on a larger scale with the real data set. Moreover, the method presented in this paper carries a more comprehensive analysis of the text to provide more accurate text prediction.
The organization of this paper is summarized as follows: In the second chapter, we give a brief overview of related work on the text prediction. In the third chapter we introduce some basic definitions. We analysis the possible correlation between the text with the information entropy after the acquisition of joint self-information of text data. Then we discuss the possibility of applying the multiple leading texts and subsequent text to the maximum entropy model. In the fourth chapter, we use the improved maximum entropy model to extract the subsequent text. In the fifth chapter, we convert text data into multi-label attribute by using the classification method of the statistical data, and given these labels attributes the sorting and assignment. We give examples to verify the method proposed in this paper in the sixth chapter. Finally, we draw conclusions in the seventh chapter.
Related work
In recent years, the much research work focuses on the application of information entropy to the text and these works includes the following aspects.
Tripathy A et al. considered the application of maximum entropy model to social media use cases and introduced an intensive sentiment classification maximum entropy model. They obtained the probability on densely sampling with the characteristic function of the short text intensive sampling. In their work, the authors suggested that the emotional analysis of short text was more practical than the long ones [12]. Yin C et al. explored an immemorial method on the steganography in the plain text. In this paper, the authors considered that it was a challenge to find out the hidden secret information in plain text because of the lack of redundant information. This paper used a method based on N-gram and entropy to measure stego-test [13]. Wu G et al. applied the maximum entropy model to the mobile text classification to realize the automatic text categorization system based on cloud computing, which was implemented by Map Reduce [14]. According to the research of Wu G and Wang H, the feature selection played an important role in text classification, which directly affected the accuracy of classification. In feature selection, as the traditional method of expected cross entropy lack consideration of document frequency, therefore this paper improved the traditional formula of expected cross entropy, and then proposed an improved text feature selection method based on word frequency information [14, 15]. Chandrasekar P et al. believed that the text classification was mainly done by the classifier, and the naive Bayes method and the maximum entropy were the most effective way. In their paper, an improved method based on maximum entropy classifier was implemented [16]. Wang T believed that the term weighting scheme had been widely used in information retrieval and text classification, however, the term weighting was limited, since the specific data might be more useful to distinguish different categories of text. Meanwhile these terms tended to have smaller entropy. The author discussed the relation between the identification and entropy terms and a set of categories, then he proposed two term weighting methods based on entropy and proved that it was more effective in KNN and SVM [17]. Kuruvila M et al. applied information entropy theory to the Himalayan Rahm language and pointed out that each character needed 4.8 bits [18, 19]. According to Williams R et al., the data could be layered. The author suggested that the maximum entropy model could be introduced to the hierarchical structure to predict the maximum entropy of the unknown text [20].
In summary, the mentioned methods based on the existing information entropy approach have not considered the problem of full range detailed information in text prediction. Basically, those methods distinguished the limited label attribute in the without paying much attention to the prediction of multi text. This paper attempts to propose a method of text prediction on the basis of this point of view.
The correlation of text data and entropy
In this section, we will discuss the entropy of text data, and describe the correlation of text according to the obtained entropy.
Suppose that
Equation (1) is the original definition of information entropy. Equation (1) can be written in the following:
From Equations (2 and 3), we can see that there is some sort of relations between {x1, x2, ⋯, x i , ⋯, x n }. As there is a certain relationship between the selected texts of the user, it then follows that:
Here, the Equation (4) becomes:
Equations (4 and 5) denote the loss of information when we convert the self-information of
Given the above derivation, we can get the following three understanding.
First, for the self-information of
Second, for the random text x
i
except x
n
, when the user got any text x
i
, it means the information of the whole set of texts {x1, x2, ⋯, xi-1} before x
i
could meet its needs. Therefore, x
i
is based on the amount of information provided by {x1, x2, ⋯, xi-1} to give a choice. For the user, the amount of information generated by I (x
i
) should be I (x
i
) ≥0. When I (x
i
) >0, the x
i
is not independent of {x1, x2, ⋯, xi-1}. When I (x
i
) =0, the text x
i
would be found to be a useless text, therefore, the user u cannot get useful information from this text. I (x
i
) =0 indicates that x
i
is not independent of {x1, x2, ⋯, xi-1}, but is independent of {xi+1, ⋯, x
n
}. From Equation (4), the difference value δ
n
between
Third, moreover, based on the analysis above, as well as Equations (4 and 5), we can also obtain a relation as p (x1, x2, ⋯, x i )·p (xi+1, ⋯, xi+k) ≥ p (x1, x2, ⋯, xi+k), here, (1 ≤ k ≤ n - i). Then, once the {x1, x2, ⋯, x i } is set down, we can do the accurate analysis for each text in {xi+1, ⋯, xi+k}.
Through the above analysis, we can get three kinds of relations between the texts and these relations come to be the related parameters of different maximum entropy models. Firstly, the relation between x
n
and the text {x1, x2, ⋯, xn-1} before x
n
in the text set
From the third chapter, we can see that each text is not independent of each other, but affects the user’s choice for the posterior text based on the previous ones. In order to consider the impact of the text, we analyze three different situations.
Maximum entropy model based on multiple text
We can confirm the relation between Xi-1 = {x1, x2, ⋯, xi-1} and Y
i
= {x
i
, ⋯, x
n
} in text set
Three kinds of improvement of maximum entropy model based on multiple text
Assuming that the S represents a series of label attributes of text
The first situation: When num (Y
i
) =1, it denotes the relation between text Xi-1 and x
i
, which means x
i
is connected with all the other elements in Xi-1. With
Here,
The second situation: when num (Y
i
) = n - j, it denotes the relation between text X
i
- 1 and {x
i
, xi+1, ⋯, x
n
}, there are conditional probabilities n - 1 as n text. Which means, under the same initial conditions, we have to take a number of n - 1 situations into account. Here,
Here,
The third situation: when 1 < num (Y
i
) < n - j, it denotes the relation between text Xi-1 and {x
i
, xi+1, ⋯, xi+k}. Here we have k = num (Y
i
). Thus, considering that there are n texts for each user, therefore there may be k - i + 1 possibilities. This condition can present the comprehensive situation of the appropriate text of the user under the same initial conditions. There is,
Here,
In the above three situations, Z (Xi-1) is the normalization factor, λ is the characteristic parameter, which indicates the importance of the characteristic function f (Y
i
, Xi-1) to the model. Once the text data is given, we can calculate the empirical distribution of
The comparison of the situations mentioned above can be described for four aspects as Table 1.
Comparison of four different parameters in three situations
Comparison of four different parameters in three situations
First of all, for the representation of conditional probabilities, the second and the third situations are more constrained than the first one. In the first situation, for the keyword Xi-1, we only need to consider the next text x
i
that corresponds to X
i
- 1. In the second situation, all the number of elements in the text is n, which contains from the user’s first text to the last one. The third situation takes a number of n - i texts into consideration, and separately discuss for n - i times. Then, for the characteristic function, we can see that, after the text X
i
- 1 is fixed, the number and meaning of the following text will be different. Secondly, for the constraint conditions, the second and the third situation have an element that are not taken into account, which is the last one in the second condition and the first one in the third situation, respectively. These constraints are related to the operation of the model as well as the definition of the model. The different constraint conditions determine the difference of normalization factor. Finally, we compare the conditional probability of the methods in the first and second situation, when i ≠ n,
For text data, the vast majority of these label attributes are not numerical data, which means it cannot be directly processed, even though we take it as numerical data, the unit of different label attribute is still different. Therefore, in order to obtain the optimal value of p (Y i |Xi-1), we have to quantify the label attribute.
According to the type description of statistical data [21], considering the particularity of the text data, the label attribute is divided into the following four types: frequency statistics data, interval-level data, sequencing data, and ratio-level data.
(1) Frequency statistical data. For the text data, the most useful frequency statistics data is the keyword information. For text mining algorithm, frequency statistics of a certain word refers to a statistics of the same keyword and similar keyword in many of the texts. It includes the keyword importance degree and the number of keyword in these texts. For the description of the keyword importance degree, we use the TF-IDF method to give a normalized assignment [22]. Assume that the arbitrary keyword is denoted by z
j
. The number of texts that contains the keyword z
j
in the text set
The keyword importance degree is between 0 and 1.
(2) Interval-level data. Compared with the sequencing data, the interval-level data is an accurate measure of a certain attribute of the text. For text data, the cited times is a typical attribute in interval-level data. Obviously, the cited times of text can be easily obtained [23]. The better value, but if the difference of the cited times between two texts is very small, the importance value of these two texts becomes too difficult to determine. Therefore, the assignment of cited time is considered according to the taxonomy of interval-level data in this paper. Thus four levels of citation type have been defined, which can also correspond to the intensity of cited time. Assuming that the four levels of text are denoted as {D1, D2, D3, D4}, it denotes respectively the {strong citation intensity, medium citation intensity, week citation intensity and non-citation intensity}. After disposing the grade of D4, the correlation coefficient of three remaining categories can be adjusted according to their mutual citation intensity. Specifically, according to the law of the famous economics budget allocation, namely the law of 60:30:10, the number of the strong correlation text accounts for 10% of the total number, and the number of the medium correlation text accounts for about 30% of the total number, and the number of the weak correlation text is 60%. The citation intensity is adjusted as follows: when D2 is cited by D1, D2 is adjusted to a strong citation intensity from the medium citation intensity; when D3 is cited by D1, D3 is adjusted by the weak citation intensity to the medium citation intensity; and when D3 is cited by D2, then the weak correlation strength value of D3 is multiplied by 2. In this way, the citation intensity is assigned according to this law, and then the existing citation intensity is adjusted and assigned according to the citation situation of the text. Finally, we finish the assignment of interval-level data.
(3) Sequencing data. There are a lot of sequenced data in text data, such as the subject category of text, the authority of text, etc. The sequence of attribute is given at the same time when we do the classification of this certain attribute These attributes are classified, but there is a difference between the various types of sequence, and we can compare the pros and cons. For instance, we can label the subject category of text with three different types: original disciplines, related disciplines and other disciplines; also, we are able to divide the authority of the text into five types, namely core A, core B, core C, regular D and the others. Though such sequencing data can give an obvious criterion to judge the quality of attribute, it cannot measure the exact difference between each type of text. Thus, in order to convert the sequenced data into comparable numerical data, we use the “rule of thumb” to give an assignment [24]. That is to say, we can define the connection of two variables as the correlation strength. If the value of correlation strength equals to 0–0.05, it means non-correlation; if the value of correlation strength equals to 0.05–0.25, it means week correlation; if the value of correlation strength equals to 025–0.60, it means medium correlation; if the value of correlation strength equals to 0.6–1, it means strong correlation. This method can be applied to the sequenced data to compare this attribute of text. For instance, if we combine the text with its subject category, the intensity of which is as follows: the original disciplines are 0.60; the related disciplines are 0.25; and other disciplines are 0.05. For another example, based on the connection of the text and its authority of affiliated institutions, the correlation strength should be assigned as follows: core A = 1, core B = 0.60, core C = 0.25, general D = 0.05 and others equal to 0.
(4) Ratio-level data. As a data type that measures the interval scale, the ratio of the data in the text data is reflected as publication time of the text. Because of the timeliness of the text, the text information will generally show a process of decay with the increase of time. For example, compared with the text which is emerged more early, the text which is more recent is more important. Although there is no absolute zero value for the time of text, in order to make it facilitate to value of data, this paper makes a restriction on the time limit, which means the minimum value of the time of text. Therefore, the time of text in our work refers to the difference between the published time and the limitation time. After setting up the initial time of text, we take the time of all the text which are published before the initial time as 0. Start with the initial time, the time text refers to the natural calendar of published time - the initial time. The assignment time is t can be represented by (text time–1)/text time.
For the classification of label attributes, there are several points which are needed to be explained. (1) Generally speaking, the more label attribute we have, the more clearly expression for we need. If all of these four types of data are extracted out, the provided information for predicting the unknown data is the most accurate. But the fact is that the users cannot completely provide these four kinds of data at the same time. Therefore, for the selection of label attribute, the frequency statistical data needs to be provided at least. Because once the user selects text, the sequenced data, interval-level data and ratio-level data can be obtained directly. However, the frequency statistical data requires the user to manually extract, and the word frequency gets the ability to reflect the most refined information according to the needs of the users. (2) If the user wants to make a selection between these four types of label attribute data, there is a selection order according to the particularity of the text data, which can distinguish the importance of these four kinds of label attribute data at the same time, namely the importance of data: frequency sequencing data >interval-level data >sequencing data >ratio-level data. However, the order of label attribute of each type data has no specific requirements, as long as the text can be unified.
We have selected a part of texts from “Web of Science” (WOS) to test our method in this paper. There are 31393 texts in the area of prediction system according to the searching keywords “recommendation system” (RS). We have obtained an universal network file format (*.net) data set to do the citation analysis by the analytical tool of WOS. Thus, we could not only effectively reveal the relationship between the internal connection between the texts (which provides the intrinsic link of scientific research), but also find out the opportune ones to recommend to the user through some analytical approach with the help of the provided analytical tool. Among the selected texts of more than thirty thousand texts, there are a total number of 26993 texts with no correlation, and these texts will not be considered in the TSLI method. At the meantime, according to the judgment criterion of most text prediction, we define the following concepts:
Here, R C is the number of recommendable text, R T is the total number of text obtained by the user, k T is the number of recommended text.
Next, we convert the text information into label attributes, and determine the value of num (Xi-1) and num (Y i ) according to the description in the fourth chapter and the fifth chapter. Then we determine the number of keyword.
The evaluation of label attributes
This section discusses the number of keywords. The number of keywords should be determined according to the different application tools, for the method of keyword searching, more detailed keyword we put in, more accurate results we will get. But for the text recommended, less keyword may result in a large number of irrelevant texts for the user. The accuracy of prediction will be affected at the same time. If there are too many keywords, the user may miss some of the useful text, and the recall rate will also be affected. Therefore, the choice of the type and the number of keyword will have a direct impact on the quality of text prediction algorithm. In this paper, the keyword is divided into the primary keyword, secondary keyword and non-keyword. We only discuss the primary keyword and secondary keywords in our work. For the number of keyword, we stipulate that there is only one primary keyword in this paper, but the number of secondary keyword has no limitation, the purpose of doing in this way is to find the recommendable text in the greater scope. Based on the above analyze, we select num (Xi,j-1) =5, num (Y ij ) =40. The results are shown in Fig. 1.

Comparison of precision and recall for different number of keywords.
As we can see from Fig. 1, along with the increase of the number of keywords, the value of avg tf-idf is constantly increasing, here,
In order to verify the correlation between the recommended text, we compare our method with the four other text prediction methods as follow:
KMR [25] is a keyword matching method that can only recommend articles with similar keywords to users by comparing the keywords of different articles.
BPR [26] method is a sort of oriented algorithm for the item prediction problem in implicit data, which was ontroduced by Rendel et al. The BPR algorithm can obtain user’s preference information for item through the reconstruction of the user-item scoring matrix, and then it can optimize the sorting target by maximizing the posterior probability.
ItemKNN [27] method is to use the item’s content/attribute as a vector to find a similar relationship between the users to realize the prediction process.
ItemAverage [28] is a method for predict the unknown text prediction through the mean value of the text attribute according to the existing text of the user.
For the convenience of comparison, we would like to simplify our methods as TPSM method. The experimental results are shown in Figs. 2 and 3.

Comparison of precision for different number of Y i and X i - 1.

Comparison of recall for different number of Y i and X i - 1.
According to the analysis in the fourth chapter, in this section, we chose 6 as the number of keyword in frequency statistical data as. In this experiment, we assume that num (Y i ) is used as the ratio-level data, which is analyzed according to the difference of 5, and this assumption is not related to num (Xi-1).
Through the experiment, we can see that no matter what value of both num (Xi-1) and num (Y i ) are, the precision rate and recall rate of TPSM are almost higher than other four methods, especially when the num (Xi-1) has a bigger value, the performance of advantage becomes the more obvious. This is because, with the increasing value of num (Xi-1), there will have more available information than the one in TPSM method. These information tends to be more similar, which makes it easier to obtain useful information. At the same time, when the num (Xi-1) value is 40, the precision of TPSM method is relatively stable, as for any value of num (Xi-1), the recall rate of TPSM method is relatively high. For example, when num (Xi-1) =5, num (Y i ) =40, for the average precision rate and the average recall rate, their average difference between TPSM method and other methods reach 6.25% and 5%, respectively. The maximum value difference reached 12.5% and 6%, repectively. This shows that no matter what value of num (Xi-1) and num (Y i ) is, TPSM method can get a better prediction results.
The traditional text prediction with information entropy often has the problem of being lack of alternative methods. In other words, the majority of methods predict the text through obtaining the single label attribute. Meanwhile, those methods do not consider the control of the number of the leading text and subsequent text as well as the parameter of contingent probability. Based on these problems, this paper proposed a method based on multi-label attribute and improved maximum entropy model. Our method refined the text data as structured data with multi-label attributes, and proposed the classification and assignment of those label attributes. In the meantime, the prior data and posteriori data of the conditional probability of maximum entropy model was controlled and adjusted. In a word, this method refined the text data and guaranteed the prediction accuracy through the precise control of the leading text and subsequent text.
Footnotes
Acknowledgments
The work was partly supported by National Basic Research 973 Program of China under Grant No. 2011CB 302301. And Hubei Province Key Laboratory of Systems Science in Metallurgical Process under Grant No. M201402.
