Abstract
Academic theses writing is a complex task that requires the author to be skilled in argumentation. The goal of the academic author is to communicate clear ideas and to convince the reader of the presented claims. However, few students are good arguers, and this is a skill that takes time to master. In this paper, we present an exploration of lexical features used to model automatic detection of argumentative paragraphs using machine learning techniques. We present a novel proposal, which combines the information in the complete paragraph with the detection of argumentative segments in order to achieve improved results for the detection of argumentative paragraphs. We propose two approaches; a more descriptive one, which uses the decision tree classifier with indicators and lexical features; and another more efficient, which uses an SVM classifier with lexical features and a Document Occurrence Representation (DOR). Both approaches consider the detection of argumentative segments to ensure that a paragraph detected as argumentative has indeed segments with argumentation. We achieved encouraging results for both approaches.
Keywords
Introduction
The arguments in academic theses are essential to sustain their assertions. These arguments have a structure that provides numerous statements to support each claim presented in the thesis. An argument is a set of statements (i.e. premises) that individually or collectively provide support to a claim(a conclusion).
In recent research, automatic argument processing has begun to be studied, where artificial intelligence and argumentation theories are interdisciplinarily applied to improve the process of extracting and recovering information.
For example, in the legal field, argument analysis seeks to facilitate access to jurisprudence that supports a case [18]. In other line, in scientific biomedical articles, argumentation is studied to identifyarguments for or against a hypothesis under investigation [13].
In social networks, arguments are analyzed to identify comments for or against a topic in debate, to observe what is the position of the majority [4] and eventually evaluate these arguments based on whether they comply with an admissible structure [21].
Finally, essay writing is another area where the level of argumentation is also evaluated, to assess the student and offer immediate feedback [23]. However, we have not found studies aimed to analyze textual argumentation in larger academic works such as theses. These writings are often prepared at the end of academic programs to demonstrate the student’s research and writing skills. As a result, such products are important for the student’s further education prospects or future employment. For this reason, a model to detect argumentative paragraphs in theses is essential.
A simple argument contains at least one premise and a conclusion [8]. In Example 1 below, an argument from our corpus is presented with a conclusion which is supported by a single premise.
Example 1:
[The generation of an ontology for a domain represents a challenge,]
The conclusion indicates an assertion enunciating the generation of an ontology as a challenge, to support this assertion, a reason is presented in the premise as the need of experts to complete the task. In addition, we observed the argumentative marker ‘since’ which allows to identify the premise that supports the conclusion. Identifying the premise provides the basis for taking the given conclusion as true. We observe in the example two argumentative segments which provide evidence to identify the paragraph as argumentative.
In this paper, we present a model for the detection of argumentative paragraphs supported with the information of argumentative segments. We propose a method for the classification of the presence of argumentative paragraphs using machine learning techniques together with lexical and indicator features. We also offer a method to perform argumentative unit segmentation employing Conditional Random Fields (CRF) with lexical, syntactic,structural and indicator features to classify sequences by token, to identify segments in the paragraph, and then obtain the possible segments of argumentative components. We integrate the result of both methods to perform an effective detection of argumentative paragraphs. To evaluate our model, we created a corpus of thesis sections (problem statement, justification, conclusions) with annotated argumentative paragraphs.
This article is organized as follows. Section 2 briefly reviews related work, closing with details of the corpus for experiments. The detection of argumentative paragraphs as a whole is presented in Section 3, while Section 4 focuses on the identification of argumentative segments. Section 5 details the fusion of the information for the improved detection of argumentative paragraphs by the two methods. Last section discusses conclusions and further work.
Related work
The detection of argumentative paragraphs, sentences or clauses has been determined a preliminary step to identify the presence of premises or conclusions. For this, researchers [19] classify argumentative and non-argumentative sentences in the Araucaria corpus, representing sentences with features such as combinations of word pairs, verbs and text statistics. Using a Bayes classifier, they report 73.75% of accuracy.
In addition, [18] employ the corpus of legal texts ECHR with 47 annotated documents, where clauses (sub-sentences) are classified as argumentative or not, using a maximum entropy classifier, and report an accuracy of 80% for the task. Note that legal texts have a particular structure that allows lawyers to identify the arguments.
The identification of argumentative paragraphs in public policy formulation is investigated by [7] who employs five sets of argument categories (justification, explanation, deduction, refutation and conditional) and features based on the mode and tense of verbs. In this work, the authors identify segments of text with argumentation, using a decision tree J48 classification algorithm, reporting an F measure of 0.764.
Also, the identification of argumentation in text segments is studied by [12], who built a corpus of 204 documents collected from social networks, which were annotated with their premises. They used structural, lexical, contextual and grammatical features to represent each sentence. They report an F measure of 0.77 employing a logistic regression classifier.
And the classification of segments of text that correspond to argumentative components is analyzed by [25]. They employed to Part Of Speech (POS) tags, a list of keywords and distributional representations to characterize the texts. They report an F measure of 0.3221 using CRF.
Furthermore, deep learning architectures have been employed in the argumentative unit segmentation task, as observed in the investigation of [1] when using algorithms such as SVM, CRF and Bi-LSTM (bi-directional long short-term memory) with semantic, syntactic, structural and pragmatic features. The authors report the best performance when utilising all the features and an architecture that employs several recurring Bi-LTSM networks, thus reaching an F measure of 0.885 in the identification of argumentative segments in academic essays.
Corpus
The corpus to identify the argumentative aspects in undergraduate and graduate academic writings comes from the Coltypi collection [11]. This collection has 968 theses and research proposals from the area of information technology and computing, written in Spanish. In particular, our study focuses on sections of Problem statement, Justification and Conclusions, since these are mainly argumentative.
The data-set includes 444 sections with at least two annotations per section. Seven annotators from fields related to Linguistics worked in different subsets of the corpus. This was a challenging task for the annotators since advanced concepts of computer and information science were discussed in the theses. The annotation study covered 1,973 paragraphs. Cohen kappa [6] agreement between annotators for the identification of argumentative paragraphs was 0.399, which corresponds to a ‘fair’ level [16].
We also performed the annotation of argumentative components for the task of argumentative unit segmentation. The annotators tagged the segments as premises and conclusions, as well as the segments without argumentation, which do not have any annotation. We extracted the segments where the annotators agreed on the limits of the selection, also we considered if the segment of one annotator is contained within the segment selected by the other annotator.
Finally, a total of 4,989 segments of premise, conclusion or none types were obtained. The agreement reached for the annotation of argumentative components was 0.461, i.e. a ‘moderate’ level [16]. The complete developed corpus is described in detail in [10], and is available online in the corpus site 1 . In the following sections, we detail the number of instances taken for the experiments.
Detection of argumentative paragraphs
This section presents the method for the automatic detection of argumentative paragraphs using machine learning techniques. We use the corpus of arguments detailed in the previous section to perform the experiments. In addition, we consider paragraphs with argumentation as those in which both annotators indicated the presence of argumentation, and at least two argumentative components were identified. These criteria were considered since paragraphs annotated with argumentation were observed but without an agreement at the component level, whereby paragraphs were located indicating argumentation, but without at least two components (one premise and one conclusion) to analyze (260paragraphs).
We found 1,174 paragraphs annotated as argumentative (with_argument) or non-argumentative (without_argument), which met the stated criteria. A proportion of 70.7% (830) of paragraphs was found with arguments and 29.3% (344) paragraphs without arguments.
We employed the NLTK tool to process the paragraphs for tokenization and feature extraction [3]. Representations created for our experiments reflect lexical, syntactic, semantic, and indicator features, which are described next.
Justification Category: a causa de, a el fin y a el cabo, a el fin y a el postre, a fin de cuenta, como, como mostrar, como ser indicar por, con decir te, dar que, de acuerdo con, de hecho, deber a, deber se a, después de todo, el anterior porque, el motivo ser, el razón ser, el razón ser que, en tanto que, en vista de que, gracia a, motivo de que, no en vano, poner que ser consecuencia de, por causa de, por cuanto, por todo ello, porque, pues, puesto que, razón de que, se poder deducir de, se poder derivar de, se seguir de, ser que, ver que, ya que. Explanation Category: a causa de, a fin de cuenta, así, de otro modo, deber a, decir de otro modo, el motivo ser, en concreto, en definitivo, en otro palabra, en particular, el razón ser, motivo de que, poner, poner por caso, por ejemplo, por ello, por ese razón, por este motivo, por este razón, razón de que, uno ejemplo, uno poner. Deduction Category: a consecuencia de, a el fin y a el cabo, ante el anterior, así, así pues, así que, como conclusión, como consecuencia, como resultado, concluir que, conclusión, consecuentemente, consiguientemente, correspondientemente, de acuerdo a el anterior, de ahí que, de este forma, de manera que, de tal forma, de tal manera, deducir que demostrar que, el cual apuntar a el conclusión de que, el cual implicar que, el cual mostrar que, el cual nos permitir inferir que, el cual probar que, el cual significar que, en conclusión, en consecuencia, en definitivo, en fin, en resumen, en resumir cuenta, en sí, en síntesis, en suma, en tal caso, entonces, establecer que, finalmente, implicar que, inferir que, llegar a el, llegar a el conclusión, para, para concluir, para terminar, poder inferir que, por consiguiente, por el que, por el tanto, por ello, por ende, por ese, por este razón, por tanto, por último, probar que, que, resumir, se desprender, se desprender de, se seguir que, ser por ese que. Refutation Category: a el contrario, a menos, a pesar de, a pesar de todo, ahora, antes bien, aun así, aunque, bien a el contrario, de cualquiera modo, de todo modo, después de todo, empero, en cambio, ese sí, mas, más aun, más bien, muy a el contrario, no obstante, no parecer, pero, pero sin embargo, pesar a, por contra, por el contrario, pues, si bien, sin embargo, sino,sólo que. Conditional Category: según, con tal que, a condición de que, a menos que, con que, suponer que, aunque, si, en caso de, si y solo si.
First, features representations described above were subject to a Feature Selection process based on information gain, for each type of feature. This is indicated as greater than zero in the corresponding column (FS) of Table 1. In two cases, this was not applicable.
Results of argumentative paragraphs detection
Results of argumentative paragraphs detection
The task of detecting argumentative paragraphs was approached as a binary classification to detect whether a given paragraph contains argumentation. In the experiments, we apply a 10-fold stratified cross validation with Scikit-learn Toolkit [22]. We used the same train/test split of each fold in all the experiments. The machine learning tool Weka [14] was used to perform the classification using the default hyper-parameters. We trained the classifiers with the data set for training and applied them to the test data set. We explored the efficacy of four classifiers: Naive Bayes (NB), Decision Tree (DT), Random Forest, and Support Vector Machine (SVM).
Table 1, in the upper (Single Features) part, shows the level of agreement Cohen kappa, accuracy and F1 measures for each representation. We observed that the Decision Tree classifier with lexical features achieved the best macro F measure of 0.752, and the best accuracy of 80.4% is reached by the SVM classifier with DOR representation.
We analyzed the decision tree, created with the lexical features using attributes with information gain (265), and we observed in the first levels of the tree, word patterns that are part of argumentative markers. For instance, the bigram of ‘ya-que’ (since) appears at the root, which is an argumentative marker for identifying premises. In lower levels of the tree, we noticed words that are part of markers, such as ‘por-lo’ (at-it), the argumentative marker to indicate conclusions “por lo tanto” (therefore); as well as the word “debido” (due), indicating a premise as part of the marker “debido a” (due to). In addition, we notice that semantic features and NB classifier obtain the lowest efficacy in terms of accuracy.
Once the efficacy of features was observed individually, we carried out their combinations. We combined the best individual feature ‘Lexical’ with the other as pairs. We also created a representation with all the features to test if we achieve improved efficacy.
Additionally we include the results of solutions previously proposed by authors such as [19] and [7], without seeking a thorough comparison, since we have a different data set. The representation developed by Florou consists of categories of argumentative markers and features based on the mode and tense of verbs. The representation proposed by Moens consists of combinations of all possible word pairs, main verbs, and text statistics.
In Table 1, the results of the described combinations of features are presented, indicating in bold the maximum values. We considered the best model the combination of lexical and indicator features (Lex+Ind) using feature selection with information gain (279), with a DT classifier reaching the best macro F1 measure of 0.7623. The model with the best accuracy reaches 81.43% using lexical features together with a DOR representation (Lex+DOR), and feature selection with information gain with an SVM classifier. Observe that this model identifies better the paragraphs with arguments (with_arg) reaching a 0.8735 of F measure. Moreover, we notice that the representations ‘Lex+Ind’ and ‘Lex+DOR’ achieved higher results than the implementations of the representations proposed by Moens and Florou.
An additional analysis was carried out to identify the contribution of each set of features in the classification task. A full representation was built that includes all types of features. To examine them, representations were created in which only one type of feature of the complete representation is omitted.
In Table 1 bottom, we present the results of the representations with information gain as Feature Selection criteria, including the best classifier for that representation. By omitting the lexical feature (Lex), the lowest efficacy is achieved, indicating that it provides more information to the model to identify argumentative paragraphs. By omitting the syntactic feature (Syn), a highest result is obtained, even higher compared to the full representation, and this means that syntactic features negatively affect the efficacy of the model.
The segmentation of argumentative units is an indispensable task for the detection of argumentation, which consists of identifying the argumentative sequences of the text. This task was done by classifying the sequences by token. To capture the context around each token, we extract several features detailed below. Conditional Random Field (CRF) was employed in the classification task to capture the token sequences for labeling as in [15].
The argumentative components were coded using the IOB scheme (abbreviation of Inside, Outside, Beginning) as in [27], considering each sentence as a sequence. The argumentative components were represented using the IOB format tags, where the first token of the component is indicated by ‘B-Arg’, the tokens inside the component with ‘I-Arg’ and all non-argumentative tokens with the ‘O’ label.
The representation of each token in the sentences was done by lexical, syntactic, structural and indicator features. The description of each of the features used in the experiments is presented next.
For example, the information taken to characterize the second token of the text “La primera definicion” (in English ‘The first definition’), corresponding to word ‘primera’ (first), considering a window of size one, employs three variables to indicate the word in question (token = first), the previous (-1: token = the) and the following (+1: token = definition). When taking a window of size one in the example, only one word before and one word after the word being characterized, are employed.
In experiments, we used the corpus of arguments presented previously with 2,971 sentences, composed of 2,865 labels annotated as ‘B-Arg’, 63,399 labels of class ‘I-Arg’ and 59,557 labels of class ‘O’. We performed 10-fold cross-validation using CRF Suite with the averaged perceptron training method.
Table 2, in the single features part, presents the macro F1 measure and accuracy. The best efficacy was observed using a window size of 15 tokens around the token to be tagged. We assume that with this window size is possible to incorporate information from argumentative markers near the token, which helps in the segmentation. The feature with the best result was T_Lexsyn, achieving a macro F1 measure of 0.548. In contrast, the structure feature T_Est achieves the best efficacy for the labeling of the class ‘I-Arg’, which corresponds to the internal part of an argumentative sequence; an F measure of 0.613 is reached. The indicators feature T_Ind is the best for predicting labels that are not part of an argument, i.e., the ‘O’ class.
Results of argumentative unit segmentation
Results of argumentative unit segmentation
Finally, we note that the lexical feature T_Lex is the most effective to identify the beginning of an argument sequence ‘B-Arg’. Based on these results, we combine the representations of T_Lex, T_Ind, and T_Est to verify their performance together.
Table 2, in the combination of features part, shows the best efficacy using the representation with all the features (Previous + T_Lexsyn) with the best accuracy (60.3%) and best macro F1 measure (0.573), this means that the tokens, the grammatical category, the structure, and the markers contribute to the segmentation. This representation identifies better the tokens that are part of an argument ‘I-Arg’, than the tokens of the’ B-Arg’ and ‘O’ classes.
Finally, we generate representations where only one type of feature was omitted to determine the impact of its absence on the model.
In Table 2 bottom, we present the results of these representations (all the features excluding only one), and indicate the highest values in bold and underlined the lowest values. The structural feature (T_Est) is observed to provide more information to the model since by omitting it, the lowest performance is obtained, in particular, it provides information for segmentation of classes B-Arg and I-Arg.
After reviewing the performance of the other features, we deduce that they all provide information for the argumentative segmentation of texts since we did not observe a combination with higher efficacy than the representation with all the features. According to these results, our best model for the argumentative unit segmentation employs all the features (T_Lex + T_Ind + T_Est + T_Syn + T_Lexsyn), reaching a macro F of 0.5733.
The prediction of argumentative paragraphs is improved by incorporating the information of the argumentative segments in the paragraph. The fusion of the methods is carried out considering the prediction of the method of argumentative paragraphs detection presented, and in those cases in which no argumentative segments are detected, the paragraph is classified as ‘without argumentation’. Using this criterion, in addition to improving the task of detecting argumentative paragraphs, we guarantee that there are argumentative segments.
The idea is depicted in Fig. 1, illustrating the process for paragraph argument analysis. This begins with the input text to the model. Subsequently, each paragraph is processed through the argumentative paragraph identification method and a prediction is obtained, also simultaneously, the argumentative units segmentation method is applied to the paragraph in order to identify the segments that correspond to argumentation. Next, the fusion of both results is performed using the criteria described above. The output of the process is made up of only those paragraphs with arguments and their corresponding identified segments.

Argument analysis model with method fusion.
Some evidence to support this idea is that if we only consider the presence of argumentative segments for the identification of argumentative paragraphs, a macro F measure of 0.666 and a kappa of 0.36 is obtained, this is due to a large number of paragraphs with detected argumentative sequences (a total of 1,027 paragraphs), against the 830 argumentative paragraphs in the gold standard. For this reason, it is necessary to consider the result of the classification of both methods to detect more accurately the paragraphs with argumentation.
We propose two approaches for the fusion of the methods:
First approach. This involves a more descriptive model, using lexical and indicators features with information gain (285 attributes), using a DT classifier, where a tree can be presented to the student to explain the decision taken indicating the rule used, to provide a more complete feedback.
In Fig. 2, we observed in the upper nodes of the tree, argumentative markers such as the attributes ‘ya-que’ (‘already-that’), ‘por-lo’ (‘for-it’), ’debido’ (‘due’), since being in the paragraph, classify it as argumentative. In addition, the nodes with argumentative categories are highlighted in the tree; these are mostly observed in lower nodes, providing information for decision making in the tree, in particular the sub-tree with the category‘C_justification’.

Decision tree for detection of argumentative paragraphs.
Second approach. This involves using lexical features with DOR and with information gain (1,327 attributes), employing an SVM classifier, with this model a faster response could be given, which would be adequate to offer the instructor a quick global report of the performance of the group.
The fusion of Lex+DOR features with an SVM classifier and the argumentative segment detection model, reaches 78.2% accuracy with a macro F1 measurement of 0.823, i.e. the best efficacy. The other proposed alternative offers a descriptive solution, by using the Lex+Ind features with a DT classifier and the segment detection model, obtaining values close to the best model, reaching 77.8 accuracy with a macro F1 measurement of 0.81.
Finally, we compare the Cohen kappa agreement between annotators for the labeling of argumentative paragraphs of the corpus with 0.399 corresponding to a ‘Fair’ level. We observe that our models reach a higher value of 0.5651 and an improved level, now as ‘Moderate’. These results provide support for the use of the proposed models to identify argumentative paragraphs in academic texts.
Automatic argument analysis is a field of research that combines, in an interdisciplinary way, artificial intelligence and theories of argumentation, with the purpose of improving the process of extracting and analyzing information.
In this article, we focus on the argument analysis of academic thesis, intending to support students in writing their texts. To accomplish this, we develop and validate methods for the detection of argumentative paragraphs improved using the information of the argumentative unit segmentation model. These methods could be incorporated into a system to provide the student with assessment and feedback, to improve the argumentation inwritings.
We proposed two approaches for argumentative paragraph detection. The first, a more descriptive, employs a decision tree algorithm as a classifier with lexical features and indicators. The second, more efficient, uses an SVM classifier with lexical features with DOR. Both approaches consider the detection of argumentative segments to ensure that a paragraph detected as argumentative has segments with argumentation. In addition, we observe individual features, such as lexical, which provides more information in the detection of paragraphs without argumentation and otherwise, the DOR representation is useful in the identification of paragraphs with argumentation.
One benefit of our method fusion is that those paragraphs ascertained as argumentative would have already argumentative segments determined, so a component classification task can be directly further applied to decide their type (i.e. premise or conclusion), and then proceed to establish whether they are related as a support or attack.
The proposed approaches also provide hints on how to tackle the problem with a deep learning approach, focusing initially on lexical and indicators features, possibly coded as embeddings. And then, such representation fed to architectures that process simultaneously the paragraph and sub-sentencesegments.
Despite our representations, methods and experiments were done intended for Spanish, they can also serve with slight changes for other languages, since they do not depend on sophisticated resources.
Further work implies the identification of argumentative components (e.g. premises, conclusions), as well as their relations (support or attack), toindicate precisely to students the argument errors and deficiencies identified in their writings.
