Argument component classification in academic writings

Abstract

Argumentation in academic writing is a challenging task required to communicate clear ideas. Exposed ideas have to be supported by reasoned arguments. Arguments are composed of components such as premises and conclusions. In this paper, we present an approach to classify argumentative components using language models and machine learning algorithms on a new corpus of academic theses and research proposals. We explore the use of lexical, syntactic, semantic and indicator features to tackle this task. We found that lexical features provide the best efficacy for the classification. For language models, the best features were syntactical. But our experiments showed that a document occurrence representation with unigrams achieved the best accuracy. We also tested the conclusions about the representation and classifier on theses according to their study level (undergraduate, master, and doctoral). We analyzed the information gain of features and found patterns that are part of argumentative markers.

Keywords

Computer-assisted argument analysis academic writing argumentation studies argument components annotated theses corpus

1 Introduction

Writing an academic text such as a thesis can be a challenge for students. This kind of writing requires argumentation skills to support presented claims with solid arguments. An argument is a set of statements (i.e. premises) that individually or collectively provide support to a claim (conclusion). In scientific texts, evidence has to be presented in the form of reasoned arguments so that the reader is led to an appropriate conclusion [21]. The automatic analysis of arguments in theses becomes a necessity in the academic field, to facilitate the analysis of long texts and shorten the times of revision. Academic theses are often written at the end of college and constitute one of the major intellectual requirements for a degree, being in consequence quite important for students. In previous studies, we approached the task of argument identification [13] and assessment at the paragraph level in justifications [11]. In the research here reported, we focus on a fine-grain analysis at argumentative component level.

In recent years, researchers have studied the task of automatic processing of arguments in a variety of fields such as Law, with the aim to facilitate access to jurisprudence that supports a case [23 , 41]. In political documents, automatic argument analysis has been studied for classification of arguments into concepts and modes such as policy, fact, and value [9]. In scientific articles, particularly in biomedicine, automatic processing of arguments can be applied to more rapidly identify arguments for or against a hypothesis under investigation [15]. In social networks, argumentation analysis is also employed to identify the overall stance (for or against) of comments in a debate [5]. There are also studies oriented to the assessment of argumentation in essays [29, 34]. Still, there remains an absence of studies for larger and complex academic works such as academic theses, in particular in Spanish language.

In this paper, we present methods for argumentative component classification using language models and machine learning techniques with representations of several lexical, syntactic, semantic and indicator features. To evaluate our methods, we create a corpus of thesis sections (problem statement, justification, conclusions) with annotated argumentative components.

The paper is structured as follows. In Section 2, we discuss related work for argumentative component classification. In Section 3, we present a theoretical background for argumentation structures. Section 4 details the corpus used in the experiment. The proposed features and learning approaches are described in Section 5. In Section 6, we report the results of the efficacy of our models for argumentative component classification. Finally, we conclude with final remarks and work in progress.

2 Related work

Recently there has been intensive research in the field of argument detection and classification (e.g., claim or premise). Mochales and Moens [24] performed argument segmentation and classification in legal text from the European Court of Human Rights (ECHR) corpus. They first detect clauses with argumentation to later classify them into the respective argumentative component. They used lexical, syntactic, structural, contextual and discourse features. The classifier employed was a Support Vector Machine (SVM), reporting an F measure for premises of 0.68, and 0.74 for conclusions. In social media, Goudas et al. [14] identified text fragments that correspond to a component (i.e. claim or premise) using Conditional Random Fields (CRF) to achieved an F-measure of 0.4237. Sardianos et al. [32] presented a similar approach with CRF and distributed representations of words to identify segments that correspond to argument components. For the task, they reported an F1-measure of 0.3221.

Stab and Gurevych [33] employed a SVM to classify segments as non-argumentative, major claim, claim or premise, in academic essays. They considered several structural, lexical, syntactic, and contextual features, reporting an accuracy of 77%. Also, Nguyen and Litman [26] performed the same argument component classification with SVM, and they achieved a 79% of accuracy, using argument and domain words extracted from unlabeled persuasive essays, applying Latent Dirichlet Allocation (LDA). In [8], Daxenberger et al. analyzed claim detection across six datasets. They report the essence of a claim in the lexical features. They achieved an F1-measure of 0.625 on persuasive essays using a logistic regression classifier with discourse, embeddings, lexical and structural features.

Another approach to identify premises is applying techniques of sentiment analysis. Villalba and Saint-Dizier [37] identified discourse structures such as justification, elaboration and illustration, that support opinions (evaluative expressions) in a corpus of hotels and restaurants reviews. They designed argument extraction rules with lexical features such as: terms expressing polarity, adverbs of intensity, and domain verbs, to identify discourse structures. They reported a precision measure of 0.92 and recall value of 0.86, when identifying justifications.

The closest work to the analysis of academic theses in argument mining is that dealing with scientific articles. In [19], the authors reported the creation of a corpus with 24 scientific articles in education, for the sections of introduction and discussion. Four participants annotated argument components as premises or conclusions, as well as four relations (support, attack, sequence, and detail) between these argumentative components, with an average in the level of agreement of Fleiss Kappa of 0.41. Thus, we observed that obtaining acceptable levels of annotation agreement in scientific texts is a complex task, which depends on an appropriate annotation guide and regularly monitoring annotators during the corpus construction. Scientific articles are close to academic theses since they share a similarly complex structure and technical vocabulary. However, undergraduate theses have a longer extension for each section, and student writing often has argumentative errors. In contrast, scientific articles are often prepared by researchers who have more experience in writing.

3 Argument components

An argument consists of several components. These components are a set of assertions, that individually or as a whole support another statement [6]. The assertion supported is a conclusion (claim). The conclusion is the principal component of an argument. There is only one conclusion for each argument; however, each conclusion can be based on a series of supporting assertions. Assertions that provide support (or attack) are called premises. A premise is a reason provided by the writer to convince the reader of the conclusion. These components are linked with support or attack relations to model the structure of the argument. Among argumentation theories [10 , 39], the consensus is that the structure of an argument consists of several argumentative components. In this work, we adhere to the conclusion-premise model presented by Freeman [10] since introduces structures employed in argument mining [16].

A graphical representation of an argument structure facilitates the understanding of how argument components are interacting. Argument diagraming is the area that supports students in formulating their arguments. Each premise and conclusion is associated to a letter that is further expressed as a node of a graph. Then, directed edges (arrows) are established between nodes to indicate relationships between the components. A simple argument has only one premise that is used to support one conclusion [38].

As we can observe in Fig. 1, the first sentence is a premise (in square brackets /P1) supporting a conclusion in the second sentence (in square brackets /C2). In a simple argument, a premise provides elements to sustain the veracity of the associated conclusion. Figure 2 illustrates a simple argument structure in which the premise P1 supports the conclusion C2 (on the right side).

Fig.1

Example of annotated text with a simple structure: A single premise supporting a conclusion.

Fig.2

A simple argument structure as graph: A premise with a support relation to a conclusion.

As we can notice in the argument example, the word “therefore” plays an essential role in the identification of a possible conclusion; these patterns of words are called argumentative markers and can help with the detection of elements in an argument. In Table 1 we present some argumentative markers to identify premises and conclusions [6].

Table 1

Argumentative markers for premises and conclusions (claims). In parenthesis we include the argumentative marker in Spanish

Premises	Conclusions
as (puesto que)	therefore (por lo tanto)
since (ya que)	thus (por ende)
as (como)	so (asi)
insofar as (en tanto que)	hence (de ahi que)
since (dado que)	in consequense (en consecuencia)
seeing that (viendo que)	consequently (por consiguiente)

As mentioned previously, arguments can have more than one premise as support. In Fig.3, four types of structures for arguments are depicted. A convergent structure (a) has several supporting premises. In this case, it is possible to eliminate either one of the two premises and the conclusion is still supported [38]. A linked structure (b) has two premises which together are used to provide support for a conclusion. A divergent structure (c) has a unique premise to supports several conclusions. A serial structure (d) has arguments deployed in a successive order. In this type of structure, the conclusion of one component acts as a premise for another element [10]. These structures can be joined to form a more complex graph.

Fig.3

Types of argument structures.

4 Corpus

For our experiments, we utilized an annotated corpus of theses and research proposals in the computer and information technologies domain, in Spanish. The texts are of undergraduate and graduate level, and come from Coltypi collection of theses and research proposals [12]. The annotated sections of each document were those considered highly argumentative [22]: problem statement, justification, and conclusions. The annotation was performed by two experts who have experience reviewing theses. The corpus contains annotations of argument components such as conclusions and premises. The inter-annotator agreement in terms of Kappa for argument component was 0.578, corresponding to a “moderate level” according to Landis and Koch scale [20]. The corpus consists of 1216 paragraphs in 300 sections (100 sections per type).

An example of an argumentative paragraph of a justification of undergraduate level, with annotated premises and conclusions, is the following:

[The lack of security measures in networks is a problem that is growing.]/C1 [The number of attackers is increasing and more and more organized,]/P2 so they are acquiring day-to-day more specialized skills that allow them to obtain greater benefits. [Neither the security breaches within the organization itself should be underestimated.]/P31

The paragraph includes a convergent structure consisting of two premises (e.g., P2 and P3) which support a conclusion C1. We observed in this example an absence of argumentative markers, so the annotators had to carefully identify the semantic relation between the conclusion C1 and the premises P2 and P3.

For the annotation process of argument components, the annotators followed a procedure provided in an annotation guide which includes the following steps. First, the annotator is required to read the title and objective of the thesis or proposal. Then he had to identify if the text includes a conclusion. Next, the annotator had to determine the ideas that support the conclusion and mark them as supporting premises. Also, we advised the annotator to mark complete sentences or clauses as a conclusion or premise. The premises were indicated using square brackets and adding at the end /P, that is: [text of premise]/P. Similarly, in the case of conclusions, the text has to be enclosed in square brackets but now ending with /C, i.e. [text of conclusion]/C. This was the procedure suggested to the annotators to segment and label the argument components.

Table 2
Class distribution among instances

Conclusions Premises None

368 833 482

Conclusions	Premises	None
368	833	482

In total, 1683 instances were used for the experiments. Table 2 shows the class distribution. We only selected the annotated components with exact agreement in their boundaries. Instances of the class “None” were taken from sentences of paragraphs without argumentation. An instance of premise or conclusion class can be a clause (segment) or the complete sentence. We obtained a proportion of 49.5% for premises, 21.9% for conclusions and 28.6% for non argumentative example. We notice that there are more than twice the number of premises compared to conclusions. Such ratio between premises and conclusions is usually observed in academic writings, since the student often sustain his conclusion with one or more premises.

Table 3

Distribution of argument components per level of study

	Undergraduate	Master	Doctoral
Conclusions	242	84	42
Premises	499	249	85
None	275	157	50
Total	1016	490	177

We also considered the different levels of study in the corpus. We observe in Table 3 a majority of instances from the undergraduate level with a proportion of 60.4%, then the master level with a proportion of 29.1%, and finally doctoral with 10.5%. We employed several study levels to evaluate our model on more polished texts such as master and doctoral degree theses2

5 Features and learning approaches

In this section, we discuss several features considered for the representation of argument components, and the learning approaches applied for the task.

5.1 Features

The features extracted from the components were lexical, syntactic, semantic, and about indicator.

5.1.1 Lexical features

Lexical features are based on words, lemmas, and terms in the text. These features are taken as unigrams for all terms in the component, including some punctuation marks (;:,.) and bigrams for consecutive pairs of terms in the paragraph, that also include punctuation marks. We computed a TF-IDF (Term Frequency - Inverse Document Frequency) weight for the lexical features, to assign higher values to the most discriminative terms.

5.1.2 Syntactic features

We used lemmas, part-of-speech tags (tagPOS) and the grammatical category (categoryPOS) i.e. the first letter of the POS tag, as syntactic features. We analyzed the part-of-speech (POS) tags, which indicates the type of role a token has in a sentence, e.g. nouns, verbs, adjectives, adverbs, etc. We employed the Freeling language analysis tool [27] for POS tagging. An example of a POS tag for the Spanish word “investigar” (investigate in English) is VMN0000 which indicates with the first letter “V” a verb category and the second letter M as the main verb. The POS tags have the format proposed by the EAGLES group for the morphosyntactic annotation of lexicons and corpus for all European languages. Freeling provides the lemmas which are the canonical form of the word. For example for “ya - que” (“since” in English) POS tags are “RG - CS” which indicates an adverb (“RG”) and a conjunction (“CS”).

5.1.3 Semantic features

The semantic features are expressed as word embeddings, specifically those of Polyglot [1] trained with Spanish Wikipedia. Word embeddings map the index of a word in a dictionary to a feature vector in a high-dimension space. This vector is learned with a neural network by back-propagating the error in the model to update the initialized embeddings. The model utilised is CBOW (continuous bag of words), since this used the context to train the vector of each word. The contexts is a set of word vectors randomly initialized, which are previous and following terms of the trained word. Polyglot contains words with vectors which represent the meaning of the word. The number of word embeddings contained in Polyglot for Spanish is 100,004. The word embedding vector has a size of 64 features. For each segment, we calculate the average of word embedding vectors contained in each argumentative component.

5.1.4 Indicator features

Argumentative markers reveal components of an argument. The indicator features correspond to five sets of word patterns (argumentative markers) that were applied to obtain the occurrence frequency of each of the following categories: justification, explanation, deduction, refutation, and conditional. We created the sets of patterns based on the review of different sources of argumentative markers in Spanish language [3 , 31]. Examples of these patterns are “ya - que” (“since” in English) for justification category and related to the premise type, “por - ello” (“thus” in English) for explanation category, or “por-lo-tanto” (“therefore” in English) for deduction category, indicating a conclusion.

5.1.5 Document occurrence representation

We also explored a representation based on document occurrence which, to our knowledge, has not been tried in argument mining. The Document Occurrence Representation (DOR) is based on the “latent” semantics of a term, that can be revealed by the statistical distribution of occurrence over the documents in the corpus. A term is represented as a vector of weights associated with documents in the collection. The weights represent the contribution of a document to the semantics of a term [4]. The term vector size for a DOR representation is equal to the number of documents (segments) in the training set. In our case the training set size is 1513 segments, so the vectors generated by DOR representation had 1513 features. We associated a vector to each word in our vocabulary. We calculated the DOR representation using unigrams. The DOR representation for each instance is calculated adding all the word vectors in a given segment.

5.2 Learning approaches

We now present two learning approaches for the classification task of a text segment as premise, conclusion or none. We addressed only the classification task, and assumed that the segmentation task has already be done in some way. The segments employed were those determined by annotators in the corpus.

Our first approach is based on capturing the characteristics of components as sequences of words, lemmas and POS tags. For this approach, we tackle the problem with language models which had been used in several language processing tasks, such as speech recognition, machine translation, part-of-speech tagging, parsing, handwriting recognition and information retrieval. Our second approach formulates the problem as a straightforward classification task, employing machine learning algorithms. For training, we create representations based on several models, as described in the previous subsection.

5.2.1 Language models

Language models define probability distributions on word sequences. These probabilistic models are based on N-gram statistics as counts of unigrams, bigrams, trigrams, and so on. The language models are employed here to capture the regularities of the argumentative components. They can be used to predict the next word from previous words. Estimating the probability of the next word is close to estimating the probability of a sequence of words [18]. The perplexity is used to evaluate the probability that a given sequence belongs to the language model. The smaller the value of perplexity, the better the language model is at approximating the unseen supplied sequence.

The developed method used language models to capture the particularities of the sequences of terms in the premises and conclusions. The task was approached as a binary classification to determine if a segment belongs to the argumentative component (i.e. premise or conclusion). We split the corpus in 60% for training, 20% for tuning and 20% for testing. In Fig. 4, we describe the methodology for the classification task. First, we create five configurations for our sequences using words, lemmas and POS tags. In Table 4, we illustrate the sequence configurations of terms prepared beforehand to train the models.

Fig.4

Process for classification using language models.

Table 4

Sequence configuration examples for language model training

Type	Example
Word	esto lleva
Lemma	esto llevar
TagPOS	PD00S00 VMIP3S0
Lemma + tagPOS	llevar VMIP3S0
Lemma + categoryPOS	llevar V

Second, we trained two language models one for premises and another for conclusions using the training set sentences. We obtained the two language models for each configuration of sequences of terms. We used SRILM [35] open-source software to train our language models with a 4-gram model. Third, to compute the threshold, we calculated the perplexity for the segments in the tuning set to find the mean and the standard deviation. We established a threshold of +/- 1 standard deviation around the mean to classify the argumentative component. Finally, we calculated the perplexity of the segments in the testing set to determine if they belong to the argumentative component. The classified instances were used then to calculate the accuracy for each configuration.

5.2.2 Machine learning

The machine learning approach for argument component classification relies on learning algorithms. These algorithms learn from annotated texts how to classify a new segment. However, before the text is fed to the algorithm, an important step is to create a representation. The first step is to pre-process the text with a text normalization, converting all words to lowercase. The next step is to perform a tokenization, that is, segmenting the text into tokens (basic units) as words or symbols [18], using the space symbol to separate words. We applied regular expressions for tokenization. Next, a representation can be created using a vector space model such as bag-of-words model, i.e. an unordered set of tokens, ignoring their position in the text, to represent each segment as a feature vector with the tokens in the vocabulary. The value of each feature is the frequency of appearance in the segment. That is, each text is represented by a vector of token counts (frequencies). The representation created for our experiments utilises diverse features as detailed in previous section. The vocabulary size for the features is presented in Table 5, we filtered the bigrams to get the 20,000 most frequent terms.

Table 5
Sizes per feature type

Type Size

Unigrams 6614

Bigrams 20000

Lemmas 4116

Word embedding vector 64

Argumentative markers categories 5

Type	Size
Unigrams	6614
Bigrams	20000
Lemmas	4116
Word embedding vector	64
Argumentative markers categories	5

Once we have obtained the representation, a machine learning algorithm was employed to perform the classification task. The model was trained to choose the correct class (e.g. premise, conclusion, none) for a given text input. We applied several algorithms such as Support Vector Machine (SVM), Naive Bayes and Random Forest. The model was trained using the set of components previously annotated by human experts. For our experiments, the train set had 1513 instances, i.e. 90% of the corpus. The set of annotated texts acts as the instructor to our model. A test set is used to evaluate the model, and the efficacy depends on how similar the predictions of the classifier model are to the human annotation in the test set. The evaluation is done using customary metrics such as accuracy, recall, precision, and F-measure. We computed the accuracy metric to evaluate the efficacy of classifiers. It measures the fraction of inputs in the test set that the classifier model correctly separated [2]. We employed a test set from our corpus consisting of 170 instances to evaluate each configuration of features.

As previously mentioned, we also considered the Document Occurrence Representation. Once the corresponding vectors were built, we also applied a machine learning algorithm to train and test this DOR representation using unigrams since they achieved better results than bigrams and trigrams.

6 Experiments and results

In this section, we report the experimental results for our two approaches, i.e. language models and machine learning. Also, we analyze the impact of testing the model at different academic levels. Finally, we present the information gain observed in several patterns of terms for the identification of argument components.

6.1 Language models

For identifying the best performing language model, we realized a series of experiments on the corpus using a 5-fold cross-validation with Scikit-learn Toolkit [28].

In Table 6, we show the different configurations for conclusion identification. We observed that lemmas with their tagPOS produced the best results in terms of accuracy. Although the accuracy in the experiments was not very high, the precision to identify the components exceeds 0.8. However, for the language model of conclusion, we can also notice that the second best accuracy corresponds to tagPOS configuration, with even higher precision.

Table 6
Classification results using language model of Conclusions

Accuracy Precision

Words 0.515 0.834

Lemmas 0.505 0.761

Lemma tagPOS 0.522 0.821

Lemma categoryPOS 0.514 0.75

TagPOS 0.52 0.861

	Accuracy	Precision
Words	0.515	0.834
Lemmas	0.505	0.761
Lemma tagPOS	0.522	0.821
Lemma categoryPOS	0.514	0.75
TagPOS	0.52	0.861

In Table 7, we noticed that the language model of premises with the configuration of tagPOS obtain the best accuracy. For this model, we calculated a perplexity mean for the tune set of 16.99, for the training set the positive class (premises) with 16.97 and the negative class (conclusions) with 18.85. We observed levels of perplexity for the premises and conclusions quite close. We assume that this is due to the small number of instances, causing that the language model failed to extract the particularities of the premises and conclusions.

Table 7

Classification results using language model of Premises

	Accuracy	Precision
Words	0.505	0.855
Lemmas	0.499	0.791
Lemma tagPOS	0.508	0.816
Lemma categoryPOS	0.495	0.765
TagPOS	0.518	0.852

6.2 Machine learning

The detection of argumentative components was also approached as a multi-class classification task. We classified each segment in the corpus as premises, conclusions or none. Using lexical, syntactic, semantic and indicators features with algorithms of machine learning, we performed 10-fold cross-validation. We employed classifiers such as Support Vector Machine (SVM) [33], Naive Bayes [25] and Random Forest [7] since they have been previously used in argument mining. We ran their implementation from Weka machine learning toolkit [17], finding the best performance with the SVM classifier.

In Table 8, we observe that the lexical features obtain a good efficacy, however, when all the features are combined, the accuracy is slightly improved. The feature with lower accuracy is the indicators type, as expected.

Table 8
Identification of argumentative components

Accuracy Type

Unigrams and bigrams (TF-IDF) 72.61 lexical

Lemma with POS category 57.81 syntactic

Word embeddings 65.06 semantic

Argumentative marker categories 49.49 indicators

All features 73.74

	Accuracy	Type
Unigrams and bigrams (TF-IDF)	72.61	lexical
Lemma with POS category	57.81	syntactic
Word embeddings	65.06	semantic
Argumentative marker categories	49.49	indicators
All features	73.74

After observing that lexical features had the best accuracy, we combined the lexical features with syntactic, semantic and indicators features. Table 9 shows the accuracy obtained with these combinations. We found the best combination with the lexical and semantic features. Both kinds of features achieved the highest accuracy individually. In addition, we explored using lexical, semantic and indicators features together, obtaining an accuracy of 74.09. Again the syntactic features combined with the lexical features produced the lowest accuracy, so syntax does not seem to contribute to differentiate components.

Table 9

Combining features for identification of argumentative components

Type	Accuracy
Lexical + indicators	73.09
Lexical + syntactic	71.72
Lexical + semantic	74.10
Lexical + indicators + semantic	74.09

6.2.1 DOR representation

We performed experiments with several DOR representations using 10-fold validation with an SVM classifier of Weka. In Table 10, we present DOR representations for unigrams, bigrams and trigrams. We can notice that DOR unigram representation achieved the best accuracy with 76.35, an average F-measure of 0.74, that when analyzed in terms of particular classes, we have an F1-measure for premises of 0.80 and an F1-measure for conclusions of 0.50. The model identified better premises than conclusions, perhaps because the number of premises is more than twice that of conclusions. In terms of the representation, we confirmed in some way that DOR works well in classification of short texts.

Table 10
Argument component classification with DOR representation

Accuracy F-measure

DOR unigram 76.35 0.74

DOR bigram 57.87 0.55

DOR trigram 55.07 0.47

	Accuracy	F-measure
DOR unigram	76.35	0.74
DOR bigram	57.87	0.55
DOR trigram	55.07	0.47

We also performed experiments considering academic levels to test the efficacy of the best model found so far, in master and doctoral theses. So, we restricted to unigram DOR representation with an SVM classifier. In Table 11 we show three configurations of train/test sets to analyze the efficacy in different test sets. We found a better accuracy when we used the undergraduate level components as training set and test in the master level set. We presume the master’s theses have the closest style to the undergraduate theses rather than the doctoral theses which obtain a lower accuracy. However, taking the undergraduate and master level sets together, we observed a loss in accuracy when we classify the doctoral level theses. We suppose that the master theses are quite different from doctoral thesis introducing noise to the model. We obtained an F1-measure for premise classification above 0.70 for the three representations, which indicates that premises are more easily classified than conclusions. These results were also consistent with those given in Table 10.

Table 11

Classification results of different train and test sets among Academic Levels

Sets (train,test)	Train set	Test set	Accuracy	F1 Premises	F1 Conclusions
Undergraduate, Master	1016	490	73.46	0.77	0.48
Undergraduate, Doctoral	1016	177	65.53	0.70	0.41
Undergraduate + Master, Doctoral	1506	177	68.36	0.73	0.32

In the same direction, we formulated a second experiment to examine the quality of texts according to study levels, under the assumption that higher study level corresponds to a better quality of writings. Hence, we performed three experiments, training with master, doctoral or both levels theses an SVM classifier with DOR unigram representation again. We employed as the test set the undergraduate degree theses. As Table 12 shows, we notice the best accuracy when we employed the master level theses as the training set. We also observed the best classification of premises than conclusions, with practically the same F1-measure as in Table \ref table:resultClassLevels but with a more compact representation of DOR with only 490 instances. These results indicate that the features of master theses are closer to those of undergraduate theses than to doctoral theses. We presume doctoral theses as training set obtained a low accuracy because the set is smaller (177 instances) which produced a limited representation of DOR. However, another reason can be that doctoral theses have a higher number of sophisticated terms than master or undergraduate theses. We can also notice that when training with master and doctoral theses together, the classification of conclusions has a slight improvement, possibly caused by additional examples.

Table 12

Classification results for training with graduate levels

Sets (train,test)	Train set	Test set	Accuracy	F1 Premises	F1 Conclusions
Master, Undergraduate	490	1016	69.49	0.76	0.25
Doctoral, Undergraduate	177	1016	55.41	0.69	0.08
Doctoral + Master, Undergraduate	667	1016	69.39	0.76	0.27

6.3 Information gain analysis in argument components

We analyzed the information gain of n-grams to find out those more discriminative patterns using Weka toolkit. Table 13 shows a list of patterns with their corresponding information gain. The punctuation (e.g..,) obtained a high information gain, as well as words and patterns that are part of argumentative markers. We found argumentative marker “since” (“ya - que” in Spanish) which helps to identify premises. Also parts of markers such as “for the” (“por - lo” in Spanish) appears, that is also part of the marker “therefore” which is used to indicate a conclusion. We noticed stop words with information gain such as articles (e.g. “the”), prepositions (e.g. “to”), and conjunction (e.g. “and”), which can reveal a particular writing style.

Table 13
N-grams with information gain. Patterns in Spanish

Pattern English Translation IG value

1 . – 0.489

2 ya - que since 0.082

3 ya already 0.079

4 por - lo for the, by the 0.054

5 por for, to 0.050

6 y and 0.049

7 en in, on 0.045

8 este this 0.038

9 a to 0.034

10 , – 0.034

11 ( – 0.032

12 se one, oneself 0.032

13 lo it 0.031

14 del of the 0.029

15 ) – 0.026

16 los the 0.025

17 por – ello by that, thus 0.023

18 es - por is by 0.023

19 de - la of the 0.023

20 investigacion research 0.022

Pattern	English Translation	IG value
1	.	–	0.489
2	ya - que	since	0.082
3	ya	already	0.079
4	por - lo	for the, by the	0.054
5	por	for, to	0.050
6	y	and	0.049
7	en	in, on	0.045
8	este	this	0.038
9	a	to	0.034
10	,	–	0.034
11	(	–	0.032
12	se	one, oneself	0.032
13	lo	it	0.031
14	del	of the	0.029
15	)	–	0.026
16	los	the	0.025
17	por – ello	by that, thus	0.023
18	es - por	is by	0.023
19	de - la	of the	0.023
20	investigacion	research	0.022

7 Conclusions

We presented an annotated argumentative components corpus of theses and research proposals in the computer and information technologies domain, in Spanish. In the corpus, we found that the number of premises doubles that of conclusions. The proportion is observed in academic writings since the students often provide more than one piece of evidence to support their ideas.

We employed the corpus to perform experiments for argumentative component classification. After tackling the problem with language models, we conclude that these models require more instances for training to capture the particularities of the argumentative components. Therefore, we tried an approach employing classification using machine learning algorithms. We analyzed several feature types and found out that the lexical and semantic characteristics achieved the best accuracy. However, after analyzing the DOR representation using unigrams, we found an even better accuracy. We then tried this representation to analyze the efficacy of the model trained only with undergraduate theses and tested in graduate level theses. We found that master’s theses seem to be closer to undergraduate theses than to doctoral theses.

Also after experimenting with DOR representation and the collection divided in academic levels, we found that master level theses were quite descriptive of premises, reaching acceptable efficacy with a smaller training corpus. The classification of conclusions has still a good margin for improvement, in any case.

We also analyzed the information gain of n-grams. We obtained several patterns through the information gain that correspond to parts of argumentative markers. For example, the argumentative marker “since” (“ya - que” in Spanish) provides information to classify argument components. Also, we found that some stop words provided information for the classification of argument components. We presume these patterns also reveal stylistic aspects of academic writings in Spanish.

In future work, we will explore the use of features such as the context of nearby sentences, component location in the paragraph and the syntactic structure of the component for argument component classification. Also, we plan to experiment on relation identification between argument components using semantic similarity and machine learning algorithms. To conduct a complete analysis of new students’ texts, in future experiments, we intend to tackle the segmentation task of texts along the argumentative component classification.

Footnotes

English translation of a justification in Spanish.

We are currently working to make the corpus available to the community. Researcher interested on it can contact the corresponding author.

Acknowledgments

We thank the annotators for the assistance in the corpus creation. The first author was partially supported by CONACYT, México, under scholarship 357381.

References

Al-Rfou

, Perozzi

and Skiena

, Polyglot: Distributed word representations for multilingual nlp, In Proceedings of the Seventeenth Conference on Computational Natural Language Learning , Sofia, Bulgaria. Association for Computational Linguistics (2013), 183–192.

Bird

, Klein

and Loper

, Natural Language Processing with Python. O’Reilly Media, Inc., 1, 2009.

Briz

, Pons

and Portolés

, Diccionario de partículas discursivas del español. In El diccionario como puente entre las lenguas y culturas del mundo. Actas del II Congreso Internacional de Lexicografía Hispánica, Alicante, Biblioteca Virtual Cervantes, Biblioteca Virtual Miguel de Cervantes, Alicante, (2008), 217–227.

Cabrera

J.M.

, Escalante

H.J.

and Montes-y

, Gómez, Distributional term representations for short-text categorization, In International Conference on Computational Linguistics and Intelligent Text Processing (2013), 335–346. Springer.

Cabrio

and Villata

, Combining textual entailment and argumentation theory for supporting online debates interactions, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Short Papers - Volume 2, ACL ’12 , Association for Computational Linguistics (2012), 208–212.

Capaldi

, Cómo ganar una discusiónGedisa, 1990.

Carstens

and Toni

, Towards relation based argumentation mining, In Proceedings of the 2ndWorkshop on Argumentation MiningDenver, CO, Association for Computational Linguistics, (2015), 29–34.

Daxenberger

, Eger

, Habernal

, Stab

and Gurevych

, What is the essence of a claim? cross-domain claim identification, Association for Computational Linguistics, In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017), 2045–2056.

Fierro

, Fuentes

, Pérez

and Quezada

, 200k+ crowdsourced political arguments for a new chilean constitution, In Proceedings of the 4th Workshop on Argument Mining , Copenhagen, Denmark (2017) 1–10. Association for Computational Linguistics.

10.

Freeman

J.B.

, Argument Structure: Representation and Theory, Springer2011.

11.

García-Gorrostieta

J.M.

, López-López

and González-López

, Towards automatic assessment of argumentation in theses justifications. In Lavoué

É.

, Drachsler

, Verbert

, Broisin

and Pérez-Sanagustín

, editors, Data Driven Approaches in Digital Education: 12th European Conference on Technology Enhanced LearningEC-TEL Tallinn, Estonia, Proceedings, Springer International Publishing, (2017), 54–66.

12.

González-López

and López-López

, Colección de tesis y propuesta de investigación en tics: Un recurso para su análisis y estudio,n Educativa, In XIII Congreso Nacional de Investigaci’o (2015), 1–15.

13.

Gorrostieta

J.M.G.

and López-López

, Argumentation identification for academic support in undergraduate writings. In Verbert

, Sharples

and Klobučar

, editors, Adaptive and Adaptable Learning: 11th European Conference on Technology Enhanced LearningEC-TEL Lyon, France, Proceedings, Springer International Publishing (2016), 98–109.

14.

Goudas

, Louizos

, Petasis

and Karkaletsis

, Argument extraction from news, blogs, and social media, In Hellenic Conference on Artificial Intelligence (2014) 287–299. Springer.

15.

Green

, Identifying argumentation schemes in genetics research articles, Association for Computational Linguistics, In Proceedings of the 2nd Workshop on Argumentation Mining (2015), 12–21.

16.

Habernal

and Gurevych

, Argumentation mining in user-generated web discourse, Computational Linguistics43(1) (2017), 125–179.

17.

Hall

, Frank

, Holmes

, Pfahringer

, Reutemann

and Witten

I.H.

, The weka data mining software: An update, ACM SIGKDD Explorations Newsletter11(1) (2009), 10–18.

18.

Jurafsky

and Martin

J.H.

, Speech and Language Processing2, Prentice-Hall Inc., Upper Saddle River, NJ, USA, 2009.

19.

Kirschner

, Eckle-Kohler

and Gurevych

, Linking the thoughts: Analysis of argumentation structures in scientific publications, Association for Computational Linguistics, In Proceedings of the 2nd Workshop on Argumentation Mining (2015), 1–11.

20.

Landis

J.R.

and Koch

G.G.

, The measurement of observer agreement for categorical data, Biometrics33(1) (1977), 159–174.

21.

Lindsay

, Scientific Writing, CSIRO Publishing, 2011.

22.

López Ferrero

and García

, Negroni La argumentación en los géneros académicos,In Actas del Congreso Internacional La Argumentaci’onUniversidad de Buenos Aires (2003), 1121–1129.

23.

Mochales

and Moens

M.-F.

, Study on the structure of argumentation in case law, In Proceedings of the 2008 Conference on Legal Knowledge and Information Systems , IOS Press (2008), 11–20.

24.

Mochales

and Moens

M.-F.

, Argumentation mining, Artificial Intelligence and Law19(1) (2011), 1–22.

25.

Moens

M.-F.

, Boiy

, Palau

R.M.

and Reed

, Automatic detection of arguments in legal texts, ACM, In Proceedings of the 11th International Conference on Artificial Intelligence and Law (2007), 225–230.

26.

Nguyen

and Litman

, Extracting argument and domain words for identifying argument components in texts, Association for Computational Linguistics, In Proceedings of the 2nd Workshop on Argumentation Mining (2015), 22–28.

27.

Padró

and Stanilovsky

, Freeling 3.0: Towards wider multilinguality, Istanbul, Turkey, European Language Resources Association (ELRA), In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (2012), 2473–2479.

28.

Pedregosa

, Varoquaux

, Gramfort

, Michel

, Thirion

, Grisel

, Blondel

, Prettenhofer

, Weiss

, Dubourg

, Vanderplas

, Passos

and Cournapeau

, Brucher

, Perrot

and Duchesnay

, Scikit-learn: Machine learning in python, Journal of Machine Learning Research12(Oct) (2011), 2825–2830.

29.

Persing

and Ng

, Modeling argument strength in student essays, Beijing, China, Association for Computational Linguistics, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (2015), 543–552.

30.

Rodríguez

C.F.

, Diccionario de conectores y operadores del español, Arco/Libros, 2009.

31.

Sánchez

, Avendaño, Los conectores discursivos: Su empleo en redacciones de estudiantes universitarios costarricenses,a y Lingüística de la Universidad de Costa Rica, Revista de Filolog’ı31(2) (2005).

32.

Sardianos

, Katakis

I.M.

, Petasis

and Karkaletsis

, Argument extraction from news, Association for Computational Linguistics, In Proceedings of the 2nd Workshop on Argumentation Mining (2015), 56–66.

33.

Stab

and Gurevych

, Identifying argumentative discourse structures in persuasive essays, Association for Computational Linguistics, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 46–56.

34.

Stab

C.M.E.

, Argumentative Writing Support by means of Natural Language Processing, PhD thesis, Technische Universität Darmstadt, 2017.

35.

Stolcke

, Srilm – an extensible language modeling toolkit, In Proceedings of the 7TH International Conference on Spoken Language Processing (2002), 901–904.

36.

Toulmin

S.E.

, The uses of argument, Cambridge University Press2003.

37.

Villalba

M.P.G.

and Saint-Dizier

, Some facets of argument mining for opinion analysis, In Proceedings of the 2012 International Conference on Computational Models of Argument (2012), 23–34.

38.

Walton

, Fundamentals of critical argumentation, Cambridge University Press, 2005.

39.

Walton

, Reed

and Macagno

, Argumentation schemes, Cambridge University Press2008.

40.

Wyner

and Bench-Capon

, Towards an extensible argumentation system, In Proceedings of the Ninth European Conferences on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (2007) 283–294. Springer.

41.

Wyner

, Mochales-Palau

, Moens

M.-F.

and Milward

, Approaches to Text Mining Arguments from Legal Cases, Springer, 2010, pp. 60–79 .

Argument component classification in academic writings

Abstract

Keywords

1 Introduction

2 Related work

3 Argument components

Table 2 Class distribution among instances Conclusions Premises None 368 833 482

5.1 Features

5.1.1 Lexical features

5.1.2 Syntactic features

5.1.3 Semantic features

5.1.4 Indicator features

5.1.5 Document occurrence representation

5.2 Learning approaches

5.2.1 Language models

Table 5 Sizes per feature type Type Size Unigrams 6614 Bigrams 20000 Lemmas 4116 Word embedding vector 64 Argumentative markers categories 5

6.1 Language models

Table 6 Classification results using language model of Conclusions Accuracy Precision Words 0.515 0.834 Lemmas 0.505 0.761 Lemma tagPOS 0.522 0.821 Lemma categoryPOS 0.514 0.75 TagPOS 0.52 0.861

Table 8 Identification of argumentative components Accuracy Type Unigrams and bigrams (TF-IDF) 72.61 lexical Lemma with POS category 57.81 syntactic Word embeddings 65.06 semantic Argumentative marker categories 49.49 indicators All features 73.74

Table 10 Argument component classification with DOR representation Accuracy F-measure DOR unigram 76.35 0.74 DOR bigram 57.87 0.55 DOR trigram 55.07 0.47

Footnotes

Acknowledgments

References

Table 2
Class distribution among instances

Conclusions Premises None

368 833 482

Table 5
Sizes per feature type

Type Size

Unigrams 6614

Bigrams 20000

Lemmas 4116

Word embedding vector 64

Argumentative markers categories 5

Table 6
Classification results using language model of Conclusions

Accuracy Precision

Words 0.515 0.834

Lemmas 0.505 0.761

Lemma tagPOS 0.522 0.821

Lemma categoryPOS 0.514 0.75

TagPOS 0.52 0.861

Table 8
Identification of argumentative components

Accuracy Type

Unigrams and bigrams (TF-IDF) 72.61 lexical

Lemma with POS category 57.81 syntactic

Word embeddings 65.06 semantic

Argumentative marker categories 49.49 indicators

All features 73.74

Table 10
Argument component classification with DOR representation

Accuracy F-measure

DOR unigram 76.35 0.74

DOR bigram 57.87 0.55

DOR trigram 55.07 0.47