Solving arithmetic word problems: A deep learning based approach

Abstract

This paper presents a novel deep learning based approach to solving arithmetic word problems. Solving different types of mathematical (math) word problems (MWP) is a very complex and challenging task as it requires Natural Language Understanding (NLU) and Commonsense knowledge. An application on this can benefit learning (education) technologies such as E-learning systems, Intelligent tutoring, Learning Management Systems (LMS), Innovative teaching/learning, etc. We propose Deep Learning based Arithmetic Word Problem Solver, DLAWPS, an intelligent MWP solver system. DLAWPS consists of a Recurrent Neural Network (RNN) based Bi-directional Long Short-Term Memory (BiLSTM) to classify operation among four basic operations {+ , - , * , /}, and a knowledge-based irrelevant information removal unit (IIRU) to identify the relevant quantities to form an equation to solve arithmetic MWPs. Our system generates state-of-the-art results on the standard arithmetic word problem datasets –AddSub, SingleOp, and a Combined dataset.

Keywords

Solving arithmetic word problems solving math word problems BiLSTM-based operation prediction irrelevant information removal

1 Introduction

Researchers in the field of MWP applied various Artificial Intelligence (AI) based techniques such as Natural Language Processing (NLP), Natural Language Understanding (NLU) using Information Extraction (IE), Machine Learning (ML) and Deep Learning (DL), Learning (Education) Technologies, etc., in solving word problems. Parallel research have been carried out by the researchers of others fields like Cognitive Science, Psychology, Education, etc [18]. Therefore, this research field is interdisciplinary in characteristics that has attracted researchers from different domains from the decades of 1960 starting with [2].

[32] described word problems as a mathematical query where the background information is presented in natural language texts rather than mathematical notations. Researchers have worked on standard datasets consisting of arithmetic word problems of elementary level school education with the basic arithmetic operations - addition (‘+’), subtraction (‘-’), division (‘/’) and multiplication (‘*’). However, a math word problem can be any numerical problem consisting of numbers and operations following mathematical theories like geometry, number word problem, algebra, probability, etc. [23]. Table 1 shows a simple word problem taken from a standard dataset. Challenges in solving arithmetic word problems can be found in Section 1 of [18].

Table 1
A sample word problem from the AddSub dataset

Problem: Dan has 32 green and 38 violet marbles.

Mike took 23 of Dan ’s green marbles.

How many green marbles does Dan now have?

Operation: ‘–’

Relevant Quantities: 32 and 23

Irrelevant Quantity: 38

Answer: 32-23=9

Researchers in this field have generally employed various rule-based approaches and machine learning techniques to solve simple arithmetic MWPs. Recently, motivated by the success of deep learning in NLP and IE, various deep learning techniques such as sequence to sequence (seq2seq) RNN, Reinforcement Learning, etc., are being proposed by various researchers for solving MWPs. We propose an RNN based BiLSTM network to classify operations. We also developed a Knowledge Based IIRU, a relevance classifier, to identify the relevant quantities (RQ) required to solve MWPs. We propose a single equation schema, as in Equation 1, where the relevant quantities along with the predicated operation fit into and the equation is evaluated to generate final answer.

$\begin{matrix} X = {RQ}_{1} (op) {RQ}_{2} \dot{.} . \end{matrix}$ (1) Here, op refers to the predicted operator among {‘+’, ‘-’, ‘*’ and ‘ /’}. ‘X’ here is the system generated answer.

1.1 Major contributions in this work

The proposed work contributes in various aspects of solving word problems as given in the list below.

A bi-directional LSTM based deep learning technique to classify operations

An intelligent Irrelevant information removal unit (IIRU) to identify the relevant quantities

A template-free solution to generate the final answer

A new state-of-the-art results on two standard datasets

We propose a simple approach to form the final equation without using any templates as used in some existing DL based math solvers. Additionally, the code will be published after the publication of the paper. Next, in Section 2, we describe some specifically related works in solving arithmetic word problems. Later we explain our system in details with all important components in Section 3. Section 4 contains a description about datasets, dataset variants, and results with critical discussions followed by some conclusions in Section 5.

2 Related work

Arithmetic MWPs, its ambiguities, and varieties, and how a child in elementary school is able to solve them by their coherent learning and understanding, is always an interesting research problem for the researchers of cognitive science, education, children psychology and problem-solving, etc. The works of [3 , 33] were based on cognitive science and psychology which simulated human cognition of problem solving to solve word problems. However, these works supplied important clues towards automatically solving word problems, especially in categorizing word problems or problem schema. This research deals with the human natural intelligence of problem understanding and solving through learning. Therefore, it attracts the attention of the researchers in the domain of AI such as NLU, Machine learning, etc., who have been developing computer algorithms to solve this problem automatically since the 1960s [2, 5]. Reviews on solving various word problems are available in [18 , 41].

Automatically solving word problems was first proposed by [2] with the accompanied system –STUDENT. Based on the works of [3 , 31], [6] developed WORDPRO which implemented the theory of comprehension. [4] developed ARITHPRO which represented word problems by a lexicon of knowledge. [1] simulated ROBUST capable of solving multi-step (i.e., multiple equation) arithmetic word problems. With the advances of AI, NLP, ML, DL techniques, the research gains momentum in recent years. [15] learns to solve word problems statistically with predefined equation templates along with the answers. [14] developed ALGES representing MWPs into a set of possible expression trees and learned to choose the ‘best’ equation expression tree. [27 –29] proposed various methods to understand a word problem and developed algorithms to form expression tree (representing a word problem) with relevant quantities. [30] prepared a dataset containing 1878 number word problems and developed DOL language to solve them. [9] developed ARIS that solved addition-subtraction type word problems. [19] proposed a system to automatically create executable computer program (in JAVA) from a subset of same addition-subtraction word problems of [9] which generates the final answer when executed. [21] categorized addition-subtraction word problems into separate problem categories and proposed mathematical formulas with respect to the categories to fit them in to solve. [16, 17] developed meaning-based (referred as ‘Tag’) word problem solver with a pipeline of sub-processes.

[10] prepared a huge dataset consisting of problems (Dolphin18k 1 –18,711 problems) with a large number of algebraic operations in addition to the basic operations. It requires knowledge about algebraic and mathematical concepts related to number word problems, scale, ratio, percentage, exponential, unit conversion, etc., to solve them. This dataset contains word problems to output answers generated from multiple equations. Our objective in this work is to solve standard datasets of different types (single equation) and our proposed system presently cannot solve such a robust and complex dataset; it can solve a subset of similar problems from that dataset. [10, 38] proposed an RNN-based LSTM network to map word problem text to math equation templates with the capabilities of mapping the relevant numbers. The main drawback of the method is that it demands a huge training sample that may not be available in many cases (such as the dataset used in our case). Very recently, [12] used sequence-to-equation mapping based on RNN and attention regularization technique to represent the intermediate meaning of a word problem specifically involving multiple equations to solve. They used 10,664 math word problems, a subset of the Dolphin18k dataset, paired with equation templates. [34] proposed a BiLSTM architecture to translate (i.e., map) a math word problem to an equation (expression) tree. Their objective was not exactly to solve math word problems, rather normalize duplicated equation templates to the desired expression tree in order to improve the performance of the system such as [39]. Recent progress in computation hardware and machine learning techniques enables natural language understanding easier. [11, 36] proposed deep reinforcement-based learning methods in solving math word problems. [11] proposed a mechanism related to copying numbers and aligning them to an equation using seq2seq RNN based model and their system was able to address some shortcomings found in the RNN based model [10, 38] for solving word problems. [36] used a reward-based operation classifier and a re-ordering mechanism to align numbers in the desired order in the final equation. They used the same AddSub and SingleOp datasets as ours along with a multiple arithmetic dataset, CC [27], and two variants based on these datasets.

Although BiLSTM has been used successfully in many NLP tasks including word problem solving, to the best of our knowledge, no other work has yet used BiLSTM to predict operations in solving math word problems. Some very recent works made use of LSTM [10, 38] and BiLSTM [34] to map an input word problem to an equation system, but not exactly to predict operation. Comparisons of DLAWPS with some of the most relevant systems are given in the Section 4.3 with critical discussions in Section 4.4.

3 System description

Problem Formulation: A math word problem M is a set of problem descriptors as sentences, M = {S₁, S₂, . . . , S_q}, and contains n number of sentences including a single question sentence (S_q). Simplified Statement: Statements (S) are preprocessed by splitting complex and compound sentences into simple and single quantity statements or micro statements ( $s_{i = 1}^{n}$ ). The problem is then represented by a set of simple micro statements as in Equation 2.

$M = {s_{1}, s_{2}, . . ., s_{n}}$ (2)Global cue: A global cue (G) is a set of information extracted from each micro statements (s_i). G is defined in Equation 3, where owner is the ‘subject’ or ‘owner’ of the verb in a sentence in the MWP, item is the ‘item’ or ‘object’, attribute is the ‘attribute’ of that item, and location represents the ‘location’ of the event of the operation.

$G = {owner, item, attribute, location, verb}$ (3)Micro cue: A micro cue (μ) is defined by a set of features of a micro statement and μ ⊆ G. Hence, a problem M can be again expressed in terms of a set of micro cues (M_μ) as in Equation (4).

$M = M_{μ} = {μ_{1}, μ_{2}, . . ., μ_{n}}$ (4)Cue difference: The cue difference (κ) is a set of cues that are present in the question but either absent in the micro cue (μ) or present with a different value. κ is defined as a set operation as in Equation 5, where μ_i is i^th statement and μ_q is the question.

$κ_{i} = μ_{q} - (μ_{i} ⋂ μ_{q})$ (5)Quantity: Quantities are a set of numeric values present in the problem. Quantity (Q) is defined as in Equation 6.

$Q = {q_{1}, q_{2}, . . ., q_{n}}$ (6)Irrelevant and Relevant Quantity: An irrelevant quantity (IQ) is a quantity related with a micro cue μ_j which is not a cue of the question (μ_q), IQ ∈ Q. Relevant quantity (RQ) is vice-verse. Knowledge Base: A word problem is represented by an operation (OP), a set of micro cues (M_μ), a set of quantities (Q), and a set of cue difference (κ). A knowledge base (β) is defined as in Equation 7. $β = {OP, Q, M_{μ}, κ}$ (7)

The proposed method for solving arithmetic word problems consists of three modules. First, an input word problem is passed through a BiLSTM to predict the desired operation. In parallel, irrelevant quantities (if any) are removed using a knowledge-based system. Finally, the answer is generated by forming an equation from the identified relevant quantities (cf. Equation 1). The procedure is presented in Fig. 1 where our contributions are marked with the circles. The following subsections describe the system components in detail.

Fig. 1

The complete pipeline of the proposed arithmetic word problem solver, Deep Learning-based Arithmetic Word Problem Solver aka DLAWPS. The major components are marked with circles.

3.1 Information extraction (IE) and preprocessing

We adapted and modified the concept of an object-oriented modeling approach of arithmetic word problems from the work of [19] which reported entity role labeling for math word problems based on Semantic Role Labelling (SRL) and FrameNet. Our work requires the extraction of the desired pieces of information to identify global cue (G). The global cue G –{‘owner’, ‘item’, ‘attribute’, ‘location’, ‘verb’} are extracted using a rule-based approach similar to [20].

Preprocessing: A rule-based preprocessing module is integrated with Knowledge Base with two sub-components. One is to substitute coreferences if any (i.e., ‘he’, ‘she’, ‘his’, ‘her’, ‘it’, etc.) and another is to eliminate conjunctions such as ‘and’, ‘but’, ‘,’, etc., to simplify and rephrase an input problems. This is also used to prepare the different variants of datasets as in Section 4.

Various case studies are given in Table 2 based on word problem examples to demonstrate preprocessing, IE tasks related to different system components such as G, Q, μ, κ, etc.

Table 2
Case studies of categorically different word problems from the datasets

Category: Problem Op Simplified Problem Knowledge base (β) Relevant Quantities Equation with Answer

Combine: Q = {37, 28, 39}

Alyssa has 37 blue balloons, Alyssa has 37 blue balloons, μ₁ = {Alyssa, blue, balloon}

Sandy has 28 blue balloons, and Sandy has 28 blue balloons. μ₂ = {Sandy, blue, balloon}

Sally has 39 blue balloons. Sally has 39 blue balloons. μ₃ = {Sally, blue, balloon} RQ = {37, 28, 39}

How many blue balloons do they How many blue balloons do μ_q = {blue, balloon} (37 + 28 + 39 =104)

have in all? they have in all? κ₁ = {ϕ}

Operator: Addition (+) κ₂ = {ϕ}

κ₃ = {ϕ}

Q = {32, 38, 23}

Change: Dan has 32 green marbles. μ₁ = {Dan, green, marble}

Dan has 32 green and 38 violet marbles. Dan has 38 violet marbles. μ₂ = {Dan, violet, marble}

Mike took 23 of Dan ’s green marbles. Mike took 23 of Dan ’s μ₃ = {Mike, Dan, green, marble} RQ = {32, 23}

How many green marbles does Dan green marbles. μ_q = {Dan, green, marble} (32 - 23 =9)

now have? How many green marbles κ₁ = {ϕ}

Operator: Subtraction (-) does Dan now have? κ₂ = {green}

κ₃ = {ϕ}

Q = {6, 6, 10}

Compare: μ₁ = {Annie, apple}

Annie has 6 apples. Annie has 6 apples. μ₂ = {Annie, Nathan, apple}

She gets 6 more from Nathan. Annie gets 6 more from Nathan. μ₃ = {Annie, Crayon} RQ = {6, 6}

Later, Annie buys 10 crayons Later, Annie buys 10 crayons μ_q = {Annie, apple} (6 +6 =12)

at the store. at the store. κ₁ = {ϕ}

How many apples does Annie How many apples does Annie κ₂ = {ϕ}

have in all? have in all? κ₃ = {apple}

Operator: Addition (+)

Q = {16, 8, 4}

Virginia has 16 eggs. μ₁ = {egg}

Division-Multiplication: Virginia has 8 Skittles. μ₂ = {Skittles}

Virginia has 16 If Virginia shares the eggs μ₃ = {egg} RQ = {16, 4}

eggs and 8 Skittles. among 4 friends. μ_q = {egg} (16/4 =4)

If she shares the eggs among 4 friends, How many eggs does each κ₁ = {ϕ}

how many eggs does each friend get? friend get? κ₂ = {egg}

Operator: Division (/) κ₃ = {ϕ}

Category: Problem Op	Simplified Problem	Knowledge base (β)	Relevant Quantities Equation with Answer
Combine:	Q = {37, 28, 39}
Alyssa has 37 blue balloons,	Alyssa has 37 blue balloons,	μ₁ = {Alyssa, blue, balloon}
Sandy has 28 blue balloons, and	Sandy has 28 blue balloons.	μ₂ = {Sandy, blue, balloon}
Sally has 39 blue balloons.	Sally has 39 blue balloons.	μ₃ = {Sally, blue, balloon}	RQ = {37, 28, 39}
How many blue balloons do they	How many blue balloons do	μ_q = {blue, balloon}	(37 + 28 + 39 =104)
have in all?	they have in all?	κ₁ = {ϕ}
Operator: Addition (+)		κ₂ = {ϕ}
		κ₃ = {ϕ}
		Q = {32, 38, 23}
Change:	Dan has 32 green marbles.	μ₁ = {Dan, green, marble}
Dan has 32 green and 38 violet marbles.	Dan has 38 violet marbles.	μ₂ = {Dan, violet, marble}
Mike took 23 of Dan ’s green marbles.	Mike took 23 of Dan ’s	μ₃ = {Mike, Dan, green, marble}	RQ = {32, 23}
How many green marbles does Dan	green marbles.	μ_q = {Dan, green, marble}	(32 - 23 =9)
now have?	How many green marbles	κ₁ = {ϕ}
Operator: Subtraction (-)	does Dan now have?	κ₂ = {green}
		κ₃ = {ϕ}
		Q = {6, 6, 10}
Compare:		μ₁ = {Annie, apple}
Annie has 6 apples.	Annie has 6 apples.	μ₂ = {Annie, Nathan, apple}
She gets 6 more from Nathan.	Annie gets 6 more from Nathan.	μ₃ = {Annie, Crayon}	RQ = {6, 6}
Later, Annie buys 10 crayons	Later, Annie buys 10 crayons	μ_q = {Annie, apple}	(6 +6 =12)
at the store.	at the store.	κ₁ = {ϕ}
How many apples does Annie	How many apples does Annie	κ₂ = {ϕ}
have in all?	have in all?	κ₃ = {apple}
Operator: Addition (+)
		Q = {16, 8, 4}
Virginia has 16 eggs.	μ₁ = {egg}
Division-Multiplication:	Virginia has 8 Skittles.	μ₂ = {Skittles}
Virginia has 16	If Virginia shares the eggs	μ₃ = {egg}	RQ = {16, 4}
eggs and 8 Skittles.	among 4 friends.	μ_q = {egg}	(16/4 =4)
If she shares the eggs among 4 friends,	How many eggs does each	κ₁ = {ϕ}
how many eggs does each friend get?	friend get?	κ₂ = {egg}
Operator: Division (/)		κ₃ = {ϕ}

3.2 Deep learning-based operation classification

The BiLSTM architecture [7] consists of a set of recurrently connected network modules, known as memory blocks or LSTM cells. In this work, we used bi-directional memory blocks to classify the four basic operations. Unlike the state-of-the-art equation template-based method [15, 42], we propose a template-free framework for the operation classification task.

Firstly, the word problem is converted into a vector using pre-trained word vectors. We considered an input dimension of 1000. Widely used Glove word embedding [25] is used to generate the sequence of input from the word problem (M). Given an input sequence x = x₁ . . . x_T, a standard BiLSTM classifier computes the hidden vector sequence h = h₁ . . . h_T and the output class y. The model is defined using the following equations.

$\vec{h_{i}} = {LSTM}_{fw} ({\vec{h}}_{i - 1}, x_{i}) ∥$ (8)

$\overset{\leftarrow}{h_{i}} = {LSTM}_{bw} (h_{i + 1}^{\leftarrow}, x_{i})$ (9)

$δ = drop (\vec{h_{i}}, \overset{\leftarrow}{h_{i}})$ (10)

$y = softmax ([\vec{h_{i}}, \overset{\leftarrow}{h_{i}}])$ (11)

We used T + 100 layer model is used for the classification task following the state-of-the-art BiLSTM reported in [7]. Finally, we added a 4 class classifier as a softmax layer is added to classify the operation.

We also explored two other deep learning frameworks, Convolutional Neural Network (CNN) [37] and Hierarchical Attention Network (HAN) [40], for solving arithmetic word problems and to compare them with our proposed BiLSTM based method. We used a 3 layered convolutional architecture consisting of 128 filters (size 5) and max-pooling of 5 × 5. A HAN consisting of a hierarchically connected time distribution layer and an LSTM layer is also experimented along with the word embedding layer. In each case, Rectified Linear Unit (ReLU) is used as the activation function.

3.3 Irrelevant information removal

Irrelevant information removal by identifying relevant quantities is one of the important tasks in solving arithmetic MWPs. We developed IIRU, a relevance classifier, and propose a set theory-based approach to find the desired cues from the question sentence and the relevant micro statements (s_i) of a word problem. We categorized word problems as discussed by analyzing various word problem categories proposed by various researchers in the literature [6 , 27] and set miscellaneous rules to identify them categorically. Table 2 shows one example from each of the categories. The problems belonging to these categories have some basic differences in characteristics.

Initially, the word problems are categorized as - Change (Cha), Compare (Com), Combine (Cob), and Division-Multiplication (Div-Mul), as in Equation 12. The set of quantities, Q is extracted from M as defined in equation 6.

$category = {Cha, Com, Cob, Div - Mul}$ (12)

Next, we describe the generic procedure to identify the relevant quantities (RQ) for all the categories (cf. Table 2 with examples from each category). Firstly, the knowledge base (β) is taken as input. We define the rules to identify the relevant micro statements as in Equations 13 and 14, where at first Equation 13 is tested and if it fails to retrieve at least two such statements, each with a single quantity, the second condition presented in Equation 14 is tested to retrieve two quantities according to a set of precedence. We set precedence as {location, attribute, item, owner} to match (γ ! = ϕ) (cf. Table 2) and find at least two quantities from the word problem following the (γ) in Equation 14. In Equation 13, R and I denote the relevant and irrelevant information (quantities) respectively. $class (μ_{i}) = {\begin{matrix} R & κ = {ϕ} \\ I & Else \end{matrix}$ (13)

$match (γ) = μ_{i} ⋂ μ_{q}$ (14)

Some category-specific and problem-specific rules are also applied to identify relevant quantities for different categories. E.g., we separately set up rules to identify relevant quantities for ‘money word problems’ (consisting of ‘$’ or dollar in case of our dataset) by grouping the verbs related to monetary transaction sense (using VerbNet) such as ‘pay’, ‘spend’, ‘buy’, ‘purchase’, etc. We select the corresponding micro statements containing these verbs as relevant micro statements. For Div-Mul type problems, owner is not considered as part of G, since it has no importance to identify relevant quantities as observed in the datasets (cf. last example of Table 2). It is to be noted that if a word problem contains only 2 quantities, these quantities are automatically relevant, otherwise, the word problem is incomplete, as noticed in the datasets.

3.4 Answer generation

Answer generation is the final task in the pipeline. This task combines the predicted operation and the relevant quantities to form the desired equation (cf. Equation 1) which is further evaluated to generate the answer to the input word problem. Algorithm 3 presents the pipeline of the proposed DLAWPS.

Algorithm 1
DLAWPS –the Complete Pipeline

4 Results and discussions

4.1 Datasets

[14] proposed a framework which can check grammatical errors, reduces arithmetic MWP datasets (or extend them) by minimizing template overlaps and lexical overlaps among the word problems. [14] also categorized all such datasets (cf Table 1 of [14]) with characteristically similar word problems to motivate researchers to solve any one or more of such arithmetic MWP datasets. We carried out our experiments on two such datasets available in the MAWPS [14] word problems repository –AddSub (MWP_AddSub) dataset consisting of a reduced addition-subtraction word problems [9] and SingleOp (MWP_SingleOp) [27] dataset consisting of single operation word problems. [14]) published a reduced AddSub dataset which is reduced in the sense of numbers of problems, not in the sense of types of problems. They eliminated similar (redundant) problems examples of which are given below. In the original AddSub dataset, we found two such problems.

“There are 7 crayons in the drawer. Mary took 3 crayons out of the drawer. How many crayons are there now?”

“There are 46 rulers in the drawer. Tim took 25 rulers from the drawer. How many rulers are now in the drawer?”

In the reduced AddSub dataset, only the 1st-word problem from the above list is kept.

These datasets consist of MWPs that require single equation with single operation in order to be solved; this inspired us to formulate the single final equation as in Equation 1. We combined these 2 datasets into a Combined dataset, i.e., MWP_Combined = MWP_AddSub ∪ MWP_SingleOp, with 917 word problems comprising of all four basic operations.

We prepared 4 dataset variants (DV) for each of the 3 datasets (MWP_Combined, MWP_AddSub, MWP_SingleOp) as given below, thus giving rise to a total of 12 datasets. (1) DV1: Actual dataset, i.e., the standard dataset in its original form. (2) DV2: Modified dataset after eliminating the conjunctions. (3) DV3: Modified dataset after complete preprocessing. (4) DV4: Rephrased word problems after eliminating sentences with irrelevant information after complete preprocessing. The main motivation of creating the variants of the dataset is to understand the effectiveness of CNN, HAN and proposed BiLSTM on the different representations of the word problems. We carried out our experiments using them on all (12) dataset variants. It is to be noted that IIRU is used only as a preprocessing step to prepare the DV4 dataset variants in which sentences with irrelevant information or quantities are removed to rephrase the word problems.

4.2 Experiments and results

All the experiments were evaluated using a 10-fold cross validation framework taking validation size 20%. We used categorical cross entropy loss, RMSprop optimizer, batch size 2, word embedding vector of 100 dimension, and used 40 epochs in each case. Table 3 summarizes the results of our experiments with RNN, CNN and HAN, on each of the 4 variants (DV1, DV2, DV3 and DV4) for each of the 3 datasets (AddSub, SingleOp and Combined). Dataset variant-specific best scores are shown in Table 3 in italics and the dataset-specific best scores are shown in bold. The overall best score is underlined in Table 3. It can be observed from the results that our RNN based BiLSTM method performs better than CNN and HAN on most of the dataset variants. It can also be observed from the results in Table 3 that the DV1 is the most effective dataset for solving word problems for AddSub and SingleOp, while DV2 and DV3 for the Combined dataset produce better accuracies than DV1. The performances on DV4 are mixed, however, DV4 performances are steadier (for proposed BiLSTM) than the other variants. It was expected that on DV4 the system should produce better results, but eventually, that does not happen. Possibly, since the structure of the input problems is changed due to preprocessing, it may lead to some information loss when training the models. Also, preprocessing is not completely accurate for all cases.

Table 3
Comparative results of the different deep learning methods on different datasets and variants

Method Dataset Accuracy

DV1 DV2 DV3 DV4

AddSub CNN 92.96 92.96 92.68 91.83

HAN 93.52 93.24 92.68 91.55

BiLSTM 94.08 93.80 89.86 94.08

SingleOp CNN 96.26 96.62 95.55 96.78

HAN 96.26 95.55 95.91 96.26

BiLSTM 97.15 94.31 96.26 95.02

Combined CNN 94.11 94.44 94.00 94.77

HAN 94.77 94.77 94.22 94.87

BiLSTM 94.77 95.09 95.09 94.44

Method	Dataset	Accuracy
AddSub	CNN	92.96	92.96	92.68	91.83
	HAN	93.52	93.24	92.68	91.55
	BiLSTM	94.08	93.80	89.86	94.08
SingleOp	CNN	96.26	96.62	95.55	96.78
	HAN	96.26	95.55	95.91	96.26
	BiLSTM	97.15	94.31	96.26	95.02
Combined	CNN	94.11	94.44	94.00	94.77
	HAN	94.77	94.77	94.22	94.87
	BiLSTM	94.77	95.09	95.09	94.44

Figure 3 presents the training and validation accuracy and Fig. 4 presents the loss, both on the DV1 variant of the Combined dataset for the BiLSTM model. It is observed that the proposed network does not over-fit with the limited amount of training data. Figure 5 presents the confusion matrix for classifying the four operations (‘+’,‘-’,‘*’, and ‘/’).

Fig. 2

Framework of the arithmetic MWP operation classifier consisting of BiLSTM with 4 softmax layer to classify operation (‘+’, ‘-’, ‘*’, and ‘/’).

Fig. 3

Training and validation accuracy of the proposed architecture for DV1 of Combined dataset.

Fig. 4

Training and validation loss of the proposed architecture for DV1 of Combined dataset.

Fig. 5

Confusion matrix of the proposed operation classifier for DV1 of the Combined dataset.

It was noticed that 17.34% of the word problems (i.e., 159 out of 917 problems) in the Combined dataset contain irrelevant quantities. It is obvious that, without IIRU, we will not get correct results for any of these 159 problems which will reduce the accuracy to 77.43% from 94.77% for Combined - DV1 (cf. Table 3). This is true for all variants. Therefore, the identification of such quantities is important. Figure 6 presents the confusion matrix for our relevancy classifier, IIRU, where the labels ‘R’ and ‘I’ refer to Equation 13.

4.3 Comparative analysis

Table 4 presents the performance comparison of our system with respect to other similar systems, on the AddSub and SingleOp datasets. Table 4 clearly indicates that our system outperforms the current state-of-the art systems on the same datasets (as available in the literature.) Our BiLSTM based system provides accuracy of 94.08% and 97.15% on the AddSub and SingleOp datasets, respectively, on which the previous state-of-the-art results were 86.07% [21] and 79.5% [17]. It is to be noted that we used a reduced AddSub dataset consisting of 355 problems instead of 395 problems in the AddSub dataset on which the results of other systems are reported. All the results on SingleOp dataset are on 562 problems for all the systems including ours.

Table 4
Comparison of our system performance with the similar systems on the same dataset(s)

Systems AddSub SingleOp

KAZB 64.0 73.7

ARIS [9] 77.7 -

Roy &Roth [27] 72.0 73.9

Tag-Based [17] 85.3 79.5

Formulation Based [21] 86.07 -

MathDQN [35] 78.5 73.3

Our System –DLAWPS 94.08 97.15

Systems	AddSub	SingleOp
KAZB	64.0	73.7
ARIS [9]	77.7	-
Roy &Roth [27]	72.0	73.9
Tag-Based [17]	85.3	79.5
Formulation Based [21]	86.07	-
MathDQN [35]	78.5	73.3
Our System –DLAWPS	94.08	97.15

4.4 Critical discussion

The existing state-of-the-art systems generally used supervised learning approach to learn various system components such as equation template [15], verb categories [9], equation tree [27, 29], equation formulation [21], etc. [17] used a rule-based approach to identify the desired operation and used Tags (cf. Section 2) to identify the relevant quantities. Therefore, our system is critically different as we used deep learning based approach which helps us get rid of manual feature engineering. [27] used supervised relevance classifier trained on a dataset with very few problems with irrelevant quantities. Rather, we used a more realistic approach to identify the relevant quantities (cf. Section 2). [35] used a reinforcement learning approach to classify the operations and tested their performance on the same datasets.

Our system generates outstanding results and beats other similar systems mainly for two reasons as given below.

To the best of our knowledge, this is the first use of BiLSTM in classifying operations in solving word problems from AddSub and SingleOp datasets and the BiLSTM model outperforms other state-of-the-art systems.

The proposed IIRU successfully identifies most of the relevant and irrelevant quantities (cf. Fig. 6) from the word problems in the datasets. This leads to such high accuracies.

Fig. 6

Confusion matrix of the proposed relevance classifier (IIRU) for DV1 of the Combined dataset.

4.5 Error analysis

Our system resulted in 47 errors overall on the Combined actual (i.e., DV1) dataset, out of which 24 errors are in the AddSub dataset and 23 errors are in the SingleOp dataset. The sources of errors are given below.

Classification error: The system predicts wrong operations for 18 (5 AddSub + 13 SingleOp) word problems.

Irrelevant information removal error: IRRU failed in 29 (20 AddSub + 9 SingleOp) cases for various reasons given below.

Logical and erroneous preprocessing (8 such cases): Our system is solely dependent on the Stanord CoreNLP 3.9.0 tools for various NLP preprocessing tasks. The output of the tool sometimes is inaccurate particularly for long or complex word problems. Our system could not properly process and rephrase some of the complex or long word problems. Rephrasing a sentence after eliminating conjunctions is a very challenging task in NLP. For example, the word problem, “During a school play, Jonah staffed the snack bar. He served 0.25 pitcher of lemonade during the first intermission, 0.4166666666666667 pitcher during the second, and 0.25 pitcher during the third. How many pitchers of lemonade did Jonah pour in all ?” is too complex and not preprocessed and rephrased properly.

Lack of world knowledge and inference (21 such cases): Let us consider the example, “Tom bought a skateboard for $ 9.46. Tom spent $ 9.56 on marbles. Tom also spent $ 14.50 on shorts. In total, how much did Tom spend on toys?”. The system needs additional knowledge to understand that ‘skateboard’ and ‘marbles’ are toys, while ‘shorts’ are not.

5 Conclusion

Our system, DLAWPS, generates state-of-the-art results with RNN based bi-directional LSTM approach in solving arithmetic word problems. The proposed irrelevant information removal unit performs well in identifying and removing irrelevant quantities from the input word problems based on innovative rules using an object-oriented approach. Although presently it is rule-based and problem-specific, however, it can be scaled up with more rules for more problem types. The hand-generated rules can later be used for feature extraction for supervised learning.

As an immediate extension, we would like to explore a deep learning-based relevance classifier. We would also like to try the proposed BiLSTM method on other standard datasets of different characteristics.

Footnotes

Acknowledgment

Sudip Kumar Naskar is supported by Digital India Corporation (formerly Media Lab Asia), MeitY, Government of India, under the Young Faculty Research Fellowship of the Visvesvaraya PhD Scheme for Electronics & IT.

References

Bakman

, Robust understanding of word problems with extraneous information, arXiv preprint math/0701393, 2007.

Bobrow

D.G.

, Natural language input for a computer problem solving system, 1964.

Carpenter

T.P.

, Hiebert

and Moser

J.M.

, Problem structure and first-grade children’s initial solution processes for simple addition and subtraction problems, Journal for research in Mathematics Education, pages 27–39, 1981.

Dellarosa.

, A computer simulation of children’s arithmetic word-problem solving, Behavior Research Methods, Instruments, & Computers18(2) (1986), 147–154.

Feigenbaum

E.A.

, Feldman

, et al., Computers and thought, volume 7, McGraw-HillNew York, 1963.

Fletcher

C.R.

, Understanding and solving arithmetic word problems: A computer Simulation, Behavior Research Methods17(5) (1985), 565–571.

Graves

and Schmidhuber

, Framewise phoneme classification with bidirectional lstm and other neural network architectures, Neural Networks18(5-6) (2005), 60–610.

Heller

and Greeno

, Semantic processing in arithmetic word problem solving, in: annual meeting of the Midwestern Psychological Association, Chicago, 1978.

Hosseini

M.J.

, Hajishirzi

, Etzioni

and Kushman

, Learning to solve arithmetic word problems with verb categorization, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 523–533, 2014. http://aclweb.org/anthology/D/D14/D14-1058.pdf.

10.

Huang

, Shi

, Lin

C.-Y.

, Yin

and Ma

W.-Y.

, How well do computers solve math word probl ems? large-scale dataset construction and evaluation, Proceedings of the 2016 North American Chapter of the ACL (NAACL HLT), 2016.

11.

Huang

, Liu

, Lin

C.-Y.

and Yin

, Neural math word problem solver with reinforcement learning, in: Proceedings of the 27th International Conference on Computational Linguistics, pages 213–223, 2018.

12.

Huang

, Yao

J.-G.

, Lin

C.-Y.

, Zhou

and Yin

, Using intermediate representations to solve math word problems, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 419–428, 2018.

13.

Kingsdorf

and Krawec

, A broad look at the literature on math word problem-solving interventions for third graders, Cogent Education3(1) (2016), 1135770, URL https://dx-doi-org.web.bisu.edu.cn/10.1080/2331186X.2015.1135770.

14.

Koncel-Kedziorski

, Roy

, Amini

, Kushman

and Hajishirzi

, MAWPS: A math word problem repository, in: NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pages 1152–1157, 2016. http://aclweb.org/anthology/N/N16/N16-1136.pdf.

15.

Kushman

, Zettlemoyer

, Barzilay

and Artzi

, Learning to automatically solve algebra word problems, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22–27, 2014, Baltimore, MD, USA, Volume 1: Long Papers, pages 271–281.

16.

Liang

, Hsu

, Huang

, Li

, Miao

and Su

, A tag-based english math word problem solver with understanding, reasoning and explanation, in: Proceedings of the Demonstrations Session, NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12–17, 2016, pages 67–71, 2016.

17.

Liang

, Tsai

, Chang

, Lin

and Su

, A meaning-based english math word problem solver with understanding, reasoning and explanation, in: COLING 2016, 26th International Conference on Computational Linguistics, Proceedings of the Conference System Demonstrations, December 11–16, 2016, Osaka, Japan, pages 151–155, 2016. URL http://aclweb.org/anthology/C/C16/C16-2032.pdf.

18.

Mandal

and Naskar

S.K.

, Solving arithmetic mathematical word problems: A review and recent advancements, in: Information Technology and Applied Mathematics –ICI-TAM 2017, Haldia, Purba Medinipur, West Bengal, India, October 30 –November 1, 2017, pages 95–114, 2017. doi: 10.1007/978-981-10-7590-2_7. URL https://doi.org/10.1007/978-981-10-7590-2_7

19.

Mandal

and Naskar

S.K.

, Towards generating object-oriented programs automatically from natural language texts for solving mathematical word problems, in: Natural Language Processing and Information Systems –22nd International Conference on Applications of Natural Language to Information Systems, NLDB 2017, Liège, Belgium, June 21–23, 2017, Proceedings, pages 222–226, 2017. doi: 10.1007/978-3-319-59569-6_26. URL https://doi.org/10.1007/978-3-319-59569-6_26.

20.

Mandal

and Naskar

S.K.

, Natural language programing with automatic code generation towards solving addition-subtraction word problems, in: Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pages 146–154, Kolkata, India, December 2017. NLP Association of India. URL http://www.aclweb.org/anthology/W/W17/W17-7519.

21.

Mitra

and Baral

, Learning to use formulas to solve simple arithmetic problems, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7–12, 2016, Berlin, Germany, Volume 1: Long Papers, (2016), URL http://aclweb.org/anthology/P/P16/P16-1202.pdf.

22.

Morales

R.V.

, Shute

V.J.

and Pellegrino

J.W.

, Developmental differences in understanding and solving simple mathematics word problems, Cognition and instruction2(1) (1985), 41–57.

23.

Mukherjee

and Garain

, A review of methods for automatic understanding of natural language mathematical problems, Artif. Intell. Rev29(2) (2008), 93–122. doi: 10.1007/s10462-009-9110-0, URL https://dx-doi-org.web.bisu.edu.cn/10.1007/s10462-009-9110-0.

24.

Nesher

, Greeno

J.G.

and Riley

M.S.

, The development of semantic categories for addition and subtraction, Educational Studies in Mathematics13(4) (1982), 373–394.

25.

Pennington

, Socher

and Manning

, Glove: Global vectors for word representation, in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.

26.

Riley

M.S.

, et al., Development of children’s problem-solving ability in arithmetic, 1984.

27.

Roy

and Roth

, Solving general arithmetic word problems, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17–21, 2015, pages 1743–1752, 2015. URL http://aclweb.org/anthology/D/D15/D15-1202.pdf.

28.

Roy

and Roth

, Unit dependency graph and its application to arithmetic word problem solving, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9, 2017, San Francisco, California, USA., pages 3082–3088, 2017. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14764.

29.

Roy

, Vieira

and Roth

, Reasoning about quantities in natural language. TACL 3, (2015) 1–13. URL https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/452.

30.

Shi

, Wang

, Lin

, Liu

and Rui

, Automatically solving number word problems by semantic parsing and reasoning, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17–21, 2015, pages 1132–1142, 2015. URL http://aclweb.org/anthology/D/D15/D15-1135.pdf.

31.

Vergnaud

, A classification of cognitive tasks and operations of thought involved in addition and subtraction problems, Addition and subtraction: A cognitive perspective, (1982), pages 39–59.

32.

Verschaffel

, Greer

and De Corte

, Corte, Making sense of word problems, Lisse Swets and Zeitlinger, 2000.

33.

Wang

A.Y.

, Fuchs

L.S.

and Fuchs

, Cognitive and linguistic predictors of mathematical word problems with and without irrelevant information, Learning and Individual Differences52 (2016), 79–87.

34.

Wang

, Wang

, Cai

, Zhang

and Liu

, Translating math word problem to expression tree, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 –November 4, 2018, pages 1064–1069, 2018. URL https://www.aclweb.org/anthology/D18-1132/.

35.

Wang

, Zhang

, Gao

, Song

, Guo

and Shen

H.T.

, Mathdqn: Solving arithmetic word problems via deep reinforcement learning, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, AAAI-18, the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, (2018), pages 5545–5552. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/16749.

36.

Wang

, Zhang

, Gao

, Song

, Guo

and Shen

H.T.

, Mathdqn: Solving arithmetic word problems via deep reinforcement learning, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

37.

Wang

, Xu

, Tian

, Liu

C.-L.

and Hao

, Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification, Neurocomputing174 (2016), 806–814.

38.

Wang

, Liu

and Shi

, Deep neural solver for math word problems, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, pages 845–854, 2017. URL https://aclanthology.info/papers/D17-1088/d17-1088.

39.

Wang

, Liu

and Shi

, Deep neural solver for math word problems, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 845–854, 2017.

40.

Yang

, Yang

, Dyer

, He

, Smola

and Hovy

, Hierarchical attention networks for document classification, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, 2016.

41.

Zhang

, Wang

, Xu

, Dai

B.T.

and Shen

H.T.

, The gap of semantic parsing: A survey on automatic math word problem solvers, CoRR, abs/1808.07290, 2018. URL http://arxiv.org/abs/1808.07290.

42.

Zhou

, Dai

and Chen

, Learn to solve algebra word problems using quadratic programming, in: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17–21, 2015, pages 817–822, 2015.

Solving arithmetic word problems: A deep learning based approach

Abstract

Keywords

1 Introduction

Table 1 A sample word problem from the AddSub dataset Problem: Dan has 32 green and 38 violet marbles. Mike took 23 of Dan ’s green marbles. How many green marbles does Dan now have? Operation: ‘–’ Relevant Quantities: 32 and 23 Irrelevant Quantity: 38 Answer: 32-23=9

2 Related work

3 System description

Algorithm 1 DLAWPS –the Complete Pipeline

4.1 Datasets

4.2 Experiments and results

Table 4 Comparison of our system performance with the similar systems on the same dataset(s) Systems AddSub SingleOp KAZB 64.0 73.7 ARIS [9] 77.7 - Roy &Roth [27] 72.0 73.9 Tag-Based [17] 85.3 79.5 Formulation Based [21] 86.07 - MathDQN [35] 78.5 73.3 Our System –DLAWPS 94.08 97.15

5 Conclusion

Footnotes

Acknowledgment

References

Table 1
A sample word problem from the AddSub dataset

Problem: Dan has 32 green and 38 violet marbles.

Mike took 23 of Dan ’s green marbles.

How many green marbles does Dan now have?

Operation: ‘–’

Relevant Quantities: 32 and 23

Irrelevant Quantity: 38

Answer: 32-23=9

Algorithm 1
DLAWPS –the Complete Pipeline

Table 4
Comparison of our system performance with the similar systems on the same dataset(s)

Systems AddSub SingleOp

KAZB 64.0 73.7

ARIS [9] 77.7 -

Roy &Roth [27] 72.0 73.9

Tag-Based [17] 85.3 79.5

Formulation Based [21] 86.07 -

MathDQN [35] 78.5 73.3

Our System –DLAWPS 94.08 97.15