Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Abstract

Named entity recognition (NER) is fundamental to natural language processing (NLP). Most state-of-the-art researches on NER are based on pre-trained language models (PLMs) or classic neural models. However, these researches are mainly oriented to high-resource languages such as English. While for Indonesian, related resources (both in dataset and technology) are not yet well-developed. Besides, affix is an important word composition for Indonesian language, indicating the essentiality of character and token features for token-wise Indonesian NLP tasks. However, features extracted by currently top-performance models are insufficient. Aiming at Indonesian NER task, in this paper, we build an Indonesian NER dataset (IDNER) comprising over 50 thousand sentences (over 670 thousand tokens) to alleviate the shortage of labeled resources in Indonesian. Furthermore, we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to extract sequence features from different perspectives. Specifically, we use an enhanced convolutional structure as well as an enhanced attention structure to extract deeper features from characters and tokens. Experimental results show that HSA establishes competitive performance on IDNER and three benchmark datasets.

Keywords

Indonesian named entity recognition named entity corpus structured-attention residual gated convolution neural network

1 Introduction

Named entity recognition (NER) plays a basic role in natural language processing (NLP) [1] which is designed to identify the spans of entities and classify them into pre-defined categories. It is fundamental to many advanced NLP tasks such as machine translation and knowledge graph construction [2]. With the rapid development of deep learning, many neural methods are proposed for NER task. However, these methods rely heavily on abundant labeled data, thus most state-of-the-art researches are mainly oriented to high-resource languages such as Chinese [3 –5] and English [6 –9]. There are few studies on low-resource languages represented by Indonesian [10]. Most current methods for Indonesian NER are still based on rules and machine learning (ML) [11 –14].

Some previous studies have presented their efforts on the construction of Indonesian NER datasets [13 , 15–17], however, these datasets are not satisfactory both in size and quality. Limited labeled resource leads to limited language technology development in Indonesian. To ease this situation, we build a large Indonesian NER dataset (IDNER) from Indonesian news comprising over 50,000 sentences and explore the impact of corpus size for Indonesian NER model.

Current NER methods can be divided into two categories: (1) fine-tuned pre-trained language models (PLMs) such as deep neural models [7, 25] and transformer-based models [8 , 39–42]; (2) classic neural models such as Bi-LSTM-CNNs-CRF [6 , 18–22]. Unfortunately, both these two methods have their own shortages on NER task. Fine-tuned PLMs focus more on the acquisition of contextual information, which is crucial for NER task. However, NER is a token-wise task, where token and character features are of great significance and necessity. Especially for Indonesian language, it is an agglutinative language. As such, affix is an important word composition for Indonesian language that new words exist by adding one or several (less than three) affixes to a base word. The affixes can be the host of proclitic, enclitic and particle. Figure 1 shows the morphological structure of an Indonesian word. Therefore, how to fuse fine-tuned PLMs with an efficient token feature extractor as well as character feature extractor are of great essentiality. In classic neural models, convolutional neural network (CNN) is a superior character feature extractor [23], which can effectively solve OOV (Out of Vocabulary) problem. However, CNNs can only extract local optimal features. It is insufficient for global character features extraction such as circumfixes (such as ke...an) in Indonesian. Bidirectional long short-term memory (Bi-LSTM) can obtain global features but lacking of local optimal features. These problems may negatively affect the performance of the NER models. Attention mechanism [31] is proven to be effective in many NLP tasks, but we argue that simple 1-dimensioanal (1-D) attention layer is insufficient for the learning of different tokens and characters in NER task. The deficiency of 1-D attention vector is that it only focuses on one or a few perspectives of the characters and tokens [24], with the result that different semantic aspects of the characters and tokens are missed, leading impaired effect on NER models.

Fig. 1

The morphological structure of an Indonesian word.

To this end, we propose a hierarchical structured-attention-based model (HSA) for Indonesian NER, where semantic and syntactic features of a given input sequence can be simultaneously captured from different perspectives. Specifically, we employ an enhanced convolutional structure named residual gated convolution neural network (RGCNN) and an enhanced attention structure (structured-attention) with two pooling strategies (average and max pooling) to extract sequence features. It is worth to note that our model is closely related to AMFF [22], with mainly three improvements: (1) Unlike the work in [22], which directly extracts global word features and local word features from pre-trained word vectors and then feeds them into a Bi-LSTM layer, we argue that these features should be relative to the input sequence, so extracting token features of different levels through the attention mechanism over the hidden states in Bi-LSTM layer is more helpful to the NER task; (2) Using structured-attention with different pooling operations is more efficient and effective than the simple attention mechanism: average pooling can keep more global background knowledge of the input sequence for the tokens, and max pooling can perform feature selection to retain the local optimal features of the input sequence for the tokens; (3) Because Indonesian NER task is more sensitive to character features, we use enhanced convolutional structure and attention structure to extract character features.

In summary, the contributions of this paper are: (1) we construct a large Indonesian NER dataset to alleviate the insufficient labeled resource for Indonesian; (2) we construct a hierarchical structured-attention-based model (HSA) for Indonesian NER to make use of multi-perspective sequence features; (3) HSA establishes comparable performance on three Indonesian benchmark datasets and IDNER.

The remainder of the paper is organized as follows. Section 2 briefly presents some related work to Indonesian NER. In Section 3, the process of building Indonesian NER dataset is described. Next in Section 4, our NER model is introduced. In Section 5, we present the experiment setting and analysis the experimental results. Finally in Section 6, we give some conclusions and remarks.

2 Related work

2.1 Top-performance NER

Many researchers are focusing on NER in high-resource language such as English. Current NER methods can be divided into two categories: fine-tuned PLMs [7 , 39–42] and classic neural models [6 , 18–22]. Fine-tuned PLMs focus more on extracting contextual information that is crucial for NER task. Akbik et al. [9] propose a pooled contextualized embedding approach for NER in FLAIR 1 framework. Matthew et al. [7] apply ELMo that based on a two-layer Bi-LSTM to NER task. Transformer-based PLMs [8 , 39–42] are also widely applied for fine-tuned NER models. They all achieve superior performance on CoNLL datasets [25]. Classic NER techniques [6 , 18–22] use a combination of pre-trained word embeddings [11, 12] and character embeddings derived from a CNN layer or Bi-LSTM layer. These features are passed to a Bi-LSTM layer, which may be followed by a CRF layer [6 , 18–20].

2.2 Indonesian NER

Indonesian NER itself has attracted many years of research. It can be devided into three categories: (1) rule-based method; (2) ML method; and (3) neural method.

Rule-based Indonesian NER. Rule-based method is the first appeared method in which a set of rules is manually crafted by experts to recognize a particular entity type [12]. The rules are based on synaptic, linguistic and domain knowledge. However, this method requires linguistic experts to construct the rule sets which is time-consuming and expensive. Moreover, high demands are placed on the scale and quality of the rule sets.

ML-based Indonesian NER. Subsequently, with the rapid development of NER research, many new ML methods such as conditional random fields (CRF) have emerged. Leonandya [13] uses the semi-supervised learning method for the unlabeled data of Wikipedia and DBPedia to construct an Indonesian NER model. Alfina [16] develops some rules to expand the DBPedia entity corpus and uses Stanford NER tool to build an Indonesian named entity recognition classifier. Wibawa [14] proposes an ensemble supervised learning method for Indonesian NER.

Neural-based Indonesian NER. With the popularity of neural networks, many neural Indonesian NER models exist. Gunawan [26] applies a deep learning model to Indonesian and constructs a Bi-LSTM-CNNs model for Indonesian NER. The research done by Kurniawan [23] investigates neural model performance with word-level and character-level features in Indonesian conversational texts. Transfer learning is also widely used in low-resource language processing. With the rapid development of PLMs, one could fine-tune either monolingual PLMs 2 or multilingual PLMs [8 , 42] to build Indonesian NER models. Besides, [27] proposes to fine-tune high-resource language models for low-resource language to improve the performance of Indonesian NER and [28] explores a transfer learning method on NER task in Indonesian conversational texts.

3 Dataset construction

As the basis of NLP research, dataset construction is significant. However, existing Indonesian datasets cannot meet the requirements of deep learning technology in terms of size and quality. For the follow-up study, we focus on three entity types (person, location and organization) and build a large dataset IDNER containing 50,098 sentences (677,933 tokens in total) with IOB2 format. We would publicly release the dataset. An example is shown in Table 1. Some statistics of the dataset are presented in Table 2.

Table 1
An Indonesian annotation example

Token Tag

Jadi O

, O

kata O

Sandi B-PER

, O

wisatawan O

pencinta O

olahraga O

dapat O

semakin O

tertarik O

berbondong-bondong O

datang O

ke O

Pulau B-LOC

Tidung B-LOC

. O

Token	Tag
Jadi	O
,	O
kata	O
Sandi	B-PER
,	O
wisatawan	O
pencinta	O
olahraga	O
dapat	O
semakin	O
tertarik	O
berbondong-bondong	O
datang	O
ke	O
Pulau	B-LOC
Tidung	B-LOC
.	O

Table 2

Statistics of IDNER

Entity Type	Abbreviation	Number
Person	PER	37,473
Location	LOC	20,234
Organization	ORG	19,646

3.1 Dataset design

Fully human annotated datasets for NER are expensive and time-consuming which are therefore relatively small. Distant supervision [29], a technique which uses existing knowledge bases as the source of weak supervision to automatically annotate datasets, can well reduce the burden of manual annotation. Based on distant supervision, we collect a large number of entity instances from different sources such as DBPedia, and then link these instances into raw text to construct a preliminary NER dataset. However, the dataset labeled with distant supervision method often suffers from low recall and mislabeling errors. For further optimization, we iteratively revise the corpus based on manual audit.

3.2 Data source

We use Indonesian news articles to construct the dataset because of its convenient availability and huge amount. To obtain attested Indonesian data, we crawl articles from Indonesian news websites whose content covers various topics including politics, finance, society, military, etc. The websites are shown in Table 3. After separating paragraphs into individual sentences, we randomly select 60,000 sentences as our corpus to be annotated. After iteratively training and audit, we discard some sentences that do not contain entities. Altogether, the dataset has 50,098 sentences.

Table 3
Indonesian news websites

Medium Websites

Kompas http://www.kompas.com

Detiknews http://www.detiknews.com

Media Indonesia http://www.mediaindonesia.com

Koran Tempo http://koran.tempo.co

Republika http://www.republika.co.id/

Rakyat Merdeka http://www.rakyatmerdeka.co.id/

Suara Pembaruan http://www.suarapembaruan.com/home/

Medium	Websites
Kompas	http://www.kompas.com
Detiknews	http://www.detiknews.com
Media Indonesia	http://www.mediaindonesia.com
Koran Tempo	http://koran.tempo.co
Republika	http://www.republika.co.id/
Rakyat Merdeka	http://www.rakyatmerdeka.co.id/
Suara Pembaruan	http://www.suarapembaruan.com/home/

3.3 Construction process

This section briefly introduces the construction process of our dataset, which contains six steps (shown in Fig. 2).

Fig. 2

Process of IDNER Construction.

DBPedia Indonesia 3 provides various structured information from Indonesian Wikipedia. DBPedia Indonesian describes 19,567 persons, 57,702 locations and 5,773 organizations. To expand more entity instances, we crawl entities instances from DBPedia as well as other sources (such as travelling websites, Indonesian textbooks, etc.). At last, 20,126 person instances, 57,702 location instances and 6,547 organization instances are collected.

Based on the idea of distant supervision, we construct a preliminary NER dataset. In nested entity cases, we take the outermost (with the max length) entity as the label and discard the inner entity labels. The corpus constructed in this way often suffers from mislabeling problems and boundary errors.

All picked sentences (60,000 sentences) are used for both model training and testing.

Sentences with different labels in training and testing phases are manually audited. For each sentence to be audited, two auditors are engaged in to ensure the quality of the audited sentences. If two audit results conflict, ask another auditor to check the results.

The new entities extracted during the audit process are added to the entity set. Then the entity set is used to re-align the whole dataset.

The re-aligned dataset is put into the model for an iterative training and testing. Three iterations are processed.

3.4 Discussion

How many sentences are audited? The number of audited sentences in three iterations are: 12321, 6363, 2367. The original intention of the corpus construction is to reduce the cost of manual audit. Through this method, we can maintain the quality of the corpus to a certain extent while greatly reducing the workload of manual annotation (In fully human annotation, the annotation amount should be at least 50,098 sentences, and the number of audited sentences based on this method is only 21051 sentences).

Why only three iterations are conducted? The condition for us to judge the end of the iteration is the label difference between the training and testing phase. Specifically, in testing phase of the first iteration, we find that the main inconsistency is labeling non-entities as entities. That is to say, it is possible that our model wrongly labels non-entities as entities or there may be an annotation error in the original dataset that entities are labeled as non-entities (later confirmed in audit phase). A possible reason is that the collected entity set is not enough both in amount and coverage. New entities extracted during the audit process are added to the entity set, which later will be used to re-align the dataset to alleviate the insufficient entity set. After three iterations, we find that the main errors are wrongly labeling entities as non-entities and entities type recognition error (accounting for 92.3%) and the entity set is no more updated, which further indicates the poor performance of the NER model. Besides, in the fourth training, the model results tend to converge compared with the third training. At this point, we think that the quality of the corpus is good enough and we need to further improve the model.

Are there any strategies to effectively evaluate the quality of the corpus? In terms of auditor issue, nine auditors are participated in the audit process. Each sentence is audited by two or more auditors. The auditors are students or teachers major in Indonesian and foreign students from Indonesia. In order to verify the quality of the corpus, we randomly select 2,000 sentences from the corpus for manual verification, and the labeling error rate is 0.09%.

4 Model

This section will introduce the proposed model for Indonesian NER.

NER can be formulated as a sequence labeling problem. Give an input sentence S which composed of n tokens {w₁, w₂, w₃, … , w_n } and its corresponding labels {l₁, l₂, l₃, … , l_n }, the proposed model is to infer the entity label l_i for each token w_i and output a label sequence. Figure 3 gives an overview of our proposed model, consisting of three main components: (1) Character Encoder; (2) Word Encoder; and (3) CRF Decoder.

Fig. 3

The architecture of HSA. Firstly, a RGCNN and a structured-attention are used to extract character features. Secondly, character features and word embeddings (contextual embedding and static embedding) are concatenated and then fed into a Bi-LSTM layer and another structured-attention to obtain token features. Lastly, token features are plug into a CRF layer for final label prediction.

4.1 Character encoder

4.1.1 Global character feature

As stated in Bi-LSTM-CRF model [18] that Bi-LSTM is useful to model the input sequence for many sequence labeling problems, but it does not deal well with long-distance dependencies that may cause gradient vanishing and gradient exploding. As attention mechanism can relieve the limitation of encoding all information equally [30], we use a Bi-LSTM network followed by a structured self-attention layer to extract global character features. Specifically, taking character embedding $x_{t}^{c}$ at time step t as input, we use a Bi-LSTM layer to capture the contextual information of the character sequence: $h_{t}^{c} = [\vec{h_{t}^{c}}; \overset{\leftarrow}{h_{t}^{c}}]$ (1) $\vec{h_{t}^{c}} = \vec{BiLSTM} (x_{t}^{c})$ (2) $\overset{\leftarrow}{h_{t}^{c}} = \overset{\leftarrow}{BiLSTM} (x_{t}^{c})$ (3)

Later, the character vector $h_{t}^{c}$ is fed into a structured self-attention layer [31] to capture the multi-aspect information of the input vector: $A_{c} = softmax (W_{s 2}^{c} tanh (W_{s 1}^{c} (h_{t}^{c})^{T}))$ (4)

Where $W_{s 1}^{c}$ is a weight matrix of size $d_{a}^{c} \times u$ , u is the hidden size of the Bi-LSTM, $d_{a}^{c}$ the hidden size in the attention layer. $W_{s 2}^{c}$ is another weight matrix of size of $r^{c} \times d_{a}^{c}$ , r ^c is a hyper-parameter representing how many perspectives we focus on to extract character features.

In order to obtain the global feature, we average the 2-D A_c to a 1-D vector $\bar{A_{c}}$ which has the dimension of u. Finally, we obtain the global character feature under the weight matrix $\bar{A_{c}}$ : $h_{t}^{GC} = \bar{A_{c}} (h_{t}^{c})^{T}$ (5)

4.1.2 Local character feature

In view of CNN’s insufficient capability to extract Indonesian character features, we utilize a residual gated CNN (RGCNN) to extract local character features as shown in Fig. 4. Sigmoid function allows the network to exploit the full input field, or to focus on fewer elements if needed. Hence, based on the GLU [32], our convolutional structure consists of two blocks: a normal 1-D convolution and a 1-D convolution followed by a sigmoid function. Besides, we use a residual connection to enable character information to be transmitted in multiple channels. In order to maintain the same dimension as the convolution input, we learn a linear projection function for character embedding. Given the character embedding $x_{t}^{c}$ , local feature $g_{t}^{c}$ can be calculated by equation 1:

Fig. 4

The architecture of RGCNN.

$g_{t}^{c} = M x_{t}^{c} + Conv 1 D_{1} (x_{t}^{c}) σ (Conv 1 D_{2} (x_{t}^{c}))$ (6) where M is the mapping matrix; ⊗ is the point-wise multiplication. $Conv 1 D_{x} (x_{t}^{c})$ means the convolution operation of $x_{t}^{c}$ . The gates $σ (Conv 1 D_{2} (x_{t}^{c}))$ control which inputs $Conv 1 D_{1} (x_{t}^{c})$ of the current context are relevant. And the combination of residual connection and gated convolution can achieve multi-channel transmission.

Later, a max pooling layer is adopted to capture the significant local features assigned with the highest value for a given filter [33], the final local character feature can be obtained as follow: $h_{t}^{LC} = MaxPooling (g_{t}^{c})$ (7)

4.1.3 Multi-level character features

We concatenate the global and local character feature with two weights to combine the advantages of them: $h_{t}^{C} = λ_{1} h_{t}^{GC} + λ_{2} h_{t}^{LC}$ (8)

Where λ₁ and λ₂ represent the weights of each feature $h_{t}^{GC}$ and $h_{t}^{LC}$ , which are randomly initialized and trainable.

4.2 Word encoder

As shown in previous works [9 , 18–22], pre-trained word embeddings play an important role in capturing word similarity and relations with other words. We employ a concatenation strategy to static pre-trained word embeddings [34, 35], contextual pre-trained word embeddings [7, 8] and character features since they extract different semantic features from tokens. And a Bi-LSTM layer is further introduced: $h_{t}^{w} = [\vec{h_{t}^{w}}; \overset{\leftarrow}{h_{t}^{w}}]$ (9) $\vec{h_{t}^{w}} = \vec{BiLSTM} ([x_{t}^{sw}; x_{t}^{cw}; h_{t}^{C}])$ (10) $\overset{\leftarrow}{h_{t}^{w}} = \overset{\leftarrow}{BiLSTM} ([x_{t}^{sw}; x_{t}^{cw}; h_{t}^{C}])$ (11)

Where $x_{t}^{sw}$ and $x_{t}^{cw}$ represent one of the static pretrained embeddings and one of the contextual pretrained embeddings.

After that, another structured self-attention is adopted to capture the multi-perspective information of the tokens: $A_{wc} = softmax (W_{s 2}^{wc} tanh (W_{s 1}^{wc} H^{wc}))$ (12) $H^{wc} = {(h_{1}^{wc}, h_{2}^{wc}, h_{3}^{wc}, \dots, h_{N}^{wc})}^{T}$ (13)

Where $h_{t}^{wc} = [h_{t}^{w}; h_{t}^{c}]$ in time step t; T is the transpose operation; $W_{s 1}^{wc}$ is a weight matrix of size $d_{a}^{wc} \times uc$ , uc is the first dimension of $h_{t}^{wc}$ , $d_{a}^{wc}$ stands for the hidden size in the attention network. $W_{s 2}^{wc}$ is another wright matrix of size of $r^{wc} \times d_{a}^{wc}$ , r ^wc is a hyper-parameter representing how many perspectives we focus on to extract token features.

Then, we compute the weighted matric M ^wc by multiplying the attentive matrix A_wc and H ^wc: $M^{wc} = A_{wc} H^{wc}$ (14)

4.2.1 Global word feature

In order to obtain the global feature, we adopt an average pooling in the 2-D $M_{t}^{wc}$ in time step t to a 1-D vector $h_{t}^{GW}$ which has the dimension of uc: $h_{t}^{GW} = \bar{M_{t}^{wc}}$ (15)

4.2.2 Local word feature

Inspired by the model [33] that the max pooling operation facilitates the selection of prominent features, we adopt a max pooling operation on $M_{t}^{wc}$ to get the global word feature: $h_{t}^{CW} = MaxPooling (M_{t}^{wc})$ (16)

4.2.3 Multi-level word features

We concatenate the global as well as local word features to combine the advantages of them with an automatic adjustment: $h_{t}^{W} = λ_{3} h_{t}^{GW} + λ_{4} h_{t}^{LW}$ (17)

Where λ₃ and λ₄ represent the weight of each feature $h_{t}^{GW}$ and $h_{t}^{LW}$ , which are randomly initialized and trainable.

4.3 CRF decoder

Condition random field (CRF) [36] model is a conditional probability model for labeling and segmenting sequenced data. For linear chain CRF, given an input sentence, the score of one of the possible tag sequences can be calculated by: $sc (l | s) = \sum_{j}^{m} \sum_{i}^{n} λ_{j} f_{j} (l_{i - 1}, l_{i}, i, s)$ (18) where m is the number of labels, nis the length of a sentence, i is a token index in the sentence, l_i is the label of the current token, l_i-1 is the label of the previous token, f_j is the feature function, and λ_j is the feature weight. After getting the scores for each possible tag sequence, we can obtain the probability of these tag sequences by: $p (l | s) = \frac{\exp [sc (l | s)]}{\sum_{l^{'}} \exp [sc (l^{'} | s)]} = \frac{1}{Z (s)} \exp [sc (l | s)]$ (19)

After that, we adopt Viterbi to calculate the sequence labels.

5 Experiments

5.1 Datasets

We evaluate the proposed model on three publicly available datasets including DEE corpus, MDEE corpus and MDEE+Gazz corpus [17], as well as IDNER. Table 4 shows some statistics of the four datasets. Specifically, IDNER is divided into (80%, 8%, 12%) for (training, development, test) sets.

Table 4
Indonesian datasets

Dataset Train Develop Test

DEE 19000 1240 737

MDEE 19000 1240 737

MDEE+Gazz 19000 1240 737

IDNER 40079 3795 6224

Dataset	Train	Develop	Test
DEE	19000	1240	737
MDEE	19000	1240	737
MDEE+Gazz	19000	1240	737
IDNER	40079	3795	6224

5.2 Baseline models

We compare HSA with some stop-performance neural NER models, including classic neural models (Bi-LSTM-CRF [18], NeuralNER [20], CNNs-Bi-LSTM [37], CNNs-Bi-LSTM-CRF [6] and AMFF [22]) and fine-tuned PLMs (mBERT 4 and ELMO [7]). We repeat each experiment 10 times and report the average results on the test set. Evaluation metrics are accuracy, recall and F1-score.

5.3 Parameter initialization

5.3.1 Word embedding

Due to the lack of publicly pre-trained Indonesian word embeddings currently, we pre-train GloVe and word2vec embeddings with 300 dimensions and ELMO embeddings with 512 dimensions from Indonesian news articles (more than 150 million tokens) grabbed by several Indonesian news websites (shown in Table 3). If the token does not appear in the vocabulary of pre-trained word embedding, we will assign it a random word embedding (subject to a gaussian distribution). Besides, we continually pre-trained mBERT³ on Indonesian news for domain adaption.

5.3.2 Character embedding

We use a random uniform distribution of 114 Indonesian characters and punctuations to initialize the character vector table. The dimension of the character vector is 52 with a value range [–0.5, 0.5].

5.3.3 Hyperparameters setting

Keras 5 with TensorFlow backend is utilized to construct the model. The hyperparameters of our best model are shown in Table 5. In addition, we also use some other operations such as dropout [43] to avoid over-fitting.

Table 5
The value of hyperparameters

Component Parameter Value

CNN Filter Size 3

Number of Filters 30

Bi-LSTM Number of Units (Character-level) 100

Number of Units (Word-level) 200

Attention Aspects (r^C) at character-level 2

Aspects (r^W) at word-level 3

Dropout CNN 0.55

Bi-LSTM 0.5

Optimizer Optimizer Adam

Learning Rate 0.01

Component	Parameter	Value
CNN	Filter Size	3
	Number of Filters	30
Bi-LSTM	Number of Units (Character-level)	100
	Number of Units (Word-level)	200
Attention	Aspects (r^C) at character-level	2
	Aspects (r^W) at word-level	3
Dropout	CNN	0.55
	Bi-LSTM	0.5
Optimizer	Optimizer	Adam
	Learning Rate	0.01

5.4 Main performance

It can be seen from Table 6 that, character feature extractor brings significant improvements to Indonesian NER, indicating the effectiveness of various character feature extractors. Meanwhile, it seems that models that incorporates a CRF layer have certain improvements. At the same time, classic neural models seem to be slightly inferior to fine-tuned PLMs which shows that fine-tuned PLMs are better than classic neural models in mining semantic features in Indonesian NER. It is worth to note that the performance of mBERT is not so good as ELMO. For this, we believe that it is because mBERT pays more attention to learning language-independent features (which has superior performance in cross-lingual researches). However, language-specific features have been ignored to a certain extent, which may cause a deficiency in NER tasks. On the contrary, ELMO is pre-trained for a specific language, which can be better adapted to Indonesian NER tasks, achieving better performance. Another point worth noting is that the overall performance of the models on the DEE, MDEE and MDEE+gazz datasets is poor. A notable phenomenon is that the precision of these three datasets is relatively high and the recall is relatively low. The main reason is that there are many errors as annotating entities as non-entities in training sets of three benchmark datasets, which results in insufficient training and in turn affects the recall of test sets. While ensuring the precision of the models, HSF can effectively improve the recall, indicating that HSF can better learn entity features in training sets.

Table 6
The result in four datasets

DEE MDEE MDEE+Gazz IDNER

P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%) P(%) R(%) F1(%)

Bi-LSTM-CRF 80.12 36.13 48.80 82.11 38.14 52.09 82.32 50.47 62.58 86.11 87.29 86.70

NeuralNER 80.59 36.87 50.59 82.31 38.98 52.91 82.59 50.48 62.66 87.01 86.88 86.94

CNNs-Bi-LSTM 81.02 36.67 50.49 83.29 38.03 52.22 82.95 50.77 62.99 87.87 87.02 87.44

CNNs-Bi-LSTM-CRF 82.16 37.79 51.77 83.25 38.17 52.34 83.35 50.78 63.11 88.20 87.49 87.84

AMFF 83.23 39.11 53.21 83.95 39.32 53.56 84.13 51.25 63.70 89.83 88.77 89.30

mBERT 82.77 37.63 51.74 83.42 38.28 52.48 83.92 50.85 63.33 89.01 88.21 88.61

ELMO 81.47 38.68 52.46 83.23 39.86 53.90 84.03 50.84 63.35 88.44 88.99 88.71

HSA 84.53 40.02 54.32 84.99 40.81 55.14 84.02 52.18 64.38 90.83 91.21 91.02

	DEE	MDEE	MDEE+Gazz	IDNER
Bi-LSTM-CRF	80.12	36.13	48.80	82.11	38.14	52.09	82.32	50.47	62.58	86.11	87.29	86.70
NeuralNER	80.59	36.87	50.59	82.31	38.98	52.91	82.59	50.48	62.66	87.01	86.88	86.94
CNNs-Bi-LSTM	81.02	36.67	50.49	83.29	38.03	52.22	82.95	50.77	62.99	87.87	87.02	87.44
CNNs-Bi-LSTM-CRF	82.16	37.79	51.77	83.25	38.17	52.34	83.35	50.78	63.11	88.20	87.49	87.84
AMFF	83.23	39.11	53.21	83.95	39.32	53.56	84.13	51.25	63.70	89.83	88.77	89.30
mBERT	82.77	37.63	51.74	83.42	38.28	52.48	83.92	50.85	63.33	89.01	88.21	88.61
ELMO	81.47	38.68	52.46	83.23	39.86	53.90	84.03	50.84	63.35	88.44	88.99	88.71
HSA	84.53	40.02	54.32	84.99	40.81	55.14	84.02	52.18	64.38	90.83	91.21	91.02

HSA in this paper has obtained the state-of-the-art performance which demonstrates its superiority. By using an enhanced CNN structure and a structured-attention mechanism, we can extract deeper sequential features. Besides, different pooling operations (average and max pooling) can obtain semantic information from different perspectives, which achieve significant improvement to Indonesian NER. Another advantage of HSA is that the fusion and complementarity of static and contextual pre-trained word embeddings can further boost the NER performance.

5.5 Ablation study

We carry out ablation study on all datasets for HSA. Tables 7 –10 show the results on IDNER dataset. We omit the results on the other three datasets, which have similar trends.

Table 7
Performance of different word embeddings

Word Embedding P (%) R (%) F1 (%)

word2vec 88.11 88.59 88.35

GloVe 88.63 88.24 88.43

ELMO 89.13 89.27 89.20

BERT 88.47 89.12 88.79

word2vec+ELMO 90.13 90.04 90.08

Glove+ELMO 90.67 89.95 90.31

word2vec+mBERT 90.99 90.66 90.82

Glove+mBERT 90.83 91.21 91.02

Word Embedding	P (%)	R (%)	F1 (%)
word2vec	88.11	88.59	88.35
GloVe	88.63	88.24	88.43
ELMO	89.13	89.27	89.20
BERT	88.47	89.12	88.79
word2vec+ELMO	90.13	90.04	90.08
Glove+ELMO	90.67	89.95	90.31
word2vec+mBERT	90.99	90.66	90.82
Glove+mBERT	90.83	91.21	91.02

Table 8

Performance of different feature extractors

P (%)	R (%)	F1 (%)
HSA_GC	85.21	85.78	85.49
HSA_LC	85.47	85.66	85.56
HSA _GW	87.21	87.59	87.40
HSA _LW	87.61	87.54	87.57
HSA _GC_LC	85.62	86.01	85.81
HSA _GW_LW	88.12	87.65	87.88
HSA _GC_GW	88.32	88.04	88.18
HSA _LC_LW	89.01	88.78	88.89
HSA_LC_GC_LW_GW	90.83	91.21	91.02

Table 9

Performance of three types of entities

Entities	P(%)	R(%)	F1(%)
Name	92.19	93.08	92.63
Location	90.22	90.31	90.26
Organization	91.33	83.21	87.08

Table 10

Proportion of different error types

	Error(1) (%)	Error(2) (%)	Error(3) (%)
			L=2	L=3	L≥4
PER	15.21	65.13	2.14	4.78	12.74
LOC	13.99	62.34	3.58	8.01	12.08
ORG	16.22	75.09	0.00	1.90	6.79

5.5.1 The effect of different word embeddings

This experiment is to explore the impact of different word embeddings. As mentioned above, contextual word embeddings will pay more attention to the contextual information of the sequence, while static word embeddings will pay more attention to the information of the word itself to a certain extent. Therefore, we think that fusing them to complement each other will promote the Indonesian NER performance. We separately use Word2Vec, GloVe, ELMO and BERT embedding and the concatenation of them as the pre-trained word vectors. It can be seen that compared with directly fine-tuning mBERT, it is more effective to use mBERT vector as an external word vector. We hold the opinion that it is because fine-tuning a mBERT with such a totally different dataset tends to destruct its pre-trained representations leading to catastrophic forgetting [38]. In addition, NER performance of combining two kinds of vectors can be further improved. Among them, the type of static word vector has little effect on the experimental performance. And the combination of GloVe and mBERT achieves the best result of 91.02%.

5.5.2 The effect of different corpus sizes

In order to explore the impact of different data sizes on the model, we train the model with data sizes of 10,000, 15,000, 20,000, 25,000, 30,000, 35,000, 40,000, 45,000, and 50,000 sentences (the sum of training, development and test set) respectively and observe the trend of model results as the data size increases. It can be seen from Fig. 5 that when reaching 35000 sentences, with the increase of data size, the improvement growth is gradually flattening out.

Fig. 5

F1-score of different corpus sizes.

5.5.3 The effect of different feature extractors

In order to explore the effectiveness of all four features extraction modules for Indonesian NER, we conduct verification experiments on different modules. As can be seen from the table below, focusing on only character features or only token features are not good enough. In addition, token features seem to contribute more to Indonesian NER than character features thanks to the superior performance of pre-trained word vectors and efficient structured-attention operations. While the performance of local features and global features are not much different, it further indicates that both of them are essential to Indonesian NER. Combining all the four kinds of features simultaneously maximizes the performance of NER.

5.6 Discussion

Table 9 shows the recognition results of three types of entities in this paper. As can be seen that the F1-score of person and place is higher than that of organization. NER errors mainly include three categories: (1) labeling entities as non-entities / labeling non-entities as entities, (2) entities type recognition error, (3) boundary recognition error.

As can be seen from Table 10, most of the recognition errors focus on entities type recognition error and boundary recognition error. Boundary recognition error mainly occur when the entity length is greater than or equal to 4, which proves that the entity length is an important factor affecting the performance of NER models. In the case of entities type recognition error, the recognition of person and location is relatively easy to confuse each other. And the mislabeling of organization is mainly concentrated on being wrongly labeled as location.

6 Conclusion

In the context of scarce Indonesian language resources and immature language processing technologies for Indonesian language, this paper constructs a large Indonesian NER dataset with high quality. At the same time, in view of the particularity of Indonesian word composition, we construct a novel neural model which can extract sequence features from different perspectives. Experimental results show that the dataset and model constructed in this paper achieve superior performance, surpassing previous Indonesian NER models and datasets. As future work, we plan to extend the proposed model to other sequence labeling tasks like POS tagging and explore other possibilities of implementing more effective feature extractors.

Footnotes

Acknowledgments

The work is supported by grants from National Social Science Foundation of China (No. 17CTQ045), Soft Science Research Project of Guangdong Province (No.2019A101002108), Science and Technology Program of Guangzhou (No.202002030227), Social Science Foundation of Guangdong Province (GD20CWY10), and the Key Field Project for Universities of Guangdong Province (No. 2019KZDZX1016).

References

Morwal

, Jahan

and Chopra

, Named Entity Recognition using Hidden Markov Model (HMM), International Journal on Natural Language Computing (IJNLC) 1(4) (2012), 15–23.

Sekine

, Named Entity: History and Future, NYU Computer Science, 2003.

Zhang

and Yang

, Chinses NER Using Lattice LSTM, The 56th Annual Meeting of the Association for Computational Linguistics (ACL), 2018, 1554–1564.

Jin

, Xie

, Guo

, Luo

, Wu

and Wang

, LSTM-CRF Neural Network with Gated Self Attention for Chinese NER, IEEE Access 7 (2019), 136694–136703.

Gui

, Ma

, Zhang

and Huang

, CNN-Based Chinese NER with Lexicon Rethinking, Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), 2019, 4982–4988.

and Hovy

, End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF, The 54th Annual Meeting of the association for computational linguistics (ACL), 2016, 1064–1074.

Matthew

E.P.

, Mark

, Mohit

, et al., Deep Contextualized Word Representations, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018, 2227–2237.

Devlin

, Chang

M.W.

, Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2019, 4171–4186.

Akbik

, Bergamant

and Vollgraf

, Pooled Contextualized Embeddings for Named Entity Recognition, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2019, 724–728.

10.

Gary

F.S.

and Charles

D.F.

, Summary by Language Size: Language Size, SIL International, 2017.

11.

Suwarningsih

, Supriana

and Purwarianti

, ImNER Indonesian Medical Named Entity Recognition, IEEE International Conference on Technology, Informatics, Management, Engineering and Environment, 2014, 184–188.

12.

Budi

, Bressan

, Wahyudi

, Hansibuan

Z.A.

and Nazief

B.A.A.

, Named Entity Recognition for the Indonesian Language: Combining Contextual, Morphological and Part-of Speech Features into a Knowledge Engineering Approach, Discovery Science, 2005, 57–69.

13.

Leonandya

R.A.

, Distiawan

and Praptono

N.H.

, A Semi-supervised Algorithm for Indonesian Named Entity Recognition, IEEE International Symposium on Computational and Business Intelligence, 2015, 45–50.

14.

Wibawa

A.S.

and Purwarianti

, Indonesian Named-entity Recognition for 15 Classes Using Ensemble Supervised Learning, Procedia Computer Science 81 (2016), 221–228.

15.

Luthfi

, Distiawan

and Manurung

, Building an Indonesian named entity recognizer using Wikipedia and DBpedia, Proceedings of the International Conference on Asian Language Processing (IALP), 2014, 19–22.

16.

Alfina

, Manurung

and Fanany

M.I.

, DBpedia Entities Expansion in Automatically Building Dataset for Indonesian NER, IEEE International Conference on Advanced Computer Science and Information Systems, 2016, 335–340.

17.

Alfina

, Manurung

and Fanany

M.I.

, Modified DBpedia Entities Expansion for Tagging Automatically NER Dataset, Proceedings of the 9th International Conference on Advanced Computer Science and Information Systems, 2017, 216–221.

18.

Huang

, Xu

and Yu

, Bidirectional LSTM-CRF Models for Sequence Tagging, Computer ence, 2015.

19.

Kuru

, Can

O.A.

and Yuret

, CharNER: Character-Level Named Entity Recognition, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (COLING), 2016, 911–921.

20.

Lample

, Ballesteros

, Subramanian

, Kawakami

and Dyer.

, Neural architectures for named entity recognition, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2016, 260–270.

21.

Liu

, Shang

, Ren

, Xu

F.F.

, Gui

, Peng

and Han

, Empower sequence labeling with task-aware neural language model, The 32th AAAI Conference on Artificial Intelligence, 2018, 5253–5260.

22.

Yang

, Chen

, Zhang

, Ma

and Chang

, Attention-based Multi-level Feature Fusion for Named Entity Recognition, Proceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), 2020, 3594–3600.

23.

Kurniawan

and Louvan

, Empirical Evaluation of Character-Based Model on Neural Named-Entity Recognition in Indonesian Conversational Texts, 2018, arXiv preprint arXiv:1805.12291.

24.

Lin

, Feng

, Santos

C.N.D.

, Yu

, Xiang

, Zhou

and Bengio

, A structured self-attentive sentence embedding, In International Conference on Learning Representations (ICLR), 2017.

25.

Sang

E.F.

and De Meulder

, Introduction to the CoNLL-2003 shared task: language independent named entity recognition, Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL, 2003, 142–147.

26.

Gunawan

, Suhartono

, Purnomo

and Ongko

, Named-Entity Recognition for Indonesian Language using Bidirectional LSTM-CNNs, The 3rd International Conference on Computer Science and Computational Intelligence (ICCSCI) 135 (2018), 425–432.

27.

Ikhwantri

, Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER, 2019, arXiv preprint arXiv:1907.11158.

28.

Kosakih

J.A.

and Khodra

M.L.

, Transfer Learning for Indonesian Named Entity Recognition, International Symposium on Advanced Intelligent Informatics (SAIN), 2018.

29.

Mintz

, Bills

, Snow

, et al., Distant supervision for relation extraction without labeled data, International Joint Conference on Natural Language Processing, 2009, 1003–1011.

30.

Bahdanau

, Cho

and Bengio

, NeuralMachine Translation by Jointly Learning to Align and Translate, The 3rd International Conference on Learning Representations (ICLR), 2015.

31.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

and Polosukhin

, Attention is all you need, In Neural Information Processing Systems, 2017, 5998–6008.

32.

Gehring

, Auli

, Crangier

, Yarats

and Dauphin

Y.N.

, Convolutional Sequence to Sequence Learning, Proceedings of the 34th International Conference on Machine Learning (ICML) 70 (2017), 1243–1252.

33.

Kim

, Jernite

, Sontag

and Rush

A.M.

, Character-aware neural language models, The 13th AAAI Conference on Artificial Intelligence, 2016, 2741–2749.

34.

Mikolov

, Chen

, Corrada

and Dean

, Efficient Estimation of Word Representations in Vector Space, 2013, Proceedings of Workshop at ICLR.

35.

Pennington

, Socher

and Manning

C.D.

, Glove: Global Vectors for Word Representation, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, 1532–1543.

36.

Lafferty

, Mccallum

and Pereira

F.C.N.

, Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, Proceedings of International Conference on Machine Learning 3(2) (2001), 282–289.

37.

Chiu

J.P.C.

and Nichols

, Named Entity Recognition with Bidirectional LSTM-CNNs, Transactions of the Association for Computational Linguistics 4 (2016), 357–370.

38.

Zhu

, Xia

, Wu

, He

, Qin

, Zhou

, Li

and Liu

, Incorporating BERT into Neural Machine Translation, The 8th International Conference on Learning Representations (ICLR), 2020.

39.

Clark

, Luong

M.T.

, Le

Q.V.

and Manning

C.D.

, ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generator, Proceedings of International Conference on Learning Representations (ICLR), 2020.

40.

Liu

, Ott

, Goyal

, Du

, Joshi

, Chen

, Levy

, Lewis

, Zettlemoyer

and Stoyanov

, RoBERTa: A Robustly Optimized BERT Pretraining Approach, CORR, 2019.

41.

Xue

, Constant

, Roberts

, Kale

, Al-Rfou

, Siddhant

, Barua

and Raffel

, mT5: A massively multilingual pre-trained text-to-text transformer, CORR, 2021.

42.

Conneau

, Khandelwal

, Goyal

, Chaudhary

, Wenzek

, Guzmn

, Grave

, Ott

, Zettlemoyer

and Stoyanov

, Unsupervised Cross-lingual Representation Learning at Scale, The 58th Annual Meeting of the association for computational linguistics (ACL), 2020, 8440–8451.

43.

Srivastava

, Hinton

, Krizhevsky

, Sutskever

and Salakhutdinov

, Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Journal of Machine Learning Research 15(56) (2014), 1929–1958.

Towards corpus and model: Hierarchical structured-attention-based features for Indonesian named entity recognition

Abstract

Keywords

1 Introduction

2.1 Top-performance NER

2.2 Indonesian NER

3 Dataset construction

Table 1 An Indonesian annotation example Token Tag Jadi O , O kata O Sandi B-PER , O wisatawan O pencinta O olahraga O dapat O semakin O tertarik O berbondong-bondong O datang O ke O Pulau B-LOC Tidung B-LOC . O

3.2 Data source

4 Model

4.1.1 Global character feature

5.1 Datasets

Table 4 Indonesian datasets Dataset Train Develop Test DEE 19000 1240 737 MDEE 19000 1240 737 MDEE+Gazz 19000 1240 737 IDNER 40079 3795 6224

5.3 Parameter initialization

5.3.1 Word embedding

5.3.2 Character embedding

5.3.3 Hyperparameters setting

5.5.2 The effect of different corpus sizes

5.6 Discussion

6 Conclusion

Footnotes

Acknowledgments

References

Table 1
An Indonesian annotation example

Token Tag

Jadi O

, O

kata O

Sandi B-PER

, O

wisatawan O

pencinta O

olahraga O

dapat O

semakin O

tertarik O

berbondong-bondong O

datang O

ke O

Pulau B-LOC

Tidung B-LOC

. O

Table 4
Indonesian datasets

Dataset Train Develop Test

DEE 19000 1240 737

MDEE 19000 1240 737

MDEE+Gazz 19000 1240 737

IDNER 40079 3795 6224