CMCEE: A joint learning framework for cascade decoding with multi-feature fusion and conditional enhancement for overlapping event extraction

Abstract

Event extraction (EE) is an important natural language processing task. With the passage of time, many powerful and effective models for event extraction tasks have been developed. However, there has been limited research on complex overlapping event extraction. Therefore, we propose a new cascade decoding model: A Joint Learning Framework for Cascade Decoding with Multi-Feature Fusion and Conditional Enhancement for Overlapping Event Extraction. 1) In this model, we introduce a cascade decoding mechanism with multi-feature fusion to better capture the interaction between decoding layers. 2) Additionally, we introduce an enhanced conditional layer normalization (ECLN) mechanism to enhance the interaction between subtasks. Simultaneously, the use of a cascade decoding model effectively addresses the problem of overlapping events. The model successively performs three subtasks, type detection, trigger word extraction and argument extraction. All three subtasks learned together in a framework, and a new conditional normalization mechanism is used to capture dependencies among these subtasks. The experiments are conducted using the overlapping event benchmark, FewFC dataset. The experimental evaluation demonstrates that our model achieves a higher F1 score on the overlapping event extraction task compared to the original overlapping event extraction model.

Keywords

Event extraction overlapping events ECLN cascade decoding

1. Introduction

Event extraction (EE) involves identifying the type of sentence, trigger words, and arguments within a given sentence. As shown in Fig. 1(a), this is a Judgment event with the trigger words “sentenced”, and its arguments are “court”, “Qinshang Group”, and “3 million yuan”. “Court” serves as the subject role of the argument, while “Qinshang Group” plays the object role of the argument.

Figure 1.

There are two examples of events, flat event (a) and overlapping event (b). Trigger words are marked with gray boxes and argument words with underscores for easy distinction. At the same time, to make the overlapping events (b) clearer, we use two colors to distinguish the two types.

Nowadays, there are primarily two methods for extracting useful information from massive data. One is manual analysis and extraction, while the other is automatic analysis and extraction. With the advancement of science, the latter method typically yields far better results than the former. Currently, event extraction tasks are widely used in the construction of knowledge graphs, news summarization, and financial consulting, among other fields.

In the past, traditional event extraction tasks [13, 12] treated event extraction as a sequence labeling problem. These models are classic, but they have overlooked a significant issue: the presence of complex and irregular event extraction samples in the dataset. Consequently, they have not explored solutions for handling the problem of overlapping events in event extraction tasks.

Traditional event extraction methods often assume that there are no overlapping events in the dataset. In this study, we will discuss the extraction task of overlapping events. What are overlapping events? For example, Fig. 1(b) shows that “acquired” triggers both an Investment event and a Share Transfer simultaneously. In a single sentence, one word is used as the trigger for two different event types due to different arguments.

Sheng et al. [14] categorized overlapping events into three types: 1) a word can serve as a trigger for different event types across multiple events; 2) a word can act as an argument with different roles across multiple events; and 3) a word can play different roles as an argument within a single event.

Of course, in previous research, some researchers have also recognized the issue of overlapping events and provided proposed solutions. Yang et al. [20] introduced a framework based on pre-trained language models (PLMEE), which achieved remarkable results in event extraction tasks using pre-trained language models. They also observed the phenomenon of argument overlap but did not provide a corresponding solution for trigger word overlap, which can easily lead to error propagation. Sheng et al. [14] proposed a joint learning framework for event extraction based on cascade decoding (CasEE), the first paper to simultaneously handle all three types of overlap. However, we believe that CasEE is not fully utilizing the conditions, and the model rarely incorporates interactions between words, indicating that there is still significant room for improvement in the task of extracting overlapping events.

To address the above-mentioned issues, we propose the CMCEE model (A Joint Learning Framework for Cascade Decoding with Multi-Feature Fusion and Conditional Enhancement for Overlapping Event Extraction). As shown in Fig. 2. Specifically, CMCEE consists of three sub-tasks: event type detection, trigger word extraction and argument extraction, these sub-tasks correspond to three decoders: the event type detection decoder, trigger extraction decoder, and argument extraction decoder. By adopting a cascading decoding method, we have designed a residual structure and incorporated it into the argument decoder module. This enhancement enables the model to capture more text features. Additionally, we have introduced a new conditional layer normalization mechanism and the Enhanced Conditional Layer Normalization (ECLN) function. These innovations help integrate multiple conditions into text encoding simultaneously and model the interaction between subtasks more effectively.

There are four contributions to this article:

We propose CMCEE, a multi-feature fusion and conditionally enhanced cascade decoding joint learning framework for overlapping event extraction. It can simultaneously address three types of overlapping events more effectively.

We have designed an enhanced conditional fusion function, which can integrate multiple conditions simultaneously and obtain more effective interaction information between subtasks.

We have designed a cascade decoding mechanism with multi-feature fusion, providing a new approach for the application of cascade decoding models in text processing tasks.

We conducted experiments on the FewFC dataset and compared our model with similar original models. Our model demonstrated superior performance.

2. Related work

Event extraction technology involves extracting events of interest from unstructured information and present them to users in a structured way. It is one of the key tasks of natural language processing [1, 6]. The traditional event extraction methods primarily include pipeline-based event extraction [1, 17], which divides event extraction into sub-tasks with a sequential order, and these sub-tasks are independent of each other. First, trigger word extraction tasks are performed, and event types are detected based on triggers. Then, the argument extraction task is carried out, simultaneously relying on the predicted results of event types and trigger words. However, the pipeline method suffers from the problem of error propagation, which can degrade the model’s performance. The other approach is the joint extraction model [21, 8, 15], which allows trigger word extraction and argument extraction tasks to be completed simultaneously. This not only overcomes the error information propagation caused by event detection but also shares hidden layer features. For instance, Nguyen et al. [13] proposed JRNN, a joint event extraction model based on recurrent neural networks, utilizing two bidirectional RNNs to obtain richer representations and effectively avoiding error propagation issues in pipeline models. In the methods and models mentioned above, there has been limited attention given to overlapping event extraction, as they assume that the dataset is regular. However, this assumption does not imply the absence of overlapping events in the dataset.

In existing methods, Yang et al. [20] noted the phenomenon of argument overlap, but did not provide a corresponding solution to the issue of trigger word overlap. Furthermore, since PLMEE is a pipeline model that performs event extraction tasks in a pipeline manner, it is prone to the impact of error propagation. Wei et al. [19] introduced CasRel, a novel cascading labeling strategy for extracting overlapping relational triplets without being affected by the problem of overlapping triplets. For complex event extraction tasks, Sheng et al. [14] proposed CasEE, which represented the first attempt to handle overlapping event extraction within a cascading decoding joint framework. Sheng et al. [14] divided overlapping event extraction into three sub-tasks that were executed sequentially: event type detection, event trigger extraction, and argument extraction. Simultaneously, they leveraged the interaction between these sub-tasks to extract overlapping targets separately.

Figure 2.

Our proposed CMCEE model can be broadly divided into a BERT encoder and three sub-tasks. These three sub-tasks correspond to three modules: type detection decoder, trigger word extraction decoder, and argument extraction decoder. Furthermore, we adopt the conditional layer normalization mechanism (CLN) and our newly proposed enhanced conditional layer normalization mechanism (ECLN).

3. Our method

The structure of our model is shown in Fig. 2, featuring one encoder and three decoders: event type detection, event trigger word extraction, and event argument extraction. Meanwhile, to address the previously mentioned overlapping event problem, we adopt the cascaded decoding joint learning framework. Simultaneously, we incorporate the enhanced conditional layer normalization mechanism and the multi-feature fusion cascade decoding mechanism.

Specifically, we input the text sequence into the BERT encoder to obtain the characteristics of the word and sentence. Then, we identify the sentence trigger word, event type, argument as well as the corresponding role of argument. We have an event type set, denoted as C, and an argument role set R, based on the predefined event pattern. The overall goal is to predict all events in the golden set $\varepsilon_{x}$ of sentence sequence $X$ . Our goal is to maximize the possibility of obtaining optimal results for training data D in the overlapping event extraction task. The specific design is as follows:

$\displaystyle\prod\limits_{x\in D}\left[\prod\limits_{(c,t,a_{r})\in% \varepsilon_{x}}p((c,t,a_{r})|x)\right]$

(1) $\displaystyle=\prod\limits_{x\in D}\left[\prod\limits_{c\in C_{x}}p(c|x)\prod% \limits_{t\in T_{x,c}}p(t|x,c)\prod\limits_{a_{r}\in A_{x,c,t}}p(a_{r}|(x+x_{t% }),c,(t+c))\right]$

where $\varepsilon_{x}$ represents the golden set, $C_{x}$ represents the type set of $x\in D$ , $T_{x,c}$ represents the trigger word set of type $c$ , $A_{x,c,t}$ represents the argument set of type $c$ and trigger $t$ , and $x_{c}$ represents the textual representation of event type $c$ integrated into the encoder output.

Equation (3) exploits the interactions and dependencies among the type detection decoder, trigger word extraction decoder, and argument extraction decoder. The expression $p(c|x)$ represents the event type detection decoder, which identifies the event type of a sentence using the output of the BERT encoder as input. The expression $p(t|x,c)$ represents an event trigger extraction decoder, which integrates event types as conditional information into a text representation and extracts event triggers. The expression $p(a_{r}|(x+x_{t}),c,(t+c))$ denotes the event argument extraction decoder, which incorporates both the event type and trigger words as conditional information into the text representation. The argument extraction decoder fuses the text representation $x$ and $x_{c}$ into a text representation with more features, which is then used as input to the argument extraction decoder.
3.1 BERT encoder

Devlin et al. [12] introduced BERT, which is a pre-training model, utilizing the strategy of randomly masking words and predicting them subsequently. Through unsupervised training on a large corpus in its early stages, BERT acquires rich textual information and learns language, syntax, word meanings, and other information beneficial for downstream tasks. In downstream tasks, it employs a small number of target domain datasets for supervised training, thereby addressing the resource cost issue associated with extensive labeled datasets for supervised learning. Before the emergence of the BERT model, the primary approach involved identifying trigger words within the text and determining the event type based on these triggers. However, with the introduction of event extraction models based on the BERT model, new approaches have been proposed for event extraction methods grounded in deep learning.

We utilize BERT as our encoder. Specifically, given a token sequence $X={w_{1},w_{2},\ldots,w_{n}}$ , where $n$ represents the number of words in the sequence, we input the sequence into the BERT encoder and get the output $H={h_{1},h_{2},h_{3},\ldots,h_{n}}$ from the last hidden layer. Once we get the output of the hidden layer, we can proceed with the downstream task.

3.2 Event type detection encoder

In the previous step, we obtained the output of the BERT encoding layer. In this section, we will introduce the event type detection module’s contents. In the previous discussion, we emphasized our intention to fully leverage the interaction between the three decoder layers. Therefore, to enable subsequent event trigger and argument extraction, we must first detect the event type. Then, we can use type detection as conditional information to address the issue of overlapping trigger words.

Liu et al. [11] proposed an event type detection method that doesn’t rely on trigger words. Sheng et al. [14] also utilized trigger word-free event type detection in a cascaded decoding joint learning framework. Their experimental results highlighted the effectiveness of trigger word-free event type detection in event extraction tasks. Therefore, we also adopted a similar approach and designed an event type detection decoder to predict event types.

Specifically, we used an attention mechanism [18] to detect event types and capture as much relevant information about the relationship type as possible. The input to the event type decoder directly came from the output of the BERT encoding layer. We initialized a learnable type embedding matrix $c$ randomly, representing a candidate type. To measure the correlation between the representation of each marker and the candidate event type, we defined a similarity function $\delta$ , calculate the similarity between $h_{i}$ and candidate type $c$ . To capture similarity information from various perspectives as comprehensively as possible, this function was an expressive learnable function [14]. The specific design is as follows:

$\displaystyle\delta(c,h_{i})=K^{\top}\tanh(W[c;h_{i};|c-h_{i}|;c\odot h_{i}])$ $\displaystyle s_{c}=\sum\limits_{i=1}^{N}\frac{\exp(\delta(c,h_{i}))}{\sum% \limits_{j=1}^{N}\exp(\delta(c,h_{i}))}$ (2)

where $K\in\textbf{R}^{4d\times 1}$ and $W\in\textbf{R}^{4d\times 4d}$ are learnable parameter matrices, $\odot$ indicates that the corresponding elements of the two matrices are multiplied.

Finally, according to Eq. (3.2), we calculate the value of $s_{c}$ , and then, use the similarity function again to measure the similarity between $s_{c}$ and $c$ . Afterward, we apply an activation function such that when the resulting value exceeds a certain threshold, it is considered that the sentence contains this type of event. The specific design is as follows:

$\displaystyle\hat{c}=p(c|x)=\sigma(\delta(c,s_{c}))$ (3)

Among them, $\sigma$ represents the sigmoid function, we set a threshold $\varepsilon_{1}\in[0,1]$ . According to Eq. (3), we calculate the value of $c$ , and then, we will select the event type when $\hat{c}>\varepsilon_{1}$ is the result of event classification. Finally, all predicted event types are added to the event type Set C.

3.3 Event trigger extraction decoder

Overlapping events can involve the issue of overlapping trigger words. To address the problem of overlapping event trigger words, we employ a trigger word decoder combined with conditional layer normalization (CLN). In essence, the input to the event trigger word decoding layer incorporates not only the output of the encoding layer but also takes into account the characteristics of the event type. Past research [19, 14] has proven the effectiveness of this approach.

In transformer models like BERT, the primary normalization method is Layer Normalization. Therefore, it is natural to consider transforming the corresponding $\beta$ and $\gamma$ into functions of input conditions. Previous work introduced this idea by De Vries et al. [3]. Inspired by Conditional Batch Normalization (CBN), researchers dynamically generate gain $\gamma$ and bias $\beta$ based on conditional information when designing conditional layer normalization to control the generation behavior of the transformer model. This is the idea of Conditional Layer Normalization.

In previous research, [14], to obtain the characteristic conditions for other layers in the event argument extraction decoder design, researchers utilized the conditional layer normalization proposed by Su [16]. This normalization mechanism dynamically generates gain $\gamma$ and bias $\beta$ based on conditional information. The specific design is as follows:

$\displaystyle\textit{CLN}(c,h_{i})=\gamma_{c}\odot\left(\frac{h_{i}-\mu}{% \sigma}\right)+\beta_{c}$

(4) $\displaystyle\gamma_{c}=W_{\gamma}c+b_{\gamma},\beta_{c}=W_{\beta}c+b_{\beta}$

where $c$ represents conditional information, and $\beta\in R$ and $\gamma\in R$ are the mean and standard deviation of hidden layer elements. Through this process, we can integrate the conditions into $\beta$ and $\gamma$ of Layer Normalization.

Next, to further enhance the representation for trigger word extraction [14], we employ a multi-head attention mechanism in conjunction with the conditional layer normalization mechanism. The advantage of the attention mechanism lies in its ability to capture global connections. Additionally, it allows for parallelized calculations and, when compared with CNN and RNN, the model is simpler with fewer parameters. The specific design is as follows:

$\displaystyle Z^{c}=\textit{SelfAttention}(G^{c})$ (5)

where $G^{c}$ represents the text representation fused with the conditional information. After passing through the multi-head attention mechanism, the output $Z^{c}$ is obtained.

In the predictive trigger word module, we employ a binary taggers. Simultaneously, to address the issue of overlapping events, we extract all trigger words within a sentence. The specific operation is to predict whether each token $w_{i}$ is the starting or ending position of a trigger word for the sentence $X=$ $w_{1},w_{2},\ldots,w_{n}$ . At the same time, for the start position $\hat{t}^{sc}_{i}$ and the end position $\hat{t}^{ec}_{i}$ of the trigger word, we design two thresholds $\xi_{2}$ and $\xi_{3}$ respectively, and $\xi_{2}$ , $\xi_{3}\in[0,1]$ . We choose the token when $\hat{t}^{sc}_{i}>\xi_{2}$ as the start position, and we choose the token when $\hat{t}^{ec}_{i}>\xi_{3}$ as the end position of the trigger word. The specific design is as follows:

$\displaystyle\hat{t}^{sc}_{i}=p(t_{s}|w_{i},c)=\sigma(w_{t_{s}}^{\top}z^{c}_{i% }+b_{t_{s}})$ (6) $\displaystyle\hat{t}^{ec}_{i}=p(t_{e}|w_{i},c)=\sigma(w_{t_{e}}^{\top}z^{c}_{i% }+b_{t_{e}})$

Where $\sigma$ represents the sigmoid function. As we aim to address the issue of overlapping trigger words, we need to extract overlapping trigger words based on the event type. Through Eq. (3.3), we can calculate $\hat{t}^{sc}_{i}$ and $\hat{t}^{ec}_{i}$ . Then, we enumerate the starting positions of all trigger words, and then find the closest end position following each start position to identify complete trigger words. Afterward, we identify each trigger word within a sentence. Finally, we categorize the listed trigger words based on the event type.
3.4 Event argument extraction decoder

In order to better solve the problem of overlapping arguments, we integrated dual text representations in the input of the argument extraction encoder within the event argument decoding layer. These representations include the output of the encoding layer and the output of the event trigger word decoding layer. This approach not only allows us to obtain rich semantic features from the encoder but also enables access to feature information from the trigger word extraction module.

In this section, we discuss how to extract arguments with specific event types $c\in C_{s}$ and event triggers $t\in T_{c}$ . Simultaneously, we will describe the enhanced conditional layer normalization layer and the enhanced conditional fusion function, and the cascaded decoder for multi-feature fusion.

When it comes to Enhanced Conditional Layer Normalization (ECLN), this is a normalization technique used in neural networks that incorporates the concept of conditional layer normalization. The aim of this technique is to better handle input data with multiple conditions in a neural network, such as those associated with different categories, styles, or other characteristics.

Why do we call it a multi-feature fusion cascade decoder? It is because when we design the cascaded decoding model, our goal is to capture as many text features as possible in the input stage of the decoder, including the rich semantic features of the encoder and the text interaction provided by other decoders. Specifically, in the design of the event argument extraction decoder, we combine the outputs of the encoder and the event trigger word decoder as input and use this combination as input for the event argument extraction decoder.

In the enhanced conditional layer normalization of argument extraction module, the key is to combine the two sets of conditional information with the normalized parameters. This is usually done through the following steps:

First, for each sample, we calculate the mean and variance of its features. Then, the outputs of the event classification decoder and trigger word extraction decoder are provided as conditional information to the argument extraction decoder. This condition information will be used to calculate the normalized parameters for specific conditions. Second, we use the event classification decoder and trigger extraction decoder are used to extract the conditional information of the decoder to calculate the scaling and shift parameters under the given conditions. Finally, the calculated scaling and shifting parameters are used to normalize the features and maintain them within the appropriate range.

In conclusion, enhanced conditional layer normalization is a method that combines the concept of conditional layer normalization to make neural networks better adapt to a variety of input conditions. By incorporating multiple conditions into the normalization process, this technique can improve the flexibility and generalization performance of the model, and is suitable for a variety of tasks that involve dealing with different input conditions.

Building on previous research, we utilized embedding encoding relative to the trigger word positions to obtain token-level representations:

$\displaystyle Z^{ct}=[Z^{ct};P]$ (7)

where $P\in R^{N\times d_{p}}$ represents relative position embedding, $Z^{ct}$ represents the input text representation for the argument extraction decoder, and $[;]$ represents the concatenation of representations.

Figure 3.

Specifically, Enhanced Conditional Layer Normalization (ECLN) incorporates two sets of conditional information $c_{1}$ and $c_{2}$ , as well as a textual representation $H$ . Enhanced Conditional Layer Normalization is the conditional information $c_{1}$ and $c_{2}$ into the text representation $H$ .

During the experiment [19, 14], we observed that Su’s [16] conditional fusion function could only fuse one condition. However, in our model, there are three decoders. Therefore, building upon prior work, we have further designed an enhanced conditional fusion function. As shown in Fig. 3, on the basis of the conditional layer normalization mechanism proposed by Su [16], we transform multiple input conditions into the same dimension as $\beta$ and $\gamma$ through two different transformation matrices, and then transform the multiple conditions into the transformation results are added to $\beta$ and $\gamma$ respectively. The specific formula is as follows:

$\displaystyle\textit{ECLN}(c_{1},c_{2},h_{i})=\gamma_{c_{1,2}}\odot\left(\frac% {h_{i}-\mu}{\sigma}\right)+\beta_{c_{1,2}}$ $\displaystyle\gamma_{c_{1,2}}=W_{\gamma}c_{1}+W_{\gamma}c_{2}+b_{\gamma}$ (8) $\displaystyle\beta_{c_{1,2}}=W_{\beta}c_{1}+W_{\beta}c_{2}+b_{\beta}$

where $c_{1}$ and $c_{2}$ represent two types of conditional information, respectively, $\beta_{c_{1,2}}\in R$ and $\gamma_{c_{1,2}}\in R$ are the mean and standard deviation of hidden layer elements. Through this process, we can integrate the enhanced conditions into $\beta$ and $\gamma$ of Enhanced Conditional Layer Normalization.

After getting a text representation that combines multiple features, we can input this text representation into the enhanced conditional layer normalization mechanism. Here, we embed the average values represented by the token at the start and end of $t$ as triggers, and type embeddings and trigger word embeddings are integrated into the text representation to model the conditional dependencies between type detection and argument extraction, and between trigger word extraction and argument extraction.

Compared with previous work (CasEE) [14], we have reduced model redundancy by implementing “model pruning” to remove some “unimportant” parameters and modules from the model. Therefore, when designing the argument decoder structure, we did not utilize a multi-head attention mechanism.

Finally, in the prediction argument section, we utilize a set of role-specific binary label pairs [14] for the independent variables. Similar to the prediction of trigger words, for each token $w_{i}$ , we predict its correspondence to the beginning position of argument role $r\in R$ , and subsequently identify the nearest ending position after the beginning position, constituting a complete argument. The detailed design is as follows:

$\displaystyle\hat{r}^{\textit{sct}}_{i}=p(a_{r}^{s}|w_{i},c,t)=I(r,c)\sigma(w_% {r_{s}}^{\top}z^{ct}_{i}+b_{r_{s}})$ (9) $\displaystyle\hat{r}^{\textit{ect}}_{i}=p(a_{r}^{e}|w_{i},c,t)=I(r,c)\sigma(w_% {r_{e}}^{\top}z^{ct}_{i}+b_{r_{e}})$

Among them, $\sigma$ represents the sigmoid functionand, $z^{ct}_{i}$ denotes the $i$ -th token representation in $Z^{ct}$ .

Considering that not all roles belong to a specific type $c$ , according to past work [14], we adopt a learnable indicator function $I(r,c)$ . Subsequently, we adhere to a predefined event scheme to indicate whether a role r belongs to a type $c$ , and the role r and Modeling of linkages between event types $c$ . Simultaneously, we parameterize the indicator function $I(r,c)$ as follows:

$\displaystyle I(r,c)=\sigma(w_{r}^{\top}c+b_{r})$ (10)

Among them, $\sigma$ represents the sigmoid functionand. Similarly, we set two thresholds $\xi_{4},\xi_{5}\in[0,1]$ . For each role $r$ , we consider the token of $\hat{r}^{\textit{sct}}_{i}>\xi_{4}$ as the starting position of the argument role and set the token of $\hat{r}^{\textit{sct}}_{i}>\xi_{5}$ token as the end position of the argument role. To obtain the argument word a with role $r$ , we identify all starting and ending positions. Each starting position paired with its closest ending position in a sentence forms a complete argument. Subsequently, to address the issue of overlapping event arguments, we extract arguments based on different types and trigger words. The generation of all predicted arguments $a_{r}$ is contingent on a specific type $c$ and the trigger $t$ in sentence $x$ . Finally, predicted arguments form a set $A_{x,c,t}$ .

3.5 Model training

In this module we design related questions about model training. First, we define the overall goal $\L$ of model training:

$\displaystyle\text{\L}\!=\!\sum\limits_{x\in D}\left[\sum\limits_{c\in C_{x}}% \log p_{\theta_{1}}(c|x)+\sum\limits_{t\in T_{x,c}}\log p_{\theta_{2}}(t|x,c)+% \sum\limits_{a_{r}\in A_{x,c,t}}\log p_{\theta_{3}}(a_{r}|(x+x_{t}),c,(t+c))\right]$ (11)

Among them, Ł represents the total training objective, while the sub-objectives $p_{\theta_{1}}(c|x)$ , $p_{\theta_{2}}(t|x,c)$ and $p_{\theta_{3}}(a_{r}|(x+x_{t}),c,(t+c))$ corresponding to the three subtasks of entity detection, trigger word extraction, and argument extraction, respectively, are defined as:

$\displaystyle p_{\theta_{1}}(c|x)=(\hat{c})^{\bar{c}}(1-\hat{c})^{(1-\bar{c})}$ $\displaystyle p_{\theta_{2}}(t|x,c)=\prod\limits_{z\in{\{s,e\}}}\prod\limits_{% i=1}^{N}(\hat{t}^{zc}_{i})^{\bar{t}^{zc}_{i}}(1-\hat{t}^{zc}_{i})^{(1-\bar{t}^% {zc}_{i})}$ (12) $\displaystyle p_{\theta_{3}}(a_{r}|(x+x_{t}),c,(t+c))=\prod\limits_{r\in R_{z}% }\prod\limits_{z\in{\{s,e\}}}\prod\limits_{i=1}^{N}(\hat{r}^{zc(t+c)}_{i})^{% \bar{r}^{zc(t+c)}_{i}}(1-\hat{r}^{zc(t+c)}_{i})^{(1-\bar{r}^{zc(t+c)}_{i})}$

Where $\theta_{1}\triangleq\{\theta_{\textit{bert},\theta_{c}}\}$ , $\theta_{2}\triangleq\{\theta_{\textit{bert},\theta_{t}}\}$ , $\theta_{3}\triangleq\{\theta_{\textit{bert},\theta_{a_{r}}}\}$ , among them, $\theta_{\textit{bert}}$ , $\theta_{c}$ , $\theta_{t}$ and $\theta_{a_{r}}$ represents the parameters of BERT encoder, trigger word extraction decoder and argument extraction decoder, respectively. Additionally, $\bar{c}$ , $\bar{t}^{sc}_{i}$ , $\bar{t}^{ec}_{i}$ , $\bar{r}^{sc(t+c)}_{i}$ , $\bar{r}^{ec(t+c)}_{i}$ are the true 0/1 labels of the training data, respectively. We train the model by maximizing Ł using the Adam stochastic gradient descent [7].

4. Experiment

In the previous section, we introduced the concept and structure of the CMCEE model, which is used for the task of overlapping event extraction. Next, we compare the performance with other baseline models. In this module, we have designed experiments to verify the effectiveness of CMCEE.

4.1 Datasets and evaluation metrics

Our experiment uses the Chinese financial event extraction benchmark FewFC [22]. FewFC has a total of 12,890 events, encompassing 10 financial field event types and 8,982 sentences, of which 1,975 overlapping events. As shown in Table 1, we divided the 8,982 sentences in the data set into the training data set, validation dataset, and the testing dataset in an 8:1:1 ratio. Among them, there were 7,185 sentences in the training data set, 899 in the validation set, and 898 in the testing set.

In the FewFC dataset, there are 10 event types, such as Pledge, Sue and Investment. There are 18 argument role classes, such as “proportion”, “money”, and “number”. In the test and validation sets, the event types are evenly distributed, with ten event types. FewFC is a Chinese financial event extraction dataset, and the language is independent. The FewFc dataset is publicly available at https://github.com/ TimeBurningFish/FewFC.

The various data of the FewFC [22] dataset are shown in Table 1. Each column represents the number of sentences with overlapping elements, sentences without overlapping elements, all sentences, and all events. Each row represents the number of training set, validation set, testing set and the whole dataset.

Table 1
Statistics for the dataset

	Overlap	Non.overlaping	Sentence	Event
Training	1,560	5,625	7,185	10,277
Validation	205	694	899	1,281
Testing	210	688	898	1,332
All	1,975	7,007	8,982	12,890

The event type distribution of the data is shown in Table 2. Column 1 represents 10 event types, while columns 2, 3, and 4 represent the number of samples of event types in the training set, validation set, and testing set, respectively.

Table 2

Event type distribution of dataset

	Training	Validation	Testing	All
Pledge	1,092	139	151	1,382
Share transfer	2,149	270	283	2,702
Sue	731	95	74	900
Investment	1,495	170	198	1,863
Share reduction	897	127	131	1,155
Acquisition	1,054	135	116	1,305
Guarantee	677	80	100	857
Acceptance of bid	809	92	101	1,002
Sign the contract	537	70	78	685
Judgment	836	103	100	1,039
Event	10,277	1,281	1,332	12,890

In the evaluation part, we follow past traditions [1, 5, 14] and divide into four evaluation metrics: 1) Trigger Identification (TI): When predicting a trigger, if the span of the predicted start position and end position matches the gold span, then it identifies the correct trigger; 2) Trigger Classification (TC): When classifying triggers, if the trigger is correctly recognized and the classification result is correct, then the trigger is correctly classified; 3) Argument Identification (AI): When predicting arguments, if the event type classification of the argument is correct, and the span of the predicted start and end positions matches the golden span, then it correctly recognizes the argument; 4) Argument Classification (AC): When classifying arguments, if the argument is correctly identified and the predicted character matches the golden character, then the argument is correctly classified.

For each of these four metrics, we report precision (P), recall (R) and F-measure (F1) for each of the four metrics. Precision represents the proportion of the total number of correct extraction results to the total number of extraction results. Recall is the proportion of the total number of correctly extracted results to the total number of positive samples in the corpus. F1 is the most comprehensive metric.

4.2 Baseline model

In the comparison experiment, we will formulate several baselines according to the existing overlapping event substrate model schemes. Additionally, the comparison models we employ Additionally: a multi-stage approach with overlapping events and a joint sequence labeling approach.

The joint sequence labeling method transforms the event extraction task into a sequence labeling task. For the BERT-softmax model, we apply the softmax function on the last layer to output the probability distribution of each label. For the BERT-CRF model, we use a linear chain conditional random field (CRF) as the output layer, which is trained and tested by maximizing the joint probability.

The multi-stage method of overlapping events involves extracting event trigger words and arguments sequentially. PLMEE [20] is an event extraction method based on a pre-trained model, addresses role overlap by segmenting events based on roles. Event extraction research based on machine reading comprehension (MRC) has attracted more and more attention, inspired by MRC [9, 5, 10, 2], we train multiple MRC BERTs for overlapping event extraction. We trained two MRC BERTs for overlapping event extraction and compared them with our method. Regarding the MRC model design, we followed the approach of Sheng et al. [14] to create MQAEE and incorporated the following methods:

1)
The first method involves using BERT initially to predict event types, and then utilizing MRC BERT to predict overlapping triggers and arguments based on event types, known as MQAEE-1.
2)
The second method focuses on extracting arguments based on type and trigger to address the issue of overlapping arguments. Firstly, MRC BERT is used to predict overlapping triggers with type, and then overlapping arguments are predicted based on type triggers, known as MQAEE-2.

CasEE [14] was the pioneer in adopting cascading decoding to tackle the problem of overlapping event extraction. It involves joint extraction learning for three subtasks and emphasizes interaction among these subtasks.

We evaluated the performance of these models using the public dataset FewFC. Additionally, we applied the same data preprocessing steps to prepare these datasets, dividing them into training, validation, and test sets.
4.3 Implementation details

To emphasize the difference between CMCEE and the comparative model effectively, all models use the Chinese BERT-base model as the text encoder. They consist of 12 layers, 768 units in the hidden layer, 12 Attention heads, 110M parameter. As for the setting of hyper-parameters, the ranges thresholds $\xi_{1}$ , $\xi_{2}$ , $\xi_{3}$ , $\xi_{4}$ , $\xi_{5}$ for predictions are adjusted within the range of $[0,1]$ . The specific hyper-parameters settings are detailed in Table 3.

Table 3
Hyper-parameter settings of CMCEE

Column 1	Column 2
Type embedding dimension $d$	768
Position embedding dimension $d_{p}$	64
Dropout rate of decoders	0.3
Batch size	8
Training epoch	20
Initial learning rate of BERT	$2e^{-5}$
Learning rate of decoders	$1e^{-4}$
$\xi_{1}$	0.5
$\xi_{2}$	0.5
$\xi_{3}$	0.5
$\xi_{4}$	0.5
$\xi_{5}$	0.5

5. Experimental results

In this section we present the experimental results.

5.1 Compared with the baseline model

We trained and tested the aforementioned models, and recorded their performance metrics, including precision, recall and F1 score. We compared them with our proposed CMCEE model and the specific data are shown in Table 4. Through the evaluation of the data in Table 4, we can draw the following conclusions:

1)
Our model outperforms the comparison models across all four evaluation metrics on the FewFC dataset, achieving the best results.
2)
In comparison with the joint sequence labeling method, CMCEE achieves 7.8% and 8.5% improvement over BERT-CRF and BERT-CRF-joint on the F1 score of TC, respectively. Moreover, it can be seen that CMCEE has the most significant effect on improving the recall rate. We believe that this is because the sequence labeling method has the problem of label conflict, and CMCEE can effectively solve this problem.
3)
In comparison with multi-stage methods with overlapping events, CMCEE surpasses them in F1 score. The results show that CMCEE improves F1 scores by 4.2% and 3.7% in TC and AC, respectively. Compared to the baseline MQAEE-2, CMCEE improved F1 scores in TC and AC by 1.5% and 5.2%, respectively. The reason is that CMCEE combines useful interactions and connections between subtasks when it jointly learns text representations of subtasks. When compared to CasEE, our model achieved improvements in TI, TC, AI, and AC, with improvements of 1.1%, 1.3%, and 1.1% in TC, AI, and AC, respectively. We analyze that it is because our proposed enhanced conditional fusion function and cascaded decoding mechanism of multi-feature fusion enhance the interaction between subtasks. While also enhancing the text features, so that CMCEE can achieve better results in the task of overlapping event extraction.
4)
The results of Trigger Classification (TC) in Table 4 show the performance of CMCEE event type detection. The experimental results show that the F1 score of the evaluation metrics TC is greatly improved compared with the past methods. Compared with CasEE, our method F1 score is 1.1% higher. Compared with PLMEE, our method improves the F1 score by 4.2%, which shows that CMCEE performs better than the traditional model in the event type detection task. We analyze the reason for this performance improvement is that our method adopts the enhanced conditional layer normalization mechanism, effectively addressing the challenge of overlapping events, a facet where our model outperforms traditional models.

Table 4
Experimental results of overlapping event tasks on the FewFC dataset

TI (%)¹ TC (%)² AI (%)³ AC (%)⁴

P R F1 P R F1 P R F1 P R F1

BERT-softmax 89.8 79.0 84.0 80.2 61.8 69.8 74.6 62.8 68.2 72.5 60.2 65.8

BERT-CRF 90.8 80.8 85.5 81.7 63.6 71.5 75.1 64.3 69.3 72.9 61.8 66.9

BERT-CRF-joint 89.5 79.8 84.4 80.7 63.0 70.8 76.1 63.5 69.2 74.2 61.2 67.1

PLMEE 83.7 85.8 84.7 75.6 74.5 75.1 74.3 67.3 70.6 72.5 65.5 68.8

MQAEE-1 90.1 85.5 87.7 77.3 76.0 76.6 62.9 71.5 66.9 51.7 70.4 59.6

MQAEE-2 89.1 85.5 87.4 79.7 76.1 77.8 70.3 68.3 69.3 68.2 66.5 67.3

CasEE 89.4 87.7 88.5 77.9 78.5 78.2 72.8 73.1 72.9 71.3 71.5 71.4

CMCEE 88.6 88.6 88.6 77.0 81.7 79.3 72.0 76.5 74.2 70.3 74.8 72.5

5.2 Analysis

	TI (%)¹	TC (%)²	AI (%)³	AC (%)⁴
	P	R	F1	P	R	F1	P	R	F1	P	R	F1
BERT-softmax	89.8	79.0	84.0	80.2	61.8	69.8	74.6	62.8	68.2	72.5	60.2	65.8
BERT-CRF	90.8	80.8	85.5	81.7	63.6	71.5	75.1	64.3	69.3	72.9	61.8	66.9
BERT-CRF-joint	89.5	79.8	84.4	80.7	63.0	70.8	76.1	63.5	69.2	74.2	61.2	67.1
PLMEE	83.7	85.8	84.7	75.6	74.5	75.1	74.3	67.3	70.6	72.5	65.5	68.8
MQAEE-1	90.1	85.5	87.7	77.3	76.0	76.6	62.9	71.5	66.9	51.7	70.4	59.6
MQAEE-2	89.1	85.5	87.4	79.7	76.1	77.8	70.3	68.3	69.3	68.2	66.5	67.3
CasEE	89.4	87.7	88.5	77.9	78.5	78.2	72.8	73.1	72.9	71.3	71.5	71.4
CMCEE	88.6	88.6	88.6	77.0	81.7	79.3	72.0	76.5	74.2	70.3	74.8	72.5

We conducted separate experiments to analyze the enhanced conditional layer normalization (ECLN) mechanism to demonstrate the effectiveness of our enhanced conditional normalization layer. The specific experimental design is as follows:

We modified the CasEE model by replacing the conditional layer normalization layer (CLN) with our enhanced conditional layer normalization (ECLN) in the argument extraction decoder, and named the modified model CasEE-E.

Subsequently, we trained and tested BERT-CRF, PLMEE, MQAEE-2, CasEE and CasEE-E on the FewFC dataset, and recorded their performance metrics, including precision, recall, and F1 score. Regarding the setting of hyper-parameters, the ranges of thresholds $\xi_{1}$ , $\xi_{2}$ , $\xi_{3}$ , $\xi_{4}$ , $\xi_{5}$ for prediction are adjusted in $[0,1]$ , the specific hyper-parameter settings are presented in Table 3.

Table 5
Comparison results of CasEE-E and baseline model experimental data

Project	TI (%)	TC (%)	AI (%)	AC (%)
BERT-CRF	85.5	71.5	69.3	66.9
PLMEE	84.7	75.1	70.6	68.8
MQAEE-2	87.4	77.8	69.3	67.3
CasEE	88.5	78.2	72.9	71.4
CasEE-E	88.6	78.8	73.5	71.8

Table 6

Experimental results on the FewFC dataset

Project	TI (%)	TC (%)	AI (%)	AC (%)
CasEE	88.5	78.2	72.9	71.4
CasEE-E	88.6	78.8	73.5	71.8
Ours	88.6	79.3	74.2	72.5

Table 5 presents the four evaluation metrics of BERT-CRF, PLMEE, MQAEE-2, CasEE and CMCEE. From this table, it is evident that data of CasEE-E outperform the four baseline models, confirming the effectiveness of our proposed Enhanced Conditional Layer Normalization (ECLN) in overlapping event extraction tasks. The reason why CasEE-E is better than CasEE in our analysis is that we use the Enhanced Conditional Layer Normalization (ECLN). Compared with CasEE, CasEE-E can effectively integrate the features of the event type detection encoder and trigger extraction decoder into the argument extraction decoder.

Table 6 demonstrates that our method outperforms CasEE and CasEE-E in all four evaluation metrics. Consequently, we can conclude that the multi-feature fusion cascading decoding mechanism we added to CasEE-E is effective. This mechanism allows us to obtain richer text features in the input of the argument extraction decoder, including the semantic features from the encoder and the text interaction provided by the trigger extraction decoder.

5.3 Argument extraction decoder variants

Table 7
The result of the argument extraction decoder variant

Variants	P (%)	R (%)	F1 (%)
AED-1	69.0	73.8	71.3
AED-2	67.0	76.5	71.4
AED-3	69.4	74.8	72.0
CMCEE	70.3	74.8	72.5

Table 7 further demonstrates the performance of our method. We conducted experiments on different argument extraction decoder variants, combined with type detection and trigger extraction. We set up specific experiments and there are three variants: 1) AED-1 only adds the attention layer to the argument extraction module of CMCEE; 2) AED-2 only removes the residual mechanism in the argument extraction module of CMCEE; 3) AED-3 removes the residual mechanism and adds a self-attention layer. Subsequently, we calculated the P, R and F1 scores of different argument extraction decoder variants in the argument classification (AC) metric. The experimental results demonstrate the optimality of our method.

The results indicate that the performance without residual structure decreases significantly on the F1 scores of the two AED-2 and AED-3 variants. This decrease occurs because the model fails to capture additional text features. In addition, the experimental results also show that the F1 score of AED-3 is higher than that of AED-2, which indicates that the self-attention mechanism can further refine the representation of argument extraction.

6. Conclusion

In this paper, we proposed the CMCEE model and explained the architecture of the model in detail. Then, we conducted experiments on the FewFC dataset. After comparing multiple baseline models, the CMCEE model we proposed was proved to be in the overlapping event extraction task. F1 scores have improved numerically. At the same time, based on the previous conditional layer normalization, this paper further proposes an enhanced conditional layer normalization (ECLN), which enhances the interaction between the event argument extraction decoder and other subtasks. At the same time, a separate experiment is designed, demonstrating that our proposed enhanced conditional layer normalization mechanism is proven to be effective on the overlapping event dataset FewFC.

Finally, we compare the metrics of CasEE-E and CMCEE on the overlapping event extraction task, demonstrating the effectiveness of our multi-feature conditional fusion cascaded decoding mechanism. To further demonstrate the performance of our method, we conducted experiments on different argument extraction decoder variants, combined with type detection and trigger extraction.

Nowadays, the Enhanced Conditional Layer Normalization (ECLN) is only proven to be effective on the event extraction task of the overlapping event dataset FewFC. Therefore, in future work, we will further study to make the enhanced conditional layer normalization mechanism common to other fields.

Footnotes

Acknowledgments

We would like to thank the anonymous reviewers for their insightful comments and suggestions. This research is supported by the National Natural Science Foundation of China [grant number U2003208], the Xinjiang Autonomous Region key research and development project [grant number 2021B01002] and The Xinjiang Autonomous Region major scientific and technological projects [grant number 2020A03004-4] and University research program projects, China [grant number XJEDU2022P018].

Conflict of interest

Competing interests statement: The authors declare that they have no competing financial interests.

Data availability statement

Data openly available in a public repository.

FewFC contains 10 financial field event types and 8982 sentences. The FewFC data supporting the results of this study is publicly available at https://github.com/TimeBurningFish/FewFC.

References

Chen

Liu

Zeng

and Zhao

, Event extraction via dynamic multi-pooling convolutional neural networks, in: Association for Computational Linguistics, Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Vol. 1, 2015, pp. 167–176.

Chen

Ebner

White

A.S.

and Van Durme

, Reading the Manual: Event Extraction as Definition Comprehension, in: Association for Computational Linguistics, Proceedings of the Fourth Workshop on Structured Prediction for NLP, 2020, pp. 74–83.

De Vries

Strub

Mary

Larochelle

Pietquin

and Courville

A.C.

, Modulating early visual processing by language, Advances in Neural Information Processing Systems 30 (2017).

Devlin

Chang

M.W.

Lee

and Toutanova

, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, in: Association for Computational Linguistics, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 2019, pp. 4171–4186.

and Cardie

, Event Extraction by Answering (Almost) Natural Questions, in: Association for Computational Linguistics, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 671–683.

Fei

Zhang

and Ji

, End-to-end semantic role labeling with neural transition-based model, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, issue 14, 2021, pp. 12803–12811.

Kingma

D.P.

and Ba

, Adam: A method for stochastic optimization, arXiv preprint arXiv:14126980, 2014.

Huang

and Han

, Biomedical event extraction based on knowledge-driven tree-LSTM, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 2019, pp. 1421–1430.

Peng

Chen

Wang

Pan

Lyu

et al., Event extraction as multi-turn question answering, Findings of the Association for Computational Linguistics: EMNLP, 2020, 829–838.

10.

Liu

Chen

Liu

and Liu

, Event extraction as machine reading comprehension, in: Association for Computational Linguistics, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1641–1651.

11.

Liu

Zhang

Yang

and Zhou

, Event detection without triggers, in: Association for Computational Linguistics, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, 2019, pp. 735–744.

12.

Liu

Luo

and Huang

, Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation, in: Association for Computational Linguistics, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 1247–1256.

13.

Nguyen

T.H.

Cho

and Grishman

, Joint event extraction via recurrent neural networks, in: Association for Computational Linguistics, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 300–309.

14.

Sheng

Guo

Hei

Wang

et al., CasEE: A Joint Learning Framework with Cascade Decoding for Overlapping Event Extraction, Association for Computational Linguistics, Findings of the Association for Computational Linguistics: ACL-IJCNLP, 2021, 164–174.

15.

Sheng

Hei

Guo

Wang

et al., A joint learning framework for the CCKS-2020 financial event extraction task, Data Intelligence 3(3) (2021), 444–459.

16.

, Conditional text generation based on conditional layer normalization, Available from: https://spaces.ac.cn/archives/7124, 2019.

17.

Subburathinam

May

Chang

S.F.

Sil

et al., Cross-lingual structure transfer for relation and event extraction, in: Association for Computational Linguistics, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 313–325.

18.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

et al., Attention is all you need, Advances in Neural Information Processing Systems 30 (2017).

19.

Wei

Wang

Tian

and Chang

, A Novel Cascade Binary Tagging Framework for Relational Triple Extraction, in: Association for Computational Linguistics, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1476–1488.

20.

Yang

Feng

Qiao

Kan

and Li

, Exploring pre-trained language models for event extraction and generation, in: Association for Computational Linguistics, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 5284–5294.

21.

Zhang

Qin

Zhang

Liu

and Ji

, Extracting Entities and Events as a Single Task Using a Transition-Based Neural Model, in: International Joint Conferences on Artificial Intelligence Organization, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019, pp. 5422–5428.

22.

Zhou

Chen

Zhao

and Li

, What the role is vs. What plays the role: Semi-supervised Event Argument Extraction via Dual Question Answering, Proceedings of AAAI-21. AAAI Press 35(16) (2021), 14638–14646.

CMCEE: A joint learning framework for cascade decoding with multi-feature fusion and conditional enhancement for overlapping event extraction

Abstract

Keywords

1. Introduction

3.2 Event type detection encoder

4.1 Datasets and evaluation metrics

Table 1 Statistics for the dataset

Table 3 Hyper-parameter settings of CMCEE

5.1 Compared with the baseline model

Table 5 Comparison results of CasEE-E and baseline model experimental data

Table 7 The result of the argument extraction decoder variant

Footnotes

Acknowledgments

Conflict of interest

Data availability statement

References

Table 1
Statistics for the dataset

Table 3
Hyper-parameter settings of CMCEE

Table 5
Comparison results of CasEE-E and baseline model experimental data

Table 7
The result of the argument extraction decoder variant