Abstract
This research leads to the development of a system that generates minutes while the system and user interactively structure and summarize the discussion. The system can look back on a meeting from a specific viewpoint or purpose by sending an inquiry to the minutes, receiving the result, and performing a new operation. To improve the reusability of the minutes, it is necessary to understand the semantic relationships among utterances at a deep level. We therefore propose a novel tree structure called the discussion time-span tree for representing a hierarchical discussion based on the relative importance of utterances derived from the meeting. The discussion time-span tree is a binary tree, each leaf of which corresponds to an utterance. Adjacent relevant utterances are hierarchically grouped, and the most important utterance becomes a head. In this way, the tree represents the implicit meaning and intention contained in meeting records, and can be used for summarization. Our system aims to grasp the main point of the meeting by interactively manipulating this tree structure and giving different viewpoints to each system user. An evaluation of the important utterance extraction using the proposed system showed that the score of ROUGE-2 was higher than that of the existing document summarizing technique. In a subsequent user evaluation conducted with ten participants, we found that the system can be used to grasp conference contents efficiently. The results of both evaluations demonstrate the usefulness of the system.
Keywords
Introduction
Meetings are one of the most significant activities in collective intelligence and play a pivotal role in the smooth advancement of work and research [15]. Minutes are documents that record the content of a meeting and disseminate any decisions made to the relevant parties. As such, they are a crucial component when it comes to sharing information. Because recorded minutes detail how opinions were exchanged, what paths were taken until conclusions were reached, and many other pieces of information, they serve as content that yields new value through reuse. However, it is difficult to grasp information that is not explicitly recorded in conventional minutes, as they often summarize only superficial details [18].
In light of these circumstances, minutes data sets with various annotations added by hand, such as the AMI Meeting Corpus [1] and Discussion Mining1
Nagao Laboratory of Nagoya University: Discussion Mining Project, Source
The aim of the present study is to develop a method for the accurate, flexible extraction of knowledge from meeting records and evaluate its validity. To deal with the above-mentioned problems, we need to analyze the semantic relationships among utterances in a meeting at a deep level. Therefore, we propose a novel tree structure that hierarchically expresses the importance of each utterance on the basis of both verbal and non-verbal information. The underlying concept is to represent organized knowledge and clarify semantic relationships by means of the hierarchical structure. Our design of the tree structure is inspired by the Generative Theory of Tonal Music (GTTM) [10], which is a theory that parses the chronological sequence of musical events. GTTM derives hidden hierarchical structures from the surface structure of scores, thereby extracting the underlying deep structures. We apply music theory to the analysis of a meeting record by hypothesizing that both the musical events in a musical composition and the utterances in a discussion generate meaningful groups (gestalts) along a time axis.
The final goal of our research is to develop a system that generates minutes using the question and answer (Q&A) statements exchanged between users and the system, targeting the discussion corpus recorded by Discussion Mining [15]. The proposed system can extract intention, compose a summary, and switch viewpoints by operating on the above-mentioned tree structure interactively. The user inquires about minutes in various ways and repeatedly make new inquiries taking the previous results into account. Exploratory data analysis of discussions requires a trial-and-error process because the information needed by the user is initially undefined, and only becomes clearer during the data analysis process [6]. In addition, Tao et al. insisted on the importance of reviewing the result from a personal viewpoint in information search [19]. Our system enables users to satisfy their information needs by repeating this exploratory operation from a personal viewpoint.
So far, there have been numerous studies on the generation of meeting summaries by applying document summarization techniques. Much research has applied conventional summarizing methods, such as the LEAD-based method [3], which extracts opening lines from an input document, and LexRank [4], which is based on the concept of PageRank [17]. Although these methods can greatly reduce the number of utterances to be analyzed, they pay little attention to utterance structures or the correspondence among utterances. There is also the problem that currently available techniques generate syntactically unnatural sentences and meanings. In addition, exploratory analysis is important in information retrieval, such as the extraction of important statements for meeting contents, because the information required by the user is initially undefined, and only becomes clearer during the data analysis process [6].
There have also been many studies on discourse structure analysis [7,12,14]. These methods can represent the relationship between sentences and rapidly discover and understand the points made during meetings. However, although they can acquire superficial group structures of a discussion, it is not possible to acquire hierarchical group structures. In addition, since utterances tend to be made spontaneously during meetings, syntactic analysis often does not work well. This is because conventional research has focused on text-only analysis, and does not consider the unique situation of an utterance scenario.
For these reasons, many studies on discussion analysis stress the importance of focusing not only on verbal information but also non-verbal information, such as utterance duration and turn-taking[2,9]. From the 2000s onward, research has focused on discussion analysis of the non-verbal communication of meeting participants, particularly from the perspective of portraying meetings as multimodal interactions between multiple individuals [8,16]. Ichino demonstrated several feature quantities of non-verbal information that work effectively to distinguish the discussion state [8]. Okada et al. estimated the communication ability from non-verbal information, such as utterance turns and prosody, that participants exhibit in group discussions [16]. These studies clarified the effective feature quantities with high accuracy in discussion analysis. Therefore, we conclude that developing practical applications, such as an automatic summarization system based on these research results, will be useful.
Discussion time-span tree
Representation of discussion time-span tree
In this study, we adopt a hierarchical, ordered structure as a method to represent hidden meaning and/or intention in meeting records. Data of this nature are time-series data. People comprehend the discussion structure from the relationships between utterances, e.g., whether one is for or against the other, or a question and an answer, which yield a gestalt in understanding the discussion. For adjacent utterances, one is more important than the other. Therefore, the semantic structure of a discussion is modeled using a tree structure that represents the importance of the utterances hierarchically. We refer to this as a discussion time-span tree (Fig. 1). By using this expression method, we can understand the discussion structure and the general flow of a discussion step by step on the basis of the relationship between the utterances.

Discussion time-span tree.
The discussion time-span tree is a binary tree that has a branch for each utterance. The tree (i) regards adjoining relevant utterances as one group and (ii) represents the hierarchical importance of the utterances. Table 1 shows an example of a thread that is composed of eight utterances. The discussion time-span tree is constructed on this thread. Regarding (i),
Outline of sample discussion
A discussion time-span tree is generated from the information in a bottom-up manner in accordance with two kinds of rules: (1) grouping preference rules (GPRs), which acquire the grouping structures in the discussion, and (2) significance preference rules (SPRs), which select a significant utterance that represents the duration of a certain entire group. In this research, we designed the GPRs and SPRs with reference to [8] and [16], which clarified the effective feature quantities with high accuracy in discussion analysis. Tables 2 and 3 present the list of GPRs and SPRs, respectively.
These rules can judge the similarity of topics on the basis of temporal proximity, speaking order, and text information. In addition, we assume that the importance of an utterance can be determined by the number of times it is uttered, the duration, and the frequency of occurrence of important words. As a specific rule implementation method, regarding (i), for example, GPR 1a could be: “Consider a sequence of four utterances, n1–n4; the transition n2–n3 may be considered a group boundary if the interval between an utterance is changed.” Regarding (ii), for example, SPR 3b could be: “Consider a sequence of four utterances, n1–n4; the n1 may be considered more significant if it is the utterance that contains important words.”
List of grouping preference rules (GPRs)
List of grouping preference rules (GPRs)
List of significance preference rules (SPRs)
The procedure for generating a discussion time-span tree is as follows (Fig. 2). First, for the hierarchical group structures, we need to acquire the top-down structures using local boundaries determined through bottom-up processing. Therefore, we compute the strength of the higher order of the boundary by the boundary value between utterances according to the applied GPRs (Fig. 2(a)). We consequently split the group into two chunks from the strongest boundary, and repeat the process if the group contains an internal local boundary. Thus, a local / global hierarchy can be obtained (Fig. 2(b)). The discussion time-span tree is generated bottom-up from these local / global hierarchies and the degree of the importance of each utterance is generated according to the applied SPRs to the whole group (Fig. 2(c)).

Generation method of discussion time-span tree.
We designed parameters that manipulate the weights of the rules, called weight parameters, for generating the discussion time-span tree. These parameters are weighted by considering the number of applications of each rule, normalizing it in the range of values (0.0 to 1.0), and then adjusting the weight value (default value: 0.5). We utilized a slider GUI as the user interface of this parameter. The final boundary value is obtained by multiplying the boundary value given by applying rules by the weight value set by the parameter. Users can improve the analysis accuracy and generate a tree structure corresponding to different viewpoints by setting this weight parameter freely. Setting this weight parameter properly is an important task in terms of providing new viewpoints for information retrieval. One disadvantage is that users need to set various weight parameters through trial-and-error, since all the parameters have mutual relationships with each other.
System overview
In this section, we give an overview of our proposed system, which provides the user with an interactive way to generate minutes by using the mechanism based on a discussion time-span tree. First, as an example, we describe the various information extraction methods for the scenario shown in Table 1 by generating a discussion time-span tree.
Intention extraction from discussion. The grouping structure enables the extraction of various types of information from the discussion, such as unity of meaning and utterance, including the identified subject. Assume we explore the generated discussion time-span tree from the top down, where global grouping structures (
Descriptions of functions
Descriptions of functions
Summarization by reduction. The reduction of a discussion time-span tree assigns structural importance to each utterance in a hierarchical manner. The reduction is identified with the subsumption relation, which is the most fundamental relation in knowledge representation. Figure 1 shows the reduction concept. The use of reduction enables us to produce a group of utterances that summarizes the original meeting records. For example, cutting at the upper level (level a) generates a summary composed of two utterances (
Viewpoint switching by parameter adjustment. The adjustable parameters alter the weighted value of the rules for extracting intended important utterances and structuring the information for different purposes and from different viewpoints. The discussion time-span tree in Fig. 1 is generated by giving importance to utterances such as questions (
On the other hand, the discussion time-span tree in Fig. 3 is generated by giving high weight values to GPR1b, GPR1b, SPR2a, and SPR2c. Regarding SPR, for instance, giving priority to SPR2a means that the first and last utterances are of high importance, and priority for SPR2c means that importance is given to the presenter’s utterances (the presenter is denoted by “W” in Table 1). The generated summary ranks the utterances that raise a problem or introduce a new topic (

Viewpoint switching of discussion time-span tree.

System diagram.
The system diagram is shown in Fig. 4. It is constructed using flexible information retrieval based on the discussion time-span tree. Exploratory data analysis of discussions requires a trial-and-error process because the information needed by the user is initially undefined, and only becomes clearer during the data analysis process. The problem is solved by repeating this exploratory operation. When generating a discussion time-span tree for a switched viewpoint, the discussion is represented by a hierarchical representation of the important utterances based on different viewpoints and intentions, such as Q&A. We use the mechanism to provide the user with a new viewpoint for information retrieval. In addition, the user can continue exploring by asking new questions based on the answer they obtain. A series of these operations enables the user to satisfy their information needs. The user interacts with the system by issuing commands to the system: specifically, the user selects a function, and the system outputs the results. We provide six system functions (listed in Table 4), all of which are aimed at efficiently grasping the contents on the basis of the structural information of the discussion.
The purpose of the proposed system is to efficiently grasp the contents of the minutes by using the mechanism of the discussion time-span tree. To determine its validity, we conducted the following two evaluations. (1) An evaluation on extracting important utterances from meeting records. With this evaluation, we investigate whether important utterances can be accurately identified from the meeting records. (2) A usefulness evaluation of the system by user experiment. With this evaluation, we investigate which of the following two conditions can more efficiently grasp meeting contents: (x) when using the proposed system or (y) when viewing a Web browser that displays only utterance text information.
Discussion corpus
For the corpus in this experiment, we used a meeting record2
Nagao Laboratory of Nagoya University: Discussion Mining Project, Source
We refer to an utterance that presents a new topic as a start-up utterance and an utterance that concerns the same topic as the previous one as a follow-up utterance. One start-up utterance and at least one follow-up utterance form the unity of each topic. We call this a discussion segment (Fig. 5). The root of the discussion segment is the start-up utterance, and the rest is composed of follow-up utterances. A discussion time-span tree is generated for each discussion segment.

Discussion segment.
In the implementation of important words in GPR2b, SPR3a, and SPR3b, we obtained the part-of-speech information from the text data of utterances by means of morphological analysis using MeCab.3
MeCab
Setting values of weight parameters of grouping preference rules (GPRs) and significance preference rules (SPRs)
Setting values of weight parameters of grouping preference rules (GPRs) and significance preference rules (SPRs)
In this experiment, we conducted an evaluation on extracting important utterances from meeting records. The proposed method was compared with conventional methods commonly used in the relevant field in order to determine whether extracting important utterances by discussion time-span trees is effective. The following two methods were compared in this experiment.
LEAD-based method. Text that appears near the beginning of a document tends to contain important information, so this method selectively extracts these particular texts [3]. In this experiment, we extract opening lines from discussion segments until a specified summarization criterion is met.
LexRank. This is a classic summarization technique based on the concept of PageRank [17], which utilizes a graphical representation of Erkan et al.’s sentence similarity measure [4]. This method involves (a) calculating the similarity between statements in discussion segments using TF-IDF and creating a similarity graph with the statements as nodes and the relationships between them as edges, (b) creating an adjacency matrix where, when the degree of similarity is above the threshold, this is denoted by 1, and 0 otherwise, and (c) calculating the principal eigenvector of the adjacency matrix of the aforementioned graph, followed by sequentially extracting statements in order of decreasing node importance until a specified summarization criterion is met.
We conducted experiments on 25 items of discussion data recorded by Discussion Mining (total discussion duration: 48 hours, 36 minutes; number of discussion segments: 339; number of utterances: 1661). ROUGE [11], which is the most widely utilized method in automatic summarization evaluation, was used as an evaluation index. The measure was computed by counting the number of overlapping words or N-grams between the system-generated summary to be evaluated and the reference summaries. We adopted ROUGE-2, which is the most widely used. The reference summaries were independently annotated by two authors of the paper who are familiar with the generating method, and the inter-annotator agreement between the two was 68%. We averaged the summarization rates to about 50%.
The proposed technique uses a discussion time-span tree corresponding to an output that serves as the baseline determining the value of each weight parameter. Each weight parameter needs to be assigned the optimum value in order to reproduce reference summaries, so we carried out a preliminary experiment to assign such values. In this preliminary experiment, we used 20 meeting and conference minutes, excluding the minutes used for the main experiment, as training data. One of the experimenters spent 2–5 minutes on each data set, adjusting the values with reference to the meeting records and reference summaries so that the system output resembled the summaries. The weight parameters were set to be equal to the median of the values assigned to them during the preliminary experiment (Table 5).
Results and discussion
Table 6 lists the results of this experiment. As shown, the proposed method produced results closer to the reference summaries than the standard summarization technique that makes use of comparative methods. This is presumably because the comparative methods do not take into account the structure of a discussion and form a summary solely based on verbal information, resulting in a lower score. The results could be improved by taking into account not only verbal but also non-verbal information. These results demonstrate that our summarization method utilizing the discussion time-span tree is highly effective. We can also assume that the proposed system works properly to some extent. However, there is a high possibility that fluctuation will occur in the analysis, because even a slight change in the weight parameter tends to change the result significantly. For this reason, we conclude that, to increase the accuracy of the summaries, we need a method for assigning and controlling values that takes into account the particular characteristics of each meeting record.
Evaluation results of important utterance extraction by ROUGE-2
Evaluation results of important utterance extraction by ROUGE-2
Experimental method
In this section, we conducted a test to determine the degree of the user comprehension to evaluate the usefulness of the system following the previous section’s results. In this test, users answered questions about the contents of the discussion under the following two conditions: (x) when using the proposed system and (y) when viewing a Web browser that displayed only utterance text information.
Questions with four potential answers were prepared in advance, and users would be able to answer them correctly if they had read all the important utterances. The test consisted of eight questions and users were given ten minutes to respond. Participants were ten students in their twenties who were majoring in computer science. For condition (x), a five-minute practice session was held prior to the experiment to let them get a feel for how to use the various system functions (adjusting weight parameters etc.). Each user individually evaluated different meeting records under their respective conditions. The initial values of the weight parameters are as shown in Table 5. Out of the 25 meeting records available online, two were randomly selected 4
Table 7 lists the results of the comprehension degree test. T-testing with a significance level of 5% showed that the null hypothesis of no difference in number of correct answers between (x) and (y) was rejected. This result demonstrates the effectiveness of using the proposed system. We also observed the extent to which the users made use of the various system functions during the experiment. Table 8 shows the frequency of occurrence (“Usage transition”) in the concatenation pair of the time-series usage of the six functions.
Results of comprehension test for each participant
Results of comprehension test for each participant
Trends in use of system functions for two consecutive functions
The findings are as follows.
Participants used the system functions an average of 12 times in total.
Throughout the experiment, participants used Function 4 the most, accounting for 38% of total usage. They also repeatedly changed the weight values of the rules and the reduction level while reading the minutes.
The participants were able to use each function effectively for an exchange consisting of six to 15 utterances. For fewer utterances, they did not make use of each function.
After Function 4, the most frequently used ones were Functions 2 and 3 (11.3% of the total). A continuous process of performing a local search for more detailed information after generating an overview of the content of the minutes was observed.
After participants had adjusted the parameters (Function 6), they mostly used Function 4: 16.3% of the total. Changing the reduction level tended to narrow down the number of utterances.
Four of the participants had some difficulty operating the system when trying to adjust the parameters that controlled the rules. In an interview conducted after the experiment, we found that these participants had difficulty understanding how to adjust the parameters for a certain purpose and how analyzing rules would apply to the result.
Setting value of weight parameter at end of task for each participant
We demonstrated the usefulness of the system from the results of the user experiment detailed in Section 5.3. In this section, in order to examine these results more deeply, we analyzed the following two points. (a) Verification on the effectiveness of information retrieval. We clarify the causality between system functions. (b) Analysis on the method of reproducing the viewpoint for information retrieval. We analyze the usage tendency of the weight parameter operation in the user experiment.
Analysis of usage results using structured modeling method
In this section, we examine user utilization to verify the effectiveness of information retrieval using the proposed system. Specifically, we carried out a structural analysis of the user utilization results. Causality between functions in the system utilization is structured by using the interpretive structural modeling (ISM) method [20], which is an application of graph theory. In the ISM method, by replacing the relationship between elements with binary values by pairwise comparison, the relationship between the various items can be visualized with a multi-layer directed graph. The generated model is used as an interpretation and examination tool for objectively solving problems. Due to space constraints, we omit the description of the ISM method’s application procedures. One multi-layer directed graph for the user utilization results of the participants as a whole is obtained, as well as one for each participant. In the multi-layer directed graph shown in Fig. 6, Level 1 is more affected by other factors, and Level 3 expresses a stronger effect on other elements. For the participants’ overall usage results, the ISM method was applied to the top 8 items that satisfy a threshold value of 5.0% with respect to usage transition.
Looking at the results of the overall graph in Fig. 6, we can observe in system usage a tendency to narrow down to the conclusion section after grasping the structure of the entire discussion. This trend is classified into three stages: Levels 1 to 3. First, with respect to Level 3, we use Function 1 to grasp the overall structure of the discussion. Next, with respect to Level 2, there is a process that flexibly utilizes the features of Functions 2, 3, 4, and 6. Here, focusing on Functions 4 and 6, the total of the usage transition of both is 25%; in this process, the system user refines the information request while switching the purpose and viewpoint. Finally, with respect to Level 1, Function 5 is used to grasp the conclusion section.

Multi-layer directed graph of overall user utilization.
In this section, we describe our analysis on the usage trends of the weight parameter operation (
Factor loadings of components 1–3 to weight parameter values
Factor loadings of components 1–3 to weight parameter values
Next, we carried out a principal component analysis of the values of these weight parameters. This was a multivariate analysis to reduce the number of dimensions in data by obtaining the eigenvalues of the data matrix, thereby allowing easy knowledge extraction from data distributions. The input data consist of the final values assigned to the weight parameters (0.0–1.0) by the participants at the end of the task. We carried out a principal component analysis of the data set consisting of the values assigned to the weight parameters by all ten participants and computed the contribution and cumulative contribution ratios of each component, the factor loadings for each variable with respect to the components, and the score plots for each principal component by each participant.
According to the analysis, the contribution ratios of each principal component were 28.1% for Component 1, 27.2% for Component 2, and 17.6% for Component 3; the cumulative contribution ratio was 72.8%. This indicates that out of the 14 total parameters, over 70.0% of the information is contained within the first three. Table 10 shows the factor loadings of principal components 1–3 with respect to the values of each weight parameter. The underlined parts indicate moderately relevant parameters (r = 0.31–0.50). It can be seen from Table 10 that GPR1a contributes to principal components 2 and 3, SPR2a to 2 and 3, and SPR2b to 1 and 3, with each contributing to more than one principal component. This demonstrates the effectiveness of these parameters. Finally, the principal components and their relevant weight parameters along with their defining characteristics may be divided as follows: Component 1 with verbal information (such as GPR1c, 2b, SPR3a), Component 2 with temporal information (such as GPR1a, 1d, SPR2a), and Component 3 with social signals information such as those indicating agreement or the strength of social influence (such as GPR1c, SPR2b) (Table 10). This suggests that the participants had a tendency to search for information on the basis of these three perspectives.
In this paper, we have proposed a summarization system of meeting records based on a hierarchical structure. To create the system, we presented a data model for representing a hierarchical discussion by means of a discussion time-span tree based on verbal and non-verbal information. We then described how to implement the discussion structure and demonstrated the advantages of our method. By using this model, we were able to build an automatic summary of a meeting record corresponding to different viewpoints.
The results of an evaluation on extracting important utterances showed that the score of ROUGE-2 was higher than that of the conventional document summarizing technique. In a subsequent usefulness evaluation of the system by user experiment, we found that the system users improved their comprehension degree significantly compared to when not using the system. Generally, the system users were able to reach the same comprehension regarding the flow of the meeting and the important arguments. These results demonstrate that the proposed system can efficiently grasp discussion contents. Moreover, they corroborate the utility of the discussion time-span tree that represents the hierarchical discussion structure. Finally, our analysis on the usage tendency of the weight parameter operation in the user experiment showed that the participants had a tendency to search for information on the basis of three perspectives: verbal information, temporal information, and social signals information.
As future work, we will implement a function of assigning default values for a specific retrieval mode to the rule parameters. We will also carry out large-scale experiments with a greater number of participants and more varied tasks, and classify their usage trends.
Footnotes
Acknowledgement
This research received a grant from the JSPS Grant-in-Aid for Scientific Research (16H01744).
