Abstract
Active learning in machine learning is an effective approach to reducing the cost of human efforts for generating labels. The iterative process of active learning involves a human annotation step, during which crowdsourcing could be leveraged. It is essential for organisations adopting the active learning method to obtain a high model performance. This study aims to identify effective crowdsourcing interaction designs to promote the quality of human annotations and therefore the natural language processing (NLP)-based machine learning model performance. Specifically, the study experimented with four human-centred design techniques: highlight, guidelines, validation and text amount. Based on different combinations of the four design elements, the study developed 15 different annotation interfaces and recruited crowd workers to annotate texts with these interfaces. Annotated data under different designs were used separately to iteratively train a machine learning model. The results show that the design techniques of highlight and guideline play an essential role in improving the quality of human labels and therefore the performance of active learning models, while the impact of validation and text amount on model performance can be either positive in some cases or negative in other cases. The ‘simple’ designs (i.e. D1, D2, D7 and D14) with a few design techniques contribute to the top performance of models. The results provide practical implications to inspire the design of a crowdsourcing labelling system used for active learning.
Keywords
1. Introduction
Artificial intelligence (AI) and machine learning rely on human-generated ground truth, which is usually labour-intensive and expensive to collect. To save the costs of human labels, active learning is widely believed to be an effective approach [1]. Instead of sending an arbitrary set of training instances to human annotators, active learning proactively selects informative instances for humans to label. Overall, the process of active learning can be theoretically concluded: obtaining those sample data that are ‘difficult’ to classify through machine learning models; letting the human to confirm and review; then the manually labelled data again being used for model training to gradually improve the effectiveness of the model. In the process of active learning, both algorithms and humans are in the loop to make the learning process effective and efficient. Currently, existing studies have explored the improvement of active learning effectiveness and efficiency from the angle of algorithm optimisation, for example [2]. While algorithms have been well studied in many existing studies, few studies focus on human-centred designs to facilitate humans to cooperate with algorithms, which are suggested to research.
In recent years, crowdsourcing services, such as Amazon Mechanical Turk (MTurk), 1 Figure Eight (formerly known as CrowdFlower) 2 and Prolific Academic, 3 provide AI researchers with a diverse, on-demand and scalable workforce to complete various annotation tasks to establish the ground truth for machine learning models. There is a growing consensus that data collected using the crowdsourcing approach can be just as valid as data collected via expert opinions [3,4]. Thus, crowdsourcing becomes popular among machine learning researchers to collect human labels with a low cost. Meanwhile, the performance of machine learning models is attached to much importance, which has a connection with the annotation quality. Crowdsourcing and active learning share a purpose: reducing human annotation costs without sacrificing the machine learning quality, which inspires this study to combine crowdsourcing and active learning process and find proper designs of crowdsourcing annotation to facilitate the performance of active learning models. There is a natural trade-off between cost and quality. This study purposes to investigate whether the quality of human annotation could be improved with the help of human-centred design under the same annotation cost. Therefore, the study aims to answer two questions: RQ1: How do we use different techniques to design an effective crowdsourcing tool for crowd workers to provide high-quality annotations? RQ2: How do different crowdsourcing designs affect the performance of active learning?
Details of proposing the research question is presented in section 3.
Specifically, this study experimented with different crowdsourcing interaction designs used for a text annotation tool during the human annotation step of active natural language processing (NLP)-based machine learning. These designs were resulted from 15 combinations of four techniques for crowdsourcing annotation, that is, highlight, validation, guidelines and text amount (details illustrated in sections 4.1 and 4.2). The designs are interfaces of text annotation tools used in the crowdsourcing experiment. The techniques are visual and strategic elements of designing the annotation interfaces. Crowd workers were recruited to complete the annotation tasks via the 15 different designs to provide labels used in the active NLP-based machine learning. Active learning performance was compared across those different interaction designs. In this case, we examined the individual and interactive impact of the annotation techniques and therefore to identify the effective design(s). The comparison also took the annotation cost into consideration. The cost was fixed by the total annotation time decided by the task completion time and the number of crowd workers.
The study is structured as follows: We first review existing literature related to interactive machine learning, active learning and crowdsourcing. Then we demonstrate our research motivation on crowdsourcing interaction designs on the quality of human annotation and therefore the active learning model performance and propose specific research questions in section 3. After that, we conduct a crowdsourcing experiment involving 15 designs based on combinations of four interactive annotation techniques in section 4. We recruit crowd workers to label via interfaces of the 15 designs and perform the model training of active NLP-based machine learning via the labels. Further in Section 5, we compare the model performance among 15 designs in terms of the model performance change and the model performance given the same cost. Last but not least, we answer the research questions according to the analysis results and discuss the effect of different crowdsourcing designs and annotation techniques on active learning model performance and therefore the implications in Section 6.
The main contribution of this study is summarised as follows: (1) this study uses human-centred approaches to improve the quality of human labels and therefore the performance of active learning in NLP; (2) this study focuses on the human annotation aspect of active learning that theoretically supplements a different perspective of active learning optimisation and (3) this study contributes effective interaction designs that have practical implications for crowdsourcing approaches in general.
2. Related work
This research connects three lines of research in machine learning: interactive machine learning, active learning and crowdsourcing. We will review each of them in the following subsections.
2.1. Interactive machine learning
Interactive machine learning adopts human-centred approaches to the training process of mathematical models. An early study first summarised the interactive machine learning model and distinguished it from the classical machine learning model [5]. In the classical machine learning process, the end-user selects features, and the training is then performed ‘off-line’. The algorithm training is independent to the user and needs not be interactive. By contrast, the end-user is encouraged to correct and teach the classifier in the interactive machine learning model. Specifically, interactive machine learning involves a cycle: the training provides the user with feedback, the user then creates training data that help correct errors, and the algorithm performs the training again. Interactions between human and machine learning systems have been empirically confirmed to help the improvement of machine learning performance [6]. It is not necessary for the end-user to be an expert in machine learning. An investigation revealed the potential of non-expert to build learning algorithm [7]. Active learning is one of the use cases of interactive machine learning [8], where the user labels the important instances provided by the learning model.
With the user’s teaching role for training models is highlighted, machine teaching, a new paradigm for interactive machine learning is emerging. While algorithms are regarded as ‘learners’, human is regarded as the ‘teacher’. Machine teaching focuses on the teacher and the teacher’s interaction with data, which seeks to facilitate the teacher’s productivity through techniques and the design of interaction [9]. Machine teaching is heavily related to the user interaction side of active learning, although, in active learning, the active learner explores the optimal parameters by itself rather than being guided by the teacher [10]. Viewing studies in areas of interactive machine learning and machine teaching has contextualised this research. Also, the user interface is the bridge of interactions between human and interactive machine learning systems [11], which inspires this research to explore interaction designs for active learning.
2.2. Active learning
Active learning is a branch of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points [12]. It is a typical situation that unlabelled data are abundant but manual labelling is expensive. In active learning, learning algorithms can actively and interactively query the user for labels. Since the learning algorithm chooses the data samples to label, the number of data samples to train a model is often lower than the number required in normal supervised learning, saving the cost and effort of human labelling.
There are various approaches through which a learning algorithm queries a user, including membership query synthesis [13], stream-based selective sampling [14] and pool-based sampling [15]. Pool-based sampling, common in practice, involves these three steps: (1) a previously trained model selects the informative instance from a pool of unlabelled instances, (2) the informative instance is labelled by an expert user and added to the training set and (3) the model is trained again based on the new training set. How to select informative instances has been a key question in this process. Uncertainty sampling [12] is a popular strategy, which selects the instances closest to the prediction boundary. Another popular strategy is query-by-committee [16], which has a committee of models to vote on the labels for instances. The instances with the highest disagreement among committee members get selected. The two strategies were compared in a smartwatch activity recognition task [17]. It was found that the uncertainty sampling strategy performed better than the query-by-committee approach. In addition to those two strategies, expected model change, expected error reduction, variance reduction and density-weighted methods are also used in selecting informative instances. Pereira-Santos et al. [18] presented a thorough review of different informative data instances selection strategies. These studies have provided this research the background knowledge about active learning, and more importantly, the strategy to select informative data instances for crowd workers.
Annotation cost is an essential topic in active learning research because the purpose of active learning is to achieve the best possible model performance given a certain annotation cost. A common assumption is that labelling every instance costs equally. However, the real annotation cost incurred in the process of labelling varies, especially in the field of NLP [19]. Much effort of reducing annotation costs has been attempted through optimising the active learning algorithms. For example, Dong et al. [2] proposed a cost-sensitive active learning algorithm that combines the uncertainty and diversity sampling strategies to select training samples for an unlabelled sample pool. However, active learning involves not only the algorithms but also humans. Different from the previous work that controls the annotation cost through various algorithmic designs, this study tackles the cost problem by interaction designs to obtain efficient and high-quality human annotations.
2.3. Crowdsourcing and active learning
The rapid development of crowdsourcing research and platforms offers new ideas and challenges to conventional methods for machine learning [20]. Crowdsourcing approaches can harness human intelligence on a large scale and near-real time from a vast and diverse population [21]. Due to the mechanism of active learning, human annotation is an essential component, which opens an avenue for the incorporation of crowdsourcing. Gilyazev and Turdakov [22] reviewed the methods of crowdsourcing labelling, as well as the approach to multi-annotator active learning. They discussed the benefits of combining active learning with crowdsourcing. Recently, such a combination has been more common in the NLP field. For example, Hong et al. [23] integrated active learning and crowdsourcing mechanisms to improve the accuracy and efficiency of knowledge extraction of Tang poetry. They found the integration of active learning and crowdsourcing outperformed the state-of-art methods and validated the human-in-the-loop knowledge extraction. Mairittha et al. [24] developed a mobile activity recognition system, named CrowdAct exploiting crowdsourced data labelling and active learning. CrowdAct was experimented with to show its effectiveness in improving activity recognition accuracy. With the popular usage of crowdsourced data in the active learning process, Sayin et al. [25] conducted a study on the cost-effectiveness of existing active learning approaches over crowdsourced data. They found factors that affect the performance of the active learning approaches. These studies have informed this research on the effectiveness of combining active learning with crowdsourcing. Therefore, we selected such a combination as an ‘environment’ to investigate the impacts of different interaction designs.
Interaction design differs depending on whether the crowd annotation tasks are texts, images, or videos, but all designs share a common motivation: facilitating crowd workers to provide high-quality labels. Gamification is popular in recent years. Dergousoff and Mandryk [26] developed a data collection tool combining the gamification and crowdsourcing. Madge et al. [27] demonstrated a text labelling game called WordClicker to argue that the games help gather more high-quality annotations. Other than gamification, other studies have demonstrated the usefulness of guidelines [28] and the mechanism of validation [29] in obtaining high-quality human labels. Those studies collectively provide ideas of design techniques (i.e. highlight, validation, guidelines and text amount) for this research. Those studies will be reviewed in more detail in Section 4 when we introduce our designs of the crowdsourcing experiments.
3. Research questions
Machine learning relies on human labels. Given the related work, it is informed crowdsourcing and active learning are effective approaches to save annotation costs. The combination of crowdsourcing and active learning provides a cost-saving environment. It is curious, with such an environment, if there is a way to efficiently collect labels of high quality and therefore build machine learning models with decent performance. To explore this problem, this study investigates the impact of different crowdsourcing interaction designs on the quality of labels and therefore the performance of active NLP-based machine learning models.
The active NLP-based machine learning process was conducted in the context of Chinese web fiction because it is convenient to obtain the text collection. The co-authors offered a text collection of Chinese web fiction to be used as labelling documents. The popularity of web fiction has been increasing in China in recent years. Based on the data from 83 web fiction companies, the Chinese Web Fiction Development Report of 2018 [30] states that by June 2018, the number of web fiction readers has exceeded 400 million, more than half of the Chinese Internet users. With the popularity of web fiction, the emergence and spread of pornographic web fiction become an issue for the online environment, especially for children and young people. Various AI models have been used to detect and remove pornography. However, it is a very difficult task for AI models to recognise pornography due to its complexity, ambiguity and subjectiveness. Part of the reason is the lack of abundant human intelligence. Considering this issue existing in the Chinese web fiction industry, the crowding task of this research was selected to be labelling pornography in web fiction.
Detecting sensitive content, such as adult content and pornography, is not new to the field of text classification [31]. The machine learning method is widely applied in classifying adult and pornographic content among not only text but also images [32,33]. Also, the result showed active learning approach was able to build the sensitive content classifier with decent accuracy [34]. These previous studies provide this research with a context that the target concept of pornography is learnable by algorithms. On this premise, the current research aims to leverage human-centred approaches, as well as the strengths of active learning and crowdsourcing, to offer high-quality collective intelligence in a scalable and efficient way, to solve the challenge of pornography detection. Specifically, we put forward two research questions:
RQ1: How do we use different techniques to design an effective crowdsourcing tool for crowd workers to provide high-quality annotations?
RQ2: How do different crowdsourcing designs affect the performance of active learning?
4. Crowdsourcing experiment
4.1. Techniques for crowdsourcing annotation
To collect human labels, an interactive annotation tool is necessary as referred to in RQ1. The design of the interactive annotation tool took account of techniques in both interaction and annotation aspects. By reviewing existing work and brainstorming among the co-authors, four design techniques were considered in this study: highlight, guidelines, validation and text amount. Highlight, guidelines and text amount are deemed to be interaction techniques that are relevant to the interface, while validation is a strategy for labelling. We will talk about each of them in the following paragraphs.
Wilson et al. [35] found that highlighting relevant paragraphs in the privacy policy could reduce crowdsourcing task completion time. Therefore, highlight is proposed in this study as a design technique. In this study, the highlighted words were determined by the pornography probability that was predicted by the support vector machine (SVM) model. Section 4.4 describes the complete training procedure of active learning. Each round of learning trained an SVM model to predict the probability of paragraphs including pornography. The same model was used to predict the probability of words that form the given paragraph. Words with a high probability of being pornography (over 60%) were highlighted. While highlight helps draw attention to the most relevant information, it might mislead people to miss some other relevant information that the interface fails to highlight. Therefore, we will design interfaces with and without such a highlight feature to see the effect on the quality of human annotations.
In a crowdsourcing task on annotating drug-use-related twitter, Alvaro et al. [28] provided annotators with guidelines on what kind of tweets could be counted as drug-use-related. They found the guidelines were effective in obtaining quality labels. Compared with expert annotators, guidelines are believed to be more important for crowd workers to establish a common understanding of the task and consistent criteria for labelling. On the other hand, guidelines can bring a risk to annotation quality when annotating subjective concepts. The study found annotators’ insensitivity to language could lead to poor performance in hate speech detection models [36]. Similarly, differences in annotators’ understanding of guidelines might cause poor quality annotations. Therefore, the effect of guidelines will be investigated. In this study, two versions of guidelines on the definition of pornography were provided. One was adapted from the definition of pornography in the Criminal Law of the People’s Republic of China [37] and the other was from the interpretation from a highly cited academic paper on pornography [38]. It is acknowledged that the concept of pornography is subjective and culturally biased. Since the annotation documents were Chinese web fictions and the recruited annotators were Chinese people, we chose the guidelines that reflect the values of pornography in China.
Validation was also selected as a design technique in this study. Validation refers to the process that multiple annotators need to reach an agreement to establish a label for a data instance. Validation is a common practice for the crowdsourcing approach. For example, TagCaptcha, an image annotation system developed by Morrison, Marchand-Maillet and Bruno [39], specified that an image must be assigned to the same label by multiple annotators before it could be counted as a case with a known label. Chamberlain et al. [29] reported that adding the validation step through multi-annotator agreement reduced the noises of annotations. On the other hand, validation means having multiple crowd workers work on the same task, which will increase the annotation cost. This study will examine whether the validation mechanism justifies its added cost and contributes to the improvement of the human annotation quality.
We believe that text amount or length plays a role in a text labelling task. For pornography detection, short text may lack important context information. For example, the description of a nude person might be considered pornography, but the description of a nude person in a painting may not. However, long text may contain much unnecessary information, and therefore takes more unnecessary effort from a crowd worker. This study is interested in finding an appropriate text amount that balances the quality of information and the effort. Therefore, text amount was selected as a design technique. The variations regarding text amount include a single sentence (s), a sentence and its two adjacent sentences (ss) and a paragraph (p).
4.2. Design of crowdsourcing annotation interactions
An annotation system developed by our research team was used in the experiment. Crowd workers needed to sign up for an account with the system. On the registration page, crowd workers needed to fill in a questionnaire about their demographic information such as age, gender, education, and career. In addition, they needed to report their experiences with crowdsourcing annotation and reading web fiction. Given the extremely sexually explicit of pornography detection and the enforcement of social norms around explicit content, prior to the experiment, we informed the crowd workers that they may encounter sexually explicit content, and suggested those who were likely to be uncomfortable about this should quit.
Based on different combinations of the four techniques, there were 15 annotation interfaces. Table 1 shows the details of the 15 interface designs as well as the number of assigned crowd workers and tasks to each of the designs. The designs involving the technique of validation (D3, D6, D8, D10, D11, D13, D14 and D15) were assigned a double number of crowd workers compared with the rest designs because the validation processing required agreement between two crowd workers. In addition, the designs involving the text amount technique (D4, D7, D9, D10, D12, D13, D14 and D15) were assigned 180 tasks per crowd worker, as opposed to 120 tasks per crowd worker for the rest designs. For the 180 tasks, 60 were a single sentence (s), 60 were a sentence and its two adjacent sentences (ss) and the remaining 60 were paragraph (p). Generally, Chinese native speakers are able to read 300–500 words per minute. Therefore, we assumed that one crowd worker would spend about an hour for all the tasks assigned to him or her (either 120 or 180 tasks). A pilot test verified the correctness of this assumption.
Fifteen interaction designs.
The interface of D15 is presented in Figure 1 as an example. We used Chinese text in the experiment. English is shown in Figure 1 for demonstration purposes. The text that needs to be annotated is presented in the middle with highlighted words. We exchanged the original text content for fake Lorem Ipsum content in Figure 1 to avoid potential sexual harassment to readers. Below the text, there are two label buttons for the crowd workers to vote on whether the piece of text is pornography. If the annotators think the text is pornography, the content that makes he or she think so needs to be marked up. In the upper left corner, the number of completed tasks and a total number of tasks are presented to let the crowd workers know their progress. Guidelines on what is pornography are presented on the right panel. The system also tracks the username, the assigned design ID, the text IDs, the labels, user marked-up contents and the time spent on each task.

The annotation interface of D15.
4.3. Setup of crowdsourcing experiments
The original data set is provided by the co-authors, which contains 100,000 paragraphs of web fiction from a Chinese commercial website, as well as the ground truth labels on pornography. Initial training of the model was performed on a subset of the data set to predict each paragraph’s probability of being pornography. The goal was to select uncertain paragraphs (informative instances) for human annotations. Details of the training procedure will be introduced in the next section, while this section aims to explain the process of crowdsourcing annotation.
Due to the iterative nature of the active learning approach, the annotation was conducted in three rounds. After each round, the annotated uncertain cases were added to the training data set to train the model again. The updated model was applied to the pool of candidates to select the uncertain cases for the second round of human annotations. The process was repeated for the third round of human annotations. In total, the model was trained four times, including the initial training, and human annotations were conducted in three rounds. The three rounds of annotation experiments collected three sets of labels for uncertain cases. Each set had 15 subsets corresponding to the 15 designs. The size of each set and subset are presented in Table 2.
Annotations collected from the 15 designs from the three rounds.
The number of annotators in each annotation round was 230 (the sum of the column of no. of annotators in Table 1). Annotators were recruited from the Alibaba Crowdsourcing platform. They were asked questions about their demographics when signing up for an account with the annotating system developed by us. Before starting off to label text, each crowd worker watched an instruction video about how to use the annotation system. Each of them was randomly assigned to one of the 15 designs and was required to label either 120 or 180 tasks. They were paid 0.3 Chinese Yuan for each completed task.
4.4. Setup of active learning iterations
The nature of the pornography detecting task is a text classification problem. SVM, a commonly used classifier for text classification, is a supervised machine learning algorithm that learns a linear separating hyper-plane to divide the data into two classes within a vector space. SVM showed its efficiency in identifying pornographic web content [40] and was therefore used in this study. Paragraphs were represented as vectors using the term frequency–inverse document frequency (TF-IDF) technique.
This study adopted the uncertainty sampling strategy based on pool-based sampling [41]. The training procedure of active learning was iterative and the learning was supposed to stop when the model stabilised [42]. The training procedure is shown in Figure 2.

The training procedure.
As mentioned, the original data set consists of 100,000 paragraphs of web fiction and corresponding labels. Before the first round of annotation, an initial SVM model training was conducted on 5000 randomly selected paragraphs from the data set. Another 5000 randomly selected samples were set as the testing data and were fixed for all the rounds of testing. The remaining 90,000 paragraphs were used as a selection pool to select uncertain cases for human annotations.
The F1-score of the initial round model was 0.44. We applied this initial model into the pool of 90,000 samples to select the most uncertain cases (the probability close to 0.5) for the first round of human annotation. Then, the first round of crowdsourcing experiments was conducted and the result was 15 subsets of labelled cases. Each of the 15 subsets was added into the original training samples to train the model again. The fixed testing set of 5000 fragments were used to test the performance of the 15 models, among which the best one was applied to the remaining of the selection pool to select uncertain cases for the second round of human annotations. The above process was repeated for the third round of human annotations. It is worth noting that the model training was accumulative, meaning the training set of a round consisted of new labelled data as well as the labelled data from all previous rounds.
5. Results
Six hundred ninety crowd workers were randomly recruited in this study via the Alibaba Crowdsourcing platform, 230 in each of the three rounds. Table 3 presents their demographic information. Gender distribution in each round was roughly balanced. The majority of the crowd workers were between 18 and 40 (94.8%). Most of the crowd workers have a bachelor’s degree (48.7%). The majority were students (42.2%) and company staff (25.1%). Most of them (65.1%) did not have crowdsourcing annotation experience before. Over 90% of them were web fiction readers.
Demographics of crowd workers.
5.1. Change of model performance over three rounds of crowdsourcing annotations
F1-scores of the 15 models for each of the three rounds are presented in Table 4. The 15 models were learned based on the annotations from the 15 designs, while the baseline model results were obtained based on the annotations from D3 but without applying the validation process, equivalent to a ‘vanilla’ interface without any design technique. As the result, most designs (except D9 and D15) have a better performance than the baseline design after the third round of annotations. In addition, the F1-scores of 11 designs (except D9, D11, D14 and D15) increase after the third round of annotations. In contrast, the F1-scores of D9, D11, D14 and D15 are more stabilised over the three rounds. It is worth noting that the F1-scores of D7 keeps increasing after each of the three rounds of human annotations and gets the highest in the last round, indicating the effectiveness of highlight and text amount techniques.
F1-scores of the 15 designs for the three rounds of human annotations.
For the final F1-score after the third round of annotations, it is shown that D7 achieves the highest while D15 is the lowest. Meanwhile, the top 5 designs are D7, D4, D2, D1 and D10, all of which involve one or two design techniques. This suggests that more than two design techniques being included in the design may lead to the deterioration of model performance. D4 is the top among the designs with one design technique (D1–D4) and D7 is the top among the designs with two design techniques (D5–D10). If we look at only the designs with three design techniques (D11–D14), D12 is the top one. D4, D7 and D12 share a common design technique: text amount, suggesting that text amount plays an important role in obtaining quality labels. It is worth noting that adding highlight on top of D4 is equivalent to D7, and adding guidelines on top of D7 is equivalent to D12. Surprisingly, the F1-score of D15, with every design technique included, is lower than the baseline result. Compared with D12, D15 has one more design technique of validation. In addition, the F1-score of D3 is the lowest among the designs with one technique (D1–D4). This means the validation technique has negatively impacted the model performance. The explanation may be the design with the validation technique has collected a fewer number of labels compared with other designs. The improved confidence of labels does not compensate for the less number of labels.
To compare the change of model performance over the three rounds of human annotations, the F1-scores of D10, D12, D14 and D15 drop after the second round of annotations although the decline is small. For the rest designs, the F1-scores are monotonically increasing over the three rounds. The increases for D7, D9 and D11 after the third round are less than those after the second round. The slowing down of the performance increase suggests that these models tend to stabilise soon. It is worth noting that D7 (highlight + text amount) achieves the highest performance after the third round of annotation, while D8 (guidelines + validation) achieves the largest growth over the three rounds. This suggests the different roles that different design techniques play in improving model performance.
5.2. Model performance with the same cost
In the previous analysis, in each round of crowd annotations, the cost was different for the 15 designs due to different task completion time and the different number of crowd workers. It is practically useful to compare the model performance under different designs with the same annotation cost. The top performers would be effective and efficient designs for crowdsourcing and active learning. Annotation cost was defined as the total annotation time decided by the task completion time and the number of crowd workers. The average task completion time in the 15 designs for this third round is listed in Table 5. To fix the cost for each of the 15 designs, we needed to select (sample) more crowd workers for the designs with less task completion time, and fewer crowd workers for the designs that took more time to finish. To test the statistical significance of differences between designs under the same cost, for each design, we had multiple five sets of samples. The sampling size and number of sample sets for each of the 15 designs are presented in Table 5.
Training set size for each design given the same annotation cost.
Analysis of variance (ANOVA) was conducted to compare the F1-scores for different interaction designs with a fixed cost. We used SPSS Statistics 23 to conduct the ANOVA. The F1-scores of the 15 models are presented in Table 6. The result of Welch’s ANOVA shows a significant difference at 0.05 level among the 15 designs (F = 2.286, p = 0.014). Dunn’s t-test was conducted to make the post hoc pairwise comparisons. Table 6 shows the p-values of the comparisons with a significant difference at 0.05 level. Overall, D2 and D14 perform better than most of the other designs (six designs). In contrast, D9 under-performs than most of the other designs (nine designs). D1 (0.491), D2 (0.488), D7 (0.486) and D14 (0.486) are the top four performers, while D9 (0.460), D13 (0.469) and D15 (0.469) are the bottom 3.
Average F1-score comparison for the 15 designs.
p < 0.05; **p < 0.01.
6. Discussion and implications
6.1. Effect of different crowdsourcing design techniques
To answer RQ1, this study designed a crowdsourcing annotation tool with combinations of four different techniques: highlight, guidelines, validation and text amount. These techniques were used to collect labels of web fiction that provided the input of models for active learning. The model performance resulted from the combinations of these techniques was analysed to see the effect of different techniques to design an effective crowdsourcing tool for obtaining high-quality annotations.
D7 (highlight + text amount) keeps improving its performance after each of the three rounds of human annotations. It achieves the highest F1-score in the last round. Under a fixed cost, the model of D1 (highlight) is the top performer, and the model of D7 also achieves a top-tier performance. The success of D1 and D7 suggests the value of the technique of highlight. The technique of highlight is to emphasise the content that is important for annotators to make a decision. For a text annotation task, the annotator gains an understanding of the content by reading. Highlighting key words (e.g. sensitive words in this experiment) may be helpful for people to focus on important content and make quick and accurate decisions. This finding suggests the potential impact of highlight on annotation time and accuracy, which is also found by Ramírez et al. [43]. Their study indicates that bad highlight can hurt accuracy, and good highlight can decrease decision time. In this study, sensitive words with a high possibility of being pornography were highlighted, which can be considered as good highlight.
D2 (guidelines) outperforms with and without the same cost, suggesting the usefulness of guidelines, which is to build a common understanding of the concept and a consistent labelling criterion for multiple crowd workers. The target concept in this study is subjective, and every annotator may have a unique understanding on their own. In fact, the provided guidelines purpose not to have annotators agreed with the opinion of guidelines. Guidelines are used to educate annotators on which kind of content is expected to be labelled, in other words, tell what text is wanted in this crowdsourcing experiment. For example, the annotator labels a given text following the criterion of guidelines, though he or she may choose not to label it according to his or her own point of view. Therefore, guidelines can facilitate the crowdsourcing annotation on subjective concepts.
It is interesting to notice that the impact of validation on model performance could be either positive in some cases or negative in other cases. For example, the under-performed design D9 (guideline + text amount) would become a good performer design D14 (guideline + validation + text amount) when adding the technique of validation. Ramírez et al. [43] adopted the aggregation strategy to text highlighting tasks. The aggregation strategy is to highlight words that are highlighted by more people, which is similar to the idea of the validation technique in this study. They reported an increase in classification model performance with text highlighting using the aggregation strategy. In contrast, this study also observes a negative effect of the validation technique on model performance. It is found a good performer design D7 (highlight + text amount) would become an under-performed design D13 (highlight + validation + text amount) when adding the technique of validation. Interaction between validation and other techniques could be a possible explanation and further exploration is needed.
When examining the top design among designs with an equal number of techniques, D4, D7 and D12 outperform in three-round annotations, sharing a common design technique: text amount. When considering the fixed cost, D9, D13 and D15 are the bottom designs, also sharing the technique of text amount. In addition, although the design consisting of only text amount technique (D4) achieves a good model performance in the last round annotation, the model performance becomes poor with a fixed cost. It can be seen the technique of text amount has an impact on the performance of active learning, which can be either positive or negative due to the specific text amount. A previous study revealed that crowd workers correctly annotated shorter tweets with fewer entities, informing the text amount can influence the quality of crowdsourcing annotation [44]. Our study agrees with the finding.
6.2. Implications of crowdsourcing designs for active learning
This study selects the combination of crowdsourcing and active learning as a cost-saving environment. With such an environment, this study investigates efficient designs for crowd works to provide quality labels and therefore facilitate the performance of active learning models. The results of this study provide evidence that leveraging crowdsourcing into the active learning approach can benefit the performance of models, which is consistent with previous exploration [23,24]. Differently, this study not only adopted the combination of crowdsourcing and active learning but also identified the effective way of the combination from a perspective of the human-centred interaction design. The impact of different design techniques on the performance of active learning models fills the knowledge gap about how to design an effective crowdsourcing labelling system used for active learning. The implications can be applied beyond the current domain and in wider crowdsourcing contexts where the high quality of collective intelligence is needed.
Analysis of different designs’ impact on model performance sheds light on practical implications for crowdsourcing designs used for active learning. In this study, models based on most designs outperform the model of baseline design after three-round annotations, which indicates that, compared with adopting no technique, crowdsourcing designs with any of the four techniques can optimise the performance of active learning. Therefore, human-centred design shows its advantage in collecting quality labels with the combination of crowdsourcing and active learning. Designers can utilise different interaction and annotation techniques as a promising approach to improve crowdsourcing annotation tools. Different hybrid techniques improve the performance of active learning in different aspects. For example, this study presents the mix of highlight and text amount (D7) leads to the highest performance after the last round annotation, while the hybrid of guidelines and validation (D8) results in the largest performance growth over the three rounds. Crowdsourcing designers need to adopt different techniques depending on what effect of model performance is wanted. Moreover, the top-performed designs (D1, D2, D7 and D14) involve fewer techniques than the bottom ones (D9, D13 and D15) on average. This means that crowdsourcing tools that include more design techniques may have had negative impacts on the label quality, probably due to the visual complexity and extra cognitive burden. Crowdsourcing designers can learn from it that simple design brings better performance.
7. Conclusion
Most active learning research focuses on algorithms that optimise for an ideal balance between cost of training set, annotation labour and model performance, and overlooks a probable critical factor, the design of the crowdsourcing annotation interaction. This study addresses this missing component of research into active learning. This study leverages human-centred designs to improve the quality of crowdsourcing labels with a controlled cost. 690 crowd workers in total were recruited for three rounds of annotations. Fifteen interaction designs involving four design techniques were used in the crowdsourcing experiments. Model results for the 15 designs with and without fixed costs were analysed. We find those design techniques by themselves and by their interactions were playing different roles in improving model performance. The top designs are D1, D2, D7 and D14, which are relatively ‘simple’ designs with a few design techniques.
Meanwhile, we acknowledge a limitation of this study that the performance of all models was reported in relatively low F1-scores. A probable factor in the poor model performance is that subjective target concepts often offer far more complex challenges to machine learning algorithms. Also, no validation set was used in the training regime can be another limitation of this study. Furthermore, this study reveals the impact of text amount, but what is the efficient text amount to induce a positive performance of active learning remains unknown, which could be future work. The focus of study is the human-centred designs to improve the quality of crowdsourcing labels. Therefore, the supervised machine learning of SVM model was used. It is admitted that currently the bidirectional encoder representations from transformers (BERT) model becomes trendy in NLP, which is a self-supervised method. In future work, the BERT model will be adopted to see whether the F1-score increases.
Footnotes
Declaration of conflicting interests
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China (No. 92370112) and the Science Fund for Creative Research Groups of the Natural Science Foundation of Hubei Province (No. 2023AFA012).
Funding
The author(s) received no financial support for the research, authorship and/or publication of this article.
