Abstract
Current crowdsourcing platforms provide an attractive solution for processing of high-volume tasks at low cost. However, problems of quality control remain a major concern. In the present work, we developed a private crowdsourcing system (PCSS) running in an intranetwork, which allows us to devise quality control methods. We introduce four worker selection methods and a grade-based training method. The four worker selection methods consist of preprocessing filtering, real-time filtering, post-processing filtering, and guess-processing filtering. In addition to a basic approach involving initial training or the use of gold-standard data, these methods include a novel approach, utilizing collaborative filtering techniques. We collected a large amount of vocabulary data for natural language processing (NLP), such as voice recognition and text to speech using PCSS. The quality control methods increased accuracy 32.4 points in collecting vocabulary tasks. We also implemented the grade-based training method to avoid claims of unfair dismissal and shrinkage of the market of crowdsourcing caused by excluding workers. This training method uses Bayesian networks to calculate correlations between tasks based on workers’ records, and then allocates learning tasks to the workers to raise the results of target tasks according to the correlations. In an experiment, the method automatically allocated learning tasks for target tasks, and after the training of the workers, we confirmed that the workers raised the accuracy of target task 10.77 points on average. Therefore, by combining the filtering methods and the training method, task requesters in microtask crowdsourcing can obtain higher-quality results without dismissing valuable workers.
Introduction
There is a trend toward analysis of ever-larger amounts of data. Although the attributes of speed and quality are required for practicable analysis, the traditional methods of big data analytics applied by researchers or specialists are costly. Crowdsourcing services have made it easy to analyze big data. Crowdsourcing helps reduce both the time required for analysis and the cost. However, microtask crowdsourcing services, such as Amazon Mechanical Turk (AMT),1
Conventional quality control methods for microtask crowdsourcing are targeted at task requesters, such as a good task design and exclusion of low-quality workers. Most of the methods, however, require the requesters to have sufficient knowledge of crowdsourcing [2]. As a result, the requesters have to accept low-quality results of the tasks because of a lack of such knowledge.
We propose a method of creating a crowdsourcing system in a private environment (private crowdsourcing system, PCSS), and incorporate a new quality control mechanism in this system. We introduce our PCSS and four quality control methods, namely, preprocessing filtering, real-time filtering, post-processing filtering, and guess-processing filtering, and evaluate the PCSS by using it to collect vocabulary data from the Web. NLP, such as voice recognition or text to speech, requires a lot of vocabulary data, consisting of reading data, class data and accent data, in order to improve.
However, workers exclusion will shrink the crowdsourcing market. Therefore, support for crowdsourcing workers, such as education and improvement of the working environment, is becoming important [19]. Crowdsourcing workers are employed and disemployed easily, since they are unspecified. This poor management of workers could lead to complaints regarding unfair dismissal of workers and the quality of task results would likely decline. Therefore, education for crowdsourcing workers is expected to be raised to a level corresponding to that of education in the traditional working environment. However, the education for crowdsourcing workers is subject to several problems. For example, it is difficult to educate the workers individually, because the workers are many and unspecified in microtask crowdsourcing. In addition, education for crowdsourcing workers undermines the merits of microtask crowdsourcing, such as low cost and rapidity.
Thus, we proposed a grade-based training method for the education of microtask crowdsourcing workers based on the method proposed in [33]. In this method, workers process appropriate training tasks before processing difficult tasks. The workers’ skill is upgraded by processing the training tasks. Specifically, our system creates appropriate training tasks by analyzing the correlations between tasks and workers’ records using a Bayesian network.
Thus, the main contribution of this paper is the proposal of a set of methods for improving the quality of the results without requiring the requesters’ knowledge. Our system improved the quality of results by excluding spam workers and by educating workers.
The rest of the paper is organized as follows. In Section 2, we review existing quality control methods in microtask crowdsourcing. Section 3 presents PCSS and the four quality control methods. Section 4 shows the evaluation of collecting vocabulary data from the Web using PCSS. Section 5 shows the grade-based training method for educating workers. The paper finishes with Section 6, which presents conclusions and indicates directions for future research.
Crowdsourcing is defined as the process of obtaining desired contents by soliciting contributions from a large group of people. In this paper, we discuss microtask crowdsourcing, which is our research target.
Microtask crowdsourcing is a service or system, whereby tasks are produced by a company or a group, and processed by many people (workers). These workers are collected from unspecified Internet users. The tasks tend to be tiny, easy, numerous, and modestly rewarded. AMT and CrowdFlower2
There are many metrics for measuring the quality of microtask crowdsourcing. In the present work, we used cost, accuracy, and rapidity as quality indicators. These indicators are mutually dependent. For example, in order to reduce the cost, the system must reduce workers’ rewards for tasks. However, this reduces workers’ motivation, leading to a decline in the quality of the data. In order to improve the quality of processed data, the system should raise workers’ rewards.
Microtask crowdsourcing is generally advantageous in terms of cost and rapidity, not quality. Therefore, when using microtask crowdsourcing for research purposes, the quality of the processed data is the most important topic. Thus, many quality control methods for microtask crowdsourcing have been proposed in the literature. These methods can be classified into three categories: 1) control of crowdsourcing tasks, 2) control of requesters and 3) control of workers, such as excluding spam workers and finding skillful workers.
1) The control methods for crowdsourcing tasks, such as task design, including quality control by improving task design [18], supplying tasks to workers and combining the results [10,23,31,34,35], having workers process tasks and simultaneously vote on the reliability of their work [27,30], and checking workers’ skill against tasks that have already been completed (gold-standard data).
2) The control methods for requesters proposed include: excluding spam requesters by checking task contents using machine learning [5], finding bad requesters by analyzing workers’ opinion [24], and determining if a requester is a bad requester or not by workers [15].
3) The control methods for workers can be classified into three categories: 3a) quality control by excluding low-quality workers, 3b) quality control by controlling allocation of tasks to workers, and 3c) quality control by analyzing communication among workers.
3a) The quality control techniques for excluding low-quality workers include: excluding spam workers by applying learning data from task results [12], making conjectures about workers’ skill based on their behavior in unrelated tasks [17], excluding spam workers and poorly qualified workers by ranking them [28], and excluding spam workers and poorly qualified workers by calculating the threshold between highly qualified workers and poorly qualified workers [11].
3b) The quality control techniques for controlling the allocation of tasks to workers include: controlling task allocation by analyzing correlations between workers and requesters [24], allocating easy tasks to workers before allocating difficult tasks to them, with the easy tasks being derived from difficult tasks [4], allocating tasks to workers according to the degree of difficulty of tasks [6], one such method involves allocating tasks to workers based on the task type and the percentage completed [3], allocating tasks to workers based on workers’ action history and the tasks workers prefer [36], and allocating tasks to workers based on the task difficulty and worker skill [13].
3c) The quality control techniques for analyzing communication among workers include: introducing a worker to a requester by analyzing communication among workers [24] and creating training data for new workers by analyzing conversation among workers [25].
The mixed control techniques for excluding low-quality workers and controlling the allocation of tasks to workers include: filtering low-quality workers by a pretest and controlling the allocation of tasks according to task difficulty [16].
In this paper, we propose four worker selection methods corresponding to category 3a), and a grade-based training method corresponding to category 3b). PCSS does not utilize a method corresponding to category 3c), because communication among workers tends to lead to trouble among workers and increases the cost of worker maintenance.
Furthermore, we explain how vocabulary data are collected from Web using PCSS in this paper. Many data collection methods from Web have been proposed in the literature include: the words extraction from corpus for Semantic Web [37], development middle ware for the Semantic Web [32], Web Ontologies extraction for semantic web [20], development of a collaborative environment for semantic-enabled mobile devices [29], designing method of executable web services using process algebra [9], development of model semantic-rich and self-contained knowledge units as the web site abstraction [8], development of visual query for search web [14], reduction of cost for search Internet [21,26] and evaluation conversations in process of Web service [22].

Relation of PCSS, market research company and workers.

PCSS Architecture.
We used microtask crowdsourcing to collect a lot of vocabulary data for NLP in Section 4, and such data needs to be of high quality. However, the task results from existing microtask crowdsourcing services were of such low quality as to be unsuitable for the requesters. The quality control methods included in existing crowdsourcing services are insufficient and it is difficult to incorporate new quality control mechanisms in them. Therefore, we developed PCSS.
Microtask crowdsourcing requires many workers. Whereas AMT can collect workers from its Internet customers, it is difficult to recruit workers for PCSS from the public, because PCSS is used in a private environment. The workers for PCSS are recruited from the workers of a market research company. We conducted a questionnaire survey of the market research company’s workers. The matters addressed by the questionnaire were “available working time”, “ambitions concerning work”, “hourly pay rates”, “educational background”, and “ICT experience”. The questionnaire targeted 80,000 possible workers. We call this method preprocessing filtering. The relations among PCSS, the market research company, and workers are shown in Fig. 1.
PCSS can be used for many kinds of tasks. In this paper, we focus on an example, namely, collection of vocabulary data for NLP from Web data. The task involves a noise reduction task, a phonetics data addition task, a correct grammar data collection task and an accent data addition task.
PCSS, shown in Fig. 2, has been in operation since 2011. The record of PCSS is shown in Table 1.
Record of PCSS
Record of PCSS
In terms of quality control, PCSS focuses on quality control methods for workers. The workers for PCSS comprise high-quality workers capable of high-quality processing of specific domain tasks, low-quality workers capable of low-quality processing of specific domain tasks, workers with skill capable of processing tasks in accordance with the requester’s desire, workers without skill who do not process tasks in accordance with the requester’s desire, and spam workers who process tasks with a script and process low-quality tasks in all domains. PCSS improves the quality of task processing by excluding the low-quality workers and the spam workers. PCSS controls workers’ behavior based on workers’ action history.
PCSS implements four worker selection methods: 1) preprocessing filtering, 2) real-time filtering, 3) post-processing filtering, and 4) guess-processing filtering. These filtering methods are applied in descending order of the number of workers as shown in Fig. 3.

Filtering methods in PCSS.
As described above, this filtering method is applied in the recruitment of workers. A questionnaire survey is conducted and workers who are clearly of low-quality are excluded based on the results. The questionnaire ascertains “available working time”, “ambitions concerning work”, “educational background”, and “ICT experience”.
Real-time filtering
This filtering method is applied in processing tasks performed by workers. Preprocessing filtering cannot exclude all low-quality workers. Workers’ quality changes throughout processing time, and thus PCSS must check workers quality in real time. The accuracy rate of all tasks

Example of displaying accuracy rate of all task and experience point in PCSS.
An example of workers’ accuracy rates for all tasks and their experience points throughout the working time is shown in Fig. 4. The message that warns that any worker whose accuracy rate for all tasks is under 70% cannot process tasks is announced to the worker, and thus all workers tend to process tasks carefully. Therefore, the real-time filtering improves the quality of task results.
The role of the real-time filtering in changing workers’ behavior is similar to that of game mechanics. Game mechanics comprises rules and methods for controlling users’ motivation [1]. In order to calculate accuracy, PCSS must judge whether a task result is correct or incorrect. Since most task answers are unknown, PCSS adopts the well-known method of judging whether a task is correct or incorrect by majority vote [31]. PCSS does not change the accuracy in a task, such as a questionnaire, for which there is no correct answer.
However, some workers whose accuracy rates become low because of difficult tasks with high rewards tend to recover their accuracy rates by processing easy tasks with low rewards. This is undesirable from the viewpoint of control quality. Therefore, PCSS also calculates the accuracy rate for each task category
Workers tend to have strong task categories and weak task categories. PCSS analyses a worker’s strong task categories and weak task categories based on the results of the tasks that the worker performed, and controls task allocation accordingly, as shown in Fig. 5. The requester can analyze workers’ strong task categories and weak task categories based on the task results. The requester collects workers whose strengths correspond to the requester’s task categories. This information is registered in PCSS and shared by all requesters as ‘skill’.

Post-processing filtering.
Since the real-time filtering and the post-processing filtering filter workers using task results, many low-quality results remain. Thus, other workers must process these low-quality data, and the requester must pay a higher reward to workers.
Various methods of allocating tasks to workers are proposed in the literature. One method involves allocating tasks to workers based on the task type and the percentage completed [3]. Since this method does not filter low-quality workers, low-quality workers create many low-quality task results. Another method involves allocating tasks to workers based on workers’ action history and the tasks workers prefer [36]. This method has a drawback in that it is difficult to calculate large data for many workers. Yet another method involves allocating tasks to workers based on the task difficulty and worker skill [13]. This method has a drawback in that categories are limited.
The proposed guess-processing filtering method controls the allocation of tasks to workers based on the degree of worker similarity. Specifically, in the case that worker 1 is similar to worker 2, it is assumed that worker 1’s strong task categories and weak task categories are similar to worker 2’s. For example, PCSS has some accuracy data. As shown in Table 2, PCSS allocates tasks as follows: 1) if worker 1 is similar to worker 2 and task 1 is worker 1’s strong task category, PCSS estimates task 2 is worker 2’s strong task category. Then, PCSS allocates task 2 to worker 2; 2) if worker 1 is similar to worker 3 and task 5 is worker 1’s weak task category, PCSS estimates task 5 is worker 3’s weak task category. Then, PCSS does not allocate task 5 to worker 3.
Sample of collaboration filtering for worker (“–” means not processed)
Sample of collaboration filtering for worker (“–” means not processed)
Part of calculated similarity score between workers
The similarity score between workers is calculated with the Pearson correlation coefficient. Similarity
PCSS allocates tasks to workers according to calculated conjectured accuracy. If conjectured accuracy of worker 1 in category 1 is under 70%, PCSS does not allocate a task in category 1 to worker 1. The part of calculated similarity score between workers is shown in Table 3.
Research on NLP, such as voice recognition and text to speech, requires a lot of vocabulary data. In this section, we explain how vocabulary data are collected using PCSS. The flow of collecting vocabulary data is shown in Fig. 6. Firstly, Web text data are collected using a Web crawler. Web text data are extracted from the crawled webpages by deleting html tags. Next, candidate vocabulary data are extracted using morphological analysis. Finally, noise data are filtered and data are annotated using PCSS.

Flow of collecting vocabulary data from Web.
Web text from the Internet contains many proper nouns and unknown words. In the present work, we use Apache Nutch3
The words are extracted from the Web text using morphological analysis. There are 230,000 candidate words.
Collected Web text
Collected Web text
The words extracted from the Web text using morphological analysis have insufficient quality as data for NLP research. PCSS rejects noise data from the candidate words and annotates data such as phonetics data, grammar data, and accent data. PCSS treated this work as a crowd sourcing microtask. The word collection involves the following PCSS tasks: the ‘noise reduction task’, the ‘phonetics data addition task’, the ‘correct grammar data collection task’, and the ‘accent data addition task’. The same question is given to 3 workers. For the ‘phonetics data addition task’, the ‘correct grammar data collection task’, and the ‘accent data addition task’, PCSS treats a majority answer as correct data. Because ‘noise reduction’ requires high quality, PCSS accepts only unanimous task results. The agreement rate of each task is shown in Table 5. Workers can skip particular tasks, in the case that words have multiple phonetics, and so the workers cannot specify their phonetics data. These unsolvable tasks that are skipped by workers 6 times are rejected as noise.
The number of collected words is shown in Table 6. 138 thousand words are collected from 12.5 billion Web text data.
Agreement rate of result of each tasks
Agreement rate of result of each tasks
Number of collected words
Morphological analysis cannot correctly extract words. Extracted words contain noise data. Crowdsourcing workers check and eliminate noise data in crowdsourcing tasks. The task shows a original sentence include the target word and another sentence include the target word. The workers answer second sentence is correct or not. If the workers answers correct, the target word is not noise. The noise reduction task is shown in Fig. 7.

Noise reduction task.
The extracted word data contains kanji characters. In order to use kanji characters in voice recognition or text to speech, phonetics data are required. Crowdsourcing workers add phonetics data to the extracted words in crowdsourcing tasks. Because phonetics data is changed by the context, this task shows a original sentence include the target word. The target word is emphasize with red. The phonetics data addition task is shown in Fig. 8.

Phonetics data addition task.

Correct grammar data collection task.

Accent data addition task.
In order to use the extracted words in voice recognition or text to speech, grammar data is required. Crowdsourcing workers select correct grammar data from candidate grammar data in crowdsourcing tasks. Because grammar data is changed by the context, this task shows a original sentence include the target word. The target word is emphasize with red. The correct grammar data collection task is shown in Fig. 9.
Accent data addition task
In order to use the extracted words in text to speech, accent data are also required. Crowdsourcing workers select correct accent data from candidate accent data in crowdsourcing tasks. This task shows candidate accent data, which are generated by system. Crowdsourcing workers can check sound of accent by clicking target icon. The accent data addition task is shown in Fig. 10.
Effect of quality control method in PCSS
The quality of the task result in Section 4.3 has been improved by the quality control methods in Section 3. In this section, we explain the effect of the quality control methods. The PCSS system manager classifies the PCSS tasks into 13 categories. This section describes about four of those categories for word collection. These four categories consist of ‘noise reduction’, ‘phonetics data addition’, ‘correct grammar data collection’ and ‘accent data addition’. The number of tasks and worker information in each category is shown in Table 7. PCSS defined ‘active worker’ as a worker who has processed 50 tasks in a category.
Number of tasks and worker information in each categories
Number of tasks and worker information in each categories
Number of skill workers and non-skill workers filtered by Post-processing filtering
Comparing guessed accuracy and actual accuracy, where high quality means over 90% accuracy
Questionnaires for this filtering method are sent to 80 thousand workers of a market research company. The questionnaires ascertain time available for work, ambitions concerning work, educational background, and ICT experience. As a result, there were 2457 suitable workers. Then, we sent a recruit e-mail to suitable workers. Finally, the number of workers who have processed all the tasks at least once was 1630. The number of workers who work on a monthly basis was around 150.
Real-time filtering
We tested the accuracy of task results by setting a lower limit of accuracy to 60%. As a result, the task results contain many low-quality task results. Also, we tested the accuracy of task results by setting a lower limit of accuracy of a result of a task to 80%. As a result, the number of workers becomes too small to process tasks rapidly. Therefore, PCSS sets a lower limit of accuracy to 70%. Real-time filtering excluded 62 low-quality workers as spam workers in all categories. The low-quality workers whose accuracy in a category became under 70% cannot select tasks in the category. The number of low-quality workers is shown in Table 7.

Comparison between guessed accuracy
The skills that are analyzed by the post-processing filtering are shown in Table 8. The difficult tasks in the ‘correct grammar data collection task’ are processed only by workers who have the skill of the ‘worker who can answer correct grammar data’. The skill of the ‘worker who can answer correct grammar data’ is set for workers whose task results of the ‘correct grammar data collection task’ are of high quality. The requesters analyze the results of task and choose high-quality workers for their own tasks. PCSS sets a skill label for workers based on the requester’s request.
There are four skill labels for ‘accent data addition’, the ‘worker who can answer true or false question about accent task’, the ‘worker who can select correct accent data from candidates’ and the ‘worker who can write accent data’. Those are skill labels that are set for high-quality workers, and in ascending order of difficulty. The difficult tasks in the ‘accent data addition’ are processed only by workers who have these skills. The ‘worker who processes accent task with low quality’ is a skill label that is set for low-quality workers. The workers for whom this skill is set cannot process tasks in ‘accent data addition’.
Guess-processing filtering
In order to evaluate the accuracy of
Improvement on task quality
To verify the quality control methods employed in PCSS, we compared accuracy in each task without the quality control and accuracy in each task with the quality control. We manually calculate accuracy by randomly selecting task results from each of the categories. The results are shown in Table 10 and Fig. 12. The accuracy of the noise reduction task is improved from 65.9% to 89.6%, the accuracy of the phonetics data addition task is improved from 56.3% to 94.0%, the accuracy of the correct grammar data collection task is improved from 71.0% to 90.4% and the accuracy of the accent data addition task is improved from 54.1% to 98.7%. This figure indicates that the quality control methods improved the accuracy of the data for NLP research.
However, methods for filtering workers may result in shortages of workers. Also, the recruitment of new workers leads to a decline in the quality of task results, since most new workers are not of high quality. In addition, some disemployed workers complain about us. Thus, in the next section, we propose a method of transforming low-quality workers into high-quality workers without easily filtering the workers.
Effect of quality control
Effect of quality control

The impact of quality control method in PCSS.
Education is based on the assumption that experience in certain tasks has a positive effect on people’s behavior in the same or similar tasks. Therefore, in order to obtain good results in any situation, people should have experience in the same or similar situations. Education for workers is very important, since an attempt to process a difficult task without any preparation poses a high degree of difficulty for low-quality workers and inexperienced workers. Many people start training with easier tasks than the target tasks. The efficacy of the grade-based training method is demonstrated in school education. Therefore, we propose a method of upgrading workers’ skill with training tasks allocated in stages.
However, school education involves teachers who can create perfect subjects from wide-ranging educational resources according to the purpose of study, and create a consecutive curriculum based on the students’ accumulated educational experience. The created subjects and curriculum are studied by many students, and then are improved based on feedback from students.
However, this grade-based training method has a cost problem. It is difficult to create the subjects and the curriculum automatically in a crowdsourcing system, since the purposes of tasks vary and there are many unspecified workers. Creating the subjects and the curriculum is difficult in regard to cost for the task requester and system manager. Therefore, a crowdsourcing system typically produces basic training and simple explanation of tasks. Our proposed method emulates experienced teachers in that it creates task categories according to purposes of tasks and their contents, and allocates tasks by analyzing workers’ behavior.
The method that we propose creates training tasks automatically by reusing existing tasks. If the workers processing task A before task B are superior to the workers who do not process task A before task B, task A can be defined as a training task of task B. Accordingly, in order to upgrade workers’ skill for task B, the system allocates task A to workers before they process task B.
Automatic task classification
In order to create training tasks by reusing existing tasks, it is necessary to analyze the correlations between task and task contents. However, analyzing each task is costly in terms of both time and calculation, because PCSS has many tasks in Table 1. Therefore, PCSS classifies task contents to task category groups automatically to reduce cost.
The tasks in PCSS have titles and the descriptions written by requesters. PCSS calculates the tf-idf of each task using keywords extracted from the title and description using morphological analysis.
Consequently, PCSS classifies tasks with the cosine similarity of 0.4 or more into the same category. PCSS classifies tasks as follows: 1) PCSS calculates the cosine similarity between the target task and the representative task in each category. 2) PCSS classifies the target task into the task category that includes the most similar task. 3) If all calculated cosine similarities in 1) are under 0.4, PCSS creates a new task category. As a result, PCSS classifies 7 million tasks into 50 task categories.
9 tasks for which all calculated cosine similarities are under 0.4 do not have similar tasks. PCSS excludes these 9 tasks.
Analyze correlation between task categories
An intelligent tutoring system (ITS) proposed by [33] is a computer system that aims to provide immediate and customized instruction or feedback to learners usually without intervention from a human teacher. ITS is very useful for a grade-based training method. ITS is used in many purpose. For example, Computer Programming [7]. Methods for representing learner models in ITS using a predicate logical representation have been popular for many years. However, a predicate logical representation has some problems, such as difficulty in handling exceptions to rules and inconsistent workers’ behavior. For example, in the case that there is a rule that a worker who creates correct results in a ‘spell checking’ task can create correct results in a ‘put phonetics data to words’ task, there are cases that a worker who cannot create correct results in a ‘spell checking’ task can create correct results in a ‘put phonetics data to words’ task. This is caused by workers’ careless mistakes and guesswork. However, these cases often happen in microtask crowdsourcing. A stochastic method, such as a Bayesian network can treat such exceptions to rules and inconsistent workers’ behavior. A Bayesian network is a probabilistic graphical model that represents a set of random variables and their conditional dependencies with a directed acyclic graph. PCSS analyzes correlations between task categories using a Bayesian network.
A Bayesian network is used for learning and inferencing. For example, a directed acyclic graph shown in Fig. 13 can be created by a Bayesian network learning. In the figure, task A affects task B and task B and C affect task D. Therefore, if the workers processing task B and task C before task D are superior to the workers who do not process task B and task C before task D, task B and C can be defined as training tasks of task D. Accordingly, in order to upgrade workers’ skill for task D, the system allocates task B and task C to workers before processing task D.
A method for representing a learner model in ITS with a Bayesian network has been proposed [33]. However, this method is unsuitable for crowdsourcing, and nodes of a directed acyclic graph are defined manually. PCSS creates nodes of a directed acyclic graph automatically and verifies them using the data in Table 1.

Sample of directed acyclic graph in PCSS.
Task categories of low-quality
In order to use a Bayesian network in crowdsourcing, PCSS calculates the average accuracy of task results in each task category. The Bayesian network is learned from the average accuracy of 7 million tasks processed by 798 workers, and creates a directed acyclic graph. In the present work, we used Waikato Environment for Knowledge Analysis (Weka) 3.6.11, in order to apply the Bayesian network in crowdsourcing. We selected task categories for which upgrading of the accuracy of task results is required. The target task categories are listed in Table 11 in ascending order of accuracy.

Directed acyclic graph for target task categories.
As a result, create directed acyclic graphs for ID3: determine if two webpages have similar content or not and ID11: cutting a sentence by paragraph. The created directed acyclic graphs are shown in Fig. 14.
The created directed acyclic graphs show the correlations between target tasks and training tasks. The directed acyclic graphs on the left in Fig. 14 shows the worker who processed ID12: select correct accent data from the following options, ID13: spell checking, ID14: determine if two synthesized speeches have similar sound or not, ID15: put phonetics data to words and ID16: checking if phonetics data of word are correct can process ID3 with high accuracy. For example, in order to find a training task for ID3, PCSS calculates the probability
If
In order to check the effectiveness of the grade-based training, we tested as follows:
Allocate a target task ID3 to all workers to check the initial accuracy.
Workers who processed the target task in 1) are grouped into some groups as follows:
Worker group 1 processes all training tasks in ID12, ID13, ID14, ID15 and ID16. Each task has 10 questions.
Worker group 2 processes a single training task in ID12, ID13, ID14, ID15 and ID16. The target training task has 50 questions.
Worker group 3 processes the same task as 1, that is ID3. The target task has 50 questions.
Worker group 4 processes a different task from the target task in 1, for example, ID11. The different task has 50 questions.
Allocate the same task in 1) to workers who processed in 2). This same task has 50 questions.
Repeat 3 times from 2) to 3).
Compare the accuracy of the task result in 1) and the accuracy of the final task result in 3). The workers do not perform other tasks between the steps.
Effect of training task
We evaluate this method in terms of the improvement of the averaged accuracy, and the improvement of the number of workers who raised the accuracy.
The result of the averaged accuracy is shown in Table 12. In the case of ID3, upgrading of accuracy in worker group 1 was 10.8 points, and upgrading of accuracy in worker group 2 was 9.8 points on average. The same tendency can be recognized in the case of ID11. As a result, training tasks created from other tasks according to the Bayesian network are shown to be of benefit. On the other hand, upgrading of accuracy in worker group 3 was 2.2 points and upgrading of accuracy in worker group 4 was 0.3 points. Thus, upgrading of accuracy is slight in the case of the training task that is not processed. It is speculated that this result is due to workers getting used to processing the same tasks.

The workers growth patterns in ID3 (in the case of training task ID13).
The workers growth type has many patterns as shown in Fig. 15. These patterns can be classified into three categories: 1) The continuously growing type (worker E, F), 2) The rising and falling type. The total quality is improved (worker A, B, C, D, G, H, I) and 3) the rising and falling type. The total quality falls (worker J, K, L, M). The number of workers in these categories is shown in Table 13. It is assumed that a grade-based training method would be effective, because the sum of numbers of 1) and numbers of 2) is greater than the number of 3). For example, in the case of ID13, the sum of numbers of 1) and numbers of 2) (9 persons) is greater than the number of 3) (4 persons). We conjecture that the differences between these types are due to worker profiles. Development of a method for generating effective training tasks that uses worker profiles is a subject for future work.
Also, the reduction of workers’ motivation leads to lowering of the quality of task results. Workers’ motivation is controlled by the reward and the pleasure derived from task contents [18]. Since workers could not perform other tasks during this test, the rising and falling growth type may be attributable to boredom associated with the repeated test. Development of a training test that keeps the interest of workers is also a subject for future work.
We could also create directed acyclic graphs for ID5: phonetics data checking, ID8: select correct phonetics data from the following option and ID2: correct grammar data collection. However, we exclude these task categories from the target task to be tested, because ID5 already had a high-accuracy result of 97.5% for the first check. It is speculated that this result is due to the check task being too easy. The directed acyclic graph for ID8 has no training task category that influences the target task category. The directed acyclic graph for ID2 only has a task category that has a harmful influence on the target task category. It is speculated that we cannot create a directed acyclic graph for the task categories ID4: accent data checking, ID6: accent data and phonetics data checking and ID7: select correct accent data from the following options, because workers are already limited by PCSS. We could not create directed acyclic graphs for the task categories ID9: put phonetics data to words and ID10: noise reduction, since workers are already high-quality workers.
The workers growth type in ID3
In this paper, we developed PCSS, which is used in a private environment. In PCSS, we implemented four quality control methods on the server side. These methods consist of preprocessing filtering, real-time filtering, post-processing filtering and guess-processing filtering. PCSS improves the accuracy of task result by these quality control methods. We also collected a large amount of vocabulary data for NLP, such as voice recognition and text to speech by using PCSS. The accuracy of the noise reduction task is improved from 65.9% to 89.6%, the accuracy of the phonetics data addition task is improved from 56.3% to 94.0%, the accuracy of the correct grammar data collection task is improved from 71.0% to 90.4%, and the accuracy of the accent data addition task is improved from 54.1% to 98.7%. In addition, we conjecture the difficulty of a task depends on the workers’ profiles. Development of a task allocation method using workers’ profiles is a subject for future work.
Furthermore, PCSS creates a grade-based training task automatically by reusing other tasks. Workers’ skill is upgraded for the target task by processing a training task before processing the target task. In order to create a grade-based training task automatically, PCSS classifies many tasks into task categories by comparing similarity of task contents and obtaining the influence of a task on the target task by creating a directed acyclic graph using a Bayesian network. As a result, PCSS automatically creates a training task for the target task for which upgrading of the accuracy of the task result is required. (The more explanation about grade-based training task in PCSS is omitted in this paper, because of the page limitation.) Additionally, in order to check the effectiveness of grade-based training, we created 2 directed acyclic graphs for 2 task categories, and then 8 training tasks from the directed acyclic graphs. These training tasks improve the accuracy of target tasks 10.77 points on average. On the other hand, there are some task categories for which there are no training tasks created by PCSS. It is speculated that PCSS could not create training tasks for these task categories because of a shortage of workers’ behavior data. Development of a method for generating effective training tasks that requires little workers’ behavior data and using workers’ profiles are subjects for future work.
