Abstract
This study proposes a machine learning-based evaluation function that intentionally weakens the competency level of an AI program based on the game records of amateur players. By regulating the evaluation function, it is possible to intentionally weaken the AI relative to existing methods in an equivalent search space. In addition, a subjective evaluation of naturalness was conducted by using a panel of experts to rate the proposed AI technique against conventional AI techniques at equivalent levels of weakness. The analysis revealed the elements of “human-likeness” present in the shogi game records and identified key similarities and differences between amateur and professional players.
Keywords
Introduction
The inaugural human vs. computer shogi championships were held on January 14, 2012 and at the championship in 2013, five professional players competed against computers. This demonstrates that computer shogi programs have become increasingly capable of matching their human counterparts. With recent advances in artificial intelligence (AI), computer opponents have become too powerful for most amateur players. A new challenge in this field is to develop more human-like programs and programs that can serve as genuine opponents in a game of shogi, rather than opponents that simply win every game.
Several studies have used the Bonanza method (Hoki, 2006; Hoki & Kaneko, 2014) to enhance the entertainment value in machine learning. A computer program that incorporates the Kifuu playing style, developed by Namai et al. (Namai & Ito, 2010) uses the Bonanza method to emulate notable defensive plays and moves involving the king. These are executed by professional shogi players in the middle and end of games, and they can also emulate a range of opening moves. It is possible to construct an evaluation function that emulates specific professional players by attaching greater values to selected moves made by those players based on game records.
As mentioned above, several studies have focused on developing computer programs that entertain players, but the method utilized to weaken computer programs is still considered unsatisfactory. There have been several attempts to weaken computer programs, such as limiting the search space and adding random numbers to the evaluation function. However, relying on such methods exclusively merely engenders computer-likeness, rather than human-likeness. Therefore, this study focuses on two research questions: How can we weaken computer programs? And what is human-like about weak players?
This study uses the Bonanza method to create an evaluation function based on the game records of amateur players at a designated level. The goal is to develop an evaluation function capable of emulating the characteristics of amateur players.
We analyzed the elements of human-likeness (as opposed to the skill level) in the gaming experience along with the sources of these elements. Using a panel of experts, we performed a subjective evaluation of the naturalness of weak AI programs based on the proposed and conventional techniques at equivalent levels of weakness. Based on this evaluation, we identified the human-like and computer-like elements in shogi game records. By analyzing the human-like elements, we were able to design a human-like AI program for playing shogi.
Related research
Skill versus enjoyment
The enjoyment of the game is influenced by the relative skill levels of the computer and human player. Li et al. described the development of simple, weaker AI systems for strategy games such as Reversi, where all combinations of the evaluation features are used to tailor the AI competency to the skill level of the player (Li & Grimbergen, 2012). Yamashita et al. analyzed the extent to which the skill level of the opponent affects the psychological status of subjects in a competitive setting (Yamashita & Okubo, 2011). A follow-up survey of subjects revealed that they were able to concentrate better when the opponent was of a similar skill level, which suggests a higher level of enjoyment in the game. Although the study was ostensibly designed to examine response times, it is important to remember that a weaker opponent is more likely to bore the player, whereas an overly powerful opponent is more likely to cause anxiety and hinder concentration. Sweetser and Wyeth focused on the effects of challenge on game experiences. They classified enjoyment in games into eight main groups and proposed the GameFlow model as an evaluation tool for measuring enjoyment (Sweetser & Wyeth, 2005).
Although there are many examples that support the utility of weak AI programs, this study is interested in developing weak AI programs for various complex games. The aim of this study is to create a weak shogi AI program that is designed to automatically function at a skill level similar to that of the player.
Regulation of evaluation function through machine learning
Newer computer shogi programs use machine learning to improve the accuracy of the positional evaluation. There is considerable research underway on how learning techniques may be applied to evaluation functions to improve AI performance. For example, Kaneko used multiple varied sets of shogi game records as the basis for AI systems (Kaneko, 2012). Each collection consisted of 10,000 randomly selected amateur games, professional games, and higher-order games from the Floodgate AI server. The strongest AI was found to be the one based on the professional games, whereas the amateur games generated the weakest AI. Thus, training data should be selected from amateur game records in accordance with the skill level of the actual player.
Human-like game-playing AI program
As described above, the importance of weak AI programs for enhancing the gaming experience cannot be overemphasized. The question that arises is as follows: Is it possible that unconscientious moves and allowance moves cause the player to feel unsatisfied? To answer this question, several studies have been conducted on human-like, game-playing AI programs. Examples of these attempts include Turing-test competitions involving platformer games, first-person shooter games, and Go (Livingstone, 2006; JAIST, 2011).
Fujii et al. made several important observations of human-like, game-playing AI programs during his evaluation of the human-likeness of AI programs. They also proposed a method for autonomously acquiring human-like behavior with biological constraints (Fujii et al., 2013). They asserted that users tend to judge the computer-likeness of players by optimal behavior, whereas they judge human-likeness by mistakes and fluctuations. Next, they instructed novice players on the ways of expert players, and asked them to watch a gameplay and identify whether a player was a human or a computer (Fujii et al., 2015). The results indicated that once the novices were instructed on how experts play, they gave higher scores to the human-likeness of expert players.
This leads to the conclusion that human-likeness is not an absolute value but is instead a relative value that changes based on the skill differences between users and AI programs. The problem we must consider next is how to change the human-likeness of the program by using this skill difference. We performed a subjective evaluation of the naturalness of weak AI programs. The participants in this experiment were professional and amateur shogi players. Based on this evaluation, we analyzed the effects of the skill differences on human-likeness.
Experimental conditions
Bonanza
The proposed technique is predicated on the Bonanza 6.0 AI shogi program (Hoki, 2011), which was victorious at the 16th World Computer Shogi Championships in 2006. Currently available as open-source software, the program employs the Bonanza method, which is a technique that automatically regulates an evaluation function with vast quantities of feature elements in accordance with game records. All of the leading AI programs competing at the top level of subsequent computer shogi championships have used similar forms of regulation involving automated learning.
Bonanza method
In the Bonanza method, the parameters of the static evaluation function are optimized via the method of steepest descent to ensure consistency between the training data game records and the game tree search results (Hoki, 2006; Hoki & Kaneko, 2014). The evaluation function in Bonanza employs feature vectors with over 10,000 elements, which are adjusted via the Bonanza method. The Bonanza method attempts to minimize the objective function J:
Source of game records
Shogi Club 24 is a website run by the Japan Shogi Association where gamers can play one another online. This study used game records from Shogi Club 24 as training data for machine learning (Kume, 2002).
The study also used the Floodgate server as a general yardstick for higher-order AI performance. Launched in 2008, Floodgate is an automated server created by shogi game developers to provide an unrestricted 24-hour platform for shogi gamers (The University of Tokyo, 2008; Moriwaki & Kaneko, 2007).
Ratings
Skill level ratings are used to match players with different abilities. The most common skill level rating, as used by Shogi Club 24 and Floodgate, is calculated by assigning participants an initial rating of R. The R value is then adjusted up or down in accordance with game outcomes. Skilled players will see their R value steadily increase. Shogi Club 24 uses the following rating algorithm:
The expected win rate
Floodgate uses a fixed conversion for the gps_normal Shogi Club 24 anticipated rating of 2,150, which can be used as a yardstick for Shogi Club 24.
Proposed technique
Methodology
The proposed technique involves the automatic regulation of the evaluation function for the purpose of creating a weaker AI. By tailoring the evaluation function to encourage gameplay by less skilled players, it is possible to limit the horizon effect (by restricting the search space to weaken the AI) and encourage moves that are likely to be employed by amateur players.
In the proposed method, amateur game records are selected in accordance with skill level, and machine learning is based on the game records of players at a certain designated skill level. The aim is to assign higher evaluations to moves preferred by players of the target skill level. In this way it should be possible to create multiple AI regimes with different skill levels.
Adjusting the skill level
Results. The skill level was adjusted to R1300 and R800. Approximately 30,000 game records from Shogi Club 24 were used as training data. These data were split into two groups: R1200–R1499 (equivalent to 3rd class) and R700–R999 (equivalent to 8th class).
After the learning phase was completed, the ratings for the two AI regimes were calculated. Bonanza with the standard evaluation function was registered in Floodgate from 10/29/2012 to 11/14/2012 as “stdd6.” The rating was based on approximately 100 games to provide an indicator of AI strength. The search depth was set to five plies with a hash size of 100 MB. Set moves were taken from Bonanza 6.0 with no restriction on length. The rating was R1984. This was then played against the AI regimes created with evaluation functions from R1300 and R800 game records under the same conditions. The resulting ratings were R1688 and R1617, respectively.
The resulting ratings are shown in Table 1. The search space was further restricted to make the strength equivalent to the game records used for machine learning. The search depth was reduced to three plies and then two plies, and these were compared to the AI with the standard evaluation function. In both cases, the weakness was such that one more search level was required.
Discussion. As shown above, the selected training set can have different effects on weakening a computer shogi program. We will compare how strength is weakened between our method and a related study. Omori et al. modified the steps and training set of the Bonanza method in order to realize playing styles in shogi (Omori & Kaneko, 2016). They focused on playing style without weakening the computer program. Therefore, they compared the winning rate against the original program for each ratio of offensive and defensive records: as a result, the winning rate changed from
Game records used for machine learning vs. AI strength
Game records used for machine learning vs. AI strength
Subjects were asked to access game records on web pages and to provide their evaluations. No time restrictions were imposed and subjects did not know whether the computer moved first or second. In addition, the subjects were asked to rate their assessment of whether or not the computer had moved first, using a five-point scale. Given that the conception of “human-likeness” is likely to vary from person to person, subjects were also required to complete preliminary and post-evaluation questionnaire surveys on this topic. The subjects were asked to provide reasoning for their answers in the form of open-ended responses, such as “The 70th move was typical of those often employed by less skilled players.” These responses were used to assess perceptions of human-like behavior.
Experimental conditions
Target AI. Three types of Bonanza 6.0 AI were used to generate the computer game records that constitute the target of the investigation, and each featured a different evaluation function. In all cases, the depth of the AI was adjusted to the equivalent of R1300 through internal game-playing. The AI with the existing technique had a depth of three plies, and the AI using the proposed technique had a depth of four plies, based on internal game-playing against an AI with evaluation functions to a depth of six plies, as calculated by Floodgate. We also prepared an AI with a basic search depth of six plies with a randomized number based on the existing technique (Obata et al., 2010). The randomized number is in a normal distribution with a standard deviation of 1,000 added to the evaluation function. This allows for a win ratio of approximately 50% to be achieved against the AI with three plies.
Target game records. The game records that constitute the target of the investigation involved games between players at approximately the R1300 skill level. While the computer game records were of games where the same AI played itself, the human game records comprised 20 selected games supplied by Shogi Club 24 involving R1200–R1499 players. Table 2 lists the game numbers and game types. There were five games using the proposed AI technique, five using an AI with randomized numbers, five using AI with only the search space deleted, and five between human players. Game records involving set moves were also selected to enable an evaluation of differences between moves by the evaluation function.
For the computer game records, one AI regime had its set moves reduced to ten while another AI had its set moves reduced to 20; in addition, set moves that are not currently used were deleted from the game records of internal play. For the human game records, games that deviated from the standard set moves were omitted. Similarly, nyugyoku-moyo or impasse games deemed to have little benefit in terms of training data for the evaluation function were also omitted.
Subjects. Subjective evaluations were performed by six students from a university Shogi club and five professional players from the Japan Shogi Association. Given that an understanding of shogi moves was required, the recruitment guidelines stipulated that games to be assessed would involve players with shogi club ratings of approximately 1,300. All six university students had Shogi Club 24 ratings of over 2,000, which was considered more than adequate for the task of evaluating game records.
Game numbers and types
Game numbers and types
A subjective evaluation of game records was conducted as follows.
20 games were rearranged in random order. Subjects were given an information sheet describing the experiment. Subjects were asked to fill in a preliminary questionnaire. Subjects rated each game in turn on a five-point scale and provided reasoning in the form of an open-ended response. Subjects were asked to repeat Step 4 until all games were evaluated. Subjects were asked to complete a post-evaluation questionnaire.
The preliminary questionnaire, game evaluation, and post-evaluation questionnaire are described in more detail below.
Preliminary questionnaire. The preliminary questionnaire consisted of the following questions.
Indicate your skill level, such as your Shogi Club 24 rating. What do you consider to be indicators of human-like play in a shogi game? What will you be looking for in particular when evaluating shogi games for this study? Have you ever played a computer at shogi and felt that it was strange or unnatural? If so, please describe how. Definitely a human Probably a human Cannot say either way Probably a computer Definitely a computer
Evaluation process. Subjects were first asked to evaluate each game without a time limit. They were then asked whether they thought the first move had been made by a computer or a human. The answer was an evaluation of human-like game play based on the following five-point scale:
For the non-computer games, given the distinct possibility that the person performing the evaluation may have been involved in the game when it was originally played, either as a player or as an observer, the question “Do you have any memory of this game?” was inserted. Games with a “yes” response were omitted.
Post-evaluation questionnaire. The post-evaluation questionnaire consisted of the following questions.
Has your perception of human-like behavior changed during this experiment? What sort of situations did you consider to be human-like? What sort of situations did you consider to be computer-like? Please provide any other thoughts or ideas with regards to the experiment
Analysis and discussion
Analysis and discussion
Comparison by type of game. Figure 1 shows the average evaluation scores for each set of game records based on the ratings provided by all subjects. It can be seen that the non-computer games were considered human-like, the games involving the proposed and conventional AI systems were difficult for the participants to judge, and the games involving randomized AI were seen to be computer-like.
Professional vs. amateur evaluations. Figure 2 shows the average ratings for each set of games from the professional and amateur players, illustrating the difference in evaluations by professional and amateur players. See Table 4 and 5 in Appendix A.1 for details of the responses.

Average human-likeness rating for each set of games.
Both professional and amateur players judged the games between human players to be the most human-like and the games involving randomized AI to be the most computer-like. However, they had opposing opinions on which of the proposed AI and conventional AI were more human-like. In addition, the average ratings from the amateur players were higher than the average ratings from the professional players across all game categories.

Human-likeness ratings by professional and amateur players.
Discussion. Before we discuss the results, we should examine the experience the subjects have with using computer shogi programs as a lurking variable. More specifically, we will determine whether the subjects were familiar with computer programs or not. We asked the subjects to complete in a preliminary questionnaire, and its questionnaire item concerns experiences with using computer shogi programs: “What sort of situations did you consider to be computer-like?” Of course, in order to examine what experiences produce representations of computer-likeness, a more strict and thorough survey must be provided. In any event, it was found from the responses that all of the subjects had experiences of using computer shogi programs and representations of computer-likeness.
By the same token, we should examine whether the subjects identified the computer programs used in this experiment as the Bonanza shogi program. If the subjects could identify the program as Bonanza, they might have easily judged their opponent to be a computer. However, this concern would be unlikely for the following reasons. First, no subject mentioned the name of a computer program in the evaluation. Second, it is unlikely that the subjects were familiar with exactly the same computer programs used in our experiment. This is because of the differences in the skill level of the subjects and computer programs: the subjects were at least the rank of amateur 5-dan, whereas the computer programs were merely the rank of 1-dan. Moreover, we modified the source code of Bonanza 6.0 in order to realize the proposed AI and randomized AI. For these reasons, it is probable that the subjects’ experience with computer shogi programs, especially Bonanza, did not affect the results.
Having examined these matters regarding the subjects, we now turn to our main concern. The above results indicate that randomization lowers the skill level at shogi and as such cannot be considered a suitable natural means of introducing weakness. The professional players correctly judged both conventional and proposed AI to be AI, and their average scores were no greater than four in any category (including human players). This is analyzed further from the open-ended responses.
The average scores for both conventional and proposed AI were over three from the amateur players but less than three from the professional players. Moreover, the amateur players gave consistently higher scores than the professional players for the human games. This suggests that amateur players were more likely to think that their opponents were human, and that the professional players were more skeptical.
Main response categories. The reasoning employed by subjects in distinguishing between human and computer players, as stated in the open-ended responses to the preliminary and post-evaluation questionnaires, were divided into seven categories:
Has your perception of human-like behavior changed during this experiment?
Looks like a mistake Not a normal/expected move Betrays emotion Not a normal/expected passage of play Inconsistency in skill level Looks like a move that a human player/computer would make Final moves seem a like a human player/computer
Sample responses are listed in Appendix A.2.
Analysis of responses. Figure 3 shows the results of an analysis of the 20 questions for amateur and professional subjects, respectively. The most common response category among both amateur and professional subjects was “looks like a move that a computer would make.” Indeed, this accounted for approximately 50% of all responses from the professional subjects. The corresponding response, “looks like a move that a human player would make” was second among the amateur subjects but sixth among professional subjects.
“Looks like a mistake” was used by approximately 10% of amateurs but only 1% of professionals.

Breakdown of responses from amateur and professional subjects.

Human player game No. 4, up to △5 three horse move.
Discrepancies in ratings. Game No. 4 involving human players scored a rating of 4.7 from amateur subjects but a much lower rating of 2.6 from professional subjects. A key point of difference was the rating of the move ▲3 five rook against △5 three bishop, as shown in Fig. 4. Professional players judged that a genuine human player would be unlikely to make such a move, and therefore assumed it more likely to be the computer. However, amateur players tended to regard the move as a simple mistake or clicking error, and therefore assumed it to be a human player.
Similarly, Game No. 5 involving conventional AI was rated 4.0 by amateurs and only 2.2 by professionals. The move ▲2 four pawn against △6 five knight, depicted in Fig. 5, was mentioned only by the professional subjects, with comments such as “a fatal mistake,” “no human player of any skill level would ever do that,” and “you wouldn’t normally attack in order to avoid losing a piece.” None of the amateur subjects picked up on this move, and overall they saw it as consistent with a move from a human player.
The board on the left displays pieces according to the first letter of the piece name, with the exception of the knight, denoted “N.” The white pieces are represented by underlined letters.
The board on the left displays pieces according to the first letter of the piece name, with the exception of the knight, denoted “N.” The white pieces are represented by underlined letters.

Conventional AI game No. 5, up to △6 five knight (N) move.
Discussion. There were several key differences in the language employed by amateurs and professionals in their responses. Amateurs were more likely to talk about human-like moves and errors, while professionals were more likely to refer to computer-like moves. Certain specific moves were deemed to be human error by amateurs, but not by professionals, suggesting that the two groups use different sets of evaluation criteria. Professionals were more likely to identify bad moves in a game, while amateurs tended to miss the bad moves and view the game as normal or natural. Thus, it would appear that amateurs were more likely to judge bad moves to be part of the natural gameplay.
Let us compare the human-like elements between the above results and the related studies. Khalifa et al. developed human-like general game playing computer programs (Khalifa et al., 2016). They revealed human-likeness derived from physical constraints in player manipulation, such as jitteriness and sudden reaction time. However, it is difficult to analyze the behavior and intentions of players because the responses of the subjects to the manipulation problem were weighted heavily based on the game type. In other words, being familiar with operations and acting as you think is also a goal of the game. We investigated human-likeness without the manipulation problem because our study used shogi, which is a sequential game. Consequently, it was found from the results that bad moves affected computer-likeness, and the skill level of subjects determined whether bad moves were detectable. These results correspond to the experiment conducted by Fujii et al. (2013). That is, human-likeness is not an absolute value, but a relative value that changes based on the skill differences between users and AI programs.
In a study focused on realizing human-likeness, it is necessary to analyze response reasoning because the type of game and test subject affect the generality of the algorithm. For example, Luong et al. succeeded at creating human-like behavior in an opponent within a platform action game (Luong et al., 2017). However, their results significantly differed from ours. Specifically, the subjects in (Luong et al., 2017) misjudged their opponents less than 5% of the time. It seems that the behavior of human players differed conspicuously from the behavior of computers within the platform action game. Furthermore, the reasons why their method was capable of enhancing human-likeness are unknown.
Their evaluation of the skill level of computer programs showed that the skillfulness of the programs utilizing the proposed method (3.22) was higher than the existing method (3.10). Strengthen of computer programs may account for the enhancement of human-likeness in their research. This difference in skill level between computer programs and subjects corresponds to the results in our experiment. That is, amateur players tend to evaluate more human-like behavior than professional players. The effect of differences in skill level on determining human-likeness should be examined in the evaluation of a method that enhances the human-likeness of a computer program.
Questionnaires and interview process. As mentioned above, there was considerable variation among subjects in the evaluation ratings and the criteria used to identify bad moves. There were many common aspects of criteria for human-like behavior in the responses to the preliminary and post-evaluation questionnaires, as outlined in Table 3.
When quizzed specifically on the elements of computer-like and human-like behavior, one subject responded: “when you come across a move that does not seem right, even a very poor move, if you can nevertheless see the intent or the idea behind the move, then it is more likely to have been made by a human player.”
Criteria for human-like behavior in subject responses. Note: amateur subjects are denoted by capital letters, and professional subjects are denoted by lower-case letters
Criteria for human-like behavior in subject responses. Note: amateur subjects are denoted by capital letters, and professional subjects are denoted by lower-case letters
Discussion. The references to consistency in the questionnaire and the references to intent and reasoning in the interview indicate that the ability to envision an opponent was seen as an important criterion in discerning human-like behavior, among both amateurs and professionals alike. Conversely, where it was difficult to envision the opponent, this was seen as an indication of computer-like behavior. Professional players tended to expect a higher level of skill from the opponent than was actually the case, and therefore identified more moves as poor, resulting in a lower rating for the game.
In connection with a consistent human-like opponent model, let us discuss the generality of human-likeness. The question we have to ask here is whether human-like elements, such as mistakes, are immutable. Currently, programs with deep learning and reinforcement learning appear to affect human players (Silver et al., 2017a, 2017b). For example, the opening stones played by Alpha Go are used as a learning application and professional players adopted these moves (DeepMind Technologies, 2017). It is reasonable to suppose that the definition of human-likeness can be changed over time.
Even then, it is not considerable that our findings of human-like elements completely change. Strong computer programs affect the human-likeness of strong players, but not the human-likeness of weak players. Although it is possible that strong moves made by computer programs are taken as common-sense moves, those moves are seen in matches between strong players. This improvement in human players cannot be a serious problem because our research focused on matches between weak players. To support the existence of the human-likeness of weak players, our experiment shows that a non-ideal move can be a human-like move. Amateur players judging bad moves as mistakes are a good example of this. Selecting bad moves that look like mistakes can be a useful way to realize the human-likeness of weak players.
In this study, we refined an evaluation function through machine learning based on the records of shogi games between amateur players of around R1300 and R800 level, in order to adjust the shogi skill level to the equivalent of omitting one level of depth in the search space.
When subjects were asked to evaluate human-like behavior in games at the R1300 level, the professional players correctly identified games involving both conventional and proposed AI systems as being non-human.
Responses to the preliminary and post-evaluation questionnaires were used to classify the criteria used by professional and amateur subjects to distinguish between human and AI games. Professional subjects were more likely to identify poor moves as evidence of AI, while amateurs were more likely to regard poor moves as the result of human error. For this reason, if we limit our target to amateur players, we can reproduce human errors using training data to reflect moves judged to be human mistakes. However, there will be issues with preparing a large number of labeled data and with the timing for introducing such moves.
In the preliminary and post-evaluation questionnaires and interviews, many subjects cited the existence of discernible strategies behind poor moves and the overall consistency during a game as indicators of human-like behavior. Moves that were deemed to be incompatible with the skill level of the opponent or inconsistent with the general passage of play were considered likely to be indicative of AI.
In this experiment, subjective evaluations were performed by amateur and professional players rated at R2000 and above. If we assume that differences in skill levels between subjects translate into different expectations, and therefore different game ratings, then a game at an equivalent skill level to the subject performing the evaluation is more likely to be judged a human player game. Further investigation is required to examine the correlation with skill level, in order to determine whether subjects of equivalent or lower skill level are even more likely to identify games as played by human players.
Another key finding in this experiment is that consistency is an indicator of human-like behavior. It is important to achieve skill level consistency by analyzing the conditions for identifying differing skill levels, and also by investigating techniques to adjust all three phases of the game: beginning, middle, and end.
