Abstract
Technology holds the promise of greatly altering the conduct of interest assessment. I review five technological advances that currently exist and present how they can be incorporated into our interest measures and procedures: (a) dynamic assessment using item response theory, (b) adapting interpretations to individual users, (c) incorporating response latency, (d) gamification of interest measures, and (e) incorporating big data and machine learning. Using these advances in our assessments and procedures can structurally change what we do and enhance the precision of our measures.
I have been asked to address the future of vocational interests over the next 10 years, in a concise, pithy manner. I will leave the evaluation of how well I do any of this to the reader, but I do wish to point out some important changes and opportunities that I see in the near future regarding interest assessment. As a caveat, my comments are intended to serve as stimuli and not as thorough explications. I cite other sources that provide a more thorough presentation.
Parsons (1909) proposed long ago that the assessment of one’s self is a cornerstone of vocational development and a good deal of this assessment has focused on interests. Interest assessment has served the field of vocational psychology well over the past century as it has demonstrated strong predictive validity with respect to performance and satisfaction (e.g., Nye, Su, Rounds, & Drasgow, 2012, 2017; Tracey & Robbins, 2006) as well as major life outcomes such as getting and staying married and having children (Stoll et al., 2017). Indeed, there are no other vocational constructs that have demonstrated such a strong relation to key outcomes. However, interest assessment is at a juncture. Changes in technology have altered what people do and how they interact. I see similar changes arising in interest assessment regarding how we gather and disseminate interest information. Technological changes are already having a profound effect on interest assessment. Original instruments relied on hand scoring by skilled professionals and this resulted in limited access of test users to the information. The development of computers made the scoring and administration of instruments relatively easy, resulting in increased access and usage. However, the promise of technology extends far beyond current assessment. I posit five specific technological advances that can, and will, change how interest assessment is conducted, specifically: (a) dynamic assessment using item response theory (IRT), (b) adaption of interpretations to individual users, (c) incorporation of response latency, (d) gamification of interest measures, and (e) incorporation of big data (BD) and machine learning (ML).
Future Direction #1: Dynamic Assessment Using IRT
Computer adaptive testing has been around for decades, and early branching programs such as DISCOVER have provided important advances. Users were provided with different information depending upon their responses. In theory, no two individuals would go through the process in the same manner. However, the assessment instruments themselves and their interpretation remained unchanged. People were administered a common set of items and then these were scored and a common interpretative format presented, that is, a common table or summary of results with only the specific scores varying.
Recent advances in IRT (Embretson & Reise, 2000) have yielded changes in how instruments are constructed and applied specifically in the creation of unique assessments for each test taker. Typically, IRT has served as an important means of developing and shortening current instruments (e.g., Roberson-Nay, Strong, Nay, Beidel, & Turner, 2007; Sharp, Goodyer, & Coudace, 2006). An example of how this has been applied to interest assessment is development of the Personal Globe Inventory-Short (PGI-S, Tracey, 2010a). IRT involves the examination of responses to each item as a function of the underlying trait being examined and characteristics of the item itself (e.g., difficulty and discrimination). There are many unique aspects involved in IRT that are beyond the scope of this article, but the most salient in this context is the concept of reliability. Reliability refers to the precision of any assessment, where the relative magnitude of error is estimated; the less error the greater the precision and reliability. Traditional classical test theory (CTT) estimates reliability at a global level applying equally to everyone in a sample via internal consistency estimates or test–test retest correlations as examples. There is one estimate derived that applies to everyone. There is the same error associated with those who score low on the trait as those who score high. IRT is different in that it provides reliability estimates across all levels of the underlying trait being measured. For example, IRT provides a reliability estimate of an item for those scoring very low on the underlying trait and a different estimate for those in the middle and a different one for those scoring at the extreme high end. Given this differential reliability basis of IRT, it is possible to select items that demonstrate maximal reliability (i.e., precision) for the specific areas of the underlying trait that one wishes to assess. There is no longer the need required in CTT that any item acts the same over all possible underlying trait scores.
It is this item reliability information that enables dynamic assessments such as the Graduate Record Exam (GRE). Instead of having a common test for every test taker, the GRE customizes the test so that each test taker gets a unique set of items. All test takers get a short common set of items that were designed to cover the full range of scores. From this common pool of items, an estimated score on the underlying trait is generated. Items that have the most reliability at that specific estimated score are then selected and presented to the test taker. If too many of these new items are answered correctly, then harder items are provided. If too few items are answered correctly, then easier items are provided. This process of providing items that are most precise at the level of the estimated score is repeated until a very precise final score is determined. Thus, someone with lower ability on the underlying trait would be getting very different items (easier) than someone with higher trait ability. The old need to have everyone take the same test is obviated in this dynamic testing format resulting in a more precise and efficient test result with less need of having people respond to items that provide no information (e.g., too easy ones for those with high ability or too hard ones for those with low ability).
While such applications have been demonstrated with cognitive tests which are dichotomously scored (i.e., correct vs. incorrect), I know of none applied to interest assessments. Part of the reason for this lack of application is that most interest assessments use polytomous response scales (e.g., Likert-type 1–5 Scales) making the IRT analysis more complex. There are many differences with respect to IRT examination of polytomous items, but the reliability of each item at all levels of the underlying trait is still provided. In dichotomous response formats, this reliability function is an inverted U curve with high reliability only in a narrow span. With polytomous items, this reliability function can have any form ranging from a uniform flat function showing that the item is equally precise at all levels of the underlying trait to very skewed functions that show one end of the distribution having much higher reliability. This reliability function can be used to select optimal items at different trait levels.
With respect to interests, IRT could be applied in a similar manner to the GRE to create dynamic measures. Similar to the GRE, a very short set of items (perhaps as short as 2 items for each of the RIASEC Scales) that demonstrate good reliability across the range of scores could be administered to all test takers. These items would generate initial interest estimates for each RIASEC Scale. Then for each RIASEC Scale, a few items (2 or 3) that have maximum reliability at the estimated score can be pulled for the item pool and presented to the test taker. For example, an individual with a mean level initial score on E and a +2 SD score on R would get E items that have maximum precision around mean levels of E and R items that show maximum precision at upper levels of R. Each test taker would get a very different set of items, and the items would be differentially sampled for each of the RIASEC Scales.
There would not be a need for many items because the items chosen would have maximal properties at the level of interest. The resulting precision of the scores would be as good as or better than the scores derived from traditional longer scales where a full battery is administered to everyone. Each test taker could thus get a unique set of items to which to respond, and the resulting scores would be more precise. The tools of such applications currently exist but have not, as yet, been applied to interest assessment.
Future Direction #2: Adapting Interpretations to Individual User
Another more obvious user-specific application is the tailoring of the back end of interest assessment, specifically, doing away with the common reporting format. Most current tests provide a common template of scale scores such as a profile of RIASEC scores with subscales and occupations. Everyone sees the same scores. While this approach is thorough and relatively simple, it can be overwhelming to the test taker and it does not match with how users approach the information (Tracey, 2010b). Do users look at all of the scores? Do they read through all the scores to make conclusions? While I have no formal data on how users read interest scales, personal experience has shown me that very quickly individuals focus only on the very high scores and ignore all the others. When asked why, the frequent response is that “there is too much and I only need the high stuff.” If this is a common experience, it makes sense to cater the interpretation presentation to the specific individual. The score pattern of the individual could be used as a guide in presenting test results. For example, if an individual’s scores are very high on E but low on I, then presenting a wealth of occupations high on I interests may not be the most beneficial. A narrower presentation on E occupations would better fit the individuals focus. Such is the approach taken in the web assessment of the PGI (Tracey, 2002, 2019; pgi.asu.edu). An assumption is that average scores are not especially important in helping individuals understand themselves. More extreme scores yield the most information on individual uniqueness. Thus, scores that are more outside of the norm are given greater interpretation and focus in the output. One of the broad interest dimensions assessed is prestige. Individuals who score in the average range are not provided with graphs of their many prestige scores; however, if someone scores high or low on prestige, then the appropriate scores are graphed for the user as the individual is stating that prestige is salient. Of course, it should be possible for the test user to be able to see all scores, but these should not be the primary presentation mode. Individual assessment reporting should be adapted to the individual based on his or her responses. Such uses of Internet assessments are not common but easily enabled.
Future Direction #3: Incorporating Response Latency
The integration of response latency (i.e., the amount of time that one takes to make a response) into our measures has the promise of streamlining our assessments with no loss of information. Response latency has a long history in the implicit assessment literature (Greenwald, McGhee, & Schwartz, 1998), but it has not been applied to interest assessment. The covariation of implicit measures to explicit, self-report measures is moderate at best (Nosek, 2007), and each type of assessment is uniquely predictive of outcome criteria. I recently examined the viability of response latency in interest assessment. With Internet assessments, response time is easily collected, and I hypothesized that it would provide unique, nonredundant information on the strength of interests (Tracey & Tao, 2018). Response time was related to interest endorsement in a curvilinear manner, and this provided added information with respect to the reliability and validity of the measures. However, the effect of adding in response time was demonstrated most with shorter measures where it significantly enhanced both the internal consistency and predictive validity. With longer measures, including response time did not appreciably add to either reliability or predictive validity. Response time can thus be used with very brief measures as a means of improving psychometric properties to such an extent that it enables existing measures to be cut in half with no loss of information.
Future Direction #4: Gamification of Interest Measures
Another Internet-based application that has not been applied much in interest assessment is the gamification of assessments, which is the presentation of information/items in a computer gaming format. There is a current trend toward using computer games as a means of teaching key concepts. The literature has shown that users become more involved in the task if it is presented in a game format and that learning can be greater relative to standard presentations (Bavelier et al., 2011; Young et al., 2012). There are two major ways that gamification could be used in interest assessment. The first is to make the assessment process itself a game, where instead of just answering the amount of liking or preference of an activity, one could present individuals with objects in space and instruct them to shoot the objects in order of preference. This shooting (along with response time of shooting) could be used as a means of scoring interests. Current scales could be adapted to a game presentation model, and this may have particular application to children.
The other means of using games would be on the back end, where assessment information is presented in a manner where the individual has to navigate through it and gains points. For example, I have demonstrated (Tracey, 1997, 2002, 2010a; Tracey & Rounds, 1996) that interests and occupational environments can be viewed in three-dimensional space, people–things, data–ideas, and prestige. Individuals and occupations can be represented as points in this three-dimensional space. Scores could be represented in a simple extension of the space invaders game where an individual is placed in this space at the point where he scores and occupations are presented in space around him other, with those closer being more similar. The individual could jet around space thus seeing the similarity/dissimilarity dimension (further away = more dissimilar) and shoot occupations to get information about that occupation and even video clips. Such a presentation is hypothesized to enable better integration of the interest assessment and its application.
Future Direction #5: Incorporation of BD and ML Into Our Assessments
The above technological changes still involve using assessment in the traditional manner, where there is a set of items that are administered to individuals and the responses to these items are then scored. Advances in BD and ML have implications that render such approaches obsolete. People use the Internet in a variety of ways and access a wealth of information. The information gathered and sources used provide data relating to that individual. Such digital records have been shown to predict personality traits and attributes (Kosinski, Stillwell, & Graepel, 2013). These data include such things as Facebook likes, contents of personal websites, properties of web profiles, spending habits, friendship networks, browsing histories, and music selected. Clearly, such data represent one’s choices among options and are thus very similar to preferences and liking assessed in interest assessments. Indeed, many of the aspects endorsed or selected for viewing are similar or identical to items on interest inventories. Thus, the content demonstrated in web behavior is representative of one’s interests and perhaps an even better indication of interest patterns than responding to items as this involves actual behavior.
However, such data are different from data we typically examine. First, there is a huge amount of data, far beyond conventional data sets for instrument development and examination. More importantly, the data are very incomplete. There is no piece of data that exist for all individuals; hence, the data set is filled with many empty cells. In traditional assessment development, we gather responses from a large number of individuals on a common set of items. In BD, there is a massive amount of data, but there are no items or digital pieces of information that are shared by everyone. Not everyone has seen the same sites so it is not possible to say if a lack of a “like” response is a function of not liking a site or just not seeing it. Such data sets make it difficult to apply our traditional analytical models. There are new pools of analyses that are being developed for the examination of BD (Chen & Wojcik, 2016; Zafarani, Abbasi, & Liu, 2014). These BD analytic tools involve looking for patterns in the data that do exist and using these patterns to make predictions about individual characteristics. Given BD approaches, it is possible to generate RIASEC scores for each individual.
One advantage to the BD approach using web behavior is that there is no formal assessment using self-ratings. An issue with many traditional measures, especially with children, is that the measures are very long and somewhat tedious to complete, which can lead to less diligent responding. Being able to collect existing data on individuals would obviate requiring individuals take existing assessments. Such digital information could be gathered and interest profiles generated.
An important addition to BD is ML, which involves generating models and algorithms from the data itself, then applying the models to examining model-data fit, and then modifying models and retesting in an iterative manner. As such, there is no a priori specification of specific variables or patterns (i.e., RIASEC scores) that are used but just a general search for patterns in the data. The resulting patterns are tested against other data sets in an examination of validity. Such approaches are starting to be applied in personality assessment (Gladstone, Matz, & Lemaire, 2019; Kern et al., 2014; Park et al., 2015; Quercia, Kosinski, Stillwell, & Crowcroft, 2011; Schwartz et al., 2013). Indeed, some recent applications of ML models using Facebook likes (Youyou, Kosinski, & Stillwell, 2015) have demonstrated that personality predictions generated from the data may be superior to human judgments and self-ratings (i.e., traditional format of personality assessments).
It is relatively easy to fit ML models to data generated, but it is essential to understand how well the models work. Bleidorn and Hopwood (2019) present a very nice summary of the research and issues involved. Issues of the generalizability of the models with different groups must be established, and this includes both structural aspects and predictive aspects. Much of this validity is done using k-fold validation where different partitions of the data (k-fold; different individuals, different groups, different items, or aspects) are used to examine the validity. Like many issues, distinctions must be made between theory-expected results (confirmation) and theory-unexpected results (disconfirmation). How well do these ML models account for unexpected results, and how well do they improve on traditional prediction if at all? ML approaches have the potential of creating entirely new models that demonstrate better validity. As is true in existing measures, validity is never established but it is something that is ever being examined (i.e., one never arrives at validity). New methods and models require an even larger focus on validity.
Issues in Digital Data Use: Security
As is commonly noted in the popular media, privacy of digital data is crucial. The dangers of being able to predict personal characteristics are great. Models can be built on de-identified data, and such applications are important in model generation and validation. Nevertheless, when it comes to applying any model to individuals, care must be taken to ensure that any application is done with the understanding and consent of the user. Appropriate legal and ethical notices are needed, and it must be clear that any use of an individual’s data is done with consent that is well understood. While there are legal waivers attached to most all sites, these waivers generally are quite complex and serve a legal function and less an information function. It is also important to present a clear consent form that individuals can understand and use as a basis of deciding. This consent should include clear explication of what will be used, within what time period, how long it will be stored, and deny any sharing. As with all assessment information, it is crucial that the test administrator ensure that the materials (the test and the responses) are secure. Security sounds basic, but there is ample evidence that breaches of Internet information are not rare events. Moving into web applications such as BD and ML carries new manifestations of crucial assessment issues of information privacy and security.
Issues in Digital Data Use: Support
While the potential of using digital assessments and information carries great promise (instant, briefer, more valid assessments), such applications are not easily done. The technology and analyses require skills that few if any vocational psychologists possess. The process of web design and BD extraction requires strong coding skills and hiring individuals who possess such skills. I have noticed that even putting a somewhat traditional assessment instrument on the Internet (i.e., PGI) requires reliance on a number of individuals with specialized technology skills. The financial costs of such instrument delivery and site maintenance are high perhaps taking instrument development and assessment out of the hands of individual vocational psychologists and placing them in the hands of large testing corporations who have the resources. Further, with the advent of better Internet assessment, the roles of vocational psychologists may narrow. Such testing, while holding merit, poses potential threats to the job functions of vocational and career specialists. Like changing assessments, the field too will have to change with respect to the professional roles that are filled. While the Internet and advanced assessment hold great promise for information provision, the individual guidance and counseling function will remain and perhaps even be strengthened.
Potential Applications of BD: International Assessment
Besides ML models having the potential of extensive use, such models and the BD on which they are based have promise of improving the quality of our measures on all groups. Certainly, a major issue in interest assessment is its valid use with different groups. This goes beyond just the appropriateness of norms for different individuals (e.g., Can a Somali immigrant’s responses be interpreted using U.S. norms?) but applies to the overall validity of any assessment across different groups. It is difficult to establish cross-cultural validity for any measure. The measure must be used to collect large samples from many different groups. Access to and ability to get such data is very difficult and costly. However, with the ability to access de-identified digital data, models can be tested on a far greater number of groups and these can be narrowly targeted (e.g., Somali immigrants to the United States).
This focus on specific groups makes ML important with respect to assessment of interests in international contexts. Given global changes in occupational availability and entry, it is even more important to provide valid assessments to assist in career choice and development. Yet this need exists in juxtaposition with the lack of valid measures. There currently are several interest measures, but most of these have been developed on U.S. samples and their validity in other countries is often lacking (e.g., Rounds & Tracey, 1996). The standard approach is to alter or translate the U.S. measure to fit the specific local context and then gather data and examine validity. Altering and translating is a difficult process because items sometimes do not translate easily or at all. For example, an item from the PGI “dance instructor” does not exist in Iran so there is no translation possible. There are also subtle differences between cultures that are masked in the items. The PGI also has an item “vacuuming” and when given in Ireland, there was no variance and this was because the word is not used or known. Irish use the term “hoovering.” So the alteration and translation is a very long and expensive process and fraught with problems. Gathering data from Internet usage obviates many of these alteration and translation problems in that the Internet behavior is conducted in their own language and using content that is understood by the user. BD and ML provide a relative easy means of developing and/or validating interest assessment and models. Given the high usage of cell phones, the data could be gathered, and models of interests for each unique country or area could be developed. Therefore, adopting BD and ML could lead to usage of validated instruments and/or entirely locally derived and validated models and instruments.
Potential Applications of BD: Issues of Gender
One interesting application of ML models could be the importance of gender in interest assessment. One of the key gender differences is the people–things difference between men and women. Indeed, of all psychological constructs, this is the one where there are the greatest mean differences between the sexes (Lippa, 1998; Lubinski, 2000; Su, Rounds, & Armstrong, 2009). The presence of this difference creates concern with the application of interest tests. Most women score high in S, while men more commonly score high on R and I. The implication of these differences is that providing interest scores may perpetuate existing gender differences in occupations. This has been a hotly debated issue in the literature (Holland, 1976; Prediger & Hansen, 1976) and it continues today (Tracey & Caulum, 2015) regarding how best to deal with this by specifically norming within gender and/or selecting only items that minimize these differences, perhaps at the cost of validity. Test developers have had to make decisions about how this should be done. Recent thinking in the area of gender has dictated a move away from viewing gender as binary categories. It is unclear how viewing gender as nonbinary would relate to the model of interests and how interests should be reported. With traditional models of test development, interest measures would have to be administered to large sample of individuals representing all aspects of nonbinary gender identification to develop norms. This is a difficult endeavor not only with respect to sample sizes but also with varying gender definitions. Short of this research being conducted, we are left with no information on how to incorporate more current definitions of gender into our assessments. A possible short-term solution is to allow the test taker to select the norm group (male or female) against which the person wishes to be compared (e.g., Colton & Fitzgerald, 2017). However, using existing digital data and gender self-characterizations, it would be much easier to expand our model and measures. Given the larger data sets and large amounts of data included, various gender self-definitions could be examined with respect to interest data in a way that current procedures cannot. Further having this enhanced data set, application of ML may enable very different and novel conceptions of how gender manifests itself in interest data. The promise of BD and ML is that it enables generation of valid models and assessment for most any group of individuals because the reach is great.
Potential Applications of BD: Environmental Assessment
ML also has potential application in the assessment of environments. Most all psychological models see the environment as a key component in determining behavior (e.g., Funder, 2006); however, the assessment and categorization of environments is relatively primitive (Sherman, Nave, & Funder, 2010). The history of situational taxonomies has demonstrated that the process is very difficult given the amount of information required and generalizations tend to be weak. Like the assessment of personal characteristics, BD and ML can provide needed models. With respect to career assessment, the main difficulty has been the categorizing of occupational or educational/major environments. The predominant person–environment matching model of Holland (1997) and others requires that we have not only quality assessment of the individual but also of the environment. Such provision of RIASEC codes for environments has generally been done in one of two ways: either by job analysis (e.g., Gottfredson & Holland, 1996) or by using the mean incumbent profiles as representative of the environment (e.g., Harmon, Hansen, Borgen, & Hammer, 1994; Tracey, Allen, & Robbins, 2012). Both of these approaches are difficult and time-consuming. Another problem of both approaches is that each is very delayed with respect to newer occupations. The occupational landscape is constantly changing and there are many new jobs arising, others dying off, and others changing characteristics (Levy & Murnane, 2004; Lyons, Schweitzer, & Ng, 2015). Current models of occupations are based on occupations from at least a decade ago. There is little incorporation of newer jobs as these take time to be identified and then categorized. However, with ML, it can be relatively easy to have current information (e.g., RIASEC codes, occupational requirements) on any occupation or job. Indeed, in larger companies, it could even be possible to categorize the different occupations throughout the organization easily and readily.
Implications
As noted, the incorporation of any of these technological advances into interest assessment requires advanced skills in both method (e.g., IRT, BD, and ML) and technology (coding and web design). There are few current members of the profession that have these skills or even have a good grasp of the content. These advances will fall to the next generation of interest researchers and instrument designers. As with many fields, technology has changed and will change how things are done. While at times I view myself a Luddite, it does not alter the fact that technology is changing how we do our jobs and how we approach things. With respect to training, it is important to incorporate technology into what we teach our students but also select those who have these skills and can teach us. We do need interest researchers who have these technological skills, and we also need to get more skilled in working with others who have these skills so they can be applied to interest measurement.
These new technologies, especially BD and ML, hold the potential to expand our models or even create new ones that are more valid. While Holland’s model has served the field well, new technologies may yield newer models. Holland’s RIASEC model is elegant in its simplicity, appropriateness for both the individual and environment, and has good research support. With the capabilities of BD and especially ML, the simplicity may no longer be an asset. Newer, more complex, more valid models may enable far more accurate assessment and interpretation of interests.
With the advent of web-based assessment, career professionals are doing less interest assessment themselves and referring more to these external sources. With the incorporations of these technological changes, career development professionals will become even further removed from the assessment process. This presents a challenge to the field with respect to how services are provided. More and more services and information that has traditionally been in our purview is being provided on the Internet. The focus of our services will have to shift, be it to focusing on web-based interventions or more personal counseling.
Summary
Each of these current technological advances (dynamic assessment using IRT, adapting interpretations to individual users, incorporating response latency, gamification of interest measures, and incorporation of BD and ML) has been applied to some areas of psychological assessment but not to interest assessment. While these applications are not easily done, the promise of incorporating these into our assessments is high. It is exciting to ponder the possibilities with respect to technological applications to interest assessment. I envision our assessments being very different and in many ways very unfamiliar, moving away from set items, scoring rubrics and interpretation formats and into novel models and predictions. There are many possibilities. However, I am reminded of watching TV shows when I was young that made bold future predictions. I am still waiting for the abandonment of cars and the use of personal planes, not to mention teleporting.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
