Abstract
LinkedIn is a social medium oriented to professional career handling and networking. In it, users write a textual profile on their experience, and add skill labels in a free format. Users are able to apply for different jobs, but specific feedback on the appropriateness of their application according to their skills is not provided to them. In this work we particularly focus on applicants of the project management branch from information technologies—although the presented methodology could be extended to any area following the same mechanism. Using the information users provide in their profile, it is possible to establish the corresponding level in a predefined Project Manager career path (PM level). 1500+ experiences and skills from 300 profiles were manually tagged to train and test a model to automatically estimate the PM level. In this proposal we were able to perform such prediction with a precision of 98%. Additionally, the proposed model is able to provide feedback to users by offering a guideline of necessary skills to be learned to fulfill the current PM level, or those needed in order to upgrade to the following PM level. This is achieved through the clustering of skill qualification labels. Results of experiments with several clustering algorithms are provided as part of this work.
Keywords
Introduction
The use of online recruitment services such as LinkedIn has greatly increased over the recent years; however, recruiters still have to define their own ways of selecting the best person-job fit, which involves to adapt the right job seekers to the right position [1]. Most of the theories of procuring this person-job fit are, however, focused on explaining the way resumes measure the matching degree between talent qualification and the job requirements based on manual inspection of the human resource experts—despite the lack of knowledge of certain specific areas.
The aim of this research is to explore the idea of using the contents of a resume to establish a correspondence to a certain level in a career path from an IT area, as well as to provide a guideline for job seekers to complete a generalized list of skills automatically collected and grouped from actual skills required for each level.
Accurate and realistic job information should enable applicants to assess the degree of congruence between their knowledge, skills and abilities, and their job requirements [2, 3]. When a good degree of congruence is achieved, the recruitment and selection processes is generally associated with positive work outcomes [4]. Although assessment has been used in employment settings for over a century to consider these crucial points, the main problem arises for the task of finding the best fit in a different domain to the one recruiters are directly related to, as they may not have the domain knowledge to know exactly whether applicants’ skills are fitted [5].
On the other hand, in a recruitment process, for few particular cases, feedback is provided by the recruiter to the job seekers in order to improve their abilities. However, this step is presented only once the resume is accepted. To better understand this, we can consider the example of a recent graduate without experience, supposing that, because of his or her inexperience, (s)he forgot to specify a relevant skill in his/her resume. While evaluating the resume, recruiters would just reject the resume without giving the candidate any feedback of the cause of rejection of his/her resume.
Another case is when a job seeker wants to know which abilities are needed to be completed for upgrading in the job career path; this information is not easy to obtain, as there is no explicit reference that includes this information in general.
The work of [6] establishes a relationship between experience and job performance. Based on this, we propose a method to obtain the classification of the level in a job career path based on the textual context expressed in the experience description of a job seeker. In addition to that, each job level is related to a set of particular skills. We propose a method that is able to automatically bind the most relevant abilities that correspond to a particular job level, given a set of profiles that are known to be suited to the same job level of qualification.
This document is divided as follows: In Section 2 related works to this research are discussed, from general organizational psychology foundations for resume analysis, to specific implementations that automate the process of the ideal candidate selection and job recommendation. Then, our proposal is detailed in Section 3. Experiments and results are described in Section 4; and finally in Section 5, our conclusions are drawn, and possible future directions of this research are outlined.
Related work
Extensive research supports the proposition that individuals are satisfied with, and adjust most easily, to jobs that are congruent with their own career-relevant personality types [7, 8] and the fact that retaining personnel in any industry has been recognized as important.
There are several works based on developing a model to establish a proper manner to define how to match a person with a specific job [9]. Particularly, there are some researchers that propose to not only focus on person-job fit [10] but also in other perspectives such as using the evaluation of the environment to be attractive for job seekers—person-environment fit [11], as well as the fit between the organization and the person—person-organization fit [12], to establish a general benefit for both parts if both satisfy the necessities of each other. Carless [5] has proposed a study to compare the person-job fit versus person-organization fit as a component of attraction of talent and job acceptance based on a three-step process: initial interview, final job choice, and durability of the job seeker.
In general, resume analysis has been for long time an important feature to take a decision of interviewing a candidate or not [9]. Particular features have been found to be crucial for catching the recruiters’ atention.
The Internet-based recruitment platforms have become a main recruitment channel for most of the businesses, since they allow improving the efficiency of recruitment, saving cost and releases from information overload. For instance, LinkedIn is one of the favoured recruiting platforms in the market [13]. One of the key aspects of its effectiveness is the sophisticated series of search and recommendation algorithms that it applies [14].
With the high mobility of talent, and the increment of job seekers through this channel, it becomes important for the recruitment team to seek intelligent ways for person-job fitting to adapt the right job seekers to the right positions. This crucial task for job recruitment person-job fit has been thoroughly studied as candidate matching [1, 15–17], job recommendations [13, 18], job transitions [19–21] and other methods for talent identification [22].
In the following paragraphs, we describe 8 works in particular that are similarly oriented to our purpose. [18] proposes a system for mining related job patterns. The authors argue that by collecting information from different social media platforms, a better understanding of the job market is possible, which in turn could be helpful to construct a job-hopping network. In this way they study the latent behaviour and relationships of people and companies in the job marketing. Their main purpose is to highlight and rank within social media, influential companies in the job market to specify how individuals move among them and which jobs are involved. While this study can be used as a job recommendation system for both job seekers and employers, it does not provide an exact analysis for a perfect match in the job fit approach; nevertheless, it can be used as an auxiliary tool to analyze the job market information and its impact in social networks postings.
[20] proposes the analysis of job transition networks by extracting talent circles such that different organizations with similar exchange patterns are identified. Patterns are mainly identified with the aid of a directed graph that models the job transition among organizations in a specific period. By weighting each of edges as a percentage of job transitioning in a period, the node similarity and the definition of the egocentric node can be easily found. Detected talent circles patterns can be used for talent recruitment and job search. Xu et al. state that it is possible to capture the hidden recruitment patterns and identify the right talent resources. Although it is demonstrated in the results of this paper that their approach outperformed the benchmark methods, this research is focused on widely known and most relevant companies, so that the evaluation is limited to that universe; additionally it is only designed to match the candidates depending on the job transition even though they may belong to the same level of qualification.
[23] proposes the use of word2vec to develop an automated system to discover additional names for an entity using Latent Semantic Analysis—LSA [24]. The authors consider the skills as words in the model, and try to predict the word or words that belong to the same category by applying a model similar to skip-gram model. In this way if a skill like java is introduced, the possible outputs could be HTML, CSS, and JavaScript.
An important work related to automatic resume evaluation is [25]. By analyzing resumes or curricula vitarum, this work proposes to enhance the recruiting process by categorizing and scoring them through diverse metrics. They propose a computational technique to analyze the resume by using methods like LSA and LDA—Latent Dirichlet Allocation [26], as well as word2vec models [27]. Additional textual processing is performed, including word complexity, spelling, and the use of predefined categorized lexicons. This CV analysis tool 1 provides personalized recommendations based on feedback score to assess the quality of the CVs.
The application of neural networks in recruitment area for text processing has had a great improvement and it is proved that the best option to be applied is the use of Recurrent Neural Network (RNN)-based models, since they are natural for modelling sequential textual data, mainly for model serialization information [1, 17].
To give an example, [17] shows that, apart from using a word2vec approach for feeding the neural network, their model has less dependence on unsupervised word embedding. The authors compared it with other methods such as support vector machines and decision trees. Their study proposes two kinds of features: one is a categorical feature, which is manually designed from resume structure; and another is the context feature automatically learned by using word2vec. By merging both features, it is possible to explore deep feature interactions with a deep neural network model. The authors mention that the main categorical features to obtain a perfect match are: (a) personal information such as gender, age, major and degree, (b) age when first employed, (c) previous position, (d) previous salaries, and (e) working periods. They propose that, by merging these features with the context of the resume, a better prediction of which position fits best to it would be obtained. Their results show that the maximum precision obtained using this model is around 54.5%. In addition to that, the use of the gender and age, as well as salaries, were found as possibly noisy, as the circumstances of the employment vary among individuals.
In contrast, [1], based on a word2vec model representation similar to [23] aims to bring a hierarchical ability categorization in order to predict the best matching between a person and a job posting in order to reduce the dependence on manual labour and to provide a better interpretation of results. It is divided in three phases: (1) word level representation through word2vec description, (2) hierarchical ability-aware representation to extract the ability representation for job postings and resumes simultaneously using a BiLSTM (Bidirectional Long Short-Term Memory) in order to capture the semantic relationship between job postings and resumes; and finally, (3) the evaluation of the matching degree between them is done. As the results presented in that work are highly accurate, the person-job fit is dependant exclusively on the requirements description provided by an expert in the area, which means that if the requirements are wrongly specified, the calculation would not be completed accordingly.
Despite of existing several works that claim that using Convolutional Networks are not the best approach for text processing, [15] proposes a novel approach using a deep Siamese convolutional network to get the characteristics needed for resumes matching. It consists of a pair of identical CNNs that contains repeating convolution, max pooling and ReLU layers with a fully connected layer on the top. Parameter sharing between twin networks is a fundamental characteristic for this model and are learnt to minimize the semantic distance between resumes and job descriptions and maximize the semantic distance between the resumes and irrelevant job descriptions. As in [1] and [17], the model is dependent on the job requirements that are specified by recruiters and are defined based on their expertise in the area.
Following a different approach, [13] explains the methodology used by LinkedIn for a recommendation system based on gradient boosted decision trees [28]. The main importance of this model is the use of the comparison of candidates that have similar contexts. In addition to that, LinkedIn tries to match candidates with related titles. In this schema, information on job career level is not considered or provided, nor advice on missing skills, or those that would be needed to upgrade current career level.
Despite all the progress achieved by using computer techniques for natural processing and analysis of resumes, most of the work is dependent on a predefined ranking by the expert in the area, which most of the times is defined by the recruiters. This can allow the presence of gaps in the way of comparing job-matching candidates. On the other hand, most of these works focus on a perfect match between what is requested in a job posting, but not on summarizing the abilities that people with a similar profile must contain, as well as to identify which points can be improved and developed. In the next section we propose a model focused on addressing these points.
Proposed model
The main diagram of our proposal is shown in Fig. 1. This proposal focuses on two main tasks: the estimation of career path level based on text from a job experience description (PM Level estimation), and the qualification of skills that belong to a particular job level (Matched and missing category skills), once that the profile has been automatically classified as belonging to a certain PM Level.

Proposed Career path levels for a PM.
For this purpose, we divide our proposal in four stages:
Data extraction, analysis and training (Section 3.1), PM Level estimation (Section 3.2), Clustering of skills associated to each PM Level (Section 3.3), Skill Qualification (matched and missing skills—Section 3.4)
The universe of job titles defined in the different careers and vocations is broad and extensive, so that we delimit the scope of this work to the IT area, particularly the Project Management (PM) career path. We establish the career path of PM following [29], observing that a person starts at a certain level in his or her job, and gradually grows into higher levels, depending on education, experience, skills and performance. Figure 2 shows our proposed PM Career Path levels.

Proposed Career path levels for a PM.
For building our dataset, we collected 300 profiles from the LinkedIn website, including the following fields: Candidate name, Current location, Current job. Summary: Brief description of the job experience defined and generated by the user. In most of the cases, this is the way the user highlights his/her capabilities and achievements. Experience: This field might be empty depending on the level of career path of the user; for example, referring to a junior job seeker, there is no experience to be registered. More than one job description could be registered as an experience. Educational information: In the same way as with experience, more than one item could be registered, referring to the different levels of education, like bachelor degree, master degree, etc. Skills and Endorsements: Involves all the skills acquired through professional experience and education, classified in the following types: Industry knowledge, Tools and technologies, Interpersonal skills, and Other skills.
From this extracted information, we focused on the fields: current job, experience, and skills. Skills were extracted and linked to the PM level corresponding to the current job. All the categories specified in the corpus (skills and endorsements, tools and technologies, interpersonal skills and other skills) were considered in the same level of importance, and endorsements were not considered. 293 sets of skills were successfully linked to a PM level. Remaining 7 profiles did not describe skills or current job, and therefore, they were discarded.
As each profile may include multiple job experiences, they were considered separately, obtaining 1,574 experience descriptions related to job titles (approximately 5 experience descriptions per profile, in average). All job experiences were manually labeled with a PM level mentioned in Fig. 2. Labels were assigned according to key words mentioned in each experience, according to Table 1. An additional level was included as “OOB” (Out of Business) to consider those descriptions that did not match with any keyboard corresponding to the PM levels. Table 2 shows the number of sets of skills and experiences identified for each level in the profile corpus 2 .
Keywords corresponding to each PM level
Total number of Skill sets and experiences linked to each PM Level from the profile corpus
After collecting the information from profiles, descriptions and skills should be converted to an appropriate description such as embeddings, using a distributed representation—doc2vec, [30]. Nowadays, it is a common practice to use the gensim tool 3 with pretrained embeddings on large corpora, such as Wikipedia or Associated Press 4 ; however, given the particular vocabulary used in IT job market, we determined it was more convenient to train a doc2vec model using the whole corpus of descriptions. For this purpose, we considered experience descriptions as documents, to train a doc2vec model. A total of 1,574 experiences were used, with 93,817 tokens and 11,469 types. The parameters used for this training were: distributed bag of words, learning rate of 0.025, vector size of 200, maximum distance between the current and predicted word 10, initial learning rate 0.025, no minimum frequency threshold, and 500 epochs for training.
Once we have a pre-trained model, we are able to represent both experiences and job skills as a vector.
To analyze the behavior of the sentence description embeddings, we used the TensorBoard tool 5 and used the T-SNE method to reduce dimensions from 200 to 3 to be able to visualize them. Figure 3 shows the mapping of embeddings generated from experience descriptions, and each color represents a different PM Level. Similarly, Fig. 4 shows the embeddings generated from the Skills embeddings. From these figures we anticipate that skills may be a better estimator for PM Level, as there are less overlapping areas in their representation.

Experience embeddings colored by PM Level.

Skills embeddings colored by PM Level.
This stage aims to estimate the level in which a user’s profile is located; this can be useful to help them to know which to category it may be considered they belong; this information will be used as well in the next stage, to give them advice on suggested skills to complete the current level, as well as a list of recommended skills to upgrade their current PM level.
There are two ways to estimate the PM Level: using the text of each job experience associated to each profile, or using the set of skills associated to a profile, as if it were a single document.
As an example, please consider the profile 6 shown in Fig. 5. From this profile, the PM level could be estimated using the whole list of skills as a single document. Additionally, the PM level could be estimated separately for each of the 10 experience items listed in this profile (not shown in Fig. 5), and the PM level of this profile could be estimated as the highest PM found in the list of experiences.

Real profile example from LinkedIn.
To estimate the PM Level, a classifier is trained with the information obtained in the previous section. Several experiments were performed, as described in Section 4.
To better explain this section, let us suppose a group of users, which belong to the same PM Level. Each of them participates in a similar task in their jobs, and have particular activities depending on the business area or company. To define a general list of skills for each PM Level, it is necessary to find the common skills defined for each level, so that similar characteristics could be grouped and ranked to specify the importance of each skill.
In order to find the relevant skills for each PM Level, a subset of skills for each PM Level in a vector representation, as explained in Section 3.1, was clustered using the DBSCAN clustering algorithm [31] using euclidean distance with ε=2.3 and minimum number of samples set to two 7 . Each profile in average lists 32 skills, ranging from 3 to 50. Figure 6 shows the result of clustering the skill representations for the PM Level 4: Project Manager—393 clusters on 2107 skills in total for this level 8 .

Skills clustering for PM Level 4 using DBSCAN.
The main purpose is to identify skills that could be equivalent, such as portfolio management and software project management, or magento and e-commerce, and select the most salient label for each cluster.
For each profile p, let skills (p) be a function that represents the set of skills defined for that profile. If P level is a set of profiles for a particular PM level, then we can define the set of all skills listed for a particular level as
Each skill s j ∈ S level , associated with a profile p is compared using cosine similarity to the previously clustered skills SC. Once the skill y i ∈ SC most similar to s j is found, its cluster calculated by cluster (y i ) ∈ C level is added to the set of clusters related to profile p. This is done for all profiles p i ∈ P level , so that at the end there is an associated set of clusters to every profile p i defined as clusters (p i ).
As a final step, the frequency of any cluster c ∈ C
level
is calculated as the number of profiles it appears in:
This information is used to calculate the most representative clusters associated to each PM Level. Now that we can calculate the frequency of each cluster, we are able to rank them and categorize them as follows:
Essential/Necessary Skills: are those skills that are related to the most frequent skills used by the profiles in the corpus. They can be considered as a must in the specific job level. General Skills: are related to the average profiles that contain some of the skills defined in the cluster selection. Aggregated Value: are those skills that are less repetitive in the profiles but that can contribute to a better development within the job level.
Figure 7 shows the percentages used for assigning a category of each cluster: when more than 70% of the population of the corpus for each PM Level has this cluster, skills belonging to this cluster will be considered as essential or necessary skills. When the frequency is between 40% and 69%, then skills of this cluster will be treated as general skills. Finally, when the frequency of the cluster is between 25% and 39% it will be considered as an aggregated value. Clusters with a frequency with a percentile lower than 25% were omitted since they may be considered as not relevant for a description of a level.

Percentages for categorizing clusters of skills.
Finally, the label for any cluster c ∈ C level is assigned as the more frequent skill within this cluster. This label is used only for displaying purposes, as in the qualification stage (described in the next section), all skills in each cluster are considered.
Let us present an example to illustrate the aforementioned concepts. Table 3 shows a toy example of three profiles, which are supposed to belong to the project management PM Level (Level 4). Different skills are listed for each profile p1, p2, p3, with some intersections between them. Each skill is matched to a cluster, shown in the second column of each profile. From this information, we are able to obtain freq (C1) =3 (50%), freq (C2) =2 (33%) and freq (C3) =1 (17%). Following the percentages shown in Fig. 7, the skills listed in C1 would be considered as essential, those in C2 as aggregated value, and the skills belonging to C3 would not be considered.
Toy example of profiles belonging to PM Level 4
In order to select the label of each cluster, the most frequent skill within each cluster would be selected, as shown in bold in Table 4.
Skill frequency in clusters, according to toy example of Table 3
Qualification of skills for a particular profile is done by comparing the skills of this profile against the skills corresponding to the clustered skills associated to the PM Level obtained in the previous section. To improve the matching of skills, they were POS Tagged and lemmatized using the NLTK library 9 .
An example of the application of this comparison is shown in Section 4.3.
Experiments and results
Several experiments were carried out with different classifiers to estimate the PM Level from two different sources: (1) Experience descriptions; and (2) Set of Skills. 80% of each source was used for training, and 20% for testing. 10-fold cross-validation was used during the training stage. Table 5 shows accuracy results of PM level estimation with different classifiers. Best results for both sources were obtained with the Logistic Regression classifier.
Accuracy results of PM Level estimation using different sources: (1) Experience descriptions, (2) Set of Skills
Accuracy results of PM Level estimation using different sources: (1) Experience descriptions, (2) Set of Skills
In general, results were more accurate for skills than using the description of the experience. This is a result of the words used; despite the limited vocabulary in the skills to express ideas, it is a specialized vocabulary and most of the profiles repeat the same ability, making it easier to construct a better context of what the job seeker is referring to.
Figure 8 shows the confusion matrix of 315 unseen job descriptions (20%), trained with 1,259 (80%) experience descriptions represented as vector embeddings of 200 dimensions. OOB represents those descriptions that could not be labeled with one of the proposed PM Levels. Table 6 shows results for estimating experience description for each PM Level.

Confusion matrix of unseen job descriptions.
Precision, recall and F1-score for PM Level estimation based on experience descriptions
Table 7 shows precision, recall and F1-score on the classification of 59 profiles (∼ 20% of the 293 profiles collected). These profiles had never been seen, nor their words were used for the pre-training of the embeddings. Within these 59 profiles, there was none belonging to the director PM Level, and therefore, it was not considered in the calculation of average values.
Precision, recall and F1-score for PM Level estimation based on skills
Precision, recall and F1-score for PM Level estimation based on skills
As shown in Fig. 1, the information used for estimating the PM level and giving feedback on the Skills Qualification, is based on the classifier based on skills. More information on skills qualification is given in the next section.
It is important to notice that, despite performance of PM Level classification based on experience descriptions may seem lower, the granularity of information provided by experience descriptions is greater, and could be considered for other purposes as future work.
For the profile shown in Fig. 5, the predicted PM Level was 1–Consultant IT or equivalent. The skills for which this profile qualifies are shown in Table 8, along with other needed skills to fully qualify for this level. In case this user would like to upgrade PM Level, current qualifications and needed skills are shown in Table 9.
Example of Skills Qualification
Example of Skills Qualification
Example of Skills needed to upgrade PM Level
This work presented a model able to determine the job level position from the context of a description experience, as well as to provide a guide of the necessary and suggested skills according to the predicted job level. This model has the following advantages: it requires only the experience description and the defined skills from a resume; the prediction of the job level is automatically performed; the evaluation of the skills is automatically done; and it is a single model based on the expertise of people related in a common particular area.
The proposed model is not pretended to be useful only in the IT area, it can be used for any area using the respective corpora, allowing recruitment process and job seekers to have a clearer idea of what is expected to learn in real life according to what a set of abilities and experiences define.
As future work, we plan to increase the size of the dataset for training and testing purposes, as well to experiment with the application of this model to other areas. For improving clustering of skills, we plan to experiment with other clustering algorithms, such as OPTICS [32].
Some of the future working line directions could be: summarizing experiences; consider other features from LinkedIn profiles, for example dates of experience so that the stability or the length of stay in each level could be analyzed and perhaps predicted; and finally, to implement sentiment analysis to experience descriptions, to analyze how comfortable a person is in his/her current job.
Footnotes
Acknowledgments
We thank the support of the Mexican Government and Instituto Politécnico Nacional (IPN): COFAA, EDI, Projects SIP 20190077, 20195886, 20200811, 20200640; and Consejo Nacional de Ciencia y Tecnología–CONACyT (SNI, RedTTL). Particularly, through FOINS 360, Problemas Nacionales 5241, and Cátedras CONACyT 556.
Readerbench.com/demo/cv
Available at github.com/likufanele/pmlevel
radimrehurek.com/gensim/models/doc2vec.html
github.com/jhlau/doc2vec
tensorflow.org/tensorboard
linkedin.com/in/melenealcameron/
These parameters were experimentally found.
The total number of clusters per level / total number of skills per level was: 1:232/1880, 2:204/851, 3:350/2291, 4:393/2107, 5:284/1369, 6:26/33, 7:146/196, 8:38/526.
