Abstract
Kanban, which is an agile process methodology as well as a means to implement lean principles, has been growing as a project management framework across a range of domains, including manufacturing, software development and data science. This paper explores, for teams using Kanban, the ability to predict low team performance. The prediction is based on an analytical model that uses specific project metrics that can be collected via the team’s visual Kanban board. Specifically, data from 80 teams was used to build and test machine learning models that predict teams at risk for delivering low quality results. The model developed was significantly better than the baseline situation of thinking that all teams were at risk. While this analysis was done within a data science project context, the results are likely applicable across a range of information system projects.
Introduction
Studying and predicting team performance and teams at risk of poor performance has been studied for decades in both academia (Minaei-Bidgoli et al., 2003; Kabakchieva, 2013; Osmanbegović & Suljić, 2012; El-Halees, 2008) and industry (Wolf et al., 2009; Levesque et al., 2001). In both academia and industry, the goal of identifying teams at risk of poor performance is to reduce the risk of the poor performance by allowing the team leaders (e.g., managers or classroom instructors) to provide appropriate advising in a timely manner (Minaei-Bidgoli et al., 2003). One way to help identify teams at risk is via the development of a predictive model, which would learn, from previous projects, the characteristics of teams that did not do well.
However, the factors impacting a project’s success or failure are dependent on the project’s context (Shenhar et al., 2002; Dvir et al., 1998). Hence, it is not surprising that it has been noted that more research is needed to adapt a more project-specific approach to identify exact causes of a project’s success or failure (Shenhar et al., 2002; Dvir et al., 1998). One example of a context that could impact the factors driving a project’s success is the project management framework used by the project team.
Agile process methodologies are a commonly used project management framework within information system projects. Specifically, agile frameworks are a group of incremental and iterative methods that have been shown to be effective in helping to manage projects. During the past 25
Kanban, a process framework that focuses on visualizing the flow and minimizing work in progress, is an agile approach that aims to streamline the amount of work done at the moment. Kanban has been growing in use across a range of domains such as software development (Kirovska & Koceski, 2015) and data science (Saltz et al., 2017). However, to the best of our knowledge, for teams using Kanban, there have been no studies on how to predict projects at risk of delivering poor results.
This study aims to help address this gap by collecting easily obtained Kanban metrics and then building a predictive model that identifies teams at risk for delivering low quality results. In short, the research questions address by this research are:
The rest of the paper is organized as follows. First, a review of project metrics is presented, which is followed by a discussion on the key Kanban concepts as well metrics specifically designed for Kanban. Next, a classification model is developed to help identify teams at risk. Finally, a synthesis of our observations will be provided within our discussion and conclusion.
Background
Project management and project metrics
Traditionally, due to the multi-attribute nature of a project’s success, the success of a project is assessed using multiple measures, including internal (meeting goals, schedule, budget) and external (project’s impact on its customers, and on the developing organization itself) measures (Lipovetsky et al., 1997). These measures are often called as metrics or dimensions (Lipovetsky et al., 1997).
According to (Fenton & Pfleeger, 1998), teams use metrics to understand, control and improve what is done and how it gets done. Common needs for metrics are related to supporting communication and decision making. Pulford et al. (1995) give us the following motivations for metrics use:
Project planning and estimation; Project management and tracking; Understanding quality and business objectives; Improved software development communication, processes, and tools.
However, metrics are only valuable if one can use them. Specifically, good metrics have three key attributes: the data is consistent, inexpensive to collect, and quick to collect (Bladt & Filbin, 2013). With this in mind, in a literature review several categories of metrics were identified (Kupiainen et al., 2014) including: Iteration Planning, Iteration Tracking, Motivating and Improving, Identifying Process Problems, Pre-release Quality, Post-release Quality. Some of these categories could be useful for predicting project success. Specifically, process problem metrics, which are metrics that are used to identify or predict problems in order to solve or avoid them, has been extensively studied (Petersen & Wohlin, 2011; Trapa & Rao, 2006; Mahnič & Zabkar, 2012; Petersen & Wohlin, 2010; Shen & Ju, 2007; Mujtaba et al., 2010; Tudor & Walter, 2006). However, none of these studies explored Kanban related metrics. Perhaps the closest relevant study was when Catal & Diri (2009) explored software fault prediction. Although this effort focused on risk prediction, it did not explore this prediction challenge within a Kanban context. In fact, to the best of our knowledge, there are no studies that have focused on predicting a projects’ success based on Kanban metrics.
A Gantt chart is a type of bar chart that illustrates a project schedule as well as dependency relationships between activities and current schedule status. It is a common way to structure and visualize work. The critical path of a project is a key project management concept when using Gantt charts. It is “the sequence of scheduled activities that determines the duration of the project” (PMI, 2004). In other words, it is the longest sequence of tasks in a project plan that must be completed on time in order for the project to meet its deadline. If there is a delay in any task on the critical path, then the entire project will be delayed. One can view the critical path method as a step-by-step project management technique to identify activities on the critical path. It is an approach to project scheduling that breaks the project into several work tasks, displays them in a flow chart, and then calculates the project duration based on estimated durations for each task.
While Gantt charts (in conjunction with critical path analysis) have often been used to help manage projects, for better and worse, they display the most detail and complexity with respect to the relationship among the project tasks (Landry & McDaniel, 2015). It has also been noted that Gantt charts focus on providing visual effects, but contain no quantitative information to help monitor a software project (Zhang, 2011). Furthermore, with respect to critical path analysis, when using Kanban, the focus is not on the overall critical path, but on limiting work-in-progress and bottlenecks, which can cause tasks to not be completed. As described in the next section, this is done via the use of the Kanban board and defining a maximum number of work-in-progress tasks. When using Kanban, when a task becomes a bottleneck, the entire team then focuses, if appropriate, on eliminating the bottleneck. This suggest why, in contrast to Gantt charts, Kanban is increasing in popularity (Al-Aidaros & Omar, 2017). For example, it has been noted that “Gantt charts and reports that looked solid on paper often have failed to deliver the software on time” (Sutherland, 2004) and that many companies are switching from Gantt chart-based planning to other, more agile methodologies (Moore et al., 2007).
Kanban is Japanese for “visual signal” or “card” (Ohno, 1988; Sugimori et al., 1977). Starting in the 1940s, Toyota line-workers used Kanban (with physical cards) to improve their manufacturing process. The system’s highly visual nature allowed teams to communicate more easily on what work needed to be done. In the first academic study about Kanban (Sugimori et al., 1977), three reasons for its use were proposed: reduction in information processing cost, rapid and precise acquisition of facts, and limiting surplus capacity of preceding shops or stages.
Based on a systematic mapping study (Ahmad et al., 2018), the most common definition of the Kanban methodology is the definition defined by Anderson (2010, p. 6): an “evolutionary change method that utilizes a Kanban pull system, visualization, and other tools to catalyze the introduction of Lean ideas…the process is evolutionary and incremental”. More specifically, Kanban is based on three key principles:
Visualize the workflow – Split the work into pieces; write each item on a “card” and put on the “wall” and using named columns to illustrate where each item is in the workflow. By creating a visual model of work and workflow, the team can observe the flow of work moving through its Kanban system. Making the work visible—along with blockers, bottlenecks and queues—instantly leads to increased communication and collaboration.
Limit WIP (work in progress) – assign explicit limits to how many items may be in progress at each workflow state. By limiting how much unfinished work is in process, the team can reduce the time it takes an item to travel through the Kanban system. The team can also avoid problems caused by task switching and reduce the need to constantly reprioritize items.
Focus on Flow By using work-in-process (WIP) limits – by developing team-driven policies, the team can smooth the flow of work and make sure the team is focused on getting work completed.
Limiting the amount of work-in-progress, at each step in the process, prevents overproduction and reveals bottlenecks dynamically and is one of the key differences between a Kanban board and any other visual storyboards used within other methodologies. Note each Kanban board has columns that the team collectively determines what are the most appropriate columns for the team to use. As shown in Fig. 1, columns often include “to do”, “doing”, “validate” and “done”. One can see that for the project shown in Fig. 1, the project did not have any bottlenecks. However, if the “validate” column, for example, reached the work-in-progress limit, other work that was finished in the “doing” phase could not be moved into the “validate” column. Hence, the team would understand that there is a bottleneck and would collectively work to help ensure that at least one of the validate tasks gets completed (and hence, moved to the “done” column, thus enabling another task to be moved to the “validate” column).
Example Kanban board.
Note that many teams now use an e-version of the Kanban board (Nakazawa & Tanaka, 2015, 2016; Ostergaard, 2016), and Fig. 1 shows an e-version created using a web-based tool known as Trello.
With respect to information systems development, it has been shown that Kanban has a positive impact on project results (Lei et al., 2017). For example, at BBC Worldwide (Middleton & Joyce, 2012), the lead time to deliver software improved by 37%, the consistency of delivery rose by 47%, and defects reported by customers fell 24% as compared to the previously adopted agile method.
The value of Kanban has also been observed in other information system contexts, such as within a data science project where a Kanban based approach has been shown to be a more effective framework, when it was compared to other approaches (Saltz et al., 2017). Furthermore, although Kanban is often used within information systems context, the Kanban method has been applied to other aspects of knowledge work (Leybourn, 2013), such as Human resources (Benson, 2011) and recruitment (Lambert, 2014). Overall, it has been noted that Kanban offers improved project visibility, quality, team motivation, communication and collaboration (Ahmad et al., 2013, 2016). Hence, it is not surprising that the use of Kanban is growing. For example, in the latest “State of Agile Report” (VersionOne, 2016, 2018), the use of Kanban, as an agile technique, grew to 65% (in 2017) from 50% in 2016, and from 31% in 2014. In short, the use of Kanban is growing quickly.
In a literature review focusing on Kanban projects (Kupiainen et al., 2015), none of the identified studies explored predicting projects success using Kanban-based metrics. There are, however, several commonly used metrics to help track the progress of a Kanban project. These metrics might also be leading indicators of the overall performance of the Kanban team (Power, 2014). Below, we review the most frequently used metrics and explain the purpose of the metric.
Lead time: Lead time is the total elapsed time from when the work item was started until it is declared complete and accepted by the customer. This is a key measure of the team’s throughput and productivity (Mahnič, 2013). Lead time measures duration from beginning to end. This includes process time, as well as time that work spends sitting in queues, or wait states. By tracking lead time metrics over a set period of time, one can determine the impact of any changes one makes – if the change helps deliver value faster, or if the change gets in the way of delivering value.
Cycle time: Cycle time measures how long it takes a work item to get from point A to point B. Since cycle time can be measured from any two starting and ending points on a Kanban board. It’s common for several categories of cycle time to exist on one board (e.g., deployment cycle time, development cycle time, QA cycle time, etc.). Lead time and cycle time are similar and easily confused; both help the team understand how long work takes to flow through our value streams, but they measure different segments of the process.
Work in process (or work in progress) – WIP length: In general terms, “WIP” refers to any work that has been started but is not yet providing value to the customer (i.e., work that is not yet “done”). In short, it is all the work that is actively being worked on at any one time. Tracking work items that are started, but not yet finished, can help the team improve the overall flow of value through the system. Practically speaking, work cannot add value to the customer, team, or organization unless it’s finished work. In addition, a team with fewer WIP items is more agile, since the team can get feedback more frequently with respect to completed tasks and which new item to start.
Queue length: Queues form in a team’s process when work waits between different stages (or columns on the Kanban board). Since queues often represent the majority of a work item’s total life cycle, it’s important to understand how they affect the team and where they occur in a team’s process. Limiting the time that work spends in queues can help reduce the overall cycle time and keep work flowing through the system. An efficiency diagram measures the difference between total WIP and the work that’s waiting in queues. It illustrates when the work in queue is growing as a percentage of the total WIP. This allows one to pinpoint where work is likely stuck in a queue and also investigate what can be done to get work flowing again.
Number of blockers: Kanban systems often use a “blocker” symbol to visually indicate work that cannot move forward in the process. Different from work in a queue – which is often simply waiting its turn to be pulled into process – a blocker is typically waiting on an external dependency or some failure condition. Blockers double as a useful signal that a piece of work needs immediate attention as well as a valuable flow metric. Blockers are one of the easiest board elements to measure, especially for new teams that lack the work history necessary for other metrics. Simply stated, count how many items are blocked and record how long those work items stay blocked.
Throughput: Throughput is the average number of units processed per time unit. In a Kanban system, examples can include “cards/tasks per day” or “cards/tasks per week”.
Figure 2 shows a cumulative flow diagram (CFD), which visualizes two of the most frequent metrics (WIP and cycle time).
Cumulative flow diagram for WIP and cycle time (Scotland, 2009).
Following the findings from Lipovetsky et al. (1997) and Tishler et al. (1996), which suggest that managerial activities should emphasize managerial variables that can maximize the benefits to the customer (via meeting the design goals), this study explores metrics that can help predict successful project outcomes (i.e., deliverables to the customer).
To investigate the predictive power of existing and potentially newly devised Kanban metrics, 80 Kanban projects were analyzed. Specifically, metrics were collected for 80 student teams in a master’s level data science course, across four semesters. The instructors acted as the customer for each team, and each team worked on a semester long data science project using the Kanban process methodology, including the use of an online Kanban board tool (
The use of students and their applicability to industry
The generalization to industry, when using students, has often been criticized (Runeson, 2003; Sjøberg et al., 2002, 2005; Salman et al., 2015). This is due to the fact that in-class tasks are often far from reality, and therefore, the task is not representative of tasks in industry. This is one of the key reasons the results of an in-class experiment often do not transfer to industry (Sjøberg et al., 2005).
More generally, some think that using students is an external threat to validity (i.e., when the researcher generalizes beyond the groups in the experiment to other groups not under study). However, based on additional analysis of the literature, the results from students can be applied to an industry context, if the use of students is done in an appropriate manner. Within this research context, we first note that the students worked on a semester long project, which was defined in such a way to be as realistic as possible. In addition, master students, with an average of 3 years of IT experience, were selected as participants. The use of graduate level students ensures that the subjects had more advanced information systems knowledge, and as noted by many, constitute a good sample of the next generation of professionals entering the job market (Taibi et al., 2017; Runeson, 2003; Kitchenham et al., 2002; Tichy, 2000; Salman et al., 2015).
Furthermore, many studies have not found a significant difference in quality of the code produced by students as compared to the quality of code produced by professionals (Höst et al., 2000; Runeson, 2003; Berander, 2004; Svahnberg et al., 2008; Salman et al., 2015). For example, in Salman et al. (2015), professionals produced better code than students only when they were already familiar with the tasks to be done.
Finally, classifying experimental subjects (students) by their status (student or not student) is a proxy for more important and meaningful classifications, such as classifying the participants according to their abilities and experience, and effort should be invested in defining and using these more meaningful classifications (Feitelson, 2015).
Hence, within our context of a new realistic project task, graduate student teams are a good proxy for teams in industry. In short, one can view our analysis focusing on junior professionals with, on average, three years of IT experience.
Project context
As part of the introductory data science course, students were required to work on a group project, which started early in the semester and continued until the end of the semester. The project was twenty-five percent of the course grade; thus, the students were highly motivated to work on the project. The project was done using the R programming language, a popular data science tool that is used in both industry and academia. The analysis required the team to perform many typical data science tasks, such as data cleaning and leveraging machine learning algorithms to deliver insight to their client.
Specifically, the project was positioned in a way that the students were to act as consultants and analyze a large data set of customer survey responses for their client. In order to have the project be as realistic as possible, specific instructions were not provided to the students (on what analysis was to be done). The goal for each team was to identify and then answer “interesting” questions, such as understanding how customer satisfaction varied across surveys (frequent vs non-frequent customers, etc.). This goal was the same for all teams (i.e., to get actionable insight from the data so that the organization could improve customer satisfaction). For example, if customer service was observed to be an important driver for improving customer satisfaction, a team might suggest that the organization should improve customer service by providing extra services to gold members.
All teams worked on well understood datasets, and as such, all the teams had the same level of difficulty with respect to working on the project. Note that no specific questions/goals were provided to any of the teams. For example, what to analyze was determined by each team, and was a function of how the team determined what might be useful, what was possible with the data that was provided as well as project duration. In other words, teams tried to identify important drivers (factors) for improving customer satisfaction. In doing the projects, each team needed to create tasks such as “data preprocessing: N/A handling” or “find most important factors that impact frequent customers”. To identify the appropriate questions, each team explored the data available as well as discussed possible analysis options with their client (or more accurately, a person acting as the client). The dataset was a modified version of a real dataset of survey responses. Hence, the data was not real, but was representative of the actual challenges one might face in executing a data science project. For example, some values in the dataset were blank. This reflected a typical ‘real-life’ challenge in how to handle missing values that was due to the fact that some of the surveys asked more questions than other surveys (i.e., there were different survey given to different customers, some surveys asked more questions than others).
Finally, students had access to a description of each column of the dataset. The attributes included information about the person who responded to the survey (e.g., place of residence, a member of their rewards program, and if so, what level), information about the service (e.g. location) and information about the responses to the survey from the customer (e.g., would they recommend the service/company to a friend).
Data collection
As shown in Table 1, 362 students participated in the study across 80 teams during four semesters.
Number of teams and students participated in the experiment
Number of teams and students participated in the experiment
Each team used a web-based Trello board, which can be used as a virtual Kanban board. The tasks were managed by the students (i.e., the team members). Each of the team’s activities (such as task moves, adding comments, etc.) was logged within Trello and retrieved using the Trello API. Specifically, as part of this research effort, an R script was developed to pull the data from Trello (using their API). Based on the data retrieved from Trello, the appropriate metrics were then calculated. These metrics were then provided as input to the machine learning (predictive modeling) algorithms.
Note that each of the metrics were evaluated three times during semester: after two weeks (i.e., near the beginning of the project), at the mid-point of the project and two week prior to the project being submitted. Therefore, the data can be thought of as an 80
In addition to the Kanban process metrics, the target variable “grade”, was also part of the dataset for each team. The grade was used to identify teams with poor performance. The grades were assigned to the teams by two faculty members independently, and evaluated after the projected completed. The grade was assigned to the team as a whole (i.e., same grade for all team members). To generate a score for each team, the evaluation from the two faculty members were averaged. Furthermore, the grades were scaled such that approximately half the student teams were deemed “at risk” of a poor result. Specifically, grades were scaled such that grades were out of 20, and teams that scored less than 19 indicated that a team was at risk. Forty-six percent of the teams (37 of the 80 teams) were deemed “teams at risk” (i.e., had a grade less than 19), and fifty-four percent of the teams (43 of the 80 teams) were deemed not at risk for poor performance (i.e., the project was deemed a success). In other words, the grades were used as an estimation of project quality, and the concept of risk in the experiment was with respect to low quality. Hence, the predictive algorithms could use the “grade” or “at risk” variable to use as their dependent variable.
The following, previously defined, Kanban metrics were used:
Lead time, WIP, blockers; Throughput – average number of units processed per time unit. In a Kanban system, examples can include “cards per day”, “cards per week”, or “story points per iteration” – in our case we can have “X per update”;
However, other existing metrics were not used:
Queues – for this project, queues were not used; Cycle time (how long it takes a work item to get from point A to point B) – it was not possible to calculate in a consistent manner, since column the names differed from team to team.
In addition to previously defined Kanban metrics, several additional metrics were also calculated and evaluated:
Number of tasks: Total number of tasks on the board (in any state); Number of actions: Total number of transactions by the time of measurement; this include commenting, moving card, renaming card headers or adding description; Number of completed tasks (in “done” column): Helps to understand real progress of teams; Number of tasks after “to do”: Tasks in progress and completed tasks; Number of students who created or edited cards: This metrics shows how equal work is distributed across team members; Number of words in “done”: The idea is that very short task descriptions might be not that valuable in contrast to tasks with more detailed information.
The goal of the analysis was to propose metrics (i.e. features for ML models) that have high predictive power (i.e. that can distinguish team at risk from other teams). As previously noted, teams “at risk” means teams that are likely to show low performance by the end of semester (i.e. will get low grade).
In order to analyze effectiveness of the models, a baseline solution is needed. However, since there has been no predictive model generated and published previously, the baseline is defined as the task of assigning one label to all objects:
Assign “at risk” to all objects (majority vote rule, maximizing accuracy).
Confusion matrix Assign “not at risk” to all objects (since we do not want to overlook teams at risk).
Confusion matrix
Evaluating metrics with the baseline approach
Below, we review typical approaches that might be used to evaluate different models within this project context. Table 2 shows the value of each metric for each of the two baseline approaches.
Sensitivity/recall/True Positive Rate (TPR) Precision/Positive Predictive Value (PPV) Specificity/selectivity/True Negative Rate (TNR) Accuracy: percentage of correct predictions; F1 score: harmonic mean of precision and recall; Sensitivity and specificity score: harmonic mean of Sensitivity and Specificity.
It has been noted that precision and recall should only be used in situations where the correct identification of the negative class does not play a role. This is true in information retrieval, which is where this term originated (Powers, 2011). In information retrieval, there is generally an unknown and much larger of true negatives, as compared to the actual numbers of relevant and retrieved documents, and hence, not exploring true negatives is not an issue.
In contrast to information retrieval, where number of true negatives is large and unknown, in many other applications (e.g., medicine, team performance analysis), the assumption of very large numbers of true negatives versus positives is rare (Powers, 2011). For example, in this project management context, providing support to “at risk” teams requires the use of a limited resource and providing this support to teams that don’t need support may leave other teams unsupported (or incur undue expense). Thus, in this project modeling context, precision and recall, as well as F1 (harmonic mean of recall and precision) should not be used.
In this situation (where there is not the assumption of a very large number of true negatives), evaluating a model based on both sensitivity and specificity is appropriate (Yerushalmy, 1947; Parikh et al., 2008). This is because these measures consider all entries in the confusion matrix. While sensitivity deals with true positives and false negatives, specificity deals with false positives and true negatives. Furthermore, sensitivity and selectivity are closely related to
This suggests that the combination of sensitivity and specificity is an appropriate holistic measure when both true positives and true negatives should be considered (Baron, 1994; Boyko, 1994; Pewsner et al., 2004). In terms of combining sensitivity and selectivity, one should consider both the arithmetic and harmonic mean:
Arithmetic mean Harmonic mean
One can note that using the harmonic mean puts more of a focus on the lower of the two numbers being averaged, which is why is it used in F1 and why the harmonic mean is also widely leveraged when combining sensitivity and selectivity (Hong et al., 2009), especially in medicine (Hoff et al., 2009; Hildebrandt et al., 2017). Thus, using the harmonic mean of sensitivity and selectivity was the approach used to evaluate the different models.
In addition to linear modeling, additional Machine Learning (ML) models, listed below, were also evaluated. These models were evaluated using the python machine learning library scikit-learn (Pedregosa et al., 2011), where they were used to predict the “at risk” variable:
K-Nearest Neighbors algorithm (KNN); SVM (with multiple kernels); Gaussian process classification (GPC) based on Laplace approximation; Decision Trees (DT); Random Forests (RF); Multi-layer Perceptron (MLP); Gaussian Naïve Bayes (NB); Quadratic Discriminant Analysis (QDA).
Cross-validation was used to evaluate the algorithms performance. Cross-validation is one of many similar model validation techniques for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice (James, 2013). Specifically, 5-K-Fold cross-validation and MinMaxScaler were used (transforming all numbers into [0, 1]).
However, running cross-validation once might generate unstable results. This is due to the fact that splitting the dataset into train and test parts (which is what K-fold cross-validation does) is a random process and there is no way to ensure replicability and reliability of the results. In short, if one replicates the experiment with the same parameters of models, the quality of the prediction might not be the same. In contrast, if one repeats the cross-validation, and shuffles the data before each iteration, and then takes an average of those repeated cross-validations, the results will converge to a stable mean (with a normal distribution). Therefore, we performed the 5-fold cross-validation 1000 times as demonstrated via the following logical flow:
It is often the case that when there are many features (relative to the training sample size), it is beneficial to reduce the number of features (Hua et al., 2005). This could be the situation with our dataset. Specifically, since the dataset being explored had many features/attributes (compared to number of observations – rows of data), it is possible (or even likely) that one could get better results if one reduced the number of features/metrics.
To do the feature selection, we employed multivariate analysis methods such as Principal Component Analysis, which has been proven to be an effective approach within a project management context (Tishler et al., 1996; Shenhar et al., 2002; Dvir et al., 2003). In short, since the dataset had only 80 observations, it was hypothesized that having fewer features (e.g., 5-10) would improve results. Therefore, the following feature selection techniques were leveraged within this analysis (with a goal of improving the predictive accuracy of the analysis):
Principal component analysis (PCA) – A dimensionality reduction method based on the correlation matrix. This multivariate analysis method has been shown to be effective in multiple applications (Shenhar et al., 2002). Manual feature selection based on metrics of different categories – Based on exploratory metrics analysis and high correlation between some variables, one can hypothesize that some metrics can be grouped together (to avoid multicollinearity). If two numerical features are highly correlated, then one doesn’t add any additional information (it is determined by the other). For example, number of tasks after “to do” might be eliminated from the analysis if one uses number of WIP tasks, since these two metrics are conceptually very close to each other. In other words, when using this approach, the analysis will use one metric from each logical group of concepts. Univariate feature selection – Select the top K features based on Chi-squared univariate statistical test (scikit-learn, 2018). The chi-squared test measures dependence between stochastic variables and the target class variable, hence using this function eliminates the features that are the most likely to be independent of class and therefore irrelevant for classification. Therefore, this score can be used to select the K features with the highest values for the test chi-squared statistic.
Linear regression
Linear regression was first explored, but the results didn’t yield any models with high predictive value (i.e., high R-sq). For example, when trying to create a linear model using ordinary least squares (OLS) and using all the metrics, the R squared value was 0.1 and all
Comparison of predictive power of the metrics
The results, using 5-fold cross-validation, with 1000 iterations and a goal of maximizing the harmonic mean of sensitivity and selectivity, for predicting teams at risk, using KNN, SVM, DT, RF, MLP, NB, GPC, QDA machine learning techniques, ranged from 0.27 to 0.38, with KNN performing the best. These results were after tuning the key parameters of the model, such as the cost ‘C’ for SVM. Note that the average Sensitivity and Selectivity were calculated during each iteration, and then averaged (i.e., those calculations did not use the averaged recall, selectivity, and precision values). The statistical
Feature engineering/selection
As described in our methodology, feature engineering/selection was explored via principal component analysis, manual feature selection and univariate feature selection. The results of the feature selection analysis are presented below.
PCA: With respect to principal component analysis (PCA), an analysis was done when the dimensions were reduced to 2, 3, 4, 6 and 8. Using PCA, with four components, led to the highest improvement of the harmonic mean of selectivity and sensitivity (0.46 when using NB). We also note that with six components, the harmonic mean of selectivity and sensitivity was 0.45 (also when using NB). Table 3 has the full PCA results.
Performance of ML models depending on number of PCA components
Performance of ML models depending on number of PCA components
Manual feature selection: The attributes were put into logical groups, and the key attribute used, as noted below:
WIP (removed ‘Number of tasks after to do’). Number of blockers (no attributes were removed). Throughput (removed ‘Number of tasks’, ‘Number of tasks in done’, ‘Number of Actions’). Number of words in done (removed ‘Project update evaluation’).
As shown in Table 4, the manual selection of features led to an improvement of the harmonic mean of selectivity and sensitivity to 0.45 (using NB), which is approximately the same result as was derived via PCA.
Performance of KNN models for the four manually selected features
Univariate feature selection: As shown in the previous analyses, SVM, MLP and GPC performed significantly worse than other methods. Therefore, these approaches were not evaluated during univariate feature selection. As shown in Table 5, univariate feature selection generated a result with a harmonic mean of 0.49 (using KNN), which was the best model identified.
Univariate feature selection performance
In summary, univariate feature selection was the best feature selection approach, and that approach outperformed the results when no feature selection was used. Specifically, using attributes such as the number of items that are in progress during update 3, the throughput that took place during the period before update 3, the number of tasks defined during update 2 and the average number of words in the description of tasks done during tasks 3, were deemed most important. The attributes, using KNN, generated a result of close to 0.5, which was much better than using all the attributes (which generated a sensitivity and selectivity harmonic mean of 0.38). However, it should also be noted that PCA as well as manual feature selection generated similar results (i.e., 0.46 for both). Hence, this suggests that the results were stable, since they were generating similar results across different approaches.
Planning fallacy is a situation in which predictions about how much time will be needed to complete a future task are optimistically biased and hence, the time needed to do the task is underestimated (Kahneman & Tversky, 1979). While the planning fallacy is an important issue within project management, it is not a key issue when teams use the Kanban framework. This is because when teams use Kanban, they do not generate tasks estimates. In short, as previously mentioned, the Kanban team members work on tasks as the team puts the tasks on their Kanban board (without an explicit task estimation). Hence, this lack needing to do task estimation when using Kanban is why this study did not focus on task estimation nor the planning fallacy related to task estimation.
We also note that even if one considers the final project deadline, Kanban’s focus on minimizing work-in-progress helps each team ensure a usable result at the project due date. Specifically, with respect to information systems development (e.g., software applications, data science), at the project’s deadline, there might be some features or analysis that might not yet have been completed. However, even with everything not being completed, there is still a usable and working system. For example, with respect to the data science projects analyzed during this research effort, the goal was to have an analysis with actionable insights by the end of semester. More advanced teams might have explored additional models or improved their tuning parameters within the models that were implemented. In short, within this type of data science analysis, there is always room for improvement, and even in industry, some tasks are typically not completed due to the perceived diminished return on investment.
The study does have several limitations, which suggests potential next steps. For example, there are other forms of project risk beyond quality (e.g., budget issues or the delay of a project) and work could be done to build and predict these other types of project challenges.
In addition, while the model was far better than the default baseline, there is still significant room for improvement. One way to improve the model might be to explore additional data/metrics that could be collected from within a project. Another way to improve the model is to collect data from additional teams. As previously mentioned, the data was collected from 80 project teams. While 80 projects might seem like a lot of projects, having 80 projects implies that the analysis was performed on the relatively small dataset of 80 rows of data. Thus, data could to be collected, across additional teams, which would likely help refine and improve the model.
Finally, this study was focused on student-based data science teams (i.e. graduate students taking applied data science class). The results might be different for other teams, such as software development teams or when reviewing projects with experienced industry professionals. Hence, additional studies could be explored within industry-based teams, across a range of domains.
Conclusion
In exploring the impact of the predictive algorithms, the baseline would predict that all teams were at risk, and so while all at risk teams would be identified, 46% of the teams would be treated as if they were at risk, even though they were not at risk. Thus, 100% of the teams “at risk” would be correctly identified. However, 0% of the teams “not at risk” would be correctly identified.
Using univariate feature selection approach with 4 features, the model provides a manager with opportunity to identify, on average 50% of the teams “at risk” (i.e. recall). At the same time, on average, 70% of the teams “not at risk” will be correctly classified as “not at risk”. Although the results reported are for a relatively small dataset, we believe that the characteristics of the classification model are stable because we used 1000 iterations of 5-cross-validation. In other words, the results are generalizable to new data, and if one applies the model to new teams, the class instructor or data science team manager would be able to detect approximately 50% teams at risk as well as correctly identify approximately 70% teams “not at risk”.
More generally, this study has several contributions. First, Kanban-based metrics were reviewed and additional metrics proposed. In addition, an analysis of the existing and proposed Kanban metrics was conducted. The analysis specifically explored each of the metrics predictive power with respect to team outcomes. Our results suggest that some of the newly proposed metrics were helpful in identifying teams at risk of poor performance (as compared to teams not at risk). Finally, as part of this work, several models were built to demonstrate that the selected metrics can help to identify teams at risk when the team is using a Kanban framework, with the best performing algorithm using KNN with univariate feature selection.
