Abstract
BACKGROUND:
Occupational accidents in the plumbing activity in the construction sector in developing countries have high rates of work absenteeism. The productivity of enterprises is heavily influenced by it.
OBJECTIVE:
To propose a model based on the Plan, Do, Check, and Act cycle and data mining for the prevention of occupational accidents in the plumbing activity in the construction sector.
METHODS:
This cross-sectional study was administered on a total of 200 male technical workers in plumbing. It considers biological, biomechanical, chemical, and, physical risk factors. Three data mining algorithms were compared: Logistic Regression, Naive Bayes, and Decision Trees, classifying the occurrences occupational accident. The model was validated considering 20% of the data collected, maintaining the same proportion between accidents and non-accidents. The model was applied to data collected from the last 17 years of occupational accidents in the plumbing activity in a Colombian construction company.
RESULTS:
The results showed that, in 90.5% of the cases, the decision tree classifier (J48) correctly identified the possible cases of occupational accidents with the biological, chemical, and, biomechanical, risk factors training variables applied in the model.
CONCLUSION:
The results of this study are promising in that the model is efficient in predicting the occurrence of an occupational accident in the plumbing activity in the construction sector. For the accidents identified and the associated causes, a plan of measures to mitigate the risk of occupational accidents is proposed.
Introduction
The lack of control measures for risk factors present in the work environment is a factor that negatively influences the functioning of company operations, resulting in an increase in occupational accidents due to non-compliance with regulations and the lack of management in the prevention of occupational risks [1]. The International Labour Organization estimated that an average of 651,279 workers die each year due to occupational diseases and mentions that the construction industry has a disproportionately high rate of recorded accidents [2], based on the above figures, it was estimated that one person dies every three and a half minutes due to occupational accidents in the European Union.
The construction industry is famous for a large number of deaths worldwide [6]. It is also one of the most dangerous industries for workers’ safety compared to all other industries. Studies show that incidence and mortality rates in this sector are high due to the presence of multiple occupational risk factors in work environments [7] and they relate the age of the worker with the occurrence of work accidents [9, 10].
In this industry, plumbers determine the quality of facilities, networks, and water management, making the position of plumbers important for maintaining the health of the environment and buildings. The results observed during the last years in the sector show that the competence instrument of plumbers needs to be deeply improved both in its content and in its constructs and criteria, which are adjusted to the development of the labor market and the current needs of the industry [10].
Plumbing is one of the most dangerous trades in the workplace due to the risks that operators face in the development of their activities using machinery. This is because, on average, out of 25,128 plumbing workers, 252 suffer accidents with permanent injuries in various parts of the body due to the loss of control of tools or manual machinery [11]. Some authors have identified that plumbing work in confined spaces is one of those classified as having a high-risk level [12]. However, the conditions related to biological risk and chemical risk have not been analyzed. The non-use of Personal Protective Equipment (PPE) is one of the main risk factors in the industrial construction sector due to the high incidence of injuries and accidents that workers are exposed to. According to research developed in Egypt on construction workers, 40.6% of the workers studied didn't use safety implements in the development of their work, followed by low results of PPE use obtained in Kenya of 45.2% [13–16].
Some of the mechanisms used by companies to reduce the level of accidents in workspaces are: (i) establishing promotion and prevention strategies that allow workers to comply timely with safety rules established by the Occupational Safety and Health Administration [4], and (ii) promoting the redesign of traditional training methodology for an inclusive and collaborative training model that encourages learning and increases the plumber’s confidence in their abilities [5].
To contribute to the development of possible solutions, this study proposes a predictive model for the prevention of workplace accidents in the plumbing activity in the construction sector, based on data mining technique and structured under the Plan, Do, Check, and Act (PDCA) methodology.
Methods
This study, which is a cross-sectional study design, was conducted to review the historical data in the evaluation of health and safety hazards in the workplace; investigation, and reports of work-related accidents. The results of 100 incidents and occupational accidents registered in the plumbing activity in a company with 200 basic and technical workers in the construction sector, were analyzed. There were 59 accidents. The data were obtained from 2005 to 2022 year. The inclusion criteria were all reported incidents and occupational accidents. The dataset is an accident database labor administered by Occupational Health and Safety coordinators and its use was approved by the company with confidentiality. The developed model is based on the PDCA cycle, and Fig. 1 shows the process flowchart.

Process flowchart of the model PDCA-DM-OHS.
Step 1, consists of reviewing the organization’s dataset that collects incidents and work-related accidents that have occurred in the last seventeen years; frequency, accidents vs frequency bar chart, and the Pareto chart show the behavior of the accident rate. The objective of Step 2, is to identify causes associated with the occurrence or non-occurrence of work-related accidents related to the use of PPEs according to the activities involved. In Step 3, records are classified as an accident or non-accident, identifying associated causes (model variables), where the dataset matrix records one when the associated controls are complied with, or zero when the operator doesn't comply with the controls. In Step 4, results of the three algorithms used to compare work-related accident prediction classification results are obtained, using the dataset for each algorithm’s validation, training, and testing. Three data mining models are proposed, i) Logistic Regression, ii) Naïve Bayes, and iii) Decision Trees.
To determine which parameters best classify each model, 10-fold cross-validation is used as the initial step to extract results. Next, the dataset is divided into multiple training sets and test sets. In this step, accident-related causes are entered as data into the predictive algorithms.
Logistic regression
One-dimensional logistic regression can be used to try to correlate the probability of a binary qualitative variable (We’ll assume it can take the real values “0” and “1”) with a scalar variable x. The idea is that logistic regression approximates the probability of obtaining “0” (The Accident doesn't occur) or “1” (The Accident occurred) with the value of the explanatory variable x, according to Equation (1) [17].
Where, Y(x), represents the estimated probability of being in one binary outcome category (i = accident or non-accident) versus the other; (eB0+B1X1+B2X2+...BiXi) represents the linear regression equation for independent variables (Personal Protective Equipment) expressed in the logit scale, rather than in the original linear format.
Naïve Bayes is a technique based on probability theory, which diagnoses conditional probabilities and makes predictions about new cases based on data frequencies. The equation (2) below defines it [18].
In which n corresponds to the predicted data of the continuous risk variables X1,..., X7. The data C, denoted as c, increases the posterior probability of the variable C, where the risk variables X1,..., X7 correspond to “X=(X1,..., X7).” That is, in the Naïve Bayes paradigm, in the search for the most probable prediction, c*, knowing the values of the risk variables (X1,..., X7) of a particular at-risk individual, [19] reduces to:
Decision trees are prediction models that have algorithms that “learn” patterns from data to perform a simplified analysis that more closely represents the original data. The accident data presented, S = s1, s2,...s7, are classified data with X1,..., X7 referring to missing engineering controls at the time of the accident. In the development of the model, 100 variables are taken as predictors and a class variable corresponds to whether or not the accident occurred.
10-fold cross-validation
In k-fold cross-validation, the data set is randomly divided into k partitions, and then we fit our model to a data set consisting of k-1 of the original k parts and use the remaining part for validation. That is, we estimate out-of-sample error using the portion of the data that has been left out of the fitting procedure. We repeat this k times and our estimate of out-of-sample error is the average of the k validations [20, 21].
In Step 5, we select the optimal algorithm, which demonstrates the highest classification accuracy and the lowest error rate. The error rate, as defined by the average frequency of incorrect predictions of attribute class values [22], is used as a key criterion for this selection. Subsequently, a stability analysis test is carried out to ensure the consistency of the results. This analysis incorporates an analysis of variance to further assess the chosen algorithm’s stability. This analysis is represented by Yi, which corresponds to the response to the variables; Ti is the effect caused by the i-th treatment and ɛi is the i-th experimental error (Equation 5). The analysis has a 99.5% confidence interval. The information must meet the required independence and normality requirements.
In Step 6 of the methodology, after obtaining the results from the algorithm that yields the most accurate classification outcomes, the interpretation of these findings is performed.
Finally, in step 7, once the problem and its causes have been identified, an action plan is proposed for the mitigation of workplace accidents in plumbing activities in the construction sector. The action plan; i) refers to human talent management (training and education) to determine the objectives that are expected to be achieved, what to do, and how to do it, preparing the intervention at the workplace with the proper use of tools, correct safety manipulation of machinery, and mandatory use of Personal Protective Equipment. After completing the planning process, the next steps of the proposed model are: do, check, and act.
In this phase, the action plan for mitigating workplace accidents in plumbing activity in the construction sector is implemented, and ii) the results of the plan’s execution are measured; iii) the activities outlined are monitored to ensure that they are executed according to the plan.
Check
The results obtained according to the action plan are verified after executing and evaluating some work after the training, and to observe if the application of labor safety techniques is appropriate according to the instructions.
Act
In this phase, improvement actions are implemented according to the activities performed in plumbing workplaces.
Results
Plan
The observed results from steps 1 and 2 involve the review and analysis of data and causes associated with occupational accidents in plumbing activity within the construction sector. It used a nomenclature for easy processing of the type of accident and its characteristics in data mining algorithms (Table 1). A total of 200 male technical workers in plumbing with a mean age of 30.30±10.16 were analyzed. Of them, 59 had an occupational accident. The most frequent causes of accidents were: Meniscus rupture (10 cases) (16.95%), Fibrosis and scar tissue formation in the skin (6 cases) (10.17%), Infection in the skin (5 cases) (8.47%), Leg injury (5 cases) (8.47%). Other minor causes of accidents were: hematomas, respiratory tract infections, deep cuts on the right hand while removing the probe from a pipeline, and minor injury and dizziness (4 cases) (6.78%). The total number of accidents over the collection period is analyzed through occupational accident frequency (Table 2), accidents vs frequency bar chart (Fig. 2), and, the Pareto chart (Fig. 3).
Description of type of occupational accident and its nomenclature for use in the algorithm
Description of type of occupational accident and its nomenclature for use in the algorithm
Frequency of occupational accident type in plumbing activity

Occupational accidents vs frequency bar chart plumbing activity: 2005-2022 years.

Pareto chart for the plumbing activity occupational accidents.
A research process was handled to define the variables that influence the risk prediction model listed in Table 3 and identify a total of 7 variables. The variables were defined in the occupational accident investigation report with the occupational health and safety coordinator. The dataset contains quantitative values, and 101 registry data, where 100 variables are predictors (Type of accident) and 1 variable of class (Accident/Non-accident).
Inputs variables definition
Step 3, for each accident, the characteristics that correspond to the administrative controls that the worker must have when realizing the activities in the workplace, were identified and described: a) Gloves: The worker used this personal protective equipment at the time of the event; b) Glasses: Did the worker have the glasses on at the time the event occurred? c) Protective safety boots: The worker was not wearing the appropriate boots at the time of the event; d) Industrial double-filter mask: The worker did not have the appropriate mask during the execution of the activity at the time the event occurred; e) Jeans without rips: The operator was not wearing jeans without rips when performing his task at the time of the event; f) Long-sleeved vest: The operator was not wearing a long-sleeved vest at the time of the event; and, g) Mishandling of machinery related with musculoskeletal accident: The worker did not follow adequate safety steps and handling procedures for machinery at the time of the event. The identification of characteristics is performed by marking 0 when the control does not comply and 1 otherwise at the time of the accident. An example is illustrated in Fig. 4, after which it is classified as either “accident” or “non-accident.” Obtaining results in step 4 is achieved with the algorithms used for the prediction of the accident rate, based on the data obtained from the accidents that occurred in the period from 2005 to 2022, the evaluation of the data involved applying three data mining techniques, with the input of variables. The technique that yielded the best classification result using the seven characteristics corresponding to the administrative controls was selected for further analysis with the software Weka (Hall et al., 2009).
First, the algorithms of Logistic Regression, Naive Bayes, and Decision Trees were customized to fit the classification data, categorized by accident type and controls. This adaptation included utilizing 10-fold cross-validation. From this adaptation, results were generated for each algorithm as shown in Table 4. Based on the results obtained from this model, it was observed that the percentage of correctly classified instances (Accident or Non-Accident) for the Logistic Regression and Decision Trees (J48) algorithms were similar. Due to its simple graphical visualization, the occupational health and safety coordinators preferred the implementation of the Decision Trees model.
In Step 5, during the process of calculating the error rate and accuracy classification results for each prediction algorithm, the outcomes presented in Table 5 were obtained. Five data treatments were executed as the values obtained remained consistent across multiple algorithm runs. Based on the superior results from the validation and training tests achieved by the Decision Tree algorithm (J48), the stability of the results was assessed. This evaluation began with initial values of 10% (Seed) and involved partitioning the dataset into three different configurations for training and validation. This process facilitated an analysis and verification of the accuracy of the selected data mining model. The results are shown in Table 6. A stability test of the algorithm was performed using an analysis of variance, as shown in Table 7. In this table, it can be observed that the calculated F-value is less than the tabulated F-value, indicating that the model is statistically significant. This implies that, within a 99.5% confidence interval, the obtained statistical values are equal.

Example of the dataset used in the algorithm for accident classification.
Results of classification algorithms applying 10-fold cross-validation. (*accuracy)
Values between classification algorithms (*Accuracy)
Replicas with the J48 classification algorithm and their overall average
Analysis of Variance (ANOVA)
Step 6, with the previous results, shows that the decision trees (J48) with an average of 90.5% of correctly classified instances, and a rate of incorrectly classified instances of 9.5%, is the algorithm with the best approximation average to the results of predicting occupational accidents in plumbing activity. The results obtained allowed us to obtain the decision tree shown in Fig. 5.

Decision tree for the prediction of occupational accidents.
After having obtained the results, step 7, is the proposal of the action plan for the mitigation of the accident rate, which focuses on addressing the causes identified and related through the decision tree. It was formulated as a knowledge management model based on Occupational Health and Safety, for the prevention of occupational accidents in the plumbing activity and was made up of training activities:
1) Initial activity (Socialization by the operators about, how do the operators communicate with each other about the proper way to operate machinery?); 2) Feedback from the expert to the group leaders on the inappropriate postures that have been observed in the operators; 3) The delivery of brochures to workers regarding the importance of maintaining proper postures while performing activities classified as hazardous; 4) The trainers’ intervention regarding the consequences that occur in operators when workers assume incorrect postures during their activities; 5) The leading operator’s testimony on how his daily routine has been impacted by the absence of proper postures during the execution of his tasks; 6) Feedback from the operators and the trainers; 7) Delivery of brochures to workers emphasizing the importance of using Personal Protective Equipment (PPE); 8) Trainers’ intervention to explain how the failure to use Personal Protective Equipment (PPE) can impact the health of each operator; 9) The leading operator’s testimony on how his health has been negatively affected by the absence of Personal Protection Elements; 10) Final feedback from the operators and the trainers; 11) Closing of the training 12) Positioning posters in strategic locations within the company; 13) Mailbox for questions, where workers can consult about doubts they have regarding safety and health issues at work; 14) Rotation of workers every three days in different types of work. This will be done for the worker to reduce contact with the sewage canals; 15) Acknowledgment by the group leaders of operators who perform their tasks in accordance with the prescribed safety measures.
The process of customizing the action plan for mitigating occupational accidents in the plumbing activity within the construction sector is currently in the implementation phase. Their progress is shown in Table 7.
Occupational accident and risk prevention plan
Occupational accident and risk prevention plan
Upon completing the implementation phase, a pilot program was implemented involving four operators to assess the correct application of the shared information within the program.
Act
After the verification stage has been reviewed, the next steps involve:
Conducting periodic evaluations of the operators to determine their understanding of the content taught in the training program for the action plan aimed at mitigating occupational accidents in plumbing activities within the construction sector.
Discussion
In this study, the data mining technique inside the PDCA methodology (Fig. 1) approach was applied to model and predict occupational accidents in the plumbing activity in the construction sector in Colombia. The improvement in this model was the implementation of the data mining technique as a fundamental part of the continuous improvement PDCA model to predict occupational accidents in plumbing activity. Indeed, the occupational accident was associated with the use of Personal Protective Equipment and occupational conditions in the workplace. This attempts to benefit from data mining and increase precision via innovation for the development of a plan for the prevention of occupational accidents (Table 7).
The review of reports of occupational accidents in the plumbing activity allowed us to identify the frequency of accidents, the type of accident, and the affection (Tables 1, 2, Figs. 2, and 3). With the use of data mining techniques, it was possible to predict the causes that contribute to the probability of an accident at work in the workplaces of the plumbing activity in the construction sector.
Thanks to the project’s development, it has become evident that several issues have a significant impact:
The use of torn jeans by workers, which can lead to skin diseases and infections, inadequate machinery handling, contributing to musculoskeletal diseases, the absence of personal protection glasses in the work area, resulting in occupational eye injuries and diseases (As shown in Fig. 5), and, the workspace itself, which can influence accident occurrence since employees need to be transported to predetermined locations by the company, and sometimes these sites lack optimal physical conditions.
According to [8, 9], the aging workforce in the plumbing activity is the most vulnerable population to present some type of occupational accident, in this study, the elderly workforce is close to 40 years old and seems to agree with what was evidenced in the literature and it presents resistance to the use of Personal Protective Equipment, so it is constantly trained.
The literature review revealed that some researchers such as [23] make use of the data mining methodology to predict occupational accidents in construction sites through the random forest decision tree algorithm which corresponds to the combination of several decision trees from a randomly chosen data sample, whose precision was 71.3%. For accident prediction in plumbing activities, a similar methodology was employed, involving five runs, with an average classification accuracy of 90.5% (As shown in Table 5). This indicates that the decision trees technique provides a significantly more accurate prediction capability.
Other authors used data mining to assess the risk of occupational accidents with data provided by the Ministry of Labor of Iran, they ran the model based on three types of cases, the first case was the one that yielded greater precision with a percentage of 89.3% presents the following characteristics: a) workplaces with less than 17 workers who work one, two or three shifts, b) workplaces in which workers work one, two or three shifts without work permit, c) Workplaces with less than 17 workers that do not have a technical protection and health committee, d) Workplaces with less than three workers that do have a technical protection and health committee, e) workplaces without a technical protection and health committee that do not have stipulated work hours, and f) workplaces without a technical protection and health committee [24] and when analyzing the percentage of precision of the development of this study based on the prevention of occupational accidents in the plumbing activity, which is 90.5%, the use of trees is considered viable, decision maker due to its close approximation to the correctly classified values, achieving a more accurate prediction.
Based on the information gathered from occupational accidents in the plumbing activity spanning from 2005 to 2022, a decision tree has been constructed. This decision tree highlights the administrative controls that are not being properly implemented, which, in turn, contribute to the rise in workplace accidents within the company. Thanks to the decision tree, it is possible to extract and analyze real data, which can then be used by occupational health and safety coordinators to proactively prevent accidents. The classification of occupational accidents using the decision tree technique has allowed for the development of a plan aimed at preventing workplace accidents. This plan has had a positive impact on worker health, and it has also contributed to the reduction and control of factors that previously led to occupational illnesses and sick leave. Furthermore, it has had a positive influence on labor productivity.
Limitations
Our study had some limitations. The first limitation pertained to the number of participating companies in the study, which was influenced by voluntary participation. The second limitation was that a significant portion of companies in the construction sector showed little interest in sharing information regarding occupational accidents.
Conclusion
A novel approach, the Data Mining –PDCA Cycle –OHS (PDCA-DM-OHS), was proposed in this study to predict occupational accidents among Colombian plumbing workers in the construction sector. The results obtained indicate that:
a) The use of protective safety boots; b) Mishandling of machinery; c) Failure to use personal protection glasses; and, d) Wearing jeans without tears.
These variables enable the prediction of occupational accidents in plumbing activities within Colombian companies. By analyzing these characteristics, it is feasible to create a data mining model based on the decision tree (J48), which is highly beneficial for developing a plan aimed at preventing occupational accidents. This plan is designed in accordance with the PDCA (Plan-Do-Check-Act) methodology to effectively control risk factors (Biological, chemical, physical and, biomechanical). Strategies have been devised to control and mitigate the factors that contribute to occupational accidents, thereby enhancing the working conditions of employees within construction sites. It is imperative to raise the usage rates of Personal Protective Equipment (PPE) among plumbing workers in the Colombian construction sector.
Footnotes
Acknowledgments
The authors would like to express their gratitude to the plumbing activity operators and supervisors, as well as to the Universidad Industrial de Santander.
Ethical approval
The study was developed under the Declaration of Helsinki and was approved by the Non-Interventional Research Ethics Committee at Universidad de San Buenaventura, Cali (Date of approval 20 December 2021).
Informed consent
All participants were informed that participation was based on the principles of confidentiality and volunteerism. Before data collection, informed consent was obtained from all participants.
Conflict of interest
The authors declare that they have no conflict of interest.
Funding
This work was funded by grants from the Vicerrectoría de Investigación y Extensión (Project Code 3769), Universidad Industrial de Santander.
