Abstract
Addressing undeclared work is a high priority in the labor field for government policymakers since it adversely affects all involved parties and results in significant losses in tax and social security contribution revenues. In the last years, the wide use of ICT in labor inspectorates and the considerable progress in data exchange have resulted in numerous databases dispersed in various units, yet these are not effectively used to increase their functions productivity.
This study presents a detailed analysis of a data mining project per the CRISP-DM methodology aiming to assist the labor inspectorates in dealing with undeclared work and other labor law violations. It uses real past inspections data merged with companies characteristics and their employment details and examines the application of two Associative Classification algorithms, the CBA and CBA2, in combination with two types of datasets, a binary and a four-class. The produced models are assessed per the data mining goals and per the initial business objectives, and the research concludes proposing an innovative inspections recommendation tool proved to offer two major benefits: a mechanism for planning targeted inspections of improved efficiency and a knowledge repository for enhancing the inspectors understanding of those features linked with labor law violations.
Keywords
Introduction
Undeclared work is a complicated and multisided socio-economic issue and is considered one of the structural parts of the shadow economy. At the EU level, it is defined as “paid activities that are lawful as regards their nature but not declared to the public authorities, taking into account differences in the regulatory system of Member States”, yet criminal activities are excluded from the scope [1, 2, 3, 4]. Thus, only the legal and payable activities that are not declared to the state institutions with the aim to evade taxes, social security contributions, and compliance with labor laws, are attributed to undeclared work.
This fraudulent practice can be located in a wide variety of workplaces and sectors and involving workers of different backgrounds and profiles, constituting its measurement and monitoring significantly difficult. The scale and growth of the phenomenon cannot be well estimated, which is – by definition – hidden, taking various forms and being influenced by a wide range of economic, social, institutional, and cultural factors, whereas it, in its turn, affects negatively workers, companies, and the states [3, 5, 6].
Informal workers are usually paid under the minimum wage and suffer no or limited coverage of labor and social security law, while also being excluded from benefits in case of disease, work accident, or unemployment; compliant companies suffer unfair competition when their business opponents use undeclared work paying lower wages and evading taxes; last, informal employment undermines the sustainability of the state social model through a reduced collection of taxes and social security contributions and, in the long term, damages the economy and competitiveness by lowering working conditions and obstructing skills development [4, 5].
Dealing with undeclared workers is, therefore, high on the list of challenges facing governments and labor inspection authorities, both at the national and international levels [5]. Yet, while labor inspectorates play a strategic role in responding to informal work and transforming it into formal employment, they often suffer many financial and human resource gaps and lack effective tools, thus facing significant difficulties in carrying out their functions [5]. Hence, and since undeclared work is not static but constantly evolving, innovative solutions and improved techniques must be incorporated into the labor inspection planning and practices to identify and prevent incidences of this phenomenon with success.
In recent years, labor inspectorates have substantially proceeded with the use of ICT in their business processes and, as such, there exist several databases with information on companies and workplaces, employment details, and inspection visits and findings. However, in most cases, the authorities do not use efficiently these data to analyze the patterns and frequency of undeclared work and be adaptive to new approaches, but their planning of inspection visits is usually focused on high-risk sectors and seasonal work, whereas the selection is often arbitrary or contingent on complaints or the inspectors discretion [7]. On other occasions, they might use risk analysis systems configured with manually created rules, based on the experience of labor inspectors. However, previous studies have proved that even the most experienced labor inspection professionals can have a prejudiced conception regarding the patterns of non-compliant employers, and thus leave out of planned inspection visits a large proportion of companies [8].
At this point, the use of advanced data mining and machine learning techniques comes as a bright opportunity since it is able to automatically reveal interesting and actionable insights into the patterns of undeclared work and help decision-makers to develop customized policies to prevent or reduce the occurrences of this practice. The aspiration of the current data mining project is, thus, to build a classification system using past inspections data and offer two main uses: first, as a sophisticated tool serving to classify companies at risk levels regarding their likelihood to be found with labor law violations, which can contribute to planning targeted labor inspection visits; Second, as a knowledge repository containing the discovered patterns of those attributes that are highly associated with labor law violations, which can be exploited in enhancing the labor inspectors understanding and ability to identify these violations, but also in assisting decision-makers to design successful campaigns focused on specific target groups.
This work constitutes an extended study of our first application of Associative Classification (AC) in the domain of targeted inspections to address the issue of undeclared work [9]. AC is an innovative machine learning method, coming from the combination of Classification and ARM [10] aiming to produce a set of Class Association Rules (CARs) to be used in the assignment of class labels to unknown data instances, as successfully as possible. So, the scope of this data mining project is to further analyze the employment of AC algorithms in this domain through four different approaches, i.e., the expansion is mainly focused on two parts. First, in the dataset construction phase, where we produce two final datasets to be used for the building of the models: one having two class labels – as in [9] – and another having four class labels. Second, in the model building phase, where two AC algorithms are applied, the CBA [11] – as in [9] – and the CBA2 [12], an enhanced successor of CBA. Consequently, four different classification models are obtained from the above combinations, which are then beyond compared and assessed using well-known data mining evaluation measures, but also with respect to the initial success criteria of the project.
The analysis of the project is guided by the CRoss-Industry Standard Process for Data Mining (CRISP-DM) methodology [13], which describes a set of phases and processes a data mining project should follow to be solid but also flexible, quick but also qualitative. The application is based on the business needs of the Hellenic Labor Inspectorate and on real-life data containing findings of inspection visits conducted in Attica in 2018–2019, integrated with data of companies and employment. Eventually, this study proves that the applied aforementioned machine learning technique is able to reach the business goals and help the Authority improve its functioning at various levels.
In the following section, the related work is introduced, and the rest of the paper analyzes the project, consistent with the CRISP-DM phases. Section 3 introduces the Hellenic Labor Inspection Authority and the goals of this project, both from a business perspective but also from a data mining perspective; Section 4 refers to the data repositories available and the most interesting database tables for this project, and Section 5 studies the construction process of the final dataset, including several steps such as cleaning, anonymization, primary feature selection, derived feature creation, discretization, and categorization. Next, Section 6 analyzes the building of the four classifiers through the training and testing, and their evaluation per the data mining goals, and Section 7 reviews the overall project assessment in relation to its original business goals and determines the best option for establishing an inspections recommendation system. Section 8 concludes the study and identifies future work.
Related work
In our first machine learning application in the domain of undeclared work detection [8], we presented the use of Association Rule Mining (ARM) [10], a famous method based on Frequent Pattern Analysis, which discovers interesting correlations – i.e., association rules – between attributes in a large dataset [14]. The goal of that application was to enhance the inspectors ability to identify employers likely to be involved in undeclared work and other labor law violations, through the extraction of patterns that reveal specific attributes strongly associated with each other but also with those violations. We used real-life data of around 2.5K labor inspection visits conducted in an area of Attica in 2018–2019 and the study proved that this method could offer advantages to the Labor Inspection Authority at different levels. At the local offices level, it can provide the labor inspectors with valuable and interpretable knowledge regarding the behavior patterns of both the compliant and non-compliant employers of their area, which can then be used in designing their targeting plans. At the central administration level, it can yield an overall understanding of trends and prevalence of undeclared work and other violations of labor law provisions across the country, which can support decision-makers in designing improved and adapted preventative and curative measures to address these violations.
In addition, the study evidenced some specific characteristics highly correlated with undeclared and underdeclared work that were earlier believed to indicate compliance with labor law, which surprised even the experts of the labor inspectorate. More specifically, it was verified that the attribute ‘rare changes’ – when the employers change rarely or not at all the working schedule of their employees – is considerably associated with undeclared and underdeclared work, and that ‘often changes’ demonstrate significant compliance of employers; however, the labor inspection professionals were convinced of exactly the opposite and had even manually created a rule in their risk analysis system based on this perception. Also, it was demonstrated that most of the inspection visits triggered by complaints uncovered underdeclared work – i.e., work that is partially declared to the authorities aiming to limited taxes and social security contributions – whereas fully undeclared work was mostly exposed through scheduled inspections; yet the inspectors believed that complaints are mainly linked with fully undeclared work.
The above study divulged various patterns of correlated attributes regarding inspection findings, companies details, and employment data; yet its critical contribution was the disclosure of a new promising research area of data mining and machine learning in the domain of tackling undeclared work.
Indeed, in [9], we introduced our first application of an Associative Classification (AC) algorithm, the CBA (Classification Based on Associations) [11], where the generated classifier succeeded in overachieving the initial objectives of the project, since it reached an efficiency of more than 65%, and it additionally disclosed a set of intriguing and comprehensible patterns regarding compliant and non-compliant employers practices to labor law, hence reinforcing their recognition by the labor inspectors on the field.
In the present extended study, we additionally present the use of another similar AC algorithm, the CBA2 [12], introduced by the same authors, Liu et al., in 2001. CBA was the first AC algorithm proposed, constructing an associative classifier in two main steps. First, all Class Association Rules (CARs) are produced using the Apriori [10] algorithm, satisfying the user-defined minimum support (minsup) and confidence (minconf) thresholds. Then, sorting and pruning are followed, where rules not covering enough data instances of the dataset are finally removed. Ending, a default class is mined and added in the classifier to be used in classifying new unseen data instances that are not covered by any CAR.
CBA produces effective and competent classifiers, as also demonstrated in [9], yet it uses a single minsup, which is inadequate for datasets having uneven class frequency distribution, since in such cases, the produced classifier may not include sufficient rules of infrequent classes. This problem is solved by CBA2, which uses multiple minimum class supports in the rule generation phase, i.e., the user-specified total minsup is distributed to each class according to their distribution in the given dataset. This type of minsup allocation to the different classes of the dataset ensures sufficient rules generation for scarce classes and prevents excessive overfitting rules for frequent classes.
Last, the CRISP-DM methodology used for analysis in this study was developed in 1996 by SPSS [13] and has been used in a wide range of projects, both at commercial but also at the research level [e.g., 15–18]. It consists of six main phases with each one being composed of various tasks and generally following one another, but iterations are possible when the results are not satisfactory or need to be extra examined. The first phase concerns the Business Understanding and the assessment of the present situation. At this phase the goals of the project are also determined and further translated into a data mining problem with distinct success criteria. Then, in the Data Understanding phase, the available data are collected, explored, and described, focusing on their source, values, and quality. The phase of the Data Preparation is dedicated to appropriately integrating the available information and transforming it into an organized dataset to be used in the Modeling phase, where the models are selected, built, and assessed. Next is the Evaluation phase which assesses the degree to which the models meet the initial business objectives, and reviews the whole data mining engagement, covering also quality assurance issues. The final tasks of this phase include the identification of possible next steps and activities, highlighting the reasons for and against each option. The last phase is the Deployment, incorporated when the evaluation results are satisfactory, or else a new CRISP-DM cycle is launched. At this final stage, the deploying strategy is determined, and a monitoring and maintenance plan of the data mining project is drawn. Last, a final report is produced, including all deliverables and documentation, and summarizing the results.
Understanding the business
The organization
The analysis of the present study is based on real business needs of a labor law inspection and enforcement authority, the Hellenic Labor Inspectorate (HLI), and was carried through close cooperation with the domain experts at its different phases. The HLI lies under the Ministry of Labor and Social Affairs in Greece, aiming to guarantee a decent working environment by ensuring the application of the labor law provisions in both fields of labor relations and occupational safety and health. This is accomplished by providing advice to employers and employees regarding the most effective means of complying with labor legislation and regulations, but also by performing inspections in workplaces to examine the lawful insurance coverage and employment of workers, targeting to protect their rights. Thus, one of the main objectives of the Authority is to efficiently identify and deal with undeclared work and other labor law violations and, at the same time, use its resources most adequately.
The HLI consists of the labor relations inspection services and the occupational safety and health inspection services, having central offices in Athens and 125 local departments countrywide, with around 650 labor inspectors in total offering their services. At present, the labor market consists of around 300K active registered employers and around 2M active registered employees, with both numbers having a generally increasing trend each year. This fact, along with the decreased human force of the Authority per around 30% the last decade and the increasingly complex labor regulatory setting, has unveiled the HLI difficulties in carrying out its role and also the critical need for new and advanced tools to be embodied into its activities to achieve higher inspection yields.
For the last three years, the HLI owns a risk analysis tool able to rank employers based on a calculated risk, using predefined user-specified rules composed of various criteria. This tool can be employed in the planning and selection of onsite inspections, yet it has some major disadvantages resulting in a low level of acceptance by the labor inspectors. To begin with, the risk analysis rules need to be regularly manually updated, following the new scheduled inspections results, since the system does not support this process automatically. Moreover, the creation and update of the rules rely on the experience and perceptions of labor inspection professionals, who can have biased ideas regarding patterns of labor law non-compliance, thus rendering the tool misleading, as it was proved in [8]. Another issue refers to the configuration of the tool, which is done by experts at the central offices of the Authority since it requires specific competency. Yet, this results in the local inspectors not participating in this process and thus getting unwilling to correspond to the tool suggestions and inspect companies scored as ‘risky’, without actually knowing the factors affecting this score. The above issues brought about the gradual abandonment of this risk analysis tool and, as a consequence, the selection of the companies for inspection is still often done arbitrarily, or it is based on reported complaints and whistleblower information (
Business objectives and goals
Yearly, the HLI inspectors perform around 35K on-site visits to inspect labor relations and around 30K visits to check the conditions of occupational safety and health. Focusing on the results of the labor relations inspections, only half of them is found with violations, out of which around 4% corresponds to fully undeclared work and nearly 35% to underdeclared work. However, the last survey regarding undeclared work conducted by the EU in September 2019 – Special Eurobarometer 498 [4] – published that in Greece, in the demand side, 27% of the respondents admitted having paid for goods or services believing to have undeclared work involved, and, in the supply side, 59% of the respondents said that they personally know someone not declaring all or part of their income to tax or social security authorities, with the corresponding rates lying at 10% and 33% at EU level. Apparently, the figures between the EU Member States exhibit significant differentiation, with Greece displaying one of the highest rates of fully or partially undeclared work prevalence in the EU.
Thus, considering these statistics and the said operational difficulties the HLI encounters, the primary business objective of the Authority is to detect and assimilate into its business processes an advanced method that will assist it to, first, overcome its functioning constraints and, secondly, use its resources more properly, aiming to reach an increased proficiency in dealing with the problem of undeclared work.
Consequently, this study focuses on the analysis of a progressive tool to accomplish the above two goals of the Authority, which can be converted into the following two success criteria, equally important. First, the tool must offer significant help to the labor inspectors in the selection process of the companies to be inspected, attaining higher effectiveness in detecting labor law violations than the present one, which lies around 50%. In addition, the tool must not need manual update and configuration, but it should work fully automatically. Second, the tool should provide the labor inspectors with understandable knowledge over the factors that affect each of the tool suggestions regarding the selected companies for inspection visits. This knowledge provision will highly enhance the labor inspectors ability to identify by self those factors indicating labor law violations, and, moreover, it will actively keep them involved in the inspections selection process, thus aiming to bend their objections for a wide use of the tool. In addition, at the administrative level, this knowledge will strongly support the decision-makers to design targeted campaigns for specific groups of companies exhibiting increased levels of non-compliance, to advance the transition of undeclared work into the formal economy.
Situation assessment
Human resources
The present data mining project is studied based on the needs and data of the Hellenic Labor Inspectorate; hence, it required the regular engagement and use of the Authority resources throughout all the phases of the analysis.
Most importantly, regarding the personnel, there had to be close cooperation of the researchers of this study with staff from different backgrounds and posts within the Organization, depending on the various phases of the project analysis. As a start, the head and upper management of the HLI needed to specify the project objectives and later contribute to assessing its results considering the initial business success criteria.
Labor inspection experts had to offer an overall comprehension of the Authority operation and tasks, as well as the business processes in place that are related to the project. They were also very engaged in explaining to the data mining researchers the functional meaning and relations of all available data in the data understanding phase, and they significantly contributed with their domain knowledge to the various stages of the dataset construction.
The IT experts contribution was crucial in understanding the information systems infrastructure and interoperability, and they cooperated closely with the researchers in collecting the necessary data, as well as assessing their quality. Last, the Data Protection Officer (DPO) of the Authority was also involved in the data collection stage, offering advice regarding anonymization techniques, and ensuring compliance with the GDPR.
Data resources
The data sources mainly used in this project were derived from the two information systems of the Authority, the Employment Information System (Employment-IS, also called ERGANI) and the Integrated Information System of the HLI (HLI-IIS), which interoperate at various parts. The Employment-IS is used by the employers since 2013, who are under the legal obligation to declare any changes in their business employment before these are applied. So, in this IS, the labor inspectors can access the employment details of a specific company and know in real-time the declared working hours, wages, contracts information, etc.
The HLI-IIS is composed of several subsystems digitalizing most of the Authority business processes, but most relevant to the project are the Inspections subsystem and the Complaints filing process. The Inspections subsystem is used by all the labor inspectors, since 2016, who are liable to enter all their inspections data, findings, confirmed violations, and any imposed administrative measures, along with all actions performed by the companies or the inspectors after the inspection visit. Regarding the complaints, the HLI receives anonymous information about possible violations of labor law provisions mostly through three channels. First is a five-digit telephone line operating at headquarters, second is the portal site by completing a form, and third is any type of citizens contact with the HLI local departments, such as telephone, email, or physical presence. In all three cases, the details of each complaint are formally recorded in the HLI-IIS, and the local department that is accountable – according to its territorial area of responsibility – to further investigate the complaint, is automatically informed about its details through the HLI-IIS document management system.
Last, other data resources that proved useful were various documentation regarding labor law provisions and means of application, as well as business processes of the Authority.
Assumptions, constraints, and risks
At this early phase of the analysis, the researchers of this study had to state some assumptions made about the project and some possible constraints and risks that could have jeopardized its success.
Regarding the thoroughness of the Inspections subsystem data, it was assumed that all the inspectors comply with the obligation to insert all their inspections correctly into the system and update them when necessary. Similarly, it was assumed that all complaints are formally registered into the HLI-IIS and soon directed to the correct local department for investigation. In their turn, the local inspectors were assumed to correctly flag the ‘trigger’ of an inspection visit as complaint, when this is initiated due to a complaint or whistleblower information. Also, it was generally assumed that the accuracy and quality of data are good and with a low level of missing values, especially in critical attributes that would be used in the data mining analysis.
Concerning the HLI personnel engagement with the project and their cooperation with the researchers, it was assumed that there would be a high level of availability with respect to time, required effort, and domain knowledge, in all its stages.
Possible constraints and risks in the successful completion of the study included the inability to access the necessary data due to various technical complications or limited access rights to data sources because of GDPR restrictions.
Last, other likely risks that would cause major problems in the research evolution are related to the inadequate contribution of the Authority staff, possibly as a result of poor collaboration between personnel being at a critical post or disbelief in the study potential benefits.
Data mining goals
Following the above thorough analysis of the Organization at different levels, the last task of this phase is the transformation of the business objectives into a clearly specified data mining problem, with the respective success criteria defined in technical terms, to move forward with the analysis.
Hence, this study focuses on the analysis of an innovative classification – prediction method that can use past experiences, i.e., past inspections data, and produce patterns of employers practices indicating compliance or non-compliance to labor law. A tool embodying this machine learning method should be able to predict, with increased accuracy, the employers likely involved in undeclared work and other labor law violations and thus assist the Authority in planning inspection visits. This tool must not need manual configuration or update, but it ought to be able to update itself automatically, learning from the new inspections data inserted into the HLI-IIS.
In addition, the patterns produced relating to the employers practices must be easily understandable and provide the labor inspectors with valuable insights into those attributes that are highly correlated with these practices, thus improving their expertise in detecting violations and their ability to plan effective inspection visits.
Concluding, the success criteria of this data mining problem include the provision of a set of comprehensible rules classifying employers into ‘compliant’ or ‘non-compliant’ with a predictive accuracy higher than the current inspections efficiency of 50%, finally set to 65%.
INSPECTIONS table
INSPECTIONS table
The INSPECTIONS database table of the HLI-IIS contains all the inspections of the HLI since 2016.
At this phase, the researchers had to consult jointly with labor inspection experts but also with the database administrators of the Authority, to comprehend in detail the available data both from an operational view but also from the view of how these are being interpreted and managed in the databases. Challenges faced were mostly relating to the much complicating databases schemas and to choosing over the data to be collected and used for the project. Thus, at this point, the data selection criteria had to be specified clearly, as well as the reasoning on which they were based.
HLI-IIS data
Primarily, it was realized that the HLI maintains loads of data regarding the actions taken after an inspection visit, such as imposed fines and other administrative measures, objections, appeals, court decisions, etc. Clearly, though, all these actions are highly related to human decisions and are affected by various unregistered factors, so, it was finally decided to leave them out from this data mining analysis, whose scope is mainly to predict the discoveries of inspection visits and not the terminal outcome of each inspection case. Therefore, only the findings and discovered violations of past inspections were decided to be used for classification purposes in this study, i.e., the class attribute will rely on the onsite findings and infringements and not on any data related to further continuity of the inspection cases.
Table 1 describes the INSPECTIONS table containing all the inspection cases inserted into HLI-IIS since it was put in place. Each inspection case is associated with a specific company and branch, and its type and initiation trigger are indicated. Regarding the type of inspection, considering that this study is focused on addressing undeclared work and other labor relations violations that lie in inspection objects of only one of the two types, this analysis used the past inspections data of only the labor relations inspection services. The occupational safety and health inspections were omitted since they are occupied with labor accidents and labor safety conditions. Consequently, the type of initiation trigger that normally takes four values (see Table 1) concludes to be taking two in our study (i.e., complaint/scheduled), since accident investigations are not included, and labor disputes constitute another type of labor relations inspection that is related to solving issues between employer and employees and is not examined within the scope of our study.
INSP_FND table
INSP_FND table
The INSP_FND database table of the HLI-IIS contains all the findings related to the conducted inspections. One inspection can have multiple findings.
INSP_INFR table
The INSP_INFR database table of the HLI-IIS contains all the infringements related to the conducted inspections. One inspection can have multiple infringements.
COMPLAINTS table
The COMPLAINTS database table of the HLI-IIS contains all the filed complaints to the HLI. The data relating to the employer are given by the person filing the complaint.
The findings and infringements of all the inspection visits exist in the INSP_FND and INSP_INFR database tables respectively (see Tables 2 and 3). An inspection visit can consist of multiple findings and infringements, with only the second ones to be considered negative discoveries and require further investigation. If an inspection visit is registered to contain only findings, then the employer was found compliant with labor law provisions. Yet, inspection visits to non-compliant employers (i.e., with infringements registered) may include some (positive) findings as well.
In light of this and the main scope of the study to discriminate between compliant and non-compliant employers, the most important data selection criterion for our analysis was to include in the final dataset one entry for each inspection visit when the employer was found compliant, and one entry for each infringement when the employer was found non-compliant to labor law. The reasoning behind this criterion lies in the data mining goal of discovering different patterns of employers non-compliance for different types of violations (e.g., undeclared work, underdeclared work, other violations, etc.), whereas we are not interested in discovering different patterns of compliance for different findings, but only for the compliance existence itself.
The last interesting data source from the HLI-IIS for our project is the COMPLAINTS database table, given in Table 4. This table holds the filed complaints as described in §3.3.2, with each one of them likely connected with an inspection visit. The purpose of using this information in our analysis is to investigate the correlations between the inspections outcome and the complaint data if any.
For each entry in the dataset, some data related to the corresponding company and its employment status are retrieved from the Employment-IS and integrated to analyze possible correlations of those features with the inspection discoveries. Table 5 describes the EMPLOYERS database table with all the registered employers since March 2013 from where interesting attributes for our study can be obtained, such as the legal form, the starting date, and the business sectors of the company.
EMPLOYERS table
EMPLOYERS table
The EMPLOYERS database table of the Employment-IS contains all the registered employers since 2013.
EMPL_BRANCHES table
The EMPL_BRANCHES database table of the Employment-IS contains all the registered branches of the employers since 2013. An employer can have one or multiple branches registered.
Employers may have multiple branches registered, which are all saved in the EMPL_BRANCHES database table, illustrated in Table 6. This table can offer information about the specific business sector of the branch and its area.
Concerning the employment details, these are obtained from the various types of declarations the employers are obliged to submit at specific dates or deadlines, according to the labor law regulations. All these employers declarations exist in the EMPL_ DECLARATIONS database table described in Table 7. The volume of data in this table is huge, considering that a company can submit limitless declarations for each of its branches and based on its needs.
EMPL_DECLARATIONS table
The EMPL_DECLARATIONS database table of the Employment-IS contains all the declarations of the employers regarding their business employment since 2013. An employer can submit for each branch multiple declarations.
DECL_CHNG_EMPLOYEES table
Some types of employer declarations (in Table 7) can refer to changes in employment details of one or more of the company employees. So, an entry in Table 7 can be connected with one or more entries in the DECL_CHNG_EMPLOYEES database table of the Employment-IS, which contains the declared employment changes for each employee.
More specifically, the Employment-IS offers around 50 different types of electronic services to the employers who are required to use each one of them on different occasions to keep the state informed about their business employment. Yet, this study cannot examine the effects associated with such a large diversity of declaration types, but its interest focuses mostly on the types that modify the working schedule and wage of the employees, which are the major characteristics of a company employment status. Each of these declarations can refer to changes in the employment of one or more employees of the company branch. Thus, an entry in Table 7 can be connected with one or more entries in the DECL_CHNG_EMPLOYEES database table (see Table 8) that contains the declared employment changes for each employee.
While looking into the data and selecting the sources of information based on their relevance to the goal of our project, the researchers and the HLI IT personnel were investigating their quality as well, mainly with respect to the databases tables connections and keys, the attributes missing values, and the values formatting and encoding. In general, the databases were considered well-structured, with no major issues detected, and with the tables connections operating well in the integration stage of the selected data fields, as described in the following section. Some missing values and a few wrong (future) dates were detected, yet they didn’t notably affect the evolvement of the analysis.
Preparing the data
This phase refers to all these tasks relating to the creation of the final dataset, which next been used for training and assessing the classifiers. The knowledge and engagement of the domain and ICT experts of the Authority and their cooperation with the researchers were certainly required, for this challenging stage to be completed as planned. Particularly, in the first three tasks presented below, the HLI IT experts were more actively involved in their implementation, following the DPO guidelines, as per, the data had to be anonymized first before their delivery to the researchers of the study for further analysis.
Data selection
Selecting the data to be fed into a data mining model is, one can say, the most significant part of the process since the obtained results depend totally on the data quality and relevance to the project objectives, the distribution of the attribute values, the limits on data volumes, and many other features, with each one being able to affect the quality of the results drastically. Thus, data selection should be made with care, both in the dimension of attributes but also of records – i.e., in columns and rows respectively – and inclusion or exclusion criteria should be documented in detail.
Summarizing here the selection criteria outlined in the previous section, the dataset was defined to be structured over the findings and infringements of the past labor relations inspection visits. Namely, each record in the dataset corresponds to either an inspection visit encountering a labor law compliant employer, thus, its ‘result’ attribute is marked as ‘NO_INFR’ – i.e., no infringements, see also §5.6 and Table 10 – or, to an infringement of an inspection visit detecting a non-compliant employer, with its ‘result’ containing the type of the infringement (see Table 3). Regarding the employment data, these are selected based on the employers declarations of those types that report/modify the employees weekly working hours and wages.
Beyond the above criteria, the time period of data collected and used is another type of criterion to be identified as well. In this study, the decision was grounded on two facts: the first was the obligation starting date of the inspectors inserting their inspection data into the HLI-IIS, which was the beginning of 2018, hence ensuring that the data would be more complete and accurate after that date; the second was the pandemic of Covid-19 appearing to vastly impact the Greek labor market after March of 2020, thus employment data of that year were widely affected by an overly uncommon factor, whose effects, though, lie out of the scope of this study. Consequently, here, only the data of past inspections conducted in 2018–2019 were chosen to be used.
The last data selection criterion applied concerned the area of inspection, opted to be Attica, the broader area of the country capital. The reasoning behind this choice rests on the proven broad diversity of employment trends and patterns between the various regions of the country, with this variety being dominated by multiple seasonality and locality factors, such as tourism, or local festivities at specific periods. Hence, the area of Attica was a safe decision for our initial application of Associative Classification in this domain since it does not exhibit substantial differentiation in employment trends and figures within the year, and data collection with this criterion can supply an adequate volume of information for our analysis.
Fitting the aforesaid criteria, 18.988 records concluded to be collected, with around half of them corresponding to inspection cases of labor law compliant employers and the rest to represent violations found to non-compliant companies.
Data integration
As previously explained, the dataset was determined to be constructed on the basis of findings and infringements (Tables 2 and 3), and the rest of the attributes considered interesting, to be accordingly integrated. Thus, data fields related to the inspection visit were merged, such as the date and time, the day and type of day, the total workers found, and the initiation trigger (Table 1). The source and the objects of the complaint were also integrated (Table 4) but contained data only in the cases where the inspection is triggered by a complaint. Other information joined pertained to the company (Table 5) – such as the legal form, the business starting date, and the types of business industry – and some others to the inspected employer branch (Table 6), i.e., city, municipality, region, and the branch business sector.
Employment information was challenging to be combined in the dataset since it had to be drawn from the employers declarations (Table 7) and the changes in employment details (Table 8). Yet, both of those tables hold numerous entries connected with an inspected employer branch, which, in this study, constitutes the inspected entity.
After consultation with the domain experts of the Authority, it was identified that obtaining some meaningful insights into the employment status of the inspected entity required a summarization of these data. So, for each dataset entry, the average weekly working hours, and the average hourly wage of all the employees of the inspected location were calculated based on all the declarations submitted in the six months period before the inspection. In addition, the sum of all these declarations was also joined in the dataset as a new attribute. Regarding the total number of employees in the company and of those in the inspected branch-entity, these two figures were taken from the last declaration before the inspection.
Last, the researchers wished to integrate information related to the compliance level of the inspected entity, prior to the last inspection. Actually, they wanted to investigate if the employers are affected by inspection visits and to what degree they subsequently change their level of compliance with labor law. Hence, the sum of inspections conducted before the inspection date was calculated, as well as the sum of detected violations in these inspections, in the period of 2018–2019. These figures, later on, were used to indicate the level of past labor law compliance of the inspected entity.
Data cleaning and anonymization
Following the integration of all attributes considered interesting in relevance to the project objectives, data cleaning was required before proceeding with the next stages. Few records were observed with no values in some attributes, possibly due to momentary technical complications in the software application saving process, which were finally omitted from the dataset, concluding with 18.548 records.
Anonymization was demanded next by the DPO since any data linking companies details with inspections findings could not be made known to the researchers. Therefore, attributes related to identifying companies and their branches, such as Tax ID, name, and address, were omitted from the dataset, which, at this stage, concluded to contain the attributes as illustrated in Table 9.
Dataset – after the anonymization stage
Dataset – after the anonymization stage
At the data preparation phase, after the anonymization stage, the dataset is comprised of the attributes listed above. These in bold letters were followingly selected as primary attributes.
At this point, the researchers obtained a dataset collected using some selection criteria identified at the data understanding phase per the project objectives. Yet, this dataset is comprised of 25 attributes (Table 9), which are way too many for a model to process and produce interpretable patterns of employers compliance and non-compliance to labor law. The next step at this stage was, then, a more detailed examination of each one of them and the final selection, based on their interestingness and importance to this study, to come up with a smaller set of attributes.
The primary attributes were selected either to participate themselves in the data mining analysis or to be used in creating other derived attributes for the analysis. Principally, as explained earlier, the attribute ‘result’ plays the role of the ‘class’ in our classification problem and thus was firstly selected since the classifiers to be produced aim to predict the result of the inspection visits. Among the inspection features, the ‘starting time’, the ‘day’, and the ‘type of trigger’ were picked as most intriguing, to check if and how they can be associated with specific violations. If the analysis were to focus specifically on seasonal employment, the ‘inspection starting date’ attribute could have been used as well, after clustering in seasons. Yet, in the selected area of Attica, the ratio of seasonal businesses is not high, denoting that seasonality itself would have a low impact on employment figures and trends in this area along the year. For this reason, the ‘inspection date’ was not preferred for inclusion in the final dataset, compared to other attributes.
Regarding the ‘source’ and ‘objects’ of complaints, these were not of much concern for further investigation to the Authority, and also, values existed in only 19% of the records, i.e., in the cases of initiation due to a complaint (see ratio in Table 10), hence they were omitted from the analysis.
Final dataset
Final dataset
At the completion of the data preparation phase, the final dataset consists of the attributes listed above. Their categorical values with their corresponding ranges (at the numerical attributes) are presented, along with their ratio in the collected records.
From the company-related attributes, the ‘legal form’ was selected aiming to discriminate the business leadership between single-person and directing-board and explore the way this feature is correlated with specific levels of compliance with labor law. The ‘business starting date’, serving to distinguish between old and new companies, could have been used similarly, yet its values range was identified as very broad and tricky to be partitioned, so it was finally neglected.
Among the four available attributes in the dataset related to the ‘business sector’, that of the inspected entity was chosen, i.e., of the inspected branch of the employer, being the most relevant to the specific inspection case.
The attributes identifying the inspection location area are the ‘city’, the ‘municipality’, and the ‘region’ of the branch, out of which the ‘region’ was selected to further participate in the analysis since its values were less and didn’t need further categorization. Further on, the ‘total number of employees in the branch’ – and not ‘in the company’ – was chosen to identify the size of the inspected entity, and, ending, the rest of the attributes related to the employment status and the level of past labor law compliance of the inspected entity, were all chosen to be used in the construction of derived features, as analyzed beneath.
Derived attributes are considered the new ones that are constructed using one or more of the existing ones. In this study, we built two new attributes that the domain experts were curious about to see how they can be associated with the results of an inspection: the level of past compliance to labor law and the frequency of changes in employment.
The level of ‘past compliance’ was calculated by dividing the sum of detected past infringements by the sum of past inspections – if any – in the inspected entity. Yet, in only around one-third of the cases, previous inspections data were available, whereas, in two-thirds of the cases, the entity was being inspected for the first time, as later noticed.
The ‘frequency of changes’ corresponds to the number of changes in employment details per employee in the inspected entity, in the last semester prior to the inspection. The labor inspection experts believed that this attribute reveals non-compliance when it gets high values, i.e., in the cases of often submissions of changes, though in our first machine learning application in this domain [8] exactly the opposite was proved.
Discretization and clustering
Most of the primary and derived attributes, as presented earlier, take either numeric values or values from a wide range of category codes. Yet, Associative Classification, the modeling technique we are using in this study, as also many other machine learning methods, requires the selected attributes to be taking discrete values. Thus, before building the classifiers, one last task is necessary, discretization, in which the values range of numeric attributes is partitioned into different non-overlapping intervals, with each one being portrayed by a discrete value. Similarly, the multiple category codes of the categorical attributes must get clustered, i.e., aggregated into groups according to a homogeneity feature, with each group being represented by a single value.
Discretization of quantitative data values and clustering of qualitative data values demanded an in-depth understanding of their meaning and consultation with the labor inspection experts since an arbitrary grouping of data values would produce meaningless data mining results.
Starting from the values of the ‘result’ as the most significant attribute and considering the study focus on addressing undeclared work, the around 200 category codes corresponding to the different infringements were clustered into three main groups: undeclared work (UDW), underdeclared work (UNDER_DW), and other infringements (OTHER_INFR). These three values plus the no infringements (NO_INFR) constitute the new categorical values of this attribute, which, as said, plays the role of the class. Keeping this categorization, we have a four-class classification problem, whereas grouping the violations categories into one, the infringements (INFR), we transform it into a binary, i.e., a two-class problem. In the following section, we are studying both types of classification problems, in combination with the two selected AC algorithms.
The ‘inspection time’ was easy to be partitioned into three 8-hours time zones, the ‘inspection day’ to be split into weekday and weekend, and the around 40 category codes of the ‘legal form’ attribute to be segmented into Sole proprietorship and Corporation, determining the business leadership. Similarly, the ‘business sector’ numerous codes of the inspected entity were divided into four main groups: Hotels-Restaurants-Catering, Production-Construction, Sales, and Services. The ‘region’ categorical values were seven constituting the different parts of Attica and did not need further grouping.
The ‘entity size’ was created by discretization of the quantitative attribute of the ‘total number of employees in the branch’ relying on labor regulations defining the ranges of each of the categories: small, medium, large, and very large size as illustrated in Table 10.
The employment-related attributes, i.e., the ‘average weekly working hours’, the ‘average hourly wage’, and the constructed ‘frequency of changes’, all taking numerical continuous values, had to be discretized as well. Thus, the new feature ‘employment’, illustrating the level of average working time in the inspected entity, was created to take the values: low employment for up to 16 hours per week, medium employment corresponding to 17–32 hours per week, and full employment for more than 32 hours per week. Accordingly, the new attribute ‘payment’ exhibiting the level of average payment in the branch was generated to pick values from: low, medium, high, and very high paid, with their ranges as displayed in Table 10. Last, the ‘frequency of changes’ was partitioned per the labor inspectors perception and expertise ending with the categories: rare changes with up to 2 changes per employee within the last semester before the inspection, medium frequency changes matching to 2–4 changes per employee, often changes and very often changes with ranges 4–10 and more than 10 changes per employee accordingly.
Lastly, the derived numerical attribute ‘past compliance’ had to be turned into categorical. For values above zero, the range was split into four intervals as shown in Table 10, matching to categories: low, medium, high, and very high non-compliance. Along with these, two more categories participated: compliant, to correspond to the cases that no infringements had been detected in the past inspections, and uninspected, to be attributed to those cases where no past inspection had been registered for the inspected entity.
At this point, having completed all data preparation steps, we have constructed a dataset of 12 categorical attributes and 18.548 records. Table 10 presents the final attributes, their categorical values with their corresponding ranges – at the numerical attributes – and their ratio of existence in the obtained dataset.
Building the models
In this section, the selected modeling techniques are described for the reader to understand the main steps of the algorithms used. The training, testing, and assessment plans are then presented, followed by the construction of the models and the final appraisal per the data mining goals.
Modeling technique
In this study, Associative Classification (AC) is put forward for building our classification models for various reasons. Primarily, studies [19, 20, 21] have proved that this innovative machine learning method is able to create classifiers with higher predictive accuracy than other traditional data mining techniques, such as decision trees [22, 23] and rule induction [24, 25]. Foremost though, the appropriateness of the method relies on its core characteristics that assure fulfillment of our project goals. Among them, AC produces very simple knowledge in the form of rules that are easily interpretable and manually updated by the end-user if need be [19]. Moreover, this data mining approach often discovers further hidden knowledge neglected by other classification methods, hence succeeding in minimized error rate. This feature derives from Association Rule Mining (ARM) [10] being used in the training phase of AC, where all possible correlations among the attribute values and the class values in the dataset are identified and extracted [19].
AC algorithms normally follow three main phases. In the first training phase, it generates the Class Association Rules (CARs), i.e., rules in an if-then form revealing the connections among attribute values and class values. In the second pruning phase, ranking and pruning processes are activated to exclude weak rules from the set of CARs, according to certain user-defined thresholds. At this phase, opposing and replicating rules are also omitted. The last testing phase aims to examine the classifier on a new separate dataset and assess its predictive accuracy and error rate [19].
The AC problem is defined by Liu et al. as follows [11]:
‘Let
In this study, we are using two AC algorithms, CBA and CBA2, whose key features are described below.
CBA
CBA [11] was one of the first algorithms demonstrating the use of Apriori technique [10] in solving classification challenges, which, in the learning phase finds frequent ruleitems, i.e., associations among attribute values and a class label, whose support and confidence satisfy the user-defined minimum support (minsup) and minimum confidence (minconf) thresholds.
In the pruning phase, CBA ranks the rules based on their confidence, support, and length and prunes redundant rules according to the database coverage method. This method starts with the highest-ranked rule and marks for deletion all those training cases covered by this rule, which is added into the classifier. It similarly proceeds with the rest of the rules and, when a rule cannot cover a training case, i.e., the rule body does not match the attribute values of any training instance, then this rule is omitted. This process ends when the training dataset gets empty, or all rules have been assessed. In the second case, the remaining unused training instances will generate the default class rule, i.e., the class with the largest frequency among them, which is also inserted into the classifier. The default class rule is fired in the prediction phase in cases when there is no other rule fitting the test instance [19].
The classifier resulting from CBA employing the database coverage pruning method is usually smaller in size than other AC algorithms, offering an advantage in applications where a compact set of rules is preferred for the users to effortlessly maintain the classifiers.
CBA2
CBA, using Apriori in the training phase, has also inherited some of its weaknesses, one of which is the exponential growth of rules [12]. If the minsup is set too low then too many unnecessary and overfitting rules will be produced, whereas if it is set too high then very few or no rules will be produced for the infrequent classes. Yet, in many classification problems, the classes are not evenly distributed, as it also happens in our project when dealing with the four-class case study, where, e.g., the frequency of undeclared work in our dataset is 4%.
This problem is addressed by CBA2, the enhanced version of CBA, proposed by the same authors [11, 12], which measures the frequency of the classes in the given dataset and assigns a different minimum support value to each class. Namely, the user-defined minsup is distributed to each class based on their class distribution in the dataset, ensuring sufficient rules generation for rare classes and preventing excessive rules production for frequent classes.
Training, testing, and assessment design
Ten-fold cross-validation
For training and testing purposes of the classifiers, the prepared dataset must be split into two non-overlapping parts, one to be used for learning and the other for testing. Several techniques exist in the literature regarding this process, one of them is the ten-fold cross-validation, which is widely used in the data mining research community [26, 27, 28, 29], it has been proved fair and accurate, and is adopted in this study as well.
This method partitions the dataset in 10 folds and repeats the training and testing process 10 times, in which the classifier is trained on the 9 folds and tested on the other fold to produce prediction results. At the end of the classifier training and testing across the ten-fold cross-validation method, classification results are obtained for all the dataset instances, which are all placed in a matrix as described below and further used to calculate the prediction measures and assess the classifier performance.
Confusion matrix
To evaluate the prediction results of the classifiers, we use the confusion matrix structure [30], where the classification results are displayed in detail. Table 11 illustrates the confusion matrix for the two-class classification problem, where the rows represent the true classes and the columns represent the predicted classes. In this two-class type of problem, the INFR plays the role of the positive class and the NO_INFR corresponds to the negative class. The positives classified correctly by the classifier are called True Positives (TP), and their misclassifications are the False Negatives (FN). Accordingly, for the negatives, the correct classifications are the True Negatives (TN), and the wrong ones are the False Positives (FP).
Confusion matrix for the binary (two-class) classification problem
Confusion matrix for the binary (two-class) classification problem
The two-class type classification results are presented in a table of the following structure.
Confusion matrix for the four-class classification problem
The four-class type classification results are placed in a table of this form.
In the case of the four-class type of classification problem, we use the confusion matrix displayed in Table 12 to display the prediction results. It is easily observed that this matrix constitutes a variation of the typical confusion matrix used in the literature for multi-class classification problems [30], which normally displays only one cell per row and column as True Positive (TP), whereas the rest in the row are regarded as False Negatives (FN), and the rest in the column as False Positives (FP). Yet, this variation is made here only due to this project dataset structure.
Be recalled that the dataset construction was grounded in the past inspections findings and infringements. Namely, the inspection cases of compliant employers were represented by one record in the dataset and labeled with the class NO_INFR, whereas the inspection cases of non-compliant employers may be represented by more than one record in the dataset, since, in inspection cases where violations were discovered, one record was inserted for each violation, labeled with the class corresponding to the violation. This practically means that there can be records in the dataset having all their attribute values equal but labeled with a different infringement class (UDW, UNDER_DW, OTHER_INFR), and these records to be derived from the same inspection visit, i.e., these three infringement classes are not mutually exclusive, yet they are mutually exclusive with the fourth one, the NO_INFR.
With this in mind and under the scope of this study, we regard these three infringement classes as parts of a broader class, the INFR class, yet, in the case of the four-class classification problem, we keep them discriminated for examining them in more detail. In this respect, when a testing case of infringement is predicted to fall under a different infringement class, then it is considered and attributed to True Positives, as shown in Table 12.
The evaluation metrics presented below are calculated using the prediction results of the confusion matrixes and are used to assess the classifiers performance in various aspects. Two common classification evaluation metrics are the Accuracy (Acc) (Eq. (1)), which corresponds to the correctly classified data instances to the total number of instances, and its complement, the Error Rate (Err) (Eq. (2)), or else called misclassification rate [30].
Yet, in the domains of applications where the users are more interested in the prediction performance of the classifier for the positive class, as is here the INFR class, then two more metrics are suitable, the Precision and the Recall. Precision (
Precision and Recall are usually complementary quantities: higher Precision may be obtained at the price of lower Recall and vice versa. If we need a single measure to compare different classifiers, the F-score (
One last indicative, for this study, evaluation metric is the Specificity (
Other figures considered in this study while assessing the models are the total number of CARs representing the size of the classifier, the average number of attributes per rule corresponding to the average length of CARs, and the processing time needed for its learning and testing.
Last, the distribution of the CARs per class is also calculated to enhance the understanding of the classifiers performance and their assessment.
Classification results of the CBA-2class model
Classification results of the CBA2-2class model
Classification results of the CBA-4class model
Classification results of the CBA2-4class model
Prediction metrics and other indicators of the four models
The measurements of the two four-class models (two last columns) are calculated considering the models as binary (two-class), i.e., when an instance of an infringement class is classified into a different infringement class (i.e., among UDW, UNDER_DW, OTHER_INFR), then it is considered a True Positive (TP) (see Table 12).
As earlier explained, in this project, we are using two types of datasets, one with two classes and one with four classes, plus we are applying two AC algorithms, the CBA and the CBA2. The combination of these two variations generates four different models: (a) CBA with the two-class dataset, referred to as CBA-2class further on, (b) CBA2 with the two-class dataset, called as CBA2-2class, (c) CBA with the four-class dataset, named as CBA-4class, and (d) CBA2 with the four-class dataset, called as CBA2-4class, in the analysis below.
All four above classifiers were built using LAC [31], a Java Library for Associative Classification. The system was running on OS Windows 10Pro 64-bit, with an Intel Core i5-10210u 2.3 GHz processor and 8 GB memory. For the parameters setting of both CBA and CBA2 algorithms, the authors recommendations were followed [11, 12], i.e., the minimum support was set to 1%, and the minimum confidence was set to 50%. Tables 13–16 display the classification results of the models CBA-2class, CBA2-2class, CBA-4class and CBA2-4class respectively.
Table 17 summarizes the measurements at all the evaluation metrics for all four models, where, for the two four-class models, it has been regarded that when a data instance of an infringement class is predicted to fall into a different infringement class, then this is attributed to True Positives (TP), following the structure of Table 12.
Prediction metrics of the two four-class models, regarded here as classical multi-class models
Prediction metrics of the two four-class models, regarded here as classical multi-class models
These measurements are calculated for the two four-class models but with all infringement classes being considered as mutually exclusive, i.e., a positive instance is regarded as True Positive only when it is classified into its own infringement class.
In Table 18, though, the evaluation figures are calculated again for the two four-class models, but here all infringement classes are considered as mutually exclusive, i.e., a positive instance is regarded as True Positive only when it is classified into its own infringement class.
Followingly, the classification results and the prediction measurements of each of the four models are analyzed and compared with reference to the data mining goals and success criteria.
CBA-2class
This model was also analyzed in our first AC application in this domain [9], where it was proved that it is an efficient model offering an accuracy of 66.41% and foremost a precision of 66.50%, meaning that it is capable of predicting positive cases – i.e., cases with infringements – at a ratio above the project data mining goal of 65%. Its recall is 63.32%, indicating that the classifier managed to successfully detect this proportion of infringement cases in the dataset, which can be also deemed sufficing if one considers the complex and ever-changing nature of undeclared work and the rest of labor relations violations. Yet, due to this multifaceted character of these violations, as also explained in [9], it is suggested that some inspection visits still need to be selected randomly to feed the classifiers with new data of inspections not initiated by prediction models, intending to uncover new patterns of labor law non-compliance that are not yet discovered.
Specificity is calculated to 69.37%, which means that around the seven-tenths of the cases corresponding to labor law compliant employers were identified as such by the model. This is a measure not to be underestimated in the context of this domain application since such a classifier would avoid suggesting most of the unnecessary onsite visits, thus saving human and financial resources.
The building of the model was quick, needing only around 2 minutes in every round of the ten-fold cross-validation, producing on average 793 CARs, of which 46.65% classify into INFR and the rest to NO_INFR class. The set of the produced rules is balanced and adequate to offer understandable knowledge to the labor inspectors regarding patterns of compliance and non-compliance to labor law, such as those of Table 19. For instance, the first rule reveals that a morning scheduled inspection at a very large size business paying high wages will most probably detect no violations, whereas the fourth rule indicates that a night inspection to a business making rare changes to the working schedule of its employees, is highly likely to discover labor law infringements.
Examples of class association rules of the CBA-2class model
Examples of class association rules of the CBA-2class model
Concluding, this model is considered efficient, achieving the data mining goals as described in §3.4., i.e., efficiency above 65% and simplicity of the produced knowledge.
The application of the CBA2 algorithm in this domain is originally analyzed in this study, and, observing at Table 17 its prediction measurements with the two-class type dataset, the CBA2-2class model can be regarded as a successful classifier. Specifically, all its figures are higher by around 1% than those of CBA, indicating a steadily slightly better overall performance. The construction indicators reveal the reasons for the improved efficiency of this algorithm that seems to produce more detailed CARs than CBA since the classifier size is a lot larger and the average length of CARs is also bigger.
Table 20 illustrates some of this model CARs, where the discovered knowledge can be observed to be similar to this produced by CBA, yet some rules may contain few more detail, thus contributing in succeeding higher accuracy. For instance, the fourth rule constitutes a detailed version of the CBA fourth rule (Table 19), indicating that the rule applies mostly to medium-size businesses.
Examples of class association rules of the CBA2-2class model
Examples of class association rules of the CBA2-2class model
The processing time for the model construction was measured to last four times longer than CBA, i.e., it took on average 8–9 minutes for one round training and testing of the ten-fold cross-validation process. Nevertheless, the present domain of application does not require immediate and automatic decision-making from the prediction system as in other domains, e.g., in online fraud detection. Hence, the classifiers building time is not one of the significant metrics to be seriously considered in selecting the best model.
In light of all the above, the CBA2-2class model certainly achieves the initially set data mining goals and may be also regarded as more effective than CBA with the two-class dataset and preferable in predicting labor law non-compliant employers.
A four-class classification problem is, by definition, more challenging to address and analyze than a binary one, even more so when some of the classes are not mutually exclusive, as it happens in this domain application. In our study, not to be forgotten that the dataset was constructed such that an inspection case corresponds to more than one data instance, each with a different infringement class, when the inspection detects multiple violations. Thus, several data instances resulted in having the same attribute values but different classes, a fact that complicated the classifiers to discover and produce strong classification rules.
This led us in dealing with the four-class problem partially as a binary problem, by viewing the three classes of infringements (undeclared work, underdeclared work, and other infringements) as subclasses of a broader class – the infringement class – but still keeping them distinguished in the dataset for the analysis of this model prediction efficiency and its produced CARs. Under this handling, a variation of the typical confusion matrix was created as illustrated in Table 12, based on which the model prediction metrics are calculated and presented in Table 17.
Yet, before the interpretation of the prediction measurements, the classification results (Table 15) draw the attention, where one instantly perceives the model failure to distinguish any of the cases labeled as undeclared work (UDW) or other infringements (OTHER_INFR). Indeed, the distribution of the produced CARs, as shown in Table 17, lies only between the dominating classes of no infringements (NO_INFR) and underdeclared work (UNDER_DW), with a proportion of 84% and 16% correspondingly. Regarding specifically the OTHER_INFR class, it’s interesting to note that although it exists in the dataset with a non-negligible proportion of 10.93%, still the model neglects to produce any rules classifying into this class. Only the default class was attributed to OTHER_INFR in two of the ten rounds of the classifier training and testing, thus classifying only 8 test instances into this class.
Examples of class association rules classifying into underdeclared work (UNDER_DW) of the CBA-4class model
Examples of class association rules classifying into underdeclared work (UNDER_DW) of the CBA-4class model
This phenomenon originates in the weakness of the CBA algorithm to consider the infrequent classes while training when the user-defined minimum support threshold is not set very low, since the algorithm treats equally all the classes independently of their distribution in the dataset. Consequently, although the algorithm operates well in a balanced binary dataset and produces an efficient classifier as proved in §6.4.1, in cases of unevenly distributed multi-class datasets, CBA does not perform well in classifying instances of infrequent classes.
Indeed, examining the figures in Table 17, the Accuracy of 60.73% – calculated in the binary form of the dataset – does not reveal significant inefficiency, but the Recall of 31.57% exposes the incompetence of the model to detect the seven-tenths of the cases with infringements. This figure, along with this of the F-score at 44.06%, constitutes this classifier ineffective to be used as a core prediction model in the domain of labor law violations detection.
Nonetheless, one must acknowledge one major advantage of this model, revealed through the noticeably high Precision of 72.91%, which corresponds to the correctness of predicted cases with infringements. With a level of around 73% accuracy in positive predictions, this model can offer two important assets. First, the positive cases predicted by this model should not be overlooked by the labor inspectors, but they should all trigger an inspection visit since more than seven out of ten of these inspections will detect labor law violations. Second is the knowledge gained by this model. Specifically, 16% of the rules classify cases into underdeclared work, reaching the Precision – p (UNDER_DW) – of 72.88% for this sub-class of infringements. This basically corresponds to a small and concrete set of strong CARs revealing the patterns of underdeclared work, offering valuable understanding to the labor inspectors. For instance, it was identified that more than half of the rules contain the attribute value ‘complaint’, more than the six-tenths contain the value ‘evening’, and the eight out of ten include the ‘rare changes’, i.e., many of the inspections uncovering underdeclared work were conducted after a complaint and/or at evening hours and in most cases, the employer was neglecting to declare the changes in employment hours. Some indicative CARs classifying to UNDER_DW are also illustrated in Table 21.
Last, the Specificity attracts attention measured exceptionally high at 88.73%, meaning that indeed nine out of ten of the NO_INFR cases are classified correctly by the model. This achievement is mainly due to the high ratio of 84% of CARs classifying cases into this class, yet this set of rules cannot safely provide patterns of labor law compliance since several of these rules are proved to misclassify positive instances into the negative class.
Overall, this model cannot be considered effective, not succeeding in some of the project data mining goals. In effect, it fails to detect around 7 out of 10 real infringement cases, yet those that are classified positive are highly likely to be positive. Regarding the generated knowledge, it can produce a set of accurate patterns of underdeclared work, offering valuable insights into this type of violations to the labor inspectors. However, there are no rules classifying cases into the rest of the infringement classes, whereas the generated rules classifying instances into NO_INFR are too dominating to be trusted so as to extract some safe and precise understanding regarding patterns of compliance to labor law.
Combining the CBA2 algorithm with the four-class dataset, this last model to be examined is also handled partially as a binary classification problem like the previous one, thus all calculations of the prediction metrics are based on a confusion matrix of the form in Table 12. Table 16 presents its classification results, and the prediction measurements are presented in Table 17.
Examining the model output and figures, CBA2-4class exhibits an overall better performance than CBA-4class. First of all, some instances of undeclared work have been successfully identified, and the Precision, specifically for this class, is measured to 83%. This signifies that a few strong and accurate rules that classify cases to UDW have been formed, supplying a valuable understanding of the patterns related to the type of violation this study is mostly interested in, the undeclared work.
Similarly, the model succeeded in generating CARs for the other two infringement classes as well, achieving a Precision of 72.64% for underdeclared work (UNDER_DW) and 83.72% for the other infringements (OTHER_INFR), thus achieving one of the major goals of this project, the generation of valuable and understandable knowledge of the patterns linked to labor law non-compliance. This was accomplished due to the CBA2 feature of dealing with unbalanced datasets more effectively than CBA by assigning a different minimum support value to each class, based on its distribution in the given dataset.
The model prediction measurements – of the binary form, Table 17 – also disclose an improved functioning with respect to CBA, yet its Recall calculated to 38.14% cannot be considered sufficient since the model exhibits an ability to recognize only around 4 out of the 10 actual cases of infringements. Although Precision is competently high at 73.37%, the F-score, determined as the harmonic mean between Precision and Recall, is calculated at 50.19%, exposing the model low aptitude in identifying cases of violations largely. As being proved, although CBA2 is designed to deal satisfactorily with infrequent classes, it didn’t manage to build an overall efficient classifier with this four-class dataset. This could be explained considering the dataset construction technique that was followed in this study, which led to the dataset containing several data instances with the same attribute values but different infringement classes, a fact that complicated the construction of a set of robust CARs for all classes.
The predicted positive cases by this model, though, should not be ignored by the users, on the contrary, this model, with the highest Precision among all four, must be regarded trustworthy – for its positive predictions – and initiate onsite inspections yielding the highest productivity. Regarding its classification efficiency of negative cases, this is depicted by Specificity calculated at 86.70%, denoting that this model does not initiate redundant inspection visits. The set of CARs classifying the data instances to NO_INFR (72.1%) is smaller than that of CBA-4class, but it is still regarded as dominating since lots of positive instances are misclassified as negatives. Yet, the strong ones of those rules – high ranked in the set of CARs – can contribute to enhancing the labor inspectors knowledge about the labor law compliance patterns. Table 22 displays some of the strong rules of this model for each class.
Examples of class association rules of the CBA2-4class model
Examples of class association rules of the CBA2-4class model
Ending, the size of the model is not larger than that of CBA2-2class but the processing time is three times longer, which can be due to the four-class existence and the complicated dataset as described above, yet, this metric is not of crucial importance in this domain of application, as discussed in §6.4.2.
Concluding this model evaluation, the produced classifier can highly contribute to enhance the labor inspectors knowledge and understanding of the patterns related to labor law compliance and non-compliance, thus helping them to identify easier by self both types of behavior. It is also considered valuable to the Authority upper management since it provides meaningful insights to the policy-makers regarding the characteristics linked with undeclared work, a benefit afforded only by this model.
However, it does not exhibit a high rate of existing violations detection, therefore it is not suggested to be used as a core model for predicting employers non-compliance and planning targeted inspection visits.
Models assessment per the business goals
In the previous section, the four generated models were assessed per the data mining goals, whereas this last phase deals with the overall evaluation of the project, also with respect to the initial business objectives. Be recalled that the Authority initial objective was to embody into its business processes innovative technology solutions succeeding in two major goals. The first was the increased efficiency of onsite visits through the use of a prediction-recommendation system, which should be automatically configured and updated and contribute to a more effective allocation of the Authority resources; the second was the provision of simple and comprehensible knowledge to the labor inspectors in terms of patterns of labor law compliance and non-compliance that would improve their ability to perceive the system recommendations for targeted onsite inspections and, at the same time, to recognize swiftly cases of employers likely to be involved in undeclared work and other violations.
Out of the four generated models, as thoroughly prior explained, only the first two that were built using the two-class dataset are succeeding in both the business goals, with the CBA2-2class model exhibiting a slightly improved overall prediction performance and offering more and exhaustive knowledge about those characteristics discovered to be linked with compliance and non-compliance to labor regulations. Thus, if one had to choose only one of the four models to be applied, this should be CBA2-2class, which offers a complete and balanced set of readable and understandable rules classifying employers into compliance (NO_INFR) and non-compliance (INFR) classes, with an accuracy of 67.15% for both classes.
Yet, this model cannot offer an enhanced knowledge with regard to the specific patterns associated with each of the infringement sub-classes, i.e., undeclared work (UDW), underdeclared work (UNDER_DW), and other infringements (OTHER_INFR), as the four-class models can. Especially the CBA2-4class model manages to produce CARs for all classes, even for the undeclared work whose ratio in the dataset is very low, at 4%. This accomplishment should not be overlooked since this model succeeds the knowledge provision business goal at a much higher level than the selected CBA2-2class model, even if it cannot reach the set accuracy threshold of 65% of the initial data mining goal. Particularly, Table 22 proves that the patterns of labor law non-compliance can be differentiated for each infringement class, and the provision of such detailed insight into the different types of violations is of significant value for the Authority. Must be reminded, at this point, that one of the HLI major issues is the labor inspectors unwillingness to correspond to the current risk analysis tool inspections suggestions, mainly due to the lack of knowledge of the tool functioning and the factors affecting its results. If a newly established inspections recommendation tool does not effectively address this issue by helping the labor inspectors thoroughly perceive the various patterns of labor law non-compliance and actively involving them in the inspections selection process, it highly risks being gradually abandoned, as well.
Considering the above business risk of the tool disuse, the CBA2-4class model should be unquestionably integrated into the inspections recommendation tool, all the more so if the business goal of improved allocation of the Authority human and financial resources is further contemplated. Every inspection visit is associated with a high operational cost which should be taken into account while deciding which of the tool suggestions necessitate initiating a targeted onsite inspection. With this model Precision at 73.37% – i.e., its correctness of infringements predictions – and notably that of undeclared work lying at 83%, its suggestions should not be questioned, but they should all trigger an inspection visit hence increasing the likelihood of violations detection.
Determination of best option
Concluding from the aforementioned, a combination of the CBA2-2class model displaying the highest overall accuracy at 67,15%, with the CBA2-4class model exhibiting the peak Precision at 73.37% and supplying the most detailed knowledge regarding patterns associated with all infringement classes, stands as the best option to be followed in setting up the required inspections recommendation tool.
The tool may consist of two classifiers, one built by the CBA2-2class model reaching an adequate and stable predictive accuracy both for the positive but also for the negative class, thus, assisting the Authority in planning on-site inspections at a larger scale, and one made by the CBA2-4class model supplying simple and thorough understanding of those characteristics linked with the different types of infringements.
Knowing the pros and cons of each model, the tool may be set to use both of them to classify employers and benefit accordingly. When both the two models predict a company to be positive – i.e., with infringements – then it should be prioritized to be inspected unhesitatingly, whereas if they classify it into negatives – i.e., compliant – then it can be ranked as less risky for conducting an onsite inspection. Cases of employers classified as non-compliant (INFR) by the CBA2-2class model, but the 4class model foresees them as compliant (NO_INFR), should be candidates for an inspection visit, yet ranked per their classification rule confidence value. Notably, all inspections suggestions of both the two models should come along with their corresponding activated classification rule and its confidence value, supplying the tool users with the necessary knowledge, first, to perceive the reasons – i.e., the attribute values – for which a company is suggested for a targeted inspection and second, to be aware of how confident this suggestion is.
In this way, the labor inspectors will be highly stimulated and actively involved in the whole inspections selection process, improving their domain knowledge and violations identification ability, while, at the same time, they will still be in authority over deciding which of the tool suggestions to follow. Thus, the inspectors will perceive this inspections recommendation tool as a helping means to improve their productivity and not as yet another complicated system producing inexplicable employers ranking lists, which they are inevitably obliged to follow.
The tool will be able to automatically update its classification mechanisms through an automated periodic process of rebuilding the classifiers, by including also, each time, the new inspections data that were inserted into the Inspections subsystem of the HLI-IIS. For this reason, the labor inspectors are suggested to keep conducting a proportion of inspection visits randomly selected so as to give the tool the chance to discover new patterns of labor law non-compliance. Consequently, the tool will need no manual configuration or update, but it will function completely automatically, offering unbiased inspections recommendations, thus contributing to the enhancement of the labor inspectors trust in it.
Closing, the combination of the two models, as explained above, incorporated in a recommendation system, achieves maximum success both in business and in data mining goals and is suggested as the best option in this data mining application.
Next steps
Following the in-depth analysis of this data mining project through the different stages of the CRISP-DM methodology, several parts were identified that could be further examined and allow an even more meticulous understanding of the models produced knowledge, as well as an increased prediction capability.
More specifically, for every task to be completed, various decisions had to be taken from the researchers of this study in close cooperation with the Authority involved personnel, which were mainly based on their domain expertise and thorough investigation. For instance, starting from the data selection and integration, proceeding with the primary attributes selection and the creation of the derived attributes, till the discretization and clustering tasks, the dataset preparation phase required several choices to be made, where, evidently, different variations would produce differentiated results. Similarly, in the modeling phase, other algorithms could be used which, combined with a diversity of dataset structures, would create a variety of changed models to be analyzed and assessed.
Under this view and per the CRISP-DM methodology, the present analysis constitutes a complete life cycle of our data mining project, which, eventually, may trigger new, more focused business goals and another life cycle, benefitting from the gained experiences and lessons learned.
Particularly, in this cycle, the two models with the four-class datasets were identified to not perform well in detecting an adequate percentage of actual infringement cases, both reaching a Recall measurement of under 40%, hence they could not be considered effective to be used as a core classification model. Yet, their vital knowledge provision constitutes them beneficial, thus they are regarded worthy of a supplementary examination, especially the CBA2-4class model, which is able to handle more effectively the unevenly distributed multi-class datasets.
Grounded on the acquired experience of the first life cycle of this data mining project, the researchers presume that the four-class models prediction capability might have been affected by the dataset construction design leading to the existence of several instances with identical attribute values but with different infringement classes – i.e., when inspection cases pertained to a non-compliant employer with several violations – which must have impeded the models from constructing several strong and accurate rules for each infringement class.
For this reason, a new iteration initiation of the data mining project life cycle was identified as one of the next steps, to examine the predictive potential and the discovered knowledge of the four-class models, but with a variation in the current dataset structure, so as only one data instance per inspection case should exist in the final dataset. This practically means that the cases of non-compliant employers detected with multiple violations should be represented in the dataset by one data record and labeled with the most severe infringement class out of the discovered ones. If the infringement classes are ranked per severity, i.e. UDW
Concluding, it is estimated and remains to be proved that this new construction variation of the dataset, containing only one record per detected non-compliant employer being labeled with his/her most severe violation, might allow the four-class models to generate more accurate and stronger CARs for each infringement class and achieve higher prediction capability.
Additionally, other life cycle iterations could include the analysis and comparison of other AC algorithms, such as the Classification based on Multiple Association Rules (CMAR) [20], the Classification based on Predictive Association Rules (CPAR) [21], the Multiclass Associative Classification (MAC) [32], the Live and Let Live (
Conclusions and future work
This study presents a data mining project aiming to address the issue of undeclared work and other labor law violations. Undeclared work is a complex and expressly concealed phenomenon originating from the employers illegal practice of not declaring all or part of their employed staff to the public authorities, aiming to circumvent paying taxes and social security dues and complying with labor law regulations. This practice costs the states considerably reduced revenues, whereas it also affects negatively and in various respects the labor force and the companies competition.
Public authorities in charge to deal with this phenomenon often face issues with diminishing human and financial resources and lacking the appropriate tools to fulfill their business goals successfully. Although ICT has been widely introduced into the public sector resulting in several databases lying dispersed within the authorities, these data are still not effectively used to increase their functions productivity. Consequently, the use of data mining and machine learning techniques rises as the sole option, offering considerable help to Organizations at different levels.
This data mining application was inspired by the business needs of the Hellenic Labor Inspectorate and uses real past inspections data merged with employment figures and companies details. It aims to help the Authority in dealing with undeclared work and other labor relations violations offering two major benefits: a mechanism to plan targeted inspection visits with improved accuracy and a knowledge base to enhance the labor inspectors understanding of the features correlated with these violations.
The project is analyzed comprehensively through all its phases per the CRISP-DM methodology, as also illustrated in Fig. 1. It incorporates the examination of two Associative Classification algorithms, the CBA and CBA2, in combination with two datasets, one two-class and one four-class, providing four different models. The models are thoroughly examined and assessed from the data mining perspective, using well-established evaluation metrics, as well as from the business perspective, considering the initial objectives and ambitions of the Authority. Comparing the two AC algorithms, CBA2 demonstrated superior classification results than CBA, with both the two-class dataset and the four-class dataset, due to its ability to handle more effectively unbalanced datasets, thus it is preferred for this domain of application.
The data mining project life cycle.
In particular, the CBA2-2class is opted as the best model among the four, given that it managed to succeed in both the data mining goals and the business objectives, achieving an accuracy of 67.15% and the supply of a balanced set of patterns revealing compliance and non-compliance to labor law. Nevertheless, it cannot offer more detailed insights into those features linked with each type of violation, which is accomplished by the CBA2-4class model supplying valuable comprehension even for the undeclared work, which exists with a ratio of 4% in the dataset. Indeed, this model validated that there can be different patterns for each type of violation, and its produced pool of knowledge was regarded as invaluable for the Authority, both for the higher management but also for the labor inspectors.
Hence, the study concluded that the best option in establishing an inspections recommendation tool is to combine these two models, with the CBA2-2class model offering the highest overall prediction accuracy and the CBA2-4class model supplying the most detailed understandable knowledge regarding the patterns associated with each class of infringements.
In the future, a new iteration of the project life cycle will be initiated to examine the effectiveness of the four models with a specific variation of the dataset structure, where each discovered non-compliant employer will be represented by only one data instance being labeled with the infringement class of his/her most severe violation. This new mode of depicting the inspection cases in the dataset is assumed to assist the models in generating a set of more accurate and robust classification rules and remains to be studied.
Last, other new life cycles of the data mining project will involve the exploration of different Associative Classification algorithms, which, combined with the variety of constructed datasets, will produce diverse models to be compared and assessed, both per the data mining goals, as well as per the Authority business purposes, where the most effective ones will be finally distinguished.
Footnotes
Acknowledgments
The authors are very thankful to the Hellenic Labour Inspectorate for the cooperation and provision of all necessary data for this project.
