Abstract
Regression analysis is a widely used statistical technique for estimating the relationship between two variables. These two variables are called independent and dependent variables. The regression techniques are classified into two broad categories such as linear and logistic regression. Based on the input dataset, these two techniques are chosen and implemented. Many organizations and institutions are trying to use the decision support system for extracting the relationship between the employees’ salaries based on the target achieved and the years of experience. In this paper, the relationship extraction between two variables is analysed and studied. Based on the Experience, the salary of employees is predicted. Here the model extracts the relationship among the variables first, next to that forecasting of new observations is carried out. In this phased approach, the data pre-processing is carried out to clean the noise on the dataset. Followed by, fitting the model to train the train set and testing test. The third phase predicts the results based on the two variables to draw some observations. As a final step, visualization is employed on training and testing datasets. To implement this proposed work, the employee database from an organization is considered. This dataset contains 115 technical and non-technical staff details with their profile information.
Introduction
Data analytics techniques are playing crucial roles in all types of businesses to draw efficient decisions in the result analysis phases. These techniques provide huge insights and meaningful relationships between datasets (Razia et al.,2017). Multiple data processing methods are originated from these techniques. Many supervised and unsupervised learning algorithms are being used in industries for analysis purposes. Among them, regression techniques are considered the widely used method to extract the relationships between two variables.
Types of regression
Regression analysis techniques are classified into three types such as linear, logistic, and multiple linear regression for multiple values. Linear regression is employed to deal with single dependent and single independent variables. Whereas logistic regression is applied for predicting binary dependant variables. On the other hand, multiple linear regression is applied to cope with more than two independent variables. The following Table 1 clearly explains the types of regression analysis.
Types of regression analysis
Types of regression analysis
Information processing. Figure 1 clearly narrates the information processing from different sources like B – Business Data, S – Social media data, M – Machine generated data.
Many machine learning algorithms have been proposed for disease analysis and predictions (Dudi et al., 2019). But the ML algorithms are used for enhancing the DSS. The decision support system of any organization produces the expected and predicted information from the hidden data. In this proposed work, an institution’s worker data set has been applied to DSS for drawing the final conclusions. Many stochastic regression models have been implemented so far were failed to produce the expected results (Patel et al., 2019).
This DSS has many phases to process the given dataset such as names of the workers, employee ID, years of experience, target achieved, and performance metrics.
Salary prediction methods
The salary prediction process also involves the promotional statements as well as the hike to employees. As per the Fig. 2 the employee data has been collected as 115 samples.
Salary and hike prediction process.
Initially, the correlation analysis is performed on the given dataset. Further to this, the strength of association between variables is carried out (Kawahara et al., 2020). This implementation involves the initial phase of importing the required libraries to apply the data pre-processing. It involves the removal of repeated data, missing values, and unappropriated data. NumPy array is used to analyse the data and with the help of panda’s library, the given dataset is imported to the implementation environment (Donthi et al., 2019).
The given dataset is separated as training set and testing set. For training 80% of the dataset is considered whereas for testing the remaining are considered.
Applying the model to fit on the data
With help of the linear regression model, the dataset is employed to fit in the conditions. Initially, a single independent variable is considered for the prediction of a single dependent variable as shown in the following Table 2.
Employee dataset
Employee dataset
Dependent attribute
Independent attribute
As per the Tables 1 and 2 the dependent variable is salary, and the independent variable is years of experience. So, this is called the single linear regression model. In case of two or more independent variables are considered, that is called multiple linear regression. The likelihood relationship of these variables is found clearly (Satish Kumar et al., 2019).
Multiple regression with three variables.
As per the results shown in Fig. 3, the correlation coefficient has been carried out from dependent and independent variables. Based on the performance and years of experience the salary is increased. These visual-based results are clarifying the relationship between the employees’ salaries. The linear and multiple regression are purely different from other available regression types.
In logistic regression, the categorical values as binary values of the variable are predicted with help of the fitting models. In contrast to that, the polynomial regression is applied and implemented on the
Conclusion
In this proposed work, the widely used regression analysis techniques and their types are implemented on 115 sample datasets. The results are carried out about the relationship and correlation coefficient parameters. In the meanwhile, multiple regression lines are plotted to explain the results in feature space. Due to the inefficiency of classification algorithms and results, in the future, advanced algorithms will be used (Donthi et al., 2019) These ML-based classification algorithms are being used in the healthcare domain in recent days. Hence, as future work, these regression algorithms will be employed in the healthcare domain also (Saba et al., 2020).
Future work
The employees of other department salaries also considered in the future work. The logistic regression will be implemented to find the relationship between the entry and exit of the employees. With help of the support vector, machine-based regression models will be used to predict the earlier results between two unknown variables (Mahaboob et al., 2019). The logistics regression will produce a higher degree of relationship extraction between more datasets (Satish Kumar et al., 2019).
