Towards a classification of sustainable software development process using manifold machine learning techniques

Abstract

With the evaluation of the software industry, a huge number of software applications are designing, developing, and uploading to multiple online repositories. To find out the same type of category and resource utilization of applications, researchers must adopt manual working. To reduce their efforts, a solution has been proposed that works in two phases. In first phase, a semantic analysis-based keywords and variables identification process has been proposed. Based on the semantics, designed a dataset having two classes: one represents application type and the other corresponds to application keywords. Afterward, in second phase, input preprocessed dataset to manifold machine learning techniques (Decision Table, Random Forest, OneR, Randomizable Filtered Classifier, Logistic model tree) and compute their performance based on TP Rate, FP Rate, Precision, Recall, F1-Score, MCC, ROC Area, PRC Area, and Accuracy (%). For evaluation purposes, We have used an R language library called latent semantic analysis for creating semantics, and the Weka tool is used for measuring the performance of algorithms. Results show that the random forest depicts the highest accuracy which is 99.3% due to its parametric function evaluation and less misclassification error.

Keywords

Machine learning software classification software sustainability data analytics

1 Introduction

Automatic software classification in different databases is gaining attraction because it minimizes manual work intensely. Several software depositories have sprung up in the previous era, many of which contain huge volumes of source code and other software artifacts [38]. To the discovering these sources easier, software schemes are classified into different types. Since there are so many people involved in software maintenance, well-organized software repositories benefit them in two ways. Users can choose which features to put in their apps in the same category by grouping apps with similar qualities.Second, users may be able to figure out the problems or issues that are communal to various apps in a similar class and predict the problems or issues of future apps in the same category may encounter. This kind of prediction could be utilized as an excellent control technique to discover common scents or programming problems [1]. To categories software programs, text classification algorithms are typically used; keywords are mined from the source code and used as characteristics. Although automatic categorization techniques may not achieve perfect accuracy, participants can still benefit from classified applications while dealing with software issues and maintenance obligations [30].

Categorization is the ability and activity of recognizing common characteristics or similarities among elements of one’s world experience (such as objects, events, or ideas), and then organizing and classifying that expertise by assigning it to a more abstract collective (that is, a category, class, or type) based on traits, attributes, similarities, or other evaluation [31]. Software categorization and detection of similar software can be beneficial for a variety of reasons, including knowledge exchange, application understanding, and rapid prototyping. To take advantage of open-source projects’ extensive availability and showcase their functionality-based product updates to search engines, automatic classification and categorization of similar software detection approaches are necessary. Automatic software categorization and search for identical applications are critical in two scenarios: Software migration from one platform to another. When their system necessities and implementation circumstances are modified, developers might transfer their applications to different software or hardware platforms [32]. Consequently, application mechanisms and libraries that worked in one environment might not be feasible to the other platform [2, 3]. Alternative software solutions are available for architects and developers to find or build to replace the software that isn’t working.

Automatic categorization is a new technology aimed at limiting the flood of unorganized, unindexed, and unstructured digital content that threatens to suffocate knowledge workers in corporations and government [29]. Auto-categorization software solutions give the capacity to categories digital material according to defined taxonomies, extract ideas and entities for the construction of taxonomies and tag information with subject-related metadata tags. Open-source software repositories like SourceForge.net store massive volumes of source code and software artifacts to make things easier to access and search (text editors, for example, are grouped into categories like anti-virus software, databases, and so on). Systems must be manually classified into these groups by users or administrators based on their functionality [33]. Time-consuming and labor-intensive classification demands an understanding of the software projects’ underlying functionality in the repository. Identifying parallel software projects through several programming languages is difficult for a variety of ins and outs. Consequently, present methods are incapable of cross-language identification or have limited cross-language identification capabilities of linked systems. By the development of the open-source platform the variety of software programs that have been created, an instrument that can discover and categorize similar apps using numerous programming languages would be useful [4].

Three problems are addressed: (1) matching items described on the web with structured annotations, (2) supplementing an existing product database with web-based product data, and (3) sorting items into categories [34]. Extract product attributes from textual descriptions using Conditional Random Fields and Convolutional Neural Networks for these tasks. Generating decision trees classes is a common method in supervised machine learning. A machine learning system is used to input their properties. As a result, the rule generator generates rules for a hypothetical category collection. Latent Semantic Analysis is a statistical way for extracting and visualizing the meaning of words in context from a large corpus of text [36]. LSA can be used for a variety of things, including investigating the human mind. It’s also used for clustering when it comes to data mining. A software system’s components, as well as their recovery linkages from a document to a source. Code clones are replicated code sections that appear throughout the source code in different places [5].

Two separate software systems are defined as the proportion of total lines of code clones to total lines of code the similarity of the full lines of software. In real-world applications, text categorization usually needs a system that can handle tens of thousands of categories spread across a broad taxonomy. Automated text categorization has grown popular due to the time and cost of manually generating these text classifiers. It has gained in popularity over time [35]. The text classification problem is well-served by machine learning-based classification techniques. Text classification or categorization is the process of assigning documents to a predetermined category [6]. A depository is a simple entity on GitHub that commonly comprises the source code and resource records for a software project. The history of the project’s development is archived. GitHub hosts a wide range of projects, including database software, operating systems, gaming, web applications, smartphone apps and much more. GitHub is used by large corporations such as Google, Microsoft, and Facebook to host their open-source projects. GitHub hosts millions of repositories, and many of them provide related functionality [37]. Nonetheless, they are created by a variety of people and organizations. Inappropriately, to the greatest of our familiarity, there is currently no technology on GitHub that can determine the resemblance of depositories. GitHub is a platform that allows you to share and has a search engine to help designers to find out important information amongst the millions of depositories that it hosts [7].

This research seeks to classify and identify manually uploaded source code to determine what category it belongs to. A research gap is how a user understands which category a new source code belongs to once it is published into a repository. To do this a semantic analysis has been performed on the source code by using latent semantic analysis (LSA) in which several keywords and variables are found, these keywords and variables differ in each application. Therefore, We split all keywords and variables into an array of 10 words and stored them by assigning their category. Two classes are created for the machine learning model; one belongs to the application category and the other belongs to application keywords. Many machine learning classifiers were used to categorize software and it proved to be a cost-effective option for software classification.

The rest of the paper is structured as follows. Section 2 presents the related work. Section 3 outlines about the research methodology while Section 4 presents the implementation and results. Section 5 focuses on the discussion. Finally, Section 6 offers conclusions and future directions.

2 Materials and methods

A new accessible and efficient method to Language Agnostic Program Categorization and related application identification was presented by D Altarawy et al. [2]. The almost 103 applications data set was reused, and a new 5,220 applications data set was created with no labels. The source code of programs is subjected to Latent Dirichlet Allocation (LDA) and hierarchical clustering as part of this approach’s methodology. Individually, the Top-1 retrieved applications had 70 and 71 percentiles, respectively. In this study, Guendouz and colleagues proposed a new method for identifying software called LACT, which uses open-source repositories and systems [8]. First, LACT was put to the MUDA Blue test, which classified 41 software systems in C into issue area divisions. In the second investigation, LACT was expanded to 43 information systems fixed in a diversity of programming languages. The results show that LACT can automatically generate meaningful group names and produce classification results that are equivalent to MUDA Blue. The second study’s outcome indicates that it can be used to categorize information applications without regard to the underlying paradigm programming. Ugurel et al. show how to use automatic machine learning to categorize source code in eleven different implementations and ten different programming languages [9]. Their findings show that enormous repositories of heterogeneous data records, text, and source code can be categorized and categorized automatically.

Linares-V $\overset{´}{a}$ squez et al. provide a contemporary method for classifying software projects without using any sources [10]. Three datasets were provided with minor changes, each having its directory hierarchy. Their method was 80.22 percent accurate. McMillan et al. proposed the idea of automatic detection of closely connected CLANs (Applications) that allows users to find related applications for a specified Java application. [11]. Their findings show that CLAN examines comparable applications with more consistency than MUDA Blue when using efficient statistical significance. Nguyen & Nguyen, et al. investigated the DNN potential in software automatic categorization as part of their thesis [12]. To learn more about DNN’s characteristics and variants, as well as a variety of other settings and data sets. Kawaguchi et al. proposed a software system that automatically categorizes data based on nothing but the source code [13]. They used the MUDA Blue GUI, which is a tool that focuses on the categories framework of archive searching. Zhang et al. suggested a method for effectively detecting similar repositories on GitHub [4]. Empirical research RepoPal demonstrates that it has a higher success rate, accuracy, and certainty than CLAN [14].

Prana et al. conducted a thorough investigation that included a manual explanation of 4,226 parts of README files from 393 randomly selected GitHub sections [15]. These parts include a classifier and a set of attributes that may automatically categories repositories and architecture. On the manually annotated dataset, we evaluated the classifier’s effectiveness in identifying the most valuable characteristics for distinguishing the various sorts of sections. Using randomly generated data to use the classes to designate parts with badges in hidden GitHub README scripts and viewing GitHub README files the findings give repository owners a benchmark in contradiction of which they can model and review their files, resulting in more consistent software description.

Velazquez-Rodriguez et al. propose an alternative to automated library categorization trained on class and process names by machine learning classifiers [16]. The approach, which was trained on a huge dataset, can designate a current library type. The method is based on text categorization machine learning algorithms that are trained and validated using a text corpus collected from libraries. The results show that the approach is exact, implying that large-scale apps are possible. Qadir et al. proposed a basic static analysis methodology to first remove the functionality of the android device software groups based on purpose and intent [17]. Whether or not an Android application is found in a database, a list of pre-defined malware is kept. Proposing and implementing a static detection approach for malicious software. As a result, individuals can figure out which applications use functionality that they aren’t supposed to utilize or wouldn’t need. Auch M et al. described a complete literature evaluation for software applications that uses current similarity, categorization, and significance research approaches [18].

Many software projects patterns have been created and cataloged as alternative solutions to a design challenge. In the designing phase of a project, the available automatic algorithms for design pattern selection assist inexperienced software developers in selecting the most relevant design pattern(s) from a list of suitable patterns to address a design challenge creation of software. However, present automated solutions are confined to semi-formal specifications, multi-class problems, sufficient sample sizes for precise learning, and individual classifier training to establish a candidate design pattern class and recommend more relevant patterns [21]. From the literature review, there are numerous approaches mentioned to solve the problem of the categorization of the source code is explained in Table 1.

Table 1
Recent approaches of software categorization

Reference Idea Methodology Results

2017 [39] The purpose of the thesis was to investigate the DNN potential in software automatic categorization. DNN model for categorization of software. To discover more traits and varieties of DNN, with numerous further configurations and data sets.

2020 [40] Research area related to the software categorization applications for software dictionaries. Singular Value Decomposition, Processing of Natural Language (NL), and Deep Learning. The findings suggest an overall accuracy of more than 65%.

2018 [41] This paper explains the categorization of source code, a classifier that can classify the language of programming. A Naive Bayes Multinomial (MNB).The classifier that is trained with Stack Overflow is used. It has been shown that precision of 75 percent is better. For the Description of Programming Languages (PLI-a) Proprietary online snippet classifier) the quality of which is only 55.5%.

2016 [42] The topic of classification of the source code is necessarily based on the text appearance sequence. A Recurrent Neural Network based on this work was used. With 80.22 percent accuracy, as the outcome.

2019 [43] Adopt a detailed analysis that involves the manual explanation of 4,226 parts of README files from 393 random sample sections of GitHub. Using randomly determined data to use the classes to mark parts using badges in unseen GitHub README scripts, and view GitHub README files. The results provide repository owners with a comparison point in contradiction of which they can model and review their READ ME files.

2020 [44] This refere the similarity, categorization, and significant research methods for software applications. This analysis aims to find comprehension and understanding of common techniques and future implementations for automated software identification purposes. With such results, the mentioned findings will help the selection of a system and finding out to under for more processes study by giving a structured summary.

Reference	Idea	Methodology	Results
2017 [39]	The purpose of the thesis was to investigate the DNN potential in software automatic categorization.	DNN model for categorization of software.	To discover more traits and varieties of DNN, with numerous further configurations and data sets.
2020 [40]	Research area related to the software categorization applications for software dictionaries.	Singular Value Decomposition, Processing of Natural Language (NL), and Deep Learning.	The findings suggest an overall accuracy of more than 65%.
2018 [41]	This paper explains the categorization of source code, a classifier that can classify the language of programming.	A Naive Bayes Multinomial (MNB).The classifier that is trained with Stack Overflow is used.	It has been shown that precision of 75 percent is better. For the Description of Programming Languages (PLI-a) Proprietary online snippet classifier) the quality of which is only 55.5%.
2016 [42]	The topic of classification of the source code is necessarily based on the text appearance sequence.	A Recurrent Neural Network based on this work was used.	With 80.22 percent accuracy, as the outcome.
2019 [43]	Adopt a detailed analysis that involves the manual explanation of 4,226 parts of README files from 393 random sample sections of GitHub.	Using randomly determined data to use the classes to mark parts using badges in unseen GitHub README scripts, and view GitHub README files.	The results provide repository owners with a comparison point in contradiction of which they can model and review their READ ME files.
2020 [44]	This refere the similarity, categorization, and significant research methods for software applications.	This analysis aims to find comprehension and understanding of common techniques and future implementations for automated software identification purposes.	With such results, the mentioned findings will help the selection of a system and finding out to under for more processes study by giving a structured summary.

3 Methodology

The research methodology section explains the proposed machine learning approach for classification. It also explains the proposed research methodology in detail, the dataset used, data preprocessing, and approach architecture.

3.1 System model

The proposed model consists of two steps, one represents a semantic analysis of GitHub repositories, and the second applies to machine learning techniques that are used for software classification. In the first step, dataset preprocessing, semantic analysis and dataset preparation for machine learning models have been explained. In the second step, selected machine learning models working behavior are described in detail.

3.2 Dataset preprocessing

Over 29 million repositories have been created on GitHub by more than 11 million developers from all over the world. In GitHub, a repository is a fundamental unit that generally comprises a software project’s source code and resource files. It keeps track of the project’s progress and high-level features, as well as the people that develop, contribute, and maintain it. Start with a fork and keep an eye on it. We have downloaded almost 150 codes of java language from GitHub which are different codes according to their work, but the language of all projects is the same. We have put all these codes into one folder as depicted in Fig. 2 and we have created an MS Access sheet in which we saved these codes in arrangement as code ID, Project name, Programming language, Author, URL, Main category, and Subcategory as shown in Fig. 1.

Fig.1

Dataset Gathered from GitHub.

Fig.2

Collection of Applications Folder.

3.3 Semantic analysis

Data preprocessing is implemented for the whole dataset to get the required shape of a dataset. The normalization process is used to equalize the dataset types and the number of required values for further processing. After taking the necessary adjustment, semantic analysis has been performed on the given dataset. The default structure of one application folder is shown in Fig. 3. This structure has a complex nature having multiple folders, their subfolders, and so on. For extracting the values of keywords and variables, the Latent Semantic Analysis (LSA) library has been used. LSA is a singular value decomposition that contains a bag of words by considering the text as a vector space and it corresponds to the semantic structure of a document as shown in Fig. 4. After applying LSA, a CSV file is generated that contains multiple rows and columns, in the first column all keywords and variables are displayed while in other columns their occurrence frequencies are kept in a cell. These columns contain some unnecessary values like empty cells, special characters, and numbers. These values have been removed manually by using predefined excel sheet functions. The dataset cleaning and feature selection based on semantics have been done in this phase.

Fig.3

Structure of Application Folder.

Fig.4

Latent semantic analysis of keywords and variables.

3.3.1 Prepare dataset for machine learning models

Out of 150 repositories, we are considering 5 major types of a category having more than 30 repositories. These five categories are attendance system, basic calculator, ludo game, ordering system, and desktop applications. As shown in Fig. 5, several keywords and variables are found during the semantic analysis process, these keywords and variables differ in each application. Therefore, we split all keywords and variables into an array of 10 words and stored them by assigning their category. These categories and their keywords are listed in Table 2. In this way, We have prepared our dataset for machine learning models. The dataset is divided into 20% and 80% percentage for testing and training processes.

Fig.5

Workflow of the proposed model.

Table 2

Dataset categories description

Application type	Application keywords
basic_calculator	then, there, these, this, throws, to, toast, tochararray, todegrees
ludo_game	access_coarse_location, access_fine_location, activity_main, activity_main2, activitycompat, add, addapi, addconnectioncallbacks, addition_iscorrect
attendance_system	inflater, information, int, integer, integers, intent, intent2, interruptedexception, io
ordering_system	dismiss, distinguished, documentation, donald, drawable, drop, drop_table, each, einstein
basic_calculator	convertingunits, cos, cosh, cosinv, count, count1, counter, create, create_table
desktop_applications	mengoneksikanya, mengubah, menit, menjadi, menjalankan, menonaktifkan, menu, menubar, menuitem
attendance_system	edittext7, edittext8, edittext9, else, email, emails, empty, emptylist, entering
desktop_applications	close, cloud, cloud∖∖n, color, colorboard, colorpin, cols, com, commons
ludo_game	blankfile, boolean, break, build, builder, bundle, button, by, calling
ordering_system	cliton, close, coin, cointext, coinvalue, collections, color, colordrawable, coloumn
attendance_system	oncreate, oncreateviewholder, ondatachange, onitemclick, onitemclicklistener, onpostexecute, onpreexecute, oops, os
attendance_system	department, dept, details, dialog, dialoginterface, dismiss, doinbackground, doonclick, dude
ordering_system	train, transparent, triviaquestion, triviaquiz, triviaquizhelper, triviaquiztext, trump, try, ttf
ludo_game	switch, t1, test, testing, textview, the, then, this, throws
basic_calculator	other, ounces, ouncestokilo, override, package, pair, parameters, params, parent
basic_calculator	getitemid, getmenuinflater, getreadabledatabase, getres, gets, getselecteditemposition, getstring, getstringextra, gettag
basic_calculator	activity_scientific_cal, activity_standard_cal, activity_unit_area, activity_unit_coverter, activity_unit_length, activity_unit_temperature, activity_unit_weight, adapter, add
desktop_applications	net, new, next, nextint, nextline, nilai, no, no_option, north
attendance_system	arrayadapter, arraylist, asynctask, attapattud, attendance, attendanceactivity, attendanceregister, auth, authresult
ordering_system	ardeshir, are, aristotle, arraylist, as, asha, assert, assertequals, astronomer
basic_calculator	public, put, putextra, query, radian, rawquery, represented, res, res_size

3.3 Semantic analysis based machine learning models

Five machine learning classifiers (decision table, random forest, OneR, Randomizable Filtered Classifier, and logistic model tree) are used for software classification. These models used different parameters and protocols to get results on a required dataset for the software classification process. The details of the classifier are given below. Figure 5 shows the workflow of the proposed model.

By generating a set of tasks that describe business-level rules, a decision table is utilized to represent conditional logic. When a collection of requirements must be examined and a specified set of actions must be allocated when the conditions are eventually met, decision tables might be useful. It represents the input values in the tabular form. It is most effective for software testing and initial software requirement management. In our work, it takes software category and classified it into software classification. It gives output in the form of true or false. It works logically against every class of software category and is identified as true against the correctly classified value and false as misclassified value [19].

Multiple decision tree classifiers make up the Random Forests (RF) [20]. In comparison to the DT design, the design can effectively handle the overfitting problem. Rather than making, decision tree it makes forset just taking an average of the prediction values. It works in two parts, one for random sampling of training of dataset, second for random feature selection during the feature classification. It gives output in the form of Yes and No.

OneR stands for “One Rule,” and it is a basic but accurate classification method that generates one rule for each predictor in the data and then chooses the rule with the smallest overall error as its “one rule.” To make a rule for a predictor, We build a frequency table that compares each prediction to the objective. OneR develops rules that are just slightly less accurate than state-of-the-art categorization algorithms while also being easy to understand by people [21]. The abominable filtered classifier is used to randomize the features extracted during the preprocessing phase. This classifier has many filters to classify the features. Based on random features, it takes a classification process [22]. This classifier is the combination of a logistic regression tree and a decision tree. It is also known as MLT having completive resultant values as compared to the other existing techniques. It extracts features during the combination of regression values and decision tree induction [23].

4 Results

This section briefly explains the results of the implemented methodology. This section covers the discussion about the results and implementation of the proposed method, testing, and training process. The machine learning approach is used for the software classification process. The machine learning approach almost used the five classifiers to classify the software classifications. The Weka tool used to implement the proposed model is with its libraries. The system with 4 CPU, 2.5 GHz, 8GB RAM, and windows operating was used for training and testing purposes

4.1 Evaluation criterion

The proposed model used different classifiers for software classification. The evaluation is done with the help of the following equations from 1–7 as mentioned below. The cross-validation process is shown in Fig. 8. True-positive denotes the figure of accurately positive classified images, true-negative characterizes the number of accurately negative classified images, false-positive denotes the number of inaccurately positive classified images, and false-negative characterizes the number of incorrectly negative classified images. The recall, accuracy, and precision were used to measure the model’s output. The MCC shows the Mathews Correlational coefficient $FP Rate = \frac{False_positive}{False_positive + True_positive}$ (1) $TP Rate = \frac{True_positive}{True_positive + False_positive}$ (2) $Precision = \frac{True_positive}{True_positive + False_positive}$ (3) $Recall = \frac{True_positive}{True_positive + False_positive}$ (4)

$\begin{matrix} F 1 - score = 2 . \\ (\frac{True_positive}{True_positive + Fale_positive + False_negative}) \end{matrix}$ (5)

$\begin{matrix} Accuracy \\ = \frac{True_positive + True_negative}{True_positive + True_negative + False_positive + False_negative} \end{matrix}$ (6) $MCC = \frac{(Tp . Tn) + (Fp . Fn)}{\frac{1}{2}}$ (7)

4.1.2 Cross-validation

The cross-validation shows the dataset shuffling into different groups to make it feasible for the training and testing process. It provides preprocess data to the trained model for evaluation as display in Fig. 6.

Fig.6

Workflow of the cross-validation process.

4.2 Proposed method evaluation

The evaluation of proposed method metrics is given in Table 3 and the graphical representation of correctly and incorrectly classification of applications are shown in Fig. 8 in the form of confusion matrix. The decision tree classifier shows the lowest result because this algorithm does not consider a complete process for all the test cases, and it must be feed with more details for all the cases. The random forest reduces overfitting and improved results as compared to decision tree. OneR accuracy is slightly different from random forest because it generates only a one-level decision tree. The randomized filter classifier has an average result due to the random values assigned during the classification process the logistic model tree is simple and having less parametric functionality and performs well as compared to random forest.

Table 3
Evaluation Metrics of Proposed Models

ML Model TP Rate FP Rate Precision Recall F1-Score MCC ROC Area PRC Area

Decision tree 1.000 0.179 0.611 1.000 0.759 0.708 0.982 0.929

Random forest 1.000 0.000 1.000 1.000 1.000 1.000 1.000 1.000

OneR 1.000 0.013 0.956 1.000 0.977 0.971 0.994 0.956

Randomizable Filtered 1.000 0.013 0.956 1.000 0.977 0.971 0.994 0.956

Logistic Model Tree 0.990 0.000 1.000 0.990 0.995 0.993 0.996 0.994

ML Model	TP Rate	FP Rate	Precision	Recall	F1-Score	MCC	ROC Area	PRC Area
Decision tree	1.000	0.179	0.611	1.000	0.759	0.708	0.982	0.929
Random forest	1.000	0.000	1.000	1.000	1.000	1.000	1.000	1.000
OneR	1.000	0.013	0.956	1.000	0.977	0.971	0.994	0.956
Randomizable Filtered	1.000	0.013	0.956	1.000	0.977	0.971	0.994	0.956
Logistic Model Tree	0.990	0.000	1.000	0.990	0.995	0.993	0.996	0.994

5 Discussion

The results in Table 4 show the comparison of classifier results on the software classification dataset. The decision table classifier shows the lowest accuracy of 85.3% because this algorithm does not consider the complete process for all the test cases and it has to be feed with more details for all the cases. The second-lowest is the randomized filter classifier with an accuracy of 90% due to the random values assigned during theclassification process. OneR accuracy is slightly different from 98.9% because it generates only a one-level decision tree. The random forest: 99.3% and a logistic model tree: 99% have almost the same accuracy with minor differences because random forest reduces overfitting and improved accuracy as well while on the other hand logistic model tree is simple and having less parametric functionality. The graph in Fig. 7 shows the accuracy comparison graph. The graph represents the lowest and highest values of accuracies.

Table 4
Proposed Model Accuracy Comparison

Proposed methodology results with the following approaches

Method Accuracy (%)

Decision table 85.3

Random Forest 99.3

OneR 98.9

Randomizable Filtered Classifier 90.0

Logistic Model Tree 99.0

State of the art approaches results

Deep Learning [40] 65

MNB [41] 55.5

RNN [42] 80.22

Proposed methodology results with the following approaches
Method	Accuracy (%)
Decision table	85.3
Random Forest	99.3
OneR	98.9
Randomizable Filtered Classifier	90.0
Logistic Model Tree	99.0
State of the art approaches results
Deep Learning [40]	65
MNB [41]	55.5
RNN [42]	80.22

Fig.7

Graphical representation of classifier accuracy comparison.

Fig.8

Confusion Matrix of Proposed Models.

6 Conclusion

This work seeks to a sustainable software process by using multiple machine learning approaches for software classification. For extracting the values of keywords and variables in applications, the LSA library has been used. LSA is a singular value decomposition that contains a bag of words by considering the text as a vector space and it corresponds to the semantic structure of a document. Afterward, the proposed approach used almost five classifiers named as decision table, random forest, OneR, Randomizable Filtered Classifier and logistic model tree to evaluate the model. The model evaluation is done on software classification dataset after preprocessing of dataset. The model evaluates the accuracies as decision table: 85.3%, random forest: 99.3%, OneR: 98.9%, Randomizable Filtered Classifier: 90.0% and logistic model tree: 99.0%. The random forest depicts the highest accuracy which is 99.3%, due to its parametric function evaluation and less misclassification error. The Decision table explain the lowest accuracy which is 85.3% due to misclassification error and less parameter used in the evaluation process. Therefore, based on accuracy values the random forest classifier is the best one for software classification process.

As a future work, we will analyze the resource utilization of same type of applications by adopting mainfold machine learning algorithms.

References

Linares-Vásquez

, McMillan

, Poshyvanyk

and Grechanik

, On using machine learning to automatically classify software applications into domain categories, Empirical Software Engineering 19(3) (2014), 582–618.

Altarawy

, Shahin

, Mohammed

and Meng

, Lascad: Language-agnostic software categorization and similar application detection, Journal of Systems and Software 142 (2018), 21–34.

Patil

and Kale

, Category Based Application Engine’, IRJCS:: International Research Journal, 2017.

Nafi

K.W.

, Roy

C.K.

and Schneider

K.A.

, A universal cross language software similarity detector for open source software categorization, Journal of Systems and Software 162 (2020), 110491.

Kim

, Cho

S.-J.

, Han

and You

, A software classification scheme using binary-level characteristics for efficient software filtering, Soft Computing 22(2) (2018), 595–606.

Bikki

, Machine learning for text categorization: experiments using clustering and classification, 2018.

Zhang

, Lo

, Kochhar

P.S.

, Xia

, Li

and Sun

, Detecting similar repositories on GitHub, in Editor (Ed.)^∧(Eds.):‘Book Detecting similar repositories on GitHub’ (IEEE, edn.), (2017), pp. 13–23.

Guendouz

, Amine

and Hamou

R.M.

, Recommending relevant open source projects on github using a collaborative-filtering technique, International Journal of Open Source Software and Processes (IJOSSP) 6(1) (2015), 1–16.

Reyes

, Ramírez

and Paciello

, Automatic classification of source code archives by programming language: A deeplearning approach, in Editor (Ed.)^∧(Eds.): ‘Book Automatic classification of source code archives by programming language: A deep learning approach’ (IEEE, 2016, edn.), pp. 514–519.

10.

Catal

, Tugul

and Akpinar

, Automatic software categorization using ensemble methods and bytecode analysis, International Journal of Software Engineering and Knowledge Engineering 27(07) (2017), 1129–1144.

11.

Nguyen

A.T.

and Nguyen

T.N.

, Automatic categorization with deep neural network for open-source Java projects, in Editor (Ed.)^∧(Eds.): ‘Book Automatic categorization with deep neural network for open-source Java projects’ (IEEE, 2017, edn.), pp. 164–166.

12.

LeClair

, Eberhart

and McMillan

, Adapting neural text classification for improved software categorization, in Editor (Ed.)^∧(Eds.): ‘Book Adapting neural text classification for improved software categorization’ (IEEE, 2018, edn.), pp. 461–472.

13.

Alreshedy

, Dharmaretnam

, German

D.M.

, Srinivasan

and Gulliver

T.A.

, SCC: automatic classification of code snippets, arXiv preprint arXiv:1809.07945, 2018.

14.

Chen

, Huang

, Liu

, Chen

, Zhou

and Luo

, Automatically detecting the scopes of source code comments, Journal of Systems and Software 153 (2019), 45–63.

15.

Prana

G.A.A.

, Treude

, Thung

, Atapattu

and Lo

, Categorizing the content of GitHub README files, Empirical Software Engineering 24(3) (2019), 1296–1327.

16.

Velázquez-Rodríguez

and De Roover

, Automatic library categorization, in Editor (Ed.)^∧(Eds.): ‘Book Automatic library categorization’ (2020, edn.), pp. 733–734.

17.

Qadir

M.Z.

, Jilani

A.N.

and Sheikh

H.U.

, Automatic Feature Extraction, Categorization and Detection of Malicious Code in Android Applications, arXiv preprint arXiv:2006.02758, 2020.

18.

Auch

, Weber

, Mandl

and Wolff

, Similarity-based analyses on software applications: A systematic literature review, Journal of Systems and Software 168 (2020), 110669.

19.

Nguyen

P.T.

, Di Rocco

, Di Ruscio

and Di Penta

, CrossRec: Supporting software developers by recommending third-party libraries, Journal of Systems and Software 161 (2020), 110460.

20.

, Fakhoury

, Christensen

, Arnaoudova

, Zogaan

and Mirakhorli

, Automatic classification of software artifacts in open-source applications, in Editor (Ed.)^∧(Eds.): ‘Book Automatic classification of software artifacts in open-source applications’ (IEEE, 2018, edn.), pp. 414–425.

21.

Hussain

, Keung

and Khan

A.A.

, Software design patterns classification and selection using text categorization approach, Applied Soft Computing 58 (2017), 225–244.

22.

Pérez

, Pérez

, Casillas

and Gojenola

, Cardiology record multi-label classification using latent Dirichlet allocation, Computer Methods and Programs in Biomedicine 164 (2018), 111–119.

23.

Mahmoud

and Bradshaw

, Semantic topic models for source code analysis, Empirical Software Engineering 22(4) (2017), 1965–2000.

24.

Huysmans

, Dejaeger

, Mues

, Vanthienen

and Baesens

, An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models, Decision Support Systems 51(1) (2011), 141–154.

25.

Zhang

, Zulkernine

and Haque

, Random-forests-based network intrusion detection systems, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 38(5) (2008), 649–659.

26.

Singh

, Singh

and Singh

, Optimization of sentiment analysis using machine learning classifiers, Human-centric Computing and Information Sciences 7(1) (2017), 1–12.

27.

Asaju

L.a.B.

, Shola

P.B.

, Franklin

and Abiola

H.M.

, Intrusion detection system on a computer network using an ensemble of randomizable filtered classifier, K-nearest neighbor algorithm, FUW Trends in Science & Technology Journal 2(1) (2017), 550–553.

28.

Chen

, Shahabi

, Shirzadi

, Li

, Guo

, Hong

, Li

, Pan

, Hui

and Ma

, A novel ensemble approach of bivariate statistical-based logistic model tree classifier for landslide susceptibility assessment, Geocarto International 33(12) (2018), 1398–1420.

29.

Dimitrijevic

I.R.

and Parausic

, Overview and Classification of Open-Source Databases on Security Issues, International Organizing Committee: 106.

30.

AlOmar

E.A.

, Mkaouer

M.W.

and Ouni

, Toward the automatic classification of self-affirmed refactoring, Journal of Systems and Software 171 (2021), 110821.

31.

Wikipedia Contributors, “Categorization,” Wikipedia, Wikimedia Foundation, 5 May 2019, en.wikipedia.org/wiki/Categorization, Accessed 18 November 2021.

32.

Khalilian

, Baraani-Dastjerdi

and Zamani

, CGenProg: Adaptation of cartesian genetic programming with migration and opposite guesses for automatic repair of software regression faults, Expert Systems with Applications 169 (2021), 114503.

33.

Hung

C.S.

and Dyer

, Boa views: Easy modularization and sharing of msr analyses, Proceedings of the 17th International Conference on Mining Software Repositories, 2020.

34.

Blobel

, Rumo

and Lames

, Sports Information Systems: A systematic review, Journal homepage: http://iacss.org/index.php?id 20.1(2021).

35.

Dhar

, et al., Text categorization: past and present, Artificial Intelligence Review 54(4) (2021), 3007–3054.

36.

Kim

, Park

and Lee

, Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis, Expert Systems with Applications 152 (2020), 113401.

37.

Gürsakal

, Gürsakal

and Çelik

, Big Data Companies and Open Source Movement, Avrupa Bilim ve Teknoloji Dergisi 21 (2021), 680–689.

38.

Kagdi

and Maletic

, Software repositories: A source for traceability links, International Workshop on Traceability in Emerging Forms of Software Engineering (GCT/TEFSEâ07), 2007.

39.

Nguyen

A.T.

and Nguyen

T.N.

, Automatic categorization with deep neural network for open-source Java projects, in 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C), (2017). IEEE.

40.

Nafi

K.W.

, et al., A universal cross language software similarity detector for open source software categorization, Journal of Systems and Software 162 (2020), 110491.

41.

Alreshedy

, et al., SCC: automatic classification of code snippets, arXiv preprint arXiv:1809.07945, 2018.

42.

Reyes

, Ramírez

and Paciello

, Automatic classification of source code archives by programming language: A deep learning approach, in 2016 International Conference on Computational Science and Computational Intelligence (CSCI), 2016. IEEE.

43.

Prana

G.A.A.

, et al., Categorizing the content of GitHub README files, Empirical Software Engineering 24(3) (2019), 1296–1327.

44.

Auch

, et al., Similarity-based analyses on software applications: A systematic literature review, Journal of Systems and Software 168 (2020), 110669.