Abstract
The advent of the information age has made accurate search for information a challenge. In this paper, we analyze intelligent recommendations for innovative entrepreneurial projects based on collaborative filtering algorithms. Collaborative filtering is one of the most widely used and successful techniques in recommendation systems. In this paper, an interest migration function plus time is introduced to address the shortcomings of traditional collaborative filtering recommendation algorithms. Meanwhile, this paper builds an intelligent recommendation engine system for innovative entrepreneurial projects based on the Hadoop open-source distributed computing framework, sustainable PSCM, and Mahout collaborative filtering recommendation engine technology. This paper uses experiments to test and evaluate the overall performance of the distributed recommendation platform and the improved collaborative filtering recommendation algorithm. It is found that the algorithm outperforms similar algorithms in terms of data volume and coverage of recommended innovation and entrepreneurship projects. This is sufficient to show that the collaborative filtering algorithm and sustainable PSCM are useful for the intelligent recommendation analysis of innovative entrepreneurial projects.
Keywords
Introduction
The emergence and development of the Internet economy has led to an explosive growth in the amount of information on the Internet. Information consumers must find their truly useful information from vast information. A large number of scholars have conducted related studies and research. Therefore, academia has the emergence and development of personalized recommendation systems. With the rapid development of information network technology, the contradiction bet This articleen user information acquisition efficiency and information resource allocation mode is deepening. The previous popular information push mode has been unable to meet the basic needs of users. The rapid expansion of the amount of information reduces the retrieval efficiency of the database. This creates obstacles for users to query information [1]. The lack of information integration means and personalized service measures in traditional entrepreneurial information platforms limit the breadth of communication channels bet This articleen platforms and users. This seriously affects the user experience of using the platform. Most of the traditional information integration systems commonly used at present are based on TRS network radar and are developed in combination with This articleb spider software. Although it has certain information integration abilities, its research and development cost is too high. The information it provides is less accurate and difficult to gain user recognition.
Recommendation algorithms include a content-based recommendation, association rule-based recommendation, collaborative filtering-based recommendation, knowledge-based recommendation, and so on. Among them, the recommendation based on collaborative filtering is one of the most mainstream recommendation algorithms. It faces various problems and challenges at the same time. For example, data sparsity issues, new user cold start issues, scalability issues, etc. User-based collaborative filtering recommendation algorithms are based on an assumption. If two users have similar interests, one user may be interested in what the other user is interested in. The collaborative filtering recommendation algorithm relies heavily on the user’s historical rating data when modeling users. The rating data tends to have high sparsity [2]. This poses a great challenge for us to build an accurate user model. The current optimization schemes for dataset sparsity problems can be divided into three categories. The first category is to introduce other factors into the similarity calculation to reduce the impact of using sparse scoring data on the similarity calculation. The second type is to fill the scoring matrix with missing values in some way to reduce the sparsity of the dataset. The third is to introduce a clustering algorithm. Such methods generally calculate the distance bet This articleen the target user and the centroid. Finally, This article determines the cluster where the target user is located. At the same time, this article initially gets the neighbor user set of the target user.
The traditional solution to the sparsity problem of scoring data is to perform mean filling or zero-padded missing values for item scores. This will inevitably bring a large error in the prediction results. In response to this problem, some scholars have proposed to convert user comment information into a feature matrix. Finally, scholars combine the similarity calculated by the feature matrix and the rating matrix to obtain the comprehensive similarity of the target user. Some scholars mentioned that the similarity of user ratings and the similarity of user review characteristics are considered comprehensively. Then they control the importance of both through This articleights [3]. They all have one thing in common. They all rely on introducing another factor to reduce the dependence of similarity on sparse data to improve the accuracy of similarity calculation. The advantage of this algorithm is that it can This articlell reduce the problem of inaccurate calculation of neighbor users caused by data sparsity. The disadvantage is that it increases the complexity of the algorithm. Therefore, some scholars have proposed a method of filling missing values with the mean. Although this can alleviate the problems caused by data sparseness to a certain extent, it is not ideal for data sets that are too sparse.
On the other hand, scalability is also a very important indicator to measure the performance of recommender systems. At present, thousands of users and goods appear on the Internet every day. The recommendation system can still provide personalized services for each user in real-time and quickly [4]. This is because the algorithm is very scalable. Collaborative filtering recommendation algorithms often encounter high-dimensional vector calculation problems when finding neighbors. This will take up a lot of time on the system. If the algorithm can reduce the dimension of the high-dimensional scoring matrix, it will greatly increase the scalability of the algorithm. For the processing of high-dimensional data, some scholars proposed to introduce PCA principal component analysis into the recommendation system. Mathematical algorithm realizes the reduction of data dimension and reduces the time consumption of the system.
In this paper, a new method of padding and dimensionality reduction is proposed for the problem of data sparsity and scalability. At the same time, This article introduces the mean difference of user ratings in the calculation of similarity and the mean of user ratings in the rating prediction of the target item. This greatly reduces the problem of inaccurate recommendation results due to different user rating scales. This paper proposes and designs an entrepreneurial information integration system based on personalized parameters. The platform can complete the integration of entrepreneurial information in the big data environment [5]. It can effectively eliminate disordered information. The system realizes the optimal allocation of entrepreneurial information resources. It greatly enhances the user experience through efficient and personalized service.
Algorithm and design
Traditional user-based collaborative filtering recommendation algorithm
The key to traditional user-based collaborative filtering recommendation algorithm is to accurately obtain the target user’s neighbor user set. Whether the nearest neighbor user set found by the algorithm is accurate will affect the recommendation accuracy of the recommender system [6]. The recommendation accuracy of the recommender system directly determines the user’s degree of recognition and attachment to the recommender system. The traditional user-based collaborative filtering recommendation algorithm can be divided into the following three steps:
This article needs to get the user’s rating data for the item. This article transforms it into a user-item rating matrix with m rows and n columns. This article denote it as Calculate the similarity betThis articleen each user and all other users according to the scoring matrix determined in the first step. In this way, the user similarity matrix is obtained. Finally, This article determines the neighbor user set This article uses the user ratings in the neighbor user set to make rating predictions for the target item
Cosine similarity: The size of the cosine value
Modified Cosine Similarity: When the user’s rating scale is very close, the calculated result is still relatively accurate. The rating scales of users in datasets tend to be different. Suppose the modified cosine similarity of user
Similarity-based on Pearson correlation coefficient: The Pearson correlation coefficient method is a very common similarity calculation method. The setting
Similarity based on Euclidean distance: This article use Euclidean distance to measure similarity. Its expression is shown in Eq. (5). Where
In this paper, the original user-item rating matrix
Filling and dimensionality reduction based on average Euclidean distance and mean: This article fill in the missing values of the target vector with a vector that is relatively similar to the target vector.
Transferred to:
For example, given the vectors
When
Similarity based on cosine distance and mean difference: Cosine similarity is one of the most classic similarity calculation methods. Because it completely ignores the important factor of different rating scales betThis articleen users, improved similarity calculation methods such as modified cosine similarity and Pearson correlation coefficient method appeared later [8]. To a certain extent, this can make up for the influence of different user rating scales on the similarity calculation results. Cosine distance focuses on the difference in the direction of two vectors. This paper presents a method. The purpose is to compensate for the influence of different user rating scales on the similarity calculation results. This article combine the similarity and cosine similarity of the two user rating scales to obtain a comprehensive similarity. Then This article control the importance of the two through the This articleight
Where
Rating Prediction Based on Mean User Ratings: This paper assigns an appropriate This articleight to k in Eq. (8). The value of k is related to the centrality of the user rating scale. The more concentrated the rating scale, the larger the k value. Then This article calculate the similarity betThis articleen users according to this formula. Finally, This article get the target user’s neighbor user set as
This article make a rating prediction for the target item based on
This article select a neighbor user This article set the score of the target item to be Repeat steps 1 and 2 until all users in The final predicted score is obtained according to the following formula (9).
The principle of entrepreneurial information integration is to concentrate the entrepreneurial information resources in the Internet based on different types, different carriers, and in a decentralized and heterogeneous form. In this way, the collection, integration and sorting of entrepreneurial information are completed. At the same time, This article enable users to query all kinds of entrepreneurial information through a unified client interface [9]. The information service of personalized parameters refers to the analysis of entrepreneurial project demand information based on the user’s age, identity, interests and other factors. In this way, a personalized service strategy can be formulated in a targeted manner. The entrepreneurial information integration system based on individualized parameters mainly includes resource publishing client, unified service platform, project retrieval subsystem, processing subsystem, information retrieval subsystem and various interfaces. Its overall structure is shown in Fig. 1.
Schematic diagram of the overall architecture of the system.
The system adopts the structural form of combining entity components and virtual systems. Entity components are selected based on the principle of improving the efficiency of secondary development of the system and ensuring the speed of information update. The virtual system invokes the subsystems of the traditional platform. In this way, the repeated occupation of resource storage space is avoided, and the trouble caused by network address change is also avoided.
Item Retrieval Subsystem:The item retrieval subsystem is a core component of the platform. Its function is to extract information from the platform database according to the keywords or words input by the user. Finally, This article give feedback to the user through the client. The project retrieval subsystem includes four components: index database, retrieval module, load balancing module and user interface. After receiving the user’s query request, the user interface generates a query string and packs it into a large string for down-transmission. The role of the load balancing module is to distribute the retrieval requests of users. This article select suitable nodes for each retrieval node according to the load capacity of each retrieval node. The retrieval module processes the retrieval request [10]. This article extract the retrieval information entered by the user from the large string and create a word dataset. The database locates the corresponding file based on the interface information of the Index function. This article sort and process it and feed it back to the load balancing module.
Processing Subsystem: The processing subsystem performs centralized processing of all data in the platform based on the dynamic data maintenance mechanism on the basis of generating collection tasks and selecting data collection methods. In the mode of B/S architecture, This article collect information resources that have been configured. The processing subsystem is designed according to the structure shown in Fig. 2.
Schematic diagram of the structure of the processing subsystem.
This article create data sources according to the user’s personalized parameters (requirements). This sets the acquisition and output conditions for the sorted database. The time variable is marked by the time field betThis articleen adjacent acquisition actions. This article take the time division point as the acquisition interruption signal [11]. This article complete data collection activities based on serial numbers. The project resource integration platform monitors the transmission route of the data according to the user’s processing request. This article start various databases and generate SQL statements that meet the requirements based on the field mirroring mechanism. The system realizes the updating, editing and querying of information.
Interface Design: Different users can query in this system according to their needs. This system has developed the computer terminal and the mobile terminal interface, the purpose is to meet the needs of terminal query.
Mobile phone interface design: The mobile phone has a central processing unit. The mobile phone can download the corresponding equipment to compile and run the program through the signal of the remote output terminal. This article can scan the QR code and read the detailed information of the entrepreneurial project by querying the entrepreneurial information. The mobile terminal interface design is shown in Fig. 3. The central processing unit consists of the MICRF007 turn-on-turn-off receiving chip. The packaging method is SOP(M)-8. The chip contains the reference controller, demodulator and converter [12]. The receiver consists of poThis articler supply decoupling capacitors, crystal oscillators, and external capacitors. This article use the chip’s own narrow RF internal tuning to complete the interaction with the mobile phone. Pins 4 to 7 respectively provide comparator reference threshold, data output, receiver control and external capacitors. This design greatly increases the range of information input. The system and the mobile terminal constitute a one-to-many, point-to-point transmission mode. This greatly reduces the time for information integration.
Mobile phone interface circuit.
Computer interface circuit.
Computer interface design:This article can realize historical record retrieval and related information query through the computer. This also wakes up hibernation files and resumes them. The computer interface design is shown in Fig. 4. The computer interface adopts a trigger interface circuit with strong load capacity [13]. The drive circuit is 90mA. The main control unit will generate a trigger signal when the user sends an instruction to retrieve information. It also detects trigger information through the FPGA program. This ensures the accuracy of entrepreneurial project information retrieval.
This platform realizes the integration of personalized entrepreneurial information based on the mining and analysis of user information in the database. This article provide personalized resource recommendations for users by summarizing their retrieval history, resource retrieval categories, and hobbies. Figure 5 shows the process of platform personalization integration.
Implementation process of personalized information integration system.
Retrieval of project information resources: The project information integration system collects and processes information according to user requests. Entrepreneurial project information retrieval is the premise of platform function realization [14]. This platform extracts and distributes the information requested by users based on the binary data conversion method. The specific process is:
The system is based on the binary sequence of adjacent points. In this way, the upward or downward growth trend betThis articleen the two points is judged. The data-based trend proportional reduction mechanism integrates the candidate sequences and their patterns of adjacent points with clear relationships into an interval. Obtain the similarity of each sequence in the interval according to the established algorithm. This determines the magnitude of the upward and downward growth. This article use this as a basis to cluster similar subsequences. At the same time, This article complete pattern matching on this basis.
Integration of project information resources: The information This article obtain through information retrieval presents multiple, heterogeneous and discrete characteristics, which may lead to data redundancy, resulting in the emergence of information islands. So This article have to take integration measures.
This paper integrates heterogeneous information based on database connection pool technology. Connection pools are used to store connection objects. This article control the amount of information according to certain connection rules. In this way, This article use the query interface to realize the connection to the database. Heterogeneous databases are connected by JDBC API functions [15]. This enables the transformed user request to be assigned to a specific database. This article extract the required information and create an image file. If want to increase the speed of creating image files, must reduce the delay. In this way, the trigger interface of the PC end and the interface circuit of the mobile end work in the state of the highest transmission efficiency. The principle of combinational logic delay is shown in Fig. 6.
Schematic diagram of combinational logic delay.
Release of Consolidated Information: This platform publishes the integrated information through This articleb pages. This article further enhance the user experience through information navigation. The purpose is to meet the information access needs of users. Users can read the information in the local database, or enter the original address for query [16]. This platform uses Windows 2000 Server to create an information release environment. The system provides users with an open database. Information is released in both artificial and intelligent ways. The system meets the user’s quick query needs.
This chapter first introduces the experimental environment, the selected data set, and the evaluation indicators of the experiment. Finally, This article compare and analyze the improved method and the traditional method according to the data set.
Introduction to the experimental environment and dataset:The PC configuration used in the experiment is the CPU of Intel Core i53230M, 8GB memory, and Windows8 Enterprise Edition 64-bit operating system. The programming language uses python language. The version is python2.7. The editor uses the pyCharm community edition.
This experiment uses the 100k movieLens dataset given by the GroupLens group of the University of Minnesota [17]. The dataset contains 943 users and 1682 entrepreneurial projects. Scores range from 0.5 to 5 points. Each user’s entrepreneurial project is rated no less than 20 times. The experiment randomly selects 80% of the data as the training set and 20% as the test set. This article cross-validated it.
Experiment evaluation index: At present, the performance evaluation indicators of recommendation algorithms include mean absolute error (MAE), recall rate, precision rate, etc. The evaluation index of the algorithm performance in this paper is MAE, which reflects the gap betThis articleen the predicted value and the true value. The smaller the value, the better. Assuming that the predicted score of the recommendation system for the item is set
The MAE of the system can be expressed by Eq. (11):
Experiment evaluation index: The data set used in this experiment is a 100k entrepreneurial project data set. Its sparsity is 93.69%. The experiment was conducted in two rounds. The data sparsity of the first round of training set is loThis articler than 93.69%, and the data sparsity of the second round of training set is higher than 93.69%.
Results of the first round of experiments: This article use the Pearson correlation coefficient method to calculate the similarity before and after filling and dimensionality reduction of the training set data. The change curve of the MAE of the system with the value of the neighboring user
The system EAM changes with the number of neighbors.
This article can see that after the training set data is filled and dimensionally reduced, when the number of neighboring users is less than 200 and greater than 700, the system MAE is obviously slightly larger than that before filling and dimensionality reduction [18]. The MAE of the improved system is loThis articler when the number of neighbors is betThis articleen 200 and 600. This shows that the improved results are more reliable when This article choose an appropriate number of neighbors. On the other hand, in the experiment in this paper, the training set data is processed through several rounds of filling and dimensionality reduction. This article replace the 1345 entrepreneurial projects in the training set with 71 new ‘features’. This greatly reduces the time consumption of the user modeling process and increases the scalability of the system.
After the training set data is filled and dimensionally reduced, the system MAE changes with the number of neighboring users obtained when the similarity is calculated by the Pearson correlation coefficient method (Fig. 8). This article greatly reduce the MAE of the system and increase the recommendation accuracy when This article predict the target item’s rating by the neighbor users.
The system EAM changes with the number of neighbors.
System MAE versus constant k.
When the number of neighbors is 250, the MAE of the system is the loThis articlest. In this paper, the best result of calculating the similarity using the Pearson correlation coefficient method is compared with the result of calculating the similarity by Eq. (8). The results are shown in Fig. 9.
Compared with the Pearson correlation coefficient method, the improved similarity calculation method has been greatly improved. When k is close to 1, the improved method is closer to the cosine similarity [19]. Pearson similarity is an improved version of cosine similarity. It can improve the problem of inaccurate similarity calculation caused by different scoring scales. It can be seen from the figure that the advantage of the Pearson correlation coefficient method is not obvious when
First of all, the smaller the angle betThis articleen the two vectors of the Pearson correlation coefficient method or the modified cosine similarity, and the closer the modulo length is, the more similar the two vectors are. When k approaches 1, the standard of the improved similarity calculation method for judging the similarity of two vectors changes. It hardly takes into account the modulo length of the vector. This paper uses Eq. (7) to adjust the prediction results when the target item is scored and predicted. Here is the optimization for users with gaps in rating scales. From Eq. (9), it can be analyzed that the value of calculating the similarity only determines the This articleight of the predicted score of the target item by the corresponding neighbor users in the experiment of this paper. The final prediction result is the addition of each This articleight result.
After filling and dimensionality reduction, This article use the Pearson correlation coefficient method to solve the similarity. The optimal number of neighbors here is 300. The purpose is to select the optimal number of neighbors for the Pearson correlation coefficient method. The MAE of the second-round system was overall higher than that of the first-round system. Mainly because the sparsity of the training set data in the second round is higher than that in the first round. Therefore, the user model established through the training set will produce relatively large errors. So This article can draw a conclusion. Under the condition that other conditions remain unchanged, the denser the training set data, the more reliable the established model.
In this paper, in the intelligent recommendation of innovative and entrepreneurial projects relying on collaborative filtering algorithm and sustainable PSCM, the improved collaborative filtering algorithm is used to solve the problem of traditional similarity measure in the case of sparse data. This method improves the similarity calculation and combines the filling and dimensionality reduction of the rating matrix and innovation and sustainability in PSCM to improve the recommendation accuracy. In this way, the time loss of searching for target users is reduced, and the scalability of the system is improved. However, due to time reasons and experimental data problems, the research in this paper still has certain limitations. In the future, further research will be done in this area in order to provide better services for the intelligent recommendation of innovative and entrepreneurial projects.
Data availability
The data used to support the findings of this study are included within the article.
Funding
This research study is sponsored by the "Thirteenth Five-Year Plan" Project of Educational Science of Jiangsu Province. Project name of collaborative education project of industry university cooperation of the Ministry of Education: Exploration on the path of talent training serving the region under the mode of collaborative education and integration of industry and education. The project numbers are 202102364002 and X-b/2018/02. The paper is published for the conclusion of the project. Thank the project for supporting this article.
Footnotes
Conflict of interest
The authors declare no conflicts of interest.
