Abstract
In view of the large data volume and complex structure of big data in the operation of major systems, the traditional cloud computing distributed multi-layer architecture data retrieval method has been difficult to meet the needs of big data retrieval. Therefore, related research must be centered around the structure of complex big data is unfolded, especially unstructured big data, to shape a new data cluster infrastructure. The cluster environment needs to implement stable retrieval of unstructured data through a cloud-centric method. The system introduces an unstructured big data retrieval framework as an unstructured database, as a data retrieval engine to improve the cloud computing big data non-structure retrieval service, and obtain the key code of unstructured big data under the retrieval technology. Finally, the system is applied to the exsperimental system, and an experimental unstructured big data integration system is built to realize the unified retrieval of large-scale different business functions and business data, so as to satisfy the users to quickly retrieve from a large number of heterogeneous business systems and massive data. Claim. Through search engine technology, the use of text mining, natural language processing, information retrieval and other fields of technology to further improve the precision and recall rate of full-text search. The application of this technology can meet the needs of unified retrieval of large-scale different business data; and at the same time meet the rapid response requirements of large-scale data retrieval requests. The experimental results show that the designed system has high precision and low retrieval time in the process of retrieving unstructured big data under cloud computing, which can realize the stable retrieval of unstructured cloud data.
Introduction
With the rapid development of cloud computing technology, data information of different types of servers has exploded, and the market urgently needs advanced large-scale cloud computing data storage and retrieval technology [1]. The research and application of big data retrieval under cloud computing conforms to the development needs of the market. In the current massive data under cloud computing, more than 80% of the data are unstructured data. However, the current information-based retrieval relies mainly on establishing structured association rules, which cannot meet the needs of cloud information retrieval, and seek effective methods. Quickly retrieve valuable information from it and become a hot topic for related scholars [2].
Literature [3] proposed a heat-sensitive unstructured data retrieval ranking algorithm, but this method has higher sensitivity to the attribute characteristics of the data and has higher limitations. Literature [4] analyzed the file full-text retrieval solution based on Lucene algorithm, which can quickly and effectively analyze the information of different structure data, but has the disadvantages of high energy consumption and low retrieval efficiency. The distributed index method analyzed in [5] uses multi-node backup to realize system retrieval. However, when the backup node fails at the same time, the index on the failed node cannot be restored, resulting in lower retrieval accuracy. Literature [6] proposed a local indexing method based on indexing services, which directly supports the service method, ensures that the search is closely integrated with the index cluster, and greatly improves the fault tolerance of the method, but also increases the method.
In order to solve the above problems, this paper analyzes the characteristics of storing unstructured big data based on the cloud computing Hadoop distributed multi-layer architecture, and treats the unstructured big data indexing framework as an unstructured database to provide big data under cloud computing. Unstructured retrieval service. The paper applies the system to the experimental big data retrieval system to realize the experimental enterprise-level distributed full-text retrieval system with the “searching” search engine as the core, and ensures the high-performance, scalable and maintainable features of the full-text retrieval system. The experimental results show that the designed system has high precision and low retrieval time in the process of retrieving unstructured big data under cloud computing, which can realize the stable retrieval of big data unstructured [7].
Related technology introduction
Cloud computing and big data technology
Cloud computing is another major change in the field of electronic information technology after personal computers and the Internet. It effectively aggregates various resources through virtualization; supplies resources on demand through network, and provides rich application services through specialization; this new type of computing resource organization, distribution and usage mode is conducive to rational allocation of computing resources and improvement Rate [8], reduce costs, promote energy conservation and emission reduction, and achieve green computing. Big data is a collection of information technologies, including data collection, data management, computational processing, data analysis, and data presentation [13–20].
Distributed technology
Distributed technology mainly includes two aspects of distributed computing and distributed storage. The “search for” search engine is a distributed full-text search based on the distributed storage data source, so it is necessary to study the distributed technology elastic search is a real-time distributed search and analysis engine that supports cloud services. It is created by the base class of the Apache Lucene search engine and provides full-text search capabilities, multi-language support, specialized query languages, support for geolocation services, context-based search suggestions, auto-completion, and the ability to search for fragments.

Big data processing technology.

Distributed system structure.
Search technology
The search technology is to check and request the required text or materials, and find the required information and data technology from the network information, literature and other information collection. In order to perform a quick search, it is usually necessary to index the keywords in the data. The full-text search has developed rapidly into a particularly efficient search technology because it has been rapidly developed due to its thoroughness andoriginality.
Full-text search technology
The full-text search technology is a new search object that uses text, audio, pictures, video and other data as the main processing object to retrieve the content of the information, rather than its appearance. It is a new search for full-text content. technology. Simple full-text search can be used for string matching. Advanced full-text search technology can develop a large-scale software for comprehensive management of unstructured data such as large text, audio, pictures, and video. With the deepening of the research on full-text search technology and the promotion of its application, the full-text search system has gradually become a model for efficient enterprise management information systems.
Full-text search platform
The full-text search platform is a system developed using full-text search technology. The full-text search platform is a service system based on full-text search technology, which is mainly used to provide full-text search services. As shown in Fig. 3, a full-text search platform structure diagram, in which the full-text search engine, is a key part and core of the full-text search platform. It can be seen from the figure that the full-text search engine mainly includes text analysis, index creation and query index “large modules [9]. Firstly, information extraction of document information and database data in various formats is first performed, and then different text analysis is selected according to the file type. The device performs text analysis, creates an index, and generates an index database, that is, creates an index module. When the user inputs the query condition, the user first performs text analysis, then queries the index from the index database, and finally returns the obtained result to the user. In addition, a well-designed full-text search system should also be easy to expand and maintain. It should be able to support operations such as Chinese processing, log recording, WORD document processing and downloading, so while ensuring high efficiency of full-text search, there should be an open framework andarchitecture.

Full-text search platform structure.
Single sign-on (SSO) is one of the more popular solutions for enterprise business integration. SSO is defined in multiple applications, users only need to log in once to access all trusted applications. Single sign-on technology can be applied to the presentation of search results. The information of the search results is derived from each business system.
Natural language processing technology
Since text data and user search input are basically composed of natural language, natural language processing technology is an important part of the auxiliary search application. Natural language processing techniques include word segmentation, part-of-speech tagging, syntactic analysis, and named entity recognition. The application of these technologies helps to improve the understanding of the search system for data and user search semantics, thereby further improving search indicators such as precision and recall. Wait.
Elastic search
Elastic Search (ES) is an open source, distributed, RESful search engine built on Lucene. In cloud computing, ES can achieve real-time search, stable, reliable, fast, and support data indexing using http using http. The ES index uses an inverted indexing mechanism to build an index. The inverted index is different from the positive sort index. The positive sort index is oriented to the document, the inverted index is oriented to words, and the stored index is an index item consisting of a set of key value pairs. The ES index data structure includes items, fields, documents, and segments, as shown inFig. 4. Item: The smallest index unit that directly represents a keyword and its location and number of occurrences in the source document. Domain: An associated tuple, including the domain name and domain value. The domain name is a string and the domain value is an item. Documentation: Includes all domain information. Segment: Contains several documents, and several segments form a sub-index or index.

Data structure diagram of the ES index.
Unstructured analysis of hadoop architecture
There are many types of big data stored under cloud computing, all of which exist in non-structural retrieval relationships. The overall architecture is shown in Fig. 5. The search platform relies on cloud computing. The cloud platform adopts Hadoop distributed multi-layer architecture to store unstructured big data, and shapes the basic environment of the cluster. It realizes the stable retrieval of big data non-structure through the form of cloud and end-heavy.

Unstructured platform of the Hadoop architecture.
The current big data non-structural retrieval system under cloud computing collects massive data under cloud computing through the Internet, and realizes the autonomous processing and statistical operations of big data non-structure through the background system. The platform is also capable of statistical analysis of hotspot information. After obtaining a large amount of unstructured data, the system completes the storage service of big data unstructured retrieval according to the definition of the retrieval business strategy and the cooperation of the retrieval engine.
The overall architecture of the above system is designed with a multi-layer architecture and is shaped by the basic services of the multi-tier architecture. Under the multi-layer architecture, the stable storage of big data unstructured is realized through the form of cloud and end-heavy. The non-structural characteristics of the stored procedure program are asfollows:
The user retrieves the unstructured terminal data. In the service for the terminal to retrieve users and provide users with big data retrieval through the web portal application and the mobile terminal APP, a stable structure cannot be formed because of the diversity of user information. Unstructured data of business application layers. The service application layer service provides various applications for the system, and the system provides large data service support in a service manner, and the part of the service is deployed in the cloud platform. The user retrieval terminal application completes the unstructured stability retrieval service of the big data under the cloud computing by accessing the business application layer service in the cloud platform. However, due to the increasing diversity of applications in the types and development process, the data of this layer also has large unstructured features. Unstructured data of platform service layer. The platform service layer provides related services for the business application layer and the basic resource layer, including media processing services and scheduling, and process engine services. The platform service layer contains key big data retrieval engines. However, as the retrieval model continues to increase, this layer of data also has large unstructured features. Unstructured data of the underlying resource layer. The basic resource service layer is the basic device layer of the cloud platform. Through the computing resource service, the storage resource service, and the network resource service, the cloud platform is controlled by the logical resource pool. The basic resource services in the search platform mainly refer to the basic resources of the cloud platform, including cloud storage, virtual computing resources, and operating system and other basic components. With the continuous increase of basic equipment, stable structural features cannot be formed between device data. Forming unstructuredfeatures.
Realization of unstructured retrieval of big data under cloud computing
Under cloud computing, differential big data non-structural retrieval is a complex process. It can be seen from the analysis in Section 1 that massive unstructured data is stored in the platform. The traditional method of structured indexing cannot meet the stable retrieval requirements of unstructured data. By constructing a distributed index system of unstructured big data, this paper can meet the stability retrieval requirements of unstructured big data under cloud computing [10].
Designing an index framework for unstructured data
By shaping a distributed unstructured big data indexing framework, a structurally similar framework can be built for unstructured data. The design framework is shown in Fig. 6. The distributed indexing framework includes an index cluster, a retrieval cluster, and a distributed file system.

Distributed unstructured data indexing framework.
In an unstructured framework, designing index clusters can shape the index of distributed big data unstructured retrieval. The index cluster adopts the Master Slave structure and consists of an index master node and multiple index nodes. Through this structure, the index task can be segmented into different index nodes, ensuring that different index nodes can shape indexes in parallel, and enhance the system’s operation performance on unstructured data under cloud computing. The index cluster serves the bulk and incremental indexing modes. After the system saves the unstructured data, it will pass the message of the incremental index task to the index master. The index master node uses the index fragmentation scheme according to the data characteristics and content in the message, determines that the corresponding data belongs to the index fragment, and stores the related message in the distributed index message queue [11].
Data index nodes of different structures are independent of each other and collect messages from the message queue. If the collected message belongs to the corresponding index node, the message is relatedly retrieved, otherwise the message is fed back to the corresponding index node and then operated. If the corresponding operation index node fails to operate normally, the operation of the message is completed by the index master node, and the new index node is configured to the corresponding message. Index clustering can increase the throughput of the total system.
Design of search clusters under unstructured
In the unstructured data framework, the retrieval cluster contains the retrieval master node, the retrieval node, and the retrieval client. The retrieval cluster ensures the efficient deployment of index files to different retrieval nodes through the Master Slave structure, enhancing the efficiency of the data retrieval service. The retrieval node can obtain the load status of different retrieval nodes in the overall retrieval cluster according to the Master Slave structure. After the user sends a data retrieval application through the retrieval client, the retrieval master node acquires a node list according to the load condition of different retrieval nodes, and feeds the list back to the retrieval client, and the retrieval client searches according to the obtained retrieval node list. The user can use the search client to request a search and obtain the corresponding search result.
Big data unstructured retrieval code under cloud computing
At present, the SQL full-text search technology is used to realize the unstructured retrieval design of big data under cloud computing [12]. The detailed process is: start the full text search global search service of SQL Server, set the default language of the database server to 2052 (Chinese); run the SQL statement. Enable full-text search: Execut esp _ full text _ Struct Dabase ‘enable’; select “Define full-text index” in “Full-text index”, into the full-text search wizard dialog box, select the field and full-text catalog to be fully indexed. Restart SQL Server to query the set table with the search statements CONTAINS and FREETEXT. Among them, the CONTAINS statement can search all the columns of the table, words or phrases and words similar to the corresponding words; the FREETEXT statement can search for a free text of various strings in all columns or specified columns of a table, and return the row of data that matches the string.
If the file content in the Doc table contains “terror”, the SQL statement used is:
SELECT * FROM Doc Where CONTAINS (Document Connotation, ‘Threats’)
In the interface for retrieving unstructured data, enter a keyword and click the “Search” button to display the file name and file type of the keyword in the file content. The main code to implement keyword retrieval in the document is:
Public Struct Da Table Struct Da Search (string keyword, stringste Conn) {string sql= “select * from doc where CONTAINS (Document Conten, ““+keyword+””)”;
Sql Struct Da A daper da = new Sql Struct Da A daper (sql, str Conn);
Struct Da Ser da = new Struct DaSe (t);
da. Fil (l da); return da. Tables [0];}
If you need to open a document, the overall document is analyzed in detail. In the.NET environment, you can display the relevant document content in the browser by setting the Response’s Connotation Species property and calling the Binary Write method. The key code to display the contents of a Word document and an Excel document is:
Sql = “select Document Species, Document Connotation from doc
Where id = “+Grid View1. Selected Row. Cells [0]. Text;
Sql Command cmd = new Sql Command (sql, cn);
Sql Struct Da Reader dr = cmd. Execute Reader ();
I (f dr. Read ())
{string Document Species = dr. Get Value (0. To String ());
Byte [] Document Struct Da = (byte []) dr. Get Value (1);
I (f Document Species== “doc”)
Response. Connotation Species=“program/ms word”;
Else I (f Document Species==“xls”)
Response. Connotation Species=“program/vnd. MS excel”;
Else Response. Connotation Species=“program/octetstream”;
Response. Flush (); Response. Binary Write (Document Struct Da);}
Experiment analysis
The experiment uses a cloud computing unstructured NUS data set, and the visual word features extracted from the low-dimensional features in the data set are used as test sets. This experiment randomly extracts 1 million features from the unstructured data set as a sample training set. The experiment uses the two indexes of retrieval time and precision to evaluate the advantages and disadvantages of the system and the average distribution retrieval system.
Precision rate = total number of related results in the search results / total number of search results
In order to obtain a more comprehensive experimental result, the experiment selected 10 groups of experiments, each with a file number of 100,000, 200,000, 300,000, 400,000, 500,000, 600,000, 700,000, 800,000, and 900,000. 1000000. The average of the search time and the average of the precision are calculated to evaluate each system.
The accuracy rate is the evaluation index
The experiment was first carried out with the precision as the evaluation index. In each experimental system, different search targets were sequentially entered for retrieval, and the longest search time of 60 was set to perform experiments. The precision of different search targets is counted and the average is calculated. Then compare the precision of each system. The comparison of the precision of the system and the average distribution system under the method of this paper is shown in Fig. 7.

Comparison of the precision of the system and the average distribution system.
Analysis of Fig. 7 shows that the precision of the method under this method is significantly better than the average allocation retrieval system, indicating that the method used in this paper has a strong advantage in large data non-structural retrieval.
The experiment evaluates different retrieval systems based on the average time of retrieval. In the two experimental systems, different retrieval targets were input for retrieval, and different retrieval results were set for multiple experiments. Calculate the average search time under the difference results in different systems, and then compare the retrieval time of the two systems. In the system and the average distribution system, 10 experimental targets were searched, and the retrieval time of each target in the two systems was counted, and the average retrieval time was calculated, as shown in Fig. 8.

Comparison of retrieval time between system and average distribution system in this paper.
It can be seen from Fig. 8 that the average allocation retrieval system is higher than the system in the average search time, which proves that the retrieval system of this paper is superior to the traditional average distribution in the two evaluation indexes of query time and precision. system.
As can be seen from the above experimental comparison, the performance of the retrieval system for unstructured data is superior to the traditional average distribution system. The system can realize the stability retrieval of big data non-structure under cloud computing and has high application value.
Based on cloud computing, this paper uses Hadoop distributed multi-tier architecture to store unstructured big data, and shape the basic environment of the cluster. It realizes the stable retrieval of big data non-structure through the form of cloud and terminal. The unstructured big data indexing framework is used as an unstructured database as a data retrieval engine to provide a large data unstructured retrieval service under cloud computing. The distributed index framework includes an index cluster, a retrieval cluster, and a distributed file system. The key code of SQL Server 2008 full-text search technology in the process of retrieving unstructured big data is given. The experimental results show that the designed system has high precision and low retrieval time in the process of retrieving unstructured big data under cloud computing, which can realize the stable retrieval of big data unstructured.
Footnotes
Acknowledgments
The work is partially supported by (1) Research on the basic research business expenses of the central colleges and universities, Research on Industry-oriented Service Data Processing in Big Data Environment (Grant No. ZY20160106). (2) National Key Research and Development Project,China (Grant No. 2018YFC1503805).
