Abstract
The intelligent scheduling algorithm for hierarchical data migration is a key issue in data management. Mass media content platforms and the discovery of content object usage patterns is the basic schedule of data migration. We add QPop, the dimensionality reduction result of media content usage logs, as content objects for discovering usage patterns. On this basis, a clustering algorithm QPop is proposed to increase the time segmentation, thereby improving the mining performance. We hired the standard C-means algorithm as the clustering core and used segmentation to conduct an experimental mining process to collect the ted QPop increments in practical applications. The results show that the improved algorithm has good robustness in cluster cohesion and other indicators, slightly better than the basic model.
Introduction
In recent years, with the increasing development and maturity of key technologies such as multimedia data compression, content-based retrieval, high-speed Internet, and mass storage (SAN, NAS, iSCSI, and others), mass media content management systems have gradually become popular in film and television media, content providers (Including mobile multimedia value-added service providers), cultural communication agencies can be widely used. A key issue facing mass media content platforms is effective data migration. Through HSM (hierarchical storage) and VSM (virtual storage) technologies, mass storage has achieved data-oriented and content-oriented tiring on the architecture, which is consistent with access and application-oriented, but HSM brings media data at different storage levels. The problem is migration between storage and storage modes (hierarchical data migration).
Many mass storage solutions in the industry currently lack an intelligent and effective data migration scheduling mechanism. As a result, in the actual application environment, frequent and uncertain multimedia data push and pull have a great impact on system availability and robustness. Therefore, the use of intelligent information processing methods such as feature extraction and knowledge discovery, and the establishment of an intelligent data migration scheduling mechanism based on the application characteristics of content objects are necessary guarantees for the normal operation of mass media content management systems, as shown in Fig. 1.
With the rapid development of Internet technology, we media on the Internet has been growing explosively in recent years. It has rapidly penetrated all walks of life in the form of virus transmission, giving everyone a way to show themselves and understand others [1]. Sina Weibo, as one of the large-scale network we media platforms in China, has a huge user base, and a large number of information related to personal life or social phenomena generated by this large number of users [2] With the development and maturity of the Web2.0 era, in addition to data mining based on conventional data, microblog also has a large number of data containing latitude and longitude location attributes [3]. These spatial location data can display our information mining results intuitively through the front-end API of each map so that we can well discover the spatial distribution rules of various personal life or social phenomena and other topics related to all walks of life and individuals or goods of interest Valuable research direction.
Spatial data mining and knowledge discovery (SDMKD) is a branch of data mining and knowledge discovery. Through a series of processing of spatial data sets, SDMKD obtains spatial feature rules, spatial clustering rules, and spatial distribution rules, which can directly display the information of spatial entities. Academician Li Deren was the first person to pay attention to and understand the field of spatial data mining. He first proposed the theory of spatial data mining and knowledge discovery at the international geographic information system academic conference held at the end of the 20th century and studied and proposed the theoretical framework of spatial data mining and knowledge discovery [4–7]. There are huge amounts of information in the existing spatial databases, including mountain height, the river width, and other shallow information that can be found by using GIS query tools [8]; but there is much deep-seated information besides shallow information, such as spatial classification rules and spatial deviation [9–12] However, it is difficult to obtain such information by using the query method of GIS, which can only be found using operation or mining.
Due to the rapid rise of cloud computing, it provides an excellent solution for the complex and large number of iterative computing that we are faced with in solving the clustering problem in machine learning. Among the numerous distributed computing frameworks, the open-source framework Hadoop is favored by many enterprises and scientific research institutions for its stable performance and low-cost. Compared with the traditional parallel framework, Hadoop has the advantages of high efficiency, high efficiency, and low cost Based on this platform, Apache has developed a computing framework for machine learning algorithm mahout; this paper will use mahout and Hadoop as the platform; HDFS in Hadoop ecology is the data storage system; MapReduce in Hadoop is the distributed computing framework; and then K-means clustering optimized by canopy algorithm is selected Analysis algorithm, using mahout data mining framework on Hadoop cluster to implement parallel clustering algorithm operation [13] Finally, using the visual analysis method, the clusters with themes are displayed on the map, and the information contained in these microblog data is analyzed in this more intuitive way, to study the information related to the society and life hidden in the network public opinion, to provide support for the harmonious and stable development of society.

Mass data storage technology framework.
Because the data structure of MongoDB is similar to the JSON object with a loose structure, and it is easy to operate, and it also performs very well in distributed storage. Therefore, all the data crawled through the scrappy project are stored in the MongoDB database. Based on this data, preprocessing, clustering analysis, similarity evaluation, and visual analysis are carried out. The specific methods are as follows: Microblog text preprocessing: in this paper, the microblog user is regarded as a spatial entity, and the microblog text and its longitude and latitude position data are taken as attributes. The IK word breaker is used to add the stop word list and the EXT table (additional custom word base). The weighted TF-IDF algorithm is used to weight the result of microblog text segmentation. After that, the IO stream operation of Java language is used to read in the data and process the output Finally, we got a complete data set of Wenbo, which includes 110000 microblog longitudes and latitudes. Because a large number of expressions in the text of microblog will affect the clustering results, this part of emoticons will be replaced in the process of crawling, which is expressed as “[expression]”, for example, the expression of “smile” is expressed as [“smile”]. Besides, there are some special symbols in the text, such as “#”, @ “, which will also have an impact on the results. Therefore, they are eliminated and the regular expression is used to eliminate them. Clustering analysis based on Hadoop and mahout parallel framework: in this paper, the microblog user is regarded as a spatial entity, the microblog text and its longitude and latitude position data are taken as its attributes. The Chinese clustering of microblog text attributes is carried out on the Hadoop platform, and the clustering quality is optimized by adding or deleting stop words, adding new words, and removing high weight words, and finally, the microblog data set is discovered Hot people, goods, topics and other information in the hot information as well as the keywords carried by each hot information (for example Paris is accompanied by keywords: taste, art, travel, etc.). User similarity evaluation based on text similarity: the similarity between users is usually used for friend recommendation function, while the traditional user similarity measurement is usually calculated from the personal information and network of microblog users such as user interest and follow list [14, 15] In this paper, we consider the similarity degree and spatial location of microblog texts between users to determine whether users are similar; the similarity of microblog texts of two users indicates that their interest directions are similar, while users with similar geographical locations may have similar regional and cultural backgrounds; in the excellent clustering results, the topics of each sample in the same cluster are all Similarly, if the distance between two vectors is closer to each other after vectorization, the text-similarity between them will be higher, and the topic similarity will naturally be higher; if the similar text vector is extracted and the spatial location information carried in microblog is used to evaluate the correlation of regional background, the similarity of similar regions and topics of interest can be obtained Households. The data visualization method in spatial data mining is used for visual analysis: the clustering results generated in 3 and 4 steps are visualized by software with visualization function or open API, to intuitively find the spatial distribution rules of keywords or key topics corresponding to each class and the general content of keywords.
Since the end of the 20th century, the migration of massive media data has received widespread attention from international information processing and computer science circles. In 1978, IBM engineer S. Todd discussed the data migration between geographically decentralized distributed database systems that support authority management, which was the earliest research on data migration. In 1994, the IEEE Storage System Standards Working Group gave the basic architecture and key technology models of mass storage [16]. Since then, the research on data migration is mainly based on the distribution represented by SAN, NAS, and iSCSI.
In a massive storage environment. These researches are roughly divided into three aspects, one is the case study of data migration model based on an actual data migration project, the second is the research of data migration scheduling algorithm based on control theory, and the third is the strategy of combining data migration with related technologies. Recently, Khuller [17] et al. proposed a polynomial-based data migration time-domain analysis model, and established a scheduling grate method, which can achieve algorithm performance with a worst-case bound of 9.5. Gandhi et al. established the 5.06 approximation scheduling algorithm for the OpenShop migration problem at the cost function of the migration completion time of all devices, which is better than the 9.0 of Kim (2003) and the 5.83 algorithms of Queyranne (2002) performance.
Most of the early domestic research on mass storage focused on the architecture level. Special data migration research began in the 1990s. It was driven by solving the problem of heterogeneous/cross-platform data migration that appeared in large numbers in application system development and system integration. Research on strategies and algorithms under a unified data migration model is launched. The digitization and networking of the radio and television industry, as well as the development of multimedia services in telecommunications and other industries, make the multimedia content management platform an important support platform for the information service industry, and the level of data migration involved has also become one of the important topics in the field of multimedia information processing. Currently more active in the research on intelligent data migration in the integrated content business environment [18]. Besides, there are many case studies, such as the research on the automatic migration strategy of TSM and DIVA based on the IBM platform [19].
With the breakthrough and progress of intelligent science at home and abroad at the basic level, based on the high-level semantic characteristics of media content and user behavior characteristics, the establishment of an intelligent distributed media data migration scheduling system based on Ontology semantic description and multi-agent behavior structure has become data Trends in research development in the field of migration.

Intelligent data migration based on application pattern discovery.
Intelligent hierarchical data migration based on application pattern clustering
Application mode is a concept shared by the fields of human-computer interaction and artificial intelligence. It characterizes the mode characteristics of a user’s access or invocation of resources or functions in a system. In the environment of mass media content, the application mode refers to the regular characteristics of a large number of visits to a specific set of media content objects such as query, preview, flipping, and downloading, as shown in Fig. 2.
The basic storage levels involved in the mass media content platform include at least three types: online, near-line, and offline, with disk arrays, tape libraries, and optical disk libraries as physical media. The disk array provides high-speed data retrieval, and establishes a seamless interface with application systems such as non-linear editing and hard disk broadcasting, which is suitable for the storage of online resources; however, the disk array is expensive and requires the use of data streaming tapes with low read and write speeds but relatively low prices. As a near-line storage device, at the same time, for material resources and film resources that will not change media data, low-cost and large-capacity optical disk media can be used for read-only storage. Therefore, during the operation of the system, a large amount of media data will be dumped and copied between different storage modes, that is, the hierarchical data migration of multimedia content. The hierarchical data migration of massive media content is a process that consumes huge resources and affects the stability of the system. It is necessary to realize the most efficient data migration to improve the performance of the media asset platform.
Exploring the group application characteristics of media content objects to form application mode knowledge, and formulating group and a priori migration scheduling strategies for the media content object groups corresponding to different application modes can effectively avoid conventional media asset application platforms The blindness and disorder of the data migration in the medium, improve the quality of system data migration.
Application pattern clustering based on QPop features
We use the QPop feature (query popularity) as an important feature to describe the application mode of content objects [20]. The extraction of QPop features is formed by reducing the dimensionality of the log data of various access operations of content objects into a finite-dimensional vector (or scalar). During the operation and use of the mass storage system, the QPop growth algorithm must be established according to the characteristics of the application environment Quantify the impact of various access operations on the application attention of content objects and reflect the dynamic characteristics of QPop through QPop growth. The intelligent data migration scheduling system performs clustering knowledge discovery based on the QPop characteristics of a large number of content objects and forms the input parameters that guide the migration engine to perform data migration operations.
Many current content platforms, such as portals, have discovered the application mode of a large number of content objects managed by them by defining simple QPop models (such as clicks, recommendations, etc.) and QPop growth models (mostly increasing one by one), and based on this Plan content architecture.
In the research work of this article, to facilitate the study of time-domain segmentation and dimensionality upgrade strategies, we use the basic QPop model (one-dimensional, media data access) and the basic QPop growth model (seed is 1 access increase), and the experimental system is applied QPop data collection and QPop incremental mining experiment are carried out in the environment.
QPop incremental mining based on C-Means
C-Means is an important class of partition clustering algorithm, which is usually selected as a standard algorithm to verify the effectiveness of mining strategies and compare the performance of different mining strategies. This paper chooses C-Means to mine QPop incremental data to discover the application mode of content objects.
The QPop incremental mining algorithm based on C-Means is to obtain the QPop increments of each content object in a specific period and use the C-Means algorithm to cluster them, and the clustering results reflect different media content. The application characteristics of the object in the current period are different (for example, the QPop of some content objects grows rapidly, while the QPop of other content objects has no growth. means can distinguish these two types of content objects well, and can be According to the development of migration strategy). The algorithm flow is as follows: For a given start and end time t0 and t, calculate the QPop increment of each media content object I in this time interval ΔQ
i
= Q
it
- Qit0 with {x} as the training set. Determine the number of clustering patterns k, randomly select k quantities c1, c2, . . . , c
k
from the training set as cluster centers. Put each sample size Xi into the class of center C1 according to Euclidean distance Re-adjust the cluster center C, which is calculated by the following formula:
Where N; is the cardinality of the object in the i-th cluster block. If the cluster center q(i = 1,2,..,k) in step 4) no longer changes, stop, otherwise go to step 3).
The experimental system in this article first realized this basic algorithm as part of the control experiment.
Algorithm improvement based on time-domain segmentation and dimension upgrade
The basic QPop incremental mining algorithm reduces the dimensionality of all access logs in the period between the start and end times to a one-dimensional integer number, which does not reflect the trend of QPop changes in this period. Because of this, we propose an improved method of QPop incremental clustering mining algorithm based on time-domain segmentation and dimension upgrade.
The basic idea of this method is to set m intermediate points between the start and end times and obtain the QPop segment increments between each intermediate point so that there are m + 1 segment increments between the start and end times, and each content object is at this time. The increase of QPop on the segment constitutes am ten 1-dimensional vectors, and these vectors are clustered. Since the QPop increment vector after the dimension upgrades better reflects the growth detail trend of the content object QPop during this period, so It is more beneficial to clustering.
The specific steps of this improved algorithm are: For the given start and end times t0 and t, and the number of divisions m, calculate the division time interval τ = (t - t0)/(m + 1) and each intermediate time point t
j
= t0 + jτ, j = 1, 2, . . . , m + 1; For each media content object I, take the QPop segment increment x
ij
= ΔQi,j = Q
it
- Qit0 at each segmentation time interval; Take the m + 1-dimensional vector X
i
= [Xi1, Xi2, . . . , Xi(m+1)] concerning the entire set of I as the training set; Randomly take k m + 1 dimensional vectors c1, c2, . . . , c
k
from the training set as cluster centers; Put each sample size x; according to Euclidean distance Re-adjust the cluster center Ci, calculated by the following formula:
Where N; is the cardinality of the object in the i-th cluster block. cp is the p-th component of c; If the cluster center c (i = 1,2,.,k) in step 5) no longer changes, stop, otherwise go to step 4).
The main environmental parameters of the QPop incremental mining experiment
The main environmental parameters of the QPop incremental mining experiment
Based on the standard C-Means clustering algorithm and the basic QPop growth model, we built an experimental system for media content object application pattern discovery and implemented the basic QPop incremental mining algorithm and time-domain segmentation in the system. Shengwei’s QPop incremental mining improved algorithm, and through a test team composed of professionals engaged in video program production-these content objects (including 372 objects such as video material, audio material, movie material, and film material) Access and various operations, simulate the access situation of media content objects in the actual application environment and obtain a large amount of raw data close to the real application environment. Based on the QPop incremental data, the above two algorithms are used to aggregate the QPop increments. Class knowledge discovery, comparative analysis of the clustering performance of the two algorithms. The main parameters of the experimental environment and process are as follows in Table 1.
The number of target modes of the clustering algorithm is set to 5 to simulate the five storage modes of “data stream tape library + optical disc library + SAN + non-linear editing workstation local SCSI + non-linear editing workstation local IDE” that often appear in actual mass media asset applications.
Comparison of mode cohesion
Comparison of mode cohesion
The cluster output format of the experimental system. First, each cluster center is output, and then all 372 object categories are output. We conducted clustering experiments in the 12 cases of the basic algorithm and the improved algorithm m = 1,2,.,11 (corresponding to the dimensions of the ascending processing dimension are 2, 3,., 12), and then according to the experimental results The core performance indicators of the basic QPop incremental mining algorithm and the improved algorithm are compared and analyzed: the mode cohesion, which is quantitatively described by the sum of the mean Adi of the Euclidean distance between each element in each target mode and the respective cluster center.
From the Table 2, it can be seen that the Sum (Adi) of the improved algorithm is significantly higher than the basic algorithm. As the cohesion of each target mode is improved, the corresponding application mode is also clearer, and the migration strategy configuration for each application mode is also more Accuracy is helpful to improve migration performance.
At the same time, by analyzing the object transfer between each dimension in the improved algorithm, it can be found that as the dimension increases, the classification and membership of objects tend to be stable. For example, the fuzzy clustering core algorithm is introduced in the follow-up research, and each object can be used for each target. The degree of membership of the model describes the relationship of membership, which can better solve the problem of object transfer between different dimensions.
This paper studies the problem of intelligent data migration used in mass media content platforms in the broadcasting and television media industry. Based on the time domain segmentation and dimension upgrade of query popularity QPop increments, the application characteristics of media content objects discovered by QPop increments are established. Algorithm and experimental system. The experimental results verify that the improved algorithm for time-domain segmentation and dimensionality has better clustering performance. In the follow-up research work, we will study the QPop growth model involving multiple factors, and introduce fuzzy clustering algorithms such as FCM, PIM, FCS, RCP, etc., to realize more effective content object application pattern mining.
