Abstract
In addressing the challenges of scattered data and limited professional knowledge in traditional power data quality assessment and verification governance, our approach leveraged natural language processing (NLP) technology for text preprocessing, incorporating power system research findings and case analyses. Named entity recognition enhanced entity identification accuracy, while relationship weighting technology facilitated entity classification and relationship weight assignment. The resulting power system knowledge graph was seamlessly integrated into a graph database for real-time updates. Through the synergistic use of relationship weighting technology and graph convolutional networks (GCN), our method achieved precise representation and modeling of power system knowledge graphs. The research outcomes underscored the method’s exceptional performance in real-time anomaly detection, maintaining an anomaly detection rate between 1% and 10% and accuracy fluctuating from 80% to 98%. This method stands as a testament to its efficacy in processing power data and validating its robustness in power system assessment and verification.
Keywords
Introduction
As an important infrastructure in modern society, the stable operation of the power system is crucial for energy security and socio-economic development. However, traditional data quality assessment and verification governance face problems such as scattered data [1, 2], insufficient professional knowledge [3, 4], and high real-time requirements [5, 6], which restrict the ability to accurately monitor the system’s operating status and correct problems promptly.
We propose an innovative approach to enhance the quality assessment and verification governance of traditional power data by integrating knowledge graphs with machine learning algorithms. This comprehensive methodology addresses the limitations of existing methods in power system data quality assessment. It encompasses text preprocessing, entity recognition, knowledge graph construction, relationship extraction, and machine learning-based data quality assessment, providing a systematic framework for robust analysis and improvement of power data quality.
The main innovation of the article lies in proposing a power data quality assessment and verification governance method based on knowledge graphs. This approach effectively addresses the issues of data dispersion and limited professional knowledge in traditional power data quality assessment and governance. Key contributions include: (1) Collecting and processing professional literature, technical manuals, and case studies from the power system, utilizing NLP technology for text preprocessing and entity recognition to construct a comprehensive power system knowledge graph, along with a real-time data synchronization and update strategy to reflect the latest changes in the power system. (2) Employing advanced BiLSTM-CRF models for entity recognition to improve accuracy and generalization, and using graph convolutional networks to better understand the relationships between entities. (3) Designing feature engineering methods to transform power data into features suitable for machine learning algorithms, using RNNs to adapt to the dynamic changes of the power system, and dynamically assessing power data to timely identify and respond to potential issues. (4) Demonstrating outstanding real-time processing, accuracy, and anomaly detection rates, especially in simulated anomaly injection experiments and actual power system anomaly scenarios, showing high sensitivity in detecting and governing power data anomalies.
This article initiates by extracting domain knowledge and leveraging NLP technology [7, 8] for text preprocessing and entity recognition, culminating in the establishment of a comprehensive knowledge graph of the power system. The related work section underscores the advancements in power data quality assessment, encompassing fault detection methods, differential protection schemes, and transient stability prediction techniques, while also addressing challenges like interpretability and adaptability in machine learning applications. In the subsequent phase of knowledge graph construction and data integration (Section 3.1), the text data undergoes processing and is integrated into the graph database for visualization, ensuring real-time performance through continuous updates. Subsequently, in entity annotation and boundary extraction (Section 3.2), relationship weighting [9, 10] and GCN [11, 12] are employed for precise relationship extraction, optimizing attribute extraction, and mitigating noise in the power knowledge graph. Finally, in the machine learning techniques section (Section 3.3), the focus shifts towards data quality assessment, involving feature extraction and culminating in the development of anomaly detection algorithms and real-time evaluation methods. The experimental and evaluation phase encompasses the training of machine learning models for power data quality assessment, incorporating supervised, unsupervised, and semi-supervised learning approaches. Evaluation metrics encompass assessing real-time performance, processing time, accuracy, and anomaly detection rate across varied data volumes and complexities. Additionally, simulated anomaly injection experiments and real-world anomaly detection scenarios provide comprehensive insights into the efficacy and robustness of the proposed methods.
Related works
Previous research has focused on addressing the issue of power data quality assessment, achieving online detection, localization, and discrimination of power facility faults through current measurement rate index, unified power flow controller connection fault detection, and air gap rotating magnetic field analysis. This not only improves the operational efficiency of the power system [13, 14] but also lays the foundation for the development of the field of power data quality assessment. Chatterjee B and his team proposed a method of identifying the poles involved in faults by measuring the rate of change of current, to distinguish faults involved in ground faults and specialized metal circuit conductors [15]. Haleem N M and colleagues proposed a method for identifying internal and external faults and implementing positive sequence components connected to the Unified Power Flow Controller (UPFC) to detect internal faults, which can measure positive and negative sequence voltage and current components at both ends of the line [16]. Faced with the problem of online detection, localization, and discrimination of faults between stator and rotor turns in synchronous generators, Afrandideh S’s team indicated the presence of stator or rotor faults and provided information about the type and location of faults by analyzing the measured air gap rotating magnetic field [17]. The induction charging system requires effective foreign object detection methods to cope with external interference and different alignment changes. Jafari H and others introduced a foreign object detection method that only requires main side measurement, and utilized the main resonant current and resonant frequency offset. This method successfully achieved high-speed detection of large-sized objects while reducing system complexity [18]. The direct current (DC) power grid composed of a modular multilevel converter (MMC) requires an effective protection scheme to deal with DC line faults. He Y’s team proposed a transient information DC line protection scheme, which quickly and accurately identified DC line faults and provided effective protection for the safe and stable operation of the DC power grid [19]. The application of these technologies is expected to improve the operational efficiency of the power system, reducing the negative impact on system performance by identifying and locating faults more quickly and accurately. However, the scalability for different types of power systems or operating conditions is still insufficient, which limits the application of these methods on a wider range.
In light of the burgeoning advancements in machine learning, the amalgamation of machine learning technology with power data quality assessment has yielded significant breakthroughs. Numerous studies have excelled in enhancing the reliability and efficacy of assessments. For instance, Afrasiabi S and team innovatively employed a convolutional neural network (CNN) to devise a novel differential protection scheme, effectively discerning transformer magnetization current from internal faults. Their method exhibited superior performance in speed, hardware utilization, and accuracy, thereby mitigating the risk of operational errors [20]. However, despite these strides, challenges persist in the reliability assessment and control of energy systems. Scholars such as Duchesne L advocate for the synergy between machine learning and reliability management, particularly in large-scale power systems, offering valuable insights into the potential applications of machine learning technology in energy system reliability management, thus charting the course for future research directions [21]. Similarly, while machine learning algorithms have been instrumental in power system safety assessment, concerns linger regarding their interpretability. Cremer J L and colleagues addressed this concern by exploring the delicate balance between prediction accuracy and interpretability, introducing decision tree learning of safety rules, and proposing innovative training methods to enhance algorithm quality while preserving interpretability [22]. Furthermore, online transient stability prediction necessitates efficient techniques to navigate the dynamic changes in power systems. Zhu L devised a hierarchical deep learning machine, leveraging anti-noise graphical transient feature technology and hierarchical convolutional neural networks to achieve precise and adaptive online transient stability prediction, particularly excelling in forecasting stable states and stability margins [23]. In overcoming the limitations posed by missing data in phasor measurement unit (PMU) on dynamic security assessment, Ren C and Xu Y proposed a fully data-driven approach grounded in generative adversarial networks. This method not only resolved the issue of missing PMU data but also demonstrated enhanced universality and scalability [24]. While machine learning technologies have made commendable progress in improving power system reliability and safety assessment, persistent challenges remain, notably in interpretability, adaptability to dynamic changes, handling missing data, and effective integration of professional knowledge [25, 26]. These challenges stem from the intricate nature of power systems and highlight the necessity for a more profound comprehension of domain-specific knowledge. In response, this study innovates by integrating knowledge graph construction with machine learning techniques to enhance power data quality assessment and verification governance, aiming to overcome these limitations of existing methods.
Method
Knowledge graph construction
The literature search is conducted through academic databases such as IEEE Xplore and ScienceDirect in the field of power, and professional literature containing knowledge of power systems is obtained. Technical manuals such as power system equipment, standards, and operation manuals are collected, which usually contain information and relevant specifications of the power system in practical applications. Academic journals in the field of power are traversed to obtain published research results and case studies related to the power system.
In the knowledge extraction stage of the power system field, this study fully utilizes natural language processing technology to extract key knowledge from literature and professional manuals through the following steps:
Firstly, text preprocessing is performed on literature and professional manuals in the field of power systems, including removing stop words and punctuation, performing word segmentation, and other operations to prepare text data for subsequent processing. The stop-word list is used to remove common words that have no practical meaning.
The stop word list is used to remove common but meaningless stop words in the text, such as “de”, “zai”, etc. This helps to reduce the size of text data and improve the efficiency of subsequent processing. The Jieba segmentation tool is used to segment text, which helps to segment continuous text into meaningful lexical units, providing a foundation for subsequent entity recognition and keyword extraction.
Punctuation marks, numbers, and other nonletter or non-Chinese character symbols in the text are removed to avoid interference with subsequent processing steps caused by these special characters. All letters in the text are converted to lowercase, which helps eliminate differences in capitalization and ensures that the same vocabulary can be correctly recognized and processed in subsequent processing. The reduction or replacement of abbreviations in the text can reduce ambiguity caused by abbreviations and ensure that abbreviations in the text can be understood correctly.
Named entity recognition (NER) technology is utilized to identify entities related to the power system in text, such as generators, substations, transmission lines, etc. Advanced deep learning models, such as bi-directional long short-term memory with conditional random field (BiLSTM-CRF), are adopted to improve the accuracy and generalization of entity recognition.
The BiLSTM-CRF model is used to annotate text and identify named entities in power systems, such as generators, substations, transmission lines, etc.
Entity recognition results
Entity recognition results
Table 1 annotates the sentence “Power data quality assessment is crucial for maintaining a reliable electrical grid. The accuracy of metering devices, such as transformers and substations, impacts the overall efficiency of the power system.” using the BiLSTM-CRF model. Annotations are in BIO (Begin, Inside, Outside) format. Among them, “B-” represents the beginning of the entity; “I-” represents the interior of the entity; “O” represents nonentity.
When constructing the BiLSTM-CRF model, context windows around each vocabulary are defined to determine the range of contextual information that the model should consider. The vocabulary within the context window is represented and incorporated into the model input, enabling the model to better understand the context and features of entities in the text.
Knowledge graph of power infrastructure.
To better understand the relationships between entities, entities in the power system are divided into different categories such as equipment and station categories. The data with contextual features and category labels can be used to train the BiLSTM-CRF model, which can accurately identify and classify entities in the field of power systems.
Establishing an entity dictionary is crucial for entity recognition and classification in the field of power systems. This dictionary contains various types of power equipment, components, and professional terminology, improving the accuracy of entity recognition. When preprocessing the recognized text, operations such as word segmentation and removing stop words are included to match with the entity dictionary. Through precise or fuzzy matching, the vocabulary in the text corresponds to the entity dictionary. After successful matching, the vocabulary is marked as an entity. Finally, the matched entities are integrated into the text and marked with their positions, and each entity is assigned a corresponding label for subsequent processing.
To improve the representation ability of the knowledge graph, weights are assigned to the relationships in the graph based on the context and importance in the text, which helps to more accurately reflect the strength of relationships between entities in the power system. A node is created for each power system entity, including the entity’s type, name, and other relevant information. A node is created for each keyword, including information such as the keyword itself, word frequency, weight, etc. The relationships between different entities are modeled, such as the connection relationship between power equipment, the association between power stations and equipment, etc., based on the actual topology and connection method of the power system. Meanwhile, the relationship between keywords and entities is modeled, and a keyword node can establish relationships with multiple entity nodes containing the keyword, representing the association between these entities and the keyword. Finally, the processed node and relationship data are imported into the database in the format supported by the graph database. In Neo4j, the Cypher language is used to describe and import data, importing node and relationship data into the graph database by running Cypher scripts or using import tools from the graph database. The query language of the Cypher graph database is used for querying to retrieve specific information about the power system. Through these queries, the relationships between entities and the distribution of keywords can be thoroughly understood. The graph database is used as a visualization tool to provide a visual representation of the power system knowledge graph, which helps to gain a deeper understanding of the relationships and interactions between various components. Figure 1 shows a partial example of the power structure. Rectangles represent entities, such as power stations and generators, while solid lines represent direct relationships between entities, such as containment relationships. The dashed line represents relationships that are linked through a series of methods, such as electricity supply relationships. These dashed lines are not the original attributes of entities but are formed through the association methods in the graph, emphasizing the complex and organic interrelationships between entities.
Schematic diagram of feature extraction process.
Figure 2 depicts the retrieval of device data, references, and related research results through public networks and open-source datasets. By performing feature extraction, fusion, and classification operations, accurate entity labels are ultimately obtained. The professionalism of this process lies in the precise utilization of open networks and open-source datasets, as well as the further processing of the obtained information through feature extraction and fusion to obtain entity labels that meet the requirements.
From BetterGrids, OpenStreetMap (OSM) data sources, and power company data, information on entities and relationships can be obtained, including structured data, semi-structured data, and unstructured data. BetterGrids provides power grid topology data. The website focuses on power grid topology data, most of which are modeled based on real power grids. OSM is an open-source map data source that can provide entity and relationship information related to electricity. In addition, some companies have also disclosed data on electricity supply and demand. To achieve effective data integration and subsequent processing, entity, and relationship information from these different data sources are integrated into a unified format. In the integration process, differences in data format, units, accuracy, and other aspects need to be resolved to ensure the accuracy and consistency of the data, which requires operations such as data conversion, unit unification, and accuracy adjustment. The integrated entity and relationship information is added to the knowledge graph, and the nodes and relationships in the graph database are updated to ensure that they reflect the latest information. This can achieve knowledge sharing and reuse, and improve data processing efficiency and accuracy.
The Stanford NER model is used to annotate text, which outputs entity labels for each word and entity boundary information. The output of the model is parsed to extract entity boundaries and corresponding entity categories from the text [27, 28]. The results are stored in files or databases for subsequent analysis and application.
The TF-IDF (Term Frequency-Inverse Document Frequency) algorithm is utilized to extract keywords from text, thereby capturing the theme and important information of the text. Through this step, key knowledge clues can be obtained, laying the foundation for subsequent knowledge graph construction. The main steps are shown in Eq. (1). The TF of the term is calculated, and the frequency of the inverse document is then calculated, as shown in Eq. (2). IDF calculates the importance of a certain term in the entire text set and then calculates the TF-IDF value as shown in Eq. (3). Among them,
Before model training, data preprocessing is required to convert the original text into a format suitable for relation extraction model input, which can represent entities and keywords in the text as nodes to construct a graph. The connection relationship between nodes represents the syntactic and semantic connections between entities. This structure of the graph provides the foundation for subsequent graph convolutional networks.
Graph convolutional network is a deep learning model suitable for graph-structured data. In the knowledge graph of the power system, GCN is used to learn the relationships between nodes. The core idea of GCN is to update the representation of each node by aggregating the information of adjacent nodes, which enables nodes to consider the contextual information of their neighboring nodes. GCN is utilized to learn feature representations of entity nodes, which contain contextual information about entities, enabling the model to better understand the relationships between entities. During the training process, the model gradually extracts higher-level semantic features through multiple rounds of graph convolution. The learned feature representations are used for relationship classification, which determines the types of relationships between entities. This is a multi-classification problem, where each relationship type corresponds to a connection method in the graph. Through model training, the model can accurately classify the relationships between entities.
For each entity type, the rule of attribute extraction [29, 30] is utilized. For example, the installed capacity of power stations can be extracted by identifying specific keywords and values. The BiLSTM-CRF model is trained to automatically extract entity attributes. The training dataset containing attribute information needs to be annotated. Rules or trained models are applied to extract attributes from entities in the knowledge graph, ensuring that the extracted attributes accurately reflect the characteristics of the entities.
Figure 3 shows multiple power entities, which are visually represented as circles, and their specific properties and functions are presented in the form of rounded rectangles. For example, the generator entity includes attributes such as size, shell style, and cooling system, and its main components are the stator and rotor. These two parts work together to generate an electric current. In addition, the generator also forms a substation along with other entities such as cables and transformers.
The data in the knowledge graph is analyzed to identify and process noisy data. Noise may come from errors, inconsistencies, or outliers in the data source. The entity disambiguation problem is addressed when similar entities from different data sources are merged or identified. Entity-linking technology is used to determine whether entities refer to the same entity by comparing their attributes and contextual information. Assessment indicators are designed to quantitatively assess the quality of the knowledge graph, which can include the completeness of nodes, accuracy of entity disambiguation, and graphic quality ratings, as shown in Table 2. Operations are optimized based on assessment results, including deleting redundant information, fixing errors, adjusting weights, etc., to improve the quality and accuracy of the graph.
Part of the assessment results
Schematic diagram of attribute extraction.
According to Table 2, the current knowledge graph has a high coverage of power system elements, which includes comprehensive representations of various entities related to the power system. Although the relationships in the knowledge graph are moderately accurate, there is still room for improvement. This indicates that some connections may not accurately reflect the actual relationships in the power system. The entity disambiguation process has high accuracy. Similar entities are effectively distinguished, indicating a strong disambiguation mechanism. Considering multiple factors and assigning appropriate weights results in a higher overall quality score for the knowledge graph. This indicates that although there are specific areas for improvement, knowledge graphs perform well in various assessment criteria.
Precise identification and resolution of specific quality issues in power data: with the support of knowledge graphs, achieving precise identification and resolution of specific quality issues in power data is a key task. To address the issue of power data quality, the first step is to use the problem localization module to determine the specific domain and entity type of the problem. If the generator data is abnormal, it would be located in the generator entity. The entity relationships in the knowledge graph are utilized, and knowledge related to the problem entity is obtained through querying the graph. The associated information of the generator entity is queried, including its affiliated power station, operating status, etc. For the identified problem entity, the key features of the problem entity are extracted through the professional knowledge provided by the knowledge graph, which can include entity attribute information, historical data trends, etc. The associated entity information in the graph is utilized to obtain other entity features related to the problem entity, expanding the dimension of the problem feature. The similarity problem matching module in the knowledge graph is used to compare the features of the current problem entity with those of similar historical problems, which can be achieved through the similarity measurement algorithm in the graph. Whether there are historical issues similar to the current problem is determined. If it exists, solutions to historical problems can be borrowed. Based on the solution library in the knowledge graph, corresponding solutions are recommended for the current problem entity. This requires matching existing solutions based on problem characteristics and utilizing the relationship information in the knowledge graph to obtain detailed content of the solutions. Multiple possible solutions are provided to meet the needs of different contexts. For scenarios involving data quality issues, intelligent cleaning of power data can be achieved through the data cleaning rules in the knowledge graph, which can include operations such as removing abnormal data and filling in missing data. The correlation information in the graph is utilized to ensure that the cleaning operation is not limited to the current problem entity, but also covers relevant entities, ensuring data consistency.
Model training structure diagram.
Figure 4 adopts the basic structure of Soft training and Bootstrapping multiple iterations [31, 32]. In the initial stage, annotated power data is used to predict unlabeled data using the deep learning model BiLSTM-CRF, generating pseudo labels. These pseudo labels are merged with annotated data to form soft labels. Next, all pseudo labels are integrated to construct a new training set. The training set is utilized to train a new model, and the model performance is assessed through the validation set. Based on the prediction results of the current model, the annotated data is updated and new pseudo labels are generated. These new pseudo labels are merged with annotated data to form a new training set for further iteration. This iterative process is carried out in a loop, gradually optimizing the performance of the model. This method continuously improves model performance through multiple iterations to meet the needs of power data assessment.
By combining the real-time update mechanism of the knowledge graph, it is ensured that the data and knowledge in the graph can reflect the changes in the power system promptly, which involves the design of real-time data synchronization and update strategies for the graph. By combining the professional knowledge provided by the knowledge graph, precise identification and resolution of specific quality issues in power data are achieved. This method integrates rich information from the knowledge graph, improving the accuracy and real-time performance of problem-solving.
To address the high real-time requirements and significant dynamic changes in the power system, using machine learning algorithms for dynamic assessment is an effective strategy. Firstly, through the data acquisition module of the power system, real-time power data can be obtained, including multiple indicators such as generator output, voltage, current, etc. Real-time is the foundation for ensuring dynamic assessment. The real-time obtained power data is preprocessed, including missing value processing, to ensure that the input data to the machine learning model has a certain quality and stability. Feature engineering is designed to transform power data into features that are acceptable to machine learning algorithms, including time series features, frequency domain features, and statistical features, to capture the dynamic changes in data.
According to the requirements of dynamic assessment, a recurrent neural network (RNN) is selected to capture the temporal relationships of data and adapt to the dynamic changes of the power system. As shown in Eq. (4),
During the training process, historical time series data is used to adjust model parameters to minimize the loss between predicted and actual values. The cross-entropy loss is denoted by
The calculation of gradients involves backpropagation through time (BPTT), which requires cumulative gradient calculation of parameters at each time step. In this process, the chain rule is used to calculate the gradients of each layer, and then the model parameters are updated through the gradient descent method to gradually reduce the loss and better fit the temporal patterns of historical data. The hyperparameters of the model are optimized through cross-validation and other means to improve its generalization performance in practical scenarios. Finally, the trained model is deployed to the real-time power data assessment system.
The model can be assessed dynamically by continuously monitoring real-time data. For instance, in power output, the model can detect abnormalities or exceed the normal range in real time. The machine learning model sets the threshold for anomaly detection in its output. When abnormal power data is detected, the system can respond promptly by activating the corresponding warning mechanism or automatic control system for adjustment. The performance of machine learning models is regularly monitored and if it is found to have decreased in practical applications, the model parameters can be retrained or updated to adapt to the dynamic changes of the power system. Through the above steps, machine learning algorithms can dynamically assess power data in power systems with high real-time requirements and significant dynamic changes, timely identify and respond to potential problems, and improve system stability and reliability.
Power system data is usually time series data, so time series analysis is necessary. Fourier transform is a commonly used time-frequency analysis method [33, 34], which can convert time-domain data into frequency-domain data to better reveal the periodic patterns and frequency distribution characteristics in the data. This transformation visualizes the distribution of signals at different frequencies by calculating their frequency components, as shown in Eq. (8). Among them,
Time-delayed embedding is a method of mapping temporal data to high-dimensional space to preserve temporal information. The sliding window is used to generate local statistical features, such as mean and variance, as shown in Eq. (9). Among them,
The comprehensive application of these feature engineering methods can provide richer and more meaningful inputs for real-time anomaly detection, which helps to improve the accuracy and robustness of the model.
During the segmentation process of each node, an isolation forest randomly selects a feature. For the selected feature, the isolation forest [35] randomly selects a segmentation point between the minimum and maximum values of the feature. During the tree construction process, by recursively calculating the path length, isolation forests can determine the relative isolation degree of outliers, as shown in Eq. (10). Among them,
Anomaly detection is a process that involves training a model to recognize normal patterns and identify abnormal patterns. The isolation forest algorithm follows a specific training process, which starts with preparing training data. The model then learns the features of normal patterns and isolates them by constructing a random tree structure. Once the training is complete, the model’s effectiveness is assessed using indicators such as precision and recall. If the performance of the model meets the requirements, it can be saved for future use.
Deploying the trained model into real-time data streams is the first step in achieving real-time anomaly detection. The model is saved and loaded into the real-time system, after which it can detect anomalies in newly entered data using the prediction method. The model assigns an anomaly score between 0 and 1 to each data point, which can be used to determine whether the data is abnormal. The real-time data is preprocessed to obtain the model’s anomaly prediction results.
Since the power system is constantly changing, the model needs to be regularly updated to adapt to new patterns and changes. This involves retraining the model using the latest historical data. Once an anomaly is detected, an alarm notification is triggered to alert relevant personnel. The severity and urgency of the anomaly determines the alarm level, and personnel can prioritize handling high-priority abnormal situations accordingly.
After detecting anomalies, the automation control system executes predefined measures to minimize potential risks. For example, it may automatically switch to a backup system in a redundant state to ensure service continuity. However, some situations may require manual intervention and decision-making. In such cases, the system provides personnel with detailed information to make quick and accurate decisions. The system generates detailed exception reports, including the type of exception, occurrence time, and scope of impact. Each anomaly is recorded for subsequent analysis and improvement.
The entire process of constructing a knowledge graph.
Figure 5 illustrates the comprehensive process of knowledge graph construction and data integration, encompassing critical stages such as data acquisition, domain knowledge extraction, entity recognition, relationship modeling, data integration, attribute extraction, graph optimization, knowledge graph-enabled governance, and method effectiveness assessment. This holistic approach ensures the accuracy, completeness, and practicality of the knowledge graphs at every stage.
The study trained machine learning models using pre-prepared power data and domain knowledge graphs to assess the quality of power data. These models employed various learning methods, including supervised learning, unsupervised learning, and semi-supervised learning, to address the challenges encountered in power data quality assessment
In supervised learning, a series of classifiers and regression models were trained to detect anomalies and identify potential issues in power data. These models were trained using labeled data to learn the normal and abnormal patterns of the data, enabling real-time anomaly detection and issue identification when new data is received.
In unsupervised learning, clustering and anomaly detection algorithms were utilized to perform unsupervised learning and pattern discovery on power data. These algorithms automatically discovered hidden patterns and anomalies in the data, aiding in the identification of data quality issues and providing improvement recommendations.
Additionally, semi-supervised learning methods were employed, leveraging both labeled and unlabeled data to enhance model performance and generalization capability. These methods utilized large amounts of unlabeled data to further improve the training effectiveness of the models, thereby enhancing the accuracy and efficiency of power data quality assessment. The study collected power data of varying volumes and complexities and evaluated the system’s real-time performance and processing time when handling these data.
In this study, the ability and effectiveness of power data quality assessment and verification governance were assessed through a series of objective indicators. The assessment process covered aspects such as real-time performance, accuracy, and anomaly detection rate, and multiple methods were adopted to ensure the scientificity and reliability of the assessment.
Assessment of real-time performance and processing time.
Figure 6 shows the assessment of real-time performance and processing time. To assess the system’s ability to respond quickly when receiving new data, the latest power information was ensured to be obtained promptly, and the performance of the system in processing data of different scales and complexities was determined.
In Fig. 6a, different data volumes were used to calculate the length of time required for key stages in the power data processing. For example, with a data volume of 300000, the data acquisition stage required 1 hour, which represents the time required to collect raw data from the actual power dataset when processing 300000 data points. Knowledge graph construction phase: the required time was 2 hours, which included extracting key knowledge in the field of power systems from literature and professional manuals using natural language processing techniques, and using knowledge graph technology for entity recognition and relationship modeling. Machine learning model application stage: the required time was 1.5 hours, which means using machine learning algorithms for real-time anomaly detection to ensure timely detection of quality issues during power data processing. This time calculation reflects the distribution of time required for each stage under different data volumes, assesses the efficiency and performance of the power data processing process, and provides information on the system’s response to challenges of different data scales, providing a reference for optimizing the process.
In Fig. 6b, the performance of the model at different periods was assessed through two indicators: accuracy and anomaly detection rate. Specifically, the overall anomaly detection rate remained within the range of 1% to 10%, while the accuracy fluctuated between 80% and 98%. This assessment method aims to present the quality performance of the model in processing power data. The stable maintenance of the anomaly detection rate at a low level indicated that the model had relatively good sensitivity to anomalies in the data. Meanwhile, the accuracy fluctuated between 80% and 98%, reflecting the overall performance level of the model in real-time assessment of power data. This indicates the real-time adaptability of the method to system changes, ensuring that the method can still maintain efficient and stable operation even in rapidly changing situations.
Comparison of anomaly detection.
The anomaly detection rate is an important indicator of the effectiveness of method governance, which directly reflects the sensitivity of the method to power data issues, as shown in Fig. 7. In the simulated anomaly injection experiment, Fig. 7a, Fig. 7b, and Fig. 7c represent injecting a small, moderate, and large equal amount of anomalies, respectively. The anomaly detection module of the method was used to identify these anomalies and compare the detection results with the actual injected anomalies. The results indicated that the method could maintain a high detection rate when dealing with various amounts of abnormal data. To verify the anomaly detection performance of the method in real scenarios, abnormal cases in actual power systems were collected and the method was applied for detection, including sudden load fluctuations in the power system (Fig. 7d), power equipment failures (Fig. 7e), and power system network attacks (Fig. 7f). The x-axis is all time, and time is the point at which a certain period is measured. Among them, the green line corresponds to the right y-axis; the yellow line corresponds to the left y-axis; the left y-axis represents the amount of data; and the right y-axis represents the anomaly detection rate. The abnormal results indicated that the method also exhibited an excellent anomaly detection rate in real scenarios. This series of experiments fully demonstrates the high sensitivity of the method to abnormal data of different scales and types, making it outstanding in the detection and governance of power data anomalies.
This study has made significant advancements in the field of power, encompassing knowledge extraction, knowledge graph construction, data integration, and quality assessment. It has established effective methods for power system data quality assessment and verification governance. Leveraging deep learning and NLP technology, this approach has developed a knowledge graph with high representation ability and precise correlation strength. Experimental results demonstrate its strong performance in real-time processing, accuracy, and anomaly detection, showcasing high sensitivity to various abnormal situations. Despite these achievements, ensuring stability in anomaly detection remains a concern in certain scenarios. Future research can focus on optimizing the graph updating process and enhancing the real-time performance of knowledge graphs.
