Feature Selection Before Propositionalization of Multi-Source Oil Drilling Data

Abstract

Despite recent improvements in collected drilling data quality and volume, the actual number of wells being used in studies remain low and are often limited to a single source and oil field, producing results that are prone to overfitting and are non-transferable.

In our study, we access oil drilling data from 5 of more than 20 oil drilling companies collected from 2005 to 2016 from our industrial partner to create well drilling duration models for well planning. This project could lead to the creation of more generalized models from larger datasets than others in literature. However, the data is difficult to process without expert knowledge, further complicated by properties such as unharmonized, source-locked, semantic heterogeneity, sparse and unlabelled. Conventional automated methods for feature selection, propositionalization, multi-source, or block-wise missing techniques could not be used.

In this paper, we describe our method to assist the Knowledge Discovery in Databases (KDD) Selection stage of the abovementioned data - Feature Selection before Propositionalization (FSbP) via Database Attribute Health Feature Reduction (DAHFR) and Report Feature Correlation Matrix (RFCM), collectively known as FvDR. DAHFR and RFCM are filter-type feature selection techniques that could measure relational missingness and keyword correlations respectively despite the complexity of multi-source oil drilling data. FvDR successfully reduced the scope from 700 tables containing 20,000 columns to 22 tables containing fewer than 707 columns while successfully selecting 13 of 16 relevant tables suggested by literature. Despite the loss of information from limitations of subsequent KDD procedures, preliminary models show promising results with over half the test predictions falling within the 20% error margin required for well planning. FvDR proves to be indispensable in KDD as a FSbP framework as it reduces features for examination and streamlines the research process necessary to understand business rules for data harmonization and propositionalization.

Keywords

Feature selection before propositionalization big data harmonization knowledge discovery in databases oil drilling

1. Introduction

Oil and gas exploration is a high risk, high rewards endeavour which requires a combination of specialized workforce and equipment to explore, plan and drill oil wells to extract crude oil and natural gas from oil fields. In recent years, oil drilling companies must contend with the high cost of oil drilling, where the cost of renting an offshore drilling alone could cost upwards of USD500,000 per day, the declining numbers of reservoirs which are easy to access while providing high yield, and the declined price of crude oil (during the time of the study). To economize well drilling operations, research is focused on methods to optimize the speed, safety of drilling processes and to provide benchmarks to measure the project’s progress.

Although mathematical models are still being created, most studies are now data-driven, taking advantage of recent drilling data from oil drilling companies which are steadily increasing in volume, scope, and quality with the advancement of data collection and storage technologies (originally used for reporting and progress tracking). The data-driven nature of these models leans their targeted usage to be close to where the data is collected, specifically as real-time predictive models during live drilling operations.

However, all the recent data-driven studies are limited in nature, with most available research papers using a subset of the available data collected from the drilling of one or a few local wells. Very few papers have access to data from an entire oilfield. From a machine learning perspective, the final model will be overfitted to that local oilfield, because patterns in oil drilling are defined by geological formations and the depth of formation transition.

In our study, we are grateful to our industrial partner, Independent Data Services for the access to oil well drilling data from 5 drilling companies and their oilfields. Independent Data Services provides data storage, reporting, and analytical services to more than 20 oil drilling companies. The data is collected over the duration of 10 years, from about 2005 to 2016 and provides an opportunity to create predictive models with more data than other available studies both in the number of wells and the number of available training features. The objective of the study is to create a phase duration prediction model from the data which could be used to aid well planners.

A defining problem in our KDD of this multi-sourced oil drilling data is the encountered in the Selection stage of KDD (Fig. 1). It is the nature of the available oil drilling data (relational, unharmonized, heterogenous, sparse, unlabelled (requires propositionalization), multi-sourced) as well as the unavailability of expert knowledge. This environment impedes manual or automated Selection and Transformation stages of KDD, of which Transformation must be performed before the data could be used for machine learning.

Figure 1.

The KDD Process.

We address this problem by performing Feature Selection before Propositionalization (FSbP) via Database Attribute Health Feature Reduction (DAHFR) and Report Features Correlation Matrix (RFCM), collectively known as FvDR. The reduction in project scope via FvDR is essential for a timely project completion, as the addition of each of the 20,000 possible features in 700 tables (per database) in 5 databases (sources) into the scope is an increase in time complexity – time is required to individually examine, study, and identify the feature’s viability, and then transform it.

This paper is divided into 6 chapters. In the Introduction, we have described the opportunity which motivates this project and briefly described the problems which come with this opportunity. In Background and Motivation, we further describe the motivation for the use of multi-source data as well as the difficulties of using multi-source oil drilling data. In Methodology, we describe the FvDR framework and model evaluation method used during our KDD of multi-sourced oil drilling data. FvDR uses the combination of two novel Filter Feature Selection (FS) methods, DAHFR Ting and Then (2022) and RFCM which can be used without the need to transform the relational data. DAHFR is a filter feature selection metric that measures the quality of data in the form of relational missingness, while RFCM produces a correlation chart of features to represent business rules based on business reports produced from the database. In Results, we reveal the feature scope reduction rate performed by FvDR in the form of % of features selected, visualization of FvDR data, the selected features compared against features recommended by literature and the resulting models’ performance. In Discussion and Future Work, and Conclusion, we discuss the efficacy of FvDR both as a FSbP process and as a research tool in the multi-sourced KDD pipeline, areas for further improvements both for FvDR and our KDD process and place our concluding remarks on this paper.

2. Background and Motivation

2.1. Using Multi-source Data From Multiple oil Fields to Improve Generalization

Recent data-centric literature in oil drilling have been taking advantage of improvements in data collection technology such as Mud Logging and Wireline Logging, which increases the volume, reliability, and frequency of data collection. However, these literatures tend to be limited to data from a low number of wells, often involving just one up to a hundred wells or oil fields while data from multiple oilfields or more wells (2000 wells for Brett and Millheim (1986)) are rare.

The use of low number of wells and oil fields is problematic. There are at least two factors that can cause overfitting or overfitting-like issues when using data from only one oil field.

Most importantly, formation characteristics (Anemangely et al., 2018) and the depth of transition of formations define the planned well dimensions and drilling plan. During normal drilling operations, the unreinforced section of the well is stabilized by the weight of the mud which must follow a Safe Mud Weight Window (Gholilou et al., 2017; Ma et al., 2019), which changes from formation to formation. These depths of transition define depths drilling must be stopped to install (Al Ramadan et al., 2019; Byrom, 2014) and cement in place (Ahmed et al., 2019; Soares et al., 2017) a casing or liner down to the new depth which serves to isolate and reinforce the well from the surrounding formation.

It is to be noted that there is another methodology, where casing drilling is performed (Mohammed et al., 2012; Steppe et al., 2005). In this process, the casing also serves as the drill string (the moving component transferring torque from the surface to the drill bit) and the drill bit is attached to the end of the casing.

The availability of formation characteristics and depths data makes the difference of speed and cost of exploratory/appraisal wells versus regular development/production wells. Exploratory/appraisal wells must be drilled slower and carefully with limited formation information to gather the information. Development/production wells have advanced knowledge of safe and unsafe depths from offset (nearby, any type) wells and could be operated at speed until the unsafe zone is reached.

A second factor which is affected by rigs and oil fields is crew efficiency (Adeleye et al., 2004; Amorim et al., 2019; Anemangely, et al., 2018; Brett & Millheim, 1986; Hossain, 2015; Kaiser, 2007; Lukawski et al., 2014; Whelehan & Thorogood, 1994) even though many papers have limited features available and must perform predictions without experience data or predict the Rate of Penetration (ROP) using only wireline data (Abtahi et al., 2011; Amar & Ibrahim, 2012; Hamrick, 2011; Huang et al., 2011; Paiaman & Moghadasi, 2009; Sultan & Al-Kaabi, 2002). Adeleye et al. (2004) talks about how experience and better information allows drilling performances to approach the Best Composite Time (best predicted time) of a drilling operation. Lukawski et al. (2014) and Amorim et al. (2019) tells of how experience from drilling offset (nearby) wells will improve decision making on the choices of drill bits to be used for drilling through the formation, better insight into the site’s lithology or simply eliminate past mistakes from being repeated.

By using data from wells of one oil field, data will be overfitted to the formation characteristics (and their drilling properties) and decisions made in the formations for that oil field and the experience of the crew operating in that oil field (which improves the more wells they drill in the same field). This leaves an issue in transferring this knowledge to a new oilfield, where the crews are different, and the depths of different formations and transitions are different.

2.2. Multi-source Oil Drilling Data is Difficult to Propositionalize

The use of multi-sourced oil drilling for KDD is a double-edged sword. While the usage data from multiple sources improves the generalization ability of the resulting model, the amount of work to prepare the data increments with each source, as work must be done to individually examine and study the sources, and them merge them into a harmonized dataset.

In this section subsections, we’ll be using LazyBum (Schouterden et al., 2019) as a baseline when discussing difficulties handling multi-sourced oil drilling data. LazyBum is a lazy variant of the One Button Machine, which automatically converts tables in relational databases into a propositional dataset, while handling one-to-many and many-to-many relationships without the need for expert knowledge.

2.2.1. Sparse Data Causes Block-Wise Missing Data in Multi-source Data

During preliminary work on the underlying data, the absence of an Entity Relationship (ER) model makes it difficult to compile relevant data. Manual and automated inference of the ER diagram makes joining possible, but several sparsity issues interfere with the process which causes rows of data to be dropped when intermediate tables turn out to be unused or barely used. These issues stemmed from various causes: some tables were filled mostly or exclusively with test data, some data were available only in random cases with whole rows missing at seemingly random rates, and some data were missing at the feature level, interfering with initial aggregation attempts.

LazyBum already handles this problem by automatically aggregating missing data as part of its propositionalization process, but the block-wise missing data created could not be dismissed. When the multiple sources are merged, it creates a dataset where entire groups of features are empty (depending on the source), a situation called block-wise missing data.

In literature, block-wise missing data is often encountered in heterogenous multi-source clinical data, where study, test, or departmental data are joined on the individual patients (Li et al., 2019; Xiang et al., 2014; Xue & Qu, 2021; Zhu et al., 2020). Block-wise missing data occurs when the patient data is not available in some of the sources, of which the sources represent entire feature sets being missing for groups of patients.

There are two main strategies of dealing with heterogenous multi-source data and its resulting block-wise missing data. The first strategy is to divide the main dataset into multiple subsets that are less afflicted by sparse data and then either impute them (Xue & Qu, 2021) or train multiple learning algorithms and rejoin the algorithms into one (Li et al., 2019; Xiang et al., 2014). Otherwise, the data is not divided, and imputation is performed anyway (Du et al., 2023; Zhou et al., 2021; Zhu et al., 2020), or machine working frameworks that works with block-wise missing data are used (Ghorpade-Aher & Sonkamble 2022).

However, examination of the characteristics of multi-source oil drilling data shows that it is closer to being homogeneous multi-source data than heterogeneous. Homogeneous multi-source data are data from different sources that are identical in format and attributes (Wu et al., 2017). They usually consist of sensor data or data from multiple studies of the same subjects, and objectives include the fusion of the data into a singular dataset for further analysis, which often requires harmonization. Difficulties during the fusion process tends to be attributed to differences in semantics, sensor settings, and entity identification. Missing data are barely mentioned in homogeneous multi-source data studies, and block-wise missing data are not mentioned at all (Li et al., 2017; Misiuk et al., 2020; Zhang et al., 2019; Zhang et al., 2022).

A key difference between oil drilling data versus heterogeneous or homogeneous multi-source data is the data distribution. Homogeneous studies aim to create datasets by unioning cases, while joining cases that represent the same entity. Heterogeneous studies perform the same, except with features instead of entities. Even though harmonization may be required for homogeneous or heterogeneous (Dinov et al., 2016; Kaiser et al., 2024) multi-source data, in the homogeneous or heterogeneous cases, some of the data are connected cross-source to provide a basis for comparison and missing data recovery.

Ultimately, block-wise missing oil drilling data could not be handled in the same manner as those encountered from other heterogeneous or homogeneous (non-existent) sources found in literature. Regular heterogeneous multi-source data in literature involve data taken from the whole population, so training or imputation could be made with the existing biases (confidently representing the global bias and distributions). However, because of the non-overlap of well/oil field data across sources and the unavailability of some data (specifically formation data) and the difficulty in comparing data across oilfields as mentioned in chapter 0, block-wise missing oil drilling values could be unrecoverable.

For for the lack of a better name, we differentiate the oil drilling data used in this study as ‘source-locked’ homogeneous multi-source data to represent its distinctiveness from other multi-source data and the need for different strategies and decisions in response to the resulting block-wise missing data.

2.2.2. Harmonization and feature engineering

Despite the availability of automatic propositionalization techniques like Lazybum, the variety of data that are available could cause issue. Plenty of the data require expert knowledge or additional techniques to convert it into usable data. For example, activity data is a treasure trove of information but requires natural language processing to extract further information and is an entire field on its own (AbouLaban et al., 2022; Castiñeira et al., 2018).

In this project, several propositionalization and harmonization techniques require expert consideration and research to properly transform the data to make it useful for machine learning, while keeping it comparable when merging multi-source data. To get the label (phase duration), irregular-length time-series activity data must be merged into phases, which must then be harmonized into standardized phase set (Figs. 2, 4a, 4b). This data must be joined with other time series data (aggregated daily, or irregular-length as well) and relational data. The bharun data must be harmonized as well to standardized categorization of bha types, then based on the bha type, decisions must be made on whether they should be joined to a given phase (Fig. 4d). Bit types are often uncategorized to the point that rules from expert knowledge and machine learning must be employed to fill in areas of confidence (Fig. 4c). Hole dimensions must be examined from multiple time-series data related to the operation due to sparsity (Fig. 3), to determine one given diameter of a particular hole or casing section. Many of the underlying data could not be propositionalized without expert knowledge.

Figure 2.

Number of Unique Activity (a) and Phase (b) Codes That Must be Harmonized Across Sources.

Figure 3.

Hole Sizes, Phase and Depth over Time for Operations in DB4. Each Legend Represents a Different Hole Size Data Source: report_daily.last_holesize (daily time-series), activity.hole_size (irregular time-series), bitrun.bit_diameter (once per bitrun), casing_section.casing_od & casing_section.casing_id (outer or inner diameter of casing, once per casing/liner run). Data is Compared to Identify the Singular Hole Size for any Given Phase (Drilling, Casing/Liner) (Conductor, Surface, Intermediate, Production).

Figure 4.

Various Data Processed. (a) Standardizing Phases (b) Merging Standardized Phases (c) Missing or Miscategorized Bit Type Problem (d) Standardizing Bha Types and Associating Them to Phases.

2.3. Feature Selection Before Propositionalization is Necessary, Hence FvDR

In 2.2, we establish that automated propositionalization could not perform all the tasks required to propositionalize and harmonize multi-source oil drilling data. Manual examination and use of expert knowledge is still preferred to perform this task, but the research and work required is massive for each additional feature added to the dataset.

FSbP could aid in prioritizing features that are relevant by performing feature selection before data compilation. There are a few papers that focuses on this topic. Motl and Kordík (2018) performed FSbP by using meta-learners to perform a filter-method selection of meta-features (unpropositionalized features). In their study, the meta-features are measured based on their relevancy to the task, feature redundancy, and runtime of feature calculation. Krogel and Wrobel (2002) on the other hand, used the wrapper method to perform feature-selection by relying on the available schema and label to automatically propositionalize the target feature to the label to determine the feature’s relevance.

Both papers rely on the capability to relate meta-features to the label to measure its relevance to the task. However, as discussed in 2.2, the labels are not available at the early stages of examination and the irregular time-series, need-to-harmonize and multi-source nature of the data makes it difficult to perform the comparison both on a per-source basis and on a multi-source basis. FvDR aims to perform FSbP without directly observing the data.

3. Methodology

FvDR is performed in four steps as seen in Fig. 5 as part of the KDD of Multi-source Oil Drilling Data:

Figure 5.

The FvDR Process for Multi-Sourced Oil Drilling Data.

(1)

Database Attribute Health Feature Reduction (DAHFR) (Ting & Then, 2022) removes unused or extremely sparse multi-source database features from the project scope.

(2)

A report Feature Correlation Matrix (RFCM) is created from business reports generated by the system. Feature Correlation data from RFCM serves as a representation of expert knowledge to identify concepts that are relevant to a target machine learning objective.

(3)

The keywords from RFCM are related to database features found in DAHFR.

(4)

Relevant database features are identified by tracing RFCM keywords that are relevant to the target label.

These steps and the evaluation method are described further in the following subchapters.

3.1. Step 1: Database Attribute Health Feature Reduction (DAHFR) (Ting & Then, 2022)

DAHFR is a Filter type Feature Selection metric we created which identifies which tables and columns ‘contain enough data’ which in our multi-sourced data project, is also used to perform the comparison across multiple databases. Instead of examining the table-rows for missing data, DAHFR uses ‘relational missingness’ (Fig. 6) as a metric to estimate how many % of the data will missing for the given feature or table after data has been propositionalized. The following are the steps to perform DAHFR.

Figure 6.

DAHFR Rationale for Relational Missingness.

3.1.1. Determine a Baseline Feature

A baseline feature is selected for measuring the DAH score. The feature is selected based on the following characteristics:

(1)
Is a Foreign Key: For easier cross-table comparisons, while ensuring that the values being compared are unique.
(2)
Relatability: It must be relatable to as many tables as possible.
(3)
Granularity: It must contain a high number of unique values to be able to obtain a final resolution of the DAH score.
(4)
Meaningfulness: The definition of the baseline feature should be significant and easily interpretable, eg: x% of baseline feature y are represented in feature z

3.1.2. Measure DAH

For each table, its relationship to the target label is traced to baseline feature. The DAH (consisting of H_t (1) and H_a (2)) are measured.

Table H e a l t h H_{t} = \frac{O_{t} \cap O_{v}}{O_{v}}

(1)

Attribute Health H_{a} = \frac{C_{o}}{O_{v}} / H_{t}

(2)

O_t = count of distinct, valid baseline uids represented by the table.

O_v = count of valid baseline uids in the baseline table.

C_o = count of valid baseline uids represented by the column. A baseline uid is represented when at least one row holding the baseline UID contains data in the column.

Relational missing % = 1 – Health

3.1.3. Perform Feature Reduction

Features with poor scores are reduced using the threshold specified in Table 1. This threshold is based on the user’s decision. In this project, several factors were considered. Firstly, the actual missing rate after propositionalization will higher than the DAH due to the ‘one row or more = 1 weight’ rule. Second, a low table health threshold is employed because MNAR (Missing Not at Random) data can significantly skew results. Examples of observed tables that are significant but have low table health, tables that record events (rare, causes delays or even casualties) and drilling data (a source contains many operations but only 45% of the operations are drilling operations).

Table 1.
DAHFR Threshold Used in a Single-Source and Multi-Source for This Study.

Attribute Health H_a Table Health H_t

Single Database Threshold > = 60% > 5%

Multi-Sourced Harmonization Threshold > 60% databases passes Single Database Threshold

	Attribute Health H_a	Table Health H_t
Single Database Threshold	> = 60%	> 5%
Multi-Sourced Harmonization Threshold	> 60% databases passes Single Database Threshold

Although skipped in this study for the sake of reduced complexity (due to complexity in dealing with missing data at a source-locked multi-source level) there are further considerations in the realm of missing data. A low Attribute Health or Missing Rate is a complex issue in missingness. (Poor) Missing rates does not inform the user on how well the missing data could be recovered from the available observable data (Lang & Little, 2018). A consensus is that Fraction of Missing Information (FMI) is a better metric than missing rate, as it quantifies the amount of data that is unrecoverable from non-response (Chen & Savalei, 2021; Chen et al., 2023; Lang & Little, 2018; Madley-Dowd, et al., 2019). However, for thoroughness, some lessons or advice from literature on the appropriate missing rate thresholds could be found in Table 2.

Table 2.

Lessons and Missing Rate Thresholds in Literature.

Source	Missing Rate Thresholds Lessons or Recommendations
Graham and Schafer (1999)	Multiple Imputation (MI) could be performed with 50% missing rate even with samples as low as N = 50
Collins et al. (2001)	MI and Full Information Maximum Likelihood (FIML) abilities are indistinguishable at 25% and 50% missing rates with acceptably low bias
Graham (2009); (Nakagawa, 2015)	Complete Case Analysis (CCA) could be performed for <5% missing rates, while MI must be performed for higher missing rates
Lodder (2013)	When testing for missing rates (10%, 20%, 30%, 40%), MI is very effective, and its effectiveness is not affected by the missing rate or missing data mechanisms (MCAR, MAR, MNAR).
Seijo-Pardo et al. (2019)	Recommends caution when handling data with missing rates >80%
Madley-Dowd et al. (2019)	MI is beneficial and can reduce bias and improve efficiency for analysis of MAR data at any proportion of missingness (rates between 10% and 90%) provided sufficient relevant auxiliary variables are available. Missing rate thresholds vary in literature, with thresholds varying between 5% and 40%, while studies that include 50% missing rates and above are limited in their methods.

3.2. Step 2: Report Feature Correlation Matrix (RFCM)

The objective of Knowledge Discovery in Databases (KDD) is to transform secondary data (information collected by others for different purposes and then utilized in a study) into actionable knowledge. This process is particularly relevant for business-oriented databases, where the primary goal of collecting data is often to provide key stakeholders with insights into the business’s status through reports.

This implies that within the context of KDD, each database has an associated set of reports generated from its data. We can target these reports as a rich source for extracting correlation data, operating under the premise that the features used to generate these reports are either correlated, or thought to correlate to other features within the report. These reports, created by experts could therefore mirror some of their business knowledge in the form of the correlation data.

Given this framework, the process of extracting correlation data from reports presents a scalable and repeatable opportunity across different databases. This is because the generation of reports from data is a common practice across various databases, making this approach widely applicable and potentially valuable for gaining insights across multiple domains.

These are the steps used to perform RFCM. For this project, only the reports of DB4 (Source #4) are examined due to the time constraint. The schemas and reports are expected to be generally the same, although it is known that custom reports have been requested (generated manually) which are not available through the reporting website.

3.2.1. Examine System Reports and Gather Characteristics into Report Features Index (RFI)

Distinct business reports are generated by accessing test or staging websites for the system, ensuring the live data and websites are unaffected.

For each report, we identify several key components, as illustrated in Fig. 7. The Scope of the report refers to the inputs used to filter data during report generation. The Group-by parameters are the criteria by which the data is organized, both during the report generation process and in the final report. Additionally, the report Features and Keywords include the keywords and known database features that appear in both the inputs and the resulting report. These components are used to form a Report Features Index.

Figure 7.

Sample Oil Drilling Reports and Keywords Gathered from DB4 Reports.

3.2.2. Refine RFI Keywords Based on Business Terminologies

The keywords identified in the Report Features Index are searched for online and then grouped manually. Synonymous keywords are combined. When keywords fall under the scope of a more significant keyword, they are typically merged into this larger keyword, except in cases where the smaller keyword frequently appears in reports.

3.2.3. Convert the RFI into a Report Features Correlation Matrix (RFCM)

The RFI is converted into the RFCM, where the RFCM compares features based on how many reports they have co-occurred in. The RFCM provides an overview of the distribution and prevalence of features while providing the viewer with an analogy of business rules in the form of ‘feature closeness’.

3.3. Step 3: Linking DAHFR to RFCM

Using the knowledge gained during the RFCM process, the RFCM keywords (representing the business domain) are related to features (by column names or table names) from filtered from DAHFR as shown in Table 3. It is expected that the RFCM keywords may be related to several tables/columns, as RFCM keywords may be generated either directly from, or calculated from DAHFR features.

Table 3.
Examples of RFCM Keyword and DAHFR Feature Pairings.

RFCM Keyword DAHFR Keyword Database Table.Column

Country Country well.country, rushmore_wells.country, lesson_ticket.country

Formation Formation rushmore_wells.deepest_formation, rushmore_wells.formation_ evaluation, formation., ops_datum.

Wellbore Wellbore wellbore.*

Rig Rig rig_information.*

Rig Average Move Speed Rig rig_information.*

Survey Survey survey_reference., survey_station., target_zone.*

RFCM Keyword	DAHFR Keyword	Database Table.Column
Country	Country	well.country, rushmore_wells.country, lesson_ticket.country
Formation	Formation	rushmore_wells.deepest_formation, rushmore_wells.formation_ evaluation, formation., ops_datum.
Wellbore	Wellbore	wellbore.*
Rig	Rig	rig_information.*
Rig Average Move Speed	Rig	rig_information.*
Survey	Survey	survey_reference., survey_station., target_zone.*

3.4. Step 4: Select Relevant Features

The relevant features from the database are gathered via the following steps.

(1)
Identify the label keyword (the KDD/ML objective) from the RFCM.
(2)
Use the RFCM to search for keywords relevant to the label keyword.
(3)
Using the links created in Step 3 to identify the label features, and relevant features from the DAHFR (database domain).

3.5. Step 5: KDD, Validation and Evaluation

Steps 1 through 4 constitute the procedure for FvDR. In this paper, we use two methods to evaluate FvDR. The first method consists of the comparison of FvDR’s selection result, either as keywords or as features, versus features recommended by literature. The second method consists of running the entire KDD process, which includes manual selection, transformation (propositionalization and harmonization) and machine learning. The machine learning models are used to assess the FvDR’s effectiveness, which is the common method to compare the effectiveness of feature selection methods.

4. Results

4.1. FvDR

During FvDR, the scope of features to observe is reduced from columns in 700 tables to 30 tables as seen in Table 4. This is a reduction of the scope to 4.3% tables of the original. DAHFR provides a significant reduction due to the very sparse nature of the multi-source oil drilling data. After the reduction, the DAH scores serve as an intuitive cheat sheet (Tables 5, 9) to allow the user to quickly identify relevant features or features of interest to examine, while also preventing potential MNAR data from skewing the result by allowing the user to focus on the Column/Attribute Health during comparison.

Table 4.
Reduction of features during the FvDR process.

FvDR

RAW Features DAHFR RFCM Linking DAHFR AND RFCM Feature Selection

700 tables 37 tables Processed 125 reports Linked 48 keywords to 50 tables Target Label/Keyword: Duration

20,000 columns 707 columns Identified 48 keywords 21/48 relevant keywords

22/50 database tables

FvDR
700 tables	37 tables	Processed 125 reports	Linked 48 keywords to 50 tables	Target Label/Keyword: Duration
20,000 columns	707 columns	Identified 48 keywords		21/48 relevant keywords
				22/50 database tables

RFCM (Fig. 8) serves as a repeatable source of business knowledge to find keywords that are relevant to any target prediction objective. The process of gathering and merging keywords also serves as a scope-limiter, as it provides the user with a list of keywords relevant to the database to study in the expert-knowledge gathering process without needing to search too deep or wide until required (extensive expert knowledge is required for propositionalization and harmonization).

Figure 8.

RFCM Produced From the RFI Comparing Co-Occurrence Counts of Keywords.

4.2. KDD

The rest of KDD was performed after FvDR. Of the 30 tables recommended by FvDR (Table 5), 8 were selected manually for Propositionalization and Harmonization. This is because of the difficulty in transforming the features, which is complex and time-consuming.

Table 5.
Visualization of DAH Scores for Activity (a) and Bharun (b) tables across 5 sources (db1-db5). A better score (green) suggests a lower relational missingness.

After KDD Transformation, the dataset (consisting of parameters in Table 6) was used for ML processes to predict the duration of oil drilling phases. The problem is treated as a regression problem with the use of RMSE to identify the best performing models.

Table 6.

Features/Parameter Available After Harmonization and Propositionalization.

Two rounds of ML were performed. The first ML involved the use of Deep Learning which was manually finetuned over many iterations to create the best model, but it did not perform better than the baseline model, which was the Random Forest Regressor. In the second round, the dataset was split in a ratio of 70/30 operations into a training set and test set accordingly. Although the prediction is to be made for the duration of phases (one operation has multiple phases), the separation prevents the contamination of data from the same operation, which is not available during well planning. The separation also makes it easier to create DvDs for each operation which are entirely training-validation or test results.

Multiple combinations of ML preprocessing are created and tested with three model types (Random Forest Regressor, Linear Regression, and Multi Layer Perceptron) with the model configuration left at default settings. Although not used for evaluation of FvDR due to the problem discussed in chapter 5.3.3, feature selection data were collected to identify the relevance of each parameter that were created in KDD Transformation (Table 7). This data provides an outlook whether certain data are more useful than others, and whether feature engineering using expert knowledge (conversion of some features into others) is worth the resources spent. For example, survey data, which are coordinates along the drilled oil well was converted to represent the complexity of the hole (survey_station_count, max_turn_rate_degrees_per_m, and max_inclination_degrees) and are considered important features for prediction by the FS algorithms (Table 7).

Table 7.

Top Performing Features Identified by FS Techniques After Harmonization and Propositionalization. Average Score Refers to the Average Score Across Model Combinations During Training.

Rank	Feature Importance (Decision Tree)		K-Best (F Regression)		K-Best (Mutual Info Regression)
	Feature Name	Average Score	Feature Name	Average Score	Feature Name	Average Score
1	bha_count	0.3949	bha_count	777.11	max_depth_md_msl	0.3296
2	flag_bit_type_pdc	0.1155	flag_bit_type_pdc	538.56	fixed_end_depth_md_msl	0.3238
3	flag_bop	0.0879	max_depth_md_msl	179.39	end_depth_md_msl	0.3044
4	water_depth	0.0796	fixed_end_depth_md_msl	179.11	hole_section_end_depth	0.2980
5	field	0.0341	hole_section_end_depth	177.70	bha_count	0.2947
6	survey_station_count	0.0311	end_depth_md_msl	162.60	survey_station_count	0.2645
7	max_inclination_degrees	0.0276	flag_bop	153.51	fixed_holesize	0.1987
8	max_turnrate_degrees_per_m	0.0211	fixed_holesize	83.99	max_inclination_degrees	0.1981
9	hole_section_end_depth	0.0204	phase_code_IH	82.60	bhatype_rotary_steerable	0.1908
10	fixed_end_depth_md_msl	0.0201	survey_station_count	63.25	hole_section_start_depth	0.1836
11	phase_number	0.0176	flag_bit_type_rc	58.69	prev_casing_depth	0.1828
12	max_depth_md_msl	0.0170	bhatype_rotary_steerable	52.12	start_depth_md_msl	0.1804
13	bhatype_rotary_steerable	0.0142	phase_code_PH	48.13	min_depth_md_msl	0.1803
14	prev_casing_depth	0.0138	max_inclination_degrees	47.80	fixed_start_depth_md_msl	0.1802
15	bottom_of_well_for_operation_md_msl	0.0118	prev_casing_depth	47.58	max_turnrate_degrees_per_m	0.1753
16	flag_bit_type_rc	0.0084	phase_code_CH	35.01	bhatype_unknown	0.1644
17	hole_section_start_depth	0.0083	phase_code_CC	26.37	flag_bit_type_pdc	0.1507
18	coord_long	0.0078	bhatype_whipstock	26.05	flag_bop	0.1359
19	end_depth_md_msl	0.0068	max_turnrate_degrees_per_m	24.09	flag_bit_type_rc	0.1186
20	bhatype_unknown	0.0066	hole_section_start_depth	23.92	phase_code_CH	0.0981
21	bhatype_milling	0.0062	bhatype_unknown	23.11	phase_code_IH	0.0720
22	paxmax	0.0061	flag_bit_type_hybrid	18.56	bottom_of_well_for_operation_md_msl	0.0683
23	field_REDACTED	0.0057	bhatype_milling	18.19	phase_code_CC	0.0681
24	top_of_well_md_msl	0.0055	field_REDACTED	18.02	phase_number	0.0547
25	flag_bit_type_core	0.0055	phase_number	15.51	planned_td_md_msl	0.0520
26	wellbore_final_purpose_PROD	0.0053	start_depth_md_msl	13.35	water_depth	0.0506
27	field_REDACTED	0.0050	min_depth_md_msl	13.35	coord_lat	0.0402
28	field_REDACTED	0.0044	fixed_start_depth_md_msl	13.35	top_of_well_md_msl	0.0393
29	coord_lat	0.0043	bhatype_coring	13.10	phase_code_PH	0.0392
30	min_depth_md_msl	0.0032	bottom_of_well_for_operation_md_msl	12.40	coord_long	0.0365

Ultimately, among learning algorithms available, Random Forest Regressor outperformed the rest of the models in both ML rounds, in both the RMSE score and the ‘larger than 0’ score (Table 8). The results of this best model are displayed and discussed in Chapter 4.2.2.

Table 8.

Top 5 Scoring Configuration for Each Model Type, as Well as Their Ranking in Accuracy (>0 for % Positive Values, RMSE).

						>0		RMSE
Rank	Model ID	Imputaton Method	Duration Scale	Feature Selection	Model	Train	Test	Train	Test
1	1_1_3_1_rfr	DEFAULT_0_BEFORE_ENCODE	Scale to 0-1	None	rfr	100	100	38.29	84.14
2	1_1_4_5_rfr	DEFAULT_0_BEFORE_ENCODE	Standardization	K-Best(Mutual Info Regression)	rfr	100	100	37.03	84.56
3	2_1_2_5_rfr	DEFAULT_0_AFTER_ENCODE	Hour	K-Best(Mutual Info Regression)	rfr	100	100	38.14	84.64
4	1_1_1_1_rfr	DEFAULT_0_BEFORE_ENCODE	No Change (Seconds)	None	rfr	100	100	37.55	84.71
5	2_1_4_5_rfr	DEFAULT_0_AFTER_ENCODE	Standardization	K-Best(Mutual Info Regression)	rfr	100	100	37.66	84.73
65	3_1_1_5_lr	DOUBLE_KNN	No Change (Seconds)	K-Best(Mutual Info Regression)	lr	87.8	85.96	93.73	92.1
66	4_1_4_2_lr	TRIPLE_KNN	Standardization	Feature Importance (Decision Tree)	lr	87.98	84.27	93.76	92.41
67	4_1_2_5_lr	TRIPLE_KNN	Hour	K-Best(Mutual Info Regression)	lr	87.52	84.55	94.04	92.43
68	3_1_4_5_lr	DOUBLE_KNN	Standardization	K-Best(Mutual Info Regression)	lr	87.8	86.24	93.47	92.71
69	4_1_2_2_lr	TRIPLE_KNN	Hour	Feature Importance (Decision Tree)	lr	88.07	85.67	93.26	92.76
113	4_1_1_5_mlp	TRIPLE_KNN	No Change (Seconds)	K-Best(Mutual Info Regression)	mlp	100	99.72	126.93	102.72
114	3_1_1_5_mlp	DOUBLE_KNN	No Change (Seconds)	K-Best(Mutual Info Regression)	mlp	100	100	127.88	102.99
115	4_1_1_2_mlp	TRIPLE_KNN	No Change (Seconds)	Feature Importance (Decision Tree)	mlp	100	100	128.97	103.31
116	4_1_1_1_mlp	TRIPLE_KNN	No Change (Seconds)	None	mlp	100	100	128.16	103.53
117	3_1_1_1_mlp	DOUBLE_KNN	No Change (Seconds)	None	mlp	100	100	128.18	103.58

4.2.1. FvDR Selection vs Literature Review

The first evaluation was done by comparing the FvDR’s selected tables against the tables suggested by literature (Table 9) that are relevant to the prediction of Rate of Penetration (a form of duration) and cost of drilling, which is directly affected by the drilling duration. Despite a reduction of scope from 700 tables to 30 tables in a multi-source environment, FvDR was able to identify 13 of the 16 tables recommended by literature. The following are the tables missed and explanations where only 1 of the tables missed is considered a failure in RFCM/FvDR.

(1)
Experience: Does not exist in the scope of RFCM nor the database, though it may be feature-engineered from other features during KDD Transformation.
(2)
Formation: Is not associated to the keyword ‘duration’ which was used for this projects FvDR. However, examining the RFCM reveals that ‘Formation’ is associated to RoP, which is the label used by literature.
(3)
Weather: Is found in only one report, ‘Weather/Marine Breakdown’, which does not show any duration. This is considered a failure of RFCM/FvDR.

Table 9.
Comparison of tables recommended by FvDR, Manual Selection, and Literature. YES refers to recommended tables/keywords, while X refers to tables or keywords that are unused with a reason. The literatures used are Anemangely et al. (2018); Ashrafi et al. (2019); Brenjkar et al. (2021); Hossain (2015); Kaiser (2007); Mehrad et al. (2020); Nzeda et al. (2014); Veeningen et al. (2009) which predicts either the cost of drilling or ROP.

4.2.2. Predictive Modeling Results

Despite using only 8 tables of the 30 suggested by FvDR, the best performing model was able to estimate convincing DvDs that mimic the true DvDs of the operations (Fig. 9). To get a better estimate of the model’s viability, we use threshold of ±20% DvD duration error to be considered a successful prediction. The threshold is based on Hossain (2015) who says that operations should not exceed initial budgets by 20% and with the assumption that drilling cost is linearly related to the drilling duration. Based on the observation of training and testing cases, the accuracy increases as the length of the operations (and number of phases) and the number of phases in the operation and/or the duration of the phases increases (Figs. 10, 11). This is expected because the models trained were trained to minimize RMSE instead of percentage error.

Figure 9.

Sample Model Predictions of Well Drilling Duration.

Figure 10.

Cumulative Count Comparing Successful Predictions vs Length of Actual Operation. Failure Rates were Expected to be Higher for Shorter Duration Operations due to Use of RMSE, but the Chart Shows the Distribution to be Relatively Stable.

Figure 11.

Prediction Accuracy of Operation Duration Prediction Model. Dotted Lines Represent the ±20% Range Used to Determine the Accuracy.

A total of 53% of test operation predictions falls within this threshold (Table 10) which shows some level of prediction success despite the limited number of features used to create the dataset.

Table 10.

Passing Rate of the Best Model According to the Success Criteria Related to Well Planning (‘Total Cost of Well Should not Exceed 20% of Planned Cost’) with the Reasoning That Well Cost is Directly Related to Drilling Time.

	Training			Test
	Yes	No	% pass	Yes	No	% pass
Actual <= (Predicted * 120%)	151	3	98.05%	43	8	84.31%
(Predicted * 80%) <= Actual <= (Predicted * 120%)	131	23	85.06%	27	24	52.94%

5. Discussion and future work

5.1. Limitations

During this project, several limitations are acknowledged:

(1)
Limited attempts at automated propositionalization were performed. It is possible that with more time and resource spent in building a framework of automated propositionalization, then harmonization, etc, that a better predictive model could be created without FvDR.
(2)
Very few of the features selected by FvDR were used to create the dataset, due to the work required to add each additional feature as seen in background.
(3)
In the same vein, the full effectiveness of FvDR is difficult to determine, given that FvDR’s effectiveness is entirely dependent on the predictive model, which is created after the data is processed several times over in the KDD process causing several stages of potential loss in information.
(4)
The evaluation of the model’s ‘accuracy’ of RMSE is also non-ideal, as the prediction of any duration model is a probabilistic problem. Even in the oil and gas industry, duration prediction in well planning is done with the estimation of p10, p60 and p90 values (Coelho et al., 2005; Løberg et al., 2008) or probability distribution graphs (Adams et al., 2010; Codling & Leatherby, 2013; Moeinikia et al., 2014) for each phase, which are then used for monte carlo simulation to create probabilistic duration estimates.

5.2. Efficacy of FvDR

FvDR has been instrumental as an entry point for the observation of multi-sourced oil drilling data. By reducing the scope of features to observe (from 700 to 30 tables), FvDR allows the database to be more manageable for the non-expert user to understand and study. In the case of multi-source oil drilling data where the features are often very sparse, DAHFR is important to identify features with ‘enough data’ for the user to use in follow-up decision making processes while providing a quick comparison across sources. RFCM also serves as a guide, not just by selecting the relevant keyword but for providing a scope of keywords to study, ensuring the non-expert does not spend effort in researching outside the scope.

As a FSbP process, FvDR must not only reduce the scope, but avoid removing relevant features in the process, while also not having the target label immediately available (it was feature-engineered during KDD Transformation). Preliminary results are promising. As observed in the Results, despite selecting only 4.3% of the possible tables, FvDR successfully selected 13 of the 16 tables suggested by literature, with only 1 of the missed tables being unexplainable.

As for the final predictive model’s accuracy, despite the limitations and loss of information in the rest of the KDD process, the final model was able to achieve a 52.94% success rate.

5.3. Future Work

5.3.1. DAHFR

DAHFR was created due to necessity when severe sparsity makes it difficult to identify which tables and features are worth examining in the database. With current work, DAH is already employed with limited success to identify less sparse data while hinting at the existence of MNAR data, as opposed to directly having MNAR data be the cause of entire features being dropped.

However, as discussed in chapter 3.1.3, missing data is a complex topic, but here are some statements and questions as food for thought.

MI is the best option for most cases, regardless of the threshold, MAR and MNAR debacle. While missing rate as a threshold is less relevant for MI (MI could still return acceptable results), the result of MI could be improved with the addition of expert knowledge and auxiliary data. DAHFR was initially created as a necessity, a method to view the extent of the sparsity in the database in the case where too little knowledge is available in the database.

Could DAHFR be modified to behave more like FMI (the better alternative to missing rate) without directly observing the data (which is a requisite for DAHFR)? Could DAHFR be used to better identify MNAR? From a single-source point of view, MNAR data could or could not be a problem when the availability of auxiliary data is not known (due to the unpropositionalized state), so should a threshold still be implemented?

When dealing with source-locked homogeneous multi-source data, the author asserted that the nature of the drilling data prevents the use of data from other sources from informing the imputation method. But to what extent is this correct, at various levels of ‘block-wise missing’ (eg: completely missing vs 95% missing) at source levels? There is room to study this in the future with the addition of more sources - only db4 was completely propositionalized and harmonized in this study so comparisons can’t be made.

5.3.2. RFCM

RFCM as a component of FvDR has found surprising success in identifying targeted, relevant feature groups (tables) when compared to features suggested in literature. At current levels, RFCM could only show correlation, but not the intensity of correlation between features and it is hoped that the improvements in this process could lead to better RFCM methods in the future.

By improving the correlation estimation and giving directions to the correlation, we hypothesize that RFCM could also be used to not only identify correlated data, but to outright estimate features that that predict the labels by first converting RFCM into a graph structure then searching for the Markov Blanket (Gao et al., 2015) of the target label.

Another area for improvement could be the automation of RFCM processes. Three areas could be automated in RFCM. First is the gathering of keywords, which must be done during the form-filling (to generate the report) and the resulting report, which could be done either by web and document crawling, or by image processing. Second is the grouping of keywords of similar nature. Third is the linking of these keywords to database (by identifying relevant keywords associated to the main keyword). It is possible with the increasing potency of Large Language Models, that the second and third step could be automated.

5.3.3. How do you compare FSbP versus other FSbP or FS techniques

In this project, the comparison in performance can’t be made because of the difficulty in processing features in the pipeline (which made FvDR necessary at the first place). Comparison in this project is limited to comparison of features selected vs those suggested in literature as well as the resulting model on its own. Future studies with data that are easier to process in its entirety is necessary for further understanding of FvDR’s capabilities as a FSbP technique, as well as its effect and competitiveness on FSbP/FS techniques.

With that, we suggest the following methodology for optimal comparisons. For this discussion, we identify Features before Propositionalization (FbP, eg: relational database) and Features after Propositionalization (FaP, eg: dataset). The following are their characteristics. FbP is converted to FaP during KDD Transformation, therefore FaP features could be traced back in a many-to-many relationship to their corresponding FbP. FSbP is made to process FbP, while regular FS algorithms (associated with machine learning) are compatible only with FaP.

When comparing multiple FSbPs, FaP could be created based on a union of the FbPs selected by FSbPs, after which the FaP dataset is split based on their corresponding FbP (Selected by the FSbP). These split FaPs datasets could be used to train models, and the accuracy of the resulting models could be used to compare the efficacy of the FSbPs.

However, when attempting to compare FSbPs versus FS techniques, all the FbP must be transformed into FaP. This is to ensure FSs has access the same features (FaP) that were observed the FSbPs. By following the same FaP dataset splitting based on the selected FaP or FbP, models could be created based on the FSbP or FS selections for comparison. Comparing FSbPs versus FS is unfeasible for oil drilling data due to the scope of data that must be processed to represent ‘all FbP’.

6. Conclusion

Source-locked multi-sourced oil drilling data promises a larger, better generalized dataset for oil drilling data analysis but comes with a host of complexities which negates the effectiveness of conventional automated methods in data analysis.

The FvDR framework discussed in this paper was created as a response to be performed as an aid during the KDD Selection process to reduce the scope of data processing and analysis, which is heavily dependent on manual work especially due to unavailability expert knowledge. FvDR successfully performs the FSbP (automated feature selection of relational data) functions by removing less relevant features while preserving features relevant to the learning objective in this multi-source environment and could easily be used in single-source environments too. In addition to the feature selection aspect, FvDR has the additional benefits of guiding the expert/knowledge gathering process for the user, minimizing searching for unnecessary resources.

FvDR’s success is measured by comparing the selected features against features recommended in literature, as well as based on the performance of the preliminary KDD model. The preliminary model’s results are promising despite the limitations of the KDD Transformation stage (not a fault of FvDR). The KDD Transformation stage used only a small fraction of the features recommended by FvDR and literature, and additionally due to complications during the project, only one source was ultimately processed.

The two algorithms used in FvDR measures the correlation and sparsity of the features, but there is still room for improvement. DAHFR (relational missingness) could be improved with further research and experimentation into missing data (missing rate, FMI) and source-locked block-wise missing concepts and behaviours. RFCM is rudimentary but has shown success in estimating feature relevance and could potentially be improved, one direction being with the introduction of the Markov Blanket concept.

There is also a need to properly measure FsBPs (FvDR being a FSbP framework)’s effectiveness. Both FsBP and FS are measured by the resulting model’s effectiveness, but the features examined by either process are vastly different, with significant levels of data processing potentially required to make a true comparison a reality.

Footnotes

Acknowledgments

This research was supported by Independent Data Services (IDS) who provided the research grant and monthly stipend for the project as well as Yayasan Sarawak whom through the Sarawak Foundation Tun Taib Scholarship (BYSTT) provided support for the tuition fee and monthly stipend.

Author Biographies

Clement Ting is PhD student in Swinburne University of Technology Sarawak Campus with a background in robotics and computer science. His interest in research is on Artificial Intelligence and Data Science.

Professor Patrick Then is the head of School of Information and Communications Technology in Swinburne University of Technology Sarawak Campus and Health Informatics cluster at Swinburne Melbourne Data Science. His interests in research include Digital Health, Artificial Intelligence and Data Science.

Dr Man Fai Lau is a senior lecturer in Swinburne University of Technology. His current research interests are on Data Mining and Testing Prediction Models.

Appendix

Appendix A. Abbreviations and Terminologies.

Concept	Abbreviations/Terminologies	Meanings
FvDR	KDD	Knowledge Discovery in Databases
	KDD Selection	The Selection stage of KDD: Feature selection which is performed after data collection but before it is compiled
	FSbP	Feature Selection before Propositionalization: Performed during KDD Selection, differentiated by FSbP being an automated process done withut expert knowledge
	DAHFR	Database Attribute Health Feature Reduction: A filter feature selection process that removes sparse features based on DAH
	DAH Score	Database Attribute Health Score: A relational missingness metric consisting of Table Health and Attribute/Feature/Column Health
	H _a	Attribute Health: Aka Feature Health/Column Health
	H _t	Table Health
	RFCM	Report Feature Correlation Matrix: A filter feature selection process that converts business reports into a Report Feature Correlation Matrix (RFCM, same name) which represents correlation data for selecting features that are relevant to a target label
	RFI	Report Features Index: The output from initial reports examination that is later converted into the RFCM
	FvDR	FSbP via DAHFR and RFCM
	FbP	Features before Propositionalization: Representing features (Eg: database columns, or keywords) before they are propositionalized. Used in the context of comparing FsBP algorithms agaisnt regular FS algorithms. FSbP can be performed with FbP
	FaP	Features before Propositionalization: Representing features (Eg: attributes in a dataset/training set) after they are propositionalized. Used in the context of comparing FsBP algorithms agaisnt regular FS algorithms. FS can be performed with FaP
Missing Data	Missing Rate	Missing Rate: A percentage value representing the ratio of rows/cases with missing data, either for a specific feature/attribute or overall.
	MCAR	Missing Completely At Random: A special case of MAR where the missing rate at the feature level is unaffected by values of other features
	MAR	Missing At Random: Missing data where the Missing Rate is affected by values of observed data
	MNAR	Missing Not At Random: Missing data where the Missing Rate is affected by unobserved data. Eg: the data might be missing because of the actual value of the data
	MI	Multiple Imputation
	FIML	Full Information Maximum Likelihood
	FMI	Fraction of Missing Information: An alternative to Missing Rate that quantifies the amount of data that is unrecoverable from non-response
Oil Drilling	Source/Database/Company	The oil drilling company that produces the oil drilling data during oil drilling
	Oil Field	An area. May contain oil and gas reserviors. A source handles one or more oil field.
	(Oil) Well	Wells are drilled in oil fields to collect data or to access reservoirs.
	ROP/RoP	Rate of Penetration: A measure of rate of progress. Is aggregated at a Well level, Hole level, Bitrun level, or even the current ROP (in a d/dx manner), as it represents the rate at which the well is being drilled and can be found in the units feet/hr or m/hr
	PT	Productive Time
	NPT	Non-Productive Time
	DvD	Depth versus Day (Chart): Chart that represents the progress of well drillling over the duration of the drilling project.
Databases	ER	Entity Relationship (Model/Diagram): Representation of how concepts or tables are linked
Data Analysis	ML	Machine Learning
	FS	Feature Selection
	RMSE	Root Mean Square Error: A metric used in detemining the accuracy of a regression model. A smaller value is preferred.

References

AbouLaban

AlAwami

Herve

Amrite

(2022). Correcting classified activities with natural language processing. International Petroleum Technology Conference, D021S042R002. doi: https://doi.org/10.2523/IPTC-22057-EA

Abtahi

Butt

Molgaard

Arvani

(2011). Wear analysis and optimization on impregnated diamond bits in vibration assisted rotary drilling (Vard). ARMA US Rock Mechanics/Geomechanics Symposium, ARMA-11-266.

Adams

A.J.

Gibson

Smith

(2010). Probabilistic well-time estimation revisited. SPE Drilling & Completion, 25(04), 472–499. doi: https://doi.org/10.2118/119287-PA

Adeleye

Virginillo

Iyoho

Parenteau

Licis

(2004). Improving drilling performance through systematic analysis of historical data: Case study of a canadian field. . SPE/IADC Drilling Conference and Exhibition, SPE–87177–MS. doi: https://doi.org/10.2118/87177-MS

Ahmed

Mahmoud

A.A.

Elkatatny

Chen

(2019). The effect of weighting materials on oil-well cement properties while drilling deep wells. Sustainability, 11(23), 6776. doi: https://doi.org/10.3390/su11236776

Al Ramadan

Salehi

Ezeakacha

Teodoriu

(2019). Analytical and experimental investigation of the critical length in casing– liner overla. Sustainability, 11(23), 6861.

Amar

Ibrahim

(2012). Rate of Penetration Prediction and Optimization Using Advances in Artificial Neural Networks, a Comparative Study. Proceedings of the 4th International Joint Conference on Computational Intelligence, Barcelona, Spain, 5–7.

Amorim

D.S.

Santos

O.L.A.

Azevedo

R.C.D.

(2019). A statistical solution for cost estimation in oil well drilling. REM-International Engineering Journal, 72, 675–683. doi: https://doi.org/10.1590/0370-44672018720183

Anemangely

Ramezanzadeh

Tokhmechi

Molaghab

Mohammadian

(2018). Drilling rate prediction from petrophysical logs and mud logging data using an optimized multilayer perceptron neural network. Journal of Geophysics and Engineering, 15(4), 1146–1159. doi: https://doi.org/10.1088/1742-2140/aaac5d

10.

Ashrafi

S.B.

Anemangely

Sabah

Ameri

M.J.

(2019). Application of hybrid artificial neural networks for predicting rate of penetration (Rop): A case study from marun oil field. Journal of Petroleum Science and Engineering, 175, 604–623.

11.

Brenjkar

Biniaz Delijani

Karroubi

(2021). Prediction of penetration rate in drilling operations: A comparative study of three neural network forecast methods. Journal of Petroleum Exploration and Production, 11(2), 805–818.

12.

Brett

Millheim

(1986). The drilling performance curve: A yardstick for judging drilling performance. SPE Annual Technical Conference and Exhibition?, SPE–15362–MS. doi: https://doi.org/10.2118/15362-MS

13.

Byrom

T.G.

, (2014) Casing and Liners for Drilling and Completion: Design and Application: Elsevier.

14.

Castiñeira

Toronyi

Saleri

(2018). Machine learning and natural language processing for automated analysis of drilling and completion data. SPE Kingdom of Saudi Arabia Annual Technical Symposium and Exhibition, SPE–192280–MS. doi: https://doi.org/10.2118/192280-MS

15.

Chen

Savalei

(2021). Three sample estimates of fraction of missing information from full information maximum likelihood. Frontiers in Psychology, 12, 667802. https://doi.org/10.3389/fpsyg.2021.667802

16.

Chen

Savalei

Rhemtulla

(2023). Pay attention to the ignorable missing data mechanisms! An exploration of their impact on the efficiency of regression coefficients. Multivariate Behavioral Research, 58(6), 1134–1159. doi: https://doi.org/10.1080/00273171.2023.2193600

17.

Codling

Leatherby

(2013). Probabilistic well time estimation using operations reporting data. SPE Digital Energy Conference and Exhibition, SPE–163687–MS. doi: https://doi.org/10.2118/163687-MS

18.

Coelho

D.K.

Roisenberg

F Filho

P.J.

Jacinto

C.M.C.

(2005). Risk assessment of drilling and completion operations in petroleum wells using a monte carlo and a neural network approach. Proceedings of the Winter Simulation Conference, 2005, p. 6. doi: https://doi.org/10.1109/WSC.2005.1574466

19.

Collins

L.M.

Schafer

J.L.

Kam

C.-M.

(2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6(4), 330. doi: https://doi.org/10.1037/1082-989X.6.4.330

20.

Dinov

I.D.

Heavner

Tang

Glusman

Chard

Darcy

Madduri

Spino

Kesselman

(2016). Predictive big data analytics: A study of Parkinson’s disease using large, complex, heterogeneous, incongruent, multi-source and incomplete observations. PloS One, 11(8), e0157077. doi: https://doi.org/10.1371/journal.pone.0157077

21.

Jiang

Lin

Initiative

A. s. D. N.

(2023). Multinomial logistic factor regression for multi-source functional block-wise missing data. Psychometrika, 88(3), 975–1001. doi: https://doi.org/10.1007/s11336-023-09918-5

22.

Gao

Wang

(2015). Structured Feature Selection. Proceedings of the IEEE International Conference on Computer Vision, pp. 4256–4264.

23.

Gholilou

Vialle

Madadi

(2017). Determination of safe mud window considering time-dependent variations of temperature and pore pressure: Analytical and numerical approaches. Journal of Rock Mechanics and Geotechnical Engineering, 9(5), 900–911. doi: https://doi.org/10.1016/j.jrmge.2017.02.002

24.

Ghorpade-Aher

Sonkamble

(2022). A machine learning algorithm for multi-source heterogeneous data with block-wise missing information. Indian Journal of Computer Science and Engineering, 13(6), 1893–1904.

25.

Graham

J.W.

(2009). Missing data analysis: Making it work in the real world. Annual Review of Psychology, 60, 549–576. doi: https://doi.org/10.1146/annurev.psych.58.110405.085530

26.

Graham

J.W.

Schafer

J.L.

(1999). On the performance of multiple imputation for multivariate data with small sample size. Statistical Strategies for Small Sample Research, 50, 1–27.

27.

Hamrick

T.R.

, (2011). Optimization of Operating Parameters for Minimum Mechanical Specific Energy in Drilling: West Virginia University.

28.

Hossain

M.E.

(2015). Drilling costs estimation for hydrocarbon wells. Journal of Sustainable Energy Engineering, 3(1), 3–32. doi: https://doi.org/10.7569/jsee.2014.629520

29.

Huang

Crews

J.B.

Clark

D.E.

(2011). Protecting the reservoir with surfactant micellar drill-in fluids in carbonate-containing formations. SPE Drilling & Completion, 26(04), 492–498. doi: https://doi.org/10.2118/135166-PA

30.

Kaiser

M.J.

(2007). A survey of drilling cost and complexity estimation models. International Journal of Petroleum Science and Technology, 1(1), 1–22.

31.

Kaiser

Boike

Grosse

Langer

(2024). Multi-source synthesis, harmonization, and inventory of critical infrastructure and human-impacted areas in permafrost regions of alaska (Sirius). Earth System Science Data Discussions, 2024, 1–53. doi: https://doi.org/10.5194/essd-2023-393

32.

Krogel

M.-A.

Wrobel

(2002). Feature Selection for Propositionalization.beck, Germany, Proceedings. Discovery Science: 5th International Conference, DS 2002 Lübeck, Germany, Proceedings 5, pp. 430–434.

33.

Lang

K.M.

Little

T.D.

(2018). Principled missing data treatments. Prevention Science, 19(3), 284–294.

34.

De la Prieta Pintado

Corchado

J.M.

Bajo

(2017). Multi-source homogeneous data clustering for multi-target detection from cluttered background with misdetection. Applied Soft Computing, 60, 436–446. doi: https://doi.org/10.1016/j.asoc.2017.07.012

35.

Wang

Zhou

(2019). Multi-task learning based survival analysis for multi-source block-wise missing data. Neurocomputing, 364, 95–107. doi: https://doi.org/10.1016/j.neucom.2019.07.010

36.

Løberg

Arild

Ø.

Merlo

D’Alesio

(2008). The how’s and why’s of probabilistic well cost estimation. IADC/SPE Asia Pacific Drilling Technology Conference and Exhibition?, SPE–114696–MS. doi: https://doi.org/10.2118/114696-MS

37.

Lodder

(2013). To impute or not impute: That’s the question. Advising on Research Methods: Selected Topics, 1–7.

38.

Lukawski

M.Z.

Anderson

B.J.

Augustine

Capuano

Jr, L.E.

Beckers

K.F.

Livesay

Tester

J.W.

(2014). Cost analysis of oil, gas, and geothermal well drilling. Journal of Petroleum Science and Engineering, 118, 1–14. doi: https://doi.org/10.1016/j.petrol.2014.03.012

39.

Tang

Chen

Yang

(2019). Uncertainty evaluation of safe mud weight window utilizing the reliability assessment method. Energies, 12(5), 942. doi: https://doi.org/10.3390/en12050942

40.

Madley-Dowd

Hughes

Tilling

Heron

(2019). The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology, 110, 63–73. doi: https://doi.org/10.1016/j.jclinepi.2019.02.016

41.

Mehrad

Bajolvand

Ramezanzadeh

Neycharan

J.G.

(2020). Developing a new rigorous drilling rate prediction model using a machine learning technique. Journal of Petroleum Science and Engineering, 192, 107338.

42.

Misiuk

Brown

C.J.

Robert

Lacharité

(2020). Harmonizing multi-source sonar backscatter datasets for seabed mapping using bulk shift approaches. Remote Sensing, 12(4), 601. doi: https://doi.org/10.3390/rs12040601

43.

Moeinikia

Fjelde

K.K.

Saasen

Vrålstad

(2014)An investigation of different approaches for probabilistic cost and time estimation of rigless P&a in subsea multi-well campaign. SPE Norway Subsurface Conference?, SPE–169203–MS. doi: https://doi.org/10.2118/169203-MS

44.

Mohammed

Okeke

C.J.

Abolle-Okoyeagu

(2012). Current trends and future development in casing drilling. International Journal of Science and Technology, 2(8), 567–582.

45.

Motl

Kordík

(2018). Do we need to observe features to perform feature selection? ITAT, 44–51.

46.

Nakagawa

(2015). Missing data: Mechanisms, methods and messages. Ecological Statistics: Contemporary Theory and Application, 81–105.

47.

Nzeda

B.G.

Schamp

J.H.

Schmitt

(2014). Development of well complexity index to improve risk and cost assessments of oil and gas wells. SPE/IADC Drilling Conference and Exhibition, SPE–167932–MS.

48.

Paiaman

A.M.

Moghadasi

(2009). An overview to applicability of multilateral drilling in the middle east fields. SPE Offshore Europe Conference and Exhibition, SPE–123955–MS. doi: https://doi.org/10.2118/123955-MS

49.

Schouterden

Davis

Blockeel

(2019). Lazybum: Decision tree learning using lazy propositionalization. International Conference on Inductive Logic Programming, pp.98–113. doi: https://doi.org/10.1007/978-3-030-49210-6_9

50.

Seijo-Pardo

Alonso-Betanzos

Bennett

K.P.

Bolón-Canedo

Josse

Saeed

Guyon

(2019). Biases in feature selection with missing data. Neurocomputing, 342, 97–112. doi: https://doi.org/10.1016/j.neucom.2018.10.085

51.

Soares

A.A.

de Oliveira Freitas

J.C.

de Araujo Melo

D.M.

Braga

R.M.

Amaral-Machado

Santos

P.H.S.

Soares

L.W.O.

(2017). Cement slurry contamination with oil-based drilling fluids. Journal of Petroleum Science and Engineering, 158, 433–440. doi: https://doi.org/10.1016/j.petrol.2017.08.064

52.

Steppe

Clark

Johns

(2005). Casing drilling Vs. liner drilling: Critical analysis of an operation in the gulf of mexico. SPE Annual Technical Conference and Exhibition?SPE–96810–MS. doi: https://doi.org/10.2118/96810-MS

53.

Sultan

M.A.

Al-Kaabi

A.U.

(2002). Application of neural network to the determination of well-test interpretation model for horizontal wells. SPE Asia Pacific Oil and Gas Conference and Exhibition, SPE–77878–MS. doi: https://doi.org/10.2118/77878-MS

54.

Ting

C.P.W.

Then

P.H.H.

(2022). Feature reduction of relational oil drilling data before propositionalization and harmonization by measuring relational data missingness. ASEAN Australian Engineering Congress, 45–55. doi: https://doi.org/10.1007/978-981-99-5547-3_4

55.

Veeningen

Givens

Ravichandran

Jeffers

(2009). Method system and program storage device for automatically calculating and displaying time and cost data in a well planning system using a monte carlo simulation software: Google patents.

56.

Whelehan

Thorogood

(1994). An automated system for predicting drilling performance. SPE/IADC Drilling Conference and Exhibition, SPE–27487–MS. doi: https://doi.org/10.2118/27487-MS

57.

Zhou

Tan

Yan

Hao

(2017). Online transfer learning with multiple homogeneous or heterogeneous sources. IEEE Transactions on Knowledge and Data Engineering, 29(7), 1494–1507. doi: https://doi.org/10.1109/TKDE.2017.2685597

58.

Xiang

Yuan

Fan

Wang

Thompson

P.M.

Initiative

A. s. D. N.

(2014). Bi-level multi-source learning for heterogeneous block-wise missing data. NeuroImage, 102, 192–206. doi: https://doi.org/10.1016/j.neuroimage.2013.08.015

59.

Xue

(2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 116(536), 1914–1927. doi: https://doi.org/10.1080/01621459.2020.1751176

60.

Zhang

Zhu

Huang

Zhang

Liu

(2019). Harmonizing multi-source remote sensing images for summer corn growth monitoring. Remote Sensing, 11(11), 1266. doi: https://doi.org/10.3390/rs11111266

61.

Zhang

Yuan

Luo

Wang

Liu

(2022). A data-level fusion model for unsupervised attribute selection in multi-source homogeneous data. Information Fusion, 80, 87–103. doi: https://doi.org/10.1016/j.inffus.2021.10.017

62.

Zhou

Cai

(2021). Multi-source learning via completion of block-wise overlapping noisy matrices. arXiv preprint arXiv:2105.10360. doi: https://doi.org/10.48550/arXiv.2105.10360

63.

Zhu

Lock

E.F.

(2020). Generalized integrative principal component analysis for multi-type data with block-wise missing structure. Biostatistics, 21(2), 302–318. doi: https://doi.org/10.1093/biostatistics/kxy052

FvDR
RAW Features	DAHFR	RFCM	Linking DAHFR AND RFCM	Feature Selection
700 tables	37 tables	Processed 125 reports	Linked 48 keywords to 50 tables	Target Label/Keyword: Duration
20,000 columns	707 columns	Identified 48 keywords		21/48 relevant keywords
				22/50 database tables

Feature Selection Before Propositionalization of Multi-Source Oil Drilling Data

Abstract

Keywords

1. Introduction

2.1. Using Multi-source Data From Multiple oil Fields to Improve Generalization

2.2. Multi-source Oil Drilling Data is Difficult to Propositionalize

2.2.1. Sparse Data Causes Block-Wise Missing Data in Multi-source Data

2.2.2. Harmonization and feature engineering

3. Methodology

Table 1. DAHFR Threshold Used in a Single-Source and Multi-Source for This Study. Attribute Health H a Table Health H t Single Database Threshold > = 60% > 5% Multi-Sourced Harmonization Threshold > 60% databases passes Single Database Threshold

3.2.1. Examine System Reports and Gather Characteristics into Report Features Index (RFI)

3.2.3. Convert the RFI into a Report Features Correlation Matrix (RFCM)

3.3. Step 3: Linking DAHFR to RFCM

4. Results

4.1. FvDR

Table 5. Visualization of DAH Scores for Activity (a) and Bharun (b) tables across 5 sources (db1-db5). A better score (green) suggests a lower relational missingness.

5.1. Limitations

5.3. Future Work

5.3.1. DAHFR

5.3.2. RFCM

5.3.3. How do you compare FSbP versus other FSbP or FS techniques

6. Conclusion

Footnotes

Acknowledgments

Author Biographies

Appendix

References

Table 1.
DAHFR Threshold Used in a Single-Source and Multi-Source for This Study.

Attribute Health H_a Table Health H_t

Single Database Threshold > = 60% > 5%

Multi-Sourced Harmonization Threshold > 60% databases passes Single Database Threshold

Table 5.
Visualization of DAH Scores for Activity (a) and Bharun (b) tables across 5 sources (db1-db5). A better score (green) suggests a lower relational missingness.