Exploration Insights December 2019/ January 2020 | Page 12
12 | Halliburton Landmark
Exploration Insights | 13
PCA Explained Variance of Shale Plays
Lower 48 States Shale Plays
Eagle Ford Western
Gulf
Shale plays
Current plays
Prospective plays
Stacked plays
Shallowest/youngest
Intermediate depth/age
Deepest/oldest
1.0
km
“Data is the new oil” is often quoted when big
data are used in ML models quantifying the
value of large datasets. However, this metaphor
extends further. Like hydrocarbons, raw data
need to be broken down and refined in order
Input Data
1) Data Cleaning and
Preprocessing
3) Classify Data
4) Train and Test
Algorithm (Classifier:
RF, DT, SGD)
5) Output Accuracy
and Confusion Matrix
Analysis
Modify
hyperparameters to
improve accuracy
Low accuracy
High accuracy
0.6
0
** Mixed shale &
limestone play
2
4
6
8
10
Number of Components
*** Mixed shale &
light dolostone-
siltstone-sandstone
Figure 3> Graph of principle component analysis (PCA) explained variance within the dataset used in this study, showing approximately
95% of the variance is explained through six principal components.
to have commercial or scientific value (Flender,
2019) (Figure 2).
Methodical cleaning of data enables the
underlying relationships between production
and geology to emerge. Median values were
calculated and imputed for missing data where
Feature Weighting
Average Thickness 0.543
Pore Pressure 0.185
TVD 0.156
Resource Concentration 0.0487
Geothermal Gradient 0.0303
Reservoir Pressure 0.0204
Maximum Burial Temperature 0.0152
GOR 0.00127
Porosity 0.000807
MAXIMUM ACCURACY MODEL
Figure 2> Machine Learning model data preparation and project
workflow.
0.7
0.4
Basins
* Mixed shale &
chalk play
Feature Weighting - All Input Parameters
2) Train and Test
Algorithm
(Regression)
0.8
0.5
Figure 1> Location of unconventional resource plays used in this project. (Source: EIA, 2018)
DATA PREPARATION
Elbow point at
approximately
95% explained
variance
0.9
Niobrara*
Montana
Thrust
Bakken***
Heath**
Belt
Cody
Williston
Powder
Basin
Big Horn
River
Gammon
Hilliard- Basin
Basin
Baxter
Mowry
Appalachian
Mancos
Michigan
Basin
Basin Antrim
Greater Green
Park
Niobrara
River Basin
Basin
Forest
Marcellus
City Basin
Illinois
Manning Uinta Basin
Niobrara
San Joaquin
Marcellus
Basin
Canyon
Piceance
Basin
Denver
Mancos
Basin
Basin Excello-
Utica
New
Hermosa
Mulky Cherokee Platform
Monterey-
Albany
Paradox Basin
Pierre
Temblor
Lewis
Fayetteville
Raton
Anadarko
San Juan Basin Basin A Basin
Monterey
Chattanooga
rdm
o
Black Warrior
r
Santa Maria, Ventura,
Palo Duro
e B
Arkoma Basin
asin
Basin
Conasauga
Los Angeles basins
Basin
Avalon-
Woodford
Bend
Valley & Ridge
Bone Spring
Province
Permian Barnett
500
TX-LA-MS
Wolfcamp
Basin Fort Worth
Salt Basin Floyd-Neal
Marfa
Barnett Basin
Basin
Tuscaloosa
Woodford
Eagle Haynesville-
Ford
Bossier
Pearsall
less than 50% of values for each attribute were
not available was conducted to decrease bias,
with subsequent normalization and scaling of
the data for well comparison. Using Principal
Component Analysis, the number of components
along which these data could be projected and
maintain 95% of the explained variance was
reduced to six, shown in Figure 3. Feature
extraction was also conducted to assess the
correlation of all nine geological input features to
initial production. As shown in Table 1, features
SVM Linear Classification
OVERVIEW OF ALGORITHMS AND
PERFORMANCE EVALUATION
Three algorithms were used to assess the accuracy
of predicting success in shale plays: Support Vector
Machine classifier, Decision Tree classifier, and
Random Forest classifier. A model was built using
Stochastic Gradient Descent (SGD) classifier with
a Support Vector Machine (SVM) optimizer, which
SVM Non-Linear Classification with Kernel Transformation
Class 1
Class 1
Class 2
Class 2
A
B
SVM with Overlapping Data Points
SVM with Regularisation with Hyperparameter
Class 1
C
Table 1> Feature weighting of all input parameters with respect
to normalized initial production.
such as gas-to-oil ratio and porosity can be
excluded from model input to reduce noise.
Class 1
Class 2
Class 2
D
Figure 4> Stochastic Gradient Descent (SGD) classifier using Support Vector Machine (SVM) Linear classifier. Modified from Patel (2017).