A Machine Learning Approach To Enhanced Oil Recovery Prediction Fedor Krasnov1 , Nikolay Glavnov1 , and Alexander Sitnikov1 Gazpromneft NTC, 75-79 Moika River emb., St Petersburg, 190000, Russia
[email protected], http://ntc.gazprom-neft.ru orcid.org/0000-0002-9881-7371
Abstract. In a number of computational experiments, a meta-algorithm is used to solve the problems of the oil and gas industry. Such experiments begin in the hydrodynamic simulator, where the value of the function is calculated for specific nodal values of the parameters based on the physical laws of fluid flow through porous media. Then, the values of the function are calculated, either on a more detailed set of parameter values, or for parameter values that go beyond the nodal values. Among other purposes, such an approach is used to calculate incremental oil production resulting from the application of various methods of enhanced oil recovery (EOR). The authors found out that in comparison with the traditional computational experiments on a regular grid, computation using machine learning algorithms could prove more productive. Keywords: Enhanced Oil Recovery, EOR, random forest, regular grid interpolation
1
Proxy Model Approach
One of the main reasons for the appearance of the meta-algorithms in Oil&Gaz industry is the limitations on the speed of hydrodynamic modeling. In the future, when any specialist of an organization will be able to vary the values of the parameters at any time and within a wide range and get the required values of the function in near-real-time mode, the need for a meta-algorithm will disappear. Meanwhile, it takes experts hours or even days to perform modeling for one set of parameters on high-cost, high-performance clusters (HPC). Thus, there is a need for astute preparation of data for further processing. Since the need to change the parameters can occur several times a day and with a whole variety of specialists, an efficient meta-algorithm is an urgent necessity. As a result of applying the meta-algorithm, a model - sometimes called a proxy model - is obtained in [1] and [2]. At the input, the proxy model receives a set of parameters. Then, it outputs the value of the physical function from these parameters, performing interpolation or extrapolation based on the previously calculated values of the function in the nodal values of the parameters.
The completed proxy model does not require a large amount of computational resources and works in a close-to-real-time mode. It yields immediate results. In this article, we consider two different approaches to constructing a proxy model, using the example of a computational experiment to study the increment in oil production using miscible displacement by carbon dioxide injection.
2
Hydrodynamic Simulation
To obtain the first estimate of additional oil production from tertiary methods for increasing oil recovery, representative curves in the hydrocarbon pore volume injection and the enhanced oil recovery coordinates are used.
Fig. 1. Representative curve of enhanced oil recovery versus HCPVI
Fig. 2. Representative curve of enhanced CO2 production versus HCPVI
These curves on Figures 1 and 2 are most often obtained as a result of statistical analysis of the actually implemented field projects, or from simplified analytical dependencies and, less often, from the results of multivariate calculations on synthetic simulation models. The latter method of obtaining representative curves for the technology of alternating injection of carbon dioxide and water into the reservoir with miscibility was applied in this article. The simulation was carried out on an Eclipse 300 composite simulator (Schlumberger) which allows to reproduce the process of miscible displacement. The model is a segment of a five-point element of the development system, with vertical wells in the corners. In our experiment, additional oil production is calculated by varying the following parameters: – Oil properties (density, viscosity, saturation pressure). In order to take into account the influence of the properties of the reservoir system on the displacement efficiency, three models of reservoir oil were created with characteristics covering the entire range (223 objects) of the properties of the reservoir oil of the available oil samples. – The actual value of the residual oil saturation, obtained from the results of core filtration experiments in the oil-water system, determines the amount of oil that is not attracted by water flooding. During carbon dioxide displacement, a decrease in the residual oil saturation is detected due to a decrease in the tension between the displacing agent and the oil, accompanied by the dissolution process. – Heterogeneity of permeability: Based on the results of interpretation of logs of exploratory wells through the formula for determining the DykstraParsons coefficient, the values for all the considered objects are calculated. For the sake of variety, four values with almost uniform coverage of the whole interval are picked out. – Relative phase permeability: to determine the nodal values of the endpoints of the relative phase permeability and the values of the residual water and oil saturation, the results of the laboratory core studies are generalized. To cover the whole range of phase permeability by modeling, the values of the maximum relative phase permeability in gas and water were chosen to correspond to the average value, as well as to two values close to the maximum and the minimum values – The current oil saturation: in the calculations, the degree of production of stocks was computed by changing the initial oil saturation of the grid when the model was initialized. Three values were identified: the first-production fluid, the average yield, and the produced object. A total of 324 simulation models were generated. Based on them, 972 calculations were performed (3 per each model). The results of the simulations were grouped into one summary database, in which up to 486 representative curves were processed. Above on Figure 3 are the statistics of the time spent on calculating one variant, the average value of which was 90 minutes.
Fig. 3. Simulation time
2.1
An Approach to Creating a Proxy Model Based on Multidimensional Linear Interpolation
One of the approaches to creating a proxy model is linear multidimensional interpolation described in [3]. To understand it, we will consider the parameter space as a multidimensional cube, in which each dimension is formed by a vector of values of one parameter. Then, the resulting function can also be represented as a vector. The process of creating a proxy-model will contain the following steps: 1. Reading the parameters and values of the function from the results of hydrodynamic modeling; 2. Vectorizing the parameters and the resulting function; 3. Constructing a multidimensional cube of parameters; 4. Constructing an interpolation function; 5. Determining the dimensions of new parameter vectors; 6. Creating the new parameter vectors; 7. Performing interpolation of new parameter vectors by means of an interpolation function; 8. Exporting the received proxy-model in a format convenient for use. The essence of modeling based on multidimensional linear interpolation is to choose the step of changing new parameters so that the resulting parameter vectors would have the nodal values in their composition and cover the range, necessary for the model, with a sufficient amount of steps. In other words, if you have Parameter P1 with Dimension 3, for which calculations are made in the nodal values of 0.1, 0.5, 0.9, and there is a need for the values of the function at 0.4 and 0.8, then for the new vector, it will be sufficient to select Step 0.1 and Dimension 9. Thus, the parameters are meshed on a regular grid. In the case when the dimension of the parameter space is greater than 2, we can no longer apply the Spline methods described, among other resources, in [5]. In our case, the dimension of the parameter space is 6 (including the pumped pore volume parameter).
In addition, it is worth noting that the choice of the step should be made, taking into account the capabilities of the computing resources. The vectorized parameter space, represented as a multidimensional array of rational numbers, must correspond to the sizes of the available server RAM for calculations. Under the finished proxy model, in our case, we can understand an MS Excel table with seven columns: six for input parameters and one for the resulting function. Such a presentation is as clear as possible to a wide range of specialists within an organization and allows further research based on the proxy mode data. Figure 4 below shows the dependence of additional production on residual oil saturation and heterogeneity of permeability, with the other parameters having fixed values.
Fig. 4. Additional oil production on a regular interpolation grid model
2.2
An Approach to Creating a Proxy Model Based on Machine Learning Methods
Our task belongs to the class of regression-building tasks, from the point of view of machine learning methods. One common and universal regressor is Random Forest - a method coined by Leo Breiman [7]. Random forest is a set of decision trees. In a regression problem, their answers are averaged, in the classification problem - a decision is made by voting on the majority. All trees are constructed independently according to the following scheme [6]: – A subsample of a training sample of a certain size is chosen. Subsequently, a decision tree is constructed on it (there is a separate subsample for each tree); – For the construction of each splitting in the tree, a certain number of random features sets is observed. A separate random features set is defined for each new splitting;
– Finally, the best feature is selected based on a predetermined criterion and the splitting goes on according to that criterion. In the original algorithm, the tree is constructed until the subsample is exhausted, and until the representatives of only one class remain in the leaves. However, in the modern implementations, there are parameters that limit the height of the tree, the number of objects in the leaves, and the number of objects in the subsample, under which the splitting is performed. This construction scheme corresponds to the main principle of ensemble training [9] - the construction of a machine learning algorithm based on several, in this case, decision trees: the basic algorithms must be good and diverse. In the above mentioned formulation of the problem of predicting additional oil production, we are training the regressor on the six available parameters and the values of the additional oil recovery factor. Then, we use the resulting regression model for the calculation of values of the additional oil recovery factor based on the new parameter values. When evaluating the accuracy of the model by predicting the values of the known parameters, the determination coefficient (scaling R-squared) is 0.99 for a test sample of 100 parameter sets. We can also immediately distribute the features by degree of importance: Table 1. Feature importance Feature Oil properties (density, viscosity, saturation pressure) The actual value of the residual oil saturation Heterogeneity of permeability Relative phase permeability Current oil saturation Pore volume injection (time)
Importance 0.016 0.032 0.488 0.041 0.026 0.397
The accuracy of the result will depend on the Heterogeneity of permeability (Vdp) and Pore volume injection (Time) much more than on the other parameters, as shown in Figure 5. You can also evaluate the effect of the number of hydrodynamic simulations on the accuracy of the prediction result.
Fig. 5. Additional oil recovery factor on the Random Forest model
3
Computational Methods and Algorithms
For calculations, the Python environment was chosen. The choice of Python owes to its extensive capabilities for working with data arrays as matrices, provided by the NumPy library [13]. To work on exporting and importing data to the MS Excel format, the Pandas library [14] was used. For the interpolation of multidimensional surfaces, the SciPy library classes were used. 3D surfaces are rendered using the Matplotlib library. As a software implementation of multidimensional linear interpolation, the Regular Grid Interpolator method from the SciPy library was selected [11], [12]. One of the advantages of this method is that it uses the possibilities of a regular grid, instead of the resource-intensive triangulation of the parameter space. Random Forest uses the implementation of scikit-learn [10]. The performance of the libraries used fell within the requirements of the ”on demand” computing time. Calculations were performed on a 64-core Linux-run OS (CentOS 7). For the fullest possible use of Multi-core processors, the authors used the Math Kernel Library.
4
Conclusions and Future Directions of Research
It is important to note that this approach should be applied on a regional scale. The authors made the calculation for the entire range of parameter values of the Gazprom Neft fields. In other words, having calculated the additional oil recovery factor once , you can continue working by proxy models (MS Excel tables) without resorting to more calculations, but simply by finding the required set of parameters and the corresponding additional oil recovery factor. Multidimensional regression using the Random Forest method is a modern and high-performance tool. It meets the requirements of the task of constructing
proxy models for calculating the additional oil recovery factor in the range of properties covering all Gazprom Neft fields. It is important to notice complete continuity with respect to multidimensional linear interpolation: on the same data, the same results are obtained with accuracy. Multidimensional regression using the Random Forest method has a number of additional advantages over the method of multidimensional linear interpolation. Namely: – An ability to take into account the importance of the parameters. – An ability to determine a sufficient number of calculations of the hydrodynamic models, based on the required accuracy. The main conclusion of this article is a significant simplification of computations, a significant reduction in the requirements for computational resources and achievement of better predictability of the simulated function, when applying machine learning methods.
References 1. Guo, Z., Reynolds, A. C., Zhao, H.: A Physics-Based Data-Driven Model for History-Matching, Prediction and Characterization of Waterflooding Performance. Society of Petroleum Engineers. doi:10.2118/182660-MS 2. Shehata, A. M., El-banbi, A. H., Sayyouh, H.: Guidelines to Optimize CO2 EOR in Heterogeneous Reservoirs. Society of Petroleum Engineers. doi:10.2118/151871-MS 3. Weiser, Alan, and Sergio E. Zarantonello.: A note on piecewise linear and multilinear table interpolation in many dimensions. MATH. COMPUT. 50.181 (1988): 189-196. 4. Shahdad Ghassemzadeh, Amir Hashempour Charkhi: Optimization of integrated production system using advanced proxy based models Journal of Natural Gas Science and Engineering, Volume 35, 2016, Pages 89-96, ISSN 1875-5100 5. Dierckx, Paul: Curve and surface fitting with splines Monographs on Numerical Analysis Oxford University Press, 1993. 6. Alexander Dyakonov: blog ”Random Forest” 2016/11/14 https://alexanderdyakonov.wordpress.com 7. Breiman, Leo: Random Forests Machine Learning 45 (1): 532 DOI:10.1023/A:1010933404324 8. Gashler, M. and Giraud-Carrier, C. and Martinez, T.: Decision Tree Ensemble: Small Heterogeneous Is Better Than Large Homogeneous The Seventh International Conference on Machine Learning and Applications, 2008, pp. 900-905., DOI 10.1109/ICMLA.2008.154 9. Opitz, D., Maclin, R.: Popular ensemble methods: An empirical study Journal of Artificial Intelligence Research. 11: 169198. 1999 doi:10.1613/jair.614 10. Pedregosa et al.: Scikit-learn: Machine Learning in Python JMLR 12, pp. 2825-2830 2011 11. Travis E. Oliphant: Python for Scientific Computing Computing in Science and Engineering 9, 10-20 (2007), DOI:10.1109/MCSE.2007.58 12. K. Jarrod Millman and Michael Aivazis: Python for Scientists and Engineers Computing in Science and Engineering, 13, 9-12 (2011) DOI:10.1109/MCSE.2011.36 13. Stfan van der Walt, S. Chris Colbert and Gal Varoquaux: The NumPy Array: A Structure for Efficient Numerical Computation Computing in Science and Engineering, 13, 22-30 (2011) DOI:10.1109/MCSE.2011.37
14. Wes McKinney: Data Structures for Statistical Computing in Python Proceedings of the 9th Python in Science Conference, 51-56 (2010)