Ambyint Insights
Two Paths to Incorporating Value-Adding Data Science into Production Operations: The Hard Way or the Easy Way
September 24, 2018

With data science and analytics permeating industrial, financial, and consumer technology sectors, oil and gas producers have begun to invest in and explore the potential for these tools to add value in upstream activities. To date, most E&Ps have experienced mixed success, with value coming mainly in the drilling & completions business units (1-4).

Forming an analytics team is only the tip of the iceberg. Building a value-adding data science competency requires significantly more than simply adding headcount and implementing a Tableau or Spotfire visualization system. We present a roadmap for what is needed to build a value-adding data science capability and how Ambyint can catapult E&Ps to the forefront of production analytics today.

We see two paths for E&Ps to achieve value-adding data science capabilities: the ‘hard way’ and the ‘easy way’. This post (originally released as a white paper) aims to lay out the path to achieving value-adding data science and predictive analytics that delivers actionable insights with operational merit. We break these down to four main pillars: high quality data, infrastructure, domain expertise, and a full cycle adaptive control analytics solution. The easiest way to complete the adaptive control loop is by executing  and delivering novel data science ‘over-the-air’.


Access to high quality data seems obvious, but the devil is in the details.  We will start at the wellsite. Key questions one should ask when initiating the design of a data science initiative are: what is the right data? How do you acquire and store it effectively to utilize the data both in real-time and retrospectively? At what resolution should you collect it?

Most E&P operators have implemented a SCADA system across at least a portion of their wells, which enables them to acquire data, but is that data high quality enough for modern data science? Our experience says no.


The nature of SCADA-based polling architecture produces a mountain of data, regardless of its level of insight into production trends and anomalies. In other words, the haystack is always the same size regardless of operating conditions, making finding the value difficult. Parsing out  the signal from the high volume of noise created in polling-based systems can inhibit the development of relevant algorithms. Furthermore, possessing mountains of data but without context can create situations where runaway models produce misleading results, actually destroying value.


Production data is not generally being collected at the frequency and resolution required for modern data science, which is effectively statistical analysis of extremely large data sets (i.e. Big Data). In order for data to be usable for effective anomaly detection and characterization, it must be sampled at a resolution on the order of milliseconds and in such a way that data from relevant sensors and devices are time synchronized. Most current data science in artificial lift is based on random sampling, which can be problematic in a cyclic system such as rod pumping. We base our analytical approach to acquiring and analyzing data on a stroke level.  More specifically, the Ambyint analytics approach samples the pumping system on a sub-second level throughout a single stroke, providing a richer picture of how the system is operating.


Legacy automation and control hardware at the wellsite was never built with data science in mind nor was that the intended use. Modern devices are required to access machine data and provide the requisite resolution for modern data science-based analytics. Additionally, using standard off-the-shelf, open-source hardware has not been able to provide this granular level of data acquisition and satisfy the robust needs of the field. Operators need to consider these barriers carefully when scoping data science initiatives.


We have already touched on the devices at the wellsite, but how to do we connect, acquire, and store that data? What is involved in data transmission? What do we mean by large volumes of data?

Large unconventional-focused E&Ps already generate on the order of 10s of terabytes (TB) of data annually. According to one data scientist at a prominent independent E&P we spoke with, less than 30% of the data his company acquires is useful for scientific analysis because it is not captured at the resolution and quality required for data science. What does it take for E&Ps to get to ‘data science grade’ infrastructure?


The infrastructure required for value-adding data science must be built with data science in mind. Given the legacy data architecture systems (SCADA, Osi PI), E&Ps would need to make multi-million dollar capital investments to upgrade their infrastructures wholesale, and/or live with significantly higher operational expenditures from increased polling costs and equipment maintenance. The average mid-life or late-in-life producing asset with wells on the flat part of their decline can not justify the capital for a complete overhaul and the higher maintenance and data transmission costs that are the result. We do not recommend this – instead we opt for an approach that integrates modern technology upgrades across the hardware/software stack that can cost-effectively deliver and seamlessly integrate these enhanced capabilities.


Data transmission in data science means more than simply dumping mounds of polled data into a server. A full cycle in data science information transmission involves collecting, scaling, and transmitting large volumes of data, and spinning up large clusters of processing power to analyze the data in real time. The primary options for collecting larger volumes of high-resolution data are: 1) ratchet up polling costs significantly on existing SCADA networks, 2) install a costly wifi mesh network, 3) invest in high bandwidth fiber-optics for every pad, or 4) install devices with computing power that can efficiently communicate on novel machine-to-machine (M2M) or satellite-based systems.

Advances in modern communications networks help provide economical alternatives to normal data transmission protocols. Combine this with compression algorithms embedded in edge devices and sophisticated push-based architectures based on insightful events. All of these features together have rendered these communication modes the most economic and reliable option for value-added data science.


For data storage and processing, the choices are to 1) buy dozens of servers and assume the requisite staff, electricity costs, and complexity to operate dedicated data centers or 2) rely on cloud-based storage and computational services like Amazon Web Services (AWS) or Microsoft Azure. E&Ps could choose to host themselves, but the question is why? Companies that are born in the cloud (“cloud native”) have an advantage over cloud-migrating companies because they do not need to integrate any prior processes or architecture and have 77% lower hosting costs as a result, according to cloud vs. on premises studies. Given E&Ps’ core competency of project management, they are generally better served by treating hosting as any other service, like drilling or completion, where they rely on vendors to bring the assets to deliver the service, with appropriate engineering guidance and field project management.

Domain Expertise

We have seen a wave of data science and artificial intelligence (AI) companies from Silicon Valley seeking to sell their solutions to the oil and gas industry in recent years. Based on feedback from numerous E&Ps, these efforts have resulted in little measurable success to-date. Why?

Beyond the limitations related to data quality and infrastructure discussed above, we believe that lack of domain expertise is the primary factor. Most operators would not allow a new production engineer to operate a workover rig by themselves, so why would they expect a data scientist from Silicon Valley to understand the complexity of E&P operations and nuanced data without having spent significant field time learning the industry and its complex operations?


Value is created from data science when insights and recommendations are provided. Without context and deep understanding of the operations on which the data science is being conducted, the critical components of building a learning system architecture and the subsequent machine learning algorithms are useless. Even in-house data science teams can produce recommendations that result from limited domain expertise.  

We recently heard of one example where an in-house analytics team presented a statistical model to field engineers.  When the data scientist presented a snapshot of the data, it included a recommendation for a 20,000′ pump set depth when no wells in the field had a measured depth (MD) of that length! The model credibility was instantly in question. Another example we heard was an external statistical reservoir study that concluded  GOR was important – not exactly a novel finding. While the statistical methods and data science resources in many companies are top notch, it can be a bit like bringing the best rifle to a duck hunt – not the right tool for the job.


How do you bridge data science and production engineering fundamentals to generate actionable insights that create value? We believe in collaboration between experts in both domains best enables the development of high quality machine learning algorithms. Two key aspects of machine learning most influenced by domain expertise are feature engineering and marked data sets.

Feature engineering is, “the process of transforming raw data into features that better represent the underlying problem…resulting in improved model accuracy.” (5) Features shape models, and if properly done, enable flexibility and creation of simple, elegant models. It is at this step in the process where data scientists translate rod pumps to regressions. Thus, if data scientists truly know how artificial lift works, they will create the right features and better data science will ensue.

Once the algorithm is created, data scientists refine their model features using marked data sets. Marking a set of data is providing human context to the data and allows the machine to train itself faster and more accurately. To properly understand the operational context, data scientists should go to the field to understand how production operations work. Additionally, production staff should receive more advanced training on statistical techniques to provide better insight into the data science. This is a time-consuming process, but well worth it.


Even then, expertly marked data sets are often not possible to create from historical data. Often well historian data quality is limited.  In today’s world, it is also common that the knowledge to understand the data in the historian has departed due to layoffs or retirement. With the quarter-to-quarter focus of public E&Ps and margin pressure with lower oil prices, will management have the patience to persevere and see data science initiatives through to generate meaningful value? Our guess is likely no.

This puts a premium on partnering with companies that can deliver both the relevant domain expertise and modern data science capabilities to speed up the timeline.  With a data science partner that has lived similar production pain points as their E&P customers, the diagnoses, recommendations, and analytics are relevant and focused on creating value.

Adaptive Analytics Solution

Completing a full cycle in a truly adaptive analytics solution is the fourth ingredient to enabling value-add data science. It is the execution of the first three pieces in combination. Fast forward a few years and several million dollars. Let us assume that a forward-thinking operator has invested to:

  • Upgrade their hardware at the wellsite
  • Increase data transmission abilities with a wifi mesh network, fiber-optics or absorb significantly higher polling costs
  • Staff up IT and servers or build hosted capabilities in-house
  • Train internal staff on integrated team disciplines and hire requisite data scientist staff
  • Clean and mark historical data sets
  • Build advanced statistical models and machine learning algorithms
  • Create buy-in with business units and inculcate an understanding and culture of data science across the organization

What the company has now created is the base level of what we believe is a value-adding data science system. In other words, we believe true data science starts with a hardware or devices seamlessly feeding software with clean, high-resolution data via a reliable communications protocol. The loop continues into the analytics platform where machine learning algorithms can backtest against historic data and integrate current data simultaneously to continuously train and improve itself. The loop is closed when this process is actively monitored in real-time, data science algorithms are refined as new data is acquired, and those improved models are then seamlessly delivered back to the edge devices ‘over-the-air’ as updates without requiring someone to visit the wellsite. This creates a dynamic ecosystem of continuous improvement and learning that unlocks a step-change in value for operators.


Snapshots in time of operations data can be misleading, as well conditions change continuously. Algorithms without domain expertise and context are useless. Software and analytics without clean, high-resolution data are ineffective. Most E&Ps are trying to take shortcuts in lieu of necessary investments, leaving their data science teams without the proper data, infrastructure, and tools to add value with analytics. It is understandable given the time and effort involved in following the ‘hard way’ outlined in this paper, but this is the only path that is going to result in industry truly taking full advantage of the new wave of data science tools  and capabilities available today.

We advise E&Ps to consider the benefits of working with companies like Ambyint who have already developed adaptive analytics solutions and have deep domain expertise from years of delivering solutions to oil and gas customers. Ambyint can help operators develop best-in- class capabilities in artificial lift analytics and optimization – adding significant value at the operational level – for a fraction of the time, effort, and cost of going it alone.