By Mike Ferguson

November 2021

Upcoming events by this speaker:

Nov 22 – Nov 23, 2021:
Designing, operating and managing a Multi-purpose Data Lake

Nov 24 – Nov 25, 2021:
Machine Learning and Advanced Analytics

Dec 9 – Dec 10, 2021:
Enterprise Data Governance & Master Data Management

Data Science Workbenches and Machine Learning Automation

In the last decade the rise of data science has been nothing short of spectacular. Today, almost every business has placed a priority on data and analytics in their determination to become data driven. Many companies have undergone what could only be described as a frenzy of activity in this area with different business units buying many different technologies to start developing machine learning models. However, with so many advances in the field of machine learning and data science, we have seen technologies in this area leapfrog each other at a very rapid rate. The result is that many companies now have a broad range of tools and libraries of algorithms scattered around their organisation. For example, some teams may be using Jupyter notebooks and Spark MLlib with Python. Others may be using Tensorflow and Keras. R and R-Studio are also popular as are drop data mining tools from vendors like Knime, SAS, Tibco and Dataiku. Then there are streaming analytics technologies that can be done using perhaps Kafka, Python and Flink.  There are a multitude of options available, many not mentioned, let alone emerging new ones such as the new rapidly growing RAY reinforcement learning framework.

In that sense it is fair to say that most companies have a fractured and siloed approach to development of machine learning models. Skills are thinly spread across multiple technologies, reuse is limited, development is slower than expected and maintenance has become a major and costly challenge. Feature engineering is another headache. Some data scientists have written their own code to clean and prepare data e.g., in Python or R. Others may have used self-service data preparation tools like Trifacta or DataRobot Paxata or even traditional ETL tools. Also, many data sources have been used repeatedly by different data scientists to get the data to produce the features needed to train a model. As a result, in many cases, re-invention has occurred with people creating the same features again and again.

Looking at it from a corporate perspective, companies are eager to manage data science and multiple projects much better than they do. and also to accelerate both the development and the deployment of machine learning models.  

There are four key things that need to be done to accelerate development of machine learning models and strengthen the management of data science. These are:

  • Improve the operating model (organisation) to speed up development
  • Align project teams with business strategy objectives and facilitate collaborative development
  • Create a data catalogue and an analytics catalogue
  • Improve data science productivity by a common data analytics platform that can integrate stand-alone analytics technologies

When it comes to operating models, frequent discussions talk about either centralised or decentralised organisational set-up. At present many are now decentralised with stand alone data science teams scattered around the company. Both approaches have their problems but what about a federated approach? Many companies are now looking at this as a way to bring together and coordinate multiple disparate data science teams across the organisation. A federated approach includes centre of excellence and a central programme office supporting and linking data science teams. Also establishing information producers and information consumers so that information producers create trusted, reusable data and features to speed up data science who are the consumers. Things like a data catalogue / marketplace, common data fabric for data and feature engineering, a feature store, and a common data science workbench with machine learning automation and MLOps would all help. Ideally collaborative development on top of common data fabric, a data catalogue and a common data science workbench could make a significant difference to shortening time to value.

Key requirements for an enterprise data science workbench supporting multiple teams would include:

  • End-to-end lifecycle support – model development, model deployment, model execution and model management
  • Ability to create projects that consist of data assets, data prep jobs, notebooks, analytical models and collaborations (discussions etc.) …
  • Ability to create communities and facilitate collaboration
  • Track all activity within a project using metadata to record what project team members are doing
  • Integration with a data catalogue to automatically discover, profile and help find data
  • Built-in integration with data preparation tools to prepare data
  • Support for a feature store to avoid unnecessary feature re-invention
  • Integration with multiple statistical and algorithmic libraries / frameworks
    • g., Google Tensorflow, Pytorch, Spark MLlib, H2O, XGBoost, Sci-kit learn, …, AND visualisation libraries
  • Integrate notebooks into the workbench, e.g., Jupyter, Zeppelin, RStudio
  • Integration with code repositories (e.g., Git on Github, BitBucket) and CI/CD technologies
  • Machine learning automation for rapid training, testing, evaluation and hyperparameter tuning of machine learning models
  • MLOps for rapid model deployment in multiple environments, automated model monitoring and model re-training
  • Support model management including model versioning

Deploying machine learning models into multiple environments includes the need to deploy models as a service (API) with elastic execution, in a database (e.g., a data warehouse DBMS), in-Spark, in a streaming analytics environment at the edge to analyse real-time data and in an application as code. All of these are needed.

In order to address these needs, new data science workbench tools have emerged that support the ability for multiple teams to manage and organise data science projects on a single platform. This would include products such as:

  • AWS SageMaker
  • Cloudera CDP machine learning
  • Google vertex AI
  • IBM Watson Studio
  • Microsoft Azure Machine Learning
  • SAS Viya 4

Others like DataRobot and Dataiku are also popular.

In many cases these data science work benches offer a range of services that govern the entire machine learning model life cycle. AWS, Google, IBM, and Microsoft all offer this.

For example, AWS SageMaker includes an integrated development environment (IDE), management of multiple projects, pipeline development, built-in data preparation, lineage tracking of ML workflows, a feature store, integration with Jupyter notebooks and GitHub, CI/CD, AutoML to train and test multiple algorithms in parallel, explanatory AI, bias removal, model management, automated model deployment, automated model monitoring, retraining and redeployment.   

IBM Watson Studio also supports all of these capabilities. It integrates with RStudio, Jupyter Notebooks and has CI/CD workflows for managing code in GitHub or Bitbucket.  It is also part of IBM Cloud Pak for Data (IBM’s data fabric offering) that allow data scientists direct access to the Watson Knowledge Catalog and many data connectors from within the Watson Studio IDE to quickly find and access data. You can create multiple teams, manage multiple projects, and speed up development of notebooks using its auto code generation capability to quickly generate Spark code to connect to data and brings it into a dataframes.  There is also a drag and drop option for low-code / no-code model development if you don’t want to write code. Similar to AWS, it supports AutoML automatically train and test multiple models to identify the most accurate, supports model management and can deploy and automatically monitor, re-train and re-deploy models when accuracy drops below user-defined thresholds. 

These are just two examples of new data science workbenches. Google has also just entered the market and SAS who has a major presence in Italy are also in the mix.  If you are unfamiliar with machine learning automation (also known as AutoML), it offers major productivity benefits including automated feature engineering.  AutoML can also automatically rank the importance of variables in a model in terms of their contribution to a prediction which makes automatic variable selection possible. This, along with the ability to automatically determination how much data you need to maximise model accuracy is a major benefit.  AutoML also includes automatic training, testing and evaluation of multiple algorithms to get the best models.  Finally, it can be used to automatically explain predictions (needed for compliance) as well and automate model deployment, monitoring, re-training, retesting and model refresh.

All of this raises the level of abstraction of data science and lowers the bar on skillsets. For example, programming skills are no longer needed to prepare and integrate data.  This is good news for many companies amid a global shortage of data scientists. It improves productivity, increases agility and reduces time to value and most importantly enables companies to turn many of their business analysts into ‘Citizen Data Scientists’.  

There is no doubt that data science has come a long way in the last decade. The time is now right to industrialise this process. To do that companies need to organise to succeed by appointing owners responsible for the contribution of analytics to business outcomes, and tasking data science teams to develop models that help achieve strategy business objectives. Companies should also consider introducing a data catalog and common data fabric (data management platform) with self-service data preparation to find and prepare data. This helps get everyone on a common platform to maximise sharing and re-use and move away from the complexity of fractured stand-alone tools that are not integrated. In addition to this, getting teams to build models using a common extensible data science workbench that supports automated ML, with automated model monitoring, retraining and refresh significantly improves productivity and reduces time to value. It is also recommended that companies look towards publishing trusted data, trained deployed models, BI reports and dashboards in a catalog / marketplace so that they are easy to find, consume and use. This would also allow them to be rated in terms of business value. Also, publishing models as a service and/or in a data warehouse makes it easy to integrate predictive and prescriptive analytics with applications, processes, websites, and BI tools.