Mike Ferguson

By Mike Ferguson

April 2017

Accelerating Data Science Using Machine Learning Automation

For many companies, today, the demand for data and analytics is now everywhere in the enterprise. Projects are underway to improve customer engagement, reduce risk and optimize business operations. Data sources are also growing rapidly with new data coming from both inside and outside the enterprise in many different varieties. Also although analytics are needed almost everywhere, the current approaches to developing them are slow and expensive.


To support this demand, the modern analytical environment has also expanded going way beyond the traditional data warehouse to become an analytical ecosystem comprising multiple data stores and platforms optimised for different kinds of analytic workloads. This ecosystem includes data warehouses, NoSQL databases, Hadoop and cloud storage and streaming analytics platforms made up or technologies such as Kafka, Apache Spark and Apache Flink for example. Some vendors like Cloudera and MapR are trying to position themselves as a data platform to do all of these things on a single system.


In addition, companies have created data science teams to focus on specific business problems analysing data on different underlying platforms like Spark/Hadoop, stream processing systems and analytical RDBMSs that underpin data warehouses. Therefore

data science teams are operating across the analytical ecosystem on data housed in multiple data stores in what is turning out to be a distributed logical data lake and not a centralised single data store.


The demand for new data and new kinds of analytics on high volume multi-structured data has also created a need to scale at low cost which in turn has triggered the rapid adoption of technologies such as Hadoop and Apache Spark to scale data and analytical processing.


While the adoption of these technologies was slow to begin with, they are now mainstream with many different types of analytics being developed in data science projects and used to analyze data in both traditional and big data environments. The types of analysis being undertaken include supervised and unsupervised machine learning, text analysis, graph analysis, deep learning and artificial intelligence.


Most probably the fastest growing of these is machine learning – the use of supervised machine learning to classify (predict) and unsupervised machine learning to describe patterns in data e.g. to cluster similar customers into groups for customer segmentation. With supervised learning, data is first prepared and then fed into an algorithm to train it to correctly predict an outcome. Examples would be to predict customer churn or to predict equipment failure. Data here is often split into training and test data with a further subset held back altogether to see how a model performs on totally unseen data once trained. There are a number of algorithms that can be used for prediction and data scientists will typically develop multiple models, each with a different algorithm and compare results to find the most accurate. 


Unsupervised machine learning is when an algorithm is run on the data without any training to find patterns in the data. A good example here is clustering to group together similar data or association detection for market basket analysis.


However the current approaches to machine learning analytical model development have experienced a number of problems. For a start there is a real shortage of skilled data scientists who, even if you can hire them, are likely to be head-hunted pretty quickly. Also analytical model development is often slow especially on big data platforms where data scientists often prefer to develop everything manually by writing code in popular languages like Java, R, Scala and Python. This results in data science becoming a bottleneck with a backlog building up of analytics that still need to be built.


Also the cost of development is higher than it should or could be because data science is often fractured with different people often using inconsistent development approaches, different libraries and tools. This leads to a fracturing of skills thinly spread across too many technologies. It also limits re-use and sharing of datasets and models. Maintenance then becomes a problem, costs increase and there is no integrated programme of analytical activity. All of this limits agility and adds to complexity.  In addition, with the pace of development being slow, data science can become a bottleneck, which in turn can cause existing analytics running in production to become stale because people are held up on other model development activities associated with the aforementioned backlog.


Furthermore, fractured and overly complex data science can also lead to unmanaged self-service data preparation if teams are all adopting different approaches to preparing data. This is especially the case if everything is hand-coded with no metadata lineage and no way to know how data was prepared. The problem is it encourages re-invention rather than reuse and so a governed environment is preferable. Productivity and governance also suffers and so time to value is impacted.  Also the backlogs of analytics yet to be built continues to get bigger and we lose the ability to prevent, optimise and respond. Therefore, opportunities are missed and business problems become unavoidable


The question is how do you solve this? The answer is to accelerate data science by automating the development of predictive models. This is sometimes referred to as machine learning automation. New technologies are emerging to do this e.g. DataRobot, Tellmeplus and IBM Data Science Experience. Google are also headed down thus road.  Machine learning automation allows you to rapidly build and compare predictive models and so enable lesser skilled business analysts to become ‘citizen data scientists’.

It also allows you to integrate automatically built and custom built models into a common champion/ challenger program so that you can co-ordinate, and govern all machine learning projects from a single place.


If you are looking to evaluate tools in that automate machine learning, some of the key requirements for this kind of technology you might want to consider include:

  • Project management and alignment with business strategy
  • Collaborative development and sharing
  • Integration with an information catalog to make it easy for data scientists to find data
  • Providing access to data on big data and small data platforms inside or outside the enterprise
  • Support for easy exploration and profiling of data
  • Built-in and integrated 3rd party self-service data preparation
  • Ability to automate or manually define datasets to train and validate models
  • The ability to automate variable selection for input to a machine learning algorithm
  • The ability to automatically create, train and validate multiple candidate models using different algorithms to find the best predictor
  • Suport for interactive training for better accuracy
  • The ability to include algorithms from 3rd party libraries to integrate technologies and train candidate models from a common platform
  • The ability to integrate custom models built in various languages in interactive notebooks like Zeppelin and Jupyter
  • The ability to easily compare the predictive accuracy of multiple candidate models
  • The ability to select and easily deploy models for consumption by other applications and tools
  • The ability to easily deploy models to run in different executing environments e.g. in-cloud, in-Spark, in-database, in-Hadoop or in-stream
  • The ability to set up a machine learning ‘factory’ to not only industrialise development but also to automate the monitoring of model accuracy and the maintenance and refresh of existing models

If you are using machine learning to develop new analytical models, I would recommend that you look at machine automation to not only accelerate the model development process to reduce time to value but to monitor what it built and automatically keep it fresh while you focus attention on new problems.