Points of View
More Digital Transformation Research
Get Ahead of Automated Machine Learning (AutoML) to Accelerate Your AI Roadmap
Being great at data science to keep your business ahead of the competition curve is finally becoming more affordable and less complex to manage as open source technology becomes commonplace.
In this POV, we’ll explore the emergence of Automated Machine Learning (AutoML) which is making it much more feasible to use machine learning algorithms to develop machine learning algorithms. This is how quickly the AI industry is progressing today.
We are already seeing the data science community explore ways to make analytics and machine learning tasks cheaper, faster, easier and increasingly automous and self-remediating. Business leaders should prepare for automated data science to become commonplace – not necessarily as a way to entirely replace data scientists, but to boost significantly their capabilities and provide a starting point to ML. AutoML is a step in this direction.
Automated data science and AutoML is fast becoming a reality
The data science lifecycle covers a wide range of tasks that data scientists perform to extract insights and knowledge from data. In any project, teams need to explore and clean data, select, test, train, and tune algorithms, and then monitor and maintain performance over time (see Exhibit 1).
Exhibit 1: The Data Science Process
Source: CRISP-DM, 2018
Automated data science (AutoDS) is a dizzying new development in the AI technology industry that promises to automate some or all of these tasks for two key benefits:
- Opening up the almost-mystical world of data science, including machine learning, to non-data scientists, and potentially, non-programmers
- Easing the burden of hiring scores of data scientists, which are among the most sought-after talent across enterprises
Enterprises are struggling to attract and then retain data science talent across global markets. Some of the biggest challenges with the industry are thus around improving the productivity and quality of work of data science teams, who spend an inordinate amount of time with low-value work in areas like data preparation. Consequently, the industry is responding by focusing on automation and technology enablement to improve processes across the lifecycle to get the most out of the available talent and automate repetitive tasks.
Machine learning is one of the techniques that data scientists can deploy to solve business problems using data, making automated machine learning (or AutoML) a subset of the automated data science movement. Automated data science and AutoML have the potential to accelerate the adoption of these technologies within businesses, with lowered costs and higher productivity of data science teams.
Not every ML task can actually be automated
Automation can be applied to multiple tasks in the data science process. The biggest areas of attention today are in the following area:
- Data pre-processing and exploratory analysis. Data preprocessing converts raw data into clean data sets that can be used for analytical modeling. This stage is one of the most repetitive and thus a great candidate for automation, including tasks like data cleaning and integration. Next, once data scientists complete this stage, they undertake exploratory data analysis to visualize data before modeling. Automation tools can be used to auto-generate and compare variables against desired predictions.
- Model selection and hyperparameter tuning. Data scientists need to select from hundreds of algorithms to find the ones best suited for business problems at hand and the data available. Further, they also need to tune hyperparameters for ML projects. Hyperparameters are the pre-set variables that determine the network structure and how the network is trained. Both these tasks can be automated by frameworks that can recommend suitable models and optimize hyperparameters based on an adaptive knowledge bank.
- Feature engineering, extraction, and selection. Feature engineering is an integral part of applied machine learning, where data science teams use their domain understanding to create “features” that make ML models work. These selections can be automatically selected for many use cases.
Multiple repetitive tasks in the data science process can be automated to some extent. However, it is important to note that there is no “end-to-end” automated data science solution that can expertly execute all data science activities. Setting up impactful ML and other data science projects require the meaningful combination of domain knowledge, human judgement, and technology.
Related advancements in the research community include the concepts of transfer learning, and the availability of pre-trained models.
The open source community, tech majors, and a select few startups are fueling AutoML development
Enterprises will typically invest in analytics and machine learning platforms and tools that enable data scientists to go through the data science lifecycle and deploy models into production environments. The introduction of automated data science is thus quickly becoming the territory of the major and emerging analytics and ML platforms. Simultaneously, the data science community is advancing AutoML through open source libraries such as auto-sklearn. Some of the key players in this fast-developing market niche are:
- Google announced its Cloud AutoML suite back in January 2018, with CEO Sundar Pichai setting a broad vision, stating, “We hope AutoML will take an ability that a few PhDs have today and will make it possible in three to five years for hundreds of thousands of developers to design new neural nets for their particular needs.” Google Cloud AutoML will thus enable you to train your own self-learning model using tools within Google’s own cloud ecosystem. The cloud provider has wisely prioritized user friendliness with this suite, which features an easy GUI and drag-and-drop functionalities. As of today, AutoML is available in beta for training vision, natural language, and translation models.
- The Chinese ecommerce leader recently launched its Google Cloud AutoML competitor, EZDL. This “no code” service platform also features a simple drag-and-drop interface to allow non-programmers to design and build custom deep learning models. Baidu’s focus seems to differ slightly from Google, in that it is promising model performance even with “a small amount of data.” Baidu has released image and sound classification analyses at the outset, and hopes that EZDL will bring AI to small and medium businesses.
- Rapidminer Auto Model is the latest product release from this analytics platform provider. Built to work with the RapidMiner Studio, this solution promises predictive models “in 4 clicks.” The most compelling aspect of how RapidMiner is provisioning AutoML is its focus on transparency with the auto-generated model as well as the model creation method. RapidMiner Auto Model produces a flowchart-like process that can be inspected, fine-tuned, and tested by analysts and data scientists.
- Founded in 2014, SigOpt has specifically chosen the automation of hyperparameter optimization as its core purpose. SigOpt automates the tuning of a model’s hyper, feature, and architecture parameters. This stage of fine tuning the model can sometimes result in significant performance gains, but can be time-consuming and ultimately a fruitless exercise for data scientists. That makes it a perfect candidate for automation, and SigOpt has built APIs that can be integrated easily into most data science workflows and platforms.
Exhibit 2: SigOpt’s Automation-Led Model Improvement
Source: SigOpt, 2018
AutoML has a growing place in AI, with implications for both existing and new adopters
The scope of automated data science and AutoML will only continue to expand, not least with the forces of Google and technology majors behind the movement. The open source community is constantly updating the multiple libraries available, and more and more commercial AutoML products are making their way into the market. However, the industry has several challenges to overcome for this concept to gain more maturity:
- Black box solutions: Some AutoML products operate as black box solutions, which limit the amount of visibility and control that ML teams can affect. They also present a challenge when models need to be compliant with data protection and other regulatory laws, particularly around customer datasets. XAI, or explainable AI is an increasingly important feature of many AI practices that AutoML solutions will need to address.
- Canned solutions for AutoML are unlikely to be suitable for all types model development and client environments. Complex use cases, “data swamps”, and opportunistic data exploration are common scenarios that AutoML solutions cannot address in their current state.
- Some data science tasks are much harder to automate, such as the process of selecting and testing hypotheses. “End-to-end” automation is thus not a realistic outcome for the data science process.
- Further, data science is a challenging discipline to incubate within an organization. Data science, analytics, and ML leaders and practitioners spend a lot of time on non-technical tasks including overall program management, setting and managing expectations, identifying and bringing in domain experts, developing use cases, and communicating technical results to business leaders. While you may be able to automate some aspects of data preparation and model building, the key to making business impact with AI goes beyond pure technology performance and productivity.
Think of AutoML as a way to remove rote tasks from data scientists and creating a starting point for ML for non-data scientists. Over time, this will enable organizations to have broader or diversified teams with data scientists plus other resources. As a result, you might lower the heavy reliance of data scientists, and use their talents for the most complex tasks where they are needed.
Bottom-line: You’re not going to replace data scientists anytime soon, but AutoML will make AI easier over time
Despite the challenges, automated data science and automated machine learning have a great potential to improve the democratization of AI tools, and speed up the laborious process of deriving insights from data. If your organization has an established AI-focused data science team, you will most likely benefit from investing in AutoML tools such as SigOpt or Auto-sklearn that can ease the workload burden of analysts and data scientists. If you are new to AI-focused data science, AutoML options such as Baidu’s EZDL can help non-programmers start to explore the data science process and create an appetite for AI.
Sign in or register an account to access HFS' Content
Create an account