AutoML: Getting started

If you are thinking about using AutoML, you may have questions about how it works and what steps you should take to get started. This section dives deeper into common AutoML patterns, explores how AutoML works, and examines what steps you may need to take before you begin using AutoML for your project.

AutoML tools

AutoML tools fall into two main categories:

Tools that require no coding typically take the form of web applications that let you configure and run experiments through a user interface to find the best model for your data without writing any code.
API and CLI tools provide advanced automation features, but require more (sometimes significantly more) programming and ML expertise.

AutoML tools that require coding can be more powerful and more flexible than no-code tools, but they can also be more difficult to use. This module focuses on the no-code options for model development, but be aware that API and CLI options can help if you require customized automation.

AutoML workflow

Let's walk through a typical ML workflow and see how things work when you use AutoML. The high level steps in the workflow are the same as those you use for custom training; the main difference is that AutoML handles some tasks for you.

Problem definition

The first step in any ML workflow is to define your problem. When you are using AutoML, ensure that the tool you choose can support the objectives of your ML project. Most AutoML tools support a variety of supervised machine learning algorithms and input data types.

For more information about problem framing, take a look at the module on Introduction to Machine Learning Problem Framing.

Data gathering

Before you can start working with an AutoML tool, you need to collect your data into a single data source. Check the product documentation to make sure that your tool supports: your data source, the data types in your dataset, the size of your dataset.

Data preparation

Data preparation is an area where AutoML tools can help you, but no tool can do everything automatically, so expect to do some work before you can import your data into the tool. Data preparation for AutoML is similar to what you would need to do to train a model manually. If you need to know more about how to prepare your data for training, take a look at the Data Preparation section.

For more information on preparing your data, see the working with numerical data and working with categorical data modules.

Before importing your data for AutoML training, you need to complete these steps:

Label your data

Every example in your dataset needs a label.
Clean and format data

Real-world data tends to be messy, so expect to clean your data before using it. Even with AutoML you need to determine the best treatments for your particular dataset and problem. This might require some exploration and potentially multiple AutoML runs before you get the best results.
Perform feature transformations

Some AutoML tools handle certain feature transformations for you. But, if the tool you are using does not support a feature transform that you need or does not support it well, you may need to perform the transformations ahead of time.

Model development (with a no-code AutoML)

AutoML does the work for you during training. However, before you start training, you need to configure your experiment. To set up an AutoML training run, you typically need to specify these high level steps:

Import your data

To import your data, specify your data source. During the import process, the AutoML tool assigns a semantic data type to each data value.
Analyze your data

AutoML products usually provide tools to analyze your dataset before and after training. As a best practice, you may want to use these analysis tools to understand and verify your data before starting an AutoML run.
Refine your data

AutoML tools often provide mechanisms to help you refine your data after importing and before training. Here are a few tasks you may want to complete to refine your data:
- Semantic Checking: During import, AutoML tools try to determine the correct semantic type for each feature, but these are only guesses. You should check the types designated to all features and change them if they were assigned incorrectly.
  
  For example, you may have postal codes stored as numbers in a column in your database. Most AutoML systems would detect the data as continuous numeric data. This would be incorrect for a postal code and the user would probably want to change the semantic type to categorical rather than continuous for this feature column.
- Transformations: Some tools allow users to customize data transformations as part of the refinement process. Sometimes this is needed when a dataset has potentially predictive features that need to be transformed or combined in a way that is difficult for AutoML tools to determine without help.
  
  For example, consider a housing dataset that you are using to predict the sale price for a house. Suppose there is feature that represents the description for a house listing called description and you would like to use this data to create a new feature called description_length. Some AutoML systems offer ways to use custom transformations. For this example, there might be a LENGTH function to generate a new description length feature like this: LENGTH(description).
Configure AutoML run parameters

The last step before running your training experiment is to choose a few configuration settings to tell the tool how you want it to train your model. Though each AutoML tool has its own unique set of configuration options, here are a few of the significant configuration tasks you may need to complete:
- Select the ML problem type you plan to solve. For example, are you solving a classification or regression problem?
- Select which column in your dataset is the label.
- Select the set of features to use to train the model.
- Select the set of ML algorithms AutoML considers in the model search.
- Select the evaluation metric AutoML uses to choose the best model.

After configuring your AutoML experiment, you are ready to start the training run. Training may take a while to complete (on the order of hours).

Evaluate model

After training, you can examine the results by using the tools your AutoML product provides to help you:

Evaluate your features by examining feature importance metrics.
Understand your model by examining the architecture and hyperparameters used to build it.
Evaluate top-level model performance with plots and metrics collected during training for the output model.

Productionization

Though it is outside the scope of this module, some AutoML systems can help you test and deploy your model.

Retrain model

You might need to retrain the model with new data. This might happen after you evaluate your AutoML training run or after your model is in production for some time. Either way, AutoML systems can help with retraining too. It is not uncommon to take another look at your data after an AutoML run, and retrain with an improved dataset.

What's next

Congratulations on finishing this module!

We encourage you to explore the various MLCC modules at your own pace and interest. If you'd like to follow a recommended order, we suggest that you move to the following module next: ML Fairness.

Benefits and limitations (10 min)

Introduction (5 min)