Experiments

Experiments drive a project toward viability. They are testable and reproducible hypotheses. When running experiments, the goal is to make continual, incremental improvements by evaluating a variety of model architectures and features. When experimenting, you'll want to do the following:

Determine baseline performance. Start by establishing a baseline metric. The baseline acts as a measuring stick to compare experiments against.

In some cases, the current non-ML solution can provide the first baseline metric. If no solution currently exists, create a ML model with a simple architecture, a few features, and use its metrics as the baseline.
Make single, small changes. Make only a single, small change at a time, for example, to the hyperparameters, architecture, or features. If the change improves the model, that model's metrics become the new baseline to compare future experiments against.

The following are examples of experiments that make a single, small change:
- include feature X.
- use 0.5 dropout on the first hidden layer.
- take the log transform of feature Y.
- change the learning rate to 0.001.
Record the progress of the experiments. You'll most likely need to do lots of experiments. Experiments with poor (or neutral) prediction quality compared to the baseline are still useful to track. They signal which approaches won't work. Because progress is typically non-linear, it's important to show that you're working on the problem by highlighting all the ways you found that don't work—in addition to your progress at increasing the baseline quality.

Because each full training on a real-world dataset can take hours (or days), consider running multiple independent experiments concurrently to explore the space quickly. As you continue to iterate, you'll hopefully get closer and closer to the level of quality you'll need for production.

Noise in experimental results

Note that you might encounter noise in experimental results that aren't from changes to the model or the data, making it difficult to determine if a change you made actually improved the model. The following are examples of things can produce noise in experimental results:

Data shuffling: The order in which the data is presented to the model can affect the model's performance.
Variable initialization: The way in which the model's variables are initialized can also affect its performance.
Asynchronous parallelism: If the model is trained using asynchronous parallelism, the order in which the different parts of the model are updated can also affect its performance.
Small evaluation sets: If the evaluation set is too small, it may not be representative of the overall performance of the model, producing uneven variations in the model's quality.

Running an experiment multiple times helps confirm experimental results.

Align on experimentation practices

Your team should have a clear understanding of what exactly an "experiment" is, with a defined set of practices and artifacts. You'll want documentation that outlines the following:

Artifacts. What are the artifacts for an experiment? In most cases, an experiment is a tested hypothesis that can be reproduced, typically by logging the metadata (like the features and hyperparameters) that indicate the changes between experiments and how they affect model quality.
Coding practices. Will everyone use their own experimental environments? How possible (or easy) will it be to unify everyone's work into shared libraries?
Reproducibility and tracking. What are the standards for reproducibility? For instance, should the team use the same data pipeline and versioning practices, or is it OK to show only plots? How will experimental data be saved: as SQL queries or as model snapshots? Where will the logs from each experiment be documented: in a doc, a spreadsheet, or a CMS for managing experiments?

Wrong predictions

No real-world model is perfect. How will your system handle wrong predictions? Begin thinking early on about how to deal with them.

A best-practices strategy encourages users to correctly label wrong predictions. For example, mail apps capture misclassified email by logging the mail users move into their spam folder, as well as the reverse. By capturing ground truth labels from users, you can design automated feedback loops for data collection and model retraining.

Note that although UI-embedded surveys capture user feedback, the data is typically qualitative and can't be incorporated into the retraining data.

Implement an end-to-end solution

While your team is experimenting on the model, it's a good idea to start building out parts of the final pipeline (if you have the resources to do so).

Establishing different pieces of the pipeline—like data intake and model retraining—makes it easier to move the final model to production. For example, getting an end-to-end pipeline for ingesting data and serving predictions can help the team start integrating the model into the product and to begin conducting early-stage user testing.

Troubleshooting stalled projects

You might be in scenarios where a project's progress stalls. Maybe your team has been working on a promising experiment but hasn't had success improving the model for weeks. What should you do? The following are some possible approaches:

Strategic. You might need to reframe the problem. After spending time in the experimentation phase, you probably understand the problem, the data, and the possible solutions better. With a deeper knowledge of the domain, you can probably frame the problem more precisely.

For instance, maybe you initially wanted to use linear regression to predict a numeric value. Unfortunately, the data wasn't good enough to train a viable linear regression model. Maybe further analysis reveals the problem can be solved by predicting whether an example is above or below a specific value. This lets you reframe the problem as a binary classification one.

If progress is slower than expected, don't give up. Incremental improvements over time might be the only way to solve the problem. As noted earlier, don't expect the same amount of progress week over week. Often, getting a production-ready version of a model requires substantial amounts of time. Model improvement can be irregular and unpredictable. Periods of slow progress can be followed by spikes in improvement, or the reverse.
Technical. Spend time diagnosing and analyzing wrong predictions. In some cases, you can find the issue by isolating a few wrong predictions and diagnosing the model's behavior in those instances. For example, you might uncover problems with the architecture or the data. In other cases, getting more data can help. You might get a clearer signal that suggests you're on the right path, or it might produce more noise, indicating other issues exist in the approach.

If you're working on a problem that requires human-labeled datasets, getting a labeled dataset for model evaluation might be hard to obtain. Find resources to get the datasets you'll need for evaluation.

Maybe no solution is possible. Time-box your approach, stopping if you haven't made progress within the timeframe. However, if you have a strong problem statement, then it probably warrants a solution.

Check Your Understanding

A team member found a combination of hyperparameters that improves the baseline model metric. What should the other members of the team do?

Maybe incorporate one hyperparameter, but continue with their experiments.

Correct. If one of their hyperparameters seems like a reasonable choice, try it. However, not all hyperparameter choices make sense in every experimental context.

Change all their hyperparameters in their current experiment to match their co-worker's.

Hyperparameters that improved one model doesn't mean they'll also improve a different model. The other teammates should continue with their experiments, which might actually improve the baseline even more later on.

Start building an end-to-end pipeline that will be used to implement the model.

A model that improves the baseline doesn't mean it's the model that will ultimately be used in production. They should continue with their experiments, which might actually improve the baseline even more later on.

Measuring success

ML pipelines