The Process for Data Preparation and Feature Engineering

What's the Process Like?

As mentioned earlier, this course focuses on constructing your data set and transforming your data.

Constructing your dataset consists of the following tasks: 1. Collect raw
data.  2. Identify feature and label sources. 3. Select a sampling strategy.
4. Split the data. Transforming data consists of the following tasks:
1. Explore and clean your data. 2. Perform feature
engineering.

Keep in mind:

  • The figure shows a typical process, which might not be ideal for every project. This course applies primarily to linear regression and neural nets.
  • The process shown is not always sequential. You might, for example, split your data after you transform it. You might need to collect more data. You might need to modify the feature set, even after training begins, as you learn empirically what works and what doesn't.

How Much Time Does it Take?

For the following question, click the desired arrow to check your answer:

Take a guess: In your machine learning project, how much time will you typically spend on data preparation and transformation?
More than half of the project time
Correct: you'll spend the majority of time on a machine learning project constructing data sets and transforming data.
Less than half of the project time
Plan for more! Typically, 80% of the time on a machine learning project is spent constructing data sets and transforming data.