Text classification algorithms are at the heart of a variety of software systems that process text data at scale. Email software uses text classification to determine whether incoming mail is sent to the inbox or filtered into the spam folder. Discussion forums use text classification to determine whether comments should be flagged as inappropriate.
These are two examples of topic classification, categorizing a text document into one of a predefined set of topics. In many topic classification problems, this categorization is based primarily on keywords in the text.
Figure 1: Topic classification is used to flag incoming spam emails, which are filtered into a spam folder.
Another common type of text classification is sentiment analysis, whose goal is to identify the polarity of text content: the type of opinion it expresses. This can take the form of a binary like/dislike rating, or a more granular set of options, such as a star rating from 1 to 5. Examples of sentiment analysis include analyzing Twitter posts to determine if people liked the Black Panther movie, or extrapolating the general public’s opinion of a new brand of Nike shoes from Walmart reviews.
This guide will teach you some key machine learning best practices for solving text classification problems. Here’s what you’ll learn:
- The high-level, end-to-end workflow for solving text classification problems using machine learning
- How to choose the right model for your text classification problem
- How to implement your model of choice using TensorFlow
Text Classification Workflow
Here’s a high-level overview of the workflow used to solve machine learning problems:
- Step 1: Gather Data
- Step 2: Explore Your Data
- Step 2.5: Choose a Model*
- Step 3: Prepare Your Data
- Step 4: Build, Train, and Evaluate Your Model
- Step 5: Tune Hyperparameters
- Step 6: Deploy Your Model
Figure 2: Workflow for solving machine learning problems
The following sections explain each step in detail, and how to implement them for text data.