第 2 步:探索数据
使用集合让一切井井有条
根据您的偏好保存内容并对其进行分类。
构建和训练模型只是工作流的一部分。理解
预先了解数据的特征,
模型。这可能只是意味着获得更高的准确率。这也可能意味着
需要较少的数据进行训练,或减少计算资源。
加载数据集
首先,我们将数据集加载到 Python 中。
def load_imdb_sentiment_analysis_dataset(data_path, seed=123):
"""Loads the IMDb movie reviews sentiment analysis dataset.
# Arguments
data_path: string, path to the data directory.
seed: int, seed for randomizer.
# Returns
A tuple of training and validation data.
Number of training samples: 25000
Number of test samples: 25000
Number of categories: 2 (0 - negative, 1 - positive)
# References
Mass et al., http://www.aclweb.org/anthology/P11-1015
Download and uncompress archive from:
http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
"""
imdb_data_path = os.path.join(data_path, 'aclImdb')
# Load the training data
train_texts = []
train_labels = []
for category in ['pos', 'neg']:
train_path = os.path.join(imdb_data_path, 'train', category)
for fname in sorted(os.listdir(train_path)):
if fname.endswith('.txt'):
with open(os.path.join(train_path, fname)) as f:
train_texts.append(f.read())
train_labels.append(0 if category == 'neg' else 1)
# Load the validation data.
test_texts = []
test_labels = []
for category in ['pos', 'neg']:
test_path = os.path.join(imdb_data_path, 'test', category)
for fname in sorted(os.listdir(test_path)):
if fname.endswith('.txt'):
with open(os.path.join(test_path, fname)) as f:
test_texts.append(f.read())
test_labels.append(0 if category == 'neg' else 1)
# Shuffle the training data and labels.
random.seed(seed)
random.shuffle(train_texts)
random.seed(seed)
random.shuffle(train_labels)
return ((train_texts, np.array(train_labels)),
(test_texts, np.array(test_labels)))
检查数据
加载数据后,最好对数据运行一些检查:
少量样本,然后手动检查它们是否符合您的预期。
例如,输出几个随机样本,
所表达的内容与评价的情感相符。这是我们随机挑选的评价
来自 IMDb 数据集:“10 分钟的故事延伸到
大约两小时的时间里在会议期间未发生任何重要的事情时
我应该已经离开了一半。”预期情感(消极)匹配
标签。
收集关键指标
验证数据后,请收集以下
描述文本分类问题的特征:
样本数:数据中包含的样本总数。
类别数量:数据中的主题或类别的总数。
每个类别的样本数:每个类别的样本数
(主题/类别)。在均衡数据集中,所有类别的数量相似,
样本数;在不均衡的数据集中,每个类别中的样本数量
差异很大。
每个样本的字数:一个样本中的字数中位数。
字词的出现频率分布:显示频率的分布情况
(出现次数)。
样本长度分布:显示字词数的分布情况
每个样本的权重。
我们来看看针对 IMDb 评价数据集,这些指标的值分别是多少
(有关词频和样本长度的图,请参见图 3 和图 4
分发)。
指标名称 |
指标值 |
样本数 |
25000 |
类别数量 |
2 |
每个类别的样本数 |
12500 |
每个样本的字数 |
174 |
表 1:IMDb 查看数据集指标
explore_data.py
包含用于
并计算和分析这些指标以下是几个例子:
import numpy as np
import matplotlib.pyplot as plt
def get_num_words_per_sample(sample_texts):
"""Returns the median number of words per sample given corpus.
# Arguments
sample_texts: list, sample texts.
# Returns
int, median number of words per sample.
"""
num_words = [len(s.split()) for s in sample_texts]
return np.median(num_words)
def plot_sample_length_distribution(sample_texts):
"""Plots the sample length distribution.
# Arguments
samples_texts: list, sample texts.
"""
plt.hist([len(s) for s in sample_texts], 50)
plt.xlabel('Length of a sample')
plt.ylabel('Number of samples')
plt.title('Sample length distribution')
plt.show()

图 3:IMDb 的字词频率分布

图 4:IMDb 的样本长度分布情况
如未另行说明,那么本页面中的内容已根据知识共享署名 4.0 许可获得了许可,并且代码示例已根据 Apache 2.0 许可获得了许可。有关详情,请参阅 Google 开发者网站政策。Java 是 Oracle 和/或其关联公司的注册商标。
最后更新时间 (UTC):2025-07-27。
[null,null,["最后更新时间 (UTC):2025-07-27。"],[[["\u003cp\u003eUnderstanding your data before model building can lead to improved model performance, including higher accuracy and reduced resource requirements.\u003c/p\u003e\n"],["\u003cp\u003eThe IMDb movie reviews dataset contains 25,000 samples, is balanced with 12,500 samples per class (positive and negative), and has a median of 174 words per sample.\u003c/p\u003e\n"],["\u003cp\u003eBefore training, it's crucial to verify your data and examine key metrics like the number of samples, classes, samples per class, words per sample, word frequency, and sample length distribution.\u003c/p\u003e\n"],["\u003cp\u003eThe provided code and functions can be utilized to load the dataset, perform data checks, calculate metrics (e.g., median words per sample), and visualize data distributions (e.g., sample length).\u003c/p\u003e\n"]]],[],null,["# Step 2: Explore Your Data\n\nBuilding and training a model is only one part of the workflow. Understanding\nthe characteristics of your data beforehand will enable you to build a better\nmodel. This could simply mean obtaining a higher accuracy. It could also mean\nrequiring less data for training, or fewer computational resources.\n\nLoad the Dataset\n----------------\n\nFirst up, let's load the dataset into Python. \n\n```python\ndef load_imdb_sentiment_analysis_dataset(data_path, seed=123):\n \"\"\"Loads the IMDb movie reviews sentiment analysis dataset.\n\n # Arguments\n data_path: string, path to the data directory.\n seed: int, seed for randomizer.\n\n # Returns\n A tuple of training and validation data.\n Number of training samples: 25000\n Number of test samples: 25000\n Number of categories: 2 (0 - negative, 1 - positive)\n\n # References\n Mass et al., http://www.aclweb.org/anthology/P11-1015\n\n Download and uncompress archive from:\n http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz\n \"\"\"\n imdb_data_path = os.path.join(data_path, 'aclImdb')\n\n # Load the training data\n train_texts = []\n train_labels = []\n for category in ['pos', 'neg']:\n train_path = os.path.join(imdb_data_path, 'train', category)\n for fname in sorted(os.listdir(train_path)):\n if fname.endswith('.txt'):\n with open(os.path.join(train_path, fname)) as f:\n train_texts.append(f.read())\n train_labels.append(0 if category == 'neg' else 1)\n\n # Load the validation data.\n test_texts = []\n test_labels = []\n for category in ['pos', 'neg']:\n test_path = os.path.join(imdb_data_path, 'test', category)\n for fname in sorted(os.listdir(test_path)):\n if fname.endswith('.txt'):\n with open(os.path.join(test_path, fname)) as f:\n test_texts.append(f.read())\n test_labels.append(0 if category == 'neg' else 1)\n\n # Shuffle the training data and labels.\n random.seed(seed)\n random.shuffle(train_texts)\n random.seed(seed)\n random.shuffle(train_labels)\n\n return ((train_texts, np.array(train_labels)),\n (test_texts, np.array(test_labels)))\n```\n\nCheck the Data\n--------------\n\nAfter loading the data, it's good practice to **run some checks** on it: pick a\nfew samples and manually check if they are consistent with your expectations.\nFor example, print a few random samples to see if the sentiment label\ncorresponds to the sentiment of the review. Here is a review we picked at random\nfrom the IMDb dataset: *\"Ten minutes worth of story stretched out into the\nbetter part of two hours. When nothing of any significance had happened at the\nhalfway point I should have left.\"* The expected sentiment (negative) matches\nthe sample's label.\n\nCollect Key Metrics\n-------------------\n\nOnce you've verified the data, collect the following important metrics that can\nhelp characterize your text classification problem:\n\n1. ***Number of samples***: Total number of examples you have in the data.\n\n2. ***Number of classes***: Total number of topics or categories in the data.\n\n3. ***Number of samples per class***: Number of samples per class\n (topic/category). In a balanced dataset, all classes will have a similar number\n of samples; in an imbalanced dataset, the number of samples in each class will\n vary widely.\n\n4. ***Number of words per sample***: Median number of words in one sample.\n\n5. ***Frequency distribution of words***: Distribution showing the frequency\n (number of occurrences) of each word in the dataset.\n\n6. ***Distribution of sample length***: Distribution showing the number of words\n per sample in the dataset.\n\nLet's see what the values for these metrics are for the IMDb reviews dataset\n(See Figures [3](#figure-3) and [4](#figure-4) for plots of the word-frequency and sample-length\ndistributions).\n\n| Metric name | Metric value |\n|-----------------------------|--------------|\n| Number of samples | 25000 |\n| Number of classes | 2 |\n| Number of samples per class | 12500 |\n| Number of words per sample | 174 |\n\n**Table 1: IMDb reviews dataset metrics**\n\n[explore_data.py](https://github.com/google/eng-edu/blob/master/ml/guides/text_classification/explore_data.py)\ncontains functions to\ncalculate and analyse these metrics. Here are a couple of examples: \n\n```python\nimport numpy as np\nimport matplotlib.pyplot as plt\n\ndef get_num_words_per_sample(sample_texts):\n \"\"\"Returns the median number of words per sample given corpus.\n\n # Arguments\n sample_texts: list, sample texts.\n\n # Returns\n int, median number of words per sample.\n \"\"\"\n num_words = [len(s.split()) for s in sample_texts]\n return np.median(num_words)\n\ndef plot_sample_length_distribution(sample_texts):\n \"\"\"Plots the sample length distribution.\n\n # Arguments\n samples_texts: list, sample texts.\n \"\"\"\n plt.hist([len(s) for s in sample_texts], 50)\n plt.xlabel('Length of a sample')\n plt.ylabel('Number of samples')\n plt.title('Sample length distribution')\n plt.show()\n```\n\n\n**Figure 3: Frequency distribution of words for IMDb**\n\n\n**Figure 4: Distribution of sample length for IMDb**"]]