第 2 步：探索数据

构建和训练模型只是工作流的一部分。理解预先了解数据的特征，模型。这可能只是意味着获得更高的准确率。这也可能意味着需要较少的数据进行训练，或减少计算资源。

加载数据集

首先，我们将数据集加载到 Python 中。

def load_imdb_sentiment_analysis_dataset(data_path, seed=123):
    """Loads the IMDb movie reviews sentiment analysis dataset.

    # Arguments
        data_path: string, path to the data directory.
        seed: int, seed for randomizer.

    # Returns
        A tuple of training and validation data.
        Number of training samples: 25000
        Number of test samples: 25000
        Number of categories: 2 (0 - negative, 1 - positive)

    # References
        Mass et al., http://www.aclweb.org/anthology/P11-1015

        Download and uncompress archive from:
        http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    """
    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname)) as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)

    # Load the validation data.
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname)) as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)

    # Shuffle the training data and labels.
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)

    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

检查数据

加载数据后，最好对数据运行一些检查：少量样本，然后手动检查它们是否符合您的预期。例如，输出几个随机样本，所表达的内容与评价的情感相符。这是我们随机挑选的评价来自 IMDb 数据集：“10 分钟的故事延伸到大约两小时的时间里在会议期间未发生任何重要的事情时我应该已经离开了一半。”预期情感（消极）匹配标签。

收集关键指标

验证数据后，请收集以下描述文本分类问题的特征：

样本数：数据中包含的样本总数。
类别数量：数据中的主题或类别的总数。
每个类别的样本数：每个类别的样本数（主题/类别）。在均衡数据集中，所有类别的数量相似，样本数；在不均衡的数据集中，每个类别中的样本数量差异很大。
每个样本的字数：一个样本中的字数中位数。
字词的出现频率分布：显示频率的分布情况（出现次数）。
样本长度分布：显示字词数的分布情况每个样本的权重。

我们来看看针对 IMDb 评价数据集，这些指标的值分别是多少（有关词频和样本长度的图，请参见图 3 和图 4 分发）。

指标名称	指标值
样本数	25000
类别数量	2
每个类别的样本数	12500
每个样本的字数	174

表 1：IMDb 查看数据集指标

explore_data.py 包含用于并计算和分析这些指标以下是几个例子：

import numpy as np
import matplotlib.pyplot as plt

def get_num_words_per_sample(sample_texts):
    """Returns the median number of words per sample given corpus.

    # Arguments
        sample_texts: list, sample texts.

    # Returns
        int, median number of words per sample.
    """
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

def plot_sample_length_distribution(sample_texts):
    """Plots the sample length distribution.

    # Arguments
        samples_texts: list, sample texts.
    """
    plt.hist([len(s) for s in sample_texts], 50)
    plt.xlabel('Length of a sample')
    plt.ylabel('Number of samples')
    plt.title('Sample length distribution')
    plt.show()

IMDb 的字词频率分布

图 3：IMDb 的字词频率分布

IMDb 的样本长度分布情况

图 4：IMDb 的样本长度分布情况

第 1 步：收集数据

第 2.5 步：选择模型

第 2 步：探索数据 使用集合让一切井井有条 根据您的偏好保存内容并对其进行分类。

加载数据集

检查数据

收集关键指标

第 2 步：探索数据