ステップ 2: データを探索する

モデルの構築とトレーニングは、ワークフローの一部でしかありません。理解データの特性を事前に把握することで、より優れたモデルです。これは単に、より高い精度を得ることを意味するかもしれません。また、これはトレーニングに必要なデータ量や計算リソースが少なくて済みます。

データセットを読み込む

まず、データセットを Python に読み込みます。

def load_imdb_sentiment_analysis_dataset(data_path, seed=123):
    """Loads the IMDb movie reviews sentiment analysis dataset.

    # Arguments
        data_path: string, path to the data directory.
        seed: int, seed for randomizer.

    # Returns
        A tuple of training and validation data.
        Number of training samples: 25000
        Number of test samples: 25000
        Number of categories: 2 (0 - negative, 1 - positive)

    # References
        Mass et al., http://www.aclweb.org/anthology/P11-1015

        Download and uncompress archive from:
        http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    """
    imdb_data_path = os.path.join(data_path, 'aclImdb')

    # Load the training data
    train_texts = []
    train_labels = []
    for category in ['pos', 'neg']:
        train_path = os.path.join(imdb_data_path, 'train', category)
        for fname in sorted(os.listdir(train_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(train_path, fname)) as f:
                    train_texts.append(f.read())
                train_labels.append(0 if category == 'neg' else 1)

    # Load the validation data.
    test_texts = []
    test_labels = []
    for category in ['pos', 'neg']:
        test_path = os.path.join(imdb_data_path, 'test', category)
        for fname in sorted(os.listdir(test_path)):
            if fname.endswith('.txt'):
                with open(os.path.join(test_path, fname)) as f:
                    test_texts.append(f.read())
                test_labels.append(0 if category == 'neg' else 1)

    # Shuffle the training data and labels.
    random.seed(seed)
    random.shuffle(train_texts)
    random.seed(seed)
    random.shuffle(train_labels)

    return ((train_texts, np.array(train_labels)),
            (test_texts, np.array(test_labels)))

データを確認する

データを読み込んだ後は、そのデータをチェックすることをおすすめします。サンプル数を減らし、期待値と一致しているかどうかを手動でチェックします。たとえば、いくつかのサンプルをランダムに出力して、感情ラベルがレビューの感情に対応しますこちらは無作為に選んだレビューです 「10 分間のストーリーがほぼ 2 時間です。当時は特に何の問題も起こらなかったときに、 期待するセンチメント（ネガティブ）は、サンプリングします。

主な指標の収集

データを確認したら、次の重要な指標を収集します。テキスト分類問題の特徴付けに役立ちます。

サンプル数: データ内のサンプルの総数。
クラスの数: データ内のトピックまたはカテゴリの総数。
クラスごとのサンプル数: クラスごとのサンプル数（トピック/カテゴリ）。バランスの取れたデータセットでは、すべてのクラスの数が同程度になるサンプル。データセット内の各クラスのサンプル数が大きく異なります。
サンプルあたりの単語数: 1 つのサンプルの単語数の中央値。
単語の頻度分布: 単語の頻度を示す分布（出現回数）をカウントします。
サンプルの長さの分布: 単語の数を示す分布データセット内のサンプルごとに適用されます。

IMDb レビューデータセットにおけるこれらの指標の値を見てみましょう。（単語頻度とサンプル長のプロットについては、図 3 と 4 をご覧ください）。です。

指標名	指標値
サンプルの数	25000
クラス数	2
クラスごとのサンプル数	12,500 個
サンプルあたりの単語数	174

表 1: IMDb レビューのデータセット指標

explore_data.py 以下を実行する関数が含まれています。これらの指標を算出、分析できます。次に例を示します。

import numpy as np
import matplotlib.pyplot as plt

def get_num_words_per_sample(sample_texts):
    """Returns the median number of words per sample given corpus.

    # Arguments
        sample_texts: list, sample texts.

    # Returns
        int, median number of words per sample.
    """
    num_words = [len(s.split()) for s in sample_texts]
    return np.median(num_words)

def plot_sample_length_distribution(sample_texts):
    """Plots the sample length distribution.

    # Arguments
        samples_texts: list, sample texts.
    """
    plt.hist([len(s) for s in sample_texts], 50)
    plt.xlabel('Length of a sample')
    plt.ylabel('Number of samples')
    plt.title('Sample length distribution')
    plt.show()

IMDb に使用する単語の頻度分布

図 3: IMDb の単語頻度分布

IMDb のサンプル長の分布

図 4: IMDb のサンプル長の分布

ステップ 1: データを収集する

ステップ 2.5: モデルを選択する

ステップ 2: データを探索する コレクションでコンテンツを整理 必要に応じて、コンテンツの保存と分類を行います。

データセットを読み込む

データを確認する

主な指標の収集

ステップ 2: データを探索する