使用大型数据集训练卷积神经网络 (CNN),以免过拟合

使用大型数据集训练卷积神经网络 (CNN),以免过拟合

מידע על Codelab זה

account_circleנכתב על ידי Laurence Moroney

1.‏ 准备工作

在此 Codelab 中,您将学习使用大型数据集训练 CNN,这有助于避免过拟合问题。

前提条件

如果您之前未使用过 TensorFlow 构建卷积,可能需要完成构建卷积并执行池化 Codelab(我们在其中介绍了卷积和池化),以及构建卷积神经网络 (CNN) 以增强计算机视觉 Codelab(我们在其中探讨了如何使计算机更加高效地识别图像)。

学习内容

  • 如何避免过拟合

您将构建的内容

  • CNN,训练后用于识别猫或狗的图像(基于经典的 Kaggle 挑战)。

所需条件

您可以找到在 Colab 中运行其余 Codelab 的代码。

您还需要安装 TensorFlow 以及您在上一个 Codelab 中安装的库。

2.‏ 使用猫和狗的大型数据集进行训练

在此 Codelab 中,您将使用一个真实且非常庞大的数据集,并了解它对避免过拟合的影响。

首先,使用必要的库设置您所需的开发环境。

import os
import zipfile
import random
import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from shutil import copyfile

3.‏ 获取数据

Kaggle 挑战的完整数据集由 Microsoft 提供。您可以在此处找到该数据。如果以下代码块中的网址不起作用,请参阅注释中的说明。

# If the URL doesn't work, visit https://www.microsoft.com/en-us/download/confirmation.aspx?id=54765
# And right click on the 'Download Manually' link to get a new URL to the dataset
# Note: This is a very large dataset and will take time to download
!wget --no-check-certificate "https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip" -O "/tmp/cats-and-dogs.zip"
local_zip
= '/tmp/cats-and-dogs.zip'
zip_ref  
= zipfile.ZipFile(local_zip, 'r')
zip_ref
.extractall('/tmp')
zip_ref
.close()
print(len(os.listdir('/tmp/PetImages/Cat/')))
print(len(os.listdir('/tmp/PetImages/Dog/')))
# Expected Output:
# 12501
# 12501

4.‏ 准备数据

下载数据后,将其解压缩到训练和测试目录中。以下代码可以实现此目的:

try:
    os
.mkdir('/tmp/cats-v-dogs')
    os
.mkdir('/tmp/cats-v-dogs/training')
    os
.mkdir('/tmp/cats-v-dogs/testing')
    os
.mkdir('/tmp/cats-v-dogs/training/cats')
    os
.mkdir('/tmp/cats-v-dogs/training/dogs')
    os
.mkdir('/tmp/cats-v-dogs/testing/cats')
    os
.mkdir('/tmp/cats-v-dogs/testing/dogs')
except OSError:
   
pass

def split_data(SOURCE, TRAINING, TESTING, SPLIT_SIZE):
    files
= []
   
for filename in os.listdir(SOURCE):
        file
= SOURCE + filename
       
if os.path.getsize(file) > 0:
            files
.append(filename)
       
else:
           
print(filename + " is zero length, so ignoring.")

    training_length
= int(len(files) * SPLIT_SIZE)
    testing_length
= int(len(files) - training_length)
    shuffled_set
= random.sample(files, len(files))
    training_set
= shuffled_set[0:training_length]
    testing_set
= shuffled_set[:testing_length]

   
for filename in training_set:
        this_file
= SOURCE + filename
        destination
= TRAINING + filename
        copyfile
(this_file, destination)

   
for filename in testing_set:
        this_file
= SOURCE + filename
        destination
= TESTING + filename
        copyfile
(this_file, destination)

CAT_SOURCE_DIR
= "/tmp/PetImages/Cat/"
TRAINING_CATS_DIR
= "/tmp/cats-v-dogs/training/cats/"
TESTING_CATS_DIR
= "/tmp/cats-v-dogs/testing/cats/"
DOG_SOURCE_DIR
= "/tmp/PetImages/Dog/"
TRAINING_DOGS_DIR
= "/tmp/cats-v-dogs/training/dogs/"
TESTING_DOGS_DIR
= "/tmp/cats-v-dogs/testing/dogs/"

split_size
= .9
split_data
(CAT_SOURCE_DIR, TRAINING_CATS_DIR, TESTING_CATS_DIR, split_size)
split_data
(DOG_SOURCE_DIR, TRAINING_DOGS_DIR, TESTING_DOGS_DIR, split_size)
# Expected output
# 666.jpg is zero length, so ignoring
# 11702.jpg is zero length, so ignoring

您可以使用以下代码检查您的数据是否已正确解压:

print(len(os.listdir('/tmp/cats-v-dogs/training/cats/')))
print(len(os.listdir('/tmp/cats-v-dogs/training/dogs/')))
print(len(os.listdir('/tmp/cats-v-dogs/testing/cats/')))
print(len(os.listdir('/tmp/cats-v-dogs/testing/dogs/')))
# Expected output:
# 11250
# 11250
# 1250
# 1250

5.‏ 定义模型

接下来,将模型定义为一系列具有最大池化的卷积层。

model = tf.keras.models.Sequential([
    tf
.keras.layers.Conv2D(16, (3, 3), activation='relu', input_shape=(150, 150, 3)),
    tf
.keras.layers.MaxPooling2D(2, 2),
    tf
.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf
.keras.layers.MaxPooling2D(2, 2),
    tf
.keras.layers.Conv2D(64, (3, 3), activation='relu'),
    tf
.keras.layers.MaxPooling2D(2, 2),
    tf
.keras.layers.Flatten(),
    tf
.keras.layers.Dense(512, activation='relu'),
    tf
.keras.layers.Dense(1, activation='sigmoid')
])
model
.compile(optimizer=RMSprop(lr=0.001), loss='binary_crossentropy', metrics=['accuracy'])

6.‏ 训练模型

现在模型已经定义,您可以使用 ImageDataGenerator 训练模型。

TRAINING_DIR = "/tmp/cats-v-dogs/training/"
train_datagen
= ImageDataGenerator(rescale=1.0/255.)
train_generator
= train_datagen.flow_from_directory(TRAINING_DIR,
                                                    batch_size
=100,
                                                    class_mode
='binary',
                                                    target_size
=(150, 150))

VALIDATION_DIR
= "/tmp/cats-v-dogs/testing/"
validation_datagen
= ImageDataGenerator(rescale=1.0/255.)
validation_generator
= validation_datagen.flow_from_directory(VALIDATION_DIR,
                                                              batch_size
=100,
                                                              class_mode
='binary',
                                                              target_size
=(150, 150))

# Expected Output:
# Found 22498 images belonging to 2 classes.
# Found 2500 images belonging to 2 classes.

为了训练模型,您现在可以调用 model.fit_generator,并将其传递给您创建的生成器。

# Note that this may take some time.
history
= model.fit_generator(train_generator,
                              epochs
=15,
                              verbose
=1,
                              validation_data
=validation_generator)

7.‏ 探索结果

您可以使用以下代码探索训练和验证的准确率并绘制图表。请使用此代码了解何时达到训练效率上限,并查看是否过拟合。

%matplotlib inline
import matplotlib.image  as mpimg
import matplotlib.pyplot as plt
#-----------------------------------------------------------
# Retrieve a list of list results on training and test data
# sets for each training epoch
#-----------------------------------------------------------
acc
=history.history['accuracy']
val_acc
=history.history['val_accuracy']
loss
=history.history['loss']
val_loss
=history.history['val_loss']

epochs
=range(len(acc)) # Get number of epochs

#------------------------------------------------
# Plot training and validation accuracy per epoch
#------------------------------------------------
plt
.plot(epochs, acc, 'r', "Training Accuracy")
plt
.plot(epochs, val_acc, 'b', "Validation Accuracy")
plt
.title('Training and validation accuracy')
plt
.figure()

#------------------------------------------------
# Plot training and validation loss per epoch
#------------------------------------------------
plt
.plot(epochs, loss, 'r', "Training Loss")
plt
.plot(epochs, val_loss, 'b', "Validation Loss")
plt
.figure()

8.‏ 测试模型

如果要使用模型旋转图像,可以使用以下代码。请上传图像,了解模型如何对图像进行分类!

# Here's a codeblock just for fun. You should be able to upload an image here
# and have it classified without crashing
import numpy as np
from google.colab import files
from keras.preprocessing import image

uploaded
= files.upload()

for fn in uploaded.keys():

 
# predicting images
  path
= '/content/' + fn
  img
= image.load_img(path, target_size=(150, 150))
  x
= image.img_to_array(img)
  x
= np.expand_dims(x, axis=0)

  images
= np.vstack([x])
  classes
= model.predict(images, batch_size=10)
 
print(classes[0])
 
if classes[0]>0.5:
   
print(fn + " is a dog")
 
else:
   
print(fn + " is a cat")

9.‏ 恭喜

现在,您已了解了机器学习的基础知识,包括从基本原理到创建卷积神经网络!

了解详情

如需了解机器学习和 TensorFlow 可以如何帮助您创建计算机视觉模型,请访问 TensorFlow.org