第13课:Tensorflow深度学习实战之模型训练

有了训练数据,我们就要开始建立自己的深度学习模型并训练模型,使之可以用来识别并分类图片
模型其实就是对应的算法,目前有非常多种选择,比如CNN,RNN, DQN,Lstm,DenseNet等等,他们在不同的领域有这各自的优势
深度学习的优化算法,说白了就是梯度下降。每次的参数更新有两种方式。
第一种,遍历全部数据集算一次损失函数,然后算函数对各个参数的梯度,更新梯度。这种方法每更新一次参数都要把数据集里的所有样本都看一遍,计算量开销大,计算速度慢,不支持在线学习,这称为Batch gradient descent,批梯度下降。
另一种,每看一个数据就算一下损失函数,然后求梯度更新参数,这个称为随机梯度下降,stochastic gradient descent。这个方法速度比较快,但是收敛性能不太好,可能在最优点附近晃来晃去,hit不到最优点。两次参数的更新也有可能互相抵消掉,造成目标函数震荡的比较剧烈。
为了克服两种方法的缺点,现在一般采用的是一种折中手段,mini-batch gradient decent,小批的梯度下降,这种方法把数据分为若干个批,按批来更新参数,这样,一个批中的一组数据共同决定了本次梯度的方向,下降起来就不容易跑偏,减少了随机性。另一方面因为批的样本数与整个数据集相比小了很多,计算量也不是很大。
基本上现在的梯度下降都是基于mini-batch的

这里我们选择用CNN,一种常见,容易理解的算法。后面我们再具体分析各种算法的优劣与适用范围

CNN也有各种的变种,我的模型是仿照TensorFlow的官方例程cifar-10的网络结构来写的。就是两个卷积层(每个卷积层后加一个池化层),两个全连接层,最后一个softmax输出分类结果

具体实现代码如下(这里是类中的一个方法,完整代码移步:https://github.com/zhimengzhe/iBrain/blob/master/neuron/cnn.py):

def inference(self, images, batch_size, n_classes):#入参images是训练数据  batch_size是一次输入数据量,n_classes表示把数据非常几类,这是个相对比较通用的方法。猫狗大战这个例子,只需要分成两类
    '''Build the model
    Args:
        images: image batch, 4D tensor, tf.float32, [batch_size, width, height, channels]
    Returns:
        output tensor with the computed logits, float, [batch_size, n_classes]
    '''
    # conv1, shape = [kernel size, kernel size, channels, kernel numbers]

    #程序里面有很多with tf.variable_scope("name")的语句,这其实是TensorFlow中的变量作用域机制,目的是有效便捷地管理需要的变量。 
  变量作用域机制在TensorFlow中主要由两部分组成:
    tf.get_variable(<name>, <shape>, <initializer>): 创建一个变量
    tf.variable_scope(<scope_name>): 指定命名空间

    with tf.variable_scope('conv1') as scope:
        weights = tf.get_variable('weights',
                                  shape=[3, 3, 3, 16],
                                  dtype=tf.float32,
                                  initializer=tf.truncated_normal_initializer(stddev=0.1, dtype=tf.float32))
        biases = tf.get_variable('biases',
                                 shape=[16],
                                 dtype=tf.float32,
                                 initializer=tf.constant_initializer(0.1))
        conv = tf.nn.conv2d(images, weights, strides=[1, 1, 1, 1], padding='SAME')
    #tf.nn.conv2d是TensorFlow里面实现卷积的函数,是搭建卷积神经网络比较核心的一个方法,非常重要
    #第一个参数是训练数据集,第二个参数是卷积核,都要求是一个Tensor。第三个参数strides:卷积时在图像每一维的步长,这是一个一维的向量,长度4,第四个参数padding:string类型的量,只能是"SAME","VALID"其中之一,这个值决定了不同的卷积方式,第五个参数:use_cudnn_on_gpu:bool类型,是否使用cudnn加速,默认为true,结果返回一个Tensor,这个输出,就是我们常说的feature map,shape仍然是[batch, height, width, channels]这种形式。
        pre_activation = tf.nn.bias_add(conv, biases)#这个函数的作用是将偏差项biases增加到conv上,返回值是Tensor.可以看作是加法。biases必须是一维。如果没有这一项操作,很可能导致不收敛啊,道理很简单,保证曲线距离原点有一定距离
        conv1 = tf.nn.relu(pre_activation, name=scope.name)#这个函数的作用是计算激活函数relu,即max(features, 0)。即将矩阵中每行的非最大值置0  这样是为了突出数据集经过本次操作得到的特征,减少干扰

    # pool1 and norm1  池化 和卷积函数tf.nn.conv2d有一定的类似,一般跟在卷积函数后面
    with tf.variable_scope('pooling1_lrn') as scope:
        pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1],
                               padding='SAME', name='pooling1')
        norm1 = tf.nn.lrn(pool1, depth_radius=4, bias=1.0, alpha=0.001 / 9.0,
                          beta=0.75, name='norm1')

    # conv2  第二个卷积层
    with tf.variable_scope('conv2') as scope:
        weights = tf.get_variable('weights',
                                  shape=[3, 3, 16, 16],
                                  dtype=tf.float32,
                                  initializer=tf.truncated_normal_initializer(stddev=0.1, dtype=tf.float32))
        biases = tf.get_variable('biases',
                                 shape=[16],
                                 dtype=tf.float32,
                                 initializer=tf.constant_initializer(0.1))
        conv = tf.nn.conv2d(norm1, weights, strides=[1, 1, 1, 1], padding='SAME')
        pre_activation = tf.nn.bias_add(conv, biases)
        conv2 = tf.nn.relu(pre_activation, name='conv2')

    # pool2 and norm2  第二个池化层
    with tf.variable_scope('pooling2_lrn') as scope:
        norm2 = tf.nn.lrn(conv2, depth_radius=4, bias=1.0, alpha=0.001 / 9.0,
                          beta=0.75, name='norm2')
        pool2 = tf.nn.max_pool(norm2, ksize=[1, 3, 3, 1], strides=[1, 1, 1, 1],
                               padding='SAME', name='pooling2')

    # local3 第一个全连接层
    with tf.variable_scope('local3') as scope:
        reshape = tf.reshape(pool2, shape=[batch_size, -1])
        dim = reshape.get_shape()[1].value
        weights = tf.get_variable('weights',
                                  shape=[dim, 128],
                                  dtype=tf.float32,
                                  initializer=tf.truncated_normal_initializer(stddev=0.005, dtype=tf.float32)) #从截断的正态分布中输出随机值。
生成的值服从具有指定平均值和标准偏差的正态分布,如果生成的值大于平均值2个标准偏差的值则丢弃重新选择
        biases = tf.get_variable('biases',
                                 shape=[128],
                                 dtype=tf.float32,
                                 initializer=tf.constant_initializer(0.1))
        local3 = tf.nn.relu(tf.matmul(reshape, weights) + biases, name=scope.name)#激活函数

        # local4 第二个全连接层
    with tf.variable_scope('local4') as scope:
        weights = tf.get_variable('weights',
                                  shape=[128, 128],
                                  dtype=tf.float32,
                                  initializer=tf.truncated_normal_initializer(stddev=0.005, dtype=tf.float32))
        biases = tf.get_variable('biases',
                                 shape=[128],
                                 dtype=tf.float32,
                                 initializer=tf.constant_initializer(0.1))
        local4 = tf.nn.relu(tf.matmul(local3, weights) + biases, name='local4')

    # softmax  输出 0 1分类函数,非此即彼
    with tf.variable_scope('softmax_linear') as scope:
        weights = tf.get_variable('softmax_linear',
                                  shape=[128, n_classes],
                                  dtype=tf.float32,
                                  initializer=tf.truncated_normal_initializer(stddev=0.005, dtype=tf.float32))
        biases = tf.get_variable('biases',
                                 shape=[n_classes],
                                 dtype=tf.float32,
                                 initializer=tf.constant_initializer(0.1))
        softmax_linear = tf.add(tf.matmul(local4, weights), biases, name='softmax_linear')

    return softmax_linear

def losses(self, logits, labels): #损失函数  loss是代价值,也就是我们要最小化的值  根据这个方法返回的loss来修正训练,寻找最小值
    '''Compute loss from logits and labels
    Args:
        logits: logits tensor, float, [batch_size, n_classes]
        labels: label tensor, tf.int32, [batch_size]

    Returns:
        loss tensor of float type
    '''
    with tf.variable_scope('loss') as scope:
    #交叉熵用于描述模型预测值与真实值的差距大小。这个交叉熵函数适用于互斥分类的模型,比如一张图,不允许有猫也有狗,只能允许有一个。这个函数的返回值并不是一个数,而是一个向量。TensorFlow针对分类问题,实现了四个交叉熵函数,分别是
tf.nn.sigmoid_cross_entropy_with_logits、tf.nn.softmax_cross_entropy_with_logits、tf.nn.sparse_softmax_cross_entropy_with_logits和tf.nn.weighted_cross_entropy_with_logits
        cross_entropy = tf.nn.sparse_softmax_cross_entropy_with_logits \
            (logits=logits, labels=labels, name='xentropy_per_example')
        loss = tf.reduce_mean(cross_entropy, name='loss’)#要求loss ,就要对向量求均值,使用到tf.reduce_mean
        tf.summary.scalar(scope.name + '/loss', loss)#绘制图形用
    return loss

def trainning(self, loss, learning_rate): #训练模型的方法(只是提供方法,并未开始训练)
    '''Training ops, the Op returned by this function is what must be passed to
        'sess.run()' call to cause the model to train.

    Args:
        loss: loss tensor, from losses()

    Returns:
        train_op: The op for trainning
    '''
    with tf.name_scope('optimizer'):
        optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
#优化器  优化器最后其实就是各种对于梯度下降算法的优化。优化器种类非常多,
#常用的有:Optimizer GradientDescentOptimizer AdagradOptimizer AdagradDAOptimizer MomentumOptimizer AdamOptimizer FtrlOptimizer RMSPropOptimizer适用于各种优化场景,后续再详细解说

        global_step = tf.Variable(0, name='global_step', trainable=False)
        train_op = optimizer.minimize(loss, global_step=global_step)#让损失函数最小
    return train_op

def evaluation(self, logits, labels):#评估
    """Evaluate the quality of the logits at predicting the label.
    Args:
      logits: Logits tensor, float - [batch_size, NUM_CLASSES].
      labels: Labels tensor, int32 - [batch_size], with values in the
        range [0, NUM_CLASSES).
    Returns:
      A scalar int32 tensor with the number of examples (out of batch_size)
      that were predicted correctly.
    """
    with tf.variable_scope('accuracy') as scope:
        correct = tf.nn.in_top_k(logits, labels, 1)
        correct = tf.cast(correct, tf.float16)
        accuracy = tf.reduce_mean(correct)
        tf.summary.scalar(scope.name + '/accuracy', accuracy)
    return accuracy

def trainCnnNetwork(self, train_batch, train_label_batch, n_class = 2, batch_size = 16):#训练模型,运行这个方法就开始训练模型了,这个是主要的一个方法
    train_logits = self.inference(train_batch, batch_size, n_class)
    train_loss = self.losses(train_logits, train_label_batch)
    train_op = self.trainning(train_loss, LEARN_RATE)
    train_acc = self.evaluation(train_logits, train_label_batch)#评估效果
    summary_op = tf.summary.merge_all()
    sess = tf.Session()
    train_writer = tf.summary.FileWriter(self.logs_train_dir, sess.graph)
    saver = tf.train.Saver()

    sess.run(tf.global_variables_initializer())#执行
    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(sess=sess, coord=coord)
    try:
        for step in np.arange(MAX_STEP):
            if coord.should_stop():
                break
            _, tra_loss, tra_acc = sess.run([train_op, train_loss, train_acc])

            if step % 50 == 0:
                print('Step %d, train loss = %.2f, train accuracy = %.2f%%' % (step, tra_loss, tra_acc * 100.0))
                summary_str = sess.run(summary_op)
                train_writer.add_summary(summary_str, step)

            if step % 2000 == 0 or (step + 1) == MAX_STEP:
                checkpoint_path = os.path.join(self.logs_train_dir, 'model.ckpt')
                saver.save(sess, checkpoint_path, global_step=step)

    except tf.errors.OutOfRangeError:
        print('Done training -- epoch limit reached')
    finally:
        coord.request_stop()

    coord.join(threads)
    sess.close()