目标检测 | 东方少

one stage、two stage、anchor-based、anchor-free

这四者是从两个并行的维度划分；单、双阶段划分依据是是否存在显示的ROI特征提取过程，如Faster RCNN中RPN模块负责提取ROI，再输入到RCNN中进行识别和定位，而单阶段没有显式ROI提取过程，如YOLO。

anchor-based与anchor-free划分依据是是否需要显式定义先验anchor，如Faster RCNN、YOLOv2等主流算法都需要定义先验框的尺寸，YOLOv1、DenseNet是anchor-free的。

one stage方法就是在特征图的每个位置上，使用不同尺度、长宽比密集采样生成anchor，直接进行分类和回归。主要优点是计算效率高，但检测精度稍差（主要是没有删除负样本的anchor，导致类别不均衡，RetinaNet采用focal loss），对小物体检测效果差（two stage中的roi polling对目标做resize，小目标的特征被放大，其轮廓也更为清晰，解决方法最简单的是增大输入尺寸，但会丧失速度快的优势，还可以采用FPN融合底层特征、空洞卷积增加感受野）

NMS

非极大值抑制：用于去除冗余框

选取置信度最大的检测框，再移除剩余检测框中与置信度最大的检测框IoU大于阈值的框，重复这一过程。

def py_nms(dets, thresh):

    x1 = dets[:,0]
    y1 = dets[:,1]
    x2 = dets[:,2]
    y2 = dets[:,3]
    scores = dets[:,4]

    # 每个框的面积
    areas = (x2 - x1 + 1) * (y2 - y1 + 1)

    # 按每个框的置信度降序排序
    order = scores.argsort()[::-1]

    keey = []
    while(order.size > 0):
        # 加入置信度最高的
        i = order[0]
        keep.append(i)

        # 计算IoU
        xx1 = np.maximum(x1[i],x1[order[1:]])
        yy1 = np.maximum(y1[i],y1[order[1:]])
        xx2 = np.minimum(x2[i],x2[order[1:]])
        yy2 = np.minimum(y2[i],y2[order[1:]])

        # 计算相交面积
        w = np.maximum(0.0, xx2 - xx1 + 1)
        h = np.maximum(0.0, yy2 - yy1 + 1)
        inter = w * h

        over = inter / (areas[i] + areas[order[1:]] - inter)
        inds = np.where(over <= thresh)[0]

        order = order[inds + 1]
    
    return keep

NMS直接根据IoU与阈值过滤框，而Soft-NMS根据IoU重新计算置信度，根据置信度与阈值过滤框。（衰减与置信度最高的框有重叠的相邻检测框的置信度，重叠面积越大，衰减越严重）。

原来的NMS可以表示为

$s_i=\left\{ \begin{aligned} &s_i, &iou(M,b_i)<N_t\\ &0, &iou(M,b_i) \geq N_t\\ \end{aligned} \right.$

线性加权的Soft-NMS

$s_i=\left\{ \begin{aligned} &s_i, &iou(M,b_i)<N_t\\ &s_i(1-iou(M, b_i)), &iou(M,b_i) \geq N_t\\ \end{aligned} \right.$

指标

准确率（Precision）：预测正确的比例
召回率（Recall）：所有正确样本中成功预测的比例
F1度量：准确率与召回率的加权调和平均，F1越高，说明试验方法越有效

$F1 = \frac{Precision*Recall*2}{Precision+Recall}$

AP表示不同召回率的点上的准确率的平均(P-R图中的面积,不同的置信度)；MAP是指对每一个类别样本计算AP，然后计算所有类别样本AP的均值。

MAP

# 按照置信度降序排序
sorted_ind = np.argsort(-confidence)
BB = BB[sorted_ind, :]   # 预测框坐标
image_ids = [image_ids[x] for x in sorted_ind] # 各个预测框的对应图片id

# 便利预测框，并统计TPs和FPs
nd = len(image_ids)
tp = np.zeros(nd)
fp = np.zeros(nd)
for d in range(nd):
    R = class_recs[image_ids[d]]
    bb = BB[d, :].astype(float)
    ovmax = -np.inf
    BBGT = R['bbox'].astype(float)  # ground truth

    if BBGT.size > 0:
        # 计算IoU
        # intersection
        ixmin = np.maximum(BBGT[:, 0], bb[0])
        iymin = np.maximum(BBGT[:, 1], bb[1])
        ixmax = np.minimum(BBGT[:, 2], bb[2])
        iymax = np.minimum(BBGT[:, 3], bb[3])
        iw = np.maximum(ixmax - ixmin + 1., 0.)
        ih = np.maximum(iymax - iymin + 1., 0.)
        inters = iw * ih

        # union
        uni = ((bb[2] - bb[0] + 1.) * (bb[3] - bb[1] + 1.) +
               (BBGT[:, 2] - BBGT[:, 0] + 1.) *
               (BBGT[:, 3] - BBGT[:, 1] + 1.) - inters)

        overlaps = inters / uni
        ovmax = np.max(overlaps)
        jmax = np.argmax(overlaps)
    # 取最大的IoU
    if ovmax > ovthresh:  # 是否大于阈值
        if not R['difficult'][jmax]:  # 非difficult物体
            if not R['det'][jmax]:    # 未被检测
                tp[d] = 1.
                R['det'][jmax] = 1    # 标记已被检测
            else:
                fp[d] = 1.
    else:
        fp[d] = 1.

# 计算precision recall
fp = np.cumsum(fp)
tp = np.cumsum(tp)
rec = tp / float(npos)
# avoid divide by zero in case the first detection matches a difficult
# ground truth
prec = tp / np.maximum(tp + fp, np.finfo(np.float64).eps)


def voc_ap(rec, prec, use_07_metric=False):
    """Compute VOC AP given precision and recall. If use_07_metric is true, uses
    the VOC 07 11-point method (default:False).
    """
    if use_07_metric:  # 使用07年方法
        # 11 个点
        ap = 0.
        for t in np.arange(0., 1.1, 0.1):
            if np.sum(rec >= t) == 0:
                p = 0
            else:
                p = np.max(prec[rec >= t])  # 插值
            ap = ap + p / 11.
    else:  # 新方式，计算所有点
        # correct AP calculation
        # first append sentinel values at the end
        mrec = np.concatenate(([0.], rec, [1.]))
        mpre = np.concatenate(([0.], prec, [0.]))

        # compute the precision 曲线值（也用了插值）
        for i in range(mpre.size - 1, 0, -1):
            mpre[i - 1] = np.maximum(mpre[i - 1], mpre[i])

        # to calculate area under PR curve, look for points
        # where X axis (recall) changes value
        i = np.where(mrec[1:] != mrec[:-1])[0]

        # and sum (\Delta recall) * prec
        ap = np.sum((mrec[i + 1] - mrec[i]) * mpre[i + 1])
    return ap

传统检测方法

传统方法通常包含三个部分：区域选择、特征提取和分类回归。区域选择采用滑动窗口与选择性搜索。滑动窗口存在区域选择策略没有针对性，存在大量的冗余和干扰信息，时间复杂度高的问题。

YOLOv1

流程

将图片放缩至一定大小（448x448），划分为固定的7x7网格，每个网格负责特定区域的目标检测；
每个网格内包含了一个30维的向量，每个网格预测多个检测框。30维的向量中10维负责预测两个Bounding Box的坐标和物体置信度，剩下的20维负责预测该目标的类别概率；
NMS过滤得到最终的预测框。

损失函数

L = 位置误差 + 置信度误差 + 分类误差

位置误差：由IoU最大的B-box计算，对于x与y直接计算欧氏距离，而对于w与h计算开方后的（用于平衡大小物体）
置信度误差：分含不含有物体，设置不同的权重，用于样本均衡
分类误差：当物体的中心落在网格中时计算

YOLOv2

加入BN，舍弃dropout（加入归一化，收敛相对更容易）
更大的分辨率（在训练阶段使用448x448进行了10次微调）
使用DarkNet，实际输入为416x416，没有FC层，5次降采样（13x13）
基于聚类提取先验框（聚类中的距离采用1-IOU，k=5）
引入anchor boxes，
相对网格的偏移量，以格子为锚点，加入sigmoid，计算中心点偏移量；长跟宽计算指数偏差
特征融合的改进：感受野太大导致小目标丢失，需要融合之前的特征
多尺度

YOLOv3

多scale：三个scale，九个框
残差的思想
先验框：九个框，
logistic