Abracadabra

Do it yourself


  • Home

  • Categories

  • About

  • Archives

  • Tags

  • Sitemap

  • 公益404

  • Search
close
Abracadabra

Paper Note: Selective Search for Object Recognition

Posted on 2018-05-03 | | Visitors

与 Selective Search 初次见面是在著名的物体检测论文 Rich feature hierarchies for accurate object detection and semantic segmentation ,因此,这篇论文算是阅读 R-CNN 的准备。

这篇论文的标题虽然也提到了 Object Recognition ,但就创新点而言,其实在 Selective Search 。所以,这里只简单介绍 Selective Search 的思想和算法过程,对于 Object Recognition 则不再赘述。

什么是 Selective Search

Selective Search,说的简单点,就是从图片中找出物体可能存在的区域。

resultresult

上面这幅宇航员的图片中,那些红色的框就是 Selective Search 找出来的可能存在物体的区域。

在进一步探讨它的原理之前,我们分析一下,如何判别哪些 region 属于一个物体?

image segimage seg

作者在论文中用以上四幅图,分别描述了四种可能的情况:

  1. 图 a ,物体之间可能存在层级关系,比如:碗里有个勺;
  2. 图 b,我们可以用颜色来分开两只猫,却没法用纹理来区分;
  3. 图 c,我们可以用纹理来区分变色龙,却没法用颜色来区分;
  4. 图 d,轮胎是车的一部分,不是因为它们颜色相近、纹理相近,而是因为轮胎包含在车上。

所以,我们没法用单一的特征来定位物体,需要综合考虑多种策略,这一点是 Selective Search 精要所在。

需要考虑的问题

在学习 Selective Search 算法之前,我曾在计算机视觉课上学到过关于物体(主要是人脸)检测的方法。通常来说,最常规也是最简单粗暴的方法,就是用不同尺寸的矩形框,一行一行地扫描整张图像,通过提取矩形框内的特征判断是否是待检测物体。这种方法的复杂度极高,所以又被称为 exhaustive search。在人脸识别中,由于使用了 Haar 特征,因此可以借助 Paul Viola 和 Michael Jones 两位大牛提出的积分图,使检测在常规时间内完成。但并不是每种特征都适用于积分图,尤其在神经网络中,积分图这种动态规划的思路就没什么作用了。

针对传统方法的不足,Selective Search 从三个角度提出了改进:

  1. 我们没法事先得知物体的大小,在传统方法中需要用不同尺寸的矩形框检测物体,防止遗漏。而 Selective Search 采用了一种具备层次结构的算法来解决这个问题;
  2. 检测的时间复杂度可能会很高。Selective Search 遵循简单即是美的原则,只负责快速地生成可能是物体的区域,而不做具体的检测;
  3. 另外,结合上一节提出的,采用多种先验知识来对各个区域进行简单的判别,避免一些无用的搜索,提高速度和精度。

算法框架

algorithmalgorithm

论文中给出的这个算法框架还是很详细的,这里再简单翻译一下。

  • 输入:彩色图片。
  • 输出:物体可能的位置,实际上是很多的矩形坐标。
  • 首先,我们使用这篇论文的方法将图片初始化为很多小区域 $R=r_i, \cdots, r_n$。由于我们的重点是 Selective Search,因此我直接将该论文的算法当成一个黑盒子。
  • 初始化一个相似集合为空集: $S=∅$。
  • 计算所有相邻区域之间的相似度(相似度函数之后会重点分析),放入集合 S 中,集合 S 保存的其实是一个区域对以及它们之间的相似度。
  • 找出 S 中相似度最高的区域对,将它们合并,并从 S 中删除与它们相关的所有相似度和区域对。重新计算这个新区域与周围区域的相似度,放入集合 S 中,并将这个新合并的区域放入集合 R 中。重复这个步骤直到 S 为空。
  • 从 R 中找出所有区域的 bounding box(即包围该区域的最小矩形框),这些 box 就是物体可能的区域。

另外,为了提高速度,新合并区域的 feature 可以通过之前的两个区域获得,而不必重新遍历新区域的像素点进行计算。这个 feature 会被用于计算相似度。

相似度计算方法

相似度计算方法将直接影响合并区域的顺序,进而影响到检测结果的好坏。

论文中比较了八种颜色空间的特点,在实际操作中,只选择一个颜色空间(比如:RGB 空间)进行计算。

正如一开始提出的那样,我们需要综合多种信息来判断。作者将相似度度量公式分为四个子公式,称为互补相似度测量(Complementary Similarity Measures) 。这四个子公式的值都被归一化到区间 [0, 1] 内。

1. 颜色相似度scolor (ri,rj)scolor (ri,rj)

正如本文一开始提到的,颜色是一个很重要的区分物体的因素。论文中将每个 region 的像素按不同颜色通道统计成直方图,其中,每个颜色通道的直方图为 25 bins (比如,对于 0 ~ 255 的颜色通道来说,就每隔 9(255/25=9) 个数值统计像素数量)。这样,三个通道可以得到一个 75 维的直方图向量 $C_i={c_{i}^{1}, …, c_{i}^{n}}$,其中 n = 75。之后,我们用 L1 范数(绝对值之和)对直方图进行归一化。由直方图我们就可以计算两个区域的颜色相似度:
$$
s_{color}(r_i, r_j) =\sum_{k=1}^{n}{min(c_{i}^{k}, c_{j}^{k})}
$$
这个颜色直方图可以在合并区域的时候,很方便地传递给下一级区域。即它们合并后的区域的直方图向量为:
$$
C_t=\frac{size(r_i)C_i+size(r_j)C_j}{size(r_i)+size(r_j)}
$$
,其中$size(r_i)$ 表示区域 $r_i$ 的面积,合并后的区域为 $size(r_t)=size(r_i)+size(r_j)$。

2. 纹理相似度$s_{texture}(r_i,r_j)$

另一个需要考虑的因素是纹理,即图像的梯度信息。

论文中对纹理的计算采用了 SIFT-like 特征,该特征借鉴了 SIFT 的计算思路,对每个颜色通道的像素点,沿周围 8 个方向计算高斯一阶导数(σ=1σ=1),每个方向统计一个直方图(bin = 10),这样,一个颜色通道统计得到的直方图向量为 80 维,三个通道就是 240 维:$T_i={t_i^{(1)}, …, t_i^{(n)}}$,其中 n = 240。注意这个直方图要用 L1 范数归一化。然后,我们按照颜色相似度的计算思路计算两个区域的纹理相似度:
$$
s_{texture}(r_i, r_j) =\sum_{k=1}^{n}{min(t_{i}^{k}, t_{j}^{k})}
$$

3. 尺寸相似度$s_{size} (r_i,r_j)$

在合并区域的时候,论文优先考虑小区域的合并,这种做法可以在一定程度上保证每次合并的区域面积都比较相似,防止大区域对小区域的逐步蚕食。这么做的理由也很简单,我们要均匀地在图片的每个角落生成不同尺寸的区域,作用相当于 exhaustive search 中用不同尺寸的矩形扫描图片。具体的相似度计算公式为:
$$
s_{size}(r_i, r_j)=1-\frac{size(r_i) + size(r_j)}{size(im)}
$$
其中,$size(im)$ 表示原图片的像素数量。

4. 填充相似度$s_{fill}(r_i,r_j)$

填充相似度主要用来测量两个区域之间 fit 的程度,个人觉得这一点是要解决文章最开始提出的物体之间的包含关系(比如:轮胎包含在汽车上)。在给出填充相似度的公式前,我们需要定义一个矩形区域 $BB_{ij}$,它表示包含 $r_i$ 和 $r_j$ 的最小的 bounding box。基于此,我们给出相似度计算公式为:
$$
s_{fill}(r_i, r_j)=1-\frac{size(BB_{ij})-size(r_i)-size(r_j)}{size(im)}
$$
为了高效地计算 $BB_{ij}$,我们可以在计算每个 region 的时候,都保存它们的 bounding box 的位置,这样,$BB_{ij}$ 就可以很快地由两个区域的 bounding box 推出来。

5. 相似度计算公式

综合上面四个子公式,我们可以得到计算相似度的最终公式:
$$
s(r_i, r_j) = a_1 s_{color}(r_i, r_j) +a_2s_{texture}(r_i, r_j) \\\\ +a_3s_{size}(r_i, r_j)+a_4s_{fill}(r_i, r_j)
$$
其中,$a_i$的取值为 0 或 1,表示某个相似度是否被采纳。

Combining Locations

前面我们基本完成了 Selective Search 的流程,从图片中提取出了物体可能的位置。现在,我们想完善最后一个问题,那就是给这些位置排个序。因为提取出来的矩形框数量巨大,而用户可能只需要其中的几个,这个时候我们就很有必要对这些矩形框赋予优先级,按照优先级高低返回给用户。原文中作者称这一步为 Combining Locations,我找不出合适的翻译,就姑且保留英文原文。

这个排序的方法也很简单。作者先给各个 region 一个序号,前面说了,Selective Search 是一个逐步合并的层级结构,因此,我们将覆盖整个区域的 region 的序号标记为 1,合成这个区域的两个子区域的序号为 2,以此类推。但如果仅按序号排序,会存在一个漏洞,那就是区域面积大的会排在前面,为了避免这个漏洞,作者又在每个序号前乘上一个随机数 $RND∈[0,1]$,通过这个新计算出来的数值,按从小到大的顺序得出 region 最终的排序结果。

参考

  • Selective Search for Object Recognition(阅读)
  • Efficient Graph-Based Image Segmentation

本文作者: Jermmy

本文链接: https://jermmy.github.io/2017/05/04/2017-5-4-paper-notes-selective-search/

版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 3.0 许可协议。转载请注明出处!

Abracadabra

Micro- and Macro-average of Precision, Recall and F-Score

Posted on 2018-04-27 | | Visitors

Micro-average Method

In Micro-average method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them to get the statistics. For example, for a set of data, the system’s

True positive (TP1)= 12
False positive (FP1)=9
False negative (FN1)=3

Then precision (P1) and recall (R1) will be 57.14 and 80

and for a different set of data, the system’s

True positive (TP2)= 50
False positive (FP2)=23
False negative (FN2)=9

Then precision (P2) and recall (R2) will be 68.49 and 84.75

Now, the average precision and recall of the system using the Micro-average method is

Micro-average of precision = (TP1+TP2)/(TP1+TP2+FP1+FP2) = (12+50)/(12+50+9+23) = 65.96
Micro-average of recall = (TP1+TP2)/(TP1+TP2+FN1+FN2) = (12+50)/(12+50+3+9) = 83.78

The Micro-average F-Score will be simply the harmonic mean of these two figures.

Macro-average Method

The method is straight forward. Just take the average of the precision and recall of the system on different sets. For example, the macro-average precision and recall of the system for the given example is

Macro-average precision = (P1+P2)/2 = (57.14+68.49)/2 = 62.82
Macro-average recall = (R1+R2)/2 = (80+84.75)/2 = 82.25

The Macro-average F-Score will be simply the harmonic mean of these two figures.

Suitability
Macro-average method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average.

On the other hand, micro-average can be a useful measure when your dataset varies in size.

Abracadabra

Get financial data from Tushare

Posted on 2018-04-11 | | Visitors

Introduction

TuShare is a famous free, open source python financial data interface package. Its official home page is: TuShare - financial data interface package. The interface package now provides a large amount of financial data covering a wide range of data such as stocks, fundamentals, macros, news, etc. (Please check the official website for details) and keep updating. At present, the length of the stock’s data is three years. Although it is a bit short, it can basically meet the needs of quantitative beginners for testing.

Tutorial

Install and Import

You need to install first:

  • Pandas
  • lxml

Two way to install tushare:

  1. pip install tushare
  2. visit https://pypi.python.org/pypi/Tushare/, download and install

How to update:

pip install tushare --upgrade

Import package and view package version:

1
2
3
import tushare
print(tushare.__version__)

Use some simple function

Stock data

update:Many of the quotes returned by the get_hist_data function are wrong, but both get_h_data and get_k_data can be used

We should still master how to use tushare to obtain stock market data, using the ts.get_hist_data() function whose input parameters are:

  • code: Stock code, ie 6-digit code, or index code (sh = Shanghai index sz = Shenzhen index hs300 = CSI 300 index sz50 = SSE 50 zxb = small and medium board cyb = board)
  • start: Start date, format YYYY-MM-DD
  • end: End date, format YYYY-MM-DD
  • ktype: Data type, D = day k line W = week M = month 5 = 5 minutes 15 = 15 minutes 30 = 30 minutes 60 = 60 minutes, the default is D
  • retry_count: The number of retries after the network is abnormal. The default is 3
  • pause: Pause seconds when retrying, default is 0

Return values:

  • date:date
  • open:Opening price
  • high:Highest price
  • close:Closing price
  • low:Lowest price
  • volume:Volume
  • price_change:price fluncuation
  • p_change:Quote change
  • ma5:5-day average price
  • ma10:10-day average price
  • ma20: 20-day average price
  • v_ma5: 5-day average volume
  • v_ma10: 10-day average volume
  • v_ma20: 20-day average volume
  • turnover: Change in hand rate [Note: Index does not have this item]

Specific examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
ts.get_hist_data('600848')
date open high close low volume p_change ma5
2012-01-11 6.880 7.380 7.060 6.880 14129.96 2.62 7.060
2012-01-12 7.050 7.100 6.980 6.900 7895.19 -1.13 7.020
2012-01-13 6.950 7.000 6.700 6.690 6611.87 -4.01 6.913
2012-01-16 6.680 6.750 6.510 6.480 2941.63 -2.84 6.813
2012-01-17 6.660 6.880 6.860 6.460 8642.57 5.38 6.822
2012-01-18 7.000 7.300 6.890 6.880 13075.40 0.44 6.788
2012-01-19 6.690 6.950 6.890 6.680 6117.32 0.00 6.770
2012-01-20 6.870 7.080 7.010 6.870 6813.09 1.74 6.832
date ma10 ma20 v_ma5 v_ma10 v_ma20 turnover
2012-01-11 7.060 7.060 14129.96 14129.96 14129.96 0.48
2012-01-12 7.020 7.020 11012.58 11012.58 11012.58 0.27
2012-01-13 6.913 6.913 9545.67 9545.67 9545.67 0.23
2012-01-16 6.813 6.813 7894.66 7894.66 7894.66 0.10
2012-01-17 6.822 6.822 8044.24 8044.24 8044.24 0.30
2012-01-18 6.833 6.833 7833.33 8882.77 8882.77 0.45
2012-01-19 6.841 6.841 7477.76 8487.71 8487.71 0.21
2012-01-20 6.863 6.863 7518.00 8278.38 8278.38 0.23

You can also set the start time and end time of historical data:

1
2
3
4
5
6
7
8
9
10
11
12
13
ts.get_hist_data('600848',start='2015-01-05',end='2015-01-09')
date open high close low volume p_change ma5 ma10
2015-01-05 11.160 11.390 11.260 10.890 46383.57 1.26 11.156 11.212
2015-01-06 11.130 11.660 11.610 11.030 59199.93 3.11 11.182 11.155
2015-01-07 11.580 11.990 11.920 11.480 86681.38 2.67 11.366 11.251
2015-01-08 11.700 11.920 11.670 11.640 56845.71 -2.10 11.516 11.349
2015-01-09 11.680 11.710 11.230 11.190 44851.56 -3.77 11.538 11.363
date ma20 v_ma5 v_ma10 v_ma20 turnover
2015-01-05 11.198 58648.75 68429.87 97141.81 1.59
2015-01-06 11.382 54854.38 63401.05 98686.98 2.03
2015-01-07 11.543 55049.74 61628.07 103010.58 2.97
2015-01-08 11.647 57268.99 61376.00 105823.50 1.95

Others:

1
2
3
4
5
6
7
8
9
10
11
12
ts.get_hist_data('600848', ktype='W') # Get weekly k-line data
ts.get_hist_data('600848', ktype='M') # Get monthly k-line data
ts.get_hist_data('600848', ktype='5') # Get 5 minutes k-line data
ts.get_hist_data('600848', ktype='15') # Get 15 minutes k-line data
ts.get_hist_data('600848', ktype='30') # Get 30 minutes k-line data
ts.get_hist_data('600848', ktype='60') # Get 60 minutes k-line data
ts.get_hist_data('sh')# Get data on the Shanghai index k-line, other parameters consistent with the stocks, the same below
ts.get_hist_data('sz')# Get Shenzhen Chengzhi k line data
ts.get_hist_data('hs300')# Get the CSI 300 k line data
ts.get_hist_data('sz50')# Get SSE 50 Index k-line data
ts.get_hist_data('zxb')# Get the k-line data of small and medium board indices
ts.get_hist_data('cyb')# Get GEM Index k-line data

Fundamental data

With tushare we can also get fundamental data through ts.get_stock_basics() (shown in the results section):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
ts.get_stock_basics()
code name industry area pe outstanding totals totalAssets
300563 N神宇 通信设备 江苏 26.73 2000.00 8000.00 4.216000e+04
601882 海天精工 机床制造 浙江 26.83 5220.00 52200.00 1.877284e+05
601880 大连港 港口 辽宁 76.40 773582.00 1289453.63 3.263012e+06
300556 丝路视觉 软件服务 深圳 101.38 2780.00 11113.33 4.448248e+04
600528 中铁二局 建筑施工 四川 149.34 145920.00 145920.00 5.709568e+06
002495 佳隆股份 食品 广东 202.12 66611.13 93562.56 1.169174e+05
600917 重庆燃气 供气供热 重庆 76.87 15600.00 155600.00 8.444600e+05
002752 昇兴股份 广告包装 福建 75.14 12306.83 63000.00 2.387493e+05
002346 柘中股份 电气设备 上海 643.97 7980.00 44157.53 2.263010e+05
000680 山推股份 工程机械 山东 0.00 105694.97 124078.75 9.050701e+05
...

Macro data

We use the resident consumer index as an example, which can be obtained through the ts.get_cpi() function (it will get 322 items at a time, some of them will be displayed):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
print ts.get_cpi()
month cpi
0 2016.10 102.10
1 2016.9 101.90
2 2016.8 101.34
3 2016.7 101.77
4 2016.6 101.88
5 2016.5 102.04
6 2016.4 102.33
7 2016.3 102.30
8 2016.2 102.28
9 2016.1 101.75
10 2015.12 101.64
...

Recent news

The tushare package can use the ts.get_latest_news() function to view the latest news, and it will return 80. For reasons of space, we only show the first 15 here. We can see that it is all Sina Finance’s news data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
print ts.get_latest_news();
classify title time \
0 美股 “特朗普通胀”预期升温 美国国债下挫 11-14 23:10
1 美股 特朗普:脸书、推特等社交媒体助我入主白宫 11-14 23:10
2 证券 11月14日晚增减持每日速览 11-14 22:54
3 美股 财经观察:日本为何急于推动TPP批准程序 11-14 22:54
4 美股 新总统谜题:特朗普会连续加息吗? 11-14 22:52
5 证券 神州专车财报遭质疑 增发100亿股东退出需50年 11-14 22:41
6 证券 恒大闪电杀回马枪锁仓半年 戒短炒了吗? 11-14 22:38
7 国内财经 楼继伟力推改革做派 或加快国有资本划拨社保 11-14 22:36
8 美股 开盘:美股周一小幅高开 延续上周涨势 11-14 22:32
9 美股 喜达屋创始人:当好总统就要走中庸之道 11-14 22:24
10 证券 北京高华:将乐视网评级下调至中性 11-14 22:09
11 美股 11月14日22点交易员正关注要闻 11-14 22:02
12 美股 摩根大通:新兴市场股市、货币的前景悲观 11-14 21:55
13 国内财经 人民日报刊文谈全面深化改革这三年:啃下硬骨头 11-14 21:46
14 证券 泽平宏观:经济L型延续 地产销量回落投资超预期 11-14 21:43
15 证券 黄燕铭等五大券商大佬告诉你 2017年买点啥? 11-14 21:41
url
0 http://finance.sina.com.cn/stock/usstock/c/201...
1 http://finance.sina.com.cn/stock/usstock/c/201...
2 http://finance.sina.com.cn/stock/y/2016-11-14/...
3 http://finance.sina.com.cn/stock/usstock/c/201...
4 http://finance.sina.com.cn/stock/usstock/c/201...
5 http://finance.sina.com.cn/stock/marketresearc...
6 http://finance.sina.com.cn/stock/marketresearc...
7 http://finance.sina.com.cn/china/gncj/2016-11-...
8 http://finance.sina.com.cn/stock/usstock/c/201...
9 http://finance.sina.com.cn/stock/usstock/c/201...
10 http://finance.sina.com.cn/stock/s/2016-11-14/...
11 http://finance.sina.com.cn/stock/usstock/c/201...
12 http://finance.sina.com.cn/stock/usstock/c/201...
13 http://finance.sina.com.cn/china/gncj/2016-11-...
14 http://finance.sina.com.cn/stock/marketresearc...
15 http://finance.sina.com.cn/stock/marketresearc...
Abracadabra

A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN

Posted on 2018-04-10 | | Visitors

Ever since Alex Krizhevsky, Geoff Hinton, and Ilya Sutskever won ImageNet in 2012, Convolutional Neural Networks(CNNs) have become the gold standard for image classification. In fact, since then, CNNs have improved to the point where they now outperform humans on the ImageNet challenge!

img

CNNs now outperform humans on the ImageNet challenge. The y-axis in the above graph is the error rate on ImageNet.

While these results are impressive, image classification is far simpler than the complexity and diversity of true human visual understanding.

img

An example of an image used in the classification challenge. Note how the image is well framed and has just one object.

In classification, there’s generally an image with a single object as the focus and the task is to say what that image is (see above). But when we look at the world around us, we carry out far more complex tasks.

img

Sights in real life are often composed of a multitude of different, overlapping objects, backgrounds, and actions.

We see complicated sights with multiple overlapping objects, and different backgrounds and we not only classify these different objects but also identify their boundaries, differences, and relations to one another!

img

In image segmentation, our goal is to classify the different objects in the image, and identify their boundaries. Source: Mask R-CNN paper.

Can CNNs help us with such complex tasks? Namely, given a more complicated image, can we use CNNs to identify the different objects in the image, and their boundaries? As has been shown by Ross Girshick and his peers over the last few years, the answer is conclusively yes.

Goals of this Post

Through this post, we’ll cover the intuition behind some of the main techniques used in object detection and segmentation and see how they’ve evolved from one implementation to the next. In particular, we’ll cover R-CNN (Regional CNN), the original application of CNNs to this problem, along with its descendants Fast R-CNN, and Faster R-CNN. Finally, we’ll cover Mask R-CNN, a paper released recently by Facebook Research that extends such object detection techniques to provide pixel level segmentation. Here are the papers referenced in this post:

  1. R-CNN: https://arxiv.org/abs/1311.2524
  2. Fast R-CNN: https://arxiv.org/abs/1504.08083
  3. Faster R-CNN: https://arxiv.org/abs/1506.01497
  4. Mask R-CNN: https://arxiv.org/abs/1703.06870

2014: R-CNN - An Early Application of CNNs to Object Detection

img

Object detection algorithms such as R-CNN take in an image and identify the locations and classifications of the main objects in the image. Source: https://arxiv.org/abs/1311.2524.

Inspired by the research of Hinton’s lab at the University of Toronto, a small team at UC Berkeley, led by Professor Jitendra Malik, asked themselves what today seems like an inevitable question:

To what extent do [Krizhevsky et. al’s results] generalize to object detection?

Object detection is the task of finding the different objects in an image and classifying them (as seen in the image above). The team, comprised of Ross Girshick (a name we’ll see again), Jeff Donahue, and Trevor Darrel found that this problem can be solved with Krizhevsky’s results by testing on the PASCAL VOC Challenge, a popular object detection challenge akin to ImageNet. They write,

This paper is the first to show that a CNN can lead to dramatically higher object detection performance on PASCAL VOC as compared to systems based on simpler HOG-like features.

Let’s now take a moment to understand how their architecture, Regions With CNNs (R-CNN) works.

Understanding R-CNN

The goal of R-CNN is to take in an image, and correctly identify where the main objects (via a bounding box) in the image.

  • Inputs: Image
  • Outputs: Bounding boxes + labels for each object in the image.

But how do we find out where these bounding boxes are? R-CNN does what we might intuitively do as well - propose a bunch of boxes in the image and see if any of them actually correspond to an object.

img

Selective Search looks through windows of multiple scales and looks for adjacent pixels that share textures, colors, or intensities. Image source: https://www.koen.me/research/pub/uijlings-ijcv2013-draft.pdf

R-CNN creates these bounding boxes, or region proposals, using a process called Selective Search which you can read about here. At a high level, Selective Search (shown in the image above) looks at the image through windows of different sizes, and for each size tries to group together adjacent pixels by texture, color, or intensity to identify objects.

img

After creating a set of region proposals, R-CNN passes the image through a modified version of AlexNet to determine whether or not it is a valid region. Source: https://arxiv.org/abs/1311.2524.

Once the proposals are created, R-CNN warps the region to a standard square size and passes it through to a modified version of AlexNet (the winning submission to ImageNet 2012 that inspired R-CNN), as shown above.

On the final layer of the CNN, R-CNN adds a Support Vector Machine (SVM) that simply classifies whether this is an object, and if so what object. This is step 4 in the image above.

Improving the Bounding Boxes

Now, having found the object in the box, can we tighten the box to fit the true dimensions of the object? We can, and this is the final step of R-CNN. R-CNN runs a simple linear regression on the region proposal to generate tighter bounding box coordinates to get our final result. Here are the inputs and outputs of this regression model:

  • Inputs: sub-regions of the image corresponding to objects.
  • Outputs: New bounding box coordinates for the object in the sub-region.

So, to summarize, R-CNN is just the following steps:

  1. Generate a set of proposals for bounding boxes.
  2. Run the images in the bounding boxes through a pre-trained AlexNet and finally an SVM to see what object the image in the box is.
  3. Run the box through a linear regression model to output tighter coordinates for the box once the object has been classified.

2015: Fast R-CNN - Speeding up and Simplifying R-CNN

img

Ross Girshick wrote both R-CNN and Fast R-CNN. He continues to push the boundaries of Computer Vision at Facebook Research.

R-CNN works really well, but is really quite slow for a few simple reasons:

  1. It requires a forward pass of the CNN (AlexNet) for every single region proposal for every single image (that’s around 2000 forward passes per image!).
  2. It has to train three different models separately - the CNN to generate image features, the classifier that predicts the class, and the regression model to tighten the bounding boxes. This makes the pipeline extremely hard to train.

In 2015, Ross Girshick, the first author of R-CNN, solved both these problems, leading to the second algorithm in our short history - Fast R-CNN. Let’s now go over its main insights.

Fast R-CNN Insight 1: RoI (Region of Interest) Pooling

For the forward pass of the CNN, Girshick realized that for each image, a lot of proposed regions for the image invariably overlapped causing us to run the same CNN computation again and again (~2000 times!). His insight was simple — Why not run the CNN just once per image and then find a way to share that computation across the ~2000 proposals?

img

In RoIPool, a full forward pass of the image is created and the conv features for each region of interest are extracted from the resulting forward pass. Source: Stanford’s CS231N slides by Fei Fei Li, Andrei Karpathy, and Justin Johnson.

This is exactly what Fast R-CNN does using a technique known as RoIPool (Region of Interest Pooling). At its core, RoIPool shares the forward pass of a CNN for an image across its subregions. In the image above, notice how the CNN features for each region are obtained by selecting a corresponding region from the CNN’s feature map. Then, the features in each region are pooled (usually using max pooling). So all it takes us is one pass of the original image as opposed to ~2000!

Fast R-CNN Insight 2: Combine All Models into One Network

img

Fast R-CNN combined the CNN, classifier, and bounding box regressor into one, single network. Source: https://www.slideshare.net/simplyinsimple/detection-52781995.

The second insight of Fast R-CNN is to jointly train the CNN, classifier, and bounding box regressor in a single model. Where earlier we had different models to extract image features (CNN), classify (SVM), and tighten bounding boxes (regressor), Fast R-CNN instead used a single network to compute all three.

You can see how this was done in the image above. Fast R-CNN replaced the SVM classifier with a softmax layer on top of the CNN to output a classification. It also added a linear regression layer parallel to the softmax layer to output bounding box coordinates. In this way, all the outputs needed came from one single network! Here are the inputs and outputs to this overall model:

  • Inputs: Images with region proposals.
  • Outputs: Object classifications of each region along with tighter bounding boxes.

2016: Faster R-CNN - Speeding Up Region Proposal

Even with all these advancements, there was still one remaining bottleneck in the Fast R-CNN process — the region proposer. As we saw, the very first step to detecting the locations of objects is generating a bunch of potential bounding boxes or regions of interest to test. In Fast R-CNN, these proposals were created using Selective Search, a fairly slow process that was found to be the bottleneck of the overall process.

img

Jian Sun, a principal researcher at Microsoft Research, led the team behind Faster R-CNN. Source: https://blogs.microsoft.com/next/2015/12/10/microsoft-researchers-win-imagenet-computer-vision-challenge/#sm.00017fqnl1bz6fqf11amuo0d9ttdp

In the middle 2015, a team at Microsoft Research composed of Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, found a way to make the region proposal step almost cost free through an architecture they (creatively) named Faster R-CNN.

The insight of Faster R-CNN was that region proposals depended on features of the image that were already calculated with the forward pass of the CNN (first step of classification). So why not reuse those same CNN results for region proposals instead of running a separate selective search algorithm?

img

In Faster R-CNN, a single CNN is used for region proposals, and classifications. Source: https://arxiv.org/abs/1506.01497.

Indeed, this is just what the Faster R-CNN team achieved. In the image above, you can see how a single CNN is used to both carry out region proposals and classification. This way, only one CNN needs to be trained and we get region proposals almost for free! The authors write:

Our observation is that the convolutional feature maps used by region-based detectors, like Fast R- CNN, can also be used for generating region proposals [thus enabling nearly cost-free region proposals].

Here are the inputs and outputs of their model:

  • Inputs: Images (Notice how region proposals are not needed).
  • Outputs: Classifications and bounding box coordinates of objects in the images.

How the Regions are Generated

Let’s take a moment to see how Faster R-CNN generates these region proposals from CNN features. Faster R-CNN adds a Fully Convolutional Network on top of the features of the CNN creating what’s known as the Region Proposal Network.

img

The Region Proposal Network slides a window over the features of the CNN. At each window location, the network outputs a score and a bounding box per anchor (hence 4k box coordinates where k is the number of anchors). Source: https://arxiv.org/abs/1506.01497.

The Region Proposal Network works by passing a sliding window over the CNN feature map and at each window, outputting k potential bounding boxes and scores for how good each of those boxes is expected to be. What do these k boxes represent?

img

We know that the bounding boxes for people tend to be rectangular and vertical. We can use this intuition to guide our Region Proposal networks through creating an anchor of such dimensions. Image Source: http://vlm1.uta.edu/~athitsos/courses/cse6367_spring2011/assignments/assignment1/bbox0062.jpg.

Intuitively, we know that objects in an image should fit certain common aspect ratios and sizes. For instance, we know that we want some rectangular boxes that resemble the shapes of humans. Likewise, we know we won’t see many boxes that are very very thin. In such a way, we create k such common aspect ratios we call anchor boxes. For each such anchor box, we output one bounding box and score per position in the image.

With these anchor boxes in mind, let’s take a look at the inputs and outputs to this Region Proposal Network:

  • Inputs: CNN Feature Map.
  • Outputs: A bounding box per anchor. A score representing how likely the image in that bounding box will be an object.

We then pass each such bounding box that is likely to be an object into Fast R-CNN to generate a classification and tightened bounding boxes.


2017: Mask R-CNN - Extending Faster R-CNN for Pixel Level Segmentation

img

The goal of image instance segmentation is to identify, at a pixel level, what the different objets in a scene are. Source: https://arxiv.org/abs/1703.06870.

So far, we’ve seen how we’ve been able to use CNN features in many interesting ways to effectively locate different objects in an image with bounding boxes.

Can we extend such techniques to go one step further and locate exact pixels of each object instead of just bounding boxes? This problem, known as image segmentation, is what Kaiming He and a team of researchers, including Girshick, explored at Facebook AI using an architecture known as Mask R-CNN.

img

Kaiming He, a researcher at Facebook AI, is lead author of Mask R-CNN and also a coauthor of Faster R-CNN.

Much like Fast R-CNN, and Faster R-CNN, Mask R-CNN’s underlying intuition is straight forward. Given that Faster R-CNN works so well for object detection, could we extend it to also carry out pixel level segmentation?

img

In Mask R-CNN, a Fully Convolutional Network (FCN) is added on top of the CNN features of Faster R-CNN to generate a mask (segmentation output). Notice how this is in parallel to the classification and bounding box regression network of Faster R-CNN. Source: https://arxiv.org/abs/1703.06870.

Mask R-CNN does this by adding a branch to Faster R-CNN that outputs a binary mask that says whether or not a given pixel is part of an object. The branch (in white in the above image), as before, is just a Fully Convolutional Network on top of a CNN based feature map. Here are its inputs and outputs:

  • Inputs: CNN Feature Map.
  • Outputs: Matrix with 1s on all locations where the pixel belongs to the object and 0s elsewhere (this is known as a binary mask).

But the Mask R-CNN authors had to make one small adjustment to make this pipeline work as expected.

RoiAlign - Realigning RoIPool to be More Accurate

img

Instead of RoIPool, the image gets passed through RoIAlign so that the regions of the feature map selected by RoIPool correspond more precisely to the regions of the original image. This is needed because pixel level segmentation requires more fine-grained alignment than bounding boxes. Source: https://arxiv.org/abs/1703.06870.

When run without modifications on the original Faster R-CNN architecture, the Mask R-CNN authors realized that the regions of the feature map selected by RoIPool were slightly misaligned from the regions of the original image. Since image segmentation requires pixel level specificity, unlike bounding boxes, this naturally led to inaccuracies.

The authors were able to solve this problem by cleverly adjusting RoIPool to be more precisely aligned using a method known as RoIAlign.

img

How do we accurately map a region of interest from the original image onto the feature map?

Imagine we have an image of size 128x128 and a feature map of size 25x25. Let’s imagine we want features the region corresponding to the top-left 15x15pixels in the original image (see above). How might we select these pixels from the feature map?

We know each pixel in the original image corresponds to ~ 25/128 pixels in the feature map. To select 15 pixels from the original image, we just select 15 25/128 ~= *2.93 pixels.

In RoIPool, we would round this down and select 2 pixels causing a slight misalignment. However, in RoIAlign, we avoid such rounding. Instead, we use bilinear interpolation to get a precise idea of what would be at pixel 2.93. This, at a high level, is what allows us to avoid the misalignments caused by RoIPool.

Once these masks are generated, Mask R-CNN combines them with the classifications and bounding boxes from Faster R-CNN to generate such wonderfully precise segmentations:

img

Mask R-CNN is able to segment as well as classify the objects in an image. Source: https://arxiv.org/abs/1703.06870.


Code

If you’re interested in trying out these algorithms yourselves, here are relevant repositories:

Faster R-CNN

  • Caffe: https://github.com/rbgirshick/py-faster-rcnn
  • PyTorch: https://github.com/longcw/faster_rcnn_pytorch
  • MatLab: https://github.com/ShaoqingRen/faster_rcnn

Mask R-CNN

  • PyTorch: https://github.com/felixgwu/mask_rcnn_pytorch
  • TensorFlow: https://github.com/CharlesShang/FastMaskRCNN

Reblog from here.

Abracadabra

Some Paper Summaries of Semantic Segmentation with Deep Learning

Posted on 2018-04-10 | | Visitors

What exactly is semantic segmentation?

Semantic segmentation is understanding an image at pixel level i.e, we want to assign each pixel in the image an object class. For example, check out the following images.

biker biker
Left: Input image. Right: It’s semantic segmentation. Source.

Apart from recognizing the bike and the person riding it, we also have to delineate the boundaries of each object. Therefore, unlike classification, we need dense pixel-wise predictions from our models.

VOC2012 and MSCOCO are the most important datasets for semantic segmentation.

What are the different approaches?

Before deep learning took over computer vision, people used approaches like TextonForest and Random Forest based classifiers for semantic segmentation. As with image classification, convolutional neural networks (CNN) have had enormous success on segmentation problems.

One of the popular initial deep learning approaches was patch classification where each pixel was separately classified into classes using a patch of image around it. Main reason to use patches was that classification networks usually have full connected layers and therefore required fixed size images.

In 2014, Fully Convolutional Networks (FCN) by Long et al. from Berkeley, popularized CNN architectures for dense predictions without any fully connected layers. This allowed segmentation maps to be generated for image of any size and was also much faster compared to the patch classification approach. Almost all the subsequent state of the art approaches on semantic segmentation adopted this paradigm.

Apart from fully connected layers, one of the main problems with using CNNs for segmentation is pooling layers. Pooling layers increase the field of view and are able to aggregate the context while discarding the ‘where’ information. However, semantic segmentation requires the exact alignment of class maps and thus, needs the ‘where’ information to be preserved. Two different classes of architectures evolved in the literature to tackle this issue.

First one is encoder-decoder architecture. Encoder gradually reduces the spatial dimension with pooling layers and decoder gradually recovers the object details and spatial dimension. There are usually shortcut connections from encoder to decoder to help decoder recover the object details better. U-Net is a popular architecture from this class.

U-Net architecture
U-Net: An encoder-decoder architecture. Source.

Architectures in the second class use what are called as dilated/atrous convolutionsand do away with pooling layers.

Dilated/atrous convolutions
Dilated/atrous convolutions. rate=1 is same as normal convolutions. Source.

Conditional Random Field (CRF) postprocessing are usually used to improve the segmentation. CRFs are graphical models which ‘smooth’ segmentation based on the underlying image intensities. They work based on the observation that similar intensity pixels tend to be labeled as the same class. CRFs can boost scores by 1-2%.

CRF
CRF illustration. (b) Unary classifiers is the segmentation input to the CRF. (c, d, e) are variants of CRF with (e) being the widely used one. Source.

In the next section, I’ll summarize a few papers that represent the evolution of segmentation architectures starting from FCN. All these architectures are benchmarked on VOC2012 evaluation server.

Summaries

Following papers are summarized (in chronological order):

  1. FCN
  2. SegNet
  3. Dilated Convolutions
  4. DeepLab (v1 & v2)
  5. RefineNet
  6. PSPNet
  7. Large Kernel Matters
  8. DeepLab v3

For each of these papers, I list down their key contributions and explain them. I also show their benchmark scores (mean IOU) on VOC2012 test dataset.

FCN

  • Fully Convolutional Networks for Semantic Segmentation
  • Submitted on 14 Nov 2014
  • Arxiv Link

Key Contributions:

  • Popularize the use of end to end convolutional networks for semantic segmentation
  • Re-purpose imagenet pretrained networks for segmentation
  • Upsample using deconvolutional layers
  • Introduce skip connections to improve over the coarseness of upsampling

Explanation:

Key observation is that fully connected layers in classification networks can be viewed as convolutions with kernels that cover their entire input regions. This is equivalent to evaluating the original classification network on overlapping input patches but is much more efficient because computation is shared over the overlapping regions of patches. Although this observation is not unique to this paper (see overfeat, this post), it improved the state of the art on VOC2012 significantly.

FCN architecture
Fully connected layers as a convolution. Source.

After convolutionalizing fully connected layers in a imagenet pretrained network like VGG, feature maps still need to be upsampled because of pooling operations in CNNs. Instead of using simple bilinear interpolation, deconvolutional layers can learn the interpolation. This layer is also known as upconvolution, full convolution, transposed convolution or fractionally-strided convolution.

However, upsampling (even with deconvolutional layers) produces coarse segmentation maps because of loss of information during pooling. Therefore, shortcut/skip connections are introduced from higher resolution feature maps.

Benchmarks (VOC2012):

ScoreCommentSource
62.2-leaderboard
67.2More momentum. Not described in paperleaderboard

My Comments:

  • This was an important contribution but state of the art has improved a lot by now though.

SegNet

  • SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation
  • Submitted on 2 Nov 2015
  • Arxiv Link

Key Contributions:

  • Maxpooling indices transferred to decoder to improve the segmentation resolution.

Explanation:

FCN, despite upconvolutional layers and a few shortcut connections produces coarse segmentation maps. Therefore, more shortcut connections are introduced. However, instead of copying the encoder features as in FCN, indices from maxpooling are copied. This makes SegNet more memory efficient than FCN.

SegNet Architecture
Segnet Architecture. Source.

Benchmarks (VOC2012):

ScoreCommentSource
59.9-leaderboard

My comments:

  • FCN and SegNet are one of the first encoder-decoder architectures.
  • Benchmarks for SegNet are not good enough to be used anymore.

Dilated Convolutions

  • Multi-Scale Context Aggregation by Dilated Convolutions
  • Submitted on 23 Nov 2015
  • Arxiv Link

Key Contributions:

  • Use dilated convolutions, a convolutional layer for dense predictions.
  • Propose ‘context module’ which uses dilated convolutions for multi scale aggregation.

Explanation:

Pooling helps in classification networks because receptive field increases. But this is not the best thing to do for segmentation because pooling decreases the resolution. Therefore, authors use dilated convolution layer which works like this:

Dilated/Atrous Convolutions
Dilated/Atrous Convolutions. Source

Dilated convolutional layer (also called as atrous convolution in DeepLab) allows for exponential increase in field of view without decrease of spatial dimensions.

Last two pooling layers from pretrained classification network (here, VGG) are removed and subsequent convolutional layers are replaced with dilated convolutions. In particular, convolutions between the pool-3 and pool-4 have dilation 2 and convolutions after pool-4 have dilation 4. With this module (called frontend module in the paper), dense predictions are obtained without any increase in number of parameters.

A module (called context module in the paper) is trained separately with the outputs of frontend module as inputs. This module is a cascade of dilated convolutions of different dilations so that multi scale context is aggregated and predictions from frontend are improved.

Benchmarks (VOC2012):

ScoreCommentSource
71.3frontendreported in the paper
73.5frontend + contextreported in the paper
74.7frontend + context + CRFreported in the paper
75.3frontend + context + CRF-RNNreported in the paper

My comments:

  • Note that predicted segmentation map’s size is 1/8th of that of the image. This is the case with almost all the approaches. They are interpolated to get the final segmentation map.

DeepLab (v1 & v2)

  • v1 : Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs
  • Submitted on 22 Dec 2014
  • Arxiv Link
  • v2 : DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs
  • Submitted on 2 Jun 2016
  • Arxiv Link

Key Contributions:

  • Use atrous/dilated convolutions.
  • Propose atrous spatial pyramid pooling (ASPP)
  • Use Fully connected CRF

Explanation:

Atrous/Dilated convolutions increase the field of view without increasing the number of parameters. Net is modified like in dilated convolutions paper.

Multiscale processing is achieved either by passing multiple rescaled versions of original images to parallel CNN branches (Image pyramid) and/or by using multiple parallel atrous convolutional layers with different sampling rates (ASPP).

Structured prediction is done by fully connected CRF. CRF is trained/tuned separately as a post processing step.

DeepLab2 Pipeline
DeepLab2 Pipeline. Source.

Benchmarks (VOC2012):

ScoreCommentSource
79.7ResNet-101 + atrous Convolutions + ASPP + CRFleaderboard

RefineNet

  • RefineNet: Multi-Path Refinement Networks for High-Resolution Semantic Segmentation
  • Submitted on 20 Nov 2016
  • Arxiv Link

Key Contributions:

  • Encoder-Decoder architecture with well thought-out decoder blocks
  • All the components follow residual connection design

Explanation:

Approach of using dilated/atrous convolutions are not without downsides. Dilated convolutions are computationally expensive and take a lot of memory because they have to be applied on large number of high resolution feature maps. This hampers the computation of high-res predictions. DeepLab’s predictions, for example are 1/8th the size of original input.

So, the paper proposes to use encoder-decoder architecture. Encoder part is ResNet-101 blocks. Decoder has RefineNet blocks which concatenate/fuse high resolution features from encoder and low resolution features from previous RefineNet block.

RefineNet Architecture
RefineNet Architecture. Source.

Each RefineNet block has a component to fuse the multi resolution features by upsampling the lower resolution features and a component to capture context based on repeated 5 x 5 stride 1 pool layers. Each of these components employ the residual connection design following the identity map mindset.

RefineNet Block
RefineNet Block. Source.

Benchmarks (VOC2012):

ScoreCommentSource
84.2Uses CRF, Multiscale inputs, COCO pretrainingleaderboard

PSPNet

  • Pyramid Scene Parsing Network
  • Submitted on 4 Dec 2016
  • Arxiv Link

Key Contributions:

  • Propose pyramid pooling module to aggregate the context.
  • Use auxiliary loss

Explanation:

Global scene categories matter because it provides clues on the distribution of the segmentation classes. Pyramid pooling module captures this information by applying large kernel pooling layers.

Dilated convolutions are used as in dilated convolutions paper to modify Resnet and a pyramid pooling module is added to it. This module concatenates the feature maps from ResNet with upsampled output of parallel pooling layers with kernels covering whole, half of and small portions of image.

An auxiliary loss, additional to the loss on main branch, is applied after the fourth stage of ResNet (i.e input to pyramid pooling module). This idea was also called as intermediate supervision elsewhere.

PSPNet Architecture
PSPNet Architecture. Source.

Benchmarks (VOC2012):

ScoreCommentSource
85.4MSCOCO pretraining, multi scale input, no CRFleaderboard
82.6no MSCOCO pretraining, multi scale input, no CRFreported in the paper

Large Kernel Matters

  • Large Kernel Matters – Improve Semantic Segmentation by Global Convolutional Network
  • Submitted on 8 Mar 2017
  • Arxiv Link

Key Contributions:

  • Propose a encoder-decoder architecture with very large kernels convolutions

Explanation:

Semantic segmentation requires both segmentation and classification of the segmented objects. Since fully connected layers cannot be present in a segmentation architecture, convolutions with very large kernels are adopted instead.

Another reason to adopt large kernels is that although deeper networks like ResNet have very large receptive field, studies show that the network tends to gather information from a much smaller region (valid receptive filed).

Larger kernels are computationally expensive and have a lot of parameters. Therefore, k x k convolution is approximated with sum of 1 x k + k x 1 and k x 1 and 1 x k convolutions. This module is called as Global Convolutional Network (GCN) in the paper.

Coming to architecture, ResNet(without any dilated convolutions) forms encoder part of the architecture while GCNs and deconvolutions form decoder. A simple residual block called Boundary Refinement (BR) is also used.

GCN Architecture
GCN Architecture. Source.

Benchmarks (VOC2012):

ScoreCommentSource
82.2-reported in the paper
83.6Improved training, not described in the paperleaderboard

DeepLab v3

  • Rethinking Atrous Convolution for Semantic Image Segmentation
  • Submitted on 17 Jun 2017
  • Arxiv Link

Key Contributions:

  • Improved atrous spatial pyramid pooling (ASPP)
  • Module which employ atrous convolutions in cascade

Explanation:

ResNet model is modified to use dilated/atrous convolutions as in DeepLabv2 and dilated convolutions. Improved ASPP involves concatenation of image-level features, a 1x1 convolution and three 3x3 atrous convolutions with different rates. Batch normalization is used after each of the parallel convolutional layers.

Cascaded module is a resnet block except that component convolution layers are made atrous with different rates. This module is similar to context module used in dilated convolutions paper but this is applied directly on intermediate feature maps instead of belief maps (belief maps are final CNN feature maps with channels equal to number of classes).

Both the proposed models are evaluated independently and attempt to combine the both did not improve the performance. Both of them performed very similarly on val set with ASPP performing slightly better. CRF is not used.

Both these models outperform the best model from DeepLabv2. Authors note that the improvement comes from the batch normalization and better way to encode multi scale context.

DeepLabv3 ASPP
DeepLabv3 ASPP (used for submission). Source.

Benchmarks (VOC2012):

ScoreCommentSource
85.7used ASPP (no cascaded modules)leaderboard

Reblog from here.

1…456…27
Ewan Li

Ewan Li

Ewan's IT Blog

131 posts
64 tags
RSS
Github Twitter
© 2019 Ewan Li
Powered by Hexo
Theme - NexT.Mist
本站访客数人次 本站总访问量次