聚合CNN特征的遥感图像检索

doi:10.6046/gtzyyg.2019.01.07

聚合CNN特征的遥感图像检索

葛芸¹^,², 江顺亮¹, 叶发茂^,¹, 姜昌龙², 陈英², 唐祎玲¹

1.南昌大学信息工程学院,南昌 330031

2.南昌航空大学软件学院,南昌 330063

Aggregating CNN features for remote sensing image retrieval

GE Yun¹^,², JIANG Shunliang¹, YE Famao^,¹, JIANG Changlong², CHEN Ying², TANG Yiling¹

1.Information Engineering School, Nanchang University, Nanchang 330031, China

2.Software School, Nanchang Hangkong University, Nanchang 330063, China

通讯作者: 叶发茂(1978-),男,副教授,主要从事遥感图像处理和人工智能方面研究。Email:yefamao@ncu.edu.cn。

责任编辑: 张仙

收稿日期: 2017-08-21 修回日期: 2017-12-21 网络出版日期: 2019-03-15

基金资助:

国家自然科学基金项目“高空间分辨率遥感图像检索中卷积神经网络迁移特征改进方法的研究”.  41801288
“基于人工禁忌免疫原理的多源遥感图像自动配准研究”.  41261091
“基于多变量自然场景统计和局部均值估计的无参考立体图像质量评价”.  61662044
“基于深度神经网络和记忆机制的复杂环境目标跟踪研究”.  61663031
江西省青年科学基金项目“基于虹膜生物特征密钥的无线传感器网络用户认证和访问权限的理论与新方法研究”共同资助.  20161BAB212034

Received: 2017-08-21 Revised: 2017-12-21 Online: 2019-03-15

作者简介 About authors

葛芸(1983-),女,博士研究生,主要从事遥感图像检索和机器学习的研究。Email:geyun@nchu.edu.cn。。

摘要

针对高分辨率遥感图像检索中手工特征难以准确描述图像的问题,提出聚合卷积神经网络(convolutional neural network,CNN)特征的方法来改进特征表达。首先,将预训练的CNN参数迁移到遥感图像,并针对不同尺寸的输入图像,提取表达局部信息的CNN特征; 然后,对该CNN特征采用池化区域尺寸不同的均值池化和视觉词袋(bag of visual words,BoVW)2种聚合方法,分别得到池化特征和BoVW特征; 最后,将2种聚合特征用于遥感图像检索。实验结果表明: 合理的输入图像尺寸能提高聚合特征的表达能力; 当池化区域为特征图的60%80%时,绝大多数池化特征的结果优于传统均值池化方法的结果; 池化特征和BoVW特征的最优平均归一化修改检索等级值比手工特征分别降低了27.31%和21.51%,因此,均值池化和BoVW方法都能有效提高遥感图像的检索性能。

关键词： 遥感图像 ; 检索 ; 卷积神经网络 ; 均值池化 ; 视觉词袋

Abstract

In the high-resolution remote sensing image retrieval, it is difficult for hand-crafted features to describe the images accurately. Thus a method based on aggregating convolutional neural network(CNN) features is proposed to improve the feature representation. First, the parameters from CNN pre-trained on large-scale datasets are transferred for remote sensing images. Given input images with different sizes, the CNN features which represent local information are extracted. Then, average pooling with different pooling region sizes and bag of visual words (BoVW) are adopted to aggregate the CNN features. Pooling features and BoVW features are obtained accordingly. Finally, the above two aggregation features are utilized for remote sensing image retrieval. Experimental results demonstrate that the input image with reasonable size is capable of improving the feature representation. When the pooling region size is between 60% and 80% of the feature map, the vast majority of the results of pooling features are superior to those of the traditional average pooling method. The optimal average normalized modified retrieval rank values of pooling feature and BoVW feature are 27.31% and 21.51% lower than those of hand-crafted feature. Therefore, both the average pooling and BoVW can improve the remote sensing image retrieval performance efficiently.

Keywords： remote sensing image ; retrieval ; convolutional neural network ; average pooling ; bag of visual words

PDF (2170KB) 元数据多维度评价相关文章导出 EndNote| Ris| Bibtex 收藏本文

本文引用格式

葛芸, 江顺亮, 叶发茂, 姜昌龙, 陈英, 唐祎玲. 聚合CNN特征的遥感图像检索. 国土资源遥感[J], 2019, 31(1): 49-57 doi:10.6046/gtzyyg.2019.01.07

GE Yun, JIANG Shunliang, YE Famao, JIANG Changlong, CHEN Ying, TANG Yiling. Aggregating CNN features for remote sensing image retrieval. REMOTE SENSING FOR LAND & RESOURCES[J], 2019, 31(1): 49-57 doi:10.6046/gtzyyg.2019.01.07

0 引言

随着遥感技术的发展,高分辨率遥感(high-resolution remote sensing,HRRS)图像数量急速增长,HRRS图像检索技术成为了研究热点和难点之一。基于内容的遥感图像检索(content-based remote sensing image retrieval, CBRSIR)是目前主流的检索技术,它包括特征提取和相似性度量2个部分,其中特征提取是图像检索中的关键技术。

早期CBRSIR主要通过提取图像的底层特征^[1]进行检索,但是底层特征难以表达图像的高层语义信息,即存在严重的“语义鸿沟”问题^[2,3]。为了缩小语义鸿沟,主要有以下3种方法: ①采用相关反馈机制^[2],该方法依赖于反馈中标记的样本示例; ②融合多种特征^[4],该方法可以有效结合不同特征的优点,从而更加全面地描述图像信息; ③聚合特征的方法,即在局部特征的基础上进一步构建抽象出的高一级特征,如视觉词袋(bag of visual words,BoVW)^[5]是在尺度不变特征转换(scale-invariant feature transform,SIFT)特征的基础上通过K均值聚类得到的一种聚合特征,局部结构学习(local structure learning,LSL)^[6]是在局部特征的基础上,结合图正则化得到的一种聚合特征。聚合特征能够减少冗余信息,有效降低特征维度,提高特征表达能力,从而缩小语义鸿沟。

传统的聚合特征都是建立在手工提取特征的基础上,但手工特征表达图像能力有限,且容易受到人为因素干扰。目前流行的深度卷积神经网络(convolutional neural network,CNN)能够自动学习图像的特征,降低了人为干扰,在图像分类、检索和目标识别中应用广泛^{[7,8,9,10,11]},其中在大规模数据集(如ImageNet)上训练的CNN具有很强的泛化能力,可以有效迁移到其他小规模数据集。CNN迁移学习中,全连接层的输出值首先受到关注^[7],之后表达图像局部信息的卷积层特征越来越受到重视^[8],卷积层特征通常采用编码^[8]和池化^[9]的方法进一步构建为聚合特征。

在遥感图像检索领域,由于目前公开的遥感数据集规模较小,CNN的参数得不到充分训练,因此相关研究主要集中于将CNN迁移到HRRS图像并进行检索^[12,13,14]。Napoletano^[12]使用CNN中的全连接层特征进行检索; Zhou等^[13]和Hu等^[14]比较了CNN全连接层特征和基于卷积层输出值的聚合特征,并对CNN进行微调; Zhou等^[13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂。

上述文献对CNN的全连接层特征和卷积特征进行了较全面的研究,但对卷积特征采用的聚合方法均为编码方法,缺少对卷积层特征不同聚合方法的研究。因此本文根据HRRS图像的特点,研究CNN特征的聚合方法,并将其用于HRRS图像检索。首先,将CNN网络的参数迁移到HRRS图像,并针对不同尺寸的输入图像,提取表达图像局部信息的CNN特征; 然后,提出池化区域不相同的均值池化和BoVW这2种方法对CNN特征进行聚合,分别得到池化特征和BoVW特征,并对池化区域和视觉单词数目进行了研究; 最后,将这2种聚合特征用于遥感图像检索。

1 聚合CNN特征的图像检索

1.1 网络结构

在聚合CNN特征时,选用16层的VGG16网络^[15]和22层的GoogLeNet网络^[16]。VGG16通过扩展卷积层的数量增加了网络深度,GoogLeNet则通过使用inception modules机制,不仅增加了网络的深度,还增加了网络的广度。因此VGG16和GoogLeNet经过前面多个层次的抽象运算,后面的卷积层不仅仅获得更多的局部信息,并且具有更好的泛化能力。VGG16的CNN特征来自最后的卷积层(conv5-3)、激活函数层(relu5-3)和池化层(pool5)的输出值,GoogLeNet的CNN特征来自倒数第二层池化层(pool4)和最后2个inception层(incep5a和incep5b)的输出值。

输入图像尺寸不同时,输出值也不同,因此不同尺寸的输入图像对检索性能有较大影响。主要考虑2种尺寸: ①CNN默认的图像尺寸,即调整后的图像尺寸,VGG16和GoogLeNet的默认图像尺寸为224像素×224像素(文中涉及到图像尺寸的单位均为像素,为表达简洁,下文省略); ②数据集中的原图像尺寸,UC-Merced^[5]和WHU-RS^[17]为目前常用的2种HRRS数据集,256×256为UC-Merced中图像的原尺寸,比较接近默认尺寸,600×600为WHU-RS中图像的原尺寸,与默认尺寸相差较大,因此这两种数据集中图像的不同尺寸正好可以有效比较图像尺寸对检索性能的影响。表1和表2列出了不同输入图像尺寸下相应层次的输出值。以VGG16中pool5为例,在输入图像为224×224×3(3表示对应于R,G,B的3个通道)时,pool5的输出值为7×7×512,即输出值有512个通道,每个通道的特征图尺寸为7×7。

表1 不同尺寸输入图像下VGG16的输出值

Tab.1 Outputs of VGG16 under different input image sizes

输入图像	conv5-3	relu5-3	pool5
224×224×3	14×14×512	14×14×512	7×7×512
256×256×3	16×16×512	16×16×512	8×8×512
600×600×3	38×38×512	38×38×512	19×19×512

新窗口打开| 下载CSV

表2 不同尺寸输入图像下GoogLeNet的输出值

Tab.2 Outputs of GoogLeNet under different input image sizes

输入图像	pool4	incep5a	incep5b
224×224×3	7×7×832	7×7×832	7×7×1 024
256×256×3	8×8×832	8×8×832	8×8×1 024
600×600×3	18×18×832	18×18×832	18×18×1 024

新窗口打开| 下载CSV

1.2 特征提取

1.2.1 CNN特征

令图像I某个层次l的输出值为

(1) $f^{l}=s^{l}\times s^{l}\times c^{l}$

式中: $f^{l}$ 为层次l的CNN特征; $s^{l} \times s^{l}$ 为特征图的尺寸; $c^{l}$ 为特征图的数目,即通道的数目。若将 $f^{l}$ 直接转化为特征向量,则维度过高,检索性能不佳,因此需要将其构建为聚合特征。

1.2.2 聚合特征

HRRS图像内容复杂,信息丰富,因此针对HRRS图像采用池化区域尺寸不同的均值池化方法,以便找到合适的池化区域来提取区分度更好的池化特征。特征编码采用经典的BoVW编码方法。

1)池化特征。目前常用的均值池化方法是令池化区域尺寸等于特征图尺寸^[9],但针对HRRS图像,由于其内容丰富,直接令池化区域等于特征图区域,可能会丢失一些重要信息。因此提出池化区域尺寸不相同的均值池化方法,以获得效果更好的特征。

对于尺寸为 $s^{l} \times s^{l}$ 的图像I的l层特征图,令池化区域为 $m^{l} \times m^{l}$ ,记为 $r^{l}$ ; 令步幅为1,则池化区域的数目为 $(s^{l} - m^{l} + 1) \times (s^{l} - m^{l} + 1)$ ,将其记为 $k^{l}$ ,则对于每个池化区域i,其池化特征为

(2)

p^{l} (i) = \frac{1}{m^{l} \times m^{l}} \sum r^{l} (i), i = 1,2, \dots, k^{l}

式中 $m^{l} \times m^{l} \leq s^{l} \times s^{l}$ ,即池化区域小于或者等于特征图区域。当池化区域尺寸比特征图小时,可以保留更多的信息,更适合表达内容复杂的HRRS图像。根据公式(2)计算的 $p^{l}$ 的输出值为三维矩阵 $(s^{l} - m^{l} + 1) \times (s^{l} - m^{l} + 1) \times c^{l}$ ,将其转换为池化特征向量,记为 $A^{p} = [x_{1}, x_{2}, \dots, x_{D}]$ ,其中 $D = (s^{l} - m^{l} + 1) \times (s^{l} - m^{l} + 1) \times c^{l}$ ,即池化特征的维度。

因此,本文提出的均值池化方法,图像仅需要输入到CNN中一次,通过在输出的特征图上设置较小的池化区域,可以获取图像的很多局部信息,从而提高图像的特征表达。

2)BoVW特征。传统的BoVW特征主要基于手工提取的局部特征进行聚合,而本文的BoVW特征则是基于表达图像局部信息的CNN特征进行聚合后的特征。

根据文献[8],上述CNN特征 $f^{l}$ 可理解为在特征图的每个位置(i,j),能够得到一个 $c^{l}$ 维的特征向量 $f_{i, j}^{l}$ ,即

(3) $f_{i, j}^{l} = f^{l} (i, j), i = 1,2, \dots, s^{l}; j = 1,2, \dots, s^{l}$ 。

因此第l层可以看成总共输出 $n^{l}$ 个 $c^{l}$ 维的特征向量,其中 $n^{l} = s^{l} \times s^{l}$ ,将其记为 $B^{l} = [f_{1,1}^{l}, f_{1,2}^{l}, \dots, f_{s^{l}, s^{l}}^{l}]$ 。以VGG16的pool5层为例,该层在默认图像尺寸下的CNN特征为7×7×512,即有49个512维的局部特征。

令数据集中图像总数为N,数据集中的所有图像按照上述方法提取相应的特征,则提取的特征集合为 ${B_{1}^{l}, B_{2}^{l}, \dots, B_{N}^{l}}$ ; 然后,通过K均值算法聚类得到BoVW,其中每个元素为一个视觉单词; 最后,采用硬分配方法将每幅图像的特征向量分配到距离最近的视觉单词,统计各个视觉单词出现的频数,从而得到每幅图像的BoVW特征向量 $A^{b} = [y_{1}, y_{2}, \dots, y_{K}]$ ,其中K为视觉单词的数目, $y$ 为相应的视觉单词出现的频数。

1.3 检索流程

图1以VGG16为例描述了整个检索流程,图中ci(i=1,2,3,4,5-1,5-2,5-3)表示卷积层。GoogLeNet的检索流程类似,只是提取的网络层次与VGG16不同。

图1

新窗口打开| 下载原图ZIP| 生成PPT

图1 VGG16检索流程

Fig.1 Retrieval flow chart of VGG16

具体检索步骤如下:

1)将预训练CNN的参数分别迁移到HRRS数据集M和查询图像q。由于聚合特征是针对卷积层特征进行的,因此去除VGG16中的全连接层。将VGG16中卷积层的参数直接迁移到M和q。除了conv5-3外,其它卷积层省略了激活函数层和池化层。

2)提取M和q的CNN特征。将M中每幅图像和q分别输入到CNN,提取conv5-3、relu5-3和pool5层的输出值作为M中每幅图像和q的CNN特征。M中所有图像提取的CNN特征为 $f_{M} = [f_{1}, f_{2}, \dots, f_{N}]$ ,N为数据集M中图像的总数量,q的CNN特征记为 $f_{q}$ 。

3)提取M和q的聚合特征。M和q的CNN特征分别采用池化区域不相同的均值池化和BoVW方法,得到相应的池化特征和BoVW特征。为了简要表明,池化特征和BoVW特征用统一的方式标记: q的聚合特征记为 $F_{q}$ ,M中的所有图像提取的聚合特征为 $F_{M} = [A_{1}, A_{2}, \dots, A_{N}]$ 。

4)分别对 $F_{M}$ 和 $F_{q}$ 进行归一化处理。由于图像各特征向量代表的物理意义往往不同,即使对同一个特征向量,其各个分量的取值范围也可能存在很大差异,因此需要对M和q的聚合特征进行归一化处理。对此,本文采用的是常用的L2归一化。

5)计算相似度,完成图像检索。根据归一化后的聚合特征,计算q和M中图像的相似度,并根据相似度返回最相似的n幅图像。

2 实验结果及分析

2.1 实验数据和评估标准

实验使用MatConvNet^[18]提取网络模型VGG16和GoogLeNet。预训练VGG16和GoogLeNet的数据集采用ImageNet的子集ILSVRC2012,ILSVRC2012包含了1 000种图像分类,大约有130万幅训练图像、5万幅验证图像和10万幅测试图像。遥感数据集采用UC-Merced和WHU-RS。UC-Merced是从美国地质调查局收集的航空正射图像,总共21类场景,每类有100幅图像,图像大小为256×256; WHU-RS是从Google Earth下载的19类场景,每类包含50幅图像,图像大小为600×600。表3显示了这2个数据集的示例图像。

表3 UC-Merced和WHU-RS示例图像

Tab.3 Sample images of UC-Merced and WHU-RS

新窗口打开| 下载CSV

实验的相似度采用常用的欧式距离; 评估标准采用了近几年来HRRS图像中使用普遍的平均归一化修改检索等级(average normalize modified retrieval rank, ANMRR),ANMRR取值越小,表明检索出来的相关图像越靠前,即检索性能越好。同时,实验中还比较了图像检索中重要的性能评价准则查准率—查全率曲线。

2.2 池化区域比较

采用均值池化提取聚合特征时,池化区域的尺寸影响网络检索性能。图2和图3分别比较了VGG16和GoogLeNet不同池化区域尺寸的检索结果。当输入图像尺寸为224×224时,VGG16的conv5-3和relu5-3的特征图尺寸为14×14,其他层次的特征图尺寸均为7×7。图中横坐标27表示池化区域尺寸从2×2到7×7。为了显示方便,对于图3的conv5-3和relu5-3来说,池化区域尺寸为横坐标值的2倍,即为4×4到14×14。

图2

新窗口打开| 下载原图ZIP| 生成PPT

图2 VGG16中不同池化区域的ANMRR

Fig.2 ANMRR with different pooling region sizes in VGG16

图3

新窗口打开| 下载原图ZIP| 生成PPT

图3 GoogLeNet中不同池化区域的ANMRR

Fig.3 ANMRR with different pooling region sizes in GoogLeNet

图2(a)显示,3类特征的ANMRR值都呈现先降后升的趋势,其中以conv5-3的ANMRR值下降最快,pool5的ANMRR值最小,即检索性能最好。图2(b)显示,随着池化区域的增大,relu5-3的ANMRR值呈下降趋势,而conv5-3和poo5的ANMRR值均先下降再上升。当池化区域较小时,pool5的ANMRR值最小; 随着池化区域增大,conv5-3的ANMRR值急速下降,并小于pool5的值。图3(a)中3类特征的最小ANMRR值位于接近特征图的位置,其中以pool4的结果最好。图3(b)中pool4的最小ANMRR值位于6×6的位置,而其他层次的最小值位于7×7的位置,3类特征中pool4的ANMRR值最优。

从图2和图3总体上来看,大多数特征的ANMRR值首先随着池化区域尺寸的增大而减小,到达最低值后,再随着池化区域尺寸的增大而上升。除了WHU-RS上的relu5-3,incep5a和incep5b外,其他特征在池化区域尺寸小于特征图尺寸时的检索性能最好。

2.3 不同尺寸输入图像的池化特征比较

表4和表5分别显示了UC-Merced和WHU-RS中2种输入图像尺寸(默认尺寸和原始尺寸)下池化特征的结果。为了和传统的均值池化方法比较,对于每种特征,列出了3种不同池化区域尺寸的结果: 前两个值是在池化区域尺寸从2×2增加到 ${(s - 1)}^{l} \times {(s - 1)}^{l}$ (特征图尺寸为 $s^{l} \times s^{l}$ )的结果中选择的2个最优值,第3个值为池化区域尺寸等于特征图尺寸的结果(即传统的均值池化方法)。表中粗体标注的值为该类特征中的最优结果,标注星号的值表示整体的最优结果。

表4 UC-Merced不同池化特征的ANMRR

Tab.4 ANMRR with different pooling features on the UC-Merced

CNN	默认尺寸(224×224)						原始尺寸(256×256)
VGG16	conv5-3		relu5-3		pool5		conv5-3		relu5-3		pool5
	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR
	9×9	0.334 1	9×9	0.340 4	*4×4	0.324 3	11×11	0.341 8	10×10	0.342 0	4×4	0.327 6
	10×10	0.333 7	10×10	0.340 8	5×5	0.324 5	12×12	0.343 0	11×11	0.342 4	5×5	0.326 2
	14×14	0.361 4	14×14	0.368 2	7×7	0.356 7	16×16	0.369 3	16×16	0.369 9	8×8	0.358 9
GoogLeNet	pool4		incep5a		incep5b		pool4		incep5a		incep5b
	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR
	*5×5	0.317 9	5×5	0.346 0	5×5	0.340 9	5×5	0.324 4	5×5	0.354 7	6×6	0.343 8
	6×6	0.319 5	6×6	0.346 1	6×6	0.341 2	6×6	0.323 5	6×6	0.354 1	7×7	0.345 1
	7×7	0.326 9	7×7	0.348 4	7×7	0.344 1	8×8	0.337 5	8×8	0.359 6	8×8	0.349 3

新窗口打开| 下载CSV

表5 WHU-RS不同池化特征的ANMRR

Tab.5 ANMRR with different pooling features on the WHU-RS

CNN	默认尺寸(224×224)						原始尺寸(600×600)
VGG16	conv5-3		relu5-3		pool5		conv5-3		relu5-3		pool5
	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR
	10×10	0.240 0	12×12	0.273 6	5×5	0.253 5	30×30	0.237 5	28×28	0.243 1	*14×14	0.226 2
	11×11	0.237 9	13×13	0.273 1	6×6	0.251 8	31×31	0.237 5	29×29	0.242 9	15×15	0.226 5
	14×14	0.246 8	14×14	0.272 8	7×7	0.257 6	38×38	0.251 7	38×38	0.251 6	18×18	0.238 5
GoogLeNet	pool4		incep5a		incep5b		pool4		incep5a		incep5b
	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR	尺寸	ANMRR
	5×5	0.244 4	5×5	0.275 1	5×5	0.280 5	*14×14	0.232 1	16×16	0.262 5	16×16	0.250 9
	6×6	0.242 5	6×6	0.272 0	6×6	0.276 0	15×15	0.232 3	17×17	0.262 8	17×17	0.250 6
	7×7	0.245 3	7×7	0.270 1	7×7	0.273 4	18×18	0.241 5	18×18	0.263 4	18×18	0.251 0

新窗口打开| 下载CSV

表4中,输入图像的默认尺寸和原始尺寸比较接近,因此检索结果也很接近,整体上256×256的检索结果比224×224的结果稍差些。这种结果可能是由于与相差不大的256×256相比,尺寸为224×224的图像更适合用于CNN中,以便输出区别性更强的特征。表5中输入图像的默认尺寸和原始尺寸相差较大,因此检索结果的差异性比较明显,600×600的结果比224×224的结果好,这是因为当图像尺寸从600×600调整到224×224时,图像丢失的信息比较多,直接导致检索性能下降。

对比2表可知,当输入图像尺寸增大时,最优池化区域的尺寸和特征图尺寸的差距也相应增大。因此简单地令池化区域尺寸等于特征图尺寸的方法容易丢失重要的特征信息,应该根据输入图像的尺寸及网络的层次选择合理的池化区域。根据实验结果,大多数特征的最优池化区域在特征图尺寸的60%80%之间。

2.4 不同尺寸输入图像的BoVW特征比较

表6和表7显示了2种输入图像尺寸下BoVW特征的结果。为了比较视觉单词数目K对检索性能的影响,分别令K的取值为100,150,1 500,2 000和4 000。表中粗体标注的值为该类特征中的最优结果,标星号的值表示整体的最优结果。

表6 UC-Merced不同BoVW特征的ANMRR

Tab.6 ANMRR with different BoVW features on the UC-Merced

CNN	默认尺寸 (224×224)				原始尺寸 (256×256)
VGG16	K	conv5-3	relu5-3	pool5	conv5-3	relu5-3	pool5
	100	0.408 5	0.469 9	0.424 9	0.425 7	0.480 9	0.444 7
	150	*0.388 6	0.482 7	0.523 9	0.412 2	0.491 3	0.530 1
	1 500	0.410 5	0.474 8	0.482 1	0.414 5	0.477 2	0.475 2
	2 000	0.417 5	0.476 0	0.491 8	0.421 5	0.473 8	0.483 7
	4 000	0.432 1	0.478 1	0.508 6	0.436 5	0.480 0	0.497 7
GoogLeNet	K	pool4	incep5a	incep5b	pool4	incep5a	incep5b
	100	0.408 6	*0.375 9	0.402 3	0.400 5	0.394 5	0.397 5
	150	0.419 6	0.397 3	0.422 0	0.430 8	0.415 6	0.414 0
	1 500	0.536 7	0.535 4	0.480 3	0.541 4	0.532 7	0.480 1
	2 000	0.581 1	0.551 9	0.492 8	0.559 2	0.550 5	0.491 4
	4 000	0.613 8	0.616 9	0.535 3	0.616 1	0.609 8	0.540 6

新窗口打开| 下载CSV

表7 WHU-RS不同BoVW特征的ANMRR

Tab.7 ANMRR with different BoVW features on the WHU-RS

CNN	默认尺寸 (224×224)				原始尺寸 (600×600)
VGG16	K	conv5-3	relu5-3	pool5	conv5-3	relu5-3	pool5
	100	0.280 7	0.441 7	0.352 5	0.354 7	0.424 1	0.366 3
	150	*0.249 1	0.414 2	0.370 1	0.323 9	0.462 0	0.370 7
	1 500	0.275 2	0.411 2	0.391 8	0.268 2	0.356 6	0.355 7
	2 000	0.279 6	0.431 6	0.408 6	0.270 5	0.367 6	0.352 2
	4 000	0.311 5	0.425 4	0.440 1	0.278 7	0.350 8	0.344 5
GoogLeNet	K	pool4	incep5a	incep5b	pool4	incep5a	incep5b
	100	0.287 7	0.310 5	0.267 4	0.263 1	0.262 1	0.228 2
	150	0.298 7	0.311 6	0.286 4	0.226 8	0.259 6	*0.214 9
	1 500	0.440 4	0.449 4	0.425 7	0.278 8	0.304 2	0.258 7
	2 000	0.469 5	0.488 8	0.423 7	0.289 4	0.309 0	0.262 1
	4 000	0.653 8	0.656 0	0.491 4	0.309 5	0.337 0	0.283 6

新窗口打开| 下载CSV

表6中,大多数的BoVW特征在224×224尺寸下的结果优于256×256,VGG16中大多数特征的最优K值为100和150,GoogLeNet中不同特征的最优K值均为100。表7中,大多数的BoVW特征在600×600尺寸下的结果优于224×224,尤其以GoogLeNet中的结果表现更明显。当输入图像尺寸明显增大时,用于构建视觉单词的特征数目也相应增多,相应的最优K值也随之增大。例如,当输入图像尺寸为600×600时,relu5-3和pool5的最优K值增大到4 000,GoogLeNet所有层次的最优K值均增大到150。

因此在BoVW特征中,根据图像尺寸和特征图尺寸选择一个适宜的K值对提高检索结果有着重要作用。当输入图像尺寸显著增大时,K的最优取值也变大,其中以VGG16中K的最优取值变化尤为显著。

2.5 查准率—查全率曲线比较

前面实验结果中,大多数池化特征的检索结果优于BoVW特征。为了进一步比较这2种不同的聚合特征,在每种聚合特征中分别选择最优的特征(即为表4—7中标记为星号的特征)比较查准率—查全率曲线。查准率是指检索返回结果中相关图像数与返回图像数的比例,反映了检索精度; 查全率是指检索返回结果中相关图像数与所有相关图像总数的比值,反映了检索的全面性,与返回图像数目呈正相关。在查准率—查全率曲线中曲线比较高时,查准率和查全率都比较高,即检索性能比较好。

图4比较了不同特征的查准率—查全率曲线,VGG16和GoogLeNet的最优池化特征记为VGG16-P和GoogLeNet-P,VGG16和GoogLeNet的最优BoVW特征记为VGG16-B和GoogLeNet-B。UC-Merced返回图像数目最少为2,最多为2 100; WHU-RS返回图像数目最少为2,最多为950。在UC-Merced中,GoogLeNet-P的曲线位于最顶端,因此GoogLeNet-P的检索性能最优,其次是VGG16-P。当返回图像数目较少时,GoogLeNet-B的曲线高于VGG16-B的曲线,即GoogLeNet-B的检索性能优于VGG16-B; 当返回图像数目逐渐增多时,GoogLeNet-B的性能迅速下降并低于VGG16-B。在WHU-RS中,VGG16-B的曲线位于最低端,即检索性能最差,VGG16-P和GoogLeNet-P的结果比较接近。对于GoogLeNet-B,其检索性能随着返回图像数目的增大逐渐变好,甚至超过VGG16-P和GoogLeNet-P; 当返回图像数目增大到一个较大值时,GoogLeNet-B的性能又迅速下降。总体上来看,在2个数据集上,VGG16-P和GoogLeNet-P的检索性能优于VGG16-B和GoogLeNet-B。

图4

新窗口打开| 下载原图ZIP| 生成PPT

图4 不同特征的查准率—查全率曲线

Fig.4 Precision-recall curves for different features

2.6 与其他方法比较

表8比较了浅层特征和CNN特征的ANMRR值和维度。浅层特征选择了Aptoula提出的全局形态纹理特征^[3]和基于手工特征SIFT构建的BoVW^[5],以及近期提出的LSL^[6]。CNN特征包含了文献[12—14]提出的特征,以及本文提出的VGG16-P,GoogLeNet-P,VGG16-B和GoogLeNet-B特征。由于大多数其它特征使用的数据集为UC-Merced,因此表8基于UC-Merced进行比较。

表8 不同特征的ANMRR和维度

Tab.8 ANMRR and dimensions for different features

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

新窗口打开| 下载CSV

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%。

CNN特征中,VGGM-fc^[12]和VGGM-fc-RF^[12]分别是VGGM全连接层特征及加入了反馈信息的特征; VGG16-fc^[13]是VGG16全连接层特征,VGGM-conv5-IFK^[13]和VGG16-conv5-IFK^[13]是对VGGM和VGG16的卷积层使用改进的费舍尔核(improved fisher kernel,IFK)编码的特征,GoogLeNet(FT)+MultiPatch^[14]是微调后的GoogLeNet特征使用多个分块均值化的结果。

从表8中可以看出,除了VGGM-fc-RF和GoogLeNet(FT)+MultiPatch外,本文提出的4种CNN特征比其他CNN特征的ANMRR值低,而GoogLeNet(FT)+MultiPatch和VGGM-fc-RF的特征提取方法比本文方法复杂。因此选择合适的CNN网络以及采用合理的聚合方法能够有效提高HRRS图像检索性能。

3 结论

本文对VGG16和GoogLeNet中表达局部信息的CNN特征,采用池化区域尺寸不相同的均值池化和BoVW 2种方法得到不同的聚合特征,并将其用于HRRS图像检索。通过研究获得主要结论如下:

1)针对HRRS图像,池化特征的检索性能比BoVW特征的性能好。

2)池化特征中池化区域尺寸直接影响检索结果,大多数池化特征的最优池化区域尺寸为特征图尺寸的60%80%之间。这种尺寸既能有效地剔除CNN特征的冗余信息,同时也能保留一些区分度明显的特征信息。

3)BoVW特征中视觉单词数目对图像检索性能影响较大。当输入图像尺寸显著增大时,视觉单词数目的最优取值也相应增大,以VGG16的取值变化尤为明显。

4)不同输入图像尺寸影响聚合特征的检索性能,当默认尺寸和原尺寸相差较大时,原尺寸得到的聚合特征检索性能更好; 当默认尺寸和原尺寸很接近时,默认尺寸有时更适合CNN网络。

5)与传统的浅层特征相比,本文提出的聚合特征的检索性能大幅度提高。GoogLeNet的最优池化特征和VGG16的最优BoVW特征的ANMRR值比浅层特征BoVW分别降低了27.31%和21.51%。与目前提出的CNN特征相比,本文选用的CNN特征更适用于聚合,采用的聚合方法简单有效。

因此本文提出的聚合特征能够有效提高HRRS图像的检索性能,其中池化特征提高幅度更为明显。但是池化特征的维度相对较高,今后将进一步研究如何有效降低池化特征的维度。

参考文献

原文顺序

文献年度倒序

文中引用次数倒序

被引期刊影响因子

[1]

朱佳丽, 李士进, 万定生 , 等.

基于特征选择和半监督学习的遥感图像检索

[J]. 中国图象图形学报, 2011,16(8):1474-1482.

DOI:10.11834/jig.20110808 URL Magsci [本文引用: 1]

随着卫星遥感技术的不断发展，基于内容的遥感图像检索技术越来越受到关注。目前该方向的研究主要集中在对遥感图像中不同特征的提取和融合方面，这些方法普遍忽略了这样一个事实：对于不同类型的检索目标，特征应该是不同的。另外，小样本问题也是遥感图像检索中一个较为突出的问题。基于以上两方面考虑，本文提出一种基于特征选择和半监督学习的遥感图像检索新方法，该方法主要包括4个方面：1）利用最小描述长度准则自动确定聚类数目；2）结合聚类方法和适当的聚类有效性指标选择最能表示检索目标的特征，在计算聚类有效性指数时，针对遥感图像检索特点对原有的Davies-Bouldin指数进行了改进；3）动态确定最优颜色特征和最优纹理特征之间的权重；4）根据最优颜色特征和最优纹理特征的权重自动确定半监督学习方法，并进行遥感图像的检索。实验结果表明，与相关反馈方法的检索效果相比，该算法在土壤侵蚀区域检索以及其他一般地表覆盖目标检索中均获得了相近的检索效果，但不需要用户多次反馈。

Zhu J

, Li S

, Wan D

, et al.

Content-based remote sensing image retrieval based on feature selection and semi-supervised learning

[J]. Journal of Image and Graphics, 2011,16(8):1474-1482.

Magsci [本文引用: 1]

[2]

Demir

, Bruzzone

A novel active learning method in relevance feedback for content-based remote sensing image retrieval

[J]. IEEE Transactions on Geoscience and Remote Sensing, 2015,53(5):2323-2334.

DOI:10.1109/TGRS.2014.2358804 URL [本文引用: 2]

Conventional relevance feedback (RF) schemes improve the performance of content-based image retrieval (CBIR) requiring the user to annotate a large number of images. To reduce the labeling effort of the user, this paper presents a novel active learning (AL) method to drive RF for retrieving remote sensing images from large archives in the framework of the support vector machine classifier. The proposed AL method is specifically designed for CBIR and defines an effective and as small as possible set of relevant and irrelevant images with regard to a general query image by jointly evaluating three criteria: uncertainty; diversity; and density of images in the archive. The uncertainty and diversity criteria aim at selecting the most informative images in the archive, whereas the density criterion goal is to choose the images that are representative of the underlying distribution of data in the archive. The proposed AL method assesses jointly the three criteria based on two successive steps. In the first step, the most uncertain (i.e., ambiguous) images are selected from the archive on the basis of the margin sampling strategy. In the second step, the images that are both diverse (i.e., distant) to each other and associated to the high-density regions of the image feature space in the archive are chosen from the most uncertain images. This step is achieved by a novel clustering-based strategy. The proposed AL method for driving the RF contributes to mitigate problems of unbalanced and biased set of relevant and irrelevant images. Experimental results show the effectiveness of the proposed AL method.

[3]

Aptoula

Remote sensing image retrieval with global morphological texture descriptors

[J]. IEEE Transactions on Geoscience and Remote Sensing, 2014,52(5):3023-3034.

DOI:10.1109/TGRS.2013.2268736 URL [本文引用: 3]

In this paper, we present the results of applying global morphological texture descriptors to the problem of content-based remote sensing image retrieval. Specifically, we explore the potential of recently developed multiscale texture descriptors, namely, the circular covariance histogram and the rotation-invariant point triplets. Moreover, we introduce a couple of new descriptors, exploiting the Fourier power spectrum of the quasi-flat-zone-based scale space of their input. The descriptors are evaluated with the UC Merced Land Use-Land Cover data set, which has been only recently made public. The proposed approach is shown to outperform the best known retrieval scores, despite its shorter feature vector length, thus asserting the practical interest of global content descriptors as well as of mathematical morphology in this context.

[4]

陆丽珍, 刘仁义, 刘南 .

一种融合颜色和纹理特征的遥感图像检索方法

[J]. 中国图象图形学报, 2004,9(3):328-333.

DOI:10.3969/j.issn.1006-8961.2004.03.013 URL Magsci [本文引用: 1]

海量遥感图像的自动查询和选择，迫切需要有效的基于内容的图像检索方法。鉴于单一视觉特征不能很好地表达图像内容，为此提出一种基于五叉树分解的线性加权颜色和纹理特征距离的检索新方法。该方法首先采用五叉树分解法分解图像，然后在利用多通道Gabor滤波器与图像做卷积得到滤波能量值的基础上，提取各子图像滤波能量纹理特征，最后通过计算子图像的颜色均值和均方差来对查询图像和与其大小相当的数据库子图像进行颜色和纹理特征线性加权距离相似性测度。将该方法用于高分辨率卫星和航空遥感图像数据库检索的实验结果证明，该方法是有效的。

Lu L

, Liu R

, Liu

Remote sensing image retrieval using color and texture fused features

[J]. Journal of Image and Graphics, 2004,9(3):328-333.

Magsci [本文引用: 1]

[5]

Yang

, Newsam

Geographic image retrieval using local invariant features

[J]. IEEE Transactions on Geoscience and Remote Sensing, 2013,51(2):818-832.

DOI:10.1109/TGRS.2012.2205158 URL [本文引用: 5]

This paper investigates local invariant features for geographic (overhead) image retrieval. Local features are particularly well suited for the newer generations of aerial and satellite imagery whose increased spatial resolution, often just tens of centimeters per pixel, allows a greater range of objects and spatial patterns to be recognized than ever before. Local invariant features have been successfully applied to a broad range of computer vision problems and, as such, are receiving increased attention from the remote sensing community particularly for challenging tasks such as detection and classification. We perform an extensive evaluation of local invariant features for image retrieval of land-use/land-cover (LULC) classes in high-resolution aerial imagery. We report on the effects of a number of design parameters on a bag-of-visual-words (BOVW) representation including saliency-versus grid-based local feature extraction, the size of the visual codebook, the clustering algorithm used to create the codebook, and the dissimilarity measure used to compare the BOVW representations. We also perform comparisons with standard features such as color and texture. The performance is quantitatively evaluated using a first-of-its-kind LULC ground truth data set which will be made publicly available to other researchers. In addition to reporting on the effects of the core design parameters, we also describe interesting findings such as the performance-efficiency tradeoffs that are possible through the appropriate pairings of different-sized codebooks and dissimilarity measures. While the focus is on image retrieval, we expect our insights to be informative for other applications such as detection and classification.

[6]

Du Z

, Li X

, Lu X

Local structure learning in high resolution remote sensing image retrieval

[J].Neurocomputing, 2016(207):813-822.

DOI:10.1016/j.neucom.2016.05.061 URL [本文引用: 3]

High resolution remote sensing image captured by the satellites or the aircraft is of great help for military and civilian applications. In recent years, with an increasing amount of high resolution remote sensing images, it becomes more and more urgent to find a way to retrieve them. In this case, a few methods based on the statistical information of the local features are proposed, which have achieved good performances. However, most of the methods do not take the topological structure of the features into account. In this paper, we propose a new method to represent these images, by taking the structural information into consideration. The main contributions of this paper include: (1) mapping the features into a manifold space by a Lipschitz smooth function to enhance the representation ability of the features; (2) training an anchor set with several regularization constrains to get the intrinsic manifold structure. In the experiments, the method is applied to two challenging remote sensing image datasets: UC Merced land use dataset and Sydney dataset. Compared to the state-of-the-art approaches, the proposed method can achieve a more robust and commendable performance.

[7]

Babenko

, Slesarev

, Chigorin

, et al.

Neural codes for image retrieval [C]//Proceedings of European Conference on Computer Vision

.Springer, 2014: 584-599.

[本文引用: 2]

[8]

Ng J

, Yang

, Davis L

Exploiting local features from deep networks for image [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition workshops

.IEEE, 2015: 53-61.

[本文引用: 4]

[9]

Babenko

, Lempitsky

Aggregating deep convolutional features for image retrieval [C]//Proceedings of IEEE International Conference on Computer Vision

.IEEE, 2015: 1269-1277.

[本文引用: 3]

[10]

周飞燕, 金林鹏, 董军 .

卷积神经网络研究综述

[J]. 计算机学报, 2017,40(6):1229-1251.

DOI:10.11897/SP.J.1016.2017.01229 URL [本文引用: 1]

作为一个十余年来快速发展的崭新领域,深度学习受到了越来越多研究者的关注,它在特征提取和建模上都有着相较于浅层模型显然的优势.深度学习善于从原始输入数据中挖掘越来越抽象的特征表示,而这些表示具有良好的泛化能力.它克服了过去人工智能中被认为难以解决的一些问题.且随着训练数据集数量的显著增长以及芯片处理能力的剧增,它在目标检测和计算机视觉、自然语言处理、语音识别和语义分析等领域成效卓然,因此也促进了人工智能的发展.深度学习是包含多级非线性变换的层级机器学习方法,深层神经网络是目前的主要形式,其神经元间的连接模式受启发于动物视觉皮层组织,而卷积神经网络则是其中一种经典而广泛应用的结构.卷积神经网络的局部连接、权值共享及池化操作等特性使之可以有效地降低网络的复杂度,减少训练参数的数目,使模型对平移、扭曲、缩放具有一定程度的不变性,并具有强鲁棒性和容错能力,且也易于训练和优化.基于这些优越的特性,它在各种信号和信息处理任务中的性能优于标准的全连接神经网络.该文首先概述了卷积神经网络的发展历史,然后分别描述了神经元模型、多层感知器的结构.接着,详细分析了卷积神经网络的结构,包括卷积层、池化层、全连接层,它们发挥着不同的作用.然后,讨论了网中网模型、空间变换网络等改进的卷积神经网络.同时,还分别介绍了卷积神经网络的监督学习、无监督学习训练方法以及一些常用的开源工具.此外,该文以图像分类、人脸识别、音频检索、心电图分类及目标检测等为例,对卷积神经网络的应用作了归纳.卷积神经网络与递归神经网络的集成是一个途径.为了给读者以尽可能多的借鉴,该文还设计并试验了不同参数及不同深度的卷积神经网络来分析各参数间的相互关系及不同参数设置对结果的影响.最后,给出了卷积神经网络及其应用中待解决的若干问题.

Zhou F

, Jin L

, Dong

Review of convolutional neural network

[J]. Chinese Journal of Computers, 2017,40(6):1229-1251.

[本文引用: 1]

[11]

张洪群, 刘雪莹, 杨森 , 等.

深度学习的半监督遥感图像检索

[J]. 遥感学报, 2017,21(3):406-414.

DOI:10.11834/jrs.20176105 URL [本文引用: 1]

遥感图像数据的海量性、多样性和复杂性等特点对遥感图像检索的速度和精度提出了更高的要求,其中特征提取是影响遥感图像检索效果的关键。本文方法首先对遥感图像进行预处理,然后基于稀疏自动编码的方法在大量未标注的遥感图像上进行特征学习得到特征字典,基于卷积神经网络的思想,使用训练出来的特征字典对遥感图像进行卷积和池化得到每幅图像的特征图;接下来使用特征图训练Softmax分类器;最后对待检索图像分类,在同一类别中计算特征间的距离,进而实现遥感图像的检索。实验结果表明,该方法能够有效提高遥感图像检索的速度和准确度。

Zhang H

, Liu X

, Yang

, et al.

Retrieval of remote sensing images based on semisupervised deep learning

[J].Journal of Remote Sensing, 21(3):406-414.

[本文引用: 1]

[12]

Napoletano

Visual descriptors for content-based retrieval of remote sensing images

[J]. International Journal of Remote Sensing, 2018,39(5):1343-1376.

DOI:10.1080/01431161.2017.1399472 URL [本文引用: 6]

In this paper we present an extensive evaluation of visual descriptors for the content-based retrieval of remote sensing (RS) images. The evaluation includes global hand-crafted, local hand-crafted, and Convolutional Neural Network (CNNs) features coupled with four different Content-Based Image Retrieval schemes. We conducted all the experiments on two publicly available datasets: the 21-class UC Merced Land Use/Land Cover (LandUse) dataset and 19-class High-resolution Satellite Scene dataset (SceneSat). The content of RS images might be quite heterogeneous, ranging from images containing fine grained textures, to coarse grained ones or to images containing objects. It is therefore not obvious in this domain, which descriptor should be employed to describe images having such a variability. Results demonstrate that CNN-based features perform better than both global and and local hand-crafted features whatever is the retrieval scheme adopted. Features extracted from SatResNet-50, a residual CNN suitable fine-tuned on the RS domain, shows much better performance than a residual CNN pre-trained on multimedia scene and object images. Features extracted from NetVLAD, a CNN that considers both CNN and local features, works better than others CNN solutions on those images that contain fine-grained textures and objects.

[13]

Zhou W

, Newsam

, Li

, et al.

Learning low dimensional convolutional neural networks for high-resolution remote sensing image retrieval

[J]. Remote Sensing, 2017,9(5):489.

DOI:10.3390/rs9050489 URL [本文引用: 10]

Learning powerful feature representations for image retrieval has always been a challenging task in the field of remote sensing. Traditional methods focus on extracting low-level hand-crafted features which are not only time-consuming but also tend to achieve unsatisfactory performance due to the content complexity of remote sensing images. In this paper, we investigate how to extract deep feature representations based on convolutional neural networks (CNN) for high-resolution remote sensing image retrieval (HRRSIR). To this end, two effective schemes are proposed to generate powerful feature representations for HRRSIR. In the first scheme, the deep features are extracted from the fully-connected and convolutional layers of the pre-trained CNN models, respectively; in the second scheme, we propose a novel CNN architecture based on conventional convolution layers and a three-layer perceptron. The novel CNN model is then trained on a large remote sensing dataset to learn low dimensional features. The two schemes are evaluated on several public and challenging datasets, and the results indicate that the proposed schemes and in particular the novel CNN are able to achieve state-of-the-art performance.

[14]

, Tong X

, Xia G

, et al.

Delving into deep representations for remote sensing image retrieval [C]//Proceedings of IEEE International Conference on Signal Processing

.IEEE, 2016: 198-203.

[本文引用: 5]

[15]

Simonyan

,Zisserman

Very deep convolutional networks for large-scale image recognition

[EB/OL]..

URL [本文引用: 1]

[16]

Szegedy

, Liu

, Jia Y

, et al.

Going deeper with convolutions [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

.IEEE, 2015: 1-9.

[本文引用: 1]

[17]

, Xia G

, Hu J

, et al.

Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery

[J]. Remote Sensing, 2015,7(11):14680-14707.

DOI:10.3390/rs71114680 URL [本文引用: 1]

Learning efficient image representations is at the core of the scene classification task of remote sensing imagery. The existing methods for solving the scene classification task, based on either feature coding approaches with low-level hand-engineered features or unsupervised feature learning, can only generate mid-level image features with limited representative ability, which essentially prevents them from achieving better performance. Recently, the deep convolutional neural networks (CNNs), which are hierarchical architectures trained on large-scale datasets, have shown astounding performance in object recognition and detection. However, it is still not clear how to use these deep convolutional neural networks for high-resolution remote sensing (HRRS) scene classification. In this paper, we investigate how to transfer features from these successfully pre-trained CNNs for HRRS scene classification. We propose two scenarios for generating image features via extracting CNN features from different layers. In the first scenario, the activation vectors extracted from fully-connected layers are regarded as the final image features; in the second scenario, we extract dense features from the last convolutional layer at multiple scales and then encode the dense features into global image features through commonly used feature coding approaches. Extensive experiments on two public scene classification datasets demonstrate that the image features obtained by the two proposed scenarios, even with a simple linear classifier, can result in remarkable performance and improve the state-of-the-art by a significant margin. The results reveal that the features from pre-trained CNNs generalize well to HRRS datasets and are more expressive than the low- and mid-level features. Moreover, we tentatively combine features extracted from different CNN models for better performance.

[18]

Vedaldi

, Lenc

MatConvNet:convolutional neural networks for MATLAB [C]//Proceedings of 23rd ACM International Conference on Multimedia

.ACM, 2015: 689-692.

[本文引用: 1]

基于特征选择和半监督学习的遥感图像检索

2011

... 早期CBRSIR主要通过提取图像的底层特征^[1]进行检索,但是底层特征难以表达图像的高层语义信息,即存在严重的“语义鸿沟”问题^[2,3].为了缩小语义鸿沟,主要有以下3种方法: ①采用相关反馈机制^[2],该方法依赖于反馈中标记的样本示例; ②融合多种特征^[4],该方法可以有效结合不同特征的优点,从而更加全面地描述图像信息; ③聚合特征的方法,即在局部特征的基础上进一步构建抽象出的高一级特征,如视觉词袋(bag of visual words,BoVW)^[5]是在尺度不变特征转换(scale-invariant feature transform,SIFT)特征的基础上通过K均值聚类得到的一种聚合特征,局部结构学习(local structure learning,LSL)^[6]是在局部特征的基础上,结合图正则化得到的一种聚合特征.聚合特征能够减少冗余信息,有效降低特征维度,提高特征表达能力,从而缩小语义鸿沟. ...

基于特征选择和半监督学习的遥感图像检索

2011

A novel active learning method in relevance feedback for content-based remote sensing image retrieval

2015

... [2],该方法依赖于反馈中标记的样本示例; ②融合多种特征^[4],该方法可以有效结合不同特征的优点,从而更加全面地描述图像信息; ③聚合特征的方法,即在局部特征的基础上进一步构建抽象出的高一级特征,如视觉词袋(bag of visual words,BoVW)^[5]是在尺度不变特征转换(scale-invariant feature transform,SIFT)特征的基础上通过K均值聚类得到的一种聚合特征,局部结构学习(local structure learning,LSL)^[6]是在局部特征的基础上,结合图正则化得到的一种聚合特征.聚合特征能够减少冗余信息,有效降低特征维度,提高特征表达能力,从而缩小语义鸿沟. ...

Remote sensing image retrieval with global morphological texture descriptors

2014

... 表8比较了浅层特征和CNN特征的ANMRR值和维度.浅层特征选择了Aptoula提出的全局形态纹理特征^[3]和基于手工特征SIFT构建的BoVW^[5],以及近期提出的LSL^[6].CNN特征包含了文献[12—14]提出的特征,以及本文提出的VGG16-P,GoogLeNet-P,VGG16-B和GoogLeNet-B特征.由于大多数其它特征使用的数据集为UC-Merced,因此表8基于UC-Merced进行比较. ...

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

一种融合颜色和纹理特征的遥感图像检索方法

2004

一种融合颜色和纹理特征的遥感图像检索方法

2004

Geographic image retrieval using local invariant features

2013

... 输入图像尺寸不同时,输出值也不同,因此不同尺寸的输入图像对检索性能有较大影响.主要考虑2种尺寸: ①CNN默认的图像尺寸,即调整后的图像尺寸,VGG16和GoogLeNet的默认图像尺寸为224像素×224像素(文中涉及到图像尺寸的单位均为像素,为表达简洁,下文省略); ②数据集中的原图像尺寸,UC-Merced^[5]和WHU-RS^[17]为目前常用的2种HRRS数据集,256×256为UC-Merced中图像的原尺寸,比较接近默认尺寸,600×600为WHU-RS中图像的原尺寸,与默认尺寸相差较大,因此这两种数据集中图像的不同尺寸正好可以有效比较图像尺寸对检索性能的影响.表1和表2列出了不同输入图像尺寸下相应层次的输出值.以VGG16中pool5为例,在输入图像为224×224×3(3表示对应于R,G,B的3个通道)时,pool5的输出值为7×7×512,即输出值有512个通道,每个通道的特征图尺寸为7×7. ...

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [5] 0.601 0 150 LSL ^[6] 0.555 6 2 048 CNN特征 VGGM-fc ^[12] 0.378 0 4 096 VGGM-fc-RF^[12] 0.316 0 4 096 VGG16-fc^[13] 0.394 0 4 096 VGGM-conv5-IFK ^[13] 0.458 0 102 VGG16-conv5-IFK^[13] 0.407 0 102 LDCNN^[13] 0.439 0 30 GoogLeNet(FT)+MultiPatch^[14] 0.314 0 1 024 VGG16-P 0.324 3 8 192 GoogLeNet-P 0.317 9 7 488 VGG16-B 0.388 6 150 GoogLeNet-B 0.375 9 100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

Local structure learning in high resolution remote sensing image retrieval

2016

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

Neural codes for image retrieval [C]//Proceedings of European Conference on Computer Vision

2014

... 传统的聚合特征都是建立在手工提取特征的基础上,但手工特征表达图像能力有限,且容易受到人为因素干扰.目前流行的深度卷积神经网络(convolutional neural network,CNN)能够自动学习图像的特征,降低了人为干扰,在图像分类、检索和目标识别中应用广泛^{[7,8,9,10,11]},其中在大规模数据集(如ImageNet)上训练的CNN具有很强的泛化能力,可以有效迁移到其他小规模数据集.CNN迁移学习中,全连接层的输出值首先受到关注^[7],之后表达图像局部信息的卷积层特征越来越受到重视^[8],卷积层特征通常采用编码^[8]和池化^[9]的方法进一步构建为聚合特征. ...

... [7],之后表达图像局部信息的卷积层特征越来越受到重视^[8],卷积层特征通常采用编码^[8]和池化^[9]的方法进一步构建为聚合特征. ...

Exploiting local features from deep networks for image [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition workshops

2015

... [8],卷积层特征通常采用编码^[8]和池化^[9]的方法进一步构建为聚合特征. ...

... [8]和池化^[9]的方法进一步构建为聚合特征. ...

... 根据文献[8],上述CNN特征

f^{l}

可理解为在特征图的每个位置(i,j),能够得到一个

c^{l}

维的特征向量

f_{i, j}^{l}

,即 ...

Aggregating deep convolutional features for image retrieval [C]//Proceedings of IEEE International Conference on Computer Vision

2015

... [9]的方法进一步构建为聚合特征. ...

... 1)池化特征.目前常用的均值池化方法是令池化区域尺寸等于特征图尺寸^[9],但针对HRRS图像,由于其内容丰富,直接令池化区域等于特征图区域,可能会丢失一些重要信息.因此提出池化区域尺寸不相同的均值池化方法,以获得效果更好的特征. ...

卷积神经网络研究综述

2017

卷积神经网络研究综述

2017

深度学习的半监督遥感图像检索

2017

深度学习的半监督遥感图像检索

2017

Visual descriptors for content-based retrieval of remote sensing images

2018

... 在遥感图像检索领域,由于目前公开的遥感数据集规模较小,CNN的参数得不到充分训练,因此相关研究主要集中于将CNN迁移到HRRS图像并进行检索^[12,13,14].Napoletano^[12]使用CNN中的全连接层特征进行检索; Zhou等^[13]和Hu等^[14]比较了CNN全连接层特征和基于卷积层输出值的聚合特征,并对CNN进行微调; Zhou等^[13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... [12]使用CNN中的全连接层特征进行检索; Zhou等^[13]和Hu等^[14]比较了CNN全连接层特征和基于卷积层输出值的聚合特征,并对CNN进行微调; Zhou等^[13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [12] 0.316 0 4 096 VGG16-fc^[13] 0.394 0 4 096 VGGM-conv5-IFK ^[13] 0.458 0 102 VGG16-conv5-IFK^[13] 0.407 0 102 LDCNN^[13] 0.439 0 30 GoogLeNet(FT)+MultiPatch^[14] 0.314 0 1 024 VGG16-P 0.324 3 8 192 GoogLeNet-P 0.317 9 7 488 VGG16-B 0.388 6 150 GoogLeNet-B 0.375 9 100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... CNN特征中,VGGM-fc^[12]和VGGM-fc-RF^[12]分别是VGGM全连接层特征及加入了反馈信息的特征; VGG16-fc^[13]是VGG16全连接层特征,VGGM-conv5-IFK^[13]和VGG16-conv5-IFK^[13]是对VGGM和VGG16的卷积层使用改进的费舍尔核(improved fisher kernel,IFK)编码的特征,GoogLeNet(FT)+MultiPatch^[14]是微调后的GoogLeNet特征使用多个分块均值化的结果. ...

... [12]分别是VGGM全连接层特征及加入了反馈信息的特征; VGG16-fc^[13]是VGG16全连接层特征,VGGM-conv5-IFK^[13]和VGG16-conv5-IFK^[13]是对VGGM和VGG16的卷积层使用改进的费舍尔核(improved fisher kernel,IFK)编码的特征,GoogLeNet(FT)+MultiPatch^[14]是微调后的GoogLeNet特征使用多个分块均值化的结果. ...

Learning low dimensional convolutional neural networks for high-resolution remote sensing image retrieval

2017

... [13]和Hu等^[14]比较了CNN全连接层特征和基于卷积层输出值的聚合特征,并对CNN进行微调; Zhou等^[13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... [13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [13] 0.458 0 102 VGG16-conv5-IFK^[13] 0.407 0 102 LDCNN^[13] 0.439 0 30 GoogLeNet(FT)+MultiPatch^[14] 0.314 0 1 024 VGG16-P 0.324 3 8 192 GoogLeNet-P 0.317 9 7 488 VGG16-B 0.388 6 150 GoogLeNet-B 0.375 9 100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [13] 0.407 0 102 LDCNN^[13] 0.439 0 30 GoogLeNet(FT)+MultiPatch^[14] 0.314 0 1 024 VGG16-P 0.324 3 8 192 GoogLeNet-P 0.317 9 7 488 VGG16-B 0.388 6 150 GoogLeNet-B 0.375 9 100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [13] 0.439 0 30 GoogLeNet(FT)+MultiPatch^[14] 0.314 0 1 024 VGG16-P 0.324 3 8 192 GoogLeNet-P 0.317 9 7 488 VGG16-B 0.388 6 150 GoogLeNet-B 0.375 9 100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

... [13]和VGG16-conv5-IFK^[13]是对VGGM和VGG16的卷积层使用改进的费舍尔核(improved fisher kernel,IFK)编码的特征,GoogLeNet(FT)+MultiPatch^[14]是微调后的GoogLeNet特征使用多个分块均值化的结果. ...

... [13]是对VGGM和VGG16的卷积层使用改进的费舍尔核(improved fisher kernel,IFK)编码的特征,GoogLeNet(FT)+MultiPatch^[14]是微调后的GoogLeNet特征使用多个分块均值化的结果. ...

Delving into deep representations for remote sensing image retrieval [C]//Proceedings of IEEE International Conference on Signal Processing

2016

... [14]比较了CNN全连接层特征和基于卷积层输出值的聚合特征,并对CNN进行微调; Zhou等^[13]还提出一种低维度特征(low dimensional CNN,LDCNN),但该特征的性能与数据集密切相关; Hu等^[14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... [14]对卷积层特征提出了多尺度级联的方法,对全连接层特征采用了多小块均值池化的方法,但为了提取一幅图像的特征,这些方法需要多个输入来重新馈送给CNN,导致特征提取过程相对复杂. ...

... ANMRR and dimensions for different features

Tab.8

	特征	ANMRR	维度
浅层特征	Aptoula ^[3]	0.575 0	62
	BoVW ^[5]	0.591 0	15 000
	BoVW ^[5]	0.601 0	150
	LSL ^[6]	0.555 6	2 048
CNN特征	VGGM-fc ^[12]	0.378 0	4 096
	VGGM-fc-RF^[12]	0.316 0	4 096
	VGG16-fc^[13]	0.394 0	4 096
	VGGM-conv5-IFK ^[13]	0.458 0	102
	VGG16-conv5-IFK^[13]	0.407 0	102
	LDCNN^[13]	0.439 0	30
	GoogLeNet(FT)+MultiPatch^[14]	0.314 0	1 024
	VGG16-P	0.324 3	8 192
	GoogLeNet-P	0.317 9	7 488
	VGG16-B	0.388 6	150
	GoogLeNet-B	0.375 9	100

表8显示,CNN特征的结果普遍优于浅层特征,与BoVW相比,GoogLeNet-P和VGG16-P的值分别降低了27.31%和21.51%. ...

Very deep convolutional networks for large-scale image recognition

... 在聚合CNN特征时,选用16层的VGG16网络^[15]和22层的GoogLeNet网络^[16].VGG16通过扩展卷积层的数量增加了网络深度,GoogLeNet则通过使用inception modules机制,不仅增加了网络的深度,还增加了网络的广度.因此VGG16和GoogLeNet经过前面多个层次的抽象运算,后面的卷积层不仅仅获得更多的局部信息,并且具有更好的泛化能力.VGG16的CNN特征来自最后的卷积层(conv5-3)、激活函数层(relu5-3)和池化层(pool5)的输出值,GoogLeNet的CNN特征来自倒数第二层池化层(pool4)和最后2个inception层(incep5a和incep5b)的输出值. ...

Going deeper with convolutions [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition

2015

Transferring deep convolutional neural networks for the scene classification of high-resolution remote sensing imagery

2015

MatConvNet:convolutional neural networks for MATLAB [C]//Proceedings of 23rd ACM International Conference on Multimedia

2015

... 实验使用MatConvNet^[18]提取网络模型VGG16和GoogLeNet.预训练VGG16和GoogLeNet的数据集采用ImageNet的子集ILSVRC2012,ILSVRC2012包含了1 000种图像分类,大约有130万幅训练图像、5万幅验证图像和10万幅测试图像.遥感数据集采用UC-Merced和WHU-RS.UC-Merced是从美国地质调查局收集的航空正射图像,总共21类场景,每类有100幅图像,图像大小为256×256; WHU-RS是从Google Earth下载的19类场景,每类包含50幅图像,图像大小为600×600.表3显示了这2个数据集的示例图像. ...

〈

〉