摘要

1. Multi-modal Self-Supervision from Generalized Data Transformations [PDF] 返回目录
Mandela Patrick, Yuki M. Asano, Ruth Fong, João F. Henriques, Geoffrey Zweig, Andrea Vedaldi
Abstract: Self-supervised learning has advanced rapidly, with several results beating supervised models for pre-training feature representations. While the focus of most of these works has been new loss functions or tasks, little attention has been given to the data transformations that build the foundation of learning representations with desirable invariances. In this work, we introduce a framework for multi-modal data transformations that preserve semantics and induce the learning of high-level representations across modalities. We do this by combining two steps: inter-modality slicing, and intra-modality augmentation. Using a contrastive loss as the training task, we show that choosing the right transformations is key and that our method yields state-of-the-art results on downstream video and audio classification tasks such as HMDB51, UCF101 and DCASE2014 with Kinetics-400 pretraining.
摘要：自监督学习非常迅速，有几个成绩击败监督模型前的训练特征表示。虽然这些作品大多关注的焦点一直是新的损失功能或任务，很少有人注意的是能构建学习与理想的不变性表示的基础数据转换。在这项工作中，我们介绍了多模数据转换是保持语义和诱导的跨模态高层表示学习的框架。模态间的切片，和模态内增强：我们通过结合两个步骤做到这一点。使用对比损失的训练任务，我们表明，选择正确的转变是关键，我们的方法产生的国家的最先进的成果上下游的视频和音频分类任务，如HMDB51，UCF101和DCASE2014与动力学-400训练前。

2. Improved Baselines with Momentum Contrastive Learning [PDF] 返回目录
Xinlei Chen, Haoqi Fan, Ross Girshick, Kaiming He
Abstract: Contrastive unsupervised learning has recently shown encouraging progress, e.g., in Momentum Contrast (MoCo) and SimCLR. In this note, we verify the effectiveness of two of SimCLR's design improvements by implementing them in the MoCo framework. With simple modifications to MoCo---namely, using an MLP projection head and more data augmentation---we establish stronger baselines that outperform SimCLR and do not require large training batches. We hope this will make state-of-the-art unsupervised learning research more accessible. Code will be made public.
摘要：对比无监督学习近来显示出令人鼓舞的进展，例如，在动量对比度（莫科）和SimCLR。在这份说明中，我们通过在莫科框架执行这些验证的两个SimCLR的设计改进的有效性。通过简单的修改，莫科---即使用MLP投影头和更多的数据增强---我们建立超越SimCLR并且不需要大批量培养更强的基线。我们希望这将使国家的最先进的无监督学习的研究更容易获得。代码将被公之于众。

3. Knowledge distillation via adaptive instance normalization [PDF] 返回目录
Jing Yang, Brais Martinez, Adrian Bulat, Georgios Tzimiropoulos
Abstract: This paper addresses the problem of model compression via knowledge distillation. To this end, we propose a new knowledge distillation method based on transferring feature statistics, specifically the channel-wise mean and variance, from the teacher to the student. Our method goes beyond the standard way of enforcing the mean and variance of the student to be similar to those of the teacher through an $L_2$ loss, which we found it to be of limited effectiveness. Specifically, we propose a new loss based on adaptive instance normalization to effectively transfer the feature statistics. The main idea is to transfer the learned statistics back to the teacher via adaptive instance normalization (conditioned on the student) and let the teacher network "evaluate" via a loss whether the statistics learned by the student are reliably transferred. We show that our distillation method outperforms other state-of-the-art distillation methods over a large set of experimental settings including different (a) network architectures, (b) teacher-student capacities, (c) datasets, and (d) domains.
摘要：本文通过解决知识蒸馏模型压缩的问题。为此，我们提出了一种基于转移功能统计数据的新知识蒸馏法，特别是通道明智的均值和方差，从老师到学生。我们的方法超越了执法的平均值和学生的差异是相似的老师通过$ L_2 $损失，我们发现它是效果有限的标准方式。具体来说，我们提出了一种基于自适应实例归一新的损失，有效地转移功能统计信息。其主要思想是通过损失是否由学生学到的统计数据可靠地传输到转移了解到的统计数据回通过自适应实例正常化（条件的学生）老师，让老师网“评估”。我们证明了我们的蒸馏方法优于国家的最先进的其他蒸馏方法在大组实验设置，包括不同（一）网络架构，（B）师生的能力，（C）的数据集，以及（d）域。

4. Restore from Restored: Video Restoration with Pseudo Clean Video [PDF] 返回目录
Seunghwan Lee, Seobin Park, Donghyeon Cho, Jiwon Kim, Tae Hyun Kim
Abstract: In this paper, we propose a self-supervised video denoising method called "restore-from-restored" that fine-tunes a baseline network by using a pseudo clean video at the test phase. The pseudo clean video can be obtained by applying an input noisy video to the pre-trained baseline network. By adopting a fully convolutional network (FCN) as the baseline, we can restore videos without accurate optical flow and registration due to its translation-invariant property unlike many conventional video restoration methods. Moreover, the proposed method can take advantage of the existence of many similar patches across consecutive frames (i.e., patch-recurrence), which can boost performance of the baseline network by a large margin. We analyze the restoration performance of the FCN fine-tuned with the proposed self-supervision-based training algorithm, and demonstrate that FCN can utilize recurring patches without the need for registration among adjacent frames. The proposed method can be applied to any FCN-based denoising models. In our experiments, we apply the proposed method to the state-of-the-art denoisers, and our results indicate a considerable improvementin task performance.
摘要：在本文中，我们提出了称为自监督视频去噪方法“恢复 - 从 - 恢复”通过在测试阶段使用伪干净的视频是微调的基准网络。伪干净视频可由施加输入噪声的视频到预先训练基线网络来获得。通过采用全卷积网络（FCN）作为基准，我们可以在不准确的光流和登记影片由于与许多传统的视频修复方法将其翻译不变的特性恢复。此外，提出的方法可以采取跨越连续的帧（即，膜片复发）许多类似的贴剂，其可大幅度提高的基线网络的性能的存在的优点。我们分析的FCN微调所提出的基于自我监督训练算法的恢复性能，并表明FCN可以利用重复的补丁，而无需相邻帧之间的注册。该方法可以适用于任何基于FCN-降噪模式。在我们的实验中，我们应用所提出的方法，以国家的最先进的denoisers，我们的结果表明，有相当improvementin任务性能。

5. Cascaded Human-Object Interaction Recognition [PDF] 返回目录
Tianfei Zhou, Wenguan Wang, Siyuan Qi, Haibin Ling, Jianbing Shen
Abstract: Rapid progress has been witnessed for human-object interaction (HOI) recognition, but most existing models are confined to single-stage reasoning pipelines. Considering the intrinsic complexity of the task, we introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. Each of the two networks is also connected to its predecessor at the previous stage, enabling cross-stage information propagation. The interaction recognition network has two crucial parts: a relation ranking module for high-quality HOI proposal selection and a triple-stream classifier for relation prediction. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding. Further beyond relation detection on a bounding-box level, we make our framework flexible to perform fine-grained pixel-wise relation segmentation; this provides a new glimpse into better relation modeling. Our approach reached the $1^{st}$ place in the ICCV2019 Person in Context Challenge, on both relation detection and segmentation tasks. It also shows promising results on V-COCO.
摘要：快速进步已经见证了人类对象交互（海）的认可，但大多数现有的模式仅限于单级推理管道。考虑到任务的内在复杂性，我们引入一个级联架构，多级，由粗到细HOI理解。在每个阶段，一个实例本地化网络逐步细化HOI提案，并将其馈入交互识别网络。每两个网络也被连接到它的前身在前一阶段，使交叉阶段的信息的传播。交互识别网络具有两个关键部分：为高质量HOI建议选择的关系排名模块和用于预测关系的三流分类器。随着我们精心设计的以人为中心的关系的特点，这两个模块协同工作，实现有效互动的了解。在边界框水平进一步超越关系检测，我们使我们的框架灵活进行细粒度的逐像素的关系分割;这提供了一个新的了解更好的关系模型。我们的方法在ICCV2019人在语境挑战赛达到了$ 1→{ST} $的地方，双方关系检测和分割任务。这也说明看好的V-COCO结果。

6. SOIC: Semantic Online Initialization and Calibration for LiDAR and Camera [PDF] 返回目录
Weimin Wang, Shohei Nobuhara, Ryosuke Nakamura, Ken Sakurada
Abstract: This paper presents a novel semantic-based online extrinsic calibration approach, SOIC (so, I see), for Light Detection and Ranging (LiDAR) and camera sensors. Previous online calibration methods usually need prior knowledge of rough initial values for optimization. The proposed approach removes this limitation by converting the initialization problem to a Perspective-n-Point (PnP) problem with the introduction of semantic centroids (SCs). The closed-form solution of this PnP problem has been well researched and can be found with existing PnP methods. Since the semantic centroid of the point cloud usually does not accurately match with that of the corresponding image, the accuracy of parameters are not improved even after a nonlinear refinement process. Thus, a cost function based on the constraint of the correspondence between semantic elements from both point cloud and image data is formulated. Subsequently, optimal extrinsic parameters are estimated by minimizing the cost function. We evaluate the proposed method either with GT or predicted semantics on KITTI dataset. Experimental results and comparisons with the baseline method verify the feasibility of the initialization strategy and the accuracy of the calibration approach. In addition, we release the source code at this https URL.
摘要：本文提出了一种新的基于语义网络外部校准方法，SOIC（所以，我看到），用于光探测和测距（LIDAR）和摄像头传感器。上一页在线校准方法通常需要优化粗糙的初始值的先验知识。所提出的方法，通过在初始化问题与引入语义质心（SCS）的转换为透视-N点（PNP）问题消除了这一限制。这种即插即用问题的封闭形式的解决方案已经被很好地研究和可以与现有的即插即用方法来发现。由于点云的形心语义通常不会精确地匹配与该对应的图像的，参数的精度甚至没有非线性细化过程后好转。因此，基于从两个点云和图像数据的语义元素之间的对应关系的约束的成本函数被配制。随后，最佳的外部参数通过最小化成本函数估计。我们与KITTI数据集GT或预测语义或者评估所提出的方法。实验结果和比较与基线法验证初始化策略的可行性和校准方法的准确性。此外，我们发布这个HTTPS URL的源代码。

7. Motion-Attentive Transition for Zero-Shot Video Object Segmentation [PDF] 返回目录
Tianfei Zhou, Shunzhou Wang, Yi Zhou, Yazhou Yao, Jianwu Li, Ling Shao
Abstract: In this paper, we present a novel Motion-Attentive Transition Network (MATNet) for zero-shot video object segmentation, which provides a new way of leveraging motion information to reinforce spatio-temporal object representation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder, which transforms appearance features into motion-attentive representations at each convolutional stage. In this way, the encoder becomes deeply interleaved, allowing for closely hierarchical interactions between object motion and appearance. This is superior to the typical two-stream architecture, which treats motion and appearance separately in each stream and often suffers from overfitting to appearance information. Additionally, a bridge network is proposed to obtain a compact, discriminative and scale-sensitive representation for multi-level encoder features, which is further fed into a decoder to achieve segmentation results. Extensive experiments on three challenging public benchmarks (i.e. DAVIS-16, FBMS and Youtube-Objects) show that our model achieves compelling performance against the state-of-the-arts.
摘要：在本文中，我们提出了一个新颖的Motion-细心的转移网络（MATNet）对于零次视频对象分割，它提供了利用运动信息来加强时空对象表示的一种新的方式。不对称关注块，称为运动-细心的过渡（MAT），需要两个流编码器，其在每个卷积阶段变换外观特征为运动周到表示内设计的。通过这种方式，编码器变得深深地交织，允许物体的运动和外观之间的密切互动层次。这是优于典型的两流体系结构，它分别在每个流中，并经常治疗运动和外观过度拟合到外观的信息受到影响。此外，桥接网络提出，以获得多电平编码器的功能，其被进一步馈送到解码器，以实现分割结果紧凑，辨别和放大敏感表示。三个挑战公众基准大量的实验（即DAVIS-16，FBMS和YouTube的对象）表明我们的模型实现了对国家的最令人信服的艺术表现。

8. Hierarchical Kinematic Human Mesh Recovery [PDF] 返回目录
Georgios Georgakis, Ren Li, Srikrishna Karanam, Terrence Chen, Jana Kosecka, Ziyan Wu
Abstract: We consider the problem of estimating a parametric model of 3D human mesh from a single image. While there has been substantial recent progress in this area with direct regression of model parameters, these methods only implicitly exploit the human body kinematic structure, leading to sub-optimal use of the model prior. In this work, we address this gap by proposing a new technique for regression of human parametric model that is explicitly informed by the known hierarchical structure, including joint interdependencies of the model. This results in a strong prior-informed design of the regressor architecture and an associated hierarchical optimization that is flexible to be used in conjunction with the current standard frameworks for 3D human mesh recovery. We demonstrate these aspects by means of extensive experiments on standard benchmark datasets, showing how our proposed new design outperforms several existing and popular methods, establishing new state-of-the-art results. With our explicit consideration of joint interdependencies, our proposed method is equipped to infer joints even under data corruptions, which we demonstrate with experiments under varying degrees of occlusion.
摘要：我们认为从单一图像估计人体三维网状的参数模型的问题。虽然一直在与模型参数直接回归这一领域近期取得实质性进展，这些方法只是含蓄地利用人体运动学结构，之前导致次优的使用模式。在这项工作中，我们要解决通过提出对人体参数模型的回归是明确已知的分层结构获悉，包括模型的联合相互依赖性的新技术这一空白。这导致回归架构的强现有知情设计和相关联的层级优化，其是柔性的结合使用具有用于三维人体网格恢复当前的标准框架。我们通过对标准的基准数据集大量的实验，展示我们提出了新的设计如何优于现有的几种流行的方法，建立国家的最先进的新成果展示手段这些方面。随着我们共同的相互依存关系的明确考虑，我们提出的方法，即使在数据损坏，这是我们在不同程度的闭塞与实验证明装备来推断关节。

9. Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds [PDF] 返回目录
Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
Abstract: Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.
摘要：人类可以有力识别和定位通过整合视觉和听觉线索的对象。虽然机器能够与现在的图像做同样的，更少的工作一直与声音完成。这项工作的发展健全，使物体的密集语义标注，完全基于双耳声音的方法。我们提出了一个新的传感器设置和记录街景的全新的视听数据集八个专业双耳麦克风和一个360度的摄像头。视觉和声音提示的共存是杠杆监督转移。特别是，我们采用了跨模态蒸馏框架，由设想'老师“的方法和完善的'学生”的方法 - 学生的方法进行训练，以产生相同的结果作为教师法。通过这种方式，听觉系统可以在不使用人类注释的培训。我们还提出了两个辅助任务，即a）关于空间声音超分辨率的新任务，以增加声音的空间分辨率，和b）场景的密集深度预测。然后，我们制定了三个任务为一端到终端的可训练的多任务网络旨在提升整体性能。在数据集上的实验结果表明1）我们的方法实现有前途的语义预测和两个辅助任务的结果; 2）三个任务是互惠互利的 - 训练他们在一起达到最佳的性能和3）的数量和麦克风的方向都是很重要的。数据和代码将被释放，以促进这一新方向的研究。

10. Domain Adversarial Training for Infrared-colour Person Re-Identification [PDF] 返回目录
Nima Mohammadi Meshky, Sara Iodice, Krystian Mikolajczyk
Abstract: Person re-identification (re-ID) is a very active area of research in computer vision, due to the role it plays in video surveillance. Currently, most methods only address the task of matching between colour images. However, in poorly-lit environments CCTV cameras switch to infrared imaging, hence developing a system which can correctly perform matching between infrared and colour images is a necessity. In this paper, we propose a part-feature extraction network to better focus on subtle, unique signatures on the person which are visible across both infrared and colour modalities. To train the model we propose a novel variant of the domain adversarial feature-learning framework. Through extensive experimentation, we show that our approach outperforms state-of-the-art methods.
摘要：人重新鉴定（重新-ID）是计算机视觉研究的一个非常活跃的领域，由于它在视频监控中的作用。目前，大多数方法只能解决彩色图像之间的匹配工作。然而，在昏暗的环境CCTV摄像机切换到红外成像，因此，开发一种能正确地进行红外图像和彩色图像之间的匹配是必要的系统。在本文中，我们提出了一个部分特征提取的网络，以更好地专注于对人微妙的，独特的签名，这是在两个红外和彩色模式显示。训练模型，我们建议该域对抗性特征的学习框架的一个新的变种。通过大量的实验，我们表明，我们的方法比国家的最先进的方法。

11. BirdNet+: End-to-End 3D Object Detection in LiDAR Bird's Eye View [PDF] 返回目录
Alejandro Barrera, Carlos Guindel, Jorge Beltrán, Fernando García
Abstract: On-board 3D object detection in autonomous vehicles often relies on geometry information captured by LiDAR devices. Albeit image features are typically preferred for detection, numerous approaches take only spatial data as input. Exploiting this information in inference usually involves the use of compact representations such as the Bird's Eye View (BEV) projection, which entails a loss of information and thus hinders the joint inference of all the parameters of the objects' 3D boxes. In this paper, we present a fully end-to-end 3D object detection framework that can infer oriented 3D boxes solely from BEV images by using a two-stage object detector and ad-hoc regression branches, eliminating the need for a post-processing stage. The method outperforms its predecessor (BirdNet) by a large margin and obtains state-of-the-art results on the KITTI 3D Object Detection Benchmark for all the categories in evaluation.
摘要：在自主车辆车载立体物检测常常依赖于由激光雷达设备捕获几何信息。尽管图像特征通常是优选的用于检测，许多方法只需要空间数据作为输入。利用这一信息推断通常涉及使用简洁表示的，如鸟瞰（BEV）投影，这需要信息的损失，从而阻碍了对象3D盒所有参数的联合推理。在本文中，我们提出，可以通过使用两阶段对象检测器和即席回归分支，消除了对后处理的需要推断面向3D盒仅由BEV图像完全端至端的3D对象检测框架阶段。该方法通过一个大的裕度和在KITTI立体物检测基准用于在评估所有类别取得状态的最先进的结果优于其前身（BirdNet）。

12. Embedding Propagation: Smoother Manifold for Few-Shot Classification [PDF] 返回目录
Pau Rodríguez, Issam Laradji, Alexandre Drouin, Alexandre Lacoste
Abstract: Few-shot classification is challenging because the data distribution of the training set can be widely different to the distribution of the test set as their classes are disjoint. This distribution shift often results in poor generalization. Manifold smoothing has been shown to address the distribution shift problem by extending the decision boundaries and reducing the noise of the class representations. Moreover, manifold smoothness is a key factor for semi-supervised learning and transductive learning algorithms. In this work, we present embedding propagation as an unsupervised non-parametric regularizer for manifold smoothing. Embedding propagation leverages interpolations between the extracted features of a neural network based on a similarity graph. We empirically show that embedding propagation yields a smoother embedding manifold. We also show that incorporating embedding propagation to a transductive classifier leads to new state-of-the-art results in mini-Imagenet, tiered-Imagenet, and CUB. Furthermore, we show that embedding propagation results in additional improvement in performance for semi-supervised learning scenarios.
摘要：很少，镜头分类是具有挑战性的，因为训练集的数据分布可以到测试集的分布广泛不同，因为它们的类是不相交的。这种分布转移往往导致泛化差。歧管的平滑已经显示出通过延伸决策边界和减少类的表示的噪声，以解决分布移问题。此外，歧管的平滑为半监督学习和推式学习算法的关键因素。在这项工作中，我们提出嵌入传播作为无监督非参数正则用于歧管的平滑。嵌入传播利用基于相似性曲线图中的神经网络的所提取的特征之间的内插。我们经验表明，嵌入的传播产生更平滑嵌入歧管。我们还表明，整合传播嵌入的直推式分类导致国家的最先进的新成果在微型Imagenet，分层-Imagenet，和CUB。此外，我们表明，在半监督学习方案中的性能进一步改善嵌入的传播效果。

13. Accurate Temporal Action Proposal Generation with Relation-Aware Pyramid Network [PDF] 返回目录
Jialin Gao, Zhixiang Shi, Jiani Li, Guanshuo Wang, Yufeng Yuan, Shiming Ge, Xi Zhou
Abstract: Accurate temporal action proposals play an important role in detecting actions from untrimmed videos. The existing approaches have difficulties in capturing global contextual information and simultaneously localizing actions with different durations. To this end, we propose a Relation-aware pyramid Network (RapNet) to generate highly accurate temporal action proposals. In RapNet, a novel relation-aware module is introduced to exploit bi-directional long-range relations between local features for context distilling. This embedded module enhances the RapNet in terms of its multi-granularity temporal proposal generation ability, given predefined anchor boxes. We further introduce a two-stage adjustment scheme to refine the proposal boundaries and measure their confidence in containing an action with snippet-level actionness. Extensive experiments on the challenging ActivityNet and THUMOS14 benchmarks demonstrate our RapNet generates superior accurate proposals over the existing state-of-the-art methods.
摘要：准确的时间采取行动的建议起到检测从修剪影片行动了重要的作用。现有的方法在全球捕获的上下文信息，同时定位不同持续时间的操作困难。为此，我们提出了一个关系感知金字塔网络（的RapNet）生成高度精确的时间采取行动的建议。在的RapNet，一种新型的关系感知模块引入了利用局部特征之间的双向远距离关系上下文蒸馏。这种嵌入式模块增强了它的多粒度时间建议的生成能力方面的RapNet，给定的预定锚箱。我们进一步引入一个两阶段调整方案以完善该提案的边界并测量它们在含片段级actionness行动的信心。在挑战ActivityNet和THUMOS14基准广泛的实验，证明了我们的RapNet产生在现有的国家的最先进的方法优于准确的建议。

14. iFAN: Image-Instance Full Alignment Networks for Adaptive Object Detection [PDF] 返回目录
Chenfan Zhuang, Xintong Han, Weilin Huang, Matthew R. Scott
Abstract: Training an object detector on a data-rich domain and applying it to a data-poor one with limited performance drop is highly attractive in industry, because it saves huge annotation cost. Recent research on unsupervised domain adaptive object detection has verified that aligning data distributions between source and target images through adversarial learning is very useful. The key is when, where and how to use it to achieve best practice. We propose Image-Instance Full Alignment Networks (iFAN) to tackle this problem by precisely aligning feature distributions on both image and instance levels: 1) Image-level alignment: multi-scale features are roughly aligned by training adversarial domain classifiers in a hierarchically-nested fashion. 2) Full instance-level alignment: deep semantic information and elaborate instance representations are fully exploited to establish a strong relationship among categories and domains. Establishing these correlations is formulated as a metric learning problem by carefully constructing instance pairs. Above-mentioned adaptations can be integrated into an object detector (e.g. Faster RCNN), resulting in an end-to-end trainable framework where multiple alignments can work collaboratively in a coarse-tofine manner. In two domain adaptation tasks: synthetic-to-real (SIM10K->Cityscapes) and normal-to-foggy weather (Cityscapes->Foggy Cityscapes), iFAN outperforms the state-of-the-art methods with a boost of 10%+ AP over the source-only baseline.
摘要：在丰富的数据域训练对象检测器，它适用于有限的性能下降一个数据差一个是行业非常有吸引力，因为它可以节省巨大的成本注解。上无监督域自适应对象检测最近的研究已经证实，通过对抗学习源和目标图像之间的对准数据分布是非常有用的。关键是何时，何地以及如何使用它来实现最佳实践。我们建议图片，实例完全对准网络（IFAN）由两个图像和实例级别精确对齐功能分布来解决这个问题：1）图像水平对齐：多尺度特征是通过在hierarchically-训练对抗性域分类大致一致嵌套的方式。 2）全部实例级排列：深层语义信息和复杂的情况下表示被充分利用，建立类别和域之间牢固的关系。建立这些相关性是通过精心构造的实例对配制成度量学习问题。上述修改可以被集成到一个物体检测器（例如，更快的RCNN），导致在端至端可训练框架，多重比对可以以粗tofine方式协同工作。两个域适应任务：合成到实（SIM10K->都市风景）和正常到雾天（Cityscapes->雾都市风景），IFAN优于国家的最先进的方法，用10％+的升压AP上的唯一来源基线。

15. The Utility of Feature Reuse: Transfer Learning in Data-Starved Regimes [PDF] 返回目录
Edward Verenich, Alvaro Velasquez, M.G. Sarwar Murshed, Faraz Hussain
Abstract: The use of transfer learning with deep neural networks has increasingly become widespread for deploying well-tested computer vision systems to newer domains, especially those with limited datasets. We describe a transfer learning use case for a domain with a data-starved regime, having fewer than 100 labeled target samples. We evaluate the effectiveness of convolutional feature extraction and fine-tuning of overparameterized models with respect to the size of target training data, as well as their generalization performance on data with covariate shift, or out-of-distribution (OOD) data. Our experiments show that both overparameterization and feature reuse contribute to successful application of transfer learning in training image classifiers in data-starved regimes.
摘要：使用深层神经网络传输的学习越来越成为广泛部署经过全面测试计算机视觉系统到新域，特别是那些有限的数据集。我们描述了一个迁移学习使用案例与数据匮乏的政权域，具有少于100个标记的目标样本。我们评估相对于目标训练数据的大小，以及与协移，或超出分配（OOD）数据上的数据的泛化性能卷积特征提取和overparameterized模型进行微调的效果。我们的实验表明，无论是overparameterization和功能的重用，在数据匮乏的政权训练图像分类有助于迁移学习的成功应用。

16. Hazard Detection in Supermarkets using Deep Learning on the Edge [PDF] 返回目录
M. G. Sarwar Murshed, Edward Verenich, James J. Carroll, Nazar Khan, Faraz Hussain
Abstract: Supermarkets need to ensure clean and safe environments for both shoppers and employees. Slips, trips, and falls can result in injuries that have a physical as well as financial cost. Timely detection of hazardous conditions such as spilled liquids or fallen items on supermarket floors can reduce the chances of serious injuries. This paper presents EdgeLite, a novel, lightweight deep learning model for easy deployment and inference on resource-constrained devices. We describe the use of EdgeLite on two edge devices for detecting supermarket floor hazards. On a hazard detection dataset that we developed, EdgeLite, when deployed on edge devices, outperformed six state-of-the-art object detection models in terms of accuracy while having comparable memory usage and inference time.
摘要：超市需要确保清洁和安全的环境中既顾客和员工。滑倒，绊倒，跌倒和可能导致该有一个物理以及财务成本的伤害。危险条件如溢出的液体或超市的地板堕落项目及时发现可以降低严重受伤的几率。本文礼物EdgeLite，一种新的，易于部署和推断资源受限的设备轻巧深度学习模型。我们描述两个边缘设备使用EdgeLite的检测超市地板危害。上的危险检测的数据集，我们开发的，EdgeLite，部署在边缘装置时，优于国家的最先进的6物体检测模型在精度方面，同时具有相当的内存使用和推理时间。

17. A Strong Baseline for Fashion Retrieval with Person Re-Identification Models [PDF] 返回目录
Mikolaj Wieczorek, Andrzej Michalowski, Anna Wroblewska, Jacek Dabrowski
Abstract: Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a significant gap in quality, lighting, contrast, background clutter and item presentation exists between domains. As a result, fashion retrieval is an active field of research both in academia and the industry. Inspired by recent advancements in Person Re-Identification research, we adapt leading ReID models to be used in fashion retrieval tasks. We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results despite a much simpler architecture. We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results. Finally, we propose a cross-domain (cross-dataset) evaluation method to test the robustness of fashion retrieval models.
摘要：时尚的检索是找到图像中包含的时尚单品完全匹配的具有挑战性的任务。从困难的衣物，非常大的类内和组间方差细粒度的性质出现。分别街道照片和目录的照片 - 另外，对于任务的查询和源图像通常来自不同的领域。由于这些差异，在质量，照明，对比度，背景杂波和项目呈现一个显著间隙域之间存在。其结果是，时尚检索是无论是在学术界和工业研究的活跃领域。在人重新鉴定研究最新进展的启发，我们采用领先里德车型时尚检索任务中使用。我们引入了时尚检索一个简单的基准模型，尽管一个更简单的架构显著优于以前的状态的最先进的成果。我们开展对Street2Shop和DeepFashion数据集进行了深入的实验和验证我们的结果。最后，我们提出了一个跨域（跨数据集）的评估方法测试的时尚反演模型的鲁棒性。

18. Searching Central Difference Convolutional Networks for Face Anti-Spoofing [PDF] 返回目录
Zitong Yu, Chenxu Zhao, Zezheng Wang, Yunxiao Qin, Zhuo Su, Xiaobai Li, Feng Zhou, Guoying Zhao
Abstract: Face anti-spoofing (FAS) plays a vital role in face recognition systems. Most state-of-the-art FAS methods 1) rely on stacked convolutions and expert-designed network, which is weak in describing detailed fine-grained information and easily being ineffective when the environment varies (e.g., different illumination), and 2) prefer to use long sequence as input to extract dynamic features, making them difficult to deploy into scenarios which need quick response. Here we propose a novel frame level FAS method based on Central Difference Convolution (CDC), which is able to capture intrinsic detailed patterns via aggregating both intensity and gradient information. A network built with CDC, called the Central Difference Convolutional Network (CDCN), is able to provide more robust modeling capacity than its counterpart built with vanilla convolution. Furthermore, over a specifically designed CDC search space, Neural Architecture Search (NAS) is utilized to discover a more powerful network structure (CDCN++), which can be assembled with Multiscale Attention Fusion Module (MAFM) for further boosting performance. Comprehensive experiments are performed on six benchmark datasets to show that 1) the proposed method not only achieves superior performance on intra-dataset testing (especially 0.2% ACER in Protocol-1 of OULU-NPU dataset), 2) it also generalizes well on cross-dataset testing (particularly 6.5% HTER from CASIA-MFSD to Replay-Attack datasets). The codes are available at \href{this https URL}{this https URL}.
摘要：面对反欺骗（FAS）发挥了脸部识别系统至关重要的作用。状态的最先进的最FAS方法1）依赖于堆叠卷积和专家设计的网络，这是弱在描述详细细粒度信息和容易被无效时的环境变化（例如，不同的照射），和2）喜欢用长序列作为输入来提取动态特性，使得它们难以部署到其需要快速响应的情况。在这里，我们提出了一种基于中心差分卷积（CDC），这是能够通过聚合既强度和梯度信息捕捉固有详细图案的新颖帧级FAS方法。与CDC建立的网络，称为中央差分卷积网络（CDCN），能够提供比其对应的香草卷积建立更强大的建模能力。此外，在特别设计的CDC搜索空间中，神经结构搜索（NAS）被用于发现一个更强大的网络结构（CDCN ++），它可与多尺度注意融合模块（MAFM）被组装为进一步提升性能。全面实验六个基准数据集进行显示，1）所提出的方法不仅实现对帧内数据集的测试（特别是在协议-1 OULU-NPU数据集的0.2％ACER）性能优越，2）它也概括以及上横-dataset测试（从CASIA-MFSD尤其6.5％至HTER重播攻击数据集）。该代码可在\ {HREF这HTTPS URL} {这个HTTPS URL}。

19. When Person Re-identification Meets Changing Clothes [PDF] 返回目录
Fangbin Wan, Yang Wu, Xuelin Qian, Yanwei Fu
Abstract: Person re-identification (Reid) is now an active research topic for AI-based video surveillance applications such as specific person search, but the practical issue that the target person(s) may change clothes (clothes inconsistency problem) has been overlooked for long. For the first time, this paper systematically studies this problem. We first overcome the difficulty of lack of suitable dataset, by collecting a small yet representative real dataset for testing whilst building a large realistic synthetic dataset for training and deeper studies. Facilitated by our new datasets, we are able to conduct various interesting new experiments for studying the influence of clothes inconsistency. We find that changing clothes makes Reid a much harder problem in the sense of bringing difficulties to learning effective representations and also challenges the generalization ability of previous Reid models to identify persons with unseen (new) clothes. Representative existing Reid models are adopted to show informative results on such a challenging setting, and we also provide some preliminary efforts on improving the robustness of existing models on handling the clothes inconsistency issue in the data. We believe that this study can be inspiring and helpful for encouraging more researches in this direction.
摘要：现在的人重新鉴定（里德）是基于人工智能的视频监控应用，如特定的人搜索一个活跃的研究课题，但实际问题，目标人可改服（衣不一致的问题）一直被忽视长久。第一次，本文系统地研究这个问题。我们首先克服缺乏合适的数据集的困难，通过收集一个小而代表真正的数据集用于测试，而建设一个大型的现实合成数据集进行训练和更深层次的研究。我们的新数据集的推动下，我们能够进行各种有趣的新的实验为研究服装不一致的影响。我们发现，换衣服让里德一个更难问题带来困难，学习有效陈述的意识，也挑战以前里德模型的推广能力，以确定与看不见的（新的）衣服的人。代表现有的里德模型采用显示在这样一个具有挑战性的环境信息的结果，同时我们还提供完善的数据处理衣服的不一致性问题现有模型的鲁棒性一些初步的努力。我们认为，这项研究可以鼓舞人心，在这个方向上鼓励更多的研究很有帮助。

20. On the Texture Bias for Few-Shot CNN Segmentation [PDF] 返回目录
Reza Azad, Abdur R Fayjie, Claude Kauffman, Ismail Ben Ayed, Marco Pedersoli, Jose Dolz
Abstract: Despite the initial belief that Convolutional Neural Networks (CNNs) are driven by shapes to perform visual recognition tasks, recent evidence suggests that texture bias in CNNs provides higher performing and more robust models. This contrasts with the perceptual bias in the human visual cortex, which has a stronger preference towards shape components. Perceptual differences may explain why CNNs achieve human-level performance when large labeled datasets are available, but their performance significantly degrades in low-labeled data scenarios, such as few-shot semantic segmentation. To remove the texture bias in the context of few-shot learning, we propose a novel architecture that integrates a set of difference of Gaussians (DoG) to attenuate high-frequency local components in the feature space. This produces a set of modified feature maps, whose high-frequency components are diminished at different standard deviation values of the Gaussian distribution in the spatial domain. As this results in multiple feature maps for a single image, we employ a bi-directional convolutional long-short-term-memory to efficiently merge the multi scale-space representations. We perform extensive experiments on two well-known few-shot segmentation benchmarks -Pascal i5 and FSS-1000- and demonstrate that our method outperforms significantly state-of-the-art approaches. The code is available at: this https URL
摘要：尽管最初的信念，即卷积神经网络（细胞神经网络）是由形状驱动进行视觉识别任务，最近的证据表明在细胞神经网络的纹理偏置提供更高的执行和更可靠的模型。与此相反，在人的视觉皮质感知偏压，其具有朝向形状的部件较强偏好。感知的差异可以解释为什么细胞神经网络实现人类水平的性能时，大型数据集标记可用，但他们的表现显著低标记的数据方案，如几拍语义分割降低。为了消除一些次学习的背景下，质地偏差，我们提出了一个新颖的架构，集成了一组高斯（狗）的区别在特征空间衰减高频本地组件。这产生一组修饰的特征图，其高频分量在空间域中的高斯分布的不同标准偏差值被减小的。因为这导致多个功能地图的单一形象，我们采用了双向卷积长短期内存有效合并多尺度空间表示。我们对两个著名的为数不多的镜头分割基准-Pascal i5和FSS-1000-执行广泛的实验，证明了我们的方法优于显著国家的最先进的方法。该代码，请访问：此HTTPS URL

21. Transformation-based Adversarial Video Prediction on Large-Scale Data [PDF] 返回目录
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, Karen Simonyan
Abstract: Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions and proposing an architecture that yields faster convergence and higher performance than previous approaches. We then analyze recurrent units in the generator, and propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to to handle dis-occlusions, scene changes and other complex behavior. We show that this recurrent unit consistently outperforms previous designs. Our final model leads to a leap in the state-of-the-art performance, obtaining a test set Frechet Video Distance of 25.7, down from 69.2, on the large-scale Kinetics-600 dataset.
摘要：在对抗生成模型最新突破，导致能够产生高品质的视频样本中，即使在真实世界视频的庞大而复杂的数据集模型。在这项工作中，我们专注于视频预测，在给定的从视频中提取帧序列的任务，目标是产生未来可能出现的序列。我们首先进行鉴别分解进行了系统的实证研究，并提出了结构改进现有技术的状态下，收益率更快的收敛和更高的性能比以前的方法。然后我们分析了反复单位的发电机，并提出其将其过去的隐藏状态的新颖重复单元根据预测运动状特征，并提炼它来处理DIS-闭塞，场景变化和其他复杂的行为。我们表明，这种重复单元的性能一直优于以往的设计。我们的最终模型，因而在国家的最先进的性能飞跃，获得测试设置弗雷谢视频距离为25.7，同比下降69.2，在大型动力学-600数据集。

22. Learning Delicate Local Representations for Multi-Person Pose Estimation [PDF] 返回目录
Yuanhao Cai, Zhicheng Wang, Zhengxiong Luo, Binyi Yin, Angang Du, Haoqian Wang, Xinyu Zhou, Erjin Zhou, Xiangyu Zhang, Jian Sun
Abstract: In this paper, we propose a novel method called Residual Steps Network (RSN). RSN aggregates features with the same spatialsize (Intra-level features) efficiently to obtain delicate local representations, which retain rich low-level spatial information and result in pre-cise keypoint localization. In addition, we propose an efficient attention mechanism - Pose Refine Machine (PRM) to further refine the keypointlocations. Our approach won the 1st place of COCO Keypoint Challenge 2019 and achieves state-of-the-art results on both COCO and MPII benchmarks, without using extra training data and pretrained model. Our single model achieves 78.6 on COCO test-dev, 93.0 on MPII test dataset. Ensembled models achieve 79.2 on COCO test-dev, 77.1 on COCO test-challenge dataset. The source code is publicly available for further research at this https URL
摘要：在本文中，我们提出了一个所谓的剩余步骤网络（RSN）的新方法。 RSN聚集体具有相同spatialsize（帧内级功能）功能有效地得到精致的地方表示，其保持在预CISE关键点定位富含低级别的空间信息和结果。此外，我们还提出了一种高效注意机制 - 姿瑞风机（PRM），以进一步细化keypointlocations。我们的做法赢得了COCO关键点挑战2019的第一名，并实现双方COCO和MPII基准国家的先进成果，而无需使用额外的训练数据和预先训练模型。我们的单一模式达到78.6对COCO测试开发，93.0上MPII测试数据集。整体模型的实现79.2的测试COCO-dev的，77.1的COCO测试挑战数据集。源代码是公开的进一步研究在此HTTPS URL

23. Dense Dilated Convolutions Merging Network for Land Cover Classification [PDF] 返回目录
Qinghui Liu, Michael Kampffmeyer, Robert Jessen, Arnt-Børre Salberg
Abstract: Land cover classification of remote sensing images is a challenging task due to limited amounts of annotated data, highly imbalanced classes, frequent incorrect pixel-level annotations, and an inherent complexity in the semantic segmentation task. In this article, we propose a novel architecture called the dense dilated convolutions' merging network (DDCM-Net) to address this task. The proposed DDCM-Net consists of dense dilated image convolutions merged with varying dilation rates. This effectively utilizes rich combinations of dilated convolutions that enlarge the network's receptive fields with fewer parameters and features compared with the state-of-the-art approaches in the remote sensing domain. Importantly, DDCM-Net obtains fused local- and global-context information, in effect incorporating surrounding discriminative capability for multiscale and complex-shaped objects with similar color and textures in very high-resolution aerial imagery. We demonstrate the effectiveness, robustness, and flexibility of the proposed DDCM-Net on the publicly available ISPRS Potsdam and Vaihingen data sets, as well as the DeepGlobe land cover data set. Our single model, trained on three-band Potsdam and Vaihingen data sets, achieves better accuracy in terms of both mean intersection over union (mIoU) and F1-score compared with other published models trained with more than three-band data. We further validate our model on the DeepGlobe data set, achieving state-of-the-art result 56.2% mIoU with much fewer parameters and at a lower computational cost compared with related recent work. Code available at this https URL
摘要：遥感图像的土地覆盖分类是由于有限数量的注释的数据，高度不平衡类，频繁不正确的像素级别的注解，并且在语义分割任务的固有的复杂性的一个具有挑战性的任务。在这篇文章中，我们提出了一个新颖的架构称为密集扩张回旋融合网络（DDCM-网）来解决这个任务。所提出的DDCM-Net的由密集膨胀图像卷积了不同的扩张率合并。这有效地利用了丰富的扩张卷积，与更少的参数放大网络的感受域的组合和功能与状态的最先进的方法相比，在遥感域。重要的是，DDCM-网融合取得本地和全球范围内的信息，在将周围的极高分辨率的航空影像与相似的颜色和纹理的多尺度和复杂形状的物体辨别能力的影响。我们证明的有效性，可靠性和建议的DDCM-Net的灵活性上公开获得ISPRS波茨坦和Vaihingen的数据集，以及在DeepGlobe土地覆盖数据集。我们的模式单一，培训了三波段波茨坦和Vaihingen数据集，在全球联盟（米欧）和两路口平均而言达到更好的准确性F1-得分超过三个波段数据训练等发布的模型进行比较。我们进一步验证在DeepGlobe数据集我们的模型，实现了国家的最先进的结果56.2％米欧用少得多的参数，并在近期相关工作相比较低的计算成本。代码可以在这个HTTPS URL

24. Context-Aware Domain Adaptation in Semantic Segmentation [PDF] 返回目录
Jinyu Yang, Weizhi An, Chaochao Yan, Peilin Zhao, Junzhou Huang
Abstract: In this paper, we consider the problem of unsupervised domain adaptation in the semantic segmentation. There are two primary issues in this field, i.e., what and how to transfer domain knowledge across two domains. Existing methods mainly focus on adapting domain-invariant features (what to transfer) through adversarial learning (how to transfer). Context dependency is essential for semantic segmentation, however, its transferability is still not well understood. Furthermore, how to transfer contextual information across two domains remains unexplored. Motivated by this, we propose a cross-attention mechanism based on self-attention to capture context dependencies between two domains and adapt transferable context. To achieve this goal, we design two cross-domain attention modules to adapt context dependencies from both spatial and channel views. Specifically, the spatial attention module captures local feature dependencies between each position in the source and target image. The channel attention module models semantic dependencies between each pair of cross-domain channel maps. To adapt context dependencies, we further selectively aggregate the context information from two domains. The superiority of our method over existing state-of-the-art methods is empirically proved on "GTA5 to Cityscapes" and "SYNTHIA to Cityscapes".
摘要：在本文中，我们考虑在语义分割无监督领域的适应问题。有在此领域的两个主要问题，即什么，以及如何在两个域传输领域的知识。现有的方法主要集中在调整域不变特征（怎么转）通过对抗性学习（如何转移）。语境的依赖是语义分割必不可少的，但是，它的转让仍然没有很好的理解。此外，如何在两个域遗体未知传送上下文的信息。这个启发，提出了一种基于自我注意捕捉方面的依赖两个域之间的交叉注意机制和适应转让方面。为了实现这一目标，我们设计了两个跨域注意模块从空间和通道意见适应语境依赖性。具体而言，空间注意模块捕捉源和目标图像中的每个位置之间的局部特性的依赖关系。每对跨域频道映射之间的信道注意模块模型的语义相关性。为了适应方面的依赖，我们还有选择地聚集来自两个域的上下文信息。我们对现有状态的最先进的方法的方法的优越性是根据经验证明的“GTA5到都市风景”和“SYNTHIA到都市风景”。

25. Pseudo-Convolutional Policy Gradient for Sequence-to-Sequence Lip-Reading [PDF] 返回目录
Mingshuang Luo, Shuang Yang, Shiguang Shan, Xilin Chen
Abstract: Lip-reading aims to infer the speech content from the lip movement sequence and can be seen as a typical sequence-to-sequence (seq2seq) problem which translates the input image sequence of lip movements to the text sequence of the speech content. However, the traditional learning process of seq2seq models always suffers from two problems: the exposure bias resulted from the strategy of "teacher-forcing", and the inconsistency between the discriminative optimization target (usually the cross-entropy loss) and the final evaluation metric (usually the character/word error rate). In this paper, we propose a novel pseudo-convolutional policy gradient (PCPG) based method to address these two problems. On the one hand, we introduce the evaluation metric (refers to the character error rate in this paper) as a form of reward to optimize the model together with the original discriminative target. On the other hand, inspired by the local perception property of convolutional operation, we perform a pseudo-convolutional operation on the reward and loss dimension, so as to take more context around each time step into account to generate a robust reward and loss for the whole optimization. Finally, we perform a thorough comparison and evaluation on both the word-level and sentence-level benchmarks. The results show a significant improvement over other related methods, and report either a new state-of-the-art performance or a competitive accuracy on all these challenging benchmarks, which clearly proves the advantages of our approach.
摘要：唇读目标来推断从唇部运动序列的语音内容，并且可以被看作是一个典型的序列到序列（seq2seq）的问题这转化嘴唇运动的输入图像序列与语音内容的文本序列。然而，seq2seq车型的传统学习过程中始终存在两个问题：曝光偏差导致从“教师强迫”的战略，以及判别优化目标（通常是交叉熵损失），最终的评价指标之间的矛盾（通常是字符/字差错率）。在本文中，我们提出了一个新的伪卷积政策梯度（PCPG）的方法来解决这些两个问题。一方面，我们引入评估度量（是指本文中的字符误码率），为奖励的一种形式，以与原来的判别目标一起优化模型。在另一方面，通过卷积运算的本地感知特性的启发，我们执行的奖励和损失维伪卷积运算，以便采取更多上下文围绕每个时间段考虑，以产生一个强大的奖励和损失的整体优化。最后，我们执行上的字级和句子级的基准测试了彻底的比较和评价。结果表明，相对于其他相关的方法或者是显著改善，报告新的国家的最先进的性能或所有这些挑战的基准，它清楚地证明了我们方法的优点有竞争力的精度。

26. Cross-View Tracking for Multi-Human 3D Pose Estimation at over 100 FPS [PDF] 返回目录
Long Chen, Haizhou Ai, Rui Chen, Zijie Zhuang, Shuang Liu
Abstract: Estimating 3D poses of multiple humans in real-time is a classic but still challenging task in computer vision. Its major difficulty lies in the ambiguity in cross-view association of 2D poses and the huge state space when there are multiple people in multiple views. In this paper, we present a novel solution for multi-human 3D pose estimation from multiple calibrated camera views. It takes 2D poses in different camera coordinates as inputs and aims for the accurate 3D poses in the global coordinate. Unlike previous methods that associate 2D poses among all pairs of views from scratch at every frame, we exploit the temporal consistency in videos to match the 2D inputs with 3D poses directly in 3-space. More specifically, we propose to retain the 3D pose for each person and update them iteratively via the cross-view multi-human tracking. This novel formulation improves both accuracy and efficiency, as we demonstrated on widely-used public datasets. To further verify the scalability of our method, we propose a new large-scale multi-human dataset with 12 to 28 camera views. Without bells and whistles, our solution achieves 154 FPS on 12 cameras and 34 FPS on 28 cameras, indicating its ability to handle large-scale real-world applications. The proposed dataset will be released soon.
摘要：在实时多的人的估计3D姿势是计算机视觉的经典，但仍然具有挑战性的任务。其主要的困难在于，在跨行业鉴于2D姿势协会当有多人在多个视图中的巨大的状态空间的模糊性。在本文中，我们提出了从多个校准摄像机视图的多人体三维姿态估计的新颖的解决方案。它发生在不同的摄像机坐标输入和目标2D姿势的准确的三维姿态在全球坐标。不同于以往的方法，所有对在每一帧的观点从头中准二维姿态，我们利用在视频相匹配的2D输入的时间一致性与3D中3维空间构成直接。更具体地说，我们建议保留每个人的3D姿势，并通过交叉视角多的人跟踪迭代更新。这种新颖的制剂改善了精度和效率，因为我们证实在广泛使用的公共数据集。为了进一步验证了该方法的可扩展性，我们建议用12〜28个摄像头视图的新的大型多的人数据集。没有花俏，我们的解决方案实现154帧每秒12台摄像机和34帧每秒28个摄像头，表明其处理大型现实世界的应用能力。所提出的数据集将很快被释放。

27. BiDet: An Efficient Binarized Object Detector [PDF] 返回目录
Ziwei Wang, Ziyi Wu, Jiwen Lu, Jie Zhou
Abstract: In this paper, we propose a binarized neural network learning method called BiDet for efficient object detection. Conventional network binarization methods directly quantize the weights and activations in one-stage or two-stage detectors with constrained representational capacity, so that the information redundancy in the networks causes numerous false positives and degrades the performance significantly. On the contrary, our BiDet fully utilizes the representational capacity of the binary neural networks for object detection by redundancy removal, through which the detection precision is enhanced with alleviated false positives. Specifically, we generalize the information bottleneck (IB) principle to object detection, where the amount of information in the high-level feature maps is constrained and the mutual information between the feature maps and object detection is maximized. Meanwhile, we learn sparse object priors so that the posteriors are concentrated on informative detection prediction with false positive elimination. Extensive experiments on the PASCAL VOC and COCO datasets show that our method outperforms the state-of-the-art binary neural networks by a sizable margin.
摘要：在本文中，我们提出叫净身高效的对象检测二值化神经网络学习方法。常规网络的二值化方法直接量化的权重和激活在一阶段或两阶段的检测器与约束代表能力，从而在网络中的信息冗余导致大量的假阳性和降解性能显著。相反，我们的洁身器充分利用了二进制神经网络的物体检测通过去除冗余，通过检测精度与缓解误报增强了代表能力。具体来说，我们概括信息瓶颈（IB）原理来检测物体，其中在高级特征信息的量映射被约束和特征之间的相互信息映射和对象检测被最大化。同时，我们学会稀疏对象先验使后验集中在假阳性消除信息检测预测。在PASCAL VOC和COCO的数据集大量实验表明，我们的方法，通过一个相当大的余量优于国家的最先进的二元神经网络。

28. Cross-Modal Food Retrieval: Learning a Joint Embedding of Food Images and Recipes with Semantic Consistency and Attention Mechanism [PDF] 返回目录
Hao Wang, Doyen Sahoo, Chenghao Liu, Ke Shu, Palakorn Achananuparp, Ee-peng Lim, Steven C. H. Hoi
Abstract: Cross-modal food retrieval is an important task to perform analysis of food-related information, such as food images and cooking recipes. The goal is to learn an embedding of images and recipes in a common feature space, so that precise matching can be realized. Compared with existing cross-modal retrieval approaches, two major challenges in this specific problem are: 1) the large intra-class variance across cross-modal food data; and 2) the difficulties in obtaining discriminative recipe representations. To address these problems, we propose Semantic-Consistent and Attention-based Networks (SCAN), which regularize the embeddings of the two modalities by aligning output semantic probabilities. In addition, we exploit self-attention mechanism to improve the embedding of recipes. We evaluate the performance of the proposed method on the large-scale Recipe1M dataset, and the result shows that it outperforms the state-of-the-art.
摘要：跨模态食品检索是执行的食品相关的信息，如食物图像和烹饪食谱分析的重要任务。我们的目标是学习图像和食谱的嵌入在一个共同的特征空间，让精确的匹配可以实现的。与现有的跨模态获取方法相比，在这一特定问题的两大挑战是：1）跨跨模态食品数据大的类内变化; 2）获得判别配方交涉的困难。为了解决这些问题，我们提出了语义一致和注意事项为基础的网络（SCAN），它通过使输出语义概率规则化两种模式的嵌入物。此外，我们利用自身的关注机制，提高配方的嵌入。我们评估对大型数据集Recipe1M所提出的方法的性能，结果表明，它优于国家的最先进的。

29. Pacemaker: Intermediate Teacher Knowledge Distillation For On-The-Fly Convolutional Neural Network [PDF] 返回目录
Wonchul Son, Youngbin Kim, Wonseok Song, Youngsu Moon, Wonjun Hwang
Abstract: There is a need for an on-the-fly computational process with very low performance system such as system-on-chip (SoC) and embedded device etc. This paper presents pacemaker knowledge distillation as intermediate ensemble teacher to use convolutional neural network in these systems. For on-the-fly system, we consider student model using 1xN shape on-the-fly filter and teacher model using normal NxN shape filter. We note three points about training student model, caused by applying on-the-fly filter. First, same depth but unavoidable thin model compression. Second, the large capacity gap and parameter size gap due to only the horizontal field must be selected not the vertical receptive. Third, the performance instability and degradation of direct distilling. To solve these problems, we propose intermediate teacher, named pacemaker, for an on-the-fly student. So, student can be trained from pacemaker and original teacher step by step. Experiments prove our proposed method make significant performance (accuracy) improvements: on CIFAR100, 5.39% increased in WRN-40-4 than conventional knowledge distillation which shows even low performance than baseline. And we solve train instability, occurred when conventional knowledge distillation was applied without proposed method, by reducing deviation range by applying proposed method pacemaker knowledge distillation.
摘要：有必要为上即时计算过程非常低性能系统，如系统级芯片（SoC）和嵌入式设备等。本文礼物起搏器知识蒸馏作为中间合奏教师使用卷积神经网络在这些系统中。有关即时系统，我们使用的即时的1×N形过滤器和教师模型中使用正常的N×N形滤波器考虑学生的模式。我们注意到关于培养学生的模式，造成应用上的即时过滤三点。首先，深度相同，但不可避免的薄型化压缩。其次，大容量的间隙和参数大小间隙由于仅在水平字段必须没有选择垂直容易接受。三，性能不稳定，直接蒸馏的降解。为了解决这些问题，我们提出了中间的老师，名叫起搏器，对于在即时学生。因此，学生可以从起搏器和一步原老师的步骤进行培训。实验证明，我们提出的方法化妆显著的性能（精确度）的改进：在CIFAR100，5.39％的WRN-40-4增加比传统知识蒸馏这说明即使性能较基准低。我们解决了列车不稳定，时有发生的常规知识蒸馏应用提出的方法起搏器知识蒸馏没有提出的方法应用，通过减少偏差范围。

30. MCMC Guided CNN Training and Segmentation for Pancreas Extraction [PDF] 返回目录
Jinchan He, Xiaxia Yu, Chudong Cai, Yi Gao
Abstract: Efficient organ segmentation is the precondition of various quantitative analysis. Segmenting the pancreas from abdominal CT images is a challenging task because of its high anatomical variability in shape, size and location. What's more, the pancreas only occupies a small portion in abdomen, and the organ border is very fuzzy. All these factors make the segmentation methods of other organs less suitable for the pancreas segmentation. In this report, we propose a Markov Chain Monte Carlo (MCMC) sampling guided convolutional neural network (CNN) approach, in order to handle such difficulties in morphological and photometric variabilities. Specifically, the proposed method mainly contains three steps: First, registration is carried out to mitigate the body weight and location variability. Then, an MCMC sampling is employed to guide the sampling of 3D patches, which are fed to the CNN for training. At the same time, the pancreas distribution is also learned for the subsequent segmentation. Third, sampled from the learned distribution, an MCMC process guides the segmentation process. Lastly, the patches based segmentation is fused using a Bayesian voting scheme. This method is evaluated on the NIH pancreatic datasets which contains 82 abdominal contrast-enhanced CT volumes. Finally, we achieved a competitive result of 78.13% Dice Similarity Coefficient value and 82.65% Recall value in testing data.
摘要：高效的器官分割是各种定量分析的前提。从腹部CT图像分割胰腺是一项具有挑战性的任务，因为在形状，大小和位置的高解剖变异。更重要的是，胰腺仅占腹部的一小部分，而器官边界很模糊。所有这些因素使其他器官的分割方法不太适合胰腺分割。在这份报告中，我们提出了一个马尔可夫链蒙特卡罗（MCMC）采样引导卷积神经网络（CNN）的方法，以处理在形态和光度可变性这样的困难。具体地，所提出的方法主要包括三个步骤：首先，登记被执行以减轻体重和位置变异。然后，MCMC采样被用于指导三维贴剂，其被馈送到CNN进行训练的采样。与此同时，胰腺分布也获悉，为随后的分割。第三，从了解到的分布采样，一个MCMC过程指导分割过程。最后，基于补丁分割使用贝叶斯投票方案融合。这种方法是在包含82腹部增强CT卷NIH胰数据集进行评估。最后，我们实现了78.13％骰子相似系数的值，并在测试数据82.65％的召回值的有竞争力的结果。

31. Deconfounded Image Captioning: A Causal Retrospect [PDF] 返回目录
Xu Yang, Hanwang Zhang, Jianfei Cai
Abstract: The dataset bias in vision-language tasks is becoming one of the main problems that hinder the progress of our community. However, recent studies lack a principled analysis of the bias. In this paper, we present a novel perspective: Deconfounded Image Captioning (DIC), to find out the cause of the bias in image captioning, then retrospect modern neural image captioners, and finally propose a DIC framework: DICv1.0. DIC is based on causal inference, whose two principles: the backdoor and front-door adjustments, help us to review previous works and design the effective models. In particular, we showcase that DICv1.0 can strengthen two prevailing captioning models and achieves a single-model 130.7 CIDEr-D and 128.4 c40 CIDEr-D on Karpathy split and online split of the challenging MS-COCO dataset, respectively. Last but not least, DICv1.0 is merely a natural derivation from our causal retrospect, which opens a promising direction for image captioning.
摘要：在视觉语言任务的数据集偏见正在成为阻碍我们社会进步的主要问题之一。然而，最近的研究缺乏偏见的原则分析。在本文中，我们提出了一个新颖的观点：Deconfounded图片字幕（DIC），找出在图像字幕偏差的原因，然后回想起来现代神经影像式字幕，最后提出了一个框架，DIC：DICv1.0。 DIC是基于因果推论，他的两个原则：后门和前门的调整，帮助我们回顾以前的作品和设计有效的模式。特别是，我们也展示了DICv1.0可分别增强系统主要分为两种字幕模型，实现对挑战MS-COCO数据集的Karpathy分裂和在线拆分单模型130.7苹果酒-d和128.4 C40苹果酒-d。最后但并非最不重要的，DICv1.0仅仅是从我们的因果回想起来，它会显示图像字幕有前途的方向自然推导。

32. ROSE: Real One-Stage Effort to Detect the Fingerprint Singular Point Based on Multi-scale Spatial Attention [PDF] 返回目录
Liaojun Pang, Jiong Chen, Fei Guo, Zhicheng Cao, Heng Zhao
Abstract: Detecting the singular point accurately and efficiently is one of the most important tasks for fingerprint recognition. In recent years, deep learning has been gradually used in the fingerprint singular point detection. However, current deep learning-based singular point detection methods are either two-stage or multi-stage, which makes them time-consuming. More importantly, their detection accuracy is yet unsatisfactory, especially in the case of the low-quality fingerprint. In this paper, we make a Real One-Stage Effort to detect fingerprint singular points more accurately and efficiently, and therefore we name the proposed algorithm ROSE for short, in which the multi-scale spatial attention, the Gaussian heatmap and the variant of focal loss are applied together to achieve a higher detection rate. Experimental results on the datasets FVC2002 DB1 and NIST SD4 show that our ROSE outperforms the state-of-art algorithms in terms of detection rate, false alarm rate and detection speed.
摘要：准确地检测到奇异点，有效地是指纹识别的最重要的任务之一。近年来，深学习已逐步在指纹奇异点检测中使用。然而，当前的深基于学习的奇点的检测方法或者双级或多级，这使得它们耗时。更重要的是，他们的检测精度不够理想，特别是在低质量的指纹的情况下。在本文中，我们做一个真正的单级为了更准确有效地检测指纹奇异点，因此我们将其命名为短的算法ROSE，其中多尺度空间的关注，高斯热图和焦点的变种损耗一起应用以实现较高的检测率。在数据集FVC2002 DB1和NIST SD4表明我们的ROSE优于国家的最先进的算法中的检测率，误报率和检测速度方面的实验结果。

33. FarSee-Net: Real-Time Semantic Segmentation by Efficient Multi-scale Context Aggregation and Feature Space Super-resolution [PDF] 返回目录
Zhanpeng Zhang, Kaipeng Zhang
Abstract: Real-time semantic segmentation is desirable in many robotic applications with limited computation resources. One challenge of semantic segmentation is to deal with the object scale variations and leverage the context. How to perform multi-scale context aggregation within limited computation budget is important. In this paper, firstly, we introduce a novel and efficient module called Cascaded Factorized Atrous Spatial Pyramid Pooling (CF-ASPP). It is a lightweight cascaded structure for Convolutional Neural Networks (CNNs) to efficiently leverage context information. On the other hand, for runtime efficiency, state-of-the-art methods will quickly decrease the spatial size of the inputs or feature maps in the early network stages. The final high-resolution result is usually obtained by non-parametric up-sampling operation (e.g. bilinear interpolation). Differently, we rethink this pipeline and treat it as a super-resolution process. We use optimized super-resolution operation in the up-sampling step and improve the accuracy, especially in sub-sampled input image scenario for real-time applications. By fusing the above two improvements, our methods provide better latency-accuracy trade-off than the other state-of-the-art methods. In particular, we achieve 68.4% mIoU at 84 fps on the Cityscapes test set with a single Nivida Titan X (Maxwell) GPU card. The proposed module can be plugged into any feature extraction CNN and benefits from the CNN structure development.
摘要：实时语义分割与有限的计算资源很多机器人应用可取的。语义分割的一个挑战是要处理的对象的规模变化，并充分利用上下文。如何计算有限的预算范围内进行多尺度范围内聚集是非常重要的。在本文中，首先，我们引入称为梯级比式Atrous空间金字塔池（CF-ASPP）一种新颖的和有效的模块。它是用于卷积神经网络（细胞神经网络）的轻量结构级联以有效地利用上下文信息。在另一方面，对于运行时效率，国家的最先进的方法将迅速降低的输入或特征地图的空间大小在早期网络阶段。最终的高分辨率结果通常是通过非参数上采样操作（例如双线性插值）获得。不同的是，我们重新审视这条管道，并把它当作一个超分辨率处理。我们采用优化的超分辨率操作上采样步骤，提高了精度，尤其是在对实时应用子采样输入图像情景。通过融合上述两个方面的改进，我们的方法更好地提供延迟准确性权衡比其他国家的最先进的方法。特别是，我们在对都市风景试验组与单个NIVIDA泰坦X（麦克斯韦）GPU卡84 fps的实现68.4％米欧。所提出的模块可以插入任何特征提取CNN和福利从CNN结构发展。

34. Faster ILOD: Incremental Learning for Object Detectors based on Faster RCNN [PDF] 返回目录
Can Peng, Kun Zhao, Brian C. Lovell
Abstract: The human vision and perception system is inherently incremental where new knowledge is continually learned over time whilst existing knowledge is retained. On the other hand, deep learning networks are ill-equipped for incremental learning. When a well-trained network is adapted to new categories, its performance on the old categories will dramatically degrade. To address this problem, incremental learning methods have been explored to preserve the old knowledge of deep learning models. However, the state-of-the-art incremental object detector employs an external fixed region proposal method that increases overall computation time and reduces accuracy compared to object detectors such as Faster RCNN that use trainable Region Proposal Networks (RPNs). The purpose of this paper is to design an efficient end-to-end incremental object detector using knowledge distillation for object detectors based on RPNs. We first evaluate and analyze the performance of RPN-based detector with classic distillation towards incremental detection tasks. Then, we introduce multi-network adaptive distillation that properly retains knowledge from the old categories when fine-turning the model for new task. Experiments on the benchmark datasets, PASCAL VOC and COCO, demonstrate that the proposed incremental detector is more accurate as well as being 13 times faster than the baseline detector.
摘要：人的视觉和感知系统本质上是增加的地方，而现有的知识保留新知识不断学到了时间。在另一方面，深度学习网络是不具备的增量学习。当一个训练有素的网络适应新的类别，其对旧类别的性能就会急剧下降。为了解决这个问题，增量学习方法进行了探讨，以维护深度学习模型的旧识。然而，相对于物体的探测器，如更快的RCNN使用可训练的地区网络的提案（RPNS）的状态的最先进的增量对象检测器使用外部固定区域提案方法，该方法提高了总体计算时间，并降低精度。本文的目的是设计使用基于对象RPNS探测器知识蒸馏的有效端至端增量对象检测器。我们首先评估和分析基于RPN探测器的朝向增量值检测任务，经典的蒸馏性能。然后，我们介绍了多网络自适应蒸馏从旧类别正确保留知识晴好时转向新的任务模式。基准的数据集，PASCAL VOC和COCO，实验表明，该增量检测更准确，以及作为比基准探测器快13倍。

35. An Empirical Evaluation on Robustness and Uncertainty of Regularization Methods [PDF] 返回目录
Sanghyuk Chun, Seong Joon Oh, Sangdoo Yun, Dongyoon Han, Junsuk Choe, Youngjoon Yoo
Abstract: Despite apparent human-level performances of deep neural networks (DNN), they behave fundamentally differently from humans. They easily change predictions when small corruptions such as blur and noise are applied on the input (lack of robustness), and they often produce confident predictions on out-of-distribution samples (improper uncertainty measure). While a number of researches have aimed to address those issues, proposed solutions are typically expensive and complicated (e.g. Bayesian inference and adversarial training). Meanwhile, many simple and cheap regularization methods have been developed to enhance the generalization of classifiers. Such regularization methods have largely been overlooked as baselines for addressing the robustness and uncertainty issues, as they are not specifically designed for that. In this paper, we provide extensive empirical evaluations on the robustness and uncertainty estimates of image classifiers (CIFAR-100 and ImageNet) trained with state-of-the-art regularization methods. Furthermore, experimental results show that certain regularization methods can serve as strong baseline methods for robustness and uncertainty estimation of DNNs.
摘要：尽管深层神经网络（DNN）的表观人类水平的演出，他们的行为从根本上不同于人类。它们很容易地改变的预测时所输入的（缺乏鲁棒性的）被施加小损坏诸如模糊和噪声，并且他们经常产生于外的分布样品自信预测（不当不确定性量度）。尽管许多研究都旨在解决这些问题，提出的解决方案通常是昂贵和复杂的（如贝叶斯推理和对抗性训练）。同时，许多简单和廉价的正则化方法已经被开发，以提高分类器的泛化。这种正常化的方法已经在很大程度上被忽视的基线解决鲁棒性和不确定性的问题，因为它们不是专门针对设计的。在本文中，我们提供与国家的最先进的正则化方法训练的分类器图像的鲁棒性和不确定性估计（CIFAR-100和ImageNet）广泛经验评价。此外，实验结果表明，某些正则化方法可以作为强大的基线方法DNNs的鲁棒性和不确定性估计。

36. FoCL: Feature-Oriented Continual Learning for Generative Models [PDF] 返回目录
Qicheng Lao, Mehrzad Mortazavi, Marzieh Tahaei, Francis Dutil, Thomas Fevens, Mohammad Havaei
Abstract: In this paper, we propose a general framework in continual learning for generative models: Feature-oriented Continual Learning (FoCL). Unlike previous works that aim to solve the catastrophic forgetting problem by introducing regularization in the parameter space or image space, FoCL imposes regularization in the feature space. We show in our experiments that FoCL has faster adaptation to distributional changes in sequentially arriving tasks, and achieves the state-of-the-art performance for generative models in task incremental learning. We discuss choices of combined regularization spaces towards different use case scenarios for boosted performance, e.g., tasks that have high variability in the background. Finally, we introduce a forgetfulness measure that fairly evaluates the degree to which a model suffers from forgetting. Interestingly, the analysis of our proposed forgetfulness score also implies that FoCL tends to have a mitigated forgetting for future tasks.
摘要：在本文中，我们提出通过不断地学习一个总体框架生成模型：面向特征的不断学习（FoCL）。不同于旨在通过在参数空间或图像空间引入正规化解决灾难性遗忘问题以前的作品，FoCL强加特征空间正规化。我们发现在我们的实验是FoCL在到达顺序任务更快地适应分布式的变化，实现了国家的最先进的性能在任务增量学习生成模型。我们讨论了针对不同使用场景的性能提升，例如，有背景高可变性的任务组合正规化空间的选择。最后，我们介绍一个健忘措施，公平评价的程度从遗忘到一个模型受到影响。有趣的是，我们所提出的健忘分数的分析也意味着FoCL倾向于对未来的任务减轻遗忘。

37. A Tracking System For Baseball Game Reconstruction [PDF] 返回目录
Nina Wiedemann, Carlos Dietrich, Claudio T. Silva
Abstract: The baseball game is often seen as many contests that are performed between individuals. The duel between the pitcher and the batter, for example, is considered the engine that drives the sport. The pitchers use a variety of strategies to gain competitive advantage against the batter, who does his best to figure out the ball trajectory and react in time for a hit. In this work, we propose a system that captures the movements of the pitcher, the batter, and the ball in a high level of detail, and discuss several ways how this information may be processed to compute interesting statistics. We demonstrate on a large database of videos that our methods achieve comparable results as previous systems, while operating solely on video material. In addition, state-of-the-art AI techniques are incorporated to augment the amount of information that is made available for players, coaches, teams, and fans.
摘要：棒球比赛通常被看作是个体之间进行很多比赛。投手和面糊的对决，例如，被认为是发动机驱动的运动。投手使用各种策略来获得对连击，谁做他最好弄清楚球轨迹的竞争优势，并及时做出反应一击。在这项工作中，我们提出了一个系统的投手，面糊，并在高层次的细节球的捕捉运动，并讨论几种方法这些信息如何被处理以计算有趣的统计。我们表现出大的数据库，我们的方法，可以实现类似的结果与以前的系统上的视频，而视频素材完全运行。此外，国家的最先进的人工智能技术被合并，以增加其可用的球员，教练，球队和球迷的信息量。

38. Active Fine-Tuning from gMAD Examples Improves Blind Image Quality Assessment [PDF] 返回目录
Zhihua Wang, Kede Ma
Abstract: The research in image quality assessment (IQA) has a long history, and significant progress has been made by leveraging recent advances in deep neural networks (DNNs). Despite high correlation numbers on existing IQA datasets, DNN-based models may be easily falsified in the group maximum differentiation (gMAD) competition with strong counterexamples being identified. Here we show that gMAD examples can be used to improve blind IQA (BIQA) methods. Specifically, we first pre-train a DNN-based BIQA model using multiple noisy annotators, and fine-tune it on multiple subject-rated databases of synthetically distorted images, resulting in a top-performing baseline model. We then seek pairs of images by comparing the baseline model with a set of full-reference IQA methods in gMAD. The resulting gMAD examples are most likely to reveal the relative weaknesses of the baseline, and suggest potential ways for refinement. We query ground truth quality annotations for the selected images in a well controlled laboratory environment, and further fine-tune the baseline on the combination of human-rated images from gMAD and existing databases. This process may be iterated, enabling active and progressive fine-tuning from gMAD examples for BIQA. We demonstrate the feasibility of our active learning scheme on a large-scale unlabeled image set, and show that the fine-tuned method achieves improved generalizability in gMAD, without destroying performance on previously trained databases.
摘要：在图像质量评价研究（IQA）有着悠久的历史，和显著的进步已经通过利用在深层神经网络（DNNs）的最新进展作出。尽管在现有IQA数据集高相关性的数字，基于DNN的模型可以很容易地在组最大微（gMAD）竞争与伪造被识别强烈的反例。在这里，我们表明，gMAD例子可以用来改善盲人IQA（BIQA）方法。具体地讲，我们首先使用多个嘈杂注释预培养基于DNN-BIQA模型，和微调它合成失真图像的多个对象级的数据库，产生了顶执行基线模型。然后，我们将基准模型与一组在gMAD全参考IQA方法相比，寻求对图像。得到的gMAD例子是最容易泄露基准的相对薄弱环节，并提出细化的可能方式。我们查询地面所选图像的质量真相注释在良好受控的实验室环境，并进一步微调对人体额定起gMAD和现有数据库图像的组合基线。该过程可以被重复，从gMAD实例为BIQA使活性和渐进微调。我们证明了我们对大型未标记的图片集主动学习方案的可行性，并表明微调的方法实现了改进的普遍性在gMAD，不破坏先前训练的数据库性能。

39. Fine-Grained Visual Classification via Progressive Multi-Granularity Training of Jigsaw Patches [PDF] 返回目录
Ruoyi Du, Dongliang Chang, Ayan Kumar Bhunia, Jiyang Xie, Yi-Zhe Song, Zhanyu Ma, Jun Guo
Abstract: Fine-grained visual classification (FGVC) is much more challenging than traditional classification tasks due to the inherently subtle intra-class object variations. Recent works mainly tackle this problem by focusing on how to locate the most discriminative parts, more complementary parts, and parts of various granularities. However, less effort has been placed to which granularities are the most discriminative and how to fuse information cross multi-granularity. In this work, we propose a novel framework for fine-grained visual classification to tackle these problems. In particular, we propose: (i) a novel progressive training strategy that adds new layers in each training step to exploit information based on the smaller granularity information found at the last step and the previous stage. (ii) a simple jigsaw puzzle generator to form images contain information of different granularity levels. We obtain state-of-the-art performances on several standard FGVC benchmark datasets, where the proposed method consistently outperforms existing methods or delivers competitive results. The code will be available at this https URL.
摘要：细粒度的视觉分类（FGVC）比传统的分类任务更具有挑战性，由于固有的微妙的内部类对象的变化。近期的作品主要侧重于如何找到区别最大的部分，更多的互补零件，以及各种粒度的部分解决这个问题。然而，较少的努力已经被放置到粒度是导火索信息跨多粒度的判别能力最强的，以及如何。在这项工作中，我们提出了细粒度的视觉分类来解决这些问题的新框架。特别是，我们建议：（Ⅰ）新的循序渐进的训练策略，在每个训练阶段基于在最后一步，并且前一阶段发现的小颗粒度信息利用信息增加了新的图层。（ⅱ）一个简单的拼图发生器以形成图像包含不同粒度等级的信息。我们获得了几个标准的FGVC基准数据集，国家的的艺术表演，其中所提出的方法始终优于现有的方法或提供有竞争力的结果。该代码将可在此HTTPS URL。

40. No Surprises: Training Robust Lung Nodule Detection for Low-Dose CT Scans by Augmenting with Adversarial Attacks [PDF] 返回目录
Siqi Liu, Arnaud Arindra Adiyoso Setio, Florin C. Ghesu, Eli Gibson, Sasa Grbic, Bogdan Georgescu, Dorin Comaniciu
Abstract: Detecting malignant pulmonary nodules at an early stage can allow medical interventions which increases the survival rate of lung cancer patients. Using computer vision techniques to detect nodules can improve the sensitivity and the speed of interpreting chest CT for lung cancer screening. Many studies have used CNNs to detect nodule candidates. Though such approaches have been shown to outperform the conventional image processing based methods regarding the detection accuracy, CNNs are also known to be limited to generalize on under-represented samples in the training set and prone to imperceptible noise perturbations. Such limitations can not be easily addressed by scaling up the dataset or the models. In this work, we propose to add adversarial synthetic nodules and adversarial attack samples to the training data to improve the generalization and the robustness of the lung nodule detection systems. In order to generate hard examples of nodules from a differentiable nodule synthesizer, we use projected gradient descent (PGD) to search the latent code within a bounded neighbourhood that would generate nodules to decrease the detector response. To make the network more robust to unanticipated noise perturbations, we use PGD to search for noise patterns that can trigger the network to give over-confident mistakes. By evaluating on two different benchmark datasets containing consensus annotations from three radiologists, we show that the proposed techniques can improve the detection performance on real CT data. To understand the limitations of both the conventional networks and the proposed augmented networks, we also perform stress-tests on the false positive reduction networks by feeding different types of artificially produced patches. We show that the augmented networks are more robust to both under-represented nodules as well as resistant to noise perturbations.
摘要：在早期阶段检测恶性肺结节可以允许增加肺癌患者的存活率医疗干预。使用计算机视觉技术来检测结节可以提高灵敏度和解释胸部CT用于肺癌筛查的速度。许多研究使用细胞神经网络来检测结节候选人。尽管这种方法已经被证明优于传统的图像处理有关的检测精度为基础的方法，还已知细胞神经网络被限制到在训练组中代表性不足样品和容易推广到觉察不到的噪声扰动。这种限制不能向上扩展数据集或模型很容易解决。在这项工作中，我们建议对抗性合成结节和对抗性攻击样本添加到训练数据时提高了概括和肺结节检测系统的鲁棒性。为了从一个可微结节合成器产生结节硬例子中，我们使用投影梯度下降（PGD）的邻域内，将产生结节降低检测器响应中进行搜索的潜代码。为了使网络更强大的意料之外的噪音干扰，我们用PGD来搜索噪音模式，可以触发网络给予过分自信的错误。通过对含有从三个放射科医师的共识注释两种不同的基准数据集评估，我们表明，该技术可以提高真实CT数据的检测性能。为了理解传统网络和所提出的增强的网络两者的局限，我们还通过将不同类型的人工产生的补丁执行对假阳性降低网络压力测试。我们表明，增强网络更加坚固既代表不足的结节以及噪声干扰性。

41. PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models [PDF] 返回目录
Sachit Menon, Alexandru Damian, Shijia Hu, Nikhil Ravi, Cynthia Rudin
Abstract: The primary aim of single-image super-resolution is to construct a high-resolution (HR) image from a corresponding low-resolution (LR) input. In previous approaches, which have generally been supervised, the training objective typically measures a pixel-wise average distance between the super-resolved (SR) and HR images. Optimizing such metrics often leads to blurring, especially in high variance (detailed) regions. We propose an alternative formulation of the super-resolution problem based on creating realistic SR images that downscale correctly. We present a novel super-resolution algorithm addressing this problem, PULSE (Photo Upsampling via Latent Space Exploration), which generates high-resolution, realistic images at resolutions previously unseen in the literature. It accomplishes this in an entirely self-supervised fashion and is not confined to a specific degradation operator used during training, unlike previous methods (which require training on databases of LR-HR image pairs for supervised learning). Instead of starting with the LR image and slowly adding detail, PULSE traverses the high-resolution natural image manifold, searching for images that downscale to the original LR image. This is formalized through the "downscaling loss," which guides exploration through the latent space of a generative model. By leveraging properties of high-dimensional Gaussians, we restrict the search space to guarantee that our outputs are realistic. PULSE thereby generates super-resolved images that both are realistic and downscale correctly. We show extensive experimental results demonstrating the efficacy of our approach in the domain of face super-resolution (also known as face hallucination). Our method outperforms state-of-the-art methods in perceptual quality at higher resolutions and scale factors than previously possible.
摘要：单图像超分辨率的主要目的是从一个对应的低分辨率（LR）输入构建高分辨率（HR）图像。在以前的方法，通常已监督，培养目标通常测量超分辨率（SR）和HR图像之间的像素方面的平均距离。经常优化这样的度量导致模糊，尤其是在高方差（详细说明）的区域。我们提出了一种基于创造缩减正确逼真SR影像超分辨率问题的另一种提法。我们提出了一种新颖的超分辨率算法解决这一问题，PULSE（照片上采样经由潜在空间探索），其产生高清晰度，在分辨率先前在文献中看不见逼真的图像。它实现这一点在一个完全自我监督的时尚并不仅仅局限于训练期间使用的特异性降解操作，不像以前的方法（这需要对监督学习LR-HR图像对数据库培训）。相反，与LR图像开始，慢慢增加细节，PULSE穿越高分辨率自然图像歧管，寻找那个缩减到原来的LR图像的图像。这是通过通过生成模型的潜在空间引导探索“按比例缩小的损失，”形式化。通过利用高维高斯的性质，我们限制，以保证搜索空间，我们输出的是现实的。 PULSE从而产生超分辨率的图像，无论是现实的和正确的缩减。我们发现证明了我们在面对超分辨率（又称脸幻觉）的域方法的疗效广泛的实验结果。我们的方法优于在较高分辨率和比例因子比以前可能的感知质量状态的最先进的方法。

42. Mind the Gap: Enlarging the Domain Gap in Open Set Domain Adaptation [PDF] 返回目录
Dongliang Chang, Aneeshan Sain, Zhanyu Ma, Yi-Zhe Song, Jun Guo
Abstract: Unsupervised domain adaptation aims to leverage labeled data from a source domain to learn a classifier for an unlabeled target domain. Among its many variants, open set domain adaptation (OSDA) is perhaps the most challenging, as it further assumes the presence of unknown classes in the target domain. In this paper, we study OSDA with a particular focus on enriching its ability to traverse across larger domain gaps. Firstly, we show that existing state-of-the-art methods suffer a considerable performance drop in the presence of larger domain gaps, especially on a new dataset (PACS) that we re-purposed for OSDA. We then propose a novel framework to specifically address the larger domain gaps. The key insight lies with how we exploit the mutually beneficial information between two networks; (a) to separate samples of known and unknown classes, (b) to maximize the domain confusion between source and target domain without the influence of unknown samples. It follows that (a) and (b) will mutually supervise each other and alternate until convergence. Extensive experiments are conducted on Office-31, Office-Home, and PACS datasets, demonstrating the superiority of our method in comparison to other state-of-the-arts. Code has been provided as part of supplementary material and will be publicly released upon acceptance.
摘要：无监督领域适应性的目标是利用从源域标记的资料来了解一个未标记的目标域分类。在它的许多变种，开集领域适应性（OSDA）也许是最有挑战性的，因为它进一步假设未知类的目标域中存在。在本文中，我们特别注重在丰富其在更大的领域空白的遍历能力研究内部审计办公室。首先，我们表明，现有的国家的最先进的方法遭受较大的域的间隙的存在相当大的性能下降，尤其是在一个新的数据集（PACS），我们改用于OSDA。然后，我们提出了一个新的框架，专门解决大领域的空白。关键在于洞察与我们如何利用两个网络之间的互惠互利的信息; （a）至已知和未知类别的单独的样品，（b）中以最大化源和目标结构域之间的结构域混淆而不未知样品的影响。它遵循的（a）和（b）将相互互相监督和备用直到收敛。大量的实验是在办公室-31，办公室，家庭，和PACS数据集进行，这表明相对于其他国家的的艺术了该方法的优越性。代码已经被作为补充材料的一部分，将在接受公开发布。

43. DADA: Differentiable Automatic Data Augmentation [PDF] 返回目录
Yonggang Li, Guosheng Hu, Yongtao Wang, Timothy Hospedales, Neil M. Robertson, Yongxing Yang
Abstract: Data augmentation (DA) techniques aim to increase data variability, and thus train deep networks with better generalisation. The pioneering AutoAugment automated the search for optimal DA policies with reinforcement learning. However, AutoAugment is extremely computationally expensive, limiting its wide applicability. Followup work such as PBA and Fast AutoAugment improved efficiency, but optimization speed remains a bottleneck. In this paper, we propose Differentiable Automatic Data Augmentation (DADA) which dramatically reduces the cost. DADA relaxes the discrete DA policy selection to a differentiable optimization problem via Gumbel-Softmax. In addition, we introduce an unbiased gradient estimator, RELAX, leading to an efficient and effective one-pass optimization strategy to learn an efficient and accurate DA policy. We conduct extensive experiments on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. Furthermore, we demonstrate the value of Auto DA in pre-training for downstream detection problems. Results show our DADA is at least one order of magnitude faster than the state-of-the-art while achieving very comparable accuracy.
摘要：数据扩张（DA）技术的目的，以增加数据变异性，并且因此具有更好的概括训练深网络。开创性AutoAugment自动化与强化学习最佳DA政策搜索。然而，AutoAugment是极其耗费计算，限制了其广泛的适用性。跟帖的工作，如PBA和快速AutoAugment提高了效率，而且优化速度仍然是一个瓶颈。在本文中，我们提出微的自动数据增强（DADA），这极大地降低了成本。 DADA放松离散DA策略选择通过冈贝尔-一个添加Softmax微优化问题。此外，我们引入一个无偏估计量梯度，放松，导致有效和高效的一次通过优化策略来学习的高效和准确的DA政策。我们对CIFAR-10进行了广泛的实验，CIFAR-100，SVHN和ImageNet数据集。此外，我们表现出汽车DA在下游发现问题前培训的价值。结果表明我们的DADA的幅度比国家的最先进的快，同时实现了非常具有相当的精度至少一个数量级。

44. Rectifying Pseudo Label Learning via Uncertainty Estimation for Domain Adaptive Semantic Segmentation [PDF] 返回目录
Zhedong Zheng, Yi Yang
Abstract: This paper focuses on the unsupervised domain adaptation of transferring the knowledge from the source domain to the target domain in the context of semantic segmentation. Existing approaches usually regard the pseudo label as the ground truth to fully exploit the unlabeled target-domain data. Yet the pseudo labels of the target-domain data are usually predicted by the model trained on the source domain. Thus, the generated labels inevitably contain the incorrect prediction due to the discrepancy between the training domain and the test domain, which could be transferred to the final adapted model and largely compromises the training process. To overcome the problem, this paper proposes to explicitly estimate the prediction uncertainty during training to rectify the pseudo label learning for unsupervised semantic segmentation adaptation. Given the input image, the model outputs the semantic segmentation prediction as well as the uncertainty of the prediction. Specifically, we model the uncertainty via the prediction variance and involve the uncertainty into the optimization objective. To verify the effectiveness of the proposed method, we evaluate the proposed method on two prevalent synthetic-to-real semantic segmentation benchmarks, i.e., GTA5 -> Cityscapes and SYNTHIA -> Cityscapes, as well as one cross-city benchmark, i.e., Cityscapes -> Oxford RobotCar. We demonstrate through extensive experiments that the proposed approach (1) dynamically sets different confidence thresholds according to the prediction variance, (2) rectifies the learning from noisy pseudo labels, and (3) achieves significant improvements over the conventional pseudo label learning and yields competitive performance on all three benchmarks.
摘要：本文着重于语义分割的背景传送从源域的知识目标域的无监督领域适应性。现有的方法通常把伪标签作为基本事实充分利用未标记的目标域数据。然而，目标域数据的伪标签通常由受过训练的上源域的模型预测。因此，所生成的标签不可避免地含有不正确的预测，由于训练域和测试域之间的差异，这可能被转移到最终适应模型并在很大程度上损害了训练过程。为了解决这个问题，本文提出了明确估计训练期间的不确定性预测整顿伪标记学习无监督语义分割适应。给定输入图像中，模型输出语义分割预测以及预测的不确定性。具体来说，我们通过模型预测变化的不确定性，涉及的不确定性的优化目标。为了验证该方法的有效性，我们评估两个流行合成到真实语义分割基准，即GTA5所提出的方法 - >风情和SYNTHIA - >风情，以及一个跨城市的标杆，即风情 - >牛津RobotCar。我们通过广泛的实验结果表明，所提出的方法（1）根据预测方差动态地设定不同的置信的阈值，（2）从嘈杂伪标签整流学习，和（3）实现了比传统的伪标记学习率和产率的竞争显著改进在所有三个基准性能。

45. IMRAM: Iterative Matching with Recurrent Attention Memory for Cross-Modal Image-Text Retrieval [PDF] 返回目录
Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, Jungong Han
Abstract: Enabling bi-directional retrieval of images and texts is important for understanding the correspondence between vision and language. Existing methods leverage the attention mechanism to explore such correspondence in a fine-grained manner. However, most of them consider all semantics equally and thus align them uniformly, regardless of their diverse complexities. In fact, semantics are diverse (i.e. involving different kinds of semantic concepts), and humans usually follow a latent structure to combine them into understandable languages. It may be difficult to optimally capture such sophisticated correspondences in existing methods. In this paper, to address such a deficiency, we propose an Iterative Matching with Recurrent Attention Memory (IMRAM) method, in which correspondences between images and texts are captured with multiple steps of alignments. Specifically, we introduce an iterative matching scheme to explore such fine-grained correspondence progressively. A memory distillation unit is used to refine alignment knowledge from early steps to later ones. Experiment results on three benchmark datasets, i.e. Flickr8K, Flickr30K, and MS COCO, show that our IMRAM achieves state-of-the-art performance, well demonstrating its effectiveness. Experiments on a practical business advertisement dataset, named \Ads{}, further validates the applicability of our method in practical scenarios.
摘要：启用图像和文本的双向检索是理解的眼光和语言之间的对应关系非常重要。现有的方法利用的注意机制在细粒度方式探讨这样的对应关系。然而，大多数人认为所有的语义同样，从而统一对齐它们，无论他们不同的复杂性。事实上，语义是不同的（即涉及不同类型的语义概念），而人类通常遵循一个潜在的结构将它们组合成可以理解的语言。它可能很难在现有的方法最佳捕捉这种复杂的对应关系。在本文中，为了解决这样的缺陷，提出了复发性注意力存储器（IMRAM）的方法，其中图像和文本之间的对应关系与比对的多个步骤所捕获的迭代匹配。具体来说，我们引入一个迭代匹配方案以逐步探索这种细粒度的对应关系。存储器蒸馏装置被用于从早期步骤到后来的精炼对准知识。对三个标准数据集，即Flickr8K，Flickr30K和MS COCO实验结果，表明我们的IMRAM实现国家的最先进的性能，以及表明其效力。在实际的商业广告的数据集的实验中，一个名为\广告{}，也进一步证明在实际情况下的我们的方法的适用性。

46. Pixel-In-Pixel Net: Towards Efficient Facial Landmark Detection in the Wild [PDF] 返回目录
Haibo Jin, Shengcai Liao, Ling Shao
Abstract: Recently, heatmap regression based models become popular because of their superior performance on locating facial landmarks. However, high-resolution feature maps have to be either generated repeatedly or maintained through the network for such models, which is computationally inefficient for practical applications. Moreover, their generalization capabilities across domains are rarely explored. To address these two problems, we propose Pixel-In-Pixel (PIP) Net for facial landmark detection. The proposed model is equipped with a novel detection head based on heatmap regression. Different from conventional heatmap regression, the new detection head conducts score prediction on low-resolution feature maps. To localize landmarks more precisely, it also conduct offset predictions within each heatmap pixel. By doing this, the inference time is largely reduced without losing accuracy. Besides, we also propose to leverage unlabeled images to improve the generalization capbility of our model through image translation based data distillation. Extensive experiments on four benchmarks show that PIP Net is comparable to state-of-the-arts while running at $27.8$ FPS on a CPU.
摘要：近日，热图基于回归模型，因为它们对定位脸部标志卓越的性能成为流行。然而，高分辨率的特征地图必须重复地生成或通过网络进行这样的模型，其是用于实际应用计算上效率低下的保持。此外，跨域其泛化能力很少探讨。为了解决这两个问题，我们提出像素的像素内（PIP）净面部标志检测。该模型配备有基于热图回归一种新颖的检测头。从常规热图的回归不同，新的检测头行为得分低分辨率特征图预测。为了更精确地定位标志性建筑，它也进行每个热图像素中的偏移量的预测。通过这样做，推断时间在很大程度上又不失准确度降低。此外，我们还建议利用未标记的图像，提高图像通过根据翻译数据蒸馏我们的模型的泛化capbility。四个基准大量实验表明，PIP净相当于在$ 27.8 $ FPS在CPU上运行时状态的最美术馆。

47. A Benchmark for Temporal Color Constancy [PDF] 返回目录
Yanlin Qian, Jani Käpylä, Joni-Kristian Kämäräinen, Samu Koskinen, Jiri Matas
Abstract: Temporal Color Constancy (CC) is a recently proposed approach that challenges the conventional single-frame color constancy. The conventional approach is to use a single frame - shot frame - to estimate the scene illumination color. In temporal CC, multiple frames from the view finder sequence are used to estimate the color. However, there are no realistic large scale temporal color constancy datasets for method evaluation. In this work, a new temporal CC benchmark is introduced. The benchmark comprises of (1) 600 real-world sequences recorded with a high-resolution mobile phone camera, (2) a fixed train-test split which ensures consistent evaluation, and (3) a baseline method which achieves high accuracy in the new benchmark and the dataset used in previous works. Results for more than 20 well-known color constancy methods including the recent state-of-the-arts are reported in our experiments.
摘要：瞬时颜色恒常（CC）是最近提出的一种方法，挑战传统的单框颜色恒常。传统的方法是使用一个单一的帧 - 帧拍摄 - 来估计场景照明的颜色。在颞CC，从取景器序列的多个帧被用于估计的颜色。但是，也有方法评价没有实际的大规模瞬时颜色恒常数据集。在这项工作中，一个新的时间CC基准介绍。记录有高分辨率移动电话照相机（1）600的真实世界的序列的基准包括，（2），其实现了在新的高精度的固定列车测试分裂从而确保一致的评估，以及（3）的基线方法基准，在以前的作品中使用的数据集。结果超过20家知名颜色恒常方法，包括最近的国家的的艺术在我们的实验报告。

48. Monocular 3D Object Detection in Cylindrical Images from Fisheye Cameras [PDF] 返回目录
Elad Plaut, Erez Ben Yaacov, Bat El Shlomo
Abstract: Detecting objects in 3D from a monocular camera has been successfully demonstrated using various methods based on convolutional neural networks. These methods have been demonstrated on rectilinear perspective images equivalent to being taken by a pinhole camera, whose geometry is explicitly or implicitly exploited. Such methods fail in images with alternative projections, such as those acquired by fisheye cameras, even when provided with a labeled training set of fisheye images and 3D bounding boxes. In this work, we show how to adapt existing 3D object detection methods to images from fisheye cameras, including in the case that no labeled fisheye data is available for training. We significantly outperform existing art on a benchmark of synthetic data, and we also experiment with an internal dataset of real fisheye images.
摘要：从单眼相机在3D中探测对象已使用基于卷积神经网络的各种方法证明是成功。这些方法已被证明对相当于正在采取针孔相机，其几何形状明确地或隐含利用直线的透视图像。这样的方法无法在同替代突起，如那些由鱼眼镜头获取的，即使当与标记的训练集鱼眼图像和3D边界框的提供图像。在这项工作中，我们将展示如何将现有的三维物体的检测方法，从鱼眼镜头适应图像，包括在的情况下，没有标记的鱼眼数据可用于训练。我们显著跑赢上合成数据的基准现有的艺术，我们也尝试用实际鱼眼图像的内部数据集。

49. Domain-Specific Image Super-Resolution with Progressive Adversarial Network [PDF] 返回目录
Wong Lone, Zhao Deli, Wan Shaohua, Zhang Bo
Abstract: Single Image Super-Resolution (SISR) aims to improve resolution of small-size low-quality image from a single one. With popularity of consumer electronics in our daily life, this topic has become more and more attractive. In this paper, we argue that the curse of dimensionality is the underlying reason of limiting the performance of state-of-the-art algorithms. To address this issue, we propose Progressive Adversarial Network (PAN) that is capable of coping with this difficulty for domainspecific image super-resolution. The key principle of PAN is that we do not apply any distance-based reconstruction errors as the loss to be optimized, thus free from the restriction of the curse of dimensionality. To maintain faithful reconstruction precision, we resort to U-Net and progressive growing of neural architecture. The low-level features in encoder can be transferred into decoder to enhance textural details with U-Net. Progressive growing enhances image resolution gradually, thereby preserving precision of recovered image. Moreover, to obtain high-fidelity outputs, we leverage the framework of the powerful StyleGAN to perform adversarial learning. Without the curse of dimensionality, our model can super-resolve large-size images with remarkable photo-realistic details and few distortion. Extensive experiments demonstrate the superiority of our algorithm over existing state-of-the-arts both quantitatively and qualitatively.
摘要：单幅图像超分辨率（SISR）旨在改善从单一的小型低质量的图像分辨率。在我们的日常生活消费类电子产品的普及，这一话题已经变得越来越有吸引力。在本文中，我们认为，维数灾难是限制性的状态的最先进的算法的性能的根本原因。为了解决这个问题，我们提出了渐进的对抗性网络（PAN）能够与这个困难对特定领域的图像超分辨率的应对。 PAN的主要原则是，我们不应用任何基于距离的重建误差为损失进行优化，从而从维数灾难的限制自由。为了保持忠实的重建精度，我们诉诸于掌中，并逐步成长神经结构的。在编码器中的低层次特征可以被转移到解码器，以提高带U-Net的纹理细节。渐进式增长的提高图像分辨率逐步提升，从而保持恢复图像的精度。此外，获得高保真输出，我们利用强大的StyleGAN的框架内进行对抗性学习。如果没有维的诅咒，我们的模型可以具有显着的照片般逼真的细节和一些失真超级解决大尺寸图像。大量的实验证明我们的算法在现有的国家的最艺术的定量和定性的优越性。

50. Better Captioning with Sequence-Level Exploration [PDF] 返回目录
Jia Chen, Qin Jin
Abstract: Sequence-level learning objective has been widely used in captioning tasks to achieve the state-of-the-art performance for many models. In this objective, the model is trained by the reward on the quality of its generated captions (sequence-level). In this work, we show the limitation of the current sequence-level learning objective for captioning tasks from both theory and empirical result. In theory, we show that the current objective is equivalent to only optimizing the precision side of the caption set generated by the model and therefore overlooks the recall side. Empirical result shows that the model trained by this objective tends to get lower score on the recall side. We propose to add a sequence-level exploration term to the current objective to boost recall. It guides the model to explore more plausible captions in the training. In this way, the proposed objective takes both the precision and recall sides of generated captions into account. Experiments show the effectiveness of the proposed method on both video and image captioning datasets.
摘要：序列层次的学习目标已经广泛应用于字幕任务，实现国家的最先进的性能众多车型。在这一目标，该模型是通过在其生成的字幕（序列级别）的质量的奖励训练。在这项工作中，我们展示了当前序列级别的学习目标从理论和实证结果字幕任务的限制。从理论上讲，我们表明，当前的目标仅相当于优化由模型生成的字幕组的精度方面，因此可俯瞰召回的一面。实证结果表明，这一目标训练的模型往往会在召回方面得分较低。我们建议的顺序级长期的探索与当前目标加入到升压召回。它指导模型在训练探索更合理的字幕。通过这种方式，所提出的客观需要生成字幕的精确度和召回双方考虑。实验表明，在视频和图像字幕数据集所提出的方法的有效性。

51. A Multi-scale CNN-CRF Framework for Environmental Microorganism Image Segmentation [PDF] 返回目录
Jinghua Zhang, Chen Li, Frank Kulwa, Xin Zhao, Changhao Sun, Zihan Li, Tao Jiang, Hong Li
Abstract: In order to assist researchers to identify Environmental Microorganisms (EMs) effectively, a Multi-scale CNN-CRF (MSCC) framework for the EM image segmentation is proposed in this paper. There are two parts in this framework: The first is a novel pixel-level segmentation approach, using a newly introduced Convolutional Neural Network (CNN), namely "mU-Net-B3", with a dense Conditional Random Field (CRF) post-processing. The second is a VGG-16 based patch-level segmentation method with a novel "buffer" strategy, which further improves the segmentation quality of the details of the EMs. In the experiment, compared with the state-of-the-art methods on 420 EM images, the proposed MSCC method reduces the memory requirement from 355 MB to 103 MB, improves the overall evaluation indexes (Dice, Jaccard, Recall, Accuracy) from 85.24%, 77.42%, 82.27% and 96.76% to 87.13%, 79.74%, 87.12% and 96.91% respectively, and reduces the volume overlap error from 22.58% to 20.26%. Therefore, the MSCC method shows a big potential in the EM segmentation field.
摘要：为了帮助研究人员有效地识别环境微生物（EMS），为EM图像分割多尺度CNN-CRF（MSCC）框架，本文提出。有两个部分在这个框架：首先是一种新颖的像素级的分割的方法，使用一个新引入的卷积神经网络（CNN），即“MU-NET-B3”，具有致密条件随机场（CRF）后处理。第二种是VGG-16基于块拼贴的水平分割方法具有新颖“缓冲”的策略，这进一步改善了的抗电磁干扰的细节的分割质量。在实验中，与420幅EM图像的状态的最先进的方法相比，所提出的MSCC方法降低了从355 MB到103 MB的存储器需求，从提高了整体评价指标（骰子，杰卡德，回想一下，精度） 85.24％，77.42％，82.27％和96.76分别％至87.13％，79.74％，87.12％和96.91％，并减少了从22.58％的体积的重叠误差20.26％。因此，MSCC方法显示在EM分割领域的巨大潜力。

52. Object-Oriented Video Captioning with Temporal Graph and Prior Knowledge Building [PDF] 返回目录
Fangyi Zhu, Jenq-Nent Hwang, Zhanyu Ma, Guo Jun
Abstract: Traditional video captioning requests a holistic description of the video, yet the detailed descriptions of the specific objects may not be available. Besides, most methods adopt frame-level inter-object features and ambiguous descriptions during training, which is difficult for learning the vision-language relationships. Without associating the transition trajectories, these image-based methods cannot understand the activities with visual features. We propose a novel task, named object-oriented video captioning, which focuses on understanding the videos in object-level. We re-annotate the object-sentence pairs for more effective cross-modal learning. Thereafter, we design the video-based object-oriented video captioning (OVC)-Net to reliably analyze the activities along time with only visual features and capture the vision-language connections under small datasets stably. To demonstrate the effectiveness, we evaluate the method on the new dataset and compare it with the state-of-the-arts for video captioning. From the experimental results, the OVC-Net exhibits the ability of precisely describing the concurrent objects and their activities in details.
摘要：传统的视频字幕请求视频的整体描述，但在特定对象的详细描述可能无法使用。此外，大多数方法采取培训，这是很难学习视觉语言的关系时帧级对象间的功能和模糊的描述。如果没有过渡轨迹关联，这些基于图像的方法无法理解与视觉特征的活动。我们提出了一个新的任务，命名为面向对象的视频字幕，侧重于理解对象级的视频。我们重新诠释了更有效的跨模态学习的对象句对。此后，我们设计了基于视频的面向对象的视频字幕（OVC）-Net不能可靠地分析活动，随着时间只用视觉特征，捕捉下小数据集的视觉语言连接稳定。为了演示效果，我们评估对新的数据集的方法，它与国家的最艺术的视频字幕进行比较。从实验结果来看，OVC-Net的表现正是没有详细的并发对象及其活动的能力。

53. Meta3D: Single-View 3D Object Reconstruction from Shape Priors in Memory [PDF] 返回目录
Shuo Yang, Min Xu, Hongxun Yao
Abstract: 3D shape reconstruction from a single-view RGB image is an ill-posed problem due to the invisible parts of the object to be reconstructed. Most of the existing methods rely on large-scale data to obtain shape priors through tuning parameters of reconstruction models. These methods might not be able to deal with the cases with heavy object occlusions and noisy background since prior information can not be retained completely or applied efficiently. In this paper, we are the first to develop a memory-based meta-learning framework for single-view 3D reconstruction. A write controller is designed to extract shape-discriminative features from images and store image features and their corresponding volumes into external memory. A read controller is proposed to sequentially encode shape priors related to the input image and predict a shape-specific refiner. Experimental results demonstrate that our Meta3D outperforms state-of-the-art methods with a large margin through retaining shape priors explicitly, especially for the extremely difficult cases.
摘要：从单个视点的RGB图像的3D形状重建是一种不适定问题，因为对象的不可见的部分被重建。大多数现有的方法依赖于大规模数据通过重建模型的整定参数，以获得形状先验。这些方法可能无法应付沉重的物体遮挡和嘈杂的背景，因为之前的信息不能被完全保留或高效应用的情况。在本文中，我们是第一个制定单一视图三维重建中的基于内存的元学习框架。写控制器被设计成从图像中提取并存储的图像的特征和它们的相应的卷形状判别特征到外部存储器。读控制器被建议与所述输入图像顺序编码形状先验和预测的特定形状，磨浆机。实验结果表明，我们的Meta3D通过保持形状先验优于国家的最先进的方法，用大比分明确，尤其是对极端困难的情况下。

54. Trajectory Grouping with Curvature Regularization for Tubular Structure Tracking [PDF] 返回目录
Li Liu, Jiong Zhang, Da Chen, HUazhong Shu, Laurent D. Cohen
Abstract: Tubular structure tracking is an important and difficult problem in the fields of computer vision and medical image analysis. The minimal path models have exhibited its power in tracing tubular structures, by which a centerline can be naturally treated as a minimal path with a suitable geodesic metric. However, existing minimal path-based tubular structure tracing models still suffer from difficulty like the shortcuts and short branches combination problems, especially when dealing with the images with a complicated background. We introduce a new minima path-based model for minimally interactive tubular structure centerline extraction in conjunction with a perceptual grouping scheme. We take into account the prescribed tubular trajectories and the relevant curvature-penalized geodesic distances for minimal paths extraction in a graph-based optimization way. Experimental results on both synthetic and real images prove that the proposed model indeed obtains outperformance comparing to state-of-the-art minimal path-based tubular structure tracing algorithms.
摘要：管状结构跟踪是计算机视觉和医学图像分析等领域的重点和难点问题。最小路径模型已经在跟踪管状结构，通过其中心线可以与合适的测地度量最小的路径被自然地显示出处理过的其功率。然而，现有的最小的基于路径的管状结构跟踪模型仍然像快捷键和短枝组合问题遭遇困难，用一个复杂的背景图像时尤其如此。我们介绍结合微创交互式管状结构中心线萃取感知分组方案新的基于路径的最小值的模式。我们考虑到规定的管状轨迹和用于在基于图的优化方式最小路径提取的相关曲率处罚测地距离。在合成的和真实图像的实验结果证明，该模型确实获得优异表现进行比较来状态的最先进的最小基于路径的管状结构追踪算法。

55. Adaptive Semantic-Visual Tree for Hierarchical Embeddings [PDF] 返回目录
Shuo Yang, Wei Yu, Ying Zheng, Hongxun Yao, Tao Mei
Abstract: Merchandise categories inherently form a semantic hierarchy with different levels of concept abstraction, especially for fine-grained categories. This hierarchy encodes rich correlations among various categories across different levels, which can effectively regularize the semantic space and thus make predictions less ambiguous. However, previous studies of fine-grained image retrieval primarily focus on semantic similarities or visual similarities. In a real application, merely using visual similarity may not satisfy the need of consumers to search merchandise with real-life images, e.g., given a red coat as a query image, we might get a red suit in recall results only based on visual similarity since they are visually similar. But the users actually want a coat rather than suit even the coat is with different color or texture attributes. We introduce this new problem based on photoshopping in real practice. That's why semantic information are integrated to regularize the margins to make "semantic" prior to "visual". To solve this new problem, we propose a hierarchical adaptive semantic-visual tree (ASVT) to depict the architecture of merchandise categories, which evaluates semantic similarities between different semantic levels and visual similarities within the same semantic class simultaneously. The semantic information satisfies the demand of consumers for similar merchandise with the query while the visual information optimizes the correlations within the semantic class. At each level, we set different margins based on the semantic hierarchy and incorporate them as prior information to learn a fine-grained feature embedding. To evaluate our framework, we propose a new dataset named JDProduct, with hierarchical labels collected from actual image queries and official merchandise images on an online shopping application. Extensive experimental results on the public CARS196 and CUB
摘要：商品类别本身形成具有不同层次的概念抽象的语义层次，特别是对细粒度的类别。这种层次编码各类不同层级之间的相关性丰富，可有效正规化语义空间，从而使预测更加明确。然而，细颗粒图像检索的以往的研究主要集中在语义相似性或视觉相似性。在实际应用中，仅仅使用视觉相似性可能无法满足需要消费者搜索与现实生活中的图像，例如商品，赋予了红色大衣作为查询图像，我们可能只基于视觉相似性得到召回结果的红色套装因为它们看上去很相似。但用户真正想要一件大衣，而不是西装连外套是用不同颜色或纹理的属性。我们介绍了基于现实实践中使用Photoshop这一新的问题。这就是为什么语义信息集成正规化边缘以“语义”前的“视觉”。为了解决这个新的问题，我们提出了一种分级自适应语义视觉树（ASVT）来描述商品类别的体系结构，其同时评估相同语义类别内的不同的语义等级和视觉相似性之间的语义相似性。语义信息满足消费者对类似商品与查询的需求，而视觉信息优化语义类中的相关性。在每一个层面上，我们设定基于语义层级不同的利润率，并纳入他们作为先验信息，了解细粒度的功能嵌入。为了评估我们的框架中，我们提出了一个名为JDProduct新的数据集，从实际的图像查询和官方商品的图片收集到一个在线购物应用程序分级标签。对公众CARS196和CUB广泛的实验结果

56. Transferring Cross-domain Knowledge for Video Sign Language Recognition [PDF] 返回目录
Dongxu Li, Xin Yu, Chenchen Xu, Lars Petersson, Hongdong Li
Abstract: Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation. It requires models to recognize isolated sign words from videos. However, annotating WSLR data needs expert knowledge, thus limiting WSLR dataset acquisition. On the contrary, there are abundant subtitled sign news videos on the internet. Since these videos have no word-level annotation and exhibit a large domain gap from isolated signs, they cannot be directly used for training WSLR models. We observe that despite the existence of a large domain gap, isolated and news signs share the same visual concepts, such as hand gestures and body movements. Motivated by this observation, we propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them. To this end, we extract news signs using a base WSLR model, and then design a classifier jointly trained on news and isolated signs to coarsely align these two domain features. In order to learn domain-invariant features within each class and suppress domain-specific features, our method further resorts to an external memory to store the class centroids of the aligned news signs. We then design a temporal attention based on the learnt descriptor to improve recognition performance. Experimental results on standard WSLR datasets show that our method outperforms previous state-of-the-art methods significantly. We also demonstrate the effectiveness of our method on automatically localizing signs from sign news, achieving 28.1 for AP@0.5.
摘要：字级手语识别（WSLR）在手语翻译的根本任务。它需要的模型从视频中认出孤立的标志词。然而，标注WSLR数据需要专业知识，因此限制了WSLR数据集收购。与此相反，也有在互联网上丰富的字幕标志新闻视频。由于这些影片有没有字的注解，并表现出从孤立的迹象大型域中的差距，他们不能直接用于训练WSLR模型。我们观察到，尽管大领域缺口的存在，分离和新闻迹象共享相同的视觉概念，如手势和肢体动作。通过这种观察的启发，我们提出了一种新方法，它获悉域不变的视觉概念和受精WSLR模型通过转移副标题新闻标志的知识给他们。为此，我们提取使用的基础模型WSLR新闻迹象，然后设计共同训练有素的新闻和隔离标志粗略地对准这两个域特征来分类。为了了解每个类和禁止特定领域的功能中域不变特征，我们的方法还求助于外部存储器来存储的对齐新闻标志类质心。然后，我们设计了基于学习的描述符来提高识别性能时间关注。标准WSLR数据集的实验结果表明，该方法显著优于以前的状态的最先进的方法。我们还演示了从迹象新闻自动定位体征，达到28.1 AP@0.5我们的方法的有效性。

57. Unifying Specialist Image Embedding into Universal Image Embedding [PDF] 返回目录
Yang Feng, Futang Peng, Xu Zhang, Wei Zhu, Shanfeng Zhang, Howard Zhou, Zhen Li, Tom Duerig, Shih-Fu Chang, Jiebo Luo
Abstract: Deep image embedding provides a way to measure the semantic similarity of two images. It plays a central role in many applications such as image search, face verification, and zero-shot learning. It is desirable to have a universal deep embedding model applicable to various domains of images. However, existing methods mainly rely on training specialist embedding models each of which is applicable to images from a single domain. In this paper, we study an important but unexplored task: how to train a single universal image embedding model to match the performance of several specialists on each specialist's domain. Simply fusing the training data from multiple domains cannot solve this problem because some domains become overfitted sooner when trained together using existing methods. Therefore, we propose to distill the knowledge in multiple specialists into a universal embedding to solve this problem. In contrast to existing embedding distillation methods that distill the absolute distances between images, we transform the absolute distances between images into a probabilistic distribution and minimize the KL-divergence between the distributions of the specialists and the universal embedding. Using several public datasets, we validate that our proposed method accomplishes the goal of universal image embedding.
摘要：深图像嵌入提供了一种方法来衡量两个图像的语义相似。它在许多应用，如图片搜索，人脸验证和零次学习中心作用。这需要有适用于图像的各种领域的通用深深嵌入模型。然而，现有的方法主要依靠培训专家嵌入模型的每一个适用于从单个域的图像。在本文中，我们研究的一个重要但未开发的任务：如何培养一个通用的图像嵌入模型匹配多个专家对每位专家的域的性能。简单地融合来自多个域的训练数据解决不了这个问题，因为某些领域变得过度拟合越早时一起训练使用现有的方法。因此，我们建议在提炼多个专家的知识转化为普遍的嵌入来解决这个问题。与此相反的是现有蒸馏图像之间的绝对距离嵌入蒸馏方法，我们变换图像之间的绝对距离成概率分布和最小化的专家的分布和通用嵌入之间的KL散度。使用多个公共数据集，我们验证了我们提出的方法实现了通用图像嵌入的目标。

58. Adaptive Offline Quintuplet Loss for Image-Text Matching [PDF] 返回目录
Tianlang Chen, Jiajun Deng, Jiebo Luo
Abstract: Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model. For each image or text anchor in a training mini-batch, the model is trained to distinguish between a positive and the most confusing negative of the anchor mined from the mini-batch (i.e. online hard negative). This strategy improves the model's capacity to discover fine-grained correspondences and non-correspondences between image and text inputs. However, the above training approach has the following drawbacks: (1) the negative selection strategy still provides limited chances for the model to learn from very hard-to-distinguish cases. (2) The trained model has weak generalization capability from the training set to the testing set. (3) The penalty lacks hierarchy and adaptiveness for hard negatives with different ``hardness'' degrees. In this paper, we propose solutions by sampling negatives offline from the whole training set. It provides ``harder'' offline negatives than online hard negatives for the model to distinguish. Based on the offline hard negatives, a quintuplet loss is proposed to improve the model's generalization capability to distinguish positives and negatives. In addition, a novel loss function that combines the knowledge of positives, offline hard negatives and online hard negatives is created. It leverages offline hard negatives as intermediary to adaptively penalize them based on their distance relations to the anchor. We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets. Significant performance improvements are observed for all the models, demonstrating the effectiveness and generality of the proposed approach.
摘要：现有的图像文本匹配通常接近与在线硬底片训练模型杠杆三重损失。对于在训练小批量的每个图像或文本锚点，该模型被训练的正和来自小批量（即在线硬负）开采的锚定件的最混乱负之间进行区分。这一策略提高了模型的发现细粒度对应以及图像和文本输入之间的非对应能力。然而，上述培训方法有以下缺点：（1）阴性选择策略仍然提供了模型，从非常难以区分的情况下学习的机会有限。（2）训练的模型从训练集中到测试组弱泛化能力。（3）缺少惩罚的层次结构和适应性对于具有不同``硬度“”度硬底片。在本文中，我们建议从整个训练集中抽样调查底片的解决方案。它提供了``难“”脱机比网上很难底片的模型来区分底片。基于离线的硬底片，五元组损失，提出了提高模型的泛化能力来区分积极和消极。此外，创建结合阳性，离线硬底片和在线硬底片的知识新的损失函数。它利用离线硬底片作为中介自适应惩罚根据它们的距离关系到锚。我们评估对MS-COCO和Flickr30K数据集的国家的最先进的三个图像，文本模式所提出的训练方法。显著的性能提高观察到所有的模型，证明了该方法的有效性和普遍性。

59. SalsaNext: Fast Semantic Segmentation of LiDAR Point Clouds for Autonomous Driving [PDF] 返回目录
Tiago Cortinhal, George Tzelepis, Eren Erdal Aksoy
Abstract: In this paper, we introduce SalsaNext for the semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet [1] which has an encoder-decoder architecture where the encoder unit has a set of ResNet blocks and the decoder part combines upsampled features from the residual blocks. In contrast to SalsaNet, we have an additional layer in the encoder and decoder, introduce the context module, switch from stride convolution to average pooling and also apply central dropout treatment. To directly optimize the Jaccard index, we further combine the weighted cross-entropy loss with Lovasz-Softmax loss [2]. We provide a thorough quantitative evaluation on the Semantic-KITTI dataset [3], which demonstrates that the proposed SalsaNext outperforms other state-of-the-art semantic segmentation networks in terms of accuracy and computation time. We also release our source code this https URL.
摘要：在本文中，我们介绍SalsaNext的实时全3D激光雷达点云的语义分割。 SalsaNext是SalsaNet的下一个版本[1]，其具有编码器 - 解码器体系结构，其中编码器单元具有一组RESNET块，并从所述残余块上采样的特征的解码器部分联合。在对比SalsaNet，我们在编码器和解码器的附加层，介绍语境模块，交换机从步幅卷积到平均池和也适用中央差治疗。为了直接优化的Jaccard指数，我们进一步结合了Lovasz-使用SoftMax损失[2]的加权交叉熵损失。我们提供关于语义KITTI数据集的透彻定量评价[3]，这表明，所提出的SalsaNext优于其他国家的最先进的语义分割网络中的精度和计算时间方面。我们还发布了源代码，这HTTPS URL。

60. Inferring Spatial Uncertainty in Object Detection [PDF] 返回目录
Zining Wang, Di Feng, Yiyang Zhou, Wei Zhan, Lars Rosenbaum, Fabian Timm, Klaus Dietmayer, Masayoshi Tomizuka
Abstract: The availability of real-world datasets is the prerequisite to develop object detection methods for autonomous driving. While ambiguity exists in object labels due to error-prone annotation process or sensor observation noises, current object detection datasets only provide deterministic annotations, without considering their uncertainty. This precludes an in-depth evaluation among different object detection methods, especially for those that explicitly model predictive probability. In this work, we propose a generative model to estimate bounding box label uncertainties from LiDAR point clouds, and define a new representation of the probabilistic bounding box through spatial distribution. Comprehensive experiments show that the proposed model represents uncertainties commonly seen in driving scenarios. Based on the spatial distribution, we further propose an extension of IoU, called the Jaccard IoU (JIoU), as a new evaluation metric that incorporates label uncertainty. The experiments on the KITTI and the Waymo Open Datasets show that JIoU is superior to IoU when evaluating probabilistic object detectors.
摘要：现实世界的数据集的可用性是开发自动驾驶物体检测方法的前提条件。虽然在歧义对象标签由于容易出错的注释过程或传感器观测噪声，电流物体检测数据集只提供确定性的注解，存在不考虑他们的不确定性。这排除不同的物体检测方法中的深入评价，特别是对于那些预测概率显式建模。在这项工作中，我们提出了一个生成模型来估计从激光雷达点云边界框的标签不确定性，并通过空间分布定义概率边框的新表示。综合实验表明，该模型代表了驾驶情况下常见的不确定性。基于空间分布，我们进一步提出了欠条的延伸，称为捷卡IOU（吉欧），作为包含标签不确定性一种新的评价指标。在KITTI和Waymo开放数据集的实验表明，吉欧优于IOU评估概率对象检测器时。

61. AlphaNet: An Attention Guided Deep Network for Automatic Image Matting [PDF] 返回目录
Rishab Sharma, Rahul Deora, Anirudha Vishvakarma
Abstract: In this paper, we propose an end to end solution for image matting i.e high-precision extraction of foreground objects from natural images. Image matting and background detection can be achieved easily through chroma keying in a studio setting when the background is either pure green or blue. Nonetheless, image matting in natural scenes with complex and uneven depth backgrounds remains a tedious task that requires human intervention. To achieve complete automatic foreground extraction in natural scenes, we propose a method that assimilates semantic segmentation and deep image matting processes into a single network to generate detailed semantic mattes for image composition task. The contribution of our proposed method is two-fold, firstly it can be interpreted as a fully automated semantic image matting method and secondly as a refinement of existing semantic segmentation models. We propose a novel model architecture as a combination of segmentation and matting that unifies the function of upsampling and downsampling operators with the notion of attention. As shown in our work, attention guided downsampling and upsampling can extract high-quality boundary details, unlike other normal downsampling and upsampling techniques. For achieving the same, we utilized an attention guided encoder-decoder framework which does unsupervised learning for generating an attention map adaptively from the data to serve and direct the upsampling and downsampling operators. We also construct a fashion e-commerce focused dataset with high-quality alpha mattes to facilitate the training and evaluation for image matting.
摘要：在本文中，我们提出了一个端到端的解决方案为图像抠图即从自然图像的前景对象的高精度提取。图像抠图和背景检测可以很容易地通过色度键在演播室环境来实现当背景是任一纯绿色或蓝色。然而，在复杂和深浅不均的背景自然场景图像抠图仍然是一个繁琐的任务，需要人工干预。为了实现自然场景完全自动前景提取，我们提出了一种方法，同化语义分割而深图像抠图处理成一个单一的网络，以生成详细图像组合物任务语义遮罩。我们提出的方法的贡献是双重的，首先可以把它理解为一个完全自动化的语义图像抠图方法，其次为现有的语义分割模型的改进。我们提出了一个新的模型架构分割和消光的组合相结合采样和与关注的概念下采样运算符的功能。作为我们工作中所示，注意导游带领下采样和上采样可以提取高质量的边界细节，不像其他正常采样和取样技术。为了实现相同的，我们利用的注意引导编码器 - 解码器框架，它做无监督学习用于从数据自适应地产生注意力图来服务和直接上采样下采样和运营商。我们还构建一个时尚电子商务集中的数据集提供高品质的阿尔法遮罩，以方便图像抠图的培训和考核。

62. DASNet: Dual attentive fully convolutional siamese networks for change detection of high resolution satellite images [PDF] 返回目录
Jie Chen, Ziyang Yuan, Jian Peng, Li Chen, Haozhe Huang, Jiawei Zhu, Tao Lin, Haifeng Li
Abstract: Change detection is a basic task of remote sensing image processing. The research objective is to identity the change information of interest and filter out the irrelevant change information as interference factors. Recently, the rise of deep learning has provided new tools for change detection, which have yielded impressive results. However, the available methods focus mainly on the difference information between multitemporal remote sensing images and lack robustness to pseudo-change information. To overcome the lack of resistance of current methods to pseudo-changes, in this paper, we propose a new method, namely, dual attentive fully convolutional Siamese networks (DASNet) for change detection in high-resolution images. Through the dual-attention mechanism, long-range dependencies are captured to obtain more discriminant feature representations to enhance the recognition performance of the model. Moreover, the imbalanced sample is a serious problem in change detection, i.e. unchanged samples are much more than changed samples, which is one of the main reasons resulting in pseudo-changes. We put forward the weighted double margin contrastive loss to address this problem by punishing the attention to unchanged feature pairs and increase attention to changed feature pairs. The experimental results of our method on the change detection dataset (CDD) and the building change detection dataset (BCDD) demonstrate that compared with other baseline methods, the proposed method realizes maximum improvements of 2.1\% and 3.6\%, respectively, in the F1 score. Our Pytorch implementation is available at this https URL.
摘要：变化检测是遥感图像处理的基本任务。这项研究的目标是身份的利益，过滤掉不相关的变化信息，干扰因素的变化信息。近日，深度学习的兴起提供了变化检测的新工具，它已经取得了骄人的成绩。然而，可用的方法主要集中于多时遥感图像和缺乏鲁棒性之间的差异信息，以伪变化信息。为了克服缺少的电流的方法来伪变化性的，在本文中，我们提出了在高分辨率图像变化检测的新方法，即双周到全面卷积连体网络（DASNet）。通过双注意机制，远程依赖性拍摄以获得多个判别特征表示，提高模型的识别性能。此外，不平衡样品是在变化检测一个严重的问题，即，保持不变的样品比改变的样品，这是产生伪变化的主要原因之一等等。我们提出的加权双利润率对比损失惩罚注意不变特征对，增加关注变化特性对解决这个问题。我们对变化检测数据集（CDD）和建筑物变化检测数据集（BCDD）方法的实验结果表明，与其他基线方法相比，所提出的方法分别实现了2.1 \％和3.6 \％，最大的改进，在该F1得分。我们Pytorch实现可在此HTTPS URL。

63. Generative Low-bitwidth Data Free Quantization [PDF] 返回目录
Shoukai Xu, Haokun Li, Bohan Zhuang, Jing Liu, Jiezhang Cao, Chuangrun Liang, Mingkui Tan
Abstract: Neural network quantization is an effective way to compress deep models and improve the execution latency and energy efficiency, so that they can be deployed on mobile or embedded devices. Existing quantization methods require original data for calibration or fine-tuning to get better performance. However, in many real-world scenarios, the data may not be available due to confidential or private issues, making existing quantization methods not applicable. Moreover, due to the absence of original data, the recently developed generative adversarial networks (GANs) can not be applied to generate data. Although the full precision model may contain the entire data information, such information alone is hard to exploit for recovering the original data or generating new meaningful data. In this paper, we investigate a simple-yet-effective method called Generative Low-bitwidth Data Free Quantization to remove the data dependence burden. Specifically, we propose a Knowledge Matching Generator to produce meaningful fake data by exploiting classification boundary knowledge and distribution information in the pre-trained model. With the help of generated data, we are able to quantize a model by learning knowledge from the pre-trained model. Extensive experiments on three data sets demonstrate the effectiveness of our method. More critically, our method achieves much higher accuracy on 4-bit quantization than the existing data free quantization method.
摘要：神经网络量化是压缩深层模型和提高执行延迟和能源效率的有效途径，使他们可以在移动设备或嵌入式设备进行部署。现有的量化方法需要进行校准或微调的原始数据，以获得更好的性能。然而，在许多现实世界的情况下，该数据可能无法提供因机密或私人问题，使得现有的量化方法不适用。此外，由于不存在原来的数据的，最近开发的生成对抗网络（甘斯）不能被施加到产生数据。虽然全精度模型可以包含整个数据的信息，这样的信息是单独难以利用用于恢复原始数据或产生新的有意义的数据。在本文中，我们研究了被称为生成性低位宽数据免费量化删除数据依赖负担简单但有效的方法。具体来说，我们提出了一个知识匹配发电机通过利用分类边界的知识和分布信息的预先训练模型产生有意义的假数据。随着产生的数据的帮助下，我们能够通过预先训练模型学习知识量化模型。对三个数据集大量的实验证明我们的方法的有效性。更关键的是，我们的方法实现了对4位量化比现有数据免费量化方法更高的精度。

64. StyleGAN2 Distillation for Feed-forward Image Manipulation [PDF] 返回目录
Yuri Viazovetskyi, Vladimir Ivashkin, Evgeny Kashin
Abstract: StyleGAN2 is a state-of-the-art network in generating realistic images. Besides, it was explicitly trained to have disentangled directions in latent space, which allows efficient image manipulation by varying latent factors. Editing existing images requires embedding a given image into the latent space of StyleGAN2. Latent code optimization via backpropagation is commonly used for qualitative embedding of real world images, although it is prohibitively slow for many applications. We propose a way to distill a particular image manipulation of StyleGAN2 into image-to-image network trained in paired way. The resulting pipeline is an alternative to existing GANs, trained on unpaired data. We provide results of human faces' transformation: gender swap, aging/rejuvenation, style transfer and image morphing. We show that the quality of generation using our method is comparable to StyleGAN2 backpropagation and current state-of-the-art methods in these particular tasks.
摘要：StyleGAN2是国家的最先进的在网络中生成逼真的图像。此外，它明确练就了在潜在空间，从而能够实现高效影像处理通过改变潜在因子解缠结的方向。编辑现有的图像需要嵌入一个给定的图像转换成StyleGAN2的潜在空间。通过反向传播的潜代码优化通常用于现实世界图像的定性嵌入，尽管这对于很多应用过于缓慢。我们提出了一个方法来提炼StyleGAN2的特定图像处理成图像到影像网络配对的方式训练。将所得的管道是现有甘斯，训练上不成对数据的替代方法。我们提供了人脸转型的结果：性别交换，老化/复兴，风格传输和图像变形。我们表明，使用代我们的方法的质量足以媲美StyleGAN2反向传播和国家的最先进的方法，目前在这些特定的任务。

65. CPM R-CNN: Calibrating Point-guided Misalignment in Object Detection [PDF] 返回目录
Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu, Mengjie Hu
Abstract: In object detection, offset-guided and point-guided regression dominate anchor-based and anchor-free method separately. Recently, point-guided approach is introduced to anchor-based method. However, we observe points predicted by this way are misaligned with matched region of proposals and score of localization, causing a notable gap in performance. In this paper, we propose CPM R-CNN which contains three efficient modules to optimize anchor-based point-guided method. According to sufficient evaluations on the COCO dataset, CPM R-CNN is demonstrated efficient to improve the localization accuracy by calibrating mentioned misalignment. Compared with Faster R-CNN and Grid R-CNN based on ResNet-101 with FPN, our approach can substantially improve detection mAP by 3.3% and 1.5% respectively without whistles and bells. Moreover, our best model achieves improvement by a large margin to 49.9% on COCO test-dev. Code and models will be publicly available.
摘要：物体检测，偏移引导和点引导的回归主导锚型和无锚方法分开。近日，点引导方式引入基于锚的方法。但是，我们观察到通过这种方式预测百分点的提议匹配的区域没有对准和得分本地化，造成性能差距明显。在本文中，我们提议CPM R-CNN包含三个高效模块基于锚优化点引导的方法。根据对数据集COCO足够评价，CPM R-CNN证明有效的通过校准提到未对准，以提高定位精度。基于RESNET-101与FPN更快R-CNN和网格R-CNN相比，我们的方法可以通过3.3％和不哨子和铃铛分别为1.5％大幅提高检测图。此外，我们的最好的模式实现大幅度提高对COCO测试-dev的49.9％。代码和模型将公开。

66. Crowd Counting via Hierarchical Scale Recalibration Network [PDF] 返回目录
Zhikang Zou, Yifan Liu, Shuangjie Xu, Wei Wei, Shiping Wen, Pan Zhou
Abstract: The task of crowd counting is extremely challenging due to complicated difficulties, especially the huge variation in vision scale. Previous works tend to adopt a naive concatenation of multi-scale information to tackle it, while the scale shifts between the feature maps are ignored. In this paper, we propose a novel Hierarchical Scale Recalibration Network (HSRNet), which addresses the above issues by modeling rich contextual dependencies and recalibrating multiple scale-associated information. Specifically, a Scale Focus Module (SFM) first integrates global context into local features by modeling the semantic inter-dependencies along channel and spatial dimensions sequentially. In order to reallocate channel-wise feature responses, a Scale Recalibration Module (SRM) adopts a step-by-step fusion to generate final density maps. Furthermore, we propose a novel Scale Consistency loss to constrain that the scale-associated outputs are coherent with groundtruth of different scales. With the proposed modules, our approach can ignore various noises selectively and focus on appropriate crowd scales automatically. Extensive experiments on crowd counting datasets (ShanghaiTech, MALL, WorldEXPO'10, and UCSD) show that our HSRNet can deliver superior results over all state-of-the-art approaches. More remarkably, we extend experiments on an extra vehicle dataset, whose results indicate that the proposed model is generalized to other applications.
摘要：人群计数的任务十分具有挑战性，由于复杂的困难，尤其是在视觉规模巨大的变化。以前的作品往往采取的多尺度信息的天真连接来解决它，而功能地图之间规模的变化被忽略。在本文中，我们提出了一个新的层次秤重新校准网络（HSRNet），它通过造型丰富语境依赖性和重新校准多尺度有关的信息来解决上述问题。具体地讲，一个比例聚焦模块（SFM）第一集成全局上下文到局部特征通过模拟信道沿着并依次空间维度语义间的依赖关系。为了重新分配信道逐特征的响应，一个重新校准秤模块（SRM）采用一步一步融合，以产生最终的密度图。此外，我们提出了一个新颖的尺度一致性损失来约束该标度相关联的输出是相干不同尺度的真实状况。在所提出的模块，我们的方法可以选择性无视各种噪音，自动对焦的适宜人群规模。在人群计数数据集（ShanghaiTech，MALL，WorldEXPO'10和UCSD）表明我们可以HSRNet在所有国家的最先进的方法提供卓越的成果广泛的实验。更引人注目的是，我们在一个额外的车辆数据，其结果表明，该模型被推广到其他应用程序扩展实验。

67. TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation [PDF] 返回目录
Wen Wang, Xiaojiang Peng, Yanzhou Su, Yu Qiao, Jian Cheng
Abstract: Video action anticipation aims to predict future action categories from observed frames. Current state-of-the-art approaches mainly resort to recurrent neural networks to encode history information into hidden states, and predict future actions from the hidden representations. It is well known that the recurrent pipeline is inefficient in capturing long-term information which may limit its performance in predication task. To address this problem, this paper proposes a simple yet efficient Temporal Transformer with Progressive Prediction (TTPP) framework, which repurposes a Transformer-style architecture to aggregate observed features, and then leverages a light-weight network to progressively predict future features and actions. Specifically, predicted features along with predicted probabilities are accumulated into the inputs of subsequent prediction. We evaluate our approach on three action datasets, namely TVSeries, THUMOS-14, and TV-Human-Interaction. Additionally we also conduct a comprehensive study for several popular aggregation and prediction strategies. Extensive results show that TTPP not only outperforms the state-of-the-art methods but also more efficient.
摘要：视频动作预期目标预测从观察到的帧的未来行动的类别。当前国家的最先进的办法主要是采取经常性的神经网络来编码历史信息隐入状态，并预测从隐藏表示未来的行动。众所周知的是，经常性的管道是在捕捉这可能会限制其在预测任务绩效长期信息效率低下。为了解决这个问题，提出了一种简单而具有逐行预测（TTPP）框架，该框架repurposes一个变压器式的架构，以聚合观察到的特征，然后利用一个重量轻的网络逐步预测未来的特征和动作高效时间转换。具体而言，预测概率沿预测特征累积到随后的预测的输入端。我们评估我们的三个动作的数据集，即TVSeries方法，THUMOS-14，和TV-人机交互。此外，我们还为一些流行的聚集和预测策略进行了全面的研究。大量的研究结果表明TTPP不仅优于国家的最先进的方法，而且效率更高。

68. MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision [PDF] 返回目录
Tingbo Hou, Adel Ahmadyan, Liangkai Zhang, Jianing Wei, Matthias Grundmann
Abstract: In this paper, we address the problem of detecting unseen objects from RGB images and estimating their poses in 3D. We propose two mobile friendly networks: MobilePose-Base and MobilePose-Shape. The former is used when there is only pose supervision, and the latter is for the case when shape supervision is available, even a weak one. We revisit shape features used in previous methods, including segmentation and coordinate map. We explain when and why pixel-level shape supervision can improve pose estimation. Consequently, we add shape prediction as an intermediate layer in the MobilePose-Shape, and let the network learn pose from shape. Our models are trained on mixed real and synthetic data, with weak and noisy shape supervision. They are ultra lightweight that can run in real-time on modern mobile devices (e.g. 36 FPS on Galaxy S20). Comparing with previous single-shot solutions, our method has higher accuracy, while using a significantly smaller model (2~3% in model size or number of parameters).
摘要：在本文中，我们解决从RGB图像检测看不见的对象和在3D中估计其姿势的问题。我们提出了两个移动网络友好：MobilePose-Base和MobilePose形。有当只有姿势监督，后者是针对情况下形状监督是可用的，即使是微弱的一个，前者使用。我们回访形状特征在以前的方法，包括分割和坐标图中。我们解释了什么时候，为什么像素水平形状监督可以提高姿态估计。因此，我们添加形状预测如在MobilePose-形状的中间层，并让网络从形状学习姿势。我们的模型进行培训，对混合真实和合成数据，以微弱和嘈杂的形状监督。他们是超轻，可实时对现代移动设备（例如36 FPS银河S20）运行。与先前的单次比较的解决方案，我们的方法具有更高的精度，而使用较小的显著模型（模型尺寸或参数编号为2〜3％）。

69. Distilling portable Generative Adversarial Networks for Image Translation [PDF] 返回目录
Hanting Chen, Yunhe Wang, Han Shu, Changyuan Wen, Chunjing Xu, Boxin Shi, Chao Xu, Chang Xu
Abstract: Despite Generative Adversarial Networks (GANs) have been widely used in various image-to-image translation tasks, they can be hardly applied on mobile devices due to their heavy computation and storage cost. Traditional network compression methods focus on visually recognition tasks, but never deal with generation tasks. Inspired by knowledge distillation, a student generator of fewer parameters is trained by inheriting the low-level and high-level information from the original heavy teacher generator. To promote the capability of student generator, we include a student discriminator to measure the distances between real images, and images generated by student and teacher generators. An adversarial learning process is therefore established to optimize student generator and student discriminator. Qualitative and quantitative analysis by conducting experiments on benchmark datasets demonstrate that the proposed method can learn portable generative models with strong performance.
摘要：尽管剖成对抗性网络（甘斯）已经在各种图像到影像转换任务得到了广泛应用，它们可以几乎适用于移动设备由于其繁重的计算和存储成本。传统的网络压缩方法集中在视觉识别任务，但从来没有与生成任务处理。通过知识蒸馏启发，参数较少的学生发生器通过继承从原来的重老师产生低级和高级信息培训。为了促进学生发电机的能力，我们包括由学生和老师发生器产生一个学生鉴别衡量实际图像之间的距离，和图像。因此对抗性的学习过程，建立以优化学生的发电机和学生鉴别。通过开展对标准数据集实验定性和定量分析表明，该方法可以学习便携式生成模型具有很强的功能。

70. Cross-modal Learning for Multi-modal Video Categorization [PDF] 返回目录
Palash Goyal, Saurabh Sahu, Shalini Ghosh, Chul Lee
Abstract: Multi-modal machine learning (ML) models can process data in multiple modalities (e.g., video, audio, text) and are useful for video content analysis in a variety of problems (e.g., object detection, scene understanding, activity recognition). In this paper, we focus on the problem of video categorization using a multi-modal ML technique. In particular, we have developed a novel multi-modal ML approach that we call "cross-modal learning", where one modality influences another but only when there is correlation between the modalities --- for that, we first train a correlation tower that guides the main multi-modal video categorization tower in the model. We show how this cross-modal principle can be applied to different types of models (e.g., RNN, Transformer, NetVLAD), and demonstrate through experiments how our proposed multi-modal video categorization models with cross-modal learning out-perform strong state-of-the-art baseline models.
摘要：多模态机器学习（ML）的模型可以处理多模态数据（例如，视频，音频，文本），并在各种问题的视频内容分析是有用的（例如，对象检测，场景的理解，行为识别）。在本文中，我们专注于视频的分类问题采用多模态ML技术。特别是，我们已经开发出一种新型的多模ML的方法，我们称之为“跨模态学习”，其中一个方式影响另一个但只有当存在方式之间的关系---为此，我们首先培养的相关高塔在引导模式主体的多模态视频分类塔。我们表明，该跨模式的原则是如何被应用到不同类型的模型（例如，RNN，变压器，NetVLAD），并通过实验演示如何跨模态学习我们提出的多模态视频分类模型在性能强大的国有的最先进的基本模式。

71. Weight mechanism adding a constant in concatenation of series connect [PDF] 返回目录
Xiaojie Qi, Yindi Zhao
Abstract: It is a consensus that feature maps in the shallow layer are more related to image attributes such as texture and shape, whereas abstract semantic representation exists in the deep layer. Meanwhile, some image information will be lost in the process of the convolution operation. Naturally, the direct method is combining them together to gain lost detailed information through concatenation or adding. In fact, the image representation flowed in feature fusion can not match with the semantic representation completely, and the semantic deviation in different layers also destroy the information purification, that leads to useless information being mixed into the fusion layers. Therefore, it is crucial to narrow the gap among the fused layers and reduce the impact of noises during fusion. In this paper, we propose a method named weight mechanism to reduce the gap between feature maps in concatenation of series connection, and we get a better result of 0.80% mIoU improvement on Massachusetts building dataset by changing the weight of the concatenation of series connection in residual U-Net. Specifically, we design a new architecture named fused U-Net to test weight mechanism, and it also gains 0.12% mIoU improvement.
摘要：这是一个共识，即在浅层特征图更相关的图像属性，如纹理和形状，而在深层存在抽象语义表示。同时，一些图像信息将在卷积运算的过程中丢失。自然地，直接法他们结合在一起，以获得通过级联或添加丢失的详细信息。事实上，图像表示在特征融合流入不能与语义表示完全匹配，并且在不同层中的语义偏差也会破坏信息纯化，导致无用的信息混入熔融层。因此，关键的是要缩小熔合层之间的间隙并降低熔融时的噪声的影响。在本文中，我们提出了一个方法命名权重机制，减少缝隙之间串联连接的串联特征图，我们得到的0.80％米欧改善更好的结果在美国马萨诸塞州通过改变串联连接的串联的重构建数据集残余U形网。具体来说，我们设计了一个名为融合掌中宽带测试重量机制的新架构，它也获得0.12％米欧改善。

72. MatchingGAN: Matching-based Few-shot Image Generation [PDF] 返回目录
Yan Hong, Jianfu Zhang, Li Niu, Liqing Zhang
Abstract: To generate new images for a given category, most deep generative models require abundant training images from this category, which are often too expensive to acquire. To achieve the goal of generation based on only a few images, we propose matching-based Generative Adversarial Network (GAN) for few-shot generation, which includes a matching generator and a matching discriminator. Matching generator can match random vectors with a few conditional images from the same category and generate new images for this category based on the fused features. The matching discriminator extends conventional GAN discriminator by matching the feature of generated image with the fused feature of conditional images. Extensive experiments on three datasets demonstrate the effectiveness of our proposed method.
摘要：为给定类别生成新的图像，最深处生成模型需要从这一类丰富的训练图像，这往往过于昂贵收购。为了实现基于只有少数图像生成的目标，提出了一种基于匹配创成对抗性网络（GAN）的几个触发产生，其中包括一个匹配发电机和匹配鉴别。匹配发电机可以从同一类别搭配一些有条件的图像随机向量，并基于融合的特点这一类生成新的图像。匹配鉴别器通过生成图像的特征与条件的图像的融合特征匹配延伸常规GAN鉴别器。三个数据集大量实验证明我们提出的方法的有效性。

73. Semantic Change Pattern Analysis [PDF] 返回目录
Wensheng Cheng, Yan Zhang, Xu Lei, Wen Yang, Guisong Xia
Abstract: Change detection is an important problem in vision field, especially for aerial images. However, most works focus on traditional change detection, i.e., where changes happen, without considering the change type information, i.e., what changes happen. Although a few works have tried to apply semantic information to traditional change detection, they either only give the label of emerging objects without taking the change type into consideration, or set some kinds of change subjectively without specifying semantic information. To make use of semantic information and analyze change types comprehensively, we propose a new task called semantic change pattern analysis for aerial images. Given a pair of co-registered aerial images, the task requires a result including both where and what changes happen. We then describe the metric adopted for the task, which is clean and interpretable. We further provide the first well-annotated aerial image dataset for this task. Extensive baseline experiments are conducted as reference for following works. The aim of this work is to explore high-level information based on change detection and facilitate the development of this field with the publicly available dataset.
摘要：变化检测是在视觉领域的一个重要问题，特别是对航拍图像。然而，大多数的作品注重传统的变化检测，即，其中发生的变化，不考虑改变类型的信息，即什么样的变化发生。虽然一些作品曾尝试申请语义信息传统的变化检测，他们要么只给新兴对象的标签没有考虑变化型的考虑，或设置一些种类的变化主观上没有指定的语义信息。为了利用语义信息和综合分析的变化类型，我们提出了一个所谓的语义变化模式分析航空图像的新任务。给定一对共同注册的航空影像，任务要求，包括两种情况：和什么样的变化发生的结果。然后，我们描述的任务，这是干净的，可解释的通过了度量。我们还提供了该任务的第一个注释良好的航拍图像数据集。大量的基线实验为以下工作参考进行。这项工作的目的是基于变化检测，探索高层次的信息，并促进这一领域的与可公开获得的数据集的发展。

74. Super Resolution Using Segmentation-Prior Self-Attention Generative Adversarial Network [PDF] 返回目录
Yuxin Zhang, Zuquan Zheng, Roland Hu
Abstract: Convolutional Neural Network (CNN) is intensively implemented to solve super resolution (SR) tasks because of its superior performance. However, the problem of super resolution is still challenging due to the lack of prior knowledge and small receptive field of CNN. We propose the Segmentation-Piror Self-Attention Generative Adversarial Network (SPSAGAN) to combine segmentation-priors and feature attentions into a unified framework. This combination is led by a carefully designed weighted addition to balance the influence of feature and segmentation attentions, so that the network can emphasize textures in the same segmentation category and meanwhile focus on the long-distance feature relationship. We also propose a lightweight skip connection architecture called Residual-in-Residual Sparse Block (RRSB) to further improve the super-resolution performance and save computation. Extensive experiments show that SPSAGAN can generate more realistic and visually pleasing textures compared to state-of-the-art SFTGAN and ESRGAN on many SR datasets.
摘要：卷积神经网络（CNN）的深入实施，以解决超分辨率（SR），因为其优越的性能的任务。然而，超分辨率的问题由于缺乏先验知识和CNN的小感受野的仍然是具有挑战性的。我们提出了分割的Piror自注意剖成对抗性网络（SPSAGAN）来分割的先验和功能的关注合并成一个统一的框架。这种组合是由一个精心设计的加权相加，导致平衡功能和分割关注的影响，使网络可以强调纹理在同一细分类别，同时着眼于长途功能的关系。我们还提出了一个轻量级的跳跃连接架构，称为剩余的残留疏堵（RRSB），以进一步提高超分辨率的性能和节省计算。大量的实验表明，SPSAGAN可以比国家的最先进的SFTGAN和ESRGAN许多SR数据集生成更加逼真，视觉愉悦的纹理。

75. ReActNet: Towards Precise Binary Neural Network with Generalized Activation Functions [PDF] 返回目录
Zechun Liu, Zhiqiang Shen, Marios Savvides, Kwang-Ting Cheng
Abstract: In this paper, we propose several ideas for enhancing a binary network to close its accuracy gap from real-valued networks without incurring any additional computational cost. We first construct a baseline network by modifying and binarizing a compact real-valued network with parameter-free shortcuts, bypassing all the intermediate convolutional layers including the downsampling layers. This baseline network strikes a good trade-off between accuracy and efficiency, achieving superior performance than most of existing binary networks at approximately half of the computational cost. Through extensive experiments and analysis, we observed that the performance of binary networks is sensitive to activation distribution variations. Based on this important observation, we propose to generalize the traditional Sign and PReLU functions, denoted as RSign and RPReLU for the respective generalized functions, to enable explicit learning of the distribution reshape and shift at near-zero extra cost. Lastly, we adopt a distributional loss to further enforce the binary network to learn similar output distributions as those of a real-valued network. We show that after incorporating all these ideas, the proposed ReActNet outperforms all the state-of-the-arts by a large margin. Specifically, it outperforms Real-to-Binary Net and MeliusNet29 by 4.0% and 3.6% respectively for the top-1 accuracy and also reduces the gap to its real-valued counterpart to within 3.0% top-1 accuracy on ImageNet dataset.
摘要：在本文中，我们提出了提高二进制网络关闭从实值网络其准确性的差距，而不会产生任何额外的计算成本几个想法。我们首先通过修改和二值化与无参数的快捷方式紧凑实值网络，绕过所有中间卷积层，包括下采样层构造基线网络。此基线网络撞击很好的权衡精度和效率之间，大约一半的计算成本的实现比大多数现有的二进制网络的卓越性能。通过大量的实验和分析，我们观察到二进制网络的性能对活化分布的变化是敏感的。在此基础上重要的观察，我们提出来概括传统的标牌和PReLU功能，表示为RSign和RPReLU为各自的广义的功能，在接近零的额外费用，以使分布重塑和转变的明确的学习。最后，我们采用分布式损失进一步执行二进制网络学习类似的输出分布为那些实值网络。我们表明，融合了所有这些想法后，建议ReActNet性能优于国家的的艺术大幅度的全部。具体而言，由4.0％和3.6％分别优于实时到二进制Net和MeliusNet29用于顶部-1的精度，也减少了对数据集ImageNet的间隙其实值对应至3.0％的范围内顶1的精度。

76. PoseNet3D: Unsupervised 3D Human Shape and Pose Estimation [PDF] 返回目录
Shashank Tripathi, Siddhant Ranade, Ambrish Tyagi, Amit Agrawal
Abstract: Recovering 3D human pose from 2D joints is a highly unconstrained problem. We propose a novel neural network framework, PoseNet3D, that takes 2D joints as input and outputs 3D skeletons and SMPL body model parameters. By casting our learning approach in a student-teacher framework, we avoid using any 3D data such as paired/unpaired 3D data, motion capture sequences, depth images or multi-view images during training. We first train a teacher network that outputs 3D skeletons, using only 2D poses for training. The teacher network distills its knowledge to a student network that predicts 3D pose in SMPL representation. Finally, both the teacher and the student networks are jointly fine-tuned in an end-to-end manner using temporal, self-consistency and adversarial losses, improving the accuracy of each individual network. Results on Human3.6M dataset for 3D human pose estimation demonstrate that our approach reduces the 3D joint prediction error by 18\% compared to previous unsupervised methods. Qualitative results on in-the-wild datasets show that the recovered 3D poses and meshes are natural, realistic, and flow smoothly over consecutive frames.
摘要：从2D关节恢复三维人体姿势是一个高度无约束问题。我们提出了一个新颖的神经网络架构，PoseNet3D，即采用2D接头作为输入，并输出3D骨架和SMPL体模型参数。铸造我们的学习方法，在师生的框架，我们避免使用任何3D数据，例如在训练期间配对/未配对的3D数据，运动捕捉序列，深度图像或者多视图图像。首先，我们培养了教师网络输出3D骨架，只使用2D训练姿势。教师网络提炼其学生网络预测3D姿态在SMPL表示知识。最后，使用时间，自我一致性和敌对的损失，提高每个单独的网络的精确度两者教师和学生网络是在端至端的方式共同微调。在Human3.6M数据集结果的3D人体姿态估计表明，我们的方法减少了18 \％，相比以前的非监督方法的3D关节的预测误差。在最疯狂的数据集显示，回收的3D姿势和网格定性结果是自然的，现实的，流程顺畅了连续帧。

77. Semi-Supervised StyleGAN for Disentanglement Learning [PDF] 返回目录
Weili Nie, Tero Karras, Animesh Garg, Shoubhik Debhath, Anjul Patney, Ankit B. Patel, Anima Anandkumar
Abstract: Disentanglement learning is crucial for obtaining disentangled representations and controllable generation. Current disentanglement methods face several inherent limitations: difficulty with high-resolution images, primarily on learning disentangled representations, and non-identifiability due to the unsupervised setting. To alleviate these limitations, we design new architectures and loss functions based on StyleGAN (Karras et al., 2019), for semi-supervised high-resolution disentanglement learning. We create two complex high-resolution synthetic datasets for systematic testing. We investigate the impact of limited supervision and find that using only 0.25%~2.5% of labeled data is sufficient for good disentanglement on both synthetic and real datasets. We propose new metrics to quantify generator controllability, and observe there may exist a crucial trade-off between disentangled representation learning and controllable generation. We also consider semantic fine-grained image editing to achieve better generalization to unseen images.
摘要：退纠缠学习是获得解开表示，可控发电的关键。当前的解开方法面临着一些固有的局限性：高分辨率图像，主要学习解开表示，和非可识别的难度，由于无人监督的设置。为了减轻这些限制，我们设计了一种基于StyleGAN新的架构和损失函数（卡拉斯等，2019），对半监督高分辨率的解开学习。我们创建系统的测试两种复杂高分辨率合成数据集。我们调查的监督有限的影响，并发现使用标记的数据只有0.25％〜2.5％就足够了在合成和真实数据的良好的解开。我们提出新的指标来进行量化发生器的可控性，并观察可能存在一个关键的权衡解缠结表示学习和可控代之间。我们也考虑语义细粒度的图像编辑，以达到更好的推广到看不见图像。

78. Scalable Uncertainty for Computer Vision with Functional Variational Inference [PDF] 返回目录
Eduardo D C Carvalho, Ronald Clark, Andrea Nicastro, Paul H J Kelly
Abstract: As Deep Learning continues to yield successful applications in Computer Vision, the ability to quantify all forms of uncertainty is a paramount requirement for its safe and reliable deployment in the real-world. In this work, we leverage the formulation of variational inference in function space, where we associate Gaussian Processes (GPs) to both Bayesian CNN priors and variational family. Since GPs are fully determined by their mean and covariance functions, we are able to obtain predictive uncertainty estimates at the cost of a single forward pass through any chosen CNN architecture and for any supervised learning task. By leveraging the structure of the induced covariance matrices, we propose numerically efficient algorithms which enable fast training in the context of high-dimensional tasks such as depth estimation and semantic segmentation. Additionally, we provide sufficient conditions for constructing regression loss functions whose probabilistic counterparts are compatible with aleatoric uncertainty quantification.
摘要：随着深继续学习计算机视觉中得到成功的应用，量化各种形式的不确定性的能力是在真实世界中的安全和可靠的部署极为重要的要求。在这项工作中，我们利用变分推理的配方功能空间，在这里我们高斯过程（GPS），以CNN贝叶斯先验而变的家人都联系起来。由于GPS完全由他们的均值和方差的功能决定的，我们可以在一个单一的直传通过任何选择CNN架构的成本和任何监督学习任务获得预测的不确定性估计。通过利用感应协方差矩阵的结构中，我们提出了数值高效的算法，其能够在高维任务的上下文快速训练如深度估计和语义分割。此外，我们还为构建其概率同行与肆意的不确定性量化兼容回归损失函数的充分条件。

79. Manifold Regularization for Adversarial Robustness [PDF] 返回目录
Charles Jin, Martin Rinard
Abstract: Manifold regularization is a technique that penalizes the complexity of learned functions over the intrinsic geometry of input data. We develop a connection to learning functions which are "locally stable", and propose new regularization terms for training deep neural networks that are stable against a class of local perturbations. These regularizers enable us to train a network to state-of-the-art robust accuracy of 70% on CIFAR-10 against a PGD adversary using $\ell_\infty$ perturbations of size $\epsilon = 8/255$. Furthermore, our techniques do not rely on the construction of any adversarial examples, thus running orders of magnitude faster than standard algorithms for adversarial training.
摘要：流形正则是，随着输入数据的内蕴几何的惩罚了解到功能的复杂的技术。我们开发学习它们是“局部稳定”功能的连接，并提出新的正则项训练深层神经网络，反对一类局部扰动的稳定。这些regularizers使我们使用$ \ ell_ \ infty $ $大小的扰动\小量= 8 / $ 255到列车内的网络的70％状态的最先进的鲁棒精度上CIFAR-10针对PGD对手。此外，我们的技术不依赖于任何对抗性的例子建设，从而运行几个数量级比对抗训练标准算法快。

80. Deep Inverse Feature Learning: A Representation Learning of Error [PDF] 返回目录
Behzad Ghazanfari, Fatemeh Afghah
Abstract: This paper introduces a novel perspective about error in machine learning and proposes inverse feature learning (IFL) as a representation learning approach that learns a set of high-level features based on the representation of error for classification or clustering purposes. The proposed perspective about error representation is fundamentally different from current learning methods, where in classification approaches they interpret the error as a function of the differences between the true labels and the predicted ones or in clustering approaches, in which the clustering objective functions such as compactness are used. Inverse feature learning method operates based on a deep clustering approach to obtain a qualitative form of the representation of error as features. The performance of the proposed IFL method is evaluated by applying the learned features along with the original features, or just using the learned features in different classification and clustering techniques for several data sets. The experimental results show that the proposed method leads to promising results in classification and especially in clustering. In classification, the proposed features along with the primary features improve the results of most of the classification methods on several popular data sets. In clustering, the performance of different clustering methods is considerably improved on different data sets. There are interesting results that show some few features of the representation of error capture highly informative aspects of primary features. We hope this paper helps to utilize the error representation learning in different feature learning domains.
摘要：本文介绍了在机器学习误差的新颖的角度，并建议学会了一组高级的基于用于分类或聚类目的误差的表示的特征逆地物学习（IFL），为表示学习方法。关于误差表示所提出的透视图是从当前的学习方法，其中在分类方法他们解释误差的真实的标签和所预测的那些接近之间或在聚类的差的函数，从根本上不同，其中所述聚类的目标函数如紧凑被使用。逆特性的学习方法操作基于一个深聚类方法来获得误差的表示为特征的定性形式。所提出的IFL方法的性能是通过应用所学习的特征与原始特征一起，或只是使用不同的分类学特征和聚类的几个数据集技术评估。实验结果表明，该方法导致看好的分类结果，尤其是在集群。在分类上，与主要特征沿建议功能完善的大多数分类方法对几个常用的数据集的结果。在聚类，不同的聚类方法的性能上不同的数据集被显着改善。有迹象表明，展示的主要功能错误捕获高度信息方面表现的一些几个特点有趣的结果。我们希望本文有助于利用不同特征的学习域误差表示学习。

81. How to Train Your Super-Net: An Analysis of Training Heuristics in Weight-Sharing NAS [PDF] 返回目录
Kaicheng Yu, Rene Ranftl, Mathieu Salzmann
Abstract: Weight sharing promises to make neural architecture search (NAS) tractable even on commodity hardware. Existing methods in this space rely on a diverse set of heuristics to design and train the shared-weight backbone network, a.k.a. the super-net. Since heuristics and hyperparameters substantially vary across different methods, a fair comparison between them can only be achieved by systematically analyzing the influence of these factors. In this paper, we therefore provide a systematic evaluation of the heuristics and hyperparameters that are frequently employed by weight-sharing NAS algorithms. Our analysis uncovers that some commonly-used heuristics for super-net training negatively impact the correlation between super-net and stand-alone performance, and evidences the strong influence of certain hyperparameters and architectural choices. Our code and experiments set a strong and reproducible baseline that future works can build on.
摘要：重量共享的承诺，使神经结构搜索（NAS）温顺甚至在商品硬件。在这个空间中的现有方法依赖于一组不同的启发，设计和培训共享重骨干网，又名超级网。由于启发式和超参数在不同的方法显着地变化，它们之间的公平的比较只能通过系统分析这些因素的影响来实现的。在本文中，我们因此提供了其频繁重量共享NAS算法采用启发式和超参数的系统评价。我们的分析揭示，对于超级网培训一些常用的启发式超级网和单机性能之间的关系产生负面影响，并能证明某些超参数和架构选择的强烈影响。我们的代码和实验设置一个强大的和可重复的基线，未来的作品都可以建立在。

82. MLography: An Automated Quantitative Metallography Model for Impurities Anomaly Detection using Novel Data Mining and Deep Learning Approach [PDF] 返回目录
Matan Rusanovsky, Gal Oren, Sigalit Ifergane, Ofer Beeri
Abstract: The micro-structure of most of the engineering alloys contains some inclusions and precipitates, which may affect their properties, therefore it is crucial to characterize them. In this work we focus on the development of a state-of-the-art artificial intelligence model for Anomaly Detection named MLography to automatically quantify the degree of anomaly of impurities in alloys. For this purpose, we introduce several anomaly detection measures: Spatial, Shape and Area anomaly, that successfully detect the most anomalous objects based on their objective, given that the impurities were already labeled. The first two measures quantify the degree of anomaly of each object by how each object is distant and big compared to its neighborhood, and by the abnormally of its own shape respectively. The last measure, combines the former two and highlights the most anomalous regions among all input images, for later (physical) examination. The performance of the model is presented and analyzed based on few representative cases. We stress that although the models presented here were developed for metallography analysis, most of them can be generalized to a wider set of problems in which anomaly detection of geometrical objects is desired. All models as well as the data-set that was created for this work, are publicly available at: this https URL.
摘要：大多数工程合金的微观结构含有一些杂质和沉淀物，这可能影响其性能，因此它是表征他们的关键。在这项工作中，我们重点关注的异常检测命名MLography一个国家的最先进的人工智能的发展模式，以自动量化的异常杂质合金的程度。为此，我们介绍几种异常检测措施：空间，形状和面积异常，成功地检测根据他们的目标是最反常的对象，因为杂质已经标记。前两项措施量化异常每个对象的每个对象是如何遥远，大相比，其附近的程度，以及由异常自身的形状分别。最后一个措施，将所有输入图像中的前两个亮点是最反常的区域，供以后（物理）检查。该模型的性能，提出基于几个有代表性的案例进行分析。我们强调的是，虽然这里提出的模型进行了分析，金相开发，其中大部分可以推广到更广泛的，其中几何对象的异常检测需要的问题。所有型号以及这是对这项工作产生的数据集，都公布于：此HTTPS URL。

83. Toward Cross-Domain Speech Recognition with End-to-End Models [PDF] 返回目录
Thai-Son Nguyen, Sebastian Stüker, Alex Waibel
Abstract: In the area of multi-domain speech recognition, research in the past focused on hybrid acoustic models to build cross-domain and domain-invariant speech recognition systems. In this paper, we empirically examine the difference in behavior between hybrid acoustic models and neural end-to-end systems when mixing acoustic training data from several domains. For these experiments we composed a multi-domain dataset from public sources, with the different domains in the corpus covering a wide variety of topics and acoustic conditions such as telephone conversations, lectures, read speech and broadcast news. We show that for the hybrid models, supplying additional training data from other domains with mismatched acoustic conditions does not increase the performance on specific domains. However, our end-to-end models optimized with sequence-based criterion generalize better than the hybrid models on diverse domains. In term of word-error-rate performance, our experimental acoustic-to-word and attention-based models trained on multi-domain dataset reach the performance of domain-specific long short-term memory (LSTM) hybrid models, thus resulting in multi-domain speech recognition systems that do not suffer in performance over domain specific ones. Moreover, the use of neural end-to-end models eliminates the need of domain-adapted language models during recognition, which is a great advantage when the input domain is unknown.
摘要：在多域语音识别领域，研究过去侧重于混合动力声学模型构建跨域和域不变的语音识别系统。在本文中，我们凭经验多个域的混合声音的训练数据时，检查混合声模型和神经终端到终端系统之间的行为差异。对于这些实验，我们由来自公共资源多域数据集，与语料库中涵盖了各种各样的主题和声学条件，如电话交谈，讲座，读取语音和广播消息的不同领域。我们表明，对于混合动力车型，从供给不匹配的声学条件的其他域的其他训练数据不会增加特定域的性能。然而，我们的终端到终端的型号序列为基础的准则期广义优化比对不同领域的混合动力车型更好。在字误码率性能来看，我们的实验声对词和重视基于模型的培训上多域数据集中到特定领域的表现长短期记忆（LSTM）混合动力车型，从而导致多不-domain语音识别系统遭受了特定领域的那些性能。此外，使用神经终端到高端机型的消除域适应语言模型的识别过程中的需要，这是一个很大的优势，当输入域是未知的。

84. Hybrid calibration procedure for fringe projection profilometry based on stereo-vision and polynomial fitting [PDF] 返回目录
Raul Vargas, Andres G. Marrugo, Song Zhang, Lenny A. Romero
Abstract: The key to accurate 3D shape measurement in Fringe Projection Profilometry (FPP) is the proper calibration of the measurement system. Current calibration techniques rely on phase-coordinate mapping (PCM) or back-projection stereo-vision (SV) methods. PCM methods are cumbersome to implement as they require precise positioning of the calibration target relative to the FPP system but produce highly accurate measurements within the calibration volume. SV methods generally do not achieve the same accuracy level. However, the calibration is more flexible in that the calibration target can be arbitrarily positioned. In this work, we propose a hybrid calibration method that leverages the SV calibration approach using a PCM method to achieve higher accuracy. The method has the flexibility of SV methods, is robust to lens distortions, and has a simple relation between the recovered phase and the metric coordinates. Experimental results show that the proposed Hybrid method outperforms the SV method in terms of accuracy and reconstruction time due to its low computational complexity.
摘要：键在条纹投影轮廓（FPP）精确的三维形状的测量是测量系统的正确校准。当前校准技术依赖于相位坐标映射（PCM）或背投影立体视觉（SV）的方法。 PCM方法是麻烦的实施，因为它们需要相对FPP系统校准目标的精确定位，但产生的校准体积内高度精确的测量。 SV方法一般不会达到同样的精度水平。然而，校准是在校准目标可以任意地定位更加灵活。在这项工作中，我们提出了利用使用PCM方法来实现更高的精度SV校准方法的混合校准方法。该方法具有的SV方法的灵活性，是稳健的透镜畸变，并具有恢复的相位和所述度量坐标之间的简单关系。实验结果表明，所提出的混合方法优于在精度和重建时间上的SV方法由于其较低的计算复杂度。

85. Learned Spectral Computed Tomography [PDF] 返回目录
Dimitris Kamilis, Mario Blatter, Nick Polydorides
Abstract: Spectral Photon-Counting Computed Tomography (SPCCT) is a promising technology that has shown a number of advantages over conventional X-ray Computed Tomography (CT) in the form of material separation, artefact removal and enhanced image quality. However, due to the increased complexity and non-linearity of the SPCCT governing equations, model-based reconstruction algorithms typically require handcrafted regularisation terms and meticulous tuning of hyperparameters making them impractical to calibrate in variable conditions. Additionally, they typically incur high computational costs and in cases of limited-angle data, their imaging capability deteriorates significantly. Recently, Deep Learning has proven to provide state-of-the-art reconstruction performance in medical imaging applications while circumventing most of these challenges. Inspired by these advances, we propose a Deep Learning imaging method for SPCCT that exploits the expressive power of Neural Networks while also incorporating model knowledge. The method takes the form of a two-step learned primal-dual algorithm that is trained using case-specific data. The proposed approach is characterised by fast reconstruction capability and high imaging performance, even in limited-data cases, while avoiding the hand-tuning that is required by other optimisation approaches. We demonstrate the performance of the method in terms of reconstructed images and quality metrics via numerical examples inspired by the application of cardiovascular imaging.
摘要：光谱光子计数计算机断层扫描（SPCCT）是有前途的技术是，在材料分离，人工制品去除和增强的图像质量的形式已显示出许多优于常规的X射线计算机断层扫描（CT）的优点。然而，由于增加的复杂性和控制方程的SPCCT的非线性，基于模型的重建算法通常需要手工使其不切实际的校准在可变条件下正则项和超参数的细致调整。此外，它们通常招致高的计算成本和有限的角度数据的情况下，其成像能力显著劣化。近日，深学习证明，它们在医疗成像应用提供先进设备，最先进的重建效率，同时避开大部分的这些挑战。通过这些进步的启发，我们提出了一个SPCCT深度学习成像方法，它利用神经网络的表达能力同时也纳入模型的知识。该方法采用了两步的形式得知正在使用的情况下，特定的数据训练原始对偶算法。所提出的方法的特征在于快速重建能力和高成像性能，即使在有限的数据的情况下，同时避免由其他优化方法所需要的手动调整。我们证明了该方法在重构图像和质量度量方面经由由心血管成像的应用启发数值例子的性能。

86. $Π-$nets: Deep Polynomial Neural Networks [PDF] 返回目录
Grigorios G. Chrysos, Stylianos Moschoglou, Giorgos Bouritsas, Jiankang Deng, Yannis Panagakis, Stefanos Zafeiriou
Abstract: Deep Convolutional Neural Networks (DCNNs) is currently the method of choice both for generative, as well as for discriminative learning in computer vision and machine learning. The success of DCNNs can be attributed to the careful selection of their building blocks (e.g., residual blocks, rectifiers, sophisticated normalization schemes, to mention but a few). In this paper, we propose $\Pi$-Nets, a new class of DCNNs. $\Pi$-Nets are polynomial neural networks, i.e., the output is a high-order polynomial of the input. $\Pi$-Nets can be implemented using special kind of skip connections and their parameters can be represented via high-order tensors. We empirically demonstrate that $\Pi$-Nets have better representation power than standard DCNNs and they even produce good results without the use of non-linear activation functions in a large battery of tasks and signals, i.e., images, graphs, and audio. When used in conjunction with activation functions, $\Pi$-Nets produce state-of-the-art results in challenging tasks, such as image generation. Lastly, our framework elucidates why recent generative models, such as StyleGAN, improve upon their predecessors, e.g., ProGAN.
摘要：深卷积神经网络（DCNNs）是目前既为生成，以及一种在计算机视觉和机器学习判别学习的首选方法。 DCNNs的成功可以归因于他们的积木仔细选择（例如，残余块，整流器，复杂的标准化方案，以仅举几例）。在本文中，我们提出了$ \丕$ -Nets，一类新DCNNs的。 $ \裨$ -Nets是多项式神经网络，即，输出是输入的高阶多项式。 $ \丕$ -Nets可以使用特殊的跳跃连接来实现和他们的参数可以通过高阶张量表示。我们经验表明，$ \丕$ -Nets有更好的表现力比标准DCNNs，他们甚至会产生没有大电池的任务和信号，即，图片，图表，音频的使用非线性激活功能良好的效果。当与活化的功能一起使用时，$ \裨$ -Nets产生状态的最先进的结果在挑战性的任务，如图像生成。最后，我们的框架阐明了为什么最近生成模型，如StyleGAN，完善其前辈，例如，ProGAN。

87. Deep Learning Guided Undersampling Mask Design for MR Image Reconstruction [PDF] 返回目录
Shengke Xue, Ruiliang Bai, Xinyu Jin
Abstract: In this paper, we propose a cross-domain networks that can achieve undersampled MR image reconstruction from raw k-space space. We design a 2D probability sampling mask layer to simulate real undersampling operation. Then the 2D Inverse FFT is deployed to reconstruct MR image from frequency domain to spatial domain. By minimizing the Euclidean loss between ground-truth image and output, we train the parameters in our probability mask layer. We discover the probability appears special patterns that is quite different from universal common sense that mask should be Poisson-like, under certain undersampled rates. We analyze the probability mask is subjected to Gaussian or Quadratic distributions, and discuss this pattern will be more accurate and robust than traditional ones. Extensive experiments proves that the rules we discovered are adaptive to most cases. This can be a useful guidance to further MR reconstruction mask designs.
摘要：在本文中，我们提出一种可实现从原始k空间空间欠采样的MR图像重构跨域网络。我们设计了一个二维概率抽样掩膜层模拟真实欠操作。然后，2D逆FFT被部署到从频域到空间域重建MR图像。通过减少地面实况图像和输出之间的欧几里德损失，我们训练我们的概率模板层的参数。我们发现的概率会出现特殊的模式，是由通用常识完全不同即面膜应该是泊松状，在某些欠率。我们分析的概率模板进行高斯或二次分布，并探讨这种模式会比传统更精确和稳健。大量的实验证明，我们发现规则适应于大多数情况下。这可以进一步MR重建面罩设计的有用的指导。

88. DFVS: Deep Flow Guided Scene Agnostic Image Based Visual Servoing [PDF] 返回目录
Y V S Harish, Harit Pandya, Ayush Gaud, Shreya Terupally, Sai Shankar, K. Madhava Krishna
Abstract: Existing deep learning based visual servoing approaches regress the relative camera pose between a pair of images. Therefore, they require a huge amount of training data and sometimes fine-tuning for adaptation to a novel scene. Furthermore, current approaches do not consider underlying geometry of the scene and rely on direct estimation of camera pose. Thus, inaccuracies in prediction of the camera pose, especially for distant goals, lead to a degradation in the servoing performance. In this paper, we propose a two-fold solution: (i) We consider optical flow as our visual features, which are predicted using a deep neural network. (ii) These flow features are then systematically integrated with depth estimates provided by another neural network using interaction matrix. We further present an extensive benchmark in a photo-realistic 3D simulation across diverse scenes to study the convergence and generalisation of visual servoing approaches. We show convergence for over 3m and 40 degrees while maintaining precise positioning of under 2cm and 1 degree on our challenging benchmark where the existing approaches that are unable to converge for majority of scenarios for over 1.5m and 20 degrees. Furthermore, we also evaluate our approach for a real scenario on an aerial robot. Our approach generalizes to novel scenarios producing precise and robust servoing performance for 6 degrees of freedom positioning tasks with even large camera transformations without any retraining or fine-tuning.
摘要：现有的深度学习基于视觉伺服接近回归的一对图像之间的相对相机姿态。因此，它们需要大量的训练数据，有时微调以适应一个新的场景数量。此外，当前的方法没有考虑现场的基本几何体，依靠相机姿态的直接估计。因此，在相机姿势预测，尤其是对遥远的目标不准确，导致伺服性能下降。在本文中，我们提出了两方面的解决方案：（一）我们认为，光流作为我们的视觉特征，它使用的是深层神经网络预测。（ⅱ）这些流的特征然后系统通过使用相互作用矩阵另一个神经网络提供深度估算集成。我们进一步提出了一个广泛的基准测试中跨不同场景照片般逼真的三维模拟研究收敛性和泛化的视觉伺服方法。我们显示超过3米和40度的收敛，同时保持我们的富有挑战性的基准下2厘米和1度的精确定位，其中是无法收敛为广大的情景逾一百五十万度和20度的现行做法。此外，我们还评估我们对空中机器人真实的情景的方法。我们的方法推广到制造用于与甚至大的相机变换6个自由度定位任务精确和鲁棒伺服性能，无需任何再培训或微调新颖方案。

89. Online Self-Supervised Learning for Object Picking: Detecting Optimum Grasping Position using a Metric Learning Approach [PDF] 返回目录
Kanata Suzuki, Yasuto Yokota, Yuzi Kanazawa, Tomoyoshi Takebayashi
Abstract: Self-supervised learning methods are attractive candidates for automatic object picking. However, the trial samples lack the complete ground truth because the observable parts of the agent are limited. That is, the information contained in the trial samples is often insufficient to learn the specific grasping position of each object. Consequently, the training falls into a local solution, and the grasp positions learned by the robot are independent of the state of the object. In this study, the optimal grasping position of an individual object is determined from the grasping score, defined as the distance in the feature space obtained using metric learning. The closeness of the solution to the pre-designed optimal grasping position was evaluated in trials. The proposed method incorporates two types of feedback control: one feedback enlarges the grasping score when the grasping position approaches the optimum; the other reduces the negative feedback of the potential grasping positions among the grasping candidates. The proposed online self-supervised learning method employs two deep neural networks. : SSD that detects the grasping position of an object, and Siamese networks (SNs) that evaluate the trial sample using the similarity of two input data in the feature space. Our method embeds the relation of each grasping position as feature vectors by training the trial samples and a few pre-samples indicating the optimum grasping position. By incorporating the grasping score based on the feature space of SNs into the SSD training process, the method preferentially trains the optimum grasping position. In the experiment, the proposed method achieved a higher success rate than the baseline method using simple teaching signals. And the grasping scores in the feature space of the SNs accurately represented the grasping positions of the objects.
摘要：自监督学习方法对于自动对象采摘吸引力的候选者。然而，试验样品缺乏完整的地面实况，因为代理的观察到的部分是有限的。也就是说，包含在试验样品中的信息往往不足以学习每个对象的特定位置抓握。因此，训练陷入局部解，并且由机器人学会掌握位置是独立于对象的状态。在这项研究中，最佳的抓单个对象的位置被从把持评分来确定，其定义为在使用度量学习获得的特征空间中的距离。该溶液到预先设计的最佳把持位置的接近程度在试验进行评价。所提出的方法采用了两种类型的反馈控制的：一个反馈放大抓分数时的把持位置接近最佳;其他降低了抓候选人之间的潜在抓位置的负反馈。所提出的在线自助监督学习方法采用两道深深的神经网络。：SSD检测物体的把持位置，连体网络，评价使用在特征空间中的两个输入数据的相似性的试验样品（SNS）。我们的方法通过训练试验样品和指示最优把持位置几前样品嵌入每个抓握位置作为特征矢量之间的关系。通过将基于SN的特征空间在SSD训练过程把持得分，所述方法优选地训练的最佳把持位置。在实验中，所提出的方法实现了更高的成功率比使用简单教学信号基线方法。而在SN的特征空间中的把持分数精确地表示对象的把持位置。

90. AL2: Progressive Activation Loss for Learning General Representations in Classification Neural Networks [PDF] 返回目录
Majed El Helou, Frederike Dümbgen, Sabine Süsstrunk
Abstract: The large capacity of neural networks enables them to learn complex functions. To avoid overfitting, networks however require a lot of training data that can be expensive and time-consuming to collect. A common practical approach to attenuate overfitting is the use of network regularization techniques. We propose a novel regularization method that progressively penalizes the magnitude of activations during training. The combined activation signals produced by all neurons in a given layer form the representation of the input image in that feature space. We propose to regularize this representation in the last feature layer before classification layers. Our method's effect on generalization is analyzed with label randomization tests and cumulative ablations. Experimental results show the advantages of our approach in comparison with commonly-used regularizers on standard benchmark datasets.
摘要：大容量的神经网络，以使其了解复杂的功能。为了避免过度拟合，但是网络需要大量的训练数据，可以是昂贵和费时的收集的。衰减过度拟合常见的实用方法是使用网络正规化技巧。我们建议逐步惩罚训练期间激活的幅度新颖的正则化方法。通过在给定的层中的所有神经元所产生的组合的活化信号形成在该特征空间中的输入图像的表示。我们建议之前分类层来规范在最后一项功能层这种表示。我们的方法对推广效果进行了分析与标签随机试验和累积消融。实验结果表明，与标准的基准数据集常用regularizers对比我们的方法的优点。

91. Explaining Knowledge Distillation by Quantifying the Knowledge [PDF] 返回目录
Xu Cheng, Zhefan Rao, Yilan Chen, Quanshi Zhang
Abstract: This paper presents a method to interpret the success of knowledge distillation by quantifying and analyzing task-relevant and task-irrelevant visual concepts that are encoded in intermediate layers of a deep neural network (DNN). More specifically, three hypotheses are proposed as follows. 1. Knowledge distillation makes the DNN learn more visual concepts than learning from raw data. 2. Knowledge distillation ensures that the DNN is prone to learning various visual concepts simultaneously. Whereas, in the scenario of learning from raw data, the DNN learns visual concepts sequentially. 3. Knowledge distillation yields more stable optimization directions than learning from raw data. Accordingly, we design three types of mathematical metrics to evaluate feature representations of the DNN. In experiments, we diagnosed various DNNs, and above hypotheses were verified.
摘要：本文提出了一种方法，以通过量化和分析了在一个深的神经网络（DNN）的中间层编码任务相关的和任务不相关的视觉概念解释知识蒸馏的成功。更具体地讲，三个假设提出如下。 1.知识蒸馏使得DNN了解更多视觉概念不是从原始数据中学习。 2.知识蒸馏确保DNN易于同时学习各种视觉概念。然而，在从原始数据中学习的情况下，DNN学习视觉概念顺序。 3.知识蒸馏提供更稳定的优化方向不是从原始数据中学习。因此，我们设计了三种类型的数学指标来评估DNN的特征表示。在实验中，我们诊断各种DNNs，以上的假设进行验证。

92. Diffusion State Distances: Multitemporal Analysis, Fast Algorithms, and Applications to Biological Networks [PDF] 返回目录
Lenore Cowen, Kapil Devkota, Xiaozhe Hu, James M. Murphy, Kaiyi Wu
Abstract: Data-dependent metrics are powerful tools for learning the underlying structure of high-dimensional data. This article develops and analyzes a data-dependent metric known as diffusion state distance (DSD), which compares points using a data-driven diffusion process. Unlike related diffusion methods, DSDs incorporate information across time scales, which allows for the intrinsic data structure to be inferred in a parameter-free manner. This article develops a theory for DSD based on the multitemporal emergence of mesoscopic equilibria in the underlying diffusion process. New algorithms for denoising and dimension reduction with DSD are also proposed and analyzed. These approaches are based on a weighted spectral decomposition of the underlying diffusion process, and experiments on synthetic datasets and real biological networks illustrate the efficacy of the proposed algorithms in terms of both speed and accuracy. Throughout, comparisons with related methods are made, in order to illustrate the distinct advantages of DSD for datasets exhibiting multiscale structure.
摘要：数据相关的度量是学习高维数据的底层结构的有力工具。本文开发并分析数据相关度量被称为扩散状态的距离（DSD），其比较了使用数据驱动的扩散过程点。不同于相关扩散方法中，掺入的DSD跨越的时间尺度，这允许在一个自由参数的方式来推断固有数据结构信息。本文开发了基于细观平衡点在下面的扩散过程中，出现了多时的DSD的理论。去噪与DSD降维的新算法也被提出并分析。这些方法是基于下面的扩散过程的加权谱分解，并且在合成的数据集和真实生物网络实验示出了在速度和准确度方面所提出的算法的效力。自始至终，与相关方法比较是由，以示出用于显示多尺度结构数据集DSD的明显的优点。

93. STD-Net: Structure-preserving and Topology-adaptive Deformation Network for 3D Reconstruction from a Single Image [PDF] 返回目录
Aihua Mao, Canglan Dai, Lin Gao, Ying He, Yong-jin Liu
Abstract: 3D reconstruction from a single view image is a long-standing prob-lem in computer vision. Various methods based on different shape representations(such as point cloud or volumetric representations) have been proposed. However,the 3D shape reconstruction with fine details and complex structures are still chal-lenging and have not yet be solved. Thanks to the recent advance of the deepshape representations, it becomes promising to learn the structure and detail rep-resentation using deep neural networks. In this paper, we propose a novel methodcalled STD-Net to reconstruct the 3D models utilizing the mesh representationthat is well suitable for characterizing complex structure and geometry this http URL reconstruct complex 3D mesh models with fine details, our method consists of(1) an auto-encoder network for recovering the structure of an object with bound-ing box representation from a single image, (2) a topology-adaptive graph CNNfor updating vertex position for meshes of complex topology, and (3) an unifiedmesh deformation block that deforms the structural boxes into structure-awaremeshed models. Experimental results on the images from ShapeNet show that ourproposed STD-Net has better performance than other state-of-the-art methods onreconstructing 3D objects with complex structures and fine geometric details.
摘要：3D从单个视点图像重建是一个长期存在的概率-LEM在计算机视觉。基于不同的形状表示（如点云或体积表示）的各种方法已经被提出。然而，3D形状重建的细节和复杂的结构仍然CHAL-有挑战性，但尚未得到解决。由于近期提前deepshape表示的，它变得很有希望学习使用深层神经网络的结构和细节REP-resentation。在本文中，我们提出了一种新颖的methodcalled STD-Net的重建三维模型利用网格representationthat是非常适合用于表征复杂的结构和几何形状此http URL重构复杂3D网格模型具有精细的细节，我们的方法包括（1）一个自动编码器网络用于回收与结合-ING框表示从单个图像中的对象的结构，（2）拓扑自适应图形CNNfor更新顶点复杂拓扑的网格位置，以及（3）一个unifiedmesh变形块变形结构箱成结构awaremeshed模型。从ShapeNet图像实验结果表明，ourproposed STD-网具有比其他国家的最先进的方法更好的性能onreconstructing 3D结构复杂和精细几何细节对象。

94. Novel Radiomic Feature for Survival Prediction of Lung Cancer Patients using Low-Dose CBCT Images [PDF] 返回目录
Bijju Kranthi Veduruparthi, Jayanta Mukherjee, Partha Pratim Das, Moses Arunsingh, Raj Kumar Shrimali, Sriram Prasath, Soumendranath Ray, Sanjay Chatterjee
Abstract: Prediction of survivability in a patient for tumor progression is useful to estimate the effectiveness of a treatment protocol. In our work, we present a model to take into account the heterogeneous nature of a tumor to predict survival. The tumor heterogeneity is measured in terms of its mass by combining information regarding the radiodensity obtained in images with the gross tumor volume (GTV). We propose a novel feature called Tumor Mass within a GTV (TMG), that improves the prediction of survivability, compared to existing models which use GTV. Weekly variation in TMG of a patient is computed from the image data and also estimated from a cell survivability model. The parameters obtained from the cell survivability model are indicatives of changes in TMG over the treatment period. We use these parameters along with other patient metadata to perform survival analysis and regression. Cox's Proportional Hazard survival regression was performed using these data. Significant improvement in the average concordance index from 0.47 to 0.64 was observed when TMG is used in the model instead of GTV. The experiments show that there is a difference in the treatment response in responsive and non-responsive patients and that the proposed method can be used to predict patient survivability.
摘要：在肿瘤进展的患者存活的预测是非常有用的估计治疗方案的有效性。在我们的工作中，我们提出了一个模型，考虑到肿瘤的异质性来预测生存。肿瘤异质性在其质量计通过把与肿瘤体积（GTV）图像获得的放射密度合成信息进行测量。我们提出了一个GTV（TMG）内称为肿瘤块新的功能，改善生存的预测，相比于使用GTV现有的模型。在患者的每周TMG变化被从图像数据计算，并且还从细胞生存能力模型来估计。从细胞生存性模型获得的参数是在治疗期间在TMG变化的指事。我们使用这些参数与其他病人的元数据进行生存分析和回归。利用这些数据进行的Cox比例风险回归生存。当TMG在模型代替GTV被用来观察在从0.47至0.64的平均一致性指数显著改善。实验结果表明，有可用于预测患者存活在敏感和非敏感患者，该方法治疗的反应的差异。

95. Robust, Occlusion-aware Pose Estimation for Objects Grasped by Adaptive Hands [PDF] 返回目录
Bowen Wen, Chaitanya Mitash, Sruthi Soorian, Andrew Kimmel, Avishai Sintov, Kostas E. Bekris
Abstract: Many manipulation tasks, such as placement or within-hand manipulation, require the object's pose relative to a robot hand. The task is difficult when the hand significantly occludes the object. It is especially hard for adaptive hands, for which it is not easy to detect the finger's configuration. In addition, RGB-only approaches face issues with texture-less objects or when the hand and the object look similar. This paper presents a depth-based framework, which aims for robust pose estimation and short response times. The approach detects the adaptive hand's state via efficient parallel search given the highest overlap between the hand's model and the point cloud. The hand's point cloud is pruned and robust global registration is performed to generate object pose hypotheses, which are clustered. False hypotheses are pruned via physical reasoning. The remaining poses' quality is evaluated given agreement with observed data. Extensive evaluation on synthetic and real data demonstrates the accuracy and computational efficiency of the framework when applied on challenging, highly-occluded scenarios for different object types. An ablation study identifies how the framework's components help in performance. This work also provides a dataset for in-hand 6D object pose estimation. Code and dataset are available at: this https URL
摘要：许多操作任务，如放置或在手操纵，要求对象的姿势相对于机械手。任务是艰巨的时候手显著遮挡的对象。它是自适应的手里，因为它不容易检测手指的配置尤为严重。此外，RGB-只有接近脸的问题与无纹理对象或当手与物体的外观类似。本文提出了一种基于深度的框架，对于稳健的姿态估计和短的响应时间，其目的。该方法检测经由高效的并行自适应手的状态搜寻条件给出的手的模型和点云之间的最高重叠。手的点云修剪，并进行强有力的全球注册生成对象姿态的假设，这是集群。假的假设是通过物理推理修剪。剩下的姿势质量评估与观察到的数据给出的协议。当上具有挑战性，针对不同的对象类型高度遮挡的场景施加在合成和真实数据广泛的评估表明该框架的精度和计算效率。消融研究确定了框架的组件性能怎么样帮助。这项工作也提供了一个数据集在手6D对象姿态估计。代码和数据集，请访问：此HTTPS URL

96. SuPer Deep: A Surgical Perception Framework for Robotic Tissue Manipulation using Deep Learning for Feature Extraction [PDF] 返回目录
Jingpei Lu, Ambareesh Jayakumari, Florian Richter, Yang Li, Michael C. Yip
Abstract: Robotic automation in surgery requires precise tracking of surgical tools and mapping of deformable tissue. Previous works on surgical perception frameworks require significant effort in developing features for surgical tool and tissue tracking. In this work, we overcome the challenge by exploiting deep learning methods for surgical perception. We integrated deep neural networks, capable of efficient feature extraction, into the tissue reconstruction and instrument pose estimation processes. By leveraging transfer learning, the deep learning based approach requires minimal training data and reduced feature engineering efforts to fully perceive a surgical scene. The framework was tested on three publicly available datasets, which use the da Vinci Surgical System, for comprehensive analysis. Experimental results show that our framework achieves state-of-the-art tracking performance in a surgical environment by utilizing deep learning for feature extraction.
摘要：在手术机器人自动化要求的手术工具和可变形组织的测绘精确的跟踪。对手术的看法框架以前的作品需要在制定手术工具和组织追踪功能显著的努力。在这项工作中，我们克服了利用手术感知深度学习方法的挑战。我们综合深层神经网络，精干高效的特征提取，到组织重建和仪表姿态估计过程。通过利用迁移学习，深刻学习基础的方法只需要很少的训练数据和简化的功能设计努力，充分感知手术场面。该框架在三个公开可用的数据集，其使用达芬奇外科手术系统，进行综合分析测试。实验结果表明，我们的框架，利用深度学习的特征提取实现了手术环境的国家的最先进的跟踪性能。

97. Endoscopy disease detection challenge 2020 [PDF] 返回目录
Sharib Ali, Noha Ghatwary, Barbara Braden, Dominique Lamarque, Adam Bailey, Stefano Realdon, Renato Cannizzaro, Jens Rittscher, Christian Daul, James East
Abstract: Whilst many technologies are built around endoscopy, there is a need to have a comprehensive dataset collected from multiple centers to address the generalization issues with most deep learning frameworks. What could be more important than disease detection and localization? Through our extensive network of clinical and computational experts, we have collected, curated and annotated gastrointestinal endoscopy video frames. We have released this dataset and have launched disease detection and segmentation challenge EDD2020 this https URL to address the limitations and explore new directions. EDD2020 is a crowd sourcing initiative to test the feasibility of recent deep learning methods and to promote research for building robust technologies. In this paper, we provide an overview of the EDD2020 dataset, challenge tasks, evaluation strategies and a short summary of results on test data. A detailed paper will be drafted after the challenge workshop with more detailed analysis of the results.
摘要：虽然许多技术都是围绕内窥镜建成，还需要有多个中心收集处理与最深处的学习框架的推广问题的综合数据集。有什么能比疾病检测和定位更重要？通过我们广泛的临床和计算专家的网络，我们收集，策划并注明胃肠内窥镜视频帧。我们已经发布了这一数据集并相继推出疾病检测和分割挑战EDD2020此HTTPS URL地址来限制和探索新的方向。 EDD2020是人群采购计划，以测试近期深学习方法的可行性，并促进研究为构建健壮的技术。在本文中，我们提供的EDD2020数据集，挑战任务，评估策略和结果的测试数据的简短摘要的概述。详细的文件将与结果的更详细的分析挑战研讨会之后起草。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-03-10

目录

摘要