摘要

1. Adversarial Style Mining for One-Shot Unsupervised Domain Adaptation [PDF] 返回目录
Yawei Luo, Ping Liu, Tao Guan, Junqing Yu, Yi Yang
Abstract: We aim at the problem named One-Shot Unsupervised Domain Adaptation. Unlike traditional Unsupervised Domain Adaptation, it assumes that only one unlabeled target sample can be available when learning to adapt. This setting is realistic but more challenging, in which conventional adaptation approaches are prone to failure due to the scarce of unlabeled target data. To this end, we propose a novel Adversarial Style Mining approach, which combines the style transfer module and task-specific module into an adversarial manner. Specifically, the style transfer module iteratively searches for harder stylized images around the one-shot target sample according to the current learning state, leading the task model to explore the potential styles that are difficult to solve in the almost unseen target domain, thus boosting the adaptation performance in a data-scarce scenario. The adversarial learning framework makes the style transfer module and task-specific module benefit each other during the competition. Extensive experiments on both cross-domain classification and segmentation benchmarks verify that ASM achieves state-of-the-art adaptation performance under the challenging one-shot setting.
摘要：我们的目标是一个名为单次无监督领域适应性问题。不同于传统的无监督领域适应性，它假设不断学习才能适应当只有一个未标记目标样品可以是可用的。此设置是现实的，但更有挑战性，其中常规自适应方法是容易出现故障，由于稀少未标记目标数据。为此，我们提出了一个新颖的对抗性样式挖掘方法，它结合了风格传输模块和任务的特定模块为对抗性的方式。具体来说，风格传输模块通过反复搜索根据当前学习状态的一次性目标样品周围更难程式化的图像，从而导致任务模式探索，是很难在几乎看不见的目标域，以解决潜在的款式，从而带动在数据稀少的情况下适应性能。对抗学习框架，使作风传输模块和任务的特定模块的好处在比赛中对方。在两个交叉域分类和分割基准广泛的实验验证ASM实现了具有挑战性的单触发设置下状态的最先进的适应性能。

2. Compositional Visual Generation and Inference with Energy Based Models [PDF] 返回目录
Yilun Du, Shuang Li, Igor Mordatch
Abstract: A vital aspect of human intelligence is the ability to compose increasingly complex concepts out of simpler ideas, enabling both rapid learning and adaptation of knowledge. In this paper we show that energy-based models can exhibit this ability by directly combining probability distributions. Samples from the combined distribution correspond to compositions of concepts. For example, given a distribution for smiling faces, and another for male faces, we can combine them to generate smiling male faces. This allows us to generate natural images that simultaneously satisfy conjunctions, disjunctions, and negations of concepts. We evaluate compositional generation abilities of our model on the CelebA dataset of natural faces and synthetic 3D scene images. We also demonstrate other unique advantages of our model, such as the ability to continually learn and incorporate new concepts, or infer compositions of concept properties underlying an image.
摘要：人的智力的一个重要方面是构成日益复杂的概念出来的简单想法，同时启用快速学习和知识的适应能力。在本文中，我们表明，基于能量的模型可以通过直接结合概率分布表现出这种能力。从合并的分布对应于概念的组合物的样品。例如，给出了笑脸分配，另一个是男性的面孔，我们可以结合他们产生微笑的男性面孔。这让我们产生同时满足连词，析取，以及概念否定自然的图像。我们评估我们对自然的面孔和合成3D场景图像的CelebA数据集模型的成分生成能力。我们还表明我们的模型等独特优势，如不断学习和吸收新的概念或观念性的图像背后的推断成分的能力。

3. Training End-to-end Single Image Generators without GANs [PDF] 返回目录
Yael Vinker, Nir Zabari, Yedid Hoshen
Abstract: We present AugurOne, a novel approach for training single image generative models. Our approach trains an upscaling neural network using non-affine augmentations of the (single) input image, particularly including non-rigid thin plate spline image warps. The extensive augmentations significantly increase the in-sample distribution for the upsampling network enabling the upscaling of highly variable inputs. A compact latent space is jointly learned allowing for controlled image synthesis. Differently from Single Image GAN, our approach does not require GAN training and takes place in an end-to-end fashion allowing fast and stable training. We experimentally evaluate our method and show that it obtains compelling novel animations of single-image, as well as, state-of-the-art performance on conditional generation tasks e.g. paint-to-image and edges-to-image.
摘要：我们目前AugurOne，训练单个图像生成模式的新方法。我们的方法利用列车（单个）输入图像的非亲和增扩，特别是包括非刚性薄板样条图象的翘曲比例放大神经网络。广泛的扩充显著增加样品中分布用于上采样网络使能的高度可变的输入的升频。紧凑的潜在空间共同据悉允许控制图像合成。不同于单图像GAN，我们的方法并不需要GAN培训和发生在一个终端到终端的方式，允许快速和稳定的训练。我们通过实验评估我们的方法，并表明它获得单图像的令人信服的新颖的动画，以及，国家的最先进的有条件生成任务性能例如油漆到图像边缘和对图像。

4. Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training [PDF] 返回目录
Hongkai Zhang, Hong Chang, Bingpeng Ma, Naiyan Wang, Xilin Chen
Abstract: Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. Consequently, we propose Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training. This dynamic design makes better use of the training samples and pushes the detector to fit more high quality samples. Specifically, our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_{90}$ on the MS COCO dataset with no extra overhead. Codes and models are available at this https URL.
摘要：虽然两个阶段的目标探测器在近几年不断推进国家的最先进的性能，训练过程本身是远远结晶。在这项工作中，我们首先指出固定网络设置和动态的训练过程，这极大地影响了性能之间的矛盾问题。例如，固定标签分配策略和回归损失函数不适合的提议分布变化，从而有害于培养高素质的探测器。因此，我们提出了动态R-CNN调整标签分配条件（IOU阈值）和回归损失函数的自动基础上的提案训练期间的统计信息的形状（SmoothL1损耗参数）。这种动态的设计可以更好地利用训练样本，并推动探测器，以适应更多的高品质的样品。具体来说，我们的方法在RESNET-50-FPN基线提高1.9％AP和5.5％AP $ _ {90} $的MS COCO数据集，没有额外的开销。代码和模型可在此HTTPS URL。

5. A Survey of Single-Scene Video Anomaly Detection [PDF] 返回目录
Bharathkumar Ramachandra, Michael J. Jones, Ranga Raju Vatsavai
Abstract: This survey article summarizes research trends on the topic of anomaly detection in video feeds of a single scene. We discuss the various problem formulations, publicly available datasets and evaluation criteria. We categorize and situate past research into an intuitive taxonomy. Finally, we also provide best practices and suggest some possible directions for future research.
摘要：综述文章总结了在单一场景的视频输入异常检测的课题研究趋势。我们讨论各种问题的配方，可公开获得的数据集和评价标准。我们分类和宅院过去的研究，一个直观的分类。最后，我们还提供最佳实践，并提出今后的研究提供一些可能的方向。

6. Speak2Label: Using Domain Knowledge for Creating a Large Scale Driver Gaze Zone Estimation Dataset [PDF] 返回目录
Shreya Ghosh, Abhinav Dhall, Garima Sharma, Sarthak Gupta, Nicu Sebe
Abstract: Labelling of human behavior analysis data is a complex and time consuming task. In this paper, a fully automatic technique for labelling an image based gaze behavior dataset for driver gaze zone estimation is proposed. Domain knowledge can be added to the data recording paradigm and later labels can be generated in an automatic manner using speech to text conversion. In order to remove the noise in STT due to different ethnicity, the speech frequency and energy are analysed. The resultant Driver Gaze in the Wild DGW dataset contains 586 recordings, captured during different times of the day including evening. The large scale dataset contains 338 subjects with an age range of 18-63 years. As the data is recorded in different lighting conditions, an illumination robust layer is proposed in the Convolutional Neural Network (CNN). The extensive experiments show the variance in the database resembling real-world conditions and the effectiveness of the proposed CNN pipeline. The proposed network is also fine-tuned for the eye gaze prediction task, which shows the discriminativeness of the representation learnt by our network on the proposed DGW dataset.
摘要：人的行为分析数据的标签是一项复杂而耗时的任务。在本文中，用于标记为驾驶员注视区域估计的图像基于注视行为数据集全自动技术提出。域的知识可以被添加到数据记录范例，并且可以以自动的方式使用语音到文本转换而产生后的标签。为了除去在STT中的噪声由于不同种族，语音频率和能量进行了分析。在野生DGW数据集所得到的驾驶员注视包含586个记录，白天包括夜间的不同时间捕获。大规模数据集包含338名受试者的18-63年的年龄范围。由于数据被记录在不同的照明条件下，照明健壮层在卷积神经网络（CNN）提出。广泛的实验表明，在数据库中类似真实世界的情况和提出的CNN管道的有效性的差异。所提出的网络也微调了眼睛注视预测任务，这表明我们对提议DGW数据集网了解到代表性的discriminativeness。

7. Dense Registration and Mosaicking of Fingerprints by Training an End-to-End Network [PDF] 返回目录
Zhe Cui, Jianjiang Feng, Jie Zhou
Abstract: Dense registration of fingerprints is a challenging task due to elastic skin distortion, low image quality, and self-similarity of ridge pattern. To overcome the limitation of handcraft features, we propose to train an end-to-end network to directly output pixel-wise displacement field between two fingerprints. The proposed network includes a siamese network for feature embedding, and a following encoder-decoder network for regressing displacement field. By applying displacement fields reliably estimated by tracing high quality fingerprint videos to challenging fingerprints, we synthesize a large number of training fingerprint pairs with ground truth displacement fields. In addition, based on the proposed registration algorithm, we propose a fingerprint mosaicking method based on optimal seam selection. Registration and matching experiments on FVC2004 databases, Tsinghua Distorted Fingerprint (TDF) database, and NIST SD27 latent fingerprint database show that our registration method outperforms previous dense registration methods in accuracy and efficiency. Mosaicking experiment on FVC2004 DB1 demonstrates that the proposed algorithm produced higher quality fingerprints than other algorithms which also validates the performance of our registration algorithm.
摘要：指纹注册密是一项具有挑战性的任务，由于弹性皮肤失真，低的图像质量，和脊形图案的自相似性。为了克服手工特征的限制，我们建议的端至端网络训练直接输出逐像素位移字段中的两个指纹之间。所提出的网络包括用于特征嵌入一个连体网络，和一个以下编码器 - 解码器网络用于退化位移场。通过将位移场通过跟踪高质量指纹影片挑战指纹可靠地估计，我们合成大量的训练指纹对地面实况位移场。此外，根据建议的注册算法，我们提出的基于指纹的最佳接缝选择镶嵌方法。上FVC2004数据库登记和匹配的实验中，扭曲清华指纹（TDF）数据库，和NIST SD27潜指纹数据库显示，我们的注册方法优于在精度和效率先前密配准方法。在FVC2004 DB1镶嵌实验表明，该算法生成更高质量的指纹比其他算法，这也验证了我们的注册算法的性能。

8. CVPR 2019 WAD Challenge on Trajectory Prediction and 3D Perception [PDF] 返回目录
Sibo Zhang, Yuexin Ma, Ruigang Yang, Xin Li, Yanliang Zhu, Deheng Qian, Zetong Yang, Wenjing Zhang, Yuanpei Liu
Abstract: This paper reviews the CVPR 2019 challenge on Autonomous Driving. Baidu's Robotics and Autonomous Driving Lab (RAL) providing 150 minutes labeled Trajectory and 3D Perception dataset including about 80k lidar point cloud and 1000km trajectories for urban traffic. The challenge has two tasks in (1) Trajectory Prediction and (2) 3D Lidar Object Detection. There are more than 200 teams submitted results on Leaderboard and more than 1000 participants attended the workshop.
摘要：本文综述了自动驾驶的CVPR 2019挑战。百度的机器人和自动驾驶实验室（RAL）提供150分钟标记轨迹和3D感知数据集包括大约80K激光雷达点云和千公里轨迹为城市交通。我们面临的挑战有（1）轨迹预测和（2）的3D激光雷达目标检测两项任务。有排行榜上提交的结果，超过1000名人参加了讲习班超过200分的球队。

9. Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks [PDF] 返回目录
Lin Wang, Kuk-Jin Yoon
Abstract: Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more heavy computation power and failing to be deployed on edge devices. Besides, the performance boost is highly dependent on redundant labeled data. To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another. KD is often characterized by the so-called `Student-Teacher' (S-T) learning framework and has been broadly applied in model compression and knowledge transfer. This paper is about KD and S-T learning, which are being actively studied in recent years. First, we aim to provide explanations of what KD is and how/why it works. Then, we provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically for vision tasks. In general, we consider some fundamental questions that have been driving this research area and thoroughly generalize the research progress and technical details. Additionally, we systematically analyze the research status of KD in vision applications. Finally, we discuss the potentials and open challenges of existing methods and prospect the future directions of KD and S-T learning.
摘要：近年来深层神经模型已经在几乎所有领域取得了成功，其中包括非常复杂的问题陈述。然而，这些模型体积庞大，随着参数百万（甚至数十亿），因此需要更多的繁重的计算能力和遇事要部署的边缘设备。此外，性能提升是高度依赖于冗余的标签数据。为了实现更快的速度和处理缺乏所引起的数据的问题，知识蒸馏（KD）已经提出了从一个模型学会转移到另一个信息。 KD的特征往往是所谓`生师”（S-T）的学习框架并已在模型压缩和知识转移被广泛地应用。本文是关于KD和S-T的学习，这是近年来正在积极研究。首先，我们的目标是提供的KD是什么样的解释，以及如何/为什么它的工作原理。然后，我们一起提供最近的KD方法进步了全面的调查与S-T框架通常为视觉任务。在一般情况下，我们认为，一直在推动这一研究领域的一些基本问题，并深入推广的研究进展和技术细节。此外，我们系统地分析KD在视觉应用的研究现状。最后，我们讨论的潜力和现有的方法和前景KD和S-T学习的未来发展方向开放的挑战。

10. Unsupervised Facial Action Unit Intensity Estimation via Differentiable Optimization [PDF] 返回目录
Xinhui Song, Tianyang Shi, Tianjia Shao, Yi Yuan, Zunlei Feng, Changjie Fan
Abstract: The automatic intensity estimation of facial action units (AUs) from a single image plays a vital role in facial analysis systems. One big challenge for data-driven AU intensity estimation is the lack of sufficient AU label data. Due to the fact that AU annotation requires strong domain expertise, it is expensive to construct an extensive database to learn deep models. The limited number of labeled AUs as well as identity differences and pose variations further increases the estimation difficulties. Considering all these difficulties, we propose an unsupervised framework GE-Net for facial AU intensity estimation from a single image, without requiring any annotated AU data. Our framework performs differentiable optimization, which iteratively updates the facial parameters (i.e., head pose, AU parameters and identity parameters) to match the input image. GE-Net consists of two modules: a generator and a feature extractor. The generator learns to "render" a face image from a set of facial parameters in a differentiable way, and the feature extractor extracts deep features for measuring the similarity of the rendered image and input real image. After the two modules are trained and fixed, the framework searches optimal facial parameters by minimizing the differences of the extracted features between the rendered image and the input image. Experimental results demonstrate that our method can achieve state-of-the-art results compared with existing methods.
摘要：从单个图像的面部动作单元的自动强度推断（AUS）起着面部分析系统至关重要的作用。数据驱动AU强度估计的一个大挑战是缺乏足够的AU标签数据。由于该AU注释需要很强的专业领域知识的事实，这是构建一个广泛的数据库，以深入学习模型昂贵。标记的AU的有限数量以及身份差异和姿态的变化进一步增加估计困难。考虑到所有这些困难，我们提出了一个无人监管的框架GE-Net的面部AU强度估计从单一的图像，而不需要任何注释AU数据。我们的框架进行微优化，其迭代地更新面部参数（即，头部姿势，AU参数和身份参数）来匹配所述输入图像。 GE-网由两个模块组成：一台发电机和特征提取。发电机学会“渲染”从一组面部参数的面部图像中的微分的方式，和特征提取器提取深的特征在于用于测量所述绘制图像和输入真实图像的相似性。两个模块被训练和固定后，将框架的搜索通过最小化的所呈现的图像和输入图像之间所提取的特征的差异最佳面部参数。实验结果表明，我们的方法可以达到状态的最先进的结果与现有方法相比。

11. Regularizing Meta-Learning via Gradient Dropout [PDF] 返回目录
Hung-Yu Tseng, Yi-Wen Chen, Yi-Hsuan Tsai, Sifei Liu, Yen-Yu Lin, Ming-Hsuan Yang
Abstract: With the growing attention on learning-to-learn new tasks using only a few examples, meta-learning has been widely used in numerous problems such as few-shot classification, reinforcement learning, and domain generalization. However, meta-learning models are prone to overfitting when there are no sufficient training tasks for the meta-learners to generalize. Although existing approaches such as Dropout are widely used to address the overfitting problem, these methods are typically designed for regularizing models of a single task in supervised training. In this paper, we introduce a simple yet effective method to alleviate the risk of overfitting for gradient-based meta-learning. Specifically, during the gradient-based adaptation stage, we randomly drop the gradient in the inner-loop optimization of each parameter in deep neural networks, such that the augmented gradients improve generalization to new tasks. We present a general form of the proposed gradient dropout regularization and show that this term can be sampled from either the Bernoulli or Gaussian distribution. To validate the proposed method, we conduct extensive experiments and analysis on numerous computer vision tasks, demonstrating that the gradient dropout regularization mitigates the overfitting problem and improves the performance upon various gradient-based meta-learning frameworks.
摘要：只用几个例子随着越来越重视学习，学习的新任务，元学习已经被广泛应用于许多问题，如为数不多的镜头分类，强化学习和域名推广。然而，元学习模型很容易发生当有对元学习者推广没有足够的训练任务的过度拟合。虽然身为差被广泛用于解决过度拟合问题的现有方法等，这些方法通常被设计用于指导训练正规化单个任务的机型。在本文中，我们介绍一个简单而有效的方法，以减轻过度拟合基于梯度元学习的风险。具体地，在基于梯度的适应阶段，我们随机掉落梯度在深神经网络的每个参数的内环优化，使得增强梯度改善泛化新任务。我们提出，该术语可以从伯努利或高斯分布进行采样所提出的梯度差正规化和显示的一般形式。为了验证所提出的方法，我们在许多计算机视觉任务进行了广泛的实验和分析，表明梯度差正规化减轻了过学习问题，提高对各种基于梯度元学习框架的性能。

12. SSP: Single Shot Future Trajectory Prediction [PDF] 返回目录
Isht Dwivedi, Srikanth Malla, Behzad Dariush, Chiho Choi
Abstract: We propose a robust solution to future trajectory forecast, which can be practically applicable to autonomous agents in highly crowded environments. For this, three aspects are particularly addressed in this paper. First, we use composite fields to predict future locations of all road agents in a single-shot, which results in a constant time complexity, regardless of the number of agents in the scene. Second, interactions between agents are modeled as a non-local response, enabling spatial relationships between different locations to be captured temporally as well (i.e., in spatio-temporal interactions). Third, the semantic context of the scene are modeled and take into account the environmental constraints that potentially influence the future motion. To this end, we validate the robustness of the proposed approach using the ETH, UCY, and SDD datasets and highlight its practical functionality compared to the current state-of-the-art methods.
摘要：我们提出了一个强大的解决方案，以未来走势的预测，可实际适用于高度拥挤的环境中自主代理。为此，三个方面，本文特别处理。首先，我们使用复合材料领域的预测在单次所有道路代理商的未来的地点，其结果一定时间内的复杂性，无论在现场的代理数量。第二，代理之间的交互被建模为一个非本地响应，从而实现不同的位置之间的空间关系被捕获在时间上以及（即，在空间 - 时间相互作用）。三，现场的语义上下文建模，并考虑到可能影响未来的运动环境的制约。为此，我们验证使用ETH，UCY和SDD数据集所提出的方法的稳健性和相比，目前国家的最先进的方法，突出它的实用功能。

13. SPCNet:Spatial Preserve and Content-aware Network for Human Pose Estimation [PDF] 返回目录
Yabo Xiao, Dongdong Yu, Xiaojuan Wang, Tianqi Lv, Yiqi Fan, Lingrui Wu
Abstract: Human pose estimation is a fundamental yet challenging task in computer vision. Although deep learning techniques have made great progress in this area, difficult scenarios (e.g., invisible keypoints, occlusions, complex multi-person scenarios, and abnormal poses) are still not well-handled. To alleviate these issues, we propose a novel Spatial Preserve and Content-aware Network(SPCNet), which includes two effective modules: Dilated Hourglass Module(DHM) and Selective Information Module(SIM). By using the Dilated Hourglass Module, we can preserve the spatial resolution along with large receptive field. Similar to Hourglass Network, we stack the DHMs to get the multi-stage and multi-scale information. Then, a Selective Information Module is designed to select relatively important features from different levels under a sufficient consideration of spatial content-aware mechanism and thus considerably improves the performance. Extensive experiments on MPII, LSP and FLIC human pose estimation benchmarks demonstrate the effectiveness of our network. In particular, we exceed previous methods and achieve the state-of-the-art performance on three aforementioned benchmark datasets.
摘要：人体姿势估计是计算机视觉中一个基本而具有挑战性的任务。虽然深学习技术已经在这方面取得了长足的进步，很难场景（例如，无形的关键点，闭塞，复杂的多的人的情况，和异常姿势）仍然没有很好地处理。为了缓解这些问题，我们提出了一个新的空间保留和内容感知网络（SPCNet），其中包括两个有效模块：扩张型沙漏模块（DHM）和选择性信息模块（SIM）。通过使用扩张性沙漏模块，我们可以保持与大感受野沿空间分辨率。类似沙漏的网络，我们堆栈DHMs获得多级和多尺度信息。然后，选择性信息模块被设计成选择下一个充分考虑空间内容感知机制的不同层次相对重要的特征，从而大大提高了性能。在MPII，LSP和单腿人体姿势估计基准大量的实验证明我们网络的有效性。特别是，我们超越往年方法，实现对上述三个标准数据集的国家的最先进的性能。

14. Monocular Depth Estimation with Self-supervised Instance Adaptation [PDF] 返回目录
Robert McCraith, Lukas Neumann, Andrew Zisserman, Andrea Vedaldi
Abstract: Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision. However, in robotics applications,multiple views of a scene may or may not be available, depend-ing on the actions of the robot, switching between monocularand multi-view reconstruction. To address this mixed setting,we proposed a new approach that extends any off-the-shelfself-supervised monocular depth reconstruction system to usemore than one image at test time. Our method builds on astandard prior learned to perform monocular reconstruction,but uses self-supervision at test time to further improve thereconstruction accuracy when multiple images are available.When used to update the correct components of the model, thisapproach is highly-effective. On the standard KITTI bench-mark, our self-supervised method consistently outperformsall the previous methods with an average 25% reduction inabsolute error for the three common setups (monocular, stereoand monocular+stereo), and comes very close in accuracy whencompared to the fully-supervised state-of-the-art methods.
摘要：在自我监督学习进展havedemonstrated，它可以借鉴原始视频数据的准确monoculardepth重建，而无需使用任何3Dground真理监督。但是，在机器人应用，一个场景的多个视图可以或可能不可用，取决于-ING在机器人的动作，monocularand多视图重建之间的切换。为了解决这个混合设置，我们建议，在测试时间延长任何现成的shelfself监督单眼深度重建系统usemore比一个图像的新方法。我们的方法建立在astandard之前学会执行单眼重建，但使用自检的测试时间，以进一步提高在多个图像available.When用于更新模型的正确成分thereconstruction准确性，thisapproach是非常有效的。在标准KITTI板凳标志，我们的自我监督方法一致outperformsall有三个常见设置（单眼，stereoand单眼+立体声）平均降低25％inabsolute错误以前的方法，并且非常接近的准确性whencompared来完全状态的最先进的-supervised方法。

15. MulayCap: Multi-layer Human Performance Capture Using A Monocular Video Camera [PDF] 返回目录
Zhaoqi Su, Weilin Wan, Tao Yu, Lingjie Liu, Lu Fang, Wenping Wang, Yebin Liu
Abstract: We introduce MulayCap, a novel human performance capture method using a monocular video camera without the need for pre-scanning. The method uses "multi-layer" representations for geometry reconstruction and texture rendering, respectively. For geometry reconstruction, we decompose the clothed human into multiple geometry layers, namely a body mesh layer and a garment piece layer. The key technique behind is a Garment-from-Video (GfV) method for optimizing the garment shape and reconstructing the dynamic cloth to fit the input video sequence, based on a cloth simulation model which is effectively solved with gradient descent. For texture rendering, we decompose each input image frame into a shading layer and an albedo layer, and propose a method for fusing a fixed albedo map and solving for detailed garment geometry using the shading layer. Compared with existing single view human performance capture systems, our "multi-layer" approach bypasses the tedious and time consuming scanning step for obtaining a human specific mesh template. Experimental results demonstrate that MulayCap produces realistic rendering of dynamically changing details that has not been achieved in any previous monocular video camera systems. Benefiting from its fully semantic modeling, MulayCap can be applied to various important editing applications, such as cloth editing, re-targeting, relighting, and AR applications.
摘要：介绍MulayCap，使用单眼摄像头，无需预扫描一个新的人类表演捕捉方法。该方法使用“多层”表示为几何重建和纹理渲染，分别。对于几何重建，我们分解穿着衣服的人的成多个几何层，即一个身体网眼层和一个衣片层。的关键技术背后是用于优化服装的形状和重构所述动态布，以适应基于被有效地与梯度下降解决的布仿真模型的输入视频序列，一个服装-从视频（GFV）方法。对于纹理渲染，我们分解每个输入图像帧划分为遮光层和反照层，并提出用于融合固定反射率图，并使用遮光层求解详细服装几何的方法。与现有的单一视图人类表演捕捉系统相比，我们的“多层”的方式绕过用于获得人特异性啮合模板乏味且耗时的扫描步骤。实验结果表明，MulayCap产生的已不是以往任何单眼摄像机系统已实现动态变化的细节逼真呈现。从它的完全语义建模受益，MulayCap可应用于各种重要的编辑应用程序，如布编辑，重新定位，重新点燃，和AR应用。

16. Unsupervised Few-shot Learning via Distribution Shift-based Augmentation [PDF] 返回目录
Tiexin Qin, Wenbin Li, Yinghuan Shi, Yang Gao
Abstract: Few-shot learning aims to learn a new concept when only a few training examples are available, which has been extensively explored in recent years. However, most of the current works heavily rely on a large-scale labeled auxiliary set to train their models in an episodic-training paradigm. Such a kind of supervised setting basically limits the widespread use of few-shot learning algorithms, especially in real-world applications. Instead, in this paper, we develop a novel framework called \emph{Unsupervised Few-shot Learning via Distribution Shift-based Data Augmentation} (ULDA), which pays attention to the distribution diversity inside each constructed pretext few-shot task when using data augmentation. Importantly, we highlight the value and importance of the distribution diversity in the augmentation-based pretext few-shot tasks. In ULDA, we systemically investigate the effects of different augmentation techniques and propose to strengthen the distribution diversity (or difference) between the query set and support set in each few-shot task, by augmenting these two sets separately (i.e. shifting). In this way, even incorporated with simple augmentation techniques (e.g. random crop, color jittering, or rotation), our ULDA can produce a significant improvement. In the experiments, few-shot models learned by ULDA can achieve superior generalization performance and obtain state-of-the-art results in a variety of established few-shot learning tasks on \emph{mini}ImageNet and \emph{tiered}ImageNet. %The code will be released soon. The source code is available in \textcolor{blue}{\emph{this https URL}}.
摘要：很少次的学习目标学习时只有几个训练范例可供一个新的概念，这在近年来得到了广泛的探讨。然而，目前大多数的作品在很大程度上依赖于大规模的标记辅助组来训练他们的模型在阶段性训练模式。这样一种监督的设置基本上限制了广泛使用的几拍学习算法，尤其是在现实世界的应用。相反，在本文中，我们开发了一个{通过基于偏移分布数据增强非监督为数不多的射门学习}（ULDA），使用数据时注重分布多样性各构成借口少次任务中调用\ EMPH新框架增强。重要的是，我们强调在基于增强，借口少拍任务的分布多样性的价值和重要性。在ULDA，我们系统地探讨不同扩增技术的影响，并提出要加强查询集和支持组之间的分布差异（或差）在每个几炮的任务，通过单独增强这两组（即转移）。这样一来，即使简单的隆胸技术（如随机裁剪，颜色抖动，或旋转）合并，我们ULDA能产生显著的改善。在实验中，通过ULDA了解到一些-shot机型可实现卓越的泛化性能，并获得国家的先进成果在各种成立几拍学习任务上\ EMPH {小} ImageNet和\ EMPH {分层} ImageNet 。％的代码将很快被释放。源代码可在\ {文本颜色蓝} {\ {EMPH这HTTPS URL}}。

17. Learning Event-Based Motion Deblurring [PDF] 返回目录
Zhe Jiang, Yu Zhang, Dongqing Zou, Jimmy Ren, Jiancheng Lv, Yebin Liu
Abstract: Recovering sharp video sequence from a motion-blurred image is highly ill-posed due to the significant loss of motion information in the blurring process. For event-based cameras, however, fast motion can be captured as events at high time rate, raising new opportunities to exploring effective solutions. In this paper, we start from a sequential formulation of event-based motion deblurring, then show how its optimization can be unfolded with a novel end-to-end deep architecture. The proposed architecture is a convolutional recurrent neural network that integrates visual and temporal knowledge of both global and local scales in principled manner. To further improve the reconstruction, we propose a differentiable directional event filtering module to effectively extract rich boundary prior from the stream of events. We conduct extensive experiments on the synthetic GoPro dataset and a large newly introduced dataset captured by a DAVIS240C camera. The proposed approach achieves state-of-the-art reconstruction quality, and generalizes better to handling real-world motion blur.
摘要：从运动模糊图像恢复尖锐视频序列是高度病态由于在模糊处理运动信息的显著损失。对于基于事件的相机，但是，快速运动可以被捕获在高温时事件发生率，提高了新的机遇，以探索有效的解决方案。在本文中，我们从基于事件的运动去模糊的顺序制定开始，然后显示其优化如何能够展开了新的终端到终端的深层结构。所提出的架构是卷积回归神经网络，在坚持原则的全球和地方层面的整合视觉与时间的知识。为了进一步改善重建，我们提出一种可微定向事件过滤模块以有效边界之前从事件流中提取富。我们对合成GoPro的数据集，并通过DAVIS240C相机拍摄了大量新推出的数据集进行了广泛的实验。所提出的方法实现了国家的最先进的重建质量，并推广更好地处理现实世界的运动模糊。

18. Towards Transferable Adversarial Attack against Deep Face Recognition [PDF] 返回目录
Yaoyao Zhong, Weihong Deng
Abstract: Face recognition has achieved great success in the last five years due to the development of deep learning methods. However, deep convolutional neural networks (DCNNs) have been found to be vulnerable to adversarial examples. In particular, the existence of transferable adversarial examples could severely hinder the robustness of DCNNs since this type of attacks could be applied in a fully black-box manner without queries on the target system. In this work, we first investigate the characteristics of transferable adversarial attacks in face recognition by showing the superiority of feature-level methods over label-level methods. Then, to further improve transferability of feature-level adversarial examples, we propose DFANet, a dropout-based method used in convolutional layers, which could increase the diversity of surrogate models and obtain ensemble-like effects. Extensive experiments on state-of-the-art face models with various training databases, loss functions and network architectures show that the proposed method can significantly enhance the transferability of existing attack methods. Finally, by applying DFANet to the LFW database, we generate a new set of adversarial face pairs that can successfully attack four commercial APIs without any queries. This TALFW database is available to facilitate research on the robustness and defense of deep face recognition.
摘要：面部识别技术由于深学习方法的发展取得了在过去五年中取得巨大成功。然而，深卷积神经网络（DCNNs）已经被认为是脆弱的对抗性的例子。特别地，转让对抗性例子的存在可能会严重妨碍DCNNs的鲁棒性，因为这种类型的攻击可能在完全黑盒的方式，而不查询在目标系统上被应用。在这项工作中，我们首先显示的功能级别的方法在标签层次方法的优越性调查的人脸识别转让对抗攻击的特点。然后，为了进一步提高功能级对抗性例子转印性，我们提出DFANet，在卷积层使用基于压差法，这可能会增加替代模型的多样性和获得合奏样作用。在国家的最先进的脸部模型与各培训数据库，损失函数和网络架构大量实验表明，该方法能显著提高现有的攻击方法的可转让性。最后，通过应用DFANet到LFW数据库，我们生成一组新的对抗面部对能够成功攻击四大商业的API，没有任何疑问的。这TALFW数据库可方便的鲁棒性和深人脸识别的国防研究。

19. UC-Net: Uncertainty Inspired RGB-D Saliency Detection via Conditional Variational Autoencoders [PDF] 返回目录
Jing Zhang, Deng-Ping Fan, Yuchao Dai, Saeed Anwar, Fatemeh Sadat Saleh, Tong Zhang, Nick Barnes
Abstract: In this paper, we propose the first framework (UCNet) to employ uncertainty for RGB-D saliency detection by learning from the data labeling process. Existing RGB-D saliency detection methods treat the saliency detection task as a point estimation problem, and produce a single saliency map following a deterministic learning pipeline. Inspired by the saliency data labeling process, we propose probabilistic RGB-D saliency detection network via conditional variational autoencoders to model human annotation uncertainty and generate multiple saliency maps for each input image by sampling in the latent space. With the proposed saliency consensus process, we are able to generate an accurate saliency map based on these multiple predictions. Quantitative and qualitative evaluations on six challenging benchmark datasets against 18 competing algorithms demonstrate the effectiveness of our approach in learning the distribution of saliency maps, leading to a new state-of-the-art in RGB-D saliency detection.
摘要：在本文中，我们通过从数据标记过程学习提出了第一个框架（UCNet）为RGB-d显着性检测雇佣的不确定性。现有的RGB-d显着性检测方法治疗显着性检测任务作为点估计问题，并产生下一个确定性的学习管道一个显着图。通过显着性数据标记过程的启发，我们提出通过有条件变自动编码概率RGB-d显着性检测网络建模人类注释的不确定性和生成通过在潜在空间采样的每个输入图像的多个显着性映射。利用所提出的显着达成共识的过程，我们可以根据这些多个预测产生一个精确的显着映像。定量和对18种竞争算法6个有挑战性的基准数据集质量评估证明了该方法的有效性在学习显着图的分布，导致国家的最先进的RGB-d显着性检测新。

20. Self-supervised Feature Learning by Cross-modality and Cross-view Correspondences [PDF] 返回目录
Longlong Jing, Yucheng Chen, Ling Zhang, Mingyi He, Yingli Tian
Abstract: The success of supervised learning requires large-scale ground truth labels which are very expensive, time-consuming, or may need special skills to annotate. To address this issue, many self- or un-supervised methods are developed. Unlike most existing self-supervised methods to learn only 2D image features or only 3D point cloud features, this paper presents a novel and effective self-supervised learning approach to jointly learn both 2D image features and 3D point cloud features by exploiting cross-modality and cross-view correspondences without using any human annotated labels. Specifically, 2D image features of rendered images from different views are extracted by a 2D convolutional neural network, and 3D point cloud features are extracted by a graph convolution neural network. Two types of features are fed into a two-layer fully connected neural network to estimate the cross-modality correspondence. The three networks are jointly trained (i.e. cross-modality) by verifying whether two sampled data of different modalities belong to the same object, meanwhile, the 2D convolutional neural network is additionally optimized through minimizing intra-object distance while maximizing inter-object distance of rendered images in different views (i.e. cross-view). The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks including multi-view 2D shape recognition, 3D shape recognition, multi-view 2D shape retrieval, 3D shape retrieval, and 3D part-segmentation. Extensive evaluations on all the five different tasks across different datasets demonstrate strong generalization and effectiveness of the learned 2D and 3D features by the proposed self-supervised method.
摘要：监督学习的成功需要大规模的地面实况标签，这是非常昂贵的，耗时的，或可能需要特殊技能注释。为了解决这个问题，很多自我或无监督方法的开发。与大多数现有的自我监督的方法来学习只有2D图像的功能或只有三维点云的特点，本文提出了一种新颖而有效的自我监督学习的方式，共同学习二维图像特征和三维点云通过利用交叉形态特征和交叉视图对应，而无需使用任何人类注释的标签。具体地说，2D图像从不同视图的渲染图像的特征是由2D卷积神经网络萃取，三维点云的特征用的曲线图卷积神经网络萃取。两种类型的特征被馈送到两层完全连接的神经网络来估计跨模态的对应关系。这三个网络通过验证不同模态的2点采样的数据是否属于同一对象共同训练（即跨模态），同时，所述2D卷积神经网络被附加地通过最小化对象内距离最优化，同时最大化的对象间距离在不同视图中呈现的图像（即，跨视图）。的有效性了解到2D和3D的特征是通过在五个不同的任务，包括多视图的2D形状识别，三维形状识别，多视角的2D形状检索，三维形状检索和3D部分的分割传送它们进行评价。跨不同的数据集所有五个不同的任务广泛的评估论证学习2D强通用性和有效性，并通过所提出的自我监督方法3D功能。

21. Enabling Incremental Knowledge Transfer for Object Detection at the Edge [PDF] 返回目录
Mohammad Farhadi Bajestani, Mehdi Ghasemi, Sarma Vrudhula, Yezhou Yang
Abstract: Object detection using deep neural networks (DNNs) involves a huge amount of computation which impedes its implementation on resource/energy-limited user-end devices. The reason for the success of DNNs is due to having knowledge over all different domains of observed environments. However, we need a limited knowledge of the observed environment at inference time which can be learned using a shallow neural network (SHNN). In this paper, a system-level design is proposed to improve the energy consumption of object detection on the user-end device. An SHNN is deployed on the user-end device to detect objects in the observing environment. Also, a knowledge transfer mechanism is implemented to update the SHNN model using the DNN knowledge when there is a change in the object domain. DNN knowledge can be obtained from a powerful edge device connected to the user-end device through LAN or Wi-Fi. Experiments demonstrate that the energy consumption of the user-end device and the inference time can be improved by 78% and 40% compared with running the deep model on the user-end device.
摘要：使用深层神经网络（DNNs）对象检测涉及阻碍资源/能源有限的用户端设备及其实现的计算量巨大。之所以DNNs的成功是因为其在所观察环境的所有不同领域的知识。然而，我们需要在推理时观察到的环境，可以使用浅神经网络（SHNN）学到的知识很有限。在本文中，一个系统级设计，提出了以提高对象检测的用户端设备上的能量消耗。一个SHNN应用在用户端设备，以检测观察环境中的对象上。此外，知识转移机制在执行时存在的对象域的变化来更新使用DNN知识SHNN模型。 DNN知识可以从连接到通过局域网或Wi-Fi的用户端设备的强大边缘装置来获得。实验表明，该用户终端设备，所述推理时间的能量消耗可以通过78％和40％与运行的用户端设备上的深模型相比可以提高。

22. Deep Siamese Domain Adaptation Convolutional Neural Network for Cross-domain Change Detection in Multispectral Images [PDF] 返回目录
Hongruixuan Chen, Chen Wu, Bo Du, Liangepei Zhang
Abstract: Recently, deep learning has achieved promising performance in the change detection task. However, the deep models are task-specific and data set bias often exists, thus it is difficult to transfer a network trained on one multi-temporal data set (source domain) to another multi-temporal data set with very limited (even no) labeled data (target domain). In this paper, we propose a novel deep siamese domain adaptation convolutional neural network (DSDANet) architecture for cross-domain change detection. In DSDANet, a siamese convolutional neural network first extracts spatial-spectral features from multi-temporal images. Then, through multiple kernel maximum mean discrepancy (MK-MMD), the learned feature representation is embedded into a reproducing kernel Hilbert space (RKHS), in which the distribution of two domains can be explicitly matched. By optimizing the network parameters and kernel coefficients with the source labeled data and target unlabeled data, the DSDANet can learn transferrable feature representation that can bridge the discrepancy between two domains. To the best of our knowledge, it is the first time that such a domain adaptation-based deep network is proposed for change detection. The theoretical analysis and experimental results demonstrate the effectiveness and potential of the proposed method.
摘要：在变化检测任务近日，深度学习取得看好的表现。但是，深型号任务的具体数据组斜经常存在，因此很难对培训了一个多态数据集（源域）的网络转移到另一个多时的数据集非常有限（甚至没有）标记的数据（目标域）。在本文中，我们提出了一种新颖的深连体域适配卷积神经网络（DSDANet）架构跨域变化检测。在DSDANet，连体卷积神经网络首先提取空间光谱特征从多时图像。然后，通过多个内核最大平均差异（MK-MMD），所学习的特征表示被嵌入到一个再生核Hilbert空间（RKHS），其中两个结构域的分布可以被明确地匹配。通过优化的网络参数和核系数与源标记的数据和目标未标记数据，所述DSDANet可以学习转让特征表示可以桥接两个结构域之间的差异。据我们所知，它是基于适应这样的域深的网络提出了变化检测的第一次。理论分析和实验结果证明了该方法的有效性和潜力。

23. Learning Spatial Relationships between Samples of Image Shapes [PDF] 返回目录
Juan Castorena, Manish Bhattarai, Diane Oyen
Abstract: Many applications including image based classification and retrieval of scientific and patent documents involve images in which brightness or color is not representative of content. In such cases, it seems intuitive to perform analysis on image shapes rather than on texture variations (i.e., pixel values ). Here, we propose a method that combines sparsely sampling points from image shapes and learning the spatial relationships between the extracted samples that characterize them. A dynamic graph CNN producing a different graph at each layer is trained and used as the learning engine of node and edge features in a classification/retrieval task. Our set of experiments on multiple datasets demonstrate a variety of point sampling sparsities, training-set size, rigid body transformations and scaling; and show that the accuracy of our approach is less likely to degrade due to small training sets or transformations on the data.
摘要：许多应用，包括基于图像分类，科学和专利文献检索涉及图像中亮度或色彩是不是代表的内容。在这种情况下，似乎直观上图像的形状，而不是在结构的变化（即，像素值）进行分析。在这里，我们提出了一种方法，结合稀疏从图像形状的采样点和学习表征它们所提取的样品之间的空间关系。动态图形CNN产生在各层不同的图表被训练并用作节点和边缘特征在分类/检索任务的学习引擎。我们对多个数据集的实验组表现出各种点采样sparsities，训练集的大小，刚体变换和缩放的;并表明我们的方法的精度不太可能降低由于在数据的小训练集或变换。

24. A negative case analysis of visual grounding methods for VQA [PDF] 返回目录
Robik Shrestha, Kushal Kafle, Christopher Kanan
Abstract: Existing Visual Question Answering (VQA) methods tend to exploit dataset biases and spurious statistical correlations, instead of producing right answers for the right reasons. To address this issue, recent bias mitigation methods for VQA propose to incorporate visual cues (e.g., human attention maps) to better ground the VQA models, showcasing impressive gains. However, we show that the performance improvements are not a result of improved visual grounding, but a regularization effect which prevents over-fitting to linguistic priors. For instance, we find that it is not actually necessary to provide proper, human-based cues; random, insensible cues also result in similar improvements. Based on this observation, we propose a simpler regularization scheme that does not require any external annotations and yet achieves near state-of-the-art performance on VQA-CPv2.
摘要：现有的视觉答疑（VQA）的方法倾向于利用而不是产生正确的原因正确答案数据集的偏见和虚假的统计相关性。为了解决这个问题，最近的VQA偏差缓解方法提出将视觉线索（例如，人的注意力映射），以更好地在VQA模型，展示了可观的收益。然而，我们发现的性能改进不改进的视觉接地的结果，而是一个正规化的影响，防止过度拟合语言先验。例如，我们发现，它实际上不是必要提供适当的，以人为本的线索;随机的，没有知觉线索也会导致类似的改进。基于这一观察，我们建议，不需要任何外部注解一个简单的正则化方案，但实现上VQA-CPV2附近国家的最先进的性能。

25. Low-Resolution Overhead Thermal Tripwire for Occupancy Estimation [PDF] 返回目录
Mertcan Cokbas, Prakash Ishwar, Janusz Konrad
Abstract: Smart buildings use occupancy sensing for various tasks ranging from energy-efficient HVAC and lighting to space-utilization analysis and emergency response. We propose a people counting system which uses a low-resolution thermal sensor. Unlike previous thermal sensor based people counting systems, we use an overhead tripwire configuration at entryways to detect and track transient entries or exits. We develop two people counting algorithms for this system con-figuration. To evaluate our algorithms, we have collectedand labeled a low-resolution thermal video dataset with the proposed system configuration. The dataset, largest of its kind, will be published alongside the paper, should it be accepted. We also propose new evaluation metrics that are more suitable for systems that are subject to drift and jitter.
摘要：智能建筑使用占用感测的各种任务，从节能的空调和照明空间利用率分析和应急响应。我们建议其采用了低分辨率热传感器人员计数系统。不同于以往的热传感器基于人的计数系统，我们使用高架绊线配置在门廊探测和跟踪短暂的条目或退出。我们开发了两个人计数算法为这个系统CON-成形。为了评估我们的算法，我们collectedand标有建议的系统配置低分辨率热成像视频数据集。该数据集，规模最大的，将纸一起发布的，它应该被接受。我们还提出新的评价指标，更适合于受漂移和抖动系统。

26. PatchAttack: A Black-box Texture-based Attack with Reinforcement Learning [PDF] 返回目录
Chenglin Yang, Adam Kortylewski, Cihang Xie, Yinzhi Cao, Alan Yuille
Abstract: Patch-based attacks introduce a perceptible but localized change to the input that induces misclassification. A limitation of current patch-based black-box attacks is that they perform poorly for targeted attacks, and even for the less challenging non-targeted scenarios, they require a large number of queries. Our proposed PatchAttack is query efficient and can break models for both targeted and non-targeted attacks. PatchAttack induces misclassifications by superimposing small textured patches on the input image. We parametrize the appearance of these patches by a dictionary of class-specific textures. This texture dictionary is learned by clustering Gram matrices of feature activations from a VGG backbone. PatchAttack optimizes the position and texture parameters of each patch using reinforcement learning. Our experiments show that PatchAttack achieves > 99% success rate on ImageNet for a wide range of architectures, while only manipulating 3% of the image for non-targeted attacks and 10% on average for targeted attacks. Furthermore, we show that PatchAttack circumvents state-of-the-art adversarial defense methods successfully.
摘要：基于补丁的攻击引进可察觉的，但局部变化的输入诱导误判。目前基于补丁黑匣子攻击的局限是它们表现不佳的有针对性的攻击，甚至对于不太具有挑战性的非针对性的方案，他们需要大量的查询。我们提出的PatchAttack是查询效率，可以打破两个目标和非针对性的攻击模式。通过对输入图像上叠加小纹理补丁PatchAttack诱导错误分类。我们通过类特定纹理的字典参数化这些补丁的外观。这种质地字典由从VGG骨干群集功能激活的革兰氏矩阵教训。 PatchAttack优化位置，并使用增强学习各贴剂的纹理参数。我们的实验表明，PatchAttack达到> 99％的成功上ImageNet为广泛架构率，而只在平均操作进行非靶向攻击的图像的3％和10％为目标的攻击。此外，我们表明，PatchAttack规避国家的最先进的对抗防御方法成功。

27. MLCVNet: Multi-Level Context VoteNet for 3D Object Detection [PDF] 返回目录
Qian Xie, Yu-Kun Lai, Jing Wu, Zhoutao Wang, Yiming Zhang, Kai Xu, Jun Wang
Abstract: In this paper, we address the 3D object detection task by capturing multi-level contextual information with the self-attention mechanism and multi-scale feature fusion. Most existing 3D object detection methods recognize objects individually, without giving any consideration on contextual information between these objects. Comparatively, we propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet. We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels. Specifically, a Patch-to-Patch Context (PPC) module is employed to capture contextual information between the point patches, before voting for their corresponding object centroid points. Subsequently, an Object-to-Object Context (OOC) module is incorporated before the proposal and classification stage, to capture the contextual information between object candidates. Finally, a Global Scene Context (GSC) module is designed to learn the global scene context. We demonstrate these by capturing contextual information at patch, object and scene levels. Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets, i.e., SUN RGBD and ScanNet. We also release our code at this https URL.
摘要：在本文中，我们应对与自我注意机制和多尺度特征融合捕捉多层次的上下文信息的三维物体检测任务。大多数现有的三维物体的检测方法单独识别物体，而对这些对象之间的上下文信息给予任何考虑。相比较而言，我们建议多级上下文VoteNet（MLCVNet）认识到3D相关地对象，建立在国家的最先进的VoteNet。我们推出三款上下文模块到VoteNet的投票和分类阶段编码不同层次的上下文信息。具体而言，一个补丁到补丁上下文（PPC）模块采用以捕获点贴片之间的上下文信息及其相应的质心的对象的点的投票之前。随后，对象对对象上下文（OOC）模块被提案和分类阶段之前引入，以捕获对象候选之间的上下文信息。最后，全球场景环境（GSC）模块旨在学习全球舞台背景。我们通过在补丁，物体和场景层次捕捉上下文信息证明这些。我们的方法是，以促进检测精度的有效方法，在具有挑战性的立体物检测的数据集实现状态的最先进的新的检测性能，即，SUN RGBD和ScanNet。我们也是在这个HTTPS URL释放我们的代码。

28. Which visual questions are difficult to answer? Analysis with Entropy of Answer Distributions [PDF] 返回目录
Kento Terao, Toru Tamaki, Bisser Raytchev, Kazufumi Kaneda, Shun'ichi Satoh
Abstract: We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (i.e. the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms. Clustering results are available online at this https URL .
摘要：本文提出了一种新的方式来确定的为Visual问答系统（VQA）没有直接监督或注释的难度视觉问题的难度。在此之前的作品已经考虑人工注释的地面实况答案的多样性。与此相反，我们分析的可视化问题，基于多种不同型号VQA的行为的难度。我们建议通过集群三种不同的模型得到的预测答案分布的熵值：基线法是作为输入图像和问题，并采取由于只有唯一的输入图像和问题两个变种。我们用一个简单k均值聚类的VQA V2验证集的视觉问题。然后我们使用状态的最先进的方法来确定的准确性和每个群集的答案分布的熵。该方法的优点是没有难度的注释是必需的，因为每个群集的准确性反映了属于它的视觉问题的难度。我们的方法可以找出不受国家的最先进的方法，正确地回答困难的视觉问题集群。在VQA V2数据集详细分析表明，1）的所有方法显示的最困难群集上表现不好的（约10％的准确度），2）作为群集难度增加，由不同的方法预测的答案开始不同，以及3）簇熵的值高度与所述簇的精度相关。我们证明了我们的方法的优点是能够将它们分配到集群中的一个来评估视觉试题的难度没有地面实况（即测试集VQA V2）的优势。我们期望这能刺激的研究和新算法的新的方向发展。聚类结果可在此HTTPS URL在线。

29. Sharing Matters for Generalization in Deep Metric Learning [PDF] 返回目录
Timo Milbich, Karsten Roth, Biagio Brattoli, Björn Ommer
Abstract: Learning the similarity between images constitutes the foundation for numerous vision tasks. The common paradigm is discriminative metric learning, which seeks an embedding that separates different training classes. However, the main challenge is to learn a metric that not only generalizes from training to novel, but related, test samples. It should also transfer to different object classes. So what complementary information is missed by the discriminative paradigm? Besides finding characteristics that separate between classes, we also need them to likely occur in novel categories, which is indicated if they are shared across training classes. This work investigates how to learn such characteristics without the need for extra annotations or training data. By formulating our approach as a novel triplet sampling strategy, it can be easily applied on top of recent ranking loss frameworks. Experiments show that, independent of the underlying network architecture and the specific ranking loss, our approach significantly improves performance in deep metric learning, leading to new the state-of-the-art results on various standard benchmark datasets.
摘要：学习图像之间的相似性构成了众多的视觉任务的基础。常见的模式是歧视性的度量学习，其目的是分离不同的培训课程嵌入。然而，主要的挑战是学习的指标，不仅从培训推广到新的，但相关的，测试样品。还应该转移到不同的对象类。那么，什么补偿信息由判别范式错过了什么？除了发现类之间的分离特性，我们还需要他们有可能发生在小说类，如果他们横跨培训课程，共享其被指示。这项工作探讨如何学习等特性，而不需要额外的注解或训练数据。通过制定我们作为一种新型的三重采样战略的方法，它可以很容易地在最近的排名损失框架的顶层上应用。实验表明，独立于底层网络架构和具体排名的损失，我们的方法显著改善深度量学习的性能，从而导致新的各种标准测试数据集的国家的最先进的成果。

30. Image Co-skeletonization via Co-segmentation [PDF] 返回目录
Koteswar Rao Jerripothula, Jianfei Cai, Jiangbo Lu, Junsong Yuan
Abstract: Recent advances in the joint processing of images have certainly shown its advantages over individual processing. Different from the existing works geared towards co-segmentation or co-localization, in this paper, we explore a new joint processing topic: image co-skeletonization, which is defined as joint skeleton extraction of objects in an image collection. Object skeletonization in a single natural image is a challenging problem because there is hardly any prior knowledge about the object. Therefore, we resort to the idea of object co-skeletonization, hoping that the commonness prior that exists across the images may help, just as it does for other joint processing problems such as co-segmentation. We observe that the skeleton can provide good scribbles for segmentation, and skeletonization, in turn, needs good segmentation. Therefore, we propose a coupled framework for co-skeletonization and co-segmentation tasks so that they are well informed by each other, and benefit each other synergistically. Since it is a new problem, we also construct a benchmark dataset by annotating nearly 1.8k images spread across 38 categories. Extensive experiments demonstrate that the proposed method achieves promising results in all the three possible scenarios of joint-processing: weakly-supervised, supervised, and unsupervised.
摘要：在图像的联合处理的最新进展对个别处理都肯定出它的优势。从对联合分割或共定位面向现有的作品不同的是，在本文中，我们探索一种新的联合处理的话题：图像协骨架，它被定义为图像集合中的对象的共同骨架提取。在一个自然的图像对象骨架是一个具有挑战性的问题，因为几乎没有关于对象的任何先验知识。因此，我们诉诸对象共同骨架的想法，希望先跨过图像中存在的共性可能会有帮助，只是因为它为其他联合处理的问题，如共同分割。我们注意到，骨架可分割提供良好的涂鸦，和骨架，反过来，需要良好的分割。因此，我们提出了联合骨架和共同分割任务，使他们深受互相通报耦合框架，有利于相互协同。因为它是一个新的问题，我们也通过注释近1.8K图像遍布全国38个类别传播构建一个基准数据集。大量的实验表明，该方法实现承诺在联合处理的所有三种可能的方案的结果：弱监督，监测以及未监测。

31. YouMakeup VQA Challenge: Towards Fine-grained Action Understanding in Domain-Specific Videos [PDF] 返回目录
Shizhe Chen, Weiying Wang, Ludan Ruan, Linli Yao, Qin Jin
Abstract: The goal of the YouMakeup VQA Challenge 2020 is to provide a common benchmark for fine-grained action understanding in domain-specific videos e.g. makeup instructional videos. We propose two novel question-answering tasks to evaluate models' fine-grained action understanding abilities. The first task is \textbf{Facial Image Ordering}, which aims to understand visual effects of different actions expressed in natural language to the facial object. The second task is \textbf{Step Ordering}, which aims to measure cross-modal semantic alignments between untrimmed videos and multi-sentence texts. In this paper, we present the challenge guidelines, the dataset used, and performances of baseline models on the two proposed tasks. The baseline codes and models are released at \url{this https URL}.
摘要：YouMakeup VQA挑战2020年的目标是在特定领域的视频提供了细粒度的行动理解一个共同的基准例如彩妆教学视频。我们提出两种新的答疑任务，以评估模型的细粒度动作的理解能力。第一项任务是\ textbf {脸图像订货}，其目的是了解在自然语言脸部对象表达了不同的动作的视觉效果。第二个任务是\ textbf {步订货}，其目的是衡量修剪视频和多句文本之间的跨模态的语义路线。在本文中，我们提出了挑战指引，使用的数据集，以及两项拟议任务基线模型的性能。基线代码和模型在\ {URL这HTTPS URL}释放。

32. Cross-domain Correspondence Learning for Exemplar-based Image Translation [PDF] 返回目录
Pan Zhang, Bo Zhang, Dong Chen, Lu Yuan, Fang Wen
Abstract: We present a general framework for exemplar-based image translation, which synthesizes a photo-realistic image from the input in a distinct domain (e.g., semantic segmentation mask, or edge map, or pose keypoints), given an exemplar image. The output has the style (e.g., color, texture) in consistency with the semantically corresponding objects in the exemplar. We propose to jointly learn the crossdomain correspondence and the image translation, where both tasks facilitate each other and thus can be learned with weak supervision. The images from distinct domains are first aligned to an intermediate domain where dense correspondence is established. Then, the network synthesizes images based on the appearance of semantically corresponding patches in the exemplar. We demonstrate the effectiveness of our approach in several image translation tasks. Our method is superior to state-of-the-art methods in terms of image quality significantly, with the image style faithful to the exemplar with semantic consistency. Moreover, we show the utility of our method for several applications
摘要：我们提出基于示范性图像平移的总体框架，其合成在不同结构域从输入一个照片般逼真的图像（例如，语义分割掩码，或边缘地图，或姿势关键点），给定的一个范例图像。输出具有与在示例性语义相应的对象的一致性的风格（例如，颜色，纹理）。我们建议，共同学习跨域通信和图像平移，其中两个任务相互促进，从而可以用微弱的监督学习。从不同的结构域图像首先对准到被建立密集的对应关系的中间域。然后，网络合成基于语义对应的补丁的在示例性外观的图像。我们证明我们的方法在几个图像翻译任务的有效性。我们的方法是优于国家的最先进的方法在图像质量显著而言，与图像样式忠实于具有语义一致性典范。此外，我们展示了该方法的实用程序，用于多种应用

33. FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions [PDF] 返回目录
Alvin Wan, Xiaoliang Dai, Peizhao Zhang, Zijian He, Yuandong Tian, Saining Xie, Bichen Wu, Matthew Yu, Tao Xu, Kan Chen, Peter Vajda, Joseph E. Gonzalez
Abstract: Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to $10^{14}\times$ over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421$\times$ less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size. FBNetV2 models are open-sourced at this https URL.
摘要：微的神经结构搜索（DNAS）已经证明在设计国家的最先进的，高效的神经网络巨大的成功。不过，相对于其他搜索方法时，因为所有候选网络层必须在内存中显式实例基于飞镖，DNAS的搜索空间小。为了解决这个瓶颈，我们提出了一个内存和计算效率DNAS变种：DMaskingNAS。该算法通过扩展了搜索空间为$ 10 ^ {14} \倍$优于常规DNAS，用在其他方面昂贵的空间和通道尺寸配套的搜索：输入分辨率和滤波器数量。我们提出了特征映射重用屏蔽机制，使记忆力和计算成本继续担任搜索空间扩张几乎不变。此外，我们采用有效形状传播每FLOP或每个参数精度以最大化。当所有先前的架构相比，搜索FBNetV2s国家的最先进的产量表现。凭借高达421 $ \ $次少搜寻成本，DMaskingNAS认定模型具有较高的0.9％的精度，比MobileNetV3-小拖鞋少15％;并与类似的准确性，但效率比-B0 FLOPS减少了20％。此外，我们的FBNetV2精度优于MobileNetV3 2.6％，与等效模型的大小。 FBNetV2模型是开源的这HTTPS URL。

34. Toward Hierarchical Self-Supervised Monocular Absolute Depth Estimation for Autonomous Driving Applications [PDF] 返回目录
Feng Xue, Guirong Zhuo, Ziyuan Huang, Wufei Fu, Zhuoyue Wu, Marcelo H. Ang Jr
Abstract: In recent years, self-supervised methods for monocular depth estimation has rapidly become an significant branch of depth estimation task, especially for autonomous driving applications. Despite the high overall precision achieved, current methods still suffer from a) imprecise object-level depth inference and b) uncertain scale factor. The former problem would cause texture copy or provide inaccurate object boundary, and the latter would require current methods to have an additional sensor like LiDAR to provide depth groundtruth or stereo camera as additional training inputs, which makes them difficult to implement. In this work, we propose to address these two problems together by introducing DNet. Our contributions are twofold: a) a novel dense connected prediction (DCP) layer is proposed to provide better object-level depth estimation and b) specifically for autonomous driving scenarios, dense geometrical constrains (DGC) is introduced so that precise scale factor can be recovered without additional cost for autonomous vehicles. Extensive experiments have been conducted and, both DCP layer and DGC module are proved to be effectively solving the aforementioned problems respectively. Thanks to DCP layer, object boundary can now be better distinguished in the depth map and the depth is more continues on object level. It is also demonstrated that the performance of using DGC to perform scale recovery is comparable to that using ground-truth information, when the camera height is given and the ground point takes up more than 1.03% of the pixels. Code will be publicly available once the paper is accepted.
摘要：近年来，单眼深度估计自我监督的方法已迅速成为深度估计任务的显著分支，尤其是对自主驾驶应用。尽管取得了较高的整体精度，目前的方法还是从）不精确对象级深度推断和b）不确定的比例因子受到影响。前者的问题会导致纹理复制或提供不准确的对象边界，而后者将需要目前的方法有激光雷达等附加的传感器，以提供深度地面实况或立体摄像机作为附加训练输入，这使得它们难以实施。在这项工作中，我们提出通过引入DNET解决这两个问题一起。我们的贡献是双重的：A）是提出层以提供更好的对象级的深度估计和b）特别用于自主驾驶情形的新的密连接预测（DCP），密几何约束（DGC）被引入，使得精确的比例因子可以是没有自主汽车额外的费用收回。大量的实验已经进行和，两者DCP层和DGC模块被证明可有效地分别解决上述问题。由于DCP层，对象边界，现在可以更好地在深度图区分，深度为更继续进行对象层级。它也表明，当被赋予相机的高度和地面点占用了有效像素超过1.03％，采用DGC进行大规模恢复的性能相媲美的使用地面实况信息。一旦论文被接受的代码将公之于众。

35. Feature Lenses: Plug-and-play Neural Modules for Transformation-Invariant Visual Representations [PDF] 返回目录
Shaohua Li, Xiuchao Sui, Jie Fu, Yong Liu, Rick Siow Mong Goh
Abstract: Convolutional Neural Networks (CNNs) are known to be brittle under various image transformations, including rotations, scalings, and changes of lighting conditions. We observe that the features of a transformed image are drastically different from the ones of the original image. To make CNNs more invariant to transformations, we propose "Feature Lenses", a set of ad-hoc modules that can be easily plugged into a trained model (referred to as the "host model"). Each individual lens reconstructs the original features given the features of a transformed image under a particular transformation. These lenses jointly counteract feature distortions caused by various transformations, thus making the host model more robust without retraining. By only updating lenses, the host model is freed from iterative updating when facing new transformations absent in the training data; as feature semantics are preserved, downstream applications, such as classifiers and detectors, automatically gain robustness without retraining. Lenses are trained in a self-supervised fashion with no annotations, by minimizing a novel "Top-K Activation Contrast Loss" between lens-transformed features and original features. Evaluated on ImageNet, MNIST-rot, and CIFAR-10, Feature Lenses show clear advantages over baseline methods.
摘要：卷积神经网络（细胞神经网络）是已知的各种图像转换，包括旋转，缩放，和照明条件的变化下变脆。我们观察到一个变换图像的特点是从原始图像的那些完全不同。为了使更多的细胞神经网络不变的变换，我们提出了“功能镜头”，一组特设的模块可以很容易地插入训练模型（简称为“主机模式”）。每个单独的透镜来重构特定变换下给出一个变换的图像的特征的原始特征。这些透镜共同抵消特征的失真所造成的各种变换，从而使主机模型更坚固而不再训练。通过仅更新镜头，主机模型从反复更新面临的训练数据不存在新的转换时释放;作为特征的语义被保留，下游应用，如分类器和检测器，自动获得鲁棒性没有再训练。镜头采用了一个自我监督的方式训练的不带注释的，通过最小化小说“顶级-K激活对比度损失”镜头转换的功能和原有的特色之间。评估了ImageNet，MNIST腐烂，并CIFAR-10，功能镜头显示在基线方法明显的优势。

36. OpenMix: Reviving Known Knowledge for Discovering Novel Visual Categories in An Open World [PDF] 返回目录
Zhun Zhong, Linchao Zhu, Zhiming Luo, Shaozi Li, Yi Yang, Nicu Sebe
Abstract: In this paper, we tackle the problem of discovering new classes in unlabeled visual data given labeled data from disjoint classes. Existing methods typically first pre-train a model with labeled data, and then identify new classes in unlabeled data via unsupervised clustering. However, the labeled data that provide essential knowledge are often underexplored in the second step. The challenge is that the labeled and unlabeled examples are from non-overlapping classes, which makes it difficult to build the learning relationship between them. In this work, we introduce OpenMix to mix the unlabeled examples from an open set and the labeled examples from known classes, where their non-overlapping labels and pseudo-labels are simultaneously mixed into a joint label distribution. OpenMix dynamically compounds examples in two ways. First, we produce mixed training images by incorporating labeled examples with unlabeled examples. With the benefits of unique prior knowledge in novel class discovery, the generated pseudo-labels will be more credible than the original unlabeled predictions. As a result, OpenMix helps to prevent the model from overfitting on unlabeled samples that may be assigned with wrong pseudo-labels. Second, the first way encourages the unlabeled examples with high class-probabilities to have considerable accuracy. We introduce these examples as reliable anchors and further integrate them with unlabeled samples. This enables us to generate more combinations in unlabeled examples and exploit finer object relations among the new classes. Experiments on three classification datasets demonstrate the effectiveness of the proposed OpenMix, which is superior to state-of-the-art methods in novel class discovery.
摘要：在本文中，我们将处理来自交的类给出的标签数据未标记的可视化数据发现新类的问题。现有方法通常首先预训练的模型与标记的数据，然后经由无监督聚类识别未标记的数据的新的类。然而，提供必要的知识标记的数据经常勘探不足在第二步骤。面临的挑战是，标记的和未标记的实例是来自非重叠的类，这使得难以建立它们之间的关系的学习。在这项工作中，我们介绍OpenMix到未标记的实例从一个开集，并从已知的类标记的实施例中，其中它们的非重叠的标签和伪标签被同时混合到关节标签分配混合。 OpenMix动态化合物以两种方式的例子。首先，我们通过合并与未标记示例标记示例产生混合训练图像。随着在新一类独特的发现已有知识的优势，将产生的伪标签会比原来未标记的预测更加可信。其结果是，OpenMix有助于防止模型过度拟合上的未标记样本可能与错误的伪标签来分配。其次，第一种方式鼓励高类的概率未标记的例子有相当的准确度。我们引进这些例子，可靠的锚，并进一步将它们与未标记样本集成。这使我们能够产生未标记的例子更多的组合和利用新的阶层中更精细的客体关系。在三个数据集分类实验证明了该OpenMix，这是优于国家的最先进的方法中类新发现的有效性。

37. Individual Tooth Detection and Identification from Dental Panoramic X-Ray Images via Point-wise Localization and Distance Regularization [PDF] 返回目录
Minyoung Chung, Jusang Lee, Sanguk Park, Minkyung Lee, Chae Eun Lee, Jeongjin Lee, Yeong-Gil Shin
Abstract: Dental panoramic X-ray imaging is a popular diagnostic method owing to its very small dose of radiation. For an automated computer-aided diagnosis system in dental clinics, automatic detection and identification of individual teeth from panoramic X-ray images are critical prerequisites. In this study, we propose a point-wise tooth localization neural network by introducing a spatial distance regularization loss. The proposed network initially performs center point regression for all the anatomical teeth (i.e., 32 points), which automatically identifies each tooth. A novel distance regularization penalty is employed on the 32 points by considering $L_2$ regularization loss of Laplacian on spatial distances. Subsequently, teeth boxes are individually localized using a cascaded neural network on a patch basis. A multitask offset training is employed on the final output to improve the localization accuracy. Our method successfully localizes not only the existing teeth but also missing teeth; consequently, highly accurate detection and identification are achieved. The experimental results demonstrate that the proposed algorithm outperforms state-of-the-art approaches by increasing the average precision of teeth detection by 15.71% compared to the best performing method. The accuracy of identification achieved a precision of 0.997 and recall value of 0.972. Moreover, the proposed network does not require any additional identification algorithm owing to the preceding regression of the fixed 32 points regardless of the existence of the teeth.
摘要：牙科全景X射线成像是由于其非常小的辐射的剂量一种流行的诊断方法。在牙科诊所，自动检测和从全景X射线图像各个齿的识别的自动计算机辅助诊断系统是至关重要的先决条件。在这项研究中，我们提出通过引入空间距离正规化损失逐点齿定位的神经网络。所提出的网络最初执行用于所有的解剖齿（即，32分），它自动地识别每个齿中心点消退。一种新颖的距离正规化惩罚是通过考虑在空间距离拉普拉斯$ $ L_2正规化损失对所用32分。接着，齿箱的贴剂的基础上使用一个级联神经网络单独地本地化。多任务偏移训练采用在最终输出，以提高定位精度。我们的方法成功地本地化，不仅现有的牙齿，而且牙齿缺失;因此，高精度的检测和识别被实现。实验结果表明，所提出的算法优于状态的最先进的由相比于最好的执行方法通过15.71％增加齿的检测的平均精确度接近。识别的精度达到0.972 0.997和召回值的精度。此外，提出的网络不需要由于固定32分的前述回归而不管齿的存在，任何附加的识别算法。

38. Self-Supervised Tuning for Few-Shot Segmentation [PDF] 返回目录
Kai Zhu, Wei Zhai, Zheng-Jun Zha, Yang Cao
Abstract: Few-shot segmentation aims at assigning a category label to each image pixel with few annotated samples. It is a challenging task since the dense prediction can only be achieved under the guidance of latent features defined by sparse annotations. Existing meta-learning method tends to fail in generating category-specifically discriminative descriptor when the visual features extracted from support images are marginalized in embedding space. To address this issue, this paper presents an adaptive tuning framework, in which the distribution of latent features across different episodes is dynamically adjusted based on a self-segmentation scheme, augmenting category-specific descriptors for label prediction. Specifically, a novel self-supervised inner-loop is firstly devised as the base learner to extract the underlying semantic features from the support image. Then, gradient maps are calculated by back-propagating self-supervised loss through the obtained features, and leveraged as guidance for augmenting the corresponding elements in embedding space. Finally, with the ability to continuously learn from different episodes, an optimization-based meta-learner is adopted as outer loop of our proposed framework to gradually refine the segmentation results. Extensive experiments on benchmark PASCAL-$5^{i}$ and COCO-$20^{i}$ datasets demonstrate the superiority of our proposed method over state-of-the-art.
摘要：很少，镜头分割的目的是与几个注解样本分配一个类别标签，以每个图像像素。这是一项艰巨的任务，因为茂密的预测只能由稀疏的注解定义的潜在功能的指导下才能实现。现有元学习方法趋向于产生在从支撑图像中提取视觉特征在嵌入空间被边缘化类别特异性辨别描述符失败。为了解决这个问题，本文提出了一种自适应调节的框架，其中的潜在的功能在不同的事件分布基于自分割方案进行动态调整，增强类特定的描述符标签预测。具体而言，一种新颖的自监督内环首先设计出作为基础学习者从支撑图像中提取的基本语义特征。然后，梯度映射由向后传播自监督损失通过所获得的特征计算，并且利用作为用于增强在嵌入空间中的相应的元件引导。最后，不断从不同事件中学习的能力，基于优化的元学习者通过了我们提出的框架的外循环，逐步细化分割结果。在基准大量的实验帕5 ^ {I} $和$椰油20 ^ {I} $ $的数据集显示了我们提出的方法在国家的最先进的优越性。

39. Online Initialization and Extrinsic Spatial-Temporal Calibration for Monocular Visual-Inertial Odometry [PDF] 返回目录
Weibo Huang, Hong Liu, Weiwei Wan
Abstract: This paper presents an online initialization method for bootstrapping the optimization-based monocular visual-inertial odometry (VIO). The method can online calibrate the relative transformation (spatial) and time offsets (temporal) among camera and IMU, as well as estimate the initial values of metric scale, velocity, gravity, gyroscope bias, and accelerometer bias during the initialization stage. To compensate for the impact of time offset, our method includes two short-term motion interpolation algorithms for the camera and IMU pose estimation. Besides, it includes a three-step process to incrementally estimate the parameters from coarse to fine. First, the extrinsic rotation, gyroscope bias, and time offset are estimated by minimizing the rotation difference between the camera and IMU. Second, the metric scale, gravity, and extrinsic translation are approximately estimated by using the compensated camera poses and ignoring the accelerometer bias. Third, these values are refined by taking into account the accelerometer bias and the gravitational magnitude. For further optimizing the system states, a nonlinear optimization algorithm, which considers the time offset, is introduced for global and local optimization. Experimental results on public datasets show that the initial values and the extrinsic parameters, as well as the sensor poses, can be accurately estimated by the proposed method.
摘要：本文介绍了自举基于优化的单目视觉惯性里程计（VIO）在线初始化方法。该方法可以在线校准照相机和IMU之间的相对转化（空间）和时间偏移（时间），以及估计度量尺度，速度，重力，陀螺仪偏差的初始值，和加速度计在初始化阶段偏压。为了补偿的时间偏移的影响，我们的方法包括用于相机和IMU姿态估计2短期运动内插算法。此外，它包括三个步骤以增量估计参数从粗到细。首先，将外在转动，陀螺仪偏差，和时间偏移通过最小化照相机和IMU之间的旋转差推定。第二，度量标尺，重力和外部平移大致通过使用补偿相机的姿势和忽略加速度计偏差估计。第三，这些值是通过考虑加速度计偏差和引力幅度细化。为进一步优化了系统状态，非线性优化算法，它考虑到时间偏移，介绍了一种全局和局部优化。公共数据集实验结果表明，初始值和所述非本征参数，以及作为传感器的姿势，可以准确地由所提出的方法来估计。

40. Building Disaster Damage Assessment in Satellite Imagery with Multi-Temporal Fusion [PDF] 返回目录
Ethan Weber, Hassan Kané
Abstract: Automatic change detection and disaster damage assessment are currently procedures requiring a huge amount of labor and manual work by satellite imagery analysts. In the occurrences of natural disasters, timely change detection can save lives. In this work, we report findings on problem framing, data processing and training procedures which are specifically helpful for the task of building damage assessment using the newly released xBD dataset. Our insights lead to substantial improvement over the xBD baseline models, and we score among top results on the xView2 challenge leaderboard. We release our code used for the competition.
摘要：自动变化检测和灾害损失评估，目前通过卫星图像分析员需要巨大的劳动和体力劳动的量的过程。在自然灾害的发生，及时改变检测可以挽救生命。在这项工作中，我们报告的问题框架，数据处理和培训，这是专门为构建使用新发布的数据集XBD损害评估的任务有帮助的程序的结果。我们的见解导致在XBD基线模型显着改善，我们得分之间的xView2挑战排行榜顶部的结果。我们发布了用于比赛我们的代码。

41. Density Map Guided Object Detection in Aerial Images [PDF] 返回目录
Changlin Li, Taojiannan Yang, Sijie Zhu, Chen Chen, Shanyue Guan
Abstract: Object detection in high-resolution aerial images is a challenging task because of 1) the large variation in object size, and 2) non-uniform distribution of objects. A common solution is to divide the large aerial image into small (uniform) crops and then apply object detection on each small crop. In this paper, we investigate the image cropping strategy to address these challenges. Specifically, we propose a Density-Map guided object detection Network (DMNet), which is inspired from the observation that the object density map of an image presents how objects distribute in terms of the pixel intensity of the map. As pixel intensity varies, it is able to tell whether a region has objects or not, which in turn provides guidance for cropping images statistically. DMNet has three key components: a density map generation module, an image cropping module and an object detector. DMNet generates a density map and learns scale information based on density intensities to form cropping regions. Extensive experiments show that DMNet achieves state-of-the-art performance on two popular aerial image datasets, i.e. VisionDrone and UAVDT.
摘要：在高分辨率航空影像对象检测是因为1:1的具有挑战性的任务）在对象尺寸变化较大，和对象的2）的非均匀分布。一个常见的解决方案是将大空间图像分割成小的（均匀的）作物，然后应用每个小作物物体检测。在本文中，我们研究了影像裁剪策略来应对这些挑战。具体地，我们提出了一种密度的地图引导对象检测网络（DMNet），其从观察激发使得图像对象呈现在地图的像素强度的方面如何分发的对象密度图。作为像素强度而变化，所以能够告诉的区域是否具有对象或没有，这反过来对于统计学裁剪图像提供了指导。 DMNet具有三个关键部件：密度映射产生模块，图像裁剪模块和对象检测器。 DMNet基于密度强度以形成剪切区域的密度图和获悉尺度信息。大量的实验表明，DMNet实现上两个流行的空间图像数据集，即VisionDrone和UAVDT状态的最先进的性能。

42. A Novel Pose Proposal Network and Refinement Pipeline for Better Object Pose Estimation [PDF] 返回目录
Ameni Trabelsi, Mohamed Chaabane, Nathaniel Blanchard, Ross Beveridge
Abstract: In this paper, we present a novel deep learning pipeline for 6D object pose estimation and refinement from RGB inputs. The first component of the pipeline leverages a region proposal framework to estimate multi-class single-shot 6D object poses directly from an RGB image and through a CNN-based encoder multi-decoders network. The second component, a multi-attentional pose refinement network (MARN), iteratively refines the estimated pose. MARN takes advantage of both visual and flow features to learn a relative transformation between an initially predicted pose and a target pose. MARN is further augmented by a spatial multi-attention block that emphasizes objects' discriminative feature parts. Experiments on three benchmarks for 6D pose estimation show that the proposed pipeline outperforms state-of-the-art RGB-based methods with competitive runtime performance.
摘要：在本文中，我们提出了从RGB输入6D对象姿态估计和完善一个新的深度学习管道。在流水线的第一组件利用的区域提案框架直接从RGB图像并通过基于CNN编码器的多解码器网络估计多类单次6D对象姿势。第二成分中，多注意力姿态优化网络（MARN），迭代地细化估计姿态。 MARN需要两个视觉的优点和流动特征的学习初始预测姿态与目标姿态之间的相对转化。 MARN通过强调对象的辨别特征部分的空间多的关注块进一步增强。三个基准6D姿态估计的实验表明国家的最先进的管道提出基于性能优于RGB-方法具有竞争力的运行时性能。

43. FDA: Fourier Domain Adaptation for Semantic Segmentation [PDF] 返回目录
Yanchao Yang, Stefano Soatto
Abstract: We describe a simple method for unsupervised domain adaptation, whereby the discrepancy between the source and target distributions is reduced by swapping the low-frequency spectrum of one with the other. We illustrate the method in semantic segmentation, where densely annotated images are aplenty in one domain (synthetic data), but difficult to obtain in another (real images). Current state-of-the-art methods are complex, some requiring adversarial optimization to render the backbone of a neural network invariant to the discrete domain selection variable. Our method does not require any training to perform the domain alignment, just a simple Fourier Transform and its inverse. Despite its simplicity, it achieves state-of-the-art performance in the current benchmarks, when integrated into a relatively standard semantic segmentation model. Our results indicate that even simple procedures can discount nuisance variability in the data that more sophisticated methods struggle to learn away.
摘要：我们描述了无监督域适应，从而在源和目标分布之间的差异是通过交换一个与其它的低频频谱减少的简单方法。我们示出了在语义分割，其中密集注释的图像是在一个域（合成数据）丰富地方法，但难以得到在另一个（真实图像）。当前状态的最先进的方法是复杂的，一些需要对抗优化神经网络的不变的主链呈现到离散域选择变量。我们的方法不需要任何培训，以执行域校准，只是变换一个简单的傅立叶和逆。尽管它的简单性，实现了在当前的基准，当集成到一个比较标准的语义分割模型状态的最先进的性能。我们的研究结果表明，即使是简单的程序，可以在更复杂的方法努力学习远数据打折滋扰变化。

44. Learning to Manipulate Individual Objects in an Image [PDF] 返回目录
Yanchao Yang, Yutong Chen, Stefano Soatto
Abstract: We describe a method to train a generative model with latent factors that are (approximately) independent and localized. This means that perturbing the latent variables affects only local regions of the synthesized image, corresponding to objects. Unlike other unsupervised generative models, ours enables object-centric manipulation, without requiring object-level annotations, or any form of annotation for that matter. The key to our method is the combination of spatial disentanglement, enforced by a Contextual Information Separation loss, and perceptual cycle-consistency, enforced by a loss that penalizes changes in the image partition in response to perturbations of the latent factors. We test our method's ability to allow independent control of spatial and semantic factors of variability on existing datasets and also introduce two new ones that highlight the limitations of current methods.
摘要：我们描述的方法来培养具有潜在因子生成模型是（大约）独立和本地化。这意味着，扰动潜在变量影响的合成图像的仅局部区域，对应于对象。不像其他监督的生成模型，我们能够对象为中心的操作，而无需对象级别的注解，或任何形式的注解为此事的。我们的方法的关键是空间解缠结，由上下文信息的分离损失强制执行，和感知周期一致性的组合，由损耗惩罚响应于所述潜在因素的扰动改变了图像分区执行。我们测试方法的允许对现有数据集可变性的空间和语义因素独立控制，并且还引入了两个新的，突出的现有方法的局限性能力。

45. Unveiling COVID-19 from Chest X-ray with deep learning: a hurdles race with small data [PDF] 返回目录
Enzo Tartaglione, Carlo Alberto Barbano, Claudio Berzovini, Marco Calandri, Marco Grangetto
Abstract: The possibility to use widespread and simple chest X-ray (CXR) imaging for early screening of COVID-19 patients is attracting much interest from both the clinical and the AI community. In this study we provide insights and also raise warnings on what is reasonable to expect by applying deep-learning to COVID classification of CXR images. We provide a methodological guide and critical reading of an extensive set of statistical results that can be obtained using currently available datasets. In particular, we take the challenge posed by current small size COVID data and show how significant can be the bias introduced by transfer-learning using larger public non-COVID CXR datasets. We also contribute by providing results on a medium size COVID CXR dataset, just collected by one of the major emergency hospitals in Northern Italy during the peak of the COVID pandemic. These novel data allow us to contribute to validate the generalization capacity of preliminary results circulating in the scientific community. Our conclusions shed some light into the possibility to effectively discriminate COVID using CXR.
摘要：使用普遍和简单的胸部X线（CXR）成像早期筛查COVID，19例患者的可能性正在吸引来自临床和AI社区的极大兴趣。在这项研究中，我们提供关于什么是合理运用深学习CXR图像的COVID分类期待的见解，同时提升警告。我们提供了方法指南和一套广泛的统计结果，可以利用现有的数据集获得的批判性阅读。特别是，我们采取由目前的小尺寸COVID数据带来的挑战，并展示如何显著可以通过引入偏置转移学习使用较大的公共非COVID CXR数据集。我们还通过在一个中等大小的COVID CXR数据集时，COVID大流行高峰期间，在意大利北部的主要急救医院的一位刚刚收集提供的结果作出贡献。这些新的数据使我们能够做出贡献，以验证在科学界流传的初步结果的概括能力。我们的结论提供一些线索进入的可能性，有效地判别COVID使用CXR。

46. Spatially-Attentive Patch-Hierarchical Network for Adaptive Motion Deblurring [PDF] 返回目录
Maitreya Suin, Kuldeep Purohit, A. N. Rajagopalan
Abstract: This paper tackles the problem of motion deblurring of dynamic scenes. Although end-to-end fully convolutional designs have recently advanced the state-of-the-art in non-uniform motion deblurring, their performance-complexity trade-off is still sub-optimal. Existing approaches achieve a large receptive field by increasing the number of generic convolution layers and kernel-size, but this comes at the expense of of the increase in model size and inference speed. In this work, we propose an efficient pixel adaptive and feature attentive design for handling large blur variations across different spatial locations and process each test image adaptively. We also propose an effective content-aware global-local filtering module that significantly improves performance by considering not only global dependencies but also by dynamically exploiting neighbouring pixel information. We use a patch-hierarchical attentive architecture composed of the above module that implicitly discovers the spatial variations in the blur present in the input image and in turn, performs local and global modulation of intermediate features. Extensive qualitative and quantitative comparisons with prior art on deblurring benchmarks demonstrate that our design offers significant improvements over the state-of-the-art in accuracy as well as speed.
摘要：本文铲球动态场景的运动去模糊的问题。虽然终端到终端的全卷积的设计最近提出的国家的最先进的非匀速运动去模糊，他们的表现复杂的权衡仍是次优的。现有的方法通过增加普通卷积层和内核大小的数量实现大的感受野，但这是以在模型的大小和推理速度的增加为代价的。在这项工作中，我们提出了一种高效像素适应，并设有细心设计用于跨不同空间位置处理大的模糊的变化和适应性地处理每一个测试图像。我们还提出了一种有效的内容感知全局 - 局部滤波模块通过不仅考虑全球的依赖关系还通过动态地利用相邻像素信息显著提高性能。我们使用的是隐式地发现在输入图像和反过来在本模糊的空间变化上述模块构成的补丁分层架构周到，执行的中间特征局部和全局调制。与现有技术的去模糊基准广泛的定性和定量比较表明，在国家的最先进的我们的设计报价显著改善准确性和速度。

47. Improving Semantic Segmentation through Spatio-Temporal Consistency Learned from Videos [PDF] 返回目录
Ankita Pasad, Ariel Gordon, Tsung-Yi Lin, Anelia Angelova
Abstract: We leverage unsupervised learning of depth, egomotion, and camera intrinsics to improve the performance of single-image semantic segmentation, by enforcing 3D-geometric and temporal consistency of segmentation masks across video frames. The predicted depth, egomotion, and camera intrinsics are used to provide an additional supervision signal to the segmentation model, significantly enhancing its quality, or, alternatively, reducing the number of labels the segmentation model needs. Our experiments were performed on the ScanNet dataset.
摘要：我们利用深度，自身运动和相机内部函数的监督学习，以提高单图像语义分割的性能，通过实施跨视频帧分割掩码的3D几何和时间一致性。预测深度，自我运动，和照相机内部函数被用于提供额外的监督信号来分割模型，显著提高其质量，或者可选地，减少标签的分割模型的需求的数量。我们的实验是在ScanNet数据集进行。

48. Probabilistic Orientated Object Detection in Automotive Radar [PDF] 返回目录
Xu Dong, Pengluo Wang, Pengyue Zhang, Langechuan Liu
Abstract: Autonomous radar has been an integral part of advanced driver assistance systems due to its robustness to adverse weather and various lighting conditions. Conventional automotive radars use digital signal processing (DSP) algorithms to process raw data into sparse radar pins that do not provide information regarding the size and orientation of the objects. In this paper, we propose a deep-learning based algorithm for radar object detection. The algorithm takes in radar data in its raw tensor representation and places probabilistic oriented bounding boxes around the detected objects in bird's-eye-view space. We created a new multimodal dataset with 102544 frames of raw radar and synchronized LiDAR data. To reduce human annotation effort we developed a scalable pipeline to automatically annotate ground truth using LiDAR as reference. Based on this dataset we developed a vehicle detection pipeline using raw radar data as the only input. Our best performing radar detection model achieves 77.28\% AP under oriented IoU of 0.3. To the best of our knowledge, this is the first attempt to investigate object detection with raw radar data for conventional corner automotive radars.
摘要：自主雷达一直以先进的驾驶员辅助系统的一个组成部分，由于其稳健性恶劣天气和各种照明条件。常规的汽车雷达使用数字信号处理（DSP）算法，以原始数据处理成疏雷达销不提供关于对象的大小和方向的信息。在本文中，我们提出了雷达目标探测深学习基础算法。该算法发生在雷达数据以其原始张量表示和地方概率方向包围在周围鸟瞰视图空间中的检测到的对象框。我们创建了一个新的多模态数据集与原始雷达的102544帧和同步LiDAR数据。为了减少人为批注努力，我们开发了一个可扩展的管道使用激光雷达作为参考自动注释地面实况。在此基础上的数据集，我们开发使用原始雷达数据作为唯一的输入车辆检测管道。我们表现最好的雷达探测模式下实现0.3面向欠条77.28 \％AP。据我们所知，这是调查与现有的拐角汽车雷达的原始雷达数据对象检测的第一次尝试。

49. Inter-Region Affinity Distillation for Road Marking Segmentation [PDF] 返回目录
Yuenan Hou, Zheng Ma, Chunxiao Liu, Tak-Wai Hui, Chen Change Loy
Abstract: We study the problem of distilling knowledge from a large deep teacher network to a much smaller student network for the task of road marking segmentation. In this work, we explore a novel knowledge distillation (KD) approach that can transfer 'knowledge' on scene structure more effectively from a teacher to a student model. Our method is known as Inter-Region Affinity KD (IntRA-KD). It decomposes a given road scene image into different regions and represents each region as a node in a graph. An inter-region affinity graph is then formed by establishing pairwise relationships between nodes based on their similarity in feature distribution. To learn structural knowledge from the teacher network, the student is required to match the graph generated by the teacher. The proposed method shows promising results on three large-scale road marking segmentation benchmarks, i.e., ApolloScape, CULane and LLAMAS, by taking various lightweight models as students and ResNet-101 as the teacher. IntRA-KD consistently brings higher performance gains on all lightweight models, compared to previous distillation methods. Our code is available at this https URL.
摘要：我们研究从深大教师网络蒸馏知识，以更小的学生网络道路划线分割的任务的问题。在这项工作中，我们将探讨可以从教师更有效地传递“知识”的场景结构的学生模型一种新型的知识蒸馏（KD）的方法。我们的方法被称为区域间亲和KD（INTRA-KD）。它分解给定的道路场景图像为不同的区域和表示每个区域如在图中的节点。一个区域间亲和度图，然后通过建立基于它们在特征分布相似性的节点之间的成对关系形成。要了解从教师网络结构的知识，要求学生到老师所产生的图形匹配。该方法表明承诺在三个大型道路标线分割基准，即ApolloScape，CULane和骆驼结果，通过采取各种轻便机型为学生和RESNET-101为师。 INTRA-KD持续带来所有轻量级车型更高的性能提升，相比之前的蒸馏方法。我们的代码可在此HTTPS URL。

50. From Quantized DNNs to Quantizable DNNs [PDF] 返回目录
Kunyuan Du, Ya Zhang, Haibing Guan
Abstract: This paper proposes Quantizable DNNs, a special type of DNNs that can flexibly quantize its bit-width (denoted as `bit modes' thereafter) during execution without further re-training. To simultaneously optimize for all bit modes, a combinational loss of all bit modes is proposed, which enforces consistent predictions ranging from low-bit mode to 32-bit mode. This Consistency-based Loss may also be viewed as certain form of regularization during training. Because outputs of matrix multiplication in different bit modes have different distributions, we introduce Bit-Specific Batch Normalization so as to reduce conflicts among different bit modes. Experiments on CIFAR100 and ImageNet have shown that compared to quantized DNNs, Quantizable DNNs not only have much better flexibility, but also achieve even higher classification accuracy. Ablation studies further verify that the regularization through the consistency-based loss indeed improves the model's generalization performance.
摘要：提出Quantizable DNNs，一种特殊类型的DNNs能够灵活量化其位宽度（表示为`位模式之后）执行没有进一步再训练期间的。同时优化所有位模式下，所有的位模式的组合损耗，提出了一种实施一致的预测范围从低比特模式为32位模式。这种基于一致性的损失也被视为训练期间正规化的某种形式。因为在不同的位模式矩阵乘法的输出具有不同分布，我们引入位特定批标准化，以便减少不同的位模式之间的冲突。在CIFAR100和ImageNet实验表明，相对于量化DNNs，Quantizable DNNs不仅具有更好的灵活性，同时也实现更高的分类精度。消融研究进一步验证通过基于一致性的损失正规化确实提高了模型的泛化性能。

51. Multi-View Matching (MVM): Facilitating Multi-Person 3D Pose Estimation Learning with Action-Frozen People Video [PDF] 返回目录
Yeji Shen, C.-C. Jay Kuo
Abstract: To tackle the challeging problem of multi-person 3D pose estimation from a single image, we propose a multi-view matching (MVM) method in this work. The MVM method generates reliable 3D human poses from a large-scale video dataset, called the Mannequin dataset, that contains action-frozen people immitating mannequins. With a large amount of in-the-wild video data labeled by 3D supervisions automatically generated by MVM, we are able to train a neural network that takes a single image as the input for multi-person 3D pose estimation. The core technology of MVM lies in effective alignment of 2D poses obtained from multiple views of a static scene that has a strong geometric constraint. Our objective is to maximize mutual consistency of 2D poses estimated in multiple frames, where geometric constraints as well as appearance similarities are taken into account simultaneously. To demonstrate the effectiveness of 3D supervisions provided by the MVM method, we conduct experiments on the 3DPW and the MSCOCO datasets and show that our proposed solution offers the state-of-the-art performance.
摘要：从单一的图像处理多的人三维姿态估计的challeging问题，我们提出在这项工作中多视图匹配（MVM）方法。该MVM方法生成从大型视频数据集，被称为人体模型的数据集，包含动作冻人immitating模特可靠的3D人体姿势。随着大量由MVM自动生成3D监督标示在最疯狂的视频数据，我们能够培养接受单个图像作为输入多人3D姿态估计的神经网络。 MVM谎言在2D的有效定位的核心技术带来从具有较强的几何约束静态场景的多个视图得到。我们的目标是最大化在多个帧，其中几何约束以及外观的相似性被同时考虑到估计2D姿势相互一致性。为了证明该方法MVM提供的3D监督的有效性，我们进行的3DPW和MSCOCO数据集实验表明，该解决方案提供了先进设备，最先进的性能。

52. Towards Anomaly Detection in Dashcam Videos [PDF] 返回目录
Sanjay Haresh, Sateesh Kumar, M. Zeeshan Zia, Quoc-Huy Tran
Abstract: Inexpensive sensing and computation, as well as insurance innovations, have made smart dashboard cameras ubiquitous. Increasingly, simple model-driven computer vision algorithms focused on lane departures or safe following distances are finding their way into these devices. Unfortunately, the long-tailed distribution of road hazards means that these hand-crafted pipelines are inadequate for driver safety systems. We propose to apply data-driven anomaly detection ideas from deep learning to dashcam videos, which hold the promise of bridging this gap. Unfortunately, there exists almost no literature applying anomaly understanding to moving cameras, and correspondingly there is also a lack of relevant datasets. To counter this issue, we present a large and diverse dataset of truck dashcam videos, namely RetroTrucks, that includes normal and anomalous driving scenes. We apply: (i) one-class classification loss and (ii) reconstruction-based loss, for anomaly detection on RetroTrucks as well as on existing static-camera datasets. We introduce formulations for modeling object interactions in this context as priors. Our experiments indicate that our dataset is indeed more challenging than standard anomaly detection datasets, and previous anomaly detection methods do not perform well here out-of-the-box. In addition, we share insights into the behavior of these two important families of anomaly detection approaches on dashcam data.
摘要：廉价的检测和计算，以及保险创新，取得了智能仪表板摄像机无处不在。越来越多，简单的模型驱动的计算机视觉算法集中在偏离车道或安全的跟车距离，发现他们的方式进入这些设备。不幸的是，道路危险手段长尾分布，这些手工制作的管道不适合用于驾驶员的安全系统。我们建议采用数据驱动的异常检测的思想从深学习dashcam视频，占据弥补这一差距的承诺。不幸的是，存在几乎没有文献将异常了解到移动摄像机，并相应地也有缺乏相关的数据集。为了解决这个问题，我们提出了一个庞大而多样的数据集卡车dashcam视频，即RetroTrucks，包括正常和反常的驾驶场景。我们应用：（i）一种类分类损耗和（ii）基于重建的损失，异常检测上RetroTrucks以及对现有静态相机数据集。我们引进配方在上下文先验对象的交互建模。我们的实验表明，我们的数据确实超过标准异常检测数据集挑战性，和以前的异常检测方法不能很好地执行在这里出的即装即用。此外，我们分享见解异常检测的这两个重要的家庭行为上dashcam数据接近。

53. Attend and Decode: 4D fMRI Task State Decoding Using Attention Models [PDF] 返回目录
Sam Nguyen, Brenda Ng, Alan K. Kaplan, Priyadip Ray
Abstract: Functional magnetic resonance imaging (fMRI) is a neuroimaging modality that captures the blood oxygen level in a subject's brain while the subject performs a variety of functional tasks under different conditions. Given fMRI data, the problem of inferring the task, known as task state decoding, is challenging due to the high dimensionality (hundreds of million sampling points per datum) and complex spatio-temporal blood flow patterns inherent in the data. In this work, we propose to tackle the fMRI task state decoding problem by casting it as a 4D spatio-temporal classification problem. We present a novel architecture called Brain Attend and Decode (BAnD), that uses residual convolutional neural networks for spatial feature extraction and self-attention mechanisms for temporal modeling. We achieve significant performance gain compared to previous works on a 7-task benchmark from the large-scale Human Connectome Project (HCP) dataset. We also investigate the transferability of BAnD's extracted features on unseen HCP tasks, either by freezing the spatial feature extraction layers and retraining the temporal model, or finetuning the entire model. The pre-trained features from BAnD are useful on similar tasks while finetuning them yields competitive results on unseen tasks/conditions.
摘要：功能磁共振成像（fMRI）是一种神经成像模态捕获在受试者的脑血液中的氧气水平，而所述对象进行了各种不同条件下的功能性任务。鉴于fMRI数据，推断任务，被称为任务状态解码的问题，是具有挑战性的，由于高维数（每几百基准百万采样点），并在数据中所固有的复杂时空血流模式。在这项工作中，我们提出了铸造它作为一个4D时空分类问题解决功能磁共振成像的任务状态解码问题。我们提出一个新的架构，称为脑出席并解码（带），使用剩余的卷积神经网络，用于时间建模空间特征提取和自我关注机制。我们相比，从大规模人类连接组项目（HCP）数据集7任务基准以前的作品实现显著的性能增益。我们也看不见HCP任务，无论是冷冻空间特征提取层和再培训的时间模型，或微调整个模型探讨乐队的提取特征的可转让性。从带预先训练的特点是对类似的任务非常有用，同时微调它们产生于看不见的任务/环境竞争力的结果。

54. End-to-end Learning Improves Static Object Geo-localization in Monocular Video [PDF] 返回目录
Mohamed Chaabane, Lionel Gueguen, Ameni Trabelsi, Ross Beveridge, Stephen O'Hara
Abstract: Accurately estimating the position of static objects, such as traffic lights, from the moving camera of a self-driving car is a challenging problem. In this work, we present a system that improves the localization of static objects by jointly-optimizing the components of the system via learning. Our system is comprised of networks that perform: 1) 6DoF object pose estimation from a single image, 2) association of objects between pairs of frames, and 3) multi-object tracking to produce the final geo-localization of the static objects within the scene. We evaluate our approach using a publicly-available data set, focusing on traffic lights due to data availability. For each component, we compare against contemporary alternatives and show significantly-improved performance. We also show that the end-to-end system performance is further improved via joint-training of the constituent models.
摘要：准确估计静态对象，例如交通灯的位置，从自驾车的移动摄像机是一个具有挑战性的问题。在这项工作中，我们提出改进静态对象由通过共同学习，优化系统的部件国产化的系统。我们的系统包括用于执行网络：1）6自由度对象姿态估计从单个图像，2）对帧之间的对象的关联，以及3）多目标跟踪，以产生静电的最终的地理定位内的对象现场。我们评估使用我们的方法一个公开可用的数据集，侧重于交通信号灯，由于数据的可用性。对于每个组件，我们比较对当代的替代品，并显示显著改善的性能。我们还表明，终端到终端的系统性能通过组成模型的联合训练进一步提高。

55. Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review [PDF] 返回目录
Yaodong Cui, Ren Chen, Wenbo Chu, Long Chen, Daxin Tian, Dongpu Cao
Abstract: Autonomous vehicles are experiencing rapid development in the past few years. However, achieving full autonomy is not a trivial task, due to the nature of the complex and dynamic driving environment. Therefore, autonomous vehicles are equipped with a suite of different sensors to ensure robust, accurate environmental perception. In particular, camera-LiDAR fusion is becoming an emerging research theme. However, so far there is no critical review that focuses on deep-learning-based camera-LiDAR fusion methods. To bridge this gap and motivate future research, this paper devotes to review recent deep-learning-based data fusion approaches that leverage both image and point cloud. This review gives a brief overview of deep learning on image and point cloud data processing. Followed by in-depth reviews of camera-LiDAR fusion methods in depth completion, object detection, semantic segmentation and tracking, which are organized based on their respective fusion levels. Furthermore, we compare these methods on publicly available datasets. Finally, we identified gaps and over-looked challenges between current academic researches and real-world applications. Based on these observations, we provide our insights and point out promising research directions.
摘要：自主汽车正在经历快速的发展，在过去的几年里。然而，实现充分的自主权不是一个简单的任务，由于复杂和动态的驾驶环境的性质。因此，自主车都配备了一套不同的传感器，以确保可靠，精确环境感知。特别是，摄像头，激光雷达的融合正在成为一个新兴的研究课题。然而，到目前为止还侧重于基于深学习的相机，激光雷达的融合方法没有严格的审查。为了弥补这种差距，激励未来的研究，本文致力于审查最近深学习为基础的数据融合方法，充分利用双方的图像和点云。本文综述了深度学习的图像和点云数据处理的简要概述。随后的在深度完成时，对象检测，语义分割和跟踪相机激光雷达融合方法，这是基于它们各自的融合级别组织进行深入的评论。此外，我们比较可公开获得的数据集，这些方法。最后，我们确定了差距和目前的学术研究和实际应用之间的过度期待的挑战。根据这些意见，我们提供了自己的看法，并指出有前途的研究方向。

56. A Review on Deep Learning Techniques for Video Prediction [PDF] 返回目录
Sergiu Oprea, Pablo Martinez-Gonzalez, Alberto Garcia-Garcia, John Alejandro Castro-Vargas, Sergio Orts-Escolano, Jose Garcia-Rodriguez, Antonis Argyros
Abstract: The ability to predict, anticipate and reason about future outcomes is a key component of intelligent decision-making systems. In light of the success of deep learning in computer vision, deep-learning-based video prediction emerged as a promising research direction. Defined as a self-supervised learning task, video prediction represents a suitable framework for representation learning, as it demonstrated potential capabilities for extracting meaningful representations of the underlying patterns in natural videos. Motivated by the increasing interest in this task, we provide a review on the deep learning methods for prediction in video sequences. We firstly define the video prediction fundamentals, as well as mandatory background concepts and the most used datasets. Next, we carefully analyze existing video prediction models organized according to a proposed taxonomy, highlighting their contributions and their significance in the field. The summary of the datasets and methods is accompanied with experimental results that facilitate the assessment of the state of the art on a quantitative basis. The paper is summarized by drawing some general conclusions, identifying open research challenges and by pointing out future research directions.
摘要：预测，预见和理由对未来结果的能力是智能决策系统的关键组成部分。在计算机视觉深度学习的成功之光，深学习型影像预测成为一个有前途的研究方向。定义为一个自我监督的学习任务，视频的预测代表表示学习一个适当的框架，因为它表现出潜在的能力，用于提取天然影片的基本模式的有意义的表示。在这个任务中越来越大的兴趣启发，我们对深学习方法的视频序列预测提供了审查。我们首先定义视频预测的基本面，以及必备的后台理念和最常用的数据集。接下来，我们仔细分析了根据所提出的分类组织现有的视频预测模型，突出他们的贡献和他们在场上的意义。数据集和方法的总结伴随着促进了技术状态的定量基础上的评估实验结果。本文通过绘制一些一般性结论，确定开放的研究挑战和未来的研究方向指出了总结。

57. Hamiltonian Dynamics for Real-World Shape Interpolation [PDF] 返回目录
Marvin Eisenberger, Daniel Cremers
Abstract: We revisit the classical problem of 3D shape interpolation and propose a novel, physically plausible approach based on Hamiltonian dynamics. While most prior work focuses on synthetic input shapes, our formulation is designed to be applicable to real-world scans with imperfect input correspondences and various types of noise. To that end, we use recent progress on dynamic thin shell simulation and divergence-free shape deformation and combine them to address the inverse problem of finding a plausible intermediate sequence for two input shapes. In comparison to prior work that mainly focuses on small distortion of consecutive frames, we explicitly model volume preservation and momentum conservation, as well as an anisotropic local distortion model. We argue that, in order to get a robust interpolation for imperfect inputs, we need to model the input noise explicitly which results in an alignment based formulation. Finally, we show a qualitative and quantitative improvement over prior work on a broad range of synthetic and scanned data. Besides being more robust to noisy inputs, our method yields exactly volume preserving intermediate shapes, avoids self-intersections and is scalable to high resolution scans.
摘要：我们重温3D形状插值的古典问题，并提出了一种新的基础上，哈密顿动力学物理合理的做法。虽然大多数以前的工作集中在合成输入的形状，我们的配方被设计成适用于具有不完整输入对应和各种噪声真实世界的扫描。为此，我们使用动态薄壳模拟和无发散形状的变形的最新进展和将它们结合起来，以解决找到一个合理的中间序列为两个输入的形状的逆问题。相较于主要集中在连续帧的失真小以前的工作中，我们明确地模型保存量和动量守恒，以及各向异性局部失真模型。我们认为，为了获得完美的输入强大的插值，我们需要输入噪声明确其结果在对准基于制定模式。最后，我们表现出了以前的工作在广泛的合成和扫描数据的质量和数量的提高。除了是更强大的嘈杂的投入，我们的方法正确地产生量维持中间形状，避免自交和可扩展到高分辨率扫描。

58. k-decay: A New Method For Learning Rate Schedule [PDF] 返回目录
Tao Zhang
Abstract: It is well known that the learning rate is the most important hyper-parameter on Deep Learning. Usually used learning rate schedule training neural networks. This paper puts forward a new method for learning rate schedule, named k-decay, which suitable for any derivable function to derived a new schedule function. On the new function control degree of decay by the new hyper-parameter k, while the original function is the special case at k = 1. This paper applied k-decay to polynomial function, cosine function and exponential function gives them the new function. In the paper, evaluate the k-decay method by the new polynomial function on CIFAR-10 and CIFAR-100 datasets with different neural networks (ResNet, Wide ResNet and DenseNet), the results improvements over the state-of-the-art results on most of them. Our experiments show that the performance of the model improves with the increase of k from 1.
摘要：众所周知，学习率是深度学习的最重要的超参数。通常用来学习税率表训练神经网络。提出了用于学习速率调度，名为K-衰变，它适用于任何可导出函数来导出的新日程安排功能的新方法。上由新的超参数k衰变的新功能控制程度，而原来的功能是在k = 1的特殊情况下本文施加的k衰变到多项式函数，余弦函数和指数函数给出他们的新功能。在论文中，评估通过在CIFAR-10和CIFAR-100的数据集与不同的神经网络的新的多项式函数（RESNET，宽RESNET和DenseNet）的k衰减法，过状态的最先进的结果的结果的改进在多数人身上。我们的实验表明，该模型的性能，其中k为1的增加而提高。

59. Revisiting Loss Landscape for Adversarial Robustness [PDF] 返回目录
Dongxian Wu, Yisen Wang, Shu-tao Xia
Abstract: The study on improving the robustness of deep neural networks against adversarial examples grows rapidly in recent years. Among them, adversarial training is the most promising one, based on which, a lot of improvements have been developed, such as adding regularizations or leveraging unlabeled data. However, these improvements seem to come from isolated perspectives, so that we are curious about if there is something in common behind them. In this paper, we investigate the surface geometry of several well-recognized adversarial training variants, and reveal that their adversarial loss landscape is closely related to the adversarially robust generalization, i.e., the flatter the adversarial loss landscape, the smaller the adversarially robust generalization gap. Based on this finding, we then propose a simple yet effective module, Adversarial Weight Perturbation (AWP), to directly regularize the flatness of the adversarial loss landscape in the adversarial training framework. Extensive experiments demonstrate that AWP indeed owns flatter landscape and can be easily incorporated into various adversarial training variants to enhance their adversarial robustness further.
摘要：改进打击敌对例子深层神经网络的鲁棒性研究在近几年迅速增长。其中，对抗性训练是最有前途的一个，在此基础上，做了大量的改进已经开发，如添加正则化或利用未标记的数据。然而，这些改进似乎来自独立的视角，让我们好奇，如果有东西在他们身后常见。在本文中，我们研究的几个公认的对抗训练变种表面几何形状，并揭示其对抗损失景观密切相关的adversarially强大的概括，即平坦的对抗性损失景观，越小adversarially强大的泛化差距。基于这一发现，我们再提出一个简单而有效的模块，对抗性重量扰动（AWP），直接正规化对抗损失景观的平整度在对抗训练框架。大量的实验证明，确实AWP拥有平坦的景观，可以很容易地纳入各种对抗训练变种，以进一步增强其对抗的鲁棒性。

60. Analysis of The Ratio of $\ell_1$ and $\ell_2$ Norms in Compressed Sensing [PDF] 返回目录
Yiming Xu, Akil Narayan, Hoang Tran, Clayton Webster
Abstract: We first propose a novel criterion that guarantees that an $s$-sparse signal is the local minimizer of the $\ell_1/\ell_2$ objective; our criterion is interpretable and useful in practice. We also give the first uniform recovery condition using a geometric characterization of the null space of the measurement matrix, and show that this condition is easily satisfied for a class of random matrices. We also present analysis on the stability of the procedure when noise pollutes data. Numerical experiments are provided that compare $\ell_1/\ell_2$ with some other popular non-convex methods in compressed sensing. Finally, we propose a novel initialization approach to accelerate the numerical optimization procedure. We call this initialization approach \emph{support selection}, and we demonstrate that it empirically improves the performance of existing $\ell_1/\ell_2$ algorithms.
摘要：我们首先提出了一个新的标准，即保证了$ S $ -sparse信号是当地的极小的$ \ ell_1 / \ ell_2 $目标;我们的标准是可解释和有益实践。我们还使用测量矩阵的零空间的几何特征给第一均匀恢复状况，并表明这个条件很容易满足的一类随机矩阵。我们在噪声污染数据在程序的稳定性也存在分析。提供数值实验来比较$ \ ell_1 / \ ell_2 $与压缩传感的一些其他流行的非凸方法。最后，我们提出了一种新的初始化方式，加快数值优化方法。我们把这种方式初始化\ {EMPH支持选择}，我们证明它凭经验提高现有$ \ ell_1 / \ ell_2 $算法的性能。

61. From Inference to Generation: End-to-end Fully Self-supervised Generation of Human Face from Speech [PDF] 返回目录
Hyeong-Seok Choi, Changdae Park, Kyogu Lee
Abstract: This work seeks the possibility of generating the human face from voice solely based on the audio-visual data without any human-labeled annotations. To this end, we propose a multi-modal learning framework that links the inference stage and generation stage. First, the inference networks are trained to match the speaker identity between the two different modalities. Then the trained inference networks cooperate with the generation network by giving conditional information about the voice. The proposed method exploits the recent development of GANs techniques and generates the human face directly from the speech waveform making our system fully end-to-end. We analyze the extent to which the network can naturally disentangle two latent factors that contribute to the generation of a face image - one that comes directly from a speech signal and the other that is not related to it - and explore whether the network can learn to generate natural human face image distribution by modeling these factors. Experimental results show that the proposed network can not only match the relationship between the human face and speech, but can also generate the high-quality human face sample conditioned on its speech. Finally, the correlation between the generated face and the corresponding speech is quantitatively measured to analyze the relationship between the two modalities.
摘要：这项工作旨在生成从语音人脸仅仅基于没有任何人为标记注释影音数据的可能性。为此，我们提出了一种多模态学习框架，使推论阶段和生成阶段。首先，推论网络进行培训，以满足两种不同的模式之间的说话人识别。然后训练的推论网络与提供有关的声音条件信息的生成网络合作。该方法利用了最近的甘斯技术开发，并从语音波形使我们的系统完全结束对终端的直接生成的人脸。我们分析到的网络可以有助于一个人脸图像的生成自然解开2个潜在因素的程度 - 一个直接从语音信号和不与其相关的其他附带 - 探索网络是否能够学会通过模拟这些因素产生自然的人脸图像的分布。实验结果表明，该网络不仅可以匹配人脸和语音之间的关系，而且还可以生成调节其讲话的高品质人脸样本。最后，所生成的面与相应的语音之间的相关性定量测量来分析两种模式之间的关系。

62. Blockchain in the Internet of Things: Architectures and Implementation [PDF] 返回目录
Oscar Delgado-Mohatar, Ruben Tolosana, Julian Fierrez, Aythami Morales
Abstract: The world is becoming more interconnected every day. With the high technological evolution and the increasing deployment of it in our society, scenarios based on the Internet of Things (IoT) can be considered a reality nowadays. However, and before some predictions become true (around 75 billion devices are expected to be interconnected in the next few years), many efforts must be carried out in terms of scalability and security. In this study we propose and evaluate a new approach based on the incorporation of Blockchain into current IoT scenarios. The main contributions of this study are as follows: i) an in-depth analysis of the different possibilities for the integration of Blockchain into IoT scenarios, focusing on the limited processing capabilities and storage space of most IoT devices, and the economic cost and performance of current Blockchain technologies; ii) a new method based on a novel module named BIoT Gateway that allows both unidirectional and bidirectional communications with IoT devices on real scenarios, allowing to exchange any kind of data; and iii) the proposed method has been fully implemented and validated on two different real-life IoT scenarios, extracting very interesting findings in terms of economic cost and execution time. The source code of our implementation is publicly available in the Ethereum testnet.
摘要：世界正变得一天比一天更互连。随着高科技发展和它在我们的社会日益增加的部署，基于物联网（IOT）在互联网上的场景如今可以被认为是一个现实。然而，和之前的一些预言成真（预计约75十亿的设备在未来几年互连），很多工作必须在可扩展性和安全性方面进行。在这项研究中，我们提出和评估基于Blockchain纳入当前物联网情景的新方法。此研究的主要成果如下：1）为整合Blockchain到物联网的场景不同的可能性进行了深入分析，着眼于大多数物联网设备的处理能力有限和存储空间，以及经济成本和性能目前Blockchain技术; ii）基于名为BIOT网关的新的模块，允许与真实场景的IoT设备单向和双向通信，允许交换任何种类的数据上的新方法;和iii）所提出的方法已全面实施并在两个不同的现实生活场景物联网验证，在经济成本和执行时间方面很提取有趣的发现。我们实现的源代码是在复仇testnet公之于众。

63. Multi-modal Datasets for Super-resolution [PDF] 返回目录
Haoran Li, Weihong Quan, Meijun Yan, Jin zhang, Xiaoli Gong, Jin Zhou
Abstract: Nowdays, most datasets used to train and evaluate super-resolution models are single-modal simulation datasets. However, due to the variety of image degradation types in the real world, models trained on single-modal simulation datasets do not always have good robustness and generalization ability in different degradation scenarios. Previous work tended to focus only on true-color images. In contrast, we first proposed real-world black-and-white old photo datasets for super-resolution (OID-RW), which is constructed using two methods of manually filling pixels and shooting with different cameras. The dataset contains 82 groups of images, including 22 groups of character type and 60 groups of landscape and architecture. At the same time, we also propose a multi-modal degradation dataset (MDD400) to solve the super-resolution reconstruction in real-life image degradation scenarios. We managed to simulate the process of generating degraded images by the following four methods: interpolation algorithm, CNN network, GAN network and capturing videos with different bit rates. Our experiments demonstrate that not only the models trained on our dataset have better generalization capability and robustness, but also the trained images can maintain better edge contours and texture features.
摘要：Nowdays，大多数数据集用于训练和评估超分辨率的机型是单模态模拟数据集。然而，由于各种培训了单模态模拟数据集在现实世界图像退化类型，型号不总是有不同的降解情况下良好的鲁棒性和泛化能力。以往的工作往往只注重真实的彩色图像。相比之下，我们首次提出了超分辨率（OID-RW），这是使用手动填充像素，并与不同的相机拍摄的两种方法构建真实世界的黑与白的老照片集。该数据集包含82组图像，其中包括22组的字符类型和60组景观和建筑的。同时，我们还提出了一种多模态退化数据集（MDD400）来解决现实生活中的图像质量下降的情况超分辨率重建。我们成功地模拟产生通过以下四种方法降低图像的过程：插值算法，CNN网络，GAN网络和捕捉不同的比特率视频。我们的实验证明，不仅培养了我们的数据集模型具有更好的泛化能力和鲁棒性，而且训练的图像可以保持较好的边缘轮廓和纹理特征。

64. Rethinking Differentiable Search for Mixed-Precision Neural Networks [PDF] 返回目录
Zhaowei Cai, Nuno Vasconcelos
Abstract: Low-precision networks, with weights and activations quantized to low bit-width, are widely used to accelerate inference on edge devices. However, current solutions are uniform, using identical bit-width for all filters. This fails to account for the different sensitivities of different filters and is suboptimal. Mixed-precision networks address this problem, by tuning the bit-width to individual filter requirements. In this work, the problem of optimal mixed-precision network search (MPS) is considered. To circumvent its difficulties of discrete search space and combinatorial optimization, a new differentiable search architecture is proposed, with several novel contributions to advance the efficiency by leveraging the unique properties of the MPS problem. The resulting Efficient differentiable MIxed-Precision network Search (EdMIPS) method is effective at finding the optimal bit allocation for multiple popular networks, and can search a large model, e.g. Inception-V3, directly on ImageNet without proxy task in a reasonable amount of time. The learned mixed-precision networks significantly outperform their uniform counterparts.
摘要：低精度网络，配重块和量化为低比特宽度的激活，被广泛用于加速的边缘设备的推论。然而，当前的解决方案是均匀的，使用相同的比特宽度对于所有的过滤器。这没有考虑不同的过滤器的不同敏感性和是次优的。混合精度网络解决这个问题，通过调整位宽个别滤波器要求。在这项工作中，最佳混合精度网络搜索（MPS）问题考虑。为了规避离散搜索空间和组合优化的困难，新的微搜索架构，提出了几个新的贡献通过利用MPS问题的独特性能以推进效率。将得到的高效微的混合精度网络搜索（EdMIPS）方法是在发现多个流行网络的最优位分配有效的，并且可以搜索大的模型，例如成立之初-V3，直接在ImageNet没有在合理时间内代理任务。博学的混合精度网络显著超越他们的制服同行。

65. Deep Learning COVID-19 Features on CXR using Limited Training Data Sets [PDF] 返回目录
Yujin Oh, Sangjoon Park, Jong Chul Ye
Abstract: Under the global pandemic of COVID-19, the use of artificial intelligence to analyze chest X-ray (CXR) image for COVID-19 diagnosis and patient triage is becoming important. Unfortunately, due to the emergent nature of the COVID-19 pandemic, a systematic collection of the CXR data set for deep neural network training is difficult. To address this problem, here we propose a patch-based convolutional neural network approach with a relatively small number of trainable parameters for COVID-19 diagnosis. The proposed method is inspired by our statistical analysis of the potential imaging biomarkers of the CXR radiographs. Experimental results show that our method achieves state-of-the-art performance and provides clinically interpretable saliency maps, which are useful for COVID-19 diagnosis and patient triage.
摘要：在COVID-19的全球大流行，利用人工智能来分析胸片（CXR）图像COVID-19的诊断和病人分流越来越重要。不幸的是，由于COVID-19大流行的应急性质，深层神经网络训练CXR数据集的系统收集是困难的。为了解决这个问题，在这里我们提出与COVID-19的诊断数量相对较少的可训练参数的基于块拼贴的卷积神经网络方法。该方法是由我们的X线胸片的潜在生物标志物成像统计分析的启发。实验结果表明，我们的方法实现状态的最先进的性能，并提供临床上可解释的显着性映射，这是有用的COVID-19的诊断和病人分流。

66. A Comparison of Deep Learning Convolution Neural Networks for Liver Segmentation in Radial Turbo Spin Echo Images [PDF] 返回目录
Lavanya Umapathy, Mahesh Bharath Keerthivasan, Jean-Phillipe Galons, Wyatt Unger, Diego Martin, Maria I Altbach, Ali Bilgin
Abstract: Motion-robust 2D Radial Turbo Spin Echo (RADTSE) pulse sequence can provide a high-resolution composite image, T2-weighted images at multiple echo times (TEs), and a quantitative T2 map, all from a single k-space acquisition. In this work, we use a deep-learning convolutional neural network (CNN) for the segmentation of liver in abdominal RADTSE images. A modified UNET architecture with generalized dice loss objective function was implemented. Three 2D CNNs were trained, one for each image type obtained from the RADTSE sequence. On evaluating the performance of the CNNs on the validation set, we found that CNNs trained on TE images or the T2 maps had higher average dice scores than the composite images. This, in turn, implies that the information regarding T2 variation in tissues aids in improving the segmentation performance.
摘要：运动鲁棒2D径向快速自旋回波（RADTSE）脉冲序列可提供高分辨率的合成图像，在多个回波时间（TES），和定量T2地图，所有从一个单一的k-空间采集T2加权图像。在这项工作中，我们使用肝腹部RADTSE图像分割深学习卷积神经网络（CNN）。具有广义骰子损失目标函数的改性UNET架构实施。三个二维细胞神经网络被训练，一个用于从所述RADTSE序列获得的每个图像的类型。在上验证组评价的细胞神经网络的表现，我们发现，细胞神经网络训练的TE上的图像或T2地图有较高的平均得分骰子比复合图像。这，反过来，意味着关于在提高分割性能在组织助剂T2变化的信息。

67. Principal Neighbourhood Aggregation for Graph Nets [PDF] 返回目录
Gabriele Corso, Luca Cavalleri, Dominique Beaini, Pietro Liò, Petar Veličković
Abstract: Graph Neural Networks (GNNs) have been shown to be effective models for different predictive tasks on graph-structured data. Recent work on their expressive power has focused on isomorphism tasks and countable feature spaces. We extend this theoretical framework to include continuous features - which occur regularly in real-world input domains and within the hidden layers of GNNs - and we demonstrate the requirement for multiple aggregation functions in this setting. Accordingly, we propose Principal Neighbourhood Aggregation (PNA), a novel architecture combining multiple aggregators with degree-scalers (which generalize the sum aggregator). Finally, we compare the capacity of different models to capture and exploit the graph structure via a benchmark containing multiple tasks taken from classical graph theory, which demonstrates the capacity of our model.
摘要：图形神经网络（GNNS）已被证明是有效的模型图上的结构化数据不同的预测任务。他们的表达能力最近的工作重点是同构的任务和可数特征空间。我们扩展这一理论框架，包括连续的特征 - 这在现实世界的输入域和GNNS的隐藏层内经常发生 - 我们演示了在此设置多个聚合功能的要求。因此，我们提出了主要邻里聚合（PNA），一个新颖的架构结合了度标器（其概括之和聚合器）的多个聚合器。最后，我们比较了不同型号的捕捉能力，并通过包含从经典图论，这表明我们的模型的能力，采取了多个任务的基准利用图形结构。

68. Towards an Efficient Deep Learning Model for COVID-19 Patterns Detection in X-ray Images [PDF] 返回目录
Eduardo Luz, Pedro Lopes Silva, Rodrigo Silva, Gladston Moreira
Abstract: Confronting the pandemic of COVID-19 caused by the new coronavirus, the SARS-CoV-2, is nowadays one of the most prominent challenges of the human species. A key factor in slowing down the virus propagation is rapid diagnosis and isolation of infected patients. Nevertheless, the standard method for COVID-19 identification, the RT-PCR, is time-consuming and in short supply due to the pandemic. Researchers around the world have been trying to find alternative screening methods. In this context, deep learning applied to chest X-rays of patients has been showing a lot of promise for the identification of COVID-19. Despite their success, the computational cost of these methods remains high, which imposes difficulties in their accessibility and availability. Thus, in this work, we address the hypothesis that better performance in terms of overall accuracy and COVID-19 sensitivity can be achieved with much more compact models. In order to test this hypothesis, we propose a modification of the EfficientNet family of models. By doing this we were able to produce a high-quality model with an overall accuracy of 91.4%, COVID-19, sensitivity of 90% and positive prediction of 100% while having about 30 times fewer parameters than the baseline model, 28 and 5 times fewer parameters than the popular VGG16 and ResNet50 architectures, respectively.
摘要：面对流感大流行COVID-19引起新的冠状病毒，SARS的冠状病毒-2，是时下的人类的最突出挑战之一。在减缓病毒传播的一个关键因素是快速诊断和感染的病人隔离。然而，对于COVID-19标识，所述RT-PCR，所述标准方法是费时和供不应求的大流行所致。世界各地的研究人员一直在努力寻找替代筛选方法。在这方面，深度学习施加到胸部的患者的X射线一直呈现出很多的承诺为COVID-19的鉴定。尽管他们的成功，这些方法的计算成本仍然很高，其中规定在其可访问性和可用性的困难。因此，在这项工作中，我们要解决的是在整体精度和COVID-19灵敏度方面更好的性能可以更加紧凑车型来实现的假设。为了检验这一假设，我们提出了EfficientNet系列车型的改进。通过这样做，我们能够生产出高品质的模型的91.4％的总精度，COVID-19，90％的敏感性和同时具有比基线模型约30倍更少的参数的100％阳性预测，28和5倍更少的参数比流行VGG16和ResNet50架构，分别。

69. Y-net: Biomedical Image Segmentation and Clustering [PDF] 返回目录
Sharmin Pathan, Anant Tripathi
Abstract: We propose a deep clustering architecture alongside image segmentation for medical image analysis. The main idea is based on unsupervised learning to cluster images on severity of the disease in the subject's sample, and this image is then segmented to highlight and outline regions of interest. We start with training an autoencoder on the images for segmentation. The encoder part from the autoencoder branches out to a clustering node and segmentation node. Deep clustering using Kmeans clustering is performed at the clustering branch and a lightweight model is used for segmentation. Each of the branches use extracted features from the autoencoder. We demonstrate our results on ISIC 2018 Skin Lesion Analysis Towards Melanoma Detection and Cityscapes datasets for segmentation and clustering. The proposed architecture beats UNet and DeepLab results on the two datasets, and has less than half the number of parameters. We use the deep clustering branch for clustering images into four clusters. Our approach can be applied to work with high complexity datasets of medical imaging for analyzing survival prediction for severe diseases or customizing treatment based on how far the disease has propagated. Clustering patients can help understand how binning should be done on real valued features to reduce feature sparsity and improve accuracy on classification tasks. The proposed architecture can provide an early diagnosis and reduce human intervention on labeling as it can become quite costly as the datasets grow larger. The main idea is to propose a one shot approach to segmentation with deep clustering.
摘要：我们提出了一个深刻的集群架构一起图像分割的医学图像分析。其主要思想是基于无监督学习，以集群的图像对疾病的受试者的样品中的严重程度，然后这个图像分割感兴趣的亮点和轮廓区域。我们先从上的图像训练自动编码器进行分割。编码器部分从自动编码器分支到一个聚类节点和分割节点。使用k均值聚类聚类深在聚类分支被执行并且轻质模型用于分割。每个分支都使用提取的特征从自动编码。我们证明了我们对ISIC 2018皮损分析结果趋近于黑色素瘤检测和风情的数据集进行分割和聚类。所提出的架构击败的两个数据集UNET和DeepLab结果，并具有参数小于一半。我们使用深集群分行聚类图像转换成四组。我们的方法可以根据被应用到与分析生存预测严重疾病或定制治疗的医疗成像的高复杂数据集工作，对疾病多远传播。集群患者可以帮助了解如何分档应该在实值的功能来完成，以减少功能稀疏，提高对分类任务的准确性。所提出的架构可以提供早期诊断和减少对标签人为干预的作为数据集变得越来越大它可以变得相当昂贵的。其主要思想是提出具有深刻的集群一所拍的方法来分割。

70. Residual Attention U-Net for Automated Multi-Class Segmentation of COVID-19 Chest CT Images [PDF] 返回目录
Xiaocong Chen, Lina Yao, Yu Zhang
Abstract: The novel coronavirus disease 2019 (COVID-19) has been spreading rapidly around the world and caused significant impact on the public health and economy. However, there is still lack of studies on effectively quantifying the lung infection caused by COVID-19. As a basic but challenging task of the diagnostic framework, segmentation plays a crucial role in accurate quantification of COVID-19 infection measured by computed tomography (CT) images. To this end, we proposed a novel deep learning algorithm for automated segmentation of multiple COVID-19 infection regions. Specifically, we use the Aggregated Residual Transformations to learn a robust and expressive feature representation and apply the soft attention mechanism to improve the capability of the model to distinguish a variety of symptoms of the COVID-19. With a public CT image dataset, we validate the efficacy of the proposed algorithm in comparison with other competing methods. Experimental results demonstrate the outstanding performance of our algorithm for automated segmentation of COVID-19 Chest CT images. Our study provides a promising deep leaning-based segmentation tool to lay a foundation to quantitative diagnosis of COVID-19 lung infection in CT images.
摘要：新型冠状病毒病2019（COVID-19）已经在世界各地迅速蔓延，造成对公众健康和经济显著的影响。不过，目前尚缺乏研究对有效量化引起COVID-19的肺部感染。作为诊断框架的一个基本的，但具有挑战性的任务，分割起着通过计算机断层扫描（CT）图像测量COVID-19感染的准确定量的至关重要的作用。为此，我们提出了多个COVID-19感染区域的自动分割的新型深的学习算法。具体而言，我们使用聚合残余变换学习一个强大的和表现特征表示并应用软注意力机制，提高模型的能力来区分不同的COVID-19的症状。随着公众的CT图像数据，我们验证与其他竞争的方法相比，该算法的有效性。实验结果表明，我们的用于COVID-19胸部CT图像的自动分割算法的卓越的性能。我们的研究提供了一个有前途的深基础斜塔分割工具奠定在CT图像COVID-19肺部感染的定量诊断奠定了基础。

71. Relational Learning between Multiple Pulmonary Nodules via Deep Set Attention Transformers [PDF] 返回目录
Jiancheng Yang, Haoran Deng, Xiaoyang Huang, Bingbing Ni, Yi Xu
Abstract: Diagnosis and treatment of multiple pulmonary nodules are clinically important but challenging. Prior studies on nodule characterization use solitary-nodule approaches on multiple nodular patients, which ignores the relations between nodules. In this study, we propose a multiple instance learning (MIL) approach and empirically prove the benefit to learn the relations between multiple nodules. By treating the multiple nodules from a same patient as a whole, critical relational information between solitary-nodule voxels is extracted. To our knowledge, it is the first study to learn the relations between multiple pulmonary nodules. Inspired by recent advances in natural language processing (NLP) domain, we introduce a self-attention transformer equipped with 3D CNN, named {NoduleSAT}, to replace typical pooling-based aggregation in multiple instance learning. Extensive experiments on lung nodule false positive reduction on LUNA16 database, and malignancy classification on LIDC-IDRI database, validate the effectiveness of the proposed method.
摘要：诊断和治疗多发性肺结节在临床上是重要的，但具有挑战性。在结节表征使用单独结节此前的研究多结节患者，而忽略结节之间的关系接近。在这项研究中，我们提出了多示例学习（MIL）的做法和经验证明的好处，了解多发结节之间的关系。通过从同一患者治疗多发性结节作为孤结节体素之间的整体，临界关系信息被提取。据我们所知，这是第一次学习多个肺结节之间的关系。在自然语言处理（NLP）领域最新进展的启发，我们推出搭载3D CNN，命名为{} NoduleSAT，以取代在多示例学习典型的基于池聚集自注意变压器。上肺结节上LUNA16数据库假阳性降低，并在LIDC-IDRI数据库恶性分类了广泛的实验，验证了该方法的有效性。

72. When Weak Becomes Strong: Robust Quantification of White Matter Hyperintensities in Brain MRI scans [PDF] 返回目录
Oliver Werner, Kimberlin M.H. van Wijnen, Wiro J. Niessen, Marius de Groot, Meike W. Vernooij, Florian Dubost, Marleen de Bruijne
Abstract: To measure the volume of specific image structures, a typical approach is to first segment those structures using a neural network trained on voxel-wise (strong) labels and subsequently compute the volume from the segmentation. A more straightforward approach would be to predict the volume directly using a neural network based regression approach, trained on image-level (weak) labels indicating volume. In this article, we compared networks optimized with weak and strong labels, and study their ability to generalize to other datasets. We experimented with white matter hyperintensity (WMH) volume prediction in brain MRI scans. Neural networks were trained on a large local dataset and their performance was evaluated on four independent public datasets. We showed that networks optimized using only weak labels reflecting WMH volume generalized better for WMH volume prediction than networks optimized with voxel-wise segmentations of WMH. The attention maps of networks trained with weak labels did not seem to delineate WMHs, but highlighted instead areas with smooth contours around or near WMHs. By correcting for possible confounders we showed that networks trained on weak labels may have learnt other meaningful features that are more suited to generalization to unseen data. Our results suggest that for imaging biomarkers that can be derived from segmentations, training networks to predict the biomarker directly may provide more robust results than solving an intermediate segmentation step.
摘要：为了测量特定图像结构的体积，典型的方法是首先使用段上训练体素的明智（强）的标签，并随后计算从分割容积的神经网络的那些结构。一个更简单的方法是将预测的体积直接使用神经网络基于回归方法，训练上指示体积图像级（弱）标签。在这篇文章中，我们比较了网络，弱和强标签优化，并研究其推广到其他数据集的能力。我们尝试用脑MRI扫描白质高信号（WMH）的体积预测。神经网络进行了培训上大量本地数据集，其表现在四个独立的公共数据集进行评估。我们发现只用微弱的标签反映了更好概括为WMH量预测比WMH的体素明智分割优化网络WMH容积网络进行了优化。注意映射弱标签训练有素似乎没有划定WMHs网络，但强调不是与周围平滑的轮廓或接近WMHs领域。通过校正可能的混杂因素，我们表现出训练有素弱的标签，网络可能已经了解到，更适合推广到看不见的数据其他有意义的功能。我们的研究结果表明，对于成像可以从分割衍生的生物标记物，培养网络来预测直接生物标志物可以提供更稳定的结果比解决的中间分割步骤。

73. A Unified DNN Weight Compression Framework Using Reweighted Optimization Methods [PDF] 返回目录
Tianyun Zhang, Xiaolong Ma, Zheng Zhan, Shanglin Zhou, Minghai Qin, Fei Sun, Yen-Kuang Chen, Caiwen Ding, Makan Fardad, Yanzhi Wang
Abstract: To address the large model size and intensive computation requirement of deep neural networks (DNNs), weight pruning techniques have been proposed and generally fall into two categories, i.e., static regularization-based pruning and dynamic regularization-based pruning. However, the former method currently suffers either complex workloads or accuracy degradation, while the latter one takes a long time to tune the parameters to achieve the desired pruning rate without accuracy loss. In this paper, we propose a unified DNN weight pruning framework with dynamically updated regularization terms bounded by the designated constraint, which can generate both non-structured sparsity and different kinds of structured sparsity. We also extend our method to an integrated framework for the combination of different DNN compression tasks.
摘要：为了解决大型模型的大小和深层神经网络（DNNs）的密集计算的要求，重修剪技术已经被提出，一般分为两类，即静态的，基于正则修剪和动态的，基于正则修剪。但是，前者的方法目前患有或者复杂的工作负载或精度的劣化，而后者需要很长的时间来调整参数，以实现无精度损失所需的修剪率。在本文中，我们提出由指定的约束，它可以产生两种非结构稀疏和各种结构化稀疏界动态更新的正术语统一DNN重修剪框架。我们也我们的方法扩展到针对不同DNN压缩任务相结合的集成框架。

74. Gradients as Features for Deep Representation Learning [PDF] 返回目录
Fangzhou Mu, Yingyu Liang, Yin Li
Abstract: We address the challenging problem of deep representation learning--the efficient adaption of a pre-trained deep network to different tasks. Specifically, we propose to explore gradient-based features. These features are gradients of the model parameters with respect to a task-specific loss given an input sample. Our key innovation is the design of a linear model that incorporates both gradient and activation of the pre-trained network. We show that our model provides a local linear approximation to an underlying deep model, and discuss important theoretical insights. Moreover, we present an efficient algorithm for the training and inference of our model without computing the actual gradient. Our method is evaluated across a number of representation-learning tasks on several datasets and using different network architectures. Strong results are obtained in all settings, and are well-aligned with our theoretical insights.
摘要：针对深表示学习的具有挑战性的问题 - 预训练深层网络的高效适应不同的任务。具体来说，我们建议探索基于梯度特征。这些功能是相对于给定输入采样任务特异性损耗模型参数的梯度。我们的关键创新是，既包含梯度和预训练的网络的激活的线性模型的设计。我们表明，我们的模型提供了一个局部线性近似一个潜在的深层模型，并讨论重要的理论见解。此外，我们提出了我们的模型的训练和推理的有效算法不计算实际梯度。我们的方法是跨多个代表性的学习任务，评估了几个数据集，并使用不同的网络架构。强劲的业绩在所有设置，并利用我们的理论见解以及对齐。

75. DeepEDN: A Deep Learning-based Image Encryption and Decryption Network for Internet of Medical Things [PDF] 返回目录
Yi Ding, Guozheng Wu, Dajiang Chen, Ning Zhang, Linpeng Gong, Mingsheng Cao, Zhiguang Qin
Abstract: Internet of Medical Things (IoMT) can connect many medical imaging equipments to the medical information network to facilitate the process of diagnosing and treating for doctors. As medical image contains sensitive information, it is of importance yet very challenging to safeguard the privacy or security of the patient. In this work, a deep learning based encryption and decryption network (DeepEDN) is proposed to fulfill the process of encrypting and decrypting the medical image. Specifically, in DeepEDN, the Cycle-Generative Adversarial Network (Cycle-GAN) is employed as the main learning network to transfer the medical image from its original domain into the target domain. Target domain is regarded as a "Hidden Factors" to guide the learning model for realizing the encryption. The encrypted image is restored to the original (plaintext) image through a reconstruction network to achieve an image decryption. In order to facilitate the data mining directly from the privacy-protected environment, a region of interest(ROI)-mining-network is proposed to extract the interested object from the encrypted image. The proposed DeepEDN is evaluated on the chest X-ray dataset. Extensive experimental results and security analysis show that the proposed method can achieve a high level of security with a good performance in efficiency.
摘要：医疗物联网（IoMT）的网络可以在许多医学影像设备连接到医疗信息网络，便于诊断和治疗医生的过程。随着医疗图像包含敏感信息，这是很重要的，但很有挑战性，以保障病人的隐私或安全。在这项工作中，深刻学习基于加密和解密网（DeepEDN）提出了实现加密和解密的医学图像的过程。具体而言，在DeepEDN，循环剖成对抗性网络（循环-GAN）作为主学习网络的医用图像从它的原始域转移到目标域。目标域被视为“隐性因素”，引导学习模式实现加密。加密图像被恢复到原来的（明文）图像通过重建网络来实现的图像解密。为了方便将数据直接从隐私保护的环境内开采，利息率（ROI）的区域-mining网络提出了从加密图像中提取感兴趣的对象。所提出的DeepEDN是在胸部X射线数据集进行评估。大量的实验结果和安全性分析表明，该方法可以实现在效率优良性能的高安全级别。

76. Verification of Deep Convolutional Neural Networks Using ImageStars [PDF] 返回目录
Hoang-Dung Tran, Stanley Bak, Weiming Xiang, Taylor T.Johnson
Abstract: Convolutional Neural Networks (CNN) have redefined the state-of-the-art in many real-world applications, such as facial recognition, image classification, human pose estimation, and semantic segmentation. Despite their success, CNNs are vulnerable to adversarial attacks, where slight changes to their inputs may lead to sharp changes in their output in even well-trained networks. Set-based analysis methods can detect or prove the absence of bounded adversarial attacks, which can then be used to evaluate the effectiveness of neural network training methodology. Unfortunately, existing verification approaches have limited scalability in terms of the size of networks that can be analyzed. In this paper, we describe a set-based framework that successfully deals with real-world CNNs, such as VGG16 and VGG19, that have high accuracy on ImageNet. Our approach is based on a new set representation called the ImageStar, which enables efficient exact and over-approximative analysis of CNNs. ImageStars perform efficient set-based analysis by combining operations on concrete images with linear programming (LP). Our approach is implemented in a tool called NNV, and can verify the robustness of VGG networks with respect to a small set of input states, derived from adversarial attacks, such as the DeepFool attack. The experimental results show that our approach is less conservative and faster than existing zonotope methods, such as those used in DeepZ, and the polytope method used in DeepPoly.
摘要：卷积神经网络（CNN）重新定义了国家的最先进的在许多现实世界的应用程序，如面部识别，图像分类，人体姿势估计和语义分割。尽管他们的成功，细胞神经网络很容易受到攻击的对抗性，在那里他们投入的微小变化可能会导致其产量甚至训练有素的网络急剧变化。基于集合的分析方法可检测或证明不存在的敌对界的攻击，然后可以用来评估神经网络训练方法的有效性。不幸的是，现有的验证方法在网络规模可分析方面可扩展性有限。在本文中，我们描述了一种基于集合的框架与现实世界的细胞神经网络，如VGG16和VGG19，这对ImageNet高精度成功交易。我们的做法是基于称为ImageStar一套新表示，这使得细胞神经网络的高效准确和过近似分析。 ImageStars通过与线性规划（LP）混凝土图像组合操作进行高效的基于集合的分析。我们的方法是在一个名为NNV工具来实现，并且能够相对于验证VGG网络的健壮性一小部分输入状态的，从敌对攻击，如DeepFool攻击的。实验结果表明，该方法比现有zonotope方法，如在DeepZ使用，并在使用DeepPoly的多面体方法不太保守，速度更快。

77. MetaIQA: Deep Meta-learning for No-Reference Image Quality Assessment [PDF] 返回目录
Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, Guangming Shi
Abstract: Recently, increasing interest has been drawn in exploiting deep convolutional neural networks (DCNNs) for no-reference image quality assessment (NR-IQA). Despite of the notable success achieved, there is a broad consensus that training DCNNs heavily relies on massive annotated data. Unfortunately, IQA is a typical small sample problem. Therefore, most of the existing DCNN-based IQA metrics operate based on pre-trained networks. However, these pre-trained networks are not designed for IQA task, leading to generalization problem when evaluating different types of distortions. With this motivation, this paper presents a no-reference IQA metric based on deep meta-learning. The underlying idea is to learn the meta-knowledge shared by human when evaluating the quality of images with various distortions, which can then be adapted to unknown distortions easily. Specifically, we first collect a number of NR-IQA tasks for different distortions. Then meta-learning is adopted to learn the prior knowledge shared by diversified distortions. Finally, the quality prior model is fine-tuned on a target NR-IQA task for quickly obtaining the quality model. Extensive experiments demonstrate that the proposed metric outperforms the state-of-the-arts by a large margin. Furthermore, the meta-model learned from synthetic distortions can also be easily generalized to authentic distortions, which is highly desired in real-world applications of IQA metrics.
摘要：最近，越来越多的兴趣已在无参考图像质量评价（NR-IQA）利用深卷积神经网络（DCNNs）绘制。尽管取得了显着的成功，有一个广泛的共识是训练DCNNs在很大程度上依赖于大量的注释数据。不幸的是，IQA是一个典型的小样本问题。因此，大多数现有的基于DCNN-IQA指标基础上进行操作前的培训网络。然而，这些预训练的网络不适合IQA任务，评估不同类型的失真时导致泛化的问题。有了这个动机，本文提出了一种无参考IQA度量基于深厚元学习。其基本思想是学习评估与各种失真，然后可以很容易地适应未知的失真的图像的质量时通过人共享的元知识。具体而言，我们首先收集了一些NR-IQA任务不同的扭曲。然后，元学习，采用学习通过多元化的扭曲共同先验知识。最后，质量先验模型进行微调的目标NR-IQA任务快速获得质量模型。大量的实验表明，该指标优于国家的的艺术大幅度。此外，元模型由合成失真了解到也可以容易地推广到真实失真，这是在IQA度量的真实世界应用中是非常期望的。

78. Farmland Parcel Delineation Using Spatio-temporal Convolutional Networks [PDF] 返回目录
Han Lin Aung, Burak Uzkent, Marshall Burke, David Lobell, Stefano Ermon
Abstract: Farm parcel delineation provides cadastral data that is important in developing and managing climate change policies. Specifically, farm parcel delineation informs applications in downstream governmental policies of land allocation, irrigation, fertilization, green-house gases (GHG's), etc. This data can also be useful for the agricultural insurance sector for assessing compensations following damages associated with extreme weather events - a growing trend related to climate change. Using satellite imaging can be a scalable and cost effective manner to perform the task of farm parcel delineation to collect this valuable data. In this paper, we break down this task using satellite imaging into two approaches: 1) Segmentation of parcel boundaries, and 2) Segmentation of parcel areas. We implemented variations of UNets, one of which takes into account temporal information, which achieved the best results on our dataset on farmland parcels in France in 2017.
摘要：农场地块的划分提供了数据籍是在发展和管理气候变化政策的重要。具体而言，在土地分配，灌溉，施肥，温室气体（GHG的）等下游政府政策农场地块划定运筹学与应用这些数据也可以为农业保险部门评估在与极端天气事件有关的赔偿额非常有用 - 有关气候变化的发展趋势。利用卫星成像可以是可扩展性和成本效益的方式来执行农场地块划定的任务来收集这些宝贵的数据。在本文中，我们利用卫星成像这个任务分解成两种方法：1）分割宗地边界的，和2）地块面积分割。我们实施UNets，其中一个考虑时间信息，实现了最好的结果在我们的数据集中在农田包裹在法国2017年的变化。

79. Detection of Covid-19 From Chest X-ray Images Using Artificial Intelligence: An Early Review [PDF] 返回目录
Muhammad Ilyas, Hina Rehman, Amine Nait-ali
Abstract: In 2019, the entire world is facing a situation of health emergency due to a newly emerged coronavirus (COVID-19). Almost 196 countries are affected by covid-19, while USA, Italy, China, Spain, Iran, and France have the maximum active cases of COVID-19. The issues, medical and healthcare departments are facing in delay of detecting the COVID-19. Several artificial intelligence based system are designed for the automatic detection of COVID-19 using chest x-rays. In this article we will discuss the different approaches used for the detection of COVID-19 and the challenges we are facing. It is mandatory to develop an automatic detection system to prevent the transfer of the virus through contact. Several deep learning architecture are deployed for the detection of COVID-19 such as ResNet, Inception, Googlenet etc. All these approaches are detecting the subjects suffering with pneumonia while its hard to decide whether the pneumonia is caused by COVID-19 or due to any other bacterial or fungal attack.
摘要：在2019年，整个世界面临的突发公共卫生事件的情况下，由于新出现的冠状病毒（COVID-19）。近196个国家被covid-19的影响，而美国，意大利，中国，西班牙，伊朗和法国有COVID-19的最大活动情况。这些问题，医疗卫生部门面临的检测COVID-19延迟。几个基于人工智能的系统被设计用于COVID-19的自动检测利用胸部X射线。在这篇文章中，我们将讨论用于检测COVID-19的和我们所面临的挑战不同的办法。它是强制性的，以开发一种自动检测系统，以防止病毒的通过接触的转印。一些深学习架构部署用于检测COVID-19如RESNET，成立之初，Googlenet等。所有这些方法检测与患肺炎的主题，同时它很难决定是否肺炎是由COVID-19或由于引起任何其他细菌或霉菌侵袭。

80. Underwater Image Enhancement Based on Structure-Texture Reconstruction [PDF] 返回目录
Sen Lin, Kaichen Chi
Abstract: Aiming at the problems of color distortion, blur and excessive noise of underwater image, an underwater image enhancement algorithm based on structure-texture reconstruction is proposed. Firstly, the color equalization of the degraded image is realized by the automatic color enhancement algorithm; Secondly, the relative total variation is introduced to decompose the image into the structure layer and texture layer; Then, the best background light point is selected based on brightness, gradient discrimination, and hue judgment, the transmittance of the backscatter component is obtained by the red dark channel prior, which is substituted into the imaging model to remove the fogging phenomenon in the structure layer. Enhancement of effective details in the texture layer by multi scale detail enhancement algorithm and binary mask; Finally, the structure layer and texture layer are reconstructed to get the final image. The experimental results show that the algorithm can effectively balance the hue, saturation, and clarity of underwater image, and has good performance in different underwater environments.
摘要：针对颜色失真，模糊和水下图像的过度噪音的问题，基于结构纹理重建水下图像增强算法。首先，劣化图像的颜色均衡由自动颜色增强算法来实现;其次，相对于总的变化被引入到将图像分解到结构层和纹理层;然后，基于亮度，梯度鉴别和色调判断被选择的最佳背景光点，反向散射成分的透射率是由现有的红色暗信道，其被代入成像模型以除去雾化现象在结构获得层。增强在由多尺度细节增强算法和二元掩模纹理层有效细节;最后，结构层和纹理层进行重构，以获得最终的图像。实验结果表明，该算法能有效地平衡色调，饱和度和水下图像的清晰度，并且在不同的水下环境中良好的性能。

81. The Role of Stem Noise in Visual Perception and Image Quality Measurement [PDF] 返回目录
Arash Ashtari
Abstract: This paper considers reference free quality assessment of distorted and noisy images. Specifically, it considers the first and second order statistics of stem noise that can be evaluated given any image. In the research field of Image quality Assessment (IQA), the stem noise is defined as the input of an Auto Regressive (AR) process, from which a low-energy and de-correlated version of the image can be recovered. To estimate the AR model parameters and associated stem noise energy, the Yule-walker equations are used such that the accompanying Auto Correlation Function (ACF) coefficients can be treated as model parameters for image reconstruction. To characterize systematic signal dependent and signal independent distortions, the mean and variance of stem noise can be evaluated over the image. Crucially, this paper shows that these statistics have a predictive validity in relation to human ratings of image quality. Furthermore, under certain kinds of image distortion, stem noise statistics show very significant correlations with established measures of image quality.
摘要：本文认为，扭曲和噪声图像的无参考质量评估。具体而言，认为可以评估任何给定的图像干噪声的一阶和二阶统计信息。在图像质量评价（IQA）的研究领域中，杆噪声被定义为自回归（AR）处理，从该图像的低能量和去相关版本可以被回收的输入端。为了估计AR模型参数和相关联的干噪声能量，则使用尤耳·沃克公式，使得伴随自相关函数（ACF）的系数可以被作为用于图像重建模型参数进行处理。为了表征系统的信号依赖性和独立的信号失真，干噪声的均值和方差可以在图像上进行评估。至关重要的是，本文表明，这些统计数据都相对于图像质量的人的收视率预测效度。此外，某些类型的图像失真下，干噪声统计显示图像质量的既定措施非常显著的相关性。

82. Trajectory annotation using sequences of spatial perception [PDF] 返回目录
Sebastian Feld, Steffen Illium, Andreas Sedlmeier, Lenz Belzner
Abstract: In the near future, more and more machines will perform tasks in the vicinity of human spaces or support them directly in their spatially bound activities. In order to simplify the verbal communication and the interaction between robotic units and/or humans, reliable and robust systems w.r.t. noise and processing results are needed. This work builds a foundation to address this task. By using a continuous representation of spatial perception in interiors learned from trajectory data, our approach clusters movement in dependency to its spatial context. We propose an unsupervised learning approach based on a neural autoencoding that learns semantically meaningful continuous encodings of spatio-temporal trajectory data. This learned encoding can be used to form prototypical representations. We present promising results that clear the path for future applications.
摘要：在不久的将来，越来越多的机器将在人类处所附近执行任务或直接支持他们在空间限制的活动。为了简化语言通信和机器人单元和/或人类，可靠和稳定的系统之间的相互作用w.r.t.需要噪声和处理结果。这项工作建立了一个基金会，以解决这一任务。通过在从轨迹数据，我们的方法集群运动依赖于它的空间范围内学会使用内部空间知觉的连续表示。我们提出了一种基于神经autoencoding无监督学习的办法，时空轨迹数据的语义获悉有意义的连续编码。该学会的编码可以用来形成原型表示。我们目前看好的结果明确为未来的应用路径。

83. Bayesian Surprise in Indoor Environments [PDF] 返回目录
Sebastian Feld, Andreas Sedlmeier, Markus Friedrich, Jan Franz, Lenz Belzner
Abstract: This paper proposes a novel method to identify unexpected structures in 2D floor plans using the concept of Bayesian Surprise. Taking into account that a person's expectation is an important aspect of the perception of space, we exploit the theory of Bayesian Surprise to robustly model expectation and thus surprise in the context of building structures. We use Isovist Analysis, which is a popular space syntax technique, to turn qualitative object attributes into quantitative environmental information. Since isovists are location-specific patterns of visibility, a sequence of isovists describes the spatial perception during a movement along multiple points in space. We then use Bayesian Surprise in a feature space consisting of these isovist readings. To demonstrate the suitability of our approach, we take "snapshots" of an agent's local environment to provide a short list of images that characterize a traversed trajectory through a 2D indoor environment. Those fingerprints represent surprising regions of a tour, characterize the traversed map and enable indoor LBS to focus more on important regions. Given this idea, we propose to use "surprise" as a new dimension of context in indoor location-based services (LBS). Agents of LBS, such as mobile robots or non-player characters in computer games, may use the context surprise to focus more on important regions of a map for a better use or understanding of the floor plan.
摘要：本文提出了一种新的方法来确定使用贝叶斯惊喜的概念2D平面图意想不到的结构。考虑到一个人的期望是空间感知的一个重要方面，我们利用贝叶斯惊喜的稳健地模型期望在建筑结构方面的理论，因而惊喜。我们使用Isovist分析，这是一种流行的空间句法技术，把定性的对象属性转化为定量的环境信息。由于isovists是能见度的位置的特定图案，isovists的序列描述在空间沿多个点的移动过程中的空间感知。然后，我们在由这些isovist读数的功能空间使用贝叶斯惊喜。为了证明我们的方法的适用性，我们采取代理的本地环境的“快照”，通过一个2D的室内环境提供的图像，表征走过的轨迹的短名单。这些指纹代表参观惊人的地区，表征穿越地图，使室内LBS更专注于重要地区。鉴于这种想法，我们建议使用“惊喜”为背景的室内基于位置的服务（LBS）一个新的层面。 LBS，如移动机器人或电脑游戏的非玩家角色的代理，可以使用上下文惊喜更专注于地图的重要地区为更好地利用或平面图的理解。

84. Exploring The Spatial Reasoning Ability of Neural Models in Human IQ Tests [PDF] 返回目录
Hyunjae Kim, Yookyung Koh, Jinheon Baek, Jaewoo Kang
Abstract: Although neural models have performed impressively well on various tasks such as image recognition and question answering, their reasoning ability has been measured in only few studies. In this work, we focus on spatial reasoning and explore the spatial understanding of neural models. First, we describe the following two spatial reasoning IQ tests: rotation and shape composition. Using well-defined rules, we constructed datasets that consist of various complexity levels. We designed a variety of experiments in terms of generalization, and evaluated six different baseline models on the newly generated datasets. We provide an analysis of the results and factors that affect the generalization abilities of models. Also, we analyze how neural models solve spatial reasoning tests with visual aids. Our findings would provide valuable insights into understanding a machine and the difference between a machine and human.
摘要：虽然神经模型已经在各种任务，例如图像识别和答疑进行赫然好，他们的推理能力已经在只有少数研究测量。在这项工作中，我们侧重于空间推理和探索神经模型的空间理解。首先，我们介绍了以下两个空间推理智商测试：旋转和形状组成。使用定义明确的规则，我们构建了由各种复杂程度的数据集。我们设计了多种实验推广方面，并评估新生成的数据集六种不同的基本模式。我们提供了影响模型的泛化能力的结果和因素的分析。此外，我们分析模型如何解决神经与视觉辅助空间推理能力测试。我们的研究结果将提供有价值的见解了解一台机器，机器与人之间的区别。

85. KD-MRI: A knowledge distillation framework for image reconstruction and image restoration in MRI workflow [PDF] 返回目录
Balamurali Murugesan, Sricharan Vijayarangan, Kaushik Sarveswaran, Keerthi Ram, Mohanasankar Sivaprakasam
Abstract: Deep learning networks are being developed in every stage of the MRI workflow and have provided state-of-the-art results. However, this has come at the cost of increased computation requirement and storage. Hence, replacing the networks with compact models at various stages in the MRI workflow can significantly reduce the required storage space and provide considerable speedup. In computer vision, knowledge distillation is a commonly used method for model compression. In our work, we propose a knowledge distillation (KD) framework for the image to image problems in the MRI workflow in order to develop compact, low-parameter models without a significant drop in performance. We propose a combination of the attention-based feature distillation method and imitation loss and demonstrate its effectiveness on the popular MRI reconstruction architecture, DC-CNN. We conduct extensive experiments using Cardiac, Brain, and Knee MRI datasets for 4x, 5x and 8x accelerations. We observed that the student network trained with the assistance of the teacher using our proposed KD framework provided significant improvement over the student network trained without assistance across all the datasets and acceleration factors. Specifically, for the Knee dataset, the student network achieves $65\%$ parameter reduction, 2x faster CPU running time, and 1.5x faster GPU running time compared to the teacher. Furthermore, we compare our attention-based feature distillation method with other feature distillation methods. We also conduct an ablative study to understand the significance of attention-based distillation and imitation loss. We also extend our KD framework for MRI super-resolution and show encouraging results.
摘要：深学习网络在MRI工作流程的每一个阶段，正在开发并提供了国家的先进成果。然而，这已经在增加的计算需求和存储成本。因此，具有紧凑车型更换网络在不同阶段的MRI工作流可以显著减少所需的存储空间，并提供了相当大的加速。在计算机视觉，知识蒸馏是模型压缩常用的方法。在我们的工作中，我们提出了图像的工作流程MRI图像问题知识蒸馏（KD）框架，以开发小型，低参数模型而性能显著下降。我们建议关注基于特征的蒸馏法和模仿损失的组合，并展示其广受欢迎的MRI重建架构的有效性，DC-CNN。我们进行使用心脏，大脑和膝关节MRI数据集的4倍，5倍和8点的加速度广泛的实验。我们观察到，用我们提出的KD框架老师的帮助下训练的学生在网络没有在所有的数据集和加速度因素帮助训练学生网络提供显著的改善。具体而言，对于膝关节的数据集，学生网络达到$ 65 \％$参数减少，快2倍的CPU运行时间，并与老师快1.5倍的GPU上的运行时间。此外，我们比较了与其他特征蒸馏方法我们的关注，基于特征的蒸馏法。我们还进行了烧蚀研究，以了解关注基于蒸馏和模仿损失的意义。我们也扩大我们的MRI超分辨率KD框架，并显示了令人鼓舞的结果。

86. Object-oriented SLAM using Quadrics and Symmetry Properties for Indoor Environments [PDF] 返回目录
Ziwei Liao, Wei Wang, Xianyu Qi, Xiaoyu Zhang, Lin Xue, Jianzhen Jiao, Ran Wei
Abstract: Aiming at the application environment of indoor mobile robots, this paper proposes a sparse object-level SLAM algorithm based on an RGB-D camera. A quadric representation is used as a landmark to compactly model objects, including their position, orientation, and occupied space. The state-of-art quadric-based SLAM algorithm faces the observability problem caused by the limited perspective under the plane trajectory of the mobile robot. To solve the problem, the proposed algorithm fuses both object detection and point cloud data to estimate the quadric parameters. It finishes the quadric initialization based on a single frame of RGB-D data, which significantly reduces the requirements for perspective changes. As objects are often observed locally, the proposed algorithm uses the symmetrical properties of indoor artificial objects to estimate the occluded parts to obtain more accurate quadric parameters. Experiments have shown that compared with the state-of-art algorithm, especially on the forward trajectory of mobile robots, the proposed algorithm significantly improves the accuracy and convergence speed of quadric reconstruction. Finally, we made available an opensource implementation to replicate the experiments.
摘要：针对室内移动机器人的应用环境中，提出了一种基于一个RGB-d相机上的稀疏对象级SLAM算法。二次曲面表示作为界标，以紧凑模型对象，包括它们的位置，方向，和被占用的空间。的状态的最先进的基于二次的SLAM算法面所造成的移动机器人的平面轨迹下有限的角度观测性问题。为了解决这个问题，所提出的算法既融合物体检测和点云数据来估计二次参数。它完成基于RGB-d的数据，这显著降低了视角的变化的要求的单个帧上的二次初始化。作为对象通常局部地观察到的，所提出的算法使用室内人工的对称属性对象来估计遮挡部件，以获得更准确的二次参数。实验已经表明，与国家的现有技术的算法，特别是在移动机器人的前方轨迹相比，所提出的算法显著提高二次重建的精度和收敛速度。最后，我们提供一个开源实现复制的实验。

87. Exploit Where Optimizer Explores via Residuals [PDF] 返回目录
An Xu, Zhouyuan Huo, Heng Huang
Abstract: To train neural networks faster, many research efforts have been devoted to exploring a better gradient descent trajectory, but few have been put into exploiting the intermediate results. In this work we propose a novel optimization method named (momentum) stochastic gradient descent with residuals (RSGD(m)) to exploit the gradient descent trajectory using proper residual schemes, which leads to a performance boost of both the convergence and generalization. We provide theoretic analysis to show that RSGD can achieve a smaller growth rate of the generalization error and the same convergence rate compared with SGD. Extensive deep learning experimental results of the image classification and word-level language model empirically show that both the convergence and generalization of our RSGD(m) method are improved significantly compared with the existing SGD(m) algorithm.
摘要：为训练神经网络速度更快，许多研究已经努力探索更好的梯度下降的轨迹，但很少有人已投入开发的中间结果。在这项工作中，我们提出了一个名为（势头）随机梯度下降与残差（RSGD（M））一种新颖的优化方法利用梯度下降轨迹使用正确的剩余计划，这导致了收敛和概括两者的性能提升。我们提供理论分析表明，RSGD可以实现泛化误差较小的增长速度，并与SGD相比，相同的收敛速度。图像分类和字级语言模型的广泛深度学习实验结果表明经验，我们的RSGD（M）的方法都收敛性和泛化与现有SGD（M）算法相比显著改善。

88. FLIVVER: Fly Lobula Inspired Visual Velocity Estimation & Ranging [PDF] 返回目录
Bryson Lingenfelter, Arunava Nag, Floris van Breugel
Abstract: The mechanism by which a tiny insect or insect-sized robot could estimate its absolute velocity and distance to nearby objects remains unknown. However, this ability is critical for behaviors that require estimating wind direction during flight, such as odor-plume tracking. Neuroscience and behavior studies with insects have shown that they rely on the perception of image motion, or optic flow, to estimate relative motion, equivalent to a ratio of their velocity and distance to objects in the world. The key open challenge is therefore to decouple these two states from a single measurement of their ratio. Although modern SLAM (Simultaneous Localization and Mapping) methods provide a solution to this problem for robotic systems, these methods typically rely on computations that insects likely cannot perform, such as simultaneously tracking multiple individual visual features, remembering a 3D map of the world, and solving nonlinear optimization problems using iterative algorithms. Here we present a novel algorithm, FLIVVER, which combines the geometry of dynamic forward motion with inspiration from insect visual processing to \textit{directly} estimate absolute ground velocity from a combination of optic flow and acceleration information. Our algorithm provides a clear hypothesis for how insects might estimate absolute velocity, and also provides a theoretical framework for designing fast analog circuitry for efficient state estimation, which could be applied to insect-sized robots.
摘要：通过一个微小的昆虫或昆虫大小机器人可估计其绝对速度和距离，附近物体的机制仍然是未知的。然而，这种能力是对于需要在飞行期间估计风向行为，如气味羽跟踪关键的。与昆虫神经科学与行为的研究表明，它们依赖于图像运动，或光流的感知，来估计相对运动，相当于它们的速度和距离与在世界中的对象的比率。因此，开放密钥的挑战是从他们的比的单次测量分离这两种状态。虽然现代SLAM（同时定位与地图）方法提供了一个解决这个问题的机器人系统，这些方法通常依赖于计算昆虫可能不能执行，如同时跟踪多个单独的视觉特征，想起了世界的3D地图，并解决在使用迭代算法非线性优化问题。在这里，我们提出一个新的算法，FLIVVER，它从昆虫可视处理到\ textit结合灵感动态向前运动的几何{直接}从光流和加速度信息的组合估计绝对地面速度。我们的算法提供了昆虫会如何估计绝对速度明显的假说，并且还提供了高效的状态估计，这可能被应用到昆虫大小的机器人设计的快速模拟电路的理论框架。

89. Shape Estimation for Elongated Deformable Object using B-spline Chained Multiple Random Matrices Model [PDF] 返回目录
Gang Yao, Ryan Saltus, Ashwin Dani
Abstract: In this paper, a B-spline chained multiple random matrices representation is proposed to model geometric characteristics of an elongated deformable object. The hyper degrees of freedom structure of the elongated deformable object make its shape estimation challenging. Based on the likelihood function of the proposed model, an expectation-maximization (EM) method is derived to estimate the shape of the elongated deformable object. A split and merge method based on the Euclidean minimum spanning tree (EMST) is proposed to provide initialization for the EM algorithm. The proposed algorithm is evaluated for the shape estimation of the elongated deformable objects in scenarios, such as the static rope with various configurations (including configurations with intersection), the continuous manipulation of a rope and a plastic tube, and the assembly of two plastic tubes. The execution time is computed and the accuracy of the shape estimation results is evaluated based on the comparisons between the estimated width values and its ground-truth, and the intersection over union (IoU) metric.
摘要：在本文中，一个B样条链的多个随机矩阵表示建议的细长可变形对象的几何特征进行建模。超度细长可变形对象的自由结构使其形状推定挑战。基于所提出的模型的似然函数，期望最大化（EM）方法导出估计所述细长可变形物体的形状。提出了一种基于欧几里德最小生成树（EMST）的分割和合并方法为EM算法提供的初始化。该算法被用于在场景中的细长的可变形的对象，如具有各种构型（包括与路口配置）静态绳索，绳索和塑料管的连续操纵的形状推定评估，并且所述组件的两个塑料管。的执行时间被计算和形状推定结果的准确度是基于所估计的宽度值和它的地面实况，与交点之间在联盟（IOU）度量的比较评价。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-04-14

目录

摘要