摘要

1. Self-Supervised Learning of Audio-Visual Objects from Video [PDF] 返回目录
Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman
Abstract: Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.
摘要：我们的目标是将视频转换为一组使用自监督学习离散视听对象。为此，我们引入随着时间的推移使用注意定位和组声源，和光流聚合信息的模型。我们，我们的模型学习到将它们用于四个下行面向语音的任务证明视听对象的嵌入的效力：（1）多扬声器声源分离，（b）中定位和跟踪的扬声器，（c）中修正错位音频 - 视觉数据，以及（d）活性扬声器检测。使用我们的代表，这些任务可以完全由培训未标记的视频，没有对象探测器的帮助下解决了。我们也把它应用到非人类的扬声器，包括漫画和puppets.Our模型显著优于其他自我监督的方法，并取得其性能与使用监督的人脸检测方法的竞争证明了我们方法的通用性。

2. Deep learning for photoacoustic imaging: a survey [PDF] 返回目录
Changchun Yang, Hengrong Lan, Feng Gao, Fei Gao
Abstract: Machine learning has been developed dramatically and witnessed a lot of applications in various fields over the past few years. This boom originated in 2009, when a new model emerged, that is, the deep artificial neural network, which began to surpass other established mature models on some important benchmarks. Later, it was widely used in academia and industry. Ranging from image analysis to natural language processing, it fully exerted its magic and now become the state-of-the-art machine learning models. Deep neural networks have great potential in medical imaging technology, medical data analysis, medical diagnosis and other healthcare issues, and is promoted in both pre-clinical and even clinical stages. In this review, we performed an overview of some new developments and challenges in the application of machine learning to medical image analysis, with a special focus on deep learning in photoacoustic imaging. The aim of this review is threefold: (i) introducing deep learning with some important basics, (ii) reviewing recent works that apply deep learning in the entire ecological chain of photoacoustic imaging, from image reconstruction to disease diagnosis, (iii) providing some open source materials and other resources for researchers interested in applying deep learning to photoacoustic imaging.
摘要：机器学习有了极大的发展，出现了许多在各个领域的应用，在过去的几年里。这一热潮起源于2009年，当一个新的模式出现了，那就是深人工神经网络，开始超过对一些重要的基准等地建立了成熟的模式。后来，它被广泛应用在学术界和工业界。从图像分析，自然语言处理范围，它充分发挥它的魔力，现在成为国家的最先进的机器学习模型。深层神经网络在医学成像技术，医疗数据分析，医学诊断等医疗保健问题的巨大潜力，并在这两个临床前和临床甚至阶段性地提升。在这次审查中，我们在机器学习的应用，医学图像分析进行了一些新的发展和挑战的概述，以及特别注重深度学习的光声成像。该评价的目的有三个方面：（i）将深学习有一些重要的基础，（ⅱ）审查，在光声成像的整个生态链应用深度学习，从图像重建到疾病的诊断最近的作品，（iii）提供一些开源材料和其他资源，有意申请深度学习到的光声成像的研究人员。

3. Describe What to Change: A Text-guided Unsupervised Image-to-Image Translation Approach [PDF] 返回目录
Yahui Liu, Marco De Nadai, Deng Cai, Huayang Li, Xavier Alameda-Pineda, Nicu Sebe, Bruno Lepri
Abstract: Manipulating visual attributes of images through human-written text is a very challenging task. On the one hand, models have to learn the manipulation without the ground truth of the desired output. On the other hand, models have to deal with the inherent ambiguity of natural language. Previous research usually requires either the user to describe all the characteristics of the desired image or to use richly-annotated image captioning datasets. In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". Contrarily to state-of-the-art approaches, our model does not require a human-annotated dataset nor a textual description of all the attributes of the desired image, but only those that have to be modified. Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation. Because text might be inherently ambiguous (blond hair may refer to different shadows of blond, e.g. golden, icy, sandy), our method generates multiple stochastic versions of the same translation. Experiments show that the proposed model achieves promising performances on two large-scale public datasets: CelebA and CUB. We believe our approach will pave the way to new avenues of research combining textual and speech commands with visual attributes.
摘要：通过人的书面文字处理图像的视觉属性是一个非常具有挑战性的任务。在一方面，车型要学会操纵不期望的输出的地面实况。在另一方面，模型必须处理自然语言的内在模糊性。以前的研究通常要求要么用户描述所期望的图像的所有特征或使用内容丰富，标注的图像字幕数据集。在这项工作中，我们提出了一种新的无监督的方法，基于图像 - 图像平移，通过一个命令般的句子，如“头发的颜色变成黑色”会改变一个给定的图像的属性。相反，以国家的最先进的方法，我们的模型不需要人为注释的数据集，也没有所有需要的图像属性的文本描述，但只有那些不得不进行修改。我们提出的模型理顺了那些纷繁从视觉属性的图像内容，并且学习用文本描述修改后者，从内容和修改后的属性表示生成新的图像之前。由于文本可能是固有的歧义（金发可参考金发，例如黄金，冰，沙的不同阴影），我们的方法生成相同的翻译多个随机版本。实验结果表明，该模型实现了两个大型公共数据集有前途的表演：CelebA和CUB。我们相信，我们的方法将铺平道路的研究与视觉属性结合文本和语音命令的新途径。

4. Deep Learning-based Human Detection for UAVs with Optical and Infrared Cameras: System and Experiments [PDF] 返回目录
Timo Hinzmann, Tobias Stegemann, Cesar Cadena, Roland Siegwart
Abstract: In this paper, we present our deep learning-based human detection system that uses optical (RGB) and long-wave infrared (LWIR) cameras to detect, track, localize, and re-identify humans from UAVs flying at high altitude. In each spectrum, a customized RetinaNet network with ResNet backbone provides human detections which are subsequently fused to minimize the overall false detection rate. We show that by optimizing the bounding box anchors and augmenting the image resolution the number of missed detections from high altitudes can be decreased by over 20 percent. Our proposed network is compared to different RetinaNet and YOLO variants, and to a classical optical-infrared human detection framework that uses hand-crafted features. Furthermore, along with the publication of this paper, we release a collection of annotated optical-infrared datasets recorded with different UAVs during search-and-rescue field tests and the source code of the implemented annotation tool.
摘要：在本文中，我们提出我们深以学习为主的人体检测系统，采用光学（RGB）和长波红外（LWIR）摄像机探测，跟踪，定位和从无人机在高空飞行重新标识人类。在每个光谱中，定制的RetinaNet网络骨干RESNET提供随后被融合到最小化整体的误检测率人检测。我们发现，通过优化边界框锚和增强图像分辨率漏检的从高海拔地区的数量可以超过20％的下降。我们提出的网络相比不同RetinaNet和YOLO变体，以及一种经典光学红外人检测框架，使用手工制作的特征。另外，通过该论文发表以来，我们在发布搜索和救援现场测试和实施注释工具的源代码，用不同的无人机记录注释的光学红外数据集的集合。

5. Labels Are Not Perfect: Improving Probabilistic Object Detection via Label Uncertainty [PDF] 返回目录
Di Feng, Lars Rosenbaum, Fabian Timm, Klaus Dietmayer
Abstract: Reliable uncertainty estimation is crucial for robust object detection in autonomous driving. However, previous works on probabilistic object detection either learn predictive probability for bounding box regression in an un-supervised manner, or use simple heuristics to do uncertainty regularization. This leads to unstable training or suboptimal detection performance. In this work, we leverage our previously proposed method for estimating uncertainty inherent in ground truth bounding box parameters (which we call label uncertainty) to improve the detection accuracy of a probabilistic LiDAR-based object detector. Experimental results on the KITTI dataset show that our method surpasses both the baseline model and the models based on simple heuristics by up to 3.6% in terms of Average Precision.
摘要：可靠的不确定性估计是在自动驾驶稳健的物体探测的关键。然而，在概率物体检测以往的作品无论是学习预测概率在联合国监管方式边界框回归，或使用简单的启发式做的不确定性正规化。这导致不稳定的培训或次优的检测性能。在这项工作中，我们利用我们先前提出的估计地面实况边框参数（我们称之为标签的不确定性），以改善概率基于激光雷达的对象检测器的检测精度固有的不确定性的方法。在KITTI数据集上，我们的方法都优于基准模型和基于高达3.6％的平均准确率方面简单的启发式模型实验结果。

6. Depth Quality Aware Salient Object Detection [PDF] 返回目录
Chenglizhao Chen, Jipeng Wei, Chong Peng, Hong Qin
Abstract: The existing fusion based RGB-D salient object detection methods usually adopt the bi-stream structure to strike the fusion trade-off between RGB and depth (D). The D quality usually varies from scene to scene, while the SOTA bi-stream approaches are depth quality unaware, which easily result in substantial difficulties in achieving complementary fusion status between RGB and D, leading to poor fusion results in facing of low-quality D. Thus, this paper attempts to integrate a novel depth quality aware subnet into the classic bi-stream structure, aiming to assess the depth quality before conducting the selective RGB-D fusion. Compared with the SOTA bi-stream methods, the major highlight of our method is its ability to lessen the importance of those low-quality, no-contribution, or even negative-contribution D regions during the RGB-D fusion, achieving a much improved complementary status between RGB and D.
摘要：现有融合基于RGB-d显着对象的检测方法通常采用双流结构求取融合折衷RGB和深度（d）之间。所述d质量通常从场景到场景变化，而SOTA双向流的方法是深度质量不知道，这容易导致在实现RGB和d之间的互补融合状态实质性的困难，导致差的融合导致低质量的d的面对因此，本文试图整合的新颖深度质量感知子网到经典的双流结构，针对在进行选择性RGB-d融合前以评估深度质量。与SOTA双向流方法相比，我们的方法的主要亮点是它能够减少那些低质量的，没有贡献，或RGB-d融合过程中甚至是负贡献d区的重要能力，实现了显着改善RGB和D之间的互补状态

7. Depth Quality-aware Selective Saliency Fusion for RGB-D Image Salient Object Detection [PDF] 返回目录
Zhenyu Wu, Shuai Li, Chenglizhao Chen, Aimin Hao, Hong Qin
Abstract: Fully convolutional networks have shown outstanding performance in the salient object detection (SOD) field. The state-of-the-art (SOTA) methods have a tendency to become deeper and more complex, which easily homogenize their learned deep features, resulting in a clear performance bottleneck. In sharp contrast to the conventional ``deeper'' schemes, this paper proposes a ``wider'' network architecture which consists of parallel sub networks with totally different network architectures. In this way, those deep features obtained via these two sub networks will exhibit large diversity, which will have large potential to be able to complement with each other. However, a large diversity may easily lead to the feature conflictions, thus we use the dense short-connections to enable a recursively interaction between the parallel sub networks, pursuing an optimal complementary status between multi-model deep features. Finally, all these complementary multi-model deep features will be selectively fused to make high-performance salient object detections. Extensive experiments on several famous benchmarks clearly demonstrate the superior performance, good generalization, and powerful learning ability of the proposed wider framework.
摘要：全卷积网络已经显示在显着对象检测（SOD）字段出色的性能。在国家的最先进的（SOTA）方法必须变得更深，更复杂，容易同质化自己所学深的特点，造成了明显的性能瓶颈的倾向。与此形成鲜明对比的是传统的``更深“”方案中，本文提出了一种``更宽“”的网络体系结构，其由具有完全不同的网络结构并行子网络。通过这种方式，通过这两个子网获得的那些深的特点将表现出较大的差异，这将有巨大的潜力，能够相互补充。然而，大的分集可以很容易导致特征冲突 - ，因此，我们使用致密短连接以实现并行子网络之间的相互作用递归，追求多模型深特征之间的最佳互补状态。最后，所有这些互补的多模式深功能将被选择性地融合，使高性能的显着对象的检测。在几个著名的基准广泛的实验清楚地证明了卓越的性能，良好的泛化，以及所提出的更广泛的框架，强大的学习能力。

8. Knowing Depth Quality In Advance: A Depth Quality Assessment Method For RGB-D Salient Object Detection [PDF] 返回目录
Xuehao Wang, Shuai Li, Chenglizhao Chen, Aimin Hao, Hong Qin
Abstract: Previous RGB-D salient object detection (SOD) methods have widely adopted deep learning tools to automatically strike a trade-off between RGB and D (depth), whose key rationale is to take full advantage of their complementary nature, aiming for a much-improved SOD performance than that of using either of them solely. However, such fully automatic fusions may not always be helpful for the SOD task because the D quality itself usually varies from scene to scene. It may easily lead to a suboptimal fusion result if the D quality is not considered beforehand. Moreover, as an objective factor, the D quality has long been overlooked by previous work. As a result, it is becoming a clear performance bottleneck. Thus, we propose a simple yet effective scheme to measure D quality in advance, the key idea of which is to devise a series of features in accordance with the common attributes of high-quality D regions. To be more concrete, we conduct D quality assessments for each image region, following a multi-scale methodology that includes low-level edge consistency, mid-level regional uncertainty and high-level model variance. All these components will be computed independently and then be assembled with RGB and D features, applied as implicit indicators, to guide the selective fusion. Compared with the state-of-the-art fusion schemes, our method can achieve a more reasonable fusion status between RGB and D. Specifically, the proposed D quality measurement method achieves steady performance improvements for almost 2.0\% in general.
摘要：以前的RGB-d显着目标检测（SOD）的方法已广泛采用深度学习工具来自动求取RGB和d（深度），其主要原理是利用其互补性的充分利用，力争之间的权衡大大改善比使用其中任何单独的SOD性能。然而，这种全自动的融合可能并不总是对SOD任务有帮助的，因为d质量本身通常从场景变化场景。如果d质量没有预先考虑可能容易导致次优的融合结果。此外，作为一种客观的因素，d质量早已被先前的工作被忽视。其结果是，它正在成为一个明显的性能瓶颈。因此，我们提出了一个简单而衡量d质量提前有效的方案，重点了解哪些是制定了一系列的功能按照高品质的d区的共同属性。为了更加具体，我们进行d质量评估为每个图像区域，以下的多尺度方法，其包括低级别的边缘一致性，中级区域的不确定性和高级模型方差。所有这些组件将被独立地计算，然后用RGB被组装和d功能，应用于为隐式的指标，以引导选择性融合。与国家的最先进的融合方案相比，我们的方法可以实现RGB和D之间的更合理的融合状态具体地，提出的d质量测量方法实现了稳定的性能改进了将近2.0 \％一般。

9. Deep Sketch-guided Cartoon Video Synthesis [PDF] 返回目录
Xiaoyu Li, Bo Zhang, Jing Liao, Pedro V. Sander
Abstract: We propose a novel framework to produce cartoon videos by fetching the color information from two input keyframes while following the animated motion guided by a user sketch. The key idea of the proposed approach is to estimate the dense cross-domain correspondence between the sketch and cartoon video frames, following by a blending module with occlusion estimation to synthesize the middle frame guided by the sketch. After that, the inputs and the synthetic frame equipped with established correspondence are fed into an arbitrary-time frame interpolation pipeline to generate and refine additional inbetween frames. Finally, a video post-processing approach is used to further improve the result. Compared to common frame interpolation methods, our approach can address frames with relatively large motion and also has the flexibility to enable users to control the generated video sequences by editing the sketch guidance. By explicitly considering the correspondence between frames and the sketch, our methods can achieve high-quality synthetic results compared with image synthesis methods. Our results show that our system generalizes well to different movie frames, achieving better results than existing solutions.
摘要：我们在遵循用户草图引导动画运动提出了一个新的框架，取出由两个输入关键帧的颜色信息来产生卡通影片。所提出的方法的关键思想是估计草图和卡通视频帧之间的密跨域对应，通过用闭塞估计的混合模块以下以合成草图引导的中间帧。在此之后，配有建立的对应输入和合成的帧被馈送到一个任意时间帧内插管道生成和缩小其间的额外帧。最后，视频后处理方法来进一步提高的结果。相比普通帧插补方法中，我们的方法可以以相对大的运动解决帧，并且还具有的灵活性，以使用户通过编辑草图指导，以控制所生成的视频序列。通过明确考虑帧和草图之间的对应关系，我们的方法可以实现与图像合成方法相比，高品质的合成结果。我们的研究结果显示，我们的系统推广以及不同的电影帧，实现比现有解决方案更好的结果。

10. Vision Meets Wireless Positioning: Effective Person Re-identification with Recurrent Context Propagation [PDF] 返回目录
Yiheng Liu, Wengang Zhou, Mao Xi, Sanjing Shen, Houqiang Li
Abstract: Existing person re-identification methods rely on the visual sensor to capture the pedestrians. The image or video data from visual sensor inevitably suffers the occlusion and dramatic variations of pedestrian postures, which degrades the re-identification performance and further limits its application to the open environment. On the other hand, for most people, one of the most important carry-on items is the mobile phone, which can be sensed by WiFi and cellular networks in the form of a wireless positioning signal. Such signal is robust to the pedestrian occlusion and visual appearance change, but suffers some positioning error. In this work, we approach person re-identification with the sensing data from both vision and wireless positioning. To take advantage of such cross-modality cues, we propose a novel recurrent context propagation module that enables information to propagate between visual data and wireless positioning data and finally improves the matching accuracy. To evaluate our approach, we contribute a new Wireless Positioning Person Re-identification (WP-ReID) dataset. Extensive experiments are conducted and demonstrate the effectiveness of the proposed algorithm. Code will be released at this https URL.
摘要：现有人员重新鉴定方法依赖于视觉传感器捕捉到行人。从视觉传感器的图像或视频数据不可避免地遭受闭塞和行人的姿势，这降低了重新鉴定性能，并进一步限制了其应用到开放环境的急剧变化。在另一方面，对于大多数人来说，一个最重要的随身物品是手机，可以通过WiFi和蜂窝网络无线定位信号的形式进行检测。这样的信号是稳健的行人闭塞和视觉外观的变化，但遭受一些定位误差。在这项工作中，我们接近的人重新鉴定与远见和无线定位感测数据。采取这样的跨模态线索的优点，我们提出了一个新颖的复发性上下文传播模块，使信息的视觉数据和无线定位数据及最后提高了匹配精度之间进行传播。为了评估我们的方法，我们贡献了新的无线定位人重新鉴定（WP-REID）数据集。大量的实验进行，验证了该算法的有效性。代码将在这个HTTPS URL被释放。

11. T-GD: Transferable GAN-generated Images Detection Framework [PDF] 返回目录
Hyeonseong Jeon, Youngoh Bang, Junyaup Kim, Simon S. Woo
Abstract: Recent advancements in Generative Adversarial Networks (GANs) enable the generation of highly realistic images, raising concerns about their misuse for malicious purposes. Detecting these GAN-generated images (GAN-images) becomes increasingly challenging due to the significant reduction of underlying artifacts and specific patterns. The absence of such traces can hinder detection algorithms from identifying GAN-images and transferring knowledge to identify other types of GAN-images as well. In this work, we present the Transferable GAN-images Detection framework T-GD, a robust transferable framework for an effective detection of GAN-images. T-GD is composed of a teacher and a student model that can iteratively teach and evaluate each other to improve the detection performance. First, we train the teacher model on the source dataset and use it as a starting point for learning the target dataset. To train the student model, we inject noise by mixing up the source and target datasets, while constraining the weight variation to preserve the starting point. Our approach is a self-training method, but distinguishes itself from prior approaches by focusing on improving the transferability of GAN-image detection. T-GD achieves high performance on the source dataset by overcoming catastrophic forgetting and effectively detecting state-of-the-art GAN-images with only a small volume of data without any metadata information.
摘要：在创成对抗性网络（甘斯）的最新进展使高度逼真图像的生成，提高对其滥用于恶意目的的担忧。检测这些GAN-生成的图像（GAN-图像）变得越来越由于显著减少潜在伪像和特定模式中的挑战。如果没有这样的迹线可以从识别GAN-图像和知识转移，以确定其它类型的GAN-图像以及妨碍检测算法。在这项工作中，我们提出了转换GAN-图像检测框架T-GD，为了有效地检测GAN-图像的鲁棒转让框架。 T-GD是由老师和学生模型，可以反复教导和评估对方，以提高检测性能。首先，我们训练的源数据集的老师模型，并以此为起点，学习目标数据集。为了培养学生的模式，我们通过混合了源和目标数据集，同时限制重量变化以保持起点的噪声注入。我们的做法是一种自我训练方法，但重点放在提高GAN-图像检测的可转移性，从现有的方法中脱颖而出。 T-GD实现通过克服灾难性遗忘和有效地与只有一小容量的数据而没有任何元数据信息检测状态的最先进的GAN-图像上的源数据集高性能。

12. Improved Adaptive Type-2 Fuzzy Filter with Exclusively Two Fuzzy Membership Function for Filtering Salt and Pepper Noise [PDF] 返回目录
Vikas Singh, Pooja Agrawal, Teena Sharma, Nishchal K. Verma
Abstract: Image denoising is one of the preliminary steps in image processing methods in which the presence of noise can deteriorate the image quality. To overcome this limitation, in this paper a improved two-stage fuzzy filter is proposed for filtering salt and pepper noise from the images. In the first-stage, the pixels in the image are categorized as good or noisy based on adaptive thresholding using type-2 fuzzy logic with exclusively two different membership functions in the filter window. In the second-stage, the noisy pixels are denoised using modified ordinary fuzzy logic in the respective filter window. The proposed filter is validated on standard images with various noise levels. The proposed filter removes the noise and preserves useful image characteristics, i.e., edges and corners at higher noise level. The performance of the proposed filter is compared with the various state-of-the-art methods in terms of peak signal-to-noise ratio and computation time. To show the effectiveness of filter statistical tests, i.e., Friedman test and Bonferroni-Dunn (BD) test are also carried out which clearly ascertain that the proposed filter outperforms in comparison of various filtering approaches.
摘要：图像去噪是在图像处理方法的预备步骤，其中噪声的存在可以恶化图像质量之一。为了克服这种限制，在本文中一改进的两阶段模糊滤波器提出了一种用于从图像中滤椒盐噪声。在第一阶段中，图像中的像素是基于使用类型2模糊逻辑与过滤器窗口仅仅两个不同的隶属函数的自适应阈值分类为良好或嘈杂。在第二阶段中，噪声像素在相应的滤波器窗口使用改性普通模糊逻辑去噪。提出的过滤器进行验证与各种噪声等级的标准图像。所提出的滤波器去除噪声并且在较高的噪声电平保留有用的图像特性，即，棱角。所提出的滤波器的性能与国家的最先进的各种方法中的峰值信噪比和计算时间方面进行比较。要显示的过滤器的统计检验，即，弗里德曼试验和邦费罗尼-邓恩（BD）试验也进行了这清楚地确定，在各种滤波的比较所提出的滤波器性能优于接近的有效性。

13. Fighting Deepfake by Exposing the Convolutional Traces on Images [PDF] 返回目录
Luca Guarnera, Oliver Giudice, Sebastiano Battiato
Abstract: Advances in Artificial Intelligence and Image Processing are changing the way people interacts with digital images and video. Widespread mobile apps like FACEAPP make use of the most advanced Generative Adversarial Networks (GAN) to produce extreme transformations on human face photos such gender swap, aging, etc. The results are utterly realistic and extremely easy to be exploited even for non-experienced users. This kind of media object took the name of Deepfake and raised a new challenge in the multimedia forensics field: the Deepfake detection challenge. Indeed, discriminating a Deepfake from a real image could be a difficult task even for human eyes but recent works are trying to apply the same technology used for generating images for discriminating them with preliminary good results but with many limitations: employed Convolutional Neural Networks are not so robust, demonstrate to be specific to the context and tend to extract semantics from images. In this paper, a new approach aimed to extract a Deepfake fingerprint from images is proposed. The method is based on the Expectation-Maximization algorithm trained to detect and extract a fingerprint that represents the Convolutional Traces (CT) left by GANs during image generation. The CT demonstrates to have high discriminative power achieving better results than state-of-the-art in the Deepfake detection task also proving to be robust to different attacks. Achieving an overall classification accuracy of over 98%, considering Deepfakes from 10 different GAN architectures not only involved in images of faces, the CT demonstrates to be reliable and without any dependence on image semantic. Finally, tests carried out on Deepfakes generated by FACEAPP achieving 93% of accuracy in the fake detection task, demonstrated the effectiveness of the proposed technique on a real-case scenario.
摘要：进展人工智能和图像处理正在发生变化与数字图像和视频的方式与人民交流。像FACEAPP利用最先进的剖成对抗性网络（GAN）的广泛移动应用产生极端的转变对人脸的照片，例如性别交换，老化等，结果是完全现实的，非常容易。即使是无经验的攻击者利用。这种媒体对象走上Deepfake的名称，并提出了在多媒体取证领域的新挑战：Deepfake检测挑战。事实上，从真实的图像辨别Deepfake可能是一个艰巨的任务，即使是人眼，但最近的作品尝试应用用于生成图像相同的技术与初步良好的效果歧视他们，但有很大的局限性：采用卷积神经网络是不如此强劲，证明是特定于上下文和倾向于从图像中提取语义。本文的目的是从图像中提取一个Deepfake指纹的新方法，提出了。该方法是基于训练来检测并提取表示图像生成期间由甘斯留下的痕迹卷积（CT）的指纹的最大期望算法。该CT证明具有较高的辨别力比实现国家的最先进的Deepfake检测任务也被证明是稳健的，以不同的攻击效果更好。实现超过98％的总的分类精度，考虑从10个不同的GAN架构不仅参与面的图像Deepfakes中，CT证明是可靠的，没有图像的语义的任何依赖。最后，测试通过FACEAPP产生Deepfakes实现假检测任务的准确性93％进行，验证了该技术的有效性在真实的情况。

14. Adversarial Examples on Object Recognition: A Comprehensive Survey [PDF] 返回目录
Alex Serban, Erik Poll, Joost Visser
Abstract: Deep neural networks are at the forefront of machine learning research. However, despite achieving impressive performance on complex tasks, they can be very sensitive: Small perturbations of inputs can be sufficient to induce incorrect behavior. Such perturbations, called adversarial examples, are intentionally designed to test the network's sensitivity to distribution drifts. Given their surprisingly small size, a wide body of literature conjectures on their existence and how this phenomenon can be mitigated. In this article we discuss the impact of adversarial examples on security, safety, and robustness of neural networks. We start by introducing the hypotheses behind their existence, the methods used to construct or protect against them, and the capacity to transfer adversarial examples between different machine learning models. Altogether, the goal is to provide a comprehensive and self-contained survey of this growing field of research.
摘要：深层神经网络是在机器学习研究的最前沿。然而，尽管实现对复杂任务骄人的业绩，他们可以非常敏感：投入小扰动可能足以引起不正确的行为。这样扰动，称为对抗性例子被有意地设计到网络的灵敏度进行测试，以分布的漂移。由于其规模小得惊人，他们的存在和如何这种现象可以缓解文学猜想的宽体。在这篇文章中，我们讨论的安全性，安全性，和神经网络的鲁棒性对抗的例子的影响。首先，我们通过引入假设其背后存在着，用来构建或防止它们的方法和能力不同机器学习模型之间传递对抗性的例子。总之，我们的目标是提供这种日益增长的研究领域的综合性和自足的调查。

15. Driving among Flatmobiles: Bird-Eye-View occupancy grids from a monocular camera for holistic trajectory planning [PDF] 返回目录
Abdelhak Loukkal, Yves Grandvalet, Tom Drummond, You Li
Abstract: Camera-based end-to-end driving neural networks bring the promise of a low-cost system that maps camera images to driving control commands. These networks are appealing because they replace laborious hand engineered building blocks but their black-box nature makes them difficult to delve in case of failure. Recent works have shown the importance of using an explicit intermediate representation that has the benefits of increasing both the interpretability and the accuracy of networks' decisions. Nonetheless, these camera-based networks reason in camera view where scale is not homogeneous and hence not directly suitable for motion forecasting. In this paper, we introduce a novel monocular camera-only holistic end-to-end trajectory planning network with a Bird-Eye-View (BEV) intermediate representation that comes in the form of binary Occupancy Grid Maps (OGMs). To ease the prediction of OGMs in BEV from camera images, we introduce a novel scheme where the OGMs are first predicted as semantic masks in camera view and then warped in BEV using the homography between the two planes. The key element allowing this transformation to be applied to 3D objects such as vehicles, consists in predicting solely their footprint in camera-view, hence respecting the flat world hypothesis implied by the homography.
摘要：基于摄像头的终端到终端的驾驶神经网络带来了低成本的系统，摄像头的图像映射到驱动控制命令的承诺。这些网络吸引力，因为他们更换费力的手工工程改造的建筑块，但他们的黑盒性质使得它们很难在出现故障的情况下钻研。最近的工作已经证明使用具有同时增加解释性和网络的决定精度的好处明确的中间表示的重要性。然而，这些基于照相机的网络原因在相机视图，其中标度是不均匀的，因此不能直接适用于运动预测。在本文中，我们介绍了一种新型的单眼相机，只有全面的终端到终端的轨迹规划与来自二进制占用网格地图（OGM的）形式的鸟眼视图（BEV），中间表示网络。为了缓解的OGM的在BEV从摄像机图像的预测，我们引入其中的OGM的第一预测为相机视图语义掩模，然后使用两个平面之间的单应在BEV翘曲的新颖方案。允许这种变换将被施加到三维键元件对象，如车辆，在于仅预测其在相机视图的足迹，因此尊重通过单应性暗示平坦的世界假设。

16. Cooperative Bi-path Metric for Few-shot Learning [PDF] 返回目录
Zeyuan Wang, Yifan Zhao, Jia Li, Yonghong Tian
Abstract: Given base classes with sufficient labeled samples, the target of few-shot classification is to recognize unlabeled samples of novel classes with only a few labeled samples. Most existing methods only pay attention to the relationship between labeled and unlabeled samples of novel classes, which do not make full use of information within base classes. In this paper, we make two contributions to investigate the few-shot classification problem. First, we report a simple and effective baseline trained on base classes in the way of traditional supervised learning, which can achieve comparable results to the state of the art. Second, based on the baseline, we propose a cooperative bi-path metric for classification, which leverages the correlations between base classes and novel classes to further improve the accuracy. Experiments on two widely used benchmarks show that our method is a simple and effective framework, and a new state of the art is established in the few-shot classification field.
摘要：有足够的标记的样品鉴于基类，为数不多的镜头分类的目标是识别新类型的未标记的样品，只有少数标记的样品。大多数现有的方法，唯一要注意的小说类，哪些没有标记的和未标记的样本之间的关系，使基类中充分利用信息。在本文中，我们做两个捐款调查为数不多的镜头分类问题。首先，我们提出一个简单而传统的监督学习的方式，从而可以实现类似的结果在技术状态上训练基地类有效基准。第二，基于基线，我们提出了一个合作双向路径度量进行分类，它利用基类和新颖的类之间的相关性，以进一步提高精度。两种广泛使用的基准测试实验表明，我们的方法是一种简单而有效的框架，以及一个新的艺术状态，是建立在几拍分类字段。

17. Invertible Neural BRDF for Object Inverse Rendering [PDF] 返回目录
Zhe Chen, Shohei Nobuhara, Ko Nishino
Abstract: We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i.e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry. The BRDF is expressed with an invertible neural network, namely, normalizing flow, which provides the expressive power of a high-dimensional representation, computational simplicity of a compact analytical model, and physical plausibility of a real-world BRDF. We extract the latent space of real-world reflectance by conditioning this model, which directly results in a strong reflectance prior. We refer to this model as the invertible neural BRDF model (iBRDF). We also devise a deep illumination prior by leveraging the structural bias of deep neural networks. By integrating this novel BRDF model and reflectance and illumination priors in a MAP estimation formulation, we show that this joint estimation can be computed efficiently with stochastic gradient descent. We experimentally validate the accuracy of the invertible neural BRDF model on a large number of measured data and demonstrate its use in object inverse rendering on a number of synthetic and real images. The results show new ways in which deep neural networks can help solve challenging radiometric inverse problems.
摘要：我们介绍一种新颖的基于神经网络的模型BRDF和用于对象逆渲染，从已知几何形状的物体的单个图像的反射率和自然光照即，联合估计贝叶斯框架。所述BRDF表示与可逆神经网络，即，正火流，它提供了高维表示，紧凑的分析模型的计算简单，并且真实世界BRDF的物理的合理性的表现力。我们通过调节提取真实世界的反射率的潜在空间这种模式，这也直接导致了强烈的反射率之前。我们把这个模型作为可逆的神经BRDF模型（iBRDF）。我们还设计通过利用深层神经网络的结构偏压之前深照明。通过集成此新颖BRDF模型和反射率，并在MAP估计制剂照明先验，我们表明，这个联合估计可以有效地与随机梯度下降来计算。我们通过实验验证对大量测量数据的可逆神经BRDF模型的精确度，并证明在对象其使用逆渲染上许多合成图像和真实图像的。结果表明，新方法，使深层神经网络可以帮助解决具有挑战性的辐射逆问题。

18. Road Segmentation for Remote Sensing Images using Adversarial Spatial Pyramid Networks [PDF] 返回目录
Pourya Shamsolmoali, Masoumeh Zareapoor, Huiyu Zhou, Ruili Wang, Jie Yang
Abstract: Road extraction in remote sensing images is of great importance for a wide range of applications. Because of the complex background, and high density, most of the existing methods fail to accurately extract a road network that appears correct and complete. Moreover, they suffer from either insufficient training data or high costs of manual annotation. To address these problems, we introduce a new model to apply structured domain adaption for synthetic image generation and road segmentation. We incorporate a feature pyramid network into generative adversarial networks to minimize the difference between the source and target domains. A generator is learned to produce quality synthetic images, and the discriminator attempts to distinguish them. We also propose a feature pyramid network that improves the performance of the proposed model by extracting effective features from all the layers of the network for describing different scales objects. Indeed, a novel scale-wise architecture is introduced to learn from the multi-level feature maps and improve the semantics of the features. For optimization, the model is trained by a joint reconstruction loss function, which minimizes the difference between the fake images and the real ones. A wide range of experiments on three datasets prove the superior performance of the proposed approach in terms of accuracy and efficiency. In particular, our model achieves state-of-the-art 78.86 IOU on the Massachusetts dataset with 14.89M parameters and 86.78B FLOPs, with 4x fewer FLOPs but higher accuracy (+3.47% IOU) than the top performer among state-of-the-art approaches used in the evaluation.
摘要：在遥感图像道路提取是为广泛的应用范围非常重要的。由于复杂的背景，及高密度的，大多数现有方法无法准确地提取出现正确和完整的道路网络。此外，他们无论从训练数据不足或人工标注成本高困扰。为了解决这些问题，我们引入应用结构域自适应合成图像生成和道路分割的新模式。我们结合了功能金字塔网络到生成对抗性的网络，以尽量减少源和目标域之间的差异。发电机被学习以产生品质的合成图像，并且鉴别器试图区分它们。我们还提出了一个功能金字塔的网络，提高了模型的由用于描述所有不同规模的对象中提取有效特征的网络层性能。事实上，一个新的规模明智的架构介绍，从多层次的功能映射到学习和提高的功能语义。为了优化，该模型是由关节重建损失的功能，最大限度地减少假图像和以假乱真的区别训练。广泛的三个数据集的实验证明了该方法的准确性和效率方面的优越性能。特别地，我们的模型实现了国家的最先进的78.86 IOU上马萨诸塞数据集14.89M参数和86.78B触发器，具有4个更少触发器但更高的精度（+ 3.47％IOU）比中反映最新表现最佳最先进的办法在评价中使用。

19. SynDistNet: Self-Supervised Monocular Fisheye Camera Distance Estimation Synergized with Semantic Segmentation for Autonomous Driving [PDF] 返回目录
Varun Ravi Kumar, Marvin Klingner, Senthil Yogamani, Stefan Milz, Tim Fingscheidt, Patrick Maeder
Abstract: State-of-the-art self-supervised learning approaches for monocular depth estimation usually suffer from scale ambiguity. They do not generalize well when applied on distance estimation for complex projection models such as in fisheye and omnidirectional cameras. In this paper, we introduce a novel multi-task learning strategy to improve self-supervised monocular distance estimation on fisheye and pinhole camera images. Our contribution to this work is threefold: Firstly, we introduce a novel distance estimation network architecture using a self-attention based encoder coupled with robust semantic feature guidance to the decoder that can be trained in a one-stage fashion. Secondly, we integrate a generalized robust loss function, which improves performance significantly while removing the need for hyperparameter tuning with the reprojection loss. Finally, we reduce the artifacts caused by dynamic objects violating static world assumption by using a semantic masking strategy. We significantly improve upon the RMSE of previous work on fisheye by 25% reduction in RMSE. As there is limited work on fisheye cameras, we evaluated the proposed method on KITTI using a pinhole model.We achieved state-of-the-art performance among self-supervised methods without requiring an external scale estimation.
摘要：单眼深度估计自我监督学习方法最先进的国家的最通常规模歧义困扰。当复杂的预测模型距离估算应用，如鱼眼和全方位摄像机他们不推广好。在本文中，我们介绍了一种新型的多任务学习策略，以提高自我监督的鱼眼和针孔摄像头的图像单眼距离估计。我们对这一工作的贡献有三个方面：首先，我们使用加上强大的语义特征的指导，可以在一个阶段的方式来训练解码器自重视基于编码器的引入新的距离估计的网络架构。其次，我们整合了强大的广义损失函数，这显著提高性能，同时消除了对超参数调整与重投影损失的需要。最后，我们减少因动态对象通过使用语义屏蔽策略违反静态世界假设的工件。在鱼眼上以前的工作由RMSE减少25％RMSE我们显著提高。由于存在对鱼眼镜头有限的工作中，我们使用针孔model.We实现自我监督方法中国家的最先进的性能，而不需要外部尺度估计评价KITTI上所提出的方法。

20. MHSA-Net: Multi-Head Self-Attention Network for Occluded Person Re-Identification [PDF] 返回目录
Hongchen Tan, Xiuping Liu, Shengjing Tian, Baocai Yin, Xin Li
Abstract: This paper presents a novel person re-identification model, named Multi-Head Self-Attention Network (MHSA-Net), to prune unimportant information and capture key local information from person images. MHSA-Net contains two main novel components: Multi-Head Self-Attention Branch (MHSAB) and Attention Competition Mechanism (ACM). The MHSAM adaptively captures key local person information, and then produces effective diversity embeddings of an image for the person matching. The ACM further helps filter out attention noise and non-key information. Through extensive ablation studies, we verified that the Structured Self-Attention Branch and Attention Competition Mechanism both contribute to the performance improvement of the MHSA-Net. Our MHSA-Net achieves state-of-the-art performance especially on images with occlusions. We have released our models (and will release the source codes after the paper is accepted) on this https URL.
摘要：本文提出了一种新的人再识别模型，命名为多头自我关注网络（MHSA-NET）修剪从人图像不重要的信息和拍摄键本地信息。 MHSA-网包含两个主要组件小说：多头自我关注分支（MHSAB），并注意竞争机制（ACM）。该MHSAM自适应抓住关键地域的人的信息，然后产生对人匹配的图像的有效多样性的嵌入。该ACM进一步有助于过滤掉噪音的关注和非关键信息。通过广泛切除研究，我们验证了结构化的自我关注分支和注意竞争机制同时向MHSA-Net的性能的改善做出贡献。我们MHSA-Net的实现特别是与遮挡图像国家的最先进的性能。我们已经发布了我们的模型（会释放纸被接受后的源代码）这个HTTPS URL。

21. Incomplete Descriptor Mining with Elastic Loss for Person Re-Identification [PDF] 返回目录
Hongchen Tan, Yuhao Bian, Huasheng Wang, Xiuping Liu, Baocai Yin
Abstract: In this paper, we propose a novel person Re-ID model, Consecutive Batch DropBlock Network (CBDB-Net), to help the person Re-ID model to capture the attentive and robust person descriptor. The CBDB-Net contains two novel modules: the Consecutive Batch DropBlock Module (CBDBM) and the Elastic Loss. In the Consecutive Batch DropBlock Module (CBDBM), it firstly conducts uniform partition on the feature maps. And then, the CBDBM independently and continuously drops each patch from top to bottom on the feature maps, which outputs multiple incomplete features to push the model to capture the robust person descriptor. In the Elastic Loss, we design a novel weight control item to help the deep model adaptively balance hard sample pairs and easy sample pairs in the whole training process. Through an extensive set of ablation studies, we verify that the Consecutive Batch DropBlock Module (CBDBM) and the Elastic Loss each contribute to the performance boosts of CBDB-Net. We demonstrate that our CBDB-Net can achieve the competitive performance on the three generic person Re-ID datasets (the Market-1501, the DukeMTMC-Re-ID, and the CUHK03 dataset), three occlusion Person Re-ID datasets (the Occluded DukeMTMC, the Partial-REID, and the Partial iLIDS dataset), and the other image retrieval dataset (In-Shop Clothes Retrieval dataset).
摘要：在本文中，我们提出了一种新的人重新编号模式，顺序间歇DropBlock网络（Cbdb资讯-网），以帮助人们重新ID模型捕捉周到和强大的个人描述。所述CBDB资讯-Net的包含两个新颖模块：连续批次DropBlock模块（CBDBM）和弹性损失。在连续批量DropBlock模块（CBDBM），它首先进行对特征图均匀分区。然后，将CBDBM独立地和连续地下降从上到下每个贴片上的特征图，其输出多个不完整的特征推模型来捕捉人健壮描述符。在弹性的损失，我们设计了一种新的体重控制项目，帮助深层模型在整个训练过程中自适应均衡硬样本对易样本对。通过广泛的一套消融研究中，我们验证顺序间歇DropBlock模块（CBDBM）和弹性丧失每个有助于Cbdb资讯，网络的性能提升。我们证明了我们的Cbdb资讯，网络可以实现对三个通用的人重新编号的数据集竞赛表演（市场-1501，DukeMTMC-RE-ID和CUHK03数据集），三个闭塞的人重新编号的数据集（闭塞DukeMTMC，偏-REID，和偏iLIDS数据集），并且另一个图像检索的数据集（在车间衣服检索数据集）。

22. HAPI: Hardware-Aware Progressive Inference [PDF] 返回目录
Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, Nicholas D. Lane
Abstract: Convolutional neural networks (CNNs) have recently become the state-of-the-art in a diversity of AI tasks. Despite their popularity, CNN inference still comes at a high computational cost. A growing body of work aims to alleviate this by exploiting the difference in the classification difficulty among samples and early-exiting at different stages of the network. Nevertheless, existing studies on early exiting have primarily focused on the training scheme, without considering the use-case requirements or the deployment platform. This work presents HAPI, a novel methodology for generating high-performance early-exit networks by co-optimising the placement of intermediate exits together with the early-exit strategy at inference time. Furthermore, we propose an efficient design space exploration algorithm which enables the faster traversal of a large number of alternative architectures and generates the highest-performing design, tailored to the use-case requirements and target hardware. Quantitative evaluation shows that our system consistently outperforms alternative search mechanisms and state-of-the-art early-exit schemes across various latency budgets. Moreover, it pushes further the performance of highly optimised hand-crafted early-exit CNNs, delivering up to 5.11x speedup over lightweight models on imposed latency-driven SLAs for embedded devices.
摘要：卷积神经网络（细胞神经网络）最近成为了国家的最先进的中AI任务的多样性。尽管他们的知名度，CNN推理仍然要付出高昂的计算成本。工作越来越多旨在通过利用在样品和在网络的不同阶段早期退出之间的分类难度的差异，以缓解这一点。然而，早期退出现有的研究主要集中在培训计划，在不考虑用例需求或部署平台。这项工作提出了高致病性禽流感，一种新颖的方法通过联合优化与推理时间早退出策略一起中间退出的位置生成高性能的早期退出网络。此外，我们提出了一个高效的设计空间探索算法，这使得大量替代架构的更快遍历并产生最高性能的设计，量身定做的用例需求和目标硬件。定量评价结果显示，我们的系统始终优于在不同延迟预算替代的搜索机制和国家的最先进的早期退出方案。此外，进一步推高了优化手工制作的早期退出细胞神经网络的性能，在对嵌入式设备引起的延迟驱动的SLA轻量级车型提供高达5.11x加速。

23. 2nd Place Scheme on Action Recognition Track of ECCV 2020 VIPriors Challenges: An Efficient Optical Flow Stream Guided Framework [PDF] 返回目录
Haoyu Chen, Zitong Yu, Xin Liu, Wei Peng, Yoon Lee, Guoying Zhao
Abstract: To address the problem of training on small datasets for action recognition tasks, most prior works are either based on a large number of training samples or require pre-trained models transferred from other large datasets to tackle overfitting problems. However, it limits the research within organizations that have strong computational abilities. In this work, we try to propose a data-efficient framework that can train the model from scratch on small datasets while achieving promising results. Specifically, by introducing a 3D central difference convolution operation, we proposed a novel C3D neural network-based two-stream (Rank Pooling RGB and Optical Flow) framework for the task. The method is validated on the action recognition track of the ECCV 2020 VIPriors challenges and got the 2nd place (88.31%). It is proved that our method can achieve a promising result even without a pre-trained model on large scale datasets. The code will be released soon.
摘要：为了解决培训问题上的小型数据集的动作识别任务，大多数现有的作品要么基于大量的训练样本，或者需要从其他大型数据集传送预先训练模式，以解决过度拟合问题。然而，它限制了有强大的计算能力，组织内部的研究。在这项工作中，我们试图提出一个数据有效的框架，可以训练从头模型上的小数据集，同时实现希望的结果。具体而言，通过将3D中心差分卷积运算，我们提出了一种新颖的C3D基于神经网络的两个流（等级池RGB和光流）框架的任务。该方法验证的ECCV 2020 VIPriors挑战动作识别轨道上，并获得亚军（88.31％）。实践证明，我们的方法可以实现一个有前途的结果，即使没有大规模数据集预先训练的模式。该代码将很快被释放。

24. Deep Reinforcement Learning with Label Embedding Reward for Supervised Image Hashing [PDF] 返回目录
Zhenzhen Wang, Weixiang Hong, Junsong Yuan
Abstract: Deep hashing has shown promising results in image retrieval and recognition. Despite its success, most existing deep hashing approaches are rather similar: either multi-layer perceptron or CNN is applied to extract image feature, followed by different binarization activation functions such as sigmoid, tanh or autoencoder to generate binary code. In this work, we introduce a novel decision-making approach for deep supervised hashing. We formulate the hashing problem as travelling across the vertices in the binary code space, and learn a deep Q-network with a novel label embedding reward defined by Bose-Chaudhuri-Hocquenghem (BCH) codes to explore the best path. Extensive experiments and analysis on the CIFAR-10 and NUS-WIDE dataset show that our approach outperforms state-of-the-art supervised hashing methods under various code lengths.
摘要：深哈希已显示出大有希望的图像检索和识别结果。尽管它的成功，现有的大多数深散列方法是相当类似的：要么多层感知或CNN被施加到提取图像特征，然后二进制化不同活化功能，例如S形，双曲正切或自动编码，以产生二进制码。在这项工作中，我们介绍了深监督散列一个新的决策方法。我们制定散列问题，因为在二进制代码存储空间，在顶点旅行，学习了深刻的Q-网络与博斯 - 乔赫里 - 黑姆（BCH）码定义为探索最佳路径的新标签嵌入奖励。了广泛的实验和对CIFAR-10和NUS-WIDE数据集显示，我们的方法比状态的最先进的监督下进行散列各种码长的方法分析。

25. Unsupervised Deep-Learning Based Deformable Image Registration: A Bayesian Framework [PDF] 返回目录
Samah Khawaled, Moti Freiman
Abstract: Unsupervised deep-learning (DL) models were recently proposed for deformable image registration tasks. In such models, a neural-network is trained to predict the best deformation field by minimizing some dissimilarity function between the moving and the target images. After training on a dataset without reference deformation fields available, such a model can be used to rapidly predict the deformation field between newly seen moving and target images. Currently, the training process effectively provides a point-estimate of the network weights rather than characterizing their entire posterior distribution. This may result in a potential over-fitting which may yield sub-optimal results at inference phase, especially for small-size datasets, frequently present in the medical imaging domain. We introduce a fully Bayesian framework for unsupervised DL-based deformable image registration. Our method provides a principled way to characterize the true posterior distribution, thus, avoiding potential over-fitting. We used stochastic gradient Langevin dynamics (SGLD) to conduct the posterior sampling, which is both theoretically well-founded and computationally efficient. We demonstrated the added-value of our Basyesian unsupervised DL-based registration framework on the MNIST and brain MRI (MGH10) datasets in comparison to the VoxelMorph unsupervised DL-based image registration framework. Our experiments show that our approach provided better estimates of the deformation field by means of improved mean-squared-error ($0.0063$ vs. $0.0065$) and Dice coefficient ($0.73$ vs. $0.71$) for the MNIST and the MGH10 datasets respectively. Further, our approach provides an estimate of the uncertainty in the deformation-field by characterizing the true posterior distribution.
摘要：无监督的深学习（DL）模型，最近提出了变形图像配准任务。在这种模型中，神经网络进行训练，通过最小化移动和目标图像之间的一些差异性函数来预测的最佳变形场。在数据集上训练，而不可用的参考变形字段之后，这样的模型可以被用于快速预测新看到的运动和目标图像之间的变形场。目前，培训过程中有效地提供网络权重的点估计，而不是表征他们的整个后验分布。这可能导致潜在的过拟合其可在推断阶段得到次优的结果，尤其是对于小尺寸的数据集，经常存在于医疗成像领域。我们引入了基于无监督DL-变形图像配准完全贝叶斯框架。我们的方法提供了一种原则的方式来表征真实后验分布，从而，避免过拟合的潜力。我们使用随机梯度朗之万动力学（SGLD）进行后取样，这既是理论上充分理由和计算上有效的。我们证明相比于VoxelMorph无监督基于DL-图像配准框架上MNIST和脑MRI（MGH10）数据集我们Basyesian监督的基础DL-注册框架的附加值。我们的实验表明，我们的方法，通过改进均方误差（$ 0.0063 $与$ 0.0065 $）和骰子系数（$ 0.73 $与$ $ 0.71）的MNIST的手段和MGH10数据集分别设有变形区的更好的估计。此外，我们的方法通过表征真实后验分布提供了变形场的不确定性的估计。

26. Projected-point-based Segmentation: A New Paradigm for LiDAR Point Cloud Segmentation [PDF] 返回目录
Shijie Li, Yun Liu, Juergen Gall
Abstract: Most point-based semantic segmentation methods are designed for indoor scenarios, but many applications such as autonomous driving vehicles require accurate segmentation for outdoor scenarios. For this goal, light detection and ranging (LiDAR) sensors are often used to collect outdoor environmental data. The problem is that directly applying previous point-based segmentation methods to LiDAR point clouds usually leads to unsatisfactory results due to the domain gap between indoor and outdoor scenarios. To address such a domain gap, we propose a new paradigm, namely projected-point-based methods, to transform point-based methods to a suitable form for LiDAR point cloud segmentation by utilizing the characteristics of LiDAR point clouds. Specifically, we utilize the inherent ordered information of LiDAR points for point sampling and grouping, thus reducing unnecessary computation. All computations are carried out on the projected image, and there are only pointwise convolutions and matrix multiplication in projected-point-based methods. We compare projected-point-based methods with point-based methods on the challenging SemanticKITTI dataset, and experimental results demonstrate that projected-point-based methods achieve better accuracy than all baselines more efficiently. Even with a simple baseline architecture, projected-point-based methods perform favorably against previous state-of-the-art methods. The code will be released upon paper acceptance.
摘要：大多数基于点语义分割方法适用于室内场景，但许多应用，如自动驾驶汽车需要在室外场景精确分割。为了这个目标，光探测和测距（LIDAR）传感器通常用于收集室外环境数据。的问题是，直接应用先前基于点的分割方法，以激光雷达点云通常会导致不令人满意的结果，由于室内和室外场景之间的域间隙。为了解决这样一个域间隙，我们提出了一个新的范例，即投影点为基础的方法中，通过利用激光雷达点云的特性，以基于点的方法进行改造，以合适的形式的LiDAR点云分割。具体来说，我们利用用于采样点和分组的激光雷达点的固有有序信息，从而减少不必要的计算。所有的计算都在投影图像上进行的，并且仅存在逐点卷积和矩阵乘法在基于投影的点的方法。我们比较突出，基于点的方法与在挑战SemanticKITTI数据集基于点的方法和实验结果表明，投射点为基础的方法，可以实现更有效地比所有基准更好的精度。甚至用一个简单的基线架构的，基于投影的点的方法与先前的状态的最先进的方法顺利地执行。该代码将在论文录用被释放。

27. Measuring shape relations using r-parallel sets [PDF] 返回目录
Hans JT Stephensen, Anne Marie Svane, Carlos Benitez, Steven A. Goldman, Jon Sporring
Abstract: Geometrical measurements of biological objects form the basis of many quantitative analyses. Hausdorff measures such as the volume and the area of objects are simple and popular descriptors of individual objects, however, for most biological processes, the interaction between objects cannot be ignored, and the shape and function of neighboring objects are mutually influential. In this paper, we present a theory on the geometrical interaction between objects based on the theory of spatial point processes. Our theory is based on the relation between two objects: a reference and an observed object. We generate the $r$-parallel sets of the reference object, we calculate the intersection between the $r$-parallel sets and the observed object, and we define measures on these intersections. Our measures are simple like the volume and area of an object, but describe further details about the shape of individual objects and their pairwise geometrical relation. Finally, we propose a summary statistics for collections of shapes and their interaction. We evaluate these measures on a publicly available FIB-SEM 3D data set of an adult rodent.
摘要：生物对象的几何测量形成许多定量分析的基础。豪斯多夫措施，如音量和对象的面积是单个对象的简单和流行的描述，但是，对于大多数生物过程，对象之间的交互是不容忽视的，而相邻物体的形状和功能是相互影响的。在本文中，我们提出基于空间点过程理论对象之间的几何相互作用的理论。我们的理论是基于两个对象之间的关系：观察到的对象的引用和。我们产生$ R $ -parallel设定参照的对象，我们计算了$ R $ -parallel集和观察对象之间的交集，我们定义在这些交叉点措施。我们采取的措施是简单的像一个物体的体积和面积，但描述关于个别对象的形状和它们配对几何关系的进一步细节。最后，我们提出了一个汇总统计的形状和它们之间的相互作用的集合。我们评估在成人的啮齿动物的可公开获得的FIB-SEM 3D数据集这些措施。

28. Lane Detection Model Based on Spatio-Temporal Network with Double ConvGRUs [PDF] 返回目录
Jiyong Zhang, Tao Deng, Fei Yan, Wenbo Liu
Abstract: Lane detection is one of the indispensable and key elements of self-driving environmental perception. Many lane detection models have been proposed, solving lane detection under challenging conditions, including intersection merging and splitting, curves, boundaries, occlusions and combinations of scene types. Nevertheless, lane detection will remain an open problem for some time to come. The ability to cope well with those challenging scenes impacts greatly the applications of lane detection on advanced driver assistance systems (ADASs). In this paper, a spatio-temporal network with double Convolutional Gated Recurrent Units (ConvGRUs) is proposed to address lane detection in challenging scenes. Both of ConvGRUs have the same structures, but different locations and functions in our network. One is used to extract the information of the most likely low-level features of lane markings. The extracted features are input into the next layer of the end-to-end network after concatenating them with the outputs of some blocks. The other one takes some continuous frames as its input to process the spatio-temporal driving information. Extensive experiments on the large-scale Tusimple lane marking challenge dataset and Unsupervised LLAMAS dataset demonstrate that the proposed model can effectively detect lanes in the challenging driving scenes. Our model can outperform the state-of-the-art lane detection models.
摘要：车道检测是自驾车环境感知的不可或缺的关键要素之一。许多车道检测模型已被提出，具有挑战性的条件，包括路口合并和拆分，曲线，边界，闭塞和场景类型的组合下解决车道检测。尽管如此，车道检测将仍然是一个开放的问题，一段时间来。的能力与那些具有挑战性的场景影响大大车道检测的先进驾驶辅助系统（阿达斯）的应用也应付。在本文中，采用双门控卷积复发单位（ConvGRUs）时空网络，提出了地址车道检测在挑战的场景。这两个ConvGRUs具有相同的结构，但不同的位置和功能，我们的网络。一个是用来提取最可能低级别的车道标记的功能内的信息。所提取的特征被输入到与某些块的输出串联它们之后的端至端网络的下一层。另一个需要一些连续帧作为其输入以处理时空驱动信息。在大型Tusimple车道标线挑战DataSet和无监督拉马斯大量的实验数据集表明，该模型能够有效地检测出驾驶挑战场景车道。我们的模型可以超越国家的最先进的车道检测模型。

29. Automatic Failure Recovery and Re-Initialization for Online UAV Tracking with Joint Scale and Aspect Ratio Optimization [PDF] 返回目录
Fangqiang Ding, Changhong Fu, Yiming Li, Jin Jin, Chen Feng
Abstract: Current unmanned aerial vehicle (UAV) visual tracking algorithms are primarily limited with respect to: (i) the kind of size variation they can deal with, (ii) the implementation speed which hardly meets the real-time requirement. In this work, a real-time UAV tracking algorithm with powerful size estimation ability is proposed. Specifically, the overall tracking task is allocated to two 2D filters: (i) translation filter for location prediction in the space domain, (ii) size filter for scale and aspect ratio optimization in the size domain. Besides, an efficient two-stage re-detection strategy is introduced for long-term UAV tracking tasks. Large-scale experiments on four UAV benchmarks demonstrate the superiority of the presented method which has computation feasibility on a low-cost CPU.
摘要：当前无人驾驶飞行器（UAV）的视觉跟踪算法主要相对于有限的：（i）所述样大小的变化的，他们可以处理，（ⅱ）其很难满足实时性要求的执行速度。在这项工作中，具有强大的规模估算能力的实时跟踪无人机算法。具体而言，整体跟踪任务被分配给两个2D过滤器：（ⅰ）翻译滤波器，用于在空间域位置预测，（ii）用于在大小域规模和高宽比的优化尺寸的过滤器。此外，高效的两级重新检测策略引入长期无人机跟踪任务。四个无人机基准大规模实验证明它具有成本低的CPU计算可行性提出方法的优越性。

30. DR^2Track: Towards Real-Time Visual Tracking for UAV via Distractor Repressed Dynamic Regression [PDF] 返回目录
Changhong Fu, Fangqiang Ding, Yiming Li, Jin Jin, Chen Feng
Abstract: Visual tracking has yielded promising applications with unmanned aerial vehicle (UAV). In literature, the advanced discriminative correlation filter (DCF) type trackers generally distinguish the foreground from the background with a learned regressor which regresses the implicit circulated samples into a fixed target label. However, the predefined and unchanged regression target results in low robustness and adaptivity to uncertain aerial tracking scenarios. In this work, we exploit the local maximum points of the response map generated in the detection phase to automatically locate current distractors. By repressing the response of distractors in the regressor learning, we can dynamically and adaptively alter our regression target to leverage the tracking robustness as well as adaptivity. Substantial experiments conducted on three challenging UAV benchmarks demonstrate both excellent performance and extraordinary speed (~50fps on a cheap CPU) of our tracker.
摘要：视觉跟踪，取得了与看好无人机（UAV）的应用程序。在文献中，先进的辨别相关滤波器（DCF）型跟踪器通常区分从与了解到回归其退化的隐含循环样本到固定目标标签的背景前景。然而，预定义的和不变的回归目标导致低鲁棒性和适应性不确定的空中追踪的场景。在这项工作中，我们利用检测相生成的响应地图的局部最大点，自动定位当前的干扰项。通过抑制在回归学习分心的反应，我们可以动态地并自适应地改变我们的回归目标利用跟踪的鲁棒性以及适应性。三进行大量的实验挑战无人机基准展示了我们跟踪的两个出色的性能和非凡的速度（〜便宜的CPU上50FPS）。

31. Prototype Mixture Models for Few-shot Semantic Segmentation [PDF] 返回目录
Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, Qixiang Ye
Abstract: Few-shot segmentation is challenging because objects within the support and query images could significantly differ in appearance and pose. Using a single prototype acquired directly from the support image to segment the query image causes semantic ambiguity. In this paper, we propose prototype mixture models (PMMs), which correlate diverse image regions with multiple prototypes to enforce the prototype-based semantic representation. Estimated by an Expectation-Maximization algorithm, PMMs incorporate rich channel-wised and spatial semantics from limited support images. Utilized as representations as well as classifiers, PMMs fully leverage the semantics to activate objects in the query image while depressing background regions in a duplex manner. Extensive experiments on Pascal VOC and MS-COCO datasets show that PMMs significantly improve upon state-of-the-arts. Particularly, PMMs improve 5-shot segmentation performance on MS-COCO by up to 5.82\% with only a moderate cost for model size and inference speed.
摘要：很少，镜头分割具有挑战性，因为支持和查询图像中的对象可以在外观和姿势显著不同。使用直接从分割查询图像语义引起歧义地支撑图像获取单个原型。在本文中，我们提出了原型混合模型（予PMM），其与多个原型强制执行基于原型的语义表示相关多样的图像区域。通过最大期望算法估计，予PMM合并从有限的支持图像丰富通道识破和空间的语义。用作陈述以及分类，予PMM充分利用语义激活对象的查询图像中，而在双工方式令人沮丧的背景区域。帕斯卡VOC和MS-COCO数据集大量实验表明，予PMM在国家的最艺术显著提高。特别是，予PMM改善与只为模型的大小和推理速度和适中的成本在MS-COCO 5次分割性能高达5.82 \％。

32. IF-Net: An Illumination-invariant Feature Network [PDF] 返回目录
Po-Heng Chen, Zhao-Xu Luo, Zu-Kuan Huang, Chun Yang, Kuan-Wen Chen
Abstract: Feature descriptor matching is a critical step is many computer vision applications such as image stitching, image retrieval and visual localization. However, it is often affected by many practical factors which will degrade its performance. Among these factors, illumination variations are the most influential one, and especially no previous descriptor learning works focus on dealing with this problem. In this paper, we propose IF-Net, aimed to generate a robust and generic descriptor under crucial illumination changes conditions. We find out not only the kind of training data important but also the order it is presented. To this end, we investigate several dataset scheduling methods and propose a separation training scheme to improve the matching accuracy. Further, we propose a ROI loss and hard-positive mining strategy along with the training scheme, which can strengthen the ability of generated descriptor dealing with large illumination change conditions. We evaluate our approach on public patch matching benchmark and achieve the best results compared with several state-of-the-arts methods. To show the practicality, we further evaluate IF-Net on the task of visual localization under large illumination changes scenes, and achieves the best localization accuracy.
摘要：特征描述符匹配是关键的一步是许多计算机视觉应用，例如图像的拼接，图像检索和视觉定位。然而，它往往受这将降低其性能的许多现实的因素。在这些因素中，光照变化是最有影响力的一个，尤其是以前没有描述的学习工作重点放在处理这一问题。在本文中，我们提出了IF-Net的，旨在在关键的光照变化的条件下产生一个强大的和通用的描述符。我们发现不仅是一种训练数据很重要，但也它的显示顺序。为此，我们研究了几种数据集调度方法，提出了一种分离的培训计划，以提高匹配精度。此外，我们建议与培训方案，该方案可以增强生成的描述符交易的大型照明条件变化的能力，沿着ROI损失和难以积极挖掘策略。我们评估我们的公共补丁匹配基准办法，并与国家的最艺术的几种方法相比，达到最好的效果。要显示的实用性，我们进一步评估IF-Net的上大下照明的可视本地化的任务变化的场景，并达到最佳的定位精度。

33. RocNet: Recursive Octree Network for Efficient 3D Deep Representation [PDF] 返回目录
Juncheng Liu, Steven Mills, Brendan McCane
Abstract: We introduce a deep recursive octree network for the compression of 3D voxel data. Our network compresses a voxel grid of any size down to a very small latent space in an autoencoder-like network. We show results for compressing 32, 64 and 128 grids down to just 80 floats in the latent space. We demonstrate the effectiveness and efficiency of our proposed method on several publicly available datasets with three experiments: 3D shape classification, 3D shape reconstruction, and shape generation. Experimental results show that our algorithm maintains accuracy while consuming less memory with shorter training times compared to existing methods, especially in 3D reconstruction tasks.
摘要：介绍了深刻的递归八叉树网络三维像素数据的压缩。我们的网络压缩任意大小的体素格下降到自动编码，如网络一个非常小的潜在空间。我们显示于潜在空间压缩32，64和128个网格下降到只有80彩车结果。我们证明有三个实验在几个公开可用的数据集的有效性和我们提出的方法的效率：3D形状分类，3D形状重建和形状生成。实验结果表明，我们的算法保持准确性，同时相比于现有的方法，尤其是在3D重建任务的训练时间较短消耗更少的内存。

34. Nighttime Dehazing with a Synthetic Benchmark [PDF] 返回目录
Jing Zhang, Yang Cao, Zheng-Jun Zha, Dacheng Tao
Abstract: Increasing the visibility of nighttime hazy images is challenging because of uneven illumination from active artificial light sources and haze absorbing/scattering. The absence of large-scale benchmark datasets hampers progress in this area. To address this issue, we propose a novel synthetic method called 3R to simulate nighttime hazy images from daytime clear images, which first reconstructs the scene geometry, then simulates the light rays and object reflectance, and finally renders the haze effects. Based on it, we generate realistic nighttime hazy images by sampling real-world light colors from a prior empirical distribution. Experiments on the synthetic benchmark show that the degrading factors jointly reduce the image quality. To address this issue, we propose an optimal-scale maximum reflectance prior to disentangle the color correction from haze removal and address them sequentially. Besides, we also devise a simple but effective learning-based baseline which has an encoder-decoder structure based on the MobileNet-v2 backbone. Experiment results demonstrate their superiority over state-of-the-art methods in terms of both image quality and runtime. Both the dataset and source code will be available at \url{this https URL}.
摘要：增加夜间朦胧图像的可见度，因为从活性人造光源和雾度吸收/散射光照不均的挑战。由于缺乏大型标准数据集阻碍了这方面的进展。为了解决这个问题，我们提出了称为3R从白天清晰的图像，该第一重建场景几何图形以模拟夜间朦胧的图像的新的合成方法，然后仿真光线和对象的反射率，并最终使雾度影响。在此基础上，我们生成通过从先前的经验分布采样的真实世界的光的颜色逼真夜间朦胧图象。在合成基准的实验表明降解因素共同降低了图像质量。为了解决这个问题，我们先提出一个最佳规模最大反射解开来自烟雾去除，并依次解决这些颜色校正。此外，我们还制定了一个简单但有效的基于学习基线具有基础上，MobileNet-V2骨干编码器，解码器结构。实验结果表明，在图像质量和运行方面他们对国家的最先进方法的优越性。这两个数据集和源代码将可在\ {URL这HTTPS URL}。

35. Domain Private and Agnostic Feature for Modality Adaptive Face Recognition [PDF] 返回目录
Yingguo Xu, Lei Zhang, Qingyan Duan
Abstract: Heterogeneous face recognition is a challenging task due to the large modality discrepancy and insufficient cross-modal samples. Most existing works focus on discriminative feature transformation, metric learning and cross-modal face synthesis. However, the fact that cross-modal faces are always coupled by domain (modality) and identity information has received little attention. Therefore, how to learn and utilize the domain-private feature and domain-agnostic feature for modality adaptive face recognition is the focus of this work. Specifically, this paper proposes a Feature Aggregation Network (FAN), which includes disentangled representation module (DRM), feature fusion module (FFM) and adaptive penalty metric (APM) learning session. First, in DRM, two subnetworks, i.e. domain-private network and domain-agnostic network are specially designed for learning modality features and identity features, respectively. Second, in FFM, the identity features are fused with domain features to achieve cross-modal bi-directional identity feature transformation, which, to a large extent, further disentangles the modality information and identity information. Third, considering that the distribution imbalance between easy and hard pairs exists in cross-modal datasets, which increases the risk of model bias, the identity preserving guided metric learning with adaptive hard pairs penalization is proposed in our FAN. The proposed APM also guarantees the cross-modality intra-class compactness and inter-class separation. Extensive experiments on benchmark cross-modal face datasets show that our FAN outperforms SOTA methods.
摘要：异构人脸识别是一项具有挑战性的任务，由于大的形态差异和不足，跨模式的样本。大多数现有的作品集中在辨别特征变换，度量学习和跨模态的脸合成。然而，事实上，跨通道的面总是由域（模式）和身份信息耦合很少受到关注。因此，如何学习和利用方式的自适应人脸识别领域和私营部门的功能和领域无关的特征是这项工作的重点。具体而言，提出了一种特征聚合网络（FAN），其中包括解缠结的表示模块（DRM），特征融合模块（FFM）和自适应惩罚度量（APM）的学习会话。首先，DRM，两个子网，即域专用网络和域无关的网络是专为学习模式的特征和身份特征，分别设计。第二，在FFM，身份特征与结构域被融合功能实现跨通道双向身份特征变换，其中，在很大程度上，进一步理顺了那些纷繁模态的信息和身份信息。第三，考虑到方便和硬对之间的分配存在不平衡跨模态数据集，从而增加了模型偏差的风险，保持与自适应硬对惩罚引导度量学习的身份，我们的球迷建议。所提出的APM也保证了跨模态类内的紧凑性和类间的分离。基准的上跨模态面数据集大量实验表明，我们的球迷性能优于SOTA方法。

36. Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild [PDF] 返回目录
Jichao Zhang, Jingjing Chen, Hao Tang, Wei Wang, Yan Yan, Enver Sangineto, Nicu Sebe
Abstract: In this paper we address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need for precise annotations of the gaze angle and the head pose. We have created a new dataset called CelebAGaze, which consists of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our method consists of three novel modules: the Gaze Correction module (GCM), the Gaze Animation module (GAM), and the Pretrained Autoencoder module (PAM). Specifically, GCM and GAM separately train a dual in-painting network using data from the domain $X$ for gaze correction and data from the domain $Y$ for gaze animation. Additionally, a Synthesis-As-Training method is proposed when training GAM to encourage the features encoded from the eye region to be correlated with the angle information, resulting in a gaze animation which can be achieved by interpolation in the latent space. To further preserve the identity information~(e.g., eye shape, iris color), we propose the PAM with an Autoencoder, which is based on Self-Supervised mirror learning where the bottleneck features are angle-invariant and which works as an extra input to the dual in-painting models. Extensive experiments validate the effectiveness of the proposed method for gaze correction and gaze animation in the wild and demonstrate the superiority of our approach in producing more compelling results than state-of-the-art baselines. Our code, the pretrained models and the supplementary material are available at: this https URL.
摘要：在本文中，我们解决在野外无人监督的视线校正的问题，提出一个解决方案，工程，而无需注视角度和头部姿势的准确注释。我们已经创建了一个名为CelebAGaze新的数据集，它由两个域X，Y，其中眼睛要么在摄像头或其他地方盯着。我们的方法包括三个新的模块：凝视校正模块（GCM），注视动画模块（GAM），和预训练的自动编码器模块（PAM）。具体而言，GCM和GAM分别培养双列绘画使用从凝视校正和数据从域$ Y $凝视动画域$ X $数据网络。此外，提出了一种合成-AS-训练方法训练时GAM鼓励从与角度信息相关眼部区域所编码的功能，致使其可以通过内插在潜空间来实现的凝视动画。为了进一步保护所述身份信息〜（例如，眼睛的形状，虹膜颜色），我们提出与自动编码器，这是基于自监督镜学习其中该瓶颈特征是角度不变并且其工作作为额外输入到PAM双画中的模型。大量的实验验证了视线修正了该方法的有效性，并在野外注视动画和证明了该方法的优越性生产比国家的最先进的基线更令人瞩目的成果。我们的代码，预训练的模型和补充材料，请访问：此HTTPS URL。

37. Neural Reflectance Fields for Appearance Acquisition [PDF] 返回目录
Sai Bi, Zexiang Xu, Pratul Srinivasan, Ben Mildenhall, Kalyan Sunkavalli, Miloš Hašan, Yannick Hold-Geoffroy, David Kriegman, Ravi Ramamoorthi
Abstract: We present Neural Reflectance Fields, a novel deep scene representation that encodes volume density, normal and reflectance properties at any 3D point in a scene using a fully-connected neural network. We combine this representation with a physically-based differentiable ray marching framework that can render images from a neural reflectance field under any viewpoint and light. We demonstrate that neural reflectance fields can be estimated from images captured with a simple collocated camera-light setup, and accurately model the appearance of real-world scenes with complex geometry and reflectance. Once estimated, they can be used to render photo-realistic images under novel viewpoint and (non-collocated) lighting conditions and accurately reproduce challenging effects like specularities, shadows and occlusions. This allows us to perform high-quality view synthesis and relighting that is significantly better than previous methods. We also demonstrate that we can compose the estimated neural reflectance field of a real scene with traditional scene models and render them using standard Monte Carlo rendering engines. Our work thus enables a complete pipeline from high-quality and practical appearance acquisition to 3D scene composition and rendering.
摘要：我们提出神经反射字段，即使用全连接的神经网络在任何3D点在一个场景中编码体积密度，正常和反射率特性的新型深场景表达。我们结合这表示与基于物理的微光线跟踪的框架，可以在任何观点和光照条件下呈现来自神经反射场图像。我们表明，神经反射场可以通过一个简单的并置摄像头，灯光设置拍摄的图像进行估计，并准确地模拟真实世界的场景，具有复杂几何形状和反射外观。一旦估计，它们可以被用于渲染下新颖观点和（非并置）的照明条件照片般逼真的图像和精确再现像镜面反射，阴影和遮挡有挑战性的效果。这使我们能够进行高品质的视图合成和重新点燃是显著优于以前的方法。我们还表明，我们能够组成一个真实场景的估计的神经反射场与传统的场景模型，并使用标准的蒙特卡洛渲染引擎渲染他们。因此，我们的工作使高品质，实用的外观收购3D场景组成，并呈现一个完整的流水线。

38. Unsupervised Feature Learning by Cross-Level Discrimination between Instances and Groups [PDF] 返回目录
Xudong Wang, Ziwei Liu, Stella X. Yu
Abstract: Unsupervised feature learning has made great strides with invariant mapping and instance-level discrimination, as benchmarked by classification on common datasets. However, these datasets are curated to be distinctive and class-balanced, whereas naturally collected data could be highly correlated within the class (with repeats at the extreme) and long-tail distributed across classes. The natural grouping of instances conflicts with the fundamental assumption of instance-level discrimination. Contrastive feature learning is thus unstable without grouping, whereas grouping without contrastive feature learning is easily trapped into degeneracy. We propose to integrate grouping into instance-level discrimination, not by imposing group-level discrimination, but by imposing cross-level discrimination between instances and groups. Our key insight is that attraction and repulsion between instances work at different ranges. In order to discover the most discriminative feature that also respects natural grouping, we ask each instance to repel groups of instances that are far from it. By pushing against common groups, this cross-level repulsion actively binds similar instances together. To further avoid the clash between grouping and discrimination objectives, we also impose them on separate features derived from the common feature. Our extensive experimentation demonstrates not only significant gain on datasets with high correlation and long-tail distributions, but also leading performance on multiple self-supervision benchmarks including CIFAR-10, CIFAR-100 and ImageNet, bringing unsupervised feature learning closer to real data applications.
摘要：无监督功能学习已经与不变的映射和实例级歧视长足的进步，通过共同的数据集归类为基准。然而，这些数据集策划是独特的和类平衡，而自然地收集的数据可以在类内是高度相关的（与在极端重复）和长尾跨类分布。与实例级歧视的基本假设情况下，冲突的自然分组。对比功能学习是不进行分组因此是不稳定的，而没有对比功能学习分组容易陷入退化。我们建议整合编组为实例级别的歧视，而不是通过强加组级别的歧视，而是通过实施实例和群体之间的跨级别的歧视。我们的主要观点是，在不同的范围情况下工作之间的吸引和排斥。为了发现最判别功能，也尊重自然分组，我们要求每个实例的实例击退组是远离它。通过推动对常见的群体，这种跨级排斥积极结合类似的例子在一起。为了进一步避免分组和歧视的目标之间的冲突，我们也强加给从共同特征得出不同的功能。我们大量的实验证明不仅有很高的相关性数据集和长尾分布显著增益，而且在多个自我监管的基准，包括CIFAR-10，CIFAR-100和ImageNet，带来无监督功能学习更接近真实的数据应用领先的性能。

39. Neural Light Transport for Relighting and View Synthesis [PDF] 返回目录
Xiuming Zhang, Sean Fanello, Yun-Ta Tsai, Tiancheng Sun, Tianfan Xue, Rohit Pandey, Sergio Orts-Escolano, Philip Davidson, Christoph Rhemann, Paul Debevec, Jonathan T. Barron, Ravi Ramamoorthi, William T. Freeman
Abstract: The light transport (LT) of a scene describes how it appears under different lighting and viewing directions, and complete knowledge of a scene's LT enables the synthesis of novel views under arbitrary lighting. In this paper, we focus on image-based LT acquisition, primarily for human bodies within a light stage setup. We propose a semi-parametric approach to learn a neural representation of LT that is embedded in the space of a texture atlas of known geometric properties, and model all non-diffuse and global LT as residuals added to a physically-accurate diffuse base rendering. In particular, we show how to fuse previously seen observations of illuminants and views to synthesize a new image of the same scene under a desired lighting condition from a chosen viewpoint. This strategy allows the network to learn complex material effects (such as subsurface scattering) and global illumination, while guaranteeing the physical correctness of the diffuse LT (such as hard shadows). With this learned LT, one can relight the scene photorealistically with a directional light or an HDRI map, synthesize novel views with view-dependent effects, or do both simultaneously, all in a unified framework using a set of sparse, previously seen observations. Qualitative and quantitative experiments demonstrate that our neural LT (NLT) outperforms state-of-the-art solutions for relighting and view synthesis, without separate treatment for both problems that prior work requires.
摘要：场景的光传输（LT）描述了它如何出现不同的照明和观察方向，和一个场景的LT的完整知识下使新颖的视图合成在任意照明。在本文中，我们专注于基于图像的采集LT，主要用于光舞台设置中人体。我们提出了一个半参数的方法来学习嵌入在已知的几何特性的纹理图集的空间LT的神经表示，和模型中的所有非扩散和全球LT作为加入到物理精确弥漫基地渲染残差。特别是，我们显示如何融合光源和视图的先前看到的观察，从所选择的视点合成所需的照明条件下的同一场景的新图像。该策略允许在网络学习复杂的材料的影响（如表面散射）和全局照明，同时保证漫LT的物理正确性（如硬盘阴影）。与此学到LT，可以真实照片用定向光或HDRI地图使用一组稀疏的，以前看过的观察重新点燃场景，合成新型视图与视图相关的影响，或同时执行这两种，都在一个统一的框架。定性和定量的实验表明，对于重新点灯和视图合成，无需为这两个问题单独处理我们的神经LT（NLT）性能优于国家的最先进的解决方案，现有的工作需要。

40. Spatiotemporal Contrastive Video Representation Learning [PDF] 返回目录
Rui Qian, Tianjian Meng, Boqing Gong, Ming-Hsuan Yang, Huisheng Wang, Serge Belongie, Yin Cui
Abstract: We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn spatiotemporal visual representations from unlabeled videos. Inspired by the recently proposed self-supervised contrastive learning framework, our representations are learned using a contrastive loss, where two clips from the same short video are pulled together in the embedding space, while clips from different videos are pushed away. We study what makes for good data augmentation for video self-supervised learning and find both spatial and temporal information are crucial. In particular, we propose a simple yet effective temporally consistent spatial augmentation method to impose strong spatial augmentations on each frame of a video clip while maintaining the temporal consistency across frames. For Kinetics-600 action recognition, a linear classifier trained on representations learned by CVRL achieves 64.1\% top-1 accuracy with a 3D-ResNet50 backbone, outperforming ImageNet supervised pre-training by 9.4\% and SimCLR unsupervised pre-training by 16.1\% using the same inflated 3D-ResNet50. The performance of CVRL can be further improved to 68.2\% with a larger 3D-ResNet50 (4$\times$) backbone, significantly closing the gap between unsupervised and supervised video representation learning.
摘要：我们提出了一个自我监督的对比视频表示学习（CVRL）方法无标签的视频学习时空可视化表示。通过最近提出的启发自我监督对比学习框架，我们表示用的是对比的损失，其中来自相同的短视频两个片段中嵌入空间被拉到一起学，而来自不同的视频剪辑被推开。我们研究了视频自我监督学习好的数据隆胸是什么让找到空间和时间信息是至关重要的。特别是，我们提出了一种简单而有效的时间一致的空间增强方法的视频剪辑的每一帧上施加较强的空间增扩，同时保持整个帧的时间一致性。对于动力学-600动作识别，培养对CVRL了解到表示线性分类达到64.1 \％，最高1精准度和3D-ResNet50骨干，跑赢ImageNet 9.4 \％16.1监督前培训和SimCLR无监督前培训\ ％使用相同的膨胀3D-ResNet50。 CVRL的性能可以进一步改进，以68.2 \％具有较大3D-ResNet50（4 $ \倍$）骨架，显著闭无监督和监督视频表示学习之间的间隙。

41. Richly Activated Graph Convolutional Network for Robust Skeleton-based Action Recognition [PDF] 返回目录
Yi-Fan Song, Zhang Zhang, Caifeng Shan, Liang Wang
Abstract: Current methods for skeleton-based human action recognition usually work with complete skeletons. However, in real scenarios, it is inevitable to capture incomplete or noisy skeletons, which could significantly deteriorate the performance of current methods when some informative joints are occluded or disturbed. To improve the robustness of action recognition models, a multi-stream graph convolutional network (GCN) is proposed to explore sufficient discriminative features spreading over all skeleton joints, so that the distributed redundant representation reduces the sensitivity of the action models to non-standard skeletons. Concretely, the backbone GCN is extended by a series of ordered streams which is responsible for learning discriminative features from the joints less activated by preceding streams. Here, the activation degrees of skeleton joints of each GCN stream are measured by the class activation maps (CAM), and only the information from the unactivated joints will be passed to the next stream, by which rich features over all active joints are obtained. Thus, the proposed method is termed richly activated GCN (RA-GCN). Compared to the state-of-the-art (SOTA) methods, the RA-GCN achieves comparable performance on the standard NTU RGB+D 60 and 120 datasets. More crucially, on the synthetic occlusion and jittering datasets, the performance deterioration due to the occluded and disturbed joints can be significantly alleviated by utilizing the proposed RA-GCN.
摘要：目前的方法通常是基于骨架人类动作识别工作，完整的骨骼。然而，在实际情况下，这是不可避免的捕捉不完整或嘈杂的骨架，当一些信息接头闭塞或干扰可能显著恶化的当前方法的性能。为了提高动作识别模型的鲁棒性，多流图表卷积网络（GDN）提出了探索足够判别特征散布在所有的骨架关节，使得分布式冗余表示降低了动作模式，以非标准骨架的灵敏度。具体而言，主链GCN由一系列有序流中的哪一个负责学习从由前述流更少激活关节判别特征延长。在此，激活度每GCN流的骨架关节的是由类激活图（CAM）测量，并且只从非活化的关节的信息将被传递到下一个流，由其中获得在所有活动关节丰富的功能。因此，所提出的方法被称为激活丰富GCN（RA-GCN）。的状态相比的最先进的（SOTA）的方法中，RA-GCN达到在标准NTU RGB + d 60和120的数据集相媲美的性能。更关键的是，在合成闭塞和抖动数据集，性能恶化由于遮挡和干扰关节可通过利用所提出的RA-GCN被显著缓解。

42. 3D Human Motion Estimation via Motion Compression and Refinement [PDF] 返回目录
Zhengyi Luo, S. Alireza Golestaneh, Kris M. Kitani
Abstract: We develop a technique for generating smooth and accurate 3D human pose and motion estimates from RGB video sequences. Our technique, which we call Motion Estimation via Variational Autoencoder (MEVA), decomposes a temporal sequence of human motion into a smooth motion representation using auto-encoder-based motion compression and a residual representation learned through motion refinement. This two-step encoding of human motion captures human motion in two stages: a general human motions estimation step that captures the coarse overall motion, and a residual estimation that adds back person-specific motion details. Experiments show that our method produces both smooth and accurate 3D human pose and motion estimates.
摘要：我们开发的发电平滑和RGB视频序列精确的三维人体姿势和运动估计的技术。我们的技术，我们称之为经由变自动编码器的运动估计（MEVA），分解人体运动的时间序列分割成使用基于自动编码器的运动压缩和残留表示通过运动细化学到了平滑的运动表示。在两个阶段的人类动作捕捉人体运动的这种两步编码：一般人的运动估计步骤，其捕获粗总体运动，并且添加回个人特定的运动细节的残余估计。实验表明，我们的方法产生光滑和精确的3D人体姿势和运动估计。

43. Flow-Guided Attention Networks for Video-Based Person Re-Identification [PDF] 返回目录
Madhu Kiran, Amran Bhuiyan, Louis-Antoine Blais-Morin, Mehrsan Javan, Ismail Ben Ayed, Eric Granger
Abstract: Person Re-Identification (ReID) is an important problem in many video analytics and surveillance applications,where a person's identity must be associated across a distributed network of cameras. Video-based person ReID has recently gained much interest because it can capture discriminant spatio-temporal information that is unavailable for image-based ReID. Despite recent advances, deep learning models for video ReID often fail to leverage this information to improve the robustness of feature representations. In this paper, the motion pattern of a person is explored as an additional cue for ReID. In particular, two different flow-guided attention networks are proposed for fusion with any 2D-CNN backbone, allowing to encode temporal information along with spatial appearance information.Our first proposed network called Gated Attention relies on optical flow to generate gated attention with video-based feature that embed spatially. Hence the proposed framework allows to activate a common set of salient features across multiple frames. In contrast, our second network called Mutual Attention relies on the joint attention between image and optical flow features. This enables spatial attention between both sources of features, across motion and appearance cues. Both methods introduce a feature aggregation method that produce video features by identifying salient spatio-temporal information.Extensive experiments on two challenging video datasets indicate that using the proposed flow-guided spatio-temporal attention networks allows to improve recognition accuracy considerably, outperforming state-of-the-art methods for video-based person ReID. Additionally, our Mutual Attention network is able to process longer frame sequences with a wider range of appearance variations for highly accurate recognition.
摘要：人重新鉴定（里德）是许多视频分析和监控应用，其中一个人的身份必须通过摄像机的分布式网络相关联的一个重要问题。基于视频的人里德最近获得了很大的兴趣，因为它可以捕捉是基于图像的里德无法判别时空信息。尽管最近的进展，深学习模式的视频里德往往不能充分利用这些信息来改进功能表示的稳健性。在本文中，一个人的运动模式进行了探索，作为里德额外线索。特别地，两种不同的流动引导注意网络提出了用于与任何2D-CNN骨干融合，从而允许时间信息具有空间外观information.Our第一提出的网络称为门控注意依赖于光流沿着编码，以产生门控注意力video-基于特征的嵌入了空间。因此所提出的框架允许以激活一组共同的跨越多个帧的显着特征。相比之下，我们称之为相互关注第二网络依靠图像和光流特征之间的共同关注。这使得空间注意的特征的源之间，跨越运动和外观的线索。这两种方法都引入一个特征聚合方法产生的视频通过识别两个挑战视频数据集凸时空information.Extensive实验设有指示使用所提出的流引导时空注意网络允许显着提高识别精度，表现优于状态的-the先进的方法基于视频的人里德。此外，我们的共同关注网络能够与高度精确识别更广范围的外观的变化的处理较长的帧序列。

44. SemEval-2020 Task 8: Memotion Analysis -- The Visuo-Lingual Metaphor! [PDF] 返回目录
Chhavi Sharma, Deepesh Bhageria, William Scott, Srinivas PYKL, Amitava Das, Tanmoy Chakraborty, Viswanath Pulabaigari, Bjorn Gamback
Abstract: Information on social media comprises of various modalities such as textual, visual and audio. NLP and Computer Vision communities often leverage only one prominent modality in isolation to study social media. However, the computational processing of Internet memes needs a hybrid approach. The growing ubiquity of Internet memes on social media platforms such as Facebook, Instagram, and Twiter further suggests that we can not ignore such multimodal content anymore. To the best of our knowledge, there is not much attention towards meme emotion analysis. The objective of this proposal is to bring the attention of the research community towards the automatic processing of Internet memes. The task Memotion analysis released approx 10K annotated memes, with human-annotated labels namely sentiment (positive, negative, neutral), type of emotion (sarcastic, funny, offensive, motivation) and their corresponding intensity. The challenge consisted of three subtasks: sentiment (positive, negative, and neutral) analysis of memes, overall emotion (humour, sarcasm, offensive, and motivational) classification of memes, and classifying intensity of meme emotion. The best performances achieved were F1 (macro average) scores of 0.35, 0.51 and 0.32, respectively for each of the three subtasks.
摘要：对各种形式的社交媒体包括诸如文本，视频和音频信息。 NLP和计算机视觉社区往往只利用一个孤立突出的方式来研究社会媒体。然而，网络爆红的计算处理需要一种混合的方法。在社会化媒体平台，如Facebook，Instagram的，和Twiter网络爆红的日益普及进一步表明，我们不能忽视这样多的内容了。据我们所知，还没有向梅梅的情绪分析的关注。该提案的目的是使研究界的注意力转向网络爆红的自动处理。任务Memotion分析释放大约10K批注模因，与人类标注的标签，即情绪（正面，负面，中性），情感（讽刺，搞笑的，攻击性的动机）及其相应的强度的类型。我们面临的挑战包括三个子任务：情绪（正面，负面和中性）模因分析，整体情感（幽默，讽刺，攻击和动机）拟子分类和米姆情感的分类强度。达到的最佳表现是0.35，0.51和0.32，分别为每三个子任务的F1（宏平均）得分。

45. Depth image denoising using nuclear norm and learning graph model [PDF] 返回目录
Chenggang Yan, Zhisheng Li, Yongbing Zhang, Yutao Liu, Xiangyang Ji, Yongdong Zhang
Abstract: The depth images denoising are increasingly becoming the hot research topic nowadays because they reflect the three-dimensional (3D) scene and can be applied in various fields of computer vision. But the depth images obtained from depth camera usually contain stains such as noise, which greatly impairs the performance of depth related applications. In this paper, considering that group-based image restoration methods are more effective in gathering the similarity among patches, a group based nuclear norm and learning graph (GNNLG) model was proposed. For each patch, we find and group the most similar patches within a searching window. The intrinsic low-rank property of the grouped patches is exploited in our model. In addition, we studied the manifold learning method and devised an effective optimized learning strategy to obtain the graph Laplacian matrix, which reflects the topological structure of image, to further impose the smoothing priors to the denoised depth image. To achieve fast speed and high convergence, the alternating direction method of multipliers (ADMM) is proposed to solve our GNNLG. The experimental results show that the proposed method is superior to other current state-of-the-art denoising methods in both subjective and objective criterion.
摘要：深度图像去噪正日益成为研究热点时下，因为它们反映的三维（3D）场景，并且可以在计算机视觉的各个领域得到应用。但是，从深度相机获得的深度图像通常包含污渍如噪声，从而大大损害的深度相关的应用程序的性能。在本文中，考虑到基于组的图像恢复方法是在收集斑块之间的相似性，基于组核规范和学习曲线更有效（GNNLG）模型中提出的。对于每一个补丁，我们发现和组搜索窗口内最相似的补丁。分组补丁的固有的低品质特性在我们的模型中被利用。此外，我们研究了在歧管学习方法，并设计有效的优化学习策略，以获得图形拉普拉斯矩阵，其反映图像的拓扑结构，以进一步施以平滑先验到去噪深度图像。为了实现快速的速度和高收敛，乘数（ADMM）交替方向方法提出来解决我们的GNNLG。实验结果表明，所提出的方法优于状态的最先进的去噪在主观和客观标准方法的其它电流。

46. Recurrent Feature Reasoning for Image Inpainting [PDF] 返回目录
Jingyuan Li, Ning Wang, Lefei Zhang, Bo Du, Dacheng Tao
Abstract: Existing inpainting methods have achieved promising performance for recovering regular or small image defects. However, filling in large continuous holes remains difficult due to the lack of constraints for the hole center. In this paper, we devise a Recurrent Feature Reasoning (RFR) network which is mainly constructed by a plug-and-play Recurrent Feature Reasoning module and a Knowledge Consistent Attention (KCA) module. Analogous to how humans solve puzzles (i.e., first solve the easier parts and then use the results as additional information to solve difficult parts), the RFR module recurrently infers the hole boundaries of the convolutional feature maps and then uses them as clues for further inference. The module progressively strengthens the constraints for the hole center and the results become explicit. To capture information from distant places in the feature map for RFR, we further develop KCA and incorporate it in RFR. Empirically, we first compare the proposed RFR-Net with existing backbones, demonstrating that RFR-Net is more efficient (e.g., a 4\% SSIM improvement for the same model size). We then place the network in the context of the current state-of-the-art, where it exhibits improved performance. The corresponding source code is available at: this https URL
摘要：现有的图像修复方法都取得了有前途的性能恢复定期或小的图像缺陷。然而，在大的连续孔中填充由于缺乏对于孔中心的约束仍然是困难的。在本文中，我们设计出一种主要由插头和播放复发性特征推理模块和知识一致的注意（KCA）模块构成的复发性特征推理（RFR）网络。类似于人类如何解决难题（即先解决容易的部分，然后使用结果作为附加信息来解决困难的部分）时，RFR模块反复推断卷积特征地图的孔边界，然后把它们作为线索作进一步的推理。该模块逐步加强约束孔的中心，结果变得明确。要在特征地图RFR遥远的地方获取信息，我们进一步开发KCA和RFR其纳入。根据经验，我们首先用现有的骨架比较所提出的RFR-Net的，这表明RFR-Net是更有效的（例如，对于相同的模型大小4 \％SSIM改善）。然后，我们将在网络中的当前状态的最先进的，它显示出改进的性能的情况下。相应的源代码，请访问：此HTTPS URL

47. SOFA-Net: Second-Order and First-order Attention Network for Crowd Counting [PDF] 返回目录
Haoran Duan, Shidong Wang, Yu Guan
Abstract: Automated crowd counting from images/videos has attracted more attention in recent years because of its wide application in smart cities. But modelling the dense crowd heads is challenging and most of the existing works become less reliable. To obtain the appropriate crowd representation, in this work we proposed SOFA-Net(Second-Order and First-order Attention Network): second-order statistics were extracted to retain selectivity of the channel-wise spatial information for dense heads while first-order statistics, which can enhance the feature discrimination for the heads' areas, were used as complementary information. Via a multi-stream architecture, the proposed second/first-order statistics were learned and transformed into attention for robust representation refinement. We evaluated our method on four public datasets and the performance reached state-of-the-art on most of them. Extensive experiments were also conducted to study the components in the proposed SOFA-Net, and the results suggested the high-capability of second/first-order statistics on modelling crowd in challenging scenarios. To the best of our knowledge, we are the first work to explore the second/first-order statistics for crowd counting.
摘要：从图像/视频自动人群计时已经吸引了更多的重视，近年来，因为在智能城市的广泛应用。但建模的密集人群头是富有挑战性和大多数现有的作品变得不那么可靠。为了得到合适的人群表示，在这项工作中，我们提出了沙发网（二阶和一阶关注网络）：提取二阶统计保留为而一阶密头的通道明智的空间信息的选择性统计数据，它可以加强对头地区特征的歧视，被用作补充信息。通过多流架构，所提出的第二/一阶统计了教训，并转化为注意力强劲表现细化。我们评估的四个公共数据集，我们的方法和性能上大多达到了国家的最先进的。还进行了广泛的实验研究中所提出的沙发网的组成部分，结果表明在具有挑战性的场景建模人群的第二/一阶统计的高能力。据我们所知，我们要探讨的人群计数的第二/第一阶统计的第一部作品。

48. Online Extrinsic Camera Calibration for Temporally Consistent IPM Using Lane Boundary Observations with a Lane Width Prior [PDF] 返回目录
Jeong-Kyun Lee, Young-Ki Baik, Hankyu Cho, Seungwoo Yoo
Abstract: In this paper, we propose a method for online extrinsic camera calibration, i.e., estimating pitch, yaw, roll angles and camera height from road surface in sequential driving scene images. The proposed method estimates the extrinsic camera parameters in two steps: 1) pitch and yaw angles are estimated simultaneously using a vanishing point computed from a set of lane boundary observations, and then 2) roll angle and camera height are computed by minimizing difference between lane width observations and a lane width prior. The extrinsic camera parameters are sequentially updated using extended Kalman filtering (EKF) and are finally used to generate a temporally consistent bird-eye-view (BEV) image by inverse perspective mapping (IPM). We demonstrate the superiority of the proposed method in synthetic and real-world datasets.
摘要：在本文中，我们提出一种用于在线外在相机校准的方法，即，在顺序驱动场景图像推定俯仰，偏转辊的角度和摄像机高度从路面。所提出的方法估计在两个步骤中的非本征相机参数：1）俯仰和偏航角估计同时使用从一组车道边界观察计算的消失点，然后2）侧倾角和照相机高度被最小化车道之间差来计算宽度的观察和宽度之前的车道。非本征照相机参数是使用扩展卡尔曼滤波（EKF）顺序地更新，并且最终用于产生由逆透视映射（IPM）在时间上一致的鸟瞰视图（BEV）图像。我们展示合成和真实世界的数据集所提出的方法的优越性。

49. 1-Point RANSAC-Based Method for Ground Object Pose Estimation [PDF] 返回目录
Jeong-Kyun Lee, Young-Ki Baik, Hankyu Cho, Kang Kim, Duck Hoon Kim
Abstract: Solving Perspective-n-Point (PnP) problems is a traditional way of estimating object poses. Given outlier-contaminated data, a pose of an object is calculated with PnP algorithms of n = {3, 4} in the RANSAC-based scheme. However, the computational complexity considerably increases along with n and the high complexity imposes a severe strain on devices which should estimate multiple object poses in real time. In this paper, we propose an efficient method based on 1-point RANSAC for estimating a pose of an object on the ground. In the proposed method, a pose is calculated with 1-DoF parameterization by using a ground object assumption and a 2D object bounding box as an additional observation, thereby achieving the fastest performance among the RANSAC-based methods. In addition, since the method suffers from the errors of the additional information, we propose a hierarchical robust estimation method for polishing a rough pose estimate and discovering more inliers in a coarse-to-fine manner. The experiments in synthetic and real-world datasets demonstrate the superiority of the proposed method.
摘要：解决视角-N点（PNP）的问题是估计对象的姿势的一种传统方式。给定异常值污染数据，物体的姿势与在基于RANSAC的方案的n即插即用算法= {3,4}来计算。但是，计算的复杂性显着地与n和高复杂性强加在其上应该实时估计多个对象姿态设备严重的应变沿着增大。在本文中，我们提出了用于估计在地面上的物体的姿势根据1点RANSAC的有效方法。在所提出的方法中，姿态与1-自由度参数通过使用接地对象的假设和2D对象边界框作为附加观察，从而实现基于RANSAC的方法中最快的性能计算。此外，由于从附加信息中的错误的方法存在，我们提出了抛光粗略估计姿势和由粗到细地发现更多的内围的分层稳健估计方法。在合成和真实数据集实验证明了该方法的优越性。

50. I2L-MeshNet: Image-to-Lixel Prediction Network for Accurate 3D Human Pose and Mesh Estimation from a Single RGB Image [PDF] 返回目录
Gyeongsik Moon, Kyoung Mu Lee
Abstract: Most of the previous image-based 3D human pose and mesh estimation methods estimate parameters of the human mesh model from an input image. However, directly regressing the parameters from the input image is a highly non-linear mapping because it breaks the spatial relationship between pixels in the input image. In addition, it cannot model the prediction uncertainty, which can make training harder. To resolve the above issues, we propose I2L-MeshNet, an image-to-lixel (line+pixel) prediction network. The proposed I2L-MeshNet predicts the per-lixel likelihood on 1D heatmaps for each mesh vertex coordinate instead of directly regressing the parameters. Our lixel-based 1D heatmap preserves the spatial relationship in the input image and models the prediction uncertainty. We demonstrate the benefit of the image-to-lixel prediction and show that the proposed I2L-MeshNet outperforms previous methods. The code is publicly available \footnote{\url{this https URL}}.
摘要：大多数以前的基于图像的三维人体姿势和网状估计方法估算人类的参数从输入图像网格模型。然而，直接从输入图像回归参数是高度非线性映射，因为它破坏了输入图像中的像素之间的空间关系。此外，它不能预测的不确定性，它可以使培训更难建模。为了解决上述问题，我们提出了I2L-MeshNet，图像到lixel（行+像素）预测网络。所提出的I2L-MeshNet预测上1D热图的每lixel可能性每个网格的顶点坐标，而不是直接回归参数。我们基于lixel-1D热图保留输入图像和模型预测不确定性的空间关系。我们展示图像到lixel预测的利益，表明提出的I2L-MeshNet优于以前的方法。该代码是公开的\脚注{\ {URL这HTTPS URL}}。

51. Block Shuffle: A Method for High-resolution Fast Style Transfer with Limited Memory [PDF] 返回目录
Weifeng Ma, Zhe Chen, Caoting Ji
Abstract: Fast Style Transfer is a series of Neural Style Transfer algorithms that use feed-forward neural networks to render input images. Because of the high dimension of the output layer, these networks require much memory for computation. Therefore, for high-resolution images, most mobile devices and personal computers cannot stylize them, which greatly limits the application scenarios of Fast Style Transfer. At present, the two existing solutions are purchasing more memory and using the feathering-based method, but the former requires additional cost, and the latter has poor image quality. To solve this problem, we propose a novel image synthesis method named \emph{block shuffle}, which converts a single task with high memory consumption to multiple subtasks with low memory consumption. This method can act as a plug-in for Fast Style Transfer without any modification to the network architecture. We use the most popular Fast Style Transfer repository on GitHub as the baseline. Experiments show that the quality of high-resolution images generated by our method is better than that of the feathering-based method. Although our method is an order of magnitude slower than the baseline, it can stylize high-resolution images with limited memory, which is impossible with the baseline. The code and models will be made available on \url{this https URL}.
摘要：快速式转换为一系列的神经式转换算法，使用前馈神经网络来渲染输入图像。由于输出层的高维数的，这些网络需要用于计算的内存。因此，对于高分辨率图像，大多数移动设备和个人电脑不能风格化他们，这极大地限制了快速风格转换的应用场景。目前，现有的两个解决方案是购买更多的内存，并使用基于羽化法，但前者需要额外的费用，而后者具有图像质量不佳。为了解决这个问题，我们提出了名为一种新颖的图像合成方法\ EMPH {块混洗}，它转换高存储器消耗具有低存储器消耗多个子任务的单个任务。该方法可以作为插件在没有任何修改的网络体系结构的快速式转换。我们用最流行的快速风格转换库GitHub上作为基准。实验表明，我们的方法产生高清晰度图像的质量好于基于羽化法的。虽然我们的方法是一个数量级比基线更慢，可以风格化具有有限的存储器，这是不可能与基线的高清晰度图像。代码和模型将在\ {URL这HTTPS URL}提供。

52. Learning Consistency Pursued Correlation Filters for Real-Time UAV Tracking [PDF] 返回目录
Changhong Fu, Xiaoxiao Yang, Fan Li, Juntao Xu, Changjing Liu, Peng Lu
Abstract: Correlation filter (CF)-based methods have demonstrated exceptional performance in visual object tracking for unmanned aerial vehicle (UAV) applications, but suffer from the undesirable boundary effect. To solve this issue, spatially regularized correlation filters (SRDCF) proposes the spatial regularization to penalize filter coefficients, thereby significantly improving the tracking performance. However, the temporal information hidden in the response maps is not considered in SRDCF, which limits the discriminative power and the robustness for accurate tracking. This work proposes a novel approach with dynamic consistency pursued correlation filters, i.e., the CPCF tracker. Specifically, through a correlation operation between adjacent response maps, a practical consistency map is generated to represent the consistency level across frames. By minimizing the difference between the practical and the scheduled ideal consistency map, the consistency level is constrained to maintain temporal smoothness, and rich temporal information contained in response maps is introduced. Besides, a dynamic constraint strategy is proposed to further improve the adaptability of the proposed tracker in complex situations. Comprehensive experiments are conducted on three challenging UAV benchmarks, i.e., UAV123@10FPS, UAVDT, and DTB70. Based on the experimental results, the proposed tracker favorably surpasses the other 25 state-of-the-art trackers with real-time running speed ($\sim$43FPS) on a single CPU.
摘要：相关器（CF）基的方法已经证明在用于无人驾驶飞行器（UAV）的应用程序的视觉对象跟踪优异的性能，但是从不期望的边界效应受到影响。为了解决这个问题，在空间正规化相关滤波器（SRDCF）提出了空间调整的惩罚滤波器系数，从而提高显著跟踪性能。然而，隐藏在响应地图的时间信息中不SRDCF，这限制了辨别力和用于精确跟踪的鲁棒性考虑。这项工作提出了一种具有动态一致性追求相关滤波器，即，跟踪器CPCF一种新的方法。具体来讲，通过相邻的响应映射之间的相关运算，生成实际一致性映射到表示跨帧的一致性水平。通过最小化实际和计划的理想一致性图之间的区别，一致性水平限制，以保持时间平滑度和丰富的时间信息包含在响应地图介绍。此外，动态约束的策略，提出了进一步提高在复杂情况下提出的跟踪器的适应性。综合实验三进行挑战无人机基准，即UAV123 @ 10FPS，UAVDT和DTB70。根据实验结果，所提出的跟踪器有利地超过了其他25个国家的最先进的跟踪器具有单个CPU上实时运行速度（$ \ SIM $ 43FPS）。

53. Fully Automated Photogrammetric Data Segmentation and Object Information Extraction Approach for Creating Simulation Terrain [PDF] 返回目录
Meida Chen, Andrew Feng, Kyle McCullough, Pratusha Bhuvana Prasad, Ryan McAlinden, Lucio Soibelman, Mike Enloe
Abstract: Our previous works have demonstrated that visually realistic 3D meshes can be automatically reconstructed with low-cost, off-the-shelf unmanned aerial systems (UAS) equipped with capable cameras, and efficient photogrammetric software techniques. However, such generated data do not contain semantic information/features of objects (i.e., man-made objects, vegetation, ground, object materials, etc.) and cannot allow the sophisticated user-level and system-level interaction. Considering the use case of the data in creating realistic virtual environments for training and simulations (i.e., mission planning, rehearsal, threat detection, etc.), segmenting the data and extracting object information are essential tasks. Thus, the objective of this research is to design and develop a fully automated photogrammetric data segmentation and object information extraction framework. To validate the proposed framework, the segmented data and extracted features were used to create virtual environments in the authors previously designed simulation tool i.e., Aerial Terrain Line of Sight Analysis System (ATLAS). The results showed that 3D mesh trees could be replaced with geo-typical 3D tree models using the extracted individual tree locations. The extracted tree features (i.e., color, width, height) are valuable for selecting the appropriate tree species and enhance visual quality. Furthermore, the identified ground material information can be taken into consideration for pathfinding. The shortest path can be computed not only considering the physical distance, but also considering the off-road vehicle performance capabilities on different ground surface materials.
摘要：我们以前的工作已经证明，逼真的3D网格可以用低成本的自动重构，现成的，货架配有功能的摄像头，高效的摄影测量软件技术的无人机系统（UAS）。然而，这样产生的数据不包含语义信息/对象（即，人造物体，植被，地面，对象材料等）的功能，并且不能使复杂的用户级和系统级相互作用。考虑到创造培训和模拟逼真的虚拟环境中的数据的使用情况下（即，任务规划，排练，威胁检测等），分割数据和提取对象的信息是必不可少的任务。因此，本研究的目的是设计和开发一个完全自动化的摄影数据分割和对象信息提取框架。为了验证所提出的框架，分段数据和提取的特征被用来在以前设计的仿真工具的作者创建的虚拟环境，即，视线分析系统的空中地形线（ATLAS）。显示三维网格树的结果可能与使用所提取的各树中的位置地理典型的3D树木模型所取代。所提取的特征树（即，颜色，宽度，高度）是用于选择适当的树种，增强的视觉质量是有价值的。另外，所识别的地面材料信息可以被考虑为寻路。最短路径，可以计算不仅考虑了物理距离，还要综合考虑不同地面材料的越野车性能的能力。

54. Radar-based Dynamic Occupancy Grid Mapping and Object Detection [PDF] 返回目录
Christopher Diehl, Eduard Feicho, Alexander Schwambach, Thomas Dammeier, Eric Mares, Torsten Bertram
Abstract: Environment modeling utilizing sensor data fusion and object tracking is crucial for safe automated driving. In recent years, the classical occupancy grid map approach, which assumes a static environment, has been extended to dynamic occupancy grid maps, which maintain the possibility of a low-level data fusion while also estimating the position and velocity distribution of the dynamic local environment. This paper presents the further development of a previous approach. To the best of the author's knowledge, there is no publication about dynamic occupancy grid mapping with subsequent analysis based only on radar data. Therefore in this work, the data of multiple radar sensors are fused, and a grid-based object tracking and mapping method is applied. Subsequently, the clustering of dynamic areas provides high-level object information. For comparison, also a lidar-based method is developed. The approach is evaluated qualitatively and quantitatively with real-world data from a moving vehicle in urban environments. The evaluation illustrates the advantages of the radar-based dynamic occupancy grid map, considering different comparison metrics.
摘要：环境建模利用传感器数据融合和对象跟踪是用于安全自动驾驶至关重要。近年来，经典的占用栅格地图的方法，它假定一个静态环境中，一直延伸到动态占用网格地图，这保持了低级别数据融合的可能性，同时还估计动态本地环境的位置和速度分布。本文介绍了前一种方法的进一步发展。为了最好的作者的知识，没有关于与仅基于雷达数据的后续分析动态占用网格映射出版物。因此，在这项工作中，多个雷达传感器的数据融合，并施加基于网格的对象跟踪和映射方法。随后，动态区域的聚类提供高层次的对象的信息。为了比较，也有基于激光雷达的方法开发。该方法是从在城市环境中行驶的车辆定性和定量评估与真实世界数据。评价示出了基于雷达的动态占用栅格地图的优点，考虑到不同的比较指标。

55. LiDAR Data Enrichment Using Deep Learning Based on High-Resolution Image: An Approach to Achieve High-Performance LiDAR SLAM Using Low-cost LiDAR [PDF] 返回目录
Jiang Yue, Weisong Wen, Jing Han, Li-Ta Hsu
Abstract: LiDAR-based SLAM algorithms are extensively studied to providing robust and accurate positioning for autonomous driving vehicles (ADV) in the past decades. Satisfactory performance can be obtained using high-grade 3D LiDAR with 64 channels, which can provide dense point clouds. Unfortunately, the high price significantly prevents its extensive commercialization in ADV. The cost-effective 3D LiDAR with 16 channels is a promising replacement. However, only limited and sparse point clouds can be provided by the 16 channels LiDAR, which cannot guarantee sufficient positioning accuracy for ADV in challenging dynamic environments. The high-resolution image from the low-cost camera can provide ample information about the surroundings. However, the explicit depth information is not available from the image. Inspired by the complementariness of 3D LiDAR and camera, this paper proposes to make use of the high-resolution images from a camera to enrich the raw 3D point clouds from the low-cost 16 channels LiDAR based on a state-of-the-art deep learning algorithm. An ERFNet is firstly employed to segment the image with the aid of the raw sparse 3D point clouds. Meanwhile, the sparse convolutional neural network is employed to predict the dense point clouds based on raw sparse 3D point clouds. Then, the predicted dense point clouds are fused with the segmentation outputs from ERFnet using a novel multi-layer convolutional neural network to refine the predicted 3D point clouds. Finally, the enriched point clouds are employed to perform LiDAR SLAM based on the state-of-the-art normal distribution transform (NDT). We tested our approach on the re-edited KITTI datasets: (1)the sparse 3D point clouds are significantly enriched with a mean square error of 1.1m MSE. (2)the map generated from the LiDAR SLAM is denser which includes more details without significant accuracy loss.
摘要：基于激光雷达的SLAM算法广泛研究，在过去几十年的自主驾驶汽车（ADV）提供稳健和准确的定位。可使用高品位的三维激光雷达与64个信道，其可提供密度点云来获得令人满意的性能。不幸的是，高价格显著阻止其在ADV广泛的商业化。成本效益的3D激光雷达16个通道是一个有希望的替代物。然而，只有有限的和稀疏点云可通过16个信道的LiDAR，它不能保证在挑战动态环境为ADV足够的定位精度而提供。从低成本相机的高分辨率图像可提供有关周围环境的丰富资料。但是，明确的深度信息是不能从图像。本文通过三维激光雷达和摄像机的互补性的启发，提出来利用高分辨率图像的从照相机充实从低成本16个通道LIDAR根据状态的最先进的原始三维点云深学习算法。一个ERFNet首先用于分割图像与原稀疏三维点云的帮助。同时，稀疏的卷积神经网络被用来预测基于原始稀疏三维点云的密度点云。然后，预测密度点云融合与来自ERFnet使用新颖的多层卷积神经网络来细化预测的3D点云的分割输出。最后，将富集的点云采用基于状态的最先进的正态分布变换（NDT）来执行的LiDAR SLAM。我们测试了我们在重新编辑KITTI数据集的方法：（1）稀疏三维点云显著与1.1米MSE的均方误差丰富。（2）从激光雷达SLAM产生的映射是更致密的，其包括但显著精度损失更多的细节。

56. From depth image to semantic scene synthesis through point cloud classification and labeling: Application to assistive systems [PDF] 返回目录
Chayma Zatout, Slimane Larabi
Abstract: The aim of this work is to provide a semantic scene synthesis from depth image. First, depth image is segmented and each segment is classified in the context of assistive systems using a deep learning network. Second, inspired by the Braille system and the Japanese writing system Kanji, the obtained classes are coded with semantic labels. A semantic scene is then synthesized using the labels and extracted features. Experiments are conducted on noisy, occluded, cropped and incomplete data including acquired depth images of indoor scenes and datasets from the RMRC challenge. The obtained results are reported and discussed.
摘要：本工作的目的是提供从深度图像中的语义场景合成。首先，深度图像被分段并且每个段被分类在使用深学习网络辅助系统的上下文中。其次，通过盲文系统和日语书写系统汉字的启发，得到的类编码与语义标签。然后语义场景是利用标签和提取的特征进行合成。实验在嘈杂进行的，闭塞，裁剪和包括来自RMRC挑战获取室内场景和数据集的深度图像不完整的数据。将所得到的结果报告和讨论。

57. Feature Space Augmentation for Long-Tailed Data [PDF] 返回目录
Peng Chu, Xiao Bian, Shaopeng Liu, Haibin Ling
Abstract: Real-world data often follow a long-tailed distribution as the frequency of each class is typically different. For example, a dataset can have a large number of under-represented classes and a few classes with more than sufficient data. However, a model to represent the dataset is usually expected to have reasonably homogeneous performances across classes. Introducing class-balanced loss and advanced methods on data re-sampling and augmentation are among the best practices to alleviate the data imbalance problem. However, the other part of the problem about the under-represented classes will have to rely on additional knowledge to recover the missing information. In this work, we present a novel approach to address the long-tailed problem by augmenting the under-represented classes in the feature space with the features learned from the classes with ample samples. In particular, we decompose the features of each class into a class-generic component and a class-specific component using class activation maps. Novel samples of under-represented classes are then generated on the fly during training stages by fusing the class-specific features from the under-represented classes with the class-generic features from confusing classes. Our results on different datasets such as iNaturalist, ImageNet-LT, Places-LT and a long-tailed version of CIFAR have shown the state of the art performances.
摘要：真实世界的数据往往遵循长尾分布为每个类的频率通常是不同的。例如，数据集可以有大量不足的类和一些类的具有比足够的数据更多。然而，代表数据集模型通常预计将有跨类相当均匀的表演。引进一流的平衡损失和对数据的高级方法重新取样，并增强对缓解数据不均衡问题的最佳实践之一。然而，关于代表性不足类问题的另一部分将不得不依靠更多的知识，以恢复丢失的信息。在这项工作中，我们提出了一个新的方法通过与特点，从有充足的样本的类学习的特征空间增强的代表性不足的类来解决长尾问题。特别是，我们分解每个类的功能集成到一个类通用组件，并使用类激活图一类特定的组件。代表性不足类新颖样本随后通过融合类特定的训练阶段期间实时生成与从混淆类类通用特征的代表性不足类功能。我们对不同的数据集，如iNaturalist，ImageNet-LT，地方-LT和CIFAR的长尾版本，结果显示的艺术表演状态。

58. Augmenting Molecular Images with Vector Representations as a Featurization Technique for Drug Classification [PDF] 返回目录
Daniel de Marchi, Amarjit Budhiraja
Abstract: One of the key steps in building deep learning systems for drug classification and generation is the choice of featurization for the molecules. Previous featurization methods have included molecular images, binary strings, graphs, and SMILES strings. This paper proposes the creation of molecular images captioned with binary vectors that encode information not contained in or easily understood from a molecular image alone. Specifically, we use Morgan fingerprints, which encode higher level structural information, and MACCS keys, which encode yes or no questions about a molecules properties and structure. We tested our method on the HIV dataset published by the Pande lab, which consists of 41,127 molecules labeled by if they inhibit the HIV virus. Our final model achieved a state of the art AUC ROC on the HIV dataset, outperforming all other methods. Moreover, the model converged significantly faster than most other methods, requiring dramatically less computational power than unaugmented images.
摘要：一个在建筑深学习系统的药物分类和生成的关键步骤是featurization的分子的选择。以前featurization方法已包括分子图像，二进制串，图形和SMILES字符串。本文提出了一种具有二元载体的是不包含编码信息或容易地从单独的分子图像理解标题分子图像的创建。具体地，我们使用摩根指纹，其编码更高层次结构信息，并MACCS键，其编码是或否有关分子性质和结构的问题。我们由潘德实验室，其中包括由他们是否抑制HIV病毒标记41127个分子公布的HIV测试数据集我们的方法。我们的最终模型实现了艺术AUC ROC的状态对HIV的数据集，表现优于所有其他方法。此外，该模型显著收敛比大多数其他方法更快，需要比unaugmented图像显着更少的计算能力。

59. Forget About the LiDAR: Self-Supervised Depth Estimators with MED Probability Volumes [PDF] 返回目录
Juan Luis Gonzalez, Munchurl Kim
Abstract: Self-supervised depth estimators have recently shown results comparable to the supervised methods on the challenging single image depth estimation (SIDE) task, by exploiting the geometrical relations between target and reference views in the training data. However, previous methods usually learn forward or backward image synthesis, but not depth estimation, as they cannot effectively neglect occlusions between the target and the reference images. Previous works rely on rigid photometric assumptions or the SIDE network to infer depth and occlusions, resulting in limited performance. On the other hand, we propose a method to "Forget About the LiDAR" (FAL), for the training of depth estimators, with Mirrored Exponential Disparity (MED) probability volumes, from which we obtain geometrically inspired occlusion maps with our novel Mirrored Occlusion Module (MOM). Our MOM does not impose a burden on our FAL-net. Contrary to the previous methods that learn SIDE from stereo pairs by regressing disparity in the linear space, our FAL-net regresses disparity by binning it into the exponential space, which allows for better detection of distant and nearby objects. We define a two-step training strategy for our FAL-net: It is first trained for view synthesis and then fine-tuned for depth estimation with our MOM. Our FAL-net is remarkably light-weight and outperforms the previous state-of-the-art methods with 8x fewer parameters and 3x faster inference speeds on the challenging KITTI dataset. We present extensive experimental results on the KITTI, CityScapes, and Make3D datasets to verify our method's effectiveness. To the authors' best knowledge, the presented method performs the best among all the previous self-supervised methods until now.
摘要：自监督深度估计最近表现出效果媲美的挑战单个图像深度估计（SIDE）任务的监督方法，通过利用训练数据目标和参考意见之间的几何关系。但是，以前的方法通常学习向前或向后图像合成，但不深度估计，因为它们不能在靶和参考图像之间的有效忽视闭塞。以前的作品依靠刚性光度假设或侧面网络来推断深度和闭塞，导致有限的性能。在另一方面，我们建议“忘了激光雷达”（FAL）的方法，深度估计的训练，用镜像指数差距（MED）的概率卷，从中获得几何灵感闭塞映射我们的新镜像闭塞模块（MOM）。我们的妈妈不上我们的FAL网造成负担。相反，通过在线性空间，我们通过装箱入指数的空间，这可以更好地进行检测远近物体的FAL-净倒退差距悬殊回归学习侧立体像对以前的方法。我们定义了一个两步培训战略为我们的FAL网：这是第一次训练视图合成，然后微调与我们MOM深度估计。我们的FAL-净是显着地轻量，优于以前的状态的最先进的方法，8个更少的参数和3X上具有挑战性KITTI数据集更快推理速度。在KITTI，城市景观，并Make3D数据集，我们目前大量的实验结果来验证我们的方法的有效性。据作者所知，所提出的方法执行中以前所有的自我监督方法最好直到现在。

60. Appearance-free Tripartite Matching for Multiple Object Tracking [PDF] 返回目录
Lijun Wang, Yanting Zhu, Jue Shi, Xiaodan Fan
Abstract: Multiple Object Tracking (MOT) detects the trajectories of multiple objects given an input video, and it has become more and more popular in various research and industry areas, such as cell tracking for biomedical research and human tracking in video surveillance. We target at the general MOT problem regardless of the object appearance. The appearance-free tripartite matching is proposed to avoid the irregular velocity problem of traditional bipartite matching. The tripartite matching is formulated as maximizing the likelihood of the state vectors constituted of the position and velocity of objects, and a dynamic programming algorithm is employed to solve such maximum likelihood estimate (MLE). To overcome the high computational cost induced by the vast search space of dynamic programming, we decompose the space by the number of disappearing objects and propose a reduced-space approach by truncating the decomposition. Extensive simulations have shown the superiority and efficiency of our proposed method. We also applied our method to track the motion of natural killer cells around tumor cells in a cancer research.
摘要：多目标跟踪（MOT）检测给出的输入视频的多个对象的轨迹，它已成为越来越流行的各种科研和工业领域，如视频监控生物医学研究和人体跟踪单元跟踪。不管对象出现在我们一般MOT问题的目标。自由外观三方匹配建议避免传统二分匹配的不规则速度问题。三方匹配被配制成最大化构成的对象的位置和速度的状态向量的似然性，并且采用一种动态编程算法来解决这样的最大似然估计（MLE）。为了克服动态规划的巨大搜索空间引起的高计算成本，我们通过物体消失的数分解的空间和通过截断分解提出了减小空间的方法。大量的模拟表明，我们所提出的方法的优越性和效率。我们也将我们的方法来追踪自然杀伤细胞的周围肿瘤细胞在癌症研究的议案。

61. Tracking in Crowd is Challenging: Analyzing Crowd based on Physical Characteristics [PDF] 返回目录
Constantinou Miti, Demetriou Zatte, Siraj Sajid Gondal
Abstract: Currently, the safety of people has become a very important problem in different places including subway station, universities, colleges, airport, shopping mall and square, city squares. Therefore, considering intelligence event detection systems is more and urgently required. The event detection method is developed to identify abnormal behavior intelligently, so public can take action as soon as possible to prevent unwanted activities. The problem is very challenging due to high crowd density in different areas. One of these issues is occlusion due to which individual tracking and analysis becomes impossible as shown in Fig. 1. Secondly, more challenging is the proper representation of individual behavior in the crowd. We consider a novel method to deal with these challenges. Considering the challenge of tracking, we partition complete frame into smaller patches, and extract motion pattern to demonstrate the motion in each individual patch. For this purpose, our work takes into account KLT corners as consolidated features to describe moving regions and track these features by considering optical flow method. To embed motion patterns, we develop and consider the distribution of all motion information in a patch as Gaussian distribution, and formulate parameters of Gaussian model as our motion pattern descriptor.
摘要：目前，人们的安全已成为在不同的地方，包括地铁站，大专院校，机场，商场和广场，城市广场一个非常重要的问题。因此，在考虑智能事件检测系统越来越迫切需要。事件检测方法开发的智能识别异常行为，让市民可以采取行动，尽快以避免不必要的活动。问题是非常具有挑战性的，由于在不同领域的高密度人群。一种对这些问题是闭塞如图由于个别跟踪和分析变得不可能。1.其次，更具挑战性的是在人群中个体行为的正确表示。我们认为，一种新的方法来应对这些挑战。考虑跟踪的挑战，我们划分完整的帧成更小的贴剂和提取运动图形来阐明每个补丁的运动。为了这个目的，我们的工作考虑到KLT角落的综合功能，以描述运动区，通过考虑光流法跟踪这些功能。要嵌入运动模式，我们开发并考虑贴剂，高斯分布的所有运动信息的分发，并制定高斯模型的参数，我们的运动模式描述符。

62. Enhance CNN Robustness Against Noises for Classficiation of 12-Lead ECG with Variable Length [PDF] 返回目录
Linhai Ma, Liang Liang
Abstract: Electrocardiogram (ECG) is the most widely used diagnostic tool to monitor the condition of the cardiovascular system. Deep neural networks (DNNs), have been developed in many research labs for automatic interpretation of ECG signals to identify potential abnormalities in patient hearts. Studies have shown that given a sufficiently large amount of data, the classification accuracy of DNNs could reach human-expert cardiologist level. However, despite of the excellent performance in classification accuracy, it has been shown that DNNs are highly vulnerable to adversarial noises which are subtle changes in input of a DNN and lead to a wrong class-label prediction with a high confidence. Thus, it is challenging and essential to improve robustness of DNNs against adversarial noises for ECG signal classification, a life-critical application. In this work, we designed a CNN for classification of 12-lead ECG signals with variable length, and we applied three defense methods to improve robustness of this CNN for this classification task. The ECG data in this study is very challenging because the sample size is limited, and the length of each ECG recording varies in a large range. The evaluation results show that our customized CNN reached satisfying F1 score and average accuracy, comparable to the top-6 entries in the CPSC2018 ECG classification challenge, and the defense methods enhanced robustness of our CNN against adversarial noises and white noises, with a minimal reduction in accuracy on clean data.
摘要：心电图（ECG）是监测心血管系统的状况的最广泛使用的诊断工具。深层神经网络（DNNs），已经开发了许多研究实验室的ECG信号自动判读，以确定患者心中潜在的异常。有研究表明，给予足够大的数据量，DNNs的分类精度可以达到人类专家心脏病水平。然而，尽管在分类准确度的出色表现，已经证明DNNs极易受噪声对抗性这是在一个DNN的输入，并导致高信心错误类的标签预测的微妙变化。因此，它是具有挑战性和必要提高对用于ECG信号分类，一个生命攸关的应用对抗噪声DNNs的鲁棒性。在这项工作中，我们设计了CNN与可变长度的12导联心电图信号的分类，我们采用三层防线的方法来改善这一CNN的鲁棒性这个分类任务。本研究中的ECG数据非常具有挑战性的，因为样本大小是有限的，并且每个ECG记录的长度在一个大范围内变化。评价结果表明，我们的定制CNN达到满足F1的分数和平均准确，可比在CPSC2018心电图分类挑战顶级6项，国防方法增强了我们的CNN的鲁棒性对抗对抗噪声和白噪声，以最小的减少在干净的数据的准确性。

63. From Rain Removal to Rain Generation [PDF] 返回目录
Hong Wang, Zongsheng Yue, Qi Xie, Qian Zhao, Deyu Meng
Abstract: Single image deraining is an important yet challenging issue due to the complex and diverse rain structures in real scenes. Currently, the state-of-the-art performance on this task is achieved by deep learning (DL)-based methods that mainly benefit from abundant pre-collected paired rainy-clean samples either manually synthesized or semi-automatically generated under human supervision. This tends to bring a large labor for data collection and more importantly, such manner neglects to elaborately explore the intrinsic generative mechanism of rain streaks which should be related to the most insightful understanding of the task. Against this issue, we investigate the generative process of rainy image and construct a full Bayesian generative model for generating rains from automatically extracted latent variables that represent physical structural factors for depicting rains, like direction, scale, and thickness. To solve this model, we propose an algorithm where the posteriors of latent variables are parameterized as CNNs and all the involved parameters can be inferred under a concise variational inference framework in a data-driven manner. Especially, the rain layer is modeled as an implicit distribution, parameterized as a generator, which avoids subjective prior assumptions on rains as in traditional model-based methods. More practically, from the learned generator, rain patches can be automatically generated and utilized to simulate diverse training pairs so as to enrich and augment the existing benchmark datasets. Comprehensive experiments substantiate that the proposed model has fine capability of generating plausible samples that not only helps significantly improve the deraining performance of current DL-based single image derainers, but also largely loosens the requirement of large training sample pre-collection for the task.
摘要：单张图像deraining是一个重要而又具有挑战性的问题，由于复杂多样的降雨结构真实场景。目前，关于这种任务的状态的最先进的性能是通过深学习（DL）为基础的方法其主要受益于丰富预先采集的配对阴雨干净样品手动合成或半自动地人的监督下产生来实现。这往往会带来数据收集和更重要的是，这种方式忽略了大量的劳动力，以精心探索雨条纹应与任务相关的最精辟的理解的内在生成机制。针对此问题，我们研究了雨天图像的生成过程和构造全贝叶斯生成模型，用于从表示物理结构因素用于描述降雨，像方向，比例和厚度自动提取潜在变量的降雨。为了解决这个模型，我们提出了其中潜在变量的后验的参数作为细胞神经网络算法和所有相关的参数可以下一个数据驱动的方式简洁变推理框架来推断。特别是，雨层被建模为一个隐式分配，参数作为发电机，这避免了对降雨主观预先假设如在传统的基于模型的方法。更实际，从学习发生器，阵雨可以自动生成，并利用模拟多种培训对以丰富和增强现有的基准数据集。综合实验证明，该模型具有产生合理的样本，不仅有助于提高显著目前基于DL-单个图像derainers的deraining性能进行精细的能力，但也在很大程度上放宽大训练样本预收款任务的要求。

64. Cross-modal Center Loss [PDF] 返回目录
Longlong Jing, Elahe Vahdani, Jiaxing Tan, Yingli Tian
Abstract: Cross-modal retrieval aims to learn discriminative and modal-invariant features for data from different modalities. Unlike the existing methods which usually learn from the features extracted by offline networks, in this paper, we propose an approach to jointly train the components of cross-modal retrieval framework with metadata, and enable the network to find optimal features. The proposed end-to-end framework is updated with three loss functions: 1) a novel cross-modal center loss to eliminate cross-modal discrepancy, 2) cross-entropy loss to maximize inter-class variations, and 3) mean-square-error loss to reduce modality variations. In particular, our proposed cross-modal center loss minimizes the distances of features from objects belonging to the same class across all modalities. Extensive experiments have been conducted on the retrieval tasks across multi-modalities, including 2D image, 3D point cloud, and mesh data. The proposed framework significantly outperforms the state-of-the-art methods on the ModelNet40 dataset.
摘要：跨模态获取的目标去学习来自不同形态的数据判别和模态不变特征。不同于通常由离线网络提取的特征学习现有的方法，在本文中，我们提出了一个方法，共同培养跨模态获取框架的组件与元数据，并启用网络寻找最佳的功能。 1）一种新的跨通道中心损失，以消除交叉模态差异，2）交叉熵损失最大化类间的变化，以及3）的均方：所提出的端到端框架与三个损失功能更新-error损失降低形态变化。特别是，我们提出的跨模态中心损失最小化的功能，从属于所有形式的同一类对象的距离。大量的实验已经在跨越多方式，包括2D图像，3D点云的检索任务进行的，和网格数据。所提出的框架显著优于对数据集ModelNet40所述状态的最先进的方法。

65. LPMNet: Latent Part Modification and Generation for 3D Point Clouds [PDF] 返回目录
Cihan Öngün, Alptekin Temizel
Abstract: In this paper, we focus on latent modification and generation of 3D point cloud object models with respect to their semantic parts. Different to the existing methods which use separate networks for part generation and assembly, we propose a single end-to-end Autoencoder model that can handle generation and modification of both semantic parts, and global shapes. The proposed method supports part exchange between 3D point cloud models and composition by different parts to form new models by directly editing latent representations. This holistic approach does not need part-based training to learn part representations and does not introduce any extra loss besides the standard reconstruction loss. The experiments demonstrate the robustness of the proposed method with different object categories and varying number of points. The method can generate new models by integration of generative models such as GANs and VAEs and can work with unannotated point clouds by integration of a segmentation module.
摘要：在本文中，我们专注于三维点云对象模型的潜在修改和产生相对于它们的语义部分。不同，其使用单独的网络的部分的产生和装配，我们提出了一个单端至端自动编码器模型，可以处理生成和两个语义部分的修饰，和全球形状现有方法。由不同的部分三维点云模型和组合物之间的所提出的方法的支持部分交换通过直接编辑潜表示，以形成新的模式。这种全面的方法并不需要基于部分的训练，学习部分的陈述和不引入任何额外的损失除了标准重建的损失。实验证明所提出的方法与不同的对象类别的鲁棒性和不同数量的点。该方法可以生成由生成模型，如甘斯和VAES一体化新模式，可以通过分割模块的集成与未加点云工作。

66. Assisting Scene Graph Generation with Self-Supervision [PDF] 返回目录
Sandeep Inuganti, Vineeth N Balasubramanian
Abstract: Research in scene graph generation has quickly gained traction in the past few years because of its potential to help in downstream tasks like visual question answering, image captioning, etc. Many interesting approaches have been proposed to tackle this problem. Most of these works have a pre-trained object detection model as a preliminary feature extractor. Therefore, getting object bounding box proposals from the object detection model is relatively cheaper. We take advantage of this ready availability of bounding box annotations produced by the pre-trained detector. We propose a set of three novel yet simple self-supervision tasks and train them as auxiliary multi-tasks to the main model. While comparing, we train the base-model from scratch with these self-supervision tasks, we achieve state-of-the-art results in all the metrics and recall settings. We also resolve some of the confusion between two types of relationships: geometric and possessive, by training the model with the proposed self-supervision losses. We use the benchmark dataset, Visual Genome to conduct our experiments and show our results.
摘要：研究场景图代般的视觉问题解答，图像字幕，等等许多有趣的方法被提出来解决这个问题下游任务迅速赢得了牵引在过去的几年中，因为它的潜力来帮助。这些作品大多有一个预先训练对象检测模型的初步特征提取。因此，从物体检测模型越来越对象边框的建议是比较便宜的。我们采取的边界由预先训练检测器产生盒标注的这个现成可用的优势。我们提出一套三种新型的又简单的自我监督任务，把他们训练成辅助多重任务的主力机型。在比较中，我们培养从这些自检任务刮基本模型，我们实现了国家的最先进成果在所有的指标和召回设置。通过与提出的自检损失训练模型的几何和占有欲，：我们还解决了一些两种关系之间的混乱。我们使用基准数据集，可视化基因组进行我们的实验和展示我们的成果。

67. Forming Local Intersections of Projections for Classifying and Searching Histopathology Images [PDF] 返回目录
Aditya Sriram, Shivam Kalra, Morteza Babaie, Brady Kieffer, Waddah Al Drobi, Shahryar Rahnamayan, Hany Kashani, Hamid R. Tizhoosh
Abstract: In this paper, we propose a novel image descriptor called Forming Local Intersections of Projections (FLIP) and its multi-resolution version (mFLIP) for representing histopathology images. The descriptor is based on the Radon transform wherein we apply parallel projections in small local neighborhoods of gray-level images. Using equidistant projection directions in each window, we extract unique and invariant characteristics of the neighborhood by taking the intersection of adjacent projections. Thereafter, we construct a histogram for each image, which we call the FLIP histogram. Various resolutions provide different FLIP histograms which are then concatenated to form the mFLIP descriptor. Our experiments included training common networks from scratch and fine-tuning pre-trained networks to benchmark our proposed descriptor. Experiments are conducted on the publicly available dataset KIMIA Path24 and KIMIA Path960. For both of these datasets, FLIP and mFLIP descriptors show promising results in all experiments.Using KIMIA Path24 data, FLIP outperformed non-fine-tuned Inception-v3 and fine-tuned VGG16 and mFLIP outperformed fine-tuned Inception-v3 in feature extracting.
摘要：在本文中，我们提出了代表组织病理学图像称为新的图像描述符形成突起的地方交叉口（FLIP）和它的多分辨率版本（mFLIP）。该描述符是基于氡变换，其中我们应用在灰度级图像的小局部邻域平行投影。在每个窗口中使用等距离投影方向，我们通过取相邻的突起的交叉点提取附近的独特的和不变的特性。此后，我们构建对于每个图像，我们称之为FLIP直方图的直方图。各种分辨率提供不同的FLIP直方图然后被连结以形成所述mFLIP描述符。我们的实验包括从无到有，微调预训练的网络培训常见网络基准我们提出的描述符。实验是在可公开获得的数据集KIMIA Path24和KIMIA Path960进行。对于这两个数据集，翻转和mFLIP描述显示在所有experiments.Using KIMIA Path24数据可喜的成果，FLIP优于非微调盗梦空间-V3和微调VGG16和mFLIP跑赢微调成立之初-V3在特征提取。

68. Learning CNN filters from user-drawn image markers for coconut-tree image classification [PDF] 返回目录
Itaalos Estilon de Souza, Alexandre X. Falcão
Abstract: Identifying species of trees in aerial images is essential for land-use classification, plantation monitoring, and impact assessment of natural disasters. The manual identification of trees in aerial images is tedious, costly, and error-prone, so automatic classification methods are necessary. Convolutional Neural Network (CNN) models have well succeeded in image classification applications from different domains. However, CNN models usually require intensive manual annotation to create large training sets. One may conceptually divide a CNN into convolutional layers for feature extraction and fully connected layers for feature space reduction and classification. We present a method that needs a minimal set of user-selected images to train the CNN's feature extractor, reducing the number of required images to train the fully connected layers. The method learns the filters of each convolutional layer from user-drawn markers in image regions that discriminate classes, allowing better user control and understanding of the training process. It does not rely on optimization based on backpropagation, and we demonstrate its advantages on the binary classification of coconut-tree aerial images against one of the most popular CNN models.
摘要：在航拍图像识别的树种是土地利用分类，种植的监测和自然灾害的影响评估是必不可少的。的树木在航空图像的人工识别是乏味的，昂贵的，并且容易出错，所以自动分类的方法是必要的。卷积神经网络（CNN）模型已经很好成功地从不同的域图像分类应用。然而，CNN模型通常需要大量人工标注创建大型训练集。一个可以在概念上划分成CNN用于特征提取和特征空间减少和分类完全连接层卷积层。我们提出一个需要最小的一组用户选择的图像的训练CNN的特征提取器，减少了所需的图像的数量来训练完全连接层的方法。该方法学从用户绘制标记每个卷积层的过滤器中的图像区域即判别类，允许更好的用户控制和训练过程的理解。它不依赖于优化基于BP，和我们展示椰子树航拍图像的危害最流行的CNN的车型之一的二元分类它的优点。

69. A Unified Framework for Shot Type Classification Based on Subject Centric Lens [PDF] 返回目录
Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, Dahua Lin
Abstract: Shots are key narrative elements of various videos, e.g. movies, TV series, and user-generated videos that are thriving over the Internet. The types of shots greatly influence how the underlying ideas, emotions, and messages are expressed. The technique to analyze shot types is important to the understanding of videos, which has seen increasing demand in real-world applications in this era. Classifying shot type is challenging due to the additional information required beyond the video content, such as the spatial composition of a frame and camera movement. To address these issues, we propose a learning framework Subject Guidance Network (SGNet) for shot type recognition. SGNet separates the subject and background of a shot into two streams, serving as separate guidance maps for scale and movement type classification respectively. To facilitate shot type analysis and model evaluations, we build a large-scale dataset MovieShots, which contains 46K shots from 7K movie trailers with annotations of their scale and movement types. Experiments show that our framework is able to recognize these two attributes of shot accurately, outperforming all the previous methods.
摘要：射击是各种视频关键叙事元素，例如电影，电视剧，以及用户生成的视频正在蓬勃发展在互联网上。该类型的镜头，大大影响了基本思想，情绪和信息是如何表达。分析拍摄类型的技术是影片的了解，这已经看到了在这个时代的现实世界的应用需求越来越重要。判断镜头类型由于超出视频内容所需，诸如帧和相机移动的空间组合物中的附加信息挑战。为了解决这些问题，我们提出了一个学习的框架主题指导网（SGNet）的镜头类型识别。 SGNet分离了一枪成两个流的主题和背景，作为单独的指导地图分别规模和运动类型分类。为了便于击打类型分析和模型的评估，我们建立了一个大型数据集MovieShots，其中包含从7K的电影预告片46K杆，以他们的规模和运动类型的注解。实验表明，我们的框架是能够准确地识别拍摄的这两个属性，超越以前的所有方法。

70. Online Multi-modal Person Search in Videos [PDF] 返回目录
Jiangyue Xia, Anyi Rao, Qingqiu Huang, Linning Xu, Jiangtao Wen, Dahua Lin
Abstract: The task of searching certain people in videos has seen increasing potential in real-world applications, such as video organization and editing. Most existing approaches are devised to work in an offline manner, where identities can only be inferred after an entire video is examined. This working manner precludes such methods from being applied to online services or those applications that require real-time responses. In this paper, we propose an online person search framework, which can recognize people in a video on the fly. This framework maintains a multimodal memory bank at its heart as the basis for person recognition, and updates it dynamically with a policy obtained by reinforcement learning. Our experiments on a large movie dataset show that the proposed method is effective, not only achieving remarkable improvements over online schemes but also outperforming offline methods.
摘要：在视频搜索某些人的任务已经看到越来越多的潜在的现实世界的应用，如视频组织和编辑。大多数现有的方法是设计工作在脱机方式，在检查整个视频后的身份只能推断。从该工作方式排除了这样的方法被应用到在线服务，或那些需要实时响应的应用程序。在本文中，我们提出了一个在线的人搜索的框架，它能够识别人在飞行的视频。该框架保持在它的心脏，作为个人识别基础上的多模式记忆库，并通过强化学习获得的策略动态地更新它。我们在大数据集电影放映，所提出的方法是有效的实验中，不仅实现了在线方案显着改善，但也跑赢脱机方法。

71. HASeparator: Hyperplane-Assisted Softmax [PDF] 返回目录
Ioannis Kansizoglou, Nicholas Santavas, Loukas Bampis, Antonios Gasteratos
Abstract: Efficient feature learning with Convolutional Neural Networks (CNNs) constitutes an increasingly imperative property since several challenging tasks of computer vision tend to require cascade schemes and modalities fusion. Feature learning aims at CNN models capable of extracting embeddings, exhibiting high discrimination among the different classes, as well as intra-class compactness. In this paper, a novel approach is introduced that has separator, which focuses on an effective hyperplane-based segregation of the classes instead of the common class centers separation scheme. Accordingly, an innovatory separator, namely the Hyperplane-Assisted Softmax separator (HASeparator), is proposed that demonstrates superior discrimination capabilities, as evaluated on popular image classification benchmarks.
摘要：卷积神经网络（细胞神经网络）高效功能学习构成了日益迫切的性能，因为计算机视觉的一些具有挑战性的任务，往往需要级联方案和模式的融合。特征学习的目的在于能够提取的嵌入，表现出不同的类中的高识别性，以及类内紧凑的CNN的模型。在本文中，一种新颖的方法被引入，其具有隔膜，其重点的类，而不是普通类中心分离方案的一个有效的基于超平面偏析。因此，一个分离器革新，即超平面辅助使用SoftMax分离器（HASeparator），提出了表现出优异的鉴别能力，上流行的图像分类基准作为评价。

72. How Trustworthy are the Existing Performance Evaluations for Basic Vision Tasks? [PDF] 返回目录
Hamid Rezatofighi, Tran Thien Dat Nguyen, Ba-Ngu Vo, Ba-Tuong Vo, Silvio Savarese, Ian Reid
Abstract: Performance evaluation is indispensable to the advancement of machine vision, yet its consistency and rigour have not received proportionate attention. This paper examines performance evaluation criteria for basic vision tasks namely, object detection, instance-level segmentation and multi-object tracking. Specifically, we advocate the use of criteria that are (i) consistent with mathematical requirements such as the metric properties, (ii) contextually meaningful in sanity tests, and (iii) robust to hyper-parameters for reliability. We show that many widely used performance criteria do not fulfill these requirements. Moreover, we explore alternative criteria for detection, segmentation, and tracking, using metrics for sets of shapes, and assess them against these requirements.
摘要：绩效评估是必不可少的机器视觉的进步，但其连贯性和严谨性没有收到相称的关注。本文分析了基本的视觉任务，即，目标检测，实例级别的细分和多目标跟踪性能评价标准。具体来说，我们主张用是（i）与数学要求相一致的标准，例如度量性能，（ⅱ）上下文有意义在理智的测试，和（iii）鲁棒的超参数的可靠性。我们发现，许多广泛使用的性能指标不能满足这些要求。此外，我们探索替代标准检测，分割和跟踪，使用度量集的形状，并评估它们针对这些要求。

73. Multimodal Image-to-Image Translation via Mutual Information Estimation and Maximization [PDF] 返回目录
Zhiwen Zuo, Qijiang Xu, Huiming Zhang, Zhizhong Wang, Haibo Chen, Ailin Li, Lei Zhao, Wei Xing, Dongming Lu
Abstract: In this paper, we present a novel framework that can achieve multimodal image-to-image translation by simply encouraging the statistical dependence between the latent code and the output image in conditional generative adversarial networks. In addition, by incorporating a U-net generator into our framework, our method only needs to train a one-sided translation model from the source image domain to the target image domain for both supervised and unsupervised multimodal image-to-image translation. Furthermore, our method also achieves disentanglement between the source domain content and the target domain style for free. We conduct experiments under supervised and unsupervised settings on various benchmark image-to-image translation datasets compared with the state-of-the-art methods, showing the effectiveness and simplicity of our method to achieve multimodal and high-quality results.
摘要：在本文中，我们提出了一个新的框架，可以通过简单地鼓励潜在的代码，并在有条件的生成对抗性的网络输出图像之间的统计关系实现多模图像到图像的转换。另外，通过将一U净发生器到我们的框架中，我们的方法仅需要从源图像域单面翻译模型训练目标图像域两者监督和无监督多模态图像到图像的转换。此外，我们的方法也实现了源域内容和免费的目标域风格之间的解开。我们在各种标杆形象对图像的平移数据集监督和无监督的设置与国家的最先进的方法相比进行实验，显示出实现多式联运和高品质的效果了该方法的有效性和简单。

74. Unravelling Small Sample Size Problems in the Deep Learning World [PDF] 返回目录
Rohit Keshari, Soumyadeep Ghosh, Saheb Chhabra, Mayank Vatsa, Richa Singh
Abstract: The growth and success of deep learning approaches can be attributed to two major factors: availability of hardware resources and availability of large number of training samples. For problems with large training databases, deep learning models have achieved superlative performances. However, there are a lot of \textit{small sample size or $S^3$} problems for which it is not feasible to collect large training databases. It has been observed that deep learning models do not generalize well on $S^3$ problems and specialized solutions are required. In this paper, we first present a review of deep learning algorithms for small sample size problems in which the algorithms are segregated according to the space in which they operate, i.e. input space, model space, and feature space. Secondly, we present Dynamic Attention Pooling approach which focuses on extracting global information from the most discriminative sub-part of the feature map. The performance of the proposed dynamic attention pooling is analyzed with state-of-the-art ResNet model on relatively small publicly available datasets such as SVHN, C10, C100, and TinyImageNet.
摘要：增长和深学习方法的成功可以归结为两大因素：硬件资源和大量的训练样本的可用性的可用性。对于大的训练数据库的问题，深度学习模型已经达到最高级的表演。然而，也有很多的\ {textit样本量太小或$ S ^ 3 $}的问题，这就是它不收集大量的训练数据库是可行的。据观察，深度学习模型不上$ S ^ 3点$的问题，都需要专门的解决方案推广好。在本文中，我们首先提出的深学习算法对于其中算法根据在其运行的空间，即输入空间，模型空间中，并且特征空间分离小样本问题的评论。其次，我们目前动态注意池的做法，侧重于从提取的特征图的区别最大的子部件的全球信息。所提出的动态关注池的性能进行了分析与比较小的可公开获得的数据集，如SVHN，C10，C100和TinyImageNet国家的最先进的RESNET模型。

75. Towards Lossless Binary Convolutional Neural Networks Using Piecewise Approximation [PDF] 返回目录
Baozhou Zhu Zaid Al-Ars, Wei Pan
Abstract: Binary Convolutional Neural Networks (CNNs) can significantly reduce the number of arithmetic operations and the size of memory storage, which makes the deployment of CNNs on mobile or embedded systems more promising. However, the accuracy degradation of single and multiple binary CNNs is unacceptable for modern architectures and large scale datasets like ImageNet. In this paper, we proposed a Piecewise Approximation (PA) scheme for multiple binary CNNs which lessens accuracy loss by approximating full precision weights and activations efficiently and maintains parallelism of bitwise operations to guarantee efficiency. Unlike previous approaches, the proposed PA scheme segments piece-wisely the full precision weights and activations, and approximates each piece with a scaling coefficient. Our implementation on ResNet with different depths on ImageNet can reduce both Top-1 and Top-5 classification accuracy gap compared with full precision to approximately 1.0%. Benefited from the binarization of the downsampling layer, our proposed PA-ResNet50 requires less memory usage and two times Flops than single binary CNNs with 4 weights and 5 activations bases. The PA scheme can also generalize to other architectures like DenseNet and MobileNet with similar approximation power as ResNet which is promising for other tasks using binary convolutions. The code and pretrained models will be publicly available.
摘要：二进制卷积神经网络（细胞神经网络）可以显著降低运算的数量和存储器，这使得在移动或嵌入式系统更有前途的细胞神经网络的部署规模。然而，单个和多个二进制细胞神经网络的精度的劣化是现代架构和大规模数据集等ImageNet不可接受的。在本文中，我们提出了多个二进制细胞神经网络分段逼近（PA）方案，该方案通过有效地接近全精度的权重和激活减轻精度损失，并保持位运算的并行性，以保证效率。不同于以前的方法，所提出的方案的PA片段，明智的全精度的重量和激活，并且近似于每一块与缩放系数。具有全精度至约1.0％相比，我们的上RESNET执行与ImageNet不同深度可以减少二者顶-1和Top-5的分类精度的间隙。从下采样层的二值化中受益，我们提出的PA-ResNet50需要较少的内存使用和触发器两次比4克的重量和5个激活碱基单个二进制细胞神经网络。该PA方案也可以推广到其他的架构像DenseNet和MobileNet类似近似功率RESNET这是一个大有希望用于使用二进制卷积其他任务。代码和预训练的车型将予以公布。

76. NASB: Neural Architecture Search for Binary Convolutional Neural Networks [PDF] 返回目录
Baozhou Zhu, Zaid Al-Ars, Peter Hofstee
Abstract: Binary Convolutional Neural Networks (CNNs) have significantly reduced the number of arithmetic operations and the size of memory storage needed for CNNs, which makes their deployment on mobile and embedded systems more feasible. However, the CNN architecture after binarizing requires to be redesigned and refined significantly due to two reasons: 1. the large accumulation error of binarization in the forward propagation, and 2. the severe gradient mismatch problem of binarization in the backward propagation. Even though the substantial effort has been invested in designing architectures for single and multiple binary CNNs, it is still difficult to find an optimal architecture for binary CNNs. In this paper, we propose a strategy, named NASB, which adopts Neural Architecture Search (NAS) to find an optimal architecture for the binarization of CNNs. Due to the flexibility of this automated strategy, the obtained architecture is not only suitable for binarization but also has low overhead, achieving a better trade-off between the accuracy and computational complexity of hand-optimized binary CNNs. The implementation of NASB strategy is evaluated on the ImageNet dataset and demonstrated as a better solution compared to existing quantized CNNs. With the insignificant overhead increase, NASB outperforms existing single and multiple binary CNNs by up to 4.0% and 1.0% Top-1 accuracy respectively, bringing them closer to the precision of their full precision counterpart. The code and pretrained models will be publicly available.
摘要：二进制卷积神经网络（细胞神经网络）已显著减少的算术运算的数目和所需的细胞神经网络的存储器存储，这使得移动设备和嵌入式系统更可行它们的部署规模。但是，二值化后的CNN架构需要重新设计和显著精制由于两个原因：二值化的在向前传播1.大量堆积误差，和二值化的在后向传播2.严重梯度不匹配的问题。尽管大量的努力已投入设计架构的单个或多个二进制细胞神经网络，它仍然是很难找到一个最优架构二进制细胞神经网络。在本文中，我们提出了一个战略，名为NASB，采用神经结构搜索（NAS）找到细胞神经网络的二值化的最佳架构。由于这种自动化策略的灵活性，所获得的结构不仅适用于二值化，但也有低开销，达到更好的权衡精度和计算人工优化的二进制细胞神经网络的复杂度之间。 NASB策略的实施的ImageNet数据集进行评估和证实作为更好的解决方案相比，现有的量化细胞神经网络。随着微不足道的开销增加，NASB优于分别高达4.0％和1.0％上-1精度现有的单和多二进制细胞神经网络，使之接近其全精度对应的精度。代码和预训练的车型将予以公布。

77. Hard Negative Samples Emphasis Tracker without Anchors [PDF] 返回目录
Zhongzhou Zhang, Lei Zhang
Abstract: Trackers based on Siamese network have shown tremendous success, because of their balance between accuracy and speed. Nevertheless, with tracking scenarios becoming more and more sophisticated, most existing Siamese-based approaches ignore the addressing of the problem that distinguishes the tracking target from hard negative samples in the tracking phase. The features learned by these networks lack of discrimination, which significantly weakens the robustness of Siamese-based trackers and leads to suboptimal performance. To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. Through a distance constraint, we force to shorten the distance between exemplar vector and positive vectors, meanwhile, enlarge the distance between exemplar vector and hard negative vectors. Furthermore, we explore a novel anchor-free tracking framework in a per-pixel prediction fashion, which can significantly reduce the number of hyper-parameters and simplify the tracking process by taking full advantage of the representation of convolutional neural network. Extensive experiments on six standard benchmark datasets demonstrate that the proposed method can perform favorable results against state-of-the-art approaches.
摘要：基于连体网络上追踪器已经显示出了巨大的成功，因为精度和速度之间的平衡。然而，随着跟踪的情况变得越来越复杂，大部分现有的基于连体方法忽略了从硬阴性样品中的追踪阶段区分目标跟踪问题的解决。通过这些网络学到的功能缺乏的歧视，这显著削弱基于连体跟踪，并导致次优性能的稳健性。为了解决这个问题，我们提出了一个简单而有效的硬阴性样品加重方法，从而限制了连体网络得知知道硬阴性样品的特性和增强功能嵌入的歧视。通过一个距离约束，我们强迫缩短典范矢量和积极的向量之间的距离，同时，扩大示范性载体和硬负向量之间的距离。此外，我们探索在每像素预测方式的新型无锚跟踪框架，它可以显著降低的超参数的数目和通过采用卷积神经网络的表示的充分利用简化的跟踪处理。在六个标准基准数据集广泛的实验表明，该方法可以执行对状态的最先进的方法有利的结果。

78. Single-Shot Two-Pronged Detector with Rectified IoU Loss [PDF] 返回目录
Keyang Wang, Lei Zhang
Abstract: In the CNN based object detectors, feature pyramids are widely exploited to alleviate the problem of scale variation across object instances. These object detectors, which strengthen features via a top-down pathway and lateral connections, are mainly to enrich the semantic information of low-level features, but ignore the enhancement of high-level features. This can lead to an imbalance between different levels of features, in particular a serious lack of detailed information in the high-level features, which makes it difficult to get accurate bounding boxes. In this paper, we introduce a novel two-pronged transductive idea to explore the relationship among different layers in both backward and forward directions, which can enrich the semantic information of low-level features and detailed information of high-level features at the same time. Under the guidance of the two-pronged idea, we propose a Two-Pronged Network (TPNet) to achieve bidirectional transfer between high-level features and low-level features, which is useful for accurately detecting object at different scales. Furthermore, due to the distribution imbalance between the hard and easy samples in single-stage detectors, the gradient of localization loss is always dominated by the hard examples that have poor localization accuracy. This will enable the model to be biased toward the hard samples. So in our TPNet, an adaptive IoU based localization loss, named Rectified IoU (RIoU) loss, is proposed to rectify the gradients of each kind of samples. The Rectified IoU loss increases the gradients of examples with high IoU while suppressing the gradients of examples with low IoU, which can improve the overall localization accuracy of model. Extensive experiments demonstrate the superiority of our TPNet and RIoU loss.
摘要：在基于CNN对象探测器，特征金字塔被广泛利用跨对象实例减轻尺度变化的问题。这些对象的检测器，其通过自顶向下的通路和横向连接加强功能，主要以丰富的低级别特征的语义信息，但忽略的高级特征的提高。这可能会导致不同程度的功能之间的不平衡，特别是严重缺乏高级功能的详细信息，这使得它很难得到准确的边界框。在本文中，我们介绍一种新颖的双叉转导想法探索既向前和向后方向的不同层间的关系，这可以丰富的低级别特征和高级别的详细信息的语义信息设有同时。下的双叉想法的指导下，我们提出了双管齐下网络（TPNet）实现高水平的功能和低级别的特征之间的双向传输，它是在不同尺度精确地检测对象是有用的。此外，由于在单级探测器难和易样本之间的分布不平衡，本地化损失的梯度总是由具有定位精度差的硬例子支配。这将使该模型朝硬盘样品有偏差。因此，在我们TPNet，基于自适应欠条本地化的损失，命名为整流IOU（RIOU）损失，提出整顿各类样品的梯度。经整流的IOU损耗增加的实施例中的梯度高IOU同时抑制具有低IOU，这可以提高模型的整体定位精度例的梯度。大量的实验证明我们的TPNet和RIOU损失的优越性。

79. Hierarchical Bi-Directional Feature Perception Network for Person Re-Identification [PDF] 返回目录
Zhipu Liu, Lei Zhang, Yang Yang
Abstract: Previous Person Re-Identification (Re-ID) models aim to focus on the most discriminative region of an image, while its performance may be compromised when that region is missing caused by camera viewpoint changes or occlusion. To solve this issue, we propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other. First, the correlation maps of cross-level feature-pairs are modeled via low-rank bilinear pooling. Then, based on the correlation maps, Bi-directional Feature Perception (BFP) module is employed to enrich the attention regions of high-level feature, and to learn abstract and specific information in low-level feature. And then, we propose a novel end-to-end hierarchical network which integrates multi-level augmented features and inputs the augmented low- and middle-level features to following layers to retrain a new powerful network. What's more, we propose a novel trainable generalized pooling, which can dynamically select any value of all locations in feature maps to be activated. Extensive experiments implemented on the mainstream evaluation datasets including Market-1501, CUHK03 and DukeMTMC-ReID show that our method outperforms the recent SOTA Re-ID models.
摘要：上一页人重新鉴定（再-ID）的模型的目的是着眼于图像的区别最大区域，而当该区域丢失由摄像机视点的变化或闭塞引起其性能可能受到损害。为了解决这个问题，我们提出了一个名为分层双向功能感知网络（HBFP-网）一种新的模型相关联的多层次信息，相得益彰。首先，相关映射跨级特征对通过低秩双线性池被建模的。然后，根据相关地图，双向感知功能（BFP）模块采用丰富的高层次功能的关注点区域，并学会在低层次的功能抽象和具体信息。然后，我们建议它集成了多层次的增强功能和投入的增加低收入和中等级别的功能，下面层重新培训新的强大的网络一种新型的终端到终端的分层网络。更重要的是，我们提出了一个新颖的可训练的广义池，可动态映射选择要激活的功能的所有位置中的任何价值。大量的实验上的主流评价数据集，包括市场-1501，CUHK03和DukeMTMC - 里德表明，我们的方法优于近期SOTA里德模型来实现。

80. RPT: Learning Point Set Representation for Siamese Visual Tracking [PDF] 返回目录
Ziang Ma, Linyuan Wang, Haitao Zhang, Wei Lu, Jun Yin
Abstract: While remarkable progress has been made in robust visual tracking, accurate target state estimation still remains a highly challenging problem. In this paper, we argue that this issue is closely related to the prevalent bounding box representation, which provides only a coarse spatial extent of object. Thus an effcient visual tracking framework is proposed to accurately estimate the target state with a finer representation as a set of representative points. The point set is trained to indicate the semantically and geometrically significant positions of target region, enabling more fine-grained localization and modeling of object appearance. We further propose a multi-level aggregation strategy to obtain detailed structure information by fusing hierarchical convolution layers. Extensive experiments on several challenging benchmarks including OTB2015, VOT2018, VOT2019 and GOT-10k demonstrate that our method achieves new state-of-the-art performance while running at over 20 FPS.
摘要：虽然显着的进展，在强大的视觉跟踪而进行的，精确的目标状态估计仍然是一个非常具有挑战性的问题。在本文中，我们认为，这个问题是密切相关的普遍边界框表示，其提供了对象的仅一个粗空间范围。因此，一个effcient视觉跟踪框架提出准确地估算与更精细的表示为一组代表点的目标状态。点集进行训练，以指示目标区域的语义和几何显著位置，使更多的细粒度定位和物体的外观建模。我们进一步提出了多层次的聚合策略通过融合层次卷积层，以获得详细的结构信息。在一些具有挑战性的基准测试，包括OTB2015，VOT2018，VOT2019和GOT-10K大量的实验表明，我们的方法在20 FPS运行时，实现了国家的最先进的新性能。

81. PAN: Towards Fast Action Recognition via Learning Persistence of Appearance [PDF] 返回目录
Can Zhang, Yuexian Zou, Guang Chen, Lei Gan
Abstract: Efficiently modeling dynamic motion information in videos is crucial for action recognition task. Most state-of-the-art methods heavily rely on dense optical flow as motion representation. Although combining optical flow with RGB frames as input can achieve excellent recognition performance, the optical flow extraction is very time-consuming. This undoubtably will count against real-time action recognition. In this paper, we shed light on fast action recognition by lifting the reliance on optical flow. Our motivation lies in the observation that small displacements of motion boundaries are the most critical ingredients for distinguishing actions, so we design a novel motion cue called Persistence of Appearance (PA). In contrast to optical flow, our PA focuses more on distilling the motion information at boundaries. Also, it is more efficient by only accumulating pixel-wise differences in feature space, instead of using exhaustive patch-wise search of all the possible motion vectors. Our PA is over 1000x faster (8196fps vs. 8fps) than conventional optical flow in terms of motion modeling speed. To further aggregate the short-term dynamics in PA to long-term dynamics, we also devise a global temporal fusion strategy called Various-timescale Aggregation Pooling (VAP) that can adaptively model long-range temporal relationships across various timescales. We finally incorporate the proposed PA and VAP to form a unified framework called Persistent Appearance Network (PAN) with strong temporal modeling ability. Extensive experiments on six challenging action recognition benchmarks verify that our PAN outperforms recent state-of-the-art methods at low FLOPs. Codes and models are available at: this https URL.
摘要：在视频中高效造型动感的运动信息是动作识别任务至关重要。大多数国家的最先进的方法，在很大程度上依赖于密集光流作为运动表示。虽然光流与RGB帧组合作为输入能达到极好的识别性能，光流提取是很耗时的。这undoubtably将计入实时动作识别。在本文中，我们通过提升光流的依赖揭示快速动作识别光。我们的动机就在于观察到运动边界的小位移是区分行动的最关键要素，所以我们设计的名为外观的持久性（PA）一种新型的运动线索。相较于光流，我们的PA更侧重于蒸馏在边界处的运动信息。此外，它是通过只累加在特征空间中逐像素的差异，而不是使用穷举膜片明智搜索所有可能的运动向量的更有效。我们的PA是超过1000倍更快（对比8196fps 8FPS）中比在运动建模速度方面常规光流。为了进一步聚集在PA短期动态长期动态，我们也制定各种所谓的时间尺度聚合池（VAP）全球时间融合策略，可以自适应建模远距离在不同时间尺度的时间关系。我们终于纳入拟议的PA和VAP，形成具有较强时间的建模能力称为持久外观网络（PAN）的统一框架。在六大挑战动作识别基准测试的实验结果验证我们的PAN优于低触发器最近的国家的最先进的方法。代码和模型，请访问：此HTTPS URL。

82. Hard Class Rectification for Domain Adaptation [PDF] 返回目录
Zhang Yunlong, Jing Changxing, Lin Huangxing, Chen Chaoqi, Huang Yue, Ding Xinghao, Zou Yang
Abstract: Domain adaptation (DA) aims to transfer knowledge from a label-rich and related domain (source domain) to a label-scare domain (target domain). Pseudo-labeling has recently been widely explored and used in DA. However, this line of research is still confined to the inaccuracy of pseudo-labels. In this paper, we reveal an interesting observation that the target samples belonging to the classes with larger domain shift are easier to be misclassified compared with the other classes. These classes are called hard class, which deteriorates the performance of DA and restricts the applications of DA. We propose a novel framework, called Hard Class Rectification Pseudo-labeling (HCRPL), to alleviate the hard class problem from two aspects. First, as is difficult to identify the target samples as hard class, we propose a simple yet effective scheme, named Adaptive Prediction Calibration (APC), to calibrate the predictions of the target samples according to the difficulty degree for each class. Second, we further consider that the predictions of target samples belonging to the hard class are vulnerable to perturbations. To prevent these samples to be misclassified easily, we introduce Temporal-Ensembling (TE) and Self-Ensembling (SE) to obtain consistent predictions. The proposed method is evaluated in both unsupervised domain adaptation (UDA) and semi-supervised domain adaptation (SSDA). The experimental results on several real-world cross-domain benchmarks, including ImageCLEF, Office-31 and Office-Home, substantiates the superiority of the proposed method.
摘要：域名适应（DA）的目的是从一个标签，丰富及相关领域（源域）知识转移到一个标签恐慌域（目标域）。伪标签最近已经广泛地探讨和DA使用。不过，此行的研究仍局限于伪标签的不准确性。在本文中，我们揭示了一个有趣的现象是属于类具有较大域转移的目标样本更容易与其他类别相比，被错误分类。这些类被称为硬类，这将降低DA的性能和限制DA的应用。我们提出了一个新的框架，称为硬类整改伪标记（HCRPL），从两个方面缓解硬类问题。首先，很难辨别硬课上，我们提出了一个简单而有效的方案，名为自适应预测校准（APC）的目标样本，根据难易程度为每类目标样本的预测校准。其次，我们进一步认为属于硬类目标样本的预测很容易受到干扰。为了防止这些样品被错误分类很容易，我们引入时间，Ensembling（TE）和自我Ensembling（SE），以获得一致的预测。所提出的方法在两个无监督域适配（UDA）和半监督域适配（SSDA）进行评价。在几个真实世界的跨域基准，包括ImageCLEF，办公室-31和Office的主页，实验结果证实了该方法的优越性。

83. Meta Feature Modulator for Long-tailed Recognition [PDF] 返回目录
Renzhen Wang, Kaiqin Hu, Yanwen Zhu, Jun Shu, Qian Zhao, Deyu Meng
Abstract: Deep neural networks often degrade significantly when training data suffer from class imbalance problems. Existing approaches, e.g., re-sampling and re-weighting, commonly address this issue by rearranging the label distribution of training data to train the networks fitting well to the implicit balanced label distribution. However, most of them hinder the representative ability of learned features due to insufficient use of intra/inter-sample information of training data. To address this issue, we propose meta feature modulator (MFM), a meta-learning framework to model the difference between the long-tailed training data and the balanced meta data from the perspective of representation learning. Concretely, we employ learnable hyper-parameters (dubbed modulation parameters) to adaptively scale and shift the intermediate features of classification networks, and the modulation parameters are optimized together with the classification network parameters guided by a small amount of balanced meta data. We further design a modulator network to guide the generation of the modulation parameters, and such a meta-learner can be readily adapted to train the classification network on other long-tailed datasets. Extensive experiments on benchmark vision datasets substantiate the superiority of our approach on long-tailed recognition tasks beyond other state-of-the-art methods.
摘要：当训练数据从类不平衡问题的困扰深层神经网络通常显著下降。现有的方法，例如，重采样和重新加权，通常通过重新安排的训练数据来训练网络装修好隐式平衡标签分发标签分发解决这个问题。然而，他们大多阻碍了解到特征代表能力由于使用不足的训练数据的内/间，样本信息。为了解决这个问题，我们提出了元功能调制器（MFM），超常的学习框架，用来表示学习的角度来看长尾训练数据和平衡元数据之间的差异进行建模。具体地说，我们采用可学习的超参数（称为调制参数），以自适应地规模和移位分类网络的中间特征，和调制参数与由平衡元数据的少量引导的分类网络参数优化在一起。我们进一步设计一个调制器网络来引导调制参数的产生，并且这样的元学习者可以容易地适合于训练分类网络上的其它长尾数据集。在基准视觉数据集大量的实验证明上超越其他国家的最先进的方法长尾识别任务我们方法的优越性。

84. Using UNet and PSPNet to explore the reusability principle of CNN parameters [PDF] 返回目录
Wei Wang
Abstract: How to reduce the requirement on training dataset size is a hot topic in deep learning community. One straightforward way is to reuse some pre-trained parameters. Some previous work like Deep transfer learning reuse the model parameters trained for the first task as the starting point for the second task, and semi-supervised learning is trained upon a combination of labeled and unlabeled data. However, the fundamental reason of the success of these methods is unclear. In this paper, the reusability of parameters in each layer of a deep convolutional neural network is experimentally quantified by using a network to do segmentation and auto-encoder task. This paper proves that network parameters can be reused for two reasons: first, the network features are general; Second, there is little difference between the pre-trained parameters and the ideal network parameters. Through the use of parameter replacement and comparison, we demonstrate that reusability is different in BN(Batch Normalization)[7] layer and Convolution layer and some observations: (1)Running mean and running variance plays an important role than Weight and Bias in BN layer.(2)The weight and bias can be reused in BN layers.( 3) The network is very sensitive to the weight of convolutional layer.(4) The bias in Convolution layers are not sensitive, and it can be reused directly.
摘要：如何降低对训练数据集大小的要求是深度学习社会各界热议的话题。一个简单的方法是重复使用一些预先训练参数。像深陷转会学习重用训练的第一项任务作为第二个任务的起点模型参数，以及半监督学习一些以前的工作在标记和未标记数据的组合训练。然而，这些方法的成功的根本原因尚不清楚。在本文中，在一个深的卷积神经网络的每一层的参数的可重用性通过使用网络做分段和自动编码器的任务是通过实验定量。本文证明了网络参数可以出于两个原因来重新使用：首先，网络特征是一般性;第二，预先训练的参数和理想的网络参数相差不大。通过使用参数替换和比较，我们证明了可重用性是在BN（批次标准化）不同[7]层和卷积层和一些观察：（1）运行的平均值和运行方差起着BN比重量和偏差中起重要作用层（2）的重量和偏压可以在BN层进行再利用。（3）网络是卷积层的重量很敏感。（4）在卷积层偏置不敏感，并且它可以直接再利用。

85. Two-branch Recurrent Network for Isolating Deepfakes in Videos [PDF] 返回目录
Iacopo Masi, Aditya Killekar, Royston Marian Mascarenhas, Shenoy Pratik Gurudatt, Wael AbdAlmageed
Abstract: The current spike of hyper-realistic faces artificially generated using deepfakes calls for media forensics solutions that are tailored to video streams and work reliably with a low false alarm rate at the video level. We present a method for deepfake detection based on a two-branch network structure that isolates digitally manipulated faces by learning to amplify artifacts while suppressing the high-level face content. Unlike current methods that extract spatial frequencies as a preprocessing step, we propose a two-branch structure: one branch propagates the original information, while the other branch suppresses the face content yet amplifies multi-band frequencies using a Laplacian of Gaussian (LoG) as a bottleneck layer. To better isolate manipulated faces, we derive a novel cost function that, unlike regular classification, compresses the variability of natural faces and pushes away the unrealistic facial samples in the feature space. Our two novel components show promising results on the FaceForensics++, Celeb-DF, and Facebook's DFDC preview benchmarks, when compared to prior work. We then offer a full, detailed ablation study of our network architecture and cost function. Finally, although the bar is still high to get very remarkable figures at a very low false alarm rate, our study shows that we can achieve good video-level performance when cross-testing in terms of video-level AUC.
摘要：使用，以便针对视频流和工作可靠地在视频电平低误报率的媒体取证解决方案deepfakes电话人工生成的超现实面的电流尖峰。我们提出了基于通过学习来扩增假象抑制的高级别面内容隔离数字处理面的两分支的网络结构deepfake检测的方法。不像提取空间频率作为预处理步骤的电流的方法，我们提出了一种两分支结构：一个分支中传播的原始信息，而另一分支抑制面内容尚未放大利用高斯（LOG）的拉普拉斯多频带频率作为瓶颈层。为了更好地隔离操作的面孔，我们得出一个新的成本函数，不同于一般的分类，压缩自然的面孔和推动的变化远不切实际的面部样本特征空间。我们的两个新的组件显示在FaceForensics ++，名人-DF可喜的成果，而Facebook的DFDC预览基准，相对于以前的工作。然后，我们为我们的网络架构和成本函数的完整，详细的消融研究。最后，虽然酒吧是高还是以非常低的误报率获得非常了不起的数字，我们的研究表明，我们可以在视频级AUC条件时交叉测试达到良好的视频级性能。

86. VPC-Net: Completion of 3D Vehicles from MLS Point Clouds [PDF] 返回目录
Yan Xia, Yusheng Xu, Cheng Wang, Uwe Stilla
Abstract: Vehicles are the most concerned investigation target as a dynamic and essential component in the road environment of urban scenarios. To monitor their behaviors and extract their geometric characteristics, an accurate and instant measurement of the vehicles plays a vital role in remote sensing and computer vision field. 3D point clouds acquired from the mobile laser scanning (MLS) system deliver 3D information of unprecedented detail of road scenes along with the driving. They have proven to be an adequate data source in the fields of intelligent transportation and autonomous driving, especially for extracting vehicles. However, acquired 3D point clouds of vehicles from MLS systems are inevitably incomplete due to object occlusion or self-occlusion. To tackle this problem, we proposed a neural network to synthesize complete, dense, and uniform point clouds for vehicles from MLS data, named Vehicle Points Completion-Net (VPC-Net). In this network, we introduced a new encoder module to extract global features from the input instance, consisting of a spatial transformer network and point feature enhancement layer. Moreover, a new refiner module is also presented to preserve the vehicle details from inputs and refine the complete outputs with fine-grained information. Given the sparse and partial point clouds of vehicles, the network can generate complete and realistic structures, and keep the fine-grained details from the partial inputs. We evaluated the proposed VPC-Net in different experiments using synthetic and real-scan datasets and applied the results to 3D vehicle monitoring tasks. Quantitative and qualitative experiments demonstrate the promising performance of VPC-Net and show state-of-the-art results.
摘要：车辆是最关注的调查对象为城市场景的道路环境动态和重要组成部分。要监视他们的行为，并提取其几何特征，车辆的准确和即时测量起着遥感和计算机视觉领域的重要作用。从移动激光扫描（MLS）系统采集的3D点云与驱动沿着提供前所未有的道路场景的细节的3D信息。他们已被证明是在智能交通和自动驾驶领域的足够的数据来源，尤其是用于提取车辆。然而，从MLS系统的车辆采集的3D点云是不可避免地不完全是由于对象闭塞或自遮挡。为了解决这个问题，我们提出了一个神经网络合成完整，致密，均匀的点云从MLS数据的车辆，车辆命名点完成-NET（VPC-网）。在这个网络中，我们引入了一个新的编码器模块从输入实例中提取全局特征，包括空间变换网络和点要素增强层。此外，一个新的精磨机模块也呈现给保存从输入的车辆细节和改进与细粒度信息的完整输出。鉴于车辆稀疏，局部点云，网络可以生成完整的和现实的结构，并从部分投入保持细粒度的细节。我们评估所提出的VPC-网在不同的实验中使用合成和真实扫描数据集和应用的结果，3D车辆监控任务。定量和定性实验证明VPC-网的前途性能和显示国家的先进成果。

87. Hybrid Score- and Rank-level Fusion for Person Identification using Face and ECG Data [PDF] 返回目录
Thomas Truong, Jonathan Graf, Svetlana Yanushkevich
Abstract: Uni-modal identification systems are vulnerable to errors in sensor data collection and are therefore more likely to misidentify subjects. For instance, relying on data solely from an RGB face camera can cause problems in poorly lit environments or if subjects do not face the camera. Other identification methods such as electrocardiograms (ECG) have issues with improper lead connections to the skin. Errors in identification are minimized through the fusion of information gathered from both of these models. This paper proposes a methodology for combining the identification results of face and ECG data using Part A of the BioVid Heat Pain Database containing synchronized RGB-video and ECG data on 87 subjects. Using 10-fold cross-validation, face identification was 98.8% accurate, while the ECG identification was 96.1% accurate. By using a fusion approach the identification accuracy improved to 99.8%. Our proposed methodology allows for identification accuracies to be significantly improved by using disparate face and ECG models that have non-overlapping modalities.
摘要：单峰识别系统易受传感器收集数据中的错误，因此更容易误认科目。例如，从RGB摄像头脸部依赖数据仅会导致光照条件不好的环境问题，或者对象不面对镜头。其他鉴定方法，如心电图（ECG）有不正当的引线连接到皮肤上的问题。在识别错误将通过信息来自这两个模型的聚集融合最小化。本文提出了一种结合使用含有上87名受试者同步RGB视频和ECG数据的BioVid热痛数据库的部分A面的鉴定结果和ECG数据的方法。使用10倍交叉验证，脸部识别为98.8％的准确，而ECG识别为96.1％的准确。通过使用融合方法识别精度提高到99.8％。我们提出的方法使识别精度，通过使用不同的面孔和具有非重叠的方式心电图模型来显著改善。

88. Informative Dropout for Robust Representation Learning: A Shape-bias Perspective [PDF] 返回目录
Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, Jingdong Wang
Abstract: Convolutional Neural Networks (CNNs) are known to rely more on local texture rather than global shape when making decisions. Recent work also indicates a close relationship between CNN's texture-bias and its robustness against distribution shift, adversarial perturbation, random corruption, etc. In this work, we attempt at improving various kinds of robustness universally by alleviating CNN's texture bias. With inspiration from the human visual system, we propose a light-weight model-agnostic method, namely Informative Dropout (InfoDrop), to improve interpretability and reduce texture bias. Specifically, we discriminate texture from shape based on local self-information in an image, and adopt a Dropout-like algorithm to decorrelate the model output from the local texture. Through extensive experiments, we observe enhanced robustness under various scenarios (domain generalization, few-shot classification, image corruption, and adversarial perturbation). To the best of our knowledge, this work is one of the earliest attempts to improve different kinds of robustness in a unified model, shedding new light on the relationship between shape-bias and robustness, also on new approaches to trustworthy machine learning algorithms. Code is available at this https URL.
摘要：卷积神经网络（细胞神经网络）被称为做决策时更多地依赖于局部纹理的而非全局的形状。最近的研究还表明CNN的纹理偏见和其对分配的转变，对抗扰动，随机腐败，等等。在这个工作鲁棒性之间有着密切的关系，我们试图在普遍通过减少CNN的质地偏提高各种鲁棒性。与从人的视觉系统的启发，我们提出了一种重量轻的模型无关的方法，即信息性降（InfoDrop），以改善可解释性，减少纹理偏压。具体而言，我们区分基于图像中局部自信息形状的纹理，并采用差状算法去相关从局部纹理输出模型。通过大量的实验中，我们观察到增强在各种情况下（域泛化，很少拍分类，图像腐败，对抗性摄）鲁棒性。据我们所知，这工作是提高不同类型的鲁棒性的统一模式的最早尝试之一，对形状偏差和鲁棒性之间的关系脱落另眼相看，也对新的方法，以值得信赖的机器学习算法。代码可在此HTTPS URL。

89. Learning Invariant Feature Representation to Improve Generalization across Chest X-ray Datasets [PDF] 返回目录
Sandesh Ghimire, Satyananda Kashyap, Joy T. Wu, Alexandros Karargyris, Mehdi Moradi
Abstract: Chest radiography is the most common medical image examination for screening and diagnosis in hospitals. Automatic interpretation of chest X-rays at the level of an entry-level radiologist can greatly benefit work prioritization and assist in analyzing a larger population. Subsequently, several datasets and deep learning-based solutions have been proposed to identify diseases based on chest X-ray images. However, these methods are shown to be vulnerable to shift in the source of data: a deep learning model performing well when tested on the same dataset as training data, starts to perform poorly when it is tested on a dataset from a different source. In this work, we address this challenge of generalization to a new source by forcing the network to learn a source-invariant representation. By employing an adversarial training strategy, we show that a network can be forced to learn a source-invariant representation. Through pneumonia-classification experiments on multi-source chest X-ray datasets, we show that this algorithm helps in improving classification accuracy on a new source of X-ray dataset.
摘要：胸部X线检查是在医院筛查和诊断的最常见的医学影像学检查。在入门级的放射科医生的水平胸部X射线的自动解释可以大大有利于工作的优先次序，并帮助分析更多的人口。随后，几个数据集，并深以学习为基础的解决方案已被提出来基于胸部X射线图像的疾病。然而，这些方法被示出为在脆弱数据源转移：一个深学习模型表现良好时对同一数据集作为训练数据进行测试，开始时它是在数据集上从不同的源测试，以表现不佳。在这项工作中，我们通过强制网络学习源恒定表征解决这一概括化的挑战，新的来源。通过采用对抗性的训练策略，我们表明，网络可以强制学习源不变表示。通过对多源胸片数据集肺炎分类实验，结果表明该算法有助于X射线数据集的新源提高分类准确率。

90. Learning Bloch Simulations for MR Fingerprinting by Invertible Neural Networks [PDF] 返回目录
Fabian Balsiger, Alain Jungo, Olivier Scheidegger, Benjamin Marty, Mauricio Reyes
Abstract: Magnetic resonance fingerprinting (MRF) enables fast and multiparametric MR imaging. Despite fast acquisition, the state-of-the-art reconstruction of MRF based on dictionary matching is slow and lacks scalability. To overcome these limitations, neural network (NN) approaches estimating MR parameters from fingerprints have been proposed recently. Here, we revisit NN-based MRF reconstruction to jointly learn the forward process from MR parameters to fingerprints and the backward process from fingerprints to MR parameters by leveraging invertible neural networks (INNs). As a proof-of-concept, we perform various experiments showing the benefit of learning the forward process, i.e., the Bloch simulations, for improved MR parameter estimation. The benefit especially accentuates when MR parameter estimation is difficult due to MR physical restrictions. Therefore, INNs might be a feasible alternative to the current solely backward-based NNs for MRF reconstruction.
摘要：磁共振指纹（MRF）实现了快速和多参数磁共振成像。尽管快速采集，基于词典匹配MRF的国家的最先进的重建是缓慢的，缺乏可扩展性。为了克服这些局限性，神经网络（NN）方法估算的指纹MR参数近来已经提出。在这里，我们重新审视基于NN-MRF重建从MR参数指纹和手印，以MR参数通过利用可逆的神经网络（旅馆）的落后工艺共同学习转发过程。作为验证的概念，我们进行了各种实验示出学习前向处理，的利益即，布洛赫模拟，改进MR参数估计。当MR参数估计是困难的，因为MR物理限制的好处尤其是展露无遗。因此，旅馆可能是一个可行的替代方案，目前仅向后为基础的MRF重建神经网络。

91. An Explainable 3D Residual Self-Attention Deep Neural Network FOR Joint Atrophy Localization and Alzheimer's Disease Diagnosis using Structural MRI [PDF] 返回目录
Xin Zhang, Liangxiu Han, Wenyong Zhu, Liang Sun, Daoqiang Zhang
Abstract: Computer-aided early diagnosis of Alzheimer's disease (AD) and its prodromal form mild cognitive impairment (MCI) based on structure Magnetic Resonance Imaging (sMRI) has provided a cost-effective and objective way for early prevention and treatment of disease progression, leading to improved patient care. In this work, we have proposed a novel computer-aided approach for early diagnosis of AD by introducing an explainable 3D Residual Attention Deep Neural Network (3D ResAttNet) for end-to-end learning from sMRI scans. Different from the existing approaches, the novelty of our approach is three-fold: 1) A Residual Self-Attention Deep Neural Network has been proposed to capture local, global and spatial information of MR images to improve diagnostic performance; 2) An explainable method using Gradient-based Localization Class Activation mapping (Grad-CAM) has been introduced to improve the explainable of the proposed method; 3) This work has provided a full end-to-end learning solution for automated disease diagnosis. Our proposed 3D ResAttNet method has been evaluated on a large cohort of subjects from real dataset for two changeling classification tasks (i.e. Alzheimer's disease (AD) vs. Normal cohort (NC) and progressive MCI (pMCI) vs. stable MCI (sMCI)). The experimental results show that the proposed approach outperforms the state-of-the-art models with significant performance improvement. The accuracy for AD vs. NC and sMCI vs. pMCI task are 97.1% and 84.1% respectively. The explainable mechanism in our approach regions is able to identify and highlight the contribution of the important brain parts (hippocampus, lateral ventricle and most parts of the cortex) for transparent decisions.
摘要：计算机辅助基于结构的磁共振成像（SMRI）提供了疾病进展的早期预防和治疗具有成本效益的和客观的方式阿尔茨海默氏病（AD）及其前驱形式轻度认知障碍（MCI）的早期诊断，从而改善患者护理。在这项工作中，我们通过从SMRI扫描端至端学习引入可解释3D残余注意深层的神经网络（3D ResAttNet）提出了用于AD早期诊断的新的计算机辅助方法。从现有的做法不同的是，我们的方法的新颖之处在于有三：1）剩余自我注意深层神经网络已经提出来捕捉MR图像的局部，全局和空间信息，提高诊断能力; 2）使用基于梯度的位置类激活映射（梯度-CAM）一种可解释方法已被引入，以提高所提出的方法的解释的; 3）该工作提供了用于自动疾病诊断的完整的端至端学习解决方案。我们提出的3D ResAttNet方法在大样本来自真实数据集对象进行了评估两个低能分类任务（即阿尔茨海默病（AD）与正常人群（NC）和逐行MCI（PMCI）与MCI稳定（SMCI））。实验结果表明，该方法优于国家的最先进的机型显著的性能提升。对于AD与NC和SMCI与PMCI任务的准确性分别为97.1％和84.1％。我们在处理地区可解释的机制能够识别并突出显示重要的大脑部位（海马，侧脑室和皮质的大部分地区）的透明决策的贡献。

92. A model-guided deep network for limited-angle computed tomography [PDF] 返回目录
Wei Wang, Xiang-Gen Xia, Chuanjiang He, Zemin Ren, Jian Lu, Tianfu Wang, Baiying Lei
Abstract: In this paper, we first propose a variational model for the limited-angle computed tomography (CT) image reconstruction and then convert the model into an end-to-end deep network.We use the penalty method to solve the model and divide it into three iterative subproblems, where the first subproblem completes the sinograms by utilizing the prior information of sinograms in the frequency domain and the second refines the CT images by using the prior information of CT images in the spatial domain, and the last merges the outputs of the first two subproblems. In each iteration, we use the convolutional neural networks (CNNs) to approxiamte the solutions of the first two subproblems and, thus, obtain an end-to-end deep network for the limited-angle CT image reconstruction. Our network tackles both the sinograms and the CT images, and can simultaneously suppress the artifacts caused by the incomplete data and recover fine structural information in the CT images. Experimental results show that our method outperforms the existing algorithms for the limited-angle CT image reconstruction.
摘要：在本文中，我们首先提出的有限角度计算机断层扫描（CT）图像重建一个变模型，然后将模型转换为端至端深network.We使用惩罚方法求解模型和除法它分为三个迭代子问题，其中，所述第一子问题通过利用在频域中窦腔X线照相，并通过使用CT图像的先验信息在空间域中的第二提炼CT图像的先验信息完成窦腔X线照相，最后合并的输出前两个子问题。在每次迭代中，我们使用卷积神经网络（细胞神经网络）来approxiamte前两个子问题的解决方案，因此，获得用于有限角度CT图像重建的端至端的深网络。我们的网络铲球两者窦腔X线照相和CT图像，并且可以同时抑制由不完整的数据的伪像和恢复在该CT图像精细结构信息。实验结果表明，我们的方法优于在有限角度的CT图像重建现有算法。

93. DQI: A Guide to Benchmark Evaluation [PDF] 返回目录
Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan, Chitta Baral
Abstract: A `state of the art' model A surpasses humans in a benchmark B, but fails on similar benchmarks C, D, and E. What does B have that the other benchmarks do not? Recent research provides the answer: spurious bias. However, developing A to solve benchmarks B through E does not guarantee that it will solve future benchmarks. To progress towards a model that `truly learns' an underlying task, we need to quantify the differences between successive benchmarks, as opposed to existing binary and black-box approaches. We propose a novel approach to solve this underexplored task of quantifying benchmark quality by debuting a data quality metric: DQI.
摘要：艺术”模型A的`状态超过人类在基准B，但没有类似的基准C，d，和E是什么B的，其他基准不？最近的研究提供了答案：虚假的偏见。然而，开发一个解决基准B到E并不能保证它会解决未来的基准。要实现一个模型，'真正获悉的基础任务，我们需要量化连续基准之间的差异，相对于现有的二进制和黑盒方案的进展。我们提出了一个新的方法来解决开张的数据质量度量量化基准质量的这一勘探不足的任务：DQI。

94. RARTS: a Relaxed Architecture Search Method [PDF] 返回目录
Fanghui Xue, Yingyong Qi, Jack Xin
Abstract: Differentiable architecture search (DARTS) is an effective method for data-driven neural network design based on solving a bilevel optimization problem. In this paper, we formulate a single level alternative and a relaxed architecture search (RARTS) method that utilizes training and validation datasets in architecture learning without involving mixed second derivatives of the corresponding loss functions. Through weight/architecture variable splitting and Gauss-Seidel iterations, the core algorithm outperforms DARTS significantly in accuracy and search efficiency, as shown in both a solvable model and CIFAR-10 based architecture search. Our model continues to out-perform DARTS upon transfer to ImageNet and is on par with recent variants of DARTS even though our innovation is purely on the training algorithm.
摘要：可微架构搜索（飞镖）是立足于解决一个双层优化问题，数据驱动的神经网络设计的一种有效方法。在本文中，我们制定一个级别的替代以及利用在建筑学习训练和验证数据集，而不涉及的相应的损失函数的混合二阶导数松弛架构搜索（RARTS）方法。通过重量/体系结构变量分裂和高斯 - 塞德尔迭代中，核心算法优于在精度显著飞镖和搜索效率，如图两个可解模型和CIFAR-10体系结构的搜索。我们的模型继续进行，执行时转移到ImageNet飞镖和看齐带飞镖的近期变种尽管我们的创新是纯粹的训练算法。

95. Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality Assessment [PDF] 返回目录
Dingquan Li, Tingting Jiang, Ming Jiang
Abstract: Currently, most image quality assessment (IQA) models are supervised by the MAE or MSE loss with empirically slow convergence. It is well-known that normalization can facilitate fast convergence. Therefore, we explore normalization in the design of loss functions for IQA. Specifically, we first normalize the predicted quality scores and the corresponding subjective quality scores. Then, the loss is defined based on the norm of the differences between these normalized values. The resulting "Norm-in-Norm'' loss encourages the IQA model to make linear predictions with respect to subjective quality scores. After training, the least squares regression is applied to determine the linear mapping from the predicted quality to the subjective quality. It is shown that the new loss is closely connected with two common IQA performance criteria (PLCC and RMSE). Through theoretical analysis, it is proved that the embedded normalization makes the gradients of the loss function more stable and more predictable, which is conducive to the faster convergence of the IQA model. Furthermore, to experimentally verify the effectiveness of the proposed loss, it is applied to solve a challenging problem: quality assessment of in-the-wild images. Experiments on two relevant datasets (KonIQ-10k and CLIVE) show that, compared to MAE or MSE loss, the new loss enables the IQA model to converge about 10 times faster and the final model achieves better performance. The proposed model also achieves state-of-the-art prediction performance on this challenging problem. For reproducible scientific research, our code is publicly available at this https URL.
摘要：目前，大多数图像质量评价（IQA）模型是由MAE或MSE损失与经验收敛速度慢监督。这是众所周知的正常化可以方便快速收敛。因此，我们的损失功能的设计IQA探索正常化。具体地讲，我们首先正常化的预测质量得分和相应的主观质量分数。然后，损失是基于这些归一化的值之间的差的范数定义。由此产生的“范数在范数‘’损失鼓励IQA模型作直线相对于主观质量分数的预测。训练后，将最小二乘回归被施加到确定从预测质量的主观质量的线性映射，它结果表明，新的损失密切具有两个共同IQA性能标准（PLCC和RMSE）连接，通过理论分析，它证明了嵌入正常化使得损失函数更稳定和更可预测的，这有利于所述的梯度。该IQA模型的收敛速度更快。此外，通过实验验证了该损失的有效性，它是适用于解决具有挑战性的问题：质量评估中最百搭的图像的实验对两个相关的数据集（KonIQ-10K和克莱夫）。表明，相较于MAE或MSE流失，新的损失使得IQA模型收敛约10倍的速度和最终模型实现更好的性能。该模型还实现在此具有挑战性的问题状态的最先进的预测性能。可重复的科学研究，我们的代码是公开的，在此HTTPS URL。

96. Low-Light Maritime Image Enhancement with Regularized Illumination Optimization and Deep Noise Suppression [PDF] 返回目录
Yu Guo, Yuxu Lu, Ryan Wen Liu, Meifang Yang, Kwok Tai Chui
Abstract: Maritime images captured under low-light imaging condition easily suffer from low visibility and unexpected noise, leading to negative effects on maritime traffic supervision and management. To promote imaging performance, it is necessary to restore the important visual information from degraded low-light images. In this paper, we propose to enhance the low-light images through regularized illumination optimization and deep noise suppression. In particular, a hybrid regularized variational model, which combines L0-norm gradient sparsity prior with structure-aware regularization, is presented to refine the coarse illumination map originally estimated using Max-RGB. The adaptive gamma correction method is then introduced to adjust the refined illumination map. Based on the assumption of Retinex theory, a guided filter-based detail boosting method is introduced to optimize the reflection map. The adjusted illumination and optimized reflection maps are finally combined to generate the enhanced maritime images. To suppress the effect of unwanted noise on imaging performance, a deep learning-based blind denoising framework is further introduced to promote the visual quality of enhanced image. In particular, this framework is composed of two sub-networks, i.e., E-Net and D-Net adopted for noise level estimation and non-blind noise reduction, respectively. The main benefit of our image enhancement method is that it takes full advantage of the regularized illumination optimization and deep blind denoising. Comprehensive experiments have been conducted on both synthetic and realistic maritime images to compare our proposed method with several state-of-the-art imaging methods. Experimental results have illustrated its superior performance in terms of both quantitative and qualitative evaluations.
摘要：低光成像条件下拍摄的图像海事容易从低能见度和意外的噪音受到影响，导致对海上交通的监督和管理的负面影响。为了促进成像性能，有必要恢复从退化的低光图像中的重要的视觉信息。在本文中，我们提出，以增强通过规范化照明优化和深噪声抑制的低光图像。特别地，混合正则变模型，它与结构感知正规化之前结合L0范数梯度稀疏性，提出以细化粗照明地图使用最大-RGB原先估计。自适应伽马校正方法然后被引入以调整精照明图。基于Retinex的理论的假设，被引入一个引导基于过滤器的细节增强方法，以优化的反射图。调整后的照明和优化的反射映射最后组合以产生增强的海上图像。抑制对成像性能不希望的噪声的影响，一个深基于学习的盲去噪框架被进一步引入，以促进增强的图像的视觉质量。特别地，该框架由两个子网络，即，E-Net和d-Net的采用分别噪声电平评估和非盲降噪，的。我们的图像增强方法的主要好处是，它充分利用了规则化的照明优化和较深的盲去噪。综合实验已经在人工和现实的海上图像进行与国家的最先进的几种成像方法比较我们提出的方法。实验结果已经说明了在定性和定量评价方面以其优越的性能。

97. A methodology for the measurement of track geometry based on computer vision and inertial sensors [PDF] 返回目录
José L. Escalona
Abstract: This document describes the theory used for the calculation of track geometric irregularities on a Track Geometry Measuring System (TGMS) to be installed in railway vehicles. The TGMS includes a computer for data acquisition and process, a set of sensors including an inertial measuring unit (IMU, 3D gyroscope and 3D accelerometer), two video cameras and an encoder. The main features of the proposed system are: 1. It is capable to measure track alignment, vertical profile, cross-level, gauge, twist and rail-head profile using non-contact technology. 2. It can be installed in line railway vehicles. It is compact and low cost. Provided that the equipment sees the rail heads when the vehicle is moving, it can be installed in any body of the vehicle: at the wheelsets level, above primary suspension (bogie frame) or above the secondary suspension (car body).
摘要：本文介绍了用于轨道不平顺的上轨道几何测量系统（TGMS）安装在铁路车辆的计算理论。该TGMS包括用于数据采集和处理，一组传感器包括惯性测量单元（IMU，三维陀螺仪和加速度计的3D），两个摄像机和一个编码器的计算机。所提出的系统的主要特征是：1.它能够采用非接触技术来测量磁道对齐，垂直轮廓，横向水平，量规，扭转和轨头部轮廓。 2.它可以安装在线路铁路车辆。它结构紧凑，成本低。条件是该设备看到轨头当车辆正在运动时，它可以被安装在车辆的任何机构：在轮级，上述第一悬架（转向架框架）或二次悬浮上述（车身）。

98. Switching Loss for Generalized Nucleus Detection in Histopathology [PDF] 返回目录
Deepak Anand, Gaurav Patel, Yaman Dang, Amit Sethi
Abstract: The accuracy of deep learning methods for two foundational tasks in medical image analysis -- detection and segmentation -- can suffer from class imbalance. We propose a `switching loss' function that adaptively shifts the emphasis between foreground and background classes. While the existing loss functions to address this problem were motivated by the classification task, the switching loss is based on Dice loss, which is better suited for segmentation and detection. Furthermore, to get the most out the training samples, we adapt the loss with each mini-batch, unlike previous proposals that adapt once for the entire training set. A nucleus detector trained using the proposed loss function on a source dataset outperformed those trained using cross-entropy, Dice, or focal losses. Remarkably, without retraining on target datasets, our pre-trained nucleus detector also outperformed existing nucleus detectors that were trained on at least some of the images from the target datasets. To establish a broad utility of the proposed loss, we also confirmed that it led to more accurate ventricle segmentation in MRI as compared to the other loss functions. Our GPU-enabled pre-trained nucleus detection software is also ready to process whole slide images right out-of-the-box and is usably fast.
摘要：深学习方法在医学图像分析的两个基本任务的精度 - 检测与分割 - 可以从类不平衡困扰。我们提出了一个'开关损耗功能适应性的变化前景和背景类之间的重点。而解决这个问题的现有损失函数是由分类任务动机，开关损耗是根据骰子的损失，这是更适合的分割和检测。此外，为了获得最大的训练样本，我们每个小批量适应的损失，不像以前，对整个训练集适应一次的建议。使用上的源数据集所提出的损失函数训练甲核检测器的表现优于那些使用交叉熵，骰子，或局灶性损失训练。值得注意的是，没有再训练目标数据集，我们的预训练的核检测器也优于被训练的至少一些从所述目标数据集的图像的，现有的核探测器。建立提议的损失的广泛用途，我们还证实，它导致了更精确的脑室分割的MRI相比，其他损失的功能。我们启用了GPU的预先训练的核检测软件也准备处理整个幻灯片图像权外的即装即用，并可用地快。

99. Representative elementary volume via averaged scalar Minkowski functionals [PDF] 返回目录
M.V. Andreeva, A.V. Kalyuzhnyuk, V.V. Krutko, N.E. Russkikh, I.A. Taimanov
Abstract: Representative Elementary Volume (REV) at which the material properties do not vary with change in volume is an important quantity for making measurements or simulations which represent the whole. We discuss the geometrical method to evaluation of REV based on the quantities coming in the Steiner formula from convex geometry. For bodies in the three-space this formula gives us four scalar functionals known as scalar Minkowski functionals. We demonstrate on certain samples that the values of such averaged functionals almost stabilize for cells for which the length of edges are greater than certain threshold value R. Therefore, from this point of view, it is reasonable to consider cubes of volume R^3 as representative elementary volumes.
摘要：表征单元体（REV），在该材料的性能不随体积变化而变化是使代表整个测量或模拟的一个重要的量。我们的几何方法讨论基础上，从数量凸几何施泰纳公式在未来的REV的评价。对于在三维空间体这个公式给了我们称为标度规函数四种标量函。我们证明在某些样品，这样的平均函的值几乎稳定用于其边缘的长度比某一阈值更大的R.因此，从这个角度来看的细胞，这是合理的考虑体积R 1 3的立方体作为代表性的单元体积。

100. Accelerating Evolutionary Construction Tree Extraction via Graph Partitioning [PDF] 返回目录
Markus Friedrich, Sebastian Feld, Thomy Phan, Pierre-Alain Fayolle
Abstract: Extracting a Construction Tree from potentially noisy point clouds is an important aspect of Reverse Engineering tasks in Computer Aided Design. Solutions based on algorithmic geometry impose constraints on usable model representations (e.g. quadric surfaces only) and noise robustness. Re-formulating the problem as a combinatorial optimization problem and solving it with an Evolutionary Algorithm can mitigate some of these constraints at the cost of increased computational complexity. This paper proposes a graph-based search space partitioning scheme that is able to accelerate Evolutionary Construction Tree extraction while exploiting parallelization capabilities of modern CPUs. The evaluation indicates a speed-up up to a factor of $46.6$ compared to the baseline approach while resulting tree sizes increased by $25.2\%$ to $88.6\%$.
摘要：从潜在的噪声点云中提取建筑树是计算机逆向工程任务计算机辅助设计的一个重要方面。基于算法几何解强加可用模型表示约束（例如二次仅表面）和噪声鲁棒性。重新制定的问题作为一个组合优化问题，并用进化算法求解它可以在计算复杂度增加的成本缓解这些制约因素。本文提出了一种基于图的搜索空间的分区方案，它能够加速进化树建设提取，同时利用现代CPU的并行能力。该评估显示速度，可容纳多达比较基准方法的$ 46.6 $的因素而产生的树大小不一增长25.2 $ \％$ 88.6 $ \％$。

101. Encoding Structure-Texture Relation with P-Net for Anomaly Detection in Retinal Images [PDF] 返回目录
Kang Zhou, Yuting Xiao, Jianlong Yang, Jun Cheng, Wen Liu, Weixin Luo, Zaiwang Gu, Jiang Liu, Shenghua Gao
Abstract: Anomaly detection in retinal image refers to the identification of abnormality caused by various retinal diseases/lesions, by only leveraging normal images in training phase. Normal images from healthy subjects often have regular structures (e.g., the structured blood vessels in the fundus image, or structured anatomy in optical coherence tomography image). On the contrary, the diseases and lesions often destroy these structures. Motivated by this, we propose to leverage the relation between the image texture and structure to design a deep neural network for anomaly detection. Specifically, we first extract the structure of the retinal images, then we combine both the structure features and the last layer features extracted from original health image to reconstruct the original input healthy image. The image feature provides the texture information and guarantees the uniqueness of the image recovered from the structure. In the end, we further utilize the reconstructed image to extract the structure and measure the difference between structure extracted from original and the reconstructed image. On the one hand, minimizing the reconstruction difference behaves like a regularizer to guarantee that the image is corrected reconstructed. On the other hand, such structure difference can also be used as a metric for normality measurement. The whole network is termed as P-Net because it has a ``P'' shape. Extensive experiments on RESC dataset and iSee dataset validate the effectiveness of our approach for anomaly detection in retinal images. Further, our method also generalizes well to novel class discovery in retinal images and anomaly detection in real-world images.
摘要：异常检测在视网膜图像是指由各种视网膜疾病/损伤异常的识别，通过仅在训练阶段利用正常图像。来自健康受试者的正常图像通常具有规则的结构（例如，结构化的血管的眼底图像中，或在光学相干断层扫描图像构造解剖学）。相反，疾病和病变常破坏这些结构。这个启发，我们提出利用图像的纹理和结构的关系，设计了异常检测一个深层神经网络。具体地讲，我们首先提取视网膜图像的结构中，我们然后结合两者的结构特征，并从原始图像健康提取的最后一层的特征以重构原始输入图像健康。图像特征提供了纹理信息，保证了从所述结构中回收的图像的唯一性。最后，我们还利用重建的图像以提取结构和测量从原始提取结构和所述重构图像之间的差。在一方面，最小化重建差异表现得像一个正则为了保证图像被校正重建。在另一方面，这样的结构差异也可被用作一个度量常态测量。整个网络被称为P-网，因为它有一个``P“”的形状。在RESC数据集和iSee的大量实验数据集验证在视网膜图像异常检测我们的方法的有效性。此外，我们的方法还推广以及新型类的发现在视网膜图像和真实世界图像异常检测。

102. Speech Driven Talking Face Generation from a Single Image and an Emotion Condition [PDF] 返回目录
Sefik Emre Eskimez, You Zhang, Zhiyao Duan
Abstract: Visual emotion expression plays an important role in audiovisual speech communication. In this work, we propose a novel approach to rendering visual emotion expression in speech-driven talking face generation. Specifically, we design an end-to-end talking face generation system that takes a speech utterance, a single face image, and a categorical emotion label as input to render a talking face video in sync with the speech and expressing the condition emotion. Objective evaluation on image quality, audiovisual synchronization, and visual emotion expression shows that the proposed system outperforms a state-of-the-art baseline system. Subjective evaluation of visual emotion expression and video realness also demonstrates the superiority of the proposed system. Furthermore, we conduct a pilot study on human emotion recognition of generated videos with mismatched emotions between the audio and visual modalities, and results show that humans reply on the visual modality more significantly than the audio modality on this task.
摘要：视觉情感表达在视听言语交际中具有重要作用。在这项工作中，我们提出了一种新的方法，以在语音驱动说话脸上代渲染视觉情感表达。具体来说，我们设计了一个端至端通话面生成系统，需要一个语音话语，单个面部图像，和一个分类情感标签作为输入来呈现同步一个说话人脸视频与语音和表达状况的情感。上的图像质量，视听同步，和可视情绪表达表明，所提出的系统优于一个国家的最先进的基线系统的客观评价。视觉情感表达和视频真实性的主观评价也表明了该系统的优越性。此外，我们开展对人类情感识别与音频和视频模式之间不匹配的情绪产生视频的试点研究，并表明，人类在视觉形态更显著比这个任务的听觉模态回复。

103. Representation Learning via Cauchy Convolutional Sparse Coding [PDF] 返回目录
Perla Mayo, Oktay Karakuş, Robin Holmes, Alin Achim
Abstract: In representation learning, Convolutional Sparse Coding (CSC) enables unsupervised learning of features by jointly optimising both an $\ell_2$-norm fidelity term and a sparsity enforcing penalty. This work investigates using a regularisation term derived from an assumed Cauchy prior for the coefficients of the feature maps of a CSC generative model. The sparsity penalty term resulting from this prior is solved via its proximal operator, which is then applied iteratively, element-wise, on the coefficients of the feature maps to optimise the CSC cost function. The performance of the proposed Iterative Cauchy Thresholding (ICT) algorithm in reconstructing natural images is compared against the common choice of $\ell_1$-norm optimised via soft and hard thresholding. ICT outperforms IHT and IST in most of these reconstruction experiments across various datasets, with an average PSNR of up to 11.30 and 7.04 above ISTA and IHT respectively.
摘要： - 规范保真项和稀疏实施处罚的联合优化既是\（\ ell_2 \）在代表学习，卷积稀疏编码（CSC）使功能无监督学习。使用衍生自正则化项这项工作调查了CSC生成模型的特征地图的假设系数柯西之前。从该现有所得稀疏性惩罚项经由其近侧操作者，然后将其施加迭代地，逐元素，所述特征的系数映射，以优化CSC成本函数求解。通过软，硬阈值优化规范 - 所提出的迭代柯西阈值（ICT）在重建自然图像算法的性能是针对\（\ ell_1 \）的共同选择比较。 ICT优于IHT和IST在大多数在各种数据集重建这些实验中，与最多的平均PSNR到11.30和7.04以上分别ISTA和IHT。

104. Visual Pattern Recognition with on On-chip Learning: towards a Fully Neuromorphic Approach [PDF] 返回目录
Sandro Baumgartner, Alpha Renner, Raphaela Kreiser, Dongchen Liang, Giacomo Indiveri, Yulia Sandamirskaya
Abstract: We present a spiking neural network (SNN) for visual pattern recognition with on-chip learning on neuromorphichardware. We show how this network can learn simple visual patterns composed of horizontal and vertical bars sensed by a Dynamic Vision Sensor, using a local spike-based plasticity rule. During recognition, the network classifies the pattern's identity while at the same time estimating its location and scale. We build on previous work that used learning with neuromorphic hardware in the loop and demonstrate that the proposed network can properly operate with on-chip learning, demonstrating a complete neuromorphic pattern learning and recognition setup. Our results show that the network is robust against noise on the input (no accuracy drop when adding 130% noise) and against up to 20% noise in the neuron parameters.
摘要：本文提出了一种尖峰神经网络（SNN）与片上学习上neuromorphichardware视觉模式识别。我们表明，该网络可以学习如何通过动态视觉传感器检测到的水平和垂直线组成的简单的视觉模式，采用了基于本地穗可塑性规则。在识别过程中，网络，而在同一时间估计它的位置和规模分类模式的身份。我们建立在以前的工作是与环中的神经形态硬件使用的学习证明，该网络可以正常使用片上学习操作，展示了完整的神经运动模式学习和识别设置。我们的结果表明，该网络是针对在输入（无添加130％时的噪音精度下降）噪声鲁棒和针对神经元参数高达20％的噪声。

105. Complex Grey Matter Structure Segmentation in Brains via Deep Learning: Example of the Claustrum [PDF] 返回目录
Hongwei Li, Aurore Menegaux, Felix JB Bäuerlein, Suprosanna Shit, Benita Schmitz-Koep, Christian Sorg, Bjoern Menze, Dennis Hedderich
Abstract: Segmentationand parcellation of the brain has been widely performed on brain MRI using atlas-based methods. However, segmentation of the claustrum, a thin and sheet-like structure between insular cortex and putamen has not been amenable to automatized segmentation, thus limiting its investigation in larger imaging cohorts. Recently, deep-learning based approaches have been introduced for automated segmentation of brain structures, yielding great potential to overcome preexisting limitations. In the following, we present a multi-view deep-learning based approach to segment the claustrum in T1-weighted MRI scans. We trained and evaluated the proposed method on 181 manual bilateral claustrum annotations by an expert neuroradiologist serving as reference standard. Cross-validation experiments yielded median volumetric similarity, robust Hausdor? distance and Dice score of 93.3%, 1.41mm and 71.8% respectively which represents equal or superior segmentation performance compared to human intra-rater reliability. Leave-one-scanner-out evaluation showed good transfer-ability of the algorithm to images from unseen scanners, however at slightly inferior performance. Furthermore, we found that AI-based claustrum segmentation benefits from multi-view information and requires sample sizes of around 75 MRI scans in the training set. In conclusion, the developed algorithm has large potential in independent study cohorts and to facilitate MRI-based research of the human claustrum through automated segmentation. The software and models of our method are made publicly available.
摘要：大脑的Segmentationand地块划分已脑部MRI使用基于图谱方法广泛进行。然而，屏状的分割，薄和片状岛叶和壳之间的结构尚未适合于自动化的分割，从而限制了其在更大的成像队列调查。近日，深学习基础的方法已经推出了大脑结构的自动分割，产生很大的潜力，克服已经存在的局限性。在下文中，我们提出了一种多视图深基于学习的方法来段T1加权MRI扫描的屏状核。我们的培训和作为参考标准的专家评估神经放射科对181个手动双侧屏状核注解所提出的方法。交叉验证的实验产生中间体积相似，稳健Hausdor？距离和骰子分别得分的93.3％，和1.41毫米71.8％相比于人帧内信度表示等于或优于分割性能。留一扫描仪出评价结果表明该算法从看不见的扫描仪图像的良好的转移能力，但稍微逊色的性能。此外，我们发现，AI-基于从多视角信息屏状分割利益，需要大约75 MRI训练集中扫描的样本量。总之，开发的算法具有自主学习的同伙大的潜力，并促进通过自动分割人类屏状核的基于MRI的研究。我们的方法的软件和模型对外公开。

106. Dimensionality Reduction via Diffusion Map Improved with Supervised Linear Projection [PDF] 返回目录
Bowen Jiang, Maohao Shen
Abstract: When performing classification tasks, raw high dimensional features often contain redundant information, and lead to increased computational complexity and overfitting. In this paper, we assume the data samples lie on a single underlying smooth manifold, and define intra-class and inter-class similarities using pairwise local kernel distances. We aim to find a linear projection to maximize the intra-class similarities and minimize the inter-class similarities simultaneously, so that the projected low dimensional data has optimized pairwise distances based on the label information, which is more suitable for a Diffusion Map to do further dimensionality reduction. Numerical experiments on several benchmark datasets show that our proposed approaches are able to extract low dimensional discriminate features that could help us achieve higher classification accuracy.
摘要：当执行分类任务，原高维特征常常包含冗余信息，并导致计算复杂度增加和过度拟合。在本文中，我们假定数据样本位于一个单一的基本光滑歧管，并定义使用成对本地内核距离内类和类间的相似性。我们的目标是找到一个线性投影最大化类内的相似性，并且同时最小化类间的相似性，因此投影的低维数据已经基于该标签信息，这是更适合于扩散地图做优化成对距离进一步降维。在几个基准数据集的数值试验表明，该方法能够提取低维判别的功能，可以帮助我们实现更高的分类精度。

107. Auto-weighting for Breast Cancer Classification in Multimodal Ultrasound [PDF] 返回目录
Wang Jian, Miao Juzheng, Yang Xin, Li Rui, Zhou Guangquan, Huang Yuhao, Lin Zehui, Xue Wufeng, Jia Xiaohong, Zhou Jianqiao, Huang Ruobing, Ni Dong
Abstract: Breast cancer is the most common invasive cancer in women. Besides the primary B-mode ultrasound screening, sonographers have explored the inclusion of Doppler, strain and shear-wave elasticity imaging to advance the diagnosis. However, recognizing useful patterns in all types of images and weighing up the significance of each modality can elude less-experienced clinicians. In this paper, we explore, for the first time, an automatic way to combine the four types of ultrasonography to discriminate between benign and malignant breast nodules. A novel multimodal network is proposed, along with promising learnability and simplicity to improve classification accuracy. The key is using a weight-sharing strategy to encourage interactions between modalities and adopting an additional cross-modalities objective to integrate global information. In contrast to hardcoding the weights of each modality in the model, we embed it in a Reinforcement Learning framework to learn this weighting in an end-to-end manner. Thus the model is trained to seek the optimal multimodal combination without handcrafted heuristics. The proposed framework is evaluated on a dataset contains 1616 set of multimodal images. Results showed that the model scored a high classification accuracy of 95.4%, which indicates the efficiency of the proposed method.
摘要：乳腺癌是女性最常见的浸润性癌。除了初级B模式超声筛查，超声检查已经探索多普勒，应变和剪切波弹性成像推进诊断列入。然而，认识有用的模式在所有类型的图像，并权衡各模式的意义可以逃避经验不足的医生。在本文中，我们探索，第一次，一个自动的方式向四类超声结合乳腺良恶性结节区分。一种新颖的多峰网络提出，具有有前途的可学习性和简单性，以提高分级精度沿。关键是用重量共享战略，鼓励方式之间的相互作用，并采用额外的跨模态目标与全球信息集成。与此相反，以硬编码每种模态的权重在模型中，我们将其嵌入在一个强化学习框架来学习该加权的端至端的方式。因此，模型训练不寻求手工启发式的最佳组合多式联运。拟议的框架对数据集进行评估包含1616的多峰图像。结果表明，该模型得分的95.4％的高分类精度，这表明了该方法的效率。

108. Recent Advances and New Guidelines on Hyperspectral and Multispectral Image Fusion [PDF] 返回目录
Renwei Dian, Shutao Li, Bin Sun, Anjing Guo
Abstract: Hyperspectral image (HSI) with high spectral resolution often suffers from low spatial resolution owing to the limitations of imaging sensors. Image fusion is an effective and economical way to enhance the spatial resolution of HSI, which combines HSI with higher spatial resolution multispectral image (MSI) of the same scenario. In the past years, many HSI and MSI fusion algorithms are introduced to obtain high-resolution HSI. However, it lacks a full-scale review for the newly proposed HSI and MSI fusion approaches. To tackle this problem,this work gives a comprehensive review and new guidelines for HSI-MSI fusion. According to the characteristics of HSI-MSI fusion methods, they are categorized as four categories, including pan-sharpening based approaches, matrix factorization based approaches, tensor representation based approaches, and deep convolution neural network based approaches. We make a detailed introduction, discussions, and comparison for the fusion methods in each category. Additionally, the existing challenges and possible future directions for the HSI-MSI fusion are presented.
摘要：高光谱图像（HSI）具有高光谱分辨率往往从由于成像传感器的局限性低空间分辨率受到影响。图像融合是增强HSI的空间分辨率，它结合HSI具有相同场景的更高的空间分辨率的多光谱图像（MSI）的有效和经济的方法。在过去的几年中，许多恒生指数和MSI融合算法引入到获取高分辨率HSI。但是，它缺乏对于新提出的恒指及MSI融合方法全面审查。为了解决这个问题，这项工作给出了HSI-MSI融合了全面审查和新的指导方针。据的HSI-MSI融合方法的特点，它们被归类为四种类型，包括全色锐化基础的方法，基于矩阵因式分解方法，张量基于表示方法，并深卷积神经网络为基础的方法。我们做了详细的介绍，讨论和比较，对每个类别中的融合方法。另外，对于HSI-MSI融合现有的挑战和未来可能的方向呈现。

109. Using PSPNet and UNet to analyze the internal parameter relationship and visualization of the convolutional neural network [PDF] 返回目录
Wei Wang
Abstract: Convolutional neural network(CNN) has achieved great success in many fields, but due to the huge number of parameters, it is very difficult to study. Then, can we start from the parameters themselves to explore the relationship between the internal parameters of CNN? This paper proposes to use the convolution layer parameters substitution with the same convolution kernel setting to explore the relationship between the internal parameters of CNN and proposes to use the CNN visualization method to check the relationship. Using the visualization method, the forward propagation process of CNN is visualized. It is an intuitive representation of how CNN learns. According to the experiments, this paper believes that 1. Residual layer parameters of ResNet are correlated, and some layers can be substituted for each other; 2. Image segmentation is a process of first learning image texture features and then locating and segmentation.
摘要：卷积神经网络（CNN）在很多领域取得了巨大成功，但由于参数的数量巨大，它是研究非常困难。然后，我们可以从参数本身开始探索CNN的内部参数之间的关系？本文提出使用卷积层参数替代以相同的卷积核的设定探索CNN的内部参数之间的关系，并提出使用CNN的可视化的方法来检查的关系。使用可视化方法，CNN的正向传播过程被可视化。这是CNN如何学习的直观表示。根据实验，本文认为，RESNET 1.残余层参数是相关的，和一些层可以相互替换; 2.图像分割是第一学习图像纹理特征，然后定位和分割的处理。

110. Distributed optimization for nonrigid nano-tomography [PDF] 返回目录
Viktor Nikitin, Vincent De Andrade, Azat Slyamov, Benjamin J. Gould, Yuepeng Zhang, Vandana Sampathkumar, Narayanan Kasthuri, Doga Gursoy, Francesco De Carlo
Abstract: Resolution level and reconstruction quality in nano-computed tomography (nano-CT) are in part limited by the stability of microscopes, because the magnitude of mechanical vibrations during scanning becomes comparable to the imaging resolution, and the ability of the samples to resist beam damage during data acquisition. In such cases, there is no incentive in recovering the sample state at different time steps like in time-resolved reconstruction methods, but instead the goal is to retrieve a single reconstruction at the highest possible spatial resolution and without any imaging artifacts. Here we propose a joint solver for imaging samples at the nanoscale with projection alignment, unwarping and regularization. Projection data consistency is regulated by dense optical flow estimated by Farneback's algorithm, leading to sharp sample reconstructions with less artifacts. Synthetic data tests show robustness of the method to Poisson and low-frequency background noise. Applicability of the method is demonstrated on two large-scale nano-imaging experimental data sets.
摘要：在纳米计算机断层摄影（纳米CT）分辨率水平和重建质量是在通过显微镜的稳定性的限制部，由于机械振动的扫描期间的幅度变得与成像分辨率，并且将样品的能力来抵抗在数据采集期间束损伤。在这种情况下，存在在象在时间分辨重建方法不同的时间步长中回收采样状态没有激励，而是目标是在尽可能高的空间分辨率和没有任何成像伪影来检索单个重建。在这里，我们提出了一个联合求解器在与投影对齐，反变形，正规化的纳米级样本成像。投影数据的一致性通过由Farneback算法估计密集光流调节，导致尖锐的采样重建具有较少的工件。合成的数据的测试表明，该方法泊松和低频背景噪声的鲁棒性。该方法的适用性论证在两个大型纳米成像的实验数据集。

111. Improving the Speed and Quality of GAN by Adversarial Training [PDF] 返回目录
Jiachen Zhong, Xuanqing Liu, Cho-Jui Hsieh
Abstract: Generative adversarial networks (GAN) have shown remarkable results in image generation tasks. High fidelity class-conditional GAN methods often rely on stabilization techniques by constraining the global Lipschitz continuity. Such regularization leads to less expressive models and slower convergence speed; other techniques, such as the large batch training, require unconventional computing power and are not widely accessible. In this paper, we develop an efficient algorithm, namely FastGAN (Free AdverSarial Training), to improve the speed and quality of GAN training based on the adversarial training technique. We benchmark our method on CIFAR10, a subset of ImageNet, and the full ImageNet datasets. We choose strong baselines such as SNGAN and SAGAN; the results demonstrate that our training algorithm can achieve better generation quality (in terms of the Inception score and Frechet Inception distance) with less overall training time. Most notably, our training algorithm brings ImageNet training to the broader public by requiring 2-4 GPUs.
摘要：创成对抗网络（GAN）已经显示出图像生成任务显着成效。高保真分类条件GAN方法往往通过限制全球利普希茨连续依靠稳定技术。这种正常化导致更少的表现模型和较慢的收敛速度;其他技术，如大批量的培训，需要非常规的计算能力和不普及。在本文中，我们开发了一种有效的算法，即FastGAN（免费对抗性训练），以提高培训甘基于对抗训练技术的速度和质量。我们的基准我们对CIFAR10，ImageNet的一个子集，并全面ImageNet数据集的方法。我们选择强基线如SNGAN和SAGAN;结果证明我们的训练算法可以达到更好的质量产生（在盗梦空间得分和弗雷谢盗梦空间距离方面），用更少的整体训练时间。最值得注意的是，我们的训练算法通过要求2-4的GPU带来ImageNet培训，以更广泛的公众。

112. X-Ray bone abnormalities detection using MURA dataset [PDF] 返回目录
A.Solovyova, I.Solovyov
Abstract: We introduce the deep network trained on the MURA dataset from the Stanford University released in 2017. Our system is able to detect bone abnormalities on the radiographs and visualise such zones. We found that our solution has the accuracy comparable to the best results that have been achieved by other development teams that used MURA dataset, in particular the overall Kappa score that was achieved by our team is about 0.942 on the wrist, 0.862 on the hand and o.735 on the shoulder (compared to the best available results to this moment on the official web-site 0.931, 0.851 and 0.729 accordingly). However, despite the good results there are a lot of directions for the future enhancement of the proposed technology. We see a big potential in the further development computer aided systems (CAD) for the radiographs as the one that will help practical specialists diagnose bone fractures as well as bone oncology cases faster and with the higher accuracy.
摘要：介绍培训了来自斯坦福大学的MURA数据集在2017年我们的系统发布了深刻的网络能够检测的X光片骨异常和可视化这些区域。我们发现，我们的解决方案具有精度媲美已通过使用MURA数据集，特别是由我们团队所取得的整体卡帕得分大约为0.942在手腕上，0.862的手和其他开发团队取得的最好成绩o.735的肩膀上（相对于最好的结果这一刻在官方网站0.931，0.851和0.729相应）。然而，尽管良好的效果有很多关于技术的提出未来改进的方向。我们看到了X光片的一个，这将有助于切实专家诊断骨折以及骨肿瘤的情况下更快的速度和更高的精度的进一步发展计算机辅助系统（CAD）潜力很大。

113. Reliable Liver Fibrosis Assessment from Ultrasound using Global Hetero-Image Fusion and View-Specific Parameterization [PDF] 返回目录
Bowen Li, Ke Yan, Dar-In Tai, Yuankai Huo, Le Lu, Jing Xiao, Adam P. Harrison
Abstract: Ultrasound (US) is a critical modality for diagnosing liver fibrosis. Unfortunately, assessment is very subjective, motivating automated approaches. We introduce a principled deep convolutional neural network (CNN) workflow that incorporates several innovations. First, to avoid overfitting on non-relevant image features, we force the network to focus on a clinical region of interest (ROI), encompassing the liver parenchyma and upper border. Second, we introduce global heteroimage fusion (GHIF), which allows the CNN to fuse features from any arbitrary number of images in a study, increasing its versatility and flexibility. Finally, we use 'style'-based view-specific parameterization (VSP) to tailor the CNN processing for different viewpoints of the liver, while keeping the majority of parameters the same across views. Experiments on a dataset of 610 patient studies (6979 images) demonstrate that our pipeline can contribute roughly 7% and 22% improvements in partial area under the curve and recall at 90% precision, respectively, over conventional classifiers, validating our approach to this crucial problem.
摘要：超声（美国）是诊断肝纤维化的重要方式。不幸的是，评估是非常主观的，激励的自动化方法。我们介绍的工作流程，结合多种创新，原则性的深卷积神经网络（CNN）。首先，为了避免过度拟合上不相关的图像特征，我们强迫利息率（ROI）的临床区域网络的重点，涵盖了肝实质和上边界。第二，我们引进全球heteroimage融合（GHIF），它允许CNN从在研究图像的任意数量的熔断特性，增加其通用性和灵活性。最后，我们使用“基于style' - 视图 - 特定的参数（VSP）来定制CNN处理肝脏的不同观点，同时保持大部分的参数跨越的意见相同。上610周患者的研究（6979个图像）的数据集实验表明，我们的管道可以在90％的精度的曲线和召回下有助于在部分区域大致7％和22％的改进，分别比常规的分类器，验证我们的方法这一关键问题。

114. Generative Adversarial Network for Radar Signal Generation [PDF] 返回目录
Thomas Truong, Svetlana Yanushkevich
Abstract: A major obstacle in radar based methods for concealed object detection on humans and seamless integration into security and access control system is the difficulty in collecting high quality radar signal data. Generative adversarial networks (GAN) have shown promise in data generation application in the fields of image and audio processing. As such, this paper proposes the design of a GAN for application in radar signal generation. Data collected using the Finite-Difference Time-Domain (FDTD) method on three concealed object classes (no object, large object, and small object) were used as training data to train a GAN to generate radar signal samples for each class. The proposed GAN generated radar signal data which was indistinguishable from the training data by qualitative human observers.
摘要：在隐藏物体探测对人类和无缝集成到安全和访问控制系统基于雷达的方法的主要障碍是收集高品质的雷达信号数据的难度。生成对抗网络（GAN）已经显示出在图像和音频处理的字段中的数据生成应用前景。因此，本文提出了一种用于在雷达信号生成应用一个GAN的设计。使用有限差分时域（FDTD）方法在三个隐蔽对象类（没有物体，大的对象，和小物体）被用作训练数据来训练GAN以产生用于每个类雷达信号采样数据收集。这是从定性人类观察者训练数据不可区分的所提出的GAN产生雷达信号数据。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-11

目录

摘要