摘要

1. Transforming and Projecting Images into Class-conditional Generative Networks [PDF] 返回目录
Minyoung Huh, Richard Zhang, Jun-Yan Zhu, Sylvain Paris, Aaron Hertzmann
Abstract: We present a method for projecting an input image into the space of a class-conditional generative neural network. We propose a method that optimizes for transformation to counteract the model biases in a generative neural networks. Specifically, we demonstrate that one can solve for image translation, scale, and global color transformation, during the projection optimization to address the object-center bias of a Generative Adversarial Network. This projection process poses a difficult optimization problem, and purely gradient-based optimizations fail to find good solutions. We describe a hybrid optimization strategy that finds good projections by estimating transformations and class parameters. We show the effectiveness of our method on real images and further demonstrate how the corresponding projections lead to better edit-ability of these images.
摘要：我们提出的方法用于投影输入图像分割为分类条件生成神经网络的空间。我们建议，对于优化改造，以抵消模式偏差在生成神经网络的方法。具体而言，我们证明了一个可投射优化解决剖成对抗性网络的对象中心偏置期间解决了图像的平移，缩放和全球色彩变换。该投影过程提出了一个困难的优化问题，纯粹基于梯度的优化无法找到好的解决办法。我们描述了通过估计变换和类参数找到好的预测的混合优化策略。我们展示我们的真实图像的方法的有效性，并进一步证明对应的预测如何导致这些图像更好的编辑能力。

2. How to Train Your Energy-Based Model for Regression [PDF] 返回目录
Fredrik K. Gustafsson, Martin Danelljan, Radu Timofte, Thomas B. Schön
Abstract: Energy-based models (EBMs) have become increasingly popular within computer vision in recent years. While they are commonly employed for generative image modeling, recent work has applied EBMs also for regression tasks, achieving state-of-the-art performance on object detection and visual tracking. Training EBMs is however known to be challenging. While a variety of different techniques have been explored for generative modeling, the application of EBMs to regression is not a well-studied problem. How EBMs should be trained for best possible regression performance is thus currently unclear. We therefore accept the task of providing the first detailed study of this problem. To that end, we propose a simple yet highly effective extension of noise contrastive estimation, and carefully compare its performance to six popular methods from literature on the tasks of 1D regression and object detection. The results of this comparison suggest that our training method should be considered the go-to approach. We also apply our method to the visual tracking task, setting a new state-of-the-art on five datasets. Notably, our tracker achieves 63.7% AUC on LaSOT and 78.7% Success on TrackingNet. Code is available at this https URL.
摘要：基于能量的模型（EBMS）已经成为近年来计算机视觉中日益流行。虽然他们通常用于生成图像建模，最近的工作已申请EBMS也为回归任务，实现对目标检测和视觉跟踪国家的最先进的性能。然而，培训EBMS已知是具有挑战性的。虽然各种不同的技术已经探索了生成模型，EBMS的回归应用是一个并未得到很好研究的问题。如何EBMS应该接受培训的最好的回归表现因此目前还不清楚。因此，我们接受提供这个问题的首次详细研究的任务。为此，我们提出了一个简单而高效的噪音对比估计的推广，精心它的性能比较，从文学上的一维回归和物体检测任务的六种流行的方法。这种比较的结果表明，我们的训练方法，应考虑去到的方法。我们还运用我们的方法将视觉跟踪任务，五个数据集设置一个新的国家的最先进的。值得注意的是，我们的跟踪器实现对LaSOT 63.7％AUC和78.7％的成功上TrackingNet。代码可在此HTTPS URL。

3. Group Equivariant Generative Adversarial Networks [PDF] 返回目录
Neel Dey, Antong Chen, Soheil Ghafurian
Abstract: Generative adversarial networks are the state of the art for generative modeling in vision, yet are notoriously unstable in practice. This instability is further exacerbated with limited training data. However, in the synthesis of domains such as medical or satellite imaging, it is often overlooked that the image label is invariant to global image symmetries (e.g., rotations and reflections). In this work, we improve gradient feedback between generator and discriminator using an inductive symmetry prior via group-equivariant convolutional networks. We replace convolutional layers with equivalent group-convolutional layers in both generator and discriminator, allowing for better optimization steps and increased expressive power with limited samples. In the process, we extend recent GAN developments to the group-equivariant setting. We demonstrate the utility of our methods by improving both sample fidelity and diversity in the class-conditional synthesis of a diverse set of globally-symmetric imaging modalities.
摘要：创成敌对网络是在本领域中的视觉生成模型的状态，但在实践中是出了名的不稳定。这种不稳定性是有限的训练数据进一步加剧。然而，在结构域诸如医疗或卫星成像的合成，它经常被忽视，图像标签是不变的全局图像的对称性（例如，旋转和反射）。在这项工作中，我们改进发电机和鉴别器之间梯度反馈经由基团的等变卷积网络之前使用感应对称性。我们更换两个发电机和鉴别等组，卷积层卷积层，允许更好的优化步骤，并增加与有限的样本表现力。在这个过程中，我们延续近期GAN发展到组等变设置。我们通过提高两个样品的保真度和多样性在一组不同的全球对称的成像方式的分类条件合成证明我们的方法的实用程序。

4. Ego-motion and Surrounding Vehicle State Estimation Using a Monocular Camera [PDF] 返回目录
Jun Hayakawa, Behzad Dariush
Abstract: Understanding ego-motion and surrounding vehicle state is essential to enable automated driving and advanced driving assistance technologies. Typical approaches to solve this problem use fusion of multiple sensors such as LiDAR, camera, and radar to recognize surrounding vehicle state, including position, velocity, and orientation. Such sensing modalities are overly complex and costly for production of personal use vehicles. In this paper, we propose a novel machine learning method to estimate ego-motion and surrounding vehicle state using a single monocular camera. Our approach is based on a combination of three deep neural networks to estimate the 3D vehicle bounding box, depth, and optical flow from a sequence of images. The main contribution of this paper is a new framework and algorithm that integrates these three networks in order to estimate the ego-motion and surrounding vehicle state. To realize more accurate 3D position estimation, we address ground plane correction in real-time. The efficacy of the proposed method is demonstrated through experimental evaluations that compare our results to ground truth data available from other sensors including Can-Bus and LiDAR.
摘要：了解自我的运动和周围车辆状态是必不可少的，以使自动驾驶和先进的驾驶辅助技术。典型的方法来解决多个传感器如激光雷达，相机和雷达来识别车辆周围的状态，包括位置，速度，和方位的该问题使用的融合。这种传感方式是生产自用车辆过于复杂和昂贵。在本文中，我们提出了以估计使用单个单眼照相机自运动和周围车辆状态的新颖的机器学习方法。我们的方法是基于三个深层神经网络的组合来估计三维车辆从图像序列的边界框，深度和光流。本文的主要贡献是一个新的框架和算法，为了综合这些三个网络来估计自运动和周围车辆状态。为了实现更精确的三维位置估计，我们解决实时地平面修正。该方法的有效性是通过我们的结果进行比较的实测数据可以从其他传感器包括CAN-Bus和激光雷达试验评估证明。

5. VisualEchoes: Spatial Image Representation Learning through Echolocation [PDF] 返回目录
Ruohan Gao, Changan Chen, Ziad Al-Halah, Carl Schissler, Kristen Grauman
Abstract: Several animal species (e.g., bats, dolphins, and whales) and even visually impaired humans have the remarkable ability to perform echolocation: a biological sonar used to perceive spatial layout and locate objects in the world. We explore the spatial cues contained in echoes and how they can benefit vision tasks that require spatial reasoning. First we capture echo responses in photo-realistic 3D indoor scene environments. Then we propose a novel interaction-based representation learning framework that learns useful visual features via echolocation. We show that the learned image features are useful for multiple downstream vision tasks requiring spatial reasoning---monocular depth estimation, surface normal estimation, and visual navigation. Our work opens a new path for representation learning for embodied agents, where supervision comes from interacting with the physical world. Our experiments demonstrate that our image features learned from echoes are comparable or even outperform heavily supervised pre-training methods for multiple fundamental spatial tasks.
摘要：几个动物物种（例如，蝙蝠，海豚和鲸鱼），甚至视力受损人类必须执行回声定位的不寻常的能力：在世界上使用感知空间布局的生物声纳和定位对象。我们探讨包含在回声空间线索，他们如何能够受益需要空间推理视觉任务。在照片般逼真的3D室内场景的环境首先，我们捕获回应响应。然后，我们提出了一种基于交互的表示学习框架，通过回声定位学习有用的视觉特征。我们发现，学习图像特征是需要空间推理---单眼深度估计，表面正常的估计，以及可视化导航多个下游视觉任务非常有用。我们的工作打开了表示学习的体现剂，那里的监管来自与物理世界的互动的新路径。我们的实验表明，我们的图像特征从回声了解到相媲美，甚至超越了多个基本空间任务重监督前的训练方法。

6. MorphoCluster: Efficient Annotation of Plankton images by Clustering [PDF] 返回目录
Simon-Martin Schröder, Rainer Kiko, Reinhard Koch
Abstract: In this work, we present MorphoCluster, a software tool for data-driven, fast and accurate annotation of large image data sets. While already having surpassed the annotation rate of human experts, volume and complexity of marine data will continue to increase in the coming years. Still, this data requires interpretation. MorphoCluster augments the human ability to discover patterns and perform object classification in large amounts of data by embedding unsupervised clustering in an interactive process. By aggregating similar images into clusters, our novel approach to image annotation increases consistency, multiplies the throughput of an annotator and allows experts to adapt the granularity of their sorting scheme to the structure in the data. By sorting a set of 1.2M objects into 280 data-driven classes in 71 hours (16k objects per hour), with 90% of these classes having a precision of 0.889 or higher. This shows that MorphoCluster is at the same time fast, accurate and consistent, provides a fine-grained and data-driven classification and enables novelty detection. MorphoCluster is available as open-source software at this https URL.
摘要：在这项工作中，我们目前MorphoCluster，对于一个软件工具，数据驱动的，快速和大的图像数据集的准确注释。虽然已经有超过人类专家的注解率，海洋资料的数量和复杂性将继续在未来几年还会增加。不过，此数据需要解释。 MorphoCluster增强对发现模式，并通过在交互式过程嵌入的无监督聚类中的大量数据执行对象分类的人的能力。通过聚集类似的图像为簇，我们的新方法的图像注释增加稠度，乘吞吐量的注释器，并允许专家们自己排序方案的粒度适应数据的结构。通过排序的一组1.2M对象到280名的数据驱动类71小时（每小时16K对象），与具有0.889或更高的精确度90％这些类。由此可见，MorphoCluster是在同一时间快速，准确和一致的，提供了细粒度和数据驱动分类，使新颖的检测。 MorphoCluster可作为开源软件在此HTTPS URL。

7. Can We Learn Heuristics For Graphical Model Inference Using Reinforcement Learning? [PDF] 返回目录
Safa Messaoud, Maghav Kumar, Alexander G. Schwing
Abstract: Combinatorial optimization is frequently used in computer vision. For instance, in applications like semantic segmentation, human pose estimation and action recognition, programs are formulated for solving inference in Conditional Random Fields (CRFs) to produce a structured output that is consistent with visual features of the image. However, solving inference in CRFs is in general intractable, and approximation methods are computationally demanding and limited to unary, pairwise and hand-crafted forms of higher order potentials. In this paper, we show that we can learn program heuristics, i.e., policies, for solving inference in higher order CRFs for the task of semantic segmentation, using reinforcement learning. Our method solves inference tasks efficiently without imposing any constraints on the form of the potentials. We show compelling results on the Pascal VOC and MOTS datasets.
摘要：组合优化是计算机视觉中经常使用。举例来说，在像语义分割，人的姿态估计和动作识别应用，方案的制订在条件随机域（控释肥）解决推理产生结构化的输出，与图像的视觉特征一致。然而，在控释肥解决推理一般顽固，和近似方法计算量大且仅限于一元，两两和高阶潜力的手工制作形式。在本文中，我们证明了我们可以学习计划启发，即政策，对于更高阶的CRF解决推理的语义分割的任务，通过强化学习。我们的方法解决了有效的推理任务，而不潜力的形式施加任何限制。我们展示帕斯卡VOC和MOTS数据集令人瞩目的成果。

8. On the Benefits of Models with Perceptually-Aligned Gradients [PDF] 返回目录
Gunjan Aggarwal, Abhishek Sinha, Nupur Kumari, Mayank Singh
Abstract: Adversarial robust models have been shown to learn more robust and interpretable features than standard trained models. As shown in [\cite{tsipras2018robustness}], such robust models inherit useful interpretable properties where the gradient aligns perceptually well with images, and adding a large targeted adversarial perturbation leads to an image resembling the target class. We perform experiments to show that interpretable and perceptually aligned gradients are present even in models that do not show high robustness to adversarial attacks. Specifically, we perform adversarial training with attack for different max-perturbation bound. Adversarial training with low max-perturbation bound results in models that have interpretable features with only slight drop in performance over clean samples. In this paper, we leverage models with interpretable perceptually-aligned features and show that adversarial training with low max-perturbation bound can improve the performance of models for zero-shot and weakly supervised localization tasks.
摘要：对抗性可靠的模型已被证明学习更强大和解释功能比标准训练的模型。如图[\ {举tsipras2018robustness}]，这样可靠的模型继承有用可解释属性其中梯度对准感知以及与图像，并且添加大量靶向对抗扰动导致的图像类似的目标类。我们进行的实验表明，可解释的，并且在视觉对准梯度的存在，即使在模型没有显示出较强的鲁棒性，以对抗攻击。具体来说，我们执行与攻击的必然不同最大扰动对抗性训练。对抗性训练，低MAX-扰动界中有超过干净的样品中表现仅略有下降可解释的功能模型的结果。在本文中，我们利用模型与解释的感知对齐功能和显示，对抗性训练，势必低最大扰动可提高模型的零拍和弱监督的定位任务的性能。

9. CARRADA Dataset: Camera and Automotive Radar with Range-Angle-Doppler Annotations [PDF] 返回目录
A. Ouaknine, A. Newson, J. Rebut, F. Tupin, P. Pérez
Abstract: High quality perception is essential for autonomous driving (AD) systems. To reach the accuracy and robustness that are required by such systems, several types of sensors must be combined. Currently, mostly cameras and laser scanners (lidar) are deployed to build a representation of the world around the vehicle. While radar sensors have been used for a long time in the automotive industry, they are still under-used for AD despite their appealing characteristics (notably, their ability to measure the relative speed of obstacles and to operate even in adverse weather conditions). To a large extent, this situation is due to the relative lack of automotive datasets with real radar signals that are both raw and annotated. In this work, we introduce CARRADA, a dataset of synchronized camera and radar recordings with range-angle-Doppler annotations. We also present a semi-automatic annotation approach, which was used to annotate the dataset, and a radar semantic segmentation baseline, which we evaluate on several metrics. Both our code and dataset will be released.
摘要：高品质的感知是自主驾驶（AD）系统是必不可少的。为了达到由这样的系统所要求的精度和鲁棒性，几种类型的传感器必须进行组合。目前，主要是相机和激光扫描仪（LIDAR）的部署，建立了世界的车辆周围的表示。虽然雷达传感器已被用于在汽车行业很长一段时间，他们仍然尽管他们有吸引力的特性下，用于AD（值得注意的是，他们的测量障碍物的相对速度，甚至在恶劣的天气条件下工作的能力）。在很大程度上，这种情况是由于相对缺乏与既生并注明真实雷达信号的汽车数据集。在这项工作中，我们介绍CARRADA，同步照相机和雷达记录与范围角度多普勒注释的数据集。我们还提出了一种半自动标注方法，这是用来注释的数据集，以及雷达语义分割基准线，这是我们几个指标来评估。无论我们的代码和数据集将被释放。

10. Automated eye disease classification method from anterior eye image using anatomical structure focused image classification technique [PDF] 返回目录
Masahiro Oda, Takefumi Yamaguchi, Hideki Fukuoka, Yuta Ueno, Kensaku Mori
Abstract: This paper presents an automated classification method of infective and non-infective diseases from anterior eye images. Treatments for cases of infective and non-infective diseases are different. Distinguishing them from anterior eye images is important to decide a treatment plan. Ophthalmologists distinguish them empirically. Quantitative classification of them based on computer assistance is necessary. We propose an automated classification method of anterior eye images into cases of infective or non-infective disease. Anterior eye images have large variations of the eye position and brightness of illumination. This makes the classification difficult. If we focus on the cornea, positions of opacified areas in the corneas are different between cases of the infective and non-infective diseases. Therefore, we solve the anterior eye image classification task by using an object detection approach targeting the cornea. This approach can be said as "anatomical structure focused image classification". We use the YOLOv3 object detection method to detect corneas of infective disease and corneas of non-infective disease. The detection result is used to define a classification result of a image. In our experiments using anterior eye images, 88.3% of images were correctly classified by the proposed method.
摘要：从眼睛前部图像和感染性非感染性疾病的文章提出了一种自动化的分类方法。对于感染性和非感染性疾病的情况下，处理是不同的。从眼睛前部图像区分他们来决定治疗方案是非常重要的。眼科医生区分经验。基于计算机辅助他们的量化分级是必要的。我们建议前眼部图像的自动分类的方法为感染性或非感染性疾病的情况下。前眼部图像具有眼睛的位置和照明的亮度的大的变化。这使得分类困难。如果我们专注于角膜，在角膜浑浊化区域的位置是感染性的和非感染性疾病的情况不同。因此，我们通过使用对象检测方法靶向角膜解决前眼部图像的分类任务。这种做法可以说是“解剖结构聚焦图像分类”。我们使用YOLOv3体检测方法，检测非感染性疾病的感染性疾病的角膜和角膜。检测结果被用来定义一个图像的分类结果。在使用前眼图像我们的实验中，图像的88.3％，通过该方法被正确分类。

11. Correlating Edge, Pose with Parsing [PDF] 返回目录
Ziwei Zhang, Chi Su, Liang Zheng, Xiaodong Xie
Abstract: According to existing studies, human body edge and pose are two beneficial factors to human parsing. The effectiveness of each of the high-level features (edge and pose) is confirmed through the concatenation of their features with the parsing features. Driven by the insights, this paper studies how human semantic boundaries and keypoint locations can jointly improve human parsing. Compared with the existing practice of feature concatenation, we find that uncovering the correlation among the three factors is a superior way of leveraging the pivotal contextual cues provided by edges and poses. To capture such correlations, we propose a Correlation Parsing Machine (CorrPM) employing a heterogeneous non-local block to discover the spatial affinity among feature maps from the edge, pose and parsing. The proposed CorrPM allows us to report new state-of-the-art accuracy on three human parsing datasets. Importantly, comparative studies confirm the advantages of feature correlation over the concatenation.
摘要：根据现有的研究，人体边缘和姿势有两个有利因素对人体的解析。每个高级特征（边缘和姿态）的效力是通过它们的特征的级联与解析功能确认。由洞察力驱动的，研究人类如何语义边界和关键点位置可以共同提高人体解析。与功能级联的现行做法相比，我们发现，揭露三个因素之间的相关性是利用由边缘和姿势所提供的关键上下文线索的优越方式。为了获取这样的相关性，我们提出了相关解析机（CorrPM）采用异构的非本地区块发现功能之间的空间亲和力从边缘，姿势和解析映射。所提出的CorrPM使我们能够对三种人解析集报告的国家的最先进的新的精度。重要的是，比较研究证实了拼接功能相关的优点。

12. Monitoring COVID-19 social distancing with person detection and tracking via fine-tuned YOLO v3 and Deepsort techniques [PDF] 返回目录
Narinder Singh Punn, Sanjay Kumar Sonbhadra, Sonali Agarwal
Abstract: The rampant coronavirus disease 2019 (COVID-19) has brought global crisis with its deadly spread to more than 180 countries, and about 3,519,901 confirmed cases along with 247,630 deaths globally as on May 4, 2020. The absence of any active therapeutic agents and the lack of immunity against COVID-19 increases the vulnerability of the population. Since there are no vaccines available, social distancing is the only feasible approach to fight against this pandemic. Motivated by this notion, this article proposes a deep learning based framework for automating the task of monitoring social distancing using surveillance video. The proposed framework utilizes the YOLO v3 object detection model to segregate humans from the background and Deepsort approach to track the identified people with the help of bounding boxes and assigned IDs. The results of the YOLO v3 model are further compared with other popular state-of-the-art models, e.g. faster region-based CNN (convolution neural network) and single shot detector (SSD) in terms of mean average precision (mAP), frames per second (FPS) and loss values defined by object classification and localization. Later, the pairwise vectorized \textit{L2} norm is computed based on the three-dimensional feature space obtained by using the centroid coordinates and dimensions of the bounding box. The violation index term is proposed to quantize the non adoption of social distancing protocol. From the experimental analysis, it is observed that the YOLO v3 with Deepsort tracking scheme displayed best results with balanced mAP and FPS score to monitor the social distancing in real-time.
摘要：猖獗的冠状病毒病2019（COVID-19）带来了全球金融危机，其致命的传播到180多个国家，约3519901确诊病例与死亡病例247630全球一起作为2020年没有任何活性治疗药物的5月4日，和缺乏免疫力的抗COVID-19增加的人口的脆弱性。由于没有疫苗可用，社会隔离是唯一可行的办法来防止这种流行病作斗争。通过这个概念的启发，本文提出了一种利用自动化监控录像监测社会疏远的任务了深刻的学习为基础的框架。拟议的框架利用YOLO V3对象检测模型从背景和Deepsort方法偏析人类来跟踪识别的人包围盒和分配的ID的帮助。该YOLO V3模型的结果与其他流行国家的最先进的模型，例如被进一步比较更快的基于区域的CNN在（MAP）的值平均精度方面（卷积神经网络）和单次检测器（SSD），每秒帧数（FPS）和损耗值由对象分类和定位限定。后来，成对矢量\ textit {L2}范数基于通过使用质心坐标和边界框的尺寸所获得的三维特征空间计算的。违反索引项，提出量化非通过社会距离协议。从实验分析，观察到的YOLO v3进行Deepsort跟踪方案显示与平衡的地图和FPS最好的结果得分监测实时的社交距离。

13. Anchors based method for fingertips position from a monocular RGB image using Deep Neural Network [PDF] 返回目录
Purnendu Mishra, Kishor Sarawadekar
Abstract: In Virtual, augmented, and mixed reality, the use of hand gestures is increasingly becoming popular to reduce the difference between the virtual and real world. The precise location of the fingertip is essential/crucial for a seamless experience. Much of the research work is based on using depth information for the estimation of the fingertips position. However, most of the work using RGB images for fingertips detection is limited to a single finger. The detection of multiple fingertips from a single RGB image is very challenging due to various factors. In this paper, we propose a deep neural network (DNN) based methodology to estimate the fingertips position. We christened this methodology as an Anchor based Fingertips Position Estimation (ABFPE), and it is a two-step process. The fingertips location is estimated using regression by computing the difference in the location of a fingertip from the nearest anchor point. The proposed framework performs the best with limited dependence on hand detection results. In our experiments on the SCUT-Ego-Gesture dataset, we achieved the fingertips detection error of 2.3552 pixels on a video frame with a resolution of $640 \times 480$ and about $92.98\%$ of test images have average pixel errors of five pixels.
摘要：虚拟化，增强和混合现实，用手势正日益成为流行的减少虚拟和现实世界之间的差异。指尖的精确位置是必不可少/至关重要的无缝体验。许多研究工作都是基于使用深度信息的指尖位置的估计。然而，大多数使用RGB图像进行指尖检测的工作仅限于单个手指。从一个单一的RGB图像的多个指尖的检测是非常具有挑战性的，由于各种因素的影响。在本文中，我们提出了一个深层神经网络（DNN）为基础的方法来估计指尖位置。我们命名此方法作为基于锚指尖位置估计（ABFPE），这是一个两步的过程。指尖位置是利用回归通过计算从最近的锚定点的指尖的位置的差推定。所提出的框架的效果最好与手头的检测结果的依赖限制。我们在华南理工大学，自我手势数据集实验中，我们用$ 640的分辨率达到了2.3552像素指尖检测错误的视频帧\次$ 480及约$ 92.98 \％测试图像$有五个像素平均像素错误。

14. How to Train Your Dragon: Tamed Warping Network for Semantic Video Segmentation [PDF] 返回目录
Junyi Feng, Songyuan Li, Yifeng Chen, Fuxian Huang, Jiabao Cui, Xi Li
Abstract: Real-time semantic segmentation on high-resolution videos is challenging due to the strict requirements of speed. Recent approaches have utilized the inter-frame continuity to reduce redundant computation by warping the feature maps across adjacent frames, greatly speeding up the inference phase. However, their accuracy drops significantly owing to the imprecise motion estimation and error accumulation. In this paper, we propose to introduce a simple and effective correction stage right after the warping stage to form a framework named Tamed Warping Network (TWNet), aiming to improve the accuracy and robustness of warping-based models. The experimental results on the Cityscapes dataset show that with the correction, the accuracy (mIoU) significantly increases from 67.3% to 71.6%, and the speed edges down from 65.5 FPS to 61.8 FPS. For non-rigid categories such as "human" and "object", the improvements of IoU are even higher than 18 percentage points.
摘要：高分辨率视频实时语义分割由于速度的严格要求挑战。近来的方案已经利用帧间的连续性由翘曲在相邻帧中的特征图来减少冗余计算，大大加快了推断阶段。然而，他们的精确度会降低显著由于不精确的运动估计和误差积累。在本文中，我们提出了变形阶段之后，介绍一种简单而有效的修正阶段，形成了一个名为驯服规整的网络架构（TWNet），旨在提高基于扭曲的模型的准确性和鲁棒性。对都市风景的实验结果的数据集显示，与校正的精度（米欧）显著从67.3％增加至71.6％，且速度从65.5 FPS边缘下降到61.8 FPS。对于非刚性的类别，如“人”和“对象”，IOU的改进是小于18个百分点，甚至更高。

15. NTIRE 2020 Challenge on Image and Video Deblurring [PDF] 返回目录
Seungjun Nah, Sanghyun Son, Radu Timofte, Kyoung Mu Lee
Abstract: Motion blur is one of the most common degradation artifacts in dynamic scene photography. This paper reviews the NTIRE 2020 Challenge on Image and Video Deblurring. In this challenge, we present the evaluation results from 3 competition tracks as well as the proposed solutions. Track 1 aims to develop single-image deblurring methods focusing on restoration quality. On Track 2, the image deblurring methods are executed on a mobile platform to find the balance of the running speed and the restoration accuracy. Track 3 targets developing video deblurring methods that exploit the temporal relation between input frames. In each competition, there were 163, 135, and 102 registered participants and in the final testing phase, 9, 4, and 7 teams competed. The winning methods demonstrate the state-ofthe-art performance on image and video deblurring tasks.
摘要：运动模糊是在动态场景拍摄中最常见的退化的文物之一。本文综述了图片和视频的去模糊NTIRE 2020挑战。在这个挑战中，我们提出从3个比赛曲目，以及所提出的解决方案的评估结果。跟踪1个旨在开发单一图像去模糊方法侧重于恢复质量。在轨道2，图像去模糊方法是在移动平台上执行查找运行速度的平衡和恢复精度。追踪3个目标开发利用输入帧之间的时间关系的视频去模糊方法。在每场比赛，有163，135，和102注册参与者和在最终测试阶段，9,4，和7队比赛。获胜的方法证明上的图像和视频去模糊任务的国家国税发先进的性能。

16. Visual Question Answering with Prior Class Semantics [PDF] 返回目录
Violetta Shevchenko, Damien Teney, Anthony Dick, Anton van den Hengel
Abstract: We present a novel mechanism to embed prior knowledge in a model for visual question answering. The open-set nature of the task is at odds with the ubiquitous approach of training of a fixed classifier. We show how to exploit additional information pertaining to the semantics of candidate answers. We extend the answer prediction process with a regression objective in a semantic space, in which we project candidate answers using prior knowledge derived from word embeddings. We perform an extensive study of learned representations with the GQA dataset, revealing that important semantic information is captured in the relations between embeddings in the answer space. Our method brings improvements in consistency and accuracy over a range of question types. Experiments with novel answers, unseen during training, indicate the method's potential for open-set prediction.
摘要：我们在视觉答疑模型提出了一种新的机制来嵌入先验知识。任务的开放式集合的性质是有一个固定的分级培训的办法无处不赔率。我们将展示如何利用有关候选答案的语义的附加信息。我们扩展在语义空间回归的目标，在此我们使用来自字的嵌入衍生先验知识项目候选答案的答案预测过程。我们了解到进行交涉了广泛的研究与GQA数据集，揭示重要的语义信息在答案空间的嵌入关系抓获。我们的方法带来一致性和准确性的改善在一定范围的问题类型。新颖的答案，训练过程中看不见的实验，表明该方法对开放式的一组预测的潜力。

17. One-Shot Image Classification by Learning to Restore Prototypes [PDF] 返回目录
Wanqi Xue, Wei Wang
Abstract: One-shot image classification aims to train image classifiers over the dataset with only one image per category. It is challenging for modern deep neural networks that typically require hundreds or thousands of images per class. In this paper, we adopt metric learning for this problem, which has been applied for few- and many-shot image classification by comparing the distance between the test image and the center of each class in the feature space. However, for one-shot learning, the existing metric learning approaches would suffer poor performance because the single training image may not be representative of the class. For example, if the image is far away from the class center in the feature space, the metric-learning based algorithms are unlikely to make correct predictions for the test images because the decision boundary is shifted by this noisy image. To address this issue, we propose a simple yet effective regression model, denoted by RestoreNet, which learns a class agnostic transformation on the image feature to move the image closer to the class center in the feature space. Experiments demonstrate that RestoreNet obtains superior performance over the state-of-the-art methods on a broad range of datasets. Moreover, RestoreNet can be easily combined with other methods to achieve further improvement.
摘要：一个镜头图像分类旨在通过数据集训练图像分类，每个分类只有一个图像。这是具有挑战性的，通常需要每类图像的数百或数千现代深层神经网络。在本文中，我们采用度量学习对于这个问题，已通过比较测试图像和每一类的特征空间中心之间的距离，适用于few-和许多镜头图像分类。但是，对于一次性学习，现有的度量学习方法会遭受业绩不佳，因为单一的训练图像可能不具有代表性的类。例如，如果图像是远离特征空间的类中心，度量学习算法基础是不太可能的测试图像正确的预测，因为决定边界由这个喧嚣图像偏移。为了解决这个问题，我们提出了一个简单而有效的回归模型，通过RestoreNet表示，该学会对图像特征的一类不可知转变为更接近类中心移动图像中的特征空间。实验表明，RestoreNet取得对在广泛范围的数据集的状态的最先进的方法优越的性能。此外，RestoreNet可以很容易地与其它方法相结合，实现进一步的改进。

18. AIM 2019 Challenge on Video Temporal Super-Resolution: Methods and Results [PDF] 返回目录
Seungjun Nah, Sanghyun Son, Radu Timofte, Kyoung Mu Lee
Abstract: Videos contain various types and strengths of motions that may look unnaturally discontinuous in time when the recorded frame rate is low. This paper reviews the first AIM challenge on video temporal super-resolution (frame interpolation) with a focus on the proposed solutions and results. From low-frame-rate (15 fps) video sequences, the challenge participants are asked to submit higher-framerate (60 fps) video sequences by estimating temporally intermediate frames. We employ the REDS VTSR dataset derived from diverse videos captured in a hand-held camera for training and evaluation purposes. The competition had 62 registered participants, and a total of 8 teams competed in the final testing phase. The challenge winning methods achieve the state-of-the-art in video temporal superresolution.
摘要：影片包含各种类型和运动的优势，可能看起来不自然的非连续时间记录的帧速率较低。本文综述了视频中的时间超分辨率（帧插值）的第一个目标的挑战，重点提出的解决方案和结果。从低帧率（每秒15帧）的视频序列中，挑战参与者被要求通过估计在时间上的中间帧提交更高帧率（每秒60帧）的视频序列。我们聘请来自培训和评估目的手持相机拍摄不同的视频导出红军VTSR数据集。本次大赛有62名注册的参加者，共8支球队在最后的测试阶段比赛。挑战获胜的方法实现了国家的最先进的视频中的时间的超高分辨率。

19. Minor Privacy Protection Through Real-time Video Processing at the Edge [PDF] 返回目录
Meng Yuan, Seyed Yahya Nikouei, Alem Fitwi, Yu Chen, Yunxi Dong
Abstract: The collection of a lot of personal information about individuals, including the minor members of a family, by closed-circuit television (CCTV) cameras creates a lot of privacy concerns. Particularly, revealing children's identifications or activities may compromise their well-being. In this paper, we investigate lightweight solutions that are affordable to edge surveillance systems, which is made feasible and accurate to identify minors such that appropriate privacy-preserving measures can be applied accordingly. State of the art deep learning architectures are modified and re-purposed in a cascaded fashion to maximize the accuracy of our model. A pipeline extracts faces from the input frames and classifies each one to be of an adult or a child. Over 20,000 labeled sample points are used for classification. We explore the timing and resources needed for such a model to be used in the Edge-Fog architecture at the edge of the network, where we can achieve near real-time performance on the CPU. Quantitative experimental results show the superiority of our proposed model with an accuracy of 92.1% in classification compared to some other face recognition based child detection approaches.
摘要：很多关于个人的个人信息，包括一个家庭的未成年成员，通过闭路电视（CCTV）摄像机的集合创造了很多的隐私担忧。特别是，揭示孩子的标识或活动可能会影响他们的福祉。在本文中，我们研究了轻量级的解决方案，实惠的边缘监控系统，它是由可行的，准确的识别未成年人，使得相应的隐私保护措施，可以相应地应用。艺术深学习架构的状态被修改，重新定意在级联方式来最大限度地提高我们模型的准确性。从输入帧和分类每一个所述的管道中提取面是一个成人或儿童的。超过20,000标记的样本点用于分类。我们探索的时间和需要这样的模型资源中的边缘，雾架构在网络，在这里我们可以接近CPU的实时性能实现的边缘使用。定量实验结果表明，我们提出的模型相比于其他一些人脸识别基础的儿童检测方法的92.1％，在分类的准确度的优越性。

20. Multi-focus Image Fusion: A Benchmark [PDF] 返回目录
Xingchen Zhang
Abstract: Multi-focus image fusion (MFIF) has attracted considerable interests due to its numerous applications. While much progress has been made in recent years with efforts on developing various MFIF algorithms, some issues significantly hinder the fair and comprehensive performance comparison of MFIF methods, such as the lack of large-scale test set and the random choices of objective evaluation metrics in the literature. To solve these issues, this paper presents a multi-focus image fusion benchmark (MFIFB) which consists a test set of 105 image pairs, a code library of 30 MFIF algorithms, and 20 evaluation metrics. MFIFB is the first benchmark in the field of MFIF and provides the community a platform to compare MFIF algorithms fairly and comprehensively. Extensive experiments have been conducted using the proposed MFIFB to understand the performance of these algorithms. By analyzing the experimental results, effective MFIF algorithms are identified. More importantly, some observations on the status of the MFIF field are given, which can help to understand this field better.
摘要：多聚焦图像融合（MFIF）已经吸引了相当大的利益，由于其众多的应用。虽然极大地促进了近年来取得与努力开发各种MFIF算法，有些问题显著阻碍公平的MFIF方法，如缺乏大规模测试集的，客观的评价指标中的随机选择综合性能比较文献。为了解决这些问题，提出了一种多聚焦图像融合基准（MFIFB）由测试集105的图像对，30种MFIF算法代码库，和20个的评价指标。 MFIFB是MFIF领域的第一标杆，并提供社区平台，以公正，全面地比较MFIF算法。大量的实验已经利用提出MFIFB了解这些算法的性能进行。通过分析实验结果，有效MFIF算法识别。更重要的是，在MFIF领域的地位的一些观察给出，它可以帮助了解这一领域的更好。

21. Deep Encoder-Decoder Neural Network for Fingerprint Image Denoising and Inpainting [PDF] 返回目录
Weiya Fan
Abstract: Fingerprint image denoising is a very important step in fingerprint identification. to improve the denoising effect of fingerprint image,we have designs a fingerprint denoising algorithm based on deep encoder-decoder network,which encoder subnet to learn the fingerprint features of noisy images.the decoder subnet reconstructs the original fingerprint image based on the features to achieve denoising, while using the dilated convolution in the network to increase the receptor field without increasing the complexity and improve the network inference speed. In addition, feature fusion at different levels of the network is achieved through the introduction of residual learning, which further restores the detailed features of the fingerprint and improves the denoising effect. Finally, the experimental results show that the algorithm enables better recovery of edge, line and curve features in fingerprint images, with better visual effects and higher peak signal-to-noise ratio (PSNR) compared to other methods.
摘要：指纹图像去噪是指纹识别非常重要的一步。以提高指纹图像的去噪效果，我们设计了一个指纹去噪算法基于深编码解码器网络，编码器的子网学习指纹嘈杂images.the解码器子网的功能重建基于特征的原始指纹图像实现对降噪，同时采用了卷积扩张在网络中增加受体领域在不增加复杂性并提高了网络的推理速度。此外，在不同级别的网络的特征融合通过引入残余学习，这进一步还原指纹的细部特征和提高了去噪效果的实现。最后，实验结果表明，与其它方法相比该算法使得边缘，线，在指纹图像曲线特征的更好的恢复，具有更好的视觉效果和更高的峰值信噪比（PSNR）。

22. Remote Sensing Image Scene Classification Meets Deep Learning: Challenges, Methods, Benchmarks, and Opportunities [PDF] 返回目录
Gong Cheng, Xingxing Xie, Junwei Han, Lei Guo, Gui-Song Xia
Abstract: Remote sensing image scene classification, which aims at labeling remote sensing images with a set of semantic categories based on their contents, has broad applications in a range of fields. Propelled by the powerful feature learning capabilities of deep neural networks, remote sensing image scene classification driven by deep learning has drawn remarkable attention and achieved significant breakthroughs. However, to the best of our knowledge, a comprehensive review of recent achievements regarding deep learning for scene classification of remote sensing images is still lacking. Considering the rapid evolution of this field, this paper provides a systematic survey of deep learning methods for remote sensing image scene classification by covering more than 140 papers. To be specific, we discuss the main challenges of scene classification and survey (1) Autoencoder-based scene classification methods, (2) Convolutional Neural Network-based scene classification methods, and (3) Generative Adversarial Network-based scene classification methods. In addition, we introduce the benchmarks used for scene classification and summarize the performance of more than two dozens of representative algorithms on three commonly-used benchmark data sets. Finally, we discuss the promising opportunities for further research.
摘要：遥感图像场景分类，其目的是在用一组基于其内容的语义类别的标记遥感图像，具有范围领域广泛应用。通过强大的功能，学习深层神经网络的能力的推动下，通过深度学习驱动的遥感图像场景分类已经引起显着的关注，取得了显著的突破。然而，据我们所知，最近取得的成就进行了全面审查关于深学习的遥感图像的场景分类仍然缺乏。考虑到这一领域的快速发展，本文通过覆盖超过140篇论文提供了深刻的学习方法的遥感图像场景分类的系统的调查。具体而言，我们讨论的场景分类和调查（1）的主要挑战自动编码，基于场景的分类方法，（2）基于网络的卷积神经场景分类方法，以及（3）基于网络的剖成对抗性场景分类方法。此外，我们引入用于场景分类基准和总结三个常用的基准数据集两个以上的几十个代表算法的性能。最后，我们讨论了进一步研究的有前途的机会。

23. Feature-metric Registration: A Fast Semi-supervised Approach for Robust Point Cloud Registration without Correspondences [PDF] 返回目录
Xiaoshui Huang, Guofeng Mei, Jian Zhang
Abstract: We present a fast feature-metric point cloud registration framework, which enforces the optimisation of registration by minimising a feature-metric projection error without correspondences. The advantage of the feature-metric projection error is robust to noise, outliers and density difference in contrast to the geometric projection error. Besides, minimising the feature-metric projection error does not need to search the correspondences so that the optimisation speed is fast. The principle behind the proposed method is that the feature difference is smallest if point clouds are aligned very well. We train the proposed method in a semi-supervised or unsupervised approach, which requires limited or no registration label data. Experiments demonstrate our method obtains higher accuracy and robustness than the state-of-the-art methods. Besides, experimental results show that the proposed method can handle significant noise and density difference, and solve both same-source and cross-source point cloud registration.
摘要：我们提出了一个快速的特征度量点云登记框架，通过无对应最小化特征指标投影误差强制登记的优化。特征度量投影误差的优点是鲁棒的噪声，异常值和相反的几何投影误差的密度差。此外，最小化特征指标投影误差并不需要搜索的对应关系，从而优化速度快。该方法的原理是，功能不同的是最小的，如果点云排列得非常好。我们在一个半监督或无人监督的方法，要求限制或无登记标签数据训练所提出的方法。实验表明我们的方法获得比国家的最先进的方法更高的精度和鲁棒性。此外，实验结果表明，所提出的方法可以处理显著噪声和密度差，和解决同源和交叉源点云都登记。

24. Using Artificial Intelligence to Analyze Fashion Trends [PDF] 返回目录
Mengyun Shi, Van Dyk Lewis
Abstract: Analyzing fashion trends is essential in the fashion industry. Current fashion forecasting firms, such as WGSN, utilize the visual information from around the world to analyze and predict fashion trends. However, analyzing fashion trends is time-consuming and extremely labor intensive, requiring individual employees' manual editing and classification. To improve the efficiency of data analysis of such image-based information and lower the cost of analyzing fashion images, this study proposes a data-driven quantitative abstracting approach using an artificial intelligence (A.I.) algorithm. Specifically, an A.I. model was trained on fashion images from a large-scale dataset under different scenarios, for example in online stores and street snapshots. This model was used to detect garments and classify clothing attributes such as textures, garment style, and details for runway photos and videos. It was found that the A.I. model can generate rich attribute descriptions of detected regions and accurately bind the garments in the images. Adoption of A.I. algorithm demonstrated promising results and the potential to classify garment types and details automatically, which can make the process of trend forecasting more cost-effective and faster.
摘要：分析流行趋势是在时尚界是必不可少的。当前的时尚预测公司，如WGSN，来自世界各地的利用视觉信息来分析和预测时尚趋势。然而，在分析流行趋势是耗时和极其劳动密集型的，需要个别员工的人工编辑和分类。为了改善这样的基于图象的信息的数据分析的效率，并降低分析时尚图像的成本，本研究提出使用人工智能（A.I.）算法的数据驱动的定量文摘方法。具体而言，A.I.模型在不同情况下训练的时尚图片来自大规模的数据集，例如在网上商店和街头快照。采用这种模式来检测服装和纹理，服装样式和跑道的照片和视频信息分类服装等属性。结果发现，在A.I.模型可以生成检测到的区域，准确地结合在图像中的服装的丰富属性描述。活性成分。通过算法表现出可喜的成果和潜力进行分类服装类型和细节自动，它可以使趋势预测的过程更具成本效益和更快。

25. Joint-SRVDNet: Joint Super Resolution and Vehicle Detection Network [PDF] 返回目录
Moktari Mostofa, Syeda Nyma Ferdous, Benjamin S.Riggan, Nasser M. Nasrabadi
Abstract: In many domestic and military applications, aerial vehicle detection and super-resolutionalgorithms are frequently developed and applied independently. However, aerial vehicle detection on super-resolved images remains a challenging task due to the lack of discriminative information in the super-resolved images. To address this problem, we propose a Joint Super-Resolution and Vehicle DetectionNetwork (Joint-SRVDNet) that tries to generate discriminative, high-resolution images of vehicles fromlow-resolution aerial images. First, aerial images are up-scaled by a factor of 4x using a Multi-scaleGenerative Adversarial Network (MsGAN), which has multiple intermediate outputs with increasingresolutions. Second, a detector is trained on super-resolved images that are upscaled by factor 4x usingMsGAN architecture and finally, the detection loss is minimized jointly with the super-resolution loss toencourage the target detector to be sensitive to the subsequent super-resolution training. The network jointlylearns hierarchical and discriminative features of targets and produces optimal super-resolution results. Weperform both quantitative and qualitative evaluation of our proposed network on VEDAI, xView and DOTAdatasets. The experimental results show that our proposed framework achieves better visual quality than thestate-of-the-art methods for aerial super-resolution with 4x up-scaling factor and improves the accuracy ofaerial vehicle detection.
摘要：在国内和军事许多应用中，飞行器探测和超resolutionalgorithms经常独立地开发和应用。然而，在超分辨图像飞行器检测仍然是一个具有挑战性的任务，由于缺乏超分辨图像判别信息。为了解决这个问题，我们提出了一个联合超分辨率和车辆DetectionNetwork（联合-SRVDNet），试图产生歧视，车辆的高清晰度图像fromlow分辨率的航空影像。通过使用多scaleGenerative对抗性网络（MsGAN），其具有与increasingresolutions多个中间输出4倍的因子首先，鸟瞰图像是按比例放大的。其次，探测器上由系数4倍usingMsGAN架构放大的最后，检测损失与超分辨率的损失在鼓励着目标探测器共同最小化是随后的超分辨率培训敏感的超分辨率的图像训练。网络jointlylearns的目标层次和判别特征，并产生最佳的超分辨率效果。 Weperform我们提出了VEDAI，xView能够和DOTAdatasets网络的定量和定性的评价。实验结果表明，该框架实现了更好的视觉质量，比thestate的最先进的方法进行空中超分辨率高达4倍的比例因数，提高了精度ofaerial车辆检测。

26. Quadtree Driven Lossy Event Compression [PDF] 返回目录
Srutarshi Banerjee, Zihao W. Wang, Henry H. Chopp, Oliver Cossairt, Aggelos Katsaggelos
Abstract: Event cameras are emerging bio-inspired sensors that offer salient benefits over traditional cameras. With high speed, high dynamic range, and low power consumption, event cameras have been increasingly employed to solve existing as well as novel visual and robotics tasks. Despite rapid advancement in event-based vision, event data compression is facing growing demand, yet remains elusively challenging and not effectively addressed. The major challenge is the unique data form, \emph{i.e.}, a stream of four-attribute events, encoding the spatial locations and the timestamp of each event, with a polarity representing the brightness increase/decrease. While events encode temporal variations at high speed, they omit rich spatial information, which is critical for image/video compression. In this paper, we perform lossy event compression (LEC) based on a quadtree (QT) segmentation map derived from an adjacent image. The QT structure provides a priority map for the 3D space-time volume, albeit in a 2D manner. LEC is performed by first quantizing the events over time, and then variably compressing the events within each QT block via Poisson Disk Sampling in 2D space for each quantized time. Our QT-LEC has flexibility in accordance with the bit-rate requirement. Experimentally, we show results with state-of-the-art coding performance. We further evaluate the performance in event-based applications such as image reconstruction and corner detection.
摘要：事件相机新兴仿生传感器，提供比传统相机显着的好处。随着高速，高动态范围和低功率消耗，事件相机已经越来越多地用于解决现有以及新的视觉和机器人任务。尽管基于事件的视力快速发展，事件数据压缩正面临着不断增长的需求，但仍然难以捉摸的挑战，而不是有效的解决。主要的挑战是独特的数据形式，\ {EMPH即}，四个属性的事件流，编码的空间位置和每个事件的时间戳，与代表亮度增加/减小的极性。而事件编码在高速的时间变化，它们忽略丰富的空间信息，这对于图像/视频压缩的关键。在本文中，我们执行基于从相邻的图像导出的四叉树（QT）分割地图上有损压缩事件（LEC）。的QT结构提供用于3D空时体积中的优先级映射，尽管在2D方式。 LEC通过首先量化的事件随着时间的推移，然后通过可变泊松磁盘抽样压缩每个QT块内的事件在二维空间中的每个量化的时间执行。我们的QT-LEC具有根据比特率要求的灵活性。实验上，我们将展示与国家的最先进的编码性能结果。我们进一步评估基于事件的应用，例如图像重建和角点检测性能。

27. Tensor optimal transport, distance between sets of measures and tensor scaling [PDF] 返回目录
Shmuel Friedland
Abstract: We study the optimal transport problem for $d>2$ discrete measures. This is a linear programming problem on $d$-tensors. It gives a way to compute a "distance" between two sets of discrete measures. We introduce an entropic regularization term, which gives rise to a scaling of tensors. We give a variation of the celebrated Sinkhorn scaling algorithm. We show that this algorithm can be viewed as a partial minimization algorithm of a strictly convex function. Under appropriate conditions the rate of convergence is geometric and we estimate the rate. Our results are generalizations of known results for the classical case of two discrete measures.
摘要：我们为$ d> 2 $离散措施研究最佳的交通问题。这是在$ d $ -tensors一个线性规划问题。它提供了一种方式来计算两套独立的措施之间的“距离”。我们引入一个熵调整项，这引起了张量的比例。我们给著名Sinkhorn缩放算法的变化。我们表明，这种算法可以作为一个严格凸函数的局部最小化算法进行查看。在适当的条件收敛速度是几何和我们估计的速度。我们的研究结果是对的两个分立措施的经典案例已知结果的概括。

28. Multi-Modality Generative Adversarial Networks with Tumor Consistency Loss for Brain MR Image Synthesis [PDF] 返回目录
Bingyu Xin, Yifan Hu, Yefeng Zheng, Hongen Liao
Abstract: Magnetic Resonance (MR) images of different modalities can provide complementary information for clinical diagnosis, but whole modalities are often costly to access. Most existing methods only focus on synthesizing missing images between two modalities, which limits their robustness and efficiency when multiple modalities are missing. To address this problem, we propose a multi-modality generative adversarial network (MGAN) to synthesize three high-quality MR modalities (FLAIR, T1 and T1ce) from one MR modality T2 simultaneously. The experimental results show that the quality of the synthesized images by our proposed methods is better than the one synthesized by the baseline model, pix2pix. Besides, for MR brain image synthesis, it is important to preserve the critical tumor information in the generated modalities, so we further introduce a multi-modality tumor consistency loss to MGAN, called TC-MGAN. We use the synthesized modalities by TC-MGAN to boost the tumor segmentation accuracy, and the results demonstrate its effectiveness.
摘要：不同模态的磁共振（MR）图像可以提供用于临床诊断的补充信息，但整体模态通常是昂贵访问。大多数现有的方法只专注于两种模式，当多个模式缺少这限制了其耐用性和效率之间缺少合成图像。为了解决这个问题，我们提出了一种多模态生成对抗网络（MGAN）同时合成从一个MR模态T2三个高品质MR模态（FLAIR，T1和T1ce）。实验结果表明，所合成的图像由我们提出的方法的质量优于由基准模型，pix2pix合成的一个。此外，对于MR大脑图像的合成，它保留在生成的模式的临界肿瘤信息，因此我们进一步引入一个多模态肿瘤一致性损失MGAN，称为TC-MGAN是重要的。我们使用TC-MGAN合成的方式，以提高肿瘤分割精度，结果证明其有效性。

29. SAMP: Shape and Motion Priors for 4D Vehicle Reconstruction [PDF] 返回目录
Francis Engelmann, Jörg Stückler, Bastian Leibe
Abstract: Inferring the pose and shape of vehicles in 3D from a movable platform still remains a challenging task due to the projective sensing principle of cameras, difficult surface properties e.g. reflections or transparency, and illumination changes between images. In this paper, we propose to use 3D shape and motion priors to regularize the estimation of the trajectory and the shape of vehicles in sequences of stereo images. We represent shapes by 3D signed distance functions and embed them in a low-dimensional manifold. Our optimization method allows for imposing a common shape across all image observations along an object track. We employ a motion model to regularize the trajectory to plausible object motions. We evaluate our method on the KITTI dataset and show state-of-the-art results in terms of shape reconstruction and pose estimation accuracy.
摘要：从移动平台还是在推断三维姿势和车辆的形状仍然是一个具有挑战性的任务，由于摄像机的投影传感原理，很难表面特性例如反射或透明性，和图像之间的照明改变。在本文中，我们建议使用3D形状和运动先验规则化轨迹的估计和车辆的立体图像序列的形状。我们代表形状的三维符号距离函数和在低维流形嵌入它们。我们的优化方法允许沿对象追踪所有图像观察强加一个共同的形状。我们采用运动模型来规范的轨迹，合理的对象运动。我们评估形状重建和姿态估计精度方面我们对国家的最先进的KITTI数据集和显示效果的方法。

30. BeCAPTCHA-Mouse: Synthetic Mouse Trajectories and Improved Bot Detection [PDF] 返回目录
Alejandro Acien, Aythami Morales, Julian Fierrez, Ruben Vera-Rodriguez
Abstract: We first study the suitability of behavioral biometrics to distinguish between computers and humans, commonly named as bot detection. We then present BeCAPTCHA-Mouse, a bot detector based on neuromotor modeling of mouse dynamics that enhances traditional CAPTCHA methods. Our proposed bot detector is trained using both human and bot data generated by two new methods developed for generating realistic synthetic mouse trajectories: i) a knowledge-based method based on heuristic functions, and ii) a data-driven method based on Generative Adversarial Networks (GANs) in which a Generator synthesizes human-like trajectories from a Gaussian noise input. Experiments are conducted on a new testbed also introduced here and available in GitHub: BeCAPTCHA-Mouse Benchmark; useful for research in bot detection and other mouse-based HCI applications. Our benchmark data consists of 10,000 mouse trajectories including real data from 58 users and bot data with various levels of realism. Our experiments show that BeCAPTCHA-Mouse is able to detect bot trajectories of high realism with 93% of accuracy in average using only one mouse trajectory. When our approach is fused with state-of-the-art mouse dynamic features, the bot detection accuracy increases relatively by more than 36%, proving that mouse-based bot detection is a fast, easy, and reliable tool to complement traditional CAPTCHA systems.
摘要：我们首先学习行为特征的适宜电脑和人类，通常称为机器人检测来区分。然后，我们目前BeCAPTCHA-鼠标的基础上，增强传统CAPTCHA方法鼠标动态神经运动建模机器人探测器。我们提出的机器人检测器是使用由用于产生逼真的合成鼠标轨迹开发了两种新方法产生的人类和机器人数据训练：基于启发式函数I）的知识为基础的方法中，和ii）基于剖成对抗性网络数据驱动方法（甘斯），其中发电机合成人样从高斯噪声输入轨迹。实验是在一个新的试验床，并在这里也可介绍GitHub上进行：BeCAPTCHA鼠基准;在机器人检测和其他基于鼠标的人机交互应用的研究非常有用。我们的基准数据由10000个鼠标轨迹包括58个用户的真实数据，并与各级现实的BOT数据。我们的实验表明，BeCAPTCHA，鼠标可以检测高逼真的机器人轨迹与只使用一个鼠标移动轨迹的平均准确度为93％。当我们的做法是与国家的最先进的鼠标动态功能融合，机器人检测精度相对增加超过36％，这证明基于鼠标的机器人检测是一种快速，简便和可靠的工具来补充传统的CAPTCHA系统。

31. Derivation of a Constant Velocity Motion Model for Visual Tracking [PDF] 返回目录
Nathanael L. Baisa
Abstract: Motion models play a great role in visual tracking applications for predicting the possible locations of objects in the next frame. Unlike target tracking in radar or aerospace domain which considers only points, object tracking in computer vision involves sizes of objects. Constant velocity motion model is the most widely used motion model for visual tracking, however, there is no clear and understandable derivation involving sizes of objects specially for new researchers joining this research field. In this document, we derive the constant velocity motion model that incorporates sizes of objects that, we think, can help the new researchers to adapt to it very quickly.
摘要：运动模型在视觉跟踪方面的应用有很大的作用，用于预测下一帧的对象的可能位置。不像在雷达或只考虑点航天域目标跟踪，在计算机视觉对象跟踪涉及对象的尺寸。等速运动模型是视觉跟踪使用最广泛的运动模型，然而，是专门为新的研究人员加入这一研究领域的对象没有明确的和可以理解的推导涉及大小。在本文中，我们推导出合并对象，我们认为，可以帮助新的研究人员很快适应它的大小等速运动模型。

32. DroTrack: High-speed Drone-based Object Tracking Under Uncertainty [PDF] 返回目录
Ali Hamdi, Flora Salim, Du Yong Kim
Abstract: We present DroTrack, a high-speed visual single-object tracking framework for drone-captured video sequences. Most of the existing object tracking methods are designed to tackle well-known challenges, such as occlusion and cluttered backgrounds. The complex motion of drones, i.e., multiple degrees of freedom in three-dimensional space, causes high uncertainty. The uncertainty problem leads to inaccurate location predictions and fuzziness in scale estimations. DroTrack solves such issues by discovering the dependency between object representation and motion geometry. We implement an effective object segmentation based on Fuzzy C Means (FCM). We incorporate the spatial information into the membership function to cluster the most discriminative segments. We then enhance the object segmentation by using a pre-trained Convolution Neural Network (CNN) model. DroTrack also leverages the geometrical angular motion to estimate a reliable object scale. We discuss the experimental results and performance evaluation using two datasets of 51,462 drone-captured frames. The combination of the FCM segmentation and the angular scaling increased DroTrack precision by up to $9\%$ and decreased the centre location error by $162$ pixels on average. DroTrack outperforms all the high-speed trackers and achieves comparable results in comparison to deep learning trackers. DroTrack offers high frame rates up to 1000 frame per second (fps) with the best location precision, more than a set of state-of-the-art real-time trackers.
摘要：本DroTrack，用于无人驾驶飞机捕获的视频序列的高速视觉单一对象跟踪框架。大多数现有的对象跟踪方法的目的是应对众所周知的挑战，如闭塞和杂乱的背景。无人驾驶飞机的复杂运动，在三维空间中，即，多个自由度，引起高的不确定性。不确定性问题导致不准确的位置，预测和模糊性的规模估计。 DroTrack通过发现对象表示和运动几何之间的依赖关系来解决这些问题。我们基于模糊C均值（FCM）实施有效的目标分割。我们结合了空间信息入隶属度函数进行聚类的判别能力最强的部分。然后，我们通过使用预训练卷积神经网络（CNN）模型提高了目标分割。 DroTrack还利用估计的可靠对象分的几何角运动。我们使用51462无人机捕获的帧的两个数据集绩效评估讨论实验结果和。的FCM分割和角缩放的组合增加DroTrack精度高达$ 9 \％$和$ $ 162个像素平均降低中心位置误差。 DroTrack优于所有高速跟踪和比较深的学习跟踪器达到类似的结果。 DroTrack报价高帧速率高达每秒（fps）的帧1000的最佳定位精度，多于一组的状态的最先进的实时跟踪器。

33. Projection Inpainting Using Partial Convolution for Metal Artifact Reduction [PDF] 返回目录
Lin Yuan, Yixing Huang, Andreas Maier
Abstract: In computer tomography, due to the presence of metal implants in the patient body, reconstructed images will suffer from metal artifacts. In order to reduce metal artifacts, metals are typically removed in projection images. Therefore, the metal corrupted projection areas need to be inpainted. For deep learning inpainting methods, convolutional neural networks (CNNs) are widely used, for example, the U-Net. However, such CNNs use convolutional filter responses on both valid and corrupted pixel values, resulting in unsatisfactory image quality. In this work, partial convolution is applied for projection inpainting, which only relies on valid pixels values. The U-Net with partial convolution and conventional convolution are compared for metal artifact reduction. Our experiments demonstrate that the U-Net with partial convolution is able to inpaint the metal corrupted areas better than that with conventional convolution.
摘要：计算机断层摄影，由于金属植入物在患者体内存在，重建的图像将金属伪影苦。为了降低金属伪影，金属通常在投影图像中移除。因此，金属损坏投影区域需要进行补绘。对于深学习修补方法，卷积神经网络（细胞神经网络）被广泛使用，例如，U形网。然而，这样的细胞神经网络使用上都有效和损坏的像素值的卷积滤波器响应，从而导致不令人满意的图像质量。在这项工作中，局部卷积被应用于投影图像修复，其中仅依赖于有效的像素的值。 U形网具有部分卷积和常规卷积金属伪影减少进行比较。我们的实验表明，在U-网与局部卷积能够与传统的卷积比这更好的补绘金属破坏的地区。

34. CoMoGCN: Coherent Motion Aware Trajectory Prediction with Graph Representation [PDF] 返回目录
Yuying Chen, Congcong Liu, Bertram Shi, Ming Liu
Abstract: Forecasting human trajectories is critical for tasks such as robot crowd navigation and autonomous driving. Modeling social interactions is of great importance for accurate group-wise motion prediction. However, most existing methods do not consider information about coherence within the crowd, but rather only pairwise interactions. In this work, we propose a novel framework, coherent motion aware graph convolutional network (CoMoGCN), for trajectory prediction in crowded scenes with group constraints. First, we cluster pedestrian trajectories into groups according to motion coherence. Then, we use graph convolutional networks to aggregate crowd information efficiently. The CoMoGCN also takes advantage of variational autoencoders to capture the multimodal nature of the human trajectories by modeling the distribution. Our method achieves state-of-the-art performance on several different trajectory prediction benchmarks, and the best average performance among all benchmarks considered.
摘要：预测人类的轨迹是这样的任务，机器人人群导航和自动驾驶的关键。造型社会互动是准确的分组方式运动预测具有重要意义。然而，大多数现有的方法没有考虑对人群中的连贯性信息，而只是两两相互作用。在这项工作中，我们提出了一种新的框架，相干运动感知曲线的卷积网络（CoMoGCN），用于与组约束拥挤场景轨迹预测。首先，我们根据运动的连贯性聚类行人轨迹成组。然后，我们用图卷积网络聚集人群的信息有效。该CoMoGCN还利用变自动编码的通过模拟分布捕捉人轨迹的多式联运性质。我们的方法实现了几个不同的轨迹预测基准，国家的最先进的性能，并认为所有的基准测试中最好的平均性能。

35. Heterogeneous Knowledge Distillation using Information Flow Modeling [PDF] 返回目录
Nikolaos Passalis, Maria Tzelepi, Anastasios Tefas
Abstract: Knowledge Distillation (KD) methods are capable of transferring the knowledge encoded in a large and complex teacher into a smaller and faster student. Early methods were usually limited to transferring the knowledge only between the last layers of the networks, while latter approaches were capable of performing multi-layer KD, further increasing the accuracy of the student. However, despite their improved performance, these methods still suffer from several limitations that restrict both their efficiency and flexibility. First, existing KD methods typically ignore that neural networks undergo through different learning phases during the training process, which often requires different types of supervision for each one. Furthermore, existing multi-layer KD methods are usually unable to effectively handle networks with significantly different architectures (heterogeneous KD). In this paper we propose a novel KD method that works by modeling the information flow through the various layers of the teacher model and then train a student model to mimic this information flow. The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process, as well as by designing and training an appropriate auxiliary teacher model that acts as a proxy model capable of "explaining" the way the teacher works to the student. The effectiveness of the proposed method is demonstrated using four image datasets and several different evaluation setups.
摘要：知识蒸馏（KD）的方法能够转移在一个庞大而复杂的老师编码成更小和更快学生的知识。早期方法通常限于只网络的最后层之间传输的知识，而后者的方法是能够进行多个层KD，进一步增加了学生的准确性。然而，尽管他们改进性能，这些方法仍然遭受来自两个限制他们的效率和灵活性几个限制。首先，现有的KD方法通常忽略的神经网络，通过不同的学习阶段在训练过程中，往往需要不同类型的监督每一个经历。此外，现有的多层KD方法通常不能有效地处理与显著不同的体系结构（异质KD）网络。在本文中，我们提出，通过老师模型的各个层的信息流建模工作，然后培养学生的模式来模拟这个信息流的新颖方法KD。该方法能够通过在训练过程中的不同阶段使用适当的监督机制，以及通过设计克服上述局限和培训作为能够代理模式“解释”的方式适当的辅助教师模型教师工作的学生。所提出的方法的有效性，使用四个图像数据集和几个不同的评价设置表明。

36. Cross-View Image Retrieval -- Ground to Aerial Image Retrieval through Deep Learning [PDF] 返回目录
Numan Khurshid, Talha Hanif, Mohbat Tharani, Murtaza Taj
Abstract: Cross-modal retrieval aims to measure the content similarity between different types of data. The idea has been previously applied to visual, text, and speech data. In this paper, we present a novel cross-modal retrieval method specifically for multi-view images, called Cross-view Image Retrieval CVIR. Our approach aims to find a feature space as well as an embedding space in which samples from street-view images are compared directly to satellite-view images (and vice-versa). For this comparison, a novel deep metric learning based solution "DeepCVIR" has been proposed. Previous cross-view image datasets are deficient in that they (1) lack class information; (2) were originally collected for cross-view image geolocalization task with coupled images; (3) do not include any images from off-street locations. To train, compare, and evaluate the performance of cross-view image retrieval, we present a new 6 class cross-view image dataset termed as CrossViewRet which comprises of images including freeway, mountain, palace, river, ship, and stadium with 700 high-resolution dual-view images for each class. Results show that the proposed DeepCVIR outperforms conventional matching approaches on the CVIR task for the given dataset and would also serve as the baseline for future research.
摘要：跨模态获取的目标来衡量不同类型数据之间的内容相似。这个想法之前已应用到视觉，文本和语音数据。在本文中，我们特别提出用于多视图图像，称为交叉视图图像检索CVIR一种新颖的跨通道检索方法。我们的方法的目的是找到一个特征空间以及其中从街道视图图像样品直接比较卫星视点图像（并且反之亦然）的嵌入空间。对于这种比较，一个新的深度为基础的学习解决方案“DeepCVIR”已经提出。以前交叉视角图像数据集的，因为它们（1）缺乏类信息缺陷; （2）最初收集与连接图像交叉视角图像geolocalization任务; （3）不包括从街边位置的任何图像。到列车，比较和评估交叉视图图像检索的性能，提出了一种新的6类交叉视角图像数据组称为CrossViewRet它包括图像，包括高速公路，山，宫，河，船，体育场700高的 - 分辨率双视点图像为每个类。结果表明，所提出DeepCVIR优于传统的匹配对给定数据集的CVIR任务接近，也将作为基准为今后的研究。

37. PAMTRI: Pose-Aware Multi-Task Learning for Vehicle Re-Identification Using Highly Randomized Synthetic Data [PDF] 返回目录
Zheng Tang, Milind Naphade, Stan Birchfield, Jonathan Tremblay, William Hodge, Ratnesh Kumar, Shuo Wang, Xiaodong Yang
Abstract: In comparison with person re-identification (ReID), which has been widely studied in the research community, vehicle ReID has received less attention. Vehicle ReID is challenging due to 1) high intra-class variability (caused by the dependency of shape and appearance on viewpoint), and 2) small inter-class variability (caused by the similarity in shape and appearance between vehicles produced by different manufacturers). To address these challenges, we propose a Pose-Aware Multi-Task Re-Identification (PAMTRI) framework. This approach includes two innovations compared with previous methods. First, it overcomes viewpoint-dependency by explicitly reasoning about vehicle pose and shape via keypoints, heatmaps and segments from pose estimation. Second, it jointly classifies semantic vehicle attributes (colors and types) while performing ReID, through multi-task learning with the embedded pose representations. Since manually labeling images with detailed pose and attribute information is prohibitive, we create a large-scale highly randomized synthetic dataset with automatically annotated vehicle attributes for training. Extensive experiments validate the effectiveness of each proposed component, showing that PAMTRI achieves significant improvement over state-of-the-art on two mainstream vehicle ReID benchmarks: VeRi and CityFlow-ReID. Code and models are available at this https URL.
摘要：在与人的重新鉴定（里德），它已被广泛研究在研究界相比，车辆里德很少受到关注。车辆里德是具有挑战性由于1）高的类内变化（由形状和外观上视点）的依赖性，和2）小类间变异性（由不同制造商生产的车辆之间在形状和外观引起的相似性）。为了应对这些挑战，我们提出了一种姿态感知多任务重新鉴定（PAMTRI）框架。这种方法包括与以前的方法相比，两个创新。首先，它通过明确推理经由关键点，从姿态估计热图和段车辆姿势和形状克服视点依赖性。其次，它同时与嵌入式姿势表示执行里德，通过多任务共同学习语义分类属性车辆（颜色和类型）。由于详细的姿态和属性信息手动标记的图像是望而却步，我们创建了一个大型高度随机化合成数据集具有用于训练自动注释的车辆属性。广泛的实验验证每个提议的部件的效力，表明PAMTRI达到超过显著改善国家的最先进的两个主流车辆REID基准：VERI和CityFlow-里德。代码和模型可在此HTTPS URL。

38. Jacks of All Trades, Masters Of None: Addressing Distributional Shift and Obtrusiveness via Transparent Patch Attacks [PDF] 返回目录
Neil Fendley, Max Lennon, I-Jeng Wang, Philippe Burlina, Nathan Drenkow
Abstract: We focus on the development of effective adversarial patch attacks and -- for the first time -- jointly address the antagonistic objectives of attack success and obtrusiveness via the design of novel semi-transparent patches. This work is motivated by our pursuit of a systematic performance analysis of patch attack robustness with regard to geometric transformations. Specifically, we first elucidate a) key factors underpinning patch attack success and b) the impact of distributional shift between training and testing/deployment when cast under the Expectation over Transformation (EoT) formalism. By focusing our analysis on three principal classes of transformations (rotation, scale, and location), our findings provide quantifiable insights into the design of effective patch attacks and demonstrate that scale, among all factors, significantly impacts patch attack success. Working from these findings, we then focus on addressing how to overcome the principal limitations of scale for the deployment of attacks in real physical settings: namely the obtrusiveness of large patches. Our strategy is to turn to the novel design of irregularly-shaped, semi-transparent partial patches which we construct via a new optimization process that jointly addresses the antagonistic goals of mitigating obtrusiveness and maximizing effectiveness. Our study -- we hope - will help encourage more focus in the community on the issues of obtrusiveness, scale, and success in patch attacks.
摘要：我们专注于有效的对抗补丁的攻击和发展 - 首次 - 通过新型半透明贴片的设计共同应对攻击的成功和突出性的对立目标。这项工作是由我们的追求补丁攻击的鲁棒性的系统性能分析的关于几何变换的动机。具体而言，我们首先阐发一）托换补丁进攻成功和b）培训和测试/部署之间的分配转移的影响的关键因素时，期望在转换（EOT）形式主义下投。通过关注我们的分析转换的三个主要类别（旋转，缩放和位置），我们的研究结果提供了量化的洞察有效的补丁攻击的设计和证明的规模，所有的因素中，显著影响补丁的攻击成功。从这些发现工作中，我们则重点解决如何克服规模的主要限制为在真实的物理攻击设置部署：大补丁即突出性。我们的策略是求助于我们通过一个新的优化流程，共同解决减轻突出性和有效性最大化的拮抗目标构建不规则状，半透明部分区块的新颖的设计。我们的研究 - 我们希望 - 将有助于鼓励市民更加注重突出性，规模和补丁攻击成功的问题。

39. Learning from Noisy Labels with Noise Modeling Network [PDF] 返回目录
Zhuolin Jiang, Jan Silovsky, Man-Hung Siu, William Hartmann, Herbert Gish, Sancar Adali
Abstract: Multi-label image classification has generated significant interest in recent years and the performance of such systems often suffers from the not so infrequent occurrence of incorrect or missing labels in the training data. In this paper, we extend the state-of the-art of training classifiers to jointly deal with both forms of errorful data. We accomplish this by modeling noisy and missing labels in multi-label images with a new Noise Modeling Network (NMN) that follows our convolutional neural network (CNN), integrates with it, forming an end-to-end deep learning system, which can jointly learn the noise distribution and CNN parameters. The NMN learns the distribution of noise patterns directly from the noisy data without the need for any clean training data. The NMN can model label noise that depends only on the true label or is also dependent on the image features. We show that the integrated NMN/CNN learning system consistently improves the classification performance, for different levels of label noise, on the MSR-COCO dataset and MSR-VTT dataset. We also show that noise performance improvements are obtained when multiple instance learning methods are used.
摘要：多标签图像分类产生了近年来显著利益，这种系统的性能往往不正确或丢失的标签在训练数据不那么罕见的发生受到影响。在本文中，我们扩展训练分类器的技术状态，以两种形式errorful数据的共同应对。我们通过模拟嘈杂，并用新的噪声模拟网络（NMN）跟随我们的卷积神经网络（CNN），用它集成了失踪多标签图像标签，形成一个终端到终端的深度学习系统做到这一点，它可以共同学习的噪声分布和CNN参数。该NMN学习的噪声模式直接从嘈杂的数据，而不需要任何清洁训练数据的分布。该NMN可以模拟只取决于正确标签或还取决于图像特征标签的噪音。我们表明，整合NMN / CNN学习系统持续改进分类性能，针对不同的标签的噪音，对MSR-COCO数据集和MSR-VTT数据集。我们还表明，当使用多个实例的学习方法获得的噪声性能的改进。

40. Global Table Extractor (GTE): A Framework for Joint Table Identification and Cell Structure Recognition Using Visual Context [PDF] 返回目录
Xinyi Zheng, Doug Burdick, Lucian Popa, Nancy Xin Ru Wang
Abstract: Documents are often the format of choice for knowledge sharing and preservation in business and science. Much of the critical data are captured in tables. Unfortunately, most documents are stored and distributed in PDF or scanned images, which fail to preserve table formatting. Recent vision-based deep learning approaches have been proposed to address this gap, but most still cannot achieve state-of-the-art results. We present Global Table Extractor (GTE), a vision-guided systematic framework for joint table detection and cell structured recognition, which could be built on top of any object detection model. With GTE-Table, we invent a new penalty based on the natural cell containment constraint of tables to train our table network aided by cell location predictions. GTE-Cell is a new hierarchical cell detection network that leverages table styles. Further, we design a method to automatically label table and cell structure in existing documents to cheaply create a large corpus of training and test data. We use this to create SD-tables and SEC-tables, real world and complex scientific and financial datasets with detailed table structure annotations to help train and test structure recognition. Our deep learning framework surpasses previous state-of-the-art results on the ICDAR 2013 table competition test dataset in both table detection and cell structure recognition, with a significant 6.8% improvement in the full table extraction system. We also show more than 30% improvement in cell structure recognition F1-score when compared to a vanilla RetinaNet object detection model in our out-of-domain financial dataset (SEC-Tables).
摘要：文档通常是在商业和科学知识的共享和保存的首选格式。大部分的关键数据在表格中捕获。不幸的是，大多数文件存储和分发PDF或扫描的图像，无法保留表格格式。最近的基于视觉的深度学习方法被提出来解决这个差距，但大多数仍不能达到国家的先进成果。我们目前全球表提取（GTE），联合表格检测和细胞结构识别视觉导引系统架构，它可以在任何物体检测模型的顶部之上。随着GTE-表，我们发明基于表的自然细胞遏制约束训练我们通过细胞位置预测辅助表网络的新处罚。 GTE-Cell是一种新的分层小区检测网络，充分利用表格样式。此外，我们在现有文档设计的方法来自动标签表和单元结构，以廉价创造大量语料训练和测试数据。我们用它来创建SD-表和SEC桌，现实世界和详细的表结构的注释，帮助培训和测试结构识别复杂的科学和金融数据集。我们深厚的学习框架超越上两个表格检测和细胞结构识别ICDAR 2013台竞争的测试数据集以前国家的先进成果，在全表抽取系统显著6.8％的改善。相比之下，我们的域外金融数据集（SEC-表）香草RetinaNet对象检测模型时，我们还显示，细胞结构识别F1-得分超过30％的改善。

41. Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions [PDF] 返回目录
Arjun R Akula, Spandana Gella, Yaser Al-Onaizan, Song-Chun Zhu, Siva Reddy
Abstract: Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7% of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12% to 23% lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at this https URL
摘要：视觉指表情识别是一项具有挑战性的任务，需要自然语言中的图像的背景下理解。我们认真审视RefCOCOg，一个标准的标杆这个任务，使用人的研究，并表明测试实例的83.7％，不需要推理的语言结构，即，字就足以识别目标对象，词序并不重要。为了衡量现有车型的真正进步，我们分开测试组分成两组，一个需要在语言结构推理和没有其他。此外，我们通过询问crowdworkers扰乱在域实例，使得目标对象的变化创建外的分布的数据集REF-进阶。利用这些数据集，我们经验表明，现有的方法不能利用语言结构和有12％〜23％为低在性能上比该任务的建立进展。我们还提出了两种方法，一种基于对比学习和基于多任务学习另，增加ViLBERT，当前国家的最先进的模型，这个任务的鲁棒性。我们的数据集是公开的，在此HTTPS URL

42. Pseudo-healthy synthesis with pathology disentanglement and adversarial learning [PDF] 返回目录
Tian Xia, Agisilaos Chartsias, Sotirios A. Tsaftaris
Abstract: Pseudo-healthy synthesis is the task of creating a subject-specific `healthy' image from a pathological one. Such images can be helpful in tasks such as anomaly detection and understanding changes induced by pathology and disease. In this paper, we present a model that is encouraged to disentangle the information of pathology from what seems to be healthy. We disentangle what appears to be healthy and where disease is as a segmentation map, which are then recombined by a network to reconstruct the input disease image. We train our models adversarially using either paired or unpaired settings, where we pair disease images and maps when available. We quantitatively and subjectively, with a human study, evaluate the quality of pseudo-healthy images using several criteria. We show in a series of experiments, performed on ISLES, BraTS and Cam-CAN datasets, that our method is better than several baselines and methods from the literature. We also show that due to better training processes we could recover deformations, on surrounding tissue, caused by disease. Our implementation is publicly available at \url{https://tobeprovided.upon.acceptance}
摘要：伪健康的合成是一种病态的一个创建特定主题的'健康”形象的任务。这样的图像可以是在任务有用的，例如通过病理和疾病引起的异常检测和理解的改变。在本文中，我们提出，鼓励解开病理从什么似乎是健康的信息的模型。我们解开这似乎是健康和疾病的地方是一个分割图，然后再通过网络重组重建输入图像的疾病。我们培养adversarially使用或者成对或不成的设置，在这里我们配对疾病的图像和地图时可用我们的模型。我们在数量上和主观上，与人类的研究，评估使用几个标准的伪健康的图像的质量。我们表现出了一系列的实验，对群岛，臭小子和凸轮-CAN数据集进行，我们的方法是比几个基线，并从文献的方法。我们还表明，由于更好的训练过程中，我们可以恢复变形，对周围组织，引起疾病。我们的实现是公开的，在\ {URL的https：//tobeprovided.upon.acceptance}

43. Quantification of MagLIF morphology using the Mallat Scattering Transformation [PDF] 返回目录
Michael E. Glinsky, Thomas W. Moore, William E. Lewis, Matthew R. Weis, Christopher A. Jennings, David J. Ampleford, Patrick F. Knapp, Eric C. Harding, Matthew R. Gomez, Adam J. Harvey-Thompson
Abstract: The morphology of the stagnated plasma resulting from Magnetized Liner Inertial Fusion (MagLIF) is measured by imaging the self-emission x-rays coming from the multi-keV plasma. Equivalent diagnostic response can be generated by integrated radiation-magnetohydrodynamic (rad-MHD) simulations from programs such as HYDRA and GORGON. There have been only limited quantitative ways to compare the image morphology, that is the texture, of simulations and experiments. We have developed a metric of image morphology based on the Mallat Scattering Transformation (MST), a transformation that has proved to be effective at distinguishing textures, sounds, and written characters. This metric is designed, demonstrated, and refined by classifying ensembles (i.e., classes) of synthetic stagnation images, and by regressing an ensemble of synthetic stagnation images to the morphology (i.e., model) parameters used to generate the synthetic images. We use this metric to quantitatively compare simulations to experimental images, experimental images to each other, and to estimate the morphological parameters of the experimental images with uncertainty. This coordinate space has proved very adept at doing a sophisticated relative background subtraction in the MST space. This was needed to compare the experimental self emission images to the rad-MHD simulation images.
摘要：从磁化内衬惯性核聚变（MagLIF）导致的停滞等离子体的形态由成像自发射的x射线从多keV的等离子体来测量。可通过集成的辐射磁流体（RAD-MHD）模拟从诸如HYDRA和GORGON程序来生成等效诊断响应。曾有仅限于定量方法来比较形象的形态，也就是模拟和实验的质感。我们已经开发了一种基于散射的Mallat变换（MST）度量图像的形态，已经证明了转型是在区分材质，声音和文字字符有效。该度量被设计，证明，和通过合成停滞图像的分类合奏（即，类）精制，并且通过回归合成停滞图像的集合的形态（即，模型）参数用于生成合成图像。我们用这个指标来定量比较模拟实验图像，实验图像给对方，并具有不确定性估计实验图像的形态参数。这个坐标空间已被证明在MST空间做一个相对复杂的背景减除非常熟练。这是该实验的自发光图像进行比较的RAD-MHD模拟图像所需。

44. The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content from 16 Million Historic Newspaper Pages in Chronicling America [PDF] 返回目录
Benjamin Charles Germain Lee, Jaime Mears, Eileen Jakeway, Meghan Ferriter, Chris Adams, Nathan Yarasavage, Deborah Thomas, Kate Zwaard, Daniel S. Weld
Abstract: Chronicling America is a product of the National Digital Newspaper Program, a partnership between the Library of Congress and the National Endowment for the Humanities to digitize historic newspapers. Over 16 million pages of historic American newspapers have been digitized for Chronicling America to date, complete with high-resolution images and machine-readable METS/ALTO OCR. Of considerable interest to Chronicling America users is a semantified corpus, complete with extracted visual content and headlines. To accomplish this, we introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons collected as part of the Library of Congress's Beyond Words crowdsourcing initiative and augmented with additional annotations including those of headlines and advertisements. We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content: headlines, photographs, illustrations, maps, comics, editorial cartoons, and advertisements, complete with textual content such as captions derived from the METS/ALTO OCR, as well as image embeddings for fast image similarity querying. We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus and describe the resulting Newspaper Navigator dataset, the largest dataset of extracted visual content from historic newspapers ever produced. The Newspaper Navigator dataset, finetuned visual content recognition model, and all source code are placed in the public domain for unrestricted re-use.
摘要：美国编年史是国家数字报纸项目，数字化的历史报刊美国国会图书馆和美国国家人文基金会之间的伙伴关系的产物。超过16万页历史的美国报纸都被数字化的记述美国到今天为止，完成高清晰度图像和机器可读METS / ALTO OCR。相当感兴趣记载美国的使用者是semantified语料库，完成提取的视觉内容和标题。要做到这一点，我们引进的培训上边界的照片，插图，地图，漫画，政治漫画收集国会的无以言众包主动图书馆的零件箱注释和通过附加的注释，包括那些头条新闻和增强的可视化内容识别模型广告。我们描述我们的管道，利用这种深度学习模型抽取7类的可视化内容：标题，照片，插图，地图，漫画，政治漫画和广告，完整的文本内容，如来自METS / ALTO OCR衍生的字幕，如以及用于快速的图像相似的查询图像的嵌入。我们报告从美国编年史语料库1630万页运行管道的结果，并说明所产生的报纸导航数据集，从生产过历史的报纸中提取视频内容最大的数据集。报业导航数据集，微调，视觉内容识别模型，以及所有的源代码都放置在无限制重复使用的公共领域。

45. A Deep Convolutional Neural Network for COVID-19 Detection Using Chest X-Rays [PDF] 返回目录
Pedro R. A. S. Bassi, Romis Attux
Abstract: We present an image classifier based on the CheXNet and a transfer learning stage to classify chest X-Ray images according to three labels: COVID-19, viral pneumonia and normal. CheXNet is a DenseNet121 that has been trained twice, firstly on ImageNet and then, for classification of pneumonia and other 13 chest diseases, over a large chest X-Ray database (ChestX- ray14). The proposed network reached a test accuracy of 97.8% and, for the COVID-19 class, of 98.3%. In order to clarify the modus operandi of the network, we used Layer Wise Relevance Propagation (LRP) to generate heat maps, indicating an analytical path for future research on diagnosis.
摘要：我们提出基于所述CheXNet的图像分类，并根据三个标签转印学习阶段进行分类胸部X射线图像：COVID-19，病毒性肺炎和正常的。 CheXNet是DenseNet121已上ImageNet训练两次，首先，然后，对肺炎等13胸部疾病的分类，在大的胸部X射线数据库（ChestX- ray14）。所提出的网络达到了97.8％的测试精度，并为COVID-19级，98.3％。为了阐明该网络的工作方式中，我们使用层明智关联传播（LRP）来产生热图，指示用于将来的诊断研究的分析路径。

46. COVID-DA: Deep Domain Adaptation from Typical Pneumonia to COVID-19 [PDF] 返回目录
Yifan Zhang, Shuaicheng Niu, Zhen Qiu, Ying Wei, Peilin Zhao, Jianhua Yao, Junzhou Huang, Qingyao Wu, Mingkui Tan
Abstract: The outbreak of novel coronavirus disease 2019 (COVID-19) has already infected millions of people and is still rapidly spreading all over the globe. Most COVID-19 patients suffer from lung infection, so one important diagnostic method is to screen chest radiography images, e.g., X-Ray or CT images. However, such examinations are time-consuming and labor-intensive, leading to limited diagnostic efficiency. To solve this issue, AI-based technologies, such as deep learning, have been used recently as effective computer-aided means to improve diagnostic efficiency. However, one practical and critical difficulty is the limited availability of annotated COVID-19 data, due to the prohibitive annotation costs and urgent work of doctors to fight against the pandemic. This makes the learning of deep diagnosis models very challenging. To address this, motivated by that typical pneumonia has similar characteristics with COVID-19 and many pneumonia datasets are publicly available, we propose to conduct domain knowledge adaptation from typical pneumonia to COVID-19. There are two main challenges: 1) the discrepancy of data distributions between domains; 2) the task difference between the diagnosis of typical pneumonia and COVID-19. To address them, we propose a new deep domain adaptation method for COVID-19 diagnosis, namely COVID-DA. Specifically, we alleviate the domain discrepancy via feature adversarial adaptation and handle the task difference issue via a novel classifier separation scheme. In this way, COVID-DA is able to diagnose COVID-19 effectively with only a small number of COVID-19 annotations. Extensive experiments verify the effectiveness of COVID-DA and its great potential for real-world applications.
摘要：新型冠状病毒病2019（COVID-19）已经感染了数百万的人，并且还在迅速蔓延全球各地的爆发。最COVID-19从患者肺部感染患，因此一个重要的诊断方法是将屏幕胸片图像，例如，X射线或CT图像。然而，这样的检查是耗时耗力，导致有限的诊断效率。为了解决这个问题，基于人工智能技术，如深学习，已经被用来作为有效的计算机辅助手段来提高诊断效率。然而，实际的和关键的困难是注释COVID-19的数据，由于高昂的成本，注释和医生的紧急工作，以防治这一流行病斗争的有限。这使得深诊断模型的学习非常具有挑战性。为了解决这个问题，通过典型的肺炎动机与COVID-19和许多肺炎数据集是公开的相似的特征，我们建议进行领域知识的适应与典型肺炎COVID-19。有两个主要的挑战：1）结构域之间的数据的分布的差异; 2）典型的肺炎和COVID-19的诊断之间的任务不同。为了解决这些问题，我们提出了COVID-19的诊断，即COVID-DA新的深厚的适应方法。具体来说，我们通过减轻对抗性特征的适应差异域，并通过新的分级分离方案处理任务的差异问题。通过这种方式，COVID-DA能够与只有少数COVID-19的注解有效地诊断COVID-19。大量的实验验证COVID-DA的有效性及其对现实世界的应用潜力巨大。

47. An Adaptive Enhancement Based Hybrid CNN Model for Digital Dental X-ray Positions Classification [PDF] 返回目录
Yaqi Wang, Lingling Sun, Yifang Zhang, Dailin Lv, Zhixing Li, Wuteng Qi
Abstract: Analysis of dental radiographs is an important part of the diagnostic process in daily clinical practice. Interpretation by an expert includes teeth detection and numbering. In this project, a novel solution based on adaptive histogram equalization and convolution neural network (CNN) is proposed, which automatically performs the task for dental x-rays. In order to improve the detection accuracy, we propose three pre-processing techniques to supplement the baseline CNN based on some prior domain knowledge. Firstly, image sharpening and median filtering are used to remove impulse noise, and the edge is enhanced to some extent. Next, adaptive histogram equalization is used to overcome the problem of excessive amplification noise of HE. Finally, a multi-CNN hybrid model is proposed to classify six different locations of dental slices. The results showed that the accuracy and specificity of the test set exceeded 90\%, and the AUC reached 0.97. In addition, four dentists were invited to manually annotate the test data set (independently) and then compare it with the labels obtained by our proposed algorithm. The results show that our method can effectively identify the X-ray location of teeth.
摘要：牙科X光片分析是在日常临床实践诊断过程中的一个重要组成部分。由专家解读包括牙齿检测和编号。在该项目中，一个新颖的解决方案基于自适应直方图均衡和卷积神经网络（CNN），提出了一种自动执行用于牙科X射线的任务。为了提高检测精度，我们提出了三种预处理技术的基础上的一些现有领域知识，以补充基线CNN。首先，图像锐化和中值滤波来除去脉冲噪声，并且边缘被增强到一定程度。接着，自适应直方图均衡用于克服HE的过度放大噪声的问题。最后，多CNN混合模型，提出了牙片的六个不同的地点进行分类。结果表明，试验组的准确性和特异性超过90 \％，和AUC达到0.97。另外，4名牙医被邀请到手动标注的测试数据集（独立），然后将它与我们提出的算法得到的标签进行比较。结果表明，该方法能有效识别牙齿的X射线的位置。

48. A cascade network for Detecting COVID-19 using chest x-rays [PDF] 返回目录
Dailin Lv, Wuteng Qi, Yunxiang Li, Lingling Sun, Yaqi Wang
Abstract: The worldwide spread of pneumonia caused by a novel coronavirus poses an unprecedented challenge to the world's medical resources and prevention and control measures. Covid-19 attacks not only the lungs, making it difficult to breathe and life-threatening, but also the heart, kidneys, brain and other vital organs of the body, with possible sequela. At present, the detection of COVID-19 needs to be realized by the reverse transcription-polymerase Chain Reaction (RT-PCR). However, many countries are in the outbreak period of the epidemic, and the medical resources are very limited. They cannot provide sufficient numbers of gene sequence detection, and many patients may not be isolated and treated in time. Given this situation, we researched the analytical and diagnostic capabilities of deep learning on chest radiographs and proposed Cascade-SEMEnet which is cascaded with SEME-ResNet50 and SEME-DenseNet169. The two cascade networks of Cascade - SEMEnet both adopt large input sizes and SE-Structure and use MoEx and histogram equalization to enhance the data. We first used SEME-ResNet50 to screen chest X-ray and diagnosed three classes: normal, bacterial, and viral pneumonia. Then we used SEME-DenseNet169 for fine-grained classification of viral pneumonia and determined if it is caused by COVID-19. To exclude the influence of non-pathological features on the network, we preprocessed the data with U-Net during the training of SEME-DenseNet169. The results showed that our network achieved an accuracy of 85.6\% in determining the type of pneumonia infection and 97.1\% in the fine-grained classification of COVID-19. We used Grad-CAM to visualize the judgment based on the model and help doctors understand the chest radiograph while verifying the effectivene.
摘要：由一种新型冠状病毒肺炎蔓延全球给全世界带来的医疗资源和防治措施了前所未有的挑战。 Covid-19攻击不仅肺部，使其难以呼吸，危及生命，而且心脏，肾脏，大脑和身体其他重要器官，有可能留下后遗症。目前，通过反转录 - 聚合酶链反应（RT-PCR）来实现的COVID-19需要检测。然而，许多国家都在疫情发生期间，以及医疗资源是非常有限的。他们不能提供基因序列检测的足够数量的，许多患者可能没有被隔离，并及时治疗。鉴于这种情况，我们研究了胸片深度学习的分析和诊断能力，并提出级联SEMEnet该级联与SEME-ResNet50和SEME-DenseNet169。级联的两个级联网络 - SEMEnet均采用大尺寸的输入和SE-结构和使用MOEX和直方图均衡化，以增强数据。我们首先使用SEME-ResNet50屏幕胸部X射线和诊断三类：正常，细菌和病毒性肺炎。然后，我们使用SEME-DenseNet169用于病毒性肺炎的细粒度分类，如果它是由COVID-19引起的决定。要排除网络上的非病理特征的影响，我们SEME-DenseNet169的训练期间预处理带U-Net的数据。结果表明，我们的网络中确定肺炎感染的COVID-19的细颗粒分级的类型和97.1 \％达到85.6 \％的准确度。我们使用梯度-CAM可视化基础上的模型，并帮助医生判断了解胸片，同时验证effectivene。

49. Spiking Neural Networks Hardware Implementations and Challenges: a Survey [PDF] 返回目录
Maxence Bouvier, Alexandre Valentian, Thomas Mesquida, François Rummens, Marina Reyboz, Elisa Vianello, Edith Beigné
Abstract: Neuromorphic computing is henceforth a major research field for both academic and industrial actors. As opposed to Von Neumann machines, brain-inspired processors aim at bringing closer the memory and the computational elements to efficiently evaluate machine-learning algorithms. Recently, Spiking Neural Networks, a generation of cognitive algorithms employing computational primitives mimicking neuron and synapse operational principles, have become an important part of deep learning. They are expected to improve the computational performance and efficiency of neural networks, but are best suited for hardware able to support their temporal dynamics. In this survey, we present the state of the art of hardware implementations of spiking neural networks and the current trends in algorithm elaboration from model selection to training mechanisms. The scope of existing solutions is extensive; we thus present the general framework and study on a case-by-case basis the relevant particularities. We describe the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level and discuss their related advantages and challenges.
摘要：仿神经计算是今后的学术界和工业界的演员的一个主要研究领域。相对于冯·诺伊曼机，脑启发处理器旨在拉近所述存储器和所述计算元件，以有效地评估机器学习算法。近日，扣球神经网络，一代采用计算模拟元神经元和突触工作原理认知法则，已经成为深学习的一个重要组成部分。他们预计，以提高神经网络的计算性能和效率，但最适合于硬件能够支持他们的时间动态。在本次调查中，我们提出的技术人员从型号选择脉冲神经网络和当前的趋势算法拟订培训机制的硬件实现的状态。现有解决方案的范围是广泛的;因此，我们提出的总体框架和学习上的情况，逐案相关的特殊性。我们描述了使用在硬件级别，以利用这些事件驱动算法的特点，并讨论其相关的优势和挑战的策略。

50. Stochastic Sparse Subspace Clustering [PDF] 返回目录
Ying Chen, Chun-Guang Li, Chong You
Abstract: State-of-the-art subspace clustering methods are based on self-expressive model, which represents each data point as a linear combination of other data points. By enforcing such representation to be sparse, sparse subspace clustering is guaranteed to produce a subspace-preserving data affinity where two points are connected only if they are from the same subspace. On the other hand, however, data points from the same subspace may not be well-connected, leading to the issue of over-segmentation. We introduce dropout to address the issue of over-segmentation, which is based on randomly dropping out data points in self-expressive model. In particular, we show that dropout is equivalent to adding a squared $\ell_2$ norm regularization on the representation coefficients, therefore induces denser solutions. Then, we reformulate the optimization problem as a consensus problem over a set of small-scale subproblems. This leads to a scalable and flexible sparse subspace clustering approach, termed Stochastic Sparse Subspace Clustering, which can effectively handle large scale datasets. Extensive experiments on synthetic data and real world datasets validate the efficiency and effectiveness of our proposal.
摘要：国家的最先进的子空间聚类方法是基于自我表达模型，其表示每个数据点的其他数据点的线性组合。通过强制执行这样的表示是稀疏的，稀疏子空间聚类是保证产生其中如果它们是来自相同子空间的两个点仅连接一个子空间的保护数据的亲和力。在另一方面，然而，从相同的子空间的数据点可能无法很好地连接，从而导致过度分割的问题。我们介绍辍学，以解决过分割，这是基于自我表达模型随机丢弃了数据点的问题。特别是，我们表明，辍学相当于增加一个平方的代表性系数$ \ $ ell_2范数正则，因此导致更密集的解决方案。然后，我们重新制定优化问题，因为在一组小规模的子问题达成共识的问题。这就导致了一个可扩展的，灵活的稀疏子空间聚类方法，称为随机稀疏子空间聚类，它可以有效地处理大规模的数据集。在模拟数据和真实世界的数据集大量的实验验证我们建议的效率和效益。

51. Does Visual Self-Supervision Improve Learning of Speech Representations? [PDF] 返回目录
Abhinav Shukla, Stavros Petridis, Maja Pantic
Abstract: Self-supervised learning has attracted plenty of recent research interest. However, most works are typically unimodal and there has been limited work that studies the interaction between audio and visual modalities for self-supervised learning. This work (1) investigates visual self-supervision via face reconstruction to guide the learning of audio representations; (2) proposes two audio-only self-supervision approaches for speech representation learning; (3) shows that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features that are more robust in noisy conditions; (4) shows that self-supervised pretraining leads to a superior weight initialization, which is especially useful to prevent overfitting and lead to faster model convergence on smaller sized datasets. We evaluate our audio representations for emotion and speech recognition, achieving state of the art performance for both problems. Our results demonstrate the potential of visual self-supervision for audio feature learning and suggest that joint visual and audio self-supervision leads to more informative speech representations.
摘要：自监督学习吸引了大量最近的研究兴趣。然而，大多数的作品是典型的单峰并且存在有限的工作，学习自我监督学习的音频和视频模式之间的相互作用。该作品（1）通过人脸重建调查视觉自我监督，指导音频表示的学习; （2）提出了讲话表示学习两只音频自我监督办法; （3）所示，所提出的视觉和音频自检的多任务组合是学习更丰富的功能，是在噪声条件下更坚固有利; （4）显示，自监督训练前通向优越重量初始化，这是特别有用的，以防止过度拟合，并导致对更小尺寸的数据集更快收敛模型。我们评估我们的情感和语音识别音频表示，实现两个问题，先进的性能。我们的研究结果表明视觉自我监督音频功能学习的潜力，建议联合视觉和音频自检导致更多的知识性演讲表示。

52. A Comparative Study of Image Quality Assessment Models through Perceptual Optimization [PDF] 返回目录
Keyan Ding, Kede Ma, Shiqi Wang, Eero P. Simoncelli
Abstract: The performance of objective image quality assessment (IQA) models has been evaluated primarily by comparing model predictions to human judgments. Perceptual datasets (e.g., LIVE and TID2013) gathered for this purpose provide useful benchmarks for improving IQA methods, but their heavy use creates a risk of overfitting. Here, we perform a large-scale comparison of perceptual IQA models in terms of their use as objectives for the optimization of image processing algorithms. Specifically, we evaluate eleven full-reference IQA models by using them as objective functions to train deep neural networks for four low-level vision tasks: denoising, deblurring, super-resolution, and compression. Extensive subjective testing on the optimized images allows us to rank the competing models in terms of their perceptual performance, elucidate their relative advantages and disadvantages for these tasks, and propose a set of desirable properties for incorporation into future IQA models.
摘要：客观图像质量评价（IQA）机型的性能已经通过比较模型预测人类的判断，主要评估。感性的数据集（例如，生活和TID2013）聚集于此目的为改善IQA方法提供有用的基准，但其大量使用产生过度拟合的风险。在这里，我们在他们的作为图像处理算法的优化目标的使用条款进行感知IQA车型的大规模比较。噪声，去模糊，超分辨率和压缩：具体来说，我们用它们作为目标函数来训练深层神经网络的四低层次的视觉任务评估11全参考IQA模型。在优化图像丰富的主观测试可以让我们的排名竞争车型在其感知性能方面，阐明它们的相对优势和劣势为这些任务，并提出了一套结合理想的特性为未来IQA车型。

53. A Model-driven Deep Neural Network for Single Image Rain Removal [PDF] 返回目录
Hong Wang, Qi Xie, Qian Zhao, Deyu Meng
Abstract: Deep learning (DL) methods have achieved state-of-the-art performance in the task of single image rain removal. Most of current DL architectures, however, are still lack of sufficient interpretability and not fully integrated with physical structures inside general rain streaks. To this issue, in this paper, we propose a model-driven deep neural network for the task, with fully interpretable network structures. Specifically, based on the convolutional dictionary learning mechanism for representing rain, we propose a novel single image deraining model and utilize the proximal gradient descent technique to design an iterative algorithm only containing simple operators for solving the model. Such a simple implementation scheme facilitates us to unfold it into a new deep network architecture, called rain convolutional dictionary network (RCDNet), with almost every network module one-to-one corresponding to each operation involved in the algorithm. By end-to-end training the proposed RCDNet, all the rain kernels and proximal operators can be automatically extracted, faithfully characterizing the features of both rain and clean background layers, and thus naturally lead to its better deraining performance, especially in real scenarios. Comprehensive experiments substantiate the superiority of the proposed network, especially its well generality to diverse testing scenarios and good interpretability for all its modules, as compared with state-of-the-arts both visually and quantitatively. The source codes are available at \url{this https URL}.
摘要：深学习（DL）方法在单个图像雨去除的任务实现国家的最先进的性能。目前大多数DL架构，但是，仍然缺乏足够的可解释性的，并与一般的雨条纹内部物理结构不完全集成。对于这个问题，在本文中，我们提出了一个模型驱动的深层神经网络的任务，具有完全可解释的网络结构。具体地，基于卷积字典学习机制用于表示雨，我们提出了一种新颖的单图像deraining模型和利用近端梯度下降技术来设计的迭代算法仅含有简单的操作符用于解决模型。这样一个简单的实施方案，有利于我们把它展开成一个新的深网络架构，称为雨卷积字典网络（RCDNet），几乎相当于参与算法每个操作每一个网络模块，一个对一个。通过终端到终端的培训提出RCDNet，所有的雨仁和近端运营商可以自动提取，忠实地表征既雨和干净的背景层的功能，因而自然会导致其更好deraining的性能，特别是在实际场景。综合性实验证明所提出的网络的优势，尤其是其良好的通用于不同的测试场景和良好的可解释性其所有的模块，与国家的的艺术视觉和定量比较。源代码可在\ {URL这HTTPS URL}。

54. Mutual Information Gradient Estimation for Representation Learning [PDF] 返回目录
Liangjian Wen, Yiji Zhou, Lirong He, Mingyuan Zhou, Zenglin Xu
Abstract: Mutual Information (MI) plays an important role in representation learning. However, MI is unfortunately intractable in continuous and high-dimensional settings. Recent advances establish tractable and scalable MI estimators to discover useful representation. However, most of the existing methods are not capable of providing an accurate estimation of MI with low-variance when the MI is large. We argue that directly estimating the gradients of MI is more appealing for representation learning than estimating MI in itself. To this end, we propose the Mutual Information Gradient Estimator (MIGE) for representation learning based on the score estimation of implicit distributions. MIGE exhibits a tight and smooth gradient estimation of MI in the high-dimensional and large-MI settings. We expand the applications of MIGE in both unsupervised learning of deep representations based on InfoMax and the Information Bottleneck method. Experimental results have indicated significant performance improvement in learning useful representation.
摘要：互信息（MI）起着代表学习的重要作用。然而，MI是连续的，高维设置不幸的是棘手的。最新进展建立听话的，可扩展的MI估计发现有用的表示。然而，大多数现有的方法不能够提供MI的准确估计具有低方差当MI是大的。我们认为，直接估计MI的梯度更吸引人的表现比学习本身估计MI。为此，我们提出了互信息梯度估计（米格）基于隐分布的分估计表示学习。米格呈现紧和平滑MI的梯度估计在高维和大-MI设置。我们扩大秘阁的应用程序的基础上的Infomax和信息瓶颈法的深交涉都无监督学习。实验结果表明在学习有用表示显著的性能提升。

55. A Little Bit More: Bitplane-Wise Bit-Depth Recovery [PDF] 返回目录
Abhijith Punnappurath, Michael S. Brown
Abstract: Imaging sensors digitize incoming scene light at a dynamic range of 10--12 bits (i.e., 1024--4096 tonal values). The sensor image is then processed onboard the camera and finally quantized to only 8 bits (i.e., 256 tonal values) to conform to prevailing encoding standards. There are a number of important applications, such as high-bit-depth displays and photo editing, where it is beneficial to recover the lost bit depth. Deep neural networks are effective at this bit-depth reconstruction task. Given the quantized low-bit-depth image as input, existing deep learning methods employ a single-shot approach that attempts to either (1) directly estimate the high-bit-depth image, or (2) directly estimate the residual between the high and low-bit-depth images. In contrast, we propose a training and inference strategy that recovers the residual image bitplane-by-bitplane. Our bitplane-wise learning framework has the advantage of allowing for multiple levels of supervision during training and is able to obtain state-of-the-art results using a simple network architecture. We test our proposed method extensively on several image datasets and demonstrate an improvement from 0.5dB to 2.3dB PSNR over prior methods depending on the quantization level.
摘要：成像传感器在10--12位（即，1024--4096色调值）的动态范围数字化进入的场景的光。传感器图像然后车载摄像机处理，最后量化到只有8位（即，256个色调值），以符合现行编码标准。有许多重要的应用，如高比特深度显示器和照片编辑，它有利于恢复丢失的位深度。深层神经网络是在此位深度重建任务有效。给定经量化的低比特深度的图像作为输入，现有深学习方法采用单次的办法，企图或者（1）直接估计高比特深度图像，或（2）直接估计高之间的残余和低比特深度的图像。相反，我们建议恢复残像位平面逐位平面培训和推理策略。我们的位平面明智的学习框架具有允许在训练中监督多层次的优势，能够获得使用简单的网络架构国家的先进成果。我们测试的几个图像数据集，我们提出的方法广泛，从展示到0.5分贝PSNR2.3分贝在取决于量化等级先前方法的改进。

56. NTIRE 2020 Challenge on Perceptual Extreme Super-Resolution: Methods and Results [PDF] 返回目录
Kai Zhang, Shuhang Gu, Radu Timofte, Taizhang Shang, Qiuju Dai, Shengchen Zhu, Tong Yang, Yandong Guo, Younghyun Jo, Sejong Yang, Seon Joo Kim, Lin Zha, Jiande Jiang, Xinbo Gao, Wen Lu, Jing Liu, Kwangjin Yoon, Taegyun Jeon, Kazutoshi Akita, Takeru Ooba, Norimichi Ukita, Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Dongliang He, Wenhao Wu, Yukang Ding, Chao Li, Fu Li, Shilei Wen, Jianwei Li, Fuzhi Yang, Huan Yang, Jianlong Fu, Byung-Hoon Kim, JaeHyun Baek, Jong Chul Ye, Yuchen Fan, Thomas S. Huang, Junyeop Lee, Bokyeung Lee, Jungki Min, Gwantae Kim, Kanghyu Lee, Jaihyun Park, Mykola Mykhailych, Haoyu Zhong, Yukai Shi, Xiaojun Yang, Zhijing Yang, Liang Lin, Tongtong Zhao, Jinjia Peng, Huibing Wang, Zhi Jin, Jiahao Wu, Yifu Chen, Chenming Shang, Huanrong Zhang, Jeongki Min, Hrishikesh P S, Densen Puthussery
Abstract: This paper reviews the NTIRE 2020 challenge on perceptual extreme super-resolution with focus on proposed solutions and results. The challenge task was to super-resolve an input image with a magnification factor 16 based on a set of prior examples of low and corresponding high resolution images. The goal is to obtain a network design capable to produce high resolution results with the best perceptual quality and similar to the ground truth. The track had 280 registered participants, and 19 teams submitted the final results. They gauge the state-of-the-art in single image super-resolution.
摘要：本文综述了知觉极端超分辨率重点提出了解决方案和结果NTIRE 2020挑战。挑战的任务是超解决与基于一组的低的现有的实施例和相应的高清晰度的图像的放大倍数16的输入图像。我们的目标是获得一个网络设计能够产生最佳的感知质量和类似的地面实况的高分辨率效果。这条赛道有注册学员280，并提交了最终结果19个队。它们衡量状态的最先进的在单个图像超分辨率。

57. Fusion of visible and infrared images via complex function [PDF] 返回目录
Ya. Ye. Khaustov, D. Ye, Ye. Ryzhov, E. Lychkovskyy, Yu. A. Nastishin
Abstract: We propose an algorithm for the fusion of partial images collected from the visual and infrared cameras such that the visual and infrared images are the real and imaginary parts of a complex function. The proposed image fusion algorithm of the complex function is a generalization for the algorithm of conventional image addition in the same way as the addition of complex numbers is the generalization for the addition of real numbers. The proposed algorithm of the complex function is simple in use and non-demanding in computer power. The complex form of the fused image opens a possibility to form the fused image either as the amplitude image or as a phase image, which in turn can be in several forms. We show theoretically that the local contrast of the fused phase images is higher than those of the partial images as well as in comparison with the images obtained by the algorithm of the simple or weighted addition. Experimental image quality assessment of the fused phase images performed using the histograms, entropy shows the higher quality of the phase images in comparison with those of the input partial images as well as those obtained with different fusion methods reported in the literature. Keywords: digital image processing, image fusion, infrared imaging, image quality assessment
摘要：我们提出用于从视觉和红外线照相机采集部分图像的融合的算法使得视觉和红外图像是一个复杂的函数的实部和虚部。复合功能的所提出的图像融合算法是用于传统的图像除了以相同的方式作为相加复数的算法的一般化是用于加入实数的一般化。复变函数的算法是使用和电脑电源不苛求简单。融合图像的复杂的形式打开的可能性或者形成稠合的图像作为振幅图像或相位图像，这反过来又可以是几种形式。我们理论上表明融合相位图像的局部对比度大于那些局部图像，以及在与由简单或加权相加的算法获得的图像进行比较高。使用直方图进行融合相位图像的实验图像质量评价，熵示出相位图像的在与所述输入部分图像，以及那些具有不同融合方法获得的比较质量越高文献报道。关键词：数字图像处理，图像融合，红外成像，图像质量评价

58. Lupulus: A Flexible Hardware Accelerator for Neural Networks [PDF] 返回目录
Andreas Toftegaard Kristensen, Robert Giterman, Alexios Balatsoukas-Stimming, Andreas Burg
Abstract: Neural networks have become indispensable for a wide range of applications, but they suffer from high computational- and memory-requirements, requiring optimizations from the algorithmic description of the network to the hardware implementation. Moreover, the high rate of innovation in machine learning makes it important that hardware implementations provide a high level of programmability to support current and future requirements of neural networks. In this work, we present a flexible hardware accelerator for neural networks, called Lupulus, supporting various methods for scheduling and mapping of operations onto the accelerator. Lupulus was implemented in a 28nm FD-SOI technology and demonstrates a peak performance of 380 GOPS/GHz with latencies of 21.4ms and 183.6ms for the convolutional layers of AlexNet and VGG-16, respectively.
摘要：神经网络已经成为一个广泛的应用不可缺少的，但他们从高computational-和内存的需求受到影响，需要从网络到硬件实现的算法描述优化。此外，在机器学习创新率高使得它非常重要的硬件实现提供高水平的可编程性来支持神经网络的当前和未来的需求。在这项工作中，我们提出了神经网络的柔性硬件加速器，被称为啤酒花，支持用于调度和业务映射到所述加速器的各种方法。啤酒花是在一个28nm的FD-SOI技术实现，并演示380个GOPS / GHz的峰值性能和21.4ms为183.6ms和AlexNet VGG-16，分别的卷积层的延迟。

59. Explaining How Deep Neural Networks Forget by Deep Visualization [PDF] 返回目录
Giang Nguyen, Shuan Chen, Tae Joon Jun, Daeyoung Kim
Abstract: Explaining the behaviors of deep neural networks, usually considered as black boxes, is critical especially when they are now being adopted over diverse aspects of human life. Taking the advantages of interpretable machine learning (interpretable ML), this paper proposes a novel tool called Catastrophic Forgetting Dissector (or CFD) to explain catastrophic forgetting in continual learning settings. We also introduce a new method called Critical Freezing based on the observations of our tool. Experiments on ResNet articulate how catastrophic forgetting happens, particularly showing which components of this famous network are forgetting. Our new continual learning algorithm defeats various recent techniques by a significant margin, proving the capability of the investigation. Critical freezing not only attacks catastrophic forgetting but also exposes explainability.
摘要：尤其是当他们现在正在在人类生活的不同方面采用的解释深层神经网络，通常被认为是黑盒的行为，是至关重要的。以解释机器学习（ML解释）的优点，本文提出了一种被称为灾难性遗忘剥离（或CFD）来解释不断地学习设置灾难性的遗忘一种新的工具。我们还介绍了基于我们的工具的意见称为临界凝的新方法。在RESNET口齿实验遗忘是如何发生的灾难，尤其示出了这个著名的网络组件遗忘。我们新的持续学习算法由显著保证金击败最近的各种技术，证明调查的能力。关键冻结不仅攻击灾难性的遗忘，但也暴露出explainability。

60. A Concise yet Effective model for Non-Aligned Incomplete Multi-view and Missing Multi-label Learning [PDF] 返回目录
Xiang Li, Songcan Chen
Abstract: In real-world applications, learning from data with multi-view and multi-label inevitably confronts with three challenges: missing labels, incomplete views, and non-aligned views. Existing methods mainly concern the first two and commonly need multiple assumptions in attacking them, making even state-of-the-arts also involve at least two explicit hyper-parameters in their objectives such that model selection is quite difficult. More toughly, these will encounter a failure in dealing with the third challenge, let alone address the three challenges jointly. In this paper, our goal is to meet all of them by presenting a concise yet effective model with only one hyper-parameter in modeling under the least assumption. To make our model more discriminative, we exploit not only the consensus of multiple views but also the global and local structures among multiple labels. More specifically, we introduce an indicator matrix to tackle the first two challenges in a regression manner while align the same individual label and all labels of different views in a common label space to battle the third challenge. During our alignment, we characterize specially the global and the local structures of multiple labels with high-rank and low-rank, respectively. Consequently, the regularization terms involved in modeling are integrated by a single hyper-parameter. Even without view-alignment, it is still confirmed that our method achieves better performance on five real datasets compared to state-of-the-arts.
摘要：在实际应用中，从多视角，多标签不可避免地面临着数据与学习的三大挑战：缺少标签，不完整的意见，和不结盟的意见。现有的方法主要涉及前两个普遍需要在进攻中他们多假设，使得甚至国家的的艺术也涉及他们的目标至少有两个明确的超参数，使得模型的选择是相当困难的。更强硬，这些都将遇到故障在处理第三个挑战，更不用说地址三大挑战联手。在本文中，我们的目标是通过提供一个简明而有效的模型，在至少假设下建模只有一个超参数，以满足所有的人。为了使我们的模型更有辨别力，我们利用不仅多个标签之间的全局和局部结构的多个视图达成共识，但也。更具体地讲，我们介绍一个指标矩阵，以解决回归方式的前两个挑战，同时对齐同一个人标签，在一个共同的标签空间战斗不同意见的第三个挑战所有标签。在我们的定位，我们的特点特别在全球多个标签分别高等级和低等级，局部结构。因此，参与的造型正则项由单个超参数集成。即使没有视图对齐，它仍然证实了我们的方法实现上比较先进的最艺术5个真实数据集更好的性能。

61. Boundary-aware Context Neural Network for Medical Image Segmentation [PDF] 返回目录
Ruxin Wang, Shuyuan Chen, Chaojie Ji, Jianping Fan, Ye Li
Abstract: Medical image segmentation can provide a reliable basis for further clinical analysis and disease diagnosis. The performance of medical image segmentation has been significantly advanced with the convolutional neural networks (CNNs). However, most existing CNNs-based methods often produce unsatisfactory segmentation mask without accurate object boundaries. This is caused by the limited context information and inadequate discriminative feature maps after consecutive pooling and convolution operations. In that the medical image is characterized by the high intra-class variation, inter-class indistinction and noise, extracting powerful context and aggregating discriminative features for fine-grained segmentation are still challenging today. In this paper, we formulate a boundary-aware context neural network (BA-Net) for 2D medical image segmentation to capture richer context and preserve fine spatial information. BA-Net adopts encoder-decoder architecture. In each stage of encoder network, pyramid edge extraction module is proposed for obtaining edge information with multiple granularities firstly. Then we design a mini multi-task learning module for jointly learning to segment object masks and detect lesion boundaries. In particular, a new interactive attention is proposed to bridge two tasks for achieving information complementarity between different tasks, which effectively leverages the boundary information for offering a strong cue to better segmentation prediction. At last, a cross feature fusion module aims to selectively aggregate multi-level features from the whole encoder network. By cascaded three modules, richer context and fine-grain features of each stage are encoded. Extensive experiments on five datasets show that the proposed BA-Net outperforms state-of-the-art approaches.
摘要：医学图像分割能够为进一步的临床分析和疾病诊断提供可靠的依据。医学图像分割的性能已经与卷积神经网络（细胞神经网络）被显著推进。然而，大多数现有的基于细胞神经网络的方法往往不能令人满意产生分割掩码而不准确对象边界。这是由经过连续池和卷积运算有限上下文信息和辨别特征映射不足造成的。在该医用图像的特征在于高的类内变化，级间模糊性和噪声，提取强大上下文以及聚集细粒度分割判别特征今天仍然具有挑战性。在本文中，我们制定的2D医学图像分割的边界感知上下文神经网络（BA-网）捕捉到更丰富的背景和保留精细的空间信息。 BA-网采用编码器，解码器架构。在编码器网络的每个阶段，金字塔边提取模块提出了一种用于获得与多粒度首先边缘信息。然后，我们设计了一个小型多任务学习模块共同学习段对象口罩和检测病灶边界。特别是，新的互动关注，提出了弥合两个任务实现不同的任务，有效地利用了边界信息提供强有力的线索，以便更好地分割预测之间信息的互补性。最后，横特征融合模块的目标，以选择性地聚合的多电平从全编码器网络功能。通过级联三个模块，每个阶段的更丰富的背景和细粒特性进行编码。在五个数据集大量实验表明，该BA-Net的性能优于国家的最先进的方法。

62. On the Convergence Rate of Projected Gradient Descent for a Back-Projection based Objective [PDF] 返回目录
Tom Tirer, Raja Giryes
Abstract: Ill-posed linear inverse problems appear in many fields of imaging science and engineering, and are typically addressed by solving optimization problems, which are composed of fidelity and prior terms or constraints. Traditionally, the research has been focused on different prior models, while the least squares (LS) objective has been the common choice for the fidelity term. However, recently a few works have considered a back-projection (BP) based fidelity term as an alternative to the LS, and demonstrated excellent reconstruction results for popular inverse problems. These prior works have also empirically shown that using the BP term, rather than the LS term, requires fewer iterations of plain and accelerated proximal gradient algorithms. In the current paper, we examine the convergence rate of the BP objective for the projected gradient descent (PGD) algorithm and identify an inherent source for its faster convergence compared to the LS objective. Numerical experiments with both $\ell_1$-norm and GAN-based priors corroborate our theoretical results for PGD. We also draw the connection to the observed behavior for proximal methods.
摘要：病态线性逆问题出现在成像科学和工程的许多领域，并且通常由解决优化问题，这是由保真度和现有术语或约束的解决。传统上，研究一直侧重于不同的车型之前，而最小二乘法（LS）的目标一直是保真项的共同选择。然而，最近几个作品都被认为是反投影（BP）基于保真项作为替代的LS，并表现出对流行的反问题优异的重建结果。这些现有作品也凭经验表明，使用的术语BP，而不是LS术语，需要的滑动和加速近端梯度算法较少的迭代。在当前的论文中，我们检查BP的收敛速度目标为投影梯度下降（PGD）算法和识别的固有源其更快的收敛相比客观的LS。既$ \ $ ell_1和范数GaN基先验数值实验证实了我们对PGD的理论成果。我们还提请近端方法观察到的行为的连接。

63. Deep Generative Adversarial Residual Convolutional Networks for Real-World Super-Resolution [PDF] 返回目录
Rao Muhammad Umer, Gian Luca Foresti, Christian Micheloni
Abstract: Most current deep learning based single image super-resolution (SISR) methods focus on designing deeper / wider models to learn the non-linear mapping between low-resolution (LR) inputs and the high-resolution (HR) outputs from a large number of paired (LR/HR) training data. They usually take as assumption that the LR image is a bicubic down-sampled version of the HR image. However, such degradation process is not available in real-world settings i.e. inherent sensor noise, stochastic noise, compression artifacts, possible mismatch between image degradation process and camera device. It reduces significantly the performance of current SISR methods due to real-world image corruptions. To address these problems, we propose a deep Super-Resolution Residual Convolutional Generative Adversarial Network (SRResCGAN) to follow the real-world degradation settings by adversarial training the model with pixel-wise supervision in the HR domain from its generated LR counterpart. The proposed network exploits the residual learning by minimizing the energy-based objective function with powerful image regularization and convex optimization techniques. We demonstrate our proposed approach in quantitative and qualitative experiments that generalize robustly to real input and it is easy to deploy for other down-scaling operators and mobile/embedded devices.
摘要：大多数当前深度学习基于单图像的超分辨率（SISR）方法把重点放在设计更深/更广泛的模型来从一个大的学习低分辨率（LR）的输入和高分辨率（HR）输出之间的非线性映射成对的（LR / HR）的训练数据的数量。他们通常采取的假设LR图像是HR图像的双三次下采样版本。然而，这样的降解过程是不是在真实世界设置即固有传感器噪声，随机噪声，压缩伪像，图像劣化处理和摄像装置之间可能失配可用。它显著降低电流SISR方法由于现实世界图像损坏的性能。为了解决这些问题，我们提出了一个深刻的超分辨率残卷积剖成对抗性网络（SRResCGAN）由对抗性训练遵循现实世界的降解设置在其产生LR对应的HR领域的逐像素监管模式。所提出的网络通过最小化与功能强大的图像正规化和凸优化技术基于能量的目标函数利用剩余的学习。我们证明我们提出的在稳健地推广到实际输入的定量和定性实验方法，并很容易部署等比例缩小运营商和移动/嵌入式设备。

64. Towards Occlusion-Aware Multifocal Displays [PDF] 返回目录
Jen-Hao Rick Chang, Anat Levin, B. V. K. Vijaya Kumar, Aswin C. Sankaranarayanan
Abstract: The human visual system uses numerous cues for depth perception, including disparity, accommodation, motion parallax and occlusion. It is incumbent upon virtual-reality displays to satisfy these cues to provide an immersive user experience. Multifocal displays, one of the classic approaches to satisfy the accommodation cue, place virtual content at multiple focal planes, each at a di erent depth. However, the content on focal planes close to the eye do not occlude those farther away; this deteriorates the occlusion cue as well as reduces contrast at depth discontinuities due to leakage of the defocus blur. This paper enables occlusion-aware multifocal displays using a novel ConeTilt operator that provides an additional degree of freedom -- tilting the light cone emitted at each pixel of the display panel. We show that, for scenes with relatively simple occlusion con gurations, tilting the light cones provides the same e ect as physical occlusion. We demonstrate that ConeTilt can be easily implemented by a phase-only spatial light modulator. Using a lab prototype, we show results that demonstrate the presence of occlusion cues and the increased contrast of the display at depth edges.
摘要：人类视觉系统使用深度感知众多线索，包括视差，住宿，运动视差和闭塞。这是义不容辞的虚拟现实显示，以满足这些线索提供身临其境的用户体验。多焦点显示器，经典方法之一来满足容纳线索，在多个焦平面处的虚拟内容，在每一个不同的深度。然而，在靠近眼睛的焦平面内容不堵塞那些远;这劣化了闭塞线索以及降低了在深度不连续反差由于散焦模糊的泄漏。本文使用的新的运营商ConeTilt提供自由的额外程度的闭塞感知多焦点显示器 - 倾斜于所述显示面板的每个像素发射的光锥。我们表明，对于相对简单的阻塞CON gurations场景，倾斜光锥提供相同的E ECT物理闭塞。我们证明，ConeTilt可以由纯相位空间光调制器来容易地实现。使用实验室样机，我们表明，表现出的闭塞线索的存在和所述显示器的深度边缘的增加的对比度的结果。

65. Clue: Cross-modal Coherence Modeling for Caption Generation [PDF] 返回目录
Malihe Alikhani, Piyush Sharma, Shengjie Li, Radu Soricut, Matthew Stone
Abstract: We use coherence relations inspired by computational models of discourse to study the information needs and goals of image captioning. Using an annotation protocol specifically devised for capturing image--caption coherence relations, we annotate 10,000 instances from publicly-available image--caption pairs. We introduce a new task for learning inferences in imagery and text, coherence relation prediction, and show that these coherence annotations can be exploited to learn relation classifiers as an intermediary step, and also train coherence-aware, controllable image captioning models. The results show a dramatic improvement in the consistency and quality of the generated captions with respect to information needs specified via coherence relations.
摘要：我们使用的话语的计算模型的启发，研究信息需求和图像字幕目标的连贯关系。使用专门设计用于捕捉图像的注释协议 - 字幕的连贯关系，我们从注释公开获得的图像万个实例 - 标题对。我们推出了新的任务为学习图像和文本，连贯关系预测的推论，并表明，这些相干注解可以被利用来学习的关系分类作为一个中间步骤，并且还培养一致性感知的，可控的图像字幕模式。结果表明在所生成的字幕的一致性和质量相对于经由连贯关系规定信息需求的显着改善。

66. Towards Deep Learning Methods for Quality Assessment of Computer-Generated Imagery [PDF] 返回目录
Markus Utke, Saman Zadtootaghaj, Steven Schmidt, Sebastian Möller
Abstract: Video gaming streaming services are growing rapidly due to new services such as passive video streaming, e.g. this http URL, and cloud gaming, e.g. Nvidia Geforce Now. In contrast to traditional video content, gaming content has special characteristics such as extremely high motion for some games, special motion patterns, synthetic content and repetitive content, which makes the state-of-the-art video and image quality metrics perform weaker for this special computer generated content. In this paper, we outline our plan to build a deep learningbased quality metric for video gaming quality assessment. In addition, we present initial results by training the network based on VMAF values as a ground truth to give some insights on how to build a metric in future. The paper describes the method that is used to choose an appropriate Convolutional Neural Network architecture. Furthermore, we estimate the size of the required subjective quality dataset which achieves a sufficiently high performance. The results show that by taking around 5k images for training of the last six modules of Xception, we can obtain a relatively high performance metric to assess the quality of distorted video games.
摘要：视频游戏流媒体服务是由于新业务的快速增长，如被动的视频流，例如这个HTTP URL，以及云游戏，例如NVIDIA公司的GeForce现在。与传统的视频内容，游戏内容具有特殊的特性，例如对于一些游戏极高运动，特殊的运动模式，合成内容和重复的内容，这使得状态的最先进的视频和图像质量度量执行此较弱的特殊的计算机生成的内容。在本文中，我们列出了我们的计划，以建立视频游戏质量评估深learningbased质量度量。此外，我们通过训练基于VMAF值作为地面实况网络就如何建立一个度量标准在未来的一些见解提出了初步成效。本文描述了用于选择适当的卷积神经网络结构的方法。此外，我们估计所需的主观质量数据集，其实现了足够高的性能的大小。结果表明，通过采取5K左右的图像的Xception最后六个模块的培训，我们可以得到一个比较高的性能指标来评估失真视频游戏的质量。

67. RMM: A Recursive Mental Model for Dialog Navigation [PDF] 返回目录
Homero Roman Roman, Yonatan Bisk, Jesse Thomason, Asli Celikyilmaz, Jianfeng Gao
Abstract: Fluent communication requires understanding your audience. In the new collaborative task of Vision-and-Dialog Navigation, one agent must ask questions and follow instructive answers, while the other must provide those answers. We introduce the first true dialog navigation agents in the literature which generate full conversations, and introduce the Recursive Mental Model (RMM) to conduct these dialogs. RMM dramatically improves generated language questions and answers by recursively propagating reward signals to find the question expected to elicit the best answer, and the answer expected to elicit the best navigation. Additionally, we provide baselines for future work to build on when investigating the unique challenges of embodied visual agents that not only interpret instructions but also ask questions in natural language.
摘要：流利的通信需要了解你的听众。在视觉和对话框导航的新的协作任务，一个代理人必须要问的问题，并按照启发性的答案，而其他必须提供这些问题的答案。我们介绍其中产生充分的对话文献中第一个真正的对话框导航代理，并引入递归心理模型（RMM）进行这些对话框。 RMM极大地提高了通过递归传播奖励信号发现预计将促使最佳答案问题产生的语言问题和答案，答案有望引发了最好的导航。此外，我们提供调查体现的视觉剂，不仅解释说明，而且在询问自然语言问题的独特挑战时，对今后的工作中要建立基线。

68. Obtaining Faithful Interpretations from Compositional Neural Networks [PDF] 返回目录
Sanjay Subramanian, Ben Bogin, Nitish Gupta, Tomer Wolfson, Sameer Singh, Jonathan Berant, Matt Gardner
Abstract: Neural module networks (NMNs) are a popular approach for modeling compositionality: they achieve high accuracy when applied to problems in language and vision, while reflecting the compositional structure of the problem in the network architecture. However, prior work implicitly assumed that the structure of the network modules, describing the abstract reasoning process, provides a faithful explanation of the model's reasoning; that is, that all modules perform their intended behaviour. In this work, we propose and conduct a systematic evaluation of the intermediate outputs of NMNs on NLVR2 and DROP, two datasets which require composing multiple reasoning steps. We find that the intermediate outputs differ from the expected output, illustrating that the network structure does not provide a faithful explanation of model behaviour. To remedy that, we train the model with auxiliary supervision and propose particular choices for module architecture that yield much better faithfulness, at a minimal cost to accuracy.
摘要：神经网络模块（NMNS）是模拟组合性流行的做法：当应用到语言和视力问题，而反映的问题在网络架构的组成结构，他们达到较高的精度。然而，现有的工作隐含假设网络模块的结构，描述了抽象推理过程中，提供了模型的推理的忠实解释;也就是说，所有的模块执行其预期的行为。在这项工作中，我们提出并开展对NLVR2和DROP NMNS，需要组成多个推理步骤两个数据集的中间输出的系统评价。我们发现，在中间输出从预期输出不同，示出了网络结构不提供的模型的行为的忠实的解释。为了弥补这一点，我们培养具有辅助监管的模式，提出了模块化结构特殊的选择，收益率更好的忠诚，以最低的成本来准确性。

69. A Benchmark for Structured Procedural Knowledge Extraction from Cooking Videos [PDF] 返回目录
Frank F. Xu, Lei Ji, Botian Shi, Junyi Du, Graham Neubig, Yonatan Bisk, Nan Duan
Abstract: Procedural knowledge, which we define as concrete information about the sequence of actions that go into performing a particular procedure, plays an important role in understanding real-world tasks and actions. Humans often learn this knowledge from instructional text and video, and in this paper we aim to perform automatic extraction of this knowledge in a similar way. As a concrete step in this direction, we propose the new task of inferring procedures in a structured form(a data structure containing verbs and arguments) from multimodal instructional video contents and their corresponding transcripts. We first create a manually annotated, large evaluation dataset including over350 instructional cooking videos along with over 15,000 English sentences in transcripts spanning over 89 recipes. We conduct analysis of the challenges posed by this task and dataset with experiments with unsupervised segmentation, semantic role labeling, and visual action detection based baselines. The dataset and code will be publicly available at this https URL.
摘要：过程性知识，我们将其定义为关于该进入执行特定程序操作的顺序的具体信息，发挥在理解现实世界的任务和行动的重要作用。人类常常学习到教学文本和视频这方面的知识，并在本文中，我们的目标是用类似的方式进行这方面的知识自动提取。如在该方向的具体步骤，我们提出以结构化形式推断程序的新任务从多峰教学视频内容和它们相应的转录物（含动词和参数的数据结构）。我们首先创建一个手动注释，大型数据集的评估包括over350有超过15,000个英语句子成绩单跨越89食谱沿着教学烹饪视频。我们进行的这个任务和数据集与无监督分割，语义角色标注，和视觉动作检测基于基线实验所带来的挑战的分析。该数据集和代码将公开可在此HTTPS URL。

70. Stochastic Neighbor Embedding of Multimodal Relational Data for Image-Text Simultaneous Visualization [PDF] 返回目录
Morihiro Mizutani, Akifumi Okuno, Geewook Kim, Hidetoshi Shimodaira
Abstract: Multimodal relational data analysis has become of increasing importance in recent years, for exploring across different domains of data, such as images and their text tags obtained from social networking services (e.g., Flickr). A variety of data analysis methods have been developed for visualization; to give an example, t-Stochastic Neighbor Embedding (t-SNE) computes low-dimensional feature vectors so that their similarities keep those of the observed data vectors. However, t-SNE is designed only for a single domain of data but not for multimodal data; this paper aims at visualizing multimodal relational data consisting of data vectors in multiple domains with relations across these vectors. By extending t-SNE, we herein propose Multimodal Relational Stochastic Neighbor Embedding (MR-SNE), that (1) first computes augmented relations, where we observe the relations across domains and compute those within each of domains via the observed data vectors, and (2) jointly embeds the augmented relations to a low-dimensional space. Through visualization of Flickr and Animal with Attributes 2 datasets, proposed MR-SNE is compared with other graph embedding-based approaches; MR-SNE demonstrates the promising performance.
摘要：多模态关系数据分析已经成为近年来日益重要，跨数据的不同领域，如社交网络服务（例如，Flickr的）获得的图像及其文本标记探索。多种数据分析方法已经开发了可视化;举一个例子，叔随机邻居嵌入（叔SNE）计算低维特征向量，使得它们的相似性保持这些所观察到的数据向量。然而，T-SNE被仅用于数据的单结构域但不能用于多模态数据而设计的;本文旨在包括可视化在多个域中的数据矢量的横跨这些载体关系的多峰关系数据。通过延长T-SNE，我们在此提出多峰关系随机邻居嵌入（MR-SNE），即（1）首先计算增强的关系，其中我们观察到跨域关系，并通过所观察到的数据向量计算的那些中的每个域，并（2）共同嵌入增强关系到低维空间。通过与属性数据集2 Flickr和动物的可视化，提出了MR-SNE与其它图表中嵌入基的方法相比; MR-SNE演示看好的表现。

71. Probing Text Models for Common Ground with Visual Representations [PDF] 返回目录
Gabriel Ilharco, Rowan Zellers, Ali Farhadi, Hannaneh Hajishirzi
Abstract: Vision, as a central component of human perception, plays a fundamental role in shaping natural language. To better understand how text models are connected to our visual perceptions, we propose a method for examining the similarities between neural representations extracted from words in text and objects in images. Our approach uses a lightweight probing model that learns to map language representations of concrete words to the visual domain. We find that representations from models trained on purely textual data, such as BERT, can be nontrivially mapped to those of a vision model. Such mappings generalize to object categories that were never seen by the probe during training, unlike mappings learned from permuted or random representations. Moreover, we find that the context surrounding objects in sentences greatly impacts performance. Finally, we show that humans significantly outperform all examined models, suggesting considerable room for improvement in representation learning and grounding.
摘要：愿景，为人类感知的一个重要组成部分，在塑造自然语言的基础性作用。为了更好地理解文本的模型是如何连接到我们的视觉感受，我们提出了审查从文本文字和对象在图像中提取神经表征之间的相似性的方法。我们的方法是使用一个轻量级的探测模式，学会的具体词语言表示映射到视觉领域。我们发现，从训练的纯粹的文本数据，如BERT模型表示，可以非平凡映射到这些视觉模型。这种映射推广到从未被探测训练中看到的，不像从置换或随机交涉了解到映射对象的类别。此外，我们发现，上下文周围的句子大大影响性能的对象。最后，我们表明，人类显著胜过所有检查模式，认为需要在代表学习和接地改进相当大的空间。

72. When Ensembling Smaller Models is More Efficient than Single Large Models [PDF] 返回目录
Dan Kondratyuk, Mingxing Tan, Matthew Brown, Boqing Gong
Abstract: Ensembling is a simple and popular technique for boosting evaluation performance by training multiple models (e.g., with different initializations) and aggregating their predictions. This approach is commonly reserved for the largest models, as it is commonly held that increasing the model size provides a more substantial reduction in error than ensembling smaller models. However, we show results from experiments on CIFAR-10 and ImageNet that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute, even when those individual models' weights and hyperparameters are highly optimized. Furthermore, this gap in improvement widens as models become large. This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models, especially when the models approach the size of what their dataset can foster. Instead of using the common practice of tuning a single large model, one can use ensembles as a more flexible trade-off between a model's inference speed and accuracy. This also potentially eases hardware design, e.g., an easier way to parallelize the model across multiple workers for real-time or distributed inference.
摘要：Ensembling是通过训练多个模型（例如，具有不同的初始化）和聚合他们的预测提高评估性能的简单和流行的技术。这种方法一般被保留最大的车型，因为它是普遍认为，增加模型的大小提供了错误的更大幅减少超过ensembling小排量车型。然而，我们给出CIFAR-10实验结果和ImageNet该乐团可以超越单一车型既具有更高的精度和需要更少的总触发器来计算，即使这些个别车型的重量和超参数都经过高度优化。此外，在改善这种差距扩大的机型变大。这提出了一个有趣的观察，在ensembling输出的多样性往往比训练较大的车型，尤其是当模型法的什么他们的数据集可以促进规模更高效。除了使用调整单个大型模型的普遍做法，可以使用合奏作为一个更灵活的权衡模型的推理速度和精度之间。这也潜在地简化了硬件设计，例如，一个简单的方法来并行跨多个工人模型实时或分布式推理。

73. RigNet: Neural Rigging for Articulated Characters [PDF] 返回目录
Zhan Xu, Yang Zhou, Evangelos Kalogerakis, Chris Landreth, Karan Singh
Abstract: We present RigNet, an end-to-end automated method for producing animation rigs from input character models. Given an input 3D model representing an articulated character, RigNet predicts a skeleton that matches the animator expectations in joint placement and topology. It also estimates surface skin weights based on the predicted skeleton. Our method is based on a deep architecture that directly operates on the mesh representation without making assumptions on shape class and structure. The architecture is trained on a large and diverse collection of rigged models, including their mesh, skeletons and corresponding skin weights. Our evaluation is three-fold: we show better results than prior art when quantitatively compared to animator rigs; qualitatively we show that our rigs can be expressively posed and animated at multiple levels of detail; and finally, we evaluate the impact of various algorithm choices on our output rigs.
摘要：本RigNet，用于从输入的人物模型产生动画钻机的端至端自动化方法。给定一个表示关节字符的输入3D模型，RigNet预测相符联合安置和拓扑动画师预期的骨架。它还估计基于预测骨架表面蒙皮权重。我们的方法是基于直接在网格表示操作，而又不使形状类和结构假设深架构。该架构在一个庞大而多样的收集操纵模型，包括他们的网格，骨骼和皮肤对应的权重训练。我们的评价是三个方面：我们表现出比现有技术更好的结果时，定量比较，以动画钻机;定性我们证明了我们的钻井平台可意味深长提出，并在多个细节级别的动画;最后，我们评估的各种算法的选择上我们的产量钻机的影响。

74. Angle-based Search Space Shrinking for Neural Architecture Search [PDF] 返回目录
Yiming Hu, Yuding Liang, Zichao Guo, Ruosi Wan, Xiangyu Zhang, Yichen Wei, Qingyi Gu, Jian Sun
Abstract: In this work, we present a simple and general search space shrinking method, called Angle-Based search space Shrinking (ABS), for Neural Architecture Search (NAS). Our approach progressively simplifies the original search space by dropping unpromising candidates, thus can reduce difficulties for existing NAS methods to find superior architectures. In particular, we propose an angle-based metric to guide the shrinking process. We provide comprehensive evidences showing that, in weight-sharing supernet, the proposed metric is more stable and accurate than accuracy-based and magnitude-based metrics to predict the capability of child models. We also show that the angle-based metric can converge fast while training supernet, enabling us to get promising shrunk search spaces efficiently. ABS can easily apply to most of popular NAS approaches (e.g. SPOS, FariNAS, ProxylessNAS, DARTS and PDARTS). Comprehensive experiments show that ABS can dramatically enhance existing NAS approaches by providing a promising shrunk search space.
摘要：在这项工作中，我们提出了一个简单的和一般的搜索空间萎缩的方法，称为基于角度搜索空间的收缩（ABS），对神经结构搜索（NAS）。我们的方法通过降低没出息的候选人，从而可以减少对现有的NAS方法来寻找卓越的架构困难逐步简化了原有的搜索空间。特别是，我们提出了基于角度的指标来引导收缩过程。我们提供全面的证据表明，在体重共享超网，所提出的指标是更加稳定和精确的比精度和基于大小的指标来预测儿童模特的能力。我们还表明，基于角度的指标可以快速收敛，而训练超网，使我们能够有效地得到承诺缩水的搜索空间。 ABS可以很容易地适用于最流行的NAS的方法（例如SPOS，FariNAS，ProxylessNAS，飞镖和PDARTS）。综合实验表明，ABS可以显着提高现有NAS提供了一个有前途的缩小搜索空间的方法。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-05

目录

摘要