摘要

1. GreedyFool: Distortion-Aware Sparse Adversarial Attack [PDF] 返回目录
Xiaoyi Dong, Dongdong Chen, Jianmin Bao, Chuan Qin, Lu Yuan, Weiming Zhang, Nenghai Yu, Dong Chen
Abstract: Modern deep neural networks(DNNs) are vulnerable to adversarial samples. Sparse adversarial samples are a special branch of adversarial samples that can fool the target model by only perturbing a few pixels. The existence of the sparse adversarial attack points out that DNNs are much more vulnerable than people believed, which is also a new aspect for analyzing DNNs. However, current sparse adversarial attack methods still have some shortcomings on both sparsity and invisibility. In this paper, we propose a novel two-stage distortion-aware greedy-based method dubbed as âGreedyFool". Specifically, it first selects the most effective candidate positions to modify by considering both the gradient(for adversary) and the distortion map(for invisibility), then drops some less important points in the reduce stage. Experiments demonstrate that compared with the start-of-the-art method, we only need to modify $3\times$ fewer pixels under the same sparse perturbation setting. For target attack, the success rate of our method is 9.96\% higher than the start-of-the-art method under the same pixel budget. Code can be found at this https URL.
摘要：现代深层神经网络（DNNs）很容易受到敌对样本。稀疏对抗性的样本是对抗性的样品，可以通过仅扰乱几个像素愚弄对象模型的一个特殊分支。稀疏对抗性攻击指出DNNs的存在，更容易受到比人们认为，这也是分析DNNs一个新的方面。然而，目前的稀疏敌对攻击方法仍然有两个稀疏和隐蔽一些不足之处。在本文中，我们提出了被称为作为GreedyFool”一种新颖的双级失真感知贪心基于方法。具体而言，首先选择最有效的候选位置，以通过考虑所述梯度（对于对手）和失真修改地图（对于隐形），然后滴一些不太重要的点在减少阶段。实验结果表明，与启动的最先进的方法相比，我们只需要修改$ 3 \次$相同稀疏扰动设置下较少的像素。对于目标的攻击，我们的方法的成功率比同像素预算下启动的最先进的方法提高9.96 \％。代码可以在此HTTPS URL中找到。

2. GAN Mask R-CNN:Instance semantic segmentation benefits from generativeadversarial networks [PDF] 返回目录
Quang H. Le, Kamal Youcef-Toumi, Dzmitry Tsetserukou, Ali Jahanian
Abstract: In designing instance segmentation ConvNets that reconstruct masks, segmentation is often taken as its literal definition -- assigning label to every pixel -- for defining the loss functions. That is, using losses that compute the difference between pixels in the predicted (reconstructed) mask and the ground truth mask -- a template matching mechanism. However, any such instance segmentation ConvNet is a generator, so we can lay the problem of predicting masks as a GANs game framework: We can think the ground truth mask is drawn from the true distribution, and a ConvNet like Mask R-CNN is an implicit model that infers the true distribution. Then, designing a discriminator in front of this generator will close the loop of GANs concept and more importantly obtains a loss that is trained not hand-designed. We show this design outperforms the baseline when trying on, without extra settings, several different domains: cellphone recycling, autonomous driving, large-scale object detection, and medical glands. Further, we observe in general GANs yield masks that account for better boundaries, clutter, and small details.
摘要：在设计实例分割ConvNets即重构口罩，分割往往采取它的字面定义 - 对每一个像素分配标签 - 用于定义损失函数。也就是说，使用该计算在预测（重构）掩模像素与地面实况掩模之间的差的损失 - 一个模板匹配机制。然而，任何类似事件分割ConvNet是发电机，所以我们可以打好预测面具为甘斯游戏框架的问题：我们可以把地面实况面具是从真实分布绘制的，像面膜R-CNN一ConvNet是该推断真实分布隐式模型。然后，在这种发电机的主设计鉴别将关闭甘斯概念的循环，更重要的是获得被训练不要用手设计的一个损失。手机回收，自动驾驶，大型物体检测和医疗腺体：试穿时，无需额外的设置，几个不同的领域我们发现这种设计优于基准。此外，我们观察到一般甘斯得到该帐户为更好的界限，混乱和小细节的面具。

3. Handgun detection using combined human pose and weapon appearance [PDF] 返回目录
Jesus Ruiz-Santaquiteria, Alberto Velasco-Mata, Noelia Vallez, Gloria Bueno, Juan Antonio Alvarez, Oscar Deniz
Abstract: CCTV surveillance systems are essential nowadays to prevent and mitigate security threats or dangerous situations such as mass shootings or terrorist attacks, in which early detection is crucial. These solutions are manually supervised by a security operator, which has significant limitations. Novel deep learning-based methods have allowed to develop automatic and real time weapon detectors with promising results. However, these approaches are based on visual weapon appearance only and no additional contextual information is exploited. For handguns, body pose may be a useful cue, especially in cases where the gun is barely visible and also as a way to reduce false positives. In this work, a novel method is proposed to combine in a single architecture both weapon appearance and 2D human pose information. First, pose keypoints are estimated to extract hand regions and generate binary pose images, which are the model inputs. Then, each input is processed with a different subnetwork to extract two feature maps. Finally, this information is combined to produce the hand region prediction (handgun vs no-handgun). A new dataset composed of samples collected from different sources has been used to evaluate model performance under different situations. Moreover, the robustness of the model to different brightness and weapon size conditions (simulating conditions in which appearance is degraded by low light and distance to the camera) have also been tested. Results obtained show that the combined model improves overall performance substantially with respect to appearance alone as used by other popular methods such as YOLOv3.
摘要：闭路电视监控系统是必不可少的时下，以防止和减轻安全威胁或危险的情况下，如大规模射杀或恐怖袭击，其中早期诊断是至关重要的。这些解决方案由一个安全操作员，其中有显著的局限性手动监督。新深基于学习的方法已经允许开发自动和实时武器探测器可喜的成果。然而，这些方法是基于视觉的武器外观只，没有额外的上下文信息被利用。对于手枪，身体姿态可能是有用的线索，尤其是在情况下，枪是几乎不可见，并以此来减少误报。在这项工作中，一种新的方法，提出了在一个单一的架构既武器外观和2D人体姿势的信息结合起来。首先，关键点姿势估计提取手区域，并生成二进制图像的姿势，其是该模型的输入。然后，每个输入都与不同的子网络处理，以提取两个特征地图。最后，该信息被结合以产生手区域预测（手枪VS无手枪）。从不同来源收集的样本组成的新的数据集已被用来评估在不同情况下模型的性能。此外，该模型以不同的亮度和武器尺寸条件（模拟在其中外观是由低光和距离退化到照相机条件）的鲁棒性也进行了测试。获得的结果显示，该组合模型基本上提高了整体性能相对于单独的外观通过其它常用的方法如YOLOv3如使用。

4. Demo Abstract: Indoor Positioning System in Visually-Degraded Environments with Millimetre-Wave Radar and Inertial Sensors [PDF] 返回目录
Zhuangzhuang Dai, Muhamad Risqi U. Saputra, Chris Xiaoxuan Lu, Niki Trigoni, Andrew Markham
Abstract: Positional estimation is of great importance in the public safety sector. Emergency responders such as fire fighters, medical rescue teams, and the police will all benefit from a resilient positioning system to deliver safe and effective emergency services. Unfortunately, satellite navigation (e.g., GPS) offers limited coverage in indoor environments. It is also not possible to rely on infrastructure based solutions. To this end, wearable sensor-aided navigation techniques, such as those based on camera and Inertial Measurement Units (IMU), have recently emerged recently as an accurate, infrastructure-free solution. Together with an increase in the computational capabilities of mobile devices, motion estimation can be performed in real-time. In this demonstration, we present a real-time indoor positioning system which fuses millimetre-wave (mmWave) radar and IMU data via deep sensor fusion. We employ mmWave radar rather than an RGB camera as it provides better robustness to visual degradation (e.g., smoke, darkness, etc.) while at the same time requiring lower computational resources to enable runtime computation. We implemented the sensor system on a handheld device and a mobile computer running at 10 FPS to track a user inside an apartment. Good accuracy and resilience were exhibited even in poorly illuminated scenes.
摘要：阵地估计是在公共安全部门的高度重视。紧急救援人员，如消防员，医疗救援队，警察将弹性定位系统的所有好处，提供安全，有效的应急服务。不幸的是，卫星导航（例如，GPS）提供了在室内环境中覆盖范围有限。它也不可能依赖于基于基础设施解决方案。为此，可穿戴式传感器辅助导航技术，如基于照相机和惯性测量单元（IMU）的那些，最近最近出现的一种精确，无基础设施解决方案。一起的增加的移动设备的计算能力，运动估计可以实时进行。本演示中，我们提出了一个实时的室内定位系统，其通过深传感器融合熔断器毫米波（毫米波）雷达和IMU数据。我们使用毫米波雷达，而不是RGB相机，因为它提供了更好的鲁棒性视觉退化（例如，烟雾，黑暗等），而在同一时间需要较低的计算资源，以使运行时计算。我们实现了一个手持设备上的传感器系统和移动计算机10运行FPS来跟踪一个公寓内的用户。良好的精度和弹性，即使在不良照明场景展出。

5. ActiveNet: A computer-vision based approach to determine lethargy [PDF] 返回目录
Aitik Gupta, Aadit Agarwal
Abstract: The outbreak of COVID-19 has forced everyone to stay indoors, fabricating a significant drop in physical activeness. Our work is constructed upon the idea to formulate a backbone mechanism, to detect levels of activeness in real-time, using a single monocular image of a target person. The scope can be generalized under many applications, be it in an interview, online classes, security surveillance, et cetera. We propose a Computer Vision based multi-stage approach, wherein the pose of a person is first detected, encoded with a novel approach, and then assessed by a classical machine learning algorithm to determine the level of activeness. An alerting system is wrapped around the approach to provide a solution to inhibit lethargy by sending notification alerts to individuals involved.
摘要：COVID-19的爆发，促使每个人都呆在室内，制造物理活跃一个显著下跌。我们的工作是在理念构建制定骨干机制，来检测实时活跃的水平，采用了目标人物的一个单眼图像。范围可以在许多应用推广，无论是在接受媒体采访时，在线课堂，安防监控，等等。我们提出了一个基于计算机视觉的多阶段的方法，其中一个人物的姿势首次检测，用一种新的方法编码，然后通过传统的机器学习算法评估，以确定活跃的水平。告警系统是围绕该方法包裹通过发送通知警报涉及个人提供的解决方案以抑制嗜睡。

6. Distributed Multi-Target Tracking in Camera Networks [PDF] 返回目录
Sara Casao, Ana C. Murillo, Eduardo Montijano
Abstract: Most recent works on multi-target tracking with multiple cameras focus on centralized systems. In contrast, this paper presents a multi-target tracking approach implemented in a distributed camera network. The advantages of distributed systems lie in lighter communication management, greater robustness to failures and local decision making. On the other hand, data association and information fusion are more challenging than in a centralized setup, mostly due to the lack of global and complete information. The proposed algorithm boosts the benefits of the Distributed-Consensus Kalman Filter with the support of a re-identification network and a distributed tracker manager module to facilitate consistent information. These techniques complement each other and facilitate the cross-camera data association in a simple and effective manner. We evaluate the whole system with known public data sets under different conditions demonstrating the advantages of combining all the modules. In addition, we compare our algorithm to some existing centralized tracking methods, outperforming their behavior in terms of accuracy and bandwidth usage.
摘要：对多目标的最新作品与多台摄像机跟踪专注于集中式系统。相比之下，本文提出了在分布式网络摄像头实现的多目标跟踪方法。分布式系统的优势在于更轻通信管理，更大的鲁棒性故障和本地决策。在另一方面，数据关联和信息融合比在一个集中的设置更具挑战性，主要是由于缺乏全局和完整的信息。该算法提升了重新鉴定网络的支持和分布式跟踪管理器模块，以方便一致的信息的分布式共识卡尔曼滤波的好处。这些技术相互补充，便于以简单和有效的方式交叉相机数据相关联。我们评估不同的条件下表现出所有的模块相结合的优势，根据已知的公开数据集的整个系统。此外，我们我们的算法比较一些现有的集中式跟踪方法，优于在精度和带宽使用方面的行为。

7. Face Frontalization Based on Robustly Fitting a Deformable Shape Model to 3D Landmarks [PDF] 返回目录
Zhiqi Kang, Mostafa Sadeghi, Radu Horaud
Abstract: Face frontalization consists of synthesizing a frontally-viewed face from an arbitrarily-viewed one. The main contribution of this paper is a robust face alignment method that enables pixel-to-pixel warping. The method simultaneously estimates the \textit{rigid transformation} (scale, rotation, and translation) and the \textit{non-rigid deformation} between two 3D point sets: a set of 3D landmarks extracted from an arbitrary-viewed face, and a set of 3D landmarks parameterized by a frontally-viewed deformable face model. An important merit of the proposed method is its ability to deal both with noise (small perturbations) and with outliers (large errors). We propose to model inliers and outliers with the generalized Student's t-probability distribution function -- a heavy-tailed distribution that is immune to non-Gaussian errors in the data. We describe in detail the associated expectation-maximization (EM) algorithm that alternates between the estimation of (i)~the rigid parameters, (ii)~the deformation parameters, and (iii)~ the t-distribution parameters. We also propose to use the \textit{zero-mean normalized cross-correlation}, between a frontalized face and the corresponding ground-truth frontally-viewed face, to evaluate the performance of frontalization. To this end, we use a dataset that contains pairs of profile-viewed and frontally-viewed faces. This evaluation, based on direct image-to-image comparison, stands in contrast with indirect evaluation, based on analyzing the effect of frontalization on face recognition.
摘要：面部frontalization由来自任意-观看的一个合成正面观看的脸。本文的主要贡献是一个强大的面部对准方法，其能够像素到像素翘曲。该方法同时估计\ textit {刚性变换}（尺度，旋转和平移）和\ textit {非刚性变形}两个3D点集之间：一组3D地标从任意观看的脸提取出来，并且三维的地标集合由正面观看的变形面模型参数化。该方法的一个重要优点是它的处理都与噪声（小扰动）和离群（大错误）的能力。我们建议模型正常值和异常值与广义学生的t概率分布函数 - 一个重尾分布，不受数据非高斯错误。我们详细描述相关期望最大化（EM）算法，该算法的估计之间交替（ⅰ）〜刚性参数，（ⅱ）〜变形参数，和（iii）〜t分布参数。此外，我们建议使用\ {textit零均值的标准化互相关}，一个frontalized面和相应的地面实况正面观看的面之间，以评估frontalization的性能。为此，我们使用包含对配置文件的查看和正面观看的面的数据集。该评估的基础上，直接影像到图像比较，矗立在间接评估的基础上，分析frontalization对人脸识别的效果对比。

8. SCFusion: Real-time Incremental Scene Reconstruction with Semantic Completion [PDF] 返回目录
Shun-Cheng Wu, Keisuke Tateno, Nassir Navab, Federico Tombari
Abstract: Real-time scene reconstruction from depth data inevitably suffers from occlusion, thus leading to incomplete 3D models. Partial reconstructions, in turn, limit the performance of algorithms that leverage them for applications in the context of, e.g., augmented reality, robotic navigation, and 3D mapping. Most methods address this issue by predicting the missing geometry as an offline optimization, thus being incompatible with real-time applications. We propose a framework that ameliorates this issue by performing scene reconstruction and semantic scene completion jointly in an incremental and real-time manner, based on an input sequence of depth maps. Our framework relies on a novel neural architecture designed to process occupancy maps and leverages voxel states to accurately and efficiently fuse semantic completion with the 3D global model. We evaluate the proposed approach quantitatively and qualitatively, demonstrating that our method can obtain accurate 3D semantic scene completion in real-time.
摘要：从深度数据的实时场景重构不可避免地受到阻塞，从而导致不完整的3D模型。局部重建，反过来，限制的算法的性能，充分利用它们在，例如，增强现实，机器人导航和3D映射上下文的应用程序。大多数方法通过预测缺少几何图形作为离线优化，因而是具有实时应用程序不兼容解决这个问题。我们提出了一个框架，通过在增量和实时的方式共同进行现场重建和语义现场完成的基础上，深度图的输入序列改善这个问题。我们的框架依赖于设计的过程中占用的地图，并利用体素状态的新颖的神经结构，以准确，高效地保险丝与3D全局模型的语义完成。我们定性和定量评估所提出的方法，这表明我们的方法可以获取实时精确的3D场景语义完成。

9. Fewer is More: A Deep Graph Metric Learning Perspective Using Fewer Proxies [PDF] 返回目录
Yuehua Zhu, Muli Yang, Cheng Deng, Wei Liu
Abstract: Deep metric learning plays a key role in various machine learning tasks. Most of the previous works have been confined to sampling from a mini-batch, which cannot precisely characterize the global geometry of the embedding space. Although researchers have developed proxy- and classification-based methods to tackle the sampling issue, those methods inevitably incur a redundant computational cost. In this paper, we propose a novel Proxy-based deep Graph Metric Learning (ProxyGML) approach from the perspective of graph classification, which uses fewer proxies yet achieves better comprehensive performance. Specifically, multiple global proxies are leveraged to collectively approximate the original data points for each class. To efficiently capture local neighbor relationships, a small number of such proxies are adaptively selected to construct similarity subgraphs between these proxies and each data point. Further, we design a novel reverse label propagation algorithm, by which the neighbor relationships are adjusted according to ground-truth labels, so that a discriminative metric space can be learned during the process of subgraph classification. Extensive experiments carried out on widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate the superiority of the proposed ProxyGML over the state-of-the-art methods in terms of both effectiveness and efficiency. The source code is publicly available at this https URL.
摘要：深度量学习起着不同的机器学习任务的关键作用。大部分以前的作品已被局限于从小批量，不能准确地表征嵌入空间的全球几何采样。虽然研究人员已经开发的Proxy-和基于分类的方法来解决这个问题取样，这些方法都不可避免地招致多余的计算成本。在本文中，我们建议从图形分类，使用更少的代理尚未达到更好的综合性能的角度来看，新的基于代理的深图表指标学习（ProxyGML）的方法。具体而言，多个全局代理被利用来共同地近似原始数据点为每个类。为了有效地捕获本地邻居关系，例如代理的少数自适应地选择构建这些代理和每个数据点之间的相似性的子图。此外，我们设计了一种新的反向标签传播算法，通过该邻居关系根据地面实况标签进行调整，使得判别度量空间可分级子图的过程中被了解。大量的实验上被广泛使用的CUB-200-2011进行，Cars196和斯坦福产品在线数据集展示有效性和效率方面所提出的ProxyGML超过国家的最先进的方法的优越性。源代码是公开的，在此HTTPS URL。

10. Classification of Important Segments in Educational Videos using Multimodal Features [PDF] 返回目录
Junaid Ahmed Ghauri, Sherzod Hakimov, Ralph Ewerth
Abstract: Videos are a commonly-used type of content in learning during Web search. Many e-learning platforms provide quality content, but sometimes educational videos are long and cover many topics. Humans are good in extracting important sections from videos, but it remains a significant challenge for computers. In this paper, we address the problem of assigning importance scores to video segments, that is how much information they contain with respect to the overall topic of an educational video. We present an annotation tool and a new dataset of annotated educational videos collected from popular online learning platforms. Moreover, we propose a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features. Our experiments investigate the impact of visual and temporal information, as well as the combination of multimodal features on importance prediction.
摘要：视频在网络搜索过程中学习常用类型的内容。许多电子学习平台提供高质量的内容，但有时教育影片很长，涵盖许多主题。人类是在提取录像重要路段好，但它仍然是一个电脑挑战显著。在本文中，我们要解决分配的重要性分数视频片段的问题，那就是他们多少信息包含有关的教育视频的整体主题。我们提出了一个注释工具，从流行的在线学习平台收集注解教育影片新的数据集。此外，我们建议，利用国家的最先进的音频，视频和文本特征的多模态神经结构。我们的实验调查的视觉和时间信息的影响，以及多式联运功能上的重要性预测的组合。

11. SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results [PDF] 返回目录
Alexandre Saint, Anis Kacem, Kseniya Cherenkova, Konstantinos Papadopoulos, Julian Chibane, Gerard Pons-Moll, Gleb Gusev, David Fofi, Djamila Aouada, Bjorn Ottersten
Abstract: The SHApe Recovery from Partial textured 3D scans challenge, SHARP 2020, is the first edition of a challenge fostering and benchmarking methods for recovering complete textured 3D scans from raw incomplete data. SHARP 2020 is organised as a workshop in conjunction with ECCV 2020. There are two complementary challenges, the first one on 3D human scans, and the second one on generic objects. Challenge 1 is further split into two tracks, focusing, first, on large body and clothing regions, and, second, on fine body details. A novel evaluation metric is proposed to quantify jointly the shape reconstruction, the texture reconstruction and the amount of completed data. Additionally, two unique datasets of 3D scans are proposed, to provide raw ground-truth data for the benchmarks. The datasets are released to the scientific community. Moreover, an accompanying custom library of software routines is also released to the scientific community. It allows for processing 3D scans, generating partial data and performing the evaluation. Results of the competition, analysed in comparison to baselines, show the validity of the proposed evaluation metrics, and highlight the challenging aspects of the task and of the datasets. Details on the SHARP 2020 challenge can be found at this https URL.
摘要：形状恢复从局部纹理的三维扫描挑战，夏普2020年，是培育和基准方法恢复从原材料的不完整数据完整质感的三维扫描挑战的第一个版本。 SHARP 2020被组织为与ECCV 2020组合，车间有两个互补的挑战，第一个上的三维人体扫描，并在通用对象的第二个。挑战1被进一步分割为两个音轨，聚焦，首先，在大的身体和衣服区，以及，第二，精细机身细节。一种新的评价指标，提出了共同量化形状重建，纹理重建和完成的数据的量。此外，3D扫描的两个独特的数据集提出，为基准，提供原始地面实况数据。该数据集被释放到科学界。此外，软件程序的一个伴随自定义库也被释放到科学界。它允许用于处理3D扫描，生成部分数据和执行所述评估。比赛中，相较于基线分析结果，表明了该评价指标的有效性，并突出任务和数据集的挑战性的方面。在SHARP 2020挑战详情，可在此HTTPS URL中找到。

12. Hierarchical Neural Architecture Search for Deep Stereo Matching [PDF] 返回目录
Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Tom Drummond, Hongdong Li, Zongyuan Ge
Abstract: To reduce the human efforts in neural network design, Neural Architecture Search (NAS) has been applied with remarkable success to various high-level vision tasks such as classification and semantic segmentation. The underlying idea for the NAS algorithm is straightforward, namely, to enable the network the ability to choose among a set of operations (e.g., convolution with different filter sizes), one is able to find an optimal architecture that is better adapted to the problem at hand. However, so far the success of NAS has not been enjoyed by low-level geometric vision tasks such as stereo matching. This is partly due to the fact that state-of-the-art deep stereo matching networks, designed by humans, are already sheer in size. Directly applying the NAS to such massive structures is computationally prohibitive based on the currently available mainstream computing resources. In this paper, we propose the first end-to-end hierarchical NAS framework for deep stereo matching by incorporating task-specific human knowledge into the neural architecture search framework. Specifically, following the gold standard pipeline for deep stereo matching (i.e., feature extraction -- feature volume construction and dense matching), we optimize the architectures of the entire pipeline jointly. Extensive experiments show that our searched network outperforms all state-of-the-art deep stereo matching architectures and is ranked at the top 1 accuracy on KITTI stereo 2012, 2015 and Middlebury benchmarks, as well as the top 1 on SceneFlow dataset with a substantial improvement on the size of the network and the speed of inference. The code is available at this https URL.
摘要：为了减少神经网络的设计人的努力，神经结构搜索（NAS）已经应用了显着的成功，各种高层次的视觉任务如分类和语义分割。对于NAS算法的基本思想很简单，即，使网络的一组操作（例如，用不同的过滤器尺寸卷积），一个是能够找到一个最佳的架构，更好地适应问题中进行选择的能力在眼前。然而，迄今为止，NAS的成功并没有被低级别的几何视觉任务，如立体匹配享用。这部分是由于这样的事实，国家的最先进的深立体匹配网络，由人设计的，已经在规模庞大。直接应用的NAS到如此大规模的结构是计算望而却步基于当前可用的主流计算资源。在本文中，我们通过将任务特定的人类知识到神经结构搜索框架提出了深刻的立体匹配第一端至端分层NAS架构。具体而言，按照对深立体匹配的黄金标准管道（即，特征提取 - 特征体积结构和密集匹配），我们优化整个管道的体系结构共同。大量的实验表明，我们的搜索网络优于国家的最先进的全深立体匹配结构，并在顶部1精度排名上KITTI立体声2012年，2015年和明德基准，以及顶部1上SceneFlow数据集显着改进对网络的规模和推理的速度。该代码可在此HTTPS URL。

13. Activation Map Adaptation for Effective Knowledge Distillation [PDF] 返回目录
Zhiyuan Wu, Hong Qi, Yu Jiang, Minghao Zhao, Chupeng Cui, Zongmin Yang, Xinhui Xue
Abstract: Model compression becomes a recent trend due to the requirement of deploying neural networks on embedded and mobile devices. Hence, both accuracy and efficiency are of critical importance. To explore a balance between them, a knowledge distillation strategy is proposed for general visual representation learning. It utilizes our well-designed activation map adaptive module to replace some blocks of the teacher network, exploring the most appropriate supervisory features adaptively during the training process. Using the teacher's hidden layer output to prompt the student network to train so as to transfer effective semantic this http URL verify the effectiveness of our strategy, this paper applied our method to cifar-10 dataset. Results demonstrate that the method can boost the accuracy of the student network by 0.6% with 6.5% loss reduction, and significantly improve its training speed.
摘要：型号压缩成为最近的趋势，由于部署在嵌入式和移动设备神经网络的要求。因此，无论是精度和效率是至关重要的。探讨它们之间的平衡，知识蒸馏战略，提出了一般的视觉表现的学习。它利用我们精心设计的活动图自适应模块更换老师网络的某些区块，在训练过程中探索最合适的监控功能适应性。用老师的隐层的输出，以提示学生网络，从而培养转移有效语义这个HTTP URL验证我们的策略的有效性，本文采用我们的方法CIFAR-10数据集。结果表明，该方法可以通过0.6％与损耗降低6.5％提高学生网络的准确性，显著提高其训练速度。

14. A Centroid Loss for Weakly Supervised Semantic Segmentation in Quality Control and Inspection Application [PDF] 返回目录
Kai Yao, Alberto Ortiz, Francisco Bonnin-Pascual
Abstract: Process automation has enabled a level of accuracy and productivity that goes beyond human ability, and one critical area where automation is making a huge difference is the machine vision system. In this paper, a semantic segmentation solution is proposed for two scenes. One is the inspection intended for vessel corrosion detection, and the other is a detection system used to assist quality control on the surgery toolboxes prepared by the sterilization unit of a hospital. In order to reduce the time required to prepare pixel-level ground truth, this work focuses on the use of weakly supervised annotations (scribbles). Moreover, our solution integrates a clustering approach into a semantic segmentation network, thus reducing the negative effects caused by weakly supervised annotations. To evaluate the performance of our approach, two datasets are collected from the real world (vessels' structure and hospital surgery toolboxes) for both training and validation. According to the result of analysis, the approach proposed in this paper produce a satisfactory performance on two datasets through the use of weak annotations.
摘要：流程自动化，使精确度和生产力水平，超越人的能力，并在自动化正在作出巨大差异的一个重要领域是机器视觉系统。在本文中，语义分割解，提出了两个场景。其一是检查打算用于容器腐蚀检测，而另一个是用于帮助对由医院的灭菌单元中制备的手术工具箱的质量控制的检测系统。为了减少所需要的时间准备像素级地面实况，今年工作重点放在使用弱监督注解（涂鸦）的。此外，我们的解决方案集成了一个聚类方法成语义分割网络，从而减少由弱的负面影响监督批注。为了评估我们的方法的性能，两个数据集都来自真实世界（船只的结构和医院手术工具箱），用于训练和测试收集。根据分析结果，本文提出的方法产生，通过使用弱注释对两个数据集令人满意的表现。

15. Multi-object tracking with self-supervised associating network [PDF] 返回目录
Tae-young Chung, Heansung Lee, Myeong Ah Cho, Suhwan Cho, Sangyoun Lee
Abstract: Multi-Object Tracking (MOT) is the task that has a lot of potential for development, and there are still many problems to be solved. In the traditional tracking by detection paradigm, There has been a lot of work on feature based object re-identification methods. However, this method has a lack of training data problem. For labeling multi-object tracking dataset, every detection in a video sequence need its location and IDs. Since assigning consecutive IDs to each detection in every sequence is a very labor-intensive task, current multi-object tracking dataset is not sufficient enough to train re-identification network. So in this paper, we propose a novel self-supervised learning method using a lot of short videos which has no human labeling, and improve the tracking performance through the re-identification network trained in the self-supervised manner to solve the lack of training data problem. Despite the re-identification network is trained in a self-supervised manner, it achieves the state-of-the-art performance of MOTA 62.0\% and IDF1 62.6\% on the MOT17 test benchmark. Furthermore, the performance is improved as much as learned with a large amount of data, it shows the potential of self-supervised method.
摘要：多目标追踪（MOT）是具有很大的发展有很大的潜力的任务，仍然有许多问题需要解决。在被检测的模式是传统的追踪，目前已经有很多基于特征对象重新鉴定方法的工作。然而，这种方法有一个缺乏训练数据的问题。用于标记多目标跟踪数据集，在一个视频序列中的每个检测需要它的位置和ID。由于每一个序列中的每个检测分配连续的ID是一个劳动密集型的任务，目前多目标跟踪数据集是不够足以列车重新鉴定的网络。所以在本文中，我们提出了使用大量的短片不具有人工标记的一种新的自我监督学习方法，提高通过自我监督的方式训练了重新鉴定的网络跟踪性能，解决了缺乏培训数据问题。尽管重新鉴定网络以自监督方式训练，它达到62.0 MOTA对MOT17测试基准状态的最先进的性能\％和62.6 IDF1 \％。此外，性能作为大大改善作为具有大数据量的学习，则显示自监督方法的潜力。

16. Lane detection in complex scenes based on end-to-end neural network [PDF] 返回目录
Wenbo Liu, Fei Yan, Kuan Tang, Jiyong Zhang, Tao Deng
Abstract: The lane detection is a key problem to solve the division of derivable areas in unmanned driving, and the detection accuracy of lane lines plays an important role in the decision-making of vehicle driving. Scenes faced by vehicles in daily driving are relatively complex. Bright light, insufficient light, and crowded vehicles will bring varying degrees of difficulty to lane detection. So we combine the advantages of spatial convolution in spatial information processing and the efficiency of ERFNet in semantic segmentation, propose an end-to-end network to lane detection in a variety of complex scenes. And we design the information exchange block by combining spatial convolution and dilated convolution, which plays a great role in understanding detailed information. Finally, our network was tested on the CULane database and its F1-measure with IOU threshold of 0.5 can reach 71.9%.
摘要：车道检测是解决无人驾驶导出区的划分，一个关键的问题，以及车道线的检测精度起着重要的作用决策车辆驾驶。面对在日常驾驶的车辆场景比较复杂。明亮的光线，光线不足，拥挤的车辆带来不同程度的困难，以车道检测。因此，我们结合在空间信息处理的空间卷积和ERFNet的语义分割效率的优势，提出了一个终端到端到端的网络车道检测在各种复杂的场景。我们结合空间卷积和卷积扩张，发挥在了解详细信息的巨大作用设计了信息交换块。最后，我们的网络是在CULane数据库及其F1-措施以0.5借条门槛上测试可以达到71.9％。

17. Residual Recurrent CRNN for End-to-End Optical Music Recognition on Monophonic Scores [PDF] 返回目录
Aozhi Liu, Lipei Zhang, Yaqi Mei, Sitong Lian, Maokun Han, Wen Cheng, Yuyu Liu, Zifeng Cai, Zhaohua Zhu, Baoqiang Han, Jing Xiao
Abstract: Optical Music Recognition is a field that attempts to extract digital information from images of either the printed music scores or the handwritten music scores. One of the challenges of the Optical Music Recognition task is to transcript the symbols of the camera-captured images into digital music notations. Previous end-to-end model, based on deep learning, was developed as a Convolutional Recurrent Neural Network. However, it does not explore sufficient contextual information from full scales and there is still a large room for improvement. In this paper, we propose an innovative end-to-end framework that combines a block of Residual Recurrent Convolutional Neural Network with a recurrent Encoder-Decoder network to map a sequence of monophonic music symbols corresponding to the notations present in the image. The Residual Recurrent Convolutional block can improve the ability of the model to enrich the context information while the number of parameter will not be increasing. The experiment results were benchmarked against a publicly available dataset called CAMERA-PRIMUS. We evaluate the performances of our model on both the images with ideal conditions and that with non-ideal conditions. The experiments show that our approach surpass the state-of-the-art end-to-end method using Convolutional Recurrent Neural Network.
摘要：光学音乐识别是一个字段，尝试提取从任一印刷乐谱或手写乐谱的图像的数字信息。一个区域的光学音乐识别任务的挑战是将摄像机捕捉图像的符号成绩单到数字音乐符号。上一页终端到高端机型的基础上，深度学习，是作为一个卷积递归神经网络。但是，它不会从满量程探索充分的背景信息，并且仍有较大的提升空间。在本文中，我们提出了一种创新的终端到终端的框架，结合残余递归卷积神经网络的一个块与复发编码器 - 解码器网络来映射对应于所述符号单音音乐符号的序列呈现在图像中。该残留复发卷积块可以提高模型，以丰富的上下文信息，而参数的数量不会增加的能力。实验结果对基准可公开获得的数据集称为CAMERA-PRIMUS。我们评估我们两个有理想的条件下的图像模式，并与非理想条件下的性能。实验表明，我们的方法使用卷积递归神经网络超越国家的最先进的终端到终端的方法。

18. Flexible Piecewise Curves Estimation for Photo Enhancement [PDF] 返回目录
Chongyi Li, Chunle Guo, Qiming Ai, Shangchen Zhou, Chen Change Loy
Abstract: This paper presents a new method, called FlexiCurve, for photo enhancement. Unlike most existing methods that perform image-to-image mapping, which requires expensive pixel-wise reconstruction, FlexiCurve takes an input image and estimates global curves to adjust the image. The adjustment curves are specially designed for performing piecewise mapping, taking nonlinear adjustment and differentiability into account. To cope with challenging and diverse illumination properties in real-world images, FlexiCurve is formulated as a multi-task framework to produce diverse estimations and the associated confidence maps. These estimations are adaptively fused to improve local enhancements of different regions. Thanks to the image-to-curve formulation, for an image with a size of 512*512*3, FlexiCurve only needs a lightweight network (150K trainable parameters) and it has a fast inference speed (83FPS on a single NVIDIA 2080Ti GPU). The proposed method improves efficiency without compromising the enhancement quality and losing details in the original image. The method is also appealing as it is not limited to paired training data, thus it can flexibly learn rich enhancement styles from unpaired data. Extensive experiments demonstrate that our method achieves state-of-the-art performance on photo enhancement quantitively and qualitatively.
摘要：本文提出了一种新的方法，称为FlexiCurve，增强照片。不同于执行图像到图像的映射大多数现有的方法，该方法需要昂贵的逐像素重建，FlexiCurve采用输入图像并估计全球曲线来调整图像。调节曲线是专门用于执行分段映射，以非线性调整和可微考虑而设计的。为了应对现实世界的挑战图像和多样化的照明性能，FlexiCurve被配制成多任务的框架，产生不同的估计和相关的信任映射。这些估计自适应融合，以提高不同区域的局部增强。多亏了图像到曲线的制剂，用于将图像的大小为512 * 512 * 3，FlexiCurve只需要一个轻量级的网络（150K可训练参数），它具有快速的推理速度（83FPS在单个NVIDIA 2080Ti GPU）。该方法提高效率，在不损害增强质量与原始图像中失去的细节。这种方法也有吸引力，因为它不限于配对训练数据，因此可以灵活地从未成数据学到丰富的增强样式。大量的实验表明，我们的方法实现国家的最先进的性能上的照片增强定量和质量。

19. Video-based Facial Expression Recognition using Graph Convolutional Networks [PDF] 返回目录
Daizong Liu, Hongting Zhang, Pan Zhou
Abstract: Facial expression recognition (FER), aiming to classify the expression present in the facial image or video, has attracted a lot of research interests in the field of artificial intelligence and multimedia. In terms of video based FER task, it is sensible to capture the dynamic expression variation among the frames to recognize facial expression. However, existing methods directly utilize CNN-RNN or 3D CNN to extract the spatial-temporal features from different facial units, instead of concentrating on a certain region during expression variation capturing, which leads to limited performance in FER. In our paper, we introduce a Graph Convolutional Network (GCN) layer into a common CNN-RNN based model for video-based FER. First, the GCN layer is utilized to learn more significant facial expression features which concentrate on certain regions after sharing information between extracted CNN features of nodes. Then, a LSTM layer is applied to learn long-term dependencies among the GCN learned features to model the variation. In addition, a weight assignment mechanism is also designed to weight the output of different nodes for final classification by characterizing the expression intensities in each frame. To the best of our knowledge, it is the first time to use GCN in FER task. We evaluate our method on three widely-used datasets, CK+, Oulu-CASIA and MMI, and also one challenging wild dataset AFEW8.0, and the experimental results demonstrate that our method has superior performance to existing methods.
摘要：面部表情识别（FER），针对存在的面部图像或视频中的表达进行分类，已经吸引了大量的研究兴趣在人工智能和多媒体领域。在基于FER任务视频方面，它是明智的捕捉帧中的动态表情变化来识别面部表情。但是，现有的方法直接利用CNN-RNN或3D CNN从不同的面部单元提取的空间 - 时间的特性，而不是表达变异捕获，期间集中在特定区域，这导致在FER性能有限。在本文中，我们介绍了一个图形卷积网络（GCN）层到一个共同的CNN-RNN基于模型的基于视频的FER。首先，GCN层被用于学习之间的节点提取CNN设有共享信息后更显著面部表情特征，其集中于某些区域。然后，LSTM层施加学习长期依赖之间的GCN了解到特征的变化进行建模。此外，权重分配机制也设计成通过在每个帧中表征表达强度为最终的分类不同的节点的输出权重。据我们所知，这是第一次在FER任务使用GCN。我们评估我们的三个广泛使用的数据集，CK +，奥卢，中科院自动化所和MMI，以及一个具有挑战性的野生数据集AFEW8.0方法，实验结果表明，该方法具有优越的性能，以现有的方法。

20. Where to Look and How to Describe: Fashion Image Retrieval with an Attentional Heterogeneous Bilinear Network [PDF] 返回目录
Haibo Su, Peng Wang, Lingqiao Liu, Hui Li, Zhen Li, Yanning Zhang
Abstract: Fashion products typically feature in compositions of a variety of styles at different clothing parts. In order to distinguish images of different fashion products, we need to extract both appearance (i.e., "how to describe") and localization (i.e.,"where to look") information, and their interactions. To this end, we propose a biologically inspired framework for image-based fashion product retrieval, which mimics the hypothesized twostream visual processing system of human brain. The proposed attentional heterogeneous bilinear network (AHBN) consists of two branches: a deep CNN branch to extract fine-grained appearance attributes and a fully convolutional branch to extract landmark localization information. A joint channel-wise attention mechanism is further applied to the extracted heterogeneous features to focus on important channels, followed by a compact bilinear pooling layer to model the interaction of the two streams. Our proposed framework achieves satisfactory performance on three image-based fashion product retrieval benchmarks.
摘要：时尚产品的典型特征在不同的服装件各种款式的成分。为了区分不同的时尚产品图片，我们需要提取两种外观（即“如何描述”）和定位（即，“去哪里寻找”）的信息，以及它们之间的相互作用。为此，我们提出了基于图像的时尚产品检索，它模仿人类大脑的假设twostream视觉处理系统的仿生框架。所提出的注意力异构双线性网络（AHBN）由两个分支组成：一个深CNN分支到提取细粒度外观属性和一个完全卷积分支到提取的地标定位信息。联合信道逐注意机制被进一步施加到所提取的异质特征将焦点上重要通道，接着是紧凑双线性池层的两个流之间的相互作用进行建模。我们提出的框架实现了三个基于图像的时尚产品检索基准令人满意的性能。

21. PSF-LO: Parameterized Semantic Features Based Lidar Odometry [PDF] 返回目录
Guibin Chen, BoSheng Wang, XiaoLiang Wang, Huanjun Deng, Bing Wang, Shuo Zhang
Abstract: Lidar odometry (LO) is a key technology in numerous reliable and accurate localization and mapping systems of autonomous driving. The state-of-the-art LO methods generally leverage geometric information to perform point cloud registration. Furthermore, obtaining point cloud semantic information which can describe the environment more abundantly will help for the registration. We present a novel semantic lidar odometry method based on self-designed parameterized semantic features (PSFs) to achieve low-drift ego-motion estimation for autonomous vehicle in realtime. We first use a convolutional neural network-based algorithm to obtain point-wise semantics from the input laser point cloud, and then use semantic labels to separate the road, building, traffic sign and pole-like point cloud and fit them separately to obtain corresponding PSFs. A fast PSF-based matching enable us to refine geometric features (GeFs) registration, reducing the impact of blurred submap surface on the accuracy of GeFs matching. Besides, we design an efficient method to accurately recognize and remove the dynamic objects while retaining static ones in the semantic point cloud, which are beneficial to further improve the accuracy of LO. We evaluated our method, namely PSF-LO, on the public dataset KITTI Odometry Benchmark and ranked #1 among semantic lidar methods with an average translation error of 0.82% in the test dataset at the time of writing.
摘要：雷达测距（LO）是在自动驾驶的众多可靠和准确的定位和地图系统的关键技术。的状态的最先进的LO方法通常利用几何信息来执行点云注册。此外，获得点云可以描述环境更丰富将有助于注册语义信息。提出了一种基于自行设计参数语义特征（的PSF），以实现实时自主车辆低漂移自运动估计的新颖语义激光雷达测距方法。我们首先使用卷积基于神经网络的算法从输入激光点云获得的逐点语义，然后用语义标签的道路分隔，建筑，交通标志和棒状体的点云和分别适合他们获得相应的专业服务公司。一种快速基于PSF匹配使我们能够缩小几何特征（GEFS）注册，减少模糊子图面的上GEFS匹配的精度的影响。此外，我们设计了一个有效的方法来准确地识别并同时保持在语义点云静态的，这是有益的，以进一步提高LO的精度除去动态对象。我们评估了我们的方法，即PSF-LO，在公共数据集KITTI里程计基准，并在测试数据集在写作的时候排名第一语义激光雷达方法中有0.82％的平均转换错误。

22. Semi supervised segmentation and graph-based tracking of 3D nuclei in time-lapse microscopy [PDF] 返回目录
S. Shailja, Jiaxiang Jiang, B.S. Manjunath
Abstract: We propose a novel weakly supervised method to improve the boundary of the 3D segmented nuclei utilizing an over-segmented image. This is motivated by the observation that current state-of-the-art deep learning methods do not result in accurate boundaries when the training data is weakly annotated. Towards this, a 3D U-Net is trained to get the centroid of the nuclei and integrated with a simple linear iterative clustering (SLIC) supervoxel algorithm that provides better adherence to cluster boundaries. To track these segmented nuclei, our algorithm utilizes the relative nuclei location depicting the processes of nuclei division and apoptosis. The proposed algorithmic pipeline achieves better segmentation performance compared to the state-of-the-art method in Cell Tracking Challenge (CTC) 2019 and comparable performance to state-of-the-art methods in IEEE ISBI CTC2020 while utilizing very few pixel-wise annotated data. Detailed experimental results are provided, and the source code is available on GitHub.
摘要：提出一种新的弱，以提高三维的边界监督方法分割利用过分割的图像的细胞核。这是由观察到当前国家的最先进的不产生准确的边界深的学习方法时，训练数据是弱注释动机。朝着这个，三维U形网被训练以获得核的质心，并用简单的线性迭代聚类（SLIC）supervoxel算法，可提供更好的粘附到群集边界一体化。为了跟踪这些分段核，我们的算法利用的相对位置核细胞核描绘分裂和细胞凋亡的方法中。同时利用非常少的逐像素所提出的算法的管道实现相比，在细胞示踪挑战（CTC）的状态的最先进的方法更好的分割性能2019和相当的性能状态的最先进的方法在IEEE ISBI CTC2020注释数据。中提供了详细的实验结果，且源代码可以在GitHub。

23. EDNet: Improved DispNet for Efficient Disparity Estimation [PDF] 返回目录
Songyan Zhang, Zhicheng Wang
Abstract: Given a pair of rectified images, the goal of stereo matching is to estimate the disparity. We aim at building an efficient network so we exploit the architecture of DispNetC and propose EDNet, a network that takes advantage of both the concatenation cost volume and the correlation volume by forming a combination volume. We further propose the attention-based residual learning module by embedding spatial attention module. Experimental results show that our model outperforms previous state-of-the-art methods on Scene Flow and KITTI datasets while runs significantly faster, demonstrating the effectiveness of our proposed method.
摘要：给定一对整流图像，立体匹配的目标是估计的差距。我们的目标是建立一个高效的网络，所以我们利用DispNetC的体系结构，并提出人ednet，这种网络需要连接成本和体积通过形成组合体积的相关卷都的优势。我们进一步建议通过嵌入空间注意模块的关注，基于剩余的学习模块。实验结果表明，我们的模型优于国家的最先进的以前的场景流量和KITTI数据集的方法，而运行速度显著，证明了我们提出的方法的有效性。

24. Robust Pre-Training by Adversarial Contrastive Learning [PDF] 返回目录
Ziyu Jiang, Tianlong Chen, Ting Chen, Zhangyang Wang
Abstract: Recent work has shown that, when integrated with adversarial training, self-supervised pre-training can lead to state-of-the-art robustness In this work, we improve robustness-aware self-supervised pre-training by learning representations that are consistent under both data augmentations and adversarial perturbations. Our approach leverages a recent contrastive learning framework, which learns representations by maximizing feature consistency under differently augmented views. This fits particularly well with the goal of adversarial robustness, as one cause of adversarial fragility is the lack of feature invariance, i.e., small input perturbations can result in undesirable large changes in features or even predicted labels. We explore various options to formulate the contrastive task, and demonstrate that by injecting adversarial perturbations, contrastive pre-training can lead to models that are both label-efficient and robust. We empirically evaluate the proposed Adversarial Contrastive Learning (ACL) and show it can consistently outperform existing methods. For example on the CIFAR-10 dataset, ACL outperforms the previous state-of-the-art unsupervised robust pre-training approach by 2.99% on robust accuracy and 2.14% on standard accuracy. We further demonstrate that ACL pre-training can improve semi-supervised adversarial training, even when only a few labeled examples are available. Our codes and pre-trained models have been released at: this https URL.
摘要：最近的研究表明，当与对抗训练整合，自我监督前培训会导致国家的最先进的稳健性在这项工作中，我们提高稳健性，自觉监督通过学习表示前培训是都是数据扩充和对抗性的扰动下是一致的。我们的方法利用了最近的对比学习框架，它通过下不同意见的增强功能最大化一致性学习表示。这符合特别好与对抗的鲁棒性的目标，作为对抗脆弱的一个原因就是缺少特征不变性，即，小的输入扰动可导致特征或不期望的大的变化甚至预测标签。我们探索各种选择，以制定对比任务，并表明，通过注射对抗扰动，对比前培训会导致模型都是标签，高效和稳健。我们根据经验评估所提出的对抗性对比学习（ACL），并显示它可以始终优于现有的方法。例如在CIFAR-10数据集，ACL通过优于鲁棒精度2.99％和标准精度2.14％之前的状态的最先进的无监督健壮预培训方法。我们进一步证明ACL预训练可以提高半监督的对抗性训练，即使只有少数几个标记的例子是可用的。此HTTPS URL：我们的代码和预先训练模型已经在被释放。

25. View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose [PDF] 返回目录
Ting Liu, Jennifer J. Sun, Long Zhao, Jiaping Zhao, Liangzhe Yuan, Yuxiao Wang, Liang-Chieh Chen, Florian Schroff, Hartwig Adam
Abstract: Recognition of human poses and activities is crucial for autonomous systems to interact smoothly with people. However, cameras generally capture human poses in 2D as images and videos, which can have significant appearance variations across viewpoints. To address this, we explore recognizing similarity in 3D human body poses from 2D information, which has not been well-studied in existing works. Here, we propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses. Input ambiguities of 2D poses from projection and occlusion are difficult to represent through a deterministic mapping, and therefore we use probabilistic embeddings. In order to enable our embeddings to work with partially visible input keypoints, we further investigate different keypoint occlusion augmentation strategies during training. Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views, in comparison with 3D pose estimation models. We further show that keypoint occlusion augmentation during training significantly improves retrieval performance on partial 2D input poses. Results on action recognition and video alignment demonstrate that our embeddings, without any additional training, achieves competitive performance relative to other models specifically trained for each task.
摘要：人的姿势和活动的认识是至关重要的自治系统与人顺利地沟通。但是，一般的相机捕捉2D人体姿势的图片和视频，它可以有跨观点显著的外观变化。为了解决这个问题，我们探索从2D信息，尚未充分研究现有作品的3D人体姿势识别相似。在这里，我们提出了一个方法来学习从2D全身关节关键点紧凑观点不变嵌入空间，没有明确的预测3D姿势。从投影和闭塞2D姿势输入歧义难以通过确定性映射来表示，因此，我们使用概率的嵌入。为了使我们的嵌入到工作，部分可见输入的关键点，我们进一步探讨训练中不同的关键点闭塞增强策略。实验结果表明，我们的嵌入模型检索在不同摄像机视图类似的姿势时，与3D姿态估计模型相比实现更高的精度。进一步的研究表明关键点闭塞增强训练期间显著提高了部分二维输入姿势检索性能。在动作识别和视频比对结果表明，我们的嵌入，没有任何额外的训练，相对于专门训练的每个任务的其他车型竞争力的性能达到。

26. Zero-Shot Learning from scratch (ZFS): leveraging local compositional representations [PDF] 返回目录
Tristan Sylvain, Linda Petrini, R Devon Hjelm
Abstract: Zero-shot classification is a generalization task where no instance from the target classes is seen during training. To allow for test-time transfer, each class is annotated with semantic information, commonly in the form of attributes or text descriptions. While classical zero-shot learning does not explicitly forbid using information from other datasets, the approaches that achieve the best absolute performance on image benchmarks rely on features extracted from encoders pretrained on Imagenet. This approach relies on hyper-optimized Imagenet-relevant parameters from the supervised classification setting, entangling important questions about the suitability of those parameters and how they were learned with more fundamental questions about representation learning and generalization. To remove these distractors, we propose a more challenging setting: Zero-Shot Learning from scratch (ZFS), which explicitly forbids the use of encoders fine-tuned on other datasets. Our analysis on this setting highlights the importance of local information, and compositional representations.
摘要：零镜头分类是一个综合任务，其中没有来自目标类实例的训练过程中看到。为了允许测试时间传输，每个类都被注解的语义信息，常用于属性或文本描述的形式。虽然传统的零次学习没有明确禁止使用从其他数据集的信息，将接近于实现图像的基准最佳性能绝对依赖于从预训练上Imagenet编码器提取的特征。这种方法依赖于从监督分类设置超优化Imagenet相关的参数，纠缠不清的那些参数的适宜性的重要问题，以及他们如何与有关代表学习和推广更根本的问题的经验教训。要消除这些干扰项，我们提出了一个更具挑战性的设置：零射门学习从头开始（ZFS），其中明确禁止使用的编码器微调的其他数据集。我们在此设置分析强调的本地信息，和成分表示的重要性。

27. Structural Prior Driven Regularized Deep Learning for Sonar Image Classification [PDF] 返回目录
Isaac D. Gerg, Vishal Monga
Abstract: Deep learning has been recently shown to improve performance in the domain of synthetic aperture sonar (SAS) image classification. Given the constant resolution with range of a SAS, it is no surprise that deep learning techniques perform so well. Despite deep learning's recent success, there are still compelling open challenges in reducing the high false alarm rate and enabling success when training imagery is limited, which is a practical challenge that distinguishes the SAS classification problem from standard image classification set-ups where training imagery may be abundant. We address these challenges by exploiting prior knowledge that humans use to grasp the scene. These include unconscious elimination of the image speckle and localization of objects in the scene. We introduce a new deep learning architecture which incorporates these priors with the goal of improving automatic target recognition (ATR) from SAS imagery. Our proposal -- called SPDRDL, Structural Prior Driven Regularized Deep Learning -- incorporates the previously mentioned priors in a multi-task convolutional neural network (CNN) and requires no additional training data when compared to traditional SAS ATR methods. Two structural priors are enforced via regularization terms in the learning of the network: (1) structural similarity prior -- enhanced imagery (often through despeckling) aids human interpretation and is semantically similar to the original imagery and (2) structural scene context priors -- learned features ideally encapsulate target centering information; hence learning may be enhanced via a regularization that encourages fidelity against known ground truth target shifts (relative target position from scene center). Experiments on a challenging real-world dataset reveal that SPDRDL outperforms state-of-the-art deep learning and other competing methods for SAS image classification.
摘要：深学习最近已显示出可改善在合成孔径声纳（SAS）图像分类的域性能。鉴于恒定的分辨率与SAS的范围，这是毫不奇怪的是深度学习技术来执行这么好。尽管深度学习的最近的成功，仍存在降低误报率较高且能够成功当训练图像是有限的，这是一个实际的挑战令人信服的开放挑战区别于标准图像分类调校的SAS分类问题，在训练图像可能是丰富的。我们应对人类使用掌握现场利用事先了解这些挑战。这些措施包括场景中的物体的图像斑点和本地化的无意识消除。我们引入一个新的深度学习架构，它结合了这些先验以改进从SAS图像自动目标识别（ATR）的目标。我们的建议 - 所谓SPDRDL，结构之前驱动正则深学习 - 包含在多任务卷积神经网络（CNN）前面提到的先验和相对于传统的SAS ATR方法时，不需要额外的训练数据。两个结构先验通过正则化项执行在网络的学习：（1）结构相似现有 - 增强图像（通常通过去斑点）AIDS人体解释和在语义上类似于原始图像和（2）结构的场景上下文先验 - - 据悉功能封装的理想目标为中心的信息;因此学习可经由鼓励针对已知地面实况目标位移保真度（相对于目标位置从场景中心）正则化来增强。在一个具有挑战性的真实世界的实验数据集显示，SPDRDL性能优于国家的最先进的深学习和SAS图像分类其他竞争方法。

28. AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in the Wild [PDF] 返回目录
Zhe Zhang, Chunyu Wang, Weichao Qiu, Wenhu Qin, Wenjun Zeng
Abstract: Occlusion is probably the biggest challenge for human pose estimation in the wild. Typical solutions often rely on intrusive sensors such as IMUs to detect occluded joints. To make the task truly unconstrained, we present AdaFuse, an adaptive multiview fusion method, which can enhance the features in occluded views by leveraging those in visible views. The core of AdaFuse is to determine the point-point correspondence between two views which we solve effectively by exploring the sparsity of the heatmap representation. We also learn an adaptive fusion weight for each camera view to reflect its feature quality in order to reduce the chance that good features are undesirably corrupted by ``bad'' views. The fusion model is trained end-to-end with the pose estimation network, and can be directly applied to new camera configurations without additional adaptation. We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic. It outperforms the state-of-the-arts on all of them. We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints, as it provides occlusion labels for every joint in the images. The dataset and code are released at this https URL.
摘要：阻塞可能是在野外人体姿势估计的最大挑战。典型的解决方案通常依靠侵入传感器，如IMU的检测闭塞关节。为了使工作真正不受约束的，我们目前AdaFuse，自适应多视角的融合方法，它可以通过利用那些在可见视图提高闭塞意见的功能。 AdaFuse的核心是确定我们通过探索热图表示的稀疏性有效地解决两个视图之间的点 - 点对应。我们还了解到一个自适应融合权重为每个摄像机视图，以反映其功能质量，以减少良好的功能不希望通过``坏'的意见损坏的机会。融合模型被训练端至端与所述姿态估计网络，并且可以直接应用于新相机配置，而不附加适应。我们广泛评估三个公共数据集，包括Human3.6M，共捕获和CMU全景的办法。它优于所有他们的国家的最艺术。我们还创建了一个大规模的合成数据集闭塞人称，这使我们能够在闭塞关节进行数值计算，因为它提供了阻塞标签图像中的每个关节。该数据集和代码都在此HTTPS URL释放。

29. Global Image Segmentation Process using Machine Learning algorithm & Convolution Neural Network method for Self- Driving Vehicles [PDF] 返回目录
Tirumalapudi Raviteja, Rajay Vedaraj .I.S
Abstract: In autonomous Vehicles technology Image segmentation was a major problem in visual perception. This image segmentation process is mainly used in medical applications. Here we adopted an image segmentation process to visual perception tasks for predicting the agents on the surrounding environment, identifying the road boundaries and tracking the line markings. Main objective of the paper is to divide the input images using the image segmentation process and Convolution Neural Network method for efficient results of visual perception. For Sampling assume a local city data-set samples and validation process done in Jupyter Notebook using Python language. We proposed this image segmentation method planning to standard and further the development of state-of-the art methods for visual inspection system understanding. The experimental results achieves 73% mean IOU. Our method also achieves 90 FPS inference speed and using a NVDIA GeForce GTX 1050 GPU.
摘要：在自主汽车技术的图像分割是视觉感知的一个主要问题。该图像分割处理主要用于医疗应用。在这里，我们采用预测药物对周围环境，识别道路边界和跟踪线标记的图像分割处理，以视觉感知任务。本文的主要目标是划分使用图像分割处理和卷积神经网络的方法，用于视觉感知的高效的结果的输入图像。对于抽样假设本地城市数据集样本和验证过程中使用Python语言Jupyter笔记本电脑完成。我们提出这个图像分割方法规划的国家的最先进的方法视觉检测系统的理解标准，进一步的发展。实验结果达到73％，平均欠条。我们的方法也能达到90 FPS的推理速度和使用NVDIA的GeForce GTX GPU 1050。

30. Attack Agnostic Adversarial Defense via Visual Imperceptible Bound [PDF] 返回目录
Saheb Chhabra, Akshay Agarwal, Richa Singh, Mayank Vatsa
Abstract: The high susceptibility of deep learning algorithms against structured and unstructured perturbations has motivated the development of efficient adversarial defense algorithms. However, the lack of generalizability of existing defense algorithms and the high variability in the performance of the attack algorithms for different databases raises several questions on the effectiveness of the defense algorithms. In this research, we aim to design a defense model that is robust within a certain bound against both seen and unseen adversarial attacks. This bound is related to the visual appearance of an image, and we termed it as \textit{Visual Imperceptible Bound (VIB)}. To compute this bound, we propose a novel method that uses the database characteristics. The VIB is further used to measure the effectiveness of attack algorithms. The performance of the proposed defense model is evaluated on the MNIST, CIFAR-10, and Tiny ImageNet databases on multiple attacks that include C\&W ($l_2$) and DeepFool. The proposed defense model is not only able to increase the robustness against several attacks but also retain or improve the classification accuracy on an original clean test set. The proposed algorithm is attack agnostic, i.e. it does not require any knowledge of the attack algorithm.
摘要：针对结构化和非结构化的扰动深度学习算法的高敏感性已促使有效对抗防御算法的开发。然而，由于缺乏现有的防御算法普遍性的和的攻击算法，为不同的数据库性能的高可变性引起的防御算法的有效性几个问题。在这项研究中，我们的目标是设计一种防御模型是在一定的约束对阵双方看到的和看不见的对抗攻击的鲁棒性。这势必涉及的图像的视觉外观，并且我们称之为它作为\ textit {视觉不知不觉束缚（VIB）}。为了计算这个边界，我们建议使用的数据库特征的新方法。该VIB进一步用于测量的攻击算法的有效性。所提出的防御模型的性能上MNIST，CIFAR-10，并在多次攻击是包括C \＆W（$ L_2 $）和DeepFool微小ImageNet数据库进行评估。提议中的防御模型不仅能够增加可以有效抵抗了几次攻击，但也保留或提高分类准确度上的原始干净的测试集。该算法是攻击无关，即它不需要攻击算法的任何知识。

31. MixNet for Generalized Face Presentation Attack Detection [PDF] 返回目录
Nilay Sanghvi, Sushant Kumar Singh, Akshay Agarwal, Mayank Vatsa, Richa Singh
Abstract: The non-intrusive nature and high accuracy of face recognition algorithms have led to their successful deployment across multiple applications ranging from border access to mobile unlocking and digital payments. However, their vulnerability against sophisticated and cost-effective presentation attack mediums raises essential questions regarding its reliability. In the literature, several presentation attack detection algorithms are presented; however, they are still far behind from reality. The major problem with existing work is the generalizability against multiple attacks both in the seen and unseen setting. The algorithms which are useful for one kind of attack (such as print) perform unsatisfactorily for another type of attack (such as silicone masks). In this research, we have proposed a deep learning-based network termed as \textit{MixNet} to detect presentation attacks in cross-database and unseen attack settings. The proposed algorithm utilizes state-of-the-art convolutional neural network architectures and learns the feature mapping for each attack category. Experiments are performed using multiple challenging face presentation attack databases such as SMAD and Spoof In the Wild (SiW-M) databases. Extensive experiments and comparison with existing state of the art algorithms show the effectiveness of the proposed algorithm.
摘要：人脸识别算法的非侵入性，精度高，导致其在多个应用范围从边境进入移动解锁和数字支付成功部署。然而，他们对复杂的和具有成本效益呈现攻击媒介的漏洞提出了关于它的可靠性至关重要的问题。在文献中，几个演示攻击检测算法呈现;然而，他们仍然远远落后于现实。与现有工作的主要问题是对无论是在看到和看不到的设置多个攻击的普遍性。它们是一种攻击（例如打印）的有用的算法不能令人满意执行对另一类型的攻击（如硅酮掩模）。在这项研究中，我们提出了基于学习深的网络称为\ textit {}混合网检测跨数据库和看不见的攻击设置演示攻击。该算法利用国家的最先进的卷积神经网络结构和学习每个攻击类别的特征映射。实验使用了多个挑战面呈现攻击数据库如SMAD和欺骗在野外（硅钨酸-M）的数据库执行的。大量的实验，并与现有的本领域的算法状态相比表明了该算法的有效性。

32. Generalized Iris Presentation Attack Detection Algorithm under Cross-Database Settings [PDF] 返回目录
Mehak Gupta, Vishal Singh, Akshay Agarwal, Mayank Vatsa, Richa Singh
Abstract: Presentation attacks are posing major challenges to most of the biometric modalities. Iris recognition, which is considered as one of the most accurate biometric modality for person identification, has also been shown to be vulnerable to advanced presentation attacks such as 3D contact lenses and textured lens. While in the literature, several presentation attack detection (PAD) algorithms are presented; a significant limitation is the generalizability against an unseen database, unseen sensor, and different imaging environment. To address this challenge, we propose a generalized deep learning-based PAD network, MVANet, which utilizes multiple representation layers. It is inspired by the simplicity and success of hybrid algorithm or fusion of multiple detection networks. The computational complexity is an essential factor in training deep neural networks; therefore, to reduce the computational complexity while learning multiple feature representation layers, a fixed base model has been used. The performance of the proposed network is demonstrated on multiple databases such as IIITD-WVU MUIPA and IIITD-CLI databases under cross-database training-testing settings, to assess the generalizability of the proposed algorithm.
摘要：介绍攻击正在对大多数生物模式的重大挑战。虹膜识别，这被认为是用于个人识别最精确的生物测定模态之一，也被证明是脆弱的先进呈现攻击，例如3D隐形眼镜和纹理的透镜。虽然在文献中，几个演示攻击检测（PAD）算法提出;一个显著限制是针对一个看不见的数据库，看不见传感器和不同的成像环境中的普遍性。为了应对这一挑战，我们提出一个广义的深基础的学习-PAD网络，MVANet，它利用多表示层。它是由简单和混合算法或多个检测网络的融合的成功的启发。计算复杂度是在训练深层神经网络的一个重要因素;因此，为了降低计算复杂性，同时学习多个特征表示层，一个固定的基本模型已被使用。所提出的网络的性能表现出对多个数据库如IIITD-WVU MUIPA和下跨数据库训练测试设置值IIITD-CLI的数据库，以评估所提出的算法的普遍性。

33. Human or Machine? It Is Not What You Write, But How You Write It [PDF] 返回目录
Luis A. Leiva, Moises Diaz, Miguel A. Ferrer, Réjean Plamondon
Abstract: Online fraud often involves identity theft. Since most security measures are weak or can be spoofed, we investigate a more nuanced and less explored avenue: behavioral biometrics via handwriting movements. This kind of data can be used to verify whether a user is operating a device or a computer application, so it is important to distinguish between human and machine-generated movements reliably. For this purpose, we study handwritten symbols (isolated characters, digits, gestures, and signatures) produced by humans and machines, and compare and contrast several deep learning models. We find that if symbols are presented as static images, they can fool state-of-the-art classifiers (near 75% accuracy in the best case) but can be distinguished with remarkable accuracy if they are presented as temporal sequences (95% accuracy in the average case). We conclude that an accurate detection of fake movements has more to do with how users write, rather than what they write. Our work has implications for computerized systems that need to authenticate or verify legitimate human users, and provides an additional layer of security to keep attackers at bay.
摘要：网上诈骗往往涉及身份盗窃。由于大多数安全措施薄弱或可能会被欺骗，我们研究更细致入微，少探索的途径：通过手写运动行为特征。这种类型的数据可以被用来验证用户是否正在操作的设备或计算机应用程序，所以人类和机器生成的运动可靠地进行区分是重要的。为此，我们通过研究人类和机器产生的手写体符号（隔离的字符，数字，手势，和签名），并比较和对比几种深的学习模式。我们发现，如果符号表示为静态图像，就可以欺骗国家的最先进的分类（在最好的情况下，近75％的准确率），但也具有显着的准确性来区分，如果他们表现为时间序列（95％的准确率在平均情况下）。我们的结论是假动作的准确检测更多的是与用户如何写的，而不是他们写的东西。我们的工作对电脑系统是需要验证或验证正版个人用户的影响，并提供了一个额外的安全层，以保持袭击者在海湾。

34. Weakly-Supervised Amodal Instance Segmentation with Compositional Priors [PDF] 返回目录
Yihong Sun, Adam Kortylewski, Alan Yuille
Abstract: Amodal segmentation in biological vision refers to the perception of the entire object when only a fraction is visible. This ability of seeing through occluders and reasoning about occlusion is innate to biological vision but not adequately modeled in current machine vision approaches. A key challenge is that ground-truth supervisions of amodal object segmentation are inherently difficult to obtain. In this paper, we present a neural network architecture that is capable of amodal perception, when weakly supervised with standard (inmodal) bounding box annotations. Our model extends compositional convolutional neural networks (CompositionalNets), which have been shown to be robust to partial occlusion by explicitly representing objects as composition of parts. In particular, we extend CompositionalNets by: 1) Expanding the innate part-voting mechanism in the CompositionalNets to perform instance segmentation; 2) and by exploiting the internal representations of CompositionalNets to enable amodal completion for both bounding box and segmentation mask. Our extensive experiments show that our proposed model can segment amodal masks robustly, with much improved mask prediction qualities compared to state-of-the-art amodal segmentation approaches.
摘要：Amodal分割在生物视觉是指整个物体的感知时，只有一小部分是可见的。通过封堵器观察和推理闭塞的这种能力是与生俱来的生物愿景，但在目前的机器视觉没有充分建模方法。一个关键的挑战是，amodal对象分割的地面实况监督本身很难获得。在本文中，我们提出了一个神经网络体系结构，其能够感知amodal，当与标准（inmodal）弱监督边界框的注释。我们的模型的构图卷积神经网络（CompositionalNets），其已经显示出通过显式地表示对象作为份的组合物是坚固以部分遮挡。特别是，我们扩展由CompositionalNets：1扩大有关CompositionalNets执行实例分割先天部分投票机制）; 2）并且通过利用CompositionalNets的内部表示，以使能对边界框和分割掩码amodal完成。我们大量的实验表明，该模型可以细分amodal口罩稳健，具有显着改善的面具预测的品质相比，国家的最先进的amodal分割方法。

35. Neuron Merging: Compensating for Pruned Neurons [PDF] 返回目录
Woojeong Kim, Suhyun Kim, Mincheol Park, Geonseok Jeon
Abstract: Network pruning is widely used to lighten and accelerate neural network models. Structured network pruning discards the whole neuron or filter, leading to accuracy loss. In this work, we propose a novel concept of neuron merging applicable to both fully connected layers and convolution layers, which compensates for the information loss due to the pruned neurons/filters. Neuron merging starts with decomposing the original weights into two matrices/tensors. One of them becomes the new weights for the current layer, and the other is what we name a scaling matrix, guiding the combination of neurons. If the activation function is ReLU, the scaling matrix can be absorbed into the next layer under certain conditions, compensating for the removed neurons. We also propose a data-free and inexpensive method to decompose the weights by utilizing the cosine similarity between neurons. Compared to the pruned model with the same topology, our merged model better preserves the output feature map of the original model; thus, it maintains the accuracy after pruning without fine-tuning. We demonstrate the effectiveness of our approach over network pruning for various model architectures and datasets. As an example, for VGG-16 on CIFAR-10, we achieve an accuracy of 93.16% while reducing 64% of total parameters, without any fine-tuning. The code can be found here: this https URL
摘要：网络修剪被广泛使用，以减轻和加速神经网络模型。结构化网络修剪丢弃整个神经元或过滤器，导致精度损失。在这项工作中，我们提出了神经元的新颖概念到两个完全连接层和卷积层，合并其适用由于修剪神经元/滤光器补偿了信息丢失。神经元合并与原始权分解成两个矩阵/张量开始。其中一人成为当前层的新权重，而另一个就是我们命名一个缩放矩阵，指导神经元的组合。如果活化功能RELU，调整矩阵能够被吸收到在一定条件下的下一层，以补偿被去除的神经元。我们还提出了一个无数据和廉价的方法，利用神经元之间的余弦相似分解的权重。相对于具有相同拓扑的修剪模式，我们的合并模型更好地保留了原始模型的输出特征图;因此，没有微调修剪之后也能保持准确度。我们证明了我们对网络的方法修剪各种模型的架构和数据集的有效性。作为一个例子，为VGG-16上CIFAR-10，我们达到93.16％的准确度，同时减少总参数的64％，而没有任何微调。该代码可以在这里找到：此HTTPS URL

36. Correspondence Learning via Linearly-invariant Embedding [PDF] 返回目录
Riccardo Marin, Marie-Julie Rakotosaona, Simone Melzi, Maks Ovsjanikov
Abstract: In this paper, we propose a fully differentiable pipeline for estimating accurate dense correspondences between 3D point clouds. The proposed pipeline is an extension and a generalization of the functional maps framework. However, instead of using the Laplace-Beltrami eigenfunctions as done in virtually all previous works in this domain, we demonstrate that learning the basis from data can both improve robustness and lead to better accuracy in challenging settings. We interpret the basis as a learned embedding into a higher dimensional space. Following the functional map paradigm the optimal transformation in this embedding space must be linear and we propose a separate architecture aimed at estimating the transformation by learning optimal descriptor functions. This leads to the first end-to-end trainable functional map-based correspondence approach in which both the basis and the descriptors are learned from data. Interestingly, we also observe that learning a \emph{canonical} embedding leads to worse results, suggesting that leaving an extra linear degree of freedom to the embedding network gives it more robustness, thereby also shedding light onto the success of previous methods. Finally, we demonstrate that our approach achieves state-of-the-art results in challenging non-rigid 3D point cloud correspondence applications.
摘要：在本文中，我们提出了估算三维点云之间的准确对应茂密的全微分的管道。所提出的管道的延伸和功能映射框架的推广。然而，而是采用了拉普拉斯贝尔特拉米本征函数作为在这一领域几乎所有的以前的工作完成后，我们证明了从数据中学习的基础上既可以提高稳健性，并导致更高的精度在挑战的设置。我们理解的基础作为学习嵌入到一个更高维空间。继功能地图的范例，这种嵌入空间的最佳转换必须是线性的，我们提出了一个独立的架构，旨在通过学习最佳描述符的功能估计转变。这导致第一端至端可训练功能的基于地图的对应关系的方法，其中两个的基础和描述符被从数据学习。有趣的是，我们还观察到，学习\ {EMPH规范}嵌入导致更坏的结果，这表明自由留下了一个额外的线性度嵌入网络赋予了它更好的稳健性，从而也脱落光到以前的方法的成功。最后，我们证明了我们的方法实现了国家的先进成果在挑战非刚性三维点云对应的应用程序。

37. Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce model [PDF] 返回目录
Julian Lienen, Eyke Hüllermeier
Abstract: In many real-world applications, the relative depth of objects in an image is crucial for scene understanding, e.g., to calculate occlusions in augmented reality scenes. Predicting depth in monocular images has recently been tackled using machine learning methods, mainly by treating the problem as a regression task. Yet, being interested in an \emph{order relation} in the first place, ranking methods suggest themselves as a natural alternative to regression, and indeed, ranking approaches leveraging pairwise comparisons as training information ("object A is closer to the camera than B") have shown promising performance on this problem. In this paper, we elaborate on the use of so-called \emph{listwise} ranking as a generalization of the pairwise approach. Listwise ranking goes beyond pairwise comparisons between objects and considers rankings of arbitrary length as training information. Our approach is based on the Plackett-Luce model, a probability distribution on rankings, which we combine with a state-of-the-art neural network architecture and a sampling strategy to reduce training complexity. An empirical evaluation on benchmark data in a "zero-shot" setting demonstrates the effectiveness of our proposal compared to existing ranking and regression methods.
摘要：在许多实际应用中，物体的图像中的相对深度是场景的理解，例如关键，在增强现实场景的计算遮挡。单眼图像深度预测最近已经使用机器学习方法解决，主要是由处理这个问题的回归任务。然而，正在热衷于一个\ EMPH {顺序关系}首先，排名方法建议本身作为天然替代回归，而事实上，排名办法利用成对比较作为训练信息（“对象A更靠近照相机除B “）已经表现出对这个问题有前途的性能。在本文中，我们详细阐述了使用所谓的\ {EMPH按列表}的排名为成对的方式的推广。成列的排名超越了对象之间的两两比较，并认为任意长度培训信息的排名。我们的方法是基于采用Plackett - 卢斯模型，对排名，我们与一个国家的最先进的神经网络结构和采样策略来降低复杂性培训相结合的概率分布。在“零次”设定在基准数据的实证分析表明相比于现有的排名和回归方法，我们的建议的有效性。

38. Empowering Knowledge Distillation via Open Set Recognition for Robust 3D Point Cloud Classification [PDF] 返回目录
Ayush Bhardwaj, Sakshee Pimpale, Saurabh Kumar, Biplab Banerjee
Abstract: Real-world scenarios pose several challenges to deep learning based computer vision techniques despite their tremendous success in research. Deeper models provide better performance, but are challenging to deploy and knowledge distillation allows us to train smaller models with minimal loss in performance. The model also has to deal with open set samples from classes outside the ones it was trained on and should be able to identify them as unknown samples while classifying the known ones correctly. Finally, most existing image recognition research focuses only on using two-dimensional snapshots of the real world three-dimensional objects. In this work, we aim to bridge these three research fields, which have been developed independently until now, despite being deeply interrelated. We propose a joint Knowledge Distillation and Open Set recognition training methodology for three-dimensional object recognition. We demonstrate the effectiveness of the proposed method via various experiments on how it allows us to obtain a much smaller model, which takes a minimal hit in performance while being capable of open set recognition for 3D point cloud data.
摘要：真实世界的场景会对基于深度学习计算机视觉技术的几个挑战，尽管他们的研究取得了巨大成功。更深层次的模型提供更好的性能，但具有挑战性的部署和知识蒸馏使我们能够在性能损失最少训练小排量车型。该模型还必须处理来自它的培训上，应该能够识别它们为未知样品，同时正确地分类已知的人的那些外类开放组样品。最后，大多数现有的图像识别的研究仅着眼于用真实世界的三维物体的二维快照。在这项工作中，我们的目标是弥合这三个研究领域，已独立开发到现在为止，尽管被深深地相互关联。我们提出了三维物体识别联合知识蒸馏和开集识别训练方法。我们通过展示它是如何使我们的各种实验所提出的方法的有效性，以获得一个更小的模式，这需要一个最小的命中性能的同时，能够开集识别三维点云数据。

39. Scribble-based Weakly Supervised Deep Learning for Road Surface Extraction from Remote Sensing Images [PDF] 返回目录
Yao Wei, Shunping Ji
Abstract: Road surface extraction from remote sensing images using deep learning methods has achieved good performance, while most of the existing methods are based on fully supervised learning, which requires a large amount of training data with laborious per-pixel annotation. In this paper, we propose a scribble-based weakly supervised road surface extraction method named ScRoadExtractor, which learns from easily accessible scribbles such as centerlines instead of densely annotated road surface ground-truths. To propagate semantic information from sparse scribbles to unlabeled pixels, we introduce a road label propagation algorithm which considers both the buffer-based properties of road networks and the color and spatial information of super-pixels. The proposal masks generated from the road label propagation algorithm are utilized to train a dual-branch encoder-decoder network we designed, which consists of a semantic segmentation branch and an auxiliary boundary detection branch. We perform experiments on three diverse road datasets that are comprised of highresolution remote sensing satellite and aerial images across the world. The results demonstrate that ScRoadExtractor exceed the classic scribble-supervised segmentation method by 20% for the intersection over union (IoU) indicator and outperform the state-of-the-art scribble-based weakly supervised methods at least 4%.
摘要：从遥感图像使用深学习方法道表面提取已取得了良好的性能，而大多数现有的方法都基于完全监督学习，这需要与费力的每像素注释的大量的训练数据的。在本文中，我们提出了一个名为ScRoadExtractor基于涂鸦，弱监督的路面提取方法，从交通方便的涂鸦获悉，如中心线，而不是密集的注释路面地面的真理。传播从稀疏涂鸦到未标记像素的语义信息，我们引入它参考两个道路网络的基于缓冲器的特性和颜色和超象素的空间信息的道路标签传播算法。从道路标签传播算法产生的建议掩模被用于训练的双分支编码器 - 解码器网络我们设计，它由一个语义分割分支和辅助边界检测分支的。我们执行上是由高分辨率的遥感卫星和世界各地的航空影像三种不同的道路数据集实验。结果表明，ScRoadExtractor 20％超过经典涂鸦监督分割方法的交叉点上接头（IOU）指示器和优于状态的最先进的涂鸦基于弱监督方法中的至少4％。

40. Coherent Loss: A Generic Framework for Stable Video Segmentation [PDF] 返回目录
Mingyang Qian, Yi Fu, Xiao Tan, Yingying Li, Jinqing Qi, Huchuan Lu, Shilei Wen, Errui Ding
Abstract: Video segmentation approaches are of great importance for numerous vision tasks especially in video manipulation for entertainment. Due to the challenges associated with acquiring high-quality per-frame segmentation annotations and large video datasets with different environments at scale, learning approaches shows overall higher accuracy on test dataset but lack strict temporal constraints to self-correct jittering artifacts in most practical applications. We investigate how this jittering artifact degrades the visual quality of video segmentation results and proposed a metric of temporal stability to numerically evaluate it. In particular, we propose a Coherent Loss with a generic framework to enhance the performance of a neural network against jittering artifacts, which combines with high accuracy and high consistency. Equipped with our method, existing video object/semantic segmentation approaches achieve a significant improvement in term of more satisfactory visual quality on video human dataset, which we provide for further research in this field, and also on DAVIS and Cityscape.
摘要：视频分割方法是非常重要的众多视觉任务特别是在视频处理娱乐。由于与大规模收购优质每帧分割注解和大量的视频数据集与不同的环境相关的挑战，学习方法显示了测试数据集整体精度更高，但缺乏在大多数实际应用中严格的时间约束自我修正抖动伪像。我们正在调查这个抖动神器会降低视频分割结果的视觉质量，并提出了度量时间稳定性进行数值如何评价它。特别是，我们提出了相干损失与通用框架，加强对抖动伪像，这与高精确度和高一致性结合了神经网络的性能。配有我们的方法，现有的视频对象/语义分割方法实现对视频的人数据集更满意的视觉质量，这是我们在这一领域提供了进一步的研究，同时也对戴维斯和城市景观的任期显著的改善。

41. Fast and Accurate Light Field Saliency Detection through Feature Extraction [PDF] 返回目录
Sahan Hemachandra, Ranga Rodrigo, Chamira Edussooriya
Abstract: Light field saliency detection---important due to utility in many vision tasks---still lack speed and can improve in accuracy. Due to the formulation of the saliency detection problem in light fields as a segmentation task or a "memorizing" tasks, existing approaches consume unnecessarily large amounts of computational resources for (training and) testing leading to execution times is several seconds. We solve this by aggressively reducing the large light-field images to a much smaller three-channel feature map appropriate for saliency detection using an RGB image saliency detector. We achieve this by introducing a novel convolutional neural network based features extraction and encoding module. Our saliency detector takes $0.4$ s to process a light field of size $9\times9\times512\times375$ in a CPU and is significantly faster than existing systems, with better or comparable accuracy. Our work shows that extracting features from light fields through aggressive size reduction and the attention results in a faster and accurate light-field saliency detector.
摘要：光场显着性检测---重要，因为效用在许多视觉任务---还缺乏速度，并且可以在精度提高。由于显着性检测问题的在光领域作为分割任务或“记忆”任务的制剂，现有的方法为消耗（训练和）不必要地大量计算资源的测试导致执行时间是几秒钟。我们通过积极地减少了大的光场图像，以更小的三通道特征图适合于使用RGB图像显着检测器显着性检测解决这个问题。我们通过引入新颖的卷积神经网络基于特征提取和编码模块实现这一点。我们的显着性检测需要$ 0.4 $ s到加工尺寸$ 9的光场\ times9 \ times512 \ times375 $一个CPU中，是显著的速度比现有的系统，具有更好或相当的精度。我们的工作表明，在一个快速和准确的光场的显着性检测提取通过积极的尺寸缩小光场和重视结果的功能。

42. Applying convolutional neural networks to extremely sparse image datasets using an image subdivision approach [PDF] 返回目录
Johan P. Boetker
Abstract: Purpose: The aim of this work is to demonstrate that convolutional neural networks (CNN) can be applied to extremely sparse image libraries by subdivision of the original image datasets. Methods: Image datasets from a conventional digital camera was created and scanning electron microscopy (SEM) measurements were obtained from the literature. The image datasets were subdivided and CNN models were trained on parts of the subdivided datasets. Results: The CNN models were capable of analyzing extremely sparse image datasets by utilizing the proposed method of image subdivision. It was furthermore possible to provide a direct assessment of the various regions where a given API or appearance was predominant.
摘要：目的：本研究的目的是要证明，卷积神经网络（CNN）可以由原始图像数据组的细分施加到极其稀疏图像库。方法：从传统的数字相机的图像数据集的创建和扫描电子显微镜从文献中获得的（SEM）测量。图像数据集进行了细分和CNN模型进行了培训的细分数据集的部分。结果：CNN模型是能够通过利用图像细分所提出的方法分析极其稀疏图像数据集。此外，还有人可以提供该商品的特定API或外观为优势的各地区的直接评估。

43. CLRGaze: Contrastive Learning of Representations for Eye Movement Signals [PDF] 返回目录
Louise Gillian C. Bautista, Prospero C. Naval Jr
Abstract: Eye movements are rich but ambiguous biosignals that usually require a meticulous selection of features. We instead propose to learn feature representations of eye movements in a self-supervised manner. We adopt a contrastive learning approach and a set of data transformations that enable a deep neural network to discern salient and granular gaze patterns. We evaluate on six eye-tracking data sets and assess the learned features on biometric tasks. We achieve accuracies as high as 97.3%. Our work provides insights into a general representation learning method not only for eye movements but also possibly for similar biosignals.
摘要：眼球运动是丰富的，但模糊的生物信号通常需要的功能细致的选择。相反，我们建议学习眼球运动的特征表示在自我监督的方式。我们采用对比的学习方法和一组数据转换，使一个深层神经网络辨别突出和颗粒状的目光模式。我们评估六个眼球追踪的数据集，并评估对生物识别任务学习功能。我们实现了精度高达97.3％。我们的工作不仅提供眼球运动，也可能是类似的生物信号洞察一般表示学习方法。

44. APB2FaceV2: Real-Time Audio-Guided Multi-Face Reenactment [PDF] 返回目录
Jiangning Zhang, Xianfang Zeng, Chao Xu, Jun Chen, Yong Liu, Yunliang Jiang
Abstract: Audio-guided face reenactment aims to generate a photorealistic face that has matched facial expression with the input audio. However, current methods can only reenact a special person once the model is trained or need extra operations such as 3D rendering and image post-fusion on the premise of generating vivid faces. To solve the above challenge, we propose a novel \emph{R}eal-time \emph{A}udio-guided \emph{M}ulti-face reenactment approach named \emph{APB2FaceV2}, which can reenact different target faces among multiple persons with corresponding reference face and drive audio signal as inputs. Enabling the model to be trained end-to-end and have a faster speed, we design a novel module named Adaptive Convolution (AdaConv) to infuse audio information into the network, as well as adopt a lightweight network as our backbone so that the network can run in real time on CPU and GPU. Comparison experiments prove the superiority of our approach than existing state-of-the-art methods, and further experiments demonstrate that our method is efficient and flexible for practical applications this https URL
摘要：音频引导面重演旨在生成具有匹配与输入音频面部表情的照片写实面。然而，目前的方法只能重演一个特殊的人，一旦模型进行训练或需要额外的操作，如3D渲染和后期图像融合上产生生动面孔的前提。为了解决上述挑战，我们提出了一种新\ EMPH {R} EAL-时间\ EMPH {A} UDIO引导\ EMPH {M}名为\ EMPH {APB2FaceV2} ULTI面重演方法，其可重演中不同的靶面与相应的参考面和多个人驱动音频信号作为输入。启用模型进行训练端至端和具有更快的速度，我们设计了一个名为自适应卷积（AdaConv）灌输音频信息到网络中一个新的模块，以及采用一个轻量级的网络作为我们的骨干，使网络可以在CPU和GPU实时运行。对比实验证明我们的方法比现有的国家的最先进方法的优势，进一步的实验表明，我们的方法是有效的，灵活的实际应用此HTTPS URL

45. Classification of Spot-welded Joints in Laser Thermography Data using Convolutional Neural Networks [PDF] 返回目录
Linh Kästner, Samim Ahmadi, Florian Jonietz, Mathias Ziegler, Peter Jung, Giuseppe Caire, Jens Lambrecht
Abstract: Spot welding is a crucial process step in various industries. However, classification of spot welding quality is still a tedious process due to the complexity and sensitivity of the test material, which drain conventional approaches to its limits. In this paper, we propose an approach for quality inspection of spot weldings using images from laser thermography data.We propose data preparation approaches based on the underlying physics of spot welded joints, heated with pulsed laser thermography by analyzing the intensity over time and derive dedicated data filters to generate training datasets. Subsequently, we utilize convolutional neural networks to classify weld quality and compare the performance of different models against each other. We achieve competitive results in terms of classifying the different welding quality classes compared to traditional approaches, reaching an accuracy of more than 95 percent. Finally, we explore the effect of different augmentation methods.
摘要：点焊是在各行业的关键工序。然而，点焊质量的分类仍然是一个冗长的过程，由于复杂性和测试材料的敏感性，这漏常规方法其极限。在本文中，我们提出一种用于使用图像从激光热成像数据。我们点焊质量检验的方法提出的数据准备方法的基础上光斑的底层物理通过分析随时间的强度的焊接接头，具有脉冲激光热成像加热并导出专用数据过滤器来生成训练数据集。随后，我们采用卷积神经网络分类焊接质量和进行相互比较不同车型的性能。我们在实现与传统方法相比不同的焊接质量等级进行分类，达到超过95％的精确度方面的竞争结果。最后，我们将探讨不同的隆胸方法的效果。

46. Deep Denoising For Scientific Discovery: A Case Study In Electron Microscopy [PDF] 返回目录
Sreyas Mohan, Ramon Manzorro, Joshua L. Vincent, Binh Tang, Dev Yashpal Sheth, Eero P. Simoncelli, David S. Matteson, Peter A. Crozier, Carlos Fernandez-Granda
Abstract: Denoising is a fundamental challenge in scientific imaging. Deep convolutional neural networks (CNNs) provide the current state of the art in denoising natural images, where they produce impressive results. However, their potential has barely been explored in the context of scientific imaging. Denoising CNNs are typically trained on real natural images artificially corrupted with simulated noise. In contrast, in scientific applications, noiseless ground-truth images are usually not available. To address this issue, we propose a simulation-based denoising (SBD) framework, in which CNNs are trained on simulated images. We test the framework on data obtained from transmission electron microscopy (TEM), an imaging technique with widespread applications in material science, biology, and medicine. SBD outperforms existing techniques by a wide margin on a simulated benchmark dataset, as well as on real data. Apart from the denoised images, SBD generates likelihood maps to visualize the agreement between the structure of the denoised image and the observed data. Our results reveal shortcomings of state-of-the-art denoising architectures, such as their small field-of-view: substantially increasing the field-of-view of the CNNs allows them to exploit non-local periodic patterns in the data, which is crucial at high noise levels. In addition, we analyze the generalization capability of SBD, demonstrating that the trained networks are robust to variations of imaging parameters and of the underlying signal structure. Finally, we release the first publicly available benchmark dataset of TEM images, containing 18,000 examples.
摘要：去噪是科学成像一个根本性的挑战。深卷积神经网络（细胞神经网络）提供在去噪自然图像，其中它们产生印象深刻的结果的现有技术的当前状态。然而，他们的潜力几乎没有被探索的科学成像的情况下。去噪细胞神经网络通常是受过训练与模拟人为噪声的真实破坏自然的图像。相反，在科学应用，无噪音地面实况图像通常是不可用的。为了解决这个问题，我们提出了一种基于仿真的降噪（SBD）的框架，其中细胞神经网络是在模拟图像训练。我们测试从获得的数据的框架透射电子显微镜（TEM），以在材料科学，生物和医学应用领域广泛的成像技术。 SBD优于通过对模拟基准数据集大幅现有的技术，以及对真实数据。除了去噪图像，SBD生成似然映射到可视化去噪的图像的结构和所观察到的数据之间的协议。我们的研究结果揭示状态的最先进的去噪架构的缺点，例如它们的小字段的视图：显着增加场的视细胞神经网络的允许它们利用在数据中，非本地的周期性图案，其在高噪音水平是至关重要的。另外，我们分析SBD的泛化能力，表明训练网络是稳健的成像参数和底层信号结构的变型。最后，我们发布TEM图像的首个公开基准数据集，包含18000个例。

47. Video Understanding based on Human Action and Group Activity Recognition [PDF] 返回目录
Zijian Kuang, Xinran Tie
Abstract: A lot of previous work, such as video captioning, has shown promising performance in producing general video understanding. However, it is still challenging to generate a fine-grained description of human actions and their interactions using state-of-the-art video captioning techniques. The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. In this study, we will propose and improve the video understanding method based on the Group Activity Recognition model by learning Actor Relation Graph (ARG).We will enhance the functionality and the performance of the ARG based model to perform a better video understanding by applying approaches such as increasing human object detection accuracy with YOLO, increasing process speed by reducing the input image size, and applying ResNet in the CNN layer.We will also introduce a visualization model that will visualize each input video frame with predicted bounding boxes on each human object and predicted "video captioning" to describe each individual's action and their collective activity.
摘要：很多前期工作，如视频字幕的，已经显示出生产一般的视频理解有前途的性能。然而，它仍然是具有挑战性的产生采用先进设备，最先进的视频字幕技术的人类行为及其相互作用的细粒度描述。人类活动和集体活动的详细说明是必不可少的信息，可以在实时CCTV视频监控，医疗卫生，体育视频分析等一起使用。在这项研究中，我们将提出，提高视频的理解方法基础上，通过小组活动识别模型学习演员关系图（ARG）。我们将增强功能和ARG基于模型的性能，应用方法进行更好的视频了解如增加与YOLO人类物体的检测精度，通过降低提高处理速度输入图像尺寸，和施加RESNET在CNN layer.We也将引入可视化模型，将可视化每个输入视频帧与预测每一人类对象边界框和预测“视频字幕”来描述每个人的行动和他们的集体活动。

48. Advancing Non-Contact Vital Sign Measurement using Synthetic Avatars [PDF] 返回目录
Daniel McDuff, Javier Hernandez, Erroll Wood, Xin Liu, Tadas Baltrusaitis
Abstract: Non-contact physiological measurement has the potential to provide low-cost, non-invasive health monitoring. However, machine vision approaches are often limited by the availability and diversity of annotated video datasets resulting in poor generalization to complex real-life conditions. To address these challenges, this work proposes the use of synthetic avatars that display facial blood flow changes and allow for systematic generation of samples under a wide variety of conditions. Our results show that training on both simulated and real video data can lead to performance gains under challenging conditions. We show state-of-the-art performance on three large benchmark datasets and improved robustness to skin type and motion.
摘要：非接触生理测量有可能提供低成本，非侵入性的健康监控的潜力。然而，机器视觉的方法往往是由供应和导致泛化差到复杂的现实条件注释的视频数据集的多样性受到限制。为了应对这些挑战，这项工作提出了使用可显示面部血流变化，并允许系统生成的样品在各种各样的条件下合成的化身。我们的研究结果表明两个模拟和实际的视频数据使训练能艰难的条件下导致的性能提升。我们发现三个大型基准数据集的国家的最先进的性能和改进的鲁棒性的皮肤类型和运动。

49. RUArt: A Novel Text-Centered Solution for Text-Based Visual Question Answering [PDF] 返回目录
Zan-Xia Jin, Heran Wu, Chun Yang, Fang Zhou, Jingyan Qin, Lei Xiao, Xu-Cheng Yin
Abstract: Text-based visual question answering (VQA) requires to read and understand text in an image to correctly answer a given question. However, most current methods simply add optical character recognition (OCR) tokens extracted from the image into the VQA model without considering contextual information of OCR tokens and mining the relationships between OCR tokens and scene objects. In this paper, we propose a novel text-centered method called RUArt (Reading, Understanding and Answering the Related Text) for text-based VQA. Taking an image and a question as input, RUArt first reads the image and obtains text and scene objects. Then, it understands the question, OCRed text and objects in the context of the scene, and further mines the relationships among them. Finally, it answers the related text for the given question through text semantic matching and reasoning. We evaluate our RUArt on two text-based VQA benchmarks (ST-VQA and TextVQA) and conduct extensive ablation studies for exploring the reasons behind RUArt's effectiveness. Experimental results demonstrate that our method can effectively explore the contextual information of the text and mine the stable relationships between the text and objects.
摘要：基于文本的视觉问答（VQA）需要一个图像中的阅读和理解文字正确回答一个给定的问题。然而，大多数现有的方法简单地添加光学字符识别（OCR）令牌从图像到VQA模型中提取不考虑上下文信息的OCR令牌和矿业之间OCR令牌和场景对象的关系。在本文中，我们提出了所谓的RUArt一个小说文本为中心的方法（阅读，理解并回答相关文本）的基于文本的VQA。拍摄图像和作为输入的问题，RUArt首先读取图像，并且获得的文本和场景中的对象。然后，它理解的问题，OCR文本和场景中的对象的背景下，进一步矿山它们之间的关系。最后，它回答了通过文字语义匹配和推理的特定问题相关的文本。我们评估我们的两个基于文本的VQA基准（ST-VQA和TextVQA）RUArt和探索背后RUArt的有效性的原因进行广泛切除研究。实验结果表明，我们的方法可以有效的探索文本的上下文信息和矿山文字和对象之间的稳定的关系。

50. Classifying Eye-Tracking Data Using Saliency Maps [PDF] 返回目录
Shafin Rahman, Sejuti Rahman, Omar Shahid, Md. Tahmeed Abdullah, Jubair Ahmed Sourov
Abstract: A plethora of research in the literature shows how human eye fixation pattern varies depending on different factors, including genetics, age, social functioning, cognitive functioning, and so on. Analysis of these variations in visual attention has already elicited two potential research avenues: 1) determining the physiological or psychological state of the subject and 2) predicting the tasks associated with the act of viewing from the recorded eye-fixation data. To this end, this paper proposes a visual saliency based novel feature extraction method for automatic and quantitative classification of eye-tracking data, which is applicable to both of the research directions. Instead of directly extracting features from the fixation data, this method employs several well-known computational models of visual attention to predict eye fixation locations as saliency maps. Comparing the saliency amplitudes, similarity and dissimilarity of saliency maps with the corresponding eye fixations maps gives an extra dimension of information which is effectively utilized to generate discriminative features to classify the eye-tracking data. Extensive experimentation using Saliency4ASD, Age Prediction, and Visual Perceptual Task dataset show that our saliency-based feature can achieve superior performance, outperforming the previous state-of-the-art methods by a considerable margin. Moreover, unlike the existing application-specific solutions, our method demonstrates performance improvement across three distinct problems from the real-life domain: Autism Spectrum Disorder screening, toddler age prediction, and human visual perceptual task classification, providing a general paradigm that utilizes the extra-information inherent in saliency maps for a more accurate classification.
摘要：在人眼定位模式变化如何根据不同的因素，包括遗传，年龄，社会功能，认知功能等文献显示研究过多。在视觉注意这些变化的分析已经引起两个潜在的研究途径：1）确定所述受试者和2的生理或心理状态）预测与来自记录眼睛凝视数据观看的行为相关联的任务。为此，本文提出了一种用于眼睛跟踪数据的自动和定量分类，这同时适用于研究方向的视觉显着性基于新的特征提取方法。而不是从固定数据中直接提取特征，这种方法采用的视觉注意的几个知名的计算模型来预测眼球固定位置的显着性图。所述显着的幅度，相似性和显着性映射的相异度与对应的眼睛注视比较映射给出的，其有效地利用以产生判别特征的眼睛跟踪数据进行分类的信息的额外的维度。大量的实验使用Saliency4ASD，年龄预测，并视知觉任务数据集上，我们基于显着特征可以实现卓越的性能，以相当幅度跑赢以前的状态的最先进的方法。此外，与现有应用的解决方案，我们的方法证明了在三个不同的问题，性能改进，从现实生活中的域：泛自闭症障碍筛查，幼儿年龄的预测，和人类视觉感知任务分类，提供利用额外的一般范式 - 信息在显着固有的地图更精确的分类。

51. Discriminative feature generation for classification of imbalanced data [PDF] 返回目录
Sungho Suh, Paul Lukowicz, Yong Oh Lee
Abstract: The data imbalance problem is a frequent bottleneck in the classification performance of neural networks. In this paper, we propose a novel supervised discriminative feature generation (DFG) method for a minority class dataset. DFG is based on the modified structure of a generative adversarial network consisting of four independent networks: generator, discriminator, feature extractor, and classifier. To augment the selected discriminative features of the minority class data by adopting an attention mechanism, the generator for the class-imbalanced target task is trained, and the feature extractor and classifier are regularized using the pre-trained features from a large source data. The experimental results show that the DFG generator enhances the augmentation of the label-preserved and diverse features, and the classification results are significantly improved on the target task. The feature generation model can contribute greatly to the development of data augmentation methods through discriminative feature generation and supervised attention methods.
摘要：数据不平衡问题是在神经网络分类性能的常见瓶颈。在本文中，我们提出了一种新的监督辨别特征生成（DFG）方法只为少数类数据集。 DFG是基于由四个独立的网络的生成对抗性网络的修改的结构：发电机，鉴别器，特征提取器，和分类器。为了通过采用注意机制增加少数类数据的选择的判别特征，对于类不平衡目标任务发电机被训练，和特征提取器和分类器使用从一个大的源数据预先训练的特征正规化。实验结果表明，DFG发生器增强的标签，保存和多样化的功能增强，以及分类结果对目标任务显著改善。特征产生模型可以大大有助于增强数据的方法：通过判别特征生成和监督的关注方法的发展。

52. Persian Handwritten Digit, Character, and Words Recognition by Using Deep Learning Methods [PDF] 返回目录
Mehdi Bonyani, Simindokht Jahangard
Abstract: Digit, character, and word recognition of a particular script play a key role in the field of pattern recognition. These days, Optical Character Recognition (OCR) systems are widely used in commercial market in various applications. In recent years, there are intensive research studies on optical character, digit, and word recognition. However, only a limited number of works are offered for numeral, character, and word recognition of Persian scripts. In this paper, we have used deep neural network and investigated different versions of DensNet models and Xception and compare our results with the state-of-the-art methods and approaches in recognizing Persian character, number, and word. Two holistic Persian handwritten datasets, HODA and Sadri, have been used. For a comparison of our proposed deep neural network with previously published research studies, the best state-of-the-art results have been considered. We used accuracy as our criteria for evaluation. For HODA dataset, we achieved 99.72% and 89.99% for digit and character, respectively. For Sadri dataset, we obtained accuracy rates of 99.72%, 98.32%, and 98.82% for digit, character, and words, respectively.
摘要：数字，字符和一个特定的脚本的单词识别模式识别领域发挥关键作用。这些天来，光学字符识别（OCR）系统被广泛应用于商业市场中的各种应用。近年来，有关于光学字符，数字和单词的识别密集的调查研究。然而，只有作品的数量有限，供数字，字符，和波斯脚本认字。在本文中，我们使用深层神经网络并研究不同版本DensNet的模型和Xception和比较我们与国家的最先进的方法和结果在承认波斯文字，数字和单词的方法。两个整体波斯手写数据集，HODA和萨德里，已被使用。对于我们提出的深层神经网络与先前公布的研究报告比较，国家的最先进的最好的结果已得到审议。我们使用精确度作为我们的评价标准。对于HODA数据集，我们实现了99.72％和89.99％，分别为数字和字符。对于萨德里数据集，我们得到的准确率的99.72％，98.32％，98.82和用于％数字，字符和单词，分别。

53. Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions [PDF] 返回目录
Radhika Dua, Sai Srinivas Kancheti, Vineeth N Balasubramanian
Abstract: Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.
摘要：视觉问答系统是一个多模式的任务，目的是衡量高层次的视觉理解。当代VQA模型是在该答案是通过分类为在有限的词汇表（在开放式VQA的情况下）而获得，或通过分类在一组多项选择型答案的感限制性的。在这项工作中，我们提出其中用于可视化查询生成多字的答案完全生成配方。借此向前迈进了一步，我们引入了一个新的任务：ViQAR（视觉答疑和推理），其中一个模型必须形成完整的答案，寻求辩解产生答案的理由。我们提出了一个终端到端到端架构来解决这一任务，并介绍了如何对其进行评估。我们表明，我们的模型产生通过定性和定量评价强烈的答案和理由，以及通过人类图灵测试。

54. REDE: End-to-end Object 6D Pose Robust Estimation Using Differentiable Outliers Elimination [PDF] 返回目录
Weitong Hua, Zhongxiang Zhou, Jun Wu, Yue Wang, Rong Xiong
Abstract: Object 6D pose estimation is a fundamental task in many applications. Conventional methods solve the task by detecting and matching the keypoints, then estimating the pose. Recent efforts bringing deep learning into the problem mainly overcome the vulnerability of conventional methods to environmental variation due to the hand-crafted feature design. However, these methods cannot achieve end-to-end learning and good interpretability at the same time. In this paper, we propose REDE, a novel end-to-end object pose estimator using RGB-D data, which utilizes network for keypoint regression, and a differentiable geometric pose estimator for pose error back-propagation. Besides, to achieve better robustness when outlier keypoint prediction occurs, we further propose a differentiable outliers elimination method that regresses the candidate result and the confidence simultaneously. Via confidence weighted aggregation of multiple candidates, we can reduce the effect from the outliers in the final estimation. Finally, following the conventional method, we apply a learnable refinement process to further improve the estimation. The experimental results on three benchmark datasets show that REDE slightly outperforms the state-of-the-art approaches and is more robust to object occlusion.
摘要：目的6D姿态估计是在许多应用中根本任务。常规的方法解决，通过检测和匹配关键点，那么估计姿态的任务。最近的努力使深度学习到的问题主要是克服传统方法来环境变化的脆弱性，由于手工制作的功能设计。然而，这些方法不能同时实现终端到终端的学习和良好的可解释性。在本文中，我们提议REDE，使用RGB-d数据的新颖的端至端的对象姿态估计器，其利用网络对关键点的回归，以及用于姿态误差反向传播可微分几何姿势估计。此外，以达到更好的稳健性时异常关键点预测时，我们进一步建议，倒退候选结果，同时将信心微离群消除方法。通过多个候选人的信心加权汇总，就可以减少从离群的最终估计效果。最后，按照常规方法，我们采用一种可以学习的细化过程，进一步提高了估计。三个基准数据集的实验结果表明，REDE略微优于状态的最先进的方法，并且更健壮到对象遮挡。

55. Improving the generalization of network based relative pose regression: dimension reduction as a regularizer [PDF] 返回目录
Xiaqing Ding, Yue Wang, Li Tang, Yanmei Jiao, Rong Xiong
Abstract: Visual localization occupies an important position in many areas such as Augmented Reality, robotics and 3D reconstruction. The state-of-the-art visual localization methods perform pose estimation using geometry based solver within the RANSAC framework. However, these methods require accurate pixel-level matching at high image resolution, which is hard to satisfy under significant changes from appearance, dynamics or perspective of view. End-to-end learning based regression networks provide a solution to circumvent the requirement for precise pixel-level correspondences, but demonstrate poor performance towards cross-scene generalization. In this paper, we explicitly add a learnable matching layer within the network to isolate the pose regression solver from the absolute image feature values, and apply dimension regularization on both the correlation feature channel and the image scale to further improve performance towards generalization and large viewpoint change. We implement this dimension regularization strategy within a two-layer pyramid based framework to regress the localization results from coarse to fine. In addition, the depth information is fused for absolute translational scale recovery. Through experiments on real world RGBD datasets we validate the effectiveness of our design in terms of improving both generalization performance and robustness towards viewpoint change, and also show the potential of regression based visual localization networks towards challenging occasions that are difficult for geometry based visual localization methods.
摘要：视觉定位中占有许多领域，如增强现实技术，机器人技术和3D重建的重要地位。的状态的最先进的视觉定位方法使用RANSAC框架内的基于几何解算器执行姿势估计。然而，这些方法需要在高图像分辨率，这是很难从下的外观，动力学或视透视显著变化，以满足精确的像素电平匹配。端至端基础的学习回归网络提供的解决方案，以规避用于精确的像素级的对应的要求，但证明朝横场景泛化性能很差。在本文中，我们明确地在网络内添加一个可学习匹配层以隔离从绝对图像特征值的姿态回归解算器，并且相关特征信道和图像规模，进一步提高向概括和大视点性能上都适用尺寸正规化更改。我们两层金字塔基于框架内实施这一维度正规化战略退步由粗定位结果为细末。此外，深度信息融合的绝对平移规模恢复。通过对现实世界的RGBD数据集实验中，我们验证了我们设计的有效性在提高对视点变更都泛化性能和鲁棒性方面，也表现出回归对具有挑战性的场合难以几何基于视觉定位网络中的潜在的基于视觉定位方法。

56. Real-time Non-line-of-Sight imaging of dynamic scenes [PDF] 返回目录
Ji Hyun Nam, Eric Brandt, Sebastian Bauer, Xiaochun Liu, Eftychios Sifakis, Andreas Velten
Abstract: Non-Line-of-Sight (NLOS) imaging aims at recovering the 3D geometry of objects that are hidden from the direct line of sight. In the past, this method has suffered from the weak available multibounce signal limiting scene size, capture speed, and reconstruction quality. While algorithms capable of reconstructing scenes at several frames per second have been demonstrated, real-time NLOS video has only been demonstrated for retro-reflective objects where the NLOS signal strength is enhanced by 4 orders of magnitude or more. Furthermore, it has also been noted that the signal-to-noise ratio of reconstructions in NLOS methods drops quickly with distance and past reconstructions, therefore, have been limited to small scenes with depths of few meters. Actual models of noise and resolution in the scene have been simplistic, ignoring many of the complexities of the problem. We show that SPAD (Single-Photon Avalanche Diode) array detectors with a total of just 28 pixels combined with a specifically extended Phasor Field reconstruction algorithm can reconstruct live real-time videos of non-retro-reflective NLOS scenes. We provide an analysis of the Signal-to-Noise-Ratio (SNR) of our reconstructions and show that for our method it is possible to reconstruct the scene such that SNR, motion blur, angular resolution, and depth resolution are all independent of scene size suggesting that reconstruction of very large scenes may be possible. In the future, the light efficiency for NLOS imaging systems can be improved further by adding more pixels to the sensor array.
摘要：非视距的视线（NLOS）成像的目的是恢复被从直接视线隐藏物体的三维几何。在过去，这种方法从弱可用multibounce信号限制场景大小，捕捉速度和重建质量受到影响。而能够在几帧每秒已经证明重建场景的算法，实时视频NLOS只被证明了回归反射物体，其中NLOS信号强度是由4个数量级或更多的增强。此外，也有人指出，在NLOS方法重建的信噪比与距离和过去的重建迅速下降，因此，一直限于小场景与数米深处。场景中的噪声和分辨率的实际型号已经简单化，忽略了很多问题的复杂性。我们表明，总只是28个像素具有特异性扩展相量场重建算法可以重构的非后向反射NLOS场景实况实时视频合并的SPAD（单光子雪崩二极管）阵列检测器。我们为重建提供信号与噪声比（SNR）的分析，并表明，我们的方法可以重建场景，使得信噪比，动态模糊，角分辨率和深度分辨率都是独立的场景大小提示非常大场面的是重建是可能的。今后，对于NLOS成像系统的光效率可进一步通过添加更多的像素传感器阵列得到改善。

57. Investigating Saturation Effects in Integrated Gradients [PDF] 返回目录
Vivek Miglani, Narine Kokhlikyan, Bilal Alsallakh, Miguel Martin, Orion Reblitz-Richardson
Abstract: Integrated Gradients has become a popular method for post-hoc model interpretability. De-spite its popularity, the composition and relative impact of different regions of the integral path are not well understood. We explore these effects and find that gradients in saturated regions of this path, where model output changes minimally, contribute disproportionately to the computed attribution. We propose a variant of IntegratedGradients which primarily captures gradients in unsaturated regions and evaluate this method on ImageNet classification networks. We find that this attribution technique shows higher model faithfulness and lower sensitivity to noise com-pared with standard Integrated Gradients. A note-book illustrating our computations and results is available at this https URL.
摘要：综合渐变成了事后模型解释性的常用方法。德尽管其受欢迎程度，组成和积分路径的不同区域的相对影响还不太清楚。我们探讨这些影响，并发现梯度此路径，其中模型输出最小变化的饱和地区，在计算归属不成比例的贡献。我们建议IntegratedGradients的一个变种，其主要捕获的不饱和地区梯度和评估ImageNet分类网络这种方法。我们发现，这归因技术显示出更高的忠诚模型和对噪声敏感度较低COM-回吐标准集成梯度。说明我们的计算和结果的笔记本可在此HTTPS URL。

58. Unsupervised Dense Shape Correspondence using Heat Kernels [PDF] 返回目录
Mehmet Aygün, Zorah Lähner, Daniel Cremers
Abstract: In this work, we propose an unsupervised method for learning dense correspondences between shapes using a recent deep functional map framework. Instead of depending on ground-truth correspondences or the computationally expensive geodesic distances, we use heat kernels. These can be computed quickly during training as the supervisor signal. Moreover, we propose a curriculum learning strategy using different heat diffusion times which provide different levels of difficulty during optimization without any sampling mechanism or hard example mining. We present the results of our method on different benchmarks which have various challenges like partiality, topological noise and different connectivity.
摘要：在这项工作中，我们提出了使用最新深厚的功能映射框架学习形状之间的密集对应关系的无监督方法。而不是依赖于地面实况对应或计算昂贵的测量距离，我们使用热内核。这些可以快速培训主管信号期间计算。此外，我们建议使用不同的热扩散时间，这没有任何采样机制或硬例如采矿优化过程中提供不同的难度级别的课程学习策略。我们目前对具有类似偏好，拓扑噪声和不同的连接各种挑战不同的基准测试我们的方法的结果。

59. 3DBooSTeR: 3D Body Shape and Texture Recovery [PDF] 返回目录
Alexandre Saint, Anis Kacem, Kseniya Cherenkova, Djamila Aouada
Abstract: We propose 3DBooSTeR, a novel method to recover a textured 3D body mesh from a textured partial 3D scan. With the advent of virtual and augmented reality, there is a demand for creating realistic and high-fidelity digital 3D human representations. However, 3D scanning systems can only capture the 3D human body shape up to some level of defects due to its complexity, including occlusion between body parts, varying levels of details, shape deformations and the articulated skeleton. Textured 3D mesh completion is thus important to enhance 3D acquisitions. The proposed approach decouples the shape and texture completion into two sequential tasks. The shape is recovered by an encoder-decoder network deforming a template body mesh. The texture is subsequently obtained by projecting the partial texture onto the template mesh before inpainting the corresponding texture map with a novel approach. The approach is validated on the 3DBodyTex.v2 dataset.
摘要：我们建议3DBooSTeR，以回收有纹理的3D体从有纹理的部分三维扫描网格的新方法。随着虚拟和增强现实的到来，没有用于创建逼真和高保真的数字化三维人体表示需求。然而，3D扫描系统只能捕获三维人体形状最多缺陷一定程度，由于其复杂性，包括主体部件之间闭塞，改变的细节，形状变形和铰接骨架水平。纹理的三维网格竣工因此，重要的是要提高3D收购。所提出的方法解耦形状和纹理完成成两个连续的任务。形状由编码器 - 解码器网络变形的模板正文网状回收。纹理随后通过用一种新颖的方法补绘相应的纹理映射前突出的部分的纹理到模板丝网获得。该方法是验证在3DBodyTex.v2数据集。

60. Position and Rotation Invariant Sign Language Recognition from 3D Point Cloud Data with Recurrent Neural Networks [PDF] 返回目录
Prasun Roy, Saumik Bhattacharya, Partha Pratim Roy, Umapada Pal
Abstract: Sign language is a gesture based symbolic communication medium among speech and hearing impaired people. It also serves as a communication bridge between non-impaired population and impaired population. Unfortunately, in most situations a non-impaired person is not well conversant in such symbolic languages which restricts natural information flow between these two categories of population. Therefore, an automated translation mechanism can be greatly useful that can seamlessly translate sign language into natural language. In this paper, we attempt to perform recognition on 30 basic Indian sign gestures. Gestures are represented as temporal sequences of 3D depth maps each consisting of 3D coordinates of 20 body joints. A recurrent neural network (RNN) is employed as classifier. To improve performance of the classifier, we use geometric transformation for alignment correction of depth frames. In our experiments the model achieves 84.81% accuracy.
摘要：手语是言语和听力障碍的人之间基于手势的象征性交流媒介。它也可作为非受损的人口和人口之间的障碍一个沟通的桥梁。不幸的是，在大多数情况下，非受损的人没有很好地制约这两类人群之间的自然信息流等符号语言精通。因此，自动翻译机制可以是可以无缝地翻译手语成自然语言大大有益的。在本文中，我们试图在30个基本印度标志手势进行识别。姿势被表示为3D深度的时间序列来将每个由20个身体关节的三维坐标。甲回归神经网络（RNN）作为分类器。为了提高分类器的性能，我们采用几何变换深度帧的对齐校正。在我们的实验模型达到84.81％的准确率。

61. Attention-Guided Network for Iris Presentation Attack Detection [PDF] 返回目录
Cunjian Chen, Arun Ross
Abstract: Convolutional Neural Networks (CNNs) are being increasingly used to address the problem of iris presentation attack detection. In this work, we propose attention-guided iris presentation attack detection (AG-PAD) to augment CNNs with attention mechanisms. Two types of attention modules are independently appended on top of the last convolutional layer of the backbone network. Specifically, the channel attention module is used to model the inter-channel relationship between features, while the position attention module is used to model inter-spatial relationship between features. An element-wise sum is employed to fuse these two attention modules. Further, a novel hierarchical attention mechanism is introduced. Experiments involving both a JHU-APL proprietary dataset and the benchmark LivDet-Iris-2017 dataset suggest that the proposed method achieves promising results. To the best of our knowledge, this is the first work that exploits the use of attention mechanisms in iris presentation attack detection.
摘要：卷积神经网络（细胞神经网络）正被越来越多地用于解决虹膜呈现攻击检测的问题。在这项工作中，我们建议关注制导虹膜呈现攻击检测（AG-PAD）注重机制来增强细胞神经网络。两种类型的注意力模块独立地附加在骨干网的最后一个卷积层的顶部。具体地，信道注意模块用于特征之间的信道间的关系进行建模，而位置注意模块用于建模特征之间的相互空间关系。逐元素总和被用于融合这两个注意模块。此外，引入了新的层次注意机制。涉及实验既是JHU APL的专有数据集和基准LivDet光圈-2017数据集表明，该方法实现了可喜的成果。据我们所知，这是一种利用虹膜呈现攻击检测使用注意机制的第一项工作。

62. Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations [PDF] 返回目录
Judy Borowski, Roland S. Zimmermann, Judith Schepers, Robert Geirhos, Thomas S. A. Wallis, Matthias Bethge, Wieland Brendel
Abstract: Feature visualizations such as synthetic maximally activating images are a widely used explanation method to better understand the information processing of convolutional neural networks (CNNs). At the same time, there are concerns that these visualizations might not accurately represent CNNs' inner workings. Here, we measure how much extremely activating images help humans to predict CNN activations. Using a well-controlled psychophysical paradigm, we compare the informativeness of synthetic images (Olah et al., 2017) with a simple baseline visualization, namely exemplary natural images that also strongly activate a specific feature map. Given either synthetic or natural reference images, human participants choose which of two query images leads to strong positive activation. The experiment is designed to maximize participants' performance, and is the first to probe intermediate instead of final layer representations. We find that synthetic images indeed provide helpful information about feature map activations (82% accuracy; chance would be 50%). However, natural images-originally intended to be a baseline-outperform synthetic images by a wide margin (92% accuracy). Additionally, participants are faster and more confident for natural images, whereas subjective impressions about the interpretability of feature visualization are mixed. The higher informativeness of natural images holds across most layers, for both expert and lay participants as well as for hand- and randomly-picked feature visualizations. Even if only a single reference image is given, synthetic images provide less information than natural images (65% vs. 73%). In summary, popular synthetic images from feature visualizations are significantly less informative for assessing CNN activations than natural images. We argue that future visualization methods should improve over this simple baseline.
摘要：特征的可视化，例如合成最大限度激活图像是一种广泛使用的方法的解释以更好地了解卷积神经网络（细胞神经网络）的信息的处理。与此同时，也有这些可视化可能不能准确地代表细胞神经网络内部运作的担忧。在这里，我们测量极其激活图像多少帮助人类预测CNN激活。使用良好控制的心理物理范例，我们用一个简单的基线可视化，也强烈地激活特定特征映射即示例性自然图像进行比较合成图像的信息量（Olah等人，2017）。鉴于无论是合成或天然的参考图像，人类参与者选择其中的两个查询图像导致强阳性激活。该实验的目的是最大限度地提高参与者的性能，并且是所述第一探测中间，而不是最后的层表示。我们发现，合成图像确实提供了有关特征映射激活有用的信息（82％的准确率;机会是50％）。然而，自然图像，本来是通过宽余量（92％的准确度）的基线-跑赢合成影像。此外，参与者是自然的图像更快，更自信，而有关功能的可视化的解释性主观印象是混合。自然图像的高信息量在大部分层成立，为专家和外行的参与者以及为手工和随机挑选的功能可视化。即使只有一个参考图像中给出的，合成图像提供比自然图像少的信息（65％对73％）。综上所述，从功能的可视化流行的合成图像显著评估CNN激活比自然的图像信息较少。我们认为，未来可视化方法应该改进了这个简单的基线。

63. ST-GREED: Space-Time Generalized Entropic Differences for Frame Rate Dependent Video Quality Prediction [PDF] 返回目录
Pavan C. Madhusudana, Neil Birkbeck, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Abstract: We consider the problem of conducting frame rate dependent video quality assessment (VQA) on videos of diverse frame rates, including high frame rate (HFR) videos. More generally, we study how perceptual quality is affected by frame rate, and how frame rate and compression combine to affect perceived quality. We devise an objective VQA model called Space-Time GeneRalized Entropic Difference (GREED) which analyzes the statistics of spatial and temporal band-pass video coefficients. A generalized Gaussian distribution (GGD) is used to model band-pass responses, while entropy variations between reference and distorted videos under the GGD model are used to capture video quality variations arising from frame rate changes. The entropic differences are calculated across multiple temporal and spatial subbands, and merged using a learned regressor. We show through extensive experiments that GREED achieves state-of-the-art performance on the LIVE-YT-HFR Database when compared with existing VQA models. The features used in GREED are highly generalizable and obtain competitive performance even on standard, non-HFR VQA databases. The implementation of GREED has been made available online: this https URL
摘要：我们考虑就不同的帧速率，包括高帧率（HFR）的视频的视频帧速率取决于视频质量评估（VQA）的问题。更一般地，我们研究的质量如何感知是通过帧速率的影响，以及如何帧速率和压缩相结合，影响感知质量。我们制定所谓的时空广义熵差（贪婪）的目标VQA模型，分析了空间和时间带通视频系数的统计数据。广义高斯分布（GGD）用于模型带通响应，而GGD模型下的参考和扭曲的视频之间的熵变化被用来从帧速率的变化引起捕获的视频质量的变化。熵差异跨越多个时间和空间的子带计算，并用学到回归合并。我们通过大量的实验表明，当与现有的VQA车型相比贪婪实现在LIVE-YT-HFR数据库的国家的最先进的性能。在贪欲中使用的特点是高度概括的，甚至在标准，非HFR VQA数据库获得竞争性优势。贪婪的实施已经在网上提供：此HTTPS URL

64. Deep Low-rank plus Sparse Network for Dynamic MR Imaging [PDF] 返回目录
Wenqi Huang, Ziwen Ke, Zhuo-Xu Cui, Jing Cheng, Zhilang Qiu, Sen Jia, Yanjie Zhu, Dong Liang
Abstract: In dynamic MR imaging, L+S decomposition, or robust PCA equivalently, has achieved stunning performance. However, the selection of parameters of L+S is empirical, and the acceleration rate is limited, which are the common failings of iterative CS-MRI reconstruction methods. Many deep learning approaches were proposed to address these issues, but few of them used the low-rank prior. In this paper, a model-based low-rank plus sparse network, dubbed as L+S-Net, is proposed for dynamic MR reconstruction. In particular, we use an alternating linearized minimization method to solve the optimization problem with low-rank and sparse regularization. A learned soft singular value thresholding is introduced to make sure the clear separation of L component and S component. Then the iterative steps is unrolled into a network whose regularization parameters are learnable. Experiments on retrospective and prospective cardiac cine dataset show that the proposed model outperforms the state-of-the-art CS and existing deep learning methods.
摘要：动态MR成像，L + S分解，或鲁棒PCA等价地，已经实现了惊人的性能。然而，L + S的参数的选择是经验性的，并加速率是有限的，这是迭代CS-MRI重建方法的通病。许多深层的学习方法，提出了解决这些问题，但很少有人事先用低等级。在本文中，基于模型的低秩加稀疏网络，配成为L + S-Net的，提出了一种用于动态MR重建。特别是，我们使用的线性交替最小化方法来解决低等级和稀疏正规化优化问题。一个有学问的软奇异值阈值被引入，以确保L成分和S成分的明确分离。然后迭代步骤被展开成一个网络，其正则化参数是可以学习的。在回顾性和前瞻性心脏电影数据集表明，该模型优于国家的最先进的CS和现有的深度学习方法的实验。

65. A ReLU Dense Layer to Improve the Performance of Neural Networks [PDF] 返回目录
Alireza M. Javid, Sandipan Das, Mikael Skoglund, Saikat Chatterjee
Abstract: We propose ReDense as a simple and low complexity way to improve the performance of trained neural networks. We use a combination of random weights and rectified linear unit (ReLU) activation function to add a ReLU dense (ReDense) layer to the trained neural network such that it can achieve a lower training loss. The lossless flow property (LFP) of ReLU is the key to achieve the lower training loss while keeping the generalization error small. ReDense does not suffer from vanishing gradient problem in the training due to having a shallow structure. We experimentally show that ReDense can improve the training and testing performance of various neural network architectures with different optimization loss and activation functions. Finally, we test ReDense on some of the state-of-the-art architectures and show the performance improvement on benchmark datasets.
摘要：本文提出ReDense作为一个简单和低复杂度的方式来提高训练神经网络的性能。我们使用随机权重的组合和整流线性单元（RELU）激活函数的RELU致密（ReDense）层添加到训练的神经网络，使得它可以实现较低的训练损失。 RELU的无损流动性（LFP）是关键，以实现较低的培训损失，同时保持泛化误差小。 ReDense不从训练梯度消失问题，由于具有浅结构受到影响。我们通过实验证明ReDense能提高训练和不同的优化损失和激活功能的各种神经网络架构的性能测试。最后，我们对一些国家的最先进的架构的测试ReDense并显示在基准数据集的性能改善。

66. Towards Scale-Invariant Graph-related Problem Solving by Iterative Homogeneous Graph Neural Networks [PDF] 返回目录
Hao Tang, Zhiao Huang, Jiayuan Gu, Bao-Liang Lu, Hao Su
Abstract: Current graph neural networks (GNNs) lack generalizability with respect to scales (graph sizes, graph diameters, edge weights, etc..) when solving many graph analysis problems. Taking the perspective of synthesizing graph theory programs, we propose several extensions to address the issue. First, inspired by the dependency of the iteration number of common graph theory algorithms on graph size, we learn to terminate the message passing process in GNNs adaptively according to the computation progress. Second, inspired by the fact that many graph theory algorithms are homogeneous with respect to graph weights, we introduce homogeneous transformation layers that are universal homogeneous function approximators, to convert ordinary GNNs to be homogeneous. Experimentally, we show that our GNN can be trained from small-scale graphs but generalize well to large-scale graphs for a number of basic graph theory problems. It also shows generalizability for applications of multi-body physical simulation and image-based navigation problems.
摘要：电流图形神经网络（GNNS）解决许多图形分析问题，当缺乏普遍性相对于标度（图的尺寸，图形的直径，边缘权重，等等。）。以合成图论程序的角度出发，我们提出了一些扩展，以解决这一问题。首先，通过对图形大小共同的图论算法迭代次数的依赖关系的启发，我们学会适应根据计算进度终止消息传递过程中GNNS。其次，事实上，很多图论算法是同质相对于图形的权重启发，我们引入是普遍的同质函数逼近齐次变换层，普通GNNS转换是均匀的。实验上，我们证明了我们的GNN可以从小规模图表进行培训，但推广以及大规模图的一些基本图论问题。这也显示了多体物理模拟和基于图像的导航问题的应用普遍性。

67. Robust Disentanglement of a Few Factors at a Time [PDF] 返回目录
Benjamin Estermann, Markus Marks, Mehmet Fatih Yanik
Abstract: Disentanglement is at the forefront of unsupervised learning, as disentangled representations of data improve generalization, interpretability, and performance in downstream tasks. Current unsupervised approaches remain inapplicable for real-world datasets since they are highly variable in their performance and fail to reach levels of disentanglement of (semi-)supervised approaches. We introduce population-based training (PBT) for improving consistency in training variational autoencoders (VAEs) and demonstrate the validity of this approach in a supervised setting (PBT-VAE). We then use Unsupervised Disentanglement Ranking (UDR) as an unsupervised heuristic to score models in our PBT-VAE training and show how models trained this way tend to consistently disentangle only a subset of the generative factors. Building on top of this observation we introduce the recursive rPU-VAE approach. We train the model until convergence, remove the learned factors from the dataset and reiterate. In doing so, we can label subsets of the dataset with the learned factors and consecutively use these labels to train one model that fully disentangles the whole dataset. With this approach, we show striking improvement in state-of-the-art unsupervised disentanglement performance and robustness across multiple datasets and metrics.
摘要：退纠缠在无监督学习的最前沿，为数据的解缠结表示提高推广，解释性和性能下游任务。目前无人监督的方法仍然不能应用于现实世界的数据集，因为他们是在他们的表现差异很大，并不能达到解开的水平（半）监督的方法。我们介绍了提高训练变自动编码（VAES）一致性人口为基础的培训（PBT），并展示在监督环境（PBT-VAE）这种方法的有效性。然后，我们使用无监督退纠缠排名（UDR）作为无监督的启发式得分在我们的PBT-VAE培训模式，并显示模型是如何训练的这种方式往往会持续解开只生成因素的一个子集。建立在这一观察上面我们介绍的递归RPU-VAE的方法。我们训练模型，直到收敛，从数据集，并重申取出了解到因素。这样，我们就可以标记数据集的子集所学习的因素，并连续使用这些标签来训练一个模型，全面理顺了那些纷繁整个数据集。通过这种方法，我们将展示在跨多个数据集和指标的国家的最先进的无人监督的解开性能和稳定性显着提高。

68. Optimization for Medical Image Segmentation: Theory and Practice when evaluating with Dice Score or Jaccard Index [PDF] 返回目录
Tom Eelbode, Jeroen Bertels, Maxim Berman, Dirk Vandermeulen, Frederik Maes, Raf Bisschops, Matthew B. Blaschko
Abstract: In many medical imaging and classical computer vision tasks, the Dice score and Jaccard index are used to evaluate the segmentation performance. Despite the existence and great empirical success of metric-sensitive losses, i.e. relaxations of these metrics such as soft Dice, soft Jaccard and Lovasz-Softmax, many researchers still use per-pixel losses, such as (weighted) cross-entropy to train CNNs for segmentation. Therefore, the target metric is in many cases not directly optimized. We investigate from a theoretical perspective, the relation within the group of metric-sensitive loss functions and question the existence of an optimal weighting scheme for weighted cross-entropy to optimize the Dice score and Jaccard index at test time. We find that the Dice score and Jaccard index approximate each other relatively and absolutely, but we find no such approximation for a weighted Hamming similarity. For the Tversky loss, the approximation gets monotonically worse when deviating from the trivial weight setting where soft Tversky equals soft Dice. We verify these results empirically in an extensive validation on six medical segmentation tasks and can confirm that metric-sensitive losses are superior to cross-entropy based loss functions in case of evaluation with Dice Score or Jaccard Index. This further holds in a multi-class setting, and across different object sizes and foreground/background ratios. These results encourage a wider adoption of metric-sensitive loss functions for medical segmentation tasks where the performance measure of interest is the Dice score or Jaccard index.
摘要：在许多医疗成像和经典计算机视觉任务，骰子得分和杰卡德指数被用来评估分割性能。尽管存在和度量敏感的损失，这些指标如软骰子，软的Jaccard和Lovasz-使用SoftMax的即松弛大经验成功，许多研究人员仍然使用每个像素的损失，如（加权）交叉熵培养细胞神经网络进行分割。因此，目标指标是在许多情况下不能直接进行优化。我们从理论的角度调查，该组的度量敏感损失函数和问题对于加权交叉熵来优化骰子得分和的Jaccard指数在测试时间的最佳加权方案的存在内的关系。我们发现，骰子得分和杰卡德指数接近彼此相对，绝对，但我们发现了一个加权海明相似没有这样的近似。对于特沃斯基损失，近似从那里柔软特沃斯基等于软骰子琐碎重设置发生变更时变得单调更糟。我们对6个医务分割任务的广泛的验证经验验证这些结果，可以确认该指标敏感的损失均优于交叉熵基于丢失的功能与骰子得分或杰卡德指数评价的情况。这进一步保存在多类设置，并在不同的对象大小和前景/背景比。这些结果鼓励医学分割任务的更广泛采用的指标敏感损失函数，其中感兴趣的性能指标是骰子的分数或杰卡德指数。

69. VoteNet++: Registration Refinement for Multi-Atlas Segmentation [PDF] 返回目录
Zhipeng Ding, Marc Niethammer
Abstract: Multi-atlas segmentation (MAS) is a popular image segmentation technique for medical images. In this work, we improve the performance of MAS by correcting registration errors before label fusion. Specifically, we use a volumetric displacement field to refine registrations based on image anatomical appearance and predicted labels. We show the influence of the initial spatial alignment as well as the beneficial effect of using label information for MAS performance. Experiments demonstrate that the proposed refinement approach improves MAS performance on a 3D magnetic resonance dataset of the knee.
摘要：多图谱分割（MAS）是医学图像的流行的图像分割技术。在这项工作中，我们提高MAS由标签融合前纠正错误登记的性能。具体来说，我们使用体积排量领域的基于图像的解剖外观及预测标签细化注册。我们展示了初始空间排列的影响，以及使用标签信息MAS性能的有益效果。实验表明，该细化方法提高膝关节的3D磁共振数据集MAS性能。

70. Does contextual information improve 3D U-Net based brain tumor segmentation? [PDF] 返回目录
Iulian Emil Tampu, Neda Haj-Hosseini, Anders Eklund
Abstract: Effective, robust and automatic tools for brain tumor segmentation are needed for extraction of information useful in treatment planning. In recent years, convolutional neural networks have shown state-of-the-art performance in the identification of tumor regions in magnetic resonance (MR) images. A large portion of the current research is devoted to the development of new network architectures to improve segmentation accuracy. In this work it is instead investigated if the addition of contextual information in the form of white matter (WM), gray matter (GM) and cerebrospinal fluid (CSF) masks improves U-Net based brain tumor segmentation. The BraTS 2020 dataset was used to train and test a standard 3D U-Net model that, in addition to the conventional MR image modalities, used the contextual information as extra channels. For comparison, a baseline model that only used the conventional MR image modalities was also trained. Dice scores of 80.76 and 79.58 were obtained for the baseline and the contextual information models, respectively. Results show that there is no statistically significant difference when comparing Dice scores of the two models on the test dataset p > 0.5. In conclusion, there is no improvement in segmentation performance when using contextual information as extra channels.
摘要要在治疗计划有用的信息提取脑肿瘤分割有效，稳健和自动化工具：抽象。近年来，卷积神经网络已经显示在磁共振（MR）图像中的识别肿瘤区域的状态的最先进的性能。目前的研究有很大一部分是专门的新网络架构的发展，提高分割精度。在这项工作中它被代替调查，如果在的白质（WM）的形式加入的上下文信息，灰质（GM）和脑脊髓液（CSF）的掩模改进了基于U形网脑瘤分割。臭小子2020的数据集用于训练和测试标准3D U形网模型，除了常规的MR图像模态，所使用的上下文信息作为额外的信道。为了比较，基线模型，只使用了传统的MR图像模态还训练。是为基线和上下文信息模型分别得到80.76和79.58的得分骰子，。结果表明，没有统计学差异显著比较的测试数据集P> 0.5的两个模型的骰子分数时。总之，使用上下文信息作为额外的通道时存在分割性能没有任何改善。

71. Matthews Correlation Coefficient Loss for Deep Convolutional Networks: Application to Skin Lesion Segmentation [PDF] 返回目录
Kumar Abhishek, Ghassan Hamarneh
Abstract: The segmentation of skin lesions is a crucial task in clinical decision support systems for the computer aided diagnosis of skin lesions. Although deep learning based approaches have improved segmentation performance, these models are often susceptible to class imbalance in the data, particularly, the fraction of the image occupied by the background healthy skin. Despite variations of the popular Dice loss function being proposed to tackle the class imbalance problem, the Dice loss formulation does not penalize misclassifications of the background pixels. We propose a novel metric-based loss function using the Matthews correlation coefficient, a metric that has been shown to be efficient in scenarios with skewed class distributions, and use it to optimize deep segmentation models. Evaluations on three dermoscopic image datasets: the ISBI ISIC 2017 Skin Lesion Segmentation Challenge dataset, the DermoFit Image Library, and the PH2 dataset show that models trained using the proposed loss function outperform those trained using Dice loss by 11.25%, 4.87%, and 0.76% respectively in the mean Jaccard index. We plan to release the code on GitHub at this https URL upon publication of this paper.
摘要：皮损的分割是皮损的计算机辅助诊断临床决策支持系统的关键任务。虽然基于深刻的学习方法有改进的分割性能，这些模型往往容易在数据类的不平衡，特别是图像的部分所占的背景健康的皮肤。尽管被提出流行的骰子损失函数的变化，以解决不平衡类问题，骰子损失制剂不惩罚背景像素的错误分类。我们建议使用马修斯相关系数的新的基于度量的损失函数，已被证明为与偏斜类分布方案高效的度量，并用它来优化深分割模型。在ISBI ISIC 2017年皮损分割挑战数据集时，DermoFit图片库，以及PH2数据集显示，模型的11.25％，4.87％和0.76利用所提出的损失函数训练有素跑赢那些使用骰子损失的培训：在3个皮肤镜图像数据集的评估％分别在平均的Jaccard指数。我们计划在这个HTTPS URL在本文的出版物发布在GitHub上的代码。

72. On Embodied Visual Navigation in Real Environments Through Habitat [PDF] 返回目录
Marco Rosano, Antonino Furnari, Luigi Gulino, Giovanni Maria Farinella
Abstract: Visual navigation models based on deep learning can learn effective policies when trained on large amounts of visual observations through reinforcement learning. Unfortunately, collecting the required experience in the real world requires the deployment of a robotic platform, which is expensive and time-consuming. To deal with this limitation, several simulation platforms have been proposed in order to train visual navigation policies on virtual environments efficiently. Despite the advantages they offer, simulators present a limited realism in terms of appearance and physical dynamics, leading to navigation policies that do not generalize in the real world. In this paper, we propose a tool based on the Habitat simulator which exploits real world images of the environment, together with sensor and actuator noise models, to produce more realistic navigation episodes. We perform a range of experiments to assess the ability of such policies to generalize using virtual and real-world images, as well as observations transformed with unsupervised domain adaptation approaches. We also assess the impact of sensor and actuation noise on the navigation performance and investigate whether it allows to learn more robust navigation policies. We show that our tool can effectively help to train and evaluate navigation policies on real-world observations without running navigation pisodes in the real world.
摘要：通过强化学习大量的视觉观察受训时，基于深度学习视觉导航模式可以学习有效的政策。不幸的是，收集在真实世界所需要的经验，需要一个机器人平台，这是昂贵且耗时的部署。为了解决这个限制，一些仿真平台已经为了有效地在虚拟环境中培养视觉导航的政策建议。尽管它们提供的优点，仿真呈现在外观和物理动力学方面有限的现实，导致没有在现实世界中概括的导航策略。在本文中，我们提出了基于境模拟器利用了环境的现实世界图像，传感器和执行器噪声模型在一起，产生更逼真的导航集的工具。我们执行一系列的实验，以评估这些政策的使用虚拟和现实世界的图像，以及与无监督领域适应性方法转变的意见归纳的能力。我们也评估传感器和传动噪声对导航性能的影响，并研究是否允许以了解更多强劲的导航策略。我们证明了我们的工具可以有效地帮助培训并没有在现实世界中运行导航pisodes评估对现实世界的观察导航策略。

73. What is the best data augmentation approach for brain tumor segmentation using 3D U-Net? [PDF] 返回目录
Marco Domenico Cirillo, David Abramian, Anders Eklund
Abstract: Training segmentation networks requires large annotated datasets, which in medical imaging can be hard to obtain. Despite this fact, data augmentation has in our opinion not been fully explored for brain tumor segmentation (a possible explanation is that the number of training subjects (369) is rather large in the BraTS 2020 dataset). Here we apply different types of data augmentation (flipping, rotation, scaling, brightness adjustment, elastic deformation) when training a standard 3D U-Net, and demonstrate that augmentation significantly improves performance on the validation set (125 subjects) in many cases. Our conclusion is that brightness augmentation and elastic deformation works best, and that combinations of different augmentation techniques do not provide further improvement compared to only using one augmentation technique.
摘要：训练分割网络需要大型注释的数据集，这在医学成像中是很难获得的。尽管如此，数据增强了在我们看来没有得到充分的探讨脑肿瘤分割（一个可能的解释是，训练课（369）的数量是在臭小子2020数据集相当大）。这里，我们运用不同类型的数据增强（翻转，旋转，缩放，亮度调节，弹性变形）的训练标准的3D掌中时，并表明增强显著改善在许多情况下，验证组（125例）的性能。我们的结论是，亮度增强和弹性变形效果最好，而且不同的隆胸技术的组合相比，只有使用一种增强技术没有提供进一步的改善。

74. Robustness May Be at Odds with Fairness: An Empirical Study on Class-wise Accuracy [PDF] 返回目录
Philipp Benz, Chaoning Zhang, Adil Karjauv, In So Kweon
Abstract: Recently, convolutional neural networks (CNNs) have made significant advancement, however, they are widely known to be vulnerable to adversarial attacks. Adversarial training is the most widely used technique for improving adversarial robustness to strong white-box attacks. Prior works have been evaluating and improving the model average robustness without per-class evaluation. The average evaluation alone might provide a false sense of robustness. For example, the attacker can focus on attacking the vulnerable class, which can be dangerous, especially, when the vulnerable class is a critical one, such as "human" in autonomous driving. In this preregistration submission, we propose an empirical study on the class-wise accuracy and robustness of adversarially trained models. Given that the CIFAR10 training dataset has an equal number of samples for each class, interestingly, preliminary results on it with Resnet18 show that there exists inter-class discrepancy for accuracy and robustness on standard models, for instance, "cat" is more vulnerable than other classes. Moreover, adversarial training increases inter-class discrepancy. Our work aims to investigate the following questions: (a) is the phenomenon of inter-class discrepancy universal for other classification benchmark datasets on other seminal model architectures with various optimization hyper-parameters? (b) If so, what can be possible explanations for the inter-class discrepancy? (c) Can the techniques proposed in the long tail classification be readily extended to adversarial training for addressing the inter-class discrepancy?
摘要：近日，卷积神经网络（细胞神经网络）取得了显著的进步，但是，他们是众所周知的是脆弱的敌对攻击。对抗性训练是提高对抗性鲁棒性强的白盒攻击的最广泛使用的技术。在此之前的作品一直在评估和改进模型平均稳健而不每级评价。单独的平均评价可能提供鲁棒性的错觉。例如，攻击者可以专注于攻击脆弱类，这可能是危险的，尤其是，当脆弱类是一个关键的一个，诸如在自动驾驶“人”。在这种预注册提交，我们建议在adversarially训练的模型的类明智的准确性和鲁棒性进行了实证研究。鉴于CIFAR10训练数据集对每类样本的数量相等，有趣的是，它的初步结果与Resnet18证明存在的准确性和稳健性标准模型类间的差异，例如，“猫”是不是更容易其他类。此外，对抗性训练增加，级间差异。我们的工作旨在探讨以下问题：（a）是类间差异的现象普遍对各种优化超参数等开创性的模型架构的其他分类标准数据集？（二）如果是这样，有什么可为，级间差异的说明？（C）可在长尾提出的技术分类容易地扩展到对抗性训练用于寻址类间差异？

75. Deep Sequential Learning for Cervical Spine Fracture Detection on Computed Tomography Imaging [PDF] 返回目录
Hojjat Salehinejad, Edward Ho, Hui-Ming Lin, Priscila Crivellaro, Oleksandra Samorodova, Monica Tafur Arciniegas, Zamir Merali, Suradech Suthiphosuwan, Aditya Bharatha, Kristen Yeom, Muhammad Mamdani, Jefferson Wilson, Errol Colak
Abstract: Fractures of the cervical spine are a medical emergency and may lead to permanent paralysis and even death. Accurate diagnosis in patients with suspected fractures by computed tomography (CT) is critical to patient management. In this paper, we propose a deep convolutional neural network (DCNN) with a bidirectional long-short term memory (BLSTM) layer for the automated detection of cervical spine fractures in CT axial images. We used an annotated dataset of 3,666 CT scans (729 positive and 2,937 negative cases) to train and validate the model. The validation results show a classification accuracy of 70,92% and 79.18% on the balanced (104 positive and 104 negative cases) and imbalanced (104 positive and 419 negative cases) test datasets.
摘要：颈椎骨折是急症，可能导致永久性瘫痪，甚至死亡。患者准确诊断由计算机断层摄影（CT）可疑骨折是患者管理至关重要。在本文中，我们提出一种具有用于CT轴向图像颈椎骨折的自动化检测双向长短期存储器（BLSTM）层深卷积神经网络（DCNN）。我们使用的3666次CT扫描（729正和负2937例）的带注释的数据集来训练和验证模型。验证结果表明的70,92％的分类准确度和79.18％的平衡（104正和负104例）和不平衡（104正和负419例）测试数据集。

76. A Dark and Bright Channel Prior Guided Deep Network for Retinal Image Quality Assessment [PDF] 返回目录
Ziwen Xu, Beiji Zou, Qing Liu
Abstract: Retinal image quality assessment is an essential task in the diagnosis of retinal diseases. Recently, there are emerging deep models to grade quality of retinal images. Current state-of-the-arts either directly transfer classification networks originally designed for natural images to quality classification of retinal image or introduce extra image quality priors via multiple CNN branches or independent CNNs. This paper proposes a dark and bright prior guided deep network for retinal image quality assessment called GuidedNet. Specifically, the dark and bright channel priors are embedded into the start layer of network to improve the discriminate ability of deep features. Experimental results on retinal image quality dataset Eye-Quality demonstrate the effectiveness of the proposed GuidedNet.
摘要：视网膜图像质量评价是在视网膜疾病诊断的重要任务。近日，又有新兴的深模型视网膜图像的质量档次。当前状态的最艺要么分类网络最初设计用于自然图像直接转印到视网膜图像的质量分类或经由多个分支CNN或独立细胞神经网络引入额外的图像质量先验。本文提出了一种叫做视网膜GuidedNet图像质量评价黑暗和光明的前引导深网络。具体地讲，暗和亮信道的先验被嵌入到网络的起始层，以改善的深特征的判别能力。视网膜上的图像质量数据集的Eye-质量实验结果表明，所提出的GuidedNet的有效性。

77. Geometrically Matched Multi-source Microscopic Image Synthesis Using Bidirectional Adversarial Networks [PDF] 返回目录
Jun Zhuang, Dali Wang
Abstract: Microscopic images from different modality can provide more complete experimental information. In practice, biological and physical limitations may prohibit the acquisition of enough microscopic images at a given observation period. Image synthesis is one promising solution. However, most existing data synthesis methods only translate the image from a source domain to a target domain without strong geometric correlations. To address this issue, we propose a novel model to synthesize diversified microscopic images from multi-sources with different geometric features. The application of our model to a 3D live time-lapse embryonic images of C. elegans presents favorable results. To the best of our knowledge, it is the first effort to synthesize microscopic images with strong underlie geometric correlations from multi-source domains that of entirely separated spatial features.
摘要：从微观不同的医疗图像可以提供更完整的实验信息。在实践中，生物和物理限制可以在给定的观察期内禁止采集足够显微图像。图像合成是一个有希望的解决方案。然而，大多数现有的数据合成方法只不强的相关性的几何转换从源域到目标域的图像。为了解决这个问题，我们提出了一个新的模型来合成从多源具有不同的几何特征多元化的显微图像。我们的模型对线虫呈现良好的效果的3D直播时间推移胚胎图像中的应用。据我们所知，这是合成具有从多源结构域的完全分离的空间特征强背后几何相关性的显微图像的第一次努力。

78. Interpreting Uncertainty in Model Predictions For COVID-19 Diagnosis [PDF] 返回目录
Gayathiri Murugamoorthy, Naimul Khan
Abstract: COVID-19, due to its accelerated spread has brought in the need to use assistive tools for faster diagnosis in addition to typical lab swab testing. Chest X-Rays for COVID cases tend to show changes in the lungs such as ground glass opacities and peripheral consolidations which can be detected by deep neural networks. However, traditional convolutional networks use point estimate for predictions, lacking in capture of uncertainty, which makes them less reliable for adoption. There have been several works so far in predicting COVID positive cases with chest X-Rays. However, not much has been explored on quantifying the uncertainty of these predictions, interpreting uncertainty, and decomposing this to model or data uncertainty. To address these needs, we develop a visualization framework to address interpretability of uncertainty and its components, with uncertainty in predictions computed with a Bayesian Convolutional Neural Network. This framework aims to understand the contribution of individual features in the Chest-X-Ray images to predictive uncertainty. Providing this as an assistive tool can help the radiologist understand why the model came up with a prediction and whether the regions of interest captured by the model for the specific prediction are of significance in diagnosis. We demonstrate the usefulness of the tool in chest x-ray interpretation through several test cases from a benchmark dataset.
摘要：COVID-19，由于其加速蔓延带来的需求，除了典型的实验室测试棉签使用辅助工具更快的诊断。胸部X光为COVID例倾向于显示在肺部的变化，例如可以通过深神经网络来检测毛玻璃混浊和外围合并。然而，传统的卷积网络使用点估计值的预测，缺乏不确定性的捕捉，这使得它们对于通过不可靠。有迄今在预测与胸部X光COVID阳性病例以来的几部作品。然而，没有多少已探索对量化这些预测的不确定性，不确定性解释以及分解这对模型或数据的不确定性。为了满足这些需求，我们开发了贝叶斯卷积神经网络计算出预测的可视化框架的不确定性及其组件，具有不确定性的地址解释性。该框架旨在了解的个人特点，在胸部X射线图像来预测的不确定性的贡献。提供此作为一个辅助工具，可以帮助放射科医生明白，为什么模型想出了一个预测，以及是否由关注的模型具体预测捕获的区域在诊断意义。我们通过从基准数据集多组测试证明X线胸片解释工具的有效性。

79. Co-embedding of Nodes and Edges with Graph Neural Networks [PDF] 返回目录
Xiaodong Jiang, Ronghang Zhu, Pengsheng Ji, Sheng Li
Abstract: Graph, as an important data representation, is ubiquitous in many real world applications ranging from social network analysis to biology. How to correctly and effectively learn and extract information from graph is essential for a large number of machine learning tasks. Graph embedding is a way to transform and encode the data structure in high dimensional and non-Euclidean feature space to a low dimensional and structural space, which is easily exploited by other machine learning algorithms. We have witnessed a huge surge of such embedding methods, from statistical approaches to recent deep learning methods such as the graph convolutional networks (GCN). Deep learning approaches usually outperform the traditional methods in most graph learning benchmarks by building an end-to-end learning framework to optimize the loss function directly. However, most of the existing GCN methods can only perform convolution operations with node features, while ignoring the handy information in edge features, such as relations in knowledge graphs. To address this problem, we present CensNet, Convolution with Edge-Node Switching graph neural network, for learning tasks in graph-structured data with both node and edge features. CensNet is a general graph embedding framework, which embeds both nodes and edges to a latent feature space. By using line graph of the original undirected graph, the role of nodes and edges are switched, and two novel graph convolution operations are proposed for feature propagation. Experimental results on real-world academic citation networks and quantum chemistry graphs show that our approach achieves or matches the state-of-the-art performance in four graph learning tasks, including semi-supervised node classification, multi-task graph classification, graph regression, and link prediction.
摘要：图形，作为一个重要的数据表示，在许多现实世界的应用范围从社会网络分析生物学无处不在。如何正确有效地学习和图表提取信息是大量的机器学习任务至关重要。图嵌入在高维和非欧几里得特征空间转换和编码所述数据结构以低的尺寸和结构的空间，这是很容易由其它机器学习算法利用的一种方式。我们目睹的这种嵌入方法，近期深学习方法的一场巨大变革，从统计方法，如图形卷积网络（GCN）。深学习方法通常是通过建立一个终端到终端的学习框架，以优化直接损失函数跑赢大多数图形学习基准的传统方法。然而，现有的大部分GCN方法只能执行与节点功能卷积运算，而忽略了在边缘特征，如知识图关系得心应手信息。为了解决这个问题，我们目前CensNet，卷积与边缘节点切换图形神经网络，学习与点和边功能图结构数据的任务。 CensNet是一个一般图形嵌入框架，其嵌入两个节点和边至潜特征空间。通过使用原始无向图的线图，节点和边的作用被切换，并且两个新的图形卷积运算提议为特征传播。现实世界的学术引文网络和量子化学图表实验结果表明，我们的方法达到或符合四个图的学习任务，包括半监督节点分类，多任务图形分类，曲线回归国有的最先进的性能和链路预测。

80. Self-Supervised Training For Low Dose CT Reconstruction [PDF] 返回目录
Mehmet Ozan Unal, Metin Ertas, Isa Yildirim
Abstract: Ionizing radiation has been the biggest concern in CT imaging. To reduce the dose level without compromising the image quality, low dose CT reconstruction has been offered with the availability of compressed sensing based reconstruction methods. Recently, data-driven methods got attention with the rise of deep learning, the availability of high computational power, and big datasets. Deep learning based methods have also been used in low dose CT reconstruction problem in different manners. Usually, the success of these methods depends on clean labeled data. However, recent studies showed that training can be achieved successfully without clean datasets. In this study, we defined a training scheme to use low dose sinograms as their own training targets. We applied the self-supervision principle in the projection domain where the noise is element-wise independent as required in these methods. Using the self-supervised training, the filtering part of the FBP method and the parameters of a denoiser neural network are optimized. We demonstrate that our method outperforms both conventional and compressed sensing based iterative reconstruction methods qualitatively and quantitatively in the reconstruction of analytic CT phantoms and real-world CT images in low dose CT reconstruction task.
摘要：电离辐射已在CT成像最大的担忧。为了降低剂量水平而不损害图像质量，低剂量CT重建已经提供与压缩的基于感测的重建方法的可用性。最近，数据驱动的方法得到了关注与深度学习，高计算能力的可用性和大数据集的崛起。基于深学习方法也已在低剂量CT重建问题，采用不同的方式。通常，这些方法的成功依赖于干净的标签数据。然而，最近的研究表明，训练可以成功不干净的数据集来实现。在这项研究中，我们定义为使用低剂量的正弦图作为自己的训练目标的训练计划。我们在噪声是在这些方法中所需的元素方面的独立投影区域应用的自我监督的原则。使用自监督训练中，FBP方法的过滤部分和降噪神经网络的参数进行优化。我们证明我们的方法优于传统的和压缩的定性和定量检测基于迭代重建方法分析CT幻影和真实世界的CT图像的低剂量CT重建任务重建。

81. SUREMap: Predicting Uncertainty in CNN-based Image Reconstruction Using Stein's Unbiased Risk Estimate [PDF] 返回目录
Ruangrawee Kitichotkul, Christopher A. Metzler, Frank Ong, Gordon Wetzstein
Abstract: Convolutional neural networks (CNN) have emerged as a powerful tool for solving computational imaging reconstruction problems. However, CNNs are generally difficult-to-understand black-boxes. Accordingly, it is challenging to know when they will work and, more importantly, when they will fail. This limitation is a major barrier to their use in safety-critical applications like medical imaging: Is that blob in the reconstruction an artifact or a tumor? In this work we use Stein's unbiased risk estimate (SURE) to develop per-pixel confidence intervals, in the form of heatmaps, for compressive sensing reconstruction using the approximate message passing (AMP) framework with CNN-based denoisers. These heatmaps tell end-users how much to trust an image formed by a CNN, which could greatly improve the utility of CNNs in various computational imaging applications.
摘要：卷积神经网络（CNN）已成为解决计算成像重建问题的有力工具。然而，细胞神经网络一般很难理解的黑盒子。因此，它是具有挑战性的知道他们什么时候会工作，更重要的是，当他们将失败。这种限制是一个主要障碍他们如医疗成像安全关键应用：是在重建该斑点假象肿瘤？在这项工作中，我们使用Stein的无偏风险评估（SURE）制定每个像素的置信区间，在热图的形式，通过使用基于CNN-denoisers（AMP）框架近似的邮件压缩感知重建。这些热图告诉最终用户有多少信任由CNN，这样可以大大提高细胞神经网络的各种计算成像应用程序所形成的图像。

82. Gestop : Customizable Gesture Control of Computer Systems [PDF] 返回目录
Sriram Krishna, Nishant Sinha
Abstract: The established way of interfacing with most computer systems is a mouse and keyboard. Hand gestures are an intuitive and effective touchless way to interact with computer systems. However, hand gesture based systems have seen low adoption among end-users primarily due to numerous technical hurdles in detecting in-air gestures accurately. This paper presents Gestop, a framework developed to bridge this gap. The framework learns to detect gestures from demonstrations, is customizable by end-users and enables users to interact in real-time with computers having only RGB cameras, using gestures.
摘要：与大多数计算机系统接口的建立方法是鼠标和键盘。手势是一种直观而有效的非接触的方式与计算机系统进行交互。然而，手势的系统所看到的最终用户主要是由于许多技术障碍中采用低于准确地检测在空中手势。本文介绍Gestop，开发弥合这一差距的框架。该框架学会从示威检测手势，是可定制的最终用户，使用户能够交互实时仅具有RGB摄像头的电脑，用手势。

83. Improving the Reconstruction of Disentangled Representation Learners via Multi-Stage Modelling [PDF] 返回目录
Akash Srivastava, Yamini Bansal, Yukun Ding, Cole Hurwitz, Kai Xu, Bernhard Egger, Prasanna Sattigeri, Josh Tenenbaum, David D. Cox, Dan Gutfreund
Abstract: Current autoencoder-based disentangled representation learning methods achieve disentanglement by penalizing the (aggregate) posterior to encourage statistical independence of the latent factors. This approach introduces a trade-off between disentangled representation learning and reconstruction quality since the model does not have enough capacity to learn correlated latent variables that capture detail information present in most image data. To overcome this trade-off, we present a novel multi-stage modelling approach where the disentangled factors are first learned using a preexisting disentangled representation learning method (such as $\beta$-TCVAE); then, the low-quality reconstruction is improved with another deep generative model that is trained to model the missing correlated latent variables, adding detail information while maintaining conditioning on the previously learned disentangled factors. Taken together, our multi-stage modelling approach results in a single, coherent probabilistic model that is theoretically justified by the principal of D-separation and can be realized with a variety of model classes including likelihood-based models such as variational autoencoders, implicit models such as generative adversarial networks, and tractable models like normalizing flows or mixtures of Gaussians. We demonstrate that our multi-stage model has much higher reconstruction quality than current state-of-the-art methods with equivalent disentanglement performance across multiple standard benchmarks.
摘要：目前，基于自动编码器，解缠结表示学习方法通过惩罚实现的解开了（聚集）后，鼓励潜在因子的统计独立性。这种方法引入了一个折衷解缠结表示学习和重建质量之间因为该模型没有足够的能力去学习相关的潜在变量是存在于大多数的图像数据采集的详细信息。为了克服该折衷，我们提出了一种新颖的多阶段模拟方法，其中解缠结的因素是使用预先存在的解缠结表示学习方法（例如$ \测试$ -TCVAE）第一学习;然后，低质量的重建与被训练缺少相关的潜在变量模型的另一个深生成模型改进，增加了详细的信息，同时保持以前学过的解开因素调查。总之，我们的多级建模方法的结果在一个单一的，连贯的概率模型，其理论上由d分离的主要理由，并可以与各种模型类的包括基于似然模型，如变自动编码，隐式模型来实现如生成对抗性网络，并且易处理的模型，如正火流或高斯的混合物。我们证明了我们的多阶段模型具有更高的重建质量高于国家的最先进的现有方法与跨多个基准测试中等同的解开性能。

84. Unsupervised Super-Resolution: Creating High-Resolution Medical Images from Low-Resolution Anisotropic Examples [PDF] 返回目录
Jörg Sander, Bob D. de Vos, Ivana Išgum
Abstract: Although high resolution isotropic 3D medical images are desired in clinical practice, their acquisition is not always feasible. Instead, lower resolution images are upsampled to higher resolution using conventional interpolation methods. Sophisticated learning-based super-resolution approaches are frequently unavailable in clinical setting, because such methods require training with high-resolution isotropic examples. To address this issue, we propose a learning-based super-resolution approach that can be trained using solely anisotropic images, i.e. without high-resolution ground truth data. The method exploits the latent space, generated by autoencoders trained on anisotropic images, to increase spatial resolution in low-resolution images. The method was trained and evaluated using 100 publicly available cardiac cine MR scans from the Automated Cardiac Diagnosis Challenge (ACDC). The quantitative results show that the proposed method performs better than conventional interpolation methods. Furthermore, the qualitative results indicate that especially finer cardiac structures are synthesized with high quality. The method has the potential to be applied to other anatomies and modalities and can be easily applied to any 3D anisotropic medical image dataset.
摘要：虽然高分辨率3D各向同性医学图像在临床实践中需要的话，他们的收购并不总是可行的。相反，较低分辨率图像是使用传统的插值方法取样到更高的分辨率。先进的学习型超分辨方法是在临床上经常无法使用，因为这样的方法需要高分辨率各向同性的例子培训。为了解决这个问题，我们建议可以使用单独的各向异性的图像，即没有高分辨率的地面实况数据进行训练以学习为主的超分辨率的方法。该方法利用了潜在空间，由受过训练的各向异性的图像，以提高低分辨率图像空间分辨率自动编码生成。该方法进行训练，并且使用从所述自动心脏诊断质询（ACDC）100个可公开获得的心脏电影MR扫描评估。定量结果表明，比常规的插值方法，该方法执行得更好。此外，定性结果表明，特别是较细的心脏结构与高质量的合成。该方法有可能使被应用到其他解剖结构和方式，并且可以被容易地应用于任何三维各向异性医学图像数据集。

85. Context Aware 3D UNet for Brain Tumor Segmentation [PDF] 返回目录
Parvez Ahmad, Saqib Qamar, Linlin Shen, Adnan Saeed
Abstract: Deep convolutional neural network (CNN) achieves remarkable performance for medical image analysis. UNet is the primary source in the performance of 3D CNN architectures for medical imaging tasks, including brain tumor segmentation. The skip connection in the UNet architecture concatenates features from both encoder and decoder paths to extract multi-contexual information from image data. The multi-scaled features play an essential role in brain tumor segmentation. However, the limited use of features can degrade the performance of the UNet approach for segmentation. In this paper, we propose a modified UNet architecture for brain tumor segmentation. In the proposed architecture, we used densely connected blocks in both encoder and decoder paths to extract multi-contexual information from the concept of feature reusability. The proposed residual inception blocks (RIB) are used to extract local and global information by merging features of different kernel sizes. We validate the proposed architecture on the multimodal brain tumor segmentation challenges (BRATS) 2020 testing dataset. The dice (DSC) scores of the whole tumor (WT), tumor core (TC), and enhancement tumor (ET) are 89.12%, 84.74%, and 79.12%, respectively. Our proposed work is in the top ten methods based on the dice scores of the testing dataset.
摘要：深卷积神经网络（CNN）实现对医学图像分析骄人的业绩。 UNET是3D CNN架构用于医学成像的任务，包括脑肿瘤分割性能的主要来源。在UNET架构会连接跳过连接来自两个编码器具有和解码器的路径，以提取从图像数据中的多contexual信息。多缩放功能发挥脑肿瘤分割了至关重要的作用。然而，有限的使用特性会降低对分割UNET方法的性能。在本文中，我们提出了脑肿瘤分割修改UNET架构。在所提出的架构中，我们使用在编码器和解码器的路径密集连接的模块来提取特征可重用性的概念多contexual信息。所提出的残余块开始（RIB）用于通过合并不同的内核大小的特征来提取局部和全局信息。我们验证的多模态脑肿瘤分割挑战（小鬼）2020测试数据集所提出的架构。整个肿瘤的骰子（DSC）的分数（WT），肿瘤核心（TC），并且增强肿瘤（ET）是89.12％，84.74％，79.12和％分别。我们提出的工作是根据测试数据集的骰子得分前十名的方法。

86. Dynamic Adversarial Patch for Evading Object Detection Models [PDF] 返回目录
Shahar Hoory, Tzvika Shapira, Asaf Shabtai, Yuval Elovici
Abstract: Recent research shows that neural networks models used for computer vision (e.g., YOLO and Fast R-CNN) are vulnerable to adversarial evasion attacks. Most of the existing real-world adversarial attacks against object detectors use an adversarial patch which is attached to the target object (e.g., a carefully crafted sticker placed on a stop sign). This method may not be robust to changes in the camera's location relative to the target object; in addition, it may not work well when applied to nonplanar objects such as cars. In this study, we present an innovative attack method against object detectors applied in a real-world setup that addresses some of the limitations of existing attacks. Our method uses dynamic adversarial patches which are placed at multiple predetermined locations on a target object. An adversarial learning algorithm is applied in order to generate the patches used. The dynamic attack is implemented by switching between optimized patches dynamically, according to the camera's position (i.e., the object detection system's position). In order to demonstrate our attack in a real-world setup, we implemented the patches by attaching flat screens to the target object; the screens are used to present the patches and switch between them, depending on the current camera location. Thus, the attack is dynamic and adjusts itself to the situation to achieve optimal results. We evaluated our dynamic patch approach by attacking the YOLOv2 object detector with a car as the target object and succeeded in misleading it in up to 90% of the video frames when filming the car from a wide viewing angle range. We improved the attack by generating patches that consider the semantic distance between the target object and its classification. We also examined the attack's transferability among different car models and were able to mislead the detector 71% of the time.
摘要：最近的研究表明，用于计算机视觉神经网络模型（例如，永乐和快速R-CNN）也可能遭受敌对逃避攻击。大多数的针对对象检测器的现有真实世界对抗攻击使用附连到所述目标对象的对抗性贴片（例如，一个精心制作贴纸放在一个停止标志）。这种方法可能不坚固以在相机的位置相对于目标对象的变化;此外，它可能不能很好地在应用到非平面物体，如汽车工作。在这项研究中，我们提出了针对对象检测器创新的攻击方法在现实世界中的设置应用该地址存在的一些攻击的局限性。我们的方法使用被放置在目标物体上的多个预定位置处动态对抗性补丁。对抗性的学习算法以生成用于补丁应用。动态攻击是通过动态优化贴片之间切换根据摄像机的位置（即，物体检测系统的位置）来实现。为了证明我们在现实世界中设置的攻击，我们通过将平板电视的目标对象实现的补丁;屏幕被用来呈现所述补丁和在它们之间切换，这取决于当前相机位置。因此，攻击是动态的，并自我调整的情况，以达到最佳的效果。我们评估我们的动态补丁的方式通过攻击YOLOv2对象检测器与汽车为目标对象，并成功地从宽视角范围拍摄车的时候在到视频帧的90％，它的误导。我们通过生成考虑目标对象及其分类之间的语义距离补丁提高了攻击。我们还调查了攻击的转让不同车型之间，并能误导的时间测定仪71％。

87. Automated triage of COVID-19 from various lung abnormalities using chest CT features [PDF] 返回目录
Dor Amran, Maayan Frid-Adar, Nimrod Sagie, Jannette Nassar, Asher Kabakovitch, Hayit Greenspan
Abstract: The outbreak of COVID-19 has lead to a global effort to decelerate the pandemic spread. For this purpose chest computed-tomography (CT) based screening and diagnosis of COVID-19 suspected patients is utilized, either as a support or replacement to reverse transcription-polymerase chain reaction (RT-PCR) test. In this paper, we propose a fully automated AI based system that takes as input chest CT scans and triages COVID-19 cases. More specifically, we produce multiple descriptive features, including lung and infections statistics, texture, shape and location, to train a machine learning based classifier that distinguishes between COVID-19 and other lung abnormalities (including community acquired pneumonia). We evaluated our system on a dataset of 2191 CT cases and demonstrated a robust solution with 90.8% sensitivity at 85.4% specificity with 94.0% ROC-AUC. In addition, we present an elaborated feature analysis and ablation study to explore the importance of each feature.
摘要：COVID-19的爆发，导致全球努力减速流行病蔓延。用于此目的的胸部计算机断层造影基于筛选和COVID-19疑似患者的诊断被利用，无论是作为支撑或更换到反转录 - 聚合酶链反应（RT-PCR）测试（CT）。在本文中，我们提出了一个完全自动化的AI基础的系统，作为输入胸部CT扫描和triages COVID-19的情况。更具体地说，我们产生多个描述性特征，包括肺和感染统计，质地，形状和位置，以训练机器学习基于分类器COVID-19和其他肺部异常（包括社区获得性肺炎）之间进行区分。我们评估的2191 CT情况下的数据集我们的系统，并演示与90.8％的敏感性强大的解决方案以85.4％的特异性与94.0％，ROC-AUC。此外，我们提出了一个精细的特征分析和消融研究，探索每个功能的重要性。

88. LagNetViP: A Lagrangian Neural Network for Video Prediction [PDF] 返回目录
Christine Allen-Blanchette, Sushant Veer, Anirudha Majumdar, Naomi Ehrich Leonard
Abstract: The dominant paradigms for video prediction rely on opaque transition models where neither the equations of motion nor the underlying physical quantities of the system are easily inferred. The equations of motion, as defined by Newton's second law, describe the time evolution of a physical system state and can therefore be applied toward the determination of future system states. In this paper, we introduce a video prediction model where the equations of motion are explicitly constructed from learned representations of the underlying physical quantities. To achieve this, we simultaneously learn a low-dimensional state representation and system Lagrangian. The kinetic and potential energy terms of the Lagrangian are distinctly modelled and the low-dimensional equations of motion are explicitly constructed using the Euler-Lagrange equations. We demonstrate the efficacy of this approach for video prediction on image sequences rendered in modified OpenAI gym Pendulum-v0 and Acrobot environments.
摘要：视频预测的主导模式依赖于不透明的转型模式，其中的运动既不是方程，也不是系统的基本物理量容易地推断。运动方程，通过牛顿第二定律所定义的，描述的物理系统状态的时间演变，并因此可以被朝向未来系统状态的判定施加。在本文中，我们将介绍其中的运动方程明确从底层物理量的教训陈述构建的视频预测模型。为了实现这一目标，我们同时学会低维状态表示和系统拉格朗日。拉格朗日的动能和势能的术语清楚地建模，并且使用欧拉 - 拉格朗日方程明确构造运动的低维方程。我们证明这种方法对改性OpenAI健身房摆-V0和Acrobot环境渲染的图像序列的视频预测的功效。

89. Non-local Meets Global: An Iterative Paradigm for Hyperspectral Image Restoration [PDF] 返回目录
Wei He, Quanming Yao, Chao Li, Naoto Yokoya, Qibin Zhao, Hongyan Zhang, Liangpei Zhang
Abstract: Non-local low-rank tensor approximation has been developed as a state-of-the-art method for hyperspectral image (HSI) restoration, which includes the tasks of denoising, compressed HSI reconstruction and inpainting. Unfortunately, while its restoration performance benefits from more spectral bands, its runtime also substantially increases. In this paper, we claim that the HSI lies in a global spectral low-rank subspace, and the spectral subspaces of each full band patch group should lie in this global low-rank subspace. This motivates us to propose a unified paradigm combining the spatial and spectral properties for HSI restoration. The proposed paradigm enjoys performance superiority from the non-local spatial denoising and light computation complexity from the low-rank orthogonal basis exploration. An efficient alternating minimization algorithm with rank adaptation is developed. It is done by first solving a fidelity term-related problem for the update of a latent input image, and then learning a low-dimensional orthogonal basis and the related reduced image from the latent input image. Subsequently, non-local low-rank denoising is developed to refine the reduced image and orthogonal basis iteratively. Finally, the experiments on HSI denoising, compressed reconstruction, and inpainting tasks, with both simulated and real datasets, demonstrate its superiority with respect to state-of-the-art HSI restoration methods.
摘要：非本地低秩近似张量已发展为国家的最先进的方法，用于高光谱图像（HSI）恢复，其中包括去噪，压缩HSI重建和补绘的任务。不幸的是，同时从更光谱带其恢复性能优点，其运行时也显着增加。在本文中，我们要求恒指在于全球频谱低等级的子空间，每个全频段贴片组的光谱子空间应该位于这个全球性的低等级的子空间。这促使我们提出一个统一的模式结合恒指修复的空间和光谱特性。所提出的范式享有从低级别的正交基探索非本地空间降噪以及光计算复杂性能的优越性。秩适应的高效交替最小化算法。它是由第一个解决一个保真项相关问题的一个潜在的输入图像的更新，然后学习了低维正交基和潜输入图像相关的缩小图像来完成。随后，非本地低秩去噪显影以缩小的缩小图像和正交基迭代。最后，恒指去噪，压缩重建和图像修复任务，既模拟和真实数据的实验，证明其相对于国家的最先进的HSI修复方法的优越性。

90. Weakly-supervised VisualBERT: Pre-training without Parallel Images and Captions [PDF] 返回目录
Liunian Harold Li, Haoxuan You, Zhecan Wang, Alireza Zareian, Shih-Fu Chang, Kai-Wei Chang
Abstract: Pre-trained contextual vision-and-language (V&L) models have brought impressive performance improvement on various benchmarks. However, the paired text-image data required for pre-training are hard to collect and scale up. We investigate if a strong V&L representation model can be learned without text-image pairs. We propose Weakly-supervised VisualBERT with the key idea of conducting "mask-and-predict" pre-training on language-only and image-only corpora. Additionally, we introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. Evaluation on four V&L benchmarks shows that Weakly-supervised VisualBERT achieves similar performance with a model pre-trained with paired data. Besides, pre-training on more image-only data further improves a model that already has access to aligned data, suggesting the possibility of utilizing billions of raw images available to enhance V&L models.
摘要：预先训练的情景视觉和语言（V＆L）车型带来了各种令人印象深刻的基准性能改进。然而，对于前培训所需要的成对的文本图像数据难以收集和扩大。我们调查，如果一个强大的V＆L表示模型可以没有文字，图像对来学习。我们建议弱监督VisualBERT与开展“面具和预测”的语言，只和只显示图像的语料库前培训的核心理念。此外，我们通过介绍一个对象识别模型作为支撑点检测弥合两种模式对象的标记。四个V＆L的基准显示评价该弱监督VisualBERT实现与配对数据一个预训练的模型类似的性能。此外，前培训更多的只是图像数据进一步改进了已经访问了对齐的数据模型，这利用了数十亿的RAW图像可增强V＆L型的可能性。

91. Modularity Improves Out-of-Domain Instruction Following [PDF] 返回目录
Rodolfo Corona, Daniel Fried, Coline Devin, Dan Klein, Trevor Darrell
Abstract: We propose a modular architecture for following natural language instructions that describe sequences of diverse subgoals, such as navigating to landmarks or picking up objects. Standard, non-modular, architectures used in instruction following do not exploit subgoal compositionality and often struggle on out-of-distribution tasks and environments. In our approach, subgoal modules each carry out natural language instructions for a specific subgoal type. A sequence of modules to execute is chosen by learning to segment the instructions and predicting a subgoal type for each segment. When compared to standard sequence-to-sequence approaches on ALFRED, a challenging instruction following benchmark, we find that modularization improves generalization to environments unseen in training and to novel tasks.
摘要：我们提出了一个模块化的架构下面描述不同的子目标，比如导航到地标或拿起对象的序列自然语言指令。标准，非模块化，在指令后使用的架构不利用子目标组合性和经常挣扎于外的配送任务和环境。在我们的方法中，子目标模块为特定的子目标类型各自携带了自然语言指令。模块执行的序列通过学习来分割指令和预测子目标类型为每个段选择。相较于标准序列到序列上ALFRED，以下基准具有挑战性的指导方法，我们发现，模块化提高泛化的环境中训练看不见的新颖任务。

92. The RobotSlang Benchmark: Dialog-guided Robot Localization and Navigation [PDF] 返回目录
Shurjo Banerjee, Jesse Thomason, Jason J. Corso
Abstract: Autonomous robot systems for applications from search and rescue to assistive guidance should be able to engage in natural language dialog with people. To study such cooperative communication, we introduce Robot Simultaneous Localization and Mapping with Natural Language (RobotSlang), a benchmark of 169 natural language dialogs between a human Driver controlling a robot and a human Commander providing guidance towards navigation goals. In each trial, the pair first cooperates to localize the robot on a global map visible to the Commander, then the Driver follows Commander instructions to move the robot to a sequence of target objects. We introduce a Localization from Dialog History (LDH) and a Navigation from Dialog History (NDH) task where a learned agent is given dialog and visual observations from the robot platform as input and must localize in the global map or navigate towards the next target object, respectively. RobotSlang is comprised of nearly 5k utterances and over 1k minutes of robot camera and control streams. We present an initial model for the NDH task, and show that an agent trained in simulation can follow the RobotSlang dialog-based navigation instructions for controlling a physical robot platform. Code and data are available at this https URL.
摘要：从搜救辅助引导应用自治机器人系统应该能够从事自然语言对话的人。为了研究这种合作的沟通，我们引入机器人同步定位与地图用自然语言（RobotSlang）的人力驱动控制机器人和人类器对着导航目标提供指导与169个的自然语言对话的基准。在每次试验中，一对第一协作以定位一个全局映射到指挥官可见在机器人上，则驱动程序如下指挥官指令给移动机器人到目标对象的一个序列。我们介绍从对话历史（LDH），并从对话历史导航（NDH）任务，其中一个有学问的代理人给出从机器人平台的对话和目视观测作为输入，必须在全球地图或导航朝着下一个目标对象本地化本地化，分别。 RobotSlang由近5K话语和超过1K分钟机器人摄像机和控制流。我们提出了NDH任务初始模型，并表明在模拟训练的代理可以遵循基于对话框的RobotSlang导航指令控制的物理机器人平台。代码和数据都可以在此HTTPS URL。

93. S2cGAN: Semi-Supervised Training of Conditional GANs with Fewer Labels [PDF] 返回目录
Arunava Chakraborty, Rahul Ragesh, Mahir Shah, Nipun Kwatra
Abstract: Generative adversarial networks (GANs) have been remarkably successful in learning complex high dimensional real word distributions and generating realistic samples. However, they provide limited control over the generation process. Conditional GANs (cGANs) provide a mechanism to control the generation process by conditioning the output on a user defined input. Although training GANs requires only unsupervised data, training cGANs requires labelled data which can be very expensive to obtain. We propose a framework for semi-supervised training of cGANs which utilizes sparse labels to learn the conditional mapping, and at the same time leverages a large amount of unsupervised data to learn the unconditional distribution. We demonstrate effectiveness of our method on multiple datasets and different conditional tasks.
摘要：剖成对抗网络（甘斯）已经在学习复杂的高维实字的分布，并生成逼真的样品显着的成功。然而，它们提供了对生成处理有限的控制。条件甘斯（CGANS）提供了一种机制通过调节来控制生成过程上的用户定义的输入的输出。虽然训练甘斯只需要监督的数据，训练CGANS需要标记数据，就获得非常昂贵。我们提出了CGANS的半监督训练，它利用稀疏标签学习条件映射，并在同一时间利用大量的无监督数据的学习无条件分布的框架。我们证明了我们在多个数据集和不同条件的任务方法的有效性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-27

目录

摘要