摘要

1. A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks [PDF] 返回目录
Unnat Jain, Luca Weihs, Eric Kolve, Ali Farhadi, Svetlana Lazebnik, Aniruddha Kembhavi, Alexander Schwing
Abstract: Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task's difficulty outpaces a single agent's abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at this https URL .
摘要：自主代理人必须学会合作。它是不可扩展的每一个任务的难度赶不上一个代理的能力的时间来开发新的集中剂。尽管多智能体合作研究蓬勃发展gridworld般的环境中，相对较少的工作已经认为是视觉上丰富的领域。针对这个，我们引入了新的任务FurnMove在代理一起工作通过客厅到一个目标移动一件家具。与现有的任务，FurnMove要求代理商在每个时间步长进行协调。我们确定了两个挑战训练代理商来完成FurnMove时：现有的权力下放行动抽样程序不容许表现力共同作用的政策，并在需要密切协调任务，失败的操作次数占主导地位成功的行动。为了应对我们介绍SYNC-政策（相干同步动作），亲切（协调亏损）这些挑战。使用SYNC-政策和亲切，我们的代理商达到上FurnMove 58％的完成率，25个百分点的竞争分散的基线令人印象深刻的绝对收益。我们的数据，代码和预先训练模式可在此HTTPS URL。

2. Novel Subtypes of Pulmonary Emphysema Based on Spatially-Informed Lung Texture Learning [PDF] 返回目录
Jie Yang, Elsa D. Angelini, Pallavi P. Balte, Eric A. Hoffman, John H.M. Austin, Benjamin M. Smith, R. Graham Barr, Andrew F. Laine
Abstract: Pulmonary emphysema overlaps considerably with chronic obstructive pulmonary disease (COPD), and is traditionally subcategorized into three subtypes previously identified on autopsy. Unsupervised learning of emphysema subtypes on computed tomography (CT) opens the way to new definitions of emphysema subtypes and eliminates the need of thorough manual labeling. However, CT-based emphysema subtypes have been limited to texture-based patterns without considering spatial location. In this work, we introduce a standardized spatial mapping of the lung for quantitative study of lung texture location, and propose a novel framework for combining spatial and texture information to discover spatially-informed lung texture patterns (sLTPs) that represent novel emphysema subtypes. Exploiting two cohorts of full-lung CT scans from the MESA COPD and EMCAP studies, we first show that our spatial mapping enables population-wide study of emphysema spatial location. We then evaluate the characteristics of the sLTPs discovered on MESA COPD, and show that they are reproducible, able to encode standard emphysema subtypes, and associated with physiological symptoms.
摘要：肺气肿的慢性阻塞性肺疾病（COPD）相当重叠，并且在传统上细分为以前鉴定的尸检三个亚型。计算机断层扫描（CT），肺气肿亚型的无监督学习开辟了道路肺气肿亚型的新的定义和彻底消除手工贴标的需要。然而，基于CT-肺气肿亚型已被限制于基于纹理的图案，而不考虑空间位置。在这项工作中，我们介绍了肺肺纹理位置的定量研究的规范化空间映射，并提出结合空间和纹理信息来发现空间知情肺部纹理图案，代表新的肺气肿亚型（sLTPs）一种新的框架。利用全肺CT的两个队列从MESA COPD和EMCAP研究扫描，我们首先表明，我们的空间映射使肺气肿空间位置的群体范围的研究。然后，我们评估发现了MESA COPD的sLTPs的特点，并表明他们是可重复的，能够编码标准肺气肿亚型，并与生理症状。

3. Improving Style-Content Disentanglement in Image-to-Image Translation [PDF] 返回目录
Aviv Gabbay, Yedid Hoshen
Abstract: Unsupervised image-to-image translation methods have achieved tremendous success in recent years. However, it can be easily observed that their models contain significant entanglement which often hurts the translation performance. In this work, we propose a principled approach for improving style-content disentanglement in image-to-image translation. By considering the information flow into each of the representations, we introduce an additional loss term which serves as a content-bottleneck. We show that the results of our method are significantly more disentangled than those produced by current methods, while further improving the visual quality and translation diversity.
摘要：无监督图像到影像翻译方法在近几年取得了巨大的成功。但是，它可以很容易地观察到他们的模型包含常常会伤害人的翻译性能显著纠缠。在这项工作中，我们提出了改进的风格，内容的解开图像到影像翻译原则的做法。通过考虑信息流到每个交涉，我们引入其作为内容瓶颈的附加损耗项。我们证明了我们的方法的结果显著超过那些由目前的方法产生解开，同时进一步提高了视觉质量和翻译的多样性。

4. ThreeDWorld: A Platform for Interactive Multi-Modal Physical Simulation [PDF] 返回目录
Chuang Gan, Jeremy Schwartz, Seth Alter, Martin Schrimpf, James Traer, Julian De Freitas, Jonas Kubilius, Abhishek Bhandwaldar, Nick Haber, Megumi Sano, Kuno Kim, Elias Wang, Damian Mrowca, Michael Lingelbach, Aidan Curtis, Kevin Feigelis, Daniel M. Bear, Dan Gutfreund, David Cox, James J. DiCarlo, Josh McDermott, Joshua B. Tenenbaum, Daniel L.K. Yamins
Abstract: We introduce ThreeDWorld (TDW), a platform for interactive multi-modal physical simulation. With TDW, users can simulate high-fidelity sensory data and physical interactions between mobile agents and objects in a wide variety of rich 3D environments. TDW has several unique properties: 1) realtime near photo-realistic image rendering quality; 2) a library of objects and environments with materials for high-quality rendering, and routines enabling user customization of the asset library; 3) generative procedures for efficiently building classes of new environments 4) high-fidelity audio rendering; 5) believable and realistic physical interactions for a wide variety of material types, including cloths, liquid, and deformable objects; 6) a range of "avatar" types that serve as embodiments of AI agents, with the option for user avatar customization; and 7) support for human interactions with VR devices. TDW also provides a rich API enabling multiple agents to interact within a simulation and return a range of sensor and physics data representing the state of the world. We present initial experiments enabled by the platform around emerging research directions in computer vision, machine learning, and cognitive science, including multi-modal physical scene understanding, multi-agent interactions, models that "learn like a child", and attention studies in humans and neural networks. The simulation platform will be made publicly available.
摘要：介绍ThreeDWorld（TDW），交互式多模态物理模拟的平台。随着TDW，用户可以在多种丰富的3D环境模拟移动代理和对象之间的高保真传感器数据和物理相互作用。 TDW有几个独特的性质：1）近实时照片般逼真的图像渲染质量; 2）物体和环境与高品质的渲染的材料，和例程使资产库的用户定制的文库; 3）用于有效地建立的新的环境4）高保真音频呈现类生成程序; 5）用于多种材料类型，包括布，液体，和可变形对象的可信的和现实的物理相互作用; 6）的范围内充当AI剂的实施方案中，与用于用户化身定制选项“化身”类型;和图7），用于与VR设备的人工交互的支持。 TDW还提供了丰富的API使多个代理仿真内相互作用，并返回一个范围代表世界的状态传感器和物理学数据。我们目前的初始实验，能够通过计算机视觉，机器学习和认知科学的新兴研究方向，包括多模态物理情景的理解，多主体互动，模型“学习像个孩子”，并注意研究在人类周围的平台和神经网络。仿真平台将被公之于众。

5. AI Assisted Apparel Design [PDF] 返回目录
Alpana Dubey, Nitish Bhardwaj, Kumar Abhinav, Suma Mani Kuriakose, Sakshi Jain, Veenu Arora
Abstract: Fashion is a fast-changing industry where designs are refreshed at large scale every season. Moreover, it faces huge challenge of unsold inventory as not all designs appeal to customers. This puts designers under significant pressure. Firstly, they need to create innumerous fresh designs. Secondly, they need to create designs that appeal to customers. Although we see advancements in approaches to help designers analyzing consumers, often such insights are too many. Creating all possible designs with those insights is time consuming. In this paper, we propose a system of AI assistants that assists designers in their design journey. The proposed system assists designers in analyzing different selling/trending attributes of apparels. We propose two design generation assistants namely Apparel-Style-Merge and Apparel-Style-Transfer. Apparel-Style-Merge generates new designs by combining high level components of apparels whereas Apparel-Style-Transfer generates multiple customization of apparels by applying different styles, colors and patterns. We compose a new dataset, named DeepAttributeStyle, with fine-grained annotation of landmarks of different apparel components such as neck, sleeve etc. The proposed system is evaluated on a user group consisting of people with and without design background. Our evaluation result demonstrates that our approach generates high quality designs that can be easily used in fabrication. Moreover, the suggested designs aid to the designers creativity.
摘要：时尚是一个快速变化的行业中，设计是在每季度大规模更新。此外，它面临着未售出库存的巨大的挑战，因为不是所有的设计对客户的吸引力。这使设计师在显著的压力。首先，他们需要创建无数新鲜的设计。其次，他们需要创造能吸引客户的设计。虽然我们看到进步的方法来帮助设计人员分析消费者，往往这样的见解是太多了。这些见解创建所有可能的设计非常耗时。在本文中，我们提出了帮助设计师在他们的设计旅程AI助理制度。所提出的系统帮助设计师在分析不同的销售/趋势服装的属性。我们提出了两种设计一代助手即服装的风格，合并和服装风格 - 转让。服装样式，合并通过而服装样式，转移通过施加不同的款式，颜色和图案生成服装的多个定制组合服装的高级别组件生成新的设计。我们组成一个新的数据集，名为DeepAttributeStyle，用不同的服装部件，如颈部上的人组成有和没有设计背景的用户组，套筒等提出的系统评价的地标细粒度的注释。我们的评估结果表明，我们的方法产生高品质的设计，可以在制造很容易地使用。此外，建议的设计有助于给设计师的创意。

6. Real-time Embedded Person Detection and Tracking for Shopping Behaviour Analysis [PDF] 返回目录
Robin Schrijvers, Steven Puttemans, Timothy Callemein, Toon Goedemé
Abstract: Shopping behaviour analysis through counting and tracking of people in shop-like environments offers valuable information for store operators and provides key insights in the stores layout (e.g. frequently visited spots). Instead of using extra staff for this, automated on-premise solutions are preferred. These automated systems should be cost-effective, preferably on lightweight embedded hardware, work in very challenging situations (e.g. handling occlusions) and preferably work real-time. We solve this challenge by implementing a real-time TensorRT optimized YOLOv3-based pedestrian detector, on a Jetson TX2 hardware platform. By combining the detector with a sparse optical flow tracker we assign a unique ID to each customer and tackle the problem of loosing partially occluded customers. Our detector-tracker based solution achieves an average precision of 81.59% at a processing speed of 10 FPS. Besides valuable statistics, heat maps of frequently visited spots are extracted and used as an overlay on the video stream.
摘要：通过计数购物行为分析和店般的环境中提供有价值的信息的人跟踪的商店经营者，并提供在卖场布局（如经常访问点）的重要见解。如果不使用额外的工作人员为此，自动化内部部署解决方案是首选。这些自动化系统应该是成本效益，优选在轻量级嵌入式硬件，工作在非常具有挑战性的情况（例如处理闭塞），优选的工作实时性。我们解决通过实现实时TensorRT优化的基于YOLOv3行人检测，在杰特森TX2硬件平台这一挑战。由探测器与稀疏光流跟踪相结合，我们分配一个唯一的ID为每个客户和解决失去部分遮挡客户的问题。我们的检测器跟踪器基于溶液在10 FPS的处理速度达到的81.59％的平均精确度。除了有价值的统计数据，经常访问点的热图被提取并用作对视频流的叠加。

7. The Phong Surface: Efficient 3D Model Fitting using Lifted Optimization [PDF] 返回目录
Jingjing Shen, Thomas J. Cashman, Qi Ye, Tim Hutton, Toby Sharp, Federica Bogo, Andrew William Fitzgibbon, Jamie Shotton
Abstract: Realtime perceptual and interaction capabilities in mixed reality require a range of 3D tracking problems to be solved at low latency on resource-constrained hardware such as head-mounted devices. Indeed, for devices such as HoloLens 2 where the CPU and GPU are left available for applications, multiple tracking subsystems are required to run on a continuous, real-time basis while sharing a single Digital Signal Processor. To solve model-fitting problems for HoloLens 2 hand tracking, where the computational budget is approximately 100 times smaller than an iPhone 7, we introduce a new surface model: the `Phong surface'. Using ideas from computer graphics, the Phong surface describes the same 3D shape as a triangulated mesh model, but with continuous surface normals which enable the use of lifting-based optimization, providing significant efficiency gains over ICP-based methods. We show that Phong surfaces retain the convergence benefits of smoother surface models, while triangle meshes do not.
摘要：实时感知和在混合现实交互能力需要跟踪问题的范围内的三维在低等待时间在资源受限的硬件来解决诸如头戴式设备。的确，对于设备，例如HoloLens 2，其中CPU和GPU被留下可用于应用程序中，需要多个跟踪子系统而共享单个数字信号处理器的连续，实时的基础上运行。为了求解HoloLens 2手跟踪，其中，所述计算预算高于一个iPhone 7中，我们引入一个新的表面模型小约100倍模型拟合问题：`的Phong表面”。从计算机图形使用的想法，所述的Phong表面描述了相同的3D形状作为三角形的网格模型，但与连续的表面法线，其使得能够使用的基于提升的优化，从而提供比基于ICP-方法显著效率增益。我们发现，海防表面保持光滑的表面模型的融合优势，同时三角形网格没有。

8. Anyone here? Smart embedded low-resolution omnidirectional video sensor to measure room occupancy [PDF] 返回目录
Timothy Callemein, Kristof Van Beeck, Toon Goedemé
Abstract: In this paper, we present a room occupancy sensing solution with unique properties: (i) It is based on an omnidirectional vision camera, capturing rich scene info over a wide angle, enabling to count the number of people in a room and even their position. (ii) Although it uses a camera-input, no privacy issues arise because its extremely low image resolution, rendering people unrecognisable. (iii) The neural network inference is running entirely on a low-cost processing platform embedded in the sensor, reducing the privacy risk even further. (iv) Limited manual data annotation is needed, because of the self-training scheme we propose. Such a smart room occupancy rate sensor can be used in e.g. meeting rooms and flex-desks. Indeed, by encouraging flex-desking, the required office space can be reduced significantly. In some cases, however, a flex-desk that has been reserved remains unoccupied without an update in the reservation system. A similar problem occurs with meeting rooms, which are often under-occupied. By optimising the occupancy rate a huge reduction in costs can be achieved. Therefore, in this paper, we develop such system which determines the number of people present in office flex-desks and meeting rooms. Using an omnidirectional camera mounted in the ceiling, combined with a person detector, the company can intelligently update the reservation system based on the measured occupancy. Next to the optimisation and embedded implementation of such a self-training omnidirectional people detection algorithm, in this work we propose a novel approach that combines spatial and temporal image data, improving performance of our system on extreme low-resolution images.
摘要：在本文中，我们提出了一个房间入住传感解决方案具有独特的性质：（i）这是一个基于全方位视觉相机，在广角捕捉丰富的场景信息，从而计算在一个房间里的人数，甚至他们的立场。（二）虽然它采用了摄像头输入，则不会出现隐私问题，因为其极低的图像分辨率，使人们无法辨认。（ⅲ）所述的神经网络推断嵌入在传感器低成本处理平台完全运行时，甚至进一步降低了私隐风险。（四）有限公司手工数据注解是必要的，因为我们提出的自我培养方案。这样的智能房间占用率传感器可以例如被用于会议室和Flex-书桌。事实上，通过鼓励柔性的办公桌，所需要的办公空间可以显著降低。在某些情况下，但是，柔性服务台已保留遗体没有在预约系统的更新没人住。有会议室，这往往是在占领发生类似的问题。通过优化的占用率可以实现成本的大幅缩减。因此，在本文中，我们开发这样的系统，它决定了目前在职柔性办公桌和会议室的人数。使用安装在天花板上的全方位照相机，一个人检测器相结合，该公司可以智能地更新基于所测量的占用预约系统。旁边的优化，这种自我培养全方位的人检测算法的嵌入式实现，在这项工作中，我们提出了一种新的方法，结合时间和空间的图像数据，从而提高我们对极端的低清晰度的图像系统的性能。

9. Single architecture and multiple task deep neural network for altered fingerprint analysis [PDF] 返回目录
Oliver Giudice, Mattia Litrico, Sebastiano Battiato
Abstract: Fingerprints are one of the most copious evidence in a crime scene and, for this reason, they are frequently used by law enforcement for identification of individuals. But fingerprints can be altered. "Altered fingerprints", refers to intentionally damage of the friction ridge pattern and they are often used by smart criminals in hope to evade law enforcement. We use a deep neural network approach training an Inception-v3 architecture. This paper proposes a method for detection of altered fingerprints, identification of types of alterations and recognition of gender, hand and fingers. We also produce activation maps that show which part of a fingerprint the neural network has focused on, in order to detect where alterations are positioned. The proposed approach achieves an accuracy of 98.21%, 98.46%, 92.52%, 97.53% and 92,18% for the classification of fakeness, alterations, gender, hand and fingers, respectively on the SO.CO.FING. dataset.
摘要：指纹是犯罪现场最丰富的证据之一，为此，他们经常使用的执法区别个体。但是指纹可以改变。 “改变的指纹”，是指摩擦脊形图案的故意损坏，他们往往在希望逃避执法使用智能罪犯。我们使用了深刻的神经网络方法训练的盗梦空间-V3架构。本文提出了一种检测改变指纹，类型变化和识别的性别，手和手指的识别的方法。我们还生产活化地图，显示指纹神经网络的重点是，以检测其中的改变定位的哪一部分。所提出的方法实现了98.21％，98.46％，92.52％，97.53％和92,18％为骗人的，改变，性别，手和手指的分类的精度，分别在SO.CO.FING。数据集。

10. Patient-Specific Domain Adaptation for Fast Optical Flow Based on Teacher-Student Knowledge Transfer [PDF] 返回目录
Sontje Ihler, Max-Heinrich Laves, Tobias Ortmaier
Abstract: Fast motion feedback is crucial in computer-aided surgery (CAS) on moving tissue. Image-assistance in safety-critical vision applications requires a dense tracking of tissue motion. This can be done using optical flow (OF). Accurate motion predictions at high processing rates lead to higher patient safety. Current deep learning OF models show the common speed vs. accuracy trade-off. To achieve high accuracy at high processing rates, we propose patient-specific fine-tuning of a fast model. This minimizes the domain gap between training and application data, while reducing the target domain to the capability of the lower complex, fast model. We propose to obtain training sequences pre-operatively in the operation room. We handle missing ground truth, by employing teacher-student learning. Using flow estimations from teacher model FlowNet2 we specialize a fast student model FlowNet2S on the patient-specific domain. Evaluation is performed on sequences from the Hamlyn dataset. Our student model shows very good performance after fine-tuning. Tracking accuracy is comparable to the teacher model at a speed up of factor six. Fine-tuning can be performed within minutes, making it feasible for the operation room. Our method allows to use a real-time capable model that was previously not suited for this task. This method is laying the path for improved patient-specific motion estimation in CAS.
摘要：快速运动的反馈是在移动的组织计算机辅助手术（CAS）是至关重要的。在安全关键视觉应用图像援助需要组织运动的密集跟踪。这可以通过使用光流（OF）来完成。在高处理速度的精确运动预测导致更高的患者安全。车型目前深度学习显示了常用的速度与精度的权衡。为了实现高处理速率高精确度，我们提出了一个快速模型的患者特定的微调。这最小化了训练和应用程序数据之间的域间隙，同时降低目标域到下部复杂，快速模型的能力。我们建议在手术室获得训练序列手术前。我们处理缺少地面实况，采用师生学习。从教师模型中使用流量估计FlowNet2我们专注于特定的病人域快速学生模型FlowNet2S。评价是对从数据集中哈姆林序列来执行。我们的学生模特表演微调后的表现非常好。跟踪精度媲美的因素六的速度高达老师模型。微调可以在几分钟内进行，使之成为手术室可行的。我们的方法允许使用先前不适合这个任务实时能力的模型。此方法是铺设在CAS改进的患者特异性的运动估计的路径。

11. Uncertainty Quantification in Deep Residual Neural Networks [PDF] 返回目录
Lukasz Wandzik, Raul Vicente Garcia, Jörg Krüger
Abstract: Uncertainty quantification is an important and challenging problem in deep learning. Previous methods rely on dropout layers which are not present in modern deep architectures or batch normalization which is sensitive to batch sizes. In this work, we address the problem of uncertainty quantification in deep residual networks by using a regularization technique called stochastic depth. We show that training residual networks using stochastic depth can be interpreted as a variational approximation to the intractable posterior over the weights in Bayesian neural networks. We demonstrate that by sampling from a distribution of residual networks with varying depth and shared weights, meaningful uncertainty estimates can be obtained. Moreover, compared to the original formulation of residual networks, our method produces well-calibrated softmax probabilities with only minor changes to the network's structure. We evaluate our approach on popular computer vision datasets and measure the quality of uncertainty estimates. We also test the robustness to domain shift and show that our method is able to express higher predictive uncertainty on out-of-distribution samples. Finally, we demonstrate how the proposed approach could be used to obtain uncertainty estimates in facial verification applications.
摘要：不确定性量化是在深学习的重要和具有挑战性的问题。先前的方法依赖于它们是不存在于现代深架构或批标准化这是批量大小敏感漏失层。在这项工作中，我们使用被称为随机深度正则化技术解决深残留网络的不确定性量化的问题。我们显示了使用随机深度培训残留网络可以通过贝叶斯神经网络的权重被解释为变逼近棘手的后路。我们证明，通过从残余网络的具有不同深度和共享权重分布取样，可以得到有意义的不确定性的估计。此外，相对于残留的网络的原始制剂中，我们的方法产生良好的校准概率SOFTMAX只有轻微改变网络的结构。我们评估的流行计算机视觉数据集的方法和测量不确定性估算的质量。我们还测试了鲁棒性领域转变，表明我们的方法是能够表达对超出分配样本更高的预测不确定性。最后，我们证明了该方法是如何被用来获取面部验证应用的不确定性估计。

12. Cross-Modal Weighting Network for RGB-D Salient Object Detection [PDF] 返回目录
Gongyang Li, Zhi Liu, Linwei Ye, Yang Wang, Haibin Ling
Abstract: Depth maps contain geometric clues for assisting Salient Object Detection (SOD). In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD. Specifically, three RGB-depth interaction modules, named CMW-L, CMW-M and CMW-H, are developed to deal with respectively low-, middle- and high-level cross-modal information fusion. These modules use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW) to allow rich cross-modal and cross-scale interactions among feature layers generated by different network blocks. To effectively train the proposed Cross-Modal Weighting Network (CMWNet), we design a composite loss function that summarizes the errors between intermediate predictions and ground truth over different scales. With all these novel components working together, CMWNet effectively fuses information from RGB and depth channels, and meanwhile explores object localization and details across scales. Thorough evaluations demonstrate CMWNet consistently outperforms 15 state-of-the-art RGB-D SOD methods on seven popular benchmarks.
摘要：深度地图包含用于辅助凸目标检测（SOD）的几何线索。在本文中，我们提出了一个新的跨模态的权重（CMW）战略，以鼓励RGB-d SOD RGB和深度通道之间相互作用的综合。具体而言，三个RGB深入相互作用模块，命名CMW-L，CMW-M和CMW-H，被开发来处理分别低，中，高电平的交叉模态信息融合。这些模块使用深度到RGB称重（DW）和RGB到RGB加权（RW），以允许由不同的网络块生成的特征层中丰富的跨模态和跨尺度相互作用。为了有效地培养提出的跨模态的加权网（CMWNet），我们设计了一个汇总中间预测与地面实况在不同尺度之间的误差的复合损失函数。所有这些新的部件一起工作，CMWNet有效地从RGB和深度通道融合信息，同时探索对象定位和跨尺度的细节。全面的评价显示CMWNet一贯优于15国家的最先进的七个流行的基准RGB-d SOD的方法。

13. PIE-NET: Parametric Inference of Point Cloud Edges [PDF] 返回目录
Xiaogang Wang, Yuelang Xu, Kai Xu, Andrea Tagliasacchi, Bin Zhou, Ali Mahdavi-Amiri, Hao Zhang
Abstract: We introduce an end-to-end learnable technique to robustly identify feature edges in 3D point cloud data. We represent these edges as a collection of parametric curves (i.e.,lines, circles, and B-splines). Accordingly, our deep neural network, coined PIE-NET, is trained for parametric inference of edges. The network relies on a "region proposal" architecture, where a first module proposes an over-complete collection of edge and corner points, and a second module ranks each proposal to decide whether it should be considered. We train and evaluate our method on the ABC dataset, a large dataset of CAD models, and compare our results to those produced by traditional (non-learning) processing pipelines, as well as a recent deep learning based edge detector (EC-NET). Our results significantly improve over the state-of-the-art from both a quantitative and qualitative standpoint.
摘要：介绍一个终端到终端的可学习技术，以稳健识别三维点云数据特征的边缘。我们代表这些边缘作为参数曲线（即，线，圆，和B样条）的集合。因此，我们的深层神经网络，创造PIE-NET，是训练边的参数推断。网络依赖于“区域方案”的架构，其中第一模块提出的边缘和角落点过完整的收集，以及第二模块位居各提案，决定是否应予以考虑。我们培养并在ABC的数据集，大型数据集CAD模型的评估我们的方法，我们的结果比较那些传统（非学习）生产加工流水线，以及最近的深度学习基于边缘检测（EC-NET）。在我们的研究结果显著改善，无论从定量和定性的角度来看，国家的最先进的。

14. A Deep Joint Sparse Non-negative Matrix Factorization Framework for Identifying the Common and Subject-specific Functional Units of Tongue Motion During Speech [PDF] 返回目录
Jonghye Woo, Fangxu Xing, Jerry L. Prince, Maureen Stone, Arnold Gomez, Timothy G. Reese, Van J. Wedeen, Georges El Fakhri
Abstract: Intelligible speech is produced by creating varying internal local muscle groupings---i.e., functional units---that are generated in a systematic and coordinated manner. There are two major challenges in characterizing and analyzing functional units. First, due to the complex and convoluted nature of tongue structure and function, it is of great importance to develop a method that can accurately decode complex muscle coordination patterns during speech. Second, it is challenging to keep identified functional units across subjects comparable due to their substantial variability. In this work, to address these challenges, we develop a new deep learning framework to identify common and subject-specific functional units of tongue motion during speech. Our framework hinges on joint deep graph-regularized sparse non-negative matrix factorization (NMF) using motion quantities derived from displacements by tagged Magnetic Resonance Imaging. More specifically, we transform NMF with sparse and manifold regularizations into modular architectures akin to deep neural networks by means of unfolding the Iterative Shrinkage-Thresholding Algorithm to learn interpretable building blocks and associated weighting map. We then apply spectral clustering to common and subject-specific functional units. Experiments carried out with simulated datasets show that the proposed method surpasses the comparison methods. Experiments carried out with in vivo tongue motion datasets show that the proposed method can determine the common and subject-specific functional units with increased interpretability and decreased size variability.
摘要：可理解的语音是通过创建不同的内部局部肌肉分组---产生即，功能单元---了在一个系统的和协调的方式产生的。有表征和分析功能单元的两大挑战。首先，由于舌头的结构和功能的复杂难懂的性质，这是非常重要的发展能够演讲中准确地解码复杂的肌肉协调模式的方法。其次，它是具有挑战性的保持确定的功能单元跨学科相媲美，因为它们很大差异。在这项工作中，要应对这些挑战，我们开发了一个新的深学习框架演讲中识别舌头运动的共同和特定主题的功能单元。我们的联合深图的正则化的稀疏非负矩阵分解框架铰链（NMF）使用运动数量从由标记磁共振成像的位移的。更具体地说，我们通过展开迭代收缩阈值处理算法来学习可解释的构建模块和相关联的加权映射的装置与稀疏和歧管正则化到模块化架构类似于深神经网络变换NMF。然后，我们应用谱聚类共同和特定主题的功能单元。与模拟数据集进行的实验表明，该方法优于对照方法。实验与体内舌运动数据集表明，所提出的方法可确定与增加的解释性的共同和特定主题的功能单元进行，并且减小尺寸可变性。

15. Statistical shape analysis of brain arterial networks (BAN) [PDF] 返回目录
Xiaoyang Guo, Aditi Basu Bal, Tom Needham, Anuj Srivastava
Abstract: Structures of brain arterial networks (BANs) - that are complex arrangements of individual arteries, their branching patterns, and inter-connectivities play an important role in characterizing and understanding brain physiology. One would like tools for statistically analyzing the shapes of BANs, i.e. quantify shape differences, compare population of subjects, and study the effects of covariates on these shapes. This paper mathematically represents and statistically analyzes BAN shapes as elastic shape graphs. Each elastic shape graph is made up of nodes that are connected by a number of 3D curves, and edges, with arbitrary shapes. We develop a mathematical representation, a Riemannian metric and other geometrical tools, such as computations of geodesics, means and covariances, and PCA for analyzing elastic graphs and BANs. This analysis is applied to BANs after separating them into four components -- top, bottom, left, and right. This framework is then used to generate shape summaries of BANs from 92 subjects, and to study the effects of age and gender on shapes of BAN components. We conclude that while gender effects require further investigation, the age has a clear, quantifiable effect on BAN shapes. Specifically, we find an increased variance in BAN shapes as age increases.
摘要：脑动脉网（BAN）的结构 - 这是个别动脉复杂的安排，其分支模式，以及相互连接性发挥特征和理解脑生理学重要的作用。一想工具统计分析的BAN的形状，即形状进行量化的差异，比较受试者群体，并研究这些形状协变量的影响。本文数学上表示和统计分析BAN形状为弹性的形状的曲线图。每个弹性形状图形是由由许多三维曲线和边缘的连接，任意形状的节点组成。我们开发了一个数学表达式，一个黎曼度量和其他几何工具，如测地线，均值和方差的计算，以及PCA分析弹性图和禁令。上，下，左，右 - 该分析将它们分成四个组成部分后应用于禁令。然后，该框架被用于从92组的受试者产生的BAN的形状摘要，并研究对BAN部件的形状的年龄和性别的影响。我们的结论是，虽然性别的影响需要进一步调查中，年龄对BAN形状清晰，可量化的效果。具体来说，我们发现在BAN的方差增加形状随着年龄的增长。

16. Generalized Many-Way Few-Shot Video Classification [PDF] 返回目录
Yongqin Xian, Bruno Korbar, Matthijs Douze, Bernt Schiele, Zeynep Akata, Lorenzo Torresani
Abstract: Few-shot learning methods operate in low data regimes. The aim is to learn with few training examples per class. Although significant progress has been made in few-shot image classification, few-shot video recognition is relatively unexplored and methods based on 2D CNNs are unable to learn temporal information. In this work we thus develop a simple 3D CNN baseline, surpassing existing methods by a large margin. To circumvent the need of labeled examples, we propose to leverage weakly-labeled videos from a large dataset using tag retrieval followed by selecting the best clips with visual similarities, yielding further improvement. Our results saturate current 5-way benchmarks for few-shot video classification and therefore we propose a new challenging benchmark involving more classes and a mixture of classes with varying supervision.
摘要：很少次的学习方法在低数据政权运作。其目的是要学会与每班几个训练例子。虽然显著进展，在一些镜头图像分类作出，很少拍视频识别是相对未开发和方法基于二维细胞神经网络是无法学习时间信息。在这项工作中，我们因此制定一个简单的3D CNN基线，大幅度超越现有的方法。为了规避标识样本的需要，我们建议使用杠杆标签检索，然后用视觉相似性选择最佳剪辑，产生进一步的改进，从一个大的数据集弱标记的视频。我们的研究结果饱和电流几拍视频分类5路基准，并因此，我们建议，涉及更多的类和类具有不同的监督混合物的新的富有挑战性的基准。

17. Animated GIF optimization by adaptive color local table management [PDF] 返回目录
Oliver Giudice, Dario Allegra, Francesco Guarnera, Filippo Stanco, Sebastiano Battiato
Abstract: After thirty years of the GIF file format, today is becoming more popular than ever: being a great way of communication for friends and communities on Instant Messengers and Social Networks. While being so popular, the original compression method to encode GIF images have not changed a bit. On the other hand popularity means that storage saving becomes an issue for hosting platforms. In this paper a parametric optimization technique for animated GIFs will be presented. The proposed technique is based on Local Color Table selection and color remapping in order to create optimized animated GIFs while preserving the original format. The technique achieves good results in terms of byte reduction with limited or no loss of perceived color quality. Tests carried out on 1000 GIF files demonstrate the effectiveness of the proposed optimization strategy.
摘要：经过三十多年的GIF文件格式，如今正变得比以往任何时候都更受欢迎：是对即时通讯软件和社交网络的朋友和社区沟通的一个好方法。虽然如此流行，原来压缩方法进行编码的GIF图像没有改变了一下。在另一方面普及意味着存储节约成为主机平台的问题。本文为GIF动画的参数优化技术将提交。所提出的技术是为了同时保留原来的格式创建优化的GIF动画基于局部颜色表选择和颜色重映射。该技术实现了在有限的字节减少或无感知的颜色质量损失方面的良好结果。在1000个GIF文件进行测试，验证了优化策略的有效性。

18. Pollen13K: A Large Scale Microscope Pollen Grain Image Dataset [PDF] 返回目录
Sebastiano Battiato, Alessandro Ortis, Francesca Trenta, Lorenzo Ascari, Mara Politi, Consolata Siniscalco
Abstract: Pollen grain classification has a remarkable role in many fields from medicine to biology and agronomy. Indeed, automatic pollen grain classification is an important task for all related applications and areas. This work presents the first large-scale pollen grain image dataset, including more than 13 thousands objects. After an introduction to the problem of pollen grain classification and its motivations, the paper focuses on the employed data acquisition steps, which include aerobiological sampling, microscope image acquisition, object detection, segmentation and labelling. Furthermore, a baseline experimental assessment for the task of pollen classification on the built dataset, together with discussion on the achieved results, is presented.
摘要：花粉粒分类在许多领域，从医学生物学和农学了显着作用。事实上，自动花粉粒分类是所有相关的应用和领域的一项重要任务。这项工作提出了第一次大规模的花粉粒图像数据集，其中包括超过13个成千上万个对象。介绍花粉粒分类及其动机的问题后，本文重点研究所采用的数据采集步骤，其包括aerobiological采样，显微镜图像采集，对象检测，分割和标记。此外，基线实验评估的内置数据集花粉分类的工作，对取得的成果一起讨论，提出了。

19. Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision [PDF] 返回目录
Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, Zhiwei Yang
Abstract: Violence detection has been studied in computer vision for years. However, previous work are either superficial, e.g., classification of short-clips, and the single scenario, or undersupplied, e.g., the single modality, and hand-crafted features based multimodality. To address this problem, in this work we first release a large-scale and multi-scene dataset named XD-Violence with a total duration of 217 hours, containing 4754 untrimmed videos with audio signals and weak labels. Then we propose a neural network containing three parallel branches to capture different relations among video snippets and integrate features, where holistic branch captures long-range dependencies using similarity prior, localized branch captures local positional relation using proximity prior, and score branch dynamically captures the closeness of predicted score. Besides, our method also includes an approximator to meet the needs of online detection. Our method outperforms other state-of-the-art methods on our released dataset and other existing benchmark. Moreover, extensive experimental results also show the positive effect of multimodal input and modeling relationships. The code and dataset will be released in this https URL.
摘要：暴力检测进行了研究计算机视觉多年。然而，以前的工作或者是肤浅的，例如，短剪辑分类，以及单一的方案，或供应不足，例如，单一的形态，和手工制作的功能基于多模态。为了解决这个问题，在这项工作中，我们首先释放一个名为XD-暴力与217小时的总时长一个大规模，多场景的数据集，包括4754个与音频信号弱的标签修剪视频。然后，我们提出了包含三个并联支路捕捉视频片段之间不同的关系和整合功能，在整体分支捕捉远距离使用相似之前，局部分支捕捉本地位置关系，利用近场之前的依赖，并得分分支动态捕捉亲近神经网络的预测分数。此外，我们的方法还包括逼近，满足在线检测的需要。我们的方法优于国家的最先进的等在我们发布的数据集的方法和其他现有的基准。此外，大量的实验结果也表明多输入和模拟关系的积极作用。代码和数据集将在此HTTPS URL被释放。

20. How low can you go? Privacy-preserving people detection with an omni-directional camera [PDF] 返回目录
Timothy Callemein, Kristof Van Beeck, Toon Goedemé
Abstract: In this work, we use a ceiling-mounted omni-directional camera to detect people in a room. This can be used as a sensor to measure the occupancy of meeting rooms and count the amount of flex-desk working spaces available. If these devices can be integrated in an embedded low-power sensor, it would form an ideal extension of automated room reservation systems in office environments. The main challenge we target here is ensuring the privacy of the people filmed. The approach we propose is going to extremely low image resolutions, such that it is impossible to recognise people or read potentially confidential documents. Therefore, we retrained a single-shot low-resolution person detection network with automatically generated ground truth. In this paper, we prove the functionality of this approach and explore how low we can go in resolution, to determine the optimal trade-off between recognition accuracy and privacy preservation. Because of the low resolution, the result is a lightweight network that can potentially be deployed on embedded hardware. Such embedded implementation enables the development of a decentralised smart camera which only outputs the required meta-data (i.e. the number of persons in the meeting room).
摘要：在这项工作中，我们使用安装在天花板上全向摄像头检测到的人在一个房间里。这可以被用来作为传感器来测量的会议室的占用和可用计数柔性服务台工作空间的量。如果这些设备可以集成在一个嵌入式低功率传感器，它会形成在办公室环境中的自动化房间预订系统的一个理想的扩展。我们在这里针对的主要挑战是确保拍摄了人的隐私。我们提出的方法是将极低的图像分辨率，使得其不可能识别人或潜在的阅读机密文件。因此，我们重新培训与自动生成的地面实况单次低分辨率人检测网络。在本文中，我们证明了这种方法的功能，并探讨如何低，我们可以在分辨率去，以确定最佳的权衡识别的准确性和隐私保护之间。由于低分辨率的，其结果是，可以潜在地在嵌入式硬件部署一个轻量级的网络。这样的嵌入式实现使分散智能相机仅输出所需的元数据（即，在会议室的人数）的发展。

21. Automated analysis of eye-tracker-based human-human interaction studies [PDF] 返回目录
Timothy Callemein, Kristof Van Beeck, Geert Brône, Toon Goedemé
Abstract: Mobile eye-tracking systems have been available for about a decade now and are becoming increasingly popular in different fields of application, including marketing, sociology, usability studies and linguistics. While the user-friendliness and ergonomics of the hardware are developing at a rapid pace, the software for the analysis of mobile eye-tracking data in some points still lacks robustness and functionality. With this paper, we investigate which state-of-the-art computer vision algorithms may be used to automate the post-analysis of mobile eye-tracking data. For the case study in this paper, we focus on mobile eye-tracker recordings made during human-human face-to-face interactions. We compared two recent publicly available frameworks (YOLOv2 and OpenPose) to relate the gaze location generated by the eye-tracker to the head and hands visible in the scene camera data. In this paper we will show that the use of this single-pipeline framework provides robust results, which are both more accurate and faster than previous work in the field. Moreover, our approach does not rely on manual interventions during this process.
摘要：现在移动眼球追踪系统已经提供了近十年，并成为不同应用领域，包括市场营销，社会学，可用性研究和语言学越来越受欢迎。当用户友好性和硬件的人体工程学以迅猛的速度发展，在某些点移动眼球追踪数据的分析软件仍然缺乏稳健性和功能性。与此论文中，我们研究哪些国家的最先进的计算机视觉算法可用于向移动眼睛跟踪数据的事后分析的自动化。对于本文中的案例研究中，我们专注于在人与人面对面的面对面的互动做移动眼动仪记录。我们比较了最近两次的公开可用的框架（YOLOv2和OpenPose），以通过相关的眼动仪，头部和双手在场景摄像机数据可见产生的凝视位置。在本文中，我们将证明，使用这种单管道框架提供了强大的结果，这两者都是比以往野外工作更准确，更快速。此外，我们的做法并没有在这个过程依靠人工干预。

22. Building Robust Industrial Applicable Object Detection Models Using Transfer Learning and Single Pass Deep Learning Architectures [PDF] 返回目录
Steven Puttemans, Timothy Callemein, Toon Goedemé
Abstract: The uprising trend of deep learning in computer vision and artificial intelligence can simply not be ignored. On the most diverse tasks, from recognition and detection to segmentation, deep learning is able to obtain state-of-the-art results, reaching top notch performance. In this paper we explore how deep convolutional neural networks dedicated to the task of object detection can improve our industrial-oriented object detection pipelines, using state-of-the-art open source deep learning frameworks, like Darknet. By using a deep learning architecture that integrates region proposals, classification and probability estimation in a single run, we aim at obtaining real-time performance. We focus on reducing the needed amount of training data drastically by exploring transfer learning, while still maintaining a high average precision. Furthermore we apply these algorithms to two industrially relevant applications, one being the detection of promotion boards in eye tracking data and the other detecting and recognizing packages of warehouse products for augmented advertisements.
摘要：在计算机视觉和人工智能深度学习的起义趋势根本无法被忽略。在最多样化的任务，从识别和检测，分割，深学习能够获得国家的先进成果，达到了顶尖的性能。在本文中，我们探讨致力于物体检测任务可以提高我们的产业导向目标检测管道有多深卷积神经网络，采用先进设备，最先进的开源深学习框架，如地下网络。通过使用深度学习架构，整合区域的建议，在一次运行的分类和概率估计，我们的目标是获得实时性能。我们专注于减少通过探索转移学习培训大幅数据，同时仍保持较高的平均精度所需要的量。此外，我们应用这些算法两个工业相关的应用程序，一种是促进委员会的眼跟踪数据的检测和其他检测和识别对增强广告的仓库产品的软件包。

23. The autonomous hidden camera crew [PDF] 返回目录
Timothy Callemein, Wiebe Van Ranst, Toon Goedemé
Abstract: Reality TV shows that follow people in their day-to-day lives are not a new concept. However, the traditional methods used in the industry require a lot of manual labour and need the presence of at least one physical camera man. Because of this, the subjects tend to behave differently when they are aware of being recorded. This paper will present an approach to follow people in their day-to-day lives, for long periods of time (months to years), while being as unobtrusive as possible. To do this, we use unmanned cinematographically-aware cameras hidden in people's houses. Our contribution in this paper is twofold: First, we create a system to limit the amount of recorded data by intelligently controlling a video switch matrix, in combination with a multi-channel recorder. Second, we create a virtual camera man by controlling a PTZ camera to automatically make cinematographically pleasing shots. Throughout this paper, we worked closely with a real camera crew. This enabled us to compare the results of our system to the work of trained professionals.
摘要：真人电视节目是按照他们的一天到一天的生活的人是不是一个新概念。然而，在同行业中使用的传统方法需要大量的手工劳动，需要至少一个物理相机的人的存在。正因为如此，受试者往往表现不同，当他们意识到被记录。本文将介绍一种方法来跟踪人们在每天的日常生活中，时间（几个月到几年）长的时间，而被作为不显眼越好。要做到这一点，我们使用无人cinematographically感知相机藏在人的房子。我们在本文中的贡献是双重的：首先，我们创建的系统通过用多通道记录器智能地控制视频开关矩阵，在组合来限制记录的数据的量。其次，我们创建了一个虚拟相机的人通过控制云台镜头自动cinematographically赏心悦目的镜头。在整个本文中，我们与真正的摄制组密切合作。这使我们对我们的系统的结果进行比较，以训练有素的专业人员的工作。

24. Inertial Measurements for Motion Compensation in Weight-bearing Cone-beam CT of the Knee [PDF] 返回目录
Jennifer Maier, Marlies Nitschke, Jang-Hwan Choi, Garry Gold, Rebecca Fahrig, Bjoern M. Eskofier, Andreas Maier
Abstract: Involuntary motion during weight-bearing cone-beam computed tomography (CT) scans of the knee causes artifacts in the reconstructed volumes making them unusable for clinical diagnosis. Currently, image-based or marker-based methods are applied to correct for this motion, but often require long execution or preparation times. We propose to attach an inertial measurement unit (IMU) containing an accelerometer and a gyroscope to the leg of the subject in order to measure the motion during the scan and correct for it. To validate this approach, we present a simulation study using real motion measured with an optical 3D tracking system. With this motion, an XCAT numerical knee phantom is non-rigidly deformed during a simulated CT scan creating motion corrupted projections. A biomechanical model is animated with the same tracked motion in order to generate measurements of an IMU placed below the knee. In our proposed multi-stage algorithm, these signals are transformed to the global coordinate system of the CT scan and applied for motion compensation during reconstruction. Our proposed approach can effectively reduce motion artifacts in the reconstructed volumes. Compared to the motion corrupted case, the average structural similarity index and root mean squared error with respect to the no-motion case improved by 13-21% and 68-70%, respectively. These results are qualitatively and quantitatively on par with a state-of-the-art marker-based method we compared our approach to. The presented study shows the feasibility of this novel approach, and yields promising results towards a purely IMU-based motion compensation in C-arm CT.
摘要：在膝盖的负重锥形束计算机断层摄影（CT）扫描不自主运动导致在重构体积伪影使它们无法用于临床诊断。目前，基于图像或基于标记的方法应用于修正这种运动，但常常需要较长的执行或准备时间。我们建议，以便测量在所述扫描和校正它的运动包含加速度计和陀螺仪向所述受试者的脚部惯性测量单元（IMU）连接。为了验证这种方法，我们提出使用具有光学三维跟踪系统测量真实运动的模拟研究。随着该运动，一XCAT数值膝盖假体被非刚性地一个模拟CT期间变形扫描创建运动损坏的凸起。的生物力学模型以产生放置在膝盖下方的IMU测量动画用相同的跟踪的运动。在我们提出的多级算法，这些信号被变换到CT扫描的全局坐标系和重建过程中施加用于运动补偿。我们提出的方法可以有效地减少重建卷运动伪影。相比于运动损坏情况下，平均结构相似性指数和根均方误差相对于所述无运动的情况下通过分别13-21％和68-70％，改善。这些结果是定性和定量看齐，与我们相比，我们的方法一个国家的最先进的基于标记的方法。所呈现的研究显示了这种新颖的方法的可行性，并产生有希望的结果朝着C形臂CT纯粹IMU基运动补偿。

25. Maximum Entropy Regularization and Chinese Text Recognition [PDF] 返回目录
Changxu Cheng, Wuheng Xu, Xiang Bai, Bin Feng, Wenyu Liu
Abstract: Chinese text recognition is more challenging than Latin text due to the large amount of fine-grained Chinese characters and the great imbalance over classes, which causes a serious overfitting problem. We propose to apply Maximum Entropy Regularization to regularize the training process, which is to simply add a negative entropy term to the canonical cross-entropy loss without any additional parameters and modification of a model. We theoretically give the convergence probability distribution and analyze how the regularization influence the learning process. Experiments on Chinese character recognition, Chinese text line recognition and fine-grained image classification achieve consistent improvement, proving that the regularization is beneficial to generalization and robustness of a recognition model.
摘要：中国文字识别比拉丁文字更具挑战性，由于大量的细粒度中国人物和伟大的不平衡过班，这会导致严重的过度拟合问题。我们建议采用最大熵正则来规范培训过程中，这是简单地增加一个负熵任期到规范化交叉熵损失不带任何附加参数和修改模型的。我们从理论上给出收敛概率分布和分析正规化如何影响学习的过程。对中国文字识别，文字中国线识别和细粒度图像的实验分类实现持续改善，证明了正则的识别模型的通用性和鲁棒性是有益的。

26. JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image [PDF] 返回目录
Linpu Fang, Xingyan Liu, Li Liu, Hang Xu, Wenxiong Kang
Abstract: State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions, including voxel-to-voxel predictions, point-to-point regression, and pixel-wise estimations. Despite the good performance, those methods have a few issues in nature, such as the poor trade-off between accuracy and efficiency, and plain feature representation learning with local convolutions. In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training. Specifically, we first propose a graph convolutional network (GCN) based joint graph reasoning module to model the complex dependencies among joints and augment the representation capability of each pixel. Then we densely estimate all pixels' offsets to joints in both image plane and depth space and calculate the joints' positions by a weighted average over all pixels' predictions, totally discarding the complex postprocessing operations. The proposed model is implemented with an efficient 2D fully convolutional network (FCN) backbone and has only about 1.4M parameters. Extensive experiments on multiple 3D hand pose estimation benchmarks demonstrate that the proposed method achieves new state-of-the-art accuracy while running very efficiently with around a speed of 110fps on a single NVIDIA 1080Ti GPU.
摘要：国家的最先进的单深度基于图像的3D手姿势估计方法是基于密集的预测，包括体素到体素的预测，点至点的回归，并逐个像素的估计。尽管良好的性能，这些方法有本质的几个问题，如穷人的权衡精度和效率，以及纯特征表示之间的学习与当地盘旋。在本文中，一种新颖的逐像素基于预测的方法，提出了解决上述问题的。主要观点有两方面：1）显式造型关节之间的依赖关系和像素之间的关系和关节更好的本地特征表示学习;二）统一为终端到终端的培训茂密的逐像素偏移预测和直接联合回归。具体地讲，我们首先提出的曲线图卷积网络（GDN）基于联合图形推理模块到复杂的依赖关系关节之中建模和增强每个像素的表示能力。然后，我们稠密估计所有像素偏移关节都像平面和深度的空间和计算关节对所有像素的预测由加权平均仓位，完全抛弃复杂的后处理操作。所提出的模型与一个高效的2D充分卷积网络（FCN）主链实现，并且仅具有大约1.4M的参数。在多个三维手姿态估计基准大量实验表明，该方法虽然非常有效地围绕在一个单一的NVIDIA GPU 1080Ti的110fps的速度运行实现国家的最先进的新的精度。

27. ESA-ReID: Entropy-Based Semantic Feature Alignment for Person re-ID [PDF] 返回目录
Chaoping Tu, Yin Zhao, Longjun Cai
Abstract: Person re-identification (re-ID) is a challenging task in real-world. Besides the typical application in surveillance system, re-ID also has significant values to improve the recall rate of people identification in content video (TV or Movies). However, the occlusion, shot angle variations and complicated background make it far away from application, especially in content video. In this paper we propose an entropy based semantic feature alignment model, which takes advantages of the detailed information of the human semantic feature. Considering the uncertainty of semantic segmentation, we introduce a semantic alignment with an entropy-based mask which can reduce the negative effects of mask segmentation errors. We construct a new re-ID dataset based on content videos with many cases of occlusion and body part missing, which will be released in future. Extensive studies on both existing datasets and the new dataset demonstrate the superior performance of the proposed model.
摘要：人重新鉴定（重新-ID）是在现实世界的一个具有挑战性的任务。除了在监控系统的典型应用，再ID还具有提高的内容的视频（电视或电影）鉴定人的召回率显著值。然而，闭塞，拍摄角度的变化和复杂的背景，使其远离应用，特别是在内容的视频。在本文中，我们提出了一种基于熵的语义特征对准模型，这需要的人的语义特征的详细信息方面的优势。考虑到语义分割的不确定性，我们引入了一个基于熵面膜可以减少面膜分割错误的负面影响语义对齐。我们构造基于与闭塞和身体部分缺失，这将在未来发布的许多情况下，内容的视频再造一个新ID的数据集。对已有数据集和新数据集的广泛研究验证了该模型的卓越性能。

28. Attention Neural Network for Trash Detection on Water Channels [PDF] 返回目录
Mohbat Tharani, Abdul Wahab Amin, Mohammad Maaz, Murtaza Taj
Abstract: Rivers and canals flowing through cities are often used illegally for dumping the trash. This contaminates freshwater channels as well as causes blockage in sewerage resulting in urban flooding. When this contaminated water reaches agricultural fields, it results in degradation of soil and poses critical environmental as well as economic threats. The dumped trash is often found floating on the water surface. The trash could be disfigured, partially submerged, decomposed into smaller pieces, clumped together with other objects which obscure its shape and creates a challenging detection problem. This paper proposes a method for the detection of visible trash floating on the water surface of the canals in urban areas. We also provide a large dataset, first of its kind, trash in water channels that contains object-level annotations. A novel attention layer is proposed that improves the detection of smaller objects. Towards the end of this paper, we provide a detailed comparison of our method with state-of-the-art object detectors and show that our method significantly improves the detection of smaller objects. The dataset will be made publicly available.
摘要：江河流经城市的运河经常使用非法倾倒垃圾。这种污染物淡水渠道以及原因堵塞的排水管道造成城市洪水。当这种受污染的水到达农田，会导致土壤退化，造成严重的环境和经济的威胁。该倾倒垃圾，经常发现漂浮在水面上。垃圾桶可以毁容，部分地浸没，分解成更小的碎片，利用该模糊其形状，并创建一个具有挑战性的检测问题的其他对象聚集在一起。本文提出了一种检测可见漂浮的垃圾在市区运河的水面上的方法。我们还提供了大量的数据集，首开先河，在垃圾水通道包含对象级的注解。一种新的关注层提出了改善较小的物体的检测。向本文的目的，我们提供了我们的方法与国家的最先进的物体检测器的详细比较，并表明我们的方法显著提高较小的物体的检测。该数据集将被公之于众。

29. Monocular Vision based Crowdsourced 3D Traffic Sign Positioning with Unknown Camera Intrinsics and Distortion Coefficients [PDF] 返回目录
Hemang Chawla, Matti Jukola, Elahe Arani, Bahram Zonooz
Abstract: Autonomous vehicles and driver assistance systems utilize maps of 3D semantic landmarks for improved decision making. However, scaling the mapping process as well as regularly updating such maps come with a huge cost. Crowdsourced mapping of these landmarks such as traffic sign positions provides an appealing alternative. The state-of-the-art approaches to crowdsourced mapping use ground truth camera parameters, which may not always be known or may change over time. In this work, we demonstrate an approach to computing 3D traffic sign positions without knowing the camera focal lengths, principal point, and distortion coefficients a priori. We validate our proposed approach on a public dataset of traffic signs in KITTI. Using only a monocular color camera and GPS, we achieve an average single journey relative and absolute positioning accuracy of 0.26 m and 1.38 m, respectively.
摘要：自主车辆和驾驶员辅助系统利用，以提高决策的3D语义地标地图。然而，缩放映射过程，以及定期更新这种地图配备了巨大的成本。如交通标志的位置，这些地标众包映射提供一个有吸引力的替代方案。的状态的最先进的方法，以众包映射使用地面实况相机参数，其可能不总是已知的，或者可以随时间改变。在这项工作中，我们证明一种方法来计算3D交通标志的位置，而无需知道相机的焦距，主点，并且失真系数的先验。我们验证我们提出的关于在KITTI交通标志公共数据集的方式。仅使用单眼彩色照相机和GPS，我们分别达到0.26 m和1.38米的平均单程相对和绝对定位精度。

30. VisImages: A Large-scale, High-quality Image Corpus in Visualization Publications [PDF] 返回目录
Dazhen Deng, Yihong Wu, Xinhuan Shu, Mengye Xu, Jiang Wu, Siwei Fu Yingcai Wu
Abstract: Images in visualization publications contain rich information, such as novel visual designs, model details, and experiment results. Constructing such an image corpus can contribute to the community in many aspects, including literature analysis from the perspective of visual representations, empirical studies on visual memorability, and machine learning research for chart detection. This study presents VisImages, a high-quality and large-scale image corpus collected from visualization publications. VisImages contain fruitful and diverse annotations for each image, including captions, types of visual representations, and bounding boxes. First, we algorithmically extract the images associated with captions and manually correct the errors. Second, to categorize visualizations in publications, we extend and iteratively refine the existing taxonomy through a multi-round pilot study. Third, guided by this taxonomy, we invite senior visualization practitioners to annotate visual representations that appear in each image. In this process, we borrow techniques such as "gold standards" and majority voting for quality control. Finally, we recruit the crowd to draw bounding boxes for visual representations in the images. The resulting corpus contains 35,096 annotated visualizations from 12,267 images with 12,057 captions in 1397 papers from VAST and InfoVis. We demonstrate the usefulness of VisImages through the following four use cases: 1) analysis of color usage in VAST and InfoVis papers across years, 2) discussion of the researcher preference on visualization types, 3) spatial distribution analysis of visualizations in visual analytic systems, and 4) training visualization detection models.
摘要：在图像可视化出版物包含丰富的信息，如新奇的视觉设计，模型细节和实验结果。构建这样的图像全集可以向社区在许多方面，包括从视觉表现，视觉记忆性的实证研究，和机器学习图表检测研究的视角文献分析。这项研究提出VisImages，高质量和大规模图像库的可视化出版物的收集。 VisImages包含每个图像，包括标题，类型可视表示，和边框丰硕多样的注解。首先，我们通过算法提取与标题相关的图像和手动更正错误。其次，在出版物分类可视化，我们扩展并通过多轮试验研究反复改进现有分类。第三，通过这种分类指导，我们邀请资深从业者可视化注释出现在每个图像中可视化表示。在这个过程中，我们借用技术，如“金标准”，并进行质量控制多数表决。最后，我们招募的人群画出边界框在图像可视化表示。得到的语料库包含12267个图像与来自VAST和InfoVis 1397篇论文12057个字幕35096个注解可视化。我们证明通过以下四种使用情况VisImages的用处：1）色彩使用的VAST和InfoVis论文跨年分析，2）可视化类型的研究员偏好的讨论，3）视觉分析系统的可视化空间分布分析， 4）培训可视化检测模型。

31. Auxiliary Tasks Speed Up Learning PointGoal Navigation [PDF] 返回目录
Joel Ye, Dhruv Batra, Erik Wijmans, Abhishek Das
Abstract: PointGoal Navigation is an embodied task that requires agents to navigate to a specified point in an unseen environment. Wijmans et al. showed that this task is solvable but their method is computationally prohibitive, requiring 2.5 billion frames and 180 GPU-days. In this work, we develop a method to significantly increase sample and time efficiency in learning PointNav using self-supervised auxiliary tasks (e.g. predicting the action taken between two egocentric observations, predicting the distance between two observations from a trajectory,etc.).We find that naively combining multiple auxiliary tasks improves sample efficiency,but only provides marginal gains beyond a point. To overcome this, we use attention to combine representations learnt from individual auxiliary tasks. Our best agent is 5x faster to reach the performance of the previous state-of-the-art, DD-PPO, at 40M frames, and improves on DD-PPO's performance at40M frames by 0.16 SPL. Our code is publicly available at this https URL.
摘要：PointGoal导航是一种具体化的任务，需要代理导航到一个指定的点在一个看不见的环境。 Wijmans等。结果显示，这个任务是可以解决的，但他们的方法计算望而却步，需要2.5十亿帧和180 GPU天。在这项工作中，我们开发了一个方法，在使用学习PointNav显著提高样品和时间效率的自我监督的辅助任务（如预测2组自我中心的意见之间采取的行动，从轨迹预测两次观测之间的距离，等）。我们发现天真地结合多种辅助任务提高效率的样品，但只有提供超过临界点的涨幅。为了克服这个问题，我们使用时注意个别辅助任务学到陈述结合起来。我们最好的药剂是快5倍，达到以前的状态的最先进的，DD-PPO的性能，在40M框架和0.16 SPL提高了DD-PPO的性能at40M帧。我们的代码是公开的，在此HTTPS URL。

32. Alleviating the Burden of Labeling: Sentence Generation by Attention Branch Encoder-Decoder Network [PDF] 返回目录
Tadashi Ogura, Aly Magassouba, Komei Sugiura, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi, Hisashi Kawai
Abstract: Domestic service robots (DSRs) are a promising solution to the shortage of home care workers. However, one of the main limitations of DSRs is their inability to interact naturally through language. Recently, data-driven approaches have been shown to be effective for tackling this limitation; however, they often require large-scale datasets, which is costly. Based on this background, we aim to perform automatic sentence generation of fetching instructions: for example, "Bring me a green tea bottle on the table." This is particularly challenging because appropriate expressions depend on the target object, as well as its surroundings. In this paper, we propose the attention branch encoder--decoder network (ABEN), to generate sentences from visual inputs. Unlike other approaches, the ABEN has multimodal attention branches that use subword-level attention and generate sentences based on subword embeddings. In experiments, we compared the ABEN with a baseline method using four standard metrics in image captioning. Results show that the ABEN outperformed the baseline in terms of these metrics.
摘要：家政服务机器人（的DSR）是一个可行的解决方案，以家庭护理工人的短缺。然而，的DSR的一个主要限制是他们无法通过自然语言交互。近来，数据驱动的方法已被证明是有效的解决此限制;然而，他们往往需要大规模的数据集，这是昂贵的。基于这样的背景，我们的目标是执行取指令自动生成句子，例如：“给我放在桌子上的绿茶瓶子。”这是特别具有挑战性的，因为适当的表达依赖于目标物体上，以及它的周围。在本文中，我们建议关注分支编码 - 解码器网络（ABEN），以便从视觉输入的句子。不像其他的方法，该ABEN具有多关注分支机构在使用子词的高度重视，并基于子词的嵌入句子。在实验中，我们比较了使用图像字幕四种标准度量的基线方法ABEN。结果表明，ABEN跑赢基准在这些指标方面。

33. Long-Term Residual Blending Network for Blur Invariant Single Image Blind deblurring [PDF] 返回目录
Sungkwon An, Hyungmin Roh, Myungjoo Kang
Abstract: We present a novel, blind, single image deblurring method that utilizes information regarding blur kernels. Our model solves the deblurring problem by dividing it into two successive tasks: (1) blur kernel estimation and (2) sharp image restoration. We first introduce a kernel estimation network that produces adaptive blur kernels based on the analysis of the blurred image. The network learns the blur pattern of the input image and trains to generate the estimation of image-specific blur kernels. Subsequently, we propose a long-term residual blending network that restores sharp images using the estimated blur kernel. To use the kernel efficiently, we propose a blending block that encodes features from both blurred images and blur kernels into a low dimensional space and then decodes them simultaneously to obtain an appropriately synthesized feature representation. We evaluate our model on REDS, GOPRO and Flickr2K datasets using various Gaussian blur kernels. Experiments show that our model can achieve excellent results on each dataset.
摘要：提出一种新的，盲目的，单个图像去模糊，其利用关于模糊内核的信息的方法。我们的模型解决了通过将其分成两个连续的任务去模糊的问题：（1）模糊核估计和（2）清晰的图像恢复。我们先介绍一下产生基于模糊图像的分析自适应模糊内核内核估计网络。网络学习输入图像与列车的模糊图案，以生成图像特异性模糊内核的估计。随后，我们提出了一个长期残留混合网络，恢复使用估计的模糊核清晰的图像。为了有效地使用内核，我们提出了一种混合块编码来自两个模糊图像和模糊核设有成低维空间，然后同时对它们进行解码，以获得适当合成的特征表示。我们评估我们使用各种高斯模糊内核红色，GOPRO和Flickr2K数据集模型。实验表明，我们的模型可以实现对每个数据集优异的成绩。

34. EPI-based Oriented Relation Networks for Light Field Depth Estimation [PDF] 返回目录
Kunyuan Li, Jun Zhang, Rui Sun, Xudong Zhang, Jun Gao
Abstract: Light fields record not only the spatial information of observed scenes but also the directions of all incoming light rays. The spatial information and angular information implicitly contain the geometrical characteristics such as multi-view geometry or epipolar geometry, which can be exploited to improve the performance of depth estimation. Epipolar Plane Image (EPI), the unique 2D spatial-angular slice of the light field, contains patterns of oriented lines. The slope of these lines is associated with the disparity. Benefit from this property of EPIs, some representative methods estimate depth maps by analyzing the disparity of each line in EPIs. However, these methods often extract the optimal slope of the lines from EPIs while ignoring the relationship between neighboring pixels, which leads to inaccurate depth map predictions. Based on the observation that the similar linear structure between the oriented lines and their neighboring pixels, we propose an end-to-end fully convolutional network (FCN) to estimate the depth value of the intersection point on the horizontal and vertical EPIs. Specifically, we present a new feature extraction module, called Oriented Relation Module (ORM), that constructs the relationship between the line orientations. To facilitate training, we also propose a refocusing-based data augmentation method to obtain different slopes from EPIs of the same scene point. Extensive experiments verify the efficacy of learning relations and show that our approach is competitive to other state-of-the-art methods.
摘要：光场不仅记录观察到的场景，而且所有入射光线的方向的空间信息。该空间信息和角度信息隐式地包含几何特性如多视图几何形状或对极几何，其可以被利用来改善深度估计的性能。核平面图片（EPI），该光场的独特的2D空间角切片，包含导向线的图案。这些线的斜率与视差相关联。受益于环境绩效指标的这种性质，一些有代表性的方法，通过分析在环境绩效指标每一行的视差估计的深度图。然而，这些方法往往是从环境绩效指标提取的线的斜率最佳而忽略相邻像素之间的关系，这导致不准确的深度图的预测。基于所述导向线及其相邻像素之间的相似线性结构，提出了一种端至端充分卷积网络（FCN）来估计在水平和垂直环境绩效指标的交点的深度值的观察。具体而言，提出了一种新的特征提取模块，被称为面向关联模块（ORM），该构造线方向之间的关系。为了方便训练，我们也提出了一个基于重新聚焦数据增强方法从同一场景点的绩效指标得到不同的斜率。大量的实验验证学习关系的有效性，并表明我们的做法是其他国家的最先进的方法，有竞争力的。

35. Point Set Voting for Partial Point Cloud Analysis [PDF] 返回目录
Junming Zhang, Weijia Chen, Yuping Wang, Ram Vasudevan, Matthew Johnson-Roberson
Abstract: The continual improvement of 3D sensors has driven the development of algorithms to perform point cloud analysis. In fact, techniques for point cloud classification and segmentation have in recent years achieved incredible performance driven in part by leveraging large synthetic datasets. Unfortunately these same state-of-the-art approaches perform poorly when applied to incomplete point clouds. This limitation of existing algorithms is particularly concerning since point clouds generated by 3D sensors in the real world are usually incomplete due to perspective view or occlusion by other objects. This paper proposes a general model for partial point clouds analysis wherein the latent feature encoding a complete point clouds is inferred by applying a local point set voting strategy. In particular, each local point set constructs a vote that corresponds to a distribution in the latent space, and the optimal latent feature is the one with the highest probability. This approach ensures that any subsequent point cloud analysis is robust to partial observation while simultaneously guaranteeing that the proposed model is able to output multiple possible results. This paper illustrates that this proposed method achieves state-of-the-art performance on shape classification, part segmentation and point cloud completion.
摘要：3D传感器的持续改善带动的算法的开发进行点云分析。事实上，对点云的分类和分割技术在最近几年取得了通过利用大数据集的合成部分原因令人难以置信的性能。不幸的是这些相同的国家的最先进的方法当应用于不完整的点云表现不佳。的现有算法这种限制特别是关于因为在真实世界中的三维传感器产生点云通常不完全的，由于由其他对象透视图或闭塞。本文提出，其中编码完整的点云的潜在功能是通过应用局部点集投票策略推断局部点云分析的一般模型。特别地，每个地方点集构建一个投票对应于潜在空间分布，最佳潜在部件是一个概率最高。这种方法确保任何后续点云分析是稳健的局部观察，同时保证，该模型能够输出多个可能的结果。本文示出了此提出的方法实现了对形状分类，部分分割和点云完成状态的最先进的性能。

36. Attention-based Residual Speech Portrait Model for Speech to Face Generation [PDF] 返回目录
Jianrong Wang, Xiaosheng Hu, Li Liu, Wei Liu, Mei Yu, Tianyi Xu
Abstract: Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face. One main challenge in this task is to alleviate the natural mismatch between face and speech. To this end, in this paper, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) by introducing the ideal of the residual into a hybrid encoder-decoder architecture, where face prior features are merged with the output of speech encoder to form the final face feature. In particular, we innovatively establish a tri-item loss function, which is a weighted linear combination of the L2-norm, L1-norm and negative cosine loss, to train our model by comparing the final face feature and true face feature. Evaluation on AVSpeech dataset shows that our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.
摘要：给定一个说话者的语音，这是有趣的，如果有可能产生这款音箱的脸。在这个任务中的一个主要挑战是减轻面部和语音之间的自然不匹配。为此，在本文中，我们通过引入剩余的理想成混合编码器 - 解码器体系结构，其中面之前的特征合并语音的输出提出了一种新颖的基于注意力残余语音肖像模型（AR-SPM）编码器，以形成最终的面部特征。特别是，我们创新建立一个三项损失函数，这是L2范数，L1范数和负余弦损失的加权线性组合，通过比较最终面部特征与真实脸部特征来训练我们的模型。在AVSpeech数据集展示评测，我们提出的模型加速了训练收敛，优于国家的最先进的面部产生的质量方面，并实现了与地面实况比较性别和年龄的优越识别的准确性。

37. PointMask: Towards Interpretable and Bias-Resilient Point Cloud Processing [PDF] 返回目录
Saeid Asgari Taghanaki, Kaveh Hassani, Pradeep Kumar Jayaraman, Amir Hosein Khasahmadi, Tonya Custis
Abstract: Deep classifiers tend to associate a few discriminative input variables with their objective function, which in turn, may hurt their generalization capabilities. To address this, one can design systematic experiments and/or inspect the models via interpretability methods. In this paper, we investigate both of these strategies on deep models operating on point clouds. We propose PointMask, a model-agnostic interpretable information-bottleneck approach for attribution in point cloud models. PointMask encourages exploring the majority of variation factors in the input space while gradually converging to a general solution. More specifically, PointMask introduces a regularization term that minimizes the mutual information between the input and the latent features used to masks out irrelevant variables. We show that coupling a PointMask layer with an arbitrary model can discern the points in the input space which contribute the most to the prediction score, thereby leading to interpretability. Through designed bias experiments, we also show that thanks to its gradual masking feature, our proposed method is effective in handling data bias.
摘要：深分类往往几个判别输入变量与他们的目标函数，而这又可能会伤害他们的概括能力相关联。为了解决这个问题，可以设计系统实验和/或通过解释性方法检查模型。在本文中，我们研究了两个对点云操作深模型这些策略。我们建议PointMask，在点云模型归因模型无关可解释的信息瓶颈的方法。 PointMask鼓励探索的大多数变异因子在输入空间中，同时逐渐收敛到一个通用的解决方案。更具体地，PointMask引入了在输入和用于掩模不相关的变量潜特征之间的相互信息最小化的正则化项。我们表明，与任意的模型耦合PointMask层可以辨别在输入空间有助于最给预测得分，从而导致解释性的点。通过设计偏向实验，我们还显示，得益于其逐渐屏蔽功能，我们提出的方法是有效的处理数据偏差。

38. Aligning Videos in Space and Time [PDF] 返回目录
Senthil Purushwalkam, Tian Ye, Saurabh Gupta, Abhinav Gupta
Abstract: In this paper, we focus on the task of extracting visual correspondences across videos. Given a query video clip from an action class, we aim to align it with training videos in space and time. Obtaining training data for such a fine-grained alignment task is challenging and often ambiguous. Hence, we propose a novel alignment procedure that learns such correspondence in space and time via cross video cycle-consistency. During training, given a pair of videos, we compute cycles that connect patches in a given frame in the first video by matching through frames in the second video. Cycles that connect overlapping patches together are encouraged to score higher than cycles that connect non-overlapping patches. Our experiments on the Penn Action and Pouring datasets demonstrate that the proposed method can successfully learn to correspond semantically similar patches across videos, and learns representations that are sensitive to object and action states.
摘要：在本文中，我们专注于多部影片的视觉提取对应的任务。鉴于从动作类查询视频剪辑，我们的目标是在空间和时间的训练视频对齐。获得训练数据对于这样的细粒度对准任务是具有挑战性的，往往模棱两可。因此，我们建议，通过横视频周期一致性学习在空间和时间这样的对应的新的对齐过程。在训练期间，给定的一对视频，我们计算周期，在通过匹配所述第一视频的给定帧连接贴片通过在第二视频帧。鼓励该重叠贴片连接在一起周期比周期连接非重叠补丁得分更高。我们在宾夕法尼亚行动和浇注数据集实验结果表明，该方法能够成功地学习对应多部影片的语义相似的补丁，并且学会表示，用于对物体和动作状态敏感。

39. Deep Multi-task Learning for Facial Expression Recognition and Synthesis Based on Selective Feature Sharing [PDF] 返回目录
Rui Zhao, Tianshan Liu, Jun Xiao, Daniel P.K. Lun, Kin-Man Lam
Abstract: Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.
摘要：多任务学习是深学习为基础的面部表情识别任务的有效的学习策略。然而，大多数现有的方法考虑到有限的考虑的特征选择，不同的任务，训练多任务网络时，这可能会导致任务干扰之间传送信息时。为了解决这个问题，我们提出了一个新的选择性功能共享方式，并建立了多任务网络的面部表情识别和面部表情合成。该方法可以有效地传递不同的任务之间有益的功能，同时过滤掉无用和有害信息。此外，我们采用了人脸表情合成任务，以扩大和平衡训练数据集，进一步提高了该方法的推广能力。实验结果表明，所提出的方法实现了国家的最先进的性能上通常使用的那些面部表情识别的基准，这使得它潜在的解决方案，以真实世界的面部表情识别的问题。

40. Towards Unsupervised Learning for Instrument Segmentation in Robotic Surgery with Cycle-Consistent Adversarial Networks [PDF] 返回目录
Daniil Pakhomov, Wei Shen, Nassir Navab
Abstract: Surgical tool segmentation in endoscopic images is an important problem: it is a crucial step towards full instrument pose estimation and it is used for integration of pre- and intra-operative images into the endoscopic view. While many recent approaches based on convolutional neural networks have shown great results, a key barrier to progress lies in the acquisition of a large number of manually-annotated images which is necessary for an algorithm to generalize and work well in diverse surgical scenarios. Unlike the surgical image data itself, annotations are difficult to acquire and may be of variable quality. On the other hand, synthetic annotations can be automatically generated by using forward kinematic model of the robot and CAD models of tools by projecting them onto an image plane. Unfortunately, this model is very inaccurate and cannot be used for supervised learning of image segmentation models. Since generated annotations will not directly correspond to endoscopic images due to errors, we formulate the problem as an unpaired image-to-image translation where the goal is to learn the mapping between an input endoscopic image and a corresponding annotation using an adversarial model. Our approach allows to train image segmentation models without the need to acquire expensive annotations and can potentially exploit large unlabeled endoscopic image collection outside the annotated distributions of image/annotation data. We test our proposed method on Endovis 2017 challenge dataset and show that it is competitive with supervised segmentation methods.
摘要：外科手术工具分割在内窥镜图像是一个重要的问题：它是实现全面仪表姿态估计的关键一步，它是用于集成前和术中的图像到内窥镜视图。虽然基于卷积神经网络最近的许多方法都表现出极大的效果，一个主要障碍进步就在于大量收购手动注释的图像中的哪一个所需的算法在不同的手术场景和推广工作做好。不同于手术图像数据本身，注释是难以获取，并且可以是可变的质量。在另一方面，合成的注释可以被自动地通过它们投影到图像平面的使用的工具的机器人和CAD模型的正向运动学模型生成的。不幸的是，这种模式是非常不准确的，不能用于图像分割模型的监督学习。由于生成的注释不会直接对应于内窥镜图像由于错误，我们制订问题作为一个不成对的图像对图像的平移，其目的是学习输入内窥镜图像，并使用对抗模型对应的注解之间的映射。我们的方法允许训练图像分割模式，而无需昂贵的收购注释和有可能利用图像/注释数据的注释分布外大未标记的内窥镜图像采集。我们测试我们提出了2017年ENDOVIS挑战数据集的方法和表明它是有监督分割方法具有竞争力。

41. Searching for Efficient Architecture for Instrument Segmentation in Robotic Surgery [PDF] 返回目录
Daniil Pakhomov, Nassir Navab
Abstract: Segmentation of surgical instruments is an important problem in robot-assisted surgery: it is a crucial step towards full instrument pose estimation and is directly used for masking of augmented reality overlays during surgical procedures. Most applications rely on accurate real-time segmentation of high-resolution surgical images. While previous research focused primarily on methods that deliver high accuracy segmentation masks, majority of them can not be used for real-time applications due to their computational cost. In this work, we design a light-weight and highly-efficient deep residual architecture which is tuned to perform real-time inference of high-resolution images. To account for reduced accuracy of the discovered light-weight deep residual network and avoid adding any additional computational burden, we perform a differentiable search over dilation rates for residual units of our network. We test our discovered architecture on the EndoVis 2017 Robotic Instruments dataset and verify that our model is the state-of-the-art in terms of speed and accuracy tradeoff with a speed of up to 125 FPS on high resolution images.
摘要：手术器械的分割是在机器人辅助外科手术的一个重要问题：它是实现全面仪表姿态估计的关键一步，并直接用于在手术过程中增强现实叠加的屏蔽。大多数应用程序依靠高分辨率图像手术的精确的实时分割。虽然以前的研究主要集中在那些能够传递高精度分割掩码方法，他们大部分不能被用于实时应用由于其计算成本。在这项工作中，我们设计被调谐来执行高分辨率图像的实时推理重量轻和高效率的深残余架构。为了解释发现的轻质深剩余网络，避免增加任何额外的计算负担的精度降低，我们进行了扩张速度对我们网络的剩余单位微搜索。我们测试的ENDOVIS 2017年全自动仪器数据集我们发现架构和验证，我们的模型是国家的最先进的高达125 FPS的高分辨率图像的速度的速度和准确性权衡的方面。

42. IQ-VQA: Intelligent Visual Question Answering [PDF] 返回目录
Vatsal Goel, Mohit Chandak, Ashish Anand, Prithwijit Guha
Abstract: Even though there has been tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. To this end, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then also learn to answer the generated implication correctly. As a part of the cyclic framework, we propose a novel implication generator which can generate implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human annotated VQA-Implications dataset. The dataset consists of ~30k questions containing implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA v2.0 validation dataset. We show that our framework improves consistency of VQA models by ~15% on the rule-based dataset, ~7% on VQA-Implications dataset and robustness by ~2%, without degrading their performance. In addition, we also quantitatively show improvement in attention maps which highlights better multi-modal understanding of vision and language.
摘要：尽管人们对视觉问答系统领域的巨大进步，今天的车型仍倾向于不一致而脆。为此，我们提出增加任何VQA架构的一致性和稳健性的典范，独立循环的框架。我们培训模式，以回答原来的问题，基于答案的含义，然后还要学会正确回答产生意义。作为环状框架的一部分，我们提出了一个新的含义发生器，它可以产生任何问题 - 答案对隐含的问题。至于未来的基线上工作的一致性，我们提供了一个新的人类注释VQA-意义的数据集。该数据集由包含3种类型的含义〜30K的问题 - 逻辑等价，必要条件和互斥 - 从VQA V2.0验证数据集的制作。我们证明了我们的框架由〜15％提高VQA模型的一致性对基于规则的数据集，在〜VQA，影响7％的数据集和鲁棒性的〜2％，而不会降低其性能。此外，我们还定量地显示在地图的关注凸显其视觉和语言的更好的多模态的认识的提高。

43. Quaternion Capsule Networks [PDF] 返回目录
Barış Özcan, Furkan Kınlı, Furkan Kıraç
Abstract: Capsules are grouping of neurons that allow to represent sophisticated information of a visual entity such as pose and features. In the view of this property, Capsule Networks outperform CNNs in challenging tasks like object recognition in unseen viewpoints, and this is achieved by learning the transformations between the object and its parts with the help of high dimensional representation of pose information. In this paper, we present Quaternion Capsules (QCN) where pose information of capsules and their transformations are represented by quaternions. Quaternions are immune to the gimbal lock, have straightforward regularization of the rotation representation for capsules, and require less number of parameters than matrices. The experimental results show that QCNs generalize better to novel viewpoints with fewer parameters, and also achieve on-par or better performances with the state-of-the-art Capsule architectures on well-known benchmarking datasets.
摘要：胶囊分组的神经元，使代表视觉实体，如姿态和功能的复杂信息。在这个属性的看法，胶囊网络优于细胞神经网络在挑战任务，如在看不见的观点物体识别，这是通过学习对象及其与姿态信息的高维表示的帮助部分之间的转换来实现的。在本文中，我们提出，其中胶囊剂和它们的转换的姿态信息由四元数表示的四元胶囊（QCN）。四元数是免疫的万向节锁定，有胶囊旋转表示的简单的正则化，和需要的参数比基质较少数目。实验结果表明，QCNs更好推广到具有较少参数新颖的观点，并且还实现对面值或更好的性能与公知的基准的数据集的状态的最先进的胶囊结构。

44. Words as Art Materials: Generating Paintings with Sequential GANs [PDF] 返回目录
Azmi Can Özgen, Hazım Kemal Ekenel
Abstract: Converting text descriptions into images using Generative Adversarial Networks has become a popular research area. Visually appealing images have been generated successfully in recent years. Inspired by these studies, we investigated the generation of artistic images on a large variance dataset. This dataset includes images with variations, for example, in shape, color, and content. These variations in images provide originality which is an important factor for artistic essence. One major characteristic of our work is that we used keywords as image descriptions, instead of sentences. As the network architecture, we proposed a sequential Generative Adversarial Network model. The first stage of this sequential model processes the word vectors and creates a base image whereas the next stages focus on creating high-resolution artistic-style images without working on word vectors. To deal with the unstable nature of GANs, we proposed a mixture of techniques like Wasserstein loss, spectral normalization, and minibatch discrimination. Ultimately, we were able to generate painting images, which have a variety of styles. We evaluated our results by using the Fréchet Inception Distance score and conducted a user study with 186 participants.
摘要：将文本转换成描述使用剖成对抗性的网络图像已成为一个热门的研究领域。视觉吸引力的图片已经在最近几年成功生成。这些研究的启发，我们研究了一个大的变化数据集的艺术图像的生成。此数据集包括具有变化的图像，例如，在形状，颜色，和内容。在图像的这些变化提供独创性其是用于艺术精华的重要因素。我们工作的一个主要特点是我们使用的关键字作为图像的描述，而不是句子。随着网络架构，我们提出了一个连续的创成对抗性的网络模型。这种顺序模型的第一阶段处理该字向量，并创建一个基本图像，而下一个阶段集中在不会对字矢量工作创建高分辨率艺术风格的图像。为了对付甘斯的性质不稳定，我们建议像瓦瑟斯坦损失，谱归一化，并minibatch歧视技术的混合物。最终，我们能够生成画图像，其中有各种不同的风格。我们通过使用Fréchet可盗梦空间距离分值评估我们的结果并进行了186名参与者用户研究。

45. Temporal aggregation of audio-visual modalities for emotion recognition [PDF] 返回目录
Andreea Birhala, Catalin Nicolae Ristea, Anamaria Radoi, Liviu Cristian Dutu
Abstract: Emotion recognition has a pivotal role in affective computing and in human-computer interaction. The current technological developments lead to increased possibilities of collecting data about the emotional state of a person. In general, human perception regarding the emotion transmitted by a subject is based on vocal and visual information collected in the first seconds of interaction with the subject. As a consequence, the integration of verbal (i.e., speech) and non-verbal (i.e., image) information seems to be the preferred choice in most of the current approaches towards emotion recognition. In this paper, we propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality. We show that our proposed method outperforms other methods from the literature and human accuracy rating. The experiments are conducted over the open-access multimodal dataset CREMA-D.
摘要：情感识别在情感计算和人机交互举足轻重的作用。目前的技术发展导致增加收集关于一个人的情绪状态数据的可能性。一般地，关于由受试者发送的情感人类感知是基于与对象相互作用的第一秒收集声音和视觉信息。因此，口头（即语音）的整合和非语言（即图像）的信息似乎是最朝着情感识别当前方法的首选。在本文中，我们提出了基于视听方式从每个模式不同时间偏移一个时间窗口合并为情感识别多模态融合技术。我们证明了我们提出的方法优于从文献和人的准确率等方法。实验是在开放获取的多式联运集CREMA-d进行。

46. Sensor Fusion of Camera and Cloud Digital Twin Information for Intelligent Vehicles [PDF] 返回目录
Yongkang Liu, Ziran Wang, Kyungtae Han, Zhenyu Shou, Prashant Tiwari, John H. L. Hansen
Abstract: With the rapid development of intelligent vehicles and Advanced Driving Assistance Systems (ADAS), a mixed level of human driver engagements is involved in the transportation system. Visual guidance for drivers is essential under this situation to prevent potential risks. To advance the development of visual guidance systems, we introduce a novel sensor fusion methodology, integrating camera image and Digital Twin knowledge from the cloud. Target vehicle bounding box is drawn and matched by combining results of object detector running on ego vehicle and position information from the cloud. The best matching result, with a 79.2% accuracy under 0.7 Intersection over Union (IoU) threshold, is obtained with depth image served as an additional feature source. Game engine-based simulation results also reveal that the visual guidance system could improve driving safety significantly cooperate with the cloud Digital Twin system.
摘要：随着智能车辆和先进的驾驶辅助系统（ADAS），人力驱动订婚参与了交通系统的混合水平的快速发展。对司机视觉引导是这种情况，以防止潜在的风险下至关重要。为了推进视觉导航系统的发展，我们引入了一种新的传感器融合方法，集成摄像头图像和数字双知识从云。目标车辆的边界框被吸入并通过组合对象检测器的结果从所述云自主车辆和位置信息运行匹配。最佳匹配结果，下0.7交叉口超过联盟（IOU）的阈值的79.2％的准确度，与充当附加特征源深度图像获得。游戏引擎为基础的模拟结果还表明，视觉引导系统可以提高驾驶的安全性与云数字双系统显著合作。

47. Deep Placental Vessel Segmentation for Fetoscopic Mosaicking [PDF] 返回目录
Sophia Bano, Francisco Vasconcelos, Luke M. Shepherd, Emmanuel Vander Poorten, Tom Vercauteren, Sebastien Ourselin, Anna L. David, Jan Deprest, Danail Stoyanov
Abstract: During fetoscopic laser photocoagulation, a treatment for twin-to-twin transfusion syndrome (TTTS), the clinician first identifies abnormal placental vascular connections and laser ablates them to regulate blood flow in both fetuses. The procedure is challenging due to the mobility of the environment, poor visibility in amniotic fluid, occasional bleeding, and limitations in the fetoscopic field-of-view and image quality. Ideally, anastomotic placental vessels would be automatically identified, segmented and registered to create expanded vessel maps to guide laser ablation, however, such methods have yet to be clinically adopted. We propose a solution utilising the U-Net architecture for performing placental vessel segmentation in fetoscopic videos. The obtained vessel probability maps provide sufficient cues for mosaicking alignment by registering consecutive vessel maps using the direct intensity-based technique. Experiments on 6 different in vivo fetoscopic videos demonstrate that the vessel intensity-based registration outperformed image intensity-based registration approaches showing better robustness in qualitative and quantitative comparison. We additionally reduce drift accumulation to negligible even for sequences with up to 400 frames and we incorporate a scheme for quantifying drift error in the absence of the ground-truth. Our paper provides a benchmark for fetoscopy placental vessel segmentation and registration by contributing the first in vivo vessel segmentation and fetoscopic videos dataset.
摘要：在胎儿镜激光光凝，用于双胎输血综合征的处理（TTTS），临床医生首先识别异常胎盘血管连接和激光烧蚀它们来调节在两个胎儿的血液流动。该过程是具有挑战性的环境所致，能见度差在羊水中，偶有出血，并限制在移动领域的视图和图像质量的胎儿镜。理想情况下，吻合口胎盘血管会被自动识别，分割和注册创建扩展映射容器引导激光烧蚀，但是，这种方法还没有在临床上采用。我们提出利用在胎儿镜影片成效胎盘血管分割的掌中宽带架构的解决方案。将所得到的容器概率地图通过注册使用直接基于强度的技术的连续容器地图镶嵌对准提供足够的线索。对体内胎儿镜视频6个不同的实验表明，该基于强度船舶登记优于图像基于强度的配准方法表示的定性和定量比较更好的鲁棒性。我们还减少漂移积累到可以忽略甚至高达400帧序列和我们合并的方案对于在没有地面实况的量化漂移误差。我们的纸通过体内的血管分割和胎儿镜视频数据集有助于第一提供用于胎儿镜胎盘血管分割和配准的基准。

48. The UU-Net: Reversible Face De-Identification for Visual Surveillance Video Footage [PDF] 返回目录
Hugo Proença
Abstract: We propose a reversible face de-identification method for low resolution video data, where landmark-based techniques cannot be reliably used. Our solution is able to generate a photo realistic de-identified stream that meets the data protection regulations and can be publicly released under minimal privacy constraints. Notably, such stream encapsulates all the information required to later reconstruct the original scene, which is useful for scenarios, such as crime investigation, where the identification of the subjects is of most importance. We describe a learning process that jointly optimizes two main components: 1) a public module, that receives the raw data and generates the de-identified stream, where the ID information is surrogated in a photo-realistic and seamless way; and 2) a private module, designed for legal/security authorities, that analyses the public stream and reconstructs the original scene, disclosing the actual IDs of all the subjects in the scene. The proposed solution is landmarks-free and uses a conditional generative adversarial network to generate synthetic faces that preserve pose, lighting, background information and even facial expressions. Also, we enable full control over the set of soft facial attributes that should be preserved between the raw and de-identified data, which broads the range of applications for this solution. Our experiments were conducted in three different visual surveillance datasets (BIODI, MARS and P-DESTRE) and showed highly encouraging results. The source code is available at this https URL.
摘要：我们提出了低分辨率视频数据，其中基于地标的技术不能可靠地使用可逆的脸去识别方法。我们的解决方案能够产生符合数据保护法规，可以在最少的隐私权限制公开发表照片拟真去标识流。值得注意的是，这样的流封装了所有后来重建原始场景，这对于场景，比如犯罪调查，其中主体的识别是最重要的有用所需的信息。我们描述了一个学习的过程，联合地优化两个主要组件：1）公共模块，其接收所述原始数据，并产生去标识流，其中，所述ID信息是在照片般逼真的和无缝的方式surrogated; 2）专用模块，专为法律/安全当局，分析公共流和重建原始场景，披露场景中的所有对象的实际标识。所提出的方案是地标 - 自由和用途条件生成对抗网络，以产生合成的面即保持姿态，照明，背景信息，并且甚至面部表情。此外，我们还能够在组应原和去识别的数据，其中湖区的这种解决方案的应用范围之间不能保留的软面部特征的完全控制。我们的实验是在三个不同的视觉监控的数据集（BIODI，MARS和P-DESTRE），并表现出高度令人鼓舞的结果进行。源代码可在此HTTPS URL。

49. One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control [PDF] 返回目录
Wenlong Huang, Igor Mordatch, Deepak Pathak
Abstract: Reinforcement learning is typically concerned with learning control policies tailored to a particular agent. We investigate whether there exists a single global policy that can generalize to control a wide variety of agent morphologies -- ones in which even dimensionality of state and action spaces changes. We propose to express this global policy as a collection of identical modular neural networks, dubbed as Shared Modular Policies (SMP), that correspond to each of the agent's actuators. Every module is only responsible for controlling its corresponding actuator and receives information from only its local sensors. In addition, messages are passed between modules, propagating information between distant modules. We show that a single modular policy can successfully generate locomotion behaviors for several planar agents with different skeletal structures such as monopod hoppers, quadrupeds, bipeds, and generalize to variants not seen during training -- a process that would normally require training and manual hyperparameter tuning for each morphology. We observe that a wide variety of drastically diverse locomotion styles across morphologies as well as centralized coordination emerges via message passing between decentralized modules purely from the reinforcement learning objective. Videos and code at this https URL
摘要：强化学习通常涉及针对特定代理学习控制策略。我们调查是否存在可推广到控制各种药剂形态的一个单一的全球政策 - 那些在其中的状态和行动空间的变化甚至维度。我们提出来表达这一全球性政策相同的模块化神经网络的集合，被戏称为共享模块化策略（SMP），对应于每个代理的执行机构。每个模块仅用于控制其相应的致动负责和仅从其本地传感器接收信息。此外，消息的模块之间传递，传播遥远模块之间传送信息。我们表明，模块化的单一政策可以成功地产生了不同的骨架结构，如独斗，四足，两足动物的几个平面代理运动的行为，并推广到训练中没有看到的变体 - 这通常需要培训和手动超参数调整的过程每个形态。我们观察到的是各式各样的显着不同的运动风格跨越的形态，以及集中协调通过分布式模块之间通过单纯从强化学习目标的消息出现。视频和代码在此HTTPS URL

50. Invertible Zero-Shot Recognition Flows [PDF] 返回目录
Yuming Shen, Jie Qin, Lei Huang
Abstract: Deep generative models have been successfully applied to Zero-Shot Learning (ZSL) recently. However, the underlying drawbacks of GANs and VAEs (e.g., the hardness of training with ZSL-oriented regularizers and the limited generation quality) hinder the existing generative ZSL models from fully bypassing the seen-unseen bias. To tackle the above limitations, for the first time, this work incorporates a new family of generative models (i.e., flow-based models) into ZSL. The proposed Invertible Zero-shot Flow (IZF) learns factorized data embeddings (i.e., the semantic factors and the non-semantic ones) with the forward pass of an invertible flow network, while the reverse pass generates data samples. This procedure theoretically extends conventional generative flows to a factorized conditional scheme. To explicitly solve the bias problem, our model enlarges the seen-unseen distributional discrepancy based on negative sample-based distance measurement. Notably, IZF works flexibly with either a naive Bayesian classifier or a held-out trainable one for zero-shot recognition. Experiments on widely-adopted ZSL benchmarks demonstrate the significant performance gain of IZF over existing methods, in both classic and generalized settings.
摘要：深生成模型已经成功应用于零射门学习（ZSL）日前。然而，甘斯和VAES的基本缺点（例如，训练与面向ZSL-regularizers和有限生成质量硬度）从完全绕过看出-看不见偏压阻碍现有生成ZSL模型。为了解决上述限制，首次，这项工作结合生成模型（即，基于流的模型）到伦敦动物学会的一个新的家庭。所提出的可逆零拍流（IZF）获悉因式分解数据的嵌入（即，语义因素和非语义的）用可逆流动网络的直传，而反向通产生数据样本。这个过程在理论上延伸常规生成流至因式分解的条件方案。要明确解决偏差问题，我们的模型放大基于负基于样本的距离测量的看到久违的分布差异。值得注意的是，IZF工作灵活地无论是朴素贝叶斯分类器或持有可训练出一个零射门的认可。上广泛采用的ZSL基准实验证明IZF的显著性能增益比现有的方法中，既经典又普遍设置。

51. Medical Instrument Detection in Ultrasound-Guided Interventions: A Review [PDF] 返回目录
Hongxu Yang, Caifeng Shan, Alexander F. Kolen, Peter H. N. de With
Abstract: Medical instrument detection is essential for computer-assisted interventions since it would facilitate the surgeons to find the instrument efficiently with a better interpretation, which leads to a better outcome. This article reviews medical instrument detection methods in the ultrasound-guided intervention. First, we present a comprehensive review of instrument detection methodologies, which include traditional non-data-driven methods and data-driven methods. The non-data-driven methods were extensively studied prior to the era of machine learning, i.e. data-driven approaches. We discuss the main clinical applications of medical instrument detection in ultrasound, including anesthesia, biopsy, prostate brachytherapy, and cardiac catheterization, which were validated on clinical datasets. Finally, we selected several principal publications to summarize the key issues and potential research directions for the computer-assisted intervention community.
摘要：因为这将有利于外科医生有更好的诠释高效地发现仪器，这导致一个更好的结果，医疗器械检测是计算机辅助干预是必不可少的。本文综述了超声引导下介入医疗器械的检测方法。首先，我们介绍仪器的检测方法，其中包括传统的非数据驱动的方法和数据驱动的方法进行全面审查。非数据驱动方法进行了广泛的机器学习，即，数据驱动的方法的时代之前的研究。我们讨论的医疗超声器械检测的主要临床应用包括麻醉，穿刺活检，前列腺近距离放射治疗和心导管检查，其中进行了验证临床数据集。最后，我们选择了几个主要出版物总结了计算机辅助干预社会的关键问题和潜在的研究方向。

52. Client Adaptation improves Federated Learning with Simulated Non-IID Clients [PDF] 返回目录
Laura Rieger, Rasmus M. Th. Høegh, Lars K. Hansen
Abstract: We present a federated learning approach for learning a client adaptable, robust model when data is non-identically and non-independently distributed (non-IID) across clients. By simulating heterogeneous clients, we show that adding learned client-specific conditioning improves model performance, and the approach is shown to work on balanced and imbalanced data set from both audio and image domains. The client adaptation is implemented by a conditional gated activation unit and is particularly beneficial when there are large differences between the data distribution for each client, a common scenario in federated learning.
摘要：我们提出一个联合的学习方法，学习客户端自适应，强大的模型时，数据是不相同的，并在不同客户端的非独立分布（非IID）。通过模拟异构客户端，我们表明，添加了解到客户特定调理改善模型的性能，并且该方法被证明对平衡和不平衡的数据来自两个音频和图像域设置工作。客户适配是通过条件门控激活单元实现，并且是当存在用于每个客户端，在联合学习一种常见的方案中的数据分布的大的差异是特别有益的。

53. A Systematic Review on Context-Aware Recommender Systems using Deep Learning and Embeddings [PDF] 返回目录
Igor André Pegoraro Santana, Marcos Aurelio Domingues
Abstract: Recommender Systems are tools that improve how users find relevant information in web systems, so they do not face too much information. In order to generate better recommendations, the context of information should be used in the recommendation process. Context-Aware Recommender Systems were created, accomplishing state-of-the-art results and improving traditional recommender systems. There are many approaches to build recommender systems, and two of the most prominent advances in area have been the use of Embeddings to represent the data in the recommender system, and the use of Deep Learning architectures to generate the recommendations to the user. A systematic review adopts a formal and systematic method to perform a bibliographic review, and it is used to identify and evaluate all the research in certain area of study, by analyzing the relevant research published. A systematic review was conducted to understand how the Deep Learning and Embeddings techniques are being applied to improve Context-Aware Recommender Systems. We summarized the architectures that are used to create those and the domains that they are used.
摘要：推荐系统是提高用户如何找到在网络系统中相关信息的工具，所以他们不会面临太多的信息。为了生成更好的建议，信息的情况下，应在推荐过程中。环境感知推荐系统中创建，实现国家的先进成果和改进传统推荐系统。有建立推荐系统许多方法，和两个地区最突出的进步已经使用曲面嵌入的代表在推荐系统中的数据，并利用深度学习架构产生对用户的建议。系统回顾采用进行文献综述正规和系统的方法，它是用来通过分析发布的相关研究，以确定和评估研究的某些领域的所有研究。有系统的审查进行了解深度学习和曲面嵌入技术是如何被应用到改善环境感知推荐系统。我们总结了用于创建那些和他们正在使用的域架构。

54. Modelling the Distribution of 3D Brain MRI using a 2D Slice VAE [PDF] 返回目录
Anna Volokitin, Ertunc Erdil, Neerav Karani, Kerem Can Tezcan, Xiaoran Chen, Luc Van Gool, Ender Konukoglu
Abstract: Probabilistic modelling has been an essential tool in medical image analysis, especially for analyzing brain Magnetic Resonance Images (MRI). Recent deep learning techniques for estimating high-dimensional distributions, in particular Variational Autoencoders (VAEs), opened up new avenues for probabilistic modeling. Modelling of volumetric data has remained a challenge, however, because constraints on available computation and training data make it difficult effectively leverage VAEs, which are well-developed for 2D images. We propose a method to model 3D MR brain volumes distribution by combining a 2D slice VAE with a Gaussian model that captures the relationships between slices. We do so by estimating the sample mean and covariance in the latent space of the 2D model over the slice direction. This combined model lets us sample new coherent stacks of latent variables to decode into slices of a volume. We also introduce a novel evaluation method for generated volumes that quantifies how well their segmentations match those of true brain anatomy. We demonstrate that our proposed model is competitive in generating high quality volumes at high resolutions according to both traditional metrics and our proposed evaluation.
摘要：概率模型已在医学图像分析的一个重要工具，尤其是对大脑分析磁共振图像（MRI）。估计高维分布，特别是变自动编码（VAES）最近的深度学习技术，概率模型开辟了新途径。容积数据建模仍然是一个挑战，但是，由于可用的计算和训练数据的限制难以有效地利用VAES，对于2D图像被发达。我们通过2D切片VAE用高斯模型相结合提出的模型3D MR大脑体积的分布的方法，抓住切片之间的关系。我们通过在2D模式在切片方向的潜在空间估计样本均值和方差这样做。这种组合模式可以让我们品尝潜变量来解码的新连贯堆叠成册的片。我们还引入了一种新的评价方法产生的体积，用于量化他们的分割匹配那些真正的大脑解剖。我们证明了我们提出的模型是在生成按照传统的指标和我们所提出的评价高的分辨率高品质的卷的竞争力。

55. Automated Chest CT Image Segmentation of COVID-19 Lung Infection based on 3D U-Net [PDF] 返回目录
Dominik Müller, Iñaki Soto Rey, Frank Kramer
Abstract: The coronavirus disease 2019 (COVID-19) affects billions of lives around the world and has a significant impact on public healthcare. Due to rising skepticism towards the sensitivity of RT-PCR as screening method, medical imaging like computed tomography offers great potential as alternative. For this reason, automated image segmentation is highly desired as clinical decision support for quantitative assessment and disease monitoring. However, publicly available COVID-19 imaging data is limited which leads to overfitting of traditional approaches. To address this problem, we propose an innovative automated segmentation pipeline for COVID-19 infected regions, which is able to handle small datasets by utilization as variant databases. Our method focuses on on-the-fly generation of unique and random image patches for training by performing several preprocessing methods and exploiting extensive data augmentation. For further reduction of the overfitting risk, we implemented a standard 3D U-Net architecture instead of new or computational complex neural network architectures. Through a 5-fold cross-validation on 20 CT scans of COVID-19 patients, we were able to develop a highly accurate as well as robust segmentation model for lungs and COVID-19 infected regions without overfitting on the limited data. Our method achieved Dice similarity coefficients of 0.956 for lungs and 0.761 for infection. We demonstrated that the proposed method outperforms related approaches, advances the state-of-the-art for COVID-19 segmentation and improves medical image analysis with limited data. The code and model are available under the following link: this https URL
摘要：冠状病2019（COVID-19）影响数十亿世界各地的生活和对公众医疗保健显著的影响。由于上升的怀疑向RT-PCR的灵敏度作为筛选方法，医学成像等计算机断层扫描提供了巨大潜力作为替代。出于这个原因，自动图像分割高度期望作为用于定量评估疾病和监测临床决策支持。然而，公开可用COVID-19的成像数据是导致限于传统方法的过度拟合。为了解决这个问题，我们提出了COVID-19感染的区域创新的自动分割管道，这是能够通过利用处理小型数据集的变型数据库。我们的方法通过执行一些预处理方法和利用广泛的数据增强侧重于对即时代培训独特和随机图像补丁。为了进一步降低过度拟合的风险，我们实现了一个标准的3D掌中宽带架构，而不是新的或计算复杂的神经网络结构。通过对COVID-19的患者20个的CT扫描5倍交叉验证，我们能够开发出高度精确以及强大的分割模型肺和COVID-19感染区，而无需过度拟合在有限的数据。我们的方法达到0.956的肺部和0.761骰子相似系数感染。我们表明，相关的方法，该方法优于，前进的状态的最先进的用于COVID-19分割，并改善具有有限数据医学图像分析。代码和模型都可以通过下面的链接：此HTTPS URL

56. Low Dose CT Denoising via Joint Bilateral Filtering and Intelligent Parameter Optimization [PDF] 返回目录
Mayank Patwari, Ralf Gutjahr, Rainer Raupach, Andreas Maier
Abstract: Denoising of clinical CT images is an active area for deep learning research. Current clinically approved methods use iterative reconstruction methods to reduce the noise in CT images. Iterative reconstruction techniques require multiple forward and backward projections, which are time-consuming and computationally expensive. Recently, deep learning methods have been successfully used to denoise CT images. However, conventional deep learning methods suffer from the 'black box' problem. They have low accountability, which is necessary for use in clinical imaging situations. In this paper, we use a Joint Bilateral Filter (JBF) to denoise our CT images. The guidance image of the JBF is estimated using a deep residual convolutional neural network (CNN). The range smoothing and spatial smoothing parameters of the JBF are tuned by a deep reinforcement learning task. Our actor first chooses a parameter, and subsequently chooses an action to tune the value of the parameter. A reward network is designed to direct the reinforcement learning task. Our denoising method demonstrates good denoising performance, while retaining structural information. Our method significantly outperforms state of the art deep neural networks. Moreover, our method has only two parameters, which makes it significantly more interpretable and reduces the 'black box' problem. We experimentally measure the impact of our intelligent parameter optimization and our reward network. Our studies show that our current setup yields the best results in terms of structural preservation.
摘要：临床CT图像去噪是深度学习研究的活跃领域。目前临床批准的方法使用迭代重建方法来降低CT图像的噪声。迭代重建技术需要多个向前和向后突起，这是耗时和计算上昂贵的。近日，深学习方法已成功地用于去噪CT图像。然而，传统的深学习方法从“黑箱”问题的困扰。它们具有低问责制，这是必要的临床成像情况下使用。在本文中，我们使用了联合双边滤波器（JBF）进行去噪我们的CT图像。所述JBF的引导图像被使用深残余卷积神经网络（CNN）估计。该范围内的平滑和JBF的空间平滑参数是由深强化学习任务调整。我们的演员首先选择一个参数，然后选择一个动作来调整参数值。奖励网络的设计指导强化学习任务。我们的去噪方法表现出良好的降噪性能，同时保留结构信息。我们的方法显著优于艺术深层神经网络的状态。此外，我们的方法只有两个参数，这使得它更显著解释，降低了“黑盒子”的问题。我们通过实验测量我们的智能参数优化，我们的奖励网络的影响。我们的研究表明，我国当前的设置产生结构性保护方面最好的结果。

57. JBFnet -- Low Dose CT Denoising by Trainable Joint Bilateral Filtering [PDF] 返回目录
Mayank Patwari, Ralf Gutjahr, Rainer Raupach, Andreas Maier
Abstract: Deep neural networks have shown great success in low dose CT denoising. However, most of these deep neural networks have several hundred thousand trainable parameters. This, combined with the inherent non-linearity of the neural network, makes the deep neural network diffcult to understand with low accountability. In this study we introduce JBFnet, a neural network for low dose CT denoising. The architecture of JBFnet implements iterative bilateral filtering. The filter functions of the Joint Bilateral Filter (JBF) are learned via shallow convolutional networks. The guidance image is estimated by a deep neural network. JBFnet is split into four filtering blocks, each of which performs Joint Bilateral Filtering. Each JBF block consists of 112 trainable parameters, making the noise removal process comprehendable. The Noise Map (NM) is added after filtering to preserve high level features. We train JBFnet with the data from the body scans of 10 patients, and test it on the AAPM low dose CT Grand Challenge dataset. We compare JBFnet with state-of-the-art deep learning networks. JBFnet outperforms CPCE3D, GAN and deep GFnet on the test dataset in terms of noise removal while preserving structures. We conduct several ablation studies to test the performance of our network architecture and training method. Our current setup achieves the best performance, while still maintaining behavioural accountability.
摘要：深层神经网络显示，在低剂量CT降噪巨大的成功。然而，大多数这些深层神经网络的有几十万训练的参数。这一点，结合神经网络固有的非线性，使得深层神经网络艰难的啮合低问责理解。在这项研究中，我们介绍JBFnet，低剂量CT降噪神经网络。 JBFnet工具的架构迭代双边滤波。所述联合双边滤波器（JBF）的过滤器功能通过浅卷积网络获知。该指导图像是由深层神经网络估计。 JBFnet被分成四个过滤块，其每一个执行联合双边滤波。每个JBF块由112个可训练参数，使得噪声去除处理comprehendable。噪声地图（NM）是过滤，以保持高水平的功能后添加。我们培养JBFnet从10个例的人体扫描数据，并测试它的AAPM低剂量CT大挑战数据集。我们比较JBFnet与国家的最先进的深学习网络。 JBFnet性能优于CPCE3D，甘和，同时保留结构噪声去除方面的测试数据集深GFnet。我们进行了一些切除研究来测试我们的网络架构和训练方法的性能。我们目前的设置以达到最佳的性能，同时仍然保持行为的问责制。

58. Brain Tumor Anomaly Detection via Latent Regularized Adversarial Network [PDF] 返回目录
Nan Wang, Chengwei Chen, Yuan Xie, Lizhuang Ma
Abstract: With the development of medical imaging technology, medical images have become an important basis for doctors to diagnose patients. The brain structure in the collected data is complicated, thence, doctors are required to spend plentiful energy when diagnosing brain abnormalities. Aiming at the imbalance of brain tumor data and the rare amount of labeled data, we propose an innovative brain tumor abnormality detection algorithm. The semi-supervised anomaly detection model is proposed in which only healthy (normal) brain images are trained. Model capture the common pattern of the normal images in the training process and detect anomalies based on the reconstruction error of latent space. Furthermore, the method first uses singular value to constrain the latent space and jointly optimizes the image space through multiple loss functions, which make normal samples and abnormal samples more separable in the feature-level. This paper utilizes BraTS, HCP, MNIST, and CIFAR-10 datasets to comprehensively evaluate the effectiveness and practicability. Extensive experiments on intra- and cross-dataset tests prove that our semi-supervised method achieves outperforms or comparable results to state-of-the-art supervised techniques.
摘要：随着医学影像技术的发展，医学影像已成为医生诊断病人的重要依据。在收集数据的大脑结构复杂，从那里，需要医生诊断脑部异常时，花费充沛的精力。针对脑肿瘤数据和标记数据的稀土量的不平衡，我们提出了一种创新的脑肿瘤的异常检测算法。所述半监督异常检测模型提出了其中只健康（正常）大脑图像被训练。模型捕获的正常图像的在训练过程中的共用图案与检测基于的潜在空间重构误差异常。此外，该方法首先使用奇异值来约束潜在空间和通过多个损耗函数，这使得正常样品和异常样品中的功能级更多个可分离共同优化图像空间。本文利用臭小子，HCP，MNIST，和CIFAR-10的数据集进行综合评价的有效性和实用性。上内和跨数据集的测试大量的实验证明，我们的半监督方法实现了性能优于或可比较的结果，以国家的最先进的监督技术。

59. Multi-Granularity Modularized Network for Abstract Visual Reasoning [PDF] 返回目录
Xiangru Tang, Haoyuan Wang, Xiang Pan, Jiyang Qi
Abstract: Abstract visual reasoning connects mental abilities to the physical world, which is a crucial factor in cognitive development. Most toddlers display sensitivity to this skill, but it is not easy for machines. Aimed at it, we focus on the Raven Progressive Matrices Test, designed to measure cognitive reasoning. Recent work designed some black-boxes to solve it in an end-to-end fashion, but they are incredibly complicated and difficult to explain. Inspired by cognitive studies, we propose a Multi-Granularity Modularized Network (MMoN) to bridge the gap between the processing of raw sensory information and symbolic reasoning. Specifically, it learns modularized reasoning functions to model the semantic rule from the visual grounding in a neuro-symbolic and semi-supervision way. To comprehensively evaluate MMoN, our experiments are conducted on the dataset of both seen and unseen reasoning rules. The result shows that MMoN is well suited for abstract visual reasoning and also explainable on the generalization test.
摘要：摘要视觉推理思维能力连接到物理世界，这是认知发展的一个关键因素。大多数幼儿显示这个技能的敏感性，但它是不容易的机器。在它的目的，我们专注于乌鸦推理测验考试，旨在衡量认知推理。最近的工作设计了一些黑盒子来解决它在终端到终端的时尚，但他们是令人难以置信的复杂和难以解释。认知研究的启发，我们提出了一个多粒度模块化网络（MMON）弥合原始的感官信息和符号推理的处理之间的差距。具体而言，学习模块化推理功能的语义规则从一个神经象征性和半监督方式的视觉基础模型。综合评价MMON，我们的实验是在两个看到和看不到的推理规则的数据集进行。结果表明，MMON非常适合于视觉抽象推理和也可解释在概括测试。

60. Learning to Switch CNNs with Model Agnostic Meta Learning for Fine Precision Visual Servoing [PDF] 返回目录
Prem Raj, Vinay P. Namboodiri, L. Behera
Abstract: Convolutional Neural Networks (CNNs) have been successfully applied for relative camera pose estimation from labeled image-pair data, without requiring any hand-engineered features, camera intrinsic parameters or depth information. The trained CNN can be utilized for performing pose based visual servo control (PBVS). One of the ways to improve the quality of visual servo output is to improve the accuracy of the CNN for estimating the relative pose estimation. With a given state-of-the-art CNN for relative pose regression, how can we achieve an improved performance for visual servo control? In this paper, we explore switching of CNNs to improve the precision of visual servo control. The idea of switching a CNN is due to the fact that the dataset for training a relative camera pose regressor for visual servo control must contain variations in relative pose ranging from a very small scale to eventually a larger scale. We found that, training two different instances of the CNN, one for large-scale-displacements (LSD) and another for small-scale-displacements (SSD) and switching them during the visual servo execution yields better results than training a single CNN with the combined LSD+SSD data. However, it causes extra storage overhead and switching decision is taken by a manually set threshold which may not be optimal for all the scenes. To eliminate these drawbacks, we propose an efficient switching strategy based on model agnostic meta learning (MAML) algorithm. In this, a single model is trained to learn parameters which are simultaneously good for multiple tasks, namely a binary classification for switching decision, a 6DOF pose regression for LSD data and also a 6DOF pose regression for SSD data. The proposed approach performs far better than the naive approach, while storage and run-time overheads are almost negligible.
摘要：卷积神经网络（细胞神经网络）已经被成功地应用于用于从标记的图像对数据相对照相机姿势估计，而不需要任何手工程化特征，相机内参数或深度信息。经训练的CNN可以用于执行基于姿态的视觉伺服控制（PBVS）。一的，以改善视觉伺服输出的质量的方法之一是提高CNN的精度估算的相对姿势估计。对于一个给定的国家的最先进的CNN相对姿态回归，我们如何才能达到视觉伺服控制性能的改善？在本文中，我们探讨了细胞神经网络的切换，以提高视觉伺服控制的精度。的切换CNN的想法是由于该数据集训练相对相机姿态回归的视觉伺服控制必须包含在相对位姿的变化，从一个非常小的规模，最终规模较大。我们发现，培养CNN的两个不同的实例，一个用于大规模-位移（LSD），另一个是小规模，位移（SSD），并在视觉伺服执行率切换它们更好的结果比培养一个CNN与将合并的LSD + SSD的数据。但是，它会导致额外的存储开销和切换决定是由手动设定的阈值可能不适合所有场景优化。为了消除这些缺点，我们提出了一种基于模型无关元学习（MAML）算法的高效切换策略。在此，单个模型进行训练，以了解哪些是同时具有良好的多任务，即切换决定，6自由度姿态回归对LSD的数据，也为SSD数据的六自由度姿态回归的二元分类的参数。所提出的方法进行远不如天真的方法，而存储和运行时开销几乎可以忽略。

61. InfoMax-GAN: Improved Adversarial Image Generation via Information Maximization and Contrastive Learning [PDF] 返回目录
Kwot Sin Lee, Ngoc-Trung Tran, Ngai-Man Cheung
Abstract: While Generative Adversarial Networks (GANs) are fundamental to many generative modelling applications, they suffer from numerous issues. In this work, we propose a principled framework to simultaneously address two fundamental issues in GANs: catastrophic forgetting of the discriminator and mode collapse of the generator. We achieve this by employing for GANs a contrastive learning and mutual information maximization approach, and perform extensive analyses to understand sources of improvements. Our approach significantly stabilises GAN training and improves GAN performance for image synthesis across five datasets under the same training and evaluation conditions against state-of-the-art works. Our approach is simple to implement and practical: it involves only one objective, is computationally inexpensive, and is robust across a wide range of hyperparameters without any tuning. For reproducibility, our code is available at this https URL.
摘要：虽然创成对抗性网络（甘斯）对许多生殖建模应用的基础，它们从众多问题的影响。在这项工作中，我们提出了一个原则性的框架，以同时地址甘斯两个基本问题：发电机的鉴别和模式崩溃的灾难性遗忘。我们采用了一个甘斯对比学习，相互信息最大化的方式实现这一目标，并进行了广泛的分析，了解改进来源。我们的做法显著稳定GAN培训和提高针对国家的最先进的作品同样的培训和评估条件下在五个数据集图像合成甘性能。我们的做法是实现简单，实用：它涉及的目的只有一个，就是计算成本低，而且是在大范围的超参数的稳健，没有任何调整。对于重复性，我们的代码可在此HTTPS URL。

62. Neural Video Coding using Multiscale Motion Compensation and Spatiotemporal Context Model [PDF] 返回目录
Haojie Liu, Ming Lu, Zhan Ma, Fan Wang, Zhihuang Xie, Xun Cao, Yao Wang
Abstract: Over the past two decades, traditional block-based video coding has made remarkable progress and spawned a series of well-known standards such as MPEG-4, H.264/AVC and H.265/HEVC. On the other hand, deep neural networks (DNNs) have shown their powerful capacity for visual content understanding, feature extraction and compact representation. Some previous works have explored the learnt video coding algorithms in an end-to-end manner, which show the great potential compared with traditional methods. In this paper, we propose an end-to-end deep neural video coding framework (NVC), which uses variational autoencoders (VAEs) with joint spatial and temporal prior aggregation (PA) to exploit the correlations in intra-frame pixels, inter-frame motions and inter-frame compensation residuals, respectively. Novel features of NVC include: 1) To estimate and compensate motion over a large range of magnitudes, we propose an unsupervised multiscale motion compensation network (MS-MCN) together with a pyramid decoder in the VAE for coding motion features that generates multiscale flow fields, 2) we design a novel adaptive spatiotemporal context model for efficient entropy coding for motion information, 3) we adopt nonlocal attention modules (NLAM) at the bottlenecks of the VAEs for implicit adaptive feature extraction and activation, leveraging its high transformation capacity and unequal weighting with joint global and local information, and 4) we introduce multi-module optimization and a multi-frame training strategy to minimize the temporal error propagation among P-frames. NVC is evaluated for the low-delay causal settings and compared with H.265/HEVC, H.264/AVC and the other learnt video compression methods following the common test conditions, demonstrating consistent gains across all popular test sequences for both PSNR and MS-SSIM distortion metrics.
摘要：在过去的二十年中，传统的基于块的视频编码取得了令人瞩目的进展，并催生了一系列著名的标准，如MPEG-4，H.264 / AVC和H.265 / HEVC。在另一方面，深层神经网络（DNNs）已经显示出其对视频内容的理解，特征提取和压缩表示强大的能力。以前的一些作品探索了在终端到终端的方式学习视频编码算法，这显示出与传统方法相比的巨大潜力。在本文中，我们提出了一种端至端深层神经视频编码框架（NVC），它使用变分自动编码（VAES）与联合空间和时间之前聚集（PA）以利用帧内的像素的相关性，跨分别帧运动和帧间补偿残差。 NVC的新颖特征包括：1）估计和在大范围的幅度的补偿运动，我们提出了一种无监督的多尺度运动补偿网络（MS-MCN与在VAE金字塔解码器，用于编码运动特征，其生成多尺度流场）一起，2）我们设计用于高效熵编码运动信息，3是新的自适应时空上下文模型），我们在VAES的隐式自适应特征提取和激活瓶颈采用非局部注意模块（NLAM），利用其高转化能力和不等与关节全局和局部信息进行加权，以及4）我们引入多模块优化和多帧训练策略，以最小化的P帧之间的时间误差传播。 NVC为低延迟因果设置评价和比较与H.265 / HEVC，H.264 / AVC和之后的公用测试条件以外，其它了解到视频压缩方法，证明一致增益跨越所有流行的测试序列两者PSNR和MS -SSIM失真度量。

63. Efficient detection of adversarial images [PDF] 返回目录
Darpan Kumar Yadav, Kartik Mundra, Rahul Modpur, Arpan Chattopadhyay, Indra Narayan Kar
Abstract: In this paper, detection of deception attack on deep neural network (DNN) based image classification in autonomous and cyber-physical systems is considered. Several studies have shown the vulnerability of DNN to malicious deception attacks. In such attacks, some or all pixel values of an image are modified by an external attacker, so that the change is almost invisible to the human eye but significant enough for a DNN-based classifier to misclassify it. This paper first proposes a novel pre-processing technique that facilitates the detection of such modified images under any DNN-based image classifier as well as the attacker model. The proposed pre-processing algorithm involves a certain combination of principal component analysis (PCA)-based decomposition of the image, and random perturbation based detection to reduce computational complexity. Next, an adaptive version of this algorithm is proposed where a random number of perturbations are chosen adaptively using a doubly-threshold policy, and the threshold values are learnt via stochastic approximation in order to minimize the expected number of perturbations subject to constraints on the false alarm and missed detection probabilities. Numerical experiments show that the proposed detection scheme outperforms a competing algorithm while achieving reasonably low computational complexity.
摘要：本文在自主和网络物理系统上的深层神经网络（DNN）基于图像分类欺骗攻击的检测被认为是。一些研究表明DNN的恶意欺骗攻击的漏洞。在这种攻击中，图像的一些或所有像素值被外部攻击者修改，以使得变化是几乎看不见人眼而显著足够用于基于DNN分类器到错误分类它。本文首先提出了有助于任何基于DNN图像分类器以及攻击模型在这种经修改的图像的检测的新颖的预处理技术。所提出的预处理算法包括主成分分析（PCA）中的图像的基于分解，并且基于随机扰动检测的特定组合，以减少计算复杂度。接着，该算法的自适应版本提出其中扰动的一个随机数被选择的自适应使用双阈值策略和阈值，以便最小化对假扰动对象的条件的限制的预期数量经由随机逼近了解到报警和漏检概率。数值试验表明，该检测方案优于竞争算法，同时实现合理的计算复杂度低。

64. Wandering Within a World: Online Contextualized Few-Shot Learning [PDF] 返回目录
Mengye Ren, Michael L. Iuzzolino, Michael C. Mozer, Richard S. Zemel
Abstract: We aim to bridge the gap between typical human and machine-learning environments by extending the standard framework of few-shot learning to an online, continual setting. In this setting, episodes do not have separate training and testing phases, and instead models are evaluated online while learning novel classes. As in real world, where the presence of spatiotemporal context helps us retrieve learned skills in the past, our online few-shot learning setting also features an underlying context that changes throughout time. Object classes are correlated within a context and inferring the correct context can lead to better performance. Building upon this setting, we propose a new few-shot learning dataset based on large scale indoor imagery that mimics the visual experience of an agent wandering within a world. Furthermore, we convert popular few-shot learning approaches into online versions and we also propose a new model named contextual prototypical memory that can make use of spatiotemporal contextual information from the recent past.
摘要：我们的目标是为数不多的射门学习的标准框架延伸到网上，不断设置以弥合典型的人类和机器学习环境之间的差距。在这种背景下，发作没有单独的训练和测试阶段，而是同时学习新的类模型的在线评估。正如在现实世界中的时空背景下的存在有助于我们找回学到的技能在过去，我们的在线为数不多的射门也学习功能设定一个基本背景下，在整个时间的变化。对象类上下文内相关，并推断正确的情况下可能会导致更好的性能。在此设置的基础上，我们提出了一个新的少数拍学习数据集基于大规模的室内图像，模仿代理的视觉体验世界中徘徊。此外，我们转换流行的几炮学习方法到网上的版本，我们也提出了一个新的型号命名上下文原型存储器中，可以利用时空上下文信息从最近的过去。

65. Automatic Probe Movement Guidance for Freehand Obstetric Ultrasound [PDF] 返回目录
Richard Droste, Lior Drukker, Aris T. Papageorghiou, J. Alison Noble
Abstract: We present the first system that provides real-time probe movement guidance for acquiring standard planes in routine freehand obstetric ultrasound scanning. Such a system can contribute to the worldwide deployment of obstetric ultrasound scanning by lowering the required level of operator expertise. The system employs an artificial neural network that receives the ultrasound video signal and the motion signal of an inertial measurement unit (IMU) that is attached to the probe, and predicts a guidance signal. The network termed US-GuideNet predicts either the movement towards the standard plane position (goal prediction), or the next movement that an expert sonographer would perform (action prediction). While existing models for other ultrasound applications are trained with simulations or phantoms, we train our model with real-world ultrasound video and probe motion data from 464 routine clinical scans by 17 accredited sonographers. Evaluations for 3 standard plane types show that the model provides a useful guidance signal with an accuracy of 88.8% for goal prediction and 90.9% for action prediction.
摘要：我们提出了提供实时探测运动指导常规获取标准的飞机写意产科超声扫描的第一个系统。这样的系统可以通过降低操作人员的专业知识要求的水平有助于产科超声扫描的全球部署。该系统采用接收超声视频信号和附着至所述探针的惯性测量单元（IMU）的运动信号的人工神经网络，并预测一个引导信号。网络被称为美国GuideNet预测可能朝向标准平面位置（目标预测）的运动，或下一个运动，一个专家超声检查操作者将执行（动作预测）。而对于其他超声波应用现有模型与模拟或仿真的训练，我们通过17次认证的超声检查培训我们的现实世界的超声视频和探测器的运动数据模型从464次常规临床扫描。 3种标准平面类型的评估表明，该模型提供了88.8％的目标为预测动作预测的准确度和90.9％的有用的指导信号。

66. Journey Towards Tiny Perceptual Super-Resolution [PDF] 返回目录
Royson Lee, Łukasz Dudziak, Mohamed Abdelfattah, Stylianos I. Venieris, Hyeji Kim, Hongkai Wen, Nicholas D. Lane
Abstract: Recent works in single-image perceptual super-resolution (SR) have demonstrated unprecedented performance in generating realistic textures by means of deep convolutional networks. However, these convolutional models are excessively large and expensive, hindering their effective deployment to end devices. In this work, we propose a neural architecture search (NAS) approach that integrates NAS and generative adversarial networks (GANs) with recent advances in perceptual SR and pushes the efficiency of small perceptual SR models to facilitate on-device execution. Specifically, we search over the architectures of both the generator and the discriminator sequentially, highlighting the unique challenges and key observations of searching for an SR-optimized discriminator and comparing them with existing discriminator architectures in the literature. Our tiny perceptual SR (TPSR) models outperform SRGAN and EnhanceNet on both full-reference perceptual metric (LPIPS) and distortion metric (PSNR) while being up to 26.4$\times$ more memory efficient and 33.6$\times$ more compute efficient respectively.
摘要：在单图像感知超分辨率（SR）最近的作品已经证明，在深卷积网络来生成逼真的纹理前所未有的性能。然而，这些卷积模型过大，价格昂贵，阻碍他们有效地部署到终端设备。在这项工作中，我们提出了一个神经结构搜索（NAS）方式，集成了NAS和生成对抗网络（甘斯）与感性SR的最新进展，并推动小感知SR车型的效率，以便在设备上执行。具体来说，我们的发电机和顺序鉴别两者的硬件架构中搜索过，彰显独特的挑战和搜索SR-优化的鉴别并与文献中现有鉴别架构比较关键的观察。我们的小感性SR（TPSR）模型跑赢SRGAN和EnhanceNet两个全参考感知度量（LPIPS）和失真度（PSNR），同时高达26.4 $ \倍$更多的内存高效和33.6 $ \倍$分别为更多的计算效率。

67. Lightweight Image Super-Resolution with Enhanced CNN [PDF] 返回目录
Chunwei Tian, Ruibin Zhuge, Zhihao Wu, Yong Xu, Wangmeng Zuo, Chen Chen, Chia-Wen Li
Abstract: Deep convolutional neural networks (CNNs) with strong expressive ability have achieved impressive performances on single image super-resolution (SISR). However, their excessive amounts of convolutions and parameters usually consume high computational cost and more memory storage for training a SR model, which limits their applications to SR with resource-constrained devices in real world. To resolve these problems, we propose a lightweight enhanced SR CNN (LESRCN-N) with three successive sub-blocks, an information extraction and enhancement block (IEEB), a reconstruction block (RB) and an information refinement block (IRB). Specifically, the IEEB extracts hierarchical low-resolution (LR) features and aggregates the obtained features step-by-step to increase the memory ability of the shallow layers on deep layers for SISR. To remove redundant information obtained, a heterogeneous architecture is adopted in the IEEB. After that, the RB converts low-frequency features into high-frequency features by fusing global and local features, which is complementary with the IEEB in tackling the long-term dependency problem. Finally, the IRB uses coarse high-frequency features from the RB to learn more accurate SR features and construct a SR image. The proposed LESRCNN can obtain a high-quality image by a model for different scales. Extensive experiments demonstrate that the proposed LESRCNN outperforms state-of-the-arts on SISR in terms of qualitative and quantitative evaluation.
摘要：深卷积神经网络（细胞神经网络）具有较强的表达能力都取得了单图像超分辨率（SISR）令人印象深刻的表演。然而，他们的卷积和参数的过量消耗通常较高的计算成本和训练SR模式，这限制了它们的应用程序与SR在现实世界资源受限的设备更多的内存存储。为了解决这些问题，我们提出了增强SR CNN（LESRCN-N）的轻量与三个连续的子块，一个信息提取和增强块（IEEB），重构块（RB）以及信息精化块（IRB）。具体而言，提取IEEB分级低分辨率（LR）设有和聚集所获得的特征一步一步增加对SISR深层浅层的记忆能力。以去除所获得的冗余信息，异构架构在IEEB通过。此后，RB转换低频特征分成高频通过融合全局和局部特征，这是在应对长期依赖问题与IEEB互补的特征。最后，IRB用途粗高频特性从RB学习更准确SR特性和构造一个SR图像。所提出的LESRCNN可以通过为不同尺度的模型获得高品质的图像。大量的实验证明定性和定量评价方面提出LESRCNN性能优于国家的最艺术的SISR说。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-10

目录

摘要