摘要

1. What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions [PDF] 返回目录
Kiana Ehsani, Daniel Gordon, Thomas Nguyen, Roozbeh Mottaghi, Ali Farhadi
Abstract: Learning effective representations of visual data that generalize to a variety of downstream tasks has been a long quest for computer vision. Most representation learning approaches rely solely on visual data such as images or videos. In this paper, we explore a novel approach, where we use human interaction and attention cues to investigate whether we can learn better representations compared to visual-only representations. For this study, we collect a dataset of human interactions capturing body part movements and gaze in their daily lives. Our experiments show that our self-supervised representation that encodes interaction and attention cues outperforms a visual-only state-of-the-art method MoCo (He et al., 2020), on a variety of target tasks: scene classification (semantic), action recognition (temporal), depth estimation (geometric), dynamics prediction (physics) and walkable surface estimation (affordance).
摘要：得知推广到各种下游任务可视化数据的有效陈述一直是计算机视觉长期的任务。大多数代表的学习方法完全依赖于视觉数据，如图像或视频。在本文中，我们探索一种新的方法，在这里我们使用人际交往和注意线索进行调查我们是否能相比仅可视表示学得好表示。在这项研究中，我们收集人类交往捕捉身体部位运动的数据集，并在日常生活中凝视。我们的实验表明，我们的自我监督表示编码的互动和注意线索胜过一个仅可视状态的最先进的方法莫科，对各种目标任务（He等，2020）：场景分类（语义），动作识别（时间），深度估计（几何），动态预测（物理）和可行走表面估计（启示）。

2. Towards Online Steering of Flame Spray Pyrolysis Nanoparticle Synthesis [PDF] 返回目录
Maksim Levental, Ryan Chard, Joseph A. Libera, Kyle Chard, Aarthi Koripelly, Jakob R. Elias, Marcus Schwarting, Ben Blaiszik, Marius Stan, Santanu Chaudhuri, Ian Foster
Abstract: Flame Spray Pyrolysis (FSP) is a manufacturing technique to mass produce engineered nanoparticles for applications in catalysis, energy materials, composites, and more. FSP instruments are highly dependent on a number of adjustable parameters, including fuel injection rate, fuel-oxygen mixtures, and temperature, which can greatly affect the quality, quantity, and properties of the yielded nanoparticles. Optimizing FSP synthesis requires monitoring, analyzing, characterizing, and modifying experimental conditions.Here, we propose a hybrid CPU-GPU Difference of Gaussians (DoG)method for characterizing the volume distribution of unburnt solution, so as to enable near-real-time optimization and steering of FSP experiments. Comparisons against standard implementations show our method to be an order of magnitude more efficient. This surrogate signal can be deployed as a component of an online end-to-end pipeline that maximizes the synthesis yield.
摘要：火焰喷雾热解（FSP）是一家制造技术批量生产工程纳米颗粒用于催化应用，能源材料，复合材料，等等。 FSP仪器高度依赖于许多可调整的参数，包括燃料喷射率，燃料 - 氧气混合物，和温度，这可以极大地影响质量，数量，并且得到纳米颗粒的性质。优化FSP合成需要监测，分析，表征，和修改实验conditions.Here，我们提出了高斯（DOG）的混合CPU-GPU差分方法用于表征未燃溶液的体积分布，从而使近实时优化和FSP实验转向。针对标准实现对比表明我们的方法是幅度更高效的订单。这种替代信号可以被部署为一个在线的端至端管道最大化合成收率的组分。

3. Deep Learning based Automated Forest Health Diagnosis from Aerial Images [PDF] 返回目录
Chia-Yen Chiang, Chloe Barnes, Plamen Angelov, Richard Jiang
Abstract: Global climate change has had a drastic impact on our environment. Previous study showed that pest disaster occured from global climate change may cause a tremendous number of trees died and they inevitably became a factor of forest fire. An important portent of the forest fire is the condition of forests. Aerial image-based forest analysis can give an early detection of dead trees and living trees. In this paper, we applied a synthetic method to enlarge imagery dataset and present a new framework for automated dead tree detection from aerial images using a re-trained Mask RCNN (Mask Region-based Convolutional Neural Network) approach, with a transfer learning scheme. We apply our framework to our aerial imagery datasets,and compare eight fine-tuned models. The mean average precision score (mAP) for the best of these models reaches 54%. Following the automated detection, we are able to automatically produce and calculate number of dead tree masks to label the dead trees in an image, as an indicator of forest health that could be linked to the causal analysis of environmental changes and the predictive likelihood of forest fire.
摘要：全球气候变化已经对我们的环境产生极大的影响。以前的研究表明，害虫灾害全球气候变化的发生，可能会导致树木数量巨大的死亡，他们不可避免地成为森林火灾的因素。森林火灾的一个重要征兆是森林的状况。空中基于图像的分析，森林可以给早期发现死树和活立木。在本文中，我们采用了合成法放大的图像数据集，并提出了自动化的死树检测使用重新训练的面膜RCNN航拍图像的新框架（面膜基于区域的卷积神经网络）的方法，具有迁移学习方案。我们运用我们的框架，我们的航拍图像数据集，并比较8微调模型。平均平均精度分（图）最好这些模型的达到54％。继自动化检测，我们可以自动生成和死树口罩的计算数量的图像标记枯死的树木，为可能与环境变化的因果分析和森林的预测可能性森林健康的指标火。

4. Difference-in-Differences: Bridging Normalization and Disentanglement in PG-GAN [PDF] 返回目录
Xiao Liu, Jiajie Zhang, Siting Li, Zuotong Wu, Yang Yu
Abstract: What mechanisms causes GAN's entanglement? Although developing disentangled GAN has attracted sufficient attention, it is unclear how entanglement is originated by GAN transformation. We in this research propose a difference-in-difference (DID) counterfactual framework to design experiments for analyzing the entanglement mechanism in on of the Progressive-growing GAN (PG-GAN). Our experiment clarify the mechanisms how pixel normalization causes PG-GAN entanglement during a input-unit-ablation transformation. We discover that pixel normalization causes object entanglement by in-painting the area occupied by ablated objects. We also discover the unit-object relation determines whether and how pixel normalization causes objects entanglement. Our DID framework theoretically guarantees that the mechanisms that we discover is solid, explainable and comprehensively.
摘要：什么样的机制导致GAN的纠缠？虽然开发解缠结GaN具有引起足够的重视，这是纠缠不清是如何发起的甘转型。我们在这项研究提出了一个差的差（DID）反框架来设计实验对上述渐进生长GaN（PG-GAN）的分析纠缠机制。我们的实验中阐明的机制像素归一化的输入单元消融转换期间如何导致PG-GAN纠缠。我们发现，像素正常化在绘画领域占据由消融导致的物体对象的纠缠。我们还发现单元 - 对象关系确定像素正规化是否以及如何使物体纠缠。我们的DID框架理论上可以保证，我们发现机制是固体，可以解释和全面。

5. Reconstructing A Large Scale 3D Face Dataset for Deep 3D Face Identification [PDF] 返回目录
Cuican Yu, Zihui Zhang, Huibin Li
Abstract: Deep learning methods have brought many breakthroughs to computer vision, especially in 2D face recognition. However, the bottleneck of deep learning based 3D face recognition is that it is difficult to collect millions of 3D faces, whether for industry or academia. In view of this situation, there are many methods to generate more 3D faces from existing 3D faces through 3D face data augmentation, which are used to train deep 3D face recognition models. However, to the best of our knowledge, there is no method to generate 3D faces from 2D face images for training deep 3D face recognition models. This letter focuses on the role of reconstructed 3D facial surfaces in 3D face identification and proposes a framework of 2D-aided deep 3D face identification. In particular, we propose to reconstruct millions of 3D face scans from a large scale 2D face database (i.e.VGGFace2), using a deep learning based 3D face reconstruction method (i.e.ExpNet). Then, we adopt a two-phase training approach: In the first phase, we use millions of face images to pre-train the deep convolutional neural network (DCNN), and in the second phase, we use normal component images (NCI) of reconstructed 3D face scans to train the DCNN. Extensive experimental results illustrate that the proposed approach can greatly improve the rank-1 score of 3D face identification on the FRGC v2.0, the Bosphorus, and the BU-3DFE 3D face databases, compared to the model trained by 2D face images. Finally, our proposed approach achieves state-of-the-art rank-1 scores on the FRGC v2.0 (97.6%), Bosphorus (98.4%), and BU-3DFE (98.8%) databases. The experimental results show that the reconstructed 3D facial surfaces are useful and our 2D-aided deep 3D face identification framework is meaningful, facing the scarcity of 3D faces.
摘要：深学习方法带来了许多突破，计算机视觉，尤其在二维人脸识别。然而，深度学习的基于三维人脸识别的瓶颈是，它是很难收集到数以百万计的3D面孔，无论是工业或学术界。鉴于此情况，有很多方法来生成多个3D面从现有3D面通过三维人脸数据的增强，这是用来训练深3D脸部识别模型。然而，据我们所知，没有产生从二维的人脸图像的3D面部训练深三维人脸识别模型方法。这封信侧重于3D人脸识别三维重建面部表面的作用，并提出了二维计算机辅助深三维面部识别的框架。特别是，我们提出了从大型2D脸数据库（i.e.VGGFace2）重建数以百万计的三维人脸扫描，使用基于深度学习三维人脸重建方法（i.e.ExpNet）。然后，我们采用两阶段训练的方法：在第一阶段，我们使用数以百万计的人脸图像的预训练深层卷积神经网络（DCNN），而在第二阶段，我们使用的正常组成部分图像（NCI）重建的三维人脸扫描训练DCNN。广泛的实验结果表明，所提出的方法可以大大提高秩-1得分对FRGC V2.0，博斯普鲁斯，和BU-3DFE三维人脸数据库3D面部识别，相比于由2D面部图像训练模型。最后，我们提出的方法实现了国家的最先进的秩1的FRGC V2.0（97.6％），博斯普鲁斯海峡（98.4％）的分数，和BU-3DFE（98.8％）的数据库。实验结果表明，该重建的3D面部表面是有用的，我们的2D辅助深3D脸部识别框架是有意义的，面向3D的稀缺性面。

6. Volumetric Calculation of Quantization Error in 3-D Vision Systems [PDF] 返回目录
Eleni Bohacek, Andrew J. Coates, David R. Selviah
Abstract: This paper investigates how the inherent quantization of camera sensors introduces uncertainty in the calculated position of an observed feature during 3-D mapping. It is typically assumed that pixels and scene features are points, however, a pixel is a two-dimensional area that maps onto multiple points in the scene. This uncertainty region is a bound for quantization error in the calculated point positions. Earlier studies calculated the volume of two intersecting pixel views, approximated as a cuboid, by projecting pyramids and cones from the pixels into the scene. In this paper, we reverse this approach by generating an array of scene points and calculating which scene points are detected by which pixel in each camera. This enables us to map the uncertainty regions for every pixel correspondence for a given camera system in one calculation, without approximating the complex shapes. The dependence of the volumes of the uncertainty regions on camera baseline length, focal length, pixel size, and distance to object, shows that earlier studies overestimated the quantization error by at least a factor of two. For static camera systems the method can also be used to determine volumetric scene geometry without the need to calculate disparity maps.
摘要：本文研究如何相机传感器介绍在所观察的特征的所计算的位置的不确定性的期间3-d映射的固有量化。它通常假定像素和场景特征点，但是，一个像素是映射到场景中的多个点的二维区域。在计算出点位置这种不确定性区域是开往量化误差。早期的研究计算的两个相交的像素视图的体积，近似为长方体，通过从像素到场景突出金字塔和视锥细胞。在本文中，我们通过生成的场景点的数组，并计算其通过在每个摄像机哪个像素检测到的场景点反转这种方法。这使我们能够不确定性区域，用于在一个计算给定相机系统中的每个像素对应映射，而不近似复杂的形状。的不确定性区域的体积的相机基线长度，焦距，像素大小和距离物体，表明，早期的研究由至少两个因素高估了量化误差的依赖。对于静态摄像机系统的方法也可以用来确定容积场景几何体，而不需要计算视差图。

7. On the surprising similarities between supervised and self-supervised models [PDF] 返回目录
Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Matthias Bethge, Felix A. Wichmann, Wieland Brendel
Abstract: How do humans learn to acquire a powerful, flexible and robust representation of objects? While much of this process remains unknown, it is clear that humans do not require millions of object labels. Excitingly, recent algorithmic advancements in self-supervised learning now enable convolutional neural networks (CNNs) to learn useful visual object representations without supervised labels, too. In the light of this recent breakthrough, we here compare self-supervised networks to supervised models and human behaviour. We tested models on 15 generalisation datasets for which large-scale human behavioural data is available (130K highly controlled psychophysical trials). Surprisingly, current self-supervised CNNs share four key characteristics of their supervised counterparts: (1.) relatively poor noise robustness (with the notable exception of SimCLR), (2.) non-human category-level error patterns, (3.) non-human image-level error patterns (yet high similarity to supervised model errors) and (4.) a bias towards texture. Taken together, these results suggest that the strategies learned through today's supervised and self-supervised training objectives end up being surprisingly similar, but distant from human-like behaviour. That being said, we are clearly just at the beginning of what could be called a self-supervised revolution of machine vision, and we are hopeful that future self-supervised models behave differently from supervised ones, and---perhaps---more similar to robust human object recognition.
摘要：人类如何学会获取对象的功能强大，灵活和强大的表现？虽然其中很多过程仍然未知，很显然，人类没有需要上百万对象的标签。令人兴奋的是，在自我监督学习算法的最近进展，现在能使卷积神经网络（细胞神经网络），以学习有用的视觉对象表示，而没有监督的标签了。在最近这次突破的光，我们在这里比较自我监督网络，监督模型和人类行为。我们对这些大规模的人类行为的数据是可用的15点泛化的数据集（130K高度控制的心理试验）测试的模型。出人意料的是，目前的自我监督的细胞神经网络分享他们的监督同行的四个主要特点：（1）相对较差噪声的鲁棒性（与SimCLR显着的例外），（2）非人类的类别级别的错误模式，（3）非人图像级错误模式（还高相似监督模型误差）和（4）一个朝纹理偏压。总之，这些结果表明，通过今天的学习的战略监督和自我监督的培养目标最终是惊人的相似，但类似人类的行为遥远。话虽这么说，我们也清醒地只是在什么可以被称为机器视觉的自我监督革命的开始，我们希望未来的自我监督模式从监管者的行为不同，并可能---更多---类似于人类的强劲物体识别。

8. Toward Accurate Person-level Action Recognition in Videos of Crowded Scenes [PDF] 返回目录
Li Yuan, Yichen Zhou, Shuning Chang, Ziyuan Huang, Yunpeng Chen, Xuecheng Nie, Tao Wang, Jiashi Feng, Shuicheng Yan
Abstract: Detecting and recognizing human action in videos with crowded scenes is a challenging problem due to the complex environment and diversity events. Prior works always fail to deal with this problem in two aspects: (1) lacking utilizing information of the scenes; (2) lacking training data in the crowd and complex scenes. In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. A top-down strategy is used to overcome the limitations. Specifically, we adopt a strong human detector to detect the spatial location of each frame. We then apply action recognition models to learn the spatio-temporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet, which can improve the generalization ability of our model. Besides, the scenes information is extracted by the semantic segmentation model to assistant the process. As a result, our method achieved an average 26.05 wf\_mAP (ranking 1st place in the ACM MM grand challenge 2020: Human in Events).
摘要：视频检测和识别人的动作挤满场面是一个具有挑战性的问题，由于环境复杂性和多样性的活动。在此之前的作品总是失败在两个方面来解决这个问题：（1）缺乏利用场景的信息; （2）缺乏在人群中的复杂场景的训练数据。在本文中，我们专注于通过完全利用场景的信息，并收集新数据时提高时空动作识别。自上而下的策略来克服的局限性。具体而言，我们采用强大的人力检测器，检测每帧的空间位置。接着，我们采用行为识别模型来了解从上HIE的数据集，并与互联网不同的场景，它可以提高我们的模型的泛化能力的新数据都视频帧的时空信息。此外，场景信息由语义分割模型助手的过程中提取。其结果是，我们的方法取得了平均26.05 WF \ _MAP（在ACM MM重大挑战2020排名第一名：人类的活动）。

9. Learning Monocular Dense Depth from Events [PDF] 返回目录
Javier Hidalgo-Carrió, Daniel Gehrig, Davide Scaramuzza
Abstract: Event cameras are novel sensors that output brightness changes in the form of a stream of asynchronous events instead of intensity frames. Compared to conventional image sensors, they offer significant advantages: high temporal resolution, high dynamic range, no motion blur, and much lower bandwidth. Recently, learning-based approaches have been applied to event-based data, thus unlocking their potential and making significant progress in a variety of tasks, such as monocular depth prediction. Most existing approaches use standard feed-forward architectures to generate network predictions, which do not leverage the temporal consistency presents in the event stream. We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods. In particular, our method generates dense depth predictions using a monocular setup, which has not been shown previously. We pretrain our model using a new dataset containing events and depth maps recorded in the CARLA simulator. We test our method on the Multi Vehicle Stereo Event Camera Dataset (MVSEC). Quantitative experiments show up to 50% improvement in average depth error with respect to previous event-based methods.
摘要：事件相机是新颖传感器，其输出亮度在异步事件，而不是强度帧的流的形式发生变化。相比于传统图像传感器，它们提供显著优点：高时间分辨率，高动态范围，没有运动模糊，以及低得多的带宽。最近，基于学习的方法已应用于基于事件的数据，从而解锁他们的潜力，使在各种不同的任务，比如单眼深度预测的显著进展。大多数现有的方法使用标准的前馈架构生成网络预测，不充分利用时间一致性礼物的事件流。我们提出了一个经常性的架构来解决这个任务，并显示在标准前馈方法显著改善。特别地，我们的方法生成使用单眼设置，其先前尚未示出稠密深度预测。我们使用包含事件和深度映射记录在CARLA模拟器一个新的数据集pretrain我们的模型。我们测试我们对多车辆立体音响事件摄像机数据集（MVSEC）方法。定量实验显示在平均深度误差50％的改善相对于以前的基于事件的方法。

10. In Depth Bayesian Semantic Scene Completion [PDF] 返回目录
David Gillsjö, Kalle Åström
Abstract: This work studies Semantic Scene Completion which aims to predict a 3D semantic segmentation of our surroundings, even though some areas are occluded. For this we construct a Bayesian Convolutional Neural Network (BCNN), which is not only able to perform the segmentation, but also predict model uncertainty. This is an important feature not present in standard CNNs. We show on the MNIST dataset that the Bayesian approach performs equal or better to the standard CNN when processing digits unseen in the training phase when looking at accuracy, precision and recall. With the added benefit of having better calibrated scores and the ability to express model uncertainty. We then show results for the Semantic Scene Completion task where a category is introduced at test time on the SUNCG dataset. In this more complex task the Bayesian approach outperforms the standard CNN. Showing better Intersection over Union score and excels in Average Precision and separation scores.
摘要：该作品研究语义场景完成，目的是预测我们的环境的3D语义分割，即使某些区域被遮挡。为此，我们建立一个贝叶斯卷积神经网络（BCNN），这不仅能够进行分割，同时也预测模型不确定性。这是标准的细胞神经网络中不存在的一个重要特征。我们展示的MNIST数据集贝叶斯方法进行同等或更好的标准CNN处理数字时看不见在训练阶段看准确度，精确度和召回时。对于具有更好的校准成绩和表现模型不确定性的能力的好处。然后，我们显示，其中一类是在测试时间上SUNCG数据集引入了语义场景完成任务结果。在这种更复杂的任务贝叶斯方法优于标准的CNN。显示更好的路口过联盟得分和平均准确率和分离分数过人之处。

11. Automated Iterative Training of Convolutional Neural Networks for Tree Skeleton Segmentation [PDF] 返回目录
Keenan Granland, Rhys Newbury, David Ting, Chao Chen
Abstract: Training of convolutional neural networks for semantic segmentation requires accurate pixel-wise labeling. Depending on the application this can require large amounts of human effort. The human-in-the-loop method reduces labeling effort but still requires human intervention for a selection of images. This paper describes a new iterative training method: Automating-the-loop. Automating-the-loop aims to replicate the human adjustment in human-in-the-loop, with an automated process. Thereby, removing human intervention during the iterative process and drastically reducing labeling effort. Using the application of segmented apple tree detection, we compare human-in-the-loop, Self Training Loop, Filtered-Self Training Loop (semi-supervised learning) and our proposed method automating-the-loop. These methods are used to train U-Net, a deep learning based convolutional neural network. The results are presented and analyzed on both traditional performance metrics and a new metric, Horizontal Scan. It is shown that the new method of automating-the-loop greatly reduces the labeling effort while generating a network with comparable performance to both human-in-the-loop and completely manual labeling.
摘要：语义分割卷积神经网络的训练需要准确的逐像素的标签。根据不同的应用这可能需要大量的人力劳动。该人在半实物方法减少标签的努力，但仍然需要一个选择图像的人为干预。本文介绍一种新的迭代训练方法：自动化半实物。自动化半实物目标复制在人类中半实物人类调整，用自动化的过程。从而，在迭代过程中除去人的干预和大大减少标签的努力。使用分段苹果树检测中的应用，我们比较人性化的半实物，自我训练循环，过滤，自我训练循环（半监督学习）和我们提出的方法自动化半实物。这些方法是用来训练掌中宽带，深刻学习基于卷积神经网络。结果呈现和传统的性能指标和一个新的度量，水平扫描分析。结果表明，自动化半实物的新方法大大减少了标记的努力而产生具有相当的性能于人类合的环和完全手动贴标签的网络。

12. Real-Time Face & Eye Tracking and Blink Detection using Event Cameras [PDF] 返回目录
Cian Ryan, Brian O Sullivan, Amr Elrasad, Joe Lemley, Paul Kielty, Christoph Posch, Etienne Perot
Abstract: Event cameras contain emerging, neuromorphic vision sensors that capture local light intensity changes at each pixel, generating a stream of asynchronous events. This way of acquiring visual information constitutes a departure from traditional frame based cameras and offers several significant advantages: low energy consumption, high temporal resolution, high dynamic range and low latency. Driver monitoring systems (DMS) are in-cabin safety systems designed to sense and understand a drivers physical and cognitive state. Event cameras are particularly suited to DMS due to their inherent advantages. This paper proposes a novel method to simultaneously detect and track faces and eyes for driver monitoring. A unique, fully convolutional recurrent neural network architecture is presented. To train this network, a synthetic event-based dataset is simulated with accurate bounding box annotations, called Neuromorphic HELEN. Additionally, a method to detect and analyse drivers eye blinks is proposed, exploiting the high temporal resolution of event cameras. Behaviour of blinking provides greater insights into a driver level of fatigue or drowsiness. We show that blinks have a unique temporal signature that can be better captured by event cameras.
摘要：事件摄像机包含新兴，神经形态视觉传感器，其在每个像素处捕获局部的光强度的变化，产生一个异步事件的流。获取视觉信息的这种方式构成了从传统的基于画幅相机，并提供几个显著的优势出发：能耗低，高时间分辨率，高动态范围和低时延。驱动程序监控系统（DMS）的舱内设计，感觉安全系统和了解司机的身体和认知状态。事件相机特别适合于DMS由于其固有的优势。本文提出了一种新颖的方法，以同时检测和跟踪对驾驶员监视面部和眼睛。一个独特的，完全卷积递归神经网络结构呈现。为了训练这个网络，合成的基于事件的数据集模拟的准确边界框注释，称为仿神经HELEN。此外，检测和分析司机眨眼提出了一种方法，利用的情况下相机的高时间分辨率。闪烁的行为提供了更深入地了解疲劳或困倦的司机水平。我们发现，闪烁具有可以通过事件摄像机来拍摄更好的一个独特的时域特征。

13. Training Data Generating Networks: Linking 3D Shapes and Few-Shot Classification [PDF] 返回目录
Biao Zhang, Peter Wonka
Abstract: We propose a novel 3d shape representation for 3d shape reconstruction from a single image. Rather than predicting a shape directly, we train a network to generate a training set which will be feed into another learning algorithm to define the shape. Training data generating networks establish a link between few-shot learning and 3d shape analysis. We propose a novel meta-learning framework to jointly train the data generating network and other components. We improve upon recent work on standard benchmarks for 3d shape reconstruction, but our novel shape representation has many applications.
摘要：从单个图像提出一种用于三维形状重建的新的三维形状表示。而不是直接预测的形状，我们培养了网络以生成训练集将被送入另一学习算法，以限定的形状。学习数据生成网络建立几拍的学习和三维形状分析之间的联系。我们提出了一个新的元学习框架，共同训练数据生成网络等组成。我们提高在最近的标准基准3D形状的重建工作，但我们的新的形状表示有许多应用。

14. SF-UDA$^{3D}$: Source-Free Unsupervised Domain Adaptation for LiDAR-Based 3D Object Detection [PDF] 返回目录
Cristiano Saltori, Stéphane Lathuiliére, Nicu Sebe, Elisa Ricci, Fabio Galasso
Abstract: 3D object detectors based only on LiDAR point clouds hold the state-of-the-art on modern street-view benchmarks. However, LiDAR-based detectors poorly generalize across domains due to domain shift. In the case of LiDAR, in fact, domain shift is not only due to changes in the environment and in the object appearances, as for visual data from RGB cameras, but is also related to the geometry of the point clouds (e.g., point density variations). This paper proposes SF-UDA$^{3D}$, the first Source-Free Unsupervised Domain Adaptation (SF-UDA) framework to domain-adapt the state-of-the-art PointRCNN 3D detector to target domains for which we have no annotations (unsupervised), neither we hold images nor annotations of the source domain (source-free). SF-UDA$^{3D}$ is novel on both aspects. Our approach is based on pseudo-annotations, reversible scale-transformations and motion coherency. SF-UDA$^{3D}$ outperforms both previous domain adaptation techniques based on features alignment and state-of-the-art 3D object detection methods which additionally use few-shot target annotations or target annotation statistics. This is demonstrated by extensive experiments on two large-scale datasets, i.e., KITTI and nuScenes.
摘要：仅基于激光雷达点云3D物体探测器举行的现代街景基准的国家的最先进的。然而，基于激光雷达检测器差跨域概括由于域移位。在激光雷达的情况下，实际上，域移位不仅由于环境的变化，并在所述对象的外观，对于来自RGB相机的视觉数据，但也与点云的几何形状（例如，点密度变化）。本文提出了SF-UDA $ ^ {3D} $，在第一源 - 免无监督域适配（SF-UDA）框架来域适应的状态的最先进的PointRCNN 3D检测器，用于目标域一个我们没有注释（无监督），无论我们持有的源域（源极 - 自由）的图像，也没有注释。 SF-UDA $ ^ {3D} $是在两个方面小说。我们的做法是基于伪注释，可逆的尺度变换和运动的一致性。 SF-UDA $ ^ {3D} $优于基于特征对准并且另外使用少数射靶注解或注释目标统计状态的最先进的三维物体检测方法均先前域自适应技术。这表现通过大量的实验上的两个大型数据集，即，KITTI和nuScenes。

15. HPERL: 3D Human Pose Estimation from RGB and LiDAR [PDF] 返回目录
Michael Fürst, Shriya T. P. Gupta, René Schuster, Oliver Wasenmüller, Didier Stricker
Abstract: In-the-wild human pose estimation has a huge potential for various fields, ranging from animation and action recognition to intention recognition and prediction for autonomous driving. The current state-of-the-art is focused only on RGB and RGB-D approaches for predicting the 3D human pose. However, not using precise LiDAR depth information limits the performance and leads to very inaccurate absolute pose estimation. With LiDAR sensors becoming more affordable and common on robots and autonomous vehicle setups, we propose an end-to-end architecture using RGB and LiDAR to predict the absolute 3D human pose with unprecedented precision. Additionally, we introduce a weakly-supervised approach to generate 3D predictions using 2D pose annotations from PedX [1]. This allows for many new opportunities in the field of 3D human pose estimation.
摘要：在最狂野人体姿势估计有各个领域，从动画和动作识别，以意图识别和预测自动驾驶潜力巨大。当前国家的最先进的只集中RGB和RGB-d接近预测的3D人体姿势。不过，不使用精确激光雷达的深度信息，限制了性能并导致非常不准确的绝对姿态估计。随着LiDAR传感器变得更加实惠和机器人和自主汽车设置一样，我们使用RGB和激光预测以前所未有的精确的绝对的3D人体姿态提出了一个终端到终端的架构。此外，我们引入一个弱监督方法来生成使用二维姿态注解从PedX [1] 3D预测。这允许在三维人体姿态估计的领域许多新的机会。

16. Human Perception-based Evaluation Criterion for Ultra-high Resolution Cell Membrane Segmentation [PDF] 返回目录
Ruohua Shi, Wenyao Wang, Zhixuan Li, Liuyuan He, Kaiwen Sheng, Lei Ma, Kai Du, Tingting Jiang, Tiejun Huang
Abstract: Computer vision technology is widely used in biological and medical data analysis and understanding. However, there are still two major bottlenecks in the field of cell membrane segmentation, which seriously hinder further research: lack of sufficient high-quality data and lack of suitable evaluation criteria. In order to solve these two problems, this paper first proposes an Ultra-high Resolution Image Segmentation dataset for the Cell membrane, called U-RISC, the largest annotated Electron Microscopy (EM) dataset for the Cell membrane with multiple iterative annotations and uncompressed high-resolution raw data. During the analysis process of the U-RISC, we found that the current popular segmentation evaluation criteria are inconsistent with human perception. This interesting phenomenon is confirmed by a subjective experiment involving twenty people. Furthermore, to resolve this inconsistency, we propose a new evaluation criterion called Perceptual Hausdorff Distance (PHD) to measure the quality of cell membrane segmentation results. Detailed performance comparison and discussion of classic segmentation methods along with two iterative manual annotation results under existing evaluation criteria and PHD is given.
摘要：计算机视觉技术被广泛应用于生物和医学数据分析和理解。然而，仍然有细胞膜细分领域的两大瓶颈，这严重阻碍进一步的研究：缺乏足够的高质量的数据和缺乏适当的评价标准。为了解决这两个问题，本文首先提出了一种用于细胞膜的超高分辨率图像分割数据集，称为U-RISC，用于具有多个迭代注释和未压缩的高细胞膜最大注释的电子显微镜（EM）的数据集 - 分辨率的原始数据。在进行U-RISC的分析过程中，我们发现，目前流行的分段评价标准与人的感知不一致。这个有趣的现象是，涉及20人的主观实验证实。此外，为了解决这个矛盾，我们提出了所谓的感性豪斯多夫距离（PHD）来衡量细胞膜分割结果的质量的新的评价标准。详细的性能的比较和与根据现有的评价标准和PHD两种迭代人工注释结果一起经典的分割方法的讨论中给出。

17. ASMFS: Adaptive-Similarity-based Multi-modality Feature Selection for Classification of Alzheimer's Disease [PDF] 返回目录
Yuang Shi, Chen Zu, Mei Hong, Luping Zhou, Lei Wang, Xi Wu, Jiliu Zhou, Daoqiang Zhang, Yan Wang
Abstract: With the increasing amounts of high-dimensional heterogeneous data to be processed, multi-modality feature selection has become an important research direction in medical image analysis. Traditional methods usually depict the data structure using fixed and predefined similarity matrix for each modality separately, without considering the potential relationship structure across different modalities. In this paper, we propose a novel multi-modality feature selection method, which performs feature selection and local similarity learning simultaniously. Specially, a similarity matrix is learned by jointly considering different imaging modalities. And at the same time, feature selection is conducted by imposing sparse l_{2, 1} norm constraint. The effectiveness of our proposed joint learning method can be well demonstrated by the experimental results on Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset, which outperforms existing the state-of-the-art multi-modality approaches.
摘要：随着要处理的高维异构数据的增加量，多模态特征选择已成为在医学图像分析中的重要的研究方向。传统的方法通常使用固定并预定义相似性矩阵，用于分别每种模态，而没有考虑在不同的模态的潜在关系结构描绘的数据结构。在本文中，我们提出了一种新颖的多模态特征选择方法，该方法执行特征选择和局部相似学习同时放。特别地，相似矩阵由联合考虑不同的成像方式教训。并在同一时间，特征选择通过施加稀疏L_ {2，1}范数约束进行。我们提出的联合学习方法的有效性可以通过对阿尔茨海默病神经影像学倡议（ADNI）数据集的实验结果，其中现有优于国家的最先进的多模态方法被充分证明。

18. New Ideas and Trends in Deep Multimodal Content Understanding: A Review [PDF] 返回目录
Wei Chen, Weiping Wang, Li Liu, Michael S. Lew
Abstract: The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.
摘要：本次调查的重点是多深的学习两种形式的分析：图片和文字。与深度学习的经典评论，其中单峰图像分类，如VGG，RESNET和盗梦空间模块是中心议题，本文将探讨近期多深的模型和结构，包括自动编码器，生成对抗性网及其变种。这些模型超越了简单的图像分类，使他们能够做单向（例如图像字幕，图像生成）和双向（例如跨模态获取，视觉问题回答），多任务。此外，我们分析了在深多模式应用更好的内容理解方面的挑战的两个方面。然后，我们介绍了在深多式联运功能，学习当前的思路和趋势，如功能嵌入的方法和目标函数的设计，这是在克服上述挑战的关键。最后，我们包括对未来的研究几个有前途的方向。

19. Vid-ODE: Continuous-Time Video Generation with Neural Ordinary Differential Equation [PDF] 返回目录
Sunghyun Park, Kangyeol Kim, Junsoo Lee, Jaegul Choo, Joonseok Lee, Sookyung Kim, Edward Choi
Abstract: Video generation models often operate under the assumption of fixed frame rates, which leads to suboptimal performance when it comes to handling flexible frame rates (e.g., increasing the frame rate of more dynamic portion of the video as well as handling missing video frames). To resolve the restricted nature of existing video generation models' ability to handle arbitrary timesteps, we propose continuous-time video generation by combining neural ODE (Vid-ODE) with pixel-level video processing techniques. Using ODE-ConvGRU as an encoder, a convolutional version of the recently proposed neural ODE, which enables us to learn continuous-time dynamics, Vid-ODE can learn the spatio-temporal dynamics of input videos of flexible frame rates. The decoder integrates the learned dynamics function to synthesize video frames at any given timesteps, where the pixel-level composition technique is used to maintain the sharpness of individual frames. With extensive experiments on four real-world video datasets, we verify that the proposed Vid-ODE outperforms state-of-the-art approaches under various video generation settings, both within the trained time range (interpolation) and beyond the range (extrapolation). To the best of our knowledge, Vid-ODE is the first work successfully performing continuous-time video generation using real-world videos.
摘要：视频代车型通常的固定帧速率的假设，当涉及到处理灵活的帧速率从而导致最佳性能下运行（例如，增加视频的更多动态部分的帧速率以及处理丢失的视频帧）。要解决现有的视频代车型来处理任意时间步长能力的限制性质，我们通过结合像素级视频处理技术，神经ODE（VID-ODE）提出连续时间视频生成。使用ODE-ConvGRU作为一个编码器，最近提出的神经ODE卷积版本，这使我们能够学习连续时间动态，VID-ODE可以学习的灵活的帧速率输入视频的时空动态。解码器集成了了解到动力学作用在任何给定时间步长，其中所述像素级组合物技术被用于保持各帧的锐度来合成视频帧。随着四个真实世界视频数据集大量的实验，我们验证所提出的VID-ODE性能优于国家的最先进的在各种视频生成设置方法，无论是训练的时间范围（插值）之内和之外的范围内（外推）。据我们所知，VID-ODE是使用真实的视频已成功进行连续实时视频产生的第一项工作。

20. How many images do I need? Understanding how sample size per class affects deep learning model performance metrics for balanced designs in autonomous wildlife monitoring [PDF] 返回目录
Saleh Shahinfar, Paul Meek, Greg Falzon
Abstract: Deep learning (DL) algorithms are the state of the art in automated classification of wildlife camera trap images. The challenge is that the ecologist cannot know in advance how many images per species they need to collect for model training in order to achieve their desired classification accuracy. In fact there is limited empirical evidence in the context of camera trapping to demonstrate that increasing sample size will lead to improved accuracy. In this study we explore in depth the issues of deep learning model performance for progressively increasing per class (species) sample sizes. We also provide ecologists with an approximation formula to estimate how many images per animal species they need for certain accuracy level a priori. This will help ecologists for optimal allocation of resources, work and efficient study design. In order to investigate the effect of number of training images; seven training sets with 10, 20, 50, 150, 500, 1000 images per class were designed. Six deep learning architectures namely ResNet-18, ResNet-50, ResNet-152, DnsNet-121, DnsNet-161, and DnsNet-201 were trained and tested on a common exclusive testing set of 250 images per class. The whole experiment was repeated on three similar datasets from Australia, Africa and North America and the results were compared. Simple regression equations for use by practitioners to approximate model performance metrics are provided. Generalized additive models (GAM) are shown to be effective in modelling DL performance metrics based on the number of training images per class, tuning scheme and dataset. Key-words: Camera Traps, Deep Learning, Ecological Informatics, Generalised Additive Models, Learning Curves, Predictive Modelling, Wildlife.
摘要：深学习（DL）的算法是本领域中的野生动物相机陷阱图像的自动分类的状态。我们面临的挑战是，生态学家无法预先知道每个物种的许多图像，他们需要为了达到自己想要的分类准确率来收集模型训练。事实上，有在相机捕捉证明增加的样本大小将导致改善的准确度的上下文中不受限制经验证据。在这项研究中，我们深入探讨的深度学习模型性能的问题，每级递增（种）样本量。我们还提供生态学家与近似公式来估算有多少每个动物的图像需要一定的精度水平先验。这将有助于生态学家对资源的优化配置，工作高效的研究设计。为了研究训练图像的数量的效果; 7训练集与10，20，50，150，500，每个类1000张图像设计。六个深学习架构即RESNET-18，RESNET-50，RESNET-152，DnsNet-121，DnsNet-161，和DnsNet-201进行了培训和在共同的专属测试集每个类250幅的图像的测试。整个实验重复来自澳大利亚，非洲和北美三个类似的数据集和结果进行了比较。提供了通过从业者近似模型的性能指标使用简单的回归方程。广义相加模型（GAM）被证明是有效的建模根据每个类，调谐方案和数据集训练图像的数量DL性能度量。键字：相机陷阱，深学，信息学生态，广义加法模型，学习曲线，预测建模，野生动物。

21. Anisotropic Stroke Control for Multiple Artists Style Transfer [PDF] 返回目录
Xuanhong Chen, Xirui Yan, Naiyuan Liu, Ting Qiu, Bingbing Ni
Abstract: Though significant progress has been made in artistic style transfer, semantic information is usually difficult to be preserved in a fine-grained locally consistent manner by most existing methods, especially when multiple artists styles are required to transfer within one single model. To circumvent this issue, we propose a Stroke Control Multi-Artist Style Transfer framework. On the one hand, we develop a multi-condition single-generator structure which first performs multi-artist style transfer. On the one hand, we design an Anisotropic Stroke Module (ASM) which realizes the dynamic adjustment of style-stroke between the non-trivial and the trivial regions. ASM endows the network with the ability of adaptive semantic-consistency among various styles. On the other hand, we present an novel Multi-Scale Projection Discriminator} to realize the texture-level conditional generation. In contrast to the single-scale conditional discriminator, our discriminator is able to capture multi-scale texture clue to effectively distinguish a wide range of artistic styles. Extensive experimental results well demonstrate the feasibility and effectiveness of our approach. Our framework can transform a photograph into different artistic style oil painting via only ONE single model. Furthermore, the results are with distinctive artistic style and retain the anisotropic semantic information.
摘要：虽然显著进展，在艺术风格转移已经取得，语义信息通常是很难被大多数现有的方法中细粒度本地一致的方式被保留了，尤其是当需要多个艺术家的风格，以一个单一的模式内不转让。为了规避这个问题，我们提出了一个行程控制多艺术家风格转让框架。在一方面，我们开发了一个多条件单发电机结构，首先进行多艺术家的风格转移。在一方面，我们设计了一个各向异性行程模块（ASM），其实现了风格冲程的非平凡和琐碎区域之间的动态调整。 ASM赋予了的自适应语义一致性各种样式中的能力的网络。在另一方面，我们提出了一个新颖的多尺度投影鉴别}实现纹理水平的条件产生。相较于单一的规模条件鉴别，鉴别我们能够捕捉到多尺度纹理线索，有效区分了广泛的艺术风格。大量的实验结果也证明了该方法的可行性和有效性。我们的框架可以仅通过一个单一的模式变换的照片成不同艺术风格的油画。此外，该结果与鲜明的艺术风格，并保留了各向异性的语义信息。

22. Pose And Joint-Aware Action Recognition [PDF] 返回目录
Anshul Shah, Shlok Mishra, Ankan Bansal, Jun-Cheng Chen, Rama Chellappa, Abhinav Shrivastava
Abstract: Most human action recognition systems typically consider static appearances and motion as independent streams of information. In this paper, we consider the evolution of human pose and propose a method to better capture interdependence among skeleton joints. Our model extracts motion information from each joint independently, reweighs the information and finally performs inter-joint reasoning. The effectiveness of pose and joint-based representations is strengthened using a geometry-aware data augmentation technique which jitters pose heatmaps while retaining the dynamics of the action. Our best model gives an absolute improvement of 8.19% on JHMDB, 4.31% on HMDB and 1.55 mAP on Charades datasets over state-of-the-art methods using pose heat-maps alone. Fusing with RGB and flow streams leads to improvement over state-of-the-art. Our model also outperforms the baseline on Mimetics, a dataset with out-of-context videos by 1.14% while using only pose heatmaps. Further, to filter out clips irrelevant for action recognition, we re-purpose our model for clip selection guided by pose information and show improved performance using fewer clips.
摘要：大多数人的行为识别系统一般考虑静态的外观和运动作为独立的信息流。在本文中，我们考虑人体姿势的变化，提出了一种方法，以更好地捕捉相互依存骨骼关节之中。我们从每个关节模型提取运动信息独立地，reweighs的信息，并最后进行关节间的推理。姿势和基于合资交涉的有效性用几何感知的数据增强技术，它抖动造成热图，同时保留采取行动的力度加强。我们最好的模型给出的对JHMDB 8.19％，上HMDB 4.31％和1.55地图上哑谜数据集的绝对改善了使用姿势国家的最先进的方法，单独的热映射。与RGB和流量流融合在国家的最先进的，可以提高。我们的模型也同时使用只造成热图由1.14％上优于模拟物的，与外的背景视频数据集的基线。此外，过滤掉不相关的剪辑的动作识别，我们重新利用我们通过姿态信息引导剪辑选择模式，并显示使用较少的剪辑提高了性能。

23. Taking A Closer Look at Synthesis: Fine-grained Attribute Analysis for Person Re-Identification [PDF] 返回目录
Suncheng Xiang, Yuzhuo Fu, Guanjie You, Ting Liu
Abstract: Person re-identification (re-ID) plays an important role in applications such as public security and video surveillance. Recently, learning from synthetic data, which benefits from the popularity of synthetic data engine, has achieved remarkable performance. However, in pursuit of high accuracy, researchers in the academic always focus on training with large-scale datasets at a high cost of time and label expenses, while neglect to explore the potential of performing efficient training from millions of synthetic data. To facilitate development in this field, we reviewed the previously developed synthetic dataset GPR and built an improved one (GPR+) with larger number of identities and distinguished attributes. Based on it, we quantitatively analyze the influence of dataset attribute on re-ID system. To our best knowledge, we are among the first attempts to explicitly dissect person re-ID from the aspect of attribute on synthetic dataset. This research helps us have a deeper understanding of the fundamental problems in person re-ID, which also provides useful insights for dataset building and future practical usage.
摘要：人重新鉴定（重新-ID），起到应用的重要作用，如公安和视频监控。近日，来自合成数据，从综合数据引擎的普及好处，取得了显着的性能学习。然而，在追求高准确度的，研究人员在学术始终围绕在时间和标签的费用成本高与大规模数据集的训练，而忽视探索从数百万合成的数据进行高效率的训练的潜力。为了便于发展在这一领域，我们回顾以前开发的综合数据集GPR和内置改进的一个（GPR +）具有较大身份和尊贵属性的数量。在此基础上，我们定量分析重新-ID系统的数据集属性的影响。据我们所知，我们是从属性的对合成数据集方面的第一次尝试，以明确解剖的人重新编号中。这项研究帮助我们拥有的人重新编号，这也为数据集建设和未来的实际使用有用的见解的根本问题有更深的了解。

24. Semantic Editing On Segmentation Map Via Multi-Expansion Loss [PDF] 返回目录
Jianfeng He, Xuchao Zhang, Shuo Lei, Shuhui Wang, Qingming Huang, Chang-Tien Lu, Bei Xiao
Abstract: Semantic editing on segmentation map has been proposed as an intermediate interface for image generation, because it provides flexible and strong assistance in various image generation tasks. This paper aims to improve quality of edited segmentation map conditioned on semantic inputs. Even though recent studies apply global and local adversarial losses extensively to generate images for higher image quality, we find that they suffer from the misalignment of the boundary area in the mask area. To address this, we propose MExGAN for semantic editing on segmentation map, which uses a novel Multi-Expansion (MEx) loss implemented by adversarial losses on MEx areas. Each MEx area has the mask area of the generation as the majority and the boundary of original context as the minority. To boost convenience and stability of MEx loss, we further propose an Approximated MEx (A-MEx) loss. Besides, in contrast to previous model that builds training data for semantic editing on segmentation map with part of the whole image, which leads to model performance degradation, MExGAN applies the whole image to build the training data. Extensive experiments on semantic editing on segmentation map and natural image inpainting show competitive results on four datasets.
摘要：对分割图语义编辑已被提议作为图像生成中间的接口，因为它提供了各种图像生成任务灵活和强大的援助。本文旨在改善编辑分段图空调语义输入的质量。尽管最近的研究应用全局和局部的对抗损失大量产生更高的图像质量的图像，我们发现，他们从屏蔽区域的边界区域的错位受到影响。为了解决这个问题，我们提出MExGAN对分割图，它使用由墨西哥地区的敌对损失实现一种新型的多扩展（MEX）损失语义编辑。每个MEX地区有一代多数与原始上下文作为少数的边界的屏蔽区域。为了提升便利性和MEX损失的稳定性，我们进一步提出了一个近似MEX（A-MEX）的损失。此外，相比于以前的模式，建立训练数据上的分割图的语义编辑与整个图像，从而导致模型性能下降，MExGAN应用于整个图像建立训练数据的一部分。关于分割图和自然的图像修补语义编辑大量的实验表明，在四个数据集的竞争结果。

25. Physics-informed GANs for Coastal Flood Visualization [PDF] 返回目录
Björn Lütjens, Brandon Leshchinskiy, Christian Requena-Mesa, Farrukh Chishtie, Natalia Díaz-Rodriguez, Océane Boulais, Aaron Piña, Dava Newman, Alexander Lavin, Yarin Gal, Chedy Raïssi
Abstract: As climate change increases the intensity of natural disasters, society needs better tools for adaptation. Floods, for example, are the most frequent natural disaster, but during hurricanes the area is largely covered by clouds and emergency managers must rely on nonintuitive flood visualizations for mission planning. To assist these emergency managers, we have created a deep learning pipeline that generates visual satellite images of current and future coastal flooding. We advanced a state-of-the-art GAN called pix2pixHD, such that it produces imagery that is physically-consistent with the output of an expert-validated storm surge model (NOAA SLOSH). By evaluating the imagery relative to physics-based flood maps, we find that our proposed framework outperforms baseline models in both physical-consistency and photorealism. While this work focused on the visualization of coastal floods, we envision the creation of a global visualization of how climate change will shape our earth.
摘要：随着气候变化增加了自然灾害的强度，社会需要适应更好的工具。洪水，例如，是最常见的自然灾害，但飓风在该地区主要是由云和应急管理人员覆盖必须依靠不直观的可视化防汛任务规划。为了帮助这些应急管理人员，我们已经创建了一个深刻的学习管道产生当前和未来的沿海洪灾的可视卫星图像。我们以先进的国家的最先进的GAN称为pix2pixHD，使得其产生的图像是与专家验证的风暴潮模式（NOAA晃动）的输出物理上是一致的。相对于基于物理的洪水地图的图像评估，我们发现在这两个物理一致性和写实，我们提出的框架性能优于基线模型。虽然这项工作集中在沿海洪灾的可视化，我们设想的气候变化将如何塑造我们的地球全球可视化的创作。

26. Human Segmentation with Dynamic LiDAR Data [PDF] 返回目录
Tao Zhong, Wonjik Kim, Masayuki Tanaka, Masatoshi Okutomi
Abstract: Consecutive LiDAR scans compose dynamic 3D sequences, which contain more abundant information than a single frame. Similar to the development history of image and video perception, dynamic 3D sequence perception starts to come into sight after inspiring research on static 3D data perception. This work proposes a spatio-temporal neural network for human segmentation with the dynamic LiDAR point clouds. It takes a sequence of depth images as input. It has a two-branch structure, i.e., the spatial segmentation branch and the temporal velocity estimation branch. The velocity estimation branch is designed to capture motion cues from the input sequence and then propagates them to the other branch. So that the segmentation branch segments humans according to both spatial and temporal features. These two branches are jointly learned on a generated dynamic point cloud dataset for human recognition. Our works fill in the blank of dynamic point cloud perception with the spherical representation of point cloud and achieves high accuracy. The experiments indicate that the introduction of temporal feature benefits the segmentation of dynamic point cloud.
摘要：连续激光雷达扫描撰写动态3D序列，其包含比单个帧更丰富的信息。类似的图像和视频感知的发展历史，动态3D序列知觉开始就出现在静态3D数据感知激励的研究之后。这项工作提出了用激光雷达动态点云人体分割时空神经网络。这需要深度图像作为输入的序列。它有一个两分支结构，即，空间分割分支和所述时间速度估算分支。速度估计分支被设计成从所述输入序列捕获运动线索，然后将它们传播到其他分支。因此，根据空间和时间的功能分割分支部人类。这两个分支的人类识别动态生成点云数据集的共同教训。我们的作品填补了动态的点云感知与点云的球面表示的空白，实现了高精确度。实验结果表明，引入时间要素的优势动态点云的分割。

27. TextMage: The Automated Bangla Caption Generator Based On Deep Learning [PDF] 返回目录
Abrar Hasin Kamal, Md. Asifuzzaman Jishan, Nafees Mansoor
Abstract: Neural Networks and Deep Learning have seen an upsurge of research in the past decade due to the improved results. Generates text from the given image is a crucial task that requires the combination of both sectors which are computer vision and natural language processing in order to understand an image and represent it using a natural language. However existing works have all been done on a particular lingual domain and on the same set of data. This leads to the systems being developed to perform poorly on images that belong to specific locales' geographical context. TextMage is a system that is capable of understanding visual scenes that belong to the Bangladeshi geographical context and use its knowledge to represent what it understands in Bengali. Hence, we have trained a model on our previously developed and published dataset named BanglaLekhaImageCaptions. This dataset contains 9,154 images along with two annotations for each image. In order to access performance, the proposed model has been implemented and evaluated.
摘要：神经网络和深度学习所看到的，在过去十年的研究热潮，由于改进的结果。生成从给定的图像文本是需要这是为了了解一个图像，并用自然语言表示它的计算机视觉和自然语言处理这两个领域的结合极为重要的任务。然而现有的工作已经全部在一个特定的语言域，在同一组数据来完成的。这导致了系统开发上属于具体地点的地理范围内的图像表现不佳。 TextMage是一个系统，是能够理解的是属于孟加拉地理环境，并利用其知识表示这是什么理解孟加拉语视觉场景。因此，我们已经培训了我们之前制定并公布的数据集名为BanglaLekhaImageCaptions的典范。此数据集包含两个注解为每个图像沿9154倍的图像。为了访问性能，该模型已经实施和评估。

28. Egok360: A 360 Egocentric Kinetic Human Activity Video Dataset [PDF] 返回目录
Keshav Bhandari, Mario A. DeLaGarza, Ziliang Zong, Hugo Latapie, Yan Yan
Abstract: Recently, there has been a growing interest in wearable sensors which provides new research perspectives for 360 ° video analysis. However, the lack of 360 ° datasets in literature hinders the research in this field. To bridge this gap, in this paper we propose a novel Egocentric (first-person) 360° Kinetic human activity video dataset (EgoK360). The EgoK360 dataset contains annotations of human activity with different sub-actions, e.g., activity Ping-Pong with four sub-actions which are pickup-ball, hit, bounce-ball and serve. To the best of our knowledge, EgoK360 is the first dataset in the domain of first-person activity recognition with a 360° environmental setup, which will facilitate the egocentric 360 ° video understanding. We provide experimental results and comprehensive analysis of variants of the two-stream network for 360 egocentric activity recognition. The EgoK360 dataset can be downloaded from this https URL.
摘要：最近，一直在360°视频分析提供了新的研究视角穿戴式传感器的兴趣与日俱增。然而，由于缺乏在文献360°的数据集阻碍了这方面的研究。为了弥补该间隙，在本文中，我们提出了一种新颖的自我中心（第一人称）360°动力学人类活动视频数据集（EgoK360）。该EgoK360数据集包含有不同的子行动，例如，活性乒乓球四个它们拾球，命中，反弹球和发球分行动人类活动的注释。据我们所知，EgoK360是第一人称动作识别与360°的环境设置，这将有利于以自我为中心的360°视频了解域中的第一个数据集。我们提供的实验结果和两个流网络360自我中心的行为识别的变异体的综合分析。该EgoK360数据集可以从这个HTTPS URL下载。

29. Revisiting Optical Flow Estimation in 360 Videos [PDF] 返回目录
Keshav Bhandari, Ziliang Zong, Yan Yan
Abstract: Nowadays 360 video analysis has become a significant research topic in the field since the appearance of high-quality and low-cost 360 wearable devices. In this paper, we propose a novel LiteFlowNet360 architecture for 360 videos optical flow estimation. We design LiteFlowNet360 as a domain adaptation framework from perspective video domain to 360 video domain. We adapt it from simple kernel transformation techniques inspired by Kernel Transformer Network (KTN) to cope with inherent distortion in 360 videos caused by the sphere-to-plane projection. First, we apply an incremental transformation of convolution layers in feature pyramid network and show that further transformation in inference and regularization layers are not important, hence reducing the network growth in terms of size and computation cost. Second, we refine the network by training with augmented data in a supervised manner. We perform data augmentation by projecting the images in a sphere and re-projecting to a plane. Third, we train LiteFlowNet360 in a self-supervised manner using target domain 360 videos. Experimental results show the promising results of 360 video optical flow estimation using the proposed novel architecture.
摘要：如今360个的视频分析成为该领域以来的高品质，低成本的360台可穿戴式设备的外观显著的研究课题。在本文中，我们提出了360级的视频光流估计的新颖LiteFlowNet360架构。我们设计LiteFlowNet360从视角视频域360的视频域的域适应框架。我们适应它由内核变压器网络（KTN）的启发简单的内核转换技术，以应对所造成的球 - 平面上的投影360个视频固有失真。首先，我们采用卷积层的功能金字塔网络增量改造，并显示在推理和正规化层进一步转变并不重要，因此减少尺寸和计算成本方面的网络增长。第二，我们改进了网络通过培训增强数据在监督的方式。我们通过在球体投影图像和重新投影到平面执行数据扩张。第三，我们训练LiteFlowNet360使用目标域360级的视频自我监督的方式。实验结果表明，使用所提出的新颖架构360的视频光流估计的有希望的结果。

30. Why Layer-Wise Learning is Hard to Scale-up and a Possible Solution via Accelerated Downsampling [PDF] 返回目录
Wenchi Ma, Miao Yu, Kaidong Li, Guanghui Wang
Abstract: Layer-wise learning, as an alternative to global back-propagation, is easy to interpret, analyze, and it is memory efficient. Recent studies demonstrate that layer-wise learning can achieve state-of-the-art performance in image classification on various datasets. However, previous studies of layer-wise learning are limited to networks with simple hierarchical structures, and the performance decreases severely for deeper networks like ResNet. This paper, for the first time, reveals the fundamental reason that impedes the scale-up of layer-wise learning is due to the relatively poor separability of the feature space in shallow layers. This argument is empirically verified by controlling the intensity of the convolution operation in local layers. We discover that the poorly-separable features from shallow layers are mismatched with the strong supervision constraint throughout the entire network, making the layer-wise learning sensitive to network depth. The paper further proposes a downsampling acceleration approach to weaken the poor learning of shallow layers so as to transfer the learning emphasis to deep feature space where the separability matches better with the supervision restraint. Extensive experiments have been conducted to verify the new finding and demonstrate the advantages of the proposed downsampling acceleration in improving the performance of layer-wise learning.
摘要：逐层学习，以替代全球反向传播，很容易解释，分析，它是内存使用效率。最近的研究表明，逐层学习可以实现各种数据集图像分类的国家的最先进的性能。然而，逐层学习以前的研究仅限于简单的层次结构的网络，而且性能严重降低了像RESNET更深层次的网络。本文首次，揭示其根本原因阻碍了规模化逐层学习的是由于浅层特征空间相对较差可分离性。这个论点是凭经验通过控制在本地层中的卷积运算的强度证实。我们发现，从浅层次难溶可分离功能不匹配与整个网络的强有力的监督约束，使得逐层学习网络的深度敏感。文章进一步提出了加速采样的方法来削弱浅层不佳学习，以学习的重点转移到深的功能空间，可分匹配与监督约束更好。大量的实验已经进行，以验证新的发现和验证了采样加速度的优势，在改进层明智的学习表现。

31. QReLU and m-QReLU: Two novel quantum activation functions to aid medical diagnostics [PDF] 返回目录
L. Parisi, D. Neagu, R. Ma, F. Campean
Abstract: The ReLU activation function (AF) has been extensively applied in deep neural networks, in particular Convolutional Neural Networks (CNN), for image classification despite its unresolved dying ReLU problem, which poses challenges to reliable applications. This issue has obvious important implications for critical applications, such as those in healthcare. Recent approaches are just proposing variations of the activation function within the same unresolved dying ReLU challenge. This contribution reports a different research direction by investigating the development of an innovative quantum approach to the ReLU AF that avoids the dying ReLU problem by disruptive design. The Leaky ReLU was leveraged as a baseline on which the two quantum principles of entanglement and superposition were applied to derive the proposed Quantum ReLU (QReLU) and the modified-QReLU (m-QReLU) activation functions. Both QReLU and m-QReLU are implemented and made freely available in TensorFlow and Keras. This original approach is effective and validated extensively in case studies that facilitate the detection of COVID-19 and Parkinson Disease (PD) from medical images. The two novel AFs were evaluated in a two-layered CNN against nine ReLU-based AFs on seven benchmark datasets, including images of spiral drawings taken via graphic tablets from patients with Parkinson Disease and healthy subjects, and point-of-care ultrasound images on the lungs of patients with COVID-19, those with pneumonia and healthy controls. Despite a higher computational cost, results indicated an overall higher classification accuracy, precision, recall and F1-score brought about by either quantum AFs on five of the seven bench-mark datasets, thus demonstrating its potential to be the new benchmark or gold standard AF in CNNs and aid image classification tasks involved in critical applications, such as medical diagnoses of COVID-19 and PD.
摘要：RELU激活功能（AF）已经在深层神经网络，尽管其尚未解决的垂死RELU问题，这带来了挑战，以可靠的应用广泛应用于，尤其是卷积神经网络（CNN），用于图像分类。这个问题对关键应用，如医疗保健明显的重要意义。近来的方案只是提出了相同的未解决的垂死RELU挑战中的激活功能的变化。这笔捐款通过调查一个创新的量子的方法来RELU AF避免颠覆性设计垂死RELU问题的发展报告不同的研究方向。破RELU被利用作为其上缠结和叠加的两个量子原理被应用到导出所提出的量子RELU（QReLU）和改性-QReLU（M-QReLU）激活函数的基线。既QReLU和间QReLU实现并在TensorFlow和Keras免费提供。这种原始的方法是有效的，并有利于COVID-19和帕金森病（PD）的从医学图像中检测的案例研究广泛验证。两个新的AF中进行评价两层的CNN针对七个基准数据集9基于RELU-AFS，包括经由图形输入板取自患者患有帕金森病和健康受试者，以及点的护理超声图像螺旋图形的图像的患者肺部有COVID-19，其中有肺炎和健康对照。尽管较高的计算成本，结果显示总体更高的分类准确度，精确度，召回率和F1-得分无论是量子的AF带来五七个基准数据集，从而显示出其潜力成为新的基准或黄金标准AF在细胞神经网络和涉及关键应用，如COVID-19和PD的医疗诊断辅助图像分类任务。

32. On the Exploration of Incremental Learning for Fine-grained Image Retrieval [PDF] 返回目录
Wei Chen, Yu Liu, Weiping Wang, Tinne Tuytelaars, Erwin M. Bakker, Michael Lew
Abstract: In this paper, we consider the problem of fine-grained image retrieval in an incremental setting, when new categories are added over time. On the one hand, repeatedly training the representation on the extended dataset is time-consuming. On the other hand, fine-tuning the learned representation only with the new classes leads to catastrophic forgetting. To this end, we propose an incremental learning method to mitigate retrieval performance degradation caused by the forgetting issue. Without accessing any samples of the original classes, the classifier of the original network provides soft "labels" to transfer knowledge to train the adaptive network, so as to preserve the previous capability for classification. More importantly, a regularization function based on Maximum Mean Discrepancy is devised to minimize the discrepancy of new classes features from the original network and the adaptive network, respectively. Extensive experiments on two datasets show that our method effectively mitigates the catastrophic forgetting on the original classes while achieving high performance on the new classes.
摘要：在本文中，我们考虑在增量设置后，当新的类别会陆续加入细颗粒图像检索的问题。在一方面，反复训练表示在扩展数据集耗时。在另一方面，微调了解到表示只与新类导致灾难性的遗忘。为此，我们提出了一种增量式学习方法，以减轻因遗忘问题的检索性能下降。无需访问原始类的任何样本，对原有网络的分类提供软“标签”，以传授知识训练自适应网络，从而保留了分类之前的能力。更重要的是，基于最大平均差异正则化函数被设计成的新的类的特征的差异从原始网络和所述自适应网络分别最小化。对两个数据集大量的实验表明，该方法有效地减轻了灾难性遗忘的原始类，同时实现对新类的高性能。

33. Impact of Action Unit Occurrence Patterns on Detection [PDF] 返回目录
Saurabh Hinduja, Shaun Canavan, Saandeep Aathreya
Abstract: Detecting action units is an important task in face analysis, especially in facial expression recognition. This is due, in part, to the idea that expressions can be decomposed into multiple action units. In this paper we investigate the impact of action unit occurrence patterns on detection of action units. To facilitate this investigation, we review state of the art literature, for AU detection, on 2 state-of-the-art face databases that are commonly used for this task, namely DISFA, and BP4D. Our findings, from this literature review, suggest that action unit occurrence patterns strongly impact evaluation metrics (e.g. F1-binary). Along with the literature review, we also conduct multi and single action unit detection, as well as propose a new approach to explicitly train deep neural networks using the occurrence patterns to boost the accuracy of action unit detection. These experiments validate that action unit patterns directly impact the evaluation metrics.
摘要：检测动作单位是在面上的分析的一项重要任务，尤其是在面部表情识别。这是因为，部分的想法，表达式可以分解为多个行动单位。在本文中，我们调查的行动单位发生模式上的检测动作单元的影响。为了促进本次调查中，我们回顾了文学艺术的状态，非盟的检测，对通常用于这个任务，即DISFA和BP4D 2国家的最先进的人脸数据库。我们的研究结果，从这个文献综述，建议动作单元发生图案强烈影响评价指标（例如F1-二进制）。随着文献综述，我们也进行多单操作单元检测，并提出了一种新的方法来使用的发生模式，以提高行动装置检测的准确性明确训练深层神经网络。这些实验验证动作单元图案直接影响评价指标。

34. Quantifying the Extent to Which Race and Gender Features Determine Identity in Commercial Face Recognition Algorithms [PDF] 返回目录
John J. Howard, Yevgeniy B. Sirotin, Jerry L. Tipton, Arun R. Vemury
Abstract: Human face features can be used to determine individual identity as well as demographic information like gender and race. However, the extent to which black-box commercial face recognition algorithms (CFRAs) use gender and race features to determine identity is poorly understood despite increasing deployments by government and industry. In this study, we quantified the degree to which gender and race features influenced face recognition similarity scores between different people, i.e. non-mated scores. We ran this study using five different CFRAs and a sample of 333 diverse test subjects. As a control, we compared the behavior of these non-mated distributions to a commercial iris recognition algorithm (CIRA). Confirming prior work, all CFRAs produced higher similarity scores for people of the same gender and race, an effect known as "broad homogeneity". No such effect was observed for the CIRA. Next, we applied principal components analysis (PCA) to similarity score matrices. We show that some principal components (PCs) of CFRAs cluster people by gender and race, but the majority do not. Demographic clustering in the PCs accounted for only 10 % of the total CFRA score variance. No clustering was observed for the CIRA. This demonstrates that, although CFRAs use some gender and race features to establish identity, most features utilized by current CFRAs are unrelated to gender and race, similar to the iris texture patterns utilized by the CIRA. Finally, reconstruction of similarity score matrices using only PCs that showed no demographic clustering reduced broad homogeneity effects, but also decreased the separation between mated and non-mated scores. This suggests it's possible for CFRAs to operate on features unrelated to gender and race, albeit with somewhat lower recognition accuracy, but that this is not the current commercial practice.
摘要：人脸特征可以用来确定个人身份以及诸如性别和种族人口统计信息。然而，到暗箱商业人脸识别算法（CFRAs）使用性别和种族特征来确定身份的程度知之甚少，尽管政府和行业越来越部署。在这项研究中，我们量化到性别和种族特性的影响不同的人，即非配对分数之间的面部识别的相似性指标的程度。我们跑了使用五种不同CFRAs和333名不同测试对象的样本这项研究。作为对照，我们比较了这些非配对分布于商用虹膜识别算法（CIRA）的行为。确认以前的工作，所有CFRAs制作了相同的性别和种族，被称为“广同质化”的影响人的相似度较高的分数。没有观察到的CIRA这样的效果。接下来，我们应用主成分分析（PCA），以相似性得分矩阵。我们发现，CFRAs的一些主要组成部分（PCS）按性别和种族聚集的人，但大多数人并不会。在PC的人口集聚仅占10％，总比分CFRA方差。观察到的CIRA无群集。这表明，虽然CFRAs使用一些性别和种族特性来建立身份，通过电流CFRAs使用最多的功能是无关的性别和种族，类似于由CIRA利用虹膜纹理图案。最后，使用只有电脑即显示没有人口统计聚类相似度得分矩阵重建降低宽均匀性的效果，但也降低了配合和非配合分数之间的分离。这表明有可能为CFRAs要操作的功能无关的性别和种族，尽管有所回落识别准确率，但是，这是不是目前的商业实践。

35. Video Object Segmentation with Adaptive Feature Bank and Uncertain-Region Refinement [PDF] 返回目录
Yongqing Liang, Xin Li, Navid Jafari, Qin Chen
Abstract: We propose a new matching-based framework for semi-supervised video object segmentation (VOS). Recently, state-of-the-art VOS performance has been achieved by matching-based algorithms, in which feature banks are created to store features for region matching and classification. However, how to effectively organize information in the continuously growing feature bank remains under-explored, and this leads to inefficient design of the bank. We introduce an adaptive feature bank update scheme to dynamically absorb new features and discard obsolete features. We also design a new confidence loss and a fine-grained segmentation module to enhance the segmentation accuracy in uncertain regions. On public benchmarks, our algorithm outperforms existing state-of-the-arts.
摘要：本文提出了一种半监督的视频对象分割（VOS）一种新的基于匹配的框架。最近，国家的最先进的VOS性能已经通过基于匹配的算法，在这种特征银行创建存储的特征在于用于区域匹配和分类来实现。然而，如何有效地在不断增长的特征行依然充分开发，这导致银行效率低下的设计组织信息。我们引入了自适应功能的银行更新方案动态吸收新的功能和废弃过时的功能。我们还设计了新的信心的丧失和细粒度分割模块，以提高不确定区域的分割精度。在公众的基准，我们的算法优于现有的国家的最艺术。

36. Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding [PDF] 返回目录
Alexander Ku, Peter Anderson, Roma Patel, Eugene Ie, Jason Baldridge
Abstract: We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets. It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities. Furthermore, each word in an instruction is time-aligned to the virtual poses of instruction creators and validators. We establish baseline scores for monolingual and multilingual settings and multitask learning when including Room-to-Room annotations. We also provide results for a model that learns from synchronized pose traces by focusing only on portions of the panorama attended to in human demonstrations. The size, scope and detail of RxR dramatically expands the frontier for research on embodied language agents in simulated, photo-realistic environments.
摘要：介绍房间，跨间（RXR），一个新的视觉和语言导航（VLN）数据集。 RXR是多种语言（英语，印地文，和泰卢固语）和比其他数据集VLN更大（更路径和指令）。它强调语言在VLN在路径解决已知的偏见和引发对实体可见更多的参考作用。此外，在指令每个字是时间对准指令创造者和验证的虚拟姿势。我们建立基线分数单语和多语言设置，多任务学习，包括房间到房间的注释时。我们还提供了一个模型，结果只关注人参加游行示威的全景部分从同步姿势痕迹获悉。 RXR的规模，范围和细节极大地扩展了在模拟，照片般逼真的环境中体现语言代理研究中的前沿。

37. Convolutional Neural Network for Blur Images Detection as an Alternative for Laplacian Method [PDF] 返回目录
Tomasz Szandala
Abstract: With the prevalence of digital cameras, the number of digital images increases quickly, which raises the demand for non-manual image quality assessment. While there are many methods considered useful for detecting blurriness, in this paper we propose and evaluate a new method that uses a deep convolutional neural network, which can determine whether an image is blurry or not. Experimental results demonstrate the effectiveness of the proposed scheme and are compared to deterministic methods using the confusion matrix.
摘要：随着数码相机的普及，数字图像的数量增加很快，这引起了对非体力劳动图像质量评价的需求。虽然有很多方法考虑用于检测模糊，在本文中，我们提出了有益和评估使用了深刻的卷积神经网络，可以判断图像是否是模糊或没有的新方法。实验结果表明，所提出的方案的有效性，并且与使用混淆矩阵确定性方法。

38. Latent Vector Recovery of Audio GANs [PDF] 返回目录
Andrew Keyes, Nicky Bayat, Vahid Reza Khazaie, Yalda Mohsenzadeh
Abstract: Advanced Generative Adversarial Networks (GANs) are remarkable in generating intelligible audio from a random latent vector. In this paper, we examine the task of recovering the latent vector of both synthesized and real audio. Previous works recovered latent vectors of given audio through an auto-encoder inspired technique that trains an encoder network either in parallel with the GAN or after the generator is trained. With our approach, we train a deep residual neural network architecture to project audio synthesized by WaveGAN into the corresponding latent space with near identical reconstruction performance. To accommodate for the lack of an original latent vector for real audio, we optimize the residual network on the perceptual loss between the real audio samples and the reconstructed audio of the predicted latent vectors. In the case of synthesized audio, the Mean Squared Error (MSE) between the ground truth and recovered latent vector is minimized as well. We further investigated the audio reconstruction performance when several gradient optimization steps are applied to the predicted latent vector. Through our deep neural network based method of training on real and synthesized audio, we are able to predict a latent vector that corresponds to a reasonable reconstruction of real audio. Even though we evaluated our method on WaveGAN, our proposed method is universal and can be applied to any other GANs.
摘要：高级剖成对抗性网络（甘斯）是在从一个随机潜矢量生成可理解的音频显着。在本文中，我们探讨恢复两条合成和真实音频的潜在向量的任务。先前的工作通过一个自动编码器回收的给定音频的潜矢量激励的技术，列车的编码器网络或者与GAN或发电机被训练后平行。随着我们的方法，我们培养了深刻的残余神经网络结构由WaveGAN合成为具有几乎相同的性能重建相应的潜在空间项目的音频。为了适应缺乏真实音频的原始潜矢量的，我们优化对真实音频样本和所述预测潜矢量的重建的音频之间的感知损失剩余网络。在合成的音频的情况下，地面实况之间的均方误差（MSE）并回收潜矢量最小为好。我们进一步研究的音频重构性能时几个梯度优化步骤被应用到预测潜矢量。通过对真实音频合成训练我们的深层神经网络为基础的方法，我们能够预测潜在向量对应于真实音频的合理重建。尽管我们评估我们的方法在WaveGAN，我们提出的方法是通用的，可以适用于任何其他甘斯。

39. G-DARTS-A: Groups of Channel Parallel Sampling with Attention [PDF] 返回目录
Zhaowen Wang, Wei Zhang, Zhiming Wang
Abstract: Differentiable Architecture Search (DARTS) provides a baseline for searching effective network architectures based gradient, but it is accompanied by huge computational overhead in searching and training network architecture. Recently, many novel works have improved DARTS. Particularly, Partially-Connected DARTS(PC-DARTS) proposed the partial channel sampling technique which achieved good results. In this work, we found that the backbone provided by DARTS is prone to overfitting. To mitigate this problem, we propose an approach named Group-DARTS with Attention (G-DARTS-A), using multiple groups of channels for searching. Inspired by the partially sampling strategy of PC-DARTS, we use groups channels to sample the super-network to perform a more efficient search while maintaining the relative integrity of the network information. In order to relieve the competition between channel groups and keep channel balance, we follow the attention mechanism in Squeeze-and-Excitation Network. Each group of channels shares defined weights thence they can provide different suggestion for searching. The searched architecture is more powerful and better adapted to different deployments. Specifically, by only using the attention module on DARTS we achieved an error rate of 2.82%/16.36% on CIFAR10/100 with 0.3GPU-days for search process on CIFAR10. Apply our G-DARTS-A to DARTS/PC-DARTS, an error rate of 2.57%/2.61% on CIFAR10 with 0.5/0.4 GPU-days is achieved.
摘要：可微架构搜索（飞镖）提供了搜索基于梯度高效的网络架构的基准，但伴随着搜索和培训网络体系结构庞大的计算开销。最近，许多新颖的作品有改善飞镖。具体地，部分连接的飞镖（PC-DARTS）提出了取得良好效果的部分信道抽样技术。在这项工作中，我们发现，飞镖提供的骨干容易发生过度拟合。为了缓解这一问题，我们提出了一个名为集团飞镖有注意力（G-DARTS-A）的方法，利用渠道的多组进行搜索。通过PC-称身短缝的部分的采样策略启发，我们使用组通道以采样超级网络来执行更有效的搜索，同时保持的所述网络信息相对完整性。为了缓解频道小组及声道平衡之间的竞争，我们遵循的挤压和激励网络的注意机制。各组的权重定义通道股那里它们可以用于搜索提供不同的建议。搜索架构更强大，更好地适应不同的部署。特别是，仅使用注意模块上的飞镖，我们与0.3GPU日达到2.82％/ 16.36％的错误率上CIFAR10 / 100上CIFAR10搜索过程。应用我们的G-DARTS-A飞镖/ PC-飞镖，2.57％/ 2.61％的错误率上CIFAR10用0.5 / 0.4 GPU-天得以实现。

40. VolumeNet: A Lightweight Parallel Network for Super-Resolution of Medical Volumetric Data [PDF] 返回目录
Yinhao Li, Yutaro Iwamoto, Lanfen Lin, Rui Xu, Yen-Wei Chen
Abstract: Deep learning-based super-resolution (SR) techniques have generally achieved excellent performance in the computer vision field. Recently, it has been proven that three-dimensional (3D) SR for medical volumetric data delivers better visual results than conventional two-dimensional (2D) processing. However, deepening and widening 3D networks increases training difficulty significantly due to the large number of parameters and small number of training samples. Thus, we propose a 3D convolutional neural network (CNN) for SR of medical volumetric data called ParallelNet using parallel connections. We construct a parallel connection structure based on the group convolution and feature aggregation to build a 3D CNN that is as wide as possible with few parameters. As a result, the model thoroughly learns more feature maps with larger receptive fields. In addition, to further improve accuracy, we present an efficient version of ParallelNet (called VolumeNet), which reduces the number of parameters and deepens ParallelNet using a proposed lightweight building block module called the Queue module. Unlike most lightweight CNNs based on depthwise convolutions, the Queue module is primarily constructed using separable 2D cross-channel convolutions. As a result, the number of network parameters and computational complexity can be reduced significantly while maintaining accuracy due to full channel fusion. Experimental results demonstrate that the proposed VolumeNet significantly reduces the number of model parameters and achieves high precision results compared to state-of-the-art methods.
摘要：深基于学习的超分辨率（SR）技术一般都在计算机视觉领域取得了优异的性能。最近，它已被证明，三维（3D）SR医疗体积数据提供比传统的二维（2D）处理更好的视觉效果。然而，深化和扩大3D网络增加是由于大量的参数和少量的训练样本的显著训练难度。因此，我们提出了医疗使用并行连接称为ParallelNet体积数据的SR三维卷积神经网络（CNN）。我们构建基于组卷积和特征聚合构建3D CNN即尽可能宽与几个参数的并联连接的结构。其结果是，该模型彻底学习具有较大感受野更特征图。此外，为了进一步提高精度，提出ParallelNet（称为VolumeNet）的有效版本，这减少了的参数和加深ParallelNet使用提出的轻质建筑块模块调用队列模块的数量。与基于深度方向的回旋最轻质细胞神经网络，队列模块被利用可分二维交叉信道卷积主要构成。其结果是，网络参数和计算复杂性的数量可以显著同时维持准确度由于全通道融合降低。实验结果表明，所提出的VolumeNet显著减少了模型参数的数量，并实现高精度的结果相比，国家的最先进的方法。

41. Learning Accurate Entropy Model with Global Reference for Image Compression [PDF] 返回目录
Yichen Qian, Zhiyu Tan, Xiuyu Sun, Ming Lin, Dongyang Li, Zhenhong Sun, Hao Li, Rong Jin
Abstract: In recent deep image compression neural networks, the entropy model plays a critical role in estimating the prior distribution of deep image encodings. Existing methods combine hyperprior with local context in the entropy estimation function. This greatly limits their performance due to the absence of a global vision. In this work, we propose a novel Global Reference Model for image compression to effectively leverage both the local and the global context information, leading to an enhanced compression rate. The proposed method scans decoded latents and then finds the most relevant latent to assist the distribution estimating of the current latent. A by-product of this work is the innovation of a mean-shifting GDN module that further improves the performance. Experimental results demonstrate that the proposed model outperforms the rate-distortion performance of most of the state-of-the-art methods in the industry.
摘要：近年来深图像压缩神经网络，熵模型，对于在估计深图像编码的先验分布着关键作用。现有的方法结合hyperprior与熵估计函数的局部范围内。这大大限制了它们的性能，因为缺乏全球视野的。在这项工作中，我们提出了一种新的全球参考模型的图像压缩，有效地利用本地和全球范围内的信息，从而增强压缩率。该方法扫描解码latents，然后找到最相关的潜在协助分布估计当前潜在的。这项工作的副产物是平均移GDN模块，进一步提高了性能的创新。实验结果表明，该模型优于大部分的国家的最先进的方法，在同行业中的率失真性能。

42. Towards truly local gradients with CLAPP: Contrastive, Local And Predictive Plasticity [PDF] 返回目录
Bernd Illing, Wulfram Gerstner, Guillaume Bellec
Abstract: Back-propagation (BP) is costly to implement in hardware and implausible as a learning rule implemented in the brain. However, BP is surprisingly successful in explaining neuronal activity patterns found along the cortical processing stream. We propose a locally implementable, unsupervised learning algorithm, CLAPP, which minimizes a simple, layer-specific loss function, and thus does not need to back-propagate error signals. The weight updates only depend on state variables of the pre- and post-synaptic neurons and a layer-wide third factor. Networks trained with CLAPP build deep hierarchical representations of images and speech.
摘要：反向传播（BP）是昂贵的硬件实现和不合理的大脑中的实现的学习规则。然而，BP是用于说明沿着皮质处理流中发现的神经元活性模式的令人惊讶的成功。我们提出了一个本地实现的，监督学习算法，克拉普，最大限度地减少简单，层特定损失函数，因此不需要背繁殖的错误信号。重量仅更新依赖于预的状态变量和突触后神经元和一个层宽的第三个因素。网络，图像和语音的克拉普构建更深层次交涉培训。

43. Auxiliary Task Reweighting for Minimum-data Learning [PDF] 返回目录
Baifeng Shi, Judy Hoffman, Kate Saenko, Trevor Darrell, Huijuan Xu
Abstract: Supervised learning requires a large amount of training data, limiting its application where labeled data is scarce. To compensate for data scarcity, one possible method is to utilize auxiliary tasks to provide additional supervision for the main task. Assigning and optimizing the importance weights for different auxiliary tasks remains an crucial and largely understudied research question. In this work, we propose a method to automatically reweight auxiliary tasks in order to reduce the data requirement on the main task. Specifically, we formulate the weighted likelihood function of auxiliary tasks as a surrogate prior for the main task. By adjusting the auxiliary task weights to minimize the divergence between the surrogate prior and the true prior of the main task, we obtain a more accurate prior estimation, achieving the goal of minimizing the required amount of training data for the main task and avoiding a costly grid search. In multiple experimental settings (e.g. semi-supervised learning, multi-label classification), we demonstrate that our algorithm can effectively utilize limited labeled data of the main task with the benefit of auxiliary tasks compared with previous task reweighting methods. We also show that under extreme cases with only a few extra examples (e.g. few-shot domain adaptation), our algorithm results in significant improvement over the baseline.
摘要：监督学习需要大量的训练数据，限制了它的应用程序，其中标签的数据是稀缺的。为了弥补数据匮乏，一种可能的方法是利用辅助任务提供的主要任务额外的监管。分配和优化不同的辅助任务的重要性权重仍然是一个关键，在很大程度上得到充分研究的研究问题。在这项工作中，我们提出，以减少对主要任务的数据要求自动reweight辅助任务的方法。具体来说，我们制定的辅助任务的加权似然函数作为替代现有的主要任务。通过调整辅助任务的权重，以尽量减少替代之前和真实之前的主要任务之间的差异，我们得到了一个更准确的预先估计，实现减少所需的训练数据为主要任务量，避免昂贵的目标网格搜索。在多个实验的设置（例如半监督学习，多标签分类），我们证明了我们的算法能够有效地利用与先前任务重新加权方法相比的辅助任务受益的主要任务限制标记数据。我们还表明，在极端的情况下，只有一些额外的例子（如几拍域自适应），我们的算法结果超过基线显著改善。

44. How Does Supernet Help in Neural Architecture Search? [PDF] 返回目录
Yuge Zhang, Quanlu Zhang, Yaming Yang
Abstract: With the success of Neural Architecture Search (NAS), weight sharing, as an approach to speed up architecture performance estimation has received wide attention. Instead of training each architecture separately, weight sharing builds a supernet that assembles all the architectures as its submodels. However, there has been debate over whether the NAS process actually benefits from weight sharing, due to the gap between supernet optimization and the objective of NAS. To further understand the effect of weight sharing on NAS, we conduct a comprehensive analysis on five search spaces, including NAS-Bench-101, NAS-Bench-201, DARTS-CIFAR10, DARTS-PTB, and ProxylessNAS. Moreover, we take a step forward to explore the pruning based NAS algorithms. Some of our key findings are summarized as: (i) A well-trained supernet is not necessarily a good architecture-ranking model. (ii) Supernet is good at finding relatively good (top-10%) architectures but struggles to find the best ones (top-1% or less). (iii) The effectiveness of supernet largely depends on the design of search space itself. (iv) Comparing to selecting the best architectures, supernet is more confident in pruning the worst ones. (v) It is easier to find better architectures from an effectively pruned search space with supernet training. We expect the observations and insights obtained in this work would inspire and help better NAS algorithm design.
摘要：随着神经结构搜索（NAS），重量共享的成功，作为一种方法，以加快结构性能估计已经收到了广泛的关注。而不是单独训练每个结构，重量分担建立一个超网是组装所有的架构作为其子模型。然而，一直在NAS过程是否真正从重量利益共享，由于超网优化和目标NAS之间的差距辩论。为了进一步了解关于NAS共享重量的影响，我们在五个搜索空间，其中包括NAS-台-101，NAS-台-201，飞镖，CIFAR10，飞镖，PTB和ProxylessNAS进行综合分析。此外，我们向前走一步，探索修剪基于NAS算法。我们的一些重要结论归纳为：（i）将训练有素的超网不一定是好建筑排名模型。（ⅱ）超网是善于发现相对良好的（前10％）架构，但斗争找到最好的（顶部1％或更少）。（三）超网的有效性在很大程度上取决于搜索空间本身的设计。（ⅳ）相较于选择最佳的结构中，超网是在修剪最坏那些更有信心。（五）这是比较容易找到与超网培训的有效修剪的搜索空间，更好的架构。我们预计，在这项工作中取得的意见和见解将激励和帮助更好的NAS算法设计。

45. Manipulation-Oriented Object Perception in Clutter through Affordance Coordinate Frames [PDF] 返回目录
Xiaotong Chen, Kaizhi Zheng, Zhen Zeng, Shreshtha Basu, James Cooney, Jana Pavlasek, Odest Chadwicke Jenkins
Abstract: In order to enable robust operation in unstructured environments, robots should be able to generalize manipulation actions to novel object instances. For example, to pour and serve a drink, a robot should be able to recognize novel containers which afford the task. Most importantly, robots should be able to manipulate these novel containers to fulfill the task. To achieve this, we aim to provide robust and generalized perception of object affordances and their associated manipulation poses for reliable manipulation. In this work, we combine the notions of affordance and category-level pose, and introduce the Affordance Coordinate Frame (ACF). With ACF, we represent each object class in terms of individual affordance parts and the compatibility between them, where each part is associated with a part category-level pose for robot manipulation. In our experiments, we demonstrate that ACF outperforms state-of-the-art methods for object detection, as well as category-level pose estimation for object parts. We further demonstrate the applicability of ACF to robot manipulation tasks through experiments in a simulated environment.
摘要：为了使在非结构化环境中稳健运行，机器人应能概括操纵行动，以新颖的对象实例。例如，倒和服务于饮料，机器人应能识别哪些负担任务新颖容器。最重要的是，机器人应该能够处理这些新的容器来完成任务。为了实现这一目标，我们的目标是提供对象启示和可靠的操作及其相关的操作姿势的强大和普遍的看法。在这项工作中，我们结合启示和类别，级别姿势的概念，并推出了可供性坐标框架（ACF）。与ACF，我们表示在个人启示部件和它们之间的相容性，其中每个部分与用于机器人操纵部分类别级姿态相关联的条款每个对象类。在我们的实验中，我们证明ACF性能优于国家的最先进的方法来检测物体，以及对物体的部件分类级别的姿态估计。我们进一步通过在模拟环境实验证明ACF的机器人操作任务的适用性。

46. Extracting Signals of Higgs Boson From Background Noise Using Deep Neural Networks [PDF] 返回目录
Muhammad Abbas, Asifullah Khan, Aqsa Saeed Qureshi, Muhammad Waleed Khan
Abstract: Higgs boson is a fundamental particle, and the classification of Higgs signals is a well-known problem in high energy physics. The identification of the Higgs signal is a challenging task because its signal has a resemblance to the background signals. This study proposes a Higgs signal classification using a novel combination of random forest, auto encoder and deep auto encoder to build a robust and generalized Higgs boson prediction system to discriminate the Higgs signal from the background noise. The proposed ensemble technique is based on achieving diversity in the decision space, and the results show good discrimination power on the private leaderboard; achieving an area under the Receiver Operating Characteristic curve of 0.9 and an Approximate Median Significance score of 3.429.
摘要：希格斯玻色子是一个基本颗粒，。和Higgs信号的分类是高能物理一个公知的问题。希格斯信号的识别是一项具有挑战性的任务，因为其信号具有相似的背景信号。本研究中提出了使用一种新颖的随机森林，自动编码器和深的自动编码器的组合，以构建一个健壮的和广义Higgs粒子预测系统辨别从背景噪声希格斯信号希格斯信号分类。所提出的集成技术是基于在决策空间实现多元化，并将结果显示在私人排行榜良好的识别能力;实现的0.9和接受者操作特性曲线的一个3.429近似平均意义得分下的区域。

47. A Generalizable and Accessible Approach to Machine Learning with Global Satellite Imagery [PDF] 返回目录
Esther Rolf, Jonathan Proctor, Tamma Carleton, Ian Bolliger, Vaishaal Shankar, Miyabi Ishihara, Benjamin Recht, Solomon Hsiang
Abstract: Combining satellite imagery with machine learning (SIML) has the potential to address global challenges by remotely estimating socioeconomic and environmental conditions in data-poor regions, yet the resource requirements of SIML limit its accessibility and use. We show that a single encoding of satellite imagery can generalize across diverse prediction tasks (e.g. forest cover, house price, road length). Our method achieves accuracy competitive with deep neural networks at orders of magnitude lower computational cost, scales globally, delivers label super-resolution predictions, and facilitates characterizations of uncertainty. Since image encodings are shared across tasks, they can be centrally computed and distributed to unlimited researchers, who need only fit a linear regression to their own ground truth data in order to achieve state-of-the-art SIML performance.
摘要：结合卫星图像与机器学习（SIML）受数据贫困地区远程评估社会经济和环境条件，具有应对全球性挑战的潜力，但SIML的资源需求限制其可访问和使用。我们表明，卫星图像的一个编码可以跨不同的预测任务（例如森林覆盖面积，房价，道路长度）一概而论。我们的方法以实现数量级的降低计算成本，规模在全球范围精度深层神经网络的竞争力，提供了标签超分辨率的预测，并促进不确定性的刻画。由于图像编码跨任务共享，他们可以集中计算并分配到无限的研究者，谁需要仅仅是为了实现国家的最先进的性能SIML适合线性回归到自己的地面实况数据。

48. Input-Aware Dynamic Backdoor Attack [PDF] 返回目录
Anh Nguyen, Anh Tran
Abstract: In recent years, neural backdoor attack has been considered to be a potential security threat to deep learning systems. Such systems, while achieving the state-of-the-art performance on clean data, perform abnormally on inputs with predefined triggers. Current backdoor techniques, however, rely on uniform trigger patterns, which are easily detected and mitigated by current defense methods. In this work, we propose a novel backdoor attack technique in which the triggers vary from input to input. To achieve this goal, we implement an input-aware trigger generator driven by diversity loss. A novel cross-trigger test is applied to enforce trigger nonreusablity, making backdoor verification impossible. Experiments show that our method is efficient in various attack scenarios as well as multiple datasets. We further demonstrate that our backdoor can bypass the state of the art defense methods. An analysis with a famous neural network inspector again proves the stealthiness of the proposed attack. Our code is publicly available at this https URL.
摘要：近年来，神经后门攻击一直被认为是深学习系统潜在的安全威胁。这样的系统，同时实现对干净的数据的状态的最先进的性能，在预先界定的触发器输入异常执行。目前的后门技术，然而，依靠统一的触发模式，这很容易被发现，并通过现有的防御方法缓解。在这项工作中，我们提出了一个新颖的后门攻击技术，其中的触发而改变从输入到输入。为了实现这一目标，我们落实多样性丧失驱动的输入感知触发发生器。一种新颖的交叉触发测试被施加到执行触发nonreusablity，使得后门验证是不可能的。实验表明，我们的方法是在不同的攻击方案，以及多个数据集高效。我们进一步证明了我们的后门可以绕过的技术抗辩方法的状态。有一个著名的神经网络检查员的分析再次证明了该攻击的隐蔽性。我们的代码是公开的，在此HTTPS URL。

49. The Deep Bootstrap: Good Online Learners are Good Offline Generalizers [PDF] 返回目录
Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi
Abstract: We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning. We then give empirical evidence that this gap between worlds can be small in realistic deep learning settings, in particular supervised image classification. For example, CNNs generalize better than MLPs on image distributions in the Real World, but this is "because" they optimize faster on the population loss in the Ideal World. This suggests our framework is a useful tool for understanding generalization in deep learning, and lays a foundation for future research in the area.
摘要：我们提出了一个新的框架推理在深度学习推广。其核心思想是，以情侣现实世界中优化承担的经验损失随机梯度步骤，以一个理想的世界，在那里优化拿上人口减少步骤。这导致测试误差的替代分解成：（1）理想世界测试误差加（2）的两个世界之间的间隙。如果间隙（2）是普遍较小，这降低了泛化的离线学习的问题，以优化的在线学习问题。然后，我们给出的经验证据表明，世界之间的差距可以在现实的深度学习设置小，特别是监督图像分类。例如，细胞神经网络的推广比在现实世界中的图象分布总纲发展蓝图，但这是“因为”他们在最优化的理想世界人口下降得更快。这表明我们的框架是用于理解深度学习泛化一个有用的工具，并奠定了在该地区今后的研究奠定了基础。

50. Early-stage COVID-19 diagnosis in presence of limited posteroanterior chest X-ray images via novel Pinball-OCSVM [PDF] 返回目录
Sanjay Kumar Sonbhadra, Sonali Agarwal, P. Nagabhushan
Abstract: It is evident that the infection with this severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) starts with the upper respiratory tract and as the virus grows, the infection can progress to lungs and develop pneumonia. According to the statistics, approximately 14\% of the infected people with COVID-19 have severe cough and shortness of breath due to pneumonia, because as the viral infection increases, it damages the alveoli (small air sacs) and surrounding tissues. The conventional way of COVID-19 diagnosis is reverse transcription polymerase chain reaction (RT-PCR), which is less sensitive during early stages specially, if the patient is asymptomatic that may further lead to more severe pneumonia. To overcome this problem an early diagnosis method is proposed in this paper via one-class classification approach using a novel pinball loss function based one-class support vector machine (PB-OCSVM) considering posteroanterior chest X-ray images. Recently, several automated COVID-19 diagnosis models have been proposed based on various deep learning architectures to identify pulmonary infections using publicly available chest X-ray (CXR) where the presence of less number of COVID-19 positive samples compared to other classes (normal, pneumonia and Tuberculosis) raises the challenge for unbiased learning in deep learning models that has been solved using class balancing techniques which however should be avoided in any medical diagnosis process. Inspired by this phenomenon, this article proposes a novel PB-OCSVM model to work in presence of limited COVID-19 positive CXR samples with objectives to maximize the learning efficiency while minimize the false-positive and false-negative predictions. The proposed model outperformed over recently published deep learning approaches where accuracy, precision, specificity and sensitivity are used as performance measure parameters.
摘要：该感染这种严重急性呼吸综合征冠状病毒2（SARS-COV-2）与上呼吸道开始，随着病毒生长，感染可发展为肺和发展为肺炎很明显。据统计，受感染的人COVID-19的约14 \％有严重的咳嗽和呼吸因肺炎急促，因为病毒感染的增加，它会破坏肺泡（小气囊）和周围组织。 COVID-19诊断的传统方法是逆转录聚合酶链式反应（RT-PCR），其是在专门早期阶段不太敏感，如果患者是无症状的，可能进一步导致更严重的肺炎。为了克服这个问题的早期诊断方法在本文经由一类分类方法使用考虑后前胸部X射线图像的新的弹球损失函数基于一个类支持向量机（PB-OCSVM）提议。最近，一些自动化COVID-19的诊断模型都基于不同的深度学习架构提出了使用公开可用的胸片（CXR），以确定肺部感染，其中少相对于其他类COVID-19阳性样本数的情况下（正常，肺炎和肺结核）提出了在深的学习模式学习的偏见已使用该但是应该在任何医疗诊断过程中要避免类平衡技术解决了这个难题。这一现象的启发，本文提出了一种新的PB OCSVM模型的工作与目标有限COVID-19阳性样品CXR存在最大化学习效率，同时最大限度地减少假阳性和假阴性的预测。该模型表现超越大市，其中准确度，精密度，特异性和敏感性被用作性能指标参数最近发表深度学习方法。

51. Performance evaluation and application of computation based low-cost homogeneous machine learning model algorithm for image classification [PDF] 返回目录
W. H. Huang
Abstract: The image classification machine learning model was trained with the intention to predict the category of the input image. While multiple state-of-the-art ensemble model methodologies are openly available, this paper evaluates the performance of a low-cost, simple algorithm that would integrate seamlessly into modern production-grade cloud-based applications. The homogeneous models, trained with the full instead of subsets of data, contains varying hyper-parameters and neural layers from one another. These models' inferences will be processed by the new algorithm, which is loosely based on conditional probability theories. The final output will be evaluated.
摘要：图像分类机器学习模型，并打算在预测输入图像的类别培训。尽管多个国家的最先进的集成模型的方法是公开可用的，本文评估了低成本的性能，简单的算法，将无缝集成到现代化的生产级基于云的应用。均质机型，拥有完整的，而不是数据子集的训练，含有不同超参数和彼此的神经层。这些模型的推论将通过新的算法，这是松散的基础上条件概率的理论来处理。最终输出将被评估。

52. Robust Keypoint Detection and Pose Estimation of Robot Manipulators with Self-Occlusions via Sim-to-Real Transfer [PDF] 返回目录
Jingpei Lu, Florian Richter, Michael Yip
Abstract: Keypoint detection is an essential building block for many robotic applications like motion capture and pose estimation. Historically, keypoints are detected using uniquely engineered markers such as checkerboards, fiducials, or markers. More recently, deep learning methods have been explored as they have the ability to detect user-defined keypoints in a marker-less manner. However, deep neural network (DNN) detectors can have an uneven performance for different manually selected keypoints along the kinematic chain. An example of this can be found on symmetric robotic tools where DNN detectors cannot solve the correspondence problem correctly. In this work, we propose a new and autonomous way to define the keypoint locations that overcomes these challenges. The approach involves finding the optimal set of keypoints on robotic manipulators for robust visual detection. Using a robotic simulator as a medium, our algorithm utilizes synthetic data for DNN training, and the proposed algorithm is used to optimize the selection of keypoints through an iterative approach. The results show that when using the optimized keypoints, the detection performance of the DNNs improved so significantly that they can even be detected in cases of self-occlusion. We further use the optimized keypoints for real robotic applications by using domain randomization to bridge the reality gap between the simulator and the physical world. The physical world experiments show how the proposed method can be applied to the wide-breadth of robotic applications that require visual feedback, such as camera-to-robot calibration, robotic tool tracking, and whole-arm pose estimation.
摘要：关键点检测是喜欢动作捕捉及姿态估计很多机器人应用的重要组成部分。历史上，使用独特设计的标记，如棋盘，基准，或标记物检测的关键点。最近，深学习方法进行了探讨，因为他们有来检测无标记的方式用户定义的关键点的能力。然而，深神经网络（DNN）探测器可具有用于沿所述运动链不同手动选择的关键点的不均匀的性能。这样的一个例子可以在对称的机器人工具，其中DNN探测器不能正确解决的对应问题被发现。在这项工作中，我们提出了一个新的和自主的方式来定义克服这些挑战的关键点位置。该方法涉及寻找获得稳健的视觉检测机械手关键点的最佳设置。使用机器人模拟器作为介质，我们的算法利用合成数据用于训练DNN，和所提出的算法用于通过迭代方法来优化关键点的选择。结果表明：使用已优化的关键点的情况下，DNNs的检测性能，从而显著它们能够即使在自遮挡的情况下被检测到的提高。我们通过使用域随机弥合模拟和物理世界之间的现实差距进一步使用真正的机器人应用优化的关键点。物理世界的实验表明，该方法是如何可以适用于那些需要视觉反馈机器人的应用，如摄像机到机器人校准，机器人工具跟踪和全臂姿势估计宽广度。

53. Overfitting or Underfitting? Understand Robustness Drop in Adversarial Training [PDF] 返回目录
Zichao Li, Liyuan Liu, Chengyu Dong, Jingbo Shang
Abstract: Our goal is to understand why the robustness drops after conducting adversarial training for too long. Although this phenomenon is commonly explained as overfitting, our analysis suggest that its primary cause is perturbation underfitting. We observe that after training for too long, FGSM-generated perturbations deteriorate into random noise. Intuitively, since no parameter updates are made to strengthen the perturbation generator, once this process collapses, it could be trapped in such local optima. Also, sophisticating this process could mostly avoid the robustness drop, which supports that this phenomenon is caused by underfitting instead of overfitting. In the light of our analyses, we propose APART, an adaptive adversarial training framework, which parameterizes perturbation generation and progressively strengthens them. Shielding perturbations from underfitting unleashes the potential of our framework. In our experiments, APART provides comparable or even better robustness than PGD-10, with only about 1/4 of its computational cost.
摘要：我们的目标是理解为什么时间过长进行对抗训练后的鲁棒性下降。虽然这种现象为过学习常用解释，我们的分析表明，其主要原因是扰动欠拟合。我们观察到太久训练后，FGSM产生的扰动恶化为随机噪声。直观地说，因为没有参数更新为加强扰动发电机，一旦这个过程中崩溃，也可能是被困在这种局部最优。此外，强词夺理这个过程可能主要是避免稳健性下降，它支持这种现象是由欠拟合，而不是过度拟合引起的。在光我们的分析，我们建议分开，适应性对抗训练框架，参数化扰动的产生和逐步加强他们。从欠拟合屏蔽干扰释放了我们的框架的潜力。在我们的实验中，除了提供媲美，甚至更好的稳健性比PGD-10，只有约1/4的计算成本的。

54. MAST: Multimodal Abstractive Summarization with Trimodal Hierarchical Attention [PDF] 返回目录
Aman Khullar, Udit Arora
Abstract: This paper presents MAST, a new model for Multimodal Abstractive Text Summarization that utilizes information from all three modalities -- text, audio and video -- in a multimodal video. Prior work on multimodal abstractive text summarization only utilized information from the text and video modalities. We examine the usefulness and challenges of deriving information from the audio modality and present a sequence-to-sequence trimodal hierarchical attention-based model that overcomes these challenges by letting the model pay more attention to the text modality. MAST outperforms the current state of the art model (video-text) by 2.51 points in terms of Content F1 score and 1.00 points in terms of Rouge-L score on the How2 dataset for multimodal language understanding.
摘要：本文介绍MAST，在多模写意文本摘要一种新的模式，利用从所有三个模式的资料 - 文本，音频和视频 - 以在多视频。在多抽象的文本摘要工作之前仅利用从文本和视频模式的信息。我们研究的实用性和从听觉模态获取信息的挑战，并提出了一种序列到序列三峰层次关注为主的模式，通过让模型更注重文本模式克服了这些挑战。 MAST优于在上How2数据集多式联运语言理解高棉-L分数方面的内容F1分数方面和1.00点的艺术典范（视频文字）的2.51点的当前状态。

55. Data Valuation for Medical Imaging Using Shapley Value: Application on A Large-scale Chest X-ray Dataset [PDF] 返回目录
Siyi Tang, Amirata Ghorbani, Rikiya Yamashita, Sameer Rehman, Jared A. Dunnmon, James Zou, Daniel L. Rubin
Abstract: The reliability of machine learning models can be compromised when trained on low quality data. Many large-scale medical imaging datasets contain low quality labels extracted from sources such as medical reports. Moreover, images within a dataset may have heterogeneous quality due to artifacts and biases arising from equipment or measurement errors. Therefore, algorithms that can automatically identify low quality data are highly desired. In this study, we used data Shapley, a data valuation metric, to quantify the value of training data to the performance of a pneumonia detection algorithm in a large chest X-ray dataset. We characterized the effectiveness of data Shapley in identifying low quality versus valuable data for pneumonia detection. We found that removing training data with high Shapley values decreased the pneumonia detection performance, whereas removing data with low Shapley values improved the model performance. Furthermore, there were more mislabeled examples in low Shapley value data and more true pneumonia cases in high Shapley value data. Our results suggest that low Shapley value indicates mislabeled or poor quality images, whereas high Shapley value indicates data that are valuable for pneumonia detection. Our method can serve as a framework for using data Shapley to denoise large-scale medical imaging datasets.
摘要：机器学习模型的可靠性低质量的数据训练时受到影响。许多大型医疗成像数据集包含从源提取的低质量的标签诸如医疗报告。此外，数据集内的图像可以具有异构质量由于来自设备或测量误差引起的伪影和偏差。因此，算法，可以自动识别低质量数据是高度期望的。在这项研究中，我们使用的数据沙普利，数据的估值指标，以量化的大胸部X射线数据集训练数据肺炎检测算法的性能值。我们其特征在于识别低质量与肺炎检测有价值的数据数据沙普利的有效性。我们发现，具有高沙普利值去除训练数据降低了肺炎检测性能，而具有低沙普利值删除数据提高了模型的性能。此外，还有在高Shapley值数据低Shapley值数据和更真实的肺炎病例更加贴错标签的例子。我们的研究结果表明，低Shapley值指示贴错标签或质量差的图像，而高Shapley值表明是肺炎检测有价值的数据。我们的方法可以用作用于使用数据沙普利去噪大型医疗成像数据集的框架。

56. Maximum-Entropy Adversarial Data Augmentation for Improved Generalization and Robustness [PDF] 返回目录
Long Zhao, Ting Liu, Xi Peng, Dimitris Metaxas
Abstract: Adversarial data augmentation has shown promise for training robust deep neural networks against unforeseen data shifts or corruptions. However, it is difficult to define heuristics to generate effective fictitious target distributions containing "hard" adversarial perturbations that are largely different from the source distribution. In this paper, we propose a novel and effective regularization term for adversarial data augmentation. We theoretically derive it from the information bottleneck principle, which results in a maximum-entropy formulation. Intuitively, this regularization term encourages perturbing the underlying source distribution to enlarge predictive uncertainty of the current model, so that the generated "hard" adversarial perturbations can improve the model robustness during training. Experimental results on three standard benchmarks demonstrate that our method consistently outperforms the existing state of the art by a statistically significant margin.
摘要：对抗性增强的数据已经显示出针对不可预见的数据移位或损坏训练强劲深层神经网络的承诺。然而，这是难以定义试探法来生成包含“硬”对抗性扰动是从源分布在很大程度上不同的有效虚拟目标分布。在本文中，我们提出了对抗性数据增强一个新颖而有效的调整项。理论上，我们从信息瓶颈的原则，这导致最大熵表述得到它。直观地看，这则项鼓励扰乱底层源分布放大当前模型预测的不确定性，使生成的“硬”对抗扰动可以提高训练过程中模型的鲁棒性。三个标准的基准测试实验结果表明，我们的方法始终在统计上显著利润率优于现有技术的现有状态。

57. What is More Likely to Happen Next? Video-and-Language Future Event Prediction [PDF] 返回目录
Jie Lei, Licheng Yu, Tamara L. Berg, Mohit Bansal
Abstract: Given a video with aligned dialogue, people can often infer what is more likely to happen next. Making such predictions requires not only a deep understanding of the rich dynamics underlying the video and dialogue, but also a significant amount of commonsense knowledge. In this work, we explore whether AI models are able to learn to make such multimodal commonsense next-event predictions. To support research in this direction, we collect a new dataset, named Video-and-Language Event Prediction (VLEP), with 28,726 future event prediction examples (along with their rationales) from 10,234 diverse TV Show and YouTube Lifestyle Vlog video clips. In order to promote the collection of non-trivial challenging examples, we employ an adversarial human-and-model-in-the-loop data collection procedure. We also present a strong baseline incorporating information from video, dialogue, and commonsense knowledge. Experiments show that each type of information is useful for this challenging task, and that compared to the high human performance on VLEP, our model provides a good starting point but leaves large room for future work. Our dataset and code are available at: this https URL
摘要：鉴于对准对话的视频，人们常常可以推断出更可能接下来会发生什么。制作这样的预测，不仅需要丰富的动态视频和对话背后的深刻理解，也是常识性知识显著量。在这项工作中，我们将探讨人工智能的模型是否能够学会做这样的多式联运常识未来事件的预测。为了支持研究在这个方向上，我们收集了10,234不同的电视节目和YouTube视频日志生活方式视频剪辑一个新的数据集，命名为视频和语言事件预测（VLEP），以28726未来事件预测的例子（及其理由一起）。为了促进非平凡的挑战例子集合，我们采用对抗性人类和模型在中环的数据收集程序。我们还提出从视频，对话和常识知识很强的基准合并的信息。实验表明，每一种类型的信息是这个具有挑战性的任务是有用的，并且，相比于VLEP高人力绩效，我们的模型提供了一个很好的起点，但叶大房间为今后的工作。我们的数据和代码，请访问：此HTTPS URL

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-19

目录

摘要