摘要

1. Few-Shot Adaptation of Generative Adversarial Networks [PDF] 返回目录
Esther Robb, Wen-Sheng Chu, Abhishek Kumar, Jia-Bin Huang
Abstract: Generative Adversarial Networks (GANs) have shown remarkable performance in image synthesis tasks, but typically require a large number of training samples to achieve high-quality synthesis. This paper proposes a simple and effective method, Few-Shot GAN (FSGAN), for adapting GANs in few-shot settings (less than 100 images). FSGAN repurposes component analysis techniques and learns to adapt the singular values of the pre-trained weights while freezing the corresponding singular vectors. This provides a highly expressive parameter space for adaptation while constraining changes to the pretrained weights. We validate our method in a challenging few-shot setting of 5-100 images in the target domain. We show that our method has significant visual quality gains compared with existing GAN adaptation methods. We report qualitative and quantitative results showing the effectiveness of our method. We additionally highlight a problem for few-shot synthesis in the standard quantitative metric used by data-efficient image synthesis works. Code and additional results are available at this http URL.
摘要：创成对抗性网络（甘斯）已经显示出图像合成任务表现可圈可点，但通常需要大量的训练样本，以实现高品质的合成。本文提出了一种简单有效的方法，为数不多的射击甘（FSGAN），在为数不多的拍摄设置适应甘斯（少于100张图像）。 FSGAN repurposes成分分析技术和获悉而冻结相应奇异向量来调整预先训练权重的奇异值。此适应提供了一个高度表现参数空间同时限制改变到预训练的权重。我们验证了我们在目标域5-100图像的挑战几合一设定方法。我们表明，与现有的GAN自适应方法相比，我们的方法有显著视觉质量的收益。我们报告，其中显示了该方法的有效性定性和定量结果。我们还强调开展为数不多的拍摄合成一个问题的标准定量指标由数据有效的图像合成的作品中。代码和其他结果都可以在这个HTTP URL。

2. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [PDF] 返回目录
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Abstract: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks. When pre-trained on large amounts of data and transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB, etc.), Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.
摘要：虽然变压器架构已经成为自然语言处理任务的事实上的标准，其应用计算机视觉仍然有限。视力，注意力或者结合施加卷积网络，或使用，同时保持他们的整体结构在适当位置，以取代卷积网络的某些组件。我们表明，这种依赖细胞神经网络是没有必要的，纯粹的变压器直接应用于图像块的序列可以在图像分类任务执行得非常好。当预先训练上大量数据，并转移到多个中型或小型图像识别基准（ImageNet，CIFAR-100，VTAB等），相比于国家的the-视觉变压器（VIT）获得了优异的结果艺术卷积网络，同时要求较少的基本计算资源来列车。

3. GAZED- Gaze-guided Cinematic Editing of Wide-Angle Monocular Video Recordings [PDF] 返回目录
K L Bhanu Moorthy, Moneish Kumar, Ramanathan Subramaniam, Vineet Gandhi
Abstract: We present GAZED- eye GAZe-guided EDiting for videos captured by a solitary, static, wide-angle and high-resolution camera. Eye-gaze has been effectively employed in computational applications as a cue to capture interesting scene content; we employ gaze as a proxy to select shots for inclusion in the edited video. Given the original video, scene content and user eye-gaze tracks are combined to generate an edited video comprising cinematically valid actor shots and shot transitions to generate an aesthetic and vivid representation of the original narrative. We model cinematic video editing as an energy minimization problem over shot selection, whose constraints capture cinematographic editing conventions. Gazed scene locations primarily determine the shots constituting the edited video. Effectiveness of GAZED against multiple competing methods is demonstrated via a psychophysical study involving 12 users and twelve performance videos.
摘要：由一个孤立，静态的，广角和高分辨率摄像头拍摄的视频，我们目前GAZED-视线引导编辑。视线已计算应用程序作为线索来捕捉有趣的一幕含量得到有效使用;我们采用凝视作为代理来选择镜头包含在已编辑的视频。由于原始视频，场景内容和用户视线轨迹相结合，产生一个编辑的视频，包括运动学有效演员的镜头和镜头转换生成原始叙事的审美和生动体现。我们的电影视频编辑为在投篮选择，对于它的约束捕捉电影编辑公约的能量最小化问题建模。凝视着现场的地点主要是确定构成编辑视频的拍摄。凝视着的成效对多个相互竞争的方法是通过涉及12个用户和十二性能录像的心理研究证明。

4. AEGIS: A real-time multimodal augmented reality computer vision based system to assist facial expression recognition for individuals with autism spectrum disorder [PDF] 返回目录
James Ren Hou Lee, Alexander Wong
Abstract: The ability to interpret social cues comes naturally for most people, but for those living with Autism Spectrum Disorder (ASD), some experience a deficiency in this area. This paper presents the development of a multimodal augmented reality (AR) system which combines the use of computer vision and deep convolutional neural networks (CNN) in order to assist individuals with the detection and interpretation of facial expressions in social settings. The proposed system, which we call AEGIS (Augmented-reality Expression Guided Interpretation System), is an assistive technology deployable on a variety of user devices including tablets, smartphones, video conference systems, or smartglasses, showcasing its extreme flexibility and wide range of use cases, to allow integration into daily life with ease. Given a streaming video camera source, each real-world frame is passed into AEGIS, processed for facial bounding boxes, and then fed into our novel deep convolutional time windowed neural network (TimeConvNet). We leverage both spatial and temporal information in order to provide an accurate expression prediction, which is then converted into its corresponding visualization and drawn on top of the original video frame. The system runs in real-time, requires minimal set up and is simple to use. With the use of AEGIS, we can assist individuals living with ASD to learn to better identify expressions and thus improve their social experiences.
摘要：解读社会线索的能力，顺其自然对于大多数人，但是对于那些患有自闭症谱系障碍（ASD），一些这方面经验的缺乏。此提出一种综合运用计算机视觉和深刻的卷积神经网络（CNN），以协助在社交场合的检测和面部表情的演绎个人多式联运增强现实（AR）系统的发展。所提出的系统，我们称之为AEGIS（增强现实表达指导解释系统），是在各种用户设备，包括平板电脑，智能手机，视频会议系统，或smartglasses的辅助技术部署，展示了其极大的灵活性和使用范围广的情况下，允许融入日常生活提供方便。给定一个流视频摄像机源中，每个真实世界的帧被送入AEGIS，用于面部边界框处理，然后馈送到本发明的新型深卷积时间窗神经网络（TimeConvNet）。我们为了提供准确的表达预测，然后将其转化成其相应的可视化和在原始视频帧的顶部取出利用空间和时间信息。在实时系统运行时，需要最少的设置和使用简单。通过使用AEGIS的，我们可以帮助患有自闭症的个人学习，以更好地识别表达式，从而提高他们的社会经验。

5. Spatio-temporal Features for Generalized Detection of Deepfake Videos [PDF] 返回目录
Ipek Ganiyusufoglu, L. Minh Ngô, Nedko Savov, Sezer Karaoglu, Theo Gevers
Abstract: For deepfake detection, video-level detectors have not been explored as extensively as image-level detectors, which do not exploit temporal data. In this paper, we empirically show that existing approaches on image and sequence classifiers generalize poorly to new manipulation techniques. To this end, we propose spatio-temporal features, modeled by 3D CNNs, to extend the generalization capabilities to detect new sorts of deepfake videos. We show that spatial features learn distinct deepfake-method-specific attributes, while spatio-temporal features capture shared attributes between deepfake methods. We provide an in-depth analysis of how the sequential and spatio-temporal video encoders are utilizing temporal information using DFDC dataset arXiv:2006.07397. Thus, we unravel that our approach captures local spatio-temporal relations and inconsistencies in the deepfake videos while existing sequence encoders are indifferent to it. Through large scale experiments conducted on the FaceForensics++ arXiv:1901.08971 and Deeper Forensics arXiv:2001.03024 datasets, we show that our approach outperforms existing methods in terms of generalization capabilities.
摘要：deepfake检测，视频电平检测器还没有被开发作为广泛作为图像电平检测器，其不利用时间数据。在本文中，我们根据经验显示，图像和序列分类现有方法不当推广到新的操作技术。为此，我们提出了时空的特点，通过3D建模细胞神经网络，延长泛化功能，能够检测的deepfake视频新品种。我们发现，空间特征明显的学习deepfake方法特定的属性，而时空特征捕获共享deepfake方法之间的属性。我们提供怎样的顺序和时空视频编码器利用使用DFDC数据集的arXiv时间信息的深入分析：2006.07397。因此，我们解开我们的方法捕捉当地的时空关系和矛盾中，而现有的序列编码器deepfake视频都无动于衷呢。通过对FaceForensics ++的arXiv进行大规模的实验：1901.08971更深取证的arXiv：2001.03024数据集，我们表明，我们的方法比现有的方法中的泛化能力方面。

6. Blind Video Temporal Consistency via Deep Video Prior [PDF] 返回目录
Chenyang Lei, Yazhou Xing, Qifeng Chen
Abstract: Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the resulting video. To address this issue, we present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. Unlike most previous methods that enforce temporal consistency with optical flow, we show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior. Moreover, a carefully designed iteratively reweighted training strategy is proposed to address the challenging multimodal inconsistency problem. We demonstrate the effectiveness of our approach on 7 computer vision tasks on videos. Extensive quantitative and perceptual experiments show that our approach obtains superior performance than state-of-the-art methods on blind video temporal consistency. Our source codes are publicly available at this http URL.
摘要：独立地应用图像处理算法来的每个视频帧通常导致所得到的视频中的时间不一致。为了解决这个问题，我们提出了盲视频的时间一致性新颖通用的方法。我们的方法是只训练了对原有和处理视频，而不是直接的大型数据集。不像执行与光流的时间一致性大多数以前的方法，我们表明，时间一致性可以通过与深视频视频培训卷积网络之前来实现。此外，精心设计的迭代复加权培训战略提出，解决具有挑战性的多模态不一致问题。我们证明了我们对视频7计算机视觉任务方法的有效性。大量的定量和感知实验表明，我们的方法获得比盲目的视频时间一致性的国家的最先进的方法，性能优越。我们的源代码是公开的，在此http网址。

7. Once-for-All Adversarial Training: In-Situ Tradeoff between Robustness and Accuracy for Free [PDF] 返回目录
Haotao Wang, Tianlong Chen, Shupeng Gui, Ting-Kuei Hu, Ji Liu, Zhangyang Wang
Abstract: Adversarial training and its many variants substantially improve deep network robustness, yet at the cost of compromising standard accuracy. Moreover, the training process is heavy and hence it becomes impractical to thoroughly explore the trade-off between accuracy and robustness. This paper asks this new question: how to quickly calibrate a trained model in-situ, to examine the achievable trade-offs between its standard and robust accuracies, without (re-)training it many times? Our proposed framework, Once-for-all Adversarial Training (OAT), is built on an innovative model-conditional training framework, with a controlling hyper-parameter as the input. The trained model could be adjusted among different standard and robust accuracies "for free" at testing time. As an important knob, we exploit dual batch normalization to separate standard and adversarial feature statistics, so that they can be learned in one model without degrading performance. We further extend OAT to a Once-for-all Adversarial Training and Slimming (OATS) framework, that allows for the joint trade-off among accuracy, robustness and runtime efficiency. Experiments show that, without any re-training nor ensembling, OAT/OATS achieve similar or even superior performance compared to dedicatedly trained models at various configurations. Our codes and pretrained models are available at: this https URL.
摘要：对抗性训练和它的许多变种大幅度提高深网络的健壮性，但在影响标准精度的成本。此外，训练过程比较重，因此变得不切实际的深入探索准确性和鲁棒性之间的权衡。本文问这种新的问题：如何快速校准原位训练模型，研究其标准和强大的精度之间达到权衡，不（再）培训了很多次？我们提出的框架下，一旦对所有人对抗性训练（OAT），是建立在一种创新的模式，有条件的培训框架，以控制超参数作为输入。训练的模型可以在不同的标准和强大的精度“免费”之中的测试时间进行调整。作为一种重要的旋钮，我们利用双批标准化分离标准和对抗性特征的统计数据，使他们能够在一个模型中，而不会降低性能来学习。我们OAT进一步延伸到一劳永逸的所有对抗性训练和减肥（OATS）的框架，允许联合权衡之中的准确性，稳定性和运行效率。实验表明，没有任何再培训，也没有ensembling，OAT相比，在各种配置专用地训练的模型/ OATS实现类似甚至优异的性能。我们的代码和预训练模型，请访问：此HTTPS URL。

8. Castle in the Sky: Dynamic Sky Replacement and Harmonization in Videos [PDF] 返回目录
Zhengxia Zou
Abstract: This paper proposes a vision-based method for video sky replacement and harmonization, which can automatically generate realistic and dramatic sky backgrounds in videos with controllable styles. Different from previous sky editing methods that either focus on static photos or require inertial measurement units integrated in smartphones on shooting videos, our method is purely vision-based, without any requirements on the capturing devices, and can be well applied to either online or offline processing scenarios. Our method runs in real-time and is free of user interactions. We decompose this artistic creation process into a couple of proxy tasks including sky matting, motion estimation, and image blending. Experiments are conducted on videos diversely captured in the wild by handheld smartphones and dash cameras, and show high fidelity and good generalization of our method in both visual quality and lighting/motion dynamics. Our code and animated results are available at \url{this https URL}.
摘要：本文提出了一种视频天空更换和统一基于视觉的方法，它可以自动生成与控制的风格的影片现实和戏剧性的天空背景。从以前的天空编辑方法，要么专注于静态照片，或者需要集成在智能手机上拍摄的视频惯性测量单元不同的是，我们的方法是纯粹基于视觉的，没有捕捉设备上的任何要求，并能很好地应用于在线或离线处理方案。我们的方法在实时运行，并且是免费的用户交互。我们分解这个艺术创作过程中为一对夫妇的代理任务，包括空中垫，运动估计和图像融合。实验是在通过手持智能手机和破折号相机野生多样捕获的视频进行的，并表现出高逼真度和我们在这两个视觉质量和照明/运动动力学方法的良好的泛化。我们的代码和动画结果可在\ {URL这HTTPS URL}。

9. FasterRCNN Monitoring of Road Damages: Competition and Deployment [PDF] 返回目录
Hascoet Tristan, Yihao Zhang, Persch Andreas, Ryoichi Takashima, Tetsuya Takiguchi, Yasuo Ariki
Abstract: Maintaining aging infrastructure is a challenge currently faced by local and national administrators all around the world. An important prerequisite for efficient infrastructure maintenance is to continuously monitor (i.e., quantify the level of safety and reliability) the state of very large structures. Meanwhile, computer vision has made impressive strides in recent years, mainly due to successful applications of deep learning models. These novel progresses are allowing the automation of vision tasks, which were previously impossible to automate, offering promising possibilities to assist administrators in optimizing their infrastructure maintenance operations. In this context, the IEEE 2020 global Road Damage Detection (RDD) Challenge is giving an opportunity for deep learning and computer vision researchers to get involved and help accurately track pavement damages on road networks. This paper proposes two contributions to that topic: In a first part, we detail our solution to the RDD Challenge. In a second part, we present our efforts in deploying our model on a local road network, explaining the proposed methodology and encountered challenges.
摘要：维护老化的基础设施是目前面临的地方和国家的管理人员在世界各地挑战。用于高效基础设施的维护的一个重要的先决条件是连续地监视（即，量化安全性和可靠性的电平）非常大的结构的状态。同时，计算机视觉，近年来取得了令人瞩目的进展，主要是由于深学习模式的成功的应用。这些新的进展被允许的视觉任务，这在以前是不可能实现自动化的自动化，提供有前途的可能性，以帮助管理员在优化其基础设施的维护操作。在此背景下，IEEE 2020全球道路损毁检测（RDD）的挑战是给了一个机会，深度学习和计算机视觉研究人员参与和帮助准确地跟踪道路网络的路面损坏。本文提出了该主题的两个贡献：在第一部分中，我们详细介绍我们解决RDD挑战。在第二部分中，我们提出我们在本地道路网络上部署我们的模型，解释了提出的方法和遇到的挑战的努力。

10. Deep Analysis of CNN-based Spatio-temporal Representations for Action Recognition [PDF] 返回目录
Chun-Fu Chen, Rameswar Panda, Kandan Ramakrishnan, Rogerio Feris, John Cohn, Aude Oliva, Quanfu Fan
Abstract: In recent years, a number of approaches based on 2D CNNs and 3D CNNs have emerged for video action recognition, achieving state-of-the-art results on several large-scale benchmark datasets. In this paper, we carry out an in-depth comparative analysis to better understand the differences between these approaches and the progress made by them. To this end, we develop a unified framework for both 2D-CNN and 3D-CNN action models, which enables us to remove bells and whistles and provides a common ground for a fair comparison. We then conduct an effort towards a large-scale analysis involving over 300 action recognition models. Our comprehensive analysis reveals that a) a significant leap is made in efficiency for action recognition, but not in accuracy; b) 2D-CNN and 3D-CNN models behave similarly in terms of spatio-temporal representation abilities and transferability. Our analysis also shows that recent action models seem to be able to learn data-dependent temporality flexibly as needed. Our codes and models are available on \url{this https URL}.
摘要：近年来，一些基于二维细胞神经网络和三维细胞神经网络的方法已经出现了视频行为识别，在几个大型的基准数据集实现国家的最先进的成果。在本文中，我们进行了深入的比较分析，开展以更好地了解这些方法和他们所取得的进展之间的差异。为此，我们开发了二维-CNN和3D-CNN动作模型的统一框架，这使我们能够去除花俏和公平的比较提供了一个共同的基础。然后，我们对涉及超过300动作识别模型大规模分析进行的努力。我们的综合分析表明，一）一个显著的飞跃是在动作识别效率做出，而不是在精度; b）中的2D-CNN和3D-CNN模型表现在时空表示能力和转印能力方面相似。我们的分析也表明，近期的行动模式似乎能够学习依赖于数据的时间性灵活地根据需要。我们的代码和型号可供选择上\ {URL这HTTPS URL}。

11. Self-Supervised Learning of Part Mobility from Point Cloud Sequence [PDF] 返回目录
Yahao Shi, Xinyu Cao, Bin Zhou
Abstract: Part mobility analysis is a significant aspect required to achieve a functional understanding of 3D objects. It would be natural to obtain part mobility from the continuous part motion of 3D objects. In this study, we introduce a self-supervised method for segmenting motion parts and predicting their motion attributes from a point cloud sequence representing a dynamic object. To sufficiently utilize spatiotemporal information from the point cloud sequence, we generate trajectories by using correlations among successive frames of the sequence instead of directly processing the point clouds. We propose a novel neural network architecture called PointRNN to learn feature representations of trajectories along with their part rigid motions. We evaluate our method on various tasks including motion part segmentation, motion axis prediction and motion range estimation. The results demonstrate that our method outperforms previous techniques on both synthetic and real datasets. Moreover, our method has the ability to generalize to new and unseen objects. It is important to emphasize that it is not required to know any prior shape structure, prior shape category information, or shape orientation. To the best of our knowledge, this is the first study on deep learning to extract part mobility from point cloud sequence of a dynamic object.
摘要：部分流动性分析是实现功能的了解3D对象需要显著方面。它会自然地从3D对象的连续部分运动获得部分的流动性。在这项研究中，我们介绍用于分割运动部件和从表示一个动态对象中的点云序列预测它们的运动属性的自监督方法。为了充分利用，从点云序列时空信息，我们通过使用相关性的序列的连续帧中，而不是直接处理该点云生成轨迹。我们提出了一种新的神经网络结构称为PointRNN学习轨迹的特征表示与他们的一部分刚性运动一起。我们评估我们的各种任务，包括运动部分的分割，运动轴预测和运动范围的估算方法。结果表明，我们的方法优于在合成和真实数据集之前的技术。此外，我们的方法推广到新的和看不见的物体的能力。需要强调的是它不需要知道任何事先外形结构，前形状分类的信息，或形状的方向是非常重要的。据我们所知，这是深学习提取部分流动性从动态物体的点云数据序列中的第一项研究。

12. Identification of deep breath while moving forward based on multiple body regions and graph signal analysis [PDF] 返回目录
Yunlu Wang, Cheng Yang, Menghan Hu, Jian Zhang, Qingli Li, Guangtao Zhai, Xiao-Ping Zhang
Abstract: This paper presents an unobtrusive solution that can automatically identify deep breath when a person is walking past the global depth camera. Existing non-contact breath assessments achieve satisfactory results under restricted conditions when human body stays relatively still. When someone moves forward, the breath signals detected by depth camera are hidden within signals of trunk displacement and deformation, and the signal length is short due to the short stay time, posing great challenges for us to establish models. To overcome these challenges, multiple region of interests (ROIs) based signal extraction and selection method is proposed to automatically obtain the signal informative to breath from depth video. Subsequently, graph signal analysis (GSA) is adopted as a spatial-temporal filter to wipe the components unrelated to breath. Finally, a classifier for identifying deep breath is established based on the selected breath-informative signal. In validation experiments, the proposed approach outperforms the comparative methods with the accuracy, precision, recall and F1 of 75.5%, 76.2%, 75.0% and 75.2%, respectively. This system can be extended to public places to provide timely and ubiquitous help for those who may have or are going through physical or mental trouble.
摘要：本文提出了一种不引人注目的解决方案，当一个人走过全球深度相机可以自动识别深呼吸。现有的非接触呼吸评估达到限制条件下满意的结果时，人体保持相对不动。当有人向前移动，呼吸信号通过深度相机检测的躯干位移和变形的信号中隐藏的，在信号长度较短，由于短暂的停留时间，冒充我们建立的模型极大的挑战。为了克服这些挑战，利益多个区域（ROI的）基于信号提取和选择方法，提出了从深度视频自动获取信号信息呼吸。随后，图形信号分析（GSA）被采用作为一个空间 - 时间滤波，以擦拭无关呼吸的组件。最后，用于识别深呼吸的分类建立基于所选择的呼吸信息信号。在验证实验，所提出的方法比与所述准确度，精密度，召回和的分别75.5％，76.2％，75.0％和75.2％，F1比较的方法。该系统可以扩展到公共场所提供及时和无处不在的帮助那些谁可能已经或正在经历身体上或精神上的麻烦。

13. A Cluster-Matching-Based Method for Video Face Recognition [PDF] 返回目录
Paulo R C Mendes, Antonio J G Busson, Sérgio Colcher, Daniel Schwabe, Álan L V Guedes, Carlos Laufer
Abstract: Face recognition systems are present in many modern solutions and thousands of applications in our daily lives. However, current solutions are not easily scalable, especially when it comes to the addition of new targeted people. We propose a cluster-matching-based approach for face recognition in video. In our approach, we use unsupervised learning to cluster the faces present in both the dataset and targeted videos selected for face recognition. Moreover, we design a cluster matching heuristic to associate clusters in both sets that is also capable of identifying when a face belongs to a non-registered person. Our method has achieved a recall of 99.435% and a precision of 99.131% in the task of video face recognition. Besides performing face recognition, it can also be used to determine the video segments where each person is present.
摘要：人脸识别系统存在于我们的日常生活中许多先进的解决方案和数以千计的应用程序。然而，目前的解决方案是不容易扩展，尤其是当它涉及到增加新的目标人群的。我们提出了在视频人脸识别基于集群匹配方法。在我们的方法，我们使用无监督学习聚类面对存在于数据集，并选择了面部识别目标的视频两者。此外，我们在两个集，也能够识别的当面部属于非注册人设计一个簇匹配启发式到关联的簇。我们的方法已经实现了99.435％，召回和99.131％的视频人脸识别的任务精度。除了执行脸部识别，它也可以被用于确定视频段，其中每个人的存在。

14. Vision-Based Layout Detection from Scientific Literature using Recurrent Convolutional Neural Networks [PDF] 返回目录
Huichen Yang, William H. Hsu
Abstract: We present an approach for adapting convolutional neural networks for object recognition and classification to scientific literature layout detection (SLLD), a shared subtask of several information extraction problems. Scientific publications contain multiple types of information sought by researchers in various disciplines, organized into an abstract, bibliography, and sections documenting related work, experimental methods, and results; however, there is no effective way to extract this information due to their diverse layout. In this paper, we present a novel approach to developing an end-to-end learning framework to segment and classify major regions of a scientific document. We consider scientific document layout analysis as an object detection task over digital images, without any additional text features that need to be added into the network during the training process. Our technical objective is to implement transfer learning via fine-tuning of pre-trained networks and thereby demonstrate that this deep learning architecture is suitable for tasks that lack very large document corpora for training ab initio. As part of the experimental test bed for empirical evaluation of this approach, we created a merged multi-corpus data set for scientific publication layout detection tasks. Our results show good improvement with fine-tuning of a pre-trained base network using this merged data set, compared to the baseline convolutional neural network architecture.
摘要：我们提出的方法用于调整卷积神经网络用于对象识别和分类，以科学文献布局检测（SLLD），的几个信息提取问题的共享的子任务。科学出版物包含多种类型的研究人员在各学科，组织成一个抽象的，参考书目和部分文档相关工作，实验方法和结果寻求信息;但是，没有提取这些信息没有有效的方法，由于其不同的布局。在本文中，我们提出了一个新的方法来建立最终到终端的学习框架段和科学的文档进行分类主要地区。我们认为，科学的文档布局分析，在数字图像目标检测任务，无需任何额外的文本性能要求，需要在训练过程中被添加到网络中。我们的目标是，通过实施预先训练网络的微调传递学习，从而表明这个深层的学习架构，适用于缺乏训练从头非常大的文档语料库任务。由于实验测试床这种方法的实证评估的一部分，我们创建了一个合并多语料数据为科学出版物布局探测任务设置。我们的研究结果表明，使用该合并的数据集预训练基地网络的微调较好的改善，相比于基准卷积神经网络架构。

15. What do CNN neurons learn: Visualization & Clustering [PDF] 返回目录
Haoyue Dai
Abstract: In recent years convolutional neural networks (CNN) have shown striking progress in various tasks. However, despite the high performance, the training and prediction process remains to be a black box, leaving it a mystery to extract what neurons learn in CNN. In this paper, we address the problem of interpreting a CNN from the aspects of the input image's focus and preference, and the neurons' domination, activation and contribution to a concrete final prediction. Specifically, we use two techniques - visualization and clustering - to tackle the problems above. Visualization means the method of gradient descent on image pixel, and in clustering section two algorithms are proposed to cluster respectively over image categories and network neurons. Experiments and quantitative analyses have demonstrated the effectiveness of the two methods in explaining the question: what do neurons learn.
摘要：近年来卷积神经网络（CNN）已经显示出不同的任务显着的进步。然而，尽管高性能，训练和预测过程中存留是一个黑盒子，留下一个谜提取哪些神经元CNN学习。在本文中，我们处理从输入图像的对焦和偏好，以及神经元的统治，激活和贡献的方面解释CNN的一个具体的最终预测的问题。具体而言，我们采用两种技术 - 可视化和集群 - 以解决上述问题。可视化装置，提出了对图像像素梯度下降，而在聚类部两种算法的方法在图像类别和网络的神经元分别进行聚类。实验和定量分析证明这两种方法在解释这个问题的有效性：做什么神经学。

16. LID 2020: The Learning from Imperfect Data Challenge Results [PDF] 返回目录
Yunchao Wei, Shuai Zheng, Ming-Ming Cheng, Hang Zhao, Liwei Wang, Errui Ding, Yi Yang, Antonio Torralba, Ting Liu, Guolei Sun, Wenguan Wang, Luc Van Gool, Wonho Bae, Junhyug Noh, Jinhwan Seo, Gunhee Kim, Hao Zhao, Ming Lu, Anbang Yao, Yiwen Guo, Yurong Chen, Li Zhang, Chuangchuang Tan, Tao Ruan, Guanghua Gu, Shikui Wei, Yao Zhao, Mariia Dobko, Ostap Viniavskyi, Oles Dobosevych, Zhendong Wang, Zhenyuan Chen, Chen Gong, Huanqing Yan, Jun He
Abstract: Learning from imperfect data becomes an issue in many industrial applications after the research community has made profound progress in supervised learning from perfectly annotated datasets. The purpose of the Learning from Imperfect Data (LID) workshop is to inspire and facilitate the research in developing novel approaches that would harness the imperfect data and improve the data-efficiency during training. A massive amount of user-generated data nowadays available on multiple internet services. How to leverage those and improve the machine learning models is a high impact problem. We organize the challenges in conjunction with the workshop. The goal of these challenges is to find the state-of-the-art approaches in the weakly supervised learning setting for object detection, semantic segmentation, and scene parsing. There are three tracks in the challenge, i.e., weakly supervised semantic segmentation (Track 1), weakly supervised scene parsing (Track 2), and weakly supervised object localization (Track 3). In Track 1, based on ILSVRC DET, we provide pixel-level annotations of 15K images from 200 categories for evaluation. In Track 2, we provide point-based annotations for the training set of ADE20K. In Track 3, based on ILSVRC CLS-LOC, we provide pixel-level annotations of 44,271 images for evaluation. Besides, we further introduce a new evaluation metric proposed by \cite{zhang2020rethinking}, i.e., IoU curve, to measure the quality of the generated object localization maps. This technical report summarizes the highlights from the challenge. The challenge submission server and the leaderboard will continue to open for the researchers who are interested in it. More details regarding the challenge and the benchmarks are available at this https URL
摘要：学习从不完整的数据成为许多工业应用研究社会作出了深刻的进展，监督学习，从完美的注解数据集之后的问题。学习从不完整的数据（LID）研讨会的目的是鼓励和促进研究开发新的方法，将利用不完善的数据，并在训练中提高数据效率。用户生成的数据的海量多个互联网服务提供时下。如何利用这些和提高机器学习模型是一个高冲击的问题。我们组织所面临的挑战与车间相结合。这些挑战的目标是找到目标检测，语义分割和场景分析中弱接近先进国家的监督学习环境。有三个轨道中的挑战，即，弱监督语义分割（第1道），弱监督场景解析（第2道），和弱监督对象定位（第3道）。在轨道1的基础上，ILSVRC DET，我们提供了从200个类别评价15K图像的像素级注释。在第2道，我们提供训练集ADE20K的基于点的注释。在第3道，基于ILSVRC CLS-LOC，我们提供44271对图像进行评价的像素级的注解。此外，我们还介绍通过提出一种新的评估度量\ {引用} zhang2020rethinking，即，IOU曲线，以测量所产生的物体定位的地图的质量。该技术报告总结了挑战的亮点。我们面临的挑战提交服务器和排行榜将继续打开谁对它感兴趣的研究人员。关于挑战的更多细节和基准可在此HTTPS URL

17. Restoring Negative Information in Few-Shot Object Detection [PDF] 返回目录
Yukuan Yang, Fangyu Wei, Miaojing Shi, Guoqi Li
Abstract: Few-shot learning has recently emerged as a new challenge in the deep learning field: unlike conventional methods that train the deep neural networks (DNNs) with a large number of labeled data, it asks for the generalization of DNNs on new classes with few annotated samples. Recent advances in few-shot learning mainly focus on image classification while in this paper we focus on object detection. The initial explorations in few-shot object detection tend to simulate a classification scenario by using the positive proposals in images with respect to certain object class while discarding the negative proposals of that class. Negatives, especially hard negatives, however, are essential to the embedding space learning in few-shot object detection. In this paper, we restore the negative information in few-shot object detection by introducing a new negative- and positive-representative based metric learning framework and a new inference scheme with negative and positive representatives. We build our work on a recent few-shot pipeline RepMet with several new modules to encode negative information for both training and testing. Extensive experiments on ImageNet-LOC and PASCAL VOC show our method substantially improves the state-of-the-art few-shot object detection solutions. Our code is available at this https URL.
摘要：很少拍学习近来已成为在深学习领域的一个新的挑战：不同于具有大量的标签数据训练深层神经网络（DNNs）的传统方法，它要求在新的班级，DNNs的泛化一些注释样本。在一些次学习的最新进展主要集中在图像分类，而在本文中，我们重点关注的对象检测。在几拍物体检测的初始探索倾向于通过使用图像中的积极建议相对于某些对象类而丢弃类的负提案模拟分类场景。否定，尤其是硬底片，然而，到嵌入空间几拍对象检测学习是必不可少的。在本文中，我们通过引入新的负和正代表基于度量学习框架和正离子和负离子代表一个新的推理方式恢复几拍物体检测的负面信息。我们建立我们在最近几拍管道RepMet工作了几个新模块，编码的负面信息，训练和测试。在ImageNet-LOC和PASCAL VOC大量实验证明我们的方法显着地提高了国家的最先进的几拍物体检测解决方案。我们的代码可在此HTTPS URL。

18. Using Conditional Generative Adversarial Networks to Reduce the Effects of Latency in Robotic Telesurgery [PDF] 返回目录
Neil Sachdeva, Misha Klopukh, Rachel St. Clair, William Hahn
Abstract: The introduction of surgical robots brought about advancements in surgical procedures. The applications of remote telesurgery range from building medical clinics in underprivileged areas, to placing robots abroad in military hot-spots where accessibility and diversity of medical experience may be limited. Poor wireless connectivity may result in a prolonged delay, referred to as latency, between a surgeon's input and action a robot takes. In surgery, any micro-delay can injure a patient severely and in some cases, result in fatality. One was to increase safety is to mitigate the effects of latency using deep learning aided computer vision. While the current surgical robots use calibrated sensors to measure the position of the arms and tools, in this work we present a purely optical approach that provides a measurement of the tool position in relation to the patient's tissues. This research aimed to produce a neural network that allowed a robot to detect its own mechanical manipulator arms. A conditional generative adversarial networks (cGAN) was trained on 1107 frames of mock gastrointestinal robotic surgery data from the 2015 EndoVis Instrument Challenge and corresponding hand-drawn labels for each frame. When run on new testing data, the network generated near-perfect labels of the input images which were visually consistent with the hand-drawn labels and was able to do this in 299 milliseconds. These accurately generated labels can then be used as simplified identifiers for the robot to track its own controlled tools. These results show potential for conditional GANs as a reaction mechanism such that the robot can detect when its arms move outside the operating area within a patient. This system allows for more accurate monitoring of the position of surgical instruments in relation to the patient's tissue, increasing safety measures that are integral to successful telesurgery systems.
摘要：带来了外科手术的进步引进手术机器人。远程远程手术范围从建立贫困地区的医疗诊所，在军事热点，其中的医疗经验可访问性和多样性可能是有限的机器人放置在国外的应用。差的无线连接可以导致延长的延迟，被称为等待时间，外科医生的输入和动作的机器人需要之间。在手术中，任何微小的延迟会严重伤害患者，并且在某些情况下，导致死亡。一个是增加安全性是减少的延迟使用深度学习辅助计算机视觉效果。虽然目前手术机器人使用校准的传感器来测量武器和工具的位置，在这项工作中，我们提出，可提供相对于患者的组织中的刀具位置的测量纯光学的方法。这项研究的目的是产生使机器人来检测自己的机械操纵臂神经网络。有条件生成对抗网络（cGAN）被训练从2015 ENDOVIS仪器挑战模拟胃肠机器人手术数据的1107帧和对应的手绘标签为每个帧。当新的测试数据上运行，网络产生近乎完美的标签，其均与手绘标签外观的一致性，并能做到这一点在299毫秒的输入图像。然后，这些精确地生成的标签可以被用作简化标识符用于机器人跟踪其自己的控制工具。这些结果表明用于条件甘斯作为反应机构，使得当其臂的操作区域外移动的患者中，机器人可以检测潜在的。该系统允许相对于患者组织手术器械的位置的更准确的监测，增加了整体是成功的远程手术系统的安全措施。

19. Fast and Incremental Loop Closure Detection with Deep Features and Proximity Graphs [PDF] 返回目录
Shan An, Haogang Zhu, Dong Wei, Konstantinos A. Tsintotas
Abstract: In recent years, methods concerning the place recognition task have been extensively examined from the robotics community within the scope of simultaneous localization and mapping applications. In this article, an appearance-based loop closure detection pipeline is proposed, entitled "FILD++" (Fast and Incremental Loop closure Detection). When the incoming camera observation arrives, global and local visual features are extracted through two passes of a single convolutional neural network. Subsequently, a modified hierarchical-navigable small-world graph incrementally generates a visual database that represents the robot's traversed path based on global features. Given the query sensor measurement, similar locations from the trajectory are retrieved using these representations, while an image-to-image pairing is further evaluated thanks to the spatial information provided by the local features. Exhaustive experiments on several publicly-available datasets exhibit the system's high performance and low execution time compared to other contemporary state-of-the-art pipelines.
摘要：近年来，有关地方识别任务的方法已被广泛地从机器人社区检查的同时定位和地图应用的范围之内。在这篇文章中，外观为基础的闭合回路管道的检测，提出了题为“FILD ++”（快速增量环闭合检测）。当输入的摄像机观测到达时，全局和局部视觉特征是通过一个单一的卷积神经网络的两个道次萃取。随后，经修饰的分层导航的小世界图表递增地基于全局特征，代表机器人的遍历路径的可视化数据库。给定查询的传感器测量，从轨迹的类似位置使用这些表示检索而一个图像到图像配对进一步评估由于由局部特征所提供的空间信息。相比其他现代国家的最先进的管道在几个公开可用的数据集详尽的实验展示了系统的高性能和低执行时间。

20. MLOD: Awareness of Extrinsic Perturbation in Multi-LiDAR 3D Object Detection for Autonomous Driving [PDF] 返回目录
Jianhao Jiao, Peng Yun, Lei Tai, Ming Liu
Abstract: Extrinsic perturbation always exists in multiple sensors. In this paper, we focus on the extrinsic uncertainty in multi-LiDAR systems for 3D object detection. We first analyze the influence of extrinsic perturbation on geometric tasks with two basic examples. To minimize the detrimental effect of extrinsic perturbation, we propagate an uncertainty prior on each point of input point clouds, and use this information to boost an approach for 3D geometric tasks. Then we extend our findings to propose a multi-LiDAR 3D object detector called MLOD. MLOD is a two-stage network where the multi-LiDAR information is fused through various schemes in stage one, and the extrinsic perturbation is handled in stage two. We conduct extensive experiments on a real-world dataset, and demonstrate both the accuracy and robustness improvement of MLOD. The code, data and supplementary materials are available at: this https URL
摘要：外在扰动总是存在于多个传感器。在本文中，我们专注于多激光雷达系统，用于三维物体检测的外在不确定性。我们先来分析外在扰动对两个基本的例子几何任务的影响。为了尽量减少外在扰动的不利影响，我们对输入点云中的每个点之前传播的不确定性，并利用这些信息来提高用于三维几何任务的方法。然后，我们扩展我们的研究结果提出所谓MLOD多激光雷达3D对象检测器。 MLOD是一个两阶段的网络，其中所述多激光雷达信息通过各种方案在第一阶段熔融，并且非本征扰动在第二阶段处理。我们对现实世界的数据集进行了广泛的试验，并证明双方的准确性和MLOD的鲁棒性的改善。代码，数据和补充材料，请访问：此HTTPS URL

21. Spatial Attention as an Interface for Image Captioning Models [PDF] 返回目录
Philipp Sadler
Abstract: The internal workings of modern deep learning models stay often unclear to an external observer, although spatial attention mechanisms are involved. The idea of this work is to translate these spatial attentions into natural language to provide a simpler access to the model's function. Thus, I took a neural image captioning model and measured the reactions to external modification in its spatial attention for three different interface methods: a fixation over the whole generation process, a fixation for the first time-steps and an addition to the generator's attention. The experimental results for bounding box based spatial attention vectors have shown that the captioning model reacts to method dependent changes in up to 52.65% and includes in 9.00% of the cases object categories, which were otherwise unmentioned. Afterwards, I established such a link to a hierarchical co-attention network for visual question answering by extraction of its word, phrase and question level spatial attentions. Here, generated captions for the word level included details of the question-answer pairs in up to 55.20% of the cases. This work indicates that spatial attention seen as an external interface for image caption generators is an useful method to access visual functions in natural language.
摘要：现代深度学习模型的内部工作经常熬夜不清楚外部观察者，虽然空间注意机制参与。这项工作的想法是将这些空间关注到自然语言，以提供模型的功能更简单的访问。因此，我把神经图像字幕模型和测得的外部变形例的反应在其空间注意力针对三个不同的接口中的方法：在整个生成处理的固定，用于第一时间步长的固定和加法到发电机的注意。用于限制基于框空间注意载体的实验结果表明，在高达52.65％方法相关的变化的字幕模型反应，并且包括在的情况下，9.00％的对象类，其中在其它方面未提及。后来，我通过它的单词，短语和问题的水平空间关注的提取建立这样的链接对层次共同关注网络视频答疑。这里，字级生成的字幕包含在了问答对细节的情况下，55.20％。这项工作表明，空间注意视为图片标题发电机外部接口是自然语言获得视觉功能的有效方法。

22. On Benchmarking Iris Recognition within a Head-mounted Display for AR/VR Application [PDF] 返回目录
Fadi Boutros, Naser Damer, Kiran Raja, Raghavendra Ramachandra, Florian Kirchbuchner, Arjan Kuijper
Abstract: Augmented and virtual reality is being deployed in different fields of applications. Such applications might involve accessing or processing critical and sensitive information, which requires strict and continuous access control. Given that Head-Mounted Displays (HMD) developed for such applications commonly contains internal cameras for gaze tracking purposes, we evaluate the suitability of such setup for verifying the users through iris recognition. In this work, we first evaluate a set of iris recognition algorithms suitable for HMD devices by investigating three well-established handcrafted feature extraction approaches, and to complement it, we also present the analysis using four deep learning models. While taking into consideration the minimalistic hardware requirements of stand-alone HMD, we employ and adapt a recently developed miniature segmentation model (EyeMMS) for segmenting the iris. Further, to account for non-ideal and non-collaborative capture of iris, we define a new iris quality metric that we termed as Iris Mask Ratio (IMR) to quantify the iris recognition performance. Motivated by the performance of iris recognition, we also propose the continuous authentication of users in a non-collaborative capture setting in HMD. Through the experiments on a publicly available OpenEDS dataset, we show that performance with EER = 5% can be achieved using deep learning methods in a general setting, along with high accuracy for continuous user authentication.
摘要：增强和虚拟现实正在被部署在不同的应用领域。这类应用可能涉及访问或处理关键和敏感信息，这就需要严格和连续访问控制。鉴于头戴式显示器（HMD）这样的应用程序开发通常包含视线追踪目的内置摄像头，我们评估这些设置是否适合通过虹膜识别验证用户。在这项工作中，我们首先评估组通过调查三成熟的手工特征提取适合HMD设备的虹膜识别算法的方法，并加以补充，我们还提出使用四个深学习模型分析。同时考虑到单机HMD的简约硬件的要求，我们采用和分割虹膜适应最近开发的微型细分模型（EyeMMS）。此外，考虑到非理想和虹膜的非合作拍摄，我们定义了一个新的综合质量度量，我们称为光圈面膜比（IMR）量化虹膜识别性能。通过虹膜识别性能的启发，我们也建议在HMD非合作拍摄设置用户连续认证。通过在公开的OpenEDS数据集的实验中，我们显示出与EER，业绩= 5％，可以使用一般的设定深度学习方法，高精度连续用户认证一起实现。

23. Generative Model-Enhanced Human Motion Prediction [PDF] 返回目录
Anthony Bourached, Ryan-Rhys Griffiths, Robert Gray, Ashwani Jha, Parashkev Nachev
Abstract: The task of predicting human motion is complicated by the natural heterogeneity and compositionality of actions, necessitating robustness to distributional shifts as far as out-of-distribution (OoD). Here we formulate a new OoD benchmark based on the Human3.6M and CMU motion capture datasets, and introduce a hybrid framework for hardening discriminative architectures to OoD failure by augmenting them with a generative model. When applied to current state-of-the-art discriminative models, we show that the proposed approach improves OoD robustness without sacrificing in-distribution performance, and can facilitate model interpretability. We suggest human motion predictors ought to be constructed with OoD challenges in mind, and provide an extensible general framework for hardening diverse discriminative architectures to extreme distributional shift. The code is available at this https URL.
摘要：预测人体运动的任务是由天然异质性和动作组合性复杂，因此需要稳健性分配的变化，据外的分布（OOD）。在这里，我们制定基础上，Human3.6M和CMU动作捕捉数据集新的OOD的基准，并引入混合框架，通过生成模型对它们进行硬化歧视架构，OOD失败。当应用到国家的最先进的电流判别模型，我们表明，该方法提高OOD稳健性在不牺牲分布性能，并且可以促进模型可解释性。我们建议人体运动预测应该充分考虑洪水挑战构造，并提供多样化的硬化歧视性架构，以极端的分布移位可扩展总体框架。该代码可在此HTTPS URL。

24. A Data Set and a Convolutional Model for Iconography Classification in Paintings [PDF] 返回目录
Federico Milani, Piero Fraternali
Abstract: Iconography in art is the discipline that studies the visual content of artworks to determine their motifs and themes andto characterize the way these are represented. It is a subject of active research for a variety of purposes, including the interpretation of meaning, the investigation of the origin and diffusion in time and space of representations, and the study of influences across artists and art works. With the proliferation of digital archives of art images, the possibility arises of applying Computer Vision techniques to the analysis of art images at an unprecedented scale, which may support iconography research and education. In this paper we introduce a novel paintings data set for iconography classification and present the quantitativeand qualitative results of applying a Convolutional Neural Network (CNN) classifier to the recognition of the iconography of artworks. The proposed classifier achieves good performances (71.17% Precision, 70.89% Recall, 70.25% F1-Score and 72.73% Average Precision) in the task of identifying saints in Christian religious paintings, a task made difficult by the presence of classes with very similar visual features. Qualitative analysis of the results shows that the CNN focuses on the traditional iconic motifs that characterize the representation of each saint and exploits such hints to attain correct identification. The ultimate goal of our work is to enable the automatic extraction, decomposition, and comparison of iconography elements to support iconographic studies and automatic art work annotation.
摘要：在影像学艺术是学科，研究艺术作品的视觉内容，以确定它们的图案和主题andto刻画这些都表现的方式。这是一个活跃的研究有多种用途，包括词义的解释，在时间和表现空间的起源和扩散的调查，以及跨艺术家和艺术作品的影响的研究的课题。随着艺术图像的数字档案馆的扩散，可能出现的情况将计算机视觉技术，以艺术形象以前所未有的规模分析，这可能支持肖像研究和教育。在本文中，我们介绍一种新颖的绘画数据的肖像分类设置并呈现quantitativeand施加卷积神经网络（CNN）分类器来识别艺术品的意象的定性结果。所提出的分类取得了良好的表现（71.17％精确，70.89％的召回，70.25％的F1-分数和72.73％，平均准确率）在基督教的宗教绘画识别圣人的任务，完成某项任务的类存在的困难具有非常类似的视觉特征。的结果示出了定性分析，该CNN着重于表征每个圣的表示传统的标志性图案和利用这样的提示，以获得正确的识别。我们工作的最终目标是使自动提取，分解和意象元素的比较支持肖像的研究和自动艺术作品的诠释。

25. BlendTorch: A Real-Time, Adaptive Domain Randomization Library [PDF] 返回目录
Christoph Heindl, Lukas Brunner, Sebastian Zambal, Josef Scharinger
Abstract: Solving complex computer vision tasks by deep learning techniques relies on large amounts of (supervised) image data, typically unavailable in industrial environments. The lack of training data starts to impede the successful transfer of state-of-the-art methods in computer vision to industrial applications. We introduce BlendTorch, an adaptive Domain Randomization (DR) library, to help creating infinite streams of synthetic training data. BlendTorch generates data by massively randomizing low-fidelity simulations and takes care of distributing artificial training data for model learning in real-time. We show that models trained with BlendTorch repeatedly perform better in an industrial object detection task than those trained on real or photo-realistic datasets.
摘要：深学习技术解决复杂的计算机视觉任务依赖于大量的（监管）的图像数据，通常在工业环境中不可用的。由于缺乏训练数据的开始阻碍国家的最先进的方法，在计算机视觉的成功转移到工业应用。我们引进BlendTorch，自适应域随机（DR）库，以帮助创建合成训练数据的无限流。 BlendTorch由大量随机低保真模拟生成的数据，并考虑到了在实时模型学习分发人工训练数据的照顾。我们表明，BlendTorch训练的模型比那些受过训练的真实或照片般逼真的数据集工业物体检测任务反复更好的表现。

26. Unsupervised Representation Learning by InvariancePropagation [PDF] 返回目录
Feng Wang, Huaping Liu, Di Guo, Fuchun Sun
Abstract: Unsupervised learning methods based on contrastive learning have drawn increasing attention and achieved promising results. Most of them aim to learn representations invariant to instance-level variations, which are provided by different views of the same instance. In this paper, we propose Invariance Propagation to focus on learning representations invariant to category-level variations, which are provided by different instances from the same category. Our method recursively discovers semantically consistent samples residing in the same high-density regions in representation space. We demonstrate a hard sampling strategy to concentrate on maximizing the agreement between the anchor sample and its hard positive samples, which provide more intra-class variations to help capture more abstract invariance. As a result, with a ResNet-50 as the backbone, our method achieves 71.3% top-1 accuracy on ImageNet linear classification and 78.2% top-5 accuracy fine-tuning on only 1% labels, surpassing previous results. We also achieve state-of-the-art performance on other downstream tasks, including linear classification on Places205 and Pascal VOC, and transfer learning on small scale datasets.
摘要：基于对比学习无监督学习方法已引起越来越多的关注，并取得了可喜的成果。他们中的大多数旨在学习交涉不变的情况下，电平变化，这是由同一个实例的不同视图提供。在本文中，我们提出了不变性传播到集中学习交涉不变的类别层次的变化，这是由不同的情况下，来自同一类别提供。我们的方法递归地发现在表示空间驻留在相同的高密度区域语义一致的样品。我们展示了一个硬抽样策略集中于最大限度地锚样品和硬阳性标本，这有助于获得更多的抽象不变性提供更多的类内变化之间的协议。其结果是，具有RESNET-50作为主链，我们的方法实现上只有1％的标签上ImageNet线性分类和78.2％顶-5-精度微调71.3％顶-1精度，超越先前的结果。我们也实现上下游的其他任务，包括对Places205和Pascal VOC线性分类，并在小规模数据集传送学习状态的最先进的性能。

27. Conversion and Implementation of State-of-the-Art Deep Learning Algorithms for the Classification of Diabetic Retinopathy [PDF] 返回目录
Mihir Rao, Michelle Zhu, Tianyang Wang
Abstract: Diabetic retinopathy (DR) is a retinal microvascular condition that emerges in diabetic patients. DR will continue to be a leading cause of blindness worldwide, with a predicted 191.0 million globally diagnosed patients in 2030. Microaneurysms, hemorrhages, exudates, and cotton wool spots are common signs of DR. However, they can be small and hard for human eyes to detect. Early detection of DR is crucial for effective clinical treatment. Existing methods to classify images require much time for feature extraction and selection, and are limited in their performance. Convolutional Neural Networks (CNNs), as an emerging deep learning (DL) method, have proven their potential in image classification tasks. In this paper, comprehensive experimental studies of implementing state-of-the-art CNNs for the detection and classification of DR are conducted in order to determine the top performing classifiers for the task. Five CNN classifiers, namely Inception-V3, VGG19, VGG16, ResNet50, and InceptionResNetV2, are evaluated through experiments. They categorize medical images into five different classes based on DR severity. Data augmentation and transfer learning techniques are applied since annotated medical images are limited and imbalanced. Experimental results indicate that the ResNet50 classifier has top performance for binary classification and that the InceptionResNetV2 classifier has top performance for multi-class DR classification.
摘要：糖尿病性视网膜病变（DR）是在糖尿病患者中出现视网膜微血管的条件。 DR将继续成为全世界致盲的主要原因，与2030年微血管瘤，出血，渗出物预测191.0亿全球确诊的患者，和棉絮斑是DR常见的体征。然而，他们可以小而难以被人眼察觉。 DR的早期检测是有效的临床治疗至关重要。现有的方法进行分类的图像需要的特征提取和选择的时间，并在他们的性能受到限制。卷积神经网络（细胞神经网络），作为一个新兴的深度学习（DL）的方法，已经证明了其在图像分类任务的潜力。在本文中，实施国家的最先进的细胞神经网络的DR的检测和分类的综合实验研究，以便确定的任务进行分类上进行。五CNN分类，即启-V3，VGG19，VGG16，ResNet50和InceptionResNetV2，通过实验评估。他们归类医疗图像转换成基于DR严重性五个不同的类别。数据增加和传输学习技术被应用，因为注释医学图像是有限的和不平衡。实验结果表明，该ResNet50分类有二分类，而且InceptionResNetV2分类对多类DR分类的顶级性能最高的性能。

28. Tackling problems of marker-based augmented reality under water [PDF] 返回目录
Jan Čejka, Fotis Liarokapis
Abstract: Underwater sites are a harsh environment for augmented reality applications. Obstacles that must be battled include poor visibility conditions, difficult navigation, and hard manipulation with devices under water. This chapter focuses on the problem of localizing a device under water using markers. It discusses various filters that enhance and improve images recorded under water, and their impact on marker-based tracking. It presents various combinations of 10 image improving algorithms and 4 marker detecting algorithms, and tests their performance in real situations. All solutions are designed to run real-time on mobile devices to provide a solid basis for augmented reality. Usability of this solution is evaluated on locations in Mediterranean Sea. It is shown that image improving algorithms with carefully chosen parameters can reduce the problems with visibility under water and improve the detection of markers. The best results are obtained with marker detecting algorithms that are specifically designed for underwater environments.
摘要：水下遗址是增强现实应用的恶劣环境。必须争夺的障碍包括低能见度条件下，困难的导航，以及水下设备很难操纵。本章重点介绍利用本地化标记水下设备的问题。它讨论了加强和改善水下拍摄的图像不同的过滤器，以及它们对基于标记物跟踪影响。它提出的10种图像改善算法和4种标记检测算法的各种组合，并测试其在实际情况的性能。所有解决方案都设计运行在移动设备上实时提供增强现实的坚实基础。该解决方案的可用性是在地中海地区进行评估。结果表明，与仔细地选择参数图像改善算法可以减少与能见度的问题在水中，提高标记物的检测。最好的结果是用专门为水下环境而设计的标记检测算法获得。

29. Cross-Spectral Iris Matching Using Conditional Coupled GAN [PDF] 返回目录
Moktari Mostofa, Fariborz Taherkhani, Jeremy Dawson, Nasser M. Nasrabadi
Abstract: Cross-spectral iris recognition is emerging as a promising biometric approach to authenticating the identity of individuals. However, matching iris images acquired at different spectral bands shows significant performance degradation when compared to single-band near-infrared (NIR) matching due to the spectral gap between iris images obtained in the NIR and visual-light (VIS) spectra. Although researchers have recently focused on deep-learning-based approaches to recover invariant representative features for more accurate recognition performance, the existing methods cannot achieve the expected accuracy required for commercial applications. Hence, in this paper, we propose a conditional coupled generative adversarial network (CpGAN) architecture for cross-spectral iris recognition by projecting the VIS and NIR iris images into a low-dimensional embedding domain to explore the hidden relationship between them. The conditional CpGAN framework consists of a pair of GAN-based networks, one responsible for retrieving images in the visible domain and other responsible for retrieving images in the NIR domain. Both networks try to map the data into a common embedding subspace to ensure maximum pair-wise similarity between the feature vectors from the two iris modalities of the same subject. To prove the usefulness of our proposed approach, extensive experimental results obtained on the PolyU dataset are compared to existing state-of-the-art cross-spectral recognition methods.
摘要：互谱虹膜识别是一种新兴的生物识别前途的方法来验证个人身份。但是，相比于单波段的近红外（NIR）匹配由于NIR获得虹膜图像和视觉光（VIS）光谱之间的光谱间隙时匹配在不同光谱带节目显著性能退化获取的虹膜图像。虽然研究人员最近集中在深的学习为基础的方法来恢复更精确的识别性能不变的代表性特征，现有方法无法实现商业应用所需的预期准确性。因此，在本文中，我们通过VIS和NIR虹膜图像投影到低维嵌入域探索它们之间的关系隐藏提出一种用于交叉谱虹膜识别一个条件耦合生成对抗网络（CpGAN）架构。条件CpGAN框架由一对GaN系的网络中，一个负责在可见光域和其他负责在NIR域检索图像检索图像。两个网络尝试将数据映射到一个共同的嵌入子空间，以确保来自同一受试者的两个虹膜方式的特征矢量之间的最大成对相似性。证明我们提出的方法的有用性，在得到的理大广泛的实验结果的数据集进行比较，以现有的国家的最先进的交叉频谱识别方法。

30. DocStruct: A Multimodal Method to Extract Hierarchy Structure in Document for General Form Understanding [PDF] 返回目录
Zilong Wang, Mingjie Zhan, Xuebo Liu, Ding Liang
Abstract: Form understanding depends on both textual contents and organizational structure. Although modern OCR performs well, it is still challenging to realize general form understanding because forms are commonly used and of various formats. The table detection and handcrafted features in previous works cannot apply to all forms because of their requirements on formats. Therefore, we concentrate on the most elementary components, the key-value pairs, and adopt multimodal methods to extract features. We consider the form structure as a tree-like or graph-like hierarchy of text fragments. The parent-child relation corresponds to the key-value pairs in forms. We utilize the state-of-the-art models and design targeted extraction modules to extract multimodal features from semantic contents, layout information, and visual images. A hybrid fusion method of concatenation and feature shifting is designed to fuse the heterogeneous features and provide an informative joint representation. We adopt an asymmetric algorithm and negative sampling in our model as well. We validate our method on two benchmarks, MedForm and FUNSD, and extensive experiments demonstrate the effectiveness of our method.
摘要：形态认识取决于两个文本内容和组织结构。虽然现代OCR执行好，它仍然是具有挑战性的实现一般形式的理解，因为形式是常用的各种格式。该表检测和手工制作的功能在以前的作品不能适用于所有形式的，因为他们对格式的要求。因此，我们专注于最基础的组成部分，该键值对，并采用多模式的方法来提取特征。我们认为，形式结构树状图或类似的文字片段的层次结构。亲子关系对应于形式的键值对。我们利用国家的最先进的模型和设计目标提取模块，提取从内容的语义，布局信息，图像和可见光图像的多峰特征。级联和特征换档的杂交融合方法被设计成融合异质特性，并提供一个内容联合表示。我们采用在我们的模型非对称算法和负采样为好。我们验证两个基准，MedForm和FUNSD我们的方法，以及广泛的实验证明了该方法的有效性。

31. Learning Panoptic Segmentation from Instance Contours [PDF] 返回目录
Sumanth Chennupati, Venkatraman Narayanan, Ganesh Sistu, Senthil Yogamani, Samir A Rawashdeh
Abstract: Panoptic Segmentation aims to provide an understanding of background (stuff) and instances of objects (things) at a pixel level. It combines the separate tasks of semantic segmentation (pixel-level classification) and instance segmentation to build a single unified scene understanding task. Typically, panoptic segmentation is derived by combining semantic and instance segmentation tasks that are learned separately or jointly (multi-task networks). In general, instance segmentation networks are built by adding a foreground mask estimation layer on top of object detectors or using instance clustering methods that assign a pixel to an instance center. In this work, we present a fully convolution neural network that learns instance segmentation from semantic segmentation and instance contours (boundaries of things). Instance contours along with semantic segmentation yield a boundary-aware semantic segmentation of things. Connected component labeling on these results produces instance segmentation. We merge semantic and instance segmentation results to output panoptic segmentation. We evaluate our proposed method on the CityScapes dataset to demonstrate qualitative and quantitative performances along with several ablation studies.
摘要：全景细分的目的是在像素级提供的背景（东西），对象（事物）的实例的理解。它结合了语义分割（像素级分类）和实例分割的单独任务，以构建一个统一的场景理解任务。通常情况下，全景分割通过组合单独或联合（多任务网络）了解到语义和实例分割任务的。一般情况下，例如分割网络由对象上检测器的顶部上添加前景掩模估计层或使用的像素分配给一个实例中心实例聚类方法构建的。在这项工作中，我们提出了一个完全卷积神经网络，从语义分割和实例轮廓（的东西边界）获悉实例分割。与语义分割沿着实例轮廓产生事情的边界意识的语义分割。这些结果连通分量标记产生实例的分割。我们合并语义和实例分割结果输出全景分割。我们评估我们所提出的对城市景观的方法集展示与几个切除研究沿定性和定量的表演。

32. DPAttack: Diffused Patch Attacks against Universal Object Detection [PDF] 返回目录
Shudeng Wu, Tao Dai, Shu-Tao Xia
Abstract: Recently, deep neural networks (DNNs) have been widely and successfully used in Object Detection, e.g. Faster RCNN, YOLO, CenterNet. However, recent studies have shown that DNNs are vulnerable to adversarial attacks. Adversarial attacks against object detection can be divided into two categories, whole-pixel attacks and patch attacks. While these attacks add perturbations to a large number of pixels in images, we proposed a diffused patch attack (\textbf{DPAttack}) to successfully fool object detectors by diffused patches of asteroid-shaped or grid-shape, which only change a small number of pixels. Experiments show that our DPAttack can successfully fool most object detectors with diffused patches and we get the second place in the Alibaba Tianchi competition: Alibaba-Tsinghua Adversarial Challenge on Object Detection. Our code can be obtained from this https URL.
摘要：近日，深层神经网络（DNNs）已经广泛和成功的对象使用的检测，例如更快RCNN，YOLO，CenterNet。然而，最近的研究表明，DNNs很容易受到攻击的对抗性。针对对象检测对抗性攻击可以分为两大类，全像素的攻击和补丁的攻击。而这些攻击添加扰动大量图像中的像素的，我们提出了一个漫射的补丁攻击（\ textbf {DPAttack}）由小行星形或网格形的扩散补丁成功傻瓜对象检测器，其中仅改变一个小数目的像素。实验结果表明，DPAttack可以成功地骗过大多数物体探测器与扩散的补丁，我们可以得到在阿里巴巴天池比赛第二名：对目标检测阿里巴巴清华对抗性挑战。我们的代码可以从这个HTTPS URL来获得。

33. Efficient Generalized Spherical CNNs [PDF] 返回目录
Oliver J. Cobb, Christopher G. R. Wallis, Augustine N. Mavor-Parker, Augustin Marignier, Matthew A. Price, Mayeul d'Avezac, Jason D. McEwen
Abstract: Many problems across computer vision and the natural sciences require the analysis of spherical data, for which representations may be learned efficiently by encoding equivariance to rotational symmetries. We present a generalized spherical CNN framework that encompasses various existing approaches and allows them to be leveraged alongside each other. The only existing non-linear spherical CNN layer that is strictly equivariant has complexity $\mathcal{O}(C^2L^5)$, where $C$ is a measure of representational capacity and $L$ the spherical harmonic bandlimit. Such a high computational cost often prohibits the use of strictly equivariant spherical CNNs. We develop two new strictly equivariant layers with reduced complexity $\mathcal{O}(CL^4)$ and $\mathcal{O}(CL^3 \log L)$, making larger, more expressive models computationally feasible. Moreover, we adopt efficient sampling theory to achieve further computational savings. We show that these developments allow the construction of more expressive hybrid models that achieve state-of-the-art accuracy and parameter efficiency on spherical benchmark problems.
摘要：横跨计算机视觉的许多问题和自然科学要求球的数据，其表示可通过编码同变性，以旋转对称有效地学习的分析。我们提出一个广义的球形CNN框架，包括各种现有的方法，并允许他们彼此一起加以利用。现有的唯一非线性球形CNN层即严格等变复杂具有$ \ mathcal {Ó}（C ^ 2L ^ 5）$，其中$ C $是代表能力的量度和$ L $球谐bandlimit。如此高的计算成本往往禁止使用严格等变球形细胞神经网络的。我们开发具有降低了复杂性$ \ mathcal {}Ø（CL ^ 4）$和$ \ mathcal {}Ø（CL ^ 3 \日志L）$，使得更大，更富有表现力的模型计算上是可行两个新的严格等变层。此外，我们采用高效的抽样理论，以实现进一步的计算节省。我们发现，这些发展允许实现球形基准问题的国家的最先进的精度和效率的参数更富有表现力的混合动力车型的建筑。

34. Learning to Sort Image Sequences via Accumulated Temporal Differences [PDF] 返回目录
Gagan Kanojia, Shanmuganathan Raman
Abstract: Consider a set of n images of a scene with dynamic objects captured with a static or a handheld camera. Let the temporal order in which these images are captured be unknown. There can be n! possibilities for the temporal order in which these images could have been captured. In this work, we tackle the problem of temporally sequencing the unordered set of images of a dynamic scene captured with a hand-held camera. We propose a convolutional block which captures the spatial information through 2D convolution kernel and captures the temporal information by utilizing the differences present among the feature maps extracted from the input images. We evaluate the performance of the proposed approach on the dataset extracted from a standard action recognition dataset, UCF101. We show that the proposed approach outperforms the state-of-the-art methods by a significant margin. We show that the network generalizes well by evaluating it on a dataset extracted from the DAVIS dataset, a dataset meant for video object segmentation, when the same network was trained with a dataset extracted from UCF101, a dataset meant for action recognition.
摘要：考虑一组与动态场景的n个图像对象具有静态或手持相机拍摄。让这些图像被捕获未知的时间顺序。可以有N！可能性在这些图像可能已被抓获的时间顺序。在这项工作中，我们处理的时间排序的无序集用手持相机拍摄的动态场景的图像的问题。我们提出了一个卷积块，其通过二维卷积核和捕获通过利用本特征映射从输入图像中提取的之间的差异捕获的空间信息的时间的信息。我们评估从一个标准动作识别的数据集，UCF101提取的数据集所提出的方法的性能。我们表明，该方法比国家的最先进的方法，通过一个显著保证金。我们发现，在网络上通过从戴维斯的数据集，意味着视频对象分割的数据集，当同一网络与来自UCF101，意味着动作识别数据集提取的数据集训练的提取的数据集评估它概括很好。

35. Self-Supervised Shadow Removal [PDF] 返回目录
Florin-Alexandru Vasluianu, Andres Romero, Luc Van Gool, Radu Timofte
Abstract: Shadow removal is an important computer vision task aiming at the detection and successful removal of the shadow produced by an occluded light source and a photo-realistic restoration of the image contents. Decades of re-search produced a multitude of hand-crafted restoration techniques and, more recently, learned solutions from shad-owed and shadow-free training image pairs. In this work,we propose an unsupervised single image shadow removal solution via self-supervised learning by using a conditioned mask. In contrast to existing literature, we do not require paired shadowed and shadow-free images, instead we rely on self-supervision and jointly learn deep models to remove and add shadows to images. We validate our approach on the recently introduced ISTD and USR datasets. We largely improve quantitatively and qualitatively over the compared methods and set a new state-of-the-art performance in single image shadow removal.
摘要：阴影去除是一项重要的计算机视觉任务旨在检测和成功切除由所产生的阴影的遮挡光源和的图像内容的照片般逼真的恢复。重新搜索几十年生产的手工制作的修复技术，最近众多，从鲥鱼，拖欠和无阴影的训练图像对了解到的解决方案。在这项工作中，我们提出了一种无监督的单一图像阴影去除通过自我监督学习使用条件面具的解决方案。相较于现有的文献中，我们不需要成对的阴影和无阴影的图像，而不是我们依靠自我监督，共同深入学习模型来删除和添加阴影图像。我们验证在近期推出ISTD和USR数据集我们的方法。我们主要是在数量上和质量上改善在比较的方法，并设置在单个图像阴影去除一个新的国家的最先进的性能。

36. Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization [PDF] 返回目录
Yuanhao Zhai, Le Wang, Wei Tang, Qilin Zhang, Junsong Yuan, Gang Hua
Abstract: Weakly-supervised Temporal Action Localization (W-TAL) aims to classify and localize all action instances in an untrimmed video under only video-level supervision. However, without frame-level annotations, it is challenging for W-TAL methods to identify false positive action proposals and generate action proposals with precise temporal boundaries. In this paper, we present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated, and used to provide frame-level supervision for improved model training and false positive action proposal elimination. Furthermore, we propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries. Experiments conducted on the THUMOS14 and ActivityNet datasets show that the proposed TSCN outperforms current state-of-the-art methods, and even achieves comparable results with some recent fully-supervised methods.
摘要：弱监督颞行动本地化（W-TAL）目标进行分类和定位的所有行动情况下仅视频级监督的修剪视频。然而，如果没有帧级注释，它是具有挑战性用于W-TAL方法来识别假阳性行动建议和产生具有精确的时间边界行动的建议。在本文中，我们提出了一个双流共识网（TSCN）同时应对这些挑战。所提出的TSCN设有一个反复改进训练方法，其中帧级模拟接地事实是迭代更新，并用来提供改进模型的训练和假阳性的行动建议，消除帧级监督。此外，我们提出了一个新的关注正常化损失鼓励预测关注像一个二元选择，促进操作实例边界的精确定位。实验在THUMOS14进行，ActivityNet数据集表明，该TSCN优于国家的最先进的电流的方法，甚至实现了与最近的一些充分监督方法比较的结果。

37. Face Hallucination Using Split-Attention in Split-Attention Network [PDF] 返回目录
Yuanzhi Wang, Tao Lu, Yu Wang, Yanduo Zhang
Abstract: Face hallucination is a domain-specific super-resolution (SR), that generates high-resolution (HR) facial images from the observed one/multiple low-resolution (LR) input/s. Recently, convolutional neural networks(CNNs) are successfully applied into face hallucination to model the complex nonlinear mapping between HR and LR images. Although global attention mechanism equipped into CNNs naturally focus on the facial structure information, it always ignore the local and cross feature structure information, resulting in limited reconstruction performance. In order to solve this problem, we propose global-local split-attention mechanism and design a Split-Attention in Split-Attention (SIS) network to enable local attention across feature-map groups attaining global attention and to improve the ability of feature representations. SIS can generate and focus the local attention of neural network on the interaction of face key structure information in channel-level, thereby improve the performance of face image reconstruction. Experimental results show that the proposed approach consistently and significantly improves the reconstruction performances for face hallucination.
摘要：面部幻觉是域特定的超分辨率（SR），即从观察到的一个/多个低分辨率（LR）输入来生成高分辨率（HR）的面部图像/秒。最近，卷积神经网络（细胞神经网络）被成功地应用到面部幻觉建模HR和LR图像之间的复杂的非线性映射。虽然配备到神经网络的全局注意机制自然集中在面部结构信息，它总是忽略本地和跨功能结构的信息，从而在有限的重建性能。为了解决这个问题，我们提出了全球和当地分裂注意机制和设计在斯普利特 - 注意（SIS）网络的分割注意实现跨越特征图组本地重视实现全球关注和改善功能交涉的能力。 SIS可以生成和聚焦神经网络的本地关注的在信道级面键结构信息的相互作用，从而提高脸图像重建的性能。实验结果表明，该方法一直和显著改善面部幻觉重建表演。

38. Fingerprint Orientation Estimation: Challenges and Opportunities [PDF] 返回目录
Amit Kumar Trivedi
Abstract: There is an exponential increase in portable electronic devices with biometric security mechanisms, in particular fingerprint biometric. A person has a limited number of fingerprints and it remains unchanged throughout his lifetime, once leaked to the adversary, it leaks for a lifetime. So, there is a need to secure the biometric template itself. In this survey paper, we review the different security models and fingerprint template protection techniques. The research challenges in different fingerprint template protection techniques are also highlighted in respective sections of the paper. This survey provides a comprehensive study of template protection techniques for fingerprint biometric systems and highlights the challenges and future opportunities.
摘要：有一个指数增长的便携式电子设备与生物安全机制，特别是指纹生物识别。一个人有指纹的数量有限，而且它仍然是贯穿他一生不变的，一旦泄露给对手，它泄漏一辈子。因此，有必要确保生物模板本身。在本次调查文章中，我们介绍不同的安全模型和指纹模板保护技术。在不同的指纹模板保护技术研究的挑战，也强调了在纸的相应部分。本次调查提供的模板保护技术指纹生物识别系统和亮点的挑战和未来机遇的全面研究。

39. Learning Dual Semantic Relations with Graph Attention for Image-Text Matching [PDF] 返回目录
Keyu Wen, Xiaodong Gu, Qingrong Cheng
Abstract: Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations. Previous methods that perform well on this task primarily focus on not only the alignment between region features in images and the corresponding words in sentences, but also the alignment between relations of regions and relational words. However, the lack of joint learning of regional features and global features will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words which have global meanings in some sentences. In this work, in order to alleviate this issue, it is necessary to enhance the relations between regions and the relations between regional and global concepts to obtain a more accurate visual representation so as to be better correlated to the corresponding text. Thus, a novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network(DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations module. DSRAN performs graph attention in both modules respectively for region-level relations enhancement and regional-global relations enhancement at the same time. With these two modules, different hierarchies of semantic relations are learned simultaneously, thus promoting the image-text matching process by providing more information for the final visual representation. Quantitative experimental results have been performed on MS-COCO and Flickr30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relations learning scheme. Codes are available at this https URL.
摘要：图像 - 文本匹配是跨模态信息处理的一个重大课题。主要的挑战是学习统一的视觉和文本表示。关于这个任务的执行以及以前的方法主要集中在不仅是区域特征之间的图像和句子中的词相对应的区域和关系词的关系之间的取向的取向，而且还。然而，缺乏地域特色和整体特征共同学习的会导致区域特征与全球范围内失去联系，导致同对一些句子含义全球那些非目标词的不匹配。在这项工作中，为了缓解这个问题，有必要加强地区及区域和全球概念之间的关系之间的关系，以获得更精确的可视化表示，以便能够更好地关联到对应的文本。因此，提出了名为双语义关系注意网络（DSRAN）一种新颖的多级语义关系提高方法主要包括两个模块，独立的语义关系模块和关节语义关系模块。 DSRAN执行两个模块分别绘制注意力区域级关系提高和区域 - 全球关系改善的同时。有了这两个模块，语义关系的不同层次结构被同时学习，从而通过提供用于最终视觉表示的更多信息促进图像文本匹配处理。定量实验结果已在MS-COCO和Flickr30K进行，我们的方法优于由于双重语义关系学习方法的有效性以前的方法大幅度。代码可在此HTTPS URL。

40. TLGAN: document Text Localization using Generative Adversarial Nets [PDF] 返回目录
Dongyoung Kim, Myungsung Kwak, Eunji Won, Sejung Shin, Jeongyeon Nam
Abstract: Text localization from the digital image is the first step for the optical character recognition task. Conventional image processing based text localization performs adequately for specific examples. Yet, a general text localization are only archived by recent deep-learning based modalities. Here we present document Text Localization Generative Adversarial Nets (TLGAN) which are deep neural networks to perform the text localization from digital image. TLGAN is an versatile and easy-train text localization model requiring a small amount of data. Training only ten labeled receipt images from Robust Reading Challenge on Scanned Receipts OCR and Information Extraction (SROIE), TLGAN achieved 99.83% precision and 99.64% recall for SROIE test data. Our TLGAN is a practical text localization solution requiring minimal effort for data labeling and model training and producing a state-of-art performance.
摘要：从所述数字图像的文本的定位是用于光学字符识别任务的第一步。充分用于具体例子常规的基于图像处理的文本定位执行。然而，一般的文本本地化只是最近深学习基础模式存档。在这里，我们本文件文本本地化剖成对抗性篮网（TLGAN），这是深层神经网络，执行从数字图像文本定位。 TLGAN是一种灵活，易于系文本定位模型需要数据的量小。从强大的阅读挑战上扫描的收据OCR和信息抽取（SROIE）培训只有十标记收据图像，TLGAN达到99.83％的精度和SROIE测试数据99.64％的召回。我们TLGAN是需要数据标签和模型训练最少的努力和生产先进的技术性能实用文本的本地化解决方案。

41. Convolutional Autoencoders for Human Motion Infilling [PDF] 返回目录
Manuel Kaufmann, Emre Aksan, Jie Song, Fabrizio Pece, Remo Ziegler, Otmar Hilliges
Abstract: In this paper we propose a convolutional autoencoder to address the problem of motion infilling for 3D human motion data. Given a start and end sequence, motion infilling aims to complete the missing gap in between, such that the filled in poses plausibly forecast the start sequence and naturally transition into the end sequence. To this end, we propose a single, end-to-end trainable convolutional autoencoder. We show that a single model can be used to create natural transitions between different types of activities. Furthermore, our method is not only able to fill in entire missing frames, but it can also be used to complete gaps where partial poses are available (e.g. from end effectors), or to clean up other forms of noise (e.g. Gaussian). Also, the model can fill in an arbitrary number of gaps that potentially vary in length. In addition, no further post-processing on the model's outputs is necessary such as smoothing or closing discontinuities at the end of the gap. At the heart of our approach lies the idea to cast motion infilling as an inpainting problem and to train a convolutional de-noising autoencoder on image-like representations of motion sequences. At training time, blocks of columns are removed from such images and we ask the model to fill in the gaps. We demonstrate the versatility of the approach via a number of complex motion sequences and report on thorough evaluations performed to better understand the capabilities and limitations of the proposed approach.
摘要：本文提出了一种卷积的自动编码，解决运动充填的三维人体运动数据的问题。给出开始和结束序列，动作充填旨在完成之间的缺失间隙，使得填充在姿势振振有词预测起始序列和自然过渡到最终序列。为此，我们提出了一个单一的，端至端可训练卷积的自动编码。我们表明，单一的模型可以用于创建不同类型的活动之间的自然过渡。此外，我们的方法不仅能够填充整个丢失帧，但是它也可以被用于其中的部分的姿态是可用完整的间隙（例如，从端部执行器），或清理其它形式的噪声（例如高斯分布）。此外，该模型可以在间隙潜在的长度可以变化的任意数量的填写。此外，关于模型的输出没有进一步的后处理是必要的，例如平滑化或在所述间隙的端部封闭的不连续性。在我们的方法谎言的心脏的想法，投运动充填作为图像修复问题，并培养了卷积减噪的自动编码图像样运动序列的表示。在训练时，列块从这样的图像删除，我们要求模型来填补空白。我们通过很多复杂的运动序列和报告深入评估证明方法的多样性进行更好地理解所提出的方法的能力和局限。

42. F-Siamese Tracker: A Frustum-based Double Siamese Network for 3D Single Object Tracking [PDF] 返回目录
Hao Zou, Jinhao Cui, Xin Kong, Chujuan Zhang, Yong Liu, Feng Wen, Wanlong Li
Abstract: This paper presents F-Siamese Tracker, a novel approach for single object tracking prominently characterized by more robustly integrating 2D and 3D information to reduce redundant search space. A main challenge in 3D single object tracking is how to reduce search space for generating appropriate 3D candidates. Instead of solely relying on 3D proposals, firstly, our method leverages the Siamese network applied on RGB images to produce 2D region proposals which are then extruded into 3D viewing frustums. Besides, we perform an online accuracy validation on the 3D frustum to generate refined point cloud searching space, which can be embedded directly into the existing 3D tracking backbone. For efficiency, our approach gains better performance with fewer candidates by reducing search space. In addition, benefited from introducing the online accuracy validation, for occasional cases with strong occlusions or very sparse points, our approach can still achieve high precision, even when the 2D Siamese tracker loses the target. This approach allows us to set a new state-of-the-art in 3D single object tracking by a significant margin on a sparse outdoor dataset (KITTI tracking). Moreover, experiments on 2D single object tracking show that our framework boosts 2D tracking performance as well.
摘要：本文呈现F-连体跟踪，对单个对象跟踪的新颖方法显着特征在于，更鲁棒积分2D和3D信息，以减少冗余的搜索空间。在3D单目标跟踪的一个主要挑战是如何降低搜索空间产生适当的3D候选人。而不是仅仅依靠3D提案中，首先，我们的方法利用施加在RGB图像以产生2D区域提案然后将其挤压成3D观看平截头体的连体网络。此外，我们在3D视锥进行在线精度确认生成细化点云搜索空间，它可以直接嵌入到现有的3D跟踪骨干。为了提高效率，我们的方法获得通过降低搜索空间较少的候选人更好的性能。此外，从具有很强的闭塞或非常稀疏点引入在线精度确认，对于偶然的情况下受益，我们的方法仍然可以实现高精度，即使在2D连体跟踪失去了目标。这种方法使我们能够通过一个显著缘上稀疏的室外数据集（KITTI跟踪）设置一个新的国家的最先进的三维单目标跟踪。此外，在2D单目标跟踪的实验表明我们的框架提升2D跟踪性能以及。

43. 3D Meta-Registration: Learning to Learn Registration of 3D Point Clouds [PDF] 返回目录
Lingjing Wang, Yu Hao, Xiang Li, Yi Fang
Abstract: Deep learning-based point cloud registration models are often generalized from extensive training over a large volume of data to learn the ability to predict the desired geometric transformation to register 3D point clouds. In this paper, we propose a meta-learning based 3D registration model, named 3D Meta-Registration, that is capable of rapidly adapting and well generalizing to new 3D registration tasks for unseen 3D point clouds. Our 3D Meta-Registration gains a competitive advantage by training over a variety of 3D registration tasks, which leads to an optimized model for the best performance on the distribution of registration tasks including potentially unseen tasks. Specifically, the proposed 3D Meta-Registration model consists of two modules: 3D registration learner and 3D registration meta-learner. During the training, the 3D registration learner is trained to complete a specific registration task aiming to determine the desired geometric transformation that aligns the source point cloud with the target one. In the meantime, the 3D registration meta-learner is trained to provide the optimal parameters to update the 3D registration learner based on the learned task distribution. After training, the 3D registration meta-learner, which is learned with the optimized coverage of distribution of 3D registration tasks, is able to dynamically update 3D registration learners with desired parameters to rapidly adapt to new registration tasks. We tested our model on synthesized dataset ModelNet and FlyingThings3D, as well as real-world dataset KITTI. Experimental results demonstrate that 3D Meta-Registration achieves superior performance over other previous techniques (e.g. FlowNet3D).
摘要：深以学习为主的点云登记模型通常来自大量数据的广泛的培训推广到学会预测所需的几何变换注册三维点云的能力。在本文中，我们提出了一个元学习基于三维模型的注册名为3D元登记，也就是能够迅速适应和良好推广，为看不见的三维点云新三维注册的任务。我们的3D元登记由培训过各种3D登记任务，获得竞争优势，从而导致对登记任务的分配，包括潜在的无形任务的最佳性能优化的模型。具体地，所提出的3D元登记模型包括两个模块：3D配准学习者和3D配准的元学习者。在训练中，3D配准学习者被训练以完成特定的任务登记旨在确定所希望的几何变换其对齐到的源点云与目标之一。在此期间，3D登记元学习者培训，以提供最佳的参数更新基于所学的任务分布在三维注册的学习者。训练结束后，三维注册的元学习者，这与三维注册的任务分配的优化覆盖了解到，能够动态更新的3D注册学习者提供所需的参数，以迅速适应新的注册任务。我们测试了我们对合成数据集ModelNet和FlyingThings3D模型，以及真实世界的数据集KITTI。实验结果表明，三维元登记实现比其它先前技术（例如FlowNet3D）优异的性能。

44. High resolution weakly supervised localization architectures for medical images [PDF] 返回目录
Konpat Preechakul, Sira Sriswasdi, Boonserm Kijsirikul, Ekapol Chuangsuwanich
Abstract: In medical imaging, Class-Activation Map (CAM) serves as the main explainability tool by pointing to the region of interest. Since the localization accuracy from CAM is constrained by the resolution of the model's feature map, one may expect that segmentation models, which generally have large feature maps, would produce more accurate CAMs. However, we have found that this is not the case due to task mismatch. While segmentation models are developed for datasets with pixel-level annotation, only image-level annotation is available in most medical imaging datasets. Our experiments suggest that Global Average Pooling (GAP) and Group Normalization are the main culprits that worsen the localization accuracy of CAM. To address this issue, we propose Pyramid Localization Network (PYLON), a model for high-accuracy weakly-supervised localization that achieved 0.62 average point localization accuracy on NIH's Chest X-Ray 14 dataset, compared to 0.45 for a traditional CAM model. Source code and extended results are available at this https URL.
摘要：在医疗成像，类激活地图（CAM）作为所指向的感兴趣区域的主要explainability工具。由于从CAM的定位精度是由模型的特征图的分辨率的限制，人们可以想到的是分段模型，通常具有较大的特征图，可以产生更准确的凸轮。然而，我们发现，这是不是这种情况，由于任务不匹配。虽然分割模型与像素级别的注释数据集开发，只图像的注解是在大多数医疗成像数据集可用。我们的实验表明，全球平均池（GAP）和组标准化是恶化CAM的定位精度的罪魁祸首。为了解决这个问题，我们提出了金字塔本地化网络（PYLON），用于高精度弱监督定位的模式，实现了对NIH的胸部X光14集0.62平均点定位精度，相比0.45的传统CAM模型。源代码和扩展的结果可在此HTTPS URL。

45. An explainable deep vision system for animal classification and detection in trail-camera images with automatic post-deployment retraining [PDF] 返回目录
Golnaz Moallem, Don Pathirage, Joel Reznick, James Gallagher, Hamed Sari-Sarraf
Abstract: This paper introduces an automated vision system for animal detection in trail-camera images taken from a field under the administration of the Texas Parks and Wildlife Department. As traditional wildlife counting techniques are intrusive and labor intensive to conduct, trail-camera imaging is a comparatively non-intrusive method for capturing wildlife activity. However, given the large volume of images produced from trail-cameras, manual analysis of the images remains time-consuming and inefficient. We implemented a two-stage deep convolutional neural network pipeline to find animal-containing images in the first stage and then process these images to detect birds in the second stage. The animal classification system classifies animal images with more than 87% sensitivity and 96% specificity. The bird detection system achieves better than 93% sensitivity, 92% specificity, and 68% average Intersection-over-Union rate. The entire pipeline processes an image in less than 0.5 seconds as opposed to an average 30 seconds for a human labeler. We also addressed post-deployment issues related to data drift for the animal classification system as image features vary with seasonal changes. This system utilizes an automatic retraining algorithm to detect data drift and update the system. We introduce a novel technique for detecting drifted images and triggering the retraining procedure. Two statistical experiments are also presented to explain the prediction behavior of the animal classification system. These experiments investigate the cues that steers the system towards a particular decision. Statistical hypothesis testing demonstrates that the presence of an animal in the input image significantly contributes to the system's decisions.
摘要：本文介绍了在从现场得克萨斯州公园和野生动物部门的管理之下采取跟踪摄像机图像动物检测的自动化视觉系统。由于传统的野生动物计数技术是侵入性和劳动密集的导通，TRAIL-照相机成像是用于捕获的野生动物活性的相对非侵入性的方法。然而，考虑到大的体积从TRAIL-摄像机产生的图像的，图像的手动分析仍然耗时且效率低下。我们实现了一个两阶段的深卷积神经网络的管道发现在第一阶段含有动物图像，然后再处理这些图像来检测在第二阶段的鸟类。动物分类系统进行分类动物图片包含有超过87％的灵敏度和96％的特异性。鸟检测系统实现更好的大于93％的敏感性，92％的特异性和68％的平均交叉点过联盟率。相对于平均30秒人类贴标在小于0.5秒的整个流水线处理的图像。我们还讨论了作为图像特征随季节的变化而变化的相关数据漂移的动物分类系统部署后的问题。该系统利用一种自动重新训练算法来检测数据漂移和更新系统。我们介绍一种新颖的技术来检测漂移图像和触发重新训练过程。两个统计实验还提出解释动物分类系统的预测行为。这些实验研究的是操纵系统向特定决策的线索。统计检验表明，动物的输入图像中存在显著有助于系统的决定。

46. Novel View Synthesis from only a 6-DoF Camera Pose by Two-stage Networks [PDF] 返回目录
Xiang Guo, Bo Li, Yuchao Dai, Tongxin Zhang, Hui Deng
Abstract: Novel view synthesis is a challenging problem in computer vision and robotics. Different from the existing works, which need the reference images or 3D models of the scene to generate images under novel views, we propose a novel paradigm to this problem. That is, we synthesize the novel view from only a 6-DoF camera pose directly. Although this setting is the most straightforward way, there are few works addressing it. While, our experiments demonstrate that, with a concise CNN, we could get a meaningful parametric model that could reconstruct the correct scenery images only from the 6-DoF pose. To this end, we propose a two-stage learning strategy, which consists of two consecutive CNNs: GenNet and RefineNet. GenNet generates a coarse image from a camera pose. RefineNet is a generative adversarial network that refines the coarse image. In this way, we decouple the geometric relationship between mapping and texture detail rendering. Extensive experiments conducted on the public datasets prove the effectiveness of our method. We believe this paradigm is of high research and application value and could be an important direction in novel view synthesis.
摘要：新型视图合成是在计算机视觉和机器人一个具有挑战性的问题。从现有的作品，这就需要参考图像或3D模型的情景下生成新视点的图像的不同，我们提出了一个新的范式这一问题。也就是说，我们合成了新颖的视图仅从六自由度相机直接姿势。虽然这种设置是最直接的方式，很少有作品处理相关问题。虽然，我们的实验证明，用简洁的CNN，我们可以得到，只能从六自由度姿态重建正确的风景的图像有意义的参数模型。为此，我们提出了两个阶段的学习策略，它由两个连续细胞神经网络的：GENNET和RefineNet。 GENNET生成从摄像机姿态的粗糙图像。 RefineNet是一种生成对抗性网络细化粗糙图像。通过这种方式，我们分离映射和纹理细节表现之间的几何关系。对公共数据集进行了大量的实验证明了该方法的有效性。我们相信，这种模式具有很高的研究价值和应用价值，并可能在新的视图合成的一个重要方向。

47. Fine-tuned Pre-trained Mask R-CNN Models for Surface Object Detection [PDF] 返回目录
Haruhiro Fujita, Masatoshi Itagaki, Kenta Ichikawa, Yew Kwang Hooi, Kazutaka Kawano, Ryo Yamamoto
Abstract: This study evaluates road surface object detection tasks using four Mask R-CNN models as a pre-study of surface deterioration detection of stone-made archaeological objects. The models were pre-trained and fine-tuned by COCO datasets and 15,188 segmented road surface annotation tags. The quality of the models were measured using Average Precisions and Average Recalls. Result indicates substantial number of counts of false negatives, i.e. left detection and unclassified detections. A modified confusion matrix model to avoid prioritizing IoU is tested and there are notable true positive increases in bounding box detection, but almost no changes in segmentation masks.
摘要：采用四个面膜R-CNN模型作为表面老化检测石造的考古对象的前期研究本研究评估路面物体检测任务。对模型进行预先训练和微调的COCO数据集和15188个分段路面注释标签。使用平均精密工业和平均召回测量模型的质量。结果表明大量的假阴性的计数，即，左检测和未分类的检测。一种改进的混淆矩阵模型，以避免优先IOU被测试和有在包围盒检测显着真阳性的增加，但在分割掩码几乎没有变化。

48. GAN based Unsupervised Segmentation: Should We Match the Exact Number of Objects [PDF] 返回目录
Quan Liu, Isabella M. Gaeta, Bryan A. Millis, Matthew J. Tyska, Yuankai Huo
Abstract: The unsupervised segmentation is an increasingly popular topic in biomedical image analysis. The basic idea is to approach the supervised segmentation task as an unsupervised synthesis problem, where the intensity images can be transferred to the annotation domain using cycle-consistent adversarial learning. The previous studies have shown that the macro-level (global distribution level) matching on the number of the objects (e.g., cells, tissues, protrusions etc.) between two domains resulted in better segmentation performance. However, no prior studies have exploited whether the unsupervised segmentation performance would be further improved when matching the exact number of objects at micro-level (mini-batch level). In this paper, we propose a deep learning based unsupervised segmentation method for segmenting highly overlapped and dynamic sub-cellular microvilli. With this challenging task, both micro-level and macro-level matching strategies were evaluated. To match the number of objects at the micro-level, the novel fluorescence-based micro-level matching approach was presented. From the experimental results, the micro-level matching did not improve the segmentation performance, compared with the simpler macro-level matching.
摘要：无监督分割是在生物医学图像分析一个越来越热门的话题。其基本思想是接近监督分割任务，因为无监督合成的问题，其中所述强度图像可以被转移到使用周期一致的对抗性学习注释域。先前的研究已经表明，在两个域之间的对象（例如，细胞，组织，突起等）的数量宏观层面（全球分销电平）匹配导致更好的分割性能。然而，没有先前的研究已经利用是否无监督分割性能的系统将在匹配微观层面（微型批次级别）对象的确切数目时得到进一步改善。在本文中，我们提出了分段高度重叠的和动态的亚细胞微绒毛深学习基于无监督分割方法。有了这个具有挑战性的任务，无论是微观层面和宏观层面的匹配策略进行了评价。为了匹配在微观级对象的数目，所述新的基于荧光的微电平匹配的方法被提出。从实验结果看，微观层面的匹配没有提高分割性能，简单的宏观匹配比较。

49. Task-Adaptive Feature Transformer for Few-Shot Segmentation [PDF] 返回目录
Jun Seo, Young-Hyun Park, Sung-Whan Yoon, Jaekyun Moon
Abstract: Few-shot learning allows machines to classify novel classes using only a few labeled samples. Recently, few-shot segmentation aiming at semantic segmentation on low sample data has also seen great interest. In this paper, we propose a learnable module for few-shot segmentation, the task-adaptive feature transformer (TAFT). TAFT linearly transforms task-specific high-level features to a set of task-agnostic features well-suited to the segmentation job. Using this task-conditioned feature transformation, the model is shown to effectively utilize the semantic information in novel classes to generate tight segmentation masks. The proposed TAFT module can be easily plugged into existing semantic segmentation algorithms to achieve few-shot segmentation capability with only a few added parameters. We combine TAFT with Deeplab V3+, a well-known segmentation architecture; experiments on the PASCAL-$5^i$ dataset confirm that this combination successfully adds few-shot learning capability to the segmentation algorithm, achieving the state-of-the-art few-shot segmentation performance in some key representative cases.
摘要：很少次的学习可以让机器仅使用少量标记样本进行分类小说类。近日，为数不多的镜头分割针对低样本数据语义分割也出现了极大的兴趣。在本文中，我们提出了为数不多的镜头分割一个可以学习模块，任务自适应特征变压器（TAFT）。 TAFT线性变换特定任务的高级特征的一组非常适合分割工作任务无关的功能。使用该任务空调特征变换，该模型示出了有效地利用在小说类的语义信息，以产生紧分割掩码。所提出的TAFT模块可以很容易地插入到现有的语义分割算法的实现为数不多的镜头分割功能，只有少数添加的参数。我们结合TAFT与Deeplab V3 +，一个众所周知的分割结构;在帕5 $ ^实验I $的数据集证实，这个组合成功地增加了几拍学习能力的分割算法，实现了一些关键的具有代表性的案例的国家的最先进的为数不多的镜头分割的性能。

50. Efficient Scale-Permuted Backbone with Learned Resource Distribution [PDF] 返回目录
Xianzhi Du, Tsung-Yi Lin, Pengchong Jin, Yin Cui, Mingxing Tan, Quoc Le, Xiaodan Song
Abstract: Recently, SpineNet has demonstrated promising results on object detection and image classification over ResNet model. However, it is unclear if the improvement adds up when combining scale-permuted backbone with advanced efficient operations and compound scaling. Furthermore, SpineNet is built with a uniform resource distribution over operations. While this strategy seems to be prevalent for scale-decreased models, it may not be an optimal design for scale-permuted models. In this work, we propose a simple technique to combine efficient operations and compound scaling with a previously learned scale-permuted architecture. We demonstrate the efficiency of scale-permuted model can be further improved by learning a resource distribution over the entire network. The resulting efficient scale-permuted models outperform state-of-the-art EfficientNet-based models on object detection and achieve competitive performance on image classification and semantic segmentation. Code and models will be open-sourced soon.
摘要：近日，SpineNet已经证明有前途的目标检测和图像分类在RESNET模型的结果。但是，目前还不清楚是否改善相结合的先进高效的操作和复合缩放规模重排的骨干时增加了。此外，SpineNet是建立在与操作的统一资源分布。虽然这一战略似乎是规模，降低车型普遍，它可能不会是大规模重排的模型的优化设计。在这项工作中，我们提出了一个简单的技术，以高效的操作和复合比例与先前学习的规模重排的结构相结合。我们展示规模重排的模型的有效性可以通过学习在整个网络中的资源分配进一步改善。将得到的有效的放大置换模型优于上的物体检测为基础的EfficientNet状态的最先进的模型，并实现对图像分类和语义分割竞争力的性能。代码和模型会很快开源。

51. Learning Loss for Test-Time Augmentation [PDF] 返回目录
Ildoo Kim, Younghoon Kim, Sungwoong Kim
Abstract: Data augmentation has been actively studied for robust neural networks. Most of the recent data augmentation methods focus on augmenting datasets during the training phase. At the testing phase, simple transformations are still widely used for test-time augmentation. This paper proposes a novel instance-level test-time augmentation that efficiently selects suitable transformations for a test input. Our proposed method involves an auxiliary module to predict the loss of each possible transformation given the input. Then, the transformations having lower predicted losses are applied to the input. The network obtains the results by averaging the prediction results of augmented inputs. Experimental results on several image classification benchmarks show that the proposed instance-aware test-time augmentation improves the model's robustness against various corruptions.
摘要：数据隆胸一直在积极研究了强大的神经网络。大部分的最新数据增强方法，注重在训练阶段增强数据集。在测试阶段，简单的转换仍然被广泛用于测试时间增强。本文提出了一种新颖的实例级测试时间增大能够有效地选择适合的转化为测试输入。我们提出的方法涉及到一个辅助模块，用于预测给定了输入的每个可能的变换的损失。然后，将具有较低的预测损失变换被施加到输入端。网络通过平均的增强输入的预测结果获得的结果。在几个图像分类基准，实验结果表明，该实例意识的测试时间增强提高了模型对各种腐败的鲁棒性。

52. Learning Occupancy Function from Point Clouds for Surface Reconstruction [PDF] 返回目录
Meng Jia, Matthew Kyan
Abstract: Implicit function based surface reconstruction has been studied for a long time to recover 3D shapes from point clouds sampled from surfaces. Recently, Signed Distance Functions (SDFs) and Occupany Functions are adopted in learning-based shape reconstruction methods as implicit 3D shape representation. This paper proposes a novel method for learning occupancy functions from sparse point clouds and achieves better performance on challenging surface reconstruction tasks. Unlike the previous methods, which predict point occupancy with fully-connected multi-layer networks, we adapt the point cloud deep learning architecture, Point Convolution Neural Network (PCNN), to build our learning model. Specifically, we create a sampling operator and insert it into PCNN to continuously sample the feature space at the points where occupancy states need to be predicted. This method natively obtains point cloud data's geometric nature, and it's invariant to point permutation. Our occupancy function learning can be easily fit into procedures of point cloud up-sampling and surface reconstruction. Our experiments show state-of-the-art performance for reconstructing With ShapeNet dataset and demonstrate this method's well-generalization by testing it with McGill 3D dataset \cite{siddiqi2008retrieving}. Moreover, we find the learned occupancy function is relatively more rotation invariant than previous shape learning methods.
摘要：隐函数基于表面重建已经研究了很长的时间，以从表面采样点云恢复3D形状。最近，有符号距离函数（的SDF）和Occupany功能在基于学习的形状重构方法为隐式三维形状表示采用。本文提出了一种从稀疏点云学习占用功能的新方法，达到具有挑战性的表面重建任务更好的性能。不同于以往的方法，其中预测点占用具有完全连接的多层网络，我们适应了点云深学习建筑，点卷积神经网络（PCNN），来构建我们学习的楷模。具体来说，我们创建了一个采样操作，并将其插入PCNN在那里乘坐状态需要被预测的点连续采样特征空间。此方法本身获得的点云数据的几何性质，并且它的不变的点排列。我们的占用函数学习可以很容易地装配到上采样和表面重建点云的程序。我们的实验表明，国家的最先进的性能重建随着ShapeNet数据集，并证明与麦吉尔三维数据测试它\ {引用} siddiqi2008retrieving这种方法是很好的概括。此外，我们发现学习占用的功能比以前的形状学习方法比较多的旋转不变性。

53. Learning Graph-Based Priors for Generalized Zero-Shot Learning [PDF] 返回目录
Colin Samplawski, Jannik Wolff, Tassilo Klein, Moin Nabi
Abstract: The task of zero-shot learning (ZSL) requires correctly predicting the label of samples from classes which were unseen at training time. This is achieved by leveraging side information about class labels, such as label attributes or word embeddings. Recently, attention has shifted to the more realistic task of generalized ZSL (GZSL) where test sets consist of seen and unseen samples. Recent approaches to GZSL have shown the value of generative models, which are used to generate samples from unseen classes. In this work, we incorporate an additional source of side information in the form of a relation graph over labels. We leverage this graph in order to learn a set of prior distributions, which encourage an aligned variational autoencoder (VAE) model to learn embeddings which respect the graph structure. Using this approach we are able to achieve improved performance on the CUB and SUN benchmarks over a strong baseline.
摘要：零次学习（ZSL）的任务，需要正确预测从其中在训练时间看不见类型的样品的标签。这是通过利用有关类的标签，如标签的属性或字的嵌入辅助信息来实现的。最近，注意力已转向广义ZSL（GZSL），其中试验组由看见和看不见的样本的更现实的任务。最近方法GZSL显示生成模型，其用于生成从看不见类样品的值。在这项工作中，我们在一个关系图上标签的形式并入的侧信息的附加来源。我们为了学习的一组先验分布，这鼓励对准的变分自动编码器（VAE）模型，以了解哪些尊重图形结构的嵌入的利用此曲线图。使用这种方法，我们能够在一个强大的基础来实现对幼崽和SUN的基准更好的性能。

54. QISTA-Net: DNN Architecture to Solve $\ell_q$-norm Minimization Problem and Image Compressed Sensing [PDF] 返回目录
Gang-Xuan Lin, Shih-Wei Hu, Chun-Shien Lu
Abstract: In this paper, we reformulate the non-convex $\ell_q$-norm minimization problem with $q\in(0,1)$ into a 2-step problem, which consists of one convex and one non-convex subproblems, and propose a novel iterative algorithm called QISTA ($\ell_q$-ISTA) to solve the $\left(\ell_q\right)$-problem. By taking advantage of deep learning in accelerating optimization algorithms, together with the speedup strategy that using the momentum from all previous layers in the network, we propose a learning-based method, called QISTA-Net-s, to solve the sparse signal reconstruction problem. Extensive experimental comparisons demonstrate that the QISTA-Net-s yield better reconstruction qualities than state-of-the-art $\ell_1$-norm optimization (plus learning) algorithms even if the original sparse signal is noisy. On the other hand, based on the network architecture associated with QISTA, with considering the use of convolution layers, we proposed the QISTA-Net-n for solving the image CS problem, and the performance of the reconstruction still outperforms most of the state-of-the-art natural images reconstruction methods. QISTA-Net-n is designed in unfolding QISTA and adding the convolutional operator as the dictionary. This makes QISTA-Net-s interpretable. We provide complete experimental results that QISTA-Net-s and QISTA-Net-n contribute the better reconstruction performance than the competing.
摘要：在本文中，我们重新制定非凸$ \ ell_q $范数的最小化问题，在$ Q \（0,1）$成两步骤问题，它由一个凸和一个非凸子问题，并提出（$ \ $ ell_q -ISTA）呼吁QISTA一种新的迭代算法，解决了$ \左（\ ell_q \右）$ - 问题。通过利用深度学习在加速优化算法，连同加速策略，使用从网络中的所有先前层的势头，我们提出了一种基于学习的方法，称为QISTA-NET-S，解决了稀疏信号重构问题。广泛的实验比较表明QISTA-Net的-S收率更好重建质量比状态的最先进的$ \ ell_1 $范数优化（加学习）算法即使原始稀疏信号有噪声。在另一方面，基于与QISTA相关，考虑使用卷积层的网络结构，我们提出了QISTA-NET-n，用于解决图像CS问题，而重建的表现仍优于大多数国有的的最先进的自然图像重建方法。 QISTA-NET-n被设计成在展开QISTA并添加卷积运算符作为字典。这使得QISTA-Net的-S解释。我们提供完整的实验结果表明，QISTA-NET-S和QISTA-Net的正贡献比竞争更好的重建性能。

55. Bandwidth-Adaptive Feature Sharing for Cooperative LIDAR Object Detection [PDF] 返回目录
Ehsan Emad Marvasti, Arash Raftari, Amir Emad Marvasti, Yaser P. Fallah
Abstract: Situational awareness as a necessity in the connected and autonomous vehicles (CAV) domain is the subject of a significant number of researches in recent years. The driver's safety is directly dependent on the robustness, reliability, and scalability of such systems. Cooperative mechanisms have provided a solution to improve situational awareness by utilizing high speed wireless vehicular networks. These mechanisms mitigate problems such as occlusion and sensor range limitation. However, the network capacity is a factor determining the maximum amount of information being shared among cooperative entities. The notion of feature sharing, proposed in our previous work, aims to address these challenges by maintaining a balance between computation and communication load. In this work, we propose a mechanism to add flexibility in adapting to communication channel capacity and a novel decentralized shared data alignment method to further improve cooperative object detection performance. The performance of the proposed framework is verified through experiments on Volony dataset. The results confirm that our proposed framework outperforms our previous cooperative object detection method (FS-COD) in terms of average precision.
摘要：情境意识，在连接和自动车（CAV）域中的必要性近年来显著许多研究的主题。驾驶员的安全直接依赖于稳健性，可靠性和这种系统的可扩展性。合作机制提供了利用高速无线车载网络，以提高态势感知能力的解决方案。这些机制减轻的问题，如闭塞和传感器范围的限制。然而，网络容量是一个因素确定的合作实体之间共享的信息的最大数量。的特征共享的概念，在我们以前的工作建议，旨在通过保持计算和通信负载之间的平衡，以应对这些挑战。在这项工作中，我们提出了一种机制，以适应通信信道的容量增加灵活性和一种新颖的分散共享数据对准方法以进一步提高合作对象的检测性能。拟议框架的性能是通过对Volony数据集实验验证。结果证实，我们提出的框架优于我们之前的合作对象检测方法（FS-COD）的平均精度方面。

56. Voronoi Convolutional Neural Networks [PDF] 返回目录
Soroosh Yazdani, Andrea Tagliasacchi
Abstract: In this technical report, we investigate extending convolutional neural networks to the setting where functions are not sampled in a grid pattern. We show that by treating the samples as the average of a function within a cell, we can find a natural equivalent of most layers used in CNN. We also present an algorithm for running inference for these models exactly using standard convex geometry algorithms.
摘要：在这个技术报告中，我们探讨卷积神经网络扩展到功能都没有在网格图案采样的设置。我们表明，细胞内处理样品的平均函数，我们可以发现在CNN中使用的大多数层的天然等同。我们还提出了一种算法，这些车型究竟使用标准凸几何算法运行的推断。

57. Performance Prediction for Convolutional Neural Networks in Edge Devices [PDF] 返回目录
Halima Bouzidi, Hamza Ouarnoughi, Smail Niar, Abdessamad Ait El Cadi
Abstract: Running Convolutional Neural Network (CNN) based applications on edge devices near the source of data can meet the latency and privacy challenges. However due to their reduced computing resources and their energy constraints, these edge devices can hardly satisfy CNN needs in processing and data storage. For these platforms, choosing the CNN with the best trade-off between accuracy and execution time while respecting Hardware constraints is crucial. In this paper, we present and compare five (5) of the widely used Machine Learning based methods for execution time prediction of CNNs on two (2) edge GPU platforms. For these 5 methods, we also explore the time needed for their training and tuning their corresponding hyperparameters. Finally, we compare times to run the prediction models on different platforms. The utilization of these methods will highly facilitate design space exploration by providing quickly the best CNN on a target edge GPU. Experimental results show that eXtreme Gradient Boosting (XGBoost) provides a less than 14.73% average prediction error even for unexplored and unseen CNN models' architectures. Random Forest (RF) depicts comparable accuracy but needs more effort and time to be trained. The other 3 approaches (OLS, MLP and SVR) are less accurate for CNN performances estimation.
摘要：运行卷积神经网络（CNN）上的数据源能够满足时延和隐私的挑战靠近边缘设备的应用程序。然而，由于它们降低的计算资源和它们的能量约束，这些边缘设备可以很难满足CNN在处理和数据存储的需要。对于这些平台，选择CNN与准确性和执行时间之间的最佳平衡，同时尊重硬件的限制是至关重要的。在本文中，我们提出和用于在两（2）边缘GPU平台细胞神经网络的执行时间预测广泛使用的机器学习基础的方法比较五（5）。对于这些5种方法，我们还需要探索为他们的训练时间和调整其相应的超参数。最后，我们比较次，在不同的平台上运行的预测模型。这些方法的利用率会高通过对目标的边缘GPU迅速提供最佳的便利CNN设计空间探索。实验结果表明，极限梯度增压（XGBoost）即使对于未开发和看不见CNN模型的体系结构提供了一个小于14.73％的平均预测误差。随机森林（RF）表示了与准确性，但需要更多的精力和时间进行培训。其他的方法3（OLS，MLP和SVR）是用于CNN性能估计不准确。

58. Unrolling of Deep Graph Total Variation for Image Denoising [PDF] 返回目录
Huy Vu, Gene Cheung, Yonina C. Eldar
Abstract: While deep learning (DL) architectures like convolutional neural networks (CNNs) have enabled effective solutions in image denoising, in general their implementations overly rely on training data, lack interpretability, and require tuning of a large parameter set. In this paper, we combine classical graph signal filtering with deep feature learning into a competitive hybrid design---one that utilizes interpretable analytical low-pass graph filters and employs 80% fewer network parameters than state-of-the-art DL denoising scheme DnCNN. Specifically, to construct a suitable similarity graph for graph spectral filtering, we first adopt a CNN to learn feature representations per pixel, and then compute feature distances to establish edge weights. Given a constructed graph, we next formulate a convex optimization problem for denoising using a graph total variation (GTV) prior. Via a $l_1$ graph Laplacian reformulation, we interpret its solution in an iterative procedure as a graph low-pass filter and derive its frequency response. For fast filter implementation, we realize this response using a Lanczos approximation. Experimental results show that in the case of statistical mistmatch, our algorithm outperformed DnCNN by up to 3dB in PSNR.
摘要：虽然像卷积神经网络（细胞神经网络）深度学习（DL）架构在图像降噪已启用有效的解决方案，一般它们的实现过于依赖于训练数据，缺乏可解释性，并需要大量参数的整定。在本文中，我们结合深地物学习古典图形信号滤波成竞争混合设计---一个利用可解释的分析低通滤波器图形并且采用网络参数不是较少的80％状态的最先进的DL去噪方案DnCNN。具体地，为了构建用于图形光谱过滤合适的相似度图形，我们首先采用CNN学习每像素特征表示，然后计算特征距离以建立边缘的权重。给定一个构成的曲线图，我们接下来制定的凸优化问题，使用的图表总变化（GTV）之前去噪。经由$ $ L_1图的拉普拉斯再形成，我们解释其在迭代过程为曲线图的低通滤波器溶液和得到其频率响应。对于快速过滤器实现，我们意识到使用兰克泽斯逼近此响应。实验结果表明，在统计mistmatch的情况下，我们的算法优于DnCNN高达3dB的PSNR中。

59. Shedding Light on Blind Spots: Developing a Reference Architecture to Leverage Video Data for Process Mining [PDF] 返回目录
Wolfgang Kratsch, Fabian König, Maximilian Röglinger
Abstract: Process mining is one of the most active research streams in business process management. In recent years, numerous methods have been proposed for analyzing structured process data. Yet, in many cases, it is only the digitized parts of processes that are directly captured from process-aware information systems, and manual activities often result in blind spots. While the use of video cameras to observe these activities could help to fill this gap, a standardized approach to extracting event logs from unstructured video data remains lacking. Here, we propose a reference architecture to bridge the gap between computer vision and process mining. Various evaluation activities (i.e., competing artifact analysis, prototyping, and real-world application) ensured that the proposed reference architecture allows flexible, use-case-driven, and context-specific instantiations. Our results also show that an exemplary software prototype instantiation of the proposed reference architecture is capable of automatically extracting most of the process-relevant events from unstructured video data.
摘要：过程挖掘是在业务流程管理中最活跃的研究流之一。近年来，许多方法已经被提出了结构化分析处理数据。然而，在许多情况下，它仅是直接从过程感知的信息系统获取，和手动活动往往造成盲点对数字化的部分。虽然使用的摄像机来观察这些活动可能有助于填补这一空白，一个标准化的方法来从非结构化缺乏视频数据遗体中提取的事件日志。在这里，我们提出了一个参考架构，以弥补计算机视觉和过程挖掘之间的差距。各种评价活动（即，竞争的伪像分析，原型，和现实世界的应用程序）确保了所提出的参考体系结构允许灵活的，用例驱动，并且特定于上下文的实例化。我们的研究结果还表明，该参考架构的示范性软件原型实例是能够从非结构化的视频数据自动提取大部分过程相关的事件。

60. Neural Star Domain as Primitive Representation [PDF] 返回目录
Yuki Kawana, Yusuke Mukuta, Tatsuya Harada
Abstract: Reconstructing 3D objects from 2D images is a fundamental task in computer vision. Accurate structured reconstruction by parsimonious and semantic primitive representation further broadens its application. When reconstructing a target shape with multiple primitives, it is preferable that one can instantly access the union of basic properties of the shape such as collective volume and surface, treating the primitives as if they are one single shape. This becomes possible by primitive representation with unified implicit and explicit representations. However, primitive representations in current approaches do not satisfy all of the above requirements at the same time. To solve this problem, we propose a novel primitive representation named neural star domain (NSD) that learns primitive shapes in the star domain. We show that NSD is a universal approximator of the star domain and is not only parsimonious and semantic but also an implicit and explicit shape representation. We demonstrate that our approach outperforms existing methods in image reconstruction tasks, semantic capabilities, and speed and quality of sampling high-resolution meshes.
摘要：从2D图像重建3D对象是计算机视觉的基本任务。准确结构重建由简约和语义原始表示进一步拓宽了它的应用。当重构的目标形状与多个基元，最好是一个可以即时访问的形状的基本属性的并集，如集体的体积和表面，治疗的原语，就好像它们是单个形状。这通过统一的隐性和显性表述原始的表示成为可能。然而，在当前的方法原始的表示方法并不能满足所有的上述要求在同一时间。为了解决这个问题，我们提出了一个名为神经星域（NSD）一种新型的原始表示是学习基本形状在星域。我们发现，NSD是星域的通用逼近而不仅是吝啬和语义，但还有一个隐含的和明确形状表示。我们证明在图像重建任务，语义能力，速度和采样高分辨率网格的质量，我们的方法比现有的方法。

61. Learning Black-Box Attackers with Transferable Priors and Query Feedback [PDF] 返回目录
Jiancheng Yang, Yangzhou Jiang, Xiaoyang Huang, Bingbing Ni, Chenglong Zhao
Abstract: This paper addresses the challenging black-box adversarial attack problem, where only classification confidence of a victim model is available. Inspired by consistency of visual saliency between different vision models, a surrogate model is expected to improve the attack performance via transferability. By combining transferability-based and query-based black-box attack, we propose a surprisingly simple baseline approach (named SimBA++) using the surrogate model, which significantly outperforms several state-of-the-art methods. Moreover, to efficiently utilize the query feedback, we update the surrogate model in a novel learning scheme, named High-Order Gradient Approximation (HOGA). By constructing a high-order gradient computation graph, we update the surrogate model to approximate the victim model in both forward and backward pass. The SimBA++ and HOGA result in Learnable Black-Box Attack (LeBA), which surpasses previous state of the art by considerable margins: the proposed LeBA significantly reduces queries, while keeping higher attack success rates close to 100% in extensive ImageNet experiments, including attacking vision benchmarks and defensive models. Code is open source at this https URL.
摘要：本文地址挑战暗箱对抗攻击的问题，其中只有一个受害者模型的分类置信度是可用的。通过不同的视觉模型之间的视觉显着性的一致性的启发，替代车型有望提高通过转让攻击性能。通过结合转移性和基于查询的黑盒攻击，我们提出了一个非常简单的基准办法使用代理模式，这显著优于国家的最先进的几种方法（名为SIMBA ++）。此外，为了有效地利用查询反馈，我们更新了一种新的学习计划，命名为高阶梯度近似（HOGA）的替代模型。通过构建一个高阶梯度计算图，我们更新替代模型来近似向前和向后传球受害者模型。辛巴++和HOGA结果可学习黑盒攻击（LEBA），它超越了艺术的可观利润的前一个状态：建议LEBA显著缩短了查询，而在广泛ImageNet实验中保持较高的进攻成功率接近100％，包括攻击视力基准和防御模型。代码是在这个HTTPS URL开源。

62. OCT-GAN: Single Step Shadow and Noise Removal from Optical Coherence Tomography Images of the Human Optic Nerve Head [PDF] 返回目录
Haris Cheong, Sripad Krishna Devalla, Thanadet Chuangsuwanich, Tin A. Tun, Xiaofei Wang, Tin Aung, Leopold Schmetterer, Martin L. Buist, Craig Boote, Alexandre H. Thiéry, Michaël J. A. Girard
Abstract: Speckle noise and retinal shadows within OCT B-scans occlude important edges, fine textures and deep tissues, preventing accurate and robust diagnosis by algorithms and clinicians. We developed a single process that successfully removed both noise and retinal shadows from unseen single-frame B-scans within 10.4ms. Mean average gradient magnitude (AGM) for the proposed algorithm was 57.2% higher than current state-of-the-art, while mean peak signal to noise ratio (PSNR), contrast to noise ratio (CNR), and structural similarity index metric (SSIM) increased by 11.1%, 154% and 187% respectively compared to single-frame B-scans. Mean intralayer contrast (ILC) improvement for the retinal nerve fiber layer (RNFL), photoreceptor layer (PR) and retinal pigment epithelium (RPE) layers decreased from 0.362 \pm 0.133 to 0.142 \pm 0.102, 0.449 \pm 0.116 to 0.0904 \pm 0.0769, 0.381 \pm 0.100 to 0.0590 \pm 0.0451 respectively. The proposed algorithm reduces the necessity for long image acquisition times, minimizes expensive hardware requirements and reduces motion artifacts in OCT images.
摘要：OCT B-扫描内斑点噪声和视网膜阴影阻塞重要边缘，细纹理和深部组织，防止由算法和临床医生精确和鲁棒的诊断。我们开发了一个单一的过程中，成功地移除了噪声和10.4ms之内看不见的单帧B扫描视网膜阴影。对于所提出的算法值平均梯度幅值（AGM）高于当前状态的最先进的高57.2％，而平均峰值信噪比（PSNR），对比度噪声比（CNR）和结构相似性指数度量（ SSIM）分别增加了11.1％，154％和187％相比单帧B扫描。对于视网膜神经纤维层（RNFL），感光层（PR）和视网膜色素上皮细胞平均数层内对比度（ILC）的改善（RPE）层从0.362 \下午0.133减少到0.142 \时0.102，0.449 \下午0.116到0.0904 \时0.0769，0.381 \ 0.100点至0.0590 \分别点0.0451。所提出的算法减少了对长的图像获取时间的必要性，最大限度地减少昂贵的硬件要求和减少在OCT图像的运动伪影。

63. Automatic Data Augmentation for 3D Medical Image Segmentation [PDF] 返回目录
Ju Xu, Mengzhang Li, Zhanxing Zhu
Abstract: Data augmentation is an effective and universal technique for improving generalization performance of deep neural networks. It could enrich diversity of training samples that is essential in medical image segmentation tasks because 1) the scale of medical image dataset is typically smaller, which may increase the risk of overfitting; 2) the shape and modality of different objects such as organs or tumors are unique, thus requiring customized data augmentation policy. However, most data augmentation implementations are hand-crafted and suboptimal in medical image processing. To fully exploit the potential of data augmentation, we propose an efficient algorithm to automatically search for the optimal augmentation strategies. We formulate the coupled optimization w.r.t. network weights and augmentation parameters into a differentiable form by means of stochastic relaxation. This formulation allows us to apply alternative gradient-based methods to solve it, i.e. stochastic natural gradient method with adaptive step-size. To the best of our knowledge, it is the first time that differentiable automatic data augmentation is employed in medical image segmentation tasks. Our numerical experiments demonstrate that the proposed approach significantly outperforms existing build-in data augmentation of state-of-the-art models.
摘要：数据增强是可以改进的神经网络的泛化性能的有效和普遍的技术。它可以丰富训练样本的多样性是在医学图像分割的任务必不可少的，因为1）医学图像数据集的规模通常较小，这可能会增加过度拟合的风险; 2）不同的形状和形式的对象，例如器官或肿瘤是唯一的，因此需要定制数据增强策略。然而，大多数数据增强的实现是手工制作的和次优的医疗用图像处理。为了充分利用数据扩张的潜力，我们提出了一个高效的算法，自动寻找最佳的增强策略。我们制定耦合优化w.r.t.通过随机松弛的装置网络的权和扩充参数成微分形式。该制剂允许我们应用替代的基于梯度的方法来解决它，即随机自然梯度方法的自适应步长。据我们所知，这是第一次，微自动数据增强在医学图像分割的任务使用。我们的数值实验表明，该方法显著优于现有集结在国家的最先进的车型数据增强。

64. PlenoptiCam v1.0: A light-field imaging framework [PDF] 返回目录
Christopher Hahne, Amar Aggoun
Abstract: Light-field cameras play a vital role for rich 3-D information retrieval in narrow range depth sensing applications. The key obstacle in composing light-fields from exposures taken by a plenoptic camera is to computationally calibrate, re-align and rearrange four-dimensional image data. Several attempts have been proposed to enhance the overall image quality by tailoring pipelines dedicated to particular plenoptic cameras and improving the color consistency across viewpoints at the expense of high computational loads. The framework presented herein advances prior outcomes thanks to its cost-effective color equalization from parallax-invariant probability distribution transfers and a novel micro image scale-space analysis for generic camera calibration independent of the lens specifications. Our framework compensates for hot-pixels, resampling artifacts, micro image grid rotations just as vignetting in an innovative way to enable superior quality in sub-aperture image extraction, computational refocusing and Scheimpflug rendering with sub-sampling capabilities. Benchmark comparisons using established image metrics suggest that our proposed pipeline outperforms state-of-the-art tool chains in the majority of cases. The software described in this paper is released under an open-source license offering cross-platform compatibility, few dependencies and a lean graphical user interface to make the reproduction of results and the experimentation with plenoptic camera technology convenient for peer researchers, developers, photographers, data scientists and everyone else working in this field.
摘要：光场相机发挥在狭窄的范围内的深度感测应用丰富的3-d信息检索的重要作用。在从由全光照相机拍摄曝光构成光场的关键障碍是计算校准，重新调整和重新排列四维图像数据。多次尝试已经提出来增强通过定制专用于特定的全光相机管线和高运算量为代价改善整个视点对色彩一致性的整体图像质量。本文所提出的框架前进之前由于从视差不变概率分布转移和用于照相机校准独立的透镜规格的通用的新型微图像的尺度空间分析其成本效益的颜色均衡的结果。我们的框架补偿了热像素，重新采样伪影，微图像框格旋转正如以一种创新的方式渐晕，使在子孔径图像提取，计算重聚焦和沙伊姆弗勒呈现优越的品质与亚取样的能力。利用建立的图象量度基准的比较表明，在大多数情况下，我们提出的流水线性能优于国家的最先进的工具链。本文介绍的软件以开放源代码许可证提供跨平台的兼容性，很少依赖和一支精干的图形用户界面，使结果的再现，并与全光相机技术方便了同行的研究人员，开发人员，摄影师实验释放，数据科学家和其他人的工作在这一领域。

65. A Very Compact Embedded CNN Processor Design Based on Logarithmic Computing [PDF] 返回目录
Tsung-Ying Lu, Hsu-Hsun Chin, Hsin-I Wu, Ren-Song Tsay
Abstract: In this paper, we propose a very compact embedded CNN processor design based on a modified logarithmic computing method using very low bit-width representation. Our high-quality CNN processor can easily fit into edge devices. For Yolov2, our processing circuit takes only 0.15 mm2 using TSMC 40 nm cell library. The key idea is to constrain the activation and weight values of all layers uniformly to be within the range [-1, 1] and produce low bit-width logarithmic representation. With the uniform representations, we devise a unified, reusable CNN computing kernel and significantly reduce computing resources. The proposed approach has been extensively evaluated on many popular image classification CNN models (AlexNet, VGG16, and ResNet-18/34) and object detection models (Yolov2). The hardware-implemented results show that our design consumes only minimal computing and storage resources, yet attains very high accuracy. The design is thoroughly verified on FPGAs, and the SoC integration is underway with promising results. With extremely efficient resource and energy usage, our design is excellent for edge computing purposes.
摘要：在本文中，我们提出了一种基于使用非常低的比特宽度表示的修正对数计算方法的非常紧凑的嵌入式CNN处理器设计。我们的高品质的CNN处理器可以很容易地装配到边缘设备。对于Yolov2，我们的处理电路只需要0.15采用TSMC 40纳米单元库2。关键思想是约束均匀地所有层的激活和权重值是[-1,1]的范围内，并产生低比特宽度对数表示。随着统一表示，制定一个统一的，可重复使用的CNN计算内核和显著减少计算资源。所提出的方法已经在许多流行的图像分类CNN模型被广泛地评价（AlexNet，VGG16，和RESNET-18/34）和物体检测模型（Yolov2）。硬件实现的结果表明，我们的设计仅消耗最少的计算和存储资源，但无所获非常高的精度。该设计在FPGA上彻底验证和SoC集成正在进行可喜的成果。凭借极其高效的资源和能源消耗，我们的设计是极好的边缘计算的目的。

66. Disentangling Action Sequences: Discovering Correlated Samples [PDF] 返回目录
Jiantao Wu, Lin Wang
Abstract: Disentanglement is a highly desirable property of representation due to its similarity with human's understanding and reasoning. This improves interpretability, enables the performance of down-stream tasks, and enables controllable generative models. However, this domain is challenged by the abstract notion and incomplete theories to support unsupervised disentanglement learning. We demonstrate the data itself, such as the orientation of images, plays a crucial role in disentanglement and instead of the factors, and the disentangled representations align the latent variables with the action sequences. We further introduce the concept of disentangling action sequences which facilitates the description of the behaviours of the existing disentangling approaches. An analogy for this process is to discover the commonality between the things and categorizing them. Furthermore, we analyze the inductive biases on the data and find that the latent information thresholds are correlated with the significance of the actions. For the supervised and unsupervised settings, we respectively introduce two methods to measure the thresholds. We further propose a novel framework, fractional variational autoencoder (FVAE), to disentangle the action sequences with different significance step-by-step. Experimental results on dSprites and 3D Chairs show that FVAE improves the stability of disentanglement.
摘要：退纠缠是表现的非常理想的属性，因为它与人类的理解和推理的相似性。这提高了可解释性，使的下游任务的性能，并且能够可控生成模型。然而，这一领域是由抽象的概念和理论不完全的挑战，以支持无监督的解开学习。我们证明数据本身，如图像的方向，起着解开并且代替因素至关重要的作用，而解缠结表示对准动作序列潜在变数。我们进一步介绍解开这便于解开现有方法的行为的描述的动作序列的概念。这个过程中的一个比喻是探索事物和分类它们之间的共性。此外，我们分析的数据归纳偏见和发现潜在信息的阈值与动作的意义有关。对于监督和无监督的设置，我们分别介绍了两种方法测量的阈值。我们进一步提出了一种新的框架，分数变自动编码器（FVAE），解开具有不同意义的一步一步的动作序列。在dSprites和3D椅子实验结果表明，FVAE提高解开的稳定性。

67. Lung Nodule Classification Using Biomarkers, Volumetric Radiomics and 3D CNNs [PDF] 返回目录
Kushal Mehta, Arshita Jain, Jayalakshmi Mangalagiri, Sumeet Menon, Phuong Nguyen, David R. Chapman
Abstract: We present a hybrid algorithm to estimate lung nodule malignancy that combines imaging biomarkers from Radiologist's annotation with image classification of CT scans. Our algorithm employs a 3D Convolutional Neural Network (CNN) as well as a Random Forest in order to combine CT imagery with biomarker annotation and volumetric radiomic features. We analyze and compare the performance of the algorithm using only imagery, only biomarkers, combined imagery + biomarkers, combined imagery + volumetric radiomic features and finally the combination of imagery + biomarkers + volumetric features in order to classify the suspicion level of nodule malignancy. The National Cancer Institute (NCI) Lung Image Database Consortium (LIDC) IDRI dataset is used to train and evaluate the classification task. We show that the incorporation of semi-supervised learning by means of K-Nearest-Neighbors (KNN) can increase the available training sample size of the LIDC-IDRI thereby further improving the accuracy of malignancy estimation of most of the models tested although there is no significant improvement with the use of KNN semi-supervised learning if image classification with CNNs and volumetric features are combined with descriptive biomarkers. Unexpectedly, we also show that a model using image biomarkers alone is more accurate than one that combines biomarkers with volumetric radiomics, 3D CNNs, and semi-supervised learning. We discuss the possibility that this result may be influenced by cognitive bias in LIDC-IDRI because malignancy estimates were recorded by the same radiologist panel as biomarkers, as well as future work to incorporate pathology information over a subset of study participants.
摘要：我们提出了一个混合算法来估计肺结节恶性肿瘤，结合生物标志物成像从放射科医生的注释与CT扫描的图像的分类。我们的算法采用了3D卷积神经网络（CNN）以及随机森林，以CT成像与生物标记注释和体积radiomic功能结合起来。我们分析和比较，以结节恶性的可疑级别分类的算法只使用图像，只有生物标记物，结合影像+生物标志物的表现，结合影像+体积radiomic的功能和图像+标志物+体积功能的最终组合。美国国家癌症研究所（NCI）肺影像数据库联盟（LIDC）IDRI数据集用于训练和评估分类的任务。我们证明了半监督学习的K-近邻的手段（KNN）的加入可以增加LIDC-IDRI的可用的训练样本的大小，从而进一步提高了大部分车型的恶性估计的测试精度虽然有没有与使用KNN半监督学习的显著改善如果与细胞神经网络和体积特征的图像分类相结合，与描述性的生物标志物。出乎意料的是，我们还表明，使用图像单独的生物标志物的模型比一个更准确与体积radiomics，3D细胞神经网络，以及半监督学习联合生物标志物。我们讨论这个结果可以通过LIDC-IDRI认知偏差的影响，因为恶性肿瘤估计是由同一个放射面板作为生物标志物，以及今后的工作记录了研究对象的一个子集纳入病理信息的可能性。

68. Motion Planning Combines Psychological Safety and Motion Prediction for a Sense Motive Robot [PDF] 返回目录
Hejing Ling, Guoliang Liu, Guohui Tian
Abstract: Human safety is the most important demand for human robot interaction and collaboration (HRIC), which not only refers to physical safety, but also includes psychological safety. Although many robots with different configurations have entered our living and working environments, the human safety problem is still an ongoing research problem in human-robot coexistence scenarios. This paper addresses the human safety issue by covering both the physical safety and psychological safety aspects. First, we introduce an adaptive robot velocity control and step size adjustment method according to human facial expressions, such that the robot can adjust its movement to keep safety when the human emotion is unusual. Second, we predict the human motion by detecting the suddenly changes of human head pose and gaze direction, such that the robot can infer whether the human attention is distracted, predict the next move of human and rebuild a repulsive force to avoid potential collision. Finally, we demonstrate our idea on a 7 DOF TIAGo robot in the 3D Gazebo environment, which shows that the robot becomes sense motive, and responds to human action and emotion changes quickly and efficiently.
摘要：人类安全是人类的机器人互动和协作（中国人权），这不仅是指人身安全最重要的需求，但也包括心理安全。尽管不同配置的许多机器人已经进入了我们的生活和工作环境，人的安全问题仍是人类与机器人共存方案正在进行的研究课题。本文通过涵盖人身安全和心理安全方面地址的人身安全问题。首先，我们根据人的面部表情，使得机器人可以调整它的运动，以保持安全性，当人的情感是不寻常的引入自适应机器人速度控制和步长大小调整方法。其次，我们通过检测预测人体运动突然改变人的头部姿势和视线方向，使得机器人可以推断出人的注意力是否不集中，预测人类的下一步行动和重建的排斥力，以避免潜在的冲突。最后，我们证明了我们在7自由度机器人蒂亚戈在3D环境凉亭，这表明机器人变得察言观色，并响应人类行为和情绪快速，有效地改变了主意。

69. On the Power of Deep but Naive Partial Label Learning [PDF] 返回目录
Junghoon Seo, Joon Suk Huh
Abstract: Partial label learning (PLL) is a class of weakly supervised learning where each training instance consists of a data and a set of candidate labels containing a unique ground truth label. To tackle this problem, a majority of current state-of-the-art methods employs either label disambiguation or averaging strategies. So far, PLL methods without such techniques have been considered impractical. In this paper, we challenge this view by revealing the hidden power of the oldest and naivest PLL method when it is instantiated with deep neural networks. Specifically, we show that, with deep neural networks, the naive model can achieve competitive performances against the other state-of-the-art methods, suggesting it as a strong baseline for PLL. We also address the question of how and why such a naive model works well with deep neural networks. Our empirical results indicate that deep neural networks trained on partially labeled examples generalize very well even in the over-parametrized regime and without label disambiguations or regularizations. We point out that existing learning theories on PLL are vacuous in the over-parametrized regime. Hence they cannot explain why the deep naive method works. We propose an alternative theory on how deep learning generalize in PLL problems.
摘要：部分标记学习（PLL）是一类弱监督学习，每个训练实例包含数据和一组包含一个唯一的地面实况标签候选的标签。为了解决这个问题，大多数国家的最先进的方法，目前采用两种标签歧义或平均策略。到目前为止，没有这样的技术PLL的方法已被认为是不切实际的。在本文中，我们挑战通过揭示历史最悠久，naivest PLL方法的隐藏的力量，当它与深层神经网络实例化这一观点。具体而言，我们表明，与深层神经网络，天真的模型可以实现对其他国家的最先进的方法有竞争力的表演，这表明它作为PLL强大的基线。我们还解决这么幼稚的模型是如何以及为什么深神经网络效果很好的问题。我们的实证研究结果表明，经过训练，对部分标记的例子深层神经网络即使在过度参数化方案和无标签disambiguations或正则化概括得非常好。我们指出，在现有的PLL学习理论是在过参数化的制度空洞。因此，他们无法解释为什么深天真的方法工作。我们提出了关于如何深PLL问题学习期广义的替代理论。

70. Defense-guided Transferable Adversarial Attacks [PDF] 返回目录
Zifei Zhang, Kai Qiao, Jian Chen
Abstract: Though deep neural networks perform challenging tasks excellently, they are susceptible to adversarial exmaples, which mislead classifiers by applying human-imperceptible perturbations on clean inputs. Under the query-free black-box scenario, adversarial examples are hard to transfer to unknown models, and several methods have been proposed with low transferability. To settle such issue, we design a max-min framework inspired by input transformations, which are benificial to both the adversarial attack and defense. Explicitly, we decrease loss values with affline transformations as a defense in the minimum procedure, and then increase loss values with the momentum iterative algorithm as an attack in the maximum procedure. To further promote transferability, we determine transformed values with the max-min theory. Extensive experiments on Imagenet demonstrate that our defense-guided transferable attacks achieve impressive increase on transferability. Experimentally, our best black-box attack fools normally trained models at an 85.3% attack success rate and adversarially trained models at a 40.43% attack success rate on average, respectively. Additionally, we provide elucidative insights on the improvement of transferability, and our method is expected to be a benchmark for assessing the robustness of deep models.
摘要：虽然深层神经网络很好地执行有挑战性的任务，他们很容易受到敌对exmaples，它通过在清洁的投入将人类不可感知的扰动误导分类。根据免费查询黑匣子的情况下，对抗的例子是很难转移到未知的模式，几种方法已经被提出，低转移性。为了解决这样的问题，我们设计了一个最大最小框架由输入转变，这是benificial以对抗攻击和防御都启发。明确地说，我们减少与affline转换损耗值作为最小过程中的防守，然后增加损耗值与动量迭代算法的最大程序的攻击。为进一步促进转让，我们决定改变与最大最小理论值。在Imagenet大量的实验证明，我们的防守引导转让攻击实现对转让的令人印象深刻的增长。实验上，我们最好的黑盒攻击傻瓜一般在85.3％，进攻成功率训练的模型，并在adversarially平均一个40.43％的攻击成功率分别训练的模型。此外，我们还提供转让的提高阐释的见解，我们的方法应该是评价深模型的鲁棒性的基准。

71. Malaria detection from RBC images using shallow Convolutional Neural Networks [PDF] 返回目录
Subrata Sarkar, Rati Sharma, Kushal Shah
Abstract: The advent of Deep Learning models like VGG-16 and Resnet-50 has considerably revolutionized the field of image classification, and by using these Convolutional Neural Networks (CNN) architectures, one can get a high classification accuracy on a wide variety of image datasets. However, these Deep Learning models have a very high computational complexity and so incur a high computational cost of running these algorithms as well as make it hard to interpret the results. In this paper, we present a shallow CNN architecture which gives the same classification accuracy as the VGG-16 and Resnet-50 models for thin blood smear RBC slide images for detection of malaria, while decreasing the computational run time by an order of magnitude. This can offer a significant advantage for commercial deployment of these algorithms, especially in poorer countries in Africa and some parts of the Indian subcontinent, where the menace of malaria is quite severe.
摘要：深学习模型像VGG-16和RESNET-50的出现大大革新了图像分类的领域，通过使用这些卷积神经网络（CNN）架构，可以在各种图像的获得了很高的分类精度数据集。然而，这些深层次的学习模型具有很高的计算复杂度，因此招致运行这些算法以及使它很难解释结果的计算成本高。在本文中，我们提出了一个浅CNN结构可以得到相同的分类精度为VGG-16和RESNET-50模型薄血涂片RBC幻灯片图像检测疟疾的，而由一个数量级减少的计算运行时间。这可以提供一个显著优势，为这些算法的商用部署，特别是在非洲穷国的国家和印度次大陆，疟疾的威胁是相当严重的一些地区。

72. SEG-MAT: 3D Shape Segmentation Using Medial Axis Transform [PDF] 返回目录
Cheng Lin, Lingjie Liu, Changjian Li, Leif Kobbelt, Bin Wang, Shiqing Xin, Wenping Wang
Abstract: Segmenting arbitrary 3D objects into constituent parts that are structurally meaningful is a fundamental problem encountered in a wide range of computer graphics applications. Existing methods for 3D shape segmentation suffer from complex geometry processing and heavy computation caused by using low-level features and fragmented segmentation results due to the lack of global consideration. We present an efficient method, called SEG-MAT, based on the medial axis transform (MAT) of the input shape. Specifically, with the rich geometrical and structural information encoded in the MAT, we are able to develop a simple and principled approach to effectively identify the various types of junctions between different parts of a 3D shape. Extensive evaluations and comparisons show that our method outperforms the state-of-the-art methods in terms of segmentation quality and is also one order of magnitude faster.
摘要：切段任意三维对象到在结构上有意义在广泛范围的计算机图形应用中所遇到的一个基本问题的组成部分。对于3D形状分割现有的方法从复杂的几何处理和因使用低级别的功能和零散分割结果繁重的计算遭受由于缺乏全局的考虑。我们目前的输入形状的有效方法，被称为SEG-MAT的基础上，中轴变换（MAT）。具体而言，在MAT编码的富几何和结构信息，我们能够发展一种简单而有原则的方法来有效地识别不同类型的3D形状的不同部分之间的结。广泛的评估和比较表明，我们的方法优于在分割质量方面的国家的最先进的方法，也是幅度快一个数量级。

73. DeepCSR: A 3D Deep Learning Approach for Cortical Surface Reconstruction [PDF] 返回目录
Rodrigo Santa Cruz, Leo Lebrat, Pierrick Bourgeat, Clinton Fookes, Jurgen Fripp, Olivier Salvado
Abstract: The study of neurodegenerative diseases relies on the reconstruction and analysis of the brain cortex from magnetic resonance imaging (MRI). Traditional frameworks for this task like FreeSurfer demand lengthy runtimes, while its accelerated variant FastSurfer still relies on a voxel-wise segmentation which is limited by its resolution to capture narrow continuous objects as cortical surfaces. Having these limitations in mind, we propose DeepCSR, a 3D deep learning framework for cortical surface reconstruction from MRI. Towards this end, we train a neural network model with hypercolumn features to predict implicit surface representations for points in a brain template space. After training, the cortical surface at a desired level of detail is obtained by evaluating surface representations at specific coordinates, and subsequently applying a topology correction algorithm and an isosurface extraction method. Thanks to the continuous nature of this approach and the efficacy of its hypercolumn features scheme, DeepCSR efficiently reconstructs cortical surfaces at high resolution capturing fine details in the cortical folding. Moreover, DeepCSR is as accurate, more precise, and faster than the widely used FreeSurfer toolbox and its deep learning powered variant FastSurfer on reconstructing cortical surfaces from MRI which should facilitate large-scale medical studies and new healthcare applications.
摘要：神经变性疾病的研究依赖于从磁共振成像（MRI）的大脑皮质的重建和分析。这个任务就像FreeSurfer传统框架的要求冗长的运行时间，而其加速变异FastSurfer仍然依赖于体素明智的分割这是由它的分辨率捕捉窄连续对象皮质表面的限制。铭记这些限制，我们提出DeepCSR，从MRI皮质表面重建3D深度学习的框架。为此，我们培养具有hypercolumn特征的神经网络模型来预测隐含面表示用于脑模板空间中的点。训练结束后，在详细的期望水平皮质表面是通过在特定坐标评估表面表示，并且随后施加的拓扑校正算法和等值面提取方法获得的。由于该方法的连续性质及其hypercolumn的功效特征方案，有效地DeepCSR以高分辨率捕获细细节在皮质折叠重建皮质表面。此外，DeepCSR是准确的，更精确，而且比目前广泛使用的FreeSurfer工具箱和重建的MRI的皮质表面这将有利于大型医疗研究和新的医疗应用中的深度学习动力的变种FastSurfer更快。

74. Rethinking pooling in graph neural networks [PDF] 返回目录
Diego Mesquita, Amauri H. Souza, Samuel Kaski
Abstract: Graph pooling is a central component of a myriad of graph neural network (GNN) architectures. As an inheritance from traditional CNNs, most approaches formulate graph pooling as a cluster assignment problem, extending the idea of local patches in regular grids to graphs. Despite the wide adherence to this design choice, no work has rigorously evaluated its influence on the success of GNNs. In this paper, we build upon representative GNNs and introduce variants that challenge the need for locality-preserving representations, either using randomization or clustering on the complement graph. Strikingly, our experiments demonstrate that using these variants does not result in any decrease in performance. To understand this phenomenon, we study the interplay between convolutional layers and the subsequent pooling ones. We show that the convolutions play a leading role in the learned representations. In contrast to the common belief, local pooling is not responsible for the success of GNNs on relevant and widely-used benchmarks.
摘要：图形池是图表神经网络（GNN）架构的无数的一个重要组成部分。作为从传统的细胞神经网络的继承，大多数方法制定图表池作为群集分配问题，在常规电网延伸的局部小片的想法图表。尽管普遍遵守这种设计选择，没有工作严格评估其对GNNS的成功影响。在本文中，我们建立在代表GNNS并引入挑战，需要当地保留的表示变种，或者使用随机或群集上的补图。引人注目的是，我们的实验表明，当使用这些变异不会导致任何性能下降。要理解这一现象，我们研究了卷积层和随后的汇集者之间的相互作用。我们表明，回旋起到了解到表示了主导作用。与此相反的共同信念，本地池是不负责的相关性和广泛使用的基准GNNS的成功。

75. DPD-InfoGAN: Differentially Private Distributed InfoGAN [PDF] 返回目录
Vaikkunth Mugunthan, Vignesh Gokul, Lalana Kagal, Shlomo Dubnov
Abstract: Generative Adversarial Networks (GANs) are deep learning architectures capable of generating synthetic datasets. Despite producing high-quality synthetic images, the default GAN has no control over the kinds of images it generates. The Information Maximizing GAN (InfoGAN) is a variant of the default GAN that introduces feature-control variables that are automatically learned by the framework, hence providing greater control over the different kinds of images produced. Due to the high model complexity of InfoGAN, the generative distribution tends to be concentrated around the training data points. This is a critical problem as the models may inadvertently expose the sensitive and private information present in the dataset. To address this problem, we propose a differentially private version of InfoGAN (DP-InfoGAN). We also extend our framework to a distributed setting (DPD-InfoGAN) to allow clients to learn different attributes present in other clients' datasets in a privacy-preserving manner. In our experiments, we show that both DP-InfoGAN and DPD-InfoGAN can synthesize high-quality images with flexible control over image attributes while preserving privacy.
摘要：剖成对抗性网络（甘斯）能够产生合成的数据集的深度学习架构。尽管生产高品质的合成图像，所述默认GAN具有优于种它所产生的图像的控制。的信息最大化GAN（InfoGAN）是默认的GAN的变体，其引入了功能控制由框架自动学习型变量，因此提供在不同种产生的图像的更大的控制。由于InfoGAN的高模型的复杂性，生成性分布趋于训练数据点周围集中。这是因为该车型可能会无意中暴露敏感和私人DataSet中的信息的一个关键问题。为了解决这个问题，我们提出了InfoGAN（DP-InfoGAN）的差异私有版本。我们也我们的框架扩展到分布式环境（DPD-InfoGAN），允许客户端了解不同的属性存在于其他客户的数据集的隐私保护方式。在我们的实验中，我们发现两个DP-InfoGAN和DPD-InfoGAN可以合成具有灵活控制的高品质的图像在图像属性，同时保留隐私。

76. Deep Learning for Distinguishing Normal versus Abnormal Chest Radiographs and Generalization to Unseen Diseases [PDF] 返回目录
Zaid Nabulsi, Andrew Sellergren, Shahar Jamshy, Charles Lau, Eddie Santos, Atilla P. Kiraly, Wenxing Ye, Jie Yang, Sahar Kazemzadeh, Jin Yu, Raju Kalidindi, Mozziyar Etemadi, Florencia Garcia Vicente, David Melnick, Greg S. Corrado, Lily Peng, Krish Eswaran, Daniel Tse, Neeral Beladia, Yun Liu, Po-Hsuan Cameron Chen, Shravya Shetty
Abstract: Chest radiography (CXR) is the most widely-used thoracic clinical imaging modality and is crucial for guiding the management of cardiothoracic conditions. The detection of specific CXR findings has been the main focus of several artificial intelligence (AI) systems. However, the wide range of possible CXR abnormalities makes it impractical to build specific systems to detect every possible condition. In this work, we developed and evaluated an AI system to classify CXRs as normal or abnormal. For development, we used a de-identified dataset of 248,445 patients from a multi-city hospital network in India. To assess generalizability, we evaluated our system using 6 international datasets from India, China, and the United States. Of these datasets, 4 focused on diseases that the AI was not trained to detect: 2 datasets with tuberculosis and 2 datasets with coronavirus disease 2019. Our results suggest that the AI system generalizes to new patient populations and abnormalities. In a simulated workflow where the AI system prioritized abnormal cases, the turnaround time for abnormal cases reduced by 7-28%. These results represent an important step towards evaluating whether AI can be safely used to flag cases in a general setting where previously unseen abnormalities exist.
摘要：胸片（CXR）是最广泛使用的临床胸成像模态，并用于引导的心胸条件管理是至关重要的。具体CXR结果的检测一直是几个人工智能（AI）系统的主要焦点。然而，多种可能的CXR异常使得它不切实际建立特定的系统来检测每一个可能的条件。在这项工作中，我们开发和评估的AI系统分类CXRS为正常或不正常的。为了发展，我们使用的248445例患者去识别数据集中在印度一个多城市的医院网络。为了评估普遍性，我们评估使用来自印度，中国和美国的6个国际数据集，我们的系统。这些数据集，4专注于疾病的AI并没有训练查出：2点的数据集结核病和2个数据集与冠状病毒病2019年我们的研究结果表明，AI系统推广到新的患者群体和异常。在模拟工作流程，其中AI系统优先异常的情况下，周转时间异常的情况下减少了7-28％。这些结果是朝着评估是否AI可以安全地用于标记的情况在以前看不到的异常存在的常规设置的重要一步。

77. Class-Conditional Defense GAN Against End-to-End Speech Attacks [PDF] 返回目录
Mohammad Esmaeilpour, Patrick Cardinal, Alessandro Lameiras Koerich
Abstract: In this paper we propose a novel defense approach against end-to-end adversarial attacks developed to fool advanced speech-to-text systems such as DeepSpeech and Lingvo. Unlike conventional defense approaches, the proposed approach does not directly employ low-level transformations such as autoencoding a given input signal aiming at removing potential adversarial perturbation. Instead of that, we find an optimal input vector for a class conditional generative adversarial network through minimizing the relative chordal distance adjustment between a given test input and the generator network. Then, we reconstruct the 1D signal from the synthesized spectrogram and the original phase information derived from the given input signal. Hence, this reconstruction does not add any extra noise to the signal and according to our experimental results, our defense-GAN considerably outperforms conventional defense algorithms both in terms of word error rate and sentence level recognition accuracy.
摘要：本文提出了反对开发愚弄先进的语音到文本的系统，如DeepSpeech和Lingvo的终端到终端的敌对攻击防御新途径。不同于传统的防御方法，该方法不直接使用低级别的转换，如autoencoding给定的输入信号，旨在消除潜在的敌对扰动。取而代之的是，我们通过最小化给定的测试输入和所述发电机网络之间的相对弦距离调整发现了一类条件生成对抗网络中的最佳输入向量。然后，我们重建从合成频谱的1D信号和从给定的输入信号导出原始的相位信息。因此，这种重建不任何额外的噪音添加到信号，并根据我们的实验结果，我们的防守-GaN大大优于无论是在文字性错误率和句子水平的识别精度方面，传统的防御算法。

78. AttendAffectNet: Self-Attention based Networks for Predicting Affective Responses from Movies [PDF] 返回目录
Ha Thi Phuong Thao, Balamurali B.T., Dorien Herremans, Gemma Roig
Abstract: In this work, we propose different variants of the self-attention based network for emotion prediction from movies, which we call AttendAffectNet. We take both audio and video into account and incorporate the relation among multiple modalities by applying self-attention mechanism in a novel manner into the extracted features for emotion prediction. We compare it to the typically temporal integration of the self-attention based model, which in our case, allows to capture the relation of temporal representations of the movie while considering the sequential dependencies of emotion responses. We demonstrate the effectiveness of our proposed architectures on the extended COGNIMUSE dataset [1], [2] and the MediaEval 2016 Emotional Impact of Movies Task [3], which consist of movies with emotion annotations. Our results show that applying the self-attention mechanism on the different audio-visual features, rather than in the time domain, is more effective for emotion prediction. Our approach is also proven to outperform many state-ofthe-art models for emotion prediction. The code to reproduce our results with the models' implementation is available at: this https URL.
摘要：在这项工作中，我们建议从电影，我们称之为AttendAffectNet自重视基于网络的情感预测的不同变种。我们以音频和视频考虑在内，并运用自注意机制以新颖的方式转化为情感预测所提取的特征包括多模态之间的关系。我们把它比作自我关注基于模型，这在我们的情况下，可以捕捉到电影的时间表示的关系，同时考虑的情感反应的顺序依赖的通常时间整合。我们证明我们提出的架构对扩展COGNIMUSE数据集[1] [2]的有效性，以及电影任务[3]，它由感慨注释电影的中世纪2016情绪的影响。我们的研究结果表明，采用自注意机制在不同的视听功能，而不是在时域，是情感的预测更为有效。我们的方法也被证明优于许多国家国税发先进的模型预测的情感。重现我们的结果与模型的实现代码，请访问：此HTTPS URL。

79. Product Manifold Learning [PDF] 返回目录
Sharon Zhang, Amit Moscovich, Amit Singer
Abstract: We consider problems of dimensionality reduction and learning data representations for continuous spaces with two or more independent degrees of freedom. Such problems occur, for example, when observing shapes with several components that move independently. Mathematically, if the parameter space of each continuous independent motion is a manifold, then their combination is known as a product manifold. In this paper, we present a new paradigm for non-linear independent component analysis called manifold factorization. Our factorization algorithm is based on spectral graph methods for manifold learning and the separability of the Laplacian operator on product spaces. Recovering the factors of a manifold yields meaningful lower-dimensional representations and provides a new way to focus on particular aspects of the data space while ignoring others. We demonstrate the potential use of our method for an important and challenging problem in structural biology: mapping the motions of proteins and other large molecules using cryo-electron microscopy datasets.
摘要：我们认为，连续的空间有两个或多个独立的自由度的降维的问题和学习数据表示。产生这样的问题，例如，与该独立移动的若干组件观察的形状时。在数学上，如果每个连续独立的运动的参数空间是歧管，那么它们的组合被称为一个产品歧管。在本文中，我们提出了所谓的歧管因式分解非线性独立分量分析一个新的范例。我们分解算法是基于流形学习和拉普拉斯算子的乘积空间上的可分离谱图的方法。回收的歧管的产率有意义较低维表示的因素，并提供集中于数据空间的特定方面而忽略其他问题的新途径。我们证明了在结构生物学的一个重要和具有挑战性的问题可能利用我们的方法：使用映射低温电子显微镜数据集的蛋白质和其他大分子的运动。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-10-23

目录

摘要