摘要

1. Uncertainty Sets for Image Classifiers using Conformal Prediction [PDF] 返回目录
Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, Michael I. Jordan
Abstract: Convolutional image classifiers can achieve high predictive accuracy, but quantifying their uncertainty remains an unresolved challenge, hindering their deployment in consequential settings. Existing uncertainty quantification techniques, such as Platt scaling, attempt to calibrate the network's probability estimates, but they do not have formal guarantees. We present an algorithm that modifies any classifier to output a predictive set containing the true label with a user-specified probability, such as 90%. The algorithm is simple and fast like Platt scaling, but provides a formal finite-sample coverage guarantee for every model and dataset. Furthermore, our method generates much smaller predictive sets than alternative methods, since we introduce a regularizer to stabilize the small scores of unlikely classes after Platt scaling. In experiments on both Imagenet and Imagenet-V2 with a ResNet-152 and other classifiers, our scheme outperforms existing approaches, achieving exact coverage with sets that are often factors of 5 to 10 smaller.
摘要：卷积图像分类可以达到很高的预测准确度，但其量化仍存在不确定性的悬而未决的挑战，阻碍了他们的部署相应的设置。现有的不确定性量化技术，如普拉特缩放，试图校正网络的概率估计，但他们没有正式的保证。我们提出的算法的修改任何分类器，以输出包含与用户指定的概率的正确标签的预测组，如90％。该算法简单，快速的像普拉特比例，但为每个模型和数据集的正式有限样本覆盖的保证。此外，我们的方法产生更小的预测套比其他方法，因为我们引入一个正则稳定小成绩不太类的普拉特缩放之后。在两个Imagenet和Imagenet-V2实验用RESNET-152等分类，我们的方案性能优于现有的方法，实现精确覆盖，集，通常为5至10个较小的因素。

2. Robust Detection of Objects under Periodic Motion with Gaussian Process Filtering [PDF] 返回目录
Joris Guerin, Anne Magaly de Paula Canuto, Luiz Marcos Garcia Goncalves
Abstract: Object Detection (OD) is an important task in Computer Vision with many practical applications. For some use cases, OD must be done on videos, where the object of interest has a periodic motion. In this paper, we formalize the problem of periodic OD, which consists in improving the performance of an OD model in the specific case where the object of interest is repeating similar spatio-temporal trajectories with respect to the video frames. The proposed approach is based on training a Gaussian Process to model the periodic motion, and use it to filter out the erroneous predictions of the OD model. By simulating various OD models and periodic trajectories, we demonstrate that this filtering approach, which is entirely data-driven, improves the detection performance by a large margin.
摘要：目标检测（OD）是计算机视觉与许多实际应用中的一项重要任务。对于一些使用情况，OD必须在视频，其中感兴趣的对象有一个周期性运动来完成。在本文中，我们正规化周期性OD，其在于改善当感兴趣的对象重复相似的时空轨迹相对于所述视频帧的特定情况下的OD模型的性能的问题。所提出的方法是基于训练高斯过程的周期运动模型，并用它来过滤出OD模型的错误预测。通过模拟各种OD模式和周期轨迹，我们表明，该滤波方法，这完全是数据驱动的，提高了一大截的检测性能。

3. Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People [PDF] 返回目录
Akin Caliskan, Armin Mustafa, Evren Imre, Adrian Hilton
Abstract: We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH; and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: this https URL.
摘要：呈现给从单个图像改善3D重建穿衣人的形状的精度的新方法。最近的工作已经推出了体积，隐含的和基于模型的形状学习框架，从一个或多个图像物体和人的重建。然而，对于衣服的人重建的准确性和完整性是由于形状从服装，发型，体型，姿势和摄像机视点产生的大的变化的限制。本文介绍了两项技术来克服这个限制：一是现实的衣服的人，3DVH的一种新的合成数据集;其次，对于单眼立体形状推定，这是证明显著改善泛化和重建精度的训练的新的多视图损失函数。证明与多样的自然背景的渲染逼真的3D穿衣人体模型的3DVH数据集允许从人的真实图像转移到重建。在人工和人的真实图像综合性能对比评测表明，该方法显著优于先前为基础的学习状态的最先进的单幅图像的3D人体形状估计方法实现重建的准确性，完整性和质量的显著改善。消融的研究表明，这是由于所提出的多视图的培训和新3DVH数据集两者。此HTTPS URL：代码和数据集可以在项目网站上找到。

4. A Survey on Deep Learning Techniques for Video Anomaly Detection [PDF] 返回目录
Jessie James P. Suarez, Prospero C. Naval Jr
Abstract: Anomaly detection in videos is a problem that has been studied for more than a decade. This area has piqued the interest of researchers due to its wide applicability. Because of this, there has been a wide array of approaches that have been proposed throughout the years and these approaches range from statistical-based approaches to machine learning-based approaches. Numerous surveys have already been conducted on this area but this paper focuses on providing an overview on the recent advances in the field of anomaly detection using Deep Learning. Deep Learning has been applied successfully in many fields of artificial intelligence such as computer vision, natural language processing and more. This survey, however, focuses on how Deep Learning has improved and provided more insights to the area of video anomaly detection. This paper provides a categorization of the different Deep Learning approaches with respect to their objectives. Additionally, it also discusses the commonly used datasets along with the common evaluation metrics. Afterwards, a discussion synthesizing all of the recent approaches is made to provide direction and possible areas for future research.
摘要：异常检测视频是已经研究了十多年的一个问题。这一地区的研究人员激起的兴趣，因为其广泛的适用性。正因为如此，出现了广泛的已经提出这些年来和这些来自基于统计的方法基于机器学习的方法方法范围的方法阵列。许多调查已经在这方面进行，但本文侧重于使用Deep学习异常检测领域提供的最新进展的概述。深度学习已在人工智能的许多领域，如计算机视觉，自然语言处理和更成功的应用。本次调查，然而，着眼于深度学习是如何改善和视频异常检测领域提供更多的见解。本文提供了不同的深度学习的分类相对于他们的目标接近。此外，还讨论了常用的数据集与共同评价指标一起。此后，合成所有的最近途径的讨论是由提供方向和未来研究的可能领域。

5. Score-level Multi Cue Fusion for Sign Language Recognition [PDF] 返回目录
Çağrı Gökçe, Oğulcan Özdemir, Ahmet Alp Kındıroğlu, Lale Akarun
Abstract: Sign Languages are expressed through hand and upper body gestures as well as facial expressions. Therefore, Sign Language Recognition (SLR) needs to focus on all such cues. Previous work uses hand-crafted mechanisms or network aggregation to extract the different cue features, to increase SLR performance. This is slow and involves complicated architectures. We propose a more straightforward approach that focuses on training separate cue models specializing on the dominant hand, hands, face, and upper body regions. We compare the performance of 3D Convolutional Neural Network (CNN) models specializing in these regions, combine them through score-level fusion, and use the weighted alternative. Our experimental results have shown the effectiveness of mixed convolutional models. Their fusion yields up to 19% accuracy improvement over the baseline using the full upper body. Furthermore, we include a discussion for fusion settings, which can help future work on Sign Language Translation (SLT).
摘要：手语通过手和上身的手势以及面部表情表达。因此，手语识别（SLR）需要把重点放在所有这些线索。以往的工作采用手工制作的机制或网络汇聚提取不同的提示功能，以提高性能的单反相机。这是缓慢的，涉及复杂的架构。我们建议，重在培养独立的线索模型专业上占主导地位的手，手，脸和上半身的区域更直接的方法。我们比较专注于这些区域的三维卷积神经网络（CNN）车型的性能，通过分数级融合将它们结合起来，并使用加权的选择。我们的实验结果表明，混合卷积模型的有效性。他们融合在使用全上身基线产生高达19％的准确率的提高。此外，我们包括融合的设置，它可以帮助在手语翻译（SLT）今后的工作进行了讨论。

6. Asymmetric Loss For Multi-Label Classification [PDF] 返回目录
Emanuel Ben-Baruch, Tal Ridnik, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, Lihi Zelnik-Manor
Abstract: Pictures of everyday life are inherently multi-label in nature. Hence, multi-label classification is commonly used to analyze their content. In typical multi-label datasets, each picture contains only a few positive labels, and many negative ones. This positive-negative imbalance can result in under-emphasizing gradients from positive labels during training, leading to poor accuracy. In this paper, we introduce a novel asymmetric loss ("ASL"), that operates differently on positive and negative samples. The loss dynamically down-weights the importance of easy negative samples, causing the optimization process to focus more on the positive samples, and also enables to discard mislabeled negative samples. We demonstrate how ASL leads to a more "balanced" network, with increased average probabilities for positive samples, and show how this balanced network is translated to better mAP scores, compared to commonly used losses. Furthermore, we offer a method that can dynamically adjust the level of asymmetry throughout the training. With ASL, we reach new state-of-the-art results on three common multi-label datasets, including achieving 86.6% on MS-COCO. We also demonstrate ASL applicability for other tasks such as fine-grain single-label classification and object detection. ASL is effective, easy to implement, and does not increase the training time or complexity. Implementation is available at: this https URL.
摘要：日常生活的照片本身在性质上多标签。因此，多标签分类通常用于分析它们的内容。在典型的多标签数据集，每个画面中只包含了一些积极的标签，和许多消极的。这正负不平衡可能会导致训练中下强调从正面标签梯度，导致精度差。在本文中，我们介绍一种新颖的非对称损失（“ASL”），即在阳性和阴性样品的操作不同。损失动态下的权重容易阴性样品的重要性，导致优化过程更注重阳性样品，并且还能够抛弃错误标注负样本。我们证明翔升如何导致更多的“平衡”的网络，增加平均概率为阳性样品，并显示该平衡网络是如何转化为更好的地图分数，比较常用的损失。此外，我们还提供可动态调整不对称的整个训练水平的方法。随着ASL，我们达到国家的最先进的新上三种常见的多标签数据集，包括实现对MS-COCO 86.6％的结果。我们还证明了其他任务，如细晶粒单标签分类和对象检测的ASL适用性。翔升是有效的，容易实现，并且不增加训练时间和复杂性。实施可在：该HTTPS URL。

7. CoKe: Localized Contrastive Learning for Robust Keypoint Detection [PDF] 返回目录
Yutong Bai, Angtian Wang, Adam Kortylewski, Alan Yuille
Abstract: Today's most popular approaches to keypoint detection learn a holistic representation of all keypoints. This enables them to implicitly leverage the relative spatial geometry between keypoints and thus to prevent false-positive detections due to local ambiguities. However, our experiments show that such holistic representations do not generalize well when the 3D pose of objects varies strongly, or when objects are partially occluded. In this paper, we propose CoKe, a framework for the supervised contrastive learning of distinct local feature representations for robust keypoint detection. In particular, we introduce a feature bank mechanism and update rules for keypoint and non-keypoint features which make possible to learn local keypoint detectors that are accurate and robust to local ambiguities. Our experiments show that CoKe achieves state-of-the-art results compared to approaches that jointly represent all keypoints holistically (Stacked Hourglass Networks, MSS-Net) as well as to approaches that are supervised with the detailed 3D object geometry (StarMap). Notably, CoKe performs exceptionally well when objects are partially occluded and outperforms related work on a range of diverse datasets (PASCAL3D+,ObjectNet, MPII).
摘要：当今最流行的方法关键点检测学习所有关键点的整体表现。这使他们能够利用隐含的关键点之间的相对空间几何形状，从而防止因局部歧义假阳性检测。然而，我们的实验表明，这种全面的陈述不能推广以及当物体的三维姿态剧烈变化，或者当对象被部分遮挡。在本文中，我们提出了焦炭，对于稳健的关键点检测鲜明地方特征表示的监督对比学习的框架。特别是，我们引进的关键点和非关键点特性使它能够学习当地的关键点探测器是准确和强大的本地歧义功能的银行机构和更新规则。我们的实验结果表明，可乐相比，以及对方法与该详细的3D物体的几何形状（星图）监督该联合代表所有关键点全盘办法（堆积沙漏网络，MSS-网）实现了国家的先进成果。值得注意的是，焦炭进行格外好当对象被部分遮挡，优于在一系列不同的数据集（PASCAL3D +，ObjectNet，MPII）的相关工作。

8. Localize to Classify and Classify to Localize: Mutual Guidance in Object Detection [PDF] 返回目录
Heng Zhang, Elisa Fromont, Sébastien Lefevre, Bruno Avignon
Abstract: Most deep learning object detectors are based on the anchor mechanism and resort to the Intersection over Union (IoU) between predefined anchor boxes and ground truth boxes to evaluate the matching quality between anchors and objects. In this paper, we question this use of IoU and propose a new anchor matching criterion guided, during the training phase, by the optimization of both the localization and the classification tasks: the predictions related to one task are used to dynamically assign sample anchors and improve the model on the other task, and vice versa. Despite the simplicity of the proposed method, our experiments with different state-of-the-art deep learning architectures on PASCAL VOC and MS COCO datasets demonstrate the effectiveness and generality of our Mutual Guidance strategy.
摘要：大多数深学习对象检测器根据预定义的锚箱和地面真值块来评估锚和对象之间的匹配质量的锚定结构和度假村到路口过联盟（IOU）。在本文中，我们质疑这种使用欠条，并提出了一种新的锚匹配准则的指导下，在训练阶段，通过定位都和分类任务的优化：与一个任务的预测来动态地分配样本锚和提高其他任务的模式，反之亦然。尽管该方法的简单性，我们对PASCAL VOC和MS COCO数据集不同国家的最先进的深度学习架构的实验证明我们的共同指导策略的有效性和普遍性。

9. Attentional Feature Fusion [PDF] 返回目录
Yimian Dai, Fabian Gieseke, Stefan Oehmcke, Yiquan Wu, Kobus Barnard
Abstract: Feature fusion, the combination of features from different layers or branches, is an omnipresent part of modern network architectures. It is often implemented via simple operations, such as summation or concatenation, but this might not be the best choice. In this work, we propose a uniform and general scheme, namely attentional feature fusion, which is applicable for most common scenarios, including feature fusion induced by short and long skip connections as well as within Inception layers. To better fuse features of inconsistent semantics and scales, we propose a multi-scale channel attention module, which addresses issues that arise when fusing features given at different scales. We also demonstrate that the initial integration of feature maps can become a bottleneck and that this issue can be alleviated by adding another level of attention, which we refer to as iterative attentional feature fusion. With fewer layers or parameters, our models outperform state-of-the-art networks on both CIFAR-100 and ImageNet datasets, which suggests that more sophisticated attention mechanisms for feature fusion hold great potential to consistently yield better results compared to their direct counterparts. Our codes and trained models are available online.
摘要：特征融合，从不同的层或分支特征的组合，是现代网络架构的一个无所不在的一部分。它往往是通过简单的操作，如求和或串联实现，但是这可能不是最好的选择。在这项工作中，我们提出了一个统一和总体方案，即注意力特征融合，这是适用于最常见的情况，包括短期和长期跳过连接以及盗梦空间层内诱导功能的融合。不一致的语义和规模的更好的导火索特点，我们提出了一个多尺度通道注意力模块，当融合在不同尺度特性给出出现的哪些地址的问题。我们还表明，特征地图的初步整合可以成为一个瓶颈，这个问题可以通过添加关注的另一个层面，我们称之为迭代注意力特征融合得到缓解。用更少的层或参数，我们的模型优于两个CIFAR-100和ImageNet数据集，这表明更复杂的注意力机制特征融合的巨大潜力相比于他们的直接对口持续产生更好的结果的国家的最先进的网络。我们的代码和训练的模型可在网上。

10. Fast Gravitational Approach for Rigid Point Set Registration with Ordinary Differential Equations [PDF] 返回目录
Sk Aziz Ali, Kerem Kahraman, Christian Theobalt, Didier Stricker, Vladislav Golyanik
Abstract: This article introduces a new physics-based method for rigid point set alignment called Fast Gravitational Approach (FGA). In FGA, the source and target point sets are interpreted as rigid particle swarms with masses interacting in a globally multiply-linked manner while moving in a simulated gravitational force field. The optimal alignment is obtained by explicit modeling of forces acting on the particles as well as their velocities and displacements with second-order ordinary differential equations of motion. Additional alignment cues (point-based or geometric features, and other boundary conditions) can be integrated into FGA through particle masses. We propose a smooth-particle mass function for point mass initialization, which improves robustness to noise and structural discontinuities. To avoid prohibitive quadratic complexity of all-to-all point interactions, we adapt a Barnes-Hut tree for accelerated force computation and achieve quasilinear computational complexity. We show that the new method class has characteristics not found in previous alignment methods such as efficient handling of partial overlaps, inhomogeneous point sampling densities, and coping with large point clouds with reduced runtime compared to the state of the art. Experiments show that our method performs on par with or outperforms all compared competing non-deep-learning-based and general-purpose techniques (which do not assume the availability of training data and a scene prior) in resolving transformations for LiDAR data and gains state-of-the-art accuracy and speed when coping with different types of data disturbances.
摘要：本文介绍了所谓的快速重力法（FGA）刚性点组对准一个新的基于物理的方法。在FGA，源和目标点集被解释为与群众在全球乘法 - 连结的方式相互作用，同时在模拟重力场中运动的刚性粒子群。最优比对是通过作用于颗粒以及它们的速度和位移的运动的二阶微分方程力的显式建模获得。额外的对准线索（点型或几何特征，以及其它边界条件）可以通过粒子质量被集成到FGA。我们提出了质点初始化，这改善了鲁棒性噪声和结构不连续的平滑的粒子质量函数。为了避免所有到所有点相互作用望而却步二次复杂性，我们适应巴恩斯必胜客树加速力的计算和实现拟线性计算复杂度。我们表明，新的方法类有在前面的比对方法没有发现特征，例如有效的处理的部分重叠的，不均匀的采样点的密度，并与具有减少的运行时间大点云相比于现有技术的状态的应对。实验表明，我们的方法看齐执行与或优于所有的比较竞争非深为基础的学习和通用技术，解决了激光雷达数据和增益状态转换（不承担训练数据的可用性和前一个场景） -of最先进的精度和速度与不同类型的数据干扰的应对时。

11. Improving Interpretability for Computer-aided Diagnosis tools on Whole Slide Imaging with Multiple Instance Learning and Gradient-based Explanations [PDF] 返回目录
Antoine Pirovano, Hippolyte Heuberger, Sylvain Berlemont, Saïd Ladjal, Isabelle Bloch
Abstract: Deep learning methods are widely used for medical applications to assist medical doctors in their daily routines. While performances reach expert's level, interpretability (highlight how and what a trained model learned and why it makes a specific decision) is the next important challenge that deep learning methods need to answer to be fully integrated in the medical field. In this paper, we address the question of interpretability in the context of whole slide images (WSI) classification. We formalize the design of WSI classification architectures and propose a piece-wise interpretability approach, relying on gradient-based methods, feature visualization and multiple instance learning context. We aim at explaining how the decision is made based on tile level scoring, how these tile scores are decided and which features are used and relevant for the task. After training two WSI classification architectures on Camelyon-16 WSI dataset, highlighting discriminative features learned, and validating our approach with pathologists, we propose a novel manner of computing interpretability slide-level heat-maps, based on the extracted features, that improves tile-level classification performances by more than 29% for AUC.
摘要：深学习方法被广泛用于医疗应用，以帮助医生在他们的日常生活。虽然演出达到专家的水平，可解释性（高亮显示的方式和内容训练模型教训和为什么它使一个具体的决定）是深学习方法需要回答被完全集成在医疗领域的下一个重要的挑战。在本文中，我们讨论解释性的全部幻灯片图像（WSI）的分类的情况下的问题。我们正式WSI分类架构的设计，提出了一种分段解释性的方针，依靠基于梯度的方法，功能，可视化和多示例学习环境。我们的目标是在解释这一决定，是如何根据瓦级别的得分做出，这些瓷砖分数是如何决定和其使用功能和相关的任务。训练2 WSI分类架构上Camelyon-16 WSI数据集，突出判别特征了解到，和验证我们的方法与病理学家后，我们提出了计算解释性幻灯片级热图的一个新颖的方式，根据所提取的特征，即提高瓦片通过AUC超过29％的水平分类演出。

12. MARA-Net: Single Image Deraining Network with Multi-level connection and Adaptive Regional Attention [PDF] 返回目录
Yeachan Park, Myeongho Jeon, Junho Lee, Myungjoo Kan
Abstract: Removing rain streaks from single images is an important problem in various computer vision tasks because rain streaks can degrade outdoor images and reduce their visibility. While recent convolutional neural network-based deraining models have succeeded in capturing rain streaks effectively, difficulties in recovering the details in rain-free images still remain. In this paper, we present a multi-level connection and adaptive regional attention network (MARA-Net) to properly restore the original background textures in rainy images. The first main idea is a multi-level connection design that repeatedly connects multi-level features of the encoder network to the decoder network. Multi-level connections encourage the decoding process to use the feature information of all levels. Channel attention is considered in multi-level connections to learn which level of features is important in the decoding process of the current level. The second main idea is a wide regional non-local block (WRNL). As rain streaks primarily exhibit a vertical distribution, we divide the grid of the image into horizontally-wide patches and apply a non-local operation to each region to explore the rich rain-free background information. Experimental results on both synthetic and real-world rainy datasets demonstrate that the proposed model significantly outperforms existing state-of-the-art models. Furthermore, the results of the joint deraining and segmentation experiment prove that our model contributes effectively to other vision tasks.
摘要：从单个图像删除雨条纹是在不同的计算机视觉任务的一个重要问题，因为下雨的条纹会降低户外图像，并减少他们的知名度。虽然最近的卷积基于神经网络的deraining模型已经成功地捕捉有效降雨条纹，以恢复在无雨的图像细节的困难仍然存在。在本文中，我们提出了一个多层次的连接和自适应区域关注网络（MARA-网）正确还原原始背景纹理多雨的图像。第一主要思想是编码器网络到解码器网络的反复连接的多级的多级连接的设计特点。多级连接鼓励解码过程中使用的所有层次的特征信息。通道注意在多级连接视为学习哪个级别的特点是在目前水平的解码过程中的重要。第二个主要的想法是一个广泛的区域的非本地块（WRNL）。由于雨条纹主要呈现垂直分布，我们将图像分成水平方向较宽的补丁电网和应用非本地操作，每个区域，探索丰富的无雨的背景信息。在人工和真实世界的数据集多雨实验结果表明，该模型显著优于现有的国家的最先进的车型。此外，联合deraining和分割实验结果证明，我们的模型有效地促进其他视觉任务。

13. A Prototype-Based Generalized Zero-Shot Learning Framework for Hand Gesture Recognition [PDF] 返回目录
Jinting Wu, Yujia Zhang, Xiaoguang Zhao
Abstract: Hand gesture recognition plays a significant role in human-computer interaction for understanding various human gestures and their intent. However, most prior works can only recognize gestures of limited labeled classes and fail to adapt to new categories. The task of Generalized Zero-Shot Learning (GZSL) for hand gesture recognition aims to address the above issue by leveraging semantic representations and detecting both seen and unseen class samples. In this paper, we propose an end-to-end prototype-based GZSL framework for hand gesture recognition which consists of two branches. The first branch is a prototype-based detector that learns gesture representations and determines whether an input sample belongs to a seen or unseen category. The second branch is a zero-shot label predictor which takes the features of unseen classes as input and outputs predictions through a learned mapping mechanism between the feature and the semantic space. We further establish a hand gesture dataset that specifically targets this GZSL task, and comprehensive experiments on this dataset demonstrate the effectiveness of our proposed approach on recognizing both seen and unseen gestures.
摘要：手势识别起着人机交互对于了解人类的各种手势和他们的意图显著的作用。然而，大多数现有的作品只能识别有限的标记类的手势和无法适应新的类别。广义零射门学习（GZSL）对手势识别的目标任务通过利用语义表示和检测都看见和看不见类样本，以解决上述问题。在本文中，我们提出了手势识别它由两个分支的端至端基于原型的GZSL框架。第一分支是基于原型的检测器，其学习手势表示和确定输入样本是否属于看到或看不到的类别。第二分支是一个零拍标签预测器，其通过所述特征和语义空间之间的学习映射机制需要看不见类作为输入，并输出预测的功能。我们进一步建立专门针对此GZSL任务手势数据集，并在此数据集综合性实验证明上承认看见也看不见的手势我们提出的方法的有效性。

14. Beneficial Perturbation Network for designing general adaptive artificial intelligence systems [PDF] 返回目录
Shixian Wen, Amanda Rios, Yunhao Ge, Laurent Itti
Abstract: The human brain is the gold standard of adaptive learning. It not only can learn and benefit from experience, but also can adapt to new situations. In contrast, deep neural networks only learn one sophisticated but fixed mapping from inputs to outputs. This limits their applicability to more dynamic situations, where input to output mapping may change with different contexts. A salient example is continual learning - learning new independent tasks sequentially without forgetting previous tasks. Continual learning of multiple tasks in artificial neural networks using gradient descent leads to catastrophic forgetting, whereby a previously learned mapping of an old task is erased when learning new mappings for new tasks. Here, we propose a new biologically plausible type of deep neural network with extra, out-of-network, task-dependent biasing units to accommodate these dynamic situations. This allows, for the first time, a single network to learn potentially unlimited parallel input to output mappings, and to switch on the fly between them at runtime. Biasing units are programmed by leveraging beneficial perturbations (opposite to well-known adversarial perturbations) for each task. Beneficial perturbations for a given task bias the network toward that task, essentially switching the network into a different mode to process that task. This largely eliminates catastrophic interference between tasks. Our approach is memory-efficient and parameter-efficient, can accommodate many tasks, and achieves state-of-the-art performance across different tasks and domains.
摘要：人脑是适应性学习的黄金标准。它不仅可以学习和借鉴的经验中获益，同时也能适应新的形势。与此相反，深神经网络学习只从输入到输出的一个复杂的，但固定映射。这限制了其适用性更动态的情况下，其中，输入到输出的映射可以用不同的上下文而改变。一个显着的例子就是不断学习 - 学习新的独立的任务顺序没有忘记以前的任务。采用梯度下降导致灾难性的遗忘，由此旧任务的先前学习的映射学习新的任务新的映射时被删除的人工神经网络多任务不断学习。在这里，我们提出了一个新的生物合理型深神经网络的额外，外的网络，取决于任务的偏移单元，以适应这些动态情况。这允许，在第一时间，在单一网络学习潜在无限平行输入到输出的映射，并在运行时在它们之间的飞行进行切换。施力单元通过利用为每个任务有益扰动（相对公知的对抗性的扰动）进行编程。有益扰动对于给定的任务偏置网络朝向那个任务，实质上将网络切换到不同的模式来处理该任务。这在很大程度上消除了任务之间灾难性的干扰。我们的方法是存储器效率和参数高效，可容纳许多任务，并实现跨不同的任务和域状态的最先进的性能。

15. One-Shot learning based classification for segregation of plastic waste [PDF] 返回目录
Shivaank Agarwal, Ravindra Gudi, Paresh Saxena
Abstract: The problem of segregating recyclable waste is fairly daunting for many countries. This article presents an approach for image based classification of plastic waste using one-shot learning techniques. The proposed approach exploits discriminative features generated via the siamese and triplet loss convolutional neural networks to help differentiate between 5 types of plastic waste based on their resin codes. The approach achieves an accuracy of 99.74% on the WaDaBa Database
摘要：分离可回收垃圾的问题是相当艰巨的许多国家。本文介绍了使用一次性学习技术，废塑料基于图像分类的方法。所提出的方法利用经由连体和三重损失卷积神经网络基于它们的树脂码5种类型的废塑料之间产生，以区分判别特征。该方法实现了99.74％的准确度上WaDaBa数据库

16. MS-RANAS: Multi-Scale Resource-Aware Neural Architecture Search [PDF] 返回目录
Cristian Cioflan, Radu Timofte
Abstract: Neural Architecture Search (NAS) has proved effective in offering outperforming alternatives to handcrafted neural networks. In this paper we analyse the benefits of NAS for image classification tasks under strict computational constraints. Our aim is to automate the design of highly efficient deep neural networks, capable of offering fast and accurate predictions and that could be deployed on a low-memory, low-power system-on-chip. The task thus becomes a three-party trade-off between accuracy, computational complexity, and memory requirements. To address this concern, we propose Multi-Scale Resource-Aware Neural Architecture Search (MS-RANAS). We employ a one-shot architecture search approach in order to obtain a reduced search cost and we focus on an anytime prediction setting. Through the usage of multiple-scaled features and early classifiers, we achieved state-of-the-art results in terms of accuracy-speed trade-off.
摘要：神经结构搜索（NAS）已经证明是有效的，提供优于替代手工制作的神经网络。在本文中，我们分析NAS的下严格计算约束图像分类任务的好处。我们的目标是实现自动化高效深层神经网络，能够提供快速和准确的预测设计和可能低内存，低功耗系统级芯片上部署。因此，该任务变得准确，计算复杂性和内存需求之间的三方权衡。为了解决这个问题，我们提出了多尺度资源意识的神经结构搜索（MS-拉纳）。我们采用一次性的架构搜索方法以获得降低搜索成本，我们专注于一个随时预测设置。通过对多个缩放功能和早期分类的使用，我们在精度，速度的权衡方面取得国家的先进成果。

17. A Comparative Study of Deep Learning Loss Functions for Multi-Label Remote Sensing Image Classification [PDF] 返回目录
Hichame Yessou, Gencer Sumbul, Begüm Demir
Abstract: This paper analyzes and compares different deep learning loss functions in the framework of multi-label remote sensing (RS) image scene classification problems. We consider seven loss functions: 1) cross-entropy loss; 2) focal loss; 3) weighted cross-entropy loss; 4) Hamming loss; 5) Huber loss; 6) ranking loss; and 7) sparseMax loss. All the considered loss functions are analyzed for the first time in RS. After a theoretical analysis, an experimental analysis is carried out to compare the considered loss functions in terms of their: 1) overall accuracy; 2) class imbalance awareness (for which the number of samples associated to each class significantly varies); 3) convexibility and differentiability; and 4) learning efficiency (i.e., convergence speed). On the basis of our analysis, some guidelines are derived for a proper selection of a loss function in multi-label RS scene classification problems.
摘要：分析和不同深度学习损失函数在多标签遥感（RS）图像场景分类问题的框架进行比较。我们讨论七损功能：1）跨熵损失; 2）焦损失; 3）加权交叉熵损失; 4）海明损失; 5）胡伯损失; 6）排名损失;和7）sparseMax损失。所有考虑的损失函数分析RS上的第一次。后进行了理论分析，实验分析进行在其方面比较所考虑的损失功能：1）整体精度; 2）类的不平衡意识（对于该相关联的每个类的样本的数目显著变化）; 3）convexibility和可微;和4）学习效率（即，收敛速度）。在我们分析的基础上，一些准则推导出在多标签RS场景分类问题损失函数的正确选择。

18. Weakly-supervised Salient Instance Detection [PDF] 返回目录
Xin Tian, Ke Xu, Xin Yang, Baocai Yin, Rynson W.H. Lau
Abstract: Existing salient instance detection (SID) methods typically learn from pixel-level annotated datasets. In this paper, we present the first weakly-supervised approach to the SID problem. Although weak supervision has been considered in general saliency detection, it is mainly based on using class labels for object localization. However, it is non-trivial to use only class labels to learn instance-aware saliency information, as salient instances with high semantic affinities may not be easily separated by the labels. We note that subitizing information provides an instant judgement on the number of salient items, which naturally relates to detecting salient instances and may help separate instances of the same class while grouping different parts of the same instance. Inspired by this insight, we propose to use class and subitizing labels as weak supervision for the SID problem. We propose a novel weakly-supervised network with three branches: a Saliency Detection Branch leveraging class consistency information to locate candidate objects; a Boundary Detection Branch exploiting class discrepancy information to delineate object boundaries; and a Centroid Detection Branch using subitizing information to detect salient instance centroids. This complementary information is further fused to produce salient instance maps. We conduct extensive experiments to demonstrate that the proposed method plays favorably against carefully designed baseline methods adapted from related tasks.
摘要：现有凸实例检测（SID）的方法通常从像素级注释的数据集学习。在本文中，我们提出在第一弱监督的方法来SID问题。虽然监管不力已经普遍显着性检测一直认为，这主要是基于使用类标签的对象定位。然而，这是不平凡的只使用类标签来了解情况感知显着性的信息，具有高亲和力的语义突出的情况下，可能不容易被标签分开。我们注意到，subitizing信息提供了显着的项目，这自然涉及检测突出的实例数瞬间的判断，并可能有助于同一类的不同实例，而分组相同的实例的不同部分。通过这种见解的启发，我们建议使用类和subitizing标签作为监管不力的SID问题。我们建议有三个分支的新型弱监督网络：一个显着性检测科利用类一致性信息来定位候选对象;一个边界检测分支剥削阶级差异信息划定对象边界;和质心检测分支使用subitizing信息来检测突出的实例重心。这种互补的信息进一步融合，产生显着的实例图。我们进行了广泛的实验表明，该方法起到有利的对精心设计的基准方法，从相关任务调整。

19. Multi-term and Multi-task Affect Analysis in the Wild [PDF] 返回目录
Sachihiro Youoku, Junya Saito, Yuushi Toyoda, Takahisa Yamamoto, Junya Saito, Ryosuke Kawamura, Xiaoyu Mi, Kentaro Murase
Abstract: Human affect recognition is an important factor in human-computer interaction. However, the development of method for in-the-wild data is still nowhere near enough. in this paper, we introduce the affect recognition method that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2020 Contest. In our approach, since we considered that affective behaviors have different time window features, we generated features and averaged labels using short-term, medium-term, and long-term time windows from video images. Then, we generated affect recognition models in each time window, and esembled each models. In addition,we fuseed the VA and EXP models, taking into account that Valence, Arousal, and Expresion are closely related. The features were trained by gradient boosting, using the mean, standard deviation, max-range, and slope in each time winodows. We achieved the valence-arousal score: 0.495 and expression score: 0.464 on the validation set.
摘要：人类影响识别是人机交互的重要因素。然而，法在最狂野数据的发展仍然是隔靴搔痒。在本文中，我们将介绍影响已提交情感行为分析中最野性（ABAW）2020大赛识别方法。在我们的方法，因为我们认为，情感行为有不同的时间窗口功能，我们生成的功能和使用的短期，中期平均标签，并从视频图像长期的时间窗。然后，我们产生影响识别模型中的每个时间窗口，并且每个esembled车型。此外，我们fuseed的VA和EXP模型，考虑到价，觉醒和Expresion密切相关。特征通过梯度提升训练，使用平均值，标准偏差，最大范围和斜率在每个时间winodows。我们所取得的价觉醒分数：在验证集0.464：0.495和表达得分。

20. Where is the Model Looking At?--Concentrate and Explain the Network Attention [PDF] 返回目录
Wenjia Xu, Jiuniu Wang, Yang Wang, Guangluan Xu, Wei Dai, Yirong Wu
Abstract: Image classification models have achieved satisfactory performance on many datasets, sometimes even better than human. However, The model attention is unclear since the lack of interpretability. This paper investigates the fidelity and interpretability of model attention. We propose an Explainable Attribute-based Multi-task (EAT) framework to concentrate the model attention on the discriminative image area and make the attention interpretable. We introduce attributes prediction to the multi-task learning network, helping the network to concentrate attention on the foreground objects. We generate attribute-based textual explanations for the network and ground the attributes on the image to show visual explanations. The multi-model explanation can not only improve user trust but also help to find the weakness of network and dataset. Our framework can be generalized to any basic model. We perform experiments on three datasets and five basic models. Results indicate that the EAT framework can give multi-modal explanations that interpret the network decision. The performance of several recognition approaches is improved by guiding network attention.
摘要：图像分类模型已比人类在许多数据集取得了令人满意的表现，有时甚至更好。然而，该模型关注的是，因为缺乏解释性的不清楚。本文研究的模型关注的保真度和可解释性。我们提出了一个解释的属性为基础的多任务（EAT）框架，把注意力集中在辨别图像区域模型注意力，使注意力解释。我们介绍的属性预测的多任务学习网络，帮助网络集中注意力在前景物体。我们会为网络基于属性的文字说明和地面的图像上的属性，以显示视觉解释。多模式的解释不仅可以提高用户的信赖也有助于发现网络和数据集的弱点。我们的分析框架可以推广到任何基本模型。我们对三个数据集和五个基本型号进行实验。结果表明，EAT框架可以给多模式的解释是解释网络的决定。几种识别方法的性能通过引导网络的关注提高。

21. Neural Alignment for Face De-pixelization [PDF] 返回目录
Maayan Shuvi, Noa Fish, Kfir Aberman, Ariel Shamir, Daniel Cohen-Or
Abstract: We present a simple method to reconstruct a high-resolution video from a face-video, where the identity of a person is obscured by pixelization. This concealment method is popular because the viewer can still perceive a human face figure and the overall head motion. However, we show in our experiments that a fairly good approximation of the original video can be reconstructed in a way that compromises anonymity. Our system exploits the simultaneous similarity and small disparity between close-by video frames depicting a human face, and employs a spatial transformation component that learns the alignment between the pixelated frames. Each frame, supported by its aligned surrounding frames, is first encoded, then decoded to a higher resolution. Reconstruction and perceptual losses promote adherence to the ground-truth, and an adversarial loss assists in maintaining domain faithfulness. There is no need for explicit temporal coherency loss as it is maintained implicitly by the alignment of neighboring frames and reconstruction. Although simple, our framework synthesizes high-quality face reconstructions, demonstrating that given the statistical prior of a human face, multiple aligned pixelated frames contain sufficient information to reconstruct a high-quality approximation of the original signal.
摘要：本文提出了一种简单的方法来从脸部的视频，其中一人的身份是由马赛克遮挡重建高分辨率视频。这种隐藏方法很受欢迎，因为观众仍然可以感知人脸图和整体头部运动。然而，我们在我们的实验表明，对原始视频的相当不错的近似可以在妥协匿名的方式进行重建。我们的系统利用描绘了人脸近距离的视频帧之间的相似性的同时，小的差距，并且采用空间变换组件，它获知像素化帧之间的对准。每一帧中，由它的对准周围帧的支持，被首先编码，然后解码，以更高的分辨率。重建和感知的损失促进遵循地面实况和对抗性损失维持域信实助攻。有没有必要明确的时间一致性的损失，因为它是由相邻帧和重建的排列隐式地保存。虽然简单，我们的框架合成高品质的面部重建，证明给出的统计先验的人的面孔，多对准像素化帧包含足够的信息来重构原始信号的高品质的逼近。

22. imdpGAN: Generating Private and Specific Data with Generative Adversarial Networks [PDF] 返回目录
Saurabh Gupta, Arun Balaji Buduru, Ponnurangam Kumaraguru
Abstract: Generative Adversarial Network (GAN) and its variants have shown promising results in generating synthetic data. However, the issues with GANs are: (i) the learning happens around the training samples and the model often ends up remembering them, consequently, compromising the privacy of individual samples - this becomes a major concern when GANs are applied to training data including personally identifiable information, (ii) the randomness in generated data there is no control over the specificity of generated samples. To address these issues, we propose imdpGAN - an information maximizing differentially private Generative Adversarial Network. It is an end-to-end framework that simultaneously achieves privacy protection and learns latent representations. With experiments on MNIST dataset, we show that imdpGAN preserves the privacy of the individual data point, and learns latent codes to control the specificity of the generated samples. We perform binary classification on digit pairs to show the utility versus privacy trade-off. The classification accuracy decreases as we increase privacy levels in the framework. We also experimentally show that the training process of imdpGAN is stable but experience a 10-fold time increase as compared with other GAN frameworks. Finally, we extend imdpGAN framework to CelebA dataset to show how the privacy and learned representations can be used to control the specificity of the output.
摘要：剖成对抗性网络（GAN）和它的变体已经显示出大有希望中生成合成数据的结果。然而，与甘斯的问题是：（i）学习会发生周围的训练样本和模型往往最终记住它们，因此，损害各个样品的隐私 - 这成为一个重大问题时甘斯被应用到训练数据包括个人识别信息，（ⅱ）在产生的数据的随机性有超过生成的样本的特异性没有控制权。为了解决这些问题，我们提出了imdpGAN - 信息最大化差异私人剖成对抗性网络。这是一个终端到终端的框架，同时实现了隐私保护，并学会潜伏表示。随着MNIST数据集的实验，我们表明，imdpGAN保留个人数据点的隐私，并学会潜伏代码来控制生成的样品的特异性。我们对数字对执行二元分类，以显示效用与隐私的权衡。正如我们在框架提高保密级别分类精度降低。我们还通过实验表明，imdpGAN的训练过程是稳定的，但经历了10倍时间的增加与其他GAN框架相比。最后，我们imdpGAN框架延伸到CelebA数据集来显示如何在隐私和教训表示可以用来控制输出的特异性。

23. SIR: Similar Image Retrieval for Product Search in E-Commerce [PDF] 返回目录
Theban Stanley, Nihar Vanjara, Yanxin Pan, Ekaterina Pirogova, Swagata Chakraborty, Abon Chaudhuri
Abstract: We present a similar image retrieval (SIR) platform that is used to quickly discover visually similar products in a catalog of millions. Given the size, diversity, and dynamism of our catalog, product search poses many challenges. It can be addressed by building supervised models to tagging product images with labels representing themes and later retrieving them by labels. This approach suffices for common and perennial themes like "white shirt" or "lifestyle image of TV". It does not work for new themes such as "e-cigarettes", hard-to-define ones such as "image with a promotional badge", or the ones with short relevance span such as "Halloween costumes". SIR is ideal for such cases because it allows us to search by an example, not a pre-defined theme. We describe the steps - embedding computation, encoding, and indexing - that power the approximate nearest neighbor search back-end. We also highlight two applications of SIR. The first one is related to the detection of products with various types of potentially objectionable themes. This application is run with a sense of urgency, hence the typical time frame to train and bootstrap a model is not permitted. Also, these themes are often short-lived based on current trends, hence spending resources to build a lasting model is not justified. The second application is a variant item detection system where SIR helps discover visual variants that are hard to find through text search. We analyze the performance of SIR in the context of these applications.
摘要：我们提出了用于快速发现在目录中的数百万视觉上同类产品的类似图像检索（SIR）平台。鉴于规模，多样性和分类的活力，产品搜索带来了许多挑战。它可以通过建立监督模式与唱片公司代表的主题，后来被标签检索它们标记产品图片来解决。这种方法就足够了，如“白衬衫”或“电视的生活方式形象”普通生和多年生的主题。它不为新的主题，如“电子香烟”，难以定义的，如，或短的相关跨度的如“万圣节服装”，“有优惠活动标志形象”的工作。 SIR是理想的这种情况，因为它可以让我们通过一个例子来搜索，而不是一个预先定义的主题。我们描述的步骤 - 嵌入计算，编码，和索引 - 权力的近似最近邻搜索的后端。我们还强调SIR的两个应用程序。第一个是有关检测的产品，拥有各类令人反感的主题。本申请与紧迫感运行，因此典型时间帧来训练和引导一个模型是不允许的。此外，这些主题往往是短命根据目前的趋势，因此花费资源来建立一个持久的模式是没有道理的。第二个应用是一个变体项目检测系统，其中SIR有助于发现视觉变种，是很难通过文字搜索找到。我们分析在这些应用的情况下SIR的性能。

24. TinyGAN: Distilling BigGAN for Conditional Image Generation [PDF] 返回目录
Ting-Yun Chang, Chi-Jen Lu
Abstract: Generative Adversarial Networks (GANs) have become a powerful approach for generative image modeling. However, GANs are notorious for their training instability, especially on large-scale, complex datasets. While the recent work of BigGAN has significantly improved the quality of image generation on ImageNet, it requires a huge model, making it hard to deploy on resource-constrained devices. To reduce the model size, we propose a black-box knowledge distillation framework for compressing GANs, which highlights a stable and efficient training process. Given BigGAN as the teacher network, we manage to train a much smaller student network to mimic its functionality, achieving competitive performance on Inception and FID scores with the generator having $16\times$ fewer parameters.
摘要：创成对抗性网络（甘斯）已经成为生成图像建模有效的方法。然而，甘斯是臭名昭著的他们的训练不稳定，特别是在大型，复杂的数据集。虽然最近BigGAN的工作显著改善图像生成的质量上ImageNet，它需要一个庞大的模型，很难部署在资源受限的设备。为了减少模型的大小，我们提出了压缩甘斯，这凸显了稳定，高效的训练过程黑箱知识蒸馏框架。鉴于BigGAN为师网络，我们能够更小的学生网络训练，以模拟其功能，实现在初始和FID分数竞争力的性能与具有$ 16个\倍发电机$较少的参数。

25. An Image Processing Pipeline for Automated Packaging Structure Recognition [PDF] 返回目录
Laura Dörr, Felix Brandt, Martin Pouls, Alexander Naumann
Abstract: Dispatching and receiving logistics goods, as well as transportation itself, involve a high amount of manual efforts. The transported goods, including their packaging and labeling, need to be double-checked, verified or recognized at many supply chain network points. These processes hold automation potentials, which we aim to exploit using computer vision techniques. More precisely, we propose a cognitive system for the fully automated recognition of packaging structures for standardized logistics shipments based on single RGB images. Our contribution contains descriptions of a suitable system design and its evaluation on relevant real-world data. Further, we discuss our algorithmic choices.
摘要：调度和接收货物的物流，以及运输本身，涉及手工劳动量高。转运货物，包括包装和标签，需要双重检验，验证或在许多供应链网络点的认可。这些过程自动化持有的潜力，这是我们的目标是利用计算机视觉技术来利用。更确切地说，我们提出了一个认知系统的封装结构，基于单一RGB图像标准化物流发货的全自动化识别。我们的贡献包含一个合适的系统设计及其对相关真实世界的数据评价的描述。此外，我们讨论了我们的算法选择。

26. BAMSProd: A Step towards Generalizing the Adaptive Optimization Methods to Deep Binary Model [PDF] 返回目录
Junjie Liu, Dongchao Wen, Deyu Wang, Wei Tao, Tse-Wei Chen, Kinya Osa, Masami Kato
Abstract: Recent methods have significantly reduced the performance degradation of Binary Neural Networks (BNNs), but guaranteeing the effective and efficient training of BNNs is an unsolved problem. The main reason is that the estimated gradients produced by the Straight-Through-Estimator (STE) mismatches with the gradients of the real derivatives. In this paper, we provide an explicit convex optimization example where training the BNNs with the traditionally adaptive optimization methods still faces the risk of non-convergence, and identify that constraining the range of gradients is critical for optimizing the deep binary model to avoid highly suboptimal solutions. For solving above issues, we propose a BAMSProd algorithm with a key observation that the convergence property of optimizing deep binary model is strongly related to the quantization errors. In brief, it employs an adaptive range constraint via an errors measurement for smoothing the gradients transition while follows the exponential moving strategy from AMSGrad to avoid errors accumulation during the optimization. The experiments verify the corollary of theoretical convergence analysis, and further demonstrate that our optimization method can speed up the convergence about 1:2x and boost the performance of BNNs to a significant level than the specific binary optimizer about 3:7%, even in a highly non-convex optimization problem.
摘要：最近的方法已经显著减少二进制神经网络（BNNs）的性能下降，但保证BNNs的有效和高效的培训是一个未解决的问题。其主要原因是，估计梯度产生通过与真实衍生物的梯度的直通式估算器（STE）的错配。在本文中，我们提供了在那里与传统的自适应优化方法训练BNNs仍面临不收敛的危险性显凸优化例子，并找出制约梯度的范围是优化深二元模型，以避免高度不理想的关键解决方案。对于上述问题解决，我们提出了一个重要发现是优化深二元模型的收敛性密切相关的量化误差的BAMSProd算法。简言之，它采用通过测量而如下从AMSGrad以避免错误的积累在优化期间的指数移动策略平滑梯度过渡一个错误的自适应范围约束。实验验证了理论收敛性分析的必然结果，并进一步证明了我们的优化方法可以加快约1收敛：2x和提高BNNs的性能比约3特定的二进制优化一个显著级别：7％，即使是在高度非凸优化问题。

27. A comparison of classical and variational autoencoders for anomaly detection [PDF] 返回目录
Fabrizio Patuzzo
Abstract: This paper analyzes and compares a classical and a variational autoencoder in the context of anomaly detection. To better understand their architecture and functioning, describe their properties and compare their performance, it explores how they address a simple problem: reconstructing a line with a slope.
摘要：分析和比较经典和在异常检测的上下文中，变自动编码器。为了更好地了解他们的结构和功能，描述他们的特性，并比较它们的性能，它探讨了如何解决一个简单的问题：重建与斜率的直线。

28. Micro-Facial Expression Recognition in Video Based on Optimal Convolutional Neural Network (MFEOCNN) Algorithm [PDF] 返回目录
S. D. Lalitha, K. K. Thyagharajan
Abstract: Facial expression is a standout amongst the most imperative features of human emotion recognition. For demonstrating the emotional states facial expressions are utilized by the people. In any case, recognition of facial expressions has persisted a testing and intriguing issue with regards to PC vision. Recognizing the Micro-Facial expression in video sequence is the main objective of the proposed approach. For efficient recognition, the proposed method utilizes the optimal convolution neural network. Here the proposed method considering the input dataset is the CK+ dataset. At first, by means of Adaptive median filtering preprocessing is performed in the input image. From the preprocessed output, the extracted features are Geometric features, Histogram of Oriented Gradients features and Local binary pattern features. The novelty of the proposed method is, with the help of Modified Lion Optimization (MLO) algorithm, the optimal features are selected from the extracted features. In a shorter computational time, it has the benefits of rapidly focalizing and effectively acknowledging with the aim of getting an overall arrangement or idea. Finally, the recognition is done by Convolution Neural network (CNN). Then the performance of the proposed MFEOCNN method is analysed in terms of false measures and recognition accuracy. This kind of emotion recognition is mainly used in medicine, marketing, E-learning, entertainment, law and monitoring. From the simulation, we know that the proposed approach achieves maximum recognition accuracy of 99.2% with minimum Mean Absolute Error (MAE) value. These results are compared with the existing for MicroFacial Expression Based Deep-Rooted Learning (MFEDRL), Convolutional Neural Network with Lion Optimization (CNN+LO) and Convolutional Neural Network (CNN) without optimization. The simulation of the proposed method is done in the working platform of MATLAB.
摘要：面部表情是人类之间情感识别的最迫切特征突出。为了展示该情绪状态的面部表情被人利用。在任何情况下，识别面部表情已经持续测试和有趣的问题，至于PC愿景。认识到在视频序列中的微面部表情是所提出的方法的主要目标。为了有效地识别，所提出的方法使用的最佳卷积神经网络。这里所提出的方法考虑了输入数据集是CK +数据集。首先，通过自适应的中值滤波的预处理装置在输入图像中的处理。从预处理输出，所提取的特征是几何特征，方向梯度直方图的功能和局部二元模式特征。所提出的方法的新颖性在于，改性狮子优化（MLO）算法的帮助下，最佳的特征从所提取的特征选择。在较短的计算时间，它具有快速focalizing，并得到一个布局或想法的目的是有效地承认的利益。最后，识别是通过卷积神经网络（CNN）进行。然后提出MFEOCNN方法的性能在假措施和识别精度等方面进行了分析。这种情感识别主要用于医药，营销，电子学习，娱乐，法律和监督。从仿真，我们知道，该方法实现了99.2％，最小平均绝对误差（MAE）值最高的识别精度。这些结果相比没有优化基于MicroFacial表达深层次的学习（MFEDRL），卷积神经网络与狮子优化（CNN + LO）和卷积神经网络（CNN）现有的。该方法的模拟在MATLAB的工作平台来完成。

29. Knowledge Fusion Transformers for Video Action Recognition [PDF] 返回目录
Ganesh Samarth, Sheetal Ojha, Nikhil Pareek
Abstract: We introduce Knowledge Fusion Transformers for video action classification. We present a self-attention based feature enhancer to fuse action knowledge in 3D inception based spatiotemporal context of the video clip intended to be classified. We show, how using only one stream networks and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art. Additionally, we present how different self-attention architectures used at different levels of the network can be blended-in to enhance feature representation. Our architecture is trained and evaluated on UCF-101 and Charades dataset, where it is competitive with the state of the art. It also exceeds by a large gap from single steam networks with no to less pretraining.
摘要：介绍知识融合变形金刚视频行为分类。我们提出了基于视频片段的时空背景下的自我关注基于特征的增强，以熔断器动作知识3D成立旨在进行分类。我们展示，如何只使用一个数据流的网络和很少或没有训练前可以铺平道路的性能接近当前国家的最先进的。此外，我们提出不同层次的网络中使用自重视架构多么的不同可能是渐渐出现的，以增强特征表示。我们的体系结构被训练和评估上UCF-101和数据集字谜，它是与现有技术的状态的竞争力。它还通过从没有单个蒸汽网络少预训练有很大的差距超过。

30. Geometric Loss for Deep Multiple Sclerosis lesion Segmentation [PDF] 返回目录
Hang Zhang, Jinwei Zhang, Rongguang Wang, Qihao Zhang, Susan A. Gauthier, Pascal Spincemaille, Thanh D. Nguyen, Yi Wang
Abstract: Multiple sclerosis (MS) lesions occupy a small fraction of the brain volume, and are heterogeneous with regards to shape, size and locations, which poses a great challenge for training deep learning based segmentation models. We proposed a new geometric loss formula to address the data imbalance and exploit the geometric property of MS lesions. We showed that traditional region-based and boundary-aware loss functions can be associated with the formula. We further develop and instantiate two loss functions containing first- and second-order geometric information of lesion regions to enforce regularization on optimizing deep segmentation models. Experimental results on two MS lesion datasets with different scales, acquisition protocols and resolutions demonstrated the superiority of our proposed methods compared to other state-of-the-art methods.
摘要：多发性硬化症（MS）病变占脑体积的一小部分，而且是异质至于形状，大小和位置，这构成了巨大的挑战，培养深度学习基于分割模型。我们提出了一个新的几何损耗公式来解决数据不平衡和利用MS病变的几何属性。我们发现，传统的基于区域和边界意识丧失的功能可以通过公式关联。我们进一步发展和含有病变区域的第一级和第二级的几何信息来实例2个损失函数执行优化深细分车型正规化。在两个MS实验结果病变具有不同比例的数据集，获取协议和决议证明相对于其他国家的最先进的方法，我们提出的方法的优越性。

31. SwiftFace: Real-Time Face Detection [PDF] 返回目录
Leonardo Ramos, Bernardo Morales
Abstract: Computer vision is a field of artificial intelligence that trains computers to interpret the visual world in a way similar to that of humans. Due to the rapid advancements in technology and the increasing availability of sufficiently large training datasets, the topics within computer vision have experienced a steep growth in the last decade. Among them, one of the most promising fields is face detection. Being used daily in a wide variety of fields; from mobile apps and augmented reality for entertainment purposes, to social studies and security cameras; designing high-performance models for face detection is crucial. On top of that, with the aforementioned growth in face detection technologies, precision and accuracy are no longer the only relevant factors: for real-time face detection, speed of detection is essential. SwiftFace is a novel deep learning model created solely to be a fast face detection model. By focusing only on detecting faces, SwiftFace performs 30% faster than current state-of-the-art face detection models. Code available at this https URL
摘要：计算机视觉是人工智能火车计算机解释视觉世界中类似人类的一种方式的一个领域。由于技术的快速进步和足够大的训练数据的日益普及，计算机视觉中的主题经历了过去十年中急剧增长。其中，最有前途的领域之一是人脸检测。作为一个在广泛领域的日常使用;从移动应用和娱乐的目的，社会研究和监控摄像头增强现实;设计高性能车型进行人脸检测是至关重要的。最重要的是，在人脸检测技术，上述增长，精密度和准确度都不再是唯一的相关因素：实时人脸检测，检测的速度是至关重要的。 SwiftFace是单独创作是一个快速人脸检测模型一个新的深度学习模式。通过只着眼于检测面部，执行SwiftFace比状态的最先进的当前脸部检测模型快30％。代码可以在这个HTTPS URL

32. MetaMix: Improved Meta-Learning with Interpolation-based Consistency Regularization [PDF] 返回目录
Yangbin Chen, Yun Ma, Tom Ko, Jianping Wang, Qing Li
Abstract: Model-Agnostic Meta-Learning (MAML) and its variants are popular few-shot classification methods. They train an initializer across a variety of sampled learning tasks (also known as episodes) such that the initialized model can adapt quickly to new tasks. However, current MAML-based algorithms have limitations in forming generalizable decision boundaries. In this paper, we propose an approach called MetaMix. It generates virtual feature-target pairs within each episode to regularize the backbone models. MetaMix can be integrated with any of the MAML-based algorithms and learn the decision boundaries generalizing better to new tasks. Experiments on the mini-ImageNet, CUB, and FC100 datasets show that MetaMix improves the performance of MAML-based algorithms and achieves state-of-the-art result when integrated with Meta-Transfer Learning.
摘要：模型无关元学习（MAML）及其变种很受欢迎为数不多的镜头分类方法。他们训练的初始跨多种采样学习任务（也称为发作），使得初始化的模型能够迅速适应新的任务。然而，当前基于MAML的算法有形成可推广的决策边界的限制。在本文中，我们提出了所谓的MetaMix的方法。它产生每集内的虚拟功能目标对正规化的骨干机型。 MetaMix可以与任何的基于MAML的算法进行集成和学习的决定边界更好地推广到新的任务。在迷你ImageNet，CUB和FC100数据集实验表明，MetaMix提高基于MAML-算法的性能，并在与元转让学习集成实现国家的最先进的结果。

33. Graph-based methods for analyzing orchard tree structure using noisy point cloud data [PDF] 返回目录
Fredrik Westling, Dr James Underwood, Dr Mitch Bryson
Abstract: Digitisation of fruit trees using LiDAR enables analysis which can be used to better growing practices to improve yield. Sophisticated analysis requires geometric and semantic understanding of the data, including the ability to discern individual trees as well as identifying leafy and structural matter. Extraction of this information should be rapid, as should data capture, so that entire orchards can be processed, but existing methods for classification and segmentation rely on high-quality data or additional data sources like cameras. We present a method for analysis of LiDAR data specifically for individual tree location, segmentation and matter classification, which can operate on low-quality data captured by handheld or mobile LiDAR. Results demonstrate viability both on real data for avocado and mango trees and virtual data with independently controlled sensor noise and tree spacing.
摘要：使用激光雷达果树的数字化使分析可用于更好地成长实践，以提高产量。复杂的分析需要的数据的几何和语义理解，包括能够辨别个别树木以及识别绿叶和结构物。这些信息的提取要迅速，因为要采集数据，从而使整个果园可以处理，但现有的分类和分割方法依赖于高质量的数据或其他数据源，如相机。我们提出了专为个人树位置，分割和分类物质LiDAR数据，其可以由手持或移动激光雷达捕获低质量的数据进行操作的分析的方法。结果既对鳄梨，芒果树和虚拟数据具有独立控制的传感器噪声和树间距真实数据证明生存能力。

34. A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in Aerial View [PDF] 返回目录
Zhiyuan Zhao, Tao Han, Junyu Gao, Qi Wang, Xuelong Li
Abstract: Drones shooting can be applied in dynamic traffic monitoring, object detecting and tracking, and other vision tasks. The variability of the shooting location adds some intractable challenges to these missions, such as varying scale, unstable exposure, and scene migration. In this paper, we strive to tackle the above challenges and automatically understand the crowd from the visual data collected from drones. First, to alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed, which extracts optical flow and frame difference information as an additional branch. Besides, to improve the model's generalization ability at different scales and time, we randomly combine a variety of data transformation methods to simulate some unseen environments. To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by game Grand Theft Auto V(GTAV). Experiment results show the effectiveness of the virtual data. Our method wins the challenge with a mean absolute error (MAE) of 12.70. Moreover, a comprehensive ablation study is conducted to explore each component's contribution.
摘要：无人机拍摄可以在动态交通监控应用，物体检测和跟踪，以及其他视觉任务。拍摄位置的变化增加了一些棘手的挑战，这些任务，如规模不一，不稳定的曝光和场景迁移。在本文中，我们努力应对上述挑战，并自动理解从无人驾驶飞机收集的视觉数据的人群。首先，为了减轻在跨场景检测噪声生成的背景，一个双流人群计数模型提出，其提取光流和帧差信息作为附加分支。另外，为了提高在不同尺度和时间模型的泛化能力，我们随机组合多种数据转换方法来模拟一些看不见的环境。为了解决在极端黑暗环境中的人群密度估计问题，我们引入由游戏侠盗猎车手V（GTAV）生成的合成数据。实验结果表明，虚拟数据的有效性。我们的方法与赢得12.70的平均绝对误差（MAE）的挑战。此外，全面的消融研究以探究每个组件的贡献。

35. A Comprehensive Review for MRF and CRF Approaches in Pathology Image Analysis [PDF] 返回目录
Chen Li, Yixin Li, Changhao Sun, Hao Chen, Hong Zhang
Abstract: Pathology image analysis is an essential procedure for clinical diagnosis of many diseases. To boost the accuracy and objectivity of detection, nowadays, an increasing number of computer-aided diagnosis (CAD) system is proposed. Among these methods, random field models play an indispensable role in improving the analysis performance. In this review, we present a comprehensive overview of pathology image analysis based on the markov random fields (MRFs) and conditional random fields (CRFs), which are two popular random field models. Firstly, we introduce the background of two random fields and pathology images. Secondly, we summarize the basic mathematical knowledge of MRFs and CRFs from modelling to optimization. Then, a thorough review of the recent research on the MRFs and CRFs of pathology images analysis is presented. Finally, we investigate the popular methodologies in the related works and discuss the method migration among CAD field.
摘要：病理学图像分析是临床诊断多种疾病的一个重要步骤。为了提高准确度和检测的客观性，现今，越来越多的计算机辅助诊断（CAD）系统的算法。在这些方法中，随机场模型在提高性能的分析不可缺少的作用。在这次审查中，我们提出了一种基于马尔可夫随机场（的MRF）和条件随机域（控释肥），这是两个流行的随机场模型的病理图像分析的全面概述。首先，我们介绍了两个随机领域和病理图像的背景。其次，我们从建模到优化总结的MRF和控释肥的基本数学知识。然后，在的MRF和病理图像分析控释肥的最近的研究进行彻底审查提出。最后，我们调查了相关作品的流行方法，并讨论CAD领域中的方法迁移。

36. Deep discriminant analysis for task-dependent compact network search [PDF] 返回目录
Qing Tian, Tal Arbel, James J. Clark
Abstract: Most of today's popular deep architectures are hand-engineered for general purpose applications. However, this design procedure usually leads to massive redundant, useless, or even harmful features for specific tasks. Such unnecessarily high complexities render deep nets impractical for many real-world applications, especially those without powerful GPU support. In this paper, we attempt to derive task-dependent compact models from a deep discriminant analysis perspective. We propose an iterative and proactive approach for classification tasks which alternates between (1) a pushing step, with an objective to simultaneously maximize class separation, penalize co-variances, and push deep discriminants into alignment with a compact set of neurons, and (2) a pruning step, which discards less useful or even interfering neurons. Deconvolution is adopted to reverse `unimportant' filters' effects and recover useful contributing sources. A simple network growing strategy based on the basic Inception module is proposed for challenging tasks requiring larger capacity than what the base net can offer. Experiments on the MNIST, CIFAR10, and ImageNet datasets demonstrate our approach's efficacy. On ImageNet, by pushing and pruning our grown Inception-88 model, we achieve better-performing models than smaller deep Inception nets grown, residual nets, and famous compact nets at similar sizes. We also show that our grown deep Inception nets (without hard-coded dimension alignment) can beat residual nets of similar complexities.
摘要：今天的大多数流行的深层结构都是手工设计的通用应用。然而，这样的设计过程通常会导致大量多余的，无用的，甚至是有害的特定任务的功能。这种不必要的高复杂性使得不切实际的许多现实世界的应用，尤其是那些没有强大的GPU支持深网。在本文中，我们试图获取取决于任务的紧凑车型从深度判别分析视角。我们提出了分类任务，其中（1）的推压步骤中，与物镜之间交替以同时最大化类分离，违法处罚共方差，和推深判别对准具有紧凑集的神经元，和迭代和积极的方式（2 ）一个修剪步骤，其丢弃不太有用，甚至干扰神经元。去卷积采用扭转`不重要”过滤器的效果和回收有用贡献源。基于基本盗梦空间模块上的简单的网络成长策略提出了要求比基网可以提供更大容量的具有挑战性的任务。在MNIST，CIFAR10和ImageNet数据集的实验证明了该方法的有效性。在ImageNet，通过推动和修剪我们的成年盗-88模型，我们实现比增长较小深盗网，残余网，并在类似的尺寸紧凑的著名网效果较好的模型。我们还表明，我们的成年深盗网（没有硬编码的尺寸校正）可以击败类似的复杂的残余网。

37. Learn like a Pathologist: Curriculum Learning by Annotator Agreement for Histopathology Image Classification [PDF] 返回目录
Jerry Wei, Arief Suriawinata, Bing Ren, Xiaoying Liu, Mikhail Lisovsky, Louis Vaickus, Charles Brown, Michael Baker, Mustafa Nasir-Moin, Naofumi Tomita, Lorenzo Torresani, Jason Wei, Saeed Hassanpour
Abstract: Applying curriculum learning requires both a range of difficulty in data and a method for determining the difficulty of examples. In many tasks, however, satisfying these requirements can be a formidable challenge. In this paper, we contend that histopathology image classification is a compelling use case for curriculum learning. Based on the nature of histopathology images, a range of difficulty inherently exists among examples, and, since medical datasets are often labeled by multiple annotators, annotator agreement can be used as a natural proxy for the difficulty of a given example. Hence, we propose a simple curriculum learning method that trains on progressively-harder images as determined by annotator agreement. We evaluate our hypothesis on the challenging and clinically-important task of colorectal polyp classification. Whereas vanilla training achieves an AUC of 83.7% for this task, a model trained with our proposed curriculum learning approach achieves an AUC of 88.2%, an improvement of 4.5%. Our work aims to inspire researchers to think more creatively and rigorously when choosing contexts for applying curriculum learning.
摘要：应用课程学习既需要的范围中的数据的困难和用于确定的实施例中的困难的方法。在很多的任务，但是，满足这些要求可以是一个艰巨的挑战。在本文中，我们主张组织病理学图像分类是一个引人注目的使用案例课程的学习。基于组织病理学的图像的性质，一个范围难度固有实例中存在，并且，由于医疗数据集通常由多个注释标记，注释器协议可被用作对于给定的实施例的难度自然代理。因此，我们建议，在逐行图像更难列车通过注释协议所确定的简单课程的学习方法。我们评估对大肠息肉分类的挑战和临床重要的任务，我们的假设。而香草培训实现了83.7％的AUC，对于这个任务，我们提出的课程学习方法训练的模型实现了88.2％的AUC，4.5％的改进。我们的工作目标，激发科研人员申请课程的学习选择上下文时考虑更多的创造性和严格。

38. Detecting soccer balls with reduced neural networks: a comparison of multiple architectures under constrained hardware scenarios [PDF] 返回目录
Douglas De Rizzo Meneghetti, Thiago Pedro Donadon Homem, Jonas Henrique Renolfi de Oliveira, Isaac Jesus da Silva, Danilo Hernani Perico, Reinaldo Augusto da Costa Bianchi
Abstract: Object detection techniques that achieve state-of-the-art detection accuracy employ convolutional neural networks, implemented to have optimal performance in graphics processing units. Some hardware systems, such as mobile robots, operate under constrained hardware situations, but still benefit from object detection capabilities. Multiple network models have been proposed, achieving comparable accuracy with reduced architectures and leaner operations. Motivated by the need to create an object detection system for a soccer team of mobile robots, this work provides a comparative study of recent proposals of neural networks targeted towards constrained hardware environments, in the specific task of soccer ball detection. We train multiple open implementations of MobileNetV2 and MobileNetV3 models with different underlying architectures, as well as YOLOv3, TinyYOLOv3, YOLOv4 and TinyYOLOv4 in an annotated image data set captured using a mobile robot. We then report their mean average precision on a test data set and their inference times in videos of different resolutions, under constrained and unconstrained hardware configurations. Results show that MobileNetV3 models have a good trade-off between mAP and inference time in constrained scenarios only, while MobileNetV2 with high width multipliers are appropriate for server-side inference. YOLO models in their official implementations are not suitable for inference in CPUs.
摘要：对象检测技术，实现状态的最先进的检测精度雇用卷积神经网络，实现为具有在图形处理单元的最佳性能。一些硬件系统，如移动机器人，下约束硬件操作的情况，但仍从物体检测能力中获益。多个网络模型已被提出，实现与减少的架构和精简运营的可比的精度。由于需要创建一个对象检测系统，用于移动机器人足球队的启发，这项工作提供了有针对性的对受限硬件环境的神经网络最近提出的建议，在足球检测的具体任务进行了比较研究。我们培养MobileNetV2和MobileNetV3模型与不同的基础架构的多个打开的实施方式中，以及YOLOv3，TinyYOLOv3，YOLOv4和TinyYOLOv4在捕获的标注的图像数据集使用移动机器人。然后，我们报告测试数据集的平均平均精度和不同的分辨率，在约束和不受约束的硬件配置的视频他们推断倍。结果表明，MobileNetV3模型在仅受限场景地图和推理时间之间的良好的权衡，而MobileNetV2高宽乘数适用于服务器端的推断。在他们的官方实现YOLO模型并不适用于CPU的推断。

39. VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training [PDF] 返回目录
Xiaowei Hu, Xi Yin, Kevin Lin, Lijuan Wang, Lei Zhang, Jianfeng Gao, Zicheng Liu
Abstract: It is highly desirable yet challenging to generate image captions that can describe novel objects which are unseen in caption-labeled training data, a capability that is evaluated in the novel object captioning challenge (nocaps). In this challenge, no additional image-caption training data, other than COCO Captions, is allowed for model training. Thus, conventional Vision-Language Pre-training (VLP) methods cannot be applied. This paper presents VIsual VOcabulary pre-training (VIVO) that performs pre-training in the absence of caption annotations. By breaking the dependency of paired image-caption training data in VLP, VIVO can leverage large amounts of paired image-tag data to learn a visual vocabulary. This is done by pre-training a multi-layer Transformer model that learns to align image-level tags with their corresponding image region features. To address the unordered nature of image tags, VIVO uses a Hungarian matching loss with masked tag prediction to conduct pre-training. We validate the effectiveness of VIVO by fine-tuning the pre-trained model for image captioning. In addition, we perform an analysis of the visual-text alignment inferred by our model. The results show that our model can not only generate fluent image captions that describe novel objects, but also identify the locations of these objects. Our single model has achieved new state-of-the-art results on nocaps and surpassed the human CIDEr score.
摘要：非常希望但具有挑战性的生成图像标题可以描述这是在字幕标记的训练数据看不见的新物体，即在新物体字幕挑战（nocaps）来评价的能力。在这个挑战，没有额外的图像，字幕训练数据，比COCO字幕等，是允许模型训练。因此，传统的视觉语言预训练（VLP）的方法不能应用。本文呈现的视觉词汇预训练（体内），其执行预先训练在不存在字幕的注解。通过破坏VLP对图像字幕训练数据的依赖性，VIVO可以利用大量的对图像标签数据的学习视觉词汇。这是由前训练多层变压器模型做过学会与它们对应的图像区域特征对准影像级标签。为了解决图像标签的无序性质，VIVO采用屏蔽带标签预测行为前培训匈牙利匹配损耗。我们通过微调验证VIVO的有效性进行图像字幕预先训练的模式。此外，我们履行我们的模型推导出的视觉文本对齐的分析。结果表明，我们的模型不仅可以生成描述新颖物品的流畅的视频字幕，也确定这些物体的位置。我们的单一模式取得了国家的最先进的新的nocaps结果和超越人类的苹果酒得分。

40. Cross-Task Representation Learning for Anatomical Landmark Detection [PDF] 返回目录
Zeyu Fu, Jianbo Jiao, Michael Suttie, J. Alison Noble
Abstract: Recently, there is an increasing demand for automatically detecting anatomical landmarks which provide rich structural information to facilitate subsequent medical image analysis. Current methods related to this task often leverage the power of deep neural networks, while a major challenge in fine tuning such models in medical applications arises from insufficient number of labeled samples. To address this, we propose to regularize the knowledge transfer across source and target tasks through cross-task representation learning. The proposed method is demonstrated for extracting facial anatomical landmarks which facilitate the diagnosis of fetal alcohol syndrome. The source and target tasks in this work are face recognition and landmark detection, respectively. The main idea of the proposed method is to retain the feature representations of the source model on the target task data, and to leverage them as an additional source of supervisory signals for regularizing the target model learning, thereby improving its performance under limited training samples. Concretely, we present two approaches for the proposed representation learning by constraining either final or intermediate model features on the target model. Experimental results on a clinical face image dataset demonstrate that the proposed approach works well with few labeled data, and outperforms other compared approaches.
摘要：近来，用于自动检测的解剖学界标，其提供丰富的结构信息，以促进随后的医学图像分析的需求增加。与此相关的任务，目前的方法往往利用深层神经网络的力量，而在医疗应用中微调这种模式的一个重大挑战，从数量不足标记样本的出现。为了解决这个问题，我们提出来规范通过跨任务表示学习在源文件和目标任务的知识转移。所提出的方法证明，用于提取的面部的解剖学界标，其促进胎儿酒精综合症的诊断。在这项工作中的源和目标任务是人脸识别和标志检测，分别。该方法的主要思想是保留对目标任务的数据源模型的特征表示，并把它们充分利用作为监控信号为正则目标模型的学习，从而提高其有限的训练样本下的性能的额外来源。具体而言，我们目前的两个通过约束或者最终或中间模型接近所提出的表示学习功能在目标模型。临床脸图像的实验结果表明数据集，该方法具有几个标记数据的效果很好，并且优于其他方法相比。

41. Fully Automated Left Atrium Segmentation from Anatomical Cine Long-axis MRI Sequences using Deep Convolutional Neural Network with Unscented Kalman Filter [PDF] 返回目录
Xiaoran Zhang, Michelle Noga, David Glynn Martin, Kumaradevan Punithakumar
Abstract: This study proposes a fully automated approach for the left atrial segmentation from routine cine long-axis cardiac magnetic resonance image sequences using deep convolutional neural networks and Bayesian filtering. The proposed approach consists of a classification network that automatically detects the type of long-axis sequence and three different convolutional neural network models followed by unscented Kalman filtering (UKF) that delineates the left atrium. Instead of training and predicting all long-axis sequence types together, the proposed approach first identifies the image sequence type as to 2, 3 and 4 chamber views, and then performs prediction based on neural nets trained for that particular sequence type. The datasets were acquired retrospectively and ground truth manual segmentation was provided by an expert radiologist. In addition to neural net based classification and segmentation, another neural net is trained and utilized to select image sequences for further processing using UKF to impose temporal consistency over cardiac cycle. A cyclic dynamic model with time-varying angular frequency is introduced in UKF to characterize the variations in cardiac motion during image scanning. The proposed approach was trained and evaluated separately with varying amount of training data with images acquired from 20, 40, 60 and 80 patients. Evaluations over 1515 images with equal number of images from each chamber group acquired from an additional 20 patients demonstrated that the proposed model outperformed state-of-the-art and yielded a mean Dice coefficient value of 94.1%, 93.7% and 90.1% for 2, 3 and 4-chamber sequences, respectively, when trained with datasets from 80 patients.
摘要：本研究中提出了从常规的电影左心房分割一个完全自动化的方法使用深度卷积神经网络和贝叶斯过滤长轴心脏磁共振图像序列。所提出的方法由该描绘左心房分类网络能够自动检测长轴序列和三个不同的卷积神经网络模型，随后无迹卡尔曼滤波（UKF）的类型。代替的训练和预测所有长轴序列类型一起，基于训练的该特定序列类型神经网络所提出的方法首先识别所述图像序列类型为2,3和4腔室视图，然后执行预测。该数据集进行回顾性收购和地面实况手动分割一位专家放射提供。除了基于神经网络的分类和分割，另一个神经网络被训练并用来使用UKF强加过度心动周期的时间一致性进行进一步的处理中选择的图像序列。随时间变化的角频率的循环动态模型中UKF引入图像扫描期间以表征心脏运动的变化。所提出的方法进行训练，并且具有不同的训练与20，40，60和80例获得的图像数据量分别进行评价。评价超过1515倍的图像与相等数目的图像从证明，该模型优于状态的最先进的，并且得到的94.1％的平均骰子系数值额外的20例患者中获取的每个腔室组，93.7％和2 90.1％，图3和4室的序列，分别在与数据集从80名患者的训练。

42. A Ranking-based, Balanced Loss Function Unifying Classification and Localisation in Object Detection [PDF] 返回目录
Kemal Oksuz, Baris Can Cam, Emre Akbas, Sinan Kalkan
Abstract: We propose average Localization-Recall-Precision (aLRP), a unified, bounded, balanced and ranking-based loss function for both classification and localisation tasks in object detection. aLRP extends the Localization-Recall-Precision (LRP) performance metric (Oksuz et al., 2018) inspired from how Average Precision (AP) Loss extends precision to a ranking-based loss function for classification (Chen et al., 2020). aLRP has the following distinct advantages: (i) aLRP is the first ranking-based loss function for both classification and localisation tasks. (ii) Thanks to using ranking for both tasks, aLRP naturally enforces high-quality localisation for high-precision classification. (iii) aLRP provides provable balance between positives and negatives. (iv) Compared to on average $\sim 6$ hyperparameters in the loss functions of state-of-the-art detectors, aLRP has only one hyperparameter, which we did not tune in practice. On the COCO dataset, aLRP improves its ranking-based predecessor, AP Loss, more than $4$ AP points and outperforms all one-stage detectors. The code is available at: this https URL .
摘要：本文提出平均定位召回精密（aLRP），统一的，有界的，平衡的和基于排序的损失物体检测分类和定位任务的功能。 aLRP扩展了本地化召回精密（LRP）性能指标（Oksuz等，2018）从如何平均精确度（AP）损耗延伸精度基于排名损功能分类的启发（Chen等，2020）。 aLRP具有以下明显的优点：（ⅰ）aLRP为分类和本地化任务第一基于排序的损失函数。（ⅱ）由于使用排名两个任务，aLRP自然强制进行高精度分类高质量定位。（ⅲ）aLRP提供肯定和否定之间可证平衡。（四）相对于平均$ \在国家的最先进的检测器的损耗函数SIM 6个$超参数，aLRP只有一个超参数，这是我们在实践中没有调。在COCO数据集，aLRP提高其排名为基础的前身，AP的损失，超过$ 4 $ AP点，优于所有一次探测器。该代码，请访问：此HTTPS URL。

43. Deep Evolution for Facial Emotion Recognition [PDF] 返回目录
Emmanuel Dufourq, Bruce A. Bassett
Abstract: Deep facial expression recognition faces two challenges that both stem from the large number of trainable parameters: long training times and a lack of interpretability. We propose a novel method based on evolutionary algorithms, that deals with both challenges by massively reducing the number of trainable parameters, whilst simultaneously retaining classification performance, and in some cases achieving superior performance. We are robustly able to reduce the number of parameters on average by 95% (e.g. from 2M to 100k parameters) with no loss in classification accuracy. The algorithm learns to choose small patches from the image, relative to the nose, which carry the most important information about emotion, and which coincide with typical human choices of important features. Our work implements a novel form attention and shows that evolutionary algorithms are a valuable addition to machine learning in the deep learning era, both for reducing the number of parameters for facial expression recognition and for providing interpretable features that can help reduce bias.
摘要：深面部表情识别面临两个方面的挑战，从大量的可训练的参数都干：长的训练时间和缺乏解释性的。我们通过大量减少可训练参数的数量，却同时保留分类性能提出了一种新方法基于进化算法，既有挑战交易，在某些情况下实现卓越的性能。我们能够鲁棒地由95％（例如，从2M到100k的参数），在分类准确度没有损失，以减少平均的参数的数量。该算法学会选择从图像小补丁，相对于伸出，其携带的情感重合的具有重要特征典型的人选择的最重要的信息，和。我们的工作实现了一个新的形式关注和表明，进化算法是一种宝贵的除了机器学习的深度学习的时代，不管是对减少对面部表情识别参数的数量和提供解释的功能，可以帮助减少偏见。

44. Learning to Compress Videos without Computing Motion [PDF] 返回目录
Meixu Chen, Todd Goodall, Anjul Patney, Alan C. Bovik
Abstract: With the development of higher resolution contents and displays, its significant volume poses significant challenges to the goals of acquiring, transmitting, compressing and displaying high quality video content. In this paper, we propose a new deep learning video compression architecture that does not require motion estimation, which is the most expensive element of modern hybrid video compression codecs like H.264 and HEVC. Our framework exploits the regularities inherent to video motion, which we capture by using displaced frame differences as video representations to train the neural network. In addition, we propose a new space-time reconstruction network based on both an LSTM model and a UNet model, which we call LSTM-UNet. The combined network is able to efficiently capture both temporal and spatial video information, making it highly amenable for our purposes. The new video compression framework has three components: a Displacement Calculation Unit (DCU), a Displacement Compression Network (DCN), and a Frame Reconstruction Network (FRN), all of which are jointly optimized against a single perceptual loss function. The DCU obviates the need for motion estimation as in hybrid codecs, and is less expensive. In the DCN, an RNN-based network is utilized to conduct variable bit-rate encoding based on a single round of training. The LSTM-UNet is used in the FRN to learn space time differential representations of videos. Our experimental results show that our compression model, which we call the MOtionless VIdeo Codec (MOVI-Codec), learns how to efficiently compress videos without computing motion. Our experiments show that MOVI-Codec outperforms the video coding standard H.264, and is highly competitive with, and sometimes exceeds the performance of the modern global standard HEVC codec, as measured by MS-SSIM.
摘要：随着更高分辨率的内容和显示的发展，它的体积显著构成于获取，传输，压缩和显示高品质视频内容的目标显著挑战。在本文中，我们提出了不需要运动估计，这是现代混合视频压缩编解码H.264一样和HEVC的最昂贵的元素的新深度学习的视频压缩架构。我们的框架利用固有的视频运动，这是我们通过位移帧差异化视频表示训练神经网络捕捉规律。此外，我们提出了一种基于既有LSTM模型和UNET模型的新时空重建网络，我们称之为LSTM-UNET。合并网络能够有效地捕捉时间和空间的视频信息，使得它对于我们来说非常适合。新的视频压缩框架具有三个组成部分：一个位移运算部（DCU），位移压缩网络（DCN），和一个帧重构网络（FRN），所有这些都对单个感知损失函数被联合优化。所述DCU避免了需要用于运动估计的混合编解码器，并且是不太昂贵的。在DCN，基于RNN的网络利用基于单一轮培训以进行可变比特率编码。该LSTM-UNET在FRN用于学习的视频时空差表示。我们的实验结果表明，该压缩模式，我们称之为一动不动视频编解码器（MOVI-编解码器），学习如何在不计算运动有效地压缩视频。我们的实验表明，MOVI-编解码器的性能优于视频编码标准H.264，并与竞争激烈，有时会超过现代全球标准HEVC编解码器的性能，如MS-SSIM测量。

45. Deep Image Reconstruction using Unregistered Measurements without Groundtruth [PDF] 返回目录
Weijie Gan, Yu Sun, Cihat Eldeniz, Jiaming Liu, Hongyu An, Ulugbek S. Kamilov
Abstract: One of the key limitations in conventional deep learning based image reconstruction is the need for registered pairs of training images containing a set of high-quality groundtruth images. This paper addresses this limitation by proposing a novel unsupervised deep registration-augmented reconstruction method (U-Dream) for training deep neural nets to reconstruct high-quality images by directly mapping pairs of unregistered and artifact-corrupted images. The ability of U-Dream to circumvent the need for accurately registered data makes it widely applicable to many biomedical image reconstruction tasks. We validate it in accelerated magnetic resonance imaging (MRI) by training an image reconstruction model directly on pairs of undersampled measurements from images that have undergone nonrigid deformations.
摘要：一个在传统的深度学习基于图像重建的主要限制是注册对包含一套高品质的地面实况图像的训练图像的需求。本文地址这一限制是通过提出一种新颖的无监督深登记-增强重建用于训练深神经网络通过直接映射对未注册和伪影的损坏图像的重构的高质量图像方法（U-梦）。 U型梦规避需要准确记录数据的能力，使得它广泛适用于多种生物医学图像重建任务。我们通过从已经历变形的非刚性图像直接上对欠测量的训练图像重建模型验证它在加速磁共振成像（MRI）。

46. Tackling unsupervised multi-source domain adaptation with optimism and consistency [PDF] 返回目录
Diogo Pernes, Jaime S. Cardoso
Abstract: It has been known for a while that the problem of multi-source domain adaptation can be regarded as a single source domain adaptation task where the source domain corresponds to a mixture of the original source domains. Nonetheless, how to adjust the mixture distribution weights remains an open question. Moreover, most existing work on this topic focuses only on minimizing the error on the source domains and achieving domain-invariant representations, which is insufficient to ensure low error on the target domain. In this work, we present a novel framework that addresses both problems and beats the current state of the art by using a mildly optimistic objective function and consistency regularization on the target samples.
摘要：人们已经知道了一段时间该多源域的适应的问题可以被视为单一的源域的适应任务，其中源域对应于原始源域的混合物。然而，如何调整混合分布的权重仍然是一个悬而未决的问题。此外，关于这一主题大多数现有的工作只侧重于对源域的误差最小，实现域不变表示，这不足以确保在目标域低误差。在这项工作中，我们提出了一个新颖的框架，同时解决的问题，并通过使用所述目标样品温和乐观目标函数和一致性正规化节拍现有技术的当前状态。

47. Machine Learning for Semi-Automated Meteorite Recovery [PDF] 返回目录
Seamus Anderson, Martin Towner, Phil Bland, Christopher Haikings, William Volante, Eleanor Sansom, Hadrien Devillepoix, Patrick Shober, Benjamin Hartig, Martin Cupak, Trent Jansen-Sturgeon, Robert Howie, Gretchen Benedix, Geoff Deacon
Abstract: We present a novel methodology for recovering meteorite falls observed and constrained by fireball networks, using drones and machine learning algorithms. This approach uses images of the local terrain for a given fall site to train an artificial neural network, designed to detect meteorite candidates. We have field tested our methodology to show a meteorite detection rate between 75-97%, while also providing an efficient mechanism to eliminate false-positives. Our tests at a number of locations within Western Australia also showcase the ability for this training scheme to generalize a model to learn localized terrain features. Our model-training approach was also able to correctly identify 3 meteorites in their native fall sites, that were found using traditional searching techniques. Our methodology will be used to recover meteorite falls in a wide range of locations within globe-spanning fireball networks.
摘要：我们提出一种新颖的方法用于回收坠落陨石观察和火球网络的制约，使用无人驾驶飞机和机器学习算法。该方法利用当地地形的图像对于给定的秋季现场训练的人工神经网络，旨在检测考生的陨石。我们有场测试我们的方法，以显示75-97％之间的陨石检出率，同时也提供了一个有效的机制，以消除误报。我们在一些西澳内的位置测试也展示了这个培训计划概括了一个模型来了解局部地形特征的能力。我们的模型训练方法也能正确识别他们的母语秋天的网站，使用传统搜索技术被发现，3块陨石。我们的方法将被用来恢复陨石落在了广泛的全球范围的火球网络中的位置。

48. Loop-box: Multi-Agent Direct SLAM Triggered by Single Loop Closure for Large-Scale Mapping [PDF] 返回目录
M Usman Maqbool Bhutta, Manohar Kuse, Rui Fan, Yanan Liu, Ming Liu
Abstract: In this paper, we present a multi-agent framework for real-time large-scale 3D reconstruction applications. In SLAM, researchers usually build and update a 3D map after applying non-linear pose graph optimization techniques. Moreover, many multi-agent systems are prevalently using odometry information from additional sensors. These methods generally involve intensive computer vision algorithms and are tightly coupled with various sensors. We develop a generic method for the keychallenging scenarios in multi-agent 3D mapping based on different camera systems. The proposed framework performs actively in terms of localizing each agent after the first loop closure between them. It is shown that the proposed system only uses monocular cameras to yield real-time multi-agent large-scale localization and 3D global mapping. Based on the initial matching, our system can calculate the optimal scale difference between multiple 3D maps and then estimate an accurate relative pose transformation for large-scale global mapping.
摘要：在本文中，我们提出了一个实时的大型3D重建应用的多代理框架。在SLAM，研究人员通常构建和更新应用非线性姿态图形优化技术后，3D地图。此外，许多多代理系统是普遍地使用来自附加传感器测距信息。这些方法一般包括密集计算机视觉算法和紧密耦合有各种传感器。我们开发了基于不同的相机系统，多代理的3D映射keychallenging场景的通用方法。所提出的框架在它们之间的第一环路闭合后定位每个代理方面主动地执行。结果表明，该系统只使用单眼相机以获得实时的多代理的大型国产化和3D全球地图。基于初始匹配，我们的系统可以计算多个3D地图之间的最佳比例差，然后估算用于大规模全局映射的精确的相对姿态变换。

49. Self-grouping Convolutional Neural Networks [PDF] 返回目录
Qingbei Guo, Xiao-Jun Wu, Josef Kittler, Zhiquan Feng
Abstract: Although group convolution operators are increasingly used in deep convolutional neural networks to improve the computational efficiency and to reduce the number of parameters, most existing methods construct their group convolution architectures by a predefined partitioning of the filters of each convolutional layer into multiple regular filter groups with an equal spatial group size and data-independence, which prevents a full exploitation of their potential. To tackle this issue, we propose a novel method of designing self-grouping convolutional neural networks, called SG-CNN, in which the filters of each convolutional layer group themselves based on the similarity of their importance vectors. Concretely, for each filter, we first evaluate the importance value of their input channels to identify the importance vectors, and then group these vectors by clustering. Using the resulting \emph{data-dependent} centroids, we prune the less important connections, which implicitly minimizes the accuracy loss of the pruning, thus yielding a set of \emph{diverse} group convolution filters. Subsequently, we develop two fine-tuning schemes, i.e. (1) both local and global fine-tuning and (2) global only fine-tuning, which experimentally deliver comparable results, to recover the recognition capacity of the pruned network. Comprehensive experiments carried out on the CIFAR-10/100 and ImageNet datasets demonstrate that our self-grouping convolution method adapts to various state-of-the-art CNN architectures, such as ResNet and DenseNet, and delivers superior performance in terms of compression ratio, speedup and recognition accuracy. We demonstrate the ability of SG-CNN to generalise by transfer learning, including domain adaption and object detection, showing competitive results. Our source code is available at this https URL.
摘要：虽然组卷积运算符在深卷积神经网络越来越多地用于提高计算效率，并减少参数的数量，大多数现有的方法，通过各卷积层的过滤器的预定划分多个正规滤波器构建其组卷积架构组具有相等的空间组的大小和数据的独立性，其防止一个充分利用他们潜在的。为了解决这个问题，我们提出了设计自分组卷积神经网络的新方法被称为SG-CNN，其中每个卷积层群的过滤器本身基于其重要载体的相似性。具体地，对于每个滤波器，我们首先评估他们的输入通道，以确定的重要性向量的重要性值，然后组这些载体由聚类。使用所得\ EMPH {依赖于数据的质心}，我们修剪不太重要的连接，这隐含地最小化的修剪的精度损失，从而产生一组\ EMPH {多样}组卷积滤波器。随后，我们开发了两个微调方案，即（1）本地和全局微调和（2）全球仅微调，该实验提供可比较的结果，来恢复修剪网络的识别能力。全面实验上CIFAR-10/100执行和ImageNet数据集证明我们的自分组卷积方法适用于各种国家的最先进的CNN结构，如RESNET和DenseNet，并提供在压缩比方面是优异的性能，加速和识别的准确性。我们证明SG-CNN的通过转移学习，包括域适应和目标检测，显示出有竞争力的结果来概括的能力。我们的源代码可在此HTTPS URL。

50. Learned Fine-Tuner for Incongruous Few-Shot Learning [PDF] 返回目录
Pu Zhao, Sijia Liu, Parikshit Ram, Songtao Lu, Djallel Bouneffouf, Xue Lin
Abstract: Model-agnostic meta-learning (MAML) effectively meta-learns an initialization of model parameters for few-shot learning where all learning problems share the same format of model parameters -- congruous meta-learning. We extend MAML to incongruous meta-learning where different yet related few-shot learning problems may not share any model parameters. In this setup, we propose the use of a Learned Fine Tuner (LFT) to replace hand-designed optimizers (such as SGD) for the task-specific fine-tuning. The meta-learned initialization in MAML is replaced by learned optimizers based on the learning-to-optimize (L2O) framework to meta-learn across incongruous tasks such that models fine-tuned with LFT (even from random initializations) adapt quickly to new tasks. The introduction of LFT within MAML (i) offers the capability to tackle few-shot learning tasks by meta-learning across incongruous yet related problems (e.g., classification over images of different sizes and model architectures), and (ii) can {efficiently} work with first-order and derivative-free few-shot learning problems. Theoretically, we quantify the difference between LFT (for MAML) and L2O. Empirically, we demonstrate the effectiveness of LFT through both synthetic and real problems and a novel application of generating universal adversarial attacks across different image sources in the few-shot learning regime.
摘要：型号无关元学习（MAML）有效元学习对一些次学习模型参数，所有的学习问题，共享的模型参数相同的格式初始化 - 雍元学习。我们扩展到MAML不协调元，其中学习不同但相关的几个次学习的问题可能不共享任何模型参数。在这种设置中，我们提出了具体的任务微调使用据悉精细调谐器（LFT）来代替手工设计的优化（如SGD）。跨不协调任务MAML元学习的初始化是由基于学习到优化（L2O）框架了解到优化更换元学习，使得与LFT（甚至从随机初始化）模型微调迅速适应新的任务。内MAML引入LFT（一）提供的能力，以解决少数次学习的任务荟萃学习跨不协调但又相关的问题（例如，分类在不同尺寸和型号架构的图像），及（ii）可以{有效}与一阶和无导数的几个次学习问题的工作。从理论上讲，我们量化LFT（用于MAML）和L2O之间的差异。根据经验，我们证明LFT通过合成和实际问题，并产生在少数次学习制度在不同的图像源普遍对抗攻击的新的应用成效。

51. Cranial Implant Design via Virtual Craniectomy with Shape Priors [PDF] 返回目录
Franco Matzkin, Virginia Newcombe, Ben Glocker, Enzo Ferrante
Abstract: Cranial implant design is a challenging task, whose accuracy is crucial in the context of cranioplasty procedures. This task is usually performed manually by experts using computer-assisted design software. In this work, we propose and evaluate alternative automatic deep learning models for cranial implant reconstruction from CT images. The models are trained and evaluated using the database released by the AutoImplant challenge, and compared to a baseline implemented by the organizers. We employ a simulated virtual craniectomy to train our models using complete skulls, and compare two different approaches trained with this procedure. The first one is a direct estimation method based on the UNet architecture. The second method incorporates shape priors to increase the robustness when dealing with out-of-distribution implant shapes. Our direct estimation method outperforms the baselines provided by the organizers, while the model with shape priors shows superior performance when dealing with out-of-distribution cases. Overall, our methods show promising results in the difficult task of cranial implant design.
摘要：颅植入物的设计是一项艰巨的任务，它的精确度是在颅骨修补程序方面的关键。这个任务通常使用计算机辅助设计软件专家手动执行。在这项工作中，我们提出并评估从CT图像颅内植入重建替代自动深度学习模型。该模型被训练和使用由AutoImplant挑战发布的数据库进行评估，并与组织者实施的基准。我们采用仿真的虚拟骨瓣使用完整的头骨来训练我们的模型，并比较此过程训练有素的两种不同的方法。第一种是基于所述UNET架构的直接估计方法。第二种方法采用了形状先验与外的分布形状的植入物当处理以增加的鲁棒性。我们直接估计方法优于由主办方提供的基线，而与形状先验表明，该模型优越性能外的分配案件时。总的来说，我们的方法显示有希望的颅骨植入物设计的艰巨任务结果。

52. Mathematical derivation for Vora-Value based filter design method: Gradient and Hessian [PDF] 返回目录
Yuteng Zhu, Graham D. Finlayson
Abstract: In this paper, we present the detailed mathematical derivation of the gradient and Hessian for the Vora-Value based filter optimization. We make a full recapitulation of the steps involved in differentiating the objective function for maximizing the Vora-Value. This paper serves as a supplementary material for our future paper in filter design using Vora-Value as an optimization criterion.
摘要：在本文中，我们目前的梯度和黑森州为基础沃拉 - 值滤波优化的详细数学推导。我们做的参与区分目标函数最大化沃拉 - 值的步骤，一个完整的概括。本文作为我们以后的文章中滤波器设计使用沃拉-值作为优化准则的补充材料。

53. Breaking the Memory Wall for AI Chip with a New Dimension [PDF] 返回目录
Eugene Tam, Shenfei Jiang, Paul Duan, Shawn Meng, Yue Pang, Cayden Huang, Yi Han, Jacke Xie, Yuanjun Cui, Jinsong Yu, Minggui Lu
Abstract: Recent advancements in deep learning have led to the widespread adoption of artificial intelligence (AI) in applications such as computer vision and natural language processing. As neural networks become deeper and larger, AI modeling demands outstrip the capabilities of conventional chip architectures. Memory bandwidth falls behind processing power. Energy consumption comes to dominate the total cost of ownership. Currently, memory capacity is insufficient to support the most advanced NLP models. In this work, we present a 3D AI chip, called Sunrise, with near-memory computing architecture to address these three challenges. This distributed, near-memory computing architecture allows us to tear down the performance-limiting memory wall with an abundance of data bandwidth. We achieve the same level of energy efficiency on 40nm technology as competing chips on 7nm technology. By moving to similar technologies as other AI chips, we project to achieve more than ten times the energy efficiency, seven times the performance of the current state-of-the-art chips, and twenty times of memory capacity as compared with the best chip in each benchmark.
摘要：在深度学习的最新发展，导致广泛采用的应用，如计算机视觉和自然语言处理的人工智能（AI）的。由于神经网络变得更深和更大的，AI造型需求超过传统芯片架构的能力。内存带宽落后的处理能力。能耗来主宰了总拥有成本。目前，内存容量不足以支持目前最先进的NLP模型。在这项工作中，我们提出了一个3D AI芯片，称为日出，近内存计算架构来解决这些三大挑战。这种分布式，近内存计算的架构让我们推倒限制性能的内存墙拥有丰富的数据带宽。我们实现了能源效率对40nm工艺相同的水平上7纳米技术的竞争筹码。通过迁移到类似的技术，其他AI芯片，我们项目达到十余倍的能源效率，当前国家的最先进的芯片七倍的性能和内存容量的二十倍最好的芯片相比在每个基准。

54. MPG-Net: Multi-Prediction Guided Network for Segmentation of Retinal Layers in OCT Images [PDF] 返回目录
Zeyu Fu, Yang Sun, Xiangyu Zhang, Scott Stainton, Shaun Barney, Jeffry Hogg, William Innes, Satnam Dlay
Abstract: Optical coherence tomography (OCT) is a commonly-used method of extracting high resolution retinal information. Moreover there is an increasing demand for the automated retinal layer segmentation which facilitates the retinal disease diagnosis. In this paper, we propose a novel multiprediction guided attention network (MPG-Net) for automated retinal layer segmentation in OCT images. The proposed method consists of two major steps to strengthen the discriminative power of a U-shape Fully convolutional network (FCN) for reliable automated segmentation. Firstly, the feature refinement module which adaptively re-weights the feature channels is exploited in the encoder to capture more informative features and discard information in irrelevant regions. Furthermore, we propose a multi-prediction guided attention mechanism which provides pixel-wise semantic prediction guidance to better recover the segmentation mask at each scale. This mechanism which transforms the deep supervision to supervised attention is able to guide feature aggregation with more semantic information between intermediate layers. Experiments on the publicly available Duke OCT dataset confirm the effectiveness of the proposed method as well as an improved performance over other state-of-the-art approaches.
摘要：光学相干断层扫描（OCT）是提取高分辨率的视网膜信息的通常使用的方法。此外还有用于自动视网膜层分割这有助于视网膜疾病诊断的需求日益增加。在本文中，我们提出了在OCT图像自动视网膜层分割的新颖multiprediction引导注意网络（MPG-净）。所提出的方法包括两个主要步骤，加强可靠自动分割成U形全卷积网络（FCN）的辨别力。首先，自适应地重新加权的特征信道编码器中被利用来捕获更多的信息的功能和在不相关的区域丢弃信息的特征细化模块。此外，我们提出了一个多预测制导注意机制提供逐像素语义预测指导各比例，以更好地恢复分割掩码。这种机制转变的深层监督，监管关注的是能够引导功能聚集与中间层之间有更多的语义信息。在可公开获得的杜克OCT实验数据集确认所提出的方法以及其它过度状态的最先进的方法的改进的性能的效果。

55. Multi-focus Image Fusion for Visual Sensor Networks [PDF] 返回目录
Milad Abdollahzadeh, Touba Malekzadeh, Hadi Seyedarabi
Abstract: Image fusion in visual sensor networks (VSNs) aims to combine information from multiple images of the same scene in order to transform a single image with more information. Image fusion methods based on discrete cosine transform (DCT) are less complex and time-saving in DCT based standards of image and video which makes them more suitable for VSN applications. In this paper, an efficient algorithm for the fusion of multi-focus images in the DCT domain is proposed. The Sum of modified laplacian (SML) of corresponding blocks of source images is used as a contrast criterion and blocks with the larger value of SML are absorbed to output images. The experimental results on several images show the improvement of the proposed algorithm in terms of both subjective and objective quality of fused image relative to other DCT based techniques.
摘要：在视觉传感器网络（VSNS）目标图像融合到信息从相同场景的多个图像，以便与更多的信息来变换一个图像相结合。基于离散余弦图像融合的方法变换（DCT）是较不复杂的并且在图像和视频，这使得它们更适合于VSN应用的基于DCT的标准节省时间的。在本文中，用于在DCT域中多聚焦图像的融合的有效算法。改进的拉普拉斯的相应源图像的块的总和（SML）用作对比标准和块与SML的较大的值被吸收到输出图像。在几个图像的实验结果表明，该算法在图像融合相关的主观和客观质量，其他基于DCT的技术方面的改进。

56. Fully Automatic Intervertebral Disc Segmentation Using Multimodal 3D U-Net [PDF] 返回目录
Chuanbo Wang, Ye Guo, Wei Chen, Zeyun Yu
Abstract: Intervertebral discs (IVDs), as small joints lying between adjacent vertebrae, have played an important role in pressure buffering and tissue protection. The fully-automatic localization and segmentation of IVDs have been discussed in the literature for many years since they are crucial to spine disease diagnosis and provide quantitative parameters in the treatment. Traditionally hand-crafted features are derived based on image intensities and shape priors to localize and segment IVDs. With the advance of deep learning, various neural network models have gained great success in image analysis including the recognition of intervertebral discs. Particularly, U-Net stands out among other approaches due to its outstanding performance on biomedical images with a relatively small set of training data. This paper proposes a novel convolutional framework based on 3D U-Net to segment IVDs from multi-modality MRI images. We first localize the centers of intervertebral discs in each spine sample and then train the network based on the cropped small volumes centered at the localized intervertebral discs. A detailed comprehensive analysis of the results using various combinations of multi-modalities is presented. Furthermore, experiments conducted on 2D and 3D U-Nets with augmented and non-augmented datasets are demonstrated and compared in terms of Dice coefficient and Hausdorff distance. Our method has proved to be effective with a mean segmentation Dice coefficient of 89.0% and a standard deviation of 1.4%.
摘要：椎间盘（体外诊断），如小关节相邻椎骨之间时，都起到在压力缓冲和组织保护具有重要作用。体外诊断试剂的全自动定位和分割在文献中，因为它们是脊柱疾病的诊断至关重要，并提供治疗的定量参数已经讨论了很多年。传统的手工制作的功能是基于图像强度和形状先验定位和段体外诊断的。随着深度学习的发展，各种神经网络模型已经获得了图像分析了巨大的成功，包括识别椎间盘。特别是，U-Net的其他方法中脱颖而出，因为它与相对较小的训练数据集的生物医学图像表现出色。本文提出了基于三维U形网从多模态MRI图像段的IVD一种新颖的卷积框架。我们首先定位椎间盘中心每个脊样本，然后训练基础上，裁剪小卷中心的局域椎间盘网络。使用多模态的各种组合的结果的详细全面的分析被呈现。此外，在2D和3D的U篮网增强和非增强的数据集进行的实验演示了与骰子系数和豪斯多夫距离方面进行比较。我们的方法已被证明是具有89.0％的平均分割骰子系数和1.4％的标准偏差有效。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-30

目录

摘要