摘要

1. Transformers in Vision: A Survey [PDF] 返回目录
Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, Mubarak Shah
Abstract: Astounding results from transformer models on natural language tasks have intrigued the vision community to study their application to computer vision problems. This has led to exciting progress on a number of tasks while requiring minimal inductive biases in the model design. This survey aims to provide a comprehensive overview of the transformer models in the computer vision discipline and assumes little to no prior background in the field. We start with an introduction to fundamental concepts behind the success of transformer models i.e., self-supervision and self-attention. Transformer architectures leverage self-attention mechanisms to encode long-range dependencies in the input domain which makes them highly expressive. Since they assume minimal prior knowledge about the structure of the problem, self-supervision using pretext tasks is applied to pre-train transformer models on large-scale (unlabelled) datasets. The learned representations are then fine-tuned on the downstream tasks, typically leading to excellent performance due to the generalization and expressivity of encoded features. We cover extensive applications of transformers in vision including popular recognition tasks (e.g., image classification, object detection, action recognition, and segmentation), generative modeling, multi-modal tasks (e.g., visual-question answering and visual reasoning), video processing (e.g., activity recognition, video forecasting), low-level vision (e.g., image super-resolution and colorization) and 3D analysis (e.g., point cloud classification and segmentation). We compare the respective advantages and limitations of popular techniques both in terms of architectural design and their experimental value. Finally, we provide an analysis on open research directions and possible future works.
摘要：变形器模型在自然语言任务方面的惊人结果吸引了视觉界对其在计算机视觉问题中的应用进行研究的兴趣。这导致了许多任务的激动人心的进展，同时在模型设计中要求最小的电感偏差。这项调查旨在提供计算机视觉学科中的变压器模型的全面概述，并且假设该领域的背景知识很少甚至没有。我们从介绍变压器模型成功背后的基本概念开始，即自我监督和自我关注。变压器体系结构利用自我关注机制在输入域中对远程依赖项进行编码，从而使其具有较高的表达力。由于他们假定对问题的结构缺乏先验知识，因此将使用前置任务的自我监督应用于大规模（未标记）数据集上的预训练变压器模型。然后，在下游任务上对学习到的表示进行微调，由于编码特征的泛化和表现力，通常可导致出色的性能。我们涵盖了变压器在视觉领域的广泛应用，包括流行的识别任务（例如图像分类，目标检测，动作识别和分割），生成模型，多模式任务（例如视觉问题解答和视觉推理），视频处理（例如活动识别，视频预测），低级视觉（例如图像超分辨率和彩色化）和3D分析（例如点云分类和分割）。我们从建筑设计和实验价值两个方面比较了流行技术各自的优点和局限性。最后，我们对开放研究方向和可能的未来工作进行了分析。

2. Where Do Deep Fakes Look? Synthetic Face Detection via Gaze Tracking [PDF] 返回目录
Ilke Demir, Umur A. Ciftci
Abstract: Following the recent initiatives for the democratization of AI, deep fake generators have become increasingly popular and accessible, causing dystopian scenarios towards social erosion of trust. A particular domain, such as biological signals, attracted attention towards detection methods that are capable of exploiting authenticity signatures in real videos that are not yet faked by generative approaches. In this paper, we first propose several prominent eye and gaze features that deep fakes exhibit differently. Second, we compile those features into signatures and analyze and compare those of real and fake videos, formulating geometric, visual, metric, temporal, and spectral variations. Third, we generalize this formulation to deep fake detection problem by a deep neural network, to classify any video in the wild as fake or real. We evaluate our approach on several deep fake datasets, achieving 89.79\% accuracy on FaceForensics++, 80.0\% on Deep Fakes (in the wild), and 88.35\% on CelebDF datasets. We conduct ablation studies involving different features, architectures, sequence durations, and post-processing artifacts. Our analysis concludes with 6.29\% improved accuracy over complex network architectures without the proposed gaze signatures.
摘要：继AI民主化的最新举措之后，深造假冒者变得越来越流行和易于使用，从而导致反乌托邦式的情形导致信任社会受到侵蚀。特定领域（例如生物信号）引起了人们对检测方法的注意，这些检测方法能够利用尚未通过生成方法伪造的真实视频中的真实性签名。在本文中，我们首先提出了几个突出的眼睛和凝视特征，这些特征使深层的假货表现出不同的效果。其次，我们将这些特征编译为签名，并分析和比较真实和伪造视频的特征，制定几何，视觉，度量，时间和频谱变化。第三，我们通过深度神经网络将该公式推广到深层的伪造品检测问题，以将野外的任何视频分类为伪造品或真实品。我们在多个深层伪造数据集上评估了我们的方法，在FaceForensics ++上达到了89.79％的准确性，在深造（在野外）上达到了80.0％，在CelebDF数据集上达到了88.35％。我们进行的消融研究涉及不同的功能，架构，序列持续时间和后处理工件。我们的分析得出的结论是，在没有建议的凝视签名的情况下，与复杂的网络体系结构相比，准确性提高了6.29％。

3. High-resolution land cover change from low-resolution labels: Simple baselines for the 2021 IEEE GRSS Data Fusion Contest [PDF] 返回目录
Nikolay Malkin, Caleb Robinson, Nebojsa Jojic
Abstract: We present simple algorithms for land cover change detection in the 2021 IEEE GRSS Data Fusion Contest. The task of the contest is to create high-resolution (1m / pixel) land cover change maps of a study area in Maryland, USA, given multi-resolution imagery and label data. We study several baseline models for this task and discuss directions for further research. See this https URL for the data and this https URL for an implementation of these baselines.
摘要：我们在2021年IEEE GRSS数据融合竞赛中提出了用于土地覆盖变化检测的简单算法。竞赛的任务是在给定多分辨率图像和标签数据的情况下，为美国马里兰州的研究区域创建高分辨率（1m /像素）土地覆盖变化图。我们针对此任务研究了几种基线模型，并讨论了进一步研究的方向。有关数据，请参见此https URL，有关这些基线的实现，请参见此https URL。

4. Stereo Correspondence and Reconstruction of Endoscopic Data Challenge [PDF] 返回目录
Max Allan, Jonathan Mcleod, Cong Cong Wang, Jean Claude Rosenthal, Ke Xue Fu, Trevor Zeffiro, Wenyao Xia, Zhu Zhanshi, Huoling Luo, Xiran Zhang, Xiaohong Li, Lalith Sharan, Tom Kurmann, Sebastian Schmid, Dimitris Psychogyios, Mahdi Azizian, Danail Stoyanov, Lena Maier-Hein, Stefanie Speidel
Abstract: The stereo correspondence and reconstruction of endoscopic data sub-challenge was organized during the Endovis challenge at MICCAI 2019 in Shenzhen, China. The task was to perform dense depth estimation using 7 training datasets and 2 test sets of structured light data captured using porcine cadavers. These were provided by a team at Intuitive Surgical. 10 teams participated in the challenge day. This paper contains 3 additional methods which were submitted after the challenge finished as well as a supplemental section from these teams on issues they found with the dataset.
摘要：在中国深圳举行的MICCAI 2019的Endovis挑战赛期间，组织了内窥镜数据子挑战的立体对应和重构。任务是使用7个训练数据集和2个使用猪尸体捕获的结构光数据测试集执行密集深度估计。这些是由Intuitive Surgical的团队提供的。 10支队伍参加了挑战日。本文包含挑战完成后提交的3种其他方法，以及这些团队补充的关于数据集发现问题的章节。

5. Analysis of Filter Size Effect In Deep Learning [PDF] 返回目录
Yunus Camgözlü, Yakup Kutlu
Abstract: With the use of deep learning in many areas, how to improve this technology or how to develop the structure used more effectively and in a shorter time is an issue that is of interest to many people working in this field. Many studies are carried out on this subject, it is aimed to reduce the duration of the operation and the processing power required, except to obtain the best result with the changes made in the variables, functions and data in the models used. In this study, in the leaf classification made using Mendeley data set consisting of leaf images with a fixed background, all other variables such as layer number, iteration, number of layers in the model and pooling process were kept constant, except for the filter dimensions of the convolution layers in the determined model. Convolution layers in 3 different filter sizes and in addition to this, many results obtained in 2 different structures, increasing and decreasing, and 3 different image sizes were examined. In the literature, it is seen that different uses of pooling layers, changes due to increase or decrease in the number of layers, the difference in the size of the data used, and the results of many functions used with different parameters are evaluated. In the leaf classification of the determined data set with CNN, the change in the filter size of the convolution layer together with the change in different filter combinations and in different sized images was focused. Using the data set and data reproduction methods, it was aimed to make the differences in filter sizes and image sizes more distinct. Using the fixed number of iterations, model and data set, the effect of different filter sizes has been observed.
摘要：随着深度学习在许多领域的应用，如何改进该技术或如何在更短的时间内更有效地开发所使用的结构是许多从事该领域工作的人们所关心的问题。关于该主题进行了许多研究，旨在减少操作的持续时间和所需的处理能力，除了通过使用所使用的模型中的变量，函数和数据进行更改来获得最佳结果外。在这项研究中，在使用Mendeley数据集（由具有固定背景的叶子图像组成）进行叶子分类时，除过滤器尺寸外，所有其他变量（例如层数，迭代，模型中的层数和合并过程）保持恒定确定模型中卷积层的数量。在3种不同的滤镜尺寸下的卷积层，除此之外，还在2种不同的结构（增加和减少）以及3种不同的图像尺寸下获得了许多结果。在文献中可以看到，对合并层的不同用途，由于层数增加或减少而引起的变化，所使用的数据大小的差异以及具有不同参数的许多函数的结果进行了评估。在使用CNN对确定的数据集进行叶分类时，将卷积层的滤镜大小的更改以及不同滤镜组合和不同大小的图像的更改集中在一起。使用数据集和数据再现方法，旨在使滤镜大小和图像大小的差异更加明显。使用固定数量的迭代，模型和数据集，已观察到不同滤波器大小的影响。

6. Transformer for Image Quality Assessment [PDF] 返回目录
Junyong You, Jari Korhonen
Abstract: Transformer has become the new standard method in natural language processing (NLP), and it also attracts research interests in computer vision area. In this paper we investigate the application of Transformer in Image Quality (TRIQ) assessment. Following the original Transformer encoder employed in Vision Transformer (ViT), we propose an architecture of using a shallow Transformer encoder on the top of a feature map extracted by convolution neural networks (CNN). Adaptive positional embedding is employed in the Transformer encoder to handle images with arbitrary resolutions. Different settings of Transformer architectures have been investigated on publicly available image quality databases. We have found that the proposed TRIQ architecture achieves outstanding performance. The implementation of TRIQ is published on Github (this https URL).
摘要：变压器已成为自然语言处理（NLP）的新标准方法，并引起了计算机视觉领域的研究兴趣。在本文中，我们研究了变压器在图像质量（TRIQ）评估中的应用。继视觉变压器（ViT）中采用的原始变压器编码器之后，我们提出了在卷积神经网络（CNN）提取的特征图顶部使用浅层变压器编码器的体系结构。变压器编码器中采用了自适应位置嵌入，以处理任意分辨率的图像。已经在公开可用的图像质量数据库中研究了Transformer体系结构的不同设置。我们发现，提出的TRIQ体系结构可实现出色的性能。 TRIQ的实现在Github上发布（此https URL）。

7. Anomaly Recognition from surveillance videos using 3D Convolutional Neural Networks [PDF] 返回目录
R. Maqsood, UI. Bajwa, G. Saleem, Rana H. Raza, MW. Anwar
Abstract: Anomalous activity recognition deals with identifying the patterns and events that vary from the normal stream. In a surveillance paradigm, these events range from abuse to fighting and road accidents to snatching, etc. Due to the sparse occurrence of anomalous events, anomalous activity recognition from surveillance videos is a challenging research task. The approaches reported can be generally categorized as handcrafted and deep learning-based. Most of the reported studies address binary classification i.e. anomaly detection from surveillance videos. But these reported approaches did not address other anomalous events e.g. abuse, fight, road accidents, shooting, stealing, vandalism, and robbery, etc. from surveillance videos. Therefore, this paper aims to provide an effective framework for the recognition of different real-world anomalies from videos. This study provides a simple, yet effective approach for learning spatiotemporal features using deep 3-dimensional convolutional networks (3D ConvNets) trained on the University of Central Florida (UCF) Crime video dataset. Firstly, the frame-level labels of the UCF Crime dataset are provided, and then to extract anomalous spatiotemporal features more efficiently a fine-tuned 3D ConvNets is proposed. Findings of the proposed study are twofold 1)There exist specific, detectable, and quantifiable features in UCF Crime video feed that associate with each other 2) Multiclass learning can improve generalizing competencies of the 3D ConvNets by effectively learning frame-level information of dataset and can be leveraged in terms of better results by applying spatial augmentation.
摘要：异常活动识别涉及识别与正常流不同的模式和事件。在监视范式中，这些事件的范围从滥用，战斗，交通事故到抢夺等。由于异常事件的发生稀疏，因此从监视视频中识别异常活动是一项艰巨的研究任务。所报告的方法通常可以分为基于手工和深度学习的方法。大多数已报道的研究都针对二进制分类，即从监视视频中进行异常检测。但是这些报告的方法没有解决其他异常事件，例如监视视频中的滥用，战斗，交通事故，射击，偷窃，故意破坏和抢劫等。因此，本文旨在为从视频中识别不同的现实世界异常提供有效的框架。这项研究使用在中央佛罗里达大学（UCF）犯罪视频数据集上训练的3维深度卷积网络（3D ConvNets），为学习时空特征提供了一种简单而有效的方法。首先，提供了UCF Crime数据集的帧级标签，然后为了更有效地提取异常时空特征，提出了一种微调的3D ConvNets。拟议研究的发现有两个方面：1）UCF Crime视频馈送中存在相互关联的特定，可检测和可量化的功能2）多类学习可以通过有效地学习数据集的框架级信息来提高3D ConvNets的综合能力。通过应用空间增强可以更好地利用结果。

8. Personal Privacy Protection via Irrelevant Faces Tracking and Pixelation in Video Live Streaming [PDF] 返回目录
Jizhe Zhou, Chi-Man Pun
Abstract: To date, the privacy-protection intended pixelation tasks are still labor-intensive and yet to be studied. With the prevailing of video live streaming, establishing an online face pixelation mechanism during streaming is an urgency. In this paper, we develop a new method called Face Pixelation in Video Live Streaming (FPVLS) to generate automatic personal privacy filtering during unconstrained streaming activities. Simply applying multi-face trackers will encounter problems in target drifting, computing efficiency, and over-pixelation. Therefore, for fast and accurate pixelation of irrelevant people's faces, FPVLS is organized in a frame-to-video structure of two core stages. On individual frames, FPVLS utilizes image-based face detection and embedding networks to yield face vectors. In the raw trajectories generation stage, the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm leverages face vectors and positioned information to quickly associate the same person's faces across frames. Such frame-wise accumulated raw trajectories are likely to be intermittent and unreliable on video level. Hence, we further introduce the trajectory refinement stage that merges a proposal network with the two-sample test based on the Empirical Likelihood Ratio (ELR) statistic to refine the raw trajectories. A Gaussian filter is laid on the refined trajectories for final pixelation. On the video live streaming dataset we collected, FPVLS obtains satisfying accuracy, real-time efficiency, and contains the over-pixelation problems.
摘要：迄今为止，旨在保护隐私的像素化任务仍然是劳动密集型的，尚待研究。随着视频实时流媒体的盛行，在流媒体期间建立在线面部像素化机制已成为当务之急。在本文中，我们开发了一种新方法，称为视频实时流中的面部像素化（FPVLS），可在不受限制的流活动期间生成自动的个人隐私过滤。简单地应用多脸跟踪器会遇到目标漂移，计算效率和像素过度化的问题。因此，为了快速准确地对无关人员的脸部进行像素化，FPVLS采用两个核心阶段的帧到视频结构进行组织。在单个帧上，FPVLS利用基于图像的面部检测和嵌入网络来生成面部向量。在原始轨迹生成阶段，提出的定位增量亲和力传播（PIAP）聚类算法利用人脸矢量和定位信息来快速将同一个人的人脸跨帧关联。这种逐帧累积的原始轨迹可能是断续的，并且在视频级别上不可靠。因此，我们进一步介绍了轨迹细化阶段，该阶段将提议网络与基于经验似然比（ELR）统计数据的两样本测试合并，以细化原始轨迹。高斯滤波器放置在精炼的轨迹上以进行最终像素化。在我们收集的视频实时流数据集上，FPVLS获得令人满意的准确性，实时效率，并包含过像素化问题。

9. Scene Text Detection for Augmented Reality -- Character Bigram Approach to reduce False Positive Rate [PDF] 返回目录
Sagar Gubbi, Bharadwaj Amrutur
Abstract: Natural scene text detection is an important aspect of scene understanding and could be a useful tool in building engaging augmented reality applications. In this work, we address the problem of false positives in text spotting. We propose improving the performace of sliding window text spotters by looking for character pairs (bigrams) rather than single characters. An efficient convolutional neural network is designed and trained to detect bigrams. The proposed detector reduces false positive rate by 28.16% on the ICDAR 2015 dataset. We demonstrate that detecting bigrams is a computationally inexpensive way to improve sliding window text spotters.
摘要：自然场景文本检测是场景理解的重要方面，并且可能是构建引人入胜的增强现实应用程序的有用工具。在这项工作中，我们解决了文本斑点中误报的问题。我们建议通过查找字符对（字母组合）而不是单个字符来提高滑动窗口文本查找器的性能。设计并训练了一种有效的卷积神经网络来检测二元组。拟议的检测器在ICDAR 2015数据集上将误报率降低了28.16％。我们证明了检测双字母组是提高滑动窗口文本查找器的一种计算便宜的方法。

10. VIS30K: A Collection of Figures and Tables from {IEEE} Visualization Conference Publications [PDF] 返回目录
Jian Chen, Meng Ling, Rui Li, Petra Isenberg, Tobias Isenberg, Michael Sedlmair, Torsten Möller, Robert S. Laramee, Han-Wei Shen, Katharina Wünsche, Qiru Wang
Abstract: We present the VIS30K dataset, a collection of 29,689 images that represents 30 years of figures and tables from each track of the IEEE Visualization conference series (Vis, SciVis, InfoVis, VAST). VIS30K's comprehensive coverage of the scientific literature in visualization not only reflects the progress of the field but also enables researchers to study the evolution of the state-of-the-art and to find relevant work based on graphical content. We describe the dataset and our semi-automatic collection process, which couples convolutional neural networks (CNN) with curation. Extracting figures and tables semi-automatically allows us to verify that no images are overlooked or extracted erroneously. To improve quality further, we engaged in a peer-search process for high-quality figures from early IEEE Visualization papers. With the resulting data, we also contribute VISImageNavigator (VIN, this http URL), a web-based tool that facilitates searching and exploring VIS30K by author names, paper keywords, title and abstract, and years.
摘要：我们展示了VIS30K数据集，该数据集包含29,689张图像，这些图像代表了IEEE可视化会议系列（Vis，SciVis，InfoVis，VAST）的每个轨迹的30年图形和表格。 VIS30K在可视化方面对科学文献的全面报道不仅反映了该领域的进展，还使研究人员能够研究最新技术的发展并根据图形内容找到相关的工作。我们描述了数据集和半自动收集过程，该过程将卷积神经网络（CNN）与策展相结合。半自动提取图形和表格使我们能够验证没有图像被忽略或错误提取。为了进一步提高质量，我们从早期的IEEE Visualization论文中进行了对高质量数字的对等搜索过程。利用得到的数据，我们还贡献了VISImageNavigator（VIN，此http URL），这是一个基于Web的工具，可帮助按作者姓名，论文关键词，标题和摘要以及年份来搜索和浏览VIS30K。

11. HyperMorph: Amortized Hyperparameter Learning for Image Registration [PDF] 返回目录
Andrew Hoopes, Malte Hoffmann, Bruce Fischl, John Guttag, Adrian V. Dalca
Abstract: We present HyperMorph, a learning-based strategy for deformable image registration that removes the need to tune important registration hyperparameters during training. Classical registration methods solve an optimization problem to find a set of spatial correspondences between two images, while learning-based methods leverage a training dataset to learn a function that generates these correspondences. The quality of the results for both types of techniques depends greatly on the choice of hyperparameters. Unfortunately, hyperparameter tuning is time-consuming and typically involves training many separate models with various hyperparameter values, potentially leading to suboptimal results. To address this inefficiency, we introduce amortized hyperparameter learning for image registration, a novel strategy to learn the effects of hyperparameters on deformation fields. The proposed framework learns a hypernetwork that takes in an input hyperparameter and modulates a registration network to produce the optimal deformation field for that hyperparameter value. In effect, this strategy trains a single, rich model that enables rapid, fine-grained discovery of hyperparameter values from a continuous interval at test-time. We demonstrate that this approach can be used to optimize multiple hyperparameters considerably faster than existing search strategies, leading to a reduced computational and human burden and increased flexibility. We also show that this has several important benefits, including increased robustness to initialization and the ability to rapidly identify optimal hyperparameter values specific to a registration task, dataset, or even a single anatomical region - all without retraining the HyperMorph model. Our code is publicly available at this http URL.
摘要：我们提出了HyperMorph，这是一种基于学习的可变形图像配准策略，无需在训练过程中调整重要的配准超参数。经典的配准方法解决了一个优化问题，以找到两个图像之间的一组空间对应关系，而基于学习的方法则利用训练数据集来学习生成这些对应关系的函数。两种技术的结果质量在很大程度上取决于超参数的选择。不幸的是，超参数调整非常耗时，并且通常涉及使用各种超参数值训练许多单独的模型，从而可能导致次优结果。为了解决这种效率低下的问题，我们引入了用于图像配准的摊销式超参数学习，这是一种学习超参数对变形场的影响的新颖策略。所提出的框架学习一个超网络，该超网络接受输入的超参数并调制配准网络以针对该超参数值生成最佳变形场。实际上，此策略训练了一个丰富的模型，该模型能够在测试时从连续的时间间隔中快速，细粒度地发现超参数值。我们证明，该方法可以比现有搜索策略更快地优化多个超参数，从而减少了计算量和人工负担，并提高了灵活性。我们还表明，这具有几个重要的好处，包括增强的初始化鲁棒性和快速识别特定于注册任务，数据集甚至单个解剖区域的最佳超参数值的能力-所有这些都无需重新训练HyperMorph模型。我们的代码可从此http URL公开获得。

12. Local Black-box Adversarial Attacks: A Query Efficient Approach [PDF] 返回目录
Tao Xiang, Hangcheng Liu, Shangwei Guo, Tianwei Zhang, Xiaofeng Liao
Abstract: Adversarial attacks have threatened the application of deep neural networks in security-sensitive scenarios. Most existing black-box attacks fool the target model by interacting with it many times and producing global perturbations. However, global perturbations change the smooth and insignificant background, which not only makes the perturbation more easily be perceived but also increases the query overhead. In this paper, we propose a novel framework to perturb the discriminative areas of clean examples only within limited queries in black-box attacks. Our framework is constructed based on two types of transferability. The first one is the transferability of model interpretations. Based on this property, we identify the discriminative areas of a given clean example easily for local perturbations. The second is the transferability of adversarial examples. It helps us to produce a local pre-perturbation for improving query efficiency. After identifying the discriminative areas and pre-perturbing, we generate the final adversarial examples from the pre-perturbed example by querying the targeted model with two kinds of black-box attack techniques, i.e., gradient estimation and random search. We conduct extensive experiments to show that our framework can significantly improve the query efficiency during black-box perturbing with a high attack success rate. Experimental results show that our attacks outperform state-of-the-art black-box attacks under various system settings.
摘要：对抗攻击已经威胁到深度神经网络在对安全敏感的情况下的应用。大多数现有的黑匣子攻击都会通过与目标模型进行多次交互并产生全局干扰来使其蒙蔽。但是，全局扰动改变了平滑且微不足道的背景，这不仅使扰动更易于感知，而且增加了查询开销。在本文中，我们提出了一个新颖的框架来仅在黑盒攻击中的有限查询内扰动干净示例的区分区域。我们的框架是基于两种类型的可传递性构建的。第一个是模型解释的可传递性。基于此属性，我们可以轻松地识别给定干净示例的区分区域，以进行局部扰动。第二是对抗性例子的可移植性。它有助于我们产生局部预扰动以提高查询效率。在确定了区分区域并进行预扰动之后，我们通过使用两种黑盒攻击技术（即梯度估计和随机搜索）查询目标模型，从预扰动示例中生成最终的对抗性示例。我们进行了广泛的实验，以表明我们的框架可以在黑盒扰动期间以高攻击成功率显着提高查询效率。实验结果表明，在各种系统设置下，我们的攻击均优于最新的黑盒攻击。

13. Guiding GANs: How to control non-conditional pre-trained GANs for conditional image generation [PDF] 返回目录
Manel Mateos, Alejandro González, Xavier Sevillano
Abstract: Generative Adversarial Networks (GANs) are an arrange of two neural networks -- the generator and the discriminator -- that are jointly trained to generate artificial data, such as images, from random inputs. The quality of these generated images has recently reached such levels that can often lead both machines and humans into mistaking fake for real examples. However, the process performed by the generator of the GAN has some limitations when we want to condition the network to generate images from subcategories of a specific class. Some recent approaches tackle this \textit{conditional generation} by introducing extra information prior to the training process, such as image semantic segmentation or textual descriptions. While successful, these techniques still require defining beforehand the desired subcategories and collecting large labeled image datasets representing them to train the GAN from scratch. In this paper we present a novel and alternative method for guiding generic non-conditional GANs to behave as conditional GANs. Instead of re-training the GAN, our approach adds into the mix an encoder network to generate the high-dimensional random input vectors that are fed to the generator network of a non-conditional GAN to make it generate images from a specific subcategory. In our experiments, when compared to training a conditional GAN from scratch, our guided GAN is able to generate artificial images of perceived quality comparable to that of non-conditional GANs after training the encoder on just a few hundreds of images, which substantially accelerates the process and enables adding new subcategories seamlessly.
摘要：生成对抗网络（GAN）是由两个神经网络组成的生成器，即生成器和鉴别器，它们经过联合训练以从随机输入生成人工数据，例如图像。这些生成的图像的质量最近达到了这样的水平，经常会导致机器和人类误以为是真实的例子。但是，当我们要调整网络以从特定类别的子类别生成图像时，由GAN生成器执行的过程存在一些局限性。一些最近的方法通过在训练过程之前引入额外的信息来解决\ textit {conditional generation}，例如图像语义分割或文本描述。虽然成功了，但这些技术仍然需要预先定义所需的子类别，并收集代表它们的大标签图像数据集，以从头开始训练GAN。在本文中，我们提出了一种新颖的替代方法，用于指导通用无条件GAN充当有条件GAN。代替重新训练GAN，我们的方法在编码器网络中添加了编码器网络以生成高维随机输入矢量，这些矢量被馈送到无条件GAN的生成器网络中，以使其从特定子类别生成图像。在我们的实验中，与从头训练条件GAN相比，我们的引导式GAN在仅对数百个图像进行编码器训练之后，便能够生成感知质量与非条件GAN相当的人工图像。处理并启用无缝添加新的子类别。

14. Fooling Object Detectors: Adversarial Attacks by Half-Neighbor Masks [PDF] 返回目录
Yanghao Zhang, Fu Wang, Wenjie Ruan
Abstract: Although there are a great number of adversarial attacks on deep learning based classifiers, how to attack object detection systems has been rarely studied. In this paper, we propose a Half-Neighbor Masked Projected Gradient Descent (HNM-PGD) based attack, which can generate strong perturbation to fool different kinds of detectors under strict constraints. We also applied the proposed HNM-PGD attack in the CIKM 2020 AnalytiCup Competition, which was ranked within the top 1% on the leaderboard. We release the code at this https URL.
摘要：尽管对基于深度学习的分类器有很多对抗性攻击，但很少研究如何攻击对象检测系统。在本文中，我们提出了一种基于半邻居掩盖的投影梯度下降（HNM-PGD）的攻击，该攻击可以产生强大的干扰，以在严格的约束下欺骗各种检测器。我们还在CIKM 2020 AnalytiCup竞赛中应用了建议的HNM-PGD攻击，该竞赛在排行榜中排名前1％。我们在此https URL上发布代码。

15. Classification and Segmentation of Pulmonary Lesions in CT images using a combined VGG-XGBoost method, and an integrated Fuzzy Clustering-Level Set technique [PDF] 返回目录
Niloofar Akhavan Javan, Ali Jebreili, Babak Mozafari, Morteza Hosseinioun
Abstract: Given that lung cancer is one of the deadliest diseases, and many die from the disease every year, early detection and diagnosis of this disease are valuable, preventing cancer from growing and spreading. So if cancer is diagnosed in the early stage, the patient's life will be saved. However, the current pulmonary disease diagnosis is made by human resources, which is time-consuming and requires a specialist in this field. Also, there is a high level of errors in human diagnosis. Our goal is to develop a system that can detect and classify lung lesions with high accuracy and segment them in CT-scan images. In the proposed method, first, features are extracted automatically from the CT-scan image; then, the extracted features are classified by Ensemble Gradient Boosting methods. Finally, if there is a lesion in the CT-scan image, using a hybrid method based on [1], including Fuzzy Clustering and Level Set, the lesion is segmented. We collected a dataset, including CT-scan images of pulmonary lesions. The target community was the patients in Mashhad. The collected samples were then tagged by a specialist. We used this dataset for training and testing our models. Finally, we were able to achieve an accuracy of 96% for this dataset. This system can help physicians to diagnose pulmonary lesions and prevent possible mistakes.
摘要：鉴于肺癌是最致命的疾病之一，并且每年都有许多人死于该疾病，因此对该疾病的早期发现和诊断非常有价值，可以防止癌症的扩散和扩散。因此，如果早期诊断出癌症，将挽救患者的生命。然而，当前的肺部疾病诊断是由人力资源进行的，这是耗时的并且需要该领域的专家。而且，人类诊断中存在很高的错误率。我们的目标是开发一种可以高精度检测和分类肺部病变并将其分割为CT扫描图像的系统。在提出的方法中，首先，从CT扫描图像中自动提取特征。然后，通过Ensemble Gradient Boosting方法对提取的特征进行分类。最后，如果CT扫描图像中存在病变，则使用基于[1]的混合方法，包括模糊聚类和水平集，对病变进行分割。我们收集了一个数据集，包括肺部病变的CT扫描图像。目标社区是马什哈德的患者。然后由专业人员对收集的样品进行标记。我们使用此数据集来训练和测试我们的模型。最终，我们对该数据集的准确性达到了96％。该系统可以帮助医生诊断肺部病变并防止可能的错误。

16. Weakly-Supervised Saliency Detection via Salient Object Subitizing [PDF] 返回目录
Xiaoyang Zheng, Xin Tan, Jie Zhou, Lizhuang Ma, Rynson W.H. Lau
Abstract: Salient object detection aims at detecting the most visually distinct objects and producing the corresponding masks. As the cost of pixel-level annotations is high, image tags are usually used as weak supervisions. However, an image tag can only be used to annotate one class of objects. In this paper, we introduce saliency subitizing as the weak supervision since it is class-agnostic. This allows the supervision to be aligned with the property of saliency detection, where the salient objects of an image could be from more than one class. To this end, we propose a model with two modules, Saliency Subitizing Module (SSM) and Saliency Updating Module (SUM). While SSM learns to generate the initial saliency masks using the subitizing information, without the need for any unsupervised methods or some random seeds, SUM helps iteratively refine the generated saliency masks. We conduct extensive experiments on five benchmark datasets. The experimental results show that our method outperforms other weakly-supervised methods and even performs comparably to some fully-supervised methods.
摘要：显着物体检测旨在检测视觉上最明显的物体并产生相应的蒙版。由于像素级注释的成本很高，因此图像标签通常被用作薄弱的监管手段。但是，图像标签只能用于注释一类对象。在本文中，我们引入显着性替代作为弱监督，因为它与类无关。这可以使监督与显着性检测的属性保持一致，在显着性检测中，图像的显着对象可能来自多个类别。为此，我们提出了一个包含两个模块的模型，即显着性子化模块（SSM）和显着性更新模块（SUM）。尽管SSM学会了使用替代信息生成初始显着性掩码，而无需任何无监督的方法或一些随机种子，但SUM可以帮助迭代地优化所生成的显着性掩码。我们对五个基准数据集进行了广泛的实验。实验结果表明，我们的方法优于其他弱监督方法，甚至可以与某些完全监督方法相比。

17. Global2Local: Efficient Structure Search for Video Action Segmentation [PDF] 返回目录
Shang-Hua Gao, Qi Han, Zhong-Yu Li, Pai Peng, Liang Wang, Ming-Ming Cheng
Abstract: Temporal receptive fields of models play an important role in action segmentation. Large receptive fields facilitate the long-term relations among video clips while small receptive fields help capture the local details. Existing methods construct models with hand-designed receptive fields in layers. Can we effectively search for receptive field combinations to replace hand-designed patterns? To answer this question, we propose to find better receptive field combinations through a global-to-local search scheme. Our search scheme exploits both global search to find the coarse combinations and local search to get the refined receptive field combination patterns further. The global search finds possible coarse combinations other than human-designed patterns. On top of the global search, we propose an expectation guided iterative local search scheme to refine combinations effectively. Our global-to-local search can be plugged into existing action segmentation methods to achieve state-of-the-art performance.
摘要：模型的时间接受域在动作分割中起着重要的作用。大的接收场有助于视频剪辑之间的长期关系，而小的接收场则有助于捕获局部细节。现有方法使用分层的人工设计接收场来构建模型。我们是否可以有效地寻找接受场组合来代替手工设计的图案？为了回答这个问题，我们建议通过全局到局部搜索方案找到更好的感受野组合。我们的搜索方案既利用全局搜索来找到粗略的组合，又利用局部搜索来进一步获得精炼的接收场组合模式。全局搜索会找到除人为设计的模式以外的可能的粗略组合。在全局搜索的基础上，我们提出了一种期望指导的迭代局部搜索方案，以有效地优化组合。我们的全局到本地搜索可以插入现有的动作细分方法中，以实现最新的性能。

18. Identifying centres of interest in paintings using alignment and edge detection: Case studies on works by Luc Tuymans [PDF] 返回目录
Sinem Aslan, Luc Steels
Abstract: What is the creative process through which an artist goes from an original image to a painting? Can we examine this process using techniques from computer vision and pattern recognition? Here we set the first preliminary steps to algorithmically deconstruct some of the transformations that an artist applies to an original image in order to establish centres of interest, which are focal areas of a painting that carry meaning. We introduce a comparative methodology that first cuts out the minimal segment from the original image on which the painting is based, then aligns the painting with this source, investigates micro-differences to identify centres of interest and attempts to understand their role. In this paper we focus exclusively on micro-differences with respect to edges. We believe that research into where and how artists create centres of interest in paintings is valuable for curators, art historians, viewers, and art educators, and might even help artists to understand and refine their own artistic method.
摘要：艺术家从原始图像到绘画的创作过程是什么？我们可以使用计算机视觉和模式识别中的技术检查此过程吗？在这里，我们设置了第一步，以算法方式解构艺术家应用于原始图像的某些变换，以便建立感兴趣的中心，这些中心是带有含义的绘画重点。我们介绍了一种比较方法，该方法首先从绘画所基于的原始图像中切出最小的片段，然后将绘画与该来源对齐，调查微观差异以识别感兴趣的中心并尝试理解它们的作用。在本文中，我们仅专注于边缘的微差异。我们认为，对艺术家在何处以及如何在绘画中建立兴趣中心的研究对于策展人，艺术史学家，观众和艺术教育者而言是有价值的，甚至可能有助于艺术家理解和完善自己的艺术方法。

19. Low Light Image Enhancement via Global and Local Context Modeling [PDF] 返回目录
Aditya Arora, Muhammad Haris, Syed Waqas Zamir, Munawar Hayat, Fahad Shahbaz Khan, Ling Shao, Ming-Hsuan Yang
Abstract: Images captured under low-light conditions manifest poor visibility, lack contrast and color vividness. Compared to conventional approaches, deep convolutional neural networks (CNNs) perform well in enhancing images. However, being solely reliant on confined fixed primitives to model dependencies, existing data-driven deep models do not exploit the contexts at various spatial scales to address low-light image enhancement. These contexts can be crucial towards inferring several image enhancement tasks, e.g., local and global contrast, brightness and color corrections; which requires cues from both local and global spatial extent. To this end, we introduce a context-aware deep network for low-light image enhancement. First, it features a global context module that models spatial correlations to find complementary cues over full spatial domain. Second, it introduces a dense residual block that captures local context with a relatively large receptive field. We evaluate the proposed approach using three challenging datasets: MIT-Adobe FiveK, LoL, and SID. On all these datasets, our method performs favorably against the state-of-the-arts in terms of standard image fidelity metrics. In particular, compared to the best performing method on the MIT-Adobe FiveK dataset, our algorithm improves PSNR from 23.04 dB to 24.45 dB.
摘要：在弱光条件下拍摄的图像显示可见度差，对比度不足和色彩鲜艳。与传统方法相比，深度卷积神经网络（CNN）在增强图像方面表现良好。但是，现有的数据驱动型深层模型仅依靠有限的固定基元来建模依赖关系，因此无法在各种空间尺度上利用上下文来解决弱光图像增强问题。这些背景对于推断几个图像增强任务至关重要，例如局部和全局对比度，亮度和色彩校正；这需要来自本地和全球空间范围的线索。为此，我们引入了一个上下文感知的深度网络来增强弱光图像。首先，它具有一个全局上下文模块，该模块可以对空间相关性进行建模以在整个空间域中找到互补的线索。其次，它引入了一个密集的残差块，该块以相对较大的接收场捕获局部上下文。我们使用三个具有挑战性的数据集评估提出的方法：MIT-Adobe FiveK，LoL和SID。在所有这些数据集上，我们的方法在标准图像保真度指标方面均优于最新技术。特别是，与MIT-Adobe FiveK数据集上性能最佳的方法相比，我们的算法将PSNR从23.04 dB提高到24.45 dB。

20. Temporal Contrastive Graph for Self-supervised Video Representation Learning [PDF] 返回目录
Yang Liu, Keze Wang, Haoyuan Lan, Liang Lin
Abstract: Attempt to fully explore the fine-grained temporal structure and global-local chronological characteristics for self-supervised video representation learning, this work takes a closer look at exploiting the temporal structure of videos and further proposes a novel self-supervised method named Temporal Contrastive Graph (TCG). In contrast to the existing methods that randomly shuffle the video frames or video snippets within a video, our proposed TCG roots in a hybrid graph contrastive learning strategy to regard the inter-snippet and intra-snippet temporal relationships as self-supervision signals for temporal representation learning. Inspired by the neuroscience studies that the human visual system is sensitive to both local and global temporal changes, our proposed TCG integrates the prior knowledge about the frame and snippet orders into temporal contrastive graph structures, i.e., the intra-/inter- snippet temporal contrastive graph modules, to well preserve the local and global temporal relationships among video frame-sets and snippets. By randomly removing edges and masking node features of the intra-snippet graphs or inter-snippet graphs, our TCG can generate different correlated graph views. Then, specific contrastive losses are designed to maximize the agreement between node embeddings in different views. To learn the global context representation and recalibrate the channel-wise features adaptively, we introduce an adaptive video snippet order prediction module, which leverages the relational knowledge among video snippets to predict the actual snippet orders. Extensive experimental results demonstrate the superiority of our TCG over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
摘要：尝试充分探索用于自监督视频表示学习的细粒度时间结构和全局局部时间特征，这项工作将密切关注如何利用视频的时间结构，并进一步提出一种新的自监督方法，称为时间性对比图（TCG）。与现有的随机混合视频中视频帧或视频片段的方法相反，我们提出的TCG源自混合图对比学习策略，将片段间和片段内时间关系视为用于时间表示的自我监督信号学习。受神经科学研究的启发，人类视觉系统对局部和全局时间变化都敏感，我们提出的TCG将有关帧和代码段顺序的先验知识整合到时间对比图结构中，即片段内/片段间时间对比图形模块，以很好地保留视频帧集和摘要之间的局部和全局时间关系。通过随机删除片段内图或片段间图的边缘并掩盖节点特征，我们的TCG可以生成不同的相关图视图。然后，设计特定的对比损失以最大化不同视图中节点嵌入之间的一致性。为了学习全局上下文表示并自适应地重新校准通道级特征，我们引入了自适应视频片段顺序预测模块，该模块利用视频片段之间的关系知识来预测实际的片段顺序。大量的实验结果证明了我们的TCG在大规模动作识别和视频检索基准方面优于最新技术。

21. Shed Various Lights on a Low-Light Image: Multi-Level Enhancement Guided by Arbitrary References [PDF] 返回目录
Ya'nan Wang, Zhuqing Jiang, Chang Liu, Kai Li, Aidong Men, Haiying Wang
Abstract: It is suggested that low-light image enhancement realizes one-to-many mapping since we have different definitions of NORMAL-light given application scenarios or users' aesthetic. However, most existing methods ignore subjectivity of the task, and simply produce one result with fixed brightness. This paper proposes a neural network for multi-level low-light image enhancement, which is user-friendly to meet various requirements by selecting different images as brightness reference. Inspired by style transfer, our method decomposes an image into two low-coupling feature components in the latent space, which allows the concatenation feasibility of the content components from low-light images and the luminance components from reference images. In such a way, the network learns to extract scene-invariant and brightness-specific information from a set of image pairs instead of learning brightness differences. Moreover, information except for the brightness is preserved to the greatest extent to alleviate color distortion. Extensive results show strong capacity and superiority of our network against existing methods.
摘要：由于我们在给定应用场景或用户审美的情况下对普通光有不同的定义，因此建议弱光图像增强实现一对多映射。但是，大多数现有方法都忽略了任务的主观性，而只产生一个亮度固定的结果。本文提出了一种用于多级微光图像增强的神经网络，该网络易于使用，可以通过选择不同的图像作为亮度参考来满足各种要求。受样式转换的启发，我们的方法将图像分解为潜在空间中的两个低耦合特征分量，从而可以实现将低光图像中的内容分量与参考图像中的亮度分量进行级联。以这种方式，网络学习从一组图像对中提取场景不变和特定于亮度的信息，而不是学习亮度差异。此外，最大程度地保留除亮度以外的信息以减轻色彩失真。广泛的结果表明，与现有方法相比，我们的网络具有强大的功能和优越性。

22. A Framework for Fast Scalable BNN Inference using Googlenet and Transfer Learning [PDF] 返回目录
Karthik E
Abstract: Efficient and accurate object detection in video and image analysis is one of the major beneficiaries of the advancement in computer vision systems with the help of deep learning. With the aid of deep learning, more powerful tools evolved, which are capable to learn high-level and deeper features and thus can overcome the existing problems in traditional architectures of object detection algorithms. The work in this thesis aims to achieve high accuracy in object detection with good real-time performance. In the area of computer vision, a lot of research is going into the area of detection and processing of visual information, by improving the existing algorithms. The binarized neural network has shown high performance in various vision tasks such as image classification, object detection, and semantic segmentation. The Modified National Institute of Standards and Technology database (MNIST), Canadian Institute for Advanced Research (CIFAR), and Street View House Numbers (SVHN) datasets are used which is implemented using a pre-trained convolutional neural network (CNN) that is 22 layers deep. Supervised learning is used in the work, which classifies the particular dataset with the proper structure of the model. In still images, to improve accuracy, Googlenet is used. The final layer of the Googlenet is replaced with the transfer learning to improve the accuracy of the Googlenet. At the same time, the accuracy in moving images can be maintained by transfer learning techniques. Hardware is the main backbone for any model to obtain faster results with a large number of datasets. Here, Nvidia Jetson Nano is used which is a graphics processing unit (GPU), that can handle a large number of computations in the process of object detection. Results show that the accuracy of objects detected by the transfer learning method is more when compared to the existing methods.
摘要：在深度学习的帮助下，视频和图像分析中有效，准确的目标检测是计算机视觉系统发展的主要受益者之一。借助深度学习，开发了更强大的工具，这些工具能够学习高级和更深入的功能，从而可以克服传统的对象检测算法体系结构中存在的问题。本文的工作旨在以较高的实时性能实现目标检测的高精度。在计算机视觉领域，通过改进现有算法，大量研究进入了视觉信息的检测和处理领域。二值化神经网络已在各种视觉任务（例如图像分类，目标检测和语义分割）中显示出高性能。使用修改后的美国国家标准技术研究院数据库（MNIST），加拿大高级研究院（CIFAR）和街景房门号码（SVHN）数据集，这些数据集是使用22的预训练卷积神经网络（CNN）实现的层深。工作中使用监督学习，该学习使用模型的正确结构对特定数据集进行分类。在静止图像中，为了提高准确性，使用了Googlenet。 Googlenet的最后一层被转移学习替代，以提高Googlenet的准确性。同时，可以通过转移学习技术来保持运动图像的准确性。硬件是任何模型的主要骨干，可以通过大量数据集获得更快的结果。在这里，使用Nvidia Jetson Nano，它是图形处理单元（GPU），可以在对象检测过程中处理大量计算。结果表明，与现有方法相比，通过转移学习方法检测出的物体的准确性更高。

23. WearMask: Fast In-browser Face Mask Detection with Serverless Edge Computing for COVID-19 [PDF] 返回目录
Zekun Wang, Pengwei Wang, Peter C. Louis, Lee E. Wheless, Yuankai Huo
Abstract: The COVID-19 epidemic has been a significant healthcare challenge in the United States. According to the Centers for Disease Control and Prevention (CDC), COVID-19 infection is transmitted predominately by respiratory droplets generated when people breathe, talk, cough, or sneeze. Wearing a mask is the primary, effective, and convenient method of blocking 80% of all respiratory infections. Therefore, many face mask detection and monitoring systems have been developed to provide effective supervision for hospitals, airports, publication transportation, sports venues, and retail locations. However, the current commercial face mask detection systems are typically bundled with specific software or hardware, impeding public accessibility. In this paper, we propose an in-browser serverless edge-computing based face mask detection solution, called Web-based efficient AI recognition of masks (WearMask), which can be deployed on any common devices (e.g., cell phones, tablets, computers) that have internet connections using web browsers, without installing any software. The serverless edge-computing design minimizes the extra hardware costs (e.g., specific devices or cloud computing servers). The contribution of the proposed method is to provide a holistic edge-computing framework of integrating (1) deep learning models (YOLO), (2) high-performance neural network inference computing framework (NCNN), and (3) a stack-based virtual machine (WebAssembly). For end-users, our web-based solution has advantages of (1) serverless edge-computing design with minimal device limitation and privacy risk, (2) installation free deployment, (3) low computing requirements, and (4) high detection speed. Our WearMask application has been launched with public access at this http URL.
摘要：在美国，COVID-19流行病是重大的医疗保健挑战。根据疾病控制与预防中心（CDC）的数据，COVID-19感染主要通过人们呼吸，说话，咳嗽或打喷嚏时产生的呼吸道飞沫传播。戴口罩是阻止80％的所有呼吸道感染的主要，有效且方便的方法。因此，已经开发了许多面罩检测和监视系统以对医院，机场，出版物运输，运动场馆和零售场所提供有效的监督。然而，当前的商业面罩检测系统通常与特定的软件或硬件捆绑在一起，从而阻碍了公众的可访问性。在本文中，我们提出了一种基于浏览器的无服务器边缘计算的面部遮罩检测解决方案，称为基于Web的面具有效AI识别（WearMask），它可以部署在任何常见设备（例如手机，平板电脑，计算机等）上）使用网络浏览器建立互联网连接，而无需安装任何软件。无服务器边缘计算设计最大程度地减少了额外的硬件成本（例如，特定设备或云计算服务器）。提出的方法的贡献在于提供一个集成（1）深度学习模型（YOLO），（2）高性能神经网络推理计算框架（NCNN）和（3）基于堆栈的整体边缘计算框架虚拟机（WebAssembly）。对于最终用户，我们的基于Web的解决方案具有以下优点：（1）无服务器边缘计算设计，具有最小的设备限制和隐私风险；（2）免安装部署；（3）低计算要求；以及（4）高检测速度。我们的WearMask应用程序已通过此http URL进行了公开访问。

24. Automatic Defect Detection of Print Fabric Using Convolutional Neural Network [PDF] 返回目录
Samit Chakraborty, Marguerite Moore, Lisa Parrillo-Chapman
Abstract: Automatic defect detection is a challenging task because of the variability in texture and type of fabric defects. An effective defect detection system enables manufacturers to improve the quality of processes and products. Automation across the textile manufacturing systems would reduce fabric wastage and increase profitability by saving cost and resources. There are different contemporary research on automatic defect detection systems using image processing and machine learning techniques. These techniques differ from each other based on the manufacturing processes and defect types. Researchers have also been able to establish real-time defect detection system during weaving. Although, there has been research on patterned fabric defect detection, these defects are related to weaving faults such as holes, and warp and weft defects. But, there has not been any research that is designed to detect defects that arise during such as spot and print mismatch. This research has fulfilled this gap by developing a print fabric database and implementing deep convolutional neural network (CNN).
摘要：由于织物缺陷的质地和类型的差异，自动缺陷检测是一项具有挑战性的任务。有效的缺陷检测系统使制造商能够提高过程和产品的质量。整个纺织品制造系统的自动化将通过节省成本和资源来减少织物浪费并提高利润。对于使用图像处理和机器学习技术的自动缺陷检测系统，目前有不同的研究。这些技术基于制造过程和缺陷类型而彼此不同。研究人员还能够在编织过程中建立实时缺陷检测系统。尽管已经对图案化的织物缺陷检测进行了研究，但是这些缺陷与诸如孔，经纱和纬纱缺陷之类的编织缺陷有关。但是，还没有任何旨在检测诸如斑点和印刷不匹配之类的缺陷的研究。这项研究通过开发一个印花织物数据库并实现了深度卷积神经网络（CNN）填补了这一空白。

25. An Evolution of CNN Object Classifiers on Low-Resolution Images [PDF] 返回目录
Md. Mohsin Kabir, Abu Quwsar Ohi, Md. Saifur Rahman, M. F. Mridha
Abstract: Object classification is a significant task in computer vision. It has become an effective research area as an important aspect of image processing and the building block of image localization, detection, and scene parsing. Object classification from low-quality images is difficult for the variance of object colors, aspect ratios, and cluttered backgrounds. The field of object classification has seen remarkable advancements, with the development of deep convolutional neural networks (DCNNs). Deep neural networks have been demonstrated as very powerful systems for facing the challenge of object classification from high-resolution images, but deploying such object classification networks on the embedded device remains challenging due to the high computational and memory requirements. Using high-quality images often causes high computational and memory complexity, whereas low-quality images can solve this issue. Hence, in this paper, we investigate an optimal architecture that accurately classifies low-quality images using DCNNs architectures. To validate different baselines on lowquality images, we perform experiments using webcam captured image datasets of 10 different objects. In this research work, we evaluate the proposed architecture by implementing popular CNN architectures. The experimental results validate that the MobileNet architecture delivers better than most of the available CNN architectures for low-resolution webcam image datasets.
摘要：对象分类是计算机视觉中的重要任务。作为图像处理的重要方面以及图像定位，检测和场景解析的基础，它已成为一个有效的研究领域。对于对象颜色，纵横比和背景混乱的变化，很难从低质量图像进行对象分类。随着深度卷积神经网络（DCNN）的发展，对象分类领域取得了显着进步。深度神经网络已被证明是非常强大的系统，可以应对来自高分辨率图像的目标分类挑战，但是由于对计算和内存的要求很高，因此在嵌入式设备上部署此类目标分类网络仍然具有挑战性。使用高质量的图像通常会导致较高的计算和内存复杂性，而低质量的图像可以解决此问题。因此，在本文中，我们研究了使用DCNNs架构准确分类低质量图像的最佳架构。为了验证低质量图像上的不同基准，我们使用网络摄像头捕获的10个不同对象的图像数据集进行了实验。在这项研究工作中，我们通过实现流行的CNN架构来评估所提出的架构。实验结果证明，对于低分辨率网络摄像头图像数据集，MobileNet架构比大多数可用的CNN架构具有更好的性能。

26. Fake Visual Content Detection Using Two-Stream Convolutional Neural Networks [PDF] 返回目录
Bilal Yousaf, Muhammad Usama, Waqas Sultani, Arif Mahmood, Junaid Qadir
Abstract: Rapid progress in adversarial learning has enabled the generation of realistic-looking fake visual content. To distinguish between fake and real visual content, several detection techniques have been proposed. The performance of most of these techniques however drops off significantly if the test and the training data are sampled from different distributions. This motivates efforts towards improving the generalization of fake detectors. Since current fake content generation techniques do not accurately model the frequency spectrum of the natural images, we observe that the frequency spectrum of the fake visual data contains discriminative characteristics that can be used to detect fake content. We also observe that the information captured in the frequency spectrum is different from that of the spatial domain. Using these insights, we propose to complement frequency and spatial domain features using a two-stream convolutional neural network architecture called TwoStreamNet. We demonstrate the improved generalization of the proposed two-stream network to several unseen generation architectures, datasets, and techniques. The proposed detector has demonstrated significant performance improvement compared to the current state-of-the-art fake content detectors and fusing the frequency and spatial domain streams has also improved generalization of the detector.
摘要：对抗性学习的飞速发展使得生成逼真的假视觉内容成为可能。为了区分伪造和真实的视觉内容，已经提出了几种检测技术。但是，如果从不同的分布中抽取测试数据和训练数据，则大多数这些技术的性能将大大下降。这激发了努力来改进假检测器的泛化。由于当前的伪造内容生成技术不能准确地对自然图像的频谱建模，因此我们观察到伪造视觉数据的频谱包含可用于检测伪造内容的区分特征。我们还观察到，在频谱中捕获的信息与空间域的信息不同。利用这些见解，我们建议使用称为TwoStreamNet的两流卷积神经网络体系结构来补充频域和空间域特征。我们展示了提出的两流网络对几种看不见的一代架构，数据集和技术的改进的概括。与当前最新的伪造内容检测器相比，提出的检测器表现出显着的性能改进，并且融合频域和空间域流还提高了检测器的通用性。

27. Weakly Supervised Multi-Object Tracking and Segmentation [PDF] 返回目录
Idoia Ruiz, Lorenzo Porzi, Samuel Rota Bulò, Peter Kontschieder, Joan Serrat
Abstract: We introduce the problem of weakly supervised Multi-Object Tracking and Segmentation, i.e. joint weakly supervised instance segmentation and multi-object tracking, in which we do not provide any kind of mask annotation. To address it, we design a novel synergistic training strategy by taking advantage of multi-task learning, i.e. classification and tracking tasks guide the training of the unsupervised instance segmentation. For that purpose, we extract weak foreground localization information, provided by Grad-CAM heatmaps, to generate a partial ground truth to learn from. Additionally, RGB image level information is employed to refine the mask prediction at the edges of the objects. We evaluate our method on KITTI MOTS, the most representative benchmark for this task, reducing the performance gap on the MOTSP metric between the fully supervised and weakly supervised approach to just 12% and 12.7% for cars and pedestrians, respectively.
摘要：我们介绍了弱监督的多对象跟踪和分段问题，即联合的弱监督实例分割和多对象跟踪问题，其中我们不提供任何遮罩注释。为了解决这个问题，我们通过利用多任务学习来设计一种新颖的协同训练策略，即分类和跟踪任务指导了无监督实例分割的训练。为此，我们提取了Grad-CAM热图提供的弱前景定位信息，以生成部分可借鉴的地面实况。另外，采用RGB图像级别信息来完善对象边缘的遮罩预测。我们在此任务中最具代表性的基准KITTI MOTS上评估了我们的方法，从而将MOTSP度量标准在完全监督和弱监督方法之间的性能差距减小到仅针对汽车和行人分别为12％和12.7％。

28. Depth as Attention for Face Representation Learning [PDF] 返回目录
Hardik Uppal, Alireza Sepas-Moghaddam, Michael Greenspan, Ali Etemad
Abstract: Face representation learning solutions have recently achieved great success for various applications such as verification and identification. However, face recognition approaches that are based purely on RGB images rely solely on intensity information, and therefore are more sensitive to facial variations, notably pose, occlusions, and environmental changes such as illumination and background. A novel depth-guided attention mechanism is proposed for deep multi-modal face recognition using low-cost RGB-D sensors. Our novel attention mechanism directs the deep network "where to look" for visual features in the RGB image by focusing the attention of the network using depth features extracted by a Convolution Neural Network (CNN). The depth features help the network focus on regions of the face in the RGB image that contains more prominent person-specific information. Our attention mechanism then uses this correlation to generate an attention map for RGB images from the depth features extracted by CNN. We test our network on four public datasets, showing that the features obtained by our proposed solution yield better results on the Lock3DFace, CurtinFaces, IIIT-D RGB-D, and KaspAROV datasets which include challenging variations in pose, occlusion, illumination, expression, and time-lapse. Our solution achieves average (increased) accuracies of 87.3\% (+5.0\%), 99.1\% (+0.9\%), 99.7\% (+0.6\%) and 95.3\%(+0.5\%) for the four datasets respectively, thereby improving the state-of-the-art. We also perform additional experiments with thermal images, instead of depth images, showing the high generalization ability of our solution when adopting other modalities for guiding the attention mechanism instead of depth information
摘要：人脸表征学习解决方案最近在各种应用（例如验证和识别）中取得了巨大的成功。但是，仅基于RGB图像的面部识别方法仅依赖于强度信息，因此对面部变化（尤其是姿势，遮挡以及环境变化（例如照明和背景））更加敏感。提出了一种新颖的深度引导注意机制，用于使用低成本RGB-D传感器的深度多模式人脸识别。我们新颖的注意力机制通过使用卷积神经网络（CNN）提取的深度特征来集中网络的注意力，从而将RGB图像中的视觉特征引导到深度网络的“哪里看”。深度功能可帮助网络将焦点集中在RGB图像中包含更多特定于人的信息的脸部区域。然后，我们的注意力机制使用此相关性，根据CNN提取的深度特征为RGB图像生成注意力图。我们在四个公开数据集上测试了我们的网络，结果表明，我们提出的解决方案获得的功能在Lock3DFace，CurtinFaces，IIIT-D RGB-D和KaspAROV数据集上产生了更好的结果，这些数据集包括姿势，遮挡，照明，表情，和延时。我们的解决方案实现了（提高）的平均准确度，分别为87.3％（+5.0％），99.1％（+0.9％），99.7％（+ 0.6％）和95.3％（+ 0.5％）四个数据集，从而提高了技术水平。我们还使用热图像（而不是深度图像）进行了额外的实验，显示了当采用其他方式来引导注意力机制而不是深度信息时，我们的解决方案具有很高的泛化能力

29. News Image Steganography: A Novel Architecture Facilitates the Fake News Identification [PDF] 返回目录
Jizhe Zhou, Chi-Man Pun, Yu Tong
Abstract: A larger portion of fake news quotes untampered images from other sources with ulterior motives rather than conducting image forgery. Such elaborate engraftments keep the inconsistency between images and text reports stealthy, thereby, palm off the spurious for the genuine. This paper proposes an architecture named News Image Steganography (NIS) to reveal the aforementioned inconsistency through image steganography based on GAN. Extractive summarization about a news image is generated based on its source texts, and a learned steganographic algorithm encodes and decodes the summarization of the image in a manner that approaches perceptual invisibility. Once an encoded image is quoted, its source summarization can be decoded and further presented as the ground truth to verify the quoting news. The pairwise encoder and decoder endow images of the capability to carry along their imperceptible summarization. Our NIS reveals the underlying inconsistency, thereby, according to our experiments and investigations, contributes to the identification accuracy of fake news that engrafts untampered images.
摘要：大部分虚假新闻引用别有用心的其他来源的未经篡改的图像，而不是进行图像伪造。如此精心的植入使图像和文本报告之间的不一致隐身了，从而摆脱了真正的伪造。本文提出了一种称为新闻图像隐写术（NIS）的体系结构，以通过基于GAN的图像隐写术揭示上述不一致之处。基于新闻图像的源文本生成关于新闻图像的提取摘要，并且学习的隐写算法以接近感知隐身的方式对图像的摘要进行编码和解码。一旦引用了编码的图像，就可以对其源摘要进行解码，并进一步将其作为地面真实性来表示，以验证引用的新闻。成对的编码器和解码器赋予图像进行其不可察觉的汇总的能力。我们的NIS揭示了潜在的不一致性，因此根据我们的实验和调查，这有助于识别植入未经篡改图像的假新闻的准确性。

30. Privacy-sensitive Objects Pixelation for Live Video Streaming [PDF] 返回目录
Jizhe Zhou, Chi-Man Pun, Yu Tong
Abstract: With the prevailing of live video streaming, establishing an online pixelation method for privacy-sensitive objects is an urgency. Caused by the inaccurate detection of privacy-sensitive objects, simply migrating the tracking-by-detection structure into the online form will incur problems in target initialization, drifting, and over-pixelation. To cope with the inevitable but impacting detection issue, we propose a novel Privacy-sensitive Objects Pixelation (PsOP) framework for automatic personal privacy filtering during live video streaming. Leveraging pre-trained detection networks, our PsOP is extendable to any potential privacy-sensitive objects pixelation. Employing the embedding networks and the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm as the backbone, our PsOP unifies the pixelation of discriminating and indiscriminating pixelation objects through trajectories generation. In addition to the pixelation accuracy boosting, experiments on the streaming video data we built show that the proposed PsOP can significantly reduce the over-pixelation ratio in privacy-sensitive object pixelation.
摘要：随着实时视频流的普及，为隐私敏感对象建立在线像素化方法已成为当务之急。由于对隐私敏感对象的检测不准确，仅将按检测跟踪的结构迁移到在线形式将导致目标初始化，漂移和像素过度化的问题。为了解决不可避免但影响深远的检测问题，我们提出了一种新颖的隐私敏感对象像素化（PsOP）框架，用于在实时视频流期间自动进行个人隐私过滤。利用预训练的检测网络，我们的PsOP可扩展到任何潜在的隐私敏感对象像素化。利用嵌入网络和拟议的位置增量亲和传播（PIAP）聚类算法作为主干，我们的PsOP通过轨迹生成来统一区分和不区分像素的像素。除了提高像素化精度外，我们对流视频数据进行的实验表明，所提出的PsOP可以显着降低隐私敏感对象像素化中的过像素化率。

31. A Switched View of Retinex: Deep Self-Regularized Low-Light Image Enhancement [PDF] 返回目录
Zhuqing Jiang, Haotian Li, Liangjie Liu, Aidong Men, Haiying Wang
Abstract: Self-regularized low-light image enhancement does not require any normal-light image in training, thereby freeing from the chains on paired or unpaired low-/normal-images. However, existing methods suffer color deviation and fail to generalize to various lighting conditions. This paper presents a novel self-regularized method based on Retinex, which, inspired by HSV, preserves all colors (Hue, Saturation) and only integrates Retinex theory into brightness (Value). We build a reflectance estimation network by restricting the consistency of reflectances embedded in both the original and a novel random disturbed form of the brightness of the same scene. The generated reflectance, which is assumed to be irrelevant of illumination by Retinex, is treated as enhanced brightness. Our method is efficient as a low-light image is decoupled into two subspaces, color and brightness, for better preservation and enhancement. Extensive experiments demonstrate that our method outperforms multiple state-of-the-art algorithms qualitatively and quantitatively and adapts to more lighting conditions.
摘要：自我调节的低光图像增强不需要训练中的任何正常光图像，从而摆脱了配对或未配对的低/正常图像上的链。然而，现有方法遭受颜色偏差并且不能推广到各种照明条件。本文提出了一种基于Retinex的自校正新方法，该方法受HSV启发，保留了所有颜色（色相，饱和度），并且仅将Retinex理论整合到亮度（值）中。我们通过限制嵌入在同一场景的原始亮度和新颖随机干扰形式中的反射率的一致性来构建反射率估计网络。生成的反射率（被认为与Retinex的照明无关）被视为增强的亮度。我们的方法是有效的，因为弱光图像被分解为两个子空间（颜色和亮度），以便更好地保存和增强。大量的实验表明，我们的方法在质量和数量上都优于多种最新算法，并且可以适应更多的照明条件。

32. Consensus-Guided Correspondence Denoising [PDF] 返回目录
Chen Zhao, Yixiao Ge, Jiaqi Yang, Feng Zhu, Rui Zhao, Hongsheng Li
Abstract: Correspondence selection between two groups of feature points aims to correctly recognize the consistent matches (inliers) from the initial noisy matches. The selection is generally challenging since the initial matches are generally extremely unbalanced, where outliers can easily dominate. Moreover, random distributions of outliers lead to the limited robustness of previous works when applied to different scenarios. To address this issue, we propose to denoise correspondences with a local-to-global consensus learning framework to robustly identify correspondence. A novel "pruning" block is introduced to distill reliable candidates from initial matches according to their consensus scores estimated by dynamic graphs from local to global regions. The proposed correspondence denoising is progressively achieved by stacking multiple pruning blocks sequentially. Our method outperforms state-of-the-arts on robust line fitting, wide-baseline image matching and image localization benchmarks by noticeable margins and shows promising generalization capability on different distributions of initial matches.
摘要：两组特征点之间的对应选择旨在从初始的噪声匹配中正确识别一致的匹配（内部）。选择通常具有挑战性，因为初始匹配通常非常不平衡，因此异常值很容易占优势。此外，离群值的随机分布在应用于不同场景时会导致先前工作的鲁棒性有限。为了解决这个问题，我们建议使用局部到全局共识学习框架对通信进行降噪，以可靠地识别通信。引入了一种新颖的“修剪”块，以根据通过动态图从本地到全球区域估计的共识分数从初始匹配中提取可靠的候选人。通过顺序堆叠多个修剪块，逐步实现了建议的对应降噪功能。我们的方法在鲁棒的线条拟合，宽基线图像匹配和图像定位基准方面的最新技术具有明显的优势，并且在初始匹配的不同分布上显示出有希望的泛化能力。

33. Style Normalization and Restitution for DomainGeneralization and Adaptation [PDF] 返回目录
Xin Jin, Cuiling Lan, Wenjun Zeng, Zhibo Chen
Abstract: For many practical computer vision applications, the learned models usually have high performance on the datasets used for training but suffer from significant performance degradation when deployed in new environments, where there are usually style differences between the training images and the testing images. An effective domain generalizable model is expected to be able to learn feature representations that are both generalizable and discriminative. In this paper, we design a novel Style Normalization and Restitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks. In the SNR module, particularly, we filter out the style variations (e.g, illumination, color contrast) by performing Instance Normalization (IN) to obtain style normalized features, where the discrepancy among different samples and domains is reduced. However, such a process is task-ignorant and inevitably removes some task-relevant discriminative information, which could hurt the performance. To remedy this, we propose to distill task-relevant discriminative features from the residual (i.e, the difference between the original feature and the style normalized feature) and add them back to the network to ensure high discrimination. Moreover, for better disentanglement, we enforce a dual causality loss constraint in the restitution step to encourage the better separation of task-relevant and task-irrelevant features. We validate the effectiveness of our SNR on different computer vision tasks, including classification, semantic segmentation, and object detection. Experiments demonstrate that our SNR module is capable of improving the performance of networks for domain generalization (DG) and unsupervised domain adaptation (UDA) on many tasks. Code are available at this https URL.
摘要：对于许多实际的计算机视觉应用而言，学习的模型通常在用于训练的数据集上具有较高的性能，但是当部署在新环境中时（在训练图像和测试图像之间通常存在样式差异），性能会明显下降。期望有效的领域可概括模型能够学习可概括和区分的特征表示。在本文中，我们设计了一种新颖的样式归一化和复原模块（SNR），以同时确保网络的高泛化和区分能力。特别是在SNR模块中，我们通过执行实例归一化（IN）以获得样式归一化的特征来过滤出样式变化（例如，照明，色彩对比），从而减少了不同样本和域之间的差异。但是，这样的过程是任务忽略的，不可避免地会删除一些与任务相关的区分信息，这可能会损害性能。为了解决这个问题，我们建议从残差中提取与任务相关的判别特征（即原始特征与样式规范化特征之间的差异），然后将其添加回网络中以确保高度区分度。此外，为了更好地解开纠缠，我们在恢复步骤中实施了双重因果损失约束，以鼓励更好地分离与任务相关和与任务无关的特征。我们验证了SNR在不同计算机视觉任务（包括分类，语义分割和对象检测）上的有效性。实验表明，我们的SNR模块能够在许多任务上提高用于域泛化（DG）和无监督域自适应（UDA）的网络性能。可以在此https URL上找到代码。

34. Few-shot Image Classification: Just Use a Library of Pre-trained Feature Extractors and a Simple Classifier [PDF] 返回目录
Arkabandhu Chowdhury, Mingchao Jiang, Chris Jermaine
Abstract: Recent papers have suggested that transfer learning can outperform sophisticated meta-learning methods for few-shot image classification. We take this hypothesis to its logical conclusion, and suggest the use of an ensemble of high-quality, pre-trained feature extractors for few-shot image classification. We show experimentally that a library of pre-trained feature extractors combined with a simple feed-forward network learned with an L2-regularizer can be an excellent option for solving cross-domain few-shot image classification. Our experimental results suggest that this simpler sample-efficient approach far outperforms several well-established meta-learning algorithms on a variety of few-shot tasks.
摘要：最近的论文表明，对于少拍图像分类，转移学习可以胜过复杂的元学习方法。我们将此假设推论为逻辑上的结论，并建议将高质量的，经过预训练的特征提取器集成用于少数镜头图像分类。我们通过实验证明，将经过预训练的特征提取器库与通过L2稳压器学习的简单前馈网络相结合，可以解决跨域少镜头图像分类。我们的实验结果表明，这种简单高效的采样方法远胜于在各种“少拍即发”任务上建立的多种元学习算法。

35. Six-channel Image Representation for Cross-domain Object Detection [PDF] 返回目录
Tianxiao Zhang, Wenchi Ma, Guanghui Wang
Abstract: Most deep learning models are data-driven and the excellent performance is highly dependent on the abundant and diverse datasets. However, it is very hard to obtain and label the datasets of some specific scenes or applications. If we train the detector using the data from one domain, it cannot perform well on the data from another domain due to domain shift, which is one of the big challenges of most object detection models. To address this issue, some image-to-image translation techniques are employed to generate some fake data of some specific scenes to train the models. With the advent of Generative Adversarial Networks (GANs), we could realize unsupervised image-to-image translation in both directions from a source to a target domain and from the target to the source domain. In this study, we report a new approach to making use of the generated images. We propose to concatenate the original 3-channel images and their corresponding GAN-generated fake images to form 6-channel representations of the dataset, hoping to address the domain shift problem while exploiting the success of available detection models. The idea of augmented data representation may inspire further study on object detection and other applications.
摘要：大多数深度学习模型都是数据驱动的，并且出色的性能高度依赖于丰富多样的数据集。但是，很难获得并标记某些特定场景或应用程序的数据集。如果我们使用来自一个域的数据训练检测器，则由于域移位，它不能很好地处理来自另一域的数据，这是大多数对象检测模型的一大挑战。为了解决这个问题，采用了一些图像到图像的转换技术来生成某些特定场景的虚假数据以训练模型。随着生成对抗网络（GAN）的出现，我们可以在从源域到目标域以及从目标域到源域的两个方向上实现无监督的图像到图像的转换。在这项研究中，我们报告了一种利用生成的图像的新方法。我们建议将原始的3通道图像及其对应的GAN生成的伪图像连接起来，以形成数据集的6通道表示，以期在利用现有检测模型成功的同时解决域偏移问题。增强数据表示的想法可能会激发对对象检测和其他应用程序的进一步研究。

36. A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization [PDF] 返回目录
Ashraful Islam, Chengjiang Long, Richard J. Radke
Abstract: Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an "action-ness" score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 3.8% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset. Code can be found at: this https URL.
摘要：由于训练视频中没有地面动作的真实时空位置，因此，缺乏监督的时空动作本地化是一项具有挑战性的视觉任务。在培训期间仅通过视频级别的监督，大多数现有方法都依靠多实例学习（MIL）框架来预测视频中每个动作类别的开始和结束帧。但是，现有的基于MIL的方法的主要局限性在于仅捕获动作的最具区别性的框架，而忽略了活动的全部范围。而且，这些方法不能有效地对背景活动进行建模，这在定位前景活动中起着重要作用。在本文中，我们提出了一个名为HAM-Net的新颖框架，该框架具有混合注意力机制，其中包括时间软性，半软性和硬性注意力以解决这些问题。我们的时间软性注意力模块在分类模块中的辅助背景类的指导下，通过为每个视频片段引入“行动程度”得分来对背景活动进行建模。此外，我们的时间半软和硬注意力模块可以为每个视频片段计算两个注意力得分，从而有助于将注意力集中在动作的判别力较弱的帧上，以捕获整个动作边界。我们提出的方法在THUMOS14数据集上在IoU阈值0.5处至少具有3.8％的mAP，在ActivityNet1.2数据集上在IoU阈值0.75处具有至少1.3％的mAP，其性能优于最新的方法。可以在以下网址找到代码：https URL。

37. VinVL: Making Visual Representations Matter in Vision-Language Models [PDF] 返回目录
Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, Jianfeng Gao
Abstract: This paper presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used \emph{bottom-up and top-down} model \cite{anderson2018bottom}, the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model \oscar \cite{li2020oscar}, and utilize an improved approach \short\ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks. We will release the new object detection model to public.
摘要：本文提出了针对视觉语言（VL）任务改进视觉表示的详细研究，并开发了一种改进的对象检测模型以提供以对象为中心的图像表示。与最广泛使用的\ emph {自上而下和自上而下}模型\ cite {anderson2018bottom}相比，新模型更大，针对VL任务进行了更好的设计，并且在结合了多个公众的更大训练集上进行了预训练带注释的对象检测数据集。因此，它可以生成视觉对象和概念的更丰富集合的表示。尽管先前的VL研究主要集中在改善视觉语言融合模型上，而对目标检测模型的改进却未曾动过，但我们显示视觉特征在VL模型中具有重要意义。在我们的实验中，我们将新对象检测模型生成的视觉特征输入到基于Transformer的VL融合模型\ oscar \ cite {li2020oscar}中，并利用改进的方法\ short \对VL模型进行预训练并进行微调它可用于各种下游VL任务。我们的结果表明，新的视觉功能显着改善了所有VL任务的性能，并根据七个公开基准创建了最新的结果。我们将向公众发布新的对象检测模型。

38. One-shot Representational Learning for Joint Biometric and Device Authentication [PDF] 返回目录
Sudipta Banerjee, Arun Ross
Abstract: In this work, we propose a method to simultaneously perform (i) biometric recognition (i.e., identify the individual), and (ii) device recognition, (i.e., identify the device) from a single biometric image, say, a face image, using a one-shot schema. Such a joint recognition scheme can be useful in devices such as smartphones for enhancing security as well as privacy. We propose to automatically learn a joint representation that encapsulates both biometric-specific and sensor-specific features. We evaluate the proposed approach using iris, face and periocular images acquired using near-infrared iris sensors and smartphone cameras. Experiments conducted using 14,451 images from 15 sensors resulted in a rank-1 identification accuracy of upto 99.81% and a verification accuracy of upto 100% at a false match rate of 1%.
摘要：在这项工作中，我们提出了一种方法，该方法可以同时执行（i）生物特征识别（即，识别个人）和（ii）设备识别（即，识别出设备）从单个生物特征图像（例如人脸）中进行图像，使用单发模式。这种联合识别方案在诸如智能手机之类的设备中可以用于增强安全性和隐私性。我们建议自动学习一个封装生物特征和传感器特定特征的联合表示。我们使用近红外虹膜传感器和智能手机摄像头获取的虹膜，面部和眼周图像评估提出的方法。使用来自15个传感器的14,451张图像进行的实验在错误匹配率为1％的情况下实现了1级识别精度高达99.81％和验证精度高达100％。

39. Privacy Preserving Domain Adaptation for Semantic Segmentation of Medical Images [PDF] 返回目录
Serban Stan, Mohammad Rostami
Abstract: Convolutional neural networks (CNNs) have led to significant improvements in tasks involving semantic segmentation of images. CNNs are vulnerable in the area of biomedical image segmentation because of distributional gap between two source and target domains with different data modalities which leads to domain shift. Domain shift makes data annotations in new modalities necessary because models must be retrained from scratch. Unsupervised domain adaptation (UDA) is proposed to adapt a model to new modalities using solely unlabeled target domain data. Common UDA algorithms require access to data points in the source domain which may not be feasible in medical imaging due to privacy concerns. In this work, we develop an algorithm for UDA in a privacy-constrained setting, where the source domain data is inaccessible. Our idea is based on encoding the information from the source samples into a prototypical distribution that is used as an intermediate distribution for aligning the target domain distribution with the source domain distribution. We demonstrate the effectiveness of our algorithm by comparing it to state-of-the-art medical image semantic segmentation approaches on two medical image semantic segmentation datasets.
摘要：卷积神经网络（CNN）导致涉及图像语义分割的任务得到了显着改进。由于具有不同数据模态的两个源域和目标域之间的分布间隙会导致域移位，因此CNN在生物医学图像分割领域很脆弱。域移位使得必须以新的方式注释数据，因为必须从头开始重新训练模型。提出了无监督域适应（UDA），以仅使用未标记的目标域数据将模型适应新的模式。常见的UDA算法要求访问源域中的数据点，由于隐私问题，这在医学成像中可能不可行。在这项工作中，我们开发了一种在隐私受限的环境中无法访问源域数据的UDA算法。我们的想法基于将来自源样本的信息编码为原型分布，该分布用作将目标域分布与源域分布对齐的中间分布。通过与两个医学图像语义分割数据集上的最新医学图像语义分割方法进行比较，我们证明了该算法的有效性。

40. Learning Rotation-Invariant Representations of Point Clouds Using Aligned Edge Convolutional Neural Networks [PDF] 返回目录
Junming Zhang, Ming-Yuan Yu, Ram Vasudevan, Matthew Johnson-Roberson
Abstract: Point cloud analysis is an area of increasing interest due to the development of 3D sensors that are able to rapidly measure the depth of scenes accurately. Unfortunately, applying deep learning techniques to perform point cloud analysis is non-trivial due to the inability of these methods to generalize to unseen rotations. To address this limitation, one usually has to augment the training data, which can lead to extra computation and require larger model complexity. This paper proposes a new neural network called the Aligned Edge Convolutional Neural Network (AECNN) that learns a feature representation of point clouds relative to Local Reference Frames (LRFs) to ensure invariance to rotation. In particular, features are learned locally and aligned with respect to the LRF of an automatically computed reference point. The proposed approach is evaluated on point cloud classification and part segmentation tasks. This paper illustrates that the proposed technique outperforms a variety of state of the art approaches (even those trained on augmented datasets) in terms of robustness to rotation without requiring any additional data augmentation.
摘要：由于能够快速准确地测量场景深度的3D传感器的发展，点云分析是一个越来越引起人们关注的领域。不幸的是，由于这些方法无法推广到看不见的旋转，因此应用深度学习技术执行点云分析并非易事。为了解决这一限制，通常必须增加训练数据，这可能导致额外的计算并需要更大的模型复杂性。本文提出了一种新的神经网络，称为对齐边缘卷积神经网络（AECNN），该神经网络学习点云相对于局部参考系（LRF）的特征表示，以确保旋转不变性。特别是，特征是在本地学习的，并且相对于自动计算的参考点的LRF对齐。对点云分类和零件分割任务进行了评估。本文说明了所提出的技术在旋转鲁棒性方面优于各种最新方法（甚至是在增强数据集上训练的方法），而无需任何其他数据增强。

41. Uncertainty-sensitive Activity Recognition: a Reliability Benchmark and the CARING Models [PDF] 返回目录
Alina Roitberg, Monica Haurilet, Manuel Martinez, Rainer Stiefelhagen
Abstract: Beyond assigning the correct class, an activity recognition model should also be able to determine, how certain it is in its predictions. We present the first study of how welthe confidence values of modern action recognition architectures indeed reflect the probability of the correct outcome and propose a learning-based approach for improving it. First, we extend two popular action recognition datasets with a reliability benchmark in form of the expected calibration error and reliability diagrams. Since our evaluation highlights that confidence values of standard action recognition architectures do not represent the uncertainty well, we introduce a new approach which learns to transform the model output into realistic confidence estimates through an additional calibration network. The main idea of our Calibrated Action Recognition with Input Guidance (CARING) model is to learn an optimal scaling parameter depending on the video representation. We compare our model with the native action recognition networks and the temperature scaling approach - a wide spread calibration method utilized in image classification. While temperature scaling alone drastically improves the reliability of the confidence values, our CARING method consistently leads to the best uncertainty estimates in all benchmark settings.
摘要：除了分配正确的类别外，活动识别模型还应该能够确定其预测中的确定性。我们目前就现代动作识别架构的置信度如何真正反映正确结果的可能性进行了首次研究，并提出了一种基于学习的方法来对其进行改进。首先，我们以预期的校准误差和可靠性图的形式扩展了两个具有可靠性基准的流行动作识别数据集。由于我们的评估突出表明标准动作识别架构的置信度值不能很好地表示不确定性，因此我们引入了一种新方法，该方法可以学习通过附加的校准网络将模型输出转换为实际的置信度估计。我们的带有输入指导的校准动作识别（CARING）模型的主要思想是根据视频表示学习最佳的缩放参数。我们将模型与本机动作识别网络和温度缩放方法（一种用于图像分类的广泛校准方法）进行了比较。虽然仅通过温度缩放就可以极大地提高置信度值的可靠性，但我们的CARING方法始终可在所有基准设置中带来最佳的不确定性估计。

42. Refining activation downsampling with SoftPool [PDF] 返回目录
Alexandros Stergiou, Ronald Poppe, Grigorios Kalliatakis
Abstract: Convolutional Neural Networks (CNNs) use pooling to decrease the size of activation maps. This process is crucial to locally achieve spatial invariance and to increase the receptive field of subsequent convolutions. Pooling operations should minimize the loss of information in the activation maps. At the same time, the computation and memory overhead should be limited. To meet these requirements, we propose SoftPool: a fast and efficient method that sums exponentially weighted activations. Compared to a range of other pooling methods, SoftPool retains more information in the downsampled activation maps. More refined downsampling leads to better classification accuracy. On ImageNet1K, for a range of popular CNN architectures, replacing the original pooling operations with SoftPool leads to consistent accuracy improvements in the order of 1-2%. We also test SoftPool on video datasets for action recognition. Again, replacing only the pooling layers consistently increases accuracy while computational load and memory remain limited. These favorable properties make SoftPool an excellent replacement for current pooling operations, including max-pool and average-pool
摘要：卷积神经网络（CNN）使用合并来减小激活图的大小。这个过程对于局部实现空间不变性和增加后续卷积的接收场至关重要。池操作应最大程度地减少激活图中的信息丢失。同时，应限制计算和内存开销。为了满足这些要求，我们提出了SoftPool：一种快速有效的方法，可以对指数加权的激活求和。与一系列其他合并方法相比，SoftPool在下采样的激活图中保留了更多信息。更精细的下采样可导致更好的分类准确性。在ImageNet1K上，对于一系列流行的CNN架构，用SoftPool替换原始的合并操作会导致精度不断提高1-2％。我们还在视频数据集上测试SoftPool以进行动作识别。同样，仅替换池层将持续提高准确性，而计算负载和内存仍然有限。这些有利的性能使SoftPool成为当前池操作（包括最大池和平均池）的理想替代品

43. On the confidence of stereo matching in a deep-learning era: a quantitative evaluation [PDF] 返回目录
Matteo Poggi, Seungryong Kim, Fabio Tosi, Sunok Kim, Filippo Aleotti, Dongbo Min, Kwanghoon Sohn, Stefano Mattoccia
Abstract: Stereo matching is one of the most popular techniques to estimate dense depth maps by finding the disparity between matching pixels on two, synchronized and rectified images. Alongside with the development of more accurate algorithms, the research community focused on finding good strategies to estimate the reliability, i.e. the confidence, of estimated disparity maps. This information proves to be a powerful cue to naively find wrong matches as well as to improve the overall effectiveness of a variety of stereo algorithms according to different strategies. In this paper, we review more than ten years of developments in the field of confidence estimation for stereo matching. We extensively discuss and evaluate existing confidence measures and their variants, from hand-crafted ones to the most recent, state-of-the-art learning based methods. We study the different behaviors of each measure when applied to a pool of different stereo algorithms and, for the first time in literature, when paired with a state-of-the-art deep stereo network. Our experiments, carried out on five different standard datasets, provide a comprehensive overview of the field, highlighting in particular both strengths and limitations of learning-based strategies.
摘要：立体匹配是通过在两个同步和校正后的图像上找到匹配像素之间的差异来估计密集深度图的最受欢迎的技术之一。除了开发更准确的算法外，研究界还致力于寻找好的方法来估计估计的视差图的可靠性，即置信度。该信息被证明是天真地找到错误匹配项以及根据不同策略提高各种立体声算法整体效果的有力提示。在本文中，我们回顾了立体声匹配置信度估计领域十多年的发展。我们广泛讨论和评估现有的置信度测度及其变体，从手工制作的测度方法到最新的基于学习的先进方法。我们研究了将每种测量方法应用于不同的立体声算法池时的不同行为，并且在文献中首次将其与最新的深度立体声网络配对使用时，我们研究了每种方法的不同行为。我们在五个不同的标准数据集上进行的实验提供了对该领域的全面概述，特别强调了基于学习的策略的优势和局限性。

44. Image-based Textile Decoding [PDF] 返回目录
Siqiang Chen, Masahiro Toyoura, Takamasa Terada, Xiaoyang Mao, Gang Xu
Abstract: A textile fabric consists of countless parallel vertical yarns (warps) and horizontal yarns (wefts). While common looms can weave repetitive patterns, Jacquard looms can weave the patterns without repetition restrictions. A pattern in which the warps and wefts cross on a grid is defined in a binary matrix. The binary matrix can define which warp and weft is on top at each grid point of the Jacquard fabric. The process can be regarded as encoding from pattern to textile. In this work, we propose a decoding method that generates a binary pattern from a textile fabric that has been already woven. We could not use a deep neural network to learn the process based solely on the training set of patterns and observed fabric images. The crossing points in the observed image were not completely located on the grid points, so it was difficult to take a direct correspondence between the fabric images and the pattern represented by the matrix in the framework of deep learning. Therefore, we propose a method that can apply the framework of deep learning via the intermediate representation of patterns and images. We show how to convert a pattern into an intermediate representation and how to reconvert the output into a pattern and confirm its effectiveness. In this experiment, we confirmed that 93% of correct pattern was obtained by decoding the pattern from the actual fabric images and weaving them again.
摘要：织物由无数平行的垂直纱线（经线）和水平纱线（纬线）组成。普通织机可以织造重复图案，提花织机可以织造图案，没有重复限制。在二进制矩阵中定义了经线和纬线在网格上交叉的模式。二进制矩阵可以定义在提花织物的每个网格点上哪个经纱和纬纱在顶部。该过程可以视为从图案到纺织品的编码。在这项工作中，我们提出了一种解码方法，该方法可以从已经编织的织物中生成二进制图案。我们不能使用深度神经网络仅基于训练的图案和观察到的织物图像来学习过程。观察图像中的交叉点未完全位于网格点上，因此在深度学习的框架中很难在织物图像和矩阵表示的图案之间取得直接的对应关系。因此，我们提出了一种可以通过模式和图像的中间表示来应用深度学习框架的方法。我们展示了如何将模式转换为中间表示，以及如何将输出转换为模式并确认其有效性。在该实验中，我们确认通过从实际织物图像中解码出图案并再次进行编织，可以得到93％的正确图案。

45. Video Captioning in Compressed Video [PDF] 返回目录
Mingjian Zhu, Chenrui Duan, Changbin Yu
Abstract: Existing approaches in video captioning concentrate on exploring global frame features in the uncompressed videos, while the free of charge and critical saliency information already encoded in the compressed videos is generally neglected. We propose a video captioning method which operates directly on the stored compressed videos. To learn a discriminative visual representation for video captioning, we design a residuals-assisted encoder (RAE), which spots regions of interest in I-frames under the assistance of the residuals frames. First, we obtain the spatial attention weights by extracting features of residuals as the saliency value of each location in I-frame and design a spatial attention module to refine the attention weights. We further propose a temporal gate module to determine how much the attended features contribute to the caption generation, which enables the model to resist the disturbance of some noisy signals in the compressed videos. Finally, Long Short-Term Memory is utilized to decode the visual representations into descriptions. We evaluate our method on two benchmark datasets and demonstrate the effectiveness of our approach.
摘要：视频字幕的现有方法集中于探索未压缩视频中的全局帧特征，而通常已经忽略了已压缩视频中已编码的免费信息和关键显着性信息。我们提出了一种视频字幕方法，可以直接对存储的压缩视频进行操作。为了学习视频字幕的区别性视觉表示，我们设计了残差辅助编码器（RAE），该残差辅助编码器在残差帧的帮助下在I帧中发现了感兴趣的区域。首先，我们通过提取残差特征作为I帧中每个位置的显着性值来获得空间注意权重，并设计一个空间注意模块来细化注意权重。我们进一步提出了一个时间门模块，以确定参与的功能对字幕生成的贡献，这使模型能够抵抗压缩视频中某些噪声信号的干扰。最后，长短期记忆被用来将视觉表示解码为描述。我们在两个基准数据集上评估了我们的方法，并证明了该方法的有效性。

46. Multi-Image Steganography Using Deep Neural Networks [PDF] 返回目录
Abhishek Das, Japsimar Singh Wahi, Mansi Anand, Yugant Rana
Abstract: Steganography is the science of hiding a secret message within an ordinary public message. Over the years, steganography has been used to encode a lower resolution image into a higher resolution image by simple methods like LSB manipulation. We aim to utilize deep neural networks for the encoding and decoding of multiple secret images inside a single cover image of the same resolution.
摘要：隐秘术是一种在普通公共消息中隐藏秘密消息的科学。多年来，隐写术已被用于通过诸如LSB操作之类的简单方法将较低分辨率的图像编码为高分辨率图像。我们旨在利用深度神经网络对具有相同分辨率的单个封面图像中的多个秘密图像进行编码和解码。

47. An Artificial Intelligence System for Combined Fruit Detection and Georeferencing, Using RTK-Based Perspective Projection in Drone Imagery [PDF] 返回目录
Angus Baird, Stefano Giani
Abstract: This work presents an Artificial Intelligence (AI) system, based on the Faster Region-Based Convolution Neural Network (Faster R-CNN) framework, which detects and counts apples from oblique, aerial drone imagery of giant commercial orchards. To reduce computational cost, a novel precursory stage to the network is designed to preprocess raw imagery into cropped images of individual trees. Unique geospatial identifiers are allocated to these using the perspective projection model. This employs Real-Time Kinematic (RTK) data, Digital Terrain and Surface Models (DTM and DSM), as well as internal and external camera parameters. The bulk of experiments however focus on tuning hyperparameters in the detection network itself. Apples which are on trees and apples which are on the ground are treated as separate classes. A mean Average Precision (mAP) metric, calibrated by the size of the two classes, is devised to mitigate spurious results. Anchor box design is of key interest due to the scale of the apples. As such, a k-means clustering approach, never before seen in literature for Faster R-CNN, resulted in the most significant improvements to calibrated mAP. Other experiments showed that the maximum number of box proposals should be 225; the initial learning rate of 0.001 is best applied to the adaptive RMS Prop optimiser; and ResNet 101 is the ideal base feature extractor when considering mAP and, to a lesser extent, inference time. The amalgamation of the optimal hyperparameters leads to a model with a calibrated mAP of 0.7627.
摘要：这项工作提出了一种基于快速区域卷积神经网络（Faster R-CNN）框架的人工智能（AI）系统，该系统可以从大型商业果园的倾斜，空中无人机图像中检测和计数苹果。为了减少计算成本，网络的一个新的前期阶段被设计为将原始图像预处理为单个树木的裁剪图像。使用透视投影模型将唯一的地理空间标识符分配给这些标识符。它使用实时运动（RTK）数据，数字地形和表面模型（DTM和DSM）以及内部和外部摄像机参数。但是，大量实验着重于调整检测网络本身中的超参数。树木上的苹果和地面上的苹果被视为单独的类。设计了通过两类的大小校准的平均平均精度（mAP）度量，以减轻虚假结果。由于苹果的大小，锚盒的设计非常重要。因此，k均值聚类方法在Faster R-CNN的文献中从未见过，它对校准的mAP进行了最显着的改进。其他实验表明，箱式提案的最大数量应为225； 0.001的初始学习率最好应用于自适应RMS Prop优化器； ResNet 101是考虑mAP并在较小程度上考虑推理时间时的理想基础特征提取器。最佳超参数的合并导致模型的mAP校准值为0.7627。

48. Biologically Inspired Hexagonal Deep Learning for Hexagonal Image Generation [PDF] 返回目录
Tobias Schlosser, Frederik Beuth, Danny Kowerko
Abstract: Whereas conventional state-of-the-art image processing systems of recording and output devices almost exclusively utilize square arranged methods, biological models, however, suggest an alternative, evolutionarily-based structure. Inspired by the human visual perception system, hexagonal image processing in the context of machine learning offers a number of key advantages that can benefit both researchers and users alike. The hexagonal deep learning framework Hexnet leveraged in this contribution serves therefore the generation of hexagonal images by utilizing hexagonal deep neural networks (H-DNN). As the results of our created test environment show, the proposed models can surpass current approaches of conventional image generation. While resulting in a reduction of the models' complexity in the form of trainable parameters, they furthermore allow an increase of test rates in comparison to their square counterparts.
摘要：尽管传统的记录和输出设备的先进图像处理系统几乎只使用正方形排列的方法，但是生物学模型却提出了一种替代的，基于进化的结构。受人类视觉感知系统的启发，机器学习环境中的六边形图像处理具有许多关键优势，可以使研究人员和用户都受益。因此，利用此贡献所利用的六边形深度学习框架Hexnet通过利用六边形深度神经网络（H-DNN）为生成六边形图像提供服务。正如我们创建的测试环境的结果所示，所提出的模型可以超越传统图像生成的当前方法。尽管以可训练参数的形式降低了模型的复杂性，但与正方形模型相比，它们还允许提高测试率。

49. Subtype-aware Unsupervised Domain Adaptation for Medical Diagnosis [PDF] 返回目录
Xiaofeng Liu, Xiongchang Liu, Bo Hu, Wenxuan Ji, Fangxu Xing, Jun Lu, Jane You, C.-C. Jay Kuo, Georges El Fakhri, Jonghye Woo
Abstract: Recent advances in unsupervised domain adaptation (UDA) show that transferable prototypical learning presents a powerful means for class conditional alignment, which encourages the closeness of cross-domain class centroids. However, the cross-domain inner-class compactness and the underlying fine-grained subtype structure remained largely underexplored. In this work, we propose to adaptively carry out the fine-grained subtype-aware alignment by explicitly enforcing the class-wise separation and subtype-wise compactness with intermediate pseudo labels. Our key insight is that the unlabeled subtypes of a class can be divergent to one another with different conditional and label shifts, while inheriting the local proximity within a subtype. The cases of with or without the prior information on subtype numbers are investigated to discover the underlying subtype structure in an online fashion. The proposed subtype-aware dynamic UDA achieves promising results on medical diagnosis tasks.
摘要：无监督域自适应（UDA）的最新进展表明，可转移的原型学习为类条件条件对齐提供了强大的手段，这鼓励了跨域类质心的紧密性。但是，跨域内部类的紧致度和底层的细粒度子类型结构仍然很大程度上未被开发。在这项工作中，我们建议通过使用中间伪标记显式强制执行类分离和子类型紧实度，来自适应地执行细粒度的子类型感知对齐。我们的主要见识在于，一类未标记的子类型可以通过不同的条件和标签移位彼此分离，同时继承子类型内的局部接近性。研究有或没有子类型编号的先验信息的情况，以在线方式发现潜在的子类型结构。拟议的可识别亚型的动态UDA在医学诊断任务上取得了可喜的结果。

50. Identity-aware Facial Expression Recognition in Compressed Video [PDF] 返回目录
Xiaofeng Liu, Linghao Jin, Xu Han, Jun Lu, Jane You, Lingsheng Kong
Abstract: This paper targets to explore the inter-subject variations eliminated facial expression representation in the compressed video domain. Most of the previous methods process the RGB images of a sequence, while the off-the-shelf and valuable expression-related muscle movement already embedded in the compression format. In the up to two orders of magnitude compressed domain, we can explicitly infer the expression from the residual frames and possible to extract identity factors from the I frame with a pre-trained face recognition network. By enforcing the marginal independent of them, the expression feature is expected to be purer for the expression and be robust to identity shifts. We do not need the identity label or multiple expression samples from the same person for identity elimination. Moreover, when the apex frame is annotated in the dataset, the complementary constraint can be further added to regularize the feature-level game. In testing, only the compressed residual frames are required to achieve expression prediction. Our solution can achieve comparable or better performance than the recent decoded image based methods on the typical FER benchmarks with about 3$\times$ faster inference with compressed data.
摘要：本文旨在探讨在压缩视频域中对象间变异消除的面部表情表示。大多数以前的方法都处理序列的RGB图像，而现成的和有价值的表达相关的肌肉运动已经嵌入到压缩格式中。在最多两个数量级的压缩域中，我们可以从残差帧中明确推断出表达式，并可以使用预训练的人脸识别网络从I帧中提取身份因子。通过强制边缘独立于它们，可以期望表达特征对于表达而言是更纯净的，并且对于身份转移是可靠的。我们不需要同一人的身份标签或多个表达样本来消除身份。此外，当在数据集中对顶点框架进行注释时，可以进一步添加互补约束以使特征级游戏规则化。在测试中，仅需要压缩的残差帧即可实现表达预测。我们的解决方案可以比典型的FER基准上的基于最近解码图像的方法实现可比或更好的性能，并且对压缩数据的推断速度大约快3倍。

51. Energy-constrained Self-training for Unsupervised Domain Adaptation [PDF] 返回目录
Xiaofeng Liu, Bo Hu, Xiongchang Liu, Jun Lu, Jane You, Lingsheng Kong
Abstract: Unsupervised domain adaptation (UDA) aims to transfer the knowledge on a labeled source domain distribution to perform well on an unlabeled target domain. Recently, the deep self-training involves an iterative process of predicting on the target domain and then taking the confident predictions as hard pseudo-labels for retraining. However, the pseudo-labels are usually unreliable, and easily leading to deviated solutions with propagated errors. In this paper, we resort to the energy-based model and constrain the training of the unlabeled target sample with the energy function minimization objective. It can be applied as a simple additional regularization. In this framework, it is possible to gain the benefits of the energy-based model, while retaining strong discriminative performance following a plug-and-play fashion. We deliver extensive experiments on the most popular and large scale UDA benchmarks of image classification as well as semantic segmentation to demonstrate its generality and effectiveness.
摘要：无监督域自适应（UDA）的目的是在标记的源域分布上传递知识，以在未标记的目标域上表现良好。最近，深度自我训练涉及在目标域上进行预测的迭代过程，然后将置信度预测作为重新训练的硬伪标记。但是，伪标签通常不可靠，容易导致带有传播错误的解决方案出现偏差。在本文中，我们诉诸基于能量的模型，并以能量函数最小化目标约束对未标记目标样本的训练。它可以作为简单的附加正则化应用。在此框架中，有可能获得基于能量的模型的好处，同时遵循即插即用的方式保留强大的区分性能。我们针对图像分类以及语义分割的最流行和大规模UDA基准进行了广泛的实验，以证明其通用性和有效性。

52. Iranis: A Large-scale Dataset of Farsi License Plate Characters [PDF] 返回目录
Ali Tourani, Sajjad Soroori, Asadollah Shahbahrami, Alireza Akoushideh
Abstract: Providing huge amounts of data is a fundamental demand when dealing with Deep Neural Networks (DNNs). Employing these algorithms to solve computer vision problems resulted in the advent of various image datasets to feed the most common visual imagery deep structures, known as Convolutional Neural Networks (CNNs). In this regard, some datasets can be found that contain hundreds or even thousands of images for license plate detection and optical character recognition purposes. However, no publicly available image dataset provides such data for the recognition of Farsi characters used in car license plates. The gap has to be filled due to the numerous advantages of developing accurate deep learning-based systems for law enforcement and surveillance purposes. This paper introduces a large-scale dataset that includes images of numbers and characters used in Iranian car license plates. The dataset, named Iranis, contains more than 83,000 images of Farsi numbers and letters collected from real-world license plate images captured by various cameras. The variety of instances in terms of camera shooting angle, illumination, resolution, and contrast make the dataset a proper choice for training DNNs. Dataset images are manually annotated for object detection and image classification. Finally, and to build a baseline for Farsi character recognition, the paper provides a performance analysis using a YOLO v.3 object detector.
摘要：提供大量数据是处理深度神经网络（DNN）的基本要求。利用这些算法解决计算机视觉问题导致出现了各种图像数据集，以馈送最常见的视觉图像深层结构，即卷积神经网络（CNN）。在这方面，可以发现一些数据集，其中包含数百个甚至数千个图像，用于车牌检测和光学字符识别。但是，没有可公开获得的图像数据集提供此类数据来识别汽车牌照中使用的波斯语字符。由于为执法和监视目的开发基于精确的深度学习的系统具有众多优势，因此必须填补这一空白。本文介绍了一个大规模的数据集，其中包括伊朗汽车牌照中使用的数字和字符的图像。该数据集名为Iranis，包含超过83,000张波斯数字和字母的图像，这些图像是从各种摄像机捕获的真实车牌图像中收集的。摄像机拍摄角度，照明，分辨率和对比度方面的各种实例使数据集成为训练DNN的合适选择。手动注释数据集图像以进行对象检测和图像分类。最后，为了建立波斯语字符识别的基准，本文使用YOLO v.3对象检测器提供了性能分析。

53. VisualSparta: Sparse Transformer Fragment-level Matching for Large-scale Text-to-Image Search [PDF] 返回目录
Xiaopeng Lu, Tiancheng Zhao, Kyusong Lee
Abstract: Text-to-image retrieval is an essential task in multi-modal information retrieval, i.e. retrieving relevant images from a large and unlabelled image dataset given textual queries. In this paper, we propose VisualSparta, a novel text-to-image retrieval model that shows substantial improvement over existing models on both accuracy and efficiency. We show that VisualSparta is capable of outperforming all previous scalable methods in MSCOCO and Flickr30K. It also shows substantial retrieving speed advantages, i.e. for an index with 1 million images, VisualSparta gets over 391x speed up compared to standard vector search. Experiments show that this speed advantage even gets bigger for larger datasets because VisualSparta can be efficiently implemented as an inverted index. To the best of our knowledge, VisualSparta is the first transformer-based text-to-image retrieval model that can achieve real-time searching for very large dataset, with significant accuracy improvement compared to previous state-of-the-art methods.
摘要：文本到图像的检索是多模式信息检索中的一项基本任务，即在给定文本查询的情况下从大型且未标记的图像数据集中检索相关图像。在本文中，我们提出了VisualSparta，这是一种新颖的文本到图像检索模型，该模型在准确性和效率上都比现有模型显着提高。我们证明VisualSparta能够胜过MSCOCO和Flickr30K中所有以前的可伸缩方法。它还显示了实质性的检索速度优势，即对于具有100万张图像的索引，与标准向量搜索相比，VisualSparta的速度提高了391倍以上。实验表明，对于较大的数据集，这种速度优势甚至更大，因为VisualSparta可以有效地实现为反向索引。据我们所知，VisualSparta是第一个基于转换器的文本到图像检索模型，可以实现对非常大的数据集的实时搜索，与以前的最新方法相比，其准确性显着提高。

54. Adaptive Deconvolution-based stereo matching Net for Local Stereo Matching [PDF] 返回目录
Xin Ma, Zhicheng Zhang, Danfeng Wang, Yu Luo, Hui Yuan
Abstract: In deep learning-based local stereo matching methods, larger image patches usually bring better stereo matching accuracy. However, it is unrealistic to increase the size of the image patch size without restriction. Arbitrarily extending the patch size will change the local stereo matching method into the global stereo matching method, and the matching accuracy will be saturated. We simplified the existing Siamese convolutional network by reducing the number of network parameters and propose an efficient CNN based structure, namely Adaptive Deconvolution-based disparity matching Net (ADSM net) by adding deconvolution layers to learn how to enlarge the size of input feature map for the following convolution layers. Experimental results on the KITTI 2012 and 2015 datasets demonstrate that the proposed method can achieve a good trade-off between accuracy and complexity.
摘要：在基于深度学习的局部立体声匹配方法中，较大的图像块通常会带来更好的立体声匹配精度。然而，没有限制地增加图像补丁大小的大小是不现实的。任意扩展补丁大小会将本地立体声匹配方法更改为全局立体声匹配方法，并且匹配精度将达到饱和。通过减少网络参数的数量，我们简化了现有的暹罗卷积网络，并提出了一种有效的基于CNN的结构，即通过添加反卷积层来学习如何扩大输入特征图的大小，从而基于自适应反卷积的视差匹配网（ADSM net）。以下卷积层。在KITTI 2012和2015数据集上的实验结果表明，该方法可以在准确性和复杂性之间取得良好的折衷。

55. Brain Tumor Detection and Classification based on Hybrid Ensemble Classifier [PDF] 返回目录
Ginni Garg, Ritu Garg
Abstract: To improve patient survival and treatment outcomes, early diagnosis of brain tumors is an essential task. It is a difficult task to evaluate the magnetic resonance imaging (MRI) images manually. Thus, there is a need for digital methods for tumor diagnosis with better accuracy. However, it is still a very challenging task in assessing their shape, volume, boundaries, tumor detection, size, segmentation, and classification. In this proposed work, we propose a hybrid ensemble method using Random Forest (RF), K-Nearest Neighbour, and Decision Tree (DT) (KNN-RF-DT) based on Majority Voting Method. It aims to calculate the area of the tumor region and classify brain tumors as benign and malignant. In the beginning, segmentation is done by using Otsu's Threshold method. Feature Extraction is done by using Stationary Wavelet Transform (SWT), Principle Component Analysis (PCA), and Gray Level Co-occurrence Matrix (GLCM), which gives thirteen features for classification. The classification is done by hybrid ensemble classifier (KNN-RF-DT) based on the Majority Voting method. Overall it aimed at improving the performance by traditional classifiers instead of going to deep learning. Traditional classifiers have an advantage over deep learning algorithms because they require small datasets for training and have low computational time complexity, low cost to the users, and can be easily adopted by less skilled people. Overall, our proposed method is tested upon dataset of 2556 images, which are used in 85:15 for training and testing respectively and gives good accuracy of 97.305%.
摘要：为了提高患者的生存率和治疗效果，脑肿瘤的早期诊断是一项必不可少的任务。手动评估磁共振成像（MRI）图像是一项艰巨的任务。因此，需要具有更好准确性的数字方法来进行肿瘤诊断。然而，在评估它们的形状，体积，边界，肿瘤检测，大小，分割和分类方面仍然是一项非常具有挑战性的任务。在这项拟议的工作中，我们提出了一种基于多数投票方法的使用随机森林（RF），K最近邻和决策树（DT）（KNN-RF-DT）的混合集成方法。它旨在计算肿瘤区域的面积，并将脑肿瘤分类为良性和恶性。首先，使用Otsu的Threshold方法进行细分。特征提取是通过使用固定小波变换（SWT），主成分分析（PCA）和灰度共生矩阵（GLCM）进行的，它提供了13种分类特征。通过基于多数投票方法的混合集成分类器（KNN-RF-DT）进行分类。总体而言，它旨在提高传统分类器的性能，而不是进行深度学习。传统分类器相对于深度学习算法具有优势，因为它们需要用于训练的小型数据集，并且计算时间复杂度低，用户成本低，并且容易被技术水平较低的人采用。总体而言，我们的方法是在2556个图像的数据集上进行测试的，这些图像分别在85:15的训练和测试中使用，并具有97.305％的良好准确性。

56. Improved Neural Network based Plant Diseases Identification [PDF] 返回目录
Ginni Garg, Mantosh Biswas
Abstract: The agriculture sector is essential for every country because it provides a basic income to a large number of people and food as well, which is a fundamental requirement to survive on this planet. We see as time passes, significant changes come in the present era, which begins with Green Revolution. Due to improper knowledge of plant diseases, farmers use fertilizers in excess, which ultimately degrade the quality of food. Earlier farmers use experts to determine the type of plant disease, which was expensive and time-consuming. In today time, Image processing is used to recognize and catalog plant diseases using the lesion region of plant leaf, and there are different modus-operandi for plant disease scent from leaf using Neural Networks (NN), Support Vector Machine (SVM), and others. In this paper, we improving the architecture of the Neural Networking by working on ten different types of training algorithms and the proper choice of neurons in the concealed layer. Our proposed approach gives 98.30% accuracy on general plant leaf disease and 100% accuracy on specific plant leaf disease based on Bayesian regularization, automation of cluster and without over-fitting on considered plant diseases over various other implemented methods.
摘要：农业部门对于每个国家都是必不可少的，因为它也为大量人口和粮食提供了基本收入，这是在这个星球上生存的基本要求。我们看到，随着时间的流逝，当今时代发生了重大变化，这始于绿色革命。由于对植物病害的了解不足，农民使用了过量的肥料，最终使食品质量下降。早期的农民使用专家来确定植物病的类型，这种病既昂贵又费时。在当今时代，图像处理被用于通过植物叶片的病变区域来识别和分类植物病害，并且使用神经网络（NN），支持向量机（SVM）和叶片对植物病害气味的操作方法有所不同。其他。在本文中，我们通过研究十种不同类型的训练算法以及在隐藏层中正确选择神经元来改进神经网络的体系结构。我们提出的方法基于贝叶斯正则化，集群自动化，在一般植物叶病上的准确度为98.30％，在特定植物叶病上的准确度为100％，并且不会因各种其他已实施方法而过分考虑植物病害。

57. A Hybrid MLP-SVM Model for Classification using Spatial-Spectral Features on Hyper-Spectral Images [PDF] 返回目录
Ginni Garg, Dheeraj Kumar, ArvinderPal, Yash Sonker, Ritu Garg
Abstract: There are many challenges in the classification of hyper spectral images such as large dimensionality, scarcity of labeled data and spatial variability of spectral signatures. In this proposed method, we make a hybrid classifier (MLP-SVM) using multilayer perceptron (MLP) and support vector machine (SVM) which aimed to improve the various classification parameters such as accuracy, precision, recall, f-score and to predict the region without ground truth. In proposed method, outputs from the last hidden layer of the neural net-ork become the input to the SVM, which finally classifies into various desired classes. In the present study, we worked on Indian Pines, U. Pavia and Salinas dataset with 16, 9, 16 classes and 200, 103 and 204 reflectance bands respectively, which is provided by AVIRIS and ROSIS sensor of NASA Jet propulsion laboratory. The proposed method significantly increases the accuracy on testing dataset to 93.22%, 96.87%, 93.81% as compare to 86.97%, 88.58%, 88.85% and 91.61%, 96.20%, 90.68% based on individual classifiers SVM and MLP on Indian Pines, U. Pavia and Salinas datasets respectively.
摘要：高光谱图像的分类面临很多挑战，例如大尺寸，标记数据的稀缺性和光谱特征的空间变异性。在此提议的方法中，我们使用多层感知器（MLP）和支持向量机（SVM）制作了一种混合分类器（MLP-SVM），旨在改善各种分类参数，例如准确性，精度，召回率，f得分并进行预测没有地面真理的地区。在提出的方法中，来自神经网络的最后一个隐藏层的输出成为SVM的输入，SVM最终将其分类为各种所需的类。在本研究中，我们研究了印度松，美国帕维亚和萨利纳斯的数据集，分别具有16、9、16类和200、103和204反射带，这是由NASA喷气推进实验室的AVIRIS和ROSIS传感器提供的。基于印度松树上的单个分类器SVM和MLP，所提出的方法将测试数据集的准确性显着提高到93.22％，96.87％，93.81％，而86.97％，88.58％，88.85％和91.61％，96.20％，90.68％则显着提高了U. Pavia和Salinas数据集。

58. More than just an auxiliary loss: Anti-spoofing Backbone Training via Adversarial Pseudo-depth Generation [PDF] 返回目录
Chang Keun Paik, Naeun Ko, Youngjoon Yoo
Abstract: In this paper, a new method of training pipeline is discussed to achieve significant performance on the task of anti-spoofing with RGB image. We explore and highlight the impact of using pseudo-depth to pre-train a network that will be used as the backbone to the final classifier. While the usage of pseudo-depth for anti-spoofing task is not a new idea on its own, previous endeavours utilize pseudo-depth simply as another medium to extract features for performing prediction, or as part of many auxiliary losses in aiding the training of the main classifier, normalizing the importance of pseudo-depth as just another semantic information. Through this work, we argue that there exists a significant advantage in training the final classifier can be gained by the pre-trained generator learning to predict the corresponding pseudo-depth of a given facial image, from a Generative Adversarial Network framework. Our experimental results indicate that our method results in a much more adaptable system that can generalize beyond intra-dataset samples, but to inter-dataset samples, which it has never seen before during training. Quantitatively, our method approaches the baseline performance of the current state of the art anti-spoofing models with 15.8x less parameters used. Moreover, experiments showed that the introduced methodology performs well only using basic binary label without additional semantic information which indicates potential benefits of this work in industrial and application based environment where trade-off between additional labelling and resources are considered.
摘要：本文讨论了一种新的流水线训练方法，该方法可在处理RGB图像的反欺骗任务上取得显着性能。我们探索并强调了使用伪深度来预训练将用作最终分类器主干的网络的影响。虽然将伪深度用于反欺骗任务本身并不是一个新主意，但以前的尝试只是将伪深度用作另一种提取特征以进行预测的媒介，或者作为辅助训练的许多辅助损失的一部分主要分类器，将伪深度的重要性标准化为另一种语义信息。通过这项工作，我们认为在训练最终分类器方面存在显着优势，这可以通过从生成的对抗网络框架中进行预训练的生成器学习来预测给定面部图像的相应伪深度来获得。我们的实验结果表明，我们的方法产生的适应性强得多的系统可以推广到数据集内样本之外，但可以推广到数据集间样本，这在训练期间从未见过。从数量上讲，我们的方法使用的参数减少了15.8倍，从而接近了当前最先进的反欺骗模型的基线性能。此外，实验表明，所介绍的方法仅在使用基本二进制标签而没有附加语义信息的情况下才能很好地运行，这表明在考虑了附加标签和资源之间权衡的工业和基于应用的环境中这项工作的潜在优势。

59. CIZSL++: Creativity Inspired Generative Zero-Shot Learning [PDF] 返回目录
Mohamed Elhoseiny, Kai Yi, Mohamed Elfeki
Abstract: Zero-shot learning (ZSL) aims at understanding unseen categories with no training examples from class-level descriptions. To improve the discriminative power of ZSL, we model the visual learning process of unseen categories with inspiration from the psychology of human creativity for producing novel art. First, we propose CIZSL-v1 as a creativity inspired model for generative ZSL. We relate ZSL to human creativity by observing that ZSL is about recognizing the unseen, and creativity is about creating a likable unseen. We introduce a learning signal inspired by creativity literature that explores the unseen space with hallucinated class-descriptions and encourages careful deviation of their visual feature generations from seen classes while allowing knowledge transfer from seen to unseen classes. Second, CIZSL-v2 is proposed as an improved version of CIZSL-v1 for generative zero-shot learning. CIZSL-v2 consists of an investigation of additional inductive losses for unseen classes along with a semantic guided discriminator. Empirically, we show consistently that CIZSL losses can improve generative ZSL models on the challenging task of generalized ZSL from a noisy text on CUB and NABirds datasets. We also show the advantage of our approach to Attribute-based ZSL on AwA2, aPY, and SUN datasets. We also show that CIZSL-v2 has improved performance compared to CIZSL-v1.
摘要：零镜头学习（ZSL）旨在理解不可见的类别，而没有来自班级描述的训练示例。为了提高ZSL的判别力，我们从人类创造创作心理学的灵感中模拟了看不见类别的视觉学习过程。首先，我们建议CIZSL-v1作为生成ZSL的创造力启发模型。通过观察ZSL是关于识别看不见的东西，而创造力是创建可爱的看不见的东西，我们将ZSL与人类创造力联系起来。我们引入了一个受创造力文学启发的学习信号，该学习信号用幻觉的课堂描述探索了看不见的空间，并鼓励其视觉特征世代与已观看的班级谨慎偏离，同时允许知识从已观看的班级转移到未见的班级。其次，CIZSL-v2被提出作为CIZSL-v1的改进版本，用于生成零镜头学习。 CIZSL-v2包括对看不见的类的其他归纳损失以及语义指导的鉴别器的研究。从经验上，我们一致地表明，CIZSL损失可以从CUB和NABirds数据集上的嘈杂文本中改善通用ZSL的艰巨任务，从而改善生成ZSL模型。我们还将展示在AwA2，aPY和SUN数据集上基于属性的ZSL方法的优势。我们还显示，与CIZSL-v1相比，CIZSL-v2具有更高的性能。

60. Generative Max-Mahalanobis Classifiers for Image Classification, Generation and More [PDF] 返回目录
Xiulong Yang, Hui Ye, Yang Ye, Xiang Li, Shihao Ji
Abstract: Joint Energy-based Model (JEM) of~\cite{jem} shows that a standard softmax classifier can be reinterpreted as an energy-based model (EBM) for the joint distribution $p(\boldsymbol{x}, y)$; the resulting model can be optimized with an energy-based training to improve calibration, robustness and out-of-distribution detection, while generating samples rivaling the quality of recent GAN-based approaches. However, the softmax classifier that JEM exploits is inherently discriminative and its latent feature space is not well formulated as probabilistic distributions, which may hinder its potential for image generation and incur training instability as observed in~\cite{jem}. We hypothesize that generative classifiers, such as Linear Discriminant Analysis (LDA), might be more suitable hybrid models for image generation since generative classifiers model the data generation process explicitly. This paper therefore investigates an LDA classifier for image classification and generation. In particular, the Max-Mahalanobis Classifier (MMC)~\cite{Pang2020Rethinking}, a special case of LDA, fits our goal very well since MMC formulates the latent feature space explicitly as the Max-Mahalanobis distribution~\cite{pang2018max}. We term our algorithm Generative MMC (GMMC), and show that it can be trained discriminatively, generatively or jointly for image classification and generation. Extensive experiments on multiple datasets (CIFAR10, CIFAR100 and SVHN) show that GMMC achieves state-of-the-art discriminative and generative performances, while outperforming JEM in calibration, adversarial robustness and out-of-distribution detection by a significant margin.
摘要：〜\ cite {jem}的基于联合能量模型（JEM）表明，可以将标准softmax分类器重新解释为联合分布$ p（\ boldsymbol {x}，y）的基于能量模型（EBM） $;可以使用基于能量的训练来优化所得模型，以提高校准，鲁棒性和分布外检测，同时生成可与基于GAN的方法相媲美的样本。但是，JEM利用的softmax分类器本质上是有区别的，其潜在特征空间不能很好地表达为概率分布，这可能会妨碍其生成图像的潜力，并可能导致训练不稳定性。我们假设生成分类器（例如线性判别分析（LDA））可能更适合用于图像生成的混合模型，因为生成分类器明确地对数据生成过程进行建模。因此，本文研究了用于图像分类和生成的LDA分类器。特别地，LDA的特例Max-Mahalanobis分类器（MMC）〜\ cite {Pang2020Rethinking}非常适合我们的目标，因为MMC将潜在特征空间明确地表示为Max-Mahalanobis分布〜\ cite {pang2018max}。我们将我们的算法称为生成式MMC（GMMC），并表明可以针对图像分类和生成进行判别式，生成式或联合训练。在多个数据集（CIFAR10，CIFAR100和SVHN）上进行的广泛实验表明，GMMC可以实现最新的判别和生成性能，同时在校准，对抗性鲁棒性和分布失调检测方面的表现优于JEM。

61. A Multi-modal Deep Learning Model for Video Thumbnail Selection [PDF] 返回目录
Zhifeng Yu, Nanchun Shi
Abstract: Thumbnail is the face of online videos. The explosive growth of videos both in number and variety underpins the importance of a good thumbnail because it saves potential viewers time to choose videos and even entice them to click on them. A good thumbnail should be a frame that best represents the content of a video while at the same time capturing viewers' attention. However, the techniques and models in the past only focus on frames within a video, and we believe such narrowed focus leave out much useful information that are part of a video. In this paper, we expand the definition of content to include title, description, and audio of a video and utilize information provided by these modalities in our selection model. Specifically, our model will first sample frames uniformly in time and return the top 1,000 frames in this subset with the highest aesthetic scores by a Double-column Convolutional Neural Network, to avoid the computational burden of processing all frames in downstream task. Then, the model incorporates frame features extracted from VGG16, text features from ELECTRA, and audio features from TRILL. These models were selected because of their results on popular datasets as well as their competitive performances. After feature extraction, the time-series features, frames and audio, will be fed into Transformer encoder layers to return a vector representing their corresponding modality. Each of the four features (frames, title, description, audios) will pass through a context gating layer before concatenation. Finally, our model will generate a vector in the latent space and select the frame that is most similar to this vector in the latent space. To the best of our knowledge, we are the first to propose a multi-modal deep learning model to select video thumbnail, which beats the result from the previous State-of-The-Art models.
摘要：缩略图是在线视频的代表。视频在数量和种类上的爆炸性增长加强了制作精美缩略图的重要性，因为它可以节省潜在的观看者选择视频甚至吸引他们点击的时间。好的缩略图应该是最能代表视频内容的框架，同时又能吸引观众的注意力。但是，过去的技术和模型仅关注视频中的帧，并且我们认为这种缩小的关注点会遗漏视频中有用的许多有用信息。在本文中，我们将内容的定义扩展到包括视频的标题，描述和音频，并在选择模型中利用这些方式提供的信息。具体来说，我们的模型将首先在时间上均匀地采样帧，并通过双列卷积神经网络返回该子集中具有最高美学得分的前1,000个帧，以避免在下游任务中处理所有帧的计算负担。然后，该模型合并了从VGG16提取的帧特征，从ELECTRA提取的文本特征以及从TRILL提取的音频特征。选择这些模型是因为它们在流行数据集上的结果以及其竞争表现。特征提取后，时间序列特征，帧和音频将被输入到Transformer编码器层中，以返回代表其相应模态的向量。四个功能（帧，标题，描述，音频）中的每一个都将在连接之前通过上下文选通层。最后，我们的模型将在潜在空间中生成一个向量，并选择与该空间中最类似于该向量的帧。据我们所知，我们是第一个提出多模式深度学习模型来选择视频缩略图的人，该模型优于以前的最新模型。

62. FGF-GAN: A Lightweight Generative Adversarial Network for Pansharpening via Fast Guided Filter [PDF] 返回目录
Zixiang Zhao, Jiangshe Zhang, Shuang Xu, Kai Sun, Lu Huang, Junmin Liu, Chunxia Zhang
Abstract: Pansharpening is a widely used image enhancement technique for remote sensing. Its principle is to fuse the input high-resolution single-channel panchromatic (PAN) image and low-resolution multi-spectral image and to obtain a high-resolution multi-spectral (HRMS) image. The existing deep learning pansharpening method has two shortcomings. First, features of two input images need to be concatenated along the channel dimension to reconstruct the HRMS image, which makes the importance of PAN images not prominent, and also leads to high computational cost. Second, the implicit information of features is difficult to extract through the manually designed loss function. To this end, we propose a generative adversarial network via the fast guided filter (FGF) for pansharpening. In generator, traditional channel concatenation is replaced by FGF to better retain the spatial information while reducing the number of parameters. Meanwhile, the fusion objects can be highlighted by the spatial attention module. In addition, the latent information of features can be preserved effectively through adversarial training. Numerous experiments illustrate that our network generates high-quality HRMS images that can surpass existing methods, and with fewer parameters.
摘要：锐化是一种广泛使用的遥感图像增强技术。其原理是将输入的高分辨率单通道全色（PAN）图像和低分辨率多光谱图像融合在一起，以获得高分辨率多光谱（HRMS）图像。现有的深度学习泛锐化方法有两个缺点。首先，需要沿着通道维度将两个输入图像的特征连接起来以重建HRMS图像，这使得PAN图像的重要性不突出，并且还导致高计算成本。其次，难以通过人工设计的损失函数来提取特征的隐式信息。为此，我们提出了通过快速引导滤镜（FGF）进行泛锐化的生成对抗网络。在生成器中，传统的通道级联被FGF取代，以更好地保留空间信息，同时减少了参数数量。同时，融合对象可以通过空间关注模块突出显示。此外，可以通过对抗训练有效地保留要素的潜在信息。大量实验表明，我们的网络可以生成高质量的HRMS图像，这些图像可以超越现有方法，而且参数更少。

63. How does the Combined Risk Affect the Performance of Unsupervised Domain Adaptation Approaches? [PDF] 返回目录
Li Zhong, Zhen Fang, Feng Liu, Jie Lu, Bo Yuan, Guangquan Zhang
Abstract: Unsupervised domain adaptation (UDA) aims to train a target classifier with labeled samples from the source domain and unlabeled samples from the target domain. Classical UDA learning bounds show that target risk is upper bounded by three terms: source risk, distribution discrepancy, and combined risk. Based on the assumption that the combined risk is a small fixed value, methods based on this bound train a target classifier by only minimizing estimators of the source risk and the distribution discrepancy. However, the combined risk may increase when minimizing both estimators, which makes the target risk uncontrollable. Hence the target classifier cannot achieve ideal performance if we fail to control the combined risk. To control the combined risk, the key challenge takes root in the unavailability of the labeled samples in the target domain. To address this key challenge, we propose a method named E-MixNet. E-MixNet employs enhanced mixup, a generic vicinal distribution, on the labeled source samples and pseudo-labeled target samples to calculate a proxy of the combined risk. Experiments show that the proxy can effectively curb the increase of the combined risk when minimizing the source risk and distribution discrepancy. Furthermore, we show that if the proxy of the combined risk is added into loss functions of four representative UDA methods, their performance is also improved.
摘要：无监督域自适应（UDA）旨在用源域中的带标记样本和目标域中的无标记样本来训练目标分类器。 UDA的经典学习界限表明，目标风险受三个术语的上限：源风险，分布差异和组合风险。基于合并风险是一个小的固定值的假设，基于此界限的方法仅通过最小化源风险和分布差异的估计量来训练目标分类器。但是，当最小化两个估计量时，组合风险可能会增加，这会使目标风险无法控制。因此，如果我们无法控制综合风险，目标分类器将无法达到理想的效果。为了控制综合风险，关键挑战在于目标区域中没有标记样品。为了解决这一关键挑战，我们提出了一种名为E-MixNet的方法。 E-MixNet在标记的源样本和伪标记的目标样本上采用增强的混合（通用点分布）来计算合并风险的代理。实验表明，在将源风险和分布差异最小化的情况下，代理可以有效地抑制合并风险的增加。此外，我们表明，如果将合并风险的代理添加到四种代表性UDA方法的损失函数中，它们的性能也会得到改善。

64. Fast Ensemble Learning Using Adversarially-Generated Restricted Boltzmann Machines [PDF] 返回目录
Gustavo H. de Rosa, Mateus Roder, João P. Papa
Abstract: Machine Learning has been applied in a wide range of tasks throughout the last years, ranging from image classification to autonomous driving and natural language processing. Restricted Boltzmann Machine (RBM) has received recent attention and relies on an energy-based structure to model data probability distributions. Notwithstanding, such a technique is susceptible to adversarial manipulation, i.e., slightly or profoundly modified data. An alternative to overcome the adversarial problem lies in the Generative Adversarial Networks (GAN), capable of modeling data distributions and generating adversarial data that resemble the original ones. Therefore, this work proposes to artificially generate RBMs using Adversarial Learning, where pre-trained weight matrices serve as the GAN inputs. Furthermore, it proposes to sample copious amounts of matrices and combine them into ensembles, alleviating the burden of training new models'. Experimental results demonstrate the suitability of the proposed approach under image reconstruction and image classification tasks, and describe how artificial-based ensembles are alternatives to pre-training vast amounts of RBMs.
摘要：机器学习在过去几年中已广泛应用于各种任务，从图像分类到自动驾驶和自然语言处理。受限玻尔兹曼机（RBM）受到了最近的关注，它依赖于基于能量的结构来对数据概率分布进行建模。尽管如此，这种技术还是容易受到对抗性操纵的，即，数据被轻微或深刻地修改。克服对抗问题的另一种方法是生成对抗网络（GAN），该网络能够对数据分布进行建模并生成类似于原始对抗网络的对抗数据。因此，这项工作建议使用对抗学习人工生成RBM，其中预先训练的权重矩阵用作GAN输入。此外，它建议对大量矩阵进行采样并将它们组合成集合，从而减轻了训练新模型的负担。实验结果证明了该方法在图像重建和图像分类任务下的适用性，并描述了基于人工的合奏如何替代预训练大量RBM的方法。

65. Towards Robust Data Hiding Against (JPEG) Compression: A Pseudo-Differentiable Deep Learning Approach [PDF] 返回目录
Chaoning Zhang, Adil Karjauv, Philipp Benz, In So Kweon
Abstract: Data hiding is one widely used approach for protecting authentication and ownership. Most multimedia content like images and videos are transmitted or saved in the compressed form. This kind of lossy compression, such as JPEG, can destroy the hidden data, which raises the need of robust data hiding. It is still an open challenge to achieve the goal of data hiding that can be against these compressions. Recently, deep learning has shown large success in data hiding, while non-differentiability of JPEG makes it challenging to train a deep pipeline for improving robustness against lossy compression. The existing SOTA approaches replace the non-differentiable parts with differentiable modules that perform similar operations. Multiple limitations exist: (a) large engineering effort; (b) requiring a white-box knowledge of compression attacks; (c) only works for simple compression like JPEG. In this work, we propose a simple yet effective approach to address all the above limitations at once. Beyond JPEG, our approach has been shown to improve robustness against various image and video lossy compression algorithms.
摘要：数据隐藏是一种广泛使用的保护身份验证和所有权的方法。大多数多媒体内容（例如图像和视频）都以压缩形式传输或保存。这种有损压缩（例如JPEG）会破坏隐藏的数据，从而需要强大的数据隐藏功能。实现与这些压缩相反的数据隐藏目标仍然是一个开放的挑战。最近，深度学习在数据隐藏方面显示出巨大的成功，而JPEG的不可区分性使其难以训练深度管道来提高针对有损压缩的鲁棒性。现有的SOTA方法用执行相似操作的可区分模块代替了不可区分的部分。存在多个限制：（a）大量的工程工作；（b）需要有关压缩攻击的白盒知识；（c）仅适用于像JPEG这样的简单压缩。在这项工作中，我们提出了一种简单而有效的方法来立即解决上述所有限制。除JPEG以外，我们的方法还显示出针对各种图像和视频有损压缩算法的鲁棒性。

66. Single-shot fringe projection profilometry based on Deep Learning and Computer Graphics [PDF] 返回目录
Fanzhou Wang, Chenxing Wang, Qingze Guan
Abstract: Multiple works have applied deep learning to fringe projection profilometry (FPP) in recent years. However, to obtain a large amount of data from actual systems for training is still a tricky problem, and moreover, the network design and optimization still worth exploring. In this paper, we introduce computer graphics to build virtual FPP systems in order to generate the desired datasets conveniently and simply. The way of constructing a virtual FPP system is described in detail firstly, and then some key factors to set the virtual FPP system much close to the reality are analyzed. With the aim of accurately estimating the depth image from only one fringe image, we also design a new loss function to enhance the quality of the overall and detailed information restored. And two representative networks, U-Net and pix2pix, are compared in multiple aspects. The real experiments prove the good accuracy and generalization of the network trained by the data from our virtual systems and the designed loss, implying the potential of our method for applications.
摘要：近年来，有很多著作将深度学习应用于条纹投影轮廓仪（FPP）。但是，从实际的训练系统中获取大量数据仍然是一个棘手的问题，而且，网络设计和优化仍然值得探索。在本文中，我们介绍了计算机图形学来构建虚拟FPP系统，以便方便，简单地生成所需的数据集。首先详细介绍了构建虚拟FPP系统的方法，然后分析了使虚拟FPP系统与现实环境更加接近的一些关键因素。为了仅从一个条纹图像准确估计深度图像，我们还设计了一种新的损失函数，以增强整体和详细信息的质量。并从多个方面比较了两个代表性的网络U-Net和pix2pix。真实的实验证明了由我们的虚拟系统中的数据所训练的网络具有良好的准确性和泛化性以及设计的损失，这暗示了我们方法的应用潜力。

67. Minimizing L1 over L2 norms on the gradient [PDF] 返回目录
Chao Wang, Min Tao, Chen-Nee Chuah, James Nagy, Yifei Lou
Abstract: In this paper, we study the L1/L2 minimization on the gradient for imaging applications. Several recent works have demonstrated that L1/L2 is better than the L1 norm when approximating the L0 norm to promote sparsity. Consequently, we postulate that applying L1/L2 on the gradient is better than the classic total variation (the L1 norm on the gradient) to enforce the sparsity of the image gradient. To verify our hypothesis, we consider a constrained formulation to reveal empirical evidence on the superiority of L1/L2 over L1 when recovering piecewise constant signals from low-frequency measurements. Numerically, we design a specific splitting scheme, under which we can prove the subsequential convergence for the alternating direction method of multipliers (ADMM). Experimentally, we demonstrate visible improvements of L1/L2 over L1 and other nonconvex regularizations for image recovery from low-frequency measurements and two medical applications of MRI and CT reconstruction. All the numerical results show the efficiency of our proposed approach.
摘要：在本文中，我们研究了用于成像应用的梯度上的L1 / L2最小化。最近的一些工作表明，当逼近L0范数以促进稀疏性时，L1 / L2优于L1范数。因此，我们假设在梯度上应用L1 / L2优于经典的总变化量（梯度上的L1范数）以增强图像梯度的稀疏性。为了验证我们的假设，我们考虑了一种受约束的公式，以揭示从低频测量中恢复分段恒定信号时L1 / L2优于L1的经验证据。在数值上，我们设计了一个特定的拆分方案，在此方案下，我们可以证明乘数交替方向方法（ADMM）的后续收敛性。在实验上，我们证明了L1 / L2优于L1以及其他非凸正则化的改进，可从低频测量以及MRI和CT重建的两个医学应用中恢复图像。所有数值结果都表明了我们提出的方法的有效性。

68. DSXplore: Optimizing Convolutional Neural Networks via Sliding-Channel Convolutions [PDF] 返回目录
Yuke Wang, Boyuan Feng, Yufei Ding
Abstract: As the key advancement of the convolutional neural networks (CNNs), depthwise separable convolutions (DSCs) are becoming one of the most popular techniques to reduce the computations and parameters size of CNNs meanwhile maintaining the model accuracy. It also brings profound impact to improve the applicability of the compute- and memory-intensive CNNs to a broad range of applications, such as mobile devices, which are generally short of computation power and memory. However, previous research in DSCs are largely focusing on compositing the limited existing DSC designs, thus, missing the opportunities to explore more potential designs that can achieve better accuracy and higher computation/parameter reduction. Besides, the off-the-shelf convolution implementations offer limited computing schemes, therefore, lacking support for DSCs with different convolution patterns. To this end, we introduce, DSXplore, the first optimized design for exploring DSCs on CNNs. Specifically, at the algorithm level, DSXplore incorporates a novel factorized kernel -- sliding-channel convolution (SCC), featured with input-channel overlapping to balance the accuracy performance and the reduction of computation and memory cost. SCC also offers enormous space for design exploration by introducing adjustable kernel parameters. Further, at the implementation level, we carry out an optimized GPU-implementation tailored for SCC by leveraging several key techniques, such as the input-centric backward design and the channel-cyclic optimization. Intensive experiments on different datasets across mainstream CNNs show the advantages of DSXplore in balancing accuracy and computation/parameter reduction over the standard convolution and the existing DSCs.
摘要：随着卷积神经网络（CNN）的关键发展，深度可分离卷积（DSC）成为减少CNN的计算量和参数大小并同时保持模型精度的最受欢迎的技术之一。它还为提高计算密集型和内存密集型CNN在广泛缺乏计算能力和内存的广泛应用（例如移动设备）中的应用带来了深远的影响。但是，以前在DSC中的研究主要集中在组合有限的现有DSC设计上，因此，缺少了探索更多可以实现更高准确度和更高计算/参数减少量的潜在设计的机会。此外，现成的卷积实现提供有限的计算方案，因此，缺乏对具有不同卷积模式的DSC的支持。为此，我们介绍了DSXplore，这是用于探索CNN上DSC的第一个优化设计。具体来说，在算法级别，DSXplore合并了一个新颖的因子分解内核-滑动通道卷积（SCC），具有输入通道重叠功能，以平衡精度性能以及减少计算和内存成本。 SCC还通过引入可调整的内核参数为设计探索提供了巨大的空间。此外，在实现级别，我们通过利用几种关键技术（例如以输入为中心的向后设计和通道循环优化）为SCC量身定制了优化的GPU实现。对主流CNN的不同数据集进行的密集实验表明，与标准卷积和现有DSC相比，DSXplore在平衡精度和减少计算/参数方面具有优势。

69. Factor Analysis, Probabilistic Principal Component Analysis, Variational Inference, and Variational Autoencoder: Tutorial and Survey [PDF] 返回目录
Benyamin Ghojogh, Ali Ghodsi, Fakhri Karray, Mark Crowley
Abstract: This is a tutorial and survey paper on factor analysis, probabilistic Principal Component Analysis (PCA), variational inference, and Variational Autoencoder (VAE). These methods, which are tightly related, are dimensionality reduction and generative models. They asssume that every data point is generated from or caused by a low-dimensional latent factor. By learning the parameters of distribution of latent space, the corresponding low-dimensional factors are found for the sake of dimensionality reduction. For their stochastic and generative behaviour, these models can also be used for generation of new data points in the data space. In this paper, we first start with variational inference where we derive the Evidence Lower Bound (ELBO) and Expectation Maximization (EM) for learning the parameters. Then, we introduce factor analysis, derive its joint and marginal distributions, and work out its EM steps. Probabilistic PCA is then explained, as a special case of factor analysis, and its closed-form solutions are derived. Finally, VAE is explained where the encoder, decoder and sampling from the latent space are introduced. Training VAE using both EM and backpropagation are explained.
摘要：这是一份有关因素分析，概率主成分分析（PCA），变分推理和变分自动编码器（VAE）的教程和调查论文。这些紧密相关的方法是降维和生成模型。他们假设每个数据点都是由低维潜在因子生成或引起的。通过学习潜在空间分布的参数，找到相应的低维因子，以降低维数。对于它们的随机和生成行为，这些模型还可以用于在数据空间中生成新的数据点。在本文中，我们首先从变分推理开始，在此我们导出证据下界（ELBO）和期望最大化（EM）以学习参数。然后，我们介绍因子分析，得出其联合和边际分布，并制定其EM步骤。然后解释概率PCA，作为因素分析的特殊情况，并推导其闭式解。最后，解释了VAE，其中介绍了来自潜在空间的编码器，解码器和采样。说明了同时使用EM和反向传播训练VAE。

70. Generalized Latency Performance Estimation for Once-For-All Neural Architecture Search [PDF] 返回目录
Muhtadyuzzaman Syed, Arvind Akpuram Srinivasan
Abstract: Neural Architecture Search (NAS) has enabled the possibility of automated machine learning by streamlining the manual development of deep neural network architectures defining a search space, search strategy, and performance estimation strategy. To solve the need for multi-platform deployment of Convolutional Neural Network (CNN) models, Once-For-All (OFA) proposed to decouple Training and Search to deliver a one-shot model of sub-networks that are constrained to various accuracy-latency tradeoffs. We find that the performance estimation strategy for OFA's search severely lacks generalizability of different hardware deployment platforms due to single hardware latency lookup tables that require significant amount of time and manual effort to build beforehand. In this work, we demonstrate the framework for building latency predictors for neural network architectures to address the need for heterogeneous hardware support and reduce the overhead of lookup tables altogether. We introduce two generalizability strategies which include fine-tuning using a base model trained on a specific hardware and NAS search space, and GPU-generalization which trains a model on GPU hardware parameters such as Number of Cores, RAM Size, and Memory Bandwidth. With this, we provide a family of latency prediction models that achieve over 50% lower RMSE loss as compared to with ProxylessNAS. We also show that the use of these latency predictors match the NAS performance of the lookup table baseline approach if not exceeding it in certain cases.
摘要：神经体系结构搜索（NAS）通过简化定义搜索空间，搜索策略和性能估计策略的深度神经网络体系结构的手动开发，实现了自动机器学习的可能性。为了解决对卷积神经网络（CNN）模型进行多平台部署的需求，“一劳永逸”（OFA）建议将“训练”和“搜索”分离开来，以提供一个受限于各种准确性的子网络的一次性模型延迟权衡。我们发现，由于单个硬件等待时间查找表需要大量时间和事先手动构建，因此OFA搜索的性能评估策略严重缺乏不同硬件部署平台的通用性。在这项工作中，我们演示了用于为神经网络体系结构构建延迟预测器的框架，以解决异构硬件支持的需求并完全减少查找表的开销。我们介绍了两种可推广性策略，其中包括使用在特定硬件和NAS搜索空间上训练的基本模型进行微调，以及GPU通用化，在GPU硬件参数（例如核数，RAM大小和内存带宽）上训练模型。这样，我们提供了一系列延迟预测模型，与ProxylessNAS相比，这些模型可将RMSE损失降低50％以上。我们还表明，在某些情况下，如果不超过这些延迟预测值的使用，则可以与查找表基线方法的NAS性能匹配。

71. CovTANet: A Hybrid Tri-level Attention Based Network for Lesion Segmentation, Diagnosis, and Severity Prediction of COVID-19 Chest CT Scans [PDF] 返回目录
Tanvir Mahmud, Md. Jahin Alam, Sakib Chowdhury, Shams Nafisa Ali, Md Maisoon Rahman, Shaikh Anowarul Fattah, Mohammad Saquib
Abstract: Rapid and precise diagnosis of COVID-19 is one of the major challenges faced by the global community to control the spread of this overgrowing pandemic. In this paper, a hybrid neural network is proposed, named CovTANet, to provide an end-to-end clinical diagnostic tool for early diagnosis, lesion segmentation, and severity prediction of COVID-19 utilizing chest computer tomography (CT) scans. A multi-phase optimization strategy is introduced for solving the challenges of complicated diagnosis at a very early stage of infection, where an efficient lesion segmentation network is optimized initially which is later integrated into a joint optimization framework for the diagnosis and severity prediction tasks providing feature enhancement of the infected regions. Moreover, for overcoming the challenges with diffused, blurred, and varying shaped edges of COVID lesions with novel and diverse characteristics, a novel segmentation network is introduced, namely Tri-level Attention-based Segmentation Network (TA-SegNet). This network has significantly reduced semantic gaps in subsequent encoding decoding stages, with immense parallelization of multi-scale features for faster convergence providing considerable performance improvement over traditional networks. Furthermore, a novel tri-level attention mechanism has been introduced, which is repeatedly utilized over the network, combining channel, spatial, and pixel attention schemes for faster and efficient generalization of contextual information embedded in the feature map through feature re-calibration and enhancement operations. Outstanding performances have been achieved in all three-tasks through extensive experimentation on a large publicly available dataset containing 1110 chest CT-volumes that signifies the effectiveness of the proposed scheme at the current stage of the pandemic.
摘要：快速准确地诊断COVID-19是全球社会控制这一日益流行的大流行病传播所面临的主要挑战之一。在本文中，提出了一种名为CovTANet的混合神经网络，以提供一种端到端临床诊断工具，以便利用胸部计算机断层扫描（CT）扫描对COVID-19进行早期诊断，病变分割和严重性预测。引入了多阶段优化策略来解决感染初期非常复杂的诊断难题，其中首先对有效的病变分割网络进行了优化，然后将其集成到联合优化框架中，以提供特征诊断和严重性预测任务增强感染区域。此外，为了克服具有新颖和多样特征的COVID病灶的扩散，模糊和变化形状边缘的挑战，引入了一种新颖的分割网络，即基于三级注意力的分割网络（TA-SegNet）。该网络显着减少了后续编码解码阶段的语义鸿沟，并实现了多尺度特征的巨大并行化，以实现更快的收敛，与传统网络相比，性能得到了显着提高。此外，已经引入了一种新颖的三级关注机制，该机制在网络上反复使用，结合了通道，空间和像素注意方案，可以通过特征重新校准和增强功能更快，更有效地概括嵌入在特征图中的上下文信息。操作。通过对包含1110个胸部CT量的大型公共数据集进行广泛的实验，在所有三个任务中均取得了出色的表现，这表明该计划在当前大流行阶段是有效的。

72. Phase Transitions in Recovery of Structured Signals from Corrupted Measurements [PDF] 返回目录
Zhongxing Sun, Wei Cui, Yulong Liu
Abstract: This paper is concerned with the problem of recovering a structured signal from a relatively small number of corrupted random measurements. Sharp phase transitions have been numerically observed in practice when different convex programming procedures are used to solve this problem. This paper is devoted to presenting theoretical explanations for these phenomenons by employing some basic tools from Gaussian process theory. Specifically, we identify the precise locations of the phase transitions for both constrained and penalized recovery procedures. Our theoretical results show that these phase transitions are determined by some geometric measures of structure, e.g., the spherical Gaussian width of a tangent cone and the Gaussian (squared) distance to a scaled subdifferential. By utilizing the established phase transition theory, we further investigate the relationship between these two kinds of recovery procedures, which also reveals an optimal strategy (in the sense of Lagrange theory) for choosing the tradeoff parameter in the penalized recovery procedure. Numerical experiments are provided to verify our theoretical results.
摘要：本文涉及从相对少量的损坏的随机测量中恢复结构化信号的问题。实际上，当使用不同的凸编程程序来解决此问题时，已经在数值上观察到尖锐的相变。本文致力于利用高斯过程理论中的一些基本工具对这些现象进行理论解释。具体来说，我们为受约束和受罚的恢复程序确定了相变的精确位置。我们的理论结果表明，这些相变是由结构的某些几何度量确定的，例如，切线锥的球形高斯宽度和到比例次微分的高斯（平方）距离。通过使用已建立的相变理论，我们进一步研究了这两种恢复程序之间的关系，这也揭示了在惩罚性恢复程序中选择折衷参数的最佳策略（就拉格朗日理论而言）。提供数值实验以验证我们的理论结果。

73. RegNet: Self-Regulated Network for Image Classification [PDF] 返回目录
Jing Xu, Yu Pan, Xinglin Pan, Steven Hoi, Zhang Yi, Zenglin Xu
Abstract: The ResNet and its variants have achieved remarkable successes in various computer vision tasks. Despite its success in making gradient flow through building blocks, the simple shortcut connection mechanism limits the ability of re-exploring new potentially complementary features due to the additive function. To address this issue, in this paper, we propose to introduce a regulator module as a memory mechanism to extract complementary features, which are further fed to the ResNet. In particular, the regulator module is composed of convolutional RNNs (e.g., Convolutional LSTMs or Convolutional GRUs), which are shown to be good at extracting Spatio-temporal information. We named the new regulated networks as RegNet. The regulator module can be easily implemented and appended to any ResNet architecture. We also apply the regulator module for improving the Squeeze-and-Excitation ResNet to show the generalization ability of our method. Experimental results on three image classification datasets have demonstrated the promising performance of the proposed architecture compared with the standard ResNet, SE-ResNet, and other state-of-the-art architectures.
摘要：ResNet及其变体在各种计算机视觉任务中均取得了非凡的成功。尽管在使梯度流过构建块方面取得了成功，但由于添加功能，简单的快捷连接机制仍然限制了重新开发潜在的新补充功能的能力。为了解决这个问题，在本文中，我们建议引入一个调节器模块作为一种内存机制来提取互补特征，这些特征将进一步馈送到ResNet。特别地，调节器模块由卷积RNN（例如，卷积LSTM或卷积GRU）组成，其被证明擅长提取时空信息。我们将新的受监管网络命名为RegNet。调节器模块可以轻松实现并附加到任何ResNet架构中。我们还应用了调节器模块来改进“挤压和激励ResNet”，以展示我们方法的泛化能力。在三个图像分类数据集上的实验结果证明了与标准ResNet，SE-ResNet和其他最新体系结构相比，所提出体系结构的性能令人鼓舞。

74. Towards Annotation-free Instance Segmentation and Tracking with Adversarial Simulations [PDF] 返回目录
Quan Liu, Isabella M. Gaeta, Mengyang Zhao, Ruining Deng, Aadarsh Jha, Bryan A. Millis, Anita Mahadevan-Jansen, Matthew J. Tyska, Yuankai Huo
Abstract: Quantitative analysis of microscope videos often requires instance segmentation and tracking of cellular and subcellular objects. Traditional method is composed of two stages: (1) instance object segmentation of each frame, and (2) associate objects frame by frame. Recently, pixel embedding based deep learning approaches provide single stage holistic solutions to tackle instance segmentation and tracking simultaneously. However, the deep learning methods require consistent annotations not only spatially (for segmentation), but also temporally (for tracking). In computer vision, annotated training data with consistent segmentation and tracking is resource intensive, which can be more severe in microscopy imaging owing to (1) dense objects (e.g., overlapping or touching), and (2) high dynamics (e.g., irregular motion and mitosis). To alleviate the lack of such annotations in dynamics scenes, adversarial simulations have provided successful solutions in computer vision, such as using simulated environments (e.g., computer games) to train real-world self-driving systems. In this paper, we proposed an annotation-free synthetic instance segmentation and tracking (ASIST) method with adversarial simulation and single-stage pixel-embedding based learning. The contribution is three-fold: (1) the proposed method aggregates adversarial simulations and single-stage pixel-embedding based deep learning; (2) the method is assessed with both cellular (i.e., HeLa cells) and subcellular (i.e., microvilli) objects; and (3) to the best of our knowledge, this is the first study to explore annotation-free instance segmentation and tracking study for microscope videos. From the results, our ASIST method achieved promising results compared with fully supervised approaches.
摘要：显微镜视频的定量分析通常需要实例分割以及对细胞和亚细胞对象的跟踪。传统方法包括两个阶段：（1）每帧的实例对象分割，以及（2）逐帧关联对象。最近，基于像素嵌入的深度学习方法提供了单阶段整体解决方案，以同时解决实例分割和跟踪问题。但是，深度学习方法不仅需要在空间上（用于分割）而且在时间上（用于跟踪）都需要一致的注释。在计算机视觉中，具有一致的分段和跟踪功能的带注释的训练数据是资源密集型的，由于（1）密集的物体（例如，重叠或触摸）和（2）高动态（例如，不规则运动），在显微镜成像中可能更加严重和有丝分裂）。为了减轻动态场景中此类注释的缺乏，对抗性仿真已经在计算机视觉中提供了成功的解决方案，例如使用仿真环境（例如计算机游戏）来训练现实世界中的自动驾驶系统。在本文中，我们提出了一种具有对抗性仿真和基于单级像素嵌入的学习的无注释合成实例分割与跟踪（ASIST）方法。贡献有三点：（1）提出的方法将对抗性仿真和基于单级像素嵌入的深度学习进行了聚合；（2）用细胞（即HeLa细胞）和亚细胞（即微绒毛）对象评估该方法; （3）据我们所知，这是第一个探索无注释实例分割和显微镜视频跟踪研究的研究。从结果来看，与完全监督的方法相比，我们的ASIST方法取得了可喜的结果。

75. RV-GAN : Retinal Vessel Segmentation from Fundus Images using Multi-scale Generative Adversarial Networks [PDF] 返回目录
Sharif Amit Kamran, Khondker Fariha Hossain, Alireza Tavakkoli, Stewart Lee Zuckerbrod, Kenton M. Sanders, Salah A. Baker
Abstract: Retinal vessel segmentation contributes significantly to the domain of retinal image analysis for the diagnosis of vision-threatening diseases. With existing techniques the generated segmentation result deteriorates when thresholded with higher confidence value. To alleviate from this, we propose RVGAN, a new multi-scale generative architecture for accurate retinal vessel segmentation. Our architecture uses two generators and two multi-scale autoencoder based discriminators, for better microvessel localization and segmentation. By combining reconstruction and weighted feature matching loss, our adversarial training scheme generates highly accurate pixel-wise segmentation of retinal vessels with threshold >= 0.5. The architecture achieves AUC of 0.9887, 0.9814, and 0.9887 on three publicly available datasets, namely DRIVE, CHASE-DB1, and STARE, respectively. Additionally, RV-GAN outperforms other architectures in two additional relevant metrics, Mean-IOU and SSIM.
摘要：视网膜血管分割在视网膜图像分析领域对视力障碍疾病的诊断有重要贡献。使用现有技术，以较高的置信度阈值生成的分割结果会恶化。为了缓解这种情况，我们提出了RVGAN，这是一种用于精确视网膜血管分割的新型多尺度生成架构。我们的体系结构使用两个生成器和两个基于多尺度自动编码器的鉴别器，以实现更好的微血管定位和分割。通过结合重建和加权特征匹配损失，我们的对抗训练方案可生成阈值> = 0.5的高精度视网膜血管像素分割。该体系结构在三个公开可用的数据集DRIVE，CHASE-DB1和STARE上分别达到0.9887、0.9814和0.9887的AUC。此外，RV-GAN在两个其他相关指标（Mean-IOU和SSIM）方面优于其他架构。

76. Multi-stage Deep Layer Aggregation for Brain Tumor Segmentation [PDF] 返回目录
Carlos A. Silva, Adriano Pinto, Sérgio Pereira, Ana Lopes
Abstract: Gliomas are among the most aggressive and deadly brain tumors. This paper details the proposed Deep Neural Network architecture for brain tumor segmentation from Magnetic Resonance Images. The architecture consists of a cascade of three Deep Layer Aggregation neural networks, where each stage elaborates the response using the feature maps and the probabilities of the previous stage, and the MRI channels as inputs. The neuroimaging data are part of the publicly available Brain Tumor Segmentation (BraTS) 2020 challenge dataset, where we evaluated our proposal in the BraTS 2020 Validation and Test sets. In the Test set, the experimental results achieved a Dice score of 0.8858, 0.8297 and 0.7900, with an Hausdorff Distance of 5.32 mm, 22.32 mm and 20.44 mm for the whole tumor, core tumor and enhanced tumor, respectively.
摘要：神经胶质瘤是最具侵袭性和致命性的脑肿瘤之一。本文详细介绍了用于从磁共振图像进行脑肿瘤分割的深度神经网络架构。该体系结构由三个深层聚合神经网络的级联组成，其中每个阶段都使用特征图和前一阶段的概率以及MRI通道作为输入来详细说明响应。神经影像数据是可公开获得的2020脑肿瘤分割（BraTS）挑战数据集的一部分，我们在其中评估了我们在BraTS 2020验证和测试集中的提议。在测试集中，整个肿瘤，核心肿瘤和增强型肿瘤的Dice得分分别为0.8858、0.8297和0.7900，Hausdorff距离分别为5.32 mm，22.32 mm和20.44 mm。

77. Combining unsupervised and supervised learning for predicting the final stroke lesion [PDF] 返回目录
Adriano Pinto, Sérgio Pereira, Raphael Meier, Roland Wiest, Victor Alves, Mauricio Reyes, Carlos A.Silva
Abstract: Predicting the final ischaemic stroke lesion provides crucial information regarding the volume of salvageable hypoperfused tissue, which helps physicians in the difficult decision-making process of treatment planning and intervention. Treatment selection is influenced by clinical diagnosis, which requires delineating the stroke lesion, as well as characterising cerebral blood flow dynamics using neuroimaging acquisitions. Nonetheless, predicting the final stroke lesion is an intricate task, due to the variability in lesion size, shape, location and the underlying cerebral haemodynamic processes that occur after the ischaemic stroke takes place. Moreover, since elapsed time between stroke and treatment is related to the loss of brain tissue, assessing and predicting the final stroke lesion needs to be performed in a short period of time, which makes the task even more complex. Therefore, there is a need for automatic methods that predict the final stroke lesion and support physicians in the treatment decision process. We propose a fully automatic deep learning method based on unsupervised and supervised learning to predict the final stroke lesion after 90 days. Our aim is to predict the final stroke lesion location and extent, taking into account the underlying cerebral blood flow dynamics that can influence the prediction. To achieve this, we propose a two-branch Restricted Boltzmann Machine, which provides specialized data-driven features from different sets of standard parametric Magnetic Resonance Imaging maps. These data-driven feature maps are then combined with the parametric Magnetic Resonance Imaging maps, and fed to a Convolutional and Recurrent Neural Network architecture. We evaluated our proposal on the publicly available ISLES 2017 testing dataset, reaching a Dice score of 0.38, Hausdorff Distance of 29.21 mm, and Average Symmetric Surface Distance of 5.52 mm.
摘要：预测最终的缺血性脑卒中病灶可提供有关可挽救的灌注不足组织量的重要信息，这有助于医师在治疗计划和干预的艰难决策过程中进行治疗。治疗选择受临床诊断的影响，这需要描述中风病灶，并使用神经影像采集来表征脑血流动力学。尽管如此，由于缺血性中风发生后病变大小，形状，位置和潜在的脑血流动力学过程的变化，预测最终的中风病变是一项复杂的任务。而且，由于中风和治疗之间的经过时间与脑组织的丧失有关，因此需要在短时间内评估和预测最终的中风病变，这使任务更加复杂。因此，需要预测最终中风病灶并在治疗决策过程中支持医师的自动方法。我们提出了一种基于无监督和有监督的学习的全自动深度学习方法，以预测90天后的最终卒中病变。我们的目的是在考虑可能影响预测的潜在脑血流动力学的基础上，预测最终卒中病变的位置和程度。为实现此目的，我们提出了一个两分支受限玻尔兹曼机，该机提供了来自不同组标准参量磁共振成像图的专用数据驱动功能。然后将这些数据驱动的特征图与参数磁共振成像图组合，并馈入卷积和递归神经网络体系结构。我们在公开发布的ISLES 2017测试数据集上评估了我们的建议，Dice得分为0.38，Hausdorff距离为29.21 mm，平均对称表面距离为5.52 mm。

78. Semantics for Robotic Mapping, Perception and Interaction: A Survey [PDF] 返回目录
Sourav Garg, Niko Sünderhauf, Feras Dayoub, Douglas Morrison, Akansel Cosgun, Gustavo Carneiro, Qi Wu, Tat-Jun Chin, Ian Reid, Stephen Gould, Peter Corke, Michael Milford
Abstract: For robots to navigate and interact more richly with the world around them, they will likely require a deeper understanding of the world in which they operate. In robotics and related research fields, the study of understanding is often referred to as semantics, which dictates what does the world "mean" to a robot, and is strongly tied to the question of how to represent that meaning. With humans and robots increasingly operating in the same world, the prospects of human-robot interaction also bring semantics and ontology of natural language into the picture. Driven by need, as well as by enablers like increasing availability of training data and computational resources, semantics is a rapidly growing research area in robotics. The field has received significant attention in the research literature to date, but most reviews and surveys have focused on particular aspects of the topic: the technical research issues regarding its use in specific robotic topics like mapping or segmentation, or its relevance to one particular application domain like autonomous driving. A new treatment is therefore required, and is also timely because so much relevant research has occurred since many of the key surveys were published. This survey therefore provides an overarching snapshot of where semantics in robotics stands today. We establish a taxonomy for semantics research in or relevant to robotics, split into four broad categories of activity, in which semantics are extracted, used, or both. Within these broad categories we survey dozens of major topics including fundamentals from the computer vision field and key robotics research areas utilizing semantics, including mapping, navigation and interaction with the world. The survey also covers key practical considerations, including enablers like increased data availability and improved computational hardware, and major application areas where...
摘要：要使机器人导航并与周围的世界进行更丰富的互动，它们可能需要对它们所运行的世界有更深入的了解。在机器人技术和相关研究领域，对理解的研究通常称为语义，它决定了世界对机器人的“含义”，并且与如何表示这种含义紧密相关。随着人类和机器人在同一个世界中越来越多地运转，人机交互的前景也将自然语言的语义和本体带入了画面。在需求的推动下，以及在诸如培训数据和计算资源的可用性不断提高的推动者的推动下，语义学已成为机器人技术领域快速发展的研究领域。迄今为止，该领域在研究文献中受到了极大的关注，但是大多数评论和调查都集中在该主题的特定方面：与在特定的机器人主题（例如，制图或分割）中的使用有关的技术研究问题，或与某个特定应用的相关性领域像自动驾驶。因此，需要一种新的治疗方法，并且也是及时的，因为自从许多重要调查报告发表以来，已经进行了许多相关的研究。因此，本次调查提供了当今机器人技术中语义学的总体概况。我们为机器人学中或与机器人学有关的语义学建立了分类法，将其分为四大类活动，在这些活动中，语义被提取，使用或同时被提取和使用。在这些广泛的类别中，我们调查了数十个主要主题，包括计算机视觉领域的基础知识和利用语义的主要机器人技术研究领域，包括映射，导航和与世界的交互。该调查还涵盖了关键的实际考虑因素，包括诸如增加数据可用性和改进的计算硬件之类的促成因素以及主要应用领域，其中...

79. CryoNuSeg: A Dataset for Nuclei Instance Segmentation of Cryosectioned H&E-Stained Histological Images [PDF] 返回目录
Amirreza Mahbod, Gerald Schaefer, Benjamin Bancher, Christine Löw, Georg Dorffner, Rupert Ecker, Isabella Ellinger
Abstract: Nuclei instance segmentation plays an important role in the analysis of Hematoxylin and Eosin (H&E)-stained images. While supervised deep learning (DL)-based approaches represent the state-of-the-art in automatic nuclei instance segmentation, annotated datasets are required to train these models. There are two main types of tissue processing protocols, namely formalin-fixed paraffin-embedded samples (FFPE) and frozen tissue samples (FS). Although FFPE-derived H&E stained tissue sections are the most widely used samples, H&E staining on frozen sections derived from FS samples is a relevant method in intra-operative surgical sessions as it can be performed fast. Due to differences in the protocols of these two types of samples, the derived images and in particular the nuclei appearance may be different in the acquired whole slide images. Analysis of FS-derived H&E stained images can be more challenging as rapid preparation, staining, and scanning of FS sections may lead to deterioration in image quality. In this paper, we introduce CryoNuSeg, the first fully annotated FS-derived cryosectioned and H&E-stained nuclei instance segmentation dataset. The dataset contains images from 10 human organs that were not exploited in other publicly available datasets, and is provided with three manual mark-ups to allow measuring intra-observer and inter-observer variability. Moreover, we investigate the effects of tissue fixation/embedding protocol (i.e., FS or FFPE) on the automatic nuclei instance segmentation performance of one of the state-of-the-art DL approaches. We also create a baseline segmentation benchmark for the dataset that can be used in future research. A step-by-step guide to generate the dataset as well as the full dataset and other detailed information are made available to fellow researchers at this https URL.
摘要：核实例分割在苏木精和曙红（H＆E）染色图像的分析中起着重要作用。虽然基于监督式深度学习（DL）的方法代表了自动核实例分割的最新技术，但仍需要带注释的数据集来训练这些模型。组织处理协议有两种主要类型，即福尔马林固定石蜡包埋样品（FFPE）和冷冻组织样品（FS）。尽管FFPE来源的H＆E染色的组织切片是使用最广泛的样品，但是从FS样品中提取的冷冻切片上的H＆E染色是术中外科手术中的一种相关方法，因为它可以快速进行。由于这两种类型的样本在协议上的差异，在获取的整个玻片图像中，派生的图像，尤其是核外观可能会有所不同。 FS衍生的H＆E染色图像的分析可能更具挑战性，因为快速准备，染色和扫描FS切片可能会导致图像质量下降。在本文中，我们介绍了CryoNuSeg，这是第一个完全注释的FS衍生的冰冻切片和H＆E染色的核实例分割数据集。该数据集包含来自10个人体器官的图像，这些图像未在其他公开可用的数据集中得到利用，并提供了三个手动标记，以允许测量观察者内部和观察者之间的变异性。此外，我们研究了组织固定/嵌入协议（即FS或FFPE）对最新DL方法之一的自动核实例分割性能的影响。我们还为可用于未来研究的数据集创建了基线细分基准。可以通过此https URL向其他研究人员提供有关生成数据集以及完整数据集和其他详细信息的分步指南。

80. Non-line-of-Sight Imaging via Neural Transient Fields [PDF] 返回目录
Siyuan Shen, Zi Wang, Ping Liu, Zhengqing Pan, Ruiqian Li, Tian Gao, Shiying Li, Jingyi Yu
Abstract: We present a neural modeling framework for Non-Line-of-Sight (NLOS) imaging. Previous solutions have sought to explicitly recover the 3D geometry (e.g., as point clouds) or voxel density (e.g., within a pre-defined volume) of the hidden scene. In contrast, inspired by the recent Neural Radiance Field (NeRF) approach, we use a multi-layer perceptron (MLP) to represent the neural transient field or NeTF. However, NeTF measures the transient over spherical wavefronts rather than the radiance along lines. We therefore formulate a spherical volume NeTF reconstruction pipeline, applicable to both confocal and non-confocal setups. Compared with NeRF, NeTF samples a much sparser set of viewpoints (scanning spots) and the sampling is highly uneven. We thus introduce a Monte Carlo technique to improve the robustness in the reconstruction. Comprehensive experiments on synthetic and real datasets demonstrate NeTF provides higher quality reconstruction and preserves fine details largely missing in the state-of-the-art.
摘要：我们提出了一种用于非视线（NLOS）成像的神经建模框架。先前的解决方案试图显式地恢复隐藏场景的3D几何形状（例如，作为点云）或体素密度（例如，在预定体积内）。相反，受最近的神经辐射场（NeRF）方法的启发，我们使用多层感知器（MLP）来表示神经瞬变场或NeTF。但是，NeTF测量的是球面波前的瞬变，而不是沿线的辐射。因此，我们制定了适用于共焦和非共焦设置的球体积NeTF重建管道。与NeRF相比，NeTF采样的视点（扫描点）稀疏得多，并且采样高度不均匀。因此，我们引入了蒙特卡洛技术来提高重建的鲁棒性。在合成数据集和真实数据集上进行的全面实验表明，NeTF可提供更高质量的重建，并保留了先进技术中大量缺失的细节。

81. Quaternion higher-order singular value decomposition and its applications in color image processing [PDF] 返回目录
Jifei Miao, Kit Ian Kou
Abstract: Higher-order singular value decomposition (HOSVD) is one of the most efficient tensor decomposition techniques. It has the salient ability to represent high_dimensional data and extract features. In more recent years, the quaternion has proven to be a very suitable tool for color pixel representation as it can well preserve cross-channel correlation of color channels. Motivated by the advantages of the HOSVD and the quaternion tool, in this paper, we generalize the HOSVD to the quaternion domain and define quaternion-based HOSVD (QHOSVD). Due to the non-commutability of quaternion multiplication, QHOSVD is not a trivial extension of the HOSVD. They have similar but different calculation procedures. The defined QHOSVD can be widely used in various visual data processing with color pixels. In this paper, we present two applications of the defined QHOSVD in color image processing: multi_focus color image fusion and color image denoising. The experimental results on the two applications respectively demonstrate the competitive performance of the proposed methods over some existing ones.
摘要：高阶奇异值分解（HOSVD）是最有效的张量分解技术之一。它具有表示高维数据和提取特征的显着能力。近年来，四元数已被证明是非常适合用于彩色像素表示的工具，因为它可以很好地保持彩色通道的跨通道相关性。基于HOSVD和四元数工具的优势，本文将HOSVD推广到四元数域，并定义基于四元数的HOSVD（QHOSVD）。由于四元数乘法的不可交换性，QHOSVD不是HOSVD的简单扩展。它们具有相似但不同的计算过程。定义的QHOSVD可以广泛用于彩色像素的各种视觉数据处理中。在本文中，我们介绍了定义的QHOSVD在彩色图像处理中的两个应用：多焦点彩色图像融合和彩色图像去噪。在这两个应用程序上的实验结果分别证明了该方法在某些现有方法上的竞争性能。

82. Neural Architecture Search via Combinatorial Multi-Armed Bandit [PDF] 返回目录
Hanxun Huang, Xingjun Ma, Sarah M. Erfani, James Bailey
Abstract: Neural Architecture Search (NAS) has gained significant popularity as an effective tool for designing high performance deep neural networks (DNNs). NAS can be performed via policy gradient, evolutionary algorithms, differentiable architecture search or tree-search methods. While significant progress has been made for both policy gradient and differentiable architecture search, tree-search methods have so far failed to achieve comparable accuracy or search efficiency. In this paper, we formulate NAS as a Combinatorial Multi-Armed Bandit (CMAB) problem (CMAB-NAS). This allows the decomposition of a large search space into smaller blocks where tree-search methods can be applied more effectively and efficiently. We further leverage a tree-based method called Nested Monte-Carlo Search to tackle the CMAB-NAS problem. On CIFAR-10, our approach discovers a cell structure that achieves a low error rate that is comparable to the state-of-the-art, using only 0.58 GPU days, which is 20 times faster than current tree-search methods. Moreover, the discovered structure transfers well to large-scale datasets such as ImageNet.
摘要：神经体系结构搜索（NAS）作为设计高性能深度神经网络（DNN）的有效工具而获得了广泛的欢迎。 NAS可以通过策略梯度，进化算法，可微体系结构搜索或树搜索方法来执行。尽管在策略梯度和可区分的体系结构搜索方面都已取得了重大进展，但树形搜索方法迄今仍未达到可比的准确性或搜索效率。在本文中，我们将NAS表示为组合多臂强盗（CMAB）问题（CMAB-NAS）。这允许将较大的搜索空间分解为较小的块，在其中可以更有效地应用树搜索方法。我们进一步利用称为嵌套蒙特卡洛搜索的基于树的方法来解决CMAB-NAS问题。在CIFAR-10上，我们的方法发现了一种单元结构，该单元结构仅使用0.58 GPU天，即可获得与最新技术相当的低错误率，这比当前的树搜索方法快20倍。而且，发现的结构可以很好地转移到诸如ImageNet的大规模数据集。

83. Interval Type-2 Enhanced Possibilistic Fuzzy C-Means Clustering for Gene Expression Data Analysis [PDF] 返回目录
Shahabeddin Sotudian, Mohammad Hossein Fazel Zarandi
Abstract: Both FCM and PCM clustering methods have been widely applied to pattern recognition and data clustering. Nevertheless, FCM is sensitive to noise and PCM occasionally generates coincident clusters. PFCM is an extension of the PCM model by combining FCM and PCM, but this method still suffers from the weaknesses of PCM and FCM. In the current paper, the weaknesses of the PFCM algorithm are corrected and the enhanced possibilistic fuzzy c-means (EPFCM) clustering algorithm is presented. EPFCM can still be sensitive to noise. Therefore, we propose an interval type-2 enhanced possibilistic fuzzy c-means (IT2EPFCM) clustering method by utilizing two fuzzifiers $(m_1, m_2)$ for fuzzy memberships and two fuzzifiers $({\theta}_1, {\theta}_2)$ for possibilistic typicalities. Our computational results show the superiority of the proposed approaches compared with several state-of-the-art techniques in the literature. Finally, the proposed methods are implemented for analyzing microarray gene expression data.
摘要：FCM和PCM聚类方法已广泛应用于模式识别和数据聚类。但是，FCM对噪声很敏感，PCM偶尔会产生一致的簇。 PFCM是通过结合FCM和PCM对PCM模型进行的扩展，但是该方法仍然存在PCM和FCM的缺点。在本文中，纠正了PFCM算法的弱点，并提出了增强的可能模糊c均值（EPFCM）聚类算法。 EPFCM仍然对噪声敏感。因此，我们通过利用两个模糊器$（m_1，m_2）$作为模糊成员资格和两个模糊器$（{\\ theta} _1，{\ theta} _2来提出一种区间2型增强型可能性模糊c均值（IT2EPFCM）聚类方法。）$（可能的典型值）。我们的计算结果表明，与文献中的几种最新技术相比，该方法具有优越性。最后，所提出的方法用于分析微阵列基因表达数据。

84. The Bayesian Method of Tensor Networks [PDF] 返回目录
Erdong Guo, David Draper
Abstract: Bayesian learning is a powerful learning framework which combines the external information of the data (background information) with the internal information (training data) in a logically consistent way in inference and prediction. By Bayes rule, the external information (prior distribution) and the internal information (training data likelihood) are combined coherently, and the posterior distribution and the posterior predictive (marginal) distribution obtained by Bayes rule summarize the total information needed in the inference and prediction, respectively. In this paper, we study the Bayesian framework of the Tensor Network from two perspective. First, we introduce the prior distribution to the weights in the Tensor Network and predict the labels of the new observations by the posterior predictive (marginal) distribution. Since the intractability of the parameter integral in the normalization constant computation, we approximate the posterior predictive distribution by Laplace approximation and obtain the out-product approximation of the hessian matrix of the posterior distribution of the Tensor Network model. Second, to estimate the parameters of the stationary mode, we propose a stable initialization trick to accelerate the inference process by which the Tensor Network can converge to the stationary path more efficiently and stably with gradient descent method. We verify our work on the MNIST, Phishing Website and Breast Cancer data set. We study the Bayesian properties of the Bayesian Tensor Network by visualizing the parameters of the model and the decision boundaries in the two dimensional synthetic data set. For a application purpose, our work can reduce the overfitting and improve the performance of normal Tensor Network model.
摘要：贝叶斯学习是一个强大的学习框架，它将数据的外部信息（背景信息）与内部信息（训练数据）以逻辑上一致的方式进行推理和预测。通过贝叶斯规则，将外部信息（先验分布）和内部信息（训练数据似然）相结合，贝叶斯规则获得的后验分布和后验预测（边际）分布总结了推理和预测所需的全部信息，分别。在本文中，我们从两个角度研究了Tensor网络的贝叶斯框架。首先，我们将先验分布引入张量网络中的权重，并通过后验预测（边际）分布预测新观测值的标签。由于归一化常数计算中参数积分的难处理性，我们通过拉普拉斯逼近近似后验预测分布，并获得张量网络模型后验分布的黑森州矩阵的乘积近似。其次，为了估计平稳模式的参数，我们提出了一个稳定的初始化技巧，以加快推理过程，从而使Tensor网络可以使用梯度下降法更有效且稳定地收敛到平稳路径。我们在MNIST，网上诱骗网站和乳腺癌数据集上验证我们的工作。我们通过可视化二维合成数据集中的模型参数和决策边界，研究贝叶斯张量网络的贝叶斯性质。出于应用目的，我们的工作可以减少正常张量网络模型的过拟合并提高其性能。

85. Cutting-edge 3D Medical Image Segmentation Methods in 2020: Are Happy Families All Alike? [PDF] 返回目录
Jun Ma
Abstract: Segmentation is one of the most important and popular tasks in medical image analysis, which plays a critical role in disease diagnosis, surgical planning, and prognosis evaluation. During the past five years, on the one hand, thousands of medical image segmentation methods have been proposed for various organs and lesions in different medical images, which become more and more challenging to fairly compare different methods. On the other hand, international segmentation challenges can provide a transparent platform to fairly evaluate and compare different methods. In this paper, we present a comprehensive review of the top methods in ten 3D medical image segmentation challenges during 2020, covering a variety of tasks and datasets. We also identify the "happy-families" practices in the cutting-edge segmentation methods, which are useful for developing powerful segmentation approaches. Finally, we discuss open research problems that should be addressed in the future. We also maintain a list of cutting-edge segmentation methods at \url{this https URL}.
摘要：分割是医学图像分析中最重要和最受欢迎的任务之一，在疾病诊断，手术计划和预后评估中起着至关重要的作用。在过去的五年中，一方面，已经针对不同医学图像中的各个器官和病变提出了数千种医学图像分割方法，这对于公平地比较不同方法变得越来越具有挑战性。另一方面，国际细分挑战可以提供一个透明的平台，以公平地评估和比较不同的方法。在本文中，我们对2020年十个3D医学图像分割挑战中的顶级方法进行了全面回顾，涵盖了各种任务和数据集。我们还确定了最先进的细分方法中的“幸福家庭”实践，这对开发强大的细分方法很有用。最后，我们讨论了将来应解决的开放研究问题。我们还在\ url {this https URL}中维护了一系列最新的细分方法。

86. B-SMALL: A Bayesian Neural Network approach to Sparse Model-Agnostic Meta-Learning [PDF] 返回目录
Anish Madan, Ranjitha Prasad
Abstract: There is a growing interest in the learning-to-learn paradigm, also known as meta-learning, where models infer on new tasks using a few training examples. Recently, meta-learning based methods have been widely used in few-shot classification, regression, reinforcement learning, and domain adaptation. The model-agnostic meta-learning (MAML) algorithm is a well-known algorithm that obtains model parameter initialization at meta-training phase. In the meta-test phase, this initialization is rapidly adapted to new tasks by using gradient descent. However, meta-learning models are prone to overfitting since there are insufficient training tasks resulting in over-parameterized models with poor generalization performance for unseen tasks. In this paper, we propose a Bayesian neural network based MAML algorithm, which we refer to as the B-SMALL algorithm. The proposed framework incorporates a sparse variational loss term alongside the loss function of MAML, which uses a sparsifying approximated KL divergence as a regularizer. We demonstrate the performance of B-MAML using classification and regression tasks, and highlight that training a sparsifying BNN using MAML indeed improves the parameter footprint of the model while performing at par or even outperforming the MAML approach. We also illustrate applicability of our approach in distributed sensor networks, where sparsity and meta-learning can be beneficial.
摘要：人们对学习模式（也称为元学习）的兴趣日益浓厚，在该模式中，模型使用一些训练示例来推断新任务。最近，基于元学习的方法已广泛用于少数镜头分类，回归，强化学习和领域适应。与模型无关的元学习（MAML）算法是一种众所周知的算法，它在元训练阶段获得模型参数的初始化。在元测试阶段，此初始化通过使用梯度下降迅速适应新任务。但是，元学习模型易于过度拟合，因为训练任务不足，导致过参数化的模型对于未完成的任务泛化性能差。在本文中，我们提出了一种基于贝叶斯神经网络的MAML算法，我们将其称为B-SMALL算法。提议的框架将稀疏的变分损失项与MAML的损失函数结合在一起，后者使用稀疏的近似KL散度作为正则化函数。我们使用分类和回归任务演示了B-MAML的性能，并强调了使用MAML训练稀疏的BNN确实改善了模型的参数覆盖范围，同时性能达到甚至优于MAML方法。我们还说明了我们的方法在分布式传感器网络中的适用性，其中稀疏性和元学习可能是有益的。

87. Multi-Grid Back-Projection Networks [PDF] 返回目录
Pablo Navarrete Michelini, Wenbin Chen, Hanwen Liu, Dan Zhu, Xingqun Jiang
Abstract: Multi-Grid Back-Projection (MGBP) is a fully-convolutional network architecture that can learn to restore images and videos with upscaling artifacts. Using the same strategy of multi-grid partial differential equation (PDE) solvers this multiscale architecture scales computational complexity efficiently with increasing output resolutions. The basic processing block is inspired in the iterative back-projection (IBP) algorithm and constitutes a type of cross-scale residual block with feedback from low resolution references. The architecture performs in par with state-of-the-arts alternatives for regression targets that aim to recover an exact copy of a high resolution image or video from which only a downscale image is known. A perceptual quality target aims to create more realistic outputs by introducing artificial changes that can be different from a high resolution original content as long as they are consistent with the low resolution input. For this target we propose a strategy using noise inputs in different resolution scales to control the amount of artificial details generated in the output. The noise input controls the amount of innovation that the network uses to create artificial realistic details. The effectiveness of this strategy is shown in benchmarks and it is explained as a particular strategy to traverse the perception-distortion plane.
摘要：多网格背投影（MGBP）是一种完全卷积的网络体系结构，可以学习还原具有放大伪像的图像和视频。使用多网格偏微分方程（PDE）求解器的相同策略，该多尺度体系结构可随着输出分辨率的提高有效地扩展计算复杂性。基本处理块是从迭代反投影（IBP）算法中获得灵感的，它构成了一种跨尺度残差块，具有来自低分辨率参考的反馈。该体系结构与用于回归目标的最新技术相媲美，这些目标旨在恢复高分辨率图像或视频的精确副本，而高分辨率图像或视频的确切副本只能从其中获悉。感知质量目标旨在通过引入可能与高分辨率原始内容不同的人为更改来创建更现实的输出，只要它们与低分辨率输入一致即可。对于这个目标，我们提出了一种策略，使用不同分辨率级别的噪声输入来控制输出中生成的人工细节的数量。噪声输入控制网络用于创建人造现实细节的创新量。基准测试中显示了该策略的有效性，并将其解释为遍历感知扭曲平面的特定策略。

88. Explainability Matters: Backdoor Attacks on Medical Imaging [PDF] 返回目录
Munachiso Nwadike, Takumi Miyawaki, Esha Sarkar, Michail Maniatakos, Farah Shamout
Abstract: Deep neural networks have been shown to be vulnerable to backdoor attacks, which could be easily introduced to the training set prior to model training. Recent work has focused on investigating backdoor attacks on natural images or toy datasets. Consequently, the exact impact of backdoors is not yet fully understood in complex real-world applications, such as in medical imaging where misdiagnosis can be very costly. In this paper, we explore the impact of backdoor attacks on a multi-label disease classification task using chest radiography, with the assumption that the attacker can manipulate the training dataset to execute the attack. Extensive evaluation of a state-of-the-art architecture demonstrates that by introducing images with few-pixel perturbations into the training set, an attacker can execute the backdoor successfully without having to be involved with the training procedure. A simple 3$\times$3 pixel trigger can achieve up to 1.00 Area Under the Receiver Operating Characteristic (AUROC) curve on the set of infected images. In the set of clean images, the backdoored neural network could still achieve up to 0.85 AUROC, highlighting the stealthiness of the attack. As the use of deep learning based diagnostic systems proliferates in clinical practice, we also show how explainability is indispensable in this context, as it can identify spatially localized backdoors in inference time.
摘要：已证明深度神经网络容易受到后门攻击，可以在模型训练之前将其轻松引入训练集。最近的工作集中在调查对自然图像或玩具数据集的后门攻击。因此，在复杂的实际应用中，例如在医学成像中，误诊可能非常昂贵，后门的确切影响尚未得到充分理解。在本文中，我们假设利用攻击者可以操纵训练数据集执行攻击的方式，利用胸部X射线照相术探索后门攻击对多标签疾病分类任务的影响。对最先进体系结构的广泛评估表明，通过将具有很少像素扰动的图像引入训练集中，攻击者可以成功执行后门，而无需参与训练过程。一个简单的3×3像素触发器可以在被感染的图像组的接收器工作特征（AUROC）曲线下达到1.00面积。在这组干净的图像中，后门神经网络仍然可以达到0.85 AUROC，突出了攻击的隐身性。随着基于深度学习的诊断系统在临床实践中的广泛使用，我们还展示了在这种情况下可解释性是必不可少的，因为它可以在推理时间内识别出空间局部的后门。

89. Binary Graph Neural Networks [PDF] 返回目录
Mehdi Bahri, Gaétan Bahl, Stefanos Zafeiriou
Abstract: Graph Neural Networks (GNNs) have emerged as a powerful and flexible framework for representation learning on irregular data. As they generalize the operations of classical CNNs on grids to arbitrary topologies, GNNs also bring much of the implementation challenges of their Euclidean counterparts. Model size, memory footprint, and energy consumption are common concerns for many real-world applications. Network binarization allocates a single bit to network parameters and activations, thus dramatically reducing the memory requirements (up to 32x compared to single-precision floating-point parameters) and maximizing the benefits of fast SIMD instructions of modern hardware for measurable speedups. However, in spite of the large body of work on binarization for classical CNNs, this area remains largely unexplored in geometric deep learning. In this paper, we present and evaluate different strategies for the binarization of graph neural networks. We show that through careful design of the models, and control of the training process, binary graph neural networks can be trained at only a moderate cost in accuracy on challenging benchmarks. In particular, we present the first dynamic graph neural network in Hamming space, able to leverage efficient $k$-NN search on binary vectors to speed-up the construction of the dynamic graph. We further verify that the binary models offer significant savings on embedded devices.
摘要：图形神经网络（GNN）已成为一种功能强大且灵活的框架，用于对不规则数据进行表示学习。当他们将经典CNN在网格上的操作推广到任意拓扑时，GNN也带来了其欧几里得对应系统的许多实现挑战。型号大小，内存占用量和能耗是许多实际应用中常见的问题。网络二值化将单个位分配给网络参数和激活，从而显着降低了内存需求（与单精度浮点参数相比，高达32倍），并最大限度地利用了现代硬件的快速SIMD指令带来的可测量加速效果。但是，尽管有关经典CNN的二值化工作量很大，但在几何深度学习中仍未对此领域进行充分探索。在本文中，我们提出并评估了图神经网络二值化的不同策略。我们表明，通过模型的精心设计和训练过程的控制，可以仅以中等成本在具有挑战性的基准上以较低的成本训练二进制图神经网络。特别是，我们提出了汉明空间中的第一个动态图神经网络，该网络能够利用对二元矢量的有效$ k $ -NN搜索来加快动态图的构建。我们进一步验证了二进制模型可以大大节省嵌入式设备。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2021-01-05

目录

摘要