摘要

1. Local Memory Attention for Fast Video Semantic Segmentation [PDF] 返回目录
Matthieu Paul, Martin Danelljan, Luc Van Gool, Radu Timofte
Abstract: We propose a novel neural network module that transforms an existing single-frame semantic segmentation model into a video semantic segmentation pipeline. In contrast to prior works, we strive towards a simple and general module that can be integrated into virtually any single-frame architecture. Our approach aggregates a rich representation of the semantic information in past frames into a memory module. Information stored in the memory is then accessed through an attention mechanism. This provides temporal appearance cues from prior frames, which are then fused with an encoding of the current frame through a second attention-based module. The segmentation decoder processes the fused representation to predict the final semantic segmentation. We integrate our approach into two popular semantic segmentation networks: ERFNet and PSPNet. We observe an improvement in segmentation performance on Cityscapes by 1.7% and 2.1% in mIoU respectively, while increasing inference time of ERFNet by only 1.5ms.
摘要：我们提出了一种新颖的神经网络模块，它将现有的单帧语义分割模型转换为视频语义分割管道。与以前的工作相比，我们致力于开发一个简单且通用的模块，该模块几乎可以集成到任何单框架架构中。我们的方法将过去框架中语义信息的丰富表示聚合到内存模块中。然后通过注意力机制访问存储在存储器中的信息。这提供了来自先前帧的时间外观提示，然后通过第二基于关注的模块将其与当前帧的编码融合。分段解码器处理融合表示以预测最终的语义分段。我们将我们的方法集成到两个流行的语义分割网络中：ERFNet和PSPNet。我们观察到在mIoU上Cityscapes的分割性能分别提高了1.7％和2.1％，而ERFNet的推理时间仅增加了1.5ms。

2. Learning from Synthetic Shadows for Shadow Detection and Removal [PDF] 返回目录
Naoto Inoue, Toshihiko Yamasaki
Abstract: Shadow removal is an essential task in computer vision and computer graphics. Recent shadow removal approaches all train convolutional neural networks (CNN) on real paired shadow/shadow-free or shadow/shadow-free/mask image datasets. However, obtaining a large-scale, diverse, and accurate dataset has been a big challenge, and it limits the performance of the learned models on shadow images with unseen shapes/intensities. To overcome this challenge, we present SynShadow, a novel large-scale synthetic shadow/shadow-free/matte image triplets dataset and a pipeline to synthesize it. We extend a physically-grounded shadow illumination model and synthesize a shadow image given an arbitrary combination of a shadow-free image, a matte image, and shadow attenuation parameters. Owing to the diversity, quantity, and quality of SynShadow, we demonstrate that shadow removal models trained on SynShadow perform well in removing shadows with diverse shapes and intensities on some challenging benchmarks. Furthermore, we show that merely fine-tuning from a SynShadow-pre-trained model improves existing shadow detection and removal models. Codes are publicly available at this https URL.
摘要：去除阴影是计算机视觉和计算机图形学中的一项基本任务。最近的阴影去除方法对真实配对的无阴影/无阴影或无阴影/无阴影/蒙版图像数据集的所有训练卷积神经网络（CNN）。然而，获得大规模，多样且准确的数据集一直是一个巨大的挑战，并且它限制了学习的模型在具有看不见的形状/强度的阴影图像上的性能。为了克服这一挑战，我们提出了SynShadow，这是一种新颖的大规模合成阴影/无阴影/无光泽图像三元组数据集以及用于对其进行合成的管道。我们扩展了一个物理接地的阴影照明模型，并给定了无阴影图像，遮罩图像和阴影衰减参数的任意组合，从而合成了阴影图像。由于SynShadow的多样性，数量和质量，我们证明了在某些具有挑战性的基准上，在SynShadow上训练的阴影去除模型可以很好地去除具有各种形状和强度的阴影。此外，我们表明，仅对SynShadow预训练模型进行微调可以改善现有的阴影检测和去除模型。可以在此https URL上公开获得代码。

3. Learning Accurate Dense Correspondences and When to Trust Them [PDF] 返回目录
Prune Truong, Martin Danelljan, Luc Van Gool, Radu Timofte
Abstract: Establishing dense correspondences between a pair of images is an important and general problem. However, dense flow estimation is often inaccurate in the case of large displacements or homogeneous regions. For most applications and down-steam tasks, such as pose estimation, image manipulation, or 3D reconstruction, it is crucial to know when and where to trust the estimated correspondences. In this work, we aim to estimate a dense flow field relating two images, coupled with a robust pixel-wise confidence map indicating the reliability and accuracy of the prediction. We develop a flexible probabilistic approach that jointly learns the flow prediction and its uncertainty. In particular, we parametrize the predictive distribution as a constrained mixture model, ensuring better modelling of both accurate flow predictions and outliers. Moreover, we develop an architecture and training strategy tailored for robust and generalizable uncertainty prediction in the context of self-supervised training. Our approach obtains state-of-the-art results on multiple challenging geometric matching and optical flow datasets. We further validate the usefulness of our probabilistic confidence estimation for the task of pose estimation. Code and models will be released at this http URL.
摘要：在一对图像之间建立密集的对应关系是一个重要且普遍的问题。但是，在大位移或均质区域的情况下，稠密的流量估计通常是不准确的。对于大多数应用程序和顺流任务，例如姿态估计，图像处理或3D重建，了解何时何地信任估计的对应关系至关重要。在这项工作中，我们旨在估计与两个图像相关的密集流场，并结合指示预测的可靠性和准确性的鲁棒像素级置信度图。我们开发了一种灵活的概率方法，可以共同学习流量预测及其不确定性。特别是，我们将预测分布参数化为受约束的混合模型，以确保更好地建模准确的流量预测和离群值。此外，我们开发了一种体系结构和培训策略，专门针对自我监督培训的情况下的健壮和可概括的不确定性预测而设计。我们的方法在多个具有挑战性的几何匹配和光流数据集上获得了最新的结果。我们进一步验证了概率置信度估计对姿势估计任务的有用性。代码和模型将在此http URL上发布。

4. Human Activity Recognition using Wearable Sensors: Review, Challenges, Evaluation Benchmark [PDF] 返回目录
Reem Abdel-Salam, Rana Mostafa, Mayada Hadhood
Abstract: Recognizing human activity plays a significant role in theadvancements of human-interaction applications in healthcare, personalfitness, and smart devices. Many papers presented various techniques forhuman activity representation that resulted in distinguishable this http URL this study, we conduct an extensive literature review on recent, top-performing techniques in human activity recognition based on wearablesensors. Due to the lack of standardized evaluation and to assess andensure a fair comparison between the state-of-the-art techniques, we ap-plied a standardized evaluation benchmark on the state-of-the-art tech-niques using six publicly available data-sets: MHealth, USCHAD, UTD-MHAD, WISDM, WHARF, and OPPORTUNITY. Also, we propose anexperimental, improved approach that is a hybrid of enhanced hand-crafted features and a neural network architecture which outperformedtop-performing techniques with the same standardized evaluation bench-mark applied concerning MHealth, USCHAD, UTD-MHAD data-sets.
摘要：认识到人类活动在医疗保健，个人健身和智能设备中人类交互应用的发展中起着重要作用。许多论文介绍了人类活动表示的各种技术，从而使本研究可以区分此http URL。我们对基于可穿戴传感器的人类活动识别中最新的，性能最高的技术进行了广泛的文献综述。由于缺乏标准化的评估，并且无法评估和确保最新技术之间的公平比较，因此，我们使用六种公开可用的数据，对最新技术进行了标准化评估集：MHealth，USCHAD，UTD-MHAD，WISDM，WHARF和机会。此外，我们提出了一种实验性的，改进的方法，该方法是增强的手工功能与神经网络体系结构的混合体，其性能优于性能最高的技术，并且具有适用于MHealth，USCHAD，UTD-MHAD数据集的相同标准评估基准。

5. Spatial Attention Improves Iterative 6D Object Pose Estimation [PDF] 返回目录
Stefan Stevsic, Otmar Hilliges
Abstract: The task of estimating the 6D pose of an object from RGB images can be broken down into two main steps: an initial pose estimation step, followed by a refinement procedure to correctly register the object and its observation. In this paper, we propose a new method for 6D pose estimation refinement from RGB images. To achieve high accuracy of the final estimate, the observation and a rendered model need to be aligned. Our main insight is that after the initial pose estimate, it is important to pay attention to distinct spatial features of the object in order to improve the estimation accuracy during alignment. Furthermore, parts of the object that are occluded in the image should be given less weight during the alignment process. Most state-of-the-art refinement approaches do not allow for this fine-grained reasoning and can not fully leverage the structure of the problem. In contrast, we propose a novel neural network architecture built around a spatial attention mechanism that identifies and leverages information about spatial details during pose refinement. We experimentally show that this approach learns to attend to salient spatial features and learns to ignore occluded parts of the object, leading to better pose estimation across datasets. We conduct experiments on standard benchmark datasets for 6D pose estimation (LineMOD and Occlusion LineMOD) and outperform previous state-of-the-art methods.
摘要：从RGB图像估计对象的6D姿势的任务可以分为两个主要步骤：初始姿势估计步骤，然后是用于正确注册对象及其观察结果的细化过程。在本文中，我们提出了一种从RGB图像细化6D姿态估计的新方法。为了获得最终估算的高精度，需要将观测值与渲染模型对齐。我们的主要见解是，在初始姿势估计之后，重要的是要注意对象的独特空间特征，以提高对齐过程中的估计精度。此外，在对齐过程中，应减小对象在图像中被遮挡的部分的权重。大多数最新的优化方法都不允许进行这种细粒度的推理，并且不能充分利用问题的结构。相比之下，我们提出了一种围绕空间注意力机制构建的新颖神经网络架构，该机制在姿势优化过程中识别并利用有关空间细节的信息。我们通过实验表明，这种方法学会了关注显着的空间特征，并且学会了忽略对象的被遮挡部分，从而在整个数据集中实现了更好的姿态估计。我们在用于6D姿态估计的标准基准数据集（LineMOD和遮挡LineMOD）上进行实验，并且性能优于以前的最新方法。

6. Novel View Synthesis via Depth-guided Skip Connections [PDF] 返回目录
Yuxin Hou, Arno Solin, Juho Kannala
Abstract: We introduce a principled approach for synthesizing new views of a scene given a single source image. Previous methods for novel view synthesis can be divided into image-based rendering methods (e.g. flow prediction) or pixel generation methods. Flow predictions enable the target view to re-use pixels directly, but can easily lead to distorted results. Directly regressing pixels can produce structurally consistent results but generally suffer from the lack of low-level details. In this paper, we utilize an encoder-decoder architecture to regress pixels of a target view. In order to maintain details, we couple the decoder aligned feature maps with skip connections, where the alignment is guided by predicted depth map of the target view. Our experimental results show that our method does not suffer from distortions and successfully preserves texture details with aligned skip connections.
摘要：我们介绍了一种原理化方法，用于在给定单个源图像的情况下合成场景的新视图。用于新颖视图合成的先前方法可以分为基于图像的渲染方法（例如，流预测）或像素生成方法。流量预测使目标视图可以直接重用像素，但很容易导致失真的结果。直接回归像素可以产生结构上一致的结果，但通常会缺少低级细节。在本文中，我们利用编码器-解码器体系结构来回归目标视图的像素。为了保持细节，我们将解码器对齐的特征图与跳过连接耦合在一起，其中对齐由目标视图的预测深度图指导。我们的实验结果表明，我们的方法没有失真，并且通过对齐的跳过连接成功保留了纹理细节。

7. Look Twice: A Computational Model of Return Fixations across Tasks and Species [PDF] 返回目录
Mengmi Zhang, Will Xiao, Olivia Rose, Katarina Bendtz, Margaret Livingstone, Carlos Ponce, Gabriel Kreiman
Abstract: Saccadic eye movements allow animals to bring different parts of an image into high-resolution. During free viewing, inhibition of return incentivizes exploration by discouraging previously visited locations. Despite this inhibition, here we show that subjects make frequent return fixations. We systematically studied a total of 44,328 return fixations out of 217,440 fixations across different tasks, in monkeys and humans, and in static images or egocentric videos. The ubiquitous return fixations were consistent across subjects, tended to occur within short offsets, and were characterized by longer duration than non-return fixations. The locations of return fixations corresponded to image areas of higher saliency and higher similarity to the sought target during visual search tasks. We propose a biologically-inspired computational model that capitalizes on a deep convolutional neural network for object recognition to predict a sequence of fixations. Given an input image, the model computes four maps that constrain the location of the next saccade: a saliency map, a target similarity map, a saccade size map, and a memory map. The model exhibits frequent return fixations and approximates the properties of return fixations across tasks and species. The model provides initial steps towards capturing the trade-off between exploitation of informative image locations combined with exploration of novel image locations during scene viewing.
摘要：眼跳动作使动物能够将图像的不同部分变为高分辨率。在自由观看期间，抑制返回会通过阻止以前访问的位置来激励探索。尽管有这种抑制，我们在这里显示对象经常进行固定视线。我们系统地研究了猴子，人类，静态图像或以自我为中心的视频中不同任务的217,440个固定点中的总共44,328个固定点。无处不在的固定器在受试者之间是一致的，倾向于发生在短距离内，并且具有比非固定器更长的持续时间。返回固定点的位置对应于在视觉搜索任务期间与所寻找目标具有更高显着性和更高相似性的图像区域。我们提出了一个受生物启发的计算模型，该模型利用深层卷积神经网络进行对象识别以预测固定序列。给定一个输入图像，该模型将计算四个映射来限制下一个扫视的位置：显着图，目标相似度图，扫视大小图和内存图。该模型展示了频繁的回程固定，并近似了跨任务和物种的回程固定的属性。该模型提供了初步的步骤，可以捕捉在查看场景期间信息量丰富的图像位置与探索新颖的图像位置之间的权衡。

8. STaR: Self-supervised Tracking and Reconstruction of Rigid Objects in Motion with Neural Rendering [PDF] 返回目录
Wentao Yuan, Zhaoyang Lv, Tanner Schmidt, Steven Lovegrove
Abstract: We present STaR, a novel method that performs Self-supervised Tracking and Reconstruction of dynamic scenes with rigid motion from multi-view RGB videos without any manual annotation. Recent work has shown that neural networks are surprisingly effective at the task of compressing many views of a scene into a learned function which maps from a viewing ray to an observed radiance value via volume rendering. Unfortunately, these methods lose all their predictive power once any object in the scene has moved. In this work, we explicitly model rigid motion of objects in the context of neural representations of radiance fields. We show that without any additional human specified supervision, we can reconstruct a dynamic scene with a single rigid object in motion by simultaneously decomposing it into its two constituent parts and encoding each with its own neural representation. We achieve this by jointly optimizing the parameters of two neural radiance fields and a set of rigid poses which align the two fields at each frame. On both synthetic and real world datasets, we demonstrate that our method can render photorealistic novel views, where novelty is measured on both spatial and temporal axes. Our factored representation furthermore enables animation of unseen object motion.
摘要：我们提出了一种新的方法STaR，该方法可以对多视图RGB视频进行具有刚性运动的动态场景的自监督跟踪和重构，而无需任何人工注释。最近的工作表明，神经网络在将场景的许多视图压缩为学习功能的过程中出奇地有效，该功能通过体积渲染从视线映射到观察到的辐射值。不幸的是，一旦场景中的任何物体移动，这些方法都会失去所有的预测能力。在这项工作中，我们在辐射场的神经表示的上下文中显式建模对象的刚性运动。我们表明，无需任何其他人工指定的监督，我们就可以通过将单个刚性对象同时分解为两个组成部分并使用自己的神经表示对每个对象进行编码来重建动态场景。我们通过共同优化两个神经辐射场的参数和一组在每个帧上对齐两个场的刚性姿势来实现此目的。在合成数据集和现实世界数据集上，我们都证明了我们的方法可以渲染逼真的新颖视图，其中新颖性是在空间和时间轴上进行测量的。我们的分解表示还可以实现看不见的物体运动的动画。

9. Bilateral Grid Learning for Stereo Matching Network [PDF] 返回目录
Bin Xu, Yuhua Xu, Xiaoli Yang, Wei Jia, Yulan Guo
Abstract: The real-time performance of the stereo matching network is important for many applications, such as automatic driving, robot navigation and augmented reality (AR). Although significant progress has been made in stereo matching networks in recent years, it is still challenging to balance real-time performance and accuracy. In this paper, we present a novel edge-preserving cost volume upsampling module based on the slicing operation in the learned bilateral grid. The slicing layer is parameter-free, which allows us to obtain a high quality cost volume of high resolution from a low resolution cost volume under the guide of the learned guidance map efficiently. The proposed cost volume upsampling module can be seamlessly embedded into many existing stereo matching networks, such as GCNet, PSMNet, and GANet. The resulting networks are accelerated several times while maintaining comparable accuracy. Furthermore, based on this module we design a real-time network (named BGNet), which outperforms the existing published real-time deep stereo matching networks, as well as some complex networks on KITTI stereo datasets. The code of the proposed method will be available.
摘要：立体匹配网络的实时性能对于许多应用非常重要，例如自动驾驶，机器人导航和增强现实（AR）。尽管近年来在立体声匹配网络方面已经取得了重大进展，但是在实时性能和准确性之间取得平衡仍然是挑战。在本文中，我们基于学习的双边网格中的切片操作，提出了一种新颖的边缘保留成本量上采样模块。切片层是无参数的，这使我们能够在学习的指导图的指导下从低分辨率成本量获得高分辨率的高质量成本量。提议的成本上采样模块可以无缝嵌入到许多现有的立体声匹配网络中，例如GCNet，PSMNet和GANet。最终的网络被加速了好几次，同时保持了相当的精度。此外，基于此模块，我们设计了一个实时网络（名为BGNet），其性能优于现有已发布的实时深度立体匹配网络以及KITTI立体数据集上的一些复杂网络。所提出方法的代码将可用。

10. Learning the Predictability of the Future [PDF] 返回目录
Dídac Surís, Ruoshi Liu, Carl Vondrick
Abstract: We introduce a framework for learning from unlabeled video what is predictable in the future. Instead of committing up front to features to predict, our approach learns from data which features are predictable. Based on the observation that hyperbolic geometry naturally and compactly encodes hierarchical structure, we propose a predictive model in hyperbolic space. When the model is most confident, it will predict at a concrete level of the hierarchy, but when the model is not confident, it learns to automatically select a higher level of abstraction. Experiments on two established datasets show the key role of hierarchical representations for action prediction. Although our representation is trained with unlabeled video, visualizations show that action hierarchies emerge in the representation.
摘要：我们介绍了一个框架，用于从未标记的视频中学习将来可以预测的内容。我们的方法不是预先确定要预测的功能，而是从数据中学习哪些特征是可预测的。基于双曲几何自然而紧凑地编码分层结构的观察，我们提出了双曲空间中的预测模型。当模型最有信心时，它将在层次结构的具体级别上进行预测，但是当模型没有信心时，它将学习自动选择更高的抽象级别。在两个已建立的数据集上进行的实验表明，层次表示对于动作预测具有关键作用。尽管我们的表示受制于未标记的视频，但是可视化显示动作层次结构出现在表示中。

11. Noise Sensitivity-Based Energy Efficient and Robust Adversary Detection in Neural Networks [PDF] 返回目录
Rachel Sterneck, Abhishek Moitra, Priyadarshini Panda
Abstract: Neural networks have achieved remarkable performance in computer vision, however they are vulnerable to adversarial examples. Adversarial examples are inputs that have been carefully perturbed to fool classifier networks, while appearing unchanged to humans. Based on prior works on detecting adversaries, we propose a structured methodology of augmenting a deep neural network (DNN) with a detector subnetwork. We use $\textit{Adversarial Noise Sensitivity}$ (ANS), a novel metric for measuring the adversarial gradient contribution of different intermediate layers of a network. Based on the ANS value, we append a detector to the most sensitive layer. In prior works, more complex detectors were added to a DNN, increasing the inference computational cost of the model. In contrast, our structured and strategic addition of a detector to a DNN reduces the complexity of the model while making the overall network adversarially resilient. Through comprehensive white-box and black-box experiments on MNIST, CIFAR-10, and CIFAR-100, we show that our method improves state-of-the-art detector robustness against adversarial examples. Furthermore, we validate the energy efficiency of our proposed adversarial detection methodology through an extensive energy analysis on various hardware scalable CMOS accelerator platforms. We also demonstrate the effects of quantization on our detector-appended networks.
摘要：神经网络在计算机视觉方面取得了卓越的性能，但是它们容易受到对抗性例子的攻击。对抗性的例子是输入，这些输入已经被仔细地干扰了傻瓜分类器网络，而对人类却没有变化。基于检测攻击者的先前工作，我们提出了一种结构化的方法，以利用检测器子网来增强深度神经网络（DNN）。我们使用$ \ textit {Adversarial Noise Sensitivity} $（ANS），这是一种用于测量网络不同中间层的对抗性梯度贡献的新颖指标。基于ANS值，我们将检测器附加到最敏感的层。在先前的工作中，将更复杂的检测器添加到DNN，从而增加了模型的推理计算成本。相反，我们在DNN上结构化和战略性地增加了检测器，从而降低了模型的复杂性，同时使整个网络具有对抗性。通过在MNIST，CIFAR-10和CIFAR-100上进行全面的白盒和黑盒实验，我们证明了我们的方法提高了针对对抗性示例的最新检测器的鲁棒性。此外，我们通过对各种硬件可扩展CMOS加速器平台进行广泛的能量分析，验证了我们提出的对抗性检测方法的能量效率。我们还演示了量化对检测器附加网络的影响。

12. Deep Class-Specific Affinity-Guided Convolutional Network for Multimodal Unpaired Image Segmentation [PDF] 返回目录
Jingkun Chen, Wenqi Li, Hongwei Li, Jianguo Zhang
Abstract: Multi-modal medical image segmentation plays an essential role in clinical diagnosis. It remains challenging as the input modalities are often not well-aligned spatially. Existing learning-based methods mainly consider sharing trainable layers across modalities and minimizing visual feature discrepancies. While the problem is often formulated as joint supervised feature learning, multiple-scale features and class-specific representation have not yet been explored. In this paper, we propose an affinity-guided fully convolutional network for multimodal image segmentation. To learn effective representations, we design class-specific affinity matrices to encode the knowledge of hierarchical feature reasoning, together with the shared convolutional layers to ensure the cross-modality generalization. Our affinity matrix does not depend on spatial alignments of the visual features and thus allows us to train with unpaired, multimodal inputs. We extensively evaluated our method on two public multimodal benchmark datasets and outperform state-of-the-art methods.
摘要：多模式医学图像分割在临床诊断中起着至关重要的作用。由于输入模式在空间上通常没有很好地对齐，因此仍然具有挑战性。现有的基于学习的方法主要考虑跨模式共享可训练层，并最大程度地减少视觉特征差异。虽然通常将问题表述为联合监督的特征学习，但尚未探索多尺度特征和特定于类的表示形式。在本文中，我们提出了一种针对多模态图像分割的亲和度完全卷积网络。为了学习有效的表示，我们设计了特定于类的亲和度矩阵，以对分层特征推理的知识进行编码，并与共享的卷积层一起确保跨模态泛化。我们的亲和力矩阵不依赖于视觉特征的空间排列，因此使我们能够使用不成对的多模式输入进行训练。我们在两个公开的多峰基准数据集上对我们的方法进行了广泛的评估，其性能优于最新方法。

13. Local Propagation for Few-Shot Learning [PDF] 返回目录
Yann Lifchitz, Yannis Avrithis, Sylvaine Picard
Abstract: The challenge in few-shot learning is that available data is not enough to capture the underlying distribution. To mitigate this, two emerging directions are (a) using local image representations, essentially multiplying the amount of data by a constant factor, and (b) using more unlabeled data, for instance by transductive inference, jointly on a number of queries. In this work, we bring these two ideas together, introducing \emph{local propagation}. We treat local image features as independent examples, we build a graph on them and we use it to propagate both the features themselves and the labels, known and unknown. Interestingly, since there is a number of features per image, even a single query gives rise to transductive inference. As a result, we provide a universally safe choice for few-shot inference under both non-transductive and transductive settings, improving accuracy over corresponding methods. This is in contrast to existing solutions, where one needs to choose the method depending on the quantity of available data.
摘要：几次学习的挑战是可用数据不足以捕获基础分布。为了减轻这种情况，两个出现的方向是（a）使用局部图像表示，本质上是将数据量乘以一个常数因子，以及（b）联合使用多个未标记的数据（例如通过推论推断）。在这项工作中，我们将这两个想法结合在一起，介绍了\ emph {本地传播}。我们将局部图像特征视为独立的示例，在它们上构建图，然后使用它来传播特征本身以及已知和未知的标签。有趣的是，由于每个图像有许多特征，因此即使是单个查询也会引起转导推断。因此，我们为非转导和转导设置下的单发推断提供了一种普遍安全的选择，与相应方法相比提高了准确性。这与现有解决方案相反，在现有解决方案中，需要根据可用数据量来选择方法。

14. Scale-Aware Network with Regional and Semantic Attentions for Crowd Counting under Cluttered Background [PDF] 返回目录
Qiaosi Yi, Yunxing Liu, Aiwen Jiang, Juncheng Li, Kangfu Mei, Mingwen Wang
Abstract: Crowd counting is an important task that shown great application value in public safety-related fields, which has attracted increasing attention in recent years. In the current research, the accuracy of counting numbers and crowd density estimation are the main concerns. Although the emergence of deep learning has greatly promoted the development of this field, crowd counting under cluttered background is still a serious challenge. In order to solve this problem, we propose a ScaleAware Crowd Counting Network (SACCN) with regional and semantic attentions. The proposed SACCN distinguishes crowd and background by applying regional and semantic self-attention mechanisms on the shallow layers and deep layers, respectively. Moreover, the asymmetric multi-scale module (AMM) is proposed to deal with the problem of scale diversity, and regional attention based dense connections and skip connections are designed to alleviate the variations on crowd scales. Extensive experimental results on multiple public benchmarks demonstrate that our proposed SACCN achieves satisfied superior performances and outperform most state-of-the-art methods. All codes and pretrained models will be released soon.
摘要：人群计数是一项重要的任务，在公共安全相关领域具有重要的应用价值，近年来受到越来越多的关注。在当前的研究中，计数的准确性和人群密度估计是主要关注的问题。尽管深度学习的出现极大地促进了该领域的发展，但是在混乱的背景下进行人群计数仍然是一个严峻的挑战。为了解决此问题，我们提出了一个具有区域和语义关注的ScaleAware人群计数网络（SACCN）。拟议的SACCN通过分别在浅层和深层应用区域和语义自注意机制来区分人群和背景。此外，提出了非对称多尺度模块（AMM）来解决尺度多样性的问题，并设计了基于区域关注的密集连接和跳跃连接来减轻人群尺度的变化。在多个公共基准上的大量实验结果表明，我们提出的SACCN具有令人满意的优越性能，并且优于大多数最新技术。所有代码和预先训练的模型都将很快发布。

15. PointCutMix: Regularization Strategy for Point Cloud Classification [PDF] 返回目录
Jinlai Zhang, Lvjie Chen, Bo Ouyang, Binbin Liu, Jihong Zhu, Yujing Chen, Yanmei Meng, Danfeng Wu
Abstract: 3D point cloud analysis has received increasing attention in recent years, however, the diversity and availability of point cloud datasets are still limited. We therefore present PointCutMix, a simple but effective method for augmentation in point cloud. In our method, after finding the optimal assignment between two point clouds, we replace some points in one point cloud by its counterpart point in another point cloud. Our strategy consistently and significantly improves the performance across various models and datasets. Surprisingly, when it is used as a defense method, it shows far superior performance to the SOTA defense algorithm. The code is available at:this https URL
摘要：3D点云分析近年来受到越来越多的关注，但是，点云数据集的多样性和可用性仍然有限。因此，我们提出了PointCutMix，这是一种简单但有效的点云增强方法。在我们的方法中，找到两个点云之间的最佳分配后，我们用一个点云中的对等点替换另一个点云中的对等点。我们的策略始终如一且显着提高了各种模型和数据集的性能。令人惊讶的是，当它用作防御方法时，它显示出比SOTA防御算法优越的性能。该代码位于以下网址：https https

16. WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection [PDF] 返回目录
Bojia Zi, Minghao Chang, Jingjing Chen, Xingjun Ma, Yu-Gang Jiang
Abstract: In recent years, the abuse of a face swap technique called deepfake Deepfake has raised enormous public concerns. So far, a large number of deepfake videos (known as "deepfakes") have been crafted and uploaded to the internet, calling for effective countermeasures. One promising countermeasure against deepfakes is deepfake detection. Several deepfake datasets have been released to support the training and testing of deepfake detectors, such as DeepfakeDetection and FaceForensics++. While this has greatly advanced deepfake detection, most of the real videos in these datasets are filmed with a few volunteer actors in limited scenes, and the fake videos are crafted by researchers using a few popular deepfake softwares. Detectors developed on these datasets may become less effective against real-world deepfakes on the internet. To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. WildDeepfake is a small dataset that can be used, in addition to existing datasets, to develop and test the effectiveness of deepfake detectors against real-world deepfakes. We conduct a systematic evaluation of a set of baseline detection networks on both existing and our WildDeepfake datasets, and show that WildDeepfake is indeed a more challenging dataset, where the detection performance can decrease drastically. We also propose two (eg. 2D and 3D) Attention-based Deepfake Detection Networks (ADDNets) to leverage the attention masks on real/fake faces for improved detection. We empirically verify the effectiveness of ADDNets on both existing datasets and WildDeepfake. The dataset is available at:this https URL.
摘要：近年来，滥用一种称为Deepfake Deepfake的人脸交换技术引起了公众的极大关注。到目前为止，已经制作了大量的Deepfake视频（称为“ deepfake”）并将其上传到Internet，要求采取有效的对策。防止伪造的一种有希望的对策是伪造检测。已经发布了一些Deepfake数据集来支持Deepfake检测器的训练和测试，例如DeepfakeDetection和FaceForensics ++。尽管这大大提高了Deepfake的检测能力，但这些数据集中的大多数真实视频都是在有限的场景中与一些志愿演员一起拍摄的，而伪造的视频是由研究人员使用一些流行的Deepfake软件制作的。在这些数据集上开发的检测器可能无法有效地抵抗Internet上的真实Deepfake。为了更好地支持针对现实世界的深层虚假检测，在本文中，我们引入了一个新的数据集WildDeepfake，该数据集由从互联网上完全收集的707个深层虚假视频中提取的7,314个面部序列组成。 WildDeepfake是一个小型数据集，除现有数据集外，还可用于开发和测试Deepfake检测器针对真实世界的Deepfake的有效性。我们在现有数据集和WildDeepfake数据集上对一组基线检测网络进行了系统评估，并表明WildDeepfake确实是更具挑战性的数据集，检测性能可能会急剧下降。我们还提出了两个（例如2D和3D）基于注意力的Deepfake检测网络（ADDNets），以利用真实/伪造面部上的注意蒙版来提高检测效率。我们通过经验验证了ADDNets在现有数据集和WildDeepfake上的有效性。数据集位于：https URL。

17. Dataset on Bi- and Multi-Nucleated Tumor Cells in Canine Cutaneous Mast Cell Tumors [PDF] 返回目录
Christof A. Bertram, Taryn A. Donovan, Marco Tecilla, Florian Bartenschlager, Marco Fragoso, Frauke Wilm, Christian Marzahl, Katharina Breininger, Andreas Maier, Robert Klopfleisch, Marc Aubreville
Abstract: Tumor cells with two nuclei (binucleated cells, BiNC) or more nuclei (multinucleated cells, MuNC) indicate an increased amount of cellular genetic material which is thought to facilitate oncogenesis, tumor progression and treatment resistance. In canine cutaneous mast cell tumors (ccMCT), binucleation and multinucleation are parameters used in cytologic and histologic grading schemes (respectively) which correlate with poor patient outcome. For this study, we created the first open source data-set with 19,983 annotations of BiNC and 1,416 annotations of MuNC in 32 histological whole slide images of ccMCT. Labels were created by a pathologist and an algorithmic-aided labeling approach with expert review of each generated candidate. A state-of-the-art deep learning-based model yielded an $F_1$ score of 0.675 for BiNC and 0.623 for MuNC on 11 test whole slide images. In regions of interest ($2.37 mm^2$) extracted from these test images, 6 pathologists had an object detection performance between 0.270 - 0.526 for BiNC and 0.316 0.622 for MuNC, while our model archived an $F_1$ score of 0.667 for BiNC and 0.685 for MuNC. This open dataset can facilitate development of automated image analysis for this task and may thereby help to promote standardization of this facet of histologic tumor prognostication.
摘要：具有两个核（双核细胞，BiNC）或更多核（多核细胞，MuNC）的肿瘤细胞表明细胞遗传物质的量增加，这被认为有助于肿瘤发生，肿瘤进展和治疗耐药性。在犬皮肤肥大细胞肿瘤（ccMCT）中，双核和多核分别是细胞学和组织学分级方案中使用的参数，它们与患者预后差相关。在本研究中，我们创建了第一个开源数据集，其中包含ccMCT的32个组织学完整幻灯片图像中的BiNC 19,983个注释和MuNC 1,416个注释。标签是由病理学家和算法辅助的标签方法创建的，并且对每个生成的候选对象进行了专家审查。先进的基于深度学习的模型在11张测试的完整幻灯片图像上，BiNC的$ F_1 $得分和MuNC的$ F_1 $得分为0.623。在从这些测试图像中提取的感兴趣区域（2.37毫米^ 2 $）中，有6位病理学家对BiNC的物体检测性能介于0.270-0.526和对MuNC的0.316 0.622之间，而我们的模型对BiNC和$ 70的$ F_1 $得分为0.667。 MuNC为0.685。这个开放的数据集可以促进针对该任务的自动图像分析的开发，从而可以帮助促进组织学肿瘤预后这一方面的标准化。

18. CycleGAN for Interpretable Online EMT Compensation [PDF] 返回目录
Henry Krumb, Dhritimaan Das, Romol Chadda, Anirban Mukhopadhyay
Abstract: Purpose: Electromagnetic Tracking (EMT) can partially replace X-ray guidance in minimally invasive procedures, reducing radiation in the OR. However, in this hybrid setting, EMT is disturbed by metallic distortion caused by the X-ray device. We plan to make hybrid navigation clinical reality to reduce radiation exposure for patients and surgeons, by compensating EMT error. Methods: Our online compensation strategy exploits cycle-consistent generative adversarial neural networks (CycleGAN). 3D positions are translated from various bedside environments to their bench equivalents. Domain-translated points are fine-tuned to reduce error in the bench domain. We evaluate our compensation approach in a phantom experiment. Results: Since the domain-translation approach maps distorted points to their lab equivalents, predictions are consistent among different C-arm environments. Error is successfully reduced in all evaluation environments. Our qualitative phantom experiment demonstrates that our approach generalizes well to an unseen C-arm environment. Conclusion: Adversarial, cycle-consistent training is an explicable, consistent and thus interpretable approach for online error compensation. Qualitative assessment of EMT error compensation gives a glimpse to the potential of our method for rotational error compensation.
摘要：目的：电磁跟踪（EMT）可以在微创手术中部分替代X射线引导，从而减少手术室中的辐射。但是，在这种混合设置中，EMT受到X射线设备引起的金属变形的干扰。我们计划通过补偿EMT错误，使混合导航临床成为现实，以减少对患者和外科医生的辐射暴露。方法：我们的在线补偿策略利用周期一致的生成对抗神经网络（CycleGAN）。将3D位置从各种床头环境转换为等效的工作台。域转换点经过微调，以减少基准域中的错误。我们在幻像实验中评估我们的补偿方法。结果：由于域转换方法将失真的点映射到它们的实验室等效项，因此在不同的C型臂环境中的预测是一致的。在所有评估环境中成功减少了错误。我们的定性幻象实验表明，我们的方法可以很好地推广到看不见的C型臂环境。结论：对抗性，周期一致的培训是一种在线误差补偿的可解释，一致且可解释的方法。对EMT误差补偿的定性评估使我们对旋转误差补偿方法的潜力有所了解。

19. Support Vector Machine and YOLO for a Mobile Food Grading System [PDF] 返回目录
Lili Zhu, Petros Spachos
Abstract: Food quality and safety are of great concern to society since it is an essential guarantee not only for human health but also for social development, and stability. Ensuring food quality and safety is a complex process. All food processing stages should be considered, from cultivating, harvesting and storage to preparation and consumption. Grading is one of the essential processes to control food quality. This paper proposed a mobile visual-based system to evaluate food grading. Specifically, the proposed system acquires images of bananas when they are on moving conveyors. A two-layer image processing system based on machine learning is used to grade bananas, and these two layers are allocated on edge devices and cloud servers, respectively. Support Vector Machine (SVM) is the first layer to classify bananas based on an extracted feature vector composed of color and texture features. Then, the a You Only Look Once (YOLO) v3 model further locating the peel's defected area and determining if the inputs belong to the mid-ripened or well-ripened class. According to experimental results, the first layer's performance achieved an accuracy of 98.5% while the accuracy of the second layer is 85.7%, and the overall accuracy is 96.4%.
摘要：食品质量和安全是社会关注的焦点，因为它不仅是人类健康的基本保证，而且也是社会发展和稳定的基本保证。确保食品质量和安全是一个复杂的过程。从种植，收获和储存到准备和食用，应考虑所有食品加工阶段。分级是控制食品质量的重要过程之一。本文提出了一种基于移动视觉的系统来评估食品等级。特别地，当香蕉在移动的传送带上时，所提出的系统会获取香蕉的图像。基于机器学习的两层图像处理系统用于对香蕉进行分级，并且这两层分别分配在边缘设备和云服务器上。支持向量机（SVM）是基于提取的由颜色和纹理特征组成的特征向量对香蕉进行分类的第一层。然后，一次只看一次（YOLO）v3模型进一步定位果皮的缺陷区域，并确定输入内容是否属于中成熟或成熟的类别。根据实验结果，第一层的性能达到了98.5％的精度，而第二层的性能为85.7％，整体精度为96.4％。

20. Relaxed Conditional Image Transfer for Semi-supervised Domain Adaptation [PDF] 返回目录
Qijun Luo, Zhili Liu, Lanqing Hong, Chongxuan Li, Kuo Yang, Liyuan Wang, Fengwei Zhou, Guilin Li, Zhenguo Li, Jun Zhu
Abstract: Semi-supervised domain adaptation (SSDA), which aims to learn models in a partially labeled target domain with the assistance of the fully labeled source domain, attracts increasing attention in recent years. To explicitly leverage the labeled data in both domains, we naturally introduce a conditional GAN framework to transfer images without changing the semantics in SSDA. However, we identify a label-domination problem in such an approach. In fact, the generator tends to overlook the input source image and only memorizes prototypes of each class, which results in unsatisfactory adaptation performance. To this end, we propose a simple yet effective Relaxed conditional GAN (Relaxed cGAN) framework. Specifically, we feed the image without its label to our generator. In this way, the generator has to infer the semantic information of input data. We formally prove that its equilibrium is desirable and empirically validate its practical convergence and effectiveness in image transfer. Additionally, we propose several techniques to make use of unlabeled data in the target domain, enhancing the model in SSDA settings. We validate our method on the well-adopted datasets: Digits, DomainNet, and Office-Home. We achieve state-of-the-art performance on DomainNet, Office-Home and most digit benchmarks in low-resource and high-resource settings.
摘要：半监督域自适应（SSDA）旨在在完全标记的源域的帮助下学习部分标记的目标域中的模型，近年来受到越来越多的关注。为了在两个域中显式利用标记的数据，我们自然引入了条件GAN框架来传输图像，而无需更改SSDA中的语义。但是，我们在这种方法中发现了标签控制问题。实际上，生成器倾向于忽略输入源图像，并且仅存储每个类的原型，这会导致自适应性能不理想。为此，我们提出了一个简单而有效的宽松条件GAN（Relaxed cGAN）框架。具体来说，我们将没有标签的图像馈入生成器。以这种方式，生成器必须推断输入数据的语义信息。我们正式证明了它的平衡是可取的，并通过经验证明了它在图像转印中的实际收敛性和有效性。此外，我们提出了几种技术来利用目标域中未标记的数据，从而增强SSDA设置中的模型。我们在公认的数据集上验证了我们的方法：数字，DomainNet和Office-Home。我们在低资源和高资源设置中在DomainNet，Office-Home和大多数数字基准上实现了最先进的性能。

21. VersatileGait: A Large-Scale Synthetic Gait Dataset with Fine-GrainedAttributes and Complicated Scenarios [PDF] 返回目录
Huanzhang Dou, Wenhu Zhang, Pengyi Zhang, Yuhan Zhao, Songyuan Li, Zequn Qin, Fei Wu, Lin Dong, Xi Li
Abstract: With the motivation of practical gait recognition applications, we propose to automatically create a large-scale synthetic gait dataset (called VersatileGait) by a game engine, which consists of around one million silhouette sequences of 11,000 subjects with fine-grained attributes in various complicated scenarios. Compared with existing real gait datasets with limited samples and simple scenarios, the proposed VersatileGait dataset possesses several nice properties, including huge dataset size, high sample diversity, high-quality annotations, multi-pitch angles, small domain gap with the real one, etc. Furthermore, we investigate the effectiveness of our dataset (e.g., domain transfer after pretraining). Then, we use the fine-grained attributes from VersatileGait to promote gait recognition in both accuracy and speed, and meanwhile justify the gait recognition performance under multi-pitch angle settings. Additionally, we explore a variety of potential applications for research.Extensive experiments demonstrate the value and effective-ness of the proposed VersatileGait in gait recognition along with its associated applications. We will release both VersatileGait and its corresponding data generation toolkit for further studies.
摘要：出于实际步态识别应用的动机，我们建议通过游戏引擎自动创建一个大规模的综合步态数据集（称为VersatileGait），该数据集包含约1百万个具有11,000个对象的剪影序列，这些序列具有细粒度的各种属性复杂的场景。与现有的有限步态和简单场景的真实步态数据集相比，拟议的VersatileGait数据集具有几个不错的属性，包括巨大的数据集大小，高样本多样性，高质量注释，多音高角度，与真实步态之间的小间隙等。此外，我们调查了数据集的有效性（例如，预训练后的域转移）。然后，我们使用VersatileGait的细粒度属性来提高步态识别的准确性和速度，同时证明在多俯仰角度设置下的步态识别性能是合理的。此外，我们探索了各种潜在的研究应用，广泛的实验证明了拟议的VersatileGait在步态识别及其相关应用中的价值和有效性。我们将发布VersatileGait及其相应的数据生成工具包，以供进一步研究。

22. Understanding the Ability of Deep Neural Networks to Count Connected Components in Images [PDF] 返回目录
Shuyue Guan, Murray Loew
Abstract: Humans can count very fast by subitizing, but slow substantially as the number of objects increases. Previous studies have shown a trained deep neural network (DNN) detector can count the number of objects in an amount of time that increases slowly with the number of objects. Such a phenomenon suggests the subitizing ability of DNNs, and unlike humans, it works equally well for large numbers. Many existing studies have successfully applied DNNs to object counting, but few studies have studied the subitizing ability of DNNs and its interpretation. In this paper, we found DNNs do not have the ability to generally count connected components. We provided experiments to support our conclusions and explanations to understand the results and phenomena of these experiments. We proposed three ML-learnable characteristics to verify learnable problems for ML models, such as DNNs, and explain why DNNs work for specific counting problems but cannot generally count connected components.
摘要：人们可以通过细分来非常快速地计数，但是随着对象数量的增加，计数会大大降低。先前的研究表明，训练有素的深度神经网络（DNN）检测器可以在一定数量的时间内计数对象的数量，该数量随着对象数量的增加而缓慢增加。这种现象表明了DNN的替代能力，并且与人类不同，它对于大量数字同样有效。现有的许多研究已成功地将DNN应用于对象计数，但是很少有研究研究DNN的替代能力及其解释。在本文中，我们发现DNN不具有通常计算连接组件的能力。我们提供了实验来支持我们的结论和解释，以了解这些实验的结果和现象。我们提出了三种ML可学习的特征，以验证ML模型（例如DNN）的可学习问题，并解释了DNN为什么适用于特定的计数问题，但通常无法计算连接的组件。

23. An Automatic System to Monitor the Physical Distance and Face Mask Wearing of Construction Workers in COVID-19 Pandemic [PDF] 返回目录
Moein Razavi, Hamed Alikhani, Vahid Janfaza, Benyamin Sadeghi, Ehsan Alikhani
Abstract: The COVID-19 pandemic has caused many shutdowns in different industries around the world. Sectors such as infrastructure construction and maintenance projects have not been suspended due to their significant effect on people's routine life. In such projects, workers work close together that makes a high risk of infection. The World Health Organization recommends wearing a face mask and practicing physical distancing to mitigate the virus's spread. This paper developed a computer vision system to automatically detect the violation of face mask wearing and physical distancing among construction workers to assure their safety on infrastructure projects during the pandemic. For the face mask detection, the paper collected and annotated 1,000 images, including different types of face mask wearing, and added them to a pre-existing face mask dataset to develop a dataset of 1,853 images. Then trained and tested multiple Tensorflow state-of-the-art object detection models on the face mask dataset and chose the Faster R-CNN Inception ResNet V2 network that yielded the accuracy of 99.8%. For physical distance detection, the paper employed the Faster R-CNN Inception V2 to detect people. A transformation matrix was used to eliminate the camera angle's effect on the object distances on the image. The Euclidian distance used the pixels of the transformed image to compute the actual distance between people. A threshold of six feet was considered to capture physical distance violation. The paper also used transfer learning for training the model. The final model was applied on four videos of road maintenance projects in Houston, TX, that effectively detected the face mask and physical distance. We recommend that construction owners use the proposed system to enhance construction workers' safety in the pandemic situation.
摘要：COVID-19大流行已导致全球不同行业的许多停工。基础设施建设和维护项目等部门由于对人们的日常生活有重大影响，因此并未被暂停。在这样的项目中，工人紧密地工作在一起，因此感染的风险很高。世界卫生组织建议戴上口罩并进行疏导，以减轻病毒的传播。本文开发了一种计算机视觉系统，可以自动检测建筑工人中口罩的佩戴和身体距离的违规情况，以确保他们在大流行期间对基础设施项目的安全。对于面罩检测，该论文收集并注释了1000张图像，包括不同类型的面罩佩戴方式，并将其添加到预先存在的面罩数据集中，以开发出1853张图像的数据集。然后在面罩数据集上训练并测试了多个Tensorflow最新技术的对象检测模型，并选择了Faster R-CNN Inception ResNet V2网络，其准确性为99.8％。对于物理距离检测，本文采用了Faster R-CNN Inception V2来检测人员。使用变换矩阵来消除相机角度对图像上物距的影响。欧几里得距离使用变换后的图像的像素来计算人与人之间的实际距离。考虑到六英尺的阈值来捕获物理距离违规。本文还使用转移学习来训练模型。最终模型应用于德克萨斯州休斯敦的四个道路养护项目视频，可有效检测面罩和实际距离。我们建议建筑业主使用建议的系统来增强大流行情况下建筑工人的安全。

24. Similarity Reasoning and Filtration for Image-Text Matching [PDF] 返回目录
Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu
Abstract: Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.
摘要：图像文本匹配在桥接视觉和语言方面起着至关重要的作用，并且通过利用图像和句子之间的整体对齐方式或区域和单词之间的局部对齐方式已经取得了巨大的进步。但是，如何充分利用这些比对来推断更准确的匹配分数仍未得到充分研究。在本文中，我们提出了一种新颖的相似图推理和注意力过滤（SGRAF）网络，用于图像-文本匹配。具体而言，首先学习基于矢量的相似性表示以更全面的方式表征局部和全局比对，然后引入依赖于一个图卷积神经网络的相似性图推理（SGR）模块来推断与本地和全局对齐。相似注意过滤（SAF）模块经过进一步开发，可通过有选择地参与重要的和具有代表性的比对并同时消除无意义的比对的干扰来有效地整合这些比对。我们通过在Flickr30K和MSCOCO数据集上实现最先进的性能，证明了该方法的优越性，并且通过大量的定性实验和分析，SGR和SAF模块具有良好的可解释性。

25. High Precision Medicine Bottles Vision Online Inspection System and Classification Based on Multi-Features and Ensemble Learning via Independence Test [PDF] 返回目录
Le Ma, Xiaoyue Wu, Zhiwei Li, Nan Gao, Jie Liu, Lingfang Sun
Abstract: To address the problem of online automatic inspection of drug liquid bottles in production line, an implantable visual inspection system is designed and the ensemble learning algorithm for detection is proposed based on multi-features fusion. A tunnel structure is designed for visual inspection system, which allows bottles inspection to be automated without changing original
摘要：针对生产线上药液瓶在线自动检测的问题，设计了一种植入式视觉检测系统，提出了基于多特征融合的检测集成学习算法。专为视觉检查系统设计的隧道结构，可自动进行瓶子检查，而无需更改原件

26. Self-supervised Visual-LiDAR Odometry with Flip Consistency [PDF] 返回目录
Bin Li, Mu Hu, Shuling Wang, Lianghao Wang, Xiaojin Gong
Abstract: Most learning-based methods estimate ego-motion by utilizing visual sensors, which suffer from dramatic lighting variations and textureless scenarios. In this paper, we incorporate sparse but accurate depth measurements obtained from lidars to overcome the limitation of visual methods. To this end, we design a self-supervised visual-lidar odometry (Self-VLO) framework. It takes both monocular images and sparse depth maps projected from 3D lidar points as input, and produces pose and depth estimations in an end-to-end learning manner, without using any ground truth labels. To effectively fuse two modalities, we design a two-pathway encoder to extract features from visual and depth images and fuse the encoded features with those in decoders at multiple scales by our fusion module. We also adopt a siamese architecture and design an adaptively weighted flip consistency loss to facilitate the self-supervised learning of our VLO. Experiments on the KITTI odometry benchmark show that the proposed approach outperforms all self-supervised visual or lidar odometries. It also performs better than fully supervised VOs, demonstrating the power of fusion.
摘要：大多数基于学习的方法都是通过利用视觉传感器来估计自我运动的，视觉传感器会遭受剧烈的光照变化和无纹理场景的困扰。在本文中，我们结合了从激光雷达获得的稀疏而精确的深度测量值，以克服视觉方法的局限性。为此，我们设计了一个自我监督的视觉激光测距仪（Self-VLO）框架。它以从3D激光雷达点投影的单眼图像和稀疏深度图作为输入，并以端到端的学习方式生成姿态和深度估计，而无需使用任何地面真相标签。为了有效融合两种模式，我们设计了一种两路编码器，从视觉和深度图像中提取特征，并通过融合模块将编码后的特征与解码器中的特征进行多尺度融合。我们还采用了暹罗架构，并设计了自适应加权的翻转一致性损失，以促进VLO的自我监督学习。在KITTI里程表基准上进行的实验表明，该方法优于所有自我监督的视觉或激光雷达测距仪。与完全监督的VO相比，它的性能也更好，证明了融合的力量。

27. Research on Fast Text Recognition Method for Financial Ticket Image [PDF] 返回目录
Fukang Tian, Haiyu Wu, Bo Xu
Abstract: Currently, deep learning methods have been widely applied in and thus promoted the development of different fields. In the financial accounting field, the rapid increase in the number of financial tickets dramatically increases labor costs; hence, using a deep learning method to relieve the pressure on accounting is necessary. At present, a few works have applied deep learning methods to financial ticket recognition. However, first, their approaches only cover a few types of tickets. In addition, the precision and speed of their recognition models cannot meet the requirements of practical financial accounting systems. Moreover, none of the methods provides a detailed analysis of both the types and content of tickets. Therefore, this paper first analyzes the different features of 482 kinds of financial tickets, divides all kinds of financial tickets into three categories and proposes different recognition patterns for each category. These recognition patterns can meet almost all types of financial ticket recognition needs. Second, regarding the fixed format types of financial tickets (accounting for 68.27\% of the total types of tickets), we propose a simple yet efficient network named the Financial Ticket Faster Detection network (FTFDNet) based on a Faster RCNN. Furthermore, according to the characteristics of the financial ticket text, in order to obtain higher recognition accuracy, the loss function, Region Proposal Network (RPN), and Non-Maximum Suppression (NMS) are improved to make FTFDNet focus more on text. Finally, we perform a comparison with the best ticket recognition model from the ICDAR2019 invoice competition. The experimental results illustrate that FTFDNet increases the processing speed by 50\% while maintaining similar precision.
摘要：目前，深度学习方法已被广泛应用，从而促进了各个领域的发展。在财务会计领域，金融票据数量的迅速增加极大地增加了人工成本。因此，有必要使用深度学习方法来减轻会计压力。目前，一些作品将深度学习方法应用于金融票据识别。但是，首先，他们的方法仅涵盖几种类型的票证。此外，其识别模型的准确性和速度无法满足实际财务会计系统的要求。而且，这些方法都没有提供票证类型和内容的详细分析。因此，本文首先分析了482种金融票据的不同特征，将各种金融票据分为三类，并针对每种类别提出了不同的识别模式。这些识别模式可以满足几乎所有类型的金融票据识别需求。其次，关于固定格式的金融票据（占票据总数的68.27％），我们提出了一个基于Faster RCNN的简单高效的网络，称为金融票据快速检测网络（FTFDNet）。此外，根据金融票据文本的特征，为了获得更高的识别准确性，改进了损失函数，区域提议网络（RPN）和非最大抑制（NMS），使FTFDNet更加关注文本。最后，我们与ICDAR2019发票竞赛的最佳票证识别模型进行了比较。实验结果表明，FTFDNet在保持相似精度的同时，将处理速度提高了50％。

28. CycleSegNet: Object Co-segmentation with Cycle Refinement and Region Correspondence [PDF] 返回目录
Guankai Li, Chi Zhang, Guosheng Lin
Abstract: Image co-segmentation is an active computer vision task which aims to segment the common objects in a set of images. Recently, researchers design various learning-based algorithms to handle the co-segmentation task. The main difficulty in this task is how to effectively transfer information between images to infer the common object regions. In this paper, we present CycleSegNet, a novel framework for the co-segmentation task. Our network design has two key components: a region correspondence module which is the basic operation for exchanging information between local image regions, and a cycle refinement module which utilizes ConvLSTMs to progressively update image embeddings and exchange information in a cycle manner. Experiment results on four popular benchmark datasets -- PASCAL VOC dataset, MSRC dataset, Internet dataset and iCoseg dataset demonstrate that our proposed method significantly outperforms the existing networks and achieves new state-of-the-art performance.
摘要：图像共分割是一项主动的计算机视觉任务，旨在将一组图像中的常见对象进行分割。最近，研究人员设计了各种基于学习的算法来处理共同细分任务。该任务的主要困难是如何有效地在图像之间传递信息以推断出公共目标区域。在本文中，我们介绍了CycleSegNet，这是一种用于共同细分任务的新颖框架。我们的网络设计有两个关键组成部分：区域对应模块，这是在本地图像区域之间交换信息的基本操作；以及循环优化模块，利用ConvLSTM逐步更新图像嵌入并以循环方式交换信息。在四个流行的基准数据集-PASCAL VOC数据集，MSRC数据集，Internet数据集和iCoseg数据集上的实验结果表明，我们提出的方法明显优于现有网络，并获得了最新的性能。

29. SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection [PDF] 返回目录
Keren Ye, Adriana Kovashka, Mark Sandler, Menglong Zhu, Andrew Howard, Marco Fornoni
Abstract: Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we address the question: can task-specific detectors be trained and represented as a shared set of weights, plus a very small set of additional weights for each task? The main contributions of this paper are the following: 1) we perform the first systematic study of parameter-efficient transfer learning techniques for object detection problems; 2) we propose a technique to learn a model patch with a size that is dependent on the difficulty of the task to be learned, and validate our approach on 10 different object detection tasks. Our approach achieves similar accuracy as previously proposed approaches, while being significantly more compact.
摘要：基于深度学习的对象检测器通常部署在移动设备上，以解决各种任务。为了获得最大的准确性，通常对每个检测器进行培训以解决一个特定的任务，并附带一组完全独立的参数。虽然这保证了高性能，但效率也很低，因为每种型号都必须单独下载和存储。在本文中，我们解决了一个问题：特定任务的检测器是否可以被训练和表示为一组共享的权重，以及每个任务的非常少量的附加权重？本文的主要贡献如下：1）我们对目标检测问题的参数有效转移学习技术进行了首次系统研究； 2）我们提出了一种技术来学习模型补丁，该模型补丁的大小取决于要学习的任务的难度，并在10种不同的对象检测任务上验证我们的方法。我们的方法实现了与以前提出的方法相似的准确性，同时结构更加紧凑。

30. Learn by Guessing: Multi-Step Pseudo-Label Refinement for Person Re-Identification [PDF] 返回目录
Tiago de C. G. Pereira, Teofilo E. de Campos
Abstract: Unsupervised Domain Adaptation (UDA) methods for person Re-Identification (Re-ID) rely on target domain samples to model the marginal distribution of the data. To deal with the lack of target domain labels, UDA methods leverage information from labeled source samples and unlabeled target samples. A promising approach relies on the use of unsupervised learning as part of the pipeline, such as clustering methods. The quality of the clusters clearly plays a major role in methods performance, but this point has been overlooked. In this work, we propose a multi-step pseudo-label refinement method to select the best possible clusters and keep improving them so that these clusters become closer to the class divisions without knowledge of the class labels. Our refinement method includes a cluster selection strategy and a camera-based normalization method which reduces the within-domain variations caused by the use of multiple cameras in person Re-ID. This allows our method to reach state-of-the-art UDA results on DukeMTMC-Market1501 (source-target). We surpass state-of-the-art for UDA Re-ID by 3.4% on Market1501-DukeMTMC datasets, which is a more challenging adaptation setup because the target domain (DukeMTMC) has eight distinct cameras. Furthermore, the camera-based normalization method causes a significant reduction in the number of iterations required for training convergence.
摘要：用于人员重新识别（Re-ID）的无监督域自适应（UDA）方法依赖于目标域样本来对数据的边际分布进行建模。为了解决缺少目标域标签的问题，UDA方法利用了来自标记源样本和未标记目标样本的信息。一种有前途的方法依赖于将无监督学习作为流水线的一部分，例如聚类方法。簇的质量显然在方法性能中起着重要作用，但是这一点却被忽略了。在这项工作中，我们提出了一种多步伪标签细化方法，以选择最佳可能的聚类并不断对其进行改进，以使这些聚类在不了解类标签的情况下更接近类划分。我们的优化方法包括群集选择策略和基于摄像机的归一化方法，该方法可减少因使用多个摄像机的人Re-ID引起的域内变化。这使我们的方法能够在DukeMTMC-Market1501（源目标）上获得最新的UDA结果。我们在Market1501-DukeMTMC数据集上将UDA Re-ID的最新水平提高了3.4％，这是一个更具挑战性的适应设置，因为目标域（DukeMTMC）具有八个不同的摄像头。此外，基于相机的归一化方法可显着减少训练收敛所需的迭代次数。

31. Semantic Video Segmentation for Intracytoplasmic Sperm Injection Procedures [PDF] 返回目录
Peter He, Raksha Jain, Jérôme Chambost, Céline Jacques, Cristina Hickman
Abstract: We present the first deep learning model for the analysis of intracytoplasmic sperm injection (ICSI) procedures. Using a dataset of ICSI procedure videos, we train a deep neural network to segment key objects in the videos achieving a mean IoU of 0.962, and to localize the needle tip achieving a mean pixel error of 3.793 pixels at 14 FPS on a single GPU. We further analyze the variation between the dataset's human annotators and find the model's performance to be comparable to human experts.
摘要：我们提出了第一个用于胞浆内精子注射（ICSI）程序分析的深度学习模型。使用ICSI程序视频的数据集，我们训练了一个深度神经网络来对视频中的关键对象进行分割，以实现平均IoU为0.962，并定位针尖，从而在单个GPU上以14 FPS的速度实现平均像素误差3.793像素。我们进一步分析了数据集的人工注释者之间的差异，发现该模型的性能可与人工专家相媲美。

32. Monocular Depth Estimation for Soft Visuotactile Sensors [PDF] 返回目录
Rares Ambrus, Vitor Guizilini, Naveen Kuppuswamy, Andrew Beaulieu, Adrien Gaidon, Alex Alspach
Abstract: Fluid-filled soft visuotactile sensors such as the Soft-bubbles alleviate key challenges for robust manipulation, as they enable reliable grasps along with the ability to obtain high-resolution sensory feedback on contact geometry and forces. Although they are simple in construction, their utility has been limited due to size constraints introduced by enclosed custom IR/depth imaging sensors to directly measure surface deformations. Towards mitigating this limitation, we investigate the application of state-of-the-art monocular depth estimation to infer dense internal (tactile) depth maps directly from the internal single small IR imaging sensor. Through real-world experiments, we show that deep networks typically used for long-range depth estimation (1-100m) can be effectively trained for precise predictions at a much shorter range (1-100mm) inside a mostly textureless deformable fluid-filled sensor. We propose a simple supervised learning process to train an object-agnostic network requiring less than 10 random poses in contact for less than 10 seconds for a small set of diverse objects (mug, wine glass, box, and fingers in our experiments). We show that our approach is sample-efficient, accurate, and generalizes across different objects and sensor configurations unseen at training time. Finally, we discuss the implications of our approach for the design of soft visuotactile sensors and grippers.
摘要：充满流体的软质触觉传感器（如“软泡”）减轻了稳健操作的关键挑战，因为它们能够可靠地抓握并能够获得有关接触几何形状和力的高分辨率感官反馈。尽管它们结构简单，但是由于封闭的定制IR /深度成像传感器引入的尺寸限制而无法直接测量表面变形，因此其用途受到了限制。为了减轻这种局限性，我们研究了最新的单眼深度估计技术的应用，以直接从内部单个小型红外成像传感器推断出密集的内部（触觉）深度图。通过实际实验，我们证明了在通常没有纹理的可变形充液传感器中，通常用于远距离深度估计（1-100m）的深层网络可以有效地训练，从而可以在更短的范围（1-100mm）内进行精确的预测。。我们提出了一个简单的有监督的学习过程，以训练一个与对象无关的网络，该网络要求一小组不同的对象（实验中的杯子，酒杯，盒子和手指）在不到10秒的时间内接触不到10个随机姿势。我们证明了我们的方法是高效采样的，准确的，并且可以概括训练时看不到的不同对象和传感器配置。最后，我们讨论了我们的方法对软质触觉传感器和抓具设计的意义。

33. Robust R-Peak Detection in Low-Quality Holter ECGs using 1D Convolutional Neural Network [PDF] 返回目录
Muhammad Uzair Zahid, Serkan Kiranyaz, Turker Ince, Ozer Can Devecioglu, Muhammad E. H. Chowdhury, Amith Khandakar, Anas Tahir, Moncef Gabbouj
Abstract: Noise and low quality of ECG signals acquired from Holter or wearable devices deteriorate the accuracy and robustness of R-peak detection algorithms. This paper presents a generic and robust system for R-peak detection in Holter ECG signals. While many proposed algorithms have successfully addressed the problem of ECG R-peak detection, there is still a notable gap in the performance of these detectors on such low-quality ECG records. Therefore, in this study, a novel implementation of the 1D Convolutional Neural Network (CNN) is used integrated with a verification model to reduce the number of false alarms. This CNN architecture consists of an encoder block and a corresponding decoder block followed by a sample-wise classification layer to construct the 1D segmentation map of R- peaks from the input ECG signal. Once the proposed model has been trained, it can solely be used to detect R-peaks possibly in a single channel ECG data stream quickly and accurately, or alternatively, such a solution can be conveniently employed for real-time monitoring on a lightweight portable device. The model is tested on two open-access ECG databases: The China Physiological Signal Challenge (2020) database (CPSC-DB) with more than one million beats, and the commonly used MIT-BIH Arrhythmia Database (MIT-DB). Experimental results demonstrate that the proposed systematic approach achieves 99.30% F1-score, 99.69% recall, and 98.91% precision in CPSC-DB, which is the best R-peak detection performance ever achieved. Compared to all competing methods, the proposed approach can reduce the false-positives and false-negatives in Holter ECG signals by more than 54% and 82%, respectively. Results also demonstrate similar or better performance than most competing algorithms on MIT-DB with 99.83% F1-score, 99.85% recall, and 99.82% precision.
摘要：从Holter或可穿戴设备获取的ECG信号的噪声和低质量会降低R峰检测算法的准确性和鲁棒性。本文提出了一种用于Holter ECG信号中R峰检测的通用且鲁棒的系统。尽管许多建议的算法已成功解决了ECG R峰检测问题，但在此类低质量ECG记录上，这些检测器的性能仍然存在明显差距。因此，在本研究中，将一维卷积神经网络（CNN）的新颖实现与验证模型集成在一起使用，以减少错误警报的数量。该CNN体系结构由一个编码器块和一个相应的解码器块组成，后面是一个按样本分类层，用于根据输入的ECG信号构造R峰的1D分割图。一旦对提出的模型进行了训练，就可以将其仅用于快速，准确地检测单通道ECG数据流中的R峰，或者，可以方便地将这种解决方案用于在轻巧便携式设备上进行实时监控。该模型在两个开放式心电图数据库上进行了测试：心跳信号超过一百万的中国生理信号挑战（2020）数据库（CPSC-DB）和常用的MIT-BIH心律失常数据库（MIT-DB）。实验结果表明，所提出的系统方法在CPSC-DB中可达到99.30％的F1得分，99.69％的查全率和98.91％的精度，这是有史以来最好的R峰检测性能。与所有竞争方法相比，该方法可以将动态心电图信号中的假阳性和假阴性分别减少54％和82％以上。结果还证明，与MIT-DB上大多数竞争算法相比，F1得分达到99.83％，召回率达到99.85％和精度达到99.82％时，它们具有相似或更好的性能。

34. Contextual colorization and denoising for low-light ultra high resolution sequences [PDF] 返回目录
N. Anantrasirichai, David Bull
Abstract: Low-light image sequences generally suffer from spatio-temporal incoherent noise, flicker and blurring of moving objects. These artefacts significantly reduce visual quality and, in most cases, post-processing is needed in order to generate acceptable quality. Most state-of-the-art enhancement methods based on machine learning require ground truth data but this is not usually available for naturally captured low light sequences. We tackle these problems with an unpaired-learning method that offers simultaneous colorization and denoising. Our approach is an adaptation of the CycleGAN structure. To overcome the excessive memory limitations associated with ultra high resolution content, we propose a multiscale patch-based framework, capturing both local and contextual features. Additionally, an adaptive temporal smoothing technique is employed to remove flickering artefacts. Experimental results show that our method outperforms existing approaches in terms of subjective quality and that it is robust to variations in brightness levels and noise.
摘要：弱光图像序列通常会受到时空非相干噪声，闪烁和运动物体模糊的困扰。这些伪影会大大降低视觉质量，在大多数情况下，需要进行后期处理才能产生可接受的质量。大多数基于机器学习的最新增强方法都需要地面真实数据，但通常不适用于自然捕获的弱光序列。我们采用不成对的学习方法来解决这些问题，该方法可以同时进行着色和去噪。我们的方法是对CycleGAN结构的适应。为了克服与超高分辨率内容相关的过多内存限制，我们提出了一种基于多尺度补丁的框架，同时捕获了本地和上下文特征。另外，采用自适应时间平滑技术来消除闪烁的伪像。实验结果表明，我们的方法在主观质量方面优于现有方法，并且对亮度水平和噪声变化具有鲁棒性。

35. Density Compensated Unrolled Networks for Non-Cartesian MRI Reconstruction [PDF] 返回目录
Zaccharie Ramzi, Philippe Ciuciu, Jean-Luc Starck
Abstract: Deep neural networks have recently been thoroughly investigated as a powerful tool for MRI reconstruction. There is a lack of research however regarding their use for a specific setting of MRI, namely non-Cartesian acquisitions. In this work, we introduce a novel kind of deep neural networks to tackle this problem, namely density compensated unrolled neural networks. We assess their efficiency on the publicly available fastMRI dataset, and perform a small ablation study. We also open source our code, in particular a Non-Uniform Fast Fourier transform for TensorFlow.
摘要：深度神经网络最近已被广泛研究为MRI重建的强大工具。但是，对于将其用于MRI的特定设置（即非笛卡尔采集）的研究尚缺乏研究。在这项工作中，我们介绍了一种新型的深度神经网络来解决该问题，即密度补偿展开神经网络。我们在公开的fastMRI数据集上评估其效率，并进行小型消融研究。我们还开源了代码，尤其是针对TensorFlow的非均匀快速傅立叶变换。

36. Brain Tumor Segmentation and Survival Prediction using Automatic Hard mining in 3D CNN Architecture [PDF] 返回目录
Vikas Kumar Anand, Sanjeev Grampurohit, Pranav Aurangabadkar, Avinash Kori, Mahendra Khened, Raghavendra S Bhat, Ganapathy Krishnamurthi
Abstract: We utilize 3-D fully convolutional neural networks (CNN) to segment gliomas and its constituents from multimodal Magnetic Resonance Images (MRI). The architecture uses dense connectivity patterns to reduce the number of weights and residual connections and is initialized with weights obtained from training this model with BraTS 2018 dataset. Hard mining is done during training to train for the difficult cases of segmentation tasks by increasing the dice similarity coefficient (DSC) threshold to choose the hard cases as epoch increases. On the BraTS2020 validation data (n = 125), this architecture achieved a tumor core, whole tumor, and active tumor dice of 0.744, 0.876, 0.714,respectively. On the test dataset, we get an increment in DSC of tumor core and active tumor by approximately 7%. In terms of DSC, our network performances on the BraTS 2020 test data are 0.775, 0.815, and 0.85 for enhancing tumor, tumor core, and whole tumor, respectively. Overall survival of a subject is determined using conventional machine learning from rediomics features obtained using a generated segmentation mask. Our approach has achieved 0.448 and 0.452 as the accuracy on the validation and test dataset.
摘要：我们利用3-D全卷积神经网络（CNN）从多模态磁共振图像（MRI）分割神经胶质瘤及其成分。该架构使用密集的连接模式来减少权重和剩余连接的数量，并使用从BraTS 2018数据集中训练该模型获得的权重进行初始化。在训练过程中进行了硬挖掘，以通过增加骰子相似性系数（DSC）阈值来选择分割难度较大的情况，以训练细分任务的困难情况。在BraTS2020验证数据（n = 125）上，此体系结构分别获得了0.744、0.876、0.714的肿瘤核心，整个肿瘤和活动肿瘤骰子。在测试数据集上，我们获得了肿瘤核心和活动性肿瘤的DSC增量约7％。在DSC方面，我们在BraTS 2020测试数据上的网络性能分别为0.775、0.815和0.85，可增强肿瘤，肿瘤核心和整个肿瘤。使用常规的机器学习从使用生成的分割蒙版获得的重整特征中确定对象的总体生存。我们的方法在验证和测试数据集上的准确性达到0.448和0.452。

37. On the Control of Attentional Processes in Vision [PDF] 返回目录
John K. Tsotsos, Omar Abid, Iuliia Kotseruba, Markus D. Solbach
Abstract: The study of attentional processing in vision has a long and deep history. Recently, several papers have presented insightful perspectives into how the coordination of multiple attentional functions in the brain might occur. These begin with experimental observations and the authors propose structures, processes, and computations that might explain those observations. Here, we consider a perspective that past works have not, as a complementary approach to the experimentally-grounded ones. We approach the same problem as past authors but from the other end of the computational spectrum, from the problem nature, as Marr's Computational Level would prescribe. What problem must the brain solve when orchestrating attentional processes in order to successfully complete one of the myriad possible visuospatial tasks at which we as humans excel? The hope, of course, is for the approaches to eventually meet and thus form a complete theory, but this is likely not soon. We make the first steps towards this by addressing the necessity of attentional control, examining the breadth and computational difficulty of the visuospatial and attentional tasks seen in human behavior, and suggesting a sketch of how attentional control might arise in the brain. The key conclusions of this paper are that an executive controller is necessary for human attentional function in vision, and that there is a 'first principles' computational approach to its understanding that is complementary to the previous approaches that focus on modelling or learning from experimental observations directly.
摘要要】视觉中注意加工的研究有着悠久而深刻的历史。最近，几篇论文提出了关于大脑中多种注意功能可能如何发生协调的有见地的见解。这些从实验观察开始，作者提出了可以解释这些观察的结构，过程和计算。在这里，我们考虑一个过去的作品没有的观点，作为对以实验为基础的作品的补充方法。我们与过去的作者处理相同的问题，但从计算范围的另一端出发，从问题的本质出发，就像马尔的计算水平所规定的那样。为了成功地完成我们人类最擅长的众多视觉空间任务之一，大脑在安排注意力过程时必须解决什么问题？当然，希望是最终会合并形成完整理论的方法，但这可能不会很快。我们朝着这一目标迈出了第一步，解决了注意力控制的必要性，研究了人类行为中可见空间和注意力任务的广度和计算难度，并提出了如何在大脑中进行注意力控制的草图。本文的主要结论是，执行控制器对于视觉中的人类注意力功能是必不可少的，并且存在一种“第一原理”计算方法来理解它，这是对先前专注于建模或从实验观察中学习的方法的补充直。

38. End-to-End Video Question-Answer Generation with Generator-Pretester Network [PDF] 返回目录
Hung-Ting Su, Chen-Hsi Chang, Po-Wei Shen, Yu-Siang Wang, Ya-Liang Chang, Yu-Cheng Chang, Pu-Jen Cheng, Winston H. Hsu
Abstract: We study a novel task, Video Question-Answer Generation (VQAG), for challenging Video Question Answering (Video QA) task in multimedia. Due to expensive data annotation costs, many widely used, large-scale Video QA datasets such as Video-QA, MSVD-QA and MSRVTT-QA are automatically annotated using Caption Question Generation (CapQG) which inputs captions instead of the video itself. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG). Existing video-to-text (V2T) approaches, despite taking a video as the input, only generate a question alone. In this work, we propose a novel model Generator-Pretester Network that focuses on two components: (1) The Joint Question-Answer Generator (JQAG) which generates a question with its corresponding answer to allow Video Question "Answering" training. (2) The Pretester (PT) verifies a generated question by trying to answer it and checks the pretested answer with both the model's proposed answer and the ground truth answer. We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances. Furthermore, using our generated QA pairs only on the Video QA task, we can surpass some supervised baselines. We apply our generated questions to Video QA applications and surpasses some supervised baselines using generated questions only. As a pre-training strategy, we outperform both CapQG and transfer learning approaches when employing semi-supervised (20%) or fully supervised learning with annotated data. These experimental results suggest the novel perspectives for Video QA training.
摘要：我们研究了一种新颖的任务，视频问题答案生成（VQAG），用于挑战多媒体中的视频问题回答（Video QA）任务。由于昂贵的数据注释成本，许多使用广泛的大规模视频QA数据集（例如Video-QA，MSVD-QA和MSRVTT-QA）会使用字幕问题生成（CapQG）自动注释，该问题输入字幕而不是视频本身。由于字幕既不能完全代表视频，也不能始终实用，因此至关重要的是，通过视频问题解答生成（VQAG）基于视频生成问题答案对。尽管将视频作为输入，但是现有的视频到文本（V2T）方法仅产生一个问题。在这项工作中，我们提出了一种新颖的生成器-预测试器网络模型，该模型着重于两个部分：（1）联合问题-答案生成器（JQAG）生成带有相应答案的问题，以进行视频问题“答案”培训。（2）Pretester（PT）通过尝试回答所生成的问题来进行验证，并使用模型的建议答案和地面真实答案来检查预先测试的答案。我们使用仅有的两个可用的大规模人工注释视频质量检查数据集评估我们的系统，并实现最新的问题生成性能。此外，仅在视频质量检查任务中使用生成的质量检查对，我们可以超越一些监督基准。我们将生成的问题应用于视频质量检查应用程序，仅使用生成的问题就超出了一些监督基准。作为一种预培训策略，当使用带注释数据的半监督（20％）或完全监督的学习时，我们优于CapQG和迁移学习方法。这些实验结果为视频质量检查培训提供了新颖的视角。

39. CLOI: An Automated Benchmark Framework For Generating Geometric Digital Twins Of Industrial Facilities [PDF] 返回目录
Eva Agapaki, Ioannis Brilakis
Abstract: This paper devises, implements and benchmarks a novel framework, named CLOI, that can accurately generate individual labelled point clusters of the most important shapes of existing industrial facilities with minimal manual effort in a generic point-level format. CLOI employs a combination of deep learning and geometric methods to segment the points into classes and individual instances. The current geometric digital twin generation from point cloud data in commercial software is a tedious, manual process. Experiments with our CLOI framework reveal that the method can reliably segment complex and incomplete point clouds of industrial facilities, yielding 82% class segmentation accuracy. Compared to the current state-of-practice, the proposed framework can realize estimated time-savings of 30% on average. CLOI is the first framework of its kind to have achieved geometric digital twinning for the most important objects of industrial factories. It provides the foundation for further research on the generation of semantically enriched digital twins of the built environment.
摘要：本文设计，实施和基准测试了一个名为CLOI的新颖框架，该框架可以以通用点级格式准确地生成现有工业设施中最重要形状的单个标记点簇，而无需进行任何人工操作。 CLOI结合了深度学习和几何方法，将这些点划分为类和单个实例。当前从商业软件中的点云数据生成几何数字孪生是一个繁琐的手动过程。使用我们的CLOI框架进行的实验表明，该方法可以可靠地分割工业设施的复杂和不完整的点云，从而达到82％的分类精度。与当前的实践状态相比，该框架可实现估计平均节省30％的时间。 CLOI是同类产品中第一个针对工业工厂最重要对象实现几何数字孪生的框架。它为进一步研究构建环境中语义丰富的数字孪生提供了基础。

40. Reconstructing Patchy Reionization with Deep Learning [PDF] 返回目录
Eric Guzman, Joel Meyers
Abstract: The precision anticipated from next-generation cosmic microwave background (CMB) surveys will create opportunities for characteristically new insights into cosmology. Secondary anisotropies of the CMB will have an increased importance in forthcoming surveys, due both to the cosmological information they encode and the role they play in obscuring our view of the primary fluctuations. Quadratic estimators have become the standard tools for reconstructing the fields that distort the primary CMB and produce secondary anisotropies. While successful for lensing reconstruction with current data, quadratic estimators will be sub-optimal for the reconstruction of lensing and other effects at the expected sensitivity of the upcoming CMB surveys. In this paper we describe a convolutional neural network, ResUNet-CMB, that is capable of the simultaneous reconstruction of two sources of secondary CMB anisotropies, gravitational lensing and patchy reionization. We show that the ResUNet-CMB network significantly outperforms the quadratic estimator at low noise levels and is not subject to the lensing-induced bias on the patchy reionization reconstruction that would be present with a straightforward application of the quadratic estimator.
摘要：下一代宇宙微波背景（CMB）调查所期望的精确度将为对宇宙学进行独特的新见解创造机会。由于CMB编码的宇宙学信息以及它们在掩盖我们对主要波动的看法中所起的作用，因此CMB的次要各向异性在即将进行的调查中将具有越来越重要的意义。二次估计器已成为重构扭曲主要CMB并产生次要各向异性的场的标准工具。尽管使用当前数据成功进行镜头重建，但在即将到来的CMB调查的预期灵敏度下，二次估计对于镜头和其他效果的重建将不是最优的。在本文中，我们描述了一个卷积神经网络ResUNet-CMB，它能够同时重建二次CMB各向异性的两个来源，即引力透镜和斑片电离。我们显示，ResUNet-CMB网络在低噪声水平下明显优于二次估计器，并且不受二次估计器直接应用时出现的斑片电离重建上透镜引起的偏差的影响。

41. Advances in Electron Microscopy with Deep Learning [PDF] 返回目录
Jeffrey M. Ede
Abstract: This doctoral thesis covers some of my advances in electron microscopy with deep learning. Highlights include a comprehensive review of deep learning in electron microscopy; large new electron microscopy datasets for machine learning, dataset search engines based on variational autoencoders, and automatic data clustering by t-distributed stochastic neighbour embedding; adaptive learning rate clipping to stabilize learning; generative adversarial networks for compressed sensing with spiral, uniformly spaced and other fixed sparse scan paths; recurrent neural networks trained to piecewise adapt sparse scan paths to specimens by reinforcement learning; improving signal-to-noise; and conditional generative adversarial networks for exit wavefunction reconstruction from single transmission electron micrographs. This thesis adds to my publications by presenting their relationships, reflections, and holistic conclusions. This copy of my thesis is typeset for online dissemination to improve readability, whereas the thesis submitted to the University of Warwick in support of my application for the degree of Doctor of Philosophy in Physics will be typeset for physical printing and binding.
摘要：该博士论文涵盖了我在深度学习电子显微镜方面的一些进展。重点包括对电子显微镜深度学习的全面回顾；用于机器学习的大型新电子显微镜数据集，基于变分自动编码器的数据集搜索引擎以及通过t分布随机邻居嵌入进行自动数据聚类；自适应学习率削减以稳定学习；生成对抗网络，用于螺旋，均匀间隔和其他固定稀疏扫描路径的压缩感知；循环神经网络经过训练，可以通过强化学习使稀疏扫描路径分段适应标本；改善信噪比；和条件生成对抗网络，用于从单个透射电子显微照片重建出射波函数。本文通过介绍它们之间的关系，思考和整体结论，为我的出版物增色不少。本论文的副本是为了在线阅读而排版的，以提高可读性，而提交给华威大学以支持我申请物理学博士学位的论文将被排版，以进行物理印刷和装订。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2021-01-06

目录

摘要