摘要

1. Augmented Convolutional LSTMs for Generation of High-Resolution Climate Change Projections [PDF] 返回目录
Nidhin Harilal, Udit Bhatia, Mayank Singh
Abstract: Projection of changes in extreme indices of climate variables such as temperature and precipitation are critical to assess the potential impacts of climate change on human-made and natural systems, including critical infrastructures and ecosystems. While impact assessment and adaptation planning rely on high-resolution projections (typically in the order of a few kilometers), state-of-the-art Earth System Models (ESMs) are available at spatial resolutions of few hundreds of kilometers. Current solutions to obtain high-resolution projections of ESMs include downscaling approaches that consider the information at a coarse-scale to make predictions at local scales. Complex and non-linear interdependence among local climate variables (e.g., temperature and precipitation) and large-scale predictors (e.g., pressure fields) motivate the use of neural network-based super-resolution architectures. In this work, we present auxiliary variables informed spatio-temporal neural architecture for statistical downscaling. The current study performs daily downscaling of precipitation variable from an ESM output at 1.15 degrees (~115 km) to 0.25 degrees (25 km) over the world's most climatically diversified country, India. We showcase significant improvement gain against three popular state-of-the-art baselines with a better ability to predict extreme events. To facilitate reproducible research, we make available all the codes, processed datasets, and trained models in the public domain.
摘要：在气候变量，如温度和降水的极端指标的变化的投影是评估气候变化对人类和自然系统，包括关键基础设施和生态系统的潜在影响是至关重要的。虽然影响评估和适应规划依靠高分辨率的预测（通常在几公里的顺序），国家的最先进的地球系统模型（ESM的）可在数百公里的空间分辨率。目前的解决方案，以获得ESM的包括缩减，在粗尺度考虑的信息，使当地规模预测方法高分辨率投影。当地气候变量（例如，温度和沉淀）和大型的预测（例如，压力场）之间的复杂和非线性相互依赖激励使用基于神经网络的超分辨率架构。在这项工作中，我们目前的辅助变量告知时空神经结构的统计降尺度。目前的研究进行每天从ESM输出在1.15度（约为115公里）超过了世界上最多样化的气候的国家，印度0.25度（25公里）缩减沉淀变量。我们展示对三种流行国家的最先进的基线显著改善获益与预测极端事件的更好的能力。为了促进可重复的研究，我们也提供所有的代码，处理的数据集，并在公共领域的培训模式。

2. X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers [PDF] 返回目录
Jaemin Cho, Jiasen Lu, Dustin Schwenk, Hannaneh Hajishirzi, Aniruddha Kembhavi
Abstract: Mirroring the success of masked language models, vision-and-language counterparts like ViLBERT, LXMERT and UNITER have achieved state of the art performance on a variety of multimodal discriminative tasks like visual question answering and visual grounding. Recent work has also successfully adapted such models towards the generative task of image captioning. This begs the question: Can these models go the other way and generate images from pieces of text? Our analysis of a popular representative from this model family LXMERT - finds that it is unable to generate rich and semantically meaningful imagery with its current training setup. We introduce X-LXMERT, an extension to LXMERT with training refinements including: discretizing visual representations, using uniform masking with a large range of masking ratios and aligning the right pre-training datasets to the right objectives which enables it to paint. X-LXMERT's image generation capabilities rival state of the art generative models while its question answering and captioning abilities remains comparable to LXMERT. Finally, we demonstrate the generality of these training refinements by adding image generation capabilities into UNITER to produce X-UNITER.
摘要：镜像掩盖语言模型的成功，视力和语言的同行一样ViLBERT，LXMERT和UNITER已经取得的先进的性能在各种般的视觉答疑和视觉接地多歧视性的任务。最近的研究还成功地适应对图像字幕的生成任务这样的模型。这引出了一个问题：可以在这些车型走另一条路，并从文本块生成的图像？我们从这个模型家庭LXMERT一个人民代表的分析 - 发现它无法产生与目前的培训设置丰富和语义上有意义的图像。我们引入X-LXMERT，扩展到LXMERT与训练精炼包括：离散视觉表示，使用统一的掩蔽用大范围的掩蔽比的和对准的合适的预训练数据集到右侧目标使其能够绘制。 X-LXMERT的图像生成能力的技术生成模型的对手状态，而它的答疑和字幕的能力仍然相当LXMERT。最后，我们通过将图像生成功能集成到UNITER产生X-UNITER证明这些训练改进的通用性。

3. A Linear Transportation $\mathrm{L}^p$ Distance for Pattern Recognition [PDF] 返回目录
Oliver M. Crook, Mihai Cucuringu, Tim Hurst, Carola-Bibiane Schönlieb, Matthew Thorpe, Konstantinos C. Zygalakis
Abstract: The transportation $\mathrm{L}^p$ distance, denoted $\mathrm{TL}^p$, has been proposed as a generalisation of Wasserstein $\mathrm{W}^p$ distances motivated by the property that it can be applied directly to colour or multi-channelled images, as well as multivariate time-series without normalisation or mass constraints. These distances, as with $\mathrm{W}^p$, are powerful tools in modelling data with spatial or temporal perturbations. However, their computational cost can make them infeasible to apply to even moderate pattern recognition tasks. We propose linear versions of these distances and show that the linear $\mathrm{TL}^p$ distance significantly improves over the linear $\mathrm{W}^p$ distance on signal processing tasks, whilst being several orders of magnitude faster to compute than the $\mathrm{TL}^p$ distance.
摘要：交通$ \ mathrm {L} ^ P $距离，表示$ \ mathrm {TL} ^ P $，已被提议作为由酒店，它促使瓦瑟斯坦$ \ mathrm {白} ^ P $距离的推广可以直接应用到彩色或多通道的图像，以及多元时间序列而不归一化或质量约束。这些距离，与$ \ mathrm {白} ^ P $，在造型与空间或时间的扰动数据的强大工具。然而，他们的计算成本可以使他们不可能适用于均匀适中模式识别任务。我们提出了这些距离，并表明线性$ \ mathrm {TL} ^ P $距离显著提高了线性$ \ mathrm {白} ^信号处理任务的P $距离的线性版本，同时又是几个数量级速度更快计算比$ \ mathrm {} TL ^ p $距离。

4. Interactive Learning for Semantic Segmentation in Earth Observation [PDF] 返回目录
Gaston Lenczner, Adrien Chan-Hon-Tong, Nicola Luminari, Bertrand Le Saux, Guy Le Besnerais
Abstract: Dense pixel-wise classification maps output by deep neural networks are of extreme importance for scene understanding. However, these maps are often partially inaccurate due to a variety of possible factors. Therefore, we propose to interactively refine them within a framework named DISCA (Deep Image Segmentation with Continual Adaptation). It consists of continually adapting a neural network to a target image using an interactive learning process with sparse user annotations as ground-truth. We show through experiments on three datasets using synthesized annotations the benefits of the approach, reaching an IoU improvement up to 4.7% for ten sampled clicks. Finally, we exhibit that our approach can be particularly rewarding when it is faced to additional issues such as domain adaptation.
摘要：密集逐像素分类映射通过深层神经网络输出是用于现场了解极端重要性。然而，这些地图由于各种可能的因素往往是不准确的部分。因此，我们提出了一个框架命名DISCA（深度图像分割持续调整）内以交互方式提炼他们。它包括使用交互式学习过程中有稀疏用户注解地面实况不断地适应神经网络的目标图像。我们展示通过使用合成的注解方法的好处，达到了一个欠条改善高达4.7％，十次抽样点击三个数据集实验。最后，我们表现出我们的方法可以当它面对更多的问题，比如领域适应性特别奖励。

5. A Simple Yet Effective Method for Video Temporal Grounding with Cross-Modality Attention [PDF] 返回目录
Binjie Zhang, Yu Li, Chun Yuan, Dejing Xu, Pin Jiang, Ying Shan
Abstract: The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video. Though progress has been made continuously in this field, some issues still need to be resolved. First, most of the existing methods rely on the combination of multiple complicated modules to solve the task. Second, due to the semantic gaps between the two different modalities, aligning the information at different granularities (local and global) between the video and the language is significant, which is less addressed. Last, previous works do not consider the inevitable annotation bias due to the ambiguities of action boundaries. To address these limitations, we propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design, which alternatively modulates two modalities for better matching the information both locally and globally. In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias. We conduct extensive experiments to validate our method, and the results show that just with this simple model, it can outperform the state of the arts on both Charades-STA and ActivityNet Captions datasets.
摘要：语言引导下的视频的时间接的任务是定位对应于查询语句中未修剪视频的特定视频剪辑。虽然取得了在该领域不断取得，有些问题还需要解决。首先，大多数现有的方法依赖于多种复杂的模块的组合来解决的任务。其次，由于两个不同的模式之间的语义差距，排列在视频和语言之间的不同粒度（本地和全球）的信息是显著，这是不太解决。最后，以前的作品不考虑不可避免的注释偏置由于行动边界的模糊性。为了解决这些限制，我们提出用直观的结构设计简单的两分支跨模态注意（CMA）模块，或者调制为本地和全球更好的匹配信息两种模式。此外，我们引入了一个新的特定任务的回归损失函数，它通过减轻注释偏见的影响，提高了时间接地准确性。我们进行了大量的实验来验证我们的方法，结果表明，仅仅在这个简单的模型，它可以超越两个哑谜-STA和ActivityNet字幕数据集艺术的状态。

6. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset [PDF] 返回目录
Sylvain Guy, Stéphane Lathuilière, Pablo Mesejo, Radu Horaud
Abstract: Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not. V-VAD is useful whenever audio VAD (A-VAD) is inefficient either because the acoustic signal is difficult to analyze or because it is simply missing. We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow. Moreover, available datasets, used for learning and for testing V-VAD, lack content variability. We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD -- based on combining A-VAD with face detection and tracking. A thorough empirical evaluation shows the advantage of training the proposed deep V-VAD models with this dataset.
摘要：可视化语音活动检测（V-VAD）采用视觉特征来预测一个人是否正在讲话与否。每当音频VAD（A-VAD）是低效的或者是因为声信号难以分析或因为被简单地丢失它的V-VAD是有用的。我们提出了V-VAD，一个基于脸部标志和一个基于光流两道深深的架构。此外，现有的数据集，用于学习和测试V-VAD，缺乏内容的变化。我们引入了一种新的方法来自动创建和注释非常大的数据集，在最狂野 - WildVVAD - 基于A-VAD人脸检测和跟踪相结合。深入的实证分析表明与此数据集训练所提出的深V VAD车型的优势。

7. Label-Efficient Multi-Task Segmentation using Contrastive Learning [PDF] 返回目录
Junichiro Iwasawa, Yuichiro Hirano, Yohei Sugawara
Abstract: Obtaining annotations for 3D medical images is expensive and time-consuming, despite its importance for automating segmentation tasks. Although multi-task learning is considered an effective method for training segmentation models using small amounts of annotated data, a systematic understanding of various subtasks is still lacking. In this study, we propose a multi-task segmentation model with a contrastive learning based subtask and compare its performance with other multi-task models, varying the number of labeled data for training. We further extend our model so that it can utilize unlabeled data through the regularization branch in a semi-supervised manner. We experimentally show that our proposed method outperforms other multi-task methods including the state-of-the-art fully supervised model when the amount of annotated data is limited.
摘要：获得注解三维医学图像是昂贵和费时的，尽管它的自动化分割任务的重要性。虽然多任务学习被认为是使用少量注解数据进行训练分割模型的有效方法，不同的子任务系统的了解仍然缺乏。在这项研究中，我们提出了一个对比学习基于子任务多任务细分模型，并比较其与其他多任务模型的表现，不同的训练标记数据的数量。我们进一步扩展我们的模型，以便它可以在一个半监督的方式通过正规化分支利用未数据。我们通过实验证明，该方法优于其他多任务的方法，包括国家的最先进的完全监控模式时，标注的数据量是有限的。

8. 2D-3D Geometric Fusion Network using Multi-Neighbourhood Graph Convolution for RGB-D Indoor Scene Classification [PDF] 返回目录
Albert Mosella-Montoro, Javier Ruiz-Hidalgo
Abstract: Multi-modal fusion has been proved to help enhance the performance of scene classification tasks. This paper presents a 2D-3D fusion stage that combines 3D Geometric features with 2D Texture features obtained by 2D Convolutional Neural Networks. To get a robust 3D Geometric embedding, a network that uses two novel layers is proposed. The first layer, Multi-Neighbourhood Graph Convolution, aims to learn a more robust geometric descriptor of the scene combining two different neighbourhoods: one in the Euclidean space and the other in the Feature space. The second proposed layer, Nearest Voxel Pooling, improves the performance of the well-known Voxel Pooling. Experimental results, using NYU-Depth-v2 and SUN RGB-D datasets, show that the proposed method outperforms the current state-of-the-art in RGB-D indoor scene classification tasks.
摘要：多模态融合已被证明有利于提升场景分类任务的性能。本文提出了结合3D几何一个2D-3D融合阶段与2D纹理特征的特征由2D卷积神经网络获得。为了得到一个强大的3D几何嵌入，使用两种新型层提出了一种网络。第一层，多邻域图卷积，旨在学习场景组合两种不同的街区更稳健的几何描述符：一个在欧几里德空间中，另一个在特征空间。第二层建议，最近的体素池，改善公知的体素池的性能。实验结果，用NYU-深度v2和SUN RGB-d的数据集，结果表明，该方法优于国家的最先进的RGB-d的室内场景分类任务的电流。

9. Information-Theoretic Visual Explanation for Black-Box Classifiers [PDF] 返回目录
Jihun Yi, Eunji Kim, Siwon Kim, Sungroh Yoon
Abstract: In this work, we attempt to explain the prediction of any black-box classifier from an information-theoretic perspective. For this purpose, we propose two attribution maps: an information gain (IG) map and a point-wise mutual information (PMI) map. IG map provides a class-independent answer to "How informative is each pixel?", and PMI map offers a class-specific explanation by answering "How much does each pixel support a specific class?" In this manner, we propose (i) a theory-backed attribution method. The attribution (ii) provides both supporting and opposing explanations for each class and (iii) pinpoints most decisive parts in the image, not just the relevant objects. In addition, the method (iv) offers a complementary class-independent explanation. Lastly, the algorithmic enhancement in our method (v) improves faithfulness of the explanation in terms of a quantitative evaluation metric. We showed the five strengths of our method through various experiments on the ImageNet dataset. The code of the proposed method is available online.
摘要：在这项工作中，我们试图从信息理论的角度解释任何暗箱分类的预测。为此，我们提出了两种属性映射：信息增益（IG）地图和逐点互信息（PMI）地图。 IG地图回答提供了一类独立的回答“如何信息是每一个像素？”，以及PMI地图提供一类特定的解释“多少钱每个像素支持特定的类？”通过这种方式，我们建议（一）理论支持的归属方法。归属（ii）为每个类和（iii）针点最具有决定性的部分图像中的，不只是相关的对象既支持和反对解释。此外，该方法（IV）提供了一种互补类无关的解释。最后，在我们的方法（V）的增强算法提高了在定量评价指标方面解释的忠诚。我们通过对ImageNet数据集各种实验表明我们的方法的五大优势。该方法的代码是在网上提供。

10. Multiple interaction learning with question-type prior knowledge for constraining answer search space in visual question answering [PDF] 返回目录
Tuong Do, Binh X. Nguyen, Huy Tran, Erman Tjiputra, Quang D. Tran, Thanh-Toan Do
Abstract: Different approaches have been proposed to Visual Question Answering (VQA). However, few works are aware of the behaviors of varying joint modality methods over question type prior knowledge extracted from data in constraining answer search space, of which information gives a reliable cue to reason about answers for questions asked in input images. In this paper, we propose a novel VQA model that utilizes the question-type prior information to improve VQA by leveraging the multiple interactions between different joint modality methods based on their behaviors in answering questions from different types. The solid experiments on two benchmark datasets, i.e., VQA 2.0 and TDIUC, indicate that the proposed method yields the best performance with the most competitive approaches.
摘要：不同的方法被提出到Visual答疑（VQA）。然而，很少有作品都知道在不同的问题类型的先验知识的联合方式方法，从数据中限制答案的搜索空间，其中的信息提供了有关问题的答案问输入图像的可靠线索理性提取的行为。在本文中，我们提出了利用问题类型的信息之前通过利用基于自己的行为在回答来自不同类型的问题，不同的联合方式方法之间的多重相互作用，提高VQA一种新的VQA模型。在两个基准数据集，所述固体实验即VQA 2.0和TDIUC，表明所提出的方法产生最具有竞争力的方法的最佳性能。

11. Residual Embedding Similarity-Based Network Selection for Predicting Brain Network Evolution Trajectory from a Single Observation [PDF] 返回目录
Ahmet Serkan Goktas, Alaa Bessadok, Islem Rekik
Abstract: While existing predictive frameworks are able to handle Euclidean structured data (i.e, brain images), they might fail to generalize to geometric non-Euclidean data such as brain networks. Besides, these are rooted the sample selection step in using Euclidean or learned similarity measure between vectorized training and testing brain networks. Such sample connectomic representation might include irrelevant and redundant features that could mislead the training sample selection step. Undoubtedly, this fails to exploit and preserve the topology of the brain connectome. To overcome this major drawback, we propose Residual Embedding Similarity-Based Network selection (RESNets) for predicting brain network evolution trajectory from a single timepoint. RESNets first learns a compact geometric embedding of each training and testing sample using adversarial connectome embedding network. This nicely reduces the high-dimensionality of brain networks while preserving their topological properties via graph convolutional networks. Next, to compute the similarity between subjects, we introduce the concept of a connectional brain template (CBT), a fixed network reference, where we further represent each training and testing network as a deviation from the reference CBT in the embedding space. As such, we select the most similar training subjects to the testing subject at baseline by comparing their learned residual embeddings with respect to the pre-defined CBT. Once the best training samples are selected at baseline, we simply average their corresponding brain networks at follow-up timepoints to predict the evolution trajectory of the testing network. Our experiments on both healthy and disordered brain networks demonstrate the success of our proposed method in comparison to RESNets ablated versions and traditional approaches.
摘要：虽然现有预测框架能够处理欧几里德结构化数据（即，大脑图像），他们可能无法推广到几何非欧几里德的数据，如大脑网络。此外，这些植根于利用欧氏样本选择步骤或矢量化训练和测试大脑网络之间的相似性学到测量。这样的样品连接组学表示可能包括不相关和冗余功能，可能误导训练样本选择步骤。毫无疑问，这不能充分利用和保护脑连接组的拓扑结构。为了克服这一主要缺点，我们提出了从单一时间点预测的脑网络的演进轨迹剩余嵌入类似基于网络的选择（RESNets）。 RESNets第一获悉使用对抗性连接组嵌入网络中的每个训练的紧凑几何嵌入和测试样品。这很好地降低了脑网络的高维数，同时通过图卷积网络中保留它们的拓扑性质。接着，以计算对象之间的相似性，我们引入一个connectional脑模板（CBT），固定网络参考，在这里我们进一步表示每个训练和测试网络从在嵌入空间中的参考CBT的偏差的概念。因此，我们通过对于预先定义的CBT比较自己所学的残余的嵌入选择在基线最相似的训练科目，以测试主题。一旦最好的训练样本在基线选择，我们只是他们的平均对应的大脑网络在后续的时间点来预测测试网络的演进轨迹。我们对健康的和无序的大脑网络实验证明我们提出的方法相比，RESNets消融版本和传统方法的成功。

12. Multiplexed Illumination for Classifying Visually Similar Objects [PDF] 返回目录
Taihua Wang, Donald G. Dansereau
Abstract: Distinguishing visually similar objects like forged/authentic bills and healthy/unhealthy plants is beyond the capabilities of even the most sophisticated classifiers. We propose the use of multiplexed illumination to extend the range of objects that can be successfully classified. We construct a compact RGB-IR light stage that images samples under different combinations of illuminant position and colour. We then develop a methodology for selecting illumination patterns and training a classifier using the resulting imagery. We use the light stage to model and synthetically relight training samples, and propose a greedy pattern selection scheme that exploits this ability to train in simulation. We then apply the trained patterns to carry out fast classification of new objects. We demonstrate the approach on visually similar artificial and real fruit samples, showing a marked improvement compared with fixed-illuminant approaches as well as a more conventional code selection scheme. This work allows fast classification of previously indistinguishable objects, with potential applications in forgery detection, quality control in agriculture and manufacturing, and skin lesion classification.
摘要：区分视觉上相似的对象，如伪造/真币和健康/不健康的植物是超越，即使是最复杂的分类器的能力。我们建议使用复用照明的扩展可被成功地分类的对象的范围。我们构建了一个紧凑的RGB-IR光阶段，根据光源位置和颜色的不同组合的图像的样品。然后，我们制定选择照明模式，并使用所产生的图像训练分类的方法。我们使用光阶段模型，并人工重新点火训练样本，并建议利用这种能力在模拟训练贪婪模式选择方案。然后，我们应用训练模式，开展新对象的快速分类。我们证明在视觉上相似的人工和真正的水果样品的方式，显示了固定光源的方法，以及更传统的代码的选择方案相比有了明显的改善。这项工作使以前不能区分物体的快速分类，在农业和制造业的伪造检测，质量控制潜在的应用，以及皮肤损伤的分类。

13. Differential Viewpoints for Ground Terrain Material Recognition [PDF] 返回目录
Jia Xue, Hang Zhang, Ko Nishino, Kristin J. Dana
Abstract: Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined single-view images captured in the scene. We take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. A key concept is differential angular imaging, where small angular variations in image capture enables angular-gradient features for an enhanced appearance representation that improves recognition. We build a large-scale material database, Ground Terrain in Outdoor Scenes (GTOS) database, to support ground terrain recognition for applications such as autonomous driving and robot navigation. The database consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. We develop a novel approach for material recognition called texture-encoded angular network (TEAN) that combines deep encoding pooling of RGB information and differential angular images for angular-gradient features to fully leverage this large dataset. With this novel network architecture, we extract characteristics of materials encoded in the angular and spatial gradients of their appearance. Our results show that TEAN achieves recognition performance that surpasses single view performance and standard (non-differential/large-angle sampling) multiview performance.
摘要：计算曲面建模该underlies材料识别已经从反射建模使用实验室控制辐射测量基于在场景中捕获的互联网开采单视点图像的基于图像的表示转换。我们采取的材料承认，既需要丰富的辐射线索和灵活的图像捕捉的优势中等地的做法。一个关键的概念是差分角成像，其中图像捕获小的角度变化可使角梯度特征为改善识别增强的外观表示。我们建立了一个大型的素材库，地面地形中的室外场景（GTOS）的数据库，以支持地面地形识别的应用，如自动驾驶和机器人导航。该数据库包含超过30000的图像覆盖在变化的天气和照明条件40类室外地面地形的。我们开发材料识别被称为纹理编码的角度网络（TEAN），结合深编码RGB信息和差分角图像进行角梯度池特性充分利用这个大的数据集的新方法。有了这个新型的网络体系结构，我们提取的在它们的外观的角度和空间编码梯度材料特性。我们的研究结果表明，TEAN实现了优于单一视图的性能和标准（非差分/大角度采样）多视点性能的识别性能。

14. A Sparse Sampling-based framework for Semantic Fast-Forward of First-Person Videos [PDF] 返回目录
Michel Melo Silva, Washington Luis Souza Ramos, Mario Fernando Montenegro Campos, Erickson Rangel Nascimento
Abstract: Technological advances in sensors have paved the way for digital cameras to become increasingly ubiquitous, which, in turn, led to the popularity of the self-recording culture. As a result, the amount of visual data on the Internet is moving in the opposite direction of the available time and patience of the users. Thus, most of the uploaded videos are doomed to be forgotten and unwatched stashed away in some computer folder or website. In this paper, we address the problem of creating smooth fast-forward videos without losing the relevant content. We present a new adaptive frame selection formulated as a weighted minimum reconstruction problem. Using a smoothing frame transition and filling visual gaps between segments, our approach accelerates first-person videos emphasizing the relevant segments and avoids visual discontinuities. Experiments conducted on controlled videos and also on an unconstrained dataset of First-Person Videos (FPVs) show that, when creating fast-forward videos, our method is able to retain as much relevant information and smoothness as the state-of-the-art techniques, but in less processing time.
摘要：传感器技术进步铺平了道路数码相机变得越来越普遍，这反过来又导致了自记录文化的普及方式。其结果是，在互联网上的视觉数据的量的用户的可用的时间和耐心的相反方向移动。因此，大部分上传的视频，是注定要被遗忘，并且在一些电脑文件夹或网站不留神卷走。在本文中，我们解决不失相关内容创建流畅的快进视频的问题。我们提出配制为加权最小重建问题的新的自适应帧选择。使用平滑过渡帧和段之间填充视觉间隙，我们的方法加速第一人称视频强调相关的段，并且避免了视觉不连续性。在控制视频，并在第一人称视频（FPVs）不受约束的数据集进行的实验表明，建立快进视频时，我们的方法是能够保留尽可能多的相关信息和平滑的国家的最先进的技术，但在更短的处理时间。

15. Robust and efficient post-processing for video object detection [PDF] 返回目录
Alberto Sabater, Luis Montesano, Ana C. Murillo
Abstract: Object recognition in video is an important task for plenty of applications, including autonomous driving perception, surveillance tasks, wearable devices or IoT networks. Object recognition using video data is more challenging than using still images due to blur, occlusions or rare object poses. Specific video detectors with high computational cost or standard image detectors together with a fast post-processing algorithm achieve the current state-of-the-art. This work introduces a novel post-processing pipeline that overcomes some of the limitations of previous post-processing methods by introducing a learning-based similarity evaluation between detections across frames. Our method improves the results of state-of-the-art specific video detectors, specially regarding fast moving objects, and presents low resource requirements. And applied to efficient still image detectors, such as YOLO, provides comparable results to much more computationally intensive detectors.
摘要：在视频对象识别是大量的应用，包括自动驾驶感受，监控任务，可穿戴设备或物联网网络的一项重要任务。使用视频数据的对象识别比使用仍然因模糊，闭塞或稀有对象姿势图像更具挑战性。具有高的计算成本或标准图像检测器具有快速后处理算法一起特定视频检测器实现的当前状态的最先进的。这项工作引入了一种新型的后处理流水线，通过在帧之间检测之间引入一个基于学习的相似性评价克服一些以前的后处理方法的局限性。我们的方法提高了国家的最先进的专用视频探测器的结果，特别是对于快速移动的物体，并给出了较低的资源需求。和施加到有效的静止图像检测器，如YOLO，提供比较的结果，计算上更密集得多的检测器。

16. Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem Formulation [PDF] 返回目录
Dimche Kostadinov, Davide Scaramuzza
Abstract: Event-based cameras record an asynchronous stream of per-pixel brightness changes. As such, they have numerous advantages over the standard frame-based cameras, including high temporal resolution, high dynamic range, and no motion blur. Due to the asynchronous nature, efficient learning of compact representation for event data is challenging. While it remains not explored the extent to which the spatial and temporal event "information" is useful for pattern recognition tasks. In this paper, we focus on single-layer architectures. We analyze the performance of two general problem formulations: the direct and the inverse, for unsupervised feature learning from local event data (local volumes of events described in space-time). We identify and show the main advantages of each approach. Theoretically, we analyze guarantees for an optimal solution, possibility for asynchronous, parallel parameter update, and the computational complexity. We present numerical experiments for object recognition. We evaluate the solution under the direct and the inverse problem and give a comparison with the state-of-the-art methods. Our empirical results highlight the advantages of both approaches for representation learning from event data. We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods from the same class of methods.
摘要：基于事件的摄像机记录的每个像素的亮度变化异步流。因此，它们具有优于标准的基于帧的摄像机许多优点，包括高的时间分辨率，高动态范围，并且没有运动模糊。由于异步性质，对于事件数据紧凑表示的高效学习是具有挑战性的。同时它保持没有探索的程度的空间和时间事件“信息”是用于图案识别任务是有用的。在本文中，我们专注于单层结构。我们分析两个普遍问题配方的性能：直接和逆，从本地事件数据无监督功能学习（事件的本地卷在时空中所述）。我们识别并显示每个方法的主要优点。从理论上讲，我们分析担保的最佳解决方案，可能为异步，并行参数更新和计算复杂性。我们提出了目标识别的数值实验。我们评估下的直接和逆问题的解决方案，并给予与国家的最先进的方法进行了比较。我们的实证研究结果凸显优势的两种方法都用于表示从事件数据中学习。我们发现相比于同一类方法的国家的最先进的方法，高达9％的识别精度的提高。

17. Few-shot Font Generation with Localized Style Representations and Factorization [PDF] 返回目录
Song Park, Sanghyuk Chun, Junbum Cha, Bado Lee, Hyunjung Shim
Abstract: Automatic few-shot font generation is in high demand because manual designs are expensive and sensitive to the expertise of designers. Existing few-shot font generation methods aim to learn to disentangle the style and content element from a few reference glyphs, and mainly focus on a universal style representation for each font style. However, such approach limits the model in representing diverse local styles, and thus makes it unsuitable to the most complicated letter system, e.g., Chinese, whose characters consist of a varying number of components (often called "radical") with a highly complex structure. In this paper, we propose a novel font generation method by learning localized styles, namely component-wise style representations, instead of universal styles. The proposed style representations enable us to synthesize complex local details in text designs. However, learning component-wise styles solely from reference glyphs is infeasible in the few-shot font generation scenario, when a target script has a large number of components, e.g., over 200 for Chinese. To reduce the number of reference glyphs, we simplify component-wise styles by a product of component factor and style factor, inspired by low-rank matrix factorization. Thanks to the combination of strong representation and a compact factorization strategy, our method shows remarkably better few-shot font generation results (with only 8 reference glyph images) than other state-of-the-arts, without utilizing strong locality supervision, e.g., location of each component, skeleton, or strokes. The source code is available at this https URL.
摘要：自动几拍字体生成的需求量很大，因为手工设计是昂贵的，设计师的专业知识敏感。现有的几个次字体生成方法的目的是学会从几个参考字形解开风格和内容元素，并且主要集中在每个字体样式通用的风格表示。然而，这种做法限制了代表不同地方风格的模型，从而使得它不适合到最复杂的信令系统，如中国，其角色由不同数量的组件（通常称为“激进”）具有高度复杂的结构。在本文中，我们通过学习本地化的方式，即逐分量风格提出交涉，而不是通用的样式新颖的字体生成方法。拟议的风格表述使我们在文本中设计，合成复杂的局部细节。然而，从参考字形学习逐分量的风格仅仅是在少数次字体生成方案不可行，当目标脚本中有大量的组件，例如，为中国超过200个。为了减少参考字形的数目，我们通过的构成因素和风格因素的产物，由低秩矩阵分解启发简化逐分量的风格。由于较强的代表性和紧凑的分解策略的组合，我们的方法显示出明显更好一些，出手字体生成结果（仅8个参考字形图像），比其他国家的的艺术，不利用强大的本地监控，例如，每个部件，骨架，或笔划的位置。源代码可在此HTTPS URL。

18. Generative Model without Prior Distribution Matching [PDF] 返回目录
Cong Geng, Jia Wang, Li Chen, Zhiyong Gao
Abstract: Variational Autoencoder (VAE) and its variations are classic generative models by learning a low-dimensional latent representation to satisfy some prior distribution (e.g., Gaussian distribution). Their advantages over GAN are that they can simultaneously generate high dimensional data and learn latent representations to reconstruct the inputs. However, it has been observed that a trade-off exists between reconstruction and generation since matching prior distribution may destroy the geometric structure of data manifold. To mitigate this problem, we propose to let the prior match the embedding distribution rather than imposing the latent variables to fit the prior. The embedding distribution is trained using a simple regularized autoencoder architecture which preserves the geometric structure to the maximum. Then an adversarial strategy is employed to achieve a latent mapping. We provide both theoretical and experimental support for the effectiveness of our method, which alleviates the contradiction between topological properties' preserving of data manifold and distribution matching in latent space.
摘要：变自动编码器（VAE）和其变体是经典生成模型通过学习低维潜表示以满足一些先验分布（例如，高斯分布）。他们在甘优点是它们可以同时产生高维数据，并了解潜在的交涉，重建的投入。然而，已经观察到，重建和发电之间存在自先验分布匹配可能破坏数据歧管的几何结构的折衷。为了缓解这一问题，我们建议让之前的比赛中嵌入分布，而不是强加的潜在变量，以适应前。嵌入分布使用简单的正则化的自动编码器结构，它保留了几何结构到最大训练。然后对抗性策略被用来实现一个潜映射。我们为我们的方法，这既减轻拓扑性质保持在潜在空间数据歧管和分配匹配之间的矛盾的有效性理论和实验支持。

19. What is the Reward for Handwriting? -- Handwriting Generation by Imitation Learning [PDF] 返回目录
Keisuke Kanda, Brian Kenji Iwana, Seiichi Uchida
Abstract: Analyzing the handwriting generation process is an important issue and has been tackled by various generation models, such as kinematics based models and stochastic models. In this study, we use a reinforcement learning (RL) framework to realize handwriting generation with the careful future planning ability. In fact, the handwriting process of human beings is also supported by their future planning ability; for example, the ability is necessary to generate a closed trajectory like '0' because any shortsighted model, such as a Markovian model, cannot generate it. For the algorithm, we employ generative adversarial imitation learning (GAIL). Typical RL algorithms require the manual definition of the reward function, which is very crucial to control the generation process. In contrast, GAIL trains the reward function along with the other modules of the framework. In other words, through GAIL, we can understand the reward of the handwriting generation process from handwriting examples. Our experimental results qualitatively and quantitatively show that the learned reward catches the trends in handwriting generation and thus GAIL is well suited for the acquisition of handwriting behavior.
摘要：分析手写生成过程是一个重要的问题，并通过各种代车型，如基于运动学模型和随机模型已经被解决。在这项研究中，我们使用了强化学习（RL）框架来实现手写代上精心规划未来的能力。事实上，人类的手写过程也通过自己的未来规划能力支持;例如，的能力是必要的，以产生一个封闭的轨迹，如“0”，因为任何近视模型，例如模型马尔可夫不能生成它。对于算法，我们采用生成对抗性模仿学习（GAIL）。典型RL算法需要回报函数，这是非常重要的，以控制生成过程的手动定义。与此相反，GAIL列车与框架的其他模块沿着所述回报函数。换句话说，通过GAIL，我们可以了解手写生成过程中，从手写实例奖励。我们的实验结果定性和定量表明了解到奖励捕捉趋势手写一代，因此GAIL非常适合收购的笔迹行为。

20. MAFF-Net: Filter False Positive for 3D Vehicle Detection with Multi-modal Adaptive Feature Fusion [PDF] 返回目录
Zehan Zhang, Ming Zhang, Zhidong Liang, Xian Zhao, Ming Yang, Wenming Tan, ShiLiang Pu
Abstract: 3D vehicle detection based on multi-modal fusion is an important task of many applications such as autonomous driving. Although significant progress has been made, we still observe two aspects that need to be further improvement: First, the specific gain that camera images can bring to 3D detection is seldom explored by previous works. Second, many fusion algorithms run slowly, which is essential for applications with high real-time requirements(autonomous driving). To this end, we propose an end-to-end trainable single-stage multi-modal feature adaptive network in this paper, which uses image information to effectively reduce false positive of 3D detection and has a fast detection speed. A multi-modal adaptive feature fusion module based on channel attention mechanism is proposed to enable the network to adaptively use the feature of each modal. Based on the above mechanism, two fusion technologies are proposed to adapt to different usage scenarios: PointAttentionFusion is suitable for filtering simple false positive and faster; DenseAttentionFusion is suitable for filtering more difficult false positive and has better overall performance. Experimental results on the KITTI dataset demonstrate significant improvement in filtering false positive over the approach using only point cloud data. Furthermore, the proposed method can provide competitive results and has the fastest speed compared to the published state-of-the-art multi-modal methods in the KITTI benchmark.
摘要：基于多模态融合的3D车辆检测是许多应用，如自动驾驶的一项重要任务。虽然显著已经取得了进展，但我们仍然看到两个方面需要进一步改进：第一，具体的增益摄像机图像可以带来3D检测很少由以前的作品探讨。其次，许多融合算法运行速度很慢，这对于具有高实时性要求（自动驾驶）的应用是必不可少的。为此，我们在本文中，其使用图像信息来有效地降低3D检测的假阳性，并且具有检测速度快提出的端至端可训练单级多模态特征自适应网络。基于信道注意机制的多模态自适应特征融合模块，提出了使网络能够自适应地使用每个模态的功能。基于上述机理，两个融合技术，提出了以适应不同的使用场景：PointAttentionFusion是适用于过滤简单的假阳性和快; DenseAttentionFusion是适用于过滤更困难的假阳性，并具有更好的整体性能。在KITTI数据集的实验结果表明，在全球仅使用点云数据的方法筛选假阳性显著改善。此外，所提出的方法可以提供具有竞争力的结果，并且具有相比于KITTI基准公布的国家的最先进的多模态方法最快的速度。

21. Exploring global diverse attention via pairwise temporal relation for video summarization [PDF] 返回目录
Ping Li, Qinghao Ye, Luming Zhang, Li Yuan, Xianghua Xu, Ling Shao
Abstract: Video summarization is an effective way to facilitate video searching and browsing. Most of existing systems employ encoder-decoder based recurrent neural networks, which fail to explicitly diversify the system-generated summary frames while requiring intensive computations. In this paper, we propose an efficient convolutional neural network architecture for video SUMmarization via Global Diverse Attention called SUM-GDA, which adapts attention mechanism in a global perspective to consider pairwise temporal relations of video frames. Particularly, the GDA module has two advantages: 1) it models the relations within paired frames as well as the relations among all pairs, thus capturing the global attention across all frames of one video; 2) it reflects the importance of each frame to the whole video, leading to diverse attention on these frames. Thus, SUM-GDA is beneficial for generating diverse frames to form satisfactory video summary. Extensive experiments on three data sets, i.e., SumMe, TVSum, and VTW, have demonstrated that SUM-GDA and its extension outperform other competing state-of-the-art methods with remarkable improvements. In addition, the proposed models can be run in parallel with significantly less computational costs, which helps the deployment in highly demanding applications.
摘要：视频摘要是为了便于视频搜索和浏览的有效途径。大多数现有系统采用的编码器 - 解码器基于回归神经网络，这不能明确地多样化系统生成的摘要帧，同时需要密集的计算。在本文中，我们提出了通过全球多元化注意视频概括称为SUM-GDA，其适应注意机制在全球的角度来考虑的视频帧的成对的临时关系的有效卷积神经网络架构。特别是，GDA模块有两个优点：1）它的模型配对帧以及所有对之间的关系中的关系，从而掌握整个一个视频的所有帧全球瞩目; 2）它反映了每帧的整个视频的重要性，从而导致不同的注意力放在这些帧。因此，SUM-GDA是用于产生多样化的帧以形成满意的视频摘要是有益的。在三个数据集，即，郑树森，TVSum，和VTW，广泛的实验已经证明，SUM-GDA和其延伸超过其他竞争状态的最先进的方法具有显着的改进。此外，所提出的模型可以并联显著较少的计算成本，这有助于在高要求的应用程序的部署运行。

22. Scene Graph to Image Generation with Contextualized Object Layout Refinement [PDF] 返回目录
Maor Ivgi, Yaniv Benny, Avichai Ben-David, Jonathan Berant, Lior Wolf
Abstract: Generating high-quality images from scene graphs, that is, graphs that describe multiple entities in complex relations, is a challenging task that attracted substantial interest recently. Prior work trained such models by using supervised learning, where the goal is to produce the exact target image layout for each scene graph. It relied on predicting object locations and shapes independently and in parallel. However, scene graphs are underspecified, and thus the same scene graph often occurs with many target images in the training data. This leads to generated images with high inter-object overlap, empty areas, blurry objects, and overall compromised quality. In this work, we propose a method that alleviates these issues by generating all object layouts together and reducing the reliance on such supervision. Our model predicts layouts directly from embeddings (without predicting intermediate boxes) by gradually upsampling, refining and contextualizing object layouts. It is trained with a novel adversarial loss, that optimizes the interaction between object pairs. This improves coverage and removes overlaps, while maintaining sensible contours and respecting objects relations. We empirically show on the COCO-STUFF dataset that our proposed approach substantially improves the quality of generated layouts as well as the overall image quality. Our evaluation shows that we improve layout coverage by almost 20 points, and drop object overlap to negligible amounts. This leads to better image generation, relation fulfillment and objects quality.
摘要：从场景图，也就是描述复杂关系的多个实体图形生成高质量的图像，是一个艰巨的任务，最近引起了巨大关注。以前的工作通过监督学习，其目的是要为每个场景图的确切目标图像布局训练有素的这种模式。它依赖于独立和并行预测对象位置和形状。然而，场景图中得以确认，并且因此相同场景图通常与在训练数据中的许多目标图像出现。这导致高的对象间的重叠，空白区域，看似模糊的目标和总体质量受损生成的图像。在这项工作中，我们建议通过生成的所有对象的布局在一起，并减少对这种监督的依赖，缓解这些问题的方法。我们的模型直接预测布局从嵌入物（无预测中间框）通过逐渐上采样，精炼和情境化对象布局。它被训练具有新颖对抗性损失，优化对象对之间的相互作用。这提高了覆盖范围及排除重叠，同时保持合理的轮廓和尊重的对象关系。我们经验显示出对COCO-STUFF数据集，我们提出的方法显着提高所产生的布局以及整体图像质量的质量。我们的评估显示，我们几乎20分提高布局覆盖，落对象的重叠，可忽略不计。这导致更好的图像生成，关系实现和对象的质量。

23. CLASS: Cross-Level Attention and Supervision for Salient Objects Detection [PDF] 返回目录
Tang Lv, Bo Li
Abstract: Salient object detection (SOD) is a fundamental computer vision task. Recently, with the revival of deep neural networks, SOD has made great progresses. However, there still exist two thorny issues that cannot be well addressed by existing methods, indistinguishable regions and complex structures. To address these two issues, in this paper we propose a novel deep network for accurate SOD, named CLASS. First, in order to leverage the different advantages of low-level and high-level features, we propose a novel non-local cross-level attention (CLA), which can capture the long-range feature dependencies to enhance the distinction of complete salient object. Second, a novel cross-level supervision (CLS) is designed to learn complementary context for complex structures through pixel-level, region-level and object-level. Then the fine structures and boundaries of salient objects can be well restored. In experiments, with the proposed CLA and CLS, our CLASS net. consistently outperforms 13 state-of-the-art methods on five datasets
摘要：显着对象检测（SOD）是一种基本的计算机视觉任务。近年来，随着深层神经网络的复兴，SOD取得了很大的进展。但是，仍然存在两种棘手问题不能很好地通过现有的方法，区分地区和复杂的结构解决。为了解决这两个问题，在本文中，我们提出了准确的SOD，命名为A类小说深刻的网络。首先，为了充分利用的低层次和高层次的功能，不同的优势，我们提出了一种新的非本地交叉的高度重视（CLA），它可以捕捉到远距离的关联功能，以提高完成显着的区别目的。第二，一种新型跨级监督（CLS）被设计为用于学习通过像素级，区域级和对象级复杂的结构互补的上下文。然后精细结构和显着对象的边界可以得到很好的恢复。在实验中，所提出的CLA和CLS，我们班净。始终如一地优于状态的最先进13在五个数据集的方法

24. LoRRaL: Facial Action Unit Detection Based on Local Region Relation Learning [PDF] 返回目录
Ziqiang Shi, Liu Liu, Rujie Liu, Xiaoyu Mi, and Kentaro Murase
Abstract: End-to-end convolution representation learning has been proved to be very effective in facial action unit (AU) detection. Considering the co-occurrence and mutual exclusion between facial AUs, in this paper, we propose convolution neural networks with Local Region Relation Learning (LoRRaL), which can combine latent relationships among AUs for an end-to-end approach to facial AU occurrence detection. LoRRaL consists of 1) use bi-directional long short-term memory (BiLSTM) to dynamically and sequentially encode local AU feature maps, 2) use self-attention mechanism to dynamically compute correspondences from local facial regions and to re-aggregate AU feature maps considering AU co-occurrences and mutual exclusions, 3) use a continuous-state modern Hopfield network to encode and map local facial features to more discriminative AU feature maps, that all these networks take the facial image as input and map it to AU occurrences. Our experiments on the challenging BP4D and DISFA Benchmarks without any external data or pre-trained models results in F1-scores of 63.5% and 61.4% respectively, which shows our proposed networks can lead to performance improvement on the AU detection task.
摘要：端至端卷积表示学习已被证明是在面部动作单元（AU）的检测非常有效。考虑到面部AU之间的共生和相互排斥，在本文中，我们提出了用局部区域的关系学习（LoRRaL），它可以结合的AU之间的潜在关系的一个终端到终端的方法面部AU出现检测卷积神经网络。 LoRRaL包含：1）使用双向长短期记忆（BiLSTM）动态和顺序地编码本地AU特征图，2）使用自关注机构以从本地面部区域，并重新聚集AU特征映射动态地计算对应考虑到非盟共同出现和相互排斥，3）使用连续状态现代的Hopfield网络编码和当地的面部特征映射到更有辨别AU特征映射，所有这些网络采取的面部图像作为输入，并将其映射到AU发生。我们对挑战BP4D和DISFA基准实验没有分别为63.5％和61.4％，F1分数任何外部数据或预先训练模型的结果，这说明我们提出的网络可能导致对非盟检测任务的性能提升。

25. Leveraging Local and Global Descriptors in Parallel to Search Correspondences for Visual Localization [PDF] 返回目录
Pengju Zhang, Yihong Wu, Bingxi Liu
Abstract: Visual localization to compute 6DoF camera pose from a given image has wide applications such as in robotics, virtual reality, augmented reality, etc. Two kinds of descriptors are important for the visual localization. One is global descriptors that extract the whole feature from each image. The other is local descriptors that extract the local feature from each image patch usually enclosing a key point. More and more methods of the visual localization have two stages: at first to perform image retrieval by global descriptors and then from the retrieval feedback to make 2D-3D point correspondences by local descriptors. The two stages are in serial for most of the methods. This simple combination has not achieved superiority of fusing local and global descriptors. The 3D points obtained from the retrieval feedback are as the nearest neighbor candidates of the 2D image points only by global descriptors. Each of the 2D image points is also called a query local feature when performing the 2D-3D point correspondences. In this paper, we propose a novel parallel search framework, which leverages advantages of both local and global descriptors to get nearest neighbor candidates of a query local feature. Specifically, besides using deep learning based global descriptors, we also utilize local descriptors to construct random tree structures for obtaining nearest neighbor candidates of the query local feature. We propose a new probabilistic model and a new deep learning based local descriptor when constructing the random trees. A weighted Hamming regularization term to keep discriminativeness after binarization is given in the loss function for the proposed local descriptor. The loss function co-trains both real and binary descriptors of which the results are integrated into the random trees.
摘要：视觉定位计算从给定的图像6自由度相机姿态有着广泛的应用，如机器人技术，虚拟现实，增强现实等两种描述的是视觉的定位很重要。一个是全球性的描述符提取每幅图像的整体特征。另一种是局部描述符从每个图像补丁提取局部特征通常包围一个关键点。视觉定位的越来越多的方法有两个阶段：首先由全球描述从检索反馈进行图像检索，然后通过局部描述符，使2D-3D对应点。这两个阶段是串行对于大多数的方法。这个简单的组合并没有达到融合局部和全局描述符的优越性。从检索反馈获得的三维点是因为只有通过全球性描述符的2D图像点的近邻的候选人。执行2D-3D对应点时，每个2D图像点也被称为查询局部特征。在本文中，我们提出了一个新的并行搜索的框架，它利用本地和全局描述符的优势获得查询局部特征的近邻候选人。具体而言，除了使用深层学习基础的全球的描述，我们也利用当地的描述构建随机树结构获得查询局部特征的近邻候选人。我们提出了一个新的概率模型和一个新的深构建随机树时，基于学习的局部描述符。二值化在拟议局部描述的损失函数发出后的加权海明则项保持discriminativeness。损失函数共同火车其结果都融入了随机树真正和二进制描述符。

26. Hamming OCR: A Locality Sensitive Hashing Neural Network for Scene Text Recognition [PDF] 返回目录
Bingcong Li, Xin Tang, Xianbiao Qi, Yihao Chen, Rong Xiao
Abstract: Recently, inspired by Transformer, self-attention-based scene text recognition approaches have achieved outstanding performance. However, we find that the size of model expands rapidly with the lexicon increasing. Specifically, the number of parameters for softmax classification layer and output embedding layer are proportional to the vocabulary size. It hinders the development of a lightweight text recognition model especially applied for Chinese and multiple languages. Thus, we propose a lightweight scene text recognition model named Hamming OCR. In this model, a novel Hamming classifier, which adopts locality sensitive hashing (LSH) algorithm to encode each character, is proposed to replace the softmax regression and the generated LSH code is directly employed to replace the output embedding. We also present a simplified transformer decoder to reduce the number of parameters by removing the feed-forward network and using cross-layer parameter sharing technique. Compared with traditional methods, the number of parameters in both classification and embedding layers is independent on the size of vocabulary, which significantly reduces the storage requirement without loss of accuracy. Experimental results on several datasets, including four public benchmaks and a Chinese text dataset synthesized by SynthText with more than 20,000 characters, shows that Hamming OCR achieves competitive results.
摘要：近日，由变压器的启发，基于自我关注场景文本的识别方法都取得了卓越的性能。然而，我们发现，模型的大小与词汇增加迅速膨胀。具体而言，用于SOFTMAX分类层和输出层包埋参数的数量是正比于词汇量。它妨碍了特别适用于中国和多国语言的轻量级文字识别模型的发展。因此，我们提出了一个名为海明OCR一个轻量级的场景文本识别模型。在这个模型中，一种新颖的汉明分类器，其采用局部性敏感散列（LSH）算法到每个字符编码，建议更换SOFTMAX回归和所生成的LSH代码直接使用，以取代输出嵌入。我们还提出简化转换器解码，以通过去除前馈网络并使用跨层参数共享技术减少参数的数量。与传统方法相比，在这两个分类参数和嵌入层的数量不依赖于词汇的大小，这显著降低了存储需求，而不损失精度。几个数据集的实验结果，其中包括四个公立benchmaks和SynthText拥有超过20,000个字符，显示，海明OCR取得竞争上的结果合成了中国文字的数据集。

27. A Real-time Vision Framework for Pedestrian Behavior Recognition and Intention Prediction at Intersections Using 3D Pose Estimation [PDF] 返回目录
Ue-Hwan Kim, Dongho Ka, Hwasoo Yeo, Jong-Hwan Kim
Abstract: Minimizing traffic accidents between vehicles and pedestrians is one of the primary research goals in intelligent transportation systems. To achieve the goal, pedestrian behavior recognition and prediction of pedestrian's crossing or not-crossing intention play a central role. Contemporary approaches do not guarantee satisfactory performance due to lack of generalization, the requirement of manual data labeling, and high computational complexity. To overcome these limitations, we propose a real-time vision framework for two tasks: pedestrian behavior recognition (100.53 FPS) and intention prediction (35.76 FPS). Our framework obtains satisfying generalization over multiple sites because of the proposed site-independent features. At the center of the feature extraction lies 3D pose estimation. The 3D pose analysis enables robust and accurate recognition of pedestrian behaviors and prediction of intentions over multiple sites. The proposed vision framework realizes 89.3% accuracy in the behavior recognition task on the TUD dataset without any training process and 91.28% accuracy in intention prediction on our dataset achieving new state-of-the-art performance. To contribute to the corresponding research community, we make our source codes public which are available at this https URL
摘要：车辆和行人之间最大限度地减少交通事故是在智能交通系统中的主要研究目标之一。为了实现行人的交叉或不交叉的意图发挥核心作用的目标，行人行为识别和预测。当代方法不因缺乏泛化，人工数据标签的要求，计算复杂度高，保证满意的性能。为了克服这些限制，我们提出了一个实时的视觉框架，两个任务：行人行为识别（100.53 FPS）和意图预测（35.76 FPS）。我们的框架取得满足泛化了的，因为项目选址无关的特性多个站点。在特征提取的中心位于三维姿态估计。三维姿态分析，可以健全和准确识别行人的行为，并在多个网站的意图预测。所提出的愿景框架实现了对TUD数据集的行为识别任务的89.3％的准确率不会对我们的数据实现国家的最先进的新性能的任何训练过程和91.28％的准确率在意向预测。为了促进相应的研究机构，我们使我们的源代码公开，其可在此HTTPS URL

28. Angular Luminance for Material Segmentation [PDF] 返回目录
Jia Xue, Matthew Purri, Kristin Dana
Abstract: Moving cameras provide multiple intensity measurements per pixel, yet often semantic segmentation, material recognition, and object recognition do not utilize this information. With basic alignment over several frames of a moving camera sequence, a distribution of intensities over multiple angles is obtained. It is well known from prior work that luminance histograms and the statistics of natural images provide a strong material recognition cue. We utilize per-pixel {\it angular luminance distributions} as a key feature in discriminating the material of the surface. The angle-space sampling in a multiview satellite image sequence is an unstructured sampling of the underlying reflectance function of the material. For real-world materials there is significant intra-class variation that can be managed by building a angular luminance network (AngLNet). This network combines angular reflectance cues from multiple images with spatial cues as input to fully convolutional networks for material segmentation. We demonstrate the increased performance of AngLNet over prior state-of-the-art in material segmentation from satellite imagery.
摘要：移动摄像机提供每像素多的强度测量，但往往语义分割，材料识别和对象识别不利用此信息。有超过一个移动摄像机序列的几个帧基本对准，获得强度在多个角度的分布。大家都从以前的工作众所周知，亮度直方图和自然图像的统计数据提供了强大的物质识别线索。我们利用每像素{\它的角亮度分布}如在判别的表面的材料的关键特征。多视点图像卫星序列中的角空间采样是材料的底层反射率函数的非结构化采样。为现实世界的材料有显著类内变化，可以通过建立一个角亮度网络（AngLNet）进行管理。这个网络将来自与空间线索作为输入到用于材料分割充分卷积网络中的多个图像的角反射率的线索。我们证明AngLNet的先前状态的最先进的材料分割从卫星图像在提高性能。

29. Kernelized dense layers for facial expression recognition [PDF] 返回目录
M.Amine Mahmoudi, Aladine Chetouani, Fatma Boufera, Hedi Tabia
Abstract: Fully connected layer is an essential component of Convolutional Neural Networks (CNNs), which demonstrates its efficiency in computer vision tasks. The CNN process usually starts with convolution and pooling layers that first break down the input images into features, and then analyze them independently. The result of this process feeds into a fully connected neural network structure which drives the final classification decision. In this paper, we propose a Kernelized Dense Layer (KDL) which captures higher order feature interactions instead of conventional linear relations. We apply this method to Facial Expression Recognition (FER) and evaluate its performance on RAF, FER2013 and ExpW datasets. The experimental results demonstrate the benefits of such layer and show that our model achieves competitive results with respect to the state-of-the-art approaches.
摘要：完全连接层是卷积神经网络（细胞神经网络），这表明它的效率在计算机视觉任务的一个基本组成部分。 CNN新闻过程通常用卷积和池层开始，首先分解输入图像转换成特征，然后独立地对它们进行分析。这个过程的结果馈送到驱动该最终分类决定一个完全连接的神经网络结构。在本文中，我们提出了一个核心化的致密层（KDL）捕获高阶功能的相互作用，而不是传统的线性关系。我们将此方法应用于人脸表情识别（FER），并评估其对RAF，FER2013和ExpW数据集性能。实验结果表明，该层的好处，并表明我们的模型实现了有竞争力的结果相对于国家的最先进的方法。

30. Efficient DWT-based fusion techniques using genetic algorithm for optimal parameter estimation [PDF] 返回目录
S. Kavitha, K. K. Thyagharajan
Abstract: Image fusion plays a vital role in medical imaging. Image fusion aims to integrate complementary as well as redundant information from multiple modalities into a single fused image without distortion or loss of information. In this research work, discrete wavelet transform (DWT)and undecimated discrete wavelet transform (UDWT)-based fusion techniques using genetic algorithm (GA)foroptimalparameter(weight)estimationinthefusionprocessareimplemented and analyzed with multi-modality brain images. The lack of shift variance while performing image fusion using DWT is addressed using UDWT. The proposed fusion model uses an efficient, modified GA in DWT and UDWT for optimal parameter estimation, to improve the image quality and contrast. The complexity of the basic GA (pixel level) has been reduced in the modified GA (feature level), by limiting the search space. It is observed from our experiments that fusion using DWT and UDWT techniques with GA for optimal parameter estimation resulted in a better fused image in the aspects of retaining the information and contrast without error, both in human perception as well as evaluation using objective metrics. The contributions of this research work are (1) reduced time and space complexity in estimating the weight values using GA for fusion (2) system is scalable for input image of any size with similar time complexity, owing to feature level GA implementation and (3) identification of source image that contributes more to the fused image, from the weight values estimated.
摘要：图像融合起着医疗成像至关重要的作用。图像融合目的是从多个模态互补以及冗余信息集成到单个融合图像无失真或丢失信息。在本研究工作，离散小波变换（DWT）和抽样的离散小波变换（UDWT）使用estimationinthefusionprocessareimplemented并与多模态脑图像分析遗传算法（GA）foroptimalparameter（重量）为基础的融合技术。缺乏而执行使用DWT的图像融合移位方差使用UDWT寻址。所提出的融合模型使用在DWT和UDWT一种高效，改性GA为最佳参数估计，以提高图像质量和对比度。的基本GA（像素级）的复杂性在改性GA（特征电平）被降低，通过限制搜索空间。它是从我们的实验中，使用融合DWT和UDWT技术与遗传算法的最优参数估计导致保留信息和对比度没有错误的方面更好的融合图像，无论是在人类感知和评价用客观指标观察。这项研究工作的贡献是：（1）减少的时间和空间复杂度估计使用GA用于融合权重值（2）系统是可扩展的具有类似的时间复杂度的任何大小的输入图像，由于特征级GA执行和（3源图像的）鉴定有助于越到融合图像中，从权重值来估计。

31. Role of Orthogonality Constraints in Improving Properties of Deep Networks for Image Classification [PDF] 返回目录
Hongjun Choi, Anirudh Som, Pavan Turaga
Abstract: Standard deep learning models that employ the categorical cross-entropy loss are known to perform well at image classification tasks. However, many standard models thus obtained often exhibit issues like feature redundancy, low interpretability, and poor calibration. A body of recent work has emerged that has tried addressing some of these challenges by proposing the use of new regularization functions in addition to the cross-entropy loss. In this paper, we present some surprising findings that emerge from exploring the role of simple orthogonality constraints as a means of imposing physics-motivated constraints common in imaging. We propose an Orthogonal Sphere (OS) regularizer that emerges from physics-based latent-representations under simplifying assumptions. Under further simplifying assumptions, the OS constraint can be written in closed-form as a simple orthonormality term and be used along with the cross-entropy loss function. The findings indicate that orthonormality loss function results in a) rich and diverse feature representations, b) robustness to feature sub-selection, c) better semantic localization in the class activation maps, and d) reduction in model calibration error. We demonstrate the effectiveness of the proposed OS regularization by providing quantitative and qualitative results on four benchmark datasets - CIFAR10, CIFAR100, SVHN and tiny ImageNet.
摘要：标准深度学习是采用分类交叉熵损失已知在图像分类任务表现良好的机型。然而，由此获得的许多标准模型往往表现出类似的功能冗余，低解释性和校准差的问题。近期工作的身体已经出现，试图通过提出除交叉熵损失利用新的正规化功能解决其中的一些挑战。在本文中，我们提出，从探索简单的正交性约束的作用施加物理的积极性约束成像常见的手段，出现了一些令人吃惊的发现。我们建议从以下简化的假设基于物理的潜表示出现正交球（OS）正则。下进一步简化假设，操作系统约束可以在封闭形式被写为一个简单的正交性术语和可以与交叉熵损失函数用于沿。研究结果表明在该正交性损失函数的结果）丰富多样的特征表示，二）稳健性功能的子选择，C）更好的语义定位在类激活图，和d）减少模型校准误差。 CIFAR10，CIFAR100，SVHN和微小ImageNet - 我们通过四个标准数据集提供定量和定性的结果验证了OS正规化的有效性。

32. Fuzzy Simplicial Networks: A Topology-Inspired Model to Improve Task Generalization in Few-shot Learning [PDF] 返回目录
Henry Kvinge, Zachary New, Nico Courts, Jung H. Lee, Lauren A. Phillips, Courtney D. Corley, Aaron Tuor, Andrew Avila, Nathan O. Hodas
Abstract: Deep learning has shown great success in settings with massive amounts of data but has struggled when data is limited. Few-shot learning algorithms, which seek to address this limitation, are designed to generalize well to new tasks with limited data. Typically, models are evaluated on unseen classes and datasets that are defined by the same fundamental task as they are trained for (e.g. category membership). One can also ask how well a model can generalize to fundamentally different tasks within a fixed dataset (for example: moving from category membership to tasks that involve detecting object orientation or quantity). To formalize this kind of shift we define a notion of "independence of tasks" and identify three new sets of labels for established computer vision datasets that test a model's ability to generalize to tasks which draw on orthogonal attributes in the data. We use these datasets to investigate the failure modes of metric-based few-shot models. Based on our findings, we introduce a new few-shot model called Fuzzy Simplicial Networks (FSN) which leverages a construction from topology to more flexibly represent each class from limited data. In particular, FSN models can not only form multiple representations for a given class but can also begin to capture the low-dimensional structure which characterizes class manifolds in the encoded space of deep networks. We show that FSN outperforms state-of-the-art models on the challenging tasks we introduce in this paper while remaining competitive on standard few-shot benchmarks.
摘要：深学习与海量数据的设置已经显示出了巨大的成功，但是当数据被限制一直在努力。很少拍学习算法，它试图解决此限制，是为了好推广到有限的数据的新任务。通常情况下，模型上由相同的基本任务定义，因为他们训练（例如类会员）看不见的类和数据集进行评估。一个也可以询问如何以及一个模型可以固定数据集内推广到根本不同的任务（例如：从类别成员移动到该包括检测物体的方向或数量的任务）。正式这种转变，我们定义的“任务的独立”这个概念，并确定标签建立的计算机视觉数据集是测试模型的能力推广到其上绘制数据正交属性任务的三个新集。我们使用这些数据集来研究基于度量的几个-shot机型的故障模式。根据我们的调查结果，我们引入所谓模糊的单纯网络（FSN）的新几拍模型，它利用了建筑从拓扑更灵活地表示从有限的数据中的每一类。特别是，FSN模型不仅可以形成一个给定的类的多个表示的，但也可以开始捕捉低维结构表征类歧管在深网络的编码空间。我们发现，国家的最先进的FSN性能优于在富有挑战性的任务模型，我们在本文中介绍，而其余的几个标准拍基准竞争力。

33. Whole Slide Images based Cancer Survival Prediction using Attention Guided Deep Multiple Instance Learning Networks [PDF] 返回目录
Jiawen Yao, Xinliang Zhu, Jitendra Jonnagaddala, Nicholas Hawkins, Junzhou Huang
Abstract: Traditional image-based survival prediction models rely on discriminative patch labeling which make those methods not scalable to extend to large datasets. Recent studies have shown Multiple Instance Learning (MIL) framework is useful for histopathological images when no annotations are available in classification task. Different to the current image-based survival models that limit to key patches or clusters derived from Whole Slide Images (WSIs), we propose Deep Attention Multiple Instance Survival Learning (DeepAttnMISL) by introducing both siamese MI-FCN and attention-based MIL pooling to efficiently learn imaging features from the WSI and then aggregate WSI-level information to patient-level. Attention-based aggregation is more flexible and adaptive than aggregation techniques in recent survival models. We evaluated our methods on two large cancer whole slide images datasets and our results suggest that the proposed approach is more effective and suitable for large datasets and has better interpretability in locating important patterns and features that contribute to accurate cancer survival predictions. The proposed framework can also be used to assess individual patient's risk and thus assisting in delivering personalized medicine. Codes are available at this https URL.
摘要：传统的基于图像的生存预测模型依赖于辨别补丁标签，这使得这些方法不可扩展延伸到大型数据集。最近的研究表明，多示例学习（MIL）框架是在没有注释的分类任务提供病理图像非常有用。不同于目前基于图像的生存模式，限制从整个幻灯片图像（WSIS）派生密钥补丁或集群中，我们提出深注意多实例生存学习（DeepAttnMISL）通过引入两个连体MI-FCN和关注为主MIL池来有效地学习成像从WSI特征，然后聚集WSI级信息到患者水平。基于注意力聚集更灵活，比聚合技术自适应近生存模型。我们评估了两个大肿瘤整个幻灯片图像数据集，我们的方法和我们的结果表明，该方法是对大型数据集更有效和适当的，并在定位重要的模式和特点，有助于准确的癌症生存的预测更好的可解释性。拟议的框架也可以用来评估个别病人的风险，因此在提供个性化医疗协助。代码可在此HTTPS URL。

34. Foreseeing Brain Graph Evolution Over Time Using Deep Adversarial Network Normalizer [PDF] 返回目录
Zeynep Gurler, Ahmed Nebli, Islem Rekik
Abstract: Foreseeing the brain evolution as a complex highly inter-connected system, widely modeled as a graph, is crucial for mapping dynamic interactions between different anatomical regions of interest (ROIs) in health and disease. Interestingly, brain graph evolution models remain almost absent in the literature. Here we design an adversarial brain network normalizer for representing each brain network as a transformation of a fixed centered population-driven connectional template. Such graph normalization with respect to a fixed reference paves the way for reliably identifying the most similar training samples (i.e., brain graphs) to the testing sample at baseline timepoint. The testing evolution trajectory will be then spanned by the selected training graphs and their corresponding evolution trajectories. We base our prediction framework on geometric deep learning which naturally operates on graphs and nicely preserves their topological properties. Specifically, we propose the first graph-based Generative Adversarial Network (gGAN) that not only learns how to normalize brain graphs with respect to a fixed connectional brain template (CBT) (i.e., a brain template that selectively captures the most common features across a brain population) but also learns a high-order representation of the brain graphs also called embeddings. We use these embeddings to compute the similarity between training and testing subjects which allows us to pick the closest training subjects at baseline timepoint to predict the evolution of the testing brain graph over time. A series of benchmarks against several comparison methods showed that our proposed method achieved the lowest brain disease evolution prediction error using a single baseline timepoint. Our gGAN code is available at this http URL.
摘要：预见大脑进化作为一个复杂高度相互连接的系统中，广泛建模为曲线图，是用于在健康和疾病的兴趣（投资回报）不同解剖区域之间的映射的动态相互作用至关重要的。有趣的是，大脑图形演化模式停留在文献中几乎不存在。在这里，我们设计了一个敌对的脑网络正规化表示每个脑网络为中心的固定人口驱动connectional模板的转变。相对于一固定的参考，例如图正常化铺平用于可靠地识别在基线时间点最相似的训练样本（即，脑图形）到所述测试样本的方式。测试的演变轨迹将被选择的训练图形和它们对应的演变轨迹进行再跨越。我们立足我们对自然地运行在图形和很好的保持其拓扑性质的几何深度学习预测框架。具体来说，我们建议第一个基于图形的生成性对抗性网络（gGAN），不仅学会了如何正常化大脑图形相对于固定connectional脑模板（CBT）（即大脑模板选择性捕获跨越最常见的功能大脑人口），而且还学会了脑图也被称为的嵌入的高阶表示。我们用这些方式的嵌入计算训练和测试对象之间的相似性，使我们能够挑选在基线时间点最接近的训练科目来预测测试脑图的变化随着时间的推移。对几种比较方法一系列的基准表明，我们提出的方法使用单个基准时间点达到最低的脑部疾病演变预测误差。我们gGAN代码可在此HTTP URL。

35. Anisotropic 3D Multi-Stream CNN for Accurate Prostate Segmentation from Multi-Planar MRI [PDF] 返回目录
Anneke Meyer, Grzegorz Chlebus, Marko Rak, Daniel Schindele, Martin Schostak, Bram van Ginneken, Andrea Schenk, Hans Meine, Horst K. Hahn, Andreas Schreiber, Christian Hansen
Abstract: Background and Objective: Accurate and reliable segmentation of the prostate gland in MR images can support the clinical assessment of prostate cancer, as well as the planning and monitoring of focal and loco-regional therapeutic interventions. Despite the availability of multi-planar MR scans due to standardized protocols, the majority of segmentation approaches presented in the literature consider the axial scans only. Methods: We propose an anisotropic 3D multi-stream CNN architecture, which processes additional scan directions to produce a higher-resolution isotropic prostate segmentation. We investigate two variants of our architecture, which work on two (dual-plane) and three (triple-plane) image orientations, respectively. We compare them with the standard baseline (single-plane) used in literature, i.e., plain axial segmentation. To realize a fair comparison, we employ a hyperparameter optimization strategy to select optimal configurations for the individual approaches. Results: Training and evaluation on two datasets spanning multiple sites obtain statistical significant improvement over the plain axial segmentation ($p<0.05$ on the dice similarity coefficient). improvement can be observed especially at base ($0.898$ single-plane vs. $0.906$ triple-plane) and apex ($0.888$ $0.901$ dual-plane). conclusion: this study indicates that models employing two or three scan directions are superior to plain axial segmentation. knowledge of precise boundaries prostate is crucial for conservation risk structures. thus, proposed have potential improve outcome cancer diagnosis therapies. < font>
摘要：背景与目的：MR图像前列腺的准确和可靠的分割可以支持前列腺癌的临床评估，以及规划和监测重点和局部区域治疗干预。尽管多平面MR扫描由于标准化协议的可用性，多数分割的办法呈现在文献中考虑的轴向仅扫描。方法：我们提出的各向异性的三维多流CNN架构，其处理额外的扫描方向，以产生更高分辨率的各向同性前列腺分割。我们分别考察我们的体系结构，这两个（双平面）和三个工作（三面）的图像取向，两个变种。我们在文献中使用的标准基线（单平面），即，纯轴向分割对它们进行比较。为了实现公平的比较，我们采用了超参数优化的战略选择对于个人最佳方式配置。结果：训练和评价对两个数据集跨越多个站点获得对纯轴向分割统计显著改善（$ P <0.05 $上骰子相似系数）。的改进可以特别是在碱（$ 0.898 $单平面与0.906 $ $三重平面）和顶点（$ 0.888 $单平面与0.901 $双平面）被观察到。结论：本研究表明，采用模型的两个或三个扫描方向都优于纯轴向分割。前列腺的精确边界的知识对于风险结构的保护是至关重要的。因此，提出的模型必须提高前列腺癌的诊断和治疗的结果的可能性。< font>

36. Robustification of Segmentation Models Against Adversarial Perturbations In Medical Imaging [PDF] 返回目录
Hanwool Park, Amirhossein Bayat, Mohammad Sabokrou, Jan S. Kirschke, Bjoern H. Menze
Abstract: This paper presents a novel yet efficient defense framework for segmentation models against adversarial attacks in medical imaging. In contrary to the defense methods against adversarial attacks for classification models which widely are investigated, such defense methods for segmentation models has been less explored. Our proposed method can be used for any deep learning models without revising the target deep learning models, as well as can be independent of adversarial attacks. Our framework consists of a frequency domain converter, a detector, and a reformer. The frequency domain converter helps the detector detects adversarial examples by using a frame domain of an image. The reformer helps target models to predict more precisely. We have experiments to empirically show that our proposed method has a better performance compared to the existing defense method.
摘要：本文介绍了针对医疗成像敌对攻击细分车型的新颖而有效的防御框架。在违背对分类模型对抗性攻击其广泛的调查，市场细分等车型的防御方法防御方法已较少探讨。我们提出的方法可以用于任何深层的学习模式，而不修改深度学习模型的目标，以及可以独立对抗攻击。我们的框架由一个频域转换器，一个检测器和重整器的。频域转换器可以帮助检测器通过使用图像的一帧域检测对抗性例子。重整有助于目标模型更精确地预测。我们有实验经验表明，该方法比现有的防御方法具有更好的性能。

37. GSR-Net: Graph Super-Resolution Network for Predicting High-Resolution from Low-Resolution Functional Brain Connectomes [PDF] 返回目录
Megi Isallari, Islem Rekik
Abstract: Catchy but rigorous deep learning architectures were tailored for image super-resolution (SR), however, these fail to generalize to non-Euclidean data such as brain connectomes. Specifically, building generative models for super-resolving a low-resolution (LR) brain connectome at a higher resolution (HR) (i.e., adding new graph nodes/edges) remains unexplored although this would circumvent the need for costly data collection and manual labelling of anatomical brain regions (i.e. parcellation). To fill this gap, we introduce GSR-Net (Graph Super-Resolution Network), the first super-resolution framework operating on graph-structured data that generates high-resolution brain graphs from low-resolution graphs. First, we adopt a U-Net like architecture based on graph convolution, pooling and unpooling operations specific to non-Euclidean data. However, unlike conventional U-Nets where graph nodes represent samples and node features are mapped to a low-dimensional space (encoding and decoding node attributes or sample features), our GSR-Net operates directly on a single connectome: a fully connected graph where conventionally, a node denotes a brain region, nodes have no features, and edge weights denote brain connectivity strength between two regions of interest (ROIs). In the absence of original node features, we initially assign identity feature vectors to each brain ROI (node) and then leverage the learned local receptive fields to learn node feature representations. Second, inspired by spectral theory, we break the symmetry of the U-Net architecture by topping it up with a graph super-resolution (GSR) layer and two graph convolutional network layers to predict a HR graph while preserving the characteristics of the LR input. Our proposed GSR-Net framework outperformed its variants for predicting high-resolution brain functional connectomes from low-resolution connectomes.
摘要：朗朗上口，但严谨的深度学习的架构进行了量身定制的图像超分辨率（SR），但是，这些不能推广到非欧几里得数据，如脑connectomes。具体而言，在一个更高的分辨率（HR）建立生成模型的超分辨率低分辨率（LR）脑连接组（即，添加新的图形节点/边缘）仍然未知，尽管这将避免昂贵的数据收集和手工贴标的需要解剖大脑区域（即地块划分）。为了填补这一空白，我们介绍GSR-NET（图超分辨率网），上图结构数据的第一个超分辨率框架的工作，从低分辨率图形生成高分辨率的脑图。首先，我们采用U型网一样基于图卷积架构，汇集和unpooling操作具体到非欧几里得数据。然而，不同于传统的U型网，其中图中的节点表示的样品和节点特征映射到低维空间（编码和解码节点的属性或样品特征），我们的GSR-Net的直接在一单个连接组操作：完全连接的图，其中通常，节点表示脑区域中，节点不具有的功能，和边权表示两个感兴趣区域（ROI）之间脑连通强度。在没有原始节点的功能，我们最初分配的身份特征向量与每个脑ROI（节点），然后利用所学当地感受野学习节点特征表示。第二，通过光谱理论的启发，我们通过用图表超分辨率摘心它（GSR）层和两个曲线图的卷积网络层打破U形网体系结构的对称性来预测HR曲线图，同时保持LR输入的特性。我们提出的GSR-.Net框架跑赢及其变种从低分辨率connectomes预测高分辨率的脑功能connectomes。

38. Automatic Breast Lesion Classification by Joint Neural Analysis of Mammography and Ultrasound [PDF] 返回目录
Gavriel Habib, Nahum Kiryati, Miri Sklair-Levy, Anat Shalmon, Osnat Halshtok Neiman, Renata Faermann Weidenfeld, Yael Yagil, Eli Konen, Arnaldo Mayer
Abstract: Mammography and ultrasound are extensively used by radiologists as complementary modalities to achieve better performance in breast cancer diagnosis. However, existing computer-aided diagnosis (CAD) systems for the breast are generally based on a single modality. In this work, we propose a deep-learning based method for classifying breast cancer lesions from their respective mammography and ultrasound images. We present various approaches and show a consistent improvement in performance when utilizing both modalities. The proposed approach is based on a GoogleNet architecture, fine-tuned for our data in two training steps. First, a distinct neural network is trained separately for each modality, generating high-level features. Then, the aggregated features originating from each modality are used to train a multimodal network to provide the final classification. In quantitative experiments, the proposed approach achieves an AUC of 0.94, outperforming state-of-the-art models trained over a single modality. Moreover, it performs similarly to an average radiologist, surpassing two out of four radiologists participating in a reader study. The promising results suggest that the proposed method may become a valuable decision support tool for breast radiologists.
摘要：乳腺X线和超声被广泛使用的放射科医生作为补充方式，实现乳腺癌的诊断更好的性能。然而，现有的计算机辅助诊断（CAD）系统为乳房通常基于单形态。在这项工作中，我们提出了乳腺癌病灶从各自的乳腺X线和超声图像进行分类的深学习为基础的方法。我们提出各种方法，并利用两种模式时，显示性能持续改善。所提出的方法是基于GoogleNet架构，微调对于我们两个训练步骤数据。首先，不同的神经网络被用于每个模态分别训练，生成高级特征。然后，聚集特征源自每种模态被用来训练多峰网络以提供最终的分类。在定量实验，该方法实现了0.94的AUC，跑赢培训了一个单一的模式国家的最先进的车型。此外，它也执行的平均放射科医生，超过四分之二的放射科医师参与读者学习。有希望的结果表明，该方法可能成为乳腺放射一个有价值的决策支持工具。

39. Attention with Multiple Sources Knowledges for COVID-19 from CT Images [PDF] 返回目录
Duy M. H. Nguyen, Duy M. Nguyen, Huong Vu, Binh T. Nguyen, Fabrizio Nunnari, Daniel Sonntag
Abstract: Until now, Coronavirus SARS-CoV-2 has caused more than 850,000 deaths and infected more than 27 million individuals in over 120 countries. Besides principal polymerase chain reaction (PCR) tests, automatically identifying positive samples based on computed tomography (CT) scans can present a promising option in the early diagnosis of COVID-19. Recently, there have been increasing efforts to utilize deep networks for COVID-19 diagnosis based on CT scans. While these approaches mostly focus on introducing novel architectures, transfer learning techniques, or construction large scale data, we propose a novel strategy to improve the performance of several baselines by leveraging multiple useful information sources relevant to doctors' judgments. Specifically, infected regions and heat maps extracted from learned networks are integrated with the global image via an attention mechanism during the learning process. This procedure not only makes our system more robust to noise but also guides the network focusing on local lesion areas. Extensive experiments illustrate the superior performance of our approach compared to recent baselines. Furthermore, our learned network guidance presents an explainable feature to doctors as we can understand the connection between input and output in a grey-box model.
摘要：到现在为止，冠状病毒SARS-COV-2已造成85万多人死亡，超过120个国家感染了超过2700万个人。除了主聚合酶链反应（PCR）试验，基于计算机断层摄影自动地识别的阳性样品（CT）扫描可以在COVID-19的早期诊断呈现有前途的选择。近年来，已经有越来越多的努力，利用深层网络基于CT扫描COVID-19的诊断。虽然这些方法大多专注于推出新的架构，迁移学习技术或建设大规模的数据，我们提出通过利用相关医生的判断多有用的信息来源，以提高几个基线性能的新策略。具体而言，受感染的地区和地图中提取热量从得知网络正通过在学习过程中的注意机制与全球形象整合。这个过程不仅使我们的系统更强大的噪音也引导网络专注于病变局部地区。大量的实验说明相比，最近的基线我们的方法的性能优越。此外，我们学到了指导提出了一个解释的功能，作为医生，我们可以理解一个灰盒模型的输入和输出之间的连接。

40. Learning Non-Unique Segmentation with Reward-Penalty Dice Loss [PDF] 返回目录
Jiabo He, Sarah Erfani, Sudanthi Wijewickrema, Stephen O'Leary, Kotagiri Ramamohanarao
Abstract: Semantic segmentation is one of the key problems in the field of computer vision, as it enables computer image understanding. However, most research and applications of semantic segmentation focus on addressing unique segmentation problems, where there is only one gold standard segmentation result for every input image. This may not be true in some problems, e.g., medical applications. We may have non-unique segmentation annotations as different surgeons may perform successful surgeries for the same patient in slightly different ways. To comprehensively learn non-unique segmentation tasks, we propose the reward-penalty Dice loss (RPDL) function as the optimization objective for deep convolutional neural networks (DCNN). RPDL is capable of helping DCNN learn non-unique segmentation by enhancing common regions and penalizing outside ones. Experimental results show that RPDL improves the performance of DCNN models by up to 18.4% compared with other loss functions on our collected surgical dataset.
摘要：语义分割是计算机视觉领域的关键问题之一，因为它使计算机图像理解。然而，大多数研究和解决独特的分割问题，那里是仅适用于每个输入图像的一个黄金标准分割结果的语义分割的重点应用。这可能不是在一些问题，例如，医疗应用是真实的。我们可能有非唯一分割标注为不同的外科医生可能为稍微不同的方式对同一患者进行成功的手术。为全面了解非唯一分割的任务，我们提出了奖励惩罚骰子损失（RPDL）函数作为深卷积神经网络（DCNN）的优化目标。 RPDL能够帮助DCNN通过加强共同的区域和惩罚外的人了解非唯一分割。实验结果表明，RPDL提高DCNN车型高达18.4％的性能，我们收集的数据集的手术其他损失的功能进行比较。

41. Semantics-Preserving Adversarial Training [PDF] 返回目录
Wonseok Lee, Hanbit Lee, Sang-goo Lee
Abstract: Adversarial training is a defense technique that improves adversarial robustness of a deep neural network (DNN) by including adversarial examples in the training data. In this paper, we identify an overlooked problem of adversarial training in that these adversarial examples often have different semantics than the original data, introducing unintended biases into the model. We hypothesize that such non-semantics-preserving (and resultingly ambiguous) adversarial data harm the robustness of the target models. To mitigate such unintended semantic changes of adversarial examples, we propose semantics-preserving adversarial training (SPAT) which encourages perturbation on the pixels that are shared among all classes when generating adversarial examples in the training stage. Experiment results show that SPAT improves adversarial robustness and achieves state-of-the-art results in CIFAR-10 and CIFAR-100.
摘要：对抗性训练是一个防御技术，通过在训练数据对抗性的例子提高了深层神经网络（DNN）的对抗性鲁棒性。在本文中，我们确定的是，这些对抗的例子往往比原来的数据不同的语义被忽视的对抗性训练的问题，引入无意的偏见到模型中。我们推测，这种非语义保留（和resultingly暧昧）对抗性数据损害目标模型的鲁棒性。为了减轻的对抗性例子此类意外语义更改，我们提出语义保留它鼓励上所生成的训练阶段对抗性的例子，当所有的类之间共享像素扰动对抗性训练（SPAT）。实验结果表明，提高了SPAT对抗性鲁棒性和在CIFAR-10和CIFAR-100实现状态的最先进的结果。

42. Pruning Convolutional Filters using Batch Bridgeout [PDF] 返回目录
Najeeb Khan, Ian Stavness
Abstract: State-of-the-art computer vision models are rapidly increasing in capacity, where the number of parameters far exceeds the number required to fit the training set. This results in better optimization and generalization performance. However, the huge size of contemporary models results in large inference costs and limits their use on resource-limited devices. In order to reduce inference costs, convolutional filters in trained neural networks could be pruned to reduce the run-time memory and computational requirements during inference. However, severe post-training pruning results in degraded performance if the training algorithm results in dense weight vectors. We propose the use of Batch Bridgeout, a sparsity inducing stochastic regularization scheme, to train neural networks so that they could be pruned efficiently with minimal degradation in performance. We evaluate the proposed method on common computer vision models VGGNet, ResNet, and Wide-ResNet on the CIFAR image classification task. For all the networks, experimental results show that Batch Bridgeout trained networks achieve higher accuracy across a wide range of pruning intensities compared to Dropout and weight decay regularization.
摘要：国家的最先进的计算机视觉模型在产能，其中的参数的数量远远超过以适应训练集所需的数量迅速增加。这导致了更好的优化和推广能力。然而，在大推断成本，并限制其对资源有限的设备上使用现代模型结果的巨大规模。为了降低成本推断，在训练的神经网络卷积过滤器可以修剪，以减少推理在运行时内存和计算要求。然而，严重的培训后修剪导致的性能下降，如果训练算法结果在茂密的权重向量。我们建议使用批处理Bridgeout，稀疏性诱导随机正则化方案，来训练神经网络，使他们能够有效地性能下降幅度极小修剪。我们评估对普通计算机视觉模型VGGNet，RESNET所提出的方法，和宽RESNET在CIFAR图像分类任务。对于所有的网络中，实验结果表明，批次Bridgeout训练网络在大范围修枝强度相比差和重量衰变正规化达到更高的精度。

43. Improving Medical Annotation Quality to Decrease Labeling Burden Using Stratified Noisy Cross-Validation [PDF] 返回目录
Joy Hsu, Sonia Phene, Akinori Mitani, Jieying Luo, Naama Hammel, Jonathan Krause, Rory Sayres
Abstract: As machine learning has become increasingly applied to medical imaging data, noise in training labels has emerged as an important challenge. Variability in diagnosis of medical images is well established; in addition, variability in training and attention to task among medical labelers may exacerbate this issue. Methods for identifying and mitigating the impact of low quality labels have been studied, but are not well characterized in medical imaging tasks. For instance, Noisy Cross-Validation splits the training data into halves, and has been shown to identify low-quality labels in computer vision tasks; but it has not been applied to medical imaging tasks specifically. In this work we introduce Stratified Noisy Cross-Validation (SNCV), an extension of noisy cross validation. SNCV can provide estimates of confidence in model predictions by assigning a quality score to each example; stratify labels to handle class imbalance; and identify likely low-quality labels to analyze the causes. We assess performance of SNCV on diagnosis of glaucoma suspect risk from retinal fundus photographs, a clinically important yet nuanced labeling task. Using training data from a previously-published deep learning model, we compute a continuous quality score (QS) for each training example. We relabel 1,277 low-QS examples using a trained glaucoma specialist; the new labels agree with the SNCV prediction over the initial label >85% of the time, indicating that low-QS examples mostly reflect labeler errors. We then quantify the impact of training with only high-QS labels, showing that strong model performance may be obtained with many fewer examples. By applying the method to randomly sub-sampled training dataset, we show that our method can reduce labelling burden by approximately 50% while achieving model performance non-inferior to using the full dataset on multiple held-out test sets.
摘要：随着机器学习已成为越来越多地应用到医疗成像数据，噪声训练标签已经成为一个重要的挑战。变性的医学图像诊断是公认;此外，医疗贴标间变异培训，注重任务可能会加剧这个问题。识别和减轻的低质量标签的影响的方法进行了研究，但没有很好的特点在医学成像的任务。例如，嘈杂的交叉验证分割训练数据成两半，并已被证明，以确定在计算机视觉任务，低质量的标签;但它并没有被应用到医疗成像任务明确。在这项工作中，我们介绍了分层嘈杂的交叉验证（SNCV），嘈杂的交叉验证的扩展。 SNCV可以提供通过分配质量分数每个例子中的模型预测的信心估计;分层标签处理类不平衡;并确定可能低质量的标签来分析原因。我们对从视网膜眼底照相，临床上重要而微妙的标签制作任务青光眼嫌疑风险评估诊断SNCV的性能。从先前公布的深度学习模型中使用的训练数据，我们计算每个训练例如连续质量评分（QS）。我们重新标记使用受训青光眼专科1277低QS的例子;新的标签同意SNCV预测比初始标签>的85％的时间，这表明低QS实施例大多是反映贴标错误。然后，我们量化的培训，只有高QS标签的影响，显示出强大的性能模型可以用少得多的例子来获得。通过应用该方法随机子采样训练集，我们证明了我们的方法可以由约50％降低标签的负担，同时实现模型的性能不逊色于使用完整的数据集上的多个保留检验台。

44. Adaptive Debanding Filter [PDF] 返回目录
Zhengzhong Tu, Jessie Lin, Yilin Wang, Balu Adsumilli, Alan C. Bovik
Abstract: Banding artifacts, which manifest as staircase-like color bands on pictures or video frames, is a common distortion caused by compression of low-textured smooth regions. These false contours can be very noticeable even on high-quality videos, especially when displayed on high-definition screens. Yet, relatively little attention has been applied to this problem. Here we consider banding artifact removal as a visual enhancement problem, and accordingly, we solve it by applying a form of content-adaptive smoothing filtering followed by dithered quantization, as a post-processing module. The proposed debanding filter is able to adaptively smooth banded regions while preserving image edges and details, yielding perceptually enhanced gradient rendering with limited bit-depths. Experimental results show that our proposed debanding filter outperforms state-of-the-art false contour removing algorithms both visually and quantitatively.
摘要：带状伪影，这表现为阶梯状上的图片或视频帧颜色波段，是由低纹理平滑区域的压缩一个共同的失真。这些虚假轮廓是非常明显的，即使在高品质的视频，尤其是在高清晰度屏幕显示。然而，相对很少注意已经应用到这个问题。这里我们考虑带状伪像移除作为视觉增强问题，因此，我们通过将内容自适应平滑的形式滤波随后抖动的量化，作为后处理模块解决它。所提出的debanding滤波器能够自适应地平滑带状区域，同时保持图像的边缘和细节，得到听觉增强梯度呈现有限位深度。实验结果表明，国家的最先进的我们提出的debanding滤波器性能优于假轮廓用肉眼和定量地去除算法。

45. Cranial Implant Prediction using Low-Resolution 3D Shape Completion and High-Resolution 2D Refinement [PDF] 返回目录
Amirhossein Bayat, Suprosanna Shit, Adrian Kilian, Jürgen T. Liechtenstein, Jan S. Kirschke, Bjoern H. Menze
Abstract: Designing of a cranial implant needs a 3D understanding of the complete skull shape. Thus, taking a 2D approach is sub-optimal, since a 2D model lacks a holistic 3D view of both the defective and healthy skulls. Further, loading the whole 3D skull shapes at its original image resolution is not feasible in commonly available GPUs. To mitigate these issues, we propose a fully convolutional network composed of two subnetworks. The first subnetwork is designed to complete the shape of the downsampled defective skull. The second subnetwork upsamples the reconstructed shape slice-wise. We train the 3D and 2D networks together end-to-end, with a hierarchical loss function. Our proposed solution accurately predicts a high-resolution 3D implant in the challenge test case in terms of dice-score and the Hausdorff distance.
摘要：颅植入物的设计需要完整的头骨形状的3D理解。因此，采取了2D的方法是次优的，因为2D模型缺乏有缺陷的和健康的头骨两者的整体三维视图。此外，在其原来的图像分辨率加载整个3D头骨形状不常用的GPU是可行的。为了缓解这些问题，我们提出了两个子网组成的完全网络卷积。第一子网被设计以完成下采样有缺陷的颅骨的形状。第二子网上采样的重构形状逐个切片。我们培养的3D和2D网络连接在一起终端到年底，分层损失函数。我们提出的解决方案准确地预测在挑战测试用例高分辨率三维植入物在骰子的得分与豪斯多夫距离方面。

46. Age-Net: An MRI-Based Iterative Framework for Biological Age Estimation [PDF] 返回目录
Karim Armanious, Sherif Abdulatif, Wenbin Shi, Shashank Salian, Thomas Küstner, Daniel Weiskopf, Tobias Hepp, Sergios Gatidis, Bin Yang
Abstract: The concept of biological age (BA) - although important in clinical practice - is hard to grasp mainly due to lack of a clearly defined reference standard. For specific applications, especially in pediatrics, medical image data are used for BA estimation in a routine clinical context. Beyond this young age group, BA estimation is restricted to whole-body assessment using non-imaging indicators such as blood biomarkers, genetic and cellular data. However, various organ systems may exhibit different aging characteristics due to lifestyle and genetic factors. Thus, a whole-body assessment of the BA does not reflect the deviations of aging behavior between organs. To this end, we propose a new imaging-based framework for organ-specific BA estimation. As a first step, we introduce a chronological age (CA) estimation framework using deep convolutional neural networks (Age-Net). We quantitatively assess the performance of this framework in comparison to existing CA estimation approaches. Furthermore, we expand upon Age-Net with a novel iterative data-cleaning algorithm to segregate atypical-aging patients (BA $\not \approx$ CA) from the given population. In this manner, we hypothesize that the remaining population should approximate the true BA behaviour. For this initial study, we apply the proposed methodology on a brain magnetic resonance image (MRI) dataset containing healthy individuals as well as Alzheimer's patients with different dementia ratings. We demonstrate the correlation between the predicted BAs and the expected cognitive deterioration in Alzheimer's patients. A statistical and visualization-based analysis has provided evidence regarding the potential and current challenges of the proposed methodology.
摘要：生物年龄（BA）的概念 - 虽然在临床实践中重要的 - 是很难把握，主要由于缺乏一个明确的参考标准。对于特定的应用，特别是在儿科，医用图像数据被用于BA估计在常规临床环境。超出这个年轻的年龄组，BA估计是使用非成像指标如血的生物标志物，遗传和蜂窝数据限于全身评估。然而，各种器官系统可表现出由于生活方式和遗传因素的不同老化特性。因此，BA的全身评估并不能反映机构之间的老化行为的偏差。为此，我们提出了器官特异性BA估计新的基于成像的框架。作为第一步，我们采用深卷积神经网络（年龄-网）引入实足年龄（CA）评估框架。我们定量评估这个框架的性能相较于现有的CA估计方法。此外，我们在年龄-Net的具有新颖迭代数据清理算法扩展到隔离非典型老患者从给定的人口（BA $ \不\ $左右CA）。通过这种方式，我们假设剩余人口应接近真实的BA行为。对于这一初步研究中，我们应用脑磁共振成像（MRI）数据集包含个人健康以及老年痴呆症患者不同痴呆评级在拟议的方法。我们证明预测的学士学位，在阿尔茨海默氏症患者的预期认知功能减退之间的相关性。统计和可视化分析为基础同时提供了关于拟议的方法的潜力和目前的挑战的证据。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-24

目录

摘要