摘要

1. FairFace Challenge at ECCV 2020: Analyzing Bias in Face Recognition [PDF] 返回目录
Tomáš Sixta, Julio C. S. Jacques Junior, Pau Buch-Cardona, Eduard Vazquez, Sergio Escalera
Abstract: This work summarizes the 2020 ChaLearn Looking at People Fair Face Recognition and Analysis Challenge and provides a description of the top-winning solutions and analysis of the results. The aim of the challenge was to evaluate accuracy and bias in gender and skin colour of submitted algorithms on the task of 1:1 face verification in the presence of other confounding attributes. Participants were evaluated using an in-the-wild dataset based on reannotated IJB-C, further enriched by 12.5K new images and additional labels. The dataset is not balanced, which simulates a real world scenario where AI-based models supposed to present fair outcomes are trained and evaluated on imbalanced data. The challenge attracted 151 participants, who made more than 1.8K submissions in total. The final phase of the challenge attracted 36 active teams out of which 10 exceeded 0.999 AUC-ROC while achieving very low scores in the proposed bias metrics. Common strategies by the participants were face pre-processing, homogenization of data distributions, the use of bias aware loss functions and ensemble models. The analysis of top-10 teams shows higher false positive rates (and lower false negative rates) for females with dark skin tone as well as the potential of eyeglasses and young age to increase the false positive rates too.
摘要：工作总结2020年ChaLearn看着人们博览会人脸识别和分析面临的挑战，并提供顶级殊荣的解决方案和分析的结果的描述。挑战的目的是评估在1任务提交算法性别和肤色准确性和偏差：1个人脸验证中的其他混杂属性的存在。使用的参与者进行评价的内式野生基于reannotated IJB-C的数据集，通过12.5K新图像和附加标签进一步富集。该数据集不均衡，它模拟真实世界场景，其中应该存在公平的结果基于人工智能的模型进行培训和对不平衡数据进行评估。我们面临的挑战吸引了151人参加，谁在制造总量超过1.8K意见书。挑战的最后阶段吸引了36队活跃了其中10个超过0.999 AUC-ROC同时实现分数很低的建议偏差指标。由参与者共同的战略是面对前处理，数据分布的均匀化，使用偏置知道丢失的功能和集成模型。的前10支球队显示了较高的假阳性率（并降低假阴性率）为女性深色肤色以及眼镜的潜力，年纪轻轻就增加了假阳性率太低分析。

2. Layered Neural Rendering for Retiming People in Video [PDF] 返回目录
Erika Lu, Forrester Cole, Tali Dekel, Weidi Xie, Andrew Zisserman, David Salesin, William T. Freeman, Michael Rubinstein
Abstract: We present a method for retiming people in an ordinary, natural video---manipulating and editing the time in which different motions of individuals in the video occur. We can temporally align different motions, change the speed of certain actions (speeding up/slowing down, or entirely "freezing" people), or "erase" selected people from the video altogether. We achieve these effects computationally via a dedicated learning-based layered video representation, where each frame in the video is decomposed into separate RGBA layers, representing the appearance of different people in the video. A key property of our model is that it not only disentangles the direct motions of each person in the input video, but also correlates each person automatically with the scene changes they generate---e.g., shadows, reflections, and motion of loose clothing. The layers can be individually retimed and recombined into a new video, allowing us to achieve realistic, high-quality renderings of retiming effects for real-world videos depicting complex actions and involving multiple individuals, including dancing, trampoline jumping, or group running.
摘要：本文提出了一种重定时人民在一个普通的，自然的视频---操作和编辑，其中在视频个体的不同的动作发生的时间。我们可以暂时对准不同的运动，改变某些操作的速度（加速/减速，或者完全是“冻”人），或者选择从视频的人干脆“擦除”。我们通过专用基于学习的分层视频表示，其中在视频的每个帧被分解成单独的RGBA层，代表不同的人在视频中出现的计算获得这些效果。我们的模型中的一个关键特性是它不仅理顺了那些纷繁每个人在输入视频的直接动作，而且还自动关联的每个人与场景改变它们产生---如，阴影，反射，和宽松的衣服运动。该层可以单独重新定时和重新组合成新的视频，让我们能够实现再定时对真实世界的视频描述了复杂的操作，并涉及多个人，包括舞蹈，跳蹦床，或组运行效果逼真，高品质的渲染。

3. Multiple Exemplars-based Hallucinationfor Face Super-resolution and Editing [PDF] 返回目录
Kaili Wang, Jose Oramas, Tinne Tuytelaars
Abstract: Given a really low-resolution input image of a face (say 16x16 or 8x8 pixels), the goal of this paper is to reconstruct a high-resolution version thereof. This, by itself, is an ill-posed problem, as the high-frequency information is missing in the low-resolution input and needs to be hallucinated, based on prior knowledge about the image content. Rather than relying on a generic face prior, in this paper, we explore the use of a set of exemplars, i.e. other high-resolution images of the same person. These guide the neural network as we condition the output on them. Multiple exemplars work better than a single one. To combine the information from multiple exemplars effectively, we intro-duce a pixel-wise weight generation module. Besides standard face super-resolution, our method allows to perform subtle face editing simply by replacing the exemplars with another set with different facial features. A user study is conducted and shows the super-resolved images can hardly be distinguished from real images on the CelebA dataset. A qualitative comparison indicates our model outperforms methods proposed in the literature on the CelebA and WebFace da
摘要：由于脸的真正低分辨率输入图像（16×16说或8×8像素），本文的目的是重建其高分辨率版本。这本身是一个病态问题，如高频信息在低分辨率输入失踪，需要根据对图像内容的现有知识幻觉。而不是以前依靠一个通用的脸，在本文中，我们探索利用一组典范，同样的人，即其他高分辨率的图像。这些指导神经网络，因为我们对他们调节输出。多范例工作不是单一的一个更好的。要结合多个范例的信息有效，我们介绍-领袖逐像素重生成模块。除了标准的脸超分辨率，我们的方法可以进行细微的脸部用另一套不同的面部特征替换典范简单的编辑。用户研究进行，显示了超分辨率的图像几乎不能从对CelebA数据集真实图像区分。定性比较表明在CelebA和WebFace达在文献中提出我们的模型优于方法

4. GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network [PDF] 返回目录
Prune Truong, Martin Danelljan, Luc Van Gool, Radu Timofte
Abstract: The feature correlation layer serves as a key neural network module in numerous computer vision problems that involve dense correspondences between image pairs. It predicts a correspondence volume by evaluating dense scalar products between feature vectors extracted from pairs of locations in two images. However, this point-to-point feature comparison is insufficient when disambiguating multiple similar regions in an image, severely affecting the performance of the end task. We propose GOCor, a fully differentiable dense matching module, acting as a direct replacement to the feature correlation layer. The correspondence volume generated by our module is the result of an internal optimization procedure that explicitly accounts for similar regions in the scene. Moreover, our approach is capable of effectively learning spatial matching priors to resolve further matching ambiguities. We analyze our GOCor module in extensive ablative experiments. When integrated into state-of-the-art networks, our approach significantly outperforms the feature correlation layer for the tasks of geometric matching, optical flow, and dense semantic matching. The code and trained models will be made available at this http URL.
摘要：该功能相关层作为在涉及图像对之间的密集对应众多的计算机视觉问题的关键神经网络模块。它预测通过评估对位置在两个图像中提取的特征向量之间的密集标量积的对应体积。然而，图像中消除歧义多个类似区域时，严重影响了结束任务的性能这点对点特征的比较是不够的。我们建议GOCor，全微分的稠密匹配模块，作为直接替换的功能相关层。我们的模块所产生的通信量是明确列入场景相似区域的内部优化程序的结果。此外，我们的方法能够有效地学习空间匹配先验解决进一步匹配歧义。我们分析了大量的烧蚀实验我们GOCor模块。当集成到国家的最先进的网络中，我们的方法显著优于对几何匹配，光流，且致密的语义匹配的任务的功能相关性层。代码和训练的模型将在本HTTP URL来。

5. HOTA: A Higher Order Metric for Evaluating Multi-Object Tracking [PDF] 返回目录
Jonathon Luiten, Aljosa Osep, Patrick Dendorfer, Philip Torr, Andreas Geiger, Laura Leal-Taixe, Bastian Leibe
Abstract: Multi-Object Tracking (MOT) has been notoriously difficult to evaluate. Previous metrics overemphasize the importance of either detection or association. To address this, we present a novel MOT evaluation metric, HOTA (Higher Order Tracking Accuracy), which explicitly balances the effect of performing accurate detection, association and localization into a single unified metric for comparing trackers. HOTA decomposes into a family of sub-metrics which are able to evaluate each of five basic error types separately, which enables clear analysis of tracking performance. We evaluate the effectiveness of HOTA on the MOTChallenge benchmark, and show that it is able to capture important aspects of MOT performance not previously taken into account by established metrics. Furthermore, we show HOTA scores better align with human visual evaluation of tracking performance.
摘要：多目标追踪（MOT）一直非常难以评估。上一页指标过分强调无论是检测或协会的重要性。为了解决这个问题，我们提出了一个新颖的MOT评估度，HOTA（高阶跟踪精度），其中明确平衡进行精确的检测，关联和定位成一个统一的指标比较跟踪的效果。 HOTA分解成一个家庭的子指标，其能够单独评估每五个基本错误类型，这使得跟踪性能清晰的分析。我们评估HOTA对MOTChallenge基准的有效性，并表明它能够捕捉到以前没有建立的指标考虑MOT性能的重要方面。此外，我们显示HOTA评分与跟踪性能的人眼视觉评估更好地将。

6. TreeGAN: Incorporating Class Hierarchy into Image Generation [PDF] 返回目录
Ruisi Zhang, Luntian Mou, Pengtao Xie
Abstract: Conditional image generation (CIG) is a widely studied problem in computer vision and machine learning. Given a class, CIG takes the name of this class as input and generates a set of images that belong to this class. In existing CIG works, for different classes, their corresponding images are generated independently, without considering the relationship among classes. In real-world applications, the classes are organized into a hierarchy and their hierarchical relationships are informative for generating high-fidelity images. In this paper, we aim to leverage the class hierarchy for conditional image generation. We propose two ways of incorporating class hierarchy: prior control and post constraint. In prior control, we first encode the class hierarchy, then feed it as a prior into the conditional generator to generate images. In post constraint, after the images are generated, we measure their consistency with the class hierarchy and use the consistency score to guide the training of the generator. Based on these two ideas, we propose a TreeGAN model which consists of three modules: (1) a class hierarchy encoder (CHE) which takes the hierarchical structure of classes and their textual names as inputs and learns an embedding for each class; the embedding captures the hierarchical relationship among classes; (2) a conditional image generator (CIG) which takes the CHE-generated embedding of a class as input and generates a set of images belonging to this class; (3) a consistency checker which performs hierarchical classification on the generated images and checks whether the generated images are compatible with the class hierarchy; the consistency score is used to guide the CIG to generate hierarchy-compatible images. Experiments on various datasets demonstrate the effectiveness of our method.
摘要：有条件的图像生成（CIG）是计算机视觉和机器学习被广泛研究的问题。给定一个类，CIG借此类作为输入的名称，并生成一组属于这一类图像。在现有的CIG作品，对于不同的类，它们的对应的图像被独立地生成，在不考虑类别之间的关系。在实际应用中，这些类被组织成一个层次结构及其层次关系信息生成高保真图像。在本文中，我们的目标是利用有条件图像生成类层次结构。我们建议结合类层次结构的方式有两种：事前控制和后约束。在现有的控制中，我们首先进行编码的类层次结构，然后给它作为现有入条件生成器来生成图像。在后期约束，生成图像后，我们衡量其与类层次结构的一致性和使用的一致性分数来引导发电机的培训。基于这两个思路，我们提出一种由三个模块组成一个TreeGAN模式：（1）一类层次编码器（CHE），这需要的类及其文本名称的层次结构作为输入，并学习每个类的嵌入;嵌入捕捉类之间的层次关系; （2）条件图像生成（CIG），其取CHE生成的一类的嵌入作为输入，并生成一组属于这一类的图像; （3）一个一致性检查程序，其对所生成的图像，并检查所生成的图像是否与类层次结构兼容分层分类;一致性分数用于引导CIG以产生层级兼容的图像。在各种数据集的实验结果证明了该方法的有效性。

7. Evaluating Self-Supervised Pretraining Without Using Labels [PDF] 返回目录
Colorado Reed, Sean Metzger, Aravind Srinivas, Trevor Darrell, Kurt Keutzer
Abstract: A common practice in unsupervised representation learning is to use labeled data to evaluate the learned representations - oftentimes using the labels from the "unlabeled" training dataset. This supervised evaluation is then used to guide the training process, e.g. to select augmentation policies. However, supervised evaluations may not be possible when labeled data is difficult to obtain (such as medical imaging) or ambiguous to label (such as fashion categorization). This raises the question: is it possible to evaluate unsupervised models without using labeled data? Furthermore, is it possible to use this evaluation to make decisions about the training process, such as which augmentation policies to use? In this work, we show that the simple self-supervised evaluation task of image rotation prediction is highly correlated with the supervised performance of standard visual recognition tasks and datasets (rank correlation > 0.94). We establish this correlation across hundreds of augmentation policies and training schedules and show how this evaluation criteria can be used to automatically select augmentation policies without using labels. Despite not using any labeled data, these policies perform comparably with policies that were determined using supervised downstream tasks. Importantly, this work explores the idea of using unsupervised evaluation criteria to help both researchers and practitioners make decisions when training without labeled data.
摘要：在无监督表示学习一种常见的做法是使用标记数据来评估所学作品 - 经常使用从“未标记的”训练数据集的标签。然后，该监督评估用于指导训练过程中，例如选择增强政策。标记的数据时，难以得到（如医学成像）或含糊到标签（如时装分类）然而，监督评估可能是不可能的。这就提出了一个问题：是否有可能评估监督的模式，而无需使用标记的数据？此外，是否有可能使用这个评价，使有关训练过程中的决定，如要使用的增强政策？在这项工作中，我们表明，图像旋转预测的简单的自我监督评估工作是高度的标准视觉识别任务和数据集（秩相关> 0.94）监督的表现相关。我们建立了跨数百增强政策和培训时间安排和显示的这个评价标准如何使用自动选择增强政策不使用标签这种相关性。尽管没有使用任何标签的数据，这些政策与使用监督下游的任务是确定的政策执行相当。重要的是，这项工作探索了使用无人监管的评价标准，以帮助无标签的数据训练的时候研究人员和从业者做决定的想法。

8. Domain-invariant Similarity Activation Map Metric Learning for Retrieval-based Long-term Visual Localization [PDF] 返回目录
Hanjiang Hu, Hesheng Wang, Zhe Liu, Weidong Chen
Abstract: Visual localization is a crucial component in the application of mobile robot and autonomous driving. Image retrieval is an efficient and effective technique in image-based localization methods. Due to the drastic variability of environmental conditions, this http URL illumination, seasonal and weather changes, retrieval-based visual localization is severely affected and becomes a challenging problem. In this work, a general architecture is first formulated probabilistically to extract domain-invariant feature through multi-domain image translation. And then a novel gradient-weighted similarity activation mapping loss (Grad-SAM) is incorporated for finer localization with high accuracy. We also propose a new adaptive triplet loss to boost the metric learning of the embedding in a self-supervised manner. The final coarse-to-fine image retrieval pipeline is implemented as the sequential combination of models without and with Grad-SAM loss. Extensive experiments have been conducted to validate the effectiveness of the proposed approach on the CMU-Seasons dataset. The strong generalization ability of our approach is verified on RobotCar dataset using models pre-trained on urban part of CMU-Seasons dataset. Our performance is on par with or even outperforms the state-of-the-art image-based localization baselines in medium or high precision, especially under the challenging environments with illumination variance, vegetation and night-time images.
摘要：视觉定位是移动机器人自主移动应用的重要组成部分。图像检索是在基于图像的定位方法的一个有效的和有效的技术。由于环境条件的急剧变化，此http URL照明，季节和天气的变化，基于内容的检索，视觉定位是造成严重的影响，并成为一个挑战性的问题。在这项工作中，一般的架构首次制定概率通过多域图像的平移提取域不变特征。然后一个新的梯度加权类似激活映射损失（梯度-SAM）被掺入用于以高准确度更精细的定位。我们还提出了一种新的自适应三重损失，以提高嵌入的度量学习中自我监督的方式。最后由粗到细的图像检索管道被实现为模型，而不与奇格勒-SAM损失的顺序组合。大量的实验已经进行，以验证在CMU-四季数据集所提出的方法的有效性。我们的方法的泛化能力强，使用在CMU-四季数据集的市区部分型号预先训练上RobotCar数据集进行验证。我们的性能堪与甚至优于在中等或高精确度的状态的最先进的基于图像的定位基准，特别是在与照明方差，植被和夜间图像挑战性的环境。

9. Relative Attribute Classification with Deep Rank SVM [PDF] 返回目录
Sara Atito Ali Ahmed, Berrin Yanikoglu
Abstract: Relative attributes indicate the strength of a particular attribute between image pairs. We introduce a deep Siamese network with rank SVM loss function, called Deep Rank SVM (DRSVM), in order to decide which one of a pair of images has a stronger presence of a specific attribute. The network is trained in an end-to-end fashion to jointly learn the visual features and the ranking function. We demonstrate the effectiveness of our approach against the state-of-the-art methods on four image benchmark datasets: LFW-10, PubFig, UTZap50K-lexi and UTZap50K-2 datasets. DRSVM surpasses state-of-art in terms of the average accuracy across attributes, on three of the four image benchmark datasets.
摘要：相对属性显示的图像对之间的特定属性的强度。我们引进秩SVM损失函数深连体网名为深排名SVM（DRSVM），以决定其中一对图像中的一个具有特定属性的强大存在。该网络在终端到终端的时尚培训，共同学习的视觉特征和分级功能。我们证明了我们对四个图像基准数据集的国家的最先进的方法，该方法的有效性：LFW-10，PubFig，UTZap50K-LEXI和UTZap50K-2数据集。 DRSVM超过国家的技术跨越属性平均精度方面，三四个图像基准数据集。

10. Calibrating Self-supervised Monocular Depth Estimation [PDF] 返回目录
Robert McCraith, Lukas Neumann, Andrea Vedaldi
Abstract: In the recent years, many methods demonstrated the ability of neural networks tolearn depth and pose changes in a sequence of images, using only self-supervision as thetraining signal. Whilst the networks achieve good performance, the often over-lookeddetail is that due to the inherent ambiguity of monocular vision they predict depth up to aunknown scaling factor. The scaling factor is then typically obtained from the LiDARground truth at test time, which severely limits practical applications of these this http URL this paper, we show that incorporating prior information about the camera configu-ration and the environment, we can remove the scale ambiguity and predict depth directly,still using the self-supervised formulation and not relying on any additional sensors.
摘要：近年来，许多方法证明了神经网络学点深度和图像序列姿势变化的能力，只用自我监督，thetraining信号。虽然网络取得良好的业绩，往往过度lookeddetail是，由于单眼视力的固有不确定性，他们预测深度可达aunknown缩放因子。缩放因子然后，通常从在测试时间LiDARground真理，这严重限制了这些实际应用中得到的此http URL本文中，我们表明，掺入关于相机configu-定量和环境的先验信息，我们就可以去除氧化皮歧义和直接预测深度，仍然使用自监督制剂和不依赖于任何附加的传感器。

11. Eating Habits Discovery in Egocentric Photo-streams [PDF] 返回目录
Estefania Talavera, Andreea Glavan, Alina Matei, Petia Radeva
Abstract: Eating habits are learned throughout the early stages of our lives. However, it is not easy to be aware of how our food-related routine affects our healthy living. In this work, we address the unsupervised discovery of nutritional habits from egocentric photo-streams. We build a food-related behavioural pattern discovery model, which discloses nutritional routines from the activities performed throughout the days. To do so, we rely on Dynamic-Time-Warping for the evaluation of similarity among the collected days. Within this framework, we present a simple, but robust and fast novel classification pipeline that outperforms the state-of-the-art on food-related image classification with a weighted accuracy and F-score of 70% and 63%, respectively. Later, we identify days composed of nutritional activities that do not describe the habits of the person as anomalies in the daily life of the user with the Isolation Forest method. Furthermore, we show an application for the identification of food-related scenes when the camera wearer eats in isolation. Results have shown the good performance of the proposed model and its relevance to visualize the nutritional habits of individuals.
摘要：饮食习惯是在我们的生活的早期阶段的经验教训。然而，这是不容易的知道我们的食品有关的日常是如何影响我们的健康生活。在这项工作中，我们要解决从自我中心的照片流营养习惯的无监督发现。我们建立了一个食品相关的行为模式发现模型，它公开了由活动的营养例程整个天。要做到这一点，我们依靠动态时间规整的相似性的收集天之间的评价。在此框架下，我们提出了一个简单但稳健快速的新型分级流水线优于国家的最先进的食品有关的图像分类用的分别为70％和63％，加权准确性和F-得分。后来，我们确定的不形容人的习惯，在与隔离森林法用户的日常生活异常营养活动由天。此外，我们显示了与食品有关的场景识别应用程序时相机穿着者孤立吃。结果表明，该模型具有良好的性能和其相关的可视化个人的饮食习惯。

12. BSN++: Complementary Boundary Regressor with Scale-Balanced Relation Modeling for Temporal Action Proposal Generation [PDF] 返回目录
Haisheng Su, Weihao Gan, Wei Wu, Junjie Yan, Yu Qiao
Abstract: Generating human action proposals in untrimmed videos is an important yet challenging task with wide applications. Current methods often suffer from the noisy boundary locations and the inferior quality of confidence scores used for proposal retrieving. In this paper, we present BSN++, a new framework which exploits complementary boundary regressor and relation modeling for temporal proposal generation. First, we propose a novel boundary regressor based on the complementary characteristics of both starting and ending boundary classifiers. Specifically, we utilize the U-shaped architecture with nested skip connections to capture rich contexts and introduce bi-directional boundary matching mechanism to improve boundary precision. Second, to account for the proposal-proposal relations ignored in previous methods, we devise a proposal relation block to which includes two self-attention modules from the aspects of position and channel. Furthermore, we find that there inevitably exists data imbalanced problems in the positive/negative proposals and temporal durations, which harm the model performance on tail distributions. To relieve this issue, we introduce the scale-balanced re-sampling strategy. Extensive experiments are conducted on two popular benchmarks: ActivityNet-1.3 and THUMOS14, which demonstrate that BSN++ achieves the state-of-the-art performance.
摘要：修剪视频生成人类行动的提议是具有广泛应用的一项重要而艰巨的任务。目前的方法经常遭受嘈杂的边界位置，并用于检索建议的置信度的质量低劣。在本文中，我们目前BSN ++，一个新的框架，它利用了时间的提案产生互补边界回归和关系模型。首先，我们提出了基于这两个开始和结束边界分类的互补特征的新的边界回归。具体来说，我们利用具有嵌套跳过连接以捕获丰富上下文和引进双向边界匹配机制来提高边界精度U形结构。其次，要考虑在以前的方法忽略了建议，提案的关系，我们设计的提议关系块，其中包括从位置和渠道方面两个自注意模块。此外，我们发现，有不可避免地存在数据不均衡的正/负建议和时间的持续时间，这损害的尾部分布模型的性能问题。为了缓解这一问题，我们引入规模平衡重采样策略。 ActivityNet-1.3和THUMOS14，这表明，BSN ++实现国家的最先进的性能：大量的实验是在两种流行的基准进行。

13. ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit [PDF] 返回目录
Zijie Ye, Haozhe Wu, Jia Jia, Yaohua Bu, Wei Chen, Fanbo Meng, Yanfeng Wang
Abstract: Dance and music are two highly correlated artistic forms. Synthesizing dance motions has attracted much attention recently. Most previous works conduct music-to-dance synthesis via directly music to human skeleton keypoints mapping. Meanwhile, human choreographers design dance motions from music in a two-stage manner: they firstly devise multiple choreographic dance units (CAUs), each with a series of dance motions, and then arrange the CAU sequence according to the rhythm, melody and emotion of the music. Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. Based on the constructed dataset, we design a two-stage music-to-dance synthesis framework ChoreoNet to imitate human choreography procedure. Our framework firstly devises a CAU prediction model to learn the mapping relationship between music and CAU sequences. Afterwards, we devise a spatial-temporal inpainting model to convert the CAU sequence into continuous dance motions. Experimental results demonstrate that the proposed ChoreoNet outperforms baseline methods (0.622 in terms of CAU BLEU score and 1.59 in terms of user study score).
摘要：舞蹈和音乐是两个高度相关的艺术形式。合成舞蹈动作吸引了最近备受关注。大多数以前的作品通过直接音乐人的骨骼关键点映射进行音乐到舞蹈合成。同时，从音乐人的编舞设计舞蹈动作在两阶段的方式：他们首先设计出多种舞蹈舞蹈单元（CAU的），每个了一系列的舞蹈动作，然后根据节奏，旋律和情感安排CAU序列音乐。这些启发，我们系统地研究这种二阶段编排方式，构建一个数据集合并这样编排的知识。基于构建的数据集，我们设计了两个阶段的音乐到舞蹈综合框架ChoreoNet模仿人类的编排程序。我们首先框架发明了一种CAU预测模型学习音乐和CAU序列之间的映射关系。此后，我们制定一个时空修复模型的CAU序列转换成连续的舞蹈动作。实验结果表明，所提出的ChoreoNet性能优于基线方法（0.622在CAU BLEU分数方面和1.59在用户研究得分换算）。

14. The FaceChannel: A Fast & Furious Deep Neural Network for Facial Expression Recognition [PDF] 返回目录
Pablo Barros, Nikhil Churamani, Alessandra Sciutti
Abstract: Current state-of-the-art models for automatic Facial Expression Recognition (FER) are based on very deep neural networks that are effective but rather expensive to train. Given the dynamic conditions of FER, this characteristic hinders such models of been used as a general affect recognition. In this paper, we address this problem by formalizing the FaceChannel, a light-weight neural network that has much fewer parameters than common deep neural networks. We introduce an inhibitory layer that helps to shape the learning of facial features in the last layer of the network and thus improving performance while reducing the number of trainable parameters. To evaluate our model, we perform a series of experiments on different benchmark datasets and demonstrate how the FaceChannel achieves a comparable, if not better, performance to the current state-of-the-art in FER. Our experiments include cross-dataset analysis, to estimate how our model behaves on different affective recognition conditions. We conclude our paper with an analysis of how FaceChannel learns and adapt the learned facial features towards the different datasets.
摘要：当前国家的最先进的模型自动人脸表情识别（FER）是基于是有效的，但相当昂贵的火车非常深的神经网络。鉴于FER的动态条件下，这一特性阻碍了被用作一般这种模式影响的认可。在本文中，我们通过正式的FaceChannel，具有比普通深层神经网络的参数要少得多重量轻的神经网络解决这个问题。我们推出了一项抑制层，有助于塑造五官的学习在网络的最后一层，从而提高性能，同时降低可训练参数的数量。为了评估我们的模型，我们执行不同的标准数据集了一系列的实验和演示FaceChannel如何实现可比的，如果不是更好，性能达到了目前国家的最先进的FER。我们的实验包括跨数据集分析，估计有不同的情感认同的条件我们的模型的行为。我们得出结论：我们的纸，如何FaceChannel获悉的分析和适应对不同的数据集学习的面部特征。

15. Multi-Stage CNN Architecture for Face Mask Detection [PDF] 返回目录
Amit Chavda, Jason Dsouza, Sumeet Badgujar, Ankit Damani
Abstract: The end of 2019 witnessed the outbreak of Coronavirus Disease 2019 (COVID-19), which has continued to be the cause of plight for millions of lives and businesses even in 2020. As the world recovers from the pandemic and plans to return to a state of normalcy, there is a wave of anxiety among all individuals, especially those who intend to resume in-person activity. Studies have proved that wearing a face mask significantly reduces the risk of viral transmission as well as provides a sense of protection. However, it is not feasible to manually track the implementation of this policy. Technology holds the key here. We introduce a Deep Learning based system that can detect instances where face masks are not used properly. Our system consists of a dual-stage Convolutional Neural Network (CNN) architecture capable of detecting masked and unmasked faces and can be integrated with pre-installed CCTV cameras. This will help track safety violations, promote the use of face masks, and ensure a safe working environment.
摘要：2019年年底目睹了冠状病毒病2019（COVID-19），这仍是困境中的数百万人的生命和企业的原因，即使在2020年作为从流感大流行，并计划重返世界复苏的爆发正常的状态，有焦虑的人，尤其是那些谁打算继续在人的活动中的一波。有研究证明，戴着口罩显著减少病毒传播的风险，以及提供保护意识。然而，这是不可行的手动跟踪这一政策的实施。此技术是关键。我们引入一个深度学习基础的系统，可以检测口罩不能正常使用的情况。我们的系统包括一个双级卷积神经网络（CNN）体系结构能够检测掩蔽和非掩蔽的面，并且可以与预先安装CCTV摄像机集成的。这将帮助跟踪违反安全规定，推广使用口罩，并确保一个安全的工作环境。

16. Perceiving Traffic from Aerial Images [PDF] 返回目录
George Adaimi, Sven Kreiss, Alexandre Alahi
Abstract: Drones or UAVs, equipped with different sensors, have been deployed in many places especially for urban traffic monitoring or last-mile delivery. It provides the ability to control the different aspects of traffic given real-time obeservations, an important pillar for the future of transportation and smart cities. With the increasing use of such machines, many previous state-of-the-art object detectors, who have achieved high performance on front facing cameras, are being used on UAV datasets. When applied to high-resolution aerial images captured from such datasets, they fail to generalize to the wide range of objects' scales. In order to address this limitation, we propose an object detection method called Butterfly Detector that is tailored to detect objects in aerial images. We extend the concept of fields and introduce butterfly fields, a type of composite field that describes the spatial information of output features as well as the scale of the detected object. To overcome occlusion and viewing angle variations that can hinder the localization process, we employ a voting mechanism between related butterfly vectors pointing to the object center. We evaluate our Butterfly Detector on two publicly available UAV datasets (UAVDT and VisDrone2019) and show that it outperforms previous state-of-the-art methods while remaining real-time.
摘要：无人机或无人机，配备不同的传感器，已经部署在很多地方尤其是对城市交通监控或最后一英里交付。它提供了控制给交通实时obeservations的不同方面的能力，为交通和智能城市的未来发展的重要支柱。随着越来越多地使用这样的机器，许多以前的状态的最先进的对象检测器，谁对面向前方的相机来实现高性能的，正在上UAV数据集使用。当施加到由这样的数据集捕获高分辨率航拍照片，他们未能推广到宽范围的对象的尺度。为了解决这种限制，我们提出了称为蝴蝶检测的对象检测方法，其是专为在检测航拍照片对象。我们扩展字段的概念和介绍蝴蝶字段，描述输出的空间信息型复合字段的功能，以及所检测到的对象的比例。为了克服遮挡和视角的变化会阻碍本地化过程，我们采用指向的对象中心相关蝴蝶向量之间的投票机制。我们评估在两个可公开获得的数据集无人机（UAVDT和VisDrone2019）我们的蝴蝶探测器，并表明它优于以往的国家的最先进的方法，同时保持实时性。

17. Compressing Facial Makeup Transfer Networks by Collaborative Distillation and Kernel Decomposition [PDF] 返回目录
Bianjiang Yang, Zi Hui, Haoji Hu, Xinyi Hu, Lu Yu
Abstract: Although the facial makeup transfer network has achieved high-quality performance in generating perceptually pleasing makeup images, its capability is still restricted by the massive computation and storage of the network architecture. We address this issue by compressing facial makeup transfer networks with collaborative distillation and kernel decomposition. The main idea of collaborative distillation is underpinned by a finding that the encoder-decoder pairs construct an exclusive collaborative relationship, which is regarded as a new kind of knowledge for low-level vision tasks. For kernel decomposition, we apply the depth-wise separation of convolutional kernels to build a light-weighted Convolutional Neural Network (CNN) from the original network. Extensive experiments show the effectiveness of the compression method when applied to the state-of-the-art facial makeup transfer network - BeautyGAN.
摘要：虽然脸谱传输网络已经产生感知取悦化妆图像实现高品质的性能，它的能力仍受到网络架构的海量计算和存储的限制。我们通过与合作蒸馏和内核分解压缩脸谱传送网络解决这个问题。协同蒸馏的主要思想是通过发现支撑编码器，解码器对构建一个独特的协作关系，这被认为是一种新的知识，为低级别的视觉任务。对于内核分解，我们应用卷积内核的深度方向分离从原始网络建立一个重量轻的卷积神经网络（CNN）。当施加到所述状态的最先进的脸部化妆传送网络广泛实验表明，该压缩方法的有效性 - BeautyGAN。

18. Red Carpet to Fight Club: Partially-supervised Domain Transfer for Face Recognition in Violent Videos [PDF] 返回目录
Yunus Can Bilge, Mehmet Kerim Yucel, Ramazan Gokberk Cinbis, Nazli Ikizler-Cinbis, Pinar Duygulu
Abstract: In many real-world problems, there is typically a large discrepancy between the characteristics of data used in training versus deployment. A prime example is the analysis of aggression videos: in a criminal incidence, typically suspects need to be identified based on their clean portrait-like photos, instead of their prior video recordings. This results in three major challenges; large domain discrepancy between violence videos and ID-photos, the lack of video examples for most individuals and limited training data availability. To mimic such scenarios, we formulate a realistic domain-transfer problem, where the goal is to transfer the recognition model trained on clean posed images to the target domain of violent videos, where training videos are available only for a subset of subjects. To this end, we introduce the WildestFaces dataset, tailored to study cross-domain recognition under a variety of adverse conditions. We divide the task of transferring a recognition model from the domain of clean images to the violent videos into two sub-problems and tackle them using (i) stacked affine-transforms for classifier-transfer, (ii) attention-driven pooling for temporal-adaptation. We additionally formulate a self-attention based model for domain-transfer. We establish a rigorous evaluation protocol for this clean-to-violent recognition task, and present a detailed analysis of the proposed dataset and the methods. Our experiments highlight the unique challenges introduced by the WildestFaces dataset and the advantages of the proposed approach.
摘要：在许多现实世界的问题，有典型的是在训练与部署中使用的数据的特性之间的巨大差异。一个典型的例子是侵略视频分析：在刑事发病，通常需要根据他们的清洁肖像般的照片被识别，而不是他们以前的录像，嫌疑人。这导致三大挑战;暴力视频和ID-照片，对大多数个人用户和有限的训练数据的可用性缺乏视频实例之间的巨大差异域。为了模拟这种情况下，我们制定一个可行域的转移问题，其目的是转移培训的清洁带来图像的暴力视频，其中培训视频仅适用于受试者的子集的目标域的识别模型。为此，我们引入WildestFaces数据集，量身定制学习在各种不利条件下跨域认可。我们把从干净的图像，暴力视频的领域转移识别模型的任务分成两个子问题和使用（i）堆叠仿射变换的分类转移，（二）注意为导向的temporal-统筹解决这些问题适应。我们还制定域转移自重视基于模型。我们建立了严格的评估协议，这种清理对暴力识别任务，并提出建议的数据集和方法进行了详细分析。我们的实验强调由WildestFaces数据集，该方法的优点，推出了独特的挑战。

19. Hierarchical brain parcellation with uncertainty [PDF] 返回目录
Mark S. Graham, Carole H. Sudre, Thomas Varsavsky, Petru-Daniel Tudosiu, Parashkev Nachev, Sebastien Ourselin, M. Jorge Cardoso
Abstract: Many atlases used for brain parcellation are hierarchically organised, progressively dividing the brain into smaller sub-regions. However, state-of-the-art parcellation methods tend to ignore this structure and treat labels as if they are `flat'. We introduce a hierarchically-aware brain parcellation method that works by predicting the decisions at each branch in the label tree. We further show how this method can be used to model uncertainty separately for every branch in this label tree. Our method exceeds the performance of flat uncertainty methods, whilst also providing decomposed uncertainty estimates that enable us to obtain self-consistent parcellations and uncertainty maps at any level of the label hierarchy. We demonstrate a simple way these decision-specific uncertainty maps may be used to provided uncertainty-thresholded tissue maps at any level of the label tree.
摘要：用脑很多地块划分地图集是分级组织，逐步将大脑分成更小的子区域。然而，国家的最先进的方法地块划分倾向于如同它们是`平忽略这个结构和治疗标记物”。我们介绍的是通过在标签树每个分支预测决策工作的层次感知大脑的方法地块划分。我们进一步展示了如何这种方法可以单独使用模型的不确定性在这个标签树每个分支。我们的方法超过了扁平的不确定性方法的性能，同时还提供了分解的不确定性估算，使我们能够获得自洽parcellations和不确定性映射在标签层次结构的任何水平。我们展示了一个简单的方法，这些特定决策的不确定性地图可以用来提供的不确定性，阈值化组织映射在标签树的任何级别。

20. Domain Adaptation for Outdoor Robot Traversability Estimation from RGB data with Safety-Preserving Loss [PDF] 返回目录
Simone Palazzo, Dario C. Guastella, Luciano Cantelli, Paolo Spadaro, Francesco Rundo, Giovanni Muscato, Daniela Giordano, Concetto Spampinato
Abstract: Being able to estimate the traversability of the area surrounding a mobile robot is a fundamental task in the design of a navigation algorithm. However, the task is often complex, since it requires evaluating distances from obstacles, type and slope of terrain, and dealing with non-obvious discontinuities in detected distances due to perspective. In this paper, we present an approach based on deep learning to estimate and anticipate the traversing score of different routes in the field of view of an on-board RGB camera. The backbone of the proposed model is based on a state-of-the-art deep segmentation model, which is fine-tuned on the task of predicting route traversability. We then enhance the model's capabilities by a) addressing domain shifts through gradient-reversal unsupervised adaptation, and b) accounting for the specific safety requirements of a mobile robot, by encouraging the model to err on the safe side, i.e., penalizing errors that would cause collisions with obstacles more than those that would cause the robot to stop in advance. Experimental results show that our approach is able to satisfactorily identify traversable areas and to generalize to unseen locations.
摘要：如果能够估计围绕移动机器人领域的通行性是在导航算法设计的基本任务。但是，这项任务往往是复杂的，因为它需要评估的障碍，类型和地形坡度的距离，以及处理中检测到的距离因角度不明显的不连续性。在本文中，我们提出基于深度学习来估算并预测不同的路线在视板载RGB相机领域横移得分的方法。该模型的骨干是基于一个国家的最先进的深细分模式，在预测路线通行性的任务，这是微调。然后，我们通过提高模型的能力）寻址通过梯度逆转无监督适应域的变化，以及b）占据了移动机器人的特殊安全要求，通过鼓励模型来选择稳妥的一面，即惩罚的错误，会导致冲突与障碍者多，将导致机器人停止提前。实验结果表明，我们的做法是能够令人满意识别穿越区域，并推广到看不见的地方。

21. Similarity-based data mining for online domain adaptation of a sonar ATR system [PDF] 返回目录
Jean de Bodinat, Thomas Guerneve, Jose Vazquez, Marija Jegorova
Abstract: Due to the expensive nature of field data gathering, the lack of training data often limits the performance of Automatic Target Recognition (ATR) systems. This problem is often addressed with domain adaptation techniques, however the currently existing methods fail to satisfy the constraints of resource and time-limited underwater systems. We propose to address this issue via an online fine-tuning of the ATR algorithm using a novel data-selection method. Our proposed data-mining approach relies on visual similarity and outperforms the traditionally employed hard-mining methods. We present a comparative performance analysis in a wide range of simulated environments and highlight the benefits of using our method for the rapid adaptation to previously unseen environments.
摘要：由于现场数据采集的昂贵性质，缺乏训练数据往往限制了自动目标识别（ATR）系统的性能。这个问题经常与域自适应技术解决，但目前现有方法无法满足资源的限制有时间限制的水下系统。我们建议通过使用一种新的数据选择方法的ATR算法的在线微调来解决这个问题。我们提出的数据挖掘方法依赖于视觉相似，优于传统上采用硬挖掘方法。我们在广泛的模拟环境中呈现出比较优势分析，并突出使用我们的方法为迅速适应前所未见的环境带来的好处。

22. SLGAN: Style- and Latent-guided Generative Adversarial Network for Desirable Makeup Transfer and Removal [PDF] 返回目录
Daichi Horita, Kiyoharu Aizawa
Abstract: There are five features to consider when using generative adversarial networks to apply makeup to photos of the human face. These features include (1) facial components, (2) interactive color adjustments, (3) makeup variations, (4) robustness to poses and expressions, and the (5) use of multiple reference images. Several related works have been proposed, mainly using generative adversarial networks (GAN). Unfortunately, none of them have addressed all five features simultaneously. This paper closes the gap with an innovative style- and latent-guided GAN (SLGAN). We provide a novel, perceptual makeup loss and a style-invariant decoder that can transfer makeup styles based on histogram matching to avoid the identity-shift problem. In our experiments, we show that our SLGAN is better than or comparable to state-of-the-art methods. Furthermore, we show that our proposal can interpolate facial makeup images to determine the unique features, compare existing methods, and help users find desirable makeup configurations.
摘要：有五个功能使用生成对抗性的网络时，化完妆的人脸照片来考虑。这些功能包括（1）面部组件，（2）交互的颜色调整，（3）妆容变化，（4）的鲁棒性的姿势和表情，以及（5）使用多个参考图像。几个相关的作品已经被提出，主要是利用生成对抗网络（GAN）。不幸的是，他们都没有同时解决所有五个特点。本文关闭与一个创新的风格 - 和潜引导GAN（SLGAN）的间隙。我们提供一种新的，感性的妆容损失和风格不变的解码器，它可以将基于直方图匹配，以避免身份转变问题化妆风格。在我们的实验中，我们证明了我们的SLGAN优于或相当于国家的最先进的方法。此外，我们证明了我们的建议，可以插值脸谱图像来确定独特的功能，比较现有的方法，并帮助用户找到理想的化妆配置。

23. Hybrid-Attention Guided Network with Multiple Resolution Features for Person Re-Identification [PDF] 返回目录
Guoqing Zhang, Junchuan Yang, Yuhui Zheng, Yi Wu, Shengyong Chen
Abstract: Extracting effective and discriminative features is very important for addressing the challenging person re-identification (re-ID) task. Prevailing deep convolutional neural networks (CNNs) usually use high-level features for identifying pedestrian. However, some essential spatial information resided in low-level features such as shape, texture and color will be lost when learning the high-level features, due to extensive padding and pooling operations in the training stage. In addition, most existing person re-ID methods are mainly based on hand-craft bounding boxes where images are precisely aligned. It is unrealistic in practical applications, since the exploited object detection algorithms often produce inaccurate bounding boxes. This will inevitably degrade the performance of existing algorithms. To address these problems, we put forward a novel person re-ID model that fuses high- and low-level embeddings to reduce the information loss caused in learning high-level features. Then we divide the fused embedding into several parts and reconnect them to obtain the global feature and more significant local features, so as to alleviate the affect caused by the inaccurate bounding boxes. In addition, we also introduce the spatial and channel attention mechanisms in our model, which aims to mine more discriminative features related to the target. Finally, we reconstruct the feature extractor to ensure that our model can obtain more richer and robust features. Extensive experiments display the superiority of our approach compared with existing approaches. Our code is available at this https URL.
摘要：提取有效的，歧视性的特点，是解决具有挑战性的人重新鉴定（重新-ID）的任务非常重要。现行深卷积神经网络（细胞神经网络）通常采用高级特征用于识别行人。然而，一些重要的空间信息在低级别的居住功能，如形状，质地和颜色的学习时，高层次的特点，由于大量填充，并在培训阶段集中行动都将丢失。此外，现有的大多数的人重新编号的方法主要是基于手工工艺边界框在图像被精确地对准。这是在实际应用中是不现实的，因为被剥削的对象检测算法往往产生不准确的边界框。这将不可避免地降低现有算法的性能。为了解决这些问题，我们提出了一个新颖的人重新编号模式，熔断器高，低层次的嵌入，以减少因学习高级特征的信息丢失。然后我们把嵌入融合成几部分，然后重新连接，以获得全局特征和更显著地方特色，以减轻影响的不准确的边界框引起的。此外，我们还引进了空间和渠道关注机制在我们的模型中，其目的是矿井的目标有关更多的判别特征。最后，我们重建特征提取，以确保我们的模型能够获得更丰富和强大的功能。大量的实验显示与现行方法相比，我们的方法的优越性。我们的代码可在此HTTPS URL。

24. DRL-FAS: A Novel Framework Based on Deep Reinforcement Learning for Face Anti-Spoofing [PDF] 返回目录
Rizhao Cai, Haoliang Li, Shiqi Wang, Changsheng Chen, Alex Chichung Kot
Abstract: Inspired by the philosophy employed by human beings to determine whether a presented face example is genuine or not, i.e., to glance at the example globally first and then carefully observe the local regions to gain more discriminative information, for the face anti-spoofing problem, we propose a novel framework based on the Convolutional Neural Network (CNN) and the Recurrent Neural Network (RNN). In particular, we model the behavior of exploring face-spoofing-related information from image sub-patches by leveraging deep reinforcement learning. We further introduce a recurrent mechanism to learn representations of local information sequentially from the explored sub-patches with an RNN. Finally, for the classification purpose, we fuse the local information with the global one, which can be learned from the original input image through a CNN. Moreover, we conduct extensive experiments, including ablation study and visualization analysis, to evaluate our proposed framework on various public databases. The experiment results show that our method can generally achieve state-of-the-art performance among all scenarios, demonstrating its effectiveness.
摘要：由人类所采用的理念启发，确定所提交的脸例子是否真实与否，即，扫了一眼例如全球第一，然后仔细观察局部地区获得更多的歧视性信息，对于脸部反欺骗问题，我们提出了一种基于卷积神经网络（CNN）和递归神经网络（RNN）上一个新的框架。特别是，我们的模型通过利用深强化学习探索从图像子补丁脸欺骗相关的信息的行为。我们进一步引入一个经常性机制，以了解本地信息表示顺序从探索子补丁与RNN。最后，对于分类的目的，我们融合与全球性的，它可以从原始输入图像通过CNN得知当地的信息。此外，我们进行了广泛的实验，包括消融研究和可视化分析，以评估我们提出的各种公共数据库的框架。实验结果表明，该方法一般可以实现所有场景中的国家的最先进的性能，证明了其有效性。

25. CogTree: Cognition Tree Loss for Unbiased Scene Graph Generation [PDF] 返回目录
Jing Yu, Yuan Chai, Yue Hu, Qi Wu
Abstract: Scene graphs are semantic abstraction of images that encourage visual understanding and reasoning. However, the performance of Scene Graph Generation (SGG) is unsatisfactory when faced with biased data in real-world scenarios. Conventional debiasing research mainly studies from the view of data representation, e.g. balancing data distribution or learning unbiased models and representations, ignoring the mechanism that how humans accomplish this task. Inspired by the role of the prefrontal cortex (PFC) in hierarchical reasoning, we analyze this problem from a novel cognition perspective: learning a hierarchical cognitive structure of the highly-biased relationships and navigating that hierarchy to locate the classes, making the tail classes receive more attention in a coarse-to-fine mode. To this end, we propose a novel Cognition Tree (CogTree) loss for unbiased SGG. We first build a cognitive structure CogTree to organize the relationships based on the prediction of a biased SGG model. The CogTree distinguishes remarkably different relationships at first and then focuses on a small portion of easily confused ones. Then, we propose a hierarchical loss specially for this cognitive structure, which supports coarse-to-fine distinction for the correct relationships while progressively eliminating the interference of irrelevant ones. The loss is model-independent and can be applied to various SGG models without extra supervision. The proposed CogTree loss consistently boosts the performance of several state-of-the-art models on the Visual Genome benchmark.
摘要：场景图是鼓励视觉理解和推理的图像语义抽象。然而，当遇到在真实世界的场景数据偏压场景图生成（SGG）的性能是不能令人满意的。常规消除直流偏压的研究主要是从数据表示，例如认为研究平衡数据分发或学习公正的模型和交涉，无视机制，人类如何完成这个任务。通过分层推理的前额叶皮层（PFC）的作用的启发，我们从一个新的认知角度来分析这个问题：学习高度偏置关系的层次认知结构和导航该层次来定位类，使得尾班收在粗到精的模式更多的关注。为此，我们提出了公正SGG一个新的认知树（CogTree）的损失。我们首先建立一个认知结构CogTree组织根据偏向SGG模型预测的关系。所述CogTree区分在第一显着不同的关系，然后着眼于容易混淆那些的一小部分。然后，我们提出了一种层次亏损专门为这个认知结构，它支持粗到细的区分正确的关系，同时逐步消除不相关的干扰。损失是模型无关，可以适用于各种型号SGG没有额外的监管。所提出的CogTree损失持续提升的国家的最先进的几对视觉基因组标杆车型的性能。

26. The 1st Tiny Object Detection Challenge:Methods and Results [PDF] 返回目录
Xuehui Yu, Zhenjun Han, Yuqi Gong, Nan Jan, Jian Zhao, Qixiang Ye, Jie Chen, Yuan Feng, Bin Zhang, Xiaodi Wang, Ying Xin, Jingwei Liu, Mingyuan Mao, Sheng Xu, Baochang Zhang, Shumin Han, Cheng Gao, Wei Tang, Lizuo Jin, Mingbo Hong, Yuchao Yang, Shuiwang Li, Huan Luo, Qijun Zhao, Humphrey Shi
Abstract: The 1st Tiny Object Detection (TOD) Challenge aims toencourage research in developing novel and accurate methods for tinyobject detection in images which have wide views, with a current focuson tiny person detection. The TinyPerson dataset was used for the TODChallenge and is publicly released. It has 1610 images and 72651 box-levelannotations. Around 36 participating teams from the globe competed inthe 1st TOD Challenge. In this paper, we provide a brief summary of the1st TOD Challenge including brief introductions to the top three meth-ods.The submission leaderboard will be reopened for researchers that areinterested in the TOD challenge. The benchmark dataset and other infor-mation can be found at: this https URL.
摘要：第1微小目标检测（TOD）挑战赛旨在开发新的和准确的方法，其中具有广阔的美景图片tinyobject检测，符合当前focuson微小的人检测在鼓励着研究。使用了TinyPerson数据集的TODChallenge并公开发布。它具有1610倍的图像和72651箱levelannotations。约36来自世界各地参赛队参加在矿井1 TOD挑战。在本文中，我们提供the1st TOD挑战的简要介绍，包括简要介绍了前三名甲基-ods.The提交的排行榜将重新开放，在TOD挑战areinterested研究。此HTTPS URL：基准数据集和其他的Infor-mation可以在这里找到。

27. UXNet: Searching Multi-level Feature Aggregation for 3D Medical Image Segmentation [PDF] 返回目录
Yuanfeng Ji, Ruimao Zhang, Zhen Li, Jiamin Ren, Shaoting Zhang, Ping Luo
Abstract: Aggregating multi-level feature representation plays a critical role in achieving robust volumetric medical image segmentation, which is important for the auxiliary diagnosis and treatment. Unlike the recent neural architecture search (NAS) methods that typically searched the optimal operators in each network layer, but missed a good strategy to search for feature aggregations, this paper proposes a novel NAS method for 3D medical image segmentation, named UXNet, which searches both the scale-wise feature aggregation strategies as well as the block-wise operators in the encoder-decoder network. UXNet has several appealing benefits. (1) It significantly improves flexibility of the classical UNet architecture, which only aggregates feature representations of encoder and decoder in equivalent resolution. (2) A continuous relaxation of UXNet is carefully designed, enabling its searching scheme performed in an efficient differentiable manner. (3) Extensive experiments demonstrate the effectiveness of UXNet compared with recent NAS methods for medical image segmentation. The architecture discovered by UXNet outperforms existing state-of-the-art models in terms of Dice on several public 3D medical image segmentation benchmarks, especially for the boundary locations and tiny tissues. The searching computational complexity of UXNet is cheap, enabling to search a network with the best performance less than 1.5 days on two TitanXP GPUs.
摘要：汇总的多层次特征表示在实现强劲的体积医学图像分割，这是辅助诊断和治疗有重要的关键作用。不同于最近神经结构搜索（NAS），其典型地搜索中的每个网络层中的最佳运营商，但错过了一个很好的策略来搜索特征聚合方法，本文提出了一种用于3D医学图像分割，命名UXNet，其搜索一个新颖NAS方法两者规模明智特征聚合策略以及所述编码器 - 解码器网络中的逐块运算符。 UXNet有几个吸引人的优点。（1）显著改善了经典UNET架构，该架构仅聚集在功能等效分辨率编码器和解码器的表示的灵活性。 UXNet的（2）的连续松弛是经过精心设计，使得其搜索方案以有效区分的方式进行。（3）大量的实验证明与最近NAS方法在医学图像分割相比UXNet的有效性。通过UXNet性能优于现有的国家的最先进的车型在骰子方面对几个公共三维医学图像分割基准发现，特别是对于边界位置和微小的组织架构。 UXNet的搜索计算复杂度是价格便宜，使不到1.5天搜索网络具有最佳性能的两个TitanXP的GPU。

28. Dual Semantic Fusion Network for Video Object Detection [PDF] 返回目录
Lijian Lin, Haosheng Chen, Honglun Zhang, Jun Liang, Yu Li, Ying Shan, Hanzi Wang
Abstract: Video object detection is a tough task due to the deteriorated quality of video sequences captured under complex environments. Currently, this area is dominated by a series of feature enhancement based methods, which distill beneficial semantic information from multiple frames and generate enhanced features through fusing the distilled information. However, the distillation and fusion operations are usually performed at either frame level or instance level with external guidance using additional information, such as optical flow and feature memory. In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. Moreover, we introduce a geometric similarity measure into the fusion process to alleviate the influence of information distortion caused by noise. As a result, the proposed DSFNet can generate more robust features through the multi-granularity fusion and avoid being affected by the instability of external guidance. To evaluate the proposed DSFNet, we conduct extensive experiments on the ImageNet VID dataset. Notably, the proposed dual semantic fusion network achieves, to the best of our knowledge, the best performance of 84.1\% mAP among the current state-of-the-art video object detectors with ResNet-101 and 85.4\% mAP with ResNeXt-101 without using any post-processing steps.
摘要：视频对象的检测是一项艰巨的任务，由于在复杂的环境中拍摄的视频序列的质量恶化。目前，这个区域是由一系列的基于特征的增强方法，从多个帧提制有益的语义信息，并产生通过融合蒸信息增强功能为主。然而，蒸馏和融合操作通常在与外部指导任一帧级或实例级使用的其他信息，如光流和特征存储器执行。在这项工作中，我们提出了一个双重语义融合网络（简称DSFNet），以充分利用在没有外部指导统一融合框架既帧级和实例级语义。此外，我们引入一个几何相似性度量到融合过程，以减轻由噪音引起的信息失真的影响。其结果是，所提出的DSFNet可以生成通过多粒度融合更鲁棒的特征和避免受外部引导的不稳定。为评价所提出DSFNet，我们进行的ImageNet VID数据集大量的实验。值得注意的是，所提出的双重语义融合网络实现，据我们所知，目前的国家的最先进的视频对象检测器之中84.1 \％映像的最佳性能与RESNET-101和85.4 \％地图ResNeXt- 101，而无需使用任何后处理步骤。

29. Robust Person Re-Identification through Contextual Mutual Boosting [PDF] 返回目录
Zhikang Wang, Lihuo He, Xinbo Gao, Jane Shen
Abstract: Person Re-Identification (Re-ID) has witnessed great advance, driven by the development of deep learning. However, modern person Re-ID is still challenged by background clutter, occlusion and large posture variation which are common in practice. Previous methods tackle these challenges by localizing pedestrians through external cues (e.g., pose estimation, human parsing) or attention mechanism, suffering from high computation cost and increased model complexity. In this paper, we propose the Contextual Mutual Boosting Network (CMBN). It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference. Firstly, we construct two branches with a shared convolutional frontend to learn the foreground and background features respectively. By enabling interaction between these two branches, they boost the accuracy of the spatial localization mutually. Secondly, starting from a statistical perspective, we propose the Mask Generator that exploits the activation distribution of the transformation matrix for generating the static channel mask to the representations. The mask recalibrates the features to amplify the valuable characteristics and diminish the noise. Finally, we propose the Contextual-Detachment Strategy to optimize the two branches jointly and independently, which further enhances the localization precision. Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
摘要：人重新鉴定（重新编号）见证了巨大的进步，通过深度学习的发展带动的。然而，现代的人重新编号仍然杂乱背景，遮挡和大姿态的变化，这是常见的做法提出质疑。先前的方法通过外部线索（例如，姿态估计，人类解析）或注意机制，患高计算成本，提高模型复杂度定位行人应对这些挑战。在本文中，我们提出了语境共同推进网络（CMBN）。它通过有效利用上下文信息和统计推断本地化行人和重新校准功能。首先，我们建立两个分支与共享卷积前端学习前景和背景分别特征。通过启用这两个分支之间的互动，提高他们的空间定位的精度相互。其次，从统计的角度出发，我们建议，利用了变换矩阵的激活分配用于产生静态信道掩模的表示掩码生成器。面具重新校准放大的宝贵特性和降低噪音的功能。最后，我们提出了语境支队战略联合和独立优化的两个分支，这进一步提高了定位精度。上的基准实验表明该体系结构的优越性相比所述状态的最先进的。

30. Pooling Methods in Deep Neural Networks, a Review [PDF] 返回目录
Hossein Gholamalinezhad, Hossein Khosravi
Abstract: Nowadays, Deep Neural Networks are among the main tools used in various sciences. Convolutional Neural Network is a special type of DNN consisting of several convolution layers, each followed by an activation function and a pooling layer. The pooling layer is an important layer that executes the down-sampling on the feature maps coming from the previous layer and produces new feature maps with a condensed resolution. This layer drastically reduces the spatial dimension of input. It serves two main purposes. The first is to reduce the number of parameters or weights, thus lessening the computational cost. The second is to control the overfitting of the network. An ideal pooling method is expected to extract only useful information and discard irrelevant details. There are a lot of methods for the implementation of pooling operation in Deep Neural Networks. In this paper, we reviewed some of the famous and useful pooling methods.
摘要：如今，深层神经网络在各门科学中使用的主要工具之一。卷积神经网络是一种特殊类型的DNN由多个卷积层，每个随后的激活功能和池层。汇集层是执行降采样的特征图从上一层来，并产生新的功能，以精简解析映像的重要层。该层大大减少输入的空间维度。它有两个主要目的。第一种方法是减少的参数或权重的数目，从而减少了计算成本。第二个是，以控制网络的过度拟合。理想的轮询方法，预计只提取有用的信息，并丢弃不相关的细节。有很多的统筹的操作的深层神经网络的实现方法。在本文中，我们回顾了一些著名的和有用的汇集方法。

31. A Convolutional LSTM based Residual Network for Deepfake Video Detection [PDF] 返回目录
Shahroz Tariq, Sangyup Lee, Simon S. Woo
Abstract: In recent years, deep learning-based video manipulation methods have become widely accessible to masses. With little to no effort, people can easily learn how to generate deepfake videos with only a few victims or target images. This creates a significant social problem for everyone whose photos are publicly available on the Internet, especially on social media websites. Several deep learning-based detection methods have been developed to identify these deepfakes. However, these methods lack generalizability, because they perform well only for a specific type of deepfake method. Therefore, those methods are not transferable to detect other deepfake methods. Also, they do not take advantage of the temporal information of the video. In this paper, we addressed these limitations. We developed a Convolutional LSTM based Residual Network (CLRNet), which takes a sequence of consecutive images as an input from a video to learn the temporal information that helps in detecting unnatural looking artifacts that are present between frames of deepfake videos. We also propose a transfer learning-based approach to generalize different deepfake methods. Through rigorous experimentations using the FaceForensics++ dataset, we showed that our method outperforms five of the previously proposed state-of-the-art deepfake detection methods by better generalizing at detecting different deepfake methods using the same model.
摘要：近年来，深学习型视频操作方法已经成为大众普及。几乎没有力气，人们可以很容易地了解如何生成deepfake视频只有少数受害者或目标图像。这为大家谁的照片是公开的互联网上的显著的社会问题，尤其是在社交媒体网站。一些深基于学习的检测方法已经被开发，以确定这些deepfakes。然而，这些方法缺乏普遍性，因为他们只为deepfake方法的特定类型的表现良好。因此，这些方法不能转移到其他检测方法deepfake。此外，他们不走的视频的时间信息优势。在本文中，我们解决了这些限制。我们开发了一种基于卷积LSTM剩余网络（CLRNet），它接受连续的图像序列作为输入从视频学习的时间信息，有助于检测不自然的观看伪像都是deepfake视频帧之间存在。我们还提出了一种基于转移学习的方法来概括不同deepfake方法。通过使用FaceForensics ++数据集严谨性实验，我们发现，我们的方法通过更好地归纳在检测使用相同的模型不同deepfake方法优于先前提出的状态的最先进的deepfake检测方法中的五个。

32. Knowledge Guided Learning: Towards Open Domain Egocentric Action Recognition with Zero Supervision [PDF] 返回目录
Sathyanarayanan N. Aakur, Sanjoy Kundu, Nikhil Gunti
Abstract: Advances in deep learning have enabled the development of models that have exhibited a remarkable tendency to recognize and even localize actions in videos. However, they tend to experience errors when faced with scenes or examples beyond their initial training environment. Hence, they fail to adapt to new domains without significant retraining with large amounts of annotated data. Current algorithms are trained in an inductive learning environment where they use data-driven models to learn associations between input observations with a fixed set of known classes. In this paper, we propose to overcome these limitations by moving to an open world setting by decoupling the ideas of recognition and reasoning. Building upon the compositional representation offered by Grenander's Pattern Theory formalism, we show that attention and commonsense knowledge can be used to enable the self-supervised discovery of novel actions in egocentric videos in an open-world setting, a considerably more difficult task than zero-shot learning and (un)supervised domain adaptation tasks where target domain data (both labeled and unlabeled) are available during training. We show that our approach can be used to infer and learn novel classes for open vocabulary classification in egocentric videos and novel object detection with zero supervision. Extensive experiments show that it performs competitively with fully supervised baselines on publicly available datasets under open-world conditions. This is one of the first works to address the problem of open-world action recognition in egocentric videos with zero human supervision to the best of our knowledge.
摘要：进展深度学习已启用的已表现出对识别倾向显着，并在视频中甚至本地化操作模式的发展。然而，他们往往面对的场景或实例超出其最初的训练环境遇到错误。因此，他们无法适应新的领域而不会大量注释数据的显著再培训。在那里，他们用数据驱动模型学习输入观测值之间的关联与一组固定的已知种类的电流算法在感应式的学习环境训练。在本文中，我们提出通过解耦识别和推理的思路转移到一个开放的世界环境来克服这些限制。经Grenander的模式理论形式主义提供的组成表示的建筑，我们表明，重视和常识的知识可以被用来实现在一个开放的世界设定在以自我为中心的视频小说行动的自我监督发现，一个显着更艰巨的任务不是零镜头的学习和（联合国）监督那里训练中的目标域数据（标记的和未标记的）可用领域适应性任务。我们表明，我们的方法可以用来推断，学习新课程在自我中心的视频和零监管新物体检测开放词汇分类。大量的实验表明，与开放世界的条件下，可公开获得的数据集的完全监督基线竞争执行。这是第一部作品与零人的监控，就我们所知的解决自我中心视频的开放世界动作识别的问题之一。

33. Information Bottleneck Constrained Latent Bidirectional Embedding for Zero-Shot Learning [PDF] 返回目录
Yang Liu, Lei Zhou, Xiao Bai, Lin Gu, Tatsuya Harada, Jun Zhou
Abstract: Zero-shot learning (ZSL) aims to recognize novel classes by transferring semantic knowledge from seen classes to unseen classes. Though many ZSL methods rely on a direct mapping between the visual and the semantic space, the calibration deviation and hubness problem limit the generalization capability to unseen classes. Recently emerged generative ZSL methods generate unseen image features to transform ZSL into a supervised classification problem. However, most generative models still suffer from the seen-unseen bias problem as only seen data is used for training. To address these issues, we propose a novel bidirectional embedding based generative model with a tight visual-semantic coupling constraint. We learn a unified latent space that calibrates the embedded parametric distributions of both visual and semantic spaces. Since the embedding from high-dimensional visual features comprise much non-semantic information, the alignment of visual and semantic in latent space would inevitably been deviated. Therefore, we introduce information bottleneck (IB) constraint to ZSL for the first time to preserve essential attribute information during the mapping. Specifically, we utilize the uncertainty estimation and the wake-sleep procedure to alleviate the noises and improve model abstraction capability. We evaluate the learned latent features on four benchmark datasets. Extensive experimental results show that our method outperforms the state-of-the-art methods in different ZSL settings on most benchmark datasets. The code will be available at this https URL.
摘要：零次学习（ZSL）旨在通过语义知识从看到类转移到看不见的类来识别新的类。虽然许多ZSL方法依赖于视觉和语义空间之间的直接映射，校正偏差和hubness问题限制了泛化能力看不见类。最近出现的生成ZSL方法产生看不见的图像特征转换成ZSL有监督的分类问题。然而，大多数生成模型还是从看到久违的偏差问题遭受只看到数据被用于训练。为了解决这些问题，我们提出了一个新颖的双向嵌入基于生成具有紧缩视觉语义耦合约束模型。我们了解到，校准视觉和语义空间的嵌入参数分布统一的潜在空间。由于从高维视觉特征的嵌入包括许多非语义信息，视觉和语义中潜在空间取向将不可避免地被偏离。因此，我们引入信息瓶颈（IB）约束ZSL首次在映射期间保持必要的属性信息。具体而言，我们利用的不确定性估计和唤醒睡眠过程来缓解噪音和提高模型抽象能力。我们评估的四个标准数据集学习的潜在功能。大量的实验结果表明，该方法优于大多数标准数据集在不同ZSL设置的国家的最先进的方法。该代码将可在此HTTPS URL。

34. Exploring Font-independent Features for Scene Text Recognition [PDF] 返回目录
Yizhi Wang, Zhouhui Lian
Abstract: Scene text recognition (STR) has been extensively studied in last few years. Many recently-proposed methods are specially designed to accommodate the arbitrary shape, layout and orientation of scene texts, but ignoring that various font (or writing) styles also pose severe challenges to STR. These methods, where font features and content features of characters are tangled, perform poorly in text recognition on scene images with texts in novel font styles. To address this problem, we explore font-independent features of scene texts via attentional generation of glyphs in a large number of font styles. Specifically, we introduce trainable font embeddings to shape the font styles of generated glyphs, with the image feature of scene text only representing its essential patterns. The generation process is directed by the spatial attention mechanism, which effectively copes with irregular texts and generates higher-quality glyphs than existing image-to-image translation methods. Experiments conducted on several STR benchmarks demonstrate the superiority of our method compared to the state of the art.
摘要：场景文本识别（STR）已在过去几年中得到了广泛的研究。许多新近提出的方法是专门设计，以适应现场文本的任意形状，布局和方向，而忽略了不同的字体（或写）风格也对STR严峻的挑战。这些方法，在字体的功能和角色的内容特征纠结，在文本识别场景的图像表现不佳，在新的字体样式的文本。为了解决这个问题，我们在大量的字体样式通过注意力代字形的探索场景文本的字体无关的特性。具体来说，我们介绍了训练的字体的嵌入塑造字体样式产生字形，仅代表其基本模式场景文本的图像特征。的生成过程是由空间注意机构，其有效地与不规则文本科佩斯和比现有的图像到图像翻译方法生成更高质量的字形定向。在几个STR基准所进行的实验证明了我们方法的优越性与现有技术相比的状态。

35. Weakly-Supervised Online Hashing [PDF] 返回目录
Yu-Wei Zhan, Xin Luo, Yu Sun, Yongxin Wang, Zhen-Duo Chen, Xin-Shun Xu
Abstract: With the rapid development of social websites, recent years have witnessed an explosive growth of social images with user-provided tags which continuously arrive in a streaming fashion. Due to the fast query speed and low storage cost, hashing-based methods for image search have attracted increasing attention. However, existing hashing methods for social image retrieval are based on batch mode which violates the nature of social images, i.e., social images are usually generated periodically or collected in a stream fashion. Although there exist many online image hashing methods, they either adopt unsupervised learning which ignore the relevant tags, or are designed in the supervised manner which needs high-quality labels. In this paper, to overcome the above limitations, we propose a new method named Weakly-supervised Online Hashing (WOH). In order to learn high-quality hash codes, WOH exploits the weak supervision by considering the semantics of tags and removing the noise. Besides, We develop a discrete online optimization algorithm for WOH, which is efficient and scalable. Extensive experiments conducted on two real-world datasets demonstrate the superiority of WOH compared with several state-of-the-art hashing baselines.
摘要：随着社交网站的迅速发展，近年来目睹的社会形象与它的流媒体方式连续到达用户提供标签的爆炸性增长。由于快速的查询速度和存储成本较低，对于图像搜索基于散列的方法已经吸引了越来越多的关注。然而，对于社会图像检索现有散列方法是基于违反社会的图像，即，社会图像通常周期性地产生或以流方式收集的性质批处理模式。尽管存在许多在线图像哈希方法，它们要么采用无监督学习它忽略了相关的标签，或者被设计在需要高品质的标签，监督的方式。在本文中，克服了上述限制，我们提出了一个名为新方法弱监督在线散列（WOH）。为了了解高品质的散列码，WOH利用考虑标签的语义和去除噪声的监管不力。此外，我们开发的WOH离散在线优化算法，这是高效和可扩展。两个真实世界的数据集进行了广泛的实验表明，与国家的最先进的几种散列基线相比WOH的优越性。

36. A New Approach for Texture based Script Identification At Block Level using Quad Tree Decomposition [PDF] 返回目录
Pawan Kumar Singh, Supratim Das, Ram Sarkar, Mita Nasipuri
Abstract: A considerable amount of success has been achieved in developing monolingual OCR systems for Indic scripts. But in a country like India, where multi-script scenario is prevalent, identifying scripts beforehand becomes obligatory. In this paper, we present the significance of Gabor wavelets filters in extracting directional energy and entropy distributions for 11 official handwritten scripts namely, Bangla, Devanagari, Gujarati, Gurumukhi, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu and Roman. The experimentation is conducted at block level based on a quad-tree decomposition approach and evaluated using six different well-known classifiers. Finally, the best identification accuracy of 96.86% has been achieved by Multi Layer Perceptron (MLP) classifier for 3-fold cross validation at level-2 decomposition. The results serve to establish the efficacy of the present approach to the classification of handwritten Indic scripts
摘要：成功的一个相当大的数量已经在开发印度文脚本单一语言OCR系统实现。但在像印度，在那里多脚本的情况是普遍存在的国家，识别脚本事先变成强制性的。在本文中，我们提出的Gabor的意义在11个官方手写脚本，即孟加拉，梵语，古吉拉特语，Gurumukhi，卡纳达语，马拉雅拉姆语，奥里雅语，泰米尔语，泰卢固语，乌尔都语和罗马提取定向能量和熵分布小波滤波器。该实验是在基于四叉树分解方法和评价使用六个不同的公知的分类块水平进行的。最后，96.86％的最佳识别精度已经由多层感知器（MLP）分类为3倍交叉验证在2级分解来实现的。结果用于建立本方法的有效性，以手写的印度语脚本的分类

37. Handwritten Script Identification from Text Lines [PDF] 返回目录
Pawan Kumar Singh, Iman Chatterjee, Ram Sarkar, Mita Nasipuri
Abstract: In a multilingual country like India where 12 different official scripts are in use, automatic identification of handwritten script facilitates many important applications such as automatic transcription of multilingual documents, searching for documents on the web/digital archives containing a particular script and for the selection of script specific Optical Character Recognition (OCR) system in a multilingual environment. In this paper, we propose a robust method towards identifying scripts from the handwritten documents at text line-level. The recognition is based upon features extracted using Chain Code Histogram (CCH) and Discrete Fourier Transform (DFT). The proposed method is experimented on 800 handwritten text lines written in seven Indic scripts namely, Gujarati, Kannada, Malayalam, Oriya, Tamil, Telugu, Urdu along with Roman script and yielded an average identification rate of 95.14% using Support Vector Machine (SVM) classifier.
摘要：在像印度这样一个多语言的国家，其中12种不同的官方文字都在使用，手写脚本的自动识别有助于许多重要的应用，如多语言文档的自动转录，搜索包含一个特定的脚本上的Web /数字档案文件，为脚本特定光学字符识别（OCR）系统在多语言环境的选择。在本文中，我们提出了对在文本行级别的标识从手写文件脚本可靠的方法。所述识别是基于使用链码直方图（CCH）和离散傅里叶变换（DFT）中提取的特征。该方法是实验上写在七个印度语脚本即，古吉拉特语，卡纳达语，马拉雅拉姆语，奥里雅语，泰米尔语，泰卢固语，乌尔都语以及罗马脚本800个手写文本行，并取得了使用支持向量机（SVM）的95.14％的平均识别率分类。

38. Multi-Label Activity Recognition using Activity-specific Features [PDF] 返回目录
Yanyi Zhang, Xinyu Li, Ivan Marsic
Abstract: We introduce an approach to multi-label activity recognition by extracting independent feature descriptors for each activity. Our approach first extracts a set of independent feature snippets, focused on different spatio-temporal regions of a video, that we call "observations". We then generate independent feature descriptors for each activity, that we call "activity-specific features" by combining these observations with attention, and further make action prediction based on these activity-specific features. This structure can be trained end-to-end and plugged into any existing network structures for video classification. Our method outperformed state-of-the-art approaches on three multi-label activity recognition datasets. We also evaluated the method and achieved state-of-the-art performance on two single-activity recognition datasets to show the generalizability of our approach. Furthermore, to better understand the activity-specific features that the system generates, we visualized these activity-specific features in the Charades dataset.
摘要：通过对每个活动独立提取特征描述符介绍的方法，以多品牌行为识别。我们的方法首先提取一组独立的特征片段的，专注于视频的不同时空的区域，我们称之为“观察”。然后生成独立的特征描述符的每个活动，我们结合与关注，并进一步制定行动预测这些观察基于这些特定活动的功能称之为“特定活动的特点”。这种结构可以被训练端至端并插入到用于视频分类任何现有的网络结构。我们跑赢国家的最先进的方法，在三个多品牌行为识别的数据集的方法。我们也评估方法和两个单行为识别的数据集来实现国家的最先进的性能，以显示我们的做法的普遍性。此外，为了更好地理解活动的具体特点，该系统产生，我们可视化的数据集中哑谜这些活动的具体特点。

39. DAER to Reject Seeds with Dual-loss Additional Error Regression [PDF] 返回目录
Stephan J. Lemmer, Jason J. Corso
Abstract: Many vision tasks require side information at inference time---a seed---to fully specify the problem. For example, an initial object segmentation is needed for video object segmentation. To date, all such work makes the tacit assumption that the seed is a good one. However, in practice, from crowd-sourcing to noisy automated seeds, this is not the case. We hence propose the novel problem of seed rejection---determining whether to reject a seed based on expected degradation relative to the gold-standard. We provide a formal definition to this problem, and focus on two challenges: distinguishing poor primary inputs from poor seeds and understanding the model's response to noisy seeds conditioned on the primary input. With these challenges in mind, we propose a novel training method and evaluation metrics for the seed rejection problem. We then validate these metrics and methods on two problems which use seeds as a source of additional information: keypoint-conditioned viewpoint estimation with crowdsourced seeds and hierarchical scene classification with automated seeds. In these experiments, we show our method reduces the required number of seeds that need to be reviewed for a target performance by up to 23% over strong baselines.
摘要：许多视觉任务都需要在推理时间---种子侧信息---充分说明这个问题。例如，需要用于视频对象分割的初始对象分割。迄今为止，所有这些工作，使隐性假设种子是一个很好的一个。然而，在实践中，从众包嘈杂自动化的种子，这不是这种情况。我们因此提出种子排斥的新颖问题---确定是否基于相对于所述金标准预期降解上拒绝的种子。我们提供两个方面的挑战正式定义了这个问题，并重点：来自贫困种子区分穷人主要输入和理解模型的嘈杂种子空调上的主输入的响应。考虑到这些挑战，我们提出了种子排斥问题的新的训练方法和评价指标。然后，我们验证在其上使用种子作为附加信息的源两个问题这些指标和方法：用众包种子和用种子自动分层场景分类关键点空调视点估计。在这些实验中，我们证明我们的方法减少了所需数量需要为超过基线强高达23％的目标业绩进行审查种子。

40. EfficientNet-eLite: Extremely Lightweight and Efficient CNN Models for Edge Devices by Network Candidate Search [PDF] 返回目录
Ching-Chen Wang, Ching-Te Chiu, Jheng-Yi Chang
Abstract: Embedding Convolutional Neural Network (CNN) into edge devices for inference is a very challenging task because such lightweight hardware is not born to handle this heavyweight software, which is the common overhead from the modern state-of-the-art CNN models. In this paper, targeting at reducing the overhead with trading the accuracy as less as possible, we propose a novel of Network Candidate Search (NCS), an alternative way to study the trade-off between the resource usage and the performance through grouping concepts and elimination tournament. Besides, NCS can also be generalized across any neural network. In our experiment, we collect candidate CNN models from EfficientNet-B0 to be scaled down in varied way through width, depth, input resolution and compound scaling down, applying NCS to research the scaling-down trade-off. Meanwhile, a family of extremely lightweight EfficientNet is obtained, called EfficientNet-eLite. For further embracing the CNN edge application with Application-Specific Integrated Circuit (ASIC), we adjust the architectures of EfficientNet-eLite to build the more hardware-friendly version, EfficientNet-HF. Evaluation on ImageNet dataset, both proposed EfficientNet-eLite and EfficientNet-HF present better parameter usage and accuracy than the previous start-of-the-art CNNs. Particularly, the smallest member of EfficientNet-eLite is more lightweight than the best and smallest existing MnasNet with 1.46x less parameters and 0.56% higher accuracy. Code is available at this https URL
摘要：嵌入卷积神经网络（CNN）将用于推断边缘设备是一个非常具有挑战性的任务，因为这样轻巧的硬件并不是生来要处理这个重量级的软件，这是与现代国家的最先进的CNN模型的共同开销。在本文中，降低与交易的准确性尽可能小的开销目标，我们提出了网络搜寻求职者的小说（NCS），另一种方法来研究权衡的资源使用情况，并通过分组概念的性能之间淘汰赛。此外，NCS还可以在任何神经网络的推广。在我们的实验中，我们收集EfficientNet-B0候选CNN模型在不同的方式，通过宽度，深度，输入分辨率和复合缩小，将NCS研究降比的权衡被缩小。同时，可以得到极其轻便EfficientNet的家庭，被称为EfficientNet精英。为了进一步拥抱与特殊应用集成电路（ASIC）的CNN边缘应用，我们调整EfficientNet精英的架构来打造更多的硬件的版本，EfficientNet-HF。评价ImageNet数据集，既提出EfficientNet精英和EfficientNet-HF现在更好参数的用法和准确性比以前启动的最先进的细胞神经网络。特别地，EfficientNet精英的最小的成员是更加轻便比最好的和最小的现有MnasNet与1.46x更少的参数和较高的0.56％的精度。代码可在此HTTPS URL

41. Creation and Validation of a Chest X-Ray Dataset with Eye-tracking and Report Dictation for AI Tool Development [PDF] 返回目录
Alexandros Karargyris, Satyananda Kashyap, Ismini Lourentzou, Joy Wu, Arjun Sharma, Matthew Tong, Shafiq Abedin, David Beymer, Vandana Mukherjee, Elizabeth A Krupinski, Mehdi Moradi
Abstract: We developed a rich dataset of Chest X-Ray (CXR) images to assist investigators in artificial intelligence. The data were collected using an eye tracking system while a radiologist reviewed and reported on 1,083 CXR images. The dataset contains the following aligned data: CXR image, transcribed radiology report text, radiologist's dictation audio and eye gaze coordinates data. We hope this dataset can contribute to various areas of research particularly towards explainable and multimodal deep learning / machine learning methods. Furthermore, investigators in disease classification and localization, automated radiology report generation, and human-machine interaction can benefit from these data. We report deep learning experiments that utilize the attention maps produced by eye gaze dataset to show the potential utility of this data.
摘要：我们开发了胸部X光（CXR）图像的丰富数据集，以协助调查人工智能。使用眼动追踪系统，同时放射科医师审核和报告1083个CXR图像的数据收集。该数据集包含以下调整后的数据：CXR图像，转录影像报告文本，放射科医生的口述音频和眼睛注视坐标数据。我们希望这个数据集可以有助于研究各领域特别是对可以解释和多深的学习/机器学习方法。此外，在疾病分类和定位，自动放射学报告生成和人机交互研究者可以受益于这些数据。我们报告的深度学习利用注意通过映射产生的眼睛注视的数据集，以显示这个数据的潜在效用试验。

42. BOP Challenge 2020 on 6D Object Localization [PDF] 返回目录
Tomas Hodan, Martin Sundermeyer, Bertram Drost, Yann Labbe, Eric Brachmann, Frank Michel, Carsten Rother, Jiri Matas
Abstract: This paper presents the evaluation methodology, datasets, and results of the BOP Challenge 2020, the third in a series of public competitions organized with the goal to capture the status quo in the field of 6D object pose estimation from an RGB-D image. In 2020, to reduce the domain gap between synthetic training and real test RGB images, the participants were provided 350K photorealistic trainining images generated by BlenderProc4BOP, a~new open-source and light-weight physically-based renderer (PBR) and procedural data generator. Methods based on deep neural networks have finally caught up with methods based on point pair features, which were dominating previous editions of the challenge. Although the top-performing methods rely on RGB-D image channels, strong results were achieved when only RGB channels were used at both training and test time -- out of 26 evaluated methods, the third method was trained on RGB channels of PBR and real images, while the fifth was trained on PBR images only. Strong data augmentation was identified as a key component of the top-performing CosyPose method, and the photorealism of PBR images was demonstrated effective despite the augmentation. The online evaluation system stays open and is available at the project website: this http URL.
摘要：本文介绍了评价方法，数据集，以及BOP挑战2020年结果，第三个在一系列与目标组织的公开竞争，以从RGB-d图像捕捉6D对象姿态估计领域的现状。在2020年，为了减少合成训练和实际测试RGB图像之间的域间隙，参加者提供了通过BlenderProc4BOP生成350K真实感trainining图像，〜新的开源和重量轻的基于物理的渲染器（PBR）和程序数据发生器。基于深层神经网络的方法，终于赶上了基于点对功能，这是主导的挑战以前版本的方法。虽然顶执行方法依赖于RGB-d图像通道，强结果当只有RGB通道在训练和测试时间被用来实现 - 26种评价方法进行，第三种方法被训练在PBR和实际的RGB通道图像，而第五是只PBR图像训练。强数据扩张被确定为顶部执行CosyPose方法的一个关键组成部分，和PBR的图像的照片写实证明尽管增强有效。在线评估系统保持打开状态，并可在项目网站：这个HTTP URL。

43. Comparison of Spatiotemporal Networks for Learning Video Related Tasks [PDF] 返回目录
Logan Courtney, Ramavarapu Sreenivas
Abstract: Many methods for learning from video sequences involve temporally processing 2D CNN features from the individual frames or directly utilizing 3D convolutions within high-performing 2D CNN architectures. The focus typically remains on how to incorporate the temporal processing within an already stable spatial architecture. This work constructs an MNIST-based video dataset with parameters controlling relevant facets of common video-related tasks: classification, ordering, and speed estimation. Models trained on this dataset are shown to differ in key ways depending on the task and their use of 2D convolutions, 3D convolutions, or convolutional LSTMs. An empirical analysis indicates a complex, interdependent relationship between the spatial and temporal dimensions with design choices having a large impact on a network's ability to learn the appropriate spatiotemporal features.
摘要：用于从视频序列学习许多方法涉及在时间上处理2D CNN从各个帧或直接利用高性能的2D CNN架构内三维卷积功能。焦点通常保持关于如何已经稳定的空间架构内结合的时间的处理。这项工作与构建参数控制的常见视频相关的任务相关方面基于MNIST的视频数据集：分类，排序和速度估计。训练有素的这个数据集模型显示在根据任务和他们使用的二维卷积，3D盘旋，或卷积LSTMs的主要方法是不同的。实证分析表明，与具有对网络的学习适当的时空特征的能力有很大的影响的设计选择的空间和时间维度之间的一个复杂的，相互依存的关系。

44. Semantically Sensible Video Captioning (SSVC) [PDF] 返回目录
Md. Mushfiqur Rahman, Thasin Abedin, Khondokar S. S. Prottoy, Ayana Moshruba, Fazlul Hasan Siddiqui
Abstract: Video captioning, i.e. the task of generating captions from video sequences creates a bridge between the Natural Language Processing and Computer Vision domains of computer science. Generating a semantically accurate description of a video is an arduous task. Considering the complexity of the problem, the results obtained in recent researches are quite outstanding. But still there is plenty of scope for improvement. This paper addresses this scope and proposes a novel solution. Most video captioning models comprise of two sequential/recurrent layers - one as a video-to-context encoder and the other as a context-to-caption decoder. This paper proposes a novel architecture, SSVC (Semantically Sensible Video Captioning) which modifies the context generation mechanism by using two novel approaches - "stacked attention" and "spatial hard pull". For evaluating the proposed architecture, along with the BLEU scoring metric for quantitative analysis, we have used a human evaluation metric for qualitative analysis. This paper refers to this proposed human evaluation metric as the Semantic Sensibility (SS) scoring metric. SS score overcomes the shortcomings of common automated scoring metrics. This paper reports that the use of the aforementioned novelties improves the performance of the state-of-the-art architectures.
摘要：视频字幕，即从视频序列生成字幕的任务创建计算机科学的自然语言处理和计算机视觉领域之间的桥梁。生成视频的语义准确的描述是一项艰巨的任务。考虑到问题的复杂性，在最近的研究中获得的结果是相当出色的。但仍有很多改进的余地。本文针对本范围，提出了一种新颖的解决方案。大多数视频字幕模型包括两个连续的/复发层 - 酮，为视频到上下文编码器，而另一个作为上下文到字幕解码器。本文提出了一种新颖的体系结构，SSVC（语义上明智视频字幕），其修饰通过使用两个新颖的上下文生成机构接近 - “堆叠注意”和“空间硬拉”。为了评估所提出的架构，与BLEU打分指标进行定量分析以来，我们已经使用了一个人的评价指标进行定性分析。本文指的是此人提出评价指标与语义感性（SS）得分度量。 SS得分克服常见自动评分指标的缺点。本文报道的是，使用上述新奇提高国家的最先进的架构的性能。

45. Detecting Cross-Modal Inconsistency to Defend Against Neural Fake News [PDF] 返回目录
Reuben Tan, Kate Saenko, Bryan A. Plummer
Abstract: Large-scale dissemination of disinformation online intended to mislead or deceive the general population is a major societal problem. Rapid progression in image, video, and natural language generative models has only exacerbated this situation and intensified our need for an effective defense mechanism. While existing approaches have been proposed to defend against neural fake news, they are generally constrained to the very limited setting where articles only have text and metadata such as the title and authors. In this paper, we introduce the more realistic and challenging task of defending against machine-generated news that also includes images and captions. To identify the possible weaknesses that adversaries can exploit, we create a NeuralNews dataset composed of 4 different types of generated articles as well as conduct a series of human user study experiments based on this dataset. In addition to the valuable insights gleaned from our user study experiments, we provide a relatively effective approach based on detecting visual-semantic inconsistencies, which will serve as an effective first line of defense and a useful reference for future work in defending against machine-generated disinformation.
摘要：网上旨在误导或欺骗普通人群造谣大规模传播是一个重大的社会问题。快速发展中的图片，视频和自然语言生成模型只是加剧了这一情况，并加强了我们一个有效的防御机制的需要。虽然现有的方法被提出来抵御神经假新闻，他们一般都限制在很有限的环境，让文章只能有文本和元数据，如标题和作者。在本文中，我们介绍防范机器产生的消息，还包括图像和字幕的更现实和具有挑战性的任务。要识别可能存在的弱点是对手可以利用，我们创建了4种不同类型的文章产生的组成以及进行一系列的基于此数据集人类用户研究实验NeuralNews数据集。除了从我们的用户研究实验收集到的有价值的见解，我们会根据检测视觉语义矛盾比较有效的方法，这将作为防御的第一有效行和防御为今后的工作提供有益的参考机器生成造谣。

46. Multi-Sensor Data Fusion for Cloud Removal in Global and All-Season Sentinel-2 Imagery [PDF] 返回目录
Patrick Ebel, Andrea Meraner, Michael Schmitt, Xiaoxiang Zhu
Abstract: This work has been accepted by IEEE TGRS for publication. The majority of optical observations acquired via spaceborne earth imagery are affected by clouds. While there is numerous prior work on reconstructing cloud-covered information, previous studies are oftentimes confined to narrowly-defined regions of interest, raising the question of whether an approach can generalize to a diverse set of observations acquired at variable cloud coverage or in different regions and seasons. We target the challenge of generalization by curating a large novel data set for training new cloud removal approaches and evaluate on two recently proposed performance metrics of image quality and diversity. Our data set is the first publically available to contain a global sample of co-registered radar and optical observations, cloudy as well as cloud-free. Based on the observation that cloud coverage varies widely between clear skies and absolute coverage, we propose a novel model that can deal with either extremes and evaluate its performance on our proposed data set. Finally, we demonstrate the superiority of training models on real over synthetic data, underlining the need for a carefully curated data set of real observations. To facilitate future research, our data set is made available online
摘要：这项工作已经由IEEE，TGR5出版被接受。大部分经由星载地球图像获取的光学观测由云的影响。虽然对重建云覆盖的信息大量的前期工作，以往的研究往往局限于关注狭义的区域，提高的做法是否可以推广到一组不同的变量云层覆盖收购意见或不同地区的问题和季节。我们的目标通过策划大量新的数据集用于训练新的云去除泛化的挑战，方法和对图像质量和多样性两个最近提出的性能指标来评估。我们的数据集是第一个公开可用的包含共同注册的雷达和光学观测，阴天的全球样本以及无云。基于云覆盖晴朗的天空和绝对的覆盖面差异很大的观察，我们建议可以应对任何极端的一种新的模型，并评估其对我们提出的数据集性能。最后，我们表现出对真正的训练模式在合成数据，强调了一个精心策划的数据集实时观测需要的优越性。为了方便今后的研究，我们的数据集，在网上提供

47. Contrastive Cross-site Learning with Redesigned Net for COVID-19 CT Classification [PDF] 返回目录
Zhao Wang, Quande Liu, Qi Dou
Abstract: The pandemic of coronavirus disease 2019 (COVID-19) has lead to a global public health crisis spreading hundreds of countries. With the continuous growth of new infections, developing automated tools for COVID-19 identification with CT image is highly desired to assist the clinical diagnosis and reduce the tedious workload of image interpretation. To enlarge the datasets for developing machine learning methods, it is essentially helpful to aggregate the cases from different medical systems for learning robust and generalizable models. This paper proposes a novel joint learning framework to perform accurate COVID-19 identification by effectively learning with heterogeneous datasets with distribution discrepancy. We build a powerful backbone by redesigning the recently proposed COVID-Net in aspects of network architecture and learning strategy to improve the prediction accuracy and learning efficiency. On top of our improved backbone, we further explicitly tackle the cross-site domain shift by conducting separate feature normalization in latent space. Moreover, we propose to use a contrastive training objective to enhance the domain invariance of semantic embeddings for boosting the classification performance on each dataset. We develop and evaluate our method with two public large-scale COVID-19 diagnosis datasets made up of CT images. Extensive experiments show that our approach consistently improves the performances on both datasets, outperforming the original COVID-Net trained on each dataset by 12.16% and 14.23% in AUC respectively, also exceeding existing state-of-the-art multi-site learning methods.
摘要：流行性疾病冠状2019（COVID-19）已经导致全球性的公共健康危机蔓延上百个国家。随着新的感染，开发自动化工具COVID-19识别与CT图像的持续增长是高度期望的，以协助临床诊断和减少图像判读的繁琐工作负荷。为了扩大开发的机器学习方法的数据集，它是聚集来自不同的医疗系统的情况下，学习强大和普及机型基本上是有帮助的。本文提出了一种新颖的关节学习框架由与分布差异异构数据集有效地学习执行精确COVID-19识别。我们建立在网络架构方面的重新设计最近提出COVID-Net和学习策略，以提高预测精度和学习效率的有力支柱。在我们改进骨干之上，我们进一步明确解决通过进行潜在空间单独的功能正常化的跨站点域转变。此外，我们建议使用对比培养目标，以增强语义的嵌入域不变性提升每个数据集的分类性能。我们开发并评估我们与CT图像由两个公共的大型COVID-19的诊断数据集的方法。大量的实验表明，该方法可以始终如一提高了两个数据集的演出，超过原始COVID-Net的每个数据集训练的12.16％，并分别在AUC 14.23％，也超过国家的最先进的现有的多点学习方法。

48. Deep Learning in Photoacoustic Tomography: Current approaches and future directions [PDF] 返回目录
Andreas Hauptmann, Ben Cox
Abstract: Biomedical photoacoustic tomography, which can provide high resolution 3D soft tissue images based on the optical absorption, has advanced to the stage at which translation from the laboratory to clinical settings is becoming possible. The need for rapid image formation and the practical restrictions on data acquisition that arise from the constraints of a clinical workflow are presenting new image reconstruction challenges. There are many classical approaches to image reconstruction, but ameliorating the effects of incomplete or imperfect data through the incorporation of accurate priors is challenging and leads to slow algorithms. Recently, the application of Deep Learning, or deep neural networks, to this problem has received a great deal of attention. This paper reviews the literature on learned image reconstruction, summarising the current trends, and explains how these new approaches fit within, and to some extent have arisen from, a framework that encompasses classical reconstruction methods. In particular, it shows how these new techniques can be understood from a Bayesian perspective, providing useful insights. The paper also provides a concise tutorial demonstration of three prototypical approaches to learned image reconstruction. The code and data sets for these demonstrations are available to researchers. It is anticipated that it is in in vivo applications - where data may be sparse, fast imaging critical and priors difficult to construct by hand - that Deep Learning will have the most impact. With this in mind, the paper concludes with some indications of possible future research directions.
摘要：生物医学光声断层扫描，它可以基于光学吸收提供高分辨率的3D软组织图像，拥有先进的以哪种翻译从实验室到临床的设置正成为可能的阶段。从临床工作流程的约束产生的需要快速图像形成和数据采集的实际限制是呈现新的图像重建的挑战。有许多经典方法的图像重建，但改善通过精确的先验的掺入不完整或不完整的数据的影响，是具有挑战性的，并导致缓慢的算法。近日，深学习，或深层神经网络的应用，这个问题已经收到了极大关注。本文回顾了上了解到图像重建的文献，总结了目前的趋势，并解释这些新方法如何适应内，并在一定程度上已经从，涵盖经典的重建方法的框架出现。特别是，它显示了如何将这些新技术可以从贝叶斯的角度理解，提供了有益的启示。本文还提供了三种典型的方法来学图像重建的简明教程演示。代码和数据集的这些示威活动是提供给研究人员。可以预见，它在体内应用 - 其中的数据可能是稀疏的，快速成像关键和先验很难用手来构建 - 深学习将有最大的影响。鉴于此，本文以可能的未来的研究方向一些迹象的结论。

49. RCNN for Region of Interest Detection in Whole Slide Images [PDF] 返回目录
A Nugaliyadde, Kok Wai Wong, Jeremy Parry, Ferdous Sohel, Hamid Laga, Upeka V. Somaratne, Chris Yeomans, Orchid Foster
Abstract: Digital pathology has attracted significant attention in recent years. Analysis of Whole Slide Images (WSIs) is challenging because they are very large, i.e., of Giga-pixel resolution. Identifying Regions of Interest (ROIs) is the first step for pathologists to analyse further the regions of diagnostic interest for cancer detection and other anomalies. In this paper, we investigate the use of RCNN, which is a deep machine learning technique, for detecting such ROIs only using a small number of labelled WSIs for training. For experimentation, we used real WSIs from a public hospital pathology service in Western Australia. We used 60 WSIs for training the RCNN model and another 12 WSIs for testing. The model was further tested on a new set of unseen WSIs. The results show that RCNN can be effectively used for ROI detection from WSIs.
摘要：数字病理吸引显著重视，近年来。整个幻灯片图像（WSIS）的分析是具有挑战性的，因为他们是非常大的，即，千兆像素分辨率。识别利息（投资回报）的区域是用于病理学家以分析癌症检测和其它异常诊断感兴趣的进一步的区域中的第一个步骤。在本文中，我们调查使用RCNN的，这是一个深机器学习技术，用于检测这种的ROI仅使用少量的用于训练标记峰会。对于实验，我们用真正的信息社会世界峰会在西澳公立医院病理服务。我们用60个WSIS用于训练RCNN模型和另外12次峰会进行测试。该模型是一套新的信息社会世界峰会看不见的进一步测试。结果表明，RCNN可以有效地用于从信息社会世界峰会ROI检测。

50. m-arcsinh: An Efficient and Reliable Function for SVM and MLP in scikit-learn [PDF] 返回目录
Luca Parisi
Abstract: This paper describes the 'm-arcsinh', a modified ('m-') version of the inverse hyperbolic sine function ('arcsinh'). Kernel and activation functions enable Machine Learning (ML)-based algorithms, such as Support Vector Machine (SVM) and Multi-Layer Perceptron (MLP), to learn from data in a supervised manner. m-arcsinh, implemented in the open source Python library 'scikit-learn', is hereby presented as an efficient and reliable kernel and activation function for SVM and MLP respectively. Improvements in reliability and speed to convergence in classification tasks on fifteen (N = 15) datasets available from scikit-learn and the University California Irvine (UCI) Machine Learning repository are discussed. Experimental results demonstrate the overall competitive classification performance of both SVM and MLP, achieved via the proposed function. This function is compared to gold standard kernel and activation functions, demonstrating its overall competitive reliability regardless of the complexity of the classification tasks involved.
摘要：本文描述了“M-arcsinh”，修饰的（“间”）的反双曲正弦函数（“arcsinh”）的版本。内核和激活功能使机器学习（ML）为基础的算法，诸如支持向量机（SVM）和多层感知器（MLP），从数据学习在监督方式。间 - arcsinh，在开源Python库“scikit学习”中实现，在此提出作为用于分别SVM和MLP的有效和可靠的内核和激活功能。改进可靠性和速度，以收敛在分类任务的十五（N = 15）的数据集可以从scikit学习和大学加州大学欧文（UCI）机器学习资源库进行了讨论。实验结果表明，这两种SVM和MLP的整体竞争分类性能，通过所提出的功能来实现的。此功能相比，黄金标准内核和激活功能，无论所涉及的分类任务的复杂性的展示了其整体竞争可靠性。

51. Deep Sinogram Completion with Image Prior for Metal Artifact Reduction in CT Images [PDF] 返回目录
Lequan Yu, Zhicheng Zhang, Xiaomeng Li, Lei Xing
Abstract: Computed tomography (CT) has been widely used for medical diagnosis, assessment, and therapy planning and guidance. In reality, CT images may be affected adversely in the presence of metallic objects, which could lead to severe metal artifacts and influence clinical diagnosis or dose calculation in radiation therapy. In this paper, we propose a generalizable framework for metal artifact reduction (MAR) by simultaneously leveraging the advantages of image domain and sinogram domain-based MAR techniques. We formulate our framework as a sinogram completion problem and train a neural network (SinoNet) to restore the metal-affected projections. To improve the continuity of the completed projections at the boundary of metal trace and thus alleviate new artifacts in the reconstructed CT images, we train another neural network (PriorNet) to generate a good prior image to guide sinogram learning, and further design a novel residual sinogram learning strategy to effectively utilize the prior image information for better sinogram completion. The two networks are jointly trained in an end-to-end fashion with a differentiable forward projection (FP) operation so that the prior image generation and deep sinogram completion procedures can benefit from each other. Finally, the artifact-reduced CT images are reconstructed using the filtered backward projection (FBP) from the completed sinogram. Extensive experiments on simulated and real artifacts data demonstrate that our method produces superior artifact-reduced results while preserving the anatomical structures and outperforms other MAR methods.
摘要：计算机断层扫描（CT）已被广泛用于医疗诊断，评估和治疗的规划和指导。在现实中，CT图像可能会受到不利的金属物体，这可能导致严重的金属伪影，并影响在放射治疗中的临床诊断或剂量计算的存在的影响。在本文中，我们通过同时利用图像域和窦腔X线照相基于域的MAR技术的优点提出一种用于金属伪影减少（MAR）一个一般化框架。我们制定了框架作为正弦完成的问题，训练神经网络（SINONET）恢复金属影响的预测。为了提高已完成的突起的连续性在金属迹线的边界，从而在重建的CT图象减轻新的工件，我们培养另一个神经网络（PriorNet），以生成到导向正弦图学习良好的先前图像，并且还设计了一种新的残余正弦学习策略，以有效地利用更好的正弦完成之前的图像信息。这两个网络在端至端的方式共同训练了与可微正投影（FP）操作，使得在现有图像生成和深正弦图完成程序可以彼此受益。最后，伪影减少的CT图像是使用从完成的正弦图的滤波反向投影（FBP）重构。上模拟和实际工件数据大量的实验证明，我们的方法可以产生优异的伪影减少的结果，同时保持解剖结构和优于其他MAR方法。

52. PL-VINS: Real-Time Monocular Visual-Inertial SLAM with Point and Line [PDF] 返回目录
Qiang Fu, Jialong Wang, Hongshan Yu, Islam Ali, Feng Guo, Hong Zhang
Abstract: Leveraging line features to improve location accuracy of point-based visual-inertial SLAM (VINS) is gaining importance as they provide additional constraint of scene structure regularity, however, real-time performance has not been focused. This paper presents PL-VINS, a real-time optimization-based monocular VINS method with point and line, developed based on state-of-the-art point-based VINS-Mono \cite{vins}. Observe that current works use LSD \cite{lsd} algorithm to extract lines, however, the LSD is designed for scene shape representation instead of specific pose estimation problem, which becomes the bottleneck for the real-time performance due to its expensive cost. In this work, a modified LSD algorithm is presented by studying hidden parameter tuning and length rejection strategy. The modified LSD can run three times at least as fast as the LSD. Further, by representing a line landmark with Plücker coordinate, the line reprojection residual is modeled as midpoint-to-line distance then minimized by iteratively updating the minimum four-parameter orthonormal representation of the Plücker coordinate. Experiments in public EuRoc benchmark dataset show the location error of our method is down 12-16\% compared to VINS-Mono at the same work frequency on a low-power CPU @1.1 GHz without GPU parallelization. For the benefit of the community, we make public the source code: \textit{this https URL
摘要：利用线特征，以提高基于点的视觉惯性SLAM（VINS）的定位精度越来越重要，因为它们提供场景结构规律性的附加约束，但是，实时性能一直没有关注。本文礼物PL-VINS，基于优化的实时单眼VINS方法与点，线，开发了基于状态的最先进的基于点的VINS单声道\ {举VINS}。观察目前的作品使用LSD \ {举LSD}算法来提取线，然而，LSD是专为现场形状表示，而不是特定的姿态估计问题，它成为实时性能的瓶颈，由于其昂贵的成本。在这项工作中，修改的LSD算法通过研究隐藏的参数调整和长度排斥战略展示。修改后的LSD能够以最快的速度跑LSD至少三次。此外，通过表示与普吕克坐标的线的地标，线重投影残差被建模为中点到线距离然后通过迭代地更新坐标普吕克的最小四参数正交表示最小化。在公共EuRoc基准实验数据集显示了该方法的定位误差为下降12-16 \％在低功耗CPU在相同的工作频率相比VINS单声道@ 1.1 GHz的无GPU并行。为社会的利益，我们公开的源代码：\ textit {此HTTPS URL

53. MSP: An FPGA-Specific Mixed-Scheme, Multi-Precision Deep Neural Network Quantization Framework [PDF] 返回目录
Sung-En Chang, Yanyu Li, Mengshu Sun, Weiwen Jiang, Runbin Shi, Xue Lin, Yanzhi Wang
Abstract: With the tremendous success of deep learning, there exists imminent need to deploy deep learning models onto edge devices. To tackle the limited computing and storage resources in edge devices, model compression techniques have been widely used to trim deep neural network (DNN) models for on-device inference execution. This paper targets the commonly used FPGA (field programmable gate array) devices as the hardware platforms for DNN edge computing. We focus on the DNN quantization as the main model compression technique, since DNN quantization has been of great importance for the implementations of DNN models on the hardware platforms. The novelty of this work comes in twofold: (i) We propose a mixed-scheme DNN quantization method that incorporates both the linear and non-linear number systems for quantization, with the aim to boost the utilization of the heterogeneous computing resources, i.e., LUTs (look up tables) and DSPs (digital signal processors) on an FPGA. Note that all the existing (single-scheme) quantization methods can only utilize one type of resources (either LUTs or DSPs for the MAC (multiply-accumulate) operations in deep learning computations. (ii) We use a quantization method that supports multiple precisions along the intra-layer dimension, while the existing quantization methods apply multi-precision quantization along the inter-layer dimension. The intra-layer multi-precision method can uniform the hardware configurations for different layers to reduce computation overhead and at the same time preserve the model accuracy as the inter-layer approach.
摘要：随着深度学习的巨大成功，存在迫切需要深的学习模式部署到边缘设备。为了解决在边缘设备的有限的计算和存储资源，模型的压缩技术已被广泛用于修整深神经网络（DNN）模型装置上的推理执行。本文靶向常用FPGA（现场可编程门阵列）设备作为DNN边缘计算的硬件平台。我们专注于DNN量化为主要模式压缩技术，由于DNN量化一直是非常重要的DNN模型对硬件平台的实现。这项工作的新颖性进来双重的：（ⅰ）我们建议，采用了线性和非线性两者数字系统用于量化的混合方案DNN量化方法，其目的助推异构计算资源的利用率，即的LUT（查找表），并在FPGA上的DSP（数字信号处理器）。请注意，所有的现有的（单方案）量化方法只能利用一种类型的资源（或者查找表或DSP为MAC（乘 - 累加）在深学习计算操作。（ⅱ）我们使用的量化方法，该方法支持多个精度沿层内尺寸，而现有的量化方法适用于多精度量化沿着层间的尺寸。该层内多精度方法能够均匀硬件配置不同的层，以减少计算开销，并在同一时间保持该模型的精度作为层间的方法。

54. Surgical Video Motion Magnification with Suppression of Instrument Artefacts [PDF] 返回目录
Mirek Janatka, Hani J. Marcus, Neil L. Dorward, Danail Stoyanov
Abstract: Video motion magnification could directly highlight subsurface blood vessels in endoscopic video in order to prevent inadvertent damage and bleeding. Applying motion filters to the full surgical image is however sensitive to residual motion from the surgical instruments and can impede practical application due to aberration motion artefacts. By storing the temporal filter response from local spatial frequency information for a single cardiovascular cycle prior to tool introduction to the scene, a filter can be used to determine if motion magnification should be active for a spatial region of the surgical image. In this paper, we propose a strategy to reduce aberration due to non-physiological motion for surgical video motion magnification. We present promising results on endoscopic transnasal transsphenoidal pituitary surgery with a quantitative comparison to recent methods using Structural Similarity (SSIM), as well as qualitative analysis by comparing spatio-temporal cross sections of the videos and individual frames.
摘要：视频移动倍率可以直接彰显地下血管内窥镜检查视频，以防止意外损坏和出血。应用运动过滤器的完整的手术图像是从所述手术器械残余运动然而敏感，且可能由于像差运动伪影妨碍实际应用。通过存储从本地空间频率信息之前工具引入到现场单个心血管循环时间滤波器响应，滤波器可以被用于确定是否运动放大率应为手术图像的空间区域是活动的。在本文中，我们提出了一个战略，以减少像差由于手术视频移动放大的非生理运动。我们目前看好的经鼻内窥镜蝶窦入路垂体与定量比较手术结果使用结构相似度（SSIM）最近的方法，以及通过比较视频和单独的帧时空断面定性分析。

55. Geometric Uncertainty in Patient-Specific Cardiovascular Modeling with Convolutional Dropout Networks [PDF] 返回目录
Gabriel Maher, Casey Fleeter, Daniele Schiavazzi, Alison Marsden
Abstract: We propose a novel approach to generate samples from the conditional distribution of patient-specific cardiovascular models given a clinically aquired image volume. A convolutional neural network architecture with dropout layers is first trained for vessel lumen segmentation using a regression approach, to enable Bayesian estimation of vessel lumen surfaces. This network is then integrated into a path-planning patient-specific modeling pipeline to generate families of cardiovascular models. We demonstrate our approach by quantifying the effect of geometric uncertainty on the hemodynamics for three patient-specific anatomies, an aorto-iliac bifurcation, an abdominal aortic aneurysm and a sub-model of the left coronary arteries. A key innovation introduced in the proposed approach is the ability to learn geometric uncertainty directly from training data. The results show how geometric uncertainty produces coefficients of variation comparable to or larger than other sources of uncertainty for wall shear stress and velocity magnitude, but has limited impact on pressure. Specifically, this is true for anatomies characterized by small vessel sizes, and for local vessel lesions seen infrequently during network training.
摘要：本文提出一种新的方法来从给定的临床获得性图像体积特定病人的心血管模型的条件分布样本。与漏失层A卷积神经网络体系结构首先训练用于使用回归方法，以使容器的内腔表面贝叶斯估计血管腔分割。然后，这个网络集成到路径规划病人特定的建模流程产生心血管模型的家庭。我们证明通过量化几何不确定因素对血流动力学的影响有三个具体患者解剖，一个主 - 髂动脉分叉处，腹主动脉瘤和左冠状动脉的子模型我们的方法。在所提出的方法引入的关键创新是直接从训练数据学习几何的不确定性的能力。结果表明不确定性如何产生几何变化相当或比为壁剪切应力和速度的大小不确定性的其他来源的较大的系数，但具有有限的压力的影响。具体地说，这是对于解剖结构，其特征在于小血管尺寸真实的，以及用于网络的训练期间很少看到本地容器病变。

56. Generative models with kernel distance in data space [PDF] 返回目录
Szymon Knop, Marcin Mazur, Przemysław Spurek, Jacek Tabor, Igor Podolak
Abstract: Generative models dealing with modeling a~joint data distribution are generally either autoencoder or GAN based. Both have their pros and cons, generating blurry images or being unstable in training or prone to mode collapse phenomenon, respectively. The objective of this paper is to construct a~model situated between above architectures, one that does not inherit their main weaknesses. The proposed LCW generator (Latent Cramer-Wold generator) resembles a classical GAN in transforming Gaussian noise into data space. What is of utmost importance, instead of a~discriminator, LCW generator uses kernel distance. No adversarial training is utilized, hence the name generator. It is trained in two phases. First, an autoencoder based architecture, using kernel measures, is built to model a manifold of data. We propose a Latent Trick mapping a Gaussian to latent in order to get the final model. This results in very competitive FID values.
摘要：生成模型处理建模〜联合数据分布一般是基于要么自动编码器或GAN。两者各有利弊，产生模糊的图像或分别为在训练不稳定或易于模式崩塌现象。本文的目的是构造一模型〜之间位于上述体系结构中，一个不继承其主要弱点。所提出的LCW发生器（潜克拉默泽沃尔德发电机）类似于在转化高斯噪声到数据空间经典GAN。什么是最重要的，而不是〜鉴别，LCW发生器使用内核的距离。无对抗的训练被利用，故而得名发电机。这两个阶段的培训。首先，自动编码基础架构，采用核心措施，是建立在数据的歧管建模。我们提出了一个潜伏把戏映射高斯潜，以获得最终的模型。这导致了非常有竞争力FID值。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-09-17

目录

摘要