摘要

1. DRG: Dual Relation Graph for Human-Object Interaction Detection [PDF] 返回目录
Chen Gao, Jiarui Xu, Yuliang Zou, Jia-Bin Huang
Abstract: We tackle the challenging problem of human-object interaction (HOI) detection. Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features. In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph (one human-centric and one object-centric). Our proposed dual relation graph effectively captures discriminative cues from the scene to resolve ambiguity from local predictions. Our model is conceptually simple and leads to favorable results compared to the state-of-the-art HOI detection algorithms on two large-scale benchmark datasets.
摘要：我们解决人类对象交互（海）检测的具有挑战性的问题。现有方法或者识别每个人的对象对的相互作用在分离或执行联合推断基于复杂外观基础的特征。在本文中，我们利用一个抽象空间语义表示来描述每个人对象对和通过双关系图（一个人为中心和一个对象为中心的）聚合场景的上下文信息。我们提出的双重关系图有效地从现场从本地预测决心歧义捕捉辨别线索。我们的模型概念简单，导致有利的结果相比，两个大型的基准数据集的国家的最先进的HOI检测算法。

2. NAS-DIP: Learning Deep Image Prior with Neural Architecture Search [PDF] 返回目录
Yun-Chun Chen, Chen Gao, Esther Robb, Jia-Bin Huang
Abstract: Recent work has shown that the structure of deep convolutional neural networks can be used as a structured image prior for solving various inverse image restoration tasks. Instead of using hand-designed architectures, we propose to search for neural architectures that capture stronger image priors. Building upon a generic U-Net architecture, our core contribution lies in designing new search spaces for (1) an upsampling cell and (2) a pattern of cross-scale residual connections. We search for an improved network by leveraging an existing neural architecture search algorithm (using reinforcement learning with a recurrent neural network controller). We validate the effectiveness of our method via a wide variety of applications, including image restoration, dehazing, image-to-image translation, and matrix factorization. Extensive experimental results show that our algorithm performs favorably against state-of-the-art learning-free approaches and reaches competitive performance with existing learning-based methods in some cases.
摘要：最近的工作已经表明深卷积神经网络的结构可以用作现有解决各种逆图像恢复任务的结构图像。相反，采用手工设计的架构，我们建议寻找神经架构，捕获强先验图像。在一般的掌中宽带架构建设，我们的核心贡献在于：（1）上采样单元和（2）跨尺度的残留连接的模式设计新的搜索空间。我们通过利用现有的神经结构搜索算法（使用强化学习与回归神经网络控制器）搜索的改进网络。我们通过各种各样的应用，包括图像恢复，去雾，图像 - 图像平移和矩阵分解验证了该方法的有效性。大量的实验结果表明，该算法执行从优反对国家的最先进的自由学习方法，达到在某些情况下，现有的基于学习的方法有竞争力的表现。

3. Delving into Inter-Image Invariance for Unsupervised Visual Representations [PDF] 返回目录
Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy
Abstract: Contrastive learning has recently shown immense potential in unsupervised visual representation learning. Existing studies in this track mainly focus on intra-image invariance learning. The learning typically uses rich intra-image transformations to construct positive pairs and then maximizes agreement using a contrastive loss. The merits of inter-image invariance, conversely, remain much less explored. One major obstacle to exploit inter-image invariance is that it is unclear how to reliably construct inter-image positive pairs, and further derive effective supervision from them since there are no pair annotations available. In this work, we present a rigorous and comprehensive study on inter-image invariance learning from three main constituting components: pseudo-label maintenance, sampling strategy, and decision boundary design. Through carefully-designed comparisons and analysis, we propose a unified framework that supports the integration of unsupervised intra- and inter-image invariance learning. With all the obtained recipes, our final model, namely InterCLR, achieves state-of-the-art performance on standard benchmarks. Code and models will be available at this https URL.
摘要：对比学习近来显示出在无监督的可视化表示学习的巨大潜力。在这条赛道的现有研究主要集中在图像内不变性学习。学习通常使用丰富的图像内的转换，构建正对，然后最大化使用对比损失的协议。图像间的不变性的优点，相反，仍然少得多的探讨。利用图像间的不变性的一个主要障碍是，目前还不清楚如何可靠地构建图像间的正对，并进一步从他们那里获得有效的监督，因为没有可用的对注释。在这项工作中，我们提出从三个主要构成成分的图像间不变性学习严格而全面的研究：伪标签维护，采样战略和决策边界设计。通过精心设计的比较和分析，我们提出了一个统一的框架，支持无人监管区域内和图像间的不变性学习的整合。与所有的获得的食谱，我们最终的模式，即InterCLR，达到标准的基准测试国家的最先进的性能。代码和模型将可在此HTTPS URL。

4. 5G Utility Pole Planner Using Google Street View and Mask R-CNN [PDF] 返回目录
Yanyu Zhang, Osama Alshaykh
Abstract: With the advances of fifth-generation (5G) cellular networks technology, many studies and work have been carried out on how to build 5G networks for smart cities. In the previous research, street lighting poles and smart light poles are capable of being a 5G access point. In order to determine the position of the points, this paper discusses a new way to identify poles based on Mask R-CNN, which extends Fast R-CNNs by making it employ recursive Bayesian filtering and perform proposal propagation and reuse. The dataset contains 3,000 high-resolution images from google map. To make training faster, we used a very efficient GPU implementation of the convolution operation. We achieved a train error rate of 7.86% and a test error rate of 32.03%. At last, we used the immune algorithm to set 5G poles in the smart cities.
摘要：随着第五代（5G）蜂窝网络技术的进步，许多研究工作已经在如何构建智能城市5G网络进行。在以往的研究，路灯灯杆和智能灯杆能够被一个5G的接入点。为了确定所述点的位置，本文讨论了一种新的方式来识别基于面膜R-CNN，其通过使采用递归贝叶斯过滤并执行提案传播和重复使用延伸快速R-细胞神经网络磁极。数据集包含从谷歌地图3000的高分辨率图像。为了使训练速度更快，我们使用了卷积运算的一种非常有效的GPU实现。我们实现了7.86％的列车误差率和32.03％的测试误差率。最后，我们使用了免疫算法设置在智能城市5G极。

5. DeepSOCIAL: Social Distancing Monitoring and Infection Risk Assessment in COVID-19 Pandemic [PDF] 返回目录
Mahdi Rezaei, Mohsen Azarmi
Abstract: Social distancing is a recommended solution by the World Health Organisation (WHO) to minimise the spread of COVID-19 in public places. The majority of governments and national health authorities have set the 2-meter physical distancing as a mandatory safety measure in shopping centres, schools and other covered areas. In this research, we develop a generic Deep Neural Network-Based model for automated people detection, tracking, and inter-people distances estimation in the crowd, using common CCTV security cameras. The proposed model includes a YOLOv4-based framework and inverse perspective mapping for accurate people detection and social distancing monitoring in challenging conditions, including people occlusion, partial visibility, and lighting variations. We also provide an online risk assessment scheme by statistical analysis of the Spatio-temporal data from the moving trajectories and the rate of social distancing violations. We identify high-risk zones with the highest possibility of virus spread and infections. This may help authorities to redesign the layout of a public place or to take precaution actions to mitigate high-risk zones. The efficiency of the proposed methodology is evaluated on the Oxford Town Centre dataset, with superior performance in terms of accuracy and speed compared to three state-of-the-art methods.
摘要：社交距离是世界卫生组织推荐的解决方案（WHO），以尽量减少COVID-19在公共场所传播。大多数政府和国家卫生部门已经设置了2米物理疏远作为强制性安全措施，在商场，学校等所覆盖的区域。在这项研究中，我们开发了自动检测人，跟踪，并在人群中，人们之间的距离估计的通用基于网络的深层神经网络模型，使用普通的闭路电视监控摄像头。该模型包括用于精确人检测和社交距离在挑战性的条件，包括人闭塞，部分可见性，以及照明变化监视基于YOLOv4的框架和逆透视映射。我们还提供从移动轨迹的时空数据的统计分析和社会距离违反速率的在线风险评估方案。我们认同病毒传播和感染的可能性最高的高风险区。这可能有助于当局重新设计公共场所的布局，或采取预防措施，以减轻高风险区。拟议的方法的效率在牛津大学城中心的数据集进行评估，并在比较三个国家的最先进的方法，准确度和速度方面性能优越。

6. Attr2Style: A Transfer Learning Approach for Inferring Fashion Styles via Apparel Attributes [PDF] 返回目录
Rajdeep Hazra Banerjee, Abhinav Ravi, Ujjal Kr Dutta
Abstract: Popular fashion e-commerce platforms mostly provide details about low-level attributes of an apparel (for example, neck type, dress length, collar type, print etc) on their product detail pages. However, customers usually prefer to buy apparels based on their style information, or simply put, occasion (for example, party wear, sports wear, casual wear etc). Application of a supervised image-captioning model to generate style-based image captions is limited because obtaining ground-truth annotations in the form of style-based captions is difficult. This is because annotating style-based captions requires a certain amount of fashion domain expertise, and also adds to the costs and manual effort. On the contrary, low-level attribute based annotations are much more easily available. To address this issue, we propose a transfer-learning based image captioning model that is trained on a source dataset with sufficient attribute-based ground-truth captions, and used to predict style-based captions on a target dataset. The target dataset has only a limited amount of images with style-based ground-truth captions. The main motivation of our approach comes from the fact that most often there are correlations among the low-level attributes and the higher-level styles for an apparel. We leverage this fact and train our model in an encoder-decoder based framework using attention mechanism. In particular, the encoder of the model is first trained on the source dataset to obtain latent representations capturing the low-level attributes. The trained model is fine-tuned to generate style-based captions for the target dataset. To highlight the effectiveness of our method, we qualitatively demonstrate that the captions generated by our approach are close to the actual style information for the evaluated apparels.
摘要：流行时尚电子商务平台主要提供细节关于服装的低级别的属性（例如，颈型，衣长，领型，打印等）对他们的产品详细信息页面。然而，客户通常倾向于根据自己的风格信息购买服装，或简单地说，场合（例如，党服，运动服，休闲服等）。有监督的图像字幕模型来生成基于样式的图片说明中的应用是有限的，因为在基于样式的字幕的形式获得地面真注解是困难的。这是因为基于注解风格，字幕需要一定量的时尚领域的专业知识，同时也增加了成本和人力。与此相反，低层次的基于属性的注解是更容易获得。为了解决这个问题，我们建议是在源数据集有足够的基于属性的地面实况字幕培训，并用于预测目标数据集基于样式的字幕转移学习基于图像的字幕模型。目标数据集仅具有基于样式的地面实况字幕图像的数量有限。我们的方法的主要动力来自最经常有一个服装低级属性和更高层次的风格之间的相关性的事实。我们充分利用了这一事实，并使用注意机制培养我们的模型在编码器，解码器为基础的框架。特别地，模型的编码器被第一上训练源数据集以获得潜表示捕获低级属性。该训练模型进行微调，以生成目标数据集基于样式的字幕。为了突出我们的方法的有效性，我们定性地证明了我们的方法生成的字幕是接近实际的样式信息的评估服装。

7. RNN-based Pedestrian Crossing Prediction using Activity and Pose-related Features [PDF] 返回目录
Javier Lorenzo, Ignacio Parra, Florian Wirth, Christoph Stiller, David Fernandez Llorca, Miguel Angel Sotelo
Abstract: Pedestrian crossing prediction is a crucial task for autonomous driving. Numerous studies show that an early estimation of the pedestrian's intention can decrease or even avoid a high percentage of accidents. In this paper, different variations of a deep learning system are proposed to attempt to solve this problem. The proposed models are composed of two parts: a CNN-based feature extractor and an RNN module. All the models were trained and tested on the JAAD dataset. The results obtained indicate that the choice of the features extraction method, the inclusion of additional variables such as pedestrian gaze direction and discrete orientation, and the chosen RNN type have a significant impact on the final performance.
摘要：行人过街的预测是自主驾驶的关键任务。大量研究表明，行人意图的早期估计可以减少甚至避免事故的比例很高。在本文中，深刻学习系统的不同变化提出来试图解决这个问题。所提出的模型是由两个部分组成：基于CNN-特征提取和RNN模块。所有车型进行了培训，并在JAAD数据集进行测试。所获得的结果表明，的选择特征提取方法，附加的变量，例如行人注视方向和分立取向和所选择的类型RNN的夹杂物对最终性能产生显著影响。

8. Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization [PDF] 返回目录
Tingyu Wang, Zhedong Zheng, Chenggang Yan, Yi Yang
Abstract: Cross-view geo-localization is to spot images of the same geographic target from different platforms, e.g., drone-view cameras and satellites. It is challenging in the large visual appearance changes caused by extreme viewpoint variations. Existing methods usually concentrate on mining the fine-grained feature of the geographic target in the image center, but underestimate the contextual information in neighbor areas. In this work, we argue that neighbor areas can be leveraged as auxiliary information, enriching discriminative clues for geo-localization. Specifically, we introduce a simple and effective deep neural network, called Local Pattern Network (LPN), to take advantage of contextual information in an end-to-end manner. Without using extra part estimators, LPN adopts a square-ring feature partition strategy, which provides the attention according to the distance to the image center. It eases the part matching and enables the part-wise representation learning. Owing to the square-ring partition design, the proposed LPN has good scalability to rotation variations and achieves competitive results on two prevailing benchmarks, i.e., University-1652 and CVUSA. Besides, we also show the proposed LPN can be easily embedded into other frameworks to further boost performance.
摘要：交叉视图的地理定位是从不同的平台，例如，雄蜂视摄像机和卫星斑相同的地理目标的图像。它是在导致的极端变化视点的大视觉外观的变化挑战。现有的方法通常专注于挖掘在图像中心的地理目标的细粒度功能，但低估了邻区的上下文信息。在这项工作中，我们认为，邻居的区域可以利用作为辅助信息，丰富的地理定位辨别线索。具体而言，我们引入一个简单而有效的深神经网络，称为本地模式网络（LPN），以利用上下文信息中的端至端的方式。如果不使用额外的部分估计，LPN采用方形环特征划分策略，其中根据到所述图像中心的距离将关注。它简化了部分匹配，使部分明智表示学习。由于方形环分区设计，所提出的LPN具有良好的扩展性旋转上的变化，并在两个流行的基准测试达到竞争的结果，即大学-1652和CVUSA。此外，我们还表明，该LPN可以方便地嵌入到其他框架，以进一步提升性能。

9. Buy Me That Look: An Approach for Recommending Similar Fashion Products [PDF] 返回目录
Abhinav Ravi, Sandeep Repakula, Ujjal Kr Dutta, Maulik Parmar
Abstract: The recent proliferation of numerous fashion e-commerce platforms has led to a surge in online shopping of fashion products. Fashion being the dominant aspect in online retail sales, demands for efficient and effective fashion products recommendation systems that could boost revenue, improve customer experience and engagement. In this paper, we focus on the problem of similar fashion item recommendation for multiple fashion items. Given a Product Display Page for a fashion item in an online e-commerce platform, we identify the images with a full-shot look, i.e., the one with a full human model wearing the fashion item. While the majority of existing works in this domain focus on retrieving similar products corresponding to a single item present in a query, we focus on the retrieval of multiple fashion items at once. This is an important problem because while a user might have searched for a particular primary article type (e.g., men's shorts), the human model in the full-shot look image would usually be wearing secondary fashion items as well (e.g., t-shirts, shoes etc). Upon looking at the full-shot look image in the PDP, the user might also be interested in viewing similar items for the secondary article types. To address this need, we use human keypoint detection to first identify the fullshot images, from which we subsequently select the front facing ones. An article detection and localisation module pretrained on a large-dataset is then used to identify different articles in the image. The detected articles and the catalog database images are then represented in a common embedding space, for the purpose of similarity based retrieval. We make use of a triplet-based neural network to obtain the embeddings. Our embedding network by virtue of an active-learning component achieves further improvements in the retrieval performance.
摘要：最近众多时尚电子商务平台激增导致的时尚产品网络购物的激增。时尚是在网上零售额，要求对可能增加收入，改善客户体验和互动高效和有效的时尚产品推荐系统的主要方面。在本文中，我们专注于为多个时尚单品类似的方式项目推荐的问题。鉴于时尚物品的产品展示页面的在线电子商务平台，我们确定了全炮的样子，即图像，一个一个完整的人体模特穿着时尚项目。虽然大多数在这个领域专注于获取对应于单个项目出现在查询中的同类产品现有作品中，我们专注于多个时尚单品的检索一次。这是一个重要的问题，因为尽管用户可能搜索特定的初级文章类型（例如，男士短裤），全镜头的外观图像中的人物模型通常会穿着二次时尚单品，以及（如T恤，鞋等）。在观看的PDP全看拍摄图像，用户可能也有兴趣在观看类似项目的次物品类型。为了满足这一需求，我们用人类的关键点检测到第一识别fullshot图像，从中我们随后选择那些面对前方。预训练上的大数据集的物品检测和定位模块然后被用于在图像中识别不同的制品。所检测到的物品和目录数据库图像，然后在一个共同的嵌入空间表示的，用于基于相似性检索的目的。我们利用基于三个一组神经网络来获得的嵌入。我们借助于有源学习部件的嵌入网络实现了在检索性能的进一步改进。

10. Cross-Spectral Periocular Recognition with Conditional Adversarial Networks [PDF] 返回目录
Kevin Hernandez-Diaz, Fernando Alonso-Fernandez, Josef Bigun
Abstract: This work addresses the challenge of comparing periocular images captured in different spectra, which is known to produce significant drops in performance in comparison to operating in the same spectrum. We propose the use of Conditional Generative Adversarial Networks, trained to con-vert periocular images between visible and near-infrared spectra, so that biometric verification is carried out in the same spectrum. The proposed setup allows the use of existing feature methods typically optimized to operate in a single spectrum. Recognition experiments are done using a number of off-the-shelf periocular comparators based both on hand-crafted features and CNN descriptors. Using the Hong Kong Polytechnic University Cross-Spectral Iris Images Database (PolyU) as benchmark dataset, our experiments show that cross-spectral performance is substantially improved if both images are converted to the same spectrum, in comparison to matching features extracted from images in different spectra. In addition to this, we fine-tune a CNN based on the ResNet50 architecture, obtaining a cross-spectral periocular performance of EER=1%, and GAR>99% @ FAR=1%, which is comparable to the state-of-the-art with the PolyU database.
摘要：本工作地址进行比较在不同的光谱，这是公知的，以产生表现显著滴相比，在相同的频谱中操作的捕获的眼周的图像的挑战。我们建议使用条件剖成对抗性网络，训练的CON-VERT可见光和近红外光谱之间眼周的图像，从而使生物特征验证在相同的频谱进行。所提出的设置允许使用通常优化在单个光谱来操作现有的特征的方法。识别实验使用的是基于数量上都手工制作的特点和CNN描述过的，现成的眼周比较完成的。用香港理工大学互谱的虹膜图像数据库（理）为基准数据集，我们的实验表明，交叉频谱性能，从不同的图像提取的特征显着改善，如果两个图像被转换为相同的频谱相比，匹配光谱。除了这一点，我们微调一个CNN基于所述ResNet50架构，获得EER = 1％，和GAR> 99％@ FAR = 1％，的横光谱眼周性能是可比的反映最新最先进的与理大数据库。

11. An End-to-End Attack on Text-based CAPTCHAs Based on Cycle-Consistent Generative Adversarial Network [PDF] 返回目录
Chunhui Li, Xingshu Chen, Haizhou Wang, Yu Zhang, Peiming Wang
Abstract: As a widely deployed security scheme, text-based CAPTCHAs have become more and more difficult to resist machine learning-based attacks. So far, many researchers have conducted attacking research on text-based CAPTCHAs deployed by different companies (such as Microsoft, Amazon, and Apple) and achieved certain results.However, most of these attacks have some shortcomings, such as poor portability of attack methods, requiring a series of data preprocessing steps, and relying on large amounts of labeled CAPTCHAs. In this paper, we propose an efficient and simple end-to-end attack method based on cycle-consistent generative adversarial networks. Compared with previous studies, our method greatly reduces the cost of data labeling. In addition, this method has high portability. It can attack common text-based CAPTCHA schemes only by modifying a few configuration parameters, which makes the attack easier. Firstly, we train CAPTCHA synthesizers based on the cycle-GAN to generate some fake samples. Basic recognizers based on the convolutional recurrent neural network are trained with the fake data. Subsequently, an active transfer learning method is employed to optimize the basic recognizer utilizing tiny amounts of labeled real-world CAPTCHA samples. Our approach efficiently cracked the CAPTCHA schemes deployed by 10 popular websites, indicating that our attack is likely very general. Additionally, we analyzed the current most popular anti-recognition mechanisms. The results show that the combination of more anti-recognition mechanisms can improve the security of CAPTCHA, but the improvement is limited. Conversely, generating more complex CAPTCHAs may cost more resources and reduce the availability of CAPTCHAs.
摘要：作为一种广泛部署的安全方案，基于文本的验证码已变得越来越难以抵挡机器学习型攻击。到目前为止，许多研究人员都进行了不同的公司（如微软，亚马逊，苹果），并取得了一定的results.However，大部分的这些攻击有一些缺点，如攻击方法可移植性差部署基于文本的CAPTCHA系统攻击研究，需要一系列的数据预处理步骤，依靠大量标记的验证码的。在本文中，我们提出了基于周期一致的生成对抗网络的高效和简单的终端到终端的攻击方法。与以往的研究相比，我们的方法大大降低了数据标签的成本。此外，该方法具有较高的便携性。它只能通过修改一些配置参数，这使得攻击更容易攻击常见的基于文本的验证码方案。首先，我们训练根据循环-GaN生成一些虚假的样品CAPTCHA合成。基于卷积递归神经网络的基本识别器被训练用假数据。随后，采用的有效传输学习方法，以优化利用微量标记真实世界CAPTCHA样品的基本识别器。我们的方法有效地破解了10个热门网站部署的验证码方案，这说明我们的进攻很可能很一般。此外，我们分析了当前最流行的反识别机制。结果表明，多抗识别机制相结合可以提高CAPTCHA的安全性，但改善是有限的。相反，产生更复杂的验证码可以花费更多资源，减少验证码的可用性。

12. Estimating Example Difficulty using Variance of Gradients [PDF] 返回目录
Chirag Agarwal, Sara Hooker
Abstract: In machine learning, a question of great interest is understanding what examples are challenging for a model to classify. Identifying atypical examples helps inform safe deployment of models, isolates examples that require further human inspection, and provides interpretability into model behavior. In this work, we propose Variance of Gradients (VOG) as a proxy metric for detecting outliers in the data distribution. We provide quantitative and qualitative support that VOG is a meaningful way to rank data by difficulty and to surface a tractable subset of the most challenging examples for human-in-the-loop auditing. Data points with high VOG scores are more difficult for the model to classify and over-index on examples that require memorization.
摘要：在机器学习，非常感兴趣的问题是了解什么例子是具有挑战性的一个模型来进行分类。确定非典型的例子有助于通知车型安全部署，隔离需要进一步人工检查的实例，并提供解释性成模型行为。在这项工作中，我们提出了渐变的方差（VOG）作为代理来度量数据分布检测异常。我们提供定量和定性的支持，VOG是难度和表面的人类，在半实物审计中最具挑战性的例子一个听话的子集以有意义的方式来排名数据。高VOG得分数据点是比较困难的模型进行分类和过度指数需要记忆的例子。

13. End-to-End 3D Multi-Object Tracking and Trajectory Forecasting [PDF] 返回目录
Xinshuo Weng, Ye Yuan, Kris Kitani
Abstract: 3D multi-object tracking (MOT) and trajectory forecasting are two critical components in modern 3D perception systems. We hypothesize that it is beneficial to unify both tasks under one framework to learn a shared feature representation of agent interaction. To evaluate this hypothesis, we propose a unified solution for 3D MOT and trajectory forecasting which also incorporates two additional novel computational units. First, we employ a feature interaction technique by introducing Graph Neural Networks (GNNs) to capture the way in which multiple agents interact with one another. The GNN is able to model complex hierarchical interactions, improve the discriminative feature learning for MOT association, and provide socially-aware context for trajectory forecasting. Second, we use a diversity sampling function to improve the quality and diversity of our forecasted trajectories. The learned sampling function is trained to efficiently extract a variety of outcomes from a generative trajectory distribution and helps avoid the problem of generating many duplicate trajectory samples. We show that our method achieves state-of-the-art performance on the KITTI dataset. Our project website is at this http URL.
摘要：3D多目标跟踪（MOT）和轨迹的预测是现代3D感知系统的两个重要组成部分。我们推测，这是有利的一个统一框架下两个任务来学习代理交互的共同特征表示。为了评估这一假设，我们提出了三维MOT和轨迹预测其还集成了两个额外的新的计算单位的统一解决方案。首先，我们通过引入图形神经网络（GNNS）捕捉方式使用特征交互技术，其中多个代理彼此交互。该GNN能够复杂的层次互动模式，完善了MOT协会的辨别特征的学习，并为轨迹预测关注社会事务的上下文。其次，我们使用了一个多元化的采样功能，提高我们的预测轨迹的质量和多样性。该学会采样功能训练有效地从生成轨迹分布中提取多种结局，并有助于避免产生很多重复的轨迹样本的问题。我们表明，我们的方法实现对数据集KITTI国家的最先进的性能。我们的项目网站是在这个HTTP URL。

14. A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion Compensation for Action Recognition in the EPIC-Kitchens Dataset [PDF] 返回目录
Alejandro López-Cifuentes, Marcos Escudero-Viñolo, Jesús Bescós
Abstract: Action recognition is currently one of the top-challenging research fields in computer vision. Convolutional Neural Networks (CNNs) have significantly boosted its performance but rely on fixed-size spatio-temporal windows of analysis, reducing CNNs temporal receptive fields. Among action recognition datasets, egocentric recorded sequences have become of important relevance while entailing an additional challenge: ego-motion is unavoidably transferred to these sequences. The proposed method aims to cope with it by estimating this ego-motion or camera motion. The estimation is used to temporally partition video sequences into motion-compensated temporal \textit{chunks} showing the action under stable backgrounds and allowing for a content-driven temporal sampling. A CNN trained in an end-to-end fashion is used to extract temporal features from each \textit{chunk}, which are late fused. This process leads to the extraction of features from the whole temporal range of an action, increasing the temporal receptive field of the network.
摘要：动作识别是目前在计算机视觉上挑战性的研究领域之一。卷积神经网络（细胞神经网络）有显著提升了其性能，但依靠分析的固定大小的时空窗口，减少细胞神经网络的时空感受野。中动作识别的数据集，记录自我中心序列已经成为重要的相关性，同时将会导致了额外的挑战：自运动不可避免地转移到这些序列。所提出的方法的目的通过估计该自运动或摄像机运动应付它。该估计被用于在时间上划分视频序列到运动补偿时间\ textit {}块表示下是稳定的背景的动作，并允许一个内容驱动时间采样。甲CNN训练中的端 - 端的方式被用于提取从每个\ textit {}块的时间特征，这是晚稠合。该过程导致的从一个动作的整个范围时间特征的提取，从而增加所述网络的时间感受域。

15. Weakly Supervised Learning with Side Information for Noisy Labeled Images [PDF] 返回目录
Lele Cheng, Xiangzeng Zhou, Liming Zhao, Dangwei Li, Hong Shang, Yun Zheng, Pan Pan, Yinghui Xu
Abstract: In many real-world datasets, like WebVision, the performance of DNN based classifier is often limited by the noisy labeled data. To tackle this problem, some image related side information, such as captions and tags, often reveal underlying relationships across images. In this paper, we present an efficient weakly supervised learning by using a Side Information Network (SINet), which aims to effectively carry out a large scale classification with severely noisy labels. The proposed SINet consists of a visual prototype module and a noise weighting module. The visual prototype module is designed to generate a compact representation for each category by introducing the side information. The noise weighting module aims to estimate the correctness of each noisy image and produce a confidence score for image ranking during the training procedure. The propsed SINet can largely alleviate the negative impact of noisy image labels, and is beneficial to train a high performance CNN based classifier. Besides, we released a fine-grained product dataset called AliProducts, which contains more than 2.5 million noisy web images crawled from the internet by using queries generated from 50,000 fine-grained semantic classes. Extensive experiments on several popular benchmarks (i.e. Webvision, ImageNet and Clothing-1M) and our proposed AliProducts achieve state-of-the-art performance. The SINet has won the first place in the classification task on WebVision Challenge 2019, and outperformed other competitors by a large margin.
摘要：在许多真实世界的数据集，像WebVision，基于DNN分类器的性能往往被嘈杂的标签数据的限制。为了解决这个问题，一些图像相关的辅助信息，如标题和标签，往往揭示整个图像内在的联系。在本文中，我们提出了一个有效的监督的弱使用方信息网（SINet），其目的是有效地开展严厉嘈杂的标签大规模分类学习。所提出的SINet由视觉原型模块和噪声加权模块的。视觉原型模块被设计为通过将侧信息以产生用于每个类别的紧凑表示。噪声加权模块旨在估计每个噪声图像的正确性和产生的置信度得分为图像的训练过程期间排名。该propsed SINet可以在很大程度上缓解嘈杂的图像标签的负面影响，并有利于培养出高性能的基于CNN分类。此外，我们发布了名为AliProducts细粒度的产品数据集，其中包含超过250万张嘈杂网页图片来自互联网抓取使用50000细粒度语义类生成的查询。在几个流行的基准测试（即Webvision，ImageNet和服装-1M）和我们提出的AliProducts大量的实验实现国家的最先进的性能。该SINet赢得了在WebVision挑战2019的分类任务的首位，并以大比分胜过其他竞争对手。

16. Performance Optimization for Federated Person Re-identification via Benchmark Analysis [PDF] 返回目录
Weiming Zhuang, Yonggang Wen, Xuesen Zhang, Xin Gan, Daiying Yin, Dongzhan Zhou, Shuai Zhang, Shuai Yi
Abstract: Federated learning is a privacy-preserving machine learning technique that learns a shared model across decentralized clients. It can alleviate privacy concerns of personal re-identification, an important computer vision task. In this work, we implement federated learning to person re-identification (FedReID) and optimize its performance affected by statistical heterogeneity in the real-world scenario. We first construct a new benchmark to investigate the performance of FedReID. This benchmark consists of (1) nine datasets with different volumes sourced from different domains to simulate the heterogeneous situation in reality, (2) two federated scenarios, and (3) an enhanced federated algorithm for FedReID. The benchmark analysis shows that the client-edge-cloud architecture, represented by the federated-by-dataset scenario, has better performance than client-server architecture in FedReID. It also reveals the bottlenecks of FedReID under the real-world scenario, including poor performance of large datasets caused by unbalanced weights in model aggregation and challenges in convergence. Then we propose two optimization methods: (1) To address the unbalanced weight problem, we propose a new method to dynamically change the weights according to the scale of model changes in clients in each training round; (2) To facilitate convergence, we adopt knowledge distillation to refine the server model with knowledge generated from client models on a public dataset. Experiment results demonstrate that our strategies can achieve much better convergence with superior performance on all datasets. We believe that our work will inspire the community to further explore the implementation of federated learning on more computer vision tasks in real-world scenarios.
摘要：联合学习是一个隐私保护的机器学习技术，学习跨分散的客户共享模式。它可以减轻个人的重新鉴定，一个重要的计算机视觉任务的隐私问题。在这项工作中，我们实施联合学习的人重新鉴定（FedReID），并优化其性能受在真实世界的场景统计异质性。我们首先构建了新的标杆，调查FedReID的性能。这个基准由（1）9点的数据集与来自不同域的来源，以模拟现实FedReID异构的情况下，（2）二联盟方案，和（3）的增强的联合算法不同的卷。基准的分析表明，客户边缘云架构，通过联合逐集情景为代表，具有比FedReID客户端 - 服务器架构更好的性能。它也揭示FedReID的实际场景下的瓶颈，包括对由于不平衡重物的模型聚集和融合的挑战大型数据集的性能较差。然后，我们提出了两种优化方法：（1）为了解决失衡配重问题，我们提出了一种新的方法来动态地根据在每个训练一轮客户模式的变革规模改变权重; （2）衔接方便，我们采用知识蒸馏提炼与客户端模型在公共数据集生成知识服务器型号。实验结果表明，我们的战略可以实现与所有数据集卓越的性能要好得多收敛。我们相信，我们的工作将激励社会各界进一步探讨在现实世界的情景多种计算机视觉任务的联合学习的实施。

17. Making a Case for 3D Convolutions for Object Segmentation in Videos [PDF] 返回目录
Sabarinath Mahadevan, Ali Athar, Aljoša Ošep, Sebastian Hennen, Laura Leal-Taixé, Bastian Leibe
Abstract: The task of object segmentation in videos is usually accomplished by processing appearance and motion information separately using standard 2D convolutional networks, followed by a learned fusion of the two sources of information. On the other hand, 3D convolutional networks have been successfully applied for video classification tasks, but have not been leveraged as effectively to problems involving dense per-pixel interpretation of videos compared to their 2D convolutional counterparts and lag behind the aforementioned networks in terms of performance. In this work, we show that 3D CNNs can be effectively applied to dense video prediction tasks such as salient object segmentation. We propose a simple yet effective encoder-decoder network architecture consisting entirely of 3D convolutions that can be trained end-to-end using a standard cross-entropy loss. To this end, we leverage an efficient 3D encoder, and propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules. Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal dataset benchmarks in addition to being faster, thus showing that our architecture can efficiently learn expressive spatio-temporal features and produce high quality video segmentation masks. Our code and models will be made publicly available.
摘要：对象分割的视频中的任务通常是通过使用标准的2D卷积网络，随后的信息的两个源的学习融合分别处理外观和运动信息来完成的。在另一方面，3D卷积网络已经成功地应用于视频分类任务，但尚未利用的有效涉及的视频密集的每像素的解释问题相比，他们的二维卷积同行在性能方面落后于上述网络的背后。在这项工作中，我们表明，三维细胞神经网络可以有效地应用于密集的视频预测的任务，如突出的物体分割。我们提出了一个简单而有效的编码器 - 解码器的网络体系结构完全由可以使用一个标准的交叉熵损失被训练的端至端3D的卷积。为此，我们利用一个高效编码器3D，并提出一个3D解码器架构，其包括新颖的三维全局卷积层和三维细化模块。我们的方法比现有的国家的最艺术大幅度的DAVIS'16无监督，FBMS和ViSal数据集的基准，除了速度更快，从而表明我们的架构能够有效地学习表现时空的特性和生产高品质的视频分割口罩。我们的代码和模型将被公之于众。

18. Cross-regional oil palm tree counting and detection via multi-level attention domain adaptation network [PDF] 返回目录
Juepeng Zheng, Haohuan Fu, Weijia Li, Wenzhao Wu, Yi Zhao, Runmin Dong, Le Yu
Abstract: Providing an accurate evaluation of palm tree plantation in a large region can bring meaningful impacts in both economic and ecological aspects. However, the enormous spatial scale and the variety of geological features across regions has made it a grand challenge with limited solutions based on manual human monitoring efforts. Although deep learning based algorithms have demonstrated potential in forming an automated approach in recent years, the labelling efforts needed for covering different features in different regions largely constrain its effectiveness in large-scale problems. In this paper, we propose a novel domain adaptive oil palm tree detection method, i.e., a Multi-level Attention Domain Adaptation Network (MADAN) to reap cross-regional oil palm tree counting and detection. MADAN consists of 4 procedures: First, we adopted a batch-instance normalization network (BIN) based feature extractor for improving the generalization ability of the model, integrating batch normalization and instance normalization. Second, we embedded a multi-level attention mechanism (MLA) into our architecture for enhancing the transferability, including a feature level attention and an entropy level attention. Then we designed a minimum entropy regularization (MER) to increase the confidence of the classifier predictions through assigning the entropy level attention value to the entropy penalty. Finally, we employed a sliding window-based prediction and an IOU based post-processing approach to attain the final detection results. We conducted comprehensive ablation experiments using three different satellite images of large-scale oil palm plantation area with six transfer tasks. MADAN improves the detection accuracy by 14.98% in terms of average F1-score compared with the Baseline method (without DA), and performs 3.55%-14.49% better than existing domain adaptation methods.
摘要：在一个大的区域提供了棕榈树种植园的准确评估可以带来经济和生态方面的影响意义。然而，巨大的空间尺度和各种跨区域地质特征已使得它与基于手动人工监控力度有限的解决方案巨大的挑战。虽然基于深度学习算法形成近年来的自动化方法已经证明的潜力，需要在很大程度上限制了大规模问题的有效性不同的区域覆盖不同的功能标签的努力。在本文中，我们提出了一种新颖域自适应油棕榈树的检测方法，即，多级注意域适应网络（马丹）收获跨地区油棕榈树的计数和检测。马丹包括4个步骤：首先，我们采取了间歇实例正常化网络（BIN）的特征提取为提高模型的泛化能力，整合批标准化和实例正常化。其次，我们的嵌入式多层次注意机制（MLA）到我们的架构增强的可转让性，包括功能水平的关注及熵水平的关注。然后，我们设计了一个最小熵正则（MER）通过熵级别注意值分配给熵惩罚，增加分类预测的信心。最后，我们采用了基于滑动窗口的预测和基于期票后处理的方法来实现最终的检测结果。我们进行了大规模使用的油棕种植面积的三个不同的卫星图像，六个转换任务全面消融实验。马丹提高了14.98％在平均F1-得分方面与基线方法（无DA）相比的检测精度，并进行3.55％-14.49％，比现有的域适配方法更好。

19. Gesture Recognition from Skeleton Data for Intuitive Human-Machine Interaction [PDF] 返回目录
André Brás, Miguel Simão, Pedro Neto
Abstract: Human gesture recognition has assumed a capital role in industrial applications, such as Human-Machine Interaction. We propose an approach for segmentation and classification of dynamic gestures based on a set of handcrafted features, which are drawn from the skeleton data provided by the Kinect sensor. The module for gesture detection relies on a feedforward neural network which performs framewise binary classification. The method for gesture recognition applies a sliding window, which extracts information from both the spatial and temporal dimensions. Then we combine windows of varying durations to get a multi-temporal scale approach and an additional gain in performance. Encouraged by the recent success of Recurrent Neural Networks for time series domains, we also propose a method for simultaneous gesture segmentation and classification based on the bidirectional Long Short-Term Memory cells, which have shown ability for learning the temporal relationships on long temporal scales. We evaluate all the different approaches on the dataset published for the ChaLearn Looking at People Challenge 2014. The most effective method achieves a Jaccard index of 0.75, which suggests a performance almost on pair with that presented by the state-of-the-art techniques. At the end, the recognized gestures are used to interact with a collaborative robot.
摘要：人类手势识别承担在工业应用，如人机交互资本的作用。我们提出了分割和基于一组的手工特征，这是从通过Kinect传感器提供的骨架数据绘制动态手势分类的方法。用于姿势检测的模块依赖其执行逐帧二元分类的前馈神经网络。手势识别的方法，适用的滑动窗口，该窗口同时从空间和时间维度中提取信息。然后，我们结合不同的持续时间窗得到的多时间尺度方法和性能的附加增益。通过递归神经网络的时间序列域最近成功的鼓舞，我们也提出了基于双向龙短时记忆细胞，从而衍生出的能力，学习上很长的时间尺度上的时间关系同时进行手势分割和分类的方法。我们评估所有的人在2014年面临的挑战的最有效的方法，实现了0.75的杰卡德指数，这表明几乎对与通过国家的最先进的技术，提出了性能公布了ChaLearn展望数据集的不同方法。在结束时，将所识别的姿势被用于交互与协同的机器人。

20. Vehicle Trajectory Prediction in Crowded Highway Scenarios Using Bird Eye View Representations and CNNs [PDF] 返回目录
R. Izquierdo, A. Quintanar, I. Parra, D. Fernandez-Llorca, M. A. Sotelo
Abstract: This paper describes a novel approach to perform vehicle trajectory predictions employing graphic representations. The vehicles are represented using Gaussian distributions into a Bird Eye View. Then the U-net model is used to perform sequence to sequence predictions. This deep learning-based methodology has been trained using the HighD dataset, which contains vehicles' detection in a highway scenario from aerial imagery. The problem is faced as an image to image regression problem training the network to learn the underlying relations between the traffic participants. This approach generates an estimation of the future appearance of the input scene, not trajectories or numeric positions. An extra step is conducted to extract the positions from the predicted representation with subpixel resolution. Different network configurations have been tested, and prediction error up to three seconds ahead is in the order of the representation resolution. The model has been tested in highway scenarios with more than 30 vehicles simultaneously in two opposite traffic flow streams showing good qualitative and quantitative results.
摘要：本文介绍了采用执行图形表示车辆轨迹预测的新方法。这些车辆使用高斯分布为鸟瞰图表示。然后U形网模型被用于执行序列以序列预测。这种深层基于学习的方法已经使用了高-D数据集，其中包含从航拍图像高速公路情景车辆的检测培训。问题是面对作为图像图像回归问题训练网络学习交通参与者之间的潜在关系。这种方法产生输入场景，而轨迹或数字位置的未来外观的估计。一个额外的步骤中进行，以提取从与子像素分辨率的预测表示的位置。不同的网络配置已经被测试，且预测误差可达到三秒钟前进是在表示中分辨率的顺序。该模型在高速公路场景同时在两个相反的交通流进行了测试30余辆流表现出良好的定性和定量结果。

21. Anime-to-Real Clothing: Cosplay Costume Generation via Image-to-Image Translation [PDF] 返回目录
Koya Tango, Marie Katsurai, Hayato Maki, Ryosuke Goto
Abstract: Cosplay has grown from its origins at fan conventions into a billion-dollar global dress phenomenon. To facilitate imagination and reinterpretation from animated images to real garments, this paper presents an automatic costume image generation method based on image-to-image translation. Cosplay items can be significantly diverse in their styles and shapes, and conventional methods cannot be directly applied to the wide variation in clothing images that are the focus of this study. To solve this problem, our method starts by collecting and preprocessing web images to prepare a cleaned, paired dataset of the anime and real domains. Then, we present a novel architecture for generative adversarial networks (GANs) to facilitate high-quality cosplay image generation. Our GAN consists of several effective techniques to fill the gap between the two domains and improve both the global and local consistency of generated images. Experiments demonstrated that, with two types of evaluation metrics, the proposed GAN achieves better performance than existing methods. We also showed that the images generated by the proposed method are more realistic than those generated by the conventional methods. Our codes and pretrained model are available on the web.
摘要：角色扮演已经从它的起源在增长公约风机进入一个数十亿美元的全球礼服现象。为了便于想象和重新解读，从动画图像，真实的服装，本文提出了基于图像 - 图像平移自动古装图像生成方法。角色扮演的项目可以在他们的风格和形状显著多样，传统的方法无法直接应用到在此研究的重点服装图像差异很大。为了解决这个问题，我们的方法通过收集和预处理网页图像制作动画和真实域的清洗，配对的数据集开始。然后，我们提出了生成对抗网络（甘斯）一种新颖的体系结构，以促进高质量的角色扮演图像生成。我们GAN由几个有效的技术，以填补这两个域之间的差距，提高生成的图像的全球和本地的一致性。实验证明，两种类型的评价指标，建议GAN实现了比现有方法更好的性能。我们还表明，该方法生成的图像比用传统方法产生的更加逼真。我们的代码和预训练模式都可以在网络上。

22. HipaccVX: Wedding of OpenVX and DSL-based Code Generation [PDF] 返回目录
M. Akif Özkan, Burak Ok, Bo Qiao, Jürgen Teich, Frank Hannig
Abstract: Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX' algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing Domain-Specific Language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.
摘要：在写程序为实现高性能而优化的异构平台是困难的，因为这需要在较低水平具有特定体系结构的优化是基于从根本上不同的编程范式和语言的大部分时间，待调整的代码。 OpenVX承诺来解决这个问题的一个免版税的行业标准的计算机视觉应用程序是基于图的执行模型。然而，OpenVX”算法的空间被限制在一小部分的视觉功能。这阻碍加速未包含在标准计算。在本文中，我们分析OpenVX视觉功能找到一组正交计算抽象的。基于这些抽象的，我们夫妇现有的领域特定语言（DSL）后端的OpenVX环境，提供语言结构，以程序员为用户自定义节点的定义。通过这种方式，我们让那些无法使用标准的计算机视觉功能OpenVX图实现检测的优化。这些优化可以增加一倍对Nvidia的GTX GPU的吞吐量，并通过我们的基准降低50％赛灵思FPGA ZYNQ的资源使用情况。最后，我们表明，我们所提出的编译器框架，叫做HipaccVX，能比达到更好的效果接近Nvidia的VisionWorks和卤化物HLS先进国家的的。

23. SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation [PDF] 返回目录
Jianan Zhen, Qi Fang, Jiaming Sun, Wentao Liu, Wei Jiang, Hujun Bao, Xiaowei Zhou
Abstract: Recovering multi-person 3D poses with absolute scales from a single RGB image is a challenging problem due to the inherent depth and scale ambiguity from a single view. Addressing this ambiguity requires to aggregate various cues over the entire image, such as body sizes, scene layouts, and inter-person relationships. However, most previous methods adopt a top-down scheme that first performs 2D pose detection and then regresses the 3D pose and scale for each detected person individually, ignoring global contextual cues. In this paper, we propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm. Such a single-shot bottom-up scheme allows the system to better learn and reason about the inter-person depth relationship, improving both 3D and 2D pose estimation. The experiments demonstrate that the proposed approach achieves the state-of-the-art performance on the CMU Panoptic and MuPoTS-3D datasets and is applicable to in-the-wild videos.
摘要：从一个单一的RGB图像中恢复与绝对尺度多人3D姿势是一个具有挑战性的问题，由于从单个视图中固有的深度和规模歧义。解决这种不确定性，需要聚集在整个图像的各种提示，如身体大小，场景布局和人物间的关系。然而，大多数以前的方法采用自上而下的方式，首先进行2D姿势检测，然后倒退三维姿态和比例为每个检测到的人单独，忽略全球语境线索。在本文中，我们提出了一个新颖的系统，该系统首先退化一套的身体部位2.5D陈述，然后重构基于这些2.5D表示与深度感知的部分关联算法的3D绝对的姿势。这样的单次自下而上方案使系统更好地学习和推理人物间关系的深度，提高3D和2D姿态估计。实验表明，该方法实现对CMU全景和MuPoTS-3D数据集的国家的最先进的性能，适用于在最狂野的视频。

24. Semantic Graph Based Place Recognition for 3D Point Clouds [PDF] 返回目录
Xin Kong, Xuemeng Yang, Guangyao Zhai, Xiangrui Zhao, Xianfang Zeng, Mengmeng Wang, Yong Liu, Wanlong Li, Feng Wen
Abstract: Due to the difficulty in generating the effective descriptors which are robust to occlusion and viewpoint changes, place recognition for 3D point cloud remains an open issue. Unlike most of the existing methods that focus on extracting local, global, and statistical features of raw point clouds, our method aims at the semantic level that can be superior in terms of robustness to environmental changes. Inspired by the perspective of humans, who recognize scenes through identifying semantic objects and capturing their relations, this paper presents a novel semantic graph based approach for place recognition. First, we propose a novel semantic graph representation for the point cloud scenes by reserving the semantic and topological information of the raw point cloud. Thus, place recognition is modeled as a graph matching problem. Then we design a fast and effective graph similarity network to compute the similarity. Exhaustive evaluations on the KITTI dataset show that our approach is robust to the occlusion as well as viewpoint changes and outperforms the state-of-the-art methods with a large margin. Our code is available at: \url{this https URL}.
摘要：由于在生成有效的描述，其是稳健的闭塞和视角变化的难度，地方认可的三维点云仍然是一个悬而未决的问题。与大多数专注于提取原始点云的局部，全局和统计功能，现有的方法，我们的方法目的是在语义层面，可以是在稳健性环境的变化更加优良。本文由人类，谁通过识别语义对象和捕捉他们的关系认识场景的视角的启发，提出了地方识别新的语义图形为基础的方法。首先，我们通过保留原始点云的语义和拓扑信息提出点云的场景一个新的语义图形表示。因此，地方识别被建模为图匹配问题。然后，我们设计了一个快速和有效的图形相似的网络计算的相似性。在KITTI数据集中展现无遗的评价，我们的做法是稳健的闭塞以及视角变化，优于国家的最先进的方法，用一大截。我们的代码，请访问：\ {URL这HTTPS URL}。

25. Determinantal Point Process as an alternative to NMS [PDF] 返回目录
Samik Some, Mithun Das Gupta, Vinay P. Namboodiri
Abstract: We present a determinantal point process (DPP) inspired alternative to non-maximum suppression (NMS) which has become an integral step in all state-of-the-art object detection frameworks. DPPs have been shown to encourage diversity in subset selection problems. We pose NMS as a subset selection problem and posit that directly incorporating DPP like framework can improve the overall performance of the object detection system. We propose an optimization problem which takes the same inputs as NMS, but introduces a novel sub-modularity based diverse subset selection functional. Our results strongly indicate that the modifications proposed in this paper can provide consistent improvements to state-of-the-art object detection pipelines.
摘要：我们提出一个行列式点进程（DPP）的启发替代非最大抑制（NMS），其已成为在国家的最先进的所有物体检测框架的一个基本步骤。公诉主任已被证明，以鼓励子集选择问题的多样性。我们提出NMS作为一个子集选择问题，并断定，直接将DPP像框架可以提高目标探测系统的整体性能。我们提出了一个优化问题，其采用相同的输入，NMS，但引入了基于不同的子集选择功能的新型子模块化。我们的研究结果强烈表明，在本文提出的修改能对国家的最先进的物体探测管道提供稳定的改善。

26. Fusion of Global-Local Features for Image Quality Inspection of Shipping Label [PDF] 返回目录
Sungho Suh, Paul Lukowicz, Yong Oh Lee
Abstract: The demands of automated shipping address recognition and verification have increased to handle a large number of packages and to save costs associated with misdelivery. A previous study proposed a deep learning system where the shipping address is recognized and verified based on a camera image capturing the shipping address and barcode area. Because the system performance depends on the input image quality, inspection of input image quality is necessary for image preprocessing. In this paper, we propose an input image quality verification method combining global and local features. Object detection and scale-invariant feature transform in different feature spaces are developed to extract global and local features from several independent convolutional neural networks. The conditions of shipping label images are classified by fully connected fusion layers with concatenated global and local features. The experimental results regarding real captured and generated images show that the proposed method achieves better performance than other methods. These results are expected to improve the shipping address recognition and verification system by applying different image preprocessing steps based on the classified conditions.
摘要：自动收货地址识别和验证的需求已经上升到处理大量的软件包，并保存与错误交货的相关费用。先前的研究提出了在送货地址被识别和验证的基于摄像头拍摄的送货地址和条形码区域上了深刻的学习系统。由于该系统的性能依赖于输入的图像质量，输入的图像质量的检验是必要的图像的预处理。在本文中，我们提出了一个输入图像质量验证方法结合全局和局部特征。目标检测和尺度不变特征在不同的特征空间变换的开发，以从多个独立的卷积神经网络中提取全局和局部特征。运输标签图像的条件与串联全局和局部特征完全连接融合层分类。关于实际拍摄和生成的图像的实验结果表明，该方法实现比其他方法更好的性能。这些结果有望改善通过应用基于分类的条件不同的图像预处理步骤的送货地址识别和验证系统。

27. Orientation-aware Vehicle Re-identification with Semantics-guided Part Attention Network [PDF] 返回目录
Tsai-Shien Chen, Chih-Ting Liu, Chih-Wei Wu, Shao-Yi Chien
Abstract: Vehicle re-identification (re-ID) focuses on matching images of the same vehicle across different cameras. It is fundamentally challenging because differences between vehicles are sometimes subtle. While several studies incorporate spatial-attention mechanisms to help vehicle re-ID, they often require expensive keypoint labels or suffer from noisy attention mask if not trained with expensive labels. In this work, we propose a dedicated Semantics-guided Part Attention Network (SPAN) to robustly predict part attention masks for different views of vehicles given only image-level semantic labels during training. With the help of part attention masks, we can extract discriminative features in each part separately. Then we introduce Co-occurrence Part-attentive Distance Metric (CPDM) which places greater emphasis on co-occurrence vehicle parts when evaluating the feature distance of two images. Extensive experiments validate the effectiveness of the proposed method and show that our framework outperforms the state-of-the-art approaches.
摘要：车辆重新鉴定（重新-ID）集中于在不同相机的匹配相同的车辆的图像。这是从根本上具有挑战性，因为车辆之间的差别有时是微妙。虽然一些研究纳入空间注意机制，以帮助车辆重新编号，他们往往需要昂贵的关键点标签或在嘈杂的关注度掩遭受如果不与昂贵的标签训练。在这项工作中，我们提出了一个专门的语义导向部关注网络（SPAN），以稳健预测部件的注意口罩给定的训练中只图像层次语义标签的车辆不同的看法。随着部分注意口罩的帮助下，我们可以在每个部分分别提取判别特征。然后介绍共生部分周到的距离度量（CPDM）来评价两个图像的特征距离时将更加重视将共同出现车辆零件。大量的实验验证了该方法的有效性，并表明我们的框架优于国家的最先进的方法。

28. Point Adversarial Self Mining: A Simple Method for Facial Expression Recognition in the Wild [PDF] 返回目录
Ping Liu, Yuewei Lin, Zibo Meng, Weihong Deng, Joey Tianyi Zhou, Yi Yang
Abstract: In this paper, the Point Adversarial Self Mining (PASM) approach, a simple yet effective way to progressively mine knowledge from training samples, is proposed to produce training data for CNNs to improve the performance and network generality in Facial Expression Recognition (FER) task. In order to achieve a high prediction accuracy under real-world scenarios, most of the existing works choose to manipulate the network architectures and design sophisticated loss terms. Although demonstrated to be effective in real scenarios, those aforementioned methods require extra efforts in network design. Inspired by random erasing and adversarial erasing, we propose PASM for data augmentation, simulating the data distribution in the wild. Specifically, given a sample and a pre-trained network, our proposed approach locates the informative region in the sample generated by point adversarial attack policy. The informative region is highly structured and sparse. Comparing to the regions produced by random erasing which selects the region in a purely random way and adversarial erasing which operates by attention maps, the located informative regions obtained by PASM are more adaptive and better aligned with the previous findings: not all but only a few facial regions contribute to the accurate prediction. Then, the located informative regions are masked out from the original samples to generate augmented images, which would force the network to explore additional information from other less informative regions. The augmented images are used to finetune the network to enhance its generality. In the refinement process, we take advantage of knowledge distillation, utilizing the pre-trained network to provide guidance and retain knowledge from old samples to train a new network with the same structural configuration.
摘要：本文以点对抗性自我矿业（PASM）的方式，一个简单而有效的方法，从训练样本逐步矿的知识，建议生产培训数据，细胞神经网络，以改善面部表情识别的性能和网络一般性（FER ）任务。为了实现在真实世界的场景较高的预测精度，目前多数作品的选择操纵网络架构和设计精巧的损失条款。虽然证明是有效的在现实场景中，前述的那些方法需要在网络设计的额外工作。通过随机擦除和对抗擦除的启发，我们提出了PASM数据增强，模拟野外数据分布。具体地，给出样品和预训练网络，我们提出的方法通过定位点的对抗攻击的策略所产生的样品中的信息区域。该信息区域是高度结构化和稀疏。相较于由其中由关注地图工作的完全随机的方式对抗擦除选择区域随机擦除生产的地区，深受PASM获得的位于翔实的区域是更适应，并与以前的研究结果更好地结合：不是所有的，但只有少数面部区域有助于准确的预测。然后，将位于信息区域从原始样本屏蔽掉，以产生增强的图像，这将迫使网络从其他较少信息区域探索的附加信息。增强图像用于微调的网络，以提高其通用性。在细化过程中，我们采取知识蒸馏的优势，利用预先训练的网络提供指导和保留旧样本的知识，培养新的网络使用相同的结构配置。

29. Applying Surface Normal Information in Drivable Area and Road Anomaly Detection for Ground Mobile Robots [PDF] 返回目录
Hengli Wang, Rui Fan, Yuxiang Sun, Ming Liu
Abstract: The joint detection of drivable areas and road anomalies is a crucial task for ground mobile robots. In recent years, many impressive semantic segmentation networks, which can be used for pixel-level drivable area and road anomaly detection, have been developed. However, the detection accuracy still needs improvement. Therefore, we develop a novel module named the Normal Inference Module (NIM), which can generate surface normal information from dense depth images with high accuracy and efficiency. Our NIM can be deployed in existing convolutional neural networks (CNNs) to refine the segmentation performance. To evaluate the effectiveness and robustness of our NIM, we embed it in twelve state-of-the-art CNNs. The experimental results illustrate that our NIM can greatly improve the performance of the CNNs for drivable area and road anomaly detection. Furthermore, our proposed NIM-RTFNet ranks 8th on the KITTI road benchmark and exhibits a real-time inference speed.
摘要：可行驶区域和道路异常的联合检测是对地面移动机器人的一个重要任务。近年来，许多令人印象深刻的语义分割网络，可用于像素级驱动的区域和道路异常检测，已被开发。然而，检测精度仍需要改进。因此，我们开发命名为普通推断模块（NIM）的新型模块，其可生成从具有高的精度和效率致密深度图像表面法线信息。我们NIM可以在现有的卷积神经网络（细胞神经网络）部署到细化分割性能。为了评估我们的NIM的有效性和鲁棒性，我们十二个国家的最先进的细胞神经网络将其嵌入。实验结果表明，我们的净息差可以大大提高细胞神经网络为驱动的区域和道路异常检测性能。此外，我们提出的NIM-RTFNet居上KITTI道路基准8日和呈现实时推理速度。

30. Effective Action Recognition with Embedded Key Point Shifts [PDF] 返回目录
Haozhi Cao, Yuecong Xu, Jianfei Yang, Kezhi Mao, Jianxiong Yin, Simon See
Abstract: Temporal feature extraction is an essential technique in video-based action recognition. Key points have been utilized in skeleton-based action recognition methods but they require costly key point annotation. In this paper, we propose a novel temporal feature extraction module, named Key Point Shifts Embedding Module ($KPSEM$), to adaptively extract channel-wise key point shifts across video frames without key point annotation for temporal feature extraction. Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points. The key point shifts are encoded as the overall temporal features via linear embedding layers in a multi-set manner. Our method achieves competitive performance through embedding key point shifts with trivial computational cost, achieving the state-of-the-art performance of 82.05% on Mini-Kinetics and competitive performance on UCF101, Something-Something-v1, and HMDB51 datasets.
摘要：时间特征提取是基于视频的动作识别的重要技术。要点已被用于基于骨架动作识别方法，但它们需要昂贵的关键点注解。在本文中，我们提出了一种新颖的时间特征提取模块，命名为要点档嵌入模块（$ $ KPSEM），跨越视频帧自适应地提取信道偶密钥点移动而不关键点注释为时间特征提取。关键点自适应地提取为与分割区域的最大特征值的特征点，而关键点的变化是对应的键点的空间位移。关键点位移经由线性在多组的方式嵌入层编码为总时间特征。我们的方法通过嵌入关键点转移琐碎的计算成本，实现对小型动力学上UCF101，东西出头-V1，和HMDB51数据集的82.05％的国家的最先进的性能和有竞争力的性能达到竞争性优势。

31. Keypoint-Aligned Embeddings for Image Retrieval and Re-identification [PDF] 返回目录
Olga Moskvyak, Frederic Maire, Feras Dayoub, Mahsa Baktashmotlagh
Abstract: Learning embeddings that are invariant to the pose of the object is crucial in visual image retrieval and re-identification. The existing approaches for person, vehicle, or animal re-identification tasks suffer from high intra-class variance due to deformable shapes and different camera viewpoints. To overcome this limitation, we propose to align the image embedding with a predefined order of the keypoints. The proposed keypoint aligned embeddings model (KAE-Net) learns part-level features via multi-task learning which is guided by keypoint locations. More specifically, KAE-Net extracts channels from a feature map activated by a specific keypoint through learning the auxiliary task of heatmap reconstruction for this keypoint. The KAE-Net is compact, generic and conceptually simple. It achieves state of the art performance on the benchmark datasets of CUB-200-2011, Cars196 and VeRi-776 for retrieval and re-identification tasks.
摘要：学习是不变的物体的姿态嵌入物的视觉图像检索和重新鉴定的关键。对人，车，或动物重新鉴定任务的现有方法从高类内方差遭受由于变形的形状和不同相机的观点。为了克服这种局限性，我们建议对齐图像与关键点的预定顺序嵌入。所提出的关键点对齐的嵌入模型（KAE-网）学习通过由关键点位置导式多任务学习部件级的功能。更具体地，从特征KAE-Net的提取通道地图由特定关键点通过学习热图重建的辅助任务此关键点激活。该KAE-Net是小型，通用和概念上简单。它实现了对CUB-200-2011，Cars196和VERI-776的标准数据集进行检索和重新鉴定任务的先进的性能。

32. How Do the Hearts of Deep Fakes Beat? Deep Fake Source Detection via Interpreting Residuals with Biological Signals [PDF] 返回目录
Umur Aybars Ciftci, Ilke Demir, Lijun Yin
Abstract: Fake portrait video generation techniques have been posing a new threat to the society with photorealistic deep fakes for political propaganda, celebrity imitation, forged evidences, and other identity related manipulations. Following these generation techniques, some detection approaches have also been proved useful due to their high classification accuracy. Nevertheless, almost no effort was spent to track down the source of deep fakes. We propose an approach not only to separate deep fakes from real videos, but also to discover the specific generative model behind a deep fake. Some pure deep learning based approaches try to classify deep fakes using CNNs where they actually learn the residuals of the generator. We believe that these residuals contain more information and we can reveal these manipulation artifacts by disentangling them with biological signals. Our key observation yields that the spatiotemporal patterns in biological signals can be conceived as a representative projection of residuals. To justify this observation, we extract PPG cells from real and fake videos and feed these to a state-of-the-art classification network for detecting the generative model per video. Our results indicate that our approach can detect fake videos with 97.29% accuracy, and the source model with 93.39% accuracy.
摘要：假人像影像生成技术已经威胁到社会与政治宣传，名人模仿，伪造证据，以及其他身份相关的操作真实感深假货了新的威胁。按照这些生成技术，一些检测方法也被证明是有用的，因为它们的高分类精度。然而，几乎没有精力花在追查深假货的来源。我们提出的方法不仅能深层假货从实际视频分离出来，而且要发现背后的深刻假的具体生成模型。一些纯深度学习为基础的方法尝试使用细胞神经网络，他们居然学会了发电机的残差深假货进行分类。我们认为，这些残留物中含有较多的信息，我们可以通过与生物信号解开他们透露这些操作文物。我们的关键观察的产率，在生物信号的时空图案可被设想为残差的代表性投影。为了证明这一观察，我们提取真假视频PPG细胞和饲料这些对一个国家的最先进的分类网络用于检测每个视频生成模型。我们的研究结果表明，我们的方法可以检测假视频与97.29％的准确率，并与93.39％的准确度源模型。

33. Discriminative Cross-Domain Feature Learning for Partial Domain Adaptation [PDF] 返回目录
Taotao Jing, Ming Shao, Zhengming Ding
Abstract: Partial domain adaptation aims to adapt knowledge from a larger and more diverse source domain to a smaller target domain with less number of classes, which has attracted appealing attention. Recent practice on domain adaptation manages to extract effective features by incorporating the pseudo labels for the target domain to better fight off the cross-domain distribution divergences. However, it is essential to align target data with only a small set of source data. In this paper, we develop a novel Discriminative Cross-Domain Feature Learning (DCDF) framework to iteratively optimize target labels with a cross-domain graph in a weighted scheme. Specifically, a weighted cross-domain center loss and weighted cross-domain graph propagation are proposed to couple unlabeled target data to related source samples for discriminative cross-domain feature learning, where irrelevant source centers will be ignored, to alleviate the marginal and conditional disparities simultaneously. Experimental evaluations on several popular benchmarks demonstrate the effectiveness of our proposed approach on facilitating the recognition for the unlabeled target domain, through comparing it to the state-of-the-art partial domain adaptation approaches.
摘要：部分领域适应性的目标，以适应从一个更大，更多样化的源域到一个较小的目标领域知识较少数量的类别，吸引了吸引人的注意。域适应最近的实践管理通过将目标域伪标签关闭跨域分配分歧比较好打，提取有效的特征。然而，它与只有一小组源数据的必要的对准目标数据。在本文中，我们开发了一种新的判别跨域特征点学习（DCDF）框架来迭代地优化目标标签与加权方案的横域图。具体地，加权跨域中心损失和加权跨域图形传播提出以耦合未标记的目标数据相关源样品辨别跨域特征学习，其中不相关的源中心将被忽略，以减轻边缘和条件差距同时。在几个流行的基准测试实验评价显示于促进表彰为未标记的目标域，通过它比较接近的国家的最先进的部分领域适应性我们提出的方法的有效性。

34. Generating Handwriting via Decouple Style Descriptors [PDF] 返回目录
Atsunobu Kotani, Stefanie Tellex, James Tompkin
Abstract: Representing a space of handwriting stroke styles includes the challenge of representing both the style of each character and the overall style of the human writer. Existing VRNN approaches to representing handwriting often do not distinguish between these different style components, which can reduce model capability. Instead, we introduce the Decoupled Style Descriptor (DSD) model for handwriting, which factors both character- and writer-level styles and allows our model to represent an overall greater space of styles. This approach also increases flexibility: given a few examples, we can generate handwriting in new writer styles, and also now generate handwriting of new characters across writer styles. In experiments, our generated results were preferred over a state of the art baseline method 88% of the time, and in a writer identification task on 20 held-out writers, our DSDs achieved 89.38% accuracy from a single sample word. Overall, DSDs allows us to improve both the quality and flexibility over existing handwriting stroke generation approaches.
摘要：代表手写笔触样式的空间，包括代表每个字符的两种风格和人文学家的整体风格的挑战。现有VRNN方法可以用来表示手写往往这些不同风格的组件，它可以减少模型的能力，不区分。取而代之的是，我们引入了手写的解耦样式描述符（DSD）模型，哪些因素既字符级和作家级风格，并允许我们的模型来表示样式的整体更大的空间。这种方法还提高了灵活性：给出了几个例子，我们可以在新作家的风格产生的笔迹，而且现在还产生跨越作家风格的新角色的笔迹。在实验中，我们的生成结果优于本领域基线方法的时间88％的状态，并在20日举行出作家作家识别任务，我们的DSD达到89.38％的准确度从单个取样字。总体而言，DSD的使我们能够同时提高质量和灵活性在现有的手写笔划产生办法。

35. Detection of Genuine and Posed Facial Expressions of Emotion: A Review [PDF] 返回目录
Shan Jia, Shuo Wang, Chuanbo Hu, Paula Webster, Xin Li
Abstract: Facial expressions of emotion play an important role in human social interactions. However, posed acting is not always the same as genuine feeling. Therefore, the credibility assessment of facial expressions, namely, the discrimination of genuine (spontaneous) expressions from posed(deliberate/volitional/deceptive) ones, is a crucial yet challenging task in facial expression understanding. Rapid progress has been made in recent years for automatic detection of genuine and posed facial expressions. This paper presents a general review of the relevant research, including several spontaneous vs. posed (SVP) facial expression databases and various computer vision based detection methods. In addition, a variety of factors that will influence the performance of SVP detection methods are discussed along with open issues and technical challenges.
摘要：情感的面部表情在人类社会交往的重要作用。然而，提出的演技并不总是一样的真正的感觉。因此，面部表情的信誉评估，即真正的（自然）的表达式从所造成的歧视（审议/意志/欺骗）的，在面部表情理解的关键又具有挑战性的任务。进展迅速，近年来，正版并合影表情的自动检测完成的。本文介绍了相关研究进行全面审查，包括一些自发与构成（SVP）面部表情数据库和各种基于计算机视觉的检测方法。此外，各种将影响的SVP检测方法的性能因素都与开放的问题和技术问题讨论一起。

36. SNE-RoadSeg: Incorporating Surface Normal Information into Semantic Segmentation for Accurate Freespace Detection [PDF] 返回目录
Rui Fan, Hengli Wang, Peide Cai, Ming Liu
Abstract: Freespace detection is an essential component of visual perception for self-driving cars. The recent efforts made in data-fusion convolutional neural networks (CNNs) have significantly improved semantic driving scene segmentation. Freespace can be hypothesized as a ground plane, on which the points have similar surface normals. Hence, in this paper, we first introduce a novel module, named surface normal estimator (SNE), which can infer surface normal information from dense depth/disparity images with high accuracy and efficiency. Furthermore, we propose a data-fusion CNN architecture, referred to as RoadSeg, which can extract and fuse features from both RGB images and the inferred surface normal information for accurate freespace detection. For research purposes, we publish a large-scale synthetic freespace detection dataset, named Ready-to-Drive (R2D) road dataset, collected under different illumination and weather conditions. The experimental results demonstrate that our proposed SNE module can benefit all the state-of-the-art CNNs for freespace detection, and our SNE-RoadSeg achieves the best overall performance among different datasets.
摘要：自由空间检测是视觉感知的自我驾驶汽车的重要组成部分。近期数据融合卷积神经网络（细胞神经网络）所作的努力显著提高语义驾驶场景分割。自由空间可以被假设为一个接地平面，其上点具有类似的表面法线。因此，在本文中，我们首先介绍一种新颖的模块，名为表面法线估计器（SNE），其可以推断出从具有高的精度和效率稠密深度/视差图像表面法线信息。此外，我们提出了一个数据融合CNN架构，被称作RoadSeg，其可以提取和保险丝来自RGB图像和用于精确检测自由空间所推断的表面法线信息的功能。出于研究目的，我们发布了大型合成自由空间探测数据集，命名为现成的驱动器（R2D）道路数据集，不同的光照和天气条件下收集。实验结果表明，我们提出的SNE模块可以惠及所有的国家的最先进的细胞神经网络的自由空间探测，以及我们SNE-RoadSeg实现不同数据集之间最佳的整体性能。

37. Synthetic Sample Selection via Reinforcement Learning [PDF] 返回目录
Jiarong Ye, Yuan Xue, L. Rodney Long, Sameer Antani, Zhiyun Xue, Keith Cheng, Xiaolei Huang
Abstract: Synthesizing realistic medical images provides a feasible solution to the shortage of training data in deep learning based medical image recognition systems. However, the quality control of synthetic images for data augmentation purposes is under-investigated, and some of the generated images are not realistic and may contain misleading features that distort data distribution when mixed with real images. Thus, the effectiveness of those synthetic images in medical image recognition systems cannot be guaranteed when they are being added randomly without quality assurance. In this work, we propose a reinforcement learning (RL) based synthetic sample selection method that learns to choose synthetic images containing reliable and informative features. A transformer based controller is trained via proximal policy optimization (PPO) using the validation classification accuracy as the reward. The selected images are mixed with the original training data for improved training of image recognition systems. To validate our method, we take the pathology image recognition as an example and conduct extensive experiments on two histopathology image datasets. In experiments on a cervical dataset and a lymph node dataset, the image classification performance is improved by 8.1% and 2.3%, respectively, when utilizing high-quality synthetic images selected by our RL framework. Our proposed synthetic sample selection method is general and has great potential to boost the performance of various medical image recognition systems given limited annotation.
摘要：合成逼真的医学图像提供了一个可行的解决方案，以在深学习基础医学图像识别系统的训练数据的不足。然而，对于数据增加目的的合成图像的质量控制是不足的影响，并一些所生成的图像的是不现实的，并且可以包含误导特征扭曲数据分配在与真实图像混合。因此，在医疗用图像识别系统的那些合成影像的有效性时不能被随机地添加它们没有保证质量保证。在这项工作中，我们提出了强化学习（RL）的合成样本选择方法学会选择包含可靠，信息量大的特点合成图像。基于变压器的控制器通过使用验证分类精度作为奖励近端策略优化（PPO）训练。所选择的图像与用于改进的图像识别系统的训练原始训练数据混合。为了验证我们的方法，我们采取的病理图像识别为例，在两个组织病理学图像数据集进行了广泛的实验。在上一个数据集宫颈和实验淋巴结数据集，所述图像分类性能分别由8.1％和2.3％，利用由我们的RL框架选择高品质的合成图像时改善。我们提出的合成样本选择方法是一般性的，具有很大的潜力，以提高有限地标注各种医疗图像识别系统的性能。

38. Likelihood Landscapes: A Unifying Principle Behind Many Adversarial Defenses [PDF] 返回目录
Fu Lin, Rohit Mittapalli, Prithvijit Chattopadhyay, Daniel Bolya, Judy Hoffman
Abstract: Convolutional Neural Networks have been shown to be vulnerable to adversarial examples, which are known to locate in subspaces close to where normal data lies but are not naturally occurring and of low probability. In this work, we investigate the potential effect defense techniques have on the geometry of the likelihood landscape - likelihood of the input images under the trained model. We first propose a way to visualize the likelihood landscape leveraging an energy-based model interpretation of discriminative classifiers. Then we introduce a measure to quantify the flatness of the likelihood landscape. We observe that a subset of adversarial defense techniques results in a similar effect of flattening the likelihood landscape. We further explore directly regularizing towards a flat landscape for adversarial robustness.
摘要：卷积神经网络已被证明是脆弱的对抗性的例子，已知在子空间以定位接近的地方，正常数据谎言但不是天然存在的和低的概率的。在这项工作中，我们调查的潜在影响的防御技术对可能性景观的几何形状 - 输入图像的训练模型下的可能性。我们首先提出了一种可视化的可能性景观利用判别分类的基于能量的模型解释。然后，我们介绍了一项措施，量化的可能性景观的平整度。我们观察到的对抗防御技术成果展平的可能性景观的类似效果的一个子集。我们进一步探索直接朝着正规化为对抗坚固性地势平缓。

39. Transductive Information Maximization For Few-Shot Learning [PDF] 返回目录
Malik Boudiaf, Ziko Imtiaz Masud, Jérôme Rony, José Dolz, Pablo Piantanida, Ismail Ben Ayed
Abstract: We introduce Transductive Infomation Maximization (TIM) for few-shot learning. Our method maximizes the mutual information between the query features and predictions of a few-shot task, subject to supervision constraints from the support set. Furthermore, we propose a new alternating direction solver for our mutual-information loss, which substantially speeds up transductive-inference convergence over gradient-based optimization, while demonstrating similar accuracy performance. Following standard few-shot settings, our comprehensive experiments demonstrate that TIM outperforms state-of-the-art methods significantly across all datasets and networks, while using simple cross-entropy training on the base classes, without resorting to complex meta-learning schemes. It consistently brings between 2% to 5% improvement in accuracy over the best performing methods not only on all the well-established few-shot benchmarks, but also on more challenging scenarios, with domain shifts and larger number of classes.
摘要：我们对几拍学习引进直推式信息来源最大化（TIM）。我们的方法最大限度地提高查询的特性和一些次任务，受到来自支持一套监督约束的预测之间的互信息。此外，我们提出了我们共同的信息丢失，从而大大加快了基于梯度的优化转导推理融合了新的交替方向求解，同时展现类似的精度性能。下列标准为数不多的拍摄设置，我们的综合实验表明，TIM性能优于国家的最先进的方法显著所有数据集和网络，同时使用的基类简单的交叉熵的训练，而不需使用复杂的元学习计划。它始终如一地带来了精度％2％至5％提高到了不仅对所有的成熟的几个次基准表现最好的方法，但也更有挑战性的场景，与领域的变化和较大的班数。

40. Multi-Face: Self-supervised Multiview Adaptation for Robust Face Clustering in Videos [PDF] 返回目录
Krishna Somandepalli, Rajat Hebbar, Shrikanth Narayanan
Abstract: Robust face clustering is a key step towards computational understanding of visual character portrayals in media. Face clustering for long-form content such as movies is challenging because of variations in appearance and lack of large-scale labeled video resources. However, local face tracking in videos can be used to mine samples belonging to same/different persons by examining the faces co-occurring in a video frame. In this work, we use this idea of self-supervision to harvest large amounts of weakly labeled face tracks in movies. We propose a nearest-neighbor search in the embedding space to mine hard examples from the face tracks followed by domain adaptation using multiview shared subspace learning. Our benchmarking on movie datasets demonstrate the robustness of multiview adaptation for face verification and clustering. We hope that the large-scale data resources developed in this work can further advance automatic character labeling in videos.
摘要：面对强大的集群是实现媒体视觉特征描绘的计算理解的关键一步。脸聚类的长篇内容，如电影，是因为在外观上的变化的挑战和缺乏大规模标记的视频资源。然而，在视频本地脸部跟踪可以用来通过检查在一个视频帧中的面孔共现属于相同/不同的人矿样品。在这项工作中，我们使用这个想法自我监督的收获大量电影弱标记脸部轨迹的。我们建议在嵌入空间近邻搜索矿从脸部轨迹使用多视点共享子空间学习，然后域适应艰苦的例子。我们对电影的数据集基准演示了人脸验证和集群多视点适应的鲁棒性。我们希望，在这项工作中开发的大型数据资源可以进一步推进自动字符标注的视频。

41. Deep Neural Network for 3D Surface Segmentation based on Contour Tree Hierarchy [PDF] 返回目录
Wenchong He, Arpan Man Sainju, Zhe Jiang, Da Yan
Abstract: Given a 3D surface defined by an elevation function on a 2D grid as well as non-spatial features observed at each pixel, the problem of surface segmentation aims to classify pixels into contiguous classes based on both non-spatial features and surface topology. The problem has important applications in hydrology, planetary science, and biochemistry but is uniquely challenging for several reasons. First, the spatial extent of class segments follows surface contours in the topological space, regardless of their spatial shapes and directions. Second, the topological structure exists in multiple spatial scales based on different surface resolutions. Existing widely successful deep learning models for image segmentation are often not applicable due to their reliance on convolution and pooling operations to learn regular structural patterns on a grid. In contrast, we propose to represent surface topological structure by a contour tree skeleton, which is a polytree capturing the evolution of surface contours at different elevation levels. We further design a graph neural network based on the contour tree hierarchy to model surface topological structure at different spatial scales. Experimental evaluations based on real-world hydrological datasets show that our model outperforms several baseline methods in classification accuracy.
摘要：鉴于通过升降功能上的2D网格以及在每个像素处观察到的非空间特征所限定的三维表面，的表面分割的目标进行分类的像素成基于两个非空间特征和表面拓扑结构邻接类的问题。该问题已经在水文重要的应用，行星科学，和生物化学但有几个原因是独特的挑战。首先，类链段的空间范围遵循在拓扑空间的表面轮廓，而不管它们的空间的形状和方向的。第二，拓扑结构中的多个空间尺度存在基于不同的表面分辨率。现有的广泛的成功深度学习模型图像分割，由于其卷积和集中行动学习网格上的常规结构模式的依赖往往是不适用的。相比之下，我们提出由等高线树骨架，这是一个polytree捕捉表面轮廓的变化在不同的高度水平来代表表面拓扑结构。我们进一步设计了基于轮廓树层次结构模型表面拓扑结构不同空间尺度的图形神经网络。基于真实世界的数据集水文试验评估表明，我们的模型优于分级精度几个基线的方法。

42. Temporal Action Localization with Variance-Aware Networks [PDF] 返回目录
Ting-Ting Xie, Christos Tzelepis, Ioannis Patras
Abstract: This work addresses the problem of temporal action localization with Variance-Aware Networks (VAN), i.e., DNNs that use second-order statistics in the input and/or the output of regression tasks. We first propose a network (VANp) that when presented with the second-order statistics of the input, i.e., each sample has a mean and a variance, it propagates the mean and the variance throughout the network to deliver outputs with second order statistics. In this framework, both the input and the output could be interpreted as Gaussians. To do so, we derive differentiable analytic solutions, or reasonable approximations, to propagate across commonly used NN layers. To train the network, we define a differentiable loss based on the KL-divergence between the predicted Gaussian and a Gaussian around the ground truth action borders, and use standard back-propagation. Importantly, the variances propagation in VANp does not require any additional parameters, and during testing, does not require any additional computations either. In action localization, the means and the variances of the input are computed at pooling operations, that are typically used to bring arbitrarily long videos to a vector with fixed dimensions. Second, we propose two alternative formulations that augment the first (respectively, the last) layer of a regression network with additional parameters so as to take in the input (respectively, predict in the output) both means and variances. Results in the action localization problem show that the incorporation of second order statistics improves over the baseline network, and that VANp surpasses the accuracy of virtually all other two-stage networks without involving any additional parameters.
摘要：该作品地址随着方差感知网络（VAN），即，在输入和/或消退任务的输出使用二阶统计DNNs颞行动定位问题。我们首先提出了一种网络（VANp），其当与输入的二阶统计呈现，即每个样品具有均值和方差，它传播的平均值和整个网络与二阶统计递送输出的方差。在此框架下，输入和输出可以被解释为高斯。要做到这一点，我们推导出可微的分析解决方案，或合理的近似，传播跨常用的NN层。对网络进行训练，我们定义基于预测高斯和周围的地面实况动作边界高斯之间的KL-发散微损失，并使用标准的反向传播。重要的是，在方差传播VANp不需要任何额外的参数，和测试期间，不需要任何附加的计算任一。在动作定位，装置和输入的方差在池的操作中，通常用于使任意长影片，与固定尺寸的向量计算的。第二，我们建议，用额外的参数增加的第一（分别为最后一个）回归网络的层，以便采取在输入两种可供选择的配制剂（分别是，在输出预测）两者的均值和方差。结果在动作的本地化问题表明，二阶统计掺入改善了基线的网络，以及VANp几乎超越了所有其他的两级网络的准确性不涉及任何额外的参数。

43. HoloLens 2 Research Mode as a Tool for Computer Vision Research [PDF] 返回目录
Dorin Ungureanu, Federica Bogo, Silvano Galliani, Pooja Sama, Xin Duan, Casey Meekhof, Jan Stühmer, Thomas J. Cashman, Bugra Tekin, Johannes L. Schönberger, Pawel Olszta, Marc Pollefeys
Abstract: Mixed reality headsets, such as the Microsoft HoloLens 2, are powerful sensing devices with integrated compute capabilities, which makes it an ideal platform for computer vision research. In this technical report, we present HoloLens 2 Research Mode, an API and a set of tools enabling access to the raw sensor streams. We provide an overview of the API and explain how it can be used to build mixed reality applications based on processing sensor data. We also show how to combine the Research Mode sensor data with the built-in eye and hand tracking capabilities provided by HoloLens 2. By releasing the Research Mode API and a set of open-source tools, we aim to foster further research in the fields of computer vision as well as robotics and encourage contributions from the research community.
摘要：混合现实的耳机，如Microsoft HoloLens 2，与集成的计算能力强大的传感装置，这使得它成为计算机视觉研究的理想平台。在这个技术报告中，我们提出HoloLens 2研究模式，API和一套工具能够访问原始传感器数据流。我们提供的API的概述，并解释如何可以用来构建基于传感器数据处理混合现实应用。我们还展示了如何研究模式的传感器数据与内置的HoloLens 2.提供通过释放研究模式API和一组开源工具眼睛和手的跟踪能力，我们的目标是促进该领域的进一步研究相结合计算机视觉以及机器人技术，并鼓励研究界的贡献。

44. A Hidden Markov Tree Model for Flood Extent Mapping in Heavily Vegetated Areas based on High Resolution Aerial Imagery and DEM: A Case Study on Hurricane Matthew Floods [PDF] 返回目录
Zhe Jiang, Arpan Man Sainju
Abstract: Flood extent mapping plays a crucial role in disaster management and national water forecasting. In recent years, high-resolution optical imagery becomes increasingly available with the deployment of numerous small satellites and drones. However, analyzing such imagery data to extract flood extent poses unique challenges due to the rich noise and shadows, obstacles (e.g., tree canopies, clouds), and spectral confusion between pixel classes (flood, dry) due to spatial heterogeneity. Existing machine learning techniques often focus on spectral and spatial features from raster images without fully incorporating the geographic terrain within classification models. In contrast, we recently proposed a novel machine learning model called geographical hidden Markov tree that integrates spectral features of pixels and topographic constraints from Digital Elevation Model (DEM) data (i.e., water flow directions) in a holistic manner. This paper evaluates the model through case studies on high-resolution aerial imagery from the National Oceanic and Atmospheric Administration (NOAA) National Geodetic Survey together with DEM. Three scenes are selected in heavily vegetated floodplains near the cities of Grimesland and Kinston in North Carolina during Hurricane Matthew floods in 2016. Results show that the proposed hidden Markov tree model outperforms several state of the art machine learning algorithms (e.g., random forests, gradient boosted model) by an improvement of F-score (the harmonic mean of the user's accuracy and producer's accuracy) from around 70% to 80% to over 95% on our datasets.
摘要：洪水程度映射在救灾管理和国家水预测了至关重要的作用。近年来，高分辨率光学影像变得越来越可与众多小型卫星和无人驾驶飞机的部署。然而，这样的图像数据分析，以提取洪水程度造成由于像素的类（洪水，干）由于空间异质性之间的富噪声和阴影，障碍物（例如，树冠，云），和频谱混淆独特的挑战。现有的机器学习技术往往集中在从光栅图像光谱和空间特征不完全纳入分类模型中的地理地形。与此相反，我们最近提出了一种新的机器学习模型称为地理隐马尔可夫树集成光谱以整体方式的像素和地形约束从数字高程模型（DEM）的数据（即，水流动方向）的功能。本文评估通过从美国国家海洋和大气管理局（NOAA）国家大地测量与DEM一起高分辨率航空影像的案例研究模型。三个场景在Grimesland和金斯顿在北卡罗莱纳州的飓风期间洪水马修两个城市植被严重漫滩选择在2016年研究结果表明，所提出的隐马尔可夫树模型优于学习算法的技术的机器的几个状态（例如，随机森林，渐变从70％左右提高模型）的F-得分（用户的精度和生产者精度的调和平均值）的提高到80％到95％，我们的数据集。

45. Orientation-Disentangled Unsupervised Representation Learning for Computational Pathology [PDF] 返回目录
Maxime W. Lafarge, Josien P.W. Pluim, Mitko Veta
Abstract: Unsupervised learning enables modeling complex images without the need for annotations. The representation learned by such models can facilitate any subsequent analysis of large image datasets. However, some generative factors that cause irrelevant variations in images can potentially get entangled in such a learned representation causing the risk of negatively affecting any subsequent use. The orientation of imaged objects, for instance, is often arbitrary/irrelevant, thus it can be desired to learn a representation in which the orientation information is disentangled from all other factors. Here, we propose to extend the Variational Auto-Encoder framework by leveraging the group structure of rotation-equivariant convolutional networks to learn orientation-wise disentangled generative factors of histopathology images. This way, we enforce a novel partitioning of the latent space, such that oriented and isotropic components get separated. We evaluated this structured representation on a dataset that consists of tissue regions for which nuclear pleomorphism and mitotic activity was assessed by expert pathologists. We show that the trained models efficiently disentangle the inherent orientation information of single-cell images. In comparison to classical approaches, the resulting aggregated representation of sub-populations of cells produces higher performances in subsequent tasks.
摘要：无监督的学习能够无需注释造型复杂的图像。通过这种模式中汲取的表示可以方便大型图像数据集的任何后续分析。然而，在图像产生不相关的一些变化因素的生成有可能纠缠在这样一个博学表示造成不良影响的后续使用的风险。成像的目标物的方位，例如，经常是任意/无关，因此可以期望来学习在其中取向信息是从所有其他因素解开的表示。在这里，我们提出通过利用旋转等变卷积网络组结构来学习病理图像的方向明智解开的生成因素延长变自动编码器的结构。通过这种方式，我们执行的潜在空间，从而为本，各向同性组件走散了一种新的划分。我们评估的是由用于其核多形性和有丝分裂活动是由病理学专家评估组织区域的数据集此结构表示。我们表明，经过训练的模型有效地解开单细胞图像的固有方位信息。相比于经典方法中，细胞亚群的产生聚集表示产生在后续任务更高的性能。

46. Analysis of deep machine learning algorithms in COVID-19 disease diagnosis [PDF] 返回目录
Samir S. Yadav, Mininath R. Bendre, Pratap S. Vikhe, Shivajirao M. Jadhav
Abstract: The aim of the work is to use deep neural network models for solving the problem of image recognition. These days, every human being is threatened by a harmful coronavirus disease, also called COVID-19 disease. The spread of coronavirus affects the economy of many countries in the world. To find COVID-19 patients early is very essential to avoid the spread and harm to society. Pathological tests and Chromatography(CT) scans are helpful for the diagnosis of COVID-19. However, these tests are having drawbacks such as a large number of false positives, and cost of these tests are so expensive. Hence, it requires finding an easy, accurate, and less expensive way for the detection of the harmful COVID-19 disease. Chest-x-ray can be useful for the detection of this disease. Therefore, in this work chest, x-ray images are used for the diagnosis of suspected COVID-19 patients using modern machine learning techniques. The analysis of the results is carried out and conclusions are made about the effectiveness of deep machine learning algorithms in image recognition problems.
摘要：这项工作的目的是利用深层神经网络模型解决图像识别的问题。这些天来，每个人都被有害冠状病毒病，也称为COVID-19的疾病的威胁。冠状病毒的传播影响到许多国家的经济在世界上。为了找到COVID-19早期患者是非常必要的，以避免扩散和对社会的危害。病理测试和层析（CT）扫描是用于COVID-19的诊断。然而，这些测试都具有缺点，如大量的假阳性和成本这些测试是如此昂贵。因此，需要找到用于检测有害COVID-19疾病的一种简单，准确，更便宜的方式。胸部x射线可以用于检测这种疾病是有用的。因此，在这项工作中胸部，X射线图像被用于使用现代机器学习技术的怀疑COVID，19例患者的诊断。结果的分析进行和结论有关的深机器学习算法在图像识别问题的有效性提出。

47. 3D Semantic Segmentation of Brain Tumor for Overall Survival Prediction [PDF] 返回目录
Rupal Agravat, Mehul S Raval
Abstract: Glioma, the malignant brain tumor, requires immediate treatment to improve the survival of patients. Gliomas heterogeneous nature makes the segmentation difficult, especially for sub-regions like necrosis, enhancing tumor, non-enhancing tumor, and Edema. Deep neural networks like full convolution neural networks and ensemble of fully convolution neural networks are successful for Glioma segmentation. The paper demonstrates the use of a 3D fully convolution neural network with a three layer encoder decoder approach for layer arrangement. The encoder blocks include the dense modules, and decoder blocks include convolution modules. The input to the network is 3D patches. The loss function combines dice loss and focal loss functions. The validation set dice score of the network is 0.74, 0.88, and 0.73 for enhancing tumor, whole tumor, and tumor core, respectively. The Random Forest Regressor uses shape, volumetric, and age features extracted from ground truth for overall survival prediction. The regressor achieves an accuracy of 44.8% on the validation set.
摘要：脑胶质瘤，恶性脑肿瘤，需要立即治疗，以提高患者的生存期。胶质瘤异质性使得分割困难，特别是对于子区域样坏死，增强肿瘤，非增强肿瘤和水肿。像全卷积神经网络深层神经网络和集成完全卷积神经网络是成功的脑胶质瘤分割。本文说明了如何使用一个三维的完全卷积用于层结构的三层编码器解码器的方法的神经网络。编码器块包括致密模块和解码器模块包括卷积模块。输入到网络3D补丁。损失函数组合切块损失和焦点损失函数。验证设定骰子得分网络的为0.74，0.88，和0.73用于增强肿瘤，整个肿瘤和肿瘤核心，分别。随机森林回归使用形状，体积和年龄从地面实况总生存期预测中提取的特征。该回归实现了44.8％的验证集的精度。

48. On the Composition and Limitations of Publicly Available COVID-19 X-Ray Imaging Datasets [PDF] 返回目录
Beatriz Garcia Santa Cruz, Jan Sölter, Matias Nicolas Bossa, Andreas Dominik Husch
Abstract: Machine learning based methods for diagnosis and progression prediction of COVID-19 from imaging data have gained significant attention in the last months, in particular by the use of deep learning models. In this context hundreds of models where proposed with the majority of them trained on public datasets. Data scarcity, mismatch between training and target population, group imbalance, and lack of documentation are important sources of bias, hindering the applicability of these models to real-world clinical practice. Considering that datasets are an essential part of model building and evaluation, a deeper understanding of the current landscape is needed. This paper presents an overview of the currently public available COVID-19 chest X-ray datasets. Each dataset is briefly described and potential strength, limitations and interactions between datasets are identified. In particular, some key properties of current datasets that could be potential sources of bias, impairing models trained on them are pointed out. These descriptions are useful for model building on those datasets, to choose the best dataset according the model goal, to take into account the specific limitations to avoid reporting overconfident benchmark results, and to discuss their impact on the generalisation capabilities in a specific clinical setting
摘要：机器学习从成像数据COVID-19的诊断和进展预测为基础的方法已通过使用深学习模型获得显著关注，在过去几个月，尤其如此。在此背景下上百种型号，其中与他们中的大多数建议训练有素的公共数据集。数据匮乏，培训和目标人群，群失衡之间的不匹配，以及缺乏文档是偏见的重要来源，阻碍了这些模型，以现实世界的临床实践中的适用性。考虑到数据集模型构建和评估的重要组成部分，需要在当前的景观有更深的了解。本文介绍了目前公众提供COVID，19胸部X射线数据集的概述。每个数据集简要描述和数据集之间的潜在力量，限制和交互识别。特别是，当前的数据集可能是偏见的潜在来源，损害的培训对他们的模型中的一些关键特性指出。这些描述是对这些数据集模型的建立是有用的，可以选择根据模型目标的最佳数据集时，要考虑到的具体限制，以避免报告过于自信的基准测试结果，并讨论在特定的临床他们对泛化能力的影响

49. Towards Structured Prediction in Bioinformatics with Deep Learning [PDF] 返回目录
Yu Li
Abstract: Using machine learning, especially deep learning, to facilitate biological research is a fascinating research direction. However, in addition to the standard classification or regression problems, in bioinformatics, we often need to predict more complex structured targets, such as 2D images and 3D molecular structures. The above complex prediction tasks are referred to as structured prediction. Structured prediction is more complicated than the traditional classification but has much broader applications, considering that most of the original bioinformatics problems have complex output objects. Due to the properties of those structured prediction problems, such as having problem-specific constraints and dependency within the labeling space, the straightforward application of existing deep learning models can lead to unsatisfactory results. Here, we argue that the following ideas can help resolve structured prediction problems in bioinformatics. Firstly, we can combine deep learning with other classic algorithms, such as probabilistic graphical models, which model the problem structure explicitly. Secondly, we can design the problem-specific deep learning architectures or methods by considering the structured labeling space and problem constraints, either explicitly or implicitly. We demonstrate our ideas with six projects from four bioinformatics subfields, including sequencing analysis, structure prediction, function annotation, and network analysis. The structured outputs cover 1D signals, 2D images, 3D structures, hierarchical labeling, and heterogeneous networks. With the help of the above ideas, all of our methods can achieve SOTA performance on the corresponding problems. The success of these projects motivates us to extend our work towards other more challenging but important problems, such as health-care problems, which can directly benefit people's health and wellness.
摘要：利用机器学习，尤其是深度学习，有利于生物的研究是一个令人着迷的研究方向。然而，除了标准的分类或回归问题，在生物信息学，我们经常需要预测更复杂的结构化的目标，如2D图像和3D分子结构。上述复杂的任务预测被称为结构预测。结构预测是比传统的分类更加复杂，但具有更广泛的应用，考虑到大部分的原生物信息学的问题有复杂的输出对象。由于这些结构预测的问题，例如具有标记空间内问题的特定的约束和依赖特性，现有的深学习模型的直接应用可能导致不令人满意的结果。在这里，我们认为以下的想法可以帮助解决生物信息学结构预测的问题。首先，我们可以与其他经典算法，如概率图形模型，该问题明确的结构模型相结合深度学习。其次，我们可以考虑结构化标签空间和问题制约设计特定问题的深度学习体系或方法，或隐或显。我们证明我们的想法与来自四个生物信息学的子场六个项目，包括序列分析，结构预测，功能注释和网络分析。结构化输出覆盖1D信号，2D图像，3D结构，层次标识，网络和异构网络。随着上述思路的帮助下，我们所有的方法可以实现对相应的问题SOTA性能。这些项目激励着我们的成功对其他更具挑战性，但重要的问题，如医疗保健问题，它可以直接造福人民的健康和健身扩展我们的工作。

50. Disentangled Representations for Domain-generalized Cardiac Segmentation [PDF] 返回目录
Xiao Liu, Spyridon Thermos, Agisilaos Chartsias, Alison O'Neil, Sotirios A. Tsaftaris
Abstract: Robust cardiac image segmentation is still an open challenge due to the inability of the existing methods to achieve satisfactory performance on unseen data of different domains. Since the acquisition and annotation of medical data are costly and time-consuming, recent work focuses on domain adaptation and generalization to bridge the gap between data from different populations and scanners. In this paper, we propose two data augmentation methods that focus on improving the domain adaptation and generalization abilities of state-to-the-art cardiac segmentation models. In particular, our "Resolution Augmentation" method generates more diverse data by rescaling images to different resolutions within a range spanning different scanner protocols. Subsequently, our "Factor-based Augmentation" method generates more diverse data by projecting the original samples onto disentangled latent spaces, and combining the learned anatomy and modality factors from different domains. Our extensive experiments demonstrate the importance of efficient adaptation between seen and unseen domains, as well as model generalization ability, to robust cardiac image segmentation.
摘要：强大的心脏图像分割仍然是一个开放的挑战，由于现有的方法无法实现对不同域的看不见的数据表现令人满意。由于医疗数据的采集和注释是昂贵和费时的，最近工作的重点领域适应性和推广，以弥合不同人群和扫描仪的数据之间的差距。在本文中，我们提出了着眼于提高国家对最先进的心脏分割模型的领域适应性和推广能力两个数据隆胸方法。特别地，我们的“分辨率增强”方法产生由一系列横跨不同扫描器协议内重新缩放图像以不同的分辨率更多样化的数据。接着，我们的“基于因子增强”的方法由原始样本投影到解缠结的潜空间，以及从不同的域相结合的学习解剖和形态因素产生更多样化的数据。我们广泛的实验证明看到和看不到的域之间进行有效适应的重要性，以及模型的泛化能力，强劲的心脏图像分割。

51. Detection of Retinal Blood Vessels by using Gabor filter with Entropic threshold [PDF] 返回目录
Mohamed. I. Waly, Ahmed El-Hossiny
Abstract: Diabetic retinopathy is the basic reason for visual deficiency. This paper introduces a programmed strategy to identify and dispense with the blood vessels. The location of the blood vessels is the fundamental stride in the discovery of diabetic retinopathy because the blood vessels are the typical elements of the retinal picture. The location of the blood vessels can help the ophthalmologists to recognize the sicknesses prior and quicker. The blood vessels recognized and wiped out by utilizing Gobar filter on two freely accessible retinal databases which are STARE and DRIVE. The exactness of segmentation calculation is assessed quantitatively by contrasting the physically sectioned pictures and the comparing yield pictures, the Gabor filter with Entropic threshold vessel pixel segmentation by Entropic thresholding is better vessels with less false positive portion rate.
摘要：糖尿病性视网膜病变是视觉缺陷的根本原因。本文介绍了一个编程的战略，以确定和分配与血管。血管的位置是在糖尿病性视网膜病变的发现的基本步幅因为血管视网膜图像的典型元件。血管的位置可以帮助眼科医生之前和更快的识别疾病。血管认可和利用两个自由进入视网膜数据库，它们的目光和DRIVE戈巴尔过滤一扫而光。分割计算的准确性是通过对比所述物理切片图像和比较图像收率定量评估，与熵阈容器像素分割的Gabor滤波器由熵阈值是用较少的假阳性率部分更好的容器。

52. Selective Particle Attention: Visual Feature-Based Attention in Deep Reinforcement Learning [PDF] 返回目录
Sam Blakeman, Denis Mareschal
Abstract: The human brain uses selective attention to filter perceptual input so that only the components that are useful for behaviour are processed using its limited computational resources. We focus on one particular form of visual attention known as feature-based attention, which is concerned with identifying features of the visual input that are important for the current task regardless of their spatial location. Visual feature-based attention has been proposed to improve the efficiency of Reinforcement Learning (RL) by reducing the dimensionality of state representations and guiding learning towards relevant features. Despite achieving human level performance in complex perceptual-motor tasks, Deep RL algorithms have been consistently criticised for their poor efficiency and lack of flexibility. Visual feature-based attention therefore represents one option for addressing these criticisms. Nevertheless, it is still an open question how the brain is able to learn which features to attend to during RL. To help answer this question we propose a novel algorithm, termed Selective Particle Attention (SPA), which imbues a Deep RL agent with the ability to perform selective feature-based attention. SPA learns which combinations of features to attend to based on their bottom-up saliency and how accurately they predict future reward. We evaluate SPA on a multiple choice task and a 2D video game that both involve raw pixel input and dynamic changes to the task structure. We show various benefits of SPA over approaches that naively attend to either all or random subsets of features. Our results demonstrate (1) how visual feature-based attention in Deep RL models can improve their learning efficiency and ability to deal with sudden changes in task structure and (2) that particle filters may represent a viable computational account of how visual feature-based attention occurs in the brain.
摘要：人类大脑利用选择性注意，以便仅是行为有用的部件使用其有限的计算资源处理过滤知觉输入。我们关注的被称为基于特征的注意，这是关注鉴定是重要的当前任务，无论其空间位置的视觉输入的功能，视觉注意一个特定形式。视觉基于特征的注意力已经提出通过减少状态表示的维度和指导对相关特征的学习，以提高强化学习（RL）的效率。尽管实现复杂的知觉动作任务的人水平的表现，深RL算法已一致批评为效率低下和缺乏灵活性。因此，基于特征的视觉注意力代表为解决这些批评的一个选项。尽管如此，它仍然是一个悬而未决的问题的大脑如何能够了解哪些RL期间拥有无暇顾及。为了回答这个问题，我们提出了一种新的算法，称为选择性注意粒子（SPA），其灌输深RL剂进行选择性的基于特征的注意力的能力。 SPA获悉的特征组合出席基于其自下而上的显着性，以及如何准确地预测他们未来的赏赐了。我们评价一个选择题任务和2D视频游戏，都涉及到原始像素的输入和动态变化的任务结构SPA。我们展示SPA的各种福利比接近于天真地出席为特征的全部或随机的子集。我们的研究结果表明在深RL车型（1）如何视觉特征为基础的重视可以提高他们的学习效率和应对任务结构和（2）的突然变化是颗粒物过滤器可以是一种可行的计算考虑如何视觉特征为基础的能力关注发生在大脑中。

53. DRR4Covid: Learning Automated COVID-19 Infection Segmentation from Digitally Reconstructed Radiographs [PDF] 返回目录
Pengyi Zhang, Yunxin Zhong, Yulin Deng, Xiaoying Tang, Xiaoqiong Li
Abstract: Automated infection measurement and COVID-19 diagnosis based on Chest X-ray (CXR) imaging is important for faster examination. We propose a novel approach, called DRR4Covid, to learn automated COVID-19 diagnosis and infection segmentation on CXRs from digitally reconstructed radiographs (DRRs). DRR4Covid comprises of an infection-aware DRR generator, a classification and/or segmentation network, and a domain adaptation module. The infection-aware DRR generator is able to produce DRRs with adjustable strength of radiological signs of COVID-19 infection, and generate pixel-level infection annotations that match the DRRs precisely. The domain adaptation module is introduced to reduce the domain discrepancy between DRRs and CXRs by training networks on unlabeled real CXRs and labeled DRRs together.We provide a simple but effective implementation of DRR4Covid by using a domain adaptation module based on Maximum Mean Discrepancy (MMD), and a FCN-based network with a classification header and a segmentation header. Extensive experiment results have confirmed the efficacy of our method; specifically, quantifying the performance by accuracy, AUC and F1-score, our network without using any annotations from CXRs has achieved a classification score of (0.954, 0.989, 0.953) and a segmentation score of (0.957, 0.981, 0.956) on a test set with 794 normal cases and 794 positive cases. Besides, we estimate the sensitive of X-ray images in detecting COVID-19 infection by adjusting the strength of radiological signs of COVID-19 infection in synthetic DRRs. The estimated detection limit of the proportion of infected voxels in the lungs is 19.43%, and the estimated lower bound of the contribution rate of infected voxels is 20.0% for significant radiological signs of COVID-19 infection. Our codes will be made publicly available at this https URL.
摘要：基于胸部X光（CXR）成像自动感染测量和COVID-19的诊断是更快的检查非常重要。我们建议在从数字重建射线照片（的DRR）CXRS一种新颖的方法，称为DRR4Covid，吸取自动化COVID-19的诊断和感染分割。 DRR4Covid包括感染感知DRR发生器，分类和/或分割网络，和一个域自适应模块的。感染感知DRR发生器能够产生的DRR与COVID-19感染的放射征象调节强度，并生成精确匹配的DRR是像素级感染的注解。域自适应模块被引入通过训练上实未标记和CXRS标记的DRR网络together.We通过使用基于最大平均差异域自适应模块提供DRR4Covid的一个简单但有效的执行，以减少的DRR和CXRS之间的域差异（MMD）和基于FCN-网络与分类报头和分割报头。大量的实验结果证实了该方法的疗效;具体地，定量通过准确性，AUC和F1-得分的性能，我们不使用来自CXRS任何注释网络已经在测试实现的（0.954，0.989，0.953）的分类评分和的分割得分（0.957，0.981，0.956）与794例正常人和794阳性病例设置。此外，我们估计敏感的X射线图像中，通过调整在合成的DRR COVID-19感染的放射性标志强度检测COVID-19感染。受感染的体素在肺部的比例估计检出限为19.43％，并绑定感染体素的贡献率估计下是COVID-19感染的显著放射征象20.0％。我们的代码将在这个HTTPS URL公之于众。

54. Multi-Dimension Fusion Network for Light Field Spatial Super-Resolution using Dynamic Filters [PDF] 返回目录
Qingyan Sun, Shuo Zhang, Song Chang, Lixi Zhu, Youfang Lin
Abstract: Light field cameras have been proved to be powerful tools for 3D reconstruction and virtual reality applications. However, the limited resolution of light field images brings a lot of difficulties for further information display and extraction. In this paper, we introduce a novel learning-based framework to improve the spatial resolution of light fields. First, features from different dimensions are parallelly extracted and fused together in our multi-dimension fusion architecture. These features are then used to generate dynamic filters, which extract subpixel information from micro-lens images and also implicitly consider the disparity information. Finally, more high-frequency details learned in the residual branch are added to the upsampled images and the final super-resolved light fields are obtained. Experimental results show that the proposed method uses fewer parameters but achieves better performances than other state-of-the-art methods in various kinds of datasets. Our reconstructed images also show sharp details and distinct lines in both sub-aperture images and epipolar plane images.
摘要：光场相机已被证明是三维重建和虚拟现实应用的强大工具。然而，光场图像中的有限分辨率带来了许多进一步的信息显示和提取困难。在本文中，我们介绍了一种新的基于学习的框架，以提高光场的空间分辨率。首先，从不同的尺寸特征提取平行并在我们的多维融合结构稠合在一起。然后这些特征被用于生成动态过滤器，从微透镜的图像，其提取物的子像素信息，并且还隐含地考虑视差信息。最后，在剩余的分支学到了更多的高频细节被添加到上采样图像，并获得了最终的超分辨率光场。实验结果表明，该方法使用更少的参数，但实现了比其他国家的最先进的方法，在各种数据集的更好的性能。我们的重构图像也显示在子孔径图像和核平面图像二者锐利的细节和不同的线路。

55. Better Than Reference In Low Light Image Enhancement: Conditional Re-Enhancement Networks [PDF] 返回目录
Yu Zhang, Xiaoguang Di, Bin Zhang, Ruihang Ji, Chunhui Wang
Abstract: Low light images suffer from severe noise, low brightness, low contrast, etc. In previous researches, many image enhancement methods have been proposed, but few methods can deal with these problems simultaneously. In this paper, to solve these problems simultaneously, we propose a low light image enhancement method that can combined with supervised learning and previous HSV (Hue, Saturation, Value) or Retinex model based image enhancement methods. First, we analyse the relationship between the HSV color space and the Retinex theory, and show that the V channel (V channel in HSV color space, equals the maximum channel in RGB color space) of the enhanced image can well represent the contrast and brightness enhancement process. Then, a data-driven conditional re-enhancement network (denoted as CRENet) is proposed. The network takes low light images as input and the enhanced V channel as condition, then it can re-enhance the contrast and brightness of the low light image and at the same time reduce noise and color distortion. It should be noted that during the training process, any paired images with different exposure time can be used for training, and there is no need to carefully select the supervised images which will save a lot. In addition, it takes less than 20 ms to process a color image with the resolution 400*600 on a 2080Ti GPU. Finally, some comparative experiments are implemented to prove the effectiveness of the method. The results show that the method proposed in this paper can significantly improve the quality of the enhanced image, and by combining with other image contrast enhancement methods, the final enhancement result can even be better than the reference image in contrast and brightness. (Code will be available at this https URL)
摘要：低光图像严重噪音，低亮度，低对比度等。在以往的研究受到影响，很多图像增强方法被提出，但很少有方法可以同时处理这些问题。在本文中，为了同时解决这些问题，我们提出了可与监督学习和以前的HSV（色调，饱和度，值）或的Retinex基于模型的图像增强方法相结合的低光图像增强方法。首先，我们分析了增强图像的HSV颜色空间和的Retinex理论，并表明，该V通道（在HSV颜色空间V信道，等于在RGB颜色空间中的最大信道）可以很好地代表对比度和亮度之间的关系增强的过程。然后，数据驱动条件重新增强网络（表示为CRENet）提出。网络采用低光图像作为输入，并作为条件增强V通道，那么它可以重新增强低光图像的对比度和亮度，并在同一时间减少噪声和色彩失真。应当指出的是，在训练过程中，以不同的曝光时间以任何成对的图像可用于培训，也没有必要慎重选择监督的图像，这将节省很多。此外，它需要小于20ms来处理彩色图像以上的2080Ti GPU分辨率400 * 600。最后，一些比较实验实施证明了该方法的有效性。结果表明，在本文提出可以显著提高增强的图像质量的方法，以及通过与其它图象对比度增强方法相结合，最终增强结果甚至可以是比所述参考图像中更好的对比度和亮度。（代码将可在此HTTPS URL）

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-27

目录

摘要