摘要

1. Learning to See Through Obstructions with Layered Decomposition [PDF] 返回目录
Yu-Lun Liu, Wei-Sheng Lai, Ming-Hsuan Yang, Yung-Yu Chuang, Jia-Bin Huang
Abstract: We present a learning-based approach for removing unwanted obstructions, such as window reflections, fence occlusions or raindrops, from a short sequence of images captured by a moving camera. Our method leverages the motion differences between the background and the obstructing elements to recover both layers. Specifically, we alternate between estimating dense optical flow fields of the two layers and reconstructing each layer from the flow-warped images via a deep convolutional neural network. The learning-based layer reconstruction allows us to accommodate potential errors in the flow estimation and brittle assumptions such as brightness consistency. We show that training on synthetically generated data transfers well to real images. Our results on numerous challenging scenarios of reflection and fence removal demonstrate the effectiveness of the proposed method.
摘要：我们提出一个基于学习的方法对于除去不需要的障碍物，如窗口反射，围栏闭塞或雨滴，从由移动相机拍摄的图像的短序列。我们的方法利用了背景和阻碍件来恢复两个层之间的运动的差异。具体而言，我们估计所述两个层的致密的光流场，并通过深卷积神经网络重构从流入翘曲图像中的每个层之间交替。所述基于学习的层重建允许我们以适应流估计和脆假设诸如亮度一致性潜在的错误。我们给出合成产生的数据传输培训以及真实影像。我们在反思和围栏拆除了许多具有挑战性的场景结果证明了该方法的有效性。

2. Adversarial Generative Grammars for Human Activity Prediction [PDF] 返回目录
AJ Piergiovanni, Anelia Angelova, Alexander Toshev, Michael S. Ryoo
Abstract: In this paper we propose an adversarial generative grammar model for future prediction. The objective is to learn a model that explicitly captures temporal dependencies, providing a capability to forecast multiple, distinct future activities. Our adversarial grammar is designed so that it can learn stochastic production rules from the data distribution, jointly with its latent non-terminal representations. Being able to select multiple production rules during inference leads to different predicted outcomes, thus efficiently modeling many plausible futures. The adversarial generative grammar is evaluated on the Charades, MultiTHUMOS, Human3.6M, and 50 Salads datasets and on two activity prediction tasks: future 3D human pose prediction and future activity prediction. The proposed adversarial grammar outperforms the state-of-the-art approaches, being able to predict much more accurately and further in the future, than prior work.
摘要：在本文中，我们提出了未来预测的对抗生成语法模型。目的是学习的楷模，明确捕获时间相关，提供能力预测多个不同今后的活动。我们的对抗语法的设计使得它可以从数据分布，其潜在的非终端表示学习随机产生式规则，联合。如果能够在推论导致不同结果的预测来选择多个生产规则，从而有效地模拟许多可能的未来。未来的3D人体姿态预测和未来活动的预测：对抗生成语法上的哑谜，MultiTHUMOS，Human3.6M和50个沙拉数据集，并在两个活动的预测任务进行评估。所提出的语法对抗性优于状态的最先进的方法，能够更准确和进一步预测在未来，比现有的工作。

3. Hardware-Centric AutoML for Mixed-Precision Quantization [PDF] 返回目录
Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, Song Han
Abstract: Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support mixed precision (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, energy, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in a uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals (latency and energy) to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, energy, and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.
摘要：型号量化是一种广泛使用的技术来压缩和加速深层神经网络（DNN）的推断。紧急DNN硬件加速器开始支持混合精度（1-8位），进一步提高了计算效率，这引起了极大的挑战找到每层的最佳位宽：它需要领域专家共同开拓广阔的设计空间精度之间的权衡，等待时间，能量和模型的大小，这是既费时又次优的。传统的量化算法忽略的不同的硬件架构并量化所有的层以统一的方式。在本文中，我们介绍了硬件识别自动量化（HAQ）框架，充分利用了强化学习，以自动确定量化政策，我们采取了硬件加速器的设计回路反馈。而不是依赖于代理信号，诸如触发器和模型大小，我们采用一个硬件仿真器，以产生直接的反馈信号（等待时间和能量）到RL剂。与传统方法相比，我们的框架是完全自动化的，可以专门为不同的神经网络架构和硬件架构的量化政策。我们的框架有效地降低由1.4-1.95x延迟和由1.9倍的能量消耗与准确性的损失可忽略不计与固定位宽（8个比特）的量化比较。我们的框架表明，在根据不同的资源约束不同硬件架构（即边缘和云架构）的最优策略（即潜伏期，能源和模型的大小）是截然不同的。我们理解不同的量化政策，这不但能为神经网络的架构设计和硬件架构设计的独到见解的含义。

4. A Boundary Based Out-of-Distribution Classifier for Generalized Zero-Shot Learning [PDF] 返回目录
Xingyu Chen, Xuguang Lan, Fuchun Sun, Nanning Zheng
Abstract: Generalized Zero-Shot Learning (GZSL) is a challenging topic that has promising prospects in many realistic scenarios. Using a gating mechanism that discriminates the unseen samples from the seen samples can decompose the GZSL problem to a conventional Zero-Shot Learning (ZSL) problem and a supervised classification problem. However, training the gate is usually challenging due to the lack of data in the unseen domain. To resolve this problem, in this paper, we propose a boundary based Out-of-Distribution (OOD) classifier which classifies the unseen and seen domains by only using seen samples for training. First, we learn a shared latent space on a unit hyper-sphere where the latent distributions of visual features and semantic attributes are aligned class-wisely. Then we find the boundary and the center of the manifold for each class. By leveraging the class centers and boundaries, the unseen samples can be separated from the seen samples. After that, we use two experts to classify the seen and unseen samples separately. We extensively validate our approach on five popular benchmark datasets including AWA1, AWA2, CUB, FLO and SUN. The experimental results show that our approach surpasses state-of-the-art approaches by a significant margin.
摘要：广义零射门学习（GZSL）是有希望在许多实际场景前景具有挑战性的课题。使用门控机构，其判别从看出样品看不见的样品可以分解GZSL问题常规的零铅球学习（ZSL）问题和监督分类问题。然而，培养的栅极通常是由于具有挑战性在看不见的域中的数据缺乏。要解决这个问题，在本文中，我们提出了一种基于外的分布（OOD）分类器只使用看过样本训练分类看不见，看到域的边界。首先，我们学上的单位超球，其中的视觉特征和语义属性潜分布对准类明智地共享潜在空间。随后，我们发现边界和歧管每个类的中心。通过利用类中心和边界，看不见的样品可以从样品可见分开。在那之后，我们使用了两个专家来看到和看不见的样本分别进行分类。我们广泛验证我们的五个流行的基准数据集，包括AWA1，AWA2，CUB，FLO和SUN的做法。实验结果表明，我们的方法超越了国家的最先进的由显著利润率接近。

5. BREEDS: Benchmarks for Subpopulation Shift [PDF] 返回目录
Shibani Santurkar, Dimitris Tsipras, Aleksander Madry
Abstract: We develop a methodology for assessing the robustness of models to subpopulation shift---specifically, their ability to generalize to novel data subpopulations that were not observed during training. Our approach leverages the class structure underlying existing datasets to control the data subpopulations that comprise the training and test distributions. This enables us to synthesize realistic distribution shifts whose sources can be precisely controlled and characterized, within existing large-scale datasets. Applying this methodology to the ImageNet dataset, we create a suite of subpopulation shift benchmarks of varying granularity. We then validate that the corresponding shifts are tractable by obtaining human baselines for them. Finally, we utilize these benchmarks to measure the sensitivity of standard model architectures as well as the effectiveness of off-the-shelf train-time robustness interventions. Code and data available at this https URL .
摘要：我们开发了一个方法，评估模型的鲁棒性亚群转变---特别是，他们推广到培训过程中未观察到的新型数据亚群的能力。我们的方法利用类构造体的现有数据集，以控制包括训练和测试分布数据亚群的下面。这使我们能够合成，其来源可精确控制和特点，现有大型数据集内的现实分布的变化。采用这个方法给ImageNet数据集，我们创建了一套不同粒度的亚群转变基准。然后，我们验证了相应的变化是由人类获得的基线为他们听话。最后，我们利用这些基准来衡量的标准模型架构的灵敏度以及现成的，货架列车时的鲁棒性干预的有效性。代码，并在此HTTPS URL可用的数据。

6. KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue [PDF] 返回目录
Xiaoze Jiang, Siyi Du, Zengchang Qin, Yajing Sun, Jing Yu
Abstract: Visual dialogue is a challenging task that needs to extract implicit information from both visual (image) and textual (dialogue history) contexts. Classical approaches pay more attention to the integration of the current question, vision knowledge and text knowledge, despising the heterogeneous semantic gaps between the cross-modal information. In the meantime, the concatenation operation has become de-facto standard to the cross-modal information fusion, which has a limited ability in information retrieval. In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. Moreover, the reasoning clues for visual dialogue can be clearly drawn from intra-modal entities and inter-modal bridges. Experimental results on VisDial v1.0 and VisDial-Q datasets demonstrate that our model outperforms exiting models with state-of-the-art results.
摘要：可视对话是一项艰巨的任务，需要提取两个视频（图像）的上下文隐含信息和文本（对话历史）。古典方法更注重的当前问题，视力知识和文本知识的整合，轻视跨模态信息之间的异质性语义差距。在此期间，所述级联操作已成为事实上的标准的跨通道信息融合，其在信息检索的能力有限。在本文中，我们用图来弥补细粒度的视觉和文字知识之间的交叉模式语义关系，以及通过自适应信息选择模式获取所需的知识提出了一种新的知识桥图网络（KBGN）模型。此外，可视对话的推理线索可以清楚地从内模态实体及多式联运的桥梁绘制。在VisDial v1.0和VisDial-Q数据集的实验结果表明，我们的模型优于退出模式与国家的最先进的成果。

7. GeLaTO: Generative Latent Textured Objects [PDF] 返回目录
Ricardo Martin-Brualla, Rohit Pandey, Sofien Bouaziz, Matthew Brown, Dan B Goldman
Abstract: Accurate modeling of 3D objects exhibiting transparency, reflections and thin structures is an extremely challenging problem. Inspired by billboards and geometric proxies used in computer graphics, this paper proposes Generative Latent Textured Objects (GeLaTO), a compact representation that combines a set of coarse shape proxies defining low frequency geometry with learned neural textures, to encode both medium and fine scale geometry as well as view-dependent appearance. To generate the proxies' textures, we learn a joint latent space allowing category-level appearance and geometry interpolation. The proxies are independently rasterized with their corresponding neural texture and composited using a U-Net, which generates an output photorealistic image including an alpha map. We demonstrate the effectiveness of our approach by reconstructing complex objects from a sparse set of views. We show results on a dataset of real images of eyeglasses frames, which are particularly challenging to reconstruct using classical methods. We also demonstrate that these coarse proxies can be handcrafted when the underlying object geometry is easy to model, like eyeglasses, or generated using a neural network for more complex categories, such as cars.
摘要：三维的精确建模对象表现出透明性，反射和薄的结构是一个非常具有挑战性的问题。本文通过广告牌和在计算机图形中使用几何代理的启发，提出剖成潜纹理的物体（冰淇淋），紧凑表示，结合一组粗略形状代理限定低频几何与学习神经纹理，既中，细尺度的几何形状进行编码的以及依赖于视图的外观。要生成代理纹理，我们学会联合潜在空间，允许类级的外观和几何插值。所述代理被独立地与它们相应的神经结构和使用U型网，其产生输出写实图像包括一个α映射复合光栅化。我们通过从稀疏的一套看法重建复杂的对象证明了该方法的有效性。我们显示在眼镜框的真实图像，这是特别具有挑战性的使用传统方法重建的数据集的结果。我们还表明，这些粗代理可以在底层对象几何容易模型，如眼镜可以手工或使用神经网络进行更复杂的类别，例如汽车生成。

8. TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection [PDF] 返回目录
Fangfang Wang, Yifeng Chen, Fei Wu, Xi Li
Abstract: Arbitrary-shaped text detection is a challenging task due to the complex geometric layouts of texts such as large aspect ratios, various scales, random rotations and curve shapes. Most state-of-the-art methods solve this problem from bottom-up perspectives, seeking to model a text instance of complex geometric layouts with simple local units (e.g., local boxes or pixels) and generate detections with heuristic post-processings. In this work, we propose an arbitrary-shaped text detection method, namely TextRay, which conducts top-down contour-based geometric modeling and geometric parameter learning within a single-shot anchor-free framework. The geometric modeling is carried out under polar system with a bidirectional mapping scheme between shape space and parameter space, encoding complex geometric layouts into unified representations. For effective learning of the representations, we design a central-weighted training strategy and a content loss which builds propagation paths between geometric encodings and visual content. TextRay outputs simple polygon detections at one pass with only one NMS post-processing. Experiments on several benchmark datasets demonstrate the effectiveness of the proposed approach. The code is available at this https URL.
摘要：任意形文本检测是一项具有挑战性的任务，由于文本的复杂的几何布局，如大的纵横比，不同尺度，随机旋转和曲线的形状。大多数国家的最先进的方法解决从底向上的观点此问题，寻求复杂的几何布局的文本实例与简单的本地单元（例如，本地盒或像素）进行建模，并生成与启发式后处理检测。在这项工作中，我们提出了一个任意形状的文本检测方法，即TextRay，从事单次无锚框架内自上而下的基于轮廓的几何造型和几何参数的学习。的几何造型是根据极性系统具有形状和空间参数的空间之间的双向映射方案进行，复杂的几何布局编码成统一的表示。所述表示的有效的学习，我们设计了一个中央加权的训练策略和它建立几何编码和可视内容之间的传播路径的内容丢失。 TextRay输出在一个道次与只有一个NMS简单多边形检测后处理。在几个基准数据集的实验表明，该方法的有效性。该代码可在此HTTPS URL。

9. Exposing Deep-faked Videos by Anomalous Co-motion Pattern Detection [PDF] 返回目录
Gengxing Wang, Jiahuan Zhou, Ying Wu
Abstract: Recent deep learning based video synthesis approaches, in particular with applications that can forge identities such as "DeepFake", have raised great security concerns. Therefore, corresponding deep forensic methods are proposed to tackle this problem. However, existing methods are either based on unexplainable deep networks which greatly degrades the principal interpretability factor to media forensic, or rely on fragile image statistics such as noise pattern, which in real-world scenarios can be easily deteriorated by data compression. In this paper, we propose an fully-interpretable video forensic method that is designed specifically to expose deep-faked videos. To enhance generalizability on videos with various content, we model the temporal motion of multiple specific spatial locations in the videos to extract a robust and reliable representation, called Co-Motion Pattern. Such kind of conjoint pattern is mined across local motion features which is independent of the video contents so that the instance-wise variation can also be largely alleviated. More importantly, our proposed co-motion pattern possesses both superior interpretability and sufficient robustness against data compression for deep-faked videos. We conduct extensive experiments to empirically demonstrate the superiority and effectiveness of our approach under both classification and anomaly detection evaluation settings against the state-of-the-art deep forensic methods.
摘要：最近的深度学习的基于视频的合成方法，尤其是可以伪造的身份，如“DeepFake”的应用，都提出了很大的安全问题。因此，对应于深取证方法提出来解决这个问题。但是，现有的方法或者基于无法解释深网络这大大降低了主要解释性因子媒体法医，或依靠脆弱图像统计数据，例如噪声图案，其在真实世界的场景可以很容易地通过数据压缩劣化。在本文中，我们提出了专门揭露深伪造视频的完全可解释的视频取证方法。要加强与各内容的影片普遍性，我们在影片的多个特定空间位置的时间的运动来提取一个强大的和可靠的表现，称为联合运动模式模型。这类联合图案的跨越局部运动特征是独立的视频内容，以便于实例的明智的变化也可以很大程度上缓解开采。更重要的是，我们提出的共同运动模式兼具优异的解释性和对数据压缩足够的鲁棒性深伪造的视频。我们进行了广泛的实验来证明经验分类和对国家的最先进的深取证方法异常检测评价设置下的优势和我们的方法的有效性。

10. TransNet V2: An effective deep network architecture for fast shot transition detection [PDF] 返回目录
Tomáš Souček, Jakub Lokoč
Abstract: Although automatic shot transition detection approaches are already investigated for more than two decades, an effective universal human-level model was not proposed yet. Even for common shot transitions like hard cuts or simple gradual changes, the potential diversity of analyzed video contents may still lead to both false hits and false dismissals. Recently, deep learning-based approaches significantly improved the accuracy of shot transition detection using 3D convolutional architectures and artificially created training data. Nevertheless, one hundred percent accuracy is still an unreachable ideal. In this paper, we share the current version of our deep network TransNet V2 that reaches state-of-the-art performance on respected benchmarks. A trained instance of the model is provided so it can be instantly utilized by the community for a highly efficient analysis of large video archives. Furthermore, the network architecture, as well as our experience with the training process, are detailed, including simple code snippets for convenient usage of the proposed model and visualization of results.
摘要：虽然自动镜头转换检测方法已经研究了超过二十年，一个有效的人类普遍级车型尚未提出。即使是常见的拍摄转场像硬切换或简单的逐渐变化，来分析视频内容的潜在多样性仍可能导致两个虚假点击和虚假解雇。近日，深学习为基础的办法显著提高镜头转换检测使用3D卷积架构和人为制造的训练数据的准确性。然而，准确率百分之百仍然是一个无法理想。在本文中，我们分享，我们深切网络的TransNet V2的当前版本上尊重基准下游国家的最先进的性能。提供该模型的训练的情况下，因此可以立即通过社会各界的大视频档案高效的分析利用。此外，网络架构，以及我们与训练过程中的经验，详述，包括简单的代码片段的结果提出的模型和可视化的便捷使用。

11. Detecting Urban Dynamics Using Deep Siamese Convolutional Neural Networks [PDF] 返回目录
Ephrem Admasu Yekun, Petros Reda Samsom
Abstract: Change detection is a fast-growing discipline in the areas of computer vision and remote sensing. In this work, we designed and developed a variant of convolutional neural network (CNN), known as Siamese CNN to extract features from pairs of Sentinel-2 temporal images of Mekelle city captured at different times and detect changes due to urbanization: buildings and roads. The detection capability of the proposed was measured in terms of overall accuracy (95.8), Kappa measure (72.5), recall (76.5), precision (77.7), F1 measure (77.1). The model has achieved a good performance in terms of most of these measures and can be used to detect changes in Mekelle and other cities at different time horizons undergoing urbanization.
摘要：变化检测是计算机视觉和遥感等领域快速增长的纪律。在这项工作中，我们设计并开发了卷积神经网络的变体（CNN），被称为连体CNN从对提取特征哨兵2马克勒城市的时空图像在不同时间拍摄和检测由于城市化的变化：建筑物和道路。所提出的检测能力在整体精度（95.8），卡帕量度（72.5），召回（76.5），精密（77.7），F1量度（77.1）来进行测定。该模型已在大多数的这些措施方面取得了良好的性能，可用于检测马克勒，并在不同的时间跨度经受城市化的变化等城市。

12. Unified Representation Learning for Cross Model Compatibility [PDF] 返回目录
Chien-Yi Wang, Ya-Liang Chang, Shang-Ta Yang, Dong Chen, Shang-Hong Lai
Abstract: We propose a unified representation learning framework to address the Cross Model Compatibility (CMC) problem in the context of visual search applications. Cross compatibility between different embedding models enables the visual search systems to correctly recognize and retrieve identities without re-encoding user images, which are usually not available due to privacy concerns. While there are existing approaches to address CMC in face identification, they fail to work in a more challenging setting where the distributions of embedding models shift drastically. The proposed solution improves CMC performance by introducing a light-weight Residual Bottleneck Transformation (RBT) module and a new training scheme to optimize the embedding spaces. Extensive experiments demonstrate that our proposed solution outperforms previous approaches by a large margin for various challenging visual search scenarios of face recognition and person re-identification.
摘要：我们提出了一个统一的代表性学习框架，以解决在视觉搜索应用的背景下，交叉模型兼容性（CMC）的问题。不同型号的嵌入之间的交叉兼容性使得视觉搜索系统正确识别和检索的身份无需重新编码用户图像，这通常是不可用是由于隐私问题。虽然有现有的方法来解决CMC在面部识别，他们没有工作更具挑战性的环境，让嵌入模型的分布显着偏移。提出的解决方案通过引入轻质残余瓶颈变换（RBT）模块和一个新的训练方案来优化嵌入空格提高CMC性能。大量的实验证明，我们提出的解决方案大幅度优于以前的方法进行人脸识别和人重新鉴定的各种挑战视觉搜索场景。

13. Learning Stereo Matchability in Disparity Regression Networks [PDF] 返回目录
Jingyang Zhang, Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, Long Quan
Abstract: Learning-based stereo matching has recently achieved promising results, yet still suffers difficulties in establishing reliable matches in weakly matchable regions that are textureless, non-Lambertian, or occluded. In this paper, we address this challenge by proposing a stereo matching network that considers pixel-wise matchability. Specifically, the network jointly regresses disparity and matchability maps from 3D probability volume through expectation and entropy operations. Next, a learned attenuation is applied as the robust loss function to alleviate the influence of weakly matchable pixels in the training. Finally, a matchability-aware disparity refinement is introduced to improve the depth inference in weakly matchable regions. The proposed deep stereo matchability (DSM) framework can improve the matching result or accelerate the computation while still guaranteeing the quality. Moreover, the DSM framework is portable to many recent stereo networks. Extensive experiments are conducted on Scene Flow and KITTI stereo datasets to demonstrate the effectiveness of the proposed framework over the state-of-the-art learning-based stereo methods.
摘要：学习型立体匹配最近取得了可喜的成果，但仍然受到在那些无纹理，非朗伯，或遮挡弱可匹配区域建立可靠的匹配困难。在本文中，我们通过提出一种考虑逐像素匹配性立体匹配网络应对这一挑战。具体地，网络通过共同期望和熵操作退化从3D概率体积视差和匹配性的地图。接下来，一个有学问的衰减被应用为稳健的损失函数，以减轻在训练弱可匹配像素的影响。最后，匹配性意识的差距细化引入，以提高弱可匹配区域深度推断。所提出的深立体声匹配性（DSM）架构可以提高匹配结果或加速计算，同时还保证了质量。此外，DSM框架移植到最近许多立体声网络。大量的实验是在现场流量和KITTI立体声数据集进行展示在国家的最先进的学习型的立体方法所提出的框架的有效性。

14. DTVNet: Dynamic Time-lapse Video Generation via Single Still Image [PDF] 返回目录
Jiangning Zhang, Chao Xu, Liang Liu, Mengmeng Wang, Xia Wu, Yong Liu, Yunliang Jiang
Abstract: This paper presents a novel end-to-end dynamic time-lapse video generation framework, named DTVNet, to generate diversified time-lapse videos from a single landscape image, which are conditioned on normalized motion vectors. The proposed DTVNet consists of two submodules: \emph{Optical Flow Encoder} (OFE) and \emph{Dynamic Video Generator} (DVG). The OFE maps a sequence of optical flow maps to a \emph{normalized motion vector} that encodes the motion information inside the generated video. The DVG contains motion and content streams that learn from the motion vector and the single image respectively, as well as an encoder and a decoder to learn shared content features and construct video frames with corresponding motion respectively. Specifically, the \emph{motion stream} introduces multiple \emph{adaptive instance normalization} (AdaIN) layers to integrate multi-level motion information that are processed by linear layers. In the testing stage, videos with the same content but various motion information can be generated by different \emph{normalized motion vectors} based on only one input image. We further conduct experiments on Sky Time-lapse dataset, and the results demonstrate the superiority of our approach over the state-of-the-art methods for generating high-quality and dynamic videos, as well as the variety for generating videos with various motion information.
摘要：本文提出了一种新颖的端至端动态时移视频生成框架，命名DTVNet，以产生从单个风景图像，其被归一化运动矢量空调多元化时间推移录像。所提出的DTVNet包括两个子模块的：\ {EMPH光流编码器}（OFE）和\ {EMPH动态视频发生器}（DVG）。的OFE映射光学流的序列映射到\ {EMPH归一化运动矢量}编码所生成的视频内的运动信息。所述DVG包含运动和从运动矢量和所述单个图像分别在于学习内容流，以及一个编码器和解码器，用于学习的共享内容的特征和构造的视频帧分别对应的运动。具体而言，\ {EMPH运动流}引入了多种\ EMPH {自适应实例正常化}（AdaIN）层，以由线性层处理的多级运动信息集成。在测试阶段，可以通过仅基于一个输入图像上的不同\ EMPH {归一化运动矢量}来产生具有相同内容但不同的运动信息的视频。我们在天空时移数据集进一步进行实验，结果证明我们在国家的最先进的方法产生高品质的和动态的视频，以及各种用于生成与各种移动视频方式的优越性信息。

15. Transfer Learning for Protein Structure Classification and Function Inference at Low Resolution [PDF] 返回目录
Alexander Hudson, Shaogang Gong
Abstract: Structure determination is key to understanding protein function at a molecular level. Whilst significant advances have been made in predicting structure and function from amino acid sequence, researchers must still rely on expensive, time-consuming analytical methods to visualise detailed protein conformation. In this study, we demonstrate that it is possible to make accurate predictions of protein fold taxonomy from structures determined at low ($>$3 Angstroms) resolution, using a deep convolutional neural network trained on high-resolution structures ($\leq$3 Angstroms). Thus, we provide proof of concept for high-speed, low-cost protein structure classification at low resolution. We explore the relationship between the information content of the input image and the predictive power of the model, achieving state of the art performance on homologous superfamily prediction with maps of interatomic distance. Our findings contribute further evidence that inclusion of both amino acid alpha and beta carbon geometry in these maps improves classification performance over purely alpha carbon representations, and show that side-chain information may not be necessary for fine-grained structure predictions. Finally, we confirm that high-resolution, low-resolution and NMR-determined structures inhabit a common feature space, and thus provide a theoretical basis for mapping between domains to boost resolution.
摘要：结构测定的关键是在分子水平上了解蛋白质的功能。虽然显著进步已经在从氨基酸序列预测的结构和功能而进行的，研究人员仍然必须依赖于昂贵的，耗时的分析方法来可视化蛋白质的详细的构象。在这项研究中，我们证明它是可以从分辨率低确定结构（$> $ 3个埃）使蛋白质折叠分类的准确预测，利用训练的高分辨率结构深刻的卷积神经网络（$ \当量$ 3埃）。因此，我们在低分辨率概念提供高速，低成本的蛋白质结构分类的证明。我们探讨该输入图像的信息内容和该模型的预测能力之间的关系，实现对同源超家族预测的技术的性能状态与原子间距离的图。我们的研究结果有助于进一步的证据表明，在这些地图包含两个氨基酸α-碳原子的几何形状提高了分类性能比纯α-碳原子表示，并且表明，侧链信息细粒度结构预测可能不是必要的。最后，我们证实，高清晰度，低的分辨率和NMR确定的结构栖息的共同特征的空间，并因此提供了域升压分辨率之间的映射提供了理论基础。

16. HydraMix-Net: A Deep Multi-task Semi-supervised Learning Approach for Cell Detection and Classification [PDF] 返回目录
R.M. Saad Bashir, Talha Qaiser, Shan E Ahmed Raza, Nasir M. Rajpoot
Abstract: Semi-supervised techniques have removed the barriers of large scale labelled set by exploiting unlabelled data to improve the performance of a model. In this paper, we propose a semi-supervised deep multi-task classification and localization approach HydraMix-Net in the field of medical imagining where labelling is time consuming and costly. Firstly, the pseudo labels are generated using the model's prediction on the augmented set of unlabelled image with averaging. The high entropy predictions are further sharpened to reduced the entropy and are then mixed with the labelled set for training. The model is trained in multi-task learning manner with noise tolerant joint loss for classification localization and achieves better performance when given limited data in contrast to a simple deep model. On DLBCL data it achieves 80\% accuracy in contrast to simple CNN achieving 70\% accuracy when given only 100 labelled examples.
摘要：半监督技术已经通过利用未标记的数据，以提高模型的性能除去大规模标记集的障碍。在本文中，我们建议在医疗想象的领域，其中标签是耗时和昂贵的半监督深多任务分类和定位方法HydraMix-Net的。首先，利用在增强组与未标记的平均图像的模型的预测被生成的伪标签。高熵预测被进一步加强以减小熵和然后与用于训练标记集混合。该模型是在多任务学习的方式与分类定位能耐噪音联合损耗培训，并达到给定的对比数据有限时，一个简单的深层模型更好的性能。上DLBCL数据它实现在相对于简单的CNN 80 \％的准确度仅给出100标记的例子，当达到70 \％的精度。

17. Reinforced Wasserstein Training for Severity-Aware Semantic Segmentation in Autonomous Driving [PDF] 返回目录
Xiaofeng Liu, Yimeng Zhang, Xiongchang Liu, Song Bai, Site Li, Jane You
Abstract: Semantic segmentation is important for many real-world systems, e.g., autonomous vehicles, which predict the class of each pixel. Recently, deep networks achieved significant progress w.r.t. the mean Intersection-over Union (mIoU) with the cross-entropy loss. However, the cross-entropy loss can essentially ignore the difference of severity for an autonomous car with different wrong prediction mistakes. For example, predicting the car to the road is much more servery than recognize it as the bus. Targeting for this difficulty, we develop a Wasserstein training framework to explore the inter-class correlation by defining its ground metric as misclassification severity. The ground metric of Wasserstein distance can be pre-defined following the experience on a specific task. From the optimization perspective, we further propose to set the ground metric as an increasing function of the pre-defined ground metric. Furthermore, an adaptively learning scheme of the ground matrix is proposed to utilize the high-fidelity CARLA simulator. Specifically, we follow a reinforcement alternative learning scheme. The experiments on both CamVid and Cityscapes datasets evidenced the effectiveness of our Wasserstein loss. The SegNet, ENet, FCN and Deeplab networks can be adapted following a plug-in manner. We achieve significant improvements on the predefined important classes, and much longer continuous playtime in our simulator.
摘要：语义分割是许多真实世界的系统，例如，自主车，其预测类的每个像素的重要。近日，深网络取得显著进展w.r.t.平均交叉点过联盟（米欧）与交叉熵损失。然而，交叉熵损失基本上可以忽略的程度为自主轿车用不同的错误预测失误的差异。例如，汽车预测的道路比它识别为公交车更加的servery。针对这一困难，我们开发了一个Wasserstein的培训框架通过定义地面指标作为分类错误的严重性，探索，级间的相关性。指标Wasserstein的距离的地面可以预先定义下面就一特定任务的经验。从优化的角度来看，我们进一步提出，设置地面指标与预先定义的地面指标的增函数。此外，接地矩阵的自适应学习方案提出了利用高保真CARLA模拟器。具体来说，我们按照加强替代学习计划中。两个CamVid和风情的数据集实验证明我们的瓦瑟斯坦损失的有效性。的SegNet，ENET，FCN和Deeplab网络可以以下插件方式进行调整。我们实现预定义的重要的类显著的改善，并在我们的模拟器持续更长的播放时间。

18. Attention-based 3D Object Reconstruction from a Single Image [PDF] 返回目录
Andrey Salvi, Nathan Gavenski, Eduardo Pooch, Felipe Tasoniero, Rodrigo Barros
Abstract: Recently, learning-based approaches for 3D reconstruction from 2D images have gained popularity due to its modern applications, e.g., 3D printers, autonomous robots, self-driving cars, virtual reality, and augmented reality. The computer vision community has applied a great effort in developing functions to reconstruct the full 3D geometry of objects and scenes. However, to extract image features, they rely on convolutional neural networks, which are ineffective in capturing long-range dependencies. In this paper, we propose to substantially improve Occupancy Networks, a state-of-the-art method for 3D object reconstruction. For such we apply the concept of self-attention within the network's encoder in order to leverage complementary input features rather than those based on local regions, helping the encoder to extract global information. With our approach, we were capable of improving the original work in 5.05% of mesh IoU, 0.83% of Normal Consistency, and more than 10X the Chamfer-L1 distance. We also perform a qualitative study that shows that our approach was able to generate much more consistent meshes, confirming its increased generalization power over the current state-of-the-art.
摘要：从二维图像三维重建近来，学习基础的方法已经得到普及，由于其现代的应用程序，例如，3D打印机，自主机器人，自动驾驶汽车，虚拟现实和增强现实。计算机视觉领域发展的功能重建物体和场景的3D几何已申请了很大的努力。然而，提取图像特征，他们依靠卷积神经网络，这是在拍摄远距离依存关系无效。在本文中，我们提出了大幅度提高占用网络，国家的最先进的方法，用于3D对象重建。对于这样我们才能在网络的编码器中应用的自我关注的概念，利用互补输入功能，而不是基于局部的，帮助编码器来提取全局信息。随着我们的方法，我们能够提高网欠条，标准稠度的0.83％，5.05％的原创作品，而且比10X倒角-L1距离。我们还进行定性研究，表明，我们的方法是能够产生更加一致的网格，确定在当前国家的最先进的其增加的泛化能力。

19. Robust Long-Term Object Tracking via Improved Discriminative Model Prediction [PDF] 返回目录
Seokeon Choi, Junhyun Lee, Yunsung Lee, Alexander Hauptmann
Abstract: We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline pre-trained short-term tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP with the standard DiMP classifier. Our tracker RLT-DiMP improves SuperDiMP in the following three aspects: (1) Uncertainty reduction using random erasing: To make our model robust, we exploit an agreement from multiple images after erasing random small rectangular areas as a certainty. And then, we correct the tracking state of our model accordingly. (2) Random search with spatio-temporal constraints: we propose a robust random search method with a score penalty applied to prevent the problem of sudden detection at a distance. (3) Background augmentation for more discriminative feature learning: We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter. In experiments on the VOT-LT2020 benchmark dataset, the proposed method achieves comparable performance to the state-of-the-art long-term trackers. The source code is available at: this https URL.
摘要：我们提出了稳健的长期改进的判别模型预测方法跟踪基于预先训练的短期跟踪。在训练前的基准短期跟踪器SuperDiMP相结合PrDiMP与标准DIMP分类包围盒回归。我们跟踪RLT-DIMP提高SuperDiMP在以下三个方面：（1）利用随机擦除减少不确定性：为了使我们的模型强大的，我们利用从多个图像擦除随机的小矩形区域作为确定后的协议。然后，我们相应地修正我们的模型的跟踪状态。（2）随机搜索与时空约束：我们提出与比分惩罚鲁棒随机搜索的方法适用于防止突然检测的问题在一定距离。为更有辨别力的特征学习（3）增强的背景：我们增加未包含在搜索区域的背景杂波培养一个更强大的模型不同的背景。在对VOT-LT2020基准数据集实验，所提出的方法达到相当的性能，以国家的最先进的长期跟踪。源代码，请访问：此HTTPS URL。

20. A Study of Efficient Light Field Subsampling and Reconstruction Strategies [PDF] 返回目录
Yang Chen, Martin Alain, Aljosa Smolic
Abstract: Limited angular resolution is one of the main obstacles for practical applications of light fields. Although numerous approaches have been proposed to enhance angular resolution, view selection strategies have not been well explored in this area. In this paper, we study subsampling and reconstruction strategies for light fields. First, different subsampling strategies are studied with a fixed sampling ratio, such as row-wise sampling, column-wise sampling, or their combinations. Second, several strategies are explored to reconstruct intermediate views from four regularly sampled input views. The influence of the angular density of the input is also evaluated. We evaluate these strategies on both real-world and synthetic datasets, and optimal selection strategies are devised from our results. These can be applied in future light field research such as compression, angular super-resolution, and design of camera systems.
摘要：有限的角分辨率的光领域的实际应用的主要障碍之一。尽管许多方法被提出，以提高角分辨率，查看选择策略并没有在这方面得到了很好的探索。在本文中，我们研究了光场二次采样和重建战略。首先，不同的子采样策略进行了研究具有固定采样率，例如按行采样，列方向取样，或它们的组合。第二，一些策略进行了探索以重构从四个规则采样的输入视图中间视图。输入的角密度的影响也进行评价。我们评估两个现实世界和合成数据集，这些策略和最优选择策略是从我们的研究结果设计的。这些可以在将来的光场研究可以应用于诸如压缩，角超分辨率，和摄像系统的设计。

21. PROFIT: A Novel Training Method for sub-4-bit MobileNet Models [PDF] 返回目录
Eunhyeok Park, Sungjoo Yoo
Abstract: 4-bit and lower precision mobile models are required due to the ever-increasing demand for better energy efficiency in mobile devices. In this work, we report that the activation instability induced by weight quantization (AIWQ) is the key obstacle to sub-4-bit quantization of mobile networks. To alleviate the AIWQ problem, we propose a novel training method called PROgressive-Freezing Iterative Training (PROFIT), which attempts to freeze layers whose weights are affected by the instability problem stronger than the other layers. We also propose a differentiable and unified quantization method (DuQ) and a negative padding idea to support asymmetric activation functions such as h-swish. We evaluate the proposed methods by quantizing MobileNet-v1, v2, and v3 on ImageNet and report that 4-bit quantization offers comparable (within 1.48 % top-1 accuracy) accuracy to full precision baseline. In the ablation study of the 3-bit quantization of MobileNet-v3, our proposed method outperforms the state-of-the-art method by a large margin, 12.86 % of top-1 accuracy.
摘要：4位和低精度模型移动需要，由于在移动设备上更好的能源效率的需求不断增加。在这项工作中，我们报道了由加权量化（AIWQ）诱导激活的不稳定是移动网络的子4位量化的关键障碍。为了减轻AIWQ问题，我们提出了所谓的渐进式冷冻迭代训练（利润），冻结层，其权重的不稳定问题比其它层更强的影响，这尝试一种新的训练方法。我们还提出了一种微的和统一的量化方法（DUQ）和负填充想法以支持不对称的活化的功能，如H-沙沙。我们评估的量化MobileNet-V1，V2和V3上ImageNet和报告所提出的方法是4位量化的优惠相当（内1.48％，最高1精度）精度全精度的基线。在MobileNet-V3的3比特量化的烧蚀研究中，我们提出的方法优于通过一大截，顶部-1精度12.86％的状态下的最先进的方法。

22. ClimAlign: Unsupervised statistical downscaling of climate variables via normalizing flows [PDF] 返回目录
Brian Groenke, Luke Madaus, Claire Monteleoni
Abstract: Downscaling is a landmark task in climate science and meteorology in which the goal is to use coarse scale, spatio-temporal data to infer values at finer scales. Statistical downscaling aims to approximate this task using statistical patterns gleaned from an existing dataset of downscaled values, often obtained from observations or physical models. In this work, we investigate the application of deep latent variable learning to the task of statistical downscaling. We present ClimAlign, a novel method for unsupervised, generative downscaling using adaptations of recent work in normalizing flows for variational inference. We evaluate the viability of our method using several different metrics on two datasets consisting of daily temperature and precipitation values gridded at low (1 degree latitude/longitude) and high (1/4 and 1/8 degree) resolutions. We show that our method achieves comparable predictive performance to existing supervised statistical downscaling methods while simultaneously allowing for both conditional and unconditional sampling from the joint distribution over high and low resolution spatial fields. We provide publicly accessible implementations of our method, as well as the baselines used for comparison, on GitHub.
摘要：降尺度是在气候科学和气象学具有里程碑意义的任务，其目标是在精细尺度上用粗具规模，时空数据来推断值。统计降尺度目标逼近利用降尺度值的现有数据集，往往是从观察或物理模型获得的收集统计模式这个任务。在这项工作中，我们探讨深潜变量的学习来统计降级任务的应用程序。我们提出ClimAlign，在正火使用最近作品的改编无监督，生成缩小的新方法为变流推论。我们使用在由在低温（1度纬度/经度）和高（1/4和1/8度）的分辨率网格每日温度和降雨量值的两个数据集几个不同的度量评估我们的方法的可行性。我们表明，我们的方法实现媲美的预测性能现有的受监管统计降级方法，同时允许来自高分辨率和低分辨率的空间领域的联合分布都有条件和无条件采样。我们提供我们的方法的可公开访问的实现，以及用于比较的基准，在GitHub上。

23. Fast and Accurate Optical Flow based Depth Map Estimation from Light Fields [PDF] 返回目录
Yang Chen, Martin Alain, Aljosa Smolic
Abstract: Depth map estimation is a crucial task in computer vision, and new approaches have recently emerged taking advantage of light fields, as this new imaging modality captures much more information about the angular direction of light rays compared to common approaches based on stereoscopic images or multi-view. In this paper, we propose a novel depth estimation method from light fields based on existing optical flow estimation methods. The optical flow estimator is applied on a sequence of images taken along an angular dimension of the light field, which produces several disparity map estimates. Considering both accuracy and efficiency, we choose the feature flow method as our optical flow estimator. Thanks to its spatio-temporal edge-aware filtering properties, the different disparity map estimates that we obtain are very consistent, which allows a fast and simple aggregation step to create a single disparity map, which can then converted into a depth map. Since the disparity map estimates are consistent, we can also create a depth map from each disparity estimate, and then aggregate the different depth maps in the 3D space to create a single dense depth map.
摘要：深度图估计是计算机视觉中一个极为重要的任务，和新方法，最近出现了利用光场的，因为这种新的成像方式捕捉更多的信息有关的比较常见的光线角度方向靠近，基于立体图像或多视图。在本文中，我们提出基于现有的光流估计方法的光场的新颖的深度估计方法。光流估计施加在沿着光场，它产生几个视差图估计的角度尺寸拍摄的图像的一个序列。同时考虑精度和效率，我们选择的功能流方法，我们的光流估计。由于其时空边缘感知滤波特性，不同的视差图的估计，我们得到非常一致，它允许快速和简单的聚合步骤，以建立一个单一的视差图，然后可以转换成深度图。由于视差图的估计是一致的，我们也可以从每个差估计深度图，然后汇总不同的深度图在3D空间中创建一个单一的密集深度图。

24. Learning to Cluster under Domain Shift [PDF] 返回目录
Willi Menapace, Stéphane Lathuilière, Elisa Ricci
Abstract: While unsupervised domain adaptation methods based on deep architectures have achieved remarkable success in many computer vision tasks, they rely on a strong assumption, i.e. labeled source data must be available. In this work we overcome this assumption and we address the problem of transferring knowledge from a source to a target domain when both source and target data have no annotations. Inspired by recent works on deep clustering, our approach leverages information from data gathered from multiple source domains to build a domain-agnostic clustering model which is then refined at inference time when target data become available. Specifically, at training time we propose to optimize a novel information-theoretic loss which, coupled with domain-alignment layers, ensures that our model learns to correctly discover semantic labels while discarding domain-specific features. Importantly, our architecture design ensures that at inference time the resulting source model can be effectively adapted to the target domain without having access to source data, thanks to feature alignment and self-supervision. We evaluate the proposed approach in a variety of settings, considering several domain adaptation benchmarks and we show that our method is able to automatically discover relevant semantic information even in presence of few target samples and yields state-of-the-art results on multiple domain adaptation benchmarks.
摘要：尽管基于深厚的架构监督的领域适应性方法在许多计算机视觉任务都取得了显着的成功，他们依靠强大的假设，即标记源数据必须是可用的。在这项工作中，我们克服了这个假设，当源和目标数据都没有标注我们解决从源传输的知识，对目标域的问题。通过对深集群近期作品的启发，我们的方法利用来自多个源域聚集建造域无关的集群模式，然后在推理时间精制时目标数据成为可用的数据信息。具体地，在训练时间，我们提出以优化的新颖信息理论损失，这，加上域配向层，确保了我们的模型学会正确发现语义标签同时丢弃特定于域的特征。重要的是，我们的架构设计，保证了在推理时所产生的源模型，可以有效地适应目标域，而无需访问源数据，由于功能定位和自我监督。我们评估各种环境所提出的方法，考虑几个领域适应性基准，我们证明了我们的方法是能够甚至在多个领域的国家的最先进的一些目标样本和产生的结果存在自动发现相关的语义信息适应基准。

25. Real-Time Sign Language Detection using Human Pose Estimation [PDF] 返回目录
Amit Moryossef, Ioannis Tsochantaridis, Roee Aharoni, Sarah Ebling, Srini Narayanan
Abstract: We propose a lightweight real-time sign language detection model, as we identify the need for such a case in videoconferencing. We extract optical flow features based on human pose estimation and, using a linear classifier, show these features are meaningful with an accuracy of 80%, evaluated on the DGS Corpus. Using a recurrent model directly on the input, we see improvements of up to 91% accuracy, while still working under 4ms. We describe a demo application to sign language detection in the browser in order to demonstrate its usage possibility in videoconferencing applications.
摘要：我们提出了一个轻量级的实时手语检测模型，因为我们确定在视频会议这样的情况下的需要。我们提取光流基于人类姿势估计的特征和，使用线性分类器，显示这些特征是具有80％的准确度，在DGS语料库评价有意义。直接输入使用复发模型，我们可以看到高达91％的准确率的提高，同时仍在4ms的工作。我们描述了一个演示应用程序登录语言检测浏览器，以证明在视频会议应用它的使用可能性。

26. R-MNet: A Perceptual Adversarial Network for Image Inpainting [PDF] 返回目录
Jireh Jam, Connah Kendrick, Vincent Drouard, Kevin Walker, Gee-Sern Hsu, Moi Hoon Yap
Abstract: Facial image inpainting is a problem that is widely studied, and in recent years the introduction of Generative Adversarial Networks, has led to improvements in the field. Unfortunately some issues persists, in particular when blending the missing pixels with the visible ones. We address the problem by proposing a Wasserstein GAN combined with a new reverse mask operator, namely Reverse Masking Network (R-MNet), a perceptual adversarial network for image inpainting. The reverse mask operator transfers the reverse masked image to the end of the encoder-decoder network leaving only valid pixels to be inpainted. Additionally, we propose a new loss function computed in feature space to target only valid pixels combined with adversarial training. These then capture data distributions and generate images similar to those in the training data with achieved realism (realistic and coherent) on the output images. We evaluate our method on publicly available dataset, and compare with state-of-the-art methods. We show that our method is able to generalize to high-resolution inpainting task, and further show more realistic outputs that are plausible to the human visual system when compared with the state-of-the-art methods.
摘要：面部图像修复是广泛研究的问题，并在近几年引进剖成对抗性网络，导致在该领域的改进。不幸的是有些问题仍然存在，特别是融合与可见那些丢失的像素时。我们通过提出一种瓦瑟斯坦GAN与新的反向掩模运算，即反向掩蔽网络（R-MNET），感知对抗性网络图像修复组合解决该问题。反向掩模运算传送反向掩蔽图像提供给编码器 - 译码器网络只留下有效的像素的端部要被补绘。此外，我们建议在特征空间计算的新的损失函数的目标只有有效像素，对抗性训练相结合。这些然后捕获数据分布和产生类似于在与输出图像取得真实感（现实的和相干）的训练数据的图像。我们评估可公开获得的数据集我们的方法，并与国家的最先进的方法进行了比较。我们证明了我们的方法可以推广到高分辨率的图像修补任务，并且进一步显示更逼真的输出，当国家的最先进的方法相比是合理的，以人的视觉系统。

27. Fully-Automated Packaging Structure Recognition in Logistics Environments [PDF] 返回目录
Laura Dörr, Felix Brandt, Martin Pouls, Alexander Naumann
Abstract: Within a logistics supply chain, a large variety of transported goods need to be handled, recognized and checked at many different network points. Often, huge manual effort is involved in recognizing or verifying packet identity or packaging structure, for instance to check the delivery for completeness. We propose a method for complete automation of packaging structure recognition: Based on a single image, one or multiple transport units are localized and, for each of these transport units, the characteristics, the total number and the arrangement of its packaging units is recognized. Our algorithm is based on deep learning models, more precisely convolutional neural networks for instance segmentation in images, as well as computer vision methods and heuristic components. We use a custom data set of realistic logistics images for training and evaluation of our method. We show that the solution is capable of correctly recognizing the packaging structure in approximately 85% of our test cases, and even more (91%) when focusing on most common package types.
摘要：在一个物流供应链，大量的各种货物运输的需要处理，识别，在许多不同的网络点检查。通常情况下，巨大的人工劳动参与识别或验证数据包标识或包装结构，例如检查交付内容的完整性。我们提出了包装结构识别的完全自动化的方法：基于一个单一图像，一个或多个传输单元上被本地化，并且对于每个这些运输单元，特点，总数和其包装单元的排列是公认的。我们的算法是基于深刻的学习模式，更确切地说例如分割图像中，以及计算机视觉的方法和启发式组件卷积神经网络。我们用逼真的图像物流对我们的方法的训练和评估的自定义数据集。我们表明，该解决方案能够在我们的测试案例大约有85％的人正确认识的封装结构，甚至专注于最常见的封装类型，当更多的（91％）。

28. Deep UAV Localization with Reference View Rendering [PDF] 返回目录
Timo Hinzmann, Roland Siegwart
Abstract: This paper presents a framework for the localization of Unmanned Aerial Vehicles (UAVs) in unstructured environments with the help of deep learning. A real-time rendering engine is introduced that generates optical and depth images given a six Degrees-of-Freedom (DoF) camera pose, camera model, geo-referenced orthoimage, and elevation map. The rendering engine is embedded into a learning-based six-DoF Inverse Compositional Lucas-Kanade (ICLK) algorithm that is able to robustly align the rendered and real-world image taken by the UAV. To learn the alignment under environmental changes, the architecture is trained using maps spanning multiple years at high resolution. The evaluation shows that the deep 6DoF-ICLK algorithm outperforms its non-trainable counterparts by a large margin. To further support the research in this field, the real-time rendering engine and accompanying datasets are released along with this publication.
摘要：本文介绍了一种无人驾驶飞行器（UAV）的深学习的帮助非结构化环境中的定位的框架。实时渲染引擎被引入，产生给定六度的自由度（DOF）照相机姿态的光学和深度图像，相机型号，地理参考的垂直图像，和高程地图。渲染引擎嵌入到以学习为主的六自由度反向合成卢卡斯金出武雄（ICLK）算法，它能够稳健地对准由无人机拍摄的渲染和真实世界图像。要了解下环境变化的调整，该架构是利用地图在高分辨率下跨越多个年的培训。评估显示，深6自由度，ICLK算法大幅度优于它的非训练的同行。为了进一步支持在这一领域的研究，实时渲染引擎和相应的数据集此出版物一起发布。

29. Sharp Multiple Instance Learning for DeepFake Video Detection [PDF] 返回目录
Xiaodan Li, Yining Lang, Yuefeng Chen, Xiaofeng Mao, Yuan He, Shuhui Wang, Hui Xue, Quan Lu
Abstract: With the rapid development of facial manipulation techniques, face forgery has received considerable attention in multimedia and computer vision community due to security concerns. Existing methods are mostly designed for single-frame detection trained with precise image-level labels or for video-level prediction by only modeling the inter-frame inconsistency, leaving potential high risks for DeepFake attackers. In this paper, we introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. We address this problem by multiple instance learning framework, treating faces and input video as instances and bag respectively. A sharp MIL (S-MIL) is proposed which builds direct mapping from instance embeddings to bag prediction, rather than from instance embeddings to instance prediction and then to bag prediction in traditional MIL. Theoretical analysis proves that the gradient vanishing in traditional MIL is relieved in S-MIL. To generate instances that can accurately incorporate the partially manipulated faces, spatial-temporal encoded instance is designed to fully model the intra-frame and inter-frame inconsistency, which further helps to promote the detection performance. We also construct a new dataset FFPMS for partially attacked DeepFake video detection, which can benefit the evaluation of different methods at both frame and video levels. Experiments on FFPMS and the widely used DFDC dataset verify that S-MIL is superior to other counterparts for partially attacked DeepFake video detection. In addition, S-MIL can also be adapted to traditional DeepFake image detection tasks and achieve state-of-the-art performance on single-frame datasets.
摘要：随着面部处理技术的迅猛发展，面对造假已经得到极大的重视多媒体和计算机视觉社区出于安全考虑。现有方法大多设计用于通过仅建模帧间不一致，留下潜在风险高为DeepFake攻击者具有精确影像级标签或用于视频电平预测训练单帧检测。在本文中，我们介绍DeepFake视频部分面对出击，在那里只提供视频级别的标签，但不是所有的假视频的面孔被操纵的一个新问题。我们分别应对多示例学习框架，处理面和输入视频的实例和袋这个问题。尖锐MIL（S-MIL）提出了一种构建从实例的嵌入直接映射到袋预测，而不是从实例的嵌入到实例预测，然后在传统的MIL袋预测。理论分析证明，该梯度在传统的MIL消失在S-MIL被解除。产生能够正确地并入部分操纵面的情况下，空间 - 时间编码的实例设计成将帧内和帧间的不一致，这进一步有助于促进检测性能充分建模。我们还构建了部分攻击DeepFake视频检测新的数据集FFPMS，它可以在两帧和视频电平有利于不同的方法进行评估。上FFPMS和广泛使用的DFDC实验数据集验证S-MIL优于其它同行用于部分地攻击DeepFake视频检测。此外，S-MIL也可以适用于传统DeepFake图像检测任务，实现对单帧的数据集的状态的最先进的性能。

30. Rethinking Pseudo-LiDAR Representation [PDF] 返回目录
Xinzhu Ma, Shinan Liu, Zhiyi Xia, Hongwen Zhang, Xingyu Zeng, Wanli Ouyang
Abstract: The recently proposed pseudo-LiDAR based 3D detectors greatly improve the benchmark of monocular/stereo 3D detection task. However, the underlying mechanism remains obscure to the research community. In this paper, we perform an in-depth investigation and observe that the efficacy of pseudo-LiDAR representation comes from the coordinate transformation, instead of data representation itself. Based on this observation, we design an image based CNN detector named Patch-Net, which is more generalized and can be instantiated as pseudo-LiDAR based 3D detectors. Moreover, the pseudo-LiDAR data in our PatchNet is organized as the image representation, which means existing 2D CNN designs can be easily utilized for extracting deep features from input data and boosting 3D detection performance. We conduct extensive experiments on the challenging KITTI dataset, where the proposed PatchNet outperforms all existing pseudo-LiDAR based counterparts. Code has been made available at: this https URL.
摘要：最近提出的伪激光雷达基于3D探测器大大提高单眼/ 3D立体探测任务的标杆。但是，基本的机制仍不清楚研究界。在本文中，我们执行的深入调查，并观察伪激光雷达的表示功效来自坐标变换，而不是数据表示本身。基于这一观察，我们设计了一个基于图像的检测器CNN名为补丁型网，其更广义的，并且可以被实例化为基于伪激光雷达三维探测器。此外，在我们的PatchNet伪LiDAR数据被组织为图像表示，该装置现有的2D CNN的设计可以很容易地用于从输入数据中提取深特点和升压3D检测性能。我们的挑战KITTI数据集，其中所提出的PatchNet优于所有现有的伪激光雷达基于同行进行广泛的试验。代码已可在：该HTTPS URL。

31. Text as Neural Operator: Image Manipulation by Text Instruction [PDF] 返回目录
Tianhao Zhang, Hung-Yu Tseng, Lu Jiang, Weilong Yang, Honglak Lee, Irfan Essa
Abstract: In this paper, we study a new task that allows users to edit an input image using language instructions. In this image generation task, the inputs are a reference image and a text instruction that describes desired modifications to the input image. We propose a GAN-based method to tackle this problem. The key idea is to treat language as neural operators to locally modify the image feature. To this end, our model decomposes the generation process into finding where (spatial region) and how (text operators) to apply modifications. We show that the proposed model performs favorably against recent baselines on three datasets.
摘要：在本文中，我们研究了一个新的任务，它允许用户使用语言指令编辑输入图像。在该图像生成任务，输入为一个参考图像和一个描述输入图像所需的修改文本指令。我们提出了一个基于GaN的方法来解决这个问题。其核心思想是把语言作为神经运营商在本地修改的图像特征。为此，我们的模型分解生成过程为发现其中（空间区域），以及如何（文字运营商）以进行修改。我们发现，对毫不逊色于三个数据集最近基线所提出的模型执行。

32. TCL: an ANN-to-SNN Conversion with Trainable Clipping Layers [PDF] 返回目录
Nguyen-Dong Ho, Ik-Joon Chang
Abstract: Spiking Neural Networks (SNNs) provide significantly lower power dissipationthan deep neural networks (DNNs), called as analog neural networks (ANNs) inthis work. Conventionally, SNNs have failed to arrive at the training accuraciesof ANNs. However, several recent researches have shown that this challenge canbe addressed by converting ANN to SNN instead of the direct training of SNNs.Nonetheless, the large latency of SNNs still limits their application, more prob-lematic for large size datasets such as Imagenet. It is challenging to overcome thisproblem since in SNNs, there is the trade-off relation between their accuracy and la-tency. In this work, we elegantly alleviate the problem by using a trainable clippinglayers, so called TCL. By combining the TCL with traditional data-normalizationtechniques, we respectively obtain 71.12% and 73.38% (on ImageNet) for VGG-16and RESNET-34 after the ANN to SNN conversion with the latency constraint of250 cycles.
摘要：扣球神经网络（SNNS）提供显著低功耗dissipationthan深层神经网络（DNNs），称为模拟神经网络（人工神经网络）inthis工作。按照惯例，SNNS未能在训练accuraciesof人工神经网络到达。然而，最近的一些研究表明，这种挑战canbe谈到了ANN转换为SNN，而不是SNNs.Nonetheless的直接培训，SNNS的长等待时间仍然限制了其应用，更概率，lematic大尺寸的数据集，如Imagenet。这是一个挑战，克服thisproblem自SNNS，有其准确性和LA-tency之间的权衡关系。在这项工作中，我们通过优雅的使用可训练clippinglayers，所谓TCL缓解这个问题。通过TCL与传统的数据normalizationtechniques相结合，我们分别ANN到SNN转换与延时约束of250次循环后获得VGG-16and 71.12％和73.38％（基于ImageNet）RESNET-34。

33. Keypoint Autoencoders: Learning Interest Points of Semantics [PDF] 返回目录
Ruoxi Shi, Zhengrong Xue, Xinyang Li
Abstract: Understanding point clouds is of great importance. Many previous methods focus on detecting salient keypoints to identity structures of point clouds. However, existing methods neglect the semantics of points selected, leading to poor performance on downstream tasks. In this paper, we propose Keypoint Autoencoder, an unsupervised learning method for detecting keypoints. We encourage selecting sparse semantic keypoints by enforcing the reconstruction from keypoints to the original point cloud. To make sparse keypoint selection differentiable, Soft Keypoint Proposal is adopted by calculating weighted averages among input points. A downstream task of classifying shape with sparse keypoints is conducted to demonstrate the distinctiveness of our selected keypoints. Semantic Accuracy and Semantic Richness are proposed and our method gives competitive or even better performance than state of the arts on these two metrics.
摘要：了解点云是非常重要的。许多先前的方法集中在检测突出关键点，以点云的身份结构。然而，现有的方法忽略点的选择的语义，从而导致对下游任务表现不佳。在本文中，我们提出了关键点自动编码器，用于检测关键点无人监督的学习方法。我们鼓励通过实施关键点，从重建到原来的点云选择稀疏语义关键点。为了使稀疏关键点选择微，软关键点建议由输入点之间计算加权平均值采用。形状与稀疏关键点分类的下游任务以证明我们的选择的关键点的独特性。语义的准确度和丰富度语义提出，我们的方法提供了比这两个指标对艺术的状态，有竞争力的，甚至更好的性能。

34. Key-Nets: Optical Transformation Convolutional Networks for Privacy Preserving Vision Sensors [PDF] 返回目录
Jeffrey Byrne, Brian Decann, Scott Bloom
Abstract: Modern cameras are not designed with computer vision or machine learning as the target application. There is a need for a new class of vision sensors that are privacy preserving by design, that do not leak private information and collect only the information necessary for a target machine learning task. In this paper, we introduce key-nets, which are convolutional networks paired with a custom vision sensor which applies an optical/analog transform such that the key-net can perform exact encrypted inference on this transformed image, but the image is not interpretable by a human or any other key-net. We provide five sufficient conditions for an optical transformation suitable for a key-net, and show that generalized stochastic matrices (e.g. scale, bias and fractional pixel shuffling) satisfy these conditions. We motivate the key-net by showing that without it there is a utility/privacy tradeoff for a network fine-tuned directly on optically transformed images for face identification and object detection. Finally, we show that a key-net is equivalent to homomorphic encryption using a Hill cipher, with an upper bound on memory and runtime that scales quadratically with a user specified privacy parameter. Therefore, the key-net is the first practical, efficient and privacy preserving vision sensor based on optical homomorphic encryption.
摘要：现代摄像机并非设计为与计算机视觉和机器学习为目标的应用程序。有必要对一类新的视觉传感器是通过隐私保护设计，不泄露私人信息，并收集仅需要一个目标机器学习任务的信息。在本文中，我们介绍键网，这与施加的光学/模拟变换，使得键网可在此变换后的图像上执行精确加密推理定制视觉传感器配对卷积网络，但是该图像是不可解释通过人或任何其他关键网。我们提供了五个充分条件适合于键网的光学转变，表明广义随机矩阵（例如刻度，偏压和分数像素改组）满足这些条件。我们通过证明没有它有用于多用途/隐私折衷激励键网的网络微调直接在用于面部识别和物体检测光学变换图像。最后，我们表明，一个关键的网等同于使用希尔密码同态加密，与上界内存和运行时二次与用户指定的隐私参数尺度。因此，关键是净基于光学同态加密第一实用，高效，隐私保护视觉传感器。

35. Grasping Field: Learning Implicit Representations for Human Grasps [PDF] 返回目录
Korrawe Karunratanakul, Jinlong Yang, Yan Zhang, Michael Black, Krikamol Muandet, Siyu Tang
Abstract: In recent years, substantial progress has been made on robotic grasping of household objects. Yet, human grasps are still difficult to synthesize realistically. There are several key reasons: (1) the human hand has many degrees of freedom (more than robotic manipulators); (2) the synthesized hand should conform naturally to the object surface; and (3) it must interact with the object in a semantically and physical plausible manner. To make progress in this direction, we draw inspiration from the recent progress on learning-based implicit representations for 3D object reconstruction. Specifically, we propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks. Our insight is that every point in a three-dimensional space can be characterized by the signed distances to the surface of the hand and the object, respectively. Consequently, the hand, the object, and the contact area can be represented by implicit surfaces in a common space, in which the proximity between the hand and the object can be modelled explicitly. We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data. We demonstrate that the proposed grasping field is an effective and expressive representation for human grasp generation. Specifically, our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud. The extensive experiments demonstrate that our generative model compares favorably with a strong baseline. Furthermore, based on the grasping field representation, we propose a deep network for the challenging task of 3D hand and object reconstruction from a single RGB image. Our method improves the physical plausibility of the 3D hand-object reconstruction task over baselines.
摘要：近年来，实质性的进展已经取得了家用物品的机器人抓持。然而，人类掌握仍难以真实地合成。有几个关键的原因：（1）人的手有很大的自由度（除机械手以上）; （2）合成的手自然应该符合对象表面上;和（3），它必须与在一个语义和物理可信方式对象交互。要在这方面的进展，我们从对三维物体重建基于学习隐交涉的最新进展汲取灵感。具体来说，我们提出了人类把握建模是高效和易于使用深层神经网络集成的表现表示。我们的观点是，在三维空间中的每个点可以通过经签署的距离分别与手的表面和物体，来表征。因此，一方面，对象，并且接触面积可通过在共同的空间，其中，所述手与物体之间的接近可以明确建模隐式曲面来表示。我们命名这个3D到2D映射，抓现场，与深层神经网络参数了，并且从数据中学习。我们表明，该抓场是人类把握产生一个有效的和表现力的表现。具体来说，我们的生成模型能够合成高品质的人掌握，只在3D物体点云给出。广泛的实验证明我们的生成模型具有较强的基线相比，毫不逊色。此外，基于该抓外地代表，我们提出了3D手和重建对象从单一的RGB图像的具有挑战性的任务了深刻的网络。我们的方法提高了3D手对象重建任务超过基线的物理的合理性。

36. Distributed Multi-agent Video Fast-forwarding [PDF] 返回目录
Shuyue Lan, Zhilu Wang, Amit K. Roy-Chowdhury, Ermin Wei, Qi Zhu
Abstract: In many intelligent systems, a network of agents collaboratively perceives the environment for better and more efficient situation awareness. As these agents often have limited resources, it could be greatly beneficial to identify the content overlapping among camera views from different agents and leverage it for reducing the processing, transmission and storage of redundant/unimportant video frames. This paper presents a consensus-based distributed multi-agent video fast-forwarding framework, named DMVF, that fast-forwards multi-view video streams collaboratively and adaptively. In our framework, each camera view is addressed by a reinforcement learning based fast-forwarding agent, which periodically chooses from multiple strategies to selectively process video frames and transmits the selected frames at adjustable paces. During every adaptation period, each agent communicates with a number of neighboring agents, evaluates the importance of the selected frames from itself and those from its neighbors, refines such evaluation together with other agents via a system-wide consensus algorithm, and uses such evaluation to decide their strategy for the next period. Compared with approaches in the literature on a real-world surveillance video dataset VideoWeb, our method significantly improves the coverage of important frames and also reduces the number of frames processed in the system.
摘要：在许多智能系统，代理的网络协同感知更好，更高效的态势感知环境。由于这些试剂通常具有有限的资源，也可能是非常有益来标识内容从不同剂和杠杆它用于减少冗余/不重要的视频帧的处理，传输和存储的摄像机视图中的重叠。本文提出了一种基于共识的分布式多代理视频快进框架，名为DMVF，即快进多视点视频协作和自适应流。在我们的框架中，每个摄像机视图是通过增强学习基于快进剂，其周期性地从多种策略，以选择性地处理视频帧选择和在步可调发送所选择的帧处理。期间每适应期，与多个相邻剂的每个代理进行通信，通过一个全系统的一致性算法与其它药剂评估来自自身所选择的帧和来自它的邻居，提炼这样评价的重要性一起，并且使用这样的评价，以决定其下一阶段的策略。与现实世界的监控录像数据集VideoWeb文献方法相比，我们的方法显著提高重要帧的覆盖范围，也降低了系统处理的帧数。

37. Measures of Complexity for Large Scale Image Datasets [PDF] 返回目录
Ameet Annasaheb Rahane, Anbumani Subramanian
Abstract: Large scale image datasets are a growing trend in the field of machine learning. However, it is hard to quantitatively understand or specify how various datasets compare to each other - i.e., if one dataset is more complex or harder to ``learn'' with respect to a deep-learning based network. In this work, we build a series of relatively computationally simple methods to measure the complexity of a dataset. Furthermore, we present an approach to demonstrate visualizations of high dimensional data, in order to assist with visual comparison of datasets. We present our analysis using four datasets from the autonomous driving research community - Cityscapes, IDD, BDD and Vistas. Using entropy based metrics, we present a rank-order complexity of these datasets, which we compare with an established rank-order with respect to deep learning.
摘要：大规模图像数据集是在机器学习领域的发展趋势。然而，这是很难定量地理解或指定数据集如何将各种互相比较 - 即，如果一个数据集比较复杂或难以``学习“”相对于深学习的网络。在这项工作中，我们建立了一系列的比较简单的计算方法来衡量一个数据集的复杂性。此外，我们提出了一个方法来证明高维数据的可视化，以便协助与数据集的视觉比较。我们目前使用四个数据集从自动驾驶研究社区我们的分析 - 风情，国际直拨电话，BDD和远景。使用基于熵的指标，我们提出这些数据集，我们有一个既定的排序比较相对于深度学习的等级秩序的复杂性。

38. Locating Cephalometric X-Ray Landmarks with Foveated Pyramid Attention [PDF] 返回目录
Logan Gilmour, Nilanjan Ray
Abstract: CNNs, initially inspired by human vision, differ in a key way: they sample uniformly, rather than with highest density in a focal point. For very large images, this makes training untenable, as the memory and computation required for activation maps scales quadratically with the side length of an image. We propose an image pyramid based approach that extracts narrow glimpses of the of the input image and iteratively refines them to accomplish regression tasks. To assist with high-accuracy regression, we introduce a novel intermediate representation we call 'spatialized features'. Our approach scales logarithmically with the side length, so it works with very large images. We apply our method to Cephalometric X-ray Landmark Detection and get state-of-the-art results.
摘要：细胞神经网络，最初是由人类视觉的启发，在一个关键的方式不同：它们均匀采样，而不是在一个焦点密度最高。对于非常大的图像，这使得训练站不住脚的，作为用于激活所需的存储器和计算与图像的边长的二次方映射秤。我们提出了一个图像金字塔基础的方法，提取缩小的输入图像和反复提炼他们的影子来完成回归任务。为了帮助高精度回归，我们介绍一种新颖的中间表示我们称之为“空间化特征”。我们的方法与边长对数尺度，所以它具有非常大的影像作品。我们应用我们的方法来头颅侧位X线标志检测，并得到国家的先进成果。

39. Bipartite Graph Reasoning GANs for Person Image Generation [PDF] 返回目录
Hao Tang, Song Bai, Philip H.S. Torr, Nicu Sebe
Abstract: We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the challenging person image generation task. The proposed graph generator mainly consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed Bipartite Graph Reasoning (BGR) block aims to reason the crossing long-range relations between the source pose and the target pose in a bipartite graph, which mitigates some challenges caused by pose deformation. Moreover, we propose a new Interaction-and-Aggregation (IA) block to effectively update and enhance the feature representation capability of both person's shape and appearance in an interactive way. Experiments on two challenging and public datasets, i.e., Market-1501 and DeepFashion, show the effectiveness of the proposed BiGraphGAN in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at this https URL.
摘要：我们的具有挑战性的人物图像生成任务提出一种新的二部图推理GAN（BiGraphGAN）。所提出的图形发生器主要由旨在姿态到姿态和姿势到图像的关系，分别模拟两种新块。具体地，所提出的二分图推理（BGR）嵌段旨在原因源姿势和在二分图，其减轻引起姿势变形的一些挑战目标姿势之间的交叉远距离关系。此外，我们提出了一个新的互动，和聚合（IA）块，有效地更新和增强互动的方式都人的形状和外观的特征表现能力。两个有挑战性和公共数据集，即以市场为1501和DeepFashion，实验表明所提出的BiGraphGAN的客观定量评分和主观视觉真实性方面的有效性。源代码和训练的模型可在此HTTPS URL。

40. Unsupervised Deep Metric Learning with Transformed Attention Consistency and Contrastive Clustering Loss [PDF] 返回目录
Yang Li, Shichao Kan, Zhihai He
Abstract: Existing approaches for unsupervised metric learning focus on exploring self-supervision information within the input image itself. We observe that, when analyzing images, human eyes often compare images against each other instead of examining images individually. In addition, they often pay attention to certain keypoints, image regions, or objects which are discriminative between image classes but highly consistent within classes. Even if the image is being transformed, the attention pattern will be consistent. Motivated by this observation, we develop a new approach to unsupervised deep metric learning where the network is learned based on self-supervision information across images instead of within one single image. To characterize the consistent pattern of human attention during image comparisons, we introduce the idea of transformed attention consistency. It assumes that visually similar images, even undergoing different image transforms, should share the same consistent visual attention map. This consistency leads to a pairwise self-supervision loss, allowing us to learn a Siamese deep neural network to encode and compare images against their transformed or matched pairs. To further enhance the inter-class discriminative power of the feature generated by this network, we adapt the concept of triplet loss from supervised metric learning to our unsupervised case and introduce the contrastive clustering loss. Our extensive experimental results on benchmark datasets demonstrate that our proposed method outperforms current state-of-the-art methods for unsupervised metric learning by a large margin.
摘要：无监督度量学习重点探索输入图像本身的自我监管信息现有的方法。我们观察到，在分析图像时，人眼经常进行相互比较，而不是单独检查图像的图像。此外，他们往往注重某些关键点，图像区域或对象作为图像类之间的区别，但班内高度一致。即使图像正在转变，关注格局将是一致的。通过这一观察的推动下，我们开发了一个新的方法来监督的深层度量学习，其中网络是基于整个图像，而不是一个单一的图像内自我监管信息教训。为了在图像比较表征人类关注的一贯模式，我们引进的转化重视一致性的想法。它假定外观相似的图片，甚至经受不同的图像变换，应该共享相同一致的视觉注意图。这种一致性导致成对自我监督的损失，使我们能够学习一个连体深层神经网络编码及比较反对他们的转化或配对图像。为了进一步提高该网络所产生的功能，级间辨别力，我们适应三重损失的受监管度量学习的理念，我们的无人监督的情况下，引进了对比集群损失。在基准数据集我们广泛的实验结果表明，我们提出的方法优于大幅度当前国家的最先进的方法无监督度量学习。

41. Visual Imitation Made Easy [PDF] 返回目录
Sarah Young, Dhiraj Gandhi, Shubham Tulsiani, Abhinav Gupta, Pieter Abbeel, Lerrel Pinto
Abstract: Visual imitation learning provides a framework for learning complex manipulation behaviors by leveraging human demonstrations. However, current interfaces for imitation such as kinesthetic teaching or teleoperation prohibitively restrict our ability to efficiently collect large-scale data in the wild. Obtaining such diverse demonstration data is paramount for the generalization of learned skills to novel scenarios. In this work, we present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots. We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector. To extract action information from these visual demonstrations, we use off-the-shelf Structure from Motion (SfM) techniques in addition to training a finger detection network. We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task. For both tasks, we use standard behavior cloning to learn executable policies from the previously collected offline demonstrations. To improve learning performance, we employ a variety of data augmentations and provide an extensive analysis of its effects. Finally, we demonstrate the utility of our interface by evaluating on real robotic scenarios with previously unseen objects and achieve a 87% success rate on pushing and a 62% success rate on stacking. Robot videos are available at this https URL.
摘要：视觉模仿学习提供了通过利用人的示威学习复杂的操作行为的框架。然而，如动觉教学或远程操作电流接口模仿望而却步限制我们在野外有效地收集大规模数据的能力。这样获得多样化示范数据是最重要的的学到的技能泛化到新场景。在这项工作中，我们提出了仿备选接口，简化了数据收集过程，同时允许容易转移到机器人。我们用市售的伸缩器贪心辅助工具既作为数据采集设备和机器人的末端执行器。为了从这些视觉示威行动的信息，我们使用过的，现成的结构从运动（SFM）技术，除训练手指检测网络。我们通过实验评估在两个有挑战性的任务：非抓握推抓握堆叠，以1000个为每个任务不同的演示。对于这两个任务，我们使用标准的克隆行为来学习从先前收集的脱机演示可执行的政策。为了提高学习效果，我们采用了多种数据扩充，并提供其影响广泛的分析。最后，我们用以前看不到的物体真实的机器人场景评估证明我们接口的效用和实现上推着87％的成功率和层叠62％的成功率。机器人视频可在此HTTPS URL。

42. A Review on Deep Learning Techniques for the Diagnosis of Novel Coronavirus (COVID-19) [PDF] 返回目录
Md. Milon Islam, Fakhri Karray, Reda Alhajj, Jia Zeng
Abstract: Novel coronavirus (COVID-19) outbreak, has raised a calamitous situation all over the world and has become one of the most acute and severe ailments in the past hundred years. The prevalence rate of COVID-19 is rapidly rising every day throughout the globe. Although no vaccines for this pandemic have been discovered yet, deep learning techniques proved themselves to be a powerful tool in the arsenal used by clinicians for the automatic diagnosis of COVID-19. This paper aims to overview the recently developed systems based on deep learning techniques using different medical imaging modalities like Computer Tomography (CT) and X-ray. This review specifically discusses the systems developed for COVID-19 diagnosis using deep learning techniques and provides insights on well-known data sets used to train these networks. It also highlights the data partitioning techniques and various performance measures developed by researchers in this field. A taxonomy is drawn to categorize the recent works for proper insight. Finally, we conclude by addressing the challenges associated with the use of deep learning methods for COVID-19 detection and probable future trends in this research area. This paper is intended to provide experts (medical or otherwise) and technicians with new insights into the ways deep learning techniques are used in this regard and how they potentially further works in combatting the outbreak of COVID-19.
摘要：新型冠状病毒（COVID-19）爆发，掀起了一场灾难性的情况在世界各地，并已成为在过去的一百年中最严重和最严重的疾病之一。 COVID-19的患病率每天都在迅速上升在全球各地。虽然这一流行病没有疫苗尚未发现的，深度学习技术，证明了自己是在COVID-19的自动诊断使用的临床医生阿森纳的有力工具。本文旨在概述基于使用如电脑断层扫描（CT）和X射线不同的医学成像方式深度学习技术，最近开发的系统。这次审查采用深学习技术专门讨论了COVID-19诊断开发的系统，并提供了用于培训这些网络知名的数据集的见解。这也凸显了数据分区技术，并通过在这一领域的研究人员开发出各种性能的措施。分类法被吸引到分类近期适当洞察力的作品。最后，我们得出结论：通过解决与本研究领域使用的深度学习方法COVID-19检测和未来可能的发展趋势带来的挑战。本文旨在提供专家（医疗或其他方式）和技术人员用新的见解深刻的学习方法是在这方面所使用的方式，以及它们如何在打击COVID-19的爆发可能进一步作品。

43. 3D FLAT: Feasible Learned Acquisition Trajectories for Accelerated MRI [PDF] 返回目录
Jonathan Alush-Aben, Linor Ackerman-Schraier, Tomer Weiss, Sanketh Vedula, Ortal Senouf, Alex Bronstein
Abstract: Magnetic Resonance Imaging (MRI) has long been considered to be among the gold standards of today's diagnostic imaging. The most significant drawback of MRI is long acquisition times, prohibiting its use in standard practice for some applications. Compressed sensing (CS) proposes to subsample the k-space (the Fourier domain dual to the physical space of spatial coordinates) leading to significantly accelerated acquisition. However, the benefit of compressed sensing has not been fully exploited; most of the sampling densities obtained through CS do not produce a trajectory that obeys the stringent constraints of the MRI machine imposed in practice. Inspired by recent success of deep learning based approaches for image reconstruction and ideas from computational imaging on learning-based design of imaging systems, we introduce 3D FLAT, a novel protocol for data-driven design of 3D non-Cartesian accelerated trajectories in MRI. Our proposal leverages the entire 3D k-space to simultaneously learn a physically feasible acquisition trajectory with a reconstruction method. Experimental results, performed as a proof-of-concept, suggest that 3D FLAT achieves higher image quality for a given readout time compared to standard trajectories such as radial, stack-of-stars, or 2D learned trajectories (trajectories that evolve only in the 2D plane while fully sampling along the third dimension). Furthermore, we demonstrate evidence supporting the significant benefit of performing MRI acquisitions using non-Cartesian 3D trajectories over 2D non-Cartesian trajectories acquired slice-wise.
摘要：磁共振成像（MRI）一直被认为是黄金标准，如今的影像诊断中。 MRI最显著的缺点是很长的采集时间，禁止其在标准的做法使用某些应用。压缩感测（CS）建议子采样k-空间（傅立叶域中双到空间坐标的物理空间），从而导致加速显著采集。然而，压缩感知的利益并没有得到充分的发挥;大多数通过CS后的采样密度不产生服从MRI机器的严格限制在实践中实行的轨迹。通过最近的关于成像系统的基于学习的设计深度学习基础的方法进行图像重建和想法从计算成像成功的启发，我们引入3D FLAT，一种新颖的协议用于3D非笛卡尔的数据驱动的设计在MRI加速轨迹。我们的提议利用整个3D k空间来同时学习一个重建方法物理上可行采集轨迹。实验结果，作为证明的概念进行的，表明相对于标准轨迹诸如径向，栈的星，或2D了解到轨迹（轨迹仅在该演变3D FLAT实现对于给定的读出时间更高的图像质量二维平面，同时沿着第三维充分采样）。此外，我们展示的证据支持使用进行非笛卡尔3D轨迹被收购逐个切片的2D非笛卡尔轨迹MRI收购的显著利益。

44. Artificial Intelligence to Assist in Exclusion of Coronary Atherosclerosis during CCTA Evaluation of Chest-Pain in the Emergency Department: Preparing an Application for Real-World Use [PDF] 返回目录
Richard D. White, Barbaros S. Erdal, Mutlu Demirer, Vikash Gupta, Matthew T. Bigelow, Engin Dikici, Sema Candemir, Mauricio S. Galizia, Jessica L. Carpenter, Thomas P. O Donnell, Abdul H. Halabi, Luciano M. Prevedello
Abstract: Coronary Computed Tomography Angiography (CCTA) evaluation of chest-pain patients in an Emergency Department (ED) is considered appropriate. While a negative CCTA interpretation supports direct patient discharge from an ED, labor-intensive analyses are required, with accuracy in jeopardy from distractions. We describe the development of an Artificial Intelligence (AI) algorithm and workflow for assisting interpreting physicians in CCTA screening for the absence of coronary atherosclerosis. The two-phase approach consisted of (1) Phase 1 - focused on the development and preliminary testing of an algorithm for vessel-centerline extraction classification in a balanced study population (n = 500 with 50% disease prevalence) derived by retrospective random case selection; and (2) Phase 2 - concerned with simulated-clinical Trialing of the developed algorithm on a per-case basis in a more real-world study population (n = 100 with 28% disease prevalence) from an ED chest-pain series. This allowed pre-deployment evaluation of the AI-based CCTA screening application which provides a vessel-by-vessel graphic display of algorithm inference results integrated into a clinically capable viewer. Algorithm performance evaluation used Area Under the Receiver-Operating-Characteristic Curve (AUC-ROC); confusion matrices reflected ground-truth vs AI determinations. The vessel-based algorithm demonstrated strong performance with AUC-ROC = 0.96. In both Phase 1 and Phase 2, independent of disease prevalence differences, negative predictive values at the case level were very high at 95%. The rate of completion of the algorithm workflow process (96% with inference results in 55-80 seconds) in Phase 2 depended on adequate image quality. There is potential for this AI application to assist in CCTA interpretation to help extricate atherosclerosis from chest-pain presentations.
摘要：在急诊科（ED）胸痛患者冠状动脉CT血管造影（CCTA）的评估是适当的。而负CCTA解释从ED支持直接病人出院，劳动密集型的分析是必需的，与分心岌岌可危的准确性。我们描述了协助CCTA解释医生筛查无冠状动脉粥样硬化的人工智能（AI）算法和工作流的发展。两阶段方法包括（1）第1阶段 - 集中在通过回顾性随机病例选择衍生以平衡的研究人群中发展和血管中心线提取分类的算法的初步测试（N = 500，用50％的疾病流行率） ; （2）第二阶段 - 在一个更真实的研究人群（n = 100，28％的疾病流行）从ED胸痛系列涉及模拟临床测试这套开发的算法对每个案件的基础。的基于AI的CCTA筛选应用，其提供集成到临床上能够观看者算法推理结果的容器逐个容器图形显示的这使得部署前评估。算法的性能评价中使用下面积接受者操作特性曲线（AUC-ROC）;混淆矩阵反射的地面真值VS AI测定。基于容器的算法表现出与AUC-ROC = 0.96性能强。在阶段1和阶段2，独立的疾病流行的差异，在壳体的水平阴性预测值非常高为95％。该算法工作流过程的完成率（96％与在55-80秒的推断导致）在阶段2依赖于足够的图像质量。有潜力为这个AI应用程序在CCTA解释从胸痛演示协助帮忙1918-1937动脉粥样硬化。

45. AtrialJSQnet: A New Framework for Joint Segmentation and Quantification of Left Atrium and Scars Incorporating Spatial and Shape Information [PDF] 返回目录
Lei Li, Veronika A. Zimmer, Julia A. Schnabel, Xiahai Zhuang
Abstract: Left atrial (LA) and atrial scar segmentation from late gadolinium enhanced magnetic resonance imaging (LGE MRI) is an important task in clinical practice. %, to guide ablation therapy and predict treatment results for atrial fibrillation (AF) patients. The automatic segmentation is however still challenging, due to the poor image quality, the various LA shapes, the thin wall, and the surrounding enhanced regions. Previous methods normally solved the two tasks independently and ignored the intrinsic spatial relationship between LA and scars. In this work, we develop a new framework, namely AtrialJSQnet, where LA segmentation, scar projection onto the LA surface, and scar quantification are performed simultaneously in an end-to-end style. We propose a mechanism of shape attention (SA) via an explicit surface projection, to utilize the inherent correlation between LA and LA scars. In specific, the SA scheme is embedded into a multi-task architecture to perform joint LA segmentation and scar quantification. Besides, a spatial encoding (SE) loss is introduced to incorporate continuous spatial information of the target, in order to reduce noisy patches in the predicted segmentation. We evaluated the proposed framework on 60 LGE MRIs from the MICCAI2018 LA challenge. Extensive experiments on a public dataset demonstrated the effect of the proposed AtrialJSQnet, which achieved competitive performance over the state-of-the-art. The relatedness between LA segmentation and scar quantification was explicitly explored and has shown significant performance improvements for both tasks. The code and results will be released publicly once the manuscript is accepted for publication via this https URL.
摘要：左心房（LA）和心房疤痕分割晚期钆增强磁共振成像（MRI LGE）是在临床实践中的一项重要任务。％，以指导消融治疗和预测心房颤动（AF）的患者的治疗效果。自动分割但是仍然具有挑战性，由于差的图像质量，各种LA的形状，薄壁，和周边增强区域。先前的方法通常独立解决的两个任务，而忽视LA和疤痕之间的内在空间关系。在这项工作中，我们开发了一个新的框架，即AtrialJSQnet，其中LA分割，疤痕投影到LA面，和瘢痕量化在一个终端到终端的风格的同时进行。我们通过一个明确的体表投影提出的形状注意（SA）的机制，利用LA和LA疤痕之间的内在关系。在具体的SA方案嵌入到多任务体系结构中进行联合LA细分和量化的疤痕。此外，一个空间编码（SE）损失被引入到包含靶的连续的空间信息，以便减少在预测分割嘈杂补丁。我们从MICCAI2018 LA挑战评估了60个LGE核磁共振拟议的框架。在公共数据集大量的实验验证了该AtrialJSQnet，这在国家的最先进的获得有竞争力的性能的影响。 LA分割和疤痕定量之间的相关性是明确的探索，并表现出对两个任务都显著的性能改进。该代码和结果将公开发布一次的手稿通过此HTTPS URL接受发表。

46. The Umbrella software suite for automated asteroid detection [PDF] 返回目录
Malin Stanescu, Ovidiu Vaduvescu
Abstract: We present the Umbrella software suite for asteroid detection, validation, identification and reporting. The current core of Umbrella is an open-source modular library, called Umbrella2, that includes algorithms and interfaces for all steps of the processing pipeline, including a novel detection algorithm for faint trails. Building on the library, we have also implemented a detection pipeline accessible both as a desktop program (ViaNearby) and via a web server (Webrella), which we have successfully used in near real-time data reduction of a few asteroid surveys on the Wide Field Camera of the Isaac Newton Telescope. In this paper we describe the library, focusing on the interfaces and algorithms available, and we present the results obtained with the desktop version on a set of well-curated fields used by the EURONEAR project as an asteroid detection benchmark.
摘要：我们提出的伞软件套件小行星探测，验证，识别和报告。伞的当前的核心是一个开放源码的模块化磁带库，称为Umbrella2，其包括用于处理流水线的所有步骤，包括对微弱小径一种新颖的检测算法的算法和接口。建立在图书馆，我们还实施了检测管道访问都作为一个桌面程序（ViaNearby），并通过Web服务器（Webrella），这是我们在上宽一些小行星调查的近实时数据压缩成功使用在艾萨克·牛顿望远镜的视场相机。在本文中，我们描述了图书馆，专注于现有的接口和算法，以及我们提出用一组使用的EURONEAR项目作为小行星探测基准精心策划领域的桌面版本获得的结果。

47. Implanting Synthetic Lesions for Improving Liver Lesion Segmentation in CT Exams [PDF] 返回目录
Dario Augusto Borges Oliveira
Abstract: The success of supervised lesion segmentation algorithms using Computed Tomography (CT) exams depends significantly on the quantity and variability of samples available for training. While annotating such data constitutes a challenge itself, the variability of lesions in the dataset also depends on the prevalence of different types of lesions. This phenomenon adds an inherent bias to lesion segmentation algorithms that can be diminished, among different possibilities, using aggressive data augmentation methods. In this paper, we present a method for implanting realistic lesions in CT slices to provide a rich and controllable set of training samples and ultimately improving semantic segmentation network performances for delineating lesions in CT exams. Our results show that implanting synthetic lesions not only improves (up to around 12\%) the segmentation performance considering different architectures but also that this improvement is consistent among different image synthesis networks. We conclude that increasing the variability of lesions synthetically in terms of size, density, shape, and position seems to improve the performance of segmentation models for liver lesion segmentation in CT slices.
摘要：使用计算机断层扫描（CT）检查监督病变分割算法成功的数量和可用的样本的变异性训练显著依赖。虽然标注这些数据构成了挑战自身，在数据集中病变的变化还取决于不同类型病变的患病率。这种现象增加的固有偏压到可以减少，不同的可能性中，采用积极的数据增强方法病灶的分割算法。在本文中，我们提出了一种用于植入在CT切片现实病变以提供丰富的和可控制的训练样本集，并最终改善在CT检查描绘病变语义分割网络性能的方法。我们的研究结果表明，植入合成病变不仅提高（高达约12 \％）分割性能考虑不同的架构，而且，这种改进是不同的图像合成网络之间是一致的。我们得出结论，增加病变的综合变化的大小，密度，形状和位置方面似乎提高细分车型在CT片肝脏病变划分的性能。

48. Left Ventricular Wall Motion Estimation by Active Polynomials for Acute Myocardial Infarction Detection [PDF] 返回目录
Serkan Kiranyaz, Aysen Degerli, Tahir Hamid, Rashid Mazhar, Rayyan Ahmed, Rayaan Abouhasera, Morteza Zabihi, Junaid Malik, Ridha Hamila, Moncef Gabbouj
Abstract: Echocardiogram (echo) is the earliest and the primary tool for identifying regional wall motion abnormalities (RWMA) in order to diagnose myocardial infarction (MI) or commonly known as heart attack. This paper proposes a novel approach, Active Polynomials, which can accurately and robustly estimate the global motion of the Left Ventricular (LV) wall from any echo in a robust and accurate way. The proposed algorithm quantifies the true wall motion occurring in LV wall segments so as to assist cardiologists diagnose early signs of an acute MI. It further enables medical experts to gain an enhanced visualization capability of echo images through color-coded segments along with their "maximum motion displacement" plots helping them to better assess wall motion and LV Ejection-Fraction (LVEF). The outputs of the method can further help echo-technicians to assess and improve the quality of the echocardiogram recording. A major contribution of this study is the first public echo database collection composed by physicians at the Hamad Medical Corporation Hospital in Qatar. The so-called HMC-QU database will serve as the benchmark for the forthcoming relevant studies. The results over the HMC-QU dataset show that the proposed approach can achieve high accuracy, sensitivity and precision in MI detection even though the echo quality is quite poor, and the temporal resolution is low.
摘要：超声心动图（回波）是最早和用于以诊断心肌梗死（MI）或通常称为心脏发作识别室壁运动异常（RWMA）的主要工具。本文提出了一种新颖的方法，活动多项式，它可以准确和鲁棒估计左心室（LV）壁的全局运动从任何回波在一个坚固的和准确的方法。所提出的算法在量化LV壁段发生的，以便协助心脏病专家诊断急性MI的早期迹象的真实壁运动。它进一步使医学专家与他们的“最大运动位移”地块帮助他们更好地评估壁运动和LV射血分数（LVEF）沿着增益通过彩色编码段回波图像的增强的可视化能力。该方法的输出可以进一步帮助回波技术人员评估和改善超声心动图记录的质量。这项研究的主要贡献是在哈马德医疗集团医院卡塔尔由医生组成的首次公开回声数据库集合。所谓HMC-QU数据库将作为基准为即将举行的相关研究。在HMC-QU结果数据集表明，该方法可以达到较高的精度，灵敏度和精确度在MI检测，即使回声质量相当差，时间分辨率低。

49. Multi-modal segmentation of 3D brain scans using neural networks [PDF] 返回目录
Jonathan Zopes, Moritz Platscher, Silvio Paganucci, Christian Federau
Abstract: Purpose: To implement a brain segmentation pipeline based on convolutional neural networks, which rapidly segments 3D volumes into 27 anatomical structures. To provide an extensive, comparative study of segmentation performance on various contrasts of magnetic resonance imaging (MRI) and computed tomography (CT) scans. Methods: Deep convolutional neural networks are trained to segment 3D MRI (MPRAGE, DWI, FLAIR) and CT scans. A large database of in total 851 MRI/CT scans is used for neural network training. Training labels are obtained on the MPRAGE contrast and coregistered to the other imaging modalities. The segmentation quality is quantified using the Dice metric for a total of 27 anatomical structures. Dropout sampling is implemented to identify corrupted input scans or low-quality segmentations. Full segmentation of 3D volumes with more than 2 million voxels is obtained in less than 1s of processing time on a graphical processing unit. Results: The best average Dice score is found on $T_1$-weighted MPRAGE ($85.3\pm4.6\,\%$). However, for FLAIR ($80.0\pm7.1\,\%$), DWI ($78.2\pm7.9\,\%$) and CT ($79.1\pm 7.9\,\%$), good-quality segmentation is feasible for most anatomical structures. Corrupted input volumes or low-quality segmentations can be detected using dropout sampling. Conclusion: The flexibility and performance of deep convolutional neural networks enables the direct, real-time segmentation of FLAIR, DWI and CT scans without requiring $T_1$-weighted scans.
摘要：目的：基于卷积神经网络，实现大脑分割管道从而迅速段三维体积为27层解剖结构。为了提供对磁共振成像（MRI）和计算机断层摄影（CT）扫描的各种对比分割性能广泛的，比较研究。方法：将深卷积神经网络进行培训，以段3D MRI（MPRAGE，DWI，FLAIR）和CT扫描。的合计851 MRI / CT大量扫描数据库，用于神经网络的训练。训练标签上MPRAGE对比度获得并配准到其他成像模态。分割质量是使用骰子度量，总共27层的解剖结构的定量。降采样被实现为识别受破坏的输入扫描或低质量的分割。有超过200万点的体素的3D体积的全分割比中的图形处理单元上的处理时间的1秒得到以下。结果：最佳的平均骰子得分上$ T_1 $发现加权MPRAGE（$ 85.3 \ pm4.6 \，\％$）。然而，对于FLAIR（$ 80.0 \ pm7.1 \，\％$），DWI（$ 78.2 \ pm7.9 \，\％$）和CT（$ 79.1 \时许7.9 \，\％$），高质量的分割是可行的对于大多数的解剖结构。损坏输入卷或低质量的分割可以利用差采样来检测。结论：灵活性和深卷积神经网络的性能使FLAIR，DWI的直接，实时分割和CT扫描，而不需要$ T_1 $加权扫描。

50. Surgical Mask Detection with Convolutional Neural Networks and Data Augmentations on Spectrograms [PDF] 返回目录
Steffen Illium, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien
Abstract: In many fields of research, labeled datasets are hard to acquire. This is where data augmentation promises to overcome the lack of training data in the context of neural network engineering and classification tasks. The idea here is to reduce model over-fitting to the feature distribution of a small under-descriptive training dataset. We try to evaluate such data augmentation techniques to gather insights in the performance boost they provide for several convolutional neural networks on mel-spectrogram representations of audio data. We show the impact of data augmentation on the binary classification task of surgical mask detection in samples of human voice (ComParE Challenge 2020). Also we consider four varying architectures to account for augmentation robustness. Results show that most of the baselines given by ComParE are outperformed.
摘要：在研究许多领域，标记数据集都很难获得。这是数据增强承诺克服缺乏的神经网络工程和分类任务的情况下训练数据。这里的想法是，以减少过度拟合小下描述性训练数据集的特征分布模型。我们试图评估这些数据增强技术的数据分析中，他们提供了几个卷积神经网络对音频数据的梅尔频谱表示的性能提升。我们的数据显示隆胸的外科口罩检测的二元分类任务的人的声音（比较挑战2020）的样本的影响。此外，我们考虑四个变架构，以帐户为增强鲁棒性。结果显示，大多数通过比较给出的基线都跑赢。

51. PiNet: Attention Pooling for Graph Classification [PDF] 返回目录
Peter Meltzer, Marcelo Daniel Gutierrez Mallea, Peter J. Bentley
Abstract: We propose PiNet, a generalised differentiable attention-based pooling mechanism for utilising graph convolution operations for graph level classification. We demonstrate high sample efficiency and superior performance over other graph neural networks in distinguishing isomorphic graph classes, as well as competitive results with state of the art methods on standard chemo-informatics datasets.
摘要：本文提出PINET，广义微注意力基于池机制的利用图形的卷积运算的图形层次的划分。我们证明高采样效率和在区分同构的图类优于其它图形神经网络优越的性能，以及与对标准化疗信息学数据集的现有技术方法的竞争结果。

52. Extension of JPEG XS for Two-Layer Lossless Coding [PDF] 返回目录
Hiroyuki Kobayashi, Hitoshi Kiya
Abstract: A two-layer lossless image coding method compatible with JPEG XS is proposed. JPEG XS is a new international standard for still image coding that has the characteristics of very low latency and very low complexity. However, it does not support lossless coding, although it can achieve visual lossless coding. The proposed method has a two-layer structure similar to JPEG XT, which consists of JPEG XS coding and a lossless coding method. As a result, it enables us to losslessly restore original images, while maintaining compatibility with JPEG XS.
摘要：两层无损图像编码方法与JPEG兼容XS提出。 JPEG XS是静止图像编码具有极低延时和极低的复杂性的特点，新的国际标准。但是，它不支持无损编码，虽然它可以实现可视无损编码。所提出的方法具有类似于JPEG XT，它由JPEG编码XS和无损编码方法的两层结构。其结果是，它使我们能够无损地恢复原始图像，同时保持与JPEG XS的兼容性。

53. Thick Cloud Removal of Remote Sensing Images Using Temporal Smoothness and Sparsity-Regularized Tensor Optimization [PDF] 返回目录
Chenxi Duan, Jun Pan, Rui Li
Abstract: In remote sensing images, the presence of thick cloud accompanying cloud shadow is a high probability event, which can affect the quality of subsequent processing and limit the scenarios of application. Hence, removing the thick cloud and cloud shadow as well as recovering the cloud-contaminated pixels is indispensable to make good use of remote sensing images. In this paper, a novel thick cloud removal method for remote sensing images based on temporal smoothness and sparsity-regularized tensor optimization (TSSTO) is proposed. The basic idea of TSSTO is that the thick cloud and cloud shadow are not only sparse but also smooth along the horizontal and vertical direction in images while the clean images are smooth along the temporal direction between images. Therefore, the sparsity norm is used to boost the sparsity of the cloud and cloud shadow, and unidirectional total variation (UTV) regularizers are applied to ensure the unidirectional smoothness. This paper utilizes alternation direction method of multipliers to solve the presented model and generate the cloud and cloud shadow element as well as the clean element. The cloud and cloud shadow element is purified to get the cloud area and cloud shadow area. Then, the clean area of the original cloud-contaminated images is replaced to the corresponding area of the clean element. Finally, the reference image is selected to reconstruct details of the cloud area and cloud shadow area using the information cloning method. A series of experiments are conducted both on simulated and real cloud-contaminated images from different sensors and with different resolutions, and the results demonstrate the potential of the proposed TSSTO method for removing cloud and cloud shadow from both qualitative and quantitative viewpoints.
摘要：遥感图像，厚云伴随云影的存在的概率高的事件，这可能会影响随后的处理的质量和限制应用的场景。因此，在去除厚云和云阴影以及回收所述云污染像素是不可缺少善用遥感图像。在本文中，基于时间平滑度和稀疏性的正则张量优化（TSSTO）遥感图像的新颖的厚云去除方法提出。 TSSTO的基本思想是，厚云和云影不仅稀疏但在图像沿水平方向和垂直方向上也平滑而干净图像是沿着图像之间的时间方向平滑。因此，稀疏性规范被用来促进云和云影的稀疏性，和单向总变化被施加（UTV）regularizers以确保单向平滑度。本文利用乘法器的交替方向的方法来解决所提出的模型，并产生云和云影元件以及清洁元件。云和云影元素纯化，得到云区和云影区。然后，原有的云污染图像的洁净区被替换为清洁元件的相应的区域。最后，参考图像被选择来重建使用所述信息克隆方法云区域和云阴影区域的细节。一系列的实验是在来自不同传感器的模拟和实际云的污染的图像和具有不同分辨率进行两者，结果证明了该方法TSSTO的电位从定性和定量的观点除去云和云影。

54. SAFRON: Stitching Across the Frontier for Generating Colorectal Cancer Histology Images [PDF] 返回目录
Srijay Deshpande, Fayyaz Minhas, Simon Graham, Nasir Rajpoot
Abstract: Synthetic images can be used for the development and evaluation of deep learning algorithms in the context of limited availability of annotations. In the field of computational pathology where histology images are large and visual context is crucial, synthesis of large tissue images via generative modeling is a challenging task due to memory and computing constraints hindering the generation of large images. To address this challenge, we propose a novel framework named as SAFRON to construct realistic large tissue image tiles from ground truth annotations while preserving morphological features and with minimal boundary artifacts at the seams. To this end, we train the proposed SAFRON framework based on conditional generative adversarial networks on large tissue image tiles from the Colorectal Adenocarcinoma Gland (CRAG) and DigestPath datasets. We demonstrate that our model can generate high quality and realistic image tiles of arbitrary large size after training it on relatively small image patches. We also show that training on synthetic data generated by SAFRON can significantly boost the performance of a standard algorithm for gland segmentation of colorectal cancer tissue images. Sample high resolution images generated using SAFRON are available at the URL:this https URL
摘要：合成图像可用于深学习算法中的注释有限的背景下，开发和评估。在计算病理学其中组织学图像是大和可视上下文是至关重要的领域中，通过生成建模大组织图像的合成是一个具有挑战性的任务，因为存储器和计算限制阻碍大图像的生成。为了应对这一挑战，我们建议命名为番红花，构建从地面真相批注现实大组织的图像片，同时保持形态特征，并在接缝处最小边界假象一种新型的框架。为此，我们训练根据从大肠腺癌腺（CRAG）和DigestPath数据集大型组织图像瓷砖条件生成对抗网络所提出的番红花框架。我们表明，我们的模型可以训练它在相对小的图像修补程序后，产生任意大尺寸的高品质和形象逼真的瓷砖。我们还表明对番红花产生可以显著提高标准算法大肠癌组织图像的腺体分割的高性能合成数据训练。用番红花产生样品的高分辨率图像都可以在网址：此HTTPS URL

55. Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling [PDF] 返回目录
Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, Yueting Zhuang
Abstract: Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.
摘要：视觉故事〜（VIST）是一个任务根据给定的照片流讲一个故事叙述关于某个话题。已有的研究专注于设计复杂的模型，依靠巨大的人力标注数据量。然而，VIST的注释是极其昂贵的，许多问题不能在训练数据集所覆盖，由于长尾主题分布。在本文中，我们着重考虑的几合一设定增强了VIST模型的泛化能力。顺便说人类讲故事的启发，我们提出了一个话题自适应说书模型话题间泛化能力。在实践中，我们采用多模态seq2seq模型基于梯度的元学习算法，赋予模型从话题迅速适应话题的能力。此外，我们进一步提出了一个原型编码结构模型内的话题派生的能力。具体来说，我们进行编码和恢复训练有数的故事文本，以作为参考来指导产生的推理时间。实验结果表明，话题适应和样机编码结构共同造福于BLEU和流星度量几拍模式。进一步的案例研究表明，一些次改编后产生的故事，更是相对和表现力。

56. ARPM-net: A novel CNN-based adversarial method with Markov Random Field enhancement for prostate and organs at risk segmentation in pelvic CT images [PDF] 返回目录
Zhuangzhuang Zhang, Tianyu Zhao, Hiram Gay, Weixiong Zhang, Baozhou Sun
Abstract: Purpose: The research is to develop a novel CNN-based adversarial deep learning method to improve and expedite the multi-organ semantic segmentation of CT images, and to generate accurate contours on pelvic CT images. Methods: Planning CT and structure datasets for 110 patients with intact prostate cancer were retrospectively selected and divided for 10-fold cross-validation. The proposed adversarial multi-residual multi-scale pooling Markov Random Field (MRF) enhanced network (ARPM-net) implements an adversarial training scheme. A segmentation network and a discriminator network were trained jointly, and only the segmentation network was used for prediction. The segmentation network integrates a newly designed MRF block into a variation of multi-residual U-net. The discriminator takes the product of the original CT and the prediction/ground-truth as input and classifies the input into fake/real. The segmentation network and discriminator network can be trained jointly as a whole, or the discriminator can be used for fine-tuning after the segmentation network is coarsely trained. Multi-scale pooling layers were introduced to preserve spatial resolution during pooling using less memory compared to atrous convolution layers. An adaptive loss function was proposed to enhance the training on small or low contrast organs. The accuracy of modeled contours was measured with the Dice similarity coefficient (DSC), Average Hausdorff Distance (AHD), Average Surface Hausdorff Distance (ASHD), and relative Volume Difference (VD) using clinical contours as references to the ground-truth. The proposed ARPM-net method was compared to several stateof-the-art deep learning methods.
摘要：目的：这项研究是开发一个新的基于CNN对抗性深度学习方法来提高和加快CT图像的多器官语义分割，并产生盆腔CT图像精确的轮廓。方法：规划CT和110例完整前列腺癌结构数据集进行回顾性选择和分割为10倍交叉验证。所提出的对抗性多残留多尺度汇集马尔可夫随机场（MRF）增强型网络（ARPM网）实现了一个对抗性训练方案。甲分割网络和鉴别器网络被共同训练，并用于预测仅分割网络。分割网络集成了新设计的MRF块分割成多残留的U净的变化。鉴别器将原始CT和预测/地面实况作为输入，并进行分类的产品的输入到假/真实。分割网络和鉴别网络可以共同作为一个整体来训练，或鉴别可以被用于微调后的分割网络进行粗训练。多尺度池层引入使用较少的存储器相比atrous卷积层汇集过程中保持空间分辨率。提出了一种自适应损失函数，以加强对小或低对比度器官的培训。轮廓建模的准确度与骰子相似系数（DSC）测量，平均Hausdorff距离（AHD），平均表面Hausdorff距离（ASHD），以及使用临床轮廓作为对地面实况参考相体积差（VD）。所提出的ARPM网法相比，几个stateof最先进的深学习方法。

57. Spatio-temporal Attention Model for Tactile Texture Recognition [PDF] 返回目录
Guanqun Cao, Yi Zhou, Danushka Bollegala, Shan Luo
Abstract: Recently, tactile sensing has attracted great interest in robotics, especially for facilitating exploration of unstructured environments and effective manipulation. A detailed understanding of the surface textures via tactile sensing is essential for many of these tasks. Previous works on texture recognition using camera based tactile sensors have been limited to treating all regions in one tactile image or all samples in one tactile sequence equally, which includes much irrelevant or redundant information. In this paper, we propose a novel Spatio-Temporal Attention Model (STAM) for tactile texture recognition, which is the very first of its kind to our best knowledge. The proposed STAM pays attention to both spatial focus of each single tactile texture and the temporal correlation of a tactile sequence. In the experiments to discriminate 100 different fabric textures, the spatially and temporally selective attention has resulted in a significant improvement of the recognition accuracy, by up to 18.8%, compared to the non-attention based models. Specifically, after introducing noisy data that is collected before the contact happens, our proposed STAM can learn the salient features efficiently and the accuracy can increase by 15.23% on average compared with the CNN based baseline approach. The improved tactile texture perception can be applied to facilitate robot tasks like grasping and manipulation.
摘要：近日，触觉传感吸引了机器人技术的极大兴趣，特别是对促进非结构化环境和有效的操作的探索。通过触觉感知的表面纹理的详细理解是许多任务至关重要。上使用基于相机触觉传感器结构识别先前的工作已不限于治疗在一个触觉图像所有区域或所有的样品中一个触觉序列同样，其包括多不相关的或冗余信息。在本文中，我们提出了触觉质感的认可，这是非常之先河，以我们所知新颖的时空注意力模型（STAM）。所提出的STAM注重每个单一的触觉质感的空间都聚焦和触觉序列的时间相关性。在实验中，以区分100和不同的面料质感，空间和时间选择性注意已导致显著提高识别精度，高达18.8％，相较于非基于关注的机型。具体来说，引入所收集的接触发生之前噪声数据后，我们提出的STAM可以学习的显着特征和高效的准确性可以通过15.23％，平均增加与基于CNN基线方法相比。的改进的触觉质感感知可应用于促进像抓握和操纵机器人任务。

58. GANDALF: Generative Adversarial Networks with Discriminator-Adaptive Loss Fine-tuning for Alzheimer's Disease Diagnosis from MRI [PDF] 返回目录
Hoo-Chang Shin, Alvin Ihsani, Ziyue Xu, Swetha Mandava, Sharath Turuvekere Sreenivas, Christopher Forster, Jiook Cha, Alzheimer's Disease Neuroimaging Initiative
Abstract: Positron Emission Tomography (PET) is now regarded as the gold standard for the diagnosis of Alzheimer's Disease (AD). However, PET imaging can be prohibitive in terms of cost and planning, and is also among the imaging techniques with the highest dosage of radiation. Magnetic Resonance Imaging (MRI), in contrast, is more widely available and provides more flexibility when setting the desired image resolution. Unfortunately, the diagnosis of AD using MRI is difficult due to the very subtle physiological differences between healthy and AD subjects visible on MRI. As a result, many attempts have been made to synthesize PET images from MR images using generative adversarial networks (GANs) in the interest of enabling the diagnosis of AD from MR. Existing work on PET synthesis from MRI has largely focused on Conditional GANs, where MR images are used to generate PET images and subsequently used for AD diagnosis. There is no end-to-end training goal. This paper proposes an alternative approach to the aforementioned, where AD diagnosis is incorporated in the GAN training objective to achieve the best AD classification performance. Different GAN lossesare fine-tuned based on the discriminator performance, and the overall training is stabilized. The proposed network architecture and training regime show state-of-the-art performance for three and four- class AD classification tasks.
摘要：正电子发射断层扫描（PET）现在被认为是金标准阿尔茨海默氏病（AD）的诊断。然而，PET成像可以在成本和规划方面望而却步，而且也与辐射剂量最高的成像技术之一。磁共振成像（MRI），与此相反，是更广泛地提供并设置期望的图像分辨率时，提供了更多的灵活性。不幸的是，采用MRI AD的诊断是困难的，因为在MRI可见健康和AD主体之间的非常微妙的生理差异。其结果，已经进行了许多尝试以使用在使AD从MR诊断的感兴趣生成对抗网络（甘斯）MR图像合成PET图像。从MRI PET合成现有的工作主要集中在有条件甘斯，其中MR图像被用于生成PET图像和随后用于AD的诊断。没有终端到终端的培训目标。本文提出了一种替代方法，上述的，其中AD诊断在GAN训练目标结合以达到最佳的AD分类性能。基于鉴别性能不同GAN lossesare微调，整体训练稳定。所提出的网络体系结构和培训制度，展示国家的最先进的性能三年四AD类分类任务。

59. GANBERT: Generative Adversarial Networks with Bidirectional Encoder Representations from Transformers for MRI to PET synthesis [PDF] 返回目录
Hoo-Chang Shin, Alvin Ihsani, Swetha Mandava, Sharath Turuvekere Sreenivas, Christopher Forster, Jiook Cha, Alzheimer's Disease Neuroimaging Initiative
Abstract: Synthesizing medical images, such as PET, is a challenging task due to the fact that the intensity range is much wider and denser than those in photographs and digital renderings and are often heavily biased toward zero. Above all, intensity values in PET have absolute significance, and are used to compute parameters that are reproducible across the population. Yet, usually much manual adjustment has to be made in pre-/post- processing when synthesizing PET images, because its intensity ranges can vary a lot, e.g., between -100 to 1000 in floating point values. To overcome these challenges, we adopt the Bidirectional Encoder Representations from Transformers (BERT) algorithm that has had great success in natural language processing (NLP), where wide-range floating point intensity values are represented as integers ranging between 0 to 10000 that resemble a dictionary of natural language vocabularies. BERT is then trained to predict a proportion of masked values images, where its "next sentence prediction (NSP)" acts as GAN discriminator. Our proposed approach, is able to generate PET images from MRI images in wide intensity range, with no manual adjustments in pre-/post- processing. It is a method that can scale and ready to deploy.
摘要：合成医学图像，如PET，是一项具有挑战性的任务，由于这样的事实，所述强度范围比那些在照片和数字渲染宽得多且致密，并经常重朝向零偏置。首先，在PET强度值有绝对的意义，是用来计算是在人群重现的参数。然而，通常要手工调整在被制成在浮点值合成时的PET图像，因为它的强度范围可以有很大的差异，例如，-100之间至1000处理前/后。为了克服这些挑战，我们采用了双向编码器交涉从变压器（BERT），其已在自然语言处理（NLP），其中大范围的浮点强度值表示为类似于一个整数范围到10000 0之间巨大的成功算法自然语言词汇的字典。然后BERT被训练以预测掩蔽值的图像，其中，它的“下一个句子预测（NSP）”作为GAN鉴别器的比例。我们提出的方法，能够产生在宽的强度范围从MRI图像PET图像，在预/后处理没有手动调节。它是一个可扩展的方法和准备部署。

60. Predicting Risk of Developing Diabetic Retinopathy using Deep Learning [PDF] 返回目录
Ashish Bora, Siva Balasubramanian, Boris Babenko, Sunny Virmani, Subhashini Venugopalan, Akinori Mitani, Guilherme de Oliveira Marinho, Jorge Cuadros, Paisan Ruamviboonsuk, Greg S Corrado, Lily Peng, Dale R Webster, Avinash V Varadarajan, Naama Hammel, Yun Liu, Pinal Bavishi
Abstract: Diabetic retinopathy (DR) screening is instrumental in preventing blindness, but faces a scaling challenge as the number of diabetic patients rises. Risk stratification for the development of DR may help optimize screening intervals to reduce costs while improving vision-related outcomes. We created and validated two versions of a deep learning system (DLS) to predict the development of mild-or-worse ("Mild+") DR in diabetic patients undergoing DR screening. The two versions used either three-fields or a single field of color fundus photographs (CFPs) as input. The training set was derived from 575,431 eyes, of which 28,899 had known 2-year outcome, and the remaining were used to augment the training process via multi-task learning. Validation was performed on both an internal validation set (set A; 7,976 eyes; 3,678 with known outcome) and an external validation set (set B; 4,762 eyes; 2,345 with known outcome). For predicting 2-year development of DR, the 3-field DLS had an area under the receiver operating characteristic curve (AUC) of 0.79 (95%CI, 0.78-0.81) on validation set A. On validation set B (which contained only a single field), the 1-field DLS's AUC was 0.70 (95%CI, 0.67-0.74). The DLS was prognostic even after adjusting for available risk factors (p<0.001). when added to the risk factors, 3-field dls improved auc from 0.72 (95%ci, 0.68-0.76) 0.81 0.77-0.84) in validation set a, and 1-field 0.62 0.58-0.66) 0.71 0.68-0.75) b. dlss this study identified prognostic information for dr development cfps. is independent of more informative than available factors. < font>
摘要：糖尿病性视网膜病变（DR）筛查是预防失明的工具，但也面临着缩放的挑战，因为糖尿病患者的数量上升。为DR的发展风险分层可帮助优化筛查间隔来降低成本，同时提高视觉的相关成果。我们创建并验证了深刻的学习系统（DLS）的两个版本来预测轻度或-更糟的糖尿病患者发生DR筛查的发展（“轻度+”），DR。在两个版本中使用任一三场或彩色眼底照片（国家重点计划）作为输入的单个字段。训练集是由575431周的眼睛，其中28899已经知道2年得出胜负，其余被用于增强通过多任务学习训练过程。验证是在两个执行内部验证集（集合A; 7976只眼睛; 3678与已知的结果）和外部验证集（组B; 4762只眼睛; 2345与已知的结果）。用于预测DR的2年的发展，3-字段DLS有接收器上的验证集A.操作0.79的特性曲线（AUC）（95％CI，0.78-0.81）在验证集合B下的区域（其包含的仅仅单场），1场DLS的AUC为0.70（95％CI，0.67-0.74）。在DLS即使在调整了可用的危险因素（P <0.001）之后是预后性的。当添加到的危险因素，3字段dls提高了auc从0.72（95％ci，0.68-0.76）0.81（95％ci，0.77-0.84）在验证组a，和1场dls改善了auc从0.62（95％ci，0.58-0.66），0.71（95％ci，0.68-0.75）的验证集合b的dlss在从国家重点计划dr发展这项研究中确定的预后信息。这个信息是独立的，比现有的危险因素更多的信息。< font>

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-12

目录

摘要