摘要

1. Image Classification by Reinforcement Learning with Two-State Q-Learning [PDF] 返回目录
Abdul Mueed Hafiz, Ghulam Mohiuddin Bhat
Abstract: In this paper, a simple and efficient Hybrid Classifier is presented which is based on deep learning and reinforcement learning. Q-Learning has been used with two states and 'two or three' actions. Other techniques found in the literature use feature map extracted from Convolutional Neural Networks and use these in the Q-states along with past history. This leads to technical difficulties in these approaches because the number of states is high due to large dimensions of the feature map. Because our technique uses only two Q-states it is straightforward and consequently has much lesser number of optimization parameters, and thus also has a simple reward function. Also, the proposed technique uses novel actions for processing images as compared to other techniques found in literature. The performance of the proposed technique is compared with other recent algorithms like ResNet50, InceptionV3, etc. on popular databases including ImageNet, Cats and Dogs Dataset, and Caltech-101 Dataset. Our approach outperforms others techniques on all the datasets used.
摘要：本文介绍了一种简单而有效的自动分类方法，提出了一种基于深度学习和强化学习。 Q学习已使用具有两种状态和“两三个”行动。其他技术从卷积神经网络中提取，并在Q-国过去的历史以及使用这些文献中使用的功能地图找到。这导致在这些技术上的困难，因为接近状态的数量是特征图的大尺寸的高所致。因为我们的技术只使用两个Q-状态非常简单，因此有很大数量较少的优化参数，从而也有一个简单的奖励功能。此外，所提出的技术使用用于处理图像相比，在文献中发现的其它技术是新颖的动作。所提出的技术的性能与流行数据库，包括ImageNet，猫和狗数据集，和加州理工学院-101数据集最近其他算法如ResNet50，InceptionV3等进行比较。我们的方法比其他所有的数据集技术使用。

2. A Closer Look at Local Aggregation Operators in Point Cloud Analysis [PDF] 返回目录
Ze Liu, Han Hu, Yue Cao, Zheng Zhang, Xin Tong
Abstract: Recent advances of network architecture for point cloud processing are mainly driven by new designs of local aggregation operators. However, the impact of these operators to network performance is not carefully investigated due to different overall network architecture and implementation details in each solution. Meanwhile, most of operators are only applied in shallow architectures. In this paper, we revisit the representative local aggregation operators and study their performance using the same deep residual architecture. Our investigation reveals that despite the different designs of these operators, all of these operators make surprisingly similar contributions to the network performance under the same network input and feature numbers and result in the state-of-the-art accuracy on standard benchmarks. This finding stimulate us to rethink the necessity of sophisticated design of local aggregation operator for point cloud processing. To this end, we propose a simple local aggregation operator without learnable weights, named Position Pooling (PosPool), which performs similarly or slightly better than existing sophisticated operators. In particular, a simple deep residual network with PosPool layers achieves outstanding performance on all benchmarks, which outperforms the previous state-of-the methods on the challenging PartNet datasets by a large margin (7.4 mIoU). The code is publicly available at this https URL
摘要：网络体系结构的点云处理的最新进展主要是由当地的运营商聚集的新设计驱动。然而，这些运营商对网络性能的影响不仔细调查，由于在每个解决方案不同的整体网络架构和实施细则。与此同时，大部分运营商在浅架构只适用。在本文中，我们重温代表本地聚合运营商和使用相同的深层残留架构研究它们的性能。我们的调查显示，尽管有这些运营商的不同设计，所有这些运营商在做标准的基准测试的国家的最先进的精度相同的网络输入和功能的数字和结果下的网络性能惊人的相似贡献。这一发现激发我们重新思考本地聚合操作的复杂设计，点云处理的必要性。为此，我们建议没有可以学习的权重简单的局部聚集操作，命名位置池（PosPool），其执行类似的或略好于现有成熟的运营商。特别地，具有PosPool层的简单深残余网络实现所有基准优秀的性能，这大幅度（7.4米欧）优于对挑战PartNet数据集之前的状态的最方法。该代码是公开的，在此HTTPS URL

3. Deep Single Image Manipulation [PDF] 返回目录
Yael Vinker, Eliahu Horwitz, Nir Zabari, Yedid Hoshen
Abstract: Image manipulation has attracted much research over the years due to the popularity and commercial importance of the task. In recent years, deep neural network methods have been proposed for many image manipulation tasks. A major issue with deep methods is the need to train on large amounts of data from the same distribution as the target image, whereas collecting datasets encompassing the entire long-tail of images is impossible. In this paper, we demonstrate that simply training a conditional adversarial generator on the single target image is sufficient for performing complex image manipulations. We find that the key for enabling single image training is extensive augmentation of the input image and provide a novel augmentation method. Our network learns to map between a primitive representation of the image (e.g. edges) to the image itself. At manipulation time, our generator allows for making general image changes by modifying the primitive input representation and mapping it through the network. We extensively evaluate our method and find that it provides remarkable performance.
摘要：图像处理已经吸引过去几年大量的研究，由于知名度和任务的商业价值。近年来，深层神经网络方法已经提出了许多图像处理任务。深的方法的一个主要问题是对大量的从相同的分布作为目标图像数据的训练，而收集的数据集包含图像的整个长尾是不可能的需要。在本文中，我们证明了简单的训练条件对抗性发电机单一目标图像是足以进行复杂的图像操作。我们发现，为使单个图像的培训，关键是输入图像的大量增加和提供一种新的增强方法。我们的网络获知到图像（例如边缘），以图像本身的原始表示之间进行映射。在操作时间，我们的发电机允许通过修改原始输入表示，并且通过网络映射它使一般的图像的变化。我们广泛地评估我们的方法，并发现它提供了卓越的性能。

4. RELATE: Physically Plausible Multi-Object Scene Synthesis Using Structured Latent Spaces [PDF] 返回目录
Sebastien Ehrhardt, Oliver Groth, Aron Monszpart, Martin Engelcke, Ingmar Posner, Niloy Mitra, Andrea Vedaldi
Abstract: We present RELATE, a model that learns to generate physically plausible scenes and videos of multiple interacting objects. Similar to other generative approaches, RELATE is trained end-to-end on raw, unlabeled data. RELATE combines an object-centric GAN formulation with a model that explicitly accounts for correlations between individual objects. This allows the model to generate realistic scenes and videos from a physically-interpretable parameterization. Furthermore, we show that modeling the object correlation is necessary to learn to disentangle object positions and identity. We find that RELATE is also amenable to physically realistic scene editing and that it significantly outperforms prior art in object-centric scene generation in both synthetic (CLEVR, ShapeStacks) and real-world data (street traffic scenes). In addition, in contrast to state-of-the-art methods in object-centric generative modeling, RELATE also extends naturally to dynamic scenes and generates videos of high visual fidelity
摘要：我们提出联系，一个模式，学会产生物理似是而非的场景和多个相互作用的对象的视频。类似于其他的生成方法地，涉及训练结束到终端的原始的，未标记的数据。 RELATE结合明确占各个对象之间的相关性的模型的对象为中心的GAN制剂。这使得该模型从物理上可解释的参数生成逼真的场景和视频。此外，我们表明，模拟物体相关就要学会分清对象的位置和身份。我们发现了涉及还适用于物理现实场景编辑，并且在这两个合成（CLEVR，ShapeStacks）和真实世界的数据（街道交通场景），它显著优于以对象为中心的场景生成的现有技术。此外，在对比状态的最先进的方法，在以对象为中心生成建模地，涉及自然也延伸到动态场景，并生成高视觉保真度的视频

5. Curriculum Manager for Source Selection in Multi-Source Domain Adaptation [PDF] 返回目录
Luyu Yang, Yogesh Balaji, Ser-Nam Lim, Abhinav Shrivastava
Abstract: The performance of Multi-Source Unsupervised Domain Adaptation depends significantly on the effectiveness of transfer from labeled source domain samples. In this paper, we proposed an adversarial agent that learns a dynamic curriculum for source samples, called Curriculum Manager for Source Selection (CMSS). The Curriculum Manager, an independent network module, constantly updates the curriculum during training, and iteratively learns which domains or samples are best suited for aligning to the target. The intuition behind this is to force the Curriculum Manager to constantly re-measure the transferability of latent domains over time to adversarially raise the error rate of the domain discriminator. CMSS does not require any knowledge of the domain labels, yet it outperforms other methods on four well-known benchmarks by significant margins. We also provide interpretable results that shed light on the proposed method.
摘要：多源无监督域适应的性能上从标源域样本转移的有效性取决于显著。在本文中，我们提出了学习的源样本动态的课程，叫做课程经理源选择（CMSS）的对抗剂。课程管理器，一个独立的网络模块，在训练期间不断更新的课程，并且迭代地获悉哪些域或样品最适合用于对准到目标。这背后的直觉是课程管理强制不断地重新衡量潜在域的转移性随着时间的推移adversarially提升域鉴别的错误率。 CMSS不需要域标签的任何知识，但它优于在四大知名基准由显著的利润等方法。我们还提供了对所提出的方法阐明可解释的结果。

6. Globally Optimal Segmentation of Mutually Interacting Surfaces using Deep Learning [PDF] 返回目录
Hui Xie, Zhe Pan, Leixin Zhou, Fahim A Zaman, Danny Chen, Jost B Jonas, Yaxing Wang, Xiaodong Wu
Abstract: Segmentation of multiple surfaces in medical images is a challenging problem, further complicated by the frequent presence of weak boundary and mutual influence between adjacent objects. The traditional graph-based optimal surface segmentation method has proven its effectiveness with its ability of capturing various surface priors in a uniform graph model. However, its efficacy heavily relies on handcrafted features that are used to define the surface cost for the "goodness" of a surface. Recently, deep learning (DL) is emerging as powerful tools for medical image segmentation thanks to its superior feature learning capability. Unfortunately, due to the scarcity of training data in medical imaging, it is nontrivial for DL networks to implicitly learn the global structure of the target surfaces, including surface interactions. In this work, we propose to parameterize the surface cost functions in the graph model and leverage DL to learn those parameters. The multiple optimal surfaces are then simultaneously detected by minimizing the total surface cost while explicitly enforcing the mutual surface interaction constraints. The optimization problem is solved by the primal-dual Internal Point Method, which can be implemented by a layer of neural networks, enabling efficient end-to-end training of the whole network. Experiments on Spectral Domain Optical Coherence Tomography (SD-OCT) retinal layer segmentation and Intravascular Ultrasound (IVUS) vessel wall segmentation demonstrated very promising results. All source code is public to facilitate further research at this direction.
摘要：在医用图像的多个表面的分割是一个具有挑战性的问题，通过弱边界和相邻对象之间的相互影响的频繁存在进一步复杂化。传统的基于图形的最佳表面分割方法已被证明与它以均匀的图形模型捕获各种表面先验的能力的有效性。然而，它的功效在很大程度上依赖于那些用于定义所述表面的“良好性”表面成本手工特征。近日，深度学习（DL）是一种新兴的医学图像分割得益于其卓越的功能，学习能力的强大工具。不幸的是，由于在医学成像中的训练数据的稀缺性，它是平凡的用于DL网络隐含地学习所述目标表面的全局结构，包括表面相互作用。在这项工作中，我们提出了参数化的表面成本函数图中的模型，并利用DL学习这些参数。多个最佳表面，然后同时通过最小化总表面成本，同时明确地强制执行相互的表面相互作用的限制检测。最优化问题是由原始对偶内部点法，其可通过神经网络的层来实现，从而实现高效的端至端训练整个网络的解决。上谱域光学相干断层扫描（SD-OCT）视网膜层分段和血管内超声（IVUS）血管壁分割证明非常有希望的结果的实验。所有的源代码是公开的，以便进一步研究在这个方向。

7. Learning ordered pooling weights in image classification [PDF] 返回目录
J.I.Forcen, Miguel Pagola, Edurne Barrenechea, Humberto Bustince
Abstract: Spatial pooling is an important step in computer vision systems like Convolutional Neural Networks or the Bag-of-Words method. The spatial pooling purpose is to combine neighbouring descriptors to obtain a single descriptor for a given region (local or global). The resultant combined vector must be as discriminant as possible, in other words, must contain relevant information, while removing irrelevant and confusing details. Maximum and average are the most common aggregation functions used in the pooling step. To improve the aggregation of relevant information without degrading their discriminative power for image classification, we introduce a simple but effective scheme based on Ordered Weighted Average (OWA) aggregation operators. We present a method to learn the weights of the OWA aggregation operator in a Bag-of-Words framework and in Convolutional Neural Networks, and provide an extensive evaluation showing that OWA based pooling outperforms classical aggregation operators.
摘要：空间池是计算机视觉系统等卷积神经网络或袋的字的方法的一个重要步骤。该空间汇集目的是邻近的描述相结合，以获得给定区域（本地或全局）的单个描述符。将所得的合成矢量必须尽可能判别成为可能，换言之，必须包含相关的信息，同时除去不相干和混淆的信息。最大和平均在汇集步骤中使用的最常见的聚合函数。为了提高相关信息的汇总，而不会降低他们的辨别力的图像分类，我们介绍一个简单的，但基于有序加权平均（OWA）汇聚运营商有效的方案。我们提出要学习一袋字的框架和卷积神经网络的OWA算子的权重，并提供显示基于池性能优于传统的运营商聚集OWA进行广泛的评估方法。

8. Reinforcement Learning Based Handwritten Digit Recognition with Two-State Q-Learning [PDF] 返回目录
Abdul Mueed Hafiz, Ghulam Mohiuddin Bhat
Abstract: We present a simple yet efficient Hybrid Classifier based on Deep Learning and Reinforcement Learning. Q-Learning is used with two Q-states and four actions. Conventional techniques use feature maps extracted from Convolutional Neural Networks (CNNs) and include them in the Qstates along with past history. This leads to difficulties with these approaches as the number of states is very large number due to high dimensions of the feature maps. Since our method uses only two Q-states it is simple and has much lesser number of parameters to optimize and also thus has a straightforward reward function. Also, the approach uses unexplored actions for image processing vis-a-vis other contemporary techniques. Three datasets have been used for benchmarking of the approach. These are the MNIST Digit Image Dataset, the USPS Digit Image Dataset and the MATLAB Digit Image Dataset. The performance of the proposed hybrid classifier has been compared with other contemporary techniques like a well-established Reinforcement Learning Technique, AlexNet, CNN-Nearest Neighbor Classifier and CNNSupport Vector Machine Classifier. Our approach outperforms these contemporary hybrid classifiers on all the three datasets used.
摘要：提出了一种基于深度学习和强化学习一个简单而有效的自动分类。 Q学习使用两个Q-个国家和四个行动。传统技术的使用提供了地图从卷积神经网络（细胞神经网络）提取，包括他们在Qstates与过去的历史一起。这导致这些困难的办法状态的数量是非常大的数字，由于特征映射的高维。由于我们的方法仅使用两个Q-状态它是简单和具有小得多数目的参数，以优化并且还因此具有简单的回报函数。此外，该方法使用图像处理面对面的人同时代的其他技术未开发的动作。三个数据集已用于该方法的标杆。这些都是MNIST数字图像数据集，美国邮政数字图像数据集和MATLAB数字图像数据集。所提出的混合分类器的性能已经与同时代的其他技术，如一套行之有效的强化学习技术，AlexNet，CNN近邻分类和CNNSupport向量机分类比较。我们的方法优于上使用的所有三个数据集，这些现代的混合分类。

9. Multiclass Classification with an Ensemble of Binary Classification Deep Networks [PDF] 返回目录
Abdul Mueed Hafiz, Ghulam Mohiuddin Bhat
Abstract: Deep neural network classifiers have been used frequently and are efficient. In multiclass deep network classifiers, the burden of classifying samples of different classes is put on a single classifier. As shown in this paper, the classification capability of deep networks can be further increased by using an ensemble of binary classification deep networks. In the proposed approach, a single (one-versus-all) deep network binary classifier is dedicated to each category classification. Subsequently, binary classification deep network ensembles have been investigated. Every network in an ensemble has been trained by a one-versus-all binary training technique using the Stochastic Gradient Descent with Momentum Algorithm. For classification of the test sample, the sample is presented to each network in the ensemble. After softmax-layer score voting, the network with the largest score is assumed to have classified the sample. Digit image recognition has been used for experimentation. Three datasets have been used for experimentation viz. the MATLAB Digit Image Dataset, the USPS+ Digit Image Dataset, and the MNIST Digit Image Dataset. The experiments demonstrate that given sufficient training, a Binary Classification Convolutional Neural Network (BCCNN) ensemble can outperform a conventional Multi-class Convolutional Neural Network (MCNN). In one of the experiments, it was noted that after training and testing of a BCCNN ensemble and an MCNN respectively on a subset of the MNIST Digit Image Dataset, the BCCNN ensemble gave a higher accuracy of 98.03% as compared to the MCNN which gave an accuracy of 97.90%. The architecture of the BCCNNs in an ensemble has also been modified in order to increase their recognition accuracy. On a large subset of the MNIST Digit Image Dataset, the modified BCCNN ensemble gave a higher accuracy of 98.50%, while as the MCNN gave an accuracy of 98.4875%.
摘要：深层神经网络分类器已经被频繁使用并且是有效的。在多类深网络分类，不同类别的样本进行分类的重担放在一个单一的分类。正如本文中所示，深网络的分类能力，可以进一步提高通过使用二元分类深网络的集合。在所提出的方法中，单一（一个抗全部）深网络二元分类器是专用于每个类别分类。随后，二元分类深网络集成进行了研究。合奏每个网络已经使用随机梯度下降动量算法一抗所有二进制培训技术培训。对于测试样品的分类，将样品提交给在合奏每一个网络。 SOFTMAX层得分投票后，假设与最大比分的网络已分类的样本。数字图像识别已用于实验。三个数据集已用于实验即。 MATLAB的数字图像数据集，所述USPS +数字图像数据集，和MNIST数字图像数据集。实验结果表明，给定足够的培训，二元分类卷积神经网络（BCCNN）合奏可以超越传统的多类卷积神经网络（MCNN）。在实验中的一个，有人指出，培训和分别在所述MNIST数字图像数据集的子集的BCCNN合奏和MCNN的测试之后，BCCNN合奏得到的98.03％的较高的精度相比，这给了一个所述MCNN的97.90％的准确率。所述BCCNNs在合奏的架构也被以增加他们的识别精度修改。在MNIST数字图像数据集的一个大的子集，所述修饰BCCNN合奏得到的98.50％的较高的精度，而作为MCNN得到的98.4875％的准确度。

10. Weakly Supervised Segmentation with Multi-scale Adversarial Attention Gates [PDF] 返回目录
Gabriele Valvano, Andrea Leo, Sotirios A. Tsaftaris
Abstract: Large, fine-grained image segmentation datasets, annotated at pixel-level, are difficult to obtain, particularly in medical imaging, where annotations also require expert knowledge. Weakly-supervised learning can train models by relying on weaker forms of annotation, such as scribbles. Here, we learn to segment using scribble annotations in an adversarial game. With unpaired segmentation masks, we train a multi-scale GAN to generate realistic segmentation masks at multiple resolutions, while we use scribbles to learn the correct position in the image. Central to the model's success is a novel attention gating mechanism, which we condition with adversarial signals to act as a shape prior, resulting in better object localization at multiple scales. We evaluated our model on several medical (ACDC, LVSC, CHAOS) and non-medical (PPSS) datasets, and we report performance levels matching those achieved by models trained with fully annotated segmentation masks. We also demonstrate extensions in a variety of settings: semi-supervised learning; combining multiple scribble sources (a crowdsourcing scenario) and multi-task learning (combining scribble and mask supervision). We will release expert-made scribble annotations for the ACDC dataset, and the code used for the experiments, at this https URL.
摘要：大，细粒图像分割数据集，在像素级注释，是很难获得的，特别是在医学成像中，其中的注释也需要专业知识。弱监督学习可以依靠较弱的形式标注，如涂鸦训练模式。在这里，我们学会段的对抗性游戏使用涂鸦注解。随着不成对的分割掩码，我们培养了多尺度赣能产生多种分辨率逼真分割口罩，而我们用涂鸦来学习图像中的正确位置。中央对模型的成功是一种新型的注意力门控机制，这是我们与条件对抗性信号之前充当形状，导致在多个尺度更好的目标定位。我们评估了几个医疗（ACDC，LVSC，CHAOS）和非医疗（PPSS）数据集我们的模型，我们报告的性能水平相匹配那些由具有完全注释分割掩码训练的模型来实现。我们还演示了在多种设置扩展：半监督学习;结合多个涂鸦源（众包场景）和多任务学习（结合涂鸦和掩码监督）。我们将发布专家提出涂鸦注解为ACDC数据集，以及用于实验的代码，在这个HTTPS URL。

11. JUMPS: Joints Upsampling Method for Pose Sequences [PDF] 返回目录
Lucas Mourot, François Le Clerc, Cédric Thébault, Pierre Hellier
Abstract: Human Pose Estimation is a low-level task useful for surveillance, human action recognition, and scene understanding at large. It also offers promising perspectives for the animation of synthetic characters. For all these applications, and especially the latter, estimating the positions of many joints is desirable for improved performance and realism. To this purpose, we propose a novel method called JUMPS for increasing the number of joints in 2D pose estimates and recovering occluded or missing joints. We believe this is the first attempt to address the issue. We build on a deep generative model that combines a GAN and an encoder. The GAN learns the distribution of high-resolution human pose sequences, the encoder maps the input low-resolution sequences to its latent space. Inpainting is obtained by computing the latent representation whose decoding by the GAN generator optimally matches the joints locations at the input. Post-processing a 2D pose sequence using our method provides a richer representation of the character motion. We show experimentally that the localization accuracy of the additional joints is on average on par with the original pose estimates.
摘要：人体姿势估计是一个低级别的任务进行监视，人类动作识别，并在大场景的理解是有用的。它也提供了希望的前景进行合成符的动画。对于所有这些应用，并且尤其是后者，估计多关节的位置是期望的，以提高性能和真实感。为此，我们提出了一个小说叫方法跳到增加关节的二维姿势估计数和恢复闭塞或失踪关节。我们认为，这是解决这一问题的首次尝试。我们建立了深刻生成模型，结合了GAN和编码器。在GAN学习高分辨率人类姿势序列的分配，编码器输入低分辨率序列映射到其潜在空间。修补通过计算潜表示其由GAN发生器解码最佳地匹配于所述输入接头的位置获得的。后处理用我们的方法提供了角色动作的更丰富的表示二维姿态序列。我们通过实验证明，该附加关节的定位精度是平均看齐，与原来的姿势估计。

12. Automatic Page Segmentation Without Decompressing the Run-Length Compressed Text Documents [PDF] 返回目录
Mohammed Javed, P. Nagabhushan
Abstract: Page segmentation is considered to be the crucial stage for the automatic analysis of documents with complex layouts. This has traditionally been carried out in uncompressed documents, although most of the documents in real life exist in a compressed form warranted by the requirement to make storage and transfer efficient. However, carrying out page segmentation directly in compressed documents without going through the stage of decompression is a challenging goal. This research paper proposes demonstrating the possibility of carrying out a page segmentation operation directly in the run-length data of the CCITT Group-3 compressed text document, which could be single- or multi-columned and might even have some text regions in the inverted text color mode. Therefore, before carrying out the segmentation of the text document into columns, each column into paragraphs, each paragraph into text lines, each line into words, and, finally, each word into characters, a pre-processing of the text document needs to be carried out. The pre-processing stage identifies the normal text regions and inverted text regions, and the inverted text regions are toggled to the normal mode. In the sequel to initiate column separation, a new strategy of incremental assimilation of white space runs in the vertical direction and the auto-estimation of certain related parameters is proposed. A procedure to realize column-segmentation employing these extracted parameters has been devised. Subsequently, what follows first is a two-level horizontal row separation process, which segments every column into paragraphs, and in turn, into text-lines. Then, there is a two-level vertical column separation process, which completes the separation into words and characters.
摘要：页面分割被认为是与复杂的布局文件自动分析的关键阶段。这在传统上一直以未压缩的文件进行，但以压缩形式最现实生活中存在的文件保证按要求进行存储和传输效率。然而，直接在压缩文件执行页分割，而无需通过减压的阶段会是一个具有挑战性的目标。本研究提出证明直接在CCITT组-3压缩文本文档的行程数据，这可能是单层或多层圆柱的，甚至可能在一些倒置文本区域进行页面分割操作的可能性文本颜色模式。因此，进行文本文件的分割成多列，各列成段，每个段为文本行，每行成单词，和，最后，每个字为字符之前，文本文件需要预先处理是执行。前处理阶段识别正常文本区域和倒置文本区域，和反相文本区域被切换到正常模式。在续集发起柱分离，提出了在垂直方向和某些相关参数的自动估计白色空间中运行的增量同化的新策略。使用这些提取的参数来实现列分割的过程已经被设计出来。接着，下文中第一个是一个两电平水平行分离过程中，每列成段，其段，并且反过来，为文本行。然后，有一个两级垂直柱分离过程，从而完成分离成字和字符。

13. Motion Prediction in Visual Object Tracking [PDF] 返回目录
Jianren Wang, Yihui He
Abstract: Visual object tracking (VOT) is an essential component for many applications, such as autonomous driving or assistive robotics. However, recent works tend to develop accurate systems based on more computationally expensive feature extractors for better instance matching. In contrast, this work addresses the importance of motion prediction in VOT. We use an off-the-shelf object detector to obtain instance bounding boxes. Then, a combination of camera motion decouple and Kalman filter is used for state estimation. Although our baseline system is a straightforward combination of standard methods, we obtain state-of-the-art results. Our method establishes new state-of-the-art performance on VOT (VOT-2016 and VOT-2018). Our proposed method improves the EAO on VOT-2016 from 0.472 of prior art to 0.505, from 0.410 to 0.431 on VOT-2018. To show the generalizability, we also test our method on video object segmentation (VOS: DAVIS-2016 and DAVIS-2017) and observe consistent improvement.
摘要：视觉对象跟踪（VOT）对于许多应用，例如自动驾驶辅助或机器人的必要成分。然而，最近的作品趋向于开发基于更好的实例匹配更耗费计算特征提取准确的系统。相比之下，这项工作涉及运动预测的VOT的重要性。我们使用过的，现成的对象检测器，以获得例如边界框。然后，照相机运动解耦和卡尔曼滤波器的组合被用于状态估计。虽然我们的基准体系的标准方法简单组合，我们得到国家的先进成果。我们的方法建立在VOT（VOT-2016和VOT-2018）新的国家的最先进的性能。我们提出的方法提高了从现有技术的0.472上VOT-2016的EAO到0.505，从0.410到0.431的VOT-2018。要显示的普遍性，我们还测试了视频对象分割方法（VOS：DAVIS-2016和Davis-2017），并观察持续改善。

14. Estimating Blink Probability for Highlight Detection in Figure Skating Videos [PDF] 返回目录
Tamami Nakano, Atsuya Sakata, Akihiro Kishimoto
Abstract: Highlight detection in sports videos has a broad viewership and huge commercial potential. It is thus imperative to detect highlight scenes more suitably for human interest with high temporal accuracy. Since people instinctively suppress blinks during attention-grabbing events and synchronously generate blinks at attention break points in videos, the instantaneous blink rate can be utilized as a highly accurate temporal indicator of human interest. Therefore, in this study, we propose a novel, automatic highlight detection method based on the blink rate. The method trains a one-dimensional convolution network (1D-CNN) to assess blink rates at each video frame from the spatio-temporal pose features of figure skating videos. Experiments show that the method successfully estimates the blink rate in 94% of the video clips and predicts the temporal change in the blink rate around a jump event with high accuracy. Moreover, the method detects not only the representative athletic action, but also the distinctive artistic expression of figure skating performance as key frames. This suggests that the blink-rate-based supervised learning approach enables high-accuracy highlight detection that more closely matches human sensibility.
摘要：体育视频重点检测具有广泛的收视率和巨大的商业潜力。因此当务之急是更适当地检测重点场景具有高时间精度的人的兴趣。由于人们本能中引人注目的事件抑制闪烁，并同步在视频中产生的关注破发点闪烁，瞬间眨眼率可被用作人类利益的高度精确的时间指示。因此，在本研究中，我们提出了基于眨眼率一本小说，自动高亮检测方法。该方法火车的一维卷积网络（1D-CNN），以评估在从时空姿势的每个视频帧闪烁速率的花样滑冰视频功能。实验表明，该方法成功地估计在视频剪辑的94％的眨眼率和预测的周围以高准确度跳跃项目的眨眼率随时间的变化。此外，该方法不仅检测代表运动的动作，但也花样滑冰性能作为关键帧的鲜明的艺术表达。这表明，基于眨眼率监督学习的方法使高精度的重点检测，人类的情感更加匹配。

15. Mining and Tailings Dam Detection In Satellite Imagery Using Deep Learning [PDF] 返回目录
Remis Balaniuk, Olga Isupova, Steven Reece
Abstract: This work explores the combination of free cloud computing, free open-source software, and deep learning methods to analyse a real, large-scale problem: the automatic country-wide identification and classification of surface mines and mining tailings dams in Brazil. Locations of officially registered mines and dams were obtained from the Brazilian government open data resource. Multispectral Sentinel-2 satellite imagery, obtained and processed at the Google Earth Engine platform, was used to train and test deep neural networks using the TensorFlow 2 API and Google Colab platform. Fully Convolutional Neural Networks were used in an innovative way, to search for unregistered ore mines and tailing dams in large areas of the Brazilian territory. The efficacy of the approach is demonstrated by the discovery of 263 mines that do not have an official mining concession. This exploratory work highlights the potential of a set of new technologies, freely available, for the construction of low cost data science tools that have high social impact. At the same time, it discusses and seeks to suggest practical solutions for the complex and serious problem of illegal mining and the proliferation of tailings dams, which pose high risks to the population and the environment, especially in developing countries. Code is made publicly available at: this https URL.
摘要：该作品探讨了免费云计算，免费的开源软件，并深度学习相结合的方法来分析一个真实的，大规模的问题：在巴西露天矿和矿业尾矿坝的自动全国范围的识别和分类。巴西政府开放数据资源，获得正式注册的地雷和水坝的位置。多光谱哨兵-2卫星图像，并获得在谷歌地球引擎平台进行处理，用于训练和使用TensorFlow 2 API和谷歌Colab平台测试深层神经网络。全卷积神经网络是在一个创新的方式使用，以搜索未注册的矿山，并在巴西的大片领土尾矿坝。该方法的有效性是由263个矿的发现没有一个正式的采矿权证明。这一探索性的工作突出了一套新技术，免费提供，为具有高社会影响的低成本数据科学工具的建设的潜力。同时，探讨和寻求建议对非法采矿的复杂而严重的问题，尾矿坝的增殖切实可行的解决方案，这对高风险人口和环境，尤其是在发展中国家。此HTTPS URL：代码在公开提供。

16. Attention-Oriented Action Recognition for Real-Time Human-Robot Interaction [PDF] 返回目录
Ziyang Song, Ziyi Yin, Zejian Yuan, Chong Zhang, Wanchao Chi, Yonggen Ling, Shenghao Zhang
Abstract: Despite the notable progress made in action recognition tasks, not much work has been done in action recognition specifically for human-robot interaction. In this paper, we deeply explore the characteristics of the action recognition task in interaction scenarios and propose an attention-oriented multi-level network framework to meet the need for real-time interaction. Specifically, a Pre-Attention network is employed to roughly focus on the interactor in the scene at low resolution firstly and then perform fine-grained pose estimation at high resolution. The other compact CNN receives the extracted skeleton sequence as input for action recognition, utilizing attention-like mechanisms to capture local spatial-temporal patterns and global semantic information effectively. To evaluate our approach, we construct a new action dataset specially for the recognition task in interaction scenarios. Experimental results on our dataset and high efficiency (112 fps at 640 x 480 RGBD) on the mobile computing platform (Nvidia Jetson AGX Xavier) demonstrate excellent applicability of our method on action recognition in real-time human-robot interaction.
摘要：尽管显着的进展，动作识别任务做，没有太多的工作已经在动作识别做专门用于人机交互。在本文中，我们深入探讨的交互场景的动作识别任务的特点，提出了一个面向注意力多层次的网络架构，以满足实时交互的需要。具体而言，预关注网络被用于粗略聚焦以低分辨率在场景中交互件首先，然后在高分辨率执行细粒度姿势估计。其他紧凑CNN接收提取出的骨架序列作为动作识别输入，利用注意力样的机制有效地捕捉到当地的时空格局和全球语义信息。为了评估我们的方法，我们构造了一个新的动作数据集专门为交互场景识别任务。在我们的移动计算平台（Nvidia的杰特森AGX泽维尔）上的数据集和高效率（112 fps的在640×480 RGBD）实验结果表明，我们对在实时人机交互动作识别方法的优良的适用性。

17. Are there any 'object detectors' in the hidden layers of CNNs trained to identify objects or scenes? [PDF] 返回目录
Ella M. Gale, Nicholas Martin, Ryan Blything, Anh Nguyen, Jeffrey S. Bowers
Abstract: Various methods of measuring unit selectivity have been developed with the aim of better understanding how neural networks work. But the different measures provide divergent estimates of selectivity, and this has led to different conclusions regarding the conditions in which selective object representations are learned and the functional relevance of these representations. In an attempt to better characterize object selectivity, we undertake a comparison of various selectivity measures on a large set of units in AlexNet, including localist selectivity, precision, class-conditional mean activity selectivity (CCMAS), network dissection,the human interpretation of activation maximization (AM) images, and standard signal-detection measures. We find that the different measures provide different estimates of object selectivity, with precision and CCMAS measures providing misleadingly high estimates. Indeed, the most selective units had a poor hit-rate or a high false-alarm rate (or both) in object classification, making them poor object detectors. We fail to find any units that are even remotely as selective as the 'grandmother cell' units reported in recurrent neural networks. In order to generalize these results, we compared selectivity measures on units in VGG-16 and GoogLeNet trained on the ImageNet or Places-365 datasets that have been described as 'object detectors'. Again, we find poor hit-rates and high false-alarm rates for object classification. We conclude that signal-detection measures provide a better assessment of single-unit selectivity compared to common alternative approaches, and that deep convolutional networks of image classification do not learn object detectors in their hidden layers.
摘要：测量单位的选择性的各种方法已经开发了更好的理解网络是如何工作的神经目的。但不同的措施提供了选择性的不同估计，这导致关于其选择对象表示正在学习的条件，而这些陈述的功能关联性不同的结论。在试图更好地表征物体的选择性，我们在AlexNet对大量的单位承接各种选择性措施的比较，包括方言选择性，精度，分类条件均值活性选择性（CCMAS），网络剥离，激活人体解释最大化（AM）的图像，和标准信号检测的措施。我们发现，不同的措施提供了物体的选择性不同的估计，精度和CCMAS措施提供误导性高的估计。事实上，最有选择性的单位有一个贫穷的命中率或对象类别中的高误报率（或两者），使他们可怜对象探测器。我们无法找到，作为“祖母细胞”单位经常性的神经网络报道甚至远程的选择性任何单位。为了概括这些结果，我们比较了单元选择性措施VGG-16和GoogLeNet训练上已经被描述为“对象检测器”的ImageNet或商家-365的数据集。同样，我们发现可怜的命中率和对象分类的高误报率。我们的结论是信号检测措施，可提供比普通的替代方法单一单位的选择性进行更好的评估，以及图像分类的深卷积网络不学习对象探测器在他们的隐藏层。

18. Unsupervised Landmark Learning from Unpaired Data [PDF] 返回目录
Yinghao Xu, Ceyuan Yang, Ziwei Liu, Bo Dai, Bolei Zhou
Abstract: Recent attempts for unsupervised landmark learning leverage synthesized image pairs that are similar in appearance but different in poses. These methods learn landmarks by encouraging the consistency between the original images and the images reconstructed from swapped appearances and poses. While synthesized image pairs are created by applying pre-defined transformations, they can not fully reflect the real variances in both appearances and poses. In this paper, we aim to open the possibility of learning landmarks on unpaired data (i.e. unaligned image pairs) sampled from a natural image collection, so that they can be different in both appearances and poses. To this end, we propose a cross-image cycle consistency framework ($C^3$) which applies the swapping-reconstruction strategy twice to obtain the final supervision. Moreover, a cross-image flow module is further introduced to impose the equivariance between estimated landmarks across images. Through comprehensive experiments, our proposed framework is shown to outperform strong baselines by a large margin. Besides quantitative results, we also provide visualization and interpretation on our learned models, which not only verifies the effectiveness of the learned landmarks, but also leads to important insights that are beneficial for future research.
摘要：最近针对在外观上相似，但在不同的姿势无监督的地标学习影响力合成图像对尝试。这些方法鼓励从交换外观和姿态重建原始图像与图像之间的一致性学习的地标。而通过施加预先定义的转换创建合成图像对，它们不能完全反映在这两个外表和姿势的实际变化。在本文中，我们的目标是打开从自然图像采集采样不成对的数据（即未对准图像对）学习标志的可能性，以便它们可以在两个外观和姿势不同。为此，我们提出了一个跨成像循环一致性框架（$ C ^ 3 $）的应用交换重建战略两次以获得最终的监督。此外，交叉图像流动模块被进一步引入到跨图像强加估计的地标之间的同变性。通过全面的实验，我们提出的框架大幅度证明优于强基线。除了定量的结果，我们还提供可视化和解释我们了解到的机型，这不仅验证了解到标志的有效性，同时也导致了对于未来的研究有益的重要见解。

19. ReXNet: Diminishing Representational Bottleneck on Convolutional Neural Network [PDF] 返回目录
Dongyoon Han, Sangdoo Yun, Byeongho Heo, YoungJoon Yoo
Abstract: This paper addresses representational bottleneck in a network and propose a set of design principles that improves model performance significantly. We argue that a representational bottleneck may happen in a network designed by a conventional design and results in degrading the model performance. To investigate the representational bottleneck, we study the matrix rank of the features generated by ten thousand random networks. We further study the entire layer's channel configuration towards designing more accurate network architectures. Based on the investigation, we propose simple yet effective design principles to mitigate the representational bottleneck. Slight changes on baseline networks by following the principle leads to achieving remarkable performance improvements on ImageNet classification. Additionally, COCO object detection results and transfer learning results on several datasets provide other backups of the link between diminishing representational bottleneck of a network and improving performance. Code and pretrained models are available at this https URL.
摘要：在网络中的文件地址代表性的瓶颈，并提出了一套显著提高模型的性能设计原则。我们认为，一个代表性的瓶颈可以通过降低模型性能的常规设计和结果设计了一个网络中发生的。调查代表性的瓶颈，我们研究的一万级随机产生的网络功能的矩阵的秩。我们进一步研究朝着设计更精确的网络架构整个层的通道配置。根据调查，我们提出了简单而有效的设计原则，以减轻代表性的瓶颈。遵循的原则引线实现对ImageNet分类显着的性能改进基线网络的细微变化。此外，在几个数据集COCO对象检测结果和转印学习结果提供减少网络的代表性瓶颈和提高性能之间的联系的其他备份。代码和预训练模式可在此HTTPS URL。

20. RGB-D-based Framework to Acquire, Visualize and Measure the Human Body for Dietetic Treatments [PDF] 返回目录
Andrés Fuster-Guilló, Jorge Azorín-López, Marcelo Saval-Calvo, Juan Miguel Castillo-Zaragoza, Nahuel Garcia-DUrso, Robert B Fisher
Abstract: This research aims to improve dietetic-nutritional treatment using state-of-the-art RGB-D sensors and virtual reality (VR) technology. Recent studies show that adherence to treatment can be improved using multimedia technologies. However, there are few studies using 3D data and VR technologies for this purpose. On the other hand, obtaining 3D measurements of the human body and analyzing them over time (4D) in patients undergoing dietary treatment is a challenging field. The main contribution of the work is to provide a framework to study the effect of 4D body model visualization on adherence to obesity treatment. The system can obtain a complete 3D model of a body using low-cost technology, allowing future straightforward transference with sufficient accuracy and realistic visualization, enabling the analysis of the evolution (4D) of the shape during the treatment of obesity. The 3D body models will be used for studying the effect of visualization on adherence to obesity treatment using 2D and VR devices. Moreover, we will use the acquired 3D models to obtain measurements of the body. An analysis of the accuracy of the proposed methods for obtaining measurements with both synthetic and real objects has been carried out.
摘要：本研究旨在利用国家的最先进的RGB-d传感器和虚拟现实（VR）技术来提高饮食营养治疗。最近的研究表明，坚持治疗可以利用多媒体技术得到改善。不过，也有利用3D数据和VR技术用于此目的的研究很少。在另一方面，获取人体的三维测量和进行饮食治疗的患者分析它们随着时间的推移（4D）是一个具有挑战性的领域。这项工作的主要贡献是提供了一个框架，研究4D人体模型可视化的就坚持治疗肥胖的效果。系统可以获取使用低成本的技术的主体的一个完整的3D模型，从而允许未来以足够的精度和现实可视化简单的转移，使形状的演变（4D）的分析肥胖症的治疗过程中。三维人体模型将被用于研究上坚持使用二维和VR设备肥胖治疗的可视化效果。此外，我们将使用获得的3D模型来获得身体的测量。获得具有合成和真实物体的测量所提出的方法的精确度的分析已经进行了。

21. PerceptionGAN: Real-world Image Construction from Provided Text through Perceptual Understanding [PDF] 返回目录
Kanish Garg, Ajeet kumar Singh, Dorien Herremans, Brejesh Lall
Abstract: Generating an image from a provided descriptive text is quite a challenging task because of the difficulty in incorporating perceptual information (object shapes, colors, and their interactions) along with providing high relevancy related to the provided text. Current methods first generate an initial low-resolution image, which typically has irregular object shapes, colors, and interaction between objects. This initial image is then improved by conditioning on the text. However, these methods mainly address the problem of using text representation efficiently in the refinement of the initially generated image, while the success of this refinement process depends heavily on the quality of the initially generated image, as pointed out in the DM-GAN paper. Hence, we propose a method to provide good initialized images by incorporating perceptual understanding in the discriminator module. We improve the perceptual information at the first stage itself, which results in significant improvement in the final generated image. In this paper, we have applied our approach to the novel StackGAN architecture. We then show that the perceptual information included in the initial image is improved while modeling image distribution at multiple stages. Finally, we generated realistic multi-colored images conditioned by text. These images have good quality along with containing improved basic perceptual information. More importantly, the proposed method can be integrated into the pipeline of other state-of-the-art text-based-image-generation models to generate initial low-resolution images. We also worked on improving the refinement process in StackGAN by augmenting the third stage of the generator-discriminator pair in the StackGAN architecture. Our experimental analysis and comparison with the state-of-the-art on a large but sparse dataset MS COCO further validate the usefulness of our proposed approach.
摘要：从所提供的描述性文本生成的图像是因为在将与提供有关所提供的文本相关度高沿感知信息（对象的形状，颜色，以及它们的相互作用）的难度的相当有挑战性的任务。目前的方法首先生成初始低分辨率图像，其典型地具有不规则的对象的形状，颜色和对象之间的交互。该初始图像然后通过在文本调理改善。然而，这些方法主要是解决在初始生成图像的细化有效地利用文本表示的问题，而这种细化过程的成功在很大程度上取决于初始产生的图像的质量，如DM-GaN论文指出。因此，我们提出了一个方法，通过鉴别模块中结合感性的认识，以提供良好的初始化图像。我们提高在第一阶段本身，这导致最终生成的图像中显著改善感知信息。在本文中，我们应用了我们对小说StackGAN架构方法。然后，我们表明，虽然在造型多级图像传送包含在初始图像中的感知信息得到改善。最后，我们生成逼真的多彩色图像的文本条件。这些图像有含改善基本的感知信息以及良好的品质。更重要的是，所提出的方法可以集成到其他国家的最先进的基于文本的图像产生模型的管道，以产生初始的低分辨率图像。我们还致力于通过扩大在StackGAN架构发电机鉴别器对的第三阶段改善StackGAN细化过程。我们的实验分析，并与大，但稀疏的数据集MS COCO的国家的最先进的对比进一步验证我们提出的方法的有效性。

22. A deep primal-dual proximal network for image restoration [PDF] 返回目录
Mingyuan Jiu, Nelly Pustelnik
Abstract: Image restoration remains a challenging task in image processing. Numerous methods have been proposed to tackle this problem, which is often solved by minimizing a non-smooth penalized likelihood function. Although the solution is easily interpretable with theoretic guarantees, its estimation relies on an optimization process. Considering the important research efforts in deep learning for image classification, they offers an alternative to perform image restoration but its adaptation to inverse problem is still challenging. In this work, we design a deep network, named DeepPDNet, built from primal-dual proximal iterations associated with the minimization of a standard penalized likelihood with an analysis prior, allowing us to take advantages from both worlds. We reformulate a specific instance of the Condat-Vu primal-dual hybrid gradient (PDHG) algorithm as a deep network with fixed layers. Each layer corresponds to one iteration of the primal-dual algorithm. The learned parameters are the primal-dual proximal algorithm step-size and the analysis linear operator involved in the penalization. These parameters are allowed to vary from a layer to another one. Two different learning strategies: "Full learning" and "Partial learning" are proposed, the first one is the most efficient numerically while the second one relies on standard constraints insuring convergence in the standard PDHG iterations. Moreover, global and local sparse analysis prior are studied to seek the better feature representation. We experiment the proposed DeepPDNet on the MNIST and BSD68 datasets with different blur and additive Gaussian noise. Extensive results shows that the proposed deep primal-dual proximal networks demonstrate excellent performance on the MNIST dataset compared to other state-of-the-art methods and better or at least comparable performance on the more complex BSD68 dataset.
摘要：图像复原保持在图像处理的具有挑战性的任务。许多方法被提出来解决这个问题，这往往是通过最小化不光滑的惩罚似然函数解决。尽管该解决方案是与理论保证容易解释的，它的估计依赖于一个优化过程。考虑到深学习的图像分类的重要研究成果，他们提供了执行图像恢复，但其适应反问题仍然是具有挑战性的替代品。在这项工作中，我们设计了深刻的网络，命名DeepPDNet，从之前与分析标准处罚的可能性最小化相关联的原对偶近迭代建成，使我们能够采取优势来自两个世界。我们重新制定康达特-VU原始对偶混合梯度（PDHG）算法与固定层的深网络的特定实例。每一层对应于原始对偶算法的一次迭代。所学习的参数是原始对偶算法近端步长和参与惩罚分析线性算子。这些参数被允许从一个层到另一个会有所不同。两种不同的学习策略：“全员学习”和“学习部分”提出，第一个是最有效的数字，而第二个依赖于标准的限制标准PDHG迭代确保收敛。此外，全局和局部稀疏分析前进行了研究，以寻求更好的特征表示。我们在实验上提出MNIST和DeepPDNet数据集BSD68具有不同模糊及加性高斯噪声。广泛的结果表明，所提出的深原始对偶近端网络演示在MNIST数据集优异的性能相对于其他国家的最先进的方法，更好的或者在对更复杂的数据集BSD68至少相当的性能。

23. The Impact of Explanations on AI Competency Prediction in VQA [PDF] 返回目录
Kamran Alipour, Arijit Ray, Xiao Lin, Jurgen P. Schulze, Yi Yao, Giedrius T. Burachas
Abstract: Explainability is one of the key elements for building trust in AI systems. Among numerous attempts to make AI explainable, quantifying the effect of explanations remains a challenge in conducting human-AI collaborative tasks. Aside from the ability to predict the overall behavior of AI, in many applications, users need to understand an AI agent's competency in different aspects of the task domain. In this paper, we evaluate the impact of explanations on the user's mental model of AI agent competency within the task of visual question answering (VQA). We quantify users' understanding of competency, based on the correlation between the actual system performance and user rankings. We introduce an explainable VQA system that uses spatial and object features and is powered by the BERT language model. Each group of users sees only one kind of explanation to rank the competencies of the VQA model. The proposed model is evaluated through between-subject experiments to probe explanations' impact on the user's perception of competency. The comparison between two VQA models shows BERT based explanations and the use of object features improve the user's prediction of the model's competencies.
摘要：Explainability是在AI系统建立信任的关键因素之一。在众多试图使AI解释的，量化的解释的作用仍然在进行人类AI协作任务的一个挑战。除了预测AI的整体行为的能力，在许多应用中，用户需要了解任务区的不同方面的AI代理的能力。在本文中，我们评估视觉答疑（VQA）的任务中的用户的AI代理能力的心理模型解释的影响。我们量化用户的能力的理解，根据实际系统性能和用户的排名之间的相关性。我们介绍了使用空间和对象特点，并通过BERT语言模型供电的解释的VQA系统。每一组用户看到的只是一种解释均居VQA模型的能力。该模型是通过学科间的实验评估，以探讨解释胜任力的用户感知的影响。 2个VQA模型显示BERT基于解释和使用的对象特征之间的比较，提高模型的能力的用户的预测。

24. ACFD: Asymmetric Cartoon Face Detector [PDF] 返回目录
Bin Zhang, Jian Li, Yabiao Wang, Zhipeng Cui, Yili Xia, Chengjie Wang, Jilin Li, Feiyue Huang
Abstract: Cartoon face detection is a more challenging task than human face detection due to many difficult scenarios is involved. Aiming at the characteristics of cartoon faces, such as huge differences within the intra-faces, in this paper, we propose an asymmetric cartoon face detector, named ACFD. Specifically, it consists of the following modules: a novel backbone VoVNetV3 comprised of several asymmetric one-shot aggregation modules (AOSA), asymmetric bi-directional feature pyramid network (ABi-FPN), dynamic anchor match strategy (DAM) and the corresponding margin binary classification loss (MBC). In particular, to generate features with diverse receptive fields, multi-scale pyramid features are extracted by VoVNetV3, and then fused and enhanced simultaneously by ABi-FPN for handling the faces in some extreme poses and have disparate aspect ratios. Besides, DAM is used to match enough high-quality anchors for each face, and MBC is for the strong power of discrimination. With the effectiveness of these modules, our ACFD achieves the 1st place on the detection track of 2020 iCartoon Face Challenge under the constraints of model size 200MB, inference time 50ms per image, and without any pretrained models.
摘要：卡通人脸检测比人脸检测一个更具挑战性的任务，因为许多困难的情况下参与。针对卡通面孔，如内面内巨大差异的特点，在本文中，我们提出了一种不对称的卡通人脸检测器，命名为ACFD。具体而言，它包括下列模块：一个新颖骨干VoVNetV3由几个不对称一步法聚合模块（AOSA），非对称双向特征金字塔网络（ABI-FPN），动态锚匹配策略（DAM）和相应的余量二元分类损失（MBC）。特别是，以产生具有不同的感受域的特征，多尺度金字塔特征由VoVNetV3萃取，然后熔融并通过ABI-FPN用于处理的面在一些极端的姿势和具有不同的纵横比的同时增强。此外，DAM用于为每个面配合不够高品质的锚，MBC是歧视的强劲动力。这些模块的有效性，我们ACFD达到2020 iCartoon面对挑战的模型大小200MB，每幅图像推理时间50ms的制约下检测轨道上的第一名，并且没有任何预先训练模式。

25. Image Analysis Based on Nonnegative/Binary Matrix Factorization [PDF] 返回目录
Hinako Asaoka, Kazue Kudo
Abstract: Using nonnegative/binary matrix factorization (NBMF), a matrix can be decomposed into a nonnegative matrix and a binary matrix. Our analysis of facial images, based on NBMF and using the Fujitsu Digital Annealer, leads to successful image reconstruction and image classification. The NBMF algorithm converges in fewer iterations than those required for the convergence of nonnegative matrix factorization (NMF), although both techniques perform comparably in image classification.
摘要：利用非负/二进制矩阵分解（NBMF），矩阵可以被分解为一个非负矩阵和一个二进制矩阵。我们的面部图像进行分析的基础上，NBMF并使用富士数码退火炉，导致成功的图像重建和图像分类。所述NBMF算法收敛在比那些用于非负矩阵因式分解（NMF）的收敛所需的迭代更少，虽然这两种技术都在图像分类相当执行。

26. Noticing Motion Patterns: Temporal CNN with a Novel Convolution Operator for Human Trajectory Prediction [PDF] 返回目录
Dapeng Zhao, Jean Oh
Abstract: We propose a novel way to learn, detect and extract patterns in sequential data, and successfully applied it to the problem of human trajectory prediction. Our model, Social Pattern Extraction Convolution (Social-PEC), when compared to existing methods, achieves the best performance in terms of Average/Final Displacement Error. In addition, the proposed approach avoids the obscurity in the previous use of pooling layer, presenting intuitive and explainable decision making processes.
摘要：我们建议学习，在连续的数据检测和提取模式的新方法，并成功应用于人类轨迹预测的问题。我们的模型，社会形态提取卷积（社会-PEC），比现有方法时，实现了平均/最后的位移误差方面的最佳性能。此外，该方法避免了以前使用的池层，呈现直观的，可解释的决策过程中的默默无闻。

27. MSA-MIL: A deep residual multiple instance learning model based on multi-scale annotation for classification and visualization of glomerular spikes [PDF] 返回目录
Yilin Chen, Ming Li, Yongfei Wu, Xueyu Liu, Fang Hao, Daoxiang Zhou, Xiaoshuang Zhou, Chen Wang
Abstract: Membranous nephropathy (MN) is a frequent type of adult nephrotic syndrome, which has a high clinical incidence and can cause various complications. In the biopsy microscope slide of membranous nephropathy, spikelike projections on the glomerular basement membrane is a prominent feature of the MN. However, due to the whole biopsy slide contains large number of glomeruli, and each glomerulus includes many spike lesions, the pathological feature of the spikes is not obvious. It thus is time-consuming for doctors to diagnose glomerulus one by one and is difficult for pathologists with less experience to diagnose. In this paper, we establish a visualized classification model based on the multi-scale annotation multi-instance learning (MSA-MIL) to achieve glomerular classification and spikes visualization. The MSA-MIL model mainly involves three parts. Firstly, U-Net is used to extract the region of the glomeruli to ensure that the features learned by the succeeding algorithm are focused inside the glomeruli itself. Secondly, we use MIL to train an instance-level classifier combined with MSA method to enhance the learning ability of the network by adding a location-level labeled reinforced dataset, thereby obtaining an example-level feature representation with rich semantics. Lastly, the predicted scores of each tile in the image are summarized to obtain glomerular classification and visualization of the classification results of the spikes via the usage of sliding window method. The experimental results confirm that the proposed MSA-MIL model can effectively and accurately classify normal glomeruli and spiked glomerulus and visualize the position of spikes in the glomerulus. Therefore, the proposed model can provide a good foundation for assisting the clinical doctors to diagnose the glomerular membranous nephropathy.
摘要：膜性肾病（MN）是一种常见的类型成人肾病综合征，具有高发病率的临床和可引起各种并发症。膜性肾病的活检显微镜载玻片，在肾小球基底膜穗状突起是MN的一个突出的特点。但是，由于整个活检幻灯片包含大量肾小球和肾小球每一个包括许多秒杀病变，尖峰的病理特征并不明显。因此，这是耗时医生通过一个诊断肾小球之一，是困难的经验较少的病理学家诊断。在本文中，我们建立了基于多尺度注释多实例学习（MSA-MIL）可视化分类模型，实现肾小球分类和尖峰可视化。该MSA-MIL模型主要涉及三个部分。首先，U-Net是用来提取肾小球的区域，以确保通过随后的算法了解到的特点是专注于肾小球本身内。其次，我们使用MIL训练实例级分类与MSA相结合的方法通过添加标记增强数据集的位置级别，从而获得丰富的语义的例子级特征表示，以便提高网络的学习能力。最后，图像中的每个瓦片的预测分数总结通过滑动窗口方法的使用，以获得尖峰的分类结果的肾小球分类和可视化。实验结果确认，所提出的MSA-MIL模型能有效和准确地分类正常的肾小球和肾小球加标和可视化在肾小球尖峰的位置。因此，该模型可用于辅助临床医生诊断肾小球膜性肾病提供了良好的基础。

28. Low-light Environment Neural Surveillance [PDF] 返回目录
Michael Potter, Henry Gridley, Noah Lichtenstein, Kevin Hines, John Nguyen, Jacob Walsh
Abstract: We design and implement an end-to-end system for real-time crime detection in low-light environments. Unlike Closed-Circuit Television, which performs reactively, the Low-Light Environment Neural Surveillance provides real time crime alerts. The system uses a low-light video feed processed in real-time by an optical-flow network, spatial and temporal networks, and a Support Vector Machine to identify shootings, assaults, and thefts. We create a low-light action-recognition dataset, LENS-4, which will be publicly available. An IoT infrastructure set up via Amazon Web Services interprets messages from the local board hosting the camera for action recognition and parses the results in the cloud to relay messages. The system achieves 71.5% accuracy at 20 FPS. The user interface is a mobile app which allows local authorities to receive notifications and to view a video of the crime scene. Citizens have a public app which enables law enforcement to push crime alerts based on user proximity.
摘要：我们设计和实现一个终端到终端的系统，可以在低光环境下的实时破案。与闭路电视，执行反应性，低光环境监测神经提供实时犯罪警报。该系统使用通过光流网络，空间和时间的网络，和一个支持向量机中实时处理以识别枪击，攻击，和盗窃的低光视频馈送。我们创建了一个低光动作识别的数据集，LENS-4，这将是公开的。一个物联网基础设施通过Amazon Web Services的解释邮件从本地托管板相机动作识别设置和解析在云中的结果中继消息。该系统实现了71.5％的准确率在20 FPS。用户界面是一个移动应用程序，允许地方政府接收通知，并查看案发现场的视频。公民有一个公共的应用程序，它使执法基于用户接近推犯罪警报。

29. Understanding Road Layout from Videos as a Whole [PDF] 返回目录
Buyu Liu, Bingbing Zhuang, Samuel Schulter, Pan Ji, Manmohan Chandraker
Abstract: In this paper, we address the problem of inferring the layout of complex road scenes from video sequences. To this end, we formulate it as a top-view road attributes prediction problem and our goal is to predict these attributes for each frame both accurately and consistently. In contrast to prior work, we exploit the following three novel aspects: leveraging camera motions in videos, including context cuesand incorporating long-term video information. Specifically, we introduce a model that aims to enforce prediction consistency in videos. Our model consists of one LSTM and one Feature Transform Module (FTM). The former implicitly incorporates the consistency constraint with its hidden states, and the latter explicitly takes the camera motion into consideration when aggregating information along videos. Moreover, we propose to incorporate context information by introducing road participants, e.g. objects, into our model. When the entire video sequence is available, our model is also able to encode both local and global cues, e.g. information from both past and future frames. Experiments on two data sets show that: (1) Incorporating either globalor contextual cues improves the prediction accuracy and leveraging both gives the best performance. (2) Introducing the LSTM and FTM modules improves the prediction consistency in videos. (3) The proposed method outperforms the SOTA by a large margin.
摘要：在本文中，我们要解决的推断从视频序列复杂路况的场景布局的问题。为此，我们制定它作为一个俯视道路属性预测问题，我们的目标是既要准确一致地预测这些属性对于每一帧。相较于以前的工作，我们利用以下三种新型方面：借力相机运动视频中，包括合并长期视频信息方面cuesand。具体来说，我们引入了一个模型，旨在加强在视频预测的一致性。我们的模型包括一个LSTM和一个特征变换模块（FTM）的。前者隐含结合其隐藏状态的一致性约束和聚集一起的视频信息时，后者明确地拍摄摄像机运动考虑在内。此外，我们建议通过引入道路参与者包括上下文信息，例如对象，为我们的模型。当整个视频序列是可用的，我们的模型也能够编码局部和全局的线索，例如从过去和将来帧的信息。在两个数据集的实验表明：（1）包含任一globalor上下文线索提高了预测精度，并利用这两个给出了最好的性能。（2）将所述LSTM和FTM模块提高了视频的预测一致。（3）所提出的方法优于大幅度的SOTA。

30. Query-Free Adversarial Transfer via Undertrained Surrogates [PDF] 返回目录
Chris Miller, Soroush Vosoughi
Abstract: Deep neural networks have been shown to be highly vulnerable to adversarial examples---minor perturbations added to a model's input which cause the model to output an incorrect prediction. This vulnerability represents both a risk for the use of deep learning models in security-conscious fields and an opportunity to improve our understanding of how deep networks generalize to unexpected inputs. In a transfer attack, the adversary builds an adversarial attack using a surrogate model, then uses that attack to fool an unseen target model. Recent work in this subfield has focused on attack generation methods which can improve transferability between models. We show that optimizing a single surrogate model is a more effective method of improving adversarial transfer, using the simple example of an undertrained surrogate. This method transfers well across varied architectures and outperforms state-of-the-art methods. To interpret the effectiveness of undertrained surrogate models, we represent adversarial transferability as a function of surrogate model loss function curvature and similarity between surrogate and target gradients and show that our approach reduces the presence of local loss maxima which hinder transferability. Our results suggest that finding good single surrogate models is a highly effective and simple method for generating transferable adversarial attacks, and that this method represents a valuable route for future study in this field.
摘要：深神经网络已经被证明是非常容易受到对抗性例子---添加到模型的输入的小扰动，其导致模型输出一个不正确的预测。这个漏洞是一种既有对安全意识的领域使用深学习模式的风险，并提高我们的深层网络如何推广到意外的输入了解的机会。在传递攻击中，攻击者使用建立一个替代模型敌对攻击，然后使用该攻击愚弄一个看不见的目标模式。在子场最近的工作重点攻击的生成方法可以提高模型之间转让。我们表明，优化的单个替代模型是改善对抗转移，使用undertrained替代的简单示例的更有效的方法。此方法传送横过井改变架构，优于国家的最先进的方法。为了解释undertrained替代模型的有效性，我们代表对抗性转让的替代和目标梯度和节目之间的代理模型损失函数曲率和相似的功能，我们的方法可以减少局部损失极大阻碍转让的存在。我们的研究结果表明，寻找优秀的单一替代模型是产生转让敌对攻击非常有效且简单的方法，而且这种方法代表了在这一领域的未来研究的宝贵途径。

31. TiledSoilingNet: Tile-level Soiling Detection on Automotive Surround-view Cameras Using Coverage Metric [PDF] 返回目录
Arindam Das, Pavel Krizek, Ganesh Sistu, Fabian Burger, Sankaralingam Madasamy, Michal Uricar, Varun Ravi Kumar, Senthil Yogamani
Abstract: Automotive cameras, particularly surround-view cameras, tend to get soiled by mud, water, snow, etc. For higher levels of autonomous driving, it is necessary to have a soiling detection algorithm which will trigger an automatic cleaning system. Localized detection of soiling in an image is necessary to control the cleaning system. It is also necessary to enable partial functionality in unsoiled areas while reducing confidence in soiled areas. Although this can be solved using a semantic segmentation task, we explore a more efficient solution targeting deployment in low power embedded system. We propose a novel method to regress the area of each soiling type within a tile directly. We refer to this as coverage. The proposed approach is better than learning the dominant class in a tile as multiple soiling types occur within a tile commonly. It also has the advantage of dealing with coarse polygon annotation, which will cause the segmentation task. The proposed soiling coverage decoder is an order of magnitude faster than an equivalent segmentation decoder. We also integrated it into an object detection and semantic segmentation multi-task model using an asynchronous back-propagation algorithm. A portion of the dataset used will be released publicly as part of our WoodScape dataset to encourage further research.
摘要：汽车的相机，特别是环绕视摄像头，倾向于是泥，水，雪等更高水平的自主驾驶的脏，就必须有一个污染检测算法，这将触发自动清洗系统。的图像中的污染的局部检测是必需的，以控制清洗系统。这也是必要的，以使在清洁的区域部分功能的同时，减少污染的地区的信心。虽然这可以通过使用语义分割任务来解决，我们将探讨在低功耗的嵌入式系统更有效的解决方案瞄准的部署。我们提出了一个新的方法直接在瓦片内倒退每个脏污型的面积。我们将此称为覆盖。所提出的方法比多重污染类型的瓦块内出现普遍的学习统治阶级的瓷砖更好。它也有一个处理粗多边形的注释，这将导致分割任务的优势。所提出的污物覆盖解码器是一个数量级比同等分割解码器更快。我们还它集成到使用异步反向传播算法的物体检测和语义分割多任务模式。使用的数据集的一部分将被公开发布为我们WoodScape数据集，以鼓励进一步研究的一部分。

32. Learning Surrogates via Deep Embedding [PDF] 返回目录
Yash Patel, Tomas Hodan, Jiri Matas
Abstract: This paper proposes a technique for training a neural network by minimizing a surrogate loss that approximates the target evaluation metric, which may be non-differentiable. The surrogate is learned via a deep embedding where the Euclidean distance between the prediction and the ground truth corresponds to the value of the evaluation metric. The effectiveness of the proposed technique is demonstrated in a post-tuning setup, where a trained model is tuned using the learned surrogate. Without a significant computational overhead and any bells and whistles, improvements are demonstrated on challenging and practical tasks of scene-text recognition and detection. In the recognition task, the model is tuned using a surrogate approximating the edit distance metric and achieves up to $39\%$ relative improvement in the total edit distance. In the detection task, the surrogate approximates the intersection over union metric for rotated bounding boxes and yields up to $4.25\%$ relative improvement in the $F_{1}$ score.
摘要：本文提出了一种用于通过最小化近似于目标评估度量，其可以是不可微分的替代损失训练神经网络的技术。替代通过深嵌入得知在预测和地面实况对应于评价度量的值之间的欧几里得距离。所提出的技术的有效性证明后调整设置，其中受过训练的模型使用替代学到调整。如果没有显著的计算开销和任何花里胡哨，改进表现出对挑战和场景文本识别与检测的实际任务。在识别任务，该模型是使用替代近似编辑距离度量调整和实现了高达$ 39 \％$的总编辑距离相对改善。在检测任务，替代接近交叉点上联合度量旋转的边界框和产率高达$ 4.25 \％$在$ F_ {1} $得分相对改善。

33. Age-Oriented Face Synthesis with Conditional Discriminator Pool and Adversarial Triplet Loss [PDF] 返回目录
Haoyi Wang, Victor Sanchez, Chang-Tsun Li
Abstract: The vanilla Generative Adversarial Networks (GAN) are commonly used to generate realistic images depicting aged and rejuvenated faces. However, the performance of such vanilla GANs in the age-oriented face synthesis task is often compromised by the mode collapse issue, which may result in the generation of faces with minimal variations and a poor synthesis accuracy. In addition, recent age-oriented face synthesis methods use the L1 or L2 constraint to preserve the identity information on synthesized faces, which implicitly limits the identity permanence capabilities when these constraints are associated with a trivial weighting factor. In this paper, we propose a method for the age-oriented face synthesis task that achieves a high synthesis accuracy with strong identity permanence capabilities. Specifically, to achieve a high synthesis accuracy, our method tackles the mode collapse issue with a novel Conditional Discriminator Pool (CDP), which consists of multiple discriminators, each targeting one particular age category. To achieve strong identity permanence capabilities, our method uses a novel Adversarial Triplet loss. This loss, which is based on the Triplet loss, adds a ranking operation to further pull the positive embedding towards the anchor embedding resulting in significantly reduced intra-class variances in the feature space. Through extensive experiments, we show that our proposed method outperforms state-of-the-art methods in terms of synthesis accuracy and identity permanence capabilities, qualitatively and quantitatively.
摘要：香草剖成对抗性网络（GAN）通常用于生成逼真的图像描绘老年人和复原的面。然而，在面向年龄脸合成任务，例如香草甘斯的表现由模式崩溃的问题，这可能导致以最小的变化和综合精度差面的产生常常受到损害。此外，最近的面向年龄面合成方法使用L1或L2约束以保持在合成面上的标识信息，其中隐式地限制了身份持久能力时这些约束与一个微不足道的加权因子相关联。在本文中，我们提出的是实现了与强大的身份持久能力较高的精度合成了面向年龄脸合成任务的方法。具体地，为了实现高精确度的合成，我们的方法铲球模式崩溃问题具有新颖条件鉴别普尔（CDP），它由多个鉴别器，每一个靶向特定的年龄类别。为了实现强大的身份持久能力，我们的方法是使用一种新的对抗性三重损失。这个损失，这是基于三重损失，增加了一个排序操作，以进一步拉正嵌入朝向锚嵌入导致显著减少帧内类在特征空间方差。通过大量的实验，我们证明了我们所提出的方法优于国家的最先进的方法合成的精度和身份持久能力方面，定性和定量。

34. Self-supervised Deep Reconstruction of Mixed Strip-shredded Text Documents [PDF] 返回目录
Thiago M. Paixão, Rodrigo F. Berriel, Maria C. S. Boeres, Alessandro L. Koerich, Claudine Badue, Alberto F. de Souza, Thiago Oliveira-Santos
Abstract: The reconstruction of shredded documents consists of coherently arranging fragments of paper (shreds) to recover the original document(s). A great challenge in computational reconstruction is to properly evaluate the compatibility between the shreds. While traditional pixel-based approaches are not robust to real shredding, more sophisticated solutions compromise significantly time performance. The solution presented in this work extends our previous deep learning method for single-page reconstruction to a more realistic/complex scenario: the reconstruction of several mixed shredded documents at once. In our approach, the compatibility evaluation is modeled as a two-class (valid or invalid) pattern recognition problem. The model is trained in a self-supervised manner on samples extracted from simulated-shredded documents, which obviates manual annotation. Experimental results on three datasets -- including a new collection of 100 strip-shredded documents produced for this work -- have shown that the proposed method outperforms the competing ones on complex scenarios, achieving accuracy superior to 90%.
摘要：切碎文件的重建由相干排列的纸碎片（碎片），以恢复原始文件（多个）。在计算重建巨大挑战是正确评价碎片之间的兼容性。虽然传统的基于像素的方法是不稳健真正碎纸，更先进的解决方案妥协显著时性能。在这项工作中提出的解决方案，单页重建我们以前的深度学习方法延伸到更逼真的/复杂的场景：在一次几个混合切碎的文件重建。在我们的方法中，兼容性评估建模为两舱（有效或无效）模式识别问题。该模型是在从模拟-撕碎文档中提取的样品的自监督的方式，其消除手动注释的训练。包括用于这项工作产生100条带切碎的文件的一个新的集合 - - 对三个数据集的实验结果表明，所提出的方法优于对复杂的场景竞争的，实现精度优良至90％。

35. Weakly-Supervised Segmentation for Disease Localization in Chest X-Ray Images [PDF] 返回目录
Ostap Viniavskyi, Mariia Dobko, Oles Dobosevych
Abstract: Deep Convolutional Neural Networks have proven effective in solving the task of semantic segmentation. However, their efficiency heavily relies on the pixel-level annotations that are expensive to get and often require domain expertise, especially in medical imaging. Weakly supervised semantic segmentation helps to overcome these issues and also provides explainable deep learning models. In this paper, we propose a novel approach to the semantic segmentation of medical chest X-ray images with only image-level class labels as supervision. We improve the disease localization accuracy by combining three approaches as consecutive steps. First, we generate pseudo segmentation labels of abnormal regions in the training images through a supervised classification model enhanced with a regularization procedure. The obtained activation maps are then post-processed and propagated into a second classification model-Inter-pixel Relation Network, which improves the boundaries between different object classes. Finally, the resulting pseudo-labels are used to train a proposed fully supervised segmentation model. We analyze the robustness of the presented method and test its performance on two distinct datasets: PASCAL VOC 2012 and SIIM-ACR Pneumothorax. We achieve significant results in the segmentation on both datasets using only image-level annotations. We show that this approach is applicable to chest X-rays for detecting an anomalous volume of air in the pleural space between the lung and the chest wall. Our code has been made publicly available.
摘要：深卷积神经网络已被证明有效地解决语义分割的任务。然而，他们的效率在很大程度上依赖于像素级别的注解是昂贵的获得，往往需要专业领域知识，尤其是在医疗成像。弱监督语义分割有助于克服这些问题，并且还提供解释的深度学习模式。在本文中，我们提出了一种新的方法，以医疗胸部X射线图像的只有影像级类标签作为监督的语义分割。我们结合三种方式为连续步骤提高对该病的定位精度。首先，我们通过与正规化过程强化监督分类模型生成训练图像异常区域的伪分割标签。将所得到的激活图，这些然后处理后和传播到第二分类模型像素间关系网络，从而提高了不同的对象类之间的边界。最后，所产生的伪标签用来训练提议的完全监督分割模型。我们分析了该方法的稳健性和对两个不同的数据集测试其性能：PASCAL VOC 2012和SIIM-ACR气胸。我们实现在只使用图像级别注释两个数据集的分割显著的效果。我们表明，这种方法适用于胸部X光在肺和胸壁之间的胸膜间隙检测空气的异常体积。我们的代码已经被公布于众。

36. Virtual Testbed for Monocular Visual Navigation of Small Unmanned Aircraft Systems [PDF] 返回目录
Kyung Kim, Robert C. Leishman, Scott L. Nykl
Abstract: Monocular visual navigation methods have seen significant advances in the last decade, recently producing several real-time solutions for autonomously navigating small unmanned aircraft systems without relying on GPS. This is critical for military operations which may involve environments where GPS signals are degraded or denied. However, testing and comparing visual navigation algorithms remains a challenge since visual data is expensive to gather. Conducting flight tests in a virtual environment is an attractive solution prior to committing to outdoor testing. This work presents a virtual testbed for conducting simulated flight tests over real-world terrain and analyzing the real-time performance of visual navigation algorithms at 31 Hz. This tool was created to ultimately find a visual odometry algorithm appropriate for further GPS-denied navigation research on fixed-wing aircraft, even though all of the algorithms were designed for other modalities. This testbed was used to evaluate three current state-of-the-art, open-source monocular visual odometry algorithms on a fixed-wing platform: Direct Sparse Odometry, Semi-Direct Visual Odometry, and ORB-SLAM2 (with loop closures disabled).
摘要：单眼视觉导航方法已经在过去十年看到显著的进步，最近生产的自主导航小型无人驾驶飞机系统不依赖于GPS几种实时解决方案。这是可能涉及在GPS信号降级或剥夺环境的军事行动至关重要。然而，测试和比较的视觉导航算法，因为可视化数据是收集昂贵仍然是一个挑战。在虚拟环境中进行飞行试验是致力于户外测试之前，一个有吸引力的解决方案。这项工作提出超过现实世界的地形进行模拟飞行试验，并在31赫兹分析的视觉导航算法的实时性能测试平台的虚拟。该工具是为了最终找到一个视觉里程算法适合于固定翼飞机进一步的GPS导航否认研究，即使所有的算法被设计为其它方式。此试验台是用来评价三个电流状态的最先进的，开放源码的单目视觉里程计上的固定翼平台算法：直接稀疏里程计，半直接视觉里程计，并且ORB的SLAM2（与环闭合件禁用）。

37. Learning Geocentric Object Pose in Oblique Monocular Images [PDF] 返回目录
Gordon Christie, Rodrigo Rene Rai Munoz Abujder, Kevin Foster, Shea Hagstrom, Gregory D. Hager, Myron Z. Brown
Abstract: An object's geocentric pose, defined as the height above ground and orientation with respect to gravity, is a powerful representation of real-world structure for object detection, segmentation, and localization tasks using RGBD images. For close-range vision tasks, height and orientation have been derived directly from stereo-computed depth and more recently from monocular depth predicted by deep networks. For long-range vision tasks such as Earth observation, depth cannot be reliably estimated with monocular images. Inspired by recent work in monocular height above ground prediction and optical flow prediction from static images, we develop an encoding of geocentric pose to address this challenge and train a deep network to compute the representation densely, supervised by publicly available airborne lidar. We exploit these attributes to rectify oblique images and remove observed object parallax to dramatically improve the accuracy of localization and to enable accurate alignment of multiple images taken from very different oblique viewpoints. We demonstrate the value of our approach by extending two large-scale public datasets for semantic segmentation in oblique satellite images. All of our data and code are publicly available.
摘要：一个对象的地心的姿势，定义为地上和取向的高度相对于重力，是真实世界的结构的物体检测，分割和使用RGBD图像本地化任务的有力表示。对于近距离视觉任务，高度和方向已经直接从立体深度计算从深层网络预测的单眼深度来获得，以及最近。对于远距离视觉任务如地球观测，深度不能可靠单眼图像估计。在地上预测和静态图像的光流预测的单眼高度近期工作的启发，我们开发地心姿态的编码，以应对这一挑战，培养了深厚的网络计算表示密集，监督由可公开获得的机载激光雷达。我们利用这些属性来纠正倾斜图像，并删除观察对象的视差，从而大幅度提高定位的精度，并且使能非常不同的倾斜视点拍摄的多个图像的精确对准。我们证明通过在斜卫星图像语义分割延伸出的两个大型公共数据集我们的做法的价值。我们所有的数据和代码是公开的。

38. ConFoc: Content-Focus Protection Against Trojan Attacks on Neural Networks [PDF] 返回目录
Miguel Villarreal-Vasquez, Bharat Bhargava
Abstract: Deep Neural Networks (DNNs) have been applied successfully in computer vision. However, their wide adoption in image-related applications is threatened by their vulnerability to trojan attacks. These attacks insert some misbehavior at training using samples with a mark or trigger, which is exploited at inference or testing time. In this work, we analyze the composition of the features learned by DNNs at training. We identify that they, including those related to the inserted triggers, contain both content (semantic information) and style (texture information), which are recognized as a whole by DNNs at testing time. We then propose a novel defensive technique against trojan attacks, in which DNNs are taught to disregard the styles of inputs and focus on their content only to mitigate the effect of triggers during the classification. The generic applicability of the approach is demonstrated in the context of a traffic sign and a face recognition application. Each of them is exposed to a different attack with a variety of triggers. Results show that the method reduces the attack success rate significantly to values < 1% in all the tested attacks while keeping as well as improving the initial accuracy of the models when processing both benign and adversarial data.
摘要：深层神经网络（DNNs）已经在计算机视觉应用成功。然而，他们广泛采用的图像相关的应用程序是由他们易受木马攻击的威胁。在用样品培养具有标志或触发器，它在推理或测试时间利用这些攻击插入一些不当行为。在这项工作中，我们分析了在训练由DNNs学到的功能组成。我们确定他们，包括那些与插入触发器，同时包含内容（语义信息）和样式（纹理信息），这是在测试时确认为一体的DNNs。然后，我们提出了对木马的攻击，其中DNNs被教导要忽略其内容风格的输入和只注重减轻触发的分类中的作用的新防守技术。该方法的一般应用是体现在交通标志和脸部识别应用的上下文。他们每个人都暴露在不同的攻击与各种触发器。结果表明，该方法减少显著攻击成功率值<1％在所有测试的攻击，同时保持以及改善模型的初始精度处理良性和对抗性数据时。< font>

39. Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning [PDF] 返回目录
Zhongzheng Ren, Raymond A. Yeh, Alexander G. Schwing
Abstract: Existing semi-supervised learning (SSL) algorithms use a single weight to balance the loss of labeled and unlabeled examples, i.e., all unlabeled examples are equally weighted. But not all unlabeled data are equal. In this paper we study how to use a different weight for every unlabeled example. Manual tuning of all those weights -- as done in prior work -- is no longer possible. Instead, we adjust those weights via an algorithm based on the influence function, a measure of a model's dependency on one training example. To make the approach efficient, we propose a fast and effective approximation of the influence function. We demonstrate that this technique outperforms state-of-the-art methods on semi-supervised image and language classification tasks.
摘要：现有半监督学习（SSL）算法使用一个单一的重量来平衡的标记和未标记的实例，即损失，所有未标记的实施例是相同的权重。但并不是所有未标记的数据都是平等的。在本文中，我们研究如何利用不同的权重为每个未标记的例子。在以前的工作做了 - - 所有这些权重的手动调整是不再可能。相反，我们通过基于影响函数，一个模型的一个训练例子依赖的度量的算法调整的权重。为了使该方法有效，我们提出了影响功能的快速和有效的近似。我们表明，该技术优于上半监督图像和语言分类任务的国家的最先进的方法。

40. Image Processing and Quality Control for Abdominal Magnetic Resonance Imaging in the UK Biobank [PDF] 返回目录
Nicolas Basty, Yi Liu, Madeleine Cule, E. Louise Thomas, Jimmy D. Bell, Brandon Whitcher
Abstract: An end-to-end image analysis pipeline is presented for the abdominal MRI protocol used in the UK Biobank on the first 38,971 participants. Emphasis is on the processing steps necessary to ensure a high-level of data quality and consistency is produced in order to prepare the datasets for downstream quantitative analysis, such as segmentation and parameter estimation. Quality control procedures have been incorporated to detect and, where possible, correct issues in the raw data. Detection of fat-water swaps in the Dixon series is performed by a deep learning model and corrected automatically. Bone joints are predicted using a hybrid atlas-based registration and deep learning model for the shoulders, hips and knees. Simultaneous estimation of proton density fat fraction and transverse relaxivity (R2*) is performed using both the magnitude and phase information for the single-slice multiecho series. Approximately 98.1% of the two-point Dixon acquisitions were successfully processed and passed quality control, with 99.98% of the high-resolution T1-weighted 3D volumes succeeding. Approximately 99.98% of the single-slice multiecho acquisitions covering the liver were successfully processed and passed quality control, with 97.6% of the single-slice multiecho acquisitions covering the pancreas succeeding. At least one fat-water swap was detected in 1.8% of participants. With respect to the bone joints, approximately 3.3% of participants were missing at least one knee joint and 0.8% were missing at least one shoulder joint. For the participants who received both single-slice multiecho acquisition protocols for the liver a systematic difference between the two protocols was identified and modeled using multiple linear regression. The findings presented here will be invaluable for scientists who seek to use image-derived phenotypes from the abdominal MRI protocol.
摘要：终端到终端的图像分析管道提出了在英国生物银行使用第一38971名参与者腹部MRI协议。强调的是必要的处理步骤，以保证数据质量的高级别和一致性，以便制备用于下游定量分析的数据集，如分割和参数估计产生。质量控制程序已纳入检测，并在原始数据有可能的，正确的问题。在狄克逊系列脂肪水交换的检测是通过一个深度学习模型进行自动校正。骨关节使用基于图谱混合注册和深度学习模型的肩膀，臀部和膝盖预测。同时使用的幅度和相位信息的单切片多回波序列进行质子密度脂肪级分和横向弛豫（R2 *）的同时估计。大致两点Dixon收购98.1％被成功处理，并且通过质量控制，具有高分辨率T1加权3D体积后续的99.98％。大致单切片多回波采集覆盖肝脏的99.98％被成功处理，并且通过质量控制，与单切片多回波采集覆盖胰腺后续的97.6％。在参与者的1.8％检测至少一种脂肪 - 水交换。相对于所述骨关节，参与者的约3.3％的缺失至少一个膝关节和0.8％人失踪的至少一个肩部关节。对于谁收到单排多回波采集协议对肝脏参加这两种协议之间的系统性差异被确定并使用多元线性回归模型。这里介绍的结果将是非常宝贵的谁寻求从腹部MRI协议使用源自图像的表型的科学家。

41. Globally Optimal Surface Segmentation using Deep Learning with Learnable Smoothness Priors [PDF] 返回目录
Leixin Zhou, Xiaodong Wu
Abstract: Automated surface segmentation is important and challenging in many medical image analysis applications. Recent deep learning based methods have been developed for various object segmentation tasks. Most of them are a classification based approach, e.g. U-net, which predicts the probability of being target object or background for each voxel. One problem of those methods is lacking of topology guarantee for segmented objects, and usually post processing is needed to infer the boundary surface of the object. In this paper, a novel model based on convolutional neural network (CNN) followed by a learnable surface smoothing block is proposed to tackle the surface segmentation problem with end-to-end training. To the best of our knowledge, this is the first study to learn smoothness priors end-to-end with CNN for direct surface segmentation with global optimality. Experiments carried out on Spectral Domain Optical Coherence Tomography (SD-OCT) retinal layer segmentation and Intravascular Ultrasound (IVUS) vessel wall segmentation demonstrated very promising results.
摘要：自动曲面细分是非常重要的，而且在许多医学图像分析应用挑战。近期深基础的学习方法已经被开发为各种对象分割任务。他们中的大多数是基于分类方法，例如U型网，其预测的对于每个体素的目标对象或背景的概率。这些方法存在的一个问题是缺乏拓扑保证的用于分割对象，并且通常需要后处理以推断所述对象的边界表面。在本文中，基于卷积神经网络（CNN）一种新颖的模型之后是可学习表面平滑块提出以解决与端至端训练表面分割问题。据我们所知，这是第一次研究学会平滑先验端至端与CNN直接曲面细分与全局最优。实验上谱域光学相干断层扫描（SD-OCT）视网膜层分段和血管内超声（IVUS）血管壁分割证明非常有希望的结果进行。

42. Spot the conversation: speaker diarisation in the wild [PDF] 返回目录
Joon Son Chung, Jaesung Huh, Arsha Nagrani, Triantafyllos Afouras, Andrew Zisserman
Abstract: The goal of this paper is speaker diarisation of videos collected 'in the wild'. We make three key contributions. First, we propose an automatic audio-visual diarisation method for YouTube videos. Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from 'in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.
摘要：本文的目标是“野生”收集的视频扬声器diarisation。我们做三个关键的贡献。首先，我们提出了YouTube视频自动视听diarisation方法。我们的方法包括使用视听手段和说话人确认使用自助登记音箱型号有源音箱的检测。其次，我们的方法整合到一个半自动的数据集制作流程，其显著减少了所需的注释视频与diarisation标签的小时数。最后，我们使用这条管道来创建一个名为VoxConverse大规模diarisation数据集，从“野生”的视频，我们将公开发布研究团体收集。我们的数据集由重叠的讲话，一个庞大而多样的扬声器游泳池和具有挑战性的背景条件。

43. Student-Teacher Curriculum Learning via Reinforcement Learning: Predicting Hospital Inpatient Admission Location [PDF] 返回目录
Rasheed el-Bouri, David Eyre, Peter Watkinson, Tingting Zhu, David Clifton
Abstract: Accurate and reliable prediction of hospital admission location is important due to resource-constraints and space availability in a clinical setting, particularly when dealing with patients who come from the emergency department. In this work we propose a student-teacher network via reinforcement learning to deal with this specific problem. A representation of the weights of the student network is treated as the state and is fed as an input to the teacher network. The teacher network's action is to select the most appropriate batch of data to train the student network on from a training set sorted according to entropy. By validating on three datasets, not only do we show that our approach outperforms state-of-the-art methods on tabular data and performs competitively on image recognition, but also that novel curricula are learned by the teacher network. We demonstrate experimentally that the teacher network can actively learn about the student network and guide it to achieve better performance than if trained alone.
摘要：入院位置的准确和可靠的预测是资源约束和空间的可用性重要，因为在临床环境中，特别是与谁来自急诊室的病人打交道时。在这项工作中，我们通过强化学习来处理这一特定问题提出了师生的网络。学生网络的权重的表示被视为状态和被馈送作为输入到教师网络。教师网络的作用是选择最合适的一批数据，以培养学生网络上根据熵的训练集来分类的。通过验证的三个数据集，我们不仅表明我们的方法比国家的最先进的方法对表格数据和竞争力的图像识别来执行，也认为新课程由老师网了解到。我们演示实验，教师网络可以积极地了解学生的网络，并引导它实现比如果单独训练的更好的性能。

44. A Brief Review of Deep Multi-task Learning and Auxiliary Task Learning [PDF] 返回目录
Partoo Vafaeikia, Khashayar Namdar, Farzad Khalvati
Abstract: Multi-task learning (MTL) optimizes several learning tasks simultaneously and leverages their shared information to improve generalization and the prediction of the model for each task. Auxiliary tasks can be added to the main task to ultimately boost the performance. In this paper, we provide a brief review on the recent deep multi-task learning (dMTL) approaches followed by methods on selecting useful auxiliary tasks that can be used in dMTL to improve the performance of the model for the main task.
摘要：多任务学习（MTL）同时优化几个学习任务，并利用他们的信息共享，提高泛化和模型为每个任务的预测。辅助任务可以被添加到的主要任务，最终提高性能。在本文中，我们提供了关于近期深多任务学习（dMTL）的简要回顾办法之后，可以在dMTL用于改善该机型为主要任务的性能上的选择有用的辅助任务的方法。

45. Evaluation of Contemporary Convolutional Neural Network Architectures for Detecting COVID-19 from Chest Radiographs [PDF] 返回目录
Nikita Albert
Abstract: Interpreting chest radiograph, a.ka. chest x-ray, images is a necessary and crucial diagnostic tool used by medical professionals to detect and identify many diseases that may plague a patient. Although the images themselves contain a wealth of valuable information, their usefulness may be limited by how well they are interpreted, especially when the reviewing radiologist may be fatigued or when or an experienced radiologist is unavailable. Research in the use of deep learning models to analyze chest radiographs yielded impressive results where, in some instances, the models outperformed practicing radiologists. Amidst the COVID-19 pandemic, researchers have explored and proposed the use of said deep models to detect COVID-19 infections from radiographs as a possible way to help ease the strain on medical resources. In this study, we train and evaluate three model architectures, proposed for chest radiograph analysis, under varying conditions, find issues that discount the impressive model performances proposed by contemporary studies on this subject, and propose methodologies to train models that yield more reliable results.. Code, scripts, pre-trained models, and visualizations are available at this https URL.
摘要：解读胸片，a.ka.胸部X光，图像是使用由医疗专业人员以检测和识别许多疾病可能困扰的患者的必要和重要的诊断工具。虽然本身包含了大量有价值的信息的图像，其效用可能通过他们如何解释的限制，特别是当审查放射科医生可疲劳时或当或有经验的放射科医生是不可用的。研究采用深度学习模型来分析胸片取得了骄人的成绩，其中，在某些情况下，模型的性能优于执业放射。烟雨COVID-19大流行，研究人员已经研究并提出了使用该深模型从X光片检测COVID-19感染作为一种可能的方式帮助缓解医疗资源紧张。在这项研究中，我们培养和评价。三级模式架构，提出了胸片分析，在不同条件下，发现问题即折扣就这个问题提出的当代研究令人印象深刻的模型表演，并提出的方法来训练模型，收益率更可靠的结果。。代码，脚本，预先训练模型和可视化效果可在此HTTPS URL。

46. Scene Graph Reasoning for Visual Question Answering [PDF] 返回目录
Marcel Hildebrandt, Hang Li, Rajat Koner, Volker Tresp, Stephan Günnemann
Abstract: Visual question answering is concerned with answering free-form questions about an image. Since it requires a deep linguistic understanding of the question and the ability to associate it with various objects that are present in the image, it is an ambitious task and requires techniques from both computer vision and natural language processing. We propose a novel method that approaches the task by performing context-driven, sequential reasoning based on the objects and their semantic and spatial relationships present in the scene. As a first step, we derive a scene graph which describes the objects in the image, as well as their attributes and their mutual relationships. A reinforcement agent then learns to autonomously navigate over the extracted scene graph to generate paths, which are then the basis for deriving answers. We conduct a first experimental study on the challenging GQA dataset with manually curated scene graphs, where our method almost reaches the level of human performance.
摘要：视觉答疑涉及回答有关的图像自由形式的问题。由于它要求的问题，并把它与存在的图像中的各个对象相关联的能力的深刻理解的语言，这是一个雄心勃勃的任务，需要从两个计算机视觉和自然语言处理技术。我们建议接近通过基于存在于场景中的对象和它们的语义和空间关系上下文驱动，顺序推理任务的新方法。作为第一步，我们得出一个场景图，它描述了图像中的物体，以及它们的属性和它们之间的相互关系。甲增强剂然后学习自主导航在提取的场景图以产生路径，其然后用于导出答案的基础。我们的挑战GQA数据集手动策划的场景图，我们的方法几乎达到人的表现的水平进行第一实验研究。

47. 4D Spatio-Temporal Convolutional Networks for Object Position Estimation in OCT Volumes [PDF] 返回目录
Marcel Bengs, Nils Gessert, Alexander Schlaefer
Abstract: Tracking and localizing objects is a central problem in computer-assisted surgery. Optical coherence tomography (OCT) can be employed as an optical tracking system, due to its high spatial and temporal resolution. Recently, 3D convolutional neural networks (CNNs) have shown promising performance for pose estimation of a marker object using single volumetric OCT images. While this approach relied on spatial information only, OCT allows for a temporal stream of OCT image volumes capturing the motion of an object at high volumes rates. In this work, we systematically extend 3D CNNs to 4D spatio-temporal CNNs to evaluate the impact of additional temporal information for marker object tracking. Across various architectures, our results demonstrate that using a stream of OCT volumes and employing 4D spatio-temporal convolutions leads to a 30% lower mean absolute error compared to single volume processing with 3D CNNs.
摘要：跟踪和定位对象是在计算机辅助外科手术中的中心问题。光学相干断层扫描（OCT）可以用作一个光学跟踪系统，由于它的高空间和时间分辨率。最近，三维卷积神经网络（细胞神经网络）已经显示了使用单体积的OCT图像的标记对象的姿态估计有前途的性能。虽然这种方法对空间信息只依赖，OCT允许OCT图像体积捕获对象的运动在高音量率的时间流。在这项工作中，我们系统的3D细胞神经网络扩展到四维时空细胞神经网络评估的额外时间信息标记对象跟踪的影响。在不同的架构，我们的研究结果表明，采用OCT卷流，并与三维细胞神经网络的单卷处理采用四维时空回旋导致30％的较低的平均绝对误差。

48. Spectral-Spatial Recurrent-Convolutional Networks for In-Vivo Hyperspectral Tumor Type Classification [PDF] 返回目录
Marcel Bengs, Nils Gessert, Wiebke Laffers, Dennis Eggert, Stephan Westermann, Nina A. Mueller, Andreas O. H. Gerstner, Christian Betz, Alexander Schlaefer
Abstract: Early detection of cancerous tissue is crucial for long-term patient survival. In the head and neck region, a typical diagnostic procedure is an endoscopic intervention where a medical expert manually assesses tissue using RGB camera images. While healthy and tumor regions are generally easier to distinguish, differentiating benign and malignant tumors is very challenging. This requires an invasive biopsy, followed by histological evaluation for diagnosis. Also, during tumor resection, tumor margins need to be verified by histological analysis. To avoid unnecessary tissue resection, a non-invasive, image-based diagnostic tool would be very valuable. Recently, hyperspectral imaging paired with deep learning has been proposed for this task, demonstrating promising results on ex-vivo specimens. In this work, we demonstrate the feasibility of in-vivo tumor type classification using hyperspectral imaging and deep learning. We analyze the value of using multiple hyperspectral bands compared to conventional RGB images and we study several machine learning models' ability to make use of the additional spectral information. Based on our insights, we address spectral and spatial processing using recurrent-convolutional models for effective spectral aggregating and spatial feature learning. Our best model achieves an AUC of 76.3%, significantly outperforming previous conventional and deep learning methods.
摘要：癌组织的早期检测是患者长期生存的关键。在头部和颈部区域，典型的诊断程序是内窥镜介入其中医学专家使用RGB摄像机图像手动评估组织。虽然健康和肿瘤区域通常更易于区分，鉴别良，恶性肿瘤是非常具有挑战性的。这需要一种侵入性活检，随后是诊断学评价。此外，肿瘤切除时，肿瘤边缘必须通过组织学分析进行验证。为了避免不必要的组织切除，非侵入性的，基于图像的诊断工具将是非常有价值的。近年来，随着深度学习配对高光谱成像已经提出了这个任务，表明在离体标本可喜的成果。在这项工作中，我们证明体内肿瘤类型分类的高光谱成像和深度学习的可行性。我们分析了使用多个高光谱带相对于传统的RGB图像的价值，我们研究一些机器学习模型来利用额外的频谱信息的能力。根据我们的见解，我们反复使用，卷积模型进行有效的频谱聚合和空间特征的学习解决频谱和空间处理。我们最好的模式实现了76.3％的AUC，显著超越以前的传统和深厚的学习方法。

49. A Novel DNN Training Framework via Data Sampling and Multi-Task Optimization [PDF] 返回目录
Boyu Zhang, A. K. Qin, Hong Pan, Timos Sellis
Abstract: Conventional DNN training paradigms typically rely on one training set and one validation set, obtained by partitioning an annotated dataset used for training, namely gross training set, in a certain way. The training set is used for training the model while the validation set is used to estimate the generalization performance of the trained model as the training proceeds to avoid over-fitting. There exist two major issues in this paradigm. Firstly, the validation set may hardly guarantee an unbiased estimate of generalization performance due to potential mismatching with test data. Secondly, training a DNN corresponds to solve a complex optimization problem, which is prone to getting trapped into inferior local optima and thus leads to undesired training results. To address these issues, we propose a novel DNN training framework. It generates multiple pairs of training and validation sets from the gross training set via random splitting, trains a DNN model of a pre-specified structure on each pair while making the useful knowledge (e.g., promising network parameters) obtained from one model training process to be transferred to other model training processes via multi-task optimization, and outputs the best, among all trained models, which has the overall best performance across the validation sets from all pairs. The knowledge transfer mechanism featured in this new framework can not only enhance training effectiveness by helping the model training process to escape from local optima but also improve on generalization performance via implicit regularization imposed on one model training process from other model training processes. We implement the proposed framework, parallelize the implementation on a GPU cluster, and apply it to train several widely used DNN models. Experimental results demonstrate the superiority of the proposed framework over the conventional training paradigm.
摘要：传统的DNN训练范式通常依赖于一个训练集和一个验证组，通过分割用于培训，即总训练集的注释的数据集，以某种方式获得。训练集被用于训练模型，而验证集合被用于估计经训练的模型的泛化性能作为训练进行，以避免过拟合。存在着这种模式的两大问题。首先，验证集合可能很难保证的泛化性能的无偏估计，由于与测试数据电位不匹配。其次，培养了DNN相当于解决一个复杂的优化问题，这是容易出现越来越陷入劣势局部最优，从而导致不想要的培训效果。为了解决这些问题，我们提出了一个新颖的DNN培训框架。它产生从经由随机分裂总训练集倍数对训练和验证集，训练预先指定的结构对每对一个DNN模型，同时使有用的知识（例如，有前途的网络参数）从一个模型训练过程中获得的通过多任务优化转移到其他模型的训练过程，并输出最好的，所有训练的模型中，其中有来自所有对整个验证集的整体最佳性能。在这个新的框架功能的知识转移机制不仅能增强帮助模型训练过程从局部最优逃离，但也可以通过其他模型训练进程强加给一个模型的训练过程中隐含的正规化提高泛化能力培训效果。我们实施建议的框架，一个并行GPU集群上实施，并应用它来培养几个广泛使用DNN模型。实验结果表明，相对于传统的训练模式所提出的框架的优越性。

50. PGD-UNet: A Position-Guided Deformable Network for Simultaneous Segmentation of Organs and Tumors [PDF] 返回目录
Ziqiang Li, Hong Pan, Yaping Zhu, A. K. Qin
Abstract: Precise segmentation of organs and tumors plays a crucial role in clinical applications. It is a challenging task due to the irregular shapes and various sizes of organs and tumors as well as the significant class imbalance between the anatomy of interest (AOI) and the background region. In addition, in most situation tumors and normal organs often overlap in medical images, but current approaches fail to delineate both tumors and organs accurately. To tackle such challenges, we propose a position-guided deformable UNet, namely PGD-UNet, which exploits the spatial deformation capabilities of deformable convolution to deal with the geometric transformation of both organs and tumors. Position information is explicitly encoded into the network to enhance the capabilities of deformation. Meanwhile, we introduce a new pooling module to preserve position information lost in conventional max-pooling operation. Besides, due to unclear boundaries between different structures as well as the subjectivity of annotations, labels are not necessarily accurate for medical image segmentation tasks. It may cause the overfitting of the trained network due to label noise. To address this issue, we formulate a novel loss function to suppress the influence of potential label noise on the training process. Our method was evaluated on two challenging segmentation tasks and achieved very promising segmentation accuracy in both tasks.
摘要：器官和肿瘤的精确分割起在临床应用中至关重要的作用。这是一项艰巨的任务，由于不规则的形状和器官和肿瘤的各种尺寸以及利息（AOI）的解剖结构之间的显著类不平衡和背景区域。此外，在大多数情况下肿瘤和正常器官在医学图像往往重叠，但目前的方法无法准确界定双方的肿瘤和器官。为了应对这些挑战，我们提出了一个位置引导变形UNET，即PGD-UNET，它利用变形卷积的空间变形能力，以应对这两种器官和肿瘤的几何变换。位置信息明确地编入网络，提升变形的能力。同时，我们引入一个新的蓄积模块保持在传统的MAX-池操作丢失的位置信息。此外，由于不同的结构，以及注释的主观性之间的界限不清，标签不用于医学图像分割的任务不一定准确。这可能会导致训练网络的过度拟合由于标签的噪音。为了解决这个问题，我们制定了一种新的损失函数，以抑制潜在的标签噪声对训练过程的影响。我们的方法是在两个具有挑战性的任务分割评估并取得了非常有前途的两个任务分割精度。

51. MPLP: Learning a Message Passing Learning Protocol [PDF] 返回目录
Ettore Randazzo, Eyvind Niklasson, Alexander Mordvintsev
Abstract: We present a novel method for learning the weights of an artificial neural network - a Message Passing Learning Protocol (MPLP). In MPLP, we abstract every operations occurring in ANNs as independent agents. Each agent is responsible for ingesting incoming multidimensional messages from other agents, updating its internal state, and generating multidimensional messages to be passed on to neighbouring agents. We demonstrate the viability of MPLP as opposed to traditional gradient-based approaches on simple feed-forward neural networks, and present a framework capable of generalizing to non-traditional neural network architectures. MPLP is meta learned using end-to-end gradient-based meta-optimisation. We further discuss the observed properties of MPLP and hypothesize its applicability on various fields of deep learning.
摘要：提出一种新的方法，用于学习一个人工神经网络的权重 - 消息传递学习协议（MPLP）。在MPLP，发生在人工神经网络作为独立的代理商，我们每一个抽象的操作。每个代理负责从其它药剂摄取传入多维消息，更新其内部状态，并产生多维消息被传递到相邻的代理。我们证明MPLP的可行性，而不是简单的前馈神经网络的传统的基于梯度的方法，并提出能够推广到非传统神经网络结构的框架。 MPLP是使用元端至端基于梯度元优化教训。我们进一步讨论MPLP的观测性能，假设深学习的各个领域的适用性。

52. Iterative Bounding Box Annotation for Object Detection [PDF] 返回目录
Bishwo Adhikari, Heikki Huttunen
Abstract: Manual annotation of bounding boxes for object detection in digital images is tedious, and time and resource consuming. In this paper, we propose a semi-automatic method for efficient bounding box annotation. The method trains the object detector iteratively on small batches of labeled images and learns to propose bounding boxes for the next batch, after which the human annotator only needs to correct possible errors. We propose an experimental setup for simulating the human actions and use it for comparing different iteration strategies, such as the order in which the data is presented to the annotator. We experiment on our method with three datasets and show that it can reduce the human annotation effort significantly, saving up to 75% of total manual annotation work.
摘要：数字图像边界框物体检测的人工标注繁琐，而且耗费时间和资源。在本文中，我们提出了有效的边框标注半自动方法。该方法训练目标物检测反复上标记的图像，并学会的小批量，提出边界框下一批，在此之后，人类注释只需要纠正可能出现的错误。我们提出了一个实验装置来模拟人类的行为，并用它来比较不同的迭代策略，如在该数据呈现给标注的顺序。我们尝试在我们有三个数据集，并表明，它可以显著减少人力注释省力，节省高达总人工标注工作，75％的方法。

53. Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights [PDF] 返回目录
Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, Baoxin Li
Abstract: Machine learning (ML) models are widely used in many domains including media processing and generation, computer vision, medical diagnosis, embedded systems, high-performance and scientific computing, and recommendation systems. For efficiently processing these computational- and memory-intensive applications, tensors of these over-parameterized models are compressed by leveraging sparsity, size reduction, and quantization of tensors. Unstructured sparsity and tensors with varying dimensions yield irregular-shaped computation, communication, and memory access patterns; processing them on hardware accelerators in a conventional manner does not inherently leverage acceleration opportunities. This paper provides a comprehensive survey on how to efficiently execute sparse and irregular tensor computations of ML models on hardware accelerators. In particular, it discusses additional enhancement modules in architecture design and software support; categorizes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs; highlights further opportunities in terms of hardware/software/algorithm co-design optimizations and joint optimizations among described hardware and software enhancement modules. The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors; understanding enhancements in acceleration systems for supporting their efficient computations; analyzing trade-offs in opting for a specific type of design enhancement; understanding how to map and compile models with sparse tensors on the accelerators; understanding recent design trends for efficient accelerations and further opportunities.
摘要：机器学习（ML）模型被广泛应用于许多领域，包括媒体处理和生成，计算机视觉，医疗诊断，嵌入式系统，高性能科学计算和推荐系统。为了有效地处理这些computational-和存储器密集型的应用中，这些过度参数化模型的张量通过利用稀疏性，尺寸减小，和张量的量化压缩。非结构化稀疏性和具有不同尺寸的张量得到不规则形状的计算，通信和存储器访问模式;以常规的方式处理它们的硬件加速器本身并没有杠杆加速的机会。本文提供了有关如何有效地对硬件加速器执行ML车型的稀疏和不规则的张量计算的全面调查。特别是，它讨论了架构设计和软件支持的其他增强模块;分类不同的硬件设计和加速技术，并分析它们的硬件和执行成本方面;亮点的硬件/软件/算法协同设计优化和描述的硬件和软件增强模块之间的联合优化方面更多的机会。从本文的外卖包括：理解在加速稀疏的主要挑战，不规则形的，和量化的张量;在加速系统，用于支撑其有效计算理解增强;在选择了一个特定类型的设计增强的分析权衡;了解如何映射和编译与加速器稀疏张量模型;了解最近的设计趋势进行有效的加速和更多的机会。

54. An encoder-decoder-based method for COVID-19 lung infection segmentation [PDF] 返回目录
Omar Elharrouss, Nandhini Subramanian, Somaya Al-Maadeed
Abstract: The novelty of the COVID-19 disease and the speed of spread has created a colossal chaos, impulse among researchers worldwide to exploit all the resources and capabilities to understand and analyze characteristics of the coronavirus in term of the ways it spreads and virus incubation time. For that, the existing medical features like CT and X-ray images are used. For example, CT-scan images can be used for the detection of lung infection. But the challenges of these features such as the quality of the image and infection characteristics limitate the effectiveness of these features. Using artificial intelligence (AI) tools and computer vision algorithms, the accuracy of detection can be more accurate and can help to overcome these issues. This paper proposes a multi-task deep-learning-based method for lung infection segmentation using CT-scan images. Our proposed method starts by segmenting the lung regions that can be infected. Then, segmenting the infections in these regions. Also, to perform a multi-class segmentation the proposed model is trained using the two-stream inputs. The multi-task learning used in this paper allows us to overcome shortage of labeled data. Also, the multi-input stream allows the model to do the learning on many features that can improve the results. To evaluate the proposed method, many features have been used. Also, from the experiments, the proposed method can segment lung infections with a high degree performance even with shortage of data and labeled images. In addition, comparing with the state-of-the-art method our method achieves good performance results.
摘要：COVID-19的疾病和扩散的速度的新颖性创造了一个巨大的混乱，冲击世界各地的研究人员利用所有的资源和能力在它的传播途径和病毒培养长期理解和分析的冠状病毒的特性中时间。对于这一点，使用像CT和X射线图像的现有的医疗功能。例如，CT扫描图像可以用于检测肺部感染。但是这些功能，如图像和感染特性质量的挑战limitate的这些功能的有效性。采用人工智能（AI）工具和计算机视觉算法，检测的精确度可以更加准确，可以帮助克服这些问题。本文提出了一种利用CT扫描图像肺部感染分割深学习为基础的多任务的方法。我们提出的方法，通过分割，可以感染肺部的区域开始。然后，分割这些地区的感染。此外，为了执行多级分割该模型是使用两流输入来训练。在本文中所使用的多任务学习使我们能够克服标记数据的不足。此外，多输入流允许模型做学习上的许多功能，可以改善的结果。为了评估所提出的方法，已经使用了许多功能。此外，从实验中，该方法能够段肺部感染具有高度的性能，即使与数据和标记的图像的短缺。另外，随着国家的最先进的方法相比，我们的方法取得了良好的性能结果。

55. Surface Denoising based on Normal Filtering in a Robust Statistics Framework [PDF] 返回目录
Sunil Kumar Yadav, Martin Skrodzki, Eric Zimmermann, Konrad Polthier
Abstract: During a surface acquisition process using 3D scanners, noise is inevitable and an important step in geometry processing is to remove these noise components from these surfaces (given as points-set or triangulated mesh). The noise-removal process (denoising) can be performed by filtering the surface normals first and by adjusting the vertex positions according to filtered normals afterwards. Therefore, in many available denoising algorithms, the computation of noise-free normals is a key factor. A variety of filters have been introduced for noise-removal from normals, with different focus points like robustness against outliers or large amplitude of noise. Although these filters are performing well in different aspects, a unified framework is missing to establish the relation between them and to provide a theoretical analysis beyond the performance of each method. In this paper, we introduce such a framework to establish relations between a number of widely-used nonlinear filters for face normals in mesh denoising and vertex normals in point set denoising. We cover robust statistical estimation with M-smoothers and their application to linear and non-linear normal filtering. Although these methods originate in different mathematical theories - which include diffusion-, bilateral-, and directional curvature-based algorithms - we demonstrate that all of them can be cast into a unified framework of robust statistics using robust error norms and their corresponding influence functions. This unification contributes to a better understanding of the individual methods and their relations with each other. Furthermore, the presented framework provides a platform for new techniques to combine the advantages of known filters and to compare them with available methods.
摘要：在使用3D扫描仪的表面采集过程中，噪声是不可避免的，在几何处理一个重要的步骤是从这些表面（给定为点集或三角目）除去这些噪声分量。（去噪）噪声去除处理可以通过第一过滤表面法线和通过根据事后过滤法线调整顶点位置来执行。因此，在许多可用的降噪算法，无噪音法线的计算是一个关键因素。各种过滤器已被引入用于从法线噪声去除，与像针对异常值或噪声的大振幅的鲁棒性不同的聚焦点。尽管这些过滤器在各方面表现良好，一个统一的框架中缺少建立它们之间的关系，并提供超越每种方法的性能进行了理论分析。在本文中，我们介绍了这样一个框架，在网格点集降噪消噪和顶点的法线许多用于面法线广泛使用非线性滤波器之间建立关系。我们覆盖M-平滑器及其线性和非线性滤波正常应用强大的统计估计。虽然这些方法起源于不同的数学理论 - 其中包括弥散，bilateral-，和基于方向曲率的算法 - 我们证明，他们都可以转换成使用可靠的错误规范及其相应的影响力功能强大的统计数据的统一框架。这种统一贡献的个人方法及其相互关系的了解。此外，所提出的框架提供了一个平台，为新技术的优点结合起来称为过滤器，并将其与现有方法进行了比较。

56. Uncertainty-Guided Efficient Interactive Refinement of Fetal Brain Segmentation from Stacks of MRI Slices [PDF] 返回目录
Guotai Wang, Michael Aertsen, Jan Deprest, Sebastien Ourselin, Tom Vercauteren, Shaoting Zhang
Abstract: Segmentation of the fetal brain from stacks of motion-corrupted fetal MRI slices is important for motion correction and high-resolution volume reconstruction. Although Convolutional Neural Networks (CNNs) have been widely used for automatic segmentation of the fetal brain, their results may still benefit from interactive refinement for challenging slices. To improve the efficiency of interactive refinement process, we propose an Uncertainty-Guided Interactive Refinement (UGIR) framework. We first propose a grouped convolution-based CNN to obtain multiple automatic segmentation predictions with uncertainty estimation in a single forward pass, then guide the user to provide interactions only in a subset of slices with the highest uncertainty. A novel interactive level set method is also proposed to obtain a refined result given the initial segmentation and user interactions. Experimental results show that: (1) our proposed CNN obtains uncertainty estimation in real time which correlates well with mis-segmentations, (2) the proposed interactive level set is effective and efficient for refinement, (3) UGIR obtains accurate refinement results with around 30% improvement of efficiency by using uncertainty to guide user interactions. Our code is available online.
摘要：胎儿大脑从运动损坏胎MRI切片的堆叠的分割为运动校正和高分辨率容积重建重要。虽然卷积神经网络（细胞神经网络）已经被广泛用于胎儿大脑的自动分割，其结果还是会从互动细化受益挑战切片。为了提高互动细化过程的效率，我们提出了一个不确定性制导互动细化（UGIR）框架。我们首先提出了一种基于分组卷积CNN获得的不确定性估算多个自动分割的预测在一个单一的直传，再引导用户提供交互只有在最高的不确定性片的一个子集。一种新颖的交互式水平集方法也提出了给定的初始分割和用户交互以获得精制的结果。实验结果表明：（1）我们提出的实时CNN取得的不确定性估计与错误分割良好的相关性，（2）所提出的互动水平集是有效和高效的精细化，（3）UGIR取得准确的细化与周围结果效率的30％的改善，通过使用不确定性，以指导用户交互。我们的代码是在网上提供。

57. NP-PROV: Neural Processes with Position-Relevant-Only Variances [PDF] 返回目录
Xuesong Wang, Lina Yao, Xianzhi Wang, Feiping Nie
Abstract: Neural Processes (NPs) families encode distributions over functions to a latent representation, given context data, and decode posterior mean and variance at unknown locations. Since mean and variance are derived from the same latent space, they may fail on out-of-domain tasks where fluctuations in function values amplify the model uncertainty. We present a new member named Neural Processes with Position-Relevant-Only Variances (NP-PROV). NP-PROV hypothesizes that a target point close to a context point has small uncertainty, regardless of the function value at that position. The resulting approach derives mean and variance from a function-value-related space and a position-related-only latent space separately. Our evaluation on synthetic and real-world datasets reveals that NP-PROV can achieve state-of-the-art likelihood while retaining a bounded variance when drifts exist in the function value.
摘要：神经过程（NPS）家族编码后的均值和方差在未知位置分布在功能潜表示，给定的上下文数据，并解码。由于均值和方差是由相同的潜在空间推导，他们可能无法在超出域其中函数值的波动放大的模型不确定性的任务。我们提出了一个名为神经新成员与位置相关的，只有差额（NP-PROV）处理。 NP-PROV推测，目标点接近的上下文点具有小的不确定性，而不管在该位置处的函数值的。将得到的方法导出的平均值和从函数值相关的空间和位置相关的仅潜单独空间方差。我们对合成和真实世界的数据集的评估显示，NP-PROV可以实现国家的最先进的可能性，同时又保留了有界变化时，在函数值存在漂移。

58. Rapid tissue oxygenation mapping from snapshot structured-light images with adversarial deep learning [PDF] 返回目录
Mason T. Chen, Nicholas J. Durr
Abstract: Spatial frequency domain imaging (SFDI) is a powerful technique for mapping tissue oxygen saturation over a wide field of view. However, current SFDI methods either require a sequence of several images with different illumination patterns or, in the case of single snapshot optical properties (SSOP), introduce artifacts and sacrifice accuracy. To avoid this tradeoff, we introduce OxyGAN: a data-driven, content-aware method to estimate tissue oxygenation directly from single structured light images using end-to-end generative adversarial networks. Conventional SFDI is used to obtain ground truth tissue oxygenation maps for ex vivo human esophagi, in vivo hands and feet, and an in vivo pig colon sample under 659 nm and 851 nm sinusoidal illumination. We benchmark OxyGAN by comparing to SSOP and to a two-step hybrid technique that uses a previously-developed deep learning model to predict optical properties followed by a physical model to calculate tissue oxygenation. When tested on human feet, a cross-validated OxyGAN maps tissue oxygenation with an accuracy of 96.5%. When applied to sample types not included in the training set, such as human hands and pig colon, OxyGAN achieves a 93.0% accuracy, demonstrating robustness to various tissue types. On average, OxyGAN outperforms SSOP and a hybrid model in estimating tissue oxygenation by 24.9% and 24.7%, respectively. Lastly, we optimize OxyGAN inference so that oxygenation maps are computed ~10 times faster than previous work, enabling video-rate, 25Hz imaging. Due to its rapid acquisition and processing speed, OxyGAN has the potential to enable real-time, high-fidelity tissue oxygenation mapping that may be useful for many clinical applications.
摘要：空间频域成像（SFDI）是用于映射组织氧饱和度超过宽视场的强大技术。然而，当前的方法SFDI要么需要若干图像具有不同的照明图案的序列，或者在单个快照光学性质（SSOP）的情况下，引入伪像和牺牲精度。为了避免这种折衷，我们引入OxyGAN：一个数据驱动，内容识别方法直接从单个结构光图像使用端至端生成对抗网络估计组织氧合。常规SFDI用于获得地面实况组织氧合映射用于离体人类食管，体内的手和脚，并且在659纳米和851纳米的正弦照明体内猪结肠样品。我们基准OxyGAN通过比较SSOP和两步杂交技术，它使用先前开发的深度学习模型来预测的光学特性，然后通过物理模型来计算组织氧合。当在人脚测试，一个交叉验证OxyGAN组织氧合映射用的96.5％的准确度。当应用到不包括在训练集的样品类型，例如人的手和猪大肠，OxyGAN实现了93.0％的准确率，这表明鲁棒性各种组织类型。平均来说，OxyGAN优于SSOP并且在由分别24.9％和24.7％，估计组织氧合的混合模型。最后，我们优化OxyGAN推理，使氧合地图计算〜比以前的工作速度快10倍，可实现视频速率，25Hz的图像。由于它的快速采集和处理速度，OxyGAN必须实现实时，高保真组织氧合映射可能对许多临床应用中是有用的潜力。

59. Deep learning-based holographic polarization microscopy [PDF] 返回目录
Tairan Liu, Kevin de Haan, Bijie Bai, Yair Rivenson, Yi Luo, Hongda Wang, David Karalli, Hongxiang Fu, Yibo Zhang, John FitzGerald, Aydogan Ozcan
Abstract: Polarized light microscopy provides high contrast to birefringent specimen and is widely used as a diagnostic tool in pathology. However, polarization microscopy systems typically operate by analyzing images collected from two or more light paths in different states of polarization, which lead to relatively complex optical designs, high system costs or experienced technicians being required. Here, we present a deep learning-based holographic polarization microscope that is capable of obtaining quantitative birefringence retardance and orientation information of specimen from a phase recovered hologram, while only requiring the addition of one polarizer/analyzer pair to an existing holographic imaging system. Using a deep neural network, the reconstructed holographic images from a single state of polarization can be transformed into images equivalent to those captured using a single-shot computational polarized light microscope (SCPLM). Our analysis shows that a trained deep neural network can extract the birefringence information using both the sample specific morphological features as well as the holographic amplitude and phase distribution. To demonstrate the efficacy of this method, we tested it by imaging various birefringent samples including e.g., monosodium urate (MSU) and triamcinolone acetonide (TCA) crystals. Our method achieves similar results to SCPLM both qualitatively and quantitatively, and due to its simpler optical design and significantly larger field-of-view, this method has the potential to expand the access to polarization microscopy and its use for medical diagnosis in resource limited settings.
摘要：偏振光显微镜提供高对比度双折射试样和被广泛地用作在病理学的诊断工具。但是，偏光显微镜系统通常通过分析从在不同的偏振状态，其导致相对复杂的光学设计，高的系统成本或经验的技术人员被要求两个或更多个光路收集的图像进行操作。在这里，我们提出了一个深基于学习的全息偏振光显微镜，其能够获得标本的定量双折射性和取向信息从相位恢复的全息图，同时仅需要增加一个偏光器/分析器对到现有的全息成像系统。使用深神经网络，从极化的单个状态重建全息图像可以转化为等同于使用单次计算所捕获的图像的偏振光显微镜（SCPLM）。我们的分析表明，一个训练有素的深神经网络能够同时使用样品特定形态特征以及全息振幅和相位分布中提取的双折射信息。为了证明这种方法的有效性，我们通过成像各种双折射样品，包括例如，尿酸单钠（MSU）和曲安奈德（TCA）的晶体进行了测试。我们的方法实现类似的结果SCPLM定性和定量，并且由于其简单的光学设计和显著较大场的图，该方法具有扩大在资源有限的环境中获得偏光显微镜和其用于医学诊断应用的潜在。

60. Adversarial Example Games [PDF] 返回目录
Avishek Joey Bose, Gauthier Gidel, Hugo Berrard, Andre Cianflone, Pascal Vincent, Simon Lacoste-Julien, William L. Hamilton
Abstract: The existence of adversarial examples capable of fooling trained neural network classifiers calls for a much better understanding of possible attacks, in order to guide the development of safeguards against them. It includes attack methods in the highly challenging non-interactive blackbox setting, where adversarial attacks are generated without any access, including queries, to the target model. Prior works in this setting have relied mainly on algorithmic innovations derived from empirical observations (e.g., that momentum helps), and the field currently lacks a firm theoretical basis for understanding transferability in adversarial attacks. In this work, we address this gap and lay the theoretical foundations for crafting transferable adversarial examples to entire function classes. We introduce Adversarial Examples Games (AEG), a novel framework that models adversarial examples as two-player min-max games between an attack generator and a representative classifier. We prove that the saddle point of an AEG game corresponds to a generating distribution of adversarial examples against entire function classes. Training the generator only requires the ability to optimize a representative classifier from a given hypothesis class, enabling BlackBox transfer to unseen classifiers from the same class. We demonstrate the efficacy of our approach on the MNIST and CIFAR-10 datasets against both undefended and robustified models, achieving competitive performance with state-of-the-art BlackBox transfer approaches.
摘要：能够嘴硬的对抗例子存在训练更好地了解可能的攻击神经网络分类器调用，以引导对他们的保障的发展。它包括在极具挑战性的非交互式黑盒设置，其中没有任何访问，包括查询产生敌对攻击，目标模型的攻击方法。在此设置之前的作品主要依靠从经验观察（例如，这一势头帮助）衍生算法的创新，和现场目前缺乏在对抗攻击的认识转让坚实的理论基础。在这项工作中，我们要解决这个差距，奠定了理论基础为起草转让对抗例子，全功能的类。我们引入对抗性例子游戏（AEG），一个新的框架，模型对抗性的例子作为攻击发生器和代表性的分类之间的两名球员最小 - 最大的游戏。我们证明了一个AEG游戏对应的鞍点的针对全功能类对抗性例的生成分布。训练发电机只需要从给定的假设优化类有代表性的分类，使同一类黑箱转移到看不见分类的能力。我们证明了我们对MNIST和CIFAR-10数据集对阵双方不设防抗差模型方法的有效性，实现竞争力的性能与国家的最先进的黑盒子传输方法。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-07-03

目录

摘要