摘要

1. Learning to Learn in a Semi-Supervised Fashion [PDF] 返回目录
Yun-Chun Chen, Chao-Te Chou, Yu-Chiang Frank Wang
Abstract: To address semi-supervised learning from both labeled and unlabeled data, we present a novel meta-learning scheme. We particularly consider that labeled and unlabeled data share disjoint ground truth label sets, which can be seen tasks like in person re-identification or image retrieval. Our learning scheme exploits the idea of leveraging information from labeled to unlabeled data. Instead of fitting the associated class-wise similarity scores as most meta-learning algorithms do, we propose to derive semantics-oriented similarity representations from labeled data, and transfer such representation to unlabeled ones. Thus, our strategy can be viewed as a self-supervised learning scheme, which can be applied to fully supervised learning tasks for improved performance. Our experiments on various tasks and settings confirm the effectiveness of our proposed approach and its superiority over the state-of-the-art methods.
摘要：从标记的和未标记的数据，以解决半监督学习，我们提出了一个新的元学习方案。我们尤其认为，标记和未标记数据共享不相交的地面实况标签集，这可以看作任务，如在人重新鉴定或图像检索。我们的学习方案采用利用信息从标记与未标记数据的想法。相反，因为大多数元学习算法做拟合相关类明智的相似性得分，我们建议推导面向语义相似性表示从标数据，这样的表现转移到未标记的。因此，我们的战略可以被看作是一个自我监督的学习方案，该方案可以充分监督学习，以提高性能的任务应用。我们对各项任务和设置实验证实了我们所提出的方法和它优于国家的最先进的方法的有效性。

2. Deep Active Learning in Remote Sensing for data efficient Change Detection [PDF] 返回目录
Vít Růžička, Stefano D'Aronco, Jan Dirk Wegner, Konrad Schindler
Abstract: We investigate active learning in the context of deep neural network models for change detection and map updating. Active learning is a natural choice for a number of remote sensing tasks, including the detection of local surface changes: changes are on the one hand rare and on the other hand their appearance is varied and diffuse, making it hard to collect a representative training set in advance. In the active learning setting, one starts from a minimal set of training examples and progressively chooses informative samples that are annotated by a user and added to the training set. Hence, a core component of an active learning system is a mechanism to estimate model uncertainty, which is then used to pick uncertain, informative samples. We study different mechanisms to capture and quantify this uncertainty when working with deep networks, based on the variance or entropy across explicit or implicit model ensembles. We show that active learning successfully finds highly informative samples and automatically balances the training distribution, and reaches the same performance as a model supervised with a large, pre-annotated training set, with $\approx$99% fewer annotated samples.
摘要：我们调查深层神经网络模型的变化检测和地图更新的情况下主动学习。主动学习是许多遥感任务，包括探测的局部表面变化的必然选择：改变是一方面是罕见的，而另一方面他们的外观变化和扩散，很难收集有代表性的训练集提前。在主动学习设定，从一组训练实例最小一个开始，并逐步选择由用户注释，并加入到训练集信息样本。因此，有源学习系统的核心组件是估计模型的不确定性，其然后用于拾取不确定，信息样本的机构。深网工作时，基于跨明示或暗示的模式集合方差或熵我们研究不同的机制来捕获和量化这种不确定性。我们发现，主动学习成功找到非常翔实的样品和自动平衡训练分布，并作为一种模式，一个大型预标注的训练集的监督，以$ \ $约99％较少的注解样品达到相同的性能。

3. GRAB: A Dataset of Whole-Body Human Grasping of Objects [PDF] 返回目录
Omid Taheri, Nima Ghorbani, Michael J. Black, Dimitrios Tzionas
Abstract: Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at this https URL.
摘要：训练计算机理解，模型，并合成人抓握需要含有复杂的三维物体的形状，详细的联系信息，手姿势和形状，并且随时间的三维体动丰富的数据集。而“抓”通常被认为是一个单一的手稳定地升降的对象，我们捕获整个身体的运动，并且采用的“全身抓握”广义概念。因此，我们收集新的数据集，被称为GRAB（把持带机构操作），全身抓手，含有完整的三维形状和姿势的10名受试者具有不同形状和尺寸的51个日常物品交互序列。由于动作捕捉标记，我们适应全3D的机身造型和姿势，包括关节的脸和手，以及3D对象的姿势。这给出了详细的3D网格随着时间的推移，从中我们计算体和对象之间的接触。这是一个独特的数据集，远远超出现有的建模和了解人类如何掌握和操作对象，充分的身体是如何参与，以及如何相互作用与任务不同而不同。我们示出了具有一个示例应用GRAB的实际值;我们培养GrabNet，有条件生成网络，预测3D手握对于看不见的三维物体的形状。该数据集和代码可用于研究目的在此HTTPS URL。

4. Bias-Awareness for Zero-Shot Learning the Seen and Unseen [PDF] 返回目录
William Thong, Cees G.M. Snoek
Abstract: Generalized zero-shot learning recognizes inputs from both seen and unseen classes. Yet, existing methods tend to be biased towards the classes seen during training. In this paper, we strive to mitigate this bias. We propose a bias-aware learner to map inputs to a semantic embedding space for generalized zero-shot learning. During training, the model learns to regress to real-valued class prototypes in the embedding space with temperature scaling, while a margin-based bidirectional entropy term regularizes seen and unseen probabilities. Relying on a real-valued semantic embedding space provides a versatile approach, as the model can operate on different types of semantic information for both seen and unseen classes. Experiments are carried out on four benchmarks for generalized zero-shot learning and demonstrate the benefits of the proposed bias-aware classifier, both as a stand-alone method or in combination with generated features.
摘要：广义零射门学习识别所见既所得和看不见类的投入。然而，现有的方法倾向于在训练中看到的班有偏见。在本文中，我们努力减轻这种偏见。我们提出了一个偏置感知学习者输入映射到广义零射门学习语义嵌入空间。在训练期间，模型学会回归到与温度缩放嵌入空间实值类的原型，而基于保证金的双向熵项规则化看到和看不到的概率。依托实值语义嵌入空间提供了一种通用的方法，因为该模型可以在不同类型的两种看到和看不到的类语义信息进行操作。实验是在四个基准进行推广零射门的学习和验证了偏置感知分类的好处，无论是作为一个独立的方法，或结合产生的功能。

5. Boundary Uncertainty in a Single-Stage Temporal Action Localization Network [PDF] 返回目录
Ting-Ting Xie, Christos Tzelepis, Ioannis Patras
Abstract: In this paper, we address the problem of temporal action localization with a single stage neural network. In the proposed architecture we model the boundary predictions as uni-variate Gaussian distributions in order to model their uncertainties, which is the first in this area to the best of our knowledge. We use two uncertainty-aware boundary regression losses: first, the Kullback-Leibler divergence between the ground truth location of the boundary and the Gaussian modeling the prediction of the boundary and second, the expectation of the $\ell_1$ loss under the same Gaussian. We show that with both uncertainty modeling approaches improve the detection performance by more than $1.5\%$ in mAP@tIoU=0.5 and that the proposed simple one-stage network performs closely to more complex one and two stage networks.
摘要：在本文中，我们解决时间行动本土化的问题，单级神经网络。在所提出的架构，我们的边界预测的单变量高斯分布模式，以他们的不确定性，这是在这方面尽我们所知的第一款车型。我们用两个不确定性感知界回归损失：第一，边界和高斯模型的边界和预测的基础真实位置之间的相对熵秒，$ \ ell_1 $损失在相同的高斯预期。我们表明，这两个不确定性建模方法超过$ 1.5 \％的mAP@tIoU=0.5 $和完善的检测性能所提出的简单的一阶段的网络进行密切合作，以更复杂的一个和两个阶段的网络。

6. FastSal: a Computationally Efficient Network for Visual Saliency Prediction [PDF] 返回目录
Feiyan Hu, Kevin McGuinness
Abstract: This paper focuses on the problem of visual saliency prediction, predicting regions of an image that tend to attract human visual attention, under a constrained computational budget. We modify and test various recent efficient convolutional neural network architectures like EfficientNet and MobileNetV2 and compare them with existing state-of-the-art saliency models such as SalGAN and DeepGaze II both in terms of standard accuracy metrics like AUC and NSS, and in terms of the computational complexity and model size. We find that MobileNetV2 makes an excellent backbone for a visual saliency model and can be effective even without a complex decoder. We also show that knowledge transfer from a more computationally expensive model like DeepGaze II can be achieved via pseudo-labelling an unlabelled dataset, and that this approach gives result on-par with many state-of-the-art algorithms with a fraction of the computational cost and model size. Source code is available at this https URL.
摘要：本文着重于视觉显着性预测的问题，预测，往往会吸引人类视觉注意力，约束计算预算下的图像的区域。我们修改和测试各种最新高效的卷积神经网络架构像EfficientNet和MobileNetV2，并将它们与国家的最先进的存在显着性等车型SalGAN和DeepGaze II无论在标准精度指标，比如AUC和NSS方面比较，并在条件的计算复杂度和模型的大小。我们发现，MobileNetV2使一个视觉显着性模型中的优秀骨干和可即使没有复杂的解码器是有效的。我们还表明从计算上更昂贵的模型，知识转移像DeepGaze II可以通过伪标记的未标记的数据集来实现，并且这种方法使结果在标准杆与许多国家的最先进的算法，用的一小部分计算成本和模型的大小。源代码可在此HTTPS URL。

7. Spatiotemporal Action Recognition in Restaurant Videos [PDF] 返回目录
Akshat Gupta, Milan Desai, Wusheng Liang, Magesh Kannan
Abstract: Spatiotemporal action recognition is the task of locating and classifying actions in videos. Our project applies this task to analyzing video footage of restaurant workers preparing food, for which potential applications include automated checkout and inventory management. Such videos are quite different from the standardized datasets that researchers are used to, as they involve small objects, rapid actions, and notoriously unbalanced data classes. We explore two approaches. The first approach involves the familiar object detector You Only Look Once, and another applying a recently proposed analogue for action recognition, You Only Watch Once. In the first, we design and implement a novel, recurrent modification of YOLO using convolutional LSTMs and explore the various subtleties in the training of such a network. In the second, we study the ability of YOWOs three dimensional convolutions to capture the spatiotemporal features of our unique dataset
摘要：时空动作识别是定位和分类在视频中行动的任务。我们的项目适用此任务分析餐馆工人准备食物，对于其潜在的应用包括自动检查和库存管理的录像。这些视频是从标准化的数据集完全不同，研究人员正在使用，因为它们涉及小物件，迅速行动，并出了名的不均衡数据类。我们探索两种方法。第一种方法涉及熟悉的对象探测器你只能看一次，和其他应用最近提出的一种模拟的动作识别，你只能眼睁睁的看着一次。在第一，我们设计并实现一种新型的，YOLO反复修改使用卷积LSTMs并且在这样的网络的训练探索各种微妙之处。在第二个，我们研究YOWOs能力三维回旋捕捉到我们独特的数据集的时空特征

8. Masked Face Recognition for Secure Authentication [PDF] 返回目录
Aqeel Anwar, Arijit Raychowdhury
Abstract: With the recent world-wide COVID-19 pandemic, using face masks have become an important part of our lives. People are encouraged to cover their faces when in public area to avoid the spread of infection. The use of these face masks has raised a serious question on the accuracy of the facial recognition system used for tracking school/office attendance and to unlock phones. Many organizations use facial recognition as a means of authentication and have already developed the necessary datasets in-house to be able to deploy such a system. Unfortunately, masked faces make it difficult to be detected and recognized, thereby threatening to make the in-house datasets invalid and making such facial recognition systems inoperable. This paper addresses a methodology to use the current facial datasets by augmenting it with tools that enable masked faces to be recognized with low false-positive rates and high overall accuracy, without requiring the user dataset to be recreated by taking new pictures for authentication. We present an open-source tool, MaskTheFace to mask faces effectively creating a large dataset of masked faces. The dataset generated with this tool is then used towards training an effective facial recognition system with target accuracy for masked faces. We report an increase of 38% in the true positive rate for the Facenet system. We also test the accuracy of re-trained system on a custom real-world dataset MFR2 and report similar accuracy.
摘要：随着近年来全球范围内COVID-19大流行，使用口罩已经成为我们生活的一个重要组成部分。为鼓励人们遮住脸时在公共场所，避免感染扩散。使用这些口罩曾就用于跟踪学校/办公考勤和解锁手机的面部识别系统的准确度一个严肃的问题。许多组织使用面部识别作为认证的手段，并已经制定了必要的数据集在内部能够部署这样的系统。不幸的是，蒙面的面孔使其难以被检测和识别，从而威胁到使内部数据集无效，使这种面部识别系统无法操作。本文地址的方法，通过与工具，使掩盖面部增强它使用当前的面部数据集与低误报率和较高的整体精确性的认可，而不需要通过采取新的图片验证重新创建用户数据集。我们提出了一个开源的工具，MaskTheFace掩盖面孔有效地创建一个大的数据集掩盖面孔。使用该工具产生的数据集，然后朝与目标精度培养的有效面部识别系统，用于屏蔽面使用。我们报告中的Facenet系统真阳性率同比增长38％。我们还测试重新培训系统上的自定义真实世界的数据集MFR 2准确性和报告类似的准确性。

9. Improving Deep Stereo Network Generalization with Geometric Priors [PDF] 返回目录
Jialiang Wang, Varun Jampani, Deqing Sun, Charles Loop, Stan Birchfield, Jan Kautz
Abstract: End-to-end deep learning methods have advanced stereo vision in recent years and obtained excellent results when the training and test data are similar. However, large datasets of diverse real-world scenes with dense ground truth are difficult to obtain and currently not publicly available to the research community. As a result, many algorithms rely on small real-world datasets of similar scenes or synthetic datasets, but end-to-end algorithms trained on such datasets often generalize poorly to different images that arise in real-world applications. As a step towards addressing this problem, we propose to incorporate prior knowledge of scene geometry into an end-to-end stereo network to help networks generalize better. For a given network, we explicitly add a gradient-domain smoothness prior and occlusion reasoning into the network training, while the architecture remains unchanged during inference. Experimentally, we show consistent improvements if we train on synthetic datasets and test on the Middlebury (real images) dataset. Noticeably, we improve PSM-Net accuracy on Middlebury from 5.37 MAE to 3.21 MAE without sacrificing speed.
摘要：结束到终端的深度学习方法，在最近几年，取得了优异的成绩先进的立体视觉当训练和测试数据是相似的。然而，不同的真实世界的场景密集的地面实况的大数据集都很难获得，目前没有公开的研究团体。其结果是，许多算法依赖于训练的这种数据集类似的场景或合成的数据集，但终端到终端的算法的小现实世界的数据集通常很差推广到在实际应用中出现的不同的图像。作为解决这一问题的步骤，我们建议场景几何的先验知识纳入一个终端到终端的立体网络，帮助网络更好地一概而论。对于给定的网络中，我们明确地添加一个渐变域平滑前和闭塞推理到网络训练，而架构保持推理过程中保持不变。在实验中，我们显示，如果我们培养的综合数据集和测试于明德（真实影像）数据集的持续改善。值得注意的是，我们从5.37 MAE提高明德PSM-Net的精度3.21 MAE不牺牲速度。

10. Using the discrete radon transformation for grayscale image moments [PDF] 返回目录
William Diggin, Michael Diggin
Abstract: Image moments are weighted sums over pixel values in a given image and are used in object detection and localization. Raw image moments are derived directly from the image and are fundamental in deriving moment invariants quantities. The current general algorithm for raw image moments is computationally expensive and the number of multiplications needed scales with the number of pixels in the image. For an image of size (N,M), it has O(NM) multiplications. In this paper we outline an algorithm using the Discrete Radon Transformation for computing the raw image moments of a grayscale image. It reduces two dimensional moment calculations to linear combinations of one dimensional moment calculations. We show that the number of multiplications needed scales as O(N + M), making it faster then the most widely used algorithm of raw image moments.
摘要：图像矩被加权过的像素值和的给定图像中并在物体检测和定位被使用。原始图像矩直接从图像导出，并在推导不变矩量的基础。对于原始图像的时刻的当前通用算法在计算上是昂贵的，并且乘法的数目需要秤与图像中的像素的数量。对于尺寸的图像（N，M），它具有O（NM）乘法。在本文中，我们介绍了使用离散Radon变换计算灰度图像的原始图像矩的算法。它可以减少二维力矩计算，以一所维的时刻计算的线性组合。我们表明，需要规模为O（N + M）乘法的数量，使其更快那么原始图像矩的最广泛使用的算法。

11. Mask-guided sample selection for Semi-Supervised Instance Segmentation [PDF] 返回目录
Miriam Bellver, Amaia Salvador, Jordi Torres, Xavier Giro-i-Nieto
Abstract: Image segmentation methods are usually trained with pixel-level annotations, which require significant human effort to collect. The most common solution to address this constraint is to implement weakly-supervised pipelines trained with lower forms of supervision, such as bounding boxes or scribbles. Another option are semi-supervised methods, which leverage a large amount of unlabeled data and a limited number of strongly-labeled samples. In this second setup, samples to be strongly-annotated can be selected randomly or with an active learning mechanism that chooses the ones that will maximize the model performance. In this work, we propose a sample selection approach to decide which samples to annotate for semi-supervised instance segmentation. Our method consists in first predicting pseudo-masks for the unlabeled pool of samples, together with a score predicting the quality of the mask. This score is an estimate of the Intersection Over Union (IoU) of the segment with the ground truth mask. We study which samples are better to annotate given the quality score, and show how our approach outperforms a random selection, leading to improved performance for semi-supervised instance segmentation with low annotation budgets.
摘要：图像分割方法通常是用像素级别的注解，这需要显著人的努力，收集培训。以解决最常见的解决方案这个约束是实现低形式的监督，如边界框或涂鸦训练的弱监督的管道。另一种选择是半监督方法，其中杠杆作用大量的未标记数据的，并强烈标记样本的数量有限。在该第二设置中，样品是强注解可以随机地或与选择的那些，这将最大化模型性能的主动学习机制来选择。在这项工作中，我们提出了一个样本选择方法，其样本决定注释的半监督实例分割。我们的方法在于首先预测伪面具样品的未标记池，连同分数预测面具的质量。这个分数与地面真相面具段的路口过联盟（IOU）的估计。我们研究样本被更好地注解给出的质量分数，并展示我们的方法如何优于随机选择，导致对半监督实例分割低标注预算提高性能。

12. On estimating gaze by self-attention augmented convolutions [PDF] 返回目录
Gabriel Lefundes, Luciano Oliveira
Abstract: Estimation of 3D gaze is highly relevant to multiple fields, including but not limited to interactive systems, specialized human-computer interfaces, and behavioral research. Although recently deep learning methods have boosted the accuracy of appearance-based gaze estimation, there is still room for improvement in the network architectures for this particular task. Therefore we propose here a novel network architecture grounded on self-attention augmented convolutions to improve the quality of the learned features during the training of a shallower residual network. The rationale is that self-attention mechanism can help outperform deeper architectures by learning dependencies between distant regions in full-face images. This mechanism can also create better and more spatially-aware feature representations derived from the face and eye images before gaze regression. We dubbed our framework ARes-gaze, which explores our Attention-augmented ResNet (ARes-14) as twin convolutional backbones. In our experiments, results showed a decrease of the average angular error by 2.38% when compared to state-of-the-art methods on the MPIIFaceGaze data set, and a second-place on the EyeDiap data set. It is noteworthy that our proposed framework was the only one to reach high accuracy simultaneously on both data sets.
摘要：3D估计凝视是多个领域，包括高度相关，但不限于交互系统，专门的人机界面，和行为研究。虽然最近深学习方法已推动基于外观的注视估计的准确性，仍有余地的网络架构为这个特殊的任务的改进。因此，我们在这里提出一种新的网络架构接地自我关注增强卷积较浅的剩余网络的训练中提高学习质量功能。的理由是，自注意机制可以通过在全脸图像的学习遥远的地区之间的依赖关系帮助跑赢更深的架构。这种机制还可以从视线回归前脸和眼睛的图像得到更好，更空间感知特征表示。我们称为我们的框架战神睐，它探讨了我们的注意力，增强RESNET（ARES-14）作为双卷积骨干。在我们的实验中，相对于状态的最先进的方法，而对MPIIFaceGaze数据集，并在EyeDiap数据集的第二位时结果表明平均角度误差由2.38％的降低。值得注意的是，我们提出的框架是唯一一个对两组数据同时达到很高的精度。

13. Label Decoupling Framework for Salient Object Detection [PDF] 返回目录
Jun Wei, Shuhui Wang, Zhe Wu, Chi Su, Qingming Huang, Qi Tian
Abstract: To get more accurate saliency maps, recent methods mainly focus on aggregating multi-level features from fully convolutional network (FCN) and introducing edge information as auxiliary supervision. Though remarkable progress has been achieved, we observe that the closer the pixel is to the edge, the more difficult it is to be predicted, because edge pixels have a very imbalance distribution. To address this problem, we propose a label decoupling framework (LDF) which consists of a label decoupling (LD) procedure and a feature interaction network (FIN). LD explicitly decomposes the original saliency map into body map and detail map, where body map concentrates on center areas of objects and detail map focuses on regions around edges. Detail map works better because it involves much more pixels than traditional edge supervision. Different from saliency map, body map discards edge pixels and only pays attention to center areas. This successfully avoids the distraction from edge pixels during training. Therefore, we employ two branches in FIN to deal with body map and detail map respectively. Feature interaction (FI) is designed to fuse the two complementary branches to predict the saliency map, which is then used to refine the two branches again. This iterative refinement is helpful for learning better representations and more precise saliency maps. Comprehensive experiments on six benchmark datasets demonstrate that LDF outperforms state-of-the-art approaches on different evaluation metrics.
摘要：为了得到更精确的显着性图，最近的方法主要集中在从全卷积网络（FCN）聚集多层次的特点，引入边缘信息作为辅助监督。虽然显着的进步已经实现，我们观察到，越接近像素的边缘，这是比较难预测，因为边缘像素有一个非常不平衡的分布。为了解决这个问题，我们提出了一个标签去耦框架（LDF），它由一个标签去耦（LD）程序和特征交互网络（FIN）的。 LD明确分解原有显着映像到身体图和细节图，其中上边缘附近区域的物体和细节图焦点中心区域的身体地图精矿。详细地图效果更好，因为它涉及到比传统的边缘监督更加像素。从显着图，身体图丢弃边缘像素不同，只关注中心区。这成功地避免了在训练期间从边缘像素的分心。因此，我们采用的FIN两个分支分别处理身体图和细节图。特征交互（FI）被设计成融合两个互补分支预测的显着图，然后将其用于再次精炼的两个分支。这种迭代的细化是学习较好的陈述和更精确的显着性图的帮助。六个基准数据集综合性实验证明，LDF性能优于国家的最先进的不同的评估指标接近。

14. Polarimetric SAR Image Semantic Segmentation with 3D Discrete Wavelet Transform and Markov Random Field [PDF] 返回目录
Haixia Bi, Lin Xu, Xiangyong Cao, Yong Xue, Zongben Xu
Abstract: Polarimetric synthetic aperture radar (PolSAR) image segmentation is currently of great importance in image processing for remote sensing applications. However, it is a challenging task due to two main reasons. Firstly, the label information is difficult to acquire due to high annotation costs. Secondly, the speckle effect embedded in the PolSAR imaging process remarkably degrades the segmentation performance. To address these two issues, we present a contextual PolSAR image semantic segmentation method in this paper.With a newly defined channelwise consistent feature set as input, the three-dimensional discrete wavelet transform (3D-DWT) technique is employed to extract discriminative multi-scale features that are robust to speckle noise. Then Markov random field (MRF) is further applied to enforce label smoothness spatially during segmentation. By simultaneously utilizing 3D-DWT features and MRF priors for the first time, contextual information is fully integrated during the segmentation to ensure accurate and smooth segmentation. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on three real benchmark PolSAR image data sets. Experimental results indicate that the proposed method achieves promising segmentation accuracy and preferable spatial consistency using a minimal number of labeled pixels.
摘要：极化合成孔径雷达（极化SAR）图像分割是图像处理的遥感应用非常重要的目前。然而，这是一个具有挑战性的任务主要有两个原因。首先，标签信息难以获取，由于高注释成本。其次，嵌入在极化SAR成像过程斑纹效应显着地降低了分割性能。为了解决这两个问题，我们提出了一个上下文极化SAR图像语义分割方法在本paper.With新定义的通道式一致的特性集作为输入，三维离散小波变换（3D-DWT）技术被用来提取判别多尺度特征是强大的噪声斑点。然后马尔可夫随机场（MRF）进一步施加分割期间在空间上执行标签的平滑性。通过同时利用3D-DWT特征和MRF先验首次，上下文信息被完全分割期间集成，以确保准确和平滑分割。为了证明该方法的有效性，我们进行的三个基准极化SAR图像数据集大量的实验。实验结果表明，所提出的方法实现了使用标记的像素的最小数目的有前途的分割精度和优选空间一致性。

15. Protect, Show, Attend and Tell: Image Captioning Model with Ownership Protection [PDF] 返回目录
Jian Han Lim, Chee Seng Chan, Kam Woh Ng, Lixin Fan, Qiang Yang
Abstract: By and large, existing Intellectual Property Right (IPR) protection on deep neural networks typically i) focus on image classification task only, and ii) follow a standard digital watermarking framework that were conventionally used to protect the ownership of multimedia and video content. This paper demonstrates that current digital watermarking framework is insufficient to protect image captioning task that often regarded as one of the frontier A.I. problems. As a remedy, this paper studies and proposes two different embedding schemes in the hidden memory state of a recurrent neural network to protect image captioning model. From both theoretically and empirically points, we prove that a forged key will yield an unusable image captioning model, defeating the purpose on infringement. To the best of our knowledge, this work is the first to propose ownership protection on image captioning task. Also, extensive experiments show that the proposed method does not compromise the original image captioning performance on all common captioning metrics on Flickr30k and MS-COCO datasets, and at the same time it is able to withstand both removal and ambiguity attacks.
摘要：总的来说，现有的知识产权（IPR）保护的深层神经网络通常i）侧重于图像分类任务只，和ii）遵守一个标准的数字水印框架，是通常用于保护多媒体和视频内容的所有权。本文表明，目前的数字水印框架是不足以保护图像字幕的任务，通常被视为前沿活性成分。一个问题。作为补救措施，研究并提出了在回归神经网络，以保护影像字幕模型的隐式存储器状态的两个不同的嵌入方案。无论从理论和实证点，我们证明了一个伪造的键会产生不可用的图像字幕模式，击败侵权的目的。据我们所知，这作品是最早提出对图像字幕任务所有权的保护。此外，大量实验表明，该方法不妥协的Flickr30k和MS-COCO数据集所有常用字幕指标的原始图像字幕性能，并在同一时间，它是能够承受两个去除和模糊性的攻击。

16. Active Class Incremental Learning for Imbalanced Datasets [PDF] 返回目录
Eden Belouadah, Adrian Popescu, Umang Aggarwal, Léo Saci
Abstract: Incremental Learning (IL) allows AI systems to adapt to streamed data. Most existing algorithms make two strong hypotheses which reduce the realism of the incremental scenario: (1) new data are assumed to be readily annotated when streamed and (2) tests are run with balanced datasets while most real-life datasets are actually imbalanced. These hypotheses are discarded and the resulting challenges are tackled with a combination of active and imbalanced learning. We introduce sample acquisition functions which tackle imbalance and are compatible with IL constraints. We also consider IL as an imbalanced learning problem instead of the established usage of knowledge distillation against catastrophic forgetting. Here, imbalance effects are reduced during inference through class prediction scaling. Evaluation is done with four visual datasets and compares existing and proposed sample acquisition functions. Results indicate that the proposed contributions have a positive effect and reduce the gap between active and standard IL performance.
摘要：增量学习（IL）允许AI系统，以适应流数据。大多数现有的算法，使两点强的假设，这降低了增量场景的真实感：假设（1）新的数据流和（2）的测试与平衡的数据集运行，而最真实的数据集实际上是不平衡时容易注释。这些假设被丢弃，并且将得到的挑战与活性和不平衡学习的组合解决。我们介绍其解决失衡并与IL约束相容的样品采集功能。我们还认为IL作为一个不平衡的学习问题，而不是知识蒸馏对灾难性遗忘的既定用法。这里，失调作用是通过类预测缩放推理过程中降低。评价是有四个视觉数据集进行，并比较现有的和拟议样品采集功能。结果表明，所提出的贡献有积极的作用，并降低活性和标准IL性能之间的差距。

17. In-Home Daily-Life Captioning Using Radio Signals [PDF] 返回目录
Lijie Fan, Tianhong Li, Yuan Yuan, Dina Katabi
Abstract: This paper aims to caption daily life --i.e., to create a textual description of people's activities and interactions with objects in their homes. Addressing this problem requires novel methods beyond traditional video captioning, as most people would have privacy concerns about deploying cameras throughout their homes. We introduce RF-Diary, a new model for captioning daily life by analyzing the privacy-preserving radio signal in the home with the home's floormap. RF-Diary can further observe and caption people's life through walls and occlusions and in dark settings. In designing RF-Diary, we exploit the ability of radio signals to capture people's 3D dynamics, and use the floormap to help the model learn people's interactions with objects. We also use a multi-modal feature alignment training scheme that leverages existing video-based captioning datasets to improve the performance of our radio-based captioning model. Extensive experimental results demonstrate that RF-Diary generates accurate captions under visible conditions. It also sustains its good performance in dark or occluded settings, where video-based captioning approaches fail to generate meaningful captions. For more information, please visit our project webpage: this http URL
摘要：本文旨在说明日常生活--i.e，创造人们的活动的文字描述和互动，在自己的家园对象。解决这个问题需要超越传统的视频字幕的新方法，因为大多数人都会有关于整个家园部署摄像机的隐私问题。我们引入RF-日记，通过在家庭与家庭的floormap分析隐私保护无线信号字幕日常生活的新模式。 RF-日记可以进一步观察和说明人们的生活穿透墙壁和闭塞以及在黑暗的环境。在设计RF-日记，我们利用无线电信号的捕捉人物的3D动态的能力，并使用floormap帮助模型学会与人对象的交互。我们还使用了多模式功能对准训练方案，现有的基于视频的字幕数据集杠杆来改善我们的基础无线电字幕模型的性能。广泛的实验结果表明，RF-日记产生可见的条件下准确字幕。它也维持在黑暗或闭塞的设置，其中基于视频的字幕方式无法产生有意义的字幕以其良好的性能。欲了解更多信息，请访问我们的网页项目：该HTTP URL

18. AgingMapGAN (AMGAN): High-Resolution Controllable Face Aging with Spatially-Aware Conditional GANs [PDF] 返回目录
Julien Despois, Frederic Flament, Matthieu Perrot
Abstract: Existing approaches and datasets for face aging produce results skewed towards the mean, with individual variations and expression wrinkles often invisible or overlooked in favor of global patterns such as the fattening of the face. Moreover, they offer little to no control over the way the faces are aged and can difficultly be scaled to large images, thus preventing their usage in many real-world applications. To address these limitations, we present an approach to change the appearance of a high-resolution image using ethnicity-specific aging information and weak spatial supervision to guide the aging process. We demonstrate the advantage of our proposed method in terms of quality, control, and how it can be used on high-definition images while limiting the computational overhead.
摘要：用于面部老化产生的结果的现有方法和数据集朝向平均倾斜，与个别变化和表情皱纹常常看不见的或有利于全球图案忽略诸如脸部的育肥。此外，他们很少提供给在路那边的脸都是老年人和困难可以缩放到大图像，从而防止其使用在许多现实世界的应用程序的控制。为了解决这些限制，我们提出了一个方法使用种族特有的老化信息和空间监管薄弱，引导老龄化进程改变高分辨率图像的外观。我们证明了我们提出的方法的优势，在质量控制方面，以及它如何能在高清晰度的图像，同时限制的计算开销来使用。

19. Think about boundary: Fusing multi-level boundary information for landmark heatmap regression [PDF] 返回目录
Jinheng Xie, Jun Wan, Linlin Shen, Zhihui Lai
Abstract: Although current face alignment algorithms have obtained pretty good performances at predicting the location of facial landmarks, huge challenges remain for faces with severe occlusion and large pose variations, etc. On the contrary, semantic location of facial boundary is more likely to be reserved and estimated on these scenes. Therefore, we study a two-stage but end-to-end approach for exploring the relationship between the facial boundary and landmarks to get boundary-aware landmark predictions, which consists of two modules: the self-calibrated boundary estimation (SCBE) module and the boundary-aware landmark transform (BALT) module. In the SCBE module, we modify the stem layers and employ intermediate supervision to help generate high-quality facial boundary heatmaps. Boundary-aware features inherited from the SCBE module are integrated into the BALT module in a multi-scale fusion framework to better model the transformation from boundary to landmark heatmap. Experimental results conducted on the challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.
摘要：尽管目前的脸比对算法已经在预测脸部显着标记的位置得到相当不错的性能，巨大的挑战仍然是严重闭塞和大姿态变化等相反，面部边界的语义位置面临的是更容易被保留并估计对这些场景。因此，我们研究了两个阶段，但终端到终端的方法探索面部边界和地标得到边界意识地标性的预测，它由两个模块之间的关系：自校准边界估计（SCBE）模块和边界感知界标变换（BALT）模块。在SCBE模块，我们修改了干层，并采用中间监督，以帮助生成高质量的面部界热图。从SCBE模块继承了边界感知功能集成到BALT模块中的多尺度融合框架，以更好的模型从边界标志性建筑热图的转换。在富有挑战性的基准数据集进行的实验结果表明，我们的方法比国家的最先进的方法在文献。

20. Towards End-to-end Car License Plate Location and Recognition in Unconstrained Scenarios [PDF] 返回目录
Shuxin Qin, Sijiang Liu
Abstract: Benefiting from the rapid development of convolutional neural networks, the performance of car license plate detection and recognition has been largely improved. Nonetheless, challenges still exist especially for real-world applications. In this paper, we present an efficient and accurate framework to solve the license plate detection and recognition tasks simultaneously. It is a lightweight and unified deep neural network, that can be optimized end-to-end and work in real-time. Specifically, for unconstrained scenarios, an anchor-free method is adopted to efficiently detect the bounding box and four corners of a license plate, which are used to extract and rectify the target region features. Then, a novel convolutional neural network branch is designed to further extract features of characters without segmentation. Finally, recognition task is treated as sequence labelling problems, which are solved by Connectionist Temporal Classification (CTC) directly. Several public datasets including images collected from different scenarios under various conditions are chosen for evaluation. A large number of experiments indicate that the proposed method significantly outperforms the previous state-of-the-art methods in both speed and precision.
摘要：卷积神经网络的快速发展中受益，车牌检测与识别的性能得到了很大程度的提高。尽管如此，仍然挑战特别是存在现实世界的应用。在本文中，我们提出了一种有效和精确的框架，以同时解决车牌检测与识别的任务。它是一个轻量级的和统一的深层神经网络，可以优化终端到终端和实时工作。具体而言，无约束的场景，采用一个无锚的方法来有效地检测边界框和牌照，其用于提取的四个角和纠正目标区域的特征。然后，一个新颖的卷积神经网络的分支被设计成字符的进一步提取特征而不分割。最后，识别任务被视为序列标注的问题，这是由直接联结颞分类（CTC）解决。一些公共数据集包括在各种条件下从不同的场景收集的图像被选择用于评估。大量的实验表明，所提出的方法显著优于在速度和精度之前的状态的最先进的方法。

21. MonStereo: When Monocular and Stereo Meet at the Tail of 3D Human Localization [PDF] 返回目录
Lorenzo Bertoni, Sven Kreiss, Taylor Mordan, Alexandre Alahi
Abstract: Monocular and stereo vision are cost-effective solutions for 3D human localization in the context of self-driving cars or social robots. However, they are usually developed independently and have their respective strengths and limitations. We propose a novel unified learning framework that leverages the strengths of both monocular and stereo cues for 3D human localization. Our method jointly (i) associates humans in left-right images, (ii) deals with occluded and distant cases in stereo settings by relying on the robustness of monocular cues, and (iii) tackles the intrinsic ambiguity of monocular perspective projection by exploiting prior knowledge of human height distribution. We achieve state-of-the-art quantitative results for the 3D localization task on KITTI dataset and estimate confidence intervals that account for challenging instances. We show qualitative examples for the long tail challenges such as occluded, far-away, and children instances.
摘要：单眼和立体视觉是在自动驾驶汽车或社交机器人的背景下三维人体本地化成本效益的解决方案。然而，他们通常独立开发，并有各自的优势和局限性。我们提出了一个新的统一的学习框架，利用两个单眼的三维人体本地化的优势和立体声线索。我们的方法联合（I）相关联人类在左右的图像，（ii）通过依靠单眼线索的鲁棒性，并且通过事先利用单眼透视投影（iii）的铲球固有歧义与立体声设置闭塞和远处的情况下的交易人类高度分布的知识。我们实现上占具有挑战性的情况下KITTI数据集，估计置信区间的3D定位任务的国家的最先进的定量结果。我们展示的定性示例，长尾巴的挑战，如闭塞，遥远的，和孩子的情况。

22. Confidence-aware Adversarial Learning for Self-supervised Semantic Matching [PDF] 返回目录
Shuaiyi Huang, Qiuyue Wang, Xuming He
Abstract: In this paper, we aim to address the challenging task of semantic matching where matching ambiguity is difficult to resolve even with learned deep features. We tackle this problem by taking into account the confidence in predictions and develop a novel refinement strategy to correct partial matching errors. Specifically, we introduce a Confidence-Aware Semantic Matching Network (CAMNet) which instantiates two key ideas of our approach. First, we propose to estimate a dense confidence map for a matching prediction through self-supervised learning. Second, based on the estimated confidence, we refine initial predictions by propagating reliable matching to the rest of locations on the image plane. In addition, we develop a new hybrid loss in which we integrate a semantic alignment loss with a confidence loss, and an adversarial loss that measures the quality of semantic correspondence. We are the first that exploit confidence during refinement to improve semantic matching accuracy and develop an end-to-end self-supervised adversarial learning procedure for the entire matching network. We evaluate our method on two public benchmarks, on which we achieve top performance over the prior state of the art. We will release our source code at this https URL.
摘要：在本文中，我们的目标是解决语义匹配的具有挑战性的任务，即当匹配不确定性是难以解决的，即使学会了深刻的特点。我们考虑到在预测的信心解决这个问题，开发了一种新的改进策略，以正确的部分匹配误差。具体来说，我们引入了信任感知的语义匹配网络（CAMNet）的实例我们的方法的两个关键的想法。首先，我们建议估计密集信心地图，通过自我监督学习匹配的预测。其次，基于估计的信心，我们完善了传播可靠匹配的图像平面上的位置，其余初步预测。此外，我们开发了一个新的混合损失中，我们整合了语义对准损失有信心的丧失，以及敌对损失的措施语义对应的质量。我们是利用精炼过程中的信心，提高语义匹配精度和制定整个匹配网络的终端到终端的自我监督敌对学习过程的第一个。我们评估我们在两个公共基准，对我们在艺术的以前的状态达到最佳性能的方法。我们将在本HTTPS URL释放我们的源代码。

23. Two-Stream Networks for Lane-Change Prediction of Surrounding Vehicles [PDF] 返回目录
David Fernández-Llorca, Mahdi Biparva, Rubén Izquierdo-Gonzalo, John K. Tsotsos
Abstract: In highway scenarios, an alert human driver will typically anticipate early cut-in and cut-out maneuvers of surrounding vehicles using only visual cues. An automated system must anticipate these situations at an early stage too, to increase the safety and the efficiency of its performance. To deal with lane-change recognition and prediction of surrounding vehicles, we pose the problem as an action recognition/prediction problem by stacking visual cues from video cameras. Two video action recognition approaches are analyzed: two-stream convolutional networks and spatiotemporal multiplier networks. Different sizes of the regions around the vehicles are analyzed, evaluating the importance of the interaction between vehicles and the context information in the performance. In addition, different prediction horizons are evaluated. The obtained results demonstrate the potential of these methodologies to serve as robust predictors of future lane-changes of surrounding vehicles in time horizons between 1 and 2 seconds.
摘要：在高速公路场景，报警人的司机通常会预期早切入，并且仅使用视觉提示周围车辆的切口演习。自动化系统必须预见到在早期阶段，这些情况也增加了安全性和性能的效率。为了应对车道变换的认可和周围车辆的预测，我们通过摄像机叠加视觉线索带来的问题，因为一个动作识别/预测问题。两种视频行为识别方法进行了分析：两流卷积网络和时空乘数网络。车辆周围的区域不同尺寸的分析，评估车辆和在性能上的上下文信息之间的互动的重要性。此外，不同的预测层位被评估。所获得的结果证明了这些方法的潜力，以服务为周围车辆的未来车道变化鲁棒预测在时间跨度1和2秒之间。

24. Discriminability Distillation in Group Representation Learning [PDF] 返回目录
Manyuan Zhang, Guanglu Song, Hang Zhou, Yu Liu
Abstract: Learning group representation is a commonly concerned issue in tasks where the basic unit is a group, set, or sequence. Previously, the research community tries to tackle it by aggregating the elements in a group based on an indicator either defined by humans such as the quality and saliency, or generated by a black box such as the attention score. This article provides a more essential and explicable view. We claim the most significant indicator to show whether the group representation can be benefited from one of its element is not the quality or an inexplicable score, but the discriminability w.r.t. the model. We explicitly design the discrimiability using embedded class centroids on a proxy set. We show the discrimiability knowledge has good properties that can be distilled by a light-weight distillation network and can be generalized on the unseen target set. The whole procedure is denoted as discriminability distillation learning (DDL). The proposed DDL can be flexibly plugged into many group-based recognition tasks without influencing the original training procedures. Comprehensive experiments on various tasks have proven the effectiveness of DDL for both accuracy and efficiency. Moreover, it pushes forward the state-of-the-art results on these tasks by an impressive margin.
摘要：学习组表示是任务的共同关心的问题，其中的基本单位为一组，一组或序列。此前，研究团体试图通过聚集基于人类如质量和显着性，或由一个黑盒子产生如注意力得分要么定义的指标组中的元素来解决它。本文提供了一个更加重要和可解释性视图。我们要求保护的最显著指示器，以显示是否该组表示可以从它的元件的一个受益不是质量或莫名其妙的得分，但可辨性w.r.t.该模型。我们使用一个代理集嵌入类重心明确设计discrimiability。我们展示了discrimiability知识，具有可以由轻质蒸馏网络蒸馏可以在看不见的目标设定一概而论良好的性能。整个过程被表示为可辨性蒸馏学习（DDL）。所提出的DDL可以灵活地插入许多基于组的识别任务，而不影响原有的训练程序。在各种任务综合实验已经证明DDL的两个精度和效率的效果。此外，它推动令人印象深刻的保证金提交有关这些任务的国家的最先进的成果。

25. Graphical Object Detection in Document Images [PDF] 返回目录
Ranajit Saha, Ajoy Mondal, C. V. Jawahar
Abstract: Graphical elements: particularly tables and figures contain a visual summary of the most valuable information contained in a document. Therefore, localization of such graphical objects in the document images is the initial step to understand the content of such graphical objects or document images. In this paper, we present a novel end-to-end trainable deep learning based framework to localize graphical objects in the document images called as Graphical Object Detection (GOD). Our framework is data-driven and does not require any heuristics or meta-data to locate graphical objects in the document images. The GOD explores the concept of transfer learning and domain adaptation to handle scarcity of labeled training images for graphical object detection task in the document images. Performance analysis carried out on the various public benchmark data sets: ICDAR-2013, ICDAR-POD2017,and UNLV shows that our model yields promising results as compared to state-of-the-art techniques.
摘要：图形元素：特别表和图包含的文档中包含的最有价值的信息的可视摘要。因此，在文档图像这样的图形对象的定位是初始步骤，以理解这些图形对象或文档图像的内容。在本文中，我们提出了一个新颖的端至端的可训练的深学习基础的框架中被称为图形对象检测（GOD）的文档图像本地化图形对象。我们的框架是数据驱动的，并且不需要任何启发式或者元数据的文档图像定位图形对象。神探讨转让的学习和领域适应性，以在文档图像图形对象检测任务标记的训练图像的处理稀缺的概念。性能分析在各种公共标准数据集进行：ICDAR-2013，ICDAR-POD2017，和UNLV表明，我们的模型产生有希望的结果相比，国家的最先进的技术。

26. Adaptive Context-Aware Multi-Modal Network for Depth Completion [PDF] 返回目录
Shanshan Zhao, Mingming Gong, Huan Fu, Dacheng Tao
Abstract: Depth completion aims to recover a dense depth map from the sparse depth data and the corresponding single RGB image. The observed pixels provide the significant guidance for the recovery of the unobserved pixels' depth. However, due to the sparsity of the depth data, the standard convolution operation, exploited by most of existing methods, is not effective to model the observed contexts with depth values. To address this issue, we propose to adopt the graph propagation to capture the observed spatial contexts. Specifically, we first construct multiple graphs at different scales from observed pixels. Since the graph structure varies from sample to sample, we then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively. Furthermore, considering the mutli-modality of input data, we exploit the graph propagation on the two modalities respectively to extract multi-modal representations. Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively. The proposed strategy preserves the original information for one modality and also absorbs complementary information from the other through learning the adaptive gating weights. Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks, {\it i.e.}, KITTI and NYU-v2, and at the same time has fewer parameters than latest models. Our code is available at: \url{this https URL}.
摘要：深度完成旨在恢复来自稀疏深度数据稠密深度图和对应的单个RGB图像。所观察到的像素提供用于未观测到的像素深度的恢复显著指导。然而，由于深度数据，标准卷积运算，通过大多数的现有方法利用的稀疏性，不能有效地观察到的上下文与深度值进行建模。为了解决这个问题，我们建议采用图形传播捕捉观测到的空间环境。具体地讲，我们首先在从观察到的像素不同的尺度构造多个图。由于图结构从样品变化到样品中，我们然后应用上的传播，它鼓励网络到上下文信息自适应地建模注意机制。此外，考虑输入数据的辑阵模态，我们分别利用在两个模态的图形传播，以提取多模态表示。最后，我们介绍了对称的门融合策略有效地利用所提取的多模式功能。所提出的策略保留了一个模态的原始信息，并且还通过学习自适应门控权重吸收从其他补充信息。我们的模型，命名为自适应上下文感知的多模态网络（ACMNet），实现两个基准的国家的最先进的性能，{\它即}，KITTI和NYU-V2，并在同一时间有较少的参数比最新型号。我们的代码，请访问：\ {URL这HTTPS URL}。

27. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images [PDF] 返回目录
Madhav Agarwal, Ajoy Mondal, C. V. Jawahar
Abstract: Localizing page elements/objects such as tables, figures, equations, etc. is the primary step in extracting information from document images. We propose a novel end-to-end trainable deep network, (CDeC-Net) for detecting tables present in the documents. The proposed network consists of a multistage extension of Mask R-CNN with a dual backbone having deformable convolution for detecting tables varying in scale with high detection accuracy at higher IoU threshold. We empirically evaluate CDeC-Net on all the publicly available benchmark datasets - ICDAR-2013, ICDAR-2017, ICDAR-2019,UNLV, Marmot, PubLayNet, and TableBank - with extensive experiments. Our solution has three important properties: (i) a single trained model CDeC-Net‡ performs well across all the popular benchmark datasets; (ii) we report excellent performances across multiple, including higher, thresholds of IoU; (iii) by following the same protocol of the recent papers for each of the benchmarks, we consistently demonstrate the superior quantitative performance. Our code and models will be publicly released for enabling the reproducibility of the results.
摘要：本地化页面元素/对象，如表，图，等式等等是在从文档图像中提取的信息的主要步骤。我们提出了在文档中检测表本新颖的端至端的可训练深网络，（CDEC-净）。所提出的网络掩码由R-CNN的多级延伸与具有用于检测表在较高IOU阈值检测精度高规模变变形的卷积双骨干。我们经验上的所有评价CDEC-网公开发布的标准数据集 - ICDAR-2013，ICDAR-2017，ICDAR-2019，UNLV，旱獭，PubLayNet和TableBank - 用大量的实验。我们的解决方案有三个重要的属性：所有流行的标准数据集（I）的单训练模型CDEC，净‡执行好; （ⅱ），我们报告跨越多个优异的性能，包括IOU的更高，阈值; （iii）通过以下的最近的论文为每个基准的相同的协议，我们一致地证明了优异的定量性能。我们的代码和模型将公开发布使结果的重复性。

28. A Critical Analysis of Patch Similarity Based Image Denoising Algorithms [PDF] 返回目录
Varuna De Silva
Abstract: Image denoising is a classical signal processing problem that has received significant interest within the image processing community during the past two decades. Most of the algorithms for image denoising has focused on the paradigm of non-local similarity, where image blocks in the neighborhood that are similar, are collected to build a basis for reconstruction. Through rigorous experimentation, this paper reviews multiple aspects of image denoising algorithm development based on non-local similarity. Firstly, the concept of non-local similarity as a foundational quality that exists in natural images has not received adequate attention. Secondly, the image denoising algorithms that are developed are a combination of multiple building blocks, making comparison among them a tedious task. Finally, most of the work surrounding image denoising presents performance results based on Peak-Signal-to-Noise Ratio (PSNR) between a denoised image and a reference image (which is perturbed with Additive White Gaussian Noise). This paper starts with a statistical analysis on non-local similarity and its effectiveness under various noise levels, followed by a theoretical comparison of different state-of-the-art image denoising algorithms. Finally, we argue for a methodological overhaul to incorporate no-reference image quality measures and unprocessed images (raw) during performance evaluation of image denoising algorithms.
摘要：图像去噪的是，在过去的二十年中已经接收到的图像处理社区内显著感兴趣的经典信号处理问题。大部分的算法对图像进行去噪一直专注于非本地相似性，其中类似于邻里图像块，收集建立重建的基础的范例。通过严格的实验，本文评论基于非局部相似图像降噪算法开发的多个方面。首先，非本地相似的是存在于自然的图像质量基本概念还没有得到足够的重视。其次，所开发的图像降噪算法多个积木的组合，使得其中一个比较繁琐的任务。最后，大多数围绕去噪图像和参考图像（这是噪声干扰与加性高斯白）之间的基于峰值信噪比（PSNR）图像去噪呈现性能结果的工作。本文对非本地相似性和其在各种噪声水平的有效性进行了统计分析，随后的不同国家的最先进的图像去噪算法的理论比较开始。最后，我们认为对于一个方法论检修的图像去噪算法性能评估中引入无参考图像质量的措施和未处理的图像（RAW）。

29. Data Science for Motion and Time Analysis with Modern Motion Sensor Data [PDF] 返回目录
Chiwoo Park, Sang Do Noh, Anuj Srivastava
Abstract: The motion-and-time analysis has been a popular research topic in operations research, especially for analyzing work performances in manufacturing and service operations. It is regaining attention as continuous improvement tools for lean manufacturing and smart factory. This paper develops a framework for data-driven analysis of work motions and studies their correlations to work speeds or execution rates, using data collected from modern motion sensors. The past analyses largely relied on manual steps involving time-consuming stop-watching and video-taping, followed by manual data analysis. While modern sensing devices have automated the collection of motion data, the motion analytics that transform the new data into knowledge are largely underdeveloped. Unsolved technical questions include: How the motion and time information can be extracted from the motion sensor data, how work motions and execution rates are statistically modeled and compared, and what are the statistical correlations of motions to the rates? In this paper, we develop a novel mathematical framework for motion and time analysis with motion sensor data, by defining new mathematical representation spaces of human motions and execution rates and by developing statistical tools on these new spaces. This methodological research is demonstrated using five use cases applied to manufacturing motion data.
摘要：运动和时间的分析一直是一个热门的研究课题在业务研究，尤其是在制造业和服务运营分析工作业绩。这是恢复关注，持续改进工具，精益制造和智能工厂。本文建立了一个框架，工作运动和学习他们的工作速度或执行率相关的数据驱动的分析，采用现代运动传感器收集的数据。在过去的分析在很大程度上依赖于涉及手动步骤耗时停止观看视频，录音，其次是人工数据分析。虽然现代传感装置有自动运动数据的收集，即变换新的数据转化为知识的运动分析在很大程度上是不发达。未解的技术问题包括：如何从运动传感器数据，如何工作，运动和执行率进行统计建模和比较，有哪些运动对利率的统计相关性被提取的运动和时间信息？在本文中，我们开发了运动与运动传感器数据实时分析一种新的数学框架，通过定义人的动作和执行率的新的数学表达的空间，并通过开发这些新的空间统计工具。这种方法学研究是利用应用于制造运动数据5用例证明。

30. Learning Target Domain Specific Classifier for Partial Domain Adaptation [PDF] 返回目录
Chuan-Xian Ren, Pengfei Ge, Peiyi Yang, Shuicheng Yan
Abstract: Unsupervised domain adaptation~(UDA) aims at reducing the distribution discrepancy when transferring knowledge from a labeled source domain to an unlabeled target domain. Previous UDA methods assume that the source and target domains share an identical label space, which is unrealistic in practice since the label information of the target domain is agnostic. This paper focuses on a more realistic UDA scenario, i.e. partial domain adaptation (PDA), where the target label space is subsumed to the source label space. In the PDA scenario, the source outliers that are absent in the target domain may be wrongly matched to the target domain (technically named negative transfer), leading to performance degradation of UDA methods. This paper proposes a novel Target Domain Specific Classifier Learning-based Domain Adaptation (TSCDA) method. TSCDA presents a soft-weighed maximum mean discrepancy criterion to partially align feature distributions and alleviate negative transfer. Also, it learns a target-specific classifier for the target domain with pseudo-labels and multiple auxiliary classifiers, to further address classifier shift. A module named Peers Assisted Learning is used to minimize the prediction difference between multiple target-specific classifiers, which makes the classifiers more discriminant for the target domain. Extensive experiments conducted on three PDA benchmark datasets show that TSCDA outperforms other state-of-the-art methods with a large margin, e.g. $4\%$ and $5.6\%$ averagely on Office-31 and Office-Home, respectively.
摘要：无监督域适配〜（UDA）的目的是从一个标记的源域知识转移时的未标记的靶标域减少的分布差异。上一页UDA方法假设在源和目标域共享相同的标签的空间，这是不现实的在实践中，因为目标域的标签信息是不可知的。本文重点研究更现实的UDA的场景，即，局部域适配（PDA），其中目标标签空间归入到源标签空间。在PDA的情况下，源的异常值是在目标域中不存在可能被错误地匹配到目标域（技术上称为负转移），导致的UDA方法性能下降。本文提出了一种新颖的基于学习的目标域的具体分类域适配（TSCDA）方法。 TSCDA呈现软称重的最大平均差异准则部分地对齐功能分布和减轻负迁移。此外，它学习用于利用伪标签和多个辅助分类器，以进一步地址分类器移目标域中的目标特异性分类器。模块命名对等体辅助学习用于最小化多个目标特异性分类之间的预测差，这使得分类器更判别为目标域。在三个PDA基准数据集进行了广泛的实验表明，TSCDA优于其他国家的最先进的方法，用一大截，例如$ 4 \ $％$和5.6 \％$平均上的Office-31和Office的主页，分别。

31. Image Colorization: A Survey and Dataset [PDF] 返回目录
Saeed Anwar, Muhammad Tahir, Chongyi Li, Ajmal Mian, Fahad Shahbaz Khan, Abdul Wahab Muzaffar
Abstract: Image colorization is an essential image processing and computer vision branch to colorize images and videos. Recently, deep learning techniques progressed notably for image colorization. This article presents a comprehensive survey of recent state-of-the-art colorization using deep learning algorithms, describing their fundamental block architectures in terms of skip connections, input \etc as well as optimizers, loss functions, training protocols, and training data \etc Generally, we can roughly categorize the existing colorization techniques into seven classes. Besides, we also provide some additional essential issues, such as benchmark datasets and evaluation metrics. We also introduce a new dataset specific to colorization and perform an experimental evaluation of the publicly available methods. In the last section, we discuss the limitations, possible solutions, and future research directions of the rapidly evolving topic of deep image colorization that the community should further address. Dataset and Codes for evaluation will be publicly available at this https URL
摘要：图像着色是一个重要的图像处理和计算机视觉分支上色图片和视频。近日，深学习技术特别是改进了的图像着色。本文介绍了采用深学习算法，描述了跳跃连接，输入\等方面的基本块结构的国家的最先进的近期彩色的全面调查，以及优化，损失函数，培训协议，培训数据\等一般情况下，我们可以大致归类现有的着色技术分为七类。此外，我们还提供一些附加的基本问题，比如标准数据集和评价指标。我们还引入了一个新的数据集特有的着色和执行的公开可用方法的实验评估。在最后一节中，我们讨论的局限性，可能的解决方案，而深图像着色，社会应进一步地址的快速发展课题今后的研究方向。数据集和评估代码将被公布于该HTTPS URL

32. LULC Segmentation of RGB Satellite Image Using FCN-8 [PDF] 返回目录
Abu Bakar Siddik Nayem, Anis Sarker, Ovi Paul, Amin Ali, Md. Ashraful Amin, AKM Mahbubur Rahman
Abstract: This work presents use of Fully Convolutional Network (FCN-8) for semantic segmentation of high-resolution RGB earth surface satel-lite images into land use land cover (LULC) categories. Specically, we propose a non-overlapping grid-based approach to train a Fully Convo-lutional Network (FCN-8) with vgg-16 weights to segment satellite im-ages into four (forest, built-up, farmland and water) classes. The FCN-8 semantically projects the discriminating features in lower resolution learned by the encoder onto the pixel space in higher resolution to get a dense classi cation. We experimented the proposed system with Gaofen-2 image dataset, that contains 150 images of over 60 di erent cities in china. For comparison, we used available ground-truth along with images segmented using a widely used commeriial GIS software called eCogni-tion. With the proposed non-overlapping grid-based approach, FCN-8 obtains signi cantly improved performance, than the eCognition soft-ware. Our model achieves average accuracy of 91.0% and average Inter-section over Union (IoU) of 0.84. In contrast, eCognitions average accu-racy is 74.0% and IoU is 0.60. This paper also reports a detail analysis of errors occurred at the LULC boundary.
摘要：这项工作提出了采用全卷积网络（FCN-8）的高分辨率RGB地表SATEL-精简版图像语义分割为土地利用土地覆盖（LULC）两类。 Specically，我们提出了一个非重叠基于网格的方式，以全康沃-lutional网络（FCN-8）与VGG-16的权重段卫星IM-年龄培养成四（森林，建成，农田和水）班。该FCN-8语义项目的较低分辨率编码器在更高的分辨率得到密集CLASSI阳离子获悉到像素的空间辨别功能。我们实验所提出的系统Gaofen-2图像数据集，包含在中国超过60个迪erent个城市的150个图像。为了便于比较，我们使用的可用地面实况以及图像使用广泛使用的commeriial GIS软件称为eCogni-重刑分割。与所提出的非重叠基于网格的方法，FCN-8取得显着地提高性能，比eCognition软洁具。我们的模型达到了联盟（IOU）的0.84的91.0％，平均跨段平均准确。相比之下，eCognitions平均ACCU-情趣为74.0％和借条是0.60。本文还报告错误的详细分析，发生在LULC边界。

33. Interactive Annotation of 3D Object Geometry using 2D Scribbles [PDF] 返回目录
Tianchang Shen, Jun Gao, Amlan Kar, Sanja Fidler
Abstract: Inferring detailed 3D geometry of the scene is crucial for robotics applications, simulation, and 3D content creation. However, such information is hard to obtain, and thus very few datasets support it. In this paper, we propose an interactive framework for annotating 3D object geometry from both point cloud data and RGB imagery. The key idea behind our approach is to exploit strong priors that humans have about the 3D world in order to interactively annotate complete 3D shapes. Our framework targets naive users without artistic or graphics expertise. We introduce two simple-to-use interaction modules. First, we make an automatic guess of the 3D shape and allow the user to provide feedback about large errors by drawing scribbles in desired 2D views. Next, we aim to correct minor errors, in which users drag and drop mesh vertices, assisted by a neural interactive module implemented as a Graph Convolutional Network. Experimentally, we show that only a few user interactions are needed to produce good quality 3D shapes on popular benchmarks such as ShapeNet, Pix3D and ScanNet. We implement our framework as a web service and conduct a user study, where we show that user annotated data using our method effectively facilitates real-world learning tasks. Web service: this http URL.
摘要：推断详细的3D几何形状的场景是机器人应用，仿真和3D内容制作的关键。然而，这样的信息很难获得，因而很少有数据集支持。在本文中，我们提出了从两个点云数据和RGB图像标注3D物体的几何形状的互动框架。我们的做法背后的关键思想是利用强大的先验概率，人类为了交互注释完整的3D形状有关于3D世界。我们的目标框架天真的用户无需艺术或图形的专业知识。我们介绍两个简单易用的交互模块。首先，我们做的3D形状的自动猜测，并允许用户通过所需的二维图纸的意见，以涂鸦提供有关大的误差反馈。接下来，我们的目标是正确的小错误，用户在其中拖放网格顶点，通过以图形卷积网络实现的神经交互模块协助。实验，我们表明，只需要少数的用户交互，以生产优质的3D形状上流行的基准，如ShapeNet，Pix3D和ScanNet。我们实现我们作为Web服务框架，并进行用户研究，我们用我们的方法，有效地促进了现实世界的学习任务显示，用户注释的数据。 Web服务：该HTTP URL。

34. Video Interpolation via Generalized Deformable Convolution [PDF] 返回目录
Zhihao Shi, Xiaohong Liu, Kangdi Shi, Linhui Dai, Jun Chen
Abstract: Video interpolation aims at increasing the frame rate of a given video by synthesizing intermediate frames. The existing video interpolation methods can be roughly divided into two categories: flow-based methods and kernel-based methods. The performance of flow-based methods is often jeopardized by the inaccuracy of flow map estimation due to oversimplified motion models while that of kernel-based methods tends to be constrained by the rigidity of kernel shape. To address these performance-limiting issues, a novel mechanism named generalized deformable convolution is proposed, which can effectively learn motion information in a data-driven manner and freely select sampling points in space-time. We further develop a new video interpolation method based on this mechanism. Our extensive experiments demonstrate that the new method performs favorably against the state-of-the-art, especially when dealing with complex motions.
摘要：视频内插的目的在于通过合成中间帧增加一个给定的视频的帧速率。现有的视频内插方法可以粗略地分为两类：基于流的方法和基于核的方法。的基于流的方法的性能通常通过流图估计的不准确性危及由于过于简单的运动模型，而的基于内核的方法趋向于由内核形状的刚性的约束。为了解决这些性能限制的问题，命名为建议的广义变形卷积一种新的机制，其可以有效地学习在数据驱动的方式运动信息和自由地在空间 - 时间选择的采样点。我们进一步开发了基于此机制一种新的视频插值方法。我们大量的实验表明，新方法，对国家的最先进的表现很好，尤其是在复杂的运动的时候。

35. Probabilistic Deep Learning for Instance Segmentation [PDF] 返回目录
Josef Lorenz Rumberger, Lisa Mais, Dagmar Kainmueller
Abstract: Probabilistic convolutional neural networks, which predict distributions of predictions instead of point estimates, led to recent advances in many areas of computer vision, from image reconstruction to semantic segmentation. Besides state of the art benchmark results, these networks made it possible to quantify local uncertainties in the predictions. These were used in active learning frameworks to target the labeling efforts of specialist annotators or to assess the quality of a prediction in a safety-critical environment. However, for instance segmentation problems these methods are not frequently used so far. We seek to close this gap by proposing a generic method to obtain model-inherent uncertainty estimates within proposal-free instance segmentation models. Furthermore, we analyze the quality of the uncertainty estimates with a metric adapted from semantic segmentation. We evaluate our method on the BBBC010 C.\ elegans dataset, where it yields competitive performance while also predicting uncertainty estimates that carry information about object-level inaccuracies like false splits and false merges. We perform a simulation to show the potential use of such uncertainty estimates in guided proofreading.
摘要：概率卷积神经网络，预测预报，而不是点估计的分布，导致了最近的进展计算机视觉的许多领域，从图像重建语义分割。除了艺术的基准测试结果的状态，这些网络使人们有可能在预测量化本地的不确定性。这些在主动学习框架，用来针对专家注释的贴标工作或评估预测的质量安全关键环境。然而，例如分割问题，这些方法不经常使用至今。我们寻求通过提出一个通用的方法来免费的建议，例如分割模型中获得模型固有的不确定性估计来弥补这一差距。此外，我们分析了不确定性估算与改编自语义分割度量的质量。我们评估我们在BBBC010 C. \线虫的数据集，它产生有竞争力的性能，同时还预测的不确定性估计，携带对象级的不准确性像假的拆分和合并虚假信息的方法。我们进行的模拟显示的潜在用途中引导校对这种不确定性估算。

36. DiverseNet: When One Right Answer is not Enough [PDF] 返回目录
Michael Firman, Neill D. F. Campbell, Lourdes Agapito, Gabriel J. Brostow
Abstract: Many structured prediction tasks in machine vision have a collection of acceptable answers, instead of one definitive ground truth answer. Segmentation of images, for example, is subject to human labeling bias. Similarly, there are multiple possible pixel values that could plausibly complete occluded image regions. State-of-the art supervised learning methods are typically optimized to make a single test-time prediction for each query, failing to find other modes in the output space. Existing methods that allow for sampling often sacrifice speed or accuracy. We introduce a simple method for training a neural network, which enables diverse structured predictions to be made for each test-time query. For a single input, we learn to predict a range of possible answers. We compare favorably to methods that seek diversity through an ensemble of networks. Such stochastic multiple choice learning faces mode collapse, where one or more ensemble members fail to receive any training signal. Our best performing solution can be deployed for various tasks, and just involves small modifications to the existing single-mode architecture, loss function, and training regime. We demonstrate that our method results in quantitative improvements across three challenging tasks: 2D image completion, 3D volume estimation, and flow prediction.
摘要：在机器视觉许多结构预测任务具有可接受的答案的集合，而不是一个确切的地面真实的答案。图像的分割，例如，受到人工标记的偏差。同样，也有多种可能的像素值，可以振振有词地完成遮挡的图像区域。国家的本领域监督学习方法通常优化，以使一个单一的测试时间预测为每个查询，没有找到在输出空间中的其他模式。现有的方法是允许采样往往牺牲速度和准确性。我们介绍了训练神经网络，使每个测试时间查询，进行多样化的结构预测的简单方法。对于单输入，我们学会了预测范围可能的答案。我们相媲美，以通过网络的集合寻求多样性的方法。这种随机选择题学习面模式崩溃，其中一个或多个集合成员无法接收到任何训练信号。我们表现最好的解决方案可以部署完成各种任务，只是涉及小修改现有的单一模式架构，损失函数和培训制度。我们证明了我们在跨越三个挑战性的任务数量上的改进方法的结果：2D图像完成，3D体积估计和流预测。

37. Measure Anatomical Thickness from Cardiac MRI with Deep Neural Networks [PDF] 返回目录
Qiaoying Huang, Eric Z. Chen, Hanchao Yu, Yimo Guo, Terrence Chen, Dimitris Metaxas, Shanhui Sun
Abstract: Accurate estimation of shape thickness from medical images is crucial in clinical applications. For example, the thickness of myocardium is one of the key to cardiac disease diagnosis. While mathematical models are available to obtain accurate dense thickness estimation, they suffer from heavy computational overhead due to iterative solvers. To this end, we propose novel methods for dense thickness estimation, including a fast solver that estimates thickness from binary annular shapes and an end-to-end network that estimates thickness directly from raw cardiac images.We test the proposed models on three cardiac datasets and one synthetic dataset, achieving impressive results and generalizability on all. Thickness estimation is performed without iterative solvers or manual correction, which is 100 times faster than the mathematical model. We also analyze thickness patterns on different cardiac pathologies with a standard clinical model and the results demonstrate the potential clinical value of our method for thickness based cardiac disease diagnosis.
摘要：从医学图像形状厚度的准确的估计是在临床应用中是至关重要的。例如，心肌的厚度是关键的心脏疾病诊断中的一个。虽然数学模型可用来获得精确的密集厚度估计，他们遭受沉重的计算开销由于迭代求解。为此，我们提出了致密厚度估计的新方法，包括快速求解器，估计二进制环形形状的厚度和端部到终端的网络，直接的估计厚度为原始心脏images.We测试提出的模型在三个数据集的心脏和一个合成的数据集，实现对所有骄人的成绩和普遍性。无需迭代求解器或手动校正，这是比数学模型快100倍进行厚度估计。我们还分析了在不同的心脏疾病的厚度模式与标准的临床模型，结果证明我们的方法的厚度基于心脏疾病诊断的潜在的临床应用价值。

38. GAN Slimming: All-in-One GAN Compression by A Unified Optimization Framework [PDF] 返回目录
Haotao Wang, Shupeng Gui, Haichuan Yang, Ji Liu, Zhangyang Wang
Abstract: Generative adversarial networks (GANs) have gained increasing popularity in various computer vision applications, and recently start to be deployed to resource-constrained mobile devices. Similar to other deep models, state-of-the-art GANs suffer from high parameter complexities. That has recently motivated the exploration of compressing GANs (usually generators). Compared to the vast literature and prevailing success in compressing deep classifiers, the study of GAN compression remains in its infancy, so far leveraging individual compression techniques instead of more sophisticated combinations. We observe that due to the notorious instability of training GANs, heuristically stacking different compression techniques will result in unsatisfactory results. To this end, we propose the first unified optimization framework combining multiple compression means for GAN compression, dubbed GAN Slimming (GS). GS seamlessly integrates three mainstream compression techniques: model distillation, channel pruning and quantization, together with the GAN minimax objective, into one unified optimization form, that can be efficiently optimized from end to end. Without bells and whistles, GS largely outperforms existing options in compressing image-to-image translation GANs. Specifically, we apply GS to compress CartoonGAN, a state-of-the-art style transfer network, by up to 47 times, with minimal visual quality degradation. Codes and pre-trained models can be found at this https URL.
摘要：创成对抗网络（甘斯）已在多种计算机视觉应用获得了日益普及，最近开始将部署到资源受限的移动设备。对于其他型号深一样，国家的最先进的甘斯高参数的复杂性受到影响。最近已经促使压缩甘斯（通常是发电机）的探索。相较于大量文献和通行的压缩深分类，甘压缩遗体处于起步阶段的研究，到目前为止，利用个人的压缩技术，而不是更复杂的组合成功。我们认为，由于观察到训练甘斯的臭名昭著的不稳定，试探性地堆叠不同的压缩技术将产生令人满意的结果。为此，我们提出了第一个统一的优化框架组合多个压缩手段GAN压缩，被称为GAN瘦身（GS）。 GS无缝集成了三个主流压缩技术：模型蒸馏，信道修剪和量化，与GAN极小目标，到一个统一的优化形式，即可以从一端至另一端被有效地优化在一起。没有花俏，GS在压缩图像到影像转换甘斯在很大程度上优于现有选项。具体而言，我们应用GS压缩CartoonGAN，一个国家的最先进式传输网络，通过高达47次，用最小的视觉质量下降。代码和预先训练模型可以在此HTTPS URL中找到。

39. Efficient Blind-Spot Neural Network Architecture for Image Denoising [PDF] 返回目录
David Honzátko, Siavash A. Bigdeli, Engin Türetken, L. Andrea Dunbar
Abstract: Image denoising is an essential tool in computational photography. Standard denoising techniques, which use deep neural networks at their core, require pairs of clean and noisy images for its training. If we do not possess the clean samples, we can use blind-spot neural network architectures, which estimate the pixel value based on the neighbouring pixels only. These networks thus allow training on noisy images directly, as they by-design avoid trivial solutions. Nowadays, the blind-spot is mostly achieved using shifted convolutions or serialization. We propose a novel fully convolutional network architecture that uses dilations to achieve the blind-spot property. Our network improves the performance over the prior work and achieves state-of-the-art results on established datasets.
摘要：图像去噪是计算摄影的必备工具。标准降噪技术，这在他们的核心使用深层神经网络，需要对清洁和图像噪点其培训。如果我们不具备清洁的样品，我们可以用盲点神经网络结构，它估计仅基于相邻像素的像素值。这些网络因此允许直接培训嘈杂的图像，他们通过设计避免琐碎的解决方案。如今，盲点是使用移动回旋或序列号大多实现。我们提出了一个新颖的全卷积网络架构，使用胀缩，实现盲点财产。我们的网络提高了对现有工作和建立数据集实现国家的最先进成果的表现。

40. New Directions in Distributed Deep Learning: Bringing the Network at Forefront of IoT Design [PDF] 返回目录
Kartikeya Bhardwaj, Wei Chen, Radu Marculescu
Abstract: In this paper, we first highlight three major challenges to large-scale adoption of deep learning at the edge: (i) Hardware-constrained IoT devices, (ii) Data security and privacy in the IoT era, and (iii) Lack of network-aware deep learning algorithms for distributed inference across multiple IoT devices. We then provide a unified view targeting three research directions that naturally emerge from the above challenges: (1) Federated learning for training deep networks, (2) Data-independent deployment of learning algorithms, and (3) Communication-aware distributed inference. We believe that the above research directions need a network-centric approach to enable the edge intelligence and, therefore, fully exploit the true potential of IoT.
摘要：在本文中，我们首先强调大规模采用深度学习的三大挑战在边缘：（一）硬件受限的物联网设备，（二）数据安全和隐私在物联网时代，以及（iii）缺乏网络感知深度学习算法，跨多个物联网设备分布推断。然后，我们提供瞄准三个研究方向自然从上面的挑战不断出现一个统一的观点：（1）联合学习训练网络深，（2）学习算法的数据独立部署，以及（3）通讯感知分布推断。我们认为，上述研究方向需要以网络为中心的方法，让边缘智力，因此，充分利用物联网的真正潜力。

41. Variational Image Restoration Network [PDF] 返回目录
Zongsheng Yue, Hongwei Yong, Qian Zhao, Lei Zhang, Deyu Meng
Abstract: Deep neural networks (DNNs) have achieved significant success in image restoration tasks by directly learning a powerful non-linear mapping from corrupted images to their latent clean ones. However, there still exist two major limitations for these deep learning (DL)-based methods. Firstly, the noises contained in real corrupted images are very complex, usually neglected and largely under-estimated in most current methods. Secondly, existing DL methods are mostly trained on one pre-assumed degradation process for all of the training image pairs, such as the widely used bicubic downsampling assumption in the image super-resolution task, inevitably leading to poor generalization performance when the true degradation does not match with such assumed one. To address these issues, we propose a unified generative model for the image restoration, which elaborately configures the degradation process from the latent clean image to the observed corrupted one. Specifically, different from most of current methods, the pixel-wisely non-i.i.d. Gaussian distribution, being with more flexibility, is adopted in our method to fit the complex real noises. Furthermore, the method is built on the general image degradation process, making it capable of adapting diverse degradations under one single model. Besides, we design a variational inference algorithm to learn all parameters involved in the proposed model with explicit form of objective loss. Specifically, beyond traditional variational methodology, two DNNs are employed to parameterize the posteriori distributions, one to infer the distribution of the latent clean image, and another to infer the distribution of the image noise. Extensive experiments demonstrate the superiority of the proposed method on three classical image restoration tasks, including image denoising, image super-resolution and JPEG image deblocking.
摘要：深层神经网络（DNNs）已在图像恢复任务通过直接学习从损坏的图像，其潜在干净的强大的非线性映射取得显著的成功。然而，仍存在对这些深层次的学习（DL）为基础的方法两个主要限制。首先，包含在实际损坏的图像噪声是非常复杂的，通常被忽视并在很大程度上低估的最新方法。其次，现有的DL方法主要培训了一个预先假定降解过程为所有的训练图像对，比如在图像超分辨率任务中广泛使用的双三次采样的假设，不可避免地导致泛化性能较差，当真正的降解做的不匹配这样的假定之一。为了解决这些问题，我们提出了图像复原，其精心配置从潜干净的形象与观察到的一个损坏的降解过程统一生成模型。具体地，从最当前的方法的不同，像素明智非i.i.d。高斯分布，具有更大的灵活性是，在我们的方法，以适应复杂的真实声音被采用。此外，该方法是建立在一般的图像劣化处理，使得它能够在一个单一的模型适应多样的劣化。此外，我们设计了一个变推理算法学习参与该模型与目标损失的明确形式的所有参数。具体而言，超越了传统的变分方法中，两个DNNs采用参数化后验分布，一个推断潜清洁图像的分布，和另一个来推断图像噪声的分布。大量的实验证明了该方法的三个经典图像恢复的任务，包括图像降噪，图像超分辨率和JPEG图像分块的优势。

42. Channel-Directed Gradients for Optimization of Convolutional Neural Networks [PDF] 返回目录
Dong Lao, Peihao Zhu, Peter Wonka, Ganesh Sundaramoorthi
Abstract: We introduce optimization methods for convolutional neural networks that can be used to improve existing gradient-based optimization in terms of generalization error. The method requires only simple processing of existing stochastic gradients, can be used in conjunction with any optimizer, and has only a linear overhead (in the number of parameters) compared to computation of the stochastic gradient. The method works by computing the gradient of the loss function with respect to output-channel directed re-weighted L2 or Sobolev metrics, which has the effect of smoothing components of the gradient across a certain direction of the parameter tensor. We show that defining the gradients along the output channel direction leads to a performance boost, while other directions can be detrimental. We present the continuum theory of such gradients, its discretization, and application to deep networks. Experiments on benchmark datasets, several networks and baseline optimizers show that optimizers can be improved in generalization error by simply computing the stochastic gradient with respect to output-channel directed metrics.
摘要：介绍了可用于提高泛化误差方面现有的基于梯度的优化卷积神经网络优化方法。该方法仅需要现有随机梯度，可以结合任何优化器结合使用的简单的处理，并与所述随机梯度的计算仅具有一个线性的开销（以参数的数量）。该方法的工作原理是相对于输出信道定向重新加权L2或索伯列夫度量，其具有横跨参数张量以一定的方向平滑梯度的分量的影响计算所述损失函数的梯度。我们表明，限定沿输出通道方向引线梯度以提高性能，而其他方向可能是有害的。我们提出这样的梯度的连续体理论，它的离散化，并应用到深层网络。上基准数据集，若干网络和基线优化实验表明，优化器可在泛化误差通过简单地相对于输出信道定向度量计算所述随机梯度得到改善。

43. Exploit Camera Raw Data for Video Super-Resolution via Hidden Markov Model Inference [PDF] 返回目录
Xiaohong Liu, Kangdi Shi, Zhe Wang, Jun Chen
Abstract: To the best of our knowledge, the existing deep-learning-based Video Super-Resolution (VSR) methods exclusively make use of videos produced by the Image Signal Processor (ISP) of the camera system as inputs. Such methods are 1) inherently suboptimal due to information loss incurred by non-invertible operations in ISP, and 2) inconsistent with the real imaging pipeline where VSR in fact serves as a pre-processing unit of ISP. To address this issue, we propose a new VSR method that can directly exploit camera sensor data, accompanied by a carefully built Raw Video Dataset (RawVD) for training, validation, and testing. This method consists of a Successive Deep Inference (SDI) module and a reconstruction module, among others. The SDI module is designed according to the architectural principle suggested by a canonical decomposition result for Hidden Markov Model (HMM) inference; it estimates the target high-resolution frame by repeatedly performing pairwise feature fusion using deformable convolutions. The reconstruction module, built with elaborately designed Attention-based Residual Dense Blocks (ARDBs), serves the purpose of 1) refining the fused feature and 2) learning the color information needed to generate a spatial-specific transformation for accurate color correction. Extensive experiments demonstrate that owing to the informativeness of the camera raw data, the effectiveness of the network architecture, and the separation of super-resolution and color correction processes, the proposed method achieves superior VSR results compared to the state-of-the-art and can be adapted to any specific camera-ISP.
摘要：据我们所知，现有的深学习基于视频超分辨率（VSR）方法专门利用相机系统输入的图像信号处理器（ISP）制作的视频。此类方法是：1）本质上不理想的是由于在ISP招致不可逆的操作的信息损失，以及2）与真实成像管道，其中VSR实际上作为ISP的前处理单元不一致。为了解决这个问题，我们提出了一种新的方法，VSR可以直接利用相机传感器的数据，培训，验证和测试配以精心打造的原始视频数据集（RawVD）。该方法包括逐次深推理（SDI）模块以及重建模块，等等。该SDI模块根据由为隐马尔可夫模型（HMM）推论规范分解结果所建议的体系结构原理设计;它估计通过重复执行利用变形卷积成对特征融合的对象高分辨率帧。重建模块，以精心设计的基于注意力残余密集块（ARDBs）建成，供应的1炼稠特征和2）的学习来生成用于精确的色彩校正的特定空间变换所需的颜色信息的目的）。大量的实验证明，由于相机的原始数据，该网络体系结构的有效性，和超分辨率和色彩校正处理的所述分离的信息量相比，国家的最先进的所提出的方法实现了优异的VSR结果并且可以适用于任何特定的摄像机的ISP。

44. Robust Pancreatic Ductal Adenocarcinoma Segmentation with Multi-Institutional Multi-Phase Partially-Annotated CT Scans [PDF] 返回目录
Ling Zhang, Yu Shi, Jiawen Yao, Yun Bian, Kai Cao, Dakai Jin, Jing Xiao, Le Lu
Abstract: Accurate and automated tumor segmentation is highly desired since it has the great potential to increase the efficiency and reproducibility of computing more complete tumor measurements and imaging biomarkers, comparing to (often partial) human measurements. This is probably the only viable means to enable the large-scale clinical oncology patient studies that utilize medical imaging. Deep learning approaches have shown robust segmentation performances for certain types of tumors, e.g., brain tumors in MRI imaging, when a training dataset with plenty of pixel-level fully-annotated tumor images is available. However, more than often, we are facing the challenge that only (very) limited annotations are feasible to acquire, especially for hard tumors. Pancreatic ductal adenocarcinoma (PDAC) segmentation is one of the most challenging tumor segmentation tasks, yet critically important for clinical needs. Previous work on PDAC segmentation is limited to the moderate amounts of annotated patient images (n<300) from venous or venous+arterial phase ct scans. based on a new self-learning framework, we propose to train the pdac segmentation model using much larger quantity of patients (n~="1,000)," with mix annotated and un-annotated multi-phase images. pseudo annotations are generated by combining two teacher models different specialties unannotated images, can be further refined teaching assistant that identifies associated vessels around pancreas. student is trained both manual experiment results show our proposed method provides an absolute improvement 6.3% dice score over strong baseline nnunet achieving performance (dice="0.71)" similar inter-observer variability between radiologists. < font>
摘要：准确和自动的肿瘤分割是高度期望的，因为它具有很大的潜力，以增加计算更完整的肿瘤测量和成像生物标志物，比较（通常局部的）人的测量的效率和可重复性。这可能是唯一可行的手段，使利用医疗成像大规模临床肿瘤病人的研究。深学习方法已经显示了某些类型的肿瘤，例如，在MRI成像脑肿瘤健壮分割性能，当用大量的像素级完全注释的肿瘤图像的训练数据集可用。然而，超过的时候，我们正面临着只有（非常）有限的注解是为了获得可行的，尤其是对硬肿瘤的挑战。胰腺导管腺癌（PDAC）分割是最具挑战性的肿瘤分割的任务之一，但临床需求至关重要。上PDAC分割以前的工作仅限于适量的注释的患者图像的（N <300）从静脉或静脉+动脉期ct扫描。基于新的自学习框架，我们提出使用患者大得多的量来训练pdac分割模型（n〜= 1000），带注释和未注释的静脉或多相ct图像的混合。伪注释是由两个老师机型上未注释的图像不同pdac分割的特色相结合产生的，并且可以通过助教模型进一步细化该识别相关胰腺周围血管。学生模型被训练在手动和伪注释多相图像。实验结果表明，该方法提供了6.3％骰子得分的绝对提高了nnunet培训了注解图像的强大的基线，实现了性能（骰子="0.71）相似，放射科医师之间的观察者间的变异。

45. OpenBot: Turning Smartphones into Robots [PDF] 返回目录
Matthias Müller, Vladlen Koltun
Abstract: Current robots are either expensive or make significant compromises on sensory richness, computational power, and communication capabilities. We propose to leverage smartphones to equip robots with extensive sensor suites, powerful computational abilities, state-of-the-art communication channels, and access to a thriving software ecosystem. We design a small electric vehicle that costs $50 and serves as a robot body for standard Android smartphones. We develop a software stack that allows smartphones to use this body for mobile operation and demonstrate that the system is sufficiently powerful to support advanced robotics workloads such as person following and real-time autonomous navigation in unstructured environments. Controlled experiments demonstrate that the presented approach is robust across different smartphones and robot bodies. A video of our work is available at this https URL
摘要：目前机器人要么昂贵，使感官上的丰富性，计算能力和通信能力显著妥协。我们建议利用智能手机具有广泛的传感器套件，强大的计算能力，国家的最先进的沟通渠道，并获得了蓬勃发展的软件生态系统装备的机器人。我们设计了一个小型电动车，售价$ 50和作为机器人本体为标准的Android智能手机。我们开发了一个软件堆栈，使智能手机用户使用这个身体的移动操作，并证明该系统是足够强大，以支持先进的机器人工作负载，如在非结构化环境下的人，实时自主导航。对照实验结果表明，所提出的方法是在不同的智能手机和机器人身体健壮。我们工作的一个视频可在此HTTPS URL

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-08-26

目录

摘要