摘要

1. Instance-aware Image Colorization [PDF] 返回目录
Jheng-Wei Su, Hung-Kuo Chu, Jia-Bin Huang
Abstract: Image colorization is inherently an ill-posed problem with multi-modal uncertainty. Previous methods leverage the deep neural network to map input grayscale images to plausible color outputs directly. Although these learning-based methods have shown impressive performance, they usually fail on the input images that contain multiple objects. The leading cause is that existing models perform learning and colorization on the entire image. In the absence of a clear figure-ground separation, these models cannot effectively locate and learn meaningful object-level semantics. In this paper, we propose a method for achieving instance-aware colorization. Our network architecture leverages an off-the-shelf object detector to obtain cropped object images and uses an instance colorization network to extract object-level features. We use a similar network to extract the full-image features and apply a fusion module to full object-level and image-level features to predict the final colors. Both colorization networks and fusion modules are learned from a large-scale dataset. Experimental results show that our work outperforms existing methods on different quality metrics and achieves state-of-the-art performance on image colorization.
摘要：图像彩色本质上是一个病态问题，多模态的不确定性。以前的方法利用深层的神经网络输入灰度图像直接映射到可信颜色输出。尽管这些基于学习的方法都表现出了不俗的表现，他们通常不能对包含多个对象的输入图像。首要原因是，现有的模型对整个图像进行学习和着色。在没有明确的数字地面分离的，这些模型不能有效地定位和有意义的学习对象级语义。在本文中，我们提出了实现例如感知着色的方法。我们的网络架构利用截止的，现成的对象检测器，以获得经裁剪对象图像，并使用实例着色网络提取对象级特征。我们使用类似的网络来提取完整的图像的功能和应用融合模块满对象级和影像级特性来预测最后的颜色。这两种着色网络和融合模块从大规模数据集的经验教训。实验结果表明，我们的工作性能优于现有的不同的质量度量方法和实现图像彩色化国家的最先进的性能。

2. Hierarchical Multi-Scale Attention for Semantic Segmentation [PDF] 返回目录
Andrew Tao, Karan Sapra, Bryan Catanzaro
Abstract: Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.1 IOU test).
摘要：多尺度推理通常用于提高语义分割的结果。多个图像标度通过网络传递，然后将结果与平均或最大池合并。在这项工作中，我们提出了一个关注为基础的方法相结合的多尺度预测。我们表明，在一定尺度的预测是在解决具体的故障模式更好，并且该网络获知赞成这样的情况下，这些秤，以便产生更好的预测。我们的注意力机制是分层次的，这使得它成为大约4倍的内存高效的列车比近期其他方法。除了实现更快的训练，这让我们有更大的作物规模训练导致更大的模型的准确性。我们证明我们的方法对两个数据集的结果：都市风景和Mapillary前景。对于城市景观，其中有大量微弱标记的图片，我们还利用自动标签，以提高泛化。使用我们的方法，我们实现了新的国家的最先进的结果在这两个Mapillary（61.1 IOU VAL）和城市景观（85.1 IOU测试）。

3. Manifold Alignment for Semantically Aligned Style Transfer [PDF] 返回目录
Jing Huo, Shiyin Jin, Wenbin Li, Jing Wu, Yu-Kun Lai, Yinghuan Shi, Yang Gao
Abstract: Given a content image and a style image, the goal of style transfer is to synthesize an output image by transferring the target style to the content image. Currently, most of the methods address the problem with global style transfer, assuming styles can be represented by global statistics, such as Gram matrices or covariance matrices. In this paper, we make a different assumption that local semantically aligned (or similar) regions between the content and style images should share similar style patterns. Based on this assumption, content features and style features are seen as two sets of manifolds and a manifold alignment based style transfer (MAST) method is proposed. MAST is a subspace learning method which learns a common subspace of the content and style features. In the common subspace, content and style features with larger feature similarity or the same semantic meaning are forced to be close. The learned projection matrices are added with orthogonality constraints so that the mapping can be bidirectional, which allows us to project the content features into the common subspace, and then into the original style space. By using a pre-trained decoder, promising stylized images are obtained. The method is further extended to allow users to specify corresponding semantic regions between content and style images or using semantic segmentation maps as guidance. Extensive experiments show the proposed MAST achieves appealing results in style transfer.
摘要：给定一个内容图像和风格的形象，风格转移的目标是由目标风格转移到内容图像合成输出图像。目前，大多数的方法解决了全球风格转移的问题，假设样式可以通过全球的统计数据，如革兰氏矩阵或协方差矩阵来表示。在本文中，我们作出不同的假设，即本地语义一致（或相似）的区域的内容和风格的图像之间应该有着相似的风格模式。基于这个假设，内容的特征和风格特征被视为两组歧管和提出了一种歧管对准基于样式转移（MAST）方法。 MAST是学习的内容和风格特色的公共子空间的子空间学习方法。在常见的子空间，内容和风格特征具有较大的特征相似度或相同的语义被强制接近。该学会投影矩阵与正交性约束的增加，这样的映射可以是双向的，这使得我们的内容特征投射到共同的子空间，然后到原始风格的空间。通过使用预训练的解码器，有希望获得程式化的图像。该方法可进一步扩展到允许用户指定对应内容和风格的图像，或使用语义分割映射为指导之间的语义的区域。大量的实验表明，该MAST达到吸引人的风格转移的结果。

4. Dense Semantic 3D Map Based Long-Term Visual Localization with Hybrid Features [PDF] 返回目录
Tianxin Shi, Hainan Cui, Zhuo Song, Shuhan Shen
Abstract: Visual localization plays an important role in many applications. However, due to the large appearance variations such as season and illumination changes, as well as weather and day-night variations, it's still a big challenge for robust long-term visual localization algorithms. In this paper, we present a novel visual localization method using hybrid handcrafted and learned features with dense semantic 3D map. Hybrid features help us to make full use of their strengths in different imaging conditions, and the dense semantic map provide us reliable and complete geometric and semantic information for constructing sufficient 2D-3D matching pairs with semantic consistency scores. In our pipeline, we retrieve and score each candidate database image through the semantic consistency between the dense model and the query image. Then the semantic consistency score is used as a soft constraint in the weighted RANSAC-based PnP pose solver. Experimental results on long-term visual localization benchmarks demonstrate the effectiveness of our method compared with state-of-the-arts.
摘要：视觉定位起着在许多应用中具有重要作用。然而，由于大表观特征变化，如季节，光照变化，以及天气和昼夜变化，它仍然是稳健的长期视觉定位算法的一大挑战。在本文中，我们提出了一个新颖的视觉定位方法，使用混合手工和学习特征密集语义3D地图。混合动力功能可以帮助我们充分利用自己的优势在不同的拍摄条件和茂密的语义地图为我们提供了足够的构建2D-3D的匹配对语义一致性分数可靠和完整的几何和语义信息。在我们的管道，我们检索和穿过茂密的模型和查询图像之间的语义一致性得分每个候选数据库图像。然后语义稠度评分被用作加权基于RANSAC即插即用姿态解算器的软约束。长期视觉定位基准，实验结果证明，国家的的艺术相比，我们的方法的有效性。

5. Robust Ensemble Model Training via Random Layer Sampling Against Adversarial Attack [PDF] 返回目录
Hakmin Lee, Hong Joo Lee, Seong Tae Kim, Yong Man Ro
Abstract: Deep neural networks have achieved substantial achievements in several computer vision areas, but have vulnerabilities that are often fooled by adversarial examples that are not recognized by humans. This is an important issue for security or medical applications. In this paper, we propose an ensemble model training framework with random layer sampling to improve the robustness of deep neural networks. In the proposed training framework, we generate various sampled model through the random layer sampling and update the weight of the sampled model. After the ensemble models are trained, it can hide the gradient efficiently and avoid the gradient-based attack by the random layer sampling method. To evaluate our proposed method, comprehensive and comparative experiments have been conducted on three datasets. Experimental results show that the proposed method improves the adversarial robustness.
摘要：深层神经网络已经在几个计算机视觉领域取得了丰硕的成果，但有通常是由不被人类所认识的对抗例子上当漏洞。这是出于安全和医疗应用的一个重要问题。在本文中，我们提出了随机抽样层整体模型培训框架，以提高深层神经网络的鲁棒性。在拟议的培训框架，我们产生通过随机抽样层抽样各种模型和更新采样模型的重量。合奏模型被训练后，可以有效地隐藏梯度，并避免由随机取样层方法的基于梯度的攻击。为了评估我们提出的方法，综合和对比实验已经在三个数据集进行。实验结果表明，该方法提高了对抗性的鲁棒性。

6. Efficient Ensemble Model Generation for Uncertainty Estimation with Bayesian Approximation in Segmentation [PDF] 返回目录
Hong Joo Lee, Seong Tae Kim, Nassir Navab, Yong Man Ro
Abstract: Recent studies have shown that ensemble approaches could not only improve accuracy and but also estimate model uncertainty in deep learning. However, it requires a large number of parameters according to the increase of ensemble models for better prediction and uncertainty estimation. To address this issue, a generic and efficient segmentation framework to construct ensemble segmentation models is devised in this paper. In the proposed method, ensemble models can be efficiently generated by using the stochastic layer selection method. The ensemble models are trained to estimate uncertainty through Bayesian approximation. Moreover, to overcome its limitation from uncertain instances, we devise a new pixel-wise uncertainty loss, which improves the predictive performance. To evaluate our method, comprehensive and comparative experiments have been conducted on two datasets. Experimental results show that the proposed method could provide useful uncertainty information by Bayesian approximation with the efficient ensemble model generation and improve the predictive performance.
摘要：最近的研究表明，合奏方式不仅可以提高准确性和也估计深度学习模型不确定性。但是，它需要根据更好的预测和估计的不确定性增加整体模型的大量参数。为了解决这个问题，通用和高效的分割框架来构建整体分割模型在本文中设计的。在所提出的方法，合奏模型可以有效地通过使用随机层选择方法生成的。合奏模型进行培训，通过贝叶斯近似估算不确定性。此外，为了克服不确定的情况下，它的局限性，我们制定了新的逐像素的不确定性损失，提高了预测性能。为了评估我们的方法，综合和对比实验已经在两个数据集进行。实验结果表明，该方法可以通过贝叶斯近似提供有用信息的不确定性与高效的集成模型生成和提高预测性能。

7. Revisiting Role of Autoencoders in Adversarial Settings [PDF] 返回目录
Byeong Cheon Kim, Jung Uk Kim, Hakmin Lee, Yong Man Ro
Abstract: To combat against adversarial attacks, autoencoder structure is widely used to perform denoising which is regarded as gradient masking. In this paper, we revisit the role of autoencoders in adversarial settings. Through the comprehensive experimental results and analysis, this paper presents the inherent property of adversarial robustness in the autoencoders. We also found that autoencoders may use robust features that cause inherent adversarial robustness. We believe that our discovery of the adversarial robustness of the autoencoders can provide clues to the future research and applications for adversarial defense.
摘要：为针对对抗攻击战斗，自动编码器结构被广泛地用于执行了被视为梯度掩蔽去噪。在本文中，我们重温自动编码的对抗性设置中的作用。通过综合实验结果和分析，提出了对抗性的鲁棒性的自动编码的固有特性。我们还发现，自动编码可以使用强大的功能，导致固有的敌对鲁棒性。我们相信，我们的自动编码的对抗鲁棒性的发现可以提供线索，未来的研究和应用对抗性的防守。

8. A Nearest Neighbor Network to Extract Digital Terrain Models from 3D Point Clouds [PDF] 返回目录
Mohammed Yousefhussiena, David J. Kelbeb, Carl Salvaggioa
Abstract: When 3D-point clouds from overhead sensors are used as input to remote sensing data exploitation pipelines, a large amount of effort is devoted to data preparation. Among the multiple stages of the preprocessing chain, estimating the Digital Terrain Model (DTM) model is considered to be of a high importance; however, this remains a challenge, especially for raw point clouds derived from optical imagery. Current algorithms estimate the ground points using either a set of geometrical rules that require tuning multiple parameters and human interaction, or cast the problem as a binary classification machine learning task where ground and non-ground classes are found. In contrast, here we present an algorithm that directly operates on 3D-point clouds and estimate the underlying DTM for the scene using an end-to-end approach without the need to classify points into ground and non-ground cover types. Our model learns neighborhood information and seamlessly integrates this with point-wise and block-wise global features. We validate our model using the ISPRS 3D Semantic Labeling Contest LiDAR data, as well as three scenes generated using dense stereo matching, representative of high-rise buildings, lower urban structures, and a dense old-city residential area. We compare our findings with two widely used software packages for DTM extraction, namely ENVI and LAStools. Our preliminary results show that the proposed method is able to achieve an overall Mean Absolute Error of 11.5% compared to 29% and 16% for ENVI and LAStools.
摘要：当从塔顶传感器3D-点云被用作输入到遥感数据开采管线，大量的努力致力于数据制备。间的预处理链的多个阶段，估计数字地形模型（DTM）的模型被认为是高的重要性;然而，这仍然是一个挑战，特别是用于从光学图象来源原料点云。目前估计算法使用一个集合，需要调整多个参数和人际交往，或铸问题作为一个二元分类机器学习任务，其中地面和非地面类发现的几何规则的地面点。与此相反，在这里我们提出的三维点云直接进行操作的算法，并估计使用一个终端到终端的方法，而不需要进行分类点到地面和非地面覆盖类型的场景中的底层DTM。我们的模型学习邻里信息，并无缝地与逐点和逐块的全局特征集成了这个。我们验证使用ISPRS 3D语义标注大赛LiDAR数据我们的模型，以及使用稠密立体匹配，有代表性的高层建筑产生的三个场景，降低城市结构和密集的老城区居住区。我们比较我们的研究结果有两个广泛使用的软件包DTM提取，即ENVI和LAStools。我们的初步结果表明，该方法能够实现11.5％的总体平均绝对误差比为29％，ENVI和LAStools 16％。

9. Unsupervised anomaly localization using VAE and beta-VAE [PDF] 返回目录
Leixin Zhou, Wenxiang Deng, Xiaodong Wu
Abstract: Variational Auto-Encoders (VAEs) have shown great potential in the unsupervised learning of data distributions. An VAE trained on normal images is expected to only be able to reconstruct normal images, allowing the localization of anomalous pixels in an image via manipulating information within the VAE ELBO loss. The ELBO consists of KL divergence loss (image-wise) and reconstruction loss (pixel-wise). It is natural and straightforward to use the later as the predictor. However, usually local anomaly added to a normal image can deteriorate the whole reconstructed image, causing segmentation using only naive pixel errors not accurate. Energy based projection was proposed to increase the reconstruction accuracy of normal regions/pixels, which achieved the state-of-the-art localization accuracy on simple natural images. Another possible predictors are ELBO and its components gradients with respect to each pixels. Previous work claimed that KL gradient is a robust predictor. In this paper, we argue that the energy based projection in medical imaging is not as useful as on natural images. Moreover, we observe that the robustness of KL gradient predictor totally depends on the setting of the VAE and dataset. We also explored the effect of the weight of KL loss within beta-VAE and predictor ensemble in anomaly localization.
摘要：变自动编码器（VAES）已经在数据分布的无监督学习表现出极大的潜力。一个VAE上正常的图像训练预计只有能够重建正常的图像，通过操纵VAE ELBO损耗内的信息允许异常的像素的图像中的定位。所述ELBO由KL散度的损失（以成像方式）和重建损耗（逐像素）的。这是很自然和简单的后期使用作为预测。然而，通常局部异常添加到正常的图像可能劣化整个重构图像，仅使用幼稚像素误差不准确引起的分割。基于能量投影提出增加正常区域/像素，这实现简单的自然的图像的状态的最先进的定位精度的重建精度。另一种可能的预测器ELBO及其组件的梯度相对于每个像素。以前的研究声称，KL梯度是一个强大的预测。在本文中，我们认为，在医学成像中的能量基于投影就不太适用了作为自然的图像。此外，我们观察到KL梯度预测的稳健性完全依赖于VAE和数据集的设置。我们还探讨的β-内VAE KL损失和预测整体的重量在异常本地化的效果。

10. Wish You Were Here: Context-Aware Human Generation [PDF] 返回目录
Oran Gafni, Lior Wolf
Abstract: We present a novel method for inserting objects, specifically humans, into existing images, such that they blend in a photorealistic manner, while respecting the semantic context of the scene. Our method involves three subnetworks: the first generates the semantic map of the new person, given the pose of the other persons in the scene and an optional bounding box specification. The second network renders the pixels of the novel person and its blending mask, based on specifications in the form of multiple appearance components. A third network refines the generated face in order to match those of the target person. Our experiments present convincing high-resolution outputs in this novel and challenging application domain. In addition, the three networks are evaluated individually, demonstrating for example, state of the art results in pose transfer benchmarks.
摘要：提出一种新的方法，用于插入对象，具体地人类，到现有的图像，使得它们在一个逼真的方式混合，同时尊重场景的语义语境。我们的方法包括三个子网：第一产生新的人的意思图，考虑到现场的其他人员和一个可选的边框规范的姿势。所述第二网络使得该新颖的人的像素和其混合掩模中，基于在多个外观部件的形式的规格。第三网络细化产生的面，以匹配目标人物的。我们的实验中本令人信服高分辨率输出在该新颖的和具有挑战性的应用领域。另外，三个网络被单独评估，这表明，例如，在本领域的状态导致姿势转移基准。

11. A Neural Network Looks at Leonardo's(?) Salvator Mundi [PDF] 返回目录
Steven J. Frank, Andrea M. Frank
Abstract: We use convolutional neural networks (CNNs) to analyze authorship questions surrounding the works of Leonardo da Vinci -- in particular, Salvator Mundi, the world's most expensive painting and among the most controversial. Trained on the works of an artist under study and visually comparable works of other artists, our system can identify likely forgeries and shed light on attribution controversies. Leonardo's few extant paintings test the limits of our system and require corroborative techniques of testing and analysis.
摘要：我们使用卷积神经网络（细胞神经网络）来分析周围达芬奇的作品著作权问题 - 特别是萨尔瓦托蒙迪，是世界上最昂贵的绘画中最有争议的。培训了一个艺术家的所研究的作品和其他艺术家的视觉作品相媲美，我们的系统可以识别可能的伪造和归属争议线索。达·芬奇的一些画作现存测试我们的系统的限制和要求的测试和分析的佐证技术。

12. Bridging the gap between Natural and Medical Images through Deep Colorization [PDF] 返回目录
Lia Morra, Luca Piano, Fabrizio Lamberti, Tatiana Tommasi
Abstract: Deep learning has thrived by training on large-scale datasets. However, in many applications, as for medical image diagnosis, getting massive amount of data is still prohibitive due to privacy, lack of acquisition homogeneity and annotation cost. In this scenario, transfer learning from natural image collections is a standard practice that attempts to tackle shape, texture and color discrepancies all at once through pretrained model fine-tuning. In this work, we propose to disentangle those challenges and design a dedicated network module that focuses on color adaptation. We combine learning from scratch of the color module with transfer learning of different classification backbones, obtaining an end-to-end, easy-to-train architecture for diagnostic image recognition on X-ray images. Extensive experiments showed how our approach is particularly efficient in case of data scarcity and provides a new path for further transferring the learned color information across multiple medical datasets.
摘要：深学习已通过大规模数据集培训蓬勃发展。然而，在许多应用中，为医学影像诊断，让海量数据量仍然望而却步由于隐私，缺乏收购同质化和注释的成本。在这种情况下，传输距离自然图像集合学习是一个标准的做法，试图解决形状，质地和颜色的差异一次性通过预训练模型进行微调。在这项工作中，我们提出要理清这些挑战和设计，专注于色彩适应的专用网络模块。我们结合来自具有不同分类的主链转移学习颜色模块的划痕学习，获得的端至端，易于列车架构上的X射线图像的诊断图像识别。大量的实验表明我们的方法是如何在数据匮乏的情况下特别有效，并跨多个数据集医疗进一步转移了解到色彩信息提供了新的路径。

13. MBA-RainGAN: Multi-branch Attention Generative Adversarial Network for Mixture of Rain Removal from Single Images [PDF] 返回目录
Yiyang Shen, Yidan Feng, Sen Deng, Dong Liang, Jing Qin, Haoran Xie, Mingqiang Wei
Abstract: Rain severely hampers the visibility of scene objects when images are captured through glass in heavily rainy days. We observe three intriguing phenomenons that, 1) rain is a mixture of raindrops, rain streaks and rainy haze; 2) the depth from the camera determines the degrees of object visibility, where objects nearby and faraway are visually blocked by rain streaks and rainy haze, respectively; and 3) raindrops on the glass randomly affect the object visibility of the whole image space. We for the first time consider that, the overall visibility of objects is determined by the mixture of rain (MOR). However, existing solutions and established datasets lack full consideration of the MOR. In this work, we first formulate a new rain imaging model; by then, we enrich the popular RainCityscapes by considering raindrops, named RainCityscapes++. Furthermore, we propose a multi-branch attention generative adversarial network (termed an MBA-RainGAN) to fully remove the MOR. The experiment shows clear visual and numerical improvements of our approach over the state-of-the-arts on RainCityscapes++. The code and dataset will be available.
摘要：雨严重阻碍场景物体的可见性时，图像通过玻璃在重阴雨天拍摄的。我们观察到3个耐人寻味的现象是，1）雨雨点，雨条纹和阴雨霾的混合物; 2）来自摄像机的深度决定度对象的可见性，其中，对象附近和远处的在视觉上阻挡雨水条纹多雨雾度，分别的;和3）雨滴在玻璃上随机影响整个图像空间的物体的可见性。我们首次认为，物体的整体可见性由雨（MOR）混合所确定。然而，现有的解决方案，并建立数据集缺乏充分考虑了铁道部。在这项工作中，我们首先制定了新的雨成像模型;到那时，我们考虑雨滴，命名RainCityscapes ++丰富的流行RainCityscapes。此外，我们提出了一个多分支注意生成对抗性的网络（称为一个MBA-RainGAN）完全卸下MOR。实验表明清除我们对上RainCityscapes状态的最艺++方法的视觉和数值的改进。代码和数据集将可用。

14. Region Proposals for Saliency Map Refinement for Weakly-supervised Disease Localisation and Classification [PDF] 返回目录
Renato Hermoza, Gabriel Maicas, Jacinto C. Nascimento, Gustavo Carneiro
Abstract: The deployment of automated systems to diagnose diseases from medical images is challenged by the requirement to localise the diagnosed diseases to justify or explain the classification decision. This requirement is hard to fulfil because most of the training sets available to develop these systems only contain global annotations, making the localisation of diseases a weakly supervised approach. The main methods designed for weakly supervised disease classification and localisation rely on saliency or attention maps that are not specifically trained for localisation, or on region proposals that can not be refined to produce accurate detections. In this paper, we introduce a new model that combines region proposal and saliency detection to overcome both limitations for weakly supervised disease classification and localisation. Using the ChestX-ray14 data set, we show that our proposed model establishes the new state-of-the-art for weakly-supervised disease diagnosis and localisation.
摘要：自动化系统从医学图像诊断疾病的部署是通过本地化的疾病诊断证明或解释分类判决要求质疑。这个要求是很难实现，因为大部分可用来开发这些系统的训练集只包含全球的注释，使疾病的本地化弱监督方法。专为弱监督疾病分类和定位的方法主要依赖于没有受过专门训练的本地化，或者不能细化到产生精确的检测区域的显着性的建议或注意地图。在本文中，我们介绍了一种新的模式，结合区域提案，并显着性检测，以克服弱监督疾病分类和定位都局限。使用ChestX-ray14数据集，我们证明了我们提出的模型建立新的国家的最先进的弱监督疾病的诊断和定位。

15. Cross-Domain Few-Shot Learning with Meta Fine-Tuning [PDF] 返回目录
John Cai, Sheng Mei Shen
Abstract: In this paper, we tackle the new Cross-Domain Few-Shot Learning benchmark proposed by the CVPR 2020 Challenge. To this end, we build upon state-of-the-art methods in domain adaptation and few-shot learning to create a system that can be trained to perform both tasks end-to-end. Inspired by the need to create models designed to be fine-tuned, we explore the integration of transfer-learning (fine-tuning) with meta-learning algorithms, to train a network that has specific layers that are designed to be adapted at a later fine-tuning stage. To do so, we modify the episodic training process to include a first-order MAML-based meta-learning algorithm, and use a Graph Neural Network model as the subsequent meta-learning module. We find that our proposed method helps to boost accuracy significantly, especially when coupled with data augmentation. In our final results, we combine the novel method with the baseline method in a simple ensemble, and achieve an average accuracy of 73.78% on the benchmark. This is a 6.51% improvement over existing SOTA methods that were trained solely on miniImagenet.
摘要：在本文中，我们将处理由CVPR 2020挑战提出新的跨域为数不多的射门学习标杆。为此，我们建立在国家的最先进的方法在领域适应性和几个次学习创建可训练执行这两个任务结束到终端的系统。由需要创造设计为微调模式的启发，我们将探讨转移学习（微调）的集成与元学习算法，培养具有被设计在以后能够适应特定层网络微调阶段。要做到这一点，我们修改了发作性训练过程包括一阶基于MAML-元学习算法，并使用图形神经网络模型作为后续元学习模块。我们发现，我们提出的方法有助于提高准确性显著，特别是当与数据增强耦合。在我们最后的结果，我们用一个简单的合奏基线方法相结合的新方法，并实现了73.78％的基准的平均准确度。这是在被训练只对miniImagenet现有SOTA方法6.51％的改善。

16. HyperSTAR: Task-Aware Hyperparameters for Deep Networks [PDF] 返回目录
Gaurav Mittal, Chang Liu, Nikolaos Karianakis, Victor Fragoso, Mei Chen, Yun Fu
Abstract: While deep neural networks excel in solving visual recognition tasks, they require significant effort to find hyperparameters that make them work optimally. Hyperparameter Optimization (HPO) approaches have automated the process of finding good hyperparameters but they do not adapt to a given task (task-agnostic), making them computationally inefficient. To reduce HPO time, we present HyperSTAR (System for Task Aware Hyperparameter Recommendation), a task-aware method to warm-start HPO for deep neural networks. HyperSTAR ranks and recommends hyperparameters by predicting their performance conditioned on a joint dataset-hyperparameter space. It learns a dataset (task) representation along with the performance predictor directly from raw images in an end-to-end fashion. The recommendations, when integrated with an existing HPO method, make it task-aware and significantly reduce the time to achieve optimal performance. We conduct extensive experiments on 10 publicly available large-scale image classification datasets over two different network architectures, validating that HyperSTAR evaluates 50% less configurations to achieve the best performance compared to existing methods. We further demonstrate that HyperSTAR makes Hyperband (HB) task-aware, achieving the optimal accuracy in just 25% of the budget required by both vanilla HB and Bayesian Optimized HB~(BOHB).
摘要：尽管深层神经网络在解决视觉识别任务出色，他们需要显著努力寻找超参数，使他们以最佳状态工作。超参数优化（HPO）方法已经自动找到好的超参数的过程，但他们并不适应特定任务（任务无关），使他们计算效率低下。为了减少HPO的时候，我们提出HyperSTAR（系统任务意识到超参数的建议），任务感知的方法来热启动HPO深层神经网络。 HyperSTAR行列，并通过预测其性能对空调的联合数据集，超参数空间建议超参数。它学习的数据集（任务）表示与在端至端的方式直接在性能预测器从原始图像一起。这些建议，当与现有的HPO法集成，使其任务感知和显著减少时间，以达到最佳性能。我们在两种不同的网络架构上的10公开可用的大型图像分类数据集进行了广泛的实验，验证该HyperSTAR评估减少50％配置，以实现比现有方法的最佳性能。我们进一步证明HyperSTAR使得Hyperband技术（HB）的任务意识，在双方香草HB和贝叶斯优化HB〜（BOHB）所需的预算只有25％达到最佳精度。

17. Unsupervised segmentation via semantic-apparent feature fusion [PDF] 返回目录
Xi Li, Huimin Ma, Hongbing Ma, Yidong Wang
Abstract: Foreground segmentation is an essential task in the field of image understanding. Under unsupervised conditions, different images and instances always have variable expressions, which make it difficult to achieve stable segmentation performance based on fixed rules or single type of feature. In order to solve this problem, the research proposes an unsupervised foreground segmentation method based on semantic-apparent feature fusion (SAFF). Here, we found that key regions of foreground object can be accurately responded via semantic features, while apparent features (represented by saliency and edge) provide richer detailed expression. To combine the advantages of the two type of features, an encoding method for unary region features and binary context features is established, which realizes a comprehensive description of the two types of expressions. Then, a method for adaptive parameter learning is put forward to calculate the most suitable feature weights and generate foreground confidence score map. Furthermore, segmentation network is used to learn foreground common features from different instances. By fusing semantic and apparent features, as well as cascading the modules of intra-image adaptive feature weight learning and inter-image common feature learning, the research achieves performance that significantly exceeds baselines on the PASCAL VOC 2012 dataset.
摘要：前景分割是图像理解领域的一项重要任务。下无监督的条件下，不同的图像和实例总是有变量表达式，这使得它难以实现基于固定的规则或单一类型的特征的稳定的分割性能。为了解决这个问题，研究提出了一种基于语义的表观特征融合（SAFF）无监督前景分割方法。在这里，我们发现了前景对象的该键区可以经由语义特征被准确地响应，而表观特征（通过显着性和边缘表示）提供更丰富的表达详细。为了组合两个类型的特征的优点，建立了一元区域的特征和二进制上下文特征的编码方法，实现了两种类型的表达式的一个全面的描述。然后，自适应参数学习的方法，提出了计算最合适的要素权重，并生成前景的信心评分地图。此外，分割网络被用于从不同的实例学习前景共同的特点。通过融合语义和表观特征，以及级联图像内自适应特征重量学习和图像间共同的特征的学习模块，研究达到性能即显著超过在PASCAL VOC 2012的数据集的基线。

18. Powering One-shot Topological NAS with Stabilized Share-parameter Proxy [PDF] 返回目录
Ronghao Guo, Chen Lin, Chuming Li, Keyu Tian, Ming Sun, Lu Sheng, Junjie Yan
Abstract: One-shot NAS method has attracted much interest from the research community due to its remarkable training efficiency and capacity to discover high performance models. However, the search spaces of previous one-shot based works usually relied on hand-craft design and were short for flexibility on the network topology. In this work, we try to enhance the one-shot NAS by exploring high-performing network architectures in our large-scale Topology Augmented Search Space (i.e., over 3.4*10^10 different topological structures). Specifically, the difficulties for architecture searching in such a complex space has been eliminated by the proposed stabilized share-parameter proxy, which employs Stochastic Gradient Langevin Dynamics to enable fast shared parameter sampling, so as to achieve stabilized measurement of architecture performance even in search space with complex topological structures. The proposed method, namely Stablized Topological Neural Architecture Search (ST-NAS), achieves state-of-the-art performance under Multiply-Adds (MAdds) constraint on ImageNet. Our lite model ST-NAS-A achieves 76.4% top-1 accuracy with only 326M MAdds. Our moderate model ST-NAS-B achieves 77.9% top-1 accuracy just required 503M MAdds. Both of our models offer superior performances in comparison to other concurrent works on one-shot NAS.
摘要：单次NAS方法吸引了来自研究界的极大兴趣，由于其显着的训练效率，并发现高性能车型的能力。然而，以往的一次性基于作品的搜索空间通常依靠手工工艺的设计和是短期的网络拓扑的灵活性。在这项工作中，我们试图通过我们的大型拓扑扩增搜索太空探索高性能的网络架构，以提高一次性NAS（即超过3.4 * 10 ^ 10层不同的拓扑结构）。具体地，在这样一个复杂的空间架构搜索的困难已经通过所提出的稳定化的共享参数代理，其使用随机梯度的Langevin动力学来实现快速共享参数采样消除，从而达到的架构性能稳定的测量，即使在搜索空间复杂的拓扑结构。所提出的方法，即稳流拓扑的神经结构搜索（ST-NAS），实现了在乘添加（MAdds）上ImageNet约束国家的最先进的性能。我们的精简版机型ST-NAS-A达到76.4％，最高1精度只有326M MAdds。我们的中等型号ST-NAS-B达到77.9％，最高1精度要求只是503M MAdds。无论我们的模型相比，在一次性NAS其他并行工程提供卓越的性能。

19. Few-shot Compositional Font Generation with Dual Memory [PDF] 返回目录
Junbum Cha, Sanghyuk Chun, Gayoung Lee, Bado Lee, Seonghyeon Kim, Hwalsuk Lee
Abstract: Generating a new font library is a very labor-intensive and time-consuming job for glyph-rich scripts. Despite the remarkable success of existing font generation methods, they have significant drawbacks; they require a large number of reference images to generate a new font set, or they fail to capture detailed styles with a few samples. In this paper, we focus on compositional scripts, a widely used letter system in the world, where each glyph can be decomposed by several components. By utilizing the compositionality of compositional scripts, we propose a novel font generation framework, named Dual Memory-augmented Font Generation Network (DM-Font), which enables us to generate a high-quality font library with only a few samples. We employ memory components and global-context awareness in the generator to take advantage of the compositionality. In the experiments on Korean-handwriting fonts and Thai-printing fonts, we observe that our method generates a significantly better quality of samples with faithful stylization compared to the state-of-the-art generation methods in quantitatively and qualitatively.
摘要：生成一个新的字体库是一个非常劳动密集和费时的丰富字形的脚本工作。尽管现有的字体生成方法的显着的成功，他们有显著的缺点;他们需要大量的参考图像生成一个新的字体集，或者他们并没有捕捉到和几件样品详细的风格。在本文中，我们侧重于成分的脚本，在世界上广泛使用的字母系统，其中每个字形可以由几个部件分解。通过利用成分脚本的组合性，我们提出了一种新的字体生成框架，命名为双内存扩充字体下一代网络（DM-字体），这使我们能够产生，只有少数样本高质量的字体库。我们采用的内存组件和全球环境感知发电机采取组合性的优势。在韩国，手写字体和泰国印刷字体的实验中，我们观察到，相比于定量和定性的国家的最先进的生成方法我们的方法生成的样品与忠实程式化一个显著更好的质量。

20. Panoptic Instance Segmentation on Pigs [PDF] 返回目录
Johannes Brünger, Maria Gentz, Imke Traulsen, Reinhard Koch
Abstract: The behavioural research of pigs can be greatly simplified if automatic recognition systems are used. Especially systems based on computer vision have the advantage that they allow an evaluation without affecting the normal behaviour of the animals. In recent years, methods based on deep learning have been introduced and have shown pleasingly good results. Especially object and keypoint detectors have been used to detect the individual animals. Despite good results, bounding boxes and sparse keypoints do not trace the contours of the animals, resulting in a lot of information being lost. Therefore this work follows the relatively new definition of a panoptic segmentation and aims at the pixel accurate segmentation of the individual pigs. For this a framework of a neural network for semantic segmentation, different network heads and postprocessing methods is presented. With the resulting instance segmentation masks further information like the size or weight of the animals could be estimated. The method is tested on a specially created data set with 1000 hand-labeled images and achieves detection rates of around 95% (F1 Score) despite disturbances such as occlusions and dirty lenses.
摘要：猪的行为研究如果可以使用自动识别系统大大简化。特别是基于计算机视觉系统的优点在于，它们允许进行评估，而不会影响动物的正常行为。近年来，基于深刻的学习方法已经出台并已显示出令人愉快的好成绩。尤其是对象和特征点检测器已经被用于检测的个体动物。尽管有良好的效果，边框和稀疏关键点不跟踪动物的轮廓，造成了大量的信息丢失。因此，这项工作遵循全景细分并针对个体猪的像素精确分割的相对较新的定义。对于此一对语义分割的神经网络的框架中，不同的网络头和后处理方法，提出。用所得实例的分割掩码像动物的尺寸或重量的进一步信息可估计。该方法是在一专门创建的数据集进行测试1000的手工标记的图像，尽管干扰，如闭塞和脏镜片实现的％左右95（F1得分）检测率。

21. GroupFace: Learning Latent Groups and Constructing Group-based Representations for Face Recognition [PDF] 返回目录
Yonghyun Kim, Wonpyo Park, Myung-Cheol Roh, Jongju Shin
Abstract: In the field of face recognition, a model learns to distinguish millions of face images with fewer dimensional embedding features, and such vast information may not be properly encoded in the conventional model with a single branch. We propose a novel face-recognition-specialized architecture called GroupFace that utilizes multiple group-aware representations, simultaneously, to improve the quality of the embedding feature. The proposed method provides self-distributed labels that balance the number of samples belonging to each group without additional human annotations, and learns the group-aware representations that can narrow down the search space of the target identity. We prove the effectiveness of the proposed method by showing extensive ablation studies and visualizations. All the components of the proposed method can be trained in an end-to-end manner with a marginal increase of computational complexity. Finally, the proposed method achieves the state-of-the-art results with significant improvements in 1:1 face verification and 1:N face identification tasks on the following public datasets: LFW, YTF, CALFW, CPLFW, CFP, AgeDB-30, MegaFace, IJB-B and IJB-C.
摘要：脸部识别领域，模型学习来区分百万人脸图像的具有较少维嵌入功能，并且这样大量的信息不与单个分支的常规模型来正确编码。我们提出了一个新的脸部识别，专业架构，称为GroupFace利用多组感知表示，同时，提高了嵌入功能的效果。该方法提供了自我分配的标签这种平衡属于每个组无需额外的人力注释的样本数量，并学习组感知表示，可以缩小目标身份的搜索空间。我们通过展示大量消融研究和可视化证明了该方法的有效性。所提出的方法的所有组成部分可以在端至端的方式与计算复杂度的边缘增加训练。最后，所提出的方法实现了在1的状态下的最先进的结果与显著改进：上下面的公共数据集N面识别任务：1张人脸验证和1 LFW，YTF，CALFW，CPLFW，CFP，AgeDB-30 ，MEGAFACE，IJB-B和IJB-C。

22. AOWS: Adaptive and optimal network width search with latency constraints [PDF] 返回目录
Maxim Berman, Leonid Pishchulin, Ning Xu, Matthew B. Blaschko, Gerard Medioni
Abstract: Neural architecture search (NAS) approaches aim at automatically finding novel CNN architectures that fit computational constraints while maintaining a good performance on the target platform. We introduce a novel efficient one-shot NAS approach to optimally search for channel numbers, given latency constraints on a specific hardware. We first show that we can use a black-box approach to estimate a realistic latency model for a specific inference platform, without the need for low-level access to the inference computation. Then, we design a pairwise MRF to score any channel configuration and use dynamic programming to efficiently decode the best performing configuration, yielding an optimal solution for the network width search. Finally, we propose an adaptive channel configuration sampling scheme to gradually specialize the training phase to the target computational constraints. Experiments on ImageNet classification show that our approach can find networks fitting the resource constraints on different target platforms while improving accuracy over the state-of-the-art efficient networks.
摘要：神经结构搜索（NAS）方法的目的是自动查找适合的计算限制，同时维持在目标平台上有不错的表现小说CNN架构。我们引入新的高效的一次性NAS的方法来优化搜索的频道号码，在一个特定的硬件给定的等待时间限制。我们首先表明，我们可以用一个黑箱方法来估计一个现实的延迟模型特定推论平台，无需进行低级别的访问推理计算。然后，我们设计了一个成对MRF得分任何信道的配置和使用动态编程来高效地解码的最佳执行配置，产生用于网络宽度搜索最优解。最后，我们提出了一种自适应信道配置的采样方案的训练阶段逐渐专门到目标计算约束。在ImageNet分类的实验表明我们的方法可以找到网络安装上不同的目标平台的资源限制，同时提高精确度超过国家的最先进的高效网络。

23. Deep learning-based automated image segmentation for concrete petrographic analysis [PDF] 返回目录
Yu Song, Zilong Huang, Chuanyue Shen, Honghui Shi, David A Lange
Abstract: The standard petrography test method for measuring air voids in concrete (ASTM C457) requires a meticulous and long examination of sample phase composition under a stereomicroscope. The high expertise and specialized equipment discourage this test for routine concrete quality control. Though the task can be alleviated with the aid of color-based image segmentation, additional surface color treatment is required. Recently, deep learning algorithms using convolutional neural networks (CNN) have achieved unprecedented segmentation performance on image testing benchmarks. In this study, we investigated the feasibility of using CNN to conduct concrete segmentation without the use of color treatment. The CNN demonstrated a strong potential to process a wide range of concretes, including those not involved in model training. The experimental results showed that CNN outperforms the color-based segmentation by a considerable margin, and has comparable accuracy to human experts. Furthermore, the segmentation time is reduced to mere seconds.
摘要：在混凝土（ASTM C457）测量的空气空隙的标准测试岩石学方法需要样品相组合物的立体显微镜下一个细致和长检查。高专业技术和专业设备阻止这种测试常规混凝土质量控制。虽然任务可以与基于颜色的图像分割的帮助来减轻，需要额外的表面颜色处理。近年来，使用卷积神经网络（CNN）深学习算法都取得了图像的测试基准前所未有的分割性能。在这项研究中，我们调查使用CNN无需使用色彩处理进行具体细分的可行性。 CNN的表现出了很大的潜力来处理宽范围混凝土，包括那些不参与模型训练。实验结果表明，CNN优于由可观的边际基于颜色的分割，并具有相当的准确性人类专家。此外，分割时间减少到几秒钟。

24. Gender Slopes: Counterfactual Fairness for Computer Vision Models by Attribute Manipulation [PDF] 返回目录
Jungseock Joo, Kimmo Kärkkäinen
Abstract: Automated computer vision systems have been applied in many domains including security, law enforcement, and personal devices, but recent reports suggest that these systems may produce biased results, discriminating against people in certain demographic groups. Diagnosing and understanding the underlying true causes of model biases, however, are challenging tasks because modern computer vision systems rely on complex black-box models whose behaviors are hard to decode. We propose to use an encoder-decoder network developed for image attribute manipulation to synthesize facial images varying in the dimensions of gender and race while keeping other signals intact. We use these synthesized images to measure counterfactual fairness of commercial computer vision classifiers by examining the degree to which these classifiers are affected by gender and racial cues controlled in the images, e.g., feminine faces may elicit higher scores for the concept of nurse and lower scores for STEM-related concepts. We also report the skewed gender representations in an online search service on profession-related keywords, which may explain the origin of the biases encoded in the models.
摘要：自动化计算机视觉系统已经在许多领域，包括安全，执法，设备和个人设备中得到应用，但最近的报道表明，这些系统可能会产生偏差的结果，歧视人在特定受众群体。诊断和理解模式偏差的根本真正原因，但是，挑战性的任务，因为现代计算机视觉系统依赖于复杂的黑盒模型，其行为是很难解码。我们建议使用图像属性操作而开发的合成面部图像的性别和种族的不同尺寸，同时保持其它信号的完整的编码器，解码器网络。我们使用这些合成图像通过检查这些分类是按性别，并在图像，例如，女性的面孔可能引发更高的分数护士较低的分数的概念来控制种族线索的影响程度来衡量商用计算机视觉分类的反公平对STEM相关的概念。我们还对职业相关的关键字在线搜索服务，这可以解释在模型中编码的偏见的来源报告歪斜的性别表示。

25. Towards Streaming Image Understanding [PDF] 返回目录
Mengtian Li, Yu-Xiong Wang, Deva Ramanan
Abstract: Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the algorithmic trade-off between latency and accuracy, there has not been a clear metric to compare different methods along the Pareto optimal latency-accuracy curve. We point out a discrepancy between standard offline evaluation and real-time applications: by the time an algorithm finishes processing a particular image frame, the surrounding world has changed. To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as "streaming accuracy". The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any image understanding task into a streaming image understanding task. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations. Our proposed solutions and their empirical analysis demonstrate a number of surprising conclusions: (1) there exists an optimal "sweet spot" that maximizes streaming accuracy along the Pareto optimal latency-accuracy curve, (2) asynchronous tracking and future forecasting naturally emerge as internal representations that enable streaming image understanding, and (3) dynamic scheduling can be used to overcome temporal aliasing, yielding the paradoxical result that latency is sometimes minimized by sitting idle and "doing nothing".
摘要：体现感知是指自主代理来感知环境，以便它可以（重新）行为的能力。该剂的反应在很大程度上取决于其处理流水线的延迟控制。虽然过去的工作已经研究了算法权衡延迟和精度之间，还没有一个明确的指标来比较沿着帕累托最优延迟精度曲线的不同方法。我们指出，标准的脱机评估和实时应用程序之间的差异：通过时间的算法处理完一个特定的图像帧，周围的世界已经改变。为了实现这些目标，我们提出了一种方法，连贯集成延迟和准确性为单一度量实时在线感知，我们称之为“流精度”。这个指标背后的重要观点是，在每一个时刻共同评估整个感知组的输出，迫使堆考虑流应该计算时发生被忽略的数据量。更广泛地说，在建设这个指标，我们引入一个元基准，系统地将任何图像理解的任务变成了流媒体图像理解任务。我们专注于目标检测和实例分割城市的视频流的说明性任务，并有助于高品质和时间密集的注释的新数据集。我们提出的解决方案及实证研究证明了一些令人吃惊的结论：（1）存在一个最佳的“甜蜜点”最大化流沿着帕累托最优延迟精度曲线的准确性，（2）非同步追踪和未来的预测自然产生的内能够使流化图像的理解，和（3）动态调度可以被用来克服时间混淆，产生矛盾的结果是等待时间有时由闲置和“无为”最小化表示。

26. Interpretable and Accurate Fine-grained Recognition via Region Grouping [PDF] 返回目录
Zixuan Huang, Yin Li
Abstract: We present an interpretable deep model for fine-grained visual recognition. At the core of our method lies the integration of region-based part discovery and attribution within a deep neural network. Our model is trained using image-level object labels, and provides an interpretation of its results via the segmentation of object parts and the identification of their contributions towards classification. To facilitate the learning of object parts without direct supervision, we explore a simple prior of the occurrence of object parts. We demonstrate that this prior, when combined with our region-based part discovery and attribution, leads to an interpretable model that remains highly accurate. Our model is evaluated on major fine-grained recognition datasets, including CUB-200, CelebA and iNaturalist. Our results compare favorably to state-of-the-art methods on classification tasks, and our method outperforms previous approaches on the localization of object parts.
摘要：我们提出可解释的深层模型细粒度的视觉识别。在我们的方法的核心是一个深层神经网络中基于区域的部分发现和归属的融合。我们的模型是使用图像级对象标签的培训，并通过对象部分的分割和他们对分类贡献的鉴定提供其结果的解释。为了便于对象零件，而无需直接监督的学习中，我们探讨之前对象部分发生的简单。我们表明，在此之前，当我们基于区域的部分发现和归属，导致剩下的高度准确的可解释的模型相结合。我们的模型是对重大细粒度识别的数据集，包括CUB-200，CelebA和iNaturalist评估。我们的研究结果相媲美对分类任务的国家的最先进的方法，我们的方法优于对物体的部件的国产化以前的方法。

27. VideoForensicsHQ: Detecting High-quality Manipulated Face Videos [PDF] 返回目录
Gereon Fox, Wentao Liu, Hyeongwoo Kim, Hans-Peter Seidel, Mohamed Elgharib, Christian Theobalt
Abstract: New approaches to synthesize and manipulate face videos at very high quality have paved the way for new applications in computer animation, virtual and augmented reality, or face video analysis. However, there are concerns that they may be used in a malicious way, e.g. to manipulate videos of public figures, politicians or reporters, to spread false information. The research community therefore developed techniques for automated detection of modified imagery, and assembled benchmark datasets showing manipulatons by state-of-the-art techniques. In this paper, we contribute to this initiative in two ways: First, we present a new audio-visual benchmark dataset. It shows some of the highest quality visual manipulations available today. Human observers find them significantly harder to identify as forged than videos from other benchmarks. Furthermore we propose new family of deep-learning-based fake detectors, demonstrating that existing detectors are not well-suited for detecting fakes of a quality as high as presented in our dataset. Our detectors examine spatial and temporal features. This allows them to outperform existing approaches both in terms of high detection accuracy and generalization to unseen fake generation methods and unseen identities.
摘要：新的方法来合成和质量非常高，操纵面的视频已经铺平了道路，在计算机动画，虚拟和增强现实，否则将面临视频分析的新应用程序的方式。不过，也有其可以在恶意方式使用顾虑，例如操纵的公众人物，政治家和记者，视频传播虚假信息。因此，研究团体开发了用于修饰影像的自动检测技术，并装配基准数据集表示由国家的最先进的技术manipulatons。在本文中，我们促成这一举措在两个方面：首先，我们提出了一种新的视听基准数据集。它显示了当今一些最优质的视觉操纵。人类观察者发现他们显著难以识别锻造比其他基准测试视频。此外，我们提出了基于深学习假探测器的新的家庭，这表明现有的探测器不非常适合作为我们的数据提出了检测质量高的假货。我们的探测器检查时间和空间的特点。这使他们能够超越现有的检测精度高，并推广到看不见的假冒生成方法和看不见的身份方面接近两个。

28. TAO: A Large-Scale Benchmark for Tracking Any Object [PDF] 返回目录
Achal Dave, Tarasha Khurana, Pavel Tokmakov, Cordelia Schmid, Deva Ramanan
Abstract: For many years, multi-object tracking benchmarks have focused on a handful of categories. Motivated primarily by surveillance and self-driving applications, these datasets provide tracks for people, vehicles, and animals, ignoring the vast majority of objects in the world. By contrast, in the related field of object detection, the introduction of large-scale, diverse datasets (e.g., COCO) have fostered significant progress in developing highly robust solutions. To bridge this gap, we introduce a similarly diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average. Importantly, we adopt a bottom-up approach for discovering a large vocabulary of 833 categories, an order of magnitude more than prior tracking benchmarks. To this end, we ask annotators to label objects that move at any point in the video, and give names to them post factum. Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets. To ensure scalability of annotation, we employ a federated approach that focuses manual effort on labeling tracks for those relevant objects in a video (e.g., those that move). We perform an extensive evaluation of state-of-the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world. In particular, we show that existing single- and multi-object trackers struggle when applied to this scenario in the wild, and that detection-based, multi-object trackers are in fact competitive with user-initialized ones. We hope that our dataset and analysis will boost further progress in the tracking community.
摘要：多年来，多目标跟踪的基准都集中在类的屈指可数。主要通过监视和自驾车应用的推动下，这些数据集的人员，车辆和动物提供轨道，忽视了绝大多数世界的对象。与此相反，在对象检测，引入大规模的，多样数据集的相关领域（例如，COCO）已经在开发高度可靠的解决方案促进显著进展。为了弥补这种差距，我们引入了一个类似的不同数据集用于跟踪任何对象（TAO）。它由2907点高分辨率的视频，在不同的环境中，这是很长，平均半分钟抓获。重要的是，我们采用一种发现大量词汇的833个类别，数量级比以前的跟踪基准，更多的是自下而上的方法。为此，我们要求注释者标签对象这一举动在视频的任何位置，并给名字给他们交呈文。我们的词汇是从现有的数据集的跟踪都显著更大，质的不同。为了确保标注的可扩展性，我们采用了联合的方法，重点在视频标签为那些相关的对象轨道的手动工作（例如，那些移动）。我们执行国家的最先进的跟踪器的广泛评估，并做出了一系列关于在一个开放的世界大词汇跟踪重要的发现。特别是，我们表明，当在野外适用于这种情况下现有的单和多目标跟踪的斗争，以及检测为主，多目标跟踪器实际上与用户初始化那些有竞争力的。我们希望，我们的数据和分析，将推动在跟踪社会的进一步发展。

29. WHENet: Real-time Fine-Grained Estimation for Wide Range Head Pose [PDF] 返回目录
Yijun Zhou, James Gregson
Abstract: We present an end-to-end head-pose estimation network designed to predict Euler angles through the full range head yaws from a single RGB image. Existing methods perform well for frontal views but few target head pose from all viewpoints. This has applications in autonomous driving and retail. Our network builds on multi-loss approaches with changes to loss functions and training strategies adapted to wide range estimation. Additionally, we extract ground truth labelings of anterior views from a current panoptic dataset for the first time. The resulting Wide Headpose Estimation Network (WHENet) is the first fine-grained modern method applicable to the full-range of head yaws (hence wide) yet also meets or beats state-of-the-art methods for frontal head pose estimation. Our network is compact and efficient for mobile devices and applications.
摘要：我们提出的端至端头姿势估计网络设计来预测欧拉角通过从单个RGB图像的全部范围头部雅司病。现有的方法从各种观点前视图，但很少有目标的头部姿势表现良好。这在自主驾驶和零售应用。我们的网络基础上多损失与变化，以适应广泛的估计损失函数和培训战略的方针。此外，我们从第一次的电流全景数据集提取的前视图地面实况标号。将得到的宽Headpose估计网络（WHENet）是适用于头雅司病的全范围（因此宽），但还达到或超过国家的最先进的方法正面头部姿势估计所述第一细粒的现代方法。我们的网络是紧凑和高效的移动设备和应用。

30. InfoScrub: Towards Attribute Privacy by Targeted Obfuscation [PDF] 返回目录
Hui-Po Wang, Tribhuvanesh Orekondy, Mario Fritz
Abstract: Personal photos of individuals when shared online, apart from exhibiting a myriad of memorable details, also reveals a wide range of private information and potentially entails privacy risks (e.g., online harassment, tracking). To mitigate such risks, it is crucial to study techniques that allow individuals to limit the private information leaked in visual data. We tackle this problem in a novel image obfuscation framework: to maximize entropy on inferences over targeted privacy attributes, while retaining image fidelity. We approach the problem based on an encoder-decoder style architecture, with two key novelties: (a) introducing a discriminator to perform bi-directional translation simultaneously from multiple unpaired domains; (b) predicting an image interpolation which maximizes uncertainty over a target set of attributes. We find our approach generates obfuscated images faithful to the original input images, and additionally increase uncertainty by 6.2$\times$ (or up to 0.85 bits) over the non-obfuscated counterparts.
摘要：个人照片的个人网上共享，除了表现出令人难忘的细节万千，也揭示了广泛的私人信息，并可能需要隐私风险（例如，在线骚扰，跟踪）。为了减轻这种风险，这是研究技术，允许个人以限制可视数据泄露私人信息是至关重要的。我们解决在新的图像模糊的框架这个问题：最大化熵超过针对性的隐私属性的推理，同时保持图像的保真度。我们接近基于编码器 - 解码器式的建筑的问题，有两个关键新奇：（a）将一个鉴别器以从多个不成对的结构域同时进行双向翻译; （b）中预测的图像的内插，其在目标的属性集最大化的不确定性。我们发现我们的方法产生模糊影像忠实于原始输入图像，并通过6.2 $ \ $次额外增加不确定性在非混淆的同行（或高达0.85位）。

31. Maplets: An Efficient Approach for Cooperative SLAM Map Building Under Communication and Computation Constraints [PDF] 返回目录
Kevin M. Brink, Jincheng Zhang, Andrew R. Willis, Ryan E. Sherrill, Jamie L. Godwin
Abstract: This article introduces an approach to facilitate cooperative exploration and mapping of large-scale, near-ground, underground, or indoor spaces via a novel integration framework for locally-dense agent map data. The effort targets limited Size, Weight, and Power (SWaP) agents with an emphasis on limiting required communications and redundant processing. The approach uses a unique organization of batch optimization engines to enable a highly efficient two-tier optimization structure. Tier I consist of agents that create and potentially share local maplets (local maps, limited in size) which are generated using Simultaneous Localization and Mapping (SLAM) map-building software and then marginalized to a more compact parameterization. Maplets are generated in an overlapping manner and used to estimate the transform and uncertainty between those overlapping maplets, providing accurate and compact odometry or delta-pose representation between maplet's local frames. The delta poses can be shared between agents, and in cases where maplets have salient features (for loop closures), the compact representation of the maplet can also be shared. The second optimization tier consists of a global optimizer that seeks to optimize those maplet-to-maplet transformations, including any loop closures identified. This can provide an accurate global "skeleton"' of the traversed space without operating on the high-density point cloud. This compact version of the map data allows for scalable, cooperative exploration with limited communication requirements where most of the individual maplets, or low fidelity renderings, are only shared if desired.
摘要：本文介绍了一种方法，以促进合作勘探大规模，近地面，地下或室内空间的映射通过局部密集剂地图数据的新的集成框架。这种努力的目标有限的尺寸，重量和功耗（SWaP）与限制要求的通信和冗余处理的重点代理商。该方法使用批处理优化引擎的一个独特的组织，使高效率的两层优化结构。 I级包括用于创建剂和潜在的共享其使用同步定位和地图创建（SLAM）产生的局部maplets（本地地图，在尺寸限制）映射建设软件，然后边缘化到更紧凑的参数化。 Maplets被以重叠的方式产生的并用于估计那些重叠maplets之间的变换和不确定性，提供的Maplet的本地帧之间的精确和紧凑里程计或δ-姿势表示。增量姿势可以代理之间共享，并且在具有maplets显着特征（用于环闭合件）的情况下，的Maplet的紧凑表示也可以被共享。第二个优化层由全局优化，旨在优化这些的Maplet到的Maplet转换，包括确定任何环封闭的。这可以提供穿越空间的准确的全球“骨架””没有对高密度点云操作系统。的地图数据的这种紧凑版本允许可扩展性，合作勘探与在大多数个体maplets，或低的保真度渲染的，如果需要的话仅共享有限的通信要求。

32. Semi-Supervised Learning in Video Sequences for Urban Scene Segmentation [PDF] 返回目录
Liang-Chieh Chen, Raphael Gontijo Lopes, Bowen Cheng, Maxwell D. Collins, Ekin D. Cubuk, Barret Zoph, Hartwig Adam, Jonathon Shlens
Abstract: Supervised learning in large discriminative models is a mainstay for modern computer vision. Such an approach necessitates investing in large-scale human-annotated datasets for achieving state-of-the-art results. In turn, the efficacy of supervised learning may be limited by the size of the human annotated dataset. This limitation is particularly notable for image segmentation tasks, where the expense of human annotation is especially large, yet large amounts of unlabeled data may exist. In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences to improve the performance on urban scene segmentation, simultaneously tackling semantic, instance, and panoptic segmentation. The goal of this work is to avoid the construction of sophisticated, learned architectures specific to label propagation (e.g., patch matching and optical flow). Instead, we simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data. The procedure is iterated for several times. As a result, our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks, reaching the performance of 67.8% PQ, 42.6% AP, and 85.2% mIOU on the test set. We view this work as a notable step towards building a simple procedure to harness unlabeled video sequences to surpass state-of-the-art performance on core computer vision tasks.
摘要：在大判别模型监督学习是现代计算机视觉的中流砥柱。这种方法在必要的大规模人类标注的数据集投资，为实现国家的最先进的成果。反过来，监督式学习的功效可以通过人类注释的数据集的大小的限制。这种限制是用于图像分割任务，其中人类注释的费用是特别大的，但大量的未标记数据的可能存在特别显着的。在这项工作中，我们要问，如果我们可以利用半监督在未标记的视频序列学习中提高对城市场景分割的性能，同时解决语义，实例和全景分割。这项工作的目的是为了避免复杂，学会架构特定于标签传播（例如，块匹配和光流）的结构。相反，我们只是预测伪标签的标签数据和培养后续机型与人类注解和伪标记的数据。该过程被重复数次。其结果是，我们的朴素学生模型，训练了与这种简单而有效的迭代半监督学习，国家的最先进的无所获结果在所有三个风情基准，达到67.8％PQ的性能，42.6％AP，并85.2％米欧的测试集。我们认为这项工作是朝着建立一个简单的程序来利用未标记的视频序列超过核心计算机视觉任务的国家的最先进的性能有显着的一步。

33. Medical Image Generation using Generative Adversarial Networks [PDF] 返回目录
Nripendra Kumar Singh, Khalid Raza
Abstract: Generative adversarial networks (GANs) are unsupervised Deep Learning approach in the computer vision community which has gained significant attention from the last few years in identifying the internal structure of multimodal medical imaging data. The adversarial network simultaneously generates realistic medical images and corresponding annotations, which proven to be useful in many cases such as image augmentation, image registration, medical image generation, image reconstruction, and image-to-image translation. These properties bring the attention of the researcher in the field of medical image analysis and we are witness of rapid adaption in many novel and traditional applications. This chapter provides state-of-the-art progress in GANs-based clinical application in medical image generation, and cross-modality synthesis. The various framework of GANs which gained popularity in the interpretation of medical images, such as Deep Convolutional GAN (DCGAN), Laplacian GAN (LAPGAN), pix2pix, CycleGAN, and unsupervised image-to-image translation model (UNIT), continue to improve their performance by incorporating additional hybrid architecture, has been discussed. Further, some of the recent applications of these frameworks for image reconstruction, and synthesis, and future research directions in the area have been covered.
摘要：创成对抗网络（甘斯）在其识别多模态医学成像数据的内部结构已获得显著的注意力从过去几年的计算机视觉领域无人监管的深度学习的方法。对抗网络同时产生逼真的医用图像和相应的注解，这被证明是在许多情况下，诸如图像增强，图像配准，医用图像生成，图像重建和图像到图像的转换是有用的。这些特性使研究者的关注在医学图像分析领域，我们在许多小说与传统应用的快速适应的见证。本章提供了在医疗图像生成甘斯为基础的临床应用，和交叉形态合成状态的最先进的进展。这在医学图像，如深卷积GAN（DCGAN），拉普拉斯GAN（LAPGAN），pix2pix，CycleGAN，且无人监管的图像到图像的翻译模型（单位）的解释得到了普及甘斯的各种框架，不断改进通过将另外的混合架构它们的性能，已经进行了讨论。此外，一些近期的这些框架进行图像重建和合成，并在该地区未来的研究方向的应用已经覆盖。

34. SymJAX: symbolic CPU/GPU/TPU programming [PDF] 返回目录
Randall Balestriero
Abstract: SymJAX is a symbolic programming version of JAX simplifying graph input/output/updates and providing additional functionalities for general machine learning and deep learning applications. From an user perspective SymJAX provides a la Theano experience with fast graph optimization/compilation and broad hardware support, along with Lasagne-like deep learning functionalities.
摘要：SymJAX是一个象征性的程序版本JAX的简化图形输入/输出/更新，并提供一般的机器学习和深入学习应用的附加功能。从用户的角度来看SymJAX提供快速的图形优化/编译和广泛的硬件支持一拉Theano经验，千层面具有类似深度学习功能一起。

35. Efficient and Phase-aware Video Super-resolution for Cardiac MRI [PDF] 返回目录
Jhih-Yuan Lin, Yu-Cheng Chang, Winston H. Hsu
Abstract: Cardiac Magnetic Resonance Imaging (CMR) is widely used since it can illustrate the structure and function of heart in a non-invasive and painless way. However, it is time-consuming and high-cost to acquire the high-quality scans due to the hardware limitation. To this end, we propose a novel end-to-end trainable network to solve CMR video super-resolution problem without the hardware upgrade and the scanning protocol modifications. We incorporate the cardiac knowledge into our model to assist in utilizing the temporal information. Specifically, we formulate the cardiac knowledge as the periodic function, which is tailored to meet the cyclic characteristic of CMR. In addition, the proposed residual of residual learning scheme facilitates the network to learn the LR-HR mapping in a progressive refinement fashion. This mechanism enables the network to have the adaptive capability by adjusting refinement iterations depending on the difficulty of the task. Extensive experimental results on large-scale datasets demonstrate the superiority of the proposed method compared with numerous state-of-the-art methods.
摘要：心脏磁共振成像（CMR）被广泛使用，因为它可以在示出非侵入性和无痛的方式心脏的结构和功能。然而，这是耗时和高成本的获取高质量的扫描由于硬件限制。为此，我们提出了一个新颖的终端到终端的可训练的网络解决CMR视频超分辨率的问题没有硬件升级和扫描协议的修改。我们结合了心脏知识转化成我们的模型，以帮助在利用时间信息。具体来说，我们制定了心脏知识为周期函数，这是专为满足CMR的循环特性。此外，建议剩余的残留学习方案有利于在网络学习中逐步细化时尚的LR-HR映射。这种机制使得网络具有通过调整根据任务的难度改善迭代的适应能力。关于大型数据集广泛的实验结果证明了所提出的方法的优越性与许多国家的最先进的方法进行对比。

36. Omnidirectional Images as Moving Camera Videos [PDF] 返回目录
Xiangjie Sui, Kede Ma, Yiru Yao, Yuming Fang
Abstract: Omnidirectional images (also referred to as static 360° panoramas) impose viewing conditions much different from those of regular 2D images. A natural question arises: how do humans perceive image distortions in immersive virtual reality (VR) environments? We argue that, apart from the distorted panorama itself, three types of viewing behavior governed by VR conditions are crucial in determining its perceived quality: starting point, exploration time, and scanpath. In this paper, we propose a principled computational framework for objective quality assessment of 360° images, which embodies the threefold behavior in a delightful way. Specifically, we first transform an omnidirectional image to several video representations using viewing behavior of different users. We then leverage the recent advances in full-reference 2D image/video quality assessment to compute the perceived quality of the panorama. We construct a set of specific quality measures within the proposed framework, and demonstrate their promises on two VR quality databases.
摘要：全向图像（也被称为静态的360°全景）强加的观看条件与那些常规2D图像的很大的不同。一个自然的问题是：做人类如何感知沉浸式虚拟现实（VR）环境的图像失真？我们认为，除了扭曲的全景本身，三种观看由VR条件支配的行为是在确定其感知的质量是至关重要的：出发点，探索时间和扫描路径。在本文中，我们提出了360°图像客观的质量评价，它体现在一个愉快的方式三重行为原则的计算框架。具体而言，我们首先变换全方位图像使用不同用户的观看行为的几个视频表示。然后，我们利用在全基准2D图像/视频质量评估的最新进展来计算全景的感知质量。我们构建了一套建议的框架内具体的质量措施，并证明在两个VR质量数据库，他们的承诺。

37. Single Image Super-Resolution via Residual Neuron Attention Networks [PDF] 返回目录
Wenjie Ai, Xiaoguang Tu, Shilei Cheng, Mei Xie
Abstract: Deep Convolutional Neural Networks (DCNNs) have achieved impressive performance in Single Image Super-Resolution (SISR). To further improve the performance, existing CNN-based methods generally focus on designing deeper architecture of the network. However, we argue blindly increasing network's depth is not the most sensible way. In this paper, we propose a novel end-to-end Residual Neuron Attention Networks (RNAN) for more efficient and effective SISR. Structurally, our RNAN is a sequential integration of the well-designed Global Context-enhanced Residual Groups (GCRGs), which extracts super-resolved features from coarse to fine. Our GCRG is designed with two novelties. Firstly, the Residual Neuron Attention (RNA) mechanism is proposed in each block of GCRG to reveal the relevance of neurons for better feature representation. Furthermore, the Global Context (GC) block is embedded into RNAN at the end of each GCRG for effectively modeling the global contextual information. Experiments results demonstrate that our RNAN achieves the comparable results with state-of-the-art methods in terms of both quantitative metrics and visual quality, however, with simplified network architecture.
摘要：深卷积神经网络（DCNNs）已经在单幅图像超分辨率（SISR）取得了不俗的业绩。为了进一步提高性能，现有的基于CNN的方法一般集中在设计网络的更深层次的架构。然而，我们认为盲目增加网络的深度不是最明智的办法。在本文中，我们提出了更高效和有效SISR一种新颖的端至端的剩余神经元注意网络（RNAN）。在结构上，我们RNAN是精心设计的全球背景下增强的残基（GCRGs），其提取物超分辨功能由粗到细的顺序整合。我们GCRG设计有两个新奇。首先，残余神经元注意（RNA）机制在GCRG的每个块提议揭示了更好的特征表示的神经元的相关性。此外，全局上下文（GC）块在每个GCRG用于有效建模全局上下文信息的端部嵌入到RNAN。实验结果表明，我们的RNAN实现与国家的最先进的方法比较的结果在定量指标和视觉质量方面，然而，与简化的网络体系结构。

38. CPOT: Channel Pruning via Optimal Transport [PDF] 返回目录
Yucong Shen, Li Shen, Hao-Zhi Huang, Xuan Wang, Wei Liu
Abstract: Recent advances in deep neural networks (DNNs) lead to tremendously growing network parameters, making the deployments of DNNs on platforms with limited resources extremely difficult. Therefore, various pruning methods have been developed to compress the deep network architectures and accelerate the inference process. Most of the existing channel pruning methods discard the less important filters according to well-designed filter ranking criteria. However, due to the limited interpretability of deep learning models, designing an appropriate ranking criterion to distinguish redundant filters is difficult. To address such a challenging issue, we propose a new technique of Channel Pruning via Optimal Transport, dubbed CPOT. Specifically, we locate the Wasserstein barycenter for channels of each layer in the deep models, which is the mean of a set of probability distributions under the optimal transport metric. Then, we prune the redundant information located by Wasserstein barycenters. At last, we empirically demonstrate that, for classification tasks, CPOT outperforms the state-of-the-art methods on pruning ResNet-20, ResNet-32, ResNet-56, and ResNet-110. Furthermore, we show that the proposed CPOT technique is good at compressing the StarGAN models by pruning in the more difficult case of image-to-image translation tasks.
摘要：在深层神经网络（DNNs）导致极大增长的网络参数，使得DNNs的部署，对以有限的资源极其困难平台的最新进展。因此，不同的修剪方法已经发展到压缩深刻的网络架构，并加快推理过程。大多数现有的信道修剪方法的根据精心设计的滤波器排序准则丢弃不太重要的过滤器。然而，由于深学习模式的有限解释性，设计适当的排名准则，以区分冗余过滤器是困难的。为了解决这样一个具有挑战性的问题，我们提出了渠道修剪通过优化交通运输的新技术，被称为CPOT。具体来说，我们找到瓦瑟斯坦重心在深模型的每一层，这是平均值的最佳运输度量下的一组概率分布的通道。然后，我们修剪位于由瓦瑟斯坦重心的冗余信息。最后，我们凭经验证实，为分类任务，CPOT优于上修剪RESNET-20，RESNET-32，RESNET-56和RESNET-110的状态的最先进的方法。此外，我们还表明，该CPOT技术善于通过在图像对图像的翻译任务更困难的情况下修剪压缩StarGAN模型。

39. HF-UNet: Learning Hierarchically Inter-Task Relevance in Multi-Task U-Net for Accurate Prostate Segmentation [PDF] 返回目录
Kelei He, Chunfeng Lian, Bing Zhang, Xin Zhang, Xiaohuan Cao, Dong Nie, Yang Gao, Junfeng Zhang, Dinggang Shen
Abstract: Accurate segmentation of the prostate is a key step in external beam radiation therapy treatments. In this paper, we tackle the challenging task of prostate segmentation in CT images by a two-stage network with 1) the first stage to fast localize, and 2) the second stage to accurately segment the prostate. To precisely segment the prostate in the second stage, we formulate prostate segmentation into a multi-task learning framework, which includes a main task to segment the prostate, and an auxiliary task to delineate the prostate boundary. Here, the second task is applied to provide additional guidance of unclear prostate boundary in CT images. Besides, the conventional multi-task deep networks typically share most of the parameters (i.e., feature representations) across all tasks, which may limit their data fitting ability, as the specificities of different tasks are inevitably ignored. By contrast, we solve them by a hierarchically-fused U-Net structure, namely HF-UNet. The HF-UNet has two complementary branches for two tasks, with the novel proposed attention-based task consistency learning block to communicate at each level between the two decoding branches. Therefore, HF-UNet endows the ability to learn hierarchically the shared representations for different tasks, and preserve the specificities of learned representations for different tasks simultaneously. We did extensive evaluations of the proposed method on a large planning CT image dataset, including images acquired from 339 patients. The experimental results show HF-UNet outperforms the conventional multi-task network architectures and the state-of-the-art methods.
摘要：前列腺的准确分割是外部束放射疗法治疗的关键步骤。在本文中，我们解决在CT图像前列腺分割由两级网络1）的第一阶段快速局部化，和2）所述第二级精确段前列腺的具有挑战性的任务。为了精确段前列腺在第二阶段，我们配制前列腺分割成多任务学习框架，它包括一个主要任务段前列腺，和辅助任务划定边界前列腺。在这里，第二个任务应用提供不清楚前列腺边界的额外指导CT图像。此外，传统的多任务深网络通常共享大部分参数（即特征表示）对所有的任务，这可能会限制他们的数据拟合能力，因为不同任务的特殊性不可避免地忽略。通过对比，我们通过分层融合UNET结构，即HF-UNET解决这些问题。的HF-UNET具有用于两个任务两个互补分支，与新颖的建议关注基于任务一致性学习块在两个解码分支之间的每个级别上进行通信。因此，HF-UNET赋予分层次学习不同的任务共享交涉，并保留对同时在不同的任务学会表示的特异性的能力。我们做了一个大的规划的CT图像数据所提出的方法的广泛评估，其中包括339例患者采集的图像。实验结果表明HF-UNET优于传统的多任务的网络架构和国家的最先进的方法。

40. Adversarial Canonical Correlation Analysis [PDF] 返回目录
Benjamin Dutton
Abstract: Canonical Correlation Analysis (CCA) is a statistical technique used to extract common information from multiple data sources or views. It has been used in various representation learning problems, such as dimensionality reduction, word embedding, and clustering. Recent work has given CCA probabilistic footing in a deep learning context and uses a variational lower bound for the data log likelihood to estimate model parameters. Alternatively, adversarial techniques have arisen in recent years as a powerful alternative to variational Bayesian methods in autoencoders. In this work, we explore straightforward adversarial alternatives to recent work in Deep Variational CCA (VCCA and VCCA-Private) we call ACCA and ACCA-Private and show how these approaches offer a stronger and more flexible way to match the approximate posteriors coming from encoders to much larger classes of priors than the VCCA and VCCA-Private models. This allows new priors for what constitutes a good representation, such as disentangling underlying factors of variation, to be more directly pursued. We offer further analysis on the multi-level disentangling properties of VCCA-Private and ACCA-Private through the use of a newly designed dataset we call Tangled MNIST. We also design a validation criteria for these models that is theoretically grounded, task-agnostic, and works well in practice. Lastly, we fill a minor research gap by deriving an additional variational lower bound for VCCA that allows the representation to use view-specific information from both input views.
摘要：典型相关分析（CCA）是用于提取来自多个数据源或视图共同信息的统计技术。它在不同的表示学习上的问题，如降维，字嵌入和集群被使用。最近的工作已经给CCA概率基础在深学习环境和使用变下界数据记录可能性估计模型参数。另外，对抗性技术在最近几年出现了一个强大的替代在自动编码变分贝叶斯方法。在这项工作中，我们将探讨在深变CCA（VCCA和VCCA私人的）直接的对抗方案最近的工作中，我们称之为ACCA和ACCA私人的，并显示这些方法如何提供一个更强大和更灵活的方式来匹配近似的后验概率从编码器来大得多类先验比VCCA和VCCA私人的车型。这允许更直接地追求什么构成了良好的表示新的先验，如解开变异潜在因素。我们提供多层次，通过使用新设计的数据集我们称之为纠结MNIST的解开VCCA私营和ACCA私营的性质进一步分析。我们还设计验证标准，这些模型是理论基础的，任务无关，并在实践中行之有效。最后，我们通过获得额外的变分下界VCCA，允许表示从两个输入视图视图特定使用信息填写未成年人研究的空白。

41. A Study of Deep Learning Colon Cancer Detection in Limited Data Access Scenarios [PDF] 返回目录
Apostolia Tsirikoglou, Karin Stacke, Gabriel Eilertsen, Martin Lindvall, Jonas Unger
Abstract: Digitization of histopathology slides has led to several advances, from easy data sharing and collaborations to the development of digital diagnostic tools. Deep learning (DL) methods for classification and detection have shown great potential, but often require large amounts of training data that are hard to collect, and annotate. For many cancer types, the scarceness of data creates barriers for training DL models. One such scenario relates to detecting tumor metastasis in lymph node tissue, where the low ratio of tumor to non-tumor cells makes the diagnostic task hard and time-consuming. DL-based tools can allow faster diagnosis, with potentially increased quality. Unfortunately, due to the sparsity of tumor cells, annotating this type of data demands a high level of effort from pathologists. Using weak annotations from slide-level images have shown great potential, but demand access to a substantial amount of data as well. In this study, we investigate mitigation strategies for limited data access scenarios. Particularly, we address whether it is possible to exploit mutual structure between tissues to develop general techniques, wherein data from one type of cancer in a particular tissue could have diagnostic value for other cancers in other tissues. Our case is exemplified by a DL model for metastatic colon cancer detection in lymph nodes. Could such a model be trained with little or even no lymph node data? As alternative data sources, we investigate 1) tumor cells taken from the primary colon tumor tissue, and 2) cancer data from a different organ (breast), either as is or transformed to the target domain (colon) using Cycle-GANs. We show that the suggested approaches make it possible to detect cancer metastasis with no or very little lymph node data, opening up for the possibility that existing, annotated histopathology data could generalize to other domains.
摘要：组织病理学幻灯片的数字化已经导致了一些进步，从简单的数据共享和协作，以数字化诊断工具的开发。分类和检测深度学习（DL）方法已经显示出巨大的潜力，但往往需要大量的，是难以收集和注释的训练数据。对于许多癌症类型，数据的珍异创建培训DL模型的障碍。一个这样的场景涉及淋巴结组织，其中肿瘤对非肿瘤细胞的低比率使得诊断任务硬且耗时的检测肿瘤转移。基于DL工具可以允许更快的诊断，具有潜在的更高的图像质量。不幸的是，由于肿瘤细胞的稀疏性，注释这种类型的数据的要求的从病理学家努力高电平。从幻灯片级别的影像使用弱注释都表现出极大的潜力，但需要访问数据的大量的为好。在这项研究中，我们调查了有限的数据访问方案的缓解策略。特别是，我们处理是否可能利用的组织之间的相互结构发展的一般技术，其中，从在特定组织中的一种类型的癌症的数据可能对在其它组织中的其它癌症的诊断价值。我们的情况下是通过在淋巴结转移性结肠癌检测的DL模型为例。难道这样的模式很少，甚至没有淋巴结的数据进行训练？作为替代方案的数据源，我们调查1）从原发性结肠肿瘤组织吸收的肿瘤细胞，和2）从不同的器官（乳腺癌数据），或者作为被或转化到使用Cycle-Gans的目标域（结肠）。我们表明，该建议的方法使其能够检测癌症转移，没有或很少淋巴结的数据，对于现有的，带有注释的组织病理学数据可以推广到其他领域的可能性打开了。

42. Model-Based Robust Deep Learning [PDF] 返回目录
Alexander Robey, Hamed Hassani, George J. Pappas
Abstract: While deep learning has resulted in major breakthroughs in many application domains, the frameworks commonly used in deep learning remain fragile to artificially-crafted and imperceptible changes in the data. In response to this fragility, adversarial training has emerged as a principled approach for enhancing the robustness of deep learning with respect to norm-bounded perturbations. However, there are other sources of fragility for deep learning that are arguably more common and less thoroughly studied. Indeed, natural variation such as lighting or weather conditions can significantly degrade the accuracy of trained neural networks, proving that such natural variation presents a significant challenge for deep learning. In this paper, we propose a paradigm shift from perturbation-based adversarial robustness toward {\em model-based robust deep learning}. Our objective is to provide general training algorithms that can be used to train deep neural networks to be robust against natural variation in data. Critical to our paradigm is first obtaining a \emph{model of natural variation} which can be used to vary data over a range of natural conditions. Such models may be either known a priori or else learned from data. In the latter case, we show that deep generative models can be used to learn models of natural variation that are consistent with realistic conditions. We then exploit such models in three novel model-based robust training algorithms in order to enhance the robustness of deep learning with respect to the given model. Our extensive experiments show that across a variety of naturally-occurring conditions and across various datasets, deep neural networks trained with our model-based algorithms significantly outperform both standard deep learning algorithms as well as norm-bounded robust deep learning algorithms.
摘要：尽管深度学习已经导致许多应用领域的重大突破，在深度学习中常用的框架仍然是脆弱的，以数据人工雕琢和潜移默化的变化。针对这一脆弱性，对抗性训练已经成为提高深度学习的稳定性相对于范数有界扰动有原则的做法。不过，也有深学习这可以说是比较常见的，少深入的研究脆弱性的其他来源。事实上，如照明或天气条件自然变异可以显著降低训练神经网络的准确性，证明了这种自然变异礼物深学习显著的挑战。在本文中，我们提出了基于扰动对抗性稳健向{\ EM基于模型的稳健深度学习}模式的转变。我们的目标是提供可用于训练深层神经网络是针对数据自然变化稳健一般训练算法。临界我们范式第一获取可用于在一定范围的自然条件会发生变化的数据的\ {EMPH天然变异模型}。这样的模型可以是已知先验的或从其他数据获悉。在后一种情况下，我们表明，深生成模型可以用来学习能与现实条件一致的自然变化的模型。然后，我们利用在三个新的基于模型的稳健训练算法这样的模型，以提高深学习的鲁棒性对于给定的模型。我们广泛的实验表明，在各种的天然条件，在各种数据集，我们基于模型的算法训练的深层神经网络显著超越了标准深度学习算法以及范数界强大的深度学习算法。

注：中文为机器翻译结果！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-05-22

目录

摘要