摘要

1. Learning Visual Representations for Transfer Learning by Suppressing Texture [PDF] 返回目录
Shlok Mishra, Anshul Shah, Ankan Bansal, Jonghyun Choi, Abhinav Shrivastava, Abhishek Sharma, David Jacobs
Abstract: Recent literature has shown that features obtained from supervised training of CNNs may over-emphasize texture rather than encoding high-level information. In self-supervised learning in particular, texture as a low-level cue may provide shortcuts that prevent the network from learning higher level representations. To address these problems we propose to use classic methods based on anisotropic diffusion to augment training using images with suppressed texture. This simple method helps retain important edge information and suppress texture at the same time. We empirically show that our method achieves state-of-the-art results on object detection and image classification with eight diverse datasets in either supervised or self-supervised learning tasks such as MoCoV2 and Jigsaw. Our method is particularly effective for transfer learning tasks and we observed improved performance on five standard transfer learning datasets. The large improvements (up to 11.49\%) on the Sketch-ImageNet dataset, DTD dataset and additional visual analyses with saliency maps suggest that our approach helps in learning better representations that better transfer.
摘要：近期的文献已经表明，从细胞神经网络的监督训练而获得的特征可能过分强调纹理而不是编码的高级信息。在自我监督的特别，质地为低级别的线索学习可以提供防止网络学习更高级别的交涉快捷方式。为了解决这些问题，我们提出了一种基于各向异性扩散使用图片与抑制质感扩充训练用经典方法。这个简单的方法有助于保持在同一时间的重要边缘信息，抑制质感。我们经验表明，我们的方法实现对物体的检测，并与在任8点不同的数据集图像分类的国家的最先进成果的监督和自我监督学习任务，如MoCoV2和拼图。我们的方法是转移的学习任务中特别有效，我们在五个标准传递学习数据集观察到更高的性能。大的改进（高达11.49 \％）的草图ImageNet数据集，DTD的数据集，并与显着性图的其他视觉分析表明，我们的方法可以帮助学习更好的表述，更好地传递。

2. RealHePoNet: a robust single-stage ConvNet for head pose estimation in the wild [PDF] 返回目录
Rafael Berral-Soler, Francisco J. Madrid-Cuevas, Rafael Muñoz-Salinas, Manuel J. Marín-Jiménez
Abstract: Human head pose estimation in images has applications in many fields such as human-computer interaction or video surveillance tasks. In this work, we address this problem, defined here as the estimation of both vertical (tilt/pitch) and horizontal (pan/yaw) angles, through the use of a single Convolutional Neural Network (ConvNet) model, trying to balance precision and inference speed in order to maximize its usability in real-world applications. Our model is trained over the combination of two datasets: 'Pointing'04' (aiming at covering a wide range of poses) and 'Annotated Facial Landmarks in the Wild' (in order to improve robustness of our model for its use on real-world images). Three different partitions of the combined dataset are defined and used for training, validation and testing purposes. As a result of this work, we have obtained a trained ConvNet model, coined RealHePoNet, that given a low-resolution grayscale input image, and without the need of using facial landmarks, is able to estimate with low error both tilt and pan angles (~4.4° average error on the test partition). Also, given its low inference time (~6 ms per head), we consider our model usable even when paired with medium-spec hardware (i.e. GTX 1060 GPU). * Code available at: this https URL * Demo video at: this https URL
摘要：在图像人体头部姿态估计在很多领域，如人机交互或视频监控任务的应用程序。在这项工作中，我们要解决这个问题，在这里被定义为两个垂直（倾斜/间距）的估计和横向（水平/偏航）角，通过使用一个单一的卷积神经网络（ConvNet）模型，试图平衡精度和推理速度，以最大限度地提高实际应用中其可用性。我们的模型是在两个数据集的组合训练的：“注释脸部显着标记在野生”“Pointing'04”（旨在覆盖范围广的姿势）和（为了提高实我们为它的使用模型的鲁棒性世界的图像）。合并的数据集的三个不同的分区被定义和用于训练，验证和测试目的。作为这项工作的结果，我们得到了一个训练有素的ConvNet模式，创造RealHePoNet，给定一个低分辨率灰度输入图像，而无需使用面部地标，是能够以低误差估计倾斜和平移角度（〜4.4°在试验分区平均误差）。此外，由于其低推理时间（〜每头6毫秒），我们认为我们的模型可用，即使中等规格的硬件（即1060 GTX GPU）配对。 *代码，请访问：此HTTPS URL *演示视频：此HTTPS URL

3. Unsupervised Attention Based Instance Discriminative Learning for Person Re-Identification [PDF] 返回目录
Kshitij Nikhal, Benjamin S. Riggan
Abstract: Recent advances in person re-identification have demonstrated enhanced discriminability, especially with supervised learning or transfer learning. However, since the data requirements---including the degree of data curations---are becoming increasingly complex and laborious, there is a critical need for unsupervised methods that are robust to large intra-class variations, such as changes in perspective, illumination, articulated motion, resolution, etc. Therefore, we propose an unsupervised framework for person re-identification which is trained in an end-to-end manner without any pre-training. Our proposed framework leverages a new attention mechanism that combines group convolutions to (1) enhance spatial attention at multiple scales and (2) reduce the number of trainable parameters by 59.6%. Additionally, our framework jointly optimizes the network with agglomerative clustering and instance learning to tackle hard samples. We perform extensive analysis using the Market1501 and DukeMTMC-reID datasets to demonstrate that our method consistently outperforms the state-of-the-art methods (with and without pre-trained weights).
摘要：人重新鉴定的最新进展表明增强鉴别力，特别是监督学习或迁移学习。然而，由于数据要求---包括数据curations度---正变得越来越复杂和费力的，对于无监督方法，这些方法坚固以大类内变化，例如在角度来看，照明的变化的关键需求，关节运动，分辨率等。因此，我们提出这是在一个终端到终端的方式，没有任何训练前训练的人重新鉴定无人监督的框架。我们提出的框架利用一个新的注意机制，结合组卷积，以（1）提高空间注意在多尺度和（2）由59.6％降低可训练参数的数量。此外，我们的框架联合优化与凝聚的聚集和实例学习来解决硬样品网络。我们执行使用Market1501和DukeMTMC - 里德的数据集，以证明我们的方法始终优于国家的最先进的方法（有和没有预先训练的权重），广泛的分析。

4. Learning unbiased registration and joint segmentation: evaluation on longitudinal diffusion MRI [PDF] 返回目录
Bo Li, Wiro J. Niessen, Stefan Klein, M. Arfan Ikram, Meike W. Vernooij, Esther E. Bron
Abstract: Analysis of longitudinal changes in imaging studies often involves both segmentation of structures of interest and registration of multiple timeframes. The accuracy of such analysis could benefit from a tailored framework that jointly optimizes both tasks to fully exploit the information available in the longitudinal data. Most learning-based registration algorithms, including joint optimization approaches, currently suffer from bias due to selection of a fixed reference frame and only support pairwise transformations. We here propose an analytical framework based on an unbiased learning strategy for group-wise registration that simultaneously registers images to the mean space of a group to obtain consistent segmentations. We evaluate the proposed method on longitudinal analysis of a white matter tract in a brain MRI dataset with 2-3 time-points for 3249 individuals, i.e., 8045 images in total. The reproducibility of the method is evaluated on test-retest data from 97 individuals. The results confirm that the implicit reference image is an average of the input image. In addition, the proposed framework leads to consistent segmentations and significantly lower processing bias than that of a pair-wise fixed-reference approach. This processing bias is even smaller than those obtained when translating segmentations by only one voxel, which can be attributed to subtle numerical instabilities and interpolation. Therefore, we postulate that the proposed mean-space learning strategy could be widely applied to learning-based registration tasks. In addition, this group-wise framework introduces a novel way for learning-based longitudinal studies by direct construction of an unbiased within-subject template and allowing reliable and efficient analysis of spatio-temporal imaging biomarkers.
摘要：在成像研究纵向变化分析通常涉及的不同的时间周期的兴趣和注册结构的两个分割。这种分析的精确度可以受益于一个定制的框架，共同优化了任务，充分利用纵向数据的可用信息。大多数基于学习的配准算法，包括联合优化方法，目前患有偏压由于固定的参考帧的选择和仅支持成对变换。在这里，我们提出了一种基于对分组方式注册一个公正的学习策略，同时登记图像组的平均空间获得一致的分割的分析框架。我们评估在脑部核磁共振成像数据集2-3时间点白质道的纵向分析所提出的方法对3249个人，即8045个图像的总额。该方法的再现性从97个人上重测数据进行评价。结果证实，隐含参考图像是一个平均输入图像的。此外，所提出的框架引线，以一致的分割，比成对的固定参考进场的显著较低的加工偏差。该处理偏置比那些仅由一个体素平移分割时获得的，这可以归因于微妙数值不稳定性和内插甚至更小。因此，我们推测，所提出的平均空间学习策略可以广泛适用于基于学习登记任务。此外，该组相关的框架引入了学习基于纵向研究由一个无偏受试者内模板的直接构造和时空成像生物标志物的允许可靠和有效的分析的新方法。

5. Semi-supervised AU Intensity Estimation with Contrastive Learning [PDF] 返回目录
Enrique Sanchez, Adrian Bulat, Anestis Zaganidis, Georgios Tzimiropoulos
Abstract: This paper tackles the challenging problem of estimating the intensity of Facial Action Units with few labeled images. Contrary to previous works, our method does not require to manually select key frames, and produces state-of-the-art results with as little as $2\%$ of annotated frames, which are \textit{randomly chosen}. To this end, we propose a semi-supervised learning approach where a spatio-temporal model combining a feature extractor and a temporal module are learned in two stages. The first stage uses datasets of unlabeled videos to learn a strong spatio-temporal representation of facial behavior dynamics based on contrastive learning. To our knowledge we are the first to build upon this framework for modeling facial behavior in an unsupervised manner. The second stage uses another dataset of randomly chosen labeled frames to train a regressor on top of our spatio-temporal model for estimating the AU intensity. We show that although backpropagation through time is applied only with respect to the output of the network for extremely sparse and randomly chosen labeled frames, our model can be effectively trained to estimate AU intensity accurately, thanks to the unsupervised pre-training of the first stage. We experimentally validate that our method outperforms existing methods when working with as little as $2\%$ of randomly chosen data for both DISFA and BP4D datasets, without a careful choice of labeled frames, a time-consuming task still required in previous approaches.
摘要：本文铲球几乎没有标记的图像估计面部动作单元的强度的具有挑战性的问题。相反，以前的工作，我们的方法不需要手动选择的关键帧，并产生状态的最先进的结果用尽可能少，用$ 2 \％注释帧，其是\ textit的$ {随机选择}。为此，我们提出了一种半监督学习方法，即时空模型相结合的特征提取和时间模块中的两个阶段的经验教训。第一级使用的未标记的视频数据集，以便了解基于对比学习脸部行为动态的强大的时空表示。据我们所知，我们是第一个建立在这个框架在无人监督的方式建模脸部行为。第二阶段使用随机选择的标记的帧的另一数据集来训练对我们的时空模型的顶部上的回归用于估计AU强度。我们表明，虽然相对于仅适用于网络的输出极其稀疏和随机选择的标记的帧反向传播通过时间，我们的模型能够有效地训练以准确地估计AU强度，由于第一阶段的无监督训练前。我们通过实验验证了我们的方法优于现有的方法尽可能少的，用$ 2 \％两个DISFA和BP4D数据集随机选择的数据$工作时，无标记帧的谨慎选择，一个耗时的任务，仍然需要在以前的方法。

6. Learning Representations from Audio-Visual Spatial Alignment [PDF] 返回目录
Pedro Morgado, Yi Li, Nuno Vasconcelos
Abstract: We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, their training objectives disregard spatial cues naturally occurring in audio and visual signals. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360° video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360° video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition, and video semantic segmentation.
摘要：从视听内容的学习表示引入新的自我监督的借口任务。在视听表现学习工作之前在视频电平利用对应。基于视听对应（AVC）接近预测音频和视频剪辑是否从相同或不同的视频实例发起。视听时间同步（AVTS）进一步判别负对源自同一视频实例，但在不同的时间瞬间。虽然这些方法学高品质的表示，下游的任务，例如动作识别，其培养目标无视空间感自然的音频和视频信号出现。从这些空间线索学习，我们负责的网络来执行的360对比视听空间对准°视频和空间音频。执行空间对准的能力由推理在使用变压器架构来从多个视点的表示结合起来，360°视频的全部空间内容增强。所提出的任务由头的优点上演示多种音频和视觉下游任务，包括视听对应，空间对准，动作识别，和视频语义分割。

7. Attention Beam: An Image Captioning Approach [PDF] 返回目录
Anubhav Shrimal, Tanmoy Chakraborty
Abstract: The aim of image captioning is to generate textual description of a given image. Though seemingly an easy task for humans, it is challenging for machines as it requires the ability to comprehend the image (computer vision) and consequently generate a human-like description for the image (natural language understanding). In recent times, encoder-decoder based architectures have achieved state-of-the-art results for image captioning. Here, we present a heuristic of beam search on top of the encoder-decoder based architecture that gives better quality captions on three benchmark datasets: Flickr8k, Flickr30k and MS COCO.
摘要：图像字幕的目的是产生一个给定图像的文本描述。虽然看似一个简单的任务对于人类来说，它是为机器挑战，因为它需要理解的图像（计算机视觉），因此产生类似人类的描述为图像（自然语言理解）的能力。近来，基于编码器的解码器的体系结构已经实现状态的最先进的结果图像的字幕。 Flickr8k，Flickr30k和MS COCO：在这里，我们的编码器，解码器基础的架构，让对三个标准数据集更优质字幕的顶部呈现束搜索的启发。

8. Learning a Generative Motion Model from Image Sequences based on a Latent Motion Matrix [PDF] 返回目录
Julian Krebs, Hervé Delingette, Nicholas Ayache, Tommaso Mansi
Abstract: We propose to learn a probabilistic motion model from a sequence of images for spatio-temporal registration. Our model encodes motion in a low-dimensional probabilistic space - the motion matrix - which enables various motion analysis tasks such as simulation and interpolation of realistic motion patterns allowing for faster data acquisition and data augmentation. More precisely, the motion matrix allows to transport the recovered motion from one subject to another simulating for example a pathological motion in a healthy subject without the need for inter-subject registration. The method is based on a conditional latent variable model that is trained using amortized variational inference. This unsupervised generative model follows a novel multivariate Gaussian process prior and is applied within a temporal convolutional network which leads to a diffeomorphic motion model. Temporal consistency and generalizability is further improved by applying a temporal dropout training scheme. Applied to cardiac cine-MRI sequences, we show improved registration accuracy and spatio-temporally smoother deformations compared to three state-of-the-art registration algorithms. Besides, we demonstrate the model's applicability for motion analysis, simulation and super-resolution by an improved motion reconstruction from sequences with missing frames compared to linear and cubic interpolation.
摘要：我们建议从图像的时空注册序列学习概率运动模型。运动矩阵 - - 其允许在不同的运动分析任务，如逼真的运动模式允许更快的数据采集和数据扩张模拟和内插在低维概率空间我们的模型编码运动。更精确地，运动矩阵允许从一个受试者输送回收运动到另一个模拟例如在健康受试者的病理学运动而不需要-受试者间配准。该方法是基于使用摊销变推理训练条件潜变量模型。此无监督生成模型如下的新颖多变量高斯过程之前和被时间卷积网络这导致微分同胚运动模型内应用。时间一致性和普遍性通过施加时间差训练方案进一步提高。施加到心脏电影-MRI序列，我们表明相比于国家的最先进的三个注册的算法改进的配准精度和空间 - 时间平滑变形。此外，我们证明模型的用于运动分析，仿真和超分辨率适用性通过改进的运动重建从序列与丢失的帧相比线性和立方内插。

9. Exploring DeshuffleGANs in Self-Supervised Generative Adversarial Networks [PDF] 返回目录
Gulcin Baykal, Furkan Ozcelik, Gozde Unal
Abstract: Generative Adversarial Networks (GANs) have become the most used network models towards solving the problem of image generation. In recent years, self-supervised GANs are proposed to aid stabilized GAN training without the catastrophic forgetting problem and to improve the image generation quality without the need for the class labels of the data. However, the generalizability of the self-supervision tasks on different GAN architectures is not studied before. To that end, we extensively analyze the contribution of the deshuffling task of DeshuffleGANs in the generalizability context. We assign the deshuffling task to two different GAN discriminators and study the effects of the deshuffling on both architectures. We also evaluate the performance of DeshuffleGANs on various datasets that are mostly used in GAN benchmarks: LSUN-Bedroom, LSUN-Church, and CelebA-HQ. We show that the DeshuffleGAN obtains the best FID results for LSUN datasets compared to the other self-supervised GANs. Furthermore, we compare the deshuffling with the rotation prediction that is firstly deployed to the GAN training and demonstrate that its contribution exceeds the rotation prediction. Lastly, we show the contribution of the self-supervision tasks to the GAN training on loss landscape and present that the effects of the self-supervision tasks may not be cooperative to the adversarial training in some settings. Our code can be found at this https URL.
摘要：创成对抗性网络（甘斯）已经成为最常用的网络模型对解决图像生成的问题。近年来，自我监督甘斯被提出，以帮助稳定GAN训练没有灾难性遗忘的问题，并提高而不需要数据的类标签图像生成质量。然而，在不同的GAN架构自检任务概不前研究。为此，我们深入地分析DeshuffleGANs在普遍性方面的去重排任务的贡献。我们分配混洗的任务，以两种不同的GAN鉴别和研究了两种架构去混洗的效果。我们也评估DeshuffleGANs的对大多在甘基准测试中使用的各种数据集的性能：LSUN卧室，LSUN，教堂和CelebA-HQ。我们表明，DeshuffleGAN获得最佳FID结果LSUN数据集相对于其他自主监督甘斯。此外，我们比较的是首先部署到GAN培训和证明其贡献超过了预测旋转的旋转预测混洗。最后，我们表现出的自我监管任务的损失景观和目前认为的自我监督任务的影响可能不会合作，以在某些环境对抗训练GAN培训的贡献。我们的代码可以在此HTTPS URL中找到。

10. A spatial hue similarity measure for assessment of colourisation [PDF] 返回目录
Seán Mullery, Paul F. Whelan
Abstract: Automatic colourisation of grey-scale images is an ill-posed multi-modal problem. Where full-reference images exist, objective performance measures rely on pixel-difference techniques such as MSE and PSNR. These measures penalise any plausible modes other than the reference ground-truth; They often fail to adequately penalise implausible modes if they are close in pixel distance to the ground-truth; As these are pixel-difference methods they cannot assess spatial coherency. We use the polar form of the a*b* channels from the CIEL*a*b* colour space to separate the multi-modal problems, which we confine to the hue channel, and the common-mode which applies to the chroma channel. We apply SSIM to the chroma channel but reformulate SSIM for the hue channel to a measure we call the Spatial Hue Similarity Measure (SHSM). This reformulation allows spatially-coherent hue channels to achieve a high score while penalising spatially-incoherent modes. This method allows qualitative and quantitative performance comparison of SOTA colourisation methods and reduces reliance on subjective human visual inspection.
摘要：灰度图像的自动colourisation是一个病态的多模态问题。其中全参考图像存在的，客观的绩效措施依靠像素差的技术，如MSE和PSNR。这些措施惩罚比参考地面实况以外的任何合理的方式;他们往往未能充分惩罚不合理的方式，如果他们接近到地面实况像素的距离;因为这些像素差方法也不能评估空间相干性。我们使用一个的极坐标的形式* b *表从CIEL * A * b *颜色空间以分离多模态的问题，这是我们限制到色相通道，并且其适用于色度通道中的共模信道。我们采用SSIM色度通道，但重新制定SSIM的色调通道我们所说的空间色调相似性测度（SHSM）的措施。此再形成允许空间相干色调信道，以实现高的分数，而在空间上惩罚-非相干模式。此方法允许的SOTA colourisation方法定性和定量的性能比较，并降低对人的主观视觉检查的依赖。

11. The Aleatoric Uncertainty Estimation Using a Separate Formulation with Virtual Residuals [PDF] 返回目录
Takumi Kawashima, Qing Yu, Akari Asai, Daiki Ikami, Kiyoharu Aizawa
Abstract: We propose a new optimization framework for aleatoric uncertainty estimation in regression problems. Existing methods can quantify the error in the target estimation, but they tend to underestimate it. To obtain the predictive uncertainty inherent in an observation, we propose a new separable formulation for the estimation of a signal and of its uncertainty, avoiding the effect of overfitting. By decoupling target estimation and uncertainty estimation, we also control the balance between signal estimation and uncertainty estimation. We conduct three types of experiments: regression with simulation data, age estimation, and depth estimation. We demonstrate that the proposed method outperforms a state-of-the-art technique for signal and uncertainty estimation.
摘要：我们提出了在回归问题肆意的不确定性估计新的优化框架。现有的方法可以量化的预测对象的错误，但他们往往低估了它。为了获得在观察所固有的不确定性的预测，我们提出了一个新的可分离制剂的信号的估计和它的不确定性，避免过度拟合的效果。通过分离预测对象和不确定性的估计，我们还控制信号估计和不确定性估计之间的平衡。我们进行了三种类型的试验：回归与模拟数据，年龄估计，和深度估计。我们表明，该方法优于状态的最先进的技术用于信号和不确定性的估计。

12. Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery [PDF] 返回目录
Yong-Hao Long, Jie-Ying Wu, Bo Lu, Yue-Ming Jin, Mathias Unberath, Yun-Hui Liu, Pheng-Ann Heng, Qi Dou
Abstract: Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery. With recent advancement in robot-assisted minimally invasive surgery, rich information including surgical videos and robotic kinematics can be recorded, which provide complementary knowledge for understanding surgical gestures. However, existing methods either solely adopt uni-modal data or directly concatenate multi-modal representations, which can not sufficiently exploit the informative correlations inherent in visual and kinematics data to boost gesture recognition accuracies. In this regard, we propose a novel approach of multimodal relational graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information through interactive message propagation in the latent feature space. In specific, we first extract embeddings from video and kinematics sequences with temporal convolutional networks and LSTM units. Next, we identify multi-relations in these multi-modal features and model them through a hierarchical relational graph learning module. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on both suturing and knot typing tasks. Furthermore, we validated our method on in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK) platforms in two centers, with consistent promising performance achieved.
摘要：自动手术手势识别是从根本上重要的是能够在机器人手术的智能认知援助。随着最近的进步机器人辅助微创手术，丰富的信息，包括外科手术视频和机器人运动学可以记录，这对于了解手术的手势提供补充知识。但是，现有的方法仅仅或者采用单峰数据，或者直接接续模式的多模态表示，它不能充分利用在视觉和运动学数据到升压手势识别精度所固有的信息相关性。在这方面，我们提出了多模式关系图的网络（即，MRG-净）的动态通过交互式消息传播在潜特征空间整合视觉和运动学信息的新方法。在具体的，我们从与颞卷积网络和LSTM单元的视频和运动学序列第一次提取的嵌入。接下来，我们确定这些多模态功能多的关系，并通过分层关系图学习模块它们建模。我们的方法的有效性证明与公共拼图数据集的国家的最先进成果，跑赢上都缝合和打结的打字任务，当前单模式和多模式的方法。此外，我们确认在两个中心与达·芬奇研究工具包（dVRK）平台收集的内部视觉运动数据集我们的方法，用一致看好的性能来实现的。

13. A Deep Temporal Fusion Framework for Scene Flow Using a Learnable Motion Model and Occlusions [PDF] 返回目录
René Schuster, Christian Unger, Didier Stricker
Abstract: Motion estimation is one of the core challenges in computer vision. With traditional dual-frame approaches, occlusions and out-of-view motions are a limiting factor, especially in the context of environmental perception for vehicles due to the large (ego-) motion of objects. Our work proposes a novel data-driven approach for temporal fusion of scene flow estimates in a multi-frame setup to overcome the issue of occlusion. Contrary to most previous methods, we do not rely on a constant motion model, but instead learn a generic temporal relation of motion from data. In a second step, a neural network combines bi-directional scene flow estimates from a common reference frame, yielding a refined estimate and a natural byproduct of occlusion masks. This way, our approach provides a fast multi-frame extension for a variety of scene flow estimators, which outperforms the underlying dual-frame approaches.
摘要：运动估计是计算机视觉的核心挑战之一。与传统的双帧方法，闭塞和外的视图运动是一个限制因素，尤其是在环境感知的车辆的情况下，由于对象的大（ego-）运动。我们的工作提出了一种在多帧设置场景流量估计的时间融合，克服闭塞的问题一个新的数据驱动的方法。相反，大多数以前的方法，我们不依赖于一个恒定的运动模型，而是学习运动从数据的通用时间关系。在第二步骤中，一个神经网络联合双向场景流量估计从公共参考帧，产生细化估计和遮挡掩模的天然副产物。这样一来，我们的方法提供了多种场景流估计，这优于底层双帧接近快多帧扩展。

14. Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings [PDF] 返回目录
Yue Wang, Jing Li, Michael R. Lyu, Irwin King
Abstract: Social media produces large amounts of contents every day. To help users quickly capture what they need, keyphrase prediction is receiving a growing attention. Nevertheless, most prior efforts focus on text modeling, largely ignoring the rich features embedded in the matching images. In this work, we explore the joint effects of texts and images in predicting the keyphrases for a multimedia post. To better align social media style texts and images, we propose: (1) a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions; (2) image wordings, in forms of optical characters and image attributes, to bridge the two modalities. Moreover, we design a unified framework to leverage the outputs of keyphrase classification and generation and couple their advantages. Extensive experiments on a large-scale dataset newly collected from Twitter show that our model significantly outperforms the previous state of the art based on traditional attention networks. Further analyses show that our multi-head attention is able to attend information from various aspects and boost classification or generation in diverse scenarios.
摘要：社会化媒体每天都产生大量的内容。为了帮助用户快速获取他们所需要的，关键词的预测收到越来越多的关注。然而，大多数以前的努力集中于文字造型，很大程度上忽视嵌入匹配图像丰富的功能。在这项工作中，我们将探讨在预测的关键字句的多媒体后的文本和图像的共同作用。为了更好地将社会化媒体的风格文本和图像，我们建议：（1）一种新型的多模态多头注意（M3H-ATT）来捕获复杂的跨媒体互动; （2）图像字眼，在光学字符和图像的属性的表格，以桥接两个模态。此外，我们设计了一个统一的框架，以充分利用关键词的分类和代夫妇和他们的优点的输出。从Twitter的新收集的大规模数据集的大量实验表明，我们的模型显著优于基于传统网络关注艺术的以前的状态。进一步的分析表明，我们的多头关注的是能够参加各种方面和在不同场景升压分类或产生的信息。

15. CooGAN: A Memory-Efficient Framework for High-Resolution Facial Attribute Editing [PDF] 返回目录
Xuanhong Chen, Bingbing Ni, Naiyuan Liu, Ziang Liu, Yiliu Jiang, Loc Truong, Qi Tian
Abstract: In contrast to great success of memory-consuming face editing methods at a low resolution, to manipulate high-resolution (HR) facial images, i.e., typically larger than 7682 pixels, with very limited memory is still challenging. This is due to the reasons of 1) intractable huge demand of memory; 2) inefficient multi-scale features fusion. To address these issues, we propose a NOVEL pixel translation framework called Cooperative GAN(CooGAN) for HR facial image editing. This framework features a local path for fine-grained local facial patch generation (i.e., patch-level HR, LOW memory) and a global path for global lowresolution (LR) facial structure monitoring (i.e., image-level LR, LOW memory), which largely reduce memory requirements. Both paths work in a cooperative manner under a local-to-global consistency objective (i.e., for smooth stitching). In addition, we propose a lighter selective transfer unit for more efficient multi-scale features fusion, yielding higher fidelity facial attributes manipulation. Extensive experiments on CelebAHQ well demonstrate the memory efficiency as well as the high image generation quality of the proposed framework.
摘要：与存储器消耗面编辑方法以低分辨率，操纵高分辨率（HR）的面部图像，即，典型地大于7682个像素，具有非常有限的存储器的巨大成功仍然具有挑战性。这是由于：1）顽固的记忆巨大需求的原因; 2）低效的多尺度特征融合。为了解决这些问题，我们提出了所谓的合作GAN（COOGAN）人力资源面部图像编辑一个新的像素转换构架。该框架的特征在于用于细粒度本地面部色块生成（即，补丁级别HR，LOW存储器）和全局低分辨率（LR）的面部结构监视全局路径（即，图像级LR，LOW存储器）的本地路径，这在很大程度上减少内存需求。两个路径以协作的方式工作下的局部到全局一致性客观的（即，平滑缝合）。此外，我们提出了一种打火机选择性转移单元，用于更有效的多尺度特征融合，产生更高的保真度的面部属性的操作。上CelebAHQ广泛的实验以及验证了存储器的效率以及所提出的框架的高图像生成质量。

16. 3D-LaneNet+: Anchor Free Lane Detection using a Semi-Local Representation [PDF] 返回目录
Netalee Efrat Sela, Max Bluvstein, Shaul Oron, Dan Levi, Noa Garnett, Bat El Shlomo
Abstract: 3D-LaneNet+ is a camera-based DNN method for anchor free 3D lane detection which is able to detect 3d lanes of any arbitrary topology such as splits, merges, as well as short and perpendicular lanes. We follow recently proposed 3D-LaneNet, and extend it to enable the detection of these previously unsupported lane topologies. Our output representation is an anchor free, semi-local tile representation that breaks down lanes into simple lane segments whose parameters can be learnt. In addition we learn, per lane instance, feature embedding that reasons for the global connectivity of locally detected segments to form full 3d lanes. This combination allows 3D-LaneNet+ to avoid using lane anchors, non-maximum suppression, and lane model fitting as in the original 3D-LaneNet. We demonstrate the efficacy of 3D-LaneNet+ using both synthetic and real world data. Results show significant improvement relative to the original 3D-LaneNet that can be attributed to better generalization to complex lane topologies, curvatures and surface geometries.
摘要：3D-LaneNet +是基于相机的方法DNN锚自由3D车道检测，它能够检测到任何任意拓扑的3D车道诸如分割，合并，以及短和垂直通道。我们按照最近提出的3D-LaneNet，并且将其扩展，使这些以前不支持的车道拓扑结构的检测。我们输出表示是无锚定，半局部区块的代表，打破了小巷成简单的车道部分，它的参数是可以学习的。此外，我们学习，每通道例如，配备了嵌入该原因本地检测段的全球连接，形成完整的3D车道。这种组合允许3D-LaneNet +避免使用车道锚，非最大抑制，和车道模型拟合如原始3D-LaneNet。我们展示的3D-LaneNet +的同时使用合成和真实世界的数据的有效性。结果表明：相对于原来的3D LaneNet可以归结为更好地推广到复杂的车道拓扑结构，曲率和表面几何形状显著改善。

17. SelfPose: 3D Egocentric Pose Estimation from a Headset Mounted Camera [PDF] 返回目录
Denis Tome, Thiemo Alldieck, Patrick Peluse, Gerard Pons-Moll, Lourdes Agapito, Hernan Badino, Fernando De la Torre
Abstract: We present a solution to egocentric 3D body pose estimation from monocular images captured from downward looking fish-eye cameras installed on the rim of a head mounted VR device. This unusual viewpoint leads to images with unique visual appearance, with severe self-occlusions and perspective distortions that result in drastic differences in resolution between lower and upper body. We propose an encoder-decoder architecture with a novel multi-branch decoder designed to account for the varying uncertainty in 2D predictions. The quantitative evaluation, on synthetic and real-world datasets, shows that our strategy leads to substantial improvements in accuracy over state of the art egocentric approaches. To tackle the lack of labelled data we also introduced a large photo-realistic synthetic dataset. xR-EgoPose offers high quality renderings of people with diverse skintones, body shapes and clothing, performing a range of actions. Our experiments show that the high variability in our new synthetic training corpus leads to good generalization to real world footage and to state of theart results on real world datasets with ground truth. Moreover, an evaluation on the Human3.6M benchmark shows that the performance of our method is on par with top performing approaches on the more classic problem of 3D human pose from a third person viewpoint.
摘要：我们提出一个解决方案，从向下看安装在头部的边缘鱼眼照相机拍摄的单目图像自我中心三维人体姿势估计安装VR设备。这种不寻常的观点导致具有独特的视觉外观，有严重的自闭塞和透视失真导致下部和上部主体之间在分辨率急剧差异的图像。我们提出了一个编码器，解码器架构，全新的多分支解码器设计成考虑在2D预测的变不确定。定量评价，对合成和真实世界的数据集，说明我们的策略，因而在精度优于现有技术自我中心方法的状态显着改善。为了解决缺乏的标签数据，我们还推出了大量照片般逼真的合成数据集。 XR-EgoPose报价人具有不同skintones，身体的形状和服装高品质的效果，进行了一系列的行动。我们的实验表明，在我们的新的合成训练语料导致良好的推广到真实世界的画面和对现实世界的数据集，与地面实情theart结果的状态，高可变性。此外，在Human3.6M基准测试显示的评估，我们的方法的性能看齐与第三人称视点的三维人体姿势的比较经典的问题，表现最好的方法。

18. VEGA: Towards an End-to-End Configurable AutoML Pipeline [PDF] 返回目录
Bochao Wang, Hang Xu, Jiajin Zhang, Chen Chen, Xiaozhi Fang, Ning Kang, Lanqing Hong, Wei Zhang, Yong Li, Zhicheng Liu, Zhenguo Li, Weizhi Liu, Tong Zhang
Abstract: Automated Machine Learning (AutoML) is an important industrial solution for automatic discovery and deployment of the machine learning models. However, designing an integrated AutoML system faces four great challenges of configurability, scalability, integrability, and platform diversity. In this work, we present VEGA, an efficient and comprehensive AutoML framework that is compatible and optimized for multiple hardware platforms. a) The VEGA pipeline integrates various modules of AutoML, including Neural Architecture Search (NAS), Hyperparameter Optimization (HPO), Auto Data Augmentation, Model Compression, and Fully Train. b) To support a variety of search algorithms and tasks, we design a novel fine-grained search space and its description language to enable easy adaptation to different search algorithms and tasks. c) We abstract the common components of deep learning frameworks into a unified interface. VEGA can be executed with multiple back-ends and hardwares. Extensive benchmark experiments on multiple tasks demonstrate that VEGA can improve the existing AutoML algorithms and discover new high-performance models against SOTA methods, e.g. the searched DNet model zoo for Ascend 10x faster than EfficientNet-B5 and 9.2x faster than RegNetX-32GF on ImageNet. VEGA is open-sourced at this https URL.
摘要：自动化机器学习（AutoML）是机器学习模型的自动发现和部署重要的工业解决方案。然而，设计于一体的综合AutoML系统面临的可配置性，可扩展性，可积，和平台的多样性的四大挑战。在这项工作中，我们目前VEGA，高效，全面的AutoML框架，是兼容和多种硬件平台进行了优化。一）VEGA管道集成AutoML的各种模块，包括神经结构搜索（NAS），超参数优化（HPO），自动数据增强，压缩型和全列车。 b）支持多种搜索算法和任务，我们设计了一个新的细粒度搜索空间，并且其描述语言，以使容易适应不同的搜索算法和任务。 c）我们抽象的深度学习框架的通用组件成一个统一的接口。 VEGA可以与多个后端和硬件来执行。在多任务的广泛基准实验表明，VEGA可以提高现有AutoML算法和发现新的高性能车型对SOTA方法，例如搜索到的10倍登高DNET模型动物园比EfficientNet-B5和9.2倍比RegNetX-32GF上ImageNet快了快了。 VEGA是开源在这个HTTPS URL。

19. Wheat Crop Yield Prediction Using Deep LSTM Model [PDF] 返回目录
Sagarika Sharma, Sujit Rai, Narayanan C. Krishnan
Abstract: An in-season early crop yield forecast before harvest can benefit the farmers to improve the production and enable various agencies to devise plans accordingly. We introduce a reliable and inexpensive method to predict crop yields from publicly available satellite imagery. The proposed method works directly on raw satellite imagery without the need to extract any hand-crafted features or perform dimensionality reduction on the images. The approach implicitly models the relevance of the different steps in the growing season and the various bands in the satellite imagery. We evaluate the proposed approach on tehsil (block) level wheat predictions across several states in India and demonstrate that it outperforms existing methods by over 50\%. We also show that incorporating additional contextual information such as the location of farmlands, water bodies, and urban areas helps in improving the yield estimates.
摘要：一本赛季初的作物产量预估收获之前可以受益农户提高生产，使各机构相应地制定计划。我们引进一个可靠和廉价的方法来预测作物产量从公开的卫星图像。所提出的方法直接作用于原料的卫星图像，而不需要提取任何的手工制作的特征或执行上的图像维数降低。这种方法隐含车型在生长季节的不同步骤，并在卫星图像中不同波段的相关性。我们评估对乡级（块）级小麦的预测在几个州在印度所提出的方法，并证明其优于超过50 \％，现有的方法。我们还表明，引入额外的上下文信息，如农田，水体的位置，和市区有助于提高单产预估。

20. Kernel Two-Dimensional Ridge Regression for Subspace Clustering [PDF] 返回目录
Chong Peng, Qian Zhang, Zhao Kang, Chenglizhao Chen, Qiang Cheng
Abstract: Subspace clustering methods have been widely studied recently. When the inputs are 2-dimensional (2D) data, existing subspace clustering methods usually convert them into vectors, which severely damages inherent structures and relationships from original data. In this paper, we propose a novel subspace clustering method for 2D data. It directly uses 2D data as inputs such that the learning of representations benefits from inherent structures and relationships of the data. It simultaneously seeks image projection and representation coefficients such that they mutually enhance each other and lead to powerful data representations. An efficient algorithm is developed to solve the proposed objective function with provable decreasing and convergence property. Extensive experimental results verify the effectiveness of the new method.
摘要：子空间聚类方法已被广泛最近研究。当输入是2维（2D）数据，现有子空间聚类方法通常将它们转换成矢量，这严重损害固有的结构和从原始数据的关系。在本文中，我们提出了二维数据的新的子空间聚类方法。它直接使用二维数据作为输入，使得从内在结构和数据的关系表示好处学习。它同时寻求图像投影并使得它们相互增进彼此并导致强大的数据表示表示系数。一个高效的算法开发与证明的减少和收敛性来解决所提出的目标函数。大量的实验结果验证了新方法的有效性。

21. Distribution-aware Margin Calibration for Medical Image Segmentation [PDF] 返回目录
Zhibin Li, Litao Yu, Jian Zhang
Abstract: The Jaccard index, also known as Intersection-over-Union (IoU score), is one of the most critical evaluation metrics in medical image segmentation. However, directly optimizing the mean IoU (mIoU) score over multiple objective classes is an open problem. Although some algorithms have been proposed to optimize its surrogates, there is no guarantee provided for their generalization ability. In this paper, we present a novel data-distribution-aware margin calibration method for a better generalization of the mIoU over the whole data-distribution, underpinned by a rigid lower bound. This scheme ensures a better segmentation performance in terms of IoU scores in practice. We evaluate the effectiveness of the proposed margin calibration method on two medical image segmentation datasets, showing substantial improvements of IoU scores over other learning schemes using deep segmentation models.
摘要：的Jaccard指数，也被称为交叉点过联盟（IOU得分），是在医学图像分割的最关键的评价指标之一。然而，直接优化平均IOU（米欧）得分超过多目标类是一个开放的问题。虽然有些算法已经被提出来优化其代理人，也没有提供其泛化能力的保证。在本文中，我们提出了一种新颖的数据分布感知余量校准方法用于米欧的整个数据分布更好的概括，通过刚性下界支撑。此方案可确保在实践欠条得分方面有更好的分割性能。我们评估所提出的保证金校准方法的两个医学图像分割数据集的成效，呈现出借条得分超过使用深分割模型等学习方案的实质性改善。

22. Learning Effective Representations from Global and Local Features for Cross-View Gait Recognition [PDF] 返回目录
Beibei Lin, Shunli Zhang, Xin Yu, Zedong Chu, Haikun Zhang
Abstract: Gait recognition is one of the most important biometric technologies and has been applied in many fields. Recent gait recognition frameworks represent each human gait frame by descriptors extracted from either global appearances or local regions of humans. However, the representations based on global information often neglect the details of the gait frame, while local region based descriptors cannot capture the relations among neighboring regions, thus reducing their discriminativeness. In this paper, we propose a novel feature extraction and fusion framework to achieve discriminative feature representations for gait recognition. Towards this goal, we take advantage of both global visual information and local region details and develop a Global and Local Feature Extractor (GLFE). Specifically, our GLFE module is composed of our newly designed multiple global and local convolutional layers (GLConv) to ensemble global and local features in a principle manner. Furthermore, we present a novel operation, namely Local Temporal Aggregation (LTA), to further preserve the spatial information by reducing the temporal resolution to obtain higher spatial resolution. With the help of our GLFE and LTA, our method significantly improves the discriminativeness of our visual features, thus improving the gait recognition performance. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art gait recognition methods on popular widely-used CASIA-B and OUMVLP datasets.
摘要：步态识别是最重要的生物识别技术之一，在诸多领域得到了应用。最近的步态识别框架，代表了无论从全球出场或人类的局部区域中提取描述每个人的步态帧。然而，基于全球信息的陈述往往忽视步态框架的细节，而局部地区根据描述符永远不能捕捉到邻近地区之间的关系，从而降低他们的discriminativeness。在本文中，我们提出了一种新的特征提取和融合的框架，以实现步态识别判别特征表示。为了实现这一目标，我们利用这两个全球视觉信息和局部区域的细节和发展全局和局部特征提取（GLFE）。具体地，我们的GLFE模块以原理的方式构成我们的新设计的多个全局和局部卷积层（GLConv）至合奏全局和局部特征的。此外，我们提出一个新的操作，即局部时间聚合（LTA），通过降低时间分辨率，以获得较高的空间分辨率，以进一步保护该空间信息。随着我们GLFE和LTA的帮助下，我们的方法显著提高了我们的视觉特征discriminativeness，从而提高步态识别性能。大量的实验证明，流行广泛使用CASIA-B和OUMVLP数据集我们提出的方法优于国家的最先进的步态识别方法。

23. Learning Deformable Tetrahedral Meshes for 3D Reconstruction [PDF] 返回目录
Jun Gao, Wenzheng Chen, Tommy Xiang, Alec Jacobson, Morgan McGuire, Sanja Fidler
Abstract: 3D shape representations that accommodate learning-based 3D reconstruction are an open problem in machine learning and computer graphics. Previous work on neural 3D reconstruction demonstrated benefits, but also limitations, of point cloud, voxel, surface mesh, and implicit function representations. We introduce Deformable Tetrahedral Meshes (DefTet) as a particular parameterization that utilizes volumetric tetrahedral meshes for the reconstruction problem. Unlike existing volumetric approaches, DefTet optimizes for both vertex placement and occupancy, and is differentiable with respect to standard 3D reconstruction loss functions. It is thus simultaneously high-precision, volumetric, and amenable to learning-based neural architectures. We show that it can represent arbitrary, complex topology, is both memory and computationally efficient, and can produce high-fidelity reconstructions with a significantly smaller grid size than alternative volumetric approaches. The predicted surfaces are also inherently defined as tetrahedral meshes, thus do not require post-processing. We demonstrate that DefTet matches or exceeds both the quality of the previous best approaches and the performance of the fastest ones. Our approach obtains high-quality tetrahedral meshes computed directly from noisy point clouds, and is the first to showcase high-quality 3D tet-mesh results using only a single image as input.
摘要：可以容纳学习基于三维重建3D形状表示是在机器学习和计算机图形的一个公开问题。神经三维重建证实的好处以前的工作，同时也局限，点云数据，体素，表面网格和隐函数表示。我们引入变形四面体网格（DefTet）作为是使用体积四面体网格对于所述重建问题的特定参数。不像现有的体积的方法，DefTet优化两个顶点放置和占用，并且是可微分的相对于标准的3D重建损失功能。由此同时高精度，体积，以及适合学习基础的神经结构。我们表明，它可以表示任意的，复杂的拓扑结构，既是存储器和计算上有效的，并能产生高保真的重建与显著较小栅格尺寸比替代方案的体积接近。预测的表面也固有地定义为四面体网状物，因此不需要后处理。我们证明DefTet匹配或超过先前的最佳途径的质量和最快的人的表现。我们的方法获得直接从嘈杂点云计算高质量四面体网格，并且是第一个展示仅使用单个图像作为输入的高品质的3D TET-啮合的结果。

24. Developing High Quality Training Samples for Deep Learning Based Local Climate Classification in Korea [PDF] 返回目录
Minho Kim, Doyoung Jeong, Hyoungwoo Choi, Yongil Kim
Abstract: Two out of three people will be living in urban areas by 2050, as projected by the United Nations, emphasizing the need for sustainable urban development and monitoring. Common urban footprint data provide high-resolution city extents but lack essential information on the distribution, pattern, and characteristics. The Local Climate Zone (LCZ) offers an efficient and standardized framework that can delineate the internal structure and characteristics of urban areas. Global-scale LCZ mapping has been explored, but are limited by low accuracy, variable labeling quality, or domain adaptation challenges. Instead, this study developed a custom LCZ data to map key Korean cities using a multi-scale convolutional neural network. Results demonstrated that using a novel, custom LCZ data with deep learning can generate more accurate LCZ map results compared to conventional community-based LCZ mapping with machine learning as well as transfer learning of the global So2Sat dataset.
摘要：三分之二的人将生活在城市地区，到2050年，联合国以此推算，强调城市可持续发展和监督的需要。常见的城市足迹数据提供高分辨率的城市范围，但缺乏上的分布，图案和特性的重要信息。当地的气候区（LCZ）提供了一个高效，规范的框架，可以描绘的内部结构和城市地区的特点。全球规模LCZ映射已经被探索，但精度低，可变标记的质量，或域适应挑战是有限的。相反，本研究开发的定制LCZ数据使用多尺度卷积神经网络来映射键韩城市。结果表明，使用一种新的，具有深度学习定制LCZ数据可与用机器学习以及转印学习全球So2Sat数据集的常规基于社区的LCZ映射生成更准确的地图LCZ结果。

25. "You eat with your eyes first": Optimizing Yelp Image Advertising [PDF] 返回目录
Gaurab Banerjee, Samuel Spinner, Yasmine Mitchell
Abstract: A business's online, photographic representation can play a crucial role in its success or failure. We use Yelp's image dataset and star-based review system as a measurement of an image's effectiveness in promoting a business. After preprocessing the Yelp dataset, we use transfer learning to train a classifier which accepts Yelp images and predicts star-ratings. Additionally, we then train a GAN to qualitatively investigate the common properties of highly effective images. We achieve 90-98% accuracy in classifying simplified star ratings for various image categories and observe that images containing blue skies, open surroundings, and many windows are correlated with higher Yelp reviews.
摘要：一个企业的网上，代表性照片，能起到其成败至关重要的作用。我们使用Yelp的图像数据集和基于星级审查制度作为促进商业图像的有效性的测量。预处理Yelp的数据集后，我们使用迁移学习训练它接受Yelp的图像和预测星评级分类。此外，我们再训练GAN定性研究高效图像的公共属性。我们在对各种图像类别简化星级评定分级达到90-98％的准确率，并观察含蓝色的天空，开放的环境，很多窗口的图像具有更高的Yelp的评论相关。

26. In Defense of Feature Mimicking for Knowledge Distillation [PDF] 返回目录
Guo-Hua Wang, Yifan Ge, Jianxin Wu
Abstract: Knowledge distillation (KD) is a popular method to train efficient networks ("student") with the help of high-capacity networks ("teacher"). Traditional methods use the teacher's soft logit as extra supervision to train the student network. In this paper, we argue that it is more advantageous to make the student mimic the teacher's features in the penultimate layer. Not only the student can directly learn more effective information from the teacher feature, feature mimicking can also be applied for teachers trained without a softmax layer. Experiments show that it can achieve higher accuracy than traditional KD. To further facilitate feature mimicking, we decompose a feature vector into the magnitude and the direction. We argue that the teacher should give more freedom to the student feature's magnitude, and let the student pay more attention on mimicking the feature direction. To meet this requirement, we propose a loss term based on locality-sensitive hashing (LSH). With the help of this new loss, our method indeed mimics feature directions more accurately, relaxes constraints on feature magnitudes, and achieves state-of-the-art distillation accuracy.
摘要：知识蒸馏（KD）是一种具有高容量网络的帮助下（“老师”）的流行方法，培养有效的网络（“学生”）。传统的方法使用老师的软Logit模型作为额外的监管，培养学生的网络。在本文中，我们认为它是让学生模仿在倒数第二层老师的功能更有利。不仅学生可以直接学到老师的功能更有效的信息，功能模仿也可以应用于没有SOFTMAX层培训的教师。实验证明，它可以实现比传统KD更高的精度。为了进一步促成特征模仿，我们分解特征向量到大小和方向。我们认为，老师应该给予更多的自由，学生特点的幅度，并让学生更注重于模仿功能的方向。为了满足这一要求，提出了一种基于局部性敏感散列（LSH）损耗项。有了这个新的损失的帮助下，我们的方法确实模仿功能的方向更准确，放松了对功能大小的限制，实现国家的最先进的蒸馏准确性。

27. Content-based Analysis of the Cultural Differences between TikTok and Douyin [PDF] 返回目录
Li Sun, Haoqi Zhang, Songyang Zhang, Jiebo Luo
Abstract: Short-form video social media shifts away from the traditional media paradigm by telling the audience a dynamic story to attract their attention. In particular, different combinations of everyday objects can be employed to represent a unique scene that is both interesting and understandable. Offered by the same company, TikTok and Douyin are popular examples of such new media that has become popular in recent years, while being tailored for different markets (e.g. the United States and China). The hypothesis that they express cultural differences together with media fashion and social idiosyncrasy is the primary target of our research. To that end, we first employ the Faster Regional Convolutional Neural Network (Faster R-CNN) pre-trained with the Microsoft Common Objects in COntext (MS-COCO) dataset to perform object detection. Based on a suite of objects detected from videos, we perform statistical analysis including label statistics, label similarity, and label-person distribution. We further use the Two-Stream Inflated 3D ConvNet (I3D) pre-trained with the Kinetics dataset to categorize and analyze human actions. By comparing the distributional results of TikTok and Douyin, we uncover a wealth of similarity and contrast between the two closely related video social media platforms along the content dimensions of object quantity, object categories, and human action categories.
摘要：短格式的视频从传统媒体模式社会化媒体转移走，告诉观众一个动态的故事，吸引他们的注意力。具体而言，可以采用日常物品的不同组合来表示一道独特的风景，既有趣又可以理解的。由同一家公司提供的，和的TikTok逗引是这样的新媒体，近年来已成为流行，同时专为不同的市场（如美国和中国）的最典型的例子。他们表达的文化差异与时尚媒体和社会的特质在一起的假设是我们研究的主要目标。为此，我们首先使用了更快的区域卷积神经网络（更快的R-CNN）预先训练与Microsoft常见于上下文对象（MS-COCO）数据集来执行对象检测。基于一套从视频中检测到的对象，我们进行统计分析，包括标签的统计数据，标签相似，和标签的人分发。我们进一步用二流充气3D ConvNet（I3D）预先训练与动力学数据集进行分类和分析人类行为。通过比较的TikTok和逗引，我们的分配结果揪出了丰富的相似性和密切相关的两个视频社交媒体一起对象数量，对象类别，以及人类活动分类的内容维度平台之间的对比。

28. Out-of-Distribution Detection for Automotive Perception [PDF] 返回目录
Julia Nitsch, Masha Itkina, Ransalu Senanayake, Juan Nieto, Max Schmidt, Roland Siegwart, Mykel J. Kochenderfer, Cesar Cadena
Abstract: Neural networks (NNs) are widely used for object recognition tasks in autonomous driving. However, NNs can fail on input data not well represented by the training dataset, known as out-of-distribution (OOD) data. A mechanism to detect OOD samples is important in safety-critical applications, such as automotive perception, in order to trigger a safe fallback mode. NNs often rely on softmax normalization for confidence estimation, which can lead to high confidences being assigned to OOD samples, thus hindering the detection of failures. This paper presents a simple but effective method for determining whether inputs are OOD. We propose an OOD detection approach that combines auxiliary training techniques with post hoc statistics. Unlike other approaches, our proposed method does not require OOD data during training, and it does not increase the computational cost during inference. The latter property is especially important in automotive applications with limited computational resources and real-time constraints. Our proposed method outperforms state-of-the-art methods on real world automotive datasets.
摘要：神经网络（神经网络）被广泛用于自主驾驶物体识别任务。然而，神经网络可以失败上不能很好地由训练数据集，被称为外的分布（OOD）数据所表示的输入数据。检测OOD样的机制是在安全关键应用，如汽车感知重要的是，为了触发安全后备模式。神经网络通常依赖于SOFTMAX正常化为置信估计，这可能导致高的置信度被分配给OOD样品，从而阻碍故障的检测。本文介绍用于确定输入是否是OOD一个简单而有效的方法。我们建议，结合了事后统计辅助训练技术的OOD检测方法。不像其他的方法，我们提出的方法不训练期间需要OOD数据，并推断在不增加计算成本。后者的属性是有限的计算资源和实时限制汽车应用中尤其重要。我们提出的方法优于国家的最先进的方法，在现实世界中的汽车数据集。

29. BIGPrior: Towards Decoupling Learned Prior Hallucination and Data Fidelity in Image Restoration [PDF] 返回目录
Majed El Helou, Sabine Süsstrunk
Abstract: Image restoration encompasses fundamental image processing tasks that have been addressed with different algorithms and deep learning methods. Classical restoration algorithms leverage a variety of priors, either implicitly or explicitly. Their priors are hand-designed and their corresponding weights are heuristically assigned. Thus, deep learning methods often produce superior restoration quality. Deep networks are, however, capable of strong and hardly-predictable hallucinations. Networks jointly and implicitly learn to be faithful to the observed data while learning an image prior, and the separation of original and hallucinated data downstream is then not possible. This limits their wide-spread adoption in restoration applications. Furthermore, it is often the hallucinated part that is victim to degradation-model overfitting. We present an approach with decoupled network-prior hallucination and data fidelity. We refer to our framework as the Bayesian Integration of a Generative Prior (BIGPrior). Our BIGPrior method is rooted in a Bayesian restoration framework, and tightly connected to classical restoration methods. In fact, our approach can be viewed as a generalization of a large family of classical restoration algorithms. We leverage a recent network inversion method to extract image prior information from a generative network. We show on image colorization, inpainting, and denoising that our framework consistently improves the prior results through good integration of data fidelity. Our method, though partly reliant on the quality of the generative network inversion, is competitive with state-of-the-art supervised and task-specific restoration methods. It also provides an additional metric that sets forth the degree of prior reliance per pixel. Indeed, the per pixel contributions of the decoupled data fidelity and prior terms are readily available in our proposed framework.
摘要：图像复原包括已经基本图像处理任务，不同的算法和深刻的学习方法解决。古典恢复算法，利用各种先验的，无论是含蓄或明确。他们的先验的手工设计和其相应的权重分配启发式。因此，深的学习方法往往会产生恢复出众品质。深网，然而，能强，不易预测的幻觉。网络联合和隐式学习，同时学习一个图像之前忠实于所观察到的数据，以及原始和幻觉数据的分离下游然后不可能的。这限制了其广泛分布在通过恢复应用。此外，它往往是幻觉一部分是受害者退化模型过度拟合。我们目前与解耦网络之前，幻觉和数据保真度的方法。我们把我们作为一个集成生成性之前的贝叶斯（BIGPrior）框架。我们的方法BIGPrior植根于贝叶斯恢复框架，并紧密地连接至传统的修复方法。事实上，我们的方法可以被看作是一个大家族的经典恢复算法研究的推广。我们从生成网络利用最近的网络反演方法来提取图像的先验信息。我们展示的图像上色，补绘，以及降噪，我们的框架一致通过提高整合好的数据保真事先结果。我们的方法，虽然在生成网络反转的质量依赖部分，是竞争力与国家的最先进的监督和任务的具体恢复方法。它还提供了一个附加的度量阐述每像素之前的依赖程度。事实上，去耦数据的保真度和前项的每个像素的贡献是我们建议的框架一应俱全。

30. Faraway-Frustum: Dealing with Lidar Sparsity for 3D Object Detection using Fusion [PDF] 返回目录
Haolin Zhang, Dongfang Yang, Ekim Yurtsever, Keith A. Redmill, Ümit Özgüner
Abstract: Learned pointcloud representations do not generalize well with an increase in distance to the sensor. For example, at a range greater than 60 meters, the sparsity of lidar pointclouds reaches to a point where even humans cannot discern object shapes from each other. However, this distance should not be considered very far for fast-moving vehicles: A vehicle can traverse 60 meters under two seconds while moving at 70 mph. For safe and robust driving automation, acute 3D object detection at these ranges is indispensable. Against this backdrop, we introduce faraway-frustum: a novel fusion strategy for detecting faraway objects. The main strategy is to depend solely on the 2D vision for recognizing object class, as object shape does not change drastically with an increase in depth, and use pointcloud data for object localization in the 3D space for faraway objects. For closer objects, we use learned pointcloud representations instead, following state-of-the-art. This strategy alleviates the main shortcoming of object detection with learned pointcloud representations. Experiments on the KITTI dataset demonstrate that our method outperforms state-of-the-art by a considerable margin for faraway object detection in bird's-eye-view and 3D.
摘要：据悉点云表示不给传感器增加距离一概而论好。例如，在一个范围内大于60米，激光雷达点云达到的稀疏性到一个点，甚至人类不能彼此辨别物体的形状。车辆可以穿越2秒不到60米，而在70英里每小时移动：然而，这个距离应该不会很远了快速移动的车辆考虑。为保证安全和稳健的驾驶自动化，在这些范围急性立体物检测是必不可少的。在此背景下，我们引入遥远锥台：用于检测对象的遥远的新型融合策略。的主要策略是对所述2D视力完全取决于用于识别对象类，作为对象的形状不与用于遥远物体的增加的深度，并使用点云数据为对象定位在3D空间中发生急剧变化。对于近处的物体，我们使用学到点云表示代替，以下国家的最先进的。这种策略缓解目标检测与教训点云表示主要缺点。在实验KITTI数据集证明我们的方法优于状态的最先进的通过在鸟瞰视图和3D遥远物体检测相当大的余量。

31. Parameter Efficient Deep Neural Networks with Bilinear Projections [PDF] 返回目录
Litao Yu, Yongsheng Gao, Jun Zhou, Jian Zhang
Abstract: Recent research on deep neural networks (DNNs) has primarily focused on improving the model accuracy. Given a proper deep learning framework, it is generally possible to increase the depth or layer width to achieve a higher level of accuracy. However, the huge number of model parameters imposes more computational and memory usage overhead and leads to the parameter redundancy. In this paper, we address the parameter redundancy problem in DNNs by replacing conventional full projections with bilinear projections. For a fully-connected layer with $D$ input nodes and $D$ output nodes, applying bilinear projection can reduce the model space complexity from $\mathcal{O}(D^2)$ to $\mathcal{O}(2D)$, achieving a deep model with a sub-linear layer size. However, structured projection has a lower freedom of degree compared to the full projection, causing the under-fitting problem. So we simply scale up the mapping size by increasing the number of output channels, which can keep and even boosts the model accuracy. This makes it very parameter-efficient and handy to deploy such deep models on mobile systems with memory limitations. Experiments on four benchmark datasets show that applying the proposed bilinear projection to deep neural networks can achieve even higher accuracies than conventional full DNNs, while significantly reduces the model size.
摘要：深层神经网络（DNNs）最近的研究主要集中在提高模型的准确性。给定一个适当的深学习框架，通常能够增加的深度或宽度层实现精度更高的水平。然而，模型参数强加更多的计算和内存使用情况开销，导致参数冗余的数量巨大。在本文中，我们用双线性预测取代传统的全投影解决DNNs参数冗余问题。对于具有$ d $输入节点和$ d $输出节点的完全连接的层，应用双线性投影可以减少从$ \ mathcal {Ó}（d ^ 2）$到$ \ mathcal {Ó}模型空间复杂性（2D ）$，实现与子线性层大小深层模型。但是，结构化投影具有程度较低的自由比较充分的投影，造成欠拟合问题。因此，我们简单地通过增加输出通道甚至提升的数量，这可以保持和模型精度扩展映射大小。这使得它非常参数，高效，方便的与内存限制的移动系统上部署这种型号深。四个基准数据集实验表明，应用所提出的双线性投影深层神经网络可以实现比传统的全DNNs甚至更高的精度，而显著降低了模型的大小。

32. Dual Attention on Pyramid Feature Maps for Image Captioning [PDF] 返回目录
Litao Yu, Jian Zhang, Qiang Wu
Abstract: Generating natural sentences from images is a fundamental learning task for visual-semantic understanding in multimedia. In this paper, we propose to apply dual attention on pyramid image feature maps to fully explore the visual-semantic correlations and improve the quality of generated sentences. Specifically, with the full consideration of the contextual information provided by the hidden state of the RNN controller, the pyramid attention can better localize the visually indicative and semantically consistent regions in images. On the other hand, the contextual information can help re-calibrate the importance of feature components by learning the channel-wise dependencies, to improve the discriminative power of visual features for better content description. We conducted comprehensive experiments on three well-known datasets: Flickr8K, Flickr30K and MS COCO, which achieved impressive results in generating descriptive and smooth natural sentences from images. Using either convolution visual features or more informative bottom-up attention features, our composite captioning model achieves very promising performance in a single-model mode. The proposed pyramid attention and dual attention methods are highly modular, which can be inserted into various image captioning modules to further improve the performance.
摘要：从图像生成自然的句子是在多媒体视觉语义理解的基本学习任务。在本文中，我们建议对金字塔的图像特征映射应用于兼顾充分发掘视觉语义相关性，提高生成句子的质量。具体而言，在充分考虑由RNN控制器的隐藏状态提供的上下文信息，金字塔注意力可以更好地定位图像中的视觉指示和语义一致的区域。在另一方面，上下文信息可以帮助学习通道明智的依赖，提高视觉特征辨别力更好的内容描述重新校准功能组件的重要性。 Flickr8K，Flickr30K和MS COCO，这在从图像生成描述性和流畅的句子取得了不俗的成绩：我们在三个著名的数据集进行了全面的实验。无论是使用卷积视觉特征或提供更多信息自下而上关注的特点，我们的复合字幕模式实现了非常的单一模式模式看好的表现。所提出的金字塔注意力和兼顾方法高度模块化的，这可以被插入到各种图像字幕模块以进一步改善性能。

33. Unsupervised Monocular Depth Learning with Integrated Intrinsics and Spatio-Temporal Constraints [PDF] 返回目录
Kenny Chen, Alexandra Pogue, Brett T. Lopez, Ali-akbar Agha-mohammadi, Ankur Mehta
Abstract: Monocular depth inference has gained tremendous attention from researchers in recent years and remains as a promising replacement for expensive time-of-flight sensors, but issues with scale acquisition and implementation overhead still plague these systems. To this end, this work presents an unsupervised learning framework that is able to predict at-scale depth maps and egomotion, in addition to camera intrinsics, from a sequence of monocular images via a single network. Our method incorporates both spatial and temporal geometric constraints to resolve depth and pose scale factors, which are enforced within the supervisory reconstruction loss functions at training time. Only unlabeled stereo sequences are required for training the weights of our single-network architecture, which reduces overall implementation overhead as compared to previous methods. Our results demonstrate strong performance when compared to the current state-of-the-art on multiple sequences of the KITTI driving dataset.
摘要：单眼深度推断已经获得了极大的关注研究人员在最近几年，目前仍然是有前途的替代时间飞行昂贵的传感器，但随着规模的收购和执行问题的开销仍然困扰这些系统。为此，这项工作提出了一种无监督的学习框架，能够通过单一网络来预测在尺度深度图和自身运动，除了相机内部函数，从单眼图像序列。我们的方法结合空间和时间的几何约束来解决深度和姿势的比例因子，它们在训练时间监督重建损失函数内执行。只有未标记的立体序列所需要的训练我们的单网络架构相比之前的方法会降低整体实施开销的权重。相对于目前的KITTI驱动数据集的多个序列状态的最先进的，当我们的结果证明性能强。

34. Recyclable Waste Identification Using CNN Image Recognition and Gaussian Clustering [PDF] 返回目录
Yuheng Wang, Wen Jie Zhao, Jiahui Xu, Raymond Hong
Abstract: Waste recycling is an important way of saving energy and materials in the production process. In general cases recyclable objects are mixed with unrecyclable objects, which raises a need for identification and classification. This paper proposes a convolutional neural network (CNN) model to complete both tasks. The model uses transfer learning from a pretrained Resnet-50 CNN to complete feature extraction. A subsequent fully connected layer for classification was trained on the augmented TrashNet dataset [1]. In the application, sliding-window is used for image segmentation in the pre-classification stage. In the post-classification stage, the labelled sample points are integrated with Gaussian Clustering to locate the object. The resulting model has achieved an overall detection rate of 48.4% in simulation and final classification accuracy of 92.4%.
摘要：废品回收是在生产过程中节约能源和材料的重要途径。在一般情况下可回收对象与不可回收的对象，这引起了需要识别和分类混合。本文提出了一种卷积神经网络（CNN）模型来完成这两项任务。该机型采用把从预训练RESNET-50 CNN学习完整的特征提取。用于分类随后完全连接层被训练的增强数据集TrashNet [1]。在应用中，用于在预分类级图像分割滑动窗口。在后分类阶段，标记的样品点与高斯聚类集成到定位对象。将得到的模型已经实现了48.4％的总的检测率在模拟和92.4％的最终的分类精度。

35. Revisiting Adaptive Convolutions for Video Frame Interpolation [PDF] 返回目录
Simon Niklaus, Long Mai, Oliver Wang
Abstract: Video frame interpolation, the synthesis of novel views in time, is an increasingly popular research direction with many new papers further advancing the state of the art. But as each new method comes with a host of variables that affect the interpolation quality, it can be hard to tell what is actually important for this task. In this work, we show, somewhat surprisingly, that it is possible to achieve near state-of-the-art results with an older, simpler approach, namely adaptive separable convolutions, by a subtle set of low level improvements. In doing so, we propose a number of intuitive but effective techniques to improve the frame interpolation quality, which also have the potential to other related applications of adaptive convolutions such as burst image denoising, joint image filtering, or video prediction.
摘要：视频帧内插，新颖的视图中时间的合成，是有许多新的论文进一步推进现有技术的越来越受欢迎的研究方向。但是，随着每一个新的方法自带影响插值质量变量主机，它可以是很难判断究竟是这个任务很重要。在这项工作中，我们证明，有些奇怪的是，这是可能实现国家的最先进的近成果与旧的，更简单的方法，即自适应可分离卷积，由一个微妙的一些低层的改进。在这样做时，我们提出了一些直观的但有效的技术，以提高帧插补质量，这也有自适应卷积诸如突发图像去噪，联合图像滤波，或影像预测其它相关应用的潜力。

36. Generating Unobserved Alternatives: A Case Study through Super-Resolution and Decompression [PDF] 返回目录
Shichong Peng, Ke Li
Abstract: We consider problems where multiple predictions can be considered correct, but only one of them is given as supervision. This setting differs from both the regression and class-conditional generative modelling settings: in the former, there is a unique observed output for each input, which is provided as supervision; in the latter, there are many observed outputs for each input, and many are provided as supervision. Applying either regression methods and conditional generative models to the present setting often results in a model that can only make a single prediction for each input. We explore several problems that have this property and develop an approach that can generate multiple high-quality predictions given the same input. As a result, it can be used to generate high-quality outputs that are different from the observed output.
摘要：我们考虑在多个预测可以被认为是正确的问题，但其中只有一个是作为监督。此设置不同于回归和分类条件生成建模设置两者：在前者中，存在用于每个输入，其作为监督提供了一种独特的观察到的输出;在后者中，存在用于每个输入许多观测输出，和许多作为监督提供。地应用回归方法和条件生成模型在本设定往往导致模型只能为每个输入一个预测。我们探索具有这种性质和发展，可以给予相同的输入产生多个高品质预测的方法几个问题。其结果是，它可用于产生高品质的输出是从观察到的输出不同。

37. Classifier Pool Generation based on a Two-level Diversity Approach [PDF] 返回目录
Marcos Monteiro, Alceu S. Britto Jr, Jean P. Barddal, Luiz S. Oliveira, Robert Sabourin
Abstract: This paper describes a classifier pool generation method guided by the diversity estimated on the data complexity and classifier decisions. First, the behavior of complexity measures is assessed by considering several subsamples of the dataset. The complexity measures with high variability across the subsamples are selected for posterior pool adaptation, where an evolutionary algorithm optimizes diversity in both complexity and decision spaces. A robust experimental protocol with 28 datasets and 20 replications is used to evaluate the proposed method. Results show significant accuracy improvements in 69.4% of the experiments when Dynamic Classifier Selection and Dynamic Ensemble Selection methods are applied.
摘要：本文介绍了通过估计对数据的复杂性和分类决策的多样性导向的分类库的产生方法。首先，复杂性措施的行为是通过考虑数据集的几个子样本进行评估。跨子样本具有高可变性的复杂性措施，选择后池调整，其中的进化算法复杂性和决策空间优化的多样性。用28点的数据集和20个重复的鲁棒实验方法用来评估所提出的方法。结果表明：在实验的69.4％显著精度提高在应用动态分类器选择和动态乐团选择的方法。

38. Deep Joint Transmission-Recognition for Multi-View Cameras [PDF] 返回目录
Ezgi Ozyilkan, Mikolaj Jankowski
Abstract: We propose joint transmission-recognition schemes for efficient inference at the wireless edge. Motivated by the surveillance applications with wireless cameras, we consider the person classification task over a wireless channel carried out by multi-view cameras operating as edge devices. We introduce deep neural network (DNN) based compression schemes which incorporate digital (separate) transmission and joint source-channel coding (JSCC) methods. We evaluate the proposed device-edge communication schemes under different channel SNRs, bandwidth and power constraints. We show that the JSCC schemes not only improve the end-to-end accuracy but also simplify the encoding process and provide graceful degradation with channel quality.
摘要：我们提出了有效的推理联合传输识别方案在无线边缘。通过与无线摄像机的监视应用的推动下，我们考虑过由作为边缘设备上运行多视角摄像机进行无线信道的人员分类任务。我们介绍了结合数字（单独）传输和联合信源 - 信道编码（JSCC）方法深层神经网络（DNN）基于压缩方案。我们评估在不同信道的SNR，带宽和功率约束所提出的装置，边缘通信方案。我们表明，JSCC方案不仅提高了终端到终端的精度，而且简化了编码过程，并提供信道质量优雅降级。

39. A Comprehensive Study of Class Incremental Learning Algorithms for Visual Tasks [PDF] 返回目录
Eden Belouadah, Adrian Popescu, Ioannis Kanellos
Abstract: The ability of artificial agents to increment their capabilities when confronted with new data is an open challenge in artificial intelligence. The main challenge faced in such cases is catastrophic forgetting, i.e., the tendency of neural networks to underfit past data when new ones are ingested. A first group of approaches tackles catastrophic forgetting by increasing deep model capacity to accommodate new knowledge. A second type of approaches fix the deep model size and introduce a mechanism whose objective is to ensure a good compromise between stability and plasticity of the model. While the first type of algorithms were compared thoroughly, this is not the case for methods which exploit a fixed size model. Here, we focus on the latter, place them in a common conceptual and experimental framework and propose the following contributions: (1) define six desirable properties of incremental learning algorithms and analyze them according to these properties, (2) introduce a unified formalization of the class-incremental learning problem, (3) propose a common evaluation framework which is more thorough than existing ones in terms of number of datasets, size of datasets, size of bounded memory and number of incremental states, (4) investigate the usefulness of herding for past exemplars selection, (5) provide experimental evidence that it is possible to obtain competitive performance without the use of knowledge distillation to tackle catastrophic forgetting, and (6) facilitate reproducibility by integrating all tested methods in a common open-source repository. The main experimental finding is that none of the existing algorithms achieves the best results in all evaluated settings. Important differences arise notably if a bounded memory of past classes is allowed or not.
摘要：人工坐席的时候用新的数据面临递增自身能力的能力是人工智能的公开挑战。面对这种情况下的主要挑战是灾难性的遗忘，即神经网络underfit过去数据的倾向，当新的摄入。第一组通过增加深层模型的能力，以适应新的知识方法铲球灾难性遗忘。第二种类型的方法解决该深模型大小和引入机构，其目标是确保模型的稳定性和可塑性之间的良好折衷。而算法的第一类型的彻底进行比较，这是不适合其利用固定大小的模型的方法的情况。在这里，我们着重于后者，代替它们在一个共同的概念和实验框架，并提出以下贡献：（1）限定的增量学习算法6和期望的性质，并根据这些属性对它们进行分析，（2）引入的一个统一的形式化类增量学习问题，（3）提出了一个共同的评价框架，这比数据集的数量，数据集的规模，有限的内存和增量状态数目的大小方面现有的更彻底，（4）调查的有效性放牧过去典范选择，（5）提供了实验证据表明，它有可能获得竞争力的性能，而无需使用知识蒸馏应对灾难性的遗忘，和（6）通过所有测试方法相结合在一个共同的开源库方便的重现性。主要实验的发现是，没有一个现有的算法实现在所有评估的设置最好的结果。重要的差别，如果过去类的有限记忆被允许或者没有明显出现。

40. Point of Care Image Analysis for COVID-19 [PDF] 返回目录
Daniel Yaron, Daphna Keidar, Elisha Goldstein, Yair Shachar, Ayelet Blass, Oz Frank, Nir Schipper, Nogah Shabshin, Ahuva Grubstein, Dror Suhami, Naama R. Bogot, Eyal Sela, Amiel A. Dror, Mordehay Vaturi, Federico Mento, Elena Torri, Riccardo Inchingolo, Andrea Smargiassi, Gino Soldati, Tiziano Perrone, Libertario Demi, Meirav Galun, Shai Bagon, Yishai M. Elyada, Yonina C. Eldar
Abstract: Early detection of COVID-19 is key in containing the pandemic. Disease detection and evaluation based on imaging is fast and cheap and therefore plays an important role in COVID-19 handling. COVID-19 is easier to detect in chest CT, however, it is expensive, non-portable, and difficult to disinfect, making it unfit as a point-of-care (POC) modality. On the other hand, chest X-ray (CXR) and lung ultrasound (LUS) are widely used, yet, COVID-19 findings in these modalities are not always very clear. Here we train deep neural networks to significantly enhance the capability to detect, grade and monitor COVID-19 patients using CXRs and LUS. Collaborating with several hospitals in Israel we collect a large dataset of CXRs and use this dataset to train a neural network obtaining above 90% detection rate for COVID-19. In addition, in collaboration with ULTRa (Ultrasound Laboratory Trento, Italy) and hospitals in Italy we obtained POC ultrasound data with annotations of the severity of disease and trained a deep network for automatic severity grading.
摘要：COVID-19的早期检测是在包含大流行病键。疾病检测和评估基于成像速度快，价格便宜，因此起着COVID-19处理了重要的作用。 COVID-19是更容易在胸部CT检测，但是，它是昂贵的，非便携式的，难以消毒，使得它不适合作为一个点的护理（POC）模态。在另一方面，胸片（CXR）和肺部超声（LUS）被广泛使用，然而，在这些方式COVID-19的研究结果并不总是很清楚。在这里，我们训练深层神经网络显著提升侦测，分级和监测使用CXRS和LUS COVID，19例患者的能力。在以色列几家医院合作，我们收集了大量的数据集CXRS并使用此数据集来训练神经网络中获得上面COVID-19 90％的检出率。此外，与超（超声实验室特兰托，意大利）和医院在意大利的合作，我们获得POC超声数据与疾病的严重程度的注释和训练自动严重性分级深刻的网络。

41. Predicting intubation support requirement of patients using Chest X-ray with Deep Representation Learning [PDF] 返回目录
Aniket Maurya
Abstract: Recent developments in medical imaging with Deep Learning presents evidence of automated diagnosis and prognosis. It can also be a complement to currently available diagnosis methods. Deep Learning can be leveraged for diagnosis, severity prediction, intubation support prediction and many similar tasks. We present prediction of intubation support requirement for patients from the Chest X-ray using Deep representation learning. We release our source code publicly at this https URL.
摘要：在自动诊断和预后的深度学习的礼物证据医学影像的最新发展。它也可以是现有的诊断方法的补充。深度学习可以利用诊断，严重程度预测，插管支持预测和许多类似的任务。我们为使用Deep表示学习从胸部X线检查的病人插管支持需求目前预测。我们公开发布我们的源代码，在此HTTPS URL。

42. Solving Inverse Problems with Hybrid Deep Image Priors: the challenge of preventing overfitting [PDF] 返回目录
Zhaodong Sun
Abstract: We mainly analyze and solve the overfitting problem of deep image prior (DIP). Deep image prior can solve inverse problems such as super-resolution, inpainting and denoising. The main advantage of DIP over other deep learning approaches is that it does not need access to a large dataset. However, due to the large number of parameters of the neural network and noisy data, DIP overfits to the noise in the image as the number of iterations grows. In the thesis, we use hybrid deep image priors to avoid overfitting. The hybrid priors are to combine DIP with an explicit prior such as total variation or with an implicit prior such as a denoising algorithm. We use the alternating direction method-of-multipliers (ADMM) to incorporate the new prior and try different forms of ADMM to avoid extra computation caused by the inner loop of ADMM steps. We also study the relation between the dynamics of gradient descent, and the overfitting phenomenon. The numerical results show the hybrid priors play an important role in preventing overfitting. Besides, we try to fit the image along some directions and find this method can reduce overfitting when the noise level is large. When the noise level is small, it does not considerably reduce the overfitting problem.
摘要：我们主要分析和解决深图像之前（DIP）的过度拟合问题。深图像之前可以解决逆问题，如超分辨率，图像修复和去噪。 DIP比其他深度学习方法的主要优点是，它并不需要访问大量数据集。然而，由于大量的神经网络的参数和噪声数据，DIP overfits为迭代次数生长的图像中的噪声。在本文中，我们使用混合深先验图像，以避免过度拟合。的混合先验是DIP具有显式前如总变化或与一个隐含的现有诸如去噪算法相结合。我们使用交替方向法-的-乘法器（ADMM）以纳入新的前，并尝试不同的形式ADMM的，以避免由的ADMM步骤内环额外的计算。我们也研究了梯度下降的动力，和过拟合现象之间的关系。计算结果表明混合先验起到防止过度拟合了重要作用。此外，我们试图以适应沿着某些方向上的图像，并发现这种方法可以减少过度拟合时的噪音水平较大。当噪声电平低，它不会显着降低过度拟合问题。

43. Convolution Neural Networks for Semantic Segmentation: Application to Small Datasets of Biomedical Images [PDF] 返回目录
Vitaly Nikolaev
Abstract: This thesis studies how the segmentation results, produced by convolutional neural networks (CNN), is different from each other when applied to small biomedical datasets. We use different architectures, parameters and hyper-parameters, trying to find out the better configurations for our task, and trying to find out underlying regularities. Two working datasets are from biomedical area of research. We conducted a lot of experiments with the two types of networks and the received results have shown the preference of some conditions of experiments and parameters of the networks over the others. All testing results are given in the tables and some selected resulting graphs and segmentation predictions are shown for better illustration.
摘要：本文研究的分割结果，由卷积神经网络（CNN）产生如何，当施加到小生物医学数据集是彼此不同的。我们使用不同的体系结构，参数和超参数，试图找出我们的任务更好的配置，并试图找出潜在的规律。两个工作数据集是从研究生物医学领域。我们进行了大量的实验与两种类型的网络和接收到的结果显示的实验和网络比其他一些参数条件的偏好。所有测试的结果的表格中给出和一些所选的产生的图形和分割的预测示出了用于更好地说明。

44. Generalized Wasserstein Dice Score, Distributionally Robust Deep Learning, and Ranger for brain tumor segmentation: BraTS 2020 challenge [PDF] 返回目录
Lucas Fidon, Sebastien Ourselin, Tom Vercauteren
Abstract: Training a deep neural network is an optimization problem with four main ingredients: the design of the deep neural network, the per-sample loss function, the population loss function, and the optimizer. However, methods developed to compete in recent BraTS challenges tend to focus only on the design of deep neural network architectures, while paying less attention to the three other aspects. In this paper, we experimented with adopting the opposite approach. We stuck to a generic and state-of-the-art 3D U-Net architecture and experimented with a non-standard per-sample loss function, the generalized Wasserstein Dice loss, a non-standard population loss function, corresponding to distributionally robust optimization, and a non-standard optimizer, Ranger. Those variations were selected specifically for the problem of multi-class brain tumor segmentation. The generalized Wasserstein Dice loss is a per-sample loss function that allows taking advantage of the hierarchical structure of the tumor regions labeled in BraTS. Distributionally robust optimization is a generalization of empirical risk minimization that accounts for the presence of underrepresented subdomains in the training dataset. Ranger is a generalization of the widely used Adam optimizer that is more stable with small batch size and noisy labels. We found that each of those variations of the optimization of deep neural networks for brain tumor segmentation leads to improvements in terms of Dice scores and Hausdorff distances. With an ensemble of three deep neural networks trained with various optimization procedures, we achieved promising results on the validation dataset of the BraTS 2020 challenge. Our ensemble ranked fourth out of the 693 registered teams for the segmentation task of the BraTS 2020 challenge.
摘要：培训一个深层神经网络是一个最优化问题有四个主要成分：深层神经网络，每样本损失函数，人口损失的功能，并且优化的设计。然而，方法开发近臭小子挑战往往只注重深层神经网络结构的设计竞争，而较少关注三个其他方面。在本文中，我们尝试用采取了相反的做法。我们坚持一个通用的和国家的最先进的三维U型网络架构，并与非标准每个样本损失函数，广义Wasserstein的骰子损失，非标准的人口损失函数试验，对应于分布式地稳健优化和非标准优化器，游侠。是专门为多类脑肿瘤分割问题选择的那些变化。广义瓦瑟斯坦骰子损失是允许服用标记在臭小子肿瘤区域的分层结构的优点每个样品损失函数。分布式地稳健优化是经验风险最小化是占代表性不足的子域的训练数据集存在的一个概括。游侠是广泛使用的亚当优化更稳定的小批量大小和嘈杂标签的推广。我们发现，在每个骰子的得分和豪斯多夫距离来对脑肿瘤的分割导致的改进深层神经网络优化的那些变化。随着乐团的各种优化程序的培训深三个的神经网络，我们实现了对臭小子2020挑战的验证数据集可喜的成果。我们的整体排名第四出693支挂号队伍为臭小子2020挑战的分割任务。

45. Recent Advances in Understanding Adversarial Robustness of Deep Neural Networks [PDF] 返回目录
Tao Bai, Jinqi Luo, Jun Zhao
Abstract: Adversarial examples are inevitable on the road of pervasive applications of deep neural networks (DNN). Imperceptible perturbations applied on natural samples can lead DNN-based classifiers to output wrong prediction with fair confidence score. It is increasingly important to obtain models with high robustness that are resistant to adversarial examples. In this paper, we survey recent advances in how to understand such intriguing property, i.e. adversarial robustness, from different perspectives. We give preliminary definitions on what adversarial attacks and robustness are. After that, we study frequently-used benchmarks and mention theoretically-proved bounds for adversarial robustness. We then provide an overview on analyzing correlations among adversarial robustness and other critical indicators of DNN models. Lastly, we introduce recent arguments on potential costs of adversarial training which have attracted wide attention from the research community.
摘要：对抗性的例子是深神经网络（DNN）的普适应用的道路上不可避免的。不易察觉的扰动应用于天然样品可导致基于DNN分类器输出错误的预测与公平的信心评分。高鲁棒性是要对抗的例子耐得机型是越来越重要。在本文中，我们调查如何理解这种有趣的财产，即对抗性的鲁棒性，从不同的角度的最新进展。我们给什么敌对攻击和鲁棒性是初步的定义。在那之后，我们研究了常用的基准，并提为对抗性的鲁棒性理论，证明了界限。然后，我们提供分析对抗性的鲁棒性和DNN模型的其他关键指标之间的相关性的概述。最后，我们就已经引起了广泛的关注与研究团体对抗训练的潜在成本最近介绍的参数。

46. MACE: Model Agnostic Concept Extractor for Explaining Image Classification Networks [PDF] 返回目录
Ashish Kumar, Karan Sehgal, Prerna Garg, Vidhya Kamakshi, Narayanan C Krishnan
Abstract: Deep convolutional networks have been quite successful at various image classification tasks. The current methods to explain the predictions of a pre-trained model rely on gradient information, often resulting in saliency maps that focus on the foreground object as a whole. However, humans typically reason by dissecting an image and pointing out the presence of smaller concepts. The final output is often an aggregation of the presence or absence of these smaller concepts. In this work, we propose MACE: a Model Agnostic Concept Extractor, which can explain the working of a convolutional network through smaller concepts. The MACE framework dissects the feature maps generated by a convolution network for an image to extract concept based prototypical explanations. Further, it estimates the relevance of the extracted concepts to the pre-trained model's predictions, a critical aspect required for explaining the individual class predictions, missing in existing approaches. We validate our framework using VGG16 and ResNet50 CNN architectures, and on datasets like Animals With Attributes 2 (AWA2) and Places365. Our experiments demonstrate that the concepts extracted by the MACE framework increase the human interpretability of the explanations, and are faithful to the underlying pre-trained black-box model.
摘要：深卷积网络已经在各种图像分类任务相当成功。当前的方法来解释预先训练的模型的预测依赖于梯度信息，常常导致显着性映射该焦点的前景对象作为一个整体上。然而，典型地由人类解剖图像和指出的更小的概念的存在理由。最终输出往往是这些较小概念的存在或不存在的聚集。在这项工作中，我们提出MACE：一个模型无关概念提取，可以通过较小的概念解释卷积网络的工作。的MACE框架剖析映射由一个卷积网络的图像以提取概念基于原型说明生成的特征。此外，它估计的提取的概念，以预先训练模型的预测，是用来说明各个类的预测，在现有的方法缺少必需的一个重要方面的相关性。我们使用VGG16和ResNet50 CNN架构验证我们的框架，像动物的数据集使用属性2（AWA2）和Places365。我们的实验证明，由MACE框架提取的概念增加的解释人类解释性，并忠实于底层的预先训练的黑盒模型。

47. Self-semi-supervised Learning to Learn from NoisyLabeled Data [PDF] 返回目录
Jiacheng Wang, Yue Ma, Shuang Gao
Abstract: The remarkable success of today's deep neural networks highly depends on a massive number of correctly labeled data. However, it is rather costly to obtain high-quality human-labeled data, leading to the active research area of training models robust to noisy labels. To achieve this goal, on the one hand, many papers have been dedicated to differentiating noisy labels from clean ones to increase the generalization of DNN. On the other hand, the increasingly prevalent methods of self-semi-supervised learning have been proven to benefit the tasks when labels are incomplete. By 'semi' we regard the wrongly labeled data detected as un-labeled data; by 'self' we choose a self-supervised technique to conduct semi-supervised learning. In this project, we designed methods to more accurately differentiate clean and noisy labels and borrowed the wisdom of self-semi-supervised learning to train noisy labeled data.
摘要：今天的深层神经网络的令人瞩目的成功很大程度上依赖于正确的标签数据的数量庞大。然而，这是相当昂贵的，以获得高品质的人标记数据，导致培训模式稳健嘈杂标签的活跃的研究领域。为了实现这一目标，一方面是，许多论文一直致力于为区分从干净的嘈杂标签，以提高DNN的泛化。在另一方面，自半监督学习的越来越普遍的方法已被证明，当标签不完整受益的任务。通过“半”我们把错误地标记的数据检测为未标记的数据;通过“自我”我们选择了一个自监督法进行半监督学习。在这个项目中，我们设计方法，以更准确地区分干净，嘈杂的标签和借自半监督学习训练嘈杂的标签数据的智慧。

48. Patch2Self: Denoising Diffusion MRI with Self-Supervised Learning [PDF] 返回目录
Shreyas Fadnavis, Joshua Batson, Eleftherios Garyfallidis
Abstract: Diffusion-weighted magnetic resonance imaging (DWI) is the only noninvasive method for quantifying microstructure and reconstructing white-matter pathways in the living human brain. Fluctuations from multiple sources create significant additive noise in DWI data which must be suppressed before subsequent microstructure analysis. We introduce a self-supervised learning method for denoising DWI data, Patch2Self, which uses the entire volume to learn a full-rank locally linear denoiser for that volume. By taking advantage of the oversampled q-space of DWI data, Patch2Self can separate structure from noise without requiring an explicit model for either. We demonstrate the effectiveness of Patch2Self via quantitative and qualitative improvements in microstructure modeling, tracking (via fiber bundle coherency) and model estimation relative to other unsupervised methods on real and simulated data.
摘要：弥散加权磁共振成像（DWI）是用于量化组织和在活的人体大脑重建白质通路的唯一无创性方法。来自多个源的波动创建在DWI数据显著加性噪声必须之前随后的微结构分析得到抑制。我们引进去噪DWI数据，Patch2Self，使用整个卷到当地学习满秩线性降噪为该卷自我监督学习方法。通过取DWI数据的过采样q-空间的优点，可以Patch2Self而不需要任何显式模型分开噪声结构。我们证明经由微结构建模相对于真实的和模拟的数据的其它无监督方法模型估计的定量和定性的改进，跟踪（通过纤维束相干性）和Patch2Self的有效性。

注：中文为机器翻译结果！封面为论文标题词云图！

WITH LOVE OF WORLD

【arxiv论文】 Computer Vision and Pattern Recognition 2020-11-04

目录

摘要