目录
5. Real-time Tropical Cyclone Intensity Estimation by Handling Temporally Heterogeneous Satellite Data [PDF] 摘要
8. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis [PDF] 摘要
9. Multimodal End-to-End Learning for Autonomous Steering in Adverse Road and Weather Conditions [PDF] 摘要
17. Class-Agnostic Segmentation Loss and Its Application to Salient Object Detection and Segmentation [PDF] 摘要
18. ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications [PDF] 摘要
27. Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning [PDF] 摘要
36. Classification Beats Regression: Counting of Cells from Greyscale Microscopic Images based on Annotation-free Training Samples [PDF] 摘要
摘要
1. Generative Adversarial Networks in Human Emotion Synthesis:A Review [PDF] 返回目录
Noushin Hajarolasvadi, Miguel Arjona Ramírez, Hasan Demirel
Abstract: Synthesizing realistic data samples is of great value for both academic and industrial communities. Deep generative models have become an emerging topic in various research areas like computer vision and signal processing. Affective computing, a topic of a broad interest in computer vision society, has been no exception and has benefited from generative models. In fact, affective computing observed a rapid derivation of generative models during the last two decades. Applications of such models include but are not limited to emotion recognition and classification, unimodal emotion synthesis, and cross-modal emotion synthesis. As a result, we conducted a review of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video. In this context, facial expression synthesis, speech emotion synthesis, and the audio-visual (cross-modal) emotion synthesis is reviewed extensively under different application scenarios. Gradually, we discuss open research problems to push the boundaries of this research area for future works.
摘要:合成的现实数据样本是对学术界和工业界很大的价值。深生成模型已经成为各个研究领域像计算机视觉和信号处理的新课题。情感计算,计算机视觉社会广泛关注的话题,一直没有例外,从生成模型中受益。事实上,情感计算在过去二十年中观察到生成模型的推导迅速。这种模型的应用包括但不限于情感识别和分类,单峰合成的情感和跨模态合成的情感。其结果是,我们通过研究现有的数据库,优势,以及生成模型的缺点考虑两个主要的人际交往方式,即音频和视频相关的培训策略沿进行了人类情感合成的最新进展进行了审查。在这种情况下,面部表情合成,语音情感合成,和视听(交叉模态)情感合成在不同的应用场景广泛综述。渐渐地,我们讨论了开放性的研究问题,推动这一研究领域的界限未来工作。
Noushin Hajarolasvadi, Miguel Arjona Ramírez, Hasan Demirel
Abstract: Synthesizing realistic data samples is of great value for both academic and industrial communities. Deep generative models have become an emerging topic in various research areas like computer vision and signal processing. Affective computing, a topic of a broad interest in computer vision society, has been no exception and has benefited from generative models. In fact, affective computing observed a rapid derivation of generative models during the last two decades. Applications of such models include but are not limited to emotion recognition and classification, unimodal emotion synthesis, and cross-modal emotion synthesis. As a result, we conducted a review of recent advances in human emotion synthesis by studying available databases, advantages, and disadvantages of the generative models along with the related training strategies considering two principal human communication modalities, namely audio and video. In this context, facial expression synthesis, speech emotion synthesis, and the audio-visual (cross-modal) emotion synthesis is reviewed extensively under different application scenarios. Gradually, we discuss open research problems to push the boundaries of this research area for future works.
摘要:合成的现实数据样本是对学术界和工业界很大的价值。深生成模型已经成为各个研究领域像计算机视觉和信号处理的新课题。情感计算,计算机视觉社会广泛关注的话题,一直没有例外,从生成模型中受益。事实上,情感计算在过去二十年中观察到生成模型的推导迅速。这种模型的应用包括但不限于情感识别和分类,单峰合成的情感和跨模态合成的情感。其结果是,我们通过研究现有的数据库,优势,以及生成模型的缺点考虑两个主要的人际交往方式,即音频和视频相关的培训策略沿进行了人类情感合成的最新进展进行了审查。在这种情况下,面部表情合成,语音情感合成,和视听(交叉模态)情感合成在不同的应用场景广泛综述。渐渐地,我们讨论了开放性的研究问题,推动这一研究领域的界限未来工作。
2. Data Agnostic Filter Gating for Efficient Deep Networks [PDF] 返回目录
Xiu Su, Shan You, Tao Huang, Hongyan Xu, Fei Wang, Chen Qian, Changshui Zhang, Chang Xu
Abstract: To deploy a well-trained CNN model on low-end computation edge devices, it is usually supposed to compress or prune the model under certain computation budget (e.g., FLOPs). Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores, which ignores the variance of input batches to the difference in sparse structure over filters. In this paper, we propose a data agnostic filter pruning method that uses an auxiliary network named Dagger module to induce pruning and takes pretrained weights as input to learn the importance of each filter. In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPs-aware regularization to directly promote pruning filters toward target FLOPs. Extensive experimental results on CIFAR-10 and ImageNet datasets indicate our superiority to other state-of-the-art filter pruning methods. For example, our 50\% FLOPs ResNet-50 can achieve 76.1\% Top-1 accuracy on ImageNet dataset, surpassing many other filter pruning methods.
摘要:为部署在低端计算边缘设备训练有素CNN模型,它通常应该压缩或修剪在某些计算预算(例如,触发器)模型。电流滤波器修剪方法主要杠杆特征映射到产生用于滤波器重要分数和修剪的那些具有更小的分数,而忽略输入批次的方差超过过滤器在稀疏结构上的差异。在本文中,我们提出了一种数据,它使用名为匕首模块以诱导修剪辅助网络和采用预训练的权重输入到学习每个滤波器的重要性无关滤波器修剪方法。此外,与某些触发器约束有助于修剪过滤器,我们利用一个明确的触发器感知正规化直接推动朝目标FLOPS修剪过滤器。在CIFAR-10和ImageNet数据集广泛的实验结果表明,我们的优势,其他国家的最先进的过滤器修剪方法。例如,我们的50 \%FLOPS RESNET-50可以在ImageNet数据集达到76.1 \%上-1精确度,超过了许多其他过滤器修剪方法。
Xiu Su, Shan You, Tao Huang, Hongyan Xu, Fei Wang, Chen Qian, Changshui Zhang, Chang Xu
Abstract: To deploy a well-trained CNN model on low-end computation edge devices, it is usually supposed to compress or prune the model under certain computation budget (e.g., FLOPs). Current filter pruning methods mainly leverage feature maps to generate important scores for filters and prune those with smaller scores, which ignores the variance of input batches to the difference in sparse structure over filters. In this paper, we propose a data agnostic filter pruning method that uses an auxiliary network named Dagger module to induce pruning and takes pretrained weights as input to learn the importance of each filter. In addition, to help prune filters with certain FLOPs constraints, we leverage an explicit FLOPs-aware regularization to directly promote pruning filters toward target FLOPs. Extensive experimental results on CIFAR-10 and ImageNet datasets indicate our superiority to other state-of-the-art filter pruning methods. For example, our 50\% FLOPs ResNet-50 can achieve 76.1\% Top-1 accuracy on ImageNet dataset, surpassing many other filter pruning methods.
摘要:为部署在低端计算边缘设备训练有素CNN模型,它通常应该压缩或修剪在某些计算预算(例如,触发器)模型。电流滤波器修剪方法主要杠杆特征映射到产生用于滤波器重要分数和修剪的那些具有更小的分数,而忽略输入批次的方差超过过滤器在稀疏结构上的差异。在本文中,我们提出了一种数据,它使用名为匕首模块以诱导修剪辅助网络和采用预训练的权重输入到学习每个滤波器的重要性无关滤波器修剪方法。此外,与某些触发器约束有助于修剪过滤器,我们利用一个明确的触发器感知正规化直接推动朝目标FLOPS修剪过滤器。在CIFAR-10和ImageNet数据集广泛的实验结果表明,我们的优势,其他国家的最先进的过滤器修剪方法。例如,我们的50 \%FLOPS RESNET-50可以在ImageNet数据集达到76.1 \%上-1精确度,超过了许多其他过滤器修剪方法。
3. Road Damage Detection and Classification with Detectron2 and Faster R-CNN [PDF] 返回目录
Vung Pham, Chau Pham, Tommy Dang
Abstract: The road is vital for many aspects of life, and road maintenance is crucial for human safety. One of the critical tasks to allow timely repair of road damages is to quickly and efficiently detect and classify them. This work details the strategies and experiments evaluated for these tasks. Specifically, we evaluate Detectron2's implementation of Faster R-CNN using different base models and configurations. We also experiment with these approaches using the Global Road Damage Detection Challenge 2020, A Track in the IEEE Big Data 2020 Big Data Cup Challenge dataset. The results show that the X101-FPN base model for Faster R-CNN with Detectron2's default configurations are efficient and general enough to be transferable to different countries in this challenge. This approach results in F1 scores of 51.0% and 51.4% for the test1 and test2 sets of the challenge, respectively. Though the visualizations show good prediction results, the F1 scores are low. Therefore, we also evaluate the prediction results against the existing annotations and discover some discrepancies. Thus, we also suggest strategies to improve the labeling process for this dataset.
摘要:道路,是生活的许多方面是至关重要的,而养路费是人类安全的关键。其中一个重要的任务,让道路损坏及时修复是快速,有效地检测和分类。这项工作详细介绍了这些任务进行评估的策略和实验。具体来说,我们使用不同的基本型号和配置评估Detectron2的实现更快的R-CNN的。我们还尝试使用全球道路损伤检测挑战2020这些方法,轨道在IEEE大数据2020大数据挑战赛集。结果表明,更快的R-CNN的X101-FPN的基本型Detectron2的默认配置是有效的和一般足以被转移到不同的国家在这一挑战。这种做法的结果分别为51.0%,F1分数和为test1的51.4%和测试2套的挑战。虽然可视化显示出良好的预测结果,F1的得分较低。因此,我们也评估对现有注释的预测结果,并发现了一些差异。因此,我们也建议战略,以改善该数据集的标记过程。
Vung Pham, Chau Pham, Tommy Dang
Abstract: The road is vital for many aspects of life, and road maintenance is crucial for human safety. One of the critical tasks to allow timely repair of road damages is to quickly and efficiently detect and classify them. This work details the strategies and experiments evaluated for these tasks. Specifically, we evaluate Detectron2's implementation of Faster R-CNN using different base models and configurations. We also experiment with these approaches using the Global Road Damage Detection Challenge 2020, A Track in the IEEE Big Data 2020 Big Data Cup Challenge dataset. The results show that the X101-FPN base model for Faster R-CNN with Detectron2's default configurations are efficient and general enough to be transferable to different countries in this challenge. This approach results in F1 scores of 51.0% and 51.4% for the test1 and test2 sets of the challenge, respectively. Though the visualizations show good prediction results, the F1 scores are low. Therefore, we also evaluate the prediction results against the existing annotations and discover some discrepancies. Thus, we also suggest strategies to improve the labeling process for this dataset.
摘要:道路,是生活的许多方面是至关重要的,而养路费是人类安全的关键。其中一个重要的任务,让道路损坏及时修复是快速,有效地检测和分类。这项工作详细介绍了这些任务进行评估的策略和实验。具体来说,我们使用不同的基本型号和配置评估Detectron2的实现更快的R-CNN的。我们还尝试使用全球道路损伤检测挑战2020这些方法,轨道在IEEE大数据2020大数据挑战赛集。结果表明,更快的R-CNN的X101-FPN的基本型Detectron2的默认配置是有效的和一般足以被转移到不同的国家在这一挑战。这种做法的结果分别为51.0%,F1分数和为test1的51.4%和测试2套的挑战。虽然可视化显示出良好的预测结果,F1的得分较低。因此,我们也评估对现有注释的预测结果,并发现了一些差异。因此,我们也建议战略,以改善该数据集的标记过程。
4. Toyota Smarthome Untrimmed: Real-World Untrimmed Videos for Activity Detection [PDF] 返回目录
Rui Dai, Srijan Das, Saurav Sharma, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, Gianpiero Francesca
Abstract: This work aims at building a large scale dataset with daily-living activities performed in a natural manner. Activities performed in a spontaneous manner lead to many real-world challenges that are often ignored by the vision community. This includes low inter-class due to the presence of similar activities and high intra-class variance, low camera framing, low resolution, long-tail distribution of activities, and occlusions. To this end, we propose the Toyota Smarthome Untrimmed (TSU) dataset, which provides spontaneous activities with rich and dense annotations to address the detection of complex activities in real-world scenarios.
摘要:这项工作的目的是建立一个大型数据集,每天生活的活动,以自然的方式进行。活动自发地导致执行,常常由视觉社区忽略许多现实世界的挑战。这包括低帧间类由于类似的活动和高类内方差,低摄影机取景,分辨率低,活动长尾分布,和遮挡的存在。为此,我们提出了丰田智能家居毛边(TSU)数据集,它提供了丰富而密集的注解自发的活动,以解决在现实情况复杂的活动进行检测。
Rui Dai, Srijan Das, Saurav Sharma, Luca Minciullo, Lorenzo Garattoni, Francois Bremond, Gianpiero Francesca
Abstract: This work aims at building a large scale dataset with daily-living activities performed in a natural manner. Activities performed in a spontaneous manner lead to many real-world challenges that are often ignored by the vision community. This includes low inter-class due to the presence of similar activities and high intra-class variance, low camera framing, low resolution, long-tail distribution of activities, and occlusions. To this end, we propose the Toyota Smarthome Untrimmed (TSU) dataset, which provides spontaneous activities with rich and dense annotations to address the detection of complex activities in real-world scenarios.
摘要:这项工作的目的是建立一个大型数据集,每天生活的活动,以自然的方式进行。活动自发地导致执行,常常由视觉社区忽略许多现实世界的挑战。这包括低帧间类由于类似的活动和高类内方差,低摄影机取景,分辨率低,活动长尾分布,和遮挡的存在。为此,我们提出了丰田智能家居毛边(TSU)数据集,它提供了丰富而密集的注解自发的活动,以解决在现实情况复杂的活动进行检测。
5. Real-time Tropical Cyclone Intensity Estimation by Handling Temporally Heterogeneous Satellite Data [PDF] 返回目录
Boyo Chen, Buo-Fu Chen, Yun-Nung Chen
Abstract: Analyzing big geophysical observational data collected by multiple advanced sensors on various satellite platforms promotes our understanding of the geophysical system. For instance, convolutional neural networks (CNN) have achieved great success in estimating tropical cyclone (TC) intensity based on satellite data with fixed temporal frequency (e.g., 3 h). However, to achieve more timely (under 30 min) and accurate TC intensity estimates, a deep learning model is demanded to handle temporally-heterogeneous satellite observations. Specifically, infrared (IR1) and water vapor (WV) images are available under every 15 minutes, while passive microwave rain rate (PMW) is available for about every 3 hours. Meanwhile, the visible (VIS) channel is severely affected by noise and sunlight intensity, making it difficult to be utilized. Therefore, we propose a novel framework that combines generative adversarial network (GAN) with CNN. The model utilizes all data, including VIS and PMW information, during the training phase and eventually uses only the high-frequent IR1 and WV data for providing intensity estimates during the predicting phase. Experimental results demonstrate that the hybrid GAN-CNN framework achieves comparable precision to the state-of-the-art models, while possessing the capability of increasing the maximum estimation frequency from 3 hours to less than 15 minutes.
摘要:由多个先进的传感器在各种卫星平台收集分析大地球物理观测数据有利于促进两国的地球物理系统的了解。例如,卷积神经网络(CNN)在估计基于与固定的时间频率(例如,3小时)的卫星数据热带气旋(TC)强度取得了巨大的成功。然而,为了实现更及时的(下30分钟)和准确的TC强度估计,深学习模型被要求处理时间上异质卫星观测。具体而言,红外(IR1)和水蒸气(WV)图像是每15分钟下提供,而无源微波降水率(PMW)是可用于约每3小时。同时,可见(VIS)信道受到严重的噪音和太阳光强度的影响,使其难以被利用。因此,我们提出了一个新的框架,结合生成与CNN对抗网络(GAN)。该模型利用所有数据,包括VIS和PMW信息,在训练阶段,最终仅使用高频率IR1和WV数据在预测阶段提供强度的估计。实验结果表明,该混合GAN-CNN框架实现可比精度国家的最先进的模型,同时具有增大从3小时最大推定频率到少于15分钟的能力。
Boyo Chen, Buo-Fu Chen, Yun-Nung Chen
Abstract: Analyzing big geophysical observational data collected by multiple advanced sensors on various satellite platforms promotes our understanding of the geophysical system. For instance, convolutional neural networks (CNN) have achieved great success in estimating tropical cyclone (TC) intensity based on satellite data with fixed temporal frequency (e.g., 3 h). However, to achieve more timely (under 30 min) and accurate TC intensity estimates, a deep learning model is demanded to handle temporally-heterogeneous satellite observations. Specifically, infrared (IR1) and water vapor (WV) images are available under every 15 minutes, while passive microwave rain rate (PMW) is available for about every 3 hours. Meanwhile, the visible (VIS) channel is severely affected by noise and sunlight intensity, making it difficult to be utilized. Therefore, we propose a novel framework that combines generative adversarial network (GAN) with CNN. The model utilizes all data, including VIS and PMW information, during the training phase and eventually uses only the high-frequent IR1 and WV data for providing intensity estimates during the predicting phase. Experimental results demonstrate that the hybrid GAN-CNN framework achieves comparable precision to the state-of-the-art models, while possessing the capability of increasing the maximum estimation frequency from 3 hours to less than 15 minutes.
摘要:由多个先进的传感器在各种卫星平台收集分析大地球物理观测数据有利于促进两国的地球物理系统的了解。例如,卷积神经网络(CNN)在估计基于与固定的时间频率(例如,3小时)的卫星数据热带气旋(TC)强度取得了巨大的成功。然而,为了实现更及时的(下30分钟)和准确的TC强度估计,深学习模型被要求处理时间上异质卫星观测。具体而言,红外(IR1)和水蒸气(WV)图像是每15分钟下提供,而无源微波降水率(PMW)是可用于约每3小时。同时,可见(VIS)信道受到严重的噪音和太阳光强度的影响,使其难以被利用。因此,我们提出了一个新的框架,结合生成与CNN对抗网络(GAN)。该模型利用所有数据,包括VIS和PMW信息,在训练阶段,最终仅使用高频率IR1和WV数据在预测阶段提供强度的估计。实验结果表明,该混合GAN-CNN框架实现可比精度国家的最先进的模型,同时具有增大从3小时最大推定频率到少于15分钟的能力。
6. Object Hider: Adversarial Patch Attack Against Object Detectors [PDF] 返回目录
Yusheng Zhao, Huanqian Yan, Xingxing Wei
Abstract: Deep neural networks have been widely used in many computer vision tasks. However, it is proved that they are susceptible to small, imperceptible perturbations added to the input. Inputs with elaborately designed perturbations that can fool deep learning models are called adversarial examples, and they have drawn great concerns about the safety of deep neural networks. Object detection algorithms are designed to locate and classify objects in images or videos and they are the core of many computer vision tasks, which have great research value and wide applications. In this paper, we focus on adversarial attack on some state-of-the-art object detection models. As a practical alternative, we use adversarial patches for the attack. Two adversarial patch generation algorithms have been proposed: the heatmap-based algorithm and the consensus-based algorithm. The experiment results have shown that the proposed methods are highly effective, transferable and generic. Additionally, we have applied the proposed methods to competition "Adversarial Challenge on Object Detection" that is organized by Alibaba on the Tianchi platform and won top 7 in 1701 teams. Code is available at: this https URL
摘要:深层神经网络已经被广泛应用在许多计算机视觉任务。然而,事实证明,它们易于添加到输入小,不易察觉的扰动。输入与精心设计的扰动,可以骗过深度学习模型称为对抗的例子,他们已经得出关于深层神经网络的安全性的高度关注。对象检测算法的设计定位和对象分类的图像或视频,他们是许多计算机视觉任务,具有重大的研究价值和广阔的应用的核心。在本文中,我们侧重于对国家的最先进的一些物体检测模型对抗性攻击。作为一个实际的选择,我们采用对抗性的修补软件的攻击。两个敌对的补丁生成算法已经被提出:基于热图,算法和基于共识的算法。实验结果表明,所提出的方法是非常有效的,可转让和通用。此外,我们已申请所提出的方法来竞争“对抗性挑战的目标检测”由阿里巴巴天池的平台在1701年的球队组织,赢得了前7。代码,请访问:此HTTPS URL
Yusheng Zhao, Huanqian Yan, Xingxing Wei
Abstract: Deep neural networks have been widely used in many computer vision tasks. However, it is proved that they are susceptible to small, imperceptible perturbations added to the input. Inputs with elaborately designed perturbations that can fool deep learning models are called adversarial examples, and they have drawn great concerns about the safety of deep neural networks. Object detection algorithms are designed to locate and classify objects in images or videos and they are the core of many computer vision tasks, which have great research value and wide applications. In this paper, we focus on adversarial attack on some state-of-the-art object detection models. As a practical alternative, we use adversarial patches for the attack. Two adversarial patch generation algorithms have been proposed: the heatmap-based algorithm and the consensus-based algorithm. The experiment results have shown that the proposed methods are highly effective, transferable and generic. Additionally, we have applied the proposed methods to competition "Adversarial Challenge on Object Detection" that is organized by Alibaba on the Tianchi platform and won top 7 in 1701 teams. Code is available at: this https URL
摘要:深层神经网络已经被广泛应用在许多计算机视觉任务。然而,事实证明,它们易于添加到输入小,不易察觉的扰动。输入与精心设计的扰动,可以骗过深度学习模型称为对抗的例子,他们已经得出关于深层神经网络的安全性的高度关注。对象检测算法的设计定位和对象分类的图像或视频,他们是许多计算机视觉任务,具有重大的研究价值和广阔的应用的核心。在本文中,我们侧重于对国家的最先进的一些物体检测模型对抗性攻击。作为一个实际的选择,我们采用对抗性的修补软件的攻击。两个敌对的补丁生成算法已经被提出:基于热图,算法和基于共识的算法。实验结果表明,所提出的方法是非常有效的,可转让和通用。此外,我们已申请所提出的方法来竞争“对抗性挑战的目标检测”由阿里巴巴天池的平台在1701年的球队组织,赢得了前7。代码,请访问:此HTTPS URL
7. Leveraging Visual Question Answering to Improve Text-to-Image Synthesis [PDF] 返回目录
Stanislav Frolov, Shailza Jolly, Jörn Hees, Andreas Dengel
Abstract: Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.
摘要:从文字描述图片最近吸引了很多人的兴趣。虽然目前的模型可以生成单个对象的照片般逼真的图像,如鸟和人脸,与多个对象合成的图像仍然是很困难的。在本文中,我们建议结合文本到图像(T2I)的合成用Visual问题解答(VQA)通过利用VQA 2.0数据集,以提高所生成的图像的图像质量和图像文本对齐的有效方式。我们通过连接问题和答案(QA)对创建额外的训练样本,并采用标准的VQA模型来提供辅助学习信号T2I模型。我们鼓励从QA对产生的神态逼真,另外尽量减少外部VQA损失图像。我们的方法从27.84降低至FID 25.38并增加了R-PREC。从83.82%至84.79%,对比于基线,这表明合成T2I可以成功地使用标准VQA模型来提高时。
Stanislav Frolov, Shailza Jolly, Jörn Hees, Andreas Dengel
Abstract: Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82% to 84.79% when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.
摘要:从文字描述图片最近吸引了很多人的兴趣。虽然目前的模型可以生成单个对象的照片般逼真的图像,如鸟和人脸,与多个对象合成的图像仍然是很困难的。在本文中,我们建议结合文本到图像(T2I)的合成用Visual问题解答(VQA)通过利用VQA 2.0数据集,以提高所生成的图像的图像质量和图像文本对齐的有效方式。我们通过连接问题和答案(QA)对创建额外的训练样本,并采用标准的VQA模型来提供辅助学习信号T2I模型。我们鼓励从QA对产生的神态逼真,另外尽量减少外部VQA损失图像。我们的方法从27.84降低至FID 25.38并增加了R-PREC。从83.82%至84.79%,对比于基线,这表明合成T2I可以成功地使用标准VQA模型来提高时。
8. MedMNIST Classification Decathlon: A Lightweight AutoML Benchmark for Medical Image Analysis [PDF] 返回目录
Jiancheng Yang, Rui Shi, Bingbing Ni
Abstract: We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28x28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets; We have compared several baseline methods, including open-source or commercial AutoML tools. The datasets, evaluation code and baseline methods for MedMNIST are publicly available at this https URL.
摘要:我们提出MedMNIST,10预加工过的医学数据集开集。 MedMNIST标准化进行轻量化28x28图像,不需要背景知识分类任务。覆盖所述主数据模式在医学图像分析,它是在数据规模(从100至100,000)和任务(二进制/多类,有序回归和多标记)不同。 MedMNIST可用于教育的目的,快速成型,多模态学习机或AutoML在医学图像分析。此外,MedMNIST分类迪卡被设计成在所有的数据集10的基准AutoML算法;我们比较了几种基准方法,包括开源或商业AutoML工具。是公开的,在此HTTPS URL为MedMNIST的数据集,评估代码和基线的方法。
Jiancheng Yang, Rui Shi, Bingbing Ni
Abstract: We present MedMNIST, a collection of 10 pre-processed medical open datasets. MedMNIST is standardized to perform classification tasks on lightweight 28x28 images, which requires no background knowledge. Covering the primary data modalities in medical image analysis, it is diverse on data scale (from 100 to 100,000) and tasks (binary/multi-class, ordinal regression and multi-label). MedMNIST could be used for educational purpose, rapid prototyping, multi-modal machine learning or AutoML in medical image analysis. Moreover, MedMNIST Classification Decathlon is designed to benchmark AutoML algorithms on all 10 datasets; We have compared several baseline methods, including open-source or commercial AutoML tools. The datasets, evaluation code and baseline methods for MedMNIST are publicly available at this https URL.
摘要:我们提出MedMNIST,10预加工过的医学数据集开集。 MedMNIST标准化进行轻量化28x28图像,不需要背景知识分类任务。覆盖所述主数据模式在医学图像分析,它是在数据规模(从100至100,000)和任务(二进制/多类,有序回归和多标记)不同。 MedMNIST可用于教育的目的,快速成型,多模态学习机或AutoML在医学图像分析。此外,MedMNIST分类迪卡被设计成在所有的数据集10的基准AutoML算法;我们比较了几种基准方法,包括开源或商业AutoML工具。是公开的,在此HTTPS URL为MedMNIST的数据集,评估代码和基线的方法。
9. Multimodal End-to-End Learning for Autonomous Steering in Adverse Road and Weather Conditions [PDF] 返回目录
Jyri Maanpää, Josef Taher, Petri Manninen, Leo Pakola, Iaroslav Melekhov, Juha Hyyppä
Abstract: Autonomous driving is challenging in adverse road and weather conditions in which there might not be lane lines, the road might be covered in snow and the visibility might be poor. We extend the previous work on end-to-end learning for autonomous steering to operate in these adverse real-life conditions with multimodal data. We collected 28 hours of driving data in several road and weather conditions and trained convolutional neural networks to predict the car steering wheel angle from front-facing color camera images and lidar range and reflectance data. We compared the CNN model performances based on the different modalities and our results show that the lidar modality improves the performances of different multimodal sensor-fusion models. We also performed on-road tests with different models and they support this observation.
摘要:自主驾驶在恶劣路面和天气情况下,可能没有车道线挑战,在路上可能会在冰雪覆盖和可见性可能较差。我们延长终端到终端的学习以前的工作自动转向到具有多模态数据这些不良现实生活中的条件下工作。我们收集了28小时在几路和天气状况和训练卷积神经网络驱动数据来预测从前置彩色摄像机的图像和激光雷达范围和反射率数据的汽车方向盘角的。我们比较了基于不同方式的CNN模型表现和我们的结果表明,激光雷达模式提高了不同传感器多式联运融合模型的性能。我们还进行道路上测试不同型号,他们支持这一观察。
Jyri Maanpää, Josef Taher, Petri Manninen, Leo Pakola, Iaroslav Melekhov, Juha Hyyppä
Abstract: Autonomous driving is challenging in adverse road and weather conditions in which there might not be lane lines, the road might be covered in snow and the visibility might be poor. We extend the previous work on end-to-end learning for autonomous steering to operate in these adverse real-life conditions with multimodal data. We collected 28 hours of driving data in several road and weather conditions and trained convolutional neural networks to predict the car steering wheel angle from front-facing color camera images and lidar range and reflectance data. We compared the CNN model performances based on the different modalities and our results show that the lidar modality improves the performances of different multimodal sensor-fusion models. We also performed on-road tests with different models and they support this observation.
摘要:自主驾驶在恶劣路面和天气情况下,可能没有车道线挑战,在路上可能会在冰雪覆盖和可见性可能较差。我们延长终端到终端的学习以前的工作自动转向到具有多模态数据这些不良现实生活中的条件下工作。我们收集了28小时在几路和天气状况和训练卷积神经网络驱动数据来预测从前置彩色摄像机的图像和激光雷达范围和反射率数据的汽车方向盘角的。我们比较了基于不同方式的CNN模型表现和我们的结果表明,激光雷达模式提高了不同传感器多式联运融合模型的性能。我们还进行道路上测试不同型号,他们支持这一观察。
10. Transferable Universal Adversarial Perturbations Using Generative Models [PDF] 返回目录
Atiye Sadat Hashemi, Andreas Bär, Saeed Mozaffari, Tim Fingscheidt
Abstract: Deep neural networks tend to be vulnerable to adversarial perturbations, which by adding to a natural image can fool a respective model with high confidence. Recently, the existence of image-agnostic perturbations, also known as universal adversarial perturbations (UAPs), were discovered. However, existing UAPs still lack a sufficiently high fooling rate, when being applied to an unknown target model. In this paper, we propose a novel deep learning technique for generating more transferable UAPs. We utilize a perturbation generator and some given pretrained networks so-called source models to generate UAPs using the ImageNet dataset. Due to the similar feature representation of various model architectures in the first layer, we propose a loss formulation that focuses on the adversarial energy only in the respective first layer of the source models. This supports the transferability of our generated UAPs to any other target model. We further empirically analyze our generated UAPs and demonstrate that these perturbations generalize very well towards different target models. Surpassing the current state of the art in both, fooling rate and model-transferability, we can show the superiority of our proposed approach. Using our generated non-targeted UAPs, we obtain an average fooling rate of 93.36% on the source models (state of the art: 82.16%). Generating our UAPs on the deep ResNet-152, we obtain about a 12% absolute fooling rate advantage vs. cutting-edge methods on VGG-16 and VGG-19 target models.
摘要:深层神经网络往往是脆弱的对抗性扰动,其通过向自然的图像可以欺骗与高可信度的个别模型。近来,图像无关的干扰的存在,也被称为通用对抗扰动(UAPs),被发现了。然而,现有的UAPs仍然缺乏足够高的嘴硬率,当被应用到一个未知的目标模式。在本文中,我们提出了产生更多的转移UAPs一种新颖的深度学习技术。我们利用扰动发电机和一些给预训练的网络被称为源模型使用ImageNet数据集生成UAPs。由于在第一层的各种模型架构的相似特征表示,我们建议只在源模式的各自的第一层主要用于对抗能量损失配方。这支持了我们产生UAPs的任何其他目标模型的可转让性。我们进一步实证分析了产生UAPs,并证明这些扰动概括得很好朝着不同的目标模式。超越现有技术中两者的当前状态,嘴硬率和模型转让,我们可以证明我们提出的方法的优越性。使用我们产生的非针对性UAPs,我们得到的93.36%,对源模型的平均愚弄率(现有技术状态:82.16%)。生成的深RESNET-152我们UAPs,我们得到约12%的绝对嘴硬率优势与前沿上VGG-16和VGG-19目标模型的方法。
Atiye Sadat Hashemi, Andreas Bär, Saeed Mozaffari, Tim Fingscheidt
Abstract: Deep neural networks tend to be vulnerable to adversarial perturbations, which by adding to a natural image can fool a respective model with high confidence. Recently, the existence of image-agnostic perturbations, also known as universal adversarial perturbations (UAPs), were discovered. However, existing UAPs still lack a sufficiently high fooling rate, when being applied to an unknown target model. In this paper, we propose a novel deep learning technique for generating more transferable UAPs. We utilize a perturbation generator and some given pretrained networks so-called source models to generate UAPs using the ImageNet dataset. Due to the similar feature representation of various model architectures in the first layer, we propose a loss formulation that focuses on the adversarial energy only in the respective first layer of the source models. This supports the transferability of our generated UAPs to any other target model. We further empirically analyze our generated UAPs and demonstrate that these perturbations generalize very well towards different target models. Surpassing the current state of the art in both, fooling rate and model-transferability, we can show the superiority of our proposed approach. Using our generated non-targeted UAPs, we obtain an average fooling rate of 93.36% on the source models (state of the art: 82.16%). Generating our UAPs on the deep ResNet-152, we obtain about a 12% absolute fooling rate advantage vs. cutting-edge methods on VGG-16 and VGG-19 target models.
摘要:深层神经网络往往是脆弱的对抗性扰动,其通过向自然的图像可以欺骗与高可信度的个别模型。近来,图像无关的干扰的存在,也被称为通用对抗扰动(UAPs),被发现了。然而,现有的UAPs仍然缺乏足够高的嘴硬率,当被应用到一个未知的目标模式。在本文中,我们提出了产生更多的转移UAPs一种新颖的深度学习技术。我们利用扰动发电机和一些给预训练的网络被称为源模型使用ImageNet数据集生成UAPs。由于在第一层的各种模型架构的相似特征表示,我们建议只在源模式的各自的第一层主要用于对抗能量损失配方。这支持了我们产生UAPs的任何其他目标模型的可转让性。我们进一步实证分析了产生UAPs,并证明这些扰动概括得很好朝着不同的目标模式。超越现有技术中两者的当前状态,嘴硬率和模型转让,我们可以证明我们提出的方法的优越性。使用我们产生的非针对性UAPs,我们得到的93.36%,对源模型的平均愚弄率(现有技术状态:82.16%)。生成的深RESNET-152我们UAPs,我们得到约12%的绝对嘴硬率优势与前沿上VGG-16和VGG-19目标模型的方法。
11. Displacement-Invariant Matching Cost Learning for Accurate Optical Flow Estimation [PDF] 返回目录
Jianyuan Wang, Yiran Zhong, Yuchao Dai, Kaihao Zhang, Pan Ji, Hongdong Li
Abstract: Learning matching costs has been shown to be critical to the success of the state-of-the-art deep stereo matching methods, in which 3D convolutions are applied on a 4D feature volume to learn a 3D cost volume. However, this mechanism has never been employed for the optical flow task. This is mainly due to the significantly increased search dimension in the case of optical flow computation, ie, a straightforward extension would require dense 4D convolutions in order to process a 5D feature volume, which is computationally prohibitive. This paper proposes a novel solution that is able to bypass the requirement of building a 5D feature volume while still allowing the network to learn suitable matching costs from data. Our key innovation is to decouple the connection between 2D displacements and learn the matching costs at each 2D displacement hypothesis independently, ie, displacement-invariant cost learning. Specifically, we apply the same 2D convolution-based matching net independently on each 2D displacement hypothesis to learn a 4D cost volume. Moreover, we propose a displacement-aware projection layer to scale the learned cost volume, which reconsiders the correlation between different displacement candidates and mitigates the multi-modal problem in the learned cost volume. The cost volume is then projected to optical flow estimation through a 2D soft-argmin layer. Extensive experiments show that our approach achieves state-of-the-art accuracy on various datasets, and outperforms all published optical flow methods on the Sintel benchmark.
摘要:学习匹配的成本已被证明是对的国家的最先进的深立体匹配方法的成功,其中3D卷积是一个四维特征量学习3D成本产量上应用的关键。然而,这一机制从未被采用光流任务。这主要是由于在光流计算的,即的情况下,增加显著搜索维度的直接扩展将需要以处理一个5D特征量,这在计算上是密集的望而却步4D卷积。本文提出了一种新颖的解决方案,其能够旁路构建5D特征体积,同时仍然允许网络学习从数据合适匹配成本的要求。我们的主要创新是分离的2D位移之间的连接和自主学习的每个二维位移假设匹配的成本,即排量不变的成本学习。具体来说,我们单独申请每个二维位移假设同样的基于二维卷积匹配网络学习4D成本产量。此外,我们提出了一个位移感知投影层缩放获悉成本量,这重新考虑不同排量的候选人,并减轻在学习成本量多模态问题之间的相关性。成本体积然后通过2D软argmin层投射到光流估计。大量的实验表明,我们的方法实现对各种数据集的国家的最先进的准确性,优于在辛特尔基准所有已发布的光流法。
Jianyuan Wang, Yiran Zhong, Yuchao Dai, Kaihao Zhang, Pan Ji, Hongdong Li
Abstract: Learning matching costs has been shown to be critical to the success of the state-of-the-art deep stereo matching methods, in which 3D convolutions are applied on a 4D feature volume to learn a 3D cost volume. However, this mechanism has never been employed for the optical flow task. This is mainly due to the significantly increased search dimension in the case of optical flow computation, ie, a straightforward extension would require dense 4D convolutions in order to process a 5D feature volume, which is computationally prohibitive. This paper proposes a novel solution that is able to bypass the requirement of building a 5D feature volume while still allowing the network to learn suitable matching costs from data. Our key innovation is to decouple the connection between 2D displacements and learn the matching costs at each 2D displacement hypothesis independently, ie, displacement-invariant cost learning. Specifically, we apply the same 2D convolution-based matching net independently on each 2D displacement hypothesis to learn a 4D cost volume. Moreover, we propose a displacement-aware projection layer to scale the learned cost volume, which reconsiders the correlation between different displacement candidates and mitigates the multi-modal problem in the learned cost volume. The cost volume is then projected to optical flow estimation through a 2D soft-argmin layer. Extensive experiments show that our approach achieves state-of-the-art accuracy on various datasets, and outperforms all published optical flow methods on the Sintel benchmark.
摘要:学习匹配的成本已被证明是对的国家的最先进的深立体匹配方法的成功,其中3D卷积是一个四维特征量学习3D成本产量上应用的关键。然而,这一机制从未被采用光流任务。这主要是由于在光流计算的,即的情况下,增加显著搜索维度的直接扩展将需要以处理一个5D特征量,这在计算上是密集的望而却步4D卷积。本文提出了一种新颖的解决方案,其能够旁路构建5D特征体积,同时仍然允许网络学习从数据合适匹配成本的要求。我们的主要创新是分离的2D位移之间的连接和自主学习的每个二维位移假设匹配的成本,即排量不变的成本学习。具体来说,我们单独申请每个二维位移假设同样的基于二维卷积匹配网络学习4D成本产量。此外,我们提出了一个位移感知投影层缩放获悉成本量,这重新考虑不同排量的候选人,并减轻在学习成本量多模态问题之间的相关性。成本体积然后通过2D软argmin层投射到光流估计。大量的实验表明,我们的方法实现对各种数据集的国家的最先进的准确性,优于在辛特尔基准所有已发布的光流法。
12. Micro Stripes Analyses for Iris Presentation Attack Detection [PDF] 返回目录
Meiling Fang, Naser Damer, Florian Kirchbuchner, Arjan Kuijper
Abstract: Iris recognition systems are vulnerable to the presentation attacks, such as textured contact lenses or printed images. In this paper, we propose a lightweight framework to detect iris presentation attacks by extracting multiple micro-stripes of expanded normalized iris textures. In this procedure, a standard iris segmentation is modified. For our presentation attack detection network to better model the classification problem, the segmented area is processed to provide lower dimensional input segments and a higher number of learning samples. Our proposed Micro Stripes Analyses (MSA) solution samples the segmented areas as individual stripes. Then, the majority vote makes the final classification decision of those micro-stripes. Experiments are demonstrated on five databases, where two databases (IIITD-WVU and Notre Dame) are from the LivDet-2017 Iris competition. An in-depth experimental evaluation of this framework reveals a superior performance compared with state-of-the-art algorithms. Moreover, our solution minimizes the confusion between textured (attack) and soft (bona fide) contact lens presentations.
摘要:虹膜识别系统很容易受到呈现的攻击,例如有纹理的隐形眼镜或印刷图像。在本文中,我们提出了一个轻量级的框架,通过提取扩大标准化虹膜纹理的多个微条纹检测虹膜演示攻击。在此过程中,一个标准的虹膜分割被修改。对于我们的介绍攻击检测网络,以更好的模型分类问题,该分割区域进行处理,以提供较低的维输入段和较高数量的学习样本的。我们提出的微条纹分析(MSA)溶液样品所分割的区域作为单独的条纹。然后,在多数表决使这些微条纹的最后分类决策。实验结果于五个数据库,其中两个数据库(IIITD-WVU和圣母院)是从LivDet-2017虹膜竞争。与国家的最先进的算法相比,该框架的深入实验评估揭示了优越的性能。此外,我们的解决方案最小化纹理的(攻击)和软(善意)隐形眼镜呈现之间的混乱。
Meiling Fang, Naser Damer, Florian Kirchbuchner, Arjan Kuijper
Abstract: Iris recognition systems are vulnerable to the presentation attacks, such as textured contact lenses or printed images. In this paper, we propose a lightweight framework to detect iris presentation attacks by extracting multiple micro-stripes of expanded normalized iris textures. In this procedure, a standard iris segmentation is modified. For our presentation attack detection network to better model the classification problem, the segmented area is processed to provide lower dimensional input segments and a higher number of learning samples. Our proposed Micro Stripes Analyses (MSA) solution samples the segmented areas as individual stripes. Then, the majority vote makes the final classification decision of those micro-stripes. Experiments are demonstrated on five databases, where two databases (IIITD-WVU and Notre Dame) are from the LivDet-2017 Iris competition. An in-depth experimental evaluation of this framework reveals a superior performance compared with state-of-the-art algorithms. Moreover, our solution minimizes the confusion between textured (attack) and soft (bona fide) contact lens presentations.
摘要:虹膜识别系统很容易受到呈现的攻击,例如有纹理的隐形眼镜或印刷图像。在本文中,我们提出了一个轻量级的框架,通过提取扩大标准化虹膜纹理的多个微条纹检测虹膜演示攻击。在此过程中,一个标准的虹膜分割被修改。对于我们的介绍攻击检测网络,以更好的模型分类问题,该分割区域进行处理,以提供较低的维输入段和较高数量的学习样本的。我们提出的微条纹分析(MSA)溶液样品所分割的区域作为单独的条纹。然后,在多数表决使这些微条纹的最后分类决策。实验结果于五个数据库,其中两个数据库(IIITD-WVU和圣母院)是从LivDet-2017虹膜竞争。与国家的最先进的算法相比,该框架的深入实验评估揭示了优越的性能。此外,我们的解决方案最小化纹理的(攻击)和软(善意)隐形眼镜呈现之间的混乱。
13. Model Rubik's Cube: Twisting Resolution, Depth and Width for TinyNets [PDF] 返回目录
Kai Han, Yunhe Wang, Qiulin Zhang, Wei Zhang, Chunjing Xu, Tong Zhang
Abstract: To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik's cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, i.e., the compound scaling in EfficientNet is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59.9% Top-1 accuracy with only 24M FLOPs, which is about 1.9% higher than that of the previous best MobileNetV3 with similar computational cost. Code will be available at this https URL, and this https URL.
摘要:为了获得优良的深神经结构,一系列的技术都是经过精心设计的EfficientNets。同时为扩大分辨率,深度和宽度的巨型公式为我们提供了一个魔方的神经网络立方体。这样我们就可以通过加捻三个维度寻找具有效率高,性能优良的网络。本文旨在探讨扭曲规则,以最小的模型尺寸和计算成本获取深层神经网络。不同于网络扩大,我们观察到的分辨率和深度比宽度很小的网络更加重要。因此,原来的方法,即在EfficientNet化合物缩放不再适合。为此,我们总结了通过一系列从EfficientNet-B0得出与触发器约束较小的型号的瘦身神经架构一个微小的公式。在ImageNet基准实验结果表明,我们的TinyNet执行大大优于采用倒置式巨型的EfficientNets较小的版本。例如,我们TinyNet-E实现了59.9%顶1精度只有24M的响声,这比具有类似计算成本以前最好MobileNetV3高出约1.9%。代码将可在此HTTPS URL,这HTTPS URL。
Kai Han, Yunhe Wang, Qiulin Zhang, Wei Zhang, Chunjing Xu, Tong Zhang
Abstract: To obtain excellent deep neural architectures, a series of techniques are carefully designed in EfficientNets. The giant formula for simultaneously enlarging the resolution, depth and width provides us a Rubik's cube for neural networks. So that we can find networks with high efficiency and excellent performance by twisting the three dimensions. This paper aims to explore the twisting rules for obtaining deep neural networks with minimum model sizes and computational costs. Different from the network enlarging, we observe that resolution and depth are more important than width for tiny networks. Therefore, the original method, i.e., the compound scaling in EfficientNet is no longer suitable. To this end, we summarize a tiny formula for downsizing neural architectures through a series of smaller models derived from the EfficientNet-B0 with the FLOPs constraint. Experimental results on the ImageNet benchmark illustrate that our TinyNet performs much better than the smaller version of EfficientNets using the inversed giant formula. For instance, our TinyNet-E achieves a 59.9% Top-1 accuracy with only 24M FLOPs, which is about 1.9% higher than that of the previous best MobileNetV3 with similar computational cost. Code will be available at this https URL, and this https URL.
摘要:为了获得优良的深神经结构,一系列的技术都是经过精心设计的EfficientNets。同时为扩大分辨率,深度和宽度的巨型公式为我们提供了一个魔方的神经网络立方体。这样我们就可以通过加捻三个维度寻找具有效率高,性能优良的网络。本文旨在探讨扭曲规则,以最小的模型尺寸和计算成本获取深层神经网络。不同于网络扩大,我们观察到的分辨率和深度比宽度很小的网络更加重要。因此,原来的方法,即在EfficientNet化合物缩放不再适合。为此,我们总结了通过一系列从EfficientNet-B0得出与触发器约束较小的型号的瘦身神经架构一个微小的公式。在ImageNet基准实验结果表明,我们的TinyNet执行大大优于采用倒置式巨型的EfficientNets较小的版本。例如,我们TinyNet-E实现了59.9%顶1精度只有24M的响声,这比具有类似计算成本以前最好MobileNetV3高出约1.9%。代码将可在此HTTPS URL,这HTTPS URL。
14. Cycle-Contrast for Self-Supervised Video Representation Learning [PDF] 返回目录
Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, Tomokazu Murakami
Abstract: We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation. Following a nature that there is a belong and inclusion relation of video and its frames, CCL is designed to find correspondences across frames and videos considering the contrastive representation in their domains respectively. It is different from recent approaches that merely learn correspondences across frames or clips. In our method, the frame and video representations are learned from a single network based on an R3D architecture, with a shared non-linear transformation for embedding both frame and video features before the cycle-contrastive loss. We demonstrate that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.
摘要:当前周期对比学习(CCL),用于学习的视频表现了一种新的自我监督方法。继性质,有视频和它的帧属于和包含关系,CCL的目的是找到跨框架和视频分别考虑他们的域对比表示对应。这是从单纯学习跨帧或剪辑对应近来的方案不同。在我们的方法中,所述帧和视频表示由基于一个R3D架构单个网络学习,具有用于嵌入帧和视频共享非线性变换的周期对比损耗前的功能。我们表明,通过CCL了解到视频表示可以很好地转移到视频了解下游任务,跑赢上UCF101,HMDB51和MMAct近邻检索和动作识别任务,以前的方法。
Quan Kong, Wenpeng Wei, Ziwei Deng, Tomoaki Yoshinaga, Tomokazu Murakami
Abstract: We present Cycle-Contrastive Learning (CCL), a novel self-supervised method for learning video representation. Following a nature that there is a belong and inclusion relation of video and its frames, CCL is designed to find correspondences across frames and videos considering the contrastive representation in their domains respectively. It is different from recent approaches that merely learn correspondences across frames or clips. In our method, the frame and video representations are learned from a single network based on an R3D architecture, with a shared non-linear transformation for embedding both frame and video features before the cycle-contrastive loss. We demonstrate that the video representation learned by CCL can be transferred well to downstream tasks of video understanding, outperforming previous methods in nearest neighbour retrieval and action recognition tasks on UCF101, HMDB51 and MMAct.
摘要:当前周期对比学习(CCL),用于学习的视频表现了一种新的自我监督方法。继性质,有视频和它的帧属于和包含关系,CCL的目的是找到跨框架和视频分别考虑他们的域对比表示对应。这是从单纯学习跨帧或剪辑对应近来的方案不同。在我们的方法中,所述帧和视频表示由基于一个R3D架构单个网络学习,具有用于嵌入帧和视频共享非线性变换的周期对比损耗前的功能。我们表明,通过CCL了解到视频表示可以很好地转移到视频了解下游任务,跑赢上UCF101,HMDB51和MMAct近邻检索和动作识别任务,以前的方法。
15. AbdomenCT-1K: Is Abdominal Organ Segmentation A Solved Problem? [PDF] 返回目录
Jun Ma, Yao Zhang, Song Gu, Yichi Zhang, Cheng Zhu, Qiyuan Wang, Xin Liu, Xingle An, Cheng Ge, Shucheng Cao, Qi Zhang, Shangqing Liu, Yunpeng Wang, Yuhui Li, Congcong Wang, Jian He, Xiaoping Yang
Abstract: With the unprecedented developments in deep learning, automatic segmentation of main abdominal organs (i.e., liver, kidney, and spleen) seems to be a solved problem as the state-of-the-art (SOTA) methods have achieved comparable results with inter-observer variability on existing benchmark datasets. However, most of the existing abdominal organ segmentation benchmark datasets only contain single-center, single-phase, single-vendor, or single-disease cases, thus, it is unclear whether the excellent performance can generalize on more diverse datasets. In this paper, we present a large and diverse abdominal CT organ segmentation dataset, termed as AbdomenCT-1K, with more than 1000 (1K) CT scans from 11 countries, including multi-center, multi-phase, multi-vendor, and multi-disease cases. Furthermore, we conduct a large-scale study for liver, kidney, spleen, and pancreas segmentation, as well as reveal the unsolved segmentation problems of the SOTA method, such as the limited generalization ability on distinct medical centers, phases, and unseen diseases. To advance the unsolved problems, we build four organ segmentation benchmarks for fully supervised, semi-supervised, weakly supervised, and continual learning, which are currently challenging and active research topics. Accordingly, we develop a simple and effective method for each benchmark, which can be used as out-of-the-box methods and strong baselines. We believe the introduction of the AbdomenCT-1K dataset will promote future in-depth research towards clinical applicable abdominal organ segmentation methods. Moreover, the datasets, codes, and trained models of baseline methods will be publicly available at this https URL.
摘要:随着深度学习,主要腹部器官(如肝,肾,脾)的自动分割了前所未有的发展似乎是一个解决问题,与国家的最先进的(SOTA)方法都取得了比较的结果在现有的基准数据集,观察员变异。然而,现有的大部分腹腔脏器分割基准数据集的仅包含单中心,单相,单一供应商,或单病病例,因此,目前还不清楚的出色表现能否推广了更多样化的数据集。在本文中,我们提出了一个大且多样的腹部CT器官分割数据集,称为AbdomenCT-1K,有超过1000(1K)CT从11个国家,包括多中心,多相位,多厂商,多扫描-disease案件。此外,我们进行了肝,肾,脾和胰腺分割大规模的研究,以及揭示SOTA方法的未解决的分割问题,如不同的医疗中心,相位和看不见的疾病有限的泛化能力。以超前的尚未解决的问题,我们建设四座器官分割基准全面受监督,半监督,监督弱,并不断学习,这是目前的挑战和积极研究的课题。因此,我们开发了每个基准,其可以被用作外的所述盒的方法和强基线的简单且有效的方法。我们相信,引进AbdomenCT-1K数据集将促进今后的深入研究向临床应用腹腔脏器的分割方法。此外,该数据集,代码和训练有素的基线方法模型将公开可在此HTTPS URL。
Jun Ma, Yao Zhang, Song Gu, Yichi Zhang, Cheng Zhu, Qiyuan Wang, Xin Liu, Xingle An, Cheng Ge, Shucheng Cao, Qi Zhang, Shangqing Liu, Yunpeng Wang, Yuhui Li, Congcong Wang, Jian He, Xiaoping Yang
Abstract: With the unprecedented developments in deep learning, automatic segmentation of main abdominal organs (i.e., liver, kidney, and spleen) seems to be a solved problem as the state-of-the-art (SOTA) methods have achieved comparable results with inter-observer variability on existing benchmark datasets. However, most of the existing abdominal organ segmentation benchmark datasets only contain single-center, single-phase, single-vendor, or single-disease cases, thus, it is unclear whether the excellent performance can generalize on more diverse datasets. In this paper, we present a large and diverse abdominal CT organ segmentation dataset, termed as AbdomenCT-1K, with more than 1000 (1K) CT scans from 11 countries, including multi-center, multi-phase, multi-vendor, and multi-disease cases. Furthermore, we conduct a large-scale study for liver, kidney, spleen, and pancreas segmentation, as well as reveal the unsolved segmentation problems of the SOTA method, such as the limited generalization ability on distinct medical centers, phases, and unseen diseases. To advance the unsolved problems, we build four organ segmentation benchmarks for fully supervised, semi-supervised, weakly supervised, and continual learning, which are currently challenging and active research topics. Accordingly, we develop a simple and effective method for each benchmark, which can be used as out-of-the-box methods and strong baselines. We believe the introduction of the AbdomenCT-1K dataset will promote future in-depth research towards clinical applicable abdominal organ segmentation methods. Moreover, the datasets, codes, and trained models of baseline methods will be publicly available at this https URL.
摘要:随着深度学习,主要腹部器官(如肝,肾,脾)的自动分割了前所未有的发展似乎是一个解决问题,与国家的最先进的(SOTA)方法都取得了比较的结果在现有的基准数据集,观察员变异。然而,现有的大部分腹腔脏器分割基准数据集的仅包含单中心,单相,单一供应商,或单病病例,因此,目前还不清楚的出色表现能否推广了更多样化的数据集。在本文中,我们提出了一个大且多样的腹部CT器官分割数据集,称为AbdomenCT-1K,有超过1000(1K)CT从11个国家,包括多中心,多相位,多厂商,多扫描-disease案件。此外,我们进行了肝,肾,脾和胰腺分割大规模的研究,以及揭示SOTA方法的未解决的分割问题,如不同的医疗中心,相位和看不见的疾病有限的泛化能力。以超前的尚未解决的问题,我们建设四座器官分割基准全面受监督,半监督,监督弱,并不断学习,这是目前的挑战和积极研究的课题。因此,我们开发了每个基准,其可以被用作外的所述盒的方法和强基线的简单且有效的方法。我们相信,引进AbdomenCT-1K数据集将促进今后的深入研究向临床应用腹腔脏器的分割方法。此外,该数据集,代码和训练有素的基线方法模型将公开可在此HTTPS URL。
16. SFU-Store-Nav: A Multimodal Dataset for Indoor Human Navigation [PDF] 返回目录
Zhitian Zhang, Jimin Rhim, Taher Ahmadi, Kefan Yang, Angelica Lim, Mo Chen
Abstract: This article describes a dataset collected in a set of experiments that involves human participants and a robot. The set of experiments was conducted in the computing science robotics lab in Simon Fraser University, Burnaby, BC, Canada, and its aim is to gather data containing common gestures, movements, and other behaviours that may indicate humans' navigational intent relevant for autonomous robot navigation. The experiment simulates a shopping scenario where human participants come in to pick up items from his/her shopping list and interact with a Pepper robot that is programmed to help the human participant. We collected visual data and motion capture data from 108 human participants. The visual data contains live recordings of the experiments and the motion capture data contains the position and orientation of the human participants in world coordinates. This dataset could be valuable for researchers in the robotics, machine learning and computer vision community.
摘要:本文介绍了一组实验中,涉及到人的参与和机器人收集的数据集。这组实验中,在西蒙·弗雷泽大学,本拿比,加拿大BC省计算科学机器人实验室进行的,其目的是收集含有常用手势,动作等行为,可能表明人类的导航意图相关的自主机器人数据导航。实验模拟购物场景,人类参与者进来从他/她的购物清单,拾取物品和互动与被编程为帮助人类参与者辣椒机器人。我们收集了108名人参加影像数据以及动作捕捉数据。视觉数据包含实验的现场录音和动作捕捉数据包含在世界坐标中的人类参与者的位置和方向。此数据集可能是在机器人技术,机器学习和计算机视觉领域的研究价值。
Zhitian Zhang, Jimin Rhim, Taher Ahmadi, Kefan Yang, Angelica Lim, Mo Chen
Abstract: This article describes a dataset collected in a set of experiments that involves human participants and a robot. The set of experiments was conducted in the computing science robotics lab in Simon Fraser University, Burnaby, BC, Canada, and its aim is to gather data containing common gestures, movements, and other behaviours that may indicate humans' navigational intent relevant for autonomous robot navigation. The experiment simulates a shopping scenario where human participants come in to pick up items from his/her shopping list and interact with a Pepper robot that is programmed to help the human participant. We collected visual data and motion capture data from 108 human participants. The visual data contains live recordings of the experiments and the motion capture data contains the position and orientation of the human participants in world coordinates. This dataset could be valuable for researchers in the robotics, machine learning and computer vision community.
摘要:本文介绍了一组实验中,涉及到人的参与和机器人收集的数据集。这组实验中,在西蒙·弗雷泽大学,本拿比,加拿大BC省计算科学机器人实验室进行的,其目的是收集含有常用手势,动作等行为,可能表明人类的导航意图相关的自主机器人数据导航。实验模拟购物场景,人类参与者进来从他/她的购物清单,拾取物品和互动与被编程为帮助人类参与者辣椒机器人。我们收集了108名人参加影像数据以及动作捕捉数据。视觉数据包含实验的现场录音和动作捕捉数据包含在世界坐标中的人类参与者的位置和方向。此数据集可能是在机器人技术,机器学习和计算机视觉领域的研究价值。
17. Class-Agnostic Segmentation Loss and Its Application to Salient Object Detection and Segmentation [PDF] 返回目录
Angira Sharma, Naeemullah Khan, Ganesh Sundaramoorthi, Philip Torr
Abstract: In this paper we present a novel loss function, called class-agnostic segmentation (CAS) loss. With CAS loss the class descriptors are learned during training of the network. We don't require to define the label of a class a-priori, rather the CAS loss clusters regions with similar appearance together in a weakly-supervised manner. Furthermore, we show that the CAS loss function is sparse, bounded, and robust to class-imbalance. We apply our CAS loss function with fully-convolutional ResNet101 and DeepLab-v3 architectures to the binary segmentation problem of salient object detection. We investigate the performance against the state-of-the-art methods in two settings of low and high-fidelity training data on seven salient object detection datasets. For low-fidelity training data (incorrect class label) class-agnostic segmentation loss outperforms the state-of-the-art methods on salient object detection datasets by staggering margins of around 50%. For high-fidelity training data (correct class labels) class-agnostic segmentation models perform as good as the state-of-the-art approaches while beating the state-of-the-art methods on most datasets. In order to show the utility of the loss function across different domains we also test on general segmentation dataset, where class-agnostic segmentation loss outperforms cross-entropy based loss by huge margins on both region and edge metrics.
摘要:在本文中,我们提出一个新的损失函数,称为类无关的分割(CAS)的损失。随着CAS损失类描述符是在网络的训练期间学习。我们不要求定义一个类的先验的标签,而具有类似的外观一起在一个弱监督方式的CAS损失集群区。此外,我们表明,CAS损失函数是稀疏,有界和稳健类的不平衡。我们应用我们的CAS损失函数与全卷积ResNet101和DeepLab-V3架构,以显着的物体检测的二元分割问题。我们调查对国家的最先进的方法,七个显着的物体检测数据集低和高保真的训练数据的两个设置的效果。对于低的保真度的训练数据(不正确的类标记)类无关的分割损失优于通过错开50%左右边距上显着对象检测的数据集的状态的最先进的方法。对于高保真的训练数据(正确的类标签)类无关的细分车型进行良好的国家的最先进的,同时敲打着大多数数据集的国家的最先进的方法途径。为了显示在不同域中的损失函数的效用,我们还就一般的分割数据集,其中,类无关的分割损失性能优于交叉熵基于通过在两个区域和边缘度量巨大边距损耗测试。
Angira Sharma, Naeemullah Khan, Ganesh Sundaramoorthi, Philip Torr
Abstract: In this paper we present a novel loss function, called class-agnostic segmentation (CAS) loss. With CAS loss the class descriptors are learned during training of the network. We don't require to define the label of a class a-priori, rather the CAS loss clusters regions with similar appearance together in a weakly-supervised manner. Furthermore, we show that the CAS loss function is sparse, bounded, and robust to class-imbalance. We apply our CAS loss function with fully-convolutional ResNet101 and DeepLab-v3 architectures to the binary segmentation problem of salient object detection. We investigate the performance against the state-of-the-art methods in two settings of low and high-fidelity training data on seven salient object detection datasets. For low-fidelity training data (incorrect class label) class-agnostic segmentation loss outperforms the state-of-the-art methods on salient object detection datasets by staggering margins of around 50%. For high-fidelity training data (correct class labels) class-agnostic segmentation models perform as good as the state-of-the-art approaches while beating the state-of-the-art methods on most datasets. In order to show the utility of the loss function across different domains we also test on general segmentation dataset, where class-agnostic segmentation loss outperforms cross-entropy based loss by huge margins on both region and edge metrics.
摘要:在本文中,我们提出一个新的损失函数,称为类无关的分割(CAS)的损失。随着CAS损失类描述符是在网络的训练期间学习。我们不要求定义一个类的先验的标签,而具有类似的外观一起在一个弱监督方式的CAS损失集群区。此外,我们表明,CAS损失函数是稀疏,有界和稳健类的不平衡。我们应用我们的CAS损失函数与全卷积ResNet101和DeepLab-V3架构,以显着的物体检测的二元分割问题。我们调查对国家的最先进的方法,七个显着的物体检测数据集低和高保真的训练数据的两个设置的效果。对于低的保真度的训练数据(不正确的类标记)类无关的分割损失优于通过错开50%左右边距上显着对象检测的数据集的状态的最先进的方法。对于高保真的训练数据(正确的类标签)类无关的细分车型进行良好的国家的最先进的,同时敲打着大多数数据集的国家的最先进的方法途径。为了显示在不同域中的损失函数的效用,我们还就一般的分割数据集,其中,类无关的分割损失性能优于交叉熵基于通过在两个区域和边缘度量巨大边距损耗测试。
18. ElderSim: A Synthetic Data Generation Platform for Human Action Recognition in Eldercare Applications [PDF] 返回目录
Hochul Hwang, Cheongjae Jang, Geonwoo Park, Junghyun Cho, Ig-Jae Kim
Abstract: To train deep learning models for vision-based action recognition of elders' daily activities, we need large-scale activity datasets acquired under various daily living environments and conditions. However, most public datasets used in human action recognition either differ from or have limited coverage of elders' activities in many aspects, making it challenging to recognize elders' daily activities well by only utilizing existing datasets. Recently, such limitations of available datasets have actively been compensated by generating synthetic data from realistic simulation environments and using those data to train deep learning models. In this paper, based on these ideas we develop ElderSim, an action simulation platform that can generate synthetic data on elders' daily activities. For 55 kinds of frequent daily activities of the elders, ElderSim generates realistic motions of synthetic characters with various adjustable data-generating options, and provides different output modalities including RGB videos, two- and three-dimensional skeleton trajectories. We then generate KIST SynADL, a large-scale synthetic dataset of elders' activities of daily living, from ElderSim and use the data in addition to real datasets to train three state-of the-art human action recognition models. From the experiments following several newly proposed scenarios that assume different real and synthetic dataset configurations for training, we observe a noticeable performance improvement by augmenting our synthetic data. We also offer guidance with insights for the effective utilization of synthetic data to help recognize elders' daily activities.
摘要:要培养深度学习模型的老人的日常活动基于视觉的动作识别,我们需要在各种日常生活环境和条件下获得的大型活动数据集。然而,在人类动作识别最常用的公共数据集无论是从不同或有长辈覆盖面有限的活动在很多方面,使之成为具有挑战性的认识长者仅利用现有的数据集日常活动很好。近日,可用的数据集的这种限制已经积极通过生成从逼真的模拟环境综合数据,并使用这些数据来训练深度学习模型补偿。在本文中,基于这些想法,我们开发ElderSim,动作仿真平台,可以产生对长辈的日常活动合成数据。对于55种长者频繁的日常活动,ElderSim生成具有各种数据可调产生选项合成字符的真实的动作,并提供不同的输出模态包括RGB视频,二维和三维骨架轨迹。然后生成KIST SynADL,长老日常生活活动大规模合成数据集,从ElderSim并利用这些数据,除了真实数据集训练采用最先进技术的三种人类动作识别模型。从以下几个新近提出的方案是承担培训不同的真实和合成数据集配置的实验中,我们通过增强我们的综合数据观察到明显的性能提升。我们还提供指导与见解合成数据的有效利用,以帮助识别老人的日常活动。
Hochul Hwang, Cheongjae Jang, Geonwoo Park, Junghyun Cho, Ig-Jae Kim
Abstract: To train deep learning models for vision-based action recognition of elders' daily activities, we need large-scale activity datasets acquired under various daily living environments and conditions. However, most public datasets used in human action recognition either differ from or have limited coverage of elders' activities in many aspects, making it challenging to recognize elders' daily activities well by only utilizing existing datasets. Recently, such limitations of available datasets have actively been compensated by generating synthetic data from realistic simulation environments and using those data to train deep learning models. In this paper, based on these ideas we develop ElderSim, an action simulation platform that can generate synthetic data on elders' daily activities. For 55 kinds of frequent daily activities of the elders, ElderSim generates realistic motions of synthetic characters with various adjustable data-generating options, and provides different output modalities including RGB videos, two- and three-dimensional skeleton trajectories. We then generate KIST SynADL, a large-scale synthetic dataset of elders' activities of daily living, from ElderSim and use the data in addition to real datasets to train three state-of the-art human action recognition models. From the experiments following several newly proposed scenarios that assume different real and synthetic dataset configurations for training, we observe a noticeable performance improvement by augmenting our synthetic data. We also offer guidance with insights for the effective utilization of synthetic data to help recognize elders' daily activities.
摘要:要培养深度学习模型的老人的日常活动基于视觉的动作识别,我们需要在各种日常生活环境和条件下获得的大型活动数据集。然而,在人类动作识别最常用的公共数据集无论是从不同或有长辈覆盖面有限的活动在很多方面,使之成为具有挑战性的认识长者仅利用现有的数据集日常活动很好。近日,可用的数据集的这种限制已经积极通过生成从逼真的模拟环境综合数据,并使用这些数据来训练深度学习模型补偿。在本文中,基于这些想法,我们开发ElderSim,动作仿真平台,可以产生对长辈的日常活动合成数据。对于55种长者频繁的日常活动,ElderSim生成具有各种数据可调产生选项合成字符的真实的动作,并提供不同的输出模态包括RGB视频,二维和三维骨架轨迹。然后生成KIST SynADL,长老日常生活活动大规模合成数据集,从ElderSim并利用这些数据,除了真实数据集训练采用最先进技术的三种人类动作识别模型。从以下几个新近提出的方案是承担培训不同的真实和合成数据集配置的实验中,我们通过增强我们的综合数据观察到明显的性能提升。我们还提供指导与见解合成数据的有效利用,以帮助识别老人的日常活动。
19. MultiMix: Sparingly Supervised, Extreme Multitask Learning From Medical Images [PDF] 返回目录
Ayaan Haque, Abdullah-Al-Zubaer Imran, Adam Wang, Demetri Terzopoulos
Abstract: Semi-supervised learning via learning from limited quantities of labeled data has been investigated as an alternative to supervised counterparts. Maximizing knowledge gains from copious unlabeled data benefit semi-supervised learning settings. Moreover, learning multiple tasks within the same model further improves model generalizability. We propose a novel multitask learning model, namely MultiMix, which jointly learns disease classification and anatomical segmentation in a sparingly supervised manner, while preserving explainability through bridge saliency between the two tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justify the effectiveness of our multitasking model for the classification of pneumonia and segmentation of lungs from chest X-ray images. Moreover, both in-domain and cross-domain evaluations across the tasks further showcase the potential of our model to adapt to challenging generalization scenarios.
摘要:通过从标签数据的数量的限制学习半监督学习已经研究作为替代监督对应。从丰富的标签数据知识最大化收益中获益半监督学习环境。此外,在同一个模型中学习多个任务,进一步提高模型的普遍性。我们提出了一个新的多任务学习模式,即MULTIMIX,其共同学习疾病的分类和解剖学分割的谨慎监督的方式,同时通过两个任务之间的桥梁显着保护explainability。我们在训练集的标签数据的变化量的大量实验证明我们的多任务模式,为肺炎和肺的分割,从胸部X射线图像分类的有效性。此外,在整个任务都在域和跨域评估进一步展示我们的模型,以适应挑战概括场景的潜力。
Ayaan Haque, Abdullah-Al-Zubaer Imran, Adam Wang, Demetri Terzopoulos
Abstract: Semi-supervised learning via learning from limited quantities of labeled data has been investigated as an alternative to supervised counterparts. Maximizing knowledge gains from copious unlabeled data benefit semi-supervised learning settings. Moreover, learning multiple tasks within the same model further improves model generalizability. We propose a novel multitask learning model, namely MultiMix, which jointly learns disease classification and anatomical segmentation in a sparingly supervised manner, while preserving explainability through bridge saliency between the two tasks. Our extensive experimentation with varied quantities of labeled data in the training sets justify the effectiveness of our multitasking model for the classification of pneumonia and segmentation of lungs from chest X-ray images. Moreover, both in-domain and cross-domain evaluations across the tasks further showcase the potential of our model to adapt to challenging generalization scenarios.
摘要:通过从标签数据的数量的限制学习半监督学习已经研究作为替代监督对应。从丰富的标签数据知识最大化收益中获益半监督学习环境。此外,在同一个模型中学习多个任务,进一步提高模型的普遍性。我们提出了一个新的多任务学习模式,即MULTIMIX,其共同学习疾病的分类和解剖学分割的谨慎监督的方式,同时通过两个任务之间的桥梁显着保护explainability。我们在训练集的标签数据的变化量的大量实验证明我们的多任务模式,为肺炎和肺的分割,从胸部X射线图像分类的有效性。此外,在整个任务都在域和跨域评估进一步展示我们的模型,以适应挑战概括场景的潜力。
20. Differentiable Channel Pruning Search [PDF] 返回目录
Yu Zhao, Chung-Kuei Lee
Abstract: In this paper, we propose the differentiable channel pruning search (DCPS) of convolutional neural networks. Unlike traditional channel pruning algorithms which require users to manually set prune ratio for each convolutional layer, DCPS search the optimal combination of prune ratio that automatically. Inspired by the differentiable architecture search (DARTS), we draws lessons from the continuous relaxation and leverages the gradient information to balance the metrics and performance. However, directly applying the DARTS scheme will cause channel mismatching problem and huge memory consumption. Therefore, we introduce a novel weight sharing technique which can elegantly eliminate the shape mismatching problem with negligible additional resource. We test the proposed algorithm on image classification task and it achieves the state-of-the-art pruning results for image classification on CIFAR-10, CIFAR-100 and ImageNet. DCPS is further utilized for semantic segmentation on PASCAL VOC 2012 for two purposes. The first is to demonstrate that task-specific channel pruning achieves better performance against transferring slim models, and the second is to prove the memory efficiency of DCPS as the task demand more memory budget than classification. Results of the experiments validate the effectiveness and wide applicability of DCPS.
摘要:在本文中,我们提出了卷积神经网络的微通道修剪搜索(DCPS)。与需要用户为每个卷积层手动设置剪枝比传统的信道修剪算法,该DCPS自动搜索剪枝比率的最佳组合。由微架构搜索(飞镖)的启发,我们从不断放宽借鉴和利用梯度信息,以平衡指标和性能。然而,直接将所述飞镖方案会引起信道不匹配的问题和巨大的内存消耗。因此,我们介绍一种新颖的重量共享技术可以优雅消除具有可忽略的额外资源的形状失配问题。我们测试图像分类任务的算法,并能实现对CIFAR-10,CIFAR-100和ImageNet用于图像分类的国家的最先进的修剪效果。 DCPS还用来语义分割上PASCAL VOC 2012有两个目的。首先是要证明,任务特定通道修剪实现对传输超薄机型更好的表现,第二是要证明华府公立学校的内存效率的任务需求更多的内存比预算分类。实验结果验证了有效性和DCPS的广泛适用性。
Yu Zhao, Chung-Kuei Lee
Abstract: In this paper, we propose the differentiable channel pruning search (DCPS) of convolutional neural networks. Unlike traditional channel pruning algorithms which require users to manually set prune ratio for each convolutional layer, DCPS search the optimal combination of prune ratio that automatically. Inspired by the differentiable architecture search (DARTS), we draws lessons from the continuous relaxation and leverages the gradient information to balance the metrics and performance. However, directly applying the DARTS scheme will cause channel mismatching problem and huge memory consumption. Therefore, we introduce a novel weight sharing technique which can elegantly eliminate the shape mismatching problem with negligible additional resource. We test the proposed algorithm on image classification task and it achieves the state-of-the-art pruning results for image classification on CIFAR-10, CIFAR-100 and ImageNet. DCPS is further utilized for semantic segmentation on PASCAL VOC 2012 for two purposes. The first is to demonstrate that task-specific channel pruning achieves better performance against transferring slim models, and the second is to prove the memory efficiency of DCPS as the task demand more memory budget than classification. Results of the experiments validate the effectiveness and wide applicability of DCPS.
摘要:在本文中,我们提出了卷积神经网络的微通道修剪搜索(DCPS)。与需要用户为每个卷积层手动设置剪枝比传统的信道修剪算法,该DCPS自动搜索剪枝比率的最佳组合。由微架构搜索(飞镖)的启发,我们从不断放宽借鉴和利用梯度信息,以平衡指标和性能。然而,直接将所述飞镖方案会引起信道不匹配的问题和巨大的内存消耗。因此,我们介绍一种新颖的重量共享技术可以优雅消除具有可忽略的额外资源的形状失配问题。我们测试图像分类任务的算法,并能实现对CIFAR-10,CIFAR-100和ImageNet用于图像分类的国家的最先进的修剪效果。 DCPS还用来语义分割上PASCAL VOC 2012有两个目的。首先是要证明,任务特定通道修剪实现对传输超薄机型更好的表现,第二是要证明华府公立学校的内存效率的任务需求更多的内存比预算分类。实验结果验证了有效性和DCPS的广泛适用性。
21. CompRess: Self-Supervised Learning by Compressing Representations [PDF] 返回目录
Soroush Abbasi Koohpayegani, Ajinkya Tejankar, Hamed Pirsiavash
Abstract: Self-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models benefit more from self-supervised learning than smaller models. As a result, the gap between supervised and self-supervised learning has been greatly reduced for larger models. In this work, instead of designing a new pseudo task for self-supervised learning, we develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student). We train the student model so that it mimics the relative similarity between the data points in the teacher's embedding space. For AlexNet, our method outperforms all previous methods including the fully supervised model on ImageNet linear evaluation (59.0% compared to 56.5%) and on nearest neighbor evaluation (50.7% compared to 41.4%). To the best of our knowledge, this is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification. Our code is available here: this https URL
摘要:自监督学习目标要学会与未标记的数据好表示。最近的工作表明,较大的车型从小于模型自我监督学习获益更多。其结果是,监督和自我监督学习之间的差距已经大大降低了较大的车型。在这项工作中,而不是设计的自我监督学习新的伪任务,我们开发了一个模型压缩方法来压缩已经了解到,深自我监督模式(老师),以一个较小的(学生)。我们培养的学生模型,使其模仿老师的嵌入空间中的数据点之间的相对相似。对于AlexNet,我们的方法优于所有以前的方法包括ImageNet完全监督模型线性评价(59.0%比56.5%)和最近的邻居评价(50.7%比41.4%)。据我们所知,这是第一次自我监督AlexNet跑赢监督一个上ImageNet分类。我们的代码可以在这里找到:此HTTPS URL
Soroush Abbasi Koohpayegani, Ajinkya Tejankar, Hamed Pirsiavash
Abstract: Self-supervised learning aims to learn good representations with unlabeled data. Recent works have shown that larger models benefit more from self-supervised learning than smaller models. As a result, the gap between supervised and self-supervised learning has been greatly reduced for larger models. In this work, instead of designing a new pseudo task for self-supervised learning, we develop a model compression method to compress an already learned, deep self-supervised model (teacher) to a smaller one (student). We train the student model so that it mimics the relative similarity between the data points in the teacher's embedding space. For AlexNet, our method outperforms all previous methods including the fully supervised model on ImageNet linear evaluation (59.0% compared to 56.5%) and on nearest neighbor evaluation (50.7% compared to 41.4%). To the best of our knowledge, this is the first time a self-supervised AlexNet has outperformed supervised one on ImageNet classification. Our code is available here: this https URL
摘要:自监督学习目标要学会与未标记的数据好表示。最近的工作表明,较大的车型从小于模型自我监督学习获益更多。其结果是,监督和自我监督学习之间的差距已经大大降低了较大的车型。在这项工作中,而不是设计的自我监督学习新的伪任务,我们开发了一个模型压缩方法来压缩已经了解到,深自我监督模式(老师),以一个较小的(学生)。我们培养的学生模型,使其模仿老师的嵌入空间中的数据点之间的相对相似。对于AlexNet,我们的方法优于所有以前的方法包括ImageNet完全监督模型线性评价(59.0%比56.5%)和最近的邻居评价(50.7%比41.4%)。据我们所知,这是第一次自我监督AlexNet跑赢监督一个上ImageNet分类。我们的代码可以在这里找到:此HTTPS URL
22. Quantified Facial Temporal-Expressiveness Dynamics for Affect Analysis [PDF] 返回目录
Md Taufeeq Uddin, Shaun Canavan
Abstract: The quantification of visual affect data (e.g. face images) is essential to build and monitor automated affect modeling systems efficiently. Considering this, this work proposes quantified facial Temporal-expressiveness Dynamics (TED) to quantify the expressiveness of human faces. The proposed algorithm leverages multimodal facial features by incorporating static and dynamic information to enable accurate measurements of facial expressiveness. We show that TED can be used for high-level tasks such as summarization of unstructured visual data, and expectation from and interpretation of automated affect recognition models. To evaluate the positive impact of using TED, a case study was conducted on spontaneous pain using the UNBC-McMaster spontaneous shoulder pain dataset. Experimental results show the efficacy of using TED for quantified affect analysis.
摘要:视觉的定量影响数据(例如面部图像)是建立必不可少和监视器有效地自动化影响建模系统。考虑到这一点,这项工作提出了量化的面部颞表现动力学(TED)来量化人脸的表现力。该算法通过引入静态和动态信息,使面部表现力的精确测量利用多式联运的面部特征。我们表明,TED可用于高层次的任务,例如非结构化的视觉数据汇总,并期望从与自动化影响识别模型解释。为了评估使用TED的积极影响,案例研究是在使用北卑诗大学,麦克马斯特自发肩部疼痛数据集自发性疼痛进行。实验结果表明,使用TED进行量化分析影响的疗效。
Md Taufeeq Uddin, Shaun Canavan
Abstract: The quantification of visual affect data (e.g. face images) is essential to build and monitor automated affect modeling systems efficiently. Considering this, this work proposes quantified facial Temporal-expressiveness Dynamics (TED) to quantify the expressiveness of human faces. The proposed algorithm leverages multimodal facial features by incorporating static and dynamic information to enable accurate measurements of facial expressiveness. We show that TED can be used for high-level tasks such as summarization of unstructured visual data, and expectation from and interpretation of automated affect recognition models. To evaluate the positive impact of using TED, a case study was conducted on spontaneous pain using the UNBC-McMaster spontaneous shoulder pain dataset. Experimental results show the efficacy of using TED for quantified affect analysis.
摘要:视觉的定量影响数据(例如面部图像)是建立必不可少和监视器有效地自动化影响建模系统。考虑到这一点,这项工作提出了量化的面部颞表现动力学(TED)来量化人脸的表现力。该算法通过引入静态和动态信息,使面部表现力的精确测量利用多式联运的面部特征。我们表明,TED可用于高层次的任务,例如非结构化的视觉数据汇总,并期望从与自动化影响识别模型解释。为了评估使用TED的积极影响,案例研究是在使用北卑诗大学,麦克马斯特自发肩部疼痛数据集自发性疼痛进行。实验结果表明,使用TED进行量化分析影响的疗效。
23. Deformable Convolutional LSTM for Human Body Emotion Recognition [PDF] 返回目录
Peyman Tahghighi, Abbas Koochari, Masoume Jalali
Abstract: People represent their emotions in a myriad of ways. Among the most important ones is whole body expressions which have many applications in different fields such as human-computer interaction (HCI). One of the most important challenges in human emotion recognition is that people express the same feeling in various ways using their face and their body. Recently many methods have tried to overcome these challenges using Deep Neural Networks (DNNs). However, most of these methods were based on images or on facial expressions only and did not consider deformation that may happen in the images such as scaling and rotation which can adversely affect the recognition accuracy. In this work, motivated by recent researches on deformable convolutions, we incorporate the deformable behavior into the core of convolutional long short-term memory (ConvLSTM) to improve robustness to these deformations in the image and, consequently, improve its accuracy on the emotion recognition task from videos of arbitrary length. We did experiments on the GEMEP dataset and achieved state-of-the-art accuracy of 98.8% on the task of whole human body emotion recognition on the validation set.
摘要:人民代表无数种方法自己的情绪。其中最重要的是整个身体表情已经在不同的领域,如人机交互(HCI)的许多应用。一个在人类情感识别最重要的挑战是人们在表达用他们的脸和他们的身体各种方式同样的感觉。最近有很多方法试图克服使用深层神经网络(DNNs)这些挑战。然而,这些方法大多是基于图像或面部表情只是并没有考虑变形,可能在图像,例如缩放和旋转,可以识别精度产生不利影响发生。在这项工作中,通过对变形的回旋最近的研究动机,我们纳入变形行为变成卷积长短期记忆(ConvLSTM)为核心,以提高坚固性的图像中这些变形,因此,改善情绪识别的准确性任务从任意长度的视频。我们做的GEMEP数据集实验,取得了98.8%的国家的最先进的精确度对整个人体的情绪识别的验证组的任务。
Peyman Tahghighi, Abbas Koochari, Masoume Jalali
Abstract: People represent their emotions in a myriad of ways. Among the most important ones is whole body expressions which have many applications in different fields such as human-computer interaction (HCI). One of the most important challenges in human emotion recognition is that people express the same feeling in various ways using their face and their body. Recently many methods have tried to overcome these challenges using Deep Neural Networks (DNNs). However, most of these methods were based on images or on facial expressions only and did not consider deformation that may happen in the images such as scaling and rotation which can adversely affect the recognition accuracy. In this work, motivated by recent researches on deformable convolutions, we incorporate the deformable behavior into the core of convolutional long short-term memory (ConvLSTM) to improve robustness to these deformations in the image and, consequently, improve its accuracy on the emotion recognition task from videos of arbitrary length. We did experiments on the GEMEP dataset and achieved state-of-the-art accuracy of 98.8% on the task of whole human body emotion recognition on the validation set.
摘要:人民代表无数种方法自己的情绪。其中最重要的是整个身体表情已经在不同的领域,如人机交互(HCI)的许多应用。一个在人类情感识别最重要的挑战是人们在表达用他们的脸和他们的身体各种方式同样的感觉。最近有很多方法试图克服使用深层神经网络(DNNs)这些挑战。然而,这些方法大多是基于图像或面部表情只是并没有考虑变形,可能在图像,例如缩放和旋转,可以识别精度产生不利影响发生。在这项工作中,通过对变形的回旋最近的研究动机,我们纳入变形行为变成卷积长短期记忆(ConvLSTM)为核心,以提高坚固性的图像中这些变形,因此,改善情绪识别的准确性任务从任意长度的视频。我们做的GEMEP数据集实验,取得了98.8%的国家的最先进的精确度对整个人体的情绪识别的验证组的任务。
24. Stereo Frustums: A Siamese Pipeline for 3D Object Detection [PDF] 返回目录
Xi Mo, Usaman Sajid, Guanghui Wang
Abstract: The paper proposes a light-weighted stereo frustums matching module for 3D objection detection. The proposed framework takes advantage of a high-performance 2D detector and a point cloud segmentation network to regress 3D bounding boxes for autonomous driving vehicles. Instead of performing traditional stereo matching to compute disparities, the module directly takes the 2D proposals from both the left and the right views as input. Based on the epipolar constraints recovered from the well-calibrated stereo cameras, we propose four matching algorithms to search for the best match for each proposal between the stereo image pairs. Each matching pair proposes a segmentation of the scene which is then fed into a 3D bounding box regression network. Results of extensive experiments on KITTI dataset demonstrate that the proposed Siamese pipeline outperforms the state-of-the-art stereo-based 3D bounding box regression methods.
摘要:提出了一种重量轻的立体平截头体的三维对象检测匹配模块。拟议的框架需要一个高性能2D探测器和点云分割网络的优势,回归3D边界框为自主驾驶车辆。代替到计算差距在执行传统的立体匹配的,模块直接将来自左和作为输入的右视图两个2D提案。基于对极约束从井校准立体相机恢复,我们提出四种配套算法搜索的立体图像对之间的每个建议的最佳匹配。每个匹配对提出现场,然后将其送入一个三维边界框回归网络的分割。对KITTI广泛的实验结果数据集表明,该连体管道优于国家的最先进的基于立体3D边框回归方法。
Xi Mo, Usaman Sajid, Guanghui Wang
Abstract: The paper proposes a light-weighted stereo frustums matching module for 3D objection detection. The proposed framework takes advantage of a high-performance 2D detector and a point cloud segmentation network to regress 3D bounding boxes for autonomous driving vehicles. Instead of performing traditional stereo matching to compute disparities, the module directly takes the 2D proposals from both the left and the right views as input. Based on the epipolar constraints recovered from the well-calibrated stereo cameras, we propose four matching algorithms to search for the best match for each proposal between the stereo image pairs. Each matching pair proposes a segmentation of the scene which is then fed into a 3D bounding box regression network. Results of extensive experiments on KITTI dataset demonstrate that the proposed Siamese pipeline outperforms the state-of-the-art stereo-based 3D bounding box regression methods.
摘要:提出了一种重量轻的立体平截头体的三维对象检测匹配模块。拟议的框架需要一个高性能2D探测器和点云分割网络的优势,回归3D边界框为自主驾驶车辆。代替到计算差距在执行传统的立体匹配的,模块直接将来自左和作为输入的右视图两个2D提案。基于对极约束从井校准立体相机恢复,我们提出四种配套算法搜索的立体图像对之间的每个建议的最佳匹配。每个匹配对提出现场,然后将其送入一个三维边界框回归网络的分割。对KITTI广泛的实验结果数据集表明,该连体管道优于国家的最先进的基于立体3D边框回归方法。
25. Nested Grassmanns for Dimensionality Reduction [PDF] 返回目录
Chun-Hao Yang, Baba C. Vemuri
Abstract: Grassmann manifolds have been widely used to represent the geometry of feature spaces in a variety of problems in computer vision including but not limited to face recognition, action recognition, subspace clustering and motion segmentation. For these problems, the features usually lie in a very high-dimensional Grassmann manifold and hence an appropriate dimensionality reduction technique is called for in order to curtail the computational burden. To this end, the Principal Geodesic Analysis (PGA), a nonlinear extension of the well known principal component analysis, is applicable as a general tool to many Riemannian manifolds. In this paper, we propose a novel dimensionality reduction framework suited for Grassmann manifolds by utilizing the geometry of the manifold. Specifically, we project points in a Grassmann manifold to an embedded lower dimensional Grassmann manifold. A salient feature of our method is that it leads to higher expressed variance compared to PGA which we demonstrate via synthetic and real data experiments.
摘要:格拉斯曼歧管已被广泛用来表示各种计算机视觉问题,包括特征空间的几何形状,包括但不限于面部识别,行为识别,子空间聚类和运动分割。对于这些问题,在通常的特征在于非常高的维格拉斯曼流形,因此一个适当的维数降低技术以削减计算负担要求。为此,委托分析测地线(PGA),公知的主成分分析的非线性扩展,适用作为一般工具的许多黎曼流形。在本文中,我们提出通过利用所述歧管的几何形状适合于格拉斯曼歧管的新颖的降维框架。具体而言,我们在格拉斯曼流形为嵌入较低维度格拉斯曼流形突出点。我们的方法的一个显着特点是,它会导致比PGA我们通过模拟和真实数据的实验表明,以更高的表达差异。
Chun-Hao Yang, Baba C. Vemuri
Abstract: Grassmann manifolds have been widely used to represent the geometry of feature spaces in a variety of problems in computer vision including but not limited to face recognition, action recognition, subspace clustering and motion segmentation. For these problems, the features usually lie in a very high-dimensional Grassmann manifold and hence an appropriate dimensionality reduction technique is called for in order to curtail the computational burden. To this end, the Principal Geodesic Analysis (PGA), a nonlinear extension of the well known principal component analysis, is applicable as a general tool to many Riemannian manifolds. In this paper, we propose a novel dimensionality reduction framework suited for Grassmann manifolds by utilizing the geometry of the manifold. Specifically, we project points in a Grassmann manifold to an embedded lower dimensional Grassmann manifold. A salient feature of our method is that it leads to higher expressed variance compared to PGA which we demonstrate via synthetic and real data experiments.
摘要:格拉斯曼歧管已被广泛用来表示各种计算机视觉问题,包括特征空间的几何形状,包括但不限于面部识别,行为识别,子空间聚类和运动分割。对于这些问题,在通常的特征在于非常高的维格拉斯曼流形,因此一个适当的维数降低技术以削减计算负担要求。为此,委托分析测地线(PGA),公知的主成分分析的非线性扩展,适用作为一般工具的许多黎曼流形。在本文中,我们提出通过利用所述歧管的几何形状适合于格拉斯曼歧管的新颖的降维框架。具体而言,我们在格拉斯曼流形为嵌入较低维度格拉斯曼流形突出点。我们的方法的一个显着特点是,它会导致比PGA我们通过模拟和真实数据的实验表明,以更高的表达差异。
26. Contour Integration using Graph-Cut and Non-Classical Receptive Field [PDF] 返回目录
Seyedeh-Zahra Mousavi Kouzehkanan, Reshad Hosseini, Babak Nadjar Araabi
Abstract: Many edge and contour detection algorithms give a soft-value as an output and the final binary map is commonly obtained by applying an optimal threshold. In this paper, we propose a novel method to detect image contours from the extracted edge segments of other algorithms. Our method is based on an undirected graphical model with the edge segments set as the vertices. The proposed energy functions are inspired by the surround modulation in the primary visual cortex that help suppressing texture noise. Our algorithm can improve extracting the binary map, because it considers other important factors such as connectivity, smoothness, and length of the contour beside the soft-values. Our quantitative and qualitative experimental results show the efficacy of the proposed method.
摘要:许多边缘和轮廓检测算法给出通常通过施加一个最佳阈值所得的软值作为输出和最终的二进制地图。在本文中,我们提出了一种新颖的方法,以从其他算法所提取的边缘区段检测图像的轮廓。我们的方法是基于与设定为顶点的边缘段的无向图形模型。所提出的能量函数由在初级视觉皮层环绕调制这种帮助抑制纹理噪声的启发。我们的算法可以提高提取二进制映射,因为它考虑的其它重要因素,例如连通性,光滑度,和软值旁边的轮廓的长度。我们的定量和定性实验结果表明,该方法的有效性。
Seyedeh-Zahra Mousavi Kouzehkanan, Reshad Hosseini, Babak Nadjar Araabi
Abstract: Many edge and contour detection algorithms give a soft-value as an output and the final binary map is commonly obtained by applying an optimal threshold. In this paper, we propose a novel method to detect image contours from the extracted edge segments of other algorithms. Our method is based on an undirected graphical model with the edge segments set as the vertices. The proposed energy functions are inspired by the surround modulation in the primary visual cortex that help suppressing texture noise. Our algorithm can improve extracting the binary map, because it considers other important factors such as connectivity, smoothness, and length of the contour beside the soft-values. Our quantitative and qualitative experimental results show the efficacy of the proposed method.
摘要:许多边缘和轮廓检测算法给出通常通过施加一个最佳阈值所得的软值作为输出和最终的二进制地图。在本文中,我们提出了一种新颖的方法,以从其他算法所提取的边缘区段检测图像的轮廓。我们的方法是基于与设定为顶点的边缘段的无向图形模型。所提出的能量函数由在初级视觉皮层环绕调制这种帮助抑制纹理噪声的启发。我们的算法可以提高提取二进制映射,因为它考虑的其它重要因素,例如连通性,光滑度,和软值旁边的轮廓的长度。我们的定量和定性实验结果表明,该方法的有效性。
27. Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning [PDF] 返回目录
Iro Laina, Ruth C. Fong, Andrea Vedaldi
Abstract: The increasing impact of black box models, and particularly of unsupervised ones, comes with an increasing interest in tools to understand and interpret them. In this paper, we consider in particular how to characterise visual groupings discovered automatically by deep neural networks, starting with state-of-the-art clustering methods. In some cases, clusters readily correspond to an existing labelled dataset. However, often they do not, yet they still maintain an "intuitive interpretability". We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings, including unsupervised ones. The idea is to measure (1) how well humans can learn to reproduce a grouping by measuring their ability to generalise from a small set of visual examples (learnability) and (2) whether the set of visual examples can be replaced by a succinct, textual description (describability). By assessing human annotators as classifiers, we remove the subjective quality of existing evaluation metrics. For better scalability, we finally propose a class-level captioning system to generate descriptions for visual groupings automatically and compare it to human annotators using the describability metric.
摘要:黑盒模型的影响越来越大,特别是那些无人监督的,自带的工具来理解和解释他们的兴趣越来越大。在本文中,我们特别考虑如何表征视觉集团通过深层神经网络自动发现,从国家的最先进的聚类方法。在一些情况下,群集容易对应于现有标记的数据集。然而,他们往往不这样做,但他们仍然保持“直观解释性”。我们引入两个概念,视觉学习能力和describability,可以用于量化任意图像分组,包括无监督的人的可解释性。我们的想法是测量(1)人类可以如何学习再现通过测量它们从一小的视觉的例子(学习能力)和(2)概括是否该组的可视例子可以通过简洁的替换能力的分组,文本描述(describability)。通过评估人工注释的分类,我们删除现有评价指标的主观质量。为了更好的可扩展性,我们最终提出了一个类级别的字幕系统自动生成的视觉分组说明和使用describability指标把它比作人工注释。
Iro Laina, Ruth C. Fong, Andrea Vedaldi
Abstract: The increasing impact of black box models, and particularly of unsupervised ones, comes with an increasing interest in tools to understand and interpret them. In this paper, we consider in particular how to characterise visual groupings discovered automatically by deep neural networks, starting with state-of-the-art clustering methods. In some cases, clusters readily correspond to an existing labelled dataset. However, often they do not, yet they still maintain an "intuitive interpretability". We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings, including unsupervised ones. The idea is to measure (1) how well humans can learn to reproduce a grouping by measuring their ability to generalise from a small set of visual examples (learnability) and (2) whether the set of visual examples can be replaced by a succinct, textual description (describability). By assessing human annotators as classifiers, we remove the subjective quality of existing evaluation metrics. For better scalability, we finally propose a class-level captioning system to generate descriptions for visual groupings automatically and compare it to human annotators using the describability metric.
摘要:黑盒模型的影响越来越大,特别是那些无人监督的,自带的工具来理解和解释他们的兴趣越来越大。在本文中,我们特别考虑如何表征视觉集团通过深层神经网络自动发现,从国家的最先进的聚类方法。在一些情况下,群集容易对应于现有标记的数据集。然而,他们往往不这样做,但他们仍然保持“直观解释性”。我们引入两个概念,视觉学习能力和describability,可以用于量化任意图像分组,包括无监督的人的可解释性。我们的想法是测量(1)人类可以如何学习再现通过测量它们从一小的视觉的例子(学习能力)和(2)概括是否该组的可视例子可以通过简洁的替换能力的分组,文本描述(describability)。通过评估人工注释的分类,我们删除现有评价指标的主观质量。为了更好的可扩展性,我们最终提出了一个类级别的字幕系统自动生成的视觉分组说明和使用describability指标把它比作人工注释。
28. Perception for Autonomous Systems (PAZ) [PDF] 返回目录
Octavio Arriaga, Matias Valdenegro-Toro, Mohandass Muthuraja, Sushma Devaramani, Frank Kirchner
Abstract: In this paper we introduce the Perception for Autonomous Systems (PAZ) software library. PAZ is a hierarchical perception library that allow users to manipulate multiple levels of abstraction in accordance to their requirements or skill level. More specifically, PAZ is divided into three hierarchical levels which we refer to as pipelines, processors, and backends. These abstractions allows users to compose functions in a hierarchical modular scheme that can be applied for preprocessing, data-augmentation, prediction and postprocessing of inputs and outputs of machine learning (ML) models. PAZ uses these abstractions to build reusable training and prediction pipelines for multiple robot perception tasks such as: 2D keypoint estimation, 2D object detection, 3D keypoint discovery, 6D pose estimation, emotion classification, face recognition, instance segmentation, and attention mechanisms.
摘要:本文介绍了自治系统(PAZ)软件库的感知。 PAZ是一个层次的感知库,用户可以对按照自己的需求或技能水平的多级抽象。更具体地讲,PAZ分成,我们称其为管道,处理器和后端3个层次级别。这些抽象允许用户在分层模块化方案撰写功能可应用于用于预处理,数据增强,预测和的输入和机器学习(ML)的模型输出后处理。 PAZ使用这些抽象来建立多机器人感知任务,如可重复使用的训练和预测管线:2D关键点估计,二维物体检测,3D关键点发现,6D姿态估计,情感分类,人脸识别,例如分割和注意力的机制。
Octavio Arriaga, Matias Valdenegro-Toro, Mohandass Muthuraja, Sushma Devaramani, Frank Kirchner
Abstract: In this paper we introduce the Perception for Autonomous Systems (PAZ) software library. PAZ is a hierarchical perception library that allow users to manipulate multiple levels of abstraction in accordance to their requirements or skill level. More specifically, PAZ is divided into three hierarchical levels which we refer to as pipelines, processors, and backends. These abstractions allows users to compose functions in a hierarchical modular scheme that can be applied for preprocessing, data-augmentation, prediction and postprocessing of inputs and outputs of machine learning (ML) models. PAZ uses these abstractions to build reusable training and prediction pipelines for multiple robot perception tasks such as: 2D keypoint estimation, 2D object detection, 3D keypoint discovery, 6D pose estimation, emotion classification, face recognition, instance segmentation, and attention mechanisms.
摘要:本文介绍了自治系统(PAZ)软件库的感知。 PAZ是一个层次的感知库,用户可以对按照自己的需求或技能水平的多级抽象。更具体地讲,PAZ分成,我们称其为管道,处理器和后端3个层次级别。这些抽象允许用户在分层模块化方案撰写功能可应用于用于预处理,数据增强,预测和的输入和机器学习(ML)的模型输出后处理。 PAZ使用这些抽象来建立多机器人感知任务,如可重复使用的训练和预测管线:2D关键点估计,二维物体检测,3D关键点发现,6D姿态估计,情感分类,人脸识别,例如分割和注意力的机制。
29. Forgery Blind Inspection for Detecting Manipulations of Gel Electrophoresis Images [PDF] 返回目录
Hao-Chiang Shao, Ya-Jen Cheng, Meng-Yun Duh, Chia-Wen Lin
Abstract: Recently, falsified images have been found in papers involved in research misconducts. However, although there have been many image forgery detection methods, none of them was designed for molecular-biological experiment images. In this paper, we proposed a fast blind inquiry method, named FBI$_{GEL}$, for integrity of images obtained from two common sorts of molecular experiments, i.e., western blot (WB) and polymerase chain reaction (PCR). Based on an optimized pseudo-background capable of highlighting local residues, FBI$_{GEL}$ can reveal traceable vestiges suggesting inappropriate local modifications on WB/PCR images. Additionally, because the optimized pseudo-background is derived according to a closed-form solution, FBI$_{GEL}$ is computationally efficient and thus suitable for large scale inquiry tasks for WB/PCR image integrity. We applied FBI$_{GEL}$ on several papers questioned by the public on \textbf{PUBPEER}, and our results show that figures of those papers indeed contain doubtful unnatural patterns.
摘要:近日,伪造的图像已经在参与研究不端行为的论文中找到。然而,尽管已经有许多图像伪造检测方法,其中没有一个是专为分子生物学实验的图像。在本文中,我们提出了一种快速盲目查询方法,命名为FBI $ _ {} GEL $,从分子实验,即免疫印迹(WB)和聚合酶链反应(PCR)两种常见的各种获得的图像的完整性。基于能突出当地残留的优化伪后台,FBI $ _ {} GEL $可以揭示可追溯的痕迹表明在WB / PCR图像不合适的本地修改。此外,由于优化的伪背景根据一个封闭形式解衍生的,FBI $ _ {GEL} $是计算有效的并且因此适合于WB / PCR图像完整性大规模查询任务。我们采用FBI $ _ {} GEL $上几篇论文通过\ textbf {} PUBPEER公众的质疑,我们的结果表明,这些文件的数字的确令人怀疑含有不自然的方式。
Hao-Chiang Shao, Ya-Jen Cheng, Meng-Yun Duh, Chia-Wen Lin
Abstract: Recently, falsified images have been found in papers involved in research misconducts. However, although there have been many image forgery detection methods, none of them was designed for molecular-biological experiment images. In this paper, we proposed a fast blind inquiry method, named FBI$_{GEL}$, for integrity of images obtained from two common sorts of molecular experiments, i.e., western blot (WB) and polymerase chain reaction (PCR). Based on an optimized pseudo-background capable of highlighting local residues, FBI$_{GEL}$ can reveal traceable vestiges suggesting inappropriate local modifications on WB/PCR images. Additionally, because the optimized pseudo-background is derived according to a closed-form solution, FBI$_{GEL}$ is computationally efficient and thus suitable for large scale inquiry tasks for WB/PCR image integrity. We applied FBI$_{GEL}$ on several papers questioned by the public on \textbf{PUBPEER}, and our results show that figures of those papers indeed contain doubtful unnatural patterns.
摘要:近日,伪造的图像已经在参与研究不端行为的论文中找到。然而,尽管已经有许多图像伪造检测方法,其中没有一个是专为分子生物学实验的图像。在本文中,我们提出了一种快速盲目查询方法,命名为FBI $ _ {} GEL $,从分子实验,即免疫印迹(WB)和聚合酶链反应(PCR)两种常见的各种获得的图像的完整性。基于能突出当地残留的优化伪后台,FBI $ _ {} GEL $可以揭示可追溯的痕迹表明在WB / PCR图像不合适的本地修改。此外,由于优化的伪背景根据一个封闭形式解衍生的,FBI $ _ {GEL} $是计算有效的并且因此适合于WB / PCR图像完整性大规模查询任务。我们采用FBI $ _ {} GEL $上几篇论文通过\ textbf {} PUBPEER公众的质疑,我们的结果表明,这些文件的数字的确令人怀疑含有不自然的方式。
30. Attribution Preservation in Network Compression for Reliable Network Interpretation [PDF] 返回目录
Geondo Park, June Yong Yang, Sung Ju Hwang, Eunho Yang
Abstract: Neural networks embedded in safety-sensitive applications such as self-driving cars and wearable health monitors rely on two important techniques: input attribution for hindsight analysis and network compression to reduce its size for edge-computing. In this paper, we show that these seemingly unrelated techniques conflict with each other as network compression deforms the produced attributions, which could lead to dire consequences for mission-critical applications. This phenomenon arises due to the fact that conventional network compression methods only preserve the predictions of the network while ignoring the quality of the attributions. To combat the attribution inconsistency problem, we present a framework that can preserve the attributions while compressing a network. By employing the Weighted Collapsed Attribution Matching regularizer, we match the attribution maps of the network being compressed to its pre-compression former self. We demonstrate the effectiveness of our algorithm both quantitatively and qualitatively on diverse compression methods.
摘要:嵌入式安全敏感的应用,如自动驾驶汽车和可穿戴式健康监测神经网络依赖于两个关键技术:用于事后分析和网络压缩输入归属地减少其大小边缘计算。在本文中,我们证明了彼此的网络压缩这些看似不相关的技术冲突的变形产生的归属,这可能导致对关键任务应用带来可怕的后果。这种现象的产生是由于这样的事实,传统的网络压缩方法只保留网络的预测而忽视了归因的质量。为了解决这一矛盾的归属问题,我们提出了一个框架,同时压缩网络可保留归属。通过采用加权倒塌归因匹配正则,我们匹配网络的归属地图被压缩到压缩前的前自已。我们在数量和质量上表现出不同的压缩方法,我们的算法的有效性。
Geondo Park, June Yong Yang, Sung Ju Hwang, Eunho Yang
Abstract: Neural networks embedded in safety-sensitive applications such as self-driving cars and wearable health monitors rely on two important techniques: input attribution for hindsight analysis and network compression to reduce its size for edge-computing. In this paper, we show that these seemingly unrelated techniques conflict with each other as network compression deforms the produced attributions, which could lead to dire consequences for mission-critical applications. This phenomenon arises due to the fact that conventional network compression methods only preserve the predictions of the network while ignoring the quality of the attributions. To combat the attribution inconsistency problem, we present a framework that can preserve the attributions while compressing a network. By employing the Weighted Collapsed Attribution Matching regularizer, we match the attribution maps of the network being compressed to its pre-compression former self. We demonstrate the effectiveness of our algorithm both quantitatively and qualitatively on diverse compression methods.
摘要:嵌入式安全敏感的应用,如自动驾驶汽车和可穿戴式健康监测神经网络依赖于两个关键技术:用于事后分析和网络压缩输入归属地减少其大小边缘计算。在本文中,我们证明了彼此的网络压缩这些看似不相关的技术冲突的变形产生的归属,这可能导致对关键任务应用带来可怕的后果。这种现象的产生是由于这样的事实,传统的网络压缩方法只保留网络的预测而忽视了归因的质量。为了解决这一矛盾的归属问题,我们提出了一个框架,同时压缩网络可保留归属。通过采用加权倒塌归因匹配正则,我们匹配网络的归属地图被压缩到压缩前的前自已。我们在数量和质量上表现出不同的压缩方法,我们的算法的有效性。
31. Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases [PDF] 返回目录
Ryan Steed, Aylin Caliskan
Abstract: Recent advances in machine learning leverage massive datasets of unlabeled images from the web to learn general-purpose image representations for tasks from image classification to face recognition. But do unsupervised computer vision models automatically learn implicit patterns and embed social biases that could have harmful downstream effects? For the first time, we develop a novel method for quantifying biased associations between representations of social concepts and attributes in images. We find that state-of-the-art unsupervised models trained on ImageNet, a popular benchmark image dataset curated from internet images, automatically learn racial, gender, and intersectional biases. We replicate 8 of 15 documented human biases from social psychology, from the innocuous, as with insects and flowers, to the potentially harmful, as with race and gender. For the first time in the image domain, we replicate human-like biases about skin-tone and weight. Our results also closely match three hypotheses about intersectional bias from social psychology. When compared with statistical patterns in online image datasets, our findings suggest that machine learning models can automatically learn bias from the way people are stereotypically portrayed on the web.
摘要:在机器学习未标记的图像利用海量数据集从网络的最新进展,了解通用的图像表示从图像分类人脸识别任务。但无人监督的计算机视觉模型自动学习,可能产生有害影响的下游隐含模式,并嵌入社会偏见?这是第一次,我们开发了量化图像中的社会观念的陈述和属性之间的关联偏置的新方法。我们发现,培训了ImageNet是国家的最先进的无监督模式,一个流行的基准图像数据集从网上图片策划,自动学习的种族,性别和交叉偏见。我们复制从社会心理学8月15日的记录的人类偏见,从无害,与昆虫和花朵,给潜在危害,如种族和性别。在图像领域的第一次,我们复制类人大约肤色和重量偏差。我们的研究结果还密切配合有关从社会心理学交叉偏见三个假设。当在网上图像数据集的统计模式相比,我们的研究结果表明,机器学习模型可以自动学习从人们在网络上刻板描绘的方式偏差。
Ryan Steed, Aylin Caliskan
Abstract: Recent advances in machine learning leverage massive datasets of unlabeled images from the web to learn general-purpose image representations for tasks from image classification to face recognition. But do unsupervised computer vision models automatically learn implicit patterns and embed social biases that could have harmful downstream effects? For the first time, we develop a novel method for quantifying biased associations between representations of social concepts and attributes in images. We find that state-of-the-art unsupervised models trained on ImageNet, a popular benchmark image dataset curated from internet images, automatically learn racial, gender, and intersectional biases. We replicate 8 of 15 documented human biases from social psychology, from the innocuous, as with insects and flowers, to the potentially harmful, as with race and gender. For the first time in the image domain, we replicate human-like biases about skin-tone and weight. Our results also closely match three hypotheses about intersectional bias from social psychology. When compared with statistical patterns in online image datasets, our findings suggest that machine learning models can automatically learn bias from the way people are stereotypically portrayed on the web.
摘要:在机器学习未标记的图像利用海量数据集从网络的最新进展,了解通用的图像表示从图像分类人脸识别任务。但无人监督的计算机视觉模型自动学习,可能产生有害影响的下游隐含模式,并嵌入社会偏见?这是第一次,我们开发了量化图像中的社会观念的陈述和属性之间的关联偏置的新方法。我们发现,培训了ImageNet是国家的最先进的无监督模式,一个流行的基准图像数据集从网上图片策划,自动学习的种族,性别和交叉偏见。我们复制从社会心理学8月15日的记录的人类偏见,从无害,与昆虫和花朵,给潜在危害,如种族和性别。在图像领域的第一次,我们复制类人大约肤色和重量偏差。我们的研究结果还密切配合有关从社会心理学交叉偏见三个假设。当在网上图像数据集的统计模式相比,我们的研究结果表明,机器学习模型可以自动学习从人们在网络上刻板描绘的方式偏差。
32. Generative Tomography Reconstruction [PDF] 返回目录
Matteo Ronchetti, Davide Bacciu
Abstract: We propose an end-to-end differentiable architecture for tomography reconstruction that directly maps a noisy sinogram into a denoised reconstruction. Compared to existing approaches our end-to-end architecture produces more accurate reconstructions while using less parameters and time. We also propose a generative model that, given a noisy sinogram, can sample realistic reconstructions. This generative model can be used as prior inside an iterative process that, by taking into consideration the physical model, can reduce artifacts and errors in the reconstructions.
摘要:本文提出了一种层析重建一个终端到终端的微架构,嘈杂的正弦直接映射到一个去噪重建。相比于现有的方法我们的终端到终端的架构,同时使用更少的参数和时间产生更精确的重建。我们还提出了一个生成模型,给定一个嘈杂的正弦图,可以品尝到现实的重建。此生成模型可被用作一个迭代过程,通过考虑到物理模型,可以减少伪像的重建和错误内之前。
Matteo Ronchetti, Davide Bacciu
Abstract: We propose an end-to-end differentiable architecture for tomography reconstruction that directly maps a noisy sinogram into a denoised reconstruction. Compared to existing approaches our end-to-end architecture produces more accurate reconstructions while using less parameters and time. We also propose a generative model that, given a noisy sinogram, can sample realistic reconstructions. This generative model can be used as prior inside an iterative process that, by taking into consideration the physical model, can reduce artifacts and errors in the reconstructions.
摘要:本文提出了一种层析重建一个终端到终端的微架构,嘈杂的正弦直接映射到一个去噪重建。相比于现有的方法我们的终端到终端的架构,同时使用更少的参数和时间产生更精确的重建。我们还提出了一个生成模型,给定一个嘈杂的正弦图,可以品尝到现实的重建。此生成模型可被用作一个迭代过程,通过考虑到物理模型,可以减少伪像的重建和错误内之前。
33. Medical Deep Learning -- A systematic Meta-Review [PDF] 返回目录
Jan Egger, Christina Gsxaner, Antonio Pepe, Jianning Li
Abstract: Deep learning had a remarkable impact in different scientific disciplines during the last years. This was demonstrated in numerous tasks, where deep learning algorithms were able to outperform the state-of-art methods, also in image processing and analysis. Moreover, deep learning delivers good results in tasks like autonomous driving, which could not have been performed automatically before. There are even applications where deep learning outperformed humans, like object recognition or games. Another field in which this development is showing a huge potential is the medical domain. With the collection of large quantities of patient records and data, and a trend towards personalized treatments, there is a great need for an automatic and reliable processing and analysis of this information. Patient data is not only collected in clinical centres, like hospitals, but it relates also to data coming from general practitioners, healthcare smartphone apps or online websites, just to name a few. This trend resulted in new, massive research efforts during the last years. In Q2/2020, the search engine PubMed returns already over 11.000 results for the search term $'$deep learning$'$, and around 90% of these publications are from the last three years. Hence, a complete overview of the field of $'$medical deep learning$'$ is almost impossible to obtain and getting a full overview of medical sub-fields gets increasingly more difficult. Nevertheless, several review and survey articles about medical deep learning have been presented within the last years. They focused, in general, on specific medical scenarios, like the analysis of medical images containing specific pathologies. With these surveys as foundation, the aim of this contribution is to provide a very first high-level, systematic meta-review of medical deep learning surveys.
摘要:深学习了在过去几年中在不同学科的显着影响。这体现在众多的任务,其中深学习算法能够超越国家的最先进的方法,也是在图像处理和分析。此外,深度学习提供了像自动驾驶,它已经不能自动之前执行的任务的好成绩。在有些情况下深度学习跑赢人类一样,物体识别或游戏甚至应用程序。在这方面的发展正呈现出巨大的潜力的另一领域是医疗领域。随着大量的病历和数据的收集,以及对个性化的治疗趋势,对于这些信息的自动,可靠的处理和分析的巨大需求。患者数据不仅在临床中心,如医院收集的,但它也涉及将数据从全科医生,医疗智能手机应用程序或在线网站来了,只是仅举几例。这种趋势导致了在过去几年中新,大量的研究工作。在Q2 / 2020年,搜索引擎返回考研已经结束了11.000结果的搜索项$“$深度学习$” $,而这些出版物的大约90%都来自过去三年。因此,为$“$医学深度学习$” $领域的完整概述几乎是不可能获得和得到子医疗领域的全面概述变得越来越困难。然而,关于医疗深度学习一些审查和调查文章已经呈现在过去几年内。他们集中在一般情况下,在特定的医疗场景,如含有特定疾病的医学图像的分析。有了这些调查为基础,这种贡献的目的是提供一个非常首次高级别,医疗深度学习调查系统性元的审查。
Jan Egger, Christina Gsxaner, Antonio Pepe, Jianning Li
Abstract: Deep learning had a remarkable impact in different scientific disciplines during the last years. This was demonstrated in numerous tasks, where deep learning algorithms were able to outperform the state-of-art methods, also in image processing and analysis. Moreover, deep learning delivers good results in tasks like autonomous driving, which could not have been performed automatically before. There are even applications where deep learning outperformed humans, like object recognition or games. Another field in which this development is showing a huge potential is the medical domain. With the collection of large quantities of patient records and data, and a trend towards personalized treatments, there is a great need for an automatic and reliable processing and analysis of this information. Patient data is not only collected in clinical centres, like hospitals, but it relates also to data coming from general practitioners, healthcare smartphone apps or online websites, just to name a few. This trend resulted in new, massive research efforts during the last years. In Q2/2020, the search engine PubMed returns already over 11.000 results for the search term $'$deep learning$'$, and around 90% of these publications are from the last three years. Hence, a complete overview of the field of $'$medical deep learning$'$ is almost impossible to obtain and getting a full overview of medical sub-fields gets increasingly more difficult. Nevertheless, several review and survey articles about medical deep learning have been presented within the last years. They focused, in general, on specific medical scenarios, like the analysis of medical images containing specific pathologies. With these surveys as foundation, the aim of this contribution is to provide a very first high-level, systematic meta-review of medical deep learning surveys.
摘要:深学习了在过去几年中在不同学科的显着影响。这体现在众多的任务,其中深学习算法能够超越国家的最先进的方法,也是在图像处理和分析。此外,深度学习提供了像自动驾驶,它已经不能自动之前执行的任务的好成绩。在有些情况下深度学习跑赢人类一样,物体识别或游戏甚至应用程序。在这方面的发展正呈现出巨大的潜力的另一领域是医疗领域。随着大量的病历和数据的收集,以及对个性化的治疗趋势,对于这些信息的自动,可靠的处理和分析的巨大需求。患者数据不仅在临床中心,如医院收集的,但它也涉及将数据从全科医生,医疗智能手机应用程序或在线网站来了,只是仅举几例。这种趋势导致了在过去几年中新,大量的研究工作。在Q2 / 2020年,搜索引擎返回考研已经结束了11.000结果的搜索项$“$深度学习$” $,而这些出版物的大约90%都来自过去三年。因此,为$“$医学深度学习$” $领域的完整概述几乎是不可能获得和得到子医疗领域的全面概述变得越来越困难。然而,关于医疗深度学习一些审查和调查文章已经呈现在过去几年内。他们集中在一般情况下,在特定的医疗场景,如含有特定疾病的医学图像的分析。有了这些调查为基础,这种贡献的目的是提供一个非常首次高级别,医疗深度学习调查系统性元的审查。
34. Deep Manifold Computing and Visualization [PDF] 返回目录
Stan Z. Li, Zelin Zang, Lirong Wu
Abstract: The ability to preserve local geometry of highly nonlinear manifolds in high dimensional spaces and properly unfold them into lower dimensional hyperplanes is the key to the success of manifold computing, nonlinear dimensionality reduction (NLDR) and visualization. This paper proposes a novel method, called elastic locally isometric smoothness (ELIS), to empower deep neural networks with such an ability. ELIS requires that a desired metric between points should be preserved across layers in order to preserve local geometry; such a smoothness constraint effectively regularizes vector-based transformations to become well-behaved local metric-preserving homeomorphisms. Moreover, ELIS requires that the smoothness should be imposed in a way to render sufficient flexibility for tackling complicated nonlinearity and non-Euclideanity; this is achieved layer-wisely via nonlinearity in both the similarity and activation functions. The ELIS method incorporates a class of suitable nonlinear similarity functions into a two-way divergence loss and uses hyperparameter continuation in finding optimal solutions. Extensive experiments, comparisons, and ablation study demonstrate that ELIS can deliver results not only superior to UMAP and t-SNE for and visualization but also better than other leading counterparts of manifold and autoencoder learning for NLDR and manifold data generation.
摘要:保持高度非线性歧管局部几何形状在高维空间中并适当它们展开成低维超平面的能力的关键是歧管计算,非线性降维(NLDR)和可视化的成功。本文提出了一种新颖的方法,称为弹性等轴测局部平滑度(ELIS),以授权深神经网络具有这样的能力。 ELIS要求点之间的期望的度量应该跨层,以保持局部几何形状被保留;这种平滑约束有效的规则化的基于矢量的变换,成为乖巧本地保存指标同胚。此外,ELIS要求应平滑的方式被征收人给予充分的灵活性,为应对复杂的非线性和非Euclideanity;这是在相似性和激活功能两者来实现层明智地通过非线性。的ELIS方法结合的一类合适的非线性相似函数成双向发散损耗,并使用超参数继续在寻找最优解。大量的实验,比较和研究消融表明,ELIS可以提供结果不仅优于UMAP和T-SNE的和可视化,但也优于歧管和自动编码器用于学习和NLDR歧管数据生成的其他同行领先水平。
Stan Z. Li, Zelin Zang, Lirong Wu
Abstract: The ability to preserve local geometry of highly nonlinear manifolds in high dimensional spaces and properly unfold them into lower dimensional hyperplanes is the key to the success of manifold computing, nonlinear dimensionality reduction (NLDR) and visualization. This paper proposes a novel method, called elastic locally isometric smoothness (ELIS), to empower deep neural networks with such an ability. ELIS requires that a desired metric between points should be preserved across layers in order to preserve local geometry; such a smoothness constraint effectively regularizes vector-based transformations to become well-behaved local metric-preserving homeomorphisms. Moreover, ELIS requires that the smoothness should be imposed in a way to render sufficient flexibility for tackling complicated nonlinearity and non-Euclideanity; this is achieved layer-wisely via nonlinearity in both the similarity and activation functions. The ELIS method incorporates a class of suitable nonlinear similarity functions into a two-way divergence loss and uses hyperparameter continuation in finding optimal solutions. Extensive experiments, comparisons, and ablation study demonstrate that ELIS can deliver results not only superior to UMAP and t-SNE for and visualization but also better than other leading counterparts of manifold and autoencoder learning for NLDR and manifold data generation.
摘要:保持高度非线性歧管局部几何形状在高维空间中并适当它们展开成低维超平面的能力的关键是歧管计算,非线性降维(NLDR)和可视化的成功。本文提出了一种新颖的方法,称为弹性等轴测局部平滑度(ELIS),以授权深神经网络具有这样的能力。 ELIS要求点之间的期望的度量应该跨层,以保持局部几何形状被保留;这种平滑约束有效的规则化的基于矢量的变换,成为乖巧本地保存指标同胚。此外,ELIS要求应平滑的方式被征收人给予充分的灵活性,为应对复杂的非线性和非Euclideanity;这是在相似性和激活功能两者来实现层明智地通过非线性。的ELIS方法结合的一类合适的非线性相似函数成双向发散损耗,并使用超参数继续在寻找最优解。大量的实验,比较和研究消融表明,ELIS可以提供结果不仅优于UMAP和T-SNE的和可视化,但也优于歧管和自动编码器用于学习和NLDR歧管数据生成的其他同行领先水平。
35. Large-Scale MIDI-based Composer Classification [PDF] 返回目录
Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Abstract: Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate the composer classification problem with up to 100 composers. By using convolutional recurrent neural networks as models, our MIDI based composer classification system achieves a 10-composer and a 100-composer classification accuracies of 0.648 and 0.385 (evaluated on 30-second clips) and 0.739 and 0.489 (evaluated on music pieces), respectively. Our MIDI based composer system outperforms several audio-based baseline classification systems, indicating the effectiveness of using compact MIDI representations for composer classification.
摘要:音乐分类是一个音乐片段到标签中,比如流派或作曲家分类的任务。我们建议使用GiantMIDI钢琴,基于转录数据集的大型MIDI基础作曲家分类系统。我们建议使用钢琴卷,卷发病和速度作为滚动输入表示,并使用深层神经网络作为分类。据我们所知,我们是第一个具有高达100名作曲家调查作曲家分类问题。使用卷积递归神经网络的模型,我们的基于MIDI作曲分类系统实现了10作曲家和0.648和0.385(30秒的短片评估)和0.739和0.489 100作曲家分类准确度(对音乐作品进行评价)分别。我们基于MIDI作曲系统优于几个基于音频的基线分类系统,说明使用紧凑型MIDI表示为作曲家分类的有效性。
Qiuqiang Kong, Keunwoo Choi, Yuxuan Wang
Abstract: Music classification is a task to classify a music piece into labels such as genres or composers. We propose large-scale MIDI based composer classification systems using GiantMIDI-Piano, a transcription-based dataset. We propose to use piano rolls, onset rolls, and velocity rolls as input representations and use deep neural networks as classifiers. To our knowledge, we are the first to investigate the composer classification problem with up to 100 composers. By using convolutional recurrent neural networks as models, our MIDI based composer classification system achieves a 10-composer and a 100-composer classification accuracies of 0.648 and 0.385 (evaluated on 30-second clips) and 0.739 and 0.489 (evaluated on music pieces), respectively. Our MIDI based composer system outperforms several audio-based baseline classification systems, indicating the effectiveness of using compact MIDI representations for composer classification.
摘要:音乐分类是一个音乐片段到标签中,比如流派或作曲家分类的任务。我们建议使用GiantMIDI钢琴,基于转录数据集的大型MIDI基础作曲家分类系统。我们建议使用钢琴卷,卷发病和速度作为滚动输入表示,并使用深层神经网络作为分类。据我们所知,我们是第一个具有高达100名作曲家调查作曲家分类问题。使用卷积递归神经网络的模型,我们的基于MIDI作曲分类系统实现了10作曲家和0.648和0.385(30秒的短片评估)和0.739和0.489 100作曲家分类准确度(对音乐作品进行评价)分别。我们基于MIDI作曲系统优于几个基于音频的基线分类系统,说明使用紧凑型MIDI表示为作曲家分类的有效性。
36. Classification Beats Regression: Counting of Cells from Greyscale Microscopic Images based on Annotation-free Training Samples [PDF] 返回目录
Xin Ding, Qiong Zhang, William J. Welch
Abstract: Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on classification-oriented convolutional neural networks (CNNs) to count cells from greyscale microscopic images without using annotated training images. In this framework, we formulate the cell counting task as an image classification problem, where the cell counts are taken as class labels. This formulation has its limitation when some cell counts in the test stage do not appear in the training data. Moreover, the ordinal relation among cell counts is not utilized. To deal with these limitations, we propose a simple but effective data augmentation (DA) method to synthesize images for the unseen cell counts. We also introduce an ensemble method, which can not only moderate the influence of unseen cell counts but also utilize the ordinal information to improve the prediction accuracy. This framework outperforms many modern cell counting methods and won the data analysis competition (Case Study 1: Counting Cells From Microscopic Images this https URL) of the 47th Annual Meeting of the Statistical Society of Canada (SSC). Our code is available at this https URL.
摘要:现代方法常常配制细胞从显微图像的计数作为回归问题和更多或更少的依赖于昂贵的,手动注释的训练图像(例如,点表示细胞或分割掩码识别细胞的轮廓的形心的注释)。这项工作提出了基于面向分类,卷积神经网络(细胞神经网络)有监督的学习框架来算,从灰阶显微图像细胞,而不使用注释的训练图像。在此框架下,我们制定了细胞计数任务为图像分类问题,其中细胞计数作为类的标签。这一提法有其局限性,当在测试阶段,一些细胞计数没有在训练数据出现。此外,细胞计数之间的关系顺序不利用。为了解决这些局限性,我们提出了一个简单而有效的数据增强(DA)方法来合成图片看不见的细胞计数。我们还引入了一个集成方法,它不仅可以缓和看不见的细胞计数的影响,而且还利用序信息,以提高预测精度。这个框架优于许多现代细胞计数方法,赢得了数据分析竞争(案例1:计数细胞显微图像,此HTTPS URL)第47届加拿大统计学会(SSC)的年度会议。我们的代码可在此HTTPS URL。
Xin Ding, Qiong Zhang, William J. Welch
Abstract: Modern methods often formulate the counting of cells from microscopic images as a regression problem and more or less rely on expensive, manually annotated training images (e.g., dot annotations indicating the centroids of cells or segmentation masks identifying the contours of cells). This work proposes a supervised learning framework based on classification-oriented convolutional neural networks (CNNs) to count cells from greyscale microscopic images without using annotated training images. In this framework, we formulate the cell counting task as an image classification problem, where the cell counts are taken as class labels. This formulation has its limitation when some cell counts in the test stage do not appear in the training data. Moreover, the ordinal relation among cell counts is not utilized. To deal with these limitations, we propose a simple but effective data augmentation (DA) method to synthesize images for the unseen cell counts. We also introduce an ensemble method, which can not only moderate the influence of unseen cell counts but also utilize the ordinal information to improve the prediction accuracy. This framework outperforms many modern cell counting methods and won the data analysis competition (Case Study 1: Counting Cells From Microscopic Images this https URL) of the 47th Annual Meeting of the Statistical Society of Canada (SSC). Our code is available at this https URL.
摘要:现代方法常常配制细胞从显微图像的计数作为回归问题和更多或更少的依赖于昂贵的,手动注释的训练图像(例如,点表示细胞或分割掩码识别细胞的轮廓的形心的注释)。这项工作提出了基于面向分类,卷积神经网络(细胞神经网络)有监督的学习框架来算,从灰阶显微图像细胞,而不使用注释的训练图像。在此框架下,我们制定了细胞计数任务为图像分类问题,其中细胞计数作为类的标签。这一提法有其局限性,当在测试阶段,一些细胞计数没有在训练数据出现。此外,细胞计数之间的关系顺序不利用。为了解决这些局限性,我们提出了一个简单而有效的数据增强(DA)方法来合成图片看不见的细胞计数。我们还引入了一个集成方法,它不仅可以缓和看不见的细胞计数的影响,而且还利用序信息,以提高预测精度。这个框架优于许多现代细胞计数方法,赢得了数据分析竞争(案例1:计数细胞显微图像,此HTTPS URL)第47届加拿大统计学会(SSC)的年度会议。我们的代码可在此HTTPS URL。
37. Scaling Laws for Autoregressive Generative Modeling [PDF] 返回目录
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish
Abstract: We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
摘要:我们确定了在四个领域交叉熵损失经验定标法:生成图像建模,视频建模,多模态图像$ \ $ leftrightarrow文本模式,数学解题。在任何情况下自回归变压器顺利的性能改善,因为模型的大小和计算的预算增加,遵循幂律加常数标度律。最佳模型的大小也可以通过幂律依赖于计算预算,是跨越所有数据域几乎是通用的指数。交叉熵损失具有信息理论解释为$ S($真$)+ d _ {\ mathrm {KL}}($真$ || $模型)$,并且经验定标法表明两者的一个预测真正的数据分布的熵和真实模型分布之间的KL发散。有了这个解释,十亿参数变压器是下采样到$ 8 \次$ 8分辨率YFCC100M图像分配近乎完美的模型,我们可以预测,实现任何给定的还原损失需要(即$ d _ {\ mathrm {KL模型的大小}} $)在其他决议NATS /图像。我们在特定领域找到一些额外的标度律:(一),我们确定在多模式标题和图像的相互信息的比例关系,并展示如何回答这个问题:“是一张图片胜过千言万语?”; (二)在数学问题解决的情况下,我们确定模型的性能外推超出了训练分布在标度律; (三)我们微调了ImageNet分类生成图像模型,并找到分类的损失和错误率的平滑缩放,甚至关闭的生成损失的水平。总之,这些结果加强该标度律对神经网络性能的重要影响,包括对下游任务的情况。
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, Sam McCandlish
Abstract: We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$\leftrightarrow$text models, and mathematical problem solving. In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law. The optimal model size also depends on the compute budget through a power-law, with exponents that are nearly universal across all data domains. The cross-entropy loss has an information theoretic interpretation as $S($True$) + D_{\mathrm{KL}}($True$||$Model$)$, and the empirical scaling laws suggest a prediction for both the true data distribution's entropy and the KL divergence between the true and model distributions. With this interpretation, billion-parameter Transformers are nearly perfect models of the YFCC100M image distribution downsampled to an $8\times 8$ resolution, and we can forecast the model size needed to achieve any given reducible loss (ie $D_{\mathrm{KL}}$) in nats/image for other resolutions. We find a number of additional scaling laws in specific domains: (a) we identify a scaling relation for the mutual information between captions and images in multimodal models, and show how to answer the question "Is a picture worth a thousand words?"; (b) in the case of mathematical problem solving, we identify scaling laws for model performance when extrapolating beyond the training distribution; (c) we finetune generative image models for ImageNet classification and find smooth scaling of the classification loss and error rate, even as the generative loss levels off. Taken together, these results strengthen the case that scaling laws have important implications for neural network performance, including on downstream tasks.
摘要:我们确定了在四个领域交叉熵损失经验定标法:生成图像建模,视频建模,多模态图像$ \ $ leftrightarrow文本模式,数学解题。在任何情况下自回归变压器顺利的性能改善,因为模型的大小和计算的预算增加,遵循幂律加常数标度律。最佳模型的大小也可以通过幂律依赖于计算预算,是跨越所有数据域几乎是通用的指数。交叉熵损失具有信息理论解释为$ S($真$)+ d _ {\ mathrm {KL}}($真$ || $模型)$,并且经验定标法表明两者的一个预测真正的数据分布的熵和真实模型分布之间的KL发散。有了这个解释,十亿参数变压器是下采样到$ 8 \次$ 8分辨率YFCC100M图像分配近乎完美的模型,我们可以预测,实现任何给定的还原损失需要(即$ d _ {\ mathrm {KL模型的大小}} $)在其他决议NATS /图像。我们在特定领域找到一些额外的标度律:(一),我们确定在多模式标题和图像的相互信息的比例关系,并展示如何回答这个问题:“是一张图片胜过千言万语?”; (二)在数学问题解决的情况下,我们确定模型的性能外推超出了训练分布在标度律; (三)我们微调了ImageNet分类生成图像模型,并找到分类的损失和错误率的平滑缩放,甚至关闭的生成损失的水平。总之,这些结果加强该标度律对神经网络性能的重要影响,包括对下游任务的情况。
38. Unsupervised Domain Adaptation for Visual Navigation [PDF] 返回目录
Shangda Li, Devendra Singh Chaplot, Yao-Hung Hubert Tsai, Yue Wu, Louis-Philippe Morency, Ruslan Salakhutdinov
Abstract: Advances in visual navigation methods have led to intelligent embodied navigation agents capable of learning meaningful representations from raw RGB images and perform a wide variety of tasks involving structural and semantic reasoning. However, most learning-based navigation policies are trained and tested in simulation environments. In order for these policies to be practically useful, they need to be transferred to the real-world. In this paper, we propose an unsupervised domain adaptation method for visual navigation. Our method translates the images in the target domain to the source domain such that the translation is consistent with the representations learned by the navigation policy. The proposed method outperforms several baselines across two different navigation tasks in simulation. We further show that our method can be used to transfer the navigation policies learned in simulation to the real world.
摘要:进展视觉导航方法导致了能够学习从原始RGB图像有意义陈述的智能导航体现代理和执行各种涉及结构和语义推理任务。然而,最基础的学习导航策略进行培训,并在模拟环境下进行试验。为了使这些政策是实际有用的,他们需要被转移到现实世界。在本文中,我们提出了视觉导航无人监管的领域适应性方法。我们的方法转换在目标域源域的图像,使得翻译是由导航策略,学到的陈述一致。该方法优于在两个不同的导航任务的多个基准仿真。进一步的研究表明,我们的方法可以用来传送模拟中学会了现实世界中的导航策略。
Shangda Li, Devendra Singh Chaplot, Yao-Hung Hubert Tsai, Yue Wu, Louis-Philippe Morency, Ruslan Salakhutdinov
Abstract: Advances in visual navigation methods have led to intelligent embodied navigation agents capable of learning meaningful representations from raw RGB images and perform a wide variety of tasks involving structural and semantic reasoning. However, most learning-based navigation policies are trained and tested in simulation environments. In order for these policies to be practically useful, they need to be transferred to the real-world. In this paper, we propose an unsupervised domain adaptation method for visual navigation. Our method translates the images in the target domain to the source domain such that the translation is consistent with the representations learned by the navigation policy. The proposed method outperforms several baselines across two different navigation tasks in simulation. We further show that our method can be used to transfer the navigation policies learned in simulation to the real world.
摘要:进展视觉导航方法导致了能够学习从原始RGB图像有意义陈述的智能导航体现代理和执行各种涉及结构和语义推理任务。然而,最基础的学习导航策略进行培训,并在模拟环境下进行试验。为了使这些政策是实际有用的,他们需要被转移到现实世界。在本文中,我们提出了视觉导航无人监管的领域适应性方法。我们的方法转换在目标域源域的图像,使得翻译是由导航策略,学到的陈述一致。该方法优于在两个不同的导航任务的多个基准仿真。进一步的研究表明,我们的方法可以用来传送模拟中学会了现实世界中的导航策略。
39. Neural Architecture Search of SPD Manifold Networks [PDF] 返回目录
Rhea Sanjay Sukthanker, Zhiwu Huang, Suryansh Kumar, Erik Goron Endsjo, Yan Wu, Luc Van Gool
Abstract: In this paper, we propose a new neural architecture search (NAS) problem of Symmetric Positive Definite (SPD) manifold networks. Unlike the conventional NAS problem, our problem requires to search for a unique computational cell called the SPD cell. This SPD cell serves as a basic building block of SPD neural architectures. An efficient solution to our problem is important to minimize the extraneous manual effort in the SPD neural architecture design. To accomplish this goal, we first introduce a geometrically rich and diverse SPD neural architecture search space for an efficient SPD cell design. Further, we model our new NAS problem using the supernet strategy which models the architecture search problem as a one-shot training process of a single supernet. Based on the supernet modeling, we exploit a differentiable NAS algorithm on our relaxed continuous search space for SPD neural architecture search. Statistical evaluation of our method on drone, action, and emotion recognition tasks mostly provides better results than the state-of-the-art SPD networks and NAS algorithms. Empirical results show that our algorithm excels in discovering better SPD network design, and providing models that are more than 3 times lighter than searched by state-of-the-art NAS algorithms.
摘要:在本文中,我们提出了对称正定(SPD)多重网络的一个新的神经结构搜索(NAS)的问题。不同于传统的NAS问题,我们的问题,需要寻找名为SPD细胞独特的计算单元。这SPD细胞作为神经SPD体系结构的基本构建块。一个有效的解决方案,我们的问题是很重要的,以尽量减少SPD神经结构设计无关手工劳动。为了实现这个目标,我们首先介绍了一种有效的SPD单元设计几何丰富多样的SPD神经结构的搜索空间。此外,我们使用超网的策略,模型架构搜索问题作为一个超网的一次性培训流程建模我们的新的NAS问题。基于超网的造型,我们利用对SPD的神经结构的搜索我们的放宽连续搜索空间微NAS算法。我们的无人机,动作和情感识别任务方法的统计评价大多提供了比国家的最先进的SPD网络和NAS算法更好的结果。实证结果表明,我们在发现更好的SPD网络设计,并提供模型比轻3倍以上算法擅长搜查了国家的最先进的NAS算法。
Rhea Sanjay Sukthanker, Zhiwu Huang, Suryansh Kumar, Erik Goron Endsjo, Yan Wu, Luc Van Gool
Abstract: In this paper, we propose a new neural architecture search (NAS) problem of Symmetric Positive Definite (SPD) manifold networks. Unlike the conventional NAS problem, our problem requires to search for a unique computational cell called the SPD cell. This SPD cell serves as a basic building block of SPD neural architectures. An efficient solution to our problem is important to minimize the extraneous manual effort in the SPD neural architecture design. To accomplish this goal, we first introduce a geometrically rich and diverse SPD neural architecture search space for an efficient SPD cell design. Further, we model our new NAS problem using the supernet strategy which models the architecture search problem as a one-shot training process of a single supernet. Based on the supernet modeling, we exploit a differentiable NAS algorithm on our relaxed continuous search space for SPD neural architecture search. Statistical evaluation of our method on drone, action, and emotion recognition tasks mostly provides better results than the state-of-the-art SPD networks and NAS algorithms. Empirical results show that our algorithm excels in discovering better SPD network design, and providing models that are more than 3 times lighter than searched by state-of-the-art NAS algorithms.
摘要:在本文中,我们提出了对称正定(SPD)多重网络的一个新的神经结构搜索(NAS)的问题。不同于传统的NAS问题,我们的问题,需要寻找名为SPD细胞独特的计算单元。这SPD细胞作为神经SPD体系结构的基本构建块。一个有效的解决方案,我们的问题是很重要的,以尽量减少SPD神经结构设计无关手工劳动。为了实现这个目标,我们首先介绍了一种有效的SPD单元设计几何丰富多样的SPD神经结构的搜索空间。此外,我们使用超网的策略,模型架构搜索问题作为一个超网的一次性培训流程建模我们的新的NAS问题。基于超网的造型,我们利用对SPD的神经结构的搜索我们的放宽连续搜索空间微NAS算法。我们的无人机,动作和情感识别任务方法的统计评价大多提供了比国家的最先进的SPD网络和NAS算法更好的结果。实证结果表明,我们在发现更好的SPD网络设计,并提供模型比轻3倍以上算法擅长搜查了国家的最先进的NAS算法。
注:中文为机器翻译结果!封面为论文标题词云图!