Unsupervised representation learning using convolutional and stacked auto-encoders: a domain and cross-domain feature space analysis
This provides guidelines for unsupervised representation learning in visual domains, but it is incremental as it builds on existing auto-encoder methods.
The paper tackled the problem of learning image representations without labeled data by investigating auto-encoder architectures, finding that their features achieved classification results as discriminative as pre-trained CNN features.
A feature learning task involves training models that are capable of inferring good representations (transformations of the original space) from input data alone. When working with limited or unlabelled data, and also when multiple visual domains are considered, methods that rely on large annotated datasets, such as Convolutional Neural Networks (CNNs), cannot be employed. In this paper we investigate different auto-encoder (AE) architectures, which require no labels, and explore training strategies to learn representations from images. The models are evaluated considering both the reconstruction error of the images and the feature spaces in terms of their discriminative power. We study the role of dense and convolutional layers on the results, as well as the depth and capacity of the networks, since those are shown to affect both the dimensionality reduction and the capability of generalising for different visual domains. Classification results with AE features were as discriminative as pre-trained CNN features. Our findings can be used as guidelines for the design of unsupervised representation learning methods within and across domains.