Jochen Triesch

h-index15

33papers

141citations

Novelty45%

AI Score52

Ranked #35,554 of 201,018 authors (top 18%)#14,108 in CV (top 24%)

33 Papers

LGJul 27, 2022

Time to augment self-supervised visual representation learning

Arthur Aubret, Markus Ernst, Céline Teulière et al.

Biological vision systems are unparalleled in their ability to learn visual representations without supervision. In machine learning, self-supervised learning (SSL) has led to major advances in forming object representations in an unsupervised fashion. Such systems learn representations invariant to augmentation operations over images, like cropping or flipping. In contrast, biological vision systems exploit the temporal structure of the visual experience during natural interactions with objects. This gives access to "augmentations" not commonly used in SSL, like watching the same object from multiple viewpoints or against different backgrounds. Here, we systematically investigate and compare the potential benefits of such time-based augmentations during natural interactions for learning object categories. Our results show that time-based augmentations achieve large performance gains over state-of-the-art image augmentations. Specifically, our analyses reveal that: 1) 3-D object manipulations drastically improve the learning of object categories; 2) viewing objects against changing backgrounds is important for learning to discard background-related information from the latent representation. Overall, we conclude that time-based augmentations during natural interactions with objects can substantially improve self-supervised learning, narrowing the gap between artificial and biological vision systems.

CVFeb 5, 2023

CIPER: Combining Invariant and Equivariant Representations Using Contrastive and Predictive Learning

Xia Xu, Jochen Triesch

Self-supervised representation learning (SSRL) methods have shown great success in computer vision. In recent studies, augmentation-based contrastive learning methods have been proposed for learning representations that are invariant or equivariant to pre-defined data augmentation operations. However, invariant or equivariant features favor only specific downstream tasks depending on the augmentations chosen. They may result in poor performance when the learned representation does not match task requirements. Here, we consider an active observer that can manipulate views of an object and has knowledge of the action(s) that generated each view. We introduce Contrastive Invariant and Predictive Equivariant Representation learning (CIPER). CIPER comprises both invariant and equivariant learning objectives using one shared encoder and two different output heads on top of the encoder. One output head is a projection head with a state-of-the-art contrastive objective to encourage invariance to augmentations. The other is a prediction head estimating the augmentation parameters, capturing equivariant features. Both heads are discarded after training and only the encoder is used for downstream tasks. We evaluate our method on static image tasks and time-augmented image datasets. Our results show that CIPER outperforms a baseline contrastive method on various tasks. Interestingly, CIPER encourages the formation of hierarchically structured representations where different views of an object become systematically organized in the latent representation space.

LGMay 12, 2022

Embodied vision for learning object representations

Arthur Aubret, Céline Teulière, Jochen Triesch

Recent time-contrastive learning approaches manage to learn invariant object representations without supervision. This is achieved by mapping successive views of an object onto close-by internal representations. When considering this learning approach as a model of the development of human object recognition, it is important to consider what visual input a toddler would typically observe while interacting with objects. First, human vision is highly foveated, with high resolution only available in the central region of the field of view. Second, objects may be seen against a blurry background due to infants' limited depth of field. Third, during object manipulation a toddler mostly observes close objects filling a large part of the field of view due to their rather short arms. Here, we study how these effects impact the quality of visual representations learnt through time-contrastive learning. To this end, we let a visually embodied agent "play" with objects in different locations of a near photo-realistic flat. During each play session the agent views an object in multiple orientations before turning its body to view another object. The resulting sequence of views feeds a time-contrastive learning algorithm. Our results show that visual statistics mimicking those of a toddler improve object recognition accuracy in both familiar and novel environments. We argue that this effect is caused by the reduction of features extracted in the background, a neural network bias for large features in the image and a greater similarity between novel and familiar background regions. We conclude that the embodied nature of visual learning may be crucial for understanding the development of human object perception.

CVJul 9, 2024

Self-supervised visual learning from interactions with objects

Arthur Aubret, Céline Teulière, Jochen Triesch

Self-supervised learning (SSL) has revolutionized visual representation learning, but has not achieved the robustness of human vision. A reason for this could be that SSL does not leverage all the data available to humans during learning. When learning about an object, humans often purposefully turn or move around objects and research suggests that these interactions can substantially enhance their learning. Here we explore whether such object-related actions can boost SSL. For this, we extract the actions performed to change from one ego-centric view of an object to another in four video datasets. We then introduce a new loss function to learn visual and action embeddings by aligning the performed action with the representations of two images extracted from the same clip. This permits the performed actions to structure the latent visual representation. Our experiments show that our method consistently outperforms previous methods on downstream category recognition. In our analysis, we find that the observed improvement is associated with a better viewpoint-wise alignment of different objects from the same category. Overall, our work demonstrates that embodied interactions with objects can improve SSL of object categories.

CVOct 18, 2022

Sequence and Circle: Exploring the Relationship Between Patches

Zhengyang Yu, Jochen Triesch

The vision transformer (ViT) has achieved state-of-the-art results in various vision tasks. It utilizes a learnable position embedding (PE) mechanism to encode the location of each image patch. However, it is presently unclear if this learnable PE is really necessary and what its benefits are. This paper explores two alternative ways of encoding the location of individual patches that exploit prior knowledge about their spatial arrangement. One is called the sequence relationship embedding (SRE), and the other is called the circle relationship embedding (CRE). Among them, the SRE considers all patches to be in order, and adjacent patches have the same interval distance. The CRE considers the central patch as the center of the circle and measures the distance of the remaining patches from the center based on the four neighborhoods principle. Multiple concentric circles with different radii combine different patches. Finally, we implemented these two relations on three classic ViTs and tested them on four popular datasets. Experiments show that SRE and CRE can replace PE to reduce the random learnable parameters while achieving the same performance. Combining SRE or CRE with PE gets better performance than only using PE.

NCApr 30Code

Simulating Infant First-Person Sensorimotor Experience via Motion Retargeting from Babies to Humanoids

Francisco M. López, Hoshinori Kanazawa, Ondrej Fiala et al.

Motion retargeting from humans to human-like artificial agents is becoming increasingly important as humanoid robots grow more capable. However, most existing approaches focus only on reproducing kinematics and ignore the rich sensorimotor experience associated with human movement. In this work, we present a framework for simulating the multimodal sensorimotor experiences of infants using physical and virtual humanoids. From a single video, our method reconstructs the infant's body configuration by extracting its skeletal structure and estimating the full 3D pose from each frame. Then we map the reconstructed motion onto several developmental platforms: the physical iCub robot and the virtual simulators pyCub, EMFANT and MIMo. Replaying the retargeted motions on these embodiments produces simulated multisensory streams including proprioception (joints and muscles), touch, and vision. For the best-matching embodiment, the retargeting achieves sub-centimeter accuracy and enables a rich multimodal analysis of infant development as well as enhanced automated annotation of behaviors. This framework provides a unique window into the infant's sensorimotor experience, offering new tools for robotics, developmental science, and early detection of neurodevelopmental disorders. The code is available at https://github.com/ctu-vras/motion-retargeting/.

AIDec 7, 2023Code

MIMo: A Multi-Modal Infant Model for Studying Cognitive Development

Dominik Mattern, Pierre Schumacher, Francisco M. López et al.

Human intelligence and human consciousness emerge gradually during the process of cognitive development. Understanding this development is an essential aspect of understanding the human mind and may facilitate the construction of artificial minds with similar properties. Importantly, human cognitive development relies on embodied interactions with the physical and social environment, which is perceived via complementary sensory modalities. These interactions allow the developing mind to probe the causal structure of the world. This is in stark contrast to common machine learning approaches, e.g., for large language models, which are merely passively ``digesting'' large amounts of training data, but are not in control of their sensory inputs. However, computational modeling of the kind of self-determined embodied interactions that lead to human intelligence and consciousness is a formidable challenge. Here we present MIMo, an open-source multi-modal infant model for studying early cognitive development through computer simulations. MIMo's body is modeled after an 18-month-old child with detailed five-fingered hands. MIMo perceives its surroundings via binocular vision, a vestibular system, proprioception, and touch perception through a full-body virtual skin, while two different actuation models allow control of his body. We describe the design and interfaces of MIMo and provide examples illustrating its use. All code is available at https://github.com/trieschlab/MIMo .

NCDec 19, 2025

Re-assessing the evidence for mental rotation abilities in children using computational models

Arthur Aubret, Jochen Triesch

There is strong and diverse evidence for mental rotation (MR) abilities in adults. However, current evidence for MR in children rests on just a few behavioral paradigms adapted from the adult literature. Here, we leverage recent computational models of the development of children's object recognition abilities to re-assess the evidence for MR in children. The computational models simulate infants' acquisition of object representations during embodied interactions with objects. We consider two different object recognition strategies, different from MRs, and assess their ability to replicate results from three classical MR tasks assigned to children between the ages of 6 months and 5 years. Our results show that MR may play no role in producing the results obtained from children younger than 5 years. In fact, we find that a simple recognition strategy that reflects a pixel-wise comparison of stimuli is sufficient to model children's behavior in the most used MR task. Thus, our study reopens the debate on how and when children develop genuine MR abilities.

CVFeb 4

Temporal Slowness in Central Vision Drives Semantic Object Learning

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig et al.

Humans acquire semantic object representations from egocentric visual streams with minimal supervision. Importantly, the visual system processes with high resolution only the center of its field of view and learns similar representations for visual inputs occurring close in time. This emphasizes slowly changing information around gaze locations. This study investigates the role of central vision and slowness learning in the formation of semantic object representations from human-like visual experience. We simulate five months of human-like visual experience using the Ego4D dataset and generate gaze coordinates with a state-of-the-art gaze prediction model. Using these predictions, we extract crops that mimic central vision and train a time-contrastive Self-Supervised Learning model on them. Our results show that combining temporal slowness and central vision improves the encoding of different semantic facets of object representations. Specifically, focusing on central vision strengthens the extraction of foreground object features, while considering temporal slowness, especially during fixational eye movements, allows the model to encode broader semantic information about objects. These findings provide new insights into the mechanisms by which humans may develop semantic object representations from natural visual experience.

CVSep 23, 2025Code

SynapFlow: A Modular Framework Towards Large-Scale Analysis of Dendritic Spines

Pamela Osuna-Vargas, Altug Kamacioglu, Dominik F. Aschauer et al.

Dendritic spines are key structural components of excitatory synapses in the brain. Given the size of dendritic spines provides a proxy for synaptic efficacy, their detection and tracking across time is important for studies of the neural basis of learning and memory. Despite their relevance, large-scale analyses of the structural dynamics of dendritic spines in 3D+time microscopy data remain challenging and labor-intense. Here, we present a modular machine learning-based pipeline designed to automate the detection, time-tracking, and feature extraction of dendritic spines in volumes chronically recorded with two-photon microscopy. Our approach tackles the challenges posed by biological data by combining a transformer-based detection module, a depth-tracking component that integrates spatial features, a time-tracking module to associate 3D spines across time by leveraging spatial consistency, and a feature extraction unit that quantifies biologically relevant spine properties. We validate our method on open-source labeled spine data, and on two complementary annotated datasets that we publish alongside this work: one for detection and depth-tracking, and one for time-tracking, which, to the best of our knowledge, is the first data of this kind. To encourage future research, we release our data, code, and pre-trained weights at https://github.com/pamelaosuna/SynapFlow, establishing a baseline for scalable, end-to-end analysis of dendritic spine dynamics.

CVDec 7, 2023

Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig et al.

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.

CVApr 19, 2024

Learning Object Semantic Similarity with Self-Supervision

Arthur Aubret, Timothy Schaumlöffel, Gemma Roig et al.

Humans judge the similarity of two objects not just based on their visual appearance but also based on their semantic relatedness. However, it remains unclear how humans learn about semantic relationships between objects and categories. One important source of semantic knowledge is that semantically related objects frequently co-occur in the same context. For instance, forks and plates are perceived as similar, at least in part, because they are often experienced together in a ``kitchen" or ``eating'' context. Here, we investigate whether a bio-inspired learning principle exploiting such co-occurrence statistics suffices to learn a semantically structured object representation {\em de novo} from raw visual or combined visual and linguistic input. To this end, we simulate temporal sequences of visual experience by binding together short video clips of real-world scenes showing objects in different contexts. A bio-inspired neural network model aligns close-in-time visual representations while also aligning visual and category label representations to simulate visuo-language alignment. Our results show that our model clusters object representations based on their context, e.g. kitchen or bedroom, in particular in high-level layers of the network, akin to humans. In contrast, lower-level layers tend to better reflect object identity or category. To achieve this, the model exploits two distinct strategies: the visuo-language alignment ensures that different objects of the same category are represented similarly, whereas the temporal alignment leverages that objects from the same context are frequently seen in succession to make their representations more similar. Overall, our work suggests temporal and visuo-language alignment as plausible computational principles for explaining the origins of certain forms of semantic knowledge in humans.

CVApr 11, 2024

Self-Supervised Learning of Color Constancy

Markus R. Ernst, Francisco M. López, Arthur Aubret et al.

Color constancy (CC) describes the ability of the visual system to perceive an object as having a relatively constant color despite changes in lighting conditions. While CC and its limitations have been carefully characterized in humans, it is still unclear how the visual system acquires this ability during development. Here, we present a first study showing that CC develops in a neural network trained in a self-supervised manner through an invariance learning objective. During learning, objects are presented under changing illuminations, while the network aims to map subsequent views of the same object onto close-by latent representations. This gives rise to representations that are largely invariant to the illumination conditions, offering a plausible example of how CC could emerge during human cognitive development via a form of self-supervised learning.

LGJan 6, 2025

Seeing the Whole in the Parts in Self-Supervised Representation Learning

Arthur Aubret, Céline Teulière, Jochen Triesch

Recent successes in self-supervised learning (SSL) model spatial co-occurrences of visual features either by masking portions of an image or by aggressively cropping it. Here, we propose a new way to model spatial co-occurrences by aligning local representations (before pooling) with a global image representation. We present CO-SSL, a family of instance discrimination methods and show that it outperforms previous methods on several datasets, including ImageNet-1K where it achieves 71.5% of Top-1 accuracy with 100 pre-training epochs. CO-SSL is also more robust to noise corruption, internal corruption, small adversarial attacks, and large training crop sizes. Our analysis further indicates that CO-SSL learns highly redundant local representations, which offers an explanation for its robustness. Overall, our work suggests that aligning local and global representations may be a powerful principle of unsupervised category learning.

LGFeb 21, 2025

Hierarchical Residuals Exploit Brain-Inspired Compositionality

Francisco M. López, Jochen Triesch

We present Hierarchical Residual Networks (HiResNets), deep convolutional neural networks with long-range residual connections between layers at different hierarchical levels. HiResNets draw inspiration on the organization of the mammalian brain by replicating the direct connections from subcortical areas to the entire cortical hierarchy. We show that the inclusion of hierarchical residuals in several architectures, including ResNets, results in a boost in accuracy and faster learning. A detailed analysis of our models reveals that they perform hierarchical compositionality by learning feature maps relative to the compressed representations provided by the skip connections.

CVJan 6, 2025

Human Gaze Boosts Object-Centered Representation Learning

Timothy Schaumlöffel, Arthur Aubret, Gemma Roig et al.

Recent self-supervised learning (SSL) models trained on human-like egocentric visual inputs substantially underperform on image recognition tasks compared to humans. These models train on raw, uniform visual inputs collected from head-mounted cameras. This is different from humans, as the anatomical structure of the retina and visual cortex relatively amplifies the central visual information, i.e. around humans' gaze location. This selective amplification in humans likely aids in forming object-centered visual representations. Here, we investigate whether focusing on central visual information boosts egocentric visual object learning. We simulate 5-months of egocentric visual experience using the large-scale Ego4D dataset and generate gaze locations with a human gaze prediction model. To account for the importance of central vision in humans, we crop the visual area around the gaze location. Finally, we train a time-based SSL model on these modified inputs. Our experiments demonstrate that focusing on central vision leads to better object-centered representations. Our analysis shows that the SSL model leverages the temporal dynamics of the gaze movements to build stronger visual representations. Overall, our work marks a significant step toward bio-inspired learning of visual representations.

CVNov 4, 2024

Toddlers' Active Gaze Behavior Supports Self-Supervised Object Learning

Zhengyang Yu, Arthur Aubret, Marcel C. Raabe et al.

Toddlers learn to recognize objects from different viewpoints with almost no supervision. During this learning, they execute frequent eye and head movements that shape their visual experience. It is presently unclear if and how these behaviors contribute to toddlers' emerging object recognition abilities. To answer this question, we here combine head-mounted eye tracking during dyadic play with unsupervised machine learning. We approximate toddlers' central visual field experience by cropping image regions from a head-mounted camera centered on the current gaze location estimated via eye tracking. This visual stream feeds an unsupervised computational model of toddlers' learning, which constructs visual representations that slowly change over time. Our experiments demonstrate that toddlers' gaze strategy supports the learning of invariant object representations. Our analysis also shows that the limited size of the central visual field where acuity is high is crucial for this. Overall, our work reveals how toddlers' gaze behavior may support their development of view-invariant object recognition.

CVSep 19, 2025

Simulated Cortical Magnification Supports Self-Supervised Object Learning

Zhengyang Yu, Arthur Aubret, Chen Yu et al.

Recent self-supervised learning models simulate the development of semantic object representations by training on visual experience similar to that of toddlers. However, these models ignore the foveated nature of human vision with high/low resolution in the center/periphery of the visual field. Here, we investigate the role of this varying resolution in the development of object representations. We leverage two datasets of egocentric videos that capture the visual experience of humans during interactions with objects. We apply models of human foveation and cortical magnification to modify these inputs, such that the visual content becomes less distinct towards the periphery. The resulting sequences are used to train two bio-inspired self-supervised learning models that implement a time-based learning objective. Our results show that modeling aspects of foveated vision improves the quality of the learned object representations in this setting. Our analysis suggests that this improvement comes from making objects appear bigger and inducing a better trade-off between central and peripheral visual information. Overall, this work takes a step towards making models of humans' learning of visual representations more realistic and performant.

AIJul 20, 2025

From Kicking to Causality: Simulating Infant Agency Detection with a Robust Intrinsic Reward

Xia Xu, Jochen Triesch

While human infants robustly discover their own causal efficacy, standard reinforcement learning agents remain brittle, as their reliance on correlation-based rewards fails in noisy, ecologically valid scenarios. To address this, we introduce the Causal Action Influence Score (CAIS), a novel intrinsic reward rooted in causal inference. CAIS quantifies an action's influence by measuring the 1-Wasserstein distance between the learned distribution of sensory outcomes conditional on that action, $p(h|a)$, and the baseline outcome distribution, $p(h)$. This divergence provides a robust reward that isolates the agent's causal impact from confounding environmental noise. We test our approach in a simulated infant-mobile environment where correlation-based perceptual rewards fail completely when the mobile is subjected to external forces. In stark contrast, CAIS enables the agent to filter this noise, identify its influence, and learn the correct policy. Furthermore, the high-quality predictive model learned for CAIS allows our agent, when augmented with a surprise signal, to successfully reproduce the "extinction burst" phenomenon. We conclude that explicitly inferring causality is a crucial mechanism for developing a robust sense of agency, offering a psychologically plausible framework for more adaptive autonomous systems.

LGDec 16, 2021

Multiple Instance Learning for Brain Tumor Detection from Magnetic Resonance Spectroscopy Data

Diyuan Lu, Gerhard Kurz, Nenad Polomac et al.

We apply deep learning (DL) on Magnetic resonance spectroscopy (MRS) data for the task of brain tumor detection. Medical applications often suffer from data scarcity and corruption by noise. Both of these problems are prominent in our data set. Furthermore, a varying number of spectra are available for the different patients. We address these issues by considering the task as a multiple instance learning (MIL) problem. Specifically, we aggregate multiple spectra from the same patient into a "bag" for classification and apply data augmentation techniques. To achieve the permutation invariance during the process of bagging, we proposed two approaches: (1) to apply min-, max-, and average-pooling on the features of all samples in one bag and (2) to apply an attention mechanism. We tested these two approaches on multiple neural network architectures. We demonstrate that classification performance is significantly improved when training on multiple instances rather than single spectra. We propose a simple oversampling data augmentation method and show that it could further improve the performance. Finally, we demonstrate that our proposed model outperforms manual classification by neuroradiologists according to most performance metrics.

CVApr 21, 2021

Recurrent Feedback Improves Recognition of Partially Occluded Objects

Markus Roland Ernst, Jochen Triesch, Thomas Burwick

Recurrent connectivity in the visual cortex is believed to aid object recognition for challenging conditions such as occlusion. Here we investigate if and how artificial neural networks also benefit from recurrence. We compare architectures composed of bottom-up, lateral and top-down connections and evaluate their performance using two novel stereoscopic occluded object datasets. We find that classification accuracy is significantly higher for recurrent models when compared to feedforward models of matched parametric complexity. Additionally we show that for challenging stimuli, the recurrent feedback is able to correctly revise the initial feedforward guess.

CVJan 29, 2021

Learning Hierarchical Integration of Foveal and Peripheral Vision for Vergence Control by Active Efficient Coding

Zhetuo Zhao, Jochen Triesch, Bertram E. Shi

The active efficient coding (AEC) framework parsimoniously explains the joint development of visual processing and eye movements, e.g., the emergence of binocular disparity selective neurons and fusional vergence, the disjunctive eye movements that align left and right eye images. Vergence can be driven by information in both the fovea and periphery, which play complementary roles. The high resolution fovea can drive precise short range movements. The lower resolution periphery supports coarser long range movements. The fovea and periphery may also contain conflicting information, e.g. due to objects at different depths. While past AEC models did integrate peripheral and foveal information, they did not explicitly take into account these characteristics. We propose here a two-level hierarchical approach that does. The bottom level generates different vergence actions from foveal and peripheral regions. The top level selects one. We demonstrate that the hierarchical approach performs better than prior approaches in realistic environments, exhibiting better alignment and less oscillation.

CVJan 27, 2021

Self-Calibrating Active Binocular Vision via Active Efficient Coding with Deep Autoencoders

Charles Wilmot, Bertram E. Shi, Jochen Triesch

We present a model of the self-calibration of active binocular vision comprising the simultaneous learning of visual representations, vergence, and pursuit eye movements. The model follows the principle of Active Efficient Coding (AEC), a recent extension of the classic Efficient Coding Hypothesis to active perception. In contrast to previous AEC models, the present model uses deep autoencoders to learn sensory representations. We also propose a new formulation of the intrinsic motivation signal that guides the learning of behavior. We demonstrate the performance of the model in simulations.

LGJan 27, 2021

Learning Abstract Representations through Lossy Compression of Multi-Modal Signals

Charles Wilmot, Gianluca Baldassarre, Jochen Triesch

A key competence for open-ended learning is the formation of increasingly abstract representations useful for driving complex behavior. Abstract representations ignore specific details and facilitate generalization. Here we consider the learning of abstract representations in a multi-modal setting with two or more input modalities. We treat the problem as a lossy compression problem and show that generic lossy compression of multimodal sensory input naturally extracts abstract representations that tend to strip away modalitiy specific details and preferentially retain information that is shared across the different modalities. Furthermore, we propose an architecture to learn abstract representations by identifying and retaining only the information that is shared across multiple modalities while discarding any modality specific information.

RONov 27, 2020

REAL-X -- Robot open-Ended Autonomous Learning Architectures: Achieving Truly End-to-End Sensorimotor Autonomous Learning Systems

Emilio Cartoni, Davide Montella, Jochen Triesch et al.

Open-ended learning is a core research field of developmental robotics and AI aiming to build learning machines and robots that can autonomously acquire knowledge and skills incrementally as infants and children. The first contribution of this work is to study the challenges posed by the previously proposed benchmark `REAL competition' aiming to foster the development of truly open-ended learning robot architectures. The competition involves a simulated camera-arm robot that: (a) in a first `intrinsic phase' acquires sensorimotor competence by autonomously interacting with objects; (b) in a second `extrinsic phase' is tested with tasks unknown in the intrinsic phase to measure the quality of knowledge previously acquired. This benchmark requires the solution of multiple challenges usually tackled in isolation, in particular exploration, sparse-rewards, object learning, generalisation, task/goal self-generation, and autonomous skill learning. As a second contribution, we present a set of `REAL-X' robot architectures that are able to solve different versions of the benchmark, where we progressively release initial simplifications. The architectures are based on a planning approach that dynamically increases abstraction, and intrinsic motivations to foster exploration. REAL-X achieves a good performance level in very demanding conditions. We argue that the REAL benchmark represents a valuable tool for studying open-ended learning in its hardest form.

CVJun 17, 2020

Human-Expert-Level Brain Tumor Detection Using Deep Learning with Data Distillation and Augmentation

Diyuan Lu, Nenad Polomac, Iskra Gacheva et al.

The application of Deep Learning (DL) for medical diagnosis is often hampered by two problems. First, the amount of training data may be scarce, as it is limited by the number of patients who have acquired the condition to be diagnosed. Second, the training data may be corrupted by various types of noise. Here, we study the problem of brain tumor detection from magnetic resonance spectroscopy (MRS) data, where both types of problems are prominent. To overcome these challenges, we propose a new method for training a deep neural network that distills particularly representative training examples and augments the training data by mixing these samples from one class with those from the same and other classes to create additional training samples. We demonstrate that this technique substantially improves performance, allowing our method to reach human-expert-level accuracy with just a few thousand training examples. Interestingly, the network learns to rely on features of the data that are usually ignored by human experts, suggesting new directions for future research.

SPJun 17, 2020

Staging Epileptogenesis with Deep Neural Networks

Diyuan Lu, Sebastian Bauer, Valentin Neubert et al.

Epilepsy is a common neurological disorder characterized by recurrent seizures accompanied by excessive synchronous brain activity. The process of structural and functional brain alterations leading to increased seizure susceptibility and eventually spontaneous seizures is called epileptogenesis (EPG) and can span months or even years. Detecting and monitoring the progression of EPG could allow for targeted early interventions that could slow down disease progression or even halt its development. Here, we propose an approach for staging EPG using deep neural networks and identify potential electroencephalography (EEG) biomarkers to distinguish different phases of EPG. Specifically, continuous intracranial EEG recordings were collected from a rodent model where epilepsy is induced by electrical perforant pathway stimulation (PPS). A deep neural network (DNN) is trained to distinguish EEG signals from before stimulation (baseline), shortly after the PPS and long after the PPS but before the first spontaneous seizure (FSS). Experimental results show that our proposed method can classify EEG signals from the three phases with an average area under the curve (AUC) of 0.93, 0.89, and 0.86. To the best of our knowledge, this represents the first successful attempt to stage EPG prior to the FSS using DNNs.

LGJun 11, 2020

Towards Early Diagnosis of Epilepsy from EEG Data

Diyuan Lu, Sebastian Bauer, Valentin Neubert et al.

Epilepsy is one of the most common neurological disorders, affecting about 1% of the population at all ages. Detecting the development of epilepsy, i.e., epileptogenesis (EPG), before any seizures occur could allow for early interventions and potentially more effective treatments. Here, we investigate if modern machine learning (ML) techniques can detect EPG from intra-cranial electroencephalography (EEG) recordings prior to the occurrence of any seizures. For this we use a rodent model of epilepsy where EPG is triggered by electrical stimulation of the brain. We propose a ML framework for EPG identification, which combines a deep convolutional neural network (CNN) with a prediction aggregation method to obtain the final classification decision. Specifically, the neural network is trained to distinguish five second segments of EEG recordings taken from either the pre-stimulation period or the post-stimulation period. Due to the gradual development of epilepsy, there is enormous overlap of the EEG patterns before and after the stimulation. Hence, a prediction aggregation process is introduced, which pools predictions over a longer period. By aggregating predictions over one hour, our approach achieves an area under the curve (AUC) of 0.99 on the EPG detection task. This demonstrates the feasibility of EPG prediction from EEG recordings.

CVSep 12, 2019

Recurrent Connectivity Aids Recognition of Partly Occluded Objects

Markus Roland Ernst, Jochen Triesch, Thomas Burwick

Feedforward convolutional neural networks are the prevalent model of core object recognition. For challenging conditions, such as occlusion, neuroscientists believe that the recurrent connectivity in the visual cortex aids object recognition. In this work we investigate if and how artificial neural networks can also benefit from recurrent connectivity. For this we systematically compare architectures comprised of bottom-up (B), lateral (L) and top-down (T) connections. To evaluate performance, we introduce two novel stereoscopic occluded object datasets, which bridge the gap from classifying digits to recognizing 3D objects. The task consists of recognizing one target object occluded by multiple occluder objects. We find that recurrent models perform significantly better than their feedforward counterparts, which were matched in parametric complexity. We show that for challenging stimuli, the recurrent feedback is able to correctly revise the initial feedforward guess of the network. Overall, our results suggest that both artificial and biological neural networks can exploit recurrence for improved object recognition.

CVJul 20, 2019

Recurrent Connections Aid Occluded Object Recognition by Discounting Occluders

Markus Roland Ernst, Jochen Triesch, Thomas Burwick

Recurrent connections in the visual cortex are thought to aid object recognition when part of the stimulus is occluded. Here we investigate if and how recurrent connections in artificial neural networks similarly aid object recognition. We systematically test and compare architectures comprised of bottom-up (B), lateral (L) and top-down (T) connections. Performance is evaluated on a novel stereoscopic occluded object recognition dataset. The task consists of recognizing one target digit occluded by multiple occluder digits in a pseudo-3D environment. We find that recurrent models perform significantly better than their feedforward counterparts, which were matched in parametric complexity. Furthermore, we analyze how the network's representation of the stimuli evolves over time due to recurrent connections. We show that the recurrent connections tend to move the network's representation of an occluded digit towards its un-occluded version. Our results suggest that both the brain and artificial neural networks can exploit recurrent connectivity to aid occluded object recognition.

LGMar 19, 2019

Residual Deep Convolutional Neural Network for EEG Signal Classification in Epilepsy

Diyuan Lu, Jochen Triesch

Epilepsy is the fourth most common neurological disorder, affecting about 1% of the population at all ages. As many as 60% of people with epilepsy experience focal seizures which originate in a certain brain area and are limited to part of one cerebral hemisphere. In focal epilepsy patients, a precise surgical removal of the seizure onset zone can lead to effective seizure control or even a seizure-free outcome. Thus, correct identification of the seizure onset zone is essential. For clinical evaluation purposes, electroencephalography (EEG) recordings are commonly used. However, their interpretation is usually done manually by physicians and is time-consuming and error-prone. In this work, we propose an automated epileptic signal classification method based on modern deep learning methods. In contrast to previous approaches, the network is trained directly on the EEG recordings, avoiding hand-crafted feature extraction and selection procedures. This exploits the ability of deep neural networks to detect and extract relevant features automatically, that may be too complex or subtle to be noticed by humans. The proposed network structure is based on a convolutional neural network with residual connections. We demonstrate that our network produces state-of-the-art performance on two benchmark data sets, a data set from Bonn University and the Bern-Barcelona data set. We conclude that modern deep learning approaches can reach state-of-the-art performance on epileptic EEG classification and automated seizure onset zone identification tasks when trained on raw EEG data. This suggests that such approaches have potential for improving clinical practice.

NCJun 21, 2016

An active efficient coding model of the optokinetic nystagmus

Chong Zhang, Jochen Triesch, Bertram E. Shi

Optokinetic nystagmus (OKN) is an involuntary eye movement responsible for stabilizing retinal images in the presence of relative motion between an observer and the environment. Fully understanding the development of optokinetic nystagmus requires a neurally plausible computational model that accounts for the neural development and the behavior. To date, work in this area has been limited. We propose a neurally plausible framework for the joint development of disparity and motion tuning in the visual cortex, the optokinetic and vergence eye movements. This framework models the joint emergence of both perception and behavior, and accounts for the importance of the development of normal vergence control and binocular vision in achieving normal monocular OKN (mOKN) behaviors. Because the model includes behavior, we can simulate the same perturbations as performed in past experiments, such as artificially induced strabismus. The proposed model agrees both qualitatively and quantitatively with a number of findings from the literature on both binocular vision as well as the optokinetic reflex. Finally, our model also makes quantitative predictions about the OKN behavior using the same methods used to characterize the OKN in the experimental literature.

CVFeb 14, 2014

Intrinsically Motivated Learning of Visual Motion Perception and Smooth Pursuit

Chong Zhang, Yu Zhao, Jochen Triesch et al.

We extend the framework of efficient coding, which has been used to model the development of sensory processing in isolation, to model the development of the perception/action cycle. Our extension combines sparse coding and reinforcement learning so that sensory processing and behavior co-develop to optimize a shared intrinsic motivational signal: the fidelity of the neural encoding of the sensory input under resource constraints. Applying this framework to a model system consisting of an active eye behaving in a time varying environment, we find that this generic principle leads to the simultaneous development of both smooth pursuit behavior and model neurons whose properties are similar to those of primary visual cortical neurons selective for different directions of visual motion. We suggest that this general principle may form the basis for a unified and integrated explanation of many perception/action loops.