Kristin Dana

CV
h-index6
31papers
2,515citations
Novelty49%
AI Score51

31 Papers

CVNov 22, 2022
ViFi-Loc: Multi-modal Pedestrian Localization using GAN with Camera-Phone Correspondences

Hansi Liu, Kristin Dana, Marco Gruteser et al.

In Smart City and Vehicle-to-Everything (V2X) systems, acquiring pedestrians' accurate locations is crucial to traffic safety. Current systems adopt cameras and wireless sensors to detect and estimate people's locations via sensor fusion. Standard fusion algorithms, however, become inapplicable when multi-modal data is not associated. For example, pedestrians are out of the camera field of view, or data from camera modality is missing. To address this challenge and produce more accurate location estimations for pedestrians, we propose a Generative Adversarial Network (GAN) architecture. During training, it learns the underlying linkage between pedestrians' camera-phone data correspondences. During inference, it generates refined position estimations based only on pedestrians' phone data that consists of GPS, IMU and FTM. Results show that our GAN produces 3D coordinates at 1 to 2 meter localization error across 5 different outdoor scenes. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrian's bounding box coordinates to obtain additional camera-phone data correspondences. This allows automatic data collection during inference. After fine-tuning on the expanded dataset, localization accuracy is improved by up to 26%.

CVOct 4, 2023
ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements

Bryan Bo Cao, Abrar Alali, Hansi Liu et al.

Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In the computer vision domain, tracking is usually achieved by first detecting subjects with bounding boxes, then associating detected bounding boxes across video frames. For many IoT systems, images captured by cameras are usually sent over the network to be processed at a different site that has more powerful computing resources than edge devices. However, sending entire frames through the network causes significant bandwidth consumption that may exceed the system bandwidth constraints. To tackle this problem, we propose ViFiT, a transformer-based model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. To fill the gap of proper metrics of jointly capturing the system characteristics of both tracking quality and video bandwidth reduction, we propose a novel evaluation framework dubbed Minimum Required Frames (MRF) and Minimum Required Frames Ratio (MRFR). ViFiT achieves an MRFR of 0.65 that outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator of 0.98, resulting in a high frame reduction rate as 97.76%.

CVOct 11, 2022
ViFiCon: Vision and Wireless Association Via Self-Supervised Contrastive Learning

Nicholas Meegan, Hansi Liu, Bryan Cao et al.

We introduce ViFiCon, a self-supervised contrastive learning scheme which uses synchronized information across vision and wireless modalities to perform cross-modal association. Specifically, the system uses pedestrian data collected from RGB-D camera footage as well as WiFi Fine Time Measurements (FTM) from a user's smartphone device. We represent the temporal sequence by stacking multi-person depth data spatially within a banded image. Depth data from RGB-D (vision domain) is inherently linked with an observable pedestrian, but FTM data (wireless domain) is associated only to a smartphone on the network. To formulate the cross-modal association problem as self-supervised, the network learns a scene-wide synchronization of the two modalities as a pretext task, and then uses that learned representation for the downstream task of associating individual bounding boxes to specific smartphones, i.e. associating vision and wireless information. We use a pre-trained region proposal model on the camera footage and then feed the extrapolated bounding box information into a dual-branch convolutional neural network along with the FTM data. We show that compared to fully supervised SoTA models, ViFiCon achieves high performance vision-to-wireless association, finding which bounding box corresponds to which smartphone device, without hand-labeled association examples for training data.

CVDec 2, 2022
Learning a Pedestrian Social Behavior Dictionary

Faith Johnson, Kristin Dana

Understanding pedestrian behavior patterns is a key component to building autonomous agents that can navigate among humans. We seek a learned dictionary of pedestrian behavior to obtain a semantic description of pedestrian trajectories. Supervised methods for dictionary learning are impractical since pedestrian behaviors may be unknown a priori and the process of manually generating behavior labels is prohibitively time consuming. We instead utilize a novel, unsupervised framework to create a taxonomy of pedestrian behavior observed in a specific space. First, we learn a trajectory latent space that enables unsupervised clustering to create an interpretable pedestrian behavior dictionary. We show the utility of this dictionary for building pedestrian behavior maps to visualize space usage patterns and for computing the distributions of behaviors. We demonstrate a simple but effective trajectory prediction by conditioning on these behavior labels. While many trajectory analysis methods rely on RNNs or transformers, we develop a lightweight, low-parameter approach and show results comparable to SOTA on the ETH and UCY datasets.

CVDec 8, 2025Code
EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset

Ronan John, Aditya Kesari, Vincenzo DiMatteo et al.

We address the challenge of predicting human visual attention during real-world navigation by measuring and modeling egocentric pedestrian eye gaze in an outdoor campus setting. We introduce the EgoCampus dataset, which spans 25 unique outdoor paths over 6 km across a university campus with recordings from more than 80 distinct human pedestrians, resulting in a diverse set of gaze-annotated videos. The system used for collection, Meta's Project Aria glasses, integrates eye tracking, front-facing RGB cameras, inertial sensors, and GPS to provide rich data from the human perspective. Unlike many prior egocentric datasets that focus on indoor tasks or exclude eye gaze information, our work emphasizes visual attention while subjects walk in outdoor campus paths. Using this data, we develop EgoCampusNet, a novel method to predict eye gaze of navigating pedestrians as they move through outdoor environments. Our contributions provide both a new resource for studying real-world attention and a resource for future work in gaze prediction models for navigation. Dataset and code are available upon request, and will be made publicly available at a later date at https://github.com/ComputerVisionRutgers/EgoCampus .

CVFeb 22, 2024Code
A Landmark-Aware Visual Navigation Dataset

Faith Johnson, Bryan Bo Cao, Kristin Dana et al.

Map representations learned by expert demonstrations have shown promising research value. However, the field of visual navigation still faces challenges due to the lack of real-world human-navigation datasets that can support efficient, supervised, representation learning of environments. We present a Landmark-Aware Visual Navigation (LAVN) dataset to allow for supervised learning of human-centric exploration policies and map building. We collect RGBD observation and human point-click pairs as a human annotator explores virtual and real-world environments with the goal of full coverage exploration of the space. The human annotators also provide distinct landmark examples along each trajectory, which we intuit will simplify the task of map or graph building and localization. These human point-clicks serve as direct supervision for waypoint prediction when learning to explore in environments. Our dataset covers a wide spectrum of scenes, including rooms in indoor environments, as well as walkways outdoors. We release our dataset with detailed documentation at https://huggingface.co/datasets/visnavdataset/lavn (DOI: 10.57967/hf/2386) and a plan for long-term preservation.

RODec 10, 2025
YOPO-Nav: Visual Navigation using 3DGS Graphs from One-Pass Videos

Ryan Meegan, Adam D'Souza, Bryan Bo Cao et al.

Visual navigation has emerged as a practical alternative to traditional robotic navigation pipelines that rely on detailed mapping and path planning. However, constructing and maintaining 3D maps is often computationally expensive and memory-intensive. We address the problem of visual navigation when exploration videos of a large environment are available. The videos serve as a visual reference, allowing a robot to retrace the explored trajectories without relying on metric maps. Our proposed method, YOPO-Nav (You Only Pass Once), encodes an environment into a compact spatial representation composed of interconnected local 3D Gaussian Splatting (3DGS) models. During navigation, the framework aligns the robot's current visual observation with this representation and predicts actions that guide it back toward the demonstrated trajectory. YOPO-Nav employs a hierarchical design: a visual place recognition (VPR) module provides coarse localization, while the local 3DGS models refine the goal and intermediate poses to generate control actions. To evaluate our approach, we introduce the YOPO-Campus dataset, comprising 4 hours of egocentric video and robot controller inputs from over 6 km of human-teleoperated robot trajectories. We benchmark recent visual navigation methods on trajectories from YOPO-Campus using a Clearpath Jackal robot. Experimental results show YOPO-Nav provides excellent performance in image-goal navigation for real-world scenes on a physical robot. The dataset and code will be made publicly available for visual navigation and scene representation research.

ROJan 15
FeudalNav: A Simple Framework for Visual Navigation

Faith Johnson, Bryan Bo Cao, Shubham Jain et al.

Visual navigation for robotics is inspired by the human ability to navigate environments using visual cues and memory, eliminating the need for detailed maps. In unseen, unmapped, or GPS-denied settings, traditional metric map-based methods fall short, prompting a shift toward learning-based approaches with minimal exploration. In this work, we develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels. Our method learns to select subgoals through a simple, transferable waypoint selection network. A key component of the approach is a latent-space memory module organized solely by visual similarity, as a proxy for distance. This alternative to graph-based topological representations proves sufficient for navigation tasks, providing a compact, light-weight, simple-to-train navigator that can find its way to the goal in novel locations. We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference. An additional contribution leverages the interpretablility of the framework for interactive navigation. We consider the question: how much direction intervention/interaction is needed to achieve success in all trials? We demonstrate that even minimal human involvement can significantly enhance overall navigation performance.

LGOct 9, 2023
Hierarchical Reinforcement Learning for Temporal Pattern Prediction

Faith Johnson, Kristin Dana

In this work, we explore the use of hierarchical reinforcement learning (HRL) for the task of temporal sequence prediction. Using a combination of deep learning and HRL, we develop a stock agent to predict temporal price sequences from historical stock price data and a vehicle agent to predict steering angles from first person, dash cam images. Our results in both domains indicate that a type of HRL, called feudal reinforcement learning, provides significant improvements to training speed and stability and prediction accuracy over standard RL. A key component to this success is the multi-resolution structure that introduces both temporal and spatial abstraction into the network hierarchy.

CVAug 31, 2023
Vision-Based Cranberry Crop Ripening Assessment

Faith Johnson, Jack Lowry, Kristin Dana et al.

Agricultural domains are being transformed by recent advances in AI and computer vision that support quantitative visual evaluation. Using drone imaging, we develop a framework for characterizing the ripening process of cranberry crops. Our method consists of drone-based time-series collection over a cranberry growing season, photometric calibration for albedo recovery from pixels, and berry segmentation with semi-supervised deep learning networks using point-click annotations. By extracting time-series berry albedo measurements, we evaluate four different varieties of cranberries and provide a quantification of their ripening rates. Such quantification has practical implications for 1) assessing real-time overheating risks for cranberry bogs; 2) large scale comparisons of progeny in crop breeding; 3) detecting disease by looking for ripening pattern outliers. This work is the first of its kind in quantitative evaluation of ripening using computer vision methods and has impact beyond cranberry crops including wine grapes, olives, blueberries, and maize.

CVFeb 19, 2024
Feudal Networks for Visual Navigation

Faith Johnson, Bryan Bo Cao, Ashwin Ashok et al.

Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high-level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.

CVNov 15, 2024
Memory Proxy Maps for Visual Navigation

Faith Johnson, Bryan Bo Cao, Ashwin Ashok et al.

Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

CVDec 12, 2024
Agtech Framework for Cranberry-Ripening Analysis Using Vision Foundation Models

Faith Johnson, Ryan Meegan, Jack Lowry et al.

Agricultural domains are being transformed by recent advances in AI and computer vision that support quantitative visual evaluation. Using aerial and ground imaging over a time series, we develop a framework for characterizing the ripening process of cranberry crops, a crucial component for precision agriculture tasks such as comparing crop breeds (high-throughput phenotyping) and detecting disease. Using drone imaging, we capture images from 20 waypoints across multiple bogs, and using ground-based imaging (hand-held camera), we image same bog patch using fixed fiducial markers. Both imaging methods are repeated to gather a multi-week time series spanning the entire growing season. Aerial imaging provides multiple samples to compute a distribution of albedo values. Ground imaging enables tracking of individual berries for a detailed view of berry appearance changes. Using vision transformers (ViT) for feature detection after segmentation, we extract a high dimensional feature descriptor of berry appearance. Interpretability of appearance is critical for plant biologists and cranberry growers to support crop breeding decisions (e.g.\ comparison of berry varieties from breeding programs). For interpretability, we create a 2D manifold of cranberry appearance by using a UMAP dimensionality reduction on ViT features. This projection enables quantification of ripening paths and a useful metric of ripening rate. We demonstrate the comparison of four cranberry varieties based on our ripening assessments. This work is the first of its kind and has future impact for cranberries and for other crops including wine grapes, olives, blueberries, and maize. Aerial and ground datasets are made publicly available.

CVOct 10, 2025
Modeling Time-Lapse Trajectories to Characterize Cranberry Growth

Ronan John, Anis Chihoub, Ryan Meegan et al.

Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).

CVJun 18, 2021
Towards Single Stage Weakly Supervised Semantic Segmentation

Peri Akiva, Kristin Dana

The costly process of obtaining semantic segmentation labels has driven research towards weakly supervised semantic segmentation (WSSS) methods, using only image-level, point, or box labels. The lack of dense scene representation requires methods to increase complexity to obtain additional semantic information about the scene, often done through multiple stages of training and refinement. Current state-of-the-art (SOTA) models leverage image-level labels to produce class activation maps (CAMs) which go through multiple stages of refinement before they are thresholded to make pseudo-masks for supervision. The multi-stage approach is computationally expensive, and dependency on image-level labels for CAMs generation lacks generalizability to more complex scenes. In contrary, our method offers a single-stage approach generalizable to arbitrary dataset, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate SOTA performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other SOTA WSSS methods on recent real-world datasets (CRAID, CityPersons, IAD).

CVNov 8, 2020
AI on the Bog: Monitoring and Evaluating Cranberry Crop Risk

Peri Akiva, Benjamin Planche, Aditi Roy et al.

Machine vision for precision agriculture has attracted considerable research interest in recent years. The goal of this paper is to develop an end-to-end cranberry health monitoring system to enable and support real time cranberry over-heating assessment to facilitate informed decisions that may sustain the economic viability of the farm. Toward this goal, we propose two main deep learning-based modules for: 1) cranberry fruit segmentation to delineate the exact fruit regions in the cranberry field image that are exposed to sun, 2) prediction of cloud coverage conditions and sun irradiance to estimate the inner temperature of exposed cranberries. We develop drone-based field data and ground-based sky data collection systems to collect video imagery at multiple time points for use in crop health analysis. Extensive evaluation on the data set shows that it is possible to predict exposed fruit's inner temperature with high accuracy (0.02% MAPE). The sun irradiance prediction error was found to be 8.41-20.36% MAPE in the 5-20 minutes time horizon. With 62.54% mIoU for segmentation and 13.46 MAE for counting accuracies in exposed fruit identification, this system is capable of giving informed feedback to growers to take precautionary action (e.g. irrigation) in identified crop field regions with higher risk of sunburn in the near future. Though this novel system is applied for cranberry health monitoring, it represents a pioneering step forward for efficient farming and is useful in precision agriculture beyond the problem of cranberry overheating.

CVOct 11, 2020
H2O-Net: Self-Supervised Flood Segmentation via Adversarial Domain Adaptation and Label Refinement

Peri Akiva, Matthew Purri, Kristin Dana et al.

Accurate flood detection in near real time via high resolution, high latency satellite imagery is essential to prevent loss of lives by providing quick and actionable information. Instruments and sensors useful for flood detection are only available in low resolution, low latency satellites with region re-visit periods of up to 16 days, making flood alerting systems that use such satellites unreliable. This work presents H2O-Network, a self supervised deep learning method to segment floods from satellites and aerial imagery by bridging domain gap between low and high latency satellite and coarse-to-fine label refinement. H2O-Net learns to synthesize signals highly correlative with water presence as a domain adaptation step for semantic segmentation in high resolution satellite imagery. Our work also proposes a self-supervision mechanism, which does not require any hand annotation, used during training to generate high quality ground truth data. We demonstrate that H2O-Net outperforms the state-of-the-art semantic segmentation methods on satellite imagery by 10% and 12% pixel accuracy and mIoU respectively for the task of flood segmentation. We emphasize the generalizability of our model by transferring model weights trained on satellite imagery to drone imagery, a highly different sensor and domain.

CVSep 22, 2020
Angular Luminance for Material Segmentation

Jia Xue, Matthew Purri, Kristin Dana

Moving cameras provide multiple intensity measurements per pixel, yet often semantic segmentation, material recognition, and object recognition do not utilize this information. With basic alignment over several frames of a moving camera sequence, a distribution of intensities over multiple angles is obtained. It is well known from prior work that luminance histograms and the statistics of natural images provide a strong material recognition cue. We utilize per-pixel {\it angular luminance distributions} as a key feature in discriminating the material of the surface. The angle-space sampling in a multiview satellite image sequence is an unstructured sampling of the underlying reflectance function of the material. For real-world materials there is significant intra-class variation that can be managed by building a angular luminance network (AngLNet). This network combines angular reflectance cues from multiple images with spatial cues as input to fully convolutional networks for material segmentation. We demonstrate the increased performance of AngLNet over prior state-of-the-art in material segmentation from satellite imagery.

CVJun 11, 2020
Feudal Steering: Hierarchical Learning for Steering Angle Prediction

Faith Johnson, Kristin Dana

We consider the challenge of automated steering angle prediction for self driving cars using egocentric road images. In this work, we explore the use of feudal networks, used in hierarchical reinforcement learning (HRL), to devise a vehicle agent to predict steering angles from first person, dash-cam images of the Udacity driving dataset. Our method, Feudal Steering, is inspired by recent work in HRL consisting of a manager network and a worker network that operate on different temporal scales and have different goals. The manager works at a temporal scale that is relatively coarse compared to the worker and has a higher level, task-oriented goal space. Using feudal learning to divide the task into manager and worker sub-networks provides more accurate and robust prediction. Temporal abstraction in driving allows more complex primitives than the steering angle at a single time instance. Composite actions comprise a subroutine or skill that can be re-used throughout the driving sequence. The associated subroutine id is the manager network's goal, so that the manager seeks to succeed at the high level task (e.g. a sharp right turn, a slight right turn, moving straight in traffic, or moving straight unencumbered by traffic). The steering angle at a particular time instance is the worker network output which is regulated by the manager's high level task. We demonstrate state-of-the art steering angle prediction results on the Udacity dataset.

CVApr 29, 2020
Teaching Cameras to Feel: Estimating Tactile Physical Properties of Surfaces From Images

Matthew Purri, Kristin Dana

The connection between visual input and tactile sensing is critical for object manipulation tasks such as grasping and pushing. In this work, we introduce the challenging task of estimating a set of tactile physical properties from visual information. We aim to build a model that learns the complex mapping between visual information and tactile physical properties. We construct a first of its kind image-tactile dataset with over 400 multiview image sequences and the corresponding tactile properties. A total of fifteen tactile physical properties across categories including friction, compliance, adhesion, texture, and thermal conductance are measured and then estimated by our models. We develop a cross-modal framework comprised of an adversarial objective and a novel visuo-tactile joint classification loss. Additionally, we develop a neural architecture search framework capable of selecting optimal combinations of viewing angles for estimating a given physical property.

CVApr 18, 2020
Finding Berries: Segmentation and Counting of Cranberries using Point Supervision and Shape Priors

Peri Akiva, Kristin Dana, Peter Oudemans et al.

Precision agriculture has become a key factor for increasing crop yields by providing essential information to decision makers. In this work, we present a deep learning method for simultaneous segmentation and counting of cranberries to aid in yield estimation and sun exposure predictions. Notably, supervision is done using low cost center point annotations. The approach, named Triple-S Network, incorporates a three-part loss with shape priors to promote better fitting to objects of known shape typical in agricultural scenes. Our results improve overall segmentation performance by more than 6.74% and counting results by 22.91% when compared to state-of-the-art. To train and evaluate the network, we have collected the CRanberry Aerial Imagery Dataset (CRAID), the largest dataset of aerial drone imagery from cranberry fields. This dataset will be made publicly available.

CVApr 17, 2019
Material Segmentation of Multi-View Satellite Imagery

Matthew Purri, Jia Xue, Kristin Dana et al.

Material recognition methods use image context and local cues for pixel-wise classification. In many cases only a single image is available to make a material prediction. Image sequences, routinely acquired in applications such as mutliview stereo, can provide a sampling of the underlying reflectance functions that reveal pixel-level material attributes. We investigate multi-view material segmentation using two datasets generated for building material segmentation and scene material segmentation from the SpaceNet Challenge satellite image dataset. In this paper, we explore the impact of multi-angle reflectance information by introducing the \textit{reflectance residual encoding}, which captures both the multi-angle and multispectral information present in our datasets. The residuals are computed by differencing the sparse-sampled reflectance function with a dictionary of pre-defined dense-sampled reflectance functions. Our proposed reflectance residual features improves material segmentation performance when integrated into pixel-wise and semantic segmentation architectures. At test time, predictions from individual segmentations are combined through softmax fusion and refined by building segment voting. We demonstrate robust and accurate pixelwise segmentation results using the proposed material segmentation pipeline.

CVMar 29, 2018
Deep Texture Manifold for Ground Terrain Recognition

Jia Xue, Hang Zhang, Kristin Dana

We present a texture network called Deep Encoding Pooling Network (DEP) for the task of ground terrain recognition. Recognition of ground terrain is an important task in establishing robot or vehicular control parameters, as well as for localization within an outdoor environment. The architecture of DEP integrates orderless texture details and local spatial information and the performance of DEP surpasses state-of-the-art methods for this task. The GTOS database (comprised of over 30,000 images of 40 classes of ground terrain in outdoor scenes) enables supervised recognition. For evaluation under realistic conditions, we use test images that are not from the existing GTOS dataset, but are instead from hand-held mobile phone videos of similar terrain. This new evaluation dataset, GTOS-mobile, consists of 81 videos of 31 classes of ground terrain such as grass, gravel, asphalt and sand. The resultant network shows excellent performance not only for GTOS-mobile, but also for more general databases (MINC and DTD). Leveraging the discriminant features learned from this network, we build a new texture manifold called DEP-manifold. We learn a parametric distribution in feature space in a fully supervised manner, which gives the distance relationship among classes and provides a means to implicitly represent ambiguous class boundaries. The source code and database are publicly available.

CVMar 23, 2018
Context Encoding for Semantic Segmentation

Hang Zhang, Kristin Dana, Jianping Shi et al.

Recent work has made significant progress in improving spatial resolution for pixelwise labeling with Fully Convolutional Network (FCN) framework by employing Dilated/Atrous convolution, utilizing multi-scale features and refining boundaries. In this paper, we explore the impact of global contextual information in semantic segmentation by introducing the Context Encoding Module, which captures the semantic context of scenes and selectively highlights class-dependent featuremaps. The proposed Context Encoding Module significantly improves semantic segmentation results with only marginal extra computation cost over FCN. Our approach has achieved new state-of-the-art results 51.7% mIoU on PASCAL-Context, 85.9% mIoU on PASCAL VOC 2012. Our single model achieves a final score of 0.5567 on ADE20K test set, which surpass the winning entry of COCO-Place Challenge in 2017. In addition, we also explore how the Context Encoding Module can improve the feature representation of relatively shallow networks for the image classification on CIFAR-10 dataset. Our 14 layer network has achieved an error rate of 3.45%, which is comparable with state-of-the-art approaches with over 10 times more layers. The source code for the complete system are publicly available.

ROApr 24, 2017
Development of An Autonomous Bridge Deck Inspection Robotic System

Hung M. La, Nenad Gucunski, Kristin Dana et al.

The threat to safety of aging bridges has been recognized as a critical concern to the general public due to the poor condition of many bridges in the U.S. Currently, the bridge inspection is conducted manually, and it is not efficient to identify bridge condition deterioration in order to facilitate implementation of appropriate maintenance or rehabilitation procedures. In this paper, we report a new development of the autonomous mobile robotic system for bridge deck inspection and evaluation. The robot is integrated with several nondestructive evaluation (NDE) sensors and a navigation control algorithm to allow it to accurately and autonomously maneuver on the bridge deck to collect visual images and conduct NDE measurements. The developed robotic system can reduce the cost and time of the bridge deck data collection and inspection. For efficient bridge deck monitoring, the crack detection algorithm to build the deck crack map is presented in detail. The impact-echo (IE), ultrasonic surface waves (USW) and electrical resistivity (ER) data collected by the robot are analyzed to generate the delamination, concrete elastic modulus, corrosion maps of the bridge deck, respectively. The presented robotic system has been successfully deployed to inspect numerous bridges in more than ten different states in the U.S.

CVMar 20, 2017
Multi-style Generative Network for Real-time Transfer

Hang Zhang, Kristin Dana

Despite the rapid progress in style transfer, existing approaches using feed-forward generative network for multi-style or arbitrary-style transfer are usually compromised of image quality and model flexibility. We find it is fundamentally difficult to achieve comprehensive style modeling using 1-dimensional style embedding. Motivated by this, we introduce CoMatch Layer that learns to match the second order feature statistics with the target styles. With the CoMatch Layer, we build a Multi-style Generative Network (MSG-Net), which achieves real-time performance. We also employ an specific strategy of upsampled convolution which avoids checkerboard artifacts caused by fractionally-strided convolution. Our method has achieved superior image quality comparing to state-of-the-art approaches. The proposed MSG-Net as a general approach for real-time style transfer is compatible with most existing techniques including content-style interpolation, color-preserving, spatial control and brush stroke size control. MSG-Net is the first to achieve real-time brush-size control in a purely feed-forward manner for style transfer. Our implementations and pre-trained models for Torch, PyTorch and MXNet frameworks will be publicly available.

CVDec 8, 2016
Deep TEN: Texture Encoding Network

Hang Zhang, Jia Xue, Kristin Dana

We propose a Deep Texture Encoding Network (Deep-TEN) with a novel Encoding Layer integrated on top of convolutional layers, which ports the entire dictionary learning and encoding pipeline into a single model. Current methods build from distinct components, using standard encoders with separate off-the-shelf features such as SIFT descriptors or pre-trained CNN features for material recognition. Our new approach provides an end-to-end learning framework, where the inherent visual vocabularies are learned directly from the loss function. The features, dictionaries and the encoding representation for the classifier are all learned simultaneously. The representation is orderless and therefore is particularly useful for material and texture recognition. The Encoding Layer generalizes robust residual encoders such as VLAD and Fisher Vectors, and has the property of discarding domain specific information which makes the learned convolutional features easier to transfer. Additionally, joint training using multiple datasets of varied sizes and class labels is supported resulting in increased recognition performance. The experimental results show superior performance as compared to state-of-the-art methods using gold-standard databases such as MINC-2500, Flickr Material Database, KTH-TIPS-2b, and two recent databases 4D-Light-Field-Material and GTOS. The source code for the complete system are publicly available.

CVDec 7, 2016
Differential Angular Imaging for Material Recognition

Jia Xue, Hang Zhang, Kristin Dana et al.

Material recognition for real-world outdoor surfaces has become increasingly important for computer vision to support its operation "in the wild." Computational surface modeling that underlies material recognition has transitioned from reflectance modeling using in-lab controlled radiometric measurements to image-based representations based on internet-mined images of materials captured in the scene. We propose to take a middle-ground approach for material recognition that takes advantage of both rich radiometric cues and flexible image capture. We realize this by developing a framework for differential angular imaging, where small angular variations in image capture provide an enhanced appearance representation and significant recognition improvement. We build a large-scale material database, Ground Terrain in Outdoor Scenes (GTOS) database, geared towards real use for autonomous agents. The database consists of over 30,000 images covering 40 classes of outdoor ground terrain under varying weather and lighting conditions. We develop a novel approach for material recognition called a Differential Angular Imaging Network (DAIN) to fully leverage this large dataset. With this novel network architecture, we extract characteristics of materials encoded in the angular and spatial gradients of their appearance. Our results show that DAIN achieves recognition performance that surpasses single view or coarsely quantized multiview images. These results demonstrate the effectiveness of differential angular imaging as a means for flexible, in-place material recognition.

CVApr 6, 2016
Reading Between the Pixels: Photographic Steganography for Camera Display Messaging

Eric Wengrowski, Kristin Dana, Marco Gruteser et al.

We exploit human color metamers to send light-modulated messages less visible to the human eye, but recoverable by cameras. These messages are a key component to camera-display messaging, such as handheld smartphones capturing information from electronic signage. Each color pixel in the display image is modified by a particular color gradient vector. The challenge is to find the color gradient that maximizes camera response, while minimizing human response. The mismatch in human spectral and camera sensitivity curves creates an opportunity for hidden messaging. Our approach does not require knowledge of these sensitivity curves, instead we employ a data-driven method. We learn an ellipsoidal partitioning of the six-dimensional space of colors and color gradients. This partitioning creates metamer sets defined by the base color at the display pixel and the color gradient direction for message encoding. We sample from the resulting metamer sets to find color steps for each base color to embed a binary message into an arbitrary image with reduced visible artifacts. Unlike previous methods that rely on visually obtrusive intensity modulation, we embed with color so that the message is more hidden. Ordinary displays and cameras are used without the need for expensive LEDs or high speed devices. The primary contribution of this work is a framework to map the pixels in an arbitrary image to a metamer pair for steganographic photo messaging.

CVMar 25, 2016
Friction from Reflectance: Deep Reflectance Codes for Predicting Physical Surface Properties from One-Shot In-Field Reflectance

Hang Zhang, Kristin Dana, Ko Nishino

Images are the standard input for vision algorithms, but one-shot infield reflectance measurements are creating new opportunities for recognition and scene understanding. In this work, we address the question of what reflectance can reveal about materials in an efficient manner. We go beyond the question of recognition and labeling and ask the question: What intrinsic physical properties of the surface can be estimated using reflectance? We introduce a framework that enables prediction of actual friction values for surfaces using one-shot reflectance measurements. This work is a first of its kind vision-based friction estimation. We develop a novel representation for reflectance disks that capture partial BRDF measurements instantaneously. Our method of deep reflectance codes combines CNN features and fisher vector pooling with optimal binary embedding to create codes that have sufficient discriminatory power and have important properties of illumination and spatial invariance. The experimental results demonstrate that reflectance can play a new role in deciphering the underlying physical properties of real-world scenes.

CVFeb 7, 2015
Reflectance Hashing for Material Recognition

Hang Zhang, Kristin Dana, Ko Nishino

We introduce a novel method for using reflectance to identify materials. Reflectance offers a unique signature of the material but is challenging to measure and use for recognizing materials due to its high-dimensionality. In this work, one-shot reflectance is captured using a unique optical camera measuring {\it reflectance disks} where the pixel coordinates correspond to surface viewing angles. The reflectance has class-specific stucture and angular gradients computed in this reflectance space reveal the material class. These reflectance disks encode discriminative information for efficient and accurate material recognition. We introduce a framework called reflectance hashing that models the reflectance disks with dictionary learning and binary hashing. We demonstrate the effectiveness of reflectance hashing for material recognition with a number of real-world materials.