CLMay 1, 2022
Detecting COVID-19 Conspiracy Theories with Transformers and TF-IDFHaoming Guo, Tianyi Huang, Huixuan Huang et al.
The sharing of fake news and conspiracy theories on social media has wide-spread negative effects. By designing and applying different machine learning models, researchers have made progress in detecting fake news from text. However, existing research places a heavy emphasis on general, common-sense fake news, while in reality fake news often involves rapidly changing topics and domain-specific vocabulary. In this paper, we present our methods and results for three fake news detection tasks at MediaEval benchmark 2021 that specifically involve COVID-19 related topics. We experiment with a group of text-based models including Support Vector Machines, Random Forest, BERT, and RoBERTa. We find that a pre-trained transformer yields the best validation results, but a randomly initialized transformer with smart design can also be trained to reach accuracies close to that of the pre-trained transformer.
CVJan 14, 2021Code
OrigamiSet1.0: Two New Datasets for Origami Classification and Difficulty EstimationDaniel Ma, Gerald Friedland, Mario Michael Krell
Origami is becoming more and more relevant to research. However, there is no public dataset yet available and there hasn't been any research on this topic in machine learning. We constructed an origami dataset using images from the multimedia commons and other databases. It consists of two subsets: one for classification of origami images and the other for difficulty estimation. We obtained 16000 images for classification (half origami, half other objects) and 1509 for difficulty estimation with $3$ different categories (easy: 764, intermediate: 427, complex: 318). The data can be downloaded at: https://github.com/multimedia-berkeley/OriSet. Finally, we provide machine learning baselines.
MMDec 28, 2017Code
Field Studies with Multimedia Big Data: Opportunities and Challenges (Extended Version)Mario Michael Krell, Julia Bernd, Yifan Li et al.
Social multimedia users are increasingly sharing all kinds of data about the world. They do this for their own reasons, not to provide data for field studies-but the trend presents a great opportunity for scientists. The Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset comprises 99 million images and nearly 800 thousand videos from Flickr, all shared under Creative Commons licenses. To enable scientists to leverage these media records for field studies, we propose a new framework that extracts targeted subcorpora from the YFCC100M, in a format usable by researchers who are not experts in big data retrieval and processing. This paper discusses a number of examples from the literature-as well as some entirely new ideas-of natural and social science field studies that could be piloted, supplemented, replicated, or conducted using YFCC100M data. These examples illustrate the need for a general new open-source framework for Multimedia Big Data Field Studies. There is currently a gap between the separate aspects of what multimedia researchers have shown to be possible with consumer-produced big data and the follow-through of creating a comprehensive field study framework that supports scientists across other disciplines. To bridge this gap, we must meet several challenges. For example, the framework must handle unlabeled and noisily labeled data to produce a filtered dataset for a scientist-who naturally wants it to be both as large and as clean as possible. This requires an iterative approach that provides access to statistical summaries and refines the search by constructing new classifiers. The first phase of our framework is available as Multimedia Commons Search, an intuitive interface that enables complex search queries at a large scale...
LGDec 19, 2024
Bag of Tricks for Multimodal AutoML with Image, Text, and Tabular DataZhiqiang Tang, Zihan Zhong, Tong He et al.
This paper studies the best practices for automatic machine learning (AutoML). While previous AutoML efforts have predominantly focused on unimodal data, the multimodal aspect remains under-explored. Our study delves into classification and regression problems involving flexible combinations of image, text, and tabular data. We curate a benchmark comprising 22 multimodal datasets from diverse real-world applications, encompassing all 4 combinations of the 3 modalities. Across this benchmark, we scrutinize design choices related to multimodal fusion strategies, multimodal data augmentation, converting tabular data into text, cross-modal alignment, and handling missing modalities. Through extensive experimentation and analysis, we distill a collection of effective strategies and consolidate them into a unified pipeline, achieving robust performance on diverse datasets.
LGSep 5, 2025
Self-Aligned Reward: Towards Effective and Efficient ReasonersPeixuan Han, Adit Krishnan, Gerald Friedland et al.
Reinforcement learning with verifiable rewards has significantly advanced reasoning in large language models (LLMs), but such signals remain coarse, offering only binary correctness feedback. This limitation often results in inefficiencies, including overly verbose reasoning and high computational cost, while existing solutions often compromise accuracy. To address this, we introduce self-aligned reward (SAR), a self-guided signal that complements verifiable rewards to encourage both reasoning accuracy and efficiency. SAR is defined as the relative perplexity difference between an answer conditioned on the query and the standalone answer, thereby favoring responses that are concise and query-specific. Quantitative analysis reveals that SAR reliably distinguishes answer quality: concise, correct answers score higher than redundant ones, and partially correct answers score higher than entirely incorrect ones. Evaluation on 4 models across 7 benchmarks shows that integrating SAR with prevalent RL algorithms like PPO and GRPO improves accuracy by 4%, while reducing inference cost by 30%. Further analysis demonstrates that SAR achieves a Pareto-optimal trade-off between correctness and efficiency compared to reward signals based on length or self-confidence. We also show that SAR shortens responses while preserving advanced reasoning behaviors, demonstrating its ability to suppress unnecessary elaboration without losing critical reasoning. These results highlight the promise of self-aligned reward as a fine-grained complement to verifiable rewards, paving the way for more efficient and effective LLM training.
CLNov 22, 2024
PPLqa: An Unsupervised Information-Theoretic Quality Metric for Comparing Generative Large Language ModelsGerald Friedland, Xin Huang, Yueying Cui et al.
We propose PPLqa, an easy to compute, language independent, information-theoretic metric to measure the quality of responses of generative Large Language Models (LLMs) in an unsupervised way, without requiring ground truth annotations or human supervision. The method and metric enables users to rank generative language models for quality of responses, so as to make a selection of the best model for a given task. Our single metric assesses LLMs with an approach that subsumes, but is not explicitly based on, coherence and fluency (quality of writing) and relevance and consistency (appropriateness of response) to the query. PPLqa performs as well as other related metrics, and works better with long-form Q\&A. Thus, PPLqa enables bypassing the lengthy annotation process required for ground truth evaluations, and it also correlates well with human and LLM rankings.
LGAug 2, 2025
Effects of Feature Correlations on Associative Memory CapacityStefan Bielmeier, Gerald Friedland
We investigate how feature correlations influence the capacity of Dense Associative Memory (DAM), a Transformer attention-like model. Practical machine learning scenarios involve feature-correlated data and learn representations in the input space, but current capacity analyses do not account for this. We develop an empirical framework to analyze the effects of data structure on capacity dynamics. Specifically, we systematically construct datasets that vary in feature correlation and pattern separation using Hamming distance from information theory, and compute the model's corresponding storage capacity using a simple binary search algorithm. Our experiments confirm that memory capacity scales exponentially with increasing separation in the input space. Feature correlations do not alter this relationship fundamentally, but reduce capacity slightly at constant separation. This effect is amplified at higher polynomial degrees in the energy function, suggesting that Associative Memory is more limited in depicting higher-order interactions between features than patterns. Our findings bridge theoretical work and practical settings for DAM, and might inspire more data-centric methods.
LGFeb 1, 2021
Multi-modal Ensemble Models for Predicting Video MemorabilityTony Zhao, Irving Fang, Jeffrey Kim et al.
Modeling media memorability has been a consistent challenge in the field of machine learning. The Predicting Media Memorability task in MediaEval2020 is the latest benchmark among similar challenges addressing this topic. Building upon techniques developed in previous iterations of the challenge, we developed ensemble methods with the use of extracted video, image, text, and audio features. Critically, in this work we introduce and demonstrate the efficacy and high generalizability of extracted audio embeddings as a feature for the task of predicting media memorability.
LGJan 11, 2021
From Tinkering to Engineering: Measurements in Tensorflow PlaygroundHenrik Hoeiness, Axel Harstad, Gerald Friedland
In this article, we present an extension of the Tensorflow Playground, called Tensorflow Meter (short TFMeter). TFMeter is an interactive neural network architecting tool that allows the visual creation of different architectures of neural networks. In addition to its ancestor, the playground, our tool shows information-theoretic measurements while constructing, training, and testing the network. As a result, each change results in a change in at least one of the measurements, providing for a better engineering intuition of what different architectures are able to learn. The measurements are derived from various places in the literature. In this demo, we describe our web application that is available online at http://tfmeter.icsi.berkeley.edu/ and argue that in the same way that the original Playground is meant to build an intuition about neural networks, our extension educates users on available measurements, which we hope will ultimately improve experimental design and reproducibility in the field.
MMOct 19, 2020
DIME: An Online Tool for the Visual Comparison of Cross-Modal Retrieval ModelsTony Zhao, Jaeyoung Choi, Gerald Friedland
Cross-modal retrieval relies on accurate models to retrieve relevant results for queries across modalities such as image, text, and video. In this paper, we build upon previous work by tackling the difficulty of evaluating models both quantitatively and qualitatively quickly. We present DIME (Dataset, Index, Model, Embedding), a modality-agnostic tool that handles multimodal datasets, trained models, and data preprocessors to support straightforward model comparison with a web browser graphical user interface. DIME inherently supports building modality-agnostic queryable indexes and extraction of relevant feature embeddings, and thus effectively doubles as an efficient cross-modal tool to explore and search through datasets.
CVNov 26, 2019
Efficient Saliency Maps for Explainable AIT. Nathan Mundhenk, Barry Y. Chen, Gerald Friedland
We describe an explainable AI saliency map method for use with deep convolutional neural networks (CNN) that is much more efficient than popular fine-resolution gradient methods. It is also quantitatively similar or better in accuracy. Our technique works by measuring information at the end of each network scale which is then combined into a single saliency map. We describe how saliency measures can be made more efficient by exploiting Saliency Map Order Equivalence. We visualize individual scale/layer contributions by using a Layer Ordered Visualization of Information. This provides an interesting comparison of scale information contributions within the network not provided by other saliency map methods. Using our method instead of Guided Backprop, coarse-resolution class activation methods such as Grad-CAM and Grad-CAM++ seem to yield demonstrably superior results without sacrificing speed. This will make fine-resolution saliency methods feasible on resource limited platforms such as robots, cell phones, low-cost industrial devices, astronomy and satellite imagery.
CRNov 15, 2018
Cybercasing 2.0: You Get What You Pay ForJaeyoung Choi, Istemi Ekin Akkus, Serge Egelman et al.
Under U.S. law, marketing databases exist under almost no legal restrictions concerning accuracy, access, or confidentiality. We explore the possible (mis)use of these databases in a criminal context by conducting two experiments. First, we show how this data can be used for "cybercasing" by using this data to resolve the physical addresses of individuals who are likely to be on vacation. Second, we evaluate the utility of a "bride to be" mailing list augmented with data obtained by searching both Facebook and a bridal registry aggregator. We conclude that marketing data is not necessarily harmless and can represent a fruitful target for criminal misuse.
LGOct 23, 2018
One Bit Matters: Understanding Adversarial Examples as the Abuse of RedundancyJingkang Wang, Ruoxi Jia, Gerald Friedland et al.
Despite the great success achieved in machine learning (ML), adversarial examples have caused concerns with regards to its trustworthiness: A small perturbation of an input results in an arbitrary failure of an otherwise seemingly well-trained ML model. While studies are being conducted to discover the intrinsic properties of adversarial examples, such as their transferability and universality, there is insufficient theoretic analysis to help understand the phenomenon in a way that can influence the design process of ML experiments. In this paper, we deduce an information-theoretic model which explains adversarial attacks as the abuse of feature redundancies in ML algorithms. We prove that feature redundancy is a necessary condition for the existence of adversarial examples. Our model helps to explain some major questions raised in many anecdotal studies on adversarial examples. Our theory is backed up by empirical measurements of the information content of benign and adversarial examples on both image and text datasets. Our measurements show that typical adversarial examples introduce just enough redundancy to overflow the decision making of an ML model trained on corresponding benign examples. We conclude with actionable recommendations to improve the robustness of machine learners against adversarial examples.
NEOct 4, 2018
A Practical Approach to Sizing Neural NetworksGerald Friedland, Alfredo Metere, Mario Krell
Memorization is worst-case generalization. Based on MacKay's information theoretic model of supervised machine learning, this article discusses how to practically estimate the maximum size of a neural network given a training data set. First, we present four easily applicable rules to analytically determine the capacity of neural network architectures. This allows the comparison of the efficiency of different network architectures independently of a task. Second, we introduce and experimentally validate a heuristic method to estimate the neural network capacity requirement for a given dataset and labeling. This allows an estimate of the required size of a neural network for a given problem. We conclude the article with a discussion on the consequences of sizing the network wrongly, which includes both increased computation effort for training as well as reduced generalization capability.
CRAug 22, 2018
The Accuracy of the Demographic Inferences Shown on Google's Ad SettingsMichael Carl Tschantz, Serge Egelman, Jaeyoung Choi et al.
Google's Ad Settings shows the gender and age that Google has inferred about a web user. We compare the inferred values to the self-reported values of 501 survey participants. We find that Google often does not show an inference, but when it does, it is typically correct. We explore which usage characteristics, such as using privacy enhancing technologies, are associated with Google's accuracy, but found no significant results.
CVJul 10, 2018
The Helmholtz Method: Using Perceptual Compression to Reduce Machine Learning ComplexityGerald Friedland, Jingkang Wang, Ruoxi Jia et al.
This paper proposes a fundamental answer to a frequently asked question in multimedia computing and machine learning: Do artifacts from perceptual compression contribute to error in the machine learning process and if so, how much? Our approach to the problem is a reinterpretation of the Helmholtz Free Energy formula from physics to explain the relationship between content and noise when using sensors (such as cameras or microphones) to capture multimedia data. The reinterpretation allows a bit-measurement of the noise contained in images, audio, and video by combining a classifier with perceptual compression, such as JPEG or MP3. Our experiments on CIFAR-10 as well as Fraunhofer's IDMT-SMT-Audio-Effects dataset indicate that, at the right quality level, perceptual compression is actually not harmful but contributes to a significant reduction of complexity of the machine learning process. That is, our noise quantification method can be used to speed up the training of deep learning classifiers significantly while maintaining, or sometimes even improving, overall classification accuracy. Moreover, our results provide insights into the reasons for the success of deep learning.
ASOct 11, 2017
Audio Concept Classification with Hierarchical Deep Neural NetworksMirco Ravanelli, Benjamin Elizalde, Karl Ni et al.
Audio-based multimedia retrieval tasks may identify semantic information in audio streams, i.e., audio concepts (such as music, laughter, or a revving engine). Conventional Gaussian-Mixture-Models have had some success in classifying a reduced set of audio concepts. However, multi-class classification can benefit from context window analysis and the discriminating power of deeper architectures. Although deep learning has shown promise in various applications such as speech and object recognition, it has not yet met the expectations for other fields such as audio concept classification. This paper explores, for the first time, the potential of deep learning in classifying audio concepts on User-Generated Content videos. The proposed system is comprised of two cascaded neural networks in a hierarchical configuration to analyze the short- and long-term context information. Our system outperforms a GMM approach by a relative 54%, a Neural Network by 33%, and a Deep Neural Network by 12% on the TRECVID-MED database
NEAug 20, 2017
A Capacity Scaling Law for Artificial Neural NetworksGerald Friedland, Mario Krell
We derive the calculation of two critical numbers predicting the behavior of perceptron networks. First, we derive the calculation of what we call the lossless memory (LM) dimension. The LM dimension is a generalization of the Vapnik--Chervonenkis (VC) dimension that avoids structured data and therefore provides an upper bound for perfectly fitting almost any training data. Second, we derive what we call the MacKay (MK) dimension. This limit indicates a 50% chance of not being able to train a given function. Our derivations are performed by embedding a neural network into Shannon's communication model which allows to interpret the two points as capacities measured in bits. We present a proof and practical experiments that validate our upper bounds with repeatable experiments using different network configurations, diverse implementations, varying activation functions, and several learning algorithms. The bottom line is that the two capacity points scale strictly linear with the number of weights. Among other practical applications, our result allows to compare and benchmark different neural network implementations independent of a concrete learning task. Our results provide insight into the capabilities and limits of neural networks and generate valuable know how for experimental design decisions.
SDJul 15, 2016
DCAR: A Discriminative and Compact Audio Representation to Improve Event DetectionLiping Jing, Bo Liu, Jaeyoung Choi et al.
This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus based on YFCC100M), which includes ten events. The results show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations. DCAR's advantage over i-vector, mv-vector, and GMM representations is significant for both easier and harder discrimination tasks. We discuss how these performance differences across easy and hard cases follow from how each type of model leverages (or doesn't leverage) the intrinsic structure of the data. Furthermore, DCAR shows a particularly notable accuracy advantage on events where humans have more difficulty classifying the videos, i.e., events with lower mean annotator confidence.
MMMar 13, 2015
The YLI-MED Corpus: Characteristics, Procedures, and PlansJulia Bernd, Damian Borth, Benjamin Elizalde et al.
The YLI Multimedia Event Detection corpus is a public-domain index of videos with annotations and computed features, specialized for research in multimedia event detection (MED), i.e., automatically identifying what's happening in a video by analyzing the audio and visual content. The videos indexed in the YLI-MED corpus are a subset of the larger YLI feature corpus, which is being developed by the International Computer Science Institute and Lawrence Livermore National Laboratory based on the Yahoo Flickr Creative Commons 100 Million (YFCC100M) dataset. The videos in YLI-MED are categorized as depicting one of ten target events, or no target event, and are annotated for additional attributes like language spoken and whether the video has a musical score. The annotations also include degree of annotator agreement and average annotator confidence scores for the event categorization of each video. Version 1.0 of YLI-MED includes 1823 "positive" videos that depict the target events and 48,138 "negative" videos, as well as 177 supplementary videos that are similar to event videos but are not positive examples. Our goal in producing YLI-MED is to be as open about our data and procedures as possible. This report describes the procedures used to collect the corpus; gives detailed descriptive statistics about the corpus makeup (and how video attributes affected annotators' judgments); discusses possible biases in the corpus introduced by our procedural choices and compares it with the most similar existing dataset, TRECVID MED's HAVIC corpus; and gives an overview of our future plans for expanding the annotation effort.
MMMar 5, 2015
YFCC100M: The New Data in Multimedia ResearchBart Thomee, David A. Shamma, Gerald Friedland et al.
We present the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M), the largest public multimedia collection that has ever been released. The dataset contains a total of 100 million media objects, of which approximately 99.2 million are photos and 0.8 million are videos, all of which carry a Creative Commons license. Each media object in the dataset is represented by several pieces of metadata, e.g. Flickr identifier, owner name, camera, title, tags, geo, media source. The collection provides a comprehensive snapshot of how photos and videos were taken, described, and shared over the years, from the inception of Flickr in 2004 until early 2014. In this article we explain the rationale behind its creation, as well as the implications the dataset has for science, research, engineering, and development. We further present several new challenges in multimedia research that can now be expanded upon with our dataset.