Mengmeng Ge

CR
h-index14
14papers
296citations
Novelty42%
AI Score43

14 Papers

80.8CVApr 20
Ego-InBetween: Generating Object State Transitions in Ego-Centric Videos

Mengmeng Ge, Takashi Isobe, Xu Jia et al.

Understanding physical transformation processes is crucial for both human cognition and artificial intelligence systems, particularly from an egocentric perspective, which serves as a key bridge between humans and machines in action modeling. We define this modeling process as Egocentric Instructed Visual State Transition (EIVST), which involves generating intermediate frames that depict object transformations between initial and target states under a brief action instruction. EIVST poses two challenges for current generative models: (1) understanding the visual scenes of the initial and target states and reasoning about transformation steps from an egocentric view, and (2) generating a consistent intermediate transition that follows the given instruction while preserving object appearance across the two visual states. To address these challenges, we propose the EgoIn framework. It first infers the multi-step transition process between two given states using TransitionVLM, fine-tuned on our curated dataset to better adapt to this task and reduce hallucinated information. It then generates a sequence of frames based on transition conditions produced by the proposed Transition Conditioning module. Additionally, we introduce Object-aware Auxiliary Supervision to preserve consistent object appearance throughout the transition. Extensive experiments on human-object and robot-object interaction datasets demonstrate EgoIn's superior performance in generating semantically meaningful and visually coherent transformation sequences.

CVDec 2, 2024
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models

Xiaomin Li, Xu Jia, Qinghe Wang et al.

Existing pretrained text-to-video (T2V) models have demonstrated impressive abilities in generating realistic videos with basic motion or camera movement. However, these models exhibit significant limitations when generating intricate, human-centric motions. Current efforts primarily focus on fine-tuning models on a small set of videos containing a specific motion. They often fail to effectively decouple motion and the appearance in the limited reference videos, thereby weakening the modeling capability of motion patterns. To this end, we propose MoTrans, a customized motion transfer method enabling video generation of similar motion in new context. Specifically, we introduce a multimodal large language model (MLLM)-based recaptioner to expand the initial prompt to focus more on appearance and an appearance injection module to adapt appearance prior from video frames to the motion modeling process. These complementary multimodal representations from recaptioned prompt and video frames promote the modeling of appearance and facilitate the decoupling of appearance and motion. In addition, we devise a motion-specific embedding for further enhancing the modeling of the specific motion. Experimental results demonstrate that our method effectively learns specific motion pattern from singular or multiple reference videos, performing favorably against existing methods in customized video generation.

CVMar 24, 2025
AMD-Hummingbird: Towards an Efficient Text-to-Video Model

Takashi Isobe, He Cui, Dong Zhou et al.

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

CVSep 15, 2025
Layout-Conditioned Autoregressive Text-to-Image Generation via Structured Masking

Zirui Zheng, Takashi Isobe, Tong Shen et al.

While autoregressive (AR) models have demonstrated remarkable success in image generation, extending them to layout-conditioned generation remains challenging due to the sparse nature of layout conditions and the risk of feature entanglement. We present Structured Masking for AR-based Layout-to-Image (SMARLI), a novel framework for layoutto-image generation that effectively integrates spatial layout constraints into AR-based image generation. To equip AR model with layout control, a specially designed structured masking strategy is applied to attention computation to govern the interaction among the global prompt, layout, and image tokens. This design prevents mis-association between different regions and their descriptions while enabling sufficient injection of layout constraints into the generation process. To further enhance generation quality and layout accuracy, we incorporate Group Relative Policy Optimization (GRPO) based post-training scheme with specially designed layout reward functions for next-set-based AR models. Experimental results demonstrate that SMARLI is able to seamlessly integrate layout tokens with text and image tokens without compromising generation quality. It achieves superior layoutaware control while maintaining the structural simplicity and generation efficiency of AR models.

CVJan 26, 2022
Self-supervised 3D Semantic Representation Learning for Vision-and-Language Navigation

Sinan Tan, Mengmeng Ge, Di Guo et al.

In the Vision-and-Language Navigation task, the embodied agent follows linguistic instructions and navigates to a specific goal. It is important in many practical scenarios and has attracted extensive attention from both computer vision and robotics communities. However, most existing works only use RGB images but neglect the 3D semantic information of the scene. To this end, we develop a novel self-supervised training framework to encode the voxel-level 3D semantic reconstruction into a 3D semantic representation. Specifically, a region query task is designed as the pretext task, which predicts the presence or absence of objects of a particular class in a specific 3D region. Then, we construct an LSTM-based navigation model and train it with the proposed 3D semantic representations and BERT language features on vision-language pairs. Experiments show that the proposed approach achieves success rates of 68% and 66% on the validation unseen and test unseen splits of the R2R dataset respectively, which are superior to most of RGB-based methods utilizing vision-language transformers.

ROSep 16, 2021
Knowledge-based Embodied Question Answering

Sinan Tan, Mengmeng Ge, Di Guo et al.

In this paper, we propose a novel Knowledge-based Embodied Question Answering (K-EQA) task, in which the agent intelligently explores the environment to answer various questions with the knowledge. Different from explicitly specifying the target object in the question as existing EQA work, the agent can resort to external knowledge to understand more complicated question such as "Please tell me what are objects used to cut food in the room?", in which the agent must know the knowledge such as "knife is used for cutting food". To address this K-EQA problem, a novel framework based on neural program synthesis reasoning is proposed, where the joint reasoning of the external knowledge and 3D scene graph is performed to realize navigation and question answering. Especially, the 3D scene graph can provide the memory to store the visual information of visited scenes, which significantly improves the efficiency for the multi-turn question answering. Experimental results have demonstrated that the proposed framework is capable of answering more complicated and realistic questions in the embodied environment. The proposed method is also applicable to multi-agent scenarios.

CRSep 9, 2021
Automated Security Assessment for the Internet of Things

Xuanyu Duan, Mengmeng Ge, Triet H. M. Le et al.

Internet of Things (IoT) based applications face an increasing number of potential security risks, which need to be systematically assessed and addressed. Expert-based manual assessment of IoT security is a predominant approach, which is usually inefficient. To address this problem, we propose an automated security assessment framework for IoT networks. Our framework first leverages machine learning and natural language processing to analyze vulnerability descriptions for predicting vulnerability metrics. The predicted metrics are then input into a two-layered graphical security model, which consists of an attack graph at the upper layer to present the network connectivity and an attack tree for each node in the network at the bottom layer to depict the vulnerability information. This security model automatically assesses the security of the IoT network by capturing potential attack paths. We evaluate the viability of our approach using a proof-of-concept smart building system model which contains a variety of real-world IoT devices and potential vulnerabilities. Our evaluation of the proposed framework demonstrates its effectiveness in terms of automatically predicting the vulnerability metrics of new vulnerabilities with more than 90% accuracy, on average, and identifying the most vulnerable attack paths within an IoT network. The produced assessment results can serve as a guideline for cybersecurity professionals to take further actions and mitigate risks in a timely manner.

CRMay 18, 2021
Model-based Cybersecurity Analysis: Past Work and Future Directions

Simon Yusuf Enoch, Mengmeng Ge, Jin B. Hong et al.

Model-based evaluation in cybersecurity has a long history. Attack Graphs (AGs) and Attack Trees (ATs) were the earlier developed graphical security models for cybersecurity analysis. However, they have limitations (e.g., scalability problem, state-space explosion problem, etc.) and lack the ability to capture other security features (e.g., countermeasures). To address the limitations and to cope with various security features, a graphical security model named attack countermeasure tree (ACT) was developed to perform security analysis by taking into account both attacks and countermeasures. In our research, we have developed different variants of a hierarchical graphical security model to solve the complexity, dynamicity, and scalability issues involved with security models in the security analysis of systems. In this paper, we summarize and classify security models into the following; graph-based, tree-based, and hybrid security models. We discuss the development of a hierarchical attack representation model (HARM) and different variants of the HARM, its applications, and usability in a variety of domains including the Internet of Things (IoT), Cloud, Software-Defined Networking, and Moving Target Defenses. We provide the classification of the security metrics, including their discussions. Finally, we highlight existing problems and suggest future research directions in the area of graphical security models and applications. As a result of this work, a decision-maker can understand which type of HARM will suit their network or security analysis requirements.

CRJul 18, 2020
Toward a Deep Learning-Driven Intrusion Detection Approach for Internet of Things

Mengmeng Ge, Naeem Firdous Syed, Xiping Fu et al.

Internet of Things (IoT) has brought along immense benefits to our daily lives encompassing a diverse range of application domains that we regularly interact with, ranging from healthcare automation to transport and smart environments. However, due to the limitation of constrained resources and computational capabilities, IoT networks are prone to various cyber attacks. Thus, defending the IoT network against adversarial attacks is of vital importance. In this paper, we present a novel intrusion detection approach for IoT networks through the application of a deep learning technique. We adopt a cutting-edge IoT dataset comprising IoT traces and realistic attack traffic, including denial of service, distributed denial of service, reconnaissance and information theft attacks. We utilise the header field information in individual packets as generic features to capture general network behaviours, and develop a feed-forward neural networks model with embedding layers (to encode high-dimensional categorical features) for multi-class classification. The concept of transfer learning is subsequently adopted to encode high-dimensional categorical features to build a binary classifier. Results obtained through the evaluation of the proposed approach demonstrate a high classification accuracy for both binary and multi-class classifiers.

CRJul 7, 2020
Composite Metrics for Network Security Analysis

Simon Yusuf Enoch, Jin B. Hong, Mengmeng Ge et al.

Security metrics present the security level of a system or a network in both qualitative and quantitative ways. In general, security metrics are used to assess the security level of a system and to achieve security goals. There are a lot of security metrics for security analysis, but there is no systematic classification of security metrics that are based on network reachability information. To address this, we propose a systematic classification of existing security metrics based on network reachability information. Mainly, we classify the security metrics into host-based and network-based metrics. The host-based metrics are classified into metrics ``without probability" and "with probability", while the network-based metrics are classified into "path-based" and "non-path based". Finally, we present and describe an approach to develop composite security metrics and it's calculations using a Hierarchical Attack Representation Model (HARM) via an example network. Our novel classification of security metrics provides a new methodology to assess the security of a system.

CRMay 8, 2020
Proactive Defense for Internet-of-Things: Integrating Moving Target Defense with Cyberdeception

Mengmeng Ge, Jin-Hee Cho, Dong Seong Kim et al.

Resource constrained Internet-of-Things (IoT) devices are highly likely to be compromised by attackers because strong security protections may not be suitable to be deployed. This requires an alternative approach to protect vulnerable components in IoT networks. In this paper, we propose an integrated defense technique to achieve intrusion prevention by leveraging cyberdeception (i.e., a decoy system) and moving target defense (i.e., network topology shuffling). We verify the effectiveness and efficiency of our proposed technique analytically based on a graphical security model in a software defined networking (SDN)-based IoT network. We develop four strategies (i.e., fixed/random and adaptive/hybrid) to address "when" to perform network topology shuffling and three strategies (i.e., genetic algorithm/decoy attack path-based optimization/random) to address "how" to perform network topology shuffling on a decoy-populated IoT network, and analyze which strategy can best achieve a system goal such as prolonging the system lifetime, maximizing deception effectiveness, maximizing service availability, or minimizing defense cost. Our results demonstrate that a software defined IoT network running our intrusion prevention technique at the optimal parameter setting prolongs system lifetime, increases attack complexity of compromising critical nodes, and maintains superior service availability compared with a counterpart IoT network without running our intrusion prevention technique. Further, when given a single goal or a multi-objective goal (e.g., maximizing the system lifetime and service availability while minimizing the defense cost) as input, the best combination of "how" and "how" strategies is identified for executing our proposed technique under which the specified goal can be best achieved.

CRAug 1, 2019
Modeling and Analysis of Integrated Proactive Defense Mechanisms for Internet-of-Things

Mengmeng Ge, Jin-Hee Cho, Bilal Ishfaq et al.

As a solution to protect and defend a system against inside attacks, many intrusion detection systems (IDSs) have been developed to identify and react to them for protecting a system. However, the core idea of an IDS is a reactive mechanism in nature even though it detects intrusions which have already been in the system. Hence, the reactive mechanisms would be way behind and not effective for the actions taken by agile and smart attackers. Due to the inherent limitation of an IDS with the reactive nature, intrusion prevention systems (IPSs) have been developed to thwart potential attackers and/or mitigate the impact of the intrusions before they penetrate into the system. In this chapter, we introduce an integrated defense mechanism to achieve intrusion prevention in a software-defined Internet-of-Things (IoT) network by leveraging the technologies of cyberdeception (i.e., a decoy system) and moving target defense, namely MTD (i.e., network topology shuffling). In addition, we validate their effectiveness and efficiency based on the devised graphical security model (GSM)-based evaluation framework. To develop an adaptive, proactive intrusion prevention mechanism, we employed fitness functions based on the genetic algorithm in order to identify an optimal network topology where a network topology can be shuffled based on the detected level of the system vulnerability. Our simulation results show that GA-based shuffling schemes outperform random shuffling schemes in terms of the number of attack paths toward decoy targets. In addition, we observe that there exists a tradeoff between the system lifetime (i.e., mean time to security failure) and the defense cost introduced by the proposed MTD technique for fixed and adaptive shuffling schemes. That is, a fixed GA-based shuffling can achieve higher MTTSF with more cost while an adaptive GA-based shuffling obtains less MTTSF with less cost.

CRAug 1, 2019
Optimal Deployments of Defense Mechanisms for the Internet of Things

Mengmeng Ge, Jin-Hee Cho, Charles A. Kamhoua et al.

Internet of Things (IoT) devices can be exploited by the attackers as entry points to break into the IoT networks without early detection. Little work has taken hybrid approaches that combine different defense mechanisms in an optimal way to increase the security of the IoT against sophisticated attacks. In this work, we propose a novel approach to generate the strategic deployment of adaptive deception technology and the patch management solution for the IoT under a budget constraint. We use a graphical security model along with three evaluation metrics to measure the effectiveness and efficiency of the proposed defense mechanisms. We apply the multi-objective genetic algorithm (GA) to compute the {\em Pareto optimal} deployments of defense mechanisms to maximize the security and minimize the deployment cost. We present a case study to show the feasibility of the proposed approach and to provide the defenders with various ways to choose optimal deployments of defense mechanisms for the IoT. We compare the GA with the exhaustive search algorithm (ESA) in terms of the runtime complexity and performance accuracy in optimality. Our results show that the GA is much more efficient in computing a good spread of the deployments than the ESA, in proportion to the increase of the IoT devices.

CRApr 29, 2017
Evaluating Security and Availability of Multiple Redundancy Designs when Applying Security Patches

Mengmeng Ge, Huy Kang Kim, Dong Seong Kim

In most of modern enterprise systems, redundancy configuration is often considered to provide availability during the part of such systems is being patched. However, the redundancy may increase the attack surface of the system. In this paper, we model and assess the security and capacity oriented availability of multiple server redundancy designs when applying security patches to the servers. We construct (1) a graphical security model to evaluate the security under potential attacks before and after applying patches, (2) a stochastic reward net model to assess the capacity oriented availability of the system with a patch schedule. We present our approach based on case study and model-based evaluation for multiple design choices. The results show redundancy designs increase capacity oriented availability but decrease security when applying security patches. We define functions that compare values of security metrics and capacity oriented availability with the chosen upper/lower bounds to find design choices that satisfy both security and availability requirements.