CVJul 31, 2024Code
Tora: Trajectory-oriented Diffusion Transformer for Video GenerationZhenghao Zhang, Junchao Liao, Menghao Li et al.
Recent advancements in Diffusion Transformer (DiT) have demonstrated remarkable proficiency in producing high-quality video content. Nonetheless, the potential of transformer-based diffusion models for effectively generating videos with controllable motion remains an area of limited exploration. This paper introduces Tora, the first trajectory-oriented DiT framework that concurrently integrates textual, visual, and trajectory conditions, thereby enabling scalable video generation with effective motion guidance. Specifically, Tora consists of a Trajectory Extractor (TE), a Spatial-Temporal DiT, and a Motion-guidance Fuser (MGF). The TE encodes arbitrary trajectories into hierarchical spacetime motion patches with a 3D motion compression network. The MGF integrates the motion patches into the DiT blocks to generate consistent videos that accurately follow designated trajectories. Our design aligns seamlessly with DiT's scalability, allowing precise control of video content's dynamics with diverse durations, aspect ratios, and resolutions. Extensive experiments demonstrate that Tora excels in achieving high motion fidelity compared to the foundational DiT model, while also accurately simulating the complex movements of the physical world. Code is made available at https://github.com/alibaba/Tora .
CVJan 30Code
ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web SearchTao Yu, Haopeng Jin, Hao Wang et al.
In recent years, large language models (LLMs) have made rapid progress in information retrieval, yet existing research has mainly focused on text or static multimodal settings. Open-domain video shot retrieval, which involves richer temporal structure and more complex semantics, still lacks systematic benchmarks and analysis. To fill this gap, we introduce ShotFinder, a benchmark that formalizes editing requirements as keyframe-oriented shot descriptions and introduces five types of controllable single-factor constraints: Temporal order, Color, Visual style, Audio, and Resolution. We curate 1,210 high-quality samples from YouTube across 20 thematic categories, using large models for generation with human verification. Based on the benchmark, we propose ShotFinder, a text-driven three-stage retrieval and localization pipeline: (1) query expansion via video imagination, (2) candidate video retrieval with a search engine, and (3) description-guided temporal localization. Experiments on multiple closed-source and open-source models reveal a significant gap to human performance, with clear imbalance across constraints: temporal localization is relatively tractable, while color and visual style remain major challenges. These results reveal that open-domain video shot retrieval is still a critical capability that multimodal large models have yet to overcome.
CVNov 21, 2023
AnimateAnything: Fine-Grained Open Domain Image Animation with Motion GuidanceZuozhuo Dai, Zhenghao Zhang, Yao Yao et al.
Image animation is a key task in computer vision which aims to generate dynamic visual content from static image. Recent image animation methods employ neural based rendering technique to generate realistic animations. Despite these advancements, achieving fine-grained and controllable image animation guided by text remains challenging, particularly for open-domain images captured in diverse real environments. In this paper, we introduce an open domain image animation method that leverages the motion prior of video diffusion model. Our approach introduces targeted motion area guidance and motion strength guidance, enabling precise control the movable area and its motion speed. This results in enhanced alignment between the animated visual elements and the prompting text, thereby facilitating a fine-grained and interactive animation generation process for intricate motion sequences. We validate the effectiveness of our method through rigorous experiments on an open-domain dataset, with the results showcasing its superior performance. Project page can be found at https://animationai.github.io/AnimateAnything.
AIJan 27Code
RPO:Reinforcement Fine-Tuning with Partial Reasoning OptimizationHongzhu Yi, Xinming Wang, Zhenghao zhang et al.
Within the domain of large language models, reinforcement fine-tuning algorithms necessitate the generation of a complete reasoning trajectory beginning from the input query, which incurs significant computational overhead during the rollout phase of training. To address this issue, we analyze the impact of different segments of the reasoning path on the correctness of the final result and, based on these insights, propose Reinforcement Fine-Tuning with Partial Reasoning Optimization (RPO), a plug-and-play reinforcement fine-tuning algorithm. Unlike traditional reinforcement fine-tuning algorithms that generate full reasoning paths, RPO trains the model by generating suffixes of the reasoning path using experience cache. During the rollout phase of training, RPO reduces token generation in this phase by approximately 95%, greatly lowering the theoretical time overhead. Compared with full-path reinforcement fine-tuning algorithms, RPO reduces the training time of the 1.5B model by 90% and the 7B model by 72%. At the same time, it can be integrated with typical algorithms such as GRPO and DAPO, enabling them to achieve training acceleration while maintaining performance comparable to the original algorithms. Our code is open-sourced at https://github.com/yhz5613813/RPO.
CVJan 20, 2023
Towards Robust Video Instance Segmentation with Temporal-Aware TransformerZhenghao Zhang, Fangtao Shao, Zuozhuo Dai et al.
Most existing transformer based video instance segmentation methods extract per frame features independently, hence it is challenging to solve the appearance deformation problem. In this paper, we observe the temporal information is important as well and we propose TAFormer to aggregate spatio-temporal features both in transformer encoder and decoder. Specifically, in transformer encoder, we propose a novel spatio-temporal joint multi-scale deformable attention module which dynamically integrates the spatial and temporal information to obtain enriched spatio-temporal features. In transformer decoder, we introduce a temporal self-attention module to enhance the frame level box queries with the temporal relation. Moreover, TAFormer adopts an instance level contrastive loss to increase the discriminability of instance query embeddings. Therefore the tracking error caused by visually similar instances can be decreased. Experimental results show that TAFormer effectively leverages the spatial and temporal information to obtain context-aware feature representation and outperforms state-of-the-art methods.
CVJul 11, 2024
MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation MapsHang Wu, Zhenghao Zhang, Siyuan Lin et al.
Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has emerged as a promising low-cost alternative for localization. Current approaches based on navigation maps can achieve highly accurate localization, but their complex matching strategies lead to unacceptable inference latency that fails to meet the real-time demands. To address these limitations, we propose a novel transformer-based neural re-localization method. Inspired by image registration, our approach performs a coarse-to-fine neural feature registration between navigation map and visual bird's-eye view features. Our method significantly outperforms the current state-of-the-art OrienterNet on both the nuScenes and Argoverse datasets, which is nearly 10%/20% localization accuracy and 30/16 FPS improvement on single-view and surround-view input settings, separately. We highlight that our research presents an HD-map-free localization method for autonomous driving, offering cost-effective, reliable, and scalable performance in challenging driving environments.
CVJul 11, 2024
BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of SightHang Wu, Zhenghao Zhang, Siyuan Lin et al.
Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and planning by offering more comprehensive information and reaction time. The Standard-Definition (SD) navigation maps can provide a lightweight representation of road structure topology, characterized by ease of acquisition and low maintenance costs. An intuitive idea is to combine the close-range visual information from onboard cameras with the beyond line-of-sight (BLOS) environmental priors from SD maps to realize expanded perceptual capabilities. In this paper, we propose BLOS-BEV, a novel BEV segmentation model that incorporates SD maps for accurate beyond line-of-sight perception, up to 200m. Our approach is applicable to common BEV architectures and can achieve excellent results by incorporating information derived from SD maps. We explore various feature fusion schemes to effectively integrate the visual BEV representations and semantic features from the SD map, aiming to leverage the complementary information from both sources optimally. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in BEV segmentation on nuScenes and Argoverse benchmark. Through multi-modal inputs, BEV segmentation is significantly enhanced at close ranges below 50m, while also demonstrating superior performance in long-range scenarios, surpassing other methods by over 20% mIoU at distances ranging from 50-200m.
CVMar 18, 2024Code
EffiVED:Efficient Video Editing via Text-instruction Diffusion ModelsZhenghao Zhang, Zuozhuo Dai, Long Qin et al.
Large-scale text-to-video models have shown remarkable abilities, but their direct application in video editing remains challenging due to limited available datasets. Current video editing methods commonly require per-video fine-tuning of diffusion models or specific inversion optimization to ensure high-fidelity edits. In this paper, we introduce EffiVED, an efficient diffusion-based model that directly supports instruction-guided video editing. To achieve this, we present two efficient workflows to gather video editing pairs, utilizing augmentation and fundamental vision-language techniques. These workflows transform vast image editing datasets and open-world videos into a high-quality dataset for training EffiVED. Experimental results reveal that EffiVED not only generates high-quality editing videos but also executes rapidly. Finally, we demonstrate that our data collection method significantly improves editing performance and can potentially tackle the scarcity of video editing data. Code can be found at https://github.com/alibaba/EffiVED.
CVApr 10Code
Tora3: Trajectory-Guided Audio-Video Generation with Physical CoherenceJunchao Liao, Zhenghao Zhang, Xiangyu Meng et al.
Audio-video (AV) generation has recently made strong progress in perceptual quality and multimodal coherence, yet generating content with plausible motion-sound relations remains challenging. Existing methods often produce object motions that are visually unstable and sounds that are only loosely aligned with salient motion or contact events, largely because they lack an explicit motion-aware structure shared by video and audio generation. We present Tora3, a trajectory-guided AV generation framework that improves physical coherence by using object trajectories as a shared kinematic prior. Rather than treating trajectories as a video-only control signal, Tora3 uses them to jointly guide visual motion and acoustic events. Specifically, we design a trajectory-aligned motion representation for video, a kinematic-audio alignment module driven by trajectory-derived second-order kinematic states, and a hybrid flow matching scheme that preserves trajectory fidelity in trajectory-conditioned regions while maintaining local coherence elsewhere. We further curate PAV, a large-scale AV dataset emphasizing motion-relevant patterns with automatically extracted motion annotations. Extensive experiments show that Tora3 improves motion realism, motion-sound synchronization, and overall AV generation quality over strong open-source baselines.
CVAug 8, 2025Code
More Is Better: A MoE-Based Emotion Recognition Framework with Human Preference AlignmentJun Xie, Yingjian Zhu, Feng Chen et al.
In this paper, we present our solution for the semi-supervised learning track (MER-SEMI) in MER2025. We propose a comprehensive framework, grounded in the principle that "more is better," to construct a robust Mixture of Experts (MoE) emotion recognition system. Our approach integrates a diverse range of input modalities as independent experts, including novel signals such as knowledge from large Vision-Language Models (VLMs) and temporal Action Unit (AU) information. To effectively utilize unlabeled data, we introduce a consensus-based pseudo-labeling strategy, generating high-quality labels from the agreement between a baseline model and Gemini, which are then used in a two-stage training paradigm. Finally, we employ a multi-expert voting ensemble combined with a rule-based re-ranking process to correct prediction bias and better align the outputs with human preferences. Evaluated on the MER2025-SEMI challenge dataset, our method achieves an F1-score of 0.8772 on the test set, ranking 2nd in the track. Our code is available at https://github.com/zhuyjan/MER2025-MRAC25.
LGNov 14, 2025
Dynamic Deep Graph Learning for Incomplete Multi-View Clustering with Masked Graph Reconstruction LossZhenghao Zhang, Jun Xie, Xingchen Chen et al.
The prevalence of real-world multi-view data makes incomplete multi-view clustering (IMVC) a crucial research. The rapid development of Graph Neural Networks (GNNs) has established them as one of the mainstream approaches for multi-view clustering. Despite significant progress in GNNs-based IMVC, some challenges remain: (1) Most methods rely on the K-Nearest Neighbors (KNN) algorithm to construct static graphs from raw data, which introduces noise and diminishes the robustness of the graph topology. (2) Existing methods typically utilize the Mean Squared Error (MSE) loss between the reconstructed graph and the sparse adjacency graph directly as the graph reconstruction loss, leading to substantial gradient noise during optimization. To address these issues, we propose a novel \textbf{D}ynamic Deep \textbf{G}raph Learning for \textbf{I}ncomplete \textbf{M}ulti-\textbf{V}iew \textbf{C}lustering with \textbf{M}asked Graph Reconstruction Loss (DGIMVCM). Firstly, we construct a missing-robust global graph from the raw data. A graph convolutional embedding layer is then designed to extract primary features and refined dynamic view-specific graph structures, leveraging the global graph for imputation of missing views. This process is complemented by graph structure contrastive learning, which identifies consistency among view-specific graph structures. Secondly, a graph self-attention encoder is introduced to extract high-level representations based on the imputed primary features and view-specific graphs, and is optimized with a masked graph reconstruction loss to mitigate gradient noise during optimization. Finally, a clustering module is constructed and optimized through a pseudo-label self-supervised training mechanism. Extensive experiments on multiple datasets validate the effectiveness and superiority of DGIMVCM.
LGMay 30, 2025Code
Federated Incomplete Multi-view Clustering with Globally Fused Graph GuidanceGuoqing Chao, Zhenghao Zhang, Lei Meng et al.
Federated multi-view clustering has been proposed to mine the valuable information within multi-view data distributed across different devices and has achieved impressive results while preserving the privacy. Despite great progress, most federated multi-view clustering methods only used global pseudo-labels to guide the downstream clustering process and failed to exploit the global information when extracting features. In addition, missing data problem in federated multi-view clustering task is less explored. To address these problems, we propose a novel Federated Incomplete Multi-view Clustering method with globally Fused Graph guidance (FIMCFG). Specifically, we designed a dual-head graph convolutional encoder at each client to extract two kinds of underlying features containing global and view-specific information. Subsequently, under the guidance of the fused graph, the two underlying features are fused into high-level features, based on which clustering is conducted under the supervision of pseudo-labeling. Finally, the high-level features are uploaded to the server to refine the graph fusion and pseudo-labeling computation. Extensive experimental results demonstrate the effectiveness and superiority of FIMCFG. Our code is publicly available at https://github.com/PaddiHunter/FIMCFG.
CVMay 22, 2023Code
UVOSAM: A Mask-free Paradigm for Unsupervised Video Object Segmentation via Segment Anything ModelZhenghao Zhang, Shengfan Zhang, Zhichao Wei et al.
The current state-of-the-art methods for unsupervised video object segmentation (UVOS) require extensive training on video datasets with mask annotations, limiting their effectiveness in handling challenging scenarios. However, the Segment Anything Model (SAM) introduces a new prompt-driven paradigm for image segmentation, offering new possibilities. In this study, we investigate SAM's potential for UVOS through different prompt strategies. We then propose UVOSAM, a mask-free paradigm for UVOS that utilizes the STD-Net tracker. STD-Net incorporates a spatial-temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features, remarkably enhancing the quality of box prompts in complex video scenes. Extensive experiments on the DAVIS2017-unsupervised and YoutubeVIS19\&21 datasets demonstrate the superior performance of UVOSAM without mask supervision compared to existing mask-supervised methods, as well as its ability to generalize to weakly-annotated video datasets. Code can be found at https://github.com/alibaba/UVOSAM.
CVDec 11, 2025
SpaceDrive: Infusing Spatial Awareness into VLM-based Autonomous DrivingPeizheng Li, Zhenghao Zhang, David Holtz et al.
End-to-end autonomous driving methods built on vision language models (VLMs) have undergone rapid development driven by their universal visual understanding and strong reasoning capabilities obtained from the large-scale pretraining. However, we find that current VLMs struggle to understand fine-grained 3D spatial relationships which is a fundamental requirement for systems interacting with the physical world. To address this issue, we propose SpaceDrive, a spatial-aware VLM-based driving framework that treats spatial information as explicit positional encodings (PEs) instead of textual digit tokens, enabling joint reasoning over semantic and spatial representations. SpaceDrive employs a universal positional encoder to all 3D coordinates derived from multi-view depth estimation, historical ego-states, and text prompts. These 3D PEs are first superimposed to augment the corresponding 2D visual tokens. Meanwhile, they serve as a task-agnostic coordinate representation, replacing the digit-wise numerical tokens as both inputs and outputs for the VLM. This mechanism enables the model to better index specific visual semantics in spatial reasoning and directly regress trajectory coordinates rather than generating digit-by-digit, thereby enhancing planning accuracy. Extensive experiments validate that SpaceDrive achieves state-of-the-art open-loop performance on the nuScenes dataset and the second-best Driving Score of 78.02 on the Bench2Drive closed-loop benchmark over existing VLM-based methods.
DLJan 30
PaperX: A Unified Framework for Multimodal Academic Presentation Generation with Scholar DAGTao Yu, Minghui Zhang, Zhiqing Cui et al.
Transforming scientific papers into multimodal presentation content is essential for research dissemination but remains labor intensive. Existing automated solutions typically treat each format as an isolated downstream task, leading to redundant processing and semantic inconsistency. We introduce PaperX, a unified framework that models academic presentation generation as a structural transformation and rendering process. Central to our approach is the Scholar DAG, an intermediate representation that decouples the paper's logical structure from its final presentation syntax. By applying adaptive graph traversal strategies, PaperX generates diverse, high quality outputs from a single source. Comprehensive evaluations demonstrate that our framework achieves the state of the art performance in content fidelity and aesthetic quality while significantly improving cost efficiency compared to specialized single task agents.
CVJul 8, 2025
Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video GenerationZhenghao Zhang, Junchao Liao, Xiangyu Meng et al.
Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through explicit mapping between motion and personalization embeddings. Tora2 is, to our best knowledge, the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation. Experimental results demonstrate that Tora2 achieves competitive performance with state-of-the-art customization methods while providing advanced motion control capabilities, which marks a critical advancement in multi-condition video generation. Project page: https://ali-videoai.github.io/Tora2_page/.
CVOct 16, 2025
Identity-GRPO: Optimizing Multi-Human Identity-preserving Video Generation via Reinforcement LearningXiangyu Meng, Zixian Zhang, Zhenghao Zhang et al.
While advanced methods like VACE and Phantom have advanced video generation for specific subjects in diverse scenarios, they struggle with multi-human identity preservation in dynamic interactions, where consistent identities across multiple characters are critical. To address this, we propose Identity-GRPO, a human feedback-driven optimization pipeline for refining multi-human identity-preserving video generation. First, we construct a video reward model trained on a large-scale preference dataset containing human-annotated and synthetic distortion data, with pairwise annotations focused on maintaining human consistency throughout the video. We then employ a GRPO variant tailored for multi-human consistency, which greatly enhances both VACE and Phantom. Through extensive ablation studies, we evaluate the impact of annotation quality and design choices on policy optimization. Experiments show that Identity-GRPO achieves up to 18.9% improvement in human consistency metrics over baseline methods, offering actionable insights for aligning reinforcement learning with personalized video generation.
CVSep 30, 2025
LaTo: Landmark-tokenized Diffusion Transformer for Fine-grained Human Face EditingZhenghao Zhang, Ziying Zhang, Junchao Liao et al.
Recent multimodal models for instruction-based face editing enable semantic manipulation but still struggle with precise attribute control and identity preservation. Structural facial representations such as landmarks are effective for intermediate supervision, yet most existing methods treat them as rigid geometric constraints, which can degrade identity when conditional landmarks deviate significantly from the source (e.g., large expression or pose changes, inaccurate landmark estimates). To address these limitations, we propose LaTo, a landmark-tokenized diffusion transformer for fine-grained, identity-preserving face editing. Our key innovations include: (1) a landmark tokenizer that directly quantizes raw landmark coordinates into discrete facial tokens, obviating the need for dense pixel-wise correspondence; (2) a location-mapping positional encoding that integrates facial and image tokens for unified processing, enabling flexible yet decoupled geometry-appearance interactions with high efficiency and strong identity preservation; and (3) a landmark predictor that leverages vision-language models to infer target landmarks from instructions and source images, whose structured chain-of-thought improves estimation accuracy and interactive control. To mitigate data scarcity, we curate HFL-150K, to our knowledge the largest benchmark for this task, containing over 150K real face pairs with fine-grained instructions. Extensive experiments show that LaTo outperforms state-of-the-art methods by 7.8% in identity preservation and 4.6% in semantic consistency. Code and dataset will be made publicly available upon acceptance.
IRNov 14, 2020
Association Rules Enhanced Knowledge Graph Attention NetworkZhenghao Zhang, Jianbin Huang, Qinglin Tan
Most existing knowledge graphs suffer from incompleteness. Embedding knowledge graphs into continuous vector spaces has recently attracted increasing interest in knowledge base completion. However, in most existing embedding methods, only fact triplets are utilized, and logical rules have not been thoroughly studied for the knowledge base completion task. To overcome the problem, we propose an association rules enhanced knowledge graph attention network (AR-KGAT). The AR-KGAT captures both entity and relation features for high-order neighborhoods of any given entity in an end-to-end manner under the graph attention network framework. The major component of AR-KGAT is an encoder of an effective neighborhood aggregator, which addresses the problems by aggregating neighbors with both association-rules-based and graph-based attention weights. Additionally, the proposed model also encapsulates the representations from multi-hop neighbors of nodes to refine their embeddings. The decoder enables AR-KGAT to be translational between entities and relations while keeping the superior link prediction performance. A logic-like inference pattern is utilized as constraints for knowledge graph embedding. Then, the global loss is minimized over both atomic and complex formulas to achieve the embedding task. In this manner, we learn embeddings compatible with triplets and rules, which are certainly more predictive for knowledge acquisition and inference. We conduct extensive experiments on two benchmark datasets: WN18RR and FB15k-237, for two knowledge graph completion tasks: the link prediction and triplet classification to evaluate the proposed AR-KGAT model. The results show that the proposed AR-KGAT model achieves significant and consistent improvements over state-of-the-art methods.
SINov 12, 2020
Multi-View Dynamic Heterogeneous Information Network EmbeddingZhenghao Zhang, Jianbin Huang, Qinglin Tan
Most existing Heterogeneous Information Network (HIN) embedding methods focus on static environments while neglecting the evolving characteristic of realworld networks. Although several dynamic embedding methods have been proposed, they are merely designed for homogeneous networks and cannot be directly applied in heterogeneous environment. To tackle above challenges, we propose a novel framework for incorporating temporal information into HIN embedding, denoted as Multi-View Dynamic HIN Embedding (MDHNE), which can efficiently preserve evolution patterns of implicit relationships from different views in updating node representations over time. We first transform HIN to a series of homogeneous networks corresponding to different views. Then our proposed MDHNE applies Recurrent Neural Network (RNN) to incorporate evolving pattern of complex network structure and semantic relationships between nodes into latent embedding spaces, and thus the node representations from multiple views can be learned and updated when HIN evolves over time. Moreover, we come up with an attention based fusion mechanism, which can automatically infer weights of latent representations corresponding to different views by minimizing the objective function specific for different mining tasks. Extensive experiments clearly demonstrate that our MDHNE model outperforms state-of-the-art baselines on three real-world dynamic datasets for different network mining tasks.
LGNov 11, 2020
Proximal Policy Optimization via Enhanced Exploration EfficiencyJunwei Zhang, Zhenghao Zhang, Shuai Han et al.
Proximal policy optimization (PPO) algorithm is a deep reinforcement learning algorithm with outstanding performance, especially in continuous control tasks. But the performance of this method is still affected by its exploration ability. For classical reinforcement learning, there are some schemes that make exploration more full and balanced with data exploitation, but they can't be applied in complex environments due to the complexity of algorithm. Based on continuous control tasks with dense reward, this paper analyzes the assumption of the original Gaussian action exploration mechanism in PPO algorithm, and clarifies the influence of exploration ability on performance. Afterward, aiming at the problem of exploration, an exploration enhancement mechanism based on uncertainty estimation is designed in this paper. Then, we apply exploration enhancement theory to PPO algorithm and propose the proximal policy optimization algorithm with intrinsic exploration module (IEM-PPO) which can be used in complex environments. In the experimental parts, we evaluate our method on multiple tasks of MuJoCo physical simulator, and compare IEM-PPO algorithm with curiosity driven exploration algorithm (ICM-PPO) and original algorithm (PPO). The experimental results demonstrate that IEM-PPO algorithm needs longer training time, but performs better in terms of sample efficiency and cumulative reward, and has stability and robustness.
CRApr 2, 2012
Time Synchronization Attack in Smart Grid-Part II: Cross Layer Detection MechanismZhenghao Zhang, Matthew Trinkle, Aleksandar D. Dimitrovski et al.
A novel time synchronization attack (TSA) on wide area monitoring systems in smart grid has been identified in the first part of this paper. A cross layer detection mechanism is proposed to combat TSA in part II of this paper. In the physical layer, we propose a GPS carrier signal noise ratio (C/No) based spoofing detection technique. In addition, a patch-monopole hybrid antenna is applied to receive GPS signal. By computing the standard deviation of the C/No difference from two GPS receivers, a priori probability of spoofing detection is fed to the upper layer, where power system state is estimated and controlled. A trustworthiness based evaluation method is applied to identify the PMU being under TSA. Both the physical layer and upper layer algorithms are integrated to detect the TSA, thus forming a cross layer mechanism. Experiment is carried out to verify the effectiveness of the proposed TSA detection algorithm.
CRApr 2, 2012
Time Synchronization Attack in Smart Grid-Part I: Impact and AnalysisZhenghao Zhang, Shuping Gong, Aleksandar D. Dimitrovski et al.
Many operations in power grids, such as fault detection and event location estimation, depend on precise timing information. In this paper, a novel Time Synchronization Attack (TSA) is proposed to attack the timing information in smart grid. Since many applications in smart grid utilize synchronous measurements and most of the measurement devices are equipped with global positioning system (GPS) for precise timing, it is highly probable to attack the measurement system by spoofing the GPS. The effectiveness of TSA is demonstrated for three applications of phasor measurement unit (PMU) in smart grid, namely transmission line fault detection, voltage stability monitoring and event locationing. The validity of TSA is demonstrated by numerical simulations.
CRJan 12, 2012
Time Stamp Attack in Smart Grid: Physical Mechanism and Damage AnalysisShuping Gong, Zhenghao Zhang, Husheng Li et al.
Many operations in power grids, such as fault detection and event location estimation, depend on precise timing information. In this paper, a novel time stamp attack (TSA) is proposed to attack the timing information in smart grid. Since many applications in smart grid utilize synchronous measurements and most of the measurement devices are equipped with global positioning system (GPS) for precise timing, it is highly probable to attack the measurement system by spoofing the GPS. The effectiveness of TSA is demonstrated for three applications of phasor measurement unit (PMU) in smart grid, namely transmission line fault detection, voltage stability monitoring and event locationing.