CLSep 21, 2023Code
AceGPT, Localizing Large Language Models in ArabicHuang Huang, Fei Yu, Jianqing Zhu et al.
This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed `AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT.
CLApr 20, 2023Code
Phoenix: Democratizing ChatGPT across LanguagesZhihong Chen, Feng Jiang, Junying Chen et al.
This paper presents our efforts to democratize ChatGPT across language. We release a large language model "Phoenix", achieving competitive performance among open-source English and Chinese models while excelling in languages with limited resources (covering both Latin and non-Latin languages). We believe this work will be beneficial to make ChatGPT more accessible, especially in countries where people cannot use ChatGPT due to restrictions from OpenAI or local goverments. Our data, code, and models are available at https://github.com/FreedomIntelligence/LLMZoo.
CLAug 17, 2023Code
CMB: A Comprehensive Medical Benchmark in ChineseXidong Wang, Guiming Hardy Chen, Dingjie Song et al.
Large Language Models (LLMs) provide a possibility to make a great breakthrough in medicine. The establishment of a standardized medical benchmark becomes a fundamental cornerstone to measure progression. However, medical environments in different regions have their local characteristics, e.g., the ubiquity and significance of traditional Chinese medicine within China. Therefore, merely translating English-based medical evaluation may result in \textit{contextual incongruities} to a local region. To solve the issue, we propose a localized medical benchmark called CMB, a Comprehensive Medical Benchmark in Chinese, designed and rooted entirely within the native Chinese linguistic and cultural framework. While traditional Chinese medicine is integral to this evaluation, it does not constitute its entirety. Using this benchmark, we have evaluated several prominent large-scale LLMs, including ChatGPT, GPT-4, dedicated Chinese LLMs, and LLMs specialized in the medical domain. We hope this benchmark provide first-hand experience in existing LLMs for medicine and also facilitate the widespread adoption and enhancement of medical LLMs within China. Our data and code are publicly available at https://github.com/FreedomIntelligence/CMB.
CLFeb 18, 2024Code
ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language ModelsGuiming Hardy Chen, Shunian Chen, Ruifei Zhang et al.
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
CVMar 21, 2022
Interpreting Class Conditional GANs with Channel AwarenessYingqing He, Zhiyi Zhang, Jiapeng Zhu et al.
Understanding the mechanism of generative adversarial networks (GANs) helps us better use GANs for downstream applications. Existing efforts mainly target interpreting unconditional models, leaving it less explored how a conditional GAN learns to render images regarding various categories. This work fills in this gap by investigating how a class conditional generator unifies the synthesis of multiple classes. For this purpose, we dive into the widely used class-conditional batch normalization (CCBN), and observe that each feature channel is activated at varying degrees given different categorical embeddings. To describe such a phenomenon, we propose channel awareness, which quantitatively characterizes how a single channel contributes to the final synthesis. Extensive evaluations and analyses on the BigGAN model pre-trained on ImageNet reveal that only a subset of channels is primarily responsible for the generation of a particular category, similar categories (e.g., cat and dog) usually get related to some same channels, and some channels turn out to share information across all classes. For good measure, our algorithm enables several novel applications with conditional GANs. Concretely, we achieve (1) versatile image editing via simply altering a single channel and manage to (2) harmoniously hybridize two different classes. We further verify that the proposed channel awareness shows promising potential in (3) segmenting the synthesized image and (4) evaluating the category-wise synthesis performance.
LGJun 28, 2023
Reduce Computational Complexity for Convolutional Layers by Skipping ZerosZhiyi Zhang, Pengfei Zhang, Zhuopin Xu et al.
Convolutional neural networks necessitate good algorithms to reduce complexity, and sufficient utilization of parallel processors for acceleration. Within convolutional layers, there are three types of operators: convolution used in forward propagation, deconvolution and dilated-convolution utilized in backward propagation. During the execution of these operators, zeros are typically added to tensors, leading to redundant calculations and unnecessary strain on hardware. To circumvent these inefficiencies, we propose the C-K-S algorithm, accompanied by efficient GPU implementations. C-K-S trims filters to exclude zero-padding. For deconvolution and dilated-convolution, C-K-S transforms sparse tensors into dense tensors, and standardizes the local computational rules to simplify the hardware control. The experimental results demonstrate that C-K-S offers good performance in terms of speed and convergence, surpassing the capabilities of PyTorch and cuDNN in certain scenarios.
LGNov 14, 2023
Uplift Modeling based on Graph Neural Network Combined with Causal KnowledgeHaowen Wang, Xinyan Ye, Yangze Zhou et al.
Uplift modeling is a fundamental component of marketing effect modeling, which is commonly employed to evaluate the effects of treatments on outcomes. Through uplift modeling, we can identify the treatment with the greatest benefit. On the other side, we can identify clients who are likely to make favorable decisions in response to a certain treatment. In the past, uplift modeling approaches relied heavily on the difference-in-difference (DID) architecture, paired with a machine learning model as the estimation learner, while neglecting the link and confidential information between features. We proposed a framework based on graph neural networks that combine causal knowledge with an estimate of uplift value. Firstly, we presented a causal representation technique based on CATE (conditional average treatment effect) estimation and adjacency matrix structure learning. Secondly, we suggested a more scalable uplift modeling framework based on graph convolution networks for combining causal knowledge. Our findings demonstrate that this method works effectively for predicting uplift values, with small errors in typical simulated data, and its effectiveness has been verified in actual industry marketing data.
CLMay 24, 2023Code
HuatuoGPT, towards Taming Language Model to Be a DoctorHongbo Zhang, Junying Chen, Feng Jiang et al.
In this paper, we present HuatuoGPT, a large language model (LLM) for medical consultation. The core recipe of HuatuoGPT is to leverage both \textit{distilled data from ChatGPT} and \textit{real-world data from doctors} in the supervised fine-tuned stage. The responses of ChatGPT are usually detailed, well-presented and informative while it cannot perform like a doctor in many aspects, e.g. for integrative diagnosis. We argue that real-world data from doctors would be complementary to distilled data in the sense the former could tame a distilled language model to perform like doctors. To better leverage the strengths of both data, we train a reward model to align the language model with the merits that both data bring, following an RLAIF (reinforced learning from AI feedback) fashion. To evaluate and benchmark the models, we propose a comprehensive evaluation scheme (including automatic and manual metrics). Experimental results demonstrate that HuatuoGPT achieves state-of-the-art results in performing medical consultation among open-source LLMs in GPT-4 evaluation, human evaluation, and medical benchmark datasets. It is worth noting that by using additional real-world data and RLAIF, the distilled language model (i.e., HuatuoGPT) outperforms its teacher model ChatGPT in most cases. Our code, data, and models are publicly available at \url{https://github.com/FreedomIntelligence/HuatuoGPT}. The online demo is available at \url{https://www.HuatuoGPT.cn/}.
CLMay 2, 2023Code
Huatuo-26M, a Large-scale Chinese Medical QA DatasetJianquan Li, Xidong Wang, Xiangbo Wu et al.
In this paper, we release a largest ever medical Question Answering (QA) dataset with 26 million QA pairs. We benchmark many existing approaches in our dataset in terms of both retrieval and generation. Experimental results show that the existing models perform far lower than expected and the released dataset is still challenging in the pre-trained language model era. Moreover, we also experimentally show the benefit of the proposed dataset in many aspects: (i) trained models for other QA datasets in a zero-shot fashion; and (ii) as external knowledge for retrieval-augmented generation (RAG); and (iii) improving existing pre-trained language models by using the QA pairs as a pre-training corpus in continued training manner. We believe that this dataset will not only contribute to medical research but also facilitate both the patients and clinical doctors. See \url{https://github.com/FreedomIntelligence/Huatuo-26M}.
CLMar 4, 2024
Online Training of Large Language Models: Learn while chattingJuhao Liang, Ziwei Wang, Zhuoheng Ma et al.
Large Language Models(LLMs) have dramatically revolutionized the field of Natural Language Processing(NLP), offering remarkable capabilities that have garnered widespread usage. However, existing interaction paradigms between LLMs and users are constrained by either inflexibility, limitations in customization, or a lack of persistent learning. This inflexibility is particularly evident as users, especially those without programming skills, have restricted avenues to enhance or personalize the model. Existing frameworks further complicate the model training and deployment process due to their computational inefficiencies and lack of user-friendly interfaces. To overcome these challenges, this paper introduces a novel interaction paradigm-'Online Training using External Interactions'-that merges the benefits of persistent, real-time model updates with the flexibility for individual customization through external interactions such as AI agents or online/offline knowledge bases.
ROOct 11, 2025
Dejavu: Post-Deployment Learning for Embodied Agents via Experience FeedbackShaokai Wu, Yanbiao Ji, Qiuchang Li et al.
Embodied agents face a fundamental limitation: once deployed in real-world environments to perform specific tasks, they are unable to acquire new useful knowledge to enhance task performance. In this paper, we propose a general post-deployment learning framework called Dejavu, which employs an Experience Feedback Network (EFN) and augments the frozen Vision-Language-Action (VLA) policy with retrieved execution memories. EFN automatically identifies contextually successful prior action experiences and conditions action prediction on this retrieved guidance. We adopt reinforcement learning with semantic similarity rewards on EFN to ensure that the predicted actions align with past successful behaviors under current observations. During deployment, EFN continually enriches its memory with new trajectories, enabling the agent to exhibit "learning from experience" despite fixed weights. Experiments across diverse embodied tasks show that EFN significantly improves adaptability, robustness, and success rates over frozen baselines. These results highlight a promising path toward embodied agents that continually refine their behavior after deployment.
LGMay 15, 2023
Dragon-Alpha&cu32: A Java-based Tensor Computing Framework With its High-Performance CUDA LibraryZhiyi Zhang, Pengfei Zhang, Qi Wang
Java is very powerful, but in Deep Learning field, its capabilities probably has not been sufficiently exploited. Compared to the Java-based deep-learning-frameworks, the Python-based (PyTorch, TensorFlow, etc) are undoubtedly the mainstream, due to their easy-to-use, flexibility and better ecosystem. Dragon-Alpha is a Java-based Tensor Computing Framework, with easy-to-use, high-scalability and high-performance, trying to break Java's dilemma in deep learning field and make it more effective. Dragon-Alpha supports different levels of APIs, and can be used as a deep-learning-framework through its user-friendly high-level APIs. Dragon-Alpha has potential to aggregate computing-power across heterogeneous platforms and devices, based on its multi-layer architecture and Java's big-data ecosystem. Dragon-Alpha has its asynchronized APIs to improve parallelism, and highly-optimized CUDA library cu32 which adopts unique convolution\deconvolution operators for small feature maps. The experiments show that, compared to PyTorch&cuDNN, Dragon-Alpha&cu32 costs less time and memory (75.38% to 97.32%, 29.2% to 66.4%), to train some typical neural networks (AlexNet, VGG, GoogleNet, ResNet) on Cifar-10.
CVNov 18, 2021
One-Shot Generative Domain AdaptationCeyuan Yang, Yujun Shen, Zhiyi Zhang et al.
This work aims at transferring a Generative Adversarial Network (GAN) pre-trained on one image domain to a new domain referring to as few as just one target image. The main challenge is that, under limited supervision, it is extremely difficult to synthesize photo-realistic and highly diverse images, while acquiring representative characters of the target. Different from existing approaches that adopt the vanilla fine-tuning strategy, we import two lightweight modules to the generator and the discriminator respectively. Concretely, we introduce an attribute adaptor into the generator yet freeze its original parameters, through which it can reuse the prior knowledge to the most extent and hence maintain the synthesis quality and diversity. We then equip the well-learned discriminator backbone with an attribute classifier to ensure that the generator captures the appropriate characters from the reference. Furthermore, considering the poor diversity of the training data (i.e., as few as only one image), we propose to also constrain the diversity of the generative domain in the training process, alleviating the optimization difficulty. Our approach brings appealing results under various settings, substantially surpassing state-of-the-art alternatives, especially in terms of synthesis diversity. Noticeably, our method works well even with large domain gaps, and robustly converges within a few minutes for each experiment.
CRJun 8, 2021
Supporting Multiparty Signing over Named Data NetworkingZhiyi Zhang, Siqi Liu, Randy King et al.
Modern digitally controlled systems require multiparty authentication and authorization to meet the desired security requirement. This paper describes the design and development of NDN-MPS, an automated solution to support multiparty signature signing and verification for NDN-enabled applications. NDN-MPS suggests several changes and extensions to the existing NDN security solutions. First, it introduces a new type of trust schema to support signing and verification for multiple signers under complex policies such as threshold schemes. Second, it extends the NDN signature format to accommodate multisignature schemes such as BLS signature. Third, it introduces a signature collection protocol to solicit signatures securely from multiple signers. We further evaluate NDN-MPS by assessing its security properties and measuring its performance.
CRSep 20, 2020
On Certificate Management in Named Data NetworkingZhiyi Zhang, Su Yong Wong, Junxiao Shi et al.
Named Data Networking (NDN) secures network communications by requiring all data packets to be signed when produced. This requirement necessitates efficient and usable mechanisms to handle NDN certificate issuance and revocation, making these supporting mechanisms essential for NDN operations. In this paper, we first investigate and clarify core concepts related to NDN certificates and security design in general, and then present the model of NDN certificate management and its desired properties. We proceed with the design of a specific realization of NDN's certificate management, NDNCERT, evaluate it using a formal security analysis, and discuss the challenges in designing, implementing, and deploying the system, to share our experiences with other NDN security protocol development efforts.
NIJun 11, 2020
Sovereign: User-Controlled Smart HomesZhiyi Zhang, Tianyuan Yu, Xinyu Ma et al.
Recent years have witnessed the rapid deployment of smart homes; most of them are controlled by remote servers in the cloud. Such designs raise security and privacy concerns for end users. In this paper, we describe the design of Sovereign, a home IoT system framework that provides end users complete control of their home IoT systems. Sovereign lets home IoT devices and applications communicate via application-named data and secures data directly. This enables direct, secure, one-to-one and one-to-many device-to-device communication over wireless broadcast media. Sovereign utilizes semantic names to construct usable security solutions. We implement Sovereign as a publish-subscribe-based development platform together with a prototype home IoT controller. Our preliminary evaluation shows that Sovereign provides a systematic, easy-to-use solution to user-controlled, self-contained smart homes running on existing IoT hardware without imposing noticeable overhead.
CRFeb 24, 2020
EL PASSO: Privacy-preserving, Asynchronous Single Sign-OnZhiyi Zhang, Michał Król, Alberto Sonnino et al.
We introduce EL PASSO, a privacy-preserving, asynchronous Single Sign-On (SSO) system. It enables personal authentication while protecting users' privacy against both identity providers and relying parties, and allows selective attribute disclosure. EL PASSO is based on anonymous credentials, yet it supports users' accountability. Selected authorities may recover the identity of allegedly misbehaving users, and users can prove properties about their identity without revealing it in the clear. EL PASSO does not require specific secure hardware or a third party (other than existing participants in SSO). The generation and use of authentication credentials are asynchronous, allowing users to sign on when identity providers are temporarily unavailable. We evaluate EL PASSO in a distributed environment and prove its low computational cost, yielding faster sign-on operations than OIDC from a regular laptop, one-second user-perceived latency from a low-power device, and scaling to more than 50 sign-on operations per second at a relying party using a single 4-core server in the cloud.
CRJul 27, 2019
AuditShare: Sensitive Data Sharing with Reliable Leaker IdentificationZhiyi Zhang, Yu Guan, Xinyu Ma et al.
As Personally Identifiable Information (PII) data sharing among multiple parties becomes increasingly common, so does the potential for data leakage. As required by new data protection regulations and laws, when PII leakage occurs, one must be able to reliably identify the leaking sources. Existing solutions utilize watermark technologies or data object allocation strategies to differentiate the data shared with different parties to identify potential leakers. However, these solutions lose their effectiveness under several attack scenarios, e.g., a data sender may leak the data and a receiver may deny the reception of certain shared data. Worse yet, multiple receivers might collude and apply a set of operations such as intersection, complement, and union to their received datasets before leaking them, making the task of leaker identification even more difficult. In this paper, we propose AuditShare, a PII dataset sharing system with reliable leaking source identification. Firstly, taking advantage of the intrinsic properties of PII data, AuditShare allocates data objects to individual sharing parties by PII attributes. Secondly, AuditShare obliviously transfers data between the sender and each receiver and uses a Merkle Tree as an immutable record of the sharing. Thirdly, a knowledge-based identification algorithm is proposed to identify a guilty sender or colluding/non-colluding receivers. Through our evaluation, we show that: (i) With a modest amount of leaked data, AuditShare can accurately (accuracy>99.99%) and undeniably identify all the guilty parties in different cases; (ii) It only takes 0.5 second to share 100,000 data objects in AuditShare, which is practical in real-world deployment.
CRFeb 24, 2019
Expect More from the Networking: DDoS Mitigation by FITT in Named Data NetworkingZhiyi Zhang, Vishrant Vasavada, Siva Kesava Reddy Kakarla et al.
Distributed Denial of Service (DDoS) attacks have plagued the Internet for decades, but the basic defense approaches have not fundamentally changed. Rather, the size and rate of growth in attacks have actually outpaced carriers' and DDoS mitigation services' growth, calling for new solutions that can be, partially or fully, deployed imminently and exhibit effectiveness. In this paper, we examine the basic functions in Named Data Networking (NDN), a newly proposed Internet architecture, that can address the principle weaknesses in today's IP networks. We demonstrate by a new DDoS mitigation solution over NDN, Fine-grained Interest Traffic Throttling FITT, that NDN's architectural changes, even when incrementally deployed, can make DDoS attacks fundamentally more difficult to launch and less effective. FITT leverages the NDN design to enable the network to detect DDoS from victim's feedback, throttles DDoS traffic by reverse its exact paths through the network, and enforces control over the misbehaving entities at their sources. Our extensive simulation results show that FITT can throttle attack traffic with one-way time delay from the victim to the NDN gateway; upon activation, FITT effectively stop attack traffic from impacting benign flows, resulting in over 99\% of packets reaching victims being legitimate ones. We further demonstrate that service providers may implement NDN/FITT on existing CDN nodes as an incrementally deployable solution to effectuate the application level remediation at the sources, which remains unattainable in today's DDoS mitigation approaches.
CRFeb 24, 2019
DLedger: An IoT-Friendly Private Distributed Ledger System Based on DAGZhiyi Zhang, Vishrant Vasavada, Xinyu Ma et al.
With the ever growing Internet of Things (IoT) market, ledger systems are facing new challenges to efficiently store and secure enormous customer records collected by the IoT devices. The authenticity, availability, and integrity of these records are critically important for both business providers and customers. In this paper, we describe DLedger, a lightweight and resilient distributed ledger system. Instead of a single chain of blocks, DLedger builds the ledger over a directed acyclic graph (DAG), so that its operations can tolerate network partition and intermittent connectivity. Instead of compute-intensive Proof-of-Work (PoW), DLedger utilizes Proof-of-Authentication (PoA), whose light-weight operations are IoT-friendly, to achieve consensus. Furthermore, DLedger is built upon a data-centric network called Named Data Networking (NDN), which facilitates the peer-to-peer data dissemination in heterogeneous IoT networks.
SEJul 27, 2018
METTLE: a METamorphic testing approach to assessing and validating unsupervised machine LEarning systemsXiaoyuan Xie, Zhiyi Zhang, Tsong Yueh Chen et al.
Unsupervised machine learning is the training of an artificial intelligence system using information that is neither classified nor labeled, with a view to modeling the underlying structure or distribution in a dataset. Since unsupervised machine learning systems are widely used in many real-world applications, assessing the appropriateness of these systems and validating their implementations with respect to individual users' requirements and specific application scenarios$\,/\,$contexts are indisputably two important tasks. Such assessment and validation tasks, however, are fairly challenging due to the absence of a priori knowledge of the data. In view of this challenge, we develop a $\textbf{MET}$amorphic $\textbf{T}$esting approach to assessing and validating unsupervised machine $\textbf{LE}$arning systems, abbreviated as METTLE. Our approach provides a new way to unveil the (possibly latent) characteristics of various machine learning systems, by explicitly considering the specific expectations and requirements of these systems from individual users' perspectives. To support METTLE, we have further formulated 11 generic metamorphic relations (MRs), covering users' generally expected characteristics that should be possessed by machine learning systems. To demonstrate the viability and effectiveness of METTLE we have performed an experiment involving six commonly used clustering systems. Our experiment has shown that, guided by user-defined MR-based adequacy criteria, end users are able to assess, validate, and select appropriate clustering systems in accordance with their own specific needs. Our investigation has also yielded insightful understanding and interpretation of the behavior of the machine learning systems from an end-user software engineering's perspective, rather than a designer's or implementor's perspective, who normally adopts a theoretical approach.