Vyas Sekar

LG
h-index56
16papers
330citations
Novelty51%
AI Score45

16 Papers

LGJun 3, 2022
On the Privacy Properties of GAN-generated Samples

Zinan Lin, Vyas Sekar, Giulia Fanti · microsoft-research

The privacy implications of generative adversarial networks (GANs) are a topic of great interest, leading to several recent algorithms for training GANs with privacy guarantees. By drawing connections to the generalization properties of GANs, we prove that under some assumptions, GAN-generated samples inherently satisfy some (weak) privacy guarantees. First, we show that if a GAN is trained on m samples and used to generate n samples, the generated samples are (epsilon, delta)-differentially-private for (epsilon, delta) pairs where delta scales as O(n/m). We show that under some special conditions, this upper bound is tight. Next, we study the robustness of GAN-generated samples to membership inference attacks. We model membership inference as a hypothesis test in which the adversary must determine whether a given sample was drawn from the training dataset or from the underlying data distribution. We show that this adversary can achieve an area under the ROC curve that scales no better than O(m^{-1/4}).

LGMar 20, 2022
RareGAN: Generating Samples for Rare Classes

Zinan Lin, Hao Liang, Giulia Fanti et al. · microsoft-research

We study the problem of learning generative adversarial networks (GANs) for a rare class of an unlabeled dataset subject to a labeling budget. This problem is motivated from practical applications in domains including security (e.g., synthesizing packets for DNS amplification attacks), systems and networking (e.g., synthesizing workloads that trigger high resource usage), and machine learning (e.g., generating images from a rare class). Existing approaches are unsuitable, either requiring fully-labeled datasets or sacrificing the fidelity of the rare class for that of the common classes. We propose RareGAN, a novel synthesis of three key ideas: (1) extending conditional GANs to use labelled and unlabelled data for better generalization; (2) an active learning approach that requests the most useful labels; and (3) a weighted loss function to favor learning the rare class. We show that RareGAN achieves a better fidelity-diversity tradeoff on the rare class than prior work across different applications, budgets, rare class fractions, GAN losses, and architectures.

CRMar 3, 2023
Summary Statistic Privacy in Data Sharing

Zinan Lin, Shuaiqi Wang, Vyas Sekar et al. · microsoft-research

We study a setting where a data holder wishes to share data with a receiver, without revealing certain summary statistics of the data distribution (e.g., mean, standard deviation). It achieves this by passing the data through a randomization mechanism. We propose summary statistic privacy, a metric for quantifying the privacy risk of such a mechanism based on the worst-case probability of an adversary guessing the distributional secret within some threshold. Defining distortion as a worst-case Wasserstein-1 distance between the real and released data, we prove lower bounds on the tradeoff between privacy and distortion. We then propose a class of quantization mechanisms that can be adapted to different data distributions. We show that the quantization mechanism's privacy-distortion tradeoff matches our lower bounds under certain regimes, up to small constant factors. Finally, we demonstrate on real-world datasets that the proposed quantization mechanisms achieve better privacy-distortion tradeoffs than alternative privacy mechanisms.

CRJan 27, 2025Code
On the Feasibility of Using LLMs to Autonomously Execute Multi-host Network Attacks

Brian Singer, Keane Lucas, Lakshmi Adiga et al. · cmu

LLMs have shown preliminary promise in some security tasks and CTF challenges. Real cyberattacks are often multi-host network attacks, which involve executing a number of steps across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities, and using compromised hosts to exfiltrate data. To date, the extent to which LLMs can autonomously execute multi-host network attacks} is not well understood. To this end, our first contribution is MHBench, an open-source multi-host attack benchmark with 10 realistic emulated networks (from 25 to 50 hosts). We find that popular LLMs including modern reasoning models (e.g., GPT4o, Gemini 2.5 Pro, Sonnet 3.7 Thinking) with state-of-art security-relevant prompting strategies (e.g., PentestGPT, CyberSecEval3) cannot autonomously execute multi-host network attacks. To enable LLMs to autonomously execute such attacks, our second contribution is Incalmo, an high-level abstraction layer. Incalmo enables LLMs to specify high-level actions (e.g., infect a host, scan a network). Incalmo's translation layer converts these actions into lower-level primitives (e.g., commands to exploit tools) through expert agents. In 9 out of 10 networks in MHBench, LLMs using Incalmo achieve at least some of the attack goals. Even smaller LLMs (e.g., Haiku 3.5, Gemini 2 Flash) equipped with Incalmo achieve all goals in 5 of 10 environments. We also validate the key role of high-level actions in Incalmo's abstraction in enabling LLMs to autonomously execute such attacks.

AIMar 12Code
Generating Expressive and Customizable Evals for Timeseries Data Analysis Agents with AgentFuel

Aadyaa Maddi, Prakhar Naval, Deepti Mande et al.

Across many domains (e.g., IoT, observability, telecommunications, cybersecurity), there is an emerging adoption of conversational data analysis agents that enable users to "talk to your data" to extract insights. Such data analysis agents operate on timeseries data models; e.g., measurements from sensors or events monitoring user clicks and actions in product analytics. We evaluate 6 popular data analysis agents (both open-source and proprietary) on domain-specific data and query types, and find that they fail on stateful and incident-specific queries. We observe two key expressivity gaps in existing evals: domain-customized datasets and domain-specific query types. To enable practitioners in such domains to generate customized and expressive evals for such timeseries data agents, we present AgentFuel. AgentFuel helps domain experts quickly create customized evals to perform end-to-end functional tests. We show that AgentFuel's benchmarks expose key directions for improvement in existing data agent frameworks. We also present anecdotal evidence that using AgentFuel can improve agent performance (e.g., with GEPA). AgentFuel benchmarks are available at https://huggingface.co/datasets/RockfishData/TimeSeriesAgentEvals.

SYOct 15, 2025
Cyber-Resilient System Identification for Power Grid through Bayesian Integration

Shimiao Li, Guannan Qu, Bryan Hooi et al.

Power grids increasingly need real-time situational awareness under the ever-evolving cyberthreat landscape. Advances in snapshot-based system identification approaches have enabled accurately estimating states and topology from a snapshot of measurement data, under random bad data and topology errors. However, modern interactive, targeted false data can stay undetectable to these methods, and significantly compromise estimation accuracy. This work advances system identification that combines snapshot-based method with time-series model via Bayesian Integration, to advance cyber resiliency against both random and targeted false data. Using a distance-based time-series model, this work can leverage historical data of different distributions induced by changes in grid topology and other settings. The normal system behavior captured from historical data is integrated into system identification through a Bayesian treatment, to make solutions robust to targeted false data. We experiment on mixed random anomalies (bad data, topology error) and targeted false data injection attack (FDIA) to demonstrate our method's 1) cyber resilience: achieving over 70% reduction in estimation error under FDIA; 2) anomalous data identification: being able to alarm and locate anomalous data; 3) almost linear scalability: achieving comparable speed with the snapshot-based baseline, both taking <1min per time tick on the large 2,383-bus system using a laptop CPU.

LGJan 22, 2021
Pareto GAN: Extending the Representational Power of GANs to Heavy-Tailed Distributions

Todd Huster, Jeremy E. J. Cohen, Zinan Lin et al.

Generative adversarial networks (GANs) are often billed as "universal distribution learners", but precisely what distributions they can represent and learn is still an open question. Heavy-tailed distributions are prevalent in many different domains such as financial risk-assessment, physics, and epidemiology. We observe that existing GAN architectures do a poor job of matching the asymptotic behavior of heavy-tailed distributions, a problem that we show stems from their construction. Additionally, when faced with the infinite moments and large distances between outlier points that are characteristic of heavy-tailed distributions, common loss functions produce unstable or near-zero gradients. We address these problems with the Pareto GAN. A Pareto GAN leverages extreme value theory and the functional properties of neural networks to learn a distribution that matches the asymptotic behavior of the marginal distributions of the features. We identify issues with standard loss functions and propose the use of alternative metric spaces that enable stable and efficient learning. Finally, we evaluate our proposed approach on a variety of heavy-tailed datasets.

LGSep 6, 2020
Why Spectral Normalization Stabilizes GANs: Analysis and Improvements

Zinan Lin, Vyas Sekar, Giulia Fanti

Spectral normalization (SN) is a widely-used technique for improving the stability and sample quality of Generative Adversarial Networks (GANs). However, there is currently limited understanding of why SN is effective. In this work, we show that SN controls two important failure modes of GAN training: exploding and vanishing gradients. Our proofs illustrate a (perhaps unintentional) connection with the successful LeCun initialization. This connection helps to explain why the most popular implementation of SN for GANs requires no hyper-parameter tuning, whereas stricter implementations of SN have poor empirical performance out-of-the-box. Unlike LeCun initialization which only controls gradient vanishing at the beginning of training, SN preserves this property throughout training. Building on this theoretical understanding, we propose a new spectral normalization technique: Bidirectional Scaled Spectral Normalization (BSSN), which incorporates insights from later improvements to LeCun initialization: Xavier initialization and Kaiming initialization. Theoretically, we show that BSSN gives better gradient control than SN. Empirically, we demonstrate that it outperforms SN in sample quality and training stability on several benchmark datasets.

CRFeb 23, 2020
Fighting Fire with Light: A Case for Defending DDoS Attacks Using the Optical Layer

Matthew Hall, Ramakrishnan Durairajan, Vyas Sekar

The DDoS attack landscape is growing at an unprecedented pace. Inspired by the recent advances in optical networking, we make a case for optical layer-aware DDoS defense (O-LAD) in this paper. Our approach leverages the optical layer to isolate attack traffic rapidly via dynamic reconfiguration of (backup) wavelengths using ROADMs---bridging the gap between (a) evolution of the DDoS attack landscape and (b) innovations in the optical layer (e.g., reconfigurable optics). We show that the physical separation of traffic profiles allows finer-grained handling of suspicious flows and offers better performance for benign traffic in the face of an attack. We present preliminary results modeling throughput and latency for legitimate flows while scaling the strength of attacks. We also identify a number of open problems for the security, optical, and systems communities: modeling diverse DDoS attacks (e.g., fixed vs. variable rate, detectable vs. undetectable), building a full-fledged defense system with optical advancements (e.g., OpenConfig), and optical layer-aware defenses for a broader class of attacks (e.g., network reconnaissance).

LGNov 5, 2019
Enhancing the Privacy of Federated Learning with Sketching

Zaoxing Liu, Tian Li, Virginia Smith et al.

In response to growing concerns about user privacy, federated learning has emerged as a promising tool to train statistical models over networks of devices while keeping data localized. Federated learning methods run training tasks directly on user devices and do not share the raw user data with third parties. However, current methods still share model updates, which may contain private information (e.g., one's weight and height), during the training process. Existing efforts that aim to improve the privacy of federated learning make compromises in one or more of the following key areas: performance (particularly communication cost), accuracy, or privacy. To better optimize these trade-offs, we propose that \textit{sketching algorithms} have a unique advantage in that they can provide both privacy and performance benefits while maintaining accuracy. We evaluate the feasibility of sketching-based federated learning with a prototype on three representative learning models. Our initial findings show that it is possible to provide strong privacy guarantees for federated learning without sacrificing performance or accuracy. Our work highlights that there exists a fundamental connection between privacy and communication in distributed settings, and suggests important open problems surrounding the theoretical understanding, methodology, and system design of practical, private federated learning.

LGNov 3, 2019
Privacy for Free: Communication-Efficient Learning with Differential Privacy Using Sketches

Tian Li, Zaoxing Liu, Vyas Sekar et al.

Communication and privacy are two critical concerns in distributed learning. Many existing works treat these concerns separately. In this work, we argue that a natural connection exists between methods for communication reduction and privacy preservation in the context of distributed machine learning. In particular, we prove that Count Sketch, a simple method for data stream summarization, has inherent differential privacy properties. Using these derived privacy guarantees, we propose a novel sketch-based framework (DiffSketch) for distributed learning, where we compress the transmitted messages via sketches to simultaneously achieve communication efficiency and provable privacy benefits. Our evaluation demonstrates that DiffSketch can provide strong differential privacy guarantees (e.g., $\varepsilon$= 1) and reduce communication by 20-50x with only marginal decreases in accuracy. Compared to baselines that treat privacy and communication separately, DiffSketch improves absolute test accuracy by 5%-50% while offering the same privacy guarantees and communication compression.

LGSep 30, 2019
Using GANs for Sharing Networked Time Series Data: Challenges, Initial Promise, and Open Questions

Zinan Lin, Alankar Jain, Chen Wang et al.

Limited data access is a longstanding barrier to data-driven research and development in the networked systems community. In this work, we explore if and how generative adversarial networks (GANs) can be used to incentivize data sharing by enabling a generic framework for sharing synthetic datasets with minimal expert knowledge. As a specific target, our focus in this paper is on time series datasets with metadata (e.g., packet loss rate measurements with corresponding ISPs). We identify key challenges of existing GAN approaches for such workloads with respect to fidelity (e.g., long-term dependencies, complex multidimensional relationships, mode collapse) and privacy (i.e., existing guarantees are poorly understood and can sacrifice fidelity). To improve fidelity, we design a custom workflow called DoppelGANger (DG) and demonstrate that across diverse real-world datasets (e.g., bandwidth measurements, cluster requests, web sessions) and use cases (e.g., structural characterization, predictive modeling, algorithm comparison), DG achieves up to 43% better fidelity than baseline models. Although we do not resolve the privacy problem in this work, we identify fundamental challenges with both classical notions of privacy and recent advances to improve the privacy properties of GANs, and suggest a potential roadmap for addressing these challenges. By shedding light on the promise and challenges, we hope our work can rekindle the conversation on workflows for data sharing.

CRJan 4, 2019
Practical Verifiable In-network Filtering for DDoS defense

Deli Gong, Muoi Tran, Shweta Shinde et al.

In light of ever-increasing scale and sophistication of modern DDoS attacks, it is time to revisit in-network filtering or the idea of empowering DDoS victims to install in-network traffic filters in the upstream transit networks. Recent proposals show that filtering DDoS traffic at a handful of large transit networks can handle volumetric DDoS attacks effectively. However, the innetwork filtering primitive can also be misused. Transit networks can use the in-network filtering service as an excuse for any arbitrary packet drops made for their own benefit. For example, transit networks may intentionally execute filtering services poorly or unfairly to discriminate their competing neighbor ASes while claiming that they drop packets for the sake of DDoS defense. We argue that it is due to the lack of verifiable filtering - i.e., no one can check if a transit network executes the filter rules correctly as requested by the DDoS victims. To make in-network filtering a more robust defense primitive, in this paper, we propose a verifiable in-network filtering, called VIF, that exploits emerging hardware-based trusted execution environments (TEEs) and offers filtering verifiability to DDoS victims and neighbor ASes. Our proof of concept demonstrates that a VIF filter implementation on commodity servers with TEE support can handle traffic at line rate (e.g., 10 Gb/s) and execute up to 3,000 filter rules. We show that VIF can easily scale to handle larger traffic volume (e.g., 500 Gb/s) and more complex filtering operations (e.g., 150,000 filter rules) by parallelizing the TEE-based filters. As a practical deployment model, we suggest that Internet exchange points (IXPs) are the ideal candidates for the early adopters of our verifiable filters due to their central locations and flexible software-defined architecture.

CRNov 1, 2017
Vulnerabilities of Electric Vehicle Battery Packs to Cyberattacks

Shashank Sripad, Sekar Kulandaivel, Vikram Pande et al.

Electric Vehicles (EVs), like all modern vehicles, are entirely controlled by electronic devices embedded within networks that are exposed to the threat of cyberattacks. Cyber vulnerabilities are magnified with EVs due to unique risks associated with EV battery packs. Current batteries have well-known issues with specific energy, cost and fire-related safety risks. In this study, we develop a systematic framework to assess the impact of cyberattacks on EVs. While the current focus of automotive cyberattacks is on short-term physical safety, it is crucial to consider long-term cyberattacks that aim to cause financial losses through accrued impact, especially in the context of EVs. Faulty components of battery management systems such as a compromised voltage regulator could lead to cyberattacks that can overdischarge or overcharge the battery. Overdischarge could lead to failures such as internal shorts in the timescale of minutes through cyberattacks that compromise energy-intensive EV subsystems like auxiliary components. Attacks that overcharge the pack could shorten the lifetime of a new battery pack to less than a year. Further, such attacks also pose physical safety risks via the triggering of thermal (fire) events. Attacks on auxiliary components lead to battery drain, which could be up to 20% of the state-of-charge per hour. Lastly, we develop a heuristic for the stealthiness of a cyberattack to augment traditional threat models. The methodology presented here will help in building the foundational principles of electric vehicle cybersecurity: a nascent but critical topic in the coming years.

CRNov 2, 2016
Shedding Light on the Adoption of Let's Encrypt

Antonis Manousis, Roy Ragsdale, Ben Draffin et al.

Let's Encrypt is a new entrant in the Certificate Authority ecosystem that offers free and automated certificate signing. It is visionary in its commitment to Certificate Transparency. In this paper, we shed light on the adoption patterns of Let's Encrypt "in the wild" and inform the future design and deployment of this exciting development in the security landscape. We analyze acquisition patterns of certificates as well as their usage and deployment trends in the real world. To this end, we analyze data from Certificate Transparency Logs containing records of more then 18 million certificates. We also leverage other sources like Censys, Alexa's historic records, Geolocation databases, and VirusTotal. We also perform active HTTPS measurements on the domains owning Let's Encrypt certificates. Our analysis of certificate acquisition shows that (1) the impact of Let's Encrypt is particularly visible in Western Europe; (2) Let's Encrypt has the potential to democratize HTTPS adoption in countries that are recent entrants to Internet adoption; (3) there is anecdotal evidence of popular domains quitting their previously untrustworthy or expensive CAs in order to transition to Let's Encrypt; and (4) there is a "heavy tailed" behavior where a small number of domains acquire a large number of certificates. With respect to usage, we find that: (1) only 54% of domains actually use the Let's Encrypt certificates they have procured; (2) there are many non-trivial incidents of server misconfigurations; and (3) there is early evidence of use of Let's Encrypt certificates for typosquatting and for malware-laden sites.

MMAug 29, 2016
On the Efficiency and Fairness of Multiplayer HTTP-based Adaptive Video Streaming

Xiaoqi Yin, Mihovil Bartulović, Vyas Sekar et al.

User-perceived quality-of-experience (QoE) is critical in internet video delivery systems. Extensive prior work has studied the design of client-side bitrate adaptation algorithms to maximize single-player QoE. However, multiplayer QoE fairness becomes critical as the growth of video traffic makes it more likely that multiple players share a bottleneck in the network. Despite several recent proposals, there is still a series of open questions. In this paper, we bring the problem space to light from a control theory perspective by formalizing the multiplayer QoE fairness problem and addressing two key questions in the broader problem space. First, we derive the sufficient conditions of convergence to steady state QoE fairness under TCP-based bandwidth sharing scheme. Based on the insight from this analysis that in-network active bandwidth allocation is needed, we propose a non-linear MPC-based, router-assisted bandwidth allocation algorithm that regards each player as closed-loop systems. We use trace-driven simulation to show the improvement over existing approaches. We identify several research directions enabled by the control theoretic modeling and envision that control theory can play an important role on guiding real system design in adaptive video streaming.