LGSep 9, 2023Code
Reverse-Engineering Decoding Strategies Given Blackbox Access to a Language Generation SystemDaphne Ippolito, Nicholas Carlini, Katherine Lee et al.
Neural language models are increasingly deployed into APIs and websites that allow a user to pass in a prompt and receive generated text. Many of these systems do not reveal generation parameters. In this paper, we present methods to reverse-engineer the decoding method used to generate text (i.e., top-$k$ or nucleus sampling). Our ability to discover which decoding strategy was used has implications for detecting generated text. Additionally, the process of discovering the decoding strategy can reveal biases caused by selecting decoding settings which severely truncate a model's predicted distributions. We perform our attack on several families of open-source language models, as well as on production systems (e.g., ChatGPT).
LGNov 9, 2021
On minimizers and convolutional filters: theoretical connections and applications to genome analysisYun William Yu
Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. Here, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step towards building a deep learning assembler, though it is at present too slow to be practical. In total, this manuscript provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
CRNov 8, 2020
Privacy-accuracy trade-offs in noisy digital exposure notificationsAbbas Hammoud, Yun William Yu
Since the global spread of Covid-19 began to overwhelm the attempts of governments to conduct manual contact-tracing, there has been much interest in using the power of mobile phones to automate the contact-tracing process through the development of exposure notification applications. The rough idea is simple: use Bluetooth or other data-exchange technologies to record contacts between users, enable users to report positive diagnoses, and alert users who have been exposed to sick users. Of course, there are many privacy concerns associated with this idea. Much of the work in this area has been concerned with designing mechanisms for tracing contacts and alerting users that do not leak additional information about users beyond the existence of exposure events. However, although designing practical protocols is of crucial importance, it is essential to realize that notifying users about exposure events may itself leak confidential information (e.g. that a particular contact has been diagnosed). Luckily, while digital contact tracing is a relatively new task, the generic problem of privacy and data disclosure has been studied for decades. Indeed, the framework of differential privacy further permits provable query privacy by adding random noise. In this article, we translate two results from statistical privacy and social recommendation algorithms to exposure notification. We thus prove some naive bounds on the degree to which accuracy must be sacrificed if exposure notification frameworks are to be made more private through the injection of noise.
SOC-PHJun 4, 2020
Severability of mesoscale components and local time scales in dynamical networksYun William Yu, Jean-Charles Delvenne, Sophia N. Yaliraki et al.
A major goal of dynamical systems theory is the search for simplified descriptions of the dynamics of a large number of interacting states. For overwhelmingly complex dynamical systems, the derivation of a reduced description on the entire dynamics at once is computationally infeasible. Other complex systems are so expansive that despite the continual onslaught of new data only partial information is available. To address this challenge, we define and optimise for a local quality function severability for measuring the dynamical coherency of a set of states over time. The theoretical underpinnings of severability lie in our local adaptation of the Simon-Ando-Fisher time-scale separation theorem, which formalises the intuition of local wells in the Markov landscape of a dynamical process, or the separation between a microscopic and a macroscopic dynamics. Finally, we demonstrate the practical relevance of severability by applying it to examples drawn from power networks, image segmentation, social networks, metabolic networks, and word association.
CRMay 18, 2020
COVI White PaperHannah Alsdurf, Edmond Belliveau, Yoshua Bengio et al.
The SARS-CoV-2 (Covid-19) pandemic has caused significant strain on public health institutions around the world. Contact tracing is an essential tool to change the course of the Covid-19 pandemic. Manual contact tracing of Covid-19 cases has significant challenges that limit the ability of public health authorities to minimize community infections. Personalized peer-to-peer contact tracing through the use of mobile apps has the potential to shift the paradigm. Some countries have deployed centralized tracking systems, but more privacy-protecting decentralized systems offer much of the same benefit without concentrating data in the hands of a state authority or for-profit corporations. Machine learning methods can circumvent some of the limitations of standard digital tracing by incorporating many clues and their uncertainty into a more graded and precise estimation of infection risk. The estimated risk can provide early risk awareness, personalized recommendations and relevant information to the user. Finally, non-identifying risk data can inform epidemiological models trained jointly with the machine learning predictor. These models can provide statistical evidence for the importance of factors involved in disease transmission. They can also be used to monitor, evaluate and optimize health policy and (de)confinement scenarios according to medical and economic productivity indicators. However, such a strategy based on mobile apps and machine learning should proactively mitigate potential ethical and privacy risks, which could have substantial impacts on society (not only impacts on health but also impacts such as stigmatization and abuse of personal data). Here, we present an overview of the rationale, design, ethical considerations and privacy strategy of `COVI,' a Covid-19 public peer-to-peer contact tracing and risk awareness mobile application developed in Canada.
CRMar 25, 2020
Contact Tracing Mobile Apps for COVID-19: Privacy Considerations and Related Trade-offsHyunghoon Cho, Daphne Ippolito, Yun William Yu
Contact tracing is an essential tool for public health officials and local communities to fight the spread of novel diseases, such as for the COVID-19 pandemic. The Singaporean government just released a mobile phone app, TraceTogether, that is designed to assist health officials in tracking down exposures after an infected individual is identified. However, there are important privacy implications of the existence of such tracking apps. Here, we analyze some of those implications and discuss ways of ameliorating the privacy concerns without decreasing usefulness to public health. We hope in writing this document to ensure that privacy is a central feature of conversations surrounding mobile contact tracing apps and to encourage community efforts to develop alternative effective solutions with stronger privacy protection for the users. Importantly, though we discuss potential modifications, this document is not meant as a formal research paper, but instead is a response to some of the privacy characteristics of direct contact tracing apps like TraceTogether and an early-stage Request for Comments to the community. Date written: 2020-03-24 Minor correction: 2020-03-30