Scott Duke Kominers

h-index32

7papers

71citations

Novelty41%

AI Score46

Ranked #36,943 of 194,257 authors (top 19%)#8,646 in LG (top 22%)

7 Papers

6.6CLJul 8, 2022Code

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Mirac Suzgun, Luke Melas-Kyriazi, Suproteem K. Sarkar et al.

Innovation is a major driver of economic and social development, and information about many kinds of innovation is embedded in semi-structured data from patents and patent applications. Although the impact and novelty of innovations expressed in patent data are difficult to measure through traditional means, ML offers a promising set of techniques for evaluating novelty, summarizing contributions, and embedding semantics. In this paper, we introduce the Harvard USPTO Patent Dataset (HUPD), a large-scale, well-structured, and multi-purpose corpus of English-language patent applications filed to the United States Patent and Trademark Office (USPTO) between 2004 and 2018. With more than 4.5 million patent documents, HUPD is two to three times larger than comparable corpora. Unlike previously proposed patent datasets in NLP, HUPD contains the inventor-submitted versions of patent applications--not the final versions of granted patents--thereby allowing us to study patentability at the time of filing using NLP methods for the first time. It is also novel in its inclusion of rich structured metadata alongside the text of patent filings: By providing each application's metadata along with all of its text fields, the dataset enables researchers to perform new sets of NLP tasks that leverage variation in structured covariates. As a case study on the types of research HUPD makes possible, we introduce a new task to the NLP community--namely, binary classification of patent decisions. We additionally show the structured metadata provided in the dataset enables us to conduct explicit studies of concept shifts for this task. Finally, we demonstrate how HUPD can be used for three additional tasks: multi-class classification of patent subject areas, language modeling, and summarization.

1.7NTJun 3

Majorization and Gaussian-Mass Maximality for Construction-A Lattices from Binary Self-Dual Codes

Scott Duke Kominers

Regev and Stephens-Davidowitz conjectured that the integer lattice maximizes Gaussian mass among integral lattices of a given rank. We prove this, including the equality case, for all unimodular Construction-A lattices arising from binary self-dual codes. The proof reduces the theta-series inequality to a sharp majorization statement for codes: if $C$ is a binary self-dual $[2k,k]$ code, then the half-weight distribution of $C$ is dominated in convex order by $\operatorname{Bin}(k,1/2)$, which is the corresponding distribution for the repetition-code model of $\mathbb{Z}^{2k}$. Indeed, after putting $C$ in systematic form $[I\mid A]$, self-duality gives $AA^T=I$ over $\mathbb{F}_2$, so for a uniformly random message $a$ the two weights $\operatorname{wt}(a)$ and $\operatorname{wt}(aA)$ have the same binomial law. The half-weight of the resulting codeword is their average, and Jensen's inequality then gives convex-order domination. Applied to the convex test functions that build the theta series, this yields a sum-of-squares formula for the Gaussian-mass gap; applied to hinge functions, it gives coefficientwise nonnegativity of the reduced gap polynomial.

1.6NTMay 24

Equality in a Reverse Minkowski Shell Bound for Integral Lattices via Spherical Designs

Scott Duke Kominers

For a full-rank integral lattice $\mathcal{L}\subset\mathbb{R}^n$, Regev and Stephens-Davidowitz proved that \[N_{=k}(\mathcal{L}):=|\{y\in\mathcal{L}:\lVert y\rVert^2=k\}|\le 2\binom{n+2k-2}{2k-1}.\] We classify the equality cases. For $n\ge2$, equality holds if and only if either $k=1$ and $\mathcal{L}\cong\mathbb{Z}^n$, or $n=8$, $k=2$, and $\mathcal{L}\cong E_8$. For $n=1$, equality holds exactly when $\mathcal{L}$ represents $k$. The proof shows that equality is rigid. Saturation of the shell bound forces the normalized norm-$k$ shell to be an antipodal tight spherical $(4k-1)$-design. The associated Delsarte--Goethals--Seidel annihilator polynomial gives an arithmetic root condition, which isolates $E_8$ at $k=2$, rules out $k=3$, and combines with the Bannai--Damerell/Bannai theorem and an elementary circle argument to exclude all remaining cases in dimension at least $2$.

2.3LGJun 13, 2020Code

Generalization by Recognizing Confusion

Daniel Chiu, Franklyn Wang, Scott Duke Kominers

A recently-proposed technique called self-adaptive training augments modern neural networks by allowing them to adjust training labels on the fly, to avoid overfitting to samples that may be mislabeled or otherwise non-representative. By combining the self-adaptive objective with mixup, we further improve the accuracy of self-adaptive models for image recognition; the resulting classifier obtains state-of-the-art accuracies on datasets corrupted with label noise. Robustness to label noise implies a lower generalization gap; thus, our approach also leads to improved generalizability. We find evidence that the Rademacher complexity of these algorithms is low, suggesting a new path towards provable generalization for this type of deep learning model. Last, we highlight a novel connection between difficulties accounting for rare classes and robustness under noise, as rare classes are in a sense indistinguishable from label noise. Our code can be found at https://github.com/Tuxianeer/generalizationconfusion.

1.6LGDec 2, 2021

Recommending with Recommendations

Naveen Durvasula, Franklyn Wang, Scott Duke Kominers

Recommendation systems are a key modern application of machine learning, but they have the downside that they often draw upon sensitive user information in making their predictions. We show how to address this deficiency by basing a service's recommendation engine upon recommendations from other existing services, which contain no sensitive information by nature. Specifically, we introduce a contextual multi-armed bandit recommendation framework where the agent has access to recommendations for other services. In our setting, the user's (potentially sensitive) information belongs to a high-dimensional latent space, and the ideal recommendations for the source and target tasks (which are non-sensitive) are given by unknown linear transformations of the user information. So long as the tasks rely on similar segments of the user information, we can decompose the target recommendation problem into systematic components that can be derived from the source recommendations, and idiosyncratic components that are user-specific and cannot be derived from the source, but have significantly lower dimensionality. We propose an explore-then-refine approach to learning and utilizing this decomposition; then using ideas from perturbation theory and statistical concentration of measure, we prove our algorithm achieves regret comparable to a strong skyline that has full knowledge of the source and target transformations. We also consider a generalization of our algorithm to a model with many simultaneous targets and no source. Our methods obtain superior empirical results on synthetic benchmarks.

10.8GTJul 7, 2021Code

Deep Learning for Two-Sided Matching

Sai Srivatsa Ravindranath, Zhe Feng, Shira Li et al.

We initiate the study of deep learning for the automated design of two-sided matching mechanisms. What is of most interest is to use machine learning to understand the possibility of new tradeoffs between strategy-proofness and stability. These properties cannot be achieved simultaneously, but the efficient frontier is not understood. We introduce novel differentiable surrogates for quantifying ordinal strategy-proofness and stability and use them to train differentiable matching mechanisms that map discrete preferences to valid randomized matchings. We demonstrate that the efficient frontier characterized by these learned mechanisms is substantially better than that achievable through a convex combination of baselines of deferred acceptance (stable and strategy-proof for only one side of the market), top trading cycles (strategy-proof for one side, but not stable), and randomized serial dictatorship (strategy-proof for both sides, but not stable). This gives a new target for economic theory and opens up new possibilities for machine learning pipelines in matching market design.

1.2CYMar 21, 2020

Smarter Parking: Using AI to Identify Parking Inefficiencies in Vancouver

Devon Graham, Satish Kumar Sarraf, Taylor Lundy et al.

On-street parking is convenient, but has many disadvantages: on-street spots come at the expense of other road uses such as traffic lanes, transit lanes, bike lanes, or parklets; drivers looking for parking contribute substantially to traffic congestion and hence to greenhouse gas emissions; safety is reduced both due to the fact that drivers looking for spots are more distracted than other road users and that people exiting parked cars pose a risk to cyclists. These social costs may not be worth paying when off-street parking lots are nearby and have surplus capacity. To see where this might be true in downtown Vancouver, we used artificial intelligence techniques to estimate the amount of time it would take drivers to both park on and off street for destinations throughout the city. For on-street parking, we developed (1) a deep-learning model of block-by-block parking availability based on data from parking meters and audits and (2) a computational simulation of drivers searching for an on-street spot. For off-street parking, we developed a computational simulation of the time it would take drivers drive from their original destination to the nearest city-owned off-street lot and then to queue for a spot based on traffic and lot occupancy data. Finally, in both cases we also computed the time it would take the driver to walk from their parking spot to their original destination. We compared these time estimates for destinations in each block of Vancouver's downtown core and each hour of the day. We found many areas where off street would actually save drivers time over searching the streets for a spot, and many more where the time cost for parking off street was small. The identification of such areas provides an opportunity for the city to repurpose valuable curbside space for community-friendly uses more in line with its transportation goals.