LGApr 25, 2023
Controlling Posterior Collapse by an Inverse Lipschitz Constraint on the Decoder NetworkYuri Kinoshita, Kenta Oono, Kenji Fukumizu et al.
Variational autoencoders (VAEs) are one of the deep generative models that have experienced enormous success over the past decades. However, in practice, they suffer from a problem called posterior collapse, which occurs when the encoder coincides, or collapses, with the prior taking no information from the latent structure of the input data into consideration. In this work, we introduce an inverse Lipschitz neural network into the decoder and, based on this architecture, provide a new method that can control in a simple and clear manner the degree of posterior collapse for a wide range of VAE models equipped with a concrete theoretical guarantee. We also illustrate the effectiveness of our method through several numerical experiments.
LGJun 12, 2023
A Batch-to-Online Transformation under Random-Order ModelJing Dong, Yuichi Yoshida
We introduce a transformation framework that can be utilized to develop online algorithms with low $ε$-approximate regret in the random-order model from offline approximation algorithms. We first give a general reduction theorem that transforms an offline approximation algorithm with low average sensitivity to an online algorithm with low $ε$-approximate regret. We then demonstrate that offline approximation algorithms can be transformed into a low-sensitivity version using a coreset construction method. To showcase the versatility of our approach, we apply it to various problems, including online $(k,z)$-clustering, online matrix approximation, and online regression, and successfully achieve polylogarithmic $ε$-approximate regret for each problem. Moreover, we show that in all three cases, our algorithm also enjoys low inconsistency, which may be desired in some online applications.
DSApr 15
Lower Bounds for Testing Directed Acyclicity in the Unidirectional Bounded-Degree ModelYuichi Yoshida
We study property testing of directed acyclicity in the unidirectional bounded-degree oracle model, where a query to a vertex reveals its outgoing neighbors. We prove that there exist absolute constants $d_0\in\mathbb{N}$ and $\varepsilon>0$ such that for every constant $d\ge d_0$, any one-sided $\varepsilon$-tester for acyclicity on $n$-vertex digraphs of maximum outdegree at most $d$ requires $\widetildeΩ(n^{2/3})$ queries. This improves the previous $\widetildeΩ(n^{5/9})$ lower bound for one-sided testing of acyclicity in the same model. We also prove that, under the same degree assumption, any two-sided $\varepsilon$-tester requires $Ω(\sqrt n)$ queries, improving the previous $Ω(n^{1/3})$ lower bound. We further prove an $Ω(n)$ lower bound for tolerant testing for some absolute constant outdegree bound $d$ by reduction from bounded-degree $3$-colorability.
DSApr 2
Non-Signaling Locality Lower Bounds for Dominating SetNoah Fleming, Max Hopkins, Yuichi Yoshida
Minimum dominating set is a basic local covering problem and a core task in distributed computing. Despite extensive study, in the classic LOCAL model there exist significant gaps between known algorithms and lower bounds. Chang and Li prove an $Ω(\log n)$-locality lower bound for a constant factor approximation, while Kuhn--Moscibroda--Wattenhofer gave an algorithm beating this bound beyond $\log Î$-approximation, along with a weaker lower bound for this degree-dependent setting scaling roughly with $\min\{\log Î/\log\log Î,\sqrt{\log n/\log\log n}\}$. Unfortunately, this latter bound is weak for small $Î$, and never recovers the Chang--Li bound, leaving central questions: does $O(\log Î)$-approximation require $Ω(\log n)$ locality, and do such bounds extend beyond LOCAL? In this work, we take a major step toward answering these questions in the non-signaling model, which strictly subsumes the LOCAL, quantum-LOCAL, and bounded-dependence settings. We prove every $O(\logÎ)$-approximate non-signaling distribution for dominating set requires locality $Ω(\log n/(\logÎ\cdot \mathrm{poly}\log\logÎ))$. Further, we show for some $β\in (0,1)$, every $O(\log^βÎ)$-approximate non-signaling distribution requires locality $Ω(\log n/\logÎ)$, which combined with the KMW bound yields a degree-independent $Ω(\sqrt{\log n/\log\log n})$ quantum-LOCAL lower bound for $O(\log^βÎ)$-approximation algorithms. The proof is based on two new low-soundness sensitivity lower bounds for label cover, one via Impagliazzo--Kabanets--Wigderson-style parallel repetition with degree reduction and one from a sensitivity-preserving reworking of the Dinur--Harsha framework, together with the reductions from label cover to set cover to dominating set and the sensitivity-to-locality transfer theorem of Fleming and Yoshida.
DSMay 18
Tolerant Testing for Unique GamesYuichi Yoshida
We give tolerant testers with sublinear query complexity in the adjacency-list model for Unique Games. Prior tolerant testers required structural assumptions such as expansion or clusterability. For Unique Games, the tester distinguishes instances whose optimum fraction of violated constraints is at most $\varepsilon$ from those whose optimum is at least $ρ$, for $0<\varepsilon<ρ<1$, assuming $\varepsilon\log n\lesssimρ^4$. On instances with $n$ vertices and $m$ constraints, it uses $\widetilde O(\sqrt m\,ρ^{-13/2}+nρ^{-2}/\sqrt m)$ queries. We also give a specialized tester for bipartiteness, the $Q=2$ transposition case of Unique Games. Exploiting its signed structure, the tester achieves substantially better tolerance and query-complexity guarantees than the generic Unique Games tester. Writing $λ=ρ/(1+\log(1/ρ))$, the bipartiteness tester distinguishes graphs that can be made bipartite by deleting at most an $\varepsilon$ fraction of edges from graphs in which every bipartition has at least a $ρ$ fraction of edges with both endpoints on the same side, assuming $\varepsilon\log n\lesssimλ^2$, using $\widetilde O(\sqrt m/λ^2+n/(\sqrt m\,λ))$ queries.
MLFeb 10
From Average Sensitivity to Small-Loss Regret Bounds under Random-Order ModelShinsaku Sakaue, Yuichi Yoshida
We study online learning in the random-order model, where the multiset of loss functions is chosen adversarially but revealed in a uniformly random order. Building on the batch-to-online conversion by Dong and Yoshida (2023), we show that if an offline algorithm admits a $(1+\varepsilon)$-approximation guarantee and the effect of $\varepsilon$ on its average sensitivity is characterized by a function $\varphi(\varepsilon)$, then an adaptive choice of $\varepsilon$ yields a small-loss regret bound of $\tilde O(\varphi^{\star}(\mathrm{OPT}_T))$, where $\varphi^{\star}$ is the concave conjugate of $\varphi$, $\mathrm{OPT}_T$ is the offline optimum over $T$ rounds, and $\tilde O$ hides polylogarithmic factors in $T$. Our method requires no regularity assumptions on loss functions, such as smoothness, and can be viewed as a generalization of the AdaGrad-style tuning applied to the approximation parameter $\varepsilon$. Our result recovers and strengthens the $(1+\varepsilon)$-approximate regret bounds of Dong and Yoshida (2023) and yields small-loss regret bounds for online $k$-means clustering, low-rank approximation, and regression. We further apply our framework to online submodular function minimization using $(1\pm\varepsilon)$-cut sparsifiers of submodular hypergraphs, obtaining a small-loss regret bound of $\tilde O(n^{3/4}(1 + \mathrm{OPT}_T^{3/4}))$, where $n$ is the ground-set size. Our approach sheds light on the power of sparsification and related techniques in establishing small-loss regret bounds in the random-order model.
DSApr 30
Solving Hypergraph Laplacian Systems in Almost-Linear TimeYuichi Yoshida
For a connected weighted hypergraph, we give a randomized almost-linear-time solver for the Poisson problem for the cut-based hypergraph Laplacian in the natural input size $P=\sum_{e\in E}|e|$, the sum of hyperedge sizes. For every fixed constant $C>0$, our randomized algorithm runs in $P^{1+o(1)}$ time and, with high probability over its internal randomness, returns a primal point and a dual certificate, with additive optimality gap at most $\exp(-\log^C P)$. A key step is to rewrite the Fenchel dual as a convex-flow problem on an auxiliary $O(P)$-arc graph, yielding a near-optimal dual flow. The main difficulty is primal recovery, because this flow does not by itself determine a primal potential. Our main new ingredient is a recovery theorem showing that, for primal recovery, the detailed routing of the dual flow inside each hyperedge gadget can be discarded: one nonnegative scalar per hyperedge is enough. After the necessary finite-precision rounding, these scalars define a linear-cost min-cost-flow instance on the auxiliary graph, and solving it exactly recovers a primal potential. Finally, a ground-vertex reduction from regularized objectives to the Poisson solver gives randomized almost-linear-time resolvent/proximal primitives for the same cut-based hypergraph Laplacian.
DSMar 22
Testing Monotonicity of Real-Valued Functions on DAGsYuichi Yoshida
We study monotonicity testing of real-valued functions on directed acyclic graphs (DAGs) with $n$ vertices. For every constant $δ>0$, we prove a $Ω(n^{1/2-δ}/\sqrt{\varepsilon})$ lower bound against non-adaptive two-sided testers on DAGs, nearly matching the classical $O(\sqrt{n/\varepsilon})$-query upper bound. For constant $\varepsilon$, we also prove an $Ω(\sqrt n)$ lower bound for randomized adaptive one-sided testers on explicit bipartite DAGs, whereas previously only an $Ω(\log n)$ lower bound was known. A key technical ingredient in both lower bounds is positive-matching Ruzsa--Szemerédi families. On the algorithmic side, we give simple non-adaptive one-sided testers with query complexity $O(\sqrt{m\,\ell}/(\varepsilon n))$ and $O(m^{1/3}/\varepsilon^{2/3})$, where $m$ is the number of edges in the transitive reduction and $\ell$ is the number of edges in the transitive closure. For constant $\varepsilon>0$, these improve over the previous $O(\sqrt{n/\varepsilon})$ bound when $m\ell=o(n^3)$ and $m=o(n^{3/2})$, respectively.
LGFeb 9
Noise Stability of Transformer ModelsThemistoklis Haris, Zihan Zhang, Yuichi Yoshida
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately $35\%$ and $75\%$ respectively. Our results sculpt a new connection between signal propagation in neural networks and interpretability, with noise stability emerging as a powerful tool for understanding and improving modern Transformers.
LGJul 16, 2025
From Generative to Episodic: Sample-Efficient Replicable Reinforcement LearningMax Hopkins, Sihan Liu, Christopher Ye et al.
The epidemic failure of replicability across empirical science and machine learning has recently motivated the formal study of replicable learning algorithms [Impagliazzo et al. (2022)]. In batch settings where data comes from a fixed i.i.d. source (e.g., hypothesis testing, supervised learning), the design of data-efficient replicable algorithms is now more or less understood. In contrast, there remain significant gaps in our knowledge for control settings like reinforcement learning where an agent must interact directly with a shifting environment. Karbasi et. al show that with access to a generative model of an environment with $S$ states and $A$ actions (the RL 'batch setting'), replicably learning a near-optimal policy costs only $\tilde{O}(S^2A^2)$ samples. On the other hand, the best upper bound without a generative model jumps to $\tilde{O}(S^7 A^7)$ [Eaton et al. (2024)] due to the substantial difficulty of environment exploration. This gap raises a key question in the broader theory of replicability: Is replicable exploration inherently more expensive than batch learning? Is sample-efficient replicable RL even possible? In this work, we (nearly) resolve this problem (for low-horizon tabular MDPs): exploration is not a significant barrier to replicable learning! Our main result is a replicable RL algorithm on $\tilde{O}(S^2A)$ samples, bridging the gap between the generative and episodic settings. We complement this with a matching $\tildeΩ(S^2A)$ lower bound in the generative setting (under the common parallel sampling assumption) and an unconditional lower bound in the episodic setting of $\tildeΩ(S^2)$ showcasing the near-optimality of our algorithm with respect to the state space $S$.
MLFeb 12, 2024
Replicability is Asymptotically Free in Multi-armed BanditsJunpei Komiyama, Shinji Ito, Yuichi Yoshida et al.
We consider a replicable stochastic multi-armed bandit algorithm that ensures, with high probability, that the algorithm's sequence of actions is not affected by the randomness inherent in the dataset. Replicability allows third parties to reproduce published findings and assists the original researcher in applying standard statistical tests. We observe that existing algorithms require $O(K^2/ρ^2)$ times more regret than nonreplicable algorithms, where $K$ is the number of arms and $ρ$ is the level of nonreplication. However, we demonstrate that this additional cost is unnecessary when the time horizon $T$ is sufficiently large for a given $K, ρ$, provided that the magnitude of the confidence bounds is chosen carefully. Therefore, for a large $T$, our algorithm only suffers $K^2/ρ^2$ times smaller amount of exploration than existing algorithms. To ensure the replicability of the proposed algorithms, we incorporate randomness into their decision-making processes. We propose a principled approach to limiting the probability of nonreplication. This approach elucidates the steps that existing research has implicitly followed. Furthermore, we derive the first lower bound for the two-armed replicable bandit problem, which implies the optimality of the proposed algorithms up to a $\log\log T$ factor for the two-armed case.
DSJan 18, 2022
Sparsification of Decomposable Submodular FunctionsAkbar Rafiey, Yuichi Yoshida
Submodular functions are at the core of many machine learning and data mining tasks. The underlying submodular functions for many of these tasks are decomposable, i.e., they are sum of several simple submodular functions. In many data intensive applications, however, the number of underlying submodular functions in the original function is so large that we need prohibitively large amount of time to process it and/or it does not even fit in the main memory. To overcome this issue, we introduce the notion of sparsification for decomposable submodular functions whose objective is to obtain an accurate approximation of the original function that is a (weighted) sum of only a few submodular functions. Our main result is a polynomial-time randomized sparsification algorithm such that the expected number of functions used in the output is independent of the number of underlying submodular functions in the original function. We also study the effectiveness of our algorithm under various constraints such as matroid and cardinality constraints. We complement our theoretical analysis with an empirical study of the performance of our algorithm.
DSJun 7, 2021
Local Algorithms for Estimating Effective ResistancePan Peng, Daniel Lopatta, Yuichi Yoshida et al.
Effective resistance is an important metric that measures the similarity of two vertices in a graph. It has found applications in graph clustering, recommendation systems and network reliability, among others. In spite of the importance of the effective resistances, we still lack efficient algorithms to exactly compute or approximate them on massive graphs. In this work, we design several \emph{local algorithms} for estimating effective resistances, which are algorithms that only read a small portion of the input while still having provable performance guarantees. To illustrate, our main algorithm approximates the effective resistance between any vertex pair $s,t$ with an arbitrarily small additive error $\varepsilon$ in time $O(\mathrm{poly}(\log n/\varepsilon))$, whenever the underlying graph has bounded mixing time. We perform an extensive empirical study on several benchmark datasets, validating the performance of our algorithms.
CLJan 25, 2021
RelWalk A Latent Variable Model Approach to Knowledge Graph EmbeddingDanushka Bollegala, Huda Hakami, Yuichi Yoshida et al.
Embedding entities and relations of a knowledge graph in a low-dimensional space has shown impressive performance in predicting missing links between entities. Although progresses have been achieved, existing methods are heuristically motivated and theoretical understanding of such embeddings is comparatively underdeveloped. This paper extends the random walk model (Arora et al., 2016a) of word embeddings to Knowledge Graph Embeddings (KGEs) to derive a scoring function that evaluates the strength of a relation R between two entities h (head) and t (tail). Moreover, we show that marginal loss minimisation, a popular objective used in much prior work in KGE, follows naturally from the log-likelihood ratio maximisation under the probabilities estimated from the KGEs according to our theoretical relationship. We propose a learning objective motivated by the theoretical analysis to learn KGEs from a given knowledge graph. Using the derived objective, accurate KGEs are learnt from FB15K237 and WN18RR benchmark datasets, providing empirical evidence in support of the theory.
DSJul 15, 2020
Downsampling for Testing and Learning in Product DistributionsNathaniel Harms, Yuichi Yoshida
We study distribution-free property testing and learning problems where the unknown probability distribution is a product distribution over $\mathbb{R}^d$. For many important classes of functions, such as intersections of halfspaces, polynomial threshold functions, convex sets, and $k$-alternating functions, the known algorithms either have complexity that depends on the support size of the distribution, or are proven to work only for specific examples of product distributions. We introduce a general method, which we call downsampling, that resolves these issues. Downsampling uses a notion of "rectilinear isoperimetry" for product distributions, which further strengthens the connection between isoperimetry, testing, and learning. Using this technique, we attain new efficient distribution-free algorithms under product distributions on $\mathbb{R}^d$: 1. A simpler proof for non-adaptive, one-sided monotonicity testing of functions $[n]^d \to \{0,1\}$, and improved sample complexity for testing monotonicity over unknown product distributions, from $O(d^7)$ [Black, Chakrabarty, & Seshadhri, SODA 2020] to $\widetilde O(d^3)$. 2. Polynomial-time agnostic learning algorithms for functions of a constant number of halfspaces, and constant-degree polynomial threshold functions. 3. An $\exp(O(d \log(dk)))$-time agnostic learning algorithm, and an $\exp(O(d \log(dk)))$-sample tolerant tester, for functions of $k$ convex sets; and a $2^{\widetilde O(d)}$ sample-based one-sided tester for convex sets. 4. An $\exp(\widetilde O(k \sqrt d))$-time agnostic learning algorithm for $k$-alternating functions, and a sample-based tolerant tester with the same complexity.
DSJun 28, 2020
Fast and Private Submodular and $k$-Submodular Functions Maximization with Matroid ConstraintsAkbar Rafiey, Yuichi Yoshida
The problem of maximizing nonnegative monotone submodular functions under a certain constraint has been intensively studied in the last decade, and a wide range of efficient approximation algorithms have been developed for this problem. Many machine learning problems, including data summarization and influence maximization, can be naturally modeled as the problem of maximizing monotone submodular functions. However, when such applications involve sensitive data about individuals, their privacy concerns should be addressed. In this paper, we study the problem of maximizing monotone submodular functions subject to matroid constraints in the framework of differential privacy. We provide $(1-\frac{1}{\mathrm{e}})$-approximation algorithm which improves upon the previous results in terms of approximation guarantee. This is done with an almost cubic number of function evaluations in our algorithm. Moreover, we study $k$-submodularity, a natural generalization of submodularity. We give the first $\frac{1}{2}$-approximation algorithm that preserves differential privacy for maximizing monotone $k$-submodular functions subject to matroid constraints. The approximation ratio is asymptotically tight and is obtained with an almost linear number of function evaluations.
DSJun 15, 2020
Hypergraph Clustering Based on PageRankYuuki Takai, Atsushi Miyauchi, Masahiro Ikeda et al.
A hypergraph is a useful combinatorial object to model ternary or higher-order relations among entities. Clustering hypergraphs is a fundamental task in network analysis. In this study, we develop two clustering algorithms based on personalized PageRank on hypergraphs. The first one is local in the sense that its goal is to find a tightly connected vertex set with a bounded volume including a specified vertex. The second one is global in the sense that its goal is to find a tightly connected vertex set. For both algorithms, we discuss theoretical guarantees on the conductance of the output vertex set. Also, we experimentally demonstrate that our clustering algorithms outperform existing methods in terms of both the solution quality and running time. To the best of our knowledge, ours are the first practical algorithms for hypergraphs with theoretical guarantees on the conductance of the output set.
DSJun 7, 2020
Average Sensitivity of Spectral ClusteringPan Peng, Yuichi Yoshida
Spectral clustering is one of the most popular clustering methods for finding clusters in a graph, which has found many applications in data mining. However, the input graph in those applications may have many missing edges due to error in measurement, withholding for a privacy reason, or arbitrariness in data conversion. To make reliable and efficient decisions based on spectral clustering, we assess the stability of spectral clustering against edge perturbations in the input graph using the notion of average sensitivity, which is the expected size of the symmetric difference of the output clusters before and after we randomly remove edges. We first prove that the average sensitivity of spectral clustering is proportional to $λ_2/λ_3^2$, where $λ_i$ is the $i$-th smallest eigenvalue of the (normalized) Laplacian. We also prove an analogous bound for $k$-way spectral clustering, which partitions the graph into $k$ clusters. Then, we empirically confirm our theoretical bounds by conducting experiments on synthetic and real networks. Our results suggest that spectral clustering is stable against edge perturbations when there is a cluster structure in the input graph.
LGFeb 14, 2020
Statistical Learning with Conditional Value at RiskTasuku Soma, Yuichi Yoshida
We propose a risk-averse statistical learning framework wherein the performance of a learning algorithm is evaluated by the conditional value-at-risk (CVaR) of losses rather than the expected loss. We devise algorithms based on stochastic gradient descent for this framework. While existing studies of CVaR optimization require direct access to the underlying distribution, our algorithms make a weaker assumption that only i.i.d.\ samples are given. For convex and Lipschitz loss functions, we show that our algorithm has $O(1/\sqrt{n})$-convergence to the optimal CVaR, where $n$ is the number of samples. For nonconvex and smooth loss functions, we show a generalization bound on CVaR. By conducting numerical experiments on various machine learning tasks, we demonstrate that our algorithms effectively minimize CVaR compared with other baseline algorithms.
DSFeb 13, 2020
Approximability of Monotone Submodular Function Maximization under Cardinality and Matroid Constraints in the Streaming ModelChien-Chung Huang, Naonori Kakimura, Simon Mauras et al.
Maximizing a monotone submodular function under various constraints is a classical and intensively studied problem. However, in the single-pass streaming model, where the elements arrive one by one and an algorithm can store only a small fraction of input elements, there is much gap in our knowledge, even though several approximation algorithms have been proposed in the literature. In this work, we present the first lower bound on the approximation ratios for cardinality and matroid constraints that beat $1-\frac{1}{e}$ in the single-pass streaming model. Let $n$ be the number of elements in the stream. Then, we prove that any (randomized) streaming algorithm for a cardinality constraint with approximation ratio $\frac{2}{2+\sqrt{2}}+\varepsilon$ requires $Ω\left(\frac{n}{K^2}\right)$ space for any $\varepsilon>0$, where $K$ is the size limit of the output set. We also prove that any (randomized) streaming algorithm for a (partition) matroid constraint with approximation ratio $\frac{K}{2K-1}+\varepsilon$ requires $Ω\left(\frac{n}{K}\right)$ space for any $\varepsilon>0$, where $K$ is the rank of the given matroid. In addition, we give streaming algorithms when we only have a weak oracle with which we can only evaluate function values on feasible sets. Specifically, we show weak-oracle streaming algorithms for cardinality and matroid constraints with approximation ratios $\frac{K}{2K-1}$ and $\frac{1}{2}$, respectively, whose space complexity is exponential in $K$ but is independent of $n$. The former one exactly matches the known inapproximability result for a cardinality constraint in the weak oracle model. The latter one almost matches our lower bound of $\frac{K}{2K-1}$ for a matroid constraint, which almost settles the approximation ratio for a matroid constraint that can be obtained by a streaming algorithm whose space complexity is independent of $n$.
MLJan 28, 2019
On Random Subsampling of Gaussian Process Regression: A Graphon-Based AnalysisKohei Hayashi, Masaaki Imaizumi, Yuichi Yoshida
In this paper, we study random subsampling of Gaussian process regression, one of the simplest approximation baselines, from a theoretical perspective. Although subsampling discards a large part of training data, we show provable guarantees on the accuracy of the predictive mean/variance and its generalization ability. For analysis, we consider embedding kernel matrices into graphons, which encapsulate the difference of the sample size and enables us to evaluate the approximation and generalization errors in a unified manner. The experimental results show that the subsampling approximation achieves a better trade-off regarding accuracy and runtime than the Nyström and random Fourier expansion methods.
CVSep 13, 2018
Canonical and Compact Point Cloud Representation for Shape ClassificationKent Fujiwara, Ikuro Sato, Mitsuru Ambai et al.
We present a novel compact point cloud representation that is inherently invariant to scale, coordinate change and point permutation. The key idea is to parametrize a distance field around an individual shape into a unique, canonical, and compact vector in an unsupervised manner. We firstly project a distance field to a $4$D canonical space using singular value decomposition. We then train a neural network for each instance to non-linearly embed its distance field into network parameters. We employ a bias-free Extreme Learning Machine (ELM) with ReLU activation units, which has scale-factor commutative property between layers. We demonstrate the descriptiveness of the instance-wise, shape-embedded network parameters by using them to classify shapes in $3$D datasets. Our learning-based representation requires minimal augmentation and simple neural networks, where previous approaches demand numerous representations to handle coordinate change and point permutation.
MLFeb 26, 2018
Guaranteed Sufficient Decrease for Stochastic Variance Reduced Gradient OptimizationFanhua Shang, Yuanyuan Liu, Kaiwen Zhou et al.
In this paper, we propose a novel sufficient decrease technique for stochastic variance reduced gradient descent methods such as SVRG and SAGA. In order to make sufficient decrease for stochastic optimization, we design a new sufficient decrease criterion, which yields sufficient decrease versions of stochastic variance reduction algorithms such as SVRG-SD and SAGA-SD as a byproduct. We introduce a coefficient to scale current iterate and to satisfy the sufficient decrease property, which takes the decisions to shrink, expand or even move in the opposite direction, and then give two specific update rules of the coefficient for Lasso and ridge regression. Moreover, we analyze the convergence properties of our algorithms for strongly convex problems, which show that our algorithms attain linear convergence rates. We also provide the convergence guarantees of our algorithms for non-strongly convex problems. Our experimental results further verify that our algorithms achieve significantly better performance than their counterparts.
LGFeb 16, 2018
Spectral Normalization for Generative Adversarial NetworksTakeru Miyato, Toshiki Kataoka, Masanori Koyama et al.
One of the challenges in the study of generative adversarial networks is the instability of its training. In this paper, we propose a novel weight normalization technique called spectral normalization to stabilize the training of the discriminator. Our new normalization technique is computationally light and easy to incorporate into existing implementations. We tested the efficacy of spectral normalization on CIFAR10, STL-10, and ILSVRC2012 dataset, and we experimentally confirmed that spectrally normalized GANs (SN-GANs) is capable of generating images of better or equal quality relative to the previous training stabilization techniques.
CLSep 5, 2017
Using $k$-way Co-occurrences for Learning Word EmbeddingsDanushka Bollegala, Yuichi Yoshida, Ken-ichi Kawarabayashi
Co-occurrences between two words provide useful insights into the semantics of those words. Consequently, numerous prior work on word embedding learning have used co-occurrences between two words as the training signal for learning word embeddings. However, in natural language texts it is common for multiple words to be related and co-occurring in the same context. We extend the notion of co-occurrences to cover $k(\geq\!\!2)$-way co-occurrences among a set of $k$-words. Specifically, we prove a theoretical relationship between the joint probability of $k(\geq\!\!2)$ words, and the sum of $\ell_2$ norms of their embeddings. Next, we propose a learning objective motivated by our theoretical result that utilises $k$-way co-occurrences for learning word embeddings. Our experimental results show that the derived theoretical relationship does indeed hold empirically, and despite data sparsity, for some smaller $k$ values, $k$-way embeddings perform comparably or better than $2$-way embeddings in a range of tasks.
MLMay 31, 2017
Spectral Norm Regularization for Improving the Generalizability of Deep LearningYuichi Yoshida, Takeru Miyato
We investigate the generalizability of deep learning based on the sensitivity to input perturbation. We hypothesize that the high sensitivity to the perturbation of data degrades the performance on it. To reduce the sensitivity to perturbation, we propose a simple and effective regularization method, referred to as spectral norm regularization, which penalizes the high spectral norm of weight matrices in neural networks. We provide supportive evidence for the abovementioned hypothesis by experimentally confirming that the models trained using spectral norm regularization exhibit better generalizability than other baseline methods.
LGMar 20, 2017
Guaranteed Sufficient Decrease for Variance Reduced Stochastic Gradient DescentFanhua Shang, Yuanyuan Liu, James Cheng et al.
In this paper, we propose a novel sufficient decrease technique for variance reduced stochastic gradient descent methods such as SAG, SVRG and SAGA. In order to make sufficient decrease for stochastic optimization, we design a new sufficient decrease criterion, which yields sufficient decrease versions of variance reduction algorithms such as SVRG-SD and SAGA-SD as a byproduct. We introduce a coefficient to scale current iterate and satisfy the sufficient decrease property, which takes the decisions to shrink, expand or move in the opposite direction, and then give two specific update rules of the coefficient for Lasso and ridge regression. Moreover, we analyze the convergence properties of our algorithms for strongly convex problems, which show that both of our algorithms attain linear convergence rates. We also provide the convergence guarantees of our algorithms for non-strongly convex problems. Our experimental results further verify that our algorithms achieve significantly better performance than their counterparts.
LGAug 25, 2016
Minimizing Quadratic Functions in Constant TimeKohei Hayashi, Yuichi Yoshida
A sampling-based optimization method for quadratic functions is proposed. Our method approximately solves the following $n$-dimensional quadratic minimization problem in constant time, which is independent of $n$: $z^*=\min_{\mathbf{v} \in \mathbb{R}^n}\langle\mathbf{v}, A \mathbf{v}\rangle + n\langle\mathbf{v}, \mathrm{diag}(\mathbf{d})\mathbf{v}\rangle + n\langle\mathbf{b}, \mathbf{v}\rangle$, where $A \in \mathbb{R}^{n \times n}$ is a matrix and $\mathbf{d},\mathbf{b} \in \mathbb{R}^n$ are vectors. Our theoretical analysis specifies the number of samples $k(δ, ε)$ such that the approximated solution $z$ satisfies $|z - z^*| = O(εn^2)$ with probability $1-δ$. The empirical performance (accuracy and runtime) is positively confirmed by numerical experiments.
CLDec 7, 2014
Learning Word Representations from Relational GraphsDanushka Bollegala, Takanori Maehara, Yuichi Yoshida et al.
Attributes of words and relations between two words are central to numerous tasks in Artificial Intelligence such as knowledge representation, similarity measurement, and analogy detection. Often when two words share one or more attributes in common, they are connected by some semantic relations. On the other hand, if there are numerous semantic relations between two words, we can expect some of the attributes of one of the words to be inherited by the other. Motivated by this close connection between attributes and relations, given a relational graph in which words are inter- connected via numerous semantic relations, we propose a method to learn a latent representation for the individual words. The proposed method considers not only the co-occurrences of words as done by existing approaches for word representation learning, but also the semantic relations in which two words co-occur. To evaluate the accuracy of the word representations learnt using the proposed method, we use the learnt word representations to solve semantic word analogy problems. Our experimental results show that it is possible to learn better word representations by using semantic semantics between words.