CVOct 5, 2023Code
Kandinsky: an Improved Text-to-Image Synthesis with Image Prior and Latent DiffusionAnton Razzhigaev, Arseniy Shakhmatov, Anastasia Maltseva et al.
Text-to-image generation is a significant domain in modern computer vision and has achieved substantial improvements through the evolution of generative architectures. Among these, there are diffusion-based models that have demonstrated essential quality enhancements. These models are generally split into two categories: pixel-level and latent-level approaches. We present Kandinsky1, a novel exploration of latent diffusion architecture, combining the principles of the image prior models with latent diffusion techniques. The image prior model is trained separately to map text embeddings to image embeddings of CLIP. Another distinct feature of the proposed model is the modified MoVQ implementation, which serves as the image autoencoder component. Overall, the designed model contains 3.3B parameters. We also deployed a user-friendly demo system that supports diverse generative modes such as text-to-image generation, image fusion, text and image fusion, image variations generation, and text-guided inpainting/outpainting. Additionally, we released the source code and checkpoints for the Kandinsky models. Experimental evaluations demonstrate a FID score of 8.03 on the COCO-30K dataset, marking our model as the top open-source performer in terms of measurable image generation quality.
16.9OCMay 29
Wall-Clock Complexity for Zeroth-Order Optimization with Tunable Oracle FidelityAlexandra Suvorikova, Igor Pavlov, Artem Vasin et al.
Zeroth-order (black-box) optimization is applied when gradients are unavailable and objective evaluations rely on expensive simulations. In many such applications, the oracle fidelity is tunable: higher-accuracy queries reduce noise but incur higher computational costs. To capture this trade-off, we study an accuracy-aware wall-clock model where each query with fidelity $δ$ has a cost $c(δ)$, and we minimize the total time $T_{\mathrm{total}} = \sum_{k=1}^{N} c(δ_k)$, subject to a target accuracy constraint. We show how the choice of oracle type, noise model, and optimization scheme induces explicit wall-clock-optimal choices for the algorithmic parameters. For instance, we demonstrate that accelerated methods can be wall-clock inferior to non-accelerated schemes. Furthermore, we characterize the conditions under which a constant fidelity strategy is optimal in the Big-O sense. Our framework provides a unified methodology to translate convergence guarantees into practical fidelity and batching recommendations.
CVOct 28, 2024Code
Kandinsky 3: Text-to-Image Synthesis for Multifunctional Generative FrameworkVladimir Arkhipkin, Viacheslav Vasilev, Andrei Filatov et al.
Text-to-image (T2I) diffusion models are popular for introducing image manipulation methods, such as editing, image fusion, inpainting, etc. At the same time, image-to-video (I2V) and text-to-video (T2V) models are also built on top of T2I models. We present Kandinsky 3, a novel T2I model based on latent diffusion, achieving a high level of quality and photorealism. The key feature of the new architecture is the simplicity and efficiency of its adaptation for many types of generation tasks. We extend the base T2I model for various applications and create a multifunctional generation system that includes text-guided inpainting/outpainting, image fusion, text-image fusion, image variations generation, I2V and T2V generation. We also present a distilled version of the T2I model, evaluating inference in 4 steps of the reverse process without reducing image quality and 3 times faster than the base model. We deployed a user-friendly demo system in which all the features can be tested in the public domain. Additionally, we released the source code and checkpoints for the Kandinsky 3 and extended models. Human evaluations show that Kandinsky 3 demonstrates one of the highest quality scores among open source generation systems.
CVDec 6, 2023Code
Kandinsky 3.0 Technical ReportVladimir Arkhipkin, Andrei Filatov, Viacheslav Vasilev et al.
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3
CVFeb 22, 2022
RuCLIP -- new models and experiments: a technical reportAlex Shonenkov, Andrey Kuznetsov, Denis Dimitrov et al.
In the report we propose six new implementations of ruCLIP model trained on our 240M pairs. The accuracy results are compared with original CLIP model with Ru-En translation (OPUS-MT) on 16 datasets from different domains. Our best implementations outperform CLIP + OPUS-MT solution on most of the datasets in few-show and zero-shot tasks. In the report we briefly describe the implementations and concentrate on the conducted experiments. Inference execution time comparison is also presented in the report.
CRJul 2, 2020
Gravity: a blockchain-agnostic cross-chain communication and data oracles protocolAleksei Pupyshev, Dmitry Gubanov, Elshan Dzhafarov et al.
This paper intends to propose the architecture of a blockchain-agnostic protocol designed for communication of blockchains amongst each other (i.e. cross-chain), and for blockchains with the outside world (i.e. data oracles). The expansive growth of cutting-edge technology in the blockchain industry outlines the need and opportunity for addressing oracle consensus in a manner both technologically and economically efficient as well as futureproof. Blockchain-agnosticism is inherently limited if proposing a technological solution involves adding one more architectural layer. As such, Gravity protocol is designed to be a truly blockchain-agnostic protocol. By ensuring parity through direct integration and by leveraging the stability and security of the respective interconnected ecosystems, Gravity circumvents the need for a dedicated, public blockchain and a native token. Ultimately, Gravity protocol intends to address scalability challenges by providing a solid infrastructure for the creation of gateways, cross-chain applications, and sidechains. This paper introduces and defines the concept of Oracle Consensus and its implementation in the Gravity protocol named the Pulse Consensus algorithm. The proposed consensus architecture allows Gravity to be considered a singular decentralized blockchain-agnostic oracle.