36.8CVMay 24
Injecting Image Guidance into Text-Conditioned Diffusion Models at InferenceAgata Żywot, Iason Skylitsis, Thijmen Nijdam et al.
Text-to-image diffusion models like Stable Diffusion generate high-quality images from text, but lack a way to inject visual guidance (e.g. sketches, styles) at inference without retraining. Existing methods either require computationally expensive fine-tuning or rely on style transfer techniques that risk semantic misalignment with textual prompts. We introduce Visual Concept Fusion (VCF), the first method offering dual conditioning on both an image and text prompt at inference time without any concept-specific training. VCF enables visual concept injection into Stable Diffusion by aligning CLIP image features with the text embedding space. VCF consists of three components: (1) a lightweight aligner that maps image tokens to the text embedding manifold using InfoNCE and cross-attention reconstruction losses, (2) a fusion strategy that preserves both textual and visual semantics, and (3) an optional Prompt-Noise Optimization (PNO) module for test-time refinement. Our experiments demonstrate that VCF successfully transfers visual attributes including style, composition, and color palette from reference images while maintaining prompt adherence. Quantitative results show a trade-off between text alignment (CLIP score) and visual correspondence (LPIPS), with VCF outperforming baselines in reference fidelity.
88.7MAApr 13
Can Small Agents Collaborate to Beat a Single Large Language Model?Agata Żywot, Xinyi Chen, Yifei Yuan et al.
Recent progress in language modeling has largely relied on scaling model size, yet larger models do not reliably improve performance on tasks requiring multi-step reasoning and tool use. Multi-agent collaboration offers a potential alternative, raising a key question: can well-organized systems built from smaller models outperform much larger language models? We address this question using a minimally designed multi-agent system with a single orchestrator and a small set of specialized sub-agents with restricted communication. On tool-intensive benchmarks spanning factual retrieval, multi-hop reasoning, scientific question answering, and mathematical problem solving, we conduct controlled comparisons between small multi-agent systems and large single-agent models. We find that small multi-agent systems can outperform substantially larger single-agent models, even when the latter have direct access to tools. Reasoning at the orchestrator yields the largest gains, while enabling reasoning in sub-agents provides limited or negative benefits. Overall system performance is driven primarily by orchestrator capacity rather than sub-agent capacity. These results suggest that improved agentic performance depends more on architectural orchestration than on raw model scaling.
LGJun 27, 2024
WineGraph: A Graph Representation For Food-Wine PairingZuzanna Gawrysiak, Agata Żywot, Agnieszka Ławrynowicz
We present WineGraph, an extended version of FlavorGraph, a heterogeneous graph incorporating wine data into its structure. This integration enables food-wine pairing based on taste and sommelier-defined rules. Leveraging a food dataset comprising 500,000 reviews and a wine reviews dataset with over 130,000 entries, we computed taste descriptors for both food and wine. This information was then utilised to pair food items with wine and augment FlavorGraph with additional data. The results demonstrate the potential of heterogeneous graphs to acquire supplementary information, proving beneficial for wine pairing.