Yoshiyuki Ootani

CR
4papers
Novelty41%
AI Score43

4 Papers

CVJun 4
Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

Yoshiyuki Ootani

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

CRApr 5
Context-Binding Gaps in Stateful Zero-Knowledge Proximity Proofs: Taxonomy, Separation, and Mitigation

Yoshiyuki Ootani

A zero-knowledge proximity proof certifies geometric nearness but carries no commitment to an application context. In stateful geo-content systems, where drops can share coordinates, policies evolve, and content has persistent identity, this gap can permit proof transfer between application objects unless extra operational invariants are maintained. We present a systems-security analysis of this deployment problem: a taxonomy of context-binding vulnerabilities, a formal off-circuit verification model for a transcript-adversary that holds a recorded proof but cannot obtain fresh coordinates, an assumption comparison across five binding strategy classes, and a concrete instantiation, Zairn-ZKP, that embeds drop identity, policy version, and session context as public circuit inputs. Compared with a strong off-circuit alternative based on stored-digest server checking, in-proof binding reduces operational invariants from four to two and adds no measurable proving cost relative to the sound geo-only baseline (-0.12 ms median in our setup). It also removes a correctness pitfall we identify empirically: a plausible off-circuit implementation that omits one server-side check remains vulnerable to cross-drop transfer. Measurements across six network conditions, seven venues in four countries, and an epoch-window simulation indicate that same-epoch transfer is realistic in dense urban deployments unless per-request nonces are maintained. Across five platforms and seven binding strategies, the results support a deployable methodology for reducing assumption surfaces in stateful ZK-backed verification workflows.

CRApr 5
Search-Bound Proximity Proofs: Binding Encrypted Geographic Search to Zero-Knowledge Verification

Yoshiyuki Ootani

Location-based systems that combine encrypted geographic search with zero-knowledge proximity proofs typically treat the two phases as independent. Under an honest-but-curious server, this leaves an authorization provenance gap: once session state is purged, no forensic procedure can attribute a proof to its originating search session, because the proof's public inputs encode no session-identifying information. We formalize this gap as the search-authorized proof (SAP) security notion and show via a concrete audit re-association attack that proof-external mechanisms, where authorization evidence remains outside the proof, cannot prevent forensic misattribution when the same drop parameters recur across sessions. Search-Bound Proximity Proofs (SBPP) realize the SAP requirements without modifying the ZKP circuit: session nonce, Merkle-root result-set commitment, and signed receipt are decomposed into independently auditable components, enabling property-level fault isolation in offline audit. Experiments on synthetic and real-world data (110,776 OpenStreetMap POIs) show sub-millisecond absolute overhead on a 125 ms Groth16 baseline.

CRApr 4
Graduated Trust Gating for IoT Location Verification: Trading Off Detection and Proof Escalation

Yoshiyuki Ootani

IoT location services accept client-reported GPS coordinates at face value, yet spoofing is trivial with consumer-grade tools. Existing spoofing detectors output a binary decision, forcing system designers to choose between high false-deny and high false-accept rates. We propose a graduated trust gate that computes a multi-signal integrity score and maps it to three actions: PROCEED, STEP-UP, or DENY, where STEP-UP invokes a stronger verifier such as a zero-knowledge proximity proof. A session-latch mechanism ensures that a single suspicious fix blocks the entire session, preventing post-transition score recovery. Under an idealized step-up oracle on 10,000 synthetic traces, the gate enables strict thresholds (theta_p = 0.9) that a binary gate cannot safely use: at matched false-accept rate (11%), the graduated gate maintains zero false-deny rate versus 0.05% for binary, with 5 microseconds scoring overhead. Real-device traces from an Android smartphone demonstrate the session-latch mechanism and show that a nearby mock location (~550 m) evades theta_p = 0.7 but is routed to step-up at theta_p = 0.9. Signal ablation identifies a minimal two-signal configuration (F1 = 0.84) suitable for resource-constrained scoring layers.