Amirhossein Yousefiramandi

CL
h-index7
6papers
5citations
Novelty44%
AI Score50

6 Papers

AIMay 22Code
When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

Amirhossein Yousefiramandi, Ciaran Cooney

We study when LLM-generated synthetic data helps low-resource multi-label patent classification, separating true synthetic value from the confound that larger augmented sets can win by volume alone. Across six open-source LLMs (3.8-12B), four real-data regimes, 64 WIPO assistive-technology labels, two generation strategies, and three classifier families, the headline BERT-for-Patents micro-F1 jump from 0.120 to 0.702 is largely volume-driven. A duplicate-to-match real-only control that resamples 165 patents to the augmented size reaches 0.678; the controlled synthetic gain is only +0.024 over this control, but +0.219 over focal-loss reweighting, the strongest non-augmentation baseline. The main finding is that fidelity metrics change meaning with scale: at extreme scarcity, MMD correlates positively with classification gain (r=+0.95), but at 1:10 the relation flips (r=-0.73; Fisher z=+6.47, p<0.001). Fixed-budget mixing finds a 20-30% real / 70-80% synthetic optimum; paraphrase scaling collapses from a 165-document seed; and shuffled mixing beats curriculum ordering, ensembling, and classifier-based filtering. Leakage controls -- label-name masking, instruction-level label removal, fine-grained evaluation, and keyword-overlap audits -- argue against label-string dependence as the main driver for BERT-for-Patents. The apparent ModernBERT collapse under label removal is traced to a Flash-Attention-2 + bf16 numerical artifact, recovering 65% of lost performance with fp32 eager attention. Finally, the same corpus that improves classification by up to +0.58 raw micro-F1 hurts a Jaccard-label-overlap retrieval proxy; even a standard-patent-only filter leaves a 26% nDCG@10 drop. Thus, synthetic patent text is task- and metric-specific, not reducible to prompt genre alone.

IRMay 22
Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

Amirhossein Yousefiramandi, Ciaran Cooney

Which fine-tuning signals improve patent embedding models, and do gains transfer across patent landscapes? We benchmark 22 embedding models, from 22M-parameter encoders to 12B instruction-tuned LLMs, on retrieval, classification, and clustering. The study uses 113,148 WIPO assistive-technology patents, 46,069 citation-graph retrieval queries, and the public DAPFAM dataset for external validation. Our framework covers citation-based retrieval, hybrid sparse-dense fusion, multi-label classification over five datasets, unsupervised clustering, six text-section views, domain-adaptive fine-tuning of four models, jurisdiction analysis, and proprietary DWPI (Derwent World Patents Index, Clarivate) expert-written content. Results show that fine-tuning is task-dependent: single-landscape tuning can improve in-domain scores but often hurts retrieval on an external landscape, challenging the assumption that more domain data always helps. Within model families, scale usually predicts performance (Qwen3 0.6B to 4B to 8B; Llama-Nemotron 1B to 8B), but cross-family scaling is noisy: the 12B KaLM-Gemma3 ranks 8th on TAC retrieval, while Qwen3-0.6B leads ARI clustering. Title+Abstract+Claims is the most reliable text representation. Multi-view abstract-claim alignment improves retrieval by up to 7.1 percent nDCG@10, while combined fine-tuning gives the strongest classification gains (+7.1 F1). All models drop by 55-65 percent on out-of-domain queries, and hybrid sparse-dense fusion does not close this gap. BM25-dense interpolation gives modest nDCG@10 gains (+0.002 to +0.015), with larger benefits for weaker zero-shot dense models. Code and evaluation framework are publicly available.

ITMay 13
Learning Selective Merge Policies for Deadline-Constrained Coded Caching via Deep Reinforcement Learning

Amirhossein Yousefiramandi

With the coded caching, the server can use the information the users have cached to serve multiple users at a time by sending a single coded multi-casting message, i.e., the merged message, thereby relieving the peak network loads. However, for the delay-sensitive applications of the users, like the video streaming services, it becomes essential to choose which messages to merge online, considering the strict deadlines for each request. The problem, however, is that while the merge is helpful for the formation of the current coded multi-casting message, it can be harmful for the subsequent ones. We proposed a DRL-based solution that formulates the deadline-constrained coded delivery as a masked discrete-action queue-state control problem, while we trained a graph-attention policy network via proximal policy optimization. The policy network reduces the broadcast-packet expiration ratio $ρ$ by $40.9%$ ($0.208$ vs. $0.352$) with respect to the best coded multi-casting baseline (SACM++) on the uniform-demand benchmark, while also attaining the best broadcast-efficiency score $σ$ across the Track A battery among the coded multi-casting methods. The interesting fact we observed is that for the applications of the users with tight deadlines, the method of selective merging is better than the method of aggressive merging, i.e., the policy network learns to merge at only $\approx 31.8%$ rate, even though the same observation holds across the variations within the same simulator family.

LGMay 7
Cumulative-Goodness Free-Riding in Forward-Forward Networks: Real, Repairable, but Not Accuracy-Dominant

Amirhossein Yousefiramandi

Forward-Forward (FF) training allows each layer to learn from a local goodness criterion. In cumulative-goodness variants, however, later layers can inherit a task that earlier layers have already partially separated. We formalize this phenomenon as layer free-riding: under the softplus FF criterion, the class-discrimination gradient reaching block $d$ decays exponentially with the positive margin accumulated by preceding blocks. We then study three local remedies -- per-block, hardness-gated, and depth-scaled -- that recover current-layer separation measures without relying on backpropagated gradients. On CIFAR-10 and CIFAR-100, these remedies dramatically improve layer-separation statistics, with $4\times$--$45\times$ gains in deeper layers, while changing accuracy by less than one percentage point for non-degenerate training procedures. Tiny ImageNet provides a tougher cross-dataset check for our selected block-wise configuration and reveals the same qualitative gap between layer-health diagnostics and final accuracy. Calibration experiments further show that architecture and augmentation choices have a larger effect on final accuracy than the training-rule modifications studied here. Cumulative free-riding is therefore a real and repairable optimization pathology. Nonetheless, for the FF training rules, architectures, and datasets we study, it is not the dominant factor limiting achievable accuracy.

CLDec 14, 2025
Fine-Tuning Causal LLMs for Text Classification: Embedding-Based vs. Instruction-Based Approaches

Amirhossein Yousefiramandi, Ciaran Cooney

We explore efficient strategies to fine-tune decoder-only Large Language Models (LLMs) for downstream text classification under resource constraints. Two approaches are investigated: (1) attaching a classification head to a pre-trained causal LLM and fine-tuning on the task (using the LLM's final token embedding as a sequence representation), and (2) instruction-tuning the LLM in a prompt->response format for classification. To enable single-GPU fine-tuning of models up to 8B parameters, we combine 4-bit model quantization with Low-Rank Adaptation (LoRA) for parameter-efficient training. Experiments on two datasets - a proprietary single-label dataset and the public WIPO-Alpha patent dataset (extreme multi-label classification) - show that the embedding-based method significantly outperforms the instruction-tuned method in F1-score, and is very competitive with - even surpassing - fine-tuned domain-specific models (e.g. BERT) on the same tasks. These results demonstrate that directly leveraging the internal representations of causal LLMs, along with efficient fine-tuning techniques, yields impressive classification performance under limited computational resources. We discuss the advantages of each approach while outlining practical guidelines and future directions for optimizing LLM fine-tuning in classification scenarios.

CLSep 18, 2025
Patent Language Model Pretraining with ModernBERT

Amirhossein Yousefiramandi, Ciaran Cooney

Transformer-based language models such as BERT have become foundational in NLP, yet their performance degrades in specialized domains like patents, which contain long, technical, and legally structured text. Prior approaches to patent NLP have primarily relied on fine-tuning general-purpose models or domain-adapted variants pretrained with limited data. In this work, we pretrain 3 domain-specific masked language models for patents, using the ModernBERT architecture and a curated corpus of over 60 million patent records. Our approach incorporates architectural optimizations, including FlashAttention, rotary embeddings, and GLU feed-forward layers. We evaluate our models on four downstream patent classification tasks. Our model, ModernBERT-base-PT, consistently outperforms the general-purpose ModernBERT baseline on three out of four datasets and achieves competitive performance with a baseline PatentBERT. Additional experiments with ModernBERT-base-VX and Mosaic-BERT-large demonstrate that scaling the model size and customizing the tokenizer further enhance performance on selected tasks. Notably, all ModernBERT variants retain substantially faster inference over - 3x that of PatentBERT - underscoring their suitability for time-sensitive applications. These results underscore the benefits of domain-specific pretraining and architectural improvements for patent-focused NLP tasks.