DCJul 1, 2024
I've Got 99 Problems But FLOPS Ain't OneAlexandru M. Gherghescu, Vlad-Andrei Bădoiu, Alexandru Agache et al.
Hyperscalers dominate the landscape of large network deployments, yet they rarely share data or insights about the challenges they face. In light of this supremacy, what problems can we find to solve in this space? We take an unconventional approach to find relevant research directions, starting from public plans to build a $100 billion datacenter for machine learning applications. Leveraging the language models scaling laws, we discover what workloads such a datacenter might carry and explore the challenges one may encounter in doing so, with a focus on networking research. We conclude that building the datacenter and training such models is technically possible, but this requires novel wide-area transports for inter-DC communication, a multipath transport and novel datacenter topologies for intra-datacenter communication, high speed scale-up networks and transports, outlining a rich research agenda for the networking community.
CLJul 18, 2024
FuLG: 150B Romanian Corpus for Language Model PretrainingVlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu et al.
Research in the field of language models is rapidly evolving, with many open models being released to the public. Openly available pretraining corpora usually focus on only a handful of languages, with many others either missing completely or extremely underrepresented. In this report, we introduce FuLG, a hundred-fifty-billion-token Romanian corpus extracted from CommonCrawl. We present our methodology for filtering FuLG and compare it via ablation studies against existing Romanian corpora.
CLJan 13, 2025
LLMic: Romanian Foundation Language ModelVlad-Andrei Bădoiu, Mihai-Valentin Dumitru, Alexandru M. Gherghescu et al.
Recent advances in Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks with commercial models leading the way. While open models usually operate at a smaller scale, they maintain competitiveness through specialization and fine-tuning. However, a significant challenge persists: open models often underperform in low-resource languages due to limited representation in the training corpus. In this paper, we present LLMic, a bilingual foundation language model designed specifically for the Romanian Language. We document the complete process of pretraining a foundation model for a low-resource language, including corpus construction, architecture selection, and hyper-parameter optimization. Our evaluation demonstrates that LLMic can be specialized for tasks in the target language, achieving results comparable to other much larger open models. We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks. This opens the path for efficient large-scale processing for the Romanian language community, using the much smaller LLMic model
SEFeb 12, 2012
A Formal Approach for the Development of Service-Oriented ApplicationsLorina Negreanu, Cristian Giumale, Alexandru Agache et al.
Please cite this as "Lorina Negreanu, Cristian Giumale, Alexandru Agache, Mihnea Muraru, Matei Popovici, Ciprian Dobre, A Formal Approach for the Development of Service-Oriented Applications, in Proc. of 18th International Conference on Control Systems and Computer Science (CSCS-18), Bucharest, Romania, 2011, pp. 804-810, ISSN: 2066-4451, Politehnica Press"