LGJun 19, 2025
Semantic Outlier Removal with Embedding Models and LLMsEren Akbiyik, João Almeida, Rik Melis et al.
Modern text processing pipelines demand robust methods to remove extraneous content while preserving a document's core message. Traditional approaches such as HTML boilerplate extraction or keyword filters often fail in multilingual settings and struggle with context-sensitive nuances, whereas Large Language Models (LLMs) offer improved quality at high computational cost. We introduce SORE (Semantic Outlier Removal), a cost-effective, transparent method that leverages multilingual sentence embeddings and approximate nearest-neighbor search to identify and excise unwanted text segments. By first identifying core content via metadata embedding and then flagging segments that either closely match predefined outlier groups or deviate significantly from the core, SORE achieves near-LLM extraction precision at a fraction of the cost. Experiments on HTML datasets demonstrate that SORE outperforms structural methods and yield high precision in diverse scenarios. Our system is currently deployed in production, processing millions of documents daily across multiple languages while maintaining both efficiency and accuracy. To facilitate reproducibility and further research, we release our implementation and evaluation datasets.
HCSep 20, 2018
Personal Virtual Traffic Light SystemsVanessa Martins, João Rufino, Bruno Fernandes et al.
Traffic control management at intersections, a challenging and complex field of study, aims to attain a balance between safety and efficient traffic control. Nowadays, traffic control at intersections is typically done by traffic light systems which are not optimal and exhibit several drawbacks, e.g. poor efficiency and real-time adaptability. With the advent of Intelligent Transportation Systems (ITS), vehicles are being equipped with state-of-the-art technology, enabling cooperative decision-making which will certainly overwhelm the available traffic control systems. This solution strongly penalizes users without such capabilities, namely pedestrians, cyclists and other legacy vehicles. Therefore, in this work, a prototype based on an alternative technology to the standard vehicular communications, BLE, is presented. The proposed framework aims to integrate legacy and modern vehicular communication systems into a cohesive management system. In this framework, the movements of users at intersections are managed by a centralized controller which, through the use of networked retransmitters deployed at intersections, broadcasts alerts and virtual light signalization orders. Users receive the aforementioned information on their own smart devices, discarding the need for dedicated light signalization infrastructures. Field tests, carried-out with a real-world implementation, validate the correct operation of the proposed framework.