Josep Silva

8papers

44citations

Novelty19%

AI Score16

Ranked #198,266 of 201,326 authors (top 98%)#2,139 in IR (top 97%)

8 Papers

PLSep 15, 2017Code

Erlang Code Evolution Control

David Insa, Sergio Pérez, Josep Silva et al.

During the software lifecycle, a program can evolve several times for different reasons such as the optimisation of a bottle-neck, the refactoring of an obscure function, etc. These code changes often involve several functions or modules, so it can be difficult to know whether the correct behaviour of the previous releases has been preserved in the new release. Most developers rely on a previously defined test suite to check this behaviour preservation. We propose here an alternative approach to automatically obtain a test suite that specifically focusses on comparing the old and new versions of the code. Our test case generation is directed by a sophisticated combination of several already existing tools such as TypEr, CutEr, and PropEr; and other ideas such as allowing the programmer to chose an expression of interest that must preserve the behaviour, or the recording of the sequences of values to which this expression is evaluated. All the presented work has been implemented in an open-source tool that is publicly available on GitHub.

PLFeb 12, 2018

Erlang Code Evolution Control (Use Cases)

David Insa, Sergio Pérez, Josep Silva et al.

The main goal of this work is to show how SecEr can be used in different scenarios. Concretely, we demonstrate how a user can run SecEr to obtain reports about the behaviour preservation between versions as well as how a user can use SecEr to find the source of a discrepancy. The use cases presented are three: two completely different versions of the same program, an improvement in the performance of a function and a program where an error has been introduced. A complete description of the technique and the tool is available at [1] and [2].

IRJan 9, 2015

Web Template Extraction Based on Hyperlink Analysis

Julián Alarte, David Insa, Josep Silva et al.

Web templates are one of the main development resources for website engineers. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using menus information. Our implementation and experiments demonstrate the usefulness of the technique.

IRSep 22, 2014

A Benchmark Suite for Template Detection and Content Extraction

Julián Alarte, Josep Silva

Template detection and content extraction are two of the main areas of information retrieval applied to the Web. They perform different analyses over the structure and content of webpages to extract some part of the document. However, their objective is different. While template detection identifies the template of a webpage (usually comparing with other webpages of the same website), content extraction identifies the main content of the webpage discarding the other part. Therefore, they are somehow complementary, because the main content is not part of the template. It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks because templates usually contain irrelevant information such as advertisements, menus and banners. Processing and storing this information is likely to lead to a waste of resources (storage space, bandwidth, etc.). Similarly, identifying the main content is essential for many information retrieval tasks. In this paper, we present a benchmark suite to test different approaches for template detection and content extraction. The suite is public, and it contains real heterogeneous webpages that have been labelled so that different techniques can be suitable (and automatically) compared.

IRSep 9, 2014

Automatic Detection of Webpages that Share the Same Web Template

Julián Alarte, David Insa, Josep Silva et al.

Template extraction is the process of isolating the template of a given webpage. It is widely used in several disciplines, including webpages development, content extraction, block detection, and webpages indexing. One of the main goals of template extraction is identifying a set of webpages with the same template without having to load and analyze too many webpages prior to identifying the template. This work introduces a new technique to automatically discover a reduced set of webpages in a website that implement the template. This set is computed with an hyperlink analysis that computes a very small set with a high level of confidence.

LOJul 31, 2013

Proceedings 9th International Workshop on Automated Specification and Verification of Web Systems

António Ravara, Josep Silva

This volume contains the accepted papers of the 9th International Workshop on Automated Specification and Verification of Web Systems (WWV'13), which took place in Florence, Italy, on June 6, as a satellite event of the 8th International Federated Conferences on Distributed Computing Techniques (DisCoTec 2013).

IROct 23, 2012

Using the DOM Tree for Content Extraction

Sergio López, Josep Silva, David Insa

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the webpage. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocks.

SEOct 22, 2012

Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems

Josep Silva, Francesco Tiezzi

This volume contains the final and revised versions of the papers presented at the 8th International Workshop on Automated Specification and Verification of Web Systems (WWV 2012). The workshop was held in Stockholm, Sweden, on June 16, 2012, as part of DisCoTec 2012. WWV is a yearly workshop that aims at providing an interdisciplinary forum to facilitate the cross-fertilization and the advancement of hybrid methods that exploit concepts and tools drawn from Rule-based programming, Software engineering, Formal methods and Web-oriented research. WWV has a reputation for being a lively, friendly forum for presenting and discussing work in progress. The proceedings have been produced after the symposium to allow the authors to incorporate the feedback gathered during the event in the published papers. All papers submitted to the workshop were reviewed by at least three Program Committee members or external referees. The Program Committee held an electronic discussion leading to the acceptance of all papers for presentation at the workshop. In addition to the presentation of the contributed papers, the scientific programme included the invited talks by two outstanding speakers: Rocco De Nicola (IMT, Institute for Advanced Studies Lucca, Italy) and Josè Luiz Fiadeiro (Royal Holloway, United Kingdom).