Identifying Parallel Web Pages

Marcela Macedo Vieira; Viviane Pereira Moreira

doi:10.5753/jidm.2012.1454

Authors

Marcela Macedo Vieira No affiliation declared
Viviane Pereira Moreira No affiliation declared

DOI:

https://doi.org/10.5753/jidm.2012.1454

Keywords:

classification, parallel corpora, similarity functions

Abstract

Research on statistical machine translation and corpus-based approaches for cross-language information retrieval depend on the availability of multilingual data, particularly in the form of parallel corpora (collections of equivalent texts in two or more languages). However, the scarcity of parallel corpora limits the development of these applications. The Web is a vast repository of multilingual information, which has motivated research aimed at mining corpora from it. In this article, we present PPLocator an approach for locating parallel Web pages. PPLocator was designed to be effective while keeping a low processing cost, thus it avoids making exhaustive pairwise comparisons in order to identify the candidate pairs. In addition, it tries to minimize the number of pages that need to be downloaded during the intra-site crawl. An important characteristic of our approach is that it does not rely on resources such as dictionaries, translators, or language identifiers. PPLocator demands little effort from the human expert. Experiments using real Web data from over 284K pages attest for the viability of PPLocator. The results show superiority in relation to a baseline system in terms of both recall and precision, despite the fact that the baseline uses more resources.

Downloads

Download data is not yet available.

Identifying Parallel Web Pages

Authors

DOI:

Keywords:

Abstract

Downloads

Author Biography

Marcela Macedo Vieira, No affiliation declared

Downloads

Additional Files

Published

How to Cite

Issue

Section

Make a Submission

Metrics: