ACERPI-Block: Applying Blocking Techniques to the ACERPI Approach

Authors

  • Christian Schmitz Universidade Federal do Rio Grande do Sul
  • Jonathan Martins Universidade Federal do Rio Grande do Sul
  • Serigne K. Mbaye Instituto Federal do Rio Grande do Sul
  • Edimar Manica Instituto Federal do Rio Grande do Sul
  • Renata Galante Universidade Federal do Rio Grande do Sul

DOI:

https://doi.org/10.5753/jidm.2022.2509

Keywords:

Entity Resolution, Blocking

Abstract

Ordinances are documents issued by federal institutions that contain, among others, information regarding their staff. These documents are accessible through public repositories that usually do not allow any filter or advanced search on documents’ contents. This paper extends ACERPI (an approach to collect documents, extract information and resolve entities from institutional ordinances), which identifies the people mentioned in ordinances from institutions to help users find the documents of interest. ACERPI-Block focuses on the Entity Resolution step of the approach, developing blocking strategies that allow scalability to hundreds of thousands of records being resolved. Experiments show a reduction of 93.3% in the number of comparisons of similarity between records if compared to the solution without blocking, with no decrease in efficacy.

Downloads

Download data is not yet available.

References

Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. Supporting the automatic construction of entity aware search engines. In Proc. of the 10th ACM Workshop on WIDM. NY, USA, pp. 149–156, 2008.

Brasil. Lei nº 12.527/2011. Diário Oficial da República, 2011.

Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53 (6), Dec., 2020.

Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., and Wudali, R. Named Entity Recognition and Resolution in Legal Text. Springer-Verlag, 2010.

Eich, L. Acerpi-link: Uma alternativa de resolução de entidades em portarias institucionais utilizando linkagem de registros. Bachelor Thesis, UFRGS, 2021.

Explosion.ai. Industrial-strength natural language processing. [link], 2021a. Accessed: 2021-09-05.

Explosion.ai. Prodigy · radically efficient machine teaching. an annotation tool powered by active learning. [link], 2021b. Accessed: 2021-09-05.

Lage, J. P., Silva, A. S., Golgher, P. B., and Laender, A. H. F. Automatic generation of agents for collecting hidden web pages for data extraction. DKE vol. 49, pp. 177–196, 2004.

Manica, E., Dorneles, C. F., and Galante, R. Orion: A cypher-based web data extractor. In DEXA. Springer, Cham, pp. 275–289, 2017.

Nadeau, D. and Sekine, S. A survey of named entity recognition and classification. Lingvisticæ Investigationes 30 (1): 3–26, 2007.

Schmitz, C., Mbaye, S., Manica, E., and Galante, R. Acerpi: An approach for ordinances collection, information extraction and entity resolution. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. SBC, Porto Alegre, RS, Brasil, pp. 97–108, 2021.

van Dalen-Oskam, K., de Does, J., Marx, M., Sijaranamual, I., Depuydt, K., Verheij, B., and Geirnaert, V. Named entity recognition and resolution for literary studies. Computational Linguistics in the Netherlands Journal vol. 4, pp. 121–136, Dec., 2014.

World Wide Web Consortium. What is the document object model? [link]. Accessed: 2022-07-02.

Downloads

Published

2022-09-12

How to Cite

Schmitz, C., Martins, J., K. Mbaye, S., Manica, E., & Galante, R. (2022). ACERPI-Block: Applying Blocking Techniques to the ACERPI Approach. Journal of Information and Data Management, 13(2). https://doi.org/10.5753/jidm.2022.2509

Issue

Section

SBBD 2021 Full papers - Extended Papers