ACERPI-Block: Applying Blocking Techniques to the ACERPI Approach
DOI:
https://doi.org/10.5753/jidm.2022.2509Keywords:
Entity Resolution, BlockingAbstract
Ordinances are documents issued by federal institutions that contain, among others, information regarding their staff. These documents are accessible through public repositories that usually do not allow any filter or advanced search on documents’ contents. This paper extends ACERPI (an approach to collect documents, extract information and resolve entities from institutional ordinances), which identifies the people mentioned in ordinances from institutions to help users find the documents of interest. ACERPI-Block focuses on the Entity Resolution step of the approach, developing blocking strategies that allow scalability to hundreds of thousands of records being resolved. Experiments show a reduction of 93.3% in the number of comparisons of similarity between records if compared to the solution without blocking, with no decrease in efficacy.
Downloads
References
Blanco, L., Crescenzi, V., Merialdo, P., and Papotti, P. Supporting the automatic construction of entity aware search engines. In Proc. of the 10th ACM Workshop on WIDM. NY, USA, pp. 149–156, 2008.
Brasil. Lei nº 12.527/2011. Diário Oficial da República, 2011.
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., and Stefanidis, K. An overview of end-to-end entity resolution for big data. ACM Comput. Surv. 53 (6), Dec., 2020.
Dozier, C., Kondadadi, R., Light, M., Vachher, A., Veeramachaneni, S., and Wudali, R. Named Entity Recognition and Resolution in Legal Text. Springer-Verlag, 2010.
Eich, L. Acerpi-link: Uma alternativa de resolução de entidades em portarias institucionais utilizando linkagem de registros. Bachelor Thesis, UFRGS, 2021.
Explosion.ai. Industrial-strength natural language processing. [link], 2021a. Accessed: 2021-09-05.
Explosion.ai. Prodigy · radically efficient machine teaching. an annotation tool powered by active learning. [link], 2021b. Accessed: 2021-09-05.
Lage, J. P., Silva, A. S., Golgher, P. B., and Laender, A. H. F. Automatic generation of agents for collecting hidden web pages for data extraction. DKE vol. 49, pp. 177–196, 2004.
Manica, E., Dorneles, C. F., and Galante, R. Orion: A cypher-based web data extractor. In DEXA. Springer, Cham, pp. 275–289, 2017.
Nadeau, D. and Sekine, S. A survey of named entity recognition and classification. Lingvisticæ Investigationes 30 (1): 3–26, 2007.
Schmitz, C., Mbaye, S., Manica, E., and Galante, R. Acerpi: An approach for ordinances collection, information extraction and entity resolution. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados. SBC, Porto Alegre, RS, Brasil, pp. 97–108, 2021.
van Dalen-Oskam, K., de Does, J., Marx, M., Sijaranamual, I., Depuydt, K., Verheij, B., and Geirnaert, V. Named entity recognition and resolution for literary studies. Computational Linguistics in the Netherlands Journal vol. 4, pp. 121–136, Dec., 2014.
World Wide Web Consortium. What is the document object model? [link]. Accessed: 2022-07-02.