Generating realistic test datasets for duplicate detection at scale using historical voter data

Link:

Autor/in:

Verlag/Körperschaft:

OpenProceedings.org

Erscheinungsjahr:

2021

Medientyp:

Text

Beschreibung:

The detection of duplicates is an essential task in data cleaning and integration and has steadily gained importance especially for researchers and practitioners that need to process and integrate large volumes of potentially unclean data on a daily basis. To evaluate the quality and performance of duplicate detection algorithms, labeled test data are required that provide information on the contained duplicates. Current approaches for generating test data, however, are either not scalable (and therefore limited to small datasets) or not able to generate realistic data values and errors, especially outdated values. In this paper, we propose a scheme for generating test datasets that addresses both these issues and present a test dataset generated with it. Our approach relies on using historical data from the North Carolina voter register which (1) is realistic as it contains actual voter data and (2) facilitates generating realistic duplicates through the fact that current data values were collected at every election through manually filled out applications. The generated test dataset comprises more than 120 million records with up to 90 attribute values each. To the best of our knowledge, we are the first who provide realistic test data for duplicate detection at this scale.

Lizenz:

Quellsystem:

Forschungsinformationssystem der UHH

Interne Metadaten

Quelldatensatz: oai:www.edit.fis.uni-hamburg.de:publications/fa626bea-edf6-4222-bbc7-e8b612f83123