Blocking Page

The purpose of blocking is to make the process of detecting duplicates more efficient by reducing the number of record pairs that the system needs to evaluate. In essence, the blocking algorithm eliminates pairs of records from evaluation that are unlikely to represent a duplicate (another way to think of a duplicate is as two records that refer to the same distinct entity). It is very important to configure the blocking algorithm effectively since an effective configuration will result in a system that scales to millions of records whereas an ineffective configuration results in a system with many undetected duplicates.

The Blocking Page lists the blocking rounds that have been defined on your instance. Clicking on the plus sign to the left of each round causes the row to expand and present the list of attributes included in that particular round. You can press the “Edit” or “Delete” icon buttons to the right of each blocking round entry to configure or delete the entry, respectively.

To add a new blocking round entry, click the “Add Blocking Round” button at the top right corner of the page. Pressing the button brings up the Blocking Round dialog. You can add any number of fields from the drop-down dialog to the blocking round. The table at the bottom of the dialog lists the fields that are currently part of the blocking round configuration. When you have finished configuring the round press the “Save” button to add the new round to the list of blocking rounds and close the Blocking Round dialog. You can now review the settings and make sure you press the “Save” button on the Blocking Configuration page in order to make the changes permanent.

In order to optimize the performance of an instance, OpenEMPI generates and caches the blocking information. When you initially define the blocking configuration or after you change the settings of your blocking configuration for an entity, you need to regenerate the blocking information. This is also necessary if you import data using the flexible file loader and the bulk import flag which disables the update of blocking information. To initiate the process that regenerates the blocking information you need to use the “Rebuilding Blocking Indexes” menu option from the overflow menu. This will initiate the process that rebuilds the indexes which runs in the background. This process may take a long time to run on large instances with millions of records or a slow filesystem so, we highly recommend that you run this process while the system is not busy processing requests.

 

Starting with version 4.3.2 of OpenEMPI, you have the option to override the blocking algorithm settings for the currently selected entity with those of another entity. This feature makes it easy to configure a new entity that has a similar data model as an existing entity by simply copying the configuration settings from the existing entity. To replace the configuration settings of the current entity, select the Replace Configuration option from the More Options menu.

The dialog that pops-up, allows you to select the entity that will be the source of the configuration information for the currently selected entity.