Currently, when a new record is added to the system or an existing record is updated, OpenEMPI will evaluate whether this record matches any other records in the system. This is a fairly expensive operation that needs to be performed against this record because it involves a number of steps that are listed below:
Uses the blocking algorithm to identify existing records that need to be compared against this record, generating a list of candidate record pairs
Invoke the matching algorithm against every candidate record pair to determine if any of them are matches or potential matches, each of which is represented as a link in the system
Persist any links or potential links that were generated by the previous step
Update the global identifier associated with the record. For a new record a new global identifier will be assigned whereas for an updated record, the global identifier associated with the record may change depending on the links that were identified in the previous steps.
Send out update notification messages to any interested parties depending on the configuration of the system
In order to increase the scalability of the system, it is imperative to allow for the evaluation of potential matches to be done asynchronously with respect to the incoming request that either created or updated the record in question. This will allow the original request to be processed quickly so that the response time experienced by the users is short and the evaluation of potential matches can be performed by the system as resources are available. The rest of this document describes how this feature will be incorporated into the current architecture of OpenEMPI.
The asynchronous processing of potential match evaluations will be added as a site configurable option to the system. For a particular implementation of OpenEMPI, the users will be able to specify whether match evaluations will be processing in a synchronous or asynchronous manner and once that option is set, all requests will be processed using the selected option. Although the site will be able to change this configuration option, it will not be able to process some requests synchronously whereas other requests asynchronously.
The following figure illustrates the flow of how the synchronous/asynchronous option will be incorporated into the overall processing of add and update requests. Incoming requests to add or update a record enter the flow from the top of the figure. The add or update operation is processed first and the record is persisted in the system. If the system has been configured to process match evaluations synchronously, then the five steps described in the Background section of the document are processed within the same transaction and the original requests completes when all five steps have completed. This flow will be the same as the current flow of add and update requests through the system.
If the system has been configured to process add and update requests asynchronously, the next step is to determine whether a shallow matching (referred to as fast match in the flowchart above) module has been configured. The shallow matching module is a configurable module that can be plugged into the system and can quickly evaluate whether a record is a match with an existing record in the system or not. Like other components of the OpenEMPI architecture, the shallow matching module is described by an abstract interface which allows for multiple alternative implementations of this interface to be implemented and thus allow for the most suitable implementation to be selected during the implementation phase.
Configuration of the asynchronous feature is done during the initial installation and configuration of an OpenEMPI instance. The synchronous versus asynchronous feature is configured at the entity basis so to specify whether to use synchronous or asynchronous matching you need to go to the entity design screen. The figure below shows the Edit Entity Design screen which you can get to either by pressing the Edit Entity Model toolbar button or by selecting the Edit Entity Model option from the Edit menu. To select synchronous matching simply check the synchronous button otherwise the system will be set to perform asynchronous matching when the option is not checked.
Beyond specifying whether to use synchronous or asynchronous matching, there is also the option of specifying the configuration of the shallow matching algorithm. The configuration of the shallow matching algorithm is optional. If you don't configure this option then after a record is added or updated successfully the transaction will be completed. Then at some later point the system will pick up the dirty records and perform regular matching on them out of band with the original transaction. To enable the shallow matching algorithm, you need to install the module during deployment of the server which involves simply adding this line to the openempi-extension-contexts.properties file:
This tells the system to load the module during startup. The standard implementation of the shallow matching algorithm is configured in a similar way with the exact algorithm. You need to specify which attributes are used to assess whether there is a match between the dirty record and other records in the system. Unlike the deterministic algorithm though, there is no blocking involved. The current implementation of the shallow matching algorithm only compares the dirty record with other records in the system that have the same exact identifiers. To specify which attributes are used to establish a shallow match you need to add a section to the mpi-config.xml file listing those attributes along with the metrics and thresholds that will be used for each individual field. The following example should be able to get you started.
Don't forget to add the XML namespace definition at the top of the mpi-config.xml file.
The last step in the configuration of the asynchronous matching feature is enabling the processing of dirty records. If you setup your system for asynchronous matching you must enable the process that periodically runs and processes dirty records. This is done by adding a scheduled task to the mpi-config.xml file.
<scheduled-task entity-name="person" schedule-type="schedule-at-fixed-rate" time-unit="seconds">