Export Records

Export Records

The Export Records feature allows you to export the data that has been loaded onto an instance of OpenEMPI for further processing by other applications. A use case that has been emerging is for users to use the de-duplicated records that are stored within OpenEMPI as input to data analytics workflows. This feature was added specifically to support this use case. The data that is stored within OpenEMPI is exported into a file in a binary format. The export operation generates two files, a file that comprises the records stored in the system for a given entity and a second file that comprises the links that were generated by the system.

To make it easier for users to process the exported data, rather than coming up with a proprietary data format for the exported file, we selected the Apache Avro standard which is an open source serialization system used with big data technologies. Other benefits of using Avro include:

  • its support for rich data formats which allows us to export the records and the associated identifiers for each record
  • its compact binary format which makes it easy to detect files that have been corrupted during transfer, which is something that is not possible with comma separated (CSV) files.
  • the availability of tooling and APIs for automatically loading data from avro files. The schema for the files is embedded within the file itself so anyone that gets an Avro formatted file can use standard tools and APIs to load the data onto their system.

This feature is available starting from version 3.2.0 of OpenEMPI.

 To export the records from your OpenEMPI instance, from the Edit Entity Model view click on the Export Records button::

Before the records are exported, you have to accept the confirmation. The export process may take a while to complete depending on how many records are stored on your instance. The export processing takes place in the background, so you can continue working with your instance of OpenEMPI while the data is exported. The next screen will confirm that the export process has started as a background job. You can check on its progress by reviewing the Job Queue window through the Job Queue entries option under the Admin menu.

Once the export process has completed, you will find two files in the file repository of your instance (this is by default the directory specified by the file-repository-directory parameter in the mpi-config.xml file). The first file stores the records of your instance and follows the naming convention of entity-name-timestamp.avro and the second file stores the links between the records and follows the naming convention of recordLinks-timestamp.avro. For convenience the export function also stores the schema for the exported Avro file as well, although that file is typically not needed since the schema of the file is included into the data file itself.