Half-Measure (of software): Data Integration: Part 1

Data Integration (DI) is the process for combining heterogeneous data from different sources.

Lots of companies provide very sophisticated DI solutions and products.

There are some theories around DI but, as far as I can tell, most are focus on the query processing side of it and most assume the data is owned and/or maintained by the party running the DI.
I deal with DI quite often and spend a lot of time coding solutions tailored to each situation. I haven't found any resources to help design and implement DI in a very formal and disciplined manner.

Problem

The data I deal with has a life on its own. This is also true for each data source I try to integrate with. Data changes on a daily (even hourly) basis: things are deleted, modified and created, with or without much notice or history.
The schema of the data is also not standard so each source has its own way of representing things.
There is no unique identifier for the data I deal with.
Each source (even mine) HAS duplicates. With the amount of data being dealt with (not on par with CDC or CMS though) and the supplier of the data (mostly from user generated content, aka UGC) duplicates and other inconsistencies are significant and common.

I use DI for 2 different purposes:

Quantity

The purpose is to augment the amount of data I can have on a particular topic. For example, if I collect hotel information, I would integrate the following sources:

one source for Europe,
another for Asia
another for North America
etc.

Quality

Here the purpose is different. If again I collect hotel information, I would integrate sources to provide different kind of info for each hotel.

one source for basic hotel info
another for hotel photos
another for hotel videos
etc.

It's also possible to use sources for both quantity and quality purposes.
For example, one source might have basic hotel info in North America, and another one might provide basic hotel info for Asia. If the sources provide info for a hotel I don't have, I could use the source data to create a new one, thus augmenting my collection of hotel data.

Goal

The goal is to figure out the rules of DI, at the data and process level, in order to be able to implement a reliable and repeatable process.
The querying is implied in the way that, in order to query anything, we need to first integrate something (and hopefully everything).
This approach follows the Local As View model were all the data is "local" with its own schema.

Formally: Math

Set Theory is I think well suited to model and formalize DI using Math.
I'm obviously no Math expert and this is more of a half-serious-half-fun-with-math-and-latex project than anything else.
I believe the rigorous approach might help at pin-pointing problems, understanding consequences of changes in implementations and figuring out areas where the DI process needs more attention.

The paper [1] presents a simplified model of DI. In this paper I use "document" in place of "data" for it is more explicit and will lead later on to better definitions. I also define 2 important sets of documents: the source S and the target T. S is the source to integrate into T, T being my LAV storage.
The goal here is to model few functions and in particular the integration function and prove its idempotency.

Function: $aries$ (create)

This function is used to create a single new document given a document from S. For example, if I stumble upon "Hotel Ritz Carlton" in a data feed, and I don't have it in my database, aries would be used to create a new record.

Function: $gemini$ (merge)

This function is used to merge two documents. For example, if a feed has an entry for "Hotel Ritz Carlton" and I also happen to have "Hotel Ritz Cartlon" in my database, gemini would be used to copy over information from the feed into my existing record.

Function: $rep$ (match)

This function defines the similarity between two documents. Two documents are similar (one represent or match the other one) if certain aspects of the two documents are considered similar or perfectly equal.
Some aspects of a document might be different than ones of another, and yet both could be considered similar.
For example, a document from a source S might contain information about "Hotel Ritz Carlton" and a document from my set T might contain information about "(The) Hotel Ritz Carlton". Even though the two names are not equal, they are similar, making both documents similar.
The rules to define similarity between documents will be specific to a source. Travelocity might have "Hotel Ritz", Expedia might have "Hotel Ritz Carlton", while I may have "(The) Hotel Ritz Carlton". They are all similar but the rules to get to that conclusion are different between sources.

Function: $capricornus$ (integrate)

This is the integration function. It basically states that there are two possible outcomes:

creation of a new document
merge of similar documents (when found a similar one)

To have a reliable and repeatable process, the integration function (and any function used therein) must be formally defined and proved to be idempotent.

With the current definition of $capricornus$, idempotency means that no new element must be created after the first integration, only merging should occur.

Algorithm

Following paper [1], the algorithm for integration would look like this:

function integrate(x : Document) : Document {
 var y:Document = match(x)
 if (y)
  return merge(x,y)
 else
  return create(x)
}

Now this is very basic but there are things that can be noted right off the bat here:

match() is related to $rep$ and is not as straightforward as it looks. This is probably the most complicated function of all. If we find the wrong one, we'll be wiping out some of its information when calling merge(). If it doesn't return a match when it should have, a duplicate will be created when create() is called.
merge() might be the easiest function of all. It may not be obvious but one should make sure that merging 2 documents won't change the behavior of match(). Formally:
$ z = merge(x,y) \implies rep(x,z) = 1$
create() will actually do more than just creating a new document. It will also make sure that:
$ z = create(x) \implies rep(x,z) = 1$

To be continued

The next part will dig deeper into the formal model of DI. I'll formally define documents, the matching function (match()) and introduce mapping (used in create()).

References

[1] Data Integration, Simplified, Luc Pezet, 2014
[2] What is data integration?, IBM, 2014

Half-Measure (of software)

Monday, April 7, 2014

Data Integration: Part 1