When You Can’t Join, use DeepMatch
by Anusha Bharat
The Challenge. The quintessential pre-task of most data-driven analysis is that of “stitching” multiple data sources together. Traditionally, in the database language, this is achieved through “joins”. They “stitch” datasets together based on a commonality in terms of shared entries within common columns across datasets. Classical joins work well when the matches are rule based and depend on common columns.
In many modern settings, however, this does not work. Because two datasets may lack a shared column or have mismatched entries, the correct relationship is likely to be missed. This situation may arise due to many reasons including (1) a lack of centralized design across first-party and third-party datasets; (2) the datasets not adhering to a standardized format; or (3) some typo-like errors or missing values in the data which might make “stitching” difficult. Figure 1 below provides an example of “messy” data where stitching is difficult.
The Opportunity. Such a task of “stitching” multiple datasets like that shown in Figure 1 comes up routinely in many settings. This includes Financial Reconciliation (Accounts Receivable, Accounts Payable), Record Linkage, Inventory Tracking, Order Management, Auditing in Insurance, and more.
Currently this is addressed through a mix of manual matching and some ad-hoc solutions. Our Excel users spend up to 80% of their time conducting such operations. This problem begs for a solution that automates “stitching” of datasets where joins do not work.
The Solution. At Ikigai, we have developed a machine learning empowered solution for this precise challenge — the DeepMatch. At the highest level, DeepMatch takes datasets as inputs, attempts to learn the relationship between the columns of datasets and then uses “similarities” between rows to “stitch”.
DeepMatch comes in three forms. In the simplest form, it simply takes two datasets, and produces the best stitching it can, without any further input. For example, if two datasets can be joined in the sense of traditional database operation, it will be achieved. That is, now on, you do not need to worry about which join should you use: left, right, …?
For the dataset whose example is shown in Figure 1, the simplest form of DeepMatch achieves an F1-score of 0.83. F1 score indicates the predictive performance of a model.
In many settings, in addition, historical examples of stitching which indicates a “match” of the datasets might be available. In the second version of DeepMatch, stitching is learnt from such historical examples.
For the dataset whose example is shown in Figure 1, by correcting a small number of errors in the output of the simplest form of DeepMatch and then feeding it to the second version of DeepMatch, the F-1 score improves from 0.83 to 1! That is, perfect stitching.
The third, and most advanced version of DeepMatch involves Human In the Loop. To that end, any machine learning solution is not perfect. At Ikigai, we believe that exceptions are the norm in machine learning driven solutions. Therefore, after DeepMatch (both the simplest form and the form where stitching is learnt from historical examples) produces a stitching of data, it may not be accurate. Typically, one would expect few inaccuracies where DeepMatch may not be confident in its ability to stitch (or not). This can be rectified by simple human in the loop interaction through a few clicks where a human corrects inaccurate stitching. These corrections are learnt by the machine to improve the stitching in the future.
About the Author
Anusha Bharat (Backend Software Engineer)
Anusha Bharat is a Backend Software Engineer with a focus on Machine Learning at Ikigai Labs. She recently graduated from Texas A&M University with a Master’s in Computer Engineering. She is working on developing the platform’s ML/AI capabilities.