When You Can’t Join, use DeepMatch

by Anusha Bharat

The Challenge. The quintessential pre-task of most data-driven analysis is that of “stitching” multiple data sources together. Traditionally, in the database language, this is achieved through “joins”. They “stitch” datasets together based on a commonality in terms of shared entries within common columns across datasets. Classical joins work well when the matches are rule based and depend on common columns.

In many modern settings, however, this does not work. Because two datasets may lack a shared column or have mismatched entries, the correct relationship is likely to be missed. This situation may arise due to many reasons including (1) a lack of centralized design across first-party and third-party datasets; (2) the datasets not adhering to a standardized format; or (3) some typo-like errors or missing values in the data which might make “stitching” difficult. Figure 1 below provides an example of “messy” data where stitching is difficult.

Figure 1: Consider two tables as shown above: left has names such as “Jones, Angela” which is represented in the right as “Jones A”. The traditional Join operation will not be able to “stitch” such datasets. More generally, left table could have two columns with one having the entry “Jones” and other “Angela”. This will make it even more difficult.

The Opportunity. Such a task of “stitching” multiple datasets like that shown in Figure 1 comes up routinely in many settings. This includes Financial Reconciliation (Accounts Receivable, Accounts Payable), Record Linkage, Inventory Tracking, Order Management, Auditing in Insurance, and more.

Currently this is addressed through a mix of manual matching and some ad-hoc solutions. Our Excel users spend up to 80% of their time conducting such operations. This problem begs for a solution that automates “stitching” of datasets where joins do not work.

The Solution. At Ikigai, we have developed a machine learning empowered solution for this precise challenge — the DeepMatch. At the highest level, DeepMatch takes datasets as inputs, attempts to learn the relationship between the columns of datasets and then uses “similarities” between rows to “stitch”.

DeepMatch comes in three forms. In the simplest form, it simply takes two datasets, and produces the best stitching it can, without any further input. For example, if two datasets can be joined in the sense of traditional database operation, it will be achieved. That is, now on, you do not need to worry about which join should you use: left, right, …?

For the dataset whose example is shown in Figure 1, the simplest form of DeepMatch achieves an F1-score of 0.83. F1 score indicates the predictive performance of a model.

In many settings, in addition, historical examples of stitching which indicates a “match” of the datasets might be available. In the second version of DeepMatch, stitching is learnt from such historical examples.

For the dataset whose example is shown in Figure 1, by correcting a small number of errors in the output of the simplest form of DeepMatch and then feeding it to the second version of DeepMatch, the F-1 score improves from 0.83 to 1! That is, perfect stitching.

The third, and most advanced version of DeepMatch involves Human In the Loop. To that end, any machine learning solution is not perfect. At Ikigai, we believe that exceptions are the norm in machine learning driven solutions. Therefore, after DeepMatch (both the simplest form and the form where stitching is learnt from historical examples) produces a stitching of data, it may not be accurate. Typically, one would expect few inaccuracies where DeepMatch may not be confident in its ability to stitch (or not). This can be rectified by simple human in the loop interaction through a few clicks where a human corrects inaccurate stitching. These corrections are learnt by the machine to improve the stitching in the future.

Interested? If you want to automate your data stitching workflow, reconciliation or any such application and want to give it a try, book a demo with us.

About the Author

Anusha Bharat (Backend Software Engineer)

Anusha Bharat is a Backend Software Engineer with a focus on Machine Learning at Ikigai Labs. She recently graduated from Texas A&M University with a Master’s in Computer Engineering. She is working on developing the platform’s ML/AI capabilities.

--

--

--

AI-charged spreadsheets for data operators who run mission-critical biz processes using data.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Weapons of Math Destruction Ch. 4

Decision Science — Between Cognitive & Data Science

WHAT DOES A DATA DICTIONARY AND A REFUGEE CAMP HAVE IN COMMON?

Money Talk — In your own language

Sustainable Wind Engineering: The Stockholm Royal Seaport

Segmenting Your Customers on Many Dimensions (or Python for Wine Lovers)

Predicting Heart Disease With a Neural Network

💡How to Create a Solid Tracking Plan for Your Products

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ikigai Labs

Ikigai Labs

AI-charged spreadsheets for data operators who run mission-critical biz processes using data.

More from Medium

2022 NHL Combine Mock Data Analysis: Part 2

Email — your friendly neighbourhood Analytics reporting platform (Part 1)

How to do Data Cleansing using SQL BigQuery and Connect Custom Query to Google Data Studio (2/4)

COVID-19 World Progress Data Dashboard