The next generation of data integration

 

Simple business questions can be surprisingly hard to answer using today's IT systems. For a large company, a question like “how many employees are tax-exempt?” may require querying hundreds of databases using multiple data models and possibly inconsistent definitions: for example, is a “contractor” also an “employee”? Over the past five years, our team has developed a new technology for performing data-integration tasks—such as querying, combining, and evolving databases—based on category theory, a branch of mathematics that has already revolutionized several areas of computer science. Category theory gives us theoretical guidance missing from the widely-used relational model of data, and we have used it to build a prototype software tool, FQL, for integrating databases more quickly and more accurately than tools that use the relational model.

 
 

Relational vs Categorical

Whenever information from different sources needs to be combined, the data structures supporting that information must first be related. This task, called data integration, is the biggest and most expensive challenge in IT today, accounting for over 40% of enterprise IT budgets.

To solve data-integration problems, database schemas must be mapped to others schemas. In category theory, schemas are categories, and a mapping between schemas is called a functor. In other words,

Solutions are functors, and may be composed.

Relational schema mapping techniques suffer from a variety of theoretical and practical issues. Categorical solutions—functors—are easier to write, easier to modify, and easier to compare.

  • schema evolution becomes trivial (an update to a schema is just a functor),
  • schemas can be defined as graphs, and migrations specified visually as graph correspondences,
  • each mapping between schemas induces three data migrations, rather than one,
  • the relational model is a special case of the categorical model, so we can always interoperate with existing solutions,
  • categorical solutions are strongly initial, but relational solutions are weakly initial,
  • and our graphical tool is lightweight (< 5MB, with examples). 
 
 

What is category theory?

Category theory, originally developed to translate theorems from one area of mathematics to another (e.g., from topology to algebra), can also be applied to translate information from one computer system to another (e.g., between database management systems). This use of category theory is called functorial data migration, and is currently being researched by David SpivakRyan Wisnesky, and others in the MIT department of mathematics.

Simple as Σ, Δ, Π 

Data moves, and category theory describes this motion using three basic data migration functors.

 

Σ (sigma)

UNION

Δ (Delta)

PROJECT

Lima01.jpg

Π (Pi)

JOIN

 

Every time we migrate data across a map of schemas, we can write down that migration as some combination of Δs, Πs, and Σs. Our open-source tool, FQL, translates these data migrations into SQL and executes JDBC whenever possible. FQL is more expressive than SQL, so when FQL queries cannot be implemented using SQL, FQL executes the queries directly.

 
 

Categorical Informatics was spun out of the MIT Mathematics Department in 2015. We are supported by a SBIR grant from the National Institute for Standards and Technology and an I-Corps Teams Grant from the National Science Foundation.