February 19, 2017

Record versioning with Mysql

The topic of how to handle record versioning came up recently in a number of projects. This is a topic known commonly as slowly changing dimensions. There are a number of approaches depending on your requirements; a good overview can be found here.

Considerations in design

need to persist accurate history
number of changing fields within a record
maintainability
placement of business logic and number of actors on the data store

This article aims to explore a number of approaches and how Mysql's capabilities can be used to enforce integrity or automate the strategies where possible. Generally I'm not a fan of stored procedures or complex triggers as this essentially divests business logic from the core code base/ service layer. However this is a space that I wanted to explore how far Mysql triggers could assist.

Regardless of which strategy is the best fit for the requirements at hand, all scenarios should have appropriate integration tests to ensure expected behavior over time, especially as there are multiple ways to achieve these strategies. Integration tests should cover basic scenarios such as

Inserting new records into the system
Updating existing records
- Are updates reflected correctly?
- Is history of the old record maintained?

For discussion we'll use Mysql 5.7 and model the common scenario of a supply table with 4 simple properties;

Internal primary surrogate key
Natural key
Description
Cost

Where the assumption is that the natural key is globally unique. e.g.

CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

You can get started very simply if you have docker installed with the following;

docker run --name mysql_container --env MYSQL_ALLOW_EMPTY_PASSWORD=YES -p 3306:3306 mysql:5.7

This will download and run mysql in an isolated container and the following will allow you to connect

docker exec -it mysql_container mysql -uroot

This allows you to run and utilize Mysql in a completely isolated form without polluting your host system. It's also extremely simple to experiment between versions by simply pulling different images. See here for a complete list of offical Mysql docker images.

Type 1 - Overwrite

Essentially, using this strategy, there is only one record per supply_key and fields are updated in place with no historic values retained.
Pros

Simple to implement. Use INSERT IGNORE ... or REPLACE .... statements depending on your needs. This will overwrite any existing data with new data.

Cons

Historic trends cannot be extracted from the supplies table.
Referential integrity to the supplies table must be thought through as data can change at any point.
Data must be copied from the supplies table to an instance of an order record to preserve state at a given point in time

Type 2 - Add new row

This approach involves creating a new row for a record that has changed and delineating it from existing versions. The Wikipeadia article mentions two common approaches; an incrementing version column grouped on supply_key or a combination of start and end dates.
I favor the latter approach for 2 reasons. First, it provides temporal relevance which is useful for a wide variety of reporting and auditing reasons and second it naturally provides an easy way for determining the current record. It's a far easier query to find out which record has a NULL end date than which version is the largest in a group. For example, to get a list of current supplies with dates could be as simple as;

SELECT * FROM supplies WHERE ended_at IS NULL;

With incrementing version columns the same result could be achieved with the following more complexe statement;

SELECT *  
FROM supplies  
INNER JOIN  
  (SELECT supply_key, MAX(version) AS version 
   FROM supplies 
   GROUP BY supply_key) AS current
WHERE supplies.supply_key = current.supply_key  
  AND supplies.version = current.version

That being said, let's see how we can automate this and reduce the cognitive burden off the developer. First let's start with a few assumptions to make this simple.

Updates for a particular supply_key will be inserted in order
Only the start date is supplied and is assumed to be the end date of the preceding record version.
There is always a current record for a given supply_key with no end date. Yes, this isn't realistic as you can never remove supplies with this constraint but we're going to roll with it for the sake of discussion.

Given these requirements our new supplies table may look something like;

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key VARCHAR(10) NOT NULL,
     description VARCHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
     ended_at DATETIME,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at),
     KEY (supply_key, ended_at)
);

Of note;

The start date is mandatory, the end date is not since for a given supply_key there is always one record with a NULL ended_at date.
Our UNIQUE key constraint now covers both the natural key for the supply and the start date.

Our automation should handle the following forms;

A record UPSERT with an assumed start date equal to when the record was inserted
A record UPSERT with an explicit start date
A record UPSERT with an explicit end date

INSERT INTO supplies SET supply_key='A', description='foo', cost='1';  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=NOW();  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=DATE_SUB(NOW(), INTERVAL 1 DAY), ended_at=NOW();

Ideally we'd use something like the following trigger to handle this automatically.

DROP TRIGGER supplies_before_insert;  
DROP TRIGGER supplies_after_insert;  
delimiter |

CREATE TRIGGER supplies_before_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.ended_at=NULL;
  END;
|
CREATE TRIGGER supplies_after_insert AFTER INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    UPDATE supplies SET ended_at=NEW.started_at WHERE supply_key=NEW.supply_key AND ended_at IS NULL AND id!=NEW.id;
  END;
|

delimiter ;

Unfortunately mysql does not permit updates in triggers on the same table that you inserted to. There reasons for this are deadlocks and infinite loops. The update in the trigger will indeed cause the trigger to trip again and so on and so forth.

Our likely approach here is to push this on the application using something like the following;

START TRANSACTION;  
  INSERT INTO supplies 
          SET supply_key='B', description='bar', cost=2; 

  SELECT @end_date:=MAX(started_at) 
    FROM supplies 
   WHERE supply_key='B';

  UPDATE supplies 
     SET ended_at=@end_date 
   WHERE supply_key='B' 
     AND ended_at IS NULL 
     AND id!=last_insert_id(); 
COMMIT;

Pro's

Orders can safely reference the supplies table and remain accurate over time.
Simple reporting and trending
Efficient use of covering index to locate current records.

Cons

Your supplies table may become very large with historic data depending on the rate of change.
The business logic to maintain historical records must be observed by the client applications of the database meaning decentralized logic if not fronted by a service API.
The way this particular insert query is written coupled with the automated started_at date field could lead to unnecessary duplication. Consider a field that may toggle between multiple states over time.
The above INSERT, SELECT, UPDATE combination isn't the most efficient but robustly handles automatic started_at values as well as specified ones.

Type 3 - Add new attribute

In this approach the system only keeps track of the original & current values of selected fields and retains one record per supply. In the following example the fields description and cost are of particular interest.

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     original_description CHAR(30) DEFAULT '',
     original_cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

We generally want two guarantees from a system with this approach;
1. Inserts automatically fill the original_* fields.
2. Updates preserve the original_* fields.

This can be done with triggers in Mysql with the following.

DROP TRIGGER supplies_insert;  
DROP TRIGGER supplies_update;  
delimiter |

CREATE TRIGGER supplies_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = NEW.description;
    SET NEW.original_cost = NEW.cost;
  END;
|

CREATE TRIGGER supplies_update BEFORE UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = OLD.original_description;
    SET NEW.original_cost = OLD.original_cost;
  END;
|

delimiter ;

With these triggers in place you can safely use standard INSERT and UPDATE statements or use the following UPSERT form negating the need to know upfront whether your application already contains a record for a particular supply_key.

INSERT INTO supplies SET  
    supply_key='B', description='bar', cost=2 
  ON DUPLICATE KEY UPDATE 
    cost=VALUES(cost), description=VALUES(description);

Pro's

Simple to implement in either Mysql or code space
Guarantees at the database level can be made to preserve the original_* fields regardless of the client interacting with the database.

Cons

Limited use in terms of accurate record keeping. A de-normalized copy of the supplies data at the point of use must be made for accurate record keeping.
Trend reporting is impossible. Only start and current costs are tracked. Trending on de-normalized orders is not recommended since there is no guarantee that an order was made when a supply was a particular cost.

Type 4 - Add history table

Aside from strategy Type 2, which retains history in the same table, the other common approach to this is to seperate historic records from current records in seperate tables;

The following creates 2 tables, a supplies table and a supplies_archive table based on the structure of the current supplies table. The current supplies table still needs to know when the current record became relevant and so we need the started_at date. In the supplies_archive we also need an ended_at date.

DROP TABLE IF EXISTS supplies;  
DROP TABLE IF EXISTS supplies_archive;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
                                ON UPDATE CURRENT_TIMESTAMP,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at)
);
CREATE TABLE supplies_archive LIKE supplies;  
ALTER TABLE supplies_archive  
  ADD COLUMN ended_at DATETIME NOT NULL AFTER started_at;

Here we can set up a trigger to automatically create a new record in the supplies_archive table.

DROP TRIGGER supplies_after_update;  
delimiter |

CREATE TRIGGER supplies_after_update AFTER UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    IF NEW.cost != OLD.cost 
       OR NEW.description != OLD.description THEN
      INSERT INTO supplies_archive 
      SELECT NULL, NEW.supply_key, OLD.description, OLD.cost, OLD.started_at, NEW.started_at ;
    END IF;
  END;
|

delimiter ;

Note that this does not protect against UPDATES that explicitly set the started_at data and nothing else, which breaks our desired behavior.

As of MySQL 5.5, you can use the SIGNAL syntax to throw an exception to assist in refining TRIGGER behavior:

SIGNAL sqlstate '45000' SET message_text = 'My Error Message';

State 45000 is a generic state representing "unhandled user-defined exception".

In any approach to a complex problem with many entry points there are workarounds. The above solution is far from robust but is limited by the capabilities of Mysql triggers. The goal here is to provide as much consistency as possible at the database level which remains the lowest common denominator between modes of interaction, whether they are multiple clients or developers with direct access to the database. Of course this statement is highly dependent on design and deployment environment. For instance if you have a storage API in front of the database and ban any other method of interaction your design evaluation changes significantly.

If you have a dedicated storage API I would recommend taking the archive logic and encoding it simply in code space, forgoing all of the limitations of triggers and the bifurcation of business logic across application and database code spaces. By encoding business logic at the application level the code is also significantly more portable assuming the use of an ORM.

If you have multiple clients with direct access to the database, triggers are a useful tool to protect data contract expectations but do have limitations that are sometimes hard to work around. Dependency on triggers and stores procedures also reduces portability and behavior transparency.

This article is not meant to be categorical by any stretch. As will all things relating to software design your mileage will vary on your particular needs and requirements.

Stuart Ingram

Home

Record versioning with Mysql

Type 1 - Overwrite

Type 2 - Add new row

Type 3 - Add new attribute

Type 4 - Add history table