February 19, 2017 · Mysql

Record versioning with Mysql

The topic of how to handle record versioning came up recently in a number of projects. This is a topic known commonly as slowly changing dimensions. There are a number of approaches depending on your requirements; a good overview can be found here.

Considerations in design

This article aims to explore a number of approaches and how Mysql's capabilities can be used to enforce integrity or automate the strategies where possible. Generally I'm not a fan of stored procedures or complex triggers as this essentially divests business logic from the core code base/ service layer. However this is a space that I wanted to explore how far Mysql triggers could assist.

Regardless of which strategy is the best fit for the requirements at hand, all scenarios should have appropriate integration tests to ensure expected behavior over time, especially as there are multiple ways to achieve these strategies. Integration tests should cover basic scenarios such as

For discussion we'll use Mysql 5.7 and model the common scenario of a supply table with 4 simple properties;

Where the assumption is that the natural key is globally unique. e.g.

CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

You can get started very simply if you have docker installed with the following;

docker run --name mysql_container --env MYSQL_ALLOW_EMPTY_PASSWORD=YES -p 3306:3306 mysql:5.7  

This will download and run mysql in an isolated container and the following will allow you to connect

docker exec -it mysql_container mysql -uroot  

This allows you to run and utilize Mysql in a completely isolated form without polluting your host system. It's also extremely simple to experiment between versions by simply pulling different images. See here for a complete list of offical Mysql docker images.

Type 1 - Overwrite

Essentially, using this strategy, there is only one record per supply_key and fields are updated in place with no historic values retained.
Pros

Cons

Type 2 - Add new row

This approach involves creating a new row for a record that has changed and delineating it from existing versions. The Wikipeadia article mentions two common approaches; an incrementing version column grouped on supply_key or a combination of start and end dates.
I favor the latter approach for 2 reasons. First, it provides temporal relevance which is useful for a wide variety of reporting and auditing reasons and second it naturally provides an easy way for determining the current record. It's a far easier query to find out which record has a NULL end date than which version is the largest in a group. For example, to get a list of current supplies with dates could be as simple as;

SELECT * FROM supplies WHERE ended_at IS NULL;  

With incrementing version columns the same result could be achieved with the following more complexe statement;

SELECT *  
FROM supplies  
INNER JOIN  
  (SELECT supply_key, MAX(version) AS version 
   FROM supplies 
   GROUP BY supply_key) AS current
WHERE supplies.supply_key = current.supply_key  
  AND supplies.version = current.version

That being said, let's see how we can automate this and reduce the cognitive burden off the developer. First let's start with a few assumptions to make this simple.

Given these requirements our new supplies table may look something like;

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key VARCHAR(10) NOT NULL,
     description VARCHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
     ended_at DATETIME,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at),
     KEY (supply_key, ended_at)
);

Of note;

Our automation should handle the following forms;

INSERT INTO supplies SET supply_key='A', description='foo', cost='1';  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=NOW();  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=DATE_SUB(NOW(), INTERVAL 1 DAY), ended_at=NOW();  

Ideally we'd use something like the following trigger to handle this automatically.

DROP TRIGGER supplies_before_insert;  
DROP TRIGGER supplies_after_insert;  
delimiter |

CREATE TRIGGER supplies_before_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.ended_at=NULL;
  END;
|
CREATE TRIGGER supplies_after_insert AFTER INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    UPDATE supplies SET ended_at=NEW.started_at WHERE supply_key=NEW.supply_key AND ended_at IS NULL AND id!=NEW.id;
  END;
|

delimiter ;  

Unfortunately mysql does not permit updates in triggers on the same table that you inserted to. There reasons for this are deadlocks and infinite loops. The update in the trigger will indeed cause the trigger to trip again and so on and so forth.

Our likely approach here is to push this on the application using something like the following;

START TRANSACTION;  
  INSERT INTO supplies 
          SET supply_key='B', description='bar', cost=2; 

  SELECT @end_date:=MAX(started_at) 
    FROM supplies 
   WHERE supply_key='B';

  UPDATE supplies 
     SET ended_at=@end_date 
   WHERE supply_key='B' 
     AND ended_at IS NULL 
     AND id!=last_insert_id(); 
COMMIT;  

Pro's

Cons

Type 3 - Add new attribute

In this approach the system only keeps track of the original & current values of selected fields and retains one record per supply. In the following example the fields description and cost are of particular interest.

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     original_description CHAR(30) DEFAULT '',
     original_cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

We generally want two guarantees from a system with this approach;
1. Inserts automatically fill the original_* fields.
2. Updates preserve the original_* fields.

This can be done with triggers in Mysql with the following.

DROP TRIGGER supplies_insert;  
DROP TRIGGER supplies_update;  
delimiter |

CREATE TRIGGER supplies_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = NEW.description;
    SET NEW.original_cost = NEW.cost;
  END;
|

CREATE TRIGGER supplies_update BEFORE UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = OLD.original_description;
    SET NEW.original_cost = OLD.original_cost;
  END;
|

delimiter ;  

With these triggers in place you can safely use standard INSERT and UPDATE statements or use the following UPSERT form negating the need to know upfront whether your application already contains a record for a particular supply_key.

INSERT INTO supplies SET  
    supply_key='B', description='bar', cost=2 
  ON DUPLICATE KEY UPDATE 
    cost=VALUES(cost), description=VALUES(description);

Pro's

Cons

Type 4 - Add history table

Aside from strategy Type 2, which retains history in the same table, the other common approach to this is to seperate historic records from current records in seperate tables;

The following creates 2 tables, a supplies table and a supplies_archive table based on the structure of the current supplies table. The current supplies table still needs to know when the current record became relevant and so we need the started_at date. In the supplies_archive we also need an ended_at date.

DROP TABLE IF EXISTS supplies;  
DROP TABLE IF EXISTS supplies_archive;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
                                ON UPDATE CURRENT_TIMESTAMP,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at)
);
CREATE TABLE supplies_archive LIKE supplies;  
ALTER TABLE supplies_archive  
  ADD COLUMN ended_at DATETIME NOT NULL AFTER started_at;

Here we can set up a trigger to automatically create a new record in the supplies_archive table.

DROP TRIGGER supplies_after_update;  
delimiter |

CREATE TRIGGER supplies_after_update AFTER UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    IF NEW.cost != OLD.cost 
       OR NEW.description != OLD.description THEN
      INSERT INTO supplies_archive 
      SELECT NULL, NEW.supply_key, OLD.description, OLD.cost, OLD.started_at, NEW.started_at ;
    END IF;
  END;
|

delimiter ;  

Note that this does not protect against UPDATES that explicitly set the started_at data and nothing else, which breaks our desired behavior.

As of MySQL 5.5, you can use the SIGNAL syntax to throw an exception to assist in refining TRIGGER behavior:

SIGNAL sqlstate '45000' SET message_text = 'My Error Message';  

State 45000 is a generic state representing "unhandled user-defined exception".

In any approach to a complex problem with many entry points there are workarounds. The above solution is far from robust but is limited by the capabilities of Mysql triggers. The goal here is to provide as much consistency as possible at the database level which remains the lowest common denominator between modes of interaction, whether they are multiple clients or developers with direct access to the database. Of course this statement is highly dependent on design and deployment environment. For instance if you have a storage API in front of the database and ban any other method of interaction your design evaluation changes significantly.

If you have a dedicated storage API I would recommend taking the archive logic and encoding it simply in code space, forgoing all of the limitations of triggers and the bifurcation of business logic across application and database code spaces. By encoding business logic at the application level the code is also significantly more portable assuming the use of an ORM.

If you have multiple clients with direct access to the database, triggers are a useful tool to protect data contract expectations but do have limitations that are sometimes hard to work around. Dependency on triggers and stores procedures also reduces portability and behavior transparency.

This article is not meant to be categorical by any stretch. As will all things relating to software design your mileage will vary on your particular needs and requirements.