Stuart Ingram

Record versioning with Mysql

Stuart Ingram — Sun, 19 Feb 2017 15:57:30 GMT

The topic of how to handle record versioning came up recently in a number of projects. This is a topic known commonly as slowly changing dimensions. There are a number of approaches depending on your requirements; a good overview can be found here.

Considerations in design

need to persist accurate history
number of changing fields within a record
maintainability
placement of business logic and number of actors on the data store

This article aims to explore a number of approaches and how Mysql's capabilities can be used to enforce integrity or automate the strategies where possible. Generally I'm not a fan of stored procedures or complex triggers as this essentially divests business logic from the core code base/ service layer. However this is a space that I wanted to explore how far Mysql triggers could assist.

Regardless of which strategy is the best fit for the requirements at hand, all scenarios should have appropriate integration tests to ensure expected behavior over time, especially as there are multiple ways to achieve these strategies. Integration tests should cover basic scenarios such as

Inserting new records into the system
Updating existing records
- Are updates reflected correctly?
- Is history of the old record maintained?

For discussion we'll use Mysql 5.7 and model the common scenario of a supply table with 4 simple properties;

Internal primary surrogate key
Natural key
Description
Cost

Where the assumption is that the natural key is globally unique. e.g.

CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

You can get started very simply if you have docker installed with the following;

docker run --name mysql_container --env MYSQL_ALLOW_EMPTY_PASSWORD=YES -p 3306:3306 mysql:5.7

This will download and run mysql in an isolated container and the following will allow you to connect

docker exec -it mysql_container mysql -uroot

This allows you to run and utilize Mysql in a completely isolated form without polluting your host system. It's also extremely simple to experiment between versions by simply pulling different images. See here for a complete list of offical Mysql docker images.

Type 1 - Overwrite

Essentially, using this strategy, there is only one record per supply_key and fields are updated in place with no historic values retained.
Pros

Simple to implement. Use INSERT IGNORE ... or REPLACE .... statements depending on your needs. This will overwrite any existing data with new data.

Cons

Historic trends cannot be extracted from the supplies table.
Referential integrity to the supplies table must be thought through as data can change at any point.
Data must be copied from the supplies table to an instance of an order record to preserve state at a given point in time

Type 2 - Add new row

This approach involves creating a new row for a record that has changed and delineating it from existing versions. The Wikipeadia article mentions two common approaches; an incrementing version column grouped on supply_key or a combination of start and end dates.
I favor the latter approach for 2 reasons. First, it provides temporal relevance which is useful for a wide variety of reporting and auditing reasons and second it naturally provides an easy way for determining the current record. It's a far easier query to find out which record has a NULL end date than which version is the largest in a group. For example, to get a list of current supplies with dates could be as simple as;

SELECT * FROM supplies WHERE ended_at IS NULL;

With incrementing version columns the same result could be achieved with the following more complexe statement;

SELECT *  
FROM supplies  
INNER JOIN  
  (SELECT supply_key, MAX(version) AS version 
   FROM supplies 
   GROUP BY supply_key) AS current
WHERE supplies.supply_key = current.supply_key  
  AND supplies.version = current.version

That being said, let's see how we can automate this and reduce the cognitive burden off the developer. First let's start with a few assumptions to make this simple.

Updates for a particular supply_key will be inserted in order
Only the start date is supplied and is assumed to be the end date of the preceding record version.
There is always a current record for a given supply_key with no end date. Yes, this isn't realistic as you can never remove supplies with this constraint but we're going to roll with it for the sake of discussion.

Given these requirements our new supplies table may look something like;

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key VARCHAR(10) NOT NULL,
     description VARCHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
     ended_at DATETIME,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at),
     KEY (supply_key, ended_at)
);

Of note;

The start date is mandatory, the end date is not since for a given supply_key there is always one record with a NULL ended_at date.
Our UNIQUE key constraint now covers both the natural key for the supply and the start date.

Our automation should handle the following forms;

A record UPSERT with an assumed start date equal to when the record was inserted
A record UPSERT with an explicit start date
A record UPSERT with an explicit end date

INSERT INTO supplies SET supply_key='A', description='foo', cost='1';  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=NOW();  
INSERT INTO supplies SET supply_key='A', description='foo', cost='1', started_at=DATE_SUB(NOW(), INTERVAL 1 DAY), ended_at=NOW();

Ideally we'd use something like the following trigger to handle this automatically.

DROP TRIGGER supplies_before_insert;  
DROP TRIGGER supplies_after_insert;  
delimiter |

CREATE TRIGGER supplies_before_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.ended_at=NULL;
  END;
|
CREATE TRIGGER supplies_after_insert AFTER INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    UPDATE supplies SET ended_at=NEW.started_at WHERE supply_key=NEW.supply_key AND ended_at IS NULL AND id!=NEW.id;
  END;
|

delimiter ;

Unfortunately mysql does not permit updates in triggers on the same table that you inserted to. There reasons for this are deadlocks and infinite loops. The update in the trigger will indeed cause the trigger to trip again and so on and so forth.

Our likely approach here is to push this on the application using something like the following;

START TRANSACTION;  
  INSERT INTO supplies 
          SET supply_key='B', description='bar', cost=2; 

  SELECT @end_date:=MAX(started_at) 
    FROM supplies 
   WHERE supply_key='B';

  UPDATE supplies 
     SET ended_at=@end_date 
   WHERE supply_key='B' 
     AND ended_at IS NULL 
     AND id!=last_insert_id(); 
COMMIT;

Pro's

Orders can safely reference the supplies table and remain accurate over time.
Simple reporting and trending
Efficient use of covering index to locate current records.

Cons

Your supplies table may become very large with historic data depending on the rate of change.
The business logic to maintain historical records must be observed by the client applications of the database meaning decentralized logic if not fronted by a service API.
The way this particular insert query is written coupled with the automated started_at date field could lead to unnecessary duplication. Consider a field that may toggle between multiple states over time.
The above INSERT, SELECT, UPDATE combination isn't the most efficient but robustly handles automatic started_at values as well as specified ones.

Type 3 - Add new attribute

In this approach the system only keeps track of the original & current values of selected fields and retains one record per supply. In the following example the fields description and cost are of particular interest.

DROP TABLE supplies;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     original_description CHAR(30) DEFAULT '',
     original_cost INT DEFAULT 0,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key)
);

We generally want two guarantees from a system with this approach;
1. Inserts automatically fill the original_* fields.
2. Updates preserve the original_* fields.

This can be done with triggers in Mysql with the following.

DROP TRIGGER supplies_insert;  
DROP TRIGGER supplies_update;  
delimiter |

CREATE TRIGGER supplies_insert BEFORE INSERT ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = NEW.description;
    SET NEW.original_cost = NEW.cost;
  END;
|

CREATE TRIGGER supplies_update BEFORE UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    SET NEW.original_description = OLD.original_description;
    SET NEW.original_cost = OLD.original_cost;
  END;
|

delimiter ;

With these triggers in place you can safely use standard INSERT and UPDATE statements or use the following UPSERT form negating the need to know upfront whether your application already contains a record for a particular supply_key.

INSERT INTO supplies SET  
    supply_key='B', description='bar', cost=2 
  ON DUPLICATE KEY UPDATE 
    cost=VALUES(cost), description=VALUES(description);

Pro's

Simple to implement in either Mysql or code space
Guarantees at the database level can be made to preserve the original_* fields regardless of the client interacting with the database.

Cons

Limited use in terms of accurate record keeping. A de-normalized copy of the supplies data at the point of use must be made for accurate record keeping.
Trend reporting is impossible. Only start and current costs are tracked. Trending on de-normalized orders is not recommended since there is no guarantee that an order was made when a supply was a particular cost.

Type 4 - Add history table

Aside from strategy Type 2, which retains history in the same table, the other common approach to this is to seperate historic records from current records in seperate tables;

The following creates 2 tables, a supplies table and a supplies_archive table based on the structure of the current supplies table. The current supplies table still needs to know when the current record became relevant and so we need the started_at date. In the supplies_archive we also need an ended_at date.

DROP TABLE IF EXISTS supplies;  
DROP TABLE IF EXISTS supplies_archive;  
CREATE TABLE supplies (  
     id INT NOT NULL AUTO_INCREMENT,
     supply_key CHAR(10) NOT NULL,
     description CHAR(30) NOT NULL,
     cost INT DEFAULT 0,
     started_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP
                                ON UPDATE CURRENT_TIMESTAMP,
     PRIMARY KEY (id),
     UNIQUE KEY (supply_key, started_at)
);
CREATE TABLE supplies_archive LIKE supplies;  
ALTER TABLE supplies_archive  
  ADD COLUMN ended_at DATETIME NOT NULL AFTER started_at;

Here we can set up a trigger to automatically create a new record in the supplies_archive table.

DROP TRIGGER supplies_after_update;  
delimiter |

CREATE TRIGGER supplies_after_update AFTER UPDATE ON supplies  
  FOR EACH ROW
  BEGIN
    IF NEW.cost != OLD.cost 
       OR NEW.description != OLD.description THEN
      INSERT INTO supplies_archive 
      SELECT NULL, NEW.supply_key, OLD.description, OLD.cost, OLD.started_at, NEW.started_at ;
    END IF;
  END;
|

delimiter ;

Note that this does not protect against UPDATES that explicitly set the started_at data and nothing else, which breaks our desired behavior.

As of MySQL 5.5, you can use the SIGNAL syntax to throw an exception to assist in refining TRIGGER behavior:

SIGNAL sqlstate '45000' SET message_text = 'My Error Message';

State 45000 is a generic state representing "unhandled user-defined exception".

In any approach to a complex problem with many entry points there are workarounds. The above solution is far from robust but is limited by the capabilities of Mysql triggers. The goal here is to provide as much consistency as possible at the database level which remains the lowest common denominator between modes of interaction, whether they are multiple clients or developers with direct access to the database. Of course this statement is highly dependent on design and deployment environment. For instance if you have a storage API in front of the database and ban any other method of interaction your design evaluation changes significantly.

If you have a dedicated storage API I would recommend taking the archive logic and encoding it simply in code space, forgoing all of the limitations of triggers and the bifurcation of business logic across application and database code spaces. By encoding business logic at the application level the code is also significantly more portable assuming the use of an ORM.

If you have multiple clients with direct access to the database, triggers are a useful tool to protect data contract expectations but do have limitations that are sometimes hard to work around. Dependency on triggers and stores procedures also reduces portability and behavior transparency.

This article is not meant to be categorical by any stretch. As will all things relating to software design your mileage will vary on your particular needs and requirements.

Joy and pain with @Scheduled and @RefreshScope in SpringBoot

Stuart Ingram — Mon, 07 Nov 2016 21:39:39 GMT

| TLDR; @Scheduled and @RefreshScope are powerful tools but do not work out of the box together causing dangerous inconsistencies. Find out how to get them to play nicely and more advanced scheduling opporunities.

So I'm a fan of the SpringBoot framework and it's adoption of the Rails "convention over configuration" as well as it's powerful and simple annotations. All of which make a developers life more productive and allow the team to focus on delivering functional value to end users.

One of those annotations is @Scheduled which can be applied to any bean method. Only two pieces of code are required;

1. the enabling of scheduling in the application, typically achieved with something like the following; (note the use of @EnableScheduling

@SpringBootApplication
@EnableScheduling
public class ExampleApplication {  
  public static void main(String[] args) {
    SpringApplication.run(ExampleApplication.class, args);
  }
}

2. And the actual method you want repeated. For instance;

@Slf4j
@Configuration
public class Followers {

  @Value("${follower.count:10}")
  private int followers;

  @Scheduled(fixedRate = 4000)
  public void outputFollowers() {
    log.info("===========> Followers " + followers);
  }
}

In this example a count of followers is autowired from the spring property hierarchy and outputted to the log every 4 seconds as per the @Scheduled declaration. If you're wondering where the log object is defined, this is part of the magic of Project Lombok which abstracts a lot of standard Java boiler-plating and is provided by the @Slf4j annotation on the class.

@Scheduled is a very powerful and simple annotation. Common forms of it include;

Fixed rate - Repeat every x milliseconds. @Scheduled(fixedRate=1000)
Fixed delay - Repeat every x milliseconds between previous completion and next start. @Scheduled(fixedDelay=1000)
Crontab - Defines a cadence using the same expressions as *nix crontab definitions. @Scheduled(cron="0 0 * * * *")

A great introduction to this topic can be found here.

So at this part in the article we change track over to automatic property updating. As previously mentioned, SpringBoot has a extensive property hierarchy but no way natively to refresh properties should they change. Enter Spring Cloud Config. The main focus of this project is to establish a centralized repository of configuration for a distributed set of applications and have those applications update automatically if properties change. A great introduction to externalized properties and centralized property management with Spring Cloud Config can be found here.

Spring Cloud Config automatically provides a JMX interface and a HTTP interface (\refresh) to refresh all properties in the application in classes marked with the @RefreshScope annotation. Meaning if the external property source changes, all you have to do is hit \refresh on your application and the configuration changes are automatically pulled in.

And for the most part this works pretty nice and seamlessly as you would expect... except it doesn't work with the original @Scheduled code example at the start of the article. In fact everything but @Scheduled annotated code will refresh causing system inconsistencies within an application. The problem here is that the scheduled method (outputFollowers()) has a dependency on a property injected by the Spring framework and even when refreshed the property change is not propagated down into the scheduled code. A discussion on this can be found in common Spring Cloud issues .

The solution, rather than relying on the magic of the @Scheduled annotation, is to specify the scheduling configuration manually. For example;

@SpringBootApplication
public class ExampleApplication {  
  public static void main(String[] args) {
    SpringApplication.run(ExampleApplication.class, args);
  }
}

@Slf4j
@RefreshScope
@Configuration
public class Followers {

  @Value("${follower.count:10}")
  private int followers;

  public void outputFollowers() {
    log.info("===========> Followers " + followers);
  }
}

@Configuration
@EnableScheduling
public class SchedulingConfiguration implements SchedulingConfigurer {

  @Autowired
  Followers followers;

  @Override
  public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
    taskRegistrar.addTriggerTask(
        () -> followers.outputFollowers(),
        triggerContext -> {
          Instant nextTriggerTime = Instant.now().plus(4, SECONDS);
          return Date.from(nextTriggerTime);
        });
  }
}

Tackling the problem this way now yields an application that refreshes properties throughout on demand and consistently.

While initially this is somewhat of a pain, it is a solution first and foremost but it also enables more complex scheduling.

For instance you could build a trigger context that dynamically calculates the next time to run, potentially throttling an action per time period. See here for an example.

Scheduled code by default is limited to a single thread meaning there is an assumption that the previous call should be finished before the next execution. When this assumption is incorrect a execution thread pool is necessary which again can be manually configured such as;

 @Configuration
 @EnableScheduling
 public class AppConfig implements SchedulingConfigurer {

     @Override
     public void configureTasks(ScheduledTaskRegistrar taskRegistrar) {
         taskRegistrar.setScheduler(taskExecutor());
     }

     @Bean(destroyMethod="shutdown")
     public Executor taskExecutor() {
         return Executors.newScheduledThreadPool(100);
     }
 }

Further reading can be found on the @EnableScheduling java docs.

You can of course do far more than just the simple examples above but this should be enough to firstly resolve any problems between the SpringBoot @Scheduled annotation and the live configuration updates you can attain by incorporating the @RefreshScope annotation from the Spring Cloud Config project.

Enjoy.

Testing rake tasks effeciently in JRuby

Stuart Ingram — Sat, 05 Nov 2016 18:22:06 GMT

After writing cucumber_characteristics to profile where a large JRuby cucumber integration suite was taking it's time it soon became apparent that 30% of the time was wrapped up in testing rake tasks.

This is particularly challenging in JRuby as the usual approach to testing rake tasks is to execute them in a new shell process, capturing the system and testing any mutative effects. Something like;

When /^I run the rake task$/ do |count|  
  @output = `rake some_task`
end

This approach is problematic specifically when using JRuby. The above invocation needs to start up a completely new JVM per test taking several seconds each time. In the test suite I was working with, this alone accounted for 45 minutes on an average developer machine.

Clearly a significant problem event with Continuous Integration.

To tackle this I wrote cucumber_rake_runner which executes a rake task in the same JVM process as the integration test, negating the cost of spinning up a new JVM process per rake test. The original invocation simply becomes.

When /^I run the rake task$/ do |count|  
  @output = run_rake_task('some_task')
end

The @output captures STDIN, STDOUT and timing information.

This was immensely useful to the project and team I was working with, reducing the time to run the full integration suite by over 30%. Hopefully this will be useful to you as well.

Profiling ruby cucumber integration tests

Stuart Ingram — Thu, 27 Oct 2016 18:26:38 GMT

I've been working on a large JRuby project for a number of years now with over a thousand integration tests. With so many tests it's important to know where time is being spend to see if there are optimizations that can be made to improve the overall performance of the test suite. Locating highly utilized steps and dead code is also a normal part of code curation.

With this in mind, I wrote the cucumber_characteristics gem. The gem requires very little configuration and should work transparently with your existing setup. Essentially it is a formatter and should drop into your existing setup transparently generating a html and/or json reports. Installation and usage instructions can be found on github here

For each cucumber step definition executed the following is reported;

Location of definition & regex
Step usage location and number of times executed (background/outline etc)
Counts for success/failure/pending/etc
Total time taken in test run along with average, fastest, slowest times per step

For each feature test, the following is reported;

Location and time taken to run feature
Result and number of steps run
Breakdown of feature by individual example run if a Scenario Outline.

There is also added support to list out all unused steps in a cucumber test run to aid step curation. Be aware if you are only running a specific test set, for example via a TAG as you will get a larger number of unused steps than are not ‘true’ unused steps.

The gem supports ruby 1.9+ and cucumber 1.x & 2.x

Hope this is useful. Please get in touch if there are further enhancements that would be useful or better yet submit a pull request.

Mysql conditional INSERTS

Stuart Ingram — Thu, 06 Oct 2016 02:09:10 GMT

Every now again it's useful to have slightly more control over a MySQL insert than simply making it idempotent via the IGNORE keyword. For example;

INSERT IGNORE INTO foo (id, column_bar) values (1, 'aaa'),  (2, 'bbb');

The IGNORE keyword will simply skip over any primary or unique key constraint violations, essentially making the above statement idempotent assuming a primary key on id.

However let us suppose we have a data set without a primary key or more precisely the data we want to insert has more complex conditional requirements. Unfortunately MySQL's INSERT statement does not directly allow for greater selectivity but the SELECT statement does allowing us to take advantage of the INSERT...SELECT... form.

CREATE TEMPORARY TABLE tmp_users LIKE users;  
INSERT INTO tmp_users VALUES ( ....... *default user list*)

INSERT INTO users SELECT * FROM tmp_users WHERE .... 

-- Optional as temporary tables only exist for duration of session.
DROP TABLE tmp_users;

Admittedly this is a contrived example where there would likely be a UNIQUE key on username which could be taken advantage of in the INSERT IGNORE... statement format. However this is simply to illustrate how more complex logic can be wrapped around an INSERT statement when needed without any supporting code or stored procedures.

See here for full documentation on Mysql INSERT statement.

Simple SpringBoot profiles

Stuart Ingram — Tue, 04 Oct 2016 17:31:39 GMT

TLDR SpringBoot profiles, what they are and how to use them simply with flyway example.

Many frameworks have the concept of scoping application settings together around the concept of environments, examples being; dev, test, stage & production. Largely what you scope around is irrelevant but these examples are the most common to scope around.

So what do I mean and how is this useful? Well for local development you probably want the local database credentials in your application properties, you may also have threads turned down or certain services disabled. Whatever is most appropriate to facilitate and accelerate local development for you and your team. Clearly the application settings you run against in production will be different from both performance, debugging and security perspectives and should thus be managed separately.

The 12 Factor Application manifesto also has a great read on configuration management from a different perspective which is well worth your time, here

SpringBoot supports this concept in the form of profiles which are also analogous to the Ruby on Rails runtime environments (see here).

How do I use profiles?

Very simply a SpringBoot's default application properties are specified by src/main/resources/application.properties. Profile specific properties can be specified in the same file or by a separate file of the following format src/main/sources/application-.properties such as src/main/sources/application-dev.properties. Profile properties override the default properties in much the same way CSS does. Profile properties do not need to specify all properties, only the ones which you wish to change from the default application.properties set.

Assuming you have Actuator in your class path (if not, why not?!?), any info.* properties are exposed through the \info endpoint. This makes it especially useful for exposing build & release information as well as the profile under which the application is running.

For instance in application.properties you may have

info.profile=default  
spring.jackson.serialization.write-dates-as-timestamps=false  
management.context-path=/actuator

and in application-dev.properties you may have

info.profile=dev

Meaning that when running your application with the dev profile enabled that the \actuator\info endpoint will yield something like

{
  "profile": "dev"
}

So how do you pass in the desired profile to your SpringBoot application? Very simply;

$ SPRING_PROFILES_ACTIVE=dev gradle bootRun

Just like any SpringBoot property there's a hierarchy down which it searches (see here), so this doesn't have to be a environment variable (useful for a Containerization strategy) but could also be a property in your default application.properties file or specified some other way.

Flyway example

Recently a problem arose at work where default development data needed to be automatically loaded into all local development environments but only local development environments. Using Flyway to manage database migrations and data assets this became trivial to implement with the help of profiles.

Schema migrations were located in the default resources/db/migrations location and development environment specific migrations/data assests located in resources/db/dev

With the database files in place all that was needed was a application-dev.properties file with

info.environment=dev  
flyway.locations=classpath:db/migration,classpath:db/dev

This did two things;

publish the runtime profile to the \info endpoint provided by actuator
overrode the default locations flyway examines for migrations and callback files to include both the standard schema migrations as well as any dev environment specific files.

One item of note is that in this particular case, the development need was for default data. With this in mind while the data file(s) could be versioned using the standard versioned Flyway naming schema this requires some consideration to make sure that the versions of the dev data assets and schema migrations do not clash. Flyway also supports callbacks which are the perfect solution to this problem (see here). In particular the afterMigrate hook. Be aware to make your migration idempotent as it will run every time on startup, regardless of the number of migrations executed.

Simple when you know how, but sometimes the documentation isn't that transparent. Hope this helps. A full working example can be found on github here.

Spring-boot Schema based multi tenancy

Stuart Ingram — Sun, 02 Oct 2016 13:06:26 GMT

TLDR; This article will explain multi tenancy, focusing in on the SCHEMA strategy and how to implement it in two simple steps using Spring Boot and Hibernate.

Multi-tenancy is the sharing of process and infrastructure across multiple customers or tenants efficiently. The alternative to this is having a siloed application environment per customer. This brings it's own problems such as;

Linearly scaling infrastructure costs (assuming equal customers).
Inefficient use of infrastructure.
Divergent infrastructure & configuration without strict infrastructure automation and change management.
High overheads to keep multiple environments up to date and in sync.
Opens the door to solution forks under business pressure which incurs huge technical debt and operational overhead as teams now must support multiple solution versions.

Ofttimes multi tenancy offers the easiest way to scale customer growth while minimizing infrastructure and operational costs.

There are a few principle approaches to multi tenancy;

Discriminator Strategy

The discriminator pattern works on a single database service and single schema for all tenants. Constituent tenants are discriminated by a specific strategy such as a tenant_id field embedded in tables containing tenant specific data. Beyond the below pro's/con's this strategy is a non-starter for use case which legally require 'air-space' between tenants.
Pros

Single database and schema instance to manage
Single schema to backup
Single schema to archive, upgrade etc.
Simple reporting across tenants (e.g. SELECT .... GROUP BY tenant_id)
Single database service account to manage per application.
Single database instance to tune and maintain.

Cons

Tenant data is interwoven meaning backup & restore is an all or nothing proposition.
Care needs to be taken with every database interaction that the data returned is appropriately scoped.
If your database goes down, all your customers go down, therefor necessitating a high availability strategy which is generally a good idea but essential in this strategy.
If a table becomes corrupted it becomes corrupted for all users.
If a tenant leaves, it can be tricky to extract and archive the information
- If that tenant comes back it can be trickier to reinsert the data and easier to integrate from scratch.
- While storage is cheap, performance is not and an inactive tenant in a single schema will take up database buffer pool resources simply by it's existence through indices alone.
Because it is likely that a single service account will be used to access the schema and all tenants reside in the schema it can be challenging to trace database load to specific tenant usage.
As a single database service is serving all tenants, performance is subject to "noisy neighbors".

Scaling can be problematic depending on the underlying storage technology chosen due to the monolithic nature of the schema. If a traditional RDBMS is chosen replicas can be employed for read scaling and a sharding strategy employed for write scaling. If using a RDBMS this particular strategy lends itself well to use cases where historic data can be archived leaving just hot data in the primary database system. These considerations change if using a NoSQL technology such as AWS Aurora or MongoDB where r/w scaling is handled transparently as part of the storage service layer and not a concern of the application itself. In addition to this schema upgrades can be challenging based on the volume of potential data and all customers being affected simultaneously. Even with a backing technology supporting 'online schema updates' the application may have to consider supporting multiple data schema versions until the schema update is complete.

Schema Strategy

The schema strategy employs a single database server like the DISCRIMINATOR strategy but specifies a schema instance per tenant meaning that each tenant has complete isolation at the data layer from other tenants.
Pros

Tenant data is robustly isolated from other tenant data
- This in turn means for simpler more robust application development. However the application must be tenant aware and capable of switching tenants reliably.
- Schema & table corruption affects only a single tenant
- Ad-hoc queries are automatically scoped to a single tenant.
Granular backups can be taken and restored with ease & in parallel.
Tenants can be migrated to and from different environments easily.
Instrumentation is available on a per schema basis allowing the attribution of load and bottlenecks to specific tenant generated load.
Single database service account to manage per application.
Single database instance to tune and maintain.

Cons

As a single database service is serving all tenants, performance is subject to noisy neighbors similar to the DISCRIMINATOR strategy. However it is trivial to move problem customers onto dedicated databases should the need arise.
If your database goes down, all your customers go down, again necessitating a good failover strategy.
Tooling needs to be built to handle schema updates, backups and restores of the tenant schemas with an environment.
Reporting across tenants requires additional tooling.
De-normalization of common reference tables may be necessary or a 'common/admin' schema employed and shared by all tenants. This in itself can assist in some of the maintenance tooling mentioned.

Database Strategy

The database strategy takes the SCHEMA strategy one step further whereby each tenant has a separate schema instance on a separate database.

Pros

Tenant data is robustly isolated from other tenant data
- This in turn means for simpler more robust application development.
- Schema & table corruption affects only a single tenant
Granular backups can be taken and restored with ease & in parallel.
Tenants can be migrated to and from environments easily.
Instrumentation is available on a per schema basis allowing the attribution of load and bottlenecks to specific tenant generated load.
"Noisy neighbor" problems are eliminated at the database layer.

Cons

Multiple databases instances to tune and maintain.
Additional infrastructure cost of the multiple database instances.
A connection pool per tenant per application is now required (assuming the application layer is multi tenant) which may require additional tuning when considering the number of application instances you need to scale to and the overhead each connection incurs on your storage service.
Multiple database service accounts to manage per application.
- This assumes that an application will switch between tenants and therefor need connection credentials to all databases making this strategy equal from a security standpoint to a single service account.
If a database goes down, only a single tenant is affected.
Tooling needs to be built to handle schema updates, backups and restores of the entire environment.
Reporting across tenants requires additional tooling.
- This may be complicated by the multiple service accounts to connect with each database.

Concluding strategy thoughts

The pro/cons for each strategy are entirely subjective to the use-case under consideration. From a general standpoint I personally favor the SCHEMA approach having seen it work successfully in production many times. I also believe it strikes the right balance between pragmatic pros & cons as well as offering architectural escape routes should performance and scaling problems arise.

Further reading can be found here on the Hibernate website which is the default ORM for SpringBoot applications

Implementing the SCHEMA strategy

So now we've taken a quick high-level tour of the main multi tenant strategies lets run through what it takes to add one to a typical Spring Boot application. Here we'll be employing the SCHEMA strategy. It's actually surprising how trivial and flexible it is.

As a quick side note, while SCHEMA & DATABASE strategies are supported as of Hibernate 4.1, support for the DISCRIMINATOR pattern was introduced in 5.x (see here for more details)

Step 1. Tenant awareness

So first thing's first. For an application to be multi tenant it must have a way to detect and store the correct tenant for the transaction it is serving.

For the purposes of this entry we will assume a simple tenant naming schema where the name of the tenant id matches the name of the tenant schema in the database. We will also assume we are starting from a simple SpringBoot MVC CRUD application with RESTful API. A basic example can be found on the SpringBoot guide page here or you can look at the full working example documented here on github.
The following will serve as our tenant storage interface, storing the tenant as data against the current thread (see here for more information on ThreadLocal usage).

public class TenantContext {

  final public static String DEFAULT_TENANT = "test";

  private static ThreadLocal currentTenant = new ThreadLocal()
  {
    @Override
    protected String initialValue() {
      return DEFAULT_TENANT;
    }
  };

  public static void setCurrentTenant(String tenant) {
    currentTenant.set(tenant);
  }

  public static String getCurrentTenant() {
    return currentTenant.get();
  }

  public static void clear() {
    currentTenant.remove();
  }
}

One thing to note here is the DEFAULT_TENANT. This is necessary from a Spring framework point of view to initialize the connection pool to the database and Hibernate (see later) will complain on initial startup of the application is this is null and a multi-tenant strategy is in place. This can be implemented much cleaner in Java 8+ than in the code sample above. The DEFAULT_TENANT could be a real tenant but if that makes you uneasy you could use a demo/empty tenant or your architecture may have the concept of a shared 'master' database for centralized tenant and shared dictionary management.

But how does this get set? Our tenant could passed in the header, subdomain (e.g. http://tenantid.myapp.com/....), URI (e.g. http://myapp.com/tenant_id/....) cookie or ideally as part of the authentication strategy such as a property in a JWT.

For this example we will use a simple http header property (X-TenantID). You should absolutely not use this strategy in any production application under any circumstances, this approach is purely to simplify the concepts.

Regardless the vehicle for the tenant data, it is desirable to have the multi tenant mechanics isolated away from, and as invisible to, the main application as much as possible. For instance, no tenant specific business logic should ever be visible in the controllers. To this end, the HTTP HandlerInterceptorAdapter class is perfect for this and requires two additions to our application; the interceptor itself and the configuration to hook the interceptor in.

@Component
public class TenantInterceptor extends HandlerInterceptorAdapter {

  private static final String TENANT_HEADER = "X-TenantID";

  @Override
  public boolean preHandle(HttpServletRequest req, HttpServletResponse res, Object handler)
      throws Exception {

    String tenant = req.getHeader(TENANT_HEADER);
    boolean tenantSet = false;

    if(StringUtils.isEmpty(tenant)) {
      res.setStatus(HttpServletResponse.SC_BAD_REQUEST);
      res.setContentType(MediaType.APPLICATION_JSON_VALUE);
      res.getWriter().write("{\"error\": \"No tenant supplied\"}");
      res.getWriter().flush();
    } else {
      TenantContext.setCurrentTenant(tenant);
      tenantSet = true;
    }

    return tenantSet;
  }

  @Override
  public void postHandle(
      HttpServletRequest request, HttpServletResponse response, Object handler, ModelAndView modelAndView)
          throws Exception {
    TenantContext.clear();
  }

In the interceptor above, note the logic to return an appropriate response code and message body if a tenant is missing. This logic becomes unnecessary if the tenant is part of the authentication schema and securely transmitted in a JWT for instance which, by definition, is generated by a trusted entity.

And finally the configuration to wire the interceptor in;

@Configuration
public class WebMvcConfig extends WebMvcConfigurerAdapter {

  @Autowired
  HandlerInterceptor tenantInterceptor;

  @Override
  public void addInterceptors(InterceptorRegistry registry) {
    registry.addInterceptor(tenantInterceptor);
  }
}

It's interesting to note that interceptors can be applied to specific URL path patterns which opens up the possibility of different tenant strategies for different parts of the application. For instance everything under \admin could be handled by a different tenant interceptor which could force the tenant id to ADMIN and use a schema dedicated to centralized management of all the tenants in the system.

At this point, you can test your progress with curl and a simple endpoint responding to GET.

Without a X-TenantID header

$ curl -v localhost:8080/person/1 | jq .
*   Trying 127.0.0.1...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /person/1 HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.45.0
> Accept: */*
> 
< HTTP/1.1 400  
< X-Application-Context: application  
< Content-Type: application/json;charset=ISO-8859-1  
< Transfer-Encoding: chunked  
< Date: Thu, 29 Sep 2016 15:04:36 GMT  
< Connection: close  
<  
{ [37 bytes data]
100    31    0    31    0     0   1880      0 --:--:-- --:--:-- --:--:--  2066  
* Closing connection 0
{
  "error": "No tenant supplied"
}

X-TenantID doesn't do anything at this point, we are simply detecting and storing the desired tenant context. So with any X-TenantID header you should see the following

$ curl -v -H "X-TenantID:foo" localhost:8080/person/1 | jq .
*   Trying 127.0.0.1...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0* Connected to localhost (127.0.0.1) port 8080 (#0)
> GET /person/1 HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/7.45.0
> Accept: */*
> X-TenantID:test
> 
< HTTP/1.1 200  
< X-Application-Context: application  
< Content-Type: application/hal+json;charset=UTF-8  
< Transfer-Encoding: chunked  
< Date: Thu, 29 Sep 2016 15:04:30 GMT  
<  
{ [244 bytes data]
100   238    0   238    0     0   5803      0 --:--:-- --:--:-- --:--:--  5950  
* Connection #0 to host localhost left intact
{
  "_links": {
    "self": {
      "href": "http://localhost:8080/person/1"
    }
  },
  "lastName": "Baggins",
  "firstName": "Frodo",
  "updatedAt": "2016-09-25T23:01:10.000+0000",
  "createdAt": "2016-09-25T23:01:10.000+0000"
}

Step 2. Hibernate schema changing

So now we have the tenant context we need to change the schema transparently and reliably. Remember we do not want to burden developers with the concern of interacting with the correct context at the detriment of business logic and feature simplicity and scope. This is a great example of Aspect Oriented Programming (AOP). To this end, as mentioned previously, Hibernate natively supports SCHEMA based multi tenancy and requires three main components.

CurrentTenantIdentifierResolver - Class responsible for resolving the correct tenant
MultiTenantConnectionProvider - Class responsible for providing and closing tenant connections
Configuration - Wiring up Hibernate correctly

The CurrentTenantIdentifierResolver is remarkably straight forward and essentially, in this case, a proxy to our TenantContext class. This would be an appropriate place to handle any transformations necessary between the tenant id and the database schema name for the tenant. In this example there is a one to one match between the tenant id and schema name so no transformation is necessary but that would most likely not be true in a real production app. Often a naming convention to clearly identify tenant schemas will be useful in a growing production application.

@Component
public class CurrentTenantIdentifierResolverImpl implements CurrentTenantIdentifierResolver {

  @Override
  public String resolveCurrentTenantIdentifier() {
    return TenantContext.getCurrentTenant();
  }

  @Override
  public boolean validateExistingCurrentSessions() {
    return true;
  }
}

The MultiTenantConnectionProvider is again remarkably simple. Here we are using Mysql as the backing store and the standard USE database; SQL statement to change schemas which is very cheap to use from a database cost/performance standpoint. Errors such as the tenant database not existing are propagated up the stack in this example.

@Component
public class MultiTenantConnectionProviderImpl implements MultiTenantConnectionProvider {  
  private static final long serialVersionUID = 6246085840652870138L;

  @Autowired
  private DataSource dataSource;

  @Override
  public Connection getAnyConnection() throws SQLException {
    return dataSource.getConnection();
  }

  @Override
  public void releaseAnyConnection(Connection connection) throws SQLException {
    connection.close();
  }

  @Override
  public Connection getConnection(String tenantIdentifier) throws SQLException {
    final Connection connection = getAnyConnection();
    try {
      connection.createStatement().execute( "USE " + tenantIdentifier );
    }
    catch ( SQLException e ) {
      throw new HibernateException(
          "Could not alter JDBC connection to specified schema [" + tenantIdentifier + "]",
          e
          );
    }
    return connection;
  }

  @Override
  public void releaseConnection(String tenantIdentifier, Connection connection) throws SQLException {
    try {
      connection.createStatement().execute( "USE " + TenantContext.DEFAULT_TENANT );
    }
    catch ( SQLException e ) {
      throw new HibernateException(
          "Could not alter JDBC connection to specified schema [" + tenantIdentifier + "]",
          e
          );
    }
    connection.close();
  }

  @SuppressWarnings("rawtypes")
  @Override
  public boolean isUnwrappableAs(Class unwrapType) {
    return false;
  }

  @Override
  public  T unwrap(Class unwrapType) {
    return null;
  }

  @Override
  public boolean supportsAggressiveRelease() {
    return true;
  }

}

And finally the configuration class to wire Hibernate correctly.

@Configuration
public class HibernateConfig {

  @Autowired
  private JpaProperties jpaProperties;

  @Bean
  public JpaVendorAdapter jpaVendorAdapter() {
    return new HibernateJpaVendorAdapter();
  }

  @Bean
  public LocalContainerEntityManagerFactoryBean entityManagerFactory(DataSource dataSource,
      MultiTenantConnectionProvider multiTenantConnectionProviderImpl,
      CurrentTenantIdentifierResolver currentTenantIdentifierResolverImpl) {
    Map properties = new HashMap<>();
    properties.putAll(jpaProperties.getHibernateProperties(dataSource));
    properties.put(Environment.MULTI_TENANT, MultiTenancyStrategy.SCHEMA);
    properties.put(Environment.MULTI_TENANT_CONNECTION_PROVIDER, multiTenantConnectionProviderImpl);
    properties.put(Environment.MULTI_TENANT_IDENTIFIER_RESOLVER, currentTenantIdentifierResolverImpl);

    LocalContainerEntityManagerFactoryBean em = new LocalContainerEntityManagerFactoryBean();
    em.setDataSource(dataSource);
    em.setPackagesToScan("com.srai");
    em.setJpaVendorAdapter(jpaVendorAdapter());
    em.setJpaPropertyMap(properties);
    return em;
  }
}

Of particular note you will see the multi tenant strategy set to SCHEMA and our multiTenantConnectionProviderImpl and currentTenantIdentifierResolverImpl classes supplied to the configuration to satisfy that strategy's requirements. You will also note that we are using the default hibernate jpaProperties that SpringBoot uses. This is important to get things like the default naming strategy which converts snake case in database schemas to camel case in the Java entities transparently (see here)

And that's really all there is to it. When you look at the amount of code to power and how neatly abstracted it is away from your business logic it is hard to imagine a cleaner and simpler implementation for Hibernate & Spring to provide.

A full implementation of the code samples above can be found on github (https://github.com/singram/spring-boot-multitenant)

I hope you found this useful.

If you want to read further around the topic and differing approaches, the following articles may be of interest and were of great use in the development of the code and this article.

Separating Unit from Integration tests in Java using Gradle

Stuart Ingram — Thu, 15 Sep 2016 02:07:32 GMT

Having spent some significant time in the Ruby community and finding a new found appreciation for clean unit and integration tests, it often befuddles me why there isn't such a clean separation of test responsibility and scope in other languages.

Java has learned a lot from other test frameworks over the last decade. The venerable Junit test framework has matured significantly but with the addition of fluent libraries such as RestAssured, simple mocking frameworks such as Mockito, and verbose matching capabilities such as Hamcrest, to name a few, it's now possible to write tests with properties matching many other languages touted for readability and focused test intent.

So having established that Java has some really great testing libraries, that doesn't address how to use them and this is part of the problem. There is no simple way to separate unit test from integration test in Java so why should there be a clean understanding of what a unit or integration test is? Often times I see purported unit tests in Junit that are truely integration tests requiring a database and a full application stack to be stood up with no sign of a true unit test in sight.

Very simply stated;

If your unit tests require a database you're doing it wrong.
If your unit tests require an external service or dependency you're doing it wrong.
If your unit tests start the spring framework or an application container, you're doing it wrong.
If your unit tests spend a lot of time setting up preconditions in other classes you're doing it wrong.
If your integration tests are not using publicly exposed interfaces you're doing it wrong.
If your integration tests are stubbing or mocking parts of the system, you're doing it wrong. (External service stubbing however makes sense)
If your integration tests are not hitting a running application you're doing it wrong.

The design, purpose and intent of unit and integration tests is the subject of a much larger discussion and outside the scope of this post.

So back to the problem at hand. Having a desire to separate fast running unit tests from integration tests I have struggled for an answer until I came across the gradle-testsets-plugin and these posts from Petri kainulainen, here and here

I would strongly recommend reading both posts, but for brevity, here are the main mechanics and some further tips beyond.

Step 1.

Include jcenter as a source for your build script dependencies and pull in the gradle-testsets-plugin dependency

buildscript {  
  repositories {
    jcenter()
  }
  dependencies {
    classpath 'org.unbroken-dome.gradle-plugins:gradle-testsets-plugin:1.0.2'
  }
}

Step 2.

Apply the plugin to the build. Be sure to activate this after the java plugin and before any plugins which may build off the gradle tasks automatically created by the plugin.

apply plugin: 'org.unbroken-dome.test-sets'

Step 3.

Create the new test set definition and configuration. Here we want to add an integration test suite but this could be any category of tests you wish to scope together.

testSets {  
  integrationTest
}

Ensure that the check step executes the new test definition and that the new integrationTest step runs after the normal test (unit) step.

check.dependsOn integrationTest  
integrationTest.mustRunAfter test

Ensure that integration tests are always run regardless if they passed on previous runs

project.integrationTest {  
  outputs.upToDateWhen { false }
}

Finally ensure that the output for tasks of type Test are namespaced appropriately so reports are separated for the test (unit) and integrationTest tasks

tasks.withType(Test) {  
  reports.html.destination = file("${reporting.baseDir}/${name}")
}

Step 4.

Test compile dependencies should be reviewed and the new integrationTestCompile dependencies declared appropriately
e.g.

testCompile("junit:junit")  
integrationTestCompile("org.springframework.boot:spring-boot-starter-test",  
                       "com.jayway.restassured:json-path:2.8.0",
                       "com.jayway.restassured:rest-assured:2.8.0",
                       "com.jayway.restassured:spring-mock-mvc:2.8.0",
                       "com.jayway.restassured:xml-path:2.8.0")

Step 5.

Restructure your test file layout. Your directory structure should look something like the following.

src/  
  main/
    java/...
    resources/...
  integrationTest/
    java/...
    resources/...
  test/
    java/...
    resources/...

At this point you should be able to run gradle clean build and see your separate test and integrationTest related tasks execute.

Real time test reporting

To see a visual report of test execution and outcome as it happens in the console, add the following

test {  
  afterTest { desc, result ->
    println "Executing test [${desc.className}].${desc.name} with result: ${result.resultType}"
    }
}
integrationTest {  
  afterTest { desc, result ->
    println "Executing test [${desc.className}].${desc.name} with result: ${result.resultType}"
    }
}

Test Coverage

I use Jacoco for test coverage with the help of the Jacoco gradle plugin. While it would be ideal to have separate test coverage for integration and unit test suites in separate reports I was unable to find a simple method to generate them independently. However you can combine the coverage from both suites with the following;

apply plugin: 'jacoco'  
.....
jacoco {  
    toolVersion = "0.7.5.201505241946"
}

jacocoTestReport {  
    reports {
        xml.enabled false
        csv.enabled false
        html{
            enabled true
            destination "${buildDir}/reports/jacoco"
        }
    }
    executionData(test, integrationTest)
}

tasks.build.dependsOn(jacocoTestReport)

I hope this post has proved useful. The separation of test types has many benefits including;

forcing developers to think about test types & purpose
enforcing unit test conventions. If you need anything beyond java or are firing up an application server it's not a unit test.
separating fail fast unit tests from potentially costly integration tests
allowing finer control over CI builds and development process.

Kubernetes local-up-cluster - Heapster Metrics

Stuart Ingram — Thu, 08 Sep 2016 19:53:51 GMT

As it turns out I still didn't quite have the local kubernetes setup right. The documentation around running some of the standard services with local kubernetes is lacking. There again, it is primarily geared towards kubernetes development and light weight local testing so getting Heapster up and running is a little outside of the wheelhouse so to speak for the targeted audience.

Assuming you have followed my previous steps to get a local kubernetes cluster up and functional, you can get Heapster and Grafana running out of the box with the following

kubectl create -f https://raw.githubusercontent.com/kubernetes/heapster/master/deploy/kube-config/influxdb/heapster-controller.yaml  
kubectl create -f https://raw.githubusercontent.com/kubernetes/heapster/master/deploy/kube-config/influxdb/influxdb-grafana-controller.yaml  
kubectl create -f https://raw.githubusercontent.com/kubernetes/heapster/master/deploy/kube-config/influxdb/influxdb-service.yaml  
kubectl create -f https://raw.githubusercontent.com/kubernetes/heapster/master/deploy/kube-config/influxdb/grafana-service.yaml  
kubectl create -f https://raw.githubusercontent.com/kubernetes/heapster/master/deploy/kube-config/influxdb/heapster-service.yaml

Running kubectl cluster-info should yield a url to the Grafana front end which you can open up in a browser and view various stats at the node and pod level. Pretty nice!

But it's empty right. There is no data!

Finding the Heapster pod (kubectl get po --all-namespaces=true) and displaying the logs (kubectl logs heapster-0sbna --namespace=kube-system) should yield something like

E0907 18:47:05.041415       1 kubelet.go:230] error while getting containers from Kubelet: failed to get all container stats from Kubelet URL "http://127.0.0.1:10255/stats/container/": Post http://127.0.0.1:10255/stats/container/: dial tcp 127.0.0.1:10255: getsockopt: connection refused

If you run curl http://127.0.0.1:10255/stats/container/ from your local host you should see stats returned just fine.

So what's going on?

Well Heapster has got the list of nodes from Kubernetes and is now trying to pull stats from the kublete process on each node (which has a built in cAdvisor collecting stats on the node). In this case there's only one node and it's known by 127.0.0.1 to kubernetes. And there's the problem. The Heapster container is trying to reach the node at 127.0.0.1 which is itself and of course finding no kublete process to interrogate within the Heapster container.

127.0.0.1 is normally the IP address assigned to the "loopback" or local-only interface. This is a "fake" network adapter that can only communicate within the same host. It's often used when you want a network-capable application to only serve clients on the same host.

So how do we solve this?

As it turns out two things need to happen.
1. We need to reference the kublete worker node (our host machine running kubernetes) by something else other than the loopback network address of 127.0.0.1
2. The kublete process needs to accept traffic from the new network interface/address

To change the hostname by which the kublete is referenced is pretty simple. You can take more elaborate approaches but setting this to your eth0 ip worked fine for me (ifconfig eth0). The downside is that you need a eth0 interface and this is subject to DHCP so your mileage may vary as to how convenient this is.
export HOSTNAME_OVERRIDE=10.0.2.15

To get the kublete process to accept traffic from any network interface is just as simple.
export KUBELET_HOST=0.0.0.0

So all together the following will start a local kubernetes instance with DNS and the ability for containers to reach and interact with the kublete process

export KUBERNETES_PROVIDER=local  
export API_HOST=`ifconfig docker0 | grep "inet addr" | awk -F'[: ]+' '{ print $4 }'`  
export KUBE_ENABLE_CLUSTER_DNS=true  
export KUBELET_HOST=0.0.0.0  
export HOSTNAME_OVERRIDE=`ifconfig eth0 | grep "inet addr" | awk -F'[: ]+' '{ print $4 }'`  
hack/local-up-cluster.sh

You will of course need to reload all the replica and service definitions for Heapster. But after doing this and waiting a minute or two you should see data accumulate in the graphs. Data points are recorded every 60 seconds so give the system time to prove it's working. You can also check the Heapster pod logs for errors while you wait to verify everything is working.

As an added bonus if you are running the Kubernetes dashboard (see here for instructions) you will also get statistics from Heapster fed through to that automatically.

Awesome sauce!

Kubernetes local-up-cluster - dns fixes on Ubuntu

Stuart Ingram — Fri, 02 Sep 2016 21:47:08 GMT

So as it turns out I didn't get too far beyond the local kubernetes install without running into some issues. The first being the lack of DNS (I wanted to run the amazing dashboard UI) and then port forwarding to access pod functionality directly.

Ubuntu prerequisites

As it turns out there are a number of Ubuntu 14.04 specific hurdles to overcome before kubernetes will work happily.

First of all dnsmasq needs to be disabled so comment it out and restart networking services via the following

sudo nano /etc/NetworkManager/NetworkManager.conf  
sudo restart network-manager

Find out more about dnsmasq and ubuntu here

Next the tools socat and nsenter are required for kubernetes port forwarding.
To install socat run

sudo apt-get install socat

To install nsenter is slightly more work due to lack of 14.04 support but not much thanks to the work of Jérôme Petazzoni.

docker run --rm jpetazzo/nsenter cat /nsenter > /tmp/nsenter && chmod +x /tmp/nsenter  
sudo cp /tmp/nsenter /usr/local/bin

Check out the repo here or this gist if you want to go step by step

To find out more about socat here and nsenter here

Back to kubernetes

After these steps it's hopefully smooth sailing. So lets start kubernetes with DNS on by default by running the following

export KUBERNETES_PROVIDER=local  
export API_HOST=`ifconfig docker0 | grep "inet addr" | awk -F'[: ]+' '{ print $4 }'`  
export KUBE_ENABLE_CLUSTER_DNS=true  
hack/local-up-cluster.sh

Instructions for validating your DNS setup can be found here

Let's add the dashboard

kubectl create -f https://rawgit.com/kubernetes/dashboard/master/src/deploy/kubernetes-dashboard.yaml

This can be accessed via

firefox http://172.17.0.1/ui

From this you can view and manage most things you can via the kubectl cli.

Kind of gotcha but not really

One thing to note is that when you terminate the kubernetes process all the docker containers remain running (see docker ps). This at first caused concern, but then remember kubernetes is designed so that the containers it manages are themselves not dependent on kubernetes to function. If the scheduler dies, the containers are unaffected, only scheduling. This is a consistent philosophy throughout the kubernetes system and makes sense that upon shutdown would not remove all running containers. A few properties to note
1. If kubernetes is subsequently spun up it will reconcile the state of the system with desired state. As you would expect
2. Other docker containers can be spun up and down locally & independent of those managed by kubernetes.
These properties are possible due to docker labels being applied by kubernetes to the containers it manages.

Dangling docker volumes

Stuart Ingram — Thu, 01 Sep 2016 16:45:07 GMT

As anyone who works with docker knows, images and containers accumulate rapidly.

All containers can be cleared down with

docker rm $(docker ps -a -q)

And likewise, all Images with

docker rmi -f $(docker images -q)

What I wasn’t aware of was the dangling volume issue. While I had no images or containers left after the above, I did however have 20Gb taken up in dangling volumes which I was oblivious to until I wondered where all my system space had disappeared to.

You can check for dangling volumes independent of containers with

docker volume ls -f dangling=true

And remove them with

docker volume rm $(docker volume ls -qf dangling=true)

Or you can remove them with the associated container by adding the –v flag (e.g. docker rm –v container name) if you remember to put the flag on every time.
I would suggest incorporating these to your team purge scripts/procedures for a better cleanup.

More information can be found here (recommended reading)

Installing Kubernetes on Ubuntu 14.04

Stuart Ingram — Wed, 31 Aug 2016 13:24:29 GMT

I typically run my linux environment via VirtualBox on a Windows host for mainly corporate reasons. MiniKube is the new recommended way to get up and running with Kubernetes for local development, however this requires a host system capable of running a vm and at this time VirtualBox does not support 64bit nested VM's. With that in mind here are the steps I took to install kubernetes locally, mostly taken from this guide.

Install Docker

apt-get install apparmor lxc cgroup-lite  
wget -qO- https://get.docker.com/ | sh  
sudo usermod -aG docker YourUserNameHere  
sudo service docker restart

Install OpenSSL

sudo apt-get install openssl

Install etcd

curl -L https://github.com/coreos/etcd/releases/download/v3.0.6/etcd-v3.0.6-linux-amd64.tar.gz -o etcd-v3.0.6-linux-amd64.tar.gz  
tar xzvf etcd-v3.0.6-linux-amd64.tar.gz && cd etcd-v3.0.6-linux-amd64  
sudo mv etcd /usr/local/bin  
etcd --version

Original install instructions here

Install Go 1.6+

Remember to remove any previous version installed.

wget https://storage.googleapis.com/golang/go1.7.linux-amd64.tar.gz  
tar xzf go1.7.linux-amd64.tar.gz  
export GOPATH="/home/singram/personal"  
export GOROOT="/home/singram/go/"  
export PATH=$PATH:$GOROOT/bin:$GOPATH/bin

go get -u github.com/jteeuwen/go-bindata/go-bindata

Full instructions can be found here

Install Kubernetes

mkdir -p $GOPATH/src  
cd $GOPATH/src  
git clone --depth=1 https://github.com/kubernetes/kubernetes.git

Build and Run kubernetes

hack/local-up-cluster.sh

Beware, you will most likely be prompted for your root password towards the end of the build process. If you let this timeout, your system will have a number of processes running which are somewhat annoying to cleanup. If this happens, restarting the system proved the simplest method to reset and retry this step.

If successful you should have a running kubernetes system up and running.

Configure Kubectl

From the previous step you should see some output similar to the the commands below. Open up a new shell and execute the following to set up your ~/.kube/config

export KUBERNETES_PROVIDER=local  
cluster/kubectl.sh config set-cluster local --server=http://127.0.0.1:8080 --insecure-skip-tls-verify=true  
cluster/kubectl.sh config set-context local --cluster=local  
cluster/kubectl.sh config use-context local  
cluster/kubectl.sh

From this point on you have a working kubernetes system. You can either use the cluster/kubectl.sh or simply install kubectl separately as part of your system. The config file in your home directory is configured and the important part which is what both kubectl versions will key off.

Check out your kubernetes cluster nodes (there'll only be one)

kubectl get no  
kubectl describe no 127.0.0.1

What about your pods

kubectl get pods

And now you should have a fully working locally hosted kubernetes cluster of one. Superb!

Links that caught my eye

Stuart Ingram — Mon, 15 Aug 2016 17:31:00 GMT

First part of a reoccurring series.

Packaging a git tag

Stuart Ingram — Sat, 24 Aug 2013 22:40:00 GMT

So the other day I was presented with the following requirements.

From a git repository retrieve a history tag and it's commit history to deliver to a client. No other branches should be presented to the client, nor work committed after the tag.

This actually proved to be a little tricky and I'm certain I'm missing some git wizardry but here's what I did.

Clone the repository (foo) to work on locally

git clone myname@github:foo

Checkout the tag and create a branch from it.

cd foo  
git checkout mytag_1.0.0  
git checkout -b mytag_1.0.0_snapshot

Remove all other local branches and cleanup the repository

git branch -D master  
git gc

At this point you should have a local repository with a single local branch representing the tag you want and a number of references to remote branches. This can be verified with

git branch -a

Now clone your local repository again

cd ..  
git clone foo foo_final

The foo_final repository should contain nothing but the branch representing the tag at this point.
Zip is up, throw it on a flash drive and deliver as appropriate.

Now I make no claims that this is the best way to do this. In fact I'm certain there should be a better way but this is what I ended up doing.

Reloading FactoryGirl definitions in a Rails 3.2 console

Stuart Ingram — Sun, 14 Oct 2012 19:40:00 GMT

Problem

You've developed a rich set of class definitions using FactoryGirl and find them useful while developing and testing in rails console. The problem is that when you reload! your classes the FactoryGirl definitions are not reloaded causing confusion and errors. On top of this if, in your application, you are initializing class variables upon bootup these are lost also. Basically adding undue weight to a simple class refresh.

After much searching online for a solution article provided useful answers and is well worth a read.

Solution

Please note that this solution has only been tested in Rails 3.2 with FactoryGirl 4.0.0

In environments/development.rb

MyApplication.configure do  
  .... 
  .... 
  ActionDispatch::Reloader.to_prepare do 
    # first init will load unless  
    FactoryGirl.factories.entries.empty? 
      FactoryGirl.reload 
    end
    SomeClass.reinitialize unless SomeClass.initialized? 
  end  
end

this