David Hearnden

Functional completeness and local hermeticity

Today, I’m testing new features of Canva’s dynamic flag system. Next to me, a fellow engineer is working on the pipeline that updates design images in response to design edits. These are separate features, and both have broad reach through components that make up Canva. As we iterate, each of us is running an isolated, functionally-complete Canva universe, of between 16 and 32 separate components (depending on how you count). We can exercise Canva’s full suite of features: creating and publishing designs, searching images, purchasing and downloading prints, browsing the social graph, interacting with designs in the stream, and so on. We’re not doing this using a vast network of distributed machines; we’re doing this entirely within the confines of our laptops, without even needing network.

Canva’s production environment is quite different. It is distributed, and depends on a long list of services, including AWS (S3, CloudFront, ELB, SQS, SNS, SES, SWF), ZooKeeper, Cassandra, MongoDB, MySQL, Solr, and Redis. While many of these services can be deployed on a developer’s workstation (what we call a “local”) environment, many cannot, particularly those from AWS.

Maintaining a functionally complete yet airtight local development environment is something we consider to be of critical importance to our engineering health, since it directly impacts the ease and speed of developing new features as well as diagnosing problems.

Functional completeness means that anything that can be done on www.canva.com can be done in this local environment.

Local hermeticity means that the scope of dependencies and side-effects does not extend beyond a single machine. This post describes, with practical examples, how we achieve functional completeness and local hermeticity in our development environments in a way that is transparent to application logic.

Configuration Flavors

We maintain a handful of environment configurations that we call flavors, including local for local development, and prod for production. A flavor name is passed to a component at runtime, either as an environment variable or command argument, and our application launchers use that name for dynamic selection of flavor-specific configuration resources, using filename conventions. For example, flavor-specific property configurations are defined in files named <component>.<flavor>.properties. Flavor names never appear as literals in code. This keeps the set of environments open, and we can introduce ad-hoc flavors, such as loadtest and unittest, without code changes, simply by dropping in suites of suitably-named configuration resources for the relevant components.

As a concrete example, we allocate worker threads in our import server with an import.worker.threads property.

@Named
public final class ImportServer implements ImportService {
  @Inject
  public ImportServer(@Value("${import.worker.threads}") int workerThreads, ...) {
    ...
  }
}

The environment-specific value of this property is controlled by the following properties files:

import.local.properties:

import.worker.threads=2

import.prod.properties:

import.worker.threads=16

Nothing sophisticated is happening here – controlling configuration in properties is pretty standard practice – but it is the mechanism on which we build hermeticity that is transparent to application logic.

Most of the services we rely on are popular open-source tools, and functional instances can be installed locally with apt-get or brew. Usually, some minor additional configuration is required to lower resource consumption, allowing several such services to co-exist happily on a single machine. Completeness and hermeticity for these services is then a simple matter of using flavor-specific configuration to control addressing. For example, we configure ZooKeeper hosts as follows:

zk.local.properties:

zk.host=localhost:2181
...

zk.staging.properties:

zk.host=10.0.32.55,10.0.33.55:2181
...

zk.prod.properties:

zk.host=10.0.32.4,10.0.33.4,10.0.34.4,10.0.35.4:2181
...

Dependency injection with fakes

For the remaining services that are not locally deployable, particularly services from AWS, we achieve functional completeness using abstracted interfaces and dependency-injected fakes.

Our application code never refers to AWS services directly, but instead references our own minimal interfaces that define only the parts and modes those services that are required. For each of these interfaces, we’ve written multiple implementations, including one that is a simple pass-through to the AWS SDK, and another that uses an implementation strategy suitable for local, hermetic development. The abstract interfaces for those services, which we end up re-using across many of our components, only include the subset of AWS functionality that we use, keeping the fake implementations simple. The implementation to use in a given environment is named using a flavor-specific property, and bound at runtime using dependency injection.

For example, the Canva component that imports images requires a message-queue service (QueueClient) and a file storage service (BlobStore). In our staging and production environments, we bind those interfaces to implementations that forward to S3 and SQS, but in our local development and CI environments, we bind them to implementations that work locally. The local implementations are discussed further in the next section.

UML diagram showing interfaces and implementations
Abstracting AWS from our import pipeline

aws.local.properties:

blobstore.impl=FileBlobStore
queue.impl=FileQueueClient
...

aws.prod.properties:

blobstore.impl=SQSBlobStore
queue.impl=SQSQueueClient
...

Using implementations of an essential service that differ between local development and production is a risk, since it results in code that rarely gets exercised outside production, but is mitigated by specifically modelling the intermediate interfaces such that the binding to AWS is as trivial as possible. Nevertheless, all complex services have quirks lurking somewhere, and when they are discovered, we replicate those quirks in the local implementations, to make them as functionally authentic as we can. This is one of the trade-offs that we’ve accepted in exchange for hermeticity.

Faking services with the filesystem

There are several implementation strategies that are viable for local fakes of services like AWS. Embedded, in-memory implementations are typically quick to implement, and are great for narrow-scope testing, but they only work for components running in the same process/JVM. A logical progression is then to encapsulate that in-memory implementation in a dedicated server, to define a client/server protocol, and to implement fakes as clients of that server; another option is to research and to emulate an existing protocol. In order to be readily available, that server can be installed as a daemon, or developers can start/stop it continuously as they iterate. fake-s3, fake-sqs, and fake-sns are projects that are pursuing this direction. Persisting state across restarts is a subsequent challenge, as is ensuring that state can be conveniently inspected and manipulated out of band; i.e., is “hackable”.

An alternative direction is to leverage an existing service that is always available in development environments: a POSIX file system. It works seamlessly across processes, its state is easy to inspect and manipulate out of band, and it is naturally persistent across restarts. Emulating services on top of the file system inherits these features for free, and you can often write a functional fake in a few hours rather than a few weeks.

With this strategy, we’ve written Java and Python fakes for blob storage, message queues, event notifications, emails, and a workflow engine. The implementations are kept simple, they are not intended to be performant or to be scalable, but they are completely functional, and have proved to be more than sufficient for local development.

The following sections give a high-level summary of the strategies used by fakes that we substitute for services from AWS during local development. All these fakes store state in a configurable location on the filesystem, typically /var/canva. Paths referenced in examples are relative to that location. To give a sense of scale, the initial implementation of each of these fakes took no more than a day to complete, and each is roughly a few hundred LOC.

Blobs (S3)

The file-based blob service stores blobs in the obvious manner:

s3/<bucket>/<key>

Basic S3 operations, like get and put, map trivially to file operations, and are straightforward to implement. More complex operations, like paginated or prefixed listing, can be implemented simply if some inefficiences can be tolerated. For example, we implement paginated listing by loading the full result set of all matching files upfront, and paginating in memory.

We emulate versioned buckets by marking them with a top-level s3/<bucket>/.versioned file, and appending a version tag to the filename, with a scheme of s3/<bucket>/<key>.<version>.

In order to generate URLs that function in a browser, both for downloads and uploads, we run a ~250LOC node.js HTTP server. That server replicates, to the degree that we require, S3’s path-encoding behavior, CORS mechanisms, and access control policies.

Example files:

s3/static.canva.com/images/icon_arrow_down.png
s3/static.canva.com/images/icon_arrow_down_hover.png
s3/static.canva.com/images/icon_arrow_down_on.png
s3/static.canva.com/images/icon_arrow_up.png
s3/static.canva.com/images/icon_arrow_up_hover.png
...

Message queues (SQS)

The file-based message client each queue in a single directory, with the following structure:

sqs/<queue-name>/.lock/
sqs/<queue-name>/conf
sqs/<queue-name>/messages
  • .lock is used to establish a cross-process mutex lock, by leveraging the atomicity of mkdir in a POSIX filesystem. All read and write operations on the queue are performed in critical sections scoped by possession of that lock.
  • messages contains the queue messages, one message per line, including the requeue count, visibility deadline, receipt id, and message contents.
  • conf contains the queue configuration; specifically, its redrive policy.

Some queue operations (push) can be implemented efficiently by appending to the messages file, but others (pull/delete) are implemented by preparing an entirely new file, then renaming it to messages. This is another inefficiency concession that turns out to be perfectly acceptable for local development, and keeps the implementation simple. For example, pushing messages to a queue, with an optional delivery delay, is implemented as follows:

  private void lock(File lock) throws InterruptedException {
    while (!lock.mkdir()) {
      Thread.sleep(50);
    }
  }

  private void unlock(File lock) {
    lock.delete();
  }
  
  @Override
  public void push(String queueUrl, Integer delay, String... messages)
      throws InterruptedException, IOException {
    String queue = fromUrl(queueUrl);
    File messages = getMessagesFile(queue);
    File lock = getLockFile(queue);
    long visibileFrom = (delaySeconds != null) ? now() + TimeUnit.SECONDS.toMillis(delaySeconds) : 0L;
    
    lock(lock);
    try (PrintWriter pw = new PrintWriter(new FileWriter(messages, true))) {  // append
      for (String message : messages) {
        pw.println(Record.create(visibileFrom, message));
      }
    } finally {
      unlock(lock);
    }
  }

Continuing the example of our image import pipeline, here is a snapshot of the state of a local image import queue, contained in sqs/import/messages. The first two lines are in-flight messages on their first attempt; the remaining lines beginning with 0:0:: indicate queued and available messages (no prior attempts, visible from time 0, and no receipt id).

1:1424923314560:fe758b7b-6907-4131-ad1f-a43b17226a81:{"media":"MABJs6beuEg",...}
1:1424923315074:4fc9cedf-b8c2-4463-abfb-343b80833b74:{"media":"MABJs3yZEDI",...}
0:0::{"media":"MABJsxUmBps",...}
0:0::{"media":"MABJs6ktg-M",...}
0:0::{"media":"MABJsy0ef-c",...}

Notifications (SNS)

In our use of SNS, the only subscribers to topics are SQS queues. Since this is the only behavior we need to replicate in the file-system client, the implementation is trivial. The file-based notification service encodes each topic as a directory, containing a single queues file that lists the names of the subscribed queues.

sns/<topic>/queues

Publishing a message to a topic is done as follows:

  private final QueueClient queue;

  @Override
  public void publish(String topic, String message) {
    try {
      for (String queueName : Files.readLines(queuesFile(topic), Charsets.UTF_8)) {
        queue.push(queue.getQueueUrl(queueName), message);
      }
    } catch (IOException e) {
      throw Throwables.propagate(e);
    }
  }

Workflows (SWF)

SWF is one of the lesser known services provided by AWS. In the Design Marketplace, designers can submit their content for inclusion in the Canva library as layouts. This submission flow involves capturing the design state, preparing rendered images, a review process, final publishing, and indexing. Some of these tasks are synchronous, some are asynchronous, some are automatic, some are manual. We use SWF to connect these distributed tasks together into a coherent flow.

The file-based workflow engine is the most complex fake we use. Before unfolding workflow state into multiple files and directories, we started with an in-memory implementation of a workflow engine where all state was externalized into serializable classes:

class WorkflowExecution {
  String id;
  WorkflowType type;
  List<HistoryEvent> history;
  String queue;
  Status state;
  Date dateOpened;
  Date dateClosed;
  CloseStatus closeStatus;
}

class ActivityExecution {
  WorkflowExecution workflow;
  String id;
  ActivityType type;
  String input;
  String queue;
  long scheduledEventId;
  long startedEventId;
}

class State {
  /** Workflows indexed by their run id. This map grows continuously. */
  Map<String, WorkflowExecution> workflows = new HashMap<>();
  /** Open workflows, indexed by workflow id. */
  Map<String, WorkflowExecution> openWorkflows = new HashMap<>();
  /** Parent/child workflows: row=parent, column=child, cell=childWFInitiatedEventId */
  Table<String, String, String> openParentChildWorkflows;
  Map<String, Queue<WorkflowExecution>> decisionQueues;
  Map<String, Queue<ActivityExecution>> activityQueues;
  Map<String, WorkflowExecution> activeDecisions;
  Map<String, ActivityExecution> activeActivities;
  /** For id generation. */
  int tokenCounter;
}

Using that state to implement the subset of SWF operations that we use then turns out to be relatively straightforward. From that in-memory implementation, the file-based implementation follows a similar strategy to the file-based queue: the state of a domain is stored in a single file, as JSON, protected by a lock.

swf/<domain>/.lock/
swf/<domain>/engine.json

All workflow operations are implemented by deserializing the full domain state into memory, then using the in-memory engine to perform the operation, then serializing the full state back to disk, all within the scope of holding a lock directory. For the scale of work this engine has to handle during local development, this brute-force method has never warranted further optimization.

public class FileWorkflowEngine implements WorkflowEngine {
  ...
  @Override
  public ActivityTask getActivityTask(String activityQueue) throws InterruptedException {
    long startTime = clock.currentTimeMillis();
    do {
      lock();  // same as in the file-based queue
      try {
        State state = loadState();
        WorkflowEngine delegate = new InMemoryWorkflowEngine(state); 
        ActivityTask result = memory.getActivityTask(activityQueue);
        if (result != null) {
          saveState(state);
        }
        return result;
      } finally {
        unlock();
      }
      Thread.sleep(POLL_WAIT_MS);
    } while (clock.currentTimeMillis() - startTime < POLL_TIMEOUT_MS);
    return null;
  }
  ...
}

… and the rest

Using the same patterns above, we configure our local environments to use functional in-memory or file-system fakes for several other services such as emailing, billing, analytics, and content-distribution.

Where to from here?

It only takes the few simple strategies outlined above – clean separation of configuration, local installations for services that are available, and dependency-injected fakes for those that are not – to achieve basic functional completeness and local hermeticity in a way that is transparent to application logic. We hope the examples above give you a sense of how easily you can apply these principles in practice.

As the complexity of our production architecture increases, we’re looking towards more sophisticated techniques to maintain a hermetic development environment but with increased parity with our production deployment. In an upcoming post, Josh Graham will reveal what we’re doing with virtualization and containers in order to achieve this goal, and in another, Brendan Humpheys will talk about adaptive rate limiting.

What do computer science students learn on a tour of Canva?

What advice does Canva give to computer science students? Continue reading

My First Month at Canva

Published on September 27, 2017

The turnaround of Canva’s iOS team

Published on August 22, 2017