0 Comments

There are many challenges inherent in creating an API. From the organization of the resources, to interaction models (CQRS and eventual consistency? Purely synchronous?) all the way through to the specific mechanisms for dealing with large collections of data (paging? filtering?) there is a lot of complexity in putting a service together.

One such area of complexity is binary data, and the handling thereof. Maybe its images associated with some entity, or arbitrary files, or maybe something else entirely, but the common thread is that the content is delivered differently to a normal payload (like JSON or XML).

I mean, you’re unlikely to to serve an image to a user as a base64 encoded chunk in the middle of a JSON document.

Or you might, I don’t know your business. Maybe it seemed like a really good idea at the time.

Regardless, if you need to deal with binary data (upload or download, doesn’t really matter), its worth thinking about the ramifications with regard to API throughput at the very least.

Use All The Resources

It really all comes down to resource utilization and limits.

Ideally, in an API, you want to minimise the time that every request takes, mostly so that you can maintain throughput. Every single request consumes resources of some sort, and if a particular type of request is consuming more than its fair share, the whole system can suffer.

Serving or receiving binary data is definitely one of those needytypes of requests, as the payloads tend to be larger (files, images and so on), so whatever resources are involved in dealing with the request are consumed for longer.

You can alleviate this sort of thing by making use of good asynchronous practices, ceding control over the appropriate resources so that they can be used to deal with other incoming requests. Ideally you never want to waste time waiting for slow things, like HTTP requests and IO, but there is always going to be an upper limit on the amount you can optimize by using this approach.

A different issue can arise when you consider the details of how you’re dealing with the binary content itself. Ideally, you want to deal with it as a stream, regardless of whether its coming in or going on. This sort of thing can be quite difficult to actually accomplish though, unless you really know what you’re doing and completely understand the entire pipeline of whatever platform you’re using.

Its far more common to accidentally read the entire body of the binary data into memory all at once, consuming way more resources that are strictly necessary than if you were pushing bytes through in a continuous stream. I’ve had this happen silently and unexpectedly when hosting web applications in IIS, and it can be quite a challenging situation to engineer around.

Of course if you’re only dealing with small amounts of binary data, or small numbers of requests, you’re probably going to be fine.

Assuming every relevant stakeholder understands the limitations that is, and no one sells the capability to easily upload and download thousands and thousands of photos taken from a modern camera phone.

But its not like that’s happened to me before at all.

Separation Of Concerns

If you can, a better model to invoke is one in which you don’t deal with the binary data directly at all, and instead make use of services that are specifically built for that purpose.

Why re-invent the wheel, only to do it worse and with more limitations?

A good example of this is to still have the API handle the entities representing the binary data, but when it comes to dealing with the hard bit (the payload), provide instructions on how to do that.

That is, instead of hitting GET /thing/{id}/images/{id}, and getting the binary content in the response payload, have that endpoint return a normal payload (JSON or similar) that contains a link for where to download the binary content from.

In more concrete terms, this sort of thing is relatively easily accomplished using Amazon Simple Storage Service (S3). You can have all of your binary content sitting securely in an S3 bucket somewhere, and create pre-signed links for both uploading and downloading with very little effort. No binary data will ever have to pass directly through your API, so assuming you can deal with the relatively simple requests to provide directions, you’re free and clear.

Its not like S3 is going to have difficulties dealing with your data.

Its somewhat beyond my area of expertise, but I assume you can accomplish the same sort of thing with equivalent services, like Azure Blobs.

Hell, you can probably even get it working with a simple Nginx box serving and storing content using its local disk, but honestly, that seems like a pretty terrible idea from a reliability point of view.

Conclusion

As with anything in software, it all comes down to the specific situation that you find yourself in.

Maybe you know for certain that the binary data you deal with is small and will remain that way. Fine, just serve it from the API. Why complicate a system if you don’t have to.

But if the system grows beyond your initial assumptions (and if its popular, it will), then you’ll almost certainly have problems scaling if you’re constantly handling binary data yourself, and you’ll have to pay a price to resolve the situation in the long run.

Easily the hardest part of being software engineer is trying to predict what direction the wind will blow, and adjust accordingly.

Really, its something of a fools errand, and the best idea is to be able to react quickly in the face of changing circumstances, but that’s even harder.

0 Comments

Nancy is a pretty great, low weight, low ceremony API framework for C#/.NET. We’ve used it for basically every API that we’ve created since I started working here, with the exception of our Auth API, which uses Web API because of a dependency on the ASP Identity Management package.

In all the time that I’ve been using Nancy, I’ve encountered very few issues that weren’t just my own stupid fault. Sure, we’ve customised pieces of its pipeline (like the bootstrapper, because we use Ninject for dependency injection) but all in all its default behaviour is pretty reliable.

In fact, I can only think of one piece of default behaviour that wasn’t great, and that was the way that errors were handled during initialization when running inside IIS.

The Setup

When you use Nancy through IIS/ASP.NET, you add a reference to a handler class in your web.config file, which tells IIS how it should forward incoming requests to your code.

After installing the Nancy.Hosting.AspNet package, your web.config will contain some stuff that looks like this:

<?xml version="1.0" encoding="utf-8"?>
<configuration>
    <system.web>
        <compilation debug="true" targetFramework="4.5" />
        <httpRuntime targetFramework="4.5" requestPathInvalidCharacters="&lt;,&gt;,%,&amp;,?"/>
        <httpHandlers>
            <add verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </httpHandlers>
    </system.web>

    <system.webServer>
        <validation validateIntegratedModeConfiguration="false" />
        <httpErrors existingResponse="PassThrough" />
        <handlers>
            <add name="Nancy" verb="*" type="Nancy.Hosting.Aspnet.NancyHttpRequestHandler" path="*" />
        </handlers>
        <security>
    </system.webServer>
</configuration>

There are two sections here (system.web and system.webserver) for different versions of IIS (if I remember correctly, the top one is for the IISExpress web server in Visual Studio and the other one is for IIS7+? I’m not 100% certain).

What the configuration means is that for every incoming request (verb=*, path =* in the xml), IIS will simply forward the request to the NancyHttpRequestHandler, where it will do Nancy things.

This worked fine for us until we had an issue with our bootstrapper initialization. Specifically, our bootstrapper was throwing exceptions during creation (because it was trying to connect to a database which wasn’t available yet or something) and when that happened, it would stop the webservice from ever starting. In fact, it would be non-functional until we restarted the IIS application pool.

The root cause here was in the NancyHttpRequestHandler and the way it interacted with IIS. Basically, IIS would create one of these classes, which would trigger its static constructor. If that encountered an error, then the whole thing would be broken, never to recover.

The fix was relatively simple. Create a custom request handler (based off the default one) that had some logic in it to lazy load the Nancy bootstrapper/engine with appropriate error handling. The end result was that each request that failed to initialize would fail as expected (with a 500 or something), but the first one to succeed would cache the result for any subsequent requests.

With that solution in place our webservices became a little bit more reliable and tolerant of transitory errors during startup.

The Symptom

So, in summary, Nancy good, we made a small customisation to make startup more reliable.

Obviously the story doesn’t end there.

Not too long ago we encountered a bug in one of our APIs where it was returning errors, but there were no associated error logs from the application in our ELK stack.

We could see the errors in our ELB logs (purely as HTTP response codes), but we didn’t get any application level logs showing what the root issue was (i.e. the exception). There were other error logs (some transient problems occurring during service startup), so we assumed that there was some problem with our application logging where it wasn’t correctly reporting on errors that occurred as a result of web requests.

Reproducing the bug locally, the log output showed the requests being logged correctly, and at the correct level.

It was all very mysterious.

The Source

Our first instinct was that the data ingress into the ELK stack was failing. I’d recently been working on the stack, so I naturally assumed that it was my fault, but when we investigated, we discovered that the ELK stack was working exactly as expected. There were no errors indicating that an event had been received and then rejected for some reason (Elasticsearch field mapping conflicts are the most common).

Digger deeper, we checked the local log files on the staging webservice machines and discovered that the log events were missing altogether, having never been logged at all. Even when we caused an error on purpose, nothing was being logged into the local log file.

The answer lay in the custom request handler we implemented. It had allowed the service to remain functional (where previously it would have crashed and burned), but had an unfortunate side effect.

The sequence of actions looked like this:

  1. IIS receives a request
  2. IIS forwards request to custom request handler
  3. Custom request handler initialises itself, creating a bootstrapper
  4. Bootstrapper creates logger configuration, including a custom Sink which opens a write stream to a file (named based on date)
  5. An error occurs (can’t initialize DB because the DNS record does not exist yet, or something equally transient)
  6. Bootstrapper initialization fails. Sink is disposable with a finalizer, so it will be cleaned up eventually, just not straightaway
  7. Request handler initialization fails, failing the request
  8. Another request is received
  9. IIS does the same thing
  10. New bootstrapper is created
  11. New Sink created, pointing at same file (which is already locked)
  12. Initialization succeeds, request returns successfully
  13. Sink now throws an error every time it is logged to, because of the file lock
  14. Serilog discovers errors with sink, so stops logging to preserve application health
  15. Original Sink disposes of itself and releases file lock
  16. Second Sink starts functioning again, but Serilog has stopped logging to it, so nothing happens

The interesting thing here is that the Sink does not lock the file during its constructor because its core behaviour is to roll files based on both date and size, so every time you call it, it dynamically determines where it should write to. This meant it was created successfully, but could not write any events.

Serilog, being the good logging framework that it is, was catching those errors and stopping them from breaking the application. Unfortunately, because that Sink was the only place where we had visible output, Serilog itself could not say that it was experiencing errors. During the investigation we actually enabled the Serilog self log, and it was reporting all sorts of useful things, and was in critical in actually diagnosing the problem.

Basically, we had a misbehaving sink and Serilog was protecting us from ourselves.

The Solution

We fixed the problem by moving the initialization of the logging earlier, enforcing that it only happes once per process by using a lazily evaluated static property on the request handler, which was how some of the Nancy properties were already being handled.

This fixed the problem, but looking back I think there were probably a few other ways in which we could have tackled it that would have been better:

  • We could have associated some sort of unique ID with the logger (via the bootstrapper) guaranteeing no file conflicts
  • We could have changed the Sink to handle errors that occur while accessing its desired log file, by generating and logging to a different file
  • We could have handled the fact that the Sink was disposable, and its lifetime should be tied to the bootstrapper as expected

I’ll probably implement at least the second option at some point in the future, just to make the Sink more robust in the face of unexpected circumstances.

Conclusion

The interesting thing about this whole problem (with the custom Sink and file locking) was that we had actually anticipated that it would be a problem when we initially implemented the Sink. IIS has a tendency to run two applications in parallel whenever it does a recycle, so we we knew that there would be periods when two processes might be trying to write to the same location, so we implemented a process ID based prefix to every file. Unfortunately, that approach is remarkably ineffective when everything is happening within the same process.

The hardest part in this whole adventure was trying to determine the cause of a problem when logging was the problem.

Once you get out of the development environment, and into deployed software, logging is pretty much all you have. When that’s gone, diagnosing problems becomes exponentially more difficult.

Like picking up broken glass in the dark.

Sure, its possible, but you’ll probably get cut.

0 Comments

The way we structure our services for deployment should be familiar to anyone who’s worked with AWS before. An ELB (Elastic Load Balancer) containing one or more EC2 (Elastic Compute Cloud) instances, each of which has our service code deployed on it by Octopus Deploy. The EC2 instances are maintained by an ASG (Auto Scaling Group) which uses a Launch Configuration that describes how to create and initialize an instance. Nothing too fancy.

An ELB has logic in it to direct incoming traffic to its instance members, in order to balance the load across multiple resources. In order to do this, it needs to know whether or not a member instance is healthy, and it accomplishes this by using a health check, configured when the ELB is created.

Health checks can be very simple (does a member instance respond to a TCP packet over a certain port), or a little bit more complex (hit some API endpoint on a member instance via HTTP and look for non 200 response codes), but conceptually they are meant to provide an indication that the ELB is okay to continue to direct traffic to this member instance. If a certain number of status checks fail over a period of time (configurable, but something like “more than 5 within 5 minutes”, the ELB will mark a member instance as unhealthy and stop directing traffic to it. It will continue to execute the configured health check in the background (in case the instance comes back online because the problem was transitory), but can also be configured (when used in conjunction with an ASG) to terminate the instance and replace it with one that isn’t terrible.

Up until recently, out services have consistently used a /status endpoint for the purposes of the ELB health check.

Its a pretty simple endpoint, checking some basic things like “can I connect to the database” and “can I connect to S3”, returning a 503 if anything is wrong.

When we started using an external website monitoring service called Pingdom, it made sense to just use the /status endpoint as well.

Then we complicated things.

Only I May Do Things

Earlier this year I posted about how we do environment migrations. The contents of that post are still accurate (even though I am becoming less and less satisfied with the approach as time goes on), but we’ve improved the part about “making the environment unavailable” before starting the migration.

Originally, we did this by just putting all of the instances in the auto scaling group into standby mode (and shutting down any manually created databases, like ones hosted on EC2 instances), with the intent that this would be sufficient to stop customer traffic while we did the migration. When we introduced the concept of the statistics endpoint, and started comparing data before and after the migration, this was no longer acceptable (because there was no way for us to hit the service ourselves when it was down to everybody).

What we really wanted to do was put the service into some sort of mode that let us interact with it from an administrative point of view, but no-one else. That way we could execute the statistics endpoint (and anything else we needed access to), but all external traffic would be turned away with a 503 Service Unavailable (and a message saying something about maintenance).

This complicated the /status endpoint though. If we made it return a 503 when maintenance mode was on, the instances would eventually be removed from the ELB, which would make the service inaccessible to us. If we didn’t make it return a 503, the external monitoring in Pingdom would falsely report that everything was working just fine, when in reality normal users would be completely unable to use the service.

The conclusion that we came to was that we made a mistake using /status for two very different purposes (ELB membership and external service availability), even though they looked very similar at a glance.

The solutoion?

Split the /status endpoint into two new endpoints, /health for the ELB and /test for the external monitoring service.

Healthy Services

The /health endpoint is basically just the /status endpoint with a new name. Its response payload looks like this:

{
    "BuildVersion" : "1.0.16209.8792",
    "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
    "DatabaseConnectivity" : true,
    "MaintenanceMode" : false
}

There’s not really much more to say about it, other than its purpose now being clearer and more focused. Its all about making sure the particular member instance is healthy and should continue to receive traffic. We can probably extend it to do other health related things (like available disk space, available memory, and so on), but it’s good enough for now.

Testing Is My Bag Baby

The /test endpoint though, is a different beast.

At the very least, the /test endpoint needs to return the information present in the /health endpoint. If /health is returning a 503, /test should too. But it needs to do more.

The initial idea was to simply check whether or not we were in maintenance mode and then just make the /test endpoint return a 503 when maintenance mode was on as the only differentiator between /health and /test. That’s not quite good enough though, as what we want conceptually is the /test endpoint to act more like a smoke test of common user interactions and just checking the maintenance mode flag doesn’t give us enough faith that the service itself is working from a user point of view.

So we added some tests that get executed when the endpoint is called, checking commonly executed features.

The first service we implemented the /test endpoint for, was our auth service, so the tests included things like “can I generate a token using a known user” and “can I query the statistics endpoint”, but eventually we’ll probably extend it to include other common user operations like “add a customer” and “list customer databases”.

Anyway, the payload for the /test endpoint response ended up looking like this:

{
    "Health" : {
        "BuildVersion" : "1.0.16209.8792",
        "BuildTimestamp" : "2016-07-27T04:53:04+00:00",
        "DatabaseConnectivity" : true,
        "MaintenanceMode" : false
    },
    "Tests" : [{
            "Name" : "QueryApiStatistics",
            "Description" : "A test which returns the api statistics.",
            "Endpoint" : "/admin/statistics",
            "Success" : true,
            "Error" : null
        }, {
            "Name" : "GenerateClientToken",
            "Description" : "A test to check that a client token can be generated.",
            "Endpoint" : "/auth/token",
            "Success" : true,
            "Error" : null
        }
    ]
}

I think it turned out quite informative in the end, and it definitely meets the need of detecting when the service is actually available for normal usage, as opposed to the very specific case of detecting maintenance mode.

The only trick here was that the tests needed to hit the API via HTTP (i.e. localhost), rather than just shortcutting through the code directly. The reason for this is that otherwise each test would not be an accurate representation of actual usage, and would give false results if we added something in the Nancy pipeline that modified the response before it got to the code that the test was executing.

Subtle, but important.

Conclusion

After going to the effort of clearly identifying the two purposes that the /status endpoint was fulfilling, it was pretty clear that we would be better served by having two endpoints rather than just one. It was obvious in retrospect, but it wasn’t until we actually encountered a situation that required them to be different that we really noticed.

A great side effect of this was that we realised we needed to have some sort of smoke tests for our external monitoring to access on an ongoing basis, to give us a good indication of whether or not everything was going well from a users point of view.

Now our external monitoring actually stands a chance of tracking whether or not users are having issues.

Which makes me happy inside, because that’s the sort of thing I want to know about.

0 Comments

With our relatively new desire to create quality API’s facilitating external integration with our products, we’ve decided to start using Apiary.

In actuality, we’ve been using Apiary for a while now, but not very well. Sure we used it to document what the API looked like, but we didn’t take advantage of any of its more interesting features. That changed recently when I did a deep dive into the tool in order to clean up our existing documentation before showing it to a bunch of prospective third party integrators.

Apiary is an excellent tool for assisting in API development. It not only allows for the clear documentation of your API using markdown and multi-markdown, but it also acts as a mock API instance. This of course allows you to design the API the way you want it to be, and then evaluate that design in a very lightweight fashion, before committing to fully implement it. Cheap prototypes make for lots of validation during the design phase, which in turn makes for a better design, which generally leads to a better end-result.

Even once you’ve settled on a design, and fully implemented it, you can use the blueprint to call into your real API, using the documented parameters and attributes. This allows for all sorts of useful testing and validation from an entirely different point of view (and is especially useful for those users unfamiliar with other mechanisms for calling an API).

Its not all amazing though (even though it is pretty good). There are a few features that are lacking which make me sad as a software development, and I’ll go into those later on. In additional to a few areas that are lacking, it has a bit of a learning curve, and it can be difficult to know exactly what will work and what won’t. There are quite a few good examples and tutorials available, but the Apiary feature set also seems to change quite rapidly, so it can be hard to determine exactly what you should be using.

This post is mostly going to contain an explanation of how we currently structure our API documentation, along with a summary of the useful things I’ve learned over the past week or so. Really though, its for me to fortify this information in my own head, and to document it in an easily sharable location.

The Breakdown

I like to make sure that API documentation speaks for itself. Other than access credentials, you should be able to point someone at your documentation and they should be able to use it without needing anything else from you. Its not that I don’t like talking to people, and there are definitely always going to be questions, but I find that “doesn't need to come back to you” is a pretty good measure of documentation quality.

Of course, this implies that your documentation should do more than just outline your endpoints and how to talk to them. Don’t get me wrong, it needs to do that as well (in as much detail as you can muster), but it also needs to give high level information about the API (where are the instances, what is a summary of the endpoints available) as well as information about the concepts and context in play.

As a result, I lean towards the following headers in Apiary: Overview, Concepts and Reference.

Overview

The overview section gives a brief outline of the purpose of the API, and lists known instances (actual URLs) along with their purpose. This is a good place to differentiate your environments (CI, Staging, Production) and anything useful to know about them (CI is destroyed every night for example, so data will not be persisted).

It also lists a summary of the endpoints available, along with verbs that can be used on them. This is really just to help users understand the full breadth of the API. Ideally each one of these endpoints links to its actual endpoint specification in the reference section.

Concepts

The concepts section is where you go into more detail about things that are useful for the user to know. This is where you can outline the domain model in play, or describe anything that you think the user would be unfamiliar with because it requires specialised knowledge. You can easily put Terms of Reference here, but you should focus on not only explaining unfamiliar words, but also on communicating the purpose and reasoning behind the API.

This is also a good place to talk about authentication, assuming you have some.

Reference

This is your normal API reference. Detail the endpoints/resources and what verbs are available on them. Explain required parameters (and the ways in which they must be supplied) along with expected outputs (both successful and failure). I include in this section a common sub-section, detailing any commonality between the API endpoints (like all successful results look like this, or any request can potentially return a 400 if you mess up the inputs, etc).

For Apiary in particular, make sure that your examples actually execute against a real server, as well as act appropriately as a mock. This is one of the main benefits of doing this sort of API documentation in Apiary (rather than just a Github or equivalent repository using markdown), so make the most of it.

Helpful Tips

I found that using Data Structures (using the +Attribute syntax) was by far the best way of describing request/response structures, both in terms of presentation in the resulting document and in terms of reusability between multiple elements.

Specifying a section for data structures is easy enough. Before you do your first API endpoint reference in the document (with the ## Customer [/customers] syntax), you can simply specify a data structures heading as follows:

# Data Structures
## Fancy Object (object)
+ Something: Sample Value (string, required) – Comment describing the purpose of the field
+ AnotherThing: `different sample escaped` (string, required)
+ IsAwesome: true (bool, optional)

# Simple Thing (object)
+ Special: yes (string, required)

## Derivation (Fancy Object)
+ Magical: (Simple Thing, required)

You can then reference those data structures using the +Attributes (Data Structure Name) syntax in the API reference later on.

Notice that you can model inheritance and composition in a very nice way, which can make for some very reusable components, and that there is room for comments and descriptions to do away with any ambiguity that isn’t already dealt with via the name and example value.

It Could Be Better

Its not all great (even though it is pretty good).

I had a lot of issues with common headers in my API Blueprint, like a common header for identifying the source of a request, or even the requirement of an Authorization header. This was mostly just me not wanting to repeat myself, as I didn’t want to have to specify the exact same header for every single endpoint in the documentation, and have the examples be executable. Even the ability to specify variables to be substituted into the blueprint in a later place would have been better. I really don’t like having the same information stated in multiple locations. It makes maintenance a gigantic pain.

Another rough patch is that there seem to be a number of different ways to accomplish the same sort of thing (reduced duplication when dealing with request/response templates). You can use models (which can define headers, bodies and schemas) or you can use data structures (which are less JSON specific and more general, but can’t do headers) or you can manually specify the information yourself in every section. It can be difficult to determine the best tool to use.

I assume that this is a result of the service changing quite rapidly and that it will likely settle down as it gets more mature.

Finally, because of the nature of the markdown, it can be quite hard to determine if you have actually done what you wanted to do, in terms of the Apiary specific data structures and specifications. If you mistype or leave out some required syntax, its probably still valid markdown, so your documentation still works, but it also doesn’t quite do the right thing. There is some syntax and semantic checking available in the online editor, which is awesome, but it can be a bit flakey at times itself. Still, I imagine this will improve greatly as time goes on.

Conclusion

As a step up from normal API documentation (either in a Wiki or using some sort of Markdown), Apiary offers a number of very useful features. By far the most useful is the ability for the specification to actually create an executable service, as well as pass requests through to your real service. This can really improve the turnaround time for validating your designs and is a major boon for API development.

At a conceptual level, I’m still disappointed that we have to document our API’s, rather than somehow exposing that information from the API itself. My ideal situation would be if that sort of ability was included as part of the source of the API. I still hate the fact that the two can very easily get out of sync, and no matter how hard you try, the only source of truth is the actual executable code. Its the same problem with comments in your code. If you can’t make the code express itself well enough, you have to resort to writing around it.

If I have to document though, I want the documentation to be verifiable, and Apiary definitely provides that.