Full disclosure, most of the Elastalert related work was actually done by a colleague of mine, I’m just writing about it because I thought it was interesting.
Unfortunately, this post brings me to the end of all the Elastalert goodness, at least for now.
Like I said right at the start (and embedded in the post titles), we’re finally paying attention to the wealth of information inside our ELK stack. Well, we aren’t really paying attention to everything right now, but when we notice something or even realize ahead of time that “it would be good if we got told when this happens” we actually have somewhere to put that logic.
I’ll call that a victory.
Anyway, to bring it all full circle:
- The first post was a general and gentle introduction to Elastalert, rules and why we even bothered with this whole process
- The second post explained how we put together the underlying infrastructure in AWS to support our Elastalert environments
- The third post outlined how we actually manage the configuration for Elastalert (i.e. the rules and notifications) and the deployment thereof
- Finally, this post is basically just a capstone, drawing everything together and finishing it all off
To be honest, when you look at what we’ve done for Elastalert from a distance, it looks suspiciously similar to the ELK stack (specifically the Elasticsearch segment).
I don’t necessarily think that’s a bad thing though. Honestly, I think we’ve just found a pattern that works for us, so rather than reinventing the wheel each time, we just roll with it.
Consistency is a quality all on its own.
Rule The World
Its actually been almost a couple of months now since we put this all together, and people are slowly starting to incorporate different rules to notify us when interesting things happen.
A good example of this sort of thing is with one of our new features.
As a general rule of thumb, we try our best to include dedicated business intelligence events into the software for whatever features we develop, including major checkpoints like starting, finishing and failure. One of our recent features also raised a “configured” event, which indicated when a customer had put in the specific configuration necessary for the feature to be enabled (it was a third party integration, so required an externally provided API key to function).
We added a rule to detect when this relatively rare event occurred, and now we get a notification whenever someone configures the new feature. This sort of thing is useful when you still have a relatively small number of people coming online (so you can keep tabs on them and follow through to see if they are experiencing any issues), but we’ll probably turn it off one usage picks up so we’re not constantly being spammed.
Recently a customer came online with the new feature, but never followed up with actual usage beyond the initial configuration, so we were able to flag this with the relevant parties (like their account manager) and investigate why that was happening and how we could help.
Without Elastalert, we never would have known, even though the information was actually available for all to see.
Breaking All The Rules
Of course, no series of blog posts would be complete without noting down some potential ways in which we could improve the thing we literally just finished putting together.
I mean, we could barely call ourselves engineers if we weren’t already engineering a better version in our heads before the paint had even dried on the first one.
There are two areas that I think could use improvement, but neither of them are particularly simple:
- The architecture that we put together is high availability, even though it is self healing. There is only one Elastalert instance and we don’t really have particularly good protection against that instance being “alive” according to AWS but not actually evaluating rules. We should probably put some more effort into detecting issues with Elastalert so that the AWS Auto Scaling Group self healing can kick in at the appropriate times. I don’t think we can really do anything about side-by-side redundancy though, as Elastalert isn’t really designed to be a distributed alerting system. Two copies would probably just raise two alerts which would get annoying quickly.
- There is no real concept of an alert getting worse over time, like there is with some other alerting platforms. Pingdom is a good example of this, though its alerts are a lot simpler (pretty much just up/down). If a website is down, different actions get triggered based on the length of the downtime. We use this sort of approach to first send a note to Hipchat, then to email, then to SMS some relevant parties in a natural progression. Elastalert really only seems to have on/off, as opposed to a schedule of notifications. You could probably accomplish the same thing by having multiple similar rules with different criteria, but that sounds like a massive pain to manage moving forward. This is something that will probably have to be done at the Elastalert level, and I doubt it would be a trivial change, so I’m not going to hold my breath.
Having said that, the value that Elastalert provides in its current state is still astronomically higher that having nothing, so who am I to complain?
When all is said and done, I’m pretty happy that we finally have the capability to alert of our ELK stack.
I mean, its not like the data was going to waste before we had that capability, it just feels better knowing that we don’t always have to be watching in order to find out when interesting things happen.
I know I don’t have time to watch the ELK stack all day, and I doubt anyone else does.
Thought it is awfully pretty to look at.