Checkless & Logz.io - Free & Easy uptime history

19 March 2019 on checkless, serverless, elasticsearch

Since the very first version of Checkless, I wanted to keep it simple and cheap (preferably free). So to begin with, the only two notifiers were Slack and Email, this meant there wasn't much in the way of reports or check history. This was fine, as it still told me when things were wrong within a reasonable time frame. But it would be nice to:

Have check result history
Have some nice visualisations
Have more options for notifications

Whilst maintaining the key design goals (Simple & Free/Cheap).

Given I was already using AWS Lambda, my mind jumped first to DynamoDB as a storage mechanism - this is cheap, but it's not terribly simple (at least not to me, yet), and this particular use case isn't well documented (multiple keys/dimensions, almost time series data). Being so focused on AWS based technologies, I upsettingly neglected a technology I was very familiar with: The Elastic Stack.

Cheap, Elasticsearch?

Elasticsearch isn't generally associated with being cheap, its memory constraints are quite demanding for a start. Elastic Cloud has helped a bit - you can get a single node for ~£15/mo, but this all started because I didn't want to pay Pingdom a tenner a month, so given I didn't already have an Elasticsearch instance, that didn't seem like a particularly appealing option.

However, in one of my recent roles I encountered Logz.io, a hosted Elastic Stack platform, with a community option, that's free! You don't get a huge amount of retention - 3 days, but it fulfils the primary design goals: Cheap & Simple, and it's Elasticsearch/Kibana, which meant I knew it well.

Getting the data in

Logz.io (as does Logstash) supports sending data in via HTTP(s) endpoint. This is pretty much exactly how the slack notifier works, except it's not even necessary to format the data, we just send in the check result as JSON in the request body.

I created a new version of Checkless (v2.1.0) and Checkless-CLI (v1.11.1) that can send and configure check results to a webhook. This allows the data to be shipped to logz.io (and potentially a wide range of other targets as well). Using checkless-cli, it's just a matter of adding the notifier to the checkless.yml file:

notifications:
  - webhook:
      webhookUrl: '${env:CHECKLESS_LOGZIO_WEBHOOK_PATH}'

And the correct serverless config will be generated to always send the result to your logz.io endpoint.

env:CHECKLESS_LOGZIO_WEBHOOK_PATH is a CircleCI environment variable containing the logz.io log shipper HTTPS endpoint. Once you've signed up to Logz.io you can go to the Log Shipping tab -> Libraries -> Bulk HTTP/S, to get the version for your account.

As soon as this is deployed you should start to see events in your Discover tab:

Visualising

With logz.io you get Kibana which gives you huge flexibility to create visualisations and dashboards. Once the data's in it's really easy to create overview dashboards of all checks, or of specific websites:

This basic dashboard shows Success:Failure ratio as well as the median time to first byte. In this case it's monitoring this website, which was recently moved to Netlify. This already tells me two things:

The US-East-1 region has occasional connectivity issues to Netlify's CDN
The Netlify CDN has much better routing for US than London, 3x better Time to First Byte

So I'm going to trial switching the probe to US-East-2 region to see if it has better reliability, as well as investigate Netlify's CDN options to see if I can improve the UK based TTFB.

It'd be nice to have parameterised dashboards (so you don't have to save a dashboard per check), such as you can with Grafana, but for the slight overhead and the fact I don't have to maintain my own Elasticsearch server, I can cope!

Note: I've submitted my example checkless dashboard to Logz.io's contributions, I'll update this blog post with a link when it's live!

Alerting

Checkless's alerts to email/Slack work solely on individual events. If there's a failure - then alert. It's impossible to do alert on 3 failures in a set period or similar without something storing state. Thankfully, with Logz.io we can alert on a set no. failures within a given period.

There's still potential for a more configurable alert - no. failures in a row, no. of probes in a failing state, but it's a definite advancement. Like the pure AWS implementation the alerts can be sent via Email, slack, or anything that can be triggered by webhook.

In additional to alert on failures, I've also configured a slack alert to warn me when there has been no check results for a set period (e.g. 10 minutes), this indicates to me when I've outright broken Checkless. When I first started Checkless, I enabled slack to output success as well as failure so that I could see it working, and so I could be sure that the site's were up, and that it wasn't that Checkless wasn't running. With this alert in place, I can have confidence to know Checkless is working, and disable the rather verbose slack check success output. If I don't have any Checkless events in logz.io in 10 minutes, I know I'll get an alert.

Conclusion

Logz.io's community tier is great for the Open Source Community - it's not a product a lot of their competitors offer, and it's enabled me to gain some history and analytics for my site check already, which immediately allowed me to spot some trends I hadn't seen with it just being a stream of slack messages.

I still plan on looking into Dynamo DB as storage backend in the future - as I think that offering multiple options is a good thing for the project, but also that some other alerting/reporting use cases can be satisfied with a storage backend such as DynamoDB. But for now - combining Logz.io with Checkless has been really easy - and a massive benefit.