Building an Extensible Health Check

Alt Title: Cool things you can do with OSGi

Introduction

So one thing that many organizations like to stand up in their Liferay environments is a "health check".  The goal is to provide a simple URL that monitoring systems can invoke to verify servers are functioning correctly.  The monitoring systems will review the time it takes to render the health check page and examine the contents to compare against known, expected results.  Should the page take too long to render or does not return the expected result, the monitoring system will begin to alert operations staff.

The goal here is to allow operations to be proactive in resolving outage situations rather than being reactive when a client or supervisor calls in to see what is wrong with the site.

Now I'm not going to deliver here a complete working health check system here (sorry in advance if you're disappointed).

What I am going to do is use this as an excuse to show how you can leverage some OSGi stuff to build out Liferay things that you really couldn't have easily done before.

Basically I'm going to build out an extensible health check system which exposes a simple URL and generates a simple HTML table that lists health check sensors and status indicators, the words GREEN, YELLOW and RED for the status of the sensors.  In case it isn't clear, GREEN is healthy, YELLOW means there is non-fatal issues, and RED means something is drastically wrong.

Extensible is the key word in the previous paragraph.  I don't want the piece rendering the HTML to have to know about all of the registered sensors.  As a developer, I want to be able to create new sensors as new systems are integrated into Liferay, etc.  I don't want to have to know about every possible sensor I'm ever going to create and deploy up front, I'll worry about adding new sensors as the need arises.

Defining The Sensor

So our health check system is going to be comprised of various sensors.  Our plan here is to follow the Unix concept of creating small, consise sensors that are each great at taking an individual sensor reading rather than one really big complicated sensor.

So to do this we're going to need to define our sensor interface:

public interface Sensor {
  public static final String STATUS_GREEN = "GREEN";
  public static final String STATUS_RED = "RED";
  public static final String STATUS_YELLOW = "YELLOW";

  /**
   * getRunSortOrder: Returns the order that the sensor should run.  Lower numbers
   * run before higher numbers.  When two sensors have the same run sort order, they
   * are subsequently ordered by name.
   * @return int The run sort order, lower numbers run before higher numbers.
   */
  public int getRunSortOrder();

  /**
   * getName: Returns the name of the sensor.  The name is also displayed in the HTML
   * for the health check report, so using human-readable names is recommended.
   * @return String The sensor display name.
   */
  public String getName();

  /**
   * getStatus: This is the meat of the sensor, this method is called to actually take
   * a sensor reading and return one of the status codes listed above.
   * @return String The sensor status.
   */
  public String getStatus();
}

Pretty simple, huh?  We accommodate the sorting of the sensors for running so we can have control over the test order, we support providing a display name for the HTML output, and we also provide the method for actually getting the sensor status.

That's all we need to get our extensible healthcheck system started.  Now that we have the sensor interface, let's build some real sensors.

Building Sensors

Obviously we are going to be writing classes that implement the Sensor interface.  The fun part for us is that we're going to take advantage of OSGi for all of our sensor registration, bundling, etc.

So the first option we have with the sensors is whether to combine them in one module or build them as separate modules.  The truth is we really don't care.  You can stick with one module or separate modules.  You could mix things up and create multiple modules that each have multiple sensors.  You can include your sensor for your portlet directly in that module to keep it close to what the sensor is testing.  It's entirely up to you.

Our only limitations are that we have a dependency on the Healthcheck API module and our components have to implement the interface and declare themselves with the @Component annotation.

So for our first sensor, let's look at the JVM memory.  Our sensor is going to look at the % of memory used, we'll return GREEN if 60% or less is used, YELLOW if 61-80% and RED if 81% or more is used.  We'll create this guy as a separate module, too.

Our memory sensor class is:

@Component(immediate = true,service = Sensor.class)
public class MemorySensor implements Sensor {
  public static final String NAME = "JVM Memory";

  @Override
  public int getRunSortOrder() {
    // This can run at any time, it's not dependent on others.
    return 5;
  }

  @Override
  public String getName() {
    return NAME;
  }

  @Override
  public String getStatus() {
    // need the percent used
    int pct = getPercentUsed();
    
    // if we are 60% or less, we are green.
    if (pct <= 60) {
      return STATUS_GREEN;
    }
    // if we are 61-80%, we are yellow
    if (pct <= 80) {
      return STATUS_YELLOW;
    }
    
    // if we are above 80%, we are red.
    return STATUS_RED;
  }

  protected double getTotalMemory() {
    double mem = Runtime.getRuntime().totalMemory();

    return mem;
  }

  protected double getFreeMemory() {
    double mem = Runtime.getRuntime().freeMemory();

    return mem;
  }

  protected double getUsedMemory() {
    return getTotalMemory() - getFreeMemory();
  }

  protected int getPercentUsed() {
    double used = getUsedMemory();
    double pct = (used / getTotalMemory()) * 100.0;

    return (int) Math.round(pct);
  }
  
  protected int getPercentAvailable() {
    double pct = (getFreeMemory() / getTotalMemory()) * 100.0;

    return (int) Math.round(pct);
  }
}

Not very fancy.  There are obvious enhancements we could pursue with this.  We could add a configuration instance so we could define the memory thresholds in the control panel rather than using hard coded values.  We could refine the measurement to account for GC.  Whatever.  The point is we have a sensor which is responsible for getting the status and returning the status string.

Now imagine what you can do with these sensors... You can add a sensor for accessing your database(s).  You can check that LDAP is reachable.  If you use external web services, you could call them to ensure they are reachable (even better if they, too, have some sort of health check facility, your health check can incorporate their health check).

Your sensor options are only limited to what you are capable of creating.

I'd recommend keeping the sensors simple and fast, you don't want a long running sensor chewing up time/cpu just to get some idea of server health.

Building The Sensor Manager

The sensor manager is another key part of our extensible healthcheck system.

The sensor manager is going to use a ServiceTracker so it knows all the sensors that are available and gracefully handles the addition and removal of new Sensor components.  Here's the SensorManager:

@Component(immediate = true, service = SensorManager.class)
public class SensorManager {

  /**
   * getHealthStatuses: Returns the map of current health statuses.
   * @return Map map of statuses, key is the sensor name and value is the sensor status.
   */
  public Map<String,String> getHealthStatus() {
    StopWatch totalWatch = null;

    // time the total health check
    if (_log.isDebugEnabled()) {
      totalWatch = new StopWatch();

      totalWatch.start();
    }

    // grab the list of sensors from our service tracker
    List<Sensor> sensors = _serviceTracker.getSortedServices();

    // create a map to hold the sensor status results
    Map<String,String> statuses = new HashMap<>();

    // if we have at least one sensor
    if ((sensors != null) && (! sensors.isEmpty())) {
      String status;
      StopWatch sensorWatch = null;

      // create a stopwatch to time the sensors
      if (_log.isDebugEnabled()) {
        sensorWatch = new StopWatch();
      }

      // for each registered sensor
      for (Sensor sensor : sensors) {
        // reset the stopwatch for the run
        if (_log.isDebugEnabled()) {
          sensorWatch.reset();
          sensorWatch.start();
        }

        // get the status from the sensor
        status = sensor.getStatus();

        // add the sensor and status to the map
        statuses.put(sensor.getName(), status);

        // report sensor run time
        if (_log.isDebugEnabled()) {
          sensorWatch.stop();

          _log.debug("Sensor [" + sensor.getName() + "] run time: " + DurationFormatUtils.formatDurationWords(sensorWatch.getTime(), true, true));
        }
      }
    }

    // report health check run time
    if (_log.isDebugEnabled()) {
      totalWatch.stop();

      _log.debug("Health check run time: " + DurationFormatUtils.formatDurationWords(totalWatch.getTime(), true, true));
    }

    // return the status map
    return statuses;
  }

  @Activate
  protected void activate(BundleContext bundleContext, Map properties) {

    // if we have a current service tracker (likely not), let's close it.
    if (_serviceTracker != null) {
      _serviceTracker.close();
    }

    // create a new sorting service tracker.
    _serviceTracker = new SortingServiceTracker(bundleContext, Sensor.class.getName(), new Comparator<Sensor>() {

      @Override
      public int compare(Sensor o1, Sensor o2) {
        // compare method to sort primarily on run order and secondarily on name.
        if ((o1 == null) && (o2 == null)) return 0;
        if (o1 == null) return -1;
        if (o2 == null) return 1;

        if (o1.getRunSortOrder() != o2.getRunSortOrder()) {
          return o1.getRunSortOrder() - o2.getRunSortOrder();
        }

        return o1.getName().compareTo(o2.getName());
      }
    });
  }

  @Deactivate
  protected void deactivate() {
    if (_serviceTracker != null) {
      _serviceTracker.close();
    }
  }

  private SortingServiceTracker<Sensor> _serviceTracker;
  private static final Log _log = LogFactoryUtil.getLog(SensorManager.class);
}

The SensorManager has the ServiceTracker instance to retrieve the list of registered Sensor services and uses the list to grab each sensor status.  The getHealthStatus() method is the utility method to hide all of the implementation details but expose the ability to grab the map of sensor status details.

Conclusion

Yep, that's right, this is the conclusion.  That's really all there is to see here.

I mean, there is more, you need a portlet to serve up the health status on demand (a serve resource request can work fine here), and just displaying the health status in the portlet view will allow admins to see the health whenever they log into the portal.  And you can add a servlet so external monitoring systems can hit your status page using /o/healthcheck/status (my checked in project supports this).

But yeah, that's not really important with respect to showing cool OSGi stuff.

Ideally this becomes a platform for you to build out an expandable health check system in your own environment.  Pull down the project, start writing your own Sensor implementations and check out the results.

If you build some cool sensors you want to share, send me a PR and I'll add them to the project.

In fact, let's consider this to be like a community project.  If you use it and find issues, feel free to submit PRs with fixes.  If you build some Sensors, submit a PR with them.  If you come up with a cool enhancement, send a PR.  I'll do some minimal verification and merge everything in.

Here's the github project link to get you started: https://github.com/dnebing/healthcheck

Alt Conclusion

Just like there's an alternate title, there's an alternate conclusion.

The alternate conclusion here is that there's some really cool things you can do when you embrace OSGi in Liferay, pretty much the way Liferay has embraced OSGi.

OSGi offers a way to build expandable systems that are very decoupled.  If you need this kind of expansion, focus on separating your API from your implementations, then use a ServiceTracker to access all available instances.

Liferay uses this kind of thing extensively.  The product menu is extensible this way, the My Account pages are extensible in this way, heck even the LiferayMVC portlet implementations using MVCActionCommand and MVCResourceCommand interfaces rely on the power of OSGi to handle the dynamic services.

LiferayMVC is actually an interesting example; there, instead of managing a service tracker list, they manage a service tracker map where the key is the MVC command.  So the LiferayMVC portlet uses the incoming MVC command to get the service instance based on the command and passes the control to it for processing.  This makes the portlet more extensible because anyone can add a new command or override an existing command (using service ranking) and the original portlet module doesn't need to be touched at all.

Where can you find examples of things you can do leveraging OSGi concepts?  The Liferay source, of course.  Liferay eats it's own dog food, and they do a lot more with OSGi than I've ever needed to.  If you have some idea of a thing to do that benefits from an OSGi implementation but need an example of how to do it, find something in the Liferay source that has a similar implementation and see how they did it.

Blogs
Nice. I've been using the dropwizard metrics library lately as a base for a Liferay health check framework, but only on 6.2 so far. It will be interesting to see how easily that converts to DXP but I like that it provides a framework for gauges, timers, health, etc. so you don't have to start from scratch.
http://metrics.dropwizard.io/3.1.0/
Cool. As a library it would likely fit well into an OSGi framework since it has clear separation of interface from implementation. With OSGi, though, you wouldn't need to manage registries directly since OSGi will support dynamic "registration" when bundles are started.

Admittedly this blog was more about the fun you can have with OSGi, wanted to demonstrate all that fun in a real-world scenario!