Guardian of the Functions: Keeping an Eye on your Galaxy of AWS Step Functions with Custom Metrics on CloudWatch


Managing multiple AWS Step Functions can quickly turn into a complex task, especially when each function forms a crucial link in a broader process. For instance, consider a data processing system where numerous files are uploaded, analyzed, and then relocated. Each step of this process could be orchestrated by its own Step Function, executing a variety of tasks in sequence.

For a team monitoring this process, an error in any of these functions could disrupt the entire sequence and halt the processing of subsequent files. Therefore, having a clear, real-time understanding of the status of each Step Function's latest execution is not just a nice-to-have—it's essential.

Now, imagine a scenario where your team is handling not just one, but dozens or even hundreds of such sequences—each represented by an AWS Step Function. Manually monitoring the status of each function's latest execution becomes an incredibly time-consuming task, and the risk of missing a crucial error increases.

This is where our Guardian comes into play 🧑‍🚒

Our goal is to create an intuitive dashboard that offers an at-a-glance overview of the status of each Step Function. Think of it as a traffic light system: green for successful executions, red for failures. At any moment, a quick look at this dashboard will tell us if all our functions are operating correctly or if there's a hitch in our sequence that needs our immediate attention.

In this blog post, we will outline how to use Terraform and AWS CloudWatch to achieve this. Terraform will help us set up and manage our infrastructure, while AWS CloudWatch will provide the platform for our monitoring dashboard. With these tools at our disposal, we'll turn the daunting task of overseeing a multitude of AWS Step Functions into a manageable, even effortless process.

The Missing Piece in AWS

When dealing with AWS Step Functions, one might assume that AWS would offer a native metric in CloudWatch for monitoring the status of the most recent execution of a function. After all, AWS provides a plethora of such metrics out of the box for many of its services.

Unfortunately, this isn't the case for Step Functions. While AWS does offer metrics like the total number of executions, succeeded executions, failed executions, and throttled executions, these are all aggregate metrics. They provide a broad view of a function's performance but do not offer insight into the status of each function's latest execution.

This lack of granularity can be a significant hurdle when monitoring a large number of Step Functions, especially when the status of the most recent execution is the key metric we're interested in.

So how can we fill this gap? The solution is to create our own custom metric, and in the next section, we'll dive into how we can use AWS Lambda and CloudWatch to do just that.

Creating a Custom Metric with AWS Lambda

Since AWS doesn't offer a native metric for the status of the latest execution of a Step Function, we need to create this metric ourselves. To do this, we'll use AWS Lambda, a service that lets you run your code without provisioning or managing servers.

The idea is straightforward: we'll create a Lambda function that periodically checks the status of the latest execution of each of our Step Functions and then publishes this information as a custom metric to CloudWatch.

Configuring IAM permissions

The first thing we need to do is ensure our Lambda function has the necessary permissions to both read the status of our Step Functions and publish custom metrics to CloudWatch. To do this, we can create an IAM role with the following policy (or you can see how I use Terraform to create it, in the next chapter):

    "Version": "2012-10-17",
    "Statement": [
            "Effect": "Allow",
            "Action": [
            "Resource": "*"

This policy allows our function to list all state machines (i.e., Step Functions), describe their executions, and put metric data into CloudWatch.

Creating the Lambda function

With our permissions configured, we can now create our Lambda function. Here's a high-level overview of what our function will do:

  1. List all Step Functions in our account using the ListStateMachines API.

  2. For each Step Function, fetch the most recent execution using the DescribeExecution API.

  3. Map the status of each execution to a numerical value with a function status_to_number.

  4. Publish these numerical statuses as custom metrics to CloudWatch using the PutMetricData API.

Here's an example of what the Python code for this Lambda function might look like:

import boto3

def lambda_handler(event, context):
    # Initialize clients
    sf_client = boto3.client('stepfunctions')
    cw_client = boto3.client('cloudwatch')

    # List all state machines
    state_machines = sf_client.list_state_machines()['stateMachines']

    # Loop through all state machines
    for sm in state_machines:
        # Get latest execution status
        status = sf_client.describe_execution(

        # Map status to a numerical value
        status_value = status_to_number(status)

        # Publish custom metric to CloudWatch
                    'MetricName': sm['name'],
                    'Value': status_value

def status_to_number(status):
    mapping = {
        'RUNNING': 1,
        'SUCCEEDED': 2,
        'FAILED': 3,
        'TIMED_OUT': 4,
        'ABORTED': 5
    return mapping.get(status, 0)  # return 0 if status is not in the mapping

This way, each state is represented by a unique numerical value, providing more granular information about the status of your Step Function executions.

Scheduling the Lambda function

The final piece of the puzzle is to ensure our Lambda function runs periodically to keep our custom metrics up-to-date. To provide the most recent status of our Step Functions, we will schedule our Lambda function to run at regular intervals using Amazon EventBridge.

How do we ensure that our CloudWatch alarm reflects only the most recent state of the Step Function, not past states? This is a valid concern, as CloudWatch alarms often aggregate data over a certain time period, potentially mixing up the statuses of different Step Function executions.

This is where choosing the right statistic for our CloudWatch alarm comes into play. We will use the 'Maximum' statistic with a period of 1 hour for our alarm. This ensures that the alarm state always reflects the highest (i.e., most severe) status reported by the Lambda function in the past hour.

Why 'Maximum' and why a period of 1 hour? The 'Maximum' statistic ensures that if there's any failed execution (which we mapped to a higher value), it will be the status taken into account. The period of 1 hour is less than the frequency at which our Lambda function is invoked (3 hours). This way, each evaluation period of the alarm is guaranteed to consider only the most recent execution status.

Remember, the right frequency and period may depend on your use case, and you may need to adjust these values to fit your specific needs.

Step-by-Step: Creating our Monitoring Tool with Terraform

You have many AWS Step Functions to monitor and you're probably thinking, 'Surely, I don't have to set all of this up manually, right?' Fear not, because that's where Terraform comes in. By leveraging Infrastructure as Code, we can automate the process of creating our monitoring dashboard, saving time and ensuring consistent configuration. Let's dive into how we can use Terraform to solve our monitoring problem without having to resort to endless manual setup.

Terraforming the Custom Metrics Lambda function

Let's first create the Lambda function using the Terraform AWS Lambda Module. The AWS Lambda function will be responsible for checking the status of each Step Function and then pushing the corresponding status value to CloudWatch.

Below is a possible Terraform configuration that creates the Lambda function:

module "lambda_step_function_status" {
  source                   = "terraform-aws-modules/lambda/aws"
  version                  = "4.16.0"
  function_name            = "${local.project}-${var.env}-step-function-status-check"
  handler                  = "step_function_status.lambda_handler"
  runtime                  = "python3.8"
  memory_size              = 256
  timeout                  = 30
  architectures            = ["x86_64"]
  publish                  = true
  source_path              = "${path.module}/../source/lambda/"
  artifacts_dir            = "${path.root}/.terraform/lambda-builds/"
  attach_policy_statements = true
  policy_statements        = {
    step_functions = {
      effect    = "Allow",
      actions   = ["states:ListStateMachines", "states:DescribeStateMachine", "states:DescribeStateMachineForExecution"],
      resources = ["*"]
    cloudwatch = {
      effect    = "Allow",
      actions   = ["cloudwatch:PutMetricData"],
      resources = ["*"]

This Terraform configuration creates a new AWS Lambda function called step-function-status-check. The function is configured with Python 3.8 as the runtime environment and the handler is set to step_function_status.lambda_handler.

The source_path parameter is used to specify the location of the Python script, which contains the logic for checking the status of Step Functions and pushing the results to CloudWatch. The AWS Lambda function is granted permissions to list and describe state machines (i.e., Step Functions) and to put metric data to CloudWatch.

We can then use the outputs of this Lambda function in our subsequent steps to set up the CloudWatch alarms and dashboard.

Terraforming a trigger event for Lambda in Eventbridge

Now, let's use an EventBridge module to schedule our Lambda function:

module "step_function_status_cron_event" {
  source  = "terraform-aws-modules/eventbridge/aws"
  version = "1.17.2"

  create_bus = false
  bus_name   = "default"

  rules = {
    step_function_status_cron = {
      description         = "Trigger to Step Function Status Check Lambda"
      schedule_expression = "cron(0 */3 * * ? *)" # Every 3 hours

  targets = {
    step_function_status_cron = [
        name  = "lambda_step_function_status_check_cron"
        arn   = module.lambda_step_function_status.lambda_function_arn
        input = jsonencode({ "trigger" : "cron" })
  create_role = false

In the above code, we're creating an EventBridge rule that will trigger our Lambda function every 3 hours as per your original requirement. This schedule expression, cron(0 */3 * * ? *), translates to "At minute 0 past every 3rd hour."

The target of this rule is our previously created lambda_step_function_status_check Lambda function. This means when the rule is triggered, it will execute the lambda_step_function_status_check Lambda function. The input is optional and can be used to pass specific event data to the Lambda function.

We set create_role to false assuming that an existing IAM role will be used that has the necessary permissions for these resources. If you need to create a new role, you can change this to true and ensure the role has the appropriate permissions.

Terraforming the Step Functions

For this example, I'll assume we have a list of step function names stored in a Terraform variable. This list will be used to generate each step function and its corresponding alarm. Here's a simplified example, using an AWS Step Functions module:

variable "step_functions" {
  description = "A list of step function names"
  type        = list(string)
  default     = ["step1", "step2", "step3"]

module "step_functions" {
  source  = "terraform-aws-modules/step-functions/aws"
  version = "2.7.3"

  for_each = toset(var.step_functions)

  name       = "${each.value}-step-function"
  definition = file("${path.module}/definitions/${each.value}.json")

  logging_configuration = {
    include_execution_data = true
    level                  = "ALL"

  cloudwatch_log_group_name              = "/aws/stepfunctions/${each.value}-step-function"
  cloudwatch_log_group_retention_in_days = 90

In this example, we're using the for_each construct in Terraform to iterate over the list of step function names and create a step function for each. The definition for each step function is assumed to be stored in a separate JSON file in the definitions directory.

The output of this module is a map of step function resources, indexed by their name.

Please remember to replace the placeholders with your actual step function definitions and settings. This is a simplified example, and in a real-world scenario you would probably need to customize this further to match your actual infrastructure and business needs.

Terraforming Cloudwatch Alarms based on the Custom Metrics

Let's proceed to the CloudWatch metric and alarm setup. For this, we will use the aws_cloudwatch_metric_alarm resource, which will create an alarm for each of our Step Functions. We will use the Maximum statistic of our custom metric and set a threshold, so that an alarm is triggered if the Step Function fails:

resource "aws_cloudwatch_metric_alarm" "step_function_alarm" {
  for_each = module.step_functions

  alarm_name          = "StepFunctionStatusAlarm-${each.key}"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "${each.key}_Status"
  namespace           = "StepFunctions"
  period              = "3600" # This should be less than the execution time of the Step Function
  statistic           = "Maximum"
  threshold           = 3 # The status code for failure
  alarm_description   = "This metric checks status of Step Function ${each.key}"
  alarm_actions       = [] # Add any actions you want to be triggered when the alarm goes off
  treat_missing_data  = "missing"

This will create an alarm for each of the Step Functions, and trigger it if the status of the last execution is higher than the number corresponding to 'SUCCEEDED'.

Please be aware that the period of the alarm should be set to a value that is less than the execution time of the Step Function. This is to ensure that the alarm always considers only the last execution of the Step Function.

Terraforming the Guardian dashboard

Finally, let's create our dashboard to keep an eye on all our Step Functions. We will use the aws_cloudwatch_dashboard resource to do this:

resource "aws_cloudwatch_dashboard" "step_function_dashboard" {
  dashboard_name = "StepFunctionStatusDashboard"

  dashboard_body = jsonencode({
    widgets = [
        "type": "alarm",
        "x": 0,
        "y": 0,
        "width": 24,
        "height": 6,
        "properties": {
          "title": "Step Functions Last Execution Status",
          "alarms": values(aws_cloudwatch_metric_alarm.step_function_alarm[*].arn)

This will create a CloudWatch Dashboard with a single widget that shows the status of all our Step Function alarms. This way, you can quickly glance at the dashboard and see if there are any issues with your Step Functions, as shown in the screenshot below.


In conclusion, we've crafted an efficient, automated system to monitor the status of numerous AWS Step Functions, and visualized this data in an easily digestible dashboard. This solution not only saves you time by avoiding manual checks but also provides a real-time representation of the health of your processes.

This system is flexible, customizable, and can be adapted to monitor different types of Step Functions or to include multiple alarms per function. We've utilized AWS services and Terraform to ensure it can keep up with dynamic cloud environments and be easily adjustable to meet your specific needs.

By 'keeping an eye on the herd', we emphasize the importance of reliable, automated monitoring in today's complex IT landscapes. The goal of this solution is to enhance operational efficiency, aid in troubleshooting, and ensure the smooth running of your business processes. Keep coding and monitoring smart!