Monitoring

Monitoring operational insights is an important part of any production system. In order to enable live insights into how traffic is flowing through the system, Synqly Embedded exports Prometheus formatted metrics. Prometheus metrics are an industry standard format for tracking operational data over time. Observability platforms such as NewRelic, DataDog, Grafana, and Logz.io all natively support ingesting Prometheus metrics.

Metrics Collection

Synqly Embedded exports metrics on the v1/metrics API endpoint. In order to ingest the metrics into an observability platform, it is necessary to configure the observability platform's scraper to pull metrics from the embeddded deployment's v1/metrics endpoint.

As an example, when running a NewRelic helm chart to scrape metrics from an entire Kubernetes cluster, the only configuration needed is a set of prometheus annotations on the embedded pod.

Name:             embedded-78b977b899-tcmzc
Namespace:        synqly-embedded
Priority:         0
Service Account:  default
....
Annotations:      prometheus.io/path: v1/metrics
                  prometheus.io/scrape: true

When deploying embedded via the Synqly Embedded Helm Chart, these annotations can be added by setting the following value in values.yaml:

...
# Configuration that will be applied to every pod
pods:
  ...
  # true - adds "prometheus.io/scrape": true annotation to all Synqly pods
  prometheusScrape: true

Please note, if your metrics ingest tool is configured to only pull metrics from a specific namespace or deployment, it may need to be updated to include embedded.

Key Metrics

Request Durations

The http_durations_ms metric tracks how long calls take to complete.

http_durations_ms supports the following dimension labels:

method: The HTTP method of the incoming request
path: The API endpoint of the incoming request
code: The HTTP response code returned by embedded
quantile: The quantile bucket that the given value represents

As an example:

http_durations_ms{code="204",method="POST",path="/v1/siem",quantile="0.99"} 1

This metric point in time value represents that POST calls to the v1/siem endpoint which result in a 204 have taken 1ms to complete for the 0.99 percentile of calls. This means 99% of similar calls have taken less than or equal to 1ms.

For more information on Prometheus Quantiles, please refer to Histograms and Summaries.

The http_durations_ms metrics can be useful for tracking the performance of calls made to embedded.

embedded also tracks _sum and _count metrics for every code, method, and path combination.

http_durations_ms_sum: The sum of all request durations for the given label set since the last pod restart. http_durations_ms_count: The total number of requests for the given label set since the last pod restart.

Both of these metrics support the following labels:

method: The HTTP method of the incoming request
path: The API endpoint of the incoming request
code: The HTTP response code returned by embedded

For example:

http_durations_ms_sum{code="200",method="POST",path="/v1/integrations"} 103776
http_durations_ms_count{code="200",method="POST",path="/v1/integrations"} 147

These point in time values show that there have been 147 POST calls made to v1/integrations that resulted in a 200 response code since the last embedded restart. The http_durations_ms_sum metric shows that those 147 calls took a total of 103776ms combined.

http_durations_ms_sum and http_durations_ms_count can be useful in combination with a rate function to track average call duration over time. For example, http_durations_ms_sum / http_durations_ms_count gives the average request duration for the given label set since embedded last restarted.

Provider Counts

The provider_count metric represents the number of calls made by a given Synqly Organization to a target Provider since embedded last restarted.

provider_count supports the following labels:

organization: A Synqly Organization in the target embedded instance
type: A Provider

For example:

provider_count{organization="sandbox-embedded-e2e",type="defender"} 97
provider_count{organization="sandbox-embedded-e2e",type="elasticsearch"} 159
provider_count{organization="sandbox-embedded-e2e",type="entra_id"} 39

These point in time values show how many calls have been made by the sandbox-embedded-e2e Synqly Organization to the target Provider type since embedded last restarted.

When combined with a rate function in your observability tool of choice, provider_count can be used to track Provider usage over time across all the Organizations within your embedded instance.

Kubernetes Pod Metrics

When running Synqly Embedded via the Synqly Embedded Helm Chart, the embedded Kubernetes Pod metrics provide operational insights into the resource utilization of the Pod.

Kubernetes Pod metrics should be automatically ingested by the Kubernetes data scraper of any major observability platform. For more information on the metrics and what they represent, please refer to Kubernetes Metric Reference.

The following metrics can be used to track the Memory and CPU usage of the embedded Kubernetes Pod.

container_memory_working_set_bytes: Represents the amount of memory in use by the embedded container. Although there are multiple metrics for tracking memory pools, container_memory_working_set_bytes is the most useful as it represents memory that cannot be safely evicted. If this metric exceeds the Pod's memory request, it is possible the Kubernetes scheduler could evict the Pod with an OOM error.
container_cpu_usage_seconds_total: Represents the cumulative CPU time consumed by the container in core-seconds. This metric can be combined with a rate function to track the per-second CPU usage. If the per-second CPU usage exceeds the CPU request of the embedded Pod, the Pod could experience increased request latency due to CPU throttling.

The following metrics are also available through cAdvisor, a tool that is integrated with the kubelet binary and exposes additional container metrics. For more information on cAdvisor metrics, please refer to cAdvisor Prometheus Metrics.

container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total: Represents the percentage of CPU periods that experience throttling in the embedded container. If this percentage rises above 5-10%, it is a good indicator that the Pod needs either a higher CPU request or limit. To modify the CPU resources, modify the following values in values.yaml: