# Monitoring

Monitoring operational insights is an important part of any production system. In order to enable live
insights into how traffic is flowing through the system, Synqly Embedded exports [Prometheus](https://prometheus.io/docs/concepts/data_model/) formatted metrics. Prometheus metrics are an industry standard format for tracking operational data over time.
Observability platforms such as NewRelic, DataDog, Grafana, and Logz.io all natively support ingesting
Prometheus metrics.

## Metrics Collection

Synqly Embedded exports metrics on the `v1/metrics` API endpoint. In order to ingest the metrics
into an observability platform, it is necessary to configure the observability platform's scraper
to pull metrics from the `embeddded` deployment's `v1/metrics` endpoint.

As an example, when running a NewRelic helm chart to scrape metrics from an entire Kubernetes cluster,
the only configuration needed is a set of prometheus annotations on the `embedded` pod.


```
Name:             embedded-78b977b899-tcmzc
Namespace:        synqly-embedded
Priority:         0
Service Account:  default
....
Annotations:      prometheus.io/path: v1/metrics
                  prometheus.io/scrape: true
```

When deploying `embedded` via the Synqly Embedded Helm Chart, these annotations can be
added by setting the following value in `values.yaml`:


```yaml
...
# Configuration that will be applied to every pod
pods:
  ...
  # true - adds "prometheus.io/scrape": true annotation to all Synqly pods
  prometheusScrape: true
```

Please note, if your metrics ingest tool is configured to only pull metrics from a specific
namespace or deployment, it may need to be updated to include `embedded`.

## Key Metrics

### Request Durations

The `http_durations_ms` metric tracks how long calls take to complete.

`http_durations_ms` supports the following dimension labels:

- `method`: The HTTP method of the incoming request
- `path`: The API endpoint of the incoming request
- `code`: The HTTP response code returned by `embedded`
- `quantile`: The quantile bucket that the given value represents


As an example:


```
http_durations_ms{code="204",method="POST",path="/v1/siem",quantile="0.99"} 1
```

This metric point in time value represents that POST calls to the `v1/siem`
endpoint which result in a `204` have taken 1ms to complete for the 0.99
percentile of calls. This means 99% of similar calls have taken less than or
equal to 1ms.

For more information on Prometheus Quantiles, please refer to [Histograms and Summaries](https://prometheus.io/docs/practices/histograms/#quantiles).

The `http_durations_ms` metrics can be useful for tracking the performance of
calls made to `embedded`.

`embedded` also tracks `_sum` and `_count` metrics for every `code`, `method`, and `path`
combination.

`http_durations_ms_sum`: The sum of all request durations for the given label set since the last pod restart.
`http_durations_ms_count`: The total number of requests for the given label set since the last pod restart.

Both of these metrics support the following labels:

- `method`: The HTTP method of the incoming request
- `path`: The API endpoint of the incoming request
- `code`: The HTTP response code returned by `embedded`


For example:


```
http_durations_ms_sum{code="200",method="POST",path="/v1/integrations"} 103776
http_durations_ms_count{code="200",method="POST",path="/v1/integrations"} 147
```

These point in time values show that there have been 147 POST calls made to `v1/integrations`
that resulted in a `200` response code since the last `embedded` restart. The `http_durations_ms_sum`
metric shows that those 147 calls took a total of 103776ms combined.

`http_durations_ms_sum` and `http_durations_ms_count` can be useful in combination with a
rate function to track average call duration over time. For example,
`http_durations_ms_sum / http_durations_ms_count` gives the average request duration
for the given label set since `embedded` last restarted.

### Provider Counts

The `provider_count` metric represents the number of calls made by a given
Synqly Organization to a target Provider since `embedded` last restarted.

`provider_count` supports the following labels:

- `organization`: A Synqly Organization in the target `embedded` instance
- `type`: A Provider


For example:


```
provider_count{organization="sandbox-embedded-e2e",type="defender"} 97
provider_count{organization="sandbox-embedded-e2e",type="elasticsearch"} 159
provider_count{organization="sandbox-embedded-e2e",type="entra_id"} 39
```

These point in time values show how many calls have been made by the `sandbox-embedded-e2e`
Synqly Organization to the target Provider type since `embedded` last restarted.

When combined with a rate function in your observability tool of choice, `provider_count`
can be used to track Provider usage over time across all the Organizations within
your `embedded` instance.

### Kubernetes Pod Metrics

When running Synqly Embedded via the Synqly Embedded Helm Chart, the `embedded`
Kubernetes Pod metrics provide operational insights into the resource utilization
of the Pod.

Kubernetes Pod metrics should be automatically ingested by the Kubernetes data
scraper of any major observability platform. For more information on the metrics
and what they represent, please refer to [Kubernetes Metric Reference](https://kubernetes.io/docs/reference/instrumentation/metrics/).

The following metrics can be used to track the Memory and CPU usage of the `embedded`
Kubernetes Pod.

- `container_memory_working_set_bytes`: Represents the amount of memory in use by
the `embedded` container. Although there are multiple metrics for tracking memory pools,
`container_memory_working_set_bytes` is the most useful as it represents memory that
cannot be safely evicted. If this metric exceeds the Pod's memory request, it is
possible the Kubernetes scheduler could evict the Pod with an OOM error.
- `container_cpu_usage_seconds_total`: Represents the cumulative CPU time consumed by
the container in core-seconds. This metric can be combined with a rate function to
track the per-second CPU usage. If the per-second CPU usage exceeds the CPU request
of the `embedded` Pod, the Pod could experience increased request latency due to
CPU throttling.


The following metrics are also available through cAdvisor, a tool that is integrated
with the `kubelet` binary and exposes additional container metrics. For more information
on cAdvisor metrics, please refer to
[cAdvisor Prometheus Metrics](https://github.com/openshift/google-cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics).

- `container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total`: Represents
the percentage of CPU periods that experience throttling in the `embedded` container. If
this percentage rises above 5-10%, it is a good indicator that the Pod needs either a higher
CPU request or limit. To modify the CPU resources, modify the following values in `values.yaml`:


```yaml
embedded:

  ...

  # Resource allocations for the `embedded` pod(s)
  resources:
    requests:
      cpu: "0.5"
      ....
    limits:
      cpu: "1"
      ....
```