# Monitoring Monitoring operational insights is an important part of any production system. In order to enable live insights into how traffic is flowing through the system, Synqly Embedded exports [Prometheus](https://prometheus.io/docs/concepts/data_model/) formatted metrics. Prometheus metrics are an industry standard format for tracking operational data over time. Observability platforms such as NewRelic, DataDog, Grafana, and Logz.io all natively support ingesting Prometheus metrics. ## Metrics Collection Synqly Embedded exports metrics on the `v1/metrics` API endpoint. In order to ingest the metrics into an observability platform, it is necessary to configure the observability platform's scraper to pull metrics from the `embeddded` deployment's `v1/metrics` endpoint. As an example, when running a NewRelic helm chart to scrape metrics from an entire Kubernetes cluster, the only configuration needed is a set of prometheus annotations on the `embedded` pod. ``` Name: embedded-78b977b899-tcmzc Namespace: synqly-embedded Priority: 0 Service Account: default .... Annotations: prometheus.io/path: v1/metrics prometheus.io/scrape: true ``` When deploying `embedded` via the Synqly Embedded Helm Chart, these annotations can be added by setting the following value in `values.yaml`: ```yaml ... # Configuration that will be applied to every pod pods: ... # true - adds "prometheus.io/scrape": true annotation to all Synqly pods prometheusScrape: true ``` Please note, if your metrics ingest tool is configured to only pull metrics from a specific namespace or deployment, it may need to be updated to include `embedded`. ## Key Metrics ### Request Durations The `http_durations_ms` metric tracks how long calls take to complete. `http_durations_ms` supports the following dimension labels: - `method`: The HTTP method of the incoming request - `path`: The API endpoint of the incoming request - `code`: The HTTP response code returned by `embedded` - `quantile`: The quantile bucket that the given value represents As an example: ``` http_durations_ms{code="204",method="POST",path="/v1/siem",quantile="0.99"} 1 ``` This metric point in time value represents that POST calls to the `v1/siem` endpoint which result in a `204` have taken 1ms to complete for the 0.99 percentile of calls. This means 99% of similar calls have taken less than or equal to 1ms. For more information on Prometheus Quantiles, please refer to [Histograms and Summaries](https://prometheus.io/docs/practices/histograms/#quantiles). The `http_durations_ms` metrics can be useful for tracking the performance of calls made to `embedded`. `embedded` also tracks `_sum` and `_count` metrics for every `code`, `method`, and `path` combination. `http_durations_ms_sum`: The sum of all request durations for the given label set since the last pod restart. `http_durations_ms_count`: The total number of requests for the given label set since the last pod restart. Both of these metrics support the following labels: - `method`: The HTTP method of the incoming request - `path`: The API endpoint of the incoming request - `code`: The HTTP response code returned by `embedded` For example: ``` http_durations_ms_sum{code="200",method="POST",path="/v1/integrations"} 103776 http_durations_ms_count{code="200",method="POST",path="/v1/integrations"} 147 ``` These point in time values show that there have been 147 POST calls made to `v1/integrations` that resulted in a `200` response code since the last `embedded` restart. The `http_durations_ms_sum` metric shows that those 147 calls took a total of 103776ms combined. `http_durations_ms_sum` and `http_durations_ms_count` can be useful in combination with a rate function to track average call duration over time. For example, `http_durations_ms_sum / http_durations_ms_count` gives the average request duration for the given label set since `embedded` last restarted. ### Provider Counts The `provider_count` metric represents the number of calls made by a given Synqly Organization to a target Provider since `embedded` last restarted. `provider_count` supports the following labels: - `organization`: A Synqly Organization in the target `embedded` instance - `type`: A Provider For example: ``` provider_count{organization="sandbox-embedded-e2e",type="defender"} 97 provider_count{organization="sandbox-embedded-e2e",type="elasticsearch"} 159 provider_count{organization="sandbox-embedded-e2e",type="entra_id"} 39 ``` These point in time values show how many calls have been made by the `sandbox-embedded-e2e` Synqly Organization to the target Provider type since `embedded` last restarted. When combined with a rate function in your observability tool of choice, `provider_count` can be used to track Provider usage over time across all the Organizations within your `embedded` instance. ### Kubernetes Pod Metrics When running Synqly Embedded via the Synqly Embedded Helm Chart, the `embedded` Kubernetes Pod metrics provide operational insights into the resource utilization of the Pod. Kubernetes Pod metrics should be automatically ingested by the Kubernetes data scraper of any major observability platform. For more information on the metrics and what they represent, please refer to [Kubernetes Metric Reference](https://kubernetes.io/docs/reference/instrumentation/metrics/). The following metrics can be used to track the Memory and CPU usage of the `embedded` Kubernetes Pod. - `container_memory_working_set_bytes`: Represents the amount of memory in use by the `embedded` container. Although there are multiple metrics for tracking memory pools, `container_memory_working_set_bytes` is the most useful as it represents memory that cannot be safely evicted. If this metric exceeds the Pod's memory request, it is possible the Kubernetes scheduler could evict the Pod with an OOM error. - `container_cpu_usage_seconds_total`: Represents the cumulative CPU time consumed by the container in core-seconds. This metric can be combined with a rate function to track the per-second CPU usage. If the per-second CPU usage exceeds the CPU request of the `embedded` Pod, the Pod could experience increased request latency due to CPU throttling. The following metrics are also available through cAdvisor, a tool that is integrated with the `kubelet` binary and exposes additional container metrics. For more information on cAdvisor metrics, please refer to [cAdvisor Prometheus Metrics](https://github.com/openshift/google-cadvisor/blob/master/docs/storage/prometheus.md#prometheus-container-metrics). - `container_cpu_cfs_throttled_periods_total / container_cpu_cfs_periods_total`: Represents the percentage of CPU periods that experience throttling in the `embedded` container. If this percentage rises above 5-10%, it is a good indicator that the Pod needs either a higher CPU request or limit. To modify the CPU resources, modify the following values in `values.yaml`: ```yaml embedded: ... # Resource allocations for the `embedded` pod(s) resources: requests: cpu: "0.5" .... limits: cpu: "1" .... ```