Skip to main content

Monitor your EKS without dying in the effort!

· 9 min read
María García
Junior Platform Engineer

Monitoring a Kubernetes cluster is essential to know what is going on or to be able to detect if something is wrong in time. There are multiple tools available for this purpose, but in this implementation, we'll use Metrics Server, Alloy, Loki, Kube Prometheus Stack, and Alertmanager.

This setup will be deployed on an EKS cluster, with all configurations managed through Terraform and Helm charts.

Infrastructure Overview

The monitoring stack consists of several components such as:

  • Metrics Server: For basic resource metrics collection
  • Alloy: To collect and forward metrics and logs
  • Loki: For log storage and querying
  • Prometheus Operator: To automatically manage Prometheus and Alertmanager configurations in Kubernetes.
  • Grafana: For visualisation of metrics and logs (integrated with Prometheus and Loki).
  • Alertmanager: For handling alerts and notifications

All components will be deployed using Helm charts with configurations stored in our platform repository.

Metrics Server

Metrics Server is a tool used in Kubernetes that collects basic resource usage metrics (such as CPU and memory) from the nodes and pods in the cluster. To implement it, we will create a Chart.yaml and a values.yaml similar to these:

Chart.yaml:

apiVersion: v2
name: metrics-server
description: metrics-server helm chart
type: application
version: 3.12.1 # Change to the current version
dependencies:
- name: metrics-server
version: 3.12.1 # Change to the current version
repository: https://kubernetes-sigs.github.io/metrics-server

In this case, the values.yaml file can be empty without any issues (although you still need to create the file, even if it is empty). Then, run helm dependency update. This will create the necessary Chart.lock and .tgz files.

Alloy

Alloy is designed to collect and send metrics, logs, etc., to different destinations, such as Grafana, Loki, or Prometheus. It is the successor to Grafana Agent, so its configuration is similar (but not identical).

Unlike Grafana Agent, which used .river files for configuration, Alloy uses .alloy.

In this case, we will use Alloy to send logs to Loki. To do this, start by creating our Chart.yaml and values.yaml. These are just examples; you should modify them according to your needs:

Chart.yaml:

apiVersion: v2
name: alloy
description: A Helm chart for Alloy (Promtail)
version: 0.12.3 # Change to the current version
dependencies:
- name: alloy
version: 0.12.3 # Change to the current version
repository: https://grafana.github.io/helm-charts

values.yaml:

alloy:

ingress:
enabled: false
annotations:
kubernetes.io/external-dns.create: "true"
ingressClassName: "external"
faroPort: 80
hosts:
- alloy.example.com # Change to your domain

alloy:
configMap:
create: false
name: alloy

After this, run helm dependency update to create both the Chart.lock and the .tgz file. Then, the next step is to create a template called config.yaml. This template will be used to instruct Alloy to load the configuration from the files that will later be placed in a folder named config.

apiVersion: v1
kind: ConfigMap
metadata:
name: alloy
data:
config.alloy: |
{{- range $file, $content := .Files.Glob "config/*" }}
// {{ $file }}
{{ $content | toString | indent 4 }}
{{- end }}

loki.source.file "logs" {
targets = [
{__path__ = "/var/log/*.log"},
]
forward_to = [loki.write.default.receiver]
}

loki.write "default" {
endpoint {
url = "loki-gateway.monitoring.svc.cluster.local"

basic_auth {
username = "loki"
password = "loki_password"
}
}
}

Now create the config folder mentioned earlier. Inside it, we will store two files—one containing the general configuration and another with the specific configuration for Loki.

default.alloy:

logging {
level = "info"
format = "logfmt"
}

discovery.kubernetes "pods" {
role = "pod"
}

discovery.kubernetes "services" {
role = "service"
}

discovery.kubernetes "nodes" {
role = "node"
}

discovery.relabel "pods_k8s_labels" {
targets = discovery.kubernetes.pods.targets

rule {
action = "labelmap"
regex = "__meta_kubernetes_namespace$"
replacement = "namespace"
}
}
discovery.relabel "services_k8s_labels" {
targets = discovery.kubernetes.services.targets
rule {
action = "labelmap"
regex = "__meta_kubernetes_(.+)"
}
}

loki.alloy:

loki.source.kubernetes "pods" {
targets = discovery.relabel.pods_k8s_labels.output
forward_to = [loki.write.loki.receiver]
}
loki.write "loki" {
endpoint {
url = "http://loki-gateway.monitoring.svc:80/loki/api/v1/push"
basic_auth {
username = "loki"
password = "loki_password"
}
}
}

Loki

Loki is a log storage and query system that can be integrated with Grafana. To get started, it is necessary to create a bucket where all the collected logs will be stored :

loki.tf :

module "loki_oidc_role" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
version = "~> 5.0"
role_name = "loki-oidc-role"
oidc_providers = {
oidc_provider = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["monitoring:loki"]
}
}
}

resource "aws_iam_policy" "loki-policy" {
name = "loki"
path = "/"
description = "Loki IAM Policy to have access to S3 buckets"

policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
"Sid" : "lokiPermissions",
"Effect" : "Allow",
"Action" : [
"s3:PutObject",
"s3:GetObject",
"s3:ListBucket",
"s3:DeleteObject",
"s3:GetObjectTagging",
"s3:PutObjectTagging"
],
"Resource" : [
"arn:aws:s3:::${local.customer}-loki-chunks/*",
"arn:aws:s3:::${local.customer}-loki-chunks"
]
}
]
})
}

resource "aws_iam_role_policy_attachment" "loki-attach" {
role = module.loki_oidc_role.iam_role_name
policy_arn = aws_iam_policy.loki-policy.arn
}

resource "aws_s3_bucket" "loki_chunks" {
bucket = "${local.customer}-loki-chunks"
force_destroy = true
}

Once the bucket is created, the next step is to generate the Chart.yaml and values.yaml files.

Chart.yaml :

apiVersion: v2
name: loki
description: loki-distributed helm chart
type: application
version: 0.80.2 #Change to the current version
dependencies:
- name: loki-distributed
version: 0.80.2 #Change to the current version
repository: https://grafana.github.io/helm-charts

values.yaml :

loki-distributed:
nameOverride: loki
loki:
structuredConfig:
auth_enabled: false # Enable with X-Scope-OrgID header
compactor:
shared_store: s3
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 1m
retention_delete_worker_count: 150
delete_request_cancel_period: 10m

limits_config:
retention_period: 1y

schema_config:
configs:
- from: 2020-09-07
store: boltdb-shipper
object_store: s3
schema: v12
index:
prefix: loki_index_
period: 24h

storage_config:
filesystem: null
boltdb_shipper:
build_per_tenant_index: true
shared_store: s3
aws:
region: us-east-1 # change to your region
bucketnames: example-loki-chunks
insecure: false
s3forcepathstyle: false

serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::examplenumberaccount:role/loki-oidc-role # change to your AWS account

gateway:
basicAuth:
enabled: true
username: loki
password: loki_password

compactor:
enabled: true

After this, as with the previous tools, we will run helm dependency update to create the Chart.lock and .tgz file.

Kube Prometheus Stack

Kube Prometheus Stack is a monitoring toolset designed for Kubernetes, which automatically deploys components such as Prometheus, Alertmanager, and Grafana in your cluster. This solution, packaged as a Helm Chart, provides everything necessary to collect metrics, configure alerts, and visualize data, with predefined dashboards for Kubernetes, nodes, and applications. Here is an example of how its files would look:

Chart.yaml :

apiVersion: v2
description: kube-prometheus-stack collects Kubernetes manifests, Grafana dashboards, and Prometheus rules combined with documentation and scripts to provide easy-to-operate end-to-end Kubernetes cluster monitoring with Prometheus using the Prometheus Operator.
icon: https://raw.githubusercontent.com/prometheus/prometheus.github.io/master/assets/prometheus_logo-cb55bb5c346.png
type: application

name: kube-prometheus-stack
version: 69.8.2 #Change to the current version
appVersion: v0.78.2
kubeVersion: ">=1.19.0-0"
home: https://github.com/prometheus-operator/kube-prometheus

dependencies:
- name: kube-prometheus-stack
version: "69.8.2" #Change to the current version
repository: https://prometheus-community.github.io/helm-charts

values.yaml :

kube-prometheus-stack: 
# Remove some rules we cannot scrape
defaultRules:
rules:
etcd: false
kubeScheduler: false
disabled:
TargetDown: true
KubePodNotReady: true
KubeContainerWaiting: true
KubeHpaMaxedOut: true
KubeDeploymentReplicasMismatch: true
KubePodNotReady: true
NodeSystemSaturation: true
alertmanager:
ingress:
enabled: true
ingressClassName: "external"
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
kubernetes.io/external-dns.create: "true"
pathType: ImplementationSpecific
hosts:
- alertmanager.example.com #Change to your domain
paths:
- /
config:
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "severity", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: blackhole
routes:
- receiver: "blackhole"
matchers:
- alertname = InfoInhibitor
group_wait: 0s
group_interval: 1m
repeat_interval: 30s
receivers:
# Just an empty receiver
- name: "blackhole"
alertmanagerSpec:
alertmanagerConfigSelector:
matchLabels:
release: kube-prometheus-stack
alertmanagerConfigNamespaceSelector: {}

nodeSelector:
topology.kubernetes.io/zone: us-east-1c # Change to your region

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1c # Change to your region

grafana:
enabled: true

nodeSelector:
topology.kubernetes.io/zone: us-east-1c # Change to your region

dashboards:
default:
node-exporter:
gnetId: 1860
revision: 32
nodejs:
gnetId: 11159
revision: 1
datasource: Prometheus

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- us-east-1c # Change to your region
sidecar:
dashboards:
label: grafana_dashboard
labelValue: ""
folderAnnotation: grafana-folder
annotations:
grafana-folder: "/tmp/dashboards/Kube-Prometheus-Stack"
provider:
# Disallow updating provisioned dashboards from the UI
allowUiUpdates: false
foldersFromFilesStructure: true
datasources:
uid: prometheus
alertmanager:
uid: alertmanager
additionalDataSources:
- name: Loki
type: loki
access: proxy
url: http://loki-gateway.monitoring.svc
user: loki
secureJsonData:
password: loki_password

ingress:
enabled: true
ingressClassName: "external"
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
kubernetes.io/external-dns.create: "true"
hosts:
- grafana.example.com #Change to your domain

# Remove some scrapings we cannot perform
kubeControllerManager:
enabled: false
kubeEtcd:
enabled: false
kubeScheduler:
enabled: false
kubeProxy:
enabled: false

prometheus:
enabled: true
ingress:
ingressClassName: "external"
enabled: true
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
kubernetes.io/external-dns.create: 'true'
pathType: ImplementationSpecific
hosts:
- prometheus.example.com #Change to your domain
paths:
- /

Finally, run helm dependency update to create the charts.

Alertmanager

Alertmanager is a tool used to manage and send alerts or notifications to different platforms, such as emails, messaging systems, or, in this case, Discord channels.

To add this functionality, we first need to create a webhook for the Discord channel where you want to receive these alerts. To do this, go to the "edit this channel" settings of the chosen channel and then navigate to integrations.

AlertManager

Once in the integrations window, go to webhooks, where you can see all existing webhooks or create a new one. To add this webhook, you need to copy its URL.

webhook

After locating our webhook, go to your cluster and create a new secret to store it, ensuring that the webhook URL is not publicly exposed.

kubectl -n monitoring create secret generic discord-webhook \ --from-literal=url='https://discord.com/api/webhooks/...' #replace with your webhook URL

You can verify that your secret has been created correctly with the following command:

kubectl get secrets -n monitoring

Next, navigate to the values.yaml file of the kube-prometheus-stack (the one we created earlier). In this file, locate the Alertmanager section and add the following configurations:

  # Alertmanager configuration with secure webhook
alertmanager:
ingress:
enabled: true
ingressClassName: "external"
annotations:
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
kubernetes.io/external-dns.create: "true"
pathType: ImplementationSpecific
hosts:
- alertmanager.example.com
paths:
- /
config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'discord' ## The receiver, in this case, is Discord.
routes:
- match: ## Route for this receiver
severity: warning
receiver: 'discord'
continue: false
- match:
alertname: InfoInhibitor
receiver: 'blackhole'
group_wait: 0s
group_interval: 1m
repeat_interval: 30s
receivers:
- name: 'blackhole'
- name: 'discord' ## Specification of this receiver (webhook)
discord_configs:
- webhook_url:
secretKeyRef: ## Using the previously created secret
name: discord-webhook
key: url
alertmanagerSpec:
alertmanagerConfigSelector:
matchLabels:
release: kube-prometheus-stack
alertmanagerConfigNamespaceSelector: {}
nodeSelector:
topology.kubernetes.io/zone: eu-west-1c
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: topology.kubernetes.io/zone
operator: In
values:
- eu-west-1c

With all this, we would have our monitoring stack connected to a Discord channel, so you can be more vigilant in case anything happens, improving the visibility and reliability of your cluster.

Resources