Kubernetes clusters have different origin stories. Different tools are used with various configurations resulting in clusters using different ports and file locations and more. When building out this training, the Datadog Agent could not see etcd.
k get pods
. If you just deployed the agent, it can take a few seconds to start, so k get pods -w
is helpful to show the progress of starting the different pods.k exec <full name of pod> agent status
. This resulted in a list of pods autodiscovered. etcd wasn’t in this output, but sometimes it can take a few more seconds to show up.k get pods -n kube-system
. This cluster was deployed using Kops. Kops uses etcd-manager and in this cluster there are two etcd pods, we want etcd-manager-main.k describe pod <name of etcd pod>
.Open the values.yaml file for our Helm chart. Update datadog.confd as follows:
confd:
etcd.yaml: |-
ad_identifiers:
- etcd-manager
instances:
- prometheus_url: https://%%host%%:4001/metrics
When you add ad_identifiers, it tells Datadog to treat this as an autodiscovery template. It will then replace %%host%% and %%port%% with the appropriate values for this container. Though there is a set of rules to remember about %%port%%; its not always what you would expect. Read more about it here: https://docs.datadoghq.com/agent/faq/template_variables/
Run the Helm upgrade command on our agent and run the agent status command again. We are getting connection refused. We need the SSL certificates specified.
I researched online to figure out the right certs to use and ended up with:
etcd.yaml: |-
ad_identifiers:
- etcd-manager
instances:
- prometheus_url: https://%%host%%:4001/metrics
ssl_verify: false
use_preview: true
ssl_cert: /keys/etcd-clients-ca.crt
ssl_private_key: /keys/etcd-clients-ca.key
Run Helm upgrade again, and the agent was collecting metrics as expected from etcd.
Another way to test that this is the right url and port is to run curl from the node running the pod. In this case the pod is running on the control-plane node. SSH into the node. This node was built by Kops using an Ubuntu AMI, so ssh ubuntu@<ipaddress of master>
. To figure out the IP address, either look at your EC2 instances in the AWS console, or run k get nodes
and then describe the node and find the external IP address.
Then run curl https://localhost:4001/metrics -k --cert /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.crt --key /etc/kubernetes/pki/etcd-manager-main/etcd-clients-ca.key
on that host and you get the stream of current metrics.
This lab relied on kubectl but you have also seen the great tools available in Datadog to do the same thing. Instead of running kubectl describe, navigate to the Container page and click on Deployments or Pods. I find this to be faster and easier and more comprehensive than using kubectl directly.
A great resource for learning some troublehshooting workflows is this kubectl cheatsheet: https://kubernetes.io/docs/reference/kubectl/cheatsheet/