3.2.3 Create a Dashboard

  1. From the Dashboard menu, you can also create new Dashboards. There are now three choices: New Dashboard, New Timeboard, and New Screenboard. create a dashboard We have always had the last two, but we are merging the best features of each into the Dashboard. Enter a name for your dashboard and click the New Dashboard button.
  2. Scroll through the list of available widgets and then drag and drop the Timeseries onto the canvas.
  3. Under Graph your data, click on Metric to see the other types of data that you can graph. Click on the metric, which defaults to system.cpu.user to see the list of all the metrics available. There are thousands of metrics available.
  4. Take a look at some of the other options available on this page. You can also combine multiple metrics using the Advanced… link.
  5. Click the Overview tab at the top. This is a nice way to preview some of the functions available, though there are many more on the main Edit tab when you click the plus button to the right of any metric. functions on a graph
  6. In the previous section we saw the Correlations view. When you click the Correlations tab here, you can customize how correlations are found.
  7. That’s a quick look at creating a dashboard. When you start working with Kubernetes, you will probably start with the dashboards provided and then gradually customize them and build your own. We have a complete course on Monitoring the Kubernetes Platform available on the Datadog Learning Center. Here are some of the key metrics you should consider keeping an eye on:
etcd metric description
etcd.server.proposals.committed.total, etcd.server.proposals.applied.total, etcd.server.proposals.failed.total, etcd.server.proposals.pending Proposals occur when configuration changes are sent from the leader of the cluster to the other nodes in the cluster. Most should be successful, but errors are important to keep an eye on.
etcd.disk.wal.fsync.duration.seconds, etcd.disk.backend.commit.duration.seconds The etcd cluster sends proposals to each other via fsync, so disk performance is going to affect proposals.
etcd.debugging.mvcc.db.total.size.in_bytes Database size has a finite max and you need to make sure it always stays below that level.
etcd.grpc.server.msg.received.total, etcd.grpc.server.msg.sent.total, etcd.network.client.grpc.received.bytes.total, etcd.network.client.grpc.sent.bytes.total Network performance is going to affect the notifications of a proposals success to it’s important to watch that too.
apiserver metric description
kube_apiserver.rest_client_requests_total, kube_apiserver.rest_client_requests_total.count, kube_apiserver.rest_client_request_latency_seconds.sum, kube_apiserver.authenticated_user_requests, kube_apiserver.rest_client_request_latency_seconds.count, kube_apiserver.apiserver_request_count, kube_apiserver.apiserver_request_total, kube_apiserver.authenticated_user_requests.count, kube_apiserver.current_inflight_requests, kube_apiserver.apiserver_request_count.count, kube_apiserver.apiserver_request_total.count The apiserver is at its heart a webserver, so you need to monitor it like you would a webserver
docker.container.open_fds, the various docker.mem metrics, and kubernetes.cpu Again, it’s a webserver and these are some other metrics you would monitor with a webserver
controller manager and scheduler metrics description
kube_controller_manager.nodes.count, kube_controller_manager.nodes.unhealthy Making sure all the nodes are available and healthy is a good first step.
kube_controller_manager.queue.depth, kube_controller_manager.queue.retries The Controller and scheduler work off of a queue so making sure the queue depth isn’t getting too big is important. A large queue can point to other issues
kube_controller_manager.client.http.requests The controller also performs like a webserver
CoreDNS metrics Description
coredns.request_count Shows how many requests are coming into the DNS server.
coredns.cache_hits_count Divide this the request count to see the cache hit rate. A low hit rate may show that you should raise the TTL value
coredns.request_duration.seconds.sum, coredns.request_duration.seconds.count Understanding how long requests take to resolve is important
coredns.response_code_count When CoreDNS encounters an error, an RCODE is generated. This shows how many of each error is occurring

Again, the course on Monitoring the Kubernetes Platform at https://learn.datadoghq.com goes into a lot of detail on this topic if you want to learn more.