Grafana for (not exactly) Dummies
Grafana is a great tool for monitoring and alerting here at GBO. There are many different ways this tool can be used, but here we will discuss 3 scenarios to cover the basics of navigating Grafana.
Topic 1: User Investigating Humidity over a Maintenance Period
Say there is an issue with a receiver swap during a maintenance period on March 4th. The user might want to investigate what could have caused the issue, so that it doesn't happen again! We are going to work user the assumption that the user wants to investigate the humidity of that day, to see if water could have collected.
What will the user do? Well first they will go to Grafana of course! Below we will follow the steps the user will take to diagnose and document their issue. Please follow along!
First, they can navigate the the Grafana page:
here is the link
Then, navigate to the '
Weather' dashboard and set their time range for March 3rd to March 4th to see the weather leading up the the receiver swap. See the video below to accomplish these steps
1timerange_thurs-fri.mp4
Now that the user has all the data for the Weather manager in the correct time range, they can take a look at all the data. Unfortunately, they see way much more data than they are concerned with. Now they can reduce the displayed data by selecting only the 'instances' they care about. In this case that's 'Humidity'. See video below.
2selecthumidity.mp4
Now that the user was able to select only humidity they can see that there was interesting data at that time frame. As they scroll the rest of the page they see there is a humidity specific panel. They decide to focus specifically on that.
3veiwonlyhumidity.mp4
Since the user has exactly what they need, they want to document that data. To do this they take a snapshot and save the created link. They can then forward this link to relevant management and assure them that it was not their fault that there was an issue in the receiver swap, but mother nature!
4snapshothumidity.mp4
Topic 2: User Sees Slowdowns on Data Machine and Wants to Know Why
Here we have a situation where there is a user who is running data reductions on the data machine 'thales'. While they are running their program they begin to see that the reduction is becoming sluggish and they want to know what happened.
First they can check on the general health of the machine by going to the
Node_Exporter dashboard and selecting the 'instance' thales. Here they can check many facets of the machine including the CPU performance.
5thalesinnode.mp4
Since that seems fine they can go on to check the GPU behavior on thales by navigating to the
Nvidia_GPU dashboard and again selecting the thales 'instance'. Here they can see that the GPU is not engaged, which means they are not overloading the systems. Again, good news.
6checkthalesgpu.mp4
Now they are convinced that the machine is not in a bad condition they can use the dashboard '
htop+' to monitor their machine thales. Here they can run their program on the machine and track the behavior via grafana. When the user runs the program they can take note of the pegged cores and understand what is causing the slowdown.
The user can also find their core and isolate only that data in the graph below
7checkthaleshtop.mp4
Topic 3: User wants to check the health of GPUs on all Data Machines
Here we would imagine a user wants to check the health of all the GPUs on the data machines in order to see if any are currently in use.
To do this we can navigate to the
Nvidia GPU page and select only the relevant data machine GPUs
0selectdatamachinesOnGPU.mp4
This way we can select multiple instances to isolate only the data we are interested in and compare them to each other. Here we can see that only leibniz is in use, so it would be a bad idea to start to manipulate any GPU hardware on that machine.
How to save the Data
You have two options to save data.
1. Save the plot itself:
4snapshothumidity.mp4
2. Save the data to a cvs file:
How to save data in CSV.mp4
Final Thoughts
Hopefully this guide was helpful to you as a user of Grafana. Please feel free to reach out to
kpurcell@nrao.edu with any suggestions on more example videos, new additions to Grafana itself, or other opinions or comments.
Thanks for using Grafana!
--
KathlynPurcell - 2022-03-09