Grafana for (not exactly) Dummies

Grafana is a great tool for monitoring and alerting here at GBO. There are many different ways this tool can be used, but here we will discuss 3 scenarios to cover the basics of navigating Grafana.

Topic 1: User Investigating Humidity over a Maintenance Period

Say there is an issue with a receiver swap during a maintenance period on March 4th. The user might want to investigate what could have caused the issue, so that it doesn't happen again! We are going to work user the assumption that the user wants to investigate the humidity of that day, to see if water could have collected.

What will the user do? Well first they will go to Grafana of course! Below we will follow the steps the user will take to diagnose and document their issue. Please follow along!

First, they can navigate the the Grafana page: here is the link

Then, navigate to the 'Weather' dashboard and set their time range for March 3rd to March 4th to see the weather leading up the the receiver swap. See the video below to accomplish these steps

1timerange_thurs-fri.mp4

Now that the user has all the data for the Weather manager in the correct time range, they can take a look at all the data. Unfortunately, they see way much more data than they are concerned with. Now they can reduce the displayed data by selecting only the 'instances' they care about. In this case that's 'Humidity'. See video below.

2selecthumidity.mp4

Now that the user was able to select only humidity they can see that there was interesting data at that time frame. As they scroll the rest of the page they see there is a humidity specific panel. They decide to focus specifically on that.

3veiwonlyhumidity.mp4

Since the user has exactly what they need, they want to document that data. To do this they take a snapshot and save the created link. They can then forward this link to relevant management and assure them that it was not their fault that there was an issue in the receiver swap, but mother nature!

4snapshothumidity.mp4

Topic 2: User Sees Slowdowns on Data Machine and Wants to Know Why

Here we have a situation where there is a user who is running data reductions on the data machine 'thales'. While they are running their program they begin to see that the reduction is becoming sluggish and they want to know what happened.

First they can check on the general health of the machine by going to the Node_Exporter dashboard and selecting the 'instance' thales. Here they can check many facets of the machine including the CPU performance.

5thalesinnode.mp4

Since that seems fine they can go on to check the GPU behavior on thales by navigating to the Nvidia_GPU dashboard and again selecting the thales 'instance'. Here they can see that the GPU is not engaged, which means they are not overloading the systems. Again, good news.

6checkthalesgpu.mp4

Now they are convinced that the machine is not in a bad condition they can use the dashboard 'htop+' to monitor their machine thales. Here they can run their program on the machine and track the behavior via grafana. When the user runs the program they can take note of the pegged cores and understand what is causing the slowdown.

The user can also find their core and isolate only that data in the graph below

7checkthaleshtop.mp4

Topic 3: User wants to check the health of GPUs on all Data Machines

Here we would imagine a user wants to check the health of all the GPUs on the data machines in order to see if any are currently in use.

To do this we can navigate to the Nvidia GPU page and select only the relevant data machine GPUs

0selectdatamachinesOnGPU.mp4

This way we can select multiple instances to isolate only the data we are interested in and compare them to each other. Here we can see that only leibniz is in use, so it would be a bad idea to start to manipulate any GPU hardware on that machine.

How to save the Data

You have two options to save data.

1. Save the plot itself: 4snapshothumidity.mp4

2. Save the data to a cvs file: How to save data in CSV.mp4

Final Thoughts

Hopefully this guide was helpful to you as a user of Grafana. Please feel free to reach out to kpurcell@nrao.edu with any suggestions on more example videos, new additions to Grafana itself, or other opinions or comments.

Thanks for using Grafana!

-- KathlynPurcell - 2022-03-09
Topic attachments
I Attachment Action Size Date Who Comment
00allGPU.mp4mp4 00allGPU.mp4 manage 2 MB 2022-03-09 - 16:21 KathlynPurcell  
0selectdatamachinesOnGPU.mp4mp4 0selectdatamachinesOnGPU.mp4 manage 1 MB 2022-03-09 - 16:21 KathlynPurcell  
1timerange_thurs-fri.mp4mp4 1timerange_thurs-fri.mp4 manage 1 MB 2022-03-09 - 16:16 KathlynPurcell  
2selecthumidity.mp4mp4 2selecthumidity.mp4 manage 871 K 2022-03-09 - 16:21 KathlynPurcell  
3veiwonlyhumidity.mp4mp4 3veiwonlyhumidity.mp4 manage 1 MB 2022-03-09 - 16:21 KathlynPurcell  
4snapshothumidity.mp4mp4 4snapshothumidity.mp4 manage 661 K 2022-03-09 - 16:21 KathlynPurcell  
5thalesinnode.mp4mp4 5thalesinnode.mp4 manage 1 MB 2022-03-09 - 16:21 KathlynPurcell  
6checkthalesgpu.mp4mp4 6checkthalesgpu.mp4 manage 1 MB 2022-03-09 - 16:21 KathlynPurcell  
7checkthaleshtop.mp4mp4 7checkthaleshtop.mp4 manage 3 MB 2022-03-09 - 16:21 KathlynPurcell  
How to save data in CSV.mp4mp4 How to save data in CSV.mp4 manage 2 MB 2022-05-05 - 21:28 KathlynPurcell  
Topic revision: r3 - 2022-05-12, KathlynPurcell
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback