Lab 4: Logic Service improvements¶
Goals¶
In the following lab we will be working towards the following goals:
- Deploy and manage Grafana in our namespace
- Add instrumentation to the Logic Service and visualize it in Grafana
- Add code coverage tracking, code coverage visualization and test reports to the pipeline
- Improve your code coverage to at least 60% (if needed)
- Add project badges to your GitLab project
- Optimize your pipeline to limit the
latest
builds and deployments - Increase robustness of your Logic Service by saving and restoring state
For full details on the deliverables of this lab, see the Practicalities section.
Monitoring¶
Application container technology, like Kubernetes, is revolutionizing app development, bringing previously unimagined flexibility and efficiency to the development process. However, with these technologies come new challenges. It can prove difficult to debug and troubleshoot the many different microservices that are deployed. Even noticing that something is going wrong is harder in a cloud environment, cause there are many places to look to. To this end, we are going to set up monitoring.
Monitoring is a verb; something we perform against our applications and systems to determine their state. From basic fitness tests and up/down status checks, to more proactive performance health checks. We monitor applications to detect problems and anomalies. As troubleshooters, we use it to find the root cause of problems and gain insights into capacity requirements and performance trends over time.
During application development, monitoring can be used to correlate coding practices to performance outcomes, or to compare and validate cloud patterns and models.
Google wrote up a very influential book called Site Reliability Engineering (SRE), herein they state:
Your monitoring system should address two questions: what’s broken, and why? The “what’s broken” indicates the symptom; the “why” indicates a (possibly intermediate) cause.“What” versus “why” is one of the most important distinctions in writing good monitoring with maximum signal and minimum noise.
Having monitoring data is key to achieving observability into the inner workings of your system. Collecting data is (mostly) cheap, but not having that data when you need it can be very expensive.
Prometheus¶
Overview¶
Prometheus is a tried and tested, fully open-source monitoring solution, inspired by Google Borg Monitor (cf. Google SRE). It was initially developed by Soundcloud and later donated to the Cloud Native Computing Foundation (CNCF) which also houses Kubernetes, Helm, CoreDNS, etcd and others.
Central to Prometheus is of course the storage of monitoring data, which is time series data. Prometheus stores purely numeric time series and is focused on machine and service oriented monitoring. It is not an event logging system. Prometheus pulls or scrapes its metrics from different endpoints or instances. Instances which perform the same task are aggregated inside a job.
There are also client libraries to instrument your applications. These help you to expose metrics to an HTTP endpoint. There are official Go, Java, Scala, Python and Ruby libraries. Due to its open-source and community-driven nature there are also a number of third party libraries available for C++, C#, Bash, Lua, etc.
Prometheus also has a concept called exporters. An exporter can run on your node or next to a service and expose some metrics. No need to change the application or service, the exporter runs independently. Some exporters that are available:
- Official: Node/system, InfluxDB, JMX, HAProxy, etc.
- Third party: Kafka, RabbitMQ, MongoDB, Jenkins, Nginx, etc.
Some software even exposes Prometheus metrics by itself, some examples: etcd, Grafana, Kubernetes, Telegraf.
You can find an extensive list of available exporters and Prometheus related projects here.
If you want to learn more, go to the official Prometheus.io website, there you will find a brief but more elaborate overview.
DevOps Prometheus¶
For our DevOps cluster, Prometheus has already been installed and configured so your team will not have to set up a personal instance of it. Instead you will have to query the central Prometheus server in order to achieve Observability of your Logic Service.
Metrics types and PromQL¶
In order to query Prometheus, we need to use the functional Prometheus Query Language or PromQL. PromQL lets the user select and aggregate time series data in real time.
This functional querying is very user-friendly and readable when it comes to selecting and manipulating time series data. If you compare for example with a SQL like language on for example InfluxDB:
cpu_load_short > 0.9
SELECT * FROM “cpu_load_short” WHERE “value” > 0.9
We can't explain the use of PromQL any better than the people of Prometheus themselves, so please take your time to read through the following two pages of documentation:
- https://prometheus.io/docs/prometheus/latest/querying/basics/
- https://prometheus.io/docs/prometheus/latest/querying/examples/
We will provide you with a hands-on walkthrough of setting up your first PromQL queries when we have installed Grafana.
You will notice that sometimes there is talk of different metric types in these documentation pages. Prometheus exposes four metric types:
- Gauge: A gauge is for tracking current tallies, or things that can naturally go up or down, like memory usage, queue lengths, in-flight requests, or current CPU usage.
- Counter: A counter is for tracking cumulative totals over a number of events or quantities like the total number of HTTP requests or the total number of seconds spent handling requests. Counters only decrease in value when the process that exposes them restarts, in which case they get reset to 0.
- Histogram: A histogram is used to track the distribution of a set of observed values (like request latencies) across a set of buckets. It also tracks the total number of observed values, as well as the cumulative sum of the observed values.
- Summary: A summary is used to track the distribution of a set of observed values (like request latencies) as a set of quantiles / percentiles. It also tracks the total number of observed values, as well as the cumulative sum of the observed values.
More info on these metric types as well as client code examples can be found here.
Grafana¶
Grafana is an open-source dashboarding, analytics, and monitoring platform that is tinkered for connecting with a variety of sources like Elasticsearch, Influxdb, Graphite, Prometheus, AWS Cloud Watch, and many others.
Grafana invests heavily in these data source integrations with almost every other observability tool out there. It allows you to use one product for metrics, another for logging, and a 3rd for tracing, and bring it all together with the Grafana UI.
In our case, we are going to focus on visualizing the metrics data from Prometheus, to gain insight in the operation of our Logic Service.
If you are interested to learn more about Grafana after this lab, maybe to set it up on your home server, they have well written and extensive tutorials on various topics here.
Installing Grafana¶
We are going to deploy Grafana in our team namespace and we are going to use helm to do this.
Helm is a package manager for Kubernetes, it is the easiest way to find, share, and use software built for Kubernetes. Helm is a tool that streamlines installing and managing Kubernetes applications. Think of it like apt/scoop/homebrew for Kubernetes.
Helm uses a packaging format called charts. A chart is a collection of files that describe a related set of Kubernetes resources. A single chart might be used to deploy something simple, like a memcached pod, or something complex, like a full web app stack with HTTP servers, databases, caches, and so on.
Install Helm on your local machine by following the instructions on the Helm website.
To find the Grafana Helm chart go to ArtifactHub, a web-based application that enables finding, installing, and
publishing Kubernetes packages. You can discover various applications here, either as Helm charts, Kustomize packages or
Operators. Search for the official Grafana chart and open it up (the one owned by the Grafana
organization).
ArtifactHub provides a nice and user-friendly view on the source code of the chart, which is hosted on a Git repository (you can always navigate to that through Chart Source link in the right side bar). The chart homepage shows the readme, commonly this houses some getting started commands, changelog and a full configuration reference: table of all possible values that can be set. ArtifactHub also provides dialogs for Templates, Default Values and Install.
If you open up Templates you will see that this chart deploys quite a lot of different resources. That is the beauty of using a tool like Helm to install and manage Kubernetes applications. Instead of manipulating all these resources separately and having to keep track of them manually, everything is packaged into a release. Helm makes it easy to test out new third party applications on your cloud environment, because when you are done testing you can easily helm uninstall
the release and you are left with a clean cluster.
To get started, follow the Get Repo Info instruction on the readme to add the Grafana repository to your local list of repo's.
To configure our installation of the Grafana chart we can either use --set
parameters when installing the chart, or
preferably in this case we can make a values file to override the chart's defaults.
Navigate to your project's root folder and make a new subfolder monitoring
. In this folder we are going to create a
new file and call that grafana-values.yaml
.
This file will hold all the values we want to override in the Grafana chart, when we refer to values we mean the
configuration values that be seen either in the Configuration section of the chart's readme, or in the Default Values
view on ArtifactHub.
Admin password¶
First of all, we need to set a password for our admin. If we do not set it, Grafana will auto generate one and we will
be able to retrieve it by decoding a secret on the Kubernetes cluster. However, every time we would upgrade our release,
Grafana would again generate a new secret, sometimes resetting the admin password. That is why we will override it using
our own secret. The Grafana chart allows you to configure your admin credentials through a secret,
via admin.existingSecret
and its sibling values.
NOTE: it is very important to use a strong password, we are going to expose Grafana on a public URL and we do not want trespassers querying our Prometheus server.
Add a secret to store your admin username and password
Create a secret called grafana-admin
that holds your admin username and password. Grafana can then be directed
to refer to this secret instead of the adminPassword
variable.
Read through the K8s Secret Docs. There you'll find
information on the kinds of secrets and how to create one. In this case, kubectl create secret
in combination with
the --from-literal
argument is an easy way to start (TIP: kubectl create secret --help
)
Use admin-user
and admin-password
as keys in the secret. The values can be anything you like, but make sure they
are strong and secure. Use a random password generator or a password manager to generate a strong password.
On Unix this one-liner outputs a random 32 character password:
< /dev/urandom tr -dc _A-Z-a-z-0-9 | head -c${1:-32};echo;
When you have created your own secret with your admin username and password, you can configure Grafana to use it by
setting the necessary values in your grafana-values.yaml
.
Check the readme and Default Values on ArtifactHub to find out more.
If you are unable to setup this secret properly and install Grafana, you can use
--set adminPassword=<strong-password>
as part of your Helm install command. This will set up the password
statically and this way you will be able to proceed with the rest of the assignment
Do note you will be scored on the usage of secrets in this assignment!
Ingress Rules¶
In order to easily reach our Grafana UI, we are going to serve it on a path on our public domain https://devops-proxy.atlantis.ugent.be
To achieve this, we have to add Ingress to Grafana. If you search for the keyword ingress on the Values dialog, you will find that there are a bunch of variables that we can set to configure it.
We are going to serve Grafana on a path, the path being /grafana/devops-team<number>
, making our dashboard
accessible at https://devops-proxy.atlantis.ugent.be/grafana/devops-team<number>
.
The readme of the chart has an example on how to add a ingress with path (Example ingress with path, usable with grafana >6.3), use that example and change it appropriately! This is how we work with third party charts. Read the readme for instructions and adapt it to your situation.
Disabling RBAC and PSP¶
The Grafana chart deploys some Role Based Access Control (RBAC objects) and Pod Security Policies by default. We won't be needing these resources so add the following to disable these options. Not disabling these will throw errors on installations because your devops-user accounts linked to your kubeconfig are not allowed to create RBAC and Security Policies.
rbac:
create: false
pspEnabled: false
serviceAccount:
create: false
First Grafana release¶
Before moving on to the actual installation let's perform a dry-run to make sure everything is in order. A dry-run is a Helm command that simulates the installation of a chart, it will render the templates and print out the resources that would be created. This is a good way to check if your values file is correct and if the chart is going to be installed as you expect.
helm install grafana grafana/grafana -f grafana-values.yaml --dry-run
If you get a print out of all resources and no errors, you are good to go. Open up a second terminal: here we will watch all kubernetes related events in our namespace:
kubectl get events -w
Now install the Grafana helm chart:
helm install grafana grafana/grafana -f grafana-values.yaml
You will see that kubernetes creates deployments, services, configmaps and other resources in the events. The kubectl get events
instruction is nice to use while learning the ropes of Kubernetes, because it can give you lots of insight into the moving parts.
When we install the chart, the helm command gives us a printout of the Helm Charts NOTES.txt
file. In this file chart owners can specify some guidelines and next steps for users.
Here they guide you through retrieving the admin password (referring to a secret called grafana-admin
which was created before we installed the chart and specified in the grafana-values.yaml
file), and provide some extra info. This info is generated and can be different for each release because it is based on the values we set in our grafana-values.yaml
file. Sadly it often contains some errors, like in the example below they claim the outside URL is http://devops-proxy.atlantis.ugent.be while it is actually http://devops-proxy.atlantis.ugent.be/grafana/devops-team0. And while it does tell you how to retrieve the admin password, it claims that you can log in with the admin
username (the default), while we have set up our own admin username in a secret.
NAME: grafana
LAST DEPLOYED: Thu Nov 28 11:17:38 2024
NAMESPACE: devops-team0
STATUS: deployed
REVISION: 1
NOTES:
1. Get your 'admin' user password by running:
kubectl get secret --namespace devops-team0 grafana-admin -o jsonpath="{.data.admin-password}" | base64 --decode ; echo
2. The Grafana server can be accessed via port 80 on the following DNS name from within your cluster:
grafana.devops-team0.svc.cluster.local
If you bind grafana to 80, please update values in values.yaml and reinstall:
securityContext:
runAsUser: 0
runAsGroup: 0
fsGroup: 0
command:
- "setcap"
- "'cap_net_bind_service=+ep'"
- "/usr/sbin/grafana-server &&"
- "sh"
- "/run.sh"
Details refer to https://grafana.com/docs/installation/configuration/#http-port.
Or grafana would always crash.
From outside the cluster, the server URL(s) are:
http://devops-proxy.atlantis.ugent.be
3. Login with the password from step 1 and the username: admin
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Grafana pod is terminated. #####
#################################################################################
Remember the command to retrieve your admin password and decode it. If you would ever forget the Admin password, or if you used a really strong password as you should, you can run that command to retrieve your admin password and copy it.
Now inspect your namespace using kubectl, finding answers to the following questions:
- Are there any pods running?
- Has a service been deployed?
- Is there an ingress resource?
When all is clear, and your Grafana pods are running, you can visit
https://devops-proxy.atlantis.ugent.be/grafana/devops-team<your-team-number>
on a browser and login with the admin
credentials. Once logged in, you will see the Home screen of your Grafana installation.
Adding persistence to Grafana¶
The Grafana Chart developers warned us about something in their NOTES.txt file:
#################################################################################
###### WARNING: Persistence is disabled!!! You will lose your data when #####
###### the Grafana pod is terminated. #####
#################################################################################
Before we start making dashboards we have to add some persistence to Grafana. Default deployed Kubernetes applications have ephemeral storage, meaning that when the pod gets restarted, it is an entirely new entity and it has lost all its data (save for those fed from Secret or ConfigMaps).
So it is possible to store certain things to disk, inside the container, but these will not be persisted and survive any form of restart. Therefore we are going to mount a volume into our Grafana application which can survive restarts. The Helm chart for Grafana already has values for us to fill out.
If we go back to the Grafana Values page and search for the keyword persistence
we find a number of variables
to fill out. We will only be needing the following:
persistence:
enabled: true
size: 1Gi
storageClassName: k8s-stud-storage
So we enable persistence, request a volume with size of 1Gi and tell Kubernetes to use the storage class
k8s-stud-storage
. Add this yaml snippet to your existing grafana-values.yaml
.
These storage classes enable something called Dynamic Volume Provisioning.
Dynamic volume provisioning allows storage volumes to be created on-demand. Without dynamic provisioning, cluster administrators have to manually make calls to their cloud or storage provider to create new storage volumes, and then create PersistentVolume objects to represent them in Kubernetes. The dynamic provisioning feature eliminates the need for cluster administrators to pre-provision storage. Instead, it automatically provisions storage when it is requested by users.
Before upgrading our Grafana helm release, go ahead and open up a second terminal, in this terminal we will again watch all kubernetes related events in our namespace:
kubectl get events -w
Now from our terminal in the monitoring work directory, upgrade your grafana Helm release, with your updated grafana-values.yaml
file.
As you see in the events logs, a new pod gets created for Grafana, because through setting persistence, its config is changed and it now has a volume to attach.
The chart will have created a Persistent Volume Claim (look for it using kubectl get pvc
) which holds the configuration of the volume: storage class, size, read/write properties etc. This PVC then gets picked up by our storage provisioner (to which we linked by defining the storage class name) who provisions the volume.
This volume is then represented by a Persistent Volume (kubectl get pv
will show you not only volumes in your namespace, but all volumes across the cluster).
While all this storage provisioning goes on our new pod is Pending, waiting for the volume to come available. When it becomes available the images get pulled and the new pod gets created. When the Grafana pod is Ready, the first pod gets killed.
Notice we now have upgraded Grafana with no downtime!
Adding Users¶
It is best to not use the Admin account for normal operation of Grafana, instead we use our own personal user accounts. These still need to be added and/or invited by the Admin, so that is what we are going to do next.
Log in using the admin credentials, in the left navigation bar, go to Administration>Users then to Organization Users.
You'll see one user already, the admin. Now click on Invite to start inviting team members. Add their email in the first field and leave Name open. What permission you give to each member is entirely up to you. Viewers can only view dashboards, Editors can create and edit dashboards and Admins can change settings of the Server and the Organization. You can change user permissions later as well.
Since we don't have a mailing server set up with Grafana we can't actually send the invitation email, so deselect the Send invite email toggle
When you Submit a user, you can navigate to Pending Invites and click Copy Invite to retrieve the invite link. Now send that invite link to the appropriate team member, or open it yourself.
When you follow your invite link, you can set username, email, name and password yourself.
Adding a Data Source¶
If we want to make dashboards, we are going to need data. On the Home screen of your Grafana you can see a big button Add your first data source, click it or navigate to Connections>Add new connection.
We are going to link to the Prometheus server that has already been set up in the prometheus
namespace. Select Prometheus from the list of Data Sources.
Enter the following URL to point your Grafana instance to the already deployed Prometheus server:
http://prometheus-operated.monitoring:9090
Leave everything else on its default. Click Save & Test at the bottom of your screen. If all went well you should get a green prompt that your Data Source is working.
Exploring the data¶
The Prometheus instance we are running, collects a lot of data regarding the resource usage of our containers as well as multiple Kubernetes related metrics and events such as pod restarts, ingress creations, etc.
If you want to explore the data and build queries, the best place to go is the Explore screen on Grafana (compass in the navigation bar).
Tip
Grafana defaults to a Builder view to construct your query. This tool is very handy to start building your own queries and even get feedback on what each piece of the query does when enabling the Explain toggle at the top.
The raw query gets shown by default and can also be toggled if wanted.
The following introduction however uses the Code view to construct our query. Follow along with the Code view first and then you can move back to the Builder view.
For instance, let's say we are interested in CPU, when you type in cpu
in the query bar at the top, it will auto complete and show a list of matching suggestions. Hovering over any of these suggestions will give you the type of metric and a description.
If we select container_cpu_usage_seconds_total
you will see that we can actually select an even more specific metric if we want to.
These prefixes and suffixes indicate that these metrics were made by special recording rules within Prometheus. These rules are often much easier to work with because they have already filtered out certain series and labels we aren't interested in.
node_namespace_pod_container:container_cpu_usage_seconds_total:sum_irate
node_namespace_pod_container
indicates that these metrics have those four labels available, we can use these to filter our query.
sum_irate
indicates that these metrics have already been applied both the sum and irate function.
When we click Run Query we will get a lot of metrics, one for each container in the cluster, across all namespaces. We are only interested in our own namespace of course!
Warning
While we encourage trying out PromQL and Grafana, please take the following in mind:
- You may query data from other namespaces for academic purposes, to get the hang of the PromQL query language and basic monitoring. It is however not allowed to scrape and process metrics from the namespaces of other teams with the intent to extract Operational information from their logic.
- When experimenting with queries, set the range to fifteen minutes, one hour max. When you have filtered your query enough so that it doesn't return millions of irrelevant fields, you can expand this range to the required value for observability. Repeated queries over large time ranges, on series with lots of fields will negatively impact the performance of both your Grafana service and the Prometheus server. Prometheus queries can be backtraced to its origin.
Now we will add our first selector to our PromQL query. First, let us narrow our query to all pods in our namespace. Open curly brackets at the end of your selector, notice that Grafana prompts you with the available options for labels:
Go ahead and select only the metrics from your own namespace. You will see two series have been selected. One for our Logic Service container and one for Grafana. Now add another selector to the list so that you only query the CPU usage of your Logic Service. We will use this query in the next step so copy it or keep it handy.
Tip
You can create this same query using the builder, it will show a blue box with hint: expand rules()
. The metric we are querying is what's called a Recording rule. A Prometheus query that is continually run in the background by Prometheus, the results of this query are then recorded as another metric. Go ahead and expand the rule, you will see the raw query and by enabling the Explain toggle will also get some info about them.
You can keep on using the builder or the code view for your own queries, the choice is up to you. The explain toggle is a handy tool to learn PromQL hands-on.
Creating a dashboard¶
When we visit our Home screen again now, we can see we have completed adding our first Data Source. Next to that button, Grafana is prompting us to add our first dashboard.
Either click the big button on your Home screen or navigate to Dashboards>New Dashboard.
Now you have a completely empty dashboard. We will get you going by helping you create simple panels to visualize CPU and Memory usage of our logic-service.
Click on Add Visualisation and copy your previous CPU query in the Metrics section and give your panel an appropriate name. You'll see that your series will have a very long name in the legend on the bottom of the graph. Default it will show the metric name plus all its labels. We can actually use these labels to override the legend, setting {{pod}}
as legend format, will change the legend name to only the pod name.
Info
If you use the code view you might see warning to apply a rate()
function which you can safely ignore. The recording rule has already applied the rate function, but since the metadata of the metric still says it is a COUNTER type, Grafana keeps showing the warning.
Hit Save in the topright corner, you will get a dialog prompting you to give your Dashboard a name, optionally putting it into a folder. Give it a name and Save. You can always go to Dashboard settings to change the name later.
Every time you save your dashboard, you will also be asked to give an optional description. Grafana dashboards are versioned, a bit like Git, allowing you to describe changes as you go and revert some changes when needed.
Next we will add a graph to show us our memory usage. Click on Add Panel in the top bar then Add New Panel.
When you go to the Explorer again and type in memory
, you can see there are a lot of options. You might think that memory utilization of our service is easily tracked with container_memory_usage_bytes
, however, this metric also includes cached (think filesystem cache) items that can be evicted under memory pressure. The better metric is container_memory_working_set_bytes
.
This metric also has a recording rule, similar to our previous cpu metric! Use that metric to construct your memory usage query. Apply correct label filters to only show the memory usage of your Logic Service!
Now, your legend is probably showing that you are using about X Mil or K. This is not very readable ofcourse. On the rightside panel, we can navigate to the Standard Options tab and change our Graph's unit. Alternatively you can use the Search field at the top of the rightside panel and search for "unit". Select the Data / bytes(IEC)
unit.
You can also do the same for CPU usage, changing the unit to Time / seconds
.
When you are happy with your graph panel, hit apply. You now see two panels, you can drag and drop these panels into any position you prefer and resize them.
Grafana can automatically refresh the data and keep things up to date, in the top right corner you can click the refresh button manually or select one of the auto refresh options from the dropdown menu. Don't forget to Save your menu, you will also get the option to save your current time range as a dashboard default (this includes any auto refresh config).
You should end up with a dashboard that looks something like this:
Instrumenting our logic service¶
In this section we will show you how you can create custom metrics for your application that can be picked up by Prometheus. Application-specific metrics can be an invaluable tool for gaining insights in your application, especially if you can correlate these with generic metrics (such as CPU and memory) to spot potential issues.
Quarkus provides a plugin that integrates the Micrometer metrics library to collect runtime, extension and application metrics and expose them as a Prometheus (OpenMetrics) endpoint. See the Quarkus documentation for more information.
Start by adding a new Maven dependency to the POM of your project:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-micrometer-registry-prometheus</artifactId>
</dependency>
After refreshing your dependencies and starting the logic-service, you should see that Micrometer already exposes a number of Quarkus framework metrics at http://localhost:8080/q/metrics.
First instrumentation: move execution time¶
Micrometer has support for the various metric types we mentioned before (Counter, Timer, Gauge) and extended documentation is provided via their webpage. As an example, we will show you how you can add instrumentation for monitoring the execution time of your unit logic! We will use a Timer to measure this!
First, inject the Micrometer MeterRegistry
into your FactionLogicImpl
, so you can start recording metrics:
@Inject
MeterRegistry registry;
Next, wrap the implementation of nextUnitMove
using Timer.record
, for example:
@Override
public UnitMove nextUnitMove(UnitMoveInput input) {
return registry.timer("move_execution_time")
.record(() -> switch (input.unit().type()) {
case PIONEER -> pioneerLogic(input);
case SOLDIER -> soldierLogic(input);
case WORKER -> workerLogic(input);
case CLERIC -> moveFactory.unitIdle(); // TODO: extend with your own logic!
case MINER -> moveFactory.unitIdle(); // TODO: extend with your own logic!
});
}
This records the execution time of the unit move logic and makes sure this data can be exposed towards Prometheus. However, we also want to differentiate between the execution times of our different units' logic. Instead of creating different Timers we can add a label!
registry.timer("move_execution_time", "unit", input.unit().type().name())
Restart the logic-service and visit http://localhost:8080/q/metrics again and make sure the devops-runner
is active to run a local game. The metric you've added should then be visible in the response, e.g.:
# HELP move_execution_time_seconds_max
# TYPE move_execution_time_seconds_max gauge
move_execution_time_seconds_max{unit="SOLDIER",} 2.466E-4
move_execution_time_seconds_max{unit="WORKER",} 2.026E-4
move_execution_time_seconds_max{unit="PIONEER",} 2.197E-4
# HELP move_execution_time_seconds
# TYPE move_execution_time_seconds summary
move_execution_time_seconds_count{unit="SOLDIER",} 264.0
move_execution_time_seconds_sum{unit="SOLDIER",} 0.0095972
move_execution_time_seconds_count{unit="WORKER",} 656.0
move_execution_time_seconds_sum{unit="WORKER",} 0.0381238
move_execution_time_seconds_count{unit="PIONEER",} 4274.0
move_execution_time_seconds_sum{unit="PIONEER",} 0.0929189
Changing the metrics endpoint¶
Right now, the metrics are exposed on the main HTTP server, at port 8080. This is not ideal, as we want to keep the main HTTP server as lightweight as possible and reserve it solely for the game logic. Therefore, we will change the metrics endpoint to a different port.
To do this all we need is to enable the management interface of Quarkus by setting quarkus.management.enabled=true
in our application.properties
file. This will expose a new HTTP server on port 9000, which will serve the metrics at /q/metrics
.
Test this locally like before, by visiting http://localhost:9000/q/metrics and make sure the devops-runner
is active to run a local game. The metrics should now be visible on port 9000.
Deployment¶
From a code perspective we are now finished. However, the addition of a new port, implies kubernetes deployment should be updated as well! Otherwise, Kubernetes will not expose the new port and Prometheus will not be able to reach your service for puling the instrumentation.
Update your templates, by inserting the following in your deployment.yaml
file, as an additional entry for the attribute ports
of the logic-service
container definition:
- name: metrics
containerPort: 9000
protocol: TCP
The Prometheus server will automatically scrape the metrics
port of any Service with a app.kubernetes.io/name: logic-service
label in any namespace. This is set up through the a PodMonitor object, this resource is already deployed and active on our end, we include it here to illustrate how it works:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: devops-logic-combo-monitor
namespace: monitoring
spec:
namespaceSelector:
any: true
podMetricsEndpoints:
- interval: 15s
path: /q/metrics
port: metrics
selector:
matchExpressions:
- { key: app, operator: In, values: [logic-deployment, logic-service, logic] }
Verify your pod has this label and correct port name by using kubectl
in case you can't query your custom metrics after deploying your changes.
Warning
As always when working with YAML: double-check your indentations!
Custom metrics dashboard¶
Visualize
move_execution_time
Add two graphs to your Grafana dashboard:
- Unit move execution time: This graph should show the average time it takes for each unit type to execute its move logic. Each line in the graph should represent a different unit type, with the series labeled by the unit type. The y-axis should display time in seconds.
- Unit move requests per second: This graph should show the average number of move requests per second for each unit type. Each line in the graph should represent a different unit type, with the series labeled by the unit type. The y-axis should display requests per second.
Ensure that the series are continuous and not disrupted by pod restarts by using appropriate aggregations.
On the dashboard image above, the logic service is being restarted every 3 minutes, but the series are still continuous. The gaps in the graphs are due to the fact that the logic service is not being called because it has been defeated.
Create custom instrumentation metrics
Implement at least an additional three custom metrics for your logic service, so four custom metrics including
move_execution_time
. These metrics can be anything you like, but should provide insight into the performance of
your logic service and faction.
More info on creating your own metrics are found here on the Quarkus Micrometer docs.
If several metrics could be represented by one metric with different labels, we will consider these as one metric
for the purpose of this assignment. For example, if you have a metric worker_move_execution_time
,
pioneer_move_execution_time
, soldier_move_execution_time
, etc. these would be scored as only one metric.
Visualize these metrics on one or several Grafana dashboards. Also document the metrics
(name, labels, meaning, etc.) with links to the respective dashboards, in a markdown file in the
monitoring
folder:monitoring/instrumentation.md
. Include a link to this file in your project Report.
Experiment with the Query Builder and read through documentation of Prometheus to form your queries. To get started on visualizing the move_execution_time
metric we refer to the Prometheus docs: Count and sum of observations.
General tips:
- DO NOT create different metrics that measure the same thing but for another resource, use labels!
- Make use of aggregators such as
sum()
,avg()
, etc. Take in mind restarts of your logic service which will reset your metrics or cause disruptions in your graphs making them unreadable (see Chaos Engineering). A pod restart often results in a new series being created as the pod ID is part of the label set. This is fine for metrics like CPU / Memory because these relate to the pod itself. But for metrics that relate to the service as a whole, you want to make sure that the series are continuous. - Make sure your dashboards are readable, both over small time ranges as big ones. Tweak year dashboards and setup proper legends, units, scales, etc. A dashboard should provide necessary information at a glance and not require extensive inspection.
- a useful tool for getting to know the PromQL language as you construct queries and explore data is the Grafana Explorer and query Builder! Also go through the examples on the Prometheus-client README, these will teach you how to use the different metric types.
Improving testing¶
As we keep expanding our game logic, it is important to keep up with our testing. Therefore, we are going to add two things to our build pipeline which will help with our test visibility and most importantly: our motivation to keep writing tests.
Code Coverage Reports¶
Writing unit tests is not hard to do, the hardest part is getting into the habit of writing unit tests. To this end, Code Coverage reports can help motivate us.
Code coverage is a metric that can help you understand how much of your source is tested. It's a very useful metric that can help you assess the quality of your test suite. Code coverage tools will use several criteria to determine which lines of code were tested or not during the execution of your test suite.
To get these Code Coverage reports, all we need to do is add a dependency:
<dependency>
<groupId>io.quarkus</groupId>
<artifactId>quarkus-jacoco</artifactId>
<scope>test</scope>
</dependency>
Now when you run mvn test
and look at target
folder you will see that it now has a new jacoco-report
folder.
When you open the index.html
browser you will be able to view and browse the report.
You can drill down from the package into the individual classes. Browsing into FactionLogicImpl gives you an overview of each element. You can then inspect these elements which gives you color coded information about your Code Coverage.
JaCoCo reports help you visually analyze code coverage by using diamonds with colors for branches and background colors for lines:
- Red diamond means that no branches have been exercised during the test phase.
- Yellow diamond shows that the code is partially covered – some branches have not been exercised.
- Green diamond means that all branches have been exercised during the test. The same color code applies to the background color, but for lines coverage.
A "branch" is one of the possible execution paths the code can take after a decision statement—e.g., an if statement—gets evaluated.
JaCoCo mainly provides three important metrics:
- Line coverage reflects the amount of code that has been exercised based on the number of Java byte code instructions called by the tests.
- Branch coverage shows the percent of exercised branches in the code – typically related to if/else and switch statements.
- Cyclomatic complexity (Cxty) reflects the complexity of code by giving the number of paths needed to cover all the possible paths in a code through linear combination.
Code Coverage parsing¶
GitLab has integrations built in to visualize the Code Coverage score we are now generating using jacoco
. We can get our coverage score in job details as well as view the history of our code coverage score, see this particular section in GitLab Testing docs.
All this Coverage Parsing does is use a regular expression to parse the job output and extract a hit. Our mvn test
does not output the total Code Coverage Score to standard out sadly. We can however extract it from the generated jacoco.csv
file.
Open up your target/jacoco-report/jacoco.csv
, there you will see output similar to this
GROUP,PACKAGE,CLASS,INSTRUCTION_MISSED,INSTRUCTION_COVERED,BRANCH_MISSED,BRANCH_COVERED,LINE_MISSED,LINE_COVERED,COMPLEXITY_MISSED,COMPLEXITY_COVERED,METHOD_MISSED,METHOD_COVERED
quarkus-application,be.ugent.devops.services.logic,Main,6,0,0,0,3,0,2,0,2,0
quarkus-application,be.ugent.devops.services.logic.http,MovesResource,13,0,0,0,3,0,3,0,3,0
quarkus-application,be.ugent.devops.services.logic.http,AuthCheckerFilter,18,0,4,0,4,0,4,0,2,0
quarkus-application,be.ugent.devops.services.logic.http,RemoteLogAppender,47,57,5,1,11,9,4,3,1,3
quarkus-application,be.ugent.devops.services.logic.api,BaseMoveInput,0,12,0,0,0,1,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,UnitType,0,33,0,0,0,6,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,BaseMoveType,0,39,0,0,0,7,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,UnitMoveInput,18,0,0,0,1,0,1,0,1,0
quarkus-application,be.ugent.devops.services.logic.api,BonusType,0,27,0,0,0,5,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,Location,17,43,0,0,1,14,1,7,1,7
quarkus-application,be.ugent.devops.services.logic.api,UnitMove,12,0,0,0,1,0,1,0,1,0
quarkus-application,be.ugent.devops.services.logic.api,UnitMoveType,0,87,0,0,0,3,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,MoveFactory,136,12,0,0,18,2,18,2,18,2
quarkus-application,be.ugent.devops.services.logic.api,BaseMove,0,15,0,0,0,1,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,Faction,0,39,0,0,0,1,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,BuildSlotState,9,0,0,0,1,0,1,0,1,0
quarkus-application,be.ugent.devops.services.logic.api,Coordinate,27,41,5,5,2,10,7,4,2,4
quarkus-application,be.ugent.devops.services.logic.api,GameContext,0,24,0,0,0,1,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.api,Unit,0,21,0,0,0,1,0,1,0,1
quarkus-application,be.ugent.devops.services.logic.impl,FactionLogicImpl,428,85,72,4,68,12,56,6,16,6
You can open up this .csv
file in a Spreadsheet program to make it more readable (or use a CSV extension in VS Code). These .csv
files are very easy to parse.
Extract code coverage score
Parse the jacoco.csv
file using bash commands (or other scripting language of your choice) to output the total Code Coverage score (a percentage). You can limit the coverage score to the ratio of instructions covered over total instructions.
E.g.:
16.1972 % covered
Add this command/script to the test-unit
job of your CI file so each test run outputs their coverage.
TIP: you can use awk. Test your command locally before adding it to your CI file.
Then construct a regular expression to capture this output and add it to your maven-test
job using the coverage
keyword.
You can build and test your regex using a regex tester like Regex101.
When you have added your coverage
regex to your maven-test
job you can push your changes to GitLab. If you then navigate to that job on GitLab, you will see your score in the Job Details
If you have an open merge request, you will also see your score in the merge request widget.
You can even start tracking your code coverage history via Analyze>Repository Analytics now.
Visualizing code coverage in MR diffs¶
Coverage visualization broken in GitLab instance
The GitLab instance used in this course has a bug that prevents the coverage visualization from working correctly. The steps below will not work as intended. You can still follow the steps to set up the visualization, but the coverage will not be shown in the diffs.
We do expect to see the jacoco report be uploaded as a report artifact! Make sure you have the following artifacts
in your maven-unit-test
job:
To further improve our visibility into our code coverage, we can add some visualization to the diffs of our merge requests, using the output of jacoco to enrich the GitLab view.
TASK
Check official docs on Test Coverage Visualization and add the necessary steps to your CI file to enable this feature.
When you have set it up right, you will be able to see your code coverage in the diffs of your merge requests. The orange highlighting shows lines that aren't covered yet. The green highlighting shows lines that are covered.
Run pipeline on merge requests
Make sure to run your pipeline on merge requests to see the coverage in the diffs. If the associated job doesn't run, the artifact isn't present and thus coverage won't be shown. See Pipeline Optimizations for more information.
Using this and the coverage history enabled previously, you can easily check the contributions of team members and see if they are adding tests or not, pinpointing code lines that are not covered yet.
Test reports in pipeline view¶
A final thing we can easily add is a test overview in our Pipeline view, when you go to CI/CD>Pipelines and check the details of your latest pipeline, you will notice there are tabs, one of them being Tests.
You can find instructions on the GitLab Unit Test Reports page. In our case, we are going to add the test reports to the unit-test job (View Unit test reports on GitLab).
Info
Test reports are generated by Surefire. Take a look at the target folder of your project, compare with the artifacts needed for the test report in the GitLab docs and adapt the code example!
When you have set it up right, you will be able to see all your tests in the pipeline view.
Improving test coverage¶
TASK
Add tests to improve your Code Coverage, aim for at least 60%. Use your Code Coverage report to get insight into what can be improved.
Do note: 100% code coverage does not necessarily reflect effective testing, as it only reflects the amount of code exercised during tests, but it says nothing about tests accuracy or use-cases completeness. To this end, there are other tools that can help like Mutation Testing, an example being PiTest which also has a Maven plugin. Implementing this is beyond the scope (and allotted time) of this course.
Badges¶
By now you must have noticed how broad of a topic DevOps really is. There is a lot at play in setting up a good automated CI/CD pipeline. However, once in place, the benefits are well worth the effort. It is clear that insights into your pipeline and project at a quick glance, are very valuable. One such way to enable this are badges.
Badges are a unified way to present condensed pieces of information about your projects. They can offer quick insight into the current status of the project and can convey important statistics that you want to keep track of. The possibilities are endless. Badges consist of a small image and a URL that the image points to once clicked. Examples for badges can be the pipeline status, test coverage, or ways to contact the project maintainers.
Badges have their use for both public and private projects:
- Private projects Quick and easy way for the development team to see how the project and pipeline is doing from a technical viewpoint (how are we doing on test coverage, how many code have we written, when was our latest deployment, etc.).
- Public projects They can act as a poster board for your public repository showing visitors how the project is doing (how many downloads, latest release version, where to contact the developers, where to file an issue, etc.).
Create Coverage and Pipeline status badge
Research how to add badges to your project and add the coverage and pipeline status badge to your project. You can also add badges for other things if you want to.
Let these badges link to relevant pages in your project.
For the coverage badge to work, the coverage extraction must be setup correctly, see Code Coverage parsing for more information.
Pipeline optimizations¶
As some have noticed, as our pipeline is set up currently changes to any branch will result in a new latest
image version and a deploy our kubernetes resources.
This is far from ideal, as branches often contain experimental code that is not ready for deployment. Furthermore, as multiple people are working on separate branches, their deployments will overwrite, and sometimes even break each other.
To avoid this we need to dynamically trigger certain jobs. In GitLab we can use the rules
keyword to define rules to trigger jobs. These rules can be very simple or very complex, depending on your needs.
GitLab rules¶
Each rule has two attributes that can be set.
when
allows you to designate if the job for example should be excluded (when: never
) or only built when previous stages ended successfully (when: on_success
). Other options arewhen: manual
,when: always
andwhen:delayed
.allow_failure
can be set to true, to keep the pipeline going when a job fails or has to block. A job will block when for examplewhen: manual
is set: the job needs manual approval and this will block following jobs.
For each rule, three clauses are available to decide whether a job is triggered or not: if
evaluates an if statement (checking either environment variables or user defined variables), changes
checks for changes in a list of files (wildcards enabled) and exists
checks the presence of a list of files. These clauses can be combined into more complex statements as demonstrated in the example below.
rules:
- if: '$CI_PIPELINE_SOURCE == "merge_request_event"'
when: manual
allow_failure: true
- if: '$CI_COMMIT_BRANCH == "master"'
changes:
- Dockerfile
when: on_success
- if: '$CI_PIPELINE_SOURCE == "schedule"'
when: never
Disclaimer: the above snippet makes no sense and is merely here to illustrate how the rules work and are evaluated.
The first rule will trigger a manual job, requiring a developer to approve the job through GitLab, if the source of the pipeline is a merge request. Failure is explicitly allowed so the following jobs and stages aren't blocked by this job.
The second rule will trigger for changes of Dockerfile (on the root level of the repository) on the master
branch. When previous stages fail, this job will be skipped because of when: on_success
, which dictates to only trigger a rule upon successful completion of previous stages (this is also the default).
The third and final rule will exclude this job from being triggered by Scheduled pipelines through when: never
.
For full documentation on the rules
keyword see official GitLab CI/CD docs, it has extensive examples to get you started. For a list of all default environment variables to check via the if
statement visit this page.
Warning
Before introducing the rules
keyword into GitLab CI/CD, only|except
were used to define when jobs should be created or not. The rules
syntax is an improved and more powerful solution and only|except
has been deprecated.
Avoiding unnecessary deployments¶
The simplest strategy to solve our problem is to limit deployment and creation of the latest tag to the main
branch only. Controlling when jobs and pipelines get run is done through GitLab rules
keyword.
Set up rules
Set up your pipeline with rules
so they at the least do the following:
- Only push the latest tag when the pipeline is run on the
main
branch. - Do not deploy the latest build to the cluster, unless the pipeline is run on the
main
branch. - Make sure that your full pipeline runs when committing a tag to the repository such as
Lab 4
You can add rules that expand on this strategy, to e.g. deploy certain feature branches through a manual trigger or limit building of code to when it has actually changed, if wanted. Start with the required rules and expand if needed. Include your overall strategy in your report and make sure that your final pipeline functions as expected.
Service robustness¶
Chaos Engineering¶
Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production. It brings about a paradigm shift in how DevOps teams deal with unexpected system outages and failures. Conventional Software Engineering teaches us that it is good practice to deal with unreliability in our code, especially when performing I/O (disk or network operations), but this is never enforced. There are many stories of large systems fatally crashing because engineers e.g. never imaged two services failing in quick succession.
In Chaos Engineering, failure is built into the system, by introducing a component that can kill any service at any time. This changes your psyche as a Developer or DevOps engineer: failure is no longer an abstract concept that will probably never happen anyway. Instead, it becomes a certainty, something you have to deal with on a daily basis and keep in the back of your head for any subsequent update you perform on the codebase.
Context
Chaos Engineering was pioneered at Netflix as part of a new strategy for dealing with the rapid increasing popularity of the streaming service, causing significant technical challenges with regards to scalability and reliability.
For the Devops Game, we implemented a lite version of Chaos Engineering. There is a component in place that can kill logic-service
pods, but it does not operate in a completely random way. To keeps things fair, the component targets Teams in a uniform way by cycling through a shuffled list of the participating Teams. The logic-service
pod for each Team will be killed a fixed number of times during each game session.
Remember: Kubernetes is declarative in nature and uses a desired state, meaning if you specify that a deployment should have one pod (using the spec.replicas
attribute), Kubernetes tries to make sure that there is always one pod running. As a result, the pod for your logic-service
will automatically restart each time it is killed by our Chaos component, so you don't have to worry about that.
However, the operation of your logic could be impacted after a restart. Especially if you rely on building up a model of the game world in memory for guiding your decisions. In the next section, we will discuss how you can persist and recover important parts of your state.
Saving & Restoring state¶
GameState¶
You can extend your Logic Service with functionality to periodically save your in-memory state, with the goal of being able to restore this state when your service is restarted.
An easy way to save your state is by using the Jackson serialization library. Using Jackson, you can convert any POJO (Plain Old Java Object) into a JSON string, which can be written to a file.
Note
We recommend encapsulating all your game state into a new Java class. This class should contain nothing but your game state properties as private fields (with public getters and setters) and a generated hashCode and equals function. This helps to create a straightforward flow for saving and restoring your state.
Example of a simple GameState class:
package be.ugent.devops.services.logic.persistence;
import be.ugent.devops.services.logic.api.Location;
import java.util.HashSet;
import java.util.Objects;
import java.util.Set;
public class GameState {
private String gameId;
private Set<Location> resources = new HashSet<>();
private Set<Location> enemyBases = new HashSet<>();
public String getGameId() {
return gameId;
}
public void setGameId(String gameId) {
this.gameId = gameId;
}
public Set<Location> getResources() {
return resources;
}
public void setResources(Set<Location> resources) {
this.resources = resources;
}
public Set<Location> getEnemyBases() {
return enemyBases;
}
public void setEnemyBases(Set<Location> enemyBases) {
this.enemyBases = enemyBases;
}
public void reset(String gameId) {
this.gameId = gameId;
resources.clear();
enemyBases.clear();
}
@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
GameState gameState = (GameState) o;
return Objects.equals(gameId, gameState.gameId) && Objects.equals(resources, gameState.resources) && Objects.equals(enemyBases, gameState.enemyBases);
}
@Override
public int hashCode() {
return Objects.hash(gameId, resources, enemyBases);
}
}
It is important that GameState
has a reset method, which is able to clear all the state attributes, e.g. when a new game is started!
You could embed an instance of this class in your FactionLogicImpl
to keep track of resource locations or enemy bases, so you can use this information for making informed decisions in controlling your units.
To make an instance of GameState available in FactionLogicImpl
, you can rely on the Quarkus CDI framework, by providing a Producer method which creates an instance by reading a JSON file which contains the previously written state:
package be.ugent.devops.services.logic.persistence.impl;
import be.ugent.devops.services.logic.persistence.GameState;
import be.ugent.devops.services.logic.persistence.GameStateStore;
import io.quarkus.logging.Log;
import jakarta.inject.Inject;
import jakarta.inject.Singleton;
public class GameStateInitializer {
@Inject
GameStateStore gameStateStore;
@Produces // Declare as Producer Method
@Singleton // Make sure only one instance is created
public GameState initialize() {
Log.info("Fetching initial game-state from store.");
return gameStateStore.read();
}
}
This GameStateInitializer
relies on an implementation of GameStateStore
, who's interface is declared as follows:
package be.ugent.devops.services.logic.persistence;
public interface GameStateStore {
void write(GameState gameState);
GameState read();
}
Implement GameStateStore
Provide an implementation for this interface. Read will have to encode the GameState to JSON then save it to a file. Write will have to read the JSON file and decode it to a GameState object. Check the io.vertx.core.json.Json package which is bundled in the Quarkus framework for help with encoding and decoding JSON. For reading and writing files, you can use the java.nio.file package.
Some further pointers:
- Annotate the class with
@ApplicationScoped
, so it can be injected inGameStateInitializer
(and yourFactionLogicImpl
, see next section). - Make the path to the folder that is used for storage configurable (see the Quarkus guide on configuring your application). Name this configuration property
game.state.path
. - The actual filename can be static, e.g.
gamestate.json
- Prevent unnecessary file writes: when the GameState has not been modified, skip the write operation. You can implement this by storing the hashcode of the last written GameState in a variable and comparing this value at the start of the write method.
Integration in FactionLogicImpl
¶
You can now integrate a persistent GameState by replacing the following code block in the method nextBaseMove
, make sure to inject the necessary fields so they are properly initialized:
if (!input.context().gameId().equals(currentGameId)){
currentGameId=input.context().gameId();
Log.infof("Start running game with id %s...",currentGameId);
}
With the following code:
if (!input.context().gameId().equals(gameState.getGameId())) {
gameState.reset(input.context().gameId());
Log.infof("Start running game with id %s...", gameState.getGameId());
}
// Trigger write state once per turn
gameStateStore.write(gameState);
This will make sure that GameState is reset when your logic detects that a new game has been started and that the latest GameState is persisted once every turn.
Warning
Remember that our Game server will cancel baseMove or unitMove requests after a timeout of one second. Take this into account when calling your save state method and be mindful of the amount of data you are writing to disk. Monitor your move operations response times using Prometheus instrumentation to gain insight in how much headroom you have left!
Use GameState in your logic
Integrate the GameState in your logic, by replacing any fields and references in FactionLogicImpl
that were used to track Game specific data, with new attributes in the class GameState
.
Important: do not forget to update the equals()
and hashcode()
when modifying attributes.
Handle controlled shutdown¶
The Quarkus framework supports hooks that enable specifying what code should be executed when the application receives a "sigterm" (termination signal), e.g. when Ctrl + C is entered in the shell executing the application. This allows you to implement controlled shutdown behaviour. The following code sample shows an example of such a hook:
import io.quarkus.runtime.ShutdownEvent;
import jakarta.enterprise.context.ApplicationScoped;
import jakarta.enterprise.event.Observes;
import jakarta.inject.Inject;
@ApplicationScoped
public class ShutdownHook {
public void onStop(@Observes ShutdownEvent event) {
// Execute shutdown behaviour here...
}
}
Handle termination
Add a shutdown hook that saves the GameState to file before allowing the application to terminate.
Kubernetes Persistent Volumes¶
In the section on Grafana persistence, we've briefly touched on the concept of Kubernetes Persistent Volumes and Persistent Volume Claims.
Using the same principles, you can make sure that the game state file you write in your Logic Service code, is still there whenever your service restarts.
First, you need to add a PVC resource file volumeClaim.yaml
to your k8s
folder, with the following contents:
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: logic-service-pvc
labels:
app: logic
spec:
storageClassName: k8s-stud-storage
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Mi
This file describes the PersistentVolumeClaim, declaring that it should use the k8s-stud-storage
storageClass and reserve 100 mebibyte (which should be more than sufficient to store your state).
You will need to configure your deployment and logic service container to use this volume. Check out the official docs to find out how use claims as volumes.
setup volume provisioning
Add the necessary config to your deployment.yaml
to start using the PersistentVolumeClaim.
See https://kubernetes.io/docs/concepts/storage/persistent-volumes/#claims-as-volumes.
Make sure to update your Logic Service configuration: the path used for the volume mount should be the same as the
path set for the variable game.state.path
(or GAME_STATE_PATH
when supplying a value using environment variables).
Questions¶
- Why do we add the
when: always
attribute to upload test artifacts? - Does your code coverage score get parsed correctly when the
unit-test
job fails? If not, can you think of a way to extract in case of test failure. - Why does your logic-service require a Persistent Volume to be able to restore the game-state?
Practicalities¶
This lab must be completed before Sunday, 15 December at 23:59.
Warning
All support outside the physical Lab session is handled via the GitLab Issue tracker! Do not expect support outside of the normal office hours.
Checklist¶
- Deploy Grafana using Helm, adding the used configuration to
monitoring/grafana-values.yaml
- Explore available Prometheus metrics related to your Logic Service on Grafana
- Create a Grafana dashboard to show CPU and Memory usage of only your Logic Service
- Create at least 4 custom metrics (
move_execution_time
and 3 others) and instrument your code - Visualize
move_execution_time
as requested in the Custom metrics dashboard section) - Visualize your other custom metrics in one or several Grafana Dashboards
- Document custom metrics and dashboard usage in
monitoring/instrumentation.md
- Add Code Coverage reports to your unit test job
- Parse code coverage with GitLab so it shows up in MRs, job details and coverage history
- Add Unit test reports to your pipeline overview
- Improve your code coverage score to at least 60%
- Add pipeline status and code coverage badges to your project
- Set up rules to avoid unnecessary deployments: push
latest
only onmain
branch, trigger deploy only onmain
branch, trigger full pipeline on pushing of tagLab 4
(or equivalent) - Implement GameStateStore to load and save GameState to a JSON file
- Use GameState in your logic to keep track and persist game state
- Add a shutdown hook to your logic service to save the game state before termination
- Add a tag1
Lab 4
to the commit you consider the final result of this lab session. - Create an issue
Lab 4 Report
, labeled withReport
&Lab 4
, and add a concise report containing:- Link to the pipeline run for your
Lab 4
tag - Add a changelog for your Faction Logic implementation: what was added, fixed, removed. Link to the corresponding issues
- Answer the questions posed in the section above.
- Link to the pipeline run for your
-
You can tag on Gitlab: use the tab "Repository" (left menu), click "Tags" , then "New Tag". You can also tag using git directly: https://git-scm.com/docs/git-tag. ↩