General

CORE CONCEPTS

ARCHITECTURE

K8S architecture

OBJECTS

Understanding Kubernetes Objects

Kubernetes Objects are persisitent entities in the Kubernetes system.Kubernetes uses these entities to represent the sate of your cluster. Specifically, they can describe:

  • What containerized applications are running (and on which nodes)
  • The resources available to those applications
  • The policies around how those applications behave, such as restart policies, upgrades, and fault-tolerance

A Kubernetes object is a “record of intent”–once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re effectively telling the Kubernetes system what you want your cluster’s workload to look like; this is your cluster’s desired state. To work with Kubernetes objects–whether to create, modify, or delete them–you’ll need to use the Kubernetes API. When you use the kubectl command-line interface, for example, the CLI makes the necessary Kubernetes API calls for you. You can also use the Kubernetes API directly in your own programs using one of the Client Libraries.

Object Spec and Status

Every Kubernetes object includes two nested object fields that govern the object’s configuration: the object spec and the object status. The spec, which you must provide, describes your desired state for the object–the characteristics that you want the object to have. The status describes the actual state of the object, and is supplied and updated by the Kubernetes system. At any given time, the Kubernetes Control Plane actively manages an object’s actual state to match the desired state you supplied.

An example of the kubernetes object

apiVersion: apps/v1          # [REQUIRED] which API group and version of the Kubernetes API you're using to create the object
kind: Deployment             # [REQUIRED] what kind of object you want to create (pod,service,cron,job, ... )
metadata:                    # [REQUIRED] data that helps uniquely identify the object, including a `name` string,`UID`, and optional `namespace`
  name: nginx-deployment     # name of the object
spec:                        # [REQUIRED] specification of the object, desired state. format is depending of the object
  selector:
    matchLabels:
      app: nginx
  replicas: 2                # tells deployment to run 2 pods matching the template
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9   # used image
        ports:
        - containerPort: 80

NAMES

All objects in the Kubernetes REST API are unambiguously identified by a Name and a UID. For non-unique user-provided attributes, Kubernetes provides labels and annotations. A client-provided string that refers to an object in a resource URL, such as /api/v1/pods/some-name. Only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name.

UIDS

A Kubernetes systems-generated string to uniquely identify objects. Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is intended to distinguish between historical occurrences of similar entities.

NAMESPACES

Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces.Although namespaces alloy you to isolate objects into distinct groups, which allows you to operate only on those belonging to the specific namespace,they don’t provide any kind of isolation of running objects.In other words,pods from different namespaces can communicate.In order to stop this one should configure inter-namespace network isolation via network policies.

When to Use Multiple Namespaces

Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide. Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces can not be nested inside one another and each Kubernetes resource can only be in one namespace. Namespaces are a way to divide cluster resources between multiple users (via resource quota). In future versions of Kubernetes, objects in the same namespace will have the same access control policies by default. It is not necessary to use multiple namespaces just to separate slightly different resources, such as different versions of the same software: use labels to distinguish resources within the same namespace.

Default namespaces

Kubernetes starts with three initial namespaces:

Namespaces:
default - The default namespace for objects with no other namespace
kube-system - The namespace for objects created by the Kubernetes system
kube-public - This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.

Namespaces and DNS

When you create a Service, it creates a corresponding DNS entry. This entry is of the form <service-name>.<namespace-name>.svc.cluster.local, which means that if a container just uses <service-name>, it will resolve to the service which is local to a namespace. This is useful for using the same configuration across multiple namespaces such as Development, Staging and Production. If you want to reach across namespaces, you need to use the fully qualified domain name (FQDN).

LABELS AND SELECTORS

Labels are key/value pairs that are attached to objects(can be attached to any k8s objecT), such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying information should be recorded using annotations.

Syntax and character set

Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated by a slash (/). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.), not longer than 253 characters in total, followed by a slash (/)

Label selectors

Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes. The API currently supports two types of selectors: equality-based and set-based. A label selector can be made of multiple requirements which are comma-separated. In the case of multiple requirements, all must be satisfied so the comma separator acts as a logical AND (&&) operator. The semantics of empty or non-specified selectors are dependent on the context, and API types that use selectors should document the validity and meaning of them.

Equality-based requirement

Equality- or inequality-based requirements allow filtering by label keys and values. Matching objects must satisfy all of the specified label constraints, though they may have additional labels as well. Three kinds of operators are admitted =,==,!=. The first two represent equality (and are simply synonyms), while the latter represents inequality. For example:

environment = production
tier != frontend
Set-based requirement

Set-based label requirements allow filtering keys according to a set of values. Three kinds of operators are supported: in,notin and exists (only the key identifier). For example:

environment in (production, qa)
tier notin (frontend, backend)
partition
!partition

Service and ReplicationController

The set of pods that a service targets is defined with a label selector. Similarly, the population of pods that a replicationcontroller should manage is also defined with a label selector. Labels selectors for both objects are defined in json or yaml files using maps, and only equality-based requirement selectors are supported:

selector:
  component: redis

this selector (respectively in json or yaml format) is equivalent to component=redis or component in (redis)

Resources that support set-based requirements

Newer resources, such as Job, Deployment, Replica Set, and Daemon Set, support set-based requirements as well.

selector:
  matchLabels:
    component: redis
  matchExpressions:
    - {key: tier, operator: In, values: [cache]}
    - {key: environment, operator: NotIn, values: [dev]}

matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is “key”, the operator is “In”, and the values array contains only “value”. matchExpressions is a list of pod selector requirements. Valid operators include In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of the requirements, from both matchLabels and matchExpressions are ANDed together – they must all be satisfied in order to match.

Selecting sets of nodes

One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule. See the documentation on node selection for more information.

ANNOTATIONS

You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata. You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.

Syntax

Annotations are key/value pairs. Valid annotation keys have two segments: an optional prefix and name, separated by a slash (/). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.), not longer than 253 characters in total, followed by a slash (/).

apiVersion: v1
kind: Pod
metadata:
  name: annotations-demo
  annotations:
    imageregistry: "https://hub.docker.com/"
spec:
  containers:
  - name: nginx
    image: nginx:1.7.9
    ports:
    - containerPort: 80

print output for adding annotations

k annotation OBJECT OBJECT_NAME KEY=VALUE --dry-run=client -o jsonpath={'.metadata.annotations'} | jq 

Field Selectors

Field selectors let you select Kubernetes resources based on the value of one or more resource fields. Here are some example field selector queries:

  • metadata.name=my-service
  • metadata.namespace!=default
  • status.phase=Pending

Supported field selectors vary by Kubernetes resource type. All resource types support the metadata.name and metadata.namespace fields. Using unsupported field selectors produces an error.

Shared labels and annotations share a common prefix: app.kubernetes.io. Labels without a prefix are private to users. The shared prefix ensures that shared labels do not interfere with custom user labels.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app.kubernetes.io/name: mysql
    app.kubernetes.io/instance: wordpress-abcxzy
    app.kubernetes.io/version: "5.7.21"
    app.kubernetes.io/component: database
    app.kubernetes.io/part-of: wordpress
    app.kubernetes.io/managed-by: helm

Networking model

The Kubernetes networking model is a set of standards that define how networking between Pods behaves.There are a variety of different implementations of this model - including Calico networking plugin which have been using throughout this course. The Kubernetes network model defines how Pods communicate with each other,regardless of which Node they are running on.

Each pod has its own unieque IP address within the cluster.Any Pod can reach any other Pod using that Pod’s IP address.This creates a virtual network that allows Pods to easily communicate with each other,regardless of which node they are on.

CNI

To make it easier to connect containers into a network, a project called Container Network Interface(CNI) was started.

CNI plugins overview

CNI plugins are a type of Kubernetes network plugin.These plugins provide network connectivity between Pods according to the starndard set by the Kubernetes network model.

Understanding K8s DNS

The K8s virtual newtork uses a DNS to allow Pods to locate other Pods and Services using domain names instead of IP addresses.

Pod Domain Names

All Pods in our kubeadm cluster are automatically given a domain name of the following form:

pod-ip-address.namespace-name.pod.cluster.local

where pod-ip-address for 192.168.10.100 will be 192-168-10-100.

KUBERNETES COMPONENTS

Kubernets components

MASTER

kube-apiserver

Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.It is designed to scale horizontally – that is, it scales by deploying more instances.It provides a CRUD interface for querying and modifying the cluster state over a RESTful API.It stores that state in etcd.It also performs validation of request objects and handle optimistic locking,so changes to an object are never overriden by other clients in the event of concurrent updates. When client talks to the API server its requests goes over authentication,authorization,admission and validation(only for create,delete and update not for read) before storing in etcd.After validation API server returns a response to the client. API server doesn’t tell controllers what to do.All it does is enable those controllers and other components to observe changes to deployed resources.A control plane component can request to be notified when a resource is created,modified, or deleted.This enables the component to perform whatever task it needs in response to a change of the cluster metadata. Clients watch for changes by opening an HTTP connection to the API server.Every time an object is updated,the server sends the new version of the object to all connected clients watching the object.The watch mechanism is also used be the Scheduler.

etcd

Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.Only API server communicate with etcd.All other components read and write data to etcd indirectly through the API server.Each key in etcd is either a directory,which contains other keys,or is a regular key with a corresponding value.Etcd v3 doesn’t support directories,but because the key format remains the same(keys can include slashes), you can still think of them as being grouped into directories.Kubernetes stores all the data in etcd under /registry.Etcd uses the RAFT consensus algorithm which ensures that at any given moment,each node’s state is either what the majority(or quorum) of the nodes agress is the current state or is one of the previously afgreed upon states.The consensus algorithm requires a majority for the cluster to progess to the next state.Transition from the previous state to the new one,there needs to be more than half of the nodes taking part in the state change.

listing etcd data, records are stored under namespace like /registry/pods/<namespaces>

k -n kube-system exec etcd-k8s-control -it -- etcdctl --endpoints=https://172.31.30.119:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key get / --keys-only=true --prefix=true

kube-scheduler

Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.All it does is wait for newly created pods through the API server’s watch mechanism and assign a node to each new pod that doesn’t already have the node set.The Scheduler doesn’t instruct the selected node (or the Kubelet running on that node) to run the pod.All the Scheduler does is update the pod definition through the API server.The API server than notifies the Kubelet(again,via the watch mechanism) that the pod has been scheduled.As soon as the Kubelet on the target node sees the pod has been scheduled to its node,it creates and runs the pod’s containers.

Scheduler selects a node for the pod in a 2-step operation:

  • Filtering
  • Scoring

The filtering step finds the set of Nodes where it’s feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod’s specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn’t (yet) schedulable. In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules. Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random. Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.

kube-controller-manager

Component on the master that runs controllers. Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.

Holds info about CIDR range for pods in --cluster-cidr and CIDR range for services in --service-cluster-ip-range(although this setting is present in kube-api as well).

These controllers include:

  • Replication Manager - responsible for ReplicationController resources.It is working in infinite loop,where in each iteration,the controller finds the number of pods matching its pod selector and compares the number to the desired replica count.When too few pod instances are running,the RC runs additional instances.It creates new Pod manifests,posts them to the API server, and lets the Scheduler and the Kubelet do their job of scheduling and running the pod.Thus, it performs its work by manipluating Pod API objects through the API server.This is how all controllers operate.
  • Deployment Controller - performs a rollout of a new version each time a Deployment object is modified(if the modification should affect the deployed pods).It does this by creating a ReplicaSet and then appropriately scling both the old and the new ReplicaSet based on the strategy specified in the Deployment,until all the old pods have been replaced with new ones.It doesn’t create any pods directly.
  • StatefulSet Controller - similar to ReplicaSet controler and other related controllers, creates,manages,and deletes Pods according to the spec of a StatefulSet resource.Also it instantiates and manages PVC for each Pod instance.
  • Node Controller - Responsible for noticing and responding when nodes go down.
  • Replication Controller - Responsible for maintaining the correct number of pods for every replication controller object in the system.
  • Endpoints Controller - Populates the Endpoints object (that is, joins Services & Pods).It watches both Services and Pods.When Services are added or updated or Pods are added,updated, or deleted, it selects Pods matching the Service’s pod selector and adds their IPs and ports to the Endpoint resource.
  • Namespace Contoller - When a namespace resource is deleted,all the resources in theat namespace must also be deleted.
  • Service Account & Token Controllers - Create default accounts and API access tokens for new namespaces
  • PersistentVolume Controller - Once a user creates a PVC,Kubernetes must find an appropriate PV an bind it to the claim.K8s is looking for the best match for the claim by selecting the smallest PV with the access mode matching the one requested in the claim and the desclared capacity aboce the capacity requested in the claim.
  • Others

There’s a controller for almost every resource you can create.Resources are descriptions of what should be running in the cluster,whereas the controllers are the active Kubernetes components that perform actual work as a result of the deployed resources. After a contooler updates a resource in the API server,the Kubelet and kube-proxy,perform their work, such as spinnig up ad pod’s containers and attaching network storage to them,or in the case of services,setting up the actual load balancing across pods.

This component is also responsible for signing certificates for whole k8s cluster.

kube-cloud-controller-manager

The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider’s API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.

WORKER

kubelet

An agent that runs on each node in the cluster. It makes sure that containers are running in a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn’t manage containers which were not created by Kubernetes. Its initial job is to register the node it’s running on by creating a Node resource in the API server.Then it needs to continuously monitor the API server for Pods that have been scheduled to the node,and start the pod’s containers.It does this by telling the configured container runtime (which is DOcker,rkt,or something else) to run a container from a specific container image.The Kubelet then contstantly monitors running containers and reports their status,events and resource consumption to the API server. The Kubelet is also the component that runs the container liveness probes,restarting containers when the probes fail.Lastly, it terminates containers when their Pod is deleted from the API server and notifies the server that the pod has terminated. Although the Kubelet talks to the Kubernetes API server and gets the pod manifests from there,it can also run pods based on pod manifest files in a specific local directory.

kube-proxy

kube-proxy is a network proxy that runs on each node in the cluster.It is configured as DaemonSet which manifest is not stored on disk.It is kept in ConfigMap in k8s. It enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding. kube-proxy is responsible for request forwarding. kube-proxy allows TCP and UDP stream forwarding or round robin TCP and UDP forwarding across a set of backend functions. Besides watching the API server for changes to Services,kube-proxy also watches for changes to Endpoints objects.

Container Runtime

The container runtime is the software that is responsible for running containers. Kubernetes supports several container runtimes: Docker, containerd, cri-o, rktlet and any implementation of the Kubernetes CRI (Container Runtime Interface)

NAMESPACES

In Kubernetes almost every object is in some namespace.Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces. For small env with small number of users multiple namespaces are not needed. Kubernetes starts with three initial namespaces:

Namespaces:
default - The default namespace for objects with no other namespace
kube-system - The namespace for objects created by the Kubernetes system
kube-public - This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.

Not all objects are inside namespaces

# In a namespace
kubectl api-resources --namespaced=true

# Not in a namespace
kubectl api-resources --namespaced=false

Change default namespace to NAMESPACE_NAME

k config set-context --current --namespace=NAMESPACE_NAME

PODS

Pods are the smallest and most basic building block of the Kubernetes model.A pod consists of one or more containers,storage resources, and a unique IP address in the Kubernetes network.It will always run on the same worker node and in the same Linux namespace(s). In order to run containers, Kubernetes schedules pods to run on servers in the cluster.When a pod is scheduled,the server will run the containers that are part of the pod.

IP address of the pod is a network namespace.Every container inside the pod shares a single network nameaspace.All containers in pod share Linux namespaces like network,UTS(hostname),IPC,PID(not yet) except filesystem.Filesystem is by default binded and isolated by containers but it can be shared between containers via volumes. All containers in the same pod shares the same cgroup limits.

In summary, pods are logical hosts and behave much like physical hosts or VMs in the non-container world.Procesess running in the same pod are like processes running on the same physical or virtual machine, except that each process is encapsulated in a container.

create pod

k run nginx-pod --image=nginx --env="ENV=dev" --port=80 --dry-run=client -o yaml -l env=dev

will be created as

apiVersion: v1                  # api version
kind: Pod                       # type of object
metadata:                       # metadata object
  creationTimestamp: null
  labels:
    env: dev
  name: nginx-pod               # name of the pod
spec:                           # spec object
  containers:
  - env:
    - name: ENV
      value: dev
    image: nginx                # used image
    name: nginx-pod             # pod name same as metadata.name unless multicontainer pod
    ports:
    - containerPort: 80         # can be ommited as this is purely informational
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Intra-pod communication

Every pod gets its on IP address which is routable on that network, which means that every pod can communicate with any other pod.

Inter-pod communication

Containers inside the pod are communicate via localhost interface and related port.

Pods Lifecyle

One pod gets scheduled to one node.You define it in a manifest file ( e.g. yaml ).Then you throw that manifest at the apiserver and it gets scheduled to a node.Once it’s a scheduled to a node, it goes into the pending state while the node downloads images and fires up the containers.And this is important, it stays in this pending state until all containers are up and ready.Once that’s done, it goes into the running state.Then once it’s done and dusted with everything it was created to do,, it gets shut down and the state changes to succeeded.If it can’t start, for whatever reason, it can remain in the pending state, or maybe eventually go to the failed state, which hopefully won’t happen too much.There will be no case where some pods were deployed and some not.

Pod phases

Pod phases
Pending - The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network.
Running - The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting.
Succeeded - All containers in the Pod have terminated in success, and will not be restarted.
Failed - All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.
Unknown - For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running.
Container stats:
Waiting - If a container is not in either the Running or Terminated state, it is Waiting. A container in the Waiting state is still running the operations it requires in order to complete start up: for example, pulling the container image from a container image registry, or applying Secret data. When you use kubectl to query a Pod with a container that is Waiting, you also see a Reason field to summarize why the container is in that state
Running - The Running status indicates that a container is executing without issues. If there was a postStart hook configured, it has already executed and finished. When you use kubectl to query a Pod with a container that is Running, you also see information about when the container entered the Running state.
Terminated - A container in the Terminated state began execution and then either ran to completion or failed for some reason. When you use kubectl to query a Pod with a container that is Terminated, you see a reason, an exit code, and the start and finish time for that container’s period of execution.If a container has a preStop hook configured, this hook runs before the container enters the Terminated state
Pod Conditions:
PodScheduled - the Pod has been scheduled to a node.
ContainersReady - all containers in the Pod are ready.
Initialized - all init containers have completed successfully.
Ready - the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.

Infrastructure container

When a pod is running you can observe an additional container as part of it,with command PAUSE.This pause container is the container that holds all the containers of a pod together.The pause container is an infrastructure container whose sole purpose is to hold all these namespaces.All other user-defined containers of the pod then use the namespaces of the pod infrastrucuture container.Actual application containers may die and get restarted.When such a container starts up again,it needs to become part of the same Linux namespaces as before.The infrastructure container makes this possible since its lifecycle is tied to that of the pod - the container runs from the time the pod is scheduled until the pod is deleted.If the infrastructure pod is killed in the meantime,the Kubelet recreates it and all the pod’s containers.

Createing pod

kubectl run busybox sleep "1000s" --image=busybox --restart=Never

Multicontainer pod

A pod with more than one container is a multi-container pod.In a multi-container Pod,the containers share resources such as network and storage.They can interact with one another,working together to provide functionality.

Keep containers in separate Pods unless they need to share resources.

Static pod

A Pod that is managed directly by the kubelet on a node,not by the K8s API server.They can run if there is no K8s API server present. Kubelet automatically creates Pods(only pods nothing else) from YAML manifest files located in the manifest path on the node(/etc/kubernetes/manifests/).

Kubelet will create a mirror Pod for each static Pod.Mirror Pods allow you to see the status of the static Pod via the K8s API, but you cannot change or manage them via the API.

Cross-Container interaction

Containers sharing the same Pod can interact with one anohter using shared resources.

Network:
Containers share the same networking namespace and can communicate with one another on any port,even if the port ins not exposed to the cluster.
Storage:
Containers can use volumes to share data in a Pod.
Example Use Case
You have an application that is hard-coded to write log output to a file on disk.You add a secondary container to the Pod (somethimes called a sidecar) that reads the log file from a shared volume and prints it to the console so the log output will appear in the container log. Sidecar usecases can be where main container is web server and additional containers (sidecar) are used to periodicaly download some files,log rotation,data processing,…

sidecar example

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: sidecar-pod
  name: sidecar-pod
spec:
  containers:
  - command:
    - sh
    - -c
    - while true;do echo logs data > /output/output.log;sleep 5;done
    image: busybox
    name: sidecar-pod
    resources: {}
    volumeMounts:
    - name: sharedvol
      mountPath: /output
  - name: sidecar
    image: busybox
    command:
    - sh
    - -c
    - tail -f /input/output.log
    volumeMounts:
    - name: sharedvol
      mountPath: /input
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: sharedvol
    emptyDir: {}
status: {}

Init containers

Init containers are containers that run once during the startup process of a pod.A pod can have any number of init containers,and they will each run once (in order) to completion.You can used init containers to perform a variety of startup tasks.They can contain and use software and setup scripts that are not needed by you main containers.They are often useful in keeping your main containers lighter and more secure by offloading startup tasks to a separate container.

Use cases:

  • Cause a pod to wait for another K8s resource to be created before finishing startup
  • Perform sensitive startup steps securely outside of app containers
  • Populate data into a shared volume at startup
  • Communicate with another service at startup

an example

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: init-pod
  name: init-pod
spec:
  containers:
  - image: nginx
    name: init-pod
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  initContainers:
  - name: delay
    image: busybox
    command:
    - sh
    - -c
    - sleep 30
  - name: delay2
    image: busybox
    command:
    - sh
    - -c
    - echo "this is 2nd init container"
status: {}

Port forwarding

When you want to talk to a specific pod without goint through a service you can use port-forward.It can be used for pods,deploymetns,services.In background it is using socat utility.

k port-forward pod/deployment/service --address=localhost LOCAL_PORT:REMOTE_PORT

Scheduling

Scheduling ins just the process of assinging pods to Kubernentes nodes to that kubelets can run them.So whenever we go to create a pod, something has to determine which node to run that pod on,and that is scheduling process. Scheduler - control plane component that handles scheduling.

Scheduling process The Kuberneets scheduler selects a suitable Node for each Pod.It takes into account:

  • Resource requests vs available node resources
  • Various configuration that affect scheduling using node labels
nodeSelector

You can configure a nodeSelector for your Pods to limit which Node(s) the Pod can scheduled on. Node selectors use node labels to filter suitable nodes.

adding label to the node

k label nodes <NODE_NAME> KEY=VALUE

removing label from node

k label nodes <NODE_NAME> KEY-

list all pods with env label

k label pods -l env

list all pods without env label

k label pods -l '!env'
or something like:
k label pod -l 'creation_method!=manual' to select pods with the creation_method label with any value other than manual
k label pod -l 'env in (prod,dev)' to select pods with env label set to prod or dev
k label pod -l 'env notin (prod,devel)' to select pods with the env label set to any value other than prod or devel
k get pod -l 'env=debug,creation_method=manual' to select all pods with both labels

define nodeSelector for pod

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  name: busybox
spec:
  containers:
  - image: nginx
    name: nodeselector-pod
    resources: {}
  nodeSelector:
    KEY: VALUE
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

You can bypass scheduling and assing a Pod to a specific Node by name using nodeName.

apiVersion: v1
kind: Pod
metadata:
  name: nginx
spec:
  containers:
  - name: nginx
    image: nginx
  nodeName: kube-01   # assign pod to `kube-01` node

Overriding the command and arguments in Kubernetes

In Kubernetes,when specifying a container,you can choose to override both ENTRYPOINT(the executable that’s executed inside the container) and CMD(the arguments passed to the executable).To do that,you set the properties command (for ENTRYPOINT) and args (for CMD) in the container specification.

an example

kind: Pod
spec:
  containers:
  - image: some/image
    command: ["/bin/command"]        # docker ENTRYPOINT definition
    args: ["arg1","arg2","arg3"]     # docker CMD definition

The command and args fields can’t be updated after the pod is created

Setting environment variables for a container

The list of environment variables can’t be updated after the pod is created

an example

spec:
  containers:
  - name: some/name
    image: some/image
    env:
    - name: some_name
      value: "some_value"

Having values effectively hardcoded in the pod definition means you need to have separate pod definitions for your production and you development pods.To reuse the same pod definition in multiple environments,it makes sense to decouple the configuration from the pod descriptor.In other words,you should use ConfigMap.

Restart policies

Applies to all containers inside pod.K8s can automatically restart containers when they fail.Restart policies allow you to customize this behavior by defining when you want to a pod’s containers to be automatically resatrted.There are three possible values for a pod’s restart policy in k8s:

Restart policies:
Always - default (not applicable for job).Use this policy for applications that should always be running.
OnFailure - will restart containers only if the container process exits with an error code or the container is determined to be unhealthy by a liveness probe.If pod phase status is Completed(Succeded) pod will not be restarted.Use this policy for applications that need to run successfully and then stop.
Never - will cause the pod’s containers to never be restarted,even if the container exits or a liveness probe fails.Use this for applications that should run once and never be automatically restarted.

Security

pod.spec.containers.imagePullPolicy - Always is recommended in order to not store image on the nodes pod.spec.ImagePullSecrets - secret for login to private registry

“ImagePullSecrets” can be attached to the SA’s namespace as well.

Security Context

Defining security contexts allows you to lock down your containers, so that only certain processes can do certain things. This ensures the stability of your containers and allows you to give control or take it away

pod.spec.containers.securtyContext - various options as [‘runAsUser’, ‘runAsNonRoot’, ‘privileged’, ‘add:SYS_TIME’, ‘fsGroup’, ‘readOnlyRootFilesystem’]

  • runAsUser - specify the user as UID which will be used in container
  • runAsNonRoot - the container will only be allowed to run as a non-root user
  • privileged - to get full access to the node’s kernel
  • capabilities - better control what will be allowed or denied in the container.Defining capabilities is better way then giving full privileges with privileged:true
  • readOnlyRootFilesystem - preventing writing to root filesystem,writing to mounted volumes is allowed

Several of these options can also be set at the pod level (through the pod.spec.securityContext property).They serve as a default for all the pod’s containers but can be overridden at the container level.The pod-level security context also allows you to set additional properties.

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: alpine
  name: alpine
spec:
  containers:
  - args:
    - /bin/sleep
    - "9999"
    image: alpine
    name: main
    resources: {}
    securityContext:
      readOnlyRootFilesystem: true
      runAsUser: 405                   # run as guest
      runAsNonRoot: true               # container will be run as nonRoot user or not runned at all if USER is not set in Dockerfile
      privileged: true                 # allow access to node's kernel
      capabilities:
        add:                           # capabilities can be added or dropped 
        - SYS_TIME                     # allow container to change node's date and time.Linux capabilities are usually prefixed with `CAP_` but in pod.spec you must leave out the prefix.
        drop:
        - CHOWN                        # deny this container to change file ownership
  dnsPolicy: ClusterFirst
  restartPolicy: Never
status: {}

fsGroup and supplemetalGroups

When you use runAsUser property and have multiple containers in the pod where you wanna share volumes between containers you must use fsGroup and suppelementalGroups properties in pod.spec.securityContext property.In such setup,volumes will be owned by fsGroup id and files under that volume will be owned by runAsUser ID and fsGroup ID.Files which are created by runAsUser in other locations (not in volume).There its owner and group will be set to runAsUser and 0.

spec:
  securityContext:
    fsGroup: 555                                 # the fsGroup and supplementalGroups are defined in the security context at pod level
    supplementalGroups: [666,777]
  containers:
  - command: ["/bin/sleep", "99999"]
    image: alpine
    name: first
    resources: {}
    securityContext:
      runAsUser: 1111                            # the first container runs as used ID 1111
    volumeMounts:
    - name: shared-volume                        # both containers use the same volume
      readOnly: false
      mountPath: /volume
  - name: second
    image: alpine
    command: ["/bin/sleep", "99999"]
    securityContext:
      runAsUser: 2222                            # the second container runs as used ID 2222
    volumeMounts:
    - name: shared-volume
      readOnly: false
      mountPath: /volume
  dnsPolicy: ClusterFirst
  restartPolicy: Never
  volumes:
  - name: shared-volume
    emptyDir: {}

HostNetwork

Certain pods (usually system pods) need to operate in the host’s default namespaces,allowing them to see and manipulate node-level resources and devices.For example, a pod may need to use the node’s network adapters insteda of its own virtual network adapters.This can be achieved by setting the hostNetwork property in the pod spec to true.When the Kubernetes Control Plane components are deployed as pods(such as when you deploy your cluster with kubeadm), you will find that those pods use the hostNetwork option,effectivaly making them behave as if they weren’t running inside a pod.

hostPort,hostPID and hostIPC

A related feature allows pods to bind to a port in the node’s default namespace, but still hae their own network namespace.This is done by using the hostPort property in one of the container’s ports defined in the spec.containers.ports field.

spec:
  containers:
  - image: edesibe/kubia
    name: kubia
    hostPID: true           # you want the pod to use host's PID namaspace
    hostIPC: true           # you also want the pod to use the host's IPC namespace
    ports:
    - containerPort: 8080   # the container can be reached on port 8080 of the pod's IP
      hostPort: 9000        # it can also be reached on port 9000 of the node it's deployed on

Summary

  1. Pods are smallest unit of scheduling in k8s
  2. They have one or more containers inside
  3. They are scheduled on nodes ( minions )
  4. They are defined in manifest files

list all pods

kubectl get pods -o <output> --sort-by <JSONPATH> --selector <selector> --field-slector=metadata.name=podname

where

  • -o: set output format
  • --sort-by: sort output using a JSONpath expression
  • --selector: filter results by label
  • --show-labels=true: list all labels
  • --field-selector: filter based on one or more resource fields

get all events for pod/busybox

k get events --field-selector=involvedObject.kind=Pod,involvedObject.name=busybox --sort-by .metadata.creationTimestamp

describe pod status

kubectl describe pods ${pod_name}

check pod phase for pod

k get pod -o jsonpath='{range .items[]}{@.metadata.name}{" "}{@.status.phase}{"\n"}{end}'
k get pod <POD_NAME> -o jsonpath='podName:{@.metadata.name} podPhase:{@.status.phase} containerStatus:{@.status.containerStatuses[].state..reason}{"\n"}'

create new resource

k create -f <file>

create if not exists or update resource

k apply -f <file>

When you use k apply command (not the case for k create/run)kubernetes will add annotations with the content of last applied configuration stored as resource.metadata.annotations.kubectl.kubernetes.io/last-applied-configuration.This field is used for comparison with next versions of the object.

run command inside container

k exec -it <pod name> -c <container name> -- <command>

delete pod

kubectl delete pod ${pod_name}

delete all pods in current namespace

k delete pod --all

delete all resources in current namespace (pod,deployment,svc)

k delete all --all

Communication

Communication between pods on same nodes is done via socket.Communication between pods on different nodes is done via cni. CNI is responsible for:

  • allowing communication between pods
  • IP management
  • encapsulating packets
  • mapping in userspace
Troubleshooting
: nsenter -t PID -n ip a' - execut ip a` under network namespace
`lsns -l -p PID - list all namespaces for PID
nsenter -t PID -a - enter to pods namespace
/proc/{PID}/root - root partition of the pod
/prod/{PID}/mountinfo - what is mounted on the pod
ctr - client for containerd
nerdctl - Docker-compatible CLI for containerd
crictl - client for CRI

Manifests

Here is an example of manifest for pod as pod.yml

# APIVersion defines the versioned schema of this representation of an object.
apiVersion: v1
# Kind is a string value representing the REST resource this object represents ( Pod,Service,ReplicationController,Namespace,Node,... )
kind: Pod
# Standard object's metadata
metadata:
  name: busybox-sleep
  labels:
    zone: prod
    version: v1
# Specification of the desired behavior of the pod.
spec:
  containers:
  - name: busybox
    image: busybox
    args:
    - sleep
    - "1000000"

SERVICE

An abstract way to expose an application running on a set of Pods as a network service. No need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes gives pods their own IP addresses and a single DNS name for a set of pods, and can load-balance across them.The set of Pods targeted by a Service is usually determined by a selector( label ). Behind the scene routing logic is accomplished via kube-proxy and configuring iptables rules.Kube-proxy is reading spec from service manifest and then create iptables rules to satisfy the desired state (i.e. service spec).Most of the time iptables statitic module with random mode is used.LoadBalancing with iptables. When a service is created in the API server,the virtual IP address is assigned to it immerdiately.Soon afterward, the API server notifies all kube-proxy agents running on the worker nodes that a new Service has been created.Than,each kube-proxy makes that service addresable on the node it’s running on.It does this by setting up a few iptables rules,which make sure each packet destined for the service IP/port pair is intercepted and its destination address modified,so the packet is redirected to one of the pods backing the service.

Service IP range is defined in kube-api-server --service-cluster-ip-range ipNet (Default: 10.0.0./24).

Don’t forget that in each container,Kubernetes automatically expose environment variables for each service in the same namespace.These environment variables are auto-injected configuration.

Headless

Headless service is using ClusterIP=None type but it doesn’t allocate cluster ip from cluster ip range.When one queries its FQDN it will respond with all ips of the pods which are behind it.When you perform a DNS lookup for a service,the DNS server retunrs a single IP - the service’s cluster IP.For Headless service it will reply with the pod IPs instead of the single service IP.In addition, no iptables rules are added for headless svc

check iptables rules for headless svc

sudo iptables -S -t nat | grep <headless-svc-name>

When you wanna use the DNS lookup mechanism to find unready pods as well you need to add svc.spec.publishNotReadyAddress=True to the headless service manifest.By default only ready pods will be get resolved.

DNS of the service is in format <servicename>.<namespace>.svc.cluster-domain.example while pod can be reached via <podIPwithdashinsteaddot>.<namespace>.pod.cluster-domain.example.The default cluster domain is cluster.local.A Service’s fully qualified domain name can be used to reach the service from within any Namespaces in the cluster.However,Pods within the same Namespace can also simply use the service name instead FQDN.Pods from different namespaces need to specify <servicename>.<namespace> if they wanna reach to service which is in different namespace. The reason for such behavior is that K8s is adding following records by defaul to each pod.

bash-5.1# cat /etc/resolv.conf
search <namespace>.svc.cluster.local svc.cluster.local cluster.local

which can resolve all services in same namespaces but for different namespaces you need to add subdomain as well.

example of service.yaml

apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    app: kubia-nodeport
  name: kubia-nodeport
spec:
  sessionAffinity: ClientIP        # if you wanna all requests made by a certain client to be redirected to the same pod every time(default is set to NONE)
  ports:
  - name: http                     # when creating a service with multiple ports, you must be specify a name for each port
    nodePort: 30080                # port on node ( which holds pods ).Used for NodePort and LoadBalancer services
    port: 80                       # port where service is listening
    protocol: TCP
    targetPort: 8080               # port on the pod/container.it must match port on the pod.it is port used for traffic from service to the target
  selector:
    app: kubia                     # selector for the pod.all pods with app=nginx will be part of this service
  type: NodePort
status:
  loadBalancer: {}
cloud_user@k8s-control:~$ k get svc kubia-nodeport
NAME             TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubia-nodeport   NodePort   10.104.167.121   <none>        80:30080/TCP   7m42s

So kubia-nodeport service is accesible via:

  • 10.104.167.121:80
  • <ANY_NODE_IP>:30080

Label selector is also used to map pods to a service.Same method is used for RC,RS,Deployment and DS.

Named ports

You can give a name to each pod’s port and refer to it by name in the service spec.

pod manifest

apiVersion: v1
kind: Pod
spec:
  containers:
  - name: kubia
    ports:
    - name: http
      containerPort: 8080   # Container's port 8080 is called http
    - name: https
      containerProt: 8443   # Port 8443 is called https

svc manifest

apiVersion: v1
kind: Service
spec:
  ports:
  - name: http
    port: 80                # Port 80 is mapped to the container's port called http
    targetPort: http
  - name: https
    port: 443               # Port 443 is mapped to the container's port,whose name is https
    targetPort: https

When a pod is started,Kubernetes initializes a set of environment variables pointing to each service that exists at that moment.If you create the service before creating the client pods,processes in those pors can get the IP address and port of the service by inspecting their environment variables.Thus each service will have env variables populated inside pod’s container as <SVC_NAME>_HOST and <SVC_NAME>_SERVICE_PORT for ip and port used by service.

Endpoints

Entpoints are the backend entities to which Services route traffic.For a Service that routes traffic to multiple Pods,each Pod will have an endpoint associated with the Service. Pods are included as endpoints of a service if their labels match the service’s pod selector.An Endpoints resource is a list of IP addresses and ports exposing a service.When a client connects to a service,the service proxy selects one of those IP and port pairs and redirects the incomint connection to the server listening at that location.

Services without selectors

If you create a service without a pod selector,Kubernetes won’t even create the Endpoints resource(after all, without selector,it can’t know which pods to include in the service).It’s up to you to create the Endpoints resource to specify the list of endpoints for the service.

Services most commonly abstract access to Kubernetes Pods, but they can also abstract other kinds of backends. For example:

  • You want to have an external database cluster in production, but in your test environment you use your own databases.
  • You want to point your Service to a Service in a different Namespace or on another cluster.
  • You are migrating a workload to Kubernetes. Whilst evaluating the approach, you run only a proportion of your backends in Kubernetes.

In any of these scenarios you can define a Service without a Pod selector. For example:

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  ports:
    - protocol: TCP
      port: 80
      targetPort: 9376

Because this Service has no selector, the corresponding Endpoint object is not created automatically. You can manually map the Service to the network address and port where it’s running, by adding an Endpoint object manually:

apiVersion: v1
kind: Endpoints
metadata:
  name: my-service
subsets:
  - addresses:
      - ip: 192.0.2.42
    ports:
      - port: 9376

Publishing Services (ServiceTypes)

For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that’s outside of your cluster. Kubernetes ServiceTypes allow you to specify what kind of Service you want. The default is ClusterIP.

Type values and their behaviors are:

Values:
ClusterIP - Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the default ServiceType.Behind the scene iptables rules are used with statistic mode random on each Node which has pods
NodePort - Exposes the Service on each Node’s IP at a static port (the NodePort). A ClusterIP Service, to which the NodePort Service routes, is automatically created. You’ll be able to contact the NodePort Service, from outside the cluster, by requesting <NodeIP>:<NodePort>.Behind the scene iptables rules will be the same as for CluesterIP but it will also have additional rules opened for defined port.Random range for NodePort is 30000-32767.
LoadBalancer - This is a NodePort service with an additional infrastructure-provided load balancer.Exposes the Service externally using a cloud provider’s load balancer. NodePort and ClusterIP Services, to which the external load balancer routes, are automatically created.On AWS,LB is redirecting traffic to NodePort service under k8s via nodePort defined port.
ExternalName - Maps the Service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up.Cluster IP is not assigned as this object is implemented solely at the DNS level,a simple CNAME DNS record is created for the service.
apiVersion: v1
kind: Service
metadata:
  name: kubia
spec:
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: kubia  
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    app: kubia-nodeport
  name: kubia-nodeport
spec:
  ports:
  - name: 80-8080
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: kubia
  type: NodePort
status:
  loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    app: kubia-loadbalancer
  name: kubia-loadbalancer
spec:
  ports:
  - name: kubia-http
    port: 80
    protocol: TCP
    targetPort: 8080
  selector:
    app: kubia
  type: LoadBalancer
status:
  loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    app: kubia-external
  name: kubia-external
spec:
  externalName: kubia.example.com
  type: ExternalName
status:
  loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
  name: kubia-headless
spec:
  clusterIP: None
  ports:
  - port: 80
    targetPort: 8080
  selector:
    app: kubia

You can also use Ingress to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple services under the same IP address.It operates at the HTTP level(network layer 7) and can thus offer more features than layer 4 services can.

Each service inside the cluster can be found via FQDN <servicename>.<namespace>.svc.cluster.local.

When an external client connects to a service through the node port(this also includes cases when it goes through the load balancer first), the randomly chosen pod may or may not be running on the same node that received the connection.An additional network hoop is required to reach the pod, but this may not always be desirable.You can prevent this additional hop by configuring the service to redirect external traffic only to pods running on the node that received the connection via svc.spec.externalTrafficPolicy: Local.

INGRESS

An Ingress is an Kuberentes object that manages external access ot Services in the cluster.An Ingress is capable of providing more functionality that na simple NodePort Service,such as SSL termination,advanced load balancing,or name-based virtual hosting,cookie-based session affinity…

Each LoadBalancer service requires its own load balancer with its own public IP address,whereas an Ingress only requires one,even when providing access to dozens of services.When a client sends an HTTP request to the Ingress, the host and path in the request determine which service the request is forwarded to.

Ingress Controllers

Ingress objects actually do nothing by themselves.In order for Ingresses to do anything,you must install one or more Ingress controllers.Unlike other types of controllers which run as part of the kube-controller-manager binary, Ingress controllers are not started automatically with a cluster.Ingresses define a set of a routing rules.A routing rule’s properties determine to which requests it applies.Each rule has a set of paths,each with a backend.Requests matching a path will be routed to its associated backend. An Ingress controller runs a reverse proxy server(like nginx) and keeps it configured according to the Ingress,Service,and Endpoints resources defined in the cluster.The controller thus needs to observe those resources (again,through the watch mechanism) and change the proxy server’s config every time one of them changes.Although the Ingress reosurce’s definition points to a Service,Ingress controllers forward traffic to the service’s pod directly instead of going through the service IP.

Kubernets nginx ingress controller on AWS

Here, ingress controller is defined as service(ingress-nginx-controller) and deployment(ingress-nginx-controller) which is using nginx as proxy.When a user create Ingress resource it will populate deployment pods via populating related nginx.conf.Service ingress-nginx-controller will be created as LoadBalancer service inside k8s and via CLB on AWS.This will be the main entry for the external requests to the AWS and then K8s cluster.Every external request will target this CLB and then via LoadBalancer service it will enter in k8s cluster.Later, nginx pods will route traffic to related services based on Ingress rules.

Ingress as example

❯ k get ing,svc,pod -o wide
NAME                              CLASS   HOSTS                    ADDRESS                                                                  PORTS   AGE
ingress.networking.k8s.io/kubia   nginx   kubia.dectech-labs.com   ab01eafaaf542430fb1db2cdfd7b2439-426635961.us-west-1.elb.amazonaws.com   80      130m

NAME                 TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE   SELECTOR
service/kubernetes   ClusterIP   100.64.0.1       <none>        443/TCP   21h   <none>
service/kubia        ClusterIP   100.64.236.166   <none>        80/TCP    21h   app=kubia

NAME                 READY   STATUS    RESTARTS        AGE   IP                NODE                  NOMINATED NODE   READINESS GATES
pod/kubia-rc-cfj7t   1/1     Running   1 (3h32m ago)   21h   100.110.199.143   i-09fd9fa86a2e8b006   <none>           <none>
pod/kubia-rc-krjzt   1/1     Running   1 (3h32m ago)   21h   100.126.23.208    i-07a449d2951b3988b   <none>           <none>
❯ k -n ingress-nginx get svc,pod -o wide
NAME                                         TYPE           CLUSTER-IP       EXTERNAL-IP                                                              PORT(S)                      AGE   SELECTOR
service/ingress-nginx-controller             LoadBalancer   100.67.113.134   ab01eafaaf542430fb1db2cdfd7b2439-426635961.us-west-1.elb.amazonaws.com   80:31463/TCP,443:30329/TCP   21h   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
service/ingress-nginx-controller-admission   ClusterIP      100.71.76.121    <none>                                                                   443/TCP                      21h   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME                                           READY   STATUS      RESTARTS        AGE    IP                NODE                  NOMINATED NODE   READINESS GATES
pod/ingress-nginx-admission-create-nm2fk       0/1     Completed   0               21h    100.126.23.201    i-07a449d2951b3988b   <none>           <none>
pod/ingress-nginx-admission-patch-f7x8g        0/1     Completed   0               21h    100.110.199.137   i-09fd9fa86a2e8b006   <none>           <none>
pod/ingress-nginx-controller-b4fcbcc8f-g5nl5   1/1     Running     0               128m   100.126.23.209    i-07a449d2951b3988b   <none>           <none>
pod/ingress-nginx-controller-b4fcbcc8f-hwkpn   1/1     Running     1 (3h33m ago)   21h    100.110.199.141   i-09fd9fa86a2e8b006   <none>           <none>

Ingress picture

When client send request to kubia.dectech-labs.com it will first perform a DNS check for that FQDN.When DNS respond it will reply with IP of the Ingress controller.The client then send an HTTP request to the Ingress controller and specified kubia.dectech-labs.com in the Host header.From that header, the controller determine which service the client is trying to access,looked up the pod IPs through the Endpoints objects associeated with the service, and forward the client’s request to one of the pods.Thus,the Ingress controller didn’t forward the request to the service.It only used it to select a pod.Most,if not all, controllers work like this.

ingress manifest example

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /      # rewrite incoming url to /
    nginx.ingress.kubernetes.io/ssl-redirect: "false"  # allow access to ingress via http as well
spec:
  ingressClassName: nginx             # defined by igressclass k8s resource type
  rules:
  - http:
      paths:
      - path: /somepath               # one defind path
        pathType: Prefix
        backend:
          service:                    
            name: my-service          # which service will be used and port.cannot used svc from diffrent namespace
            port:
              number: 80
      - path: /somepath2              # some other path
        pathType: Prefix
        backend:
          service:                    
            name: my-service2         # which service will be used and port
            port:
              number: 80

You can define multiple paths for same host or even multiple hosts.

The Ingress rules must reside in the namespace where the app that they configure reside.

multiple hosts in ingress

spec:
  ingressClassName: nginx
  rules:
  - host: kubia.dectech-labs.com
    http:
      paths:
      - backend:
          service:
            name: kubia
            port:
              number: 80
        path: /
        pathType: Prefix
  - host: kubia1.dectech-labs.com
    http:
      paths:
      - backend:
          service:
            name: kubia
            port:
              number: 80
        path: /
        pathType: Prefix

ingress example with a service with a named port

apiVersion: v1
kind: Service
metadata:
  name: my-service
spec:
  selector:
    app: Myapp
  ports:
    - name: web # named port
      protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
spec:
  ingressClassName: nginx
  rules:
  - http:
      paths:
      - path: /somepath
        pathType: Prefix
        backend:
          service:
            name: my-service
            port:
              name: web # specifying port from service as a name instead of a port

create an ingress

k create ingress my-ingress --class=nginx --rule="/somepath*=my-deployment:80" --dry-run=client -o yaml

TLS

When a client opens a TLS connection to an Ingress controller,the controller terminates the TLS connection.The communication between the client and the controller is encrypted,whereas the communcation between the controller and the backend pod isn’t.To enable the controller to do that,you need to attach a certificate and a private key to the Ingress.The two need to be stored in Kubernetes secret, which is then referenced in the Ingress manifest.

Creating a self-signed cert and attaching it to the ingress

# generate tls.key
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
  -keyout kubia.dectech-labs.com.key -out kubia.dectech-labs.com.cert -subj "/CN=kubia.dectech-labs.com"

# add crt and key as secret
k create secret tls kubia.dectech-labs.com-tls --cert=kubia.dectech-labs.com.cert --key=kubia.dectech-labs.com.key

# update ingress manifest to match

spec:
  tls:
  - hosts:
    - kubia.dectech-labs.com               # TLS connections will be accepted for this domain
    secretName: kubia.decthech-labs.com    # private key and cert should be obtained from this secret
  rules:
  - host: kubia.dectech-labs.com           

ingress + tls cert

https://docs.bitnami.com/tutorials/secure-kubernetes-services-with-ingress-tls-letsencrypt/

REPLICATIONCONTROLLER

A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.

If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises multiple pods across multiple nodes. ReplicationController is often abbreviated to “rc” or “rcs” in discussion, and as a shortcut in kubectl commands. A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use case is to run several identical replicas of a replicated service, such as web servers.

Most important parts in rc definition are:

  • rc.spec.selector - k/v which will be used to match pods based on labels.can be ommited if there is labels defined under rc.spec.template.metadata.labels
  • rc.spec.template.metadata.labels - should be same as rs.spec.selector k/v
  • rc.spec.replicas - number of replicas

Pods created by RC aren’t tied to the RC in any way.At any momment, a RC manages pods that match its label selector.By changing a pod’s labels,it can be removed from or added to the scope of a RC.Although a pod isn’t tied to RC,the pod does reference it in the metadata.ownerReferences field, which you can use to easily find which RC a pod belongs to.If you wanna remove pod from RC you need to change or remove its selector label.Adding new labels doesn’t have an effect.Changing a RC’s pod template has not effect on existing pods.In order to propagate RC’s modification you have to replace old pods.

delete rc and leave pod

k delete rc my-rc --cascade=orphan

list all rc

kubectl get rc -o wide

describe rc status

kubectl describe rc ${rc_name}

RC manifest

Here is an example for rc manifest as rc.yml

apiVersion: v1
kind: ReplicationController
metadata:
  name: nginx
spec:
  replicas: 4
  selector:                                # this section is optional,if omitted rc will use rc.spec.template.metadata.labels definition
    app: nginx                             # based on this k/v rc is operating on pods
  template:
    # same definition as with pod.yml
    metadata:
      name: nginx
      labels:
        app: nginx                         # must be the same as rc.spec.selector as it is managed by rc
    spec:
      containers:
      - name: nginx
        image: nginx
        ports:
        - containerPort: 80
RC and RS difference
$ git diff rc.yaml rs.yaml
-apiVersion: v1
-kind: ReplicationController
+apiVersion: apps/v1
+kind: ReplicaSet
 metadata:
-  name: kubia-rc
+  name: kubia-rs
 spec:
   replicas: 3
   selector:
-    app: kubia
+    matchLabels:                  # RS equivalent for RC settings
+      app: kubia
   template:
     metadata:
       name: kubia

RC and RS actually do same thing as ensure desired number of replicas of needed pod.The only difference is that with rc you can set selectors via equality-based requirement selectors as:

selector:
    component: redis

while with rs you can use set-based requirements selector as well:

selector:
  matchLabels:
    component: redis
  matchExpressions:
    - {key: tier, operator: In, values: [cache]}
    - {key: environment, operator: NotIn, values: [dev]}

DAEMONSETS

Automatically runs a copy of a Pod on each node.DeamonSets will run a copy of the Pod on new nodes as they are added to the cluster.A DS deploys pods to all nodes in the cluster,unless you specify that the pods should only run on a subset of all the nodes.This is done by specifying the nodeSelector property in the pod template,which is part of the DS definition (same for RS and RC).A DS will deploy pods on unscheduled nodes as well as the unschedulable attribute is only used by the Scheduler,whereas pods managed by a DS bypass the Scheduler completely.

DaemonSets and Scheduling

DaeemonSets respect normal scheduling rules around node labels,taints, and tolerations.If a pod would not normally be schedulued on a node,a DaemonSet will not create a copy of the Pod on that Node.

an example of daemonset

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: my-daemonset
spec:
  selector:
    matchLabels:
      app: my-daemonset
  template:
    metadata:
      labels:
        app: my-daemonset
    spec:
      containers:
      - name: nginx
        image: nginx

JOBS

This resource type allows you to run a pod whose container isn’t restarted when the process running inside finishes successfully.Once it does,the pod is considered complete.In the event of a node failure,the pods on that node that are managed by a Job will be rescheduled to other nodes the way ReplicaSet pods are.In the event of a failure of the process itself(when the process returns an error exit code), the Job can be configured to either restart the container or not. Jobs are useful for ad hoc tasks,where it’s crucial that the task finishes properly.You could run the task in an unmanaged pod and wait for it to finish,but in the event of a node failing or the pod being evicted from the node while it is performing its task, you’d need to manually recreate it.Doing this manually doesn’t make sense - especially if the job takes hours to complete.

example of a job

apiVersion: batch/v1
kind: Job
metadata:
  name: batch-job
spec:
  activeDeadlineSeconds: 110      # if the pod rund longer then that, the system will try to terminte it and will mark the Job as failed
  backoffLimit: 5                 # how many times a Job can be retried befored it is marked as failed,default 6
  completions: 5                  # if you need a job to run more than once
  parallelism: 2                  # up to 3 pods can run in parallel
  ttlSecondsAfterFinished: 5      # limits the lifetime of a Job that has finished execution (either Complete or Failed). If this field is set, ttlSecondsAfterFinished after the Job finishes, it is eligible to be automatically deleted
  template:
    metadata:
      labels:
        app: batch-job
    spec:
      restartPolicy: OnFailure    # jobs can't use default resart policy which is Always
      containers:
      - name: main
        image: luksa/batch-job

When job’s pod complete its processing it will not be deleted as it will allow you to examine its logs.The pod will be deleted when you deleted it or the Job that created it. Jobs may be configured to create more than one pod instance and run them in parallel or sequentially.This is done by setting the completions and the paralleilsm properties in the Job spec.

CRONJOB

A cron job in Kubernetes is configured by creating a CronJob resource.The schedule for running the job is specified in the cron format.At the configurated time,K8s will create a Job resource according to the Job template configured in the CronJob object.When the Job resource is created,one or more pod replicas will be created and started according to the Job’s pod template.

example of cronjob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: batch-cronjob
spec:
  schedule: "0,15,30,45 * * * *"             # it will run every 15min
  startingDeadlineSeconds: 5                 # not stard too far over the scheduled time
  jobTemplate:                               # the template for the Job resources that will be created by this CronJob
    spec:
      completions: 5
      parallelism: 2
      template:
        metadata:
          labels:
            app: batch-cronjob
        spec:
          restartPolicy: OnFailure
          containers:
          - name: main
            image: luksa/batch-job

Job resouces will be created from the CronJob resource at approximately the scheduled time.The Job then createss the pod.

STORAGE

Storage volumes aren’t top-level resources like pods,but are instead defined as a part of a pod and share the same lifecycle as the pod.This means a volume is created when the pod is started and destroyed when the pod is deleted.Because of this, a volume’s content will persists across container restarts.After a container is restarted,the new container can see all the files that were written to the volume by the previous container.If a pod contains multiple containers,the volume can be used by all of them at once. Volumes are defined in the pod’s manifest - much like containers.A volume is available to all containers in the pod, but it must be mounted in each container that needs to access it.In each container,you can mount the volume in any location of its filesystem.It’s not enough to define a volume in the pod, you need to define a VolumeMount inside the container’s spec also, if you want to container to be able to access it.

The container file system is ephemeral.Files on the container’s file system exist only as long as the container exists.If a container is deleted or re-created in K8s,data stored on the container file system is lost. Volumes allow you to store data outside the container file system while allowing the container to access the data at runtime. Persistent Volumes are a slightly more advanced form of Volume.They allow you to treat storage as an abstract resource and consume it using your Pods.

Volume Types

Both Volumes and Persistend Volumes each have a volume type.The volume type determines how the storage is actually handled. Various volume types support storage methods such as:

  • emptyDir - a simple empty directory usef for storing transient data
  • hostPath - used for mounting directories from the worker node’s filesystem into the pod.Use hostPath volumes only if you need to read or write system files on the node.Never use them to persist data across pods.
  • gitRepo - a volume initialized by checking out the contents of a git repo.It is a basically an emptyDir volume that get s pouplated by cloning a Git repo and checking out a specific revision when the pod is starting up(but before its containers are created)
  • nfs - an NFS share mounted into the pod
  • gcePersistentDisk,awsElasticBlockStore,azureDisk - used for moutning cloud provider-specific storage
  • configMap,secret - special types of volumes used to expose certain kubernetes resources
  • persistantVolumeClaim - a way to use a pre- or dynamically provisioned persistent storage
spec:
  containers:
  - image: luksa/fortune
    name: html-generator
    resources: {}
    volumeMounts:
    - name: html
      mountPath: /var/htdocs
  - image: nginx:alpine
    name: web-server
    volumeMounts:
    - name: html
      mountPath: /usr/share/nginx/html
      readOnly: true
    ports:
    - containerPort: 80
      protocol: TCP
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: html
    emptyDir:                              # by default `emptyDir: {}` the volume will be created on worker node
      medium: Memory                       # craete as tmpfs filesystem(in memory instead of on disk)
spec:
  containers:
  - image: nginx:alpine
    name: gitrepo-volume-pod
    ports:
    - containerPort: 80
    resources: {}
    volumeMounts:
    - name: html
      mountPath: /usr/share/nginx/html
      readOnly: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: html
    gitRepo:
      revision: master                                                    # branch name
      repository: https://github.com/edesibe/kubia-website-example.git    # repo which will be fetched
      directory: .                                                        # it will hold the content of repo
spec:
  containers:
  - image: openweb/git-sync:0.0.1                                   # container which will sync git repo
    name: git-sync
    env:
    - name: GIT_SYNC_REPO
      value: https://github.com/edesibe/kubia-website-example.git
    - name: GIT_SYNC_DEST
      value: /tmp/git
    - name: GIT_SYNC_BRANCH
      value: master
    - name: GIT_SYNC_REV
      value: FETCH_HEAD
    - name: GIT_SYNC_WAIT
      value: "10"
    volumeMounts:
    - name: shared                                                  # syncing folder is declared as shared volume
      mountPath: /tmp/git
  - image: nginx:alpine
    name: gitrepo-volume-pod
    ports:
    - containerPort: 80
    resources: {}
    volumeMounts:
    - name: shared                                                  # nginx is using shared volume as root 
      mountPath: /usr/share/nginx/html
      readOnly: true
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: shared
    emptyDir: {}
spec:
  containers:
  - image: mongo
    name: mongodb
    volumeMounts:
    - name: mongodb-data                   # name of the volume, same as pod.spec.volumes.name
      mountPath: /data/db                  # the path where mongodb stores its data
    ports:
    - containerPort: 27017
    resources: {}
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: mongodb-data
    awsElasticBlockStore:                  # volume type and related volumeID which was created manually
      fsType: ext4
      volumeID: "vol-03af4682f86e4a74f"

Volumes and volumeMounts

Regular Volumes can be set up relatively easily within a Pod/container specification.You can use volumeMounts to mount the same volume to multiple containers within the same Pod.This is a powerful way to have multiple containers interact withi one another.For example,you could create a secondary sidecar container that process or transforms output from another container.

PV and PVC manual

create a pod with volume

k run volume-pod --image=busybox --overrides='{ "apiVersion": "v1", "spec": { "containers":[{"name":"volume-pod","image":"busybox","command":["sh","-c","sleep 3600"],"volumeMounts":[{"name":"my-volume","mountPath":"/output"}]}],"volumes":[{"name":"my-volume","hostPath":{"path":"/data"}}]}}'

which will represented as

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: volume-pod
  name: volume-pod
spec:
  containers:
  - command:
    - sh
    - -c
    - sleep 3600
    image: busybox
    name: volume-pod
    resources: {}
    volumeMounts:             # in the container spec, these reference the volumes in the Pod spec and provide a mountPath (the location on the file system where the container process will access the volume data)
    - mountPath: /output
      name: my-volume
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:                    # in the pod spec, theses specify the storage volumes available to the pod.they specify the volume type and other metadata
  - hostPath:
      path: /data
    name: my-volume
status: {}

PERSISTENT VOLUME

Ideally, a developer deploying their apps on Kubernetens should never have to know what kind of storage technology is used underneath,the same way they don’t have to know what type of physical servers are being used to run their pods.Infrastrucure-related dealings should be the sole domain of the cluster administrator.When a developer needs a certain amount of persistent storage for their application,they can request it from k8s,the same way they can request CPU,memory,and other resources when creating a pod.

To enable apps to request storage in a k8s cluster wihtout having to deal with infrastructure specifics,two new resources were introduced,PersistentVolume and PersistentVolumeClaims.

A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes.PersistentVolume is non-namespaced.It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system. PV can be created in two ways: static ( manually ) or dynamic ( storage class ).A PersistentVolume uses a set of attributes to describe the underlying storage resources (such as disk or cloud storage location) which will be used to store data.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: redis-pv
spec:
  storageClassName: localdisk            # if you don't specify name k8s will create a storagclass with default attributes(allowVolumeExpansion set to false,etc) --> NOT TRUE
  capacity:
    storage: 1Gi                         # define capacity in M,G ...
  persistentVolumeReclaimPolicy: Retain  # determines how the storage resources can be reused when the PersistantVolume’s associated PersistentVolumeClaims are deleted.This setting can be updated via `k patch pv <pv_name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'`
  accessModes:
    - ReadWriteOnce                      # can be mounted on only one node with RW
    - ReadOnlyMany                       # can be mounted on multiple nodes but as Read only
  hostPath:
    path: "/mnt/data"

accessModes defined on pvc must have at least one of the accessModes from pv

Storage Classes

Storage Class allows K8s administrators to specify the types of storage services they offer on their platform.Instead of creating PV,one can deploy a PV provisioner and define one or more SC objects to let users choose what type of PV they want.The users can refer to the SC in the PVC and the provisioner will take into account when provisioning the persistent storage(PV will be automatically created).In addition, for cloud based clusters cloud volumes will be created.Default storage class is what’s used to dynamically provision a PV if the PVC doesn’t explicitly say which storage class to use.

StorageClass describes the parameters for a class of storage for which PersistentVolumes can be dynamically provisioned. StorageClasses are non-namespaced; the name of the storage class according to etcd is in ObjectMeta.Name.They are declarative while PV is imperative. When someone creates SC he can set claim policies as:

Claim policies:
retain - Keeps all data.This requires an administrator to manually clean up the data and prepare the storage resource for reuse.In other words recreate PV(delete and create it)
recycle - Obsolete.Automatically deletes all data in the underlying storage resource,allowing the PersistentVolume to be reused.
delete - (default)Deletes the underlying storage resource automatically (only works for cloud storage resources)

yaml for creating localdisk storageclass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: localdisk
provisioner: kubernetes.io/no-provisioner         # the volume plugin to use for provisioning the PV
allowVolumeExpansion: true		          # Set to false by default.The allowVolumeExpansion property of a StorageClass determines whether or not the StorageClass supports the ability to resize volumes after they are created.
volumeBindingMode: WaitForFirstConsumer           # Can be `WaitForFirstConsumer` which means wait for pod which will use it,or `Immediate` which means bind PVC and PV at no time

deleting pv

kubectl delete pv <pv_name> --grace-period=0 --force

delete pod without waiting for the Kubelet to confirm that the pod is no longer running

k delete pod kubia-0 --force --grace-period 0

And then deleting the finalizer using:

kubectl patch pv <pv_name> -p '{"metadata": {"finalizers": null}}'

Local volumes(provisioner: kubernetes.io/no-provisioner) don’t support dynamic provisioning.One can use https://github.com/rancher/local-path-provisioner which supports dynamic provisioning for local volumes.

PERSISTENT VOLUME CLAIM

A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted once read/write or many times read-only).When a PersistentVolumeClaim is created, it will look for a PersistenVolume that is able to meet the requested criteria.If it finds one,it will automatically be bound to the PersistentVolume.If it didn’t find related PersistentVolume, PVC state will be Pending as it cannot bound to the any PersistentVolume.A PVC to PV binding is one-to-one mapping.

AccessModes options:
RWO - one node can mount the volume for write
RWX - multiple nodes can mount a volume for read and write
ROX - multiple nodes can mount a volume for read

WARNING: These modes applies for NODE not POD.

pvc example

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-pvc
  namespace: dev                     # if you are using pod in namespace you need to use namespace in pvc
spec:
  storageClassName: localdisk        # use storageClassName: "" if you wanna bind PVC to specific PVC
  volumeName: my-pv                  # if you wanna explicitly bind PVC to PV.This option is optional in which case pvc will try to bind with pv if matching is possible
  accessModes:
    - ReadWriteOnce
  resources:
    requests:                        # It seems that only requests are evaluated for the matching criteria for bounding to the PV
      storage: 100Mi                 # You can expand PersistentVolumeClaims without interrupting applications that are using them.Simply edit the spec.resources.requests.storage attribute of an existing PersistantVolumeClaim,increasing the value.
                                     # However, the StorageClass must support resizing volumes and must have allowVolumeExpansion set to true.

pod usage of pv and pvc

apiVersion: v1
kind: Pod
metadata:
  name: redispod
spec:
  volumes:
    - name: redis-data
      persistentVolumeClaim:
        claimName: my-pvc            # PersistentVolumeClaims can be mounted to a Pod's containers just like any other volume
  containers:
    - name: redisdb
      image: redis
      ports:
        - containerPort: 6379
          name: "redis"
          protocol: TCP
      volumeMounts:
        - mountPath: /data
          name: redis-data           # If the PersistentVolumeClaim is bount to a PersistenVolume,the containers will use the underlying PersistentVolume storage

extending pvc

k patch pvc my-pvc --patch '{"spec":{"resources":{"requests": {"storage": "200Mi"}}}}'

PV and PVC dynamic

To summarize,the best way to attach persistent storage to a pod is to only create the PVC( with an explicitly specified storageClassName if necessary) and the pod (which refers to the PVC by name).Everything else is taken care of by the dynamic PersistentVolume provisioner.

NETOWRKPOLICIES

Network policies allow you to specify which pods can talk to other pods. This helps when securing communication between pods, allowing you to identify ingress and egress rules.A NetworkPolicy applies to pods that match its label selector and specifies either which sources can access the matched pods or which destinations can be accessed from the matched pods.You can even choose a CIDR block range to apply the network policy.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: test-network-policy
  namespace: default
spec:
  podSelector:                        # empty podSelector (`podSelector: {}`) matches all pods in the same namespace
    matchLabels:
      role: db                        # applies to all pods with label role=db
  policyTypes:
  - Ingress                           # applies to ingress and egress.If no policyTypes are specified on a NetworkPolicy then by default Ingress will always be set and Egress will be set if the NetworkPolicy has any egress rules.
  - Egress
  ingress:                            # Ingress must be defined in netpol.spec.policyTypes
  - from:
    - ipBlock:                        # this ingress rule only allows traffic from clients in the 172.17.0.0/16 IP block except 172.17.1.0/24
        cidr: 172.17.0.0/16
        except:
        - 172.17.1.0/24
    - namespaceSelector:              # allow namespaces which have lables project=myproject
        matchLabels:
          project: myproject
    - podSelector:
        matchLabels:
          role: frontend              # it allows incoming connectsion only from pods with the role=frontend label
    ports:                            # connections to port 6379 is only allowed
    - protocol: TCP
      port: 6379
  egress:                             # it limits the pod's outbouding traffic
  - to:
    - ipBlock:
        cidr: 10.0.0.0/24
    ports:
    - protocol: TCP
      port: 5978

Solutions that support Network Policies: Kube-router,calico,romana,weave-net.Solution that DON’T support Network Policies: flannel.

Client pods usally connect to server pods through a Service instead of directly to the pod,but that doesn’t change anything.The NetworkPolicy is enforced when connecting through a Service,as well.

In a multi-tenant Kubernetes cluster,tenants usually can’t add labels (or annotations) to their namespaces themselves.If they could,they’d be able to circumvent the namespaceSelector-based ingress rules.

  • An empty selector will match everything. For example spec.podSelector: {} will apply the policy to all pods in the current namespace.
  • Selectors can only select Pods that are in the same namespace as the NetworkPolicies. Eg. spec.podSelector of an ingress rule can only select pods in the same namespace the NetworkPolicy is deployed to.
  • If no NetworkPolicies targets a pod, all traffic to and from the pod is allowed. In other words all traffic are allowed until a policy is applied.
  • There are no deny rules in NetworkPolicies. NetworkPolicies are deny by default, allow explicitly. It’s the same as saying “If you’re not on the list you can’t get in.”
  • If a NetworkPolicies matches a pod but has a null rule, all traffic is blocked. Example of this is a “Deny all traffic policy”.
spec:
  podSelector:
  ingress: []              # empty rules means block all traffic

or

spec:
  podSelector: {}
  policyTypes:             # no ingress rules means block all traffic
  - Ingress
  • Rules are chained together. NetworkPolicy are additive. If multiple NetworkPolicies are selecting a pod, their union is evaluated and applied to that pod.If there is at least one NetworkPolicy with a rule allowing the traffic, it means the traffic will be routed to the pod regardless of the policies blocking the traffic.
  • ALWAYS check k describe netpol NETWORK_POLICY_NAME for outcome

Using Networking Policies

A K8s NetworkPolicy is an object that allows you to control the flow of network communcation to and from Pods.This allows you to build a more secure cluster network by keeping Pods isolated from traffic they do not need.Network policies are implemented by the network plugin(via iptables for calico).To use network policies,you must be using a networking solution which supports NetworkPolicy.

Pod Selector

podSelector - determines to which Pods in the namespaces the NetworkPolicy applies.The podSelector can select Pods using Pod labels.If this field is set to {} it will be applied to all pods in related namespace

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: my-network-policy
spec:
  podSelector:
    matchLabels:
      role: db

By default,Pods are considered non-isolated and completely open to all communication.If any NetworkPolicy selects a Pod,the Pod is considered isolated and will only be open to traffic allowed by NetworkPolicies.

PolicyTypes

A NetworkPolicy can apply to Ingress(incoming network traffic comming into the Pod),Egress(outgoing network traffic leaving the Pod) or both.

From and to Selectors
Options:
from selector - selects ingress(incoming) traffic that will be allowed
to selector - selects egress(outgoing) traffic that will be allowed
spec:
  ingress:
    - from:
      ...
  egress:
    - to:
      ...
We can use following rules:
podSelector - select Pods to allow traffic from/to
namespaceSelector - selects namespaces based on defined labels to allow traffic from/to
ipBlock - select an IP range to allow traffic from/to

podSelector example

spec:
  ingress:
    - from:
      - podSelector:
          matchLabels:
            app: db

namespaceSelector example

spec:
  ingress:
    - from:
      - namespaceSelector:
          matchLabels:
            app: db

ipBlock example

spec:
  ingress:
    - from:
      - ipBlock:
          cidr: 172.17.0.0/16
Ports

Port - specifies one or more ports that will allow traffic.

port example

spec:
  ingress:
    - from:
      port:
        - protocol: TCP
          port: 80

Traffic is only allowed if it matches both an allowed port and one of the from/to rules.

REPLICASET

ReplicaSet ensures that a specified number of pod replicas are running at any given time.Difference between RS and RC is that RS has more expressive options for label selectors.

rs example

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  name: kubia-rs
spec:
  replicas: 3                     # mandatory,set number of replications
  selector:                       # mandatory field,it is used to filter which pods will be managed by this rc
    matchLabels:
      app: kubia
  template:
    metadata:
      name: kubia
      labels:                     # defining labels which will be used for pods,it should be same as rs.spec.selector field
        app: kubia
    spec:
      containers:
      - name: kubia
        image: edesibe/kubia
        ports:
        - containerPort: 8080

or

...
 selector:
    matchExpressions:             # instead of matchLabels one can defined key,operator,values expression
    - key: app
      operator: In                # cab be In,NotIn,Exists,DoesNotExist
      values:
        - kubia
...

If you specify multiple expressions,all those expressions must evaluate to true for the selector to match a pod.If you specify both matchLabels and matchExpressions, all the labels must match and all the expressions must evaluate to true for the pod to match the selector.

STATEFULSET

Pod replicas managed by a ReplicaSet or ReplicationController are much like cattle.Because they’re mostly stateless,they can be replaced with a completely new pod replica at any time.Stateful pods require a different approach.When a stateful pod instance dies (or the nod it’s running on fails),the pod instance needs to be resurected on another node,but the new instance needs to get the same name,network identity,and state as the one it’s replacing.This is what happens when the pods are managed through a StatefulSet.It also allows you to easily scale the number of pets up and down.A StatefulSet has a desired replica count field that determines how many pets you want running at that time.Pods are created from a pod template specified as part of the StatefulSet but they aren’t exact replicas of each other.Each can have its own set of volumes (storage), which differentiates it from its peers.Pet pods also have a predicatable(and stable) identity instead of each new pod instance getting a completely random one.Each pod created by a StatefulSet is assigned an ordinal index(zero-based),which is then used to derive the pod’s name and hostname,and to attach stable storage to the pod.

A StatefulSet requires you to create a governing headless Service that’s used to provide the actual network identity to each pod.Through this Service,each pods gets its own DNS entry,so its peers and possibly other clients in the cluster can address the pod by its hostname.For example, if the governing Service belongs to the default namespaces and is called foo,and one of the pods is called A-0,you can reach the pod through its fully qualified domain name,which is a-o.foo.default.svc.cluster.local.Additionally,you can also use DNS to look up all the StatefulSet’s pods' names by looking up SRV records for foo.default.svc.cluster.local domain.

dig -t SRV foo.default.svc.cluster.local

Any ClusterIP and Headless service has A,AAAA,SRV and PTR DNS records.

Scaling the StatefulSet creates a new pod instance with the next unused ordinal index.Scaling down a StatefulSet always removes the instances with the highest ordinal index first.Also, StatefulSets also never permit scale-down operations if any of the instances are unhealthy.

The StatefulSet has to create the PersistentVolumeClaims as well,the same way it’s creating the pods.For this reason,a StatefulSet can also have on or more volume claim templates, which enable it to stamp out PersistentVolumeClaims along with each pod instance(Pod A-0 -> PVC A-0).The PersistentVolumes for the claims can either be provisioned up-front by an administrator or just in time through dynamic provisioning of PersistentVolumes.

Scaling up a StatefulSet by one creates two or more API objects (the pod and one or more PersistentVolumeClaims referenced by the pod).Scaling down,deletes only the pod,leaving the claims alone(cause it is needed for a new pod).

sts example including headless service

---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  labels:
    app: kubia
  name: kubia
spec:
  serviceName: "kubia"
  replicas: 3
  selector:
    matchLabels:
      app: kubia
  template:
    metadata:
      labels:                            # pods created by the StatefulSet will have the app=kubia
        app: kubia
    spec:
      containers:
      - image: edesibe/kubia-pet
        name: kubia-pet
        ports:
        - containerPort: 8080
          name: http
        resources: {}
        volumeMounts:
        - name: data                     # the container inside the pod will mount the pvc volume at this path
          mountPath: /var/data
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:                                # the PersistentVolumeClaims will be created from this template
      storageClassName: "fast"
      resources:
        requests:
          storage: 1Mi
      accessModes:
      - ReadWriteOnce
---
apiVersion: v1
kind: Service
metadata:
  creationTimestamp: null
  labels:
    app: kubia
  name: kubia                            # name of the service
spec:
  clusterIP: None                        # the StatefulSet's governing service must be headless
  ports:
  - name: http
    port: 80
  selector:
    app: kubia                           # all pods with the app=kubia label belong to this service
  type: ClusterIP
status:
  loadBalancer: {}

When you are creating a StatefulSet object it will create pods one by one,next pod will be created only after the previous one is up and ready. The PersistentVolumeClaim template was used to create the PersistentVolumeClaim and the volume inside the pod,which refers to the created PersistentVolumeClaim.The names of the generated PersistentVolumeClaims are composed of the sts.spec.volumeClaimTemplates.metadata.name and the name of each pod.

DNS

You can fetch StatefulSet Pod IPs via DNS SRV queries.It will show SRV and A records of the related pods.Response is in the format

fetching IP addresses for all pods from StatefulSet

k run --rm -it --restart=Never dnsutils --image=edesibe/dnsutils -- dig +noall +additional +answer SRV kubia.default.svc.cluster.local
kubia.default.svc.cluster.local. 7 IN   SRV     0 20 80 kubia-1.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN   SRV     0 20 80 kubia-0.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN   SRV     0 20 80 kubia-2.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN   SRV     0 20 80 kubia-3.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN   SRV     0 20 80 kubia-4.kubia.default.svc.cluster.local.
kubia-1.kubia.default.svc.cluster.local. 7 IN A 100.100.177.139
kubia-4.kubia.default.svc.cluster.local. 7 IN A 100.100.177.144
kubia-0.kubia.default.svc.cluster.local. 7 IN A 100.100.177.140
kubia-2.kubia.default.svc.cluster.local. 7 IN A 100.116.201.202
kubia-3.kubia.default.svc.cluster.local. 7 IN A 100.116.201.203
7 IN SRV 0 20 80 --> <TTL> IN SRV <PRIORITY> <WEIGHT> <PORT>

The order of returned SRV records is random,because they all have the same priority.

Starting from Kubernetes 1.7,StatefulSets support rolling updates the same way Deployments and Daemonsets do.Check sts.spec.updateStrategy field via k explain.

DEPLOYMENT

A K8s object that defines a desired state for a ReplicaSet ( a set of replica Pods).The Deployment Controller seeks to maintain the desired state by creating,deleting,and replacing Pods with new configurations.

A deployment’s desired state includes:
replicas - the number of replica Pods the Deployment will seek to maintain
selector - a label selector used to identify the replica Pods managed by the Deployment
template - a template Pod definition used to create replica Pods
Use cases - There are many use cases for Deployments such as:
Easily scale and application up or down by changing the number of replicas
Perform rolling updates to deploy a new software version
Roll back to previous softwarer version

Always set pod.spec.contiainers.imagePullPolicy to Always in order to pod always fetch image.Be aware that the default imagePullPolicy depends on the image tag.If a container refers to the latest tag (either explicitly or by not specifying the tag at all), imagePullPolicy defaults to Always,but if the contiainer refers to any other tag,the policy default to IfNotPresent.

Actual deployment will create a RS (ReplicaSet) object which will hold deployment specs.ReplicaSet ensures that a specified number of pod replicas are running at any given time.The format of pod name is <DEPLOYMENT>-<PODTEMPLATEHASH>-<SOMESTRING>.The ReplicaSet’s name also contains the hash value of its pod template.A deployment creates ReplicaSets-one for each version of the pod template.Using the hash value of the pod templite like this allows the Deployment to always use the same (possibly existing) ReplicaSet for a given version of teh pod template.

Deployment strategies

How this new state should be achieved is governed by the deployment strategy configured on the Deploment itself.The default strategy is to perform a rolling update (the strategy is called RollingUpdate).The alternative is the Recreate strategy,which deletes all the old pods at once and then creates new ones.Old pods will be deleted beforw the new ones are created.

Use Recreate strategy when your application doesn’t support running multiple versions in parallel and requires the old version to be stopped completely before the new one is started.

The RollingUpdate strategy, removes old pods one by one,while adding new ones at the same time,keeping the application available throughout the whole process, and ensuring there’s no drop in its capacity to handle requests.This is the default strategy.The upper and lower limits for the number of pods above or below the desired replica count are configurable.

You should use RollingUpdate strategy only when your app can handle running both the old and new version at the same time.

Creation

k create deployment NAME --replicas 3 --image=IMAGE --dry-run=client -o yaml

example of a deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  creationTimestamp: null
  labels:
    app: kubia
  name: kubia
spec:
  replicas: 3
  selector:
    matchLabels:
      app: kubia
  strategy: {}
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: kubia
    spec:
      containers:
      - image: edesibe/kubia:v1
        name: kubia
        resources: {}
status: {}

Scaling - refers to dedicating more (or fewer) resources to an application in order to meet changing needs.Best practice is to use --current-replicas=<current_number_for_replicas> to ignore accidents via autoscaler.As deployment is controlling replicset you cannot scale replicaset object via k scale rs <ReplicaSetName>.

k scale deployment NAME --replicas=N --current-replicas=X

Updating can be done in several ways:

  • updating yaml ( replace or apply ) file which is describing deployment specification, easiest
  • k set image deployment/NAME CONTAINER_NAME=NEWIMAGE - updating image of the deployment
  • k patch deployment NAME -p '{"spec":{"minReadySeconds":10}}' - updating some spec values
  • k set resources -f nginx.yaml --dry-run=client -o yaml --limits=cpu=100m --local - adding resource limits to local yaml
  • k set env deployment/NAME KEY=VALUE - updating deployment env(it will recreate pods)
  • k set resources deployment/NAME --requests=memory=10Mi,cpu=10m --limits=memory=20Mi,cpu=20m - updating deployment resources(it will recreate pods)

The minReadySeconds property specifies how long a newly created pod should be ready before the pod is treated as available.Until the pod is available,the rollout process will not continue( cause of maxUnavailable property).A pod is ready when readiness probes of all its containers return a success.If a new pod isn’t functioning properly and its readiness probe starts failing before minReadySeconds have passed,the rollout of the new version will effectively be blocked.

By default,after the rollout can’t make any progress in 10min,it’s considered as failed.If you use the k describe deployment command,you’ll see it display a ProgressDeadlineExceeded condition.Failed rollout can be only abborted via k rollout undo deployment NAME.

If the pod template in the Deployment references a ConfigMap(or a Secret), modifying the ConfigMap will not trigger and update.One way to trigger an update when you need to modify an app’s config is to create a new ConfigMap and modify the pod template so it references the new ConfigMap.

The events that occuded below the Deployment’s surface during the update are: an additional ReplicaSet is created and it was then scaled up slowly,while the previous ReplicaSet was scaled down to zero.All new pods are now managed by the new ReplicaSet.

Managing rolling updates with deployments:
Rolling updates - Allows you to make changes to a Deployment’s Pods at a controlled rate,gradually replacing old Pods with new Pods.This allows you to update your Pods without incurring downtime.
Rollback - If an update to a deployment causes a problem,you can roll back the deployment to a previous working state.By default Kubernetes stores the last 10 ReplicaSets and lets you roll back to any of them(spec.revisionHistoryLimit in deployment definition).
  • k rollout status deployment NAME - checking deployment status
  • k rollout history deployment NAME [--revision=<NUMBER>] - listing revision history.You can get details of specific revision via --revision=<NUMBER>
  • k rollout undo deployment NAME [--to-revision=N] - reverting to latest or specific Nth revision.It can be used to during the rollout process to abort the rollout.Rolling back a rollout is possible because Deployments keep a revision history.The history is stored in the underlying ReplicaSets.When a rollout completes,the old ReplicaSet isn’t deleted, and this enables rolling back to any revision,not only the previos one.
  • k rollout pause deployment NAME - pausing the update, canary - a technique for minimizing the risk of rolling out a bad version of an application.When you pause the deployment you can only resume it (undoing is not possible).
  • k rollout resume deployment NAME - resuming the update

A proper way of performing a canary release is by using two different Deployments(stable and canary) and scaling them appropriately.When a canary deployment is tested and verified one can update stable deployment via `k set image deployment stable NAME=IMAGE.

Canary Deployment Blue-Green Deployment

Length of the revision history is limited by the revisionHistoryLimit property on the Deployment resources.It defaults to 10, so older ReplicaSets are deleted automatically. Two properties affect how many pods are replaced at once during a Deployment’s rolling update.

  • maxSurge - The maximum number of pods that can be scheduled above the desired number of pods.Value can be an absolute number (ex: 5) or a percentage of desired pods (ex:10%)
  • maxUnavailble - The maximum number of pods that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of desired pods (ex:10%)

PROBES

K8s provides a number of features that allow you to build robust solution,such as the ability to automatically restart unhealthy containers.To make the most of these features,k8s needs to be able to accurately determine the status of your applications.This means actively monitoring container health.

Kubernetes use liveness probes to know when to restart a container. Kubernetes use readiness probes to know when a container is ready to receive requests, e.g. when is up and running. A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers Unlike liveness probes,if a contaienr fails the “readiness” check, it won’t be killed or restarted.Liveness probes keeps pods healthy by killing off unhealthy containers and replacing them with new healthy ones, whereas readiness probes makes sure thet only pods that are ready to serve requests receive them.

Readiness probes

Readiness probes are used to determine when a container is ready to accept requests.When you have a service backed by multiple container endpoints,user traffic will not be send to a particular pod until its containers have all passed the readiness checks defined by their readiness probes. Use readiness probes to prevent user traffic from being sent to pods that are still in the process of starting up.When a container is started,Kubernetes can be configured to wait for a configurable amount of time to pass beforw performing the first readiness check.After that, it invokes the probe periodically(default 10s) and acts based on the result of the readiness probe.If a pod reports that it’s not ready, it’s removed from the service.If the pod then becomes ready again,it’s re-added.Unlike liveness probes,if a container fails the readiness check,it won’t be killed or restarted.Liveness probes keep pods healthy by killing off unhealthy containers and replacing them with new,healthy ones,whereas readiness probes make sure that only pods that are ready to serve requests receive them.This is mostly necessary during container start up,but it’s also useful after the container has been running for a while.

Readiness probe can use:

  • httpGet - path and port are used
  • exec - command is used
  • tcpSocket - port can be specified

Always use readinessProbe for production apps

For pods running in production,you should always define a readiness probe.Without one,pods will become service endpoints almost immediately.

readiness example

spec:
  containers:
  - name: cache-server
    image: cache-server/latest
    readinessProbe:
      httpGet:
        path: /readiness
        port: 8888
      initialDelaySeconds: 300
      periodSeconds: 30
Liveness Probes

Liveness probes allow you to automatically determine whether or not a container application is in a healthy state.By default k8s will only a consider a container to be down if the container process stops.Liveness probes allow you to customize this detection mechanism and make it more sophisticated.Utilize it with restartPolicy:Always in order to restart the pod’s containers.Always remember to set an initialDelaySeconds to account for your app’s startup time.

Liveness probe can use:

  • httpGet - path and port are used
  • exec - command is used
  • tcpSocket - port can be specified

liveness probe with exec

spec:
  containers:
  - name: liveness
    image: k8s.gcr.io/busybox
    args:
    - /bin/sh
    - -c
    - touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/healthy
      initialDelaySeconds: 5
      periodSeconds: 5

Exit code

Exit code 137 signals taht the process was killed by an external signal (exit code is 128+9(SIGKILL).Likewise,exit code 143 responds to 128+15(SIGTERM)

Always use livenssProbe for production apps

For pods running in production,you should always define a liveness probe.Without one,Kubernetes has no way of knowing whether your app is still alive or not.

Keep probes light

Liveness probes shouldn’t use too many computational resources as they are utilizing container’s CPU time quota.

liveness probe with httpGet

spec:
  containers:
  - image: nginx
    name: nginx
    livenessProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 5
      periodSeconds: 5
Startup Probes

Startup probes are very similar to liveness probes.However, while liveness probes run constantly on a schedule,startup probes run at container startup and stop running once they succeed.They are used to determine when the application has successfully started up.Startup proves are especially useful for legacy applications that can have long startup times.

startup probe

spec:
  containers:
  - name: startup
    image: nginx
    startupProbe:
      httpGet:
        path: /
        port: 80
      initialDelaySeconds: 30
      periodSeconds: 10

HORIZONTALPODAUTOSCALER

Horizontal pod autoscaling is the automatic scaling of the number of pod replicas managed by a controller.It’s performed by the Horizontal controller,which is enabled and configured by creating a HorizontalPodAutoscaler(HPA) resource.The controller periodically checks pod metrics,calculates the number of replicas required to meed the target metric value configured in the HorizontalPodAutoscaler resources,and adjust the replicas field on the target resource(Deployment,ReplicaSet,ReplicationController,or StatefulSet).Metrics server is needed for this.

The autoscaling process can be split into three steps:

  • obtain metrics of all the pods managed by the scaled resource object - hpa queryng metrics server for this
  • calculate the number of pods required to bring the metrics to(or close to)the specified target value.For example for 3pods with current CPU utilzation P1=60%,P2=90%,P3=50% and target CPU utilization is 50% you will need 4 replicas ((60+90+50)/50=4).
  • update the replicas field of the scaled resource

As far as the Autoscaler is concerned,only the pod’s guaranteed CPU amount (the CPU requests) is important when determining the CPU utilization of a pod.The Autoscaler compares the pod’s actual CPU consumption and its CPU requests,which means the pods you’re autoscaling need to have CPU requests set (either directly or indirectly through a LimitRange object) for the Autoscaler to determine the CPU utilization percentage.

# create hpa for a deployment
kubectl autoscale deployment <MY_DEPLOYMENT> --cpu-percent=30 --min=1 --max=10
# create a service out of deployment
k expose deployment <MY_DEPLOYMENT> --port=80 --target-port=80 --name=<MY_SERVICE>

# run load-generator pod
k run --rm -ti load-generator --image=busybox /bin/sh
$ while true; do wget -q -O- http://<MY_SERVICE>.default.svc.cluster.local; done
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  creationTimestamp: null
  name: kubia                           # each hpa has a name(it doesn't need to match the name of the deployment as in this case)
spec:
  maxReplicas: 5                        # min and max replicas you specified
  minReplicas: 1
  scaleTargetRef:                       # the target resources which this autoscaler will act upon
    apiVersion: apps/v1
    kind: Deployment
    name: kubia
  metrics:
  - type: Resource
    resource:                           # you'd like the autoscaler to adjust the numnber of pods so they each utilize 30% of requested CPU
      name: cpu
      target:
        type: Utilization
        averageUtilization: 30
status:
  currentReplicas: 0                    # the current status of the autoscaler
  desiredReplicas: 0

HPA has a limit on how soon a subsequent autoscale operation can occur after the previous one.Currently,a scale-up will occur only if no rescaling event occured in the last three minutes.A scale-down event is performed even less frequently - every five minutes.

CONFIGMAP

The whole point of an app’s configuration is to keep the confing options that vary between environments,or change frequently,separate from the application’s source code.

ConfigMap holds configuration data for pods to consume.They cannot include dash in its name. Regardless if you are using ConfigMap to store configuration data or not,you can configure your apps by:

  • passing command-line arguments to containers
  • setting custom environment variables for each container
  • mounting configuration files into containers through a special type of volume

You can crate a ConfigMap from liteals or from files from disk.

creating cm from literal

k create configmap fortune-config --from-literal=sleep-interval=25 --dry-run=client -o yaml

will be created as

apiVersion: v1
data:
  sleep-interval: "25"             # created from litaral
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: fortune-config

mutli cm example

cloud_user@k8s-control:~$ k create cm test-cm --from-literal firstName=mile --from-literal lastName=kitic --dry-run=client -o yaml
apiVersion: v1
data:
  firstName: mile
  lastName: kitic
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: test-cm

create a cm from file.

cloud_user@k8s-control:~$ k create cm test-cm --from-file readme --dry-run=client -o yaml
apiVersion: v1
data:
  readme: |                         # key is omitted so filename is used as key
    yeah
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: test-cm

cloud_user@k8s-control:~$ k create cm test-cm --from-file key=readme --dry-run=client -o yaml
apiVersion: v1
data:
  key: |                            # key is provided and used instead filename
    yeah
kind: ConfigMap
metadata:
  creationTimestamp: null
  name: test-cm

ConfigMap objects can be created from files in a directory as well.

k creae cm my-config --from-file=/path/to/dir

cm as environment variables

spec:
  containers:
  - image: busybox:1.28.4
    name: app-container
    command: ['sh', '-c', "echo $(MY_VAR1) && sleep 3600"]
    env:
    - name: MY_VAR1
      valueFrom:
        configMapKeyRef:
          name: appconfig
          key: key1
          optional: true                   # this key is optional,container will start if cm doesn't exist
    - name: MY_VAR2
      value: "mile kitic"
    - name: MY_VAR3
      valueFrom:
        fieldRef:
          fieldPath: status.hostIP
          apiVersion: v1

cm as volume

spec:
  containers:
  - image: busybox
    name: busybox
    command: ["sh","-c","sleep 1d"]
    resources: {}
    volumeMounts:
    - name: config
      mountPath: /etc/someconfig.conf    # folder will be created inside container at this location with the files from configMap config
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: config
    configMap:
      name: config

this will be presented on the container as:

$ k exec -it busybox -- ls -l /etc/
total 36
-rw-rw-r--    1 root     root           306 Nov 16 17:08 group
-rw-r--r--    1 root     root             8 Dec  5 14:01 hostname
-rw-r--r--    1 root     root           201 Dec  5 14:01 hosts
-rw-r--r--    1 root     root           118 Oct 29 12:12 localtime
drwxr-xr-x    6 root     root          4096 Nov 17 19:58 network
-rw-r--r--    1 root     root           340 Nov 16 17:08 passwd
-rw-r--r--    1 root     root           127 Dec  5 14:01 resolv.conf
-rw-------    1 root     root           136 Nov 17 19:58 shadow
drwxrwxrwx    3 root     root          4096 Dec  5 14:01 someconfig.conf
$ k exec -it busybox -- ls -l /etc/someconfig.conf
total 0
lrwxrwxrwx    1 root     root            15 Dec  5 14:01 boo.conf -> ..data/boo.conf
lrwxrwxrwx    1 root     root            15 Dec  5 14:01 foo.conf -> ..data/foo.conf
spec:
  containers:
  - image: some-image
    envFrom:                                # using envFrom instead env
    - prefix: CONFIG_                       # all environment variables will be prefixed with CONFIG_
      configMapRef:
        name: some-cm                       # referencing some-cm as CM
spec:
containers:
- image: edesibe/fortune:args         # image which expect interval from arguments
  name: fortune-pod
  resources: {}
  args: ["$(INTERVAL)"]               # referencing environment variable in the argument
  env:
  - name: INTERVAL                    # setting environment variable INTERVAL
    valueFrom:
      configMapKeyRef:                # using CM instead key/value
        name: fortune-config          # name of the CM
        key: sleep-interval           # use value for INTERVAL based on value from this key

A configMap volume will expose each entry of the CM as a file.

spec:
  containers:
  - image: nginx
    name: web-server
    volumeMounts:
    - name: config
      mountPath: /etc/nginx/conf.d             # mounting configMap volume at this location
      readOnly: true
  ...
   volumes:
   - name: config
     configMap:
       name: fortune-config                    # the volume refers to fortune-config CM

If you need just some parts from configMap in volume.When specifying individual entries,you need to set the filename for each individ ual entry, along with the entry’s key.

volumes:
- name: config
  configMap:
    name: fortune-config
    items:                             # selecting which entries to include in the volume by listing them
    - key: my-nginx-config.conf        # you want the entry under this key included
      path: gzip.conf                  # the entry's value should be stored in this file
spec:
  containers:
  - image: busybox
    name: busybox
    command: ["sh","-c","sleep 1d"]
    resources: {}
    volumeMounts:
    - name: config
      mountPath: /etc/foo/foo.conf            # mouting into file not a directory
      subPath: foo.conf                       # instead of mounting the whole volume,you're only mounting the foo.conf from configMap config
    - name: config
      mountPath: /etc/boo/boo.conf
      subPath: boo.conf
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: config
    configMap:
      name: config  

so it will be mounted on contaiener as:

$ k exec -it busybox -- ls -l /etc/foo /etc/boo
/etc/boo:
total 4
-rw-r--r--    1 root     root            27 Dec  5 14:07 boo.conf

/etc/foo:
total 4
-rw-r--r--    1 root     root            31 Dec  5 14:07 foo.conf

When you referenced ConfigMap which doesn’t exist in pod K8s schedules the pod normally and tries to run its containers.The container referencing the non-existing ConfigMap will fail to start(unless configMapKeyRef.optional:true is configured) but the other containers in the pod will start normally.If you then create the missing CM,the failed container is started wituout requiring you to recreate the pod.

By default, the permissions on all files in a configMap volume are set to 644.You can change this be setting the defaultMode property in the volume spec.

Drawbacks of using environment variables or command-line arguments as a configuration source is the inability to update them while the process is running.Using a CM and exposing it through a volume brings the ability to update the configuration without having to recreate the pod or even restart the container. When you update a ConfigMap,the files in all the volumes referncing it are updated.All files are updatede at once as k8s achieves this by using symbolic links.

When a ConfigMap currently consumed in a volume is updated, projected keys are eventually updated as well.ConfigMaps consumed as environment variables are not updated automatically and require a pod restart.A container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates.

SECRETS

Secrets are similar to ConfigMaps but are designed to store sensitice data,such as passwords or API keys,more securely.They can be used the same way as a ConfigMap.You can:

  • Pass secret entries to the container as environment variables
  • Expose secret entries as files in a volume

A Secret’s entries can contain binary values,not plain-text.Base64 encoding allows you to include the binary data in YAML or JSON,which are both plain-text formats.Maximum size of a Secret is limited to 1MB.

When you expose the Secret to a container through a secret volume,the value of the Secret entry is decoded and written to the file in its actual form(regardless if it’s plain text or binary).The same is also true when exposing the Secret entry through an environment variable.In both cases,the app doesn’t need to decode it,but can read the file’s contents or look up the environment variable value and use it directly.

Kubernetes helps keep your Secrets safe by making sure each Secret is only distributed to the nodes that run the pods that need access to the Secret.Also,on the nodes themselves,Secrets are always stored in memory and never written to physical storage.On the master node itself (precise in etcd), Secrets used to be stored in unencrypted form,which meant the master node needs to be secured to keep the sensitive data stored in Secrets secure.From Kubernetes version 1.7,etcd stores Secrets in encrypted form(obfuscated),making the system much more secure.Thus,one can still check that etcd stores secret in unencrypted form.

fetch namespaced secret from etcd

k -n kube-system exec -it <etcd-control> -- etcdctl get --endpoints=https://<endpoint>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key /registry/secrets/${NAMESPACE}/${SECRET}

To use encrypted secrets you need to use Sealed Secrets or Encryption at rest

By default,default-token Secret is mounted into every container,but you can disable that in each pod by setting pod.spec.automountServiceAccountToken: false.

The secret volume uses an in-memory filesystem(tmpfs) for the Secret files.You can see this if you list mounts in the container.Because tmpfs is used,the sensitive data stored in the Secret is never written to disk,where it could be compromised.

create simple secret

k create secret generic mysecret --from-literal=username=mile --from-literal=password=kitic

create secret from file

k create secret generic ssh-key-secret --from-file=ssh-privatekey=<absolute_path>/.ssh/id_rsa --from-file=ssh-publickey=<absolute_path>/.ssh/id_rsa.pub

secret as env

spec:
  containers:
  - name: mycontainer
    image: redis
    env:
      - name: SECRET_USERNAME
        valueFrom:
          secretKeyRef:
            name: mysecret
            key: username
            optional: false

secret as volume

spec:
  volumes:
  - name: secret-volume
    secret:
      secretName: ssh-key-secret
  containers:
  - name: ssh-test-container
    image: mySshImage
    volumeMounts:
    - name: secret-volume
      readOnly: true
      mountPath: "/etc/secret-volume"

If you have multiple pods which needs to fetch images from private registry you can add the secrets to service account.

spec:
  imagePullSecrets:
  - name: mydockerhubsecret                    # docker-registry secret type
  containers:
  - image: username/private:tag
    name: some_name

Always use Secret volumes for exposing Secrets,no as environment variables.

NODES

# scheduling disabled
k cordon NODE

# alternatively set not to Unschedulable
k patch nodes NODE -p '{"spec":{"unschedulable":true}}'

# scheduling enabled, moved pods will not come back to the uncordoned NODE
k uncordon NODE

# cordon + evicting pods to other nodes (with option for ingnoring volumes defined as emptyDir and daemonset pods)
k drain NODE [--delete-emptydir-data] [--ignore-daemonsets] [--force]

SCHEDULING

Before scheduler component allocate pod on the node it performs several checks as:

  • if node had required HW resources
  • if node is running out of resources
  • does the pod request requests a specific node ( nodeSelector )
  • does the node have a matching lable ( selector spec )
  • if the pod requests a port, is it available
  • if the pod requests a volume, can it be mounted
  • does the pod tolerate the taints of the node ( taint and tolerations mapping )
  • does the pod specify affinity ( node or pod affinity )

Tolerations and Taints

Tolerations allows to tolerate Taints.Nodes capacity can be viewed via kubectl describe node where one should check:

  • capacity - overall capacity of the node
  • allocable - how much can be allocated
TAINTS

Taints are used to keep pods away from certain nodes.

taint example

# taint one node
k taint node NODE KEY=VALUE:EFFECT

# remove taint
k taint node NODE KEY=VALUE:EFFECT-

# taint nodes via label
k taint node -l key=value KEY=VALUE:EFFECT

# add taint without value.Can be used for `KEY=:NoSchedule` or `KEY=:NoExecute`
k taint node NODE KEY=:EFFECT

Pods with tolerations MAY be scheduled to tainted nodes ( e.g. master ) if their tolarations matches the node’s taint.Pods with no tolerations may be only scheduled to nodes without taints. If node has some taint assigned NO pod will be able to schedule onto target node unless it has a matching toleration.

If there is no scheduling pods can be assinged to any node

The default value for operator is Equal.A toleration mathes a taint if the keys and efrects are same on node and pod and:

  • the operator is Exists (in which case no value should be specified) or
  • the operator is Equal and the values are equal`

There are two special cases:

  • An empty key with operator Exists matches all keys,values and effects which means this will tolerate everything.
  • An empty effect matches all effects with key key1
Possible Effect values:
"NoExecute" Only affect scheduling and affects pods already running on the node.Evict any already-running pods that do not tolerate the taint. Currently enforced by NodeController.
"NoSchedule" Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler.
"PreferNoSchedule" Like TaintEffectNoSchedule, but the scheduler tries not to schedule new pods onto the node, rather than prohibiting new pods from scheduling onto the node entirely. Enforced by the scheduler.
tolerations:
- effect: [NoSchedule,PreferNoSchedule,NoExecute]                                 
  key: KEY                                                                        
  operator: [Exists,Equal]
  tolerationSeconds: - wait for pod to be ready or reschedule it to another node
  value: VALUE                                                                    
  tolerationSeconds: X                                                           # how long k8s should wait before rescheduling a pod to another node if the node the pod is running on becomes unready or unreachable

Affinity

Node Affinity allows you to tell Kubernetes to schedule pods only to specific subsets of nodes.It selects nodes based on their labels,same way node selectors do. Before affinity is configured on pods related nodes need to be labeled.

kubectl label nodes NODE KEY=VALUE [--overwrite]

Then affinitiy can be used to specify prerrefence or hard requirements during scheduling pods.It is used to specify which nodes are preferred for certain pods. Options for selecting are: nodeSelector - for the pod to be eligble to run on a node, the node must have each of the indicated key/value pairs as labels nodeAffinity - it is also based on node lables but with wider options expressions.k8s will try to achieve this podAffinity - based on pod labels. it is used for geolocation or scheduling pods on same nodes,cluster,rack … podAntiAffinity - based on pod labels but with opposite effect of podAffinity.Scheduler never choosing nodes where pods matching the podAntiAnffinity’s label selector are running

The affinity feature consists of two types of affinity:

  • Node affinity functions like the nodeSelector field but is more expressive and allows you to specify soft rules.
  • Inter-pod affinity/anti-affinity allows you to constrain Pods against labels on other Pods.

hard requirements - forcing the pods to run on specific nodes by nodeAffinity spec

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:         # node must have same labels, don't effect current pods
        nodeSelectorTerms:
        - matchExpressions:
          - key: KEY
            operatior: IN
            values:
            - VALUE

preferrence - instructing pods to run on preferred nodes via nodeAffinity spec

spec:
  affinity:
    nodeAffinity:
      prefferredDuringSchedulingIgnoredDuringExecution:       # node must have lables, don't effect current pods
      - weight: 80                                            # prefer the pod to be scheduled to node with this labels.this is your most important preference
        preferrence:
          matchExpressions:
          - key: KEY1
            operatior: IN
            values:
            - VALUE1
      - weight: 20                                            # you also prefer that your pods be scheduled based on some other key/value pair
        preferrence:
          matchExpressions:
          - key: KEY2
            operatior: IN
            values:
            - VALUE2

hard requirementes - specifying pod allocation via podAffinity

spec:
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - topologyKey: kubernetes.io/hostname                 # the pods of this myst be deployed on the same node as the pods that match the selector,this key/value is used only for requiredDuringSchedingIgnoredDuringExecution
        labelSelector:
          matchLabels:
            app: backend                                             

preference

spec:
  affinity:
    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:      # preferred instead of required
      - weight: 80                                          # weight and podAffinityTerm are speficied 
        podAffinityTerm:
          topologyKey: kubernetes.io/hostname
          labelSelector:
            matchLabels:
              app: backend
template:
  metadata:
    creationTimestamp: null
    labels:
      app: frontend                                        # the frontend pods have the app=frontend label
  spec:
    affinity:
      podAntiAffinity:                                     
        requiredDuringSchedulingIgnoredDuringExecution:    # defining hard requirements for pod anti-affinity
        - topologyKey: kubernetes.io/hostname              # ensure that pods aren't deployed to the same rack,zone,region or any custom scope
          labelSelector:                                   # a frontend pod must not be scheduled to the same node as a pod with app=frontend label
            matchLabels:
              app: frontend
topologyKey options is used differently based on the use case:
bonding to pod - kubernetes.io/hostname
bonding to region - topolgy.kubernetes.io/region
bonding to az - topology.kubernetes.io/zone You can add you own topologyKey as rack (for example) but you will need to to label your nodes accordingly.For example, if you have 20 nodes you could label first 10 with rack=rack1 and second 10 with rack=rack2.Then,in podAffinity spec you would set topologyKey=rack.

Limits

Resource requests allow you to define an amount of resources (such as CPU or memory) you expect a container to use.The kubernetes scheduler will use resource requests to avoid scheduling pods on nodes that do not have enough available resources.

Containers are allowed to use more (or less) than the requested resources.Resource requests only affect scheduling.By specifying resource requests,you’re specifying the minimum amount of resources your pod needs.This information is what the Scheduler uses when scheduling the pod to a node.Each node has a certain amount of CPU and memory it can allocate to pods.When scheduling a pod,the Scheduler will only consider nodes with enough unallocated resources to meet the pod’s resource requirements.If the amount of unallocated CPU or memory is less than what the pod requests,Kubernetes will not schedule the pod to that node,because the node can’t provide the minimum amount required by the pod.As scheduler first filters the list of nodes to exlude those that the pod can’t fit it then prioritizes the remaining nodes per the configured prioritization functions.Among others,two prioritization functions rank nodes based on the amount of resources requested: LeastRequestPriority and MostRequestedPriority.The first one prefers nodes with fewer requested resources(with a greater amount of unallocated reources),whereas the second one is the exact opposite - it prefers nodes that have the most requested resources (a smaller amount of unallocated CPU and memory).But,they both consider the amount of requested resources,not the amount of resources actually consumed.The Scheduler is configured to use only one of those functions.Because the Scheduler needs to know how much CPU and memory each node has,the Kubelet reports this data to the API server,making it available through the Node resource.

list node overall capacity and allocatable capacity

> k get nodes -o jsonpath='{range .items[\*]}NAME:{.metadata.name}{"\t"}CAPACITY:{.status.capacity.cpu}{"\t"}ALLOCATABLE:{.status.allocatable.cpu}]{"\n"}{end}}' -l gpu=true
NAME:k8s-worker1        CAPACITY:2      ALLOCATABLE:2]
NAME:k8s-worker2        CAPACITY:2      ALLOCATABLE:2]

The output shows two sets of amounts related to the available resources on the nodes:the node’s capacity and allocatable resources.The capacity represents the total resources of a node,whih may not all be available to pods.The Scheduler bases its decisions only on the allocatable resource amounts.

Both CPU and memory requests are treated the same way by the Scheduler,but in contrast to memory requests,a pod’s CPU requests also play a role elsewhere - while the pod is running.The CPU requests don’t only affect scheduling - they also determine how the remaining(unused) CPU time is distributed between pods.Running pods will use unused CPU at same ration as they requested CPU.Meaning,if we have two pods and total capacity of 2CPUs where 1st pod is requesting 1cpu and 2nd pod is requesting 200milicpus unused CPU will be used as 1:5 ratio(1/6 vs 5/6 unused CPU time).But if one container wants to use up as much CPU as it can,while the other one is sigging idle at a given moment,the 1st container will be allowed to use the whole CPU time.After all,it makes sense to use all the available CPU if no one else is using it.As soon as the 2nd container needs CPU time,it will get it and the 1st container will be throtled back.

CPU is compressible resource,which means the amount used by a container can be throttled without affecting the process running in the container in an adverse way.Memory is obviously different - it’s incompressible.Once a process is given a chunk of memory,that memory can’t be taken away from it until it’s relesed by the process itself.That’s why you need to limit the maximum amount of memory a container can be given

Unlike resource requests,resource limits aren’t constrained by the node’s allocatable resource amounts.The sum of all limits of all the pods on a node is allowed to exceed 100% of the node’s capacity.This has an important consequence - when 100% of the node’s resources are used up,certain contaners will need to be killed

Even though you set a limit on how much memory is available to a container,the container will not be aware of this limit cause container sees the memory of the whole nodeAlso,containers will see all the node’s CPUs,regardless of the CPU limits configured for the container.All the CPU limit does is constrain the amount of CPU time the container can uses.

Containers never get killed if they try to use too much CPU(it will be throttled,but they are killed if they try to use too much memory(with OOMKilled status).

Pod spec can have hard and soft limits via:
limit - max amount of compute resources allowed.When a CPU limit is set for a container,the process isn’t given moder CPU time than the configured limit.With memory,when a process tries to allocate memory over its limit, the process is killed(it’s said the container is OOMKilled)
requests - min amount of compute resources allowed.If not set explicitly,default to the limits(if exists)

Resource limits provide a way for you to limit the amount of resources your containers can use.The container runtime is responsible for enforcing these limits, and different container runtimes do this differently.

Some runtimes will enforce these limits by terminating container processes that attempt to use more than the allowed amount of resources

cpu - defined in milicores(1/1000 of one CPU). If your container needs 2 full cores to run you would put the value “2000m”.If your container only needs 1/4 of your core you would put a value of “250m”. memory - defined in bytes.

limit and requests example

spec:
  containers:
  - image: nginx
    name: nginx-pod
    resources:                                   # you'r specifying resource requests for nginx-pod container
      limits:
        cpu: "250m"                              # the container will be allowed to use at most 250milicores(that is 1/4 of single CPU core's time)
        memory: "128Mi"                          # the container will be allowed to use up to 128 megabytes of memory
      requests:
        cpu: "125m"                              # the container requests 125 milicores(that is 1/8 of single CPU core's time)
        memory: "64Mi"                           # the container requests 64 megabytes of memory
QoS classes

Kubernetes categorizing pods into three Quality of Service classes based on the combination of resource requests and limits for the pod’s containers.Here are the classes:

  • BestEffort(the lowest priority) - pods that don’t have any requests or limits set at all
  • Burstable - pods where container’s limits and requests don’t match or pods with resources requests specified but without limit
  • Guaranteed(the highest) - pods whose containers' (all containers in pod) requests are equal to the limits for all resources

When the system is overcommited,the QoS classes determine which container gets killed first so the freed resources can be given to higher priority pods.First in line to get killed are pods in the BestEffort class,followed by Burstable pods,and finally Guaranteed pods,which only get killed if system processes need memory.

When two single-container pods exist,both in the Burstable class,the system will kill the one using more of its requested memory than the other,percantage-wise.

LimitRanges

Instead of having to set limits resources for every container,one can create a LimitRange resource.It allows you to specify(for each namespace) not only the minimum and maximum limit you can set on a container for each resource,but also the default resource requests for containers that don’t specify requests explicitly.LimitRange resources are used by the LimitRanger Adminission Control plugin.When a pod manifest is posted to the API server,the LimitRanger plugin validates the pod spec.If validation fails,the mainfest is rejected immediately.Beacuase of this,a great use-case for LimitRange objects is to prevent users from creating pods that are bigger than any node in the cluster.Without such a LimitRange,the API server will gladly accept the pod, but then never schedule it. The limits specified in a LimitRange resource apply to each individual pod/container or other kind of object created in the same namespace as the LimitRange object.They don’t limit the total amount of resources available across all the pods in the namespace(this is specified through ResourceQuota objects).

apiVersion: v1
kind: LimitRange
metadata:
  name: example
spec:
  limits:
  - type: Pod                                # specified the limits for a pod as a whole
    min:                                     # minimum CPU and memory all the pod's containers can request in total
      cpu: 50m
      memory: 5Mi
    max:
      cpu: 1                                 # maximum CPU and memory all the pod's containers can request(and limit)
      memory: 1Gi
  - type: Container                          # the container limits are specified below this line
    defaultRequest:                          # default requests for CPU and memory that will be applied to containers that don't specify them explicitly
      cpu: 100m
      memory: 10Mi
    default:                                 # default limits for containers that don't specify them
      cpu: 200m
      memory: 100Mi
    min:                                     # minimum and maximum requests/limits that a container can have
      cpu: 50m
      memory: 5Mi
    max:
      cpu: 1
      memory: 1Gi
    maxLimitRequestRatio:                    # maximum ratio between the limit and request for each resource
      cpu: 4                                 # a container's CPU limits will not be allowed to be more than 4 times greated than its CPU requests.A contaienr requesting 200m will not be accepted if its CPU limit is set to 801m ore higher
      memory: 10
  - type: PersistentVolumeClaim             # a LimitRange can also set the minimum and maximum amount of storage a PVC can request
    min:
      storage: 1Gi
    max:
      storage: 10Gi

ResourceQuota

LimitRanges only apply to individual pods,but cluster admins also need a way to limit the total amount of resources in a namespace.The ResourceQuota Admission Control plugin checks whether the pod being created would ause the configured ResourceQuota to be exceeded.Because resource quotas are enforced at pod creation time,a ResourceQuota object only affects pods created after the ResourceQuota object is created - creating it has no effect on existing pods. A ResourceQuota limits the amount of computational resources the pods and the amount of storage PersistentVolumeClaims in a namespace can consume.It can also limit the number of pods,claims,and other API objects users are allowed to create inside the namespace. A ResourceQuota object applies to the namespace it’s created in,like a LimitRange,but it applies to all the pods' resources requests and limits in total and not to each individual pod or container separately.

When a quota for a specific resource(CPU or memory) is configured(request or limits),pods need to have the request or limit(respectively) set for that same resource: otherwise API server will not accept to pod.That’s why having a LimitRange with defaults for those resources can make life a bit easier for people creating pods.

Quotas can also be limited to a set of quota scopes:

  • BestEffort - whether the quota applies to pods with the BestEffort QoS class.Can only limit the number of pods
  • NotBestEffort - wheter the quota applies to pods with one of the other two claesses(Burstable or Guaranteed).Can limit number of pods,CPU/memory resuests/limits.
  • Terminating - pods that have the activeDeadlineSeconds set.Can limit number of pods,CPU/memory requests/limits
  • NotTerminating - pods that don’t have the activeDeadlineSeconds set.Can limit number of pods,CPU/memory requests/limits

When creating a ResourceQuota,you can specify the scopes that it applies to.A pod must match all the specified scopes for the quota to apply to it.Also,what a quota can limit depends on the quota’s scope.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: besteffort-notterminating-pods
spec:
  scopes:                                   # this quota only applies to pods that have the BestEffort QoS and don't have an active deadline set
  - BestEffort                              # if the quota was targeting NotBestEffort pods you could also specify requests/{cpu,memory} and limits/{cpu,memory}
  - NotTerminating
  hard:
    pods: 4                                 # only four such pods can exist
apiVersion: v1
kind: ResourceQuota
metadata:
  name: cpu-and-mem
spec:
  hard:
    requests.cpu: 400m
    requests.memory: 200Mi
    limits.cpu: 600m
    limits.memory: 500Mi
apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage
spec:
  hard:
    requests.storage: 500Gi                                         # the amount of storage claimable overall
    ssd.storageclass.storage.k8s.io/requests.storage: 300Gi         # the amount of claimable storage in StorageClass named ssd
    standard.storageclass.storage.k8s.io/requests.storage: 1Ti      # the amount of claimable storage in StorageClass named standard
apiVersion: v1
kind: ResourceQuota
metadata:
  name: objects
spec:
  hard:
    pods: 10                                                         # only 10 pods,5 RC,10 secrets,10 CM,and 4 PVC can be created in namespace
    replicationcontrollers: 5
    secrets: 10
    configmaps: 10
    persistentvolumeclaims: 4
    services: 5                                                      # 5 SVC overall can be created,of which at most one can be a LoadBalancer SVC and at most 2 can be NodePort SVCs
    services.loadbalances: 1
    services.nodeports: 2
    ssd.storageclass.storage.k8s.io/persistentvolumeclaims: 2        # only 2 PVCs can claim storage with the ssd StorageClass

GENERAL COMMANDS

print all api resources

kubectl api-resources -h

print doc about some api resource

kubectl explain ${api-resource}

like

k explain pod.spec.containers.resource

RBAC (Role Based Access Control)

The Kuberentes API server can be configured to use an authorization plugin to check whether an action is allowed to be performed by the user requesting the action.REST clients send GET,POST,PUT,DELETE and other types of HTTP requests to specific URL paths,which represent specific REST resources.The verb in those examples(get,create,update) map to HTTP methods (GET,POST,PUT) performed by the client.An authorization plugin such as RBAC,which runs inside the API server,determines whether a client is allowed to perform the requested verb on the requested resource or not.

HTTP method Verb for single resource Verb for collection
GET,HEAD get(and watch for watching list(and watch)
POST create n/a
PUT update n/a
PATCH patch n/a
DELETE delete deletecollection

Besides applying security permissions to whole resource types,RBAC rules can also apply to specific instances of a resource (for example,a Service called myservice).Also, permissions can be set for non-resource URL paths,because not every path the API server exposes maps to a resource (such as the /api path itself or the server health information at /healthz).Regular Roles can’t grant access to those resources or non-resource URLs,but ClusterRole can. The RBAC authorization plugin,uses user roles as the key factor in determing whether the user may perform the action or not.A subject(which may be a human,a ServiceAccount,or a group of users or ServiceAccounts) is associated with one or more roles and each role is allowed to perform certain verbs on certain resources.

Roles and CluseterRoles are Kubernetes objects that define a set of permissions.These permissions determine what users can do in the cluster. A Role defines permissions within a particular namespace,and a ClusterRole defines cluster-wide permissions not specific to a single namespace.

RoleBinding and ClusterRoleBinding are objects that connect Roles and ClusterRoles to users(human,SA,group).

Roles defined what can be done,while bindings define who can do it

create a role

> k create role service-reader -n foo --resource=services --verb=get,list --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  creationTimestamp: null
  name: service-reader
  namespace: foo                     # roles are namespaced.They only allows access to resources in the same namespace the Role is in
rules: 
- apiGroups: [""]                    # services are resources in the `core` apiGroup which has no name - hence the "".For `named` apiGroup one needs to specify the path
  resources: ["services"]            # this rule pertains to services (plural name must be used).You could use `resourceNames` field for specific service name
  resourceName: ["mile-svc"]         # specify individual service by name
  verbs: ["get","list"]              # getting individual Services (by name) and listing all of them is allowed

create a rolebinding for user and serviceaccount

> k create rolebinding test --role=service-reader --serviceaccount=foo:default -n foo --user=mile --group=folker --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  creationTimestamp: null
  name: test
  namespace: foo
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: service-reader                             # this RoleBinding references the service-reader Role
subjects:
- apiGroup: rbac.authorization.k8s.io
  kind: User
  name: mile
- apiGroup: rbac.authorization.k8s.io
  kind: Group
  name: folker
- kind: ServiceAccount                             # And binds it to the default ServiceAccount in the foo namespace
  name: default
  namespace: foo

RoleBinding always references to a single Role (as evident from the roleRef property), but can bind the Role to multiple subjects(for one or more SA and any number of users or groups)

Although you can create a RoleBinding and have it reference a ClusterRole when you want to enable access to namespaced resources,you can’t use the same approach for cluster-level(non-namespaced) resources.To grant access to cluster-level resources,you must always use a ClusterRoleBinding.

create a clusterrole

>  k create clusterrole pv-reader --verb=get,list --resource=persistentvolumes --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  creationTimestamp: null
  name: pv-reader
rules:                              # can be applied to nonResourceURLs as well
- apiGroups:                            
  - ""                              # will be populated based on used resorices
  resources:
  - persistentvolumes
  verbs:
  - get
  - list

bind ClusterRole and ServicAccount via ClusterRoleBinding

k create clusterrolebinding pv-test --clusterrole=pv-reader --serviceaccount=foo:default --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  creationTimestamp: null
  name: pv-test
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: pv-reader
subjects:
- kind: ServiceAccount
  name: default
  namespace: foo

ClusterRoles has several uses.You can use a ClusterRole to:

  • define permissions on namespaced resources and be granted within individual namespace(s)
  • define permissions on namespaced resources and be granted across all namespaces
  • define permissions on cluster-scoped resources

The most important roles are the view,edit,admin, and cluster-admin ClusterRoles.

  • view - it allows reading most resources in a namespace,except for Roles,RoleBinding, and Secrets
  • edit - it allows modifying resources in namespaces,but also both reading and modifying Secrets.Cannot modify and view Roles and RoleBinding
  • admin - allows complete control of the resources in a namespace( except ResourceQuotas) and Namespace resource itself.The main difference between the edit and admin ClusterRoles is in the ability to view and modify Roles and RoleBindings in the namespace
  • cluster-admin - complete control of the kubernetes cluster

By default,the default ServiceAccount in a namespace has no permissions other than those of an unauthenticated user ( the system:discovery ClusterRole and associated binding which allow anyone to make GET requests on a few non-resource URLs)

SERVICEACCOUNT

In k8s, a service account is an account used by container processes within Pods to authenticate with the k8s API. Every pod is associated with a ServiceAccount,which represents the identity of the app running in the pod.The token file holds the ServiceAccount’s authentication token.When an app uses this token to connect to the API server,the authentication plugin authenticates the ServiceAccount and passes the ServiceAccount’s username back to the API server core.ServiceAccount usernames are formated like this: system:serviceaccont:{namespace}:{service account name}.The API server passes this username to the configured authorization plugins,which determine whether the action the app is trying to perform is allowed to be performed by the ServiceAccount.

The authentication tokens used in ServiceAccounts are JWT tokens

ServiceAccounts are nothing more than a way for an application running inside a pod to authenticate itself with the API server.As already mentioned,applications do that by passing the ServiceAccount’s token in the request.

A pod’s ServiceAccount must be set when creating the pod.It can’t be changed later.

A default ServiceAccount is automatically created for each namespace(that’s the one your pods have used all along).One can assign a ServiceAccount to a pod specifying the account’s name in the pod manifest.If you don’t assign it explicitly,the pod will use the default ServiceAccount in the namespace.

You can manage access control for service accounts,just like any other user,usin RBAC objects.A RoleBinding or ClusterBinding binds a role to subjects.Subjects can be groups,users or ServiceAccounts.

create service account

k create sa test --dry-run=client -o yaml -n default

From version 1.22 and 1.24 k8s will not create secret containing secret token for sa automatically.One must create it manually as:

k create sa SERVICEACCOUNTNAME         # it will create sa and related token.one can used this sa on pod
k create token SERVICEACCOUTNAME       # it will create additional token for related sa

describe options in sa

> k describe sa foo
Name:                foo
Namespace:           default
Labels:              <none>
Annotations:         <none>
Image pull secrets:  <none>        # these will be added automatically to all pods using this ServiceAccount.This is defined in `sa.imagePullSecrets` and they will not be mounted on pods.They are only used by kubelet when it needs to fetch images from private registry.
Mountable secrets:   <none>        # pods using this SA can only mount these Secrets if mountable Secrets are enforced.To enforce this SA must be configured with annotations 'kubernetes.io/enforce-mountable-secrets:"true"
Tokens:              <none>        # authentication token(s).The first one is mounted inside the container
Events:              <none>

example for rolebinding a role to service account

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  labels:
    k8s-app: metrics-server
  name: metrics-server-auth-reader
  namespace: kube-system
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
  name: metrics-server
  namespace: kube-system

OTHER

metadata.ownerReferences - can be used to find which resources an object belogs to ( POD->RS, RS->DEPLOYMENY )

LABELS

label format, use --overwrite if you wanna update existing label

k label resource resource_NAME KEY=VALUE [--overwrite]

get custom output

k get po -o custom-columns=POD:metadata.name,NODE:spec.nodeName --sort-by spec.nodeName

api deprecations, kubectl convert must be installed first

k convert -f FILENAME --output-version <new-api>

Cluster Autoscaler

The Cluster Autoscaler takes care of automatically provisioning additional nodes when it notics as pod that can’t be scheduled to existing nodes beacause of a lack of resources on those nodes.It also de-provisions nodes when they’re underutilized for longer periods of time.A new node will be provisioned if,after a new pod is created,the Scheduler can’t schedule it to any of the existing nodes.The Cluster Autoscaler looks out for such pods and asks the cloud provider to start up an additional node.

PodDistruptionBudget

Certain servies require that a minimum number of pods always keeps running:this is especially true for quorum-based clustered applications.For this reason,Kubernetes provides a way of specifying the minimum number of pods that need to keep running while performing these types of operations.IT contains only a pod label selector and a number specifying the minimum number of pods that must always be available or maximum number of pods that can be unavailable.

pdb example

> k create pdb kubia-pdb --selector app=kubia --min-available 3 --dry-run=client -o yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  creationTimestamp: null
  name: kubia-pdb
spec:
  minAvailable: 3                  # how many pods should always be available,can be defined as %
  selector:                        # the label selector that determines which pods this budget applies to
    matchLabels:
      app: kubia
status:
  currentHealthy: 0
  desiredHealthy: 0
  disruptionsAllowed: 0
  expectedPods: 0

As long as pdb exists,both the Cluster Autoscaler and the k drain command will adhere to it and will never evict a pod with the app=kubia label if that would bring the number of such pods below three.

Downward API

It allows you to pass metadata about the pod and its environment through environment variables or files(in a downwardAPI volume).It’s a way of having environment variables of files populated with values from the pod’s specification or status.It allows you to pass the following information to your containers:

  • the pod’s name
  • the pod’s IP address
  • the namespace the pod belongs to
  • the name of the node the pod is running on
  • the name of the service account the pod is running under
  • the CPU and memory requests and limits for each container
  • the pod’s labels and annotations

Most items in the list can be passwd to containers either through environment variables or through a downwardAPI volume,but labels and annotations can only be exposed through the volume.

dowaward pod example

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:                                               # labels and annotations will be exposed via downwardAPI volume
    foo: bar
    run: downward
  annotations:
    key1: value1
    key2: |
      multi
      line
      value      
  name: downward
spec:
  containers:
  - command:
    - sleep
    - "999999"
    image: busybox
    name: main
    resources:
      requests:
        cpu: 15m
        memory: 100Ki
      limits:
        cpu: 100m
        memory: 4Mi
    env:
    - name: POD_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.name                     # instaed of specifying an absolute vlaue you'r referencing the metadata.name field from the pod manifest
    - name: POD_NAMESPACE
      valueFrom:
        fieldRef:
          fieldPath: metadata.namespace
    - name: POD_IP
      valueFrom:
        fieldRef:
          fieldPath: status.podIP
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          fieldPath: spec.nodeName
    - name: SERVICE_ACCOUNT
      valueFrom:
        fieldRef:
          fieldPath: spec.serviceAccountName
    - name: CONTAINER_CPU_REQUEST_MILLICORES
      valueFrom:
        resourceFieldRef:
          resource: requests.cpu                       # a container's CPU and memory requests and limits are referenced by using resourceFieldRef instead of fieldRef
          divisor: 1m                                  # for resource fields, you define a divisor to get the valued in the unit you need
    - name: CONTAINER_MEMORY_LIMIT_KIBIBYTES
      valueFrom:
        resourceFieldRef:
          resource: limits.memory
          divisor: 1Ki
    volumeMounts:
    - name: downward
      mountPath: /etc/downward
  dnsPolicy: ClusterFirst
  restartPolicy: Always
  volumes:
  - name: downward
    downwardAPI:
      items:
      - path: "podName"
        fieldRef:
          fieldPath: metadata.name
      - path: "podNamespace"
        fieldRef:
          fieldPath: metadata.namespace
      - path: "labels"                                 # the pod's labels will be written to the /etc/downward/labels file
        fieldRef:
          fieldPath: metadata.labels
      - path: "annotations"                            # the pod's annotations will be written to the /etc/downward/annotations file
        fieldRef:
          fieldPath: metadata.annotations
      - path: "containerCpuRequestMilliCores"
        resourceFieldRef:
          containerName: main                          # specifying container name cause volumes are defined at the pod level,not container level
          resource: requests.cpu
          divisor: 1m
      - path: "containerMemoryLimitBytes"
        resourceFieldRef:
          containerName: main
          resource: limits.memory
          divisor: 1
status: {}

As DownwardAPI is fairly limited if you need more,you will need to obtain it from the Kubernetes API server directly.

Accessing the API server

The kubectl proxy command runs a proxy server that accept HTTP connections on your local machine and proxies them to the API server while takin care of authentication,so you don’t need to pass the authentication token in every request.It also makes sure you’re talkint to the actual API server and not a man in the middle (by verifying the server’s certificate on each request).As soon as it starts up( by k proxy) the proxy starts accepting connections on local port 8001.

from client
k proxy &
# check api/v1 response 
curl localhost:8001/api/v1
curl localhost:8001/apis

# fetch specific job in dev namespace
curl localhost:8001/apis/batch/v1/namespaces/dev/jobs/<jobName>

# fetch specific pod in web namespace
curl localhost:8001/api/v1/namespaces/web/pods/<podName>

# call a service via api
curl localhost:8001/api/v1/namespaces/default/services/<serviceName>/proxy/

# call a pod via api
curl localhost:8001/api/v1/namespaces/default/pods/<podName>/proxy/
inside pod
export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
export TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -H "Authorization: Bearer ${TOKEN}" https://kubernetes

or via proxy container

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: null
  labels:
    run: curl-with-ambasador
  name: curl-with-ambasador
spec:
  containers:
  - command:
    - sleep
    - "9999"
    image: edesibe/curl
    name: main
    resources: {}
  - name: proxy
    image: edesibe/kubectl-proxy
  dnsPolicy: ClusterFirst
  restartPolicy: Always
status: {}

Authentication

API server can be configured with one or more authentication plugins (and the same is true for authorization plugins).When a request is received by the API server,it goes through the list of authentication plugins,so they can each examine the request and try to determine who’s sending the request.The first plugin that can extract that information from the request returns the username,user ID,and the groups the client belongs to back to the API server core. All Kubernetes clusters have two categories of users: service accounts managed by Kubernetes, and normal users. It is assumed that a cluster-independent service manages normal users in the following ways:

  • an administrator distributing private keys
  • a user store like Keystone or Google Accounts
  • a file with a list of usernames and passwords

In this regard, Kubernetes does not have objects which represent normal user accounts. Normal users cannot be added to a cluster through an API call. Even though a normal user cannot be added via an API call, any user that presents a valid certificate signed by the cluster’s certificate authority (CA) is considered authenticated. In this configuration, Kubernetes determines the username from the common name field in the ‘subject’ of the cert (e.g., “/CN=mile”). From there, the role based access control (RBAC) sub-system would determine whether the user is authorized to perform a specific operation on a resource. For more details, refer to the normal users topic in certificate request for more details about this. In contrast, service accounts are users managed by the Kubernetes API. They are bound to specific namespaces, and created automatically by the API server or manually through API calls. Service accounts are tied to a set of credentials stored as Secrets, which are mounted into pods allowing in-cluster processes to talk to the Kubernetes API. API requests are tied to either a normal user or a service account, or are treated as anonymous requests. This means every process inside or outside the cluster, from a human user typing kubectl on a workstation, to kubelets on nodes, to members of the control plane, must authenticate when making requests to the API server, or be treated as an anonymous user.

Users

An authentication plugin returns the username and group(s) of the authenticated user.Kubernetes doesn’t store that information anywhere,it uses it to verify whether the user is authorized to perform an action or not.Kubernetes distinguishes between two kinds of clients connecting to the API server:

  • acutal humans(users) - Users are meant to be managed by an external system (such as SSO).No resource represents user accounts,which means you can’t create,update,or delete users through the API server.
  • Pods(more specifically,applications running inside them) - Used as Service accounts which are created and stored in the cluster as ServiceAccount resources.
Groups

Both human users and ServiceAccounts can belong to one or more groups.Authentication plugin returns groups with the username and user ID.Groups are used to grant permissions to several users at once,instaed of having to grant them to individual users.Groups returned by the plugin are nothing but strings, representing arbitrary group names,but built-in groups have special meaning:

  • The system:unathenticated group is used for requests where none of the authentication plugins could authenticate the client
  • The system:authenticated group is automatically assigned to a used who was autheticated successfully
  • The system:serviceaccounts group encompasses all ServiceAccounts in the system
  • The system:serviceaccounts:<namespace> includes all ServiceAccounts in a specific namespace

ServiceAccount in Kubernetes is referenced as system:serviceaccounts:<namespace>:<ServiceAccountName>

API GROUPS
Resources inside k8s are grouped in two API GROUPS as: core and named
core - core system objects like svc,cm,pod,secrets,etc.
named - all new features will be added here
> k get --raw='/api/v1' | jq -r '.resources[].name'
bindings
componentstatuses
configmaps
endpoints
events
limitranges
namespaces
namespaces/finalize
namespaces/status
nodes
nodes/proxy
nodes/status
persistentvolumeclaims
persistentvolumeclaims/status
persistentvolumes
persistentvolumes/status
pods
pods/attach
pods/binding
pods/ephemeralcontainers
pods/eviction
pods/exec
pods/log
pods/portforward
pods/proxy
pods/status
podtemplates
replicationcontrollers
replicationcontrollers/scale
replicationcontrollers/status
resourcequotas
resourcequotas/status
secrets
serviceaccounts
serviceaccounts/token
services
services/proxy
services/status
> k get --raw='/apis' | jq -r '.groups[].name'
apiregistration.k8s.io
apps
events.k8s.io
authentication.k8s.io
authorization.k8s.io
autoscaling
batch
certificates.k8s.io
networking.k8s.io
policy
rbac.authorization.k8s.io
storage.k8s.io
admissionregistration.k8s.io
apiextensions.k8s.io
scheduling.k8s.io
coordination.k8s.io
node.k8s.io
discovery.k8s.io
flowcontrol.apiserver.k8s.io
crd.projectcalico.org
metrics.k8s.io

getting resources from /apis/apps API group

> k get --raw='/apis/apps/v1' | jq -r '.resources[].name'
controllerrevisions
daemonsets
daemonsets/status
deployments
deployments/scale
deployments/status
replicasets
replicasets/scale
replicasets/status
statefulsets
statefulsets/scale
statefulsets/status

Each resource has certain verbs which are used to manipulate with resources such as: get,list,create,patch,update,delete…

Authorization

In K8s we have several authorization modes:

  • Node - used be kubelet to talk with api server
  • ABAC - external access control
  • RBAC - role based access control
  • Webhook - web acpplication which is utilized as http callback
  • AlwaysAllow - default if no mode is specified
  • AlwaysDeny - deny request

Configuration is done on api server with: --authorization-mode=Node,RBAC.If nothing is defined AlwaysAllow will be defined.When you have multiple modes configured,your request is authorized using each one in the order it is specified.So, every time a module denies the request,it goes to the next one in the chain,and as soon as a module approves the request,no more checks are done and the user is granted permission.

Admission plugins

An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. Admission controllers may be validating, mutating, or both. Mutating controllers may modify related objects to the requests they admit; validating controllers may not. Admission controllers limit requests to create, delete, modify objects. Admission controllers can also block custom verbs, such as a request connect to a Pod via an API server proxy. Admission controllers do not (and cannot) block requests to read (get, watch or list) objects.

enabling

kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger ...

disabling

kube-apiserver --disable-admission-plugins=PodNodeSelector,AlwaysDeny ...
Checking auth and authz

Following commands works only for admins

k get --raw='/api/v1/pods' --as=system:serviceaccount:<namespace>:<ServiceAccountName>

or

k auth can-i get '/api/v1/pods' --as=system:serviceaccount:<namespace>:<ServiceAccountName>

check all perms using Impersonate-User set to system:serviceaccount:<namespace>:<ServiceAccountName> and optional Impersonate-Group set to system:serviceaccounts and system:serviceaccounts:<namespace>

k auth can-i --list --as=system:serviceaccount:<namespace>:<ServiceAccountName> [--as-group=system:serviceaccounts] [--as-group=system:serviceaccounts:<namespace>]

Linux namespaces and cgroups

List of namespaces used in containers are:

  • mount(mnt)
  • pid
  • network(net)
  • ipc (systemv,POSIX)
  • UTS (hostname and NIS domain service)
  • user
  • cgroup (cpu,memory and network resource allocation)

Allowing user to connect to k8s via a client cert

https://medium.com/better-programming/k8s-tips-give-access-to-your-clusterwith-a-client-certificate-dfb3b71a76fe

Generate a csr

openssl genrsa -out myuser.key 2048
openssl req -new -key myuser.key -out myuser.csr

create CertifcateSigningRequest

apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
  name: myuser
spec:
  request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ1ZqQ0NBVDRDQVFBd0VURVBNQTBHQTFVRUF3d0dZVzVuWld4aE1JSUJJakFOQmdrcWhraUc5dzBCQVFFRgpBQU9DQVE4QU1JSUJDZ0tDQVFFQTByczhJTHRHdTYxakx2dHhWTTJSVlRWMDNHWlJTWWw0dWluVWo4RElaWjBOCnR2MUZtRVFSd3VoaUZsOFEzcWl0Qm0wMUFSMkNJVXBGd2ZzSjZ4MXF3ckJzVkhZbGlBNVhwRVpZM3ExcGswSDQKM3Z3aGJlK1o2MVNrVHF5SVBYUUwrTWM5T1Nsbm0xb0R2N0NtSkZNMUlMRVI3QTVGZnZKOEdFRjJ6dHBoaUlFMwpub1dtdHNZb3JuT2wzc2lHQ2ZGZzR4Zmd4eW8ybmlneFNVekl1bXNnVm9PM2ttT0x1RVF6cXpkakJ3TFJXbWlECklmMXBMWnoyalVnald4UkhCM1gyWnVVV1d1T09PZnpXM01LaE8ybHEvZi9DdS8wYk83c0x0MCt3U2ZMSU91TFcKcW90blZtRmxMMytqTy82WDNDKzBERHk5aUtwbXJjVDBnWGZLemE1dHJRSURBUUFCb0FBd0RRWUpLb1pJaHZjTgpBUUVMQlFBRGdnRUJBR05WdmVIOGR4ZzNvK21VeVRkbmFjVmQ1N24zSkExdnZEU1JWREkyQTZ1eXN3ZFp1L1BVCkkwZXpZWFV0RVNnSk1IRmQycVVNMjNuNVJsSXJ3R0xuUXFISUh5VStWWHhsdnZsRnpNOVpEWllSTmU3QlJvYXgKQVlEdUI5STZXT3FYbkFvczFqRmxNUG5NbFpqdU5kSGxpT1BjTU1oNndLaTZzZFhpVStHYTJ2RUVLY01jSVUyRgpvU2djUWdMYTk0aEpacGk3ZnNMdm1OQUxoT045UHdNMGM1dVJVejV4T0dGMUtCbWRSeEgvbUNOS2JKYjFRQm1HCkkwYitEUEdaTktXTU0xMzhIQXdoV0tkNjVoVHdYOWl4V3ZHMkh4TG1WQzg0L1BHT0tWQW9FNkpsYWFHdTlQVmkKdjlOSjVaZlZrcXdCd0hKbzZXdk9xVlA3SVFjZmg3d0drWm89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo=   # base64 encoded myuser.csr generated as `cat myuser.csr | base64 | tr -d "\n"`
  signerName: kubernetes.io/kube-apiserver-client
  expirationSeconds: 86400  # one day
  usages:
  - client auth

approve or deny csr

k certificate approve/deny myusr

Monitoring

In order to view metrics about the resources pods and containers are using,we need an add-on to collect and provide the data.One such add-on is Kubernentes Metrcis Server. Cluster monitoring is done via “metrics server”.The Kubernetes Metrics Server collects resource metrics from the kubelets in your cluster, and exposes those metrics through the Kubernetes API, using an APIService to add new kinds of resource that represent metric readings.

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# - Modify and add "- --kubelet-insecure-tls" in deployment.spec.template.spec.containers.args
k -n kube-system edit deployment metrics-server
# monitoring nodes which shows current CPU and memory usage
k top node
# monitoring pods including containers sorted by cpu or mem with some label defined and 
k top pod --sort-by ['cpu'|'mem'] --selector <LABEL> [--contaiers]

Monitoring applications is done via “liveness” and “readiness” probes. Cluster logs can be found on “/var/log/containers”. Applications logs can be observed via “k logs [svc,pod,deployment] –container CONTAINER_NAME –previous –selector LABELS”.

CustomResourceDefinition

Custom resources are extensions of the Kubernetes API. This page discusses when to add a custom resource to your Kubernetes cluster and when to use a standalone service. It describes the two methods for adding custom resources and how to choose between them.

---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  # name must match the spec fields below, and be in the form: <plural>.<group>
  name: internals.datasets.kodekloud.com     
spec:
  # group name to use for REST API: /apis/<group>/<version>
  group: datasets.kodekloud.com               
  # list of versions supported by this CustomResourceDefinition
  versions:
    - name: v1                               # version of the crd
      # Each version can be enabled/disabled by Served flag.
      served: true                           
      # One and only one version must be marked as the storage version.
      storage: true                          
      schema:                                # list of objects supported by CRD
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                internalLoad:
                  type: string
                range:
                  type: integer
                percentage:
                  type: string
  # either Namespaced or Cluster
  scope: Namespaced                          
  names:                                     # naming and optional an alias 
    # plural name to be used in the URL: /apis/<group>/<version>/<plural>
    plural: internals
    # singular name to be used as an alias on the CLI and for display
    singular: internal
    # kind is normally the CamelCased singular type. Your resource manifests use this.
    kind: Internal
    # shortNames allow shorter string to match your resource on the CLI
    shortNames:
    - int

TLS certs

On every pod following volume is mounted with info about namespace,ca cert and token for api server communication.

/var/run/secrets/kubernetes.io/serviceaccount/

On control plane we have TLS server certs for components:

  • apiserver
  • etcdserver
  • kubelet

and client certs:

  • kube-scheduler
  • kube-controller-manager
  • kube-proxy
  • apiserver-etcd-client
  • apiserver-kubelet-client
  • admin
  • kubelet-client

Most the certs can be found on /etc/kubernetes/pki folder except kubelet which stores certs in /var/lib/kublet/pki and kube-proxy which is using serviceAccount to access apiserver.

Client and Server Certificates

Microservices

Microservices are small,independent services that work together to form a whole application. Many applications are designed with a monolitic architecture, meaning that all parts of the application are combined in one large executable. Microservices architecture break the application up into several small services.

Usefull projects

Projects:
valero - backup solution
rakkess - rbac auditing
audit2rbac - rbac auditing
metalb - on premise elb
reloader - reload pods for a change on cm or secrets

How debug traffic in k8s

https://community.pivotal.io/s/article/How-to-get-tcpdump-for-containers-inside-Kubernetes-pods?language=en_US

How to search json object

k get RESOURCE RESOURCE-NAME -o json | jq -c paths | grep KEY
k get nodes -o json | jq -c 'paths|[.[]|tostring]|join(".")' | grep -i osImage

How to patch deployment manifest

patch-file.json

{
   "spec": {
      "template": {
         "spec": {
            "containers": [
               {
                  "name": "patch-demo-ctr-2",
                  "image": "redis"
               }
            ]
         }
      }
   }
}

The following commands are equivalent:

kubectl patch deployment patch-demo --patch-file patch-file.json
kubectl patch deployment patch-demo --patch '{"spec": {"template": {"spec": {"containers": [{"name": "patch-demo-ctr-2","image": "redis"}]}}}}'

The differences between Docker,containerd,CRI-O and runc

https://www.tutorialworks.com/difference-docker-containerd-runc-crio-oci/

REFERENCES