
Kubernetes Objects are persisitent entities in the Kubernetes system.Kubernetes uses these entities to represent the sate of your cluster. Specifically, they can describe:
A Kubernetes object is a “record of intent”–once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you’re effectively telling the Kubernetes system what you want your cluster’s workload to look like; this is your cluster’s desired state.
To work with Kubernetes objects–whether to create, modify, or delete them–you’ll need to use the Kubernetes API. When you use the kubectl command-line interface, for example, the CLI makes the necessary Kubernetes API calls for you. You can also use the Kubernetes API directly in your own programs using one of the Client Libraries.
Every Kubernetes object includes two nested object fields that govern the object’s configuration: the object spec and the object status. The spec, which you must provide, describes your desired state for the object–the characteristics that you want the object to have. The status describes the actual state of the object, and is supplied and updated by the Kubernetes system. At any given time, the Kubernetes Control Plane actively manages an object’s actual state to match the desired state you supplied.
apiVersion: apps/v1 # [REQUIRED] which API group and version of the Kubernetes API you're using to create the object
kind: Deployment # [REQUIRED] what kind of object you want to create (pod,service,cron,job, ... )
metadata: # [REQUIRED] data that helps uniquely identify the object, including a `name` string,`UID`, and optional `namespace`
name: nginx-deployment # name of the object
spec: # [REQUIRED] specification of the object, desired state. format is depending of the object
selector:
matchLabels:
app: nginx
replicas: 2 # tells deployment to run 2 pods matching the template
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.7.9 # used image
ports:
- containerPort: 80
All objects in the Kubernetes REST API are unambiguously identified by a Name and a UID. For non-unique user-provided attributes, Kubernetes provides labels and annotations. A client-provided string that refers to an object in a resource URL, such as /api/v1/pods/some-name. Only one object of a given kind can have a given name at a time. However, if you delete the object, you can make a new object with the same name.
A Kubernetes systems-generated string to uniquely identify objects. Every object created over the whole lifetime of a Kubernetes cluster has a distinct UID. It is intended to distinguish between historical occurrences of similar entities.
Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces.Although namespaces alloy you to isolate objects into distinct groups, which allows you to operate only on those belonging to the specific namespace,they don’t provide any kind of isolation of running objects.In other words,pods from different namespaces can communicate.In order to stop this one should configure inter-namespace network isolation via network policies.
Namespaces are intended for use in environments with many users spread across multiple teams, or projects. For clusters with a few to tens of users, you should not need to create or think about namespaces at all. Start using namespaces when you need the features they provide. Namespaces provide a scope for names. Names of resources need to be unique within a namespace, but not across namespaces. Namespaces can not be nested inside one another and each Kubernetes resource can only be in one namespace. Namespaces are a way to divide cluster resources between multiple users (via resource quota). In future versions of Kubernetes, objects in the same namespace will have the same access control policies by default. It is not necessary to use multiple namespaces just to separate slightly different resources, such as different versions of the same software: use labels to distinguish resources within the same namespace.
Kubernetes starts with three initial namespaces:
default - The default namespace for objects with no other namespacekube-system - The namespace for objects created by the Kubernetes systemkube-public - This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.When you create a Service, it creates a corresponding DNS entry. This entry is of the form <service-name>.<namespace-name>.svc.cluster.local, which means that if a container just uses <service-name>, it will resolve to the service which is local to a namespace. This is useful for using the same configuration across multiple namespaces such as Development, Staging and Production. If you want to reach across namespaces, you need to use the fully qualified domain name (FQDN).
Labels are key/value pairs that are attached to objects(can be attached to any k8s objecT), such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users, but do not directly imply semantics to the core system. Labels can be used to organize and to select subsets of objects. Labels can be attached to objects at creation time and subsequently added and modified at any time. Each object can have a set of key/value labels defined. Each Key must be unique for a given object Labels allow for efficient queries and watches and are ideal for use in UIs and CLIs. Non-identifying information should be recorded using annotations.
Labels are key/value pairs. Valid label keys have two segments: an optional prefix and name, separated by a slash (/). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.), not longer than 253 characters in total, followed by a slash (/)
Unlike names and UIDs, labels do not provide uniqueness. In general, we expect many objects to carry the same label(s). Via a label selector, the client/user can identify a set of objects. The label selector is the core grouping primitive in Kubernetes. The API currently supports two types of selectors: equality-based and set-based. A label selector can be made of multiple requirements which are comma-separated. In the case of multiple requirements, all must be satisfied so the comma separator acts as a logical AND (&&) operator. The semantics of empty or non-specified selectors are dependent on the context, and API types that use selectors should document the validity and meaning of them.
Equality- or inequality-based requirements allow filtering by label keys and values. Matching objects must satisfy all of the specified label constraints, though they may have additional labels as well. Three kinds of operators are admitted =,==,!=. The first two represent equality (and are simply synonyms), while the latter represents inequality. For example:
environment = production
tier != frontend
Set-based label requirements allow filtering keys according to a set of values. Three kinds of operators are supported: in,notin and exists (only the key identifier). For example:
environment in (production, qa)
tier notin (frontend, backend)
partition
!partition
The set of pods that a service targets is defined with a label selector. Similarly, the population of pods that a replicationcontroller should manage is also defined with a label selector. Labels selectors for both objects are defined in json or yaml files using maps, and only equality-based requirement selectors are supported:
selector:
component: redis
this selector (respectively in json or yaml format) is equivalent to component=redis or component in (redis)
Newer resources, such as Job, Deployment, Replica Set, and Daemon Set, support set-based requirements as well.
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
matchLabels is a map of {key,value} pairs. A single {key,value} in the matchLabels map is equivalent to an element of matchExpressions, whose key field is “key”, the operator is “In”, and the values array contains only “value”. matchExpressions is a list of pod selector requirements. Valid operators include In, NotIn, Exists, and DoesNotExist. The values set must be non-empty in the case of In and NotIn. All of the requirements, from both matchLabels and matchExpressions are ANDed together – they must all be satisfied in order to match.
One use case for selecting over labels is to constrain the set of nodes onto which a pod can schedule. See the documentation on node selection for more information.
You can use Kubernetes annotations to attach arbitrary non-identifying metadata to objects. Clients such as tools and libraries can retrieve this metadata. You can use either labels or annotations to attach metadata to Kubernetes objects. Labels can be used to select objects and to find collections of objects that satisfy certain conditions. In contrast, annotations are not used to identify and select objects. The metadata in an annotation can be small or large, structured or unstructured, and can include characters not permitted by labels.
Annotations are key/value pairs. Valid annotation keys have two segments: an optional prefix and name, separated by a slash (/). The name segment is required and must be 63 characters or less, beginning and ending with an alphanumeric character ([a-z0-9A-Z]) with dashes (-), underscores (_), dots (.), and alphanumerics between. The prefix is optional. If specified, the prefix must be a DNS subdomain: a series of DNS labels separated by dots (.), not longer than 253 characters in total, followed by a slash (/).
apiVersion: v1
kind: Pod
metadata:
name: annotations-demo
annotations:
imageregistry: "https://hub.docker.com/"
spec:
containers:
- name: nginx
image: nginx:1.7.9
ports:
- containerPort: 80
print output for adding annotations
k annotation OBJECT OBJECT_NAME KEY=VALUE --dry-run=client -o jsonpath={'.metadata.annotations'} | jq
Field selectors let you select Kubernetes resources based on the value of one or more resource fields. Here are some example field selector queries:
metadata.name=my-servicemetadata.namespace!=defaultstatus.phase=PendingSupported field selectors vary by Kubernetes resource type. All resource types support the metadata.name and metadata.namespace fields. Using unsupported field selectors produces an error.
Shared labels and annotations share a common prefix: app.kubernetes.io. Labels without a prefix are private to users. The shared prefix ensures that shared labels do not interfere with custom user labels.
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app.kubernetes.io/name: mysql
app.kubernetes.io/instance: wordpress-abcxzy
app.kubernetes.io/version: "5.7.21"
app.kubernetes.io/component: database
app.kubernetes.io/part-of: wordpress
app.kubernetes.io/managed-by: helm
The Kubernetes networking model is a set of standards that define how networking between Pods behaves.There are a variety of different implementations of this model - including Calico networking plugin which have been using throughout this course.
The Kubernetes network model defines how Pods communicate with each other,regardless of which Node they are running on.
Each pod has its own unieque IP address within the cluster.Any Pod can reach any other Pod using that Pod’s IP address.This creates a virtual network that allows Pods to easily communicate with each other,regardless of which node they are on.
To make it easier to connect containers into a network, a project called Container Network Interface(CNI) was started.
CNI plugins are a type of Kubernetes network plugin.These plugins provide network connectivity between Pods according to the starndard set by the Kubernetes network model.
The K8s virtual newtork uses a DNS to allow Pods to locate other Pods and Services using domain names instead of IP addresses.
All Pods in our kubeadm cluster are automatically given a domain name of the following form:
pod-ip-address.namespace-name.pod.cluster.local
where pod-ip-address for 192.168.10.100 will be 192-168-10-100.

Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.It is designed to scale horizontally – that is, it scales by deploying more instances.It provides a CRUD interface for querying and modifying the cluster state over a RESTful API.It stores that state in etcd.It also performs validation of request objects and handle optimistic locking,so changes to an object are never overriden by other clients in the event of concurrent updates. When client talks to the API server its requests goes over authentication,authorization,admission and validation(only for create,delete and update not for read) before storing in etcd.After validation API server returns a response to the client. API server doesn’t tell controllers what to do.All it does is enable those controllers and other components to observe changes to deployed resources.A control plane component can request to be notified when a resource is created,modified, or deleted.This enables the component to perform whatever task it needs in response to a change of the cluster metadata. Clients watch for changes by opening an HTTP connection to the API server.Every time an object is updated,the server sends the new version of the object to all connected clients watching the object.The watch mechanism is also used be the Scheduler.
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.Only API server communicate with etcd.All other components read and write data to etcd indirectly through the API server.Each key in etcd is either a directory,which contains other keys,or is a regular key with a corresponding value.Etcd v3 doesn’t support directories,but because the key format remains the same(keys can include slashes), you can still think of them as being grouped into directories.Kubernetes stores all the data in etcd under /registry.Etcd uses the RAFT consensus algorithm which ensures that at any given moment,each node’s state is either what the majority(or quorum) of the nodes agress is the current state or is one of the previously afgreed upon states.The consensus algorithm requires a majority for the cluster to progess to the next state.Transition from the previous state to the new one,there needs to be more than half of the nodes taking part in the state change.
listing etcd data, records are stored under namespace like /registry/pods/<namespaces>
k -n kube-system exec etcd-k8s-control -it -- etcdctl --endpoints=https://172.31.30.119:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key get / --keys-only=true --prefix=true
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.All it does is wait for newly created pods through the API server’s watch mechanism and assign a node to each new pod that doesn’t already have the node set.The Scheduler doesn’t instruct the selected node (or the Kubelet running on that node) to run the pod.All the Scheduler does is update the pod definition through the API server.The API server than notifies the Kubelet(again,via the watch mechanism) that the pod has been scheduled.As soon as the Kubelet on the target node sees the pod has been scheduled to its node,it creates and runs the pod’s containers.
Scheduler selects a node for the pod in a 2-step operation:
The filtering step finds the set of Nodes where it’s feasible to schedule the Pod. For example, the PodFitsResources filter checks whether a candidate Node has enough available resource to meet a Pod’s specific resource requests. After this step, the node list contains any suitable Nodes; often, there will be more than one. If the list is empty, that Pod isn’t (yet) schedulable.
In the scoring step, the scheduler ranks the remaining nodes to choose the most suitable Pod placement. The scheduler assigns a score to each Node that survived filtering, basing this score on the active scoring rules.
Finally, kube-scheduler assigns the Pod to the Node with the highest ranking. If there is more than one node with equal scores, kube-scheduler selects one of these at random.
Factors taken into account for scheduling decisions include individual and collective resource requirements, hardware/software/policy constraints, affinity and anti-affinity specifications, data locality, inter-workload interference and deadlines.
Component on the master that runs controllers. Logically, each controller is a separate process, but to reduce complexity, they are all compiled into a single binary and run in a single process.
Holds info about CIDR range for pods in --cluster-cidr and CIDR range for services in --service-cluster-ip-range(although this setting is present in kube-api as well).
These controllers include:
Replication Manager - responsible for ReplicationController resources.It is working in infinite loop,where in each iteration,the controller finds the number of pods matching its pod selector and compares the number to the desired replica count.When too few pod instances are running,the RC runs additional instances.It creates new Pod manifests,posts them to the API server, and lets the Scheduler and the Kubelet do their job of scheduling and running the pod.Thus, it performs its work by manipluating Pod API objects through the API server.This is how all controllers operate.Deployment Controller - performs a rollout of a new version each time a Deployment object is modified(if the modification should affect the deployed pods).It does this by creating a ReplicaSet and then appropriately scling both the old and the new ReplicaSet based on the strategy specified in the Deployment,until all the old pods have been replaced with new ones.It doesn’t create any pods directly.StatefulSet Controller - similar to ReplicaSet controler and other related controllers, creates,manages,and deletes Pods according to the spec of a StatefulSet resource.Also it instantiates and manages PVC for each Pod instance.Node Controller - Responsible for noticing and responding when nodes go down.Replication Controller - Responsible for maintaining the correct number of pods for every replication controller object in the system.Endpoints Controller - Populates the Endpoints object (that is, joins Services & Pods).It watches both Services and Pods.When Services are added or updated or Pods are added,updated, or deleted, it selects Pods matching the Service’s pod selector and adds their IPs and ports to the Endpoint resource.Namespace Contoller - When a namespace resource is deleted,all the resources in theat namespace must also be deleted.Service Account & Token Controllers - Create default accounts and API access tokens for new namespacesPersistentVolume Controller - Once a user creates a PVC,Kubernetes must find an appropriate PV an bind it to the claim.K8s is looking for the best match for the claim by selecting the smallest PV with the access mode matching the one requested in the claim and the desclared capacity aboce the capacity requested in the claim.OthersThere’s a controller for almost every resource you can create.Resources are descriptions of what should be running in the cluster,whereas the controllers are the active Kubernetes components that perform actual work as a result of the deployed resources. After a contooler updates a resource in the API server,the Kubelet and kube-proxy,perform their work, such as spinnig up ad pod’s containers and attaching network storage to them,or in the case of services,setting up the actual load balancing across pods.
This component is also responsible for signing certificates for whole k8s cluster.
The cloud-controller-manager is a Kubernetes control plane component that embeds cloud-specific control logic. The cloud controller manager lets you link your cluster into your cloud provider’s API, and separates out the components that interact with that cloud platform from components that only interact with your cluster.
An agent that runs on each node in the cluster. It makes sure that containers are running in a pod. The kubelet takes a set of PodSpecs that are provided through various mechanisms and ensures that the containers described in those PodSpecs are running and healthy. The kubelet doesn’t manage containers which were not created by Kubernetes. Its initial job is to register the node it’s running on by creating a Node resource in the API server.Then it needs to continuously monitor the API server for Pods that have been scheduled to the node,and start the pod’s containers.It does this by telling the configured container runtime (which is DOcker,rkt,or something else) to run a container from a specific container image.The Kubelet then contstantly monitors running containers and reports their status,events and resource consumption to the API server. The Kubelet is also the component that runs the container liveness probes,restarting containers when the probes fail.Lastly, it terminates containers when their Pod is deleted from the API server and notifies the server that the pod has terminated. Although the Kubelet talks to the Kubernetes API server and gets the pod manifests from there,it can also run pods based on pod manifest files in a specific local directory.
kube-proxy is a network proxy that runs on each node in the cluster.It is configured as DaemonSet which manifest is not stored on disk.It is kept in ConfigMap in k8s. It enables the Kubernetes service abstraction by maintaining network rules on the host and performing connection forwarding. kube-proxy is responsible for request forwarding. kube-proxy allows TCP and UDP stream forwarding or round robin TCP and UDP forwarding across a set of backend functions. Besides watching the API server for changes to Services,kube-proxy also watches for changes to Endpoints objects.
The container runtime is the software that is responsible for running containers. Kubernetes supports several container runtimes: Docker, containerd, cri-o, rktlet and any implementation of the Kubernetes CRI (Container Runtime Interface)
In Kubernetes almost every object is in some namespace.Kubernetes supports multiple virtual clusters backed by the same physical cluster. These virtual clusters are called namespaces. For small env with small number of users multiple namespaces are not needed. Kubernetes starts with three initial namespaces:
default - The default namespace for objects with no other namespacekube-system - The namespace for objects created by the Kubernetes systemkube-public - This namespace is created automatically and is readable by all users (including those not authenticated). This namespace is mostly reserved for cluster usage, in case that some resources should be visible and readable publicly throughout the whole cluster. The public aspect of this namespace is only a convention, not a requirement.Not all objects are inside namespaces
# In a namespace
kubectl api-resources --namespaced=true
# Not in a namespace
kubectl api-resources --namespaced=false
Change default namespace to NAMESPACE_NAME
k config set-context --current --namespace=NAMESPACE_NAME
Pods are the smallest and most basic building block of the Kubernetes model.A pod consists of one or more containers,storage resources, and a unique IP address in the Kubernetes network.It will always run on the same worker node and in the same Linux namespace(s). In order to run containers, Kubernetes schedules pods to run on servers in the cluster.When a pod is scheduled,the server will run the containers that are part of the pod.
IP address of the pod is a network namespace.Every container inside the pod shares a single network nameaspace.All containers in pod share Linux namespaces like network,UTS(hostname),IPC,PID(not yet) except filesystem.Filesystem is by default binded and isolated by containers but it can be shared between containers via volumes. All containers in the same pod shares the same cgroup limits.
In summary, pods are logical hosts and behave much like physical hosts or VMs in the non-container world.Procesess running in the same pod are like processes running on the same physical or virtual machine, except that each process is encapsulated in a container.
create pod
k run nginx-pod --image=nginx --env="ENV=dev" --port=80 --dry-run=client -o yaml -l env=dev
will be created as
apiVersion: v1 # api version
kind: Pod # type of object
metadata: # metadata object
creationTimestamp: null
labels:
env: dev
name: nginx-pod # name of the pod
spec: # spec object
containers:
- env:
- name: ENV
value: dev
image: nginx # used image
name: nginx-pod # pod name same as metadata.name unless multicontainer pod
ports:
- containerPort: 80 # can be ommited as this is purely informational
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
Every pod gets its on IP address which is routable on that network, which means that every pod can communicate with any other pod.
Containers inside the pod are communicate via localhost interface and related port.
One pod gets scheduled to one node.You define it in a manifest file ( e.g. yaml ).Then you throw that manifest at the apiserver and it gets scheduled to a node.Once it’s a scheduled to a node, it goes into the pending state while the node downloads images and fires up the containers.And this is important, it stays in this pending state until all containers are up and ready.Once that’s done, it goes into the running state.Then once it’s done and dusted with everything it was created to do,, it gets shut down and the state changes to succeeded.If it can’t start, for whatever reason, it can remain in the pending state, or maybe eventually go to the failed state, which hopefully won’t happen too much.There will be no case where some pods were deployed and some not.

Pending - The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network.Running - The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting.Succeeded - All containers in the Pod have terminated in success, and will not be restarted.Failed - All containers in the Pod have terminated, and at least one container has terminated in failure. That is, the container either exited with non-zero status or was terminated by the system.Unknown - For some reason the state of the Pod could not be obtained. This phase typically occurs due to an error in communicating with the node where the Pod should be running.Waiting - If a container is not in either the Running or Terminated state, it is Waiting. A container in the Waiting state is still running the operations it requires in order to complete start up: for example, pulling the container image from a container image registry, or applying Secret data. When you use kubectl to query a Pod with a container that is Waiting, you also see a Reason field to summarize why the container is in that stateRunning - The Running status indicates that a container is executing without issues. If there was a postStart hook configured, it has already executed and finished. When you use kubectl to query a Pod with a container that is Running, you also see information about when the container entered the Running state.Terminated - A container in the Terminated state began execution and then either ran to completion or failed for some reason. When you use kubectl to query a Pod with a container that is Terminated, you see a reason, an exit code, and the start and finish time for that container’s period of execution.If a container has a preStop hook configured, this hook runs before the container enters the Terminated statePodScheduled - the Pod has been scheduled to a node.ContainersReady - all containers in the Pod are ready.Initialized - all init containers have completed successfully.Ready - the Pod is able to serve requests and should be added to the load balancing pools of all matching Services.When a pod is running you can observe an additional container as part of it,with command PAUSE.This pause container is the container that holds all the containers of a pod together.The pause container is an infrastructure container whose sole purpose is to hold all these namespaces.All other user-defined containers of the pod then use the namespaces of the pod infrastrucuture container.Actual application containers may die and get restarted.When such a container starts up again,it needs to become part of the same Linux namespaces as before.The infrastructure container makes this possible since its lifecycle is tied to that of the pod - the container runs from the time the pod is scheduled until the pod is deleted.If the infrastructure pod is killed in the meantime,the Kubelet recreates it and all the pod’s containers.
kubectl run busybox sleep "1000s" --image=busybox --restart=Never
A pod with more than one container is a multi-container pod.In a multi-container Pod,the containers share resources such as network and storage.They can interact with one another,working together to provide functionality.
Keep containers in separate Pods unless they need to share resources.
A Pod that is managed directly by the kubelet on a node,not by the K8s API server.They can run if there is no K8s API server present.
Kubelet automatically creates Pods(only pods nothing else) from YAML manifest files located in the manifest path on the node(/etc/kubernetes/manifests/).
Kubelet will create a mirror Pod for each static Pod.Mirror Pods allow you to see the status of the static Pod via the K8s API, but you cannot change or manage them via the API.
Containers sharing the same Pod can interact with one anohter using shared resources.
sidecar) that reads the log file from a shared volume and prints it to the console so the log output will appear in the container log.
Sidecar usecases can be where main container is web server and additional containers (sidecar) are used to periodicaly download some files,log rotation,data processing,…sidecar example
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: sidecar-pod
name: sidecar-pod
spec:
containers:
- command:
- sh
- -c
- while true;do echo logs data > /output/output.log;sleep 5;done
image: busybox
name: sidecar-pod
resources: {}
volumeMounts:
- name: sharedvol
mountPath: /output
- name: sidecar
image: busybox
command:
- sh
- -c
- tail -f /input/output.log
volumeMounts:
- name: sharedvol
mountPath: /input
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: sharedvol
emptyDir: {}
status: {}
Init containers are containers that run once during the startup process of a pod.A pod can have any number of init containers,and they will each run once (in order) to completion.You can used init containers to perform a variety of startup tasks.They can contain and use software and setup scripts that are not needed by you main containers.They are often useful in keeping your main containers lighter and more secure by offloading startup tasks to a separate container.
Use cases:
an example
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: init-pod
name: init-pod
spec:
containers:
- image: nginx
name: init-pod
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
initContainers:
- name: delay
image: busybox
command:
- sh
- -c
- sleep 30
- name: delay2
image: busybox
command:
- sh
- -c
- echo "this is 2nd init container"
status: {}
When you want to talk to a specific pod without goint through a service you can use port-forward.It can be used for pods,deploymetns,services.In background it is using socat utility.
k port-forward pod/deployment/service --address=localhost LOCAL_PORT:REMOTE_PORT
Scheduling ins just the process of assinging pods to Kubernentes nodes to that kubelets can run them.So whenever we go to create a pod, something has to determine which node to run that pod on,and that is scheduling process.
Scheduler - control plane component that handles scheduling.
Scheduling process
The Kuberneets scheduler selects a suitable Node for each Pod.It takes into account:
You can configure a nodeSelector for your Pods to limit which Node(s) the Pod can scheduled on.
Node selectors use node labels to filter suitable nodes.
adding label to the node
k label nodes <NODE_NAME> KEY=VALUE
removing label from node
k label nodes <NODE_NAME> KEY-
list all pods with env label
k label pods -l env
list all pods without env label
k label pods -l '!env'
k label pod -l 'creation_method!=manual' to select pods with the creation_method label with any value other than manualk label pod -l 'env in (prod,dev)' to select pods with env label set to prod or devk label pod -l 'env notin (prod,devel)' to select pods with the env label set to any value other than prod or develk get pod -l 'env=debug,creation_method=manual' to select all pods with both labelsdefine nodeSelector for pod
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: busybox
spec:
containers:
- image: nginx
name: nodeselector-pod
resources: {}
nodeSelector:
KEY: VALUE
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
You can bypass scheduling and assing a Pod to a specific Node by name using nodeName.
apiVersion: v1
kind: Pod
metadata:
name: nginx
spec:
containers:
- name: nginx
image: nginx
nodeName: kube-01 # assign pod to `kube-01` node
In Kubernetes,when specifying a container,you can choose to override both ENTRYPOINT(the executable that’s executed inside the container) and CMD(the arguments passed to the executable).To do that,you set the properties command (for ENTRYPOINT) and args (for CMD) in the container specification.
an example
kind: Pod
spec:
containers:
- image: some/image
command: ["/bin/command"] # docker ENTRYPOINT definition
args: ["arg1","arg2","arg3"] # docker CMD definition
The command and args fields can’t be updated after the pod is created
The list of environment variables can’t be updated after the pod is created
an example
spec:
containers:
- name: some/name
image: some/image
env:
- name: some_name
value: "some_value"
Having values effectively hardcoded in the pod definition means you need to have separate pod definitions for your production and you development pods.To reuse the same pod definition in multiple environments,it makes sense to decouple the configuration from the pod descriptor.In other words,you should use ConfigMap.
Applies to all containers inside pod.K8s can automatically restart containers when they fail.Restart policies allow you to customize this behavior by defining when you want to a pod’s containers to be automatically resatrted.There are three possible values for a pod’s restart policy in k8s:
Always - default (not applicable for job).Use this policy for applications that should always be running.OnFailure - will restart containers only if the container process exits with an error code or the container is determined to be unhealthy by a liveness probe.If pod phase status is Completed(Succeded) pod will not be restarted.Use this policy for applications that need to run successfully and then stop.Never - will cause the pod’s containers to never be restarted,even if the container exits or a liveness probe fails.Use this for applications that should run once and never be automatically restarted.pod.spec.containers.imagePullPolicy - Always is recommended in order to not store image on the nodes
pod.spec.ImagePullSecrets - secret for login to private registry
“ImagePullSecrets” can be attached to the SA’s namespace as well.
Defining security contexts allows you to lock down your containers, so that only certain processes can do certain things. This ensures the stability of your containers and allows you to give control or take it away
pod.spec.containers.securtyContext - various options as [‘runAsUser’, ‘runAsNonRoot’, ‘privileged’, ‘add:SYS_TIME’, ‘fsGroup’, ‘readOnlyRootFilesystem’]
runAsUser - specify the user as UID which will be used in containerrunAsNonRoot - the container will only be allowed to run as a non-root userprivileged - to get full access to the node’s kernelcapabilities - better control what will be allowed or denied in the container.Defining capabilities is better way then giving full privileges with privileged:truereadOnlyRootFilesystem - preventing writing to root filesystem,writing to mounted volumes is allowedSeveral of these options can also be set at the pod level (through the pod.spec.securityContext property).They serve as a default for all the pod’s containers but can be overridden at the container level.The pod-level security context also allows you to set additional properties.
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: alpine
name: alpine
spec:
containers:
- args:
- /bin/sleep
- "9999"
image: alpine
name: main
resources: {}
securityContext:
readOnlyRootFilesystem: true
runAsUser: 405 # run as guest
runAsNonRoot: true # container will be run as nonRoot user or not runned at all if USER is not set in Dockerfile
privileged: true # allow access to node's kernel
capabilities:
add: # capabilities can be added or dropped
- SYS_TIME # allow container to change node's date and time.Linux capabilities are usually prefixed with `CAP_` but in pod.spec you must leave out the prefix.
drop:
- CHOWN # deny this container to change file ownership
dnsPolicy: ClusterFirst
restartPolicy: Never
status: {}
When you use runAsUser property and have multiple containers in the pod where you wanna share volumes between containers you must use fsGroup and suppelementalGroups properties in pod.spec.securityContext property.In such setup,volumes will be owned by fsGroup id and files under that volume will be owned by runAsUser ID and fsGroup ID.Files which are created by runAsUser in other locations (not in volume).There its owner and group will be set to runAsUser and 0.
spec:
securityContext:
fsGroup: 555 # the fsGroup and supplementalGroups are defined in the security context at pod level
supplementalGroups: [666,777]
containers:
- command: ["/bin/sleep", "99999"]
image: alpine
name: first
resources: {}
securityContext:
runAsUser: 1111 # the first container runs as used ID 1111
volumeMounts:
- name: shared-volume # both containers use the same volume
readOnly: false
mountPath: /volume
- name: second
image: alpine
command: ["/bin/sleep", "99999"]
securityContext:
runAsUser: 2222 # the second container runs as used ID 2222
volumeMounts:
- name: shared-volume
readOnly: false
mountPath: /volume
dnsPolicy: ClusterFirst
restartPolicy: Never
volumes:
- name: shared-volume
emptyDir: {}
Certain pods (usually system pods) need to operate in the host’s default namespaces,allowing them to see and manipulate node-level resources and devices.For example, a pod may need to use the node’s network adapters insteda of its own virtual network adapters.This can be achieved by setting the hostNetwork property in the pod spec to true.When the Kubernetes Control Plane components are deployed as pods(such as when you deploy your cluster with kubeadm), you will find that those pods use the hostNetwork option,effectivaly making them behave as if they weren’t running inside a pod.
A related feature allows pods to bind to a port in the node’s default namespace, but still hae their own network namespace.This is done by using the hostPort property in one of the container’s ports defined in the spec.containers.ports field.
spec:
containers:
- image: edesibe/kubia
name: kubia
hostPID: true # you want the pod to use host's PID namaspace
hostIPC: true # you also want the pod to use the host's IPC namespace
ports:
- containerPort: 8080 # the container can be reached on port 8080 of the pod's IP
hostPort: 9000 # it can also be reached on port 9000 of the node it's deployed on
list all pods
kubectl get pods -o <output> --sort-by <JSONPATH> --selector <selector> --field-slector=metadata.name=podname
where
-o: set output format--sort-by: sort output using a JSONpath expression--selector: filter results by label--show-labels=true: list all labels--field-selector: filter based on one or more resource fieldsget all events for pod/busybox
k get events --field-selector=involvedObject.kind=Pod,involvedObject.name=busybox --sort-by .metadata.creationTimestamp
describe pod status
kubectl describe pods ${pod_name}
check pod phase for pod
k get pod -o jsonpath='{range .items[]}{@.metadata.name}{" "}{@.status.phase}{"\n"}{end}'
k get pod <POD_NAME> -o jsonpath='podName:{@.metadata.name} podPhase:{@.status.phase} containerStatus:{@.status.containerStatuses[].state..reason}{"\n"}'
create new resource
k create -f <file>
create if not exists or update resource
k apply -f <file>
When you use k apply command (not the case for k create/run)kubernetes will add annotations with the content of last applied configuration stored as resource.metadata.annotations.kubectl.kubernetes.io/last-applied-configuration.This field is used for comparison with next versions of the object.
run command inside container
k exec -it <pod name> -c <container name> -- <command>
delete pod
kubectl delete pod ${pod_name}
delete all pods in current namespace
k delete pod --all
delete all resources in current namespace (pod,deployment,svc)
k delete all --all
Communication between pods on same nodes is done via socket.Communication between pods on different nodes is done via cni. CNI is responsible for:
nsenter -t PID -n ip a' - execut ip a` under network namespacensenter -t PID -a - enter to pods namespace/proc/{PID}/root - root partition of the pod/prod/{PID}/mountinfo - what is mounted on the podctr - client for containerdnerdctl - Docker-compatible CLI for containerdcrictl - client for CRIHere is an example of manifest for pod as pod.yml
# APIVersion defines the versioned schema of this representation of an object.
apiVersion: v1
# Kind is a string value representing the REST resource this object represents ( Pod,Service,ReplicationController,Namespace,Node,... )
kind: Pod
# Standard object's metadata
metadata:
name: busybox-sleep
labels:
zone: prod
version: v1
# Specification of the desired behavior of the pod.
spec:
containers:
- name: busybox
image: busybox
args:
- sleep
- "1000000"
An abstract way to expose an application running on a set of Pods as a network service.
No need to modify your application to use an unfamiliar service discovery mechanism. Kubernetes gives pods their own IP addresses and a single DNS name for a set of pods, and can load-balance across them.The set of Pods targeted by a Service is usually determined by a selector( label ).
Behind the scene routing logic is accomplished via kube-proxy and configuring iptables rules.Kube-proxy is reading spec from service manifest and then create iptables rules to satisfy the desired state (i.e. service spec).Most of the time iptables statitic module with random mode is used.LoadBalancing with iptables.
When a service is created in the API server,the virtual IP address is assigned to it immerdiately.Soon afterward, the API server notifies all kube-proxy agents running on the worker nodes that a new Service has been created.Than,each kube-proxy makes that service addresable on the node it’s running on.It does this by setting up a few iptables rules,which make sure each packet destined for the service IP/port pair is intercepted and its destination address modified,so the packet is redirected to one of the pods backing the service.
Service IP range is defined in kube-api-server --service-cluster-ip-range ipNet (Default: 10.0.0./24).
Don’t forget that in each container,Kubernetes automatically expose environment variables for each service in the same namespace.These environment variables are auto-injected configuration.
Headless service is using ClusterIP=None type but it doesn’t allocate cluster ip from cluster ip range.When one queries its FQDN it will respond with all ips of the pods which are behind it.When you perform a DNS lookup for a service,the DNS server retunrs a single IP - the service’s cluster IP.For Headless service it will reply with the pod IPs instead of the single service IP.In addition, no iptables rules are added for headless svc
check iptables rules for headless svc
sudo iptables -S -t nat | grep <headless-svc-name>
When you wanna use the DNS lookup mechanism to find unready pods as well you need to add svc.spec.publishNotReadyAddress=True to the headless service manifest.By default only ready pods will be get resolved.
DNS of the service is in format <servicename>.<namespace>.svc.cluster-domain.example while pod can be reached via <podIPwithdashinsteaddot>.<namespace>.pod.cluster-domain.example.The default cluster domain is cluster.local.A Service’s fully qualified domain name can be used to reach the service from within any Namespaces in the cluster.However,Pods within the same Namespace can also simply use the service name instead FQDN.Pods from different namespaces need to specify <servicename>.<namespace> if they wanna reach to service which is in different namespace.
The reason for such behavior is that K8s is adding following records by defaul to each pod.
bash-5.1# cat /etc/resolv.conf
search <namespace>.svc.cluster.local svc.cluster.local cluster.local
which can resolve all services in same namespaces but for different namespaces you need to add subdomain as well.
example of service.yaml
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: kubia-nodeport
name: kubia-nodeport
spec:
sessionAffinity: ClientIP # if you wanna all requests made by a certain client to be redirected to the same pod every time(default is set to NONE)
ports:
- name: http # when creating a service with multiple ports, you must be specify a name for each port
nodePort: 30080 # port on node ( which holds pods ).Used for NodePort and LoadBalancer services
port: 80 # port where service is listening
protocol: TCP
targetPort: 8080 # port on the pod/container.it must match port on the pod.it is port used for traffic from service to the target
selector:
app: kubia # selector for the pod.all pods with app=nginx will be part of this service
type: NodePort
status:
loadBalancer: {}
cloud_user@k8s-control:~$ k get svc kubia-nodeport
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubia-nodeport NodePort 10.104.167.121 <none> 80:30080/TCP 7m42s
So kubia-nodeport service is accesible via:
10.104.167.121:80<ANY_NODE_IP>:30080Label selector is also used to map pods to a service.Same method is used for RC,RS,Deployment and DS.
You can give a name to each pod’s port and refer to it by name in the service spec.
pod manifest
apiVersion: v1
kind: Pod
spec:
containers:
- name: kubia
ports:
- name: http
containerPort: 8080 # Container's port 8080 is called http
- name: https
containerProt: 8443 # Port 8443 is called https
svc manifest
apiVersion: v1
kind: Service
spec:
ports:
- name: http
port: 80 # Port 80 is mapped to the container's port called http
targetPort: http
- name: https
port: 443 # Port 443 is mapped to the container's port,whose name is https
targetPort: https
When a pod is started,Kubernetes initializes a set of environment variables pointing to each service that exists at that moment.If you create the service before creating the client pods,processes in those pors can get the IP address and port of the service by inspecting their environment variables.Thus each service will have env variables populated inside pod’s container as <SVC_NAME>_HOST and <SVC_NAME>_SERVICE_PORT for ip and port used by service.
Entpoints are the backend entities to which Services route traffic.For a Service that routes traffic to multiple Pods,each Pod will have an endpoint associated with the Service. Pods are included as endpoints of a service if their labels match the service’s pod selector.An Endpoints resource is a list of IP addresses and ports exposing a service.When a client connects to a service,the service proxy selects one of those IP and port pairs and redirects the incomint connection to the server listening at that location.
If you create a service without a pod selector,Kubernetes won’t even create the Endpoints resource(after all, without selector,it can’t know which pods to include in the service).It’s up to you to create the Endpoints resource to specify the list of endpoints for the service.
Services most commonly abstract access to Kubernetes Pods, but they can also abstract other kinds of backends. For example:
In any of these scenarios you can define a Service without a Pod selector. For example:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
ports:
- protocol: TCP
port: 80
targetPort: 9376
Because this Service has no selector, the corresponding Endpoint object is not created automatically. You can manually map the Service to the network address and port where it’s running, by adding an Endpoint object manually:
apiVersion: v1
kind: Endpoints
metadata:
name: my-service
subsets:
- addresses:
- ip: 192.0.2.42
ports:
- port: 9376
For some parts of your application (for example, frontends) you may want to expose a Service onto an external IP address, that’s outside of your cluster. Kubernetes ServiceTypes allow you to specify what kind of Service you want. The default is ClusterIP.
Type values and their behaviors are:
ClusterIP - Exposes the Service on a cluster-internal IP. Choosing this value makes the Service only reachable from within the cluster. This is the default ServiceType.Behind the scene iptables rules are used with statistic mode random on each Node which has podsNodePort - Exposes the Service on each Node’s IP at a static port (the NodePort). A ClusterIP Service, to which the NodePort Service routes, is automatically created. You’ll be able to contact the NodePort Service, from outside the cluster, by requesting <NodeIP>:<NodePort>.Behind the scene iptables rules will be the same as for CluesterIP but it will also have additional rules opened for defined port.Random range for NodePort is 30000-32767.LoadBalancer - This is a NodePort service with an additional infrastructure-provided load balancer.Exposes the Service externally using a cloud provider’s load balancer. NodePort and ClusterIP Services, to which the external load balancer routes, are automatically created.On AWS,LB is redirecting traffic to NodePort service under k8s via nodePort defined port.ExternalName - Maps the Service to the contents of the externalName field (e.g. foo.bar.example.com), by returning a CNAME record with its value. No proxying of any kind is set up.Cluster IP is not assigned as this object is implemented solely at the DNS level,a simple CNAME DNS record is created for the service.apiVersion: v1
kind: Service
metadata:
name: kubia
spec:
ports:
- port: 80
targetPort: 8080
selector:
app: kubia
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: kubia-nodeport
name: kubia-nodeport
spec:
ports:
- name: 80-8080
port: 80
protocol: TCP
targetPort: 8080
selector:
app: kubia
type: NodePort
status:
loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: kubia-loadbalancer
name: kubia-loadbalancer
spec:
ports:
- name: kubia-http
port: 80
protocol: TCP
targetPort: 8080
selector:
app: kubia
type: LoadBalancer
status:
loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: kubia-external
name: kubia-external
spec:
externalName: kubia.example.com
type: ExternalName
status:
loadBalancer: {}
apiVersion: v1
kind: Service
metadata:
name: kubia-headless
spec:
clusterIP: None
ports:
- port: 80
targetPort: 8080
selector:
app: kubia
You can also use Ingress to expose your Service. Ingress is not a Service type, but it acts as the entry point for your cluster. It lets you consolidate your routing rules into a single resource as it can expose multiple services under the same IP address.It operates at the HTTP level(network layer 7) and can thus offer more features than layer 4 services can.
Each service inside the cluster can be found via FQDN <servicename>.<namespace>.svc.cluster.local.
When an external client connects to a service through the node port(this also includes cases when it goes through the load balancer first), the randomly chosen pod may or may not be running on the same node that received the connection.An additional network hoop is required to reach the pod, but this may not always be desirable.You can prevent this additional hop by configuring the service to redirect external traffic only to pods running on the node that received the connection via svc.spec.externalTrafficPolicy: Local.
An Ingress is an Kuberentes object that manages external access ot Services in the cluster.An Ingress is capable of providing more functionality that na simple NodePort Service,such as SSL termination,advanced load balancing,or name-based virtual hosting,cookie-based session affinity…
Each LoadBalancer service requires its own load balancer with its own public IP address,whereas an Ingress only requires one,even when providing access to dozens of services.When a client sends an HTTP request to the Ingress, the host and path in the request determine which service the request is forwarded to.
Ingress objects actually do nothing by themselves.In order for Ingresses to do anything,you must install one or more Ingress controllers.Unlike other types of controllers which run as part of the kube-controller-manager binary, Ingress controllers are not started automatically with a cluster.Ingresses define a set of a routing rules.A routing rule’s properties determine to which requests it applies.Each rule has a set of paths,each with a backend.Requests matching a path will be routed to its associated backend.
An Ingress controller runs a reverse proxy server(like nginx) and keeps it configured according to the Ingress,Service,and Endpoints resources defined in the cluster.The controller thus needs to observe those resources (again,through the watch mechanism) and change the proxy server’s config every time one of them changes.Although the Ingress reosurce’s definition points to a Service,Ingress controllers forward traffic to the service’s pod directly instead of going through the service IP.
Here, ingress controller is defined as service(ingress-nginx-controller) and deployment(ingress-nginx-controller) which is using nginx as proxy.When a user create Ingress resource it will populate deployment pods via populating related nginx.conf.Service ingress-nginx-controller will be created as LoadBalancer service inside k8s and via CLB on AWS.This will be the main entry for the external requests to the AWS and then K8s cluster.Every external request will target this CLB and then via LoadBalancer service it will enter in k8s cluster.Later, nginx pods will route traffic to related services based on Ingress rules.
Ingress as example
❯ k get ing,svc,pod -o wide
NAME CLASS HOSTS ADDRESS PORTS AGE
ingress.networking.k8s.io/kubia nginx kubia.dectech-labs.com ab01eafaaf542430fb1db2cdfd7b2439-426635961.us-west-1.elb.amazonaws.com 80 130m
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/kubernetes ClusterIP 100.64.0.1 <none> 443/TCP 21h <none>
service/kubia ClusterIP 100.64.236.166 <none> 80/TCP 21h app=kubia
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/kubia-rc-cfj7t 1/1 Running 1 (3h32m ago) 21h 100.110.199.143 i-09fd9fa86a2e8b006 <none> <none>
pod/kubia-rc-krjzt 1/1 Running 1 (3h32m ago) 21h 100.126.23.208 i-07a449d2951b3988b <none> <none>
❯ k -n ingress-nginx get svc,pod -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
service/ingress-nginx-controller LoadBalancer 100.67.113.134 ab01eafaaf542430fb1db2cdfd7b2439-426635961.us-west-1.elb.amazonaws.com 80:31463/TCP,443:30329/TCP 21h app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
service/ingress-nginx-controller-admission ClusterIP 100.71.76.121 <none> 443/TCP 21h app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pod/ingress-nginx-admission-create-nm2fk 0/1 Completed 0 21h 100.126.23.201 i-07a449d2951b3988b <none> <none>
pod/ingress-nginx-admission-patch-f7x8g 0/1 Completed 0 21h 100.110.199.137 i-09fd9fa86a2e8b006 <none> <none>
pod/ingress-nginx-controller-b4fcbcc8f-g5nl5 1/1 Running 0 128m 100.126.23.209 i-07a449d2951b3988b <none> <none>
pod/ingress-nginx-controller-b4fcbcc8f-hwkpn 1/1 Running 1 (3h33m ago) 21h 100.110.199.141 i-09fd9fa86a2e8b006 <none> <none>

When client send request to kubia.dectech-labs.com it will first perform a DNS check for that FQDN.When DNS respond it will reply with IP of the Ingress controller.The client then send an HTTP request to the Ingress controller and specified kubia.dectech-labs.com in the Host header.From that header, the controller determine which service the client is trying to access,looked up the pod IPs through the Endpoints objects associeated with the service, and forward the client’s request to one of the pods.Thus,the Ingress controller didn’t forward the request to the service.It only used it to select a pod.Most,if not all, controllers work like this.
ingress manifest example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: / # rewrite incoming url to /
nginx.ingress.kubernetes.io/ssl-redirect: "false" # allow access to ingress via http as well
spec:
ingressClassName: nginx # defined by igressclass k8s resource type
rules:
- http:
paths:
- path: /somepath # one defind path
pathType: Prefix
backend:
service:
name: my-service # which service will be used and port.cannot used svc from diffrent namespace
port:
number: 80
- path: /somepath2 # some other path
pathType: Prefix
backend:
service:
name: my-service2 # which service will be used and port
port:
number: 80
You can define multiple paths for same host or even multiple hosts.
The Ingress rules must reside in the namespace where the app that they configure reside.
multiple hosts in ingress
spec:
ingressClassName: nginx
rules:
- host: kubia.dectech-labs.com
http:
paths:
- backend:
service:
name: kubia
port:
number: 80
path: /
pathType: Prefix
- host: kubia1.dectech-labs.com
http:
paths:
- backend:
service:
name: kubia
port:
number: 80
path: /
pathType: Prefix
ingress example with a service with a named port
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: Myapp
ports:
- name: web # named port
protocol: TCP
port: 80
targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-ingress
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: /somepath
pathType: Prefix
backend:
service:
name: my-service
port:
name: web # specifying port from service as a name instead of a port
create an ingress
k create ingress my-ingress --class=nginx --rule="/somepath*=my-deployment:80" --dry-run=client -o yaml
When a client opens a TLS connection to an Ingress controller,the controller terminates the TLS connection.The communication between the client and the controller is encrypted,whereas the communcation between the controller and the backend pod isn’t.To enable the controller to do that,you need to attach a certificate and a private key to the Ingress.The two need to be stored in Kubernetes secret, which is then referenced in the Ingress manifest.
Creating a self-signed cert and attaching it to the ingress
# generate tls.key
openssl req -x509 -newkey rsa:4096 -sha256 -days 3650 -nodes \
-keyout kubia.dectech-labs.com.key -out kubia.dectech-labs.com.cert -subj "/CN=kubia.dectech-labs.com"
# add crt and key as secret
k create secret tls kubia.dectech-labs.com-tls --cert=kubia.dectech-labs.com.cert --key=kubia.dectech-labs.com.key
# update ingress manifest to match
spec:
tls:
- hosts:
- kubia.dectech-labs.com # TLS connections will be accepted for this domain
secretName: kubia.decthech-labs.com # private key and cert should be obtained from this secret
rules:
- host: kubia.dectech-labs.com
https://docs.bitnami.com/tutorials/secure-kubernetes-services-with-ingress-tls-letsencrypt/
A ReplicationController ensures that a specified number of pod replicas are running at any one time. In other words, a ReplicationController makes sure that a pod or a homogeneous set of pods is always up and available.
If there are too many pods, the ReplicationController terminates the extra pods. If there are too few, the ReplicationController starts more pods. Unlike manually created pods, the pods maintained by a ReplicationController are automatically replaced if they fail, are deleted, or are terminated. For example, your pods are re-created on a node after disruptive maintenance such as a kernel upgrade. For this reason, you should use a ReplicationController even if your application requires only a single pod. A ReplicationController is similar to a process supervisor, but instead of supervising individual processes on a single node, the ReplicationController supervises multiple pods across multiple nodes. ReplicationController is often abbreviated to “rc” or “rcs” in discussion, and as a shortcut in kubectl commands. A simple case is to create one ReplicationController object to reliably run one instance of a Pod indefinitely. A more complex use case is to run several identical replicas of a replicated service, such as web servers.
Most important parts in rc definition are:
rc.spec.selector - k/v which will be used to match pods based on labels.can be ommited if there is labels defined under rc.spec.template.metadata.labelsrc.spec.template.metadata.labels - should be same as rs.spec.selector k/vrc.spec.replicas - number of replicasPods created by RC aren’t tied to the RC in any way.At any momment, a RC manages pods that match its label selector.By changing a pod’s labels,it can be removed from or added to the scope of a RC.Although a pod isn’t tied to RC,the pod does reference it in the metadata.ownerReferences field, which you can use to easily find which RC a pod belongs to.If you wanna remove pod from RC you need to change or remove its selector label.Adding new labels doesn’t have an effect.Changing a RC’s pod template has not effect on existing pods.In order to propagate RC’s modification you have to replace old pods.
delete rc and leave pod
k delete rc my-rc --cascade=orphan
list all rc
kubectl get rc -o wide
describe rc status
kubectl describe rc ${rc_name}
Here is an example for rc manifest as rc.yml
apiVersion: v1
kind: ReplicationController
metadata:
name: nginx
spec:
replicas: 4
selector: # this section is optional,if omitted rc will use rc.spec.template.metadata.labels definition
app: nginx # based on this k/v rc is operating on pods
template:
# same definition as with pod.yml
metadata:
name: nginx
labels:
app: nginx # must be the same as rc.spec.selector as it is managed by rc
spec:
containers:
- name: nginx
image: nginx
ports:
- containerPort: 80
$ git diff rc.yaml rs.yaml
-apiVersion: v1
-kind: ReplicationController
+apiVersion: apps/v1
+kind: ReplicaSet
metadata:
- name: kubia-rc
+ name: kubia-rs
spec:
replicas: 3
selector:
- app: kubia
+ matchLabels: # RS equivalent for RC settings
+ app: kubia
template:
metadata:
name: kubia
RC and RS actually do same thing as ensure desired number of replicas of needed pod.The only difference is that with rc you can set selectors via equality-based requirement selectors as:
selector:
component: redis
while with rs you can use set-based requirements selector as well:
selector:
matchLabels:
component: redis
matchExpressions:
- {key: tier, operator: In, values: [cache]}
- {key: environment, operator: NotIn, values: [dev]}
Automatically runs a copy of a Pod on each node.DeamonSets will run a copy of the Pod on new nodes as they are added to the cluster.A DS deploys pods to all nodes in the cluster,unless you specify that the pods should only run on a subset of all the nodes.This is done by specifying the nodeSelector property in the pod template,which is part of the DS definition (same for RS and RC).A DS will deploy pods on unscheduled nodes as well as the unschedulable attribute is only used by the Scheduler,whereas pods managed by a DS bypass the Scheduler completely.
DaeemonSets respect normal scheduling rules around node labels,taints, and tolerations.If a pod would not normally be schedulued on a node,a DaemonSet will not create a copy of the Pod on that Node.
an example of daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: my-daemonset
spec:
selector:
matchLabels:
app: my-daemonset
template:
metadata:
labels:
app: my-daemonset
spec:
containers:
- name: nginx
image: nginx
This resource type allows you to run a pod whose container isn’t restarted when the process running inside finishes successfully.Once it does,the pod is considered complete.In the event of a node failure,the pods on that node that are managed by a Job will be rescheduled to other nodes the way ReplicaSet pods are.In the event of a failure of the process itself(when the process returns an error exit code), the Job can be configured to either restart the container or not. Jobs are useful for ad hoc tasks,where it’s crucial that the task finishes properly.You could run the task in an unmanaged pod and wait for it to finish,but in the event of a node failing or the pod being evicted from the node while it is performing its task, you’d need to manually recreate it.Doing this manually doesn’t make sense - especially if the job takes hours to complete.
example of a job
apiVersion: batch/v1
kind: Job
metadata:
name: batch-job
spec:
activeDeadlineSeconds: 110 # if the pod rund longer then that, the system will try to terminte it and will mark the Job as failed
backoffLimit: 5 # how many times a Job can be retried befored it is marked as failed,default 6
completions: 5 # if you need a job to run more than once
parallelism: 2 # up to 3 pods can run in parallel
ttlSecondsAfterFinished: 5 # limits the lifetime of a Job that has finished execution (either Complete or Failed). If this field is set, ttlSecondsAfterFinished after the Job finishes, it is eligible to be automatically deleted
template:
metadata:
labels:
app: batch-job
spec:
restartPolicy: OnFailure # jobs can't use default resart policy which is Always
containers:
- name: main
image: luksa/batch-job
When job’s pod complete its processing it will not be deleted as it will allow you to examine its logs.The pod will be deleted when you deleted it or the Job that created it.
Jobs may be configured to create more than one pod instance and run them in parallel or sequentially.This is done by setting the completions and the paralleilsm properties in the Job spec.
A cron job in Kubernetes is configured by creating a CronJob resource.The schedule for running the job is specified in the cron format.At the configurated time,K8s will create a Job resource according to the Job template configured in the CronJob object.When the Job resource is created,one or more pod replicas will be created and started according to the Job’s pod template.
example of cronjob
apiVersion: batch/v1
kind: CronJob
metadata:
name: batch-cronjob
spec:
schedule: "0,15,30,45 * * * *" # it will run every 15min
startingDeadlineSeconds: 5 # not stard too far over the scheduled time
jobTemplate: # the template for the Job resources that will be created by this CronJob
spec:
completions: 5
parallelism: 2
template:
metadata:
labels:
app: batch-cronjob
spec:
restartPolicy: OnFailure
containers:
- name: main
image: luksa/batch-job
Job resouces will be created from the CronJob resource at approximately the scheduled time.The Job then createss the pod.
Storage volumes aren’t top-level resources like pods,but are instead defined as a part of a pod and share the same lifecycle as the pod.This means a volume is created when the pod is started and destroyed when the pod is deleted.Because of this, a volume’s content will persists across container restarts.After a container is restarted,the new container can see all the files that were written to the volume by the previous container.If a pod contains multiple containers,the volume can be used by all of them at once.
Volumes are defined in the pod’s manifest - much like containers.A volume is available to all containers in the pod, but it must be mounted in each container that needs to access it.In each container,you can mount the volume in any location of its filesystem.It’s not enough to define a volume in the pod, you need to define a VolumeMount inside the container’s spec also, if you want to container to be able to access it.
The container file system is ephemeral.Files on the container’s file system exist only as long as the container exists.If a container is deleted or re-created in K8s,data stored on the container file system is lost.
Volumes allow you to store data outside the container file system while allowing the container to access the data at runtime.
Persistent Volumes are a slightly more advanced form of Volume.They allow you to treat storage as an abstract resource and consume it using your Pods.
Both Volumes and Persistend Volumes each have a volume type.The volume type determines how the storage is actually handled.
Various volume types support storage methods such as:
emptyDir - a simple empty directory usef for storing transient datahostPath - used for mounting directories from the worker node’s filesystem into the pod.Use hostPath volumes only if you need to read or write system files on the node.Never use them to persist data across pods.gitRepo - a volume initialized by checking out the contents of a git repo.It is a basically an emptyDir volume that get s pouplated by cloning a Git repo and checking out a specific revision when the pod is starting up(but before its containers are created)nfs - an NFS share mounted into the podgcePersistentDisk,awsElasticBlockStore,azureDisk - used for moutning cloud provider-specific storageconfigMap,secret - special types of volumes used to expose certain kubernetes resourcespersistantVolumeClaim - a way to use a pre- or dynamically provisioned persistent storagespec:
containers:
- image: luksa/fortune
name: html-generator
resources: {}
volumeMounts:
- name: html
mountPath: /var/htdocs
- image: nginx:alpine
name: web-server
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
readOnly: true
ports:
- containerPort: 80
protocol: TCP
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: html
emptyDir: # by default `emptyDir: {}` the volume will be created on worker node
medium: Memory # craete as tmpfs filesystem(in memory instead of on disk)
spec:
containers:
- image: nginx:alpine
name: gitrepo-volume-pod
ports:
- containerPort: 80
resources: {}
volumeMounts:
- name: html
mountPath: /usr/share/nginx/html
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: html
gitRepo:
revision: master # branch name
repository: https://github.com/edesibe/kubia-website-example.git # repo which will be fetched
directory: . # it will hold the content of repo
spec:
containers:
- image: openweb/git-sync:0.0.1 # container which will sync git repo
name: git-sync
env:
- name: GIT_SYNC_REPO
value: https://github.com/edesibe/kubia-website-example.git
- name: GIT_SYNC_DEST
value: /tmp/git
- name: GIT_SYNC_BRANCH
value: master
- name: GIT_SYNC_REV
value: FETCH_HEAD
- name: GIT_SYNC_WAIT
value: "10"
volumeMounts:
- name: shared # syncing folder is declared as shared volume
mountPath: /tmp/git
- image: nginx:alpine
name: gitrepo-volume-pod
ports:
- containerPort: 80
resources: {}
volumeMounts:
- name: shared # nginx is using shared volume as root
mountPath: /usr/share/nginx/html
readOnly: true
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: shared
emptyDir: {}
spec:
containers:
- image: mongo
name: mongodb
volumeMounts:
- name: mongodb-data # name of the volume, same as pod.spec.volumes.name
mountPath: /data/db # the path where mongodb stores its data
ports:
- containerPort: 27017
resources: {}
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: mongodb-data
awsElasticBlockStore: # volume type and related volumeID which was created manually
fsType: ext4
volumeID: "vol-03af4682f86e4a74f"
Regular Volumes can be set up relatively easily within a Pod/container specification.You can use volumeMounts to mount the same volume to multiple containers within the same Pod.This is a powerful way to have multiple containers interact withi one another.For example,you could create a secondary sidecar container that process or transforms output from another container.

create a pod with volume
k run volume-pod --image=busybox --overrides='{ "apiVersion": "v1", "spec": { "containers":[{"name":"volume-pod","image":"busybox","command":["sh","-c","sleep 3600"],"volumeMounts":[{"name":"my-volume","mountPath":"/output"}]}],"volumes":[{"name":"my-volume","hostPath":{"path":"/data"}}]}}'
which will represented as
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: volume-pod
name: volume-pod
spec:
containers:
- command:
- sh
- -c
- sleep 3600
image: busybox
name: volume-pod
resources: {}
volumeMounts: # in the container spec, these reference the volumes in the Pod spec and provide a mountPath (the location on the file system where the container process will access the volume data)
- mountPath: /output
name: my-volume
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes: # in the pod spec, theses specify the storage volumes available to the pod.they specify the volume type and other metadata
- hostPath:
path: /data
name: my-volume
status: {}
Ideally, a developer deploying their apps on Kubernetens should never have to know what kind of storage technology is used underneath,the same way they don’t have to know what type of physical servers are being used to run their pods.Infrastrucure-related dealings should be the sole domain of the cluster administrator.When a developer needs a certain amount of persistent storage for their application,they can request it from k8s,the same way they can request CPU,memory,and other resources when creating a pod.
To enable apps to request storage in a k8s cluster wihtout having to deal with infrastructure specifics,two new resources were introduced,PersistentVolume and PersistentVolumeClaims.
A PersistentVolume (PV) is a piece of storage in the cluster that has been provisioned by an administrator or dynamically provisioned using Storage Classes.PersistentVolume is non-namespaced.It is a resource in the cluster just like a node is a cluster resource. PVs are volume plugins like Volumes, but have a lifecycle independent of any individual Pod that uses the PV. This API object captures the details of the implementation of the storage, be that NFS, iSCSI, or a cloud-provider-specific storage system.
PV can be created in two ways: static ( manually ) or dynamic ( storage class ).A PersistentVolume uses a set of attributes to describe the underlying storage resources (such as disk or cloud storage location) which will be used to store data.
apiVersion: v1
kind: PersistentVolume
metadata:
name: redis-pv
spec:
storageClassName: localdisk # if you don't specify name k8s will create a storagclass with default attributes(allowVolumeExpansion set to false,etc) --> NOT TRUE
capacity:
storage: 1Gi # define capacity in M,G ...
persistentVolumeReclaimPolicy: Retain # determines how the storage resources can be reused when the PersistantVolume’s associated PersistentVolumeClaims are deleted.This setting can be updated via `k patch pv <pv_name> -p '{"spec":{"persistentVolumeReclaimPolicy":"Retain"}}'`
accessModes:
- ReadWriteOnce # can be mounted on only one node with RW
- ReadOnlyMany # can be mounted on multiple nodes but as Read only
hostPath:
path: "/mnt/data"
accessModes defined on pvc must have at least one of the accessModes from pv
Storage Class allows K8s administrators to specify the types of storage services they offer on their platform.Instead of creating PV,one can deploy a PV provisioner and define one or more SC objects to let users choose what type of PV they want.The users can refer to the SC in the PVC and the provisioner will take into account when provisioning the persistent storage(PV will be automatically created).In addition, for cloud based clusters cloud volumes will be created.Default storage class is what’s used to dynamically provision a PV if the PVC doesn’t explicitly say which storage class to use.
StorageClass describes the parameters for a class of storage for which PersistentVolumes can be dynamically provisioned. StorageClasses are non-namespaced; the name of the storage class according to etcd is in ObjectMeta.Name.They are declarative while PV is imperative. When someone creates SC he can set claim policies as:
retain - Keeps all data.This requires an administrator to manually clean up the data and prepare the storage resource for reuse.In other words recreate PV(delete and create it)recycle - Obsolete.Automatically deletes all data in the underlying storage resource,allowing the PersistentVolume to be reused.delete - (default)Deletes the underlying storage resource automatically (only works for cloud storage resources)yaml for creating localdisk storageclass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: localdisk
provisioner: kubernetes.io/no-provisioner # the volume plugin to use for provisioning the PV
allowVolumeExpansion: true # Set to false by default.The allowVolumeExpansion property of a StorageClass determines whether or not the StorageClass supports the ability to resize volumes after they are created.
volumeBindingMode: WaitForFirstConsumer # Can be `WaitForFirstConsumer` which means wait for pod which will use it,or `Immediate` which means bind PVC and PV at no time
deleting pv
kubectl delete pv <pv_name> --grace-period=0 --force
delete pod without waiting for the Kubelet to confirm that the pod is no longer running
k delete pod kubia-0 --force --grace-period 0
And then deleting the finalizer using:
kubectl patch pv <pv_name> -p '{"metadata": {"finalizers": null}}'
Local volumes(provisioner: kubernetes.io/no-provisioner) don’t support dynamic provisioning.One can use https://github.com/rancher/local-path-provisioner which supports dynamic provisioning for local volumes.
A PersistentVolumeClaim (PVC) is a request for storage by a user. It is similar to a Pod. Pods consume node resources and PVCs consume PV resources. Pods can request specific levels of resources (CPU and Memory). Claims can request specific size and access modes (e.g., they can be mounted once read/write or many times read-only).When a PersistentVolumeClaim is created, it will look for a PersistenVolume that is able to meet the requested criteria.If it finds one,it will automatically be bound to the PersistentVolume.If it didn’t find related PersistentVolume, PVC state will be Pending as it cannot bound to the any PersistentVolume.A PVC to PV binding is one-to-one mapping.
RWO - one node can mount the volume for writeRWX - multiple nodes can mount a volume for read and writeROX - multiple nodes can mount a volume for readWARNING: These modes applies for NODE not POD.
pvc example
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-pvc
namespace: dev # if you are using pod in namespace you need to use namespace in pvc
spec:
storageClassName: localdisk # use storageClassName: "" if you wanna bind PVC to specific PVC
volumeName: my-pv # if you wanna explicitly bind PVC to PV.This option is optional in which case pvc will try to bind with pv if matching is possible
accessModes:
- ReadWriteOnce
resources:
requests: # It seems that only requests are evaluated for the matching criteria for bounding to the PV
storage: 100Mi # You can expand PersistentVolumeClaims without interrupting applications that are using them.Simply edit the spec.resources.requests.storage attribute of an existing PersistantVolumeClaim,increasing the value.
# However, the StorageClass must support resizing volumes and must have allowVolumeExpansion set to true.
pod usage of pv and pvc
apiVersion: v1
kind: Pod
metadata:
name: redispod
spec:
volumes:
- name: redis-data
persistentVolumeClaim:
claimName: my-pvc # PersistentVolumeClaims can be mounted to a Pod's containers just like any other volume
containers:
- name: redisdb
image: redis
ports:
- containerPort: 6379
name: "redis"
protocol: TCP
volumeMounts:
- mountPath: /data
name: redis-data # If the PersistentVolumeClaim is bount to a PersistenVolume,the containers will use the underlying PersistentVolume storage
extending pvc
k patch pvc my-pvc --patch '{"spec":{"resources":{"requests": {"storage": "200Mi"}}}}'

To summarize,the best way to attach persistent storage to a pod is to only create the PVC( with an explicitly specified storageClassName if necessary) and the pod (which refers to the PVC by name).Everything else is taken care of by the dynamic PersistentVolume provisioner.
Network policies allow you to specify which pods can talk to other pods. This helps when securing communication between pods, allowing you to identify ingress and egress rules.A NetworkPolicy applies to pods that match its label selector and specifies either which sources can access the matched pods or which destinations can be accessed from the matched pods.You can even choose a CIDR block range to apply the network policy.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: test-network-policy
namespace: default
spec:
podSelector: # empty podSelector (`podSelector: {}`) matches all pods in the same namespace
matchLabels:
role: db # applies to all pods with label role=db
policyTypes:
- Ingress # applies to ingress and egress.If no policyTypes are specified on a NetworkPolicy then by default Ingress will always be set and Egress will be set if the NetworkPolicy has any egress rules.
- Egress
ingress: # Ingress must be defined in netpol.spec.policyTypes
- from:
- ipBlock: # this ingress rule only allows traffic from clients in the 172.17.0.0/16 IP block except 172.17.1.0/24
cidr: 172.17.0.0/16
except:
- 172.17.1.0/24
- namespaceSelector: # allow namespaces which have lables project=myproject
matchLabels:
project: myproject
- podSelector:
matchLabels:
role: frontend # it allows incoming connectsion only from pods with the role=frontend label
ports: # connections to port 6379 is only allowed
- protocol: TCP
port: 6379
egress: # it limits the pod's outbouding traffic
- to:
- ipBlock:
cidr: 10.0.0.0/24
ports:
- protocol: TCP
port: 5978
Solutions that support Network Policies: Kube-router,calico,romana,weave-net.Solution that DON’T support Network Policies: flannel.
Client pods usally connect to server pods through a Service instead of directly to the pod,but that doesn’t change anything.The NetworkPolicy is enforced when connecting through a Service,as well.
In a multi-tenant Kubernetes cluster,tenants usually can’t add labels (or annotations) to their namespaces themselves.If they could,they’d be able to circumvent the namespaceSelector-based ingress rules.
spec.podSelector: {} will apply the policy to all pods in the current namespace.spec.podSelector of an ingress rule can only select pods in the same namespace the NetworkPolicy is deployed to.spec:
podSelector:
ingress: [] # empty rules means block all traffic
or
spec:
podSelector: {}
policyTypes: # no ingress rules means block all traffic
- Ingress
k describe netpol NETWORK_POLICY_NAME for outcomeA K8s NetworkPolicy is an object that allows you to control the flow of network communcation to and from Pods.This allows you to build a more secure cluster network by keeping Pods isolated from traffic they do not need.Network policies are implemented by the network plugin(via iptables for calico).To use network policies,you must be using a networking solution which supports NetworkPolicy.
podSelector - determines to which Pods in the namespaces the NetworkPolicy applies.The podSelector can select Pods using Pod labels.If this field is set to {} it will be applied to all pods in related namespace
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: my-network-policy
spec:
podSelector:
matchLabels:
role: db
By default,Pods are considered non-isolated and completely open to all communication.If any NetworkPolicy selects a Pod,the Pod is considered isolated and will only be open to traffic allowed by NetworkPolicies.
A NetworkPolicy can apply to Ingress(incoming network traffic comming into the Pod),Egress(outgoing network traffic leaving the Pod) or both.
from selector - selects ingress(incoming) traffic that will be allowedto selector - selects egress(outgoing) traffic that will be allowedspec:
ingress:
- from:
...
egress:
- to:
...
podSelector - select Pods to allow traffic from/tonamespaceSelector - selects namespaces based on defined labels to allow traffic from/toipBlock - select an IP range to allow traffic from/topodSelector example
spec:
ingress:
- from:
- podSelector:
matchLabels:
app: db
namespaceSelector example
spec:
ingress:
- from:
- namespaceSelector:
matchLabels:
app: db
ipBlock example
spec:
ingress:
- from:
- ipBlock:
cidr: 172.17.0.0/16
Port - specifies one or more ports that will allow traffic.
port example
spec:
ingress:
- from:
port:
- protocol: TCP
port: 80
Traffic is only allowed if it matches both an allowed port and one of the from/to rules.
ReplicaSet ensures that a specified number of pod replicas are running at any given time.Difference between RS and RC is that RS has more expressive options for label selectors.
rs example
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: kubia-rs
spec:
replicas: 3 # mandatory,set number of replications
selector: # mandatory field,it is used to filter which pods will be managed by this rc
matchLabels:
app: kubia
template:
metadata:
name: kubia
labels: # defining labels which will be used for pods,it should be same as rs.spec.selector field
app: kubia
spec:
containers:
- name: kubia
image: edesibe/kubia
ports:
- containerPort: 8080
or
...
selector:
matchExpressions: # instead of matchLabels one can defined key,operator,values expression
- key: app
operator: In # cab be In,NotIn,Exists,DoesNotExist
values:
- kubia
...
If you specify multiple expressions,all those expressions must evaluate to true for the selector to match a pod.If you specify both matchLabels and matchExpressions, all the labels must match and all the expressions must evaluate to true for the pod to match the selector.
Pod replicas managed by a ReplicaSet or ReplicationController are much like cattle.Because they’re mostly stateless,they can be replaced with a completely new pod replica at any time.Stateful pods require a different approach.When a stateful pod instance dies (or the nod it’s running on fails),the pod instance needs to be resurected on another node,but the new instance needs to get the same name,network identity,and state as the one it’s replacing.This is what happens when the pods are managed through a StatefulSet.It also allows you to easily scale the number of pets up and down.A StatefulSet has a desired replica count field that determines how many pets you want running at that time.Pods are created from a pod template specified as part of the StatefulSet but they aren’t exact replicas of each other.Each can have its own set of volumes (storage), which differentiates it from its peers.Pet pods also have a predicatable(and stable) identity instead of each new pod instance getting a completely random one.Each pod created by a StatefulSet is assigned an ordinal index(zero-based),which is then used to derive the pod’s name and hostname,and to attach stable storage to the pod.
A StatefulSet requires you to create a governing headless Service that’s used to provide the actual network identity to each pod.Through this Service,each pods gets its own DNS entry,so its peers and possibly other clients in the cluster can address the pod by its hostname.For example, if the governing Service belongs to the default namespaces and is called foo,and one of the pods is called A-0,you can reach the pod through its fully qualified domain name,which is a-o.foo.default.svc.cluster.local.Additionally,you can also use DNS to look up all the StatefulSet’s pods' names by looking up SRV records for foo.default.svc.cluster.local domain.
dig -t SRV foo.default.svc.cluster.local
Any ClusterIP and Headless service has A,AAAA,SRV and PTR DNS records.
Scaling the StatefulSet creates a new pod instance with the next unused ordinal index.Scaling down a StatefulSet always removes the instances with the highest ordinal index first.Also, StatefulSets also never permit scale-down operations if any of the instances are unhealthy.
The StatefulSet has to create the PersistentVolumeClaims as well,the same way it’s creating the pods.For this reason,a StatefulSet can also have on or more volume claim templates, which enable it to stamp out PersistentVolumeClaims along with each pod instance(Pod A-0 -> PVC A-0).The PersistentVolumes for the claims can either be provisioned up-front by an administrator or just in time through dynamic provisioning of PersistentVolumes.
Scaling up a StatefulSet by one creates two or more API objects (the pod and one or more PersistentVolumeClaims referenced by the pod).Scaling down,deletes only the pod,leaving the claims alone(cause it is needed for a new pod).
sts example including headless service
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
labels:
app: kubia
name: kubia
spec:
serviceName: "kubia"
replicas: 3
selector:
matchLabels:
app: kubia
template:
metadata:
labels: # pods created by the StatefulSet will have the app=kubia
app: kubia
spec:
containers:
- image: edesibe/kubia-pet
name: kubia-pet
ports:
- containerPort: 8080
name: http
resources: {}
volumeMounts:
- name: data # the container inside the pod will mount the pvc volume at this path
mountPath: /var/data
volumeClaimTemplates:
- metadata:
name: data
spec: # the PersistentVolumeClaims will be created from this template
storageClassName: "fast"
resources:
requests:
storage: 1Mi
accessModes:
- ReadWriteOnce
---
apiVersion: v1
kind: Service
metadata:
creationTimestamp: null
labels:
app: kubia
name: kubia # name of the service
spec:
clusterIP: None # the StatefulSet's governing service must be headless
ports:
- name: http
port: 80
selector:
app: kubia # all pods with the app=kubia label belong to this service
type: ClusterIP
status:
loadBalancer: {}
When you are creating a StatefulSet object it will create pods one by one,next pod will be created only after the previous one is up and ready. The PersistentVolumeClaim template was used to create the PersistentVolumeClaim and the volume inside the pod,which refers to the created PersistentVolumeClaim.The names of the generated PersistentVolumeClaims are composed of the sts.spec.volumeClaimTemplates.metadata.name and the name of each pod.
You can fetch StatefulSet Pod IPs via DNS SRV queries.It will show SRV and A records of the related pods.Response is in the format
fetching IP addresses for all pods from StatefulSet
k run --rm -it --restart=Never dnsutils --image=edesibe/dnsutils -- dig +noall +additional +answer SRV kubia.default.svc.cluster.local
kubia.default.svc.cluster.local. 7 IN SRV 0 20 80 kubia-1.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN SRV 0 20 80 kubia-0.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN SRV 0 20 80 kubia-2.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN SRV 0 20 80 kubia-3.kubia.default.svc.cluster.local.
kubia.default.svc.cluster.local. 7 IN SRV 0 20 80 kubia-4.kubia.default.svc.cluster.local.
kubia-1.kubia.default.svc.cluster.local. 7 IN A 100.100.177.139
kubia-4.kubia.default.svc.cluster.local. 7 IN A 100.100.177.144
kubia-0.kubia.default.svc.cluster.local. 7 IN A 100.100.177.140
kubia-2.kubia.default.svc.cluster.local. 7 IN A 100.116.201.202
kubia-3.kubia.default.svc.cluster.local. 7 IN A 100.116.201.203
7 IN SRV 0 20 80 --> <TTL> IN SRV <PRIORITY> <WEIGHT> <PORT>
The order of returned SRV records is random,because they all have the same priority.
Starting from Kubernetes 1.7,StatefulSets support rolling updates the same way Deployments and Daemonsets do.Check sts.spec.updateStrategy field via k explain.
A K8s object that defines a desired state for a ReplicaSet ( a set of replica Pods).The Deployment Controller seeks to maintain the desired state by creating,deleting,and replacing Pods with new configurations.
replicas - the number of replica Pods the Deployment will seek to maintainselector - a label selector used to identify the replica Pods managed by the Deploymenttemplate - a template Pod definition used to create replica PodsAlways set pod.spec.contiainers.imagePullPolicy to Always in order to pod always fetch image.Be aware that the default imagePullPolicy depends on the image tag.If a container refers to the latest tag (either explicitly or by not specifying the tag at all), imagePullPolicy defaults to Always,but if the contiainer refers to any other tag,the policy default to IfNotPresent.
Actual deployment will create a RS (ReplicaSet) object which will hold deployment specs.ReplicaSet ensures that a specified number of pod replicas are running at any given time.The format of pod name is <DEPLOYMENT>-<PODTEMPLATEHASH>-<SOMESTRING>.The ReplicaSet’s name also contains the hash value of its pod template.A deployment creates ReplicaSets-one for each version of the pod template.Using the hash value of the pod templite like this allows the Deployment to always use the same (possibly existing) ReplicaSet for a given version of teh pod template.
How this new state should be achieved is governed by the deployment strategy configured on the Deploment itself.The default strategy is to perform a rolling update (the strategy is called RollingUpdate).The alternative is the Recreate strategy,which deletes all the old pods at once and then creates new ones.Old pods will be deleted beforw the new ones are created.
Use Recreate strategy when your application doesn’t support running multiple versions in parallel and requires the old version to be stopped completely before the new one is started.
The RollingUpdate strategy, removes old pods one by one,while adding new ones at the same time,keeping the application available throughout the whole process, and ensuring there’s no drop in its capacity to handle requests.This is the default strategy.The upper and lower limits for the number of pods above or below the desired replica count are configurable.
You should use RollingUpdate strategy only when your app can handle running both the old and new version at the same time.
Creation
k create deployment NAME --replicas 3 --image=IMAGE --dry-run=client -o yaml
example of a deployment
apiVersion: apps/v1
kind: Deployment
metadata:
creationTimestamp: null
labels:
app: kubia
name: kubia
spec:
replicas: 3
selector:
matchLabels:
app: kubia
strategy: {}
template:
metadata:
creationTimestamp: null
labels:
app: kubia
spec:
containers:
- image: edesibe/kubia:v1
name: kubia
resources: {}
status: {}
Scaling - refers to dedicating more (or fewer) resources to an application in order to meet changing needs.Best practice is to use --current-replicas=<current_number_for_replicas> to ignore accidents via autoscaler.As deployment is controlling replicset you cannot scale replicaset object via k scale rs <ReplicaSetName>.
k scale deployment NAME --replicas=N --current-replicas=X
Updating can be done in several ways:
k set image deployment/NAME CONTAINER_NAME=NEWIMAGE - updating image of the deploymentk patch deployment NAME -p '{"spec":{"minReadySeconds":10}}' - updating some spec valuesk set resources -f nginx.yaml --dry-run=client -o yaml --limits=cpu=100m --local - adding resource limits to local yamlk set env deployment/NAME KEY=VALUE - updating deployment env(it will recreate pods)k set resources deployment/NAME --requests=memory=10Mi,cpu=10m --limits=memory=20Mi,cpu=20m - updating deployment resources(it will recreate pods)The minReadySeconds property specifies how long a newly created pod should be ready before the pod is treated as available.Until the pod is available,the rollout process will not continue( cause of maxUnavailable property).A pod is ready when readiness probes of all its containers return a success.If a new pod isn’t functioning properly and its readiness probe starts failing before minReadySeconds have passed,the rollout of the new version will effectively be blocked.
By default,after the rollout can’t make any progress in 10min,it’s considered as failed.If you use the k describe deployment command,you’ll see it display a ProgressDeadlineExceeded condition.Failed rollout can be only abborted via k rollout undo deployment NAME.
If the pod template in the Deployment references a ConfigMap(or a Secret), modifying the ConfigMap will not trigger and update.One way to trigger an update when you need to modify an app’s config is to create a new ConfigMap and modify the pod template so it references the new ConfigMap.
The events that occuded below the Deployment’s surface during the update are: an additional ReplicaSet is created and it was then scaled up slowly,while the previous ReplicaSet was scaled down to zero.All new pods are now managed by the new ReplicaSet.
Rolling updates - Allows you to make changes to a Deployment’s Pods at a controlled rate,gradually replacing old Pods with new Pods.This allows you to update your Pods without incurring downtime.Rollback - If an update to a deployment causes a problem,you can roll back the deployment to a previous working state.By default Kubernetes stores the last 10 ReplicaSets and lets you roll back to any of them(spec.revisionHistoryLimit in deployment definition).k rollout status deployment NAME - checking deployment statusk rollout history deployment NAME [--revision=<NUMBER>] - listing revision history.You can get details of specific revision via --revision=<NUMBER>k rollout undo deployment NAME [--to-revision=N] - reverting to latest or specific Nth revision.It can be used to during the rollout process to abort the rollout.Rolling back a rollout is possible because Deployments keep a revision history.The history is stored in the underlying ReplicaSets.When a rollout completes,the old ReplicaSet isn’t deleted, and this enables rolling back to any revision,not only the previos one.k rollout pause deployment NAME - pausing the update, canary - a technique for minimizing the risk of rolling out a bad version of an application.When you pause the deployment you can only resume it (undoing is not possible).k rollout resume deployment NAME - resuming the updateA proper way of performing a canary release is by using two different Deployments(stable and canary) and scaling them appropriately.When a canary deployment is tested and verified one can update stable deployment via `k set image deployment stable NAME=IMAGE.

Length of the revision history is limited by the revisionHistoryLimit property on the Deployment resources.It defaults to 10, so older ReplicaSets are deleted automatically.
Two properties affect how many pods are replaced at once during a Deployment’s rolling update.
maxSurge - The maximum number of pods that can be scheduled above the desired number of pods.Value can be an absolute number (ex: 5) or a percentage of desired pods (ex:10%)maxUnavailble - The maximum number of pods that can be unavailable during the update. Value can be an absolute number (ex: 5) or a percentage of desired pods (ex:10%)K8s provides a number of features that allow you to build robust solution,such as the ability to automatically restart unhealthy containers.To make the most of these features,k8s needs to be able to accurately determine the status of your applications.This means actively monitoring container health.
Kubernetes use liveness probes to know when to restart a container.
Kubernetes use readiness probes to know when a container is ready to receive requests, e.g. when is up and running.
A Pod is considered ready when all of its containers are ready. One use of this signal is to control which Pods are used as backends for Services. When a Pod is not ready, it is removed from Service load balancers
Unlike liveness probes,if a contaienr fails the “readiness” check, it won’t be killed or restarted.Liveness probes keeps pods healthy by killing off unhealthy containers and replacing them with new healthy ones, whereas readiness probes makes sure thet only pods that are ready to serve requests receive them.
Readiness probes are used to determine when a container is ready to accept requests.When you have a service backed by multiple container endpoints,user traffic will not be send to a particular pod until its containers have all passed the readiness checks defined by their readiness probes.
Use readiness probes to prevent user traffic from being sent to pods that are still in the process of starting up.When a container is started,Kubernetes can be configured to wait for a configurable amount of time to pass beforw performing the first readiness check.After that, it invokes the probe periodically(default 10s) and acts based on the result of the readiness probe.If a pod reports that it’s not ready, it’s removed from the service.If the pod then becomes ready again,it’s re-added.Unlike liveness probes,if a container fails the readiness check,it won’t be killed or restarted.Liveness probes keep pods healthy by killing off unhealthy containers and replacing them with new,healthy ones,whereas readiness probes make sure that only pods that are ready to serve requests receive them.This is mostly necessary during container start up,but it’s also useful after the container has been running for a while.
Readiness probe can use:
Always use readinessProbe for production apps
For pods running in production,you should always define a readiness probe.Without one,pods will become service endpoints almost immediately.
readiness example
spec:
containers:
- name: cache-server
image: cache-server/latest
readinessProbe:
httpGet:
path: /readiness
port: 8888
initialDelaySeconds: 300
periodSeconds: 30
Liveness probes allow you to automatically determine whether or not a container application is in a healthy state.By default k8s will only a consider a container to be down if the container process stops.Liveness probes allow you to customize this detection mechanism and make it more sophisticated.Utilize it with restartPolicy:Always in order to restart the pod’s containers.Always remember to set an initialDelaySeconds to account for your app’s startup time.
Liveness probe can use:
liveness probe with exec
spec:
containers:
- name: liveness
image: k8s.gcr.io/busybox
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy; sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
Exit code
Exit code 137 signals taht the process was killed by an external signal (exit code is 128+9(SIGKILL).Likewise,exit code 143 responds to 128+15(SIGTERM)
Always use livenssProbe for production apps
For pods running in production,you should always define a liveness probe.Without one,Kubernetes has no way of knowing whether your app is still alive or not.
Keep probes light
Liveness probes shouldn’t use too many computational resources as they are utilizing container’s CPU time quota.
liveness probe with httpGet
spec:
containers:
- image: nginx
name: nginx
livenessProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 5
periodSeconds: 5
Startup probes are very similar to liveness probes.However, while liveness probes run constantly on a schedule,startup probes run at container startup and stop running once they succeed.They are used to determine when the application has successfully started up.Startup proves are especially useful for legacy applications that can have long startup times.
startup probe
spec:
containers:
- name: startup
image: nginx
startupProbe:
httpGet:
path: /
port: 80
initialDelaySeconds: 30
periodSeconds: 10
Horizontal pod autoscaling is the automatic scaling of the number of pod replicas managed by a controller.It’s performed by the Horizontal controller,which is enabled and configured by creating a HorizontalPodAutoscaler(HPA) resource.The controller periodically checks pod metrics,calculates the number of replicas required to meed the target metric value configured in the HorizontalPodAutoscaler resources,and adjust the replicas field on the target resource(Deployment,ReplicaSet,ReplicationController,or StatefulSet).Metrics server is needed for this.
The autoscaling process can be split into three steps:
As far as the Autoscaler is concerned,only the pod’s guaranteed CPU amount (the CPU requests) is important when determining the CPU utilization of a pod.The Autoscaler compares the pod’s actual CPU consumption and its CPU requests,which means the pods you’re autoscaling need to have CPU requests set (either directly or indirectly through a LimitRange object) for the Autoscaler to determine the CPU utilization percentage.
# create hpa for a deployment
kubectl autoscale deployment <MY_DEPLOYMENT> --cpu-percent=30 --min=1 --max=10
# create a service out of deployment
k expose deployment <MY_DEPLOYMENT> --port=80 --target-port=80 --name=<MY_SERVICE>
# run load-generator pod
k run --rm -ti load-generator --image=busybox /bin/sh
$ while true; do wget -q -O- http://<MY_SERVICE>.default.svc.cluster.local; done
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
creationTimestamp: null
name: kubia # each hpa has a name(it doesn't need to match the name of the deployment as in this case)
spec:
maxReplicas: 5 # min and max replicas you specified
minReplicas: 1
scaleTargetRef: # the target resources which this autoscaler will act upon
apiVersion: apps/v1
kind: Deployment
name: kubia
metrics:
- type: Resource
resource: # you'd like the autoscaler to adjust the numnber of pods so they each utilize 30% of requested CPU
name: cpu
target:
type: Utilization
averageUtilization: 30
status:
currentReplicas: 0 # the current status of the autoscaler
desiredReplicas: 0
HPA has a limit on how soon a subsequent autoscale operation can occur after the previous one.Currently,a scale-up will occur only if no rescaling event occured in the last three minutes.A scale-down event is performed even less frequently - every five minutes.
The whole point of an app’s configuration is to keep the confing options that vary between environments,or change frequently,separate from the application’s source code.
ConfigMap holds configuration data for pods to consume.They cannot include dash in its name. Regardless if you are using ConfigMap to store configuration data or not,you can configure your apps by:
You can crate a ConfigMap from liteals or from files from disk.
creating cm from literal
k create configmap fortune-config --from-literal=sleep-interval=25 --dry-run=client -o yaml
will be created as
apiVersion: v1
data:
sleep-interval: "25" # created from litaral
kind: ConfigMap
metadata:
creationTimestamp: null
name: fortune-config
mutli cm example
cloud_user@k8s-control:~$ k create cm test-cm --from-literal firstName=mile --from-literal lastName=kitic --dry-run=client -o yaml
apiVersion: v1
data:
firstName: mile
lastName: kitic
kind: ConfigMap
metadata:
creationTimestamp: null
name: test-cm
create a cm from file.
cloud_user@k8s-control:~$ k create cm test-cm --from-file readme --dry-run=client -o yaml
apiVersion: v1
data:
readme: | # key is omitted so filename is used as key
yeah
kind: ConfigMap
metadata:
creationTimestamp: null
name: test-cm
cloud_user@k8s-control:~$ k create cm test-cm --from-file key=readme --dry-run=client -o yaml
apiVersion: v1
data:
key: | # key is provided and used instead filename
yeah
kind: ConfigMap
metadata:
creationTimestamp: null
name: test-cm
ConfigMap objects can be created from files in a directory as well.
k creae cm my-config --from-file=/path/to/dir
cm as environment variables
spec:
containers:
- image: busybox:1.28.4
name: app-container
command: ['sh', '-c', "echo $(MY_VAR1) && sleep 3600"]
env:
- name: MY_VAR1
valueFrom:
configMapKeyRef:
name: appconfig
key: key1
optional: true # this key is optional,container will start if cm doesn't exist
- name: MY_VAR2
value: "mile kitic"
- name: MY_VAR3
valueFrom:
fieldRef:
fieldPath: status.hostIP
apiVersion: v1
cm as volume
spec:
containers:
- image: busybox
name: busybox
command: ["sh","-c","sleep 1d"]
resources: {}
volumeMounts:
- name: config
mountPath: /etc/someconfig.conf # folder will be created inside container at this location with the files from configMap config
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: config
configMap:
name: config
this will be presented on the container as:
$ k exec -it busybox -- ls -l /etc/
total 36
-rw-rw-r-- 1 root root 306 Nov 16 17:08 group
-rw-r--r-- 1 root root 8 Dec 5 14:01 hostname
-rw-r--r-- 1 root root 201 Dec 5 14:01 hosts
-rw-r--r-- 1 root root 118 Oct 29 12:12 localtime
drwxr-xr-x 6 root root 4096 Nov 17 19:58 network
-rw-r--r-- 1 root root 340 Nov 16 17:08 passwd
-rw-r--r-- 1 root root 127 Dec 5 14:01 resolv.conf
-rw------- 1 root root 136 Nov 17 19:58 shadow
drwxrwxrwx 3 root root 4096 Dec 5 14:01 someconfig.conf
$ k exec -it busybox -- ls -l /etc/someconfig.conf
total 0
lrwxrwxrwx 1 root root 15 Dec 5 14:01 boo.conf -> ..data/boo.conf
lrwxrwxrwx 1 root root 15 Dec 5 14:01 foo.conf -> ..data/foo.conf
spec:
containers:
- image: some-image
envFrom: # using envFrom instead env
- prefix: CONFIG_ # all environment variables will be prefixed with CONFIG_
configMapRef:
name: some-cm # referencing some-cm as CM
spec:
containers:
- image: edesibe/fortune:args # image which expect interval from arguments
name: fortune-pod
resources: {}
args: ["$(INTERVAL)"] # referencing environment variable in the argument
env:
- name: INTERVAL # setting environment variable INTERVAL
valueFrom:
configMapKeyRef: # using CM instead key/value
name: fortune-config # name of the CM
key: sleep-interval # use value for INTERVAL based on value from this key
A configMap volume will expose each entry of the CM as a file.
spec:
containers:
- image: nginx
name: web-server
volumeMounts:
- name: config
mountPath: /etc/nginx/conf.d # mounting configMap volume at this location
readOnly: true
...
volumes:
- name: config
configMap:
name: fortune-config # the volume refers to fortune-config CM
If you need just some parts from configMap in volume.When specifying individual entries,you need to set the filename for each individ ual entry, along with the entry’s key.
volumes:
- name: config
configMap:
name: fortune-config
items: # selecting which entries to include in the volume by listing them
- key: my-nginx-config.conf # you want the entry under this key included
path: gzip.conf # the entry's value should be stored in this file
spec:
containers:
- image: busybox
name: busybox
command: ["sh","-c","sleep 1d"]
resources: {}
volumeMounts:
- name: config
mountPath: /etc/foo/foo.conf # mouting into file not a directory
subPath: foo.conf # instead of mounting the whole volume,you're only mounting the foo.conf from configMap config
- name: config
mountPath: /etc/boo/boo.conf
subPath: boo.conf
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: config
configMap:
name: config
so it will be mounted on contaiener as:
$ k exec -it busybox -- ls -l /etc/foo /etc/boo
/etc/boo:
total 4
-rw-r--r-- 1 root root 27 Dec 5 14:07 boo.conf
/etc/foo:
total 4
-rw-r--r-- 1 root root 31 Dec 5 14:07 foo.conf
When you referenced ConfigMap which doesn’t exist in pod K8s schedules the pod normally and tries to run its containers.The container referencing the non-existing ConfigMap will fail to start(unless configMapKeyRef.optional:true is configured) but the other containers in the pod will start normally.If you then create the missing CM,the failed container is started wituout requiring you to recreate the pod.
By default, the permissions on all files in a configMap volume are set to 644.You can change this be setting the defaultMode property in the volume spec.
Drawbacks of using environment variables or command-line arguments as a configuration source is the inability to update them while the process is running.Using a CM and exposing it through a volume brings the ability to update the configuration without having to recreate the pod or even restart the container. When you update a ConfigMap,the files in all the volumes referncing it are updated.All files are updatede at once as k8s achieves this by using symbolic links.
When a ConfigMap currently consumed in a volume is updated, projected keys are eventually updated as well.ConfigMaps consumed as environment variables are not updated automatically and require a pod restart.A container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates.
Secrets are similar to ConfigMaps but are designed to store sensitice data,such as passwords or API keys,more securely.They can be used the same way as a ConfigMap.You can:
A Secret’s entries can contain binary values,not plain-text.Base64 encoding allows you to include the binary data in YAML or JSON,which are both plain-text formats.Maximum size of a Secret is limited to 1MB.
When you expose the Secret to a container through a secret volume,the value of the Secret entry is decoded and written to the file in its actual form(regardless if it’s plain text or binary).The same is also true when exposing the Secret entry through an environment variable.In both cases,the app doesn’t need to decode it,but can read the file’s contents or look up the environment variable value and use it directly.
Kubernetes helps keep your Secrets safe by making sure each Secret is only distributed to the nodes that run the pods that need access to the Secret.Also,on the nodes themselves,Secrets are always stored in memory and never written to physical storage.On the master node itself (precise in etcd), Secrets used to be stored in unencrypted form,which meant the master node needs to be secured to keep the sensitive data stored in Secrets secure.From Kubernetes version 1.7,etcd stores Secrets in encrypted form(obfuscated),making the system much more secure.Thus,one can still check that etcd stores secret in unencrypted form.
fetch namespaced secret from etcd
k -n kube-system exec -it <etcd-control> -- etcdctl get --endpoints=https://<endpoint>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/server.crt --key=/etc/kubernetes/pki/etcd/server.key /registry/secrets/${NAMESPACE}/${SECRET}
To use encrypted secrets you need to use Sealed Secrets or Encryption at rest
By default,default-token Secret is mounted into every container,but you can disable that in each pod by setting pod.spec.automountServiceAccountToken: false.
The secret volume uses an in-memory filesystem(tmpfs) for the Secret files.You can see this if you list mounts in the container.Because tmpfs is used,the sensitive data stored in the Secret is never written to disk,where it could be compromised.
create simple secret
k create secret generic mysecret --from-literal=username=mile --from-literal=password=kitic
create secret from file
k create secret generic ssh-key-secret --from-file=ssh-privatekey=<absolute_path>/.ssh/id_rsa --from-file=ssh-publickey=<absolute_path>/.ssh/id_rsa.pub
secret as env
spec:
containers:
- name: mycontainer
image: redis
env:
- name: SECRET_USERNAME
valueFrom:
secretKeyRef:
name: mysecret
key: username
optional: false
secret as volume
spec:
volumes:
- name: secret-volume
secret:
secretName: ssh-key-secret
containers:
- name: ssh-test-container
image: mySshImage
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/etc/secret-volume"
If you have multiple pods which needs to fetch images from private registry you can add the secrets to service account.
spec:
imagePullSecrets:
- name: mydockerhubsecret # docker-registry secret type
containers:
- image: username/private:tag
name: some_name
Always use Secret volumes for exposing Secrets,no as environment variables.
# scheduling disabled
k cordon NODE
# alternatively set not to Unschedulable
k patch nodes NODE -p '{"spec":{"unschedulable":true}}'
# scheduling enabled, moved pods will not come back to the uncordoned NODE
k uncordon NODE
# cordon + evicting pods to other nodes (with option for ingnoring volumes defined as emptyDir and daemonset pods)
k drain NODE [--delete-emptydir-data] [--ignore-daemonsets] [--force]
Before scheduler component allocate pod on the node it performs several checks as:
Tolerations allows to tolerate Taints.Nodes capacity can be viewed via kubectl describe node where one should check:
capacity - overall capacity of the nodeallocable - how much can be allocatedTaints are used to keep pods away from certain nodes.
taint example
# taint one node
k taint node NODE KEY=VALUE:EFFECT
# remove taint
k taint node NODE KEY=VALUE:EFFECT-
# taint nodes via label
k taint node -l key=value KEY=VALUE:EFFECT
# add taint without value.Can be used for `KEY=:NoSchedule` or `KEY=:NoExecute`
k taint node NODE KEY=:EFFECT
Pods with tolerations MAY be scheduled to tainted nodes ( e.g. master ) if their tolarations matches the node’s taint.Pods with no tolerations may be only scheduled to nodes without taints. If node has some taint assigned NO pod will be able to schedule onto target node unless it has a matching toleration.
If there is no scheduling pods can be assinged to any node
The default value for operator is Equal.A toleration mathes a taint if the keys and efrects are same on node and pod and:
operator is Exists (in which case no value should be specified) oroperator is Equal and the values are equal`There are two special cases:
key with operator Exists matches all keys,values and effects which means this will tolerate everything.effect matches all effects with key key1Effect values:"NoExecute" Only affect scheduling and affects pods already running on the node.Evict any already-running pods that do not tolerate the taint. Currently enforced by NodeController."NoSchedule" Do not allow new pods to schedule onto the node unless they tolerate the taint, but allow all pods submitted to Kubelet without going through the scheduler to start, and allow all already-running pods to continue running. Enforced by the scheduler."PreferNoSchedule" Like TaintEffectNoSchedule, but the scheduler tries not to schedule new pods onto the node, rather than prohibiting new pods from scheduling onto the node entirely. Enforced by the scheduler.tolerations:
- effect: [NoSchedule,PreferNoSchedule,NoExecute]
key: KEY
operator: [Exists,Equal]
tolerationSeconds: - wait for pod to be ready or reschedule it to another node
value: VALUE
tolerationSeconds: X # how long k8s should wait before rescheduling a pod to another node if the node the pod is running on becomes unready or unreachable
Node Affinity allows you to tell Kubernetes to schedule pods only to specific subsets of nodes.It selects nodes based on their labels,same way node selectors do. Before affinity is configured on pods related nodes need to be labeled.
kubectl label nodes NODE KEY=VALUE [--overwrite]
Then affinitiy can be used to specify prerrefence or hard requirements during scheduling pods.It is used to specify which nodes are preferred for certain pods.
Options for selecting are:
nodeSelector - for the pod to be eligble to run on a node, the node must have each of the indicated key/value pairs as labels
nodeAffinity - it is also based on node lables but with wider options expressions.k8s will try to achieve this
podAffinity - based on pod labels. it is used for geolocation or scheduling pods on same nodes,cluster,rack …
podAntiAffinity - based on pod labels but with opposite effect of podAffinity.Scheduler never choosing nodes where pods matching the podAntiAnffinity’s label selector are running
The affinity feature consists of two types of affinity:
hard requirements - forcing the pods to run on specific nodes by nodeAffinity spec
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # node must have same labels, don't effect current pods
nodeSelectorTerms:
- matchExpressions:
- key: KEY
operatior: IN
values:
- VALUE
preferrence - instructing pods to run on preferred nodes via nodeAffinity spec
spec:
affinity:
nodeAffinity:
prefferredDuringSchedulingIgnoredDuringExecution: # node must have lables, don't effect current pods
- weight: 80 # prefer the pod to be scheduled to node with this labels.this is your most important preference
preferrence:
matchExpressions:
- key: KEY1
operatior: IN
values:
- VALUE1
- weight: 20 # you also prefer that your pods be scheduled based on some other key/value pair
preferrence:
matchExpressions:
- key: KEY2
operatior: IN
values:
- VALUE2
hard requirementes - specifying pod allocation via podAffinity
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname # the pods of this myst be deployed on the same node as the pods that match the selector,this key/value is used only for requiredDuringSchedingIgnoredDuringExecution
labelSelector:
matchLabels:
app: backend
preference
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution: # preferred instead of required
- weight: 80 # weight and podAffinityTerm are speficied
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
app: backend
template:
metadata:
creationTimestamp: null
labels:
app: frontend # the frontend pods have the app=frontend label
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # defining hard requirements for pod anti-affinity
- topologyKey: kubernetes.io/hostname # ensure that pods aren't deployed to the same rack,zone,region or any custom scope
labelSelector: # a frontend pod must not be scheduled to the same node as a pod with app=frontend label
matchLabels:
app: frontend
topologyKey options is used differently based on the use case:kubernetes.io/hostnametopolgy.kubernetes.io/regiontopology.kubernetes.io/zone
You can add you own topologyKey as rack (for example) but you will need to to label your nodes accordingly.For example, if you have 20 nodes you could label first 10 with rack=rack1 and second 10 with rack=rack2.Then,in podAffinity spec you would set topologyKey=rack.Resource requests allow you to define an amount of resources (such as CPU or memory) you expect a container to use.The kubernetes scheduler will use resource requests to avoid scheduling pods on nodes that do not have enough available resources.
Containers are allowed to use more (or less) than the requested resources.Resource requests only affect scheduling.By specifying resource requests,you’re specifying the minimum amount of resources your pod needs.This information is what the Scheduler uses when scheduling the pod to a node.Each node has a certain amount of CPU and memory it can allocate to pods.When scheduling a pod,the Scheduler will only consider nodes with enough unallocated resources to meet the pod’s resource requirements.If the amount of unallocated CPU or memory is less than what the pod requests,Kubernetes will not schedule the pod to that node,because the node can’t provide the minimum amount required by the pod.As scheduler first filters the list of nodes to exlude those that the pod can’t fit it then prioritizes the remaining nodes per the configured prioritization functions.Among others,two prioritization functions rank nodes based on the amount of resources requested: LeastRequestPriority and MostRequestedPriority.The first one prefers nodes with fewer requested resources(with a greater amount of unallocated reources),whereas the second one is the exact opposite - it prefers nodes that have the most requested resources (a smaller amount of unallocated CPU and memory).But,they both consider the amount of requested resources,not the amount of resources actually consumed.The Scheduler is configured to use only one of those functions.Because the Scheduler needs to know how much CPU and memory each node has,the Kubelet reports this data to the API server,making it available through the Node resource.
list node overall capacity and allocatable capacity
> k get nodes -o jsonpath='{range .items[\*]}NAME:{.metadata.name}{"\t"}CAPACITY:{.status.capacity.cpu}{"\t"}ALLOCATABLE:{.status.allocatable.cpu}]{"\n"}{end}}' -l gpu=true
NAME:k8s-worker1 CAPACITY:2 ALLOCATABLE:2]
NAME:k8s-worker2 CAPACITY:2 ALLOCATABLE:2]
The output shows two sets of amounts related to the available resources on the nodes:the node’s capacity and allocatable resources.The capacity represents the total resources of a node,whih may not all be available to pods.The Scheduler bases its decisions only on the allocatable resource amounts.
Both CPU and memory requests are treated the same way by the Scheduler,but in contrast to memory requests,a pod’s CPU requests also play a role elsewhere - while the pod is running.The CPU requests don’t only affect scheduling - they also determine how the remaining(unused) CPU time is distributed between pods.Running pods will use unused CPU at same ration as they requested CPU.Meaning,if we have two pods and total capacity of 2CPUs where 1st pod is requesting 1cpu and 2nd pod is requesting 200milicpus unused CPU will be used as 1:5 ratio(1/6 vs 5/6 unused CPU time).But if one container wants to use up as much CPU as it can,while the other one is sigging idle at a given moment,the 1st container will be allowed to use the whole CPU time.After all,it makes sense to use all the available CPU if no one else is using it.As soon as the 2nd container needs CPU time,it will get it and the 1st container will be throtled back.
CPU is compressible resource,which means the amount used by a container can be throttled without affecting the process running in the container in an adverse way.Memory is obviously different - it’s incompressible.Once a process is given a chunk of memory,that memory can’t be taken away from it until it’s relesed by the process itself.That’s why you need to limit the maximum amount of memory a container can be given
Unlike resource requests,resource limits aren’t constrained by the node’s allocatable resource amounts.The sum of all limits of all the pods on a node is allowed to exceed 100% of the node’s capacity.This has an important consequence - when 100% of the node’s resources are used up,certain contaners will need to be killed
Even though you set a limit on how much memory is available to a container,the container will not be aware of this limit cause container sees the memory of the whole nodeAlso,containers will see all the node’s CPUs,regardless of the CPU limits configured for the container.All the CPU limit does is constrain the amount of CPU time the container can uses.
Containers never get killed if they try to use too much CPU(it will be throttled,but they are killed if they try to use too much memory(with OOMKilled status).
limit - max amount of compute resources allowed.When a CPU limit is set for a container,the process isn’t given moder CPU time than the configured limit.With memory,when a process tries to allocate memory over its limit, the process is killed(it’s said the container is OOMKilled)requests - min amount of compute resources allowed.If not set explicitly,default to the limits(if exists)Resource limits provide a way for you to limit the amount of resources your containers can use.The container runtime is responsible for enforcing these limits, and different container runtimes do this differently.
Some runtimes will enforce these limits by terminating container processes that attempt to use more than the allowed amount of resources
cpu - defined in milicores(1/1000 of one CPU). If your container needs 2 full cores to run you would put the value “2000m”.If your container only needs 1/4 of your core you would put a value of “250m”.
memory - defined in bytes.
limit and requests example
spec:
containers:
- image: nginx
name: nginx-pod
resources: # you'r specifying resource requests for nginx-pod container
limits:
cpu: "250m" # the container will be allowed to use at most 250milicores(that is 1/4 of single CPU core's time)
memory: "128Mi" # the container will be allowed to use up to 128 megabytes of memory
requests:
cpu: "125m" # the container requests 125 milicores(that is 1/8 of single CPU core's time)
memory: "64Mi" # the container requests 64 megabytes of memory
Kubernetes categorizing pods into three Quality of Service classes based on the combination of resource requests and limits for the pod’s containers.Here are the classes:
BestEffort(the lowest priority) - pods that don’t have any requests or limits set at allBurstable - pods where container’s limits and requests don’t match or pods with resources requests specified but without limitGuaranteed(the highest) - pods whose containers' (all containers in pod) requests are equal to the limits for all resourcesWhen the system is overcommited,the QoS classes determine which container gets killed first so the freed resources can be given to higher priority pods.First in line to get killed are pods in the BestEffort class,followed by Burstable pods,and finally Guaranteed pods,which only get killed if system processes need memory.
When two single-container pods exist,both in the Burstable class,the system will kill the one using more of its requested memory than the other,percantage-wise.
Instead of having to set limits resources for every container,one can create a LimitRange resource.It allows you to specify(for each namespace) not only the minimum and maximum limit you can set on a container for each resource,but also the default resource requests for containers that don’t specify requests explicitly.LimitRange resources are used by the LimitRanger Adminission Control plugin.When a pod manifest is posted to the API server,the LimitRanger plugin validates the pod spec.If validation fails,the mainfest is rejected immediately.Beacuase of this,a great use-case for LimitRange objects is to prevent users from creating pods that are bigger than any node in the cluster.Without such a LimitRange,the API server will gladly accept the pod, but then never schedule it. The limits specified in a LimitRange resource apply to each individual pod/container or other kind of object created in the same namespace as the LimitRange object.They don’t limit the total amount of resources available across all the pods in the namespace(this is specified through ResourceQuota objects).
apiVersion: v1
kind: LimitRange
metadata:
name: example
spec:
limits:
- type: Pod # specified the limits for a pod as a whole
min: # minimum CPU and memory all the pod's containers can request in total
cpu: 50m
memory: 5Mi
max:
cpu: 1 # maximum CPU and memory all the pod's containers can request(and limit)
memory: 1Gi
- type: Container # the container limits are specified below this line
defaultRequest: # default requests for CPU and memory that will be applied to containers that don't specify them explicitly
cpu: 100m
memory: 10Mi
default: # default limits for containers that don't specify them
cpu: 200m
memory: 100Mi
min: # minimum and maximum requests/limits that a container can have
cpu: 50m
memory: 5Mi
max:
cpu: 1
memory: 1Gi
maxLimitRequestRatio: # maximum ratio between the limit and request for each resource
cpu: 4 # a container's CPU limits will not be allowed to be more than 4 times greated than its CPU requests.A contaienr requesting 200m will not be accepted if its CPU limit is set to 801m ore higher
memory: 10
- type: PersistentVolumeClaim # a LimitRange can also set the minimum and maximum amount of storage a PVC can request
min:
storage: 1Gi
max:
storage: 10Gi
LimitRanges only apply to individual pods,but cluster admins also need a way to limit the total amount of resources in a namespace.The ResourceQuota Admission Control plugin checks whether the pod being created would ause the configured ResourceQuota to be exceeded.Because resource quotas are enforced at pod creation time,a ResourceQuota object only affects pods created after the ResourceQuota object is created - creating it has no effect on existing pods. A ResourceQuota limits the amount of computational resources the pods and the amount of storage PersistentVolumeClaims in a namespace can consume.It can also limit the number of pods,claims,and other API objects users are allowed to create inside the namespace. A ResourceQuota object applies to the namespace it’s created in,like a LimitRange,but it applies to all the pods' resources requests and limits in total and not to each individual pod or container separately.
When a quota for a specific resource(CPU or memory) is configured(request or limits),pods need to have the request or limit(respectively) set for that same resource: otherwise API server will not accept to pod.That’s why having a LimitRange with defaults for those resources can make life a bit easier for people creating pods.
Quotas can also be limited to a set of quota scopes:
BestEffort - whether the quota applies to pods with the BestEffort QoS class.Can only limit the number of podsNotBestEffort - wheter the quota applies to pods with one of the other two claesses(Burstable or Guaranteed).Can limit number of pods,CPU/memory resuests/limits.Terminating - pods that have the activeDeadlineSeconds set.Can limit number of pods,CPU/memory requests/limitsNotTerminating - pods that don’t have the activeDeadlineSeconds set.Can limit number of pods,CPU/memory requests/limitsWhen creating a ResourceQuota,you can specify the scopes that it applies to.A pod must match all the specified scopes for the quota to apply to it.Also,what a quota can limit depends on the quota’s scope.
apiVersion: v1
kind: ResourceQuota
metadata:
name: besteffort-notterminating-pods
spec:
scopes: # this quota only applies to pods that have the BestEffort QoS and don't have an active deadline set
- BestEffort # if the quota was targeting NotBestEffort pods you could also specify requests/{cpu,memory} and limits/{cpu,memory}
- NotTerminating
hard:
pods: 4 # only four such pods can exist
apiVersion: v1
kind: ResourceQuota
metadata:
name: cpu-and-mem
spec:
hard:
requests.cpu: 400m
requests.memory: 200Mi
limits.cpu: 600m
limits.memory: 500Mi
apiVersion: v1
kind: ResourceQuota
metadata:
name: storage
spec:
hard:
requests.storage: 500Gi # the amount of storage claimable overall
ssd.storageclass.storage.k8s.io/requests.storage: 300Gi # the amount of claimable storage in StorageClass named ssd
standard.storageclass.storage.k8s.io/requests.storage: 1Ti # the amount of claimable storage in StorageClass named standard
apiVersion: v1
kind: ResourceQuota
metadata:
name: objects
spec:
hard:
pods: 10 # only 10 pods,5 RC,10 secrets,10 CM,and 4 PVC can be created in namespace
replicationcontrollers: 5
secrets: 10
configmaps: 10
persistentvolumeclaims: 4
services: 5 # 5 SVC overall can be created,of which at most one can be a LoadBalancer SVC and at most 2 can be NodePort SVCs
services.loadbalances: 1
services.nodeports: 2
ssd.storageclass.storage.k8s.io/persistentvolumeclaims: 2 # only 2 PVCs can claim storage with the ssd StorageClass
print all api resources
kubectl api-resources -h
print doc about some api resource
kubectl explain ${api-resource}
like
k explain pod.spec.containers.resource
The Kuberentes API server can be configured to use an authorization plugin to check whether an action is allowed to be performed by the user requesting the action.REST clients send GET,POST,PUT,DELETE and other types of HTTP requests to specific URL paths,which represent specific REST resources.The verb in those examples(get,create,update) map to HTTP methods (GET,POST,PUT) performed by the client.An authorization plugin such as RBAC,which runs inside the API server,determines whether a client is allowed to perform the requested verb on the requested resource or not.
| HTTP method | Verb for single resource | Verb for collection |
|---|---|---|
| GET,HEAD | get(and watch for watching | list(and watch) |
| POST | create | n/a |
| PUT | update | n/a |
| PATCH | patch | n/a |
| DELETE | delete | deletecollection |
Besides applying security permissions to whole resource types,RBAC rules can also apply to specific instances of a resource (for example,a Service called myservice).Also, permissions can be set for non-resource URL paths,because not every path the API server exposes maps to a resource (such as the /api path itself or the server health information at /healthz).Regular Roles can’t grant access to those resources or non-resource URLs,but ClusterRole can.
The RBAC authorization plugin,uses user roles as the key factor in determing whether the user may perform the action or not.A subject(which may be a human,a ServiceAccount,or a group of users or ServiceAccounts) is associated with one or more roles and each role is allowed to perform certain verbs on certain resources.
Roles and CluseterRoles are Kubernetes objects that define a set of permissions.These permissions determine what users can do in the cluster.
A Role defines permissions within a particular namespace,and a ClusterRole defines cluster-wide permissions not specific to a single namespace.
RoleBinding and ClusterRoleBinding are objects that connect Roles and ClusterRoles to users(human,SA,group).
Roles defined what can be done,while bindings define who can do it
create a role
> k create role service-reader -n foo --resource=services --verb=get,list --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
creationTimestamp: null
name: service-reader
namespace: foo # roles are namespaced.They only allows access to resources in the same namespace the Role is in
rules:
- apiGroups: [""] # services are resources in the `core` apiGroup which has no name - hence the "".For `named` apiGroup one needs to specify the path
resources: ["services"] # this rule pertains to services (plural name must be used).You could use `resourceNames` field for specific service name
resourceName: ["mile-svc"] # specify individual service by name
verbs: ["get","list"] # getting individual Services (by name) and listing all of them is allowed
create a rolebinding for user and serviceaccount
> k create rolebinding test --role=service-reader --serviceaccount=foo:default -n foo --user=mile --group=folker --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
creationTimestamp: null
name: test
namespace: foo
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: service-reader # this RoleBinding references the service-reader Role
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: User
name: mile
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: folker
- kind: ServiceAccount # And binds it to the default ServiceAccount in the foo namespace
name: default
namespace: foo
RoleBinding always references to a single Role (as evident from the roleRef property), but can bind the Role to multiple subjects(for one or more SA and any number of users or groups)
Although you can create a RoleBinding and have it reference a ClusterRole when you want to enable access to namespaced resources,you can’t use the same approach for cluster-level(non-namespaced) resources.To grant access to cluster-level resources,you must always use a ClusterRoleBinding.
create a clusterrole
> k create clusterrole pv-reader --verb=get,list --resource=persistentvolumes --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
creationTimestamp: null
name: pv-reader
rules: # can be applied to nonResourceURLs as well
- apiGroups:
- "" # will be populated based on used resorices
resources:
- persistentvolumes
verbs:
- get
- list
bind ClusterRole and ServicAccount via ClusterRoleBinding
k create clusterrolebinding pv-test --clusterrole=pv-reader --serviceaccount=foo:default --dry-run=client -o yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
creationTimestamp: null
name: pv-test
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: pv-reader
subjects:
- kind: ServiceAccount
name: default
namespace: foo
ClusterRoles has several uses.You can use a ClusterRole to:
The most important roles are the view,edit,admin, and cluster-admin ClusterRoles.
view - it allows reading most resources in a namespace,except for Roles,RoleBinding, and Secretsedit - it allows modifying resources in namespaces,but also both reading and modifying Secrets.Cannot modify and view Roles and RoleBindingadmin - allows complete control of the resources in a namespace( except ResourceQuotas) and Namespace resource itself.The main difference between the edit and admin ClusterRoles is in the ability to view and modify Roles and RoleBindings in the namespacecluster-admin - complete control of the kubernetes clusterBy default,the default ServiceAccount in a namespace has no permissions other than those of an unauthenticated user ( the system:discovery ClusterRole and associated binding which allow anyone to make GET requests on a few non-resource URLs)
In k8s, a service account is an account used by container processes within Pods to authenticate with the k8s API.
Every pod is associated with a ServiceAccount,which represents the identity of the app running in the pod.The token file holds the ServiceAccount’s authentication token.When an app uses this token to connect to the API server,the authentication plugin authenticates the ServiceAccount and passes the ServiceAccount’s username back to the API server core.ServiceAccount usernames are formated like this: system:serviceaccont:{namespace}:{service account name}.The API server passes this username to the configured authorization plugins,which determine whether the action the app is trying to perform is allowed to be performed by the ServiceAccount.
The authentication tokens used in ServiceAccounts are JWT tokens
ServiceAccounts are nothing more than a way for an application running inside a pod to authenticate itself with the API server.As already mentioned,applications do that by passing the ServiceAccount’s token in the request.
A pod’s ServiceAccount must be set when creating the pod.It can’t be changed later.
A default ServiceAccount is automatically created for each namespace(that’s the one your pods have used all along).One can assign a ServiceAccount to a pod specifying the account’s name in the pod manifest.If you don’t assign it explicitly,the pod will use the default ServiceAccount in the namespace.
You can manage access control for service accounts,just like any other user,usin RBAC objects.A RoleBinding or ClusterBinding binds a role to subjects.Subjects can be groups,users or ServiceAccounts.
create service account
k create sa test --dry-run=client -o yaml -n default
From version 1.22 and 1.24 k8s will not create secret containing secret token for sa automatically.One must create it manually as:
k create sa SERVICEACCOUNTNAME # it will create sa and related token.one can used this sa on pod
k create token SERVICEACCOUTNAME # it will create additional token for related sa
describe options in sa
> k describe sa foo
Name: foo
Namespace: default
Labels: <none>
Annotations: <none>
Image pull secrets: <none> # these will be added automatically to all pods using this ServiceAccount.This is defined in `sa.imagePullSecrets` and they will not be mounted on pods.They are only used by kubelet when it needs to fetch images from private registry.
Mountable secrets: <none> # pods using this SA can only mount these Secrets if mountable Secrets are enforced.To enforce this SA must be configured with annotations 'kubernetes.io/enforce-mountable-secrets:"true"
Tokens: <none> # authentication token(s).The first one is mounted inside the container
Events: <none>
example for rolebinding a role to service account
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
labels:
k8s-app: metrics-server
name: metrics-server-auth-reader
namespace: kube-system
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: extension-apiserver-authentication-reader
subjects:
- kind: ServiceAccount
name: metrics-server
namespace: kube-system
metadata.ownerReferences - can be used to find which resources an object belogs to ( POD->RS, RS->DEPLOYMENY )
LABELS
label format, use --overwrite if you wanna update existing label
k label resource resource_NAME KEY=VALUE [--overwrite]
get custom output
k get po -o custom-columns=POD:metadata.name,NODE:spec.nodeName --sort-by spec.nodeName
api deprecations, kubectl convert must be installed first
k convert -f FILENAME --output-version <new-api>
The Cluster Autoscaler takes care of automatically provisioning additional nodes when it notics as pod that can’t be scheduled to existing nodes beacause of a lack of resources on those nodes.It also de-provisions nodes when they’re underutilized for longer periods of time.A new node will be provisioned if,after a new pod is created,the Scheduler can’t schedule it to any of the existing nodes.The Cluster Autoscaler looks out for such pods and asks the cloud provider to start up an additional node.
Certain servies require that a minimum number of pods always keeps running:this is especially true for quorum-based clustered applications.For this reason,Kubernetes provides a way of specifying the minimum number of pods that need to keep running while performing these types of operations.IT contains only a pod label selector and a number specifying the minimum number of pods that must always be available or maximum number of pods that can be unavailable.
pdb example
> k create pdb kubia-pdb --selector app=kubia --min-available 3 --dry-run=client -o yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
creationTimestamp: null
name: kubia-pdb
spec:
minAvailable: 3 # how many pods should always be available,can be defined as %
selector: # the label selector that determines which pods this budget applies to
matchLabels:
app: kubia
status:
currentHealthy: 0
desiredHealthy: 0
disruptionsAllowed: 0
expectedPods: 0
As long as pdb exists,both the Cluster Autoscaler and the k drain command will adhere to it and will never evict a pod with the app=kubia label if that would bring the number of such pods below three.
It allows you to pass metadata about the pod and its environment through environment variables or files(in a downwardAPI volume).It’s a way of having environment variables of files populated with values from the pod’s specification or status.It allows you to pass the following information to your containers:
Most items in the list can be passwd to containers either through environment variables or through a downwardAPI volume,but labels and annotations can only be exposed through the volume.
dowaward pod example
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels: # labels and annotations will be exposed via downwardAPI volume
foo: bar
run: downward
annotations:
key1: value1
key2: |
multi
line
value
name: downward
spec:
containers:
- command:
- sleep
- "999999"
image: busybox
name: main
resources:
requests:
cpu: 15m
memory: 100Ki
limits:
cpu: 100m
memory: 4Mi
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name # instaed of specifying an absolute vlaue you'r referencing the metadata.name field from the pod manifest
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: SERVICE_ACCOUNT
valueFrom:
fieldRef:
fieldPath: spec.serviceAccountName
- name: CONTAINER_CPU_REQUEST_MILLICORES
valueFrom:
resourceFieldRef:
resource: requests.cpu # a container's CPU and memory requests and limits are referenced by using resourceFieldRef instead of fieldRef
divisor: 1m # for resource fields, you define a divisor to get the valued in the unit you need
- name: CONTAINER_MEMORY_LIMIT_KIBIBYTES
valueFrom:
resourceFieldRef:
resource: limits.memory
divisor: 1Ki
volumeMounts:
- name: downward
mountPath: /etc/downward
dnsPolicy: ClusterFirst
restartPolicy: Always
volumes:
- name: downward
downwardAPI:
items:
- path: "podName"
fieldRef:
fieldPath: metadata.name
- path: "podNamespace"
fieldRef:
fieldPath: metadata.namespace
- path: "labels" # the pod's labels will be written to the /etc/downward/labels file
fieldRef:
fieldPath: metadata.labels
- path: "annotations" # the pod's annotations will be written to the /etc/downward/annotations file
fieldRef:
fieldPath: metadata.annotations
- path: "containerCpuRequestMilliCores"
resourceFieldRef:
containerName: main # specifying container name cause volumes are defined at the pod level,not container level
resource: requests.cpu
divisor: 1m
- path: "containerMemoryLimitBytes"
resourceFieldRef:
containerName: main
resource: limits.memory
divisor: 1
status: {}
As DownwardAPI is fairly limited if you need more,you will need to obtain it from the Kubernetes API server directly.
The kubectl proxy command runs a proxy server that accept HTTP connections on your local machine and proxies them to the API server while takin care of authentication,so you don’t need to pass the authentication token in every request.It also makes sure you’re talkint to the actual API server and not a man in the middle (by verifying the server’s certificate on each request).As soon as it starts up( by k proxy) the proxy starts accepting connections on local port 8001.
k proxy &
# check api/v1 response
curl localhost:8001/api/v1
curl localhost:8001/apis
# fetch specific job in dev namespace
curl localhost:8001/apis/batch/v1/namespaces/dev/jobs/<jobName>
# fetch specific pod in web namespace
curl localhost:8001/api/v1/namespaces/web/pods/<podName>
# call a service via api
curl localhost:8001/api/v1/namespaces/default/services/<serviceName>/proxy/
# call a pod via api
curl localhost:8001/api/v1/namespaces/default/pods/<podName>/proxy/
export CURL_CA_BUNDLE=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
export TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)
curl -H "Authorization: Bearer ${TOKEN}" https://kubernetes
or via proxy container
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
labels:
run: curl-with-ambasador
name: curl-with-ambasador
spec:
containers:
- command:
- sleep
- "9999"
image: edesibe/curl
name: main
resources: {}
- name: proxy
image: edesibe/kubectl-proxy
dnsPolicy: ClusterFirst
restartPolicy: Always
status: {}
API server can be configured with one or more authentication plugins (and the same is true for authorization plugins).When a request is received by the API server,it goes through the list of authentication plugins,so they can each examine the request and try to determine who’s sending the request.The first plugin that can extract that information from the request returns the username,user ID,and the groups the client belongs to back to the API server core. All Kubernetes clusters have two categories of users: service accounts managed by Kubernetes, and normal users. It is assumed that a cluster-independent service manages normal users in the following ways:
In this regard, Kubernetes does not have objects which represent normal user accounts. Normal users cannot be added to a cluster through an API call. Even though a normal user cannot be added via an API call, any user that presents a valid certificate signed by the cluster’s certificate authority (CA) is considered authenticated. In this configuration, Kubernetes determines the username from the common name field in the ‘subject’ of the cert (e.g., “/CN=mile”). From there, the role based access control (RBAC) sub-system would determine whether the user is authorized to perform a specific operation on a resource. For more details, refer to the normal users topic in certificate request for more details about this. In contrast, service accounts are users managed by the Kubernetes API. They are bound to specific namespaces, and created automatically by the API server or manually through API calls. Service accounts are tied to a set of credentials stored as Secrets, which are mounted into pods allowing in-cluster processes to talk to the Kubernetes API. API requests are tied to either a normal user or a service account, or are treated as anonymous requests. This means every process inside or outside the cluster, from a human user typing kubectl on a workstation, to kubelets on nodes, to members of the control plane, must authenticate when making requests to the API server, or be treated as an anonymous user.
An authentication plugin returns the username and group(s) of the authenticated user.Kubernetes doesn’t store that information anywhere,it uses it to verify whether the user is authorized to perform an action or not.Kubernetes distinguishes between two kinds of clients connecting to the API server:
Both human users and ServiceAccounts can belong to one or more groups.Authentication plugin returns groups with the username and user ID.Groups are used to grant permissions to several users at once,instaed of having to grant them to individual users.Groups returned by the plugin are nothing but strings, representing arbitrary group names,but built-in groups have special meaning:
system:unathenticated group is used for requests where none of the authentication plugins could authenticate the clientsystem:authenticated group is automatically assigned to a used who was autheticated successfullysystem:serviceaccounts group encompasses all ServiceAccounts in the systemsystem:serviceaccounts:<namespace> includes all ServiceAccounts in a specific namespaceServiceAccount in Kubernetes is referenced as system:serviceaccounts:<namespace>:<ServiceAccountName>
core and namedcore - core system objects like svc,cm,pod,secrets,etc.named - all new features will be added here> k get --raw='/api/v1' | jq -r '.resources[].name'
bindings
componentstatuses
configmaps
endpoints
events
limitranges
namespaces
namespaces/finalize
namespaces/status
nodes
nodes/proxy
nodes/status
persistentvolumeclaims
persistentvolumeclaims/status
persistentvolumes
persistentvolumes/status
pods
pods/attach
pods/binding
pods/ephemeralcontainers
pods/eviction
pods/exec
pods/log
pods/portforward
pods/proxy
pods/status
podtemplates
replicationcontrollers
replicationcontrollers/scale
replicationcontrollers/status
resourcequotas
resourcequotas/status
secrets
serviceaccounts
serviceaccounts/token
services
services/proxy
services/status
> k get --raw='/apis' | jq -r '.groups[].name'
apiregistration.k8s.io
apps
events.k8s.io
authentication.k8s.io
authorization.k8s.io
autoscaling
batch
certificates.k8s.io
networking.k8s.io
policy
rbac.authorization.k8s.io
storage.k8s.io
admissionregistration.k8s.io
apiextensions.k8s.io
scheduling.k8s.io
coordination.k8s.io
node.k8s.io
discovery.k8s.io
flowcontrol.apiserver.k8s.io
crd.projectcalico.org
metrics.k8s.io
getting resources from /apis/apps API group
> k get --raw='/apis/apps/v1' | jq -r '.resources[].name'
controllerrevisions
daemonsets
daemonsets/status
deployments
deployments/scale
deployments/status
replicasets
replicasets/scale
replicasets/status
statefulsets
statefulsets/scale
statefulsets/status
Each resource has certain verbs which are used to manipulate with resources such as: get,list,create,patch,update,delete…
In K8s we have several authorization modes:
Configuration is done on api server with: --authorization-mode=Node,RBAC.If nothing is defined AlwaysAllow will be defined.When you have multiple modes configured,your request is authorized using each one in the order it is specified.So, every time a module denies the request,it goes to the next one in the chain,and as soon as a module approves the request,no more checks are done and the user is granted permission.
An admission controller is a piece of code that intercepts requests to the Kubernetes API server prior to persistence of the object, but after the request is authenticated and authorized. Admission controllers may be validating, mutating, or both. Mutating controllers may modify related objects to the requests they admit; validating controllers may not. Admission controllers limit requests to create, delete, modify objects. Admission controllers can also block custom verbs, such as a request connect to a Pod via an API server proxy. Admission controllers do not (and cannot) block requests to read (get, watch or list) objects.
enabling
kube-apiserver --enable-admission-plugins=NamespaceLifecycle,LimitRanger ...
disabling
kube-apiserver --disable-admission-plugins=PodNodeSelector,AlwaysDeny ...
Following commands works only for admins
k get --raw='/api/v1/pods' --as=system:serviceaccount:<namespace>:<ServiceAccountName>
or
k auth can-i get '/api/v1/pods' --as=system:serviceaccount:<namespace>:<ServiceAccountName>
check all perms using Impersonate-User set to system:serviceaccount:<namespace>:<ServiceAccountName> and optional Impersonate-Group set to system:serviceaccounts and system:serviceaccounts:<namespace>
k auth can-i --list --as=system:serviceaccount:<namespace>:<ServiceAccountName> [--as-group=system:serviceaccounts] [--as-group=system:serviceaccounts:<namespace>]
List of namespaces used in containers are:
Generate a csr
openssl genrsa -out myuser.key 2048
openssl req -new -key myuser.key -out myuser.csr
create CertifcateSigningRequest
apiVersion: certificates.k8s.io/v1
kind: CertificateSigningRequest
metadata:
name: myuser
spec:
request: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURSBSRVFVRVNULS0tLS0KTUlJQ1ZqQ0NBVDRDQVFBd0VURVBNQTBHQTFVRUF3d0dZVzVuWld4aE1JSUJJakFOQmdrcWhraUc5dzBCQVFFRgpBQU9DQVE4QU1JSUJDZ0tDQVFFQTByczhJTHRHdTYxakx2dHhWTTJSVlRWMDNHWlJTWWw0dWluVWo4RElaWjBOCnR2MUZtRVFSd3VoaUZsOFEzcWl0Qm0wMUFSMkNJVXBGd2ZzSjZ4MXF3ckJzVkhZbGlBNVhwRVpZM3ExcGswSDQKM3Z3aGJlK1o2MVNrVHF5SVBYUUwrTWM5T1Nsbm0xb0R2N0NtSkZNMUlMRVI3QTVGZnZKOEdFRjJ6dHBoaUlFMwpub1dtdHNZb3JuT2wzc2lHQ2ZGZzR4Zmd4eW8ybmlneFNVekl1bXNnVm9PM2ttT0x1RVF6cXpkakJ3TFJXbWlECklmMXBMWnoyalVnald4UkhCM1gyWnVVV1d1T09PZnpXM01LaE8ybHEvZi9DdS8wYk83c0x0MCt3U2ZMSU91TFcKcW90blZtRmxMMytqTy82WDNDKzBERHk5aUtwbXJjVDBnWGZLemE1dHJRSURBUUFCb0FBd0RRWUpLb1pJaHZjTgpBUUVMQlFBRGdnRUJBR05WdmVIOGR4ZzNvK21VeVRkbmFjVmQ1N24zSkExdnZEU1JWREkyQTZ1eXN3ZFp1L1BVCkkwZXpZWFV0RVNnSk1IRmQycVVNMjNuNVJsSXJ3R0xuUXFISUh5VStWWHhsdnZsRnpNOVpEWllSTmU3QlJvYXgKQVlEdUI5STZXT3FYbkFvczFqRmxNUG5NbFpqdU5kSGxpT1BjTU1oNndLaTZzZFhpVStHYTJ2RUVLY01jSVUyRgpvU2djUWdMYTk0aEpacGk3ZnNMdm1OQUxoT045UHdNMGM1dVJVejV4T0dGMUtCbWRSeEgvbUNOS2JKYjFRQm1HCkkwYitEUEdaTktXTU0xMzhIQXdoV0tkNjVoVHdYOWl4V3ZHMkh4TG1WQzg0L1BHT0tWQW9FNkpsYWFHdTlQVmkKdjlOSjVaZlZrcXdCd0hKbzZXdk9xVlA3SVFjZmg3d0drWm89Ci0tLS0tRU5EIENFUlRJRklDQVRFIFJFUVVFU1QtLS0tLQo= # base64 encoded myuser.csr generated as `cat myuser.csr | base64 | tr -d "\n"`
signerName: kubernetes.io/kube-apiserver-client
expirationSeconds: 86400 # one day
usages:
- client auth
approve or deny csr
k certificate approve/deny myusr
In order to view metrics about the resources pods and containers are using,we need an add-on to collect and provide the data.One such add-on is Kubernentes Metrcis Server.
Cluster monitoring is done via “metrics server”.The Kubernetes Metrics Server collects resource metrics from the kubelets in your cluster, and exposes those metrics through the Kubernetes API, using an APIService to add new kinds of resource that represent metric readings.
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# - Modify and add "- --kubelet-insecure-tls" in deployment.spec.template.spec.containers.args
k -n kube-system edit deployment metrics-server
# monitoring nodes which shows current CPU and memory usage
k top node
# monitoring pods including containers sorted by cpu or mem with some label defined and
k top pod --sort-by ['cpu'|'mem'] --selector <LABEL> [--contaiers]
Monitoring applications is done via “liveness” and “readiness” probes. Cluster logs can be found on “/var/log/containers”. Applications logs can be observed via “k logs [svc,pod,deployment] –container CONTAINER_NAME –previous –selector LABELS”.
Custom resources are extensions of the Kubernetes API. This page discusses when to add a custom resource to your Kubernetes cluster and when to use a standalone service. It describes the two methods for adding custom resources and how to choose between them.
---
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
# name must match the spec fields below, and be in the form: <plural>.<group>
name: internals.datasets.kodekloud.com
spec:
# group name to use for REST API: /apis/<group>/<version>
group: datasets.kodekloud.com
# list of versions supported by this CustomResourceDefinition
versions:
- name: v1 # version of the crd
# Each version can be enabled/disabled by Served flag.
served: true
# One and only one version must be marked as the storage version.
storage: true
schema: # list of objects supported by CRD
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
internalLoad:
type: string
range:
type: integer
percentage:
type: string
# either Namespaced or Cluster
scope: Namespaced
names: # naming and optional an alias
# plural name to be used in the URL: /apis/<group>/<version>/<plural>
plural: internals
# singular name to be used as an alias on the CLI and for display
singular: internal
# kind is normally the CamelCased singular type. Your resource manifests use this.
kind: Internal
# shortNames allow shorter string to match your resource on the CLI
shortNames:
- int
On every pod following volume is mounted with info about namespace,ca cert and token for api server communication.
/var/run/secrets/kubernetes.io/serviceaccount/
On control plane we have TLS server certs for components:
and client certs:
Most the certs can be found on /etc/kubernetes/pki folder except kubelet which stores certs in /var/lib/kublet/pki and kube-proxy which is using serviceAccount to access apiserver.

Microservices are small,independent services that work together to form a whole application.
Many applications are designed with a monolitic architecture, meaning that all parts of the application are combined in one large executable.
Microservices architecture break the application up into several small services.
valero - backup solutionrakkess - rbac auditingaudit2rbac - rbac auditingmetalb - on premise elbreloader - reload pods for a change on cm or secretsk get RESOURCE RESOURCE-NAME -o json | jq -c paths | grep KEY
k get nodes -o json | jq -c 'paths|[.[]|tostring]|join(".")' | grep -i osImage
patch-file.json
{
"spec": {
"template": {
"spec": {
"containers": [
{
"name": "patch-demo-ctr-2",
"image": "redis"
}
]
}
}
}
}
The following commands are equivalent:
kubectl patch deployment patch-demo --patch-file patch-file.json
kubectl patch deployment patch-demo --patch '{"spec": {"template": {"spec": {"containers": [{"name": "patch-demo-ctr-2","image": "redis"}]}}}}'
https://www.tutorialworks.com/difference-docker-containerd-runc-crio-oci/