# Secure virtualised workloads

* **workload**: any job/payload that needs to be executed on infrastructure
    * straight on OS
    * in a VM
    * in a container

## Virtual machines

![VM architecture](./img/ch09/vm_diagram.png)

* physical hardware
    * CPU, memory, chipset, I/O...
    * resources often underutilized
    * no isolation
* hardware-level abstraction
    * virtual hardware
    * encapsulate all OS and application state
* virtualization software
    * hypervisor/VMM
    * extra level of indirection to decouple hardware and OS
    * strong isolation between VMs
    * improves utilization
* secure multiplexing
    * isolation on hardware level
    * failure of one VM does not affect others
* entire VM is a file
    * easy to snapshot, clone, move, distribute
* create once, run anywhere (well we try)
* types
    * **type 1**: hypervisor runs on bare metal (no host OS) (VMWare, Microsoft
      Hyper-V, KVM...)
    * **type 2**: hypervisor runs on host OS (Virtualbox, VMWare Workstation...)
        * relies on host OS to manage calls to hardware
        * adds latency
        * security risks of host OS exploitable
        * aimed towards developers

![Type 1 virtualisation](./img/ch09/type_1_hypervisor.png){width=50%} \ ![Type 2 virtualisation](./img/ch09/type_2_hypervisor.png){width=50%}

## Containers

* virtualization on OS level
* much more lightweight -> more dense utilization
* share same host OS / kernel
* advantages
    * much faster startup
    * easier to manage
    * more containers per host than VMs
* no hardware isolation, so security issues
* the future
    * blur the line between contains and VMs
    * **Kata-containers**: lightweight VM per container (better security)
    * **Microsoft HyperV**: sometimes wraps containers in lightweight VM
* Linux Security Modules (LSM)
    * hostile processes can break out of container (badly configured
      namespaces, kernel exploits...)
    * LSM defines mandatory access control
    * lists allowed capabilities (syscalls) per process
    * defined by sysadmin
    * prevents niche syscalls from being exploited
* types
    * **OS-level containerization**: spawn containers straight on host OS + kernel
        * isolation using kernel functionality (namespaces, cgroups...)
        * no need for full guest OS
        * no hardware extensions
        * attackers could escape container and compromise host
        * Docker
    * **micro-VM**: containers in lightweight VMs on host
        * utilizes hardware-enforced isolation
        * containers do not share kernel
        * safer
        * slower startup, worse performance
    * **unikernel**: application compiled together with tailored kernel
        * monitor appplication on syscalls used
        * once known, construct microkernel and fixed-purpose image
        * no user space, only kernel space
        * much smaller attack surface (kernel only contains what's necessary)
        * runs straight on hypervisor or bare metal
        * small footprint, quick to start
    * **sandboxing**: container in sandbox running copy of host kernel
        * syscalls translated to host kernel
        * good isolation
        * slow
        * not all syscalls supported (yet)
        

![Container layout](./img/ch09/container.png){width=50%} \ ![Micro-VM layout](./img/ch09/micro_vm.png){width=50%}

![Unikernel layout](./img/ch09/unikernel.png){width=50%} \ ![Sandbox layout](./img/ch09/sandbox.png){width=50%}

## Linux kernel isolation support

* [https://linuxcontainers.org/]
* built into Linux kernel
* LXC (Linux Containers)
    * OS-level virtualization for running containers on Linux host
    * low-level, difficult to use
* LXD (Linux Container Hypervisor)
    * built on top of LXC
    * Canonical development
    * focus on containerising entire operations systems, not individual applications

### Cgroups

* control groups
* Linux feature to separate processes into groups
    * resource limiting e.g. cpu shares
    * prioritization e.g. cpu pinning
    * device access

### Namespaces

* provide isolated view of global resources for a group of processes
    * only see other processes in namespaces
    * only see allowed devices, users, file system...
    * 2 PIDs: global one and one within namespace
    * own root file system (copy of host root)

## WebAssembly

* W3C standard for portable high-performance applications
* binary code
    * compiled to virtual CPU
    * runs in runtime
* portable compilation target
* near-native performance
* WebAssembly System Interface (WASI): OS-level functionality + integrated
  security

## Trusted execution environment

* confidential computing: protect data in use
    * at-rest data: data on storage, just encrypt it
    * in-transit data: use ewncryption
    * in-use data: needs to be decrypted before it can be used in application
    * TEE looks to address data in use security concern
* protect *guest* from untrustworthy *host*
    * confidentiality: unauthorized entities cannot view data used in TEE, data
      is encrypted in-memory
    * integrity: prevent tampering (checksums)
    * provable origin: hardware-signed evidence of origina and current state so
      client can verify and decide to trust code running in TEE
* AMD Secure Encrypted Virtualization (SEV, SEV-ES)
* Intel Software Guard Extensions (SGX)
* Intel Trusted Domain Extensions (TDX)

![Container architecture](./img/ch09/container_diagram.png)