# Secure virtualised workloads * **workload**: any job/payload that needs to be executed on infrastructure * straight on OS * in a VM * in a container ## Virtual machines ![VM architecture](./img/ch09/vm_diagram.png) * physical hardware * CPU, memory, chipset, I/O... * resources often underutilized * no isolation * hardware-level abstraction * virtual hardware * encapsulate all OS and application state * virtualization software * hypervisor/VMM * extra level of indirection to decouple hardware and OS * strong isolation between VMs * improves utilization * secure multiplexing * isolation on hardware level * failure of one VM does not affect others * entire VM is a file * easy to snapshot, clone, move, distribute * create once, run anywhere (well we try) * types * **type 1**: hypervisor runs on bare metal (no host OS) (VMWare, Microsoft Hyper-V, KVM...) * **type 2**: hypervisor runs on host OS (Virtualbox, VMWare Workstation...) * relies on host OS to manage calls to hardware * adds latency * security risks of host OS exploitable * aimed towards developers ![Type 1 virtualisation](./img/ch09/type_1_hypervisor.png){width=50%} \ ![Type 2 virtualisation](./img/ch09/type_2_hypervisor.png){width=50%} ## Containers * virtualization on OS level * much more lightweight -> more dense utilization * share same host OS / kernel * advantages * much faster startup * easier to manage * more containers per host than VMs * no hardware isolation, so security issues * the future * blur the line between contains and VMs * **Kata-containers**: lightweight VM per container (better security) * **Microsoft HyperV**: sometimes wraps containers in lightweight VM * Linux Security Modules (LSM) * hostile processes can break out of container (badly configured namespaces, kernel exploits...) * LSM defines mandatory access control * lists allowed capabilities (syscalls) per process * defined by sysadmin * prevents niche syscalls from being exploited * types * **OS-level containerization**: spawn containers straight on host OS + kernel * isolation using kernel functionality (namespaces, cgroups...) * no need for full guest OS * no hardware extensions * attackers could escape container and compromise host * Docker * **micro-VM**: containers in lightweight VMs on host * utilizes hardware-enforced isolation * containers do not share kernel * safer * slower startup, worse performance * **unikernel**: application compiled together with tailored kernel * monitor appplication on syscalls used * once known, construct microkernel and fixed-purpose image * no user space, only kernel space * much smaller attack surface (kernel only contains what's necessary) * runs straight on hypervisor or bare metal * small footprint, quick to start * **sandboxing**: container in sandbox running copy of host kernel * syscalls translated to host kernel * good isolation * slow * not all syscalls supported (yet) ![Container layout](./img/ch09/container.png){width=50%} \ ![Micro-VM layout](./img/ch09/micro_vm.png){width=50%} ![Unikernel layout](./img/ch09/unikernel.png){width=50%} \ ![Sandbox layout](./img/ch09/sandbox.png){width=50%} ## Linux kernel isolation support * [https://linuxcontainers.org/] * built into Linux kernel * LXC (Linux Containers) * OS-level virtualization for running containers on Linux host * low-level, difficult to use * LXD (Linux Container Hypervisor) * built on top of LXC * Canonical development * focus on containerising entire operations systems, not individual applications ### Cgroups * control groups * Linux feature to separate processes into groups * resource limiting e.g. cpu shares * prioritization e.g. cpu pinning * device access ### Namespaces * provide isolated view of global resources for a group of processes * only see other processes in namespaces * only see allowed devices, users, file system... * 2 PIDs: global one and one within namespace * own root file system (copy of host root) ## WebAssembly * W3C standard for portable high-performance applications * binary code * compiled to virtual CPU * runs in runtime * portable compilation target * near-native performance * WebAssembly System Interface (WASI): OS-level functionality + integrated security ## Trusted execution environment * confidential computing: protect data in use * at-rest data: data on storage, just encrypt it * in-transit data: use ewncryption * in-use data: needs to be decrypted before it can be used in application * TEE looks to address data in use security concern * protect *guest* from untrustworthy *host* * confidentiality: unauthorized entities cannot view data used in TEE, data is encrypted in-memory * integrity: prevent tampering (checksums) * provable origin: hardware-signed evidence of origina and current state so client can verify and decide to trust code running in TEE * AMD Secure Encrypted Virtualization (SEV, SEV-ES) * Intel Software Guard Extensions (SGX) * Intel Trusted Domain Extensions (TDX) ![Container architecture](./img/ch09/container_diagram.png)