[Semi thesis Review for me]CATalyst: Defeating Last-Level Cache Side Channel Attacks in Cloud Computing

2020. 7. 21. 16:34Computer Architecture/Cache security

Authors: Fangfei Liu1, Qian Ge2,3, Yuval Yarom2,4, Frank Mckeen5, Carlos Rozas5, Gernot Heiser2,3, Ruby B. Lee1 1 Department of Electrical Engineering, Princeton University, email: {fangfeil,rblee}@princeton.edu 2 NICTA, email: {qian.ge,gernot}@nicta.com.au 3 UNSW Australia 4 School of Computer Science, The University of Adelaide, email: yval@cs.adelaide.edu.au 5 Intel Labs, email: {frank.mckeen,carlos.v.rozas}@intel.com

 

[Abstract]

Processor package안의 모든 코어들이 공유하는 last-level cache(LLC) .

Since most cloud servers use Intel processors, we show how the Intel Cache Allocation Technology (CAT) can be used to provide a system-level protection mechanism to defend from side channel attacks on the shared LLC.

CAT is a way-based hardware cache-partitioning mechanism for enforcing quality-of-service with respect to LLC occupancy.

 

 

[1.Introduction]

Multiple VMs may execute on the same physical machine and share all its underlying hardware resources.

The Last-Level cache (LLC) is one of the most dangerous shared resources since it is shared by all of the cores in a processor package, yet it allows fin-grained, high-bandwidth, low-noise cross-core attacks.

Recently, Intel introduced a performance optimization feature called the Cache Allocation Technology(CAT). 

CAT supports associating cache ways with up to four Classes of Service (COSs), such that the replacement of a cache line is allowed only if the code initiating the replacement executes within the corresponding COS.

 

We propose to use CAT to defend against LLC side channel attacks.

A direct approach is to use CAT to provide partial or complete isolation of COS for partitioning the L1 cache.

 

In this work, we propose an alternative approach, called CATalyst, for using CAT as a defense for LLC side channel attacks.

On the one hand, we use CAT as a coarse-grained mechanism to partition the LLC into secure and non-secure partitions, where the non-secure partition can be freely used by any applications. On the other hand, the secure partition is loaded with cache-pinned secure pages.

The non-secure partition remains hardware-managed, while the secure partition is software-managed, leveraging CAT to enforce the cache-pinning of secure pages.

 

[2. BACKGROUND]

[2.1 Virtualization]

Virtualization supports running multiple isolated environments, called virtual machines (VMs).

A virtual machine monitor (VMM) manages the underlying hardware resources, which not only allocates those resources to the VMs but also provides the isolation boundaries among VMs.

 

There are two levels of address-space virtualization in virtualized environments. The first maps the virtual addresses of a process to a guest’s notion of physical addresses, which are emulated by the VMM. The second maps guest physical addresses to host physical addresses.

 

Without specific hardware support, the VMM can maintain a shadow page table for each process to store the translation from the virtual address to the host physical address.

 

with hardware support for nested page tables

For example, Intel introduced an extended page table (EPT) mechanism, which features a second page table that directly stores the translation from the guest physical address to the host physical address for each guest VM.

 

guest page table + the EPT: virtual address --> host physical address

 

The VMM provides some services in terms of hypercalls that are typically not part of the instruction set of the architecture, which the OS in the VM can use.

In this work we are interested in timing differences due to sharing the LLC, which allow mounting LLC-based side channel attacks.

Image from this thesis paper

hence LLC-based attacks are also executable when the victim and the attacker are not executing on the same core.

Furthermore, as co-resident VMs share the LLC, LLC-based attacks can operate cross-VM.

 

[2.3 Page coloring]

Page coloring is a software mechanism that partitions the cache by cache sets at page-size granularity.

Memory pages are given different colors, and only memory blocks in pages with the same color can be mapped into the same cache set.

 

[2.4 The Intel sliced LLC]

In order to implement page coloring as a defense, the system designer must ensure that pages of different colors do not map to the same cache set.

 

[3. LEVERAGING CAT TO DEFEAT LLC ATTACKS]

[3.1 The Intel Cache Allocation Technology]

CAT는 hardware-based mechanism으로 Intel Hswell server processors에 2014년 처음 소개되었다.

CAT는 LLC cache way의 configurable parittioning을 지원한다.

Page coloring과 다르게 CAT는 superpage를 지원하고 LLC의 hash indexing algorithm에 영향을 받지 않는다.

CAT는 cache footpring(by guaranteeing cache capacity to high-prioirty applicatinos)를 통해  QoS(Quality-of-Service)를 강화하도록 디자인되었다.

On current Haswell servers, CAT supports up to four classes of service (COS), with a bitmap defining the LLC ways a COS can access. The bit mask can be configured in software by writing to one of four model-specific registers (MSRs), one for each COS. Intel imposes two constraints on configuring the COS: the bit mask must contain at least two ways and the allocated ways must be contiguous. CAT associates COS to logical processors. As a result, all the applications running on a logical processor share the same COS. Importantly, CAT only restricts the ways that can be used to serve an LLC miss, applications can still hit on cache lines that are cached in cache ways that are not assigned to them.

[3.2 Challenges]

CAT could be trivially used to completely partition the cache, by assigning disjoint COSs to different VMs. (below Figure2)

Image from this thesis paper

However, such a partitoning is too coarse-grained, as the hardware only supports four COSs, and four security domains are insufficient for realistic cloud workloads.

Furthermore, the system could not dynamically balance the cache allocation according to the needs of VMs, which might harm the performance of VMs with large working sets, sepecially when most VMs have small working sets.

 

[3.3 Threat model and design goals]

We target the co-residence threat in pulibc IaaS clouds.

we assumes that each VM is a trust domain, i.e., cloud customers do not trust each other.

The attacker owns a VM, having complete control of both OS and applications in the Vm.

The victim is a co-resident VM that shares the machine with the attacker VM.

The victim is a co-resident VM that shares the machine with the attacker VM.

Our goal is to ensure that running it in a public cloud platform is as secure as running a VM on a private machine.

 

We focus on LLC-based cross-VM attacks.

Cloud providers have the option not to shedule two VMs from different coud customers on the same core, but VMs still share the LLC even when they are scheduled on different cores.

Our goal is providing a system-level mechanism that offers strong security guarantees to cloud providers and their customers.

 

[3.4 Our solution: CATalyst]

The key idea of CATalyst is that we use CAT not only as a cache partitioning mechanism but also as a pseudo-locking mechanism which pins certain page frames in the LLC.

These cache-pinned page frames are called secure pages that can be used to store security-sensitive code and data by remapping them to the secure page frames.

Image from this thesis paper

1. Use the CAT hardware to create two partitions: a secure partition and a non-secure partition --> where the secure partition can be much smaller than the non-secure partition.

CAT make: secure partition + non-secure partition

secure partition to store cache-pinned secure pages.

non-secure partition remains hardware-managed, and thus transparent to the software.

 

However, the secure partition becomes a software-managed cache that is allocated as the secure partition at VMM boot time.

In this way, our solution converts the LLC into a hybrid between a hardware-managed cache (using CAT, non-secure partition) and a software-managed cache(secure partition), providing finer granularity partitions, and locking of secure pages.

Although the effective LLC capacity for applications that do not use secure pages is reduced to the size of the non-secure partition, the performance degradation is small, owing to the large size (tens of MB) of the LLC in modern processors, and to the relatively small size of the secure partition, which can be as small as two out of many (e.g., 20) LLC ways.

 

Secure pages are assigned to Vms upon request at Vm launch time.

We can, therefore, support many more mutually-distrusting VMs than the number of COS.

The VMM guarantees that a secure page is not shared by multiple VMs.

Yet, within a VM, assigned secure pages are fully managed by the guest OS and can be time-shared by all the applications in the VM.

 

Security guarantees: CATalyst provides two security guarantees:

1) No eviction of secure pages by malicious code.

We guarantee that once security-sensitive code or data is mapped to secure pages, they are kept in the LLC.

That is, malicious code cannot cause eviction through cache line contention.

Accesses to security-sensitive code or data always hit in the LLC, thus achieving constant-time accesses, which do not leak information to an attacker observing the shared LLC cache state.

We do not require explicit software preloading of security-sensitive code or data to achieve access time guarantees (due to cache hits).

Instead, by remapping the security-sensitive code or data to the secure pages, we implicitly perform preloading.

 

2) No overlapping of secure pages between different ac\tive VMs (security domains).

By preventing sharing of secure pages between VMs, we guarantee that the FLUSH+RELOAD attack cannot operate on security-sensitive code or data.

This also prevents a malicious VM form flushing the secure pages belonging to another VM.

To avoid DoS attacks by malicious VMs over-requesting secure pages, the VMM limits the maximum number of secure pages that a VM can request, for example, to no more than 8 pages.

 

Scalability:

The number of cache partitions that our system can support is essentially the number of pages in the secure partition.

 

Assume that there are N_slice LLC slices, the size of each way in a LLC slice is S bytes, and the number of reserved ways for the secure partition is k. In a typical Intel processor, Nslice is the number of cores in the processor and S is fixed at 128 KB, which is 32 page frames. Therefore, the secure partition contains 32×k×Nslice page frames. Due to the restrictions of CAT allocation, we have k ≥ 2. Hence, the smallest possible secure partition has 64 page frames per processor core.

 

STEP 1. We first disable local interrupts to ensure that the preloading routine is not interrupted.

STEP 2. executed while the logical processor is still set to COS0, ensures that the preloading routine code is cached in the non-secure partition, and that the corresponding TLB entry is loaded.

Preloading is done at boot time when only the boot core is running, we can guarantee that, once cached in the non-secure partition, the routine’s code is not evicted from the cache or the TLB while preloading a page.

 

 

Before loading P into the secure partition, we use the clflush instruction to evict any cached data of P from the non-secure partition (Step 5).

 

4.1.3 Preloading secure pages without self-conflicts

It is very hard to achieve this perfectly when the core count is not a power of two.

To avoid LLC conflicts during the loading of secure pages, the VMM needs to allocate a chunk of physically contiguous memory that is slightly larger than M.

For each page frame, the VMM needs to determine whether adding this page frame into the secure partition will cause the eviction of page frames that are already in the secure partition. There are two approaches to achieve this:

1) Reverse engineer the hash function.

Since the Vmm knows the physical address of a page frame, it is straightforward for the VMM to determine which LLC slice the page frame is mapped to, if the Vmm knows the hash function.

2) The VMM can use the PRIME+PROBE technique iteratively.

Asume the VMM has already selected a set of page frames,F, without LLC cache line conflicts. In order to determine whether adding a new candidate page frame will result in conflicts, the Vmm can first access a cache line L in the candidate page frame, and then access all the cache lines with the same page offset as L in the page frames in F.

The VMM then measures the time taken to access L again. If the access latency is small, the VMM can add the candidate page frame in the set F; otherwise, adding this candidate page will cause self-conflicts since an LLC cache miss resulted in accessing L again.

 

We use Performance monitor unit (PMU) to verify that the secure pages are indeed pinned in the cache.

The PMU allows software to access the hardware performance counters.

Specifically, the VMM first allocates a large buffer (e.g., the same size as the LLC) and cleans the LLC by traversing the large buffer. Then the Vmm uses the PMU to measure the number of LLC misses when randomly traversing a secure page.

To count the number of LLC misses, we directly program one of the performance monitor counters (elgl, PMC0) so that it is reset to zero before we start to access the secure page; we read this counter for number of LLC misses, after traversing the secure page.

 

4.1.4 Other issues

Although our preloading routine is carefully designed to provide strong security guarantees, there are still several complexities to handle in a real commodity system, in order to prevent the pinned page frames from being evicted out of the LLC.

 

Special instructions that explicitly flush the caches:

Without enforcing CAT, which will destroy our security guarantee once such an instruction is executed.

In particular, invd and wbinvd instruction may flush the whole per-core L1 and L2 caches (and the corresponding content in the LLC). Since these instructions cause Vm exits, the Vmm can intercept them and prevent their execution.

Clflush instruction: potentially more dangerous However, since these instructions operate on data identified by the guest virtual address, they can not affect another VM’s secure pages, as we guarantee that VMs never share secure pages.

 

Once the secure pages are reclaimed at VM termination, the VMM needs to reload these pages to ensure they are still pinned in the secure partition.

We allow the VMM to execute the above special instructions, but ensure that where they target cache lines in the secure partition, the Vmm immediately afterwards reloads the flushed cache lines.

In order to maintain our security guarantees we stop all other logical processors while the VMM is reloading the flushed pages. Compared to the cost of cache flushes and VM cleanup operations, the extra overhead is small.

 

 

Direct Data I/O (DDIO):

is a new feature that can be found in Intel Xeon E5 and E7 series processors.

It may flush the LLC silently without the VMM being aware.

Fortunately, the Intel processor allows software to configure which cache ways are allowed to be accessed by the DDIO through a bit mask. Therefore, we can configure the bit mask so that it is not overlapped with the cache ways allocated to the secure partition.

 

Cache coherence in multi-socket servers:

So far, we have assumed a scenario with a single multicore processor package.

Therefore, the intra-CMP (chip multiprocessor) coherence protocol does not impact the contents in the LLC.

Especially, it is safe to keep dirty data in the LLC without writing back to the memory.

However, public cloud providers often use multi-socket servers, i.e., servers that have multiple multi-core packages.

The inter-CMP cache coherence protocol may destroy our security guarantee.

To solve this, we must set the CPU affinity of the VM’s Virtual CPUs (VCPUs) to the cores of a single processor package.

 

4.2 VM launch time and terminate time

At VM launch time, the VMM allocates secure pages to the guest VM as requested, and exposes these pages to the guest physical address space.

When the kernel module is loaded into the kernel space, it allocates a chunk of contiguous guest page frames, and then asks the VMM to map these page frames to the secure pages.

 

4.3 VM run time

The guest VM’s kernel manages the secure pages by multiplexing them to all the processes.

Two alternatives to implement a time-shared secure page management mechanism.

 

One option. Truly time-shared implementation, i.e., virtualizing the secure pages for user processes so that two processes can simultaneously map to the same secure page as long as they do not access it concurrently.

A process running on one VCPU can use all the secure pages owned by that VCPU.

 

 

5. SECURITY EVALUATION

To evaluate the security, we select GnuPG, a free implementation of the OpenPGP standard, as the victim process.

The modular exponentiation in GnuPG version 1.4.13 uses a simple Square-and-Multiply algorithm that is known to be vulnerable to side channel attacks.

[algorithm image 1]

As described in algorithm 1, the algorithm scans the exponent bits (private key) from left to right, performing a square operation on each bit.

For every “1” bit, it performs an extra multiply operation.

Consequently, the time interval between two consecutive squares reveals the private key: if the exponent bit is “q” there will be a mulitply operation in between two square operation, resulting in longer elapsed time between square operations than when the exponent bit is “0”.

 

Image from this thesis paper.

 

The attack first selects cache sets that potentially conflict with instructions in the square routine, i.e., they have the same page offset and hence will potentially map into the same cache set.

The attack scans these selected cache sets, one by one, for 10,000 time slots, where each time slot is 5000 cycles.

Figure 8 only shows cache set activity for some cache sets for 500 time slots.

A dark dot represents activity in that time lot, whereas no dot represents inactivity in that time slot.

For most cache sets, we only see random activity patterns. But for set 78, we see a clear string of activity (usually 3-4 consecutive dots for one square operation) followed by an inactivity period.

This is a temporal access pattern that can indicate key bits: for this machine, the inactive period is 3-4 time slots for a “0” and 10-11 time-slots for a “1” in the key.

Hence, a common activity pattern for set 78 is 3 time slots with activity followed by 4 time slots without activity, which is the matched activity pattern that we count in Figure 9.

 

Image from this thesis paper.

 

We modified GnuPG bersion 1.4.13 to use CATalyst.

Figure 9 indicates the results of searching targeted cache traces, by counting the number of matched activity pattern described above, before and after we locked the secure page in the LLC.

Figure 9a) For GnuPG without protection, an attacker can easily dientify the targeted cache set containing expected cache activity patterns.

However, when CATAlyst isengaged as in Figure 9b, it is impossible to find such targeted cache sets.

 

[6. PERFORMANCE EVALUATION]

 

[8. CONCLUSIONS]

We propose CATalyst, a lightweight system mechanism for the cloud provider and cloud customers to protect security-sensitive code and data against LLC-based side channel attacks, by leveraging the CAT on Intel processors.

CATalyst builds on existing commodity hardware, and can be easily deployed.

It bridges the gap between protecting security applications and existing performance oriented cache-partitioning mechanism.

CATalyst uses the CAT to partition the LLC, forming a hybrid between a conventional hardware-managed cache and a small software managed cache.

We pin a pool of page frames permanently in the cache, managed by the VMM and the guest OS.

Users can load and lock security-sensitive code and data by mapping them to the cache-pinned pages.

Our evaluation shows that CATalyst not only effectively mitigates the LLC attacks but also introduces very small performance degradation.