## CryptoEmu: An Instruction Set Emulator for Computation Over Ciphers
Xiaoyang Gong xgong35@wisc.edu Dan Negrut negrut@wisc.edu
## Abstract
Fully homomorphic encryption (FHE) allows computations over encrypted data. This technique makes privacy-preserving cloud computing a reality. Users can send their encrypted sensitive data to a cloud server, get encrypted results returned and decrypt them, without worrying about data breaches.
This project report presents a homomorphic instruction set emulator, CryptoEmu, that enables fully homomorphic computation over encrypted data. The software-based instruction set emulator is built upon an open-source, state-of-the-art homomorphic encryption library that supports gate-level homomorphic evaluation. The instruction set architecture supports multiple instructions that belong to the subset of ARMv8 instruction set architecture. The instruction set emulator utilizes parallel computing techniques to emulate every functional unit for minimum latency. This project report includes details on design considerations, instruction set emulator architecture, and datapath and control unit implementation. We evaluated and demonstrated the instruction set emulator's performance and scalability on a 48-core workstation. CryptoEmu has shown a significant speedup in homomorphic computation performance when compared with HELib, a state-of-the-art homomorphic encryption library.
## Index Terms
Fully Homomorphic Encryption, Parallel Computing, Homomorphic Instruction Set, Homomorphic Processor, Computer Architecture.
## I. INTRODUCTION
I N the conventional cloud service model, users share data with their service provide (cloud) to outsource computations. The cloud receives encrypted data and decrypts it with the cloud's private key or the private key shared between the user and the cloud. Thus, the service provider has access to user data, which might contain sensitive information like health records, bank statements, or trade secrets. Privacy concerns have been raised along with the wide adoption of cloud services. In 2019, over 164.68 million sensitive records were exposed in the United States [1].
In the worst-case scenario, the cloud service provider cannot be trusted. User data is inherently unsafe if it is in plain text. Even if the service provider is honest, cloud service is prone to fail victims of cybercrime. Security loopholes or sophisticated social engineering attacks expose user privacy on the cloud, and a successful attack usually results in a massive user data leak. One way to eliminate this type of risk is to allow the cloud to operate on the encrypted user data without decrypting it. Fully Homomorphic Encryption (FHE) is a special encryption scheme that allows arbitrary computation over encrypted data without knowing the private key. An FHE enabled cloud service model shown in Fig. 1. In this example, the user wants to compute the sum of 1, 3, 5 in the cloud. The user first encrypts data with FHE, then sends the cipher (shown in Fig. 1 as bubbles with blurry text) to the cloud. When the cloud receives encrypted data, it homomorphically adds all encrypted data together to form an encrypted sum and returns the encrypted sum to the user. The user decrypts the encrypted sum with a secret key, and the result in cleartext is 9 - the sum of 1, 3, and 5. In the entire process, the cloud has no knowledge of user data input and output. Therefore, user data is safe from the insecure cloud or any attack targeted at the cloud service provider.
Fig. 1: Homomorphic Encryption
<details>
<summary>Image 1 Details</summary>

### Visual Description
\n
## Diagram: Homomorphic Sum Operation
### Overview
This diagram illustrates a simplified process of performing a homomorphic sum operation, moving data from a "Local" environment to a "Cloud" environment and back. It demonstrates encryption, transmission, computation in the cloud, and decryption.
### Components/Axes
The diagram is divided into three main sections:
1. **Local (Left):** Represents the initial data location. Contains three data points (1, 3, 5) and their encrypted counterparts.
2. **Cloud (Center):** Represents the cloud environment where the homomorphic sum operation is performed.
3. **Local (Right):** Represents the final data location after receiving and decrypting the result. Contains the received encrypted sum and the decrypted result.
Labels present:
* "Local" (appears twice, top-left and top-right)
* "Cloud" (center)
* "Enc" (Encryption)
* "Send"
* "Receive"
* "Dec" (Decryption)
* "Homomorphic SUM operation"
Data points: 1, 3, 5, 9
### Detailed Analysis or Content Details
The diagram shows the following flow:
1. **Local (Left):** Three data points are initially present: 1, 3, and 5.
2. **Encryption:** The data points 1, 3, and 5 are individually encrypted, labeled "Enc".
3. **Transmission:** The encrypted data is sent to the "Cloud". Labeled "Send".
4. **Cloud:** The "Homomorphic SUM operation" is performed on the encrypted data. The result is an encrypted sum.
5. **Reception:** The encrypted sum is received back at the "Local" environment. Labeled "Receive". The received value is 9.
6. **Decryption:** The encrypted sum (9) is decrypted, labeled "Dec", resulting in the final value of 9.
### Key Observations
The diagram demonstrates that the sum of the initial data points (1 + 3 + 5 = 9) is correctly computed in the cloud without decrypting the data. This highlights the core principle of homomorphic encryption – performing operations on encrypted data.
### Interpretation
This diagram illustrates a fundamental concept in privacy-preserving computation. Homomorphic encryption allows computations to be performed on encrypted data without revealing the underlying plaintext. This is crucial for scenarios where data privacy is paramount, such as cloud computing, secure data analysis, and machine learning. The diagram simplifies the process, but it effectively conveys the core idea: data can be processed in the cloud without compromising its confidentiality. The fact that the sum is 9 is a verification that the homomorphic operation worked correctly. The diagram does not provide details about the encryption scheme used or the computational complexity of the homomorphic sum operation. It is a conceptual illustration rather than a detailed technical specification.
</details>
Blurry text in the figure denotes encrypted data.
Over the years, the research community has developed various encryption schemes that enable computation over ciphers. TFHE [2] is an open-source FHE library that allows fully homomorphic evaluation on arbitrary Boolean circuits. TFHE library
supports FHE operations on unlimited numbers of logic gates. Using FHE logic gates provided by TFHE, users can build an application-specific FHE circuit to perform arbitrary computations over encrypted data. While TFHE library has a good performance in gate-by-gate FHE evaluation speed and memory usage [3], a rigid logic circuit has reusability and scalability issues for general-purpose computing. Also, evaluating a logic circuit in software is slow. Because bitwise FHE operations on ciphers are about one billion times slower than the same operations on plain text, computation time ramps up as the circuit becomes complex.
Herein, we propose a solution that embraces a different approach that draws on a homomorphic instruction set emulator called CryptoEmu. CryptoEmu supports multiple FHE instructions (ADD, SUB, DIV, etc.). When CryptoEmu decodes an instruction, it invokes a pre-built function, referred as functional unit, to perform an FHE operation on input ciphertext. All functional units are built upon FHE gates from TFHE library, and they are accelerated using parallel computing techniques. During execution, the functional units fully utilize a multi-core processor to achieve an optimal speedup. A user would simply reprogram the FHE assembly code for various applications, while relying on the optimized functional units.
This report is organized as follows. Section 2 provides a primer on homomorphic encryption and summarizes related work. Section 3 introduces TFHE, an open-source library for fully homomorphic encryption. TFHE provides the building blocks for CryptoEmu. Section 4 describes CryptoEmu's general architecture. Section 5 and 6 provide detailed instruction set emulator implementations and gives benchmark results on Euler, a CPU/GPU supercomputer. Section 7 analyzes CryptoEmu's scalability and vulnerability, and compared CryptoEmu with a popular FHE software library, HELib [4]. Conclusions and future directions of investigation/development are provided in Section 8.
## II. BACKGROUND
Homomorphic Encryption. Homomorphic encryption (HE) is an encryption scheme that supports computation on encrypted data and generates an encrypted output. When the encrypted output is decrypted, its value is equal to the result when applying equivalent computation on unencrypted data. HE is formally defined as follows: let Enc () be an HE encryption function, Dec () be an HE decryption function, f () be a function, g () be a homomorphic equivalent of f () , and a and b be input data in plaintext. The following equation holds:
$$f ( a , b ) = D e c \left ( g ( E n c ( a ) , E n c ( b ) ) \right ) .$$
An HE scheme is a partially homomorphic encryption (PHE) scheme if g () supports only either addition or multiplication. An HE scheme is a somewhat homomorphic encryption (SWHE) scheme if a limited number of g () is allowed to be applied to encrypted data. An HE scheme is a fully homomorphic encryption (FHE) scheme if any g () can be applied for an unlimited number of times over encrypted data [5].
The first FHE scheme was proposed by Gentry [6]. In HE schemes, the plaintext is encrypted with Gaussian noise. The noise grows after every homomorphic evaluation until the noise becomes too large for the encryption scheme to work. This is the reason that SWHE only allows a limited number of homomorphic evaluations. Gentry introduced a novel technique called 'bootstrapping' such that a ciphertext can be homomorphically decrypted and homomorphically encrypted with the secret key to reduce Gaussian noise [6], [7]. Building off [6], [8] improved bootstrapping to speedup homomorphic evaluations. The TFHE library based on [3] and [9] is one of the FHE schemes with a fast bootstrapping procedure.
Related work. This project proposed a software-based, multiple-instruction ISA emulator that supports fully homomorphic, general-purpose computation. Several general-purpose HE computer architecture implementations exist in both software and hardware. HELib [10] is an FHE software library the implements the Brakerski-Gentry-Vaikuntanathan (BGV) homomorphic encryption scheme [11]. HELib supports HE arithmetic such as addition, subtraction, multiplication, and data movement operations. HELib can be treated as an assembly language for general-purpose HE computing. Cryptoleq [12] is a softwarebased one-instruction set computer (OISC) emulator for general-purpose HE computing. Cryptoleq uses Paillier partially homomorphic scheme [13] and supports Turing-complete SUBLEQ instruction. HEROIC [14] is another OISC architecture implemented on FPGA, based on Paillier partially homomorphic scheme. Cryptoblaze [15] is a multiple-instruction computer based on non-deterministic Paillier encryption that supports partially homomorphic computation. Cryptoblaze is implemented on the FPGA.
## III. TFHE LIBRARY
TFHE [2] is an FHE C/C++ software library used to implement fast gate-by-gate bootstrapping. The idea of TFHE is straightforward: if one can homomorphically evaluate a universal logic gate and homomorphically evaluate the next universal logic gate that uses the previous logic gate's output as its input, one can homomorphically evaluate arbitrary Boolean functions, essentially allowing arbitrary FHE computations on encrypted binary data. Figure 2 demonstrates a minimum FHE gate-level library: NAND gate. Bootstrapped NAND gates are used to construct an FHE XOR gate. Similarly, any FHE logic circuit can be constructed with a combination of bootstrapped NAND gates.
Fig. 2: Use of bootstrapped NAND gate to form arbitrary FHE logic circuit. Blurry text in the figure denotes encrypted data.
<details>
<summary>Image 2 Details</summary>

### Visual Description
\n
## Digital Logic Diagram: Combinational Logic Circuit
### Overview
The image depicts a digital logic circuit diagram composed of AND and OR gates. The circuit takes two inputs, labeled 'A' and 'B', and produces a single output, labeled 'Q'. The diagram illustrates the interconnection of these logic gates to perform a specific boolean function.
### Components/Axes
The diagram consists of the following components:
* **Inputs:** A and B, represented by circles.
* **Logic Gates:**
* Two AND gates.
* Two OR gates.
* One final AND gate.
* **Output:** Q, represented by a circle.
* **Lines:** Representing the flow of signals between components.
### Detailed Analysis or Content Details
The circuit can be described as follows:
1. Input A is fed directly into one OR gate and also into one AND gate.
2. Input B is fed directly into one OR gate and also into the same AND gate as input A.
3. The output of the AND gate (A AND B) is fed into the final AND gate.
4. The output of the first OR gate (A OR B) is fed into the final AND gate.
5. The output of the second OR gate (A OR B) is also fed into the final AND gate.
6. The output of the final AND gate is the output Q.
Therefore, the output Q can be expressed as:
Q = (A AND B) AND ((A OR B) OR (A OR B))
Simplifying the expression:
Q = (A AND B) AND (A OR B)
This is equivalent to the AND operation.
### Key Observations
The circuit effectively implements an AND gate. The intermediate OR gates and the final AND gate are redundant, as the output is solely determined by the AND operation of inputs A and B.
### Interpretation
The diagram demonstrates a potentially inefficient implementation of a simple AND gate. While functionally correct, the circuit utilizes more gates than necessary. This could be a pedagogical example illustrating the equivalence of different logic gate configurations or a demonstration of how logic simplification can lead to more efficient circuit designs. The redundancy suggests a possible design flaw or a deliberate attempt to showcase the versatility of logic gate combinations. The circuit's behavior is entirely determined by the simultaneous truth of inputs A and B, mirroring the fundamental operation of an AND gate.
</details>
TFHE API. TFHE library contains a comprehensive gate bootstrapping API for the FHE scheme [2], including secret-keyset and cloud-keyset generation; Encryption/decryption with secret-keyset; and FHE evaluation on a binary gate netlist with cloudkeyset. TFHE API's performance is evaluated on a single core of Intel Xeon CPU E5-2650 v3 @ 2.30GHz CPU, running CentOS Linux release 8.2.2004 with 128 GB memory. Table I shows the benchmark result of TFHE APIs that are critical to CryptoEmu's performance. TFHE gate bootstrapping parameter setup, Secret-keyset, and cloud-keyset generation are not included in the table.
TABLE I: TFHE API Benchmark
| API | Category | Bootstrapped? | Latency (ms) |
|-----------------|------------------------|-----------------|----------------|
| Encrypt | Encrypt decrypt | N/A | 0.0343745 |
| Decrypt | Encrypt decrypt | N/A | 0.000319556 |
| CONSTANT | Homomorphic operations | No | 0.00433995 |
| NOT | Homomorphic operations | No | 0.000679717 |
| COPY | Homomorphic operations | No | 0.000624117 |
| NAND | Homomorphic operations | Yes | 25.5738 |
| OR | Homomorphic operations | Yes | 25.618 |
| AND | Homomorphic operations | Yes | 25.6176 |
| XOR | Homomorphic operations | Yes | 25.6526 |
| XNOR | Homomorphic operations | Yes | 25.795 |
| NOR | Homomorphic operations | Yes | 25.6265 |
| ANDNY | Homomorphic operations | Yes | 25.6982 |
| ANDYN | Homomorphic operations | Yes | 25.684 |
| ORNY | Homomorphic operations | Yes | 25.7787 |
| ORYN | Homomorphic operations | Yes | 25.6957 |
| MUX | Homomorphic operations | Yes | 49.2645 |
| CreateBitCipher | Ciphertexts | N/A | 0.001725 |
| DeleteBitCipher | Ciphertexts | N/A | 0.002228 |
| ReadBitFromFile | Ciphertexts | N/A | 0.0175304 |
| WriteBitToFile | Ciphertexts | N/A | 0.00960664 |
In Table I, outside the 'Homomorphic operations' category, all other operations are relatively fast. In general, the latency is around 25ms, with exceptions of MUX that takes around 50ms, and CONSTANT, NOT, COPY that are relatively fast. The difference in speed is from gate bootstrapping. Unary gates like CONSTANT, NOT and COPY do not need to be bootstrapped. Binary gates need to be bootstrapped once. MUX needs to be bootstrapped twice. The bootstrapping procedure is manifestly the most computationally expensive operation in TFHE. This overhead is alleviated in CryptoEmu via parallel computing as detailed below.
## IV. CRYPTOEMU ARCHITECTURE OVERVIEW
CryptoEmu is a C/C++ utility that emulates the behavior of Fully Homomorphic Encryption (FHE) instructions. The instruction set that CryptoEmu supports is a subset of ARMv8 A32 instructions for fully homomorphic computation over encrypted data. Figure 3 shows the abstract layer for an FHE application. For an FHE application that performs computation over encrypted data, the application will be compiled into FHE assembly that the instruction emulator supports. The instruction set emulator coordinates control units and functional units to decode and execute FHE assembly and returns final results. The design and implementation of CryptoEmu are anchored by two assumptions:
Assumption 1. The instruction set emulator runs on a high-performance multi-core machine.
Assumption 2. The cloud service provider is honest. However, the cloud is subject to cyber-attacks on the user's data.
In §VII-C we will discuss modification on CryptoEmu's implementation when Assumption 2 does not hold.
Cloud service model. Figure 4 shows the cloud service model. The instruction set emulator does what an actual hardware asset for encrypted execution would do: it reads from an unencrypted memory space an HE instruction ; i.e., it fetches
Fig. 3: Abstract Layers
<details>
<summary>Image 3 Details</summary>

### Visual Description
\n
## Diagram: FHE Stack
### Overview
The image depicts a layered diagram representing a Fully Homomorphic Encryption (FHE) stack. It illustrates the different levels of abstraction involved in building an FHE application, from the underlying hardware emulation to the final application layer. The diagram is structured as a pyramid, with the base representing the lowest level and the apex representing the highest level.
### Components/Axes
The diagram consists of four rectangular blocks stacked vertically, each representing a different layer:
1. **Instruction Set Emulator:** Located at the bottom (base) of the pyramid.
2. **FHE Functional Units:** Situated above the Instruction Set Emulator.
3. **FHE Assembly:** Positioned above the FHE Functional Units.
4. **FHE Application:** Located at the top (apex) of the pyramid.
There are no axes or legends present in this diagram.
### Detailed Analysis or Content Details
The diagram shows a hierarchical relationship between the four layers. Each layer builds upon the one below it. The layers are visually distinguished by their color:
* **Instruction Set Emulator:** Light yellow.
* **FHE Functional Units:** Light blue.
* **FHE Assembly:** Light yellow.
* **FHE Application:** Light blue.
The text within each block is centered and clearly legible. The blocks are arranged in a pyramid shape, suggesting a dependency structure where higher layers rely on the functionality provided by lower layers.
### Key Observations
The diagram emphasizes the layered architecture of an FHE system. It highlights the progression from low-level hardware emulation to high-level application development. The alternating colors of the layers (yellow and blue) may be intended to visually separate different types of components or functionalities.
### Interpretation
This diagram illustrates the software stack required to implement FHE. The Instruction Set Emulator provides the foundational hardware abstraction. The FHE Functional Units build upon this to provide basic FHE operations. The FHE Assembly layer combines these operations into more complex routines. Finally, the FHE Application layer utilizes these routines to create a fully functional application that leverages the benefits of FHE.
The pyramid shape suggests that the complexity increases as you move up the stack. The lower layers are more fundamental and provide the building blocks for the higher layers. The diagram implies that developing FHE applications requires expertise in multiple areas, including hardware emulation, FHE primitives, and software engineering. The diagram does not provide any quantitative data or specific details about the implementation of each layer. It is a conceptual representation of the FHE stack.
</details>
instruction that needs to be executed. The instruction set emulator also reads and writes HE data from an encrypted memory space, to process the user's data and return encrypted results to the encrypted memory space. The user, or any device that owns the user's secret key, will communicate with the cloud through an encrypted channel. The user provides all encrypted data to cloud. The user can send unencrypted HE instructions to the cloud through a secure channel. The user is also responsible for resolving branch directions for the cloud, based on the encrypted branch taken/non-taken result provided by the cloud.
Fig. 4: Cloud service model
<details>
<summary>Image 4 Details</summary>

### Visual Description
\n
## Diagram: System Architecture - HE Instruction Flow
### Overview
The image depicts a system architecture diagram illustrating the flow of HE (Histopathology Embedding) instructions and data between a "User" (local) and components residing in the "Cloud". The diagram shows a central "Emulator" component facilitating communication between the User and two cloud-based elements: "HE Instruction" and "HE Data".
### Components/Axes
The diagram consists of four rectangular components:
* **User:** Located on the right side, labeled "User" and designated as "Local".
* **HE Instruction:** Located in the top-left corner of the "Cloud" area, labeled "HE Instruction".
* **Emulator:** Positioned centrally within the "Cloud" area, labeled "Emulator".
* **HE Data:** Located in the bottom-center of the "Cloud" area, labeled "HE Data".
The diagram also includes directional arrows indicating the flow of information:
* An arrow originates from "HE Instruction" and points to "Emulator".
* An arrow originates from "Emulator" and points to "User".
* A bidirectional arrow connects "Emulator" and "HE Data".
* The entire diagram is enclosed within a rectangular border labeled "Cloud" in the top-right corner.
### Detailed Analysis or Content Details
The diagram illustrates a data flow:
1. "HE Instruction" sends information to the "Emulator".
2. The "Emulator" processes the instruction and communicates with the "User".
3. The "Emulator" also exchanges data with "HE Data".
There are no numerical values or scales present in the diagram. The diagram is purely conceptual, representing the relationships and flow of information between the components.
### Key Observations
The "Emulator" acts as a central hub, mediating communication between the "User", "HE Instruction", and "HE Data". The bidirectional arrow between the "Emulator" and "HE Data" suggests a continuous or iterative exchange of information. The "User" is explicitly identified as being "Local", implying a distinction from the cloud-based components.
### Interpretation
This diagram likely represents a system for processing histopathology embedding instructions. The "User" initiates a request or provides input, which is then processed by the "Emulator" based on "HE Instruction". The "Emulator" retrieves or updates "HE Data" as part of this process. The cloud-based architecture suggests scalability and accessibility. The separation of "HE Instruction" and "HE Data" implies a modular design, where instructions and data are managed independently. The diagram highlights a client-server model, where the "User" acts as the client and the "Emulator", "HE Instruction", and "HE Data" collectively form the server-side components. The system appears designed to handle histopathology embedding tasks remotely, leveraging cloud resources for processing and data storage.
</details>
## A. Data Processing
In actuality, the HE instruction and HE data can be text files or arrays of data bits stored in buffers, if sufficient memory is available. CryptoEmu employs a load-store architecture. All computations occur on virtual registers (vReg), where a vReg is an array of 32 encrypted data bits. Depending on the memory available on the machine, the number of total vReg is configurable. However, it is the compiler's responsibility to recycle vRegs and properly generate read/write addresses. A snippet of possible machine instructions is as follows:
LOAD R1
```
READ_ADDR1
READ_ADDR2
```
LOAD R2
```
ADD R0 R1, R2
STORE R0 WRITE_ADDR
```
Above, to perform a homomorphic addition, CryptoEmu fetches the LOAD instruction from the instruction memory. Because the instruction itself is in cleartext, CryptoEmu decodes the instruction, loads a piece of encrypted data from HE data memory indexed by READ ADDR1, and copies the encrypted data into vReg R1. Then, CryptoEmu increments its program counter by 4 bytes, reads the next LOAD instruction, and loads encrypted data from HE data memory into vReg R2. After the two operands are ready, CryptoEmu invokes a 32-bit adder and passes R1, R2 to it. The adder returns encrypted data in R0. Finally, CryptoEmu invokes the STORE operation and writes R0 data into the HE data memory pointed to by WRITE ADDR. Under Assumption 2, the honest cloud could infer some user information from program execution because HE instructions are in cleartext. However, all user data stays encrypted and protected from malicious attackers. Vulnerabilities are discussed in §VII-C.
## B. Branch and Control Flow
CryptoEmu can perform a homomorphic comparison and generate N (negative), Z (zero), C (Unsigned overflow), and V (signed overflow) conditional flags. Based on conditional flags, the branch instruction changes the value of the program counter and therefore changes program flow. Because branches are homomorphically evaluated on the encrypted conditional flag, the branch direction is also encrypted. To solve this problem, CryptoEmu employs a client-server communication model from CryptoBlaze [15]. Through a secure communication channel, the cloud server will send an encrypted branch decision to a machine (client) that owns the user's private key. The client deciphers the encrypted branch decision and sends the branch decision encrypted with the server's public key to the server. The cloud server finally decrypts the branch decision, and CryptoEmu will move forward with a branch direction. Under assumption 2, the honest cloud will not use branch decision query and binary search to crack user's encrypted data, nor will the honest cloud infer user information from the user. In §VII-C, the scenario that assumption 2 does not hold will be discussed.
## V. DATA PROCESSING UNITS
Data processing units are subroutines that perform manipulation on encrypted data, including homomorphic binary arithmetic, homomorphic bitwise operation, and data movement. Under Assumption 1, data processing units are implemented with OpenMP [16] and are designed for parallel computing. If the data processing units exhaust all cores available, the rest of the operations will be serialized. We benchmarked the performance of data processing units with 16-bit and 32-bit vReg size. Benchmarks are based an computing node on Euler. The computing node has 2 NUMA nodes. Each NUMA nodes has two sockets, and each socket has a 12-core Intel Xeon CPU E5-2650 v3 @ 2.30GHz CPU. The 48-core computing node runs CentOS Linux release 8.2.2004 with 128 GB memory.
## A. Load/Store Unit
CryptoEmu employs a load/store architecture. A LOAD instruction reads data from data memory; a STORE instruction writes data to data memory. The TFHE library [2] provides the API for load and store operations on FHE data. If data memory is presented as a file, CryptoEmu invokes the specific LD/ST subroutine, moves the file pointer to the right LD/ST address, and calls the appropriate file IO API, i.e.,
```
import_gate_bootstrapping_ciphertext_fromFile()
or
export_gate_bootstrapping_ciphertext_toFile()
```
Preferably, if the machine has available memory, the entire data file is loaded into a buffer as this approach significantly improves LD/ST instruction's performance. Table II shows LD/ST latency for 16-bit and 32-bit. LD/ST on a buffer is significantly faster than LD/ST on a file. The performance speedup is even more when the data file size is large because LD/ST on file needs to use fseek() function to access data at the address specified by HE instructions.
TABLE II: LD/ST latencies.
| | 16-bit (ms) | 32-bit (ms) |
|----------------|---------------|---------------|
| Load (file) | 0.027029 | 0.0554521 |
| Store (file) | 0.0127804 | 0.0276899 |
| Load (buffer) | 0.0043463 | 0.00778488 |
| Store (buffer) | 0.0043381 | 0.0077692 |
```
#pragma omp parallel sections num_threads(2)
{
#pragma omp section
{
bootsAND(&g_o[0], &a_i[0], &b_i[0], bk);
}
#pragma omp section
{
bootsXOR(&p_o[0], &a_i[0], &b_i[0], bk);
}
}
```
Fig. 5: Parallel optimization for bitwise (g,p) calculation, get gp ()
## B. Adder
CryptoEmu supports a configurable adder unit of variable width. As for the ISA that CryptoEmu supports, adders are either 16-bit or 32-bit. Operating under an assumption that CryptoEmu runs on a host that supports multi-threading, the adder unit is implemented as a parallel prefix adder [17]. The parallel prefix adder has a three-stage structure: pre-calculation of generate and propagate bit; carry propagation; and sum computation. Each stage can be divided into sub-stages and can leverage a multi-core processor. Herein, we use the OpenMP [16] library to leverage parallel computing.
Stage 1: Propagate and generate calculation. Let a and b be the operands to adder, and let a [ i ] and b [ i ] be the i th bit of a and b . In carry-lookahead logic, a [ i ] and b [ i ] generates a carry if a [ i ] AND b [ i ] is 1 and propagates a carry if a [ i ] XOR b [ i ] is 1. This calculation requires an FHE AND gate and an FHE XOR gate, see §III and Fig. 2 for gate bootstrapping. An OpenMP parallel region is created to handle two parallel sections. As shown in Fig. 5, CryptoEmu spawns two threads to execute two OpenMP sections in parallel.
For a 16-bit adder, get gp() calculations are applied on every bit. This process is parallelizable: as shown in Fig. 6, CryptoEmu spawns 16 parallel sections [16], one per bit. Inside each parallel section, the code starts another parallel region that uses two threads. Because of nested parallelism, 32 threads in total are required to calculate every generation and propagate a bit concurrently. If there is an insufficient number of cores, parallel sections will be serialized, which will only affect the efficiency of the emulator.
Stage 2: Carry propagation. Let G i be the carry signal at i th bit, P i be the accumulated propagation bit at i th bit, g i and p i be outputs from propagate and generate calculation. We define operator such that
$$( g _ { x } , p _ { x } ) \odot ( g _ { y } , p _ { y } ) = ( g _ { x } + p _ { x } \cdot g _ { y } , p _ { x } \cdot p _ { y } ) \, .$$
Carry signal G i and accumulated propagation P i can be recursively defined as
$$( G _ { i } , P _ { i } ) = ( g _ { i } , p _ { i } ) \odot ( G _ { i - 1 } , P _ { i - 1 } ) , w h e r e ( G _ { 0 } , P _ { 0 } ) = ( g _ { 0 } , p _ { 0 } ) \, .$$
The above recursive formula is equivalent to
$$( G _ { i \colon j } , P _ { i \colon j } ) = ( G _ { i \colon n } , G _ { i \colon n } ) \odot ( G _ { m \colon j } , P _ { m \colon j } ) , w h e r e i \geq j a n d m \geq n \, .$$
Therefore, carry propagation can be reduced to a parallel scan problem. In CryptoEmu, we defined a routine, get carry() to perform operation . As shown in Fig. 7 , the requires two FHE AND gate and an FHE OR gate. CryptoEmu spawns two threads to perform the operation in parallel.
For a 16-bit adder, we need 4 levels to compute the carry out from the most significant bit. As shown in fig 8, every two arrows that share an arrowhead represents one operation. The operations at the same level can be executed in parallel. In the case of a 16-bit adder, the maximum number of concurrent is 15 at level 1. Because of nested parallelism within the operation, the maximum number of threads required is 30. With a sufficient number of cores, parallel scan reduced carry propagation time from 16 times operation latency, to 4 times operation latency.
Stage 3: Sum calculation. The last stage for parallel prefix adder is sum calculation. Let s i be the sum bit at i th bit, p i be the propagation bit at i th bit, G i be the carry signal at i th bit. Then
$$s _ { i } = p _ { i } \, X O R \, G _ { i } \, .$$
One FHE XOR gate is needed to calculate 1-bit sum. For 16-bit adder, 16 FHE XOR gates are needed. All FHE XOR evaluation are independent, therefore can be executed in parallel. In total 16 threads are required for the best parallel optimization on sum calculation stage.
```
<loc_66><loc_61><loc_379><loc_273><_C++_>#pragma omp parallel sections num_threads(N)
{
#pragma omp section
{
get_gp(&g[0], &p[0], &a[0], &b[0], bk);
}
#pragma omp section
{
get_gp(&g[1], &p[1], &a[1], &b[1], bk);
}
...
#pragma omp section
{
get_gp(&g[14], &p[14], &a[14], &b[14], bk);
}
#pragma omp section
{
get_gp(&g[15], &p[15], &a[15], &b[15], bk);
}
}
```
Fig. 7: Parallel optimization for bitwise carry calculation, get carry ()
a) Benchmark: the 16-bit adder.: Table III shows benchmarking results for a 16-bit adder unit executed on the target machine describe earlier in the document. If parallelized, the 1-bit get gp() shown in Fig. 5 has one FHE gate latency around 25ms as shown in Table I. Ideally, if sufficient cores are available and there is no overhead from parallel optimization, (g,p) calculation should run 16 get gp() concurrently, and total latency should be 25ms. In reality, 16-bit (g,p) calculation uses 32 threads and takes 51.39ms to complete due to overhead in parallel computing.
For carry propagation calculation, the 1-bit get carry() shown in Fig. 7 has two FHE gate latency of around 50ms when parallelized. In an ideal scenario, each level for carry propagation should run get carry() in parallel, and total latency should be around 50ms. In reality, the 16-bit carry propagation calculation uses 30 threads on level 1 and takes 93.42ms. A collection of 28 threads are used on carry propagation level 2; the operation takes 93.58ms. A collection of 24 threads are used on carry propagation level 3; the operation takes 80.34ms. Finally, 16 threads are used on carry propagation level 4; the operation takes 70.85ms.
Fig. 8: Parallel scan for carry signals
<details>
<summary>Image 5 Details</summary>

### Visual Description
\n
## Diagram: Stair-Step Representation of Values
### Overview
The image presents a diagram consisting of five horizontal bands, each containing a series of rectangular blocks. Each block is labeled with two values separated by a colon (e.g., G15:15). The diagram appears to represent a series of incremental steps or a cumulative distribution, with the values increasing from left to right and across the bands. The bands are stacked vertically, creating a stair-step effect.
### Components/Axes
The diagram does not have explicit axes in the traditional sense. However, the horizontal bands can be considered as representing different levels or stages. The blocks within each band represent individual data points. The labels within each block indicate a paired value, likely representing a category (prefixed with "G") and a corresponding numerical value (following the colon). The labels along the top row are G0 to G15, and the labels along the left side are P0 to P15.
### Detailed Analysis or Content Details
The diagram consists of five horizontal bands. Let's analyze each band:
* **Top Band:** The labels are G9 to G15, paired with values 9 to 15. The labels are G9:9, G10:10, G11:11, G12:12, G13:13, G14:14, G15:15.
* **Second Band:** The labels are G8 to G14, paired with values 7 to 14. The labels are G8:7, G9:8, G10:9, G11:10, G12:11, G13:12, G14:13, G15:14.
* **Third Band:** The labels are G7 to G13, paired with values 6 to 12. The labels are G7:6, G8:7, G9:8, G10:9, G11:10, G12:11, G13:12, G14:11, G15:12.
* **Fourth Band:** The labels are G6 to G12, paired with values 5 to 11. The labels are G6:5, G7:6, G8:7, G9:8, G10:9, G11:10, G12:11, G13:10, G14:11, G15:11.
* **Bottom Band:** The labels are G5 to G11, paired with values 4 to 10. The labels are G5:4, G6:5, G7:6, G8:7, G9:8, G10:9, G11:10, G12:9, G13:10, G14:9, G15:10.
Each band's blocks are connected by diagonal lines, creating the stair-step pattern. The lines connect each block to the corresponding block in the band above it.
### Key Observations
The diagram demonstrates a cumulative or incremental relationship between the "G" and numerical values. As you move from left to right within each band, both the "G" value and the numerical value increase. The stair-step pattern suggests that the numerical value increases by one for each step, but the "G" value shifts between bands. The diagram appears to represent a mapping or transformation between the "G" categories and their corresponding numerical values.
### Interpretation
The diagram likely represents a cumulative distribution or a mapping of categories to values. The "G" values could represent different groups or items, and the numerical values could represent their corresponding scores, quantities, or ranks. The stair-step pattern indicates that each "G" value is associated with a range of numerical values, and the diagram shows how these values accumulate across the different "G" categories.
The diagram could be used to visualize the distribution of scores or quantities across different groups, or to illustrate the relationship between categories and their corresponding values. The diagram's simplicity and clarity make it an effective way to communicate this information. The diagram is a visual representation of a function or a lookup table, where the input is a "G" value and the output is a corresponding numerical value. The stair-step pattern suggests that the function is piecewise linear, with constant slopes within each band.
</details>
TABLE III: 16-bit adder latency
<details>
<summary>Image 6 Details</summary>

### Visual Description
\n
## Data Table: Operation Latency
### Overview
The image presents a data table detailing the latency (in milliseconds) of various operations involved in a calculation. The table lists individual operations and their corresponding latency values, culminating in a total latency figure.
### Components/Axes
The table has two columns:
* **Operation:** Describes the specific computational step.
* **Latency (ms):** Indicates the time taken for that operation, measured in milliseconds.
### Detailed Analysis or Content Details
The table contains the following data:
| Operation | Latency (ms) |
| -------------------------- | ------------ |
| (g,p) calculation | 51.3939 |
| Carry propagation (Level 1) | 93.4178 |
| Carry propagation (Level 2) | 93.5273 |
| Carry propagation (Level 3) | 80.342 |
| Carry propagation (Level 4) | 70.8481 |
| Sum calculation | 34.2846 |
| Total latency, including overhead | 528.482 |
### Key Observations
* Carry propagation operations contribute significantly to the overall latency, with values ranging from approximately 70ms to 93.5ms.
* The (g,p) calculation has a latency of approximately 51.4ms.
* The sum calculation is the fastest operation, with a latency of approximately 34.3ms.
* The total latency, including overhead, is 528.482ms.
### Interpretation
The data suggests that carry propagation is a performance bottleneck in this calculation. The multiple levels of carry propagation indicate a potentially complex arithmetic operation, likely involving large numbers or multiple additions. The total latency is dominated by the cumulative time spent in these carry propagation steps. Optimizing the carry propagation process could lead to significant performance improvements. The inclusion of "overhead" in the total latency suggests that there are additional factors contributing to the overall execution time beyond the listed operations. The (g,p) calculation and sum calculation are relatively fast compared to the carry propagation steps.
</details>
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| (g,p) calculation | 51.3939 |
| Carry propagation (Level 1) | 93.4178 |
| Carry propagation (Level 2) | 93.5273 |
| Carry propagation (Level 3) | 80.342 |
| Carry propagation (Level 4) | 70.8481 |
| Sum calculation | 34.2846 |
| Total latency, including overhead | 528.482 |
For sum calculation, 1-bit sum calculation uses one FHE XOR gate, with latency around 25ms. Ideally, if CryptoEmu runs all 16 XOR gates in parallel without parallel computing overhead, the latency for 16-bit sum calculation should be around 25ms. In reality, due to OpenMP overhead, the 16-bit sum calculation uses 16 threads and takes 34.28ms to complete.
In total, a 16-bit adder's latency is 486.66ms. This result includes latency for all stages, plus overheads like variable declaration, memory allocation, and temporary variable manipulation.
b) Benchmark: the 32-bit adder.: Table IV shows benchmarking results for a 32-bit adder unit executed on the target machine describe earlier in the document. Note that get gp() and get carry() have the same performance as the 16-bit adder. If sufficient cores are available, in the absence of OpenMP overhead, the (g,p) calculation should run 32 get gp() concurrently for 32-bit adder at a total latency of 25ms. In reality, the 32-bit (g,p) calculation uses 32 threads and takes 94.92ms to complete.
TABLE IV: 32-bit adder latency
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| (g,p) calculation | 94.9246 |
| Carry propagation (Level 1) | 147.451 |
| Carry propagation (Level 2) | 133.389 |
| Carry propagation (Level 3) | 127.331 |
| Carry propagation (Level 4) | 112.268 |
| Carry propagation (Level 5) | 91.5781 |
| Sum calculation | 49.0098 |
| Total latency, including overhead | 941.12 |
For carry propagation calculation, if sufficient cores are available, each level for carry propagation should run get carry() in parallel. Without parallel computing overhead, total latency should be around 50ms. Level 0 of 32-bit carry propagation calculation uses 62 threads. Because the platform on which CryptoEmu is tested has only 48 cores, level 0 carry propagation calculation is serialized and takes 147.45ms to complete. Level 1 carry propagation calculation uses 60 threads, and similar to level 0, its calculation is serialized. Level 1 carry propagation calculation takes 133.39ms. Level 3 carry propagation calculation that uses 56 threads is serialized and takes 127.33ms to complete. Level 4 carry propagation calculation uses 48 threads, and it is possible to run every get carry() in parallel on our 48-core workstation. Level 4 carry propagation calculation takes 112.27ms. Level 5 carry propagation calculation uses 32 threads. It is able to execute all get carry() concurrently; level 5 takes 91.58ms to complete.
For sum calculation, the 32-bit adder spawns 32 threads in parallel to perform FHE XOR operation if sufficient cores are available. The latency for the 32-bit sum calculation should be around 25ms. In reality, the 32-bit sum calculation uses 32 threads and takes 49ms to complete.
In total, 32-bit adder's latency is 941.12ms. This result includes latency for all stages, plus overheads like variable declaration, memory allocation, and temporary variable manipulation.
## C. Subtractor
The subtractor unit supports variable ALU size. CryptoEmu supports subtractors with 16-bit operands or 32-bit operands. Let a be the minuend, b be the subtrahend, and diff be the difference. Formula for 2's complement subtraction is:
$$a + N O T ( b ) + 1 = d i f f \, .$$
As shown in Fig. 9, CryptoEmu reuses adder units in §V-B to perform homomorphic subtractions. On the critical path, extra homomorphic NOT gates are used to create subtrahend's complement. For a subtractor with N-bit operands, N homomorphic NOT operations need to be applied on the subtrahend. While all FHE NOT gates can be evaluated in parallel, in §III we showed that FHE NOT gates do not need bootstrapping, and is relatively fast (around 0.0007ms latency per gate) comparing to bootstrapped FHE gates. Therefore, parallel execution is not necessary. Instead, the homomorphic NOT operation is implemented in an iterative for-loop, as shown below:
$$\begin{array} { r l } & { f o r ( i n t \, i = 0 ; \, i < N ; \, + + i ) } \\ & { b o o t s N O T ( \& b _ { - } n e g [ i ] , \, & b k [ i ] , \, b k ) ; } \end{array}$$
In addition to homomorphic NOT operation on subtrahend, the carry out bit from bit 0 needs to be evaluated with an OR gate because carry in is 1. Therefore, the adder in §V-B is extended to take a carry in bit. When sufficient cores are available on the machine, a subtractor units adds negation and carry bit calculation to the adder unit's critical path.
Fig. 9: Subtractor architecture. Blurry text in the figure denotes encrypted data.
<details>
<summary>Image 7 Details</summary>

### Visual Description
\n
## Diagram: Adder with Carry
### Overview
The image depicts a block diagram illustrating a full adder with a carry input and output. It shows the inputs 'a' and 'b', a carry-in signal, and the resulting output 'Result' along with a carry-out signal. The diagram represents a fundamental building block in digital logic circuits.
### Components/Axes
The diagram consists of the following components:
* **Inputs:** 'a' and 'b' - represented by circles at the top of the diagram.
* **Adder Block:** A yellow rectangle labeled "Adder with carry".
* **Carry-in:** An arrow labeled "Carry in" originating from a circle labeled "1" on the left side.
* **Carry-out:** An arrow labeled "Carry out" extending from the right side of the adder block.
* **Result:** An arrow labeled "Result" extending downwards from the adder block.
* **Combinational Logic:** A blue triangle connecting inputs 'a' and 'b' to the adder block.
### Detailed Analysis
The diagram illustrates the flow of data through a full adder.
1. **Inputs 'a' and 'b':** These are the two bits to be added. They are connected to the adder block via a blue triangle, indicating a combinational logic operation.
2. **Carry-in:** A carry-in signal, originating from a circle labeled "1", is fed into the adder block. This represents a carry bit from a previous addition stage.
3. **Adder with Carry:** The core component, which performs the addition of 'a', 'b', and the carry-in.
4. **Result:** The sum of the addition is output as the 'Result'.
5. **Carry-out:** If the sum exceeds one bit, a carry-out signal is generated and output.
The diagram does not provide specific numerical values or timing information. It is a conceptual representation of the adder's functionality.
### Key Observations
* The diagram clearly shows the adder as a combinational logic block, taking three inputs and producing two outputs.
* The carry-in signal is explicitly shown, indicating a full adder rather than a half adder.
* The diagram is simplified and does not show the internal logic gates within the adder block.
### Interpretation
This diagram represents a fundamental building block in digital systems – the full adder. It demonstrates how binary numbers can be added together, taking into account carry bits from previous stages. The diagram is a high-level representation, focusing on the inputs, outputs, and overall functionality of the adder. It is a crucial component in arithmetic logic units (ALUs) and other digital circuits that perform arithmetic operations. The circle labeled "1" for the carry-in suggests this adder might be part of a larger system where the carry-in is a constant value, or the beginning of a multi-bit addition. The use of arrows indicates the direction of signal flow, and the distinct shapes (circles, rectangles, triangles) help to visually differentiate the components and their roles.
</details>
a) Benchmark.: Tables V and VI report benchmark results for the 16-bit and 32-bit subtractors on the target machine. Negation on the subtrahend takes a trivial amount of time to complete. The homomorphic addition is the most time-consuming operation in the subtractor unit. The homomorphic addition is a little slower than the homomorphic additions in §V-B because the adder needs to use extra bootstrapped FHE gates to process carry in and calculate carry out from sum bit 0.
TABLE V: 16-bit subtractor latency
<details>
<summary>Image 8 Details</summary>

### Visual Description
\n
## Data Table: Operation Latency
### Overview
The image presents a data table detailing the latency (in milliseconds) of two specific operations: "Negation" and "Add with carry". It also provides the total latency, including overhead.
### Components/Axes
The table has two columns:
* **Operation:** Lists the type of operation performed.
* **Latency (ms):** Indicates the time taken for the operation to complete, measured in milliseconds.
### Detailed Analysis or Content Details
The table contains the following data:
* **Negation:** Latency is 0.0341106 ms.
* **Add with carry:** Latency is 680.59 ms.
* **Total latency, including overhead:** 715.015 ms.
### Key Observations
The latency of "Add with carry" is significantly higher than that of "Negation". The total latency is only slightly higher than the "Add with carry" latency, suggesting that the overhead is relatively small compared to the "Add with carry" operation.
### Interpretation
This data suggests that the "Add with carry" operation is the dominant factor in the overall latency. The negligible latency of "Negation" indicates it is a very fast operation. The difference in latency between the two operations could be due to the complexity of the "Add with carry" operation, potentially involving more computational steps or resource access. The total latency provides a realistic measure of the time taken to perform these operations in a practical setting, accounting for any additional overhead. This information is valuable for performance analysis and optimization of systems where these operations are frequently used.
</details>
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| Negation | 0.0341106 |
| Add with carry | 680.59 |
| Total latency, including overhead | 715.015 |
## D. Shifter
CryptoEmu supports three types of shifters: logic left shift (LLS), logic right shift (LRS), and arithmetic right shift (ARS). Each shifter type has two modes: immediate and register mode. In immediate mode, the shift amount is in cleartext. For example, the following instruction shifts encrypted data in R0 to left by 1 bit and assigns the shifted value to R0.
<details>
<summary>Image 9 Details</summary>

### Visual Description
Icon/Small Image (238x17)
</details>
This instruction is usually issued by the cloud to process user data. Shift immediate implementation is trivial. The shifter calls bootCOPY() API to move all data to the input direction by the specified amount. The LSB or MSB will be assigned to an encrypted constant using the bootCONSTANT() API call. Because neither bootCOPY() nor bootCONSTANT() need to be bootstrapped, they are fast operations, see Table III. Therefore, an iterative loop is used for shifting. Parallel optimization is unnecessary.
In register mode, the shift amount is an encrypted data stored in the input register. For example, the following instruction shifts encrypted data in R0 to left by the value stored in R1 and assign shifted value to R0.
```
LLS R0 R0 R1
```
Because the shifting amount stored in R1 is encrypted, the shifter can't simply move all encrypted bits left/right by a certain amount. The shifter is implemented as a barrel shifter, with parallel computing enabled.
a) Logic left shift.: Figure 10 shows the architecture for the 16-bit LLS. In the figure, numbers in the bubbles denote encrypted data in the shift register and the shift amount register. Numbers in the diamond denote an encrypted constant value generated by the bootsCONSTANT() API. The 16-bit LLS has four stages. In each stage, based on the encrypted shift amount, the FHE MUX homomorphically selects an encrypted bit from the shift register. In the end, the LLS outputs encrypted shifted data. FHE MUX is an elementary logic unit provided by TFHE library [2]. FHE MUX needs to be bootstrapped twice, and its latency is around 50ms. Therefore, it is reasonable to spawn multiple threads to execute all MUX select in parallel in each stage. For the 16-bit LLS, each stage needs 16 threads to perform a homomorphic MUX select, as shown in the following code:
```
its latency is around 50ms. Therefore, it is reasonable to spawn multiple threads to execute all MUX select stage. For the 16-bit LLS, each stage needs 16 threads to perform a homomorphic MUX select, as show code:
{
#pragma omp parallel sections num_threads(16)
{
#pragma omp section
{
bootsMUX(&out[0], &amt[0], &zero[0], &in[0], bk);
}
#pragma omp section
{
bootsMUX(&out[1], &amt[0], &in[0], &in[1], bk);
}
...
#pragma omp section
{
bootsMUX(&out[14], &amt[0], &in[13], &in[14], bk);
}
#pragma omp section
{
bootsMUX(&out[15], &amt[0], &in[14], &in[15], bk);
}
}
```
TABLE VI: 32-bit subtractor latency
<details>
<summary>Image 10 Details</summary>

### Visual Description
\n
## Data Table: Operation Latency
### Overview
The image presents a data table detailing the latency (in milliseconds) of different operations. The table lists "Negation" and "Add with carry" as individual operations, and provides a "Total latency, including overhead" value.
### Components/Axes
The table has two columns:
* **Operation:** Lists the type of operation performed.
* **Latency (ms):** Indicates the time taken for the operation to complete, measured in milliseconds.
### Detailed Analysis or Content Details
The table contains the following data:
* **Negation:** Latency = 0.0347466 ms
* **Add with carry:** Latency = 1058.65 ms
* **Total latency, including overhead:** Latency = 1115.25 ms
### Key Observations
The latency for "Add with carry" is significantly higher than that of "Negation". The total latency is only slightly higher than the "Add with carry" latency, suggesting that the overhead is relatively small compared to the "Add with carry" operation itself.
### Interpretation
This data suggests that the "Add with carry" operation is the dominant factor in the overall latency. The extremely low latency of "Negation" indicates it is a very fast operation. The difference in latency between the two operations could be due to the complexity of the "Add with carry" operation, potentially involving multiple steps or more extensive hardware resources. The total latency provides a realistic measure of the time taken to perform the operations, including any system-level overhead. This information is valuable for performance analysis and optimization, highlighting the "Add with carry" operation as a potential target for improvement.
</details>
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| Negation | 0.0347466 |
| Add with carry | 1058.65 |
| Total latency, including overhead | 1115.25 |
Fig. 10: LLS architecture
<details>
<summary>Image 11 Details</summary>

### Visual Description
\n
## Diagram: Barrel Shifter Implementation
### Overview
The image depicts a digital circuit diagram illustrating a barrel shifter implementation. The diagram shows a series of multiplexers (Mux) arranged to perform bitwise rotation or shifting of an input value. The shifter appears to be capable of shifting the input by a variable amount, controlled by the "Shift Amount" input.
### Components/Axes
The diagram consists of the following key components:
* **Shift Register (Input):** Labeled "Shift Reg" with values from 0 to 15, representing the input bits. These are positioned on the left side of the diagram.
* **Shift Amount:** Labeled "Shift Amount" with values from 0 to 3, representing the number of bits to shift. Located at the bottom-left.
* **Multiplexers (Mux):** Numerous rectangular blocks labeled "Mux" connecting the input bits to the output.
* **Output:** Labeled "Output" with values from 0 to 15, representing the shifted output bits. Positioned on the right side of the diagram.
* **Diamond Shapes:** These appear to represent selection logic, likely controlled by the "Shift Amount" input.
### Detailed Analysis
The diagram shows a 16-bit barrel shifter. The input bits (0-15) are connected to multiple multiplexers. Each multiplexer selects one of the input bits based on the "Shift Amount" signal. The output bits (0-15) are the result of this selection process.
Let's analyze the connections based on the shift amount:
* **Shift Amount = 0:** All output bits are directly connected to their corresponding input bits.
* **Shift Amount = 1:**
* Output 0 is connected to Input 15.
* Output 1 is connected to Input 0.
* Output 2 is connected to Input 1.
* ...and so on.
* **Shift Amount = 2:**
* Output 0 is connected to Input 14.
* Output 1 is connected to Input 15.
* Output 2 is connected to Input 0.
* ...and so on.
* **Shift Amount = 3:**
* Output 0 is connected to Input 13.
* Output 1 is connected to Input 14.
* Output 2 is connected to Input 15.
* Output 3 is connected to Input 0.
* ...and so on.
The diamond shapes act as selectors, directing the input signal to the appropriate output based on the "Shift Amount" value. The connections are arranged in a way that implements a circular shift.
### Key Observations
* The circuit implements a barrel shifter, which can perform shifts of 0, 1, 2, or 3 bits.
* The use of multiplexers allows for a parallel shift operation, making it faster than a serial shift.
* The circuit is designed for a 16-bit input.
* The diamond shapes are crucial for selecting the correct input based on the shift amount.
### Interpretation
This diagram illustrates a common hardware implementation of a barrel shifter. Barrel shifters are used in various digital systems for performing fast bitwise shifts and rotations. The circuit's structure allows for efficient shifting by selecting the appropriate input bit for each output bit based on the shift amount. The use of multiplexers enables parallel shifting, which is significantly faster than serial shifting. The circuit's modular design makes it scalable to different bit widths. The "Shift Amount" input provides flexibility in controlling the amount of shift, making it a versatile component in digital circuits. The diagram demonstrates a clear understanding of digital logic design principles and the implementation of arithmetic operations in hardware. The circuit is designed to perform a circular shift, meaning that bits shifted off one end are re-inserted at the other end. This is useful in applications where preserving the data is important, such as in cryptography or data compression.
</details>
In a parallel implementation with zero overhead, each stage should have one FHE MUX latency of around 50ms. Therefore, in an ideal scenario, four stages would have a latency of around 200ms.
b) Benchmark: Logic left shift.: Table VII shows the benchmark results for the 16-bit LLS on the target platform. Each stage spawns 16 threads to run all FHE MUX in parallel with latency from 50-90ms, a latency that is in between 1 FHE MUX latency to 2 FHE MUX latency, due to parallel computing overhead. In total, it takes around 290ms to carry out a homomorphic LLS operation on 16-bit encrypted data.
TABLE VII: 16-bit LLS latency
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| Mux select (Stage 1) | 87.3295 |
| Mux select (Stage 2) | 82.2656 |
| Mux select (Stage 3) | 76.6871 |
| Mux select(Stage 4) | 55.5396 |
| Total latency, including overhead | 287.92 |
Table VIII shows the benchmark result for the 32-bit LLS. Each stage spawns 32 threads and takes around 90-100ms to complete. In total, it takes around 450-500ms for a homomorphic LLS operation on 32-bit encrypted data.
TABLE VIII: 32-bit LLS latency
| Operation | Latency (ms) |
|-----------------------------------|----------------|
| Mux select (Stage 1) | 101.92 |
| Mux select (Stage 2) | 89.7695 |
| Mux select (Stage 3) | 98.8634 |
| Mux select(Stage 4) | 91.3939 |
| Mux select (Stage 5) | 91.8491 |
| Total latency, including overhead | 474.739 |
c) Logic right shift/Arithmetic right shift.: LRS has an architecture that is similar to the architecture of LLS. Figure 11 shows the architecture for the 16-bit LRS. Compared to Fig. 10, the only difference between LLS and LRS is the bit order of the input register and output register. To reuse the LRS architecture for ARS, one should simply pass MSB of the shift register as the shift-in value, shown as the numbers in the diamond in Fig. 11. The LRS and ARS shifter implementation is similar to that of LLS. For 16-bit LRS/ARS, CryptoEmu spawns 16 parallel threads to perform a homomorphic MUX select at each stage.
Table IX shows the benchmark result for the 16-bit LRS and ARS. LRS and ARS have similar performance. At each stage, LRS/ARS utilizes 16 threads, and each stage takes 50-90ms to complete. Single stage latency is between 1 FHE MUX latency to 2 FHE MUX latency, due to parallel computing overhead. In total, 16-bit LRS/ARS latency is around 290-300ms.
TABLE IX: 16-bit LRS, ARS latency
| Operation | LRS Latency (ms) | ARS Latency (ms) |
|-----------------------------------|--------------------|--------------------|
| Mux select (Stage 1) | 88.1408 | 87.7689 |
| Mux select (Stage 2) | 85.5154 | 79.6416 |
| Mux select (Stage 3) | 75.0295 | 75.2938 |
| Mux select(Stage 4) | 55.9639 | 54.9246 |
| Total latency, including overhead | 290.517 | 296.124 |
Table X shows the benchmark result for the 32-bit LRS and ARS. The 32-bit LRS/ARS has five stages. Each stage creates 32 parallel threads to evaluate FHE MUX and takes 90-100ms to complete. Single stage latency is around 2 FHE MUX latency. In total, the 32-bit LRS/ARS takes around 470-500ms to complete a homomorphic LRS/ARS operation on 32-bit encrypted data.
TABLE X: 32-bit LRS, ARS latency
| Operation | LRS Latency (ms) | ARS Latency (ms) |
|-----------------------------------|--------------------|--------------------|
| Mux select (Level 1) | 104.33 | 106.132 |
| Mux select (Level 2) | 104.75 | 95.3209 |
| Mux select (Level 3) | 90.2377 | 90.1364 |
| Mux select(Level 4) | 89.7183 | 90.5383 |
| Mux select(Level 5) | 90.6315 | 94.4654 |
| Total latency, including overhead | 472.896 | 491.787 |
Fig. 11: LRS architecture
<details>
<summary>Image 12 Details</summary>

### Visual Description
\n
## Diagram: Barrel Shifter
### Overview
The image depicts a digital circuit diagram representing a barrel shifter. It consists of multiple multiplexers (labeled "Mux") arranged in a cascaded structure to perform bit shifting operations. The diagram shows a 16-bit input ("Shift Reg") and a 4-bit shift amount control ("Shift Amount"), producing a 16-bit output.
### Components/Axes
* **Shift Reg:** Input register with 16 bits, labeled 1 through 16 vertically on the left side.
* **Shift Amount:** Control input with 4 bits, labeled 0 through 3 vertically on the bottom left.
* **Mux:** Multiplexers are the primary components, arranged in five columns.
* **Output:** Output register with 16 bits, labeled 1 through 16 vertically on the right side.
* **Connections:** Lines connecting the Shift Reg, Shift Amount, Muxes, and Output.
* **Diamond Shapes:** Represent selection logic, likely decoding the Shift Amount to control the Muxes.
### Detailed Analysis or Content Details
The diagram can be broken down into stages. Each stage consists of a column of multiplexers.
**Stage 1 (Leftmost Column):**
* Each input bit from the "Shift Reg" (1-16) is fed into a multiplexer.
* The selection lines for these multiplexers are connected to the "Shift Amount" control (0-3).
**Stage 2:**
* The outputs of the first stage multiplexers are fed into the inputs of the second stage multiplexers.
* Again, the selection lines are connected to the "Shift Amount" control.
**Stage 3, 4, and 5:**
* The pattern continues, with each stage's output feeding into the next stage's input.
* The selection lines remain connected to the "Shift Amount" control.
**Output Stage (Rightmost Column):**
* The outputs of the final stage multiplexers are connected to the "Output" register (1-16).
The "Shift Amount" control determines how many positions the bits are shifted. The diamond shapes likely decode the 4-bit "Shift Amount" into control signals for the multiplexers. The connections between the diamond shapes and the multiplexers are not fully detailed, but they indicate a selection mechanism.
**Specific Connections (observed):**
* Shift Reg bit 1 connects to Muxes in all columns.
* Shift Reg bit 16 connects to Muxes in all columns.
* Shift Amount bit 0 connects to selection lines of all Muxes.
* Shift Amount bit 3 connects to selection lines of all Muxes.
### Key Observations
* The circuit implements a barrel shifter, capable of shifting bits by a variable amount determined by the "Shift Amount" control.
* The cascaded multiplexer structure allows for efficient bit shifting.
* The "Shift Amount" control is used to select the appropriate input for each multiplexer, effectively shifting the bits.
* The diagram does not show the specific logic for decoding the "Shift Amount" control, only the connection points.
* The circuit is designed for a 16-bit input and output, with a 4-bit shift amount, allowing shifts from 0 to 15 positions.
### Interpretation
This diagram illustrates a common hardware implementation of a barrel shifter. Barrel shifters are crucial in digital signal processing, arithmetic operations, and data manipulation. The use of multiplexers allows for parallel shifting, making it significantly faster than serial shifting methods. The "Shift Amount" control provides flexibility in determining the shift distance. The diamond shapes represent the control logic that translates the binary "Shift Amount" into the specific select signals for each multiplexer. The diagram demonstrates a clear and efficient way to implement variable bit shifting in hardware. The absence of detailed logic within the diamond shapes suggests that this is a high-level representation of the circuit, focusing on the overall structure and data flow rather than the specific gate-level implementation of the control logic. The circuit is designed to perform both left and right shifts, depending on the value of the "Shift Amount" control.
</details>
## E. Multiplier
Design consideration. Binary multiplication can be treated as a summation of partial products [18], see Fig. 12. Therefore, adder units mentioned in §V-B can be reused for partial sum computation. Summing up all partial products is a sum reduction operation, and therefore can be parallelized.
Fig. 12: Binary multiplication
<details>
<summary>Image 13 Details</summary>

### Visual Description
\n
## Table: Multiplication Table Fragment
### Overview
The image presents a fragment of a multiplication table. It displays the products of values from A0 to A3 and B0 to B3. The table is constructed using addition, showing how multiplication can be expressed as repeated addition.
### Components/Axes
The table has two implicit axes: one representing the values A0 through A3 (top row) and the other representing the values B0 through B3 (left column). The table is organized with the products of each A and B combination displayed in the corresponding cell. The table is constructed using the '+' operator to show the addition of terms.
### Content Details
The table is structured as follows:
* **Row 0 (x):** A3B0, A2B0, A1B0, A0B0
* **Row 1 (+):** A3B1, A2B1, A1B1, A0B1
* **Row 2 (+):** A3B2, A2B2, A1B2, A0B2
* **Row 3 (+):** A3B3, A2B3, A1B3, A0B3
The values within the table are combinations of 'A' and 'B' with indices ranging from 0 to 3. The table demonstrates the distributive property of multiplication over addition.
### Key Observations
The table systematically shows the products of all possible combinations of A and B values from 0 to 3. The table is not a standard multiplication table showing the final product, but rather the breakdown of the product into a sum of terms.
### Interpretation
This table illustrates a fundamental concept in arithmetic: that multiplication can be understood as repeated addition. Each entry in the table represents the result of multiplying a value from the 'A' series by a value from the 'B' series, but instead of directly stating the product, it shows how that product can be achieved by adding appropriate terms. This is a pedagogical tool to help understand the underlying principles of multiplication. The table is a fragment, suggesting it is part of a larger demonstration or explanation. The use of 'A' and 'B' instead of numerical values suggests a more abstract or generalized approach to the concept of multiplication.
</details>
However, the best parallel optimization cannot be achieved on our 48-core computing node. For a 16-bit wide multiplier, the product is a 32-bit encrypted value. Therefore, a 32-bit adder is required to carry out the homomorphic addition. Each 32-bit adder has peak thread usage of 64 threads: 31 threads with nested parallel (g,p) calculation that uses two threads. Thus, on the server used (with 48 cores), the 32-bit adder has to be partially serialized. For the 16-bit multiplier's parallel sum reduction, at most eight summation occur in parallel and each summation uses a 32-bit adder. The peak thread usage is 512 threads. For a 32-bit multiplier, maximum thread usage is 2048 threads. Thus, because homomorphic multiplication is a computationally demanding process, the server used does not have sufficient resources to do all operations in parallel. Homomorphic multiplication will thus show suboptimal performance on the server used in this project.
Based on the design consideration above, CryptoEmu implements a carry-save multiplier [19] that supports variable ALU width. Carry-save multiplier uses an adder described in V-B to sum up partial products in series.
- 1) Unsigned multiplication: Figure 13 shows the multiplier's architecture. For a 16-bit multiplier, A and B are 16-bit operands stored in vRegs. P is an intermediate 16-bit vReg to store partial products. Adder is a 16-bit adder with a carry in bit, see §V-B.
On startup, vReg P is initialized to encrypted 0 using TFHE library's bootCONSTANT() API. Next, we enter an iterative loop and homomorphically AND all bits in vReg A with LSB of vReg B , and use the result as one of the operands to the 16-bit adder. Data stored in vReg P is then passed to the 16-bit adder as the second operand. The adder performs the homomorphic addition to output an encrypted carry out bit. Next, we right shift carry out, vReg P and vReg B , and reached the end of the iterative for-loop. We repeat the for-loop 16 times, and the final product is a 32-bit result stored in vReg P and vReg B . The pseudo-code below shows the 16-bit binary multiplication algorithm.
```
P = 0;
for 16 times:
(P, carry out) = P + (B[0] ? A : 0)
[carry out, P, B] = [carry out, P, B] >> 1;
return [P, B]
```
For implementation, an N-bit multiplier uses N threads to concurrently evaluate all the FHE AND gates. The adder is already parallel optimized. The rest of the multiplication subroutine is executed sequentially. Therefore, the multiplier is a computationally expensive unit in CryptoEmu.
- a) Benchmark: Unsigned multiplication.: Table XI shows benchmark for 16-bit unsigned multiplication and 32-bit unsigned multiplication. A single pass for partial product summation takes around 715ms and 925ms, respectively. Total latency is roughly equal to N times single iteration's latency because summation operations are in sequence.
- 2) Signed multiplication: Signed multiplication is implemented using the carry-save multiplier in Figure 13, with slight modifications. For N-bit signed multiplication, partial products need to be signed extended to 2N bit. Figure 14 shows the partial product summation for 4-bit signed multiplication. This algorithm requires a 2N-bit adder and a 2N-bit subtractor for
TABLE XI: Unsigned multiplication latency
| | 16-bit multiplier (ms) | 32-bit multiplier (ms) |
|-----------------------------------|--------------------------|--------------------------|
| Single iteration | 715.812 | 926.79 |
| Total latency, including overhead | 11316.8 | 36929.2 |
Fig. 13: Carry save multiplier
<details>
<summary>Image 14 Details</summary>

### Visual Description
\n
## Diagram: Data Embedding Scheme
### Overview
The image depicts a diagram illustrating a data embedding scheme, likely for steganography or watermarking. It shows how data 'A' is embedded into a carrier signal 'P' using an adder and a least significant bit (LSB) modification technique. The diagram outlines the flow of data and the components involved in the embedding process.
### Components/Axes
The diagram consists of the following components:
* **Input Signal:** Represented by a grey rectangle with an arrow entering from the top.
* **P:** A rectangular block labeled "P", representing the carrier signal.
* **A:** A rectangular block labeled "A", representing the data to be embedded.
* **B:** A rectangular block labeled "B" with the annotation "LSB", indicating the least significant bit.
* **Adder:** A rectangular block labeled "Adder", performing the addition operation.
* **XOR Gate:** A logic gate labeled with the XOR symbol.
* **Arrows:** Indicate the direction of data flow between components.
### Detailed Analysis or Content Details
The diagram illustrates the following data flow:
1. The input signal flows into block 'P'.
2. Block 'P' also receives a feedback loop from the output of the Adder.
3. Data 'A' is input into an XOR gate.
4. The LSB of 'B' is also input into the XOR gate.
5. The output of the XOR gate is fed into the Adder.
6. The Adder receives input from 'P' and the XOR gate.
7. The output of the Adder is fed back into 'P' via a feedback loop.
There are no numerical values or scales present in the diagram. It is a conceptual representation of a process.
### Key Observations
The diagram highlights the use of the LSB of 'B' for embedding data 'A'. The XOR gate likely serves to modulate the data 'A' before embedding it into the carrier signal 'P'. The feedback loop suggests that the embedding process might be iterative or adaptive.
### Interpretation
This diagram demonstrates a basic data embedding technique. The data 'A' is hidden within the carrier signal 'P' by modifying the least significant bit of 'B' and adding it to 'P' using an adder. The XOR gate likely ensures that the embedded data is not easily detectable without knowing the key (the LSB of 'B'). This technique is commonly used in steganography to conceal information within other data, such as images or audio files. The feedback loop suggests a potential mechanism for error correction or adaptive embedding, where the embedding process adjusts based on the characteristics of the carrier signal. The diagram is a simplified representation and does not include details about the specific implementation of the adder, XOR gate, or the carrier signal 'P'.
</details>
N-bit signed multiplication. The algorithm is further simplified with a 'magic number' [19]. Figure 15 shows the simplified signed multiplication. Based on this algorithm, the unsigned carry-save multiplier is modified to adopt signed multiplication.
| | | | | A3 | A2 | A1 | AO |
|------|------|-----------|------|----------------|-----------|------|----------|
| A3B0 | A3B0 | | X | B3 | B2 | B1 | BO |
| A3B1 | | A3B0 A3B1 | A3B0 | A3B0 | A2B0 A1B1 | | A1B0A0B0 |
| | A3B1 | | A3B1 | A2B1 | | A0B1 | |
| A3B2 | A3B2 | A3B2 | A2B2 | A1B2 | A0B2 | | |
| A3B3 | A3B3 | | | A2B3 A1B3 A0B3 | | | |
Fig. 15: Simplified signed binary multiplication
<details>
<summary>Image 15 Details</summary>

### Visual Description
\n
## Diagram: Signed Binary Multiplication
### Overview
The image presents two diagrams illustrating the process of signed binary multiplication. The first diagram shows the initial partial products generated, while the second demonstrates the subsequent addition and sign extension steps. Both diagrams use a tabular format to represent the multiplication process.
### Components/Axes
The diagrams are structured as multiplication tables.
* **First Diagram:**
* Rows are labeled with A3B0, A3B1, A3B2, A3B3 (left side, in red).
* Columns are labeled with A3, A2, A1, A0 and B3, B2, B1, B0 (top side).
* The cells contain the results of multiplying the row and column labels.
* **Second Diagram:**
* Rows are labeled with +, +, +, + (left side).
* Columns are labeled with A3, A2, A1, A0 and B3, B2, B1, B0 (top side).
* The cells contain the results of adding the partial products and sign extension.
* Both diagrams have a caption: "Fig. 14: Signed binary multiplication" positioned below the first diagram.
### Detailed Analysis or Content Details
**First Diagram (Partial Products):**
The diagram shows the initial multiplication of each bit of the multiplicand (A3 A2 A1 A0) by each bit of the multiplier (B3 B2 B1 B0). The results are arranged in a grid.
* A3B0 x B3 = A3B0
* A3B0 x B2 = A3B0
* A3B0 x B1 = A3B0
* A3B0 x B0 = A3B0
* A3B1 x B3 = A3B1
* A3B1 x B2 = A3B1
* A3B1 x B1 = A3B1
* A3B1 x B0 = A3B1
* A3B2 x B3 = A3B2
* A3B2 x B2 = A2B2
* A3B2 x B1 = A1B2
* A3B2 x B0 = A0B2
* A3B3 x B3 = A1B3
* A3B3 x B2 = A0B3
**Second Diagram (Addition and Sign Extension):**
This diagram shows the addition of the partial products generated in the first diagram, along with sign extension. The tilde (~) symbol indicates the complement of a bit.
* ~A3B0 A2B0 A1B0 A0B0
* ~A3B1 A2B1 A1B1 A0B1
* ~A3B2 A2B2 A1B2 A0B2
* A3B3 ~A2B3 ~A1B3 ~A0B3
* + 1 1
### Key Observations
The first diagram demonstrates the basic multiplication operation, generating all possible partial products. The second diagram shows how these partial products are added together, with sign extension to handle negative numbers. The use of the tilde (~) indicates the complement operation, which is crucial for representing negative numbers in binary.
### Interpretation
The diagrams illustrate the algorithm for signed binary multiplication. The first diagram shows the generation of partial products, while the second diagram demonstrates the addition of these products with appropriate sign extension. This method allows for the multiplication of both positive and negative binary numbers. The sign extension ensures that the correct sign is maintained throughout the multiplication process. The diagrams provide a visual representation of the underlying principles of binary multiplication, which is fundamental to digital systems and computer arithmetic.
</details>
a) Benchmark: Signed multiplication.: The unsigned multiplication was modified for signed multiplication based on the simplified algorithm outlined above. Table XII shows benchmark results for the 16-bit and 32-bit signed multiplications. A single pass for partial product summation takes roughly 745ms and 1155ms, respectively. The total latency for signed multiplication is around N times that of the single iteration.
TABLE XII: Signed multiplication latency
<details>
<summary>Image 16 Details</summary>

### Visual Description
Icon/Small Image (574x53)
</details>
| | 16-bit multiplier (ms) | 32-bit multiplier (ms) |
|-----------------------------------|--------------------------|--------------------------|
| Single iteration | 742.8 | 1154.77 |
| Total latency, including overhead | 13120.6 | 34826 |
## F. Divider
CryptoEmu supports unsigned division over encrypted data. Figure 16 shows the flow chart for a non-restoring division algorithm for unsigned integer [20].
The division algorithm is based on sequential addition/subtraction. In every iteration, the MSB of vReg A decides whether the divider takes an addition or subtraction. This function is implemented building off the adder in §V-B. The MSB of vReg A is passed as the SUB bit into the adder. In the adder, the second operand b is homomorphically XORed with the SUB bit and then added with the first operand A and SUB bit. If FHE XOR gates are evaluated in parallel and parallel computing overhead is ignored, the adder with add/sub select is about 1 FHE XOR gate slower than regular adders. Note that, the integer divide instruct does not care about the value of remainder because the result will be rounded down (floor) to the closest integer, and therefore remainder calculation is skipped to save computation time.
The unsigned division algorithm is a sequential algorithm. Within each iteration, a parallel optimized adder subroutine is invoked. Like multiplication, the unsigned division algorithm is computationally expensive. Pseudocode for the 16-bit unsigned division is shown below.
```
invoked. Like multiplication, the unsigned division is shown below.
Non-restoring Division:
A = 0;
D = Divisor
Q = Dividend
for 16 times:
[A, Q] = [A, Q] << 1
if A < 0:
A = A + D
else:
A = A - D
if A < 0:
Q[0] = 0;
else:
Q[0] = 1;
```
a) Benchmark.: Table XIII provides benchmark results for the 16-bit and 32-bit unsigned integer divisions. Performance for a single iteration is slightly slower than homomorphic addition because one iteration uses an adder-subtractor unit. The division algorithm is sequential, and the total latency for N-bit division is around N times that of a single iteration's latency.
TABLE XIII: Unsigned division latency
<details>
<summary>Image 17 Details</summary>

### Visual Description
Icon/Small Image (536x53)
</details>
| | 16-bit divider (ms) | 32-bit divider (ms) |
|-----------------------------------|-----------------------|-----------------------|
| Single iteration | 769.063 | 1308.67 |
| Total latency, including overhead | 11659.5 | 38256.5 |
## G. Bitwise operations and data movement
CryptoEmu supports several homomorphic bitwise operations and data movement of configurable data size. The instruction set supports 16-bit and 32-bit bitwise operation and data movement.
The collection of bitwise operations includes homomorphic NOT, AND, OR, XOR, and ORN, in both immediate mode and register mode. All homomorphic bitwise operations except bitwise NOT are implemented with parallel computing optimizations. For the N-bit bitwise operation, CryptoEmu spawns N threads to carry out all FHE gate evaluation in parallel. Bitwise NOT operation is an exception because the FHE NOT gate does not need to be bootstrapped and is relatively fast. Parallel computing cannot be justified because of its overhead. The following code shows a generic way to implement N-bit parallel bitwise operations using the TFHE and OpenMP libraries.
```
#pragma omp parallel sections num_threads(16)
{
#pragma omp section
{
bootsAND(&out[0], &a[0], &b[0], bk);
}
```
<details>
<summary>Image 18 Details</summary>

### Visual Description
\n
## Flowchart: Division Algorithm
### Overview
The image depicts a flowchart illustrating a division algorithm. The flowchart outlines the steps involved in dividing a dividend by a divisor, producing a quotient. It uses conditional branching based on comparisons between variables A and Q.
### Components/Axes
The flowchart consists of the following components:
* **Start/End:** Oval shapes indicating the beginning and end of the algorithm.
* **Process:** Rectangular boxes representing operations or assignments.
* **Decision:** Diamond shapes representing conditional checks.
* **Arrows:** Lines indicating the flow of control.
The following variables are initialized:
* A = 0
* D = Divisor
* Q = Dividend
* Count = 0
The following conditions are checked:
* A < Q
* Count < N
### Detailed Analysis or Content Details
The algorithm proceeds as follows:
1. **Start:** The algorithm begins.
2. **Initialization:** The variables A, D, Q, and Count are initialized to 0, Divisor, Dividend, and 0 respectively.
3. **First Decision:** Check if A < Q.
* **Yes:** `[A, Q] = [A, Q] << 1` and `A = A + D`. The algorithm proceeds to the next decision.
* **No:** `[A, Q] = [A, Q] << 1` and `A = A - D`. The algorithm proceeds to the next decision.
4. **Second Decision:** Check if A < Q.
* **Yes:** `Q[0] = 0`. The algorithm proceeds to the next decision.
* **No:** `Q[0] = 1`. The algorithm proceeds to the next decision.
5. **Third Decision:** Check if Count < N.
* **Yes:** The algorithm loops back to the first decision (A < Q).
* **No:** The algorithm ends.
6. **End:** The algorithm terminates.
### Key Observations
The algorithm appears to be a bit-wise division algorithm. The left shift operation (`<< 1`) suggests that the algorithm is working with binary representations of the numbers. The repeated subtraction or addition of the divisor (D) from A is a common technique in division algorithms. The variable Count is used to control the number of iterations.
### Interpretation
This flowchart represents a division algorithm that likely operates on binary representations of numbers. The algorithm iteratively adjusts a quotient (Q) based on comparisons between an accumulator (A) and the dividend (Q). The left shift operation effectively multiplies by 2, and the addition or subtraction of the divisor refines the quotient. The loop continues until a specified count (N) is reached, indicating the completion of the division process. The algorithm is a fundamental building block for implementing division operations in digital circuits or software. The use of `Q[0]` suggests that the quotient is being stored in an array, and only the least significant bit is being modified in each iteration. The algorithm is likely designed for unsigned integer division.
</details>
Fig. 16: Binary division algorithm
```
Fig. 16: Binary division algorithm
#pragma omp section
{
bootsAND(&out[1], &a[1], &b[1], bk);
}
...
#pragma omp section
{
bootsAND(&out[N-2], &a[N-2], &b[N-2], bk);
}
#pragma omp section
{
bootsAND(&out[N-1], &a[N-1], &b[N-1], bk);
}
}
```
Data movement instructions support homomorphic operations such as bit field clear, bit field insert, bit-order reverse, and byteorder reverse. Because data movement uses TFHE library's non-bootstrapped APIs like bootsCOPY() and bootsCONSTANT() , they are not implemented via OpenMP parallel sections.
a) Benchmark.: Table XIV provides benchmark results for 16-bit and 32-bit bitwise operation and data movement instructions. When parallel optimized, an N-bit bitwise operation spawns N threads. For bitwise operation with bootstrapped FHE gate, latency is between 1-2 FHE gates when all threads are in parallel. The performance of the N-bit bitwise NOT and data movement instructions that are implemented sequentially is proportional to N times the corresponding TFHE API's latency.
TABLE XIV: Bitwise operation and data movement latency
| Operation | Category | Parallel? | 16-bit latency (ms) | 32-bit latency (ms) |
|-------------|--------------------|-------------|-----------------------|-----------------------|
| AND(imm) | Bitwise operations | Yes | 26.9195 | 27.1288 |
| AND(reg) | Bitwise operations | Yes | 47.7228 | 55.4597 |
| OR(imm) | Bitwise operations | Yes | 28.1923 | 27.8774 |
| OR(reg) | Bitwise operations | Yes | 47.6171 | 50.089 |
| XOR(imm) | Bitwise operations | Yes | 29.2375 | 27.9389 |
| XOR(reg) | Bitwise operations | Yes | 47.1986 | 50.3683 |
| ORN(imm) | Bitwise operations | Yes | 27.8467 | 27.776 |
| ORN(reg) | Bitwise operations | Yes | 47.6422 | 57.0639 |
| NOT(imm) | Bitwise operations | Yes | 0.0072794 | 0.0116968 |
| NOT(reg) | Bitwise operations | No | 0.006089 | 0.01232 |
| BFC | Bitwise operations | No | 0.0028094 | 0.0026892 |
| BFI | Bitwise operations | No | 0.003278 | 0.0033974 |
| RBIT | Bitwise operations | No | 0.0104582 | 0.0216222 |
| REV | Bitwise operations | No | 0.0097926 | 0.0187618 |
## VI. CONTROL UNITS
A control unit decides and/or changes the value of the program counter (PC), which in CryptoEmu is an integer in cleartext. CryptoEmu supports conditional execution that generates encrypted conditional flags defined in the ARM ISA. Based on conditional flags, the branch unit decides the branch direction to take. The PC is set to a cleartext that points to an HE instruction address when branch is taken, or increased by 4 if the branch is not taken.
## A. Conditional Flags
The conditional unit handles NZCV flags defined in the ARM architecture. The N (negative) flag is set if an instruction's result is negative. The Z (zero) flag is set if an instruction's result is zero. The C (carry) flag is set if an instruction's results in an unsigned overflow. The V (overflow) flag is set if an instruction's results in an signed overflow. NZCV values are stored as encrypted data in a vReg. An instruction can be conditionally executed to update NZCV's value. The following code shows HE assembly for a while loop. Note that all data in this example is encrypted.
```
HE assembly for a while loop. Note that
C++:
int i = 42;
while(i != 0)
i--;
HE assembly:
MOV R0 R0 42
Loop_label:
SUBS R0 R0 1
B_NE Loop_label
```
After the SUBS instruction is executed, the NZCV vReg is updated with results in R0. Based on the value of the Z flag, the program counter either updates its value to SUBS' instruction address (branch taken) or increases by 4 (branch non-taken).
Because the homomorphic evaluation of a conditional flag is computationally expensive, an ordinary data processing unit does not have a mechanism to update a conditional flag. In reality, instructions like ADDS, SUBS, and MULS are treated as micro-ops and completed in two steps: homomorphic data processing and homomorphic conditional flag calculation.
The N flag calculation is straight forward. CryptoEmu takes the MSB of result vReg and assigns it to the NZCV vReg. For the C flag, CryptoEmu takes the carry out result from previous unsigned computation and assigns it to the NZCV vReg. For the Z flag, the conditional unit performs an OR reduction on the input vReg, and assigns a negated result to the NZCV vReg. OR reduction can be parallelized. For a 16-bit conditional unit, OR reduction has four stages and maximum thread usage is eight. For a 32-bit conditional unit, OR reduction has five stages and maximum thread usage is 16. Finally, for the V flag, we
take the sign bit (MSB) of two operands and the result for a signed computation, denoted as a , b , and s , respectively. Overflow is then evaluated as
$$o v = \left ( \bar { a } \cdot \bar { b } \cdot s \right ) + \left ( a \cdot b \cdot \bar { s } \right ) .$$
Note that the conditional unit calculates overflow in parallel by executing (¯ a · ¯ b · s ) and ( a · b · ¯ s ) concurrently.
a) Benchmark.: Table XV shows benchmark results for the 16-bit and 32-bit conditional units. The N flag and C flag calculations are fast because they do not have bootstrapped FHE gates. The Z flags and V flags are calculated at the same time. For conditional flag calculation on N-bit result, the maximum thread usage is N/ 2 + 2 threads. Z flag and V flag's latency dominates the total latency.
TABLE XV: Conditional flag latency
| | 16-bit (ms) | 32-bit (ms) |
|-----------------------------------|---------------|---------------|
| N flag | 0.0214161 | 0.0204952 |
| C flag | 0.0007251 | 0.0002539 |
| Z flag and V flag | 115.811 | 168.728 |
| Total latency, including overhead | 115.901 | 169.843 |
## B. Branch
CryptoEmu has a vReg virtual register reserved for the program counter (PC). The PC stores a cleartext value that points at an HE instruction address. CryptoEmu loads an HE instruction from PC, decodes the instruction, invokes the data processing unit to execute the instruction, and repeats. Normally when using the ARM A32 ISA, the next instruction address that CryptoEmu is at the current PC plus 4. However, branch instructions can modify the PC value based on conditional flags, and therefore decide the next instruction address CryptoEmu fetches from. For example, the following code branches to ADDRESS LABEL because SUBS instruction sets the Z flag.
```
MOV R0 R0 1
SUBS R0 R0 1
B_NE ADDRESS_LABEL
```
§VI-A discussed conditional flag calculation. The NZCV flag is stored in a vReg as encrypted data. The cloud needs to know the value of the NZCV flags to decide which branch direction to take. However, the cloud has no knowledge of NZCV value unless it is decrypted by the user. CryptoEmu adopts a server-client communication model from CryptoBlaze [15]. As demonstrated in Fig. 17, the cloud (server) first homomorphically evaluates an HE instruction and updates the NCZV vReg. Next, the cloud sends the encrypted NCZC value and branch condition to the client. User deciphers the encrypted NCZC value, and sends a branch taken/non-taken decision back to the cloud through a secure channel. Once the cloud learns the branch decision, the PC will either be updated to PC+4, or to the branch address.
Fig. 17: Server-client model for branch resolution Blurry text in the figure denotes encrypted data.
<details>
<summary>Image 19 Details</summary>

### Visual Description
\n
## Diagram: Cloud-User Interaction
### Overview
The image depicts a simplified diagram illustrating data flow between a "Cloud" and a "User". The diagram uses rectangular boxes to represent the Cloud and User, and ovals to represent data elements exchanged between them. Arrows indicate the direction of data flow.
### Components/Axes
The diagram consists of the following components:
* **Cloud:** A light blue rectangle labeled "Cloud" positioned on the left side of the image.
* **User:** A pale yellow rectangle labeled "User" positioned on the right side of the image.
* **NCZV Flags:** An oval labeled "NCZV Flags" positioned between the Cloud and User, with an arrow pointing from the Cloud to the User.
* **Branch (Direction):** An oval labeled "Branch (Direction)" positioned between the Cloud and User, with an arrow pointing from the User to the Cloud.
### Detailed Analysis or Content Details
The diagram shows a bidirectional data flow:
1. **Cloud to User:** Data labeled "NCZV Flags" is sent from the Cloud to the User.
2. **User to Cloud:** Data labeled "Branch (Direction)" is sent from the User to the Cloud.
There are no numerical values or scales present in this diagram. It is a conceptual representation of data exchange.
### Key Observations
The diagram highlights a two-way communication process. The "NCZV Flags" data appears to be sent *from* the Cloud *to* the User, while "Branch (Direction)" data is sent *from* the User *to* the Cloud. The labels suggest a system where the Cloud provides flags (potentially related to non-critical zero-value checks) and the User provides direction or branching information.
### Interpretation
This diagram likely represents a component of a software system or a data processing pipeline. The Cloud could be a server or a remote processing unit, and the User could be a client application or a human operator. The "NCZV Flags" might indicate status or validation information sent from the Cloud to the User, while the "Branch (Direction)" could represent instructions or control signals sent from the User to the Cloud, influencing the processing flow. The diagram suggests a feedback loop where the User's input affects the Cloud's operation, and the Cloud provides feedback to the User. Without further context, the specific meaning of "NCZV" and "Branch (Direction)" remains unclear, but the diagram clearly illustrates a two-way data exchange relationship.
</details>
## VII. RESULTS
This section reports on ( i ) the scalability with respect to bit count when the core count is a fixed number; ( ii ) the scalability with respect to core count when the data size is a fixed number; ( iii ) on CryptoEmu's potential vulnerabilities; and ( iv ) comparison against the state of the art.
## A. Scalability with respect to bit count
Table I shows the latency of the TFHE library API. Non-bootstrapped gates such as CONSTANT, NOT, and COPY are three to four orders of magnitude faster than bootstrapped gates. Therefore, compared to bootstrapped gates, non-bootstrapped
gates are considered as negligible in the CryptoEmu's performance equation. Functional units without bootstrapped gates are not included in our scalability analysis.
The latency of a single bootstrapped gate is denoted as G . In the rest of the subsection, we analyzed the scalability of functional unit subroutines with respect to data size N. Two scenarios are considered: single core mode, and multi-core mode with a sufficient number of cores to do all intended parallel computation concurrently. CryptoEmu uses all available cores to concurrently perform homomorphic computation on encrypted data with multi-core mode latency. When CryptoEmu exhausted all core resources, the rest of the program will be serialized with single-core mode latency.
a) Adder.: An adder has 3 stages: Generate and propagate calculation, carry calculation, and sum calculation. Let the available core number be 1 (single core mode); then, the time complexity for each stage is:
Generate and propagate calculation : To compute generate and propagate for one bit, two bootstrapped FHE gates evaluations are required. With N-bit operands, total latency is 2 · G · N .
Carry calculation : Three bootstrapped FHE gates evaluations are required to calculate the carry out bit. With N-bit operands, total latency is 3 · G · N .
Sum calculation : One bootstrapped XOR gate is needed to calculate one sum bit. With N-bit operands, total latency is G · N Total latency : The latency for adder with single core is 6 · G · N . Time complexity is O ( N ) .
Table XVI summarizes the discussion above for the N-bit adder with one core available.
TABLE XVI: Adder scalability, single core
| Operation | Latency | Time complexity |
|-------------------|-----------|-------------------|
| (g,p) calculation | 2 GN | O ( N ) |
| Carry calculation | 3 GN | O ( N ) |
| Sum calculation | GN | O ( N ) |
| Total latency | 6 GN | O ( N ) |
Assume a sufficient but fixed number of processors are available. Then, the latencies become:
Generate and propagate : Calculation for generate and propagate for one bit can be executed in parallel, with a latency of one bootstrapped FHE gate. Given N-bit operands, N calculations can be carried out concurrently. Therefore, the latency for the (g,p) calculation is G .
Carry calculation : Carry out computation for one bit can be executed in parallel, with a latency of two bootstrapped FHE gates. For N-bit operands, carry calculation is divided into log ( N ) stages. Operations in the same stage are executed in parallel. Therefore, latency for carry calculation is 2 · G · log ( N ) .
Sum calculation : Each sum bit needs one bootstrapped XOR gate. For N-bit operands, N computations can be executed in parallel. Total latency for sum calculation is G .
Total latency : The total latency for an adder with sufficient yet fixed number of cores is 2 · G · log ( N )+2 G . Time complexity is O ( log ( N )) .
Table XVII summarizes the scalability analysis discussion for the N-bit adder with sufficient, yet fixed number of cores available.
TABLE XVII: Adder scalability, multiple core
| Operation | Latency | Time complexity |
|-------------------|---------------------|-------------------|
| (g,p) calculation | G | O (1) |
| Carry calculation | 2 G · log ( N ) | O ( log ( N )) |
| Sum calculation | G | O (1) |
| Total latency | 2 G · log ( N )+2 G | O ( log ( N )) |
b) Subtractor.: A subtractor performs homomorphic negation on subtrahend, and homomorphically adds subtrahend's complement to minuend using an adder with carry in of one.
Negation uses TFHE library's non-bootstrapped NOT gate, and its latency is negligible. Adding a carry to the adder uses two extra bootstrapped gates. Therefore, the total latency for subtractor is Latency ( Adder ) + 2 · G .
Table XVIII summarizes key scalability aspects for the N-bit subtractor with single core and with sufficient yet fixed number of cores available.
TABLE XVIII: Subtractor scalability
| # of cores | Latency | Time complexity |
|--------------|------------------|-------------------|
| Single core | 6 GN +2 G | O ( N ) |
| Multi-core | 2 Glog ( N )+4 G | O ( log ( N )) |
Shifter. For shifting with cleartext immediate, the operation is essentially a homomorphic data movement operation using the TFHE library's COPY and CONSTANT API. Shifting with immediate latency is therefore negligible compared to the rest of function units in CryptoEmu.
.
Shifting with encrypted vReg is implemented as a barrel shifter. For vReg N-bit shift, the operation has log ( N ) stages. In each stage, individual bits are selected by bits in the shifting amount vReg moves to the next stage through a MUX. The TFHE MUX has latency around 2 · G .
When a single core is used, each stage has N MUXs. Latency in one stage is 2 · G · N . There are log ( N ) stages, therefore total shifter latency for a single core processor is 2 · G · N · log ( N ) .
When sufficient cores are given, MUXs of the same stage can be evaluated in parallel, resulting in a latency of 2 · G . There are log ( N ) stages, therefore total shifter latency for processor with sufficient cores is 2 · G · log ( N ) .
Table XIX summarizes key scalability aspects for the shifter with respect to bit count N, in single core and multi-core mode.
TABLE XIX: Shifter scalability
| # of cores | Latency | Time complexity |
|--------------|---------------|-------------------|
| Single core | 2 GNlog ( N ) | O ( Nlog ( N )) |
| Multi-core | 2 Glog ( N ) | O ( log ( N )) |
c) Multiplier.: The multiplier is implemented with iterative multiplication algorithms. A multiplier with N-bit operands requires N times iterations.
N-bit unsigned multiplication contains an N-bit FHE AND evaluation and invokes an N-bit adder for every iteration. When running unsigned multiplication with a single core, sequential latency for N-bit FHE AND is N · G . Sequential latency for the N-bit adder is 6 · G · N . Latency for one iteration is 7 · G · N . Total latency for the N-bit unsigned multiplication 7 · G · N 2 .
When sufficient number of cores are provided to run subroutines in parallel, parallel latency for the N-bit FHE AND is G . Parallel latency for N-bit adder is 2 · G · log ( N ) + 2 G . One iteration takes 2 · G · log ( N ) + 3 G to complete. Total latency for N-bit unsigned multiplication is 2 · G · N · log ( N ) + 3 · G · N .
Signed multiplication has the same latency per iteration as in unsigned multiplication. N-bit signed multiplication involves an N-bit addition that adds an offset to the upper half of the 2N-bit product. This operation has sequential latency of 6 · G · N and parallel latency of 2 · G · log ( N ) + 3 G . Total latency for signed multiplication is therefore 7 · G · N 2 +6 · G · N for single core, and 2 · G · N · log ( N ) + 3 · G · N +2 · G · log ( N ) + 3 G if a sufficient number of cores is available.
Table XX summarizes key scalability aspects (latency and time complexity) for the unsigned and signed multiplication with respect to bit count N, in single core and multi-core mode.
TABLE XX: Multiplier scalability
| unsigned? | # of cores | Latency | Time complexity |
|-------------|--------------|--------------------------------------|-------------------|
| Yes | Single core | 7 GN 2 | O ( N 2 )) |
| Yes | Multi-core | 2 GNlog ( N )+3 GN | O ( Nlog ( N )) |
| No | Single core | 7 GN 2 +6 GN | O ( N 2 )) |
| No | Multi-core | 2 GNlog ( N )+3 GN +2 Glog ( N )+3 G | O ( Nlog ( N )) |
d) Divider.: The divider is implemented with iterative non-restoring division algorithm. A divider with N-bit operands requires N iterations. Every iteration involves negligible non-bootstrapped gates and an adder-subtractor unit. The addersubtractor selectively inverts the second operand based on the value in the SUB bit. Then the subroutine invokes an adder to add the first operand, processes the second operand, and SUB bit together to form a result.
For computation on single core, selective inversion for an N-bit operand uses N FHE XOR gates, and therefore has a latency of G · N . Adder with carry in has 6 · G · N +2 · G latency. Latency per iteration is 7 · G · N +2 · G , and total latency is 7 · G · N 2 +2 · G · N for single core.
If enough cores are available, the N-bit selective inversion can be executed in parallel, resulting in G latency. Adder with carry in parallel computing mode has 2 · G · log ( N ) + 4 · G latency. Latency per iteration is 2 · G · log ( N ) + 5 · G , and total latency is 2 · G · N · log ( N ) + 5 · G · N .
Table XXI shows latency and time complexity results for the unsigned division with respect to bit count N, in single core and multi-core modes.
TABLE XXI: Divider scalability
| # of cores | Latency | Time complexity |
|--------------|--------------------|-------------------|
| Single core | 7 GN 2 +2 GN | O ( N 2 )) |
| Multi-core | 2 GNlog ( N )+5 GN | O ( Nlog ( N )) |
e) Bitwise operations.: Bitwise operation evaluates all input bits with one FHE gate. Latency per bit is G . For N-bit bitwise operation in single core mode, total latency is G · N . In multi-core mode, because all FHE gate evaluations are in parallel, the total latency is G .
Table XXI shows latency and time complexity data for bitwise operation with respect to bit count N, in single core and multi-core mode.
TABLE XXII: Bitwise ops scalability
| # of cores | Latency | Time complexity |
|--------------|-----------|-------------------|
| Single core | GN | O ( N ) |
| Multi-core | G | O (1) |
f) Conditional Flags.: Conditional flags calculation involves four sub-calculations: N flag, Z flag, C flag, and V flag. N flag and C flag calculations are trivial. Z flag calculation contains an OR reduction operation. In single core mode, the N-bit Z flag calculation has a latency of G · N . In multi-core mode, the OR reduction is executed in parallel. N-bit OR reduction is broken down into log ( N ) stages, and in each stage, the FHE OR gates are evaluated in parallel. The N-bit Z flag latency for parallel computing is Glog ( N ) .
The V flag calculation uses five gates regardless of the size of the input vReg. In single core mode, V flag calculation has 5 · G latency. In multi-core mode because V flag calculation can be partially optimized with parallel computing, the latency is 3 · G . Therefore, total latency for single core mode is G · N +5 · G , and total latency for multi-core mode is G · log ( N ) +3 · G .
Table XXIII shows latency and time complexity data for conditional flag calculation with respect to the bit count N, in single core and multi-core mode.
TABLE XXIII: Conditional flags scalability
| # of cores | Latency | Time complexity |
|--------------|----------------|-------------------|
| Single core | GN +5 G | O ( N ) |
| Multi-core | Glog ( N )+3 G | O ( log ( N )) |
## B. Scalability with respect to core count
CryptoEmu's performance is dictated by the extent to which it can rely on parallel computing. In ideal scenario, the processor has enough cores to enable any collection of tasks that can be performed in parallel to be processed as such. Sometimes, a part of the execution of an instruction takes place in parallel, but the rest is sequential because all CPU core resources are exhausted. In the worst case scenario, the processor has only one core and all computations are sequential/serialized.
Assume that the size of the functional units is 16-bit. Peak thread usage is then 32 threads, where each thread performance an FHE gate evaluation. On the server used in this study, which has 48 cores distributed over four CPUs, we vary the number of cores available to carry our emulation from 1 core to 48 cores. This amounts to a strong scaling analysis. We created scenarios where ( i ) the instruction set can draw on one core only and therefore the computation is sequential; ( ii ) the instruction set has insufficient cores and parallel computation is mixed with sequential computation; and ( iii ) the instruction set can count on sufficient cores to do all emulation in parallel.
a) Adder.: Figure 18 shows the 16-bit adder's compute time with respect to core count. From the chart, the adder's latency decreases when the core count increases from 1 to 32. The adder reached its optimal speed at around 32 core count and has a slight decrease in speed when the core count exceeds 32. Adder's maximum speedup from parallel computing is around 7.78x.
Fig. 18: Time vs Core count
<details>
<summary>Image 20 Details</summary>

### Visual Description
\n
## Bar Chart: Adder - Time vs Core count
### Overview
The image presents a bar chart illustrating the relationship between core count and execution time (in milliseconds) for an operation named "Adder". The chart displays a clear inverse relationship: as the core count increases, the execution time decreases, but the rate of decrease diminishes significantly after a certain point.
### Components/Axes
* **Title:** "Adder: Time vs Core count" - positioned at the top-center of the chart.
* **X-axis:** "Core count" - ranging from 1 to 47, with increments of 1. The axis is labeled at the bottom of the chart.
* **Y-axis:** "Time (ms)" - ranging from 0 to 5000, with increments of 500. The axis is labeled on the left side of the chart.
* **Data Series:** A single series of blue bars representing the execution time for each core count.
* **Gridlines:** Horizontal gridlines are present to aid in reading the values on the Y-axis.
### Detailed Analysis
The chart shows a steep decline in execution time as the core count increases from 1 to approximately 7. Beyond this point, the decrease in execution time becomes less pronounced, eventually leveling off around a core count of 25.
Here's a breakdown of approximate values, reading from the chart:
* Core Count 1: ~4500 ms
* Core Count 3: ~2100 ms
* Core Count 5: ~1400 ms
* Core Count 7: ~1000 ms
* Core Count 9: ~900 ms
* Core Count 11: ~800 ms
* Core Count 13: ~700 ms
* Core Count 15: ~650 ms
* Core Count 17: ~600 ms
* Core Count 19: ~550 ms
* Core Count 21: ~500 ms
* Core Count 23: ~475 ms
* Core Count 25: ~450 ms
* Core Count 27: ~425 ms
* Core Count 29: ~400 ms
* Core Count 31: ~375 ms
* Core Count 33: ~350 ms
* Core Count 35: ~325 ms
* Core Count 37: ~300 ms
* Core Count 39: ~275 ms
* Core Count 41: ~250 ms
* Core Count 43: ~225 ms
* Core Count 45: ~200 ms
* Core Count 47: ~175 ms
The trend is a rapid initial decrease, followed by diminishing returns as more cores are added.
### Key Observations
* The most significant performance gains are achieved with a relatively small number of cores (1-7).
* Adding cores beyond 25 yields only marginal improvements in execution time.
* There appears to be a limit to the performance improvement achievable by increasing the core count for this specific "Adder" operation.
### Interpretation
The data suggests that the "Adder" operation benefits from parallelization, as evidenced by the initial decrease in execution time with increasing core count. However, the diminishing returns indicate that the operation is likely limited by factors other than the number of available cores, such as memory bandwidth, data dependencies, or the overhead of managing parallel threads. The leveling off of the curve suggests that the operation has reached a point of diminishing returns with respect to core count. Further increasing the core count would likely not result in substantial performance improvements, and could even introduce overhead that degrades performance. This type of behavior is common in parallel computing, where the ideal speedup is limited by Amdahl's Law. The chart provides valuable insight into the scalability of the "Adder" operation and can inform decisions about hardware resource allocation and algorithm optimization.
</details>
b) Subtractor.: Figure 19 shows the 16-bit subtractor's compute time with respect to core count. The 16-bit subtractor uses at most 32 cores to compute in parallel. The subtractor's latency decreases as the core count increases from 1-32, and reaches its maximum performance at around 32 core count. At and beyond 32 core count, all parallel optimizations are enabled. When the core counts saturate the subtractor's parallel computing requirement, the latency of the subtractor is around 700ms. The subtractor's maximum speedup from parallel computing is about 8.49x.
Fig. 19: Time vs Core count
<details>
<summary>Image 21 Details</summary>

### Visual Description
\n
## Bar Chart: Subtractor - Time vs Core count
### Overview
The image presents a bar chart illustrating the relationship between the time taken (in milliseconds) for a "Subtractor" operation and the number of cores used. The chart displays a decreasing trend in execution time as the number of cores increases, with a rapid initial decrease followed by a leveling off.
### Components/Axes
* **Title:** Subtractor: Time vs Core count
* **X-axis:** Core count. Markers are present at integer values from 1 to 47.
* **Y-axis:** Time (ms). Scale ranges from 0 to 6000 milliseconds.
* **Data Series:** A single series of bars representing the time taken for each core count.
* **Bar Color:** Light blue.
### Detailed Analysis
The chart shows a significant decrease in execution time as the core count increases from 1 to approximately 11. Beyond 11 cores, the reduction in execution time becomes much less pronounced, eventually leveling off around 600-800 ms.
Here's a breakdown of approximate values, reading from the chart:
* **1 Core:** ~4900 ms
* **3 Cores:** ~2300 ms
* **5 Cores:** ~1700 ms
* **7 Cores:** ~1300 ms
* **9 Cores:** ~1100 ms
* **11 Cores:** ~900 ms
* **13 Cores:** ~800 ms
* **15 Cores:** ~750 ms
* **17 Cores:** ~700 ms
* **19 Cores:** ~650 ms
* **21 Cores:** ~600 ms
* **23 Cores:** ~580 ms
* **25 Cores:** ~560 ms
* **27 Cores:** ~540 ms
* **29 Cores:** ~530 ms
* **31 Cores:** ~520 ms
* **33 Cores:** ~510 ms
* **35 Cores:** ~500 ms
* **37 Cores:** ~490 ms
* **39 Cores:** ~480 ms
* **41 Cores:** ~470 ms
* **43 Cores:** ~460 ms
* **45 Cores:** ~450 ms
* **47 Cores:** ~440 ms
The trend is a steep decline from 1 to 11 cores, followed by diminishing returns as more cores are added.
### Key Observations
* The most significant performance gains are achieved with a relatively small number of cores (1-11).
* Adding cores beyond 11 yields progressively smaller improvements in execution time.
* The execution time appears to converge towards a minimum value of approximately 440-500 ms as the core count increases.
### Interpretation
The data suggests that the "Subtractor" operation benefits from parallelization, as evidenced by the decreasing execution time with increasing core count. However, the benefits of parallelization diminish beyond a certain point, indicating that the operation may be limited by factors other than the number of available cores (e.g., memory bandwidth, algorithm complexity, or inherent sequential dependencies). The leveling off of the curve suggests that the operation has reached a point where adding more cores does not significantly reduce the time required to complete the task. This could be due to Amdahl's Law, where a portion of the task remains inherently sequential and cannot be parallelized. The initial steep decline indicates a highly parallelizable component of the operation. The chart provides valuable insight into the scalability of the "Subtractor" operation and can inform decisions about resource allocation and optimization strategies.
</details>
c) Shifter.: Figure 20 shows the 16-bit shifter's compute time with respect to core count. A 16-bit shifter has a maximum core usage of 16. The 16-bit shifter's latency decreases as the core count increases from 1-16. The shifter's performance reaches its peak at core count 16. For core counts higher than 16, its stays stable at around 300ms. The maximum speedup yielded by parallel computing is around 10.94x.
Fig. 20: Time vs Core count
<details>
<summary>Image 22 Details</summary>

### Visual Description
\n
## Bar Chart: Shifter - Time vs Core count
### Overview
The image presents a bar chart illustrating the relationship between core count and execution time (in milliseconds) for a process labeled "Shifter". The chart displays a decreasing trend, indicating that as the number of cores increases, the execution time generally decreases, but with diminishing returns.
### Components/Axes
* **Title:** "Shifter: Time vs Core count" - positioned at the top-center of the chart.
* **X-axis:** "Core count" - ranging from 1 to 47, with increments of 2.
* **Y-axis:** "Time (ms)" - ranging from 0 to 3500, with increments of 500.
* **Data Series:** A single series of blue bars representing the execution time for each core count.
### Detailed Analysis
The chart shows a steep decline in execution time from 1 to 7 cores. After 7 cores, the decrease in execution time becomes less pronounced, leveling off around 200-300ms for core counts between 19 and 47.
Here's a breakdown of approximate values, reading from the chart:
* **1 Core:** ~3100 ms
* **3 Cores:** ~1500 ms
* **5 Cores:** ~900 ms
* **7 Cores:** ~650 ms
* **9 Cores:** ~550 ms
* **11 Cores:** ~450 ms
* **13 Cores:** ~380 ms
* **15 Cores:** ~330 ms
* **17 Cores:** ~290 ms
* **19 Cores:** ~260 ms
* **21 Cores:** ~240 ms
* **23 Cores:** ~230 ms
* **25 Cores:** ~220 ms
* **27 Cores:** ~210 ms
* **29 Cores:** ~200 ms
* **31 Cores:** ~190 ms
* **33 Cores:** ~180 ms
* **35 Cores:** ~170 ms
* **37 Cores:** ~160 ms
* **39 Cores:** ~150 ms
* **41 Cores:** ~140 ms
* **43 Cores:** ~130 ms
* **45 Cores:** ~120 ms
* **47 Cores:** ~110 ms
The bars are consistently blue, indicating a single data series.
### Key Observations
* The most significant performance gains are achieved with a relatively small number of cores (1-7).
* Adding more than 7 cores yields diminishing returns in terms of execution time reduction.
* The execution time appears to converge towards a minimum value around 110-120ms as the core count increases beyond 40.
### Interpretation
The data suggests that the "Shifter" process is well-suited for parallelization, as increasing the number of cores initially leads to substantial performance improvements. However, the process likely reaches a point of diminishing returns due to factors such as overhead associated with core communication or inherent sequential components within the algorithm. The leveling off of the curve indicates that beyond a certain core count, the benefits of parallelization are outweighed by these overheads. This information is valuable for optimizing resource allocation when running the "Shifter" process, as investing in more than approximately 7-10 cores may not yield a proportional improvement in performance. The chart implies that the process is likely CPU-bound, as increasing the number of cores directly impacts execution time.
</details>
d) Multiplier.: Figure 21 shows the 16-bit multiplier's compute time with respect to core count. A 16-bit multiplier requires 32 cores to enable all parallel computing optimization. From the chart, both unsigned and signed multiplication's latency decrease as the number of cores increases from 1 to 32. The multiplier reached its top speed at around 32 cores. Beyond that, the multiplier's latency stays around 11000-12000ms. Unsigned multiplication has a maximum parallel computing speedup of 8.3x; signed multiplication has a maximum parallel computing speedup of 8.4x.
Fig. 21: Time vs Core count
<details>
<summary>Image 23 Details</summary>

### Visual Description
\n
## Bar Chart: Multiplier: Time vs Core count
### Overview
This bar chart compares the execution time (in milliseconds - ms) of unsigned and signed multiplication operations across varying core counts, ranging from 1 to 47. The chart visually represents how the time taken for these operations changes as the number of cores increases.
### Components/Axes
* **Title:** Multiplier: Time vs Core count
* **X-axis:** Core count (ranging from 1 to 47, with increments of 1)
* **Y-axis:** Time (ms) (ranging from 0 to 100,000 ms, with increments of 10,000 ms)
* **Legend:**
* Unsigned Multiplication (represented by orange bars)
* Signed Multiplication (represented by gray bars)
### Detailed Analysis
The chart consists of two sets of bars for each core count, representing the execution time for unsigned and signed multiplication.
**Unsigned Multiplication (Orange Bars):**
The trend for unsigned multiplication is a steep decline in execution time from 1 core to approximately 9 cores, followed by a gradual leveling off.
* Core 1: Approximately 91,000 ms
* Core 3: Approximately 52,000 ms
* Core 5: Approximately 32,000 ms
* Core 7: Approximately 24,000 ms
* Core 9: Approximately 18,000 ms
* Core 11: Approximately 15,000 ms
* Core 13: Approximately 13,000 ms
* Core 15: Approximately 12,000 ms
* Core 17: Approximately 11,000 ms
* Core 19: Approximately 10,500 ms
* Core 21: Approximately 10,000 ms
* Core 23: Approximately 9,500 ms
* Core 25: Approximately 9,000 ms
* Core 27: Approximately 8,500 ms
* Core 29: Approximately 8,000 ms
* Core 31: Approximately 7,500 ms
* Core 33: Approximately 7,000 ms
* Core 35: Approximately 6,500 ms
* Core 37: Approximately 6,000 ms
* Core 39: Approximately 5,500 ms
* Core 41: Approximately 5,000 ms
* Core 43: Approximately 4,500 ms
* Core 45: Approximately 4,000 ms
* Core 47: Approximately 3,500 ms
**Signed Multiplication (Gray Bars):**
The trend for signed multiplication is relatively flat, with execution times consistently around 10,000 ms after an initial drop.
* Core 1: Approximately 90,000 ms
* Core 3: Approximately 48,000 ms
* Core 5: Approximately 28,000 ms
* Core 7: Approximately 20,000 ms
* Core 9: Approximately 16,000 ms
* Core 11: Approximately 13,000 ms
* Core 13: Approximately 11,500 ms
* Core 15: Approximately 10,500 ms
* Core 17: Approximately 10,000 ms
* Core 19: Approximately 10,000 ms
* Core 21: Approximately 10,000 ms
* Core 23: Approximately 10,000 ms
* Core 25: Approximately 10,000 ms
* Core 27: Approximately 10,000 ms
* Core 29: Approximately 10,000 ms
* Core 31: Approximately 10,000 ms
* Core 33: Approximately 10,000 ms
* Core 35: Approximately 10,000 ms
* Core 37: Approximately 10,000 ms
* Core 39: Approximately 10,000 ms
* Core 41: Approximately 10,000 ms
* Core 43: Approximately 10,000 ms
* Core 45: Approximately 10,000 ms
* Core 47: Approximately 10,000 ms
### Key Observations
* Unsigned multiplication exhibits a significant performance improvement with increasing core counts, especially up to 9 cores.
* Signed multiplication shows a much smaller performance improvement with increasing core counts, leveling off around 10,000 ms.
* The initial performance difference between unsigned and signed multiplication is substantial, but it diminishes as the core count increases.
* The unsigned multiplication time decreases more rapidly than the signed multiplication time.
### Interpretation
The data suggests that unsigned multiplication is more effectively parallelized across multiple cores than signed multiplication. This could be due to the inherent complexity of handling signed numbers (e.g., sign extension, two's complement representation) which may introduce overhead that limits the benefits of parallelization. The leveling off of both curves indicates that there is a limit to the performance gains achievable by simply adding more cores, likely due to factors such as memory bandwidth limitations or synchronization overhead. The large initial difference in execution time between the two operations suggests that the algorithm or hardware implementation for unsigned multiplication is more efficient than that for signed multiplication, especially when only a small number of cores are available. The chart demonstrates the importance of considering the specific characteristics of an operation when designing parallel algorithms or selecting hardware for performance-critical applications.
</details>
e) Divider.: Figure 22 shows the 16-bit divider's compute time with respect to core count. A 16-bit divider requires 32 cores to achieve the best parallel computing performance. From the chart, divider's latency decreases as the core count increases from 1 to 32. From 32 cores on, the divider's latency stays at around 12000ms. The 16-bit divider has a maximum parallel computing speedup of 8.43x, reached when it can draw on 32 cores.
Fig. 22: Time vs Core count
<details>
<summary>Image 24 Details</summary>

### Visual Description
\n
## Bar Chart: Divider - Time vs Core count
### Overview
This image presents a bar chart illustrating the relationship between "Core count" and "Time (ms)" for a process labeled "Divider". The chart displays the execution time of the "Divider" process as a function of the number of cores used.
### Components/Axes
* **Title:** Divider: Time vs Core count
* **X-axis:** Core count, ranging from 1 to 47, with increments of 2.
* **Y-axis:** Time (ms), ranging from 0 to 100000, with increments of 10000.
* **Data Series:** A single series of blue bars representing the time taken for each core count.
### Detailed Analysis
The chart shows a steep decline in execution time as the core count increases from 1 to approximately 7. Beyond 7 cores, the reduction in execution time becomes significantly less pronounced, eventually leveling off.
Here's a breakdown of approximate values extracted from the chart:
* **Core Count 1:** Time ≈ 89000 ms
* **Core Count 3:** Time ≈ 41000 ms
* **Core Count 5:** Time ≈ 31000 ms
* **Core Count 7:** Time ≈ 22000 ms
* **Core Count 9:** Time ≈ 18000 ms
* **Core Count 11:** Time ≈ 15000 ms
* **Core Count 13:** Time ≈ 13000 ms
* **Core Count 15:** Time ≈ 12000 ms
* **Core Count 17:** Time ≈ 11000 ms
* **Core Count 19:** Time ≈ 10000 ms
* **Core Count 21:** Time ≈ 9500 ms
* **Core Count 23:** Time ≈ 9000 ms
* **Core Count 25:** Time ≈ 8500 ms
* **Core Count 27:** Time ≈ 8000 ms
* **Core Count 29:** Time ≈ 7500 ms
* **Core Count 31:** Time ≈ 7000 ms
* **Core Count 33:** Time ≈ 6500 ms
* **Core Count 35:** Time ≈ 6000 ms
* **Core Count 37:** Time ≈ 5500 ms
* **Core Count 39:** Time ≈ 5000 ms
* **Core Count 41:** Time ≈ 4500 ms
* **Core Count 43:** Time ≈ 4000 ms
* **Core Count 45:** Time ≈ 3500 ms
* **Core Count 47:** Time ≈ 3000 ms
The bars from core count 21 to 47 are relatively consistent, hovering around 7000-9000 ms.
### Key Observations
* The most significant performance gains are achieved with a small number of cores (1-7).
* Adding more than 7 cores yields diminishing returns in terms of execution time reduction.
* The execution time appears to converge towards a minimum value around 3000 ms as the core count increases.
### Interpretation
The chart demonstrates the concept of parallel processing and its limitations. Initially, adding more cores significantly reduces execution time because the workload can be divided and processed concurrently. However, as the number of cores increases, the overhead associated with coordinating the cores and managing the workload begins to outweigh the benefits of parallelism. This leads to diminishing returns and eventually a plateau in performance.
The "Divider" process likely has a component that is inherently sequential and cannot be effectively parallelized, which explains why the execution time does not continue to decrease indefinitely with increasing core count. The optimal number of cores for this process appears to be around 7, as adding more cores beyond that point provides minimal performance improvement. This suggests that the process is bottlenecked by a non-parallelizable component.
</details>
f) Bitwise operations.: Figure 23 shows 16-bit bitwise operation's compute time with respect to the core count. 16-bit bitwise ops requires 16 cores to have the best parallel computing performance. From the graph, at core count 16 the bitwise operation reached its top speed. At 16 cores and beyond, the latency stays around 50ms. 16-bit bitwise op has a maximum parallel computing speedup of 8.63x.
Fig. 23: Time vs Core count
<details>
<summary>Image 25 Details</summary>

### Visual Description
\n
## Bar Chart: BitOps Time vs Core Count
### Overview
The image presents a bar chart illustrating the relationship between "Core count" and "Time (ms)" for a process labeled "BitOps". The chart displays the execution time of BitOps as the number of cores used increases.
### Components/Axes
* **Title:** "BitOps: Time vs Core count" - positioned at the top-center of the chart.
* **X-axis:** "Core count" - ranging from 1 to 47, with increments of 2.
* **Y-axis:** "Time (ms)" - ranging from 0 to 450, with increments of 50.
* **Data Series:** A single series of blue bars representing the time taken for BitOps at each core count.
### Detailed Analysis
The chart shows a steep decline in execution time as the core count increases initially, followed by a leveling off.
* **Core Count 1:** Time ≈ 400 ms
* **Core Count 3:** Time ≈ 190 ms
* **Core Count 5:** Time ≈ 120 ms
* **Core Count 7:** Time ≈ 100 ms
* **Core Count 9:** Time ≈ 90 ms
* **Core Count 11:** Time ≈ 80 ms
* **Core Count 13:** Time ≈ 70 ms
* **Core Count 15:** Time ≈ 60 ms
* **Core Count 17:** Time ≈ 55 ms
* **Core Count 19:** Time ≈ 50 ms
* **Core Count 21:** Time ≈ 45 ms
* **Core Count 23:** Time ≈ 40 ms
* **Core Count 25:** Time ≈ 40 ms
* **Core Count 27:** Time ≈ 40 ms
* **Core Count 29:** Time ≈ 40 ms
* **Core Count 31:** Time ≈ 40 ms
* **Core Count 33:** Time ≈ 40 ms
* **Core Count 35:** Time ≈ 40 ms
* **Core Count 37:** Time ≈ 40 ms
* **Core Count 39:** Time ≈ 40 ms
* **Core Count 41:** Time ≈ 40 ms
* **Core Count 43:** Time ≈ 40 ms
* **Core Count 45:** Time ≈ 40 ms
* **Core Count 47:** Time ≈ 40 ms
The trend is a rapid decrease in time from 1 to 7 cores, then a gradual decrease to around 17 cores, after which the time remains relatively constant at approximately 40 ms.
### Key Observations
* The most significant performance gains are achieved with a small number of cores (1-7).
* Adding more than 17 cores yields diminishing returns in terms of execution time reduction.
* The execution time plateaus at approximately 40 ms after 17 cores.
### Interpretation
The data suggests that the BitOps process is highly parallelizable, benefiting significantly from increased core counts up to a certain point. Beyond that point, the overhead of managing additional cores outweighs the performance gains. This could be due to factors such as memory bandwidth limitations, synchronization costs, or the inherent sequential nature of some parts of the BitOps algorithm. The leveling off of the curve indicates that the process has reached a point of diminishing returns, where adding more cores does not substantially reduce execution time. This information is valuable for optimizing the BitOps process by determining the optimal number of cores to utilize for maximum performance. The initial steep drop suggests that the process was initially bottlenecked by a lack of parallelism, and that increasing the core count effectively unlocked that potential.
</details>
- g) Conditional Flags.: Figure 24 shows the 16-bit conditional flag unit's compute time with respect to core count. A 16-bit conditional flag unit runs a maximum of 18 threads in parallel. On the chart from core count 15 onward, the latency of the conditional flag unit hovers at around 120-130ms. The maximum parallel computing speedup is 4.7x.
Fig. 24: Time vs Core count
<details>
<summary>Image 26 Details</summary>

### Visual Description
\n
## Bar Chart: Time vs Core Count
### Overview
The image presents a bar chart illustrating the relationship between "Core Count" and "Time (ms)". The chart displays the time taken for a process to complete as the number of cores used increases. The chart is titled "Condition: Time vs Core count".
### Components/Axes
* **Title:** Condition: Time vs Core count (centered at the top)
* **X-axis:** Core Count, ranging from 1 to 47, with increments of 2.
* **Y-axis:** Time (ms), ranging from 0 to 600, with increments of 100.
* **Data Series:** A single series of blue bars representing the time taken for each core count.
### Detailed Analysis
The chart shows a steep decrease in time taken as the core count increases from 1 to approximately 7. After this point, the decrease in time becomes much less pronounced, and the time levels off, fluctuating around 100-150ms.
Here's a breakdown of approximate values, reading from left to right:
* Core Count 1: ~480 ms
* Core Count 3: ~270 ms
* Core Count 5: ~200 ms
* Core Count 7: ~170 ms
* Core Count 9: ~150 ms
* Core Count 11: ~140 ms
* Core Count 13: ~130 ms
* Core Count 15: ~130 ms
* Core Count 17: ~130 ms
* Core Count 19: ~140 ms
* Core Count 21: ~150 ms
* Core Count 23: ~140 ms
* Core Count 25: ~130 ms
* Core Count 27: ~140 ms
* Core Count 29: ~130 ms
* Core Count 31: ~120 ms
* Core Count 33: ~110 ms
* Core Count 35: ~110 ms
* Core Count 37: ~110 ms
* Core Count 39: ~110 ms
* Core Count 41: ~110 ms
* Core Count 43: ~110 ms
* Core Count 45: ~110 ms
* Core Count 47: ~110 ms
The bars are consistently blue.
### Key Observations
* The most significant performance gain is achieved when increasing the core count from 1 to 7.
* Beyond 7 cores, the performance improvement diminishes significantly.
* The time taken appears to stabilize around 110-150ms for core counts greater than 20.
* There is some minor fluctuation in the time taken for higher core counts, but no clear trend.
### Interpretation
The data suggests that there is a diminishing return to increasing the number of cores beyond a certain point (approximately 7 in this case). Initially, adding more cores significantly reduces the processing time, likely due to increased parallelism. However, as the core count increases, the overhead associated with coordinating the cores and distributing the workload begins to outweigh the benefits of additional processing power. This could be due to factors such as memory contention, communication overhead, or the inherent limitations of the algorithm being used. The stabilization of the time taken at higher core counts indicates that the process has reached a point where it is no longer significantly limited by the number of available cores. This data could be used to optimize resource allocation and determine the optimal number of cores for a given task. The chart demonstrates a classic example of Amdahl's Law in action, where the speedup of a program using multiple processors is limited by the sequential portion of the program.
</details>
## C. Vulnerability
There are two source of vulnerability associated with the current CryptoEmu implementation: ( i ) the unencrypted HE instruction memory; and ( ii ) the branch resolution process.
Because HE instruction memory is in cleartext, the cloud knows exactly what assembly code is being executed. Assumption 2 assumes that the cloud is honest, so cloud visible instructions to the cloud do not pose a security issue. However, it is possible that the cloud's execution pipeline is compromised by attackers. If an attacker obtains control over the instruction set emulator, or performs a side-channel attack, user data can be breached.
If Assumption 2 is lifted, then the unencrypted HE instruction memory becomes a security loophole.Note that, we still assume the cloud will not start a denial of service attack on user. For example, if user send an encrypted number such as age, credit score or salaries, the cloud can also use binary search to actively query the value of an encrypted data using conditional unit implemented in CryptoEmu, and eventually find out the cleartext. User can counter the attack by sending HE instructions with anti-disassembly techniques, making it difficult for side-channel attack. However, this approach does not eliminate an active attack from cloud.
The cloud is able to actively find encrypted data's value because the cloud can abuse the branch resolution process. The problem can be solved by having the user assume control over the branch resolution process [15]. The user can ask for encrypted data from cloud, decrypt and process the data, send data back to cloud and inform the cloud which is the next instruction to fetch. With some extra overhead in data transmission, the user can make branch resolution unknown to the cloud and has full control over branch resolution.
## D. Comparison against the state of the art
HELib [4] is one of the most widely adopted FHE software libraries. HELib implements the Brakerski-Gentry-Vaikuntanathan (BGV) scheme and supports basic homomorphic operations [21]. HELib has native support for binary negation, addition/subtraction, multiplication, binary condition, shift immediate, bitwise operation and binary comparison. CryptoEmu supports all these operation and a few more: left shift/right shift with encrypted amount, unsigned binary division, and NZCV condition flag computation. This section compares the performance of the operations supported by both HELib and CryptoEmu.
TABLE XXIV: HELib vs CryptoEmu
| Operation | HELib | CryptoEmu (single core) | CryptoEmu (multi-core) | Speedup |
|---------------------------|------------|---------------------------|--------------------------|-----------|
| Addition | 10484.1ms | 4400.67ms | 566.149ms | 18.518x |
| Subtraction | 10962.2ms | 5088.57ms | 599.396ms | 18.289x |
| Multiplication (unsigned) | 69988.9ms | 86389.8ms | 10396.1ms | 6.732x |
| Multiplication (signed) | 81707.2ms | 92408.4ms | 10985.4ms | 7.438x |
| LLS (immediate) | 0.534724ms | 0.0040335ms | 0.0040335ms | 132.571x |
| Bitwise XOR | 1.52444ms | 416.013 ms | 47.1986ms | -30.9613x |
| Bitwise OR | 771.12ms | 416.146 ms | 47.6171ms | 16.194x |
| Bitwise AND | 756.641ms | 411.014ms | 47.6001ms | 15.90x |
| Bitwise NOT | 1.8975ms | 0.012508 ms | 0.012508 ms | 151.703x |
| Comparison | 4706.07ms | 519.757ms | 110.416ms | 42.62x |
As reported in Table XXIV, for addition and subtraction, CryptoEmu's single core performance is about two times faster than HELib's. With sufficient cores, CryptoEmu yields maximum 18x speed up compare to HELib. For both signed and unsigned multiplication, CryptoEmu in single core mode is slightly slower than HELib. CryptoEmu yields a maximum 7x speed up in multiple core mode compare to HELib.
HELib supports logic left shift (LLS) with immediate. This operation is not computationally intensive compared to addition/subtraction. LLS with immediate is not implemented without parallel computing optimizations on CryptoEmu, and therefore single core latency is the same as multi-core latency. Because of the excellent support provided by TFHE library, LLS with immediate on CryptoEmu is about 132x faster than on HELib.
For bitwise operation, HELib's bitwise XOR is in fact 31x faster than CryptoEmu. This is expected because the implementation for HELib's bitwise XOR is not computationally expensive. For bitwise OR and bitwise AND, CryptoEmu's single core speed is faster than HELib's. When maximum cores are enabled, CryptoEmu yields 16x speed up on bitwise OR and bitwise AND compared to HELib.
Bitwise NOT is not a computationally expensive operation on both CryptoEmu and HELib. Because bitwise NOT is implemented without parallel computing techniques, single core and multi-core performance are the same on CryptoEmu. By comparison, CryptoEmu is 151x faster than HELib on bitwise NOT operation.
For comparison/conditional operations, we benchmarked HELib's CompareTwo() function and CryptoEmu's conditional unit. HELib's compare routine compares two encrypted values and returns the greater/lesser number and comparison result. Our CryptoEmu compares two encrypted data and returns N, Z, C, and V flags. Because the Z-flag computation is on conditional unit's critical path, it is justified to compare latencies of two functionally equivalent operations that are on both unit's critical
paths. CryptoEmu's comparison is much faster than HELib's in single core mode. When multiple cores are enabled, CryptoEmu yields maximum speedup of 42x.
## VIII. CONCLUSION AND FUTURE WORK
CryptoEmu successfully supports multiple homomorphic instructions on a multi-core host. With function units built upon the TFHE bootstrapped gate library, CryptoEmu is reconfigurable and reusable for general purpose computing. Through parallel computing algorithms that significantly reduces time complexity, CryptoEmu has a scalable implementation and achieves parallel computing speedup from 4.7x to 10.94x on functional units. Compare to HELib, CryptoEmu has maximum 18x speedup on addition/subtraction, 7x speedup on multiplication, 16x speed up on bitwise AND/OR, 42x speedup on comparison, 130x speedup on left shift with immediate, and 151x speedup on bitwise NOT.
Scalability of CryptoEmu can be further improved. The current design only supports single instruction set emulator process that runs on multiple cores. Therefore maximum core count is bounded by number of cores on one CPU. With more cores available, CryptoEmu can draw on more parallelism through pipelining, multiple instruction issue and dynamic instruction scheduling. Another aspect that could provide further speed improvements is the use of AVX instructions.
## REFERENCES
- [1] J. Clement, 'Cyber crime: number of breaches and records exposed 2005-2020.' https://www.statista.com/statistics/273550, 2020. Accessed on 2020Dec-01.
- [2] I. Chillotti, N. Gama, M. Georgieva, and M. Izabach` ene, 'TFHE: Fast fully homomorphic encryption library,' August 2016, Accessed on 2020-Dec-01. https://tfhe.github.io/tfhe.
- [3] I. Chillotti, N. Gama, M. Georgieva, and M. Izabachene, 'Faster fully homomorphic encryption: Bootstrapping in less than 0.1 seconds,' in international conference on the theory and application of cryptology and information security , pp. 3-33, Springer, 2016.
- [4] S. Halevi and V. Shoup, 'An implementation of homomorphic encryption,' 2013, Accessed on 2020-Dec-17. https://github.com/shaih/HElib.
- [5] A. Acar, H. Aksu, A. S. Uluagac, and M. Conti, 'A survey on homomorphic encryption schemes: Theory and implementation,' ACM Computing Surveys (CSUR) , vol. 51, no. 4, pp. 1-35, 2018.
- [6] C. Gentry, 'Fully homomorphic encryption using ideal lattices,' in Proceedings of the forty-first annual ACM symposium on Theory of computing , pp. 169-178, 2009.
- [7] I. Chillotti, N. Gama, M. Georgieva, and M. Izabach` ene, 'Tfhe: fast fully homomorphic encryption over the torus,' Journal of Cryptology , vol. 33, no. 1, pp. 34-91, 2020.
- [8] L. Ducas and D. Micciancio, 'Fhew: bootstrapping homomorphic encryption in less than a second,' in Annual International Conference on the Theory and Applications of Cryptographic Techniques , pp. 617-640, Springer, 2015.
- [9] I. Chillotti, N. Gama, M. Georgieva, and M. Izabach` ene, 'Faster packed homomorphic operations and efficient circuit bootstrapping for tfhe,' in International Conference on the Theory and Application of Cryptology and Information Security , pp. 377-408, Springer, 2017.
- [10] S. Halevi and V. Shoup, 'Design and implementation of a homomorphic-encryption library,' IBM Research (Manuscript) , vol. 6, pp. 12-15, 2013.
- [11] Z. Brakerski, C. Gentry, and V. Vaikuntanathan, '(leveled) fully homomorphic encryption without bootstrapping,' ACM Transactions on Computation Theory (TOCT) , vol. 6, no. 3, pp. 1-36, 2014.
- [12] O. Mazonka, N. G. Tsoutsos, and M. Maniatakos, 'Cryptoleq: A heterogeneous abstract machine for encrypted and unencrypted computation,' IEEE Transactions on Information Forensics and Security , vol. 11, no. 9, pp. 2123-2138, 2016.
- [13] P. Paillier, 'Public-key cryptosystems based on composite degree residuosity classes,' in International conference on the theory and applications of cryptographic techniques , pp. 223-238, Springer, 1999.
- [14] N. G. Tsoutsos and M. Maniatakos, 'Heroic: homomorphically encrypted one instruction computer,' in 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE) , pp. 1-6, IEEE, 2014.
- [15] F. Irena, D. Murphy, and S. Parameswaran, 'Cryptoblaze: A partially homomorphic processor with multiple instructions and non-deterministic encryption support,' in 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC) , pp. 702-708, IEEE, 2018.
- [16] OpenMP, 'Specification Standard 5.1.' Available online at http://openmp.org/wp/, 2020.
- [17] K. Vitoroulis, 'Parallel prefix adders,' Concordia university , 2006.
- [18] A. D. Booth, 'A signed binary multiplication technique,' The Quarterly Journal of Mechanics and Applied Mathematics , vol. 4, no. 2, pp. 236-240, 1951.
- [19] C. Terman, '6.004 computation structures.' MIT OpenCourseWare, https://ocw.mit.edu, Spring 2017.
- [20] geeksforgeeks.org, 'Non-restoring division for unsigned integer.' https://www.geeksforgeeks.org/non-restoring-division-unsigned-integer, 2018, Accessed on 2020-Dec-09.
- [21] S. Halevi and V. Shoup, 'Algorithms in helib,' in Annual Cryptology Conference , pp. 554-571, Springer, 2014.
<details>
<summary>Image 27 Details</summary>

### Visual Description
\n
## Photograph: Portrait of a Man
### Overview
The image is a black and white portrait of a man, likely a professional headshot. It depicts a person facing forward, with a neutral expression. The background is a gradient gray, providing minimal distraction. There is no factual data or quantifiable information present in the image.
### Components/Axes
There are no axes, legends, or scales present in this image. It is a photograph, not a chart or diagram.
### Detailed Analysis or Content Details
The subject is a man with short, dark hair. He is wearing a collared shirt and a dark sweater. His facial features are clearly visible, and he appears to be in his late 20s to early 40s. The lighting is even, illuminating his face without harsh shadows. The image is cropped from approximately the mid-chest up, focusing on the head and shoulders. The image is approximately 600x800 pixels.
### Key Observations
The image is a standard portrait format. The subject's neutral expression suggests a professional context. The black and white rendering gives the image a classic and formal aesthetic.
### Interpretation
The image likely serves as a professional representation of the individual. The lack of any contextual elements suggests it is intended to be a general representation, suitable for use in a directory, website, or other professional setting. The image does not convey any specific information beyond the subject's appearance. It is a visual identifier, not a data presentation. There are no outliers or anomalies to analyze, as the image is purely representational. The image is a static representation of a person, and does not contain any dynamic or time-series data.
</details>
<details>
<summary>Image 28 Details</summary>

### Visual Description
\n
## Photograph: Portrait of a Man
### Overview
The image is a black and white photograph of a man, presented in a head-and-shoulders portrait style. The background appears to be a blurred interior setting, likely an office or similar space. The image does not contain charts, diagrams, or data tables. It is a static visual representation of a person.
### Components/Axes
There are no axes, legends, or scales present in this image. The primary component is the subject: a man. The background consists of blurred rectangular shapes, likely furniture or wall panels.
### Detailed Analysis or Content Details
The man is facing forward, with a slight smile. He has short, dark hair and appears to be middle-aged. He is wearing a collared shirt. The lighting is relatively even, with subtle shadows defining his facial features. The image is in grayscale, lacking color information. The focus is sharp on the man's face, while the background is intentionally blurred.
### Key Observations
The image is a straightforward portrait. There are no immediately apparent anomalies or unusual features. The composition is simple and focuses attention on the subject's face and expression.
### Interpretation
The photograph likely serves as an identification image or a portrait for personal or professional use. The man's expression suggests a friendly and approachable demeanor. The lack of contextual information (e.g., name, title, location) limits a deeper interpretation. The image's purpose is primarily representational, aiming to capture the subject's likeness. There is no data or trends to analyze; it is a purely visual communication. The image does not contain any facts or data beyond the visual representation of the man.
</details>
Xiaoyang Gong is a graduate student in the Department of Electrical and Computer Engineering at the University of WisconsinMadison. Xiaoyang received his MS in Electrical Engineering in 2020 and BS in Computer Engineering in 2019 from the University of Wisconsin-Madison. He worked as an intern at Arm Ltd.'s CPU team in 2019. His interests are in computer architecture and processor design.
Dan Negrut is a Mead Witter Foundation Professor in the Department of Mechanical Engineering at the University of WisconsinMadison. He has courtesy appointments in the Department of Computer Sciences and the Department of Electrical and Computer Engineering. Dan received his Ph.D. in Mechanical Engineering in 1998 from the University of Iowa under the supervision of Professor Edward J. Haug. He spent six years working for Mechanical Dynamics, Inc., a software company in Ann Arbor, Michigan. In 2004 he served as an Adjunct Assistant Professor in the Department of Mathematics at the University of Michigan, Ann Arbor. He spent 2005 as a Visiting Scientist at Argonne National Laboratory in the Mathematics and Computer Science Division. He joined University of Wisconsin-Madison in 2005. His interests are in Computational Science and he leads the Simulation-Based Engineering Lab. The lab's projects focus on high performance computing, computational dynamics, artificial intelligence, terramechanics, autonomous vehicles, robotics, and fluid-solid interaction problems. Dan received the National Science Foundation Career Award in 2009. Since 2010 he is an NVIDIA CUDA Fellow.