HIL Emulator Enhancement of Piccolo SSD Controller

DOI : 10.17577/IJERTV11IS070113

Download Full-Text PDF Cite this Publication

Text Only Version

HIL Emulator Enhancement of Piccolo SSD Controller

Rahul Pinny

Department of Electronics and Communication R V College of Engineering

Bengaluru, India

Dr. Shilpa D R Associate Professor

Department of Electronics and Communication R V College of Engineering Bengaluru, India

AbstractMemory devices are being evolved to store large amount of data in a smaller device and provide faster random access and transfer of data at lower power consumption. Solid State Disks (SSDs) provide such performance without using any electromechanical drives as in HDDs through NAND Flash Memory. The interface between the Host and the Device becomes a bottleneck in providing the latency free, high throughput scalable performance. To overcome this problem Non-Volatile Memory technology is used over the Peripheral Component Interconnect Express (PCIe) Interface. To have a fault free SSD hardware, it is first designed and developed through emulation. This paper deals with the emulation of the Host Interface Layer (HIL), the front-end of Samsungs Piccolo SSD based on NVMe specification. Various test cases are developed to evaluate the functionality of the emulated device through UTWiz (Samsungs Integrated tool with Virtual Studio), code coverage is obtained by Bulls Eye and UML diagrams are developed by Microsoft Visio tool.

Keywords SSD; NVM Express; PCI Express; Floating gate MOSFET; Emulation; HIL.


    A physical device capable of storing data/information which is processed by an intelligent device to produce certain results is called a Memory device. These devices have been evolving since 1956 where few bytes of data used to be stored on Punch cards, Magnetic drums, Magnetic core, Hard Disk Drives (HDDs), Compact Disk (CD) followed by the Micro SD card, USB Flash Drive and the Cloud storage which stores multi terabytes of data [1]. The devices were evolved to store larger amounts of data for a longer period of time at low cost, smaller size and lesser power consumption. These storage devices were classified into Volatile and Non-Volatile memory, based on the data storage capability, and into Primary and Secondary memory based on the Place of Usage. Solid State Drivers (SSDs) is a Non-Volatile Secondary Storage Device which is a replacement to the conventional HDDs. The HDD used magnetic platters to store data which was accessed by a moving spindle. The presence of mechanical moving objects made HDD vulnerable to crash and damage, presence of noise and larger power consumption. These problems were overcome by the SSD which mimicked the functionality of the HDD without the presence of any mechanical moving object [3]. It uses Electronic Flash memory which is most widely used in consumer electronics due to its large storage capacity at low cost, provides faster data access with lower power consumption, less vibration and supports multiple I/O operations per second.

    The Flash Memory is designed by adding an additional Floating gate between the Gate and the Substrate of the existing Metal Oxide Semiconductor Field Effect Transistor (MOSFET) which stored the data that was electrically erased and reprogrammed. The connection between the different terminals of the FG MOSFET led to the development of NAND Flash memory and NOR Flash memory [2]. In Nand Flash memory, both the source and drain of the FG MOSFET cell are used to connect with the adjacent cells to provide faster program and erase (in terms of blocks) cycles. This configuration supports only Read/ Write operation at once and are most widely used in File or Sequential Data application is required to store large amount of data. In NOR Flash memory, the source and drain terminals of FG MOSFET are connected to ground and Bit Line to provide Random access and Byte programming possibility. This configuration supports simultaneous Read and Write operations and is widely used in Random access applications. The SSDs are connected to the host device, like a computer, to perform various operations on data stored through an interface like Advance Technology Attachment (Serial / Parallel ATA), Non-Volatile Memory Express (NVMe) or SAS (Serial Attached SCSI). The latest SSDs use the NVMe interface to provide interior parallelism at low latency, higher throughput, higher scalability and provide better performance while performing operations with data and the memory. It supports the Host Memory Buffer (HMB) feature which allows access to a portion of the host memory, which stores the data or address mapping information, over the Peripheral Component Interconnect Express (PCIe) to overcome the use of additional on chip DRAM [3].

    SSDs are emulated before the development of original hardware as the development time of hardware is much greater than software and it also helps in detecting and correcting errors. SSD Firmware architecture is divided into frontend operated by Host Interface Layer (HIL) and backend operated by Flash Translation Layer (FTL) and Flash Interface Layer (FIL). HIL runs on Host core to handle the workload distribution from the user end to the device, the FTL runs on Flash core to maintain Flash translation table (Logical to Physical) and the FIL runs on the Nand Core to handle the NAND Flash memory.

    This paper deals with the HIL emulation of Samsung's Piccolo SSD CTRL developed from the existing Pascal SSD CTRL by introducing modification in the format of Inter Processor Communication (IPC), Memory Mapping of code and data segment relating to Tightly Coupled Memory

    (TCM), Data Buffer SRAM which deals with user data and Job Distributor (JD) handled by the Cache Manager core. Next section deals with the background required for developing the SSD CTRL. Section III describes the emulation of the hardware NVMe IPs responsible for the communication between the host and the memory device backed by PCIe interface. Section IV develops various Test cases to evaluate the HIL functionality of the developed SSD CTRL after implementing the required modifications. Section V concludes the work with results obtained after testing the developed HIL firmware.


    IPs is highlighted and implemented through Microsofts Visual Studio (x64).

    1. Flash Memory

      The Flash cells are capable of storing n-bit data per cell by dividing the threshold voltage level of the cell into 2n levels [4]. The Write operation is performed in terms of Page (parallel cells controlled by common Word Line) by applying high voltage to the control gate for displacing the electrons from substrate to the floating gate (Programming) and the erase operation is performed in terms of Block (matrix of cells connected with Word Line and Bit Line) by applying high voltage to the substrate for displacing the electrons from floating gate to substrate. The Read operation is performed by selecting the particular Page and detecting the current flow between source and drain terminal on applying voltage to the gate.

    2. NVMe Over PCIe Interface

      PCIe is used as an interface as it provides high performance at low cost with effective power management technique. It supports numerous outstanding requests and out of order processing through a full duplex channel. It contains scalable ports (up to x16) and scalable speed links with lesser number of pins covered in a small area thereby reducing the cost [6]. To make better use of the PCIe interface, a standardized NVMe specification is used. The NVMe provides a standardized scalable host CTRL interface through streamlined and efficient command set which does not require a translation overhead. The communication between the host and Non-Volatle memory takes place through standardized register interface provided by the NVMe [7]. Certain key attributes of the NVMe interface are

      • It supports up to 65,535 I/O queue capable of handling 64k outstanding commands per queue.

      • Each queue is defined with certain priority and selected based on the arbitration scheme.

      • Supports multiple Namespaces, MSI/MSI-X and interrupt collection.

      • Capable of strong error reporting and handling.

      • Supports multiple I/O paths and Namespace distribution.


    The architecture of the HIL, the frontend of SSD CTRL is divided into hardware NVMe IPs, Buffer Manager, PCIe interface, NVMe Driver and other supporting functions (Fig. 1). The workload distribution between the host and the device is handled by the HIL. The emulation of the hardware NVMe

    Fig. 1. HIL Block Diagram

    1. SQ/CQ & NSR Update Logic

      Commands entered from the host are stored in the Submission Queue (SQ) for processing and the completion status is posted to the Completion Queue (CQ). The queues are divided into Admin queues which handle the administrative commands and I/O queues which handle the I/O commands like Read/ Write / Erase. Multiple SQs can have a single CQ. On the entry of command/status to the SQ/ CQ the respective SQ Tail Doorbell/ CQ Head Doorbell are triggered to indicate the respective entries to the CTRL with the help of SQ Availability bitmap and CQ Non empty bitmap. Pin based Interrupt or Message Signal Interrupt (MSI)/MSI Express are present to indicate any error while processing the commands. NVMe Specific Registers (NSR) are mapped to the memory space, to which the host shall not issue a locked access and can access it through original width or aligned 32 bit and any violations results in undefined behaviour [7]. It doesnt support multiple register access and when a reserved register is read it returns 0h.

    2. Arbitration & Command Fetcher

      The Piccolo SSD CTRL supports 17 queues (1 – Admin Queue and 16 – I/O queues) and 2 CTRLs. CTRL is selected based on the Round Robin (RR) technique. Each queue has been defined with a priority, Admin queue having the highest priority than the I/O queues. The I/O queues are divided into urgent I/O queue and the Weighted I/O queue (High, Medium, Low) assigned with specific weights. Weighted Round Robin (WRR) technique is applied for the Weighted queues and RR technique is applied for the other queues with the resulting queue [7]. On arbitration, the command from the host selected queue is fetched by the Command Fetcher and written on to the internal buffer, i.e., Fetched SQ (FSQ) based on the Arbitration Burst count on receiving a trigger from the Arbitration block. Command Fetcher sends a Ready signal in response after completing the fetch operation along with updation of various registers and fields.

    3. Command Parser

      The number of commands in the FSQ is calculated by the difference in the Producer Index (PI) and the Consumer Index (CI). On availability of the Physical Region Page (PRP) queue, a pointer to a physical memory page of fixed size, commands are fetched from the FSQ. A Tag ID is allocated to the command by the TAG Manager and enters the ID into the TAG Table. Conflicts regarding the Tag Id and Namespace ID are checked by the Trusted Computing Group (TCG) Checker and Namespace Look Up Table (NS LUT) before writing the command on to the PRP queue using PRP Fetch Logic. The Logical Block Overlap Checker (LOC) looks for overlap in the Logical Block Address (LBA) of different Namespaces. Perform filtering rules (Admin or I/O or Perf command, disabling Auto path/LOC etc) to disable the Auto path and prepare the Normalized command in accordance with NVMe IPs requirement.

    4. Command Dispatcher

      The normalized command from the PRP queue is decoded to Admin or I/O command. For an Admin command detected, a trigger is generated to detect an index for HCore usage and for a I/O command detected, a trigger is generated to detect a Tag TR index for HCore. The command is updated with HCPU ID, TR Index and Tag ID to the Command Buffer Pool. For a Read command, Auto path is enabled and Tag is inserted to TR processor then TR Auto/ Non-Auto Device to Host (D2H) queue and the corresponding PI is updated. For a Write command the process is the same except it uses a Host to Device (H2D) queue. LBA2LPN is triggered by Command dispatcher, TR processor, Prefetch engines through LBA and NS ID for LPN. The Destination LPN(DLPN) is calculated through the Namespace which is configured by LPN range of different granularity. Sequence Detector /Prefetch works in Semi and Full Auto mode. Sequence detect, checks for the sequence of the command for a particular NS as whether it is Read/Write sequence and performs respective operations using Start LBA, Number of Logical Blocks. In Prefetch on case, it triggers TCG check and LBA2LPN mapper for destination LPN loads into the cache Manager.

    5. TR Processor and TQ Preparation

      TR processor performs hardware automation of commands released by the command filter. It allocates DMA Descriptor, Buffer (in H2D case) and Transfer Request (TR) ID. It performs TR splitting based on the TR split size defined in SFR and stores the LPN for the next TR split in TR Descriptor. It tracks the LPN boundary through the LBA2LPN mapper for DLPN which is then stored in the DMA Descriptor. Various sequence flags are generated during H2D/ D2H path operations by TR Processor to different blocks. TR Processor sends Transfer Request to Transfer Queue (TQ) Preparation Module to prepare single/multiple DMA Manager operating units the TQs which are prepared by TQ engine. TQ splitting is done based on the DMA Descriptor and PRP size. Pre TQs are prepared on not receiving the DMA Done command from TR processor during Read Operation.

    6. DMA Manager & Completion Status

    DMA Manager fetches the TQ entries and writes them to Success/ Pending CQ (PCQ) on receiving DMA done. It is

    responsible for data transfer between H2D/D2H memory and on completion releases the NVMe IP resources by HW/HCPU along with signalling the TRDDIQ and DDDIQ. It interacts with the Cache Manager through IHWrite / ID2H_HDMA_Done during H2D/D2H transfer of data and communicates with the Gaudi SQ/CQ. Round Robin Arbitration is performed among the PCQs in Completion Status module and the result is posted on to the respective CQ in the host along with generation of interrupt on error detected. The Delayed PCQ is determined by the Tag timer time which is greater than the minimum command Latency. The commands are posted to Host CQ on crossing the number of entries to the Delayed PCQ. An interrupt is generated to the Interrupt Generator module on CQ posting enabled and clears the LOC/ TAG index.


    The emulated HIL is evaluated for its functionality through various Admin and NVM command sets as specified by the NVMe Command Set Specification through UTWiz (Samsungs Integrated tool with VS). Each of the commands are submitted to specific SQ for processing and receive the completion status through SQ associated CQ. Various field definitions under SQ and CQ entries are common to both Admin and NVM command sets. SQ entry is 64 bytes whereas CQ entry is of at least 16 bytes [8]. The DWord 0 in CQ entry and DWord 10 – 15 in SQ entries are modified based on the command type.

    1. Admin Command Set

      These commands handle the administration of the CTRL. The DWord 14 – 15 of SQ entries in the Admin command set are specific to I/O commands and are not impacted by state of I/O queues [8].

      • Delete I/O Queue : This command deletes the I/O SQ and CQ. DWord 10 field is used and the PRP list describing the entries are released by the host after deletion. The commands submitted before the Delete I/O command are completed explicitly/ implicitly and return appropriate status to he host. The commands submitted after successful deletion of SQ shall not post any status to the associated CQ. A I/O CQ is deleted only after the associated I/O SQs are deleted, otherwise it returns a status of Invalid Queue Deletion. A CQ entry is posted to the associated Admin CQ with the value of Invalid Queue Identifier and Queue Deletion.

      • Abort: Any command submitted in the queue/undergoing processing/completed execution can be aborted by Abort command. The DWord 10 field is used and if a large number of commands are to be deleted then I/O SQ is deleted and recreated. The Abort Command Limit restricts the CTRL from aborting a greater number of commands than specified or perform limited simultaneous abort operation as specified. The Dword 0 of the CQ entry indicates termination of the command. The completion status is posted to the respective Admin/ I/O CQ.

    2. NVM Command Set

      These commands handle the I/O operations of the CTRL. The commands are submitted to the CTRL by host on creation of SQ and associated CQ and on receiving ready signal from CTRL through Controller Status Register (CSTS.RDY). A time limit may be set while performing error recovery through the Limited Retry (LR) field [8].

      • Write: Any Data/ Metadata to be written to NVM CTRL for the logical blocks is done by the Write NVM command. The DWord 10 – 15 are used to store starting LBA, Number of LBA and related command specific information. If data transfer is done through PRPs then Metadata Pointer and PRP entries 1 & 2 are used, else SGL Metadata Pointer and SGL entry are used to perform data transfer through SGLs. The protection for data can be set by the host under the Protection Information Field. On completion of command success/failure, a CQ entry is posted to the respective I/O CQ by the CTRL and any errors like write to read only range or invalid protection information and conflicting attributes are indicated to the Host.

      • Read: Any Data/ Metadata to be read from the NVM CTRL for the logical blocks is done by the Read NVM command. The DWord 10 – 15 are used to store starting LBA, Number of LBA and related command specific information. If data transfer is done through PRPs then Metadata Pointer and PRP entries 1 & 2 are used, else SGL Metadata Pointer and SGL entry are used to perform data transfer through SGLs. The protection for data can be set by the host under the Protection Information Field. On completion of command success/failure, a CQ entry is posted to the respective I/O CQ by the CTRL to the Host along with the errors relating to invalid protection information and conflicting attributes.

    the Power management feature. The Admin and I/O queues are configured by initializing the attributes and base address to specified values on successful completion of reset indicated by the CSTS.RDY signal. An enable signal is sent by the CTRL through CC. EN on initializing all the processes and fields. Only one interrupt is used until the I/O queues are determined and the Host shall create the SQ and CQ as required.

    The Admin command test cases on evaluation provide successful results as in Fig 2. The Delete I/O queue command checks for the number of SQ and CQ allocated for processing of previously submitted commands through the Get feature command then new SQ and CQ are created for a given Queue Identifier. On submitting the delete I/O command for the particular QID the queues are deleted successfully. A buffer with specific LBA and number of logical blocks are created and a write operation is performed by passing the Write NVM command to indicate the Abort command execution. The write operation stores the data in the CM and when an abort command is sent, the data is prevented from being written to the NCore and indicating successful abort operation.


    The developed test cases are evaluated on the emulated HIL consisting of NVMe IPs. The test cases run on the HIL firmware with the support of the dummy FCore, HCore and CMCore which handles the Data/Metadata related information for supporting the functionality of HIL. The Data Buffer, Metadata Buffer and the Host Memory Buffer controller memory addresses are modified as in Table 1 based on the requirements.

    Memory Type

    Piccolo Q


    Base Adress

    Size(in Bytes)

    Base Address

    Size (in Bytes)

    Buffer DBS





    Buffer MBS





    Buffer HMBC Monitoring SFR





    Buffer HMBC






    Fig. 2. Admin Command (Delete I/O Queue)

    The NVM command test cases on evaluation provide successful results as in Figure 3. A buffer with specific LBA and number of logical blocks are created and a write operation is performed by passing the Write NVM command to the I/O SQ. The data is written on to the CM and the status of the operation is reported to the respective CQ. The data is flushed on to the NCore on passing the Flush NVM command to the I/O SQ. The data stored on the NCore is read back using the Read NVM command and a buffer created similar to the write operation. The status of success/ failure is posted to the respective CQ and the buffer created initially is released.

    The evaluation of test cases starts with establishing interconnection between the Host and the device through PCIe interface-based PCI and PCIe registers which includes

    Fig. 3. NVM Command Set (Write Command)


The Host Interface Layer, the frontend of Piccolo SSD CTRL, is emulated through Microsoft Visual Studio and the test cases are developed based on the NVMe Command Specification in UTwiz. The Data Buffer, Metadata Buffer and HMBC memory modification is carried out for Piccolo SSD. A total of 21 test cases have successfully validated the emulation of Piccolo SSDs HIL. The developed test cases

further assist in the operation evaluation of communication between HIL to FTL and to FIL. Various test cases may be developed as per the NVMe Command Set to improve the performance of the emulated device.


[1] Sivashankar and S. Ramasamy. Design and implementation of non- volatile memory express. In:2014 International Conference on Recent Trends in Information Technology. 2014, pp. 16. DOI: 10.1109/ICRTIT.2014.6996190

[2] Meena, Jagan & Sze, Simon & Chand, Umesh & Tseng, Tseung-Yuen. (2014). Overview of Emerging Non-volatile Memory Technologies. Nanoscale Research Letters. 9. 1-33. 10.1186/1556-276X-9-526.

[3] K. Kim, E. Lee and T. Kim, "HMB-SSD: Framework for Efficient Exploiting of the Host Memory Buffer in the NVMe SSD," in IEEE Access, vol. 7, pp. 150403-150411, 2019, doi: 10.1109/ACCESS.2019.2947350.

[4] Cactus Technologies. Solid State Drivers 101 .url: https://www.cactus- tech.com/resources/blog/details/solid-state-drives-101/

[5] Micron.Design and Use Considerations for NAND Flash Memory.url: https://in.micron.com/

[6] PCI-SIG.PCI Express specification Revision 2.1.url: https://pcisig.com/.

[7] NVM Express.NVM Express Revision 1.3, url: http://nvmexpress.org/wp- content/uploads/NVM_Express_Revision_1.3.pdf

[8] NVM Express, NVM Command Set Specification, url: https://nvmexpress.org/developers/nvme-command-set-specifications/