RACK-LEVEL SCHEDULING FOR REDUCING THE LONG TAIL LATENCY USING HIGH PERFORMANCE SSDS

    公开(公告)号:US20200225999A1

    公开(公告)日:2020-07-16

    申请号:US16828649

    申请日:2020-03-24

    Abstract: A method for migrating a workload includes: receiving workloads generated from a plurality of applications running in a plurality of server nodes of a rack system; monitoring latency requirements for the workloads and detecting a violation of the latency requirement for a workload; collecting system utilization information of the rack system; calculating rewards for migrating the workload to other server nodes in the rack system; determining a target server node among the plurality of server nodes that maximizes the reward; and performing migration of the workload to the target server node.

    Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

    公开(公告)号:US11100193B2

    公开(公告)日:2021-08-24

    申请号:US16388860

    申请日:2019-04-18

    Abstract: A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

    Method and apparatus for enabling larger memory capacity than physical memory size

    公开(公告)号:US10678704B2

    公开(公告)日:2020-06-09

    申请号:US15476757

    申请日:2017-03-31

    Abstract: A method of retrieving data stored in a memory associated with a dedupe module is provided. The method includes: identifying a logical address of the data; identifying a physical line ID of the data in accordance with the logical address by looking up at least a portion of the logical address in a translation table; locating a respective physical line, the respective physical line corresponding to the PLID; and retrieving the data from the respective physical line, the retrieving including copying a respective hash cylinder to the read cache, the respective hash cylinder including: a respective hash bucket, the respective hash bucket including the respective physical line; and a respective reference counter bucket, the respective reference counter bucket including a respective reference counter associated with the respective physical line.

    Dedupe DRAM system algorithm architecture

    公开(公告)号:US09966152B2

    公开(公告)日:2018-05-08

    申请号:US15162512

    申请日:2016-05-23

    CPC classification number: G11C29/808 G06F12/0802 G11C29/74

    Abstract: A deduplication memory module, which is configured to internally perform memory deduplication, includes a hash table memory for storing multiple blocks of data in a hash table array including hash tables, each of the hash tables including physical buckets and a plurality of virtual buckets each including some of the physical buckets, each of the physical buckets including ways, an address lookup table memory (ALUTM) including a plurality of pointers indicating a location of each of the stored blocks of data in a corresponding one of the physical buckets, and a buffer memory for storing unique blocks of data not stored in the hash table memory when the hash table array is full, a processor, and memory, wherein the memory has stored thereon instructions that, when executed by the processor, cause the memory module to exchange data with an external system.

    Smart in-module refresh for DRAM
    8.
    发明授权

    公开(公告)号:US09761296B2

    公开(公告)日:2017-09-12

    申请号:US15299445

    申请日:2016-10-20

    Abstract: A memory (1205) is disclosed. The memory (1205) can includes a stack of dynamic Random Access Memory (DRAM) cores (1210, 1215, 1220, 1225) in a three-dimensional stacked memory architecture (1230). Each of the DRAM cores (1210, 1215, 1220, 1225) can include a plurality of banks (205-1, 205-2, 205-3, 205-4) to store data. The memory (1205) can also include logic layer (1235) which can include an interface (1305) to connect the memory (1205) with a processor (120). The logic layer (1235) can also include a refresh engine (115) that can be used to refresh one of the plurality of banks (205-1, 205-2, 205-3, 205-4) and a Smart Refresh Component (305) that can advise the refresh engine (115) which bank to refresh using an out-of-order per-bank refresh. The Smart Refresh Component (305) can use a logic (415) to identify a farthest bank in the pending transactions in the transaction queue (430) at the time of refresh.

    Dataflow accelerator architecture for general matrix-matrix multiplication and tensor computation in deep learning

    公开(公告)号:US12164593B2

    公开(公告)日:2024-12-10

    申请号:US17374988

    申请日:2021-07-13

    Abstract: A general matrix-matrix multiplication (GEMM) dataflow accelerator circuit is disclosed that includes a smart 3D stacking DRAM architecture. The accelerator circuit includes a memory bank, a peripheral lookup table stored in the memory bank, and a first vector buffer to store a first vector that is used as a row address into the lookup table. The circuit includes a second vector buffer to store a second vector that is used as a column address into the lookup table, and lookup table buffers to receive and store lookup table entries from the lookup table. The circuit further includes adders to sum the first product and a second product, and an output buffer to store the sum. The lookup table buffers determine a product of the first vector and the second vector without performing a multiply operation. The embodiments include a hierarchical lookup architecture to reduce latency. Accumulation results are propagated in a systolic manner.

Patent Agency Ranking