-
公开(公告)号:US12216579B2
公开(公告)日:2025-02-04
申请号:US17134254
申请日:2020-12-25
Applicant: Intel Corporation
Inventor: Carl J. Beckmann , Samantika S. Sury , Christopher J. Hughes , Lingxiang Xiang , Rahul Agrawal
IPC: G06F12/0811 , G06F12/0817 , G06F12/084 , G06F12/0862
Abstract: Disclosed embodiments relate to atomic memory operations. In one example, an apparatus includes multiple processor cores, a cache hierarchy, a local execution unit, and a remote execution unit, and an adaptive remote atomic operation unit. The cache hierarchy includes a local cache at a first level and a shared cache at a second level. The local execution unit is to perform an atomic operation at the first level if the local cache is a storing a cache line including data for the atomic operation. The remote execution unit is to perform the atomic operation at the second level. The adaptive remote atomic operation unit is to determine whether to perform the first atomic operation at the first level or at the second level and whether to copy the cache line from the shared cache to the local cache.
-
公开(公告)号:US12210446B2
公开(公告)日:2025-01-28
申请号:US18284265
申请日:2021-06-21
Applicant: Intel Corporation
Inventor: Zhe Wang , Lingxiang Xiang , Christopher J. Hughes
IPC: G06F12/00 , G06F9/30 , G06F12/02 , G06F12/0811
Abstract: An embodiment of an integrated circuit may comprise circuitry communicatively coupled to two or more sub-non-uniform memory access clusters (SNCs) to allocate a specified memory space in the two or more SNCs in accordance with a SNC memory allocation policy indicated from a request to initialize the specified memory space. An embodiment of an apparatus may comprise decode circuitry to decode a single instruction, the single instruction to include a field for an opcode, and execution circuitry to execute the decoded instruction according to the opcode to provide an indicated SNC memory allocation policy (e.g., a SNC policy hint). Other embodiments are disclosed and claimed.
-
公开(公告)号:US20240354107A1
公开(公告)日:2024-10-24
申请号:US18754447
申请日:2024-06-26
Applicant: Intel Corporation
Inventor: Frank Hady , Christopher J. Hughes , Scott Peterson
CPC classification number: G06F9/30047 , G06F9/321 , G06F9/3836
Abstract: In one example, a processor includes: at least one core to execute instructions; and at least one cache memory coupled to the at least one core, the at least one cache memory to store data, at least some of the data a copy of data stored in a memory. The at least one core is to determine whether to conditionally offload a sequence of instructions for execution on a compute circuit associated with the memory, based at least in part on whether one or more first data is present in the at least one cache memory, the one or more first data for use during execution of the sequence of instructions. Other embodiments are described and claimed.
-
公开(公告)号:US12106104B2
公开(公告)日:2024-10-01
申请号:US17133328
申请日:2020-12-23
Applicant: Intel Corporation
Inventor: Zhe Wang , Alaa R. Alameldeen , Christopher J. Hughes
IPC: G06F9/30 , G06F12/0862 , H03M7/30
CPC classification number: G06F9/30047 , G06F9/30145 , G06F12/0862 , H03M7/30 , G06F2212/602
Abstract: A processor that includes compression instructions to compress multiple adjacent data blocks of uncompressed read-only data stored in memory into one compressed read-only data block and store the compressed read-only data block in multiple adjacent blocks in the memory is provided. During execution of an application to operate on the read-only data, one of the multiple adjacent blocks storing the compressed read-only block is read from memory, stored in a prefetch buffer and decompressed in the memory controller. In response to a subsequent request during execution of the application for an adjacent data block in the compressed read-only data block, the uncompressed adjacent block is read directly from the prefetch buffer.
-
公开(公告)号:US11972230B2
公开(公告)日:2024-04-30
申请号:US16914318
申请日:2020-06-27
Applicant: Intel Corporation
Inventor: Menachem Adelman , Robert Valentine , Barukh Ziv , Amit Gradstein , Simon Rubanovich , Zeev Sperber , Mark J. Charney , Christopher J. Hughes , Alexander F. Heinecke , Evangelos Georganas , Binh Pham
CPC classification number: G06F7/78 , G06F9/3001 , G06F9/3016 , G06F17/16
Abstract: Embodiments for a matrix transpose and multiply operation are disclosed. In an embodiment, a processor includes a decoder and execution circuitry. The decoder is to decode an instruction having a format including an opcode field to specify an opcode, a first destination operand field to specify a destination matrix location, a first source operand field to specify a first source matrix location, and a second source operand field to specify a second source matrix location. The execution circuitry is to, in response to the decoded instruction, transpose the first source matrix to generate a transposed first source matrix, perform a matrix multiplication using the transposed first source matrix and the second source matrix to generate a result, and store the result in a destination matrix location.
-
公开(公告)号:US11892952B2
公开(公告)日:2024-02-06
申请号:US17867673
申请日:2022-07-18
Applicant: Intel Corporation
Inventor: Christopher J. Hughes
IPC: G06F12/0877 , G06F9/30 , G06F12/0862 , G06F12/0811 , G06F15/80 , G06F12/0897
CPC classification number: G06F12/0877 , G06F9/30 , G06F9/30036 , G06F12/0811 , G06F12/0862 , G06F12/0897 , G06F15/8069 , G06F2212/1016 , G06F2212/1024 , G06F2212/27 , G06F2212/283 , G06F2212/6028
Abstract: A processor of an aspect includes a plurality of packed data registers, and a decode unit to decode a no-locality hint vector memory access instruction. The no-locality hint vector memory access instruction to indicate a packed data register of the plurality of packed data registers that is to have a source packed memory indices. The source packed memory indices to have a plurality of memory indices. The no-locality hint vector memory access instruction is to provide a no-locality hint to the processor for data elements that are to be accessed with the memory indices. The processor also includes an execution unit coupled with the decode unit and the plurality of packed data registers. The execution unit, in response to the no-locality hint vector memory access instruction, is to access the data elements at memory locations that are based on the memory indices.
-
公开(公告)号:US11847185B2
公开(公告)日:2023-12-19
申请号:US17485055
申请日:2021-09-24
Applicant: Intel Corporation
Inventor: Dan Baum , Chen Koren , Elmoustapha Ould-Ahmed-Vall , Michael Espig , Christopher J. Hughes , Raanan Sade , Robert Valentine , Mark J. Charney , Alexander F. Heinecke
CPC classification number: G06F17/16 , G06F9/3001 , G06F9/3016 , G06F9/30101 , G06F9/3802
Abstract: Disclosed embodiments relate to accelerating multiplication of sparse matrices. In one example, a processor is to fetch and decode an instruction having fields to specify locations of first, second, and third matrices, and an opcode indicating the processor is to multiply and accumulate matching non-zero (NZ) elements of the first and second matrices with corresponding elements of the third matrix, and executing the decoded instruction as per the opcode to generate NZ bitmasks for the first and second matrices, broadcast up to two NZ elements at a time from each row of the first matrix and each column of the second matrix to a processing engine (PE) grid, each PE to multiply and accumulate matching NZ elements of the first and second matrices with corresponding elements of the third matrix. Each PE further to store an NZ element for use in a subsequent multiplications.
-
8.
公开(公告)号:US11675590B2
公开(公告)日:2023-06-13
申请号:US17865849
申请日:2022-07-15
Applicant: Intel Corporation
Inventor: Raanan Sade , Robert Valentine , Bret Toll , Christopher J. Hughes , Alexander F. Heinecke , Elmoustapha Ould-Ahmed-Vall , Mark J. Charney
IPC: G06F12/128 , G06T1/00 , G06F9/30
CPC classification number: G06F9/30167 , G06F9/30101 , G06F9/30149
Abstract: Disclosed embodiments relate to systems and methods for performing instructions to transform matrices into a row-interleaved format. In one example, a processor includes fetch and decode circuitry to fetch and decode an instruction having fields to specify an opcode and locations of source and destination matrices, wherein the opcode indicates that the processor is to transform the specified source matrix into the specified destination matrix having the row-interleaved format; and execution circuitry to respond to the decoded instruction by transforming the specified source matrix into the specified RowInt-formatted destination matrix by interleaving J elements of each J-element sub-column of the specified source matrix in either row-major or column-major order into a K-wide submatrix of the specified destination matrix, the K-wide submatrix having K columns and enough rows to hold the J elements.
-
公开(公告)号:US11513957B2
公开(公告)日:2022-11-29
申请号:US17027248
申请日:2020-09-21
Applicant: Intel Corporation
Inventor: Ren Wang , Andrew J. Herdrich , Yen-cheng Liu , Herbert H. Hum , Jong Soo Park , Christopher J. Hughes , Namakkal N. Venkatesan , Adrian C. Moga , Aamer Jaleel , Zeshan A. Chishti , Mesut A. Ergin , Jr-shian Tsai , Alexander W. Min , Tsung-yuan C. Tai , Christian Maciocco , Rajesh Sankaran
IPC: G06F12/0842 , G06F12/0893 , G06F12/109 , G06F12/0813 , G06F12/0831 , G06F9/455
Abstract: Methods and apparatus implementing Hardware/Software co-optimization to improve performance and energy for inter-VM communication for NFVs and other producer-consumer workloads. The apparatus include multi-core processors with multi-level cache hierarchies including and L1 and L2 cache for each core and a shared last-level cache (LLC). One or more machine-level instructions are provided for proactively demoting cachelines from lower cache levels to higher cache levels, including demoting cachelines from L1/L2 caches to an LLC. Techniques are also provided for implementing hardware/software co-optimization in multi-socket NUMA architecture system, wherein cachelines may be selectively demoted and pushed to an LLC in a remote socket. In addition, techniques are disclosure for implementing early snooping in multi-socket systems to reduce latency when accessing cachelines on remote sockets.
-
公开(公告)号:US20220237123A1
公开(公告)日:2022-07-28
申请号:US17712632
申请日:2022-04-04
Applicant: Intel Corporation
Inventor: Jason W. Brandt , Robert S. Chappell , Jesus Corbal , Edward T. Grochowski , Stephen H. Gunther , Buford M. Guy , Thomas R. Huff , Christopher J. Hughes , Elmoustapha Ould-Ahmed-Vall , Ronak Singhal , Seyed Yahya Sotoudeh , Bret L. Toll , Lihu Rappoport , David B. Papworth , James D. Allen
IPC: G06F12/0831 , G06F12/1027 , G06F12/1009 , G06F9/30
Abstract: Embodiments of an invention a processor architecture are disclosed. In an embodiment, a processor includes a decoder, an execution unit, a coherent cache, and an interconnect. The decoder is to decode an instruction to zero a cache line. The execution unit is to issue a write command to initiate a cache line sized write of zeros. The coherent cache is to receive the write command, to determine whether there is a hit in the coherent cache and whether a cache coherency protocol state of the hit cache line is a modified state or an exclusive state, to configure a cache line to indicate all zeros, and to issue the write command toward the interconnect. The interconnect is to, responsive to receipt of the write command, issue a snoop to each of a plurality of other coherent caches for which it must be determined if there is a hit.
-
-
-
-
-
-
-
-
-