-
公开(公告)号:US20250094237A1
公开(公告)日:2025-03-20
申请号:US18470772
申请日:2023-09-20
Applicant: Microsoft Technology Licensing, LLC
Inventor: Wenbin MENG , Hemant KUMAR , Rakesh KELKAR , Karthik RAMAN , Sanjay RAMANUJAN , Kevin Joseph RIEHM , Theodore Dragov TODOROV
IPC: G06F9/50
Abstract: A system provides capacity-based load balancing across model endpoints of a cloud-based artificial intelligence (AI) model. The system includes a consumption determination engine executable to determine a net resource consumption for processing tasks in a workload generated by a client application for input to the trained machine learning model. The system also includes a load balancer that determines a distribution of available resource capacity in a shared resource pool comprising compute resources at each of the multiple model endpoints. The load balancer allocates parallelizable tasks of the workload among the compute resources at the multiple model endpoints based on the net resource consumption of the tasks and on the distribution of available resource capacity in the shared resource pool.
-
公开(公告)号:US20250165304A1
公开(公告)日:2025-05-22
申请号:US18517762
申请日:2023-11-22
Applicant: Microsoft Technology Licensing, LLC
Inventor: Anthony Christopher KARLOFF , Grace Marie BRAMLEY-SIMMONS , Fahd Ahmad KAMAL , Hemant KUMAR , Wenbin MENG
IPC: G06F9/50
Abstract: Systems and methods are disclosed herein for providing fair allocation of resources in a multi-tenant environment. Systems and methods are configured for identifying a plurality of tenants participating in the multi-tenant environment. For each tenant of the plurality of tenants, systems determine a tenant status as a donating tenant, a fairly borrowing tenant, or an unfairly borrowing tenant and apply a different borrowing algorithm to each tenant of the plurality of tenants based on a corresponding tenant status determined for each tenant. Different borrowing algorithms are configured to determine different resource borrowing limits from a common pool of resources for each tenant.
-
公开(公告)号:US20250094240A1
公开(公告)日:2025-03-20
申请号:US18470795
申请日:2023-09-20
Applicant: Microsoft Technology Licensing, LLC
Inventor: Wenbin MENG , Hemant KUMAR , Rakesh KELKAR , Karthik RAMAN , Sanjay RAMANUJAN , Kevin Joseph RIEHM , Theodore Dragov TODOROV
IPC: G06F9/50
Abstract: A disclosed method facilitates an increase in utilization with respect to a resource quota allocated to a tenant from a shared resource pool. The method includes transmitting a lease request to a quota service on behalf of the tenant, where the lease request identifies a processing task and specifies quantity of cloud-based resources requested from the shared resource pool for execution of the processing task. The method further provides for determining, based on a feedback signal received from the quota service, whether grant of the lease request would cause the tenant to exceed a resource quota allocated to the tenant and dynamically decreasing parallelism of active tasks being processed by the cloud-based resources on behalf of the tenant in response to determining that grant of the lease request would cause the tenant to exceed the resource quota limit.
-
公开(公告)号:US20250094233A1
公开(公告)日:2025-03-20
申请号:US18470827
申请日:2023-09-20
Applicant: Microsoft Technology Licensing, LLC
Inventor: Wenbin MENG , Hemant KUMAR , Rakesh KELKAR , Karthik RAMAN , Sanjay RAMANUJAN , Kevin Joseph RIEHM , Theodore Dragov TODOROV
IPC: G06F9/50
Abstract: A disclosed method reduces memory consumption of a trained sequential model. The method includes receiving, from a client application, an initial processing request identifying an input sequence to be processed by the trained sequential model and an initial value for an output size parameter specifying a requested size of output from the trained sequential model. The method further includes sequentially transmitting, to the trained sequential model, multiple partial processing requests based on the initial processing request that each specify a fraction of the initial value as the output size parameter and receiving a sequence of output responses from the trained sequential model generated in response to processing the multiple partial processing requests. The method further provides for returning, to the client application, a final merged response that includes the sequence of output responses.
-
-
-