Elastic training of machine learning models via re-partitioning based on feedback from the training algorithm
Abstract:
Parallel training of a machine learning model on a computerized system may be provided. Computing tasks can be assigned to multiple workers of a system. A method may include accessing training data. A parallel training of the machine learning model can be started based on the accessed training data, so as for the training to be distributed through a first number K of workers, K>1. Responsive to detecting a change in a temporal evolution of a quantity indicative of a convergence rate of the parallel training (e.g., where said change reflects a deterioration of the convergence rate), the parallel training of the machine learning model is scaled-in, so as for the parallel training to be subsequently distributed through a second number K′ of workers, where K>K′≥1. Related computerized systems and computer program products may be provided.
Information query
Patent Agency Ranking
0/0