System and method for balancing computation with communication in parallel learning
Abstract:
A machine learning method includes installing a plurality of model replicas for training on a plurality of computer learning nodes; receiving training data at a each model replica and updating parameters for the model replica after trailing; sending the parameters to other model replicas with a communication batch size; evaluating received parameters from other model replicas; and dynamically adjusting the communication batch size to balance computation and communication overhead and ensuring convergence even with a mismatch in processing abilities on different computer learning nodes.
Information query
Patent Agency Ranking
0/0