MultiGPU-ComputeGradient-坑
Issue
Environment: Tensorflow-1.2
问题是这样的,在使用多GPU进行数据并行的训练时,我们需要针对每个GPU设备单独分配部分的数据,我通过以下代码解决。
devices = [0,1,2]
X = tf.placeholder(dtype=tf.float32,shape[None,224,224,3])
Y = tf.placeholder(dtype=tf.int32,shape[None])
input_tensors = tf.split(X,len(devices),0)
label_tensors = tf.split(Y,len(devices),0)
接下来就到了出错的代码部分:
grad_list = []
opt = tf.train.MomentumOptimizer(learning_rate=0.001)
for i in range(len(devices)):
device = devices[i]
with tf.device('/gpu:%d' % device):
with tf.name_scope('%d' % device) as scope:
logits=MyModel(input_tensors[i])
...
loss = ...
grad = opt.compute_gradients(loss)
grad_list.append(grad)
...
tf.get_variable_scope().reuse_variables()
...
avg_grad = Average_gradients(grad_list)
opt.apply_gradients(avg_grad)
Error:
File "train.py", line 79, in train
train_op = opt.apply_gradients(avggrads,global_step=global_step)
File "/home/hui89.liu/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/optimizer.py", line 446, in apply_gradients
ValueError: Variable blockr01/conv/W/Momentum/ does not exist, or was not created with tf.get_variable(). Did you mean to set reuse=None in VarScope?
同样的问题,在Github的issue 6220 已经得到解答。 (浏览详细解答过程,见issue 6220)
由于这个issue提的日期较早,大多数用户tensorflow版本是0.12.0,我的版本是1.2.0,以下是官方解决方案,以及我最终的写法:
with tf.variable_scope(tf.get_variable_scope(),reuse=False):
grad_list = []
opt = tf.train.MomentumOptimizer(learning_rate=0.001)
for i in range(len(devices)):
device = devices[i]
with tf.device('/gpu:%d' % device):
with tf.name_scope('%d' % device) as scope:
logits=MyModel(input_tensors[i])
...
loss = ...
grad = opt.compute_gradients(loss)
grad_list.append(grad)
...
tf.get_variable_scope().reuse_variables()
...
avg_grad = Average_gradients(grad_list)
with tf.variable_scope(tf.get_variable_scope(),reuse=False): # My personal written
opt.apply_gradients(avg_grad)
Conclusion
由于版本的不同,官方文档的欠缺,tensorflow的尝试需要我们踏破铁鞋,使用 tf.get_variable_scope().reuse_variables() 可以使得变量跨设备被重复利用。 在遇到问题和探索的过程中,我们需要确信MomentumOptimizer 在被调用的过程中,reuse=False ,也许0.12.0版本中,在 with tf.variable_scope(tf.get_variable_scope(),reuse=False): 大括号范围内,如没有特殊设定,可以保持reuse=False,但1.2.0版本中,设定具有历史记录,需要强制再次使用大括号进行reuse的转换。