【转】写诗机器人tensorflow实现

采用3w首唐诗作为训练数据,在github上dataset文件夹下可以看到,唐诗格式为”题目:诗句“,如下所示:


我们首先通过”:“将题目和内容分离,然后做数据清洗过滤一些不好的训练样本,包含特殊符号、字数太少或太多的都要去除,最后在诗的前后分别加上开始和结束符号,用来告诉LSTM这是开头和结尾,这里用方括号表示。

 

  1. poems = []
  2. file = open(filename, "r")
  3. for line in file: #every line is a poem
  4. #print(line)
  5. title, poem = line.strip().split( ":") #get title and poem
  6. poem = poem.replace( ' ','')
  7. if '_' in poem or '《' in poem or '[' in poem or '(' in poem or '(' in poem:
  8. continue
  9. if len(poem) < 10 or len(poem) > 128: #filter poem
  10. continue
  11. poem = '[' + poem + ']' #add start and end signs
  12. poems.append(poem)

然后统计每个字出现的次数,并删除出现次数较少的生僻字

 

  1. #counting words
  2. allWords = {}
  3. for poem in poems:
  4. for word in poem:
  5. if word not in allWords:
  6. allWords[word] = 1
  7. else:
  8. allWords[word] += 1
  9. # erase words which are not common
  10. erase = []
  11. for key in allWords:
  12. if allWords[key] < 2:
  13. erase.append(key)
  14. for key in erase:
  15. del allWords[key]

根据字出现的次数排序,建立字到ID的映射。为什么需要排序呢?排序后的ID从一定程度上表示了字的出现频率,两者之间有一定关系,比不排序直接映射更容易使模型学出规律。

添加空格字符,因为诗的长度不一致,需要用空格填补,所以留出空格的ID。最后将诗转成字向量的形式。

 

  1. wordPairs = sorted(allWords.items(), key = lambda x: -x[1])
  2. words, a= zip(*wordPairs)
  3. words += ( " ", )
  4. wordToID = dict(zip(words, range(len(words)))) #word to ID
  5. wordTOIDFun = lambda A: wordToID.get(A, len(words))
  6. poemsVector = [([wordTOIDFun(word) for word in poem]) for poem in poems] # poem to vector

接下来构建训练batch,每一个batch中所有的诗都要补空格直到长度达到最长诗的长度。因为补的都是空格,所以模型可以学出这样一个规律:空格后面都是接着空格。X和Y分别表示输入和输出,输出为输入的错位,即模型看到字得到的输出应该为下一个字。

这里注意一定要用np.copy,坑死我了!

 

  1. #padding length to batchMaxLength
  2. batchNum = (len(poemsVector) - 1) // batchSize
  3. X = []
  4. Y = []
  5. #create batch
  6. for i in range(batchNum):
  7. batch = poemsVector[i * batchSize: (i + 1) * batchSize]
  8. maxLength = max([len(vector) for vector in batch])
  9. temp = np.full((batchSize, maxLength), wordTOIDFun( " "), np.int32)
  10. for j in range(batchSize):
  11. temp[j, :len(batch[j])] = batch[j]
  12. X.append(temp)
  13. temp2 = np.copy(temp) #copy!!!!!!
  14. temp2[:, : -1] = temp[:, 1:]
  15. Y.append(temp2)

搭建模型

搭建一个LSTM模型,后接softmax,输出为每一个字出现的概率。这里对着LSTM模板抄一份,改改参数就好了。

 

  1. with tf.variable_scope("embedding"): #embedding
  2. embedding = tf.get_variable( "embedding", [wordNum, hidden_units], dtype = tf.float32)
  3. inputbatch = tf.nn.embedding_lookup(embedding, gtX)
  4.  
  5. basicCell = tf.contrib.rnn.BasicLSTMCell(hidden_units, state_is_tuple = True)
  6. stackCell = tf.contrib.rnn.MultiRNNCell([basicCell] * layers)
  7. initState = stackCell.zero_state(np.shape(gtX)[ 0], tf.float32)
  8. outputs, finalState = tf.nn.dynamic_rnn(stackCell, inputbatch, initial_state = initState)
  9. outputs = tf.reshape(outputs, [ -1, hidden_units])
  10.  
  11. with tf.variable_scope("softmax"):
  12. w = tf.get_variable( "w", [hidden_units, wordNum])
  13. b = tf.get_variable( "b", [wordNum])
  14. logits = tf.matmul(outputs, w) + b
  15.  
  16. probs = tf.nn.softmax(logits)

模型训练

先定义输入输出,构建模型,然后设置损失函数、学习率等参数。

 

  1. gtX = tf.placeholder(tf.int32, shape=[batchSize, None]) # input
  2. gtY = tf.placeholder(tf.int32, shape=[batchSize, None]) # output
  3. logits, probs, a, b, c = buildModel(wordNum, gtX)
  4. targets = tf.reshape(gtY, [ -1])
  5. #loss
  6. loss = tf.contrib.legacy_seq2seq.sequence_loss_by_example([logits], [targets],
  7. [tf.ones_like(targets, dtype=tf.float32)], wordNum)
  8. cost = tf.reduce_mean(loss)
  9. tvars = tf.trainable_variables()
  10. grads, a = tf.clip_by_global_norm(tf.gradients(cost, tvars), 5)
  11. learningRate = learningRateBase
  12. optimizer = tf.train.AdamOptimizer(learningRate)
  13. trainOP = optimizer.apply_gradients(zip(grads, tvars))
  14. globalStep = 0

然后开始训练,训练时先寻找能否找到检查点,找到则还原,否则重新训练。然后按照batch一步步读入数据训练,学习率逐渐递减,每隔几个step就保存一下模型。

 

  1. with tf.Session() as sess:
  2. sess.run(tf.global_variables_initializer())
  3. saver = tf.train.Saver()
  4. if reload:
  5. checkPoint = tf.train.get_checkpoint_state(checkpointsPath)
  6. # if have checkPoint, restore checkPoint
  7. if checkPoint and checkPoint.model_checkpoint_path:
  8. saver.restore(sess, checkPoint.model_checkpoint_path)
  9. print( "restored %s" % checkPoint.model_checkpoint_path)
  10. else:
  11. print( "no checkpoint found!")
  12.  
  13. for epoch in range(epochNum):
  14. if globalStep % learningRateDecreaseStep == 0: #learning rate decrease by epoch
  15. learningRate = learningRateBase * ( 0.95 ** epoch)
  16. epochSteps = len(X) # equal to batch
  17. for step, (x, y) in enumerate(zip(X, Y)):
  18. #print(x)
  19. #print(y)
  20. globalStep = epoch * epochSteps + step
  21. a, loss = sess.run([trainOP, cost], feed_dict = {gtX:x, gtY:y})
  22. print( "epoch: %d steps:%d/%d loss:%3f" % (epoch,step,epochSteps,loss))
  23. if globalStep%1000==0:
  24. print( "save model")
  25. saver.save(sess,checkpointsPath + "/poem",global_step=epoch)

自动写诗

在自动写诗之前,我们需要定义一个输出概率对应到单词的功能函数,为了避免每次生成的诗都一样,需要引入一定的随机性。不选择输出概率最高的字,而是将概率映射到一个区间上,在区间上随机采样,输出概率大的字对应的区间大,被采样的概率也大,但胖虎也有小概率会选择其他字。因为每一个字都有这样的随机性,所以每次作出的诗都完全不一样。

 

  1. def probsToWord(weights, words):
  2. """probs to word"""
  3. t = np.cumsum(weights) #prefix sum
  4. s = np.sum(weights)
  5. coff = np.random.rand( 1)
  6. index = int(np.searchsorted(t, coff * s)) # large margin has high possibility to be sampled
  7. return words[index]

然后开始写诗,首先仍然是构建模型,定义相关参数,加载checkpoint。

 

  1. gtX = tf.placeholder(tf.int32, shape=[ 1, None]) # input
  2. logits, probs, stackCell, initState, finalState = buildModel(wordNum, gtX)
  3. with tf.Session() as sess:
  4. sess.run(tf.global_variables_initializer())
  5. saver = tf.train.Saver()
  6. checkPoint = tf.train.get_checkpoint_state(checkpointsPath)
  7. # if have checkPoint, restore checkPoint
  8. if checkPoint and checkPoint.model_checkpoint_path:
  9. saver.restore(sess, checkPoint.model_checkpoint_path)
  10. print( "restored %s" % checkPoint.model_checkpoint_path)
  11. else:
  12. print( "no checkpoint found!")
  13. exit( 0)

生成generateNum这么多首诗,每首诗以左中括号开始,以右中括号或空格结束,每次生成的prob用probsToWord方法转成字。

 

  1. poems = []
  2. for i in range(generateNum):
  3. state = sess.run(stackCell.zero_state( 1, tf.float32))
  4. x = np.array([[wordToID[ '[']]]) # init start sign
  5. probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
  6. word = probsToWord(probs1, words)
  7. poem = ''
  8. while word != ']' and word != ' ':
  9. poem += word
  10. if word == '。':
  11. poem += '\n'
  12. x = np.array([[wordToID[word]]])
  13. #print(word)
  14. probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
  15. word = probsToWord(probs2, words)
  16. print(poem)
  17. poems.append(poem)

还可以写藏头诗,前面的搭建模型,加载checkpoint等内容一样,作诗部分,每遇到标点符号,人为控制下一个输入的字为指定的字就可以了。需要注意,在标点符号后,因为没有选择模型输出的字,所以需要将state往前滚动一下,直接跳过这个字的生成。

 

  1. flag = 1
  2. endSign = { -1: ",", 1: "。"}
  3. poem = ''
  4. state = sess.run(stackCell.zero_state( 1, tf.float32))
  5. x = np.array([[wordToID[ '[']]])
  6. probs1, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
  7. for c in characters:
  8. word = c
  9. flag = -flag
  10. while word != ']' and word != ',' and word != '。' and word != ' ':
  11. poem += word
  12. x = np.array([[wordToID[word]]])
  13. probs2, state = sess.run([probs, finalState], feed_dict={gtX: x, initState: state})
  14. word = probsToWord(probs2, words)
  15.  
  16. poem += endSign[flag]
  17. # keep the context, state must be updated
  18. if endSign[flag] == '。':
  19. probs2, state = sess.run([probs, finalState],
  20. feed_dict={gtX: np.array([[wordToID[ "。"]]]), initState: state})
  21. poem += '\n'
  22. else:
  23. probs2, state = sess.run([probs, finalState],
  24. feed_dict={gtX: np.array([[wordToID[ ","]]]), initState: state})
  25.  
  26. print(characters)
  27. print(poem)

大约在GPU上训练20epoch效果就不错了!

代码地址:https://github.com/hjptriplebee/Chinese_poem_generator, 欢迎fork, star

估计后续还会出看图写诗机器人-MC胖虎2.0

说了这么多胖虎该生气了!

转载于:https://www.cnblogs.com/wukefenggao/p/9353218.html

基于PyTorch的Embedding和LSTM的自动写诗实验LSTM (Long Short-Term Memory) 是一种特殊的循环神经网络(RNN)架构,用于处理具有长期依赖关系的序列数据。传统的RNN在处理长序列时往往会遇到梯度消失或梯度爆炸的问题,导致无法有效地捕捉长期依赖。LSTM通过引入门控机制(Gating Mechanism)和记忆单元(Memory Cell)来克服这些问题。 以下是LSTM的基本结构和主要组件: 记忆单元(Memory Cell):记忆单元是LSTM的核心,用于存储长期信息。它像一个传送带一样,在整个链上运行,只有一些小的线性交互。信息很容易地在其上保持不变。 输入门(Input Gate):输入门决定了哪些新的信息会被加入到记忆单元中。它由当前时刻的输入和上一时刻的隐藏状态共同决定。 遗忘门(Forget Gate):遗忘门决定了哪些信息会从记忆单元中被丢弃或遗忘。它也由当前时刻的输入和上一时刻的隐藏状态共同决定。 输出门(Output Gate):输出门决定了哪些信息会从记忆单元中输出到当前时刻的隐藏状态中。同样地,它也由当前时刻的输入和上一时刻的隐藏状态共同决定。 LSTM的计算过程可以大致描述为: 通过遗忘门决定从记忆单元中丢弃哪些信息。 通过输入门决定哪些新的信息会被加入到记忆单元中。 更新记忆单元的状态。 通过输出门决定哪些信息会从记忆单元中输出到当前时刻的隐藏状态中。 由于LSTM能够有效地处理长期依赖关系,它在许多序列建模任务中都取得了很好的效果,如语音识别、文本生成、机器翻译、时序预测等。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值