ボールを蹴りたいシステムエンジニア

ボール蹴りが大好きなシステムエンジニア、ボールを蹴る時間確保の為に時間がある時には勉強する。

CentOSでpython3からgensimのdoc2vecを使ってみる

環境

VMware Player(CentOS6)
python3.5

手順

こちらのsatomacotoさんの記事を参考に進めます。
satomacoto: doc2vecに類似ラベル・ワードを探すメソッドの追加


gensimのdoc2vecを利用します。

ライブラリをインストール

[root@localhost ~]# pip3.5 install scipy
[root@localhost ~]# pip3.5 install gensim

編集

[root@localhost doc2vec]# vi /usr/local/python/lib/python3.5/site-packages/gensim/models/doc2vec.py
[root@localhost doc2vec]# vi /usr/local/python/lib/python3.5/site-packages/gensim/models/word2vec.py

https://github.com/satomacoto/gensim/blob/develop/gensim/models/doc2vec.py
https://github.com/satomacoto/gensim/blob/develop/gensim/models/word2vec.py
をコピー上書き

https://github.com/satomacoto/gensim/archive/doc2vec-mostSimilarWordsAndLabels.zip
から取得

こんなコードで実行

import gensim

sentences = [
    ['human', 'interface', 'computer'], #0
    ['survey', 'user', 'computer', 'system', 'response', 'time'], #1
    ['eps', 'user', 'interface', 'system'], #2    ['system', 'human', 'system', 'eps'], #3
    ['user', 'response', 'time'], #4
    ['trees'], #5
    ['graph', 'trees'], #6
    ['graph', 'minors', 'trees'], #7
    ['graph', 'minors', 'survey'] #8
]


labeledSentences = gensim.models.doc2vec.LabeledListSentence(sentences)
model = gensim.models.doc2vec.Doc2Vec(labeledSentences, min_count=0)

# ラベル一覧を取得
print("-------------------------------------")
print (model.labels)

# ある文書に似ている文書を表示
print("-------------------------------------")
print( model.most_similar_labels('SENT_0') )
#print( model.most_similar_labels('SENT_doc1'))

# ある文書に似ている単語を表示
print("-------------------------------------")
print( model.most_similar_words('human'))
#print( model.most_similar_words('SENT_doc1'))

# 複数の文書を加算減算した上で、似ているユーザーを表示
print("-------------------------------------")
print( model.most_similar_labels(positive=['SENT_1', 'SENT_2'], negative=['SENT_3'], topn=5))

# 複数の文書を加算減算した上で、似ている単語を表示
print("-------------------------------------")
print( model.most_similar_words(positive=['SENT_1', 'SENT_2'], negative=['SENT_3'], topn=5))

実行できた

[root@localhost doc2vec]# python3.5 doc2vec_test.py 
/usr/local/python/lib/python3.5/site-packages/gensim/models/word2vec.py:406: UserWarning: C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.
  warnings.warn("C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.")
-------------------------------------
{'SENT_6', 'SENT_2', 'SENT_3', 'SENT_0', 'SENT_8', 'SENT_4', 'SENT_7', 'SENT_5', 'SENT_1'}
-------------------------------------
[('SENT_4', 0.08399386703968048), ('SENT_8', 0.037913162261247635), ('SENT_3', 0.005189023911952972), ('SENT_6', 0.0002727136015892029), ('SENT_1', -0.011086158454418182), ('SENT_7', -0.03499444201588631), ('SENT_5', -0.09205912053585052), ('SENT_2', -0.14769448339939117)]
-------------------------------------
[('minors', 0.15428760647773743), ('time', 0.059314027428627014), ('user', 0.05201921612024307), ('eps', 0.047635387629270554), ('survey', 0.04624335467815399), ('computer', 0.04234649986028671), ('trees', 0.038701750338077545), ('interface', 0.03709424287080765), ('graph', 0.027377430349588394), ('system', 0.019210100173950195)]
-------------------------------------
[('SENT_7', 0.08965814858675003), ('SENT_8', 0.06112371012568474), ('SENT_4', 0.028300214558839798), ('SENT_5', -0.037105970084667206), ('SENT_6', -0.047132231295108795)]
-------------------------------------
[('system', 0.07341058552265167), ('time', 0.059994593262672424), ('graph', 0.042242251336574554), ('survey', 0.028702296316623688), ('human', -0.020748157054185867)]

何か警告出たけど

/usr/local/python/lib/python3.5/site-packages/gensim/models/word2vec.py:406: UserWarning: C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.
  warnings.warn("C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.")

https://teratail.com/questions/14400
を参考にしてみる。

[root@localhost doc2vec]# pip3.5 freeze | grep scipy
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
scipy==0.18.0

[root@localhost doc2vec]# pip3.5 freeze | grep gensim 
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
gensim==0.13.2

バージョンを指定してscipyを再インストール

[root@localhost doc2vec]# pip3.5 install --no-cache-dir scipy==0.15.1

      File "scipy/linalg/setup.py", line 18, in configuration
        raise NotFoundError('no lapack/blas resources found')
    numpy.distutils.system_info.NotFoundError: no lapack/blas resources found
    
    ----------------------------------------
  Rolling back uninstall of scipy
Command "/usr/local/python/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-i9p4bki3/scipy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-15nnf0lm-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-i9p4bki3/scipy/
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

pipのバージョンが古いって怒られたのでpipをアップグレード

[root@localhost doc2vec]# pip3.5 install --upgrade pip

scipy再インストールでまたエラー

[root@localhost doc2vec]# pip3.5 install --no-cache-dir scipy==0.15.1

      File "scipy/linalg/setup.py", line 18, in configuration
        raise NotFoundError('no lapack/blas resources found')
    numpy.distutils.system_info.NotFoundError: no lapack/blas resources found
    
    ----------------------------------------
  Rolling back uninstall of scipy
Command "/usr/local/python/bin/python3.5 -u -c "import setuptools, tokenize;__file__='/tmp/pip-build-dfyj3acv/scipy/setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record /tmp/pip-vqi9ewwj-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-build-dfyj3acv/scipy/

依存ライブラリをダウンロード

[root@localhost doc2vec]# yum install atlas-devel lapack-devel blas-devel


3度目の正直・・・

[root@localhost doc2vec]# pip3.5 install --no-cache-dir scipy==0.15.1
Collecting scipy==0.15.1
  Downloading scipy-0.15.1.tar.gz (11.4MB)
    100% |████████████████████████████████| 11.4MB 3.7MB/s                                                                                                                            
Installing collected packages: scipy
  Found existing installation: scipy 0.18.0
    Uninstalling scipy-0.18.0:
      Successfully uninstalled scipy-0.18.0
  Running setup.py install for scipy ... done
Successfully installed scipy-0.15.1

インストール出来た!
10分位掛かった。

でdoc2vec
また警告発生する・・

[root@localhost doc2vec]# python3.5 doc2vec_test.py 
/usr/local/python/lib/python3.5/site-packages/gensim/models/word2vec.py:406: UserWarning: C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.
  warnings.warn("C extension compilation failed, training will be slow. Install a C compiler and reinstall gensim for fast training.")

とりあえずdoc2vec動いたから良しとしようか・・