Python 爬虫爬取指定电影评论，并生成云图

Kkcrush · 发表于 2018-8-3 14:28

本帖最后由 Kkcrush 于 2018-8-3 14:31 编辑

抓取豆瓣电影评论并生成词云目标：抓取豆瓣电影评论生成词云
第一步：抓取评论
代码如下。具体见注释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36

def getCommentsById(movieId, pageNum):
eachCommentList = []
# 抓取评论的页数
if pageNum > 0:

      start = (pageNum - 1) * 20
else:
      return False
# 豆瓣电影评论的地址
requrl = (
      "https://movie.douban.com/subject/"
      + movieId
      + "/comments"
      + "?"
      + "start="
      + str(start)
      + "&limit=20"
)
try:
      # 抓取
      resp = requests.get(requrl)
      # 用utf-8格式化返回结果
      html_data = resp.content.decode("utf-8")
      # 用beautifulsoup处理返回结果，指定parser为html
      soup = bs(html_data, "html.parser")
      # 获取评论
      comment_div_lits = soup.find_all("div", class_="comment")
      # 将评论写入数组
      for item in comment_div_lits:
         if item.find_all(name='span', attrs={"class": "short"})[0].string is not None:
            eachCommentList.append(item.find_all(name='span', attrs={"class": "short"})[0].string)
# 捕获异常
except RequestException as e:
      print("请求问题，原因：%s" % e)
# 返回评论数组
return eachCommentList

第二步：将抓取的评论转换为字符串
代码如下。具体见注释

1
2
3
4
5
6
7
8
9
10
11
12
13

commentList = []
for i in range(10):
num = i + 1
# 使用上面的方法获取评论列表
commentList_temp = getCommentsById(movieId=movieId, pageNum=num)
commentList.append(commentList_temp)
# 将列表中的数据转换为字符串
comments = ""
for k in range(len(commentList)):
# print(commentList[k])
# 去掉空格等
for m in range(len(commentList[k])):
comments = comments + str(commentList[k][m]).strip()

第三步：使用正则去掉标点
代码如下。具体见注释

1
2
3

# 使用正则去掉标点
filtrate = re.compile(r"[^\u4E00-\u9FA5]") # 提取中文，过滤掉非中文字符
filtered_str = filtrate.sub(r"", comments) # replace

第四步：使用结巴分词进行中文分词
代码如下。具体见注释

1
2
3

# 用结巴分词进行中文分词
segment = jieba.lcut(filtered_str)
words_df = pd.DataFrame({"segment": segment})

第五步：去掉停用词
代码如下。具体见注释

1
2
3
4
5
6
7
8
9
10
11
12
13

# 去掉停用词。
# 停用词是指在信息检索中，为节省存储空间和提高搜索效率
# 在处理自然语言数据（或文本）之前或之后会自动过滤掉某些字或词
# 这些字或词即被称为Stop Words（停用词）。
stopwords = pd.read_csv(
"./stopwords.txt",
index_col=False,
quoting=3,
sep="t",
names=["stopword"],
encoding="utf-8",
)
words_df = words_df[~words_df.segment.isin(stopwords.stopword)]

第六步：统计词频
代码如下。具体见注释

1
2
3

# 统计词频
words_stat = words_df.groupby(by=["segment"])["segment"].agg({"计数": numpy.size})
words_stat = words_stat.reset_index().sort_values(by=["计数"], ascending=False)

第七步：用词云进行显示
代码如下。具体见注释

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17

# 用词云进行显示
wordcloud = WordCloud(
font_path="./simkai.ttf",
background_color="white",
max_font_size=80,
width=1000,
height=860,
margin=2,
)
word_frequence = {x[0]: x[1] for x in words_stat.head(1000).values}

wordcloud = wordcloud.fit_words(word_frequence)
plt.imshow(wordcloud)
plt.axis("off")
plt.show(block=False)
img_name = "./" + movieName + ".jpg"
wordcloud.to_file(img_name)

我的博客，欢迎大家关注
https://vwin.github.io/
博客

煦涵 · 发表于 2018-8-3 14:55

楼主厉害，赞一个

上帝也幽默 · 发表于 2018-8-3 15:20

顶顶顶zsbd

顾小城 · 发表于 2018-8-3 16:24

楼主厉害，赞一个

IrvingYang · 发表于 2018-8-3 16:33

感谢分享，赞一个

波多野结文 · 发表于 2018-8-4 08:26

楼主厉害，赞一个

小黑LLB · 发表于 2019-2-9 19:16

给楼主点点赞支持一下

Kkcrush · 发表于 2019-2-14 10:36

小黑LLB 发表于 2019-2-9 19:16
给楼主点点赞支持一下

谢谢，哈哈哈

chz888 · 发表于 2019-10-4 18:47

请问楼主可以帮忙写个自动评论的脚本吗？某个视频站用的！我想提取所有视频页面自动评论！

帐号		自动登录	找回密码
密码			注册[Register]

[Python 转载] Python 爬虫爬取指定电影评论，并生成云图

免费评分

个人中心