python爬取网站的MV视频

18382747915 · 发表于 2018-9-9 11:28

本帖最后由 18382747915 于 2018-9-9 11:47 编辑

新手发帖，有什么违规的地方请多多谅解
最近刚学爬虫，献丑了

爬取网站的地址：http://www.170mv.com/mlmv
源码：
from urllib import request,response
import re,urllib
import requests
def pa(url):
a=0
hader={ 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:62.0) Gecko/20100101 Firefox/62.0'} #请求头，模拟浏览器
da=urllib.request.Request(url,headers=hader)
date=urllib.request.urlopen(da)    #发送请求
html=date.read().decode("utf-8")    #编码转换
lianjie=re.findall('<a class="clip-link" data-id="(.*?)" title="(.*?)" >',html,re.S) #提取mv列表的链接
for i,l in enumerate(lianjie):
      a=a+1
      lianjie=l[0]
      da=urllib.request.Request("http://www.170mv.com/mlmv/%s.html" %lianjie,headers=hader)
      date=urllib.request.urlopen(da)
      html=date.read().decode("utf-8")
      url=re.findall('http://www.170mv.com/tool/jiexi/ajax/pid/%s/(.*?).mp4' %lianjie,html,re.S)
      name=re.findall('<h1 class="entry-title">(.*?)</h1>',html,re.S)
      url='http://www.170mv.com/tool/jiexi/ajax/pid/%s/%s.mp4' %(lianjie,url[0])
      url = requests.get(url).content
      print("正在下载第%s首mv" % a)
      f = open('E:\\mp4\\{}.mp4'.format(name[0]), 'wb')
      f.write(url)
      f.close()
      print("下载成功")

pa("http://www.170mv.com/mlmv")

凌乱的思绪 · 发表于 2018-9-10 18:09

18382747915 发表于 2018-9-9 22:30
这个网站不能留联系方式，怎么看，你可以把网站的链接给我我帮你爬

您好，就是爱客影院的BT下载获取功能，之前的代码不行了，我修改了一下，但是还是不行。
我想爬这个网站：www.btdx8.com的视频下载地址，我但是获取不到，python没学过。
能麻烦您给看看吗？

[PHP] 纯文本查看 复制代码

<?php

error_reporting(0);
include "./inc/aik.config.php";
if (!$_GET["bt"]) {
	$tiao = $aik["pcdomain"];
?>
<meta http-equiv=refresh content='0; url=<?php echo $tiao;?>'>
<?php
}
?>
<!DOCTYPE HTML>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=0, minimum-scale=1.0, maximum-scale=1.0">
<meta name="apple-mobile-web-app-capable" content="yes">
<meta name="apple-mobile-web-app-status-bar-style" content="black">
<meta http-equiv="cache-control" content="no-siteapp">
<title>《<?php echo $_GET["bt"];?>》-搜索结果-<?php echo $aik["title"];?></title>
<link rel='stylesheet' id='main-css'  href='css/style.css' type='text/css' media='all' />
<link rel='stylesheet' id='main-css'  href='css/seacher.css' type='text/css' media='all' />
<script type='text/javascript' src='http://apps.bdimg.com/libs/jquery/2.0.0/jquery.min.js?ver=0.5'></script>
<meta name="keywords" content="《<?php echo $_GET["bt"];?>》搜索结果">
<meta name="description" content="<?php echo $aik["title"];?>-《<?php echo $_GET["bt"];?>》搜索结果">
<!--[if lt IE 9]><script src="js/html5.js"></script><![endif]-->
<style>.dw-box {
    background: #FFEA97;
    border: 1px solid #E1B400;
    color: #3A87AD;
    border-radius: 5px;
	padding:5px 0px;
}</style>
</head>
<?php
include "header.php";
?>
<base target='_blank'>
<section class="container">
<div class="am-container main" style="padding:0">       
<?php
$sourl = "https://www.btdx8.com/?s=" . $_GET["bt"];
$info = file_get_contents($sourl);
$vname = "#<a href=\"https://www.btdx8.com/torrent/(.*?).html\" class=\"zoom\" rel=\"bookmark\" target=\"_blank\" title=\"(.*?)\">#";
$vimg = "#<img src=\"(.*?)\" alt=\"(.*?)\" />                                                      <div class=\"zoomOverlay\"></div>#";
$array = array();
preg_match_all($vname, $info, $namearr);
preg_match_all($vimg, $info, $imgarr);
preg_match_all($fname, $info, $fnamearr);
$zname = $namearr[2];
$fname = $namearr[1];
?>
<strong class="single-strong">相关资源BT下载链接</strong><ul class="mvul">
<?php
foreach ($fname as $key => $video) {
	$url = "https://www.btdx8.com/" . $fname[$key] . ".html";
	$ch = curl_init($url);
	curl_setopt($ch, CURLOPT_HEADER, 0);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	curl_setopt($ch, CURLOPT_USERAGENT, "Dalvik/1.6.0 (Linux; U; Android 4.1.4; DROID RAZR HD Build/9.8.1Q-62_VQW_MR-2)");
	curl_setopt($ch, CURLOPT_REFERER, "-");
	$response = curl_exec($ch);
	curl_close($ch);
	if ($x = strpos($response, "<div id=\"zdownload")) {
		$response = substr($response, $x);
	}
	if ($x = strpos($response, "</div>")) {
		$response = substr($response, 0, $x);
	}
	$response = str_replace("<p>", "", $response);
	$response = str_replace("</p>", "", $response);
	$response = str_replace("rel=\"external nofollow\" >", "rel=\"external nofollow\" >点击下载：", $response);
	$response = str_replace("rel=\"noopener\" target=\"_blank\">", "rel=\"noopener\" target=\"_blank\">点击下载：", $response);
?>

<li>片名：<?php echo $zname[$key];?>
</li><?php echo $response;?>
<?php
}
?>
</ul>
</div>
<div style="clear: both;"></div>
</section>
<?php
include "footer.php";

天黑我隐身 · 发表于 2018-9-9 13:12

既然都用上第三方库requests了就放弃自己写正则吧 BeautifulSoup值得拥有
另外requests模块下载文件的一个好的方式是用Response对象内置的iter_content方法来处理，这样大文件下载就不会一次性读入内存再写入硬盘。
示例如下
with open(filename, 'wb') as f:
for chunk in url.iter_content():
f.write(chunk)

13319937326 · 发表于 2018-9-9 11:42

可以出个成品用用看吗

推两把 · 发表于 2018-9-9 11:44

提示: 作者被禁止或删除内容自动屏蔽

18382747915 · 发表于 2018-9-9 11:45

13319937326 发表于 2018-9-9 11:42
可以出个成品用用看吗

亲，复制黏贴就能用

18382747915 · 发表于 2018-9-9 11:49

13319937326 发表于 2018-9-9 11:42
可以出个成品用用看吗

复制黏贴就能用，成品就是你把我的代码放到txt文件中，然后把后缀名改成(.py)双击就可以了

zjjyl · 发表于 2018-9-9 11:52

感谢楼主分享

zangfong · 发表于 2018-9-9 11:58

感谢楼主分享~

13319937326 · 发表于 2018-9-9 12:09

18382747915 发表于 2018-9-9 11:49
复制黏贴就能用，成品就是你把我的代码放到txt文件中，然后把后缀名改成(.py)双击就可以了

批处理文件吗，这个

18382747915 · 发表于 2018-9-9 12:22

13319937326 发表于 2018-9-9 12:09
批处理文件吗，这个

下载个PyCharm，再这里面运行

zhengxing · 发表于 2018-9-9 12:58

感谢分享谢谢楼主我也是小白

帐号		自动登录	找回密码
密码			注册[Register]

推两把推两把当前离线好友阅读权限 0 听众最后登录 1970-1-1 头像被屏蔽	推两把发表于 2018-9-9 11:44 提示: 作者被禁止或删除内容自动屏蔽
推两把推两把当前离线好友阅读权限 0 听众最后登录 1970-1-1 头像被屏蔽	如何快速判断一个文件是否为病毒！
	回复支持举报

[Python 转载] python爬取网站的MV视频

免费评分