本帖最后由 double07 于 2021-5-17 12:12 编辑
使用concurrent.futures遇到两个问题:
1.单线程下载某一链接图片,每张图片名称均能按1/2/3/4/5命名并顺序保存。但多线程后图片名称保存名称错乱2/48/75/35/80,怎样在多线程下,使每个链接中图片命名方式顺序保存?
2.多线程下载的文件不完整,图片有5张,但部分链接只能下4张或更少,代码使用错误?
[Python] 纯文本查看 复制代码 import os,re,time,chardet,threading,requests
from lxml import etree
from concurrent.futures import ThreadPoolExecutor
p = 0
num = 0
pa = 0
n = 0
curPage = 2
data_list = []
def download_picture(u): #下载图片
global pa, p, n
html_detail = gethtml_detail(u)
html = etree.HTML(html_detail)
try:
picsrc_list = re.findall(r'data-ks-lazyload="(//[^\s]*)"', html_detail)
folder_name = html.xpath('//*[@class="pm-main clearfix"]/h1/text()')[1].strip()
except Exception as e:
print(str(e), "无图片")
folder_name = html.xpath('//*[@class="pm-main clearfix"]/h1/text()')[1].strip()
os.makedirs('./Pictures and videos/' + folder_name, exist_ok=True)
pa += 1
return
path = 'C:/Users/Administrator/Desktop/Python/AL-SF/Pictures and videos/'
if os.path.exists(path + folder_name + '/' + str(n) + '.jpg') != True:
time_start = time.time()
os.makedirs('./Pictures and videos/' + folder_name,exist_ok=True) # makedirs()里两个参数:第一个是所要创建的文件夹名称,第二个是当文件夹已经存在时不采取其他操作
n = 0
for picsrc in picsrc_list:
n = n + 1
pic_src = 'https:' + picsrc
picsrc_resp = requests.get(pic_src)
with open(os.path.join('./Pictures and videos/' + folder_name, str(n) + '.jpg'),'wb') as f: # os.path.join()第一个参数是文件夹名称,第二个是要保存文件的名称,记得加上格式后缀。图片,视频,影音基本都是’wb’
f.write(picsrc_resp.content)
f.close()
time.sleep(1)
time_end = time.time()
print('第%s页第%s条房产图片已保存!====用时%.1f秒' % (p, pa + 1,(time_end - time_start)))
pa += 1
else:
print('图片重复', '第%s页第%s条房产图片已保存!' % (p, pa + 1))
pa += 1
return
def download_video(u): #下载视频
省略……
def run_AL():
global p
for i in url_list:
html = gethtml(i)
llist = parse_url(html) #获取第一页详情页链接
with ThreadPoolExecutor(40) as t:
for u in llist:
t.submit(download_picture,u)
if __name__ == '__main__':
run_AL()
|