用python做youtube自动化下载器 代码 原创

q7537227
发布于 2021-1-12 21:27
浏览
0收藏

目录
项目地址
思路
流程1. posti. 先把post中的headers格式化
ii.然后把参数也格式化
iii. 最后再执行requests库的post请求
iv. 封装成一个函数
2. 调用解密函数
i. 分析
ii. 先取出js部分
iii. 取第一个解密函数作为我们用的解密函数
iv. 用execjs执行1. this也就是window变量不存在
2. alert不存在
v. 整合代码
3. 分析解密结果i. 取关键json
ii. 格式化json
iii. 取下载地址
3. 全部代码
 

根据 savefrom条例
本实例及教程只用于学习交流用,权利归savefrom.net所有
最后代码+注释大概100行左右,具体代码以github代码为主(可以会在上面修复bug),本文只做具体讲解
项目地址
github仓库

思路
用python做youtube自动化下载器 思路

流程
1. post
根据思路里的第一步,我们首先需要用post方式取到加密后的js字段,笔者使用了requests第三方库来执行,关于爬虫可以参考我之前的文章

i. 先把post中的headers格式化
# set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}

其中cookie部分可能要改,然后最好以你们浏览器上的为主,具体每个参数的含义不是本文范围,可以自行去搜索引擎搜

ii.然后把参数也格式化
# set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}

其中sf_url字段是我们要下载的youtube视频的url,其他参数都不变

iii. 最后再执行requests库的post请求
# do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()

注意是data=kv

iv. 封装成一个函数
import requests

def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text

2. 调用解密函数
i. 分析
这其中的难点在于在python里执行javascript代码,而晚上的解决方法有PyV8等,本文选用execjs。在思路部分我们可以发现js部分的最后几行是解密函数,所以我们只需要在execjs中先执行一遍全部,然后再单独执行解密函数就好了

ii. 先取出js部分
# target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]

这里其实可以用正则,不过由于笔者正则表达式还不太熟练就直接用split了

iii. 取第一个解密函数作为我们用的解密函数
当你多取几次不同视频的结果,你就会发现每次的解密函数都不一样,不过位置都是还是在固定行数

# split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"

所以name就是我们的解密函数了(变量名没取太好hhh)

iv. 用execjs执行
# use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(reo)
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

其中只取=后面的和去掉分号是指指执行这个函数而不用赋值,当先执行赋值+解密然后取值也不是不可以
但是我们可以发现马上就报错了(要是有这么简单就好了)

1. this也就是window变量不存在
如果没记错是报错this或者$b,笔者尝试把全部this去掉或者把全部框在一个class里面(这样子this就变成那个class了)不过都没有成功,然后发现在npm下有个jsdom可以在execjs里模拟window变量(其实应该有更好方法的),所以我们需要下载npm和里面的jsdom,然后改写以上代码

    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\xxx\AppData\Roaming\npm\node_modules')

其中

cwd字段是npm root -g的结果,也就是npm的modules路径
addition是用来模拟window的
但是我们又可以发现下一个错误
2. alert不存在
这个错误是因为在execjs下执行alert函数是没有意义的,因为我们没有浏览器让他弹窗,且原本alert函数的定义是来源window而我们自定义了window,所以我们要在代码前重写覆盖alert函数(相当于定义一个alert)

# override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")

v. 整合代码
# target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))

3. 分析解密结果
i. 取关键json
运行完上面的部分,解密结果就存在text里了,而我们在思路中可以发现,真正对我们重要的就是存在window.parent.sf.videoResult.show()里的json,所以用正则表达式取这一部分的json

# get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")  

ii. 格式化json
python可以格式化json的库有很多,这里笔者用了json库(记得import)

# use `json` to load json
    j = json.loads(result)

iii. 取下载地址
接下来就到了最后一步,根据思路里和json格式化工具我们可以发现j["url"][num]["url"]就是下载链接,而num是我们要的视频格式(不同分辨率和类型)

# the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -

3. 全部代码
# -*- coding: utf-8 -*-
# @Time: 2021/1/10
# @Author: Eritque arcus
# @File: Youtube.py
# @License: MIT
# @Environment:
#           - windows 10
#           - python 3.6.2
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python
import requests
import execjs
import re
import json


def gethtml(url):
    # set the headers or the website will not return information
    # the cookies in here you may need to change
    headers = {
        "cache-Control": "no-cache",
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,"
                  "*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "accept-encoding": "gzip, deflate, br",
        "accept-language": "zh-CN,zh;q=0.9,en;q=0.8",
        "content-type": "application/x-www-form-urlencoded",
        "cookie": "lang=en; country=CN; uid=fd94a82a406a8dd4; sfHelperDist=72; reference=14; "
                  "clickads-e2=90; poropellerAdsPush-e=63; promoBlock=64; helperWidget=92; "
                  "helperBanner=42; framelessHdConverter=68; inpagePush2=68; popupInOutput=9; "
                  "_ga=GA1.2.799702638.1610248969; _gid=GA1.2.628904587.1610248969; "
                  "PHPSESSID=030393eb0776d20d0975f99b523a70d4; x-requested-with=; "
                  "PHPSESSUD=islilfjn5alth33j9j8glj9776; _gat_helperWidget=1; _gat_inpagePush2=1",
        "origin": "https://en.savefrom.net",
        "pragma": "no-cache",
        "referer": "https://en.savefrom.net/1-youtube-video-downloader-4/",
        "sec-ch-ua": "\"Google Chrome\";v=\"87\", \"Not;A Brand\";v=\"99\",\"Chromium\";v=\"87\"",
        "sec-ch-ua-mobile": "?0",
        "sec-fetch-dest": "iframe",
        "sec-fetch-mode": "navigate",
        "sec-fetch-site": "same-origin",
        "sec-fetch-user": "?1",
        "upgrade-insecure-requests": "1",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/87.0.4280.88 Safari/537.36"}
    # set the parameter, we can get from chrome
    kv = {"sf_url": url,
          "sf_submit": "",
          "new": "1",
          "lang": "en",
          "app": "",
          "country": "cn",
          "os": "Windows",
          "browser": "Chrome"}
    # do the POST request
    r = requests.post(url="https://en.savefrom.net/savefrom.php", headers=headers,
                      data=kv)
    r.raise_for_status()
    # get the result
    return r.text


if __name__ == '__main__':
    # target(youtube address) url
    url = "https://www.youtube.com/watch?v=YPvtz1lHRiw"
    # get the target text
    reo = gethtml(url)
    # Remove the code from the head and tail (we need the javascript part, information store with encryption in js part)
    reo = reo.split("<script type=\"text/javascript\">")[1].split("</script>")[0]
    # override the alert function, because in the code there has one place using
    # and we cannot do the alerting in execjs(it is meaningless) however, if we donnot override, the code will raise a error
    reo = reo.replace("(function(){", "(function(){\nthis.alert=function(){};")
    # split each line(help us find the decrypt function in last few line)
    reA = reo.split("\n")
    # get the depcrypt function
    name = reA[len(reA) - 3].split(";")[0] + ";"
    # add jsdom into the execjs because the code will use(maybe there is a solution without jsdom, but i have no idea)
    addition = """
    const jsdom = require("jsdom");
    const { JSDOM } = jsdom;
    const dom = new JSDOM(`<!DOCTYPE html><p>Hello world</p>`);
    window = dom.window;
    document = window.document;
    XMLHttpRequest = window.XMLHttpRequest;
    """
    # use execjs to execute the js code, and the cwd is the result of `npm root -g`(the path of npm in your computer)
    ct = execjs.compile(addition + reo, cwd=r'C:\Users\19308\AppData\Roaming\npm\node_modules')
    # do the decryption
    text = ct.eval(name.split("=")[1].replace(";", ""))
    # get the result in json
    result = re.search('show\((.*?)\);;', text, re.I | re.M).group(0).replace("show(", "").replace(");;", "")
    # use `json` to load json
    j = json.loads(result)
    # the selection of video(in this case, num=1 mean the video is
    # - 360p known from j["url"][num]["quality"]
    # - MP4 known from j["url"][num]["type"]
    # - audio known from j["url"][num]["audio"]
    num = 1
    downurl = j["url"][num]["url"]
    # do some download
    # thanks :)
    # - EOF -


总计102行
开发环境
# @Environment:
#           - windows 10
#           - python 3.6.2

依赖
# @Dependence:
#           - jsdom in npm(windows also can use)
#           - requests, execjs, re, json in python

-end-
For 爬虫
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文作者: https://www.cnblogs.com/Eritque-arcus/ 或https://blog.csdn.net/qq_40832960
作者:Eritque-arcus
出处:https://www.cnblogs.com/Eritque-arcus/
本文版权归作者和博客园共有,欢迎转载,但必须给出原文链接,并保留此段声明,否则保留追究法律责任的权利。

©著作权归作者所有,如需转载,请注明出处,否则将追究法律责任
分类
标签
收藏
回复
举报
回复
    相关推荐