Selenium加载插件实现动态代理

在爬虫实现的过程中,有些时候需要借助操控浏览器实现数据的抓取,在Python中操作浏览器使用最多的库就是Selenium这个库封装了操作主流浏览器的功能,本片文章主要记录如何给Chrome浏览器设置动态代理。

0x01 设置无用户名密码代理

1
from Selenium import webdriver
2
chromeOptions = webdriver.ChromeOptions()
3
chromeOptions.add_argument('--proxy-server=http://ip:port')  
4
driver = webdriver.Chrome(chrome_options=chromeOptions)

这种方式add_argument方法将代理添加到浏览器配置中,这种方法也是最常使用的一种设置代理的方式

0x02 设置有用户名密码的代理

1
from selenium import webdriverdef create_proxyauth_extension(proxy_host, proxy_port,
2
                               proxy_username, proxy_password,
3
                               scheme='http', plugin_path=None):
4
    """Proxy Auth Extension
5
6
    args:
7
        proxy_host (str): domain or ip address, ie proxy.domain.com
8
        proxy_port (int): port
9
        proxy_username (str): auth username
10
        proxy_password (str): auth password
11
    kwargs:
12
        scheme (str): proxy scheme, default http
13
        plugin_path (str): absolute path of the extension       
14
15
    return str -> plugin_path
16
    """
17
    import string
18
    import zipfile
19
20
    if plugin_path is None:
21
        plugin_path = 'd:/webdriver/vimm_chrome_proxyauth_plugin.zip'
22
23
    manifest_json = """
24
    {
25
        "version": "1.0.0",
26
        "manifest_version": 2,
27
        "name": "Chrome Proxy",
28
        "permissions": [
29
            "proxy",
30
            "tabs",
31
            "unlimitedStorage",
32
            "storage",
33
            "<all_urls>",
34
            "webRequest",
35
            "webRequestBlocking"
36
        ],
37
        "background": {
38
            "scripts": ["background.js"]
39
        },
40
        "minimum_chrome_version":"22.0.0"
41
    }
42
    """
43
44
    background_js = string.Template(
45
    """
46
    var config = {
47
            mode: "fixed_servers",
48
            rules: {
49
              singleProxy: {
50
                scheme: "${scheme}",
51
                host: "${host}",
52
                port: parseInt(${port})
53
              },
54
              bypassList: ["foobar.com"]
55
            }
56
          };
57
58
    chrome.proxy.settings.set({value: config, scope: "regular"}, function() {});
59
60
    function callbackFn(details) {
61
        return {
62
            authCredentials: {
63
                username: "${username}",
64
                password: "${password}"
65
            }
66
        };
67
    }
68
69
    chrome.webRequest.onAuthRequired.addListener(
70
                callbackFn,
71
                {urls: ["<all_urls>"]},
72
                ['blocking']
73
    );
74
    """
75
    ).substitute(
76
        host=proxy_host,
77
        port=proxy_port,
78
        username=proxy_username,
79
        password=proxy_password,
80
        scheme=scheme,
81
    )
82
    with zipfile.ZipFile(plugin_path, 'w') as zp:
83
        zp.writestr("manifest.json", manifest_json)
84
        zp.writestr("background.js", background_js)
85
86
    return plugin_path
87
88
proxyauth_plugin_path = create_proxyauth_extension(
89
    proxy_host="proxy.crawlera.com",
90
    proxy_port=8010,
91
    proxy_username="fea687a8b2d448d5a5925ef1dca2ebe9",
92
    proxy_password=""
93
)
94
95
96
co = webdriver.ChromeOptions()
97
co.add_argument("--start-maximized")
98
co.add_extension(proxyauth_plugin_path)
99
100
101
driver = webdriver.Chrome(chrome_options=co)
102
driver.get("http://www.amazon.com/")

以上代码就会生成,一个插件安装在Chrome中,通过传入代理地址、端口、用户名、密码以及协议获取动态的代理,进行访问目标网站。

插件源代码 https://github.com/RobinDev/Selenium-Chrome-HTTP-Private-Proxy