在爬虫实现的过程中,有些时候需要借助操控浏览器实现数据的抓取,在Python中操作浏览器使用最多的库就是Selenium
这个库封装了操作主流浏览器的功能,本片文章主要记录如何给Chrome浏览器设置动态代理。
0x01 设置无用户名密码代理
1 | from Selenium import webdriver |
2 | chromeOptions = webdriver.ChromeOptions() |
3 | chromeOptions.add_argument('--proxy-server=http://ip:port') |
4 | driver = webdriver.Chrome(chrome_options=chromeOptions) |
这种方式add_argument
方法将代理添加到浏览器配置中,这种方法也是最常使用的一种设置代理的方式
0x02 设置有用户名密码的代理
1 | from selenium import webdriverdef create_proxyauth_extension(proxy_host, proxy_port, |
2 | proxy_username, proxy_password, |
3 | scheme='http', plugin_path=None): |
4 | """Proxy Auth Extension |
5 |
|
6 | args: |
7 | proxy_host (str): domain or ip address, ie proxy.domain.com |
8 | proxy_port (int): port |
9 | proxy_username (str): auth username |
10 | proxy_password (str): auth password |
11 | kwargs: |
12 | scheme (str): proxy scheme, default http |
13 | plugin_path (str): absolute path of the extension |
14 |
|
15 | return str -> plugin_path |
16 | """ |
17 | import string |
18 | import zipfile |
19 | |
20 | if plugin_path is None: |
21 | plugin_path = 'd:/webdriver/vimm_chrome_proxyauth_plugin.zip' |
22 | |
23 | manifest_json = """ |
24 | { |
25 | "version": "1.0.0", |
26 | "manifest_version": 2, |
27 | "name": "Chrome Proxy", |
28 | "permissions": [ |
29 | "proxy", |
30 | "tabs", |
31 | "unlimitedStorage", |
32 | "storage", |
33 | "<all_urls>", |
34 | "webRequest", |
35 | "webRequestBlocking" |
36 | ], |
37 | "background": { |
38 | "scripts": ["background.js"] |
39 | }, |
40 | "minimum_chrome_version":"22.0.0" |
41 | } |
42 | """ |
43 | |
44 | background_js = string.Template( |
45 | """ |
46 | var config = { |
47 | mode: "fixed_servers", |
48 | rules: { |
49 | singleProxy: { |
50 | scheme: "${scheme}", |
51 | host: "${host}", |
52 | port: parseInt(${port}) |
53 | }, |
54 | bypassList: ["foobar.com"] |
55 | } |
56 | }; |
57 |
|
58 | chrome.proxy.settings.set({value: config, scope: "regular"}, function() {}); |
59 |
|
60 | function callbackFn(details) { |
61 | return { |
62 | authCredentials: { |
63 | username: "${username}", |
64 | password: "${password}" |
65 | } |
66 | }; |
67 | } |
68 |
|
69 | chrome.webRequest.onAuthRequired.addListener( |
70 | callbackFn, |
71 | {urls: ["<all_urls>"]}, |
72 | ['blocking'] |
73 | ); |
74 | """ |
75 | ).substitute( |
76 | host=proxy_host, |
77 | port=proxy_port, |
78 | username=proxy_username, |
79 | password=proxy_password, |
80 | scheme=scheme, |
81 | ) |
82 | with zipfile.ZipFile(plugin_path, 'w') as zp: |
83 | zp.writestr("manifest.json", manifest_json) |
84 | zp.writestr("background.js", background_js) |
85 | |
86 | return plugin_path |
87 | |
88 | proxyauth_plugin_path = create_proxyauth_extension( |
89 | proxy_host="proxy.crawlera.com", |
90 | proxy_port=8010, |
91 | proxy_username="fea687a8b2d448d5a5925ef1dca2ebe9", |
92 | proxy_password="" |
93 | ) |
94 | |
95 | |
96 | co = webdriver.ChromeOptions() |
97 | co.add_argument("--start-maximized") |
98 | co.add_extension(proxyauth_plugin_path) |
99 | |
100 | |
101 | driver = webdriver.Chrome(chrome_options=co) |
102 | driver.get("http://www.amazon.com/") |
以上代码就会生成,一个插件安装在Chrome中,通过传入代理地址、端口、用户名、密码以及协议获取动态的代理,进行访问目标网站。
插件源代码 https://github.com/RobinDev/Selenium-Chrome-HTTP-Private-Proxy