基于无头浏览器的chatgpt爬虫方案
整体流程
- 初始化并启动无头浏览器,关于无头浏览器可用方案见无头浏览器过cloudflare的可行方案
- 通过无头浏览器打开https://chat.openai.com/ ,等待网页加载完毕后,如果网页源码中存在window._cf_chl_opt,则说明触发了cloudflare,需要进入cloudflare验证流程,否则直接进入5的登录流程
- 等待网页检查,页面加载完成后,点击id为ctp-checkbox-label的人机验证框,等待后续网页加载。
- 如果加载完成后仍旧存在人机验证框,则说明可能验证失败,无法通过cloudflare的人机验证;如果页面成功跳转到了登录chatgpt的登录页面,则说明通过了cloudlfare的检查,可以进入登录流程
- 寻找页面中class_name为btn-primary、文字为Log in的按钮,点击进入输入邮箱页面
- 找到元素id为username的邮箱输入框,输入邮箱,然后找到class_name是_button-login-id的按钮,点击,等待页面加载后继续输入密码
- 找到元素id为password的密码输入框,输入密码,然后找到class_name是_button-login-password的按钮,点击,等待页面加载
- 如果账号和密码都输入正确,此时chatgpt已经成功登陆了
- 读取浏览器日志,chatgpt会在登录过程中请求api/auth/session接口来获取session_token和access_token,读取日志中的这条请求,获取返回数据中的access_token值
- 使用上面获取的access_token,在已经打开的浏览器chatgpt页面中执行脚本,请求消息发送接口,向chatgpt发送消息,并等待回复
无头浏览器过cloudflare的可行方案
以下方案均是基于python,不清楚其他语言是否有相似方案
0、不可行方案:selenium
url:https://github.com/SeleniumHQ/selenium
直接通过selenium是不可行的。cloudflare会对原始的selenium浏览器检测。只用selenium打开cloudflare验证页面会在点击验证那一步无限循环,不能到达正确页面。
但是有基于selenium的可行方案。
1、undetected_chromedriver
url:https://github.com/ultrafunkamsterdam/undetected-chromedriver
该方案是基于selenium实现的可行方案,可以和selenium的各种api一起使用,可以顺利通过cloudflare的验证。但是只支持chrome浏览器。
2、DrissionPage
url:https://github.com/g1879/DrissionPage
该方案从3.0开始就不依赖selenium了,据作者描述,底层由作者自己实现,没有webdrive特征,不会被cloudfalre拦截,可以顺利通过cloudfalre验证。同样只支持chrome浏览器。
session接口
这个接口用于获取session_token和access_token。其中session_token会直接更新在cookie中,而access_token则需要放在请求接口的header中。
该接口不用请求,登录成功后,在浏览器日志中寻找请求记录取值即可
URL: https://chat.openai.com/api/auth/session
方法:GET
header:不需要请求,不需要用到header
结果解析:
{
"user": {
"id": "user-zz9OzztZZZ5P33FRjWhyRl4V",
"name": "[email protected]",
"email": "[email protected]",
"image": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
"picture": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
"mfa": false,
"groups": [],
"intercom_hash": "99db78f5c3a99cda99a99999d5cde3cb9999ebc2b47fd033eb5ecf817191c999"
},
"expires": "2023-05-19T13:58:30.012Z",
"accessToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ik1UaEVOVUpHTkQQQQQQQQQQQQQQpkQ05UZzVNRFUxUlRVd1FVSkRNRU13UmtGRVFrRXpSZyJ9.eyJodHRwczovL2FwaS5vcGVuYWkuY29tL3Byb2ZpbGUiOnsiZW1haWwiOiJ3YW5nb25jQHFxLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlfSwiaHR0cHM6Ly9hcGkub3BlbmFpLmNvbS9hdXRoIjp7InVzZXJfaWQiOiJ1c2VyLWZvOU9tbXRBR041UDMzRlJqV2h5Uzz0ViJ9LCJpc3MiOiJodHRwczovL2F1dGgwLm9wZW5haS5jb20vIiwic3ViIjoiYXV0aDB8NjM4ZGNjZjQzYTI0YzE0ZGM5NTBlODlmIiwiYXVkIjpbImh0dHBzOi8vYXBpLm9wZW5haS5jb20vdjEiLCJodHRwczovL29wZW5haS5vcGVuYWkuYXV0aDBhcHAuY29tL3VzZXJzzzZvIl0sImlhdCI6MTY4MDc3MzgxMiwiZXhwIjoxNjgxOTgzNDEyLCJhenAiOiJUZEpJY2JlMTZXb1RIdE45NW55eXdoNUU0eU9vNkl0RyIsInNjb3BlIjoib3BlbmlkIHByb2ZpbGUgZW1haWwgbW9kZWwucmVhZCBtb2RlbC5yZXF1ZXN0IG9yZ2FuaXphdGlvbi5yZWFkIG9mZzzzzzVfYWNjZXNzIn0.Jp0t2YC2U7pBlk-eL1YhdCObaq9SP-fsXv-1IHrW7vjsBLIvR58CUCqikeRTyYG5zz5GFKHxbPyVkcRE7g4OklUqyJH3OXHRmFtAv-bbbOx1bSabmTe9cuk0zzzMkLv7EPUAhJGUWzzzlMSJJI93_6U8vXb0rxEtOcHECIVzZdXmeBsW8nd9mSXZPLANIhp9FxAgIUrZw_ZvHVl7491-5pLBemNg-wHy5MAmFamL51Lp5VvUuQBDWUTcE13DyNHJQgTQg92sD2u7BbbmXVQ0BxV4b3CnGipEgT1OEkq2SfZiz7aOdqAZGwoxbXBopSZ4Tx9VYI2Zz-6f7GAIx8YdQ"
}
在上述结构中,access_token位于accessToken字段中,直接获取即可
在使用时,需要放在header的authorization字段中,并且需要在前面加上前缀Bearer
session_token字段在返回报文的header中的set-cookie中,会自动更新到浏览器cookie中,不用解析。
消息发送接口
URL:https://chat.openai.com/backend-api/conversation
方法:POST
header:
只添加下面两个字段即可,其余字段浏览器会自行添加
authorization:Bearer + access_token (中间需要空格)
Content-type:application/json
接口参数:
{
"action": "next",
"conversation_id": "e9ed4b3c-e1b3-41fa-a2f0-ef8e6afcf790", // 会话id,给定父消息所在的会话id;如果父消息id是随机生成的,则不用添加该字段,会自动创建一个新的会话
"messages": [
{
"id": "6054e5aa-0a28-46d8-8f11-ecf9b2e209d1", // 当前消息id,随机生成一个uuid
"author": {
"role": "user"
},
"content": {
"content_type": "text",
"parts": [
"hello" // 消息内容,不用分段,直接写即可
]
}
}
],
"parent_message_id": "9450f0bb-4067-a454-7c85374d5b19", // 父消息id,即想要回复哪条消息。如果是创建新绘画则随机生成一个uuid即可
"model": "text-davinci-002-render-sha", // 模型名称,免费账户和付费账户的chatgpt默认免费的都是这个模型
"timezone_offset_min": -480,
"variant_purpose": "none"
}
在上述结构中,
$.messages[0].content.parts[0]是消息内容,即需要发给chatgpt的话语,不用分段,直接完整给定即可
$.model是模型名称,其中text-davinci-002-render-sha是免费账户和付费账户的chatgpt默认模型,只不过付费账户比免费账户响应更快。而对于付费用户来说,还存在两个模型:text-davinci-002-render-paid是付费用户的老gpt模型,即默认模型更慢的版本、gpt-4也就是GPT-4,目前最先进的chatgpt模型,仅供付费账户限量使用
$.messages[0].id是当前消息的id,随机生成一个uuid即可
$.conversation_id是会话id,给定父消息所在的会话id;如果父消息id是随机生成的,则不用添加该字段,会自动创建一个新的会话
$.parent_message_id是父消息id,如果需要回复某一条消息(也就是接着某一条消息发送),需要给定其消息id,此时必须指定其对话id,在这种情况下,如果当前消息已经存在过回复,则会基于新的回复重新生成回答(也就是相当于网页中的重新生成或修改某一条问题的功能);如果需要创建新会话,则随机生成一个uuid即可
结果解析:
该接口的结果可以通过流式响应获取。流式响应时,每次的响应都是一个json,逐渐返回chatgpt回复的话语。但是使用浏览器脚本的方式请求进行流式响应比较麻烦,所以也可以进行常规请求响应。
对于非流式请求来说,接口会一次性返回所有数据。数据中由\n\n来分割每次的流式请求的返回结果,最后由data: [DONE]做结尾,如下所示
data: {"message": {"id": "0d305da1-17b1-496f-9cf5-763280c504fe", "author": {"role": "system", "name": null, "metadata": {}}, "create_time": 1681979732.600759, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": true, "weight": 1.0, "metadata": {}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "a6501981-7075-4f76-85c4-ee701182f36b", "author": {"role": "user", "name": null, "metadata": {}}, "create_time": 1681979732.602395, "update_time": null, "content": {"content_type": "text", "parts": ["How do I make an HTTP request in Javascript?"]}, "end_turn": null, "weight": 1.0, "metadata": {"timestamp_": "absolute", "message_type": null}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
......
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n .then(response => response.json())\n .then(data => console.log(data))\n .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n console.log(xhr.responseText);\n } else {\n console.error('Error:', xhr.status);\n }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n .then(response => response.json())\n .then(data => console.log(data))\n .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n console.log(xhr.responseText);\n } else {\n console.error('Error:', xhr.status);\n }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n .then(response => response.json())\n .then(data => console.log(data))\n .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n console.log(xhr.responseText);\n } else {\n console.error('Error:', xhr.status);\n }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": true, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}
data: [DONE]
每一部分数据都是由data: {}组成,每一部分都是页面中相应的一部分消息。所以在把每一段内容分隔开后,除去最后的data: [DONE]。倒数第二段就包含了完整的回复内容,所以只需要解析倒数第二段即可。
其中包含完整回复的一段json结构如下
{
"message": {
"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867",
"author": {
"role": "assistant",
"name": null,
"metadata": {}
},
"create_time": 1681979733.721916,
"update_time": null,
"content": {
"content_type": "text",
"parts": [
"You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n .then(response => response.json())\n .then(data => console.log(data))\n .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n console.log(xhr.responseText);\n } else {\n console.error('Error:', xhr.status);\n }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."
]
},
"end_turn": true,
"weight": 1.0,
"metadata": {
"message_type": "next",
"model_slug": "text-davinci-002-render-sha",
"finish_details": {
"type": "stop",
"stop": "<|im_end|>"
}
},
"recipient": "all"
},
"conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03",
"error": null
}
在上述结构中包含了回复的消息以及相关的信息,其中有用的字段如下:
$.message.content.parts[0]:回复的消息内容,为markdown格式
$.message.id: 当前消息的id,后续回复当前消息继续对话需要该值
$.conversation_id: 当前对话的id,继续对话需要给定当前字段(但是如果是新会话该值其实是我们生成的)
接口的调用方式
接口不能直接在代码中请求,会触发cloudfalre的ssl指纹校验,最终会返回403,无法得到数据
所以需要通过执行浏览器脚本的方式模拟当前网站发送请求,并解析数据。
示例脚本如下:
var xhr = new XMLHttpRequest();
xhr.open("POST", "path/url/test", false);
xhr.setRequestHeader('Authorization', 'Bearer xxxxxxxxxxxxxxxx');
xhr.setRequestHeader('Content-type', 'application/json');
xhr.send('{"action":"next","messages":["qweqweqwe"]}');
var re = {"responseText":xhr.responseText,"responseHeaders":xhr.getAllResponseHeaders(),"status":xhr.status};
return JSON.stringify(re);
上述脚本用post方法在当前页面请求了path/url/test路径,并且在header中添加了Authorization和Content-type字段,最后发送了一个json字符串。
最后把返回数据做简单的封装,返回给python脚本。
需要注意的是在不手动指定Content-type的情况下,浏览器会自行匹配一个Content-type,但是有些时候Content-type不正确会导致请求失败报错,所以手动指定Content-type稳妥一些
注意事项
- 如果短时间内连续对话(即不停的向服务端请求接口),则一般不会触发403;如果对话时间稍微边长,则有可能触发403,此时需要刷新页面重新进行cloudfalre校验流程
- 同一个账号在多个浏览器环境下登录或在不同ip下同时登陆或切换ip等行为均有可能触发403
- 同一个ip不能请求过于频繁(如登录多个账号),否则会返回429。(同时发送多条消息也会触发429)
