基于无头浏览器的chatgpt爬虫方案

基于无头浏览器的chatgpt爬虫方案

整体流程

  1. 初始化并启动无头浏览器,关于无头浏览器可用方案见无头浏览器过cloudflare的可行方案
  2. 通过无头浏览器打开https://chat.openai.com/ ,等待网页加载完毕后,如果网页源码中存在window._cf_chl_opt,则说明触发了cloudflare,需要进入cloudflare验证流程,否则直接进入5的登录流程
  3. 等待网页检查,页面加载完成后,点击id为ctp-checkbox-label的人机验证框,等待后续网页加载。
  4. 如果加载完成后仍旧存在人机验证框,则说明可能验证失败,无法通过cloudflare的人机验证;如果页面成功跳转到了登录chatgpt的登录页面,则说明通过了cloudlfare的检查,可以进入登录流程
  5. 寻找页面中class_name为btn-primary、文字为Log in的按钮,点击进入输入邮箱页面
  6. 找到元素id为username的邮箱输入框,输入邮箱,然后找到class_name是_button-login-id的按钮,点击,等待页面加载后继续输入密码
  7. 找到元素id为password的密码输入框,输入密码,然后找到class_name是_button-login-password的按钮,点击,等待页面加载
  8. 如果账号和密码都输入正确,此时chatgpt已经成功登陆了
  9. 读取浏览器日志,chatgpt会在登录过程中请求api/auth/session接口来获取session_tokenaccess_token,读取日志中的这条请求,获取返回数据中的access_token
  10. 使用上面获取的access_token,在已经打开的浏览器chatgpt页面中执行脚本,请求消息发送接口,向chatgpt发送消息,并等待回复

无头浏览器过cloudflare的可行方案

以下方案均是基于python,不清楚其他语言是否有相似方案

0、不可行方案:selenium

url:https://github.com/SeleniumHQ/selenium

直接通过selenium是不可行的。cloudflare会对原始的selenium浏览器检测。只用selenium打开cloudflare验证页面会在点击验证那一步无限循环,不能到达正确页面。

但是有基于selenium的可行方案。

1、undetected_chromedriver

url:https://github.com/ultrafunkamsterdam/undetected-chromedriver

该方案是基于selenium实现的可行方案,可以和selenium的各种api一起使用,可以顺利通过cloudflare的验证。但是只支持chrome浏览器

2、DrissionPage

url:https://github.com/g1879/DrissionPage

该方案从3.0开始就不依赖selenium了,据作者描述,底层由作者自己实现,没有webdrive特征,不会被cloudfalre拦截,可以顺利通过cloudfalre验证。同样只支持chrome浏览器

session接口

这个接口用于获取session_token和access_token。其中session_token会直接更新在cookie中,而access_token则需要放在请求接口的header中。

该接口不用请求,登录成功后,在浏览器日志中寻找请求记录取值即可

URL: https://chat.openai.com/api/auth/session

方法:GET

header:不需要请求,不需要用到header

结果解析

{
    "user": {
        "id": "user-zz9OzztZZZ5P33FRjWhyRl4V",
        "name": "[email protected]",
        "email": "[email protected]",
        "image": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
        "picture": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
        "mfa": false,
        "groups": [],
        "intercom_hash": "99db78f5c3a99cda99a99999d5cde3cb9999ebc2b47fd033eb5ecf817191c999"
    },
    "expires": "2023-05-19T13:58:30.012Z",
    "accessToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ik1UaEVOVUpHTkQQQQQQQQQQQQQQpkQ05UZzVNRFUxUlRVd1FVSkRNRU13UmtGRVFrRXpSZyJ9.eyJodHRwczovL2FwaS5vcGVuYWkuY29tL3Byb2ZpbGUiOnsiZW1haWwiOiJ3YW5nb25jQHFxLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlfSwiaHR0cHM6Ly9hcGkub3BlbmFpLmNvbS9hdXRoIjp7InVzZXJfaWQiOiJ1c2VyLWZvOU9tbXRBR041UDMzRlJqV2h5Uzz0ViJ9LCJpc3MiOiJodHRwczovL2F1dGgwLm9wZW5haS5jb20vIiwic3ViIjoiYXV0aDB8NjM4ZGNjZjQzYTI0YzE0ZGM5NTBlODlmIiwiYXVkIjpbImh0dHBzOi8vYXBpLm9wZW5haS5jb20vdjEiLCJodHRwczovL29wZW5haS5vcGVuYWkuYXV0aDBhcHAuY29tL3VzZXJzzzZvIl0sImlhdCI6MTY4MDc3MzgxMiwiZXhwIjoxNjgxOTgzNDEyLCJhenAiOiJUZEpJY2JlMTZXb1RIdE45NW55eXdoNUU0eU9vNkl0RyIsInNjb3BlIjoib3BlbmlkIHByb2ZpbGUgZW1haWwgbW9kZWwucmVhZCBtb2RlbC5yZXF1ZXN0IG9yZ2FuaXphdGlvbi5yZWFkIG9mZzzzzzVfYWNjZXNzIn0.Jp0t2YC2U7pBlk-eL1YhdCObaq9SP-fsXv-1IHrW7vjsBLIvR58CUCqikeRTyYG5zz5GFKHxbPyVkcRE7g4OklUqyJH3OXHRmFtAv-bbbOx1bSabmTe9cuk0zzzMkLv7EPUAhJGUWzzzlMSJJI93_6U8vXb0rxEtOcHECIVzZdXmeBsW8nd9mSXZPLANIhp9FxAgIUrZw_ZvHVl7491-5pLBemNg-wHy5MAmFamL51Lp5VvUuQBDWUTcE13DyNHJQgTQg92sD2u7BbbmXVQ0BxV4b3CnGipEgT1OEkq2SfZiz7aOdqAZGwoxbXBopSZ4Tx9VYI2Zz-6f7GAIx8YdQ"
}

在上述结构中,access_token位于accessToken字段中,直接获取即可
在使用时,需要放在header的authorization字段中,并且需要在前面加上前缀Bearer

session_token字段在返回报文的header中的set-cookie中,会自动更新到浏览器cookie中,不用解析。

消息发送接口

URL:https://chat.openai.com/backend-api/conversation

方法:POST

header

只添加下面两个字段即可,其余字段浏览器会自行添加

authorization:Bearer + access_token (中间需要空格)
Content-type:application/json

接口参数


{
    "action": "next",
    "conversation_id": "e9ed4b3c-e1b3-41fa-a2f0-ef8e6afcf790", // 会话id,给定父消息所在的会话id;如果父消息id是随机生成的,则不用添加该字段,会自动创建一个新的会话
    "messages": [
        {
            "id": "6054e5aa-0a28-46d8-8f11-ecf9b2e209d1", // 当前消息id,随机生成一个uuid
            "author": {
                "role": "user"
            },
            "content": {
                "content_type": "text",
                "parts": [
                    "hello" // 消息内容,不用分段,直接写即可
                ]
            }
        }
    ],
        "parent_message_id": "9450f0bb-4067-a454-7c85374d5b19", // 父消息id,即想要回复哪条消息。如果是创建新绘画则随机生成一个uuid即可

  "model": "text-davinci-002-render-sha", // 模型名称,免费账户和付费账户的chatgpt默认免费的都是这个模型
    "timezone_offset_min": -480,
    "variant_purpose": "none"
}

在上述结构中,
$.messages[0].content.parts[0]是消息内容,即需要发给chatgpt的话语,不用分段,直接完整给定即可
$.model是模型名称,其中text-davinci-002-render-sha是免费账户和付费账户的chatgpt默认模型,只不过付费账户比免费账户响应更快。而对于付费用户来说,还存在两个模型:text-davinci-002-render-paid是付费用户的老gpt模型,即默认模型更慢的版本、gpt-4也就是GPT-4,目前最先进的chatgpt模型,仅供付费账户限量使用
$.messages[0].id是当前消息的id,随机生成一个uuid即可
$.conversation_id是会话id,给定父消息所在的会话id;如果父消息id是随机生成的,则不用添加该字段,会自动创建一个新的会话
$.parent_message_id是父消息id,如果需要回复某一条消息(也就是接着某一条消息发送),需要给定其消息id,此时必须指定其对话id,在这种情况下,如果当前消息已经存在过回复,则会基于新的回复重新生成回答(也就是相当于网页中的重新生成或修改某一条问题的功能);如果需要创建新会话,则随机生成一个uuid即可

结果解析

该接口的结果可以通过流式响应获取。流式响应时,每次的响应都是一个json,逐渐返回chatgpt回复的话语。但是使用浏览器脚本的方式请求进行流式响应比较麻烦,所以也可以进行常规请求响应。

对于非流式请求来说,接口会一次性返回所有数据。数据中由\n\n来分割每次的流式请求的返回结果,最后由data: [DONE]做结尾,如下所示

data: {"message": {"id": "0d305da1-17b1-496f-9cf5-763280c504fe", "author": {"role": "system", "name": null, "metadata": {}}, "create_time": 1681979732.600759, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": true, "weight": 1.0, "metadata": {}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "a6501981-7075-4f76-85c4-ee701182f36b", "author": {"role": "user", "name": null, "metadata": {}}, "create_time": 1681979732.602395, "update_time": null, "content": {"content_type": "text", "parts": ["How do I make an HTTP request in Javascript?"]}, "end_turn": null, "weight": 1.0, "metadata": {"timestamp_": "absolute", "message_type": null}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

......

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": true, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: [DONE]

每一部分数据都是由data: {}组成,每一部分都是页面中相应的一部分消息。所以在把每一段内容分隔开后,除去最后的data: [DONE]。倒数第二段就包含了完整的回复内容,所以只需要解析倒数第二段即可。

其中包含完整回复的一段json结构如下

{
    "message": {
        "id": "98f47b88-cc58-4d68-ae52-08f5d3efa867",
        "author": {
            "role": "assistant",
            "name": null,
            "metadata": {}
        },
        "create_time": 1681979733.721916,
        "update_time": null,
        "content": {
            "content_type": "text",
            "parts": [
                "You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."
            ]
        },
        "end_turn": true,
        "weight": 1.0,
        "metadata": {
            "message_type": "next",
            "model_slug": "text-davinci-002-render-sha",
            "finish_details": {
                "type": "stop",
                "stop": "<|im_end|>"
            }
        },
        "recipient": "all"
    },
    "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03",
    "error": null
}

在上述结构中包含了回复的消息以及相关的信息,其中有用的字段如下:

$.message.content.parts[0]:回复的消息内容,为markdown格式
$.message.id: 当前消息的id,后续回复当前消息继续对话需要该值
$.conversation_id: 当前对话的id,继续对话需要给定当前字段(但是如果是新会话该值其实是我们生成的)

接口的调用方式

接口不能直接在代码中请求,会触发cloudfalre的ssl指纹校验,最终会返回403,无法得到数据
所以需要通过执行浏览器脚本的方式模拟当前网站发送请求,并解析数据。

示例脚本如下:

var xhr = new XMLHttpRequest();    
xhr.open("POST", "path/url/test", false);    
xhr.setRequestHeader('Authorization', 'Bearer xxxxxxxxxxxxxxxx');
xhr.setRequestHeader('Content-type', 'application/json');
xhr.send('{"action":"next","messages":["qweqweqwe"]}');
var re = {"responseText":xhr.responseText,"responseHeaders":xhr.getAllResponseHeaders(),"status":xhr.status};
return JSON.stringify(re);

上述脚本用post方法在当前页面请求了path/url/test路径,并且在header中添加了Authorization和Content-type字段,最后发送了一个json字符串。
最后把返回数据做简单的封装,返回给python脚本。
需要注意的是在不手动指定Content-type的情况下,浏览器会自行匹配一个Content-type,但是有些时候Content-type不正确会导致请求失败报错,所以手动指定Content-type稳妥一些

注意事项

  1. 如果短时间内连续对话(即不停的向服务端请求接口),则一般不会触发403;如果对话时间稍微边长,则有可能触发403,此时需要刷新页面重新进行cloudfalre校验流程
  2. 同一个账号在多个浏览器环境下登录或在不同ip下同时登陆或切换ip等行为均有可能触发403
  3. 同一个ip不能请求过于频繁(如登录多个账号),否则会返回429。(同时发送多条消息也会触发429)
暂无评论

发送评论 编辑评论


				
|´・ω・)ノ
ヾ(≧∇≦*)ゝ
(☆ω☆)
(╯‵□′)╯︵┴─┴
 ̄﹃ ̄
(/ω\)
∠( ᐛ 」∠)_
(๑•̀ㅁ•́ฅ)
→_→
୧(๑•̀⌄•́๑)૭
٩(ˊᗜˋ*)و
(ノ°ο°)ノ
(´இ皿இ`)
⌇●﹏●⌇
(ฅ´ω`ฅ)
(╯°A°)╯︵○○○
φ( ̄∇ ̄o)
ヾ(´・ ・`。)ノ"
( ง ᵒ̌皿ᵒ̌)ง⁼³₌₃
(ó﹏ò。)
Σ(っ °Д °;)っ
( ,,´・ω・)ノ"(´っω・`。)
╮(╯▽╰)╭
o(*////▽////*)q
>﹏<
( ๑´•ω•) "(ㆆᴗㆆ)
😂
😀
😅
😊
🙂
🙃
😌
😍
😘
😜
😝
😏
😒
🙄
😳
😡
😔
😫
😱
😭
💩
👻
🙌
🖕
👍
👫
👬
👭
🌚
🌝
🙈
💊
😶
🙏
🍦
🍉
😣
Source: github.com/k4yt3x/flowerhd
颜文字
Emoji
小恐龙
花!
上一篇
下一篇