基于无头浏览器的chatgpt爬虫方案

整体流程

初始化并启动无头浏览器，关于无头浏览器可用方案见无头浏览器过cloudflare的可行方案
通过无头浏览器打开https://chat.openai.com/ ，等待网页加载完毕后，如果网页源码中存在window._cf_chl_opt，则说明触发了cloudflare，需要进入cloudflare验证流程，否则直接进入5的登录流程
等待网页检查，页面加载完成后，点击id为ctp-checkbox-label的人机验证框，等待后续网页加载。
如果加载完成后仍旧存在人机验证框，则说明可能验证失败，无法通过cloudflare的人机验证；如果页面成功跳转到了登录chatgpt的登录页面，则说明通过了cloudlfare的检查，可以进入登录流程
寻找页面中class_name为btn-primary、文字为Log in的按钮，点击进入输入邮箱页面
找到元素id为username的邮箱输入框，输入邮箱，然后找到class_name是_button-login-id的按钮，点击，等待页面加载后继续输入密码
找到元素id为password的密码输入框，输入密码，然后找到class_name是_button-login-password的按钮，点击，等待页面加载
如果账号和密码都输入正确，此时chatgpt已经成功登陆了
读取浏览器日志，chatgpt会在登录过程中请求api/auth/session接口来获取session_token和access_token，读取日志中的这条请求，获取返回数据中的access_token值
使用上面获取的access_token，在已经打开的浏览器chatgpt页面中执行脚本，请求消息发送接口，向chatgpt发送消息，并等待回复

无头浏览器过cloudflare的可行方案

以下方案均是基于python，不清楚其他语言是否有相似方案

0、不可行方案：selenium

url：https://github.com/SeleniumHQ/selenium

直接通过selenium是不可行的。cloudflare会对原始的selenium浏览器检测。只用selenium打开cloudflare验证页面会在点击验证那一步无限循环，不能到达正确页面。

但是有基于selenium的可行方案。

1、undetected_chromedriver

url:https://github.com/ultrafunkamsterdam/undetected-chromedriver

该方案是基于selenium实现的可行方案，可以和selenium的各种api一起使用，可以顺利通过cloudflare的验证。但是只支持chrome浏览器。

2、DrissionPage

url:https://github.com/g1879/DrissionPage

该方案从3.0开始就不依赖selenium了，据作者描述，底层由作者自己实现，没有webdrive特征，不会被cloudfalre拦截，可以顺利通过cloudfalre验证。同样只支持chrome浏览器。

session接口

这个接口用于获取session_token和access_token。其中session_token会直接更新在cookie中，而access_token则需要放在请求接口的header中。

该接口不用请求，登录成功后，在浏览器日志中寻找请求记录取值即可

URL: https://chat.openai.com/api/auth/session

方法：GET

header：不需要请求，不需要用到header

结果解析：

{
    "user": {
        "id": "user-zz9OzztZZZ5P33FRjWhyRl4V",
        "name": "[email protected]",
        "email": "[email protected]",
        "image": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
        "picture": "https://s.gravatar.com/avatar/99f9ca0e99e62fac1f3f25a17d28f999?s=480&r=pg&d=https%3A%2F%2Fcdn.auth0.com%2Favatars%2Fwa.png",
        "mfa": false,
        "groups": [],
        "intercom_hash": "99db78f5c3a99cda99a99999d5cde3cb9999ebc2b47fd033eb5ecf817191c999"
    },
    "expires": "2023-05-19T13:58:30.012Z",
    "accessToken": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCIsImtpZCI6Ik1UaEVOVUpHTkQQQQQQQQQQQQQQpkQ05UZzVNRFUxUlRVd1FVSkRNRU13UmtGRVFrRXpSZyJ9.eyJodHRwczovL2FwaS5vcGVuYWkuY29tL3Byb2ZpbGUiOnsiZW1haWwiOiJ3YW5nb25jQHFxLmNvbSIsImVtYWlsX3ZlcmlmaWVkIjp0cnVlfSwiaHR0cHM6Ly9hcGkub3BlbmFpLmNvbS9hdXRoIjp7InVzZXJfaWQiOiJ1c2VyLWZvOU9tbXRBR041UDMzRlJqV2h5Uzz0ViJ9LCJpc3MiOiJodHRwczovL2F1dGgwLm9wZW5haS5jb20vIiwic3ViIjoiYXV0aDB8NjM4ZGNjZjQzYTI0YzE0ZGM5NTBlODlmIiwiYXVkIjpbImh0dHBzOi8vYXBpLm9wZW5haS5jb20vdjEiLCJodHRwczovL29wZW5haS5vcGVuYWkuYXV0aDBhcHAuY29tL3VzZXJzzzZvIl0sImlhdCI6MTY4MDc3MzgxMiwiZXhwIjoxNjgxOTgzNDEyLCJhenAiOiJUZEpJY2JlMTZXb1RIdE45NW55eXdoNUU0eU9vNkl0RyIsInNjb3BlIjoib3BlbmlkIHByb2ZpbGUgZW1haWwgbW9kZWwucmVhZCBtb2RlbC5yZXF1ZXN0IG9yZ2FuaXphdGlvbi5yZWFkIG9mZzzzzzVfYWNjZXNzIn0.Jp0t2YC2U7pBlk-eL1YhdCObaq9SP-fsXv-1IHrW7vjsBLIvR58CUCqikeRTyYG5zz5GFKHxbPyVkcRE7g4OklUqyJH3OXHRmFtAv-bbbOx1bSabmTe9cuk0zzzMkLv7EPUAhJGUWzzzlMSJJI93_6U8vXb0rxEtOcHECIVzZdXmeBsW8nd9mSXZPLANIhp9FxAgIUrZw_ZvHVl7491-5pLBemNg-wHy5MAmFamL51Lp5VvUuQBDWUTcE13DyNHJQgTQg92sD2u7BbbmXVQ0BxV4b3CnGipEgT1OEkq2SfZiz7aOdqAZGwoxbXBopSZ4Tx9VYI2Zz-6f7GAIx8YdQ"
}

在上述结构中，access_token位于accessToken字段中，直接获取即可
在使用时，需要放在header的authorization字段中，并且需要在前面加上前缀Bearer

session_token字段在返回报文的header中的set-cookie中，会自动更新到浏览器cookie中，不用解析。

消息发送接口

URL:https://chat.openai.com/backend-api/conversation

方法：POST

header：

只添加下面两个字段即可，其余字段浏览器会自行添加

authorization：Bearer + access_token （中间需要空格）
Content-type：application/json

接口参数：


{
    "action": "next",
    "conversation_id": "e9ed4b3c-e1b3-41fa-a2f0-ef8e6afcf790", // 会话id，给定父消息所在的会话id；如果父消息id是随机生成的，则不用添加该字段，会自动创建一个新的会话
    "messages": [
        {
            "id": "6054e5aa-0a28-46d8-8f11-ecf9b2e209d1", // 当前消息id，随机生成一个uuid
            "author": {
                "role": "user"
            },
            "content": {
                "content_type": "text",
                "parts": [
                    "hello" // 消息内容，不用分段，直接写即可
                ]
            }
        }
    ],
        "parent_message_id": "9450f0bb-4067-a454-7c85374d5b19", // 父消息id，即想要回复哪条消息。如果是创建新绘画则随机生成一个uuid即可

  "model": "text-davinci-002-render-sha", // 模型名称，免费账户和付费账户的chatgpt默认免费的都是这个模型
    "timezone_offset_min": -480,
    "variant_purpose": "none"
}

在上述结构中，
$.messages[0].content.parts[0]是消息内容，即需要发给chatgpt的话语，不用分段，直接完整给定即可
$.model是模型名称，其中text-davinci-002-render-sha是免费账户和付费账户的chatgpt默认模型，只不过付费账户比免费账户响应更快。而对于付费用户来说，还存在两个模型：text-davinci-002-render-paid是付费用户的老gpt模型，即默认模型更慢的版本、gpt-4也就是GPT-4，目前最先进的chatgpt模型，仅供付费账户限量使用
$.messages[0].id是当前消息的id，随机生成一个uuid即可
$.conversation_id是会话id，给定父消息所在的会话id；如果父消息id是随机生成的，则不用添加该字段，会自动创建一个新的会话
$.parent_message_id是父消息id，如果需要回复某一条消息（也就是接着某一条消息发送），需要给定其消息id，此时必须指定其对话id，在这种情况下，如果当前消息已经存在过回复，则会基于新的回复重新生成回答（也就是相当于网页中的重新生成或修改某一条问题的功能）；如果需要创建新会话，则随机生成一个uuid即可

结果解析：

该接口的结果可以通过流式响应获取。流式响应时，每次的响应都是一个json，逐渐返回chatgpt回复的话语。但是使用浏览器脚本的方式请求进行流式响应比较麻烦，所以也可以进行常规请求响应。

对于非流式请求来说，接口会一次性返回所有数据。数据中由\n\n来分割每次的流式请求的返回结果，最后由data: [DONE]做结尾,如下所示

data: {"message": {"id": "0d305da1-17b1-496f-9cf5-763280c504fe", "author": {"role": "system", "name": null, "metadata": {}}, "create_time": 1681979732.600759, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": true, "weight": 1.0, "metadata": {}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "a6501981-7075-4f76-85c4-ee701182f36b", "author": {"role": "user", "name": null, "metadata": {}}, "create_time": 1681979732.602395, "update_time": null, "content": {"content_type": "text", "parts": ["How do I make an HTTP request in Javascript?"]}, "end_turn": null, "weight": 1.0, "metadata": {"timestamp_": "absolute", "message_type": null}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": [""]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha"}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

......

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console"]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": null, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: {"message": {"id": "98f47b88-cc58-4d68-ae52-08f5d3efa867", "author": {"role": "assistant", "name": null, "metadata": {}}, "create_time": 1681979733.721916, "update_time": null, "content": {"content_type": "text", "parts": ["You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."]}, "end_turn": true, "weight": 1.0, "metadata": {"message_type": "next", "model_slug": "text-davinci-002-render-sha", "finish_details": {"type": "stop", "stop": "<|im_end|>"}}, "recipient": "all"}, "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03", "error": null}

data: [DONE]

每一部分数据都是由data: {}组成，每一部分都是页面中相应的一部分消息。所以在把每一段内容分隔开后，除去最后的data: [DONE]。倒数第二段就包含了完整的回复内容，所以只需要解析倒数第二段即可。

其中包含完整回复的一段json结构如下

{
    "message": {
        "id": "98f47b88-cc58-4d68-ae52-08f5d3efa867",
        "author": {
            "role": "assistant",
            "name": null,
            "metadata": {}
        },
        "create_time": 1681979733.721916,
        "update_time": null,
        "content": {
            "content_type": "text",
            "parts": [
                "You can make an HTTP request in Javascript using the built-in `fetch()` function or the older `XMLHttpRequest` object.\n\nUsing `fetch()`, you can make an HTTP request to a server and receive a response using promises. Here's an example:\n\n```\nfetch('https://example.com/api/data')\n  .then(response => response.json())\n  .then(data => console.log(data))\n  .catch(error => console.error(error));\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the parsed JSON data to the console. If there is an error, it is also logged to the console.\n\nUsing `XMLHttpRequest`, you can make HTTP requests to a server and receive a response using callbacks. Here's an example:\n\n```\nconst xhr = new XMLHttpRequest();\nxhr.open('GET', 'https://example.com/api/data');\nxhr.onreadystatechange = function() {\n  if (xhr.readyState === XMLHttpRequest.DONE && xhr.status === 200) {\n    console.log(xhr.responseText);\n  } else {\n    console.error('Error:', xhr.status);\n  }\n};\nxhr.send();\n```\n\nThis code sends an HTTP GET request to the URL `https://example.com/api/data`, receives a response, and logs the response text to the console if the response is successful. If there is an error, the status code is logged to the console."
            ]
        },
        "end_turn": true,
        "weight": 1.0,
        "metadata": {
            "message_type": "next",
            "model_slug": "text-davinci-002-render-sha",
            "finish_details": {
                "type": "stop",
                "stop": "<|im_end|>"
            }
        },
        "recipient": "all"
    },
    "conversation_id": "3f5ad3cc-f00b-4845-af1d-325923cf7d03",
    "error": null
}

在上述结构中包含了回复的消息以及相关的信息，其中有用的字段如下:

$.message.content.parts[0]：回复的消息内容，为markdown格式
$.message.id: 当前消息的id，后续回复当前消息继续对话需要该值
$.conversation_id: 当前对话的id，继续对话需要给定当前字段（但是如果是新会话该值其实是我们生成的）

接口的调用方式

接口不能直接在代码中请求，会触发cloudfalre的ssl指纹校验，最终会返回403，无法得到数据
所以需要通过执行浏览器脚本的方式模拟当前网站发送请求，并解析数据。

示例脚本如下：

var xhr = new XMLHttpRequest();    
xhr.open("POST", "path/url/test", false);    
xhr.setRequestHeader('Authorization', 'Bearer xxxxxxxxxxxxxxxx');
xhr.setRequestHeader('Content-type', 'application/json');
xhr.send('{"action":"next","messages":["qweqweqwe"]}');
var re = {"responseText":xhr.responseText,"responseHeaders":xhr.getAllResponseHeaders(),"status":xhr.status};
return JSON.stringify(re);

上述脚本用post方法在当前页面请求了path/url/test路径，并且在header中添加了Authorization和Content-type字段，最后发送了一个json字符串。
最后把返回数据做简单的封装，返回给python脚本。
需要注意的是在不手动指定Content-type的情况下，浏览器会自行匹配一个Content-type，但是有些时候Content-type不正确会导致请求失败报错，所以手动指定Content-type稳妥一些

注意事项

如果短时间内连续对话（即不停的向服务端请求接口），则一般不会触发403；如果对话时间稍微边长，则有可能触发403，此时需要刷新页面重新进行cloudfalre校验流程
同一个账号在多个浏览器环境下登录或在不同ip下同时登陆或切换ip等行为均有可能触发403
同一个ip不能请求过于频繁（如登录多个账号），否则会返回429。（同时发送多条消息也会触发429）