Crawler¶

约 777 个字预计阅读时间 3 分钟

request¶

URI: A system for identifying pieces of information on the network.
HTTP Methods: The protocol currently contains 8 methods for requesting a URI: , , , , , , , . In this article we focused on the most commonly used one: OPTIONS``GET``HEAD``POST``PUT``DELETE``TRACE``CONNECT``GET
HTTP Headers: The headers are additional data sent by the user agent to give more context about the transaction going on between the client and the server. Some of them will help the server reply in the most appropriate way.

字符串生成字典：利用 json 函数 Python 如何将字符串转为字典 - VincentZhu - 博客园 (cnblogs.com)
request 抛异常：学习 try-except 异常处理方法
如何检测输入的是否为正确网址
- 想法：是否包含 com cn www ; 提前验证是否可以登录；用正则表达式匹配 ( 但不知道有的网址或许没有 com 或者 www 怎么处理 )
request 中 text() 输出的格式不一：有的很整齐有换行，有的是一整行

BeautifulSoup 这个库

伪装 IP 地址的方法：
1. 使用代理服务器：代理服务器可以将你的请求转发到目标网站，从而隐藏你的真实 IP 地址。你可以通过购买代理服务器或者使用免费的代理服务器来实现伪装 IP 地址。
2. 使用 TOR 网络：TOR 网络是一种匿名网络，可以隐藏你的 IP 地址，让你在互联网上匿名浏览。你可以通过下载 TOR 浏览器来使用 TOR 网络，从而实现伪装 IP 地址。
3. 修改 Hosts 文件：你可以手动修改 Hosts 文件，将目标网站的域名解析到一个不存在的 IP 地址上，从而达到伪装 IP 地址的效果。