Google Chrome 扩展中的网页抓取(JavaScript + Chrome API)

Web Scraping in a Google Chrome Extension (JavaScript + Chrome APIs)(Google Chrome 扩展中的网页抓取(JavaScript + Chrome API))
本文介绍了Google Chrome 扩展中的网页抓取(JavaScript + Chrome API)的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

使用 JavaScript 和任何其他可用技术执行 从 Google Chrome 扩展程序中对当前未打开的标签页进行网页抓取 的最佳选项是什么?也接受其他 JavaScript 库.

What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.

重要的是掩盖抓取行为,使其表现得像正常的网络请求.没有 AJAX 或 XMLHttpRequest 的迹象,例如 X-Requested-With: XMLHttpRequestOrigin.

The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest or Origin.

必须可以从 JavaScript 访问抓取的内容,以便在扩展程序中进行进一步操作和呈现,最有可能作为字符串.

The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.

在任何 WebKit/Chrome 特定的 API 中是否有任何钩子可用于发出正常的网络请求并获取操作结果?

Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

使用磁盘上的本地文件进行这项工作的奖励积分,用于初始调试.但如果这是唯一的一点就是停止解决方案,那么请忽略奖励积分.

Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.

推荐答案

尝试使用 XHR2 responseType = "document" 并使用 (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))a href="https://gist.github.com/1129031" rel="noreferrer">我的 text/html 补丁.有关我如何检测 responseType 的示例,请参阅 https://gist.github.com/1138724= "document 支持(在从 text/html blob 创建的对象 URL 上同步检查 response === null).

Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

使用 Chrome WebRequest API 隐藏 X-Requested-With 等标题.

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

这篇关于Google Chrome 扩展中的网页抓取(JavaScript + Chrome API)的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!

本站部分内容来源互联网,如果有图片或者内容侵犯您的权益请联系我们删除!

相关文档推荐

Update another component when Formik form changes(当Formik表单更改时更新另一个组件)
Formik validation isSubmitting / isValidating not getting set to true(Formik验证正在提交/isValiating未设置为True)
React Validation Max Range Using Formik(使用Formik的Reaction验证最大范围)
Validation using Yup to check string or number length(使用YUP检查字符串或数字长度的验证)
Updating initialValues prop on Formik Form does not update input value(更新Formik表单上的初始值属性不会更新输入值)
password validation with yup and formik(使用YUP和Formick进行密码验证)