問題描述
使用 JavaScript 和任何其他可用技術執行 從 Google Chrome 擴展程序中對當前未打開的標簽頁進行網頁抓取 的最佳選項是什么?也接受其他 JavaScript 庫.
What are the best options for performing Web Scraping of a not currently open tab from within a Google Chrome Extension with JavaScript and whatever more technologies are available. Other JavaScript-libraries are also accepted.
重要的是掩蓋抓取行為,使其表現得像正常的網絡請求.沒有 AJAX 或 XMLHttpRequest 的跡象,例如 X-Requested-With: XMLHttpRequest
或 Origin
.
The important thing is to mask the scraping to behave like a normal web-request. No indications of AJAX or XMLHttpRequest, like X-Requested-With: XMLHttpRequest
or Origin
.
必須可以從 JavaScript 訪問抓取的內容,以便在擴展程序中進行進一步操作和呈現,最有可能作為字符串.
The scraped content must be accessible from JavaScript for further manipulation and presentation within the extension, most probably as a string.
在任何 WebKit/Chrome 特定的 API 中是否有任何鉤子可用于發出正常的網絡請求并獲取操作結果?
Are there any hooks in any WebKit/Chrome-specific API:s that can be used to make a normal web-request and get the results for manipulation?
var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections
使用磁盤上的本地文件進行這項工作的獎勵積分,用于初始調試.但如果這是唯一的一點就是停止解決方案,那么請忽略獎勵積分.
Bonus-points to make this work from a local file on disk, for initial debugging. But if that is the only point is stopping a solution, then disregard the bonus-points.
推薦答案
嘗試使用 XHR2 responseType = "document"
并使用 (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))
a rel="noreferrer">我的 text/html
補丁.有關我如何檢測 responseType 的示例,請參閱 https://gist.github.com/1138724= "document
支持(在從 text/html
blob 創建的對象 URL 上同步檢查 response === null
).
Attempt to use XHR2 responseType = "document"
and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type"))
with my text/html
patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document
support (synchronously checking response === null
on an object URL created from a text/html
blob).
使用 Chrome WebRequest API 隱藏 X-Requested-With
等標題.
Use the Chrome WebRequest API to hide X-Requested-With
, etc. headers.
這篇關于Google Chrome 擴展中的網頁抓取(JavaScript + Chrome API)的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!