問(wèn)題描述
我正在使用 Python SDK v0.5.5 編寫一個(gè)非常基本的 DataFlow 管道.該管道使用帶有傳入查詢的 BigQuerySource,該查詢正在從位于歐盟的數(shù)據(jù)集中查詢 BigQuery 表.
I'm writing a very basic DataFlow pipeline using the Python SDK v0.5.5. The pipeline uses a BigQuerySource with a query passed in, which is querying BigQuery tables from datasets that reside in EU.
執(zhí)行管道時(shí)出現(xiàn)以下錯(cuò)誤(項(xiàng)目名稱匿名):
When executing the pipeline I'm getting the following error (project name anonymized):
HttpError: HttpError accessing <https://www.googleapis.com/bigquery/v2/projects/XXXXX/queries/93bbbecbc470470cb1bbb9c22bd83e9d?alt=json&maxResults=10000>: response: <{'status': '400', 'content-length': '292', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'transfer-encoding': 'chunked', 'expires': 'Thu, 09 Feb 2017 10:28:04 GMT', 'vary': 'Origin, X-Origin', 'server': 'GSE', '-content-encoding': 'gzip', 'cache-control': 'private, max-age=0', 'date': 'Thu, 09 Feb 2017 10:28:04 GMT', 'x-frame-options': 'SAMEORIGIN', 'alt-svc': 'quic=":443"; ma=2592000; v="35,34"', 'content-type': 'application/json; charset=UTF-8'}>, content <{
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
],
"code": 400,
"message": "Cannot read and write in different locations: source: EU, destination: US"
}
}
在指定項(xiàng)目、數(shù)據(jù)集和表名時(shí)也會(huì)出現(xiàn)該錯(cuò)誤.但是,從可用的公共數(shù)據(jù)集(位于美國(guó)——如莎士比亞)中選擇數(shù)據(jù)時(shí)沒(méi)有錯(cuò)誤.我也有運(yùn)行 SDK 的 v0.4.4 的作業(yè),但沒(méi)有此錯(cuò)誤.
The error also occurs when specifying a project, dataset and table name. However there's no error when selecting data from the public datasets available (which reside in US - like shakespeare). I also have jobs running v0.4.4 of the SDK which don't have this error.
這些版本之間的區(qū)別在于臨時(shí)數(shù)據(jù)集的創(chuàng)建,如管道啟動(dòng)時(shí)的警告所示:
The difference between these versions is the creation of a temp dataset, as is shown by the warning at pipeline startup:
WARNING:root:Dataset does not exist so we will create it
我簡(jiǎn)要了解了 SDK 的不同版本,差異似乎在于這個(gè)臨時(shí)數(shù)據(jù)集.看起來(lái)當(dāng)前版本默認(rèn)創(chuàng)建了一個(gè)臨時(shí)數(shù)據(jù)集,其位置在美國(guó)(取自 master):
I've briefly taken a look at the different versions of the SDK and the difference seems to be around this temp dataset. It looks like the current version creates a temp dataset by default with a location in US (taken from master):
- 創(chuàng)建數(shù)據(jù)集一個(gè)>
- 默認(rèn)數(shù)據(jù)集位置
我還沒(méi)有找到禁用創(chuàng)建這些臨時(shí)數(shù)據(jù)集的方法.我是否忽略了某些東西,或者在從歐盟數(shù)據(jù)集中選擇數(shù)據(jù)時(shí)這確實(shí)不再起作用?
I haven't found a way to disable the creation of these temp datasets. Am I overlooking something, or is this indeed not working anymore when selecting data from EU datasets?
推薦答案
感謝您報(bào)告此問(wèn)題.我假設(shè)您使用的是 DirectRunner.我們更改了 DirectRunner 的 BigQuery 讀取轉(zhuǎn)換的實(shí)現(xiàn),以創(chuàng)建臨時(shí)數(shù)據(jù)集(適用于 SDK 版本 0.5.1 及更高版本)以支持大型數(shù)據(jù)集.似乎我們?cè)谶@里沒(méi)有正確設(shè)置區(qū)域.我們會(huì)研究解決這個(gè)問(wèn)題.
Thanks for reporting this issue. I assume you are using DirectRunner. We changed the implementation of BigQuery read transform for DirectRunner to create a temporary dataset (for SDK versions 0.5.1 and later) to support large datasets. Seems like we are not setting the region correctly here. We'll look into fixing this.
如果您使用在正確區(qū)域創(chuàng)建臨時(shí)數(shù)據(jù)集的 DataflowRunner,則不會(huì)出現(xiàn)此問(wèn)題.
This issue should not occur if you use DataflowRunner which creates temporary datasets in the correct region.
這篇關(guān)于Google DataFlow 無(wú)法在不同位置讀寫(Python SDK v0.5.5)的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!