問題描述
我的要求是
<塊引用>- 將數據從 Oracle 移動到 HDFS
- 處理 HDFS 上的數據
- 將處理后的數據移至 Teradata.
還需要每 15 分鐘進行一次整個處理.源數據量可能接近50GB,處理后的數據也可能相同.
在網上搜索了很多,我發現
<塊引用>- ORAOOP 將數據從 Oracle 移動到 HDFS(將代碼與 shell 腳本一起并安排它以所需的時間間隔運行).
- 通過自定義 MapReduce、Hive 或 PIG 進行大規模處理.
- SQOOP - Teradata 連接器,用于將數據從 HDFS 移動到 Teradata(再次使用帶有代碼的 shell 腳本,然后對其進行調度).
這首先是正確的選擇嗎?這在所需的時間段內是否可行(請注意,這不是每日批次左右)?
我發現的其他選項如下
<塊引用>- STORM(用于實時數據處理).但是我找不到開箱即用的 oracle Spout 或 Teradata bolt.
- 任何開源 ETL 工具,如 Talend 或 Pentaho.
請分享您對這些選項以及任何其他可能性的看法.
看起來你有幾個問題,讓我們試著分解一下.
在 HDFS 中導入
您似乎正在尋找 Sqoop.Sqoop 是一個工具,可以讓您輕松地將數據傳入/傳出 HDFS,并且可以本地連接到包括 Oracle 在內的各種數據庫.Sqoop 與 Oracle JDBC 瘦驅動程序兼容.以下是從 Oracle 轉移到 HDFS 的方法:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir/path/to/dir
有關更多信息:此處和此處.請注意,您也可以使用 Sqoop 直接導入到 Hive 表中,這可以方便您進行分析.
處理
正如您所指出的,由于您的數據最初是關系數據,因此最好使用 Hive 進行分析,因為您可能更熟悉類似 SQL 的語法.Pig 是更純粹的關系代數,其語法與 SQL 不同,更多的是偏好問題,但兩種方法都應該可以正常工作.
由于您可以使用 Sqoop 直接將數據導入 Hive,因此您的數據在導入后應該可以直接進行處理.
在 Hive 中,您可以運行查詢并告訴它在 HDFS 中寫入結果:
hive -e "插入覆蓋目錄 '/path/to/output' select * from mytable ..."
導出到 TeraData
Cloudera 去年發布了適用于 Sqoop 的 Teradata 連接器,如這里,所以你應該看看這看起來正是你想要的.以下是您的操作方法:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir/path/to/hive/output
<小時>
在您想要的任何時間段內,整個事情絕對是可行的,最終重要的是您的集群的大小,如果您希望它快速,則根據需要擴展您的集群.Hive 和 Sqoop 的好處是處理將分布在您的集群中,因此您可以完全控制計劃.
My requirement is to
- Move data from Oracle to HDFS
- Process the data on HDFS
- Move processed data to Teradata.
It is also required to do this entire processing every 15 minutes. The volume of source data may be close to 50 GB and the processed data also may be the same.
After searching a lot on the internet, i found that
- ORAOOP to move data from Oracle to HDFS (Have the code withing the shell script and schedule it to run at the required interval).
- Do large scale processing either by Custom MapReduce or Hive or PIG.
- SQOOP - Teradata Connector to move data from HDFS to Teradata (again have a shell script with the code and then schedule it).
Is this the right option in the first place and is this feasible for the required time period (Please note that this is not the daily batch or so)?
Other options that i found are the following
- STORM (for real time data processing). But i am not able to find the oracle Spout or Teradata bolt out of the box.
- Any open source ETL tools like Talend or Pentaho.
Please share your thoughts on these options as well and any other possibilities.
Looks like you have several questions so let's try to break it down.
Importing in HDFS
It seems you are looking for Sqoop. Sqoop is a tool that lets you easily transfer data in/out of HDFS, and can connect to various databases including Oracle natively. Sqoop is compatible with the Oracle JDBC thin driver. Here is how you would transfer from Oracle to HDFS:
sqoop import --connect jdbc:oracle:thin@myhost:1521/db --username xxx --password yyy --table tbl --target-dir /path/to/dir
For more information: here and here. Note than you can also import directly into a Hive table with Sqoop which could be convenient to do your analysis.
Processing
As you noted, since your data initially is relational, it is a good idea to use Hive to do your analysis since you might be more familiar with SQL-like syntax. Pig is more pure relational algebra and the syntax is NOT SQL-like, it is more a matter of preference but both approaches should work fine.
Since you can import data into Hive directly with Sqoop, your data should be directly ready to be processed after it is imported.
In Hive you could run your query and tell it to write the results in HDFS:
hive -e "insert overwrite directory '/path/to/output' select * from mytable ..."
Exporting into TeraData
Cloudera released last year a connector for Teradata for Sqoop as described here, so you should take a look as this looks like exactly what you want. Here is how you would do it:
sqoop export --connect jdbc:teradata://localhost/DATABASE=MY_BASE --username sqooptest --password xxxxx --table MY_DATA --export-dir /path/to/hive/output
The whole thing is definitely doable in whatever time period you want, in the end what will matter is the size of your cluster, if you want it quick then scale your cluster up as needed. The good thing with Hive and Sqoop is that processing will be distributed in your cluster, so you have total control over the schedule.
這篇關于將數據從 oracle 移動到 HDFS,處理并從 HDFS 移動到 Teradata的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!