問題描述
我正在使用一個輸出結果數據庫的程序.我有數百個結構相同的數據庫,我想將它們組合成一個大數據庫.我最感興趣的是每個數據庫中的 1 個表.我不太使用數據庫/sql,但它會簡化過程中的其他步驟,跳過輸出 csv.
I'm working with a program that outputs a database of results. I have hundreds of these databases that are all identical in structure and I'd like to combine them into ONE big database. I'm mostly interested in 1 table from each database. I don't work with databases/sql very much, but it would simplify other steps in the process, to skip outputting a csv.
以前我是通過導出一個 csv 并使用這些步驟來組合所有 csv 來做到這一點的:
Previously I did this by exporting a csv and used these steps to combine all csvs:
library(DBI)
library(RSQLite)
library(dplyr)
csv_locs<- list.files(newdir, recursive = TRUE, pattern="*.csv", full.names = TRUE)
pic_dat <- do.call("rbind", lapply(csv_locs,
FUN=function(files){data.table::fread(files, data.table = FALSE)}))
如何用sql類型的數據庫表來做這個??
我基本上是拉出第一張桌子,然后用一個循環連接其余的桌子.
How to do this with sql type database tables??
I'm basically pulling out the first table, then joining on the rest with a loop.
db_locs <- list.files(directory, recursive = TRUE, pattern="*.ddb", full.names = TRUE)
# first table
con1<- DBI::dbConnect(RSQLite::SQLite(), db_locs [1])
start <- tbl(con1, "DataTable")
# open connection to location[i], get table, union, disconnect; repeat.
for(i in 2:length(db_locs )){
con <- DBI::dbConnect(RSQLite::SQLite(), db_locs[i])
y <- tbl(con, "DataTable")
start <- union(start, y, copy=TRUE)
dbDisconnect(con)
}
這特別慢!好吧,公平地說,它的大數據和 csv 也很慢.
老實說,我想我寫了最慢的方法來做到這一點:) 我無法讓 do.call/lapply 選項在這里工作,但也許我遺漏了一些東西.
This is exceptionally slow! Well, to be fair, its large data and the csv one is also slow.
I think I honestly wrote the slowest possible way to do this :) I could not get the do.call/lapply option to work here, but maybe I'm missing something.
推薦答案
這看起來類似于迭代rbind
幀",因為每次你這樣做union
,它會將整個表復制到一個新對象中(未經證實,但這是我的直覺).這可能對少數人有效,但擴展性很差.我建議您將所有表收集到一個列表中,并在最后調用 data.table::rbindlist
一次,然后插入到一個表中.
This looks similar to "iterative rbind
ing of frames", in that each time you do this union
, it will copy the entire table into a new object (unconfirmed, but that's my gut feeling). This might work well for a few but scales very poorly. I suggest you collect all tables in a list and call data.table::rbindlist
once at the end, then insert into a table.
沒有你的數據,我會設計一個情況.并且因為我不完全確定每個 sqlite3 文件是否只有一個表,所以我將為每個數據庫添加兩個表.如果您只有一個,則解決方案會很容易簡化.
Without your data, I'll contrive a situation. And because I'm not entirely certain if you have just one table per sqlite3 file, I'll add two tables per database. If you only have one, the solution simplifies easily.
for (i in 1:3) {
con <- DBI::dbConnect(RSQLite::SQLite(), sprintf("mtcars_%d.sqlite3", i))
DBI::dbWriteTable(con, "mt1", mtcars[1:3,1:3])
DBI::dbWriteTable(con, "mt2", mtcars[4:5,4:7])
DBI::dbDisconnect(con)
}
(lof <- list.files(pattern = "*.sqlite3", full.names = TRUE))
# [1] "./mtcars_1.sqlite3" "./mtcars_2.sqlite3" "./mtcars_3.sqlite3"
現在我將遍歷它們并讀取表格的內容
Now I'll iterate over each them and read the contents of a table
allframes <- lapply(lof, function(fn) {
con <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con, "mt1"),
error = function(e) NULL)
mt2 <- tryCatch(DBI::dbReadTable(con, "mt2"),
error = function(e) NULL)
DBI::dbDisconnect(con)
list(mt1 = mt1, mt2 = mt2)
})
allframes
# [[1]]
# [[1]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
# [[1]]$mt2
# hp drat wt qsec
# 1 110 3.08 3.215 19.44
# 2 175 3.15 3.440 17.02
# [[2]]
# [[2]]$mt1
# mpg cyl disp
# 1 21.0 6 160
# 2 21.0 6 160
# 3 22.8 4 108
### ... repeated
從這里開始,只需將它們組合在 R 中并寫入新數據庫.雖然您可以使用 do.call(rbind,...)
或 dplyr::bind_rows
,但您已經提到了 data.table
所以我會堅持下去:
From here, just combine them in R and write to a new database. While you can use do.call(rbind,...)
or dplyr::bind_rows
, you already mentioned data.table
so I'll stick with that:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all.sqlite3")
DBI::dbWriteTable(con, "mt1", data.table::rbindlist(lapply(allframes, `[[`, 1)))
DBI::dbWriteTable(con, "mt2", data.table::rbindlist(lapply(allframes, `[[`, 2)))
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
DBI::dbDisconnect(con)
如果您不能一次將它們全部加載到 R 中,則將它們實時附加到表中:
In the event that you can't load them all into R at one time, then append them to the table in real-time:
con <- DBI::dbConnect(RSQLite::SQLite(), "mtcars_all2.sqlite3")
for (fn in lof) {
con2 <- DBI::dbConnect(RSQLite::SQLite(), fn)
mt1 <- tryCatch(DBI::dbReadTable(con2, "mt1"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt1", mt1, append = TRUE)
mt2 <- tryCatch(DBI::dbReadTable(con2, "mt2"), error = function(e) NULL)
if (!is.null(mt1)) DBI::dbWriteTable(con, "mt2", mt2, append = TRUE)
DBI::dbDisconnect(con2)
}
DBI::dbGetQuery(con, "select count(*) as n from mt1")
# n
# 1 9
這不會受到您正在經歷的迭代放緩的影響.
This doesn't suffer the iterative-slowdown that you're experiencing.
這篇關于從 SQLlite 數據庫中讀取許多表并在 R 中組合的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!