問題描述
我們每年生成 20.000.000 個(gè)文本文件,每個(gè)平均大小約為 250 Kb(35 Kb 壓縮).
We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
我們必須將這些文件放入某種存檔中 10 年.不需要在文本文件中搜索,但我們必須能夠通過(guò)搜索 5-10 個(gè)元數(shù)據(jù)字段(例如productname"、creationdate"等)來(lái)找到一個(gè) texfile.
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
我正在考慮壓縮每個(gè)文件并將它們存儲(chǔ)在 SQL Server 數(shù)據(jù)庫(kù)中,該數(shù)據(jù)庫(kù)具有 5-10 個(gè)可搜索(索引)列和一個(gè)用于壓縮文件數(shù)據(jù)的 varbinary(MAX) 列.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
數(shù)據(jù)庫(kù)會(huì)隨著時(shí)間的推移變得龐大;5-10 TB.所以我認(rèn)為我們需要對(duì)數(shù)據(jù)進(jìn)行分區(qū),例如每年保留一個(gè)數(shù)據(jù)庫(kù).
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
我一直在研究在 SQL Server 中對(duì)包含數(shù)據(jù)的 varbinary 列使用 FILESTREAM,但似乎這更適合大于 1 Mb 的 blob?
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
有關(guān)如何管理此類數(shù)據(jù)量的任何其他建議?
Any other suggestions on how to manage such data volumes?
推薦答案
Filestream 絕對(duì)更適合更大的 blob (750kB-1MB),因?yàn)榇蜷_外部文件所需的開銷開始影響讀寫性能 vs. vb(max) 小文件的 blob 存儲(chǔ).如果這不是什么大問題(即,在初始寫入后讀取 blob 數(shù)據(jù)的頻率很低,并且 blob 實(shí)際上是不可變的),那么這絕對(duì)是一個(gè)選擇.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
我可能會(huì)建議將文件直接保存在 vb(max) 列中,如果您可以保證它們的大小不會(huì)變大,但是使用 TEXTIMAGE_ON 選項(xiàng)將此表存儲(chǔ)在單獨(dú)的文件組中,這將允許您如有必要,將其從元數(shù)據(jù)的其余部分移至不同的存儲(chǔ).此外,請(qǐng)確保設(shè)計(jì)您的架構(gòu),以便可以使用分區(qū)或通過(guò)某些多表方案將 blob 的實(shí)際存儲(chǔ)拆分到多個(gè)文件組,以便您可以在將來(lái)必要時(shí)擴(kuò)展到不同的磁盤.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
通過(guò) Filestream 或直接 vb(max) 存儲(chǔ)使 blob 與 SQL 元數(shù)據(jù)直接相關(guān)比處理文件系統(tǒng)/SQL 不一致具有許多優(yōu)勢(shì),不僅限于易于備份和其他管理操作.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
這篇關(guān)于龐大的 SQL Server 數(shù)據(jù)庫(kù)中的 Blob 數(shù)據(jù)的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!