問題描述
我迷上了使用 Python 和 NetworkX 來分析圖表,隨著我了解更多,我想使用越來越多的數據(我猜我正在成為數據迷 :-).最終我認為我的 NetworkX 圖(存儲為 dict 的 dict)將超過我系統上的內存.我知道我可能可以添加更多內存,但我想知道是否有辦法將 NetworkX 與 Hbase 或類似解決方案集成?
I'm hooked on using Python and NetworkX for analyzing graphs and as I learn more I want to use more and more data (guess I'm becoming a data junkie :-). Eventually I think my NetworkX graph (which is stored as a dict of dict) will exceed the memory on my system. I know I can probably just add more memory but I was wondering if there was a way to instead integrate NetworkX with Hbase or a similar solution?
我環顧四周,并沒有真正找到任何東西,但我也找不到任何與允許簡單的 MySQL 后端相關的東西.
I looked around and couldn't really find anything but I also couldn't find anything related to allowing a simple MySQL backend as well.
這可能嗎?是否有任何東西可以連接到某種持久存儲?
Is this possible? Does anything exist to allow for connectivity to some kind of persistant storage?
謝謝!
更新:我記得在初創公司的社交網絡分析"中看到過這個主題,作者談到了其他存儲方法(包括 hbase、s3 等),但沒有說明如何執行此操作或是否可行.
Update: I remember seeing this subject in 'Social Network Analysis for Startups', the author talks about other storage methods(including hbase, s3, etc..) but does not show how to do this or if its possible.
推薦答案
存儲圖的容器一般有兩種:
There are two general types of containers for storing graphs:
真正的圖形數據庫: 例如,Neo4J、agamemnon、GraphDB 和 快板圖;這些不僅存儲一個圖表,而且他們也知道一個圖表是,例如,你可以查詢這些數據庫,例如,最短路徑之間有多少個節點節點 X 和節點 Y?
true graph databases: e.g., Neo4J, agamemnon, GraphDB, and AllegroGraph; these not only store a graph but they also understand that a graph is, so for instance, you can query these databases e.g., how many nodes are between the shortest path from node X and node Y?
靜態圖容器:Twitter 適應 MySQL 的 FlockDB 是這里最著名的示例.這些數據庫可以存儲和檢索圖表就好了;但是要查詢圖形本身,您必須首先從數據庫中檢索圖形,然后使用庫(例如,Python 的優秀的 Networkx) 來查詢圖本身.
static graph containers: Twitter's MySQL-adapted FlockDB is the most well-known exemplar here. These DBs can store and retrieve graphs just fine; but to query the graph itself, you have to first retrieve the graph from the DB then use a library (e.g., Python's excellent Networkx) to query the graph itself.
我在下面討論的基于 redis 的圖形容器屬于第二類,盡管顯然 redis 也非常適合第一類容器,redis-graph,一個非常小的python包,用于在redis中實現圖形數據庫.
The redis-based graph container i discuss below is in the second category, though apparently redis is also well-suited for containers in the first category as evidenced by redis-graph, a remarkably small python package for implementing a graph database in redis.
redis 在這里可以很好地工作.
redis will work beautifully here.
Redis 是一個適合生產使用的重型、耐用的數據存儲,但它也很簡單,可以用于命令行分析.
Redis is a heavy-duty, durable data store suitable for production use, yet it's also simple enough to use for command-line analysis.
Redis 與其他數據庫的不同之處在于它具有多種數據結構類型;我在這里推薦的是 hash 數據類型.使用這種 redis 數據結構,您可以非常接近地模仿字典列表",這是一種用于存儲圖的傳統模式,其中列表中的每個項目都是一個邊字典,鍵控到這些邊源自的節點.
Redis is different than other databases in that it has multiple data structure types; the one i would recommend here is the hash data type. Using this redis data structure allows you to very closely mimic a "list of dictionaries", a conventional schema for storing graphs, in which each item in the list is a dictionary of edges keyed to the node from which those edges originate.
您需要先安裝 redis 和 python 客戶端.DeGizmo 博客 有一個出色的啟動和運行"教程,其中包括一個分步安裝指南.
You need to first install redis and the python client. The DeGizmo Blog has an excellent "up-and-running" tutorial which includes a step-by-step guid on installing both.
一旦安裝了 redis 及其 python 客戶端,啟動一個 redis 服務器,你可以這樣做:
Once redis and its python client are installed, start a redis server, which you do like so:
cd 到你安裝 redis 的目錄(/usr/local/bin 如果你通過 make install);下一個
cd to the directory in which you installed redis (/usr/local/bin on 'nix if you installed via make install); next
在 shell 提示符下鍵入 redis-server 然后輸入
type redis-server at the shell prompt then enter
您現在應該在 shell 窗口中看到服務器日志文件的尾部
you should now see the server log file tailing on your shell window
>>> import numpy as NP
>>> import networkx as NX
>>> # start a redis client & connect to the server:
>>> from redis import StrictRedis as redis
>>> r1 = redis(db=1, host="localhost", port=6379)
在下面的片段中,我存儲了一個四節點圖;下面的每一行在 redis 客戶端上調用 hmset 并存儲一個節點和連接到該節點的邊(0" => 無邊,1" => 邊).(當然,在實踐中,你會在一個函數中抽象出這些重復的調用;這里我展示了每個調用,因為這樣可能更容易理解.)
In the snippet below, i have stored a four-node graph; each line below calls hmset on the redis client and stores one node and the edges connected to that node ("0" => no edge, "1" => edge). (In practice, of course, you would abstract these repetitive calls in a function; here i'm showing each call because it's likely easier to understand that way.)
>>> r1.hmset("n1", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
True
>>> r1.hmset("n2", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
True
>>> r1.hmset("n3", {"n1": 1, "n2": 0, "n3": 0, "n4": 1})
True
>>> r1.hmset("n4", {"n1": 0, "n2": 1, "n3": 1, "n4": 1})
True
>>> # retrieve the edges for a given node:
>>> r1.hgetall("n2")
{'n1': '1', 'n2': '0', 'n3': '0', 'n4': '1'}
現在圖表已被持久化,從 redis 數據庫中檢索它作為 NetworkX 圖表.
Now that the graph is persisted, retrieve it from the redis DB as a NetworkX graph.
有很多方法可以做到這一點,下面是在兩個 *steps*:
There are many ways to do this, below did it in two *steps*:
將redis數據庫中的數據提取成一個鄰接矩陣,實現為 2D NumPy 數組;那么
extract the data from the redis database into an adjacency matrix, implemented as a 2D NumPy array; then
使用 NetworkX 將其直接轉換為 NetworkX 圖內置功能:
convert that directly to a NetworkX graph using a NetworkX built-in function:
簡化為代碼,這兩個步驟是:
reduced to code, these two steps are:
>>> AM = NP.array([map(int, r1.hgetall(node).values()) for node in r1.keys("*")])
>>> # now convert this adjacency matrix back to a networkx graph:
>>> G = NX.from_numpy_matrix(am)
>>> # verify that G in fact holds the original graph:
>>> type(G)
<class 'networkx.classes.graph.Graph'>
>>> G.nodes()
[0, 1, 2, 3]
>>> G.edges()
[(0, 1), (0, 2), (0, 3), (1, 3), (2, 3), (3, 3)]
當你結束一個 redis 會話時,你可以像這樣從客戶端關閉服務器:
When you end a redis session, you can shut down the server from the client like so:
>>> r1.shutdown()
redis 在關閉之前保存到磁盤,因此這是確保所有寫入都被持久化的好方法.
redis saves to disk just before it shuts down so this is a good way to ensure all writes were persisted.
那么 redis 數據庫在哪里呢?它以默認文件名存儲在默認位置,即您的主目錄中的 dump.rdb.
So where is the redis DB? It is stored in the default location with the default file name, which is dump.rdb on your home directory.
要更改此設置,請編輯 redis.conf 文件(包含在 redis 源代碼分發中);轉到以:
To change this, edit the redis.conf file (included with the redis source distribution); go to the line starting with:
# The filename where to dump the DB
dbfilename dump.rdb
將 dump.rdb 更改為您想要的任何內容,但保留 .rdb 擴展名.
change dump.rdb to anything you wish, but leave the .rdb extension in place.
接下來要更改文件路徑,在redis.conf中找到這一行:
Next, to change the file path, find this line in redis.conf:
# Note that you must specify a directory here, not a file name
下面一行是redis數據庫的目錄位置.編輯它,讓它背誦你想要的位置.保存您的修訂并重命名此文件,但保留 .conf 擴展名.您可以將此配置文件存儲在您希望的任何位置,只需在啟動 redis 服務器時在同一行提供此自定義配置文件的完整路徑和名稱:
The line below that is the directory location for the redis database. Edit it so that it recites the location you want. Save your revisions and rename this file, but keep the .conf extension. You can store this config file anywhere you wish, just provide the full path and name of this custom config file on the same line when you start a redis server:
所以下次啟動redis服務器時,一定要這樣(從shell提示符:
So the next time you start a redis server, you must do it like so (from the shell prompt:
$> cd /usr/local/bin # or the directory in which you installed redis
$> redis-server /path/to/redis.conf
最后,Python 包索引 列出了一個專門用于在 redis 中實現圖形數據庫的包.這個包叫做 redis-graph 我沒有用過.
Finally, the Python Package Index lists a package specifically for implementing a graph database in redis. The package is called redis-graph and i have not used it.
這篇關于用于大規模持久化圖的 NoSQL 解決方案的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!