問題描述
我的目標(biāo)是創(chuàng)建一個(gè)以字符串為鍵、條目值為字符串的 HashSet 的哈希圖.
My aim is to create a hashmap with a String as the key, and the entry values as a HashSet of Strings.
輸出
現(xiàn)在的輸出如下所示:
Hudson+(surname)=[Q2720681], Hudson,+Quebec=[Q141445], Hudson+(given+name)=[Q5928530], Hudson,+Colorado=[Q2272323], Hudson,+Illinois=[Q2672022], Hudson,+Indiana=[Q2710584], Hudson,+Ontario=[Q5928505], Hudson,+Buenos+Aires+Province=[Q10298710], Hudson,+Florida=[Q768903]]
按照我的想法,應(yīng)該是這樣的:
According to my idea, it should look like this:
[Hudson+(surname)=[Q2720681,Q141445,Q5928530,Q2272323,Q2672022]]
<小時(shí)>
目的是在維基數(shù)據(jù)中存儲(chǔ)一個(gè)特定的名稱,然后存儲(chǔ)與其相關(guān)的所有 Q 值的消歧,例如:
The purpose is to store a particular name in Wikidata and then all of the Q values associated with it's disambiguation, so for example:
這個(gè)是布什"的頁面.
我希望布什成為關(guān)鍵,然后對(duì)于所有不同的出發(fā)點(diǎn),布什
可以與維基數(shù)據(jù)的終端頁面相關(guān)聯(lián)的所有不同方式,我想存儲(chǔ)相應(yīng)的Q 值"或唯一的字母數(shù)字標(biāo)識(shí)符.
I want Bush to be the Key, and then for all of the different points of departure, all of the different ways that Bush
could be associated with a terminal page of Wikidata, I want to store the corresponding "Q value", or unique alpha-numeric identifier.
我實(shí)際上正在做的是嘗試從維基百科消歧中抓取不同的名稱、值,然后在 wikidata 中查找與該值關(guān)聯(lián)的唯一字母數(shù)字標(biāo)識(shí)符.
What I'm actually doing is trying to scrape the different names, values, from the wikipedia disambiguation and then look up the unique alpha-numeric identifier associated with that value in wikidata.
例如,使用 Bush
我們有:
For example, with Bush
we have:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
相應(yīng)的 Q 值為:
喬治 HW 布什 (Q23505)
喬治·W·布什(Q207)
杰布·布什 (Q221997)
布什家族 (Q2743830)
Bush family (Q2743830)
布什 (Q1484464)
Bush (Q1484464)
我的想法是數(shù)據(jù)結(jié)構(gòu)應(yīng)該按如下方式來解釋
關(guān)鍵:布什
條目集: Q23505、Q207、Q221997、Q2743830、Q1484464
但我現(xiàn)在的代碼并沒有這樣做.
But the code I have now doesn't do that.
它為每個(gè)名稱和 Q 值創(chuàng)建一個(gè)單獨(dú)的條目.即
It creates a seperate entry for each name and Q value. i.e.
密鑰:杰布·布什
條目集: Q221997
鑰匙:喬治·W·布什
條目集: Q207
等等.
完整的代碼可以在 mygithub頁面,但我也會(huì)在下面總結(jié)一下.
The full code in all it's glory can be seen on my github page, but I'll summarize it below also.
這是我用來為我的數(shù)據(jù)結(jié)構(gòu)添加值的方法:
This is what I'm using to add values to my data strucuture:
// add Q values to their arrayList in the hash map at the index of the appropriate entity
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
這是我獲取內(nèi)容的方式:
This is how I fetch the content:
while ((line_by_line = wiki_data_pagecontent.readLine()) != null)
{
// if we can determine it's a disambig page we need to send it off to get all
// the possible senses in which it can be used.
Pattern disambig_pattern = Pattern.compile("<div class="wikibase-entitytermsview-heading-description ">Wikipedia disambiguation page</div>");
Matcher disambig_indicator = disambig_pattern.matcher(line_by_line);
if (disambig_indicator.matches())
{
//off to get the different usages
Wikipedia_Disambig_Fetcher.all_possibilities( variable_entity );
}
else
{
//get the Q value off the page by matching
Pattern q_page_pattern = Pattern.compile("<!-- wikibase-toolbar --><span class="wikibase-toolbar-container"><span class="wikibase-toolbar-item " +
"wikibase-toolbar ">\[<span class="wikibase-toolbar-item wikibase-toolbar-button wikibase-toolbar-button-edit"><a " +
"href="/wiki/Special:SetSiteLink/(.*?)">edit</a></span>\]</span></span>");
Matcher match_Q_component = q_page_pattern.matcher(line_by_line);
if ( match_Q_component.matches() )
{
String Q = match_Q_component.group(1);
// 'Q' should be appended to an array, since each entity can hold multiple
// Q values on that basis of disambig
put_to_hash( variable_entity, Q );
}
}
}
這就是我處理消歧頁面的方式:
and this is how I deal with a disambiguation page:
public static void all_possibilities( String variable_entity ) throws Exception
{
System.out.println("this is a disambig page");
//if it's a disambig page we know we can go right to the wikipedia
//get it's normal wiki disambig page
Document docx = Jsoup.connect( "https://en.wikipedia.org/wiki/" + variable_entity ).get();
//this can handle the less structured ones.
Elements linx = docx.select( "p:contains(" + variable_entity + ") ~ ul a:eq(0)" );
for (Element linq : linx)
{
System.out.println(linq.text());
String linq_nospace = linq.text().replace(' ', '+');
Wikidata_Q_Reader.getQ( linq_nospace );
}
}
我在想也許我可以傳遞 Key
值,但我真的不知道.我有點(diǎn)卡住了.也許有人可以看到我如何實(shí)現(xiàn)這個(gè)功能.
I was thinking maybe I could pass the Key
value around, but I really don't know. I'm kind of stuck. Maybe someone can see how I can implement this functionality.
推薦答案
我不清楚你的問題是什么不起作用,或者你是否看到實(shí)際錯(cuò)誤.但是,雖然您的基本數(shù)據(jù)結(jié)構(gòu)想法(String
到 Set
的 HashMap
是合理的,但添加"中有一個(gè)錯(cuò)誤功能.
I'm not clear from your question what isn't working, or if you're seeing actual errors. But, while your basic data structure idea (HashMap
of String
to Set<String>
) is sound, there's a bug in the "add" function.
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key))
{
return q_valMap.put(key, new HashSet<String>() );
}
HashSet<String> list = q_valMap.get(key);
list.add(value);
return q_valMap.put(key, list);
}
在第一次看到鍵的情況下(if (!q_valMap.containsKey(key))
),它會(huì)為該鍵激活一個(gè)新的 HashSet
,但它不會(huì)在返回之前添加 value
給它.(并且返回的值是該鍵的舊值,因此它將為空.)因此您將丟失每個(gè)術(shù)語的 Q 值.
In the case where a key is seen for the first time (if (!q_valMap.containsKey(key))
), it vivifies a new HashSet
for that key, but it doesn't add value
to it before returning. (And the returned value is the old value for that key, so it'll be null.) So you're going to be losing one of the Q-values for every term.
對(duì)于像這樣的多層數(shù)據(jù)結(jié)構(gòu),我通常特例只是中間結(jié)構(gòu)的激活,然后在單個(gè)代碼路徑中進(jìn)行添加和返回.我認(rèn)為這會(huì)解決它.(我也將它稱為 valSet
因?yàn)樗且粋€(gè)集合而不是一個(gè)列表.而且沒有必要每次都將集合重新添加到地圖中;它是一個(gè)引用類型并被添加第一次遇到那個(gè)鍵.)
For multi-layered data structures like this, I usually special-case just the vivification of the intermediate structure, and then do the adding and return in a single code path. I think this would fix it. (I'm also going to call it valSet
because it's a set and not a list. And there's no need to re-add the set to the map each time; it's a reference type and gets added the first time you encounter that key.)
public static HashSet<String> put_to_hash(String key, String value)
{
if (!q_valMap.containsKey(key)) {
q_valMap.put(key, new HashSet<String>());
}
HashSet<String> valSet = q_valMap.get(key);
valSet.add(value);
return valSet;
}
還要注意,您返回的 Set
是對(duì)該鍵的實(shí)時(shí) Set
的引用,因此在調(diào)用者中修改它時(shí)需要小心,如果你正在做多線程,你會(huì)遇到并發(fā)訪問問題.
Also be aware that the Set
you return is a reference to the live Set
for that key, so you need to be careful about modifying it in callers, and if you're doing multithreading you're going to have concurrent access issues.
或者只使用 Guava Multimap
這樣您就不必?fù)?dān)心自己編寫實(shí)現(xiàn).
Or just use a Guava Multimap
so you don't have to worry about writing the implementation yourself.
這篇關(guān)于用一個(gè)固定的Key對(duì)應(yīng)一個(gè)HashSet創(chuàng)建一個(gè)HashMap.出發(fā)點(diǎn)的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!