問題描述
我想使用 PHP 將文本拆分為單個單詞.您知道如何實現這一目標嗎?
I would like to split a text into single words using PHP. Do you have any idea how to achieve this?
我的方法:
function tokenizer($text) {
$text = trim(strtolower($text));
$punctuation = '/[^a-z0-9??ü?-]/';
$result = preg_split($punctuation, $text, -1, PREG_SPLIT_NO_EMPTY);
for ($i = 0; $i < count($result); $i++) {
$result[$i] = trim($result[$i]);
}
return $result; // contains the single words
}
$text = 'This is an example text, it contains commas and full-stops. Exclamation marks, too! Question marks? All punctuation marks you know.';
print_r(tokenizer($text));
這是一個好方法嗎?你有什么改進的想法嗎?
Is this a good approach? Do you have any idea for improvement?
提前致謝!
推薦答案
使用匹配任何 unicode 標點符號的類 p{P},結合 s 空白類.
Use the class p{P} which matches any unicode punctuation character, combined with the s whitespace class.
$result = preg_split('/((^p{P}+)|(p{P}*s+p{P}*)|(p{P}+$))/', $text, -1, PREG_SPLIT_NO_EMPTY);
這將拆分為一組一個或多個空白字符,但也會吸收任何周圍的標點符號.它還匹配字符串開頭或結尾的標點字符.這會區分諸如不要"和他說‘哎喲!’"之類的情況
This will split on a group of one or more whitespace characters, but also suck in any surrounding punctuation characters. It also matches punctuation characters at the beginning or end of the string. This discriminates cases such as "don't" and "he said 'ouch!'"
這篇關于將文本拆分為單個單詞的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!