問題描述
我目前正在解決需要從表中存在的字符串中清除某些字符的問題.通常我會(huì)用替換來做一個(gè)簡單的更新,但在這種情況下,需要?jiǎng)h除 32 個(gè)不同的字符.
I'm currently working on a problem where certain characters need to be cleaned from strings that exist in a table. Normally I'd do a simple UPDATE with a replace, but in this case there are 32 different characters that need to be removed.
我環(huán)顧四周,找不到任何很好的解決方案來快速清理表中已經(jīng)存在的字符串.
I've done some looking around and I can't find any great solutions for quickly cleaning strings that already exist in a table.
我調(diào)查過的事情:
進(jìn)行一系列嵌套替換
Doing a series of nested replaces
這個(gè)解決方案是可行的,但對(duì)于 32 種不同的替換,它要么需要一些丑陋的代碼,要么需要 hacky 動(dòng)態(tài) sql 來構(gòu)建大量的替換.
This solution is do-able, but for 32 different replaces it would require either some ugly code, or hacky dynamic sql to build a huge series of replaces.
PATINDEX 和 while 循環(huán)
PATINDEX and while loops
正如在這個(gè)答案中所見,可以模仿一種regex 替換,但我正在處理大量數(shù)據(jù),所以我什至不敢相信改進(jìn)的解決方案在數(shù)據(jù)量很大時(shí)在合理的時(shí)間內(nèi)運(yùn)行.
As seen in this answer it is possible to mimic a kind of regex replace, but I'm working with a lot of data so I'm hesitant to even trust the improved solution to run in a reasonable amount of time when the data volume is large.
遞歸 CTE
我嘗試了一個(gè) CTE 方法來解決這個(gè)問題,但是一旦行數(shù)變大,它的運(yùn)行速度并沒有那么快.
I tried a CTE approuch to this problem, but it didn't run terribly fast once the number of rows got large.
供參考:
CREATE TABLE #BadChar(
id int IDENTITY(1,1),
badString nvarchar(10),
replaceString nvarchar(10)
);
INSERT INTO #BadChar(badString, replaceString) SELECT 'A', '^';
INSERT INTO #BadChar(badString, replaceString) SELECT 'B', '}';
INSERT INTO #BadChar(badString, replaceString) SELECT 's', '5';
INSERT INTO #BadChar(badString, replaceString) SELECT '-', ' ';
CREATE TABLE #CleanMe(
clean_id int IDENTITY(1,1),
DirtyString nvarchar(20)
);
DECLARE @i int;
SET @i = 0;
WHILE @i < 100000 BEGIN
INSERT INTO #CleanMe(DirtyString) SELECT 'AAAAA';
INSERT INTO #CleanMe(DirtyString) SELECT 'BBBBB';
INSERT INTO #CleanMe(DirtyString) SELECT 'AB-String-BA';
SET @i = @i + 1
END;
WITH FixedString (Step, String, cid) AS (
SELECT 1 AS Step, REPLACE(DirtyString, badString, replaceString), clean_id
FROM #BadChar, #CleanMe
WHERE id = 1
UNION ALL
SELECT Step + 1, REPLACE(String, badString, replaceString), cid
FROM FixedString AS T1
JOIN #BadChar AS T2 ON T1.step + 1 = T2.id
Join #CleanMe AS T3 on T1.cid = t3.clean_id
)
SELECT String FROM FixedString WHERE step = (SELECT MAX(STEP) FROM FixedString);
DROP TABLE #BadChar;
DROP TABLE #CleanMe;
使用 CLR
Use a CLR
這似乎是許多人使用的常見解決方案,但我所處的環(huán)境并不使它成為一個(gè)很容易著手的解決方案.
It seems like this is a common solution many people use, but the environment I'm in doesn't make this a very easy one to embark on.
還有其他方法可以解決這個(gè)問題嗎?或者對(duì)我已經(jīng)研究過的方法有什么改進(jìn)?
Are there any other ways to go about this I've over looked? Or any improvements upon the methods I've already looked into for this?
推薦答案
利用來自 Alan Burstein 的解決方案,如果您想對(duì)壞/替換字符串進(jìn)行硬編碼,您可以執(zhí)行類似的操作.這也適用于長度超過單個(gè)字符的壞字符串/替換字符串.
Leveraging the idea from Alan Burstein's solution, you could do something like this, if you wanted to hard code the bad/replace strings. This would work for bad/replace strings longer than a single character as well.
CREATE FUNCTION [dbo].[CleanStringV1]
(
@String nvarchar(4000)
)
RETURNS nvarchar(4000) WITH SCHEMABINDING AS
BEGIN
SELECT @string = REPLACE
(
@string COLLATE Latin1_General_BIN,
badString,
replaceString
)
FROM
(VALUES
('A', '^')
, ('B', '}')
, ('s', '5')
, ('-', ' ')
) t(badString, replaceString)
RETURN @string;
END;
或者,如果您有一個(gè)包含錯(cuò)誤/替換字符串的表,則
Or, if you have a table containing the bad/replace strings, then
CREATE FUNCTION [dbo].[CleanStringV2]
(
@String nvarchar(4000)
)
RETURNS nvarchar(4000) AS
BEGIN
SELECT @string = REPLACE
(
@string COLLATE Latin1_General_BIN,
badString,
replaceString
)
FROM BadChar
RETURN @string;
END;
這些區(qū)分大小寫.如果您想要不區(qū)分大小寫,您可以刪除 COLLATE 位.我做了一些小測試,這些測試并不比嵌套 REPLACE 慢多少.第一個(gè)硬編碼字符串是兩者中更快的一個(gè),幾乎和嵌套 REPLACE 一樣快.
These are case sensitive. You can remove the COLLATE bit if you want case insensitive. I did a few small tests, and these were not much slower than nested REPLACE. The first one with the hardcoded strings was a the faster of the two, and was nearly as fast as nested REPLACE.
這篇關(guān)于有效清理表格中的字符串的文章就介紹到這了,希望我們推薦的答案對(duì)大家有所幫助,也希望大家多多支持html5模板網(wǎng)!