前言
大家應該都有所體會,很多時候在做網絡爬蟲的時候特別需要將爬蟲搜索到的超鏈接進行處理,統一都改成絕對路徑的,所以本文就寫了一個正則表達式來對搜索到的鏈接進行處理。下面話不多說,來看看詳細的介紹吧。
通常我們可能會搜索到如下的鏈接:
<!-- 空超鏈接 --> <a href=""></a> <!-- 空白符 --> <a href=" " rel="external nofollow" > </a> <!-- a標簽含有其它屬性 --> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接"> index.html </a> <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank"> / target="_blank" </a> <a target="_blank" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接" > target="_blank" / alt="超鏈接" </a> <a target="_blank" title="超鏈接" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超鏈接" > target="_blank" title="超鏈接" / alt="超鏈接" </a> <!-- 根目錄 --> <a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > / </a> <a href="a" rel="external nofollow" > a </a> <!-- 含參數 --> <a href="/index.html?id=1" rel="external nofollow" > /index.html?id=1 </a> <a href="?id=2" rel="external nofollow" > ?id=2 </a> <!-- // --> <a rel="external nofollow" > //index.html </a> <a rel="external nofollow" > //www.mafutian.net </a> <!-- 站內鏈接 --> <a rel="external nofollow" > http://www.hole_1.com/index.html </a> <!-- 站外鏈接 --> <a rel="external nofollow" > http://www.mafutian.net </a> <a rel="external nofollow" > http://www.numberer.net </a> <!-- 圖片,文本文件格式的鏈接 --> <a href="1.jpg" rel="external nofollow" > 1.jpg </a> <a href="1.jpeg" rel="external nofollow" > 1.jpeg </a> <a href="1.gif" rel="external nofollow" > 1.gif </a> <a href="1.png" rel="external nofollow" > 1.png </a> <a href="1.txt" rel="external nofollow" > 1.txt </a> <!-- 普通鏈接 --> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a> <a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a> <a href="./index.html" rel="external nofollow" > ./index.html </a> <a href="../index.html" rel="external nofollow" > ../index.html </a> <a href=".../" rel="external nofollow" > .../ </a> <a href="..." rel="external nofollow" > ... </a> <!-- 非鏈接,含有鏈接冒號 --> <a href="javascript:void(0)" rel="external nofollow" > javascript:void(0) </a> <a href="a:b" rel="external nofollow" > a:b </a> <a href="/a#a:b" rel="external nofollow" > /a#a:b </a> <a href="mailto:'mafutian@126.com'" rel="external nofollow" > mailto:'mafutian@126.com' </a> <a href="/tencent://message/?uin=335134463" rel="external nofollow" > /tencent://message/?uin=335134463 </a> <!-- 相對路徑 --> <a href="." rel="external nofollow" > . </a> <a href=".." rel="external nofollow" > .. </a> <a href="../" rel="external nofollow" > ../ </a> <a href="/a/b/.." rel="external nofollow" > /a/b/.. </a> <a href="/a" rel="external nofollow" > /a </a> <a href="./b" rel="external nofollow" > ./b </a> <a href="./././././././././b" rel="external nofollow" > ./././././././././b </a> <!-- 其實就是 ./b --> <a href="../c" rel="external nofollow" > ../c </a> <a href="../../d" rel="external nofollow" > ../../d </a> <a href="../a/../b/c/../d" rel="external nofollow" > ../a/../b/c/../d </a> <a href="./../e" rel="external nofollow" > ./../e </a> <a rel="external nofollow" > http://www.hole_1.org/./../e </a> <a href="./.././f" rel="external nofollow" > ./.././f </a> <a rel="external nofollow" > http://www.hole_1.org/../a/.../../b/c/../d/.. </a> <!-- 帶有端口號 --> <a href=":8081/index.html" rel="external nofollow" > :8081/index.html </a> <a rel="external nofollow" > :80/index.html </a> <a rel="external nofollow" > http://www.mafutian.net:8081/index.html </a> <a rel="external nofollow" > http://www.mafutian.net:8082/index.html </a>
處理的第一步,設置成絕對路徑:
http:// ... / ../ ../
【網站聲明】本站除付費源碼經過測試外,其他素材未做測試,不保證完整性,網站上部分源碼僅限學習交流,請勿用于商業用途。如損害你的權益請聯系客服QQ:2655101040 給予處理,謝謝支持。