問題描述
我正在開發一個破壞堆的多線程 C++ 應用程序.定位此損壞的常用工具似乎不適用.源代碼的舊版本(18 個月大)表現出與最新版本相同的行為,所以這已經存在很長時間了,只是沒有被注意到;不利的一面是,無法使用源增量來識別錯誤何時引入 - 存儲庫中有很多代碼更改.
崩潰行為的提示是在這個系統中產生吞吐量 - 數據的套接字傳輸被修改為內部表示.我有一組測試數據,這些數據會定期導致應用程序異常(各種地方,各種原因 - 包括堆分配失敗,因此:堆損壞).
該行為似乎與 CPU 功率或內存帶寬有關;每臺機器的數量越多,就越容易崩潰.禁用超線程內核或雙核內核會降低(但不會消除)損壞率.這表明存在與時間相關的問題.
現在問題來了:
當它在輕量級調試環境(比如 Visual Studio 98/AKA MSVC6
)下運行時,堆損壞很容易重現 - 十到十五分鐘后就會發生可怕的失敗和異常,例如 alloc;
在復雜的調試環境(Rational Purify、VS2008/MSVC9
甚至 Microsoft Application Verifier)下運行時,系統會受內存速度限制并且不會崩潰(內存受限:CPU 沒有超過 50%
,磁盤燈不亮,程序運行得盡可能快,盒子消耗 1.3G
的 2G RAM).因此,我可以在能夠重現問題(但不能確定原因)或能夠確定原因或無法重現的問題之間做出選擇.
我目前對下一步的最佳猜測是:
- 獲得一個非常笨拙的盒子(替換當前的開發盒子:
E6550 Core2 Duo
中的 2Gb RAM);這將使在強大的調試環境下運行時重現導致錯誤行為的崩潰成為可能;或 - 重寫運算符
new
和delete
以使用VirtualAlloc
和VirtualProtect
盡快將內存標記為只讀正如它所做的那樣.在MSVC6
下運行,讓操作系統捕獲正在寫入釋放內存的壞人.是的,這是絕望的跡象:誰他媽的重寫了new
和delete
?!我想知道這是否會使它像 Purify 等人那樣慢.
而且,不:使用內置 Purify 儀器運輸不是一種選擇.
一位同事剛走過來問堆棧溢出?我們現在有堆棧溢出嗎?!?"
現在,問題是:如何定位堆損壞者?
<小時>更新:平衡 new[]
和 delete[]
似乎在解決問題方面已經走了很長一段路.該應用程序現在在崩潰前大約需要兩個小時,而不是 15 分鐘.還沒有.有什么進一步的建議嗎?堆損壞持續存在.
更新:Visual Studio 2008 下的發布版本似乎要好得多;當前的懷疑取決于 VS98
附帶的 STL
實現.
- 重現問題.
Dr Watson
將生成可能有助于進一步分析的轉儲.
我會記錄下來,但我擔心 Watson 博士只會在事后被絆倒,而不是在堆被踩踏時.
<塊引用>另一個嘗試可能是使用 WinDebug
作為調試工具,它非常強大,同時也是輕量級的.
現在又開始了:在出現問題之前沒有太大幫助.我想在現場抓到破壞者.
<塊引用>也許這些工具至少可以讓您將問題縮小到某些組件.
我不抱太大希望,但絕望的時刻需要......
<塊引用>您確定項目的所有組件都具有正確的運行時庫設置(C/C++ 選項卡
,VS 6.0 項目設置中的代碼生成類別)?
不,我不是,明天我將花幾個小時瀏覽工作區(其中有 58 個項目)并檢查它們是否都在編譯并與適當的標志鏈接.<小時>更新:這花了 30 秒.在 Settings
對話框中選擇所有項目,取消選擇,直到找到沒有正確設置的項目(它們都有正確的設置).
我的首選是專用的堆工具,例如 pageheap.exe.
重寫 new 和 delete 可能有用,但這并不能捕獲低級代碼提交的分配.如果這是您想要的,最好使用 Microsoft Detours 繞開 low-level alloc API
.
還有健全性檢查,例如:驗證您的運行時庫是否匹配(發布與調試、多線程與單線程、dll 與靜態庫)、查找錯誤刪除(例如,delete where delete []應該已經使用過),請確保您沒有混合和匹配您的分配.
還可以嘗試有選擇地關閉線程,看看問題何時/是否消失.
在第一個異常發生時調用堆棧等是什么樣的?
I'm working on a multithreaded C++ application that is corrupting the heap. The usual tools to locate this corruption seem to be inapplicable. Old builds (18 months old) of the source code exhibit the same behaviour as the most recent release, so this has been around for a long time and just wasn't noticed; on the downside, source deltas can't be used to identify when the bug was introduced - there are a lot of code changes in the repository.
The prompt for crashing behaviuor is to generate throughput in this system - socket transfer of data which is munged into an internal representation. I have a set of test data that will periodically cause the app to exception (various places, various causes - including heap alloc failing, thus: heap corruption).
The behaviour seems related to CPU power or memory bandwidth; the more of each the machine has, the easier it is to crash. Disabling a hyper-threading core or a dual-core core reduces the rate of (but does not eliminate) corruption. This suggests a timing related issue.
Now here's the rub:
When it's run under a lightweight debug environment (say Visual Studio 98 / AKA MSVC6
) the heap corruption is reasonably easy to reproduce - ten or fifteen minutes pass before something fails horrendously and exceptions, like an alloc;
when running under a sophisticated debug environment (Rational Purify, VS2008/MSVC9
or even Microsoft Application Verifier) the system becomes memory-speed bound and doesn't crash (Memory-bound: CPU is not getting above 50%
, disk light is not on, the program's going as fast it can, box consuming 1.3G
of 2G of RAM). So, I've got a choice between being able to reproduce the problem (but not identify the cause) or being able to idenify the cause or a problem I can't reproduce.
My current best guesses as to where to next is:
- Get an insanely grunty box (to replace the current dev box: 2Gb RAM in an
E6550 Core2 Duo
); this will make it possible to repro the crash causing mis-behaviour when running under a powerful debug environment; or - Rewrite operators
new
anddelete
to useVirtualAlloc
andVirtualProtect
to mark memory as read-only as soon as it's done with. Run underMSVC6
and have the OS catch the bad-guy who's writing to freed memory. Yes, this is a sign of desperation: who the hell rewritesnew
anddelete
?! I wonder if this is going to make it as slow as under Purify et al.
And, no: Shipping with Purify instrumentation built in is not an option.
A colleague just walked past and asked "Stack Overflow? Are we getting stack overflows now?!?"
And now, the question: How do I locate the heap corruptor?
Update: balancing new[]
and delete[]
seems to have gotten a long way towards solving the problem. Instead of 15mins, the app now goes about two hours before crashing. Not there yet. Any further suggestions? The heap corruption persists.
Update: a release build under Visual Studio 2008 seems dramatically better; current suspicion rests on the STL
implementation that ships with VS98
.
- Reproduce the problem.
Dr Watson
will produce a dump that might be helpful in further analysis.
I'll take a note of that, but I'm concerned that Dr Watson will only be tripped up after the fact, not when the heap is getting stomped on.
Another try might be using
WinDebug
as a debugging tool which is quite powerful being at the same time also lightweight.
Got that going at the moment, again: not much help until something goes wrong. I want to catch the vandal in the act.
Maybe these tools will allow you at least to narrow the problem to certain component.
I don't hold much hope, but desperate times call for...
And are you sure that all the components of the project have correct runtime library settings (
C/C++ tab
, Code Generation category in VS 6.0 project settings)?
No I'm not, and I'll spend a couple of hours tomorrow going through the workspace (58 projects in it) and checking they're all compiling and linking with the appropriate flags.
Update: This took 30 seconds. Select all projects in the
Settings
dialog, unselect until you find the project(s) that don't have the right settings (they all had the right settings).My first choice would be a dedicated heap tool such as pageheap.exe.
Rewriting new and delete might be useful, but that doesn't catch the allocs committed by lower-level code. If this is what you want, better to Detour the low-level alloc API
s using Microsoft Detours.
Also sanity checks such as: verify your run-time libraries match (release vs. debug, multi-threaded vs. single-threaded, dll vs. static lib), look for bad deletes (eg, delete where delete [] should have been used), make sure you're not mixing and matching your allocs.
Also try selectively turning off threads and see when/if the problem goes away.
What does the call stack etc look like at the time of the first exception?
這篇關于Win32下堆損壞;如何定位?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!