O(N^2) in CreateProcess
It showed that 99% of the CPU time in the main unit_tests process was inside of CreateProcess, and 98.4% of the samples were in a single function. In one trace that I grabbed I found that more than 95% of the samples in my test process were in just seventeen instructions in MiCopyToCfgBitMap, which is tough to do without an n^2 algorithm:
My first attempt at investigating was to grab the sample counts and addresses from the ETW trace, grab the disassembly of MiCopyToCfgBitMap from livekd, write a script to merge them, and then analyze the annotated disassembly. That gave me the following CFG entry counts:
I then compiled seventeen different variants (using /MP for parallel compilation), using this command to verify how many CFG entries I was getting:
Finally I measured CreateProcess time of each version with a simple test harness.