mismatch vectorization#4495
Merged
StephanTLavavej merged 37 commits intomicrosoft:mainfrom Mar 28, 2024
Merged
Conversation
Contributor
Author
|
Thanks @frederick-vs-ja for #4138, especially for the coverage |
Alcaro
reviewed
Mar 21, 2024
GH 4146 originally started this convention.
This comment was marked as resolved.
This comment was marked as resolved.
Member
|
/azp run STL-ASan-CI |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
also lets test test instead of explaining it
This comment was marked as resolved.
This comment was marked as resolved.
stl/src/vector_algorithms.cpp
Outdated
Comment on lines
2142
to
2149
| } else if (_Use_sse2()) { | ||
| const size_t _Count_bytes_sse = (_Count * sizeof(_Ty)) & ~size_t{0xF}; | ||
|
|
||
| for (; _Result != _Count_bytes_sse; _Result += 0x10) { | ||
| const __m128i _Elem1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First1_ch + _Result)); | ||
| const __m128i _Elem2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(_First2_ch + _Result)); | ||
| const auto _Bingo = | ||
| static_cast<unsigned int>(_mm_movemask_epi8(_Traits::_Cmp_sse(_Elem1, _Elem2))) ^ 0xFFFF; |
Member
There was a problem hiding this comment.
🐞 Bug: This has _Use_sse2() as the guard for _Traits::_Cmp_sse(), which potentially needs more:
STL/stl/src/vector_algorithms.cpp
Lines 1830 to 1836 in 8e2d724
I'll fix this occurrence, but this has been repeatedly hazardous, so I'll also file a followup issue.
StephanTLavavej
approved these changes
Mar 27, 2024
Member
|
I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed. |
Member
|
Thanks for improving the performance of this useful algorithm! 🚀 🏃 🐆 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not only
mismatch, also useful in futurelexicographical_compareTail masking
A novel for
vector_algorithms.cppthing here is also using AVX2 mask for tail and not yielding to SSE2. This allowsUnfortunately, SSE2 doesn't have usable mask instructions, and AVX2 has 4-byte granularity as its finest mask, so still have at most 3 byte tails with AVX2 and up to 15 byte tails with SSE2.
I think other vector algorithms can benefit from this approach too.
If ASan hisses at this, this would be ASan bug, as AVX2 masked loads/stores are well documented and empirically proved to skip reading/writing fairly: no data races created, no exceptions on inaccessible memory, etc.
Integer indexing
Other vector algorithms use pointer increments instead of index. This allows instructions with memory operand avoid complex indexing, whereas keeping everything else in loops the same. The compiler may perform this optimization itself, but may not, so pointer increment is useful.
With two arrays algorithms, where both arrays are processed in the same direction, the situation is ambiguous. Integer indexing requires one increment, whereas pointer indexing takes two increments, so integer indexing saves an instruction. The compiler may perform an optimization to make one pointer relative to the other, and use that pointer instead of using index, to make some of accesses simple-indexed. Surprising only MSVC does this optimization, clang and gcc do not. This optimization cannot be done manually, as it is UB in C++, and is very squirrelly, so only the compiler can do this.
As the situation on index vs pointer is ambiguous here, and the difference is expected to be small, I just used the simpler approach, that is using index.
Benchmark
Before:
After: