Skip to content

PERF: Handle utf16 strings using simdutf and std::u16string#526

Merged
gargsaumya merged 25 commits into
microsoft:mainfrom
ffelixg:simdutf
May 14, 2026
Merged

PERF: Handle utf16 strings using simdutf and std::u16string#526
gargsaumya merged 25 commits into
microsoft:mainfrom
ffelixg:simdutf

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented Apr 15, 2026

Work Item / Issue Reference

GitHub Issue: #514


Summary

  • Use external dependency simdutf to decode utf16 when decoding in c++
  • Use pybind11 support for std::u16string to decode utf16 when interacting with python
  • Remove SQLWCHARToWString, WideToUTF8, Utf8ToWString and WStringToSQLWCHAR; refactor call sites

Copilot AI review requested due to automatic review settings April 15, 2026 23:50
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 15, 2026

I have been evaluating the external library simdutf as a high performance replacement for utf16 -> utf8 conversions, i.e. the functions SQLWCHARToWString, WideToUTF8 and Utf8ToWString. Rather than only using it for the arrow fetch path, I have been trying to make the switch for every location where one of these three functions is used, as the applications follow similar patterns. I didn't use simdutf in every case, another way to eliminate these function calls was to use std::u16string instead of std::wstring when passing strings to/from python. I think this avoids the whole issue where wchars are defined as 32 bit on some OSes but SQLWCHARs are always 16 bit.

This brings arrow performance on linux for nvarchars in line with what it should be.

If the std::u16string type works as well as I hope it does (will have to see what CI says about mac), there are some more spots where it could be used, for example to replace WStringToSQLWCHAR

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a higher-performance and more platform-consistent UTF-16LE → UTF-8 conversion path in the pybind ODBC layer by adopting simdutf and shifting several wide-string interfaces from std::wstring to std::u16string (to avoid wchar_t width differences across OSes).

Changes:

  • Add simdutf (via find_package or FetchContent) and use it for UTF-16LE → UTF-8 conversions in diagnostics and data fetch paths.
  • Replace a number of std::wstring usages with std::u16string for connection strings, queries, and SQL_C_WCHAR parameter buffers.
  • Remove legacy SQLWCHAR→wstring conversion utilities that are no longer used.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mssql_python/pybind/CMakeLists.txt Adds simdutf dependency resolution and links it into the ddbc_bindings module.
mssql_python/pybind/ddbc_bindings.h Introduces UTF-16 helpers (dupeSqlWCharAsUtf16Le, utf16LeToUtf8Alloc) and changes ErrorInfo to store UTF-8 std::string.
mssql_python/pybind/ddbc_bindings.cpp Switches parameter binding, diagnostics, query execution, and fetch conversions to UTF-16 + simdutf.
mssql_python/pybind/connection/connection.h Updates connection APIs/state to store connection strings as std::u16string.
mssql_python/pybind/connection/connection.cpp Uses UTF-16 connection string/query handling and returns UTF-8 error messages directly.
mssql_python/pybind/connection/connection_pool.h Updates pooling key types and APIs to use std::u16string.
mssql_python/pybind/connection/connection_pool.cpp Implements pooling with std::u16string connection string keys.
mssql_python/pybind/unix_utils.h Removes the SQLWCHARToWString declaration.
mssql_python/pybind/unix_utils.cpp Removes the SQLWCHARToWString implementation.
Comments suppressed due to low confidence (1)

mssql_python/pybind/ddbc_bindings.cpp:583

  • In the SQL_C_WCHAR non-DAE path, the bound buffer is a std::u16string but the bind call uses data() + SQL_NTS and sets bufferLength to size() * sizeof(SQLWCHAR). For SQL_NTS, BufferLength should include the null terminator, and the pointer should be a null-terminated buffer (prefer c_str()). As written, drivers that validate BufferLength may treat this as truncated or read past the provided length.

Suggested fix: use sqlwcharBuffer->c_str() and set bufferLength to (sqlwcharBuffer->size() + 1) * sizeof(SQLWCHAR), or alternatively set *strLenOrIndPtr to the explicit byte length (excluding terminator) and keep BufferLength consistent.

                    std::u16string* sqlwcharBuffer = AllocateParamBuffer<std::u16string>(
                        paramBuffers, param.cast<std::u16string>());
                    LOG("BindParameters: param[%d] SQL_C_WCHAR - String "
                        "length=%zu characters, buffer=%zu bytes",
                        paramIndex, sqlwcharBuffer->size(), sqlwcharBuffer->size() * sizeof(SQLWCHAR));
                    dataPtr = sqlwcharBuffer->data();
                    bufferLength = sqlwcharBuffer->size() * sizeof(SQLWCHAR);
                    strLenOrIndPtr = AllocateParamBuffer<SQLLEN>(paramBuffers);
                    *strLenOrIndPtr = SQL_NTS;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 16, 2026

📊 Code Coverage Report

🔥 Diff Coverage

94%


🎯 Overall Coverage

25%


📈 Total Lines Covered: 6961 out of 27068
📁 Project: mssql-python


Diff Coverage

Diff: main...HEAD, staged and unstaged changes

  • mssql_python/pybind/connection/connection.cpp (100%)
  • mssql_python/pybind/ddbc_bindings.cpp (94.9%): Missing lines 1576,1655,1672,3896,3905,4252,4354
  • mssql_python/pybind/utf_utils.h (92.0%): Missing lines 15-16

Summary

  • Total: 169 lines
  • Missing: 9 lines
  • Coverage: 94%

mssql_python/pybind/ddbc_bindings.cpp

Lines 1572-1580

  1572     LOG("SQLCheckError: Checking ODBC errors - handleType=%d, retcode=%d", handleType, retcode);
  1573     ErrorInfo errorInfo;
  1574     if (retcode == SQL_INVALID_HANDLE) {
  1575         LOG("SQLCheckError: SQL_INVALID_HANDLE detected - handle is invalid");
! 1576         errorInfo.ddbcErrorMsg = "Invalid handle!";
  1577         return errorInfo;
  1578     }
  1579     assert(handle != 0);
  1580     SQLHANDLE rawHandle = handle->get();

Lines 1651-1659

  1651     return records;
  1652 }
  1653 
  1654 // Wrap SQLExecDirect
! 1655 SQLRETURN SQLExecDirect_wrap(SqlHandlePtr StatementHandle, const std::u16string& Query) {
  1656     LOG("SQLExecDirect: Executing query directly - statement_handle=%p, "
  1657         "query_length=%zu chars",
  1658         (void*)StatementHandle->get(), Query.length());
  1659     if (!SQLExecDirect_ptr) {

Lines 1668-1676

  1668         SQLSetStmtAttr_ptr(StatementHandle->get(), SQL_ATTR_CONCURRENCY,
  1669                            (SQLPOINTER)SQL_CONCUR_READ_ONLY, 0);
  1670     }
  1671 
! 1672     SQLWCHAR* queryPtr = reinterpretU16stringAsSqlWChar(Query);
  1673     SQLRETURN ret;
  1674     {
  1675         // Release the GIL during the blocking ODBC call so that other Python
  1676         // threads (e.g. asyncio event loop, heartbeat threads) can run while

Lines 3892-3900

  3892                                      sizeof(DateTimeOffset) * fetchSize,
  3893                                      buffers.indicators[col - 1].data());
  3894                 break;
  3895             default:
! 3896                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3897                 std::ostringstream errorString;
  3898                 errorString << "Unsupported data type for column - " << columnName.c_str()
  3899                             << ", Type - " << dataType << ", column ID - " << col;
  3900                 LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 3901-3909

  3901                 ThrowStdException(errorString.str());
  3902                 break;
  3903         }
  3904         if (!SQL_SUCCEEDED(ret)) {
! 3905             std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  3906             std::ostringstream errorString;
  3907             errorString << "Failed to bind column - " << columnName.c_str() << ", Type - "
  3908                         << dataType << ", column ID - " << col;
  3909             LOG("SQLBindColums: %s", errorString.str().c_str());

Lines 4248-4256

  4248                     break;
  4249                 }
  4250                 default: {
  4251                     const auto& columnMeta = columnNames[col - 1].cast<py::dict>();
! 4252                     std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4253                     std::ostringstream errorString;
  4254                     errorString << "Unsupported data type for column - " << columnName.c_str()
  4255                                 << ", Type - " << dataType << ", column ID - " << col;
  4256                     LOG("FetchBatchData: %s", errorString.str().c_str());

Lines 4350-4358

  4350             case SQL_SS_TIMESTAMPOFFSET:
  4351                 rowSize += sizeof(DateTimeOffset);
  4352                 break;
  4353             default:
! 4354                 std::string columnName = columnMeta["ColumnName"].cast<std::string>();
  4355                 std::ostringstream errorString;
  4356                 errorString << "Unsupported data type for column - " << columnName.c_str()
  4357                             << ", Type - " << dataType << ", column ID - " << col;
  4358                 LOG("calculateRowSize: %s", errorString.str().c_str());

mssql_python/pybind/utf_utils.h

  11 #include <string>
  12 
  13 inline std::string utf16LeToUtf8Alloc(const std::u16string& utf16) {
  14     if (utf16.empty()) {
! 15         return {};
! 16     }
  17 
  18     std::string utf8(utf16.size() * 3, '\0');
  19     size_t n = simdutf::convert_utf16le_to_utf8_with_replacement(
  20         utf16.data(), utf16.size(), utf8.data());


📋 Files Needing Attention

📉 Files with overall lowest coverage (click to expand)
mssql_python.pybind.build._deps.simdutf-src.src.haswell.implementation.cpp: 0.4%
mssql_python.pybind.build._deps.simdutf-src.src.implementation.cpp: 6.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.implementation.h: 10.4%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.scalar.utf16_to_utf8.utf16_to_utf8.h: 25.3%
mssql_python.pybind.logger_bridge.cpp: 59.2%
mssql_python.pybind.ddbc_bindings.h: 59.7%
mssql_python.pybind.build._deps.simdutf-src.include.simdutf.internal.isadetection.h: 65.3%
mssql_python.row.py: 70.5%
mssql_python.pybind.logger_bridge.hpp: 70.8%
mssql_python.pybind.ddbc_bindings.cpp: 74.2%

🔗 Quick Links

⚙️ Build Summary 📋 Coverage Details

View Azure DevOps Build

Browse Full Coverage Report

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 16, 2026

Only issue seems to be that some Linux CI containers don't have git. I'm trying to fetch simdutf via url instead.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@ffelixg ffelixg changed the title PERF: Use simdutf for decoding utf16 PERF: Handle utf16 strings using simdutf and std::u16string Apr 19, 2026
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 19, 2026

@gargsaumya The second CI error was due to the ubuntu image using an older version of cmake. Should be fine now I hope.

I have also replaced the remaining std::wstring occurences with std::u16string and eliminated helper functions as well as platform specific code accordingly. Let me know if you are happy with this holistic update to utf16 string handling or if you would prefer to keep it contained to the arrow fetch path.

@gargsaumya
Copy link
Copy Markdown
Contributor

Thanks for the update @ffelixg. I actually prefer this and it LGTM overall. ButI’ll still take a closer look at the changes and get back to you.
I’ve approved the workflow for now so we can run the pipeline and get a passing test run.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Comment thread mssql_python/pybind/ddbc_bindings.h Fixed
@gargsaumya gargsaumya changed the title PERF: Handle utf16 strings using simdutf and std::u16string FIX: [PERF] Handle utf16 strings using simdutf and std::u16string Apr 20, 2026
@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented Apr 20, 2026

I think that this latest CI failure was due to a flaky test. The same test / platform passed in a past CI run and I don't think my changes affected that test. Could you rerun that one @gargsaumya?

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@gargsaumya gargsaumya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really appreciate the effort here, this is a very meaningful contribution @ffelixg. I've reviewed the PR and left 2 inline comments, please take a look!

Comment thread mssql_python/pybind/ddbc_bindings.h Outdated
Comment thread mssql_python/pybind/CMakeLists.txt
Comment thread mssql_python/pybind/utf_utils.h Dismissed
Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one real bug (diag message overflow, stack memory leaking into exception text on long error messages, repro inline), one accidental README typo, one perf nit on the new helper. only the diag bug actually blocks, the other two can ride along or land separately.

Comment thread mssql_python/pybind/ddbc_bindings.cpp Outdated
Comment thread mssql_python/pybind/README.md Outdated
Comment thread mssql_python/pybind/utf_utils.h
@bewithgaurav bewithgaurav changed the title FIX: [PERF] Handle utf16 strings using simdutf and std::u16string PERF: Handle utf16 strings using simdutf and std::u16string May 12, 2026
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

gargsaumya added a commit that referenced this pull request May 13, 2026
Login failures and other connection-time errors raised by the C++ pybind layer were surfacing as plain RuntimeError instead of a mssql_python exception, making it impossible for callers to catch them via the DB-API 2.0 exception hierarchy.

Connection::checkError() now embeds the SQLSTATE code in the thrown message (SQLSTATE:XXXXX:<msg>) so the Python layer can map it to the correct exception class via sqlstate_to_exception() -- consistent with how cursor-level errors are already handled in helpers.py.

The new _raise_connection_error() helper is applied to all four connection operations that go through checkError(): connect, commit, rollback, and set_autocommit.

Note: when PR #526 (simdutf) merges, the two WideToUTF8() calls in connection.cpp::checkError() will need updating to utf16LeToUtf8Alloc().

Fixes #532
@bewithgaurav
Copy link
Copy Markdown
Collaborator

@ffelixg - thanks for your responses - requesting you to please resolve conflicts and we'll do a last set of review

@ffelixg
Copy link
Copy Markdown
Contributor Author

ffelixg commented May 13, 2026

@bewithgaurav Done, I also noticed that a couple utf16LeToUtf8Alloc calls could be replaced by std::u16string + pybind11 while I was doing the conflict resolution.

@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@bewithgaurav bewithgaurav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, great stuff - thanks @ffelixg !

@gargsaumya gargsaumya merged commit 9ae2638 into microsoft:main May 14, 2026
27 checks passed
gargsaumya added a commit that referenced this pull request May 14, 2026
Login failures and other connection-time errors raised by the C++ pybind layer were surfacing as plain RuntimeError instead of a mssql_python exception, making it impossible for callers to catch them via the DB-API 2.0 exception hierarchy.

Connection::checkError() now embeds the SQLSTATE code in the thrown message (SQLSTATE:XXXXX:<msg>) so the Python layer can map it to the correct exception class via sqlstate_to_exception() -- consistent with how cursor-level errors are already handled in helpers.py.

The new _raise_connection_error() helper is applied to all four connection operations that go through checkError(): connect, commit, rollback, and set_autocommit.

Note: when PR #526 (simdutf) merges, the two WideToUTF8() calls in connection.cpp::checkError() will need updating to utf16LeToUtf8Alloc().

Fixes #532
gargsaumya added a commit that referenced this pull request May 14, 2026
PR #526 changed ErrorInfo to return std::string directly instead of
std::wstring, and removed WideToUTF8() function. Updated checkError()
to use direct assignment since err.sqlState and err.ddbcErrorMsg are
now already std::string.

This fixes compilation errors after rebase on main.
@gargsaumya gargsaumya mentioned this pull request May 15, 2026
gargsaumya added a commit that referenced this pull request May 19, 2026
Replace manual LCOV_EXCL_LINE markers with cleaner built-in lcov filtering.
This approach uses lcov's native exclusion mechanism and is more maintainable.

Changes:
- Add eng/scripts/join_logs_for_coverage.py to join multi-line LOG calls during coverage builds
- Modify build.sh to temporarily join LOG statements in codecov mode with automatic restore
- Use lcov --rc lcov_excl_line='\bLOG[A-Z_]*\s*\(' to exclude LOG macros from coverage
- Add llvm-cov ignore pattern for build/_deps/ (vendored simdutf sources from PR #526)
- Add lcov --remove for build/_deps/ as defense-in-depth (from PR #579)
- Update .gitignore to exclude local development scripts

Benefits:
- No source code clutter (600+ markers not needed)
- Catches all LOG variants (LOG_ERROR, LOG_WARNING, etc.)
- Excludes vendored third-party dependencies from coverage metrics
- Cleaner, more maintainable approach using lcov native features
- Source files remain unchanged in repository

Addresses review feedback from @bewithgaurav on PR #556
Includes changes from PR #579 to fix simdutf coverage pollution
@gargsaumya gargsaumya mentioned this pull request May 20, 2026
gargsaumya added a commit that referenced this pull request May 20, 2026
### Work Item / Issue Reference

>
[AB#45159](https://sqlclientdrivers.visualstudio.com/mssql-python/_sprints/taskboard/mssql-python%20Team/mssql-python/Rubidium/May%202026?workitem=45159)

-------------------------------------------------------------------
### Summary

**Enhancements**
- #548 — manylinux_2_28 build targets for RHEL 8 / glibc 2.28
- #542 — macOS universal2 wheel for Python 3.10
- #526 — UTF-16 string handling via simdutf
- #528 — Optimized execute() hot path
- #567 — Azure Linux installation docs

**Bug Fixes**
- #562 — Login failures now raise mssql_python exception instead of
RuntimeError
- #568 — GIL released during blocking SQLSetConnectAttr calls
- #541 — GIL released during blocking ODBC statement/fetch/transaction
calls
- #560 — executemany RuntimeError when decimals change signs
- #495 — Inconsistent CP1252 VARCHAR retrieval Windows vs Linux
- #559 — BulkCopy empty string in NVARCHAR(MAX)/VARCHAR(MAX) (via
mssql_py_core 0.1.4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants