Skip to content

FIX: Request SQL_CHAR as SQL_C_WCHAR in arrow fetch path#575

Open
ffelixg wants to merge 5 commits into
microsoft:mainfrom
ffelixg:arrow_char_to_wchar
Open

FIX: Request SQL_CHAR as SQL_C_WCHAR in arrow fetch path#575
ffelixg wants to merge 5 commits into
microsoft:mainfrom
ffelixg:arrow_char_to_wchar

Conversation

@ffelixg
Copy link
Copy Markdown
Contributor

@ffelixg ffelixg commented May 13, 2026

Work Item / Issue Reference

AB#44922

GitHub Issue: #553


Summary

Due to #495, we can now request SQL_CHAR data as SQL_C_WCHAR, i.e. utf16le strings. Doing this for the arrow path ensures that arrow methods always return correct data no matter the encoding settings / locale / operating system. There does not seem to be any significant negative performance impact.

Copilot AI review requested due to automatic review settings May 13, 2026 11:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the Arrow fetch path in the C++ pybind layer to always request SQL_CHAR/SQL_VARCHAR data as SQL_C_WCHAR (UTF-16) so Arrow results are correct regardless of server/client codepage, locale, or platform—addressing the VARCHAR non-ASCII decoding issues reported in #553.

Changes:

  • Switch Arrow batch binding/fetching for SQL_CHAR/SQL_VARCHAR from SQL_C_CHAR to SQL_C_WCHAR to avoid codepage-dependent decoding.
  • Remove the narrow-char copy path for SQL_CHAR/SQL_VARCHAR in Arrow batch production and route through the existing wide-char → UTF-8 conversion logic.
  • Add an Arrow regression test covering Unicode round-tripping through a UTF-8-collated VARCHAR column.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
mssql_python/pybind/ddbc_bindings.cpp Changes Arrow batch binding and fetch handling so VARCHAR is requested as SQL_C_WCHAR, ensuring consistent Unicode correctness.
tests/test_004_cursor_arrow.py Adds a regression test to validate Arrow output for Unicode data stored in VARCHAR with UTF-8 collation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_004_cursor_arrow.py Outdated
@gargsaumya
Copy link
Copy Markdown
Contributor

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

}
break;
}
case SQL_LONGVARCHAR:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment explaining these now fall through to WCHAR handling because SQL_CHAR is bound as SQL_C_WCHAR at line 4784.

arrowColumnProducer->varVal[idxRowArrow + 1] = start + dataLen;
break;
}
case SQL_LONGVARCHAR:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similarly here.

assert tbl.column(0).to_pylist() == expected
finally:
cursor.execute(f"drop table if exists {table}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix applies to SQL_CHAR, SQL_VARCHAR, and SQL_LONGVARCHAR, but only VARCHAR is tested. Can we add a test for CHAR (fixed-length) type.

cursor.execute(f"drop table if exists {table}")


def test_arrow_varchar_utf8_collation_cp1252(cursor: mssql_python.Cursor):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test uses SQL_Latin1_General_CP1_CI_AS (CP1252), NOT a UTF-8 collation. The current name is slightly misleading, would it be better to name it as test_arrow_varchar_cp1252_collation_unicode?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants