Skip to content

Fix isprintable() and fix Unicode 3.2#7947

Draft
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable
Draft

Fix isprintable() and fix Unicode 3.2#7947
joshuamegnauth54 wants to merge 1 commit into
RustPython:mainfrom
joshuamegnauth54:fix-isprintable

Conversation

@joshuamegnauth54
Copy link
Copy Markdown
Contributor

Python bundles an old version of Unicode for compatibility. RustPython tries to mimic supporting that old version by checking the version of individual chars. This is a problem for a few reasons. The first is that the age check adds an additional hit per each char lookup in Unicode data. The check is outdated because the unic-ucd-age crate is several versions behind the current Unicode version. The check rejects valid chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode 16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0 database.

Unfortunately, there isn't a crate that can help us here. icu4x targets modern Unicode versions. Writing a data provider for icu4x for Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the Unicode 3.2.0 data myself but to skip icu4x (mostly) to instead write small lookup tables.

As of this commit, Unicode names is still wrong but I'll try to figure that out soon.

Python bundles an old version of Unicode for compatibility. RustPython
tries to mimic supporting that old version by checking the version of
individual chars. This is a problem for a few reasons. The first is that
the age check adds an additional hit per each char lookup in Unicode
data. The check is outdated because the `unic-ucd-age` crate is several
versions behind the current Unicode version. The check rejects valid
chars because of the version differences.

The check is subtly wrong because it returns properties for Unicode
16.0.0 for Unicode 3.2.0 while checking against a Unicode 10.0.0
database.

Unfortunately, there isn't a crate that can help us here. `icu4x`
targets modern Unicode versions. Writing a data provider for `icu4x` for
Unicode 3.2.0 is a lot of work for a legacy path. I opted to parse the
Unicode 3.2.0 data myself but to skip `icu4x` (mostly) to instead write
small lookup tables.

As of this commit, Unicode names is still wrong but I'll try to figure
that out soon.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 265b2e54-1058-4e73-975b-8c1cd79b6efd

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 6)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

[x] lib: cpython/Lib/pyclbr.py
[ ] test: cpython/Lib/test/test_pyclbr.py (TODO: 2)

dependencies:

  • pyclbr

dependent tests: (1 tests)

  • pyclbr: test_pyclbr

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

| GeneralCategory::Unassigned
)
c == '\u{0020}' || printable.contains(GeneralCategory::for_char(c))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TYSM!

Overall I really like this change.

The only thing rhat might be worth is to a README.md under crates/stdlib/unicode/ucd32/README.md with links for the soure of each file so we can upsate them in the future if needed

}

/// Supported Unicode version based on icu4x.
pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume icu4x doesn't expose this:/

Comment thread Lib/test/test_str.py
self.assertTrue('\U0001F46F'.isprintable())
self.assertFalse('\U000E0020'.isprintable())

@unittest.expectedFailure # TODO: RUSTPYTHON
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants