Releases: jawah/charset_normalizer
Releases · jawah/charset_normalizer
Version 3.0.0rc1
This is the last pre-release. If everything goes well, I will publish the stable tag.
3.0.0rc1 (2022-10-18)
Added
- Extend the capability of explain=True when cp_isolation contains at most two entries (min one), will log in details of the Mess-detector results
- Support for alternative language frequency set in charset_normalizer.assets.FREQUENCIES
- Add parameter
language_threshold
infrom_bytes
,from_path
andfrom_fp
to adjust the minimum expected coherence ratio
Changed
- Build with static metadata using 'build' frontend
- Make language detection stricter
Fixed
- CLI with opt --normalize fail when using full path for files
- TooManyAccentuatedPlugin induce false positive on the mess detection when too few alpha characters have been fed to it
Removed
- Coherence detector no longer returns 'Simple English' instead returns 'English'
- Coherence detector no longer returns 'Classical Chinese' instead returns 'Chinese'
Version 3.0.0b2
3.0.0b2 (2022-08-21)
Added
normalizer --version
now specify if current version provide extra speedup (meaning mypyc compilation whl)
Removed
- Breaking: Method
first()
andbest()
from CharsetMatch - UTF-7 will no longer appear as "detected" without a recognized SIG/mark (is unreliable/conflict with ASCII)
Fixed
- Sphinx warnings when generating the documentation
Version 2.1.1
2.1.1 (2022-08-19)
Deprecated
- Function
normalize
scheduled for removal in 3.0
Changed
- Removed useless call to decode in fn is_unprintable (#206)
Fixed
- Third-party library (i18n xgettext) crashing not recognizing utf_8 (PEP 263) with underscore from @aleksandernovikov (#204)
Version 3.0.0b1
3.0.0b1 (2022-08-15)
Changed
- Optional: Module
md.py
can be compiled using Mypyc to provide an extra speedup up to 4x faster than v2.1
Removed
- Breaking: Class aliases CharsetDetector, CharsetDoctor, CharsetNormalizerMatch and CharsetNormalizerMatches
- Breaking: Top-level function
normalize
- Breaking: Properties
chaos_secondary_pass
,coherence_non_latin
andw_counter
from CharsetMatch - Support for the backport
unicodedata2
Version 2.1.0
2.1.0 (2022-06-19)
Added
- Output the Unicode table version when running the CLI with
--version
(PR #194)
Changed
- Re-use decoded buffer for single byte character sets from @nijel (PR #175)
- Fixing some performance bottlenecks from @deedy5 (PR #183)
Fixed
- Workaround potential bug in cpython with Zero Width No-Break Space located in Arabic Presentation Forms-B, Unicode 1.1 not acknowledged as space (PR #175)
- CLI default threshold aligned with the API threshold from @oleksandr-kuzmenko (PR #181)
Removed
- Support for Python 3.5 (PR #192)
Deprecated
- Use of backport unicodedata from
unicodedata2
as Python is quickly catching up, scheduled for removal in 3.0 (PR #194)
Version 2.0.12
Version 2.0.11
Version 2.0.10
Version 2.0.9
Version 2.0.8
Changed
- Improvement over Vietnamese detection (PR #126)
- MD improvement on trailing data and long foreign (non-pure latin) data (PR #124)
- Efficiency improvements in cd/alphabet_languages from @adbar (PR #122)
- call sum() without an intermediary list following PEP 289 recommendations from @adbar (PR #129)
- Code style as refactored by Sourcery-AI (PR #131)
- Minor adjustment on the MD around european words (PR #133)
- Remove and replace SRTs from assets / tests (PR #139)
- Initialize the library logger with a
NullHandler
by default from @nmaynes (PR #135) - Setting kwarg
explain
to True will add provisionally (bounded to function lifespan) a specific stream handler (PR #135)
Fixed
- Fix large (misleading) sequence giving UnicodeDecodeError (PR #137)
- Avoid using too insignificant chunk (PR #137)
Added
- Add and expose function
set_logging_handler
to configure a specific StreamHandler from @nmaynes (PR #135) - Add
CHANGELOG.md
entries, format is based on Keep a Changelog (PR #141)