Opened 3 months ago
Last modified 5 weeks ago
#143 assigned defect
Encoding-related failure with Python 3.9
Reported by: | schwa | Owned by: | schwa |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Library | Version: | 4.0.0 |
Keywords: | Python 3.9, encoding, latin1, UTF-8 | Cc: |
Description
When running the tests with tox for Python 3.9, I get
______________ TestOther.test_listdir_with_non_ascii_byte_string _______________ test/test_real_ftp.py:895: in test_listdir_with_non_ascii_byte_string names = host.listdir(path) ftputil/host.py:906: in listdir items = self._stat._listdir(path) ftputil/stat.py:825: in _listdir return self.__call_with_parser_retry(self._real_listdir, path) ftputil/stat.py:801: in __call_with_parser_retry result = method(*args, **kwargs) ftputil/stat.py:681: in _real_listdir raise ftputil.error.PermanentError( E ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used E Debugging info: ftputil 4.0.0, Python 3.9.0 (linux) _____________ TestOther.test_listdir_with_non_ascii_unicode_string _____________ test/test_real_ftp.py:908: in test_listdir_with_non_ascii_unicode_string names = host.listdir(path) ftputil/host.py:906: in listdir items = self._stat._listdir(path) ftputil/stat.py:825: in _listdir return self.__call_with_parser_retry(self._real_listdir, path) ftputil/stat.py:801: in __call_with_parser_retry result = method(*args, **kwargs) ftputil/stat.py:681: in _real_listdir raise ftputil.error.PermanentError( E ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used E Debugging info: ftputil 4.0.0, Python 3.9.0 (linux) ______________ TestOther.test_path_with_non_latin1_unicode_string ______________ test/test_real_ftp.py:922: in test_path_with_non_latin1_unicode_string self.host.mkdir(path) E Failed: DID NOT RAISE <class 'UnicodeEncodeError'> =========================== short test summary info ============================ FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_byte_string FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_unicode_string FAILED test/test_real_ftp.py::TestOther::test_path_with_non_latin1_unicode_string ================== 3 failed, 182 passed in 185.52s (0:03:05) =================== ERROR: InvocationError for command /home/build/ftputil/.tox/py39/bin/python -m pytest test (exited with code 1) ___________________________________ summary ____________________________________ py36: commands succeeded py37: commands succeeded py38: commands succeeded ERROR: py39: commands failed
I assume this is related to https://docs.python.org/3/whatsnew/3.9.html#changes-in-the-python-api :
The encoding parameter has been added to the classes ftplib.FTP and ftplib.FTP_TLS as a keyword-only parameter, and the default encoding is changed from Latin-1 to UTF-8 to follow RFC 2640.
Change History (6)
comment:1 Changed 3 months ago by
comment:2 Changed 3 months ago by
Version number change
Semver.org explains that the API should (primarily) be defined by the software documentation. Now, the ftputil documentation states the following:
- Paths created with ftputil should be the same as paths created with ftplib when using the same (unicode) string as input.
- Unicode string arguments will be encoded to latin-1 when sent to the FTP server.
- The subsection on session factories says that
ftplib.FTP
is the default factory.
Now that Python 3.9's ftplib.FTP
uses UTF-8 encoded paths by default, it's impossible to fulfill all of these three statements; we always have some contradiction. We also can't "ignore" Python 3.9 in ftputil 4.0.0 because ftputil 4.0.0 was released with the tag "Python :: 3", not with individual Python versions. So even if the adoption of Python 3.9 is low, clients may already use ftputil 4.0.0 with Python 3.9. (Although they won't notice any problems as long as they stick to ASCII-only paths.)
So, taking semver into account, this unfortunately is a backward-incompatible change and would require a new major version number, i. e. ftputil 5.0.0.
Planned code changes and new behavior
Contrary to what I wrote in my first comment, I think it would cause more harm than good to use latin-1 as default encoding also for Python 3.9. I guess, at least if explaining/knowing the encoding change in Python 3.9's ftplib module, most developers would assume that ftputil also uses 3.9 ftplib's path encoding.
Therefore, I'm considering the following code/behavior changes:
- Change
ftputil.session.session_factory
to support a new argument as described in the previous comment. But name the argumentpath_encoding
instead ofencoding
to make it clear that it has nothing to do with file contents. ftplib.FTP
will stay the default session factory. Hence in Python 3.8 and earlier, ftputil will use latin-1 path encoding and in Python 3.9 ftputil will use UTF-8 encoding. This is the behavior with ftputil 4.0.0, as long as you pass unicode strings (str
) to the methods that take a path.- At the moment, i. e. in version 4.0.0, ftputil uses an encoding
ftputil.tool.LOSSLESS_ENCODING
, which is set to latin-1. ftputil does not useftplib.FTP.encoding
(which is latin-1 in Python 3.8 and earlier) because theencoding
attribute was never documented.LOSSLESS_ENCODING
is used inftputil.tool.as_str
, which is used to convertbytes
paths tostr
paths to be passed to ftplib. In other words, there are two different places which rely on the encoding, in ftplib and in thebytes
tostr
decoding in ftputil. To (hopefully) make sure that both places use the same encoding, I plan the following internal changes:- In
FTPHost._make_session
, inspect theencoding
attribute of the created session instance, then remember the encoding in theFTPHost
object. - If the created session doesn't have a
session
attribute, fall back to UTF-8 and log a warning. This leaves a small chance where the actual encoding used in the session instance is different from the encoding that ftputil assumes forbytes
tostr
path conversions. If that happens, users need to wrap their session factory in one that sets theencoding
attribute. - Pass this encoding to
tool.as_str
instead of using the hardcodedLOSSLESS
encoding.
- In
Variations
- Add a
path_encoding
argument with defaultNone
to theFTPHost
constructor instead of checking the session'sencoding
attribute. This would be a bit cleaner. However, I don't think hardly anyone will use a session factory that doesn't inherit fromftplib.FTP
and hence is missing theencoding
attribute.
comment:3 Changed 3 months ago by
Milestone: | 4.1.0 |
---|
Deleted milestone 4.1.0 since this ticket will most likely cause a new ftputil version 5.0.0.
comment:4 Changed 5 weeks ago by
The implemented behavior now is:
ftputil.session.session_factory
got a new keyword argumentencoding
(notpath_encoding
as initially suggested). If not specified, this uses the encoding from thebase_class
argument. If specified, use this encoding for paths. This works with session base classes that inherit fromftplib.FTP
andftplib.FTP_TLS
, both with Python 3.8 (and earlier) and Python 3.9. About the argument name:path_encoding
would be clearer, but it may be confusing as the term used inftplib
is justencoding
.
- When a session instance is created, ftputil will take the path encoding from the session's
encoding
attribute. Again, this will work forftplib.FTP
andftplib.FTP_TLS
in Python 3.8 (and earlier) and Python 3.9. If a session instance doesn't have theencoding
attribute, ftputil will fall back to the default "latin-1" encoding. However, since I'm not sure whether this is a reasonable behavior, I may want to change it without releasing another backward-incompatible version. Therefore, the documentation states that in the case of a missingencoding
attribute the behavior is undefined. Note that the encoding isn't used if you pass paths asstr
.
comment:5 follow-up: 6 Changed 5 weeks ago by
If the session instances from the factory don't have an encoding
attribute, other behaviors could be:
- Raise an exception once we get a session instance (i. e. in
_make_session
) withoutencoding
attribute. - Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the
encoding
attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if nobytes
paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCIIbytes
path.
comment:6 Changed 5 weeks ago by
Replying to schwa:
- Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the
encoding
attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if nobytes
paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCIIbytes
path.
Changed behavior: If a session instance doesn't have an encoding
attribute, fail early by raising a NoEncodingError
. Originally this was called MissingEncodingError
, but NoEncodingError
is shorter and hopefully as clear.
As I'm quite serious about Semantic versioning, which ftputil officially follows since ftputil version 4.0.0., I don't want to break backward compatibility in the ftputil 4.x series.
This is my plan:
ftplib.FTP
directly, use something likeftputil.session.session_factory
, add a keyword argumentencoding
, with a default ofNone
. The semantics will be similar touse_passive_mode
:None
means don't passencoding
to thebase_class
constructor, i. e. use thebase_class
default. Otherwise the encoding must be a unicode string which will be passed to thebase_class
constructor.