~sschwarzer/ftputil#143: 
Encoding-related failure with Python 3.9

When running the tests with tox for Python 3.9, I get

______________ TestOther.test_listdir_with_non_ascii_byte_string _______________
test/test_real_ftp.py:895: in test_listdir_with_non_ascii_byte_string
    names = host.listdir(path)
ftputil/host.py:906: in listdir
    items = self._stat._listdir(path)
ftputil/stat.py:825: in _listdir
    return self.__call_with_parser_retry(self._real_listdir, path)
ftputil/stat.py:801: in __call_with_parser_retry
    result = method(*args, **kwargs)
ftputil/stat.py:681: in _real_listdir
    raise ftputil.error.PermanentError(
E   ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used
E   Debugging info: ftputil 4.0.0, Python 3.9.0 (linux)
_____________ TestOther.test_listdir_with_non_ascii_unicode_string _____________
test/test_real_ftp.py:908: in test_listdir_with_non_ascii_unicode_string
    names = host.listdir(path)
ftputil/host.py:906: in listdir
    items = self._stat._listdir(path)
ftputil/stat.py:825: in _listdir
    return self.__call_with_parser_retry(self._real_listdir, path)
ftputil/stat.py:801: in __call_with_parser_retry
    result = method(*args, **kwargs)
ftputil/stat.py:681: in _real_listdir
    raise ftputil.error.PermanentError(
E   ftputil.error.PermanentError: 550 /äbc: no such directory or wrong directory parser used
E   Debugging info: ftputil 4.0.0, Python 3.9.0 (linux)
______________ TestOther.test_path_with_non_latin1_unicode_string ______________
test/test_real_ftp.py:922: in test_path_with_non_latin1_unicode_string
    self.host.mkdir(path)
E   Failed: DID NOT RAISE <class 'UnicodeEncodeError'>
=========================== short test summary info ============================
FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_byte_string
FAILED test/test_real_ftp.py::TestOther::test_listdir_with_non_ascii_unicode_string
FAILED test/test_real_ftp.py::TestOther::test_path_with_non_latin1_unicode_string
================== 3 failed, 182 passed in 185.52s (0:03:05) ===================
ERROR: InvocationError for command /home/build/ftputil/.tox/py39/bin/python -m pytest test (exited with code 1)
___________________________________ summary ____________________________________
  py36: commands succeeded
  py37: commands succeeded
  py38: commands succeeded
ERROR:   py39: commands failed

I assume this is related to ​https://docs.python.org/3/whatsnew/3.9.html#changes-in-the-python-api :

The encoding parameter has been added to the classes ftplib.FTP and ftplib.FTP_TLS as a keyword-only parameter, and the default encoding is changed from Latin-1 to UTF-8 to follow RFC 2640.

Status
RESOLVED FIXED
Submitter
schwa (unverified)
Assigned to
No-one
Submitted
3 years ago
Updated
3 years ago
Labels
bug library

schwa (unverified) 3 years ago · edit

As I'm quite serious about ​Semantic versioning, which ftputil officially follows since ftputil version 4.0.0., I don't want to break backward compatibility in the ftputil 4.x series.

This is my plan:

  • For the default session factory, instead of using ftplib.FTP directly, use something like

    class DefaultFTP(ftplib.FTP):
        encoding = "latin-1"
    

This change unfortunately has the effect that for Python 3.9, paths with non-latin-1 characters will be incompatible between ftplib.py and ftputil. On the other hand, setting the default encoding to "UTF-8" would be incompatible between ftplib.py and ftputil for Python 3.6, 3.7 and 3.8, which ftputil supports, too. Using a default of "UTF-8" or using the implicit default for the Python version would also break backward compatibility between ftputil 4.0.0 and the next version, which I don't want.

Users who want the Python 3.9 default behavior can pass its ftplib.FTP or ftplib.FTP_TLS as the session_factory argument to FTPHost.

When Python 3.8 reaches end of life, we can release a new ftputil 5.0.0, which uses UTF-8 as default encoding like Python 3.9's ftplib.py. Of course, we could reach version 5.0.0 sooner, but in my opinion it would be silly to switch to ftputil 5.0.0 to just change the session factory default encoding.

  • In ftputil.session.session_factory, add a keyword argument encoding, with a default of None. The semantics will be similar to use_passive_mode: None means don't pass encoding to the base_class constructor, i. e. use the base_class default. Otherwise the encoding must be a unicode string which will be passed to the base_class constructor.

This change makes it relatively easy to adapt the wanted default encoding if client code already uses a factory made with ftputil.session.session_factory.

  • Document the changes, with a link to this ticket to explain the background.

schwa (unverified) 3 years ago · edit

Version number change

​Semver.org explains that the API should (primarily) be defined by the software documentation. Now, the ​ftputil documentation states the following:

  • Paths created with ftputil should be the same as paths created with ftplib when using the same (unicode) string as input.
  • Unicode string arguments will be encoded to latin-1 when sent to the FTP server.
  • The subsection on ​session factories says that ftplib.FTP is the default factory.

Now that Python 3.9's ftplib.FTP uses UTF-8 encoded paths by default, it's impossible to fulfill all of these three statements; we always have some contradiction. We also can't "ignore" Python 3.9 in ftputil 4.0.0 because ftputil 4.0.0 was released with the tag "Python :: 3", not with individual Python versions. So even if the adoption of Python 3.9 is low, clients may already use ftputil 4.0.0 with Python 3.9. (Although they won't notice any problems as long as they stick to ASCII-only paths.)

So, taking semver into account, this unfortunately is a backward-incompatible change and would require a new major version number, i. e. ftputil 5.0.0.

Planned code changes and new behavior

Contrary to what I wrote in my first comment, I think it would cause more harm than good to use latin-1 as default encoding also for Python 3.9. I guess, at least if explaining/knowing the encoding change in Python 3.9's ftplib module, most developers would assume that ftputil also uses 3.9 ftplib's path encoding.

Therefore, I'm considering the following code/behavior changes:

  • Change ftputil.session.session_factory to support a new argument as described in the previous comment. But name the argument path_encoding instead of encoding to make it clear that it has nothing to do with file contents.
  • ftplib.FTP will stay the default session factory. Hence in Python 3.8 and earlier, ftputil will use latin-1 path encoding and in Python 3.9 ftputil will use UTF-8 encoding. This is the behavior with ftputil 4.0.0, as long as you pass unicode strings (str) to the methods that take a path.
  • At the moment, i. e. in version 4.0.0, ftputil uses an encoding ftputil.tool.LOSSLESS_ENCODING, which is set to latin-1. ftputil does not use ftplib.FTP.encoding (which is latin-1 in Python 3.8 and earlier) because the encoding attribute was never documented. LOSSLESS_ENCODING is used in ftputil.tool.as_str, which is used to convert bytes paths to str paths to be passed to ftplib. In other words, there are two different places which rely on the encoding, in ftplib and in the bytes to str decoding in ftputil. To (hopefully) make sure that both places use the same encoding, I plan the following internal changes:
    • In FTPHost._make_session, inspect the encoding attribute of the created session instance, then remember the encoding in the FTPHost object.
    • If the created session doesn't have a session attribute, fall back to UTF-8 and log a warning. This leaves a small chance where the actual encoding used in the session instance is different from the encoding that ftputil assumes for bytes to str path conversions. If that happens, users need to wrap their session factory in one that sets the encoding attribute.
    • Pass this encoding to tool.as_str instead of using the hardcoded LOSSLESS encoding.

Variations

  • Add a path_encoding argument with default None to the FTPHost constructor instead of checking the session's encoding attribute. This would be a bit cleaner. However, I don't think hardly anyone will use a session factory that doesn't inherit from ftplib.FTP and hence is missing the encoding attribute.

schwa (unverified) 3 years ago · edit

Deleted milestone 4.1.0 since this ticket will most likely cause a new ftputil version 5.0.0.

schwa (unverified) 3 years ago · edit

The implemented behavior now is:

  • ftputil.session.session_factory got a new keyword argument encoding (not path_encoding as initially suggested). If not specified, this uses the encoding from the base_class argument. If specified, use this encoding for paths. This works with session base classes that inherit from ftplib.FTP and ftplib.FTP_TLS, both with Python 3.8 (and earlier) and Python 3.9. About the argument name: path_encoding would be clearer, but it may be confusing as the term used in ftplib is just encoding.

  • When a session instance is created, ftputil will take the path encoding from the session's encoding attribute. Again, this will work for ftplib.FTP and ftplib.FTP_TLS in Python 3.8 (and earlier) and Python 3.9. If a session instance doesn't have the encoding attribute, ftputil will fall back to the default "latin-1" encoding. However, since I'm not sure whether this is a reasonable behavior, I may want to change it without releasing another backward-incompatible version. Therefore, the documentation states that in the case of a missing encoding attribute the behavior is undefined. Note that the encoding isn't used if you pass paths as str.

schwa (unverified) 3 years ago · edit

If the session instances from the factory don't have an encoding attribute, other behaviors could be:

  • Raise an exception once we get a session instance (i. e. in _make_session) without encoding attribute.
  • Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the encoding attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if no bytes paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCII bytes path.

schwa (unverified) 3 years ago · edit

Replying to schwa:

  • Use a different default encoding. I think UTF-8 would not be a good idea because in other places the default encoding is latin-1 and having yet another encoding would be very confusing. However, I can imagine using ASCII as the default encoding if the encoding attribute is missing. With this behavior, we can at least still process ASCII paths. Also, if no bytes paths are used, the encoding isn't used (by ftputil) anyway. Then again, maybe it's better to raise an exception than working most of the time (as long as we have only ASCII paths), but fail at some later point in production because we encounter a non-ASCII bytes path.

Changed behavior: If a session instance doesn't have an encoding attribute, fail early by raising a NoEncodingError. Originally this was called MissingEncodingError, but NoEncodingError is shorter and hopefully as clear.

Register here or Log in to comment, or comment via email.