Changeset 2043:d268915b0b9f


Ignore:
Timestamp:
Feb 1, 2021, 3:19:14 PM (10 months ago)
Author:
Stefan Schwarzer <sschwarzer@…>
Branch:
default
Message:
Rewrite section of directory/file paths

ticket: 143
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/ftputil.txt

    r1979 r2043  
    249249   `FTPHost.open`_.
    250250
    251 First off: If your directory and file names (both as arguments and on
    252 the server) contain only ISO 8859-1 (latin-1) characters, you can use
    253 such names in the form of ``bytes`` or ``str`` objects. However, you
    254 can't mix different string types (``bytes`` and ``str``) in one call
    255 (for example in ``FTPHost.path.join``).
    256 
    257 If you have directory or file names with characters that aren't in
    258 latin-1, it's recommended to use ``bytes`` objects. In that case,
    259 returned paths will be ``bytes`` objects, too.
    260 
    261 Read on for details.
    262 
    263 .. note::
    264 
    265    The approach described below may look awkward and in a way it is.
    266    The intention of ``ftputil`` is to behave like the local file
    267    system APIs of Python 3 as far as it makes sense. Moreover, the
    268    taken approach makes sure that directory and file names that were
    269    used with Python 3's native ``ftplib`` module will be compatible
    270    with ``ftputil`` and vice versa. Otherwise you may be able to use a
    271    file name with ``ftputil``, but get an exception when trying to
    272    read the same file with Python 3's ``ftplib`` module.
    273 
    274 Methods that take paths of directories and/or files can take either
    275 ``bytes`` or ``str`` objects, or `PathLike`_ objects that can be
    276 converted to ``bytes`` or ``str``.
     251Generally, paths can be ``str`` or ``bytes`` objects (or `PathLike`_
     252objects wrapping ``str`` or ``bytes``). However, you can't mix
     253different string types (``bytes`` and ``str``) in one call (for
     254example in ``FTPHost.path.join``). If a method gets a string argument
     255(or a string argument wrapped in a PathLike_ object) and returns one
     256or more strings, these strings will have the same string type
     257(``bytes`` or ``str``) as the argument(s). Mixing different string
     258types in one call (for example in ``FTPHost.path.join``) isn't allowed
     259and will cause a ``TypeError``. These rules are the same as for local
     260file system operations.
    277261
    278262.. _PathLike: https://docs.python.org/3/library/os.html#os.PathLike
    279263
    280 If a method gets a string argument (or a string argument wrapped in a
    281 PathLike_ object) and returns one or more strings, these strings will
    282 have the same string type (``bytes`` or ``str``) as the argument(s).
    283 Mixing different string types in one call (for example in
    284 ``FTPHost.path.join``) isn't allowed and will cause a ``TypeError``.
    285 These rules are the same as for local file system operations in Python 3.
    286 
    287 ``bytes`` objects for directory and file names will be sent to the
    288 server as-is. On the other hand, ``str`` objects will be encoded to
    289 ``bytes`` objects, assuming latin-1 encoding. This implies that such
    290 ``str`` objects must only contain code points 0-255 for the latin-1
    291 character set. Using any other characters will result in a
    292 ``UnicodeEncodeError`` exception.
    293 
    294 If you have directory or file names as ``str`` objects with
    295 non-latin-1 characters, encode the strings to ``bytes`` yourself,
    296 using the encoding you know the server uses for its file system.
    297 Decode received paths with the same encoding. Encapsulate these
    298 conversions as far as you can. Otherwise, you'd have to adapt
    299 potentially a lot of code if the server encoding changes.
    300 
    301 If you *don't* know the encoding on the server side, it's probably the
    302 best to only use ``bytes`` for directory and file names. That said, as
    303 soon as you *show* the names to a user, you -- or the library you use
    304 for displaying the names -- has to guess an encoding.
    305 
    306 If you can decide about paths yourself, it's generally safest to use
    307 only ASCII characters in FTP paths.
     264Although you can pass paths as ``str`` or ``bytes``, the former is
     265recommended. See below for the reason.
     266
     267*If* you have directory or file names with non-ASCII characters, you
     268need to be aware of the encoding the `session factory`_ (e. g.
     269``ftplib.FTP``) uses. This needs to be the same encoding that the FTP
     270server uses for the paths.
     271
     272The following diagram shows string conversions on the way from your
     273code to the remote FTP server. The opposite way works analogously, so
     274encoding steps in the diagram become decoding steps and decoding steps
     275in the diagram become encoding steps.
     276
     277Both "branching points" in the upper and lower part of diagrams are
     278independent, so depending on how you pass paths to ftputil and which
     279file system API the FTP server uses, there are four possible
     280combinations.
     281
     282::
     283
     284     +-----------+       +-----------+
     285     | Your code |       | Your code |
     286     +-----------+       +-----------+
     287          |                    |
     288          |  str               |  bytes
     289          v                    v
     290    +-------------+     +-------------+  decode with encoding of session,
     291    | ftputil API |     | ftputil API |  e. g. `ftplib.FTP` instance
     292    +-------------+     +-------------+
     293            \               /
     294             \     str     /
     295              v           v
     296            +---------------+  encode with encoding
     297            |  ftplib API   |  specified in `FTP` instance
     298            +---------------+
     299                    |
     300                    |  bytes
     301                    v
     302             +-------------+
     303             | socket API  |
     304             +-------------+
     305                /       \
     306               /         \                 local / client
     307    - - - - - / - - - - - \ - - - - - - - - - - - - - - - - - - - - - -
     308             /             \              remote / server
     309            /     bytes     \
     310           v                 v
     311    +------------+      +------------+  decode with encoding from
     312    | FTP server |      | FTP server |  FTP server configuration
     313    +------------+      +------------+
     314          |                   |
     315          |  bytes            |  str
     316          v                   v
     317   +-------------+      +-------------+
     318   | remote file |      | remote file |
     319   | system API  |      | system API  |
     320   +-------------+      +-------------+
     321           \                 /
     322            \      bytes    /
     323             v             v
     324          +-------------------+
     325          |    file system    |
     326          +-------------------+
     327
     328As you can see at the top of the diagram, if you use ``str`` objects
     329(regular unicode strings), there's one fewer decoding step, and so one
     330fewer source of problems. If you use ``bytes`` objects for paths,
     331ftputil tries to get the encoding for the FTP server from the
     332``encoding`` attribute of the session instance (say, an instance of
     333``ftplib.FTP``). If no ``encoding`` attribute is present, the behavior
     334is undefined.
     335
     336All encoding/decoding steps must use the same encoding, the encoding
     337the server uses (at the bottom of the diagram). If the server uses the
     338bytes from the socket directly, i. e. without an encoding step, you
     339have to use the file system encoding.
     340
     341Until and including Python 3.8, the encoding implicitly assumed by
     342the ``ftplib`` module was latin-1, so using ``bytes`` was the safest
     343strategy. However, Python 3.9 made the ``encoding``
     344configurable via an ``ftplib.FTP`` constructor argument ``encoding``,
     345*but defaults to UTF-8*.
     346
     347If you don't pass a `session factory`_ to the ``ftputil.FTPHost``
     348constructor, ftputil will use latin-1 encoding for the paths. This is
     349the same value as in earlier ftputil versions in combination with
     350Python 3.8 and earlier.
     351
     352Summary:
     353
     354- If possible, use only ASCII characters in paths.
     355- If possible, pass paths to ftputil as ``str``, not ``bytes``.
     356- If you use a custom session factory and ``bytes`` paths, the session
     357  instances created by the factory must have an ``encoding`` attribute
     358  with the name of the path encoding to use. If your session instances
     359  don't have an ``encoding`` attribute, ftputil behavior is undefined.
    308360
    309361
     
    341393body of the ``with`` statement, the instance is closed as well.
    342394Exceptions will be propagated (as with ``try ... finally``).
     395
     396.. _`session factory`:
    343397
    344398Session factories
     
    400454                    use_passive_mode=None,
    401455                    encrypt_data_channel=True,
     456                    encoding=None,
    402457                    debug_level=None)
    403458
     
    420475  parameter is ignored.
    421476
     477- ``encoding`` can be a string to set the encoding of directory and
     478  file paths on the remote server. (This has nothing to do with the
     479  encoding of file contents!) If you pass a string and your base class
     480  is neither ``ftplib.FTP`` nor ``ftplib.FTP_TLS``, the used heuristic
     481  in ``session_factory`` may not work reliably. Therefore, if in
     482  doubt, let ``encoding`` be ``None`` and define your ``base_class``
     483  so that it sets the encoding you want.
     484
     485  Note: In Python 3.9, the default path encoding for ``ftplib.FTP``
     486  and ``ftplib.FTP_TLS`` changed from previously "latin-1" to "utf-8".
     487  Hence, if you don't pass an ``encoding`` to ``session_factory``,
     488  you'll get different path encodings for Python 3.8 and earlier vs.
     489  Python 3.9 and later.
     490
     491  If you're sure that you always use only ASCII characters in your
     492  remote paths, you don't need to worry about the path encoding and
     493  don't need to use the ``encoding`` argument.
     494
    422495- ``debug_level`` sets the debug level for FTP session instances. The
    423496  semantics is defined by the base class. For example, a debug level
     
    439512                           port=31,
    440513                           encrypt_data_channel=True,
     514                           encoding="UTF-8",
    441515                           debug_level=2)
    442516
     
    446520
    447521to create and use a session factory derived from ``ftplib.FTP_TLS``
    448 that connects on command channel 31, will encrypt the data channel and
    449 print output for debug level 2.
     522that connects on command channel 31, will encrypt the data channel,
     523use the UTF-8 encoding for remote paths and print output for debug
     524level 2.
    450525
    451526Note: Generally, you can achieve everything you can do with
    452527``ftputil.session.session_factory`` with an explicit session factory
    453 as described at the start of this section. However, the class
    454 ``M2Crypto.ftpslib.FTP_TLS`` has a limitation so that you can't use
    455 it with ftputil out of the box. The function ``session_factory``
    456 contains a workaround for this limitation. For details refer to `this
    457 ticket`_.
    458 
    459 .. _`this ticket`: https://ftputil.sschwarzer.net/trac/ticket/78
     528as described at the start of this section.
    460529
    461530Hidden files and directories
Note: See TracChangeset for help on using the changeset viewer.