Skip to content

bpo-25433: Streamline whitespace definitions #14753

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions Doc/glossary.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1144,6 +1144,18 @@ Glossary
A computer defined entirely in software. Python's virtual machine
executes the :term:`bytecode` emitted by the bytecode compiler.

whitespace
Characters that represent horizontal or vertical space.
In ASCII context, Python recognizes these characters as whitespace:

"\t\n\v\f\r " (tab, newline, vertical tab, form feed, carriage return, space)

In Unicode context, whitespace characters are those
characters defined in the Unicode character database as "Other" or "Separator"
and those with bidirectional property being one of "WS", "B", or "S".

This is used, for example, to split or strip strings.

Zen of Python
Listing of Python design principles and philosophies that are helpful in
understanding and using the language. The listing can be found by typing
Expand Down
46 changes: 21 additions & 25 deletions Doc/library/stdtypes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -594,7 +594,7 @@ debugging, and in numerical work.

Class method to return the float represented by a hexadecimal
string *s*. The string *s* may have leading and trailing
whitespace.
:term:`whitespace`.


Note that :meth:`float.hex` is an instance method, while
Expand Down Expand Up @@ -1407,10 +1407,10 @@ written in a variety of ways:
* Double quotes: ``"allows embedded 'single' quotes"``.
* Triple quoted: ``'''Three single quotes'''``, ``"""Three double quotes"""``

Triple quoted strings may span multiple lines - all associated whitespace will
Triple quoted strings may span multiple lines - all associated :term:`whitespace` will
be included in the string literal.

String literals that are part of a single expression and have only whitespace
String literals that are part of a single expression and have only :term:`whitespace`
between them will be implicitly converted to a single string literal. That
is, ``("spam " "eggs") == "spam eggs"``.

Expand Down Expand Up @@ -1762,10 +1762,8 @@ expression support in the :mod:`re` module).

.. method:: str.isspace()

Return true if there are only whitespace characters in the string and there is
at least one character, false otherwise. Whitespace characters are those
characters defined in the Unicode character database as "Other" or "Separator"
and those with bidirectional property being one of "WS", "B", or "S".
Return true if there are only :term:`whitespace` characters in the string and there is
at least one character, false otherwise.

.. method:: str.istitle()

Expand Down Expand Up @@ -1808,7 +1806,7 @@ expression support in the :mod:`re` module).

Return a copy of the string with leading characters removed. The *chars*
argument is a string specifying the set of characters to be removed. If omitted
or ``None``, the *chars* argument defaults to removing whitespace. The *chars*
or ``None``, the *chars* argument defaults to removing :term:`whitespace`. The *chars*
argument is not a prefix; rather, all combinations of its values are stripped::

>>> ' spacious '.lstrip()
Expand Down Expand Up @@ -1879,7 +1877,7 @@ expression support in the :mod:`re` module).

Return a list of the words in the string, using *sep* as the delimiter string.
If *maxsplit* is given, at most *maxsplit* splits are done, the *rightmost*
ones. If *sep* is not specified or ``None``, any whitespace string is a
ones. If *sep* is not specified or ``None``, any :term:`whitespace` string is a
separator. Except for splitting from the right, :meth:`rsplit` behaves like
:meth:`split` which is described in detail below.

Expand All @@ -1888,7 +1886,7 @@ expression support in the :mod:`re` module).

Return a copy of the string with trailing characters removed. The *chars*
argument is a string specifying the set of characters to be removed. If omitted
or ``None``, the *chars* argument defaults to removing whitespace. The *chars*
or ``None``, the *chars* argument defaults to removing :term:`whitespace`. The *chars*
argument is not a suffix; rather, all combinations of its values are stripped::

>>> ' spacious '.rstrip()
Expand Down Expand Up @@ -1921,7 +1919,7 @@ expression support in the :mod:`re` module).
['1', '2', '', '3', '']

If *sep* is not specified or is ``None``, a different splitting algorithm is
applied: runs of consecutive whitespace are regarded as a single separator,
applied: runs of consecutive :term:`whitespace` are regarded as a single separator,
and the result will contain no empty strings at the start or end if the
string has leading or trailing whitespace. Consequently, splitting an empty
string or a string consisting of just whitespace with a ``None`` separator
Expand Down Expand Up @@ -2015,7 +2013,7 @@ expression support in the :mod:`re` module).

Return a copy of the string with the leading and trailing characters removed.
The *chars* argument is a string specifying the set of characters to be removed.
If omitted or ``None``, the *chars* argument defaults to removing whitespace.
If omitted or ``None``, the *chars* argument defaults to removing :term:`whitespace`.
The *chars* argument is not a prefix or suffix; rather, all combinations of its
values are stripped::

Expand Down Expand Up @@ -2391,7 +2389,7 @@ data and are closely related to string objects in a variety of other ways.

This :class:`bytes` class method returns a bytes object, decoding the
given string object. The string must contain two hexadecimal digits per
byte, with ASCII whitespace being ignored.
byte, with ASCII :term:`whitespace` being ignored.

>>> bytes.fromhex('2Ef0 F1f2 ')
b'.\xf0\xf1\xf2'
Expand Down Expand Up @@ -2485,7 +2483,7 @@ objects.

This :class:`bytearray` class method returns bytearray object, decoding
the given string object. The string must contain two hexadecimal digits
per byte, with ASCII whitespace being ignored.
per byte, with ASCII :term:`whitespace` being ignored.

>>> bytearray.fromhex('2Ef0 F1f2 ')
bytearray(b'.\xf0\xf1\xf2')
Expand Down Expand Up @@ -2812,7 +2810,7 @@ produce new objects.
*chars* argument is a binary sequence specifying the set of byte values to
be removed - the name refers to the fact this method is usually used with
ASCII characters. If omitted or ``None``, the *chars* argument defaults
to removing ASCII whitespace. The *chars* argument is not a prefix;
to removing ASCII :term:`whitespace`. The *chars* argument is not a prefix;
rather, all combinations of its values are stripped::

>>> b' spacious '.lstrip()
Expand Down Expand Up @@ -2849,7 +2847,7 @@ produce new objects.
Split the binary sequence into subsequences of the same type, using *sep*
as the delimiter string. If *maxsplit* is given, at most *maxsplit* splits
are done, the *rightmost* ones. If *sep* is not specified or ``None``,
any subsequence consisting solely of ASCII whitespace is a separator.
any subsequence consisting solely of ASCII :term:`whitespace` is a separator.
Except for splitting from the right, :meth:`rsplit` behaves like
:meth:`split` which is described in detail below.

Expand All @@ -2861,7 +2859,7 @@ produce new objects.
*chars* argument is a binary sequence specifying the set of byte values to
be removed - the name refers to the fact this method is usually used with
ASCII characters. If omitted or ``None``, the *chars* argument defaults to
removing ASCII whitespace. The *chars* argument is not a suffix; rather,
removing ASCII :term:`whitespace`. The *chars* argument is not a suffix; rather,
all combinations of its values are stripped::

>>> b' spacious '.rstrip()
Expand Down Expand Up @@ -2906,11 +2904,11 @@ produce new objects.
[b'1', b'2', b'', b'3', b'']

If *sep* is not specified or is ``None``, a different splitting algorithm
is applied: runs of consecutive ASCII whitespace are regarded as a single
is applied: runs of consecutive ASCII :term:`whitespace` are regarded as a single
separator, and the result will contain no empty strings at the start or
end if the sequence has leading or trailing whitespace. Consequently,
end if the sequence has leading or trailing :term:`whitespace`. Consequently,
splitting an empty sequence or a sequence consisting solely of ASCII
whitespace without a specified separator returns ``[]``.
:term:`whitespace` without a specified separator returns ``[]``.

For example::

Expand All @@ -2930,7 +2928,7 @@ produce new objects.
removed. The *chars* argument is a binary sequence specifying the set of
byte values to be removed - the name refers to the fact this method is
usually used with ASCII characters. If omitted or ``None``, the *chars*
argument defaults to removing ASCII whitespace. The *chars* argument is
argument defaults to removing ASCII :term:`whitespace`. The *chars* argument is
not a prefix or suffix; rather, all combinations of its values are
stripped::

Expand Down Expand Up @@ -3073,10 +3071,8 @@ place, and instead produce new objects.
.. method:: bytes.isspace()
bytearray.isspace()

Return true if all bytes in the sequence are ASCII whitespace and the
sequence is not empty, false otherwise. ASCII whitespace characters are
those byte values in the sequence ``b' \t\n\r\x0b\f'`` (space, tab, newline,
carriage return, vertical tab, form feed).
Return true if all bytes in the sequence are ASCII :term:`whitespace` and the
sequence is not empty, false otherwise.


.. method:: bytes.istitle()
Expand Down
161 changes: 74 additions & 87 deletions Objects/bytearrayobject.c
Original file line number Diff line number Diff line change
Expand Up @@ -1780,26 +1780,78 @@ bytearray_remove_impl(PyByteArrayObject *self, int value)
Py_RETURN_NONE;
}

/* XXX These two helpers could be optimized if argsize == 1 */
#define LEFTSTRIP 0
#define RIGHTSTRIP 1
#define BOTHSTRIP 2

static Py_ssize_t
lstrip_helper(const char *myptr, Py_ssize_t mysize,
const void *argptr, Py_ssize_t argsize)
static PyObject *
do_xstrip(PyByteArrayObject *self, int striptype, PyObject *sepobj)
{
Py_ssize_t i = 0;
while (i < mysize && memchr(argptr, (unsigned char) myptr[i], argsize))
i++;
return i;
Py_buffer vsep;
char *s = PyByteArray_AS_STRING(self);
Py_ssize_t len = PyByteArray_GET_SIZE(self);
char *sep;
Py_ssize_t seplen;
Py_ssize_t i, j;

if (PyObject_GetBuffer(sepobj, &vsep, PyBUF_SIMPLE) != 0)
return NULL;
sep = vsep.buf;
seplen = vsep.len;

i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len && memchr(sep, Py_CHARMASK(s[i]), seplen)) {
i++;
}
}

j = len;
if (striptype != LEFTSTRIP) {
do {
j--;
} while (j >= i && memchr(sep, Py_CHARMASK(s[j]), seplen));
j++;
}

PyBuffer_Release(&vsep);

return PyByteArray_FromStringAndSize(s+i, j-i);
}

static Py_ssize_t
rstrip_helper(const char *myptr, Py_ssize_t mysize,
const void *argptr, Py_ssize_t argsize)

static PyObject *
do_strip(PyByteArrayObject *self, int striptype)
{
char *s = PyByteArray_AS_STRING(self);
Py_ssize_t len = PyByteArray_GET_SIZE(self), i, j;

i = 0;
if (striptype != RIGHTSTRIP) {
while (i < len && Py_ISSPACE(s[i])) {
i++;
}
}

j = len;
if (striptype != LEFTSTRIP) {
do {
j--;
} while (j >= i && Py_ISSPACE(s[j]));
j++;
}

return PyByteArray_FromStringAndSize(s+i, j-i);
}


static PyObject *
do_argstrip(PyByteArrayObject *self, int striptype, PyObject *bytes)
{
Py_ssize_t i = mysize - 1;
while (i >= 0 && memchr(argptr, (unsigned char) myptr[i], argsize))
i--;
return i + 1;
if (bytes != NULL && bytes != Py_None) {
return do_xstrip(self, striptype, bytes);
}
return do_strip(self, striptype);
}

/*[clinic input]
Expand All @@ -1815,33 +1867,9 @@ If the argument is omitted or None, strip leading and trailing ASCII whitespace.

static PyObject *
bytearray_strip_impl(PyByteArrayObject *self, PyObject *bytes)
/*[clinic end generated code: output=760412661a34ad5a input=ef7bb59b09c21d62]*/
/*[clinic end generated code: output=c7c228d3bd104a1b input=8a354640e4e0b3ef]*/
{
Py_ssize_t left, right, mysize, byteslen;
char *myptr;
const char *bytesptr;
Py_buffer vbytes;

if (bytes == Py_None) {
bytesptr = "\t\n\r\f\v ";
byteslen = 6;
}
else {
if (PyObject_GetBuffer(bytes, &vbytes, PyBUF_SIMPLE) != 0)
return NULL;
bytesptr = (const char *) vbytes.buf;
byteslen = vbytes.len;
}
myptr = PyByteArray_AS_STRING(self);
mysize = Py_SIZE(self);
left = lstrip_helper(myptr, mysize, bytesptr, byteslen);
if (left == mysize)
right = left;
else
right = rstrip_helper(myptr, mysize, bytesptr, byteslen);
if (bytes != Py_None)
PyBuffer_Release(&vbytes);
return PyByteArray_FromStringAndSize(myptr + left, right - left);
return do_argstrip(self, BOTHSTRIP, bytes);
}

/*[clinic input]
Expand All @@ -1852,35 +1880,14 @@ bytearray.lstrip

Strip leading bytes contained in the argument.

If the argument is omitted or None, strip leading ASCII whitespace.
If the argument is omitted or None, strip leading ASCII whitespace.
[clinic start generated code]*/

static PyObject *
bytearray_lstrip_impl(PyByteArrayObject *self, PyObject *bytes)
/*[clinic end generated code: output=d005c9d0ab909e66 input=80843f975dd7c480]*/
/*[clinic end generated code: output=28602e586f524e82 input=9baff4398c3f6857]*/
{
Py_ssize_t left, right, mysize, byteslen;
char *myptr;
const char *bytesptr;
Py_buffer vbytes;

if (bytes == Py_None) {
bytesptr = "\t\n\r\f\v ";
byteslen = 6;
}
else {
if (PyObject_GetBuffer(bytes, &vbytes, PyBUF_SIMPLE) != 0)
return NULL;
bytesptr = (const char *) vbytes.buf;
byteslen = vbytes.len;
}
myptr = PyByteArray_AS_STRING(self);
mysize = Py_SIZE(self);
left = lstrip_helper(myptr, mysize, bytesptr, byteslen);
right = mysize;
if (bytes != Py_None)
PyBuffer_Release(&vbytes);
return PyByteArray_FromStringAndSize(myptr + left, right - left);
return do_argstrip(self, LEFTSTRIP, bytes);
}

/*[clinic input]
Expand All @@ -1896,29 +1903,9 @@ If the argument is omitted or None, strip trailing ASCII whitespace.

static PyObject *
bytearray_rstrip_impl(PyByteArrayObject *self, PyObject *bytes)
/*[clinic end generated code: output=030e2fbd2f7276bd input=e728b994954cfd91]*/
/*[clinic end generated code: output=547e3815c95447da input=b78af445c727e32b]*/
{
Py_ssize_t right, mysize, byteslen;
char *myptr;
const char *bytesptr;
Py_buffer vbytes;

if (bytes == Py_None) {
bytesptr = "\t\n\r\f\v ";
byteslen = 6;
}
else {
if (PyObject_GetBuffer(bytes, &vbytes, PyBUF_SIMPLE) != 0)
return NULL;
bytesptr = (const char *) vbytes.buf;
byteslen = vbytes.len;
}
myptr = PyByteArray_AS_STRING(self);
mysize = Py_SIZE(self);
right = rstrip_helper(myptr, mysize, bytesptr, byteslen);
if (bytes != Py_None)
PyBuffer_Release(&vbytes);
return PyByteArray_FromStringAndSize(myptr, right);
return do_argstrip(self, RIGHTSTRIP, bytes);
}

/*[clinic input]
Expand Down