Semi-generic data loading utility class

Question

The following implements a relatively generic and modular data loading utility. It loads images from a local storage given a lookup table. The data loader is parametrized via dependency injection to exhibit different behaviours. One behaviour is to derive variants for a loaded image.

I have a use case where we are still in an experimental phase and thus during the experimental development so far needed to support the option to swap out components for our algorithm, and we will need this option going forward when continuing experimentation. This requirement resulted in a design that broadly employs separation of concerns. This design arose incrementally during development after several refactorings to facilitate supporting the aforementioned modularity required during experimentation. Meaning, I did not front-load abstraction in expectation of needing it.

Another dev said he does not understand the code at all and would prefer a simpler design. As I said, the design arose due to necessities. Is this design really that opaque? To me, this code looks like run-of-the-mill OOP code with a few small closures thrown in, applying separation of concern principles / SOLID principles.

I would be interested in an independent evaluation of how complex this design really is. I am concerned with the high-level design, not the details. Hence, I ommitted some irrelevant implementation details.

(Don't mind the import order. I had to manually modify these when postig, to obscure some internal stuff)

file 1 (imported by file 2) (obviously, these have descriptive names in reality)

"""
This module contains logic for loading reference data. It provides a reference data loader class whose behavior can be
controlled by dependency injection. Default implementations of dependencies are provided in this module as well.
"""

from __future__ import annotations

import abc
import importlib
from contextlib import contextmanager
from functools import lru_cache
from operator import itemgetter
from pathlib import Path
from typing import Any, TypeVar, Callable, Generic, Mapping, BinaryIO, Generator, Optional

import cv2
import numpy as np
from frozendict import frozendict
from funcy import lflatten
from .models import Item

from .exceptions import MissingReferenceData, DataLoadFailure
from .derivation import derive_variants
from .types import MatLike


T = TypeVar("T")


def make_id_based_reference_data_loader(
    data_map: list[dict],
    data_root: Path | str,
    data_loader: Optional[DataLoader] = None,
) -> ReferenceDataLoader:
    return ReferenceDataLoader(
        reference_selector=ReferenceSelector(data_records=data_map, select_by="id"),
        path_indexer=PathIndexer(root=data_root, path_extractor=index_by_product),
        data_loader=data_loader or ImageToNumpyLoader(),
    )


class ReferenceDataLoader(Generic[T]):
    def __init__(self, reference_selector: ReferenceSelector, path_indexer: PathIndexer, data_loader: DataLoader):
        """Loads reference data for items.

        Args:
            reference_selector: Selects the reference descriptor for a given item. A reference
                descriptor is a metadata record that specifies the reference data to load for a given item.
            path_indexer: Indexes the path to the reference data from the reference descriptor.
            data_loader: loads the data from the indexed path.
        """
        self._reference_descriptor_selector = reference_selector
        self._path_indexer = path_indexer
        self._data_loader = data_loader

    def select_and_load_reference_data(self, item: Item) -> list[T]:
        """Selects and loads the reference data for a given item.


        Raises:
            MissingReferenceData: If no reference data is found for the given item.
        """
        reference_descriptor = self._reference_descriptor_selector.select_descriptor(item)
        reference_data = self._load_reference_data(reference_descriptor)

        return reference_data

    @lru_cache
    def _load_reference_data(self, reference_descriptor: frozendict) -> list[T]:
        paths_to_reference_data = self._path_indexer.index_path(reference_descriptor)
        if not paths_to_reference_data:
            raise MissingReferenceData(f"No reference data found for reference descriptor: {reference_descriptor}")
        reference_data = self._data_loader.load(*paths_to_reference_data)
        return reference_data


class ReferenceSelector:
    def __init__(self, data_records: list[dict], select_by: str):
        """Selects a reference descriptor for a given item.

        Args:
            data_records: A list of reference descriptors.
            select_by: The key in the reference descriptor that is used to select the reference descriptor for a given
                item.
        """
        self._select_by = select_by
        self._data_map = self._data_records_to_data_map(data_records)

    def select_descriptor(self, item: Item) -> frozendict:
        """Selects the reference descriptor for a given item.

        Args:
            item: The item for which to select the reference descriptor.

        Returns:
            The reference descriptor for the given item.

        Raises:
            MissingReferenceData: If no reference descriptor is found for the given item.
        """
        selection_value = getattr(item, self._select_by)
        try:
            return self._data_map[selection_value]
        except KeyError:
            raise MissingReferenceData(
                f"No reference descriptor found for selection value '{selection_value}' and item '{item}'."
            )

    def _data_records_to_data_map(self, data_records: list[dict]) -> dict[Any, frozendict]:
        """Optimizes the data records for lookup by the selection key. Groups the records by the selection key
        and mapping the selection key to the first item in each group (each group is assumed to have only one element).

        Raises:
            AmbiguousReferenceDataSelector: If any selection key in the reference data records is not unique.
        """

        <omitted because irrelevant>


class PathIndexer:
    def __init__(self, root: str | Path, path_extractor: Callable[[Mapping], list[Path]]):
        """Indexes the path to the reference data from a reference descriptor record.

        Args:
            root: The root path to the files to access.
            path_extractor: A function that extracts the paths to the files from the reference descriptor relative to
                the root path.
        """

        self._root = Path(root)
        self._path_extractor = path_extractor

    def index_path(self, reference_descriptor: Mapping) -> list[Path]:
        file_paths = [self._root / p for p in self._path_extractor(reference_descriptor)]
        return file_paths


def index_by_product(reference_descriptor: Mapping) -> list[Path]:
    brand, filenames = itemgetter("product", "files")(reference_descriptor)
    root = Path(brand)
    access_paths = [root / f for f in filenames]
    return access_paths


class DataLoader(abc.ABC, Generic[T]):
    def load(self, path: Path, *paths: Path) -> list[T]:
        data_nested = self._load_multiple([path, *paths]) if paths else [self._load_one(path)]
        # Loading for an individual path returns either a single object or a list of objects, hence flatten
        data_flat = lflatten(data_nested)
        return data_flat

    def _load_one(self, path: Path) -> T:
        path = normalize_path(path)
        with _resolve_path(path) as fp:
            try:
                return self._load(fp)
            except Exception as err:
                raise DataLoadFailure(f"Failed to load reference data from path: {path}") from err

    def _load_multiple(self, paths: list[Path]) -> list[T]:
        return [self._load_one(path) for path in paths]

    @abc.abstractmethod
    def _load(self, fp: BinaryIO) -> T:
        pass


@contextmanager
def _resolve_path(path: Path) -> Generator[BinaryIO, None, None]:
    root = str(path.parent).replace("/", ".")
    leaf = path.name
    with importlib.resources.open_binary(root, leaf) as fp:
        yield fp


def normalize_path(path: Path) -> Path:
    root = str(path.parent).replace("/", ".")
    leaf = path.name
    path = Path(root.replace(".", "/")) / leaf
    return path


class ImageToNumpyLoader(DataLoader):
    def _load(self, binary_stream: BinaryIO) -> MatLike:
        image = _read_image_bytes(binary_stream)
        return image


class DerivativeImageToNumpyLoader(DataLoader):
    def _load(self, binary_stream: BinaryIO) -> list[MatLike]:
        image = _read_image_bytes(binary_stream)
        variants = list(derive_variants(image))
        return variants


def _read_image_bytes(binary_stream: BinaryIO) -> MatLike:
    data = binary_stream.read()
    np_arr = np.frombuffer(data, np.uint8)
    image = cv2.imdecode(np_arr, cv2.IMREAD_GRAYSCALE)
    return image

file 2

"""
This module contains functions for selecting the reference image for the algorithm.
"""

import json
import logging
from functools import partial
from importlib.resources import open_binary
from typing import Callable, Any, Optional

from funcy import lmapcat
from .models import Item

from .<file 1> import (
    make_id_based_reference_data_loader,
    DataLoader,
    DerivativeImageToNumpyLoader,
    ImageToNumpyLoader,
)
from .derivation import derive_variants
from .types import MatLike

logger = logging.getLogger(__name__)


def _build_reference_image_loader(data_loader: Optional[DataLoader] = None) -> Callable[[Item], list[MatLike]]:
    def _load_reference_images(item: Item) -> list[MatLike]:
        """Loads the reference image for the given item.

        Raises:
            MissingReferenceData: If no reference data is found for the given item.
        """
        # Might raise MissingReferenceData
        return reference_data_loader.select_and_load_reference_data(item)

    data_root = f"{__package__}.reference_data"

    with open_binary(data_root, "reference_data_records.json") as fp:
        reference_data_map = json.load(fp)

    reference_data_loader = make_id_based_reference_data_loader(
        data_map=reference_data_map, data_root=data_root, data_loader=data_loader
    )

    return _load_reference_images


def build_target_insensitive_deriving_reference_image_loader() -> Callable[[Item, Optional[Any]], list[MatLike]]:
    """Builds a data loader that applies derivation to reference images internally.

    Since the data loader interface does not support passing target images, this applies derivation internally,
    without considering the target images. The advantage is, that this allows the data loader to cache the derived
    images for a specific item ID.
    """
    _load_reference_images = _build_reference_image_loader(data_loader=DerivativeImageToNumpyLoader())

    def load_reference_images(item: Item, _: Optional[Any] = None) -> list[MatLike]:
        return _load_reference_images(item)

    return load_reference_images


def build_target_sensitive_deriving_reference_image_loader() -> Callable[[Item, MatLike], list[MatLike]]:
    """Builds a data loader that applies derivation to reference images externally.

    Since the data loader interface does not support passing target images, this applies derivation externally while
    considering the target images. The disadvantage is, that this does not allow the data loader to cache the derived
    images for a specific item ID, since the derivation happens outside the caching scope of the data loader, and the
    derived images will be different for every target image anyway, so that the item ID is not a valid cache key in this
    case.
    """
    _load_reference_images = _build_reference_image_loader(data_loader=ImageToNumpyLoader())

    def load_reference_images_and_derive_variants(item: Item, target_image: MatLike) -> list[MatLike]:
        return lmapcat(
            partial(derive_variants, target_image=target_image),
            _load_reference_images(item),
        )

    return load_reference_images_and_derive_variants


load_reference_images = build_target_insensitive_deriving_reference_image_loader()

Specifying what the code you shared is intended to do will be helpful for reviewers. Also, it's worth reflecting on it in the question title (currently, it's too broad). — Alexander Ivanchenko, Commented May 12, 2024 at 17:36
@AlexanderIvanchenko Thank you, you are right. Explaining it too much will however make the answers biased, since I want to know if the code is understandable with minimal context. I thus added a minimal context explanation. — lo tolmencre, Commented May 12, 2024 at 17:41
Stack Overflow likes minimal context. Code Review is different. Here, more context is better. — Mast, Commented May 12, 2024 at 17:57

J_H · Accepted Answer · 2024-05-13 15:58:20Z

single responsibility

The review context only says "file 1", but hopefully we're using an instructive name like data_loader.py. I do thank you for the docstring. It mentions

provides a reference data loader class
Default implementations of dependencies are provided in this module as well.

It can be perfectly OK to include several class definitions in a single module. But they should be closely related, as reflected in the module's name and its docstring. Here, I only have the docstring, and I worry that this module is doing Two things. Consider breaking out "default implementations" into a separate module.

Down in "file 2", kudos for being very clear on what belongs within that module, and clear on what kitchen_sink() definitions should not be added to it.

interpreter version

from __future__ import annotations

This line raises more questions than it answers. It makes me think we're targetting ancient interpreters. Similarly for from typing import Optional, rather than e.g. int | None. Please describe the set of supported cPython interpreters in an English sentence, or better, with a statement like assert sys.version_info >= (3, 12).

isort

I don't understand how your imports are organized. The blank line after .models is jarring. Maybe we used isort a week ago, but then there's been some edit drift since then?

I really do find the three sections helpful, so I will have a heads up about pypi packages that I'm perhaps not yet familiar with, such as funcy.

Also, please format with black. I suspect that perhaps you did so a week or two back, and there's been some editing since then.

conventional error names

from .exceptions import MissingReferenceData, DataLoadFailure

There's nothing bad about that pair of identifiers. They are diagnostic and they let me know what's going on. But consider tacking on an ...Error suffix.

vague annotation

def make_id_based_reference_data_loader(
    data_map: list[dict],

I don't know what that means. Are you routinely linting with mypy? I think it wants list[dict[Any, Any]] there. It would be more helpful to write e.g. list[dict[str, str]] or whatever is intended.

    data_root: Path | str,
    ...
        path_indexer = PathIndexer(root=data_root, ...),

I feel that's an unfortunate call. I would much rather see ... = PathIndexer(root=Path(data_root), ...) so the called code is solving a simpler problem. Plus, "path" is right there in the class name, it really does feel like we should be passing in a Path.

I do like the or defaulting for the data loader param.

There's no docstring here, and I feel that's kind of OK, we can push that responsibility down to the loader class.

class docstring

class ReferenceDataLoader(Generic[T]):
    def __init__(self, reference_selector: ReferenceSelector, path_indexer: PathIndexer, data_loader: DataLoader):
        """Loads reference data for items.

        Args:
            reference_selector: Selects the reference descriptor for a given item. A reference
                descriptor is a metadata record that specifies the reference data to load for a given item.
            path_indexer: Indexes the path to the reference data from the reference descriptor.
            data_loader: loads the data from the indexed path.
        """

This is in many ways some good code, but I have some remarks.

Give us a docstring for the ReferenceDataLoader class, please. It's an important concept, and worth defining. I want to know about its promises and invariants. Certainly the constructor goes partway into fleshing that out, thank you.

We have a defined term of art, "reference data", which perhaps is transparently obvious at your workplace, but which I find is rather on the vague side. Having read this far, I would have a tough time writing a unit or integration test for this code, and likely would be unable to say whether some returned result is "correct" or "incorrect". We're speaking in abstract generalities at this point. It's not like reading signature + docstring for sqrt(), where I could immediately take a stab at predicting correctness.

What is crystal clear is: "Hands off! My stuff is _private!", good.

In the select_and_load_reference_data() docstring, maybe the "Raises" remark is good? It feels a little javaesque to me. We all understand that Bad Things could happen at any moment, e.g. ZeroDivisionError or KeyboardInterrupt. Does caller really need to always worry about this possibility, and would it even have some viable MissingReferenceData recovery strategy? Are we actually intending to offer advice to child class implementors? Pointing out that child class is expected to signal a common error in a conventional way?

memory footprint

I confess that I write an unadorned @lru_cache decorator all the time.

    @lru_cache
    def _load_reference_data( ... ) -> list[T]:

IDK but I imagine that at least some reference datasets will be "big". It's unclear how appropriate the default maxsize of 128 is, here. Consider reducing the value, or at least discussing the engineering tradeoffs in a brief remark.

Kudos, I am duly impressed that you have a substantial codebase passing in a frozendict. I assume you're routinely passing mypy lint checks. If not passing, maybe switch to dict type. Oh, wait, light finally dawned -- you need a hashable param for the LRU magic to work, got it.

Anyway, the MissingReferenceData seems to be helpfully diagnostic. I appreciate that you threw the descriptor param into the message.

pythonic naming

class ReferenceSelector:
    def __init__(self, data_records: list[dict], select_by: str):

This is lovely. But maybe rename that last param to key? For parallel structure with e.g. sorted(). And as noted above, perhaps we could improve upon list[dict[Any, Any]]. Maybe there's an app-specific type we should define for that.

Or IDK, maybe dict[Hashable, Any]? I usually have something more specific in mind when writing such annotations.

unimplemented

    def _data_records_to_data_map( … ) -> dict[Any, frozendict]:
        …
        ...

Those first two … are my editorial ellipses for "code elided". The last one is literally the ... ellipsis singleton object, which we compute and immediately ignore.

This can be an OK way to define a base class, where "do nothing" is an acceptable default behavior.

Here, it feels like we should probably raise to tell caller that no implementation was provided for a mandatory private method.

pre-condition

This is perfectly clear:

class PathIndexer:
    def __init__(self, ..., path_extractor: Callable[[Mapping], list[Path]]):
        """ ....
            path_extractor: A function that extracts ...
        """
        self._root = Path(root)
        self._path_extractor = path_extractor

The first assignment will helpfully blow up if we pass in something nonsensical like 42.

The second assignment always succeeds. Consider tacking on an assert. When engineers write code it's a sure thing we will sometimes write bugs, but we strive to make them shallow bugs which are immediately detected and easily diagnosed / fixed. Helpful IDE's will tend to turn foo into foo() at the call site, so detecting that common error is worthwhile.

singular vs plural

    def index_path( ... ) -> list[Path]:

I don't understand that identifier.

The implementation and the signature both make it pretty clear that we intend to return plural "paths".

Also, in index_by_product() the whole {product, brand, root} renaming is a bit jarring. I'm sure you have some filesystem layout for which that makes sense. But the documentation + review context did not describe it. I'm not even sure if root is an absolute or relative filespec. Feel free to spell out the details a bit more.

I am willing to believe that each of the access_paths exist in the filesystem. Asserting that they .exist() would improve my confidence in the code, and would make bugs shallower if it turns out that e.g. caller arranged for files to be absolute instead of relative.

one or more

class DataLoader( ... ):
    def load(self, path: Path, *paths: Path) -> list[T]:

This is the first instance of "code that just seems kind of terrible" that I have seen so far in this submission.

Why the {path, paths} distinction? Why does the caller care? Why does this library code care, defining "_one" vs "_multiple"?

Accept paths: list[Path] | Path. If it turns out we don't have a list, just turn it into [Path(paths)] as a convenience to the caller. And then we can process however many paths arrived, in a uniform way.

BTW, thank you for wrapping original error within DataLoadFailure so we don't lose it.

The _resolve_path context manager is very nice. But normalize_path()? Less so. It would benefit from >>> docstring example(s).

type stability

These trouble me.

class ImageToNumpyLoader(DataLoader):
    def _load(self, binary_stream: BinaryIO) -> MatLike:
        ...
class DerivativeImageToNumpyLoader(DataLoader):
    def _load(self, binary_stream: BinaryIO) -> list[MatLike]:

We get back a single matrix, or a container of them? Make up your mind already! Prefer to always return a list of them. Even if we're returning zero variants.

test suite

does not understand the code

design arose due to necessities.

Is this design really that opaque?

This code is less transparent than it could be. It uses domain specific jargon, such as "item" and "reference image", in ways which the codebase does not explain. I am willing to believe that calling code exists which exercises each of the various code paths.

The readability of this codebase would be significantly enhanced if it included instructive {unit, integration} tests showing how to exercise each of those paths.

Thank you for the excellent and detailed review, even on things I was not even expecting input on. There are integration tests that involve this code. I take it then that you do not regard this design as unreasonably opaque and unjustified for the necessities (which you have to take my word for, I suppose). At least it seems understandable to you, in terms of how it works mechanically. — lo tolmencre, Commented May 13, 2024 at 11:24
from __future__ import annotations isn't targeting ancient interpreters. No version of Python has made the feature mandatory yet. You have to include the line to support the auto conversion of annotations to forward references which allows referencing undefined names. Which is needed for "def make_id_based_reference_data_loader(..., data_loader: Optional[DataLoader] = None,". Please read the docs. — Peilonrayz, Commented May 13, 2024 at 16:11

Stack Exchange Network

Semi-generic data loading utility class

1 Answer 1

single responsibility

interpreter version

isort

conventional error names

vague annotation

class docstring

memory footprint

pythonic naming

unimplemented

pre-condition

singular vs plural

one or more

type stability

test suite

Your Answer

Hot Network Questions

Semi-generic data loading utility class

1 Answer 1

single responsibility

interpreter version

isort

conventional error names

vague annotation

class docstring

memory footprint

pythonic naming

unimplemented

pre-condition

singular vs plural

one or more

type stability

test suite

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions