API Reference

PagedList Class

Professional disk-backed list implementation for large-scale data processing.

PagedList provides a high-performance, memory-efficient alternative to Python lists for handling datasets that exceed available system memory. The implementation automatically manages data persistence through intelligent chunking and disk-backed storage, maintaining list-like interface semantics while providing enterprise-grade scalability for data processing workflows.

Features:
  • Transparent disk-backed storage with automatic memory management

  • Configurable chunking strategies for optimal performance

  • Parallel processing capabilities for data transformations

  • Enterprise-ready error handling and resource cleanup

  • Type-safe operations with comprehensive validation

class paged_list.paged_list.PagedList(chunk_size: int = 50000, disk_path: str = 'data', auto_cleanup: bool = True)[source]

Bases: object

A disk-backed list-like object that stores most of its data on disk.

When the list gets too large, data is chunked into .pkl files in the data/ directory. When retrieving slices, only relevant chunks are loaded into memory.

append(data: Dict[str, Any]) None[source]

Appends an item to the list, flushing to disk if needed.

cleanup_chunks() None[source]

Deletes all stored chunk files from disk.

clear() None[source]

Remove all items from the list.

combine_chunks() List[Dict[str, Any]][source]

Loads and combines all chunks into a single list (if needed).

copy() PagedList[source]

Return a shallow copy of the list.

count(value: Dict[str, Any]) int[source]

Return number of occurrences of value.

extend(iterable: Iterable[Dict[str, Any]]) None[source]

Extends the list by adding multiple items at once.

property in_memory_count: int

Return the number of items currently in memory.

index(value: Dict[str, Any], start: int = 0, stop: int | None = None) int[source]

Return index of first occurrence of value.

insert(index: int, value: Dict[str, Any]) None[source]

Insert an item at a specific position.

property is_empty: bool

Return True if the list is empty.

map(func: Callable[[Dict[str, Any]], Dict[str, Any]], max_workers: int | None = None) None[source]

Processes records in chunks using the provided function with multiple threads.

Parameters:
  • func (callable) – Function to apply to each record in the chunk.

  • max_workers (int, optional) – The maximum number of threads to use. Defaults to None, which means the number of threads will be determined by the system.

Returns:

None (Modifies the records in-place)

pop(index: int = -1) Dict[str, Any][source]

Remove and return item at index (default last).

remove(value: Dict[str, Any]) None[source]

Remove first occurrence of value.

reverse() None[source]

Reverse the list in place.

Warning: This operation loads all data into memory and may be slow for large lists.

serialize(max_workers: int | None = None) None[source]

Serializes the list of dictionaries by converting certain types to JSON strings and updates the underlying files using multiple threads.

sort(*, key: Callable[[Dict[str, Any]], Any] | None = None, reverse: bool = False) None[source]

Sort the list in place.

Warning: This operation loads all data into memory and may be slow for large lists. Note: A key function is required when sorting dictionaries.

property total_chunks: int

Return the total number of chunks.

Main Module

Command-line interface and demonstration utilities for paged-list.

This module provides entry points for running demonstrations and examples of the professional disk-backed list implementation.

paged_list.main.demo() None[source]

Run a demonstration of the PagedList functionality.

paged_list.main.main() None[source]

Main entry point for the package.