dataset_hub._core.provider.DataFrameProvider

class dataset_hub._core.provider.dataframe_provider.DataFrameProvider(config)[source]

Bases: Provider[DataFrame]

Provider that loads a dataset from a source (URL or file) and returns it as a pandas DataFrame.

Regardless of the underlying file format, the output is always returned as:

{“data”: pandas.DataFrame}

Supported formats depend on the implementation of read_dataframe.

Parameters:

config (Dict[str, Any]) –

ConfigClass

alias of DataFrameProviderConfig

load()[source]

Fetch and load the dataset specified in the configuration.

Returns:

The loaded pandas DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If the file cannot be read or the format is unsupported.

read_dataframe(path_or_url, format, read_kwargs)[source]

Universal function to read a DataFrame from various file formats.

Parameters:
  • path_or_url (str) – Local file path or URL to the data.

  • format (str) – Data format (‘csv’, ‘parquet’, ‘excel’, ‘json’).

  • read_kwargs (dict, optional) – Additional parameters to pass to the corresponding pandas reader function.

Returns:

Loaded DataFrame.

Return type:

pd.DataFrame

Raises:

ValueError – If the specified format is not supported.

class dataset_hub._core.provider.dataframe_provider.DataFrameProviderConfig(source, read_kwargs=<factory>)[source]

Bases: ProviderConfig

Configuration schema for DataFrameProvider.

Parameters:
  • source (Dict[str, Any]) –

  • read_kwargs (Dict[str, Any]) –

source

Source configuration with type, url, and format.

Type:

Dict[str, Any] | SourceConfig

read_kwargs

Optional keyword arguments forwarded directly to the corresponding pandas reader.

Type:

Dict[str, Any]