Skip to content

Problem saving model data (NetCDF) with pandas' integer type #1718

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsnyde0 opened this issue May 27, 2025 · 3 comments
Open

Problem saving model data (NetCDF) with pandas' integer type #1718

jsnyde0 opened this issue May 27, 2025 · 3 comments

Comments

@jsnyde0
Copy link
Contributor

jsnyde0 commented May 27, 2025

I get a TypeError when trying to save my model (mmm.save("model.nc")). It seems to have an issue if my data uses a special kind of integer type from pandas called Int64Dtype.

What I'm doing:

I'm building a Marketing Mix Model (MMM). In my model, the main thing I'm trying to predict (my y variable) is the number of units sold. These are naturally whole numbers (integers). Pandas seems to use Int64Dtype for this column.

When ArviZ tries to save the InferenceData (which includes my sales data), it gives a TypeError if that Int64Dtype is present. It looks like the saving process doesn't quite know how to handle this specific pandas integer type.

Here's a simple code example that shows the problem:

import arviz as az
import numpy as np
import pandas as pd
import xarray as xr

def run_simplified_reproducible_example():
    print(f"Running with: pandas {pd.__version__}, xarray {xr.__version__}, arviz {az.__version__}, numpy {np.__version__}")

    # 1. Simulate sales data (integers, with a missing value)
    sales_data_with_na = pd.Series(
        [100, 150, 20, 200, 120], dtype="Int64", name="units_sold"
    )

    # 2. Put it into an xarray.DataArray
    sales_data_array = xr.DataArray(
        sales_data_with_na,
        dims=["time_period"],
        coords={"time_period": np.arange(len(sales_data_with_na))},
        name="units_sold_observed",
    )

    # 3. Create an arviz.InferenceData object (like what my MMM produces)
    model_dataset = xr.Dataset({sales_data_array.name: sales_data_array})
    inference_data_to_save = az.InferenceData(observed_data=model_dataset)

    # 4. Try to save it (this is where the error usually happens)
    output_filename = "test_sales_model_save.nc"
    print(f"\nTrying to save to '{output_filename}'...")
    try:
        inference_data_to_save.to_netcdf(output_filename)
        print(f"Saved '{output_filename}' successfully (This is UNEXPECTED if the issue exists).")
    except TypeError as e:
        print(f"\n--- EXPECTED TypeError ---")
        print(f"Oops, couldn't save '{output_filename}'. Error: {e}")
        print("This is the TypeError I'm seeing due to the Int64Dtype.")
        print(f"--- END OF TypeError ---")
    except Exception as e:
        print(f"\n--- Some Other Error ---")
        print(f"An different error happened: {e}")
        print(f"--- END OF Other Error ---")

if __name__ == "__main__":
    run_simplified_reproducible_example()

My own versions:

  • pandas version: 2.2.3
  • xarray version: 2025.4.0
  • arviz version: 0.21.0
  • numpy version: 2.2.6

Solution

Not sure if this should be solved here by converting data types before you save (which is what I'm doing currently), or move this over to ArviZ?

@williambdean
Copy link
Contributor

Yeah, I would raise with the arviz team

@williambdean
Copy link
Contributor

Are you able to convert your data to floats?

@jsnyde0
Copy link
Contributor Author

jsnyde0 commented May 28, 2025

Yeah, this helped:

def convert_int64_to_float(dataset, var_name='y'):
    """Directly converts var_name in dataset if it's Int64Dtype."""
    if dataset is not None and var_name in dataset.data_vars:
        data_array = dataset[var_name]
        if isinstance(data_array.dtype, pd.Int64Dtype):
            print(f"  Targeted: Converting '{var_name}' in group to float64.")
            # Simplified conversion for Series-like data in DataArray
            converted_values = pd.Series(data_array.data.ravel()).astype(float).to_numpy().reshape(data_array.shape)

            new_da = xr.DataArray(
                converted_values,
                coords=data_array.coords,
                dims=data_array.dims,
                name=data_array.name,
                attrs=data_array.attrs
            )
            return dataset.assign({var_name: new_da})
    return dataset # Return original if no conversion needed or var not found

if mmm.idata is not None:
    print("Applying Int64Dtype conversion for 'y' variable...")

    # Target 'fit_data'
    if hasattr(mmm.idata, 'fit_data'):
        mmm.idata.fit_data = convert_int64_to_float(mmm.idata.fit_data, 'y')

    # Target 'observed_data'
    if hasattr(mmm.idata, 'observed_data'):
         mmm.idata.observed_data = convert_int64_to_float(mmm.idata.observed_data, 'y')

    print("Minimal targeted conversion finished.")
else:
    print("mmm.idata is None, skipping conversion.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants