Customizing the Fitting Core

This guide explains how to replace or modify the curve fitting engine in RegressionLab. By default, RegressionLab uses SciPy’s curve_fit function, but you may want to use a different optimization library for specific needs.

Overview

RegressionLab’s fitting architecture is modular, making it relatively easy to swap out the optimization backend. This guide covers:

Understanding the current architecture
Why you might want to change it
How to replace SciPy with another library
Alternative optimization libraries
Performance considerations

Current Architecture

How Fitting Works Now

RegressionLab’s fitting pipeline:

User Interface (Tkinter/Streamlit)
        ↓
Fitting Wrapper Functions (ajlineal, ajsin, etc.)
        ↓
Generic Fit Utility (generic_fit)
        ↓
SciPy curve_fit
        ↓
Results (parameters, uncertainties, fitted curve)

Key File: `fitting_utils.py`

The core fitting logic is in src/fitting/fitting_utils.py:

def generic_fit(func, x, y, uy=None, p0=None):
    """
    Generic fitting function using scipy.optimize.curve_fit.
    
    Args:
        func: Mathematical function to fit
        x: Independent variable data
        y: Dependent variable data
        uy: Uncertainties in y (optional)
        p0: Initial parameter guess (optional)
        
    Returns:
        params: Fitted parameters
        y_fitted: Fitted y values
        r_squared: Coefficient of determination
    """
    from scipy.optimize import curve_fit
    
    # Prepare weights from uncertainties
    sigma = uy if uy is not None else None
    
    # Perform optimization
    popt, pcov = curve_fit(
        func, x, y,
        p0=p0,
        sigma=sigma,
        absolute_sigma=True if sigma is not None else False
    )
    
    # Calculate fitted values
    y_fitted = func(x, *popt)
    
    # Calculate R² and RMSE (residuals computed once and reused)
    residuals_sq = (y - y_fitted) ** 2
    ss_res = np.sum(residuals_sq)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    rmse = float(np.sqrt(np.mean(residuals_sq)))

    return popt, y_fitted, r_squared

This function is the single point of control for the optimization backend. Equation lookups (get_equation_format_for_function, get_equation_param_names_for_function) use O(1) reverse dicts built at module load, so there is no performance cost when resolving equations by function name.

Why Change the Fitting Core?

Reasons to Use a Different Library

Performance: Some libraries are faster for specific problem types
Robustness: Better handling of difficult optimization problems
Features: Access to different optimization algorithms (global optimization, constraints, etc.)
Licensing: Some libraries have different licenses
Custom needs: Specialized algorithms for your domain

Popular Alternatives

Library	Pros	Cons
lmfit	High-level API, parameter constraints, many models	Additional dependency
scipy.optimize.least_squares	More options than curve_fit, robust loss functions	Lower-level API
scipy.optimize.minimize	General optimization, many algorithms	Not specialized for curve fitting
PyGMO	Global optimization, parallel evaluation	Complex, heavy dependency
NLopt	Fast, many algorithms, constraints	C library binding, harder to install
TensorFlow/PyTorch	GPU acceleration, automatic differentiation	Overkill for simple fits, steep learning curve

How to Replace the Fitting Core

Method 1: Modify `generic_fit` Function

This is the recommended approach as it requires minimal changes to the codebase.

Example: Using lmfit

Step 1: Install lmfit

pip install lmfit

Add to requirements.txt:

lmfit>=1.2.0

Step 2: Create Alternative Generic Fit

Create a new function in fitting_utils.py:

def generic_fit_lmfit(func, x, y, uy=None, p0=None):
    """
    Generic fitting using lmfit library.
    
    lmfit provides more control over parameter bounds, constraints,
    and fitting statistics compared to scipy.curve_fit.
    
    Args:
        func: Mathematical function to fit
        x: Independent variable data
        y: Dependent variable data
        uy: Uncertainties in y (optional, used as weights)
        p0: Initial parameter guess (optional)
        
    Returns:
        params: Fitted parameters (as ndarray)
        y_fitted: Fitted y values
        r_squared: Coefficient of determination
    """
    from lmfit import Model
    import inspect
    
    # Create lmfit Model from function
    model = Model(func, independent_vars=['x'])
    
    # Get parameter names from function signature
    sig = inspect.signature(func)
    param_names = [p for p in sig.parameters.keys() if p != 'x' and p != 't']
    
    # Create parameters object
    params = model.make_params()
    
    # Set initial values if provided
    if p0 is not None:
        for name, value in zip(param_names, p0):
            params[name].value = value
    
    # Perform fit
    if uy is not None:
        # Weighted fit
        result = model.fit(y, params, x=x, weights=1.0/uy)
    else:
        # Unweighted fit
        result = model.fit(y, params, x=x)
    
    # Extract results
    popt = np.array([result.params[name].value for name in param_names])
    y_fitted = result.best_fit
    r_squared = 1 - result.residual.var() / np.var(y)
    
    return popt, y_fitted, r_squared

Step 3: Switch to New Implementation

Option A - Replace original:

# In fitting_utils.py, rename functions:
def generic_fit_scipy(func, x, y, uy=None, p0=None):
    # ... original scipy implementation ...
    pass

# Make generic_fit use lmfit
generic_fit = generic_fit_lmfit

Option B - Use configuration flag:

# In config/constants.py or config/env.py
USE_LMFIT = True  # Set to False to use SciPy

# In fitting_utils.py
def generic_fit(func, x, y, uy=None, p0=None):
    from config import USE_LMFIT
    
    if USE_LMFIT:
        return generic_fit_lmfit(func, x, y, uy, p0)
    else:
        return generic_fit_scipy(func, x, y, uy, p0)

Example: Using scipy.optimize.least_squares

For more control and robust loss functions:

def generic_fit_least_squares(func, x, y, uy=None, p0=None, loss='linear'):
    """
    Generic fitting using scipy.optimize.least_squares.
    
    Provides more robust loss functions and better convergence for
    difficult problems.
    
    Args:
        func: Mathematical function to fit
        x: Independent variable data
        y: Dependent variable data
        uy: Uncertainties in y (used as weights)
        p0: Initial parameter guess (required!)
        loss: Loss function ('linear', 'soft_l1', 'huber', 'cauchy', 'arctan')
        
    Returns:
        params: Fitted parameters
        y_fitted: Fitted y values
        r_squared: Coefficient of determination
    """
    from scipy.optimize import least_squares
    import inspect
    
    # Validate inputs
    if p0 is None:
        raise ValueError("least_squares requires initial parameter guess (p0)")
    
    # Get number of parameters
    sig = inspect.signature(func)
    n_params = len(sig.parameters) - 1  # Subtract independent variable
    
    # Define residual function
    def residuals(params):
        y_model = func(x, *params)
        if uy is not None:
            # Weighted residuals
            return (y - y_model) / uy
        else:
            return y - y_model
    
    # Perform optimization
    result = least_squares(
        residuals,
        p0,
        loss=loss,  # Robust loss function
        f_scale=1.0,  # Scale parameter for loss
        max_nfev=10000  # Maximum function evaluations
    )
    
    # Extract results
    popt = result.x
    y_fitted = func(x, *popt)
    
    # Calculate R²
    ss_res = np.sum((y - y_fitted) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    return popt, y_fitted, r_squared

Robust Loss Functions:

'linear': Standard least squares (default, same as curve_fit)
'soft_l1': Robust to outliers, smooth transition
'huber': Robust to outliers, less aggressive
'cauchy': Very robust to outliers, but can converge slowly
'arctan': Similar to Cauchy

Method 2: Create Custom Fitting Wrapper

For specialized needs, create domain-specific fitting functions:

def generic_fit_global(func, x, y, uy=None, bounds=None):
    """
    Global optimization for multimodal functions.
    
    Uses differential evolution to find global minimum,
    useful when curve_fit gets stuck in local minima.
    
    Args:
        func: Mathematical function to fit
        x: Independent variable data
        y: Dependent variable data
        uy: Uncertainties (optional)
        bounds: Parameter bounds as list of (min, max) tuples (required!)
        
    Returns:
        params: Fitted parameters
        y_fitted: Fitted y values
        r_squared: R² value
    """
    from scipy.optimize import differential_evolution
    
    if bounds is None:
        raise ValueError("Global optimization requires parameter bounds")
    
    # Define objective function (minimize sum of squared residuals)
    def objective(params):
        y_model = func(x, *params)
        if uy is not None:
            residuals = (y - y_model) / uy
        else:
            residuals = y - y_model
        return np.sum(residuals ** 2)
    
    # Run global optimization
    result = differential_evolution(
        objective,
        bounds,
        seed=42,  # Reproducibility
        maxiter=1000,
        atol=1e-7,
        tol=0.01
    )
    
    # Extract results
    popt = result.x
    y_fitted = func(x, *popt)
    
    # Calculate R²
    ss_res = np.sum((y - y_fitted) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    return popt, y_fitted, r_squared

Method 3: Bayesian Fitting with MCMC

For uncertainty quantification with complex models:

def generic_fit_mcmc(func, x, y, uy=None, p0=None):
    """
    Bayesian fitting using MCMC (requires emcee library).
    
    Provides full posterior distributions for parameters,
    not just point estimates.
    
    Args:
        func: Mathematical function to fit
        x: Independent variable data
        y: Dependent variable data
        uy: Uncertainties in y (required!)
        p0: Initial parameter guess
        
    Returns:
        params: Median parameter values
        y_fitted: Fitted y values
        r_squared: R² value
    """
    try:
        import emcee
    except ImportError:
        raise ImportError("MCMC fitting requires emcee: pip install emcee")
    
    if uy is None:
        uy = np.ones_like(y) * 0.1 * np.std(y)
    
    n_params = len(p0)
    
    # Define log-likelihood
    def log_likelihood(params):
        y_model = func(x, *params)
        return -0.5 * np.sum(((y - y_model) / uy) ** 2)
    
    # Define log-prior (uniform priors)
    def log_prior(params):
        # Simple bounds check
        if np.all(np.isfinite(params)):
            return 0.0
        return -np.inf
    
    # Define log-probability
    def log_probability(params):
        lp = log_prior(params)
        if not np.isfinite(lp):
            return -np.inf
        return lp + log_likelihood(params)
    
    # Initialize walkers
    n_walkers = max(32, 2 * n_params)
    pos = p0 + 1e-4 * np.random.randn(n_walkers, n_params)
    
    # Run MCMC
    sampler = emcee.EnsembleSampler(n_walkers, n_params, log_probability)
    sampler.run_mcmc(pos, 5000, progress=False)
    
    # Get results (discard burn-in)
    samples = sampler.get_chain(discard=1000, flat=True)
    popt = np.median(samples, axis=0)
    
    # Calculate fitted values and R²
    y_fitted = func(x, *popt)
    ss_res = np.sum((y - y_fitted) ** 2)
    ss_tot = np.sum((y - np.mean(y)) ** 2)
    r_squared = 1 - (ss_res / ss_tot)
    
    return popt, y_fitted, r_squared

Advanced: Parameter Constraints and Bounds

Adding Bounds to Existing Functions

Modify individual fitting functions to include bounds:

def ajlineal_con_n_bounded(data, x_name: str, y_name: str):
    """Linear fit with slope constrained to be positive."""
    from scipy.optimize import curve_fit
    
    x = data[x_name].values
    y = data[y_name].values
    uy = data[f'u{y_name}'].values if f'u{y_name}' in data.columns else None
    
    # Constrain: m > 0, n unrestricted
    bounds = ([0, -np.inf], [np.inf, np.inf])
    
    popt, _ = curve_fit(
        func_lineal_con_n,
        x, y,
        sigma=uy,
        bounds=bounds,
        absolute_sigma=True if uy is not None else False
    )
    
    # ... rest of function ...

Creating a Configurable Bounds System

Add bounds to function metadata:

# In config/constants.py
EQUATION_BOUNDS = {
    'exponential_decay': ([0, 0], [np.inf, np.inf]),  # a > 0, b > 0
    'sine_function': ([-np.inf, 0], [np.inf, np.inf]),  # b > 0
    # Add bounds for other equations
}

# In fitting_utils.py
def generic_fit_with_bounds(func, x, y, uy=None, p0=None, equation_name=None):
    """Generic fit with optional bounds from configuration."""
    from scipy.optimize import curve_fit
    from config import EQUATION_BOUNDS
    
    # Get bounds if available
    bounds = EQUATION_BOUNDS.get(equation_name, (-np.inf, np.inf))
    
    sigma = uy if uy is not None else None
    
    popt, _ = curve_fit(
        func, x, y,
        p0=p0,
        sigma=sigma,
        bounds=bounds,
        absolute_sigma=True if sigma is not None else False
    )
    
    y_fitted = func(x, *popt)
    r_squared = 1 - (np.sum((y - y_fitted)**2) / np.sum((y - np.mean(y))**2))
    
    return popt, y_fitted, r_squared

Performance Considerations

Benchmarking Different Backends

Create a benchmark script:

# benchmark_fitting.py
import time
import numpy as np
import pandas as pd
from fitting.fitting_utils import generic_fit, generic_fit_lmfit

# Generate test data
x = np.linspace(0, 10, 1000)
y = 2.5 * x + 1.3 + np.random.normal(0, 0.5, 1000)

def func_linear(t, m, n):
    return m * t + n

# Benchmark scipy
start = time.time()
for _ in range(100):
    params, y_fit, r2 = generic_fit(func_linear, x, y)
scipy_time = time.time() - start

# Benchmark lmfit
start = time.time()
for _ in range(100):
    params, y_fit, r2 = generic_fit_lmfit(func_linear, x, y)
lmfit_time = time.time() - start

print(f"SciPy:  {scipy_time:.3f} seconds (100 fits)")
print(f"lmfit:  {lmfit_time:.3f} seconds (100 fits)")
print(f"Speedup: {lmfit_time/scipy_time:.2f}x")

Optimization Tips

Vectorize operations: Use NumPy arrays, not loops
Good initial guesses: Reduces iterations
Appropriate tolerances: Don’t over-optimize
Warm starts: Reuse previous fits as initial guesses

Example with warm starts:

def batch_fit_with_warm_start(files, equation_type):
    """Fit multiple files using previous result as initial guess."""
    p0 = None  # Start with no guess
    
    results = []
    for file in files:
        data = load_data(file)
        x, y = data['x'].values, data['y'].values
        
        params, y_fit, r2 = generic_fit(func, x, y, p0=p0)
        
        # Use current result as next initial guess
        p0 = params
        
        results.append((params, r2))
    
    return results

Handling Uncertainty Propagation

Using uncertainties Package

For automatic error propagation:

def generic_fit_with_uncertainties(func, x, y, uy=None, p0=None):
    """
    Fit and return parameter uncertainties using uncertainties package.
    
    Returns parameters as ufloat objects with uncertainty.
    """
    try:
        from uncertainties import ufloat, correlated_values
    except ImportError:
        raise ImportError("Install uncertainties: pip install uncertainties")
    
    from scipy.optimize import curve_fit
    
    # Perform fit
    popt, pcov = curve_fit(func, x, y, p0=p0, sigma=uy, absolute_sigma=True if uy is not None else False)
    
    # Create uncertain parameters
    params_with_uncertainty = correlated_values(popt, pcov)
    
    # Calculate fitted values
    y_fitted = func(x, *popt)
    r_squared = 1 - (np.sum((y - y_fitted)**2) / np.sum((y - np.mean(y))**2))
    
    return params_with_uncertainty, y_fitted, r_squared

# Example usage
params_unc, y_fit, r2 = generic_fit_with_uncertainties(func_linear, x, y, uy)
m, n = params_unc
print(f"Slope: {m}")  # Prints: 2.50+/-0.03
print(f"Intercept: {n}")  # Prints: 1.30+/-0.15

Configuration File Approach

Create a comprehensive configuration system:

# config/constants.py or new config/fitting_config.py
FITTING_CONFIG = {
    'backend': 'scipy',  # Options: 'scipy', 'lmfit', 'least_squares'
    'scipy_options': {
        'method': 'lm',  # Levenberg-Marquardt
        'max_nfev': 10000,
    },
    'least_squares_options': {
        'loss': 'soft_l1',  # Robust loss
        'f_scale': 1.0,
    },
    'lmfit_options': {
        'method': 'leastsq',
    },
}

# fitting_utils.py
def generic_fit(func, x, y, uy=None, p0=None):
    """Generic fit that uses configured backend."""
    from config import FITTING_CONFIG
    
    backend = FITTING_CONFIG['backend']
    
    if backend == 'scipy':
        return generic_fit_scipy(func, x, y, uy, p0)
    elif backend == 'lmfit':
        return generic_fit_lmfit(func, x, y, uy, p0)
    elif backend == 'least_squares':
        return generic_fit_least_squares(func, x, y, uy, p0)
    else:
        raise ValueError(f"Unknown fitting backend: {backend}")

Users can then switch backends by editing the config package (e.g. config/constants.py) without modifying code.

Testing New Backends

Always test thoroughly when changing the fitting core:

# tests/test_fitting_backends.py
import pytest
import numpy as np
from fitting.fitting_utils import generic_fit, generic_fit_lmfit

def test_backends_agree():
    """Test that different backends give similar results."""
    # Generate perfect linear data
    x = np.linspace(0, 10, 50)
    y = 2.0 * x + 3.0
    
    def func(t, a, b):
        return a * t + b
    
    # Fit with scipy
    params_scipy, _, _ = generic_fit(func, x, y)
    
    # Fit with lmfit
    params_lmfit, _, _ = generic_fit_lmfit(func, x, y)
    
    # Check agreement
    np.testing.assert_array_almost_equal(params_scipy, params_lmfit, decimal=6)

def test_with_noise():
    """Test fitting with noisy data."""
    x = np.linspace(0, 10, 100)
    y = 2.5 * x + 1.0 + np.random.normal(0, 0.5, 100)
    
    def func(t, a, b):
        return a * t + b
    
    params, y_fit, r2 = generic_fit(func, x, y)
    
    # Check reasonable results
    assert 2.0 < params[0] < 3.0  # Slope near 2.5
    assert 0.5 < params[1] < 1.5  # Intercept near 1.0
    assert r2 > 0.95  # Good fit

Summary

Approach	Difficulty	Flexibility	When to Use
Modify `generic_fit`	Easy	Low	Simple backend swap
Add backend option	Medium	Medium	Support multiple backends
Custom fitting functions	Medium	High	Specialized needs per equation
Configuration system	Hard	Very High	Production deployments

Recommendations:

Start simple: Modify generic_fit directly
For production: Use configuration system
For research: Create specialized fitting functions
Always test: Verify results match expectations

Next Steps

Test thoroughly: Use known data to validate new backend
Benchmark: Compare performance with original
Document: Explain why you chose a different backend
Contribute: Share your improvements with the community!

For more technical details, see the API Documentation.