Testing Standards

Overview

This document establishes comprehensive testing standards for the Godot Stat Math project, derived from extensive data-driven conversion work and testing best practices. These standards ensure scipy-validated accuracy, maintainability, and consistency across the test suite. These standards should be treated as living documentation, updated as new patterns emerge and the project evolves.

1. Standards & Philosophy

Core Principles

1. The Dual-Pillar Testing Strategy

Our testing philosophy rests on two equally important pillars: Scipy validation for numerical accuracy and mathematical property testing for theoretical correctness. It is not a choice between them; both are required to ensure robust and reliable functions.

Pillar 1: Data-Driven Scipy Validation (Numerical Accuracy) - Purpose: To verify that our functions produce numerically precise results that match the industry-standard Scipy library. This ensures our code is correct in practice. - Method: All Scipy validation tests must be data-driven.

  • Single Source of Truth: All expected values must come from Scipy-generated data stored in /addons/godot-stat-math/tables/.

  • Eliminate Magic Numbers: No hardcoded expected values in test assertions (except for mathematical constants and identities, which belong in property tests).

  • Traceability: All generated table data must include Scipy function call documentation to ensure scientific accuracy is traceable.

  • Example: Validating normal_cdf(1.96) returns 0.9750021… by looking up the value in a data table generated by scipy.stats.norm.cdf(1.96).

  • Location: These tests belong exclusively in scipy_validation_test.gd files.

Pillar 2: Mathematical Property Testing (Theoretical Correctness) - Purpose: To verify that our implementations correctly follow fundamental mathematical laws, relationships, and identities. This ensures our code is correct in theory. - Method: Tests assert known truths, such as boundary conditions, symmetries, or relationships between different functions. This is the appropriate place for well-known constants (e.g., erf(0) = 0). - Example: Verifying that gamma_pdf(x, 1.0, scale) is equivalent to exponential_pdf(x, 1.0/scale), or that CDF(PPF(p)) == p. - Location: These tests belong exclusively in mathematical_property_test.gd files.

This dual approach guarantees that our statistical library is not only numerically accurate according to scientific standards but also structurally sound based on mathematical principles.

2. Test Organization and Structure

  • Semantic Test Names: Name tests after what they validate, not implementation details

  • Logical Grouping: Organize tests into clear sections with descriptive comments

  • Type Safety: Always use static types and typed assertions (assert_float, assert_int, etc.) Use HelperFunctions.convert_to_float_array() to convert untyped arrays.

  • No Redundancy: Eliminate duplicate tests and merge similar functionality

Data-Driven Standards

Required Table Data Structure

# Standard test data pattern with scipy documentation
const VALUES: Dictionary = {
     "function_name": [  # Generated using: scipy.stats.function_name(params)
             {
                     "params": [param1, param2, ...],
                     "expected": scipy_calculated_result,
                     "description": "Human readable test case description"
             }
     ]
}

Scipy Version Control

All test data files generated by generate_test_data.py include version information for scipy and numpy in their headers:

# Generated with: scipy 1.11.3, numpy 1.24.3

This provides traceability for test data generation and ensures reproducibility when debugging edge cases.

Acceptable Hardcoded Values

✅ KEEP HARDCODED - Mathematical constants and well-known limits:

  • Zero values: erf(0.0) = 0.0, erfc(0.0) = 1.0

  • Asymptotic limits: erf(10.0) 1.0, erf(-10.0) -1.0

  • Mathematical constants: sqrt(PI), factorial identities like 0! = 1

  • Boundary conditions: uniform_cdf(x) = 0.0 when x < a, = 1.0 when x > b

  • Identity relationships: gamma(1.0) = 1.0

❌ MUST CONVERT - Scipy-calculated precision values:

  • Complex probability calculations: 0.9750021, 0.8646647, 0.59399415

  • Non-obvious mathematical results requiring computation

  • Values that appear as “magic numbers” without clear mathematical justification

Exception: Simple, Illustrative Test Data

It is acceptable to hardcode simple data directly within a single test function if its primary purpose is to illustrate a specific behavior, data handling characteristic, or edge case, rather than to validate a complex mathematical result against a scientific standard.

Criteria for “Simple Test Data” Exception:

  • No computed expected values: The data is not the result of scientific/mathematical calculations

  • Self-evident from context: Data structure and purpose is immediately clear (e.g., [1, 2, 3, 4, 5] for median testing)

  • Single test usage: Data is not reused across multiple tests

  • Illustrative purpose: Primary goal is demonstrating behavior, not validating mathematical accuracy

Examples of acceptable simple data:

# ✅ Acceptable - simple array for median behavior
func test_median_with_odd_count() -> void:
     var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
     assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)

# ✅ Acceptable - edge case demonstration
func test_empty_array_handling() -> void:
     var empty_data: Array[float] = []
     assert_that(is_nan(StatMath.BasicStats.mean(empty_data))).is_true()

Move to tables when:

  • Data is needed for more than one test

  • Expected values represent calculated results

  • Data represents complex scenarios requiring scipy validation

This approach maintains test readability for simple cases while enforcing data-driven standards for mathematical validation.

Data Generation Requirements

  • Scipy Call Documentation: Every function must include exact scipy call used

  • Format: "function_name": [  # Generated using: stats.norm.cdf(x, mu, sigma)

  • Comprehensive Coverage: Include all test scenarios needed by actual tests

  • Optimized Structure: Design for test consumption, not generation convenience

Tolerance Constants Usage

Tolerance Decision Tree

Use this decision tree to eliminate tolerance selection paralysis:

├── Testing mathematical identity or boundary condition?
│   └── YES → StatMath.BOUNDARY_TOLERANCE (1.0e-10)
│
├── Testing against scipy validation data?
│   ├── Basic statistical functions (mean, variance, etc.)
│   │   └── StatMath.FLOAT_TOLERANCE (1.0e-7)
│   ├── Approximation algorithms (erf, gamma, etc.)
│   │   └── StatMath.ERF_APPROX_TOLERANCE (1.0e-5)
│   ├── Inverse functions (ppf, quantiles)
│   │   └── StatMath.INVERSE_FUNCTION_TOLERANCE
│   └── Probability density/mass functions
│       └── StatMath.PROBABILITY_TOLERANCE (1.0e-6)
│
├── Testing numerical integration or iterative methods?
│   └── StatMath.NUMERICAL_INTEGRATION_TOLERANCE (5.0e-3)
│
└── High-precision mathematical operations?
    └── StatMath.HIGH_PRECISION_TOLERANCE (1.0e-9)

Standard Tolerances

  • StatMath.FLOAT_TOLERANCE = 1.0e-7 - General floating-point comparisons

  • StatMath.HIGH_PRECISION_TOLERANCE = 1.0e-9 - High-precision operations

  • StatMath.BOUNDARY_TOLERANCE = 1.0e-10 - Boundary conditions

Specialized Tolerances

  • StatMath.ERF_APPROX_TOLERANCE = 1.0e-5 - Error function approximations

  • StatMath.PROBABILITY_TOLERANCE = 1.0e-6 - Probability values

  • StatMath.NUMERICAL_TOLERANCE = 1.0e-5 - Numerical methods

  • StatMath.NUMERICAL_INTEGRATION_TOLERANCE = 5.0e-3 - PDF integration

  • StatMath.INVERSE_FUNCTION_TOLERANCE - PPF and inverse functions

  • StatMath.CDF_PPF_CONSISTENCY_TOLERANCE - Round-trip consistency tests

Tolerance Selection Guidelines

  • Use the decision tree above to eliminate guesswork

  • Consider numerical method precision when selecting tolerances

  • Document why specific tolerances are chosen for edge cases

  • When in doubt, start with the most restrictive tolerance and adjust upward if tests fail

2. Implementation & Organization

Folder Structure

tests/core/
├── [module_name]/
│   ├── scipy_validation_test.gd     # SCIPY VALIDATION TESTS - DATA-DRIVEN
│   ├── mathematical_property_test.gd # MATHEMATICAL PROPERTY TESTS
│   └── parameter_validation_test.gd  # PARAMETER VALIDATION TESTS

Implementation Guidelines

File Naming Convention

  • scipy_validation_test.gd - Contains all data-driven tests that validate against SciPy reference values

  • mathematical_property_test.gd - Contains tests for mathematical properties, relationships, and theoretical behavior

  • parameter_validation_test.gd - Contains all input validation and error handling tests

Test Implementation Patterns

Types of Testing Approaches

1. Data-Driven Testing (Primary Approach)

Uses precomputed test cases from scipy for validation stored in /addons/godot-stat-math/tables/:

func test_normal_cdf_scipy_validation() -> void:
     var test_data: Array = CDF_TEST_DATA.VALUES["normal_cdf"]
     var case: Dictionary = test_data[0]
     var result: float = StatMath.CdfFunctions.normal_cdf(case["params"][0], case["params"][1], case["params"][2])
     assert_float(result).is_equal_approx(case["expected"], StatMath.FLOAT_TOLERANCE)

2. Example-Based Testing (Simple Cases Only)

Tests specific input → output pairs using hardcoded values. Only used for the simple cases outlined in the “Simple, Illustrative Test Data” exception above:

func test_median_with_odd_count() -> void:
     var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
     assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)

3. Property-Based Testing (Future Enhancement)

Tests mathematical relationships that should always hold using randomly generated inputs:

func test_cdf_range_property() -> void:
     var rng = RandomNumberGenerator.new()
     rng.seed = 12345

     for i in range(100):
             var x = rng.randf_range(-10.0, 10.0)
             var mu = rng.randf_range(-3.0, 3.0)
             var sigma = rng.randf_range(0.1, 2.0)

             var result = StatMath.CdfFunctions.normal_cdf(x, mu, sigma)
             assert_float(result).is_between(0.0, 1.0)

Note

Our current test suite has excellent mathematical property testing with fixed examples but lacks true property-based testing with random parameter generation. This enhancement will be undertaken at a later date. See: Issue #10 for details.

Standard Test Function Pattern

func test_function_name_descriptive_case() -> void:
     var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
     var case: Dictionary = test_data[index]  # function_name(params) -> expected
     var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
     assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)

Parametrized Test Pattern

func test_function_comprehensive_validation() -> void:
     var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
     for case in test_data:
             var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
             assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)

Error Testing Pattern

func test_function_invalid_parameter() -> void:
     var test_call: Callable = func():
             StatMath.Module.function_name(invalid_param)

     # Test error logging
     await assert_error(test_call).is_push_error("Expected exact error message")

     # Test sentinel return value
     var result = StatMath.Module.function_name(invalid_param)
     assert_that(is_nan(result)).is_true()  # or assert_that(result).is_null()

File Organization Standards

Data Table Standards

There are two types of data table in /addons/godot-stat-math/tables/ and they must follow these strict conventions:

1. Generated by generate_test_data.py:

# res://addons/godot-stat-math/tables/module_test_data.gd

## Data table containing test values generated by scipy
##
## Generated with: scipy 1.11.3, numpy 1.24.3

const VALUES: Dictionary = {
     "function_name": [  # Generated using: scipy.stats.function_name(params)
             {
                     "params": [param1, param2],
                     "expected": scipy_result,
                     "description": "Descriptive test case"
             }
     ]
}

2. Downloaded from authoritative sources

# res://addons/godot-stat-math/tables/sobol_data.gd

# Joe-Kuo Sobol sequence direction numbers
# Source: https://web.maths.unsw.edu.au/~fkuo/sobol/
# Supports up to 1024 dimensions

# Direction numbers indexed by dimension [0=unused, 1=unused, 2=[1], 3=[1,3], ...]
const DIRECTION_NUMBERS: Array[Array] = [
     [],             # Dimension 0 (unused)
     [],             # Dimension 1 (unused)
     [1],    # Dimension 2
     [1, 3], # Dimension 3

Key Requirements:

  • Data tables are plain scripts without class_name declarations

  • Include scipy version information for generated tables in header comments

  • Include source url for downloaded tables

  • Use const VALUES: Dictionary for the main data structure

  • Include brief documentation explaining the table’s purpose

Documentation Standards

Test Documentation

  • Include clear descriptions of what each test validates

  • Document mathematical relationships being tested

  • Explain tolerance choices for edge cases

  • Reference scipy functions used for expected values

Code Comments

# Mathematical constant - well-known limit
assert_float(StatMath.ErrorFunctions.erf(10.0)).is_equal_approx(1.0, StatMath.FLOAT_TOLERANCE)

# Scipy-validated precision value
var case: Dictionary = test_data[1]  # erf(1.0) -> 0.84270079

Scipy Call Documentation

const VALUES: Dictionary = {
     "normal_cdf": [  # Generated using: stats.norm.cdf(x, loc=mu, scale=sigma)
             {
                     "params": [1.96, 0.0, 1.0],
                     "expected": 0.97500210485177963,
                     "description": "Standard normal 97.5th percentile"
             }
     ]
}

Error Handling Standards

Error Testing Requirements

  • Every error condition must have a dedicated test

  • Test both error logging AND return values

  • Error messages must match exactly between implementation and test

  • Use sentinel values (NAN, null) for invalid operations

Error Message Format

if not (valid_condition):
    push_error("Clear description of what went wrong. Received: %s" % actual_value)
    return NAN

3. Infrastructure & Process

CI/CD Integration and Platform Consistency

Development Workflow

Core Branches

  • develop: The main integration branch for new features and bugfixes.

  • release: The staging branch for preparing and deploying new releases.

All new code enters the develop branch via Pull Request. The full workflow to release is outlined below:

PR to develop (verify-develop.yaml)

  • Trigger: Opening or updating a PR targeting the develop branch.

  • Goal: Quickly verify that the changes are structurally sound and don’t break core functionality.

  • Actions:

    1. Validate Structure: Checks that essential addon files like plugin.cfg are present and correctly configured.

    2. Run Core Tests: Executes a subset of the test suite (/tests/core and stat_math_test.gd). This provides a fast feedback loop by focusing on critical unit and integration tests, excluding slower performance tests.

    3. Upload Results: Test reports are uploaded as artifacts for inspection.

Merge PR into develop (build-develop.yaml)

  • Trigger: Merging a PR into the develop branch.

  • Goal: To create a continuously updated, testable development build of the addon.

  • Actions:

    1. Trigger AWS Runner: Spins up the AWS test runner. (Since we will run the performance test suites this time, hardware needs to be identical).

    2. Run Full Test Suite: Ensures the integrity of the develop branch by running all tests.

    3. Commit Test Results: GDUnit report goes in reports/, performance snapshots in addons/godot-stat-math/tests/performance/results.

    4. Create Dev Build: Packages the addons/godot-stat-math directory (including the addons/godot-stat-math/tests folder) into a zip file named godot-stat-math-<version>-dev.zip.

    5. Upload Artifact: The development build zip is uploaded as a GitHub artifact. This allows developers and testers to easily download and try out the latest dev build without waiting for an official release.

    6. Shutdown AWS Runner: Attempts to shut down the runner via aws cli.

PR to release (verify-release.yaml)

  • Trigger: Opening or updating a PR targeting the release branch.

  • Goal: Perform a comprehensive, final check before a new version is released.

  • Actions:

    1. Check Version Bump: Ensures the project version in project.godot has been semantically versioned upwards.

    2. Validate Structure: Checks that essential addon files like plugin.cfg are present and correctly configured.

    3. Trigger AWS Runner: Spins up the AWS test runner, since we are also running performance tests.

    4. Run Full Test Suite: Executes the entire test suite, including core tests, integration tests, and performance tests (/tests). This is a thorough validation to catch any regressions.

    5. Upload Results: Full test reports are uploaded as artifacts, but not committed to the repo.

    6. Shutdown AWS Runner: Attempts to shut down the runner via aws cli.

Merge PR into release (build-release.yaml)

  • Trigger: Merging a PR into the develop branch.

  • Goal: To automatically build and publish a new release on GitHub. (verify-release tests must pass).

  • Actions:

    1. Extract Release Notes: Pulls the relevant release notes for the new version from CHANGELOG.md.

    2. Create Release Build: Packages the addon into a clean, versioned-release zip file.

      • Normal release version does not include tests, named godot-stat-math-<version>.zip

      • Environment var RELEASE_DEBUG_DIST = true will also build the development version, which includes all tests, named godot-stat-math-<version>-debug.zip

    3. Build and Publish Docs: Builds the documentation and publishes it to the gh-pages branch.

Key Benefits:

  • Consistent Hardware: Identical AWS instances eliminate platform-specific variations

  • Test Result History: All test reports committed to branches for historical tracking

  • Performance Baselines: Automated regression detection with snapshot comparisons

  • Isolated Testing: Fresh, identical environments for every test run

Other Workflows:

  • Build and Publish Docs (build-and-publish-docs.yaml): Builds the documentation and publishes it to the gh-pages branch.

  • Run Tests (run-tests.yaml): Runs the test suite on a self-hosted runner.

  • Run Performance Tests (run-performance-tests.yaml): Runs the performance tests on a self-hosted runner.

This infrastructure allows us to focus on mathematical accuracy without worrying about hardware-specific floating-point differences that would otherwise require platform-specific tolerance adjustments.

Testing Workflow

1. Data Generation Process

  1. Identify missing or inadequate test data in generate_test_data.py

  2. Add scipy function calls with proper parameter ranges

  3. Include scipy call documentation in comments

  4. Generate comprehensive test data covering edge cases

  5. Validate generated data through spot checks

2. Test Implementation Process

  1. Review existing test patterns for consistency

  2. Implement data-driven tests using table lookups

  3. Eliminate hardcoded expected values (except mathematical constants)

  4. Use appropriate tolerance constants

  5. Test both success and error conditions

3. Quality Assurance Process

  1. Run full test suite to ensure no regressions

  2. Verify all tests use data-driven patterns

  3. Check that scipy documentation is present

  4. Confirm elimination of anti-patterns

  5. Validate test coverage and edge cases

4. Performance Testing

Performance Test Infrastructure

Only build-develop.yaml and verify-release.yaml workflows run performance tests, executed on AWS spot instances. This guarantees consistent hardware (t3.medium) and should be the only place that performance baselines are generated.

Baseline Generation:

  • Automatic from successful runs: Baselines are automatically updated from up to 50 most recent successful test runs (pass_*.json files)

  • Median-based for robustness: Uses median execution time rather than mean to be resistant to outliers and ensure stable baselines

  • Statistical metadata included: Each baseline test includes sample size, coefficient of variation, and dynamic threshold calculations

Threshold Determination:

  • 95th percentile approach: Thresholds are set so only 5% of historical runs would be considered “slower” - uses percentile-based statistical analysis rather than fixed percentages

  • Volatility-aware scaling: Different minimum thresholds based on function stability - very stable functions (CV < 5%) get 12% minimum, high volatility functions get 25% minimum

  • Fast function compensation: Functions under 0.5ms get higher thresholds (15% minimum) because small absolute variations create large percentage changes

  • Sample size confidence adjustment: Smaller sample sizes get 10-20% higher thresholds due to statistical uncertainty, mature baselines (25+ samples) get 30% higher tolerance

  • Safety buffers and bounds: All thresholds get a 10% safety buffer and are clamped between 8% minimum and 25% maximum to prevent both false positives and missing real regressions

Known Regression Handling:

  • Tests with known performance issues can be marked as regression cases

  • Marked tests will not execute until new performance baselines are established

  • This prevents false failures while allowing continued development

Baseline Regeneration:

  • Major algorithmic changes may require baseline regeneration across all functions

  • Run run-tests.yaml multiple times to establish new performance baselines

  • The system automatically commits updated baseline data to the repository

This approach maintains performance monitoring while allowing flexibility during development cycles.

Performance Considerations

Test Execution Efficiency

  • Use parallel tool execution for information gathering

  • Minimize sequential operations in test discovery

  • Cache table data appropriately for repeated access

  • Avoid redundant calculations in parametrized tests

Memory Management

  • Use auto_free() and queue_free() for node cleanup

  • Monitor for orphaned nodes in complex tests

  • Clean up temporary resources after test completion

Test Data Governance Process

Minimal Governance Philosophy

Our data governance approach is intentionally minimal since we are dealing with mathematical constants and relationships that should not change across scipy versions. Mathematical functions like norm.cdf(1.96, 0.0, 1.0) represent universal mathematical truths, not implementation-specific behaviors.

Key Principles:

  • Test data represents mathematical constants, not software behavior

  • Scipy version changes should not affect mathematical correctness

  • Data regeneration is only needed when adding new functions or test cases

  • Version tracking provides debugging context, not compatibility requirements

This approach differs from typical software testing where external API changes require extensive data migration and compatibility testing.