Testing Standards

Overview

This document establishes comprehensive testing standards for the Godot Stat Math project, derived from extensive data-driven conversion work and testing best practices. These standards ensure scipy-validated accuracy, maintainability, and consistency across the test suite. These standards should be treated as living documentation, updated as new patterns emerge and the project evolves.

1. Standards & Philosophy

Core Principles

1. The Dual-Pillar Testing Strategy

Our testing philosophy rests on two equally important pillars: Scipy validation for numerical accuracy and mathematical property testing for theoretical correctness. It is not a choice between them; both are required to ensure robust and reliable functions.

Pillar 1: Data-Driven Scipy Validation (Numerical Accuracy) - Purpose: To verify that our functions produce numerically precise results that match the industry-standard Scipy library. This ensures our code is correct in practice. - Method: All Scipy validation tests must be data-driven.

Single Source of Truth: All expected values must come from Scipy-generated data stored in /addons/godot-stat-math/tables/.

Eliminate Magic Numbers: No hardcoded expected values in test assertions (except for mathematical constants and identities, which belong in property tests).

Traceability: All generated table data must include Scipy function call documentation to ensure scientific accuracy is traceable.

Example: Validating normal_cdf(1.96) returns 0.9750021… by looking up the value in a data table generated by scipy.stats.norm.cdf(1.96).
Location: These tests belong exclusively in scipy_validation_test.gd files.

Pillar 2: Mathematical Property Testing (Theoretical Correctness) - Purpose: To verify that our implementations correctly follow fundamental mathematical laws, relationships, and identities. This ensures our code is correct in theory. - Method: Tests assert known truths, such as boundary conditions, symmetries, or relationships between different functions. This is the appropriate place for well-known constants (e.g., erf(0) = 0). - Example: Verifying that gamma_pdf(x, 1.0, scale) is equivalent to exponential_pdf(x, 1.0/scale), or that CDF(PPF(p)) == p. - Location: These tests belong exclusively in mathematical_property_test.gd files.

This dual approach guarantees that our statistical library is not only numerically accurate according to scientific standards but also structurally sound based on mathematical principles.

2. Test Organization and Structure

Semantic Test Names: Name tests after what they validate, not implementation details
Logical Grouping: Organize tests into clear sections with descriptive comments
Type Safety: Always use static types and typed assertions (assert_float, assert_int, etc.) Use HelperFunctions.convert_to_float_array() to convert untyped arrays.
No Redundancy: Eliminate duplicate tests and merge similar functionality

Data-Driven Standards

Required Table Data Structure

# Standard test data pattern with scipy documentation
const VALUES: Dictionary = {
     "function_name": [  # Generated using: scipy.stats.function_name(params)
             {
                     "params": [param1, param2, ...],
                     "expected": scipy_calculated_result,
                     "description": "Human readable test case description"
             }
     ]
}

Scipy Version Control

All test data files generated by generate_test_data.py include version information for scipy and numpy in their headers:

# Generated with: scipy 1.11.3, numpy 1.24.3

This provides traceability for test data generation and ensures reproducibility when debugging edge cases.

Acceptable Hardcoded Values

✅ KEEP HARDCODED - Mathematical constants and well-known limits:

Zero values: erf(0.0) = 0.0, erfc(0.0) = 1.0
Asymptotic limits: erf(10.0) ≈ 1.0, erf(-10.0) ≈ -1.0
Mathematical constants: sqrt(PI), factorial identities like 0! = 1
Boundary conditions: uniform_cdf(x) = 0.0 when x < a, = 1.0 when x > b
Identity relationships: gamma(1.0) = 1.0

❌ MUST CONVERT - Scipy-calculated precision values:

Complex probability calculations: 0.9750021, 0.8646647, 0.59399415
Non-obvious mathematical results requiring computation
Values that appear as “magic numbers” without clear mathematical justification

Exception: Simple, Illustrative Test Data

It is acceptable to hardcode simple data directly within a single test function if its primary purpose is to illustrate a specific behavior, data handling characteristic, or edge case, rather than to validate a complex mathematical result against a scientific standard.

Criteria for “Simple Test Data” Exception:

No computed expected values: The data is not the result of scientific/mathematical calculations
Self-evident from context: Data structure and purpose is immediately clear (e.g., [1, 2, 3, 4, 5] for median testing)
Single test usage: Data is not reused across multiple tests
Illustrative purpose: Primary goal is demonstrating behavior, not validating mathematical accuracy

Examples of acceptable simple data:

# ✅ Acceptable - simple array for median behavior
func test_median_with_odd_count() -> void:
     var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
     assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)

# ✅ Acceptable - edge case demonstration
func test_empty_array_handling() -> void:
     var empty_data: Array[float] = []
     assert_that(is_nan(StatMath.BasicStats.mean(empty_data))).is_true()

Move to tables when:

Data is needed for more than one test
Expected values represent calculated results
Data represents complex scenarios requiring scipy validation

This approach maintains test readability for simple cases while enforcing data-driven standards for mathematical validation.

Data Generation Requirements

Scipy Call Documentation: Every function must include exact scipy call used
Format: "function_name": [ # Generated using: stats.norm.cdf(x, mu, sigma)
Comprehensive Coverage: Include all test scenarios needed by actual tests
Optimized Structure: Design for test consumption, not generation convenience

Tolerance Constants Usage

Tolerance Decision Tree

Use this decision tree to eliminate tolerance selection paralysis:

├── Testing mathematical identity or boundary condition?
│   └── YES → StatMath.BOUNDARY_TOLERANCE (1.0e-10)
│
├── Testing against scipy validation data?
│   ├── Basic statistical functions (mean, variance, etc.)
│   │   └── StatMath.FLOAT_TOLERANCE (1.0e-7)
│   ├── Approximation algorithms (erf, gamma, etc.)
│   │   └── StatMath.ERF_APPROX_TOLERANCE (1.0e-5)
│   ├── Inverse functions (ppf, quantiles)
│   │   └── StatMath.INVERSE_FUNCTION_TOLERANCE
│   └── Probability density/mass functions
│       └── StatMath.PROBABILITY_TOLERANCE (1.0e-6)
│
├── Testing numerical integration or iterative methods?
│   └── StatMath.NUMERICAL_INTEGRATION_TOLERANCE (5.0e-3)
│
└── High-precision mathematical operations?
    └── StatMath.HIGH_PRECISION_TOLERANCE (1.0e-9)

Standard Tolerances

StatMath.FLOAT_TOLERANCE = 1.0e-7 - General floating-point comparisons
StatMath.HIGH_PRECISION_TOLERANCE = 1.0e-9 - High-precision operations
StatMath.BOUNDARY_TOLERANCE = 1.0e-10 - Boundary conditions

Specialized Tolerances

StatMath.ERF_APPROX_TOLERANCE = 1.0e-5 - Error function approximations
StatMath.PROBABILITY_TOLERANCE = 1.0e-6 - Probability values
StatMath.NUMERICAL_TOLERANCE = 1.0e-5 - Numerical methods
StatMath.NUMERICAL_INTEGRATION_TOLERANCE = 5.0e-3 - PDF integration
StatMath.INVERSE_FUNCTION_TOLERANCE - PPF and inverse functions
StatMath.CDF_PPF_CONSISTENCY_TOLERANCE - Round-trip consistency tests

Tolerance Selection Guidelines

Use the decision tree above to eliminate guesswork
Consider numerical method precision when selecting tolerances
Document why specific tolerances are chosen for edge cases
When in doubt, start with the most restrictive tolerance and adjust upward if tests fail

2. Implementation & Organization

Folder Structure

tests/core/
├── [module_name]/
│   ├── scipy_validation_test.gd     # SCIPY VALIDATION TESTS - DATA-DRIVEN
│   ├── mathematical_property_test.gd # MATHEMATICAL PROPERTY TESTS
│   └── parameter_validation_test.gd  # PARAMETER VALIDATION TESTS

Implementation Guidelines

File Naming Convention

scipy_validation_test.gd - Contains all data-driven tests that validate against SciPy reference values
mathematical_property_test.gd - Contains tests for mathematical properties, relationships, and theoretical behavior
parameter_validation_test.gd - Contains all input validation and error handling tests

Test Implementation Patterns

Types of Testing Approaches

1. Data-Driven Testing (Primary Approach)

Uses precomputed test cases from scipy for validation stored in /addons/godot-stat-math/tables/:

func test_normal_cdf_scipy_validation() -> void:
     var test_data: Array = CDF_TEST_DATA.VALUES["normal_cdf"]
     var case: Dictionary = test_data[0]
     var result: float = StatMath.CdfFunctions.normal_cdf(case["params"][0], case["params"][1], case["params"][2])
     assert_float(result).is_equal_approx(case["expected"], StatMath.FLOAT_TOLERANCE)

2. Example-Based Testing (Simple Cases Only)

Tests specific input → output pairs using hardcoded values. Only used for the simple cases outlined in the “Simple, Illustrative Test Data” exception above:

func test_median_with_odd_count() -> void:
     var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
     assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)

3. Property-Based Testing (Future Enhancement)

Tests mathematical relationships that should always hold using randomly generated inputs:

func test_cdf_range_property() -> void:
     var rng = RandomNumberGenerator.new()
     rng.seed = 12345

     for i in range(100):
             var x = rng.randf_range(-10.0, 10.0)
             var mu = rng.randf_range(-3.0, 3.0)
             var sigma = rng.randf_range(0.1, 2.0)

             var result = StatMath.CdfFunctions.normal_cdf(x, mu, sigma)
             assert_float(result).is_between(0.0, 1.0)

Note

Our current test suite has excellent mathematical property testing with fixed examples but lacks true property-based testing with random parameter generation. This enhancement will be undertaken at a later date. See: Issue #10 for details.

Standard Test Function Pattern

func test_function_name_descriptive_case() -> void:
     var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
     var case: Dictionary = test_data[index]  # function_name(params) -> expected
     var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
     assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)

Parametrized Test Pattern

func test_function_comprehensive_validation() -> void:
     var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
     for case in test_data:
             var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
             assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)

Error Testing Pattern

func test_function_invalid_parameter() -> void:
     var test_call: Callable = func():
             StatMath.Module.function_name(invalid_param)

     # Test error logging
     await assert_error(test_call).is_push_error("Expected exact error message")

     # Test sentinel return value
     var result = StatMath.Module.function_name(invalid_param)
     assert_that(is_nan(result)).is_true()  # or assert_that(result).is_null()

File Organization Standards

Data Table Standards

There are two types of data table in /addons/godot-stat-math/tables/ and they must follow these strict conventions:

1. Generated by `generate_test_data.py`:

# res://addons/godot-stat-math/tables/module_test_data.gd

## Data table containing test values generated by scipy
##
## Generated with: scipy 1.11.3, numpy 1.24.3

const VALUES: Dictionary = {
     "function_name": [  # Generated using: scipy.stats.function_name(params)
             {
                     "params": [param1, param2],
                     "expected": scipy_result,
                     "description": "Descriptive test case"
             }
     ]
}

2. Downloaded from authoritative sources

# res://addons/godot-stat-math/tables/sobol_data.gd

# Joe-Kuo Sobol sequence direction numbers
# Source: https://web.maths.unsw.edu.au/~fkuo/sobol/
# Supports up to 1024 dimensions

# Direction numbers indexed by dimension [0=unused, 1=unused, 2=[1], 3=[1,3], ...]
const DIRECTION_NUMBERS: Array[Array] = [
     [],             # Dimension 0 (unused)
     [],             # Dimension 1 (unused)
     [1],    # Dimension 2
     [1, 3], # Dimension 3

Key Requirements:

Data tables are plain scripts without class_name declarations
Include scipy version information for generated tables in header comments
Include source url for downloaded tables
Use const VALUES: Dictionary for the main data structure
Include brief documentation explaining the table’s purpose

Documentation Standards

Test Documentation

Include clear descriptions of what each test validates
Document mathematical relationships being tested
Explain tolerance choices for edge cases
Reference scipy functions used for expected values

Code Comments

# Mathematical constant - well-known limit
assert_float(StatMath.ErrorFunctions.erf(10.0)).is_equal_approx(1.0, StatMath.FLOAT_TOLERANCE)

# Scipy-validated precision value
var case: Dictionary = test_data[1]  # erf(1.0) -> 0.84270079

Scipy Call Documentation

const VALUES: Dictionary = {
     "normal_cdf": [  # Generated using: stats.norm.cdf(x, loc=mu, scale=sigma)
             {
                     "params": [1.96, 0.0, 1.0],
                     "expected": 0.97500210485177963,
                     "description": "Standard normal 97.5th percentile"
             }
     ]
}

Error Handling Standards

Error Testing Requirements

Every error condition must have a dedicated test
Test both error logging AND return values
Error messages must match exactly between implementation and test
Use sentinel values (NAN, null) for invalid operations

Error Message Format

if not (valid_condition):
    push_error("Clear description of what went wrong. Received: %s" % actual_value)
    return NAN

3. Infrastructure & Process

CI/CD Integration and Platform Consistency

Development Workflow

Core Branches

develop: The main integration branch for new features and bugfixes.
release: The staging branch for preparing and deploying new releases.

All new code enters the develop branch via Pull Request. The full workflow to release is outlined below:

PR to `develop` (`verify-develop.yaml`)

Trigger: Opening or updating a PR targeting the develop branch.
Goal: Quickly verify that the changes are structurally sound and don’t break core functionality.
Actions:
1. Validate Structure: Checks that essential addon files like plugin.cfg are present and correctly configured.
2. Run Core Tests: Executes a subset of the test suite (/tests/core and stat_math_test.gd). This provides a fast feedback loop by focusing on critical unit and integration tests, excluding slower performance tests.
3. Upload Results: Test reports are uploaded as artifacts for inspection.

Merge PR into `develop` (`build-develop.yaml`)

Trigger: Merging a PR into the develop branch.
Goal: To create a continuously updated, testable development build of the addon.
Actions:
1. Trigger AWS Runner: Spins up the AWS test runner. (Since we will run the performance test suites this time, hardware needs to be identical).
2. Run Full Test Suite: Ensures the integrity of the develop branch by running all tests.
3. Commit Test Results: GDUnit report goes in reports/, performance snapshots in addons/godot-stat-math/tests/performance/results.
4. Create Dev Build: Packages the addons/godot-stat-math directory (including the addons/godot-stat-math/tests folder) into a zip file named godot-stat-math-<version>-dev.zip.
5. Upload Artifact: The development build zip is uploaded as a GitHub artifact. This allows developers and testers to easily download and try out the latest dev build without waiting for an official release.
6. Shutdown AWS Runner: Attempts to shut down the runner via aws cli.

PR to `release` (`verify-release.yaml`)

Trigger: Opening or updating a PR targeting the release branch.
Goal: Perform a comprehensive, final check before a new version is released.
Actions:
1. Check Version Bump: Ensures the project version in project.godot has been semantically versioned upwards.
2. Validate Structure: Checks that essential addon files like plugin.cfg are present and correctly configured.
3. Trigger AWS Runner: Spins up the AWS test runner, since we are also running performance tests.
4. Run Full Test Suite: Executes the entire test suite, including core tests, integration tests, and performance tests (/tests). This is a thorough validation to catch any regressions.
5. Upload Results: Full test reports are uploaded as artifacts, but not committed to the repo.
6. Shutdown AWS Runner: Attempts to shut down the runner via aws cli.

Merge PR into `release` (`build-release.yaml`)

Trigger: Merging a PR into the develop branch.
Goal: To automatically build and publish a new release on GitHub. (verify-release tests must pass).
Actions:
1. Extract Release Notes: Pulls the relevant release notes for the new version from CHANGELOG.md.
2. Create Release Build: Packages the addon into a clean, versioned-release zip file.
  - Normal release version does not include tests, named godot-stat-math-<version>.zip
  - Environment var RELEASE_DEBUG_DIST = true will also build the development version, which includes all tests, named godot-stat-math-<version>-debug.zip
3. Build and Publish Docs: Builds the documentation and publishes it to the gh-pages branch.

Key Benefits:

Consistent Hardware: Identical AWS instances eliminate platform-specific variations
Test Result History: All test reports committed to branches for historical tracking
Performance Baselines: Automated regression detection with snapshot comparisons
Isolated Testing: Fresh, identical environments for every test run

Other Workflows:

Build and Publish Docs (build-and-publish-docs.yaml): Builds the documentation and publishes it to the gh-pages branch.
Run Tests (run-tests.yaml): Runs the test suite on a self-hosted runner.
Run Performance Tests (run-performance-tests.yaml): Runs the performance tests on a self-hosted runner.

This infrastructure allows us to focus on mathematical accuracy without worrying about hardware-specific floating-point differences that would otherwise require platform-specific tolerance adjustments.

Testing Workflow

1. Data Generation Process

Identify missing or inadequate test data in generate_test_data.py
Add scipy function calls with proper parameter ranges
Include scipy call documentation in comments
Generate comprehensive test data covering edge cases
Validate generated data through spot checks

2. Test Implementation Process

Review existing test patterns for consistency
Implement data-driven tests using table lookups
Eliminate hardcoded expected values (except mathematical constants)
Use appropriate tolerance constants
Test both success and error conditions

3. Quality Assurance Process

Run full test suite to ensure no regressions
Verify all tests use data-driven patterns
Check that scipy documentation is present
Confirm elimination of anti-patterns
Validate test coverage and edge cases

4. Performance Testing

Performance Test Infrastructure

Only build-develop.yaml and verify-release.yaml workflows run performance tests, executed on AWS spot instances. This guarantees consistent hardware (t3.medium) and should be the only place that performance baselines are generated.

Baseline Generation:

Automatic from successful runs: Baselines are automatically updated from up to 50 most recent successful test runs (pass_*.json files)
Median-based for robustness: Uses median execution time rather than mean to be resistant to outliers and ensure stable baselines
Statistical metadata included: Each baseline test includes sample size, coefficient of variation, and dynamic threshold calculations

Threshold Determination:

95th percentile approach: Thresholds are set so only 5% of historical runs would be considered “slower” - uses percentile-based statistical analysis rather than fixed percentages
Volatility-aware scaling: Different minimum thresholds based on function stability - very stable functions (CV < 5%) get 12% minimum, high volatility functions get 25% minimum
Fast function compensation: Functions under 0.5ms get higher thresholds (15% minimum) because small absolute variations create large percentage changes
Sample size confidence adjustment: Smaller sample sizes get 10-20% higher thresholds due to statistical uncertainty, mature baselines (25+ samples) get 30% higher tolerance
Safety buffers and bounds: All thresholds get a 10% safety buffer and are clamped between 8% minimum and 25% maximum to prevent both false positives and missing real regressions

Known Regression Handling:

Tests with known performance issues can be marked as regression cases
Marked tests will not execute until new performance baselines are established
This prevents false failures while allowing continued development

Baseline Regeneration:

Major algorithmic changes may require baseline regeneration across all functions
Run run-tests.yaml multiple times to establish new performance baselines
The system automatically commits updated baseline data to the repository

This approach maintains performance monitoring while allowing flexibility during development cycles.

Performance Considerations

Test Execution Efficiency

Use parallel tool execution for information gathering
Minimize sequential operations in test discovery
Cache table data appropriately for repeated access
Avoid redundant calculations in parametrized tests

Memory Management

Use auto_free() and queue_free() for node cleanup
Monitor for orphaned nodes in complex tests
Clean up temporary resources after test completion

Test Data Governance Process

Minimal Governance Philosophy

Our data governance approach is intentionally minimal since we are dealing with mathematical constants and relationships that should not change across scipy versions. Mathematical functions like norm.cdf(1.96, 0.0, 1.0) represent universal mathematical truths, not implementation-specific behaviors.

Key Principles:

Test data represents mathematical constants, not software behavior
Scipy version changes should not affect mathematical correctness
Data regeneration is only needed when adding new functions or test cases
Version tracking provides debugging context, not compatibility requirements

This approach differs from typical software testing where external API changes require extensive data migration and compatibility testing.