Testing Standards
Overview
This document establishes comprehensive testing standards for the Godot Stat Math project, derived from extensive data-driven conversion work and testing best practices. These standards ensure scipy-validated accuracy, maintainability, and consistency across the test suite. These standards should be treated as living documentation, updated as new patterns emerge and the project evolves.
1. Standards & Philosophy
Core Principles
1. The Dual-Pillar Testing Strategy
Our testing philosophy rests on two equally important pillars: Scipy validation for numerical accuracy and mathematical property testing for theoretical correctness. It is not a choice between them; both are required to ensure robust and reliable functions.
Pillar 1: Data-Driven Scipy Validation (Numerical Accuracy) - Purpose: To verify that our functions produce numerically precise results that match the industry-standard Scipy library. This ensures our code is correct in practice. - Method: All Scipy validation tests must be data-driven.
Single Source of Truth: All expected values must come from Scipy-generated data stored in
/addons/godot-stat-math/tables/
.Eliminate Magic Numbers: No hardcoded expected values in test assertions (except for mathematical constants and identities, which belong in property tests).
Traceability: All generated table data must include Scipy function call documentation to ensure scientific accuracy is traceable.
Example: Validating normal_cdf(1.96) returns 0.9750021… by looking up the value in a data table generated by scipy.stats.norm.cdf(1.96).
Location: These tests belong exclusively in
scipy_validation_test.gd
files.
Pillar 2: Mathematical Property Testing (Theoretical Correctness)
- Purpose: To verify that our implementations correctly follow fundamental mathematical laws, relationships, and identities. This ensures our code is correct in theory.
- Method: Tests assert known truths, such as boundary conditions, symmetries, or relationships between different functions. This is the appropriate place for well-known constants (e.g., erf(0) = 0).
- Example: Verifying that gamma_pdf(x, 1.0, scale) is equivalent to exponential_pdf(x, 1.0/scale), or that CDF(PPF(p)) == p.
- Location: These tests belong exclusively in mathematical_property_test.gd
files.
This dual approach guarantees that our statistical library is not only numerically accurate according to scientific standards but also structurally sound based on mathematical principles.
2. Test Organization and Structure
Semantic Test Names: Name tests after what they validate, not implementation details
Logical Grouping: Organize tests into clear sections with descriptive comments
Type Safety: Always use static types and typed assertions (
assert_float
,assert_int
, etc.) UseHelperFunctions.convert_to_float_array()
to convert untyped arrays.No Redundancy: Eliminate duplicate tests and merge similar functionality
Data-Driven Standards
Required Table Data Structure
# Standard test data pattern with scipy documentation
const VALUES: Dictionary = {
"function_name": [ # Generated using: scipy.stats.function_name(params)
{
"params": [param1, param2, ...],
"expected": scipy_calculated_result,
"description": "Human readable test case description"
}
]
}
Scipy Version Control
All test data files generated by generate_test_data.py
include version information for scipy and numpy in their headers:
# Generated with: scipy 1.11.3, numpy 1.24.3
This provides traceability for test data generation and ensures reproducibility when debugging edge cases.
Acceptable Hardcoded Values
✅ KEEP HARDCODED - Mathematical constants and well-known limits:
Zero values:
erf(0.0) = 0.0
,erfc(0.0) = 1.0
Asymptotic limits:
erf(10.0) ≈ 1.0
,erf(-10.0) ≈ -1.0
Mathematical constants:
sqrt(PI)
, factorial identities like0! = 1
Boundary conditions:
uniform_cdf(x) = 0.0
whenx < a
,= 1.0
whenx > b
Identity relationships:
gamma(1.0) = 1.0
❌ MUST CONVERT - Scipy-calculated precision values:
Complex probability calculations:
0.9750021
,0.8646647
,0.59399415
Non-obvious mathematical results requiring computation
Values that appear as “magic numbers” without clear mathematical justification
Exception: Simple, Illustrative Test Data
It is acceptable to hardcode simple data directly within a single test function if its primary purpose is to illustrate a specific behavior, data handling characteristic, or edge case, rather than to validate a complex mathematical result against a scientific standard.
Criteria for “Simple Test Data” Exception:
No computed expected values: The data is not the result of scientific/mathematical calculations
Self-evident from context: Data structure and purpose is immediately clear (e.g.,
[1, 2, 3, 4, 5]
for median testing)Single test usage: Data is not reused across multiple tests
Illustrative purpose: Primary goal is demonstrating behavior, not validating mathematical accuracy
Examples of acceptable simple data:
# ✅ Acceptable - simple array for median behavior
func test_median_with_odd_count() -> void:
var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)
# ✅ Acceptable - edge case demonstration
func test_empty_array_handling() -> void:
var empty_data: Array[float] = []
assert_that(is_nan(StatMath.BasicStats.mean(empty_data))).is_true()
Move to tables when:
Data is needed for more than one test
Expected values represent calculated results
Data represents complex scenarios requiring scipy validation
This approach maintains test readability for simple cases while enforcing data-driven standards for mathematical validation.
Data Generation Requirements
Scipy Call Documentation: Every function must include exact scipy call used
Format:
"function_name": [ # Generated using: stats.norm.cdf(x, mu, sigma)
Comprehensive Coverage: Include all test scenarios needed by actual tests
Optimized Structure: Design for test consumption, not generation convenience
Tolerance Constants Usage
Tolerance Decision Tree
Use this decision tree to eliminate tolerance selection paralysis:
├── Testing mathematical identity or boundary condition?
│ └── YES → StatMath.BOUNDARY_TOLERANCE (1.0e-10)
│
├── Testing against scipy validation data?
│ ├── Basic statistical functions (mean, variance, etc.)
│ │ └── StatMath.FLOAT_TOLERANCE (1.0e-7)
│ ├── Approximation algorithms (erf, gamma, etc.)
│ │ └── StatMath.ERF_APPROX_TOLERANCE (1.0e-5)
│ ├── Inverse functions (ppf, quantiles)
│ │ └── StatMath.INVERSE_FUNCTION_TOLERANCE
│ └── Probability density/mass functions
│ └── StatMath.PROBABILITY_TOLERANCE (1.0e-6)
│
├── Testing numerical integration or iterative methods?
│ └── StatMath.NUMERICAL_INTEGRATION_TOLERANCE (5.0e-3)
│
└── High-precision mathematical operations?
└── StatMath.HIGH_PRECISION_TOLERANCE (1.0e-9)
Standard Tolerances
StatMath.FLOAT_TOLERANCE = 1.0e-7
- General floating-point comparisonsStatMath.HIGH_PRECISION_TOLERANCE = 1.0e-9
- High-precision operationsStatMath.BOUNDARY_TOLERANCE = 1.0e-10
- Boundary conditions
Specialized Tolerances
StatMath.ERF_APPROX_TOLERANCE = 1.0e-5
- Error function approximationsStatMath.PROBABILITY_TOLERANCE = 1.0e-6
- Probability valuesStatMath.NUMERICAL_TOLERANCE = 1.0e-5
- Numerical methodsStatMath.NUMERICAL_INTEGRATION_TOLERANCE = 5.0e-3
- PDF integrationStatMath.INVERSE_FUNCTION_TOLERANCE
- PPF and inverse functionsStatMath.CDF_PPF_CONSISTENCY_TOLERANCE
- Round-trip consistency tests
Tolerance Selection Guidelines
Use the decision tree above to eliminate guesswork
Consider numerical method precision when selecting tolerances
Document why specific tolerances are chosen for edge cases
When in doubt, start with the most restrictive tolerance and adjust upward if tests fail
2. Implementation & Organization
Folder Structure
tests/core/
├── [module_name]/
│ ├── scipy_validation_test.gd # SCIPY VALIDATION TESTS - DATA-DRIVEN
│ ├── mathematical_property_test.gd # MATHEMATICAL PROPERTY TESTS
│ └── parameter_validation_test.gd # PARAMETER VALIDATION TESTS
Implementation Guidelines
File Naming Convention
scipy_validation_test.gd - Contains all data-driven tests that validate against SciPy reference values
mathematical_property_test.gd - Contains tests for mathematical properties, relationships, and theoretical behavior
parameter_validation_test.gd - Contains all input validation and error handling tests
Test Implementation Patterns
Types of Testing Approaches
1. Data-Driven Testing (Primary Approach)
Uses precomputed test cases from scipy for validation stored in /addons/godot-stat-math/tables/
:
func test_normal_cdf_scipy_validation() -> void:
var test_data: Array = CDF_TEST_DATA.VALUES["normal_cdf"]
var case: Dictionary = test_data[0]
var result: float = StatMath.CdfFunctions.normal_cdf(case["params"][0], case["params"][1], case["params"][2])
assert_float(result).is_equal_approx(case["expected"], StatMath.FLOAT_TOLERANCE)
2. Example-Based Testing (Simple Cases Only)
Tests specific input → output pairs using hardcoded values. Only used for the simple cases outlined in the “Simple, Illustrative Test Data” exception above:
func test_median_with_odd_count() -> void:
var data: Array[float] = [1.0, 3.0, 2.0, 5.0, 4.0]
assert_float(StatMath.BasicStats.median(data)).is_equal_approx(3.0, StatMath.FLOAT_TOLERANCE)
3. Property-Based Testing (Future Enhancement)
Tests mathematical relationships that should always hold using randomly generated inputs:
func test_cdf_range_property() -> void:
var rng = RandomNumberGenerator.new()
rng.seed = 12345
for i in range(100):
var x = rng.randf_range(-10.0, 10.0)
var mu = rng.randf_range(-3.0, 3.0)
var sigma = rng.randf_range(0.1, 2.0)
var result = StatMath.CdfFunctions.normal_cdf(x, mu, sigma)
assert_float(result).is_between(0.0, 1.0)
Note
Our current test suite has excellent mathematical property testing with fixed examples but lacks true property-based testing with random parameter generation. This enhancement will be undertaken at a later date. See: Issue #10 for details.
Standard Test Function Pattern
func test_function_name_descriptive_case() -> void:
var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
var case: Dictionary = test_data[index] # function_name(params) -> expected
var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)
Parametrized Test Pattern
func test_function_comprehensive_validation() -> void:
var test_data: Array = TEST_DATA_TABLE.VALUES["function_name"]
for case in test_data:
var result: float = StatMath.Module.function_name(case["params"][0], case["params"][1])
assert_float(result).is_equal_approx(case["expected"], StatMath.APPROPRIATE_TOLERANCE)
Error Testing Pattern
func test_function_invalid_parameter() -> void:
var test_call: Callable = func():
StatMath.Module.function_name(invalid_param)
# Test error logging
await assert_error(test_call).is_push_error("Expected exact error message")
# Test sentinel return value
var result = StatMath.Module.function_name(invalid_param)
assert_that(is_nan(result)).is_true() # or assert_that(result).is_null()
File Organization Standards
Data Table Standards
There are two types of data table in /addons/godot-stat-math/tables/
and they must follow these strict conventions:
1. Generated by generate_test_data.py
:
# res://addons/godot-stat-math/tables/module_test_data.gd
## Data table containing test values generated by scipy
##
## Generated with: scipy 1.11.3, numpy 1.24.3
const VALUES: Dictionary = {
"function_name": [ # Generated using: scipy.stats.function_name(params)
{
"params": [param1, param2],
"expected": scipy_result,
"description": "Descriptive test case"
}
]
}
Documentation Standards
Test Documentation
Include clear descriptions of what each test validates
Document mathematical relationships being tested
Explain tolerance choices for edge cases
Reference scipy functions used for expected values
Code Comments
# Mathematical constant - well-known limit
assert_float(StatMath.ErrorFunctions.erf(10.0)).is_equal_approx(1.0, StatMath.FLOAT_TOLERANCE)
# Scipy-validated precision value
var case: Dictionary = test_data[1] # erf(1.0) -> 0.84270079
Scipy Call Documentation
const VALUES: Dictionary = {
"normal_cdf": [ # Generated using: stats.norm.cdf(x, loc=mu, scale=sigma)
{
"params": [1.96, 0.0, 1.0],
"expected": 0.97500210485177963,
"description": "Standard normal 97.5th percentile"
}
]
}
Error Handling Standards
Error Testing Requirements
Every error condition must have a dedicated test
Test both error logging AND return values
Error messages must match exactly between implementation and test
Use sentinel values (NAN, null) for invalid operations
Error Message Format
if not (valid_condition):
push_error("Clear description of what went wrong. Received: %s" % actual_value)
return NAN
3. Infrastructure & Process
CI/CD Integration and Platform Consistency
Development Workflow
Core Branches
develop
: The main integration branch for new features and bugfixes.release
: The staging branch for preparing and deploying new releases.
All new code enters the develop
branch via Pull Request. The full workflow to release is outlined below:
PR to develop
(verify-develop.yaml
)
Trigger: Opening or updating a PR targeting the
develop
branch.Goal: Quickly verify that the changes are structurally sound and don’t break core functionality.
Actions:
Validate Structure: Checks that essential addon files like
plugin.cfg
are present and correctly configured.Run Core Tests: Executes a subset of the test suite (
/tests/core
andstat_math_test.gd
). This provides a fast feedback loop by focusing on critical unit and integration tests, excluding slower performance tests.Upload Results: Test reports are uploaded as artifacts for inspection.
Merge PR into develop
(build-develop.yaml
)
Trigger: Merging a PR into the
develop
branch.Goal: To create a continuously updated, testable development build of the addon.
Actions:
Trigger AWS Runner: Spins up the AWS test runner. (Since we will run the performance test suites this time, hardware needs to be identical).
Run Full Test Suite: Ensures the integrity of the
develop
branch by running all tests.Commit Test Results: GDUnit report goes in
reports/
, performance snapshots inaddons/godot-stat-math/tests/performance/results
.Create Dev Build: Packages the
addons/godot-stat-math
directory (including theaddons/godot-stat-math/tests
folder) into a zip file namedgodot-stat-math-<version>-dev.zip
.Upload Artifact: The development build zip is uploaded as a GitHub artifact. This allows developers and testers to easily download and try out the latest dev build without waiting for an official release.
Shutdown AWS Runner: Attempts to shut down the runner via aws cli.
PR to release
(verify-release.yaml
)
Trigger: Opening or updating a PR targeting the
release
branch.Goal: Perform a comprehensive, final check before a new version is released.
Actions:
Check Version Bump: Ensures the project version in
project.godot
has been semantically versioned upwards.Validate Structure: Checks that essential addon files like
plugin.cfg
are present and correctly configured.Trigger AWS Runner: Spins up the AWS test runner, since we are also running performance tests.
Run Full Test Suite: Executes the entire test suite, including core tests, integration tests, and performance tests (
/tests
). This is a thorough validation to catch any regressions.Upload Results: Full test reports are uploaded as artifacts, but not committed to the repo.
Shutdown AWS Runner: Attempts to shut down the runner via aws cli.
Merge PR into release
(build-release.yaml
)
Trigger: Merging a PR into the
develop
branch.Goal: To automatically build and publish a new release on GitHub. (
verify-release
tests must pass).Actions:
Extract Release Notes: Pulls the relevant release notes for the new version from
CHANGELOG.md
.Create Release Build: Packages the addon into a clean, versioned-release zip file.
Normal release version does not include tests, named
godot-stat-math-<version>.zip
Environment var
RELEASE_DEBUG_DIST = true
will also build the development version, which includes all tests, namedgodot-stat-math-<version>-debug.zip
Build and Publish Docs: Builds the documentation and publishes it to the
gh-pages
branch.
Key Benefits:
Consistent Hardware: Identical AWS instances eliminate platform-specific variations
Test Result History: All test reports committed to branches for historical tracking
Performance Baselines: Automated regression detection with snapshot comparisons
Isolated Testing: Fresh, identical environments for every test run
Other Workflows:
Build and Publish Docs (build-and-publish-docs.yaml): Builds the documentation and publishes it to the
gh-pages
branch.Run Tests (run-tests.yaml): Runs the test suite on a self-hosted runner.
Run Performance Tests (run-performance-tests.yaml): Runs the performance tests on a self-hosted runner.
This infrastructure allows us to focus on mathematical accuracy without worrying about hardware-specific floating-point differences that would otherwise require platform-specific tolerance adjustments.
Testing Workflow
1. Data Generation Process
Identify missing or inadequate test data in
generate_test_data.py
Add scipy function calls with proper parameter ranges
Include scipy call documentation in comments
Generate comprehensive test data covering edge cases
Validate generated data through spot checks
2. Test Implementation Process
Review existing test patterns for consistency
Implement data-driven tests using table lookups
Eliminate hardcoded expected values (except mathematical constants)
Use appropriate tolerance constants
Test both success and error conditions
3. Quality Assurance Process
Run full test suite to ensure no regressions
Verify all tests use data-driven patterns
Check that scipy documentation is present
Confirm elimination of anti-patterns
Validate test coverage and edge cases
4. Performance Testing
Performance Test Infrastructure
Only build-develop.yaml
and verify-release.yaml
workflows run performance tests, executed on AWS spot instances. This guarantees consistent hardware (t3.medium) and should be the only place that performance baselines are generated.
Baseline Generation:
Automatic from successful runs: Baselines are automatically updated from up to 50 most recent successful test runs (
pass_*.json
files)Median-based for robustness: Uses median execution time rather than mean to be resistant to outliers and ensure stable baselines
Statistical metadata included: Each baseline test includes sample size, coefficient of variation, and dynamic threshold calculations
Threshold Determination:
95th percentile approach: Thresholds are set so only 5% of historical runs would be considered “slower” - uses percentile-based statistical analysis rather than fixed percentages
Volatility-aware scaling: Different minimum thresholds based on function stability - very stable functions (CV < 5%) get 12% minimum, high volatility functions get 25% minimum
Fast function compensation: Functions under 0.5ms get higher thresholds (15% minimum) because small absolute variations create large percentage changes
Sample size confidence adjustment: Smaller sample sizes get 10-20% higher thresholds due to statistical uncertainty, mature baselines (25+ samples) get 30% higher tolerance
Safety buffers and bounds: All thresholds get a 10% safety buffer and are clamped between 8% minimum and 25% maximum to prevent both false positives and missing real regressions
Known Regression Handling:
Tests with known performance issues can be marked as regression cases
Marked tests will not execute until new performance baselines are established
This prevents false failures while allowing continued development
Baseline Regeneration:
Major algorithmic changes may require baseline regeneration across all functions
Run
run-tests.yaml
multiple times to establish new performance baselinesThe system automatically commits updated baseline data to the repository
This approach maintains performance monitoring while allowing flexibility during development cycles.
Performance Considerations
Test Execution Efficiency
Use parallel tool execution for information gathering
Minimize sequential operations in test discovery
Cache table data appropriately for repeated access
Avoid redundant calculations in parametrized tests
Memory Management
Use
auto_free()
andqueue_free()
for node cleanupMonitor for orphaned nodes in complex tests
Clean up temporary resources after test completion
Test Data Governance Process
Minimal Governance Philosophy
Our data governance approach is intentionally minimal since we are dealing with mathematical constants and relationships that should not change across scipy versions. Mathematical functions like norm.cdf(1.96, 0.0, 1.0)
represent universal mathematical truths, not implementation-specific behaviors.
Key Principles:
Test data represents mathematical constants, not software behavior
Scipy version changes should not affect mathematical correctness
Data regeneration is only needed when adding new functions or test cases
Version tracking provides debugging context, not compatibility requirements
This approach differs from typical software testing where external API changes require extensive data migration and compatibility testing.