ChemFST
ChemFST is a high-performance chemical name search library using Finite State Transducers (FSTs) to provide efficient searches of systematic and trivial names of chemical compounds in milliseconds. It's particularly useful for autocomplete features and searching through large chemical compound databases.
Features
- Memory-efficient indexing using Finite State Transducers
- Extremely fast prefix-based searches (autocomplete)
- Case-insensitive substring searches
- Memory-mapped file access for optimal performance
- Simple API with just a few functions
Setup
Prerequisites
- Rust 1.56.0 or higher
- Cargo (comes with Rust)
Installation
Add this to your Cargo.toml
:
[dependencies]
chemfst = "0.1.0"
Using the Library
Basic Usage
use chemfst::{build_fst_set, load_fst_set, prefix_search, substring_search}; use std::error::Error; fn main() -> Result<(), Box<dyn Error>> { // Step 1: Create an index from a list of chemical names (one term per line) // Note: The .fst file is generated and not distributed with the package // The repository includes a sample data/chemical_names.txt with 32+ chemical names let input_path = "data/chemical_names.txt"; let fst_path = "data/chemical_names.fst"; build_fst_set(input_path, fst_path)?; // Step 2: Load the index into memory let set = load_fst_set(fst_path)?; // Step 3: Perform searches // Prefix search (autocomplete) let prefix_results = prefix_search(&set, "acet", 10); // Find up to 10 terms starting with "acet" // Substring search let substring_results = substring_search(&set, "enz", 10)?; // Find up to 10 terms containing "enz" Ok(()) }
API Reference
Functions
build_fst_set(input_path: &str, fst_path: &str) -> Result<(), Box<dyn Error>>
Creates an FST set from a list of chemical names in a text file. The resulting .fst file is generated and not distributed with the package.
input_path
: Path to a text file with one chemical name per linefst_path
: Path where the FST index will be saved
load_fst_set(fst_path: &str) -> Result<Set<Mmap>, Box<dyn Error>>
Loads a previously created FST set from disk using memory mapping.
fst_path
: Path to the FST index file- Returns: A memory-mapped FST Set
prefix_search(set: &Set<Mmap>, prefix: &str, max_results: usize) -> Vec<String>
Performs a prefix-based search (autocomplete).
set
: The FST Set to search throughprefix
: The prefix to search formax_results
: Maximum number of results to return- Returns: A vector of matching chemical names
substring_search(set: &Set<Mmap>, substring: &str, max_results: usize) -> Result<Vec<String>, Box<dyn Error>>
Performs a case-insensitive substring search.
set
: The FST Set to search throughsubstring
: The substring to search formax_results
: Maximum number of results to return- Returns: A vector of matching chemical names
Development
Project Structure
src/lib.rs
- Core library functionalitysrc/main.rs
- Example binary that demonstrates the librarytests/
- Integration tests
Setting Up Development Environment
-
Clone the repository:
git clone <repository_url> cd chemfst
-
Build the project:
cargo build
-
Run the example:
cargo run
Running Tests
Run all tests:
cargo test
Adding New Tests
Add new integration tests to the tests/fst_search_tests.rs
file or create additional test files in the tests
directory.
Continuous Integration
The project uses GitHub Actions for continuous integration and testing across multiple platforms and Python versions.
GitHub Workflows
Rust CI (rust.yml
)
- Platforms: Ubuntu, macOS, Windows
- Rust versions: stable, beta
- Features: Build, test, clippy linting, format checking, code coverage
Python CI (python.yml
)
- Platforms: Ubuntu, macOS, Windows
- Python versions: 3.11, 3.12, 3.13
- Features:
- Automated FST file generation from test data
- Cross-platform testing
- Example execution validation
- Code coverage reporting
Local Validation
Before pushing changes, validate the workflow locally:
# Run the validation script
python scripts/validate_workflow.py
This script:
- Creates test data files
- Builds the Python package
- Runs all tests
- Validates examples work correctly
FST File Generation in CI
The workflows automatically create test data files since FST files are not distributed with the package. Each platform creates the required data/chemical_names.txt
with sample chemical names for testing.
Contributing
Contributions are welcome! Here's how you can contribute:
- Fork the repository
- Create a new branch (
git checkout -b feature/your-feature-name
) - Make your changes
- Run the tests (
cargo test
) - Commit your changes (
git commit -m 'Add some feature'
) - Push to the branch (
git push origin feature/your-feature-name
) - Open a Pull Request
Performance Considerations
- FST sets are immutable. If your chemical database changes, you'll need to rebuild the index.
- For large chemical databases, consider building the index as an offline process.
- Memory-mapped files provide excellent performance but require care when the underlying file changes.
License
Credits
This project uses the following key dependencies:
FST File Setup Guide
Overview
ChemFST uses Finite State Transducer (FST) files for high-performance chemical name searching. FST files (.fst) are not distributed with the package and must be generated from source text files.
Why FST Files Are Not Distributed
- Generated Content: FST files are compiled indexes created from source data
- Size Considerations: FST files can be large depending on the dataset
- Customization: Users typically want to use their own chemical name datasets
- License Clarity: Source text files may have different licensing than generated indexes
Required Files
Source File: data/chemical_names.txt
- Required: Must exist to generate FST files
- Format: One chemical name per line, UTF-8 encoded
- Included: The repository includes a sample file with 32+ chemical names
- Example content (from included file):
acetaminophen acetylsalicylic acid acetic acid acetone acetonitrile benzene benzoic acid ...
Generated File: data/chemical_names.fst
- Generated: Created by ChemFST from the source text file
- Not tracked: Ignored by git (see
.gitignore
) - Not distributed: Must be built locally
Building FST Files
Python API
from chemfst import build_fst
# Build FST index from text file
build_fst("data/chemical_names.txt", "data/chemical_names.fst")
Rust API
#![allow(unused)] fn main() { use chemfst::build_fst_set; // Build FST index from text file build_fst_set("data/chemical_names.txt", "data/chemical_names.fst")?; }
Automated Testing
The test suite automatically handles FST file generation:
- Source File Check: Tests verify
data/chemical_names.txt
exists - Auto-Generation: FST files are built automatically during testing if missing
- Session Scope: FST generation is cached across test sessions for efficiency
- Cleanup: Generated files are properly ignored by version control
Git Configuration
The repository is configured to ignore FST files:
# Generated FST files (not distributed with package)
data/chemical_names.fst
*.fst
Troubleshooting
Missing Source File
Error: Chemical names text file not found at data/chemical_names.txt
Solution: The repository includes a sample data/chemical_names.txt
file. If missing, create the source file with chemical names (one per line) or restore it from version control.
Permission Issues
Error: Cannot write FST file
Solution: Ensure write permissions for the data directory
Empty Results
Error: Search returns no results
Solution: Verify source file contains the expected chemical names
Best Practices
- Use Existing Data: The repository includes a curated
chemical_names.txt
file with sample data - Backup Source Files: Keep
chemical_names.txt
in version control (already included) - Ignore Generated Files: Never commit
.fst
files to version control - Document Data Sources: Include attribution for chemical name datasets
- Test Locally: Always test FST generation before deployment
- Automation: Include FST building in deployment scripts
- Customize Data: Replace the sample data with your own chemical names as needed
Performance Notes
- Build Time: FST generation is fast (typically milliseconds for small datasets)
- Memory Usage: FST files are memory-mapped for efficient loading
- Search Speed: FST searches are optimized for sub-millisecond response times
- Preloading: Use the preload functionality for even better performance
Examples
See python/examples/demo.py
for a comprehensive example that:
- Builds FST from source file
- Loads the generated FST
- Demonstrates search functionality
- Shows performance characteristics
Contributing to ChemFST
Thank you for considering contributing to ChemFST! This document provides guidelines and instructions to help you get started.
Code of Conduct
By participating in this project, you agree to abide by our Code of Conduct. Please read it before contributing.
How to Contribute
Reporting Issues
- Check if the issue has already been reported.
- Use the issue template when creating a new issue.
- Include a clear title and description.
- Add steps to reproduce the issue and expected behavior.
- Include version information (Rust version, OS, etc.).
Submitting Changes
-
Fork the Repository
- Create your own fork of the repository.
-
Create a Branch
- Create a branch for your feature or bugfix:
git checkout -b feature/your-feature-name
- Create a branch for your feature or bugfix:
-
Make Your Changes
- Follow the coding style and guidelines.
- Write tests for new features.
- Keep commits focused and with clear messages.
-
Run Tests
- Ensure all tests pass:
cargo test
- Run the linter:
cargo clippy
- Format your code:
cargo fmt
- Ensure all tests pass:
-
Submit a Pull Request
- Push your changes to your fork:
git push origin feature/your-feature-name
- Open a pull request against the
main
branch. - Describe your changes and reference any related issues.
- Push your changes to your fork:
Development Guidelines
Code Style
- Follow the Rust official style guide.
- Use
cargo fmt
before committing. - Address all
clippy
warnings.
Testing
- Write tests for new functionality.
- Ensure existing tests pass with your changes.
- Integration tests go in the
tests/
directory. - Unit tests go within the module they're testing.
Documentation
- Keep documentation up-to-date.
- Document all public APIs with doc comments.
- Include examples where appropriate.
Commit Messages
- Use clear and meaningful commit messages.
- Start with a short summary line (50 chars or less).
- Optionally, follow with a blank line and a more detailed explanation.
Release Process
- Version numbers follow Semantic Versioning.
- Update CHANGELOG.md with notable changes.
- Create a git tag for the new version.
- Publish the crate using
cargo publish
.
Getting Help
If you need help with contributing, feel free to:
- Open an issue with questions.
- Reach out to maintainers directly.
Thank you for contributing to ChemFST!
Contributor Covenant Code of Conduct
Our Pledge
We pledge to make participation in our project a harassment-free experience for everyone, and we pledge to act in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.
Our Standards
Examples of positive behavior:
- Using welcoming language
- Respecting different viewpoints
- Accepting constructive criticism
- Focusing on community benefit
- Showing empathy
Examples of unacceptable behavior:
- Sexualized language or imagery
- Trolling, insults, and personal attacks
- Harassment
- Publishing others' private information
- Other unprofessional conduct
Enforcement
Project maintainers will enforce these standards. Violations may be reported to project maintainers, who will investigate all complaints.
Attribution
This Code of Conduct is adapted from the Contributor Covenant, version 2.0.
Changelog
All notable changes to the ChemFST project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[0.1.0] - 2023-12-01
Added
- Initial project structure with lib.rs and main.rs
- Core FST (Finite State Transducer) functionality for chemical compound search
- Prefix search implementation for chemical name autocomplete
- Substring search implementation for chemical name lookup
- Memory-mapped file support for efficient loading of large chemical databases
- Integration tests
- Basic documentation
CI/CD Documentation for ChemFST
Overview
ChemFST uses GitHub Actions for continuous integration and deployment, ensuring code quality and compatibility across multiple platforms and Python versions.
Workflow Architecture
1. Rust CI Workflow (rust.yml
)
Triggers: Push to trunk
, Pull Requests to trunk
Matrix Strategy:
- Operating Systems: Ubuntu, macOS, Windows
- Rust Versions: stable, beta
Jobs:
- Build & Test: Compiles and tests Rust code
- Linting: Runs clippy for code quality
- Formatting: Checks code formatting with rustfmt
- Coverage: Generates code coverage reports using tarpaulin
2. Python CI Workflow (python.yml
)
Triggers: Push to trunk
, Pull Requests to trunk
Matrix Strategy:
- Operating Systems: Ubuntu, macOS, Windows
- Python Versions: 3.11, 3.12, 3.13
Jobs:
- Test: Builds Python bindings and runs pytest
- Coverage: Generates Python code coverage
Workflow Details
Python Workflow Steps
-
Environment Setup
- Checkout code
- Install Python and Rust toolchains
- Cache Rust dependencies for faster builds
-
Dependency Installation
- Install maturin for Python-Rust bindings
- Install pytest for testing
-
Test Data Verification
- Verify
data/chemical_names.txt
exists in repository - Use existing chemical names from the tracked file
- Verify
-
Package Building
- Use maturin to build Python bindings from Rust code (wheel format)
- Install built wheel with pip (avoids virtual environment requirement)
-
Testing
- Run pytest suite with verbose output
- Execute example scripts to validate functionality
-
Coverage (Ubuntu only)
- Generate coverage reports
- Upload to Codecov
Key Features
Cross-Platform Compatibility
- Tests run on Linux, macOS, and Windows
- Platform-specific commands for file creation
- Handles path differences across operating systems
FST File Handling
- Uses existing
data/chemical_names.txt
from repository - FST files are generated during CI, not stored in repository
- Consistent test data from the tracked source file
Caching Strategy
- Rust dependencies cached by Cargo.lock hash
- Reduces build times for subsequent runs
- Platform-specific cache keys
Test Data Source
The workflows use the chemical names from the tracked data/chemical_names.txt
file in the repository. This ensures:
- Consistency: All platforms use identical test data
- Single Source of Truth: Chemical names are maintained in one location
- Easy Updates: Modify the file to change test data across all workflows
- Version Control: Test data changes are tracked in git history
The file contains a curated list of chemical compounds used for testing all functionality including prefix search, substring search, and performance benchmarks.
Local Development
Pre-commit Validation
Use the validation script before pushing changes:
python scripts/validate_workflow.py
This script replicates the CI environment locally:
- Creates test data
- Builds the package
- Runs tests
- Validates examples
Manual Testing
# Test Rust components
cargo test
# Build and install Python package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
pip install dist/*.whl
# Test Python components
pytest python/tests/ -v
# Test examples
python python/examples/demo.py
Coverage Reporting
Rust Coverage
- Uses
cargo-tarpaulin
for coverage analysis - Generates XML reports for Codecov
- Runs only on Ubuntu for efficiency
Python Coverage
- Uses
pytest-cov
for coverage analysis - Covers Python bindings and test code
- Uploads to Codecov with platform identification
Troubleshooting
Common Issues
FST File Errors
- Cause: Missing or invalid test data
- Solution: Verify
data/chemical_names.txt
exists in repository and has content
Build Failures
- Cause: Rust toolchain issues or maturin virtual environment errors
- Solution: Check Rust installation and dependencies, use
maturin build
+pip install
instead ofmaturin develop
Test Failures
- Cause: Platform-specific path or command issues
- Solution: Review platform-specific workflow steps
Debugging Strategies
- Check Workflow Logs: Review detailed logs in GitHub Actions
- Local Reproduction: Use validation script to reproduce issues
- Platform Testing: Test on specific platforms if issues are OS-specific
- Dependency Versions: Verify Python and Rust version compatibility
Security Considerations
Secrets Management
CODECOV_TOKEN
: Used for coverage uploads- Stored as GitHub repository secrets
- Access controlled through GitHub permissions
Dependency Security
- Automated dependency updates through Dependabot
- Regular security audits of Rust and Python dependencies
- Pinned action versions for reproducibility
Maintenance
Updating Dependencies
- Monitor for new Python versions and add to matrix
- Update Rust toolchain versions as needed
- Keep GitHub Actions up to date
Performance Optimization
- Monitor build times and optimize caching
- Consider parallel job execution
- Profile test execution times
Future Enhancements
Planned Improvements
- Add Windows-specific testing for path handling
- Implement benchmark regression testing
- Add documentation generation and deployment
- Consider adding nightly Rust builds for early issue detection
Monitoring
- Track build success rates across platforms
- Monitor test execution times
- Coverage trend analysis
Windows Compatibility Guide
Overview
ChemFST is fully compatible with Windows environments, including GitHub Actions runners. This document outlines the Windows-specific considerations and implementations.
GitHub Actions Windows Support
Matrix Strategy
The Python CI workflow includes windows-latest
in the test matrix:
matrix:
os: [ubuntu-latest, macos-latest, windows-latest]
python-version: ["3.11", "3.12", "3.13"]
Platform-Specific Implementations
Cache Paths
Unix/Linux/macOS:
path: |
~/.cargo/registry
~/.cargo/git
target/
chemfst-py/target/
Windows:
path: |
C:\Users\runneradmin\.cargo\registry
C:\Users\runneradmin\.cargo\git
target\
chemfst-py\target\
File Verification Commands
Unix (Bash):
if [ ! -f "data/chemical_names.txt" ]; then
echo "Error: data/chemical_names.txt not found"
exit 1
fi
echo "✅ Found existing data/chemical_names.txt"
head -5 data/chemical_names.txt
echo "... ($(wc -l < data/chemical_names.txt) total lines)"
Windows (PowerShell):
if (!(Test-Path "data\chemical_names.txt")) {
Write-Host "Error: data\chemical_names.txt not found"
exit 1
}
Write-Host "✅ Found existing data\chemical_names.txt"
Get-Content "data\chemical_names.txt" -Head 5
$lineCount = (Get-Content "data\chemical_names.txt" | Measure-Object -Line).Lines
Write-Host "... ($lineCount total lines)"
Local Windows Development
Prerequisites
- Rust Toolchain: Install via rustup.rs
- Python 3.11+: Install from python.org
- Git: Install from git-scm.com
- Visual Studio Build Tools: Required for Rust compilation
Setup Commands
# Clone repository
git clone <repository-url>
cd ChemFST
# Install Python dependencies
python -m pip install --upgrade pip maturin pytest
# Build Python package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
python -m pip install dist/*.whl
# Run tests
python -m pytest python/tests/ -v
# Run examples
python python/examples/demo.py
PowerShell Setup
# Alternative setup using PowerShell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip maturin pytest
# Build and install package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
python -m pip install dist/*.whl
Path Handling
File Separators
- Windows: Uses backslash
\
as path separator - Cross-platform: Python's
pathlib.Path
handles this automatically - Workflow: Uses forward slashes
/
in cross-platform commands
Example Implementation
from pathlib import Path
# Works on all platforms
data_file = Path("data") / "chemical_names.txt"
fst_file = Path("data") / "chemical_names.fst"
# Platform-specific string representation
if platform.system() == "Windows":
print(f"Windows path: {data_file}") # data\chemical_names.txt
else:
print(f"Unix path: {data_file}") # data/chemical_names.txt
Build Considerations
Rust Compilation
- MSVC: Primary toolchain for Windows builds
- GNU: Alternative toolchain (less common)
- Dependencies: May require Visual Studio Build Tools
Python Extension Modules
- ABI: Windows uses different ABI than Unix systems
- File Extensions:
.pyd
files on Windows vs.so
on Unix - Maturin: Handles cross-platform building automatically
Testing on Windows
Local Validation
# Run the validation script
python scripts/validate_workflow.py
Expected output on Windows:
Operating System: Windows 10
✅ Windows PowerShell commands tested successfully
✅ Workflow should work on windows-latest
GitHub Actions Testing
The workflow automatically tests on windows-latest
with:
- Windows Server 2022
- PowerShell 5.1 and PowerShell Core
- MSVC build tools
- Python 3.11, 3.12, and 3.13
Common Windows Issues
Build Failures
Issue: Missing Visual Studio Build Tools
error: Microsoft Visual C++ 14.0 is required
Solution: Install Visual Studio Build Tools or Visual Studio Community
Issue: Maturin virtual environment error
Couldn't find a virtualenv or conda environment
Solution: Use maturin build
+ pip install dist/*.whl
instead of maturin develop
Path Issues
Issue: Path separator conflicts
FileNotFoundError: [Errno 2] No such file or directory: 'data/chemical_names.txt'
Solution: Use pathlib.Path
for cross-platform compatibility
PowerShell Execution Policy
Issue: Script execution disabled
execution of scripts is disabled on this system
Solution:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
Long Path Support
Issue: Path length limitations (260 characters) Solution: Enable long path support in Windows 10/11:
Computer Configuration > Administrative Templates > System > Filesystem > Enable Win32 long paths
Performance Considerations
File System Performance
- NTFS: Good performance for FST file operations
- Windows Defender: May impact build times (consider exclusions)
- Antivirus: Can slow down file operations
Memory Mapping
- Windows: Full support for memory-mapped files
- Performance: Comparable to Unix systems for FST operations
- Large Files: Windows handles large FST files efficiently
Best Practices
Development Environment
- Use Windows Subsystem for Linux (WSL) for Unix-like experience
- Consider PowerShell Core for better cross-platform scripting
- Use Windows Terminal for improved command-line experience
CI/CD Integration
- Test locally on Windows before pushing
- Monitor Windows-specific build times
- Use platform-specific caching strategies
- Handle path separators consistently
Deployment
- Test Windows packages thoroughly
- Consider Windows-specific packaging requirements
- Document Windows-specific installation steps
- Provide PowerShell scripts for automation
Troubleshooting
Debug Commands
# Check Python installation
python --version
where python
# Check Rust installation
rustc --version
cargo --version
# Check file exists
Test-Path "data\chemical_names.txt"
# Show file content
Get-Content "data\chemical_names.txt" -Head 10
# Check build tools
where cl.exe
Log Analysis
- GitHub Actions logs are available for 30 days
- Windows logs may include additional system information
- PowerShell errors include detailed stack traces
Future Enhancements
Planned Improvements
- Windows-specific performance optimizations
- Native Windows installer packages
- PowerShell module for ChemFST
- Windows-specific documentation
Compatibility Targets
- Windows 10 (version 1903+)
- Windows 11
- Windows Server 2019+
- Windows Server 2022