Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

ChemFST

ChemFST is a high-performance chemical name search library using Finite State Transducers (FSTs) to provide efficient searches of systematic and trivial names of chemical compounds in milliseconds. It's particularly useful for autocomplete features and searching through large chemical compound databases.

Features

  • Memory-efficient indexing using Finite State Transducers
  • Extremely fast prefix-based searches (autocomplete)
  • Case-insensitive substring searches
  • Memory-mapped file access for optimal performance
  • Simple API with just a few functions

Setup

Prerequisites

  • Rust 1.56.0 or higher
  • Cargo (comes with Rust)

Installation

Add this to your Cargo.toml:

[dependencies]
chemfst = "0.1.0"

Using the Library

Basic Usage

use chemfst::{build_fst_set, load_fst_set, prefix_search, substring_search};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    // Step 1: Create an index from a list of chemical names (one term per line)
    // Note: The .fst file is generated and not distributed with the package
    // The repository includes a sample data/chemical_names.txt with 32+ chemical names
    let input_path = "data/chemical_names.txt";
    let fst_path = "data/chemical_names.fst";
    build_fst_set(input_path, fst_path)?;

    // Step 2: Load the index into memory
    let set = load_fst_set(fst_path)?;

    // Step 3: Perform searches

    // Prefix search (autocomplete)
    let prefix_results = prefix_search(&set, "acet", 10); // Find up to 10 terms starting with "acet"

    // Substring search
    let substring_results = substring_search(&set, "enz", 10)?; // Find up to 10 terms containing "enz"

    Ok(())
}

API Reference

Functions

build_fst_set(input_path: &str, fst_path: &str) -> Result<(), Box<dyn Error>>

Creates an FST set from a list of chemical names in a text file. The resulting .fst file is generated and not distributed with the package.

  • input_path: Path to a text file with one chemical name per line
  • fst_path: Path where the FST index will be saved

load_fst_set(fst_path: &str) -> Result<Set<Mmap>, Box<dyn Error>>

Loads a previously created FST set from disk using memory mapping.

  • fst_path: Path to the FST index file
  • Returns: A memory-mapped FST Set

prefix_search(set: &Set<Mmap>, prefix: &str, max_results: usize) -> Vec<String>

Performs a prefix-based search (autocomplete).

  • set: The FST Set to search through
  • prefix: The prefix to search for
  • max_results: Maximum number of results to return
  • Returns: A vector of matching chemical names

substring_search(set: &Set<Mmap>, substring: &str, max_results: usize) -> Result<Vec<String>, Box<dyn Error>>

Performs a case-insensitive substring search.

  • set: The FST Set to search through
  • substring: The substring to search for
  • max_results: Maximum number of results to return
  • Returns: A vector of matching chemical names

Development

Project Structure

  • src/lib.rs - Core library functionality
  • src/main.rs - Example binary that demonstrates the library
  • tests/ - Integration tests

Setting Up Development Environment

  1. Clone the repository:

    git clone <repository_url>
    cd chemfst
    
  2. Build the project:

    cargo build
    
  3. Run the example:

    cargo run
    

Running Tests

Run all tests:

cargo test

Adding New Tests

Add new integration tests to the tests/fst_search_tests.rs file or create additional test files in the tests directory.

Continuous Integration

The project uses GitHub Actions for continuous integration and testing across multiple platforms and Python versions.

GitHub Workflows

Rust CI (rust.yml)

  • Platforms: Ubuntu, macOS, Windows
  • Rust versions: stable, beta
  • Features: Build, test, clippy linting, format checking, code coverage

Python CI (python.yml)

  • Platforms: Ubuntu, macOS, Windows
  • Python versions: 3.11, 3.12, 3.13
  • Features:
    • Automated FST file generation from test data
    • Cross-platform testing
    • Example execution validation
    • Code coverage reporting

Local Validation

Before pushing changes, validate the workflow locally:

# Run the validation script
python scripts/validate_workflow.py

This script:

  • Creates test data files
  • Builds the Python package
  • Runs all tests
  • Validates examples work correctly

FST File Generation in CI

The workflows automatically create test data files since FST files are not distributed with the package. Each platform creates the required data/chemical_names.txt with sample chemical names for testing.

Contributing

Contributions are welcome! Here's how you can contribute:

  1. Fork the repository
  2. Create a new branch (git checkout -b feature/your-feature-name)
  3. Make your changes
  4. Run the tests (cargo test)
  5. Commit your changes (git commit -m 'Add some feature')
  6. Push to the branch (git push origin feature/your-feature-name)
  7. Open a Pull Request

Performance Considerations

  • FST sets are immutable. If your chemical database changes, you'll need to rebuild the index.
  • For large chemical databases, consider building the index as an offline process.
  • Memory-mapped files provide excellent performance but require care when the underlying file changes.

License

MIT License

Credits

This project uses the following key dependencies:

  • fst - Finite State Transducer implementation
  • memmap2 - Memory mapping functionality

FST File Setup Guide

Overview

ChemFST uses Finite State Transducer (FST) files for high-performance chemical name searching. FST files (.fst) are not distributed with the package and must be generated from source text files.

Why FST Files Are Not Distributed

  1. Generated Content: FST files are compiled indexes created from source data
  2. Size Considerations: FST files can be large depending on the dataset
  3. Customization: Users typically want to use their own chemical name datasets
  4. License Clarity: Source text files may have different licensing than generated indexes

Required Files

Source File: data/chemical_names.txt

  • Required: Must exist to generate FST files
  • Format: One chemical name per line, UTF-8 encoded
  • Included: The repository includes a sample file with 32+ chemical names
  • Example content (from included file):
    acetaminophen
    acetylsalicylic acid
    acetic acid
    acetone
    acetonitrile
    benzene
    benzoic acid
    ...
    

Generated File: data/chemical_names.fst

  • Generated: Created by ChemFST from the source text file
  • Not tracked: Ignored by git (see .gitignore)
  • Not distributed: Must be built locally

Building FST Files

Python API

from chemfst import build_fst

# Build FST index from text file
build_fst("data/chemical_names.txt", "data/chemical_names.fst")

Rust API

#![allow(unused)]
fn main() {
use chemfst::build_fst_set;

// Build FST index from text file
build_fst_set("data/chemical_names.txt", "data/chemical_names.fst")?;
}

Automated Testing

The test suite automatically handles FST file generation:

  1. Source File Check: Tests verify data/chemical_names.txt exists
  2. Auto-Generation: FST files are built automatically during testing if missing
  3. Session Scope: FST generation is cached across test sessions for efficiency
  4. Cleanup: Generated files are properly ignored by version control

Git Configuration

The repository is configured to ignore FST files:

# Generated FST files (not distributed with package)
data/chemical_names.fst
*.fst

Troubleshooting

Missing Source File

Error: Chemical names text file not found at data/chemical_names.txt

Solution: The repository includes a sample data/chemical_names.txt file. If missing, create the source file with chemical names (one per line) or restore it from version control.

Permission Issues

Error: Cannot write FST file

Solution: Ensure write permissions for the data directory

Empty Results

Error: Search returns no results

Solution: Verify source file contains the expected chemical names

Best Practices

  1. Use Existing Data: The repository includes a curated chemical_names.txt file with sample data
  2. Backup Source Files: Keep chemical_names.txt in version control (already included)
  3. Ignore Generated Files: Never commit .fst files to version control
  4. Document Data Sources: Include attribution for chemical name datasets
  5. Test Locally: Always test FST generation before deployment
  6. Automation: Include FST building in deployment scripts
  7. Customize Data: Replace the sample data with your own chemical names as needed

Performance Notes

  • Build Time: FST generation is fast (typically milliseconds for small datasets)
  • Memory Usage: FST files are memory-mapped for efficient loading
  • Search Speed: FST searches are optimized for sub-millisecond response times
  • Preloading: Use the preload functionality for even better performance

Examples

See python/examples/demo.py for a comprehensive example that:

  1. Builds FST from source file
  2. Loads the generated FST
  3. Demonstrates search functionality
  4. Shows performance characteristics

Contributing to ChemFST

Thank you for considering contributing to ChemFST! This document provides guidelines and instructions to help you get started.

Code of Conduct

By participating in this project, you agree to abide by our Code of Conduct. Please read it before contributing.

How to Contribute

Reporting Issues

  • Check if the issue has already been reported.
  • Use the issue template when creating a new issue.
  • Include a clear title and description.
  • Add steps to reproduce the issue and expected behavior.
  • Include version information (Rust version, OS, etc.).

Submitting Changes

  1. Fork the Repository

    • Create your own fork of the repository.
  2. Create a Branch

    • Create a branch for your feature or bugfix: git checkout -b feature/your-feature-name
  3. Make Your Changes

    • Follow the coding style and guidelines.
    • Write tests for new features.
    • Keep commits focused and with clear messages.
  4. Run Tests

    • Ensure all tests pass: cargo test
    • Run the linter: cargo clippy
    • Format your code: cargo fmt
  5. Submit a Pull Request

    • Push your changes to your fork: git push origin feature/your-feature-name
    • Open a pull request against the main branch.
    • Describe your changes and reference any related issues.

Development Guidelines

Code Style

  • Follow the Rust official style guide.
  • Use cargo fmt before committing.
  • Address all clippy warnings.

Testing

  • Write tests for new functionality.
  • Ensure existing tests pass with your changes.
  • Integration tests go in the tests/ directory.
  • Unit tests go within the module they're testing.

Documentation

  • Keep documentation up-to-date.
  • Document all public APIs with doc comments.
  • Include examples where appropriate.

Commit Messages

  • Use clear and meaningful commit messages.
  • Start with a short summary line (50 chars or less).
  • Optionally, follow with a blank line and a more detailed explanation.

Release Process

  1. Version numbers follow Semantic Versioning.
  2. Update CHANGELOG.md with notable changes.
  3. Create a git tag for the new version.
  4. Publish the crate using cargo publish.

Getting Help

If you need help with contributing, feel free to:

  • Open an issue with questions.
  • Reach out to maintainers directly.

Thank you for contributing to ChemFST!

Contributor Covenant Code of Conduct

Our Pledge

We pledge to make participation in our project a harassment-free experience for everyone, and we pledge to act in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of positive behavior:

  • Using welcoming language
  • Respecting different viewpoints
  • Accepting constructive criticism
  • Focusing on community benefit
  • Showing empathy

Examples of unacceptable behavior:

  • Sexualized language or imagery
  • Trolling, insults, and personal attacks
  • Harassment
  • Publishing others' private information
  • Other unprofessional conduct

Enforcement

Project maintainers will enforce these standards. Violations may be reported to project maintainers, who will investigate all complaints.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0.

Changelog

All notable changes to the ChemFST project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[0.1.0] - 2023-12-01

Added

  • Initial project structure with lib.rs and main.rs
  • Core FST (Finite State Transducer) functionality for chemical compound search
  • Prefix search implementation for chemical name autocomplete
  • Substring search implementation for chemical name lookup
  • Memory-mapped file support for efficient loading of large chemical databases
  • Integration tests
  • Basic documentation

CI/CD Documentation for ChemFST

Overview

ChemFST uses GitHub Actions for continuous integration and deployment, ensuring code quality and compatibility across multiple platforms and Python versions.

Workflow Architecture

1. Rust CI Workflow (rust.yml)

Triggers: Push to trunk, Pull Requests to trunk

Matrix Strategy:

  • Operating Systems: Ubuntu, macOS, Windows
  • Rust Versions: stable, beta

Jobs:

  • Build & Test: Compiles and tests Rust code
  • Linting: Runs clippy for code quality
  • Formatting: Checks code formatting with rustfmt
  • Coverage: Generates code coverage reports using tarpaulin

2. Python CI Workflow (python.yml)

Triggers: Push to trunk, Pull Requests to trunk

Matrix Strategy:

  • Operating Systems: Ubuntu, macOS, Windows
  • Python Versions: 3.11, 3.12, 3.13

Jobs:

  • Test: Builds Python bindings and runs pytest
  • Coverage: Generates Python code coverage

Workflow Details

Python Workflow Steps

  1. Environment Setup

    • Checkout code
    • Install Python and Rust toolchains
    • Cache Rust dependencies for faster builds
  2. Dependency Installation

    • Install maturin for Python-Rust bindings
    • Install pytest for testing
  3. Test Data Verification

    • Verify data/chemical_names.txt exists in repository
    • Use existing chemical names from the tracked file
  4. Package Building

    • Use maturin to build Python bindings from Rust code (wheel format)
    • Install built wheel with pip (avoids virtual environment requirement)
  5. Testing

    • Run pytest suite with verbose output
    • Execute example scripts to validate functionality
  6. Coverage (Ubuntu only)

    • Generate coverage reports
    • Upload to Codecov

Key Features

Cross-Platform Compatibility

  • Tests run on Linux, macOS, and Windows
  • Platform-specific commands for file creation
  • Handles path differences across operating systems

FST File Handling

  • Uses existing data/chemical_names.txt from repository
  • FST files are generated during CI, not stored in repository
  • Consistent test data from the tracked source file

Caching Strategy

  • Rust dependencies cached by Cargo.lock hash
  • Reduces build times for subsequent runs
  • Platform-specific cache keys

Test Data Source

The workflows use the chemical names from the tracked data/chemical_names.txt file in the repository. This ensures:

  • Consistency: All platforms use identical test data
  • Single Source of Truth: Chemical names are maintained in one location
  • Easy Updates: Modify the file to change test data across all workflows
  • Version Control: Test data changes are tracked in git history

The file contains a curated list of chemical compounds used for testing all functionality including prefix search, substring search, and performance benchmarks.

Local Development

Pre-commit Validation

Use the validation script before pushing changes:

python scripts/validate_workflow.py

This script replicates the CI environment locally:

  • Creates test data
  • Builds the package
  • Runs tests
  • Validates examples

Manual Testing

# Test Rust components
cargo test

# Build and install Python package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
pip install dist/*.whl

# Test Python components
pytest python/tests/ -v

# Test examples
python python/examples/demo.py

Coverage Reporting

Rust Coverage

  • Uses cargo-tarpaulin for coverage analysis
  • Generates XML reports for Codecov
  • Runs only on Ubuntu for efficiency

Python Coverage

  • Uses pytest-cov for coverage analysis
  • Covers Python bindings and test code
  • Uploads to Codecov with platform identification

Troubleshooting

Common Issues

FST File Errors

  • Cause: Missing or invalid test data
  • Solution: Verify data/chemical_names.txt exists in repository and has content

Build Failures

  • Cause: Rust toolchain issues or maturin virtual environment errors
  • Solution: Check Rust installation and dependencies, use maturin build + pip install instead of maturin develop

Test Failures

  • Cause: Platform-specific path or command issues
  • Solution: Review platform-specific workflow steps

Debugging Strategies

  1. Check Workflow Logs: Review detailed logs in GitHub Actions
  2. Local Reproduction: Use validation script to reproduce issues
  3. Platform Testing: Test on specific platforms if issues are OS-specific
  4. Dependency Versions: Verify Python and Rust version compatibility

Security Considerations

Secrets Management

  • CODECOV_TOKEN: Used for coverage uploads
  • Stored as GitHub repository secrets
  • Access controlled through GitHub permissions

Dependency Security

  • Automated dependency updates through Dependabot
  • Regular security audits of Rust and Python dependencies
  • Pinned action versions for reproducibility

Maintenance

Updating Dependencies

  • Monitor for new Python versions and add to matrix
  • Update Rust toolchain versions as needed
  • Keep GitHub Actions up to date

Performance Optimization

  • Monitor build times and optimize caching
  • Consider parallel job execution
  • Profile test execution times

Future Enhancements

Planned Improvements

  • Add Windows-specific testing for path handling
  • Implement benchmark regression testing
  • Add documentation generation and deployment
  • Consider adding nightly Rust builds for early issue detection

Monitoring

  • Track build success rates across platforms
  • Monitor test execution times
  • Coverage trend analysis

Windows Compatibility Guide

Overview

ChemFST is fully compatible with Windows environments, including GitHub Actions runners. This document outlines the Windows-specific considerations and implementations.

GitHub Actions Windows Support

Matrix Strategy

The Python CI workflow includes windows-latest in the test matrix:

matrix:
  os: [ubuntu-latest, macos-latest, windows-latest]
  python-version: ["3.11", "3.12", "3.13"]

Platform-Specific Implementations

Cache Paths

Unix/Linux/macOS:

path: |
  ~/.cargo/registry
  ~/.cargo/git
  target/
  chemfst-py/target/

Windows:

path: |
  C:\Users\runneradmin\.cargo\registry
  C:\Users\runneradmin\.cargo\git
  target\
  chemfst-py\target\

File Verification Commands

Unix (Bash):

if [ ! -f "data/chemical_names.txt" ]; then
  echo "Error: data/chemical_names.txt not found"
  exit 1
fi
echo "✅ Found existing data/chemical_names.txt"
head -5 data/chemical_names.txt
echo "... ($(wc -l < data/chemical_names.txt) total lines)"

Windows (PowerShell):

if (!(Test-Path "data\chemical_names.txt")) {
  Write-Host "Error: data\chemical_names.txt not found"
  exit 1
}
Write-Host "✅ Found existing data\chemical_names.txt"
Get-Content "data\chemical_names.txt" -Head 5
$lineCount = (Get-Content "data\chemical_names.txt" | Measure-Object -Line).Lines
Write-Host "... ($lineCount total lines)"

Local Windows Development

Prerequisites

  1. Rust Toolchain: Install via rustup.rs
  2. Python 3.11+: Install from python.org
  3. Git: Install from git-scm.com
  4. Visual Studio Build Tools: Required for Rust compilation

Setup Commands

# Clone repository
git clone <repository-url>
cd ChemFST

# Install Python dependencies
python -m pip install --upgrade pip maturin pytest

# Build Python package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
python -m pip install dist/*.whl

# Run tests
python -m pytest python/tests/ -v

# Run examples
python python/examples/demo.py

PowerShell Setup

# Alternative setup using PowerShell
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
python -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip maturin pytest

# Build and install package
maturin build --manifest-path chemfst-py/Cargo.toml --out dist
python -m pip install dist/*.whl

Path Handling

File Separators

  • Windows: Uses backslash \ as path separator
  • Cross-platform: Python's pathlib.Path handles this automatically
  • Workflow: Uses forward slashes / in cross-platform commands

Example Implementation

from pathlib import Path

# Works on all platforms
data_file = Path("data") / "chemical_names.txt"
fst_file = Path("data") / "chemical_names.fst"

# Platform-specific string representation
if platform.system() == "Windows":
    print(f"Windows path: {data_file}")  # data\chemical_names.txt
else:
    print(f"Unix path: {data_file}")     # data/chemical_names.txt

Build Considerations

Rust Compilation

  • MSVC: Primary toolchain for Windows builds
  • GNU: Alternative toolchain (less common)
  • Dependencies: May require Visual Studio Build Tools

Python Extension Modules

  • ABI: Windows uses different ABI than Unix systems
  • File Extensions: .pyd files on Windows vs .so on Unix
  • Maturin: Handles cross-platform building automatically

Testing on Windows

Local Validation

# Run the validation script
python scripts/validate_workflow.py

Expected output on Windows:

Operating System: Windows 10
✅ Windows PowerShell commands tested successfully
✅ Workflow should work on windows-latest

GitHub Actions Testing

The workflow automatically tests on windows-latest with:

  • Windows Server 2022
  • PowerShell 5.1 and PowerShell Core
  • MSVC build tools
  • Python 3.11, 3.12, and 3.13

Common Windows Issues

Build Failures

Issue: Missing Visual Studio Build Tools

error: Microsoft Visual C++ 14.0 is required

Solution: Install Visual Studio Build Tools or Visual Studio Community

Issue: Maturin virtual environment error

Couldn't find a virtualenv or conda environment

Solution: Use maturin build + pip install dist/*.whl instead of maturin develop

Path Issues

Issue: Path separator conflicts

FileNotFoundError: [Errno 2] No such file or directory: 'data/chemical_names.txt'

Solution: Use pathlib.Path for cross-platform compatibility

PowerShell Execution Policy

Issue: Script execution disabled

execution of scripts is disabled on this system

Solution:

Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser

Long Path Support

Issue: Path length limitations (260 characters) Solution: Enable long path support in Windows 10/11:

Computer Configuration > Administrative Templates > System > Filesystem > Enable Win32 long paths

Performance Considerations

File System Performance

  • NTFS: Good performance for FST file operations
  • Windows Defender: May impact build times (consider exclusions)
  • Antivirus: Can slow down file operations

Memory Mapping

  • Windows: Full support for memory-mapped files
  • Performance: Comparable to Unix systems for FST operations
  • Large Files: Windows handles large FST files efficiently

Best Practices

Development Environment

  1. Use Windows Subsystem for Linux (WSL) for Unix-like experience
  2. Consider PowerShell Core for better cross-platform scripting
  3. Use Windows Terminal for improved command-line experience

CI/CD Integration

  1. Test locally on Windows before pushing
  2. Monitor Windows-specific build times
  3. Use platform-specific caching strategies
  4. Handle path separators consistently

Deployment

  1. Test Windows packages thoroughly
  2. Consider Windows-specific packaging requirements
  3. Document Windows-specific installation steps
  4. Provide PowerShell scripts for automation

Troubleshooting

Debug Commands

# Check Python installation
python --version
where python

# Check Rust installation
rustc --version
cargo --version

# Check file exists
Test-Path "data\chemical_names.txt"

# Show file content
Get-Content "data\chemical_names.txt" -Head 10

# Check build tools
where cl.exe

Log Analysis

  • GitHub Actions logs are available for 30 days
  • Windows logs may include additional system information
  • PowerShell errors include detailed stack traces

Future Enhancements

Planned Improvements

  • Windows-specific performance optimizations
  • Native Windows installer packages
  • PowerShell module for ChemFST
  • Windows-specific documentation

Compatibility Targets

  • Windows 10 (version 1903+)
  • Windows 11
  • Windows Server 2019+
  • Windows Server 2022