Sourcery Integration for Automated Code Review
Objective: Integrate Sourcery GitHub Action for automated code review and fixes.
Description: Set up Sourcery to automatically review Python code changes, suggest improvements, and optionally apply fixes to maintain high code quality standards across the Smart RAG project.
Dependencies: GitHub repository with Actions enabled, SOURCERY_TOKEN secret configured
Details:
- Configure Sourcery GitHub Action workflow
- Set up triggers for push and pull request events
- Configure Python-specific rules and checks
- Enable targeted directory scanning
- Implement security and performance checks
Status: Done
Test Strategy:
# Test workflow syntax
yamllint .github/workflows/sourcery.yml
yamllint .sourcery.yaml
# Create a test PR to trigger Sourcery
git checkout -b test/sourcery-integration
echo "print('test')" > test_sourcery.py
git add test_sourcery.py
git commit -m "Test Sourcery integration"
git push origin test/sourcery-integration
Sourcery Integration Architecture
flowchart TD
subgraph "GitHub Events"
PR[Pull Request]
PUSH[Push to Branch]
end
subgraph "Sourcery Workflow"
TRIGGER[Workflow Triggered]
CHECKOUT[Checkout Code]
PYTHON[Setup Python 3.11]
REVIEW[Run Sourcery Review]
FIX[Apply Fixes]
COMMENT[Comment on PR]
REPORT[Generate Report]
end
subgraph "Code Analysis"
QUALITY[Code Quality Check]
SECURITY[Security Scan]
PERF[Performance Analysis]
ML[ML-Specific Checks]
end
PR --> TRIGGER
PUSH --> TRIGGER
TRIGGER --> CHECKOUT
CHECKOUT --> PYTHON
PYTHON --> REVIEW
REVIEW --> QUALITY
REVIEW --> SECURITY
REVIEW --> PERF
REVIEW --> ML
QUALITY --> FIX
FIX --> COMMENT
COMMENT --> REPORT
Configuration Files
1. GitHub Actions Workflow (.github/workflows/sourcery.yml)
The workflow is configured to:
- Trigger on pushes to main, develop, feature/, and hotfix/ branches
- Trigger on pull requests (opened, synchronized, reopened)
- Use Python 3.11 to match the project version
- Enable automatic fixes with
fix: 'true' - Request review from PR authors
- Include inline configuration for Python-specific settings
Key features:
- Automatic PR comments with review summary
- Artifact upload for detailed reports
- Custom rules for RAG system patterns
- Security checks for common vulnerabilities
- Performance suggestions for async code and caching
2. Project Configuration (.sourcery.yaml)
Detailed configuration includes:
Code Quality Settings
- Minimum confidence threshold: 0.8
- Maximum complexity: 10
- Maximum method length: 50 lines
- Minimum quality score: 7.5
Path Configuration
include:
- "backend/**/*.py"
- "src/**/*.py"
- "ml/**/*.py"
- "scripts/**/*.py"
exclude:
- "**/migrations/**"
- "**/tests/**"
- "**/__pycache__/**"
- "**/venv/**"
Custom Rules
- Structured Logging: Convert f-strings to structured logging
- Async Context Managers: Use
async withfor resource management - Vector DB Optimizations: Batch operations for vector insertions
- Security Patterns: Detect hardcoded secrets and SQL injection risks
ML/RAG-Specific Checks
- Data leakage detection
- Model versioning validation
- Embedding consistency checks
- Context window limit validation
Usage Guidelines
For Developers
-
Before Committing: Run Sourcery locally
sourcery review --diff "git diff" -
In Pull Requests:
- Wait for Sourcery to complete its analysis
- Review suggested changes in the PR comments
- Accept or reject automatic fixes
- Address any security or quality issues
-
Configuration Updates:
- Modify
.sourcery.yamlfor project-wide settings - Update workflow file for CI/CD changes
- Add custom rules for project-specific patterns
- Modify
For Maintainers
-
Monitor Code Quality:
- Check Sourcery reports in PR artifacts
- Track quality metrics over time
- Adjust thresholds based on project needs
-
Security Oversight:
- Review security findings regularly
- Update security patterns as needed
- Ensure secrets are never committed
-
Performance Optimization:
- Act on async improvement suggestions
- Implement caching where recommended
- Monitor batch operation opportunities
Best Practices
-
Code Quality
- Maintain quality score above 7.5
- Keep methods under 50 lines
- Use type hints for all public functions
-
Security
- Never hardcode API keys or secrets
- Use parameterized queries for database operations
- Follow OWASP guidelines for web security
-
Performance
- Prefer async/await for I/O operations
- Use batch operations for database and vector store
- Implement caching for expensive computations
-
RAG-Specific
- Validate embedding dimensions
- Check retrieval context limits
- Monitor vector store performance
Troubleshooting
Common Issues
-
Workflow Not Triggering
- Verify branch names match workflow triggers
- Check if SOURCERY_TOKEN secret is set
- Ensure GitHub Actions are enabled
-
False Positives
- Add specific patterns to ignore list
- Adjust confidence thresholds
- Use inline comments to suppress warnings
-
Performance Impact
- Reduce parallel jobs if needed
- Exclude large generated files
- Enable incremental analysis
Debug Commands
# Validate workflow syntax
act -l
# Test workflow locally
act -j sourcery-review
# Check Sourcery configuration
sourcery review --check-config
# Run Sourcery on specific files
sourcery review src/graphrag/rag_system/main.py
Integration Benefits
- Automated Code Review: Consistent code quality checks on every change
- Security Scanning: Early detection of vulnerabilities
- Performance Insights: Suggestions for optimization opportunities
- RAG-Specific Validation: Custom rules for ML/AI patterns
- Team Productivity: Reduced manual review time
Future Enhancements
- Custom Plugins: Develop RAG-specific Sourcery plugins
- Metrics Dashboard: Integrate with monitoring tools
- Auto-merge: Enable for high-confidence fixes
- Learning Mode: Train on project-specific patterns
- IDE Integration: Setup for VS Code and PyCharm