feat: implement incremental indexing and remove proactive rate limit

Major performance improvements for CI builds:

1. **Removed proactive rate limit threshold**
   - No longer waits at 500 remaining requests
   - Uses full 5000 request quota before forced wait
   - Maximizes work per rate limit cycle

2. **Implemented incremental indexing**
   - Checks if repository already exists in database
   - Compares last_commit (pushedAt) to detect changes
   - Only fetches README for new or updated repositories
   - Skips README fetch for unchanged repos (major time savings)

3. **Increased timeout to GitHub maximum**
   - Job timeout: 180m → 360m (6 hours, GitHub free tier max)
   - Script timeout: 170m → 350m
   - Allows full first-run indexing to complete

Impact on performance:

**First run (empty database):**
- Same as before: ~25,000 repos need full indexing
- Will use all 360 minutes but should complete

**Subsequent runs (incremental):**
- Only fetches READMEs for changed repos (~5-10% typically)
- Dramatically faster: estimated 30-60 minutes instead of 360
- Makes daily automated builds sustainable

Files changed:
- lib/github-api.js: Removed proactive rate limit check
- lib/indexer.js: Added incremental indexing logic
- .github/workflows/build-database.yml: Increased timeout to 360m

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
valknarness
2025-10-28 09:57:02 +01:00
parent 279cc2fa25
commit 98ddac97e8
3 changed files with 20 additions and 28 deletions

View File

@@ -22,7 +22,7 @@ permissions:
jobs:
build-database:
runs-on: ubuntu-latest
timeout-minutes: 180 # 3 hours max
timeout-minutes: 360 # 6 hours (GitHub Actions maximum for free tier)
steps:
- name: Checkout repository
@@ -60,8 +60,8 @@ jobs:
INDEX_MODE="${{ github.event.inputs.index_mode || 'full' }}"
echo "Index mode: $INDEX_MODE"
# Build the index in non-interactive mode (170m timeout, job timeout is 180m)
timeout 170m node -e "
# Build the index in non-interactive mode (350m timeout, job timeout is 360m)
timeout 350m node -e "
const db = require('./lib/database');
const dbOps = require('./lib/db-operations');
const indexer = require('./lib/indexer');
@@ -96,7 +96,7 @@ jobs:
" || {
EXIT_CODE=$?
if [ $EXIT_CODE -eq 124 ]; then
echo "❌ Index building timed out after 170 minutes"
echo "❌ Index building timed out after 350 minutes"
echo "This may indicate rate limiting issues or too many lists to index"
fi
exit $EXIT_CODE