14 Commits

Author SHA1 Message Date
valknarness
99cf83330c feat: make incremental indexing configurable via CLI and workflow
Added full control over incremental indexing behavior:

**CLI Changes:**
- Added `--incremental` flag (default: true)
- Added `--full` flag to disable incremental mode
- Updated interactive prompt to ask about incremental mode

**Function Changes:**
- Updated buildIndex(force, mode, incremental) signature
- Added incremental parameter with default value true
- Conditional logic: if incremental=true, skip unchanged repos; else re-index all
- Added console logging to show incremental mode status

**Workflow Changes:**
- Added `incremental` input (boolean, default: true)
- Passes incremental setting to buildIndex via environment variable
- Defaults to true for scheduled (cron) runs

**Usage Examples:**
```bash
# CLI - incremental mode (default)
./awesome index

# CLI - force full re-index
./awesome index --full

# CLI - explicit incremental
./awesome index --incremental

# Workflow - incremental (default)
gh workflow run build-database.yml

# Workflow - full re-index
gh workflow run build-database.yml -f incremental=false
```

This makes incremental indexing opt-out instead of hardcoded, giving users full control over indexing behavior.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 09:59:47 +01:00
valknarness
98ddac97e8 feat: implement incremental indexing and remove proactive rate limit
Major performance improvements for CI builds:

1. **Removed proactive rate limit threshold**
   - No longer waits at 500 remaining requests
   - Uses full 5000 request quota before forced wait
   - Maximizes work per rate limit cycle

2. **Implemented incremental indexing**
   - Checks if repository already exists in database
   - Compares last_commit (pushedAt) to detect changes
   - Only fetches README for new or updated repositories
   - Skips README fetch for unchanged repos (major time savings)

3. **Increased timeout to GitHub maximum**
   - Job timeout: 180m → 360m (6 hours, GitHub free tier max)
   - Script timeout: 170m → 350m
   - Allows full first-run indexing to complete

Impact on performance:

**First run (empty database):**
- Same as before: ~25,000 repos need full indexing
- Will use all 360 minutes but should complete

**Subsequent runs (incremental):**
- Only fetches READMEs for changed repos (~5-10% typically)
- Dramatically faster: estimated 30-60 minutes instead of 360
- Makes daily automated builds sustainable

Files changed:
- lib/github-api.js: Removed proactive rate limit check
- lib/indexer.js: Added incremental indexing logic
- .github/workflows/build-database.yml: Increased timeout to 360m

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 09:57:02 +01:00
valknarness
279cc2fa25 fix: bypass rate limiting for raw.githubusercontent.com requests
CRITICAL FIX: raw.githubusercontent.com does NOT count against GitHub
API rate limits, but the code was treating all requests the same way.

Problem:
- README fetches (~25,000) were going through rateLimitedRequest()
- Added artificial delays, proactive checks, and unnecessary waits
- Build took ~7 hours instead of ~2-3 hours
- Only getRepoInfo() API calls actually count against rate limits

Solution:
1. Created fetchRawContent() function for direct raw content fetches
2. Updated getReadme() to use fetchRawContent()
3. Updated getAwesomeListsIndex() to use fetchRawContent()
4. Reduced workflow timeout: 330m → 180m (3 hours)

Impact:
- Build time: ~7 hours → ~2-3 hours (60% reduction)
- Only ~25K API calls (getRepoInfo) count against 5000/hour limit
- ~25K README fetches are now unrestricted via raw.githubusercontent.com
- Will complete well within GitHub Actions 6-hour free tier limit

Files changed:
- lib/github-api.js: Add fetchRawContent(), update getReadme() and
  getAwesomeListsIndex() to use it
- .github/workflows/build-database.yml: Reduce timeout to 180 minutes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-28 06:04:14 +01:00
valknarness
326186f4a8 fix: optimize CI rate limit strategy for batch efficiency
Changes for CI mode (process.env.CI === 'true'):
- Remove grace period (was 10min) to enable continuous monitoring
- Increase check frequency from 1% to 10% to catch low rate limits early
- Raise proactive threshold from 200 to 500 requests
- Increase resume threshold from 100 to 1000 requests

This prevents wasting time on small batches (e.g. 184 requests = 2min
work + 13min wait) by ensuring we work in larger 1000-5000 request
batches for better time efficiency within the 170-minute timeout.

Local mode unchanged: maintains user-friendly behavior with fewer
interruptions.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-27 20:52:41 +01:00
valknarness
958192899d fix: use adaptive rate limit threshold for CI vs local
Use MIN_REMAINING_TO_CONTINUE = 100 in CI environments to allow
incremental progress within the 170-minute timeout constraint, while
maintaining 4500 locally for better user experience with fewer
interruptions during indexing.

This fixes the timeout issue where waiting for nearly full rate limit
reset (4500/5000) required ~58 minutes per cycle, causing builds to
exceed the 170-minute timeout after just 3 cycles.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-27 17:56:55 +01:00
valknarness
946513fbbb fix: suspend 2025-10-27 03:53:52 +01:00
valknarness
a136b929b0 fix: suspend 2025-10-27 02:36:46 +01:00
valknarness
c0d3ffd328 fix: CI indexing 2025-10-26 22:04:46 +01:00
valknarness
9eb428dd63 fix: cli no user input 2025-10-26 19:12:49 +01:00
valknarness
509795ab82 Fix workflow database initialization and error handling
- Initialize database inside indexer process to ensure connection exists
- Configure GitHub token in same process as indexer
- Make indexer throw errors instead of returning early for CI failure detection
- Remove duplicate token configuration step
- Pass GITHUB_TOKEN as environment variable to build step

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 14:05:21 +01:00
valknarness
6518cd8e76 Add getIndexStats function to db-operations
This function is required by the GitHub Actions workflow for
gathering database statistics after the build completes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-26 14:03:11 +01:00
valknarness
10910d8537 fix: github workflow 2025-10-26 14:00:45 +01:00
valknarness
4cdcc62e15 feat: github workflow 2025-10-26 13:48:23 +01:00
valknarness
700c73bcbf a new start 2025-10-25 15:52:06 +02:00