html2rss is a Ruby gem that generates RSS 2.0 feeds from websites by scraping HTML or JSON content with CSS selectors or auto-detection.
This gem is the core of the html2rss-web application.
| Resource | Description | Link |
|---|---|---|
| π Documentation & Feed Directory | Complete guides, tutorials, and browse 100+ pre-built feeds | html2rss.github.io |
| π¬ Community Discussions | Get help, share ideas, and connect with other users | GitHub Discussions |
| π Project Board | Track development progress and upcoming features | View Project Board |
| π Support Development | Help fund ongoing development and maintenance | Sponsor on GitHub |
Quick Start Options:
- New to RSS? β Start with the web application
- Ruby Developer? β Check out the Ruby gem documentation
- Need a specific feed? β Browse the feed directory
- Want to contribute? β See our contributing guide
- π― CSS Selector Support - Extract content using familiar CSS selectors
- π€ Auto-Detection - Automatically detect content using Schema.org, JSON state, and semantic HTML
- π Multiple Request Strategies - Faraday for static sites, Browserless for JS-heavy sites
- π οΈ Post-Processing - Template rendering, HTML sanitization, time parsing, and more
- π§ͺ Comprehensive Testing - 95%+ test coverage with RSpec
- π Full Documentation - YARD documentation and comprehensive guides
For installation and usage instructions, please visit the project website.
You can develop html2rss directly in your browser using GitHub Codespaces:
The Codespace comes pre-configured with Ruby 3.4, all dependencies, and VS Code extensions ready to go!
The full documentation for the html2rss gem is available on the project website.
Please see the contributing guide for details on how to contribute.
- Config - Loads and validates configuration (YAML/hash)
- RequestService - Fetches pages using Faraday or Browserless
- Selectors - Extracts content via CSS selectors with extractors/post-processors
- AutoSource - Auto-detects content using Schema.org, JSON state blobs, semantic HTML, and structural patterns
- RssBuilder - Assembles Article objects and renders RSS 2.0
Config -> Request -> Extraction -> Processing -> Building -> Output
The Browserless request strategy can execute additional page interactions before the HTML is captured. Configure these options in
your feed under the request.browserless.preload key:
request:
browserless:
preload:
wait_for_network_idle:
timeout_ms: 5000
click_selectors:
- selector: '.load-more'
max_clicks: 3
delay_ms: 250
wait_for_network_idle:
timeout_ms: 4000
scroll_down:
iterations: 5
delay_ms: 200
wait_for_network_idle:
timeout_ms: 3000wait_for_network_idleβ Waits for the network to become idle before and after preload actions. If notimeout_msis provided the default of 5000 ms is used. Browserless exposes this as a timeout wait, so html2rss simply pauses the page for the configured milliseconds to let pending requests finish.click_selectorsβ Repeatedly clicks matching elements (e.g. βLoad moreβ) until the element disappears ormax_clicksis reached. Provide per-clickwait_for_network_idleblocks to avoid racing requests and to stay within Browserless rate limits.scroll_downβ Scrolls to the bottom of the page. The loop stops early once the document height stops increasing. Combine withwait_for_network_idleordelay_msto give JavaScript time to append new content.
Each step increases overall runtime. Browserless sessions have execution limits, so favour conservative values for max_clicks,
iterations, and timeouts to prevent premature session termination.
- RSpec for comprehensive testing
- 95%+ code coverage with SimpleCov
- VCR for HTTP interaction testing
- RuboCop for code style enforcement
- Reek for code smell detection
- Ruby LSP for IntelliSense and language features
- Debug for modern debugging and exploration
- YARD for documentation generation
- GitHub Actions for CI/CD
This project is licensed under the MIT License - see the LICENSE file for details.
If you find html2rss useful, please consider sponsoring the project.

