GitHub - html2rss/html2rss at codex/extend-puppetcommander-with-scroll-and-click-steps

html2rss is a Ruby gem that generates RSS 2.0 feeds from websites by scraping HTML or JSON content with CSS selectors or auto-detection.

This gem is the core of the html2rss-web application.

🌐 Community & Resources

Resource	Description	Link
📚 Documentation & Feed Directory	Complete guides, tutorials, and browse 100+ pre-built feeds	html2rss.github.io
💬 Community Discussions	Get help, share ideas, and connect with other users	GitHub Discussions
📋 Project Board	Track development progress and upcoming features	View Project Board
💖 Support Development	Help fund ongoing development and maintenance	Sponsor on GitHub

Quick Start Options:

New to RSS? → Start with the web application
Ruby Developer? → Check out the Ruby gem documentation
Need a specific feed? → Browse the feed directory
Want to contribute? → See our contributing guide

✨ Features

🎯 CSS Selector Support - Extract content using familiar CSS selectors
🤖 Auto-Detection - Automatically detect content using Schema.org, JSON state, and semantic HTML
🔄 Multiple Request Strategies - Faraday for static sites, Browserless for JS-heavy sites
🛠️ Post-Processing - Template rendering, HTML sanitization, time parsing, and more
🧪 Comprehensive Testing - 95%+ test coverage with RSpec
📚 Full Documentation - YARD documentation and comprehensive guides

🚀 Quick Start

For installation and usage instructions, please visit the project website.

💻 Try in Browser

You can develop html2rss directly in your browser using GitHub Codespaces:

The Codespace comes pre-configured with Ruby 3.4, all dependencies, and VS Code extensions ready to go!

📚 Documentation

The full documentation for the html2rss gem is available on the project website.

🤝 Contributing

Please see the contributing guide for details on how to contribute.

🏗️ Architecture

Core Components

Config - Loads and validates configuration (YAML/hash)
RequestService - Fetches pages using Faraday or Browserless
Selectors - Extracts content via CSS selectors with extractors/post-processors
AutoSource - Auto-detects content using Schema.org, JSON state blobs, semantic HTML, and structural patterns
RssBuilder - Assembles Article objects and renders RSS 2.0

Data Flow

Config -> Request -> Extraction -> Processing -> Building -> Output

🌐 Browserless Strategy Configuration

The Browserless request strategy can execute additional page interactions before the HTML is captured. Configure these options in your feed under the request.browserless.preload key:

request:
  browserless:
    preload:
      wait_for_network_idle:
        timeout_ms: 5000
      click_selectors:
        - selector: '.load-more'
          max_clicks: 3
          delay_ms: 250
          wait_for_network_idle:
            timeout_ms: 4000
      scroll_down:
        iterations: 5
        delay_ms: 200
        wait_for_network_idle:
          timeout_ms: 3000

wait_for_network_idle – Waits for the network to become idle before and after preload actions. If no timeout_ms is provided the default of 5000 ms is used. Browserless exposes this as a timeout wait, so html2rss simply pauses the page for the configured milliseconds to let pending requests finish.
click_selectors – Repeatedly clicks matching elements (e.g. “Load more”) until the element disappears or max_clicks is reached. Provide per-click wait_for_network_idle blocks to avoid racing requests and to stay within Browserless rate limits.
scroll_down – Scrolls to the bottom of the page. The loop stops early once the document height stops increasing. Combine with wait_for_network_idle or delay_ms to give JavaScript time to append new content.

Each step increases overall runtime. Browserless sessions have execution limits, so favour conservative values for max_clicks, iterations, and timeouts to prevent premature session termination.

🧪 Testing

RSpec for comprehensive testing
95%+ code coverage with SimpleCov
VCR for HTTP interaction testing
RuboCop for code style enforcement
Reek for code smell detection

🔧 Development Tools

Ruby LSP for IntelliSense and language features
Debug for modern debugging and exploration
YARD for documentation generation
GitHub Actions for CI/CD

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

💖 Sponsoring

If you find html2rss useful, please consider sponsoring the project.

Name		Name	Last commit message	Last commit date
Latest commit History 380 Commits
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
bin		bin
exe		exe
lib		lib
spec		spec
support		support
.gitignore		.gitignore
.mergify.yml		.mergify.yml
.reek.yml		.reek.yml
.rspec		.rspec
.rubocop.yml		.rubocop.yml
.yardopts		.yardopts
AGENTS.md		AGENTS.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
html2rss.gemspec		html2rss.gemspec
rakefile.rb		rakefile.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

🌐 Community & Resources

✨ Features

🚀 Quick Start

💻 Try in Browser

📚 Documentation

🤝 Contributing

🏗️ Architecture

Core Components

Data Flow

🌐 Browserless Strategy Configuration

🧪 Testing

🔧 Development Tools

📄 License

💖 Sponsoring

About

Uh oh!

Releases 9

Sponsor this project

Uh oh!

Uh oh!

Contributors 6

Uh oh!

Languages

Uh oh!

License

html2rss/html2rss

Folders and files

Latest commit

History

Repository files navigation

🌐 Community & Resources

✨ Features

🚀 Quick Start

💻 Try in Browser

📚 Documentation

🤝 Contributing

🏗️ Architecture

Core Components

Data Flow

🌐 Browserless Strategy Configuration

🧪 Testing

🔧 Development Tools

📄 License

💖 Sponsoring

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 9

Sponsor this project

Uh oh!

Uh oh!

Contributors 6

Uh oh!

Languages