Topic

#crawler

Loot, blog posts and adjacent themes connected to this topic. Follow the tag to keep it in your orbit.

Loot

More from this topic

Anansi: Self-Healing Web Scraper with MCP Server

#web scraping#mcp#python#crawler#ai agents#data extraction#automation

A Python crawler for unstable or JavaScript-heavy sites, with selector healing, structured-data extraction, adaptive rate limiting, and an MCP server for agent-driven crawling. Use only for authorized scraping. Anansi is a Python web scraping toolkit designed for sites that change often or need browser rendering. It combines adaptive parsing, structured-data extraction, incremental crawling, proxy support, and an MCP server so an LLM or agent workflow can drive fetch, extract, crawl, pause, resume, export, and metrics actions. Why it is useful Self-healing selectors: stores selector confidence and attempts fallback strategies when a layout changes. Structured extraction first: pulls JSON-LD, Open Graph, and Microdata before relying on brittle CSS selectors. Browser upgrade path: can switch from HTTP fetching to Playwright rendering for JavaScript-heavy pages. Crawler durability: includes an async crawler, SQLite-backed queue, incremental recrawls, ETag/Last-Modified handling, and resumable jobs. Agent-ready interface: ships with an MCP server so compatible LLM tools can operate crawls through tool calls. Best fit Use Anansi when you need a resilient research or data-extraction crawler for websites you are allowed to access, especially where pages change structure or require JavaScript rendering. It is most relevant for developers building data pipelines, monitoring workflows, competitive research dashboards, or agentic browsing systems. Quick evaluation checklist Confirm the target website permits your intended crawling use case. Start with structured data extraction before custom selectors. Enable browser rendering only where HTTP fetching is insufficient. Keep adaptive rate limiting active and respect Retry-After responses. Use the MCP server when you want an agent to orchestrate crawl tasks instead of manually scripting every step. Source notes The GitHub repository describes Anansi as a self-healing web scraper with selector repair, browser rendering fallback, Chrome-like TLS fingerprinting, Pydantic validation, incremental crawling, and an MCP server. The project is written primarily in Python and is licensed under Apache-2.0.

View

Free

Open

@ZachasADMIN

Blog

#crawler

More from this topic

Anansi: Self-Healing Web Scraper with MCP Server

Related reads