This repository contains documentation-related scrapers for seekseek.org.
Currently, the tools used for scraping are not documented. So while contributions are welcome, it will probably be a bit of work to figure out how to write a scraper :) This will change soon(tm). There may be issues that need your help on the issue tracker, though.
By submitting a contribution, you agree to license it under the WTFPL/CC0 like the rest of the codebase, which effectively means making it public domain and free for anyone to use for any purpose.
Scraper development guidelines
- Store dense information. Avoid storing things like raw HTML which mostly contain repetitive/template content. Storing structured data (eg. parsed JSON) is ideal, but things like HTML snippets with high information density are okay too.
- Store original information. Don't try to parse meaning directly out of the scraped data, other than for discovering new items! Data normalization is a lossy process and should happen in a dedicated normalization task; that way we don't need to rescrape the entire source just because of a small change in the data normalization code.
- Store maximum information. There's no need to selectively pick out bits of information to store; if it's easy to extract more data than you are strictly looking for (eg. the data is presented in JSON format), then please do so and just store it in the results! This allows for extracting more information from it later, when building other or new search engines. An example of this is how some scrapers store technical properties of components, even though what we're currently looking for is just datasheets.
- Scrape politely. Try not to make more requests than absolutely necessary. Prefer sitemaps over pagination. If you need to paginate through something, and it uses numeric page offsets rather than item IDs, try to make each page as large as possible - high numeric offsets are hard on a database, so a few huge requests are better than many small ones. Prefer JSON/XML/CSV/etc. over scraping HTML; rendering pages can be very resource-intensive on the server. Don't use text-based search APIs unless absolutely necessary.