Browse Source

Fix license, add some docs

master
Sven Slootweg 1 month ago
parent
commit
cc7458e410
2 changed files with 14 additions and 1 deletions
  1. +13
    -0
      README.md
  2. +1
    -1
      package.json

+ 13
- 0
README.md View File

@ -0,0 +1,13 @@
This repository contains documentation-related scrapers for seekseek.org.
## Contributions
Currently, the tools used for scraping are not documented. So while contributions are welcome, it will probably be a bit of work to figure out how to write a scraper :) This will change soon(tm). There may be [issues that need your help](https://git.cryto.net/seekseek/scrape-documentation/issues?q=&type=all&sort=&state=open&labels=18&milestone=0&assignee=0) on the issue tracker, though.
By submitting a contribution, you agree to license it under the WTFPL/CC0 like the rest of the codebase, which effectively means making it public domain and free for anyone to use for any purpose.
## Scraper development guidelines
1. __Store dense information.__ Avoid storing things like raw HTML which mostly contain repetitive/template content. Storing structured data (eg. parsed JSON) is ideal, but things like HTML snippets with high information density are okay too.
2. __Store original information.__ Don't try to parse meaning directly out of the scraped data, other than for discovering new items! Data normalization is a lossy process and should happen in a dedicated normalization task; that way we don't need to rescrape the entire source just because of a small change in the data normalization code.
3. __Store maximum information.__ There's no need to selectively pick out bits of information to store; if it's easy to extract more data than you are strictly looking for (eg. the data is presented in JSON format), then please do so and just store it in the results! This allows for extracting more information from it later, when building other or new search engines. An example of this is how some scrapers store technical properties of components, even though what we're currently looking for is just datasheets.

+ 1
- 1
package.json View File

@ -4,7 +4,7 @@
"main": "index.js",
"repository": "git@git.cryto.net:seekseek/scrape-documentation.git",
"author": "Sven Slootweg <admin@cryto.net>",
"license": "MIT",
"license": "WTFPL OR CC0-1.0",
"dependencies": {
"bhttp": "^1.2.8",
"bluebird": "^3.7.2",


Loading…
Cancel
Save