forked from seekseek/ui
Add initial technology page
parent
a45d67b6ff
commit
5326778971
@ -0,0 +1,170 @@
|
||||
"use strict";
|
||||
|
||||
const React = require("react");
|
||||
|
||||
const Layout = require("./_layout");
|
||||
|
||||
module.exports = function Technology() {
|
||||
return (
|
||||
<Layout>
|
||||
<div className="staticContent">
|
||||
<h1>The technology</h1>
|
||||
|
||||
<p>So... what makes SeekSeek tick? Let's get the boring bits out of the way first:</p>
|
||||
<ul>
|
||||
<li>The whole thing is written in Javascript, end-to-end, including the scraper.</li>
|
||||
<li>Both the scraping server and the search frontend server run on NixOS.</li>
|
||||
<li>PostgreSQL is used as the database, both for the scraper and the search frontends (there's only one
|
||||
frontend the time of writing).</li>
|
||||
<li>The search frontends use React for rendering the UI; server-side where possible, browser-side where
|
||||
necessary.</li>
|
||||
<li>Server-side rendering is done with a fork of <code>express-react-views</code>.</li>
|
||||
<li><em>Most</em> scraping tasks use bhttp as the HTTP client, and cheerio (a 'headless' implementation
|
||||
of the jQuery API) for data extraction.</li>
|
||||
</ul>
|
||||
<p>None of that is really very interesting, but people always ask about it. Let's move on to the interesting
|
||||
bits!</p>
|
||||
<h2 id="the-goal">The goal</h2>
|
||||
<p>Before we can talk about the technology, we need to talk about what the technology was built <em>for</em>.
|
||||
SeekSeek is <a href="http://cryto.net/~joepie91/manifesto.html">radical software</a>. From the ground up, it was
|
||||
designed to be FOSS, collaborative and community-driven, non-commercial, ad-free, and to improve the world - in
|
||||
the case of SeekSeek specifically, to improve on the poor state of keyword-only searches by providing highly
|
||||
specialized search engines instead!</p>
|
||||
<p>But... that introduces some unusual requirements:</p>
|
||||
<ul>
|
||||
<li><strong>It needs to be resource-conservative:</strong> While it doesn't need to be <em>perfectly</em> optimized, it shouldn't require absurd amounts of RAM or CPU power either. It should be possible to run
|
||||
<em> the whole thing</em> on a desktop or a cheap server - the usual refrain of "extra servers are
|
||||
cheaper than extra developers", a very popular one in startups, does not apply here.</li>
|
||||
<li><strong>It needs to be easy to spin up for development:</strong> The entire codebase needs to be
|
||||
self-contained as much as reasonably possible, requiring not much more than an <code>npm install</code> to
|
||||
get everything in place. No weirdly complex build stacks, no assumptions about how the developer's
|
||||
system is laid out, and things need to be debuggable by someone who has never touched it before. It needs to
|
||||
be possible for <em>anybody</em> to hack on it, not just a bunch of core developers.</li>
|
||||
<li><strong>It needs to be easy to deploy and maintain:</strong> It needs to work with commodity software on
|
||||
standard operating systems, including in constrained environments like containers and VPSes. No weird kernel
|
||||
settings, no complex network setup requirements. It needs to Just Work, and to <em>keep</em> working with
|
||||
very little maintenance. Upgrades need to be seamless.</li>
|
||||
<li><strong>It needs to be flexible:</strong> Time is still a valuable resource in a collaborative project -
|
||||
unlike a company, we can't assume that someone will be able to spend a working day restructuring the
|
||||
entire codebase. Likewise, fundamental restructuring causes coordination issues across the community,
|
||||
because a FOSS community is not a centralized entity with a manager who decides what happens. That means
|
||||
that the core (extensible) architecture needs to be right <em>from the start</em>, and able to adapt to
|
||||
changing circumstances, more so because scraping is involved.</li>
|
||||
<li><strong>It needs to be accessible:</strong> It should be possible for <em>any</em> developer to build and
|
||||
contribute to scrapers; not just specialized developers who have spent half their life working on this sort
|
||||
of thing. That means that the API needs to be simple, and there needs to be space for someone to use the
|
||||
tools they are comfortable with.</li>
|
||||
</ul>
|
||||
<p>At the time of writing, there's only a datasheet search engine. However, the long-term goal is for SeekSeek
|
||||
to become a large <em>collection</em> of specialized search engines - each one with a tailor-made UI that's
|
||||
ideal for the thing being searched through. So all of the above needs to be satisfied not just for a datasheet
|
||||
search engine, but for a <em>potentially unlimited</em> series of search engines, many of which are not even on
|
||||
the roadmap yet!</p>
|
||||
<p>And well, the very short version is that <em>none</em> of the existing options that I've evaluated even came
|
||||
<em> close</em> to meeting these requirements. Existing scraping stacks, job queues, and so on tend to very much
|
||||
be designed for corporate environments with tight control over who works on what. That wasn't an option
|
||||
here. So let's talk about what we ended up with instead!</p>
|
||||
<h2 id="the-scraping-server">The scraping server</h2>
|
||||
<p>The core component in SeekSeek is the 'scraping server' - an experimental project called <a
|
||||
href="https://git.cryto.net/joepie91/srap">srap</a> that was built specifically for SeekSeek; though also
|
||||
designed to be more generically useful. You can think of srap as <strong>a persistent job queue that's
|
||||
optimized for scraping</strong>.</p>
|
||||
<p>So what does that mean? The basic idea behind srap is that you have a big pile of "items" - each item
|
||||
isn't much more than a unique identifier and some 'initial data' to represent the work to be done.
|
||||
Each item can have zero or more 'tags' assigned, which are just short strings. Crucially, none of these
|
||||
items <em>do</em> anything yet - they're really just a mapping from an identifier to some arbitrarily-shaped
|
||||
JSON.</p>
|
||||
<p>The real work starts with the <strong>scraper configuration</strong>. Even though it's called a
|
||||
'configuration', it's really more of a <em>codebase</em> - you can find the configuration that
|
||||
SeekSeek uses <a href="https://git.cryto.net/seekseek/scraper-config">here</a>. You'll notice that it <a
|
||||
href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/index.js">defines a number of tasks
|
||||
and seed items</a>. The seed items are simply inserted automatically if they don't exist yet, and define
|
||||
the 'starting point' for the scraper.</p>
|
||||
<p>The tasks, however, define what the scraper <em>does</em>. Every task represents one specific operation in the
|
||||
scraping process; typically, there will be multiple tasks per source. One to find product categories, one to
|
||||
extract products from a category listing, one to extract data from a product page, and so on. Each of these
|
||||
tasks has its own concurrency settings, as well as a TTL (Time-To-Live) that defines after how long the scraper
|
||||
should revisit it.</p>
|
||||
<p>Finally, what wires it all together are the <em>tag mappings</em>. These define what tasks should be executed for
|
||||
what tags - or more accurately, for all the items that are tagged <em>with</em> those tags. Tags associated with
|
||||
items are dynamic, they can be added or removed by any scraping task. This provides a <em>huge</em> amount of
|
||||
flexibility, because any task can essentially queue any <em>other</em> task, just by giving an item the right
|
||||
tag. The scraping server then makes sure that it lands at the right spot in the queue at the right time - the
|
||||
task itself doesn't need to care about any of that.</p>
|
||||
<p>Here's a practical example, from the datasheet search tasks:</p>
|
||||
<ul>
|
||||
<li>The initial seed item for LCSC is tagged as <code>lcsc:home</code>.</li>
|
||||
<li>The <code>lcsc:home</code> tag is defined to trigger the <code>lcsc:findCategories</code> task.</li>
|
||||
<li>The <code>lcsc:findCategories</code> task fetches a list of categories from the source, and creates an item
|
||||
tagged as <code>lcsc:category</code> for each.</li>
|
||||
<li>The <code>lcsc:category</code> tag is then defined to trigger the <code>lcsc:scrapeCategory</code> task.
|
||||
</li>
|
||||
<li>The <code>lcsc:scrapeCategory</code> task (more or less) fetches all the products for a given category, and
|
||||
creates items tagged as <code>lcsc:product</code>. Importantly, because the LCSC category listings
|
||||
<em> already</em> include the product data we need, these items are immediately created with their full data
|
||||
- there's no separate 'scape product page' task!</li>
|
||||
<li>The <code>lcsc:product</code> tag is then defined to trigger the <code>lcsc:normalizeProduct</code> task.
|
||||
</li>
|
||||
<li>The <code>lcsc:normalizeProduct</code> task then converts the scraped data to a standardized representation,
|
||||
which is stored with a <code>result:datasheet</code> tag. The scraping flows for <em>other</em> data sources
|
||||
<em> also</em> produce <code>result:datasheet</code> items - these are the items that ultimately end up in
|
||||
the search frontend!</li>
|
||||
</ul>
|
||||
<p>One thing that's not mentioned above is that <code>lcsc:scrapeCategory</code> doesn't <em>actually </em>
|
||||
scrape all of the items for a category - it just scrapes a specific page of them! The initial
|
||||
<code>lcsc:findCategories</code> task would have created as many of such 'page tasks' as there are pages
|
||||
to scrape, based on the amount of items a category is said to have.</p>
|
||||
<p>More interesting, though, is that the scraping flow doesn't <em>have</em> to be this unidirectional - if the
|
||||
total amount of pages could only be learned from scraping the first page, it would have been entirely possible
|
||||
for the <code>lcsc:scrapeCategory</code> task to create <em>additional</em> <code>lcsc:category</code> items!
|
||||
The tag-based system makes recursive discovery like this a breeze, and because everything is keyed by a unique
|
||||
identifier and persistent, loops are automatically prevented.</p>
|
||||
<p>You'll probably have noticed that none of the above mentions HTTP requests. That's because srap
|
||||
doesn't care - it has no idea what HTTP even is! All of the actual scraping <em>logic</em> is completely
|
||||
defined by the configuration - and that's what makes it a codebase. <a
|
||||
href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/lib/lcsc/task/scrape-category.js">This</a> is the scraping logic for extracting products from an LCSC category, for example. This is also why each page is
|
||||
its own item; that allows srap to rate-limit requests despite having absolutely no hooks into the HTTP library
|
||||
being used, by virtue of limiting each task to 1 HTTP request.</p>
|
||||
<p>There are more features in srap, like deliberately invalidating past scraping results, item merges, and 'out
|
||||
of band' task result storage, but these are the basic concepts that make the whole thing work. As you can
|
||||
see, it's highly flexible, unopinionated, and easy to collaboratively maintain a scraper configuration for -
|
||||
every task functions more or less independently.</p>
|
||||
<h2 id="the-datasheet-search-frontend">The datasheet search frontend</h2>
|
||||
<p>If you've used <a href="https://seekseek.org/datasheets">the datasheet search</a>, you've probably
|
||||
noticed that it's <em>really</em> fast, it almost feels like it's all local. But no, your search queries
|
||||
really <em>are</em> going to a server. So how can it be that fast?</p>
|
||||
<p>It turns out to be surprisingly simple: by default, the search is a <em>prefix search</em> only. That means that
|
||||
it will only search for items that <em>start with</em> the query you entered. This is usually what you want when you
|
||||
search for part numbers, and it <em>also</em> has some very interesting performance implications - because a
|
||||
prefix search can be done entirely on an index!</p>
|
||||
<p>There's actually very little magic here - the PostgreSQL database that runs behind the frontend simply has a
|
||||
(normalized) index on the column for the part number, and the server is doing a <code>LIKE
|
||||
'yourquery%'</code> query against it. That's it! This generally yields a search result in under
|
||||
2 milliseconds, ie. nearly instantly. All it has to do is an index lookup, and those are <em>fast</em>.</p>
|
||||
<p>On the browser side, things aren't much more complicated. Every time the query changes, it makes a new search
|
||||
request to the server, cancelling the old one if one was still in progress. When it gets results, it renders
|
||||
them on the screen. That's it. There are no trackers on the site, no weird custom input boxes, nothing else
|
||||
to slow it down. The result is a search that feels local :)</p>
|
||||
<h2 id="the-source-code">The source code</h2>
|
||||
<p>Right now, the source code for all of these things lives across three repositories:</p>
|
||||
<ul>
|
||||
<li><a href="https://git.cryto.net/joepie91/srap">joepie91/srap</a> - the scraping server.</li>
|
||||
<li><a href="https://git.cryto.net/seekseek/scraper-config">seekseek/scraper-config</a> - the configuration and
|
||||
scraping logic that's used for SeekSeek.</li>
|
||||
<li><a href="https://git.cryto.net/seekseek/ui">seekseek/ui</a> - the search frontend (including search server!)
|
||||
for SeekSeek.</li>
|
||||
</ul>
|
||||
<p>At the time of writing, documentation is still pretty lacking across these repositories, and the code in the srap
|
||||
and UI repositories in particular is pretty rough! This will be improved upon quite soon, as SeekSeek becomes
|
||||
more polished.</p>
|
||||
|
||||
<h2 id="final-words">Final words</h2>
|
||||
<p>Of course, there are many more details that I haven't covered in this post, but hopefully this gives you an idea of how SeekSeek is put together, and why!</p>
|
||||
<p>Has this post made you interested in working on SeekSeek, or maybe your own custom srap-based project? <a
|
||||
href="https://matrix.to/#/#seekseek:pixie.town?via=pixie.town&via=matrix.org&via=librepush.net">Drop
|
||||
by in the chat!</a> We'd be happy to give you pointers :)</p>
|
||||
|
||||
</div>
|
||||
</Layout>
|
||||
);
|
||||
};
|
Loading…
Reference in New Issue