Add initial technology page

master
Sven Slootweg 2 years ago
parent a45d67b6ff
commit 5326778971

@ -110,4 +110,8 @@ app.get("/contact", (req, res) => {
res.render("contact");
});
app.get("/technology", (req, res) => {
res.render("technology");
});
module.exports = app;

@ -31,6 +31,10 @@ module.exports = function Layout({ children }) {
Come chat with us!
</a>
<span className="linkSpacer"></span>
<a href="/technology" className="chat">
Technology
</a>
<span className="linkSpacer"></span>
<a href="/contact" className="chat">
Contact/Abuse
</a>

@ -0,0 +1,170 @@
"use strict";
const React = require("react");
const Layout = require("./_layout");
module.exports = function Technology() {
return (
<Layout>
<div className="staticContent">
<h1>The technology</h1>
<p>So... what makes SeekSeek tick? Let&#39;s get the boring bits out of the way first:</p>
<ul>
<li>The whole thing is written in Javascript, end-to-end, including the scraper.</li>
<li>Both the scraping server and the search frontend server run on NixOS.</li>
<li>PostgreSQL is used as the database, both for the scraper and the search frontends (there&#39;s only one
frontend the time of writing).</li>
<li>The search frontends use React for rendering the UI; server-side where possible, browser-side where
necessary.</li>
<li>Server-side rendering is done with a fork of <code>express-react-views</code>.</li>
<li><em>Most</em> scraping tasks use bhttp as the HTTP client, and cheerio (a &#39;headless&#39; implementation
of the jQuery API) for data extraction.</li>
</ul>
<p>None of that is really very interesting, but people always ask about it. Let&#39;s move on to the interesting
bits!</p>
<h2 id="the-goal">The goal</h2>
<p>Before we can talk about the technology, we need to talk about what the technology was built <em>for</em>.
SeekSeek is <a href="http://cryto.net/~joepie91/manifesto.html">radical software</a>. From the ground up, it was
designed to be FOSS, collaborative and community-driven, non-commercial, ad-free, and to improve the world - in
the case of SeekSeek specifically, to improve on the poor state of keyword-only searches by providing highly
specialized search engines instead!</p>
<p>But... that introduces some unusual requirements:</p>
<ul>
<li><strong>It needs to be resource-conservative:</strong> While it doesn&#39;t need to be <em>perfectly</em> optimized, it shouldn&#39;t require absurd amounts of RAM or CPU power either. It should be possible to run
<em> the whole thing</em> on a desktop or a cheap server - the usual refrain of &quot;extra servers are
cheaper than extra developers&quot;, a very popular one in startups, does not apply here.</li>
<li><strong>It needs to be easy to spin up for development:</strong> The entire codebase needs to be
self-contained as much as reasonably possible, requiring not much more than an <code>npm install</code> to
get everything in place. No weirdly complex build stacks, no assumptions about how the developer&#39;s
system is laid out, and things need to be debuggable by someone who has never touched it before. It needs to
be possible for <em>anybody</em> to hack on it, not just a bunch of core developers.</li>
<li><strong>It needs to be easy to deploy and maintain:</strong> It needs to work with commodity software on
standard operating systems, including in constrained environments like containers and VPSes. No weird kernel
settings, no complex network setup requirements. It needs to Just Work, and to <em>keep</em> working with
very little maintenance. Upgrades need to be seamless.</li>
<li><strong>It needs to be flexible:</strong> Time is still a valuable resource in a collaborative project -
unlike a company, we can&#39;t assume that someone will be able to spend a working day restructuring the
entire codebase. Likewise, fundamental restructuring causes coordination issues across the community,
because a FOSS community is not a centralized entity with a manager who decides what happens. That means
that the core (extensible) architecture needs to be right <em>from the start</em>, and able to adapt to
changing circumstances, more so because scraping is involved.</li>
<li><strong>It needs to be accessible:</strong> It should be possible for <em>any</em> developer to build and
contribute to scrapers; not just specialized developers who have spent half their life working on this sort
of thing. That means that the API needs to be simple, and there needs to be space for someone to use the
tools they are comfortable with.</li>
</ul>
<p>At the time of writing, there&#39;s only a datasheet search engine. However, the long-term goal is for SeekSeek
to become a large <em>collection</em> of specialized search engines - each one with a tailor-made UI that&#39;s
ideal for the thing being searched through. So all of the above needs to be satisfied not just for a datasheet
search engine, but for a <em>potentially unlimited</em> series of search engines, many of which are not even on
the roadmap yet!</p>
<p>And well, the very short version is that <em>none</em> of the existing options that I&#39;ve evaluated even came
<em> close</em> to meeting these requirements. Existing scraping stacks, job queues, and so on tend to very much
be designed for corporate environments with tight control over who works on what. That wasn&#39;t an option
here. So let&#39;s talk about what we ended up with instead!</p>
<h2 id="the-scraping-server">The scraping server</h2>
<p>The core component in SeekSeek is the &#39;scraping server&#39; - an experimental project called <a
href="https://git.cryto.net/joepie91/srap">srap</a> that was built specifically for SeekSeek; though also
designed to be more generically useful. You can think of srap as <strong>a persistent job queue that&#39;s
optimized for scraping</strong>.</p>
<p>So what does that mean? The basic idea behind srap is that you have a big pile of &quot;items&quot; - each item
isn&#39;t much more than a unique identifier and some &#39;initial data&#39; to represent the work to be done.
Each item can have zero or more &#39;tags&#39; assigned, which are just short strings. Crucially, none of these
items <em>do</em> anything yet - they&#39;re really just a mapping from an identifier to some arbitrarily-shaped
JSON.</p>
<p>The real work starts with the <strong>scraper configuration</strong>. Even though it&#39;s called a
&#39;configuration&#39;, it&#39;s really more of a <em>codebase</em> - you can find the configuration that
SeekSeek uses <a href="https://git.cryto.net/seekseek/scraper-config">here</a>. You&#39;ll notice that it <a
href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/index.js">defines a number of tasks
and seed items</a>. The seed items are simply inserted automatically if they don&#39;t exist yet, and define
the &#39;starting point&#39; for the scraper.</p>
<p>The tasks, however, define what the scraper <em>does</em>. Every task represents one specific operation in the
scraping process; typically, there will be multiple tasks per source. One to find product categories, one to
extract products from a category listing, one to extract data from a product page, and so on. Each of these
tasks has its own concurrency settings, as well as a TTL (Time-To-Live) that defines after how long the scraper
should revisit it.</p>
<p>Finally, what wires it all together are the <em>tag mappings</em>. These define what tasks should be executed for
what tags - or more accurately, for all the items that are tagged <em>with</em> those tags. Tags associated with
items are dynamic, they can be added or removed by any scraping task. This provides a <em>huge</em> amount of
flexibility, because any task can essentially queue any <em>other</em> task, just by giving an item the right
tag. The scraping server then makes sure that it lands at the right spot in the queue at the right time - the
task itself doesn&#39;t need to care about any of that.</p>
<p>Here&#39;s a practical example, from the datasheet search tasks:</p>
<ul>
<li>The initial seed item for LCSC is tagged as <code>lcsc:home</code>.</li>
<li>The <code>lcsc:home</code> tag is defined to trigger the <code>lcsc:findCategories</code> task.</li>
<li>The <code>lcsc:findCategories</code> task fetches a list of categories from the source, and creates an item
tagged as <code>lcsc:category</code> for each.</li>
<li>The <code>lcsc:category</code> tag is then defined to trigger the <code>lcsc:scrapeCategory</code> task.
</li>
<li>The <code>lcsc:scrapeCategory</code> task (more or less) fetches all the products for a given category, and
creates items tagged as <code>lcsc:product</code>. Importantly, because the LCSC category listings
<em> already</em> include the product data we need, these items are immediately created with their full data
- there&#39;s no separate &#39;scape product page&#39; task!</li>
<li>The <code>lcsc:product</code> tag is then defined to trigger the <code>lcsc:normalizeProduct</code> task.
</li>
<li>The <code>lcsc:normalizeProduct</code> task then converts the scraped data to a standardized representation,
which is stored with a <code>result:datasheet</code> tag. The scraping flows for <em>other</em> data sources
<em> also</em> produce <code>result:datasheet</code> items - these are the items that ultimately end up in
the search frontend!</li>
</ul>
<p>One thing that&#39;s not mentioned above is that <code>lcsc:scrapeCategory</code> doesn&#39;t <em>actually </em>
scrape all of the items for a category - it just scrapes a specific page of them! The initial
<code>lcsc:findCategories</code> task would have created as many of such &#39;page tasks&#39; as there are pages
to scrape, based on the amount of items a category is said to have.</p>
<p>More interesting, though, is that the scraping flow doesn&#39;t <em>have</em> to be this unidirectional - if the
total amount of pages could only be learned from scraping the first page, it would have been entirely possible
for the <code>lcsc:scrapeCategory</code> task to create <em>additional</em> <code>lcsc:category</code> items!
The tag-based system makes recursive discovery like this a breeze, and because everything is keyed by a unique
identifier and persistent, loops are automatically prevented.</p>
<p>You&#39;ll probably have noticed that none of the above mentions HTTP requests. That&#39;s because srap
doesn&#39;t care - it has no idea what HTTP even is! All of the actual scraping <em>logic</em> is completely
defined by the configuration - and that&#39;s what makes it a codebase. <a
href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/lib/lcsc/task/scrape-category.js">This</a> is the scraping logic for extracting products from an LCSC category, for example. This is also why each page is
its own item; that allows srap to rate-limit requests despite having absolutely no hooks into the HTTP library
being used, by virtue of limiting each task to 1 HTTP request.</p>
<p>There are more features in srap, like deliberately invalidating past scraping results, item merges, and &#39;out
of band&#39; task result storage, but these are the basic concepts that make the whole thing work. As you can
see, it&#39;s highly flexible, unopinionated, and easy to collaboratively maintain a scraper configuration for -
every task functions more or less independently.</p>
<h2 id="the-datasheet-search-frontend">The datasheet search frontend</h2>
<p>If you&#39;ve used <a href="https://seekseek.org/datasheets">the datasheet search</a>, you&#39;ve probably
noticed that it&#39;s <em>really</em> fast, it almost feels like it&#39;s all local. But no, your search queries
really <em>are</em> going to a server. So how can it be that fast?</p>
<p>It turns out to be surprisingly simple: by default, the search is a <em>prefix search</em> only. That means that
it will only search for items that <em>start with</em> the query you entered. This is usually what you want when you
search for part numbers, and it <em>also</em> has some very interesting performance implications - because a
prefix search can be done entirely on an index!</p>
<p>There&#39;s actually very little magic here - the PostgreSQL database that runs behind the frontend simply has a
(normalized) index on the column for the part number, and the server is doing a <code>LIKE
&#39;yourquery%&#39;</code> query against it. That&#39;s it! This generally yields a search result in under
2 milliseconds, ie. nearly instantly. All it has to do is an index lookup, and those are <em>fast</em>.</p>
<p>On the browser side, things aren&#39;t much more complicated. Every time the query changes, it makes a new search
request to the server, cancelling the old one if one was still in progress. When it gets results, it renders
them on the screen. That&#39;s it. There are no trackers on the site, no weird custom input boxes, nothing else
to slow it down. The result is a search that feels local :)</p>
<h2 id="the-source-code">The source code</h2>
<p>Right now, the source code for all of these things lives across three repositories:</p>
<ul>
<li><a href="https://git.cryto.net/joepie91/srap">joepie91/srap</a> - the scraping server.</li>
<li><a href="https://git.cryto.net/seekseek/scraper-config">seekseek/scraper-config</a> - the configuration and
scraping logic that&#39;s used for SeekSeek.</li>
<li><a href="https://git.cryto.net/seekseek/ui">seekseek/ui</a> - the search frontend (including search server!)
for SeekSeek.</li>
</ul>
<p>At the time of writing, documentation is still pretty lacking across these repositories, and the code in the srap
and UI repositories in particular is pretty rough! This will be improved upon quite soon, as SeekSeek becomes
more polished.</p>
<h2 id="final-words">Final words</h2>
<p>Of course, there are many more details that I haven't covered in this post, but hopefully this gives you an idea of how SeekSeek is put together, and why!</p>
<p>Has this post made you interested in working on SeekSeek, or maybe your own custom srap-based project? <a
href="https://matrix.to/#/#seekseek:pixie.town?via=pixie.town&amp;via=matrix.org&amp;via=librepush.net">Drop
by in the chat!</a> We&#39;d be happy to give you pointers :)</p>
</div>
</Layout>
);
};
Loading…
Cancel
Save