Add initial technology page

2 years ago · 5326778971
parent a45d67b6ff
commit 5326778971
3 changed files with 178 additions and 0 deletions
--- a/src/app.js
+++ b/src/app.js
@ -110,4 +110,8 @@ app.get("/contact", (req, res) => {
 	res.render("contact");
 });

+app.get("/technology", (req, res) => {
+	res.render("technology");
+});
+
 module.exports = app;
--- a/src/views/_layout.jsx
+++ b/src/views/_layout.jsx
@ -31,6 +31,10 @@ module.exports = function Layout({ children }) {
 							Come chat with us!
 						</a>
 						<span className="linkSpacer">•</span>
+						<a href="/technology" className="chat">
+							Technology
+						</a>
+						<span className="linkSpacer">•</span>
 						<a href="/contact" className="chat">
 							Contact/Abuse
 						</a>
--- a/src/views/technology.jsx
+++ b/src/views/technology.jsx
@ -0,0 +1,170 @@
+"use strict";
+
+const React = require("react");
+
+const Layout = require("./_layout");
+
+module.exports = function Technology() {
+	return (
+		<Layout>
+			<div className="staticContent">
+				<h1>The technology</h1>
+
+				<p>So... what makes SeekSeek tick? Let&#39;s get the boring bits out of the way first:</p>
+				<ul>
+					<li>The whole thing is written in Javascript, end-to-end, including the scraper.</li>
+					<li>Both the scraping server and the search frontend server run on NixOS.</li>
+					<li>PostgreSQL is used as the database, both for the scraper and the search frontends (there&#39;s only one
+						frontend the time of writing).</li>
+					<li>The search frontends use React for rendering the UI; server-side where possible, browser-side where
+						necessary.</li>
+					<li>Server-side rendering is done with a fork of <code>express-react-views</code>.</li>
+					<li><em>Most</em> scraping tasks use bhttp as the HTTP client, and cheerio (a &#39;headless&#39; implementation
+						of the jQuery API) for data extraction.</li>
+				</ul>
+				<p>None of that is really very interesting, but people always ask about it. Let&#39;s move on to the interesting
+					bits!</p>
+				<h2 id="the-goal">The goal</h2>
+				<p>Before we can talk about the technology, we need to talk about what the technology was built <em>for</em>.
+					SeekSeek is <a href="http://cryto.net/~joepie91/manifesto.html">radical software</a>. From the ground up, it was
+					designed to be FOSS, collaborative and community-driven, non-commercial, ad-free, and to improve the world - in
+					the case of SeekSeek specifically, to improve on the poor state of keyword-only searches by providing highly
+					specialized search engines instead!</p>
+				<p>But... that introduces some unusual requirements:</p>
+				<ul>
+					<li><strong>It needs to be resource-conservative:</strong> While it doesn&#39;t need to be <em>perfectly</em> optimized, it shouldn&#39;t require absurd amounts of RAM or CPU power either. It should be possible to run
+						<em> the whole thing</em> on a desktop or a cheap server - the usual refrain of &quot;extra servers are
+						cheaper than extra developers&quot;, a very popular one in startups, does not apply here.</li>
+					<li><strong>It needs to be easy to spin up for development:</strong> The entire codebase needs to be
+						self-contained as much as reasonably possible, requiring not much more than an <code>npm install</code> to
+						get everything in place. No weirdly complex build stacks, no assumptions about how the developer&#39;s
+						system is laid out, and things need to be debuggable by someone who has never touched it before. It needs to
+						be possible for <em>anybody</em> to hack on it, not just a bunch of core developers.</li>
+					<li><strong>It needs to be easy to deploy and maintain:</strong> It needs to work with commodity software on
+						standard operating systems, including in constrained environments like containers and VPSes. No weird kernel
+						settings, no complex network setup requirements. It needs to Just Work, and to <em>keep</em> working with
+						very little maintenance. Upgrades need to be seamless.</li>
+					<li><strong>It needs to be flexible:</strong> Time is still a valuable resource in a collaborative project -
+						unlike a company, we can&#39;t assume that someone will be able to spend a working day restructuring the
+						entire codebase. Likewise, fundamental restructuring causes coordination issues across the community,
+						because a FOSS community is not a centralized entity with a manager who decides what happens. That means
+						that the core (extensible) architecture needs to be right <em>from the start</em>, and able to adapt to
+						changing circumstances, more so because scraping is involved.</li>
+					<li><strong>It needs to be accessible:</strong> It should be possible for <em>any</em> developer to build and
+						contribute to scrapers; not just specialized developers who have spent half their life working on this sort
+						of thing. That means that the API needs to be simple, and there needs to be space for someone to use the
+						tools they are comfortable with.</li>
+				</ul>
+				<p>At the time of writing, there&#39;s only a datasheet search engine. However, the long-term goal is for SeekSeek
+					to become a large <em>collection</em> of specialized search engines - each one with a tailor-made UI that&#39;s
+					ideal for the thing being searched through. So all of the above needs to be satisfied not just for a datasheet
+					search engine, but for a <em>potentially unlimited</em> series of search engines, many of which are not even on
+					the roadmap yet!</p>
+				<p>And well, the very short version is that <em>none</em> of the existing options that I&#39;ve evaluated even came
+					<em> close</em> to meeting these requirements. Existing scraping stacks, job queues, and so on tend to very much
+					be designed for corporate environments with tight control over who works on what. That wasn&#39;t an option
+					here. So let&#39;s talk about what we ended up with instead!</p>
+				<h2 id="the-scraping-server">The scraping server</h2>
+				<p>The core component in SeekSeek is the &#39;scraping server&#39; - an experimental project called <a
+					href="https://git.cryto.net/joepie91/srap">srap</a> that was built specifically for SeekSeek; though also
+					designed to be more generically useful. You can think of srap as <strong>a persistent job queue that&#39;s
+						optimized for scraping</strong>.</p>
+				<p>So what does that mean? The basic idea behind srap is that you have a big pile of &quot;items&quot; - each item
+					isn&#39;t much more than a unique identifier and some &#39;initial data&#39; to represent the work to be done.
+					Each item can have zero or more &#39;tags&#39; assigned, which are just short strings. Crucially, none of these
+					items <em>do</em> anything yet - they&#39;re really just a mapping from an identifier to some arbitrarily-shaped
+					JSON.</p>
+				<p>The real work starts with the <strong>scraper configuration</strong>. Even though it&#39;s called a
+					&#39;configuration&#39;, it&#39;s really more of a <em>codebase</em> - you can find the configuration that
+					SeekSeek uses <a href="https://git.cryto.net/seekseek/scraper-config">here</a>. You&#39;ll notice that it <a
+					href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/index.js">defines a number of tasks
+					and seed items</a>. The seed items are simply inserted automatically if they don&#39;t exist yet, and define
+					the &#39;starting point&#39; for the scraper.</p>
+				<p>The tasks, however, define what the scraper <em>does</em>. Every task represents one specific operation in the
+					scraping process; typically, there will be multiple tasks per source. One to find product categories, one to
+					extract products from a category listing, one to extract data from a product page, and so on. Each of these
+					tasks has its own concurrency settings, as well as a TTL (Time-To-Live) that defines after how long the scraper
+					should revisit it.</p>
+				<p>Finally, what wires it all together are the <em>tag mappings</em>. These define what tasks should be executed for
+					what tags - or more accurately, for all the items that are tagged <em>with</em> those tags. Tags associated with
+					items are dynamic, they can be added or removed by any scraping task. This provides a <em>huge</em> amount of
+					flexibility, because any task can essentially queue any <em>other</em> task, just by giving an item the right
+					tag. The scraping server then makes sure that it lands at the right spot in the queue at the right time - the
+					task itself doesn&#39;t need to care about any of that.</p>
+				<p>Here&#39;s a practical example, from the datasheet search tasks:</p>
+				<ul>
+					<li>The initial seed item for LCSC is tagged as <code>lcsc:home</code>.</li>
+					<li>The <code>lcsc:home</code> tag is defined to trigger the <code>lcsc:findCategories</code> task.</li>
+					<li>The <code>lcsc:findCategories</code> task fetches a list of categories from the source, and creates an item
+						tagged as <code>lcsc:category</code> for each.</li>
+					<li>The <code>lcsc:category</code> tag is then defined to trigger the <code>lcsc:scrapeCategory</code> task.
+					</li>
+					<li>The <code>lcsc:scrapeCategory</code> task (more or less) fetches all the products for a given category, and
+					creates items tagged as <code>lcsc:product</code>. Importantly, because the LCSC category listings
+					<em> already</em> include the product data we need, these items are immediately created with their full data
+					- there&#39;s no separate &#39;scape product page&#39; task!</li>
+					<li>The <code>lcsc:product</code> tag is then defined to trigger the <code>lcsc:normalizeProduct</code> task.
+					</li>
+					<li>The <code>lcsc:normalizeProduct</code> task then converts the scraped data to a standardized representation,
+					which is stored with a <code>result:datasheet</code> tag. The scraping flows for <em>other</em> data sources
+					<em> also</em> produce <code>result:datasheet</code> items - these are the items that ultimately end up in
+					the search frontend!</li>
+				</ul>
+				<p>One thing that&#39;s not mentioned above is that <code>lcsc:scrapeCategory</code> doesn&#39;t <em>actually </em>
+					scrape all of the items for a category - it just scrapes a specific page of them! The initial
+					<code>lcsc:findCategories</code> task would have created as many of such &#39;page tasks&#39; as there are pages
+					to scrape, based on the amount of items a category is said to have.</p>
+				<p>More interesting, though, is that the scraping flow doesn&#39;t <em>have</em> to be this unidirectional - if the
+					total amount of pages could only be learned from scraping the first page, it would have been entirely possible
+					for the <code>lcsc:scrapeCategory</code> task to create <em>additional</em> <code>lcsc:category</code> items!
+					The tag-based system makes recursive discovery like this a breeze, and because everything is keyed by a unique
+					identifier and persistent, loops are automatically prevented.</p>
+				<p>You&#39;ll probably have noticed that none of the above mentions HTTP requests. That&#39;s because srap
+					doesn&#39;t care - it has no idea what HTTP even is! All of the actual scraping <em>logic</em> is completely
+					defined by the configuration - and that&#39;s what makes it a codebase. <a
+					href="https://git.cryto.net/seekseek/scraper-config/src/branch/master/lib/lcsc/task/scrape-category.js">This</a> is the scraping logic for extracting products from an LCSC category, for example. This is also why each page is
+					its own item; that allows srap to rate-limit requests despite having absolutely no hooks into the HTTP library
+					being used, by virtue of limiting each task to 1 HTTP request.</p>
+				<p>There are more features in srap, like deliberately invalidating past scraping results, item merges, and &#39;out
+					of band&#39; task result storage, but these are the basic concepts that make the whole thing work. As you can
+					see, it&#39;s highly flexible, unopinionated, and easy to collaboratively maintain a scraper configuration for -
+					every task functions more or less independently.</p>
+				<h2 id="the-datasheet-search-frontend">The datasheet search frontend</h2>
+				<p>If you&#39;ve used <a href="https://seekseek.org/datasheets">the datasheet search</a>, you&#39;ve probably
+					noticed that it&#39;s <em>really</em> fast, it almost feels like it&#39;s all local. But no, your search queries
+					really <em>are</em> going to a server. So how can it be that fast?</p>
+				<p>It turns out to be surprisingly simple: by default, the search is a <em>prefix search</em> only. That means that
+					it will only search for items that <em>start with</em> the query you entered. This is usually what you want when you
+					search for part numbers, and it <em>also</em> has some very interesting performance implications - because a
+					prefix search can be done entirely on an index!</p>
+				<p>There&#39;s actually very little magic here - the PostgreSQL database that runs behind the frontend simply has a
+					(normalized) index on the column for the part number, and the server is doing a <code>LIKE
+						&#39;yourquery%&#39;</code> query against it. That&#39;s it! This generally yields a search result in under
+					2 milliseconds, ie. nearly instantly. All it has to do is an index lookup, and those are <em>fast</em>.</p>
+				<p>On the browser side, things aren&#39;t much more complicated. Every time the query changes, it makes a new search
+					request to the server, cancelling the old one if one was still in progress. When it gets results, it renders
+					them on the screen. That&#39;s it. There are no trackers on the site, no weird custom input boxes, nothing else
+					to slow it down. The result is a search that feels local :)</p>
+				<h2 id="the-source-code">The source code</h2>
+				<p>Right now, the source code for all of these things lives across three repositories:</p>
+				<ul>
+					<li><a href="https://git.cryto.net/joepie91/srap">joepie91/srap</a> - the scraping server.</li>
+					<li><a href="https://git.cryto.net/seekseek/scraper-config">seekseek/scraper-config</a> - the configuration and
+						scraping logic that&#39;s used for SeekSeek.</li>
+					<li><a href="https://git.cryto.net/seekseek/ui">seekseek/ui</a> - the search frontend (including search server!)
+						for SeekSeek.</li>
+				</ul>
+				<p>At the time of writing, documentation is still pretty lacking across these repositories, and the code in the srap
+					and UI repositories in particular is pretty rough! This will be improved upon quite soon, as SeekSeek becomes
+					more polished.</p>
+				
+				<h2 id="final-words">Final words</h2>
+				<p>Of course, there are many more details that I haven't covered in this post, but hopefully this gives you an idea of how SeekSeek is put together, and why!</p>
+				<p>Has this post made you interested in working on SeekSeek, or maybe your own custom srap-based project? <a
+					href="https://matrix.to/#/#seekseek:pixie.town?via=pixie.town&amp;via=matrix.org&amp;via=librepush.net">Drop
+					by in the chat!</a> We&#39;d be happy to give you pointers :)</p>
+
+			</div>
+		</Layout>
+	);
+};