diff --git a/src/app.js b/src/app.js index ac928c6..de68127 100644 --- a/src/app.js +++ b/src/app.js @@ -110,4 +110,8 @@ app.get("/contact", (req, res) => { res.render("contact"); }); +app.get("/technology", (req, res) => { + res.render("technology"); +}); + module.exports = app; diff --git a/src/views/_layout.jsx b/src/views/_layout.jsx index 7070189..ff6b8bf 100644 --- a/src/views/_layout.jsx +++ b/src/views/_layout.jsx @@ -31,6 +31,10 @@ module.exports = function Layout({ children }) { Come chat with us! + + Technology + + Contact/Abuse diff --git a/src/views/technology.jsx b/src/views/technology.jsx new file mode 100644 index 0000000..3314843 --- /dev/null +++ b/src/views/technology.jsx @@ -0,0 +1,170 @@ +"use strict"; + +const React = require("react"); + +const Layout = require("./_layout"); + +module.exports = function Technology() { + return ( + +
+

The technology

+ +

So... what makes SeekSeek tick? Let's get the boring bits out of the way first:

+ +

None of that is really very interesting, but people always ask about it. Let's move on to the interesting + bits!

+

The goal

+

Before we can talk about the technology, we need to talk about what the technology was built for. + SeekSeek is radical software. From the ground up, it was + designed to be FOSS, collaborative and community-driven, non-commercial, ad-free, and to improve the world - in + the case of SeekSeek specifically, to improve on the poor state of keyword-only searches by providing highly + specialized search engines instead!

+

But... that introduces some unusual requirements:

+ +

At the time of writing, there's only a datasheet search engine. However, the long-term goal is for SeekSeek + to become a large collection of specialized search engines - each one with a tailor-made UI that's + ideal for the thing being searched through. So all of the above needs to be satisfied not just for a datasheet + search engine, but for a potentially unlimited series of search engines, many of which are not even on + the roadmap yet!

+

And well, the very short version is that none of the existing options that I've evaluated even came + close to meeting these requirements. Existing scraping stacks, job queues, and so on tend to very much + be designed for corporate environments with tight control over who works on what. That wasn't an option + here. So let's talk about what we ended up with instead!

+

The scraping server

+

The core component in SeekSeek is the 'scraping server' - an experimental project called srap that was built specifically for SeekSeek; though also + designed to be more generically useful. You can think of srap as a persistent job queue that's + optimized for scraping.

+

So what does that mean? The basic idea behind srap is that you have a big pile of "items" - each item + isn't much more than a unique identifier and some 'initial data' to represent the work to be done. + Each item can have zero or more 'tags' assigned, which are just short strings. Crucially, none of these + items do anything yet - they're really just a mapping from an identifier to some arbitrarily-shaped + JSON.

+

The real work starts with the scraper configuration. Even though it's called a + 'configuration', it's really more of a codebase - you can find the configuration that + SeekSeek uses here. You'll notice that it defines a number of tasks + and seed items. The seed items are simply inserted automatically if they don't exist yet, and define + the 'starting point' for the scraper.

+

The tasks, however, define what the scraper does. Every task represents one specific operation in the + scraping process; typically, there will be multiple tasks per source. One to find product categories, one to + extract products from a category listing, one to extract data from a product page, and so on. Each of these + tasks has its own concurrency settings, as well as a TTL (Time-To-Live) that defines after how long the scraper + should revisit it.

+

Finally, what wires it all together are the tag mappings. These define what tasks should be executed for + what tags - or more accurately, for all the items that are tagged with those tags. Tags associated with + items are dynamic, they can be added or removed by any scraping task. This provides a huge amount of + flexibility, because any task can essentially queue any other task, just by giving an item the right + tag. The scraping server then makes sure that it lands at the right spot in the queue at the right time - the + task itself doesn't need to care about any of that.

+

Here's a practical example, from the datasheet search tasks:

+ +

One thing that's not mentioned above is that lcsc:scrapeCategory doesn't actually + scrape all of the items for a category - it just scrapes a specific page of them! The initial + lcsc:findCategories task would have created as many of such 'page tasks' as there are pages + to scrape, based on the amount of items a category is said to have.

+

More interesting, though, is that the scraping flow doesn't have to be this unidirectional - if the + total amount of pages could only be learned from scraping the first page, it would have been entirely possible + for the lcsc:scrapeCategory task to create additional lcsc:category items! + The tag-based system makes recursive discovery like this a breeze, and because everything is keyed by a unique + identifier and persistent, loops are automatically prevented.

+

You'll probably have noticed that none of the above mentions HTTP requests. That's because srap + doesn't care - it has no idea what HTTP even is! All of the actual scraping logic is completely + defined by the configuration - and that's what makes it a codebase. This is the scraping logic for extracting products from an LCSC category, for example. This is also why each page is + its own item; that allows srap to rate-limit requests despite having absolutely no hooks into the HTTP library + being used, by virtue of limiting each task to 1 HTTP request.

+

There are more features in srap, like deliberately invalidating past scraping results, item merges, and 'out + of band' task result storage, but these are the basic concepts that make the whole thing work. As you can + see, it's highly flexible, unopinionated, and easy to collaboratively maintain a scraper configuration for - + every task functions more or less independently.

+

The datasheet search frontend

+

If you've used the datasheet search, you've probably + noticed that it's really fast, it almost feels like it's all local. But no, your search queries + really are going to a server. So how can it be that fast?

+

It turns out to be surprisingly simple: by default, the search is a prefix search only. That means that + it will only search for items that start with the query you entered. This is usually what you want when you + search for part numbers, and it also has some very interesting performance implications - because a + prefix search can be done entirely on an index!

+

There's actually very little magic here - the PostgreSQL database that runs behind the frontend simply has a + (normalized) index on the column for the part number, and the server is doing a LIKE + 'yourquery%' query against it. That's it! This generally yields a search result in under + 2 milliseconds, ie. nearly instantly. All it has to do is an index lookup, and those are fast.

+

On the browser side, things aren't much more complicated. Every time the query changes, it makes a new search + request to the server, cancelling the old one if one was still in progress. When it gets results, it renders + them on the screen. That's it. There are no trackers on the site, no weird custom input boxes, nothing else + to slow it down. The result is a search that feels local :)

+

The source code

+

Right now, the source code for all of these things lives across three repositories:

+ +

At the time of writing, documentation is still pretty lacking across these repositories, and the code in the srap + and UI repositories in particular is pretty rough! This will be improved upon quite soon, as SeekSeek becomes + more polished.

+ +

Final words

+

Of course, there are many more details that I haven't covered in this post, but hopefully this gives you an idea of how SeekSeek is put together, and why!

+

Has this post made you interested in working on SeekSeek, or maybe your own custom srap-based project? Drop + by in the chat! We'd be happy to give you pointers :)

+ +
+
+ ); +};