1
0
Fork 0
Automatically migrated from Gitolite
Du kannst nicht mehr als 25 Themen auswählen Themen müssen entweder mit einem Buchstaben oder einer Ziffer beginnen. Sie können Bindestriche („-“) enthalten und bis zu 35 Zeichen lang sein.
 
 
 
Sven Slootweg 96c450d084 Release 0.0.1 vor 10 Jahren
lib API changes and documentation, preparing for npm release vor 10 Jahren
.gitignore Further tweaks and publishing script vor 10 Jahren
README.md Documentation tweaks. vor 10 Jahren
gulpfile.js Initial commit vor 10 Jahren
index.coffee API changes and documentation, preparing for npm release vor 10 Jahren
package.json Further tweaks and publishing script vor 10 Jahren
publish.sh Further tweaks and publishing script vor 10 Jahren
sample.cdx Initial commit vor 10 Jahren
test.coffee API changes and documentation, preparing for npm release vor 10 Jahren

README.md

cdx

A simple streaming CDX file parser. Parses CDX files (in particular, those corresponding to WARC files) that correspond to the format as specified by the Internet Archive. All items in the CDX field legend are supported, plus the S field.

Scope and development status

cdx currently only reads (compliant) CDX streams. In the future, it will likely be expanded to also be able to write CDX streams, but this is not currently supported. Error handling is currently nearly non-existent - you are expected to provide a compliant CDX stream.

Installation

npm install --save cdx

Usage

cdx is a streaming parser. It takes a CDX byte stream as input (regardless of the source), and outputs an object stream of CDXRecord objects with the named attributes set to the corresponding values from the CDX stream. Additionally, a plain object containing these attributes is available as the data attribute, for easy (JSON) serialization.

The signature is automatically parsed from the first line of the CDX data. Specifying a custom signature is not currently supported. If a field is not specified (that is, it is either not listed in the signature or its value is -), it will simply not be set on the CDXRecord.

var cdx = require("cdx"),
	fs = require("fs");

fs.createReadStream("sample.cdx")
	.pipe(cdx())
	.pipe(...);

An example that parses the sample CDX file, 'picks out' the serializable data, and then outputs it to stdout as serialized JSON, can be found in sample.cdx (you'll need to install devDependencies first to actually run that file, though).

Fields

All fields are self-explanatory, hopefully. These are just adapted from the legend provided by the Internet Archive, so I really have no idea what most of these do.

  • compressedRecordSize (for .warc.gz, this is the gzipped size of the record)
  • compressedDATFileOffset
  • compressedARCFileOffset (for .warc.gz, this is the gzipped starting position of the record, combine with size to get the ending position)
  • uncompressedDATFileOffset
  • uncompressedARCFileOffset
  • ARCDocumentLength
  • oldStyleChecksum
  • newStyleChecksum
  • canonicalizedUrl
  • canonicalizedFrame
  • canonicalizedHost
  • canonicalizedImage
  • canonicalizedJumpPoint
  • canonicalizedLink
  • canonicalizedPath
  • canonicalizedRedirect
  • canonicalizedHrefURL
  • canonicalizedSrcURL
  • canonicalizedScriptURL
  • originalMimeType (for .warc.gz, this is the original mimetype of the document as specified by the origin webserver)
  • originalURL (for .warc.gz, this is the original URL that the document was retrieved from)
  • originalFrame
  • originalHost
  • originalImage
  • originalJumpPoint
  • originalLink
  • originalPath
  • originalRedirect
  • originalHrefURL
  • originalSrcURL
  • originalScriptURL
  • date (for .warc.gz, this is the retrieval date of the record)
  • IP
  • fileName (for .warc.gz, this is the path to the WARC file that this record lives in - may not be useful, as it may refer to a path on a different filesystem)
  • port
  • responseCode (for .warc.gz, this is the HTTP status code encountered when retrieving the document)
  • title
  • metaTags
  • massagedURL
  • languageString
  • uniqueness
  • newsGroup
  • rulespaceCategory
  • multiColumnLanguageDescription
  • someWeirdFBISWhatsChangedKindaThing (don't ask...)
  • comment

Contributing

Contributions welcome! Please file bugs on GitHub, and target pull requests at the develop branch. Thank you!