From 69034f2ac50a634beecc01f84b13a70754de84d7 Mon Sep 17 00:00:00 2001 From: Sven Slootweg Date: Thu, 29 May 2014 21:52:41 +0200 Subject: [PATCH] Add an explicit exception for example.com, as it is not handled correctly by IANA. Fix multi-response processing, adding a `never_cut` argument. Documentation updates and version bump to 2.2. Fixes #17. --- README.md | 4 ++++ doc/usage.html | 4 ++-- doc/usage.zpy | 9 ++++++--- pythonwhois/net.py | 19 +++++++++++++++---- setup.py | 2 +- test/data/example.com | 25 +++++++++++++++++++++++++ test/target_default/example.com | 1 + test/target_normalized/example.com | 1 + 8 files changed, 55 insertions(+), 10 deletions(-) create mode 100644 test/data/example.com create mode 100644 test/target_default/example.com create mode 100644 test/target_normalized/example.com diff --git a/README.md b/README.md index 1fe676d..ff14206 100644 --- a/README.md +++ b/README.md @@ -37,6 +37,10 @@ The manual (including install instructions) can be found in the doc/ directory. * Will detect and warn about any changes in parsed data compared to previous runs * Guarantees that previously working WHOIS parsing doesn't unintentionally break when changing code +## Important update notes + +*2.2.0 and up*: The internal workings of `get_whois_raw` have been changed, to better facilitate parsing of WHOIS data from registries that may return multiple partial matches for a query, such as `whois.verisign-grs.com`. This change means that, by default, `get_whois_raw` will now strip out the part of such a response that does not pertain directly to the requested domain. If your application requires an unmodified raw WHOIS response and is calling `get_whois_raw` directly, you should use the new `never_cut` parameter to keep pythonwhois from doing this post-processing. As this is a potentially breaking behaviour change, the minor version has been bumped. + ## It doesn't work! * It doesn't work at all? diff --git a/doc/usage.html b/doc/usage.html index 13889a2..870ea7e 100644 --- a/doc/usage.html +++ b/doc/usage.html @@ -180,7 +180,7 @@ -

Using pythonwhois

This is a quick usage guide; pythonwhois is pretty simple.

Table of contents

Normalization

Before you start, it's important to understand the normalization functionality in pythonwhois. Since some WHOIS servers return data in all-uppercase or all-lowercase, and some registrants simply use the incorrect case themselves, reading WHOIS data can be a bit unpleasant.
pythonwhois attempts to solve this problem by optionally 'normalizing' WHOIS data. Depending on the kind of field, the parser will try to create a readable and consistent version of the value. The pwhois command-line utility uses normalization by default; when using the Python module it's disabled by default.
Normalization isn't perfect, and you shouldn't rely on it for technical purposes. It's intended for increasing human readability only. If you work with a lot of WHOIS data, it's recommended to turn off normalization or do your own post-processing.

From the commandline

pwhois [--raw] [--json] [-f PATH] DOMAIN
pwhois is the WHOIS tool that is included with pythonwhois. It's really just a script that you can run from your terminal, and that gives you nicely formatted WHOIS output. Normalization is turned on in pwhois by default, so it will try to make the output more readable (by fixing capitalization and such).
Example: Using pwhois
Code:
sh$ pwhois cryto.net
Output:
Status            : clientTransferProhibited
+		

Using pythonwhois

This is a quick usage guide; pythonwhois is pretty simple.

Table of contents

Normalization

Before you start, it's important to understand the normalization functionality in pythonwhois. Since some WHOIS servers return data in all-uppercase or all-lowercase, and some registrants simply use the incorrect case themselves, reading WHOIS data can be a bit unpleasant.
pythonwhois attempts to solve this problem by optionally 'normalizing' WHOIS data. Depending on the kind of field, the parser will try to create a readable and consistent version of the value. The pwhois command-line utility uses normalization by default; when using the Python module it's disabled by default.
Normalization isn't perfect, and you shouldn't rely on it for technical purposes. It's intended for increasing human readability only. If you work with a lot of WHOIS data, it's recommended to turn off normalization or do your own post-processing.

From the commandline

From your Python application

To start using pythonwhois, use import pythonwhois.
pythonwhois.get_whois(domain[, normalized=[]])
Retrieves and parses WHOIS data for a specified domain. Raises pythonwhois.shared.WhoisException if no root server for the TLD could be found.

Arguments

domain
The domain to WHOIS.
normalized
Optional. What data to normalize. By default, no data will be normalized. You can specify either a list of keys to normalize (see also the result reference below), an empty list (to turn off normalization), or True (to turn on normalization for all supported fields).

Returns

A nested structured object, consisting of dicts and lists. The only key that is always present is contacts, but the keys inside the dict that it contains may not be.
id
The Domain ID.
status
A list of current statuses of the domain at the registrar. May contain any string value.
creation_date
A list of datetime.datetime objects representing the creation date(s) of the domain.
expiration_date
A list of datetime.datetime objects representing the expiration date(s) of the domain.
updated_date
A list of datetime.datetime objects representing the update date(s) of the domain. Note that what an 'update date' entails, differs between WHOIS servers. For some, it means the last renewal data. For others, it means the last registrant info update. For yet others, it means the last update of their WHOIS database as a whole. This key is unlikely to be useful, unless you're trying to plot WHOIS data changes over time.
registrar
A list of registrar names. May contain any string value.
whois_server
A list of WHOIS servers refered to. This is unlikely to be a useful list.
nameservers
A list of nameservers for the domain, as indicated by the WHOIS server.
emails
A list of e-mail address for the domain. This list does not include e-mail addresses from registrant data, only e-mail addresses from other places in the WHOIS data such as abuse report instructions.
contacts
A dict containing contacts for the domain, each also a dict. Fields for these contacts are listed further down. If a specific type of contact was not listed for the domain, the key for it will still exist, but it will contain None.
registrant
The registrant or domain holder.
tech
The technical contact for the domain. May be either the registrar, or a party related to the registrant.
admin
The administrative contact for the domain.
billing
The billing contact for the domain.

Contact fields

These are the fields that any contact dict may contain. If certain information for a contact was not found, the corresponding key will be absent.
Important: Note that any of these fields may consist of multiple lines, although the address field is the only one that is likely to consist of multiple lines.
handle
The NIC handle for the contact.
name
The full name of the contact.
organization
The organization or company that the contact belongs to.
street
The street address of the contact (or organization).
postalcode
The postal code of the contact (or organization). This may or may not include a country prefix.
city
The city of the contact (or organization).
state
The state, province, or region of the contact (or organization). The actual values for this field vary widely.
country
The country of the contact (or organization).
email
The e-mail address of the contact (or organization).
phone
The phone number of the contact (or organization), including extension where applicable.
fax
The fax number of the contact (or organization), including extension where applicable.

When you need more control...

+[...]
There are several optional arguments that you can pass to pwhois to make it behave differently.
--raw
When you use this flag, pwhois will not attempt to parse the WHOIS data; it'll just follow redirects and output the raw data, delimited by double dashes (--).
--json
This flag will make pwhois output JSON instead of human-readable output. While not recommended, you can use this if you need parsed data in a non-Python application.
-f PATH
This will make pwhois read and parse WHOIS data from a specified file, instead of actually contacting a WHOIS server. Useful if you get your WHOIS data elsewhere.
Important: Note that when using -f PATH, pwhois will still expect a domain to be specified! What you enter here doesn't really matter, you can also just specify a single dot . for the domain.

From your Python application

To start using pythonwhois, use import pythonwhois.
pythonwhois.get_whois(domain[, normalized=[]])
Retrieves and parses WHOIS data for a specified domain. Raises pythonwhois.shared.WhoisException if no root server for the TLD could be found.

Arguments

domain
The domain to WHOIS.
normalized
Optional. What data to normalize. By default, no data will be normalized. You can specify either a list of keys to normalize (see also the result reference below), an empty list (to turn off normalization), or True (to turn on normalization for all supported fields).

Returns

A nested structured object, consisting of dicts and lists. The only key that is always present is contacts, but the keys inside the dict that it contains may not be.
id
The Domain ID.
status
A list of current statuses of the domain at the registrar. May contain any string value.
creation_date
A list of datetime.datetime objects representing the creation date(s) of the domain.
expiration_date
A list of datetime.datetime objects representing the expiration date(s) of the domain.
updated_date
A list of datetime.datetime objects representing the update date(s) of the domain. Note that what an 'update date' entails, differs between WHOIS servers. For some, it means the last renewal data. For others, it means the last registrant info update. For yet others, it means the last update of their WHOIS database as a whole. This key is unlikely to be useful, unless you're trying to plot WHOIS data changes over time.
registrar
A list of registrar names. May contain any string value.
whois_server
A list of WHOIS servers refered to. This is unlikely to be a useful list.
nameservers
A list of nameservers for the domain, as indicated by the WHOIS server.
emails
A list of e-mail address for the domain. This list does not include e-mail addresses from registrant data, only e-mail addresses from other places in the WHOIS data such as abuse report instructions.
contacts
A dict containing contacts for the domain, each also a dict. Fields for these contacts are listed further down. If a specific type of contact was not listed for the domain, the key for it will still exist, but it will contain None.
registrant
The registrant or domain holder.
tech
The technical contact for the domain. May be either the registrar, or a party related to the registrant.
admin
The administrative contact for the domain.
billing
The billing contact for the domain.

Contact fields

These are the fields that any contact dict may contain. If certain information for a contact was not found, the corresponding key will be absent.
Important: Note that any of these fields may consist of multiple lines, although the address field is the only one that is likely to consist of multiple lines.
handle
The NIC handle for the contact.
name
The full name of the contact.
organization
The organization or company that the contact belongs to.
street
The street address of the contact (or organization).
postalcode
The postal code of the contact (or organization). This may or may not include a country prefix.
city
The city of the contact (or organization).
state
The state, province, or region of the contact (or organization). The actual values for this field vary widely.
country
The country of the contact (or organization).
email
The e-mail address of the contact (or organization).
phone
The phone number of the contact (or organization), including extension where applicable.
fax
The fax number of the contact (or organization), including extension where applicable.

When you need more control...

diff --git a/doc/usage.zpy b/doc/usage.zpy index 6e671d0..cb78f18 100644 --- a/doc/usage.zpy +++ b/doc/usage.zpy @@ -123,7 +123,7 @@ To start using pythonwhois, use `import pythonwhois`. These are the fields that any contact dict may contain. If certain information for a contact was not found, the corresponding key will be absent. - ! Note that any of these fields may consist of multiple lines, although the `address` field is the only one that is **likely** to consist of multiple lines. + ! Note that any of these fields **may** consist of multiple lines, although the `address` field is the only one that is **likely** to consist of multiple lines. handle:: The NIC handle for the contact. @@ -160,7 +160,7 @@ To start using pythonwhois, use `import pythonwhois`. ## When you need more control... -^ pythonwhois.net.get_whois_raw(**domain**[, **server=""**, **rfc3490=True**]) +^ pythonwhois.net.get_whois_raw(**domain**[, **server=""**, **rfc3490=True**, **never_cut=False**]) Retrieves the raw WHOIS data for the specified domain, and returns it as a list of responses (one element for each WHOIS server queried). This method will keep following redirects, until it ends up at the right server (and all responses it picks up in the meantime, will be included). Raises `pythonwhois.shared.WhoisException` if no root server for the TLD could be found. @@ -171,7 +171,10 @@ To start using pythonwhois, use `import pythonwhois`. **Optional.** The WHOIS server to query. When not specified, it will default to the appropriate WHOIS server for the TLD. rfc3490:: - **Optional.** If set to `True` a given domain will be encoded through the **toASCII** method as documented in RFC3490 before its submission to the whois service. If the domain isn't given in unicode, the method will handle the decoding by itself. + **Optional.** If set to `True`, a given domain will be encoded through the `toASCII` method as documented in {http://www.ietf.org/rfc/rfc3490.txt}(RFC3490) before its submission to the WHOIS service. If the domain isn't supplied in unicode, the method will handle the decoding by itself. + + never_cut:: + **Optional.** If set to `True`, pythonwhois will never strip out data from the raw WHOIS responses, **even** if that data relates to a partial match, rather than the requested domain. ^ pythonwhois.net.get_root_server(**domain**) diff --git a/pythonwhois/net.py b/pythonwhois/net.py index b0dd315..982162e 100644 --- a/pythonwhois/net.py +++ b/pythonwhois/net.py @@ -2,11 +2,13 @@ import socket, re from codecs import encode, decode from . import shared -def get_whois_raw(domain, server="", previous=[], rfc3490=True): +def get_whois_raw(domain, server="", previous=[], rfc3490=True, never_cut=False): # Sometimes IANA simply won't give us the right root WHOIS server exceptions = { ".ac.uk": "whois.ja.net", - ".ps": "whois.pnina.ps" + ".ps": "whois.pnina.ps", + # The following is a bit hacky, but IANA won't return the right answer for example.com because it's a direct registration. + "example.com": "whois.verisign-grs.com" } if rfc3490: @@ -33,7 +35,14 @@ def get_whois_raw(domain, server="", previous=[], rfc3490=True): else: request_domain = domain response = whois_request(request_domain, target_server) - new_list = [response] + previous + if never_cut: + # If the caller has requested to 'never cut' responses, he will get the original response from the server (this is + # useful for callers that are only interested in the raw data). Otherwise, if the target is verisign-grs, we will + # select the data relevant to the requested domain, and discard the rest, so that in a multiple-option response the + # parsing code will only touch the information relevant to the requested domain. The side-effect of this is that + # when `never_cut` is set to False, any verisign-grs responses in the raw data will be missing header, footer, and + # alternative domain options (this is handled a few lines below, after the verisign-grs processing). + new_list = [response] + previous if target_server == "whois.verisign-grs.com": # VeriSign is a little... special. As it may return multiple full records and there's no way to do an exact query, # we need to actually find the correct record in the list. @@ -41,6 +50,8 @@ def get_whois_raw(domain, server="", previous=[], rfc3490=True): if re.search("Domain Name: %s\n" % domain.upper(), record): response = record break + if never_cut == False: + new_list = [response] + previous for line in [x.strip() for x in response.splitlines()]: match = re.match("(refer|whois server|referral url|whois server|registrar whois):\s*([^\s]+\.[^\s]+)", line, re.IGNORECASE) if match is not None: @@ -57,7 +68,7 @@ def get_root_server(domain): if match is None: continue return match.group(1) - raise shared.WhoisException("No root WHOIS server found for TLD.") + raise shared.WhoisException("No root WHOIS server found for domain.") def whois_request(domain, server, port=43): sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) diff --git a/setup.py b/setup.py index 3d0ba01..df4e8c1 100644 --- a/setup.py +++ b/setup.py @@ -1,7 +1,7 @@ from setuptools import setup setup(name='pythonwhois', - version='2.1.4', + version='2.2.0', description='Module for retrieving and parsing the WHOIS data for a domain. Supports most domains. No dependencies.', author='Sven Slootweg', author_email='pythonwhois@cryto.net', diff --git a/test/data/example.com b/test/data/example.com new file mode 100644 index 0000000..c002029 --- /dev/null +++ b/test/data/example.com @@ -0,0 +1,25 @@ +% IANA WHOIS server +% for more information on IANA, visit http://www.iana.org +% This query returned 1 object + +domain: EXAMPLE.COM + +organisation: Internet Assigned Numbers Authority + +created: 1992-01-01 +source: IANA + + +-- + Domain Name: EXAMPLE.COM + Registrar: RESERVED-INTERNET ASSIGNED NUMBERS AUTHORITY + Whois Server: whois.iana.org + Referral URL: http://res-dom.iana.org + Name Server: A.IANA-SERVERS.NET + Name Server: B.IANA-SERVERS.NET + Status: clientDeleteProhibited + Status: clientTransferProhibited + Status: clientUpdateProhibited + Updated Date: 14-aug-2013 + Creation Date: 14-aug-1995 + Expiration Date: 13-aug-2014 diff --git a/test/target_default/example.com b/test/target_default/example.com new file mode 100644 index 0000000..9e70f3d --- /dev/null +++ b/test/target_default/example.com @@ -0,0 +1 @@ +{"status": ["clientDeleteProhibited", "clientTransferProhibited", "clientUpdateProhibited"], "updated_date": ["2013-08-14T00:00:00"], "contacts": {"admin": null, "tech": null, "registrant": null, "billing": null}, "nameservers": ["A.IANA-SERVERS.NET", "B.IANA-SERVERS.NET"], "expiration_date": ["2014-08-13T00:00:00"], "creation_date": ["1992-01-01T00:00:00"], "raw": ["% IANA WHOIS server\n% for more information on IANA, visit http://www.iana.org\n% This query returned 1 object\n\ndomain: EXAMPLE.COM\n\norganisation: Internet Assigned Numbers Authority\n\ncreated: 1992-01-01\nsource: IANA\n\n", " Domain Name: EXAMPLE.COM\n Registrar: RESERVED-INTERNET ASSIGNED NUMBERS AUTHORITY\n Whois Server: whois.iana.org\n Referral URL: http://res-dom.iana.org\n Name Server: A.IANA-SERVERS.NET\n Name Server: B.IANA-SERVERS.NET\n Status: clientDeleteProhibited\n Status: clientTransferProhibited\n Status: clientUpdateProhibited\n Updated Date: 14-aug-2013\n Creation Date: 14-aug-1995\n Expiration Date: 13-aug-2014\n"], "whois_server": ["whois.iana.org"], "registrar": ["RESERVED-INTERNET ASSIGNED NUMBERS AUTHORITY"]} \ No newline at end of file diff --git a/test/target_normalized/example.com b/test/target_normalized/example.com new file mode 100644 index 0000000..a28ac98 --- /dev/null +++ b/test/target_normalized/example.com @@ -0,0 +1 @@ +{"status": ["clientDeleteProhibited", "clientTransferProhibited", "clientUpdateProhibited"], "updated_date": ["2013-08-14T00:00:00"], "contacts": {"admin": null, "tech": null, "registrant": null, "billing": null}, "nameservers": ["a.iana-servers.net", "b.iana-servers.net"], "expiration_date": ["2014-08-13T00:00:00"], "creation_date": ["1992-01-01T00:00:00"], "raw": ["% IANA WHOIS server\n% for more information on IANA, visit http://www.iana.org\n% This query returned 1 object\n\ndomain: EXAMPLE.COM\n\norganisation: Internet Assigned Numbers Authority\n\ncreated: 1992-01-01\nsource: IANA\n\n", " Domain Name: EXAMPLE.COM\n Registrar: RESERVED-INTERNET ASSIGNED NUMBERS AUTHORITY\n Whois Server: whois.iana.org\n Referral URL: http://res-dom.iana.org\n Name Server: A.IANA-SERVERS.NET\n Name Server: B.IANA-SERVERS.NET\n Status: clientDeleteProhibited\n Status: clientTransferProhibited\n Status: clientUpdateProhibited\n Updated Date: 14-aug-2013\n Creation Date: 14-aug-1995\n Expiration Date: 13-aug-2014\n"], "whois_server": ["whois.iana.org"], "registrar": ["Reserved-internet Assigned Numbers Authority"]} \ No newline at end of file