"""Helper classes for tests.""" import copy import functools import unittest from unittest import TestCase from bs4 import BeautifulSoup from bs4.element import ( CharsetMetaAttributeValue, Comment, ContentMetaAttributeValue, Doctype, SoupStrainer, ) from bs4.builder import HTMLParserTreeBuilder default_builder = HTMLParserTreeBuilder class SoupTest(unittest.TestCase): @property def default_builder(self): return default_builder() def soup(self, markup, **kwargs): """Build a Beautiful Soup object from markup.""" builder = kwargs.pop('builder', self.default_builder) return BeautifulSoup(markup, builder=builder, **kwargs) def document_for(self, markup): """Turn an HTML fragment into a document. The details depend on the builder. """ return self.default_builder.test_fragment_to_document(markup) def assertSoupEquals(self, to_parse, compare_parsed_to=None): builder = self.default_builder obj = BeautifulSoup(to_parse, builder=builder) if compare_parsed_to is None: compare_parsed_to = to_parse self.assertEqual(obj.decode(), self.document_for(compare_parsed_to)) class HTMLTreeBuilderSmokeTest(object): """A basic test of a treebuilder's competence. Any HTML treebuilder, present or future, should be able to pass these tests. With invalid markup, there's room for interpretation, and different parsers can handle it differently. But with the markup in these tests, there's not much room for interpretation. """ def assertDoctypeHandled(self, doctype_fragment): """Assert that a given doctype string is handled correctly.""" doctype_str, soup = self._document_with_doctype(doctype_fragment) # Make sure a Doctype object was created. doctype = soup.contents[0] self.assertEqual(doctype.__class__, Doctype) self.assertEqual(doctype, doctype_fragment) self.assertEqual(str(soup)[:len(doctype_str)], doctype_str) # Make sure that the doctype was correctly associated with the # parse tree and that the rest of the document parsed. self.assertEqual(soup.p.contents[0], 'foo') def _document_with_doctype(self, doctype_fragment): """Generate and parse a document with the given doctype.""" doctype = '' % doctype_fragment markup = doctype + '\n
foo
' soup = self.soup(markup) return doctype, soup def test_normal_doctypes(self): """Make sure normal, everyday HTML doctypes are handled correctly.""" self.assertDoctypeHandled("html") self.assertDoctypeHandled( 'html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"') def test_public_doctype_with_url(self): doctype = 'html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"' self.assertDoctypeHandled(doctype) def test_system_doctype(self): self.assertDoctypeHandled('foo SYSTEM "http://www.example.com/"') def test_namespaced_system_doctype(self): # We can handle a namespaced doctype with a system ID. self.assertDoctypeHandled('xsl:stylesheet SYSTEM "htmlent.dtd"') def test_namespaced_public_doctype(self): # Test a namespaced doctype with a public id. self.assertDoctypeHandled('xsl:stylesheet PUBLIC "htmlent.dtd"') def test_real_xhtml_document(self): """A real XHTML document should come out more or less the same as it went in.""" markup = b"""tag is never designated as an empty-element tag. Even if the markup shows it as an empty-element tag, it shouldn't be presented that way. """ soup = self.soup("
") self.assertFalse(soup.p.is_empty_element) self.assertEqual(str(soup.p), "") def test_unclosed_tags_get_closed(self): """A tag that's not closed by the end of the document should be closed. This applies to all tags except empty-element tags. """ self.assertSoupEquals("", "
") self.assertSoupEquals("", "") self.assertSoupEquals("foobaz
" self.assertSoupEquals(markup) soup = self.soup(markup) comment = soup.find(text="foobar") self.assertEqual(comment.__class__, Comment) # The comment is properly integrated into the tree. foo = soup.find(text="foo") self.assertEqual(comment, foo.next_element) baz = soup.find(text="baz") self.assertEquals(comment, baz.previous_element) def test_preserved_whitespace_in_pre_and_textarea(self): """Whitespace must be preserved inand