This commit is contained in:
steven 2025-08-11 22:23:30 +02:00
commit 72a26edcff
22092 changed files with 2101903 additions and 0 deletions

72
lib/sd/manual/README.md Normal file
View file

@ -0,0 +1,72 @@
This folder contains the source files for http://simplehtmldom.sourceforge.net/,
the project page for PHP Simple HTML DOM Parser.
Source files are written in Markdown: https://en.wikipedia.org/wiki/Markdown
Site data is generated by MkDocs, a lightweight static site generator for project
documentation: https://www.mkdocs.org/
# Folder structure
`custom_theme` : Contains customizations to the theme provided by MkDocs.
`docs` : Contains the source files for the project page (the actual pages).
`site` : Contains the output files for the project page when build with MkDocs.
`extra.css` : Customizations to the styles provided by MkDocs.
`mkdocs.yml` : The configuration file that is used by MkDocs to generate pages.
# Adding new pages
Place new files in `source`. Use subfolders (as few levels as possible) to
separate categories.
Files added to the manual will **not** appear on the project page automatically.
All pages need to be specified in the _mkdocs.yml_ file under "nav:". Simply add
the relative path to the new file where appropriate.
Note: Files are not added automatically because they are sorted by name if not
specified manually. Since readability is key factor for manuals, the files must
be sorted in a way that makes it clear to users.
# Setting up MkDocs
The installation instructions for MkDocs are provided on their homepage:
https://www.mkdocs.org/#installation
MkDocs automatically builds the project based on the _mkdocs.yml_ file. Find the
specification for this file at https://www.mkdocs.org/user-guide/configuration/.
# Building project pages
The build process depends on your installation of MkDocs. Typically MkDocs is
made available via the command line.
## Step 1 - Check your version of MkDocs
To check your version of MkDocs run this command:
`mkdocs --version` or
`python3 -m mkdocs --version`
Should return `version 1.0.4` or higher. If it doesn't make sure to install the
latest version using `pip install mkdocs` or `python3 -m pip install mkdocs`. If
you don't have pip installed, install it via package manager or follow the
instructions at https://pip.pypa.io/en/stable/installing/
## Step 2 - View the project locally
MkDocs allows you to view the project files in a browser on your local machine:
`mkdocs serve` or
`python3 -m mkdocs serve`
If the process is successful you can access the site at http://127.0.0.1:8000.
## Step 3 - Build the project
If you are satisfied with the results of the project, build the final project
with this command:
`mkdocs build` or
`python3 -m mkdocs build`
Find the output files in the `site` folder.

View file

@ -0,0 +1,7 @@
{% extends "base.html" %}
{% block footer %}
{% include "footer.html" %}
<hr>
<a class="logo" href="https://sourceforge.net/p/simplehtmldom/"><img alt="Download PHP Simple HTML DOM Parser" src="https://sourceforge.net/sflogo.php?type=16&group_id=218559" ></a>
{% endblock %}

View file

@ -0,0 +1,68 @@
---
title: API Reference
---
# Parsing documents
The parser accepts documents in the form of URLs, files and strings. The document
must be accessible for reading and cannot exceed [`MAX_FILE_SIZE`](constants.md#max_file_size).
Name | Description
---- | -----------
`str_get_html( string $content ) : object` | Creates a DOM object from string.
`file_get_html( string $filename ) : object` | Creates a DOM object from file or URL.
# DOM methods & properties
Name | Description
---- | -----------
`__construct( [string $filename] ) : void` | Constructor, set the filename parameter will automatically load the contents, either text or file/url.
`plaintext : string` | Returns the contents extracted from HTML.
`clear() : void` | Clean up memory.
`load( string $content ) : void` | Load contents from string.
`save( [string $filename] ) : string` | Dumps the internal DOM tree back into a string. If the $filename is set, result string will save to file.
`load_file( string $filename ) : void` | Load contents from a file or a URL.
`set_callback( string $function_name ) : void` | Set a callback function.
`find( string $selector [, int $index] ) : mixed` | Find elements by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.
# Element methods & properties
Name | Description
---- | -----------
`[attribute] : string` | Read or write element's attribute value.
`tag : string` | Read or write the tag name of element.
`outertext : string` | Read or write the outer HTML text of element.
`innertext : string` | Read or write the inner HTML text of element.
`plaintext : string` | Read or write the plain text of element.
`find( string $selector [, int $index] ) : mixed` | Find children by the CSS selector. Returns the Nth element object if index is set, otherwise return an array of object.
# DOM traversing
Name | Description
---- | -----------
`$e->children( [int $index] ) : mixed` | Returns the Nth child object if index is set, otherwise return an array of children.
`$e->parent() : element` | Returns the parent of element.
`$e->first_child() : element` | Returns the first child of element, or null if not found.
`$e->last_child() : element` | Returns the last child of element, or null if not found.
`$e->next_sibling() : element` | Returns the next sibling of element, or null if not found.
`$e->prev_sibling() : element` | Returns the previous sibling of element, or null if not found.
# Camel naming conventions
Method | Mapping
------ | -------
`$e->getAllAttributes()` | `$e->attr`
`$e->getAttribute( $name )` | `$e->attribute`
`$e->setAttribute( $name, $value)` | `$value = $e->attribute`
`$e->hasAttribute( $name )` | `isset($e->attribute)`
`$e->removeAttribute ( $name )` | `$e->attribute = null`
`$e->getElementById ( $id )` | `$e->find ( "#$id", 0 )`
`$e->getElementsById ( $id [,$index] )` | `$e->find ( "#$id" [, int $index] )`
`$e->getElementByTagName ($name )` | `$e->find ( $name, 0 )`
`$e->getElementsByTagName ( $name [, $index] )` | `$e->find ( $name [, int $index] )`
`$e->parentNode ()` | `$e->parent ()`
`$e->childNodes ( [$index] )` | `$e->children ( [int $index] )`
`$e->firstChild ()` | `$e->first_child ()`
`$e->lastChild ()` | `$e->last_child ()`
`$e->nextSibling ()` | `$e->next_sibling ()`
`$e->previousSibling ()` | `$e->prev_sibling ()`

View file

@ -0,0 +1,33 @@
---
title: Constants
---
# Constants
Constants define how the parser treats documents. They can be defined before
loading the parser to globally replace the default values.
## DEFAULT_TARGET_CHARSET
Defines the default target charset for text returned by the parser.
Default: `'UTF-8'`
## DEFAULT_BR_TEXT
Defines the default text to return for `<br>` elements.
Default: `"\r\n"`
## DEFAULT_SPAN_TEXT
Defines the default text to return for `<span>` elements.
Default: `' '`
## MAX_FILE_SIZE
Defines the maximum number of bytes the parser can load into memory. This limit
only applies to the source file or string.
Default: `600000`

View file

@ -0,0 +1,100 @@
---
title: Definitions
---
# Definitions
The definitions below are an essential part of the parser.
## Node Types
The type of a node is determined during parsing and represented by one of the elements in the list below.
| Type | Description
| ---- | -----------
| `HDOM_TYPE_ELEMENT` | Start tag (i.e. `<html>`)
| `HDOM_TYPE_COMMENT` | HTML comment (i.e. `<!-- Hello, World! -->`)
| `HDOM_TYPE_TEXT` | Plain text (i.e. `Hello, World!`)
| `HDOM_TYPE_ENDTAG` | End tag (i.e. `</html>`)
| `HDOM_TYPE_ROOT` | Root element. There can always only be one root element in the DOM.
| `HDOM_TYPE_UNKNOWN` | Unknown type (i.e. CDATA, DOCTYPE, etc...)
### Example
```html
<!DOCTYPE html><html><!-- Hello, World! --></html>Hello, World!
```
_Note_: `HDOM_TYPE_ROOT` always exists regardless of the actual document structure.
| HTML | Node Type
| ---- | ---------
| | `HDOM_TYPE_ROOT`
| `<!DOCTYPE html>` | `HDOM_TYPE_UNKNOWN`
| `<html>` | `HDOM_TYPE_ELEMENT`
| `<!-- Hello, World! -->` | `HDOM_TYPE_COMMENT`
| `</html>` | `HDOM_TYPE_ENDTAG`
| `Hello, World!` | `HDOM_TYPE_TEXT`
## Quote Types
Identifies the quoting type on attribute values.
| Type | Description
| ---- | -----------
| `HDOM_QUOTE_DOUBLE` | Double quotes (`""`)
| `HDOM_QUOTE_SINGLE` | Single quotes (`''`)
| `HDOM_QUOTE_NO` | Not quoted (flag)
_Note_: Attributes with no values (flags) are stored as `HDOM_QUOTE_NO`.
### Example
```html
<p class="paragraph" id='info1' hidden>Hello, World!</p>
```
| Attribute | Description
| --------- | -----------
| `class="paragraph"` | `HDOM_QUOTE_DOUBLE`
| `id='info1'` | `HDOM_QUOTE_SINGLE`
| `hidden` | `HDOM_QUOTE_NO`
## Node Info Types
Each node stores additional information (metadata) that is identified by the elements below.
| Type | Description
| ---- | -----------
| `HDOM_INFO_BEGIN` | Cursor position for the start tag of a node.
| `HDOM_INFO_END` | Cursor position for the end tag of a node. A value of zero indicates a node with no end tag (missing closing tag).
| `HDOM_INFO_QUOTE` | Quote type for attribute values. The value must be an element of [Quote Type](#quote-types).
| `HDOM_INFO_SPACE` | Array of whitespace around attributes (see [Attribute Whitespace](#attribute-whitespace)).
| `HDOM_INFO_TEXT` | Non-HTML text in tags (i.e. comments, doctype, etc...).
| `HDOM_INFO_INNER` | Inner text of a node.
| `HDOM_INFO_OUTER` | Outer text of a node.
| `HDOM_INFO_ENDSPACE` | Whitespace at the end of a tag before the closing bracket.
## Attribute Whitespace
Whitespace around attributes is stored in the form of an array with three elements:
| Element | Description
| ------- | -----------
| `0` | Whitespace before the attribute name.
| `1` | Whitespace between attribute name and the equal sign.
| `2` | Whitespace between the equal sign and the attribute value
### Example
```html
<p class="paragraph" id = 'info1'hidden>Hello, World!</p>
```
_Note_: Whitespace before attribute names is not displayed in the browser. It is, however, part of the attributes.
| Attribute | Description
| --------- | -----------
| ` class="paragraph"` | `[0] => ' ', [1] => '', [2] => ''`
| ` id = 'info1'` | `[0] => ' ', [1] => ' ', [2] => ' '`
| `hidden` | `[0] => '', [1] => '', [2] => ''`

View file

@ -0,0 +1,25 @@
---
title: file_get_html
---
# file_get_html
```php
file_get_html ( string $url [, bool $use_include_path = false [, resouce $context = null [, int $offset = 0 [, int $maxLen = -1 [, bool $lowercase = true [, bool $forceTagsClosed = true [, string $target_charset = DEFAULT_TARGET_CHARSET [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT ]]]]]]]]]] )
```
Parses the provided file and returns the DOM object.
| Parameter | Description
| --------- | -----------
| `url` | Name or URL of the file to read.
| `use_include_path` | See [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php#refsect1-function.file-get-contents-parameters)
| `context` | See [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php#refsect1-function.file-get-contents-parameters)
| `offset` | See [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php#refsect1-function.file-get-contents-parameters)
| `maxLen` | See [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php#refsect1-function.file-get-contents-parameters)
| `lowercase` | Forces lowercase matching of tags if enabled. This is very useful when loading documents with mixed naming conventions.
| `forceTagsClosed` | Obsolete. This parameter is no longer used by the parser.
| `target_charset` | Defines the target charset when returning text from the document.
| `stripRN` | If enabled, removes newlines before parsing the document.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.

View file

@ -0,0 +1,20 @@
# __construct
```php
__construct ( [ string $str = null [, bool $lowercase = true [, bool $forceTagsClosed = true [, string $target_charset = DEFAULT_TARGET_CHARSET [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT [, int $options = 0 ]]]]]]]]) : object
```
Creates a new `simple_html_dom` object.
| Parameter | Description
| --------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `forceTagsClosed` | Tags inside block tags are forcefully closed if the closing tag was omitted.
| `target_charset` | Defines the target charset for text returned by the parser.
| `stripRN` | Newline characters are replaced by whitespace if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.
| `options` | Additional options for the parser. Currently supports `'HDOM_SMARTY_AS_TEXT'` to remove [Smarty](https://www.smarty.net/) scripts.
Returns the object.

View file

@ -0,0 +1,7 @@
# __destruct
```php
__destruct ()
```
Destroys the current object and clears memory.

View file

@ -0,0 +1,17 @@
# __get
```php
__get ( string $name ) : mixed
```
See [magic methods](http://php.net/manual/en/language.oop5.overloading.php#object.get)
Supports following names:
| Name | Description
| ---- | -----------
| `outertext` | Returns the outer text of the root element.
| `innertext` | Returns the inner text of the root element.
| `plaintext` | Returns the plain text of the root element.
| `charset` | Returns the charset for the document.
| `target_charset` | Returns the target charset for the document.

View file

@ -0,0 +1,7 @@
# __toString
```php
__toString () : string
```
Returns the inner text of the root element of the DOM.

View file

@ -0,0 +1,13 @@
# as_text_node (protected)
```php
as_text_node ( string $tag ) : bool
```
Adds a tag as text node.
| Parameter | Description
| --------- | -----------
| `tag` | The element's tag name.
Returns true on success.

View file

@ -0,0 +1,11 @@
# childNodes
```php
childNodes ( [ int $idx = -1 ] ) : mixed
```
Returns children of the root element.
| Parameter | Description
| --------- | -----------
| `idx` | Index of the child element to return.

View file

@ -0,0 +1,7 @@
# clear
```php
clear ()
```
Cleans up memory to prevent [PHP 5 circular references memory leak](https://bugs.php.net/bug.php?id=33595).

View file

@ -0,0 +1,13 @@
# copy_skip (protected)
```php
copy_skip ( string $chars ) : string
```
Skips characters starting at the current parsing position in the document. Sets the parsing position to the first character not in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to skip.
Returns the skipped characters.

View file

@ -0,0 +1,13 @@
# copy_until (protected)
```php
copy_until ( string $chars ) : string
```
Copies all characters starting at the current parsing position in the document. Sets the parsing position to the first character that matches any of the characters in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to stop copying at.
Returns the copied characters.

View file

@ -0,0 +1,13 @@
# copy_until_char (protected)
```php
copy_until ( string $char ) : string
```
Copies all characters starting at the current parsing position in the document. Sets the parsing position to the first character that matches the provided character.
| Parameter | Description
| --------- | -----------
| `char` | A character to stop copying at.
Returns the copied characters.

View file

@ -0,0 +1,14 @@
# createElement
```php
createElement ( string $name [, string $value = null ] ) : object
```
Creates a new element.
| Parameter | Description
| --------- | -----------
| `name` | Name of the element
| `value` | Value of the element
Returns the element.

View file

@ -0,0 +1,9 @@
# createTextNode
```php
createTextNode ( string $value ) : object
```
Creates a new text element.
Returns the element.

View file

@ -0,0 +1,13 @@
# dump
```php
dump ( [ bool show_attr = true ] ) : string
```
Dumps the entire DOM into a string. Useful for debugging purposes.
| Parameter | Description
| --------- | -----------
| `show_attr` | Attributes are included in the dump when enabled.
Returns the DOM tree as string.

View file

@ -0,0 +1,15 @@
# find
```php
find ( string $selector [, int $idx = null [, bool $lowercase = false ]] ) : mixed
```
Finds elements in the DOM.
| Parameter | Description
| --------- | -----------
| `selector` | A [CSS style selector](/manual/selectors).
| `idx` | Index of the element to return.
| `lowercase` | Matches tag names case insensitive when enabled.
Returns an array of matches or a single element if `idx` is defined.

View file

@ -0,0 +1,7 @@
# firstChild
```php
firstChild () : object
```
Returns the first child of the root element.

View file

@ -0,0 +1,13 @@
# getElementById
```php
getElementById ( string $id ) : object
```
Searches an element by id.
| Parameter | Description
| --------- | -----------
| `id` | ID of the element to find.
Returns the element or null if no match was found.

View file

@ -0,0 +1,13 @@
# getElementByTagName
```php
getElementByTagName ( string $name ) : object
```
Searches an element by tag name.
| Parameter | Description
| --------- | -----------
| `name` | Tag name of the element to find.
Returns the element or null if no match was found.

View file

@ -0,0 +1,14 @@
# getElementsById
```php
getElementsById ( string $id [, int $idx = null ] ) : object
```
Searches elements by id.
| Parameter | Description
| --------- | -----------
| `id` | ID of the element to find.
| `idx` | Returns the element at the specified index if defined.
Returns the element(s) or null if no match was found.

View file

@ -0,0 +1,14 @@
# getElementsByTagName
```php
getElementsByTagName ( string $name [, int $idx = -1 ] ) : object
```
Searches elements by tag name.
| Parameter | Description
| --------- | -----------
| `name` | Tag name of the element to find.
| `idx` | Returns the element at the specified index.
Returns the element(s) or null if no match was found.

View file

@ -0,0 +1,7 @@
# lastChild
```php
lastChild () : object
```
Returns the last child of the root element.

View file

@ -0,0 +1,12 @@
# link_nodes (protected)
```php
link_nodes ( object &$node, bool $is_child )
```
Links the provided node to the DOM tree.
| Parameter | Description
| --------- | -----------
| `node` | The node to link to the DOM tree.
| `is_child` | If active, makes the node a sibling of the current node (child of parent).

View file

@ -0,0 +1,18 @@
# load
```php
load ( string $str [, bool $lowercase = true [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT [, int $options = 0 ]]]]]) : object
```
Loads the provided HTML document string.
| Parameter | Description
| --------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `stripRN` | Newline characters are replaced by whitespace if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.
| `options` | Additional options for the parser. Currently supports `'HDOM_SMARTY_AS_TEXT'` to remove [Smarty](https://www.smarty.net/) scripts.
Returns the object.

View file

@ -0,0 +1,7 @@
# loadFile
```php
loadFile (...)
```
This function is a wrapper for [`load_file`](#load_file)

View file

@ -0,0 +1,9 @@
# load_file
```php
load_file (...) : object
```
Loads a HTML document from file. Supports arguments of [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php).
Returns the object.

View file

@ -0,0 +1,7 @@
# parse (protected)
```php
parse ()
```
Parses the document. This function is called after the document was loaded into `$this->doc`.

View file

@ -0,0 +1,13 @@
# parse_attr (protected)
```php
parse_attr ( object $node, string $name, array &$space )
```
Parses a single attribute starting at the current parsing position in the document.
| Parameter | Description
| --------- | -----------
| `node` | The current element (node).
| `name` | The attribute name.
| `space` | An array of whitespace sorounding the current attribute (see [Attribute Whitespace](../definitions/#attribute-whitespace)).

View file

@ -0,0 +1,15 @@
# parse_charset (protected)
```php
parse_charset ()
```
Parses the charset.
If the callback function `get_last_retrieve_url_contents_content_type` exists, it is assumed to return the content type header for the current document as string.
Uses the charset from the metadata of the page if defined.
If none of the previous conditions are met, the charset is determined by `mb_detect_encoding` if multi-byte support is active.
If multi-byte support is not active the charset is assumed to be `'UTF-8'`.

View file

@ -0,0 +1,14 @@
# prepare (protected)
```php
prepare ( string $str [, bool $lowercase = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT ]]] )
```
Initializes the DOM object.
| Parameters | Description
| ---------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.

View file

@ -0,0 +1,9 @@
# read_tag (protected)
```php
read_tag () : bool
```
Reads a single tag starting at the current parsing position in the document. The tag is automatically added to the DOM.
Returns true if a tag was found.

View file

@ -0,0 +1,7 @@
# remove_callback
```php
remove_callback ()
```
Removes the callback set by [`set_callback`](#set_callback).

View file

@ -0,0 +1,14 @@
# remove_noise (protected)
```php
remove_noise ( string $pattern [, bool $remove_tag = false] )
```
Replaces noise in the document (i.e. scripts) by placeholders and adds the removed contents to `$this->noise`.
_Note_: Noise is replaced by placeholders in order to allow restoring the original contents. Placeholders take the form of `'___noise___1000'` where the number is increased by one for each removed noise.
| Parameter | Description
| --------- | -----------
| `pattern` | A regular expression that matches the noise to remove.
| `remove_tag` | Removes the entire match when enabled or submatches when disabled.

View file

@ -0,0 +1,13 @@
# restore_noise (protected)
```php
restore_noise ( string $text ) : string
```
Restores noise in the provided string by replacing noise placeholders by their original contents.
| Parameter | Description
| --------- | -----------
| `text` | A string (potentially) containing noise placeholders.
Returns the string with original contents restored or the original string if it doesn't contain noise placeholders.

View file

@ -0,0 +1,13 @@
# save
```php
save ( [ string $filepath = '' ] ) : string
```
Writes the current DOM to file.
| Parameter | Description
| --------- | -----------
| `filepath` | Writes to file if the provided file path is not empty.
Returns the document string.

View file

@ -0,0 +1,13 @@
# search_noise (protected)
```php
search_noise ( string $text ) : string
```
Find a single noise element by providing the noise placeholder text.
| Parameter | Description
| --------- | -----------
| `text` | The noise placeholder to find.
Returns the original contents for the placeholder.

View file

@ -0,0 +1,12 @@
# set_callback
```php
set_callback ( string $function_name )
```
Sets the callback function which is called on each element of the DOM when building outertext.
The function must accept a single parameter of type `simple_html_dom_node`.
| Parameter | Description
| --------- | -----------
| `function_name` | Name of the function.

View file

@ -0,0 +1,40 @@
---
title: simple_html_dom
---
# simple_html_dom
Represents the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) in memory. Provides functions to parse documents and access individual elements (see [`simple_html_dom_node`](../simple_html_dom_node/simple_html_dom_node.md)).
# Public Properties
| Property | Description
| -------- | -----------
| `root` | Root node of the document.
| `nodes` | List of top-level nodes in the document.
| `callback` | Callback function that is called for each element in the DOM when generating outertext.
| `lowercase` | If enabled, all tag names are converted to lowercase when parsing documents.
| `original_size` | Original document size in bytes.
| `size` | Current document size in bytes.
| `_charset` | Charset of the original document.
| `_target_charset` | Target charset for the current document.
| `default_span_text` | Text to return for `<span>` elements.
# Protected Properties
| Property | Description
| -------- | -----------
| `pos` | Current parsing position within `doc`.
| `doc` | The original document.
| `char` | Character at position `pos` in `doc`.
| `cursor` | Current element cursor in the document.
| `parent` | Parent element node.
| `noise` | Noise from the original document (i.e. scripts, comments, etc...).
| `token_blank` | Tokens that are considered whitespace in HTML.
| `token_equal` | Tokens to identify the equal sign for attributes, stopping either at the closing tag ("/" i.e. `<html />`) or the end of an opening tag (">" i.e. `<html>`).
| `token_slash` | Tokens to identify the end of a tag name. A tag name either ends on the ending slash ("/" i.e. `<html/>`) or whitespace (`"\s\r\n\t"`).
| `token_attr` | Tokens to identify the end of an attribute.
| `default_br_text` | Text to return for `<br>` elements.
| `self_closing_tags` | A list of tag names where the closing tag is omitted.
| `block_tags` | A list of tag names where remaining unclosed tags are forcibly closed.
| `optional_closing_tags` | A list of tag names where the closing tag can be omitted.

View file

@ -0,0 +1,12 @@
# skip (protected)
```php
skip ( string $chars )
```
Skips characters starting at the current parsing position in the document. Sets the parsing position to the first character not in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to skip.

View file

@ -0,0 +1,11 @@
# __construct
```php
__construct ( [ object $dom ] ) : object
```
| Parameter | Description
| --------- | -----------
| `dom` | An object of type [`simple_html_dom`](api/simple_html_dom/).
Constructs a new object of type `simple_html_dom_node`, assignes `$dom` as DOM object and adds itself to the list of nodes in `$dom`.

View file

@ -0,0 +1,7 @@
# __destruct
```php
__destruct ( )
```
Destructs the current object and frees memory.

View file

@ -0,0 +1,22 @@
# __get
```php
__get ( string $name ) : mixed
```
| Parameter | Description
| --------- | -----------
| `name` | `outertext`, `innertext`, `plaintext`, `xmltext` or attribute name.
See [magic methods](http://php.net/manual/en/language.oop5.overloading.php#object.get)
If the provided name is a valid attribute name, returns the attribute value. Otherwise a value according to the table below.
| Name | Description
| ---- | -----------
| `outertext` | Returns the outer text of the current node.
| `innertext` | Returns the inner text of the current node.
| `plaintext` | Returns the plain text of the current node.
| `xmltext` | Returns the xml representation for the inner text of the current node as a CDATA section.
Returns nothing if the provided name is neither a valid attribute name, nor a valid parameter name.

View file

@ -0,0 +1,19 @@
# __isset
```php
__isset ( string $name ) : bool
```
| Parameter | Description
| --------- | -----------
| `name` | `outertext`, `innertext`, `plaintext` or attribute name.
See [magic methods](http://php.net/manual/en/language.oop5.overloading.php#object.get)
Returns true if the provided name is a valid attribute name or any of the values in the table below. False otherwise.
| Name | Description
| ---- | -----------
| `outertext` | Returns the outer text of the current node.
| `innertext` | Returns the inner text of the current node.
| `plaintext` | Returns the plain text of the current node.

View file

@ -0,0 +1,18 @@
# __set
```php
__set ( string $name, mixed $value )
```
| Parameter | Description
| --------- | -----------
| `name` | `outertext`, `innertext` or attribute name.
| `value` | Value to set.
See [magic methods](http://php.net/manual/en/language.oop5.overloading.php#object.get)
Sets the outer text of the current node to `$value` if `$name` is `outertext`.
Sets the inner text of the current node to `$value` if `$name` is `innertext`.
Otherwise, adds or updates an attribute with name `$name` and value `$value` to the current node.

View file

@ -0,0 +1,7 @@
# __toString
```php
__toString ( ) : string
```
Returns the outer text of the current node.

View file

@ -0,0 +1,7 @@
# __unset
```php
__unset ( string $name )
```
Removes the attribute with name `$name` from the current node if it exists.

View file

@ -0,0 +1,23 @@
# addClass
```php
addClass ( mixed $class )
```
| Parameter | Description
| --------- | -----------
| `class` | Specifies one or more class names to be added.
Adds one or more class names to the current node.
**Remarks**
* To add more than one class, separate the class names with space or provide them as an array.
**Examples**
```php
$node->addClass('hidden');
$node->addClass('article important');
$node->addClass(array('article', 'new'));
```

View file

@ -0,0 +1,13 @@
# appendChild
```php
appendChild ( object $node ) : object
```
| Parameter | Description
| --------- | -----------
| `node` | An object of type [`simple_html_dom_node`](../simple_html_dom_node/)
Makes the current node parent of the node provided to this function.
Returns the provided node.

View file

@ -0,0 +1,15 @@
# childNodes
```php
childNodes ( [ int $idx = -1 ] ) : mixed
```
| Parameter | Description
| --------- | -----------
| `idx` | Index of the node to return or `-1` to return all nodes.
Returns all or one specific child node from the current node.
## Remarks
This function is a wrapper for [`children`](../children/)

View file

@ -0,0 +1,11 @@
# children
```php
children ( [ int $idx = -1 ] ) : mixed
```
| Parameter | Description
| --------- | -----------
| `idx` | Index of the node to return or `-1` to return all nodes.
Returns all or one specific child node from the current node.

View file

@ -0,0 +1,7 @@
# clear
```php
clear ( )
```
Sets all properties in the current node, which contain objects, to null.

View file

@ -0,0 +1,13 @@
# convert_text
```php
convert_text ( string $text ) : string
```
| Parameter | Description
| --------- | -----------
| `text` | Text to convert.
Assumes that the provided text is in the form of the configured source character set (see [`sourceCharset`](../simple_html_dom_node/) and converts it to the specified target character set (see [`targetCharset`](../simple_html_dom_node/)).
Returns the converted text.

View file

@ -0,0 +1,12 @@
# dump
```php
dump ( [ bool $show_attr = false [, int $depth = 0 ]] )
```
| Parameter | Description
| --------- | -----------
| `show_attr` | Attribute names are included in the output if enabled.
| `depth` | Depth of the current element
Dumps information about the current node and all child nodes recursively.

View file

@ -0,0 +1,11 @@
# dump_node
```php
dump_node ( [ bool $echo = true ] ) : mixed
```
| Parameter | Description
| --------- | -----------
| `echo` | Echoes the dump details directly if enabled.
Dumps information about the current document node. Returns a string if `$echo` is set to false, null otherwise.

View file

@ -0,0 +1,44 @@
# find
```php
find (
string $selector
[, int $idx = null ]
[, bool $lowercase = false ]
) : mixed
```
| Parameter | Description
| --------- | -----------
| `selector` | [CSS](https://www.w3.org/TR/selectors/) selector.
| `idx` | Index of element to return.
| `lowercase` | Matches tag names case insensitive (lowercase) if enabled.
Finds one or more nodes in the current document, using CSS selectors.
* Returns null if no match was found.
* Returns an array of [`simple_html_dom_node`](../simple_html_dom_node/) if `$idx` is null.
* Returns an object of type [`simple_html_dom_node`](../simple_html_dom_node/) if `$idx` is anything __but__ null.
## Supported Selectors
| Selector | Description
| --------- | -----------
| `*` | [Universal selector](https://www.w3.org/TR/selectors/#the-universal-selector)
| `E` | [Type (tag name) selector](https://www.w3.org/TR/selectors/#type-selectors)
| `E#id` | [ID selector](https://www.w3.org/TR/selectors/#id-selectors)
| `E.class` | [Class selector](https://www.w3.org/TR/selectors/#class-html)
| `E[attr]` | [Attribute selector](https://www.w3.org/TR/selectors/#attribute-selectors)
| `E[attr="value"]` | [Attribute selector](https://www.w3.org/TR/selectors/#attribute-selectors)
| `E[attr="value"] i` | [Case-sensitivity](https://www.w3.org/TR/selectors/#attribute-case)
| `E[attr="value"] s` | [Case-sensitivity](https://www.w3.org/TR/selectors/#attribute-case)
| `E[attr~="value"]` | [Attribute selector](https://www.w3.org/TR/selectors/#attribute-selectors)
| `E[attr^="value"]` | [Substring matching attribute selector](https://www.w3.org/TR/selectors/#attribute-substrings)
| `E[attr$="value"]` | [Substring matching attribute selector](https://www.w3.org/TR/selectors/#attribute-substrings)
| `E[attr*="value"]` | [Substring matching attribute selector](https://www.w3.org/TR/selectors/#attribute-substrings)
| `E[attr|="value"]` | [Attribute selector](https://www.w3.org/TR/selectors/#attribute-selectors)
| `E F` | [Descendant combinator](https://www.w3.org/TR/selectors/#descendant-combinators)
| `E > F` | [Child combinator](https://www.w3.org/TR/selectors/#child-combinators)
| `E + F` | [Next-sibling combinator](https://www.w3.org/TR/selectors/#adjacent-sibling-combinators)
| `E ~ F` | [Subsequent-sibling combinator](https://www.w3.org/TR/selectors/#general-sibling-combinators)
| `E, F` | [Selector list](https://www.w3.org/TR/selectors/#selector-list)

View file

@ -0,0 +1,11 @@
# find_ancestor_tag
```php
find_ancestor_tag ( string $tag ) : object
```
| Parameter | Description
| --------- | -----------
| `tag` | Tag name of the element to find.
Returns the first matching node that matches the specified tag name or null if no match was found.

View file

@ -0,0 +1,7 @@
# firstChild
```php
firstChild ( ) : mixed
```
This function is a wrapper for [`first_child`](../first_child/)

View file

@ -0,0 +1,7 @@
# first_child
```php
first_child ( ) : mixed
```
Returns the first child node of the current node or null if the current nod has no child nodes.

View file

@ -0,0 +1,7 @@
# getAllAttributes
```php
getAllAttributes ( ) : array
```
Returns all attributes for the current node.

View file

@ -0,0 +1,11 @@
# getAttribute
```php
getAttribute ( string $name ) : mixed
```
| Parameter | Description
| --------- | -----------
| `name` | Attribute name.
Returns the value for the attribute `$name`.

View file

@ -0,0 +1,11 @@
# getElementById
```php
getElementById ( string $id ) : object
```
| Parameter | Description
| --------- | -----------
| `id` | Element id.
Returns the first element with the specified id.

View file

@ -0,0 +1,11 @@
# getElementByTagName
```php
getElementByTagName ( string $name ) : object
```
| Parameter | Description
| --------- | -----------
| `name` | Tag name.
Returns the first element with the specified tag name.

View file

@ -0,0 +1,12 @@
# getElementsById
```php
getElementsById ( string $id [, int $idx = null] ) : mixed
```
| Parameter | Description
| --------- | -----------
| `id` | Element id.
| `idx` | Index of element to return.
Returns all elements with the specified id if `$idx` is null, or a specific one if `$idx` is a valid index.

View file

@ -0,0 +1,12 @@
# getElementsByTagName
```php
getElementsByTagName ( string $name [, int $idx = null ] ) : mixed
```
| Parameter | Description
| --------- | -----------
| `name` | Tag name.
| `idx` | Index of the element to return.
Returns all elements with the specified tag name if `$idx` is null, or a specific one if `$idx` is a valid index.

View file

@ -0,0 +1,9 @@
# get_display_size
```php
get_display_size ( ) : mixed
```
Returns false if the current node is not an image.
Returns an associative array of two elements - `height` and `width` - that represent the display size of the image.

View file

@ -0,0 +1,11 @@
# hasAttribute
```php
hasAttribute ( string $name ) : bool
```
| Parameter | Description
| --------- | -----------
| `name` | Name of the attribute.
Returns true if the current node has an attribute with the specified name.

View file

@ -0,0 +1,7 @@
# hasChildNodes
```php
hasChildNodes ( ) : bool
```
This is a wrapper function for [`has_child`](../has_child/).

View file

@ -0,0 +1,17 @@
# hasClass
```php
hasClass ( string $class ) : bool
```
| Parameter | Description
| --------- | -----------
| `class` | Specifies the class name to search for.
Returns true if the current node has the specified class name.
**Examples**
```php
$node->hasClass('article');
```

View file

@ -0,0 +1,7 @@
# has_child
```php
has_child ( ) : bool
```
Returns true if the current node has one or more child nodes.

View file

@ -0,0 +1,7 @@
# innertext
```php
innertext ( ) : string
```
Returns the inner text (everything inside the opening and closing tags) of the current node.

View file

@ -0,0 +1,11 @@
# is_utf8 (static)
```php
is_utf8 ( string $str ) : bool
```
| Parameter | Description
| --------- | -----------
| `str` | String to test.
Returns true if the provided string is a valid UTF-8 string.

View file

@ -0,0 +1,7 @@
# lastChild
```php
lastChild ( ) : object
```
This is a wrapper for [`last_child`](../last_child/).

View file

@ -0,0 +1,7 @@
# last_child
```php
last_child ( ) : object
```
Returns the last child of the current node or null if the current node has no child elements.

View file

@ -0,0 +1,7 @@
# makeup
```php
makeup ( ) : string
```
Returns the HTML representation of the current node.

View file

@ -0,0 +1,19 @@
# match (protected)
```php
match (
string $exp
, string $pattern
, string $value
, string $case_sensitivity
) : bool
```
| Parameter | Description
| --------- | -----------
| `exp` | Expression
| `pattern` | Pattern
| `value` | Value
| `case_sensitivity` | Case sensitivity
Matches a single attribute value against the specified attribute selector. See also [`find`](../find/).

View file

@ -0,0 +1,7 @@
# nextSibling
```php
nextSibling ( ) : object
```
This is a wrapper for [`next_sibling`](../next_sibling/).

View file

@ -0,0 +1,7 @@
# next_sibling
```php
next_sibling ( ) : object
```
Returns the next sibling of the current node or null if the current node has no next sibling.

View file

@ -0,0 +1,7 @@
# nodeName
```php
nodeName ( ) : string
```
Returns the name of the current node (tag name).

View file

@ -0,0 +1,7 @@
# outertext
```php
outertext ( ) : string
```
Returns the outer text (everything including the opening and closing tags) of the current node.

View file

@ -0,0 +1,12 @@
# parent
```php
parent ( [ object $parent = null ] ) : object
```
| Parameter | Description
| --------- | -----------
| `parent` | The parent node
* Returns the parent node of the current node if `$parent` is null.
* Sets the parent node of the current node if `$parent` is not null. In this case the current node is automatically added to the list of nodes in the parent node.

View file

@ -0,0 +1,7 @@
# parentNode
```php
parentNode () : object
```
Returns the current's node parent.

View file

@ -0,0 +1,11 @@
# parse_selector (protected)
```php
parse_selector ( string $selector_string ) : array
```
| Parameter | Description
| --------- | -----------
| `selector_string` | The selector string
Parses a CSS selector into an internal format for further use. See also [`find`](../find/).

View file

@ -0,0 +1,7 @@
# prevSibling
```php
prevSibling ( ) : object
```
This is a wrapper for [`previous_sibling`](../previous_sibling/).

View file

@ -0,0 +1,7 @@
# prev_sibling
```php
prev_sibling ( ) : object
```
Returns the previous sibling of the current node, or null if the current node has no previous sibling.

View file

@ -0,0 +1,39 @@
# remove
```php
remove ( )
```
Removes the current node recursively from the DOM.
Does nothing if the node has no parent (root node);
**Example**
```php
$html = str_get_html(<<<EOD
<html>
<body>
<table>
<tr><th>Title</th></tr>
<tr><td>Row 1</td></tr>
</table>
</body>
</html>
EOD
);
$table = $html->find('table', 0);
$table->remove();
echo $html;
/**
* Returns
*
* <html> <body> </body> </html>
*/
```
**Remarks**
* Whitespace immediately **before** the removed node will remain in the DOM.

View file

@ -0,0 +1,11 @@
# removeAttribute
```php
removeAttribute ( string $name )
```
| Parameter | Description
| --------- | -----------
| `name` | Name of the attribute to remove.
Removes the attribute with the speicified name from the current node.

View file

@ -0,0 +1,43 @@
# removeChild
```php
removeChild ( object $node )
```
| Parameter | Description
| --------- | -----------
| `node` | Node to remove from current element, must be a child of the current element.
Removes the node recursively from the DOM.
Does nothing if the provided node is not a child of the current node.
**Example**
```php
$html = str_get_html(<<<EOD
<html>
<body>
<table>
<tr><th>Title</th></tr>
<tr><td>Row 1</td></tr>
</table>
</body>
</html>
EOD
);
$body = $html->find('body', 0);
$body->removeChild($body->find('table', 0));
echo $html;
/**
* Returns
*
* <html> <body> </body> </html>
*/
```
**Remarks**
* Whitespace immediately **before** the removed node will remain in the DOM.

View file

@ -0,0 +1,25 @@
# removeClass
```php
removeClass ( [ mixed $class = null ] )
```
| Parameter | Description
| --------- | -----------
| `class` | Specifies one or more class names to be removed.
Removes one or more class names from the current node.
**Remarks**
* To remove more than one class, separate the class names with space or provide them as an array.
* If no parameter is specified, this method will remove all class names from the current node.
**Examples**
```php
$node->removeClass('hidden');
$node->removeClass('article important');
$node->removeClass(array('article', 'new'));
$node->removeClass();
```

View file

@ -0,0 +1,20 @@
# save
```php
save ( [ string $filepath = '' ] ) : string
```
Writes the current node to file.
| Parameter | Description
| --------- | -----------
| `filepath` | Writes to file if the provided file path is not empty.
Returns the document string.
**Examples**
```php
$string = $node->save();
$string = $node->save($file);
```

View file

@ -0,0 +1,19 @@
# seek (protected)
```php
seek (
string $selector
, array &$ret
, string $parent_cmd
[, bool $lowercase = false ]
)
```
| Parameter | Description
| --------- | -----------
| `selector` | The current selector.
| `ret` | Previous return value (starting point).
| `parent_cmd` | The combinator used before the current selector.
| `lowercase` | Matches tag names case insensitive (lowercase) if enabled.
Starts by searching for child elements of `$ret` that match the specified selector. Adds matching elements to `$ret` (for the next iteration).

View file

@ -0,0 +1,12 @@
# setAttribute
```php
setAttribute ( string $name, string $value )
```
| Parameter | Description
| --------- | -----------
| `name` | Attribute name
| `value` | Attribute value
Adds or sets an attribute in the current node to the specified value.

View file

@ -0,0 +1,30 @@
---
title: simple_html_dom_node
---
# simple_html_dom_node
Represents a single node in the DOM tree (see [`simple_html_dom`](../../simple_html_dom/simple_html_dom/)).
# Public Properties
| Property | Description
| -------- | -----------
| `_` | Node meta data (i.e. type of node).
| `attr` | List of attributes.
| `children` | List of child nodes.
| `nodes` | List of nodes.
| `nodetype` | Node type.
| `parent` | Parent node object.
| `tag` | Node's tag name.
| `tag_start` | Start position of the tag name in the original document.
# Protected Properties
None.
# Private Properties
| Property | Description
| -------- | -----------
| `dom` | The DOM object (see [`simple_html_dom`](../../simple_html_dom/simple_html_dom/)).

View file

@ -0,0 +1,7 @@
# text
```php
text ( ) : string
```
Returns the (HTML) text representation for the current node recursively.

View file

@ -0,0 +1,7 @@
# xmltext
```php
xmltext ( ) : string
```
Returns the xml representation for the inner text of the current node as a CDATA section.

View file

@ -0,0 +1,21 @@
---
title: str_get_html
---
# str_get_html
```php
str_get_html ( string $str [, bool $lowercase = true [, bool $forceTagsClosed = true [, string $target_charset = DEFAULT_TARGET_CHARSET [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT ]]]]]] )
```
Parses the provided string and returns the DOM object.
| Parameter | Description
| --------- | -----------
| `str` | The HTML document string.
| `lowercase` | Forces lowercase matching of tags if enabled. This is very useful when loading documents with mixed naming conventions.
| `forceTagsClosed` | Obsolete. This parameter is no longer used by the parser.
| `target_charset` | Defines the target charset when returning text from the document.
| `stripRN` | If enabled, removes newlines before parsing the document.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.

60
lib/sd/manual/docs/faq.md Normal file
View file

@ -0,0 +1,60 @@
# FAQ
## Problem with finding
Q: Element not found in such case: `$html->find('div[style=padding: 0px 2px;] span[class=rf]');`
A: If there is blank in selectors, quote it!
$html->find('div[style="padding: 0px 2px;"] span[class=rf]');
## Problem with hosting
Q: On my local server everything works fine, but when I put it on my esternal server it doesn't work.
A: The "file_get_dom" function is a wrapper of "file_get_contents" function, you must set "allow_url_fopen" as TRUE in "php.ini" to allow accessing files via HTTP or FTP. However, some hosting venders disabled PHP's "allow_url_fopen" flag for security issues... PHP provides excellent support for "curl" library to do the same job, Use curl to get the page, then call "str_get_dom" to create DOM object.
Example:
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, 'http://????????');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10);
$str = curl_exec($curl);
curl_close($curl);
$html= str_get_html($str);
...
## Behind a proxy
Q: My server is behind a Proxy and i can't use file_get_contents b/c it returns a unauthorized error.
A: Thanks for Shaggy to provide the solution:
// Define a context for HTTP.
$context = array
(
'http' => array
(
'proxy' => 'addresseproxy:portproxy', // This needs to be the server and the port of the NTLM Authentication Proxy Server.
'request_fulluri' => true,
),
);
$context = stream_context_create($context);
$html= file_get_html('http://www.php.net', false, $context);
...
## Memory leak
Q: This script is leaking memory seriously... After it finished running, it's not cleaning up dom object properly from memory..
A: Due to php5 circular references memory leak, after creating DOM object, you must call $dom->clear() to free memory if call file_get_dom() more then once.
Example:
$html = file_get_html(...);
// do something...
$html->clear();
unset($html);

Some files were not shown because too many files have changed in this diff Show more