This commit is contained in:
steven 2025-08-11 22:23:30 +02:00
commit 72a26edcff
22092 changed files with 2101903 additions and 0 deletions

View file

@ -0,0 +1,20 @@
# __construct
```php
__construct ( [ string $str = null [, bool $lowercase = true [, bool $forceTagsClosed = true [, string $target_charset = DEFAULT_TARGET_CHARSET [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT [, int $options = 0 ]]]]]]]]) : object
```
Creates a new `simple_html_dom` object.
| Parameter | Description
| --------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `forceTagsClosed` | Tags inside block tags are forcefully closed if the closing tag was omitted.
| `target_charset` | Defines the target charset for text returned by the parser.
| `stripRN` | Newline characters are replaced by whitespace if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.
| `options` | Additional options for the parser. Currently supports `'HDOM_SMARTY_AS_TEXT'` to remove [Smarty](https://www.smarty.net/) scripts.
Returns the object.

View file

@ -0,0 +1,7 @@
# __destruct
```php
__destruct ()
```
Destroys the current object and clears memory.

View file

@ -0,0 +1,17 @@
# __get
```php
__get ( string $name ) : mixed
```
See [magic methods](http://php.net/manual/en/language.oop5.overloading.php#object.get)
Supports following names:
| Name | Description
| ---- | -----------
| `outertext` | Returns the outer text of the root element.
| `innertext` | Returns the inner text of the root element.
| `plaintext` | Returns the plain text of the root element.
| `charset` | Returns the charset for the document.
| `target_charset` | Returns the target charset for the document.

View file

@ -0,0 +1,7 @@
# __toString
```php
__toString () : string
```
Returns the inner text of the root element of the DOM.

View file

@ -0,0 +1,13 @@
# as_text_node (protected)
```php
as_text_node ( string $tag ) : bool
```
Adds a tag as text node.
| Parameter | Description
| --------- | -----------
| `tag` | The element's tag name.
Returns true on success.

View file

@ -0,0 +1,11 @@
# childNodes
```php
childNodes ( [ int $idx = -1 ] ) : mixed
```
Returns children of the root element.
| Parameter | Description
| --------- | -----------
| `idx` | Index of the child element to return.

View file

@ -0,0 +1,7 @@
# clear
```php
clear ()
```
Cleans up memory to prevent [PHP 5 circular references memory leak](https://bugs.php.net/bug.php?id=33595).

View file

@ -0,0 +1,13 @@
# copy_skip (protected)
```php
copy_skip ( string $chars ) : string
```
Skips characters starting at the current parsing position in the document. Sets the parsing position to the first character not in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to skip.
Returns the skipped characters.

View file

@ -0,0 +1,13 @@
# copy_until (protected)
```php
copy_until ( string $chars ) : string
```
Copies all characters starting at the current parsing position in the document. Sets the parsing position to the first character that matches any of the characters in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to stop copying at.
Returns the copied characters.

View file

@ -0,0 +1,13 @@
# copy_until_char (protected)
```php
copy_until ( string $char ) : string
```
Copies all characters starting at the current parsing position in the document. Sets the parsing position to the first character that matches the provided character.
| Parameter | Description
| --------- | -----------
| `char` | A character to stop copying at.
Returns the copied characters.

View file

@ -0,0 +1,14 @@
# createElement
```php
createElement ( string $name [, string $value = null ] ) : object
```
Creates a new element.
| Parameter | Description
| --------- | -----------
| `name` | Name of the element
| `value` | Value of the element
Returns the element.

View file

@ -0,0 +1,9 @@
# createTextNode
```php
createTextNode ( string $value ) : object
```
Creates a new text element.
Returns the element.

View file

@ -0,0 +1,13 @@
# dump
```php
dump ( [ bool show_attr = true ] ) : string
```
Dumps the entire DOM into a string. Useful for debugging purposes.
| Parameter | Description
| --------- | -----------
| `show_attr` | Attributes are included in the dump when enabled.
Returns the DOM tree as string.

View file

@ -0,0 +1,15 @@
# find
```php
find ( string $selector [, int $idx = null [, bool $lowercase = false ]] ) : mixed
```
Finds elements in the DOM.
| Parameter | Description
| --------- | -----------
| `selector` | A [CSS style selector](/manual/selectors).
| `idx` | Index of the element to return.
| `lowercase` | Matches tag names case insensitive when enabled.
Returns an array of matches or a single element if `idx` is defined.

View file

@ -0,0 +1,7 @@
# firstChild
```php
firstChild () : object
```
Returns the first child of the root element.

View file

@ -0,0 +1,13 @@
# getElementById
```php
getElementById ( string $id ) : object
```
Searches an element by id.
| Parameter | Description
| --------- | -----------
| `id` | ID of the element to find.
Returns the element or null if no match was found.

View file

@ -0,0 +1,13 @@
# getElementByTagName
```php
getElementByTagName ( string $name ) : object
```
Searches an element by tag name.
| Parameter | Description
| --------- | -----------
| `name` | Tag name of the element to find.
Returns the element or null if no match was found.

View file

@ -0,0 +1,14 @@
# getElementsById
```php
getElementsById ( string $id [, int $idx = null ] ) : object
```
Searches elements by id.
| Parameter | Description
| --------- | -----------
| `id` | ID of the element to find.
| `idx` | Returns the element at the specified index if defined.
Returns the element(s) or null if no match was found.

View file

@ -0,0 +1,14 @@
# getElementsByTagName
```php
getElementsByTagName ( string $name [, int $idx = -1 ] ) : object
```
Searches elements by tag name.
| Parameter | Description
| --------- | -----------
| `name` | Tag name of the element to find.
| `idx` | Returns the element at the specified index.
Returns the element(s) or null if no match was found.

View file

@ -0,0 +1,7 @@
# lastChild
```php
lastChild () : object
```
Returns the last child of the root element.

View file

@ -0,0 +1,12 @@
# link_nodes (protected)
```php
link_nodes ( object &$node, bool $is_child )
```
Links the provided node to the DOM tree.
| Parameter | Description
| --------- | -----------
| `node` | The node to link to the DOM tree.
| `is_child` | If active, makes the node a sibling of the current node (child of parent).

View file

@ -0,0 +1,18 @@
# load
```php
load ( string $str [, bool $lowercase = true [, bool $stripRN = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT [, int $options = 0 ]]]]]) : object
```
Loads the provided HTML document string.
| Parameter | Description
| --------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `stripRN` | Newline characters are replaced by whitespace if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.
| `options` | Additional options for the parser. Currently supports `'HDOM_SMARTY_AS_TEXT'` to remove [Smarty](https://www.smarty.net/) scripts.
Returns the object.

View file

@ -0,0 +1,7 @@
# loadFile
```php
loadFile (...)
```
This function is a wrapper for [`load_file`](#load_file)

View file

@ -0,0 +1,9 @@
# load_file
```php
load_file (...) : object
```
Loads a HTML document from file. Supports arguments of [`file_get_contents`](http://php.net/manual/en/function.file-get-contents.php).
Returns the object.

View file

@ -0,0 +1,7 @@
# parse (protected)
```php
parse ()
```
Parses the document. This function is called after the document was loaded into `$this->doc`.

View file

@ -0,0 +1,13 @@
# parse_attr (protected)
```php
parse_attr ( object $node, string $name, array &$space )
```
Parses a single attribute starting at the current parsing position in the document.
| Parameter | Description
| --------- | -----------
| `node` | The current element (node).
| `name` | The attribute name.
| `space` | An array of whitespace sorounding the current attribute (see [Attribute Whitespace](../definitions/#attribute-whitespace)).

View file

@ -0,0 +1,15 @@
# parse_charset (protected)
```php
parse_charset ()
```
Parses the charset.
If the callback function `get_last_retrieve_url_contents_content_type` exists, it is assumed to return the content type header for the current document as string.
Uses the charset from the metadata of the page if defined.
If none of the previous conditions are met, the charset is determined by `mb_detect_encoding` if multi-byte support is active.
If multi-byte support is not active the charset is assumed to be `'UTF-8'`.

View file

@ -0,0 +1,14 @@
# prepare (protected)
```php
prepare ( string $str [, bool $lowercase = true [, string $defaultBRText = DEFAULT_BR_TEXT [, string $defaultSpanText = DEFAULT_SPAN_TEXT ]]] )
```
Initializes the DOM object.
| Parameters | Description
| ---------- | -----------
| `str` | The HTML document string.
| `lowercase` | Tag names are parsed in lowercase letters if enabled.
| `defaultBRText` | Defines the default text to return for `<br>` elements.
| `defaultSpanText` | Defines the default text to return for `<span>` elements.

View file

@ -0,0 +1,9 @@
# read_tag (protected)
```php
read_tag () : bool
```
Reads a single tag starting at the current parsing position in the document. The tag is automatically added to the DOM.
Returns true if a tag was found.

View file

@ -0,0 +1,7 @@
# remove_callback
```php
remove_callback ()
```
Removes the callback set by [`set_callback`](#set_callback).

View file

@ -0,0 +1,14 @@
# remove_noise (protected)
```php
remove_noise ( string $pattern [, bool $remove_tag = false] )
```
Replaces noise in the document (i.e. scripts) by placeholders and adds the removed contents to `$this->noise`.
_Note_: Noise is replaced by placeholders in order to allow restoring the original contents. Placeholders take the form of `'___noise___1000'` where the number is increased by one for each removed noise.
| Parameter | Description
| --------- | -----------
| `pattern` | A regular expression that matches the noise to remove.
| `remove_tag` | Removes the entire match when enabled or submatches when disabled.

View file

@ -0,0 +1,13 @@
# restore_noise (protected)
```php
restore_noise ( string $text ) : string
```
Restores noise in the provided string by replacing noise placeholders by their original contents.
| Parameter | Description
| --------- | -----------
| `text` | A string (potentially) containing noise placeholders.
Returns the string with original contents restored or the original string if it doesn't contain noise placeholders.

View file

@ -0,0 +1,13 @@
# save
```php
save ( [ string $filepath = '' ] ) : string
```
Writes the current DOM to file.
| Parameter | Description
| --------- | -----------
| `filepath` | Writes to file if the provided file path is not empty.
Returns the document string.

View file

@ -0,0 +1,13 @@
# search_noise (protected)
```php
search_noise ( string $text ) : string
```
Find a single noise element by providing the noise placeholder text.
| Parameter | Description
| --------- | -----------
| `text` | The noise placeholder to find.
Returns the original contents for the placeholder.

View file

@ -0,0 +1,12 @@
# set_callback
```php
set_callback ( string $function_name )
```
Sets the callback function which is called on each element of the DOM when building outertext.
The function must accept a single parameter of type `simple_html_dom_node`.
| Parameter | Description
| --------- | -----------
| `function_name` | Name of the function.

View file

@ -0,0 +1,40 @@
---
title: simple_html_dom
---
# simple_html_dom
Represents the [DOM](https://en.wikipedia.org/wiki/Document_Object_Model) in memory. Provides functions to parse documents and access individual elements (see [`simple_html_dom_node`](../simple_html_dom_node/simple_html_dom_node.md)).
# Public Properties
| Property | Description
| -------- | -----------
| `root` | Root node of the document.
| `nodes` | List of top-level nodes in the document.
| `callback` | Callback function that is called for each element in the DOM when generating outertext.
| `lowercase` | If enabled, all tag names are converted to lowercase when parsing documents.
| `original_size` | Original document size in bytes.
| `size` | Current document size in bytes.
| `_charset` | Charset of the original document.
| `_target_charset` | Target charset for the current document.
| `default_span_text` | Text to return for `<span>` elements.
# Protected Properties
| Property | Description
| -------- | -----------
| `pos` | Current parsing position within `doc`.
| `doc` | The original document.
| `char` | Character at position `pos` in `doc`.
| `cursor` | Current element cursor in the document.
| `parent` | Parent element node.
| `noise` | Noise from the original document (i.e. scripts, comments, etc...).
| `token_blank` | Tokens that are considered whitespace in HTML.
| `token_equal` | Tokens to identify the equal sign for attributes, stopping either at the closing tag ("/" i.e. `<html />`) or the end of an opening tag (">" i.e. `<html>`).
| `token_slash` | Tokens to identify the end of a tag name. A tag name either ends on the ending slash ("/" i.e. `<html/>`) or whitespace (`"\s\r\n\t"`).
| `token_attr` | Tokens to identify the end of an attribute.
| `default_br_text` | Text to return for `<br>` elements.
| `self_closing_tags` | A list of tag names where the closing tag is omitted.
| `block_tags` | A list of tag names where remaining unclosed tags are forcibly closed.
| `optional_closing_tags` | A list of tag names where the closing tag can be omitted.

View file

@ -0,0 +1,12 @@
# skip (protected)
```php
skip ( string $chars )
```
Skips characters starting at the current parsing position in the document. Sets the parsing position to the first character not in the provided list of characters.
| Parameter | Description
| --------- | -----------
| `chars` | A list of characters to skip.