> ## Documentation Index
> Fetch the complete documentation index at: https://docs.beyondwords.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Content extraction

Content extraction is how BeyondWords turns source HTML into the **article** used to generate [audio](/docs-and-guides/content/audio) and [video](/docs-and-guides/content/video). It applies when ingesting content through [Magic Embed](/docs-and-guides/integrations/magic-embed), the [RSS Feed Importer](/docs-and-guides/integrations/rss-feed-importer), the [WordPress](/docs-and-guides/integrations/publishing-platforms/wordpress) and [Ghost](/docs-and-guides/integrations/publishing-platforms/ghost) plugins, the [API](/docs-and-guides/integrations/api-overview), or URL imports.

<Info>
  Extracted text becomes the **article** within the content item. You can edit the article in the [Editor](/docs-and-guides/tools/editor), but changes there switch the item to `manual_segment` processing—see [Processing type and filters](#processing-type-and-filters) below.
</Info>

<Tip>
  For precise control over title, author, publish date, and other metadata in your HTML, use [data attributes](/docs-and-guides/integrations/data-attributes) alongside extraction settings and filters.
</Tip>

## How it works

1. BeyondWords receives HTML—fetched from a live URL, imported from a feed, or sent directly via an integration.
2. **Content filters** (if configured) remove or retain whole HTML elements in the raw markup.
3. An **extraction mode** determines how editorial content is identified from the filtered HTML.
4. The result becomes the article used for segmentation and audio/video generation.

Filters run on HTML **before** segments are created. They operate on elements, not on individual words inside a paragraph.

## Extraction modes

Configure the extraction mode in **Settings → Extraction**.

| Mode                                         | How it works                                                  | Best for                           |
| -------------------------------------------- | ------------------------------------------------------------- | ---------------------------------- |
| [Automatic](#automatic-extraction) (default) | AI identifies editorial content; filters fine-tune the result | Most new projects                  |
| [Manual](#manual-extraction)                 | Filters alone define what is included or excluded             | Predictable, markup-driven control |
| [Legacy](#legacy-extraction)                 | Rule-based heuristics plus filters                            | Deprecated—migrate to Automatic    |

### Automatic extraction

Automatic extraction uses AI to identify and extract editorial content from your page while ignoring elements that should not be used for audio or video generation. Configured [content filters](#filters) are applied to the HTML before AI extraction runs. In most cases, **Exclude** filters are the right tool—see [Extraction mode and filters](#extraction-mode-and-filters).

<Warning>
  Although we have safeguards in place to improve extraction accuracy, there is a small risk that AI-based extraction may introduce unintended inaccuracies. Use **Exclude** filters to remove recurring non-editorial blocks (newsletter sign-ups, related-article widgets, etc.).
</Warning>

### Manual extraction

Manual extraction relies on [content filters](#filters) to define which parts of a page are included or excluded. This gives more predictable results than automatic extraction, but requires careful filter configuration—especially [default filters](#default-filters) and any **Include** rules. See [Extraction mode and filters](#extraction-mode-and-filters).

### Legacy extraction

<Warning>
  Legacy extraction is **deprecated** and scheduled for removal. It remains available for existing projects but should not be used for new implementations. Migrate to [Automatic](#automatic-extraction) extraction.
</Warning>

Legacy extraction uses rule-based heuristics and [content filters](#filters) to identify the main editorial content on a page. This mode exists for backwards compatibility with older projects that relied on the previous extraction pipeline.

## Extraction settings

### Static IP

If your site requires IP allowlisting for BeyondWords to fetch article pages, enable static IP:

1. Go to **Settings → Extraction** in your project dashboard
2. Switch **Static IP** on
3. Allowlist the displayed IP addresses in your firewall, CDN, or server configuration

<Info>
  Page-fetch requests are sent from `20.234.8.180` with `User-Agent: BeyondWords Importer`. This applies to URL fetching (Magic Embed, RSS page extraction, etc.)—not to [webhook](/docs-and-guides/integrations/webhooks) delivery.
</Info>

## Filters

Content filters control which parts of your source HTML are kept or removed before extraction and segmentation. Use them to:

* **Exclude** recurring elements that should not be read aloud—newsletter sign-ups, related-article blocks, social embeds, footnotes
* **Include** only specific containers when extraction picks up more than you need (use sparingly—see [Include vs Exclude](#include-vs-exclude))

### When filters apply

Dashboard content filters run on raw HTML **before** extraction and segmentation. Whether they run depends on how content reaches BeyondWords and its processing type:

| Source                                                                                                                                               | Filters applied?                                                       |
| ---------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------- |
| [Magic Embed](/docs-and-guides/integrations/magic-embed) (live page fetch)                                                                           | Yes—before AI or rule-based extraction                                 |
| [RSS Feed Importer](/docs-and-guides/integrations/rss-feed-importer) (page fetch enabled)                                                            | Yes—on fetched article HTML                                            |
| [API](/docs-and-guides/integrations/api-overview) / [WordPress](/docs-and-guides/integrations/publishing-platforms/wordpress) REST API (`body` HTML) | Yes—when the content item uses `auto_segment`                          |
| [Ghost](/docs-and-guides/integrations/publishing-platforms/ghost) plugin                                                                             | Yes—on HTML sent to BeyondWords, when `auto_segment`                   |
| Dashboard [Editor](/docs-and-guides/tools/editor)                                                                                                    | No—saving in the Editor sets `manual_segment`; filters do not re-apply |

You can also manage filters via the [Content filters API](/api-reference/content-filters/list) and extraction settings via the [Content extraction settings API](/api-reference/content-extraction-settings/show).

<Warning>
  Content filters only apply to articles with `auto_segment` processing. If an article was created or last edited in the Editor, its processing type is `manual_segment` and filters will not apply.

  To restore filter processing: send a new API update with `type` set to `auto_segment`, or make changes manually in the Editor instead of relying on filters.
</Warning>

For the [WordPress](/docs-and-guides/integrations/publishing-platforms/wordpress) REST API integration, dashboard filters apply to the HTML in the request `body`. WordPress users can also preprocess that HTML before it is sent using the PHP [`beyondwords_content_params`](/docs-and-guides/integrations/publishing-platforms/wordpress#beyondwords_content_params) hook—that is a WordPress plugin feature, not a dashboard content filter.

Filters take effect on the **next** extraction or regeneration after you save them—not on content that was already generated. See [When do filters take effect?](#when-do-filters-take-effect) in the FAQs.

### Filter scope

Filters can be scoped to **All projects** (organization-wide) or **This project only**. At runtime, **both** apply:

* Organization-wide filters (`All projects`) run for every project in your account
* Project-specific filters (`This project only`) run in addition to organization-wide filters

Review filters at both levels when troubleshooting—a rule configured under **All projects** affects every project even if a single project's filter list looks correct.

### Default filters

New BeyondWords accounts are created with preset **Type** filters. These are normal dashboard filters—not optional—and they run in the same way as any filter you add yourself (**Exclude** first, then **Include**).

**Exclude** (element types removed from the HTML):

`aside`, `figcaption`, `footer`, `form`, `iframe`, `nav`, `noscript`

**Include** (only these element types and their ancestors/descendants are kept—everything else is stripped):

`head`, `p`, `div`, `li`, `h1`, `h2`, `h3`, `h4`, `h5`, `h6`, `blockquote`, `table`, `img`

These defaults explain why some markup disappears even before you add custom filters, and why **Manual** extraction mode is sensitive to **Include** rules. You can edit or delete default filters in **Settings → Extraction → Filters** like any other filter.

### Extraction mode and filters

| Extraction mode         | Role of filters                              | Recommended approach                                                                                                                                                                              |
| ----------------------- | -------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Automatic** (default) | Fine-tune HTML before AI extraction          | Add **Exclude** filters for recurring non-editorial blocks. AI handles most editorial identification.                                                                                             |
| **Manual**              | Filters are the primary extraction mechanism | Understand [default filters](#default-filters) and any **Include** rules—they define what survives. Use **Exclude** for specific removals; use **Include** only when you need a narrow allowlist. |
| **Legacy** (deprecated) | Applied before rule-based heuristics         | Migrate to **Automatic**. Use **Exclude** filters as you would with Automatic mode.                                                                                                               |

### Include vs Exclude

| Rule        | Effect                                                                                                                                                                                        |
| ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **Exclude** | Remove matching elements from the HTML. This is the most common choice—remove a sidebar, footnote block, or newsletter paragraph.                                                             |
| **Include** | Keep only matching elements (plus their ancestors and descendants). Everything else is removed. Use only when you want to extract from a specific container and discard the rest of the page. |

Most support cases are solved with **Exclude** filters. **Include** filters are aggressive—a single overly broad Include rule can strip most of the article. This is especially true in **Manual** extraction mode and on accounts with [default Include filters](#default-filters).

The **Text** filter uses the same matching rules whether you choose Include or Exclude—only the rule type changes. In practice, Text filters are almost always **Exclude** (e.g. remove a link whose direct text is `Subscribe`). **Include** + Text is rare and keeps only elements whose direct text matches, which is usually too narrow to be useful on its own.

### How filters work

* Filters match **HTML elements** and remove them wholesale—not individual words or phrases inside an element.
* Combined conditions on a single filter use **AND** logic (click **+** in the dashboard). All conditions must match the same element.
* Multiple filters of the same rule type are combined with **OR** logic—if any filter matches, the rule applies.
* **Exclude** filters run first and remove matching elements (and everything inside them).
* **Include** filters run after excludes and strip everything that does not match (and is not an ancestor or descendant of a match).
* Filters with invalid XPath expressions are skipped silently. Double-check syntax, especially for [XPath](#xpath-element_xpath) filters.

For **URL-based extraction** ([Magic Embed](/docs-and-guides/integrations/magic-embed), RSS page fetch), BeyondWords also applies built-in **Exclude** rules before your dashboard filters: `script`, `style`, HTML comments, and elements with class `beyondwords-player`. These cannot be disabled in the dashboard.

### Create a filter

<Steps>
  <Step title="Start a new filter">
    Go to **Settings → Extraction → Filters** in your project dashboard.

    Click **+ Filter**.
  </Step>

  <Step title="Select the filter type">
    Choose **Type**, **Class**, **Data**, **ID**, **XPath**, or **Text**. See [filter types](#filter-types) below.
  </Step>

  <Step title="Enter the filter criteria">
    Enter the value for your chosen type—for example, `h2` for a Type filter, or `newsletter` for a Class filter.

    <Warning>
      Provide only the name or identifier—no prefix characters like `<`, `.`, or `#`. The exception is **XPath**, where you enter a full XPath expression.
    </Warning>
  </Step>

  <Step title="Add additional conditions (optional)">
    Click **+** to add another condition. Combined conditions use **AND** logic—the element must match all conditions.

    For example, combine **Type** `p` and **Text** `Sponsored` to exclude only paragraphs whose direct text contains "Sponsored", rather than every element on the page containing that word.
  </Step>

  <Step title="Select Include or Exclude">
    Choose whether to **Include** or **Exclude** matching elements. In most cases, choose **Exclude**.
  </Step>

  <Step title="Set the scope and save">
    Choose **All projects** or **This project only**, then click **Save changes**. See [Filter scope](#filter-scope) for how organization-wide and project-specific filters combine.

    Regenerate affected content items for the filter to take effect.
  </Step>
</Steps>

### Filter types

| Type                          | Matches on                         | Common use                                            |
| ----------------------------- | ---------------------------------- | ----------------------------------------------------- |
| [Type](#type-element_type)    | HTML tag name                      | Remove `sup` references, `aside` blocks, etc.         |
| [Class](#class-element_class) | Substring in the `class` attribute | Sidebars, footnotes, embed containers                 |
| [ID](#id-element_id)          | Exact `id` attribute               | Unique advert or widget blocks                        |
| [Data](#data-element_data)    | Presence of a `data-*` attribute   | CMS markers like `data-exclude`                       |
| [Text](#text-element_text)    | Direct text content (substring)    | Simple text matches in a single element               |
| [XPath](#xpath-element_xpath) | Full XPath expression              | Complex markup, fragmented text, link-based targeting |

#### Type (`element_type`)

Matches elements by HTML tag name (e.g. `p`, `h2`, `blockquote`, `sup`).

**How matching works**

* Matches the tag name exactly—enter `h2`, not `<h2>`.
* When combined with other conditions via **+**, all conditions must match the same element.
* When no Type is set, the filter defaults to all elements (`*`).

**Examples**

* **Exclude** Type `sup`—removes inline superscript reference numbers (e.g. ¹, ², \[1])
* **Exclude** Type `a` + **Text** `Subscribe`—removes a subscribe link, keeps the surrounding paragraph

#### Class (`element_class`)

Matches elements whose `class` attribute **contains** the value you enter (substring match, not an exact class token).

**How matching works**

* Uses substring matching—`nav` also matches `navbar`, `navigation`, and `main-nav`.
* For `<div class="main navbar">`, entering `main navbar` can match that element.
* Each class name requires its own filter. Entering multiple names in one filter (e.g. `nonedit, collection-embed`) does not match either class.
* Do not include a leading dot—enter `sidebar`, not `.sidebar`. A leading dot causes the filter to fail silently.

**Examples**

* **Exclude** Class `footnotes`—removes a footnote block
* **Exclude** Class `related-articles`—removes a related-content widget

**Common mistakes**

* Using a broad class name that appears on editorial content—e.g. a class meaning "body text" in another language applied to every paragraph—can remove the entire article. Inspect your HTML before excluding by class.

#### ID (`element_id`)

Matches elements with an exact `id` attribute value.

**How matching works**

* Exact match—`@id='your-id'`.
* Do not include a `#` prefix—enter `newsletter-signup`, not `#newsletter-signup`.
* Best for unique, stable containers on your pages.

**Examples**

* **Exclude** ID `advert-banner`—removes a specific advert container

#### Data (`element_data`)

Matches elements that have a specific `data-*` attribute.

**How matching works**

* Enter the attribute name **without** the `data-` prefix—e.g. enter `exclude` to match elements with a `data-exclude` attribute.
* Matches attribute **presence**, not a specific attribute value.

**Examples**

* **Exclude** Data `exclude`—removes elements marked `<div data-exclude>` in your CMS markup

See [data attributes](/docs-and-guides/integrations/data-attributes) for marking content in your HTML.

#### Text (`element_text`)

Matches elements based on the **direct text** they contain.

<Warning>
  This filter removes whole HTML elements—not individual words or phrases. If an element matches, the element and everything inside it is removed. There is no way to strip a specific word or sentence from a paragraph while keeping the rest.
</Warning>

**How matching works**

* Matches when an element's **direct text** contains the value you enter. Text inside child tags (e.g. `<strong>`, `<em>`, `<a>`) is not part of the parent's direct text.
* Matching is **case-sensitive** and matches **substrings**—`more` matches "more", "Read more", and "moreover".
* When an element matches, the **whole element and its contents** are removed.
* Used alone, the filter applies to **every element** on the page. Combine with [Type](#type-element_type), [Class](#class-element_class), or [ID](#id-element_id) via **+** to narrow the match.

**Examples**

* **Exclude** Type `a` + Text `Subscribe`—removes a subscribe link, keeps the surrounding paragraph
* **Exclude** Type `p` + Text `Sponsored`—removes paragraphs whose direct text contains "Sponsored"

**What this filter can't do**

* Remove a word or sentence from inside a paragraph while keeping the rest of the paragraph
* Match text split across child tags (e.g. `<em>Sub</em><em>scribe</em>`) or formatted inline (e.g., `<p>Hello <strong>world</strong></p>`—filtering for `world` on the `<p>` will not match because `world` is inside `<strong>`)

For fragmented or heavily formatted paragraphs, use [XPath](#xpath-element_xpath) to target the wrapping element instead—for example, a newsletter paragraph identified by a link:

```text theme={null}
//p[.//a[contains(@href, 'newsletter')]]
```

#### XPath (`element_xpath`)

Matches elements using a full [XPath](https://www.w3.org/TR/1999/REC-xpath-19991116/) expression. When XPath is set, other filter fields are ignored.

**How matching works**

* Enter a complete XPath expression—e.g. `//*[@role='dialog']` or `//aside[contains(@class, 'sidebar')]`.
* Provides the most precise control for complex document structures.
* Best escape hatch when [Text](#text-element_text) filters cannot match fragmented markup you do not control (common with third-party CMS and RSS-sourced HTML).

**Examples**

* **Exclude** `//*[@role='dialog']`—removes dialog/modal elements
* **Exclude** `//p[.//a[contains(@href, 'suscripcion-newsletter')]]`—removes a newsletter sign-up paragraph identified by its subscribe link, regardless of how text is split across `<em>` and `<strong>` tags inside

### Processing type and filters

When sending content through the API, the `type` field determines whether [content filters](#filters) are applied.

| Action                                                       | Filters applied? |
| ------------------------------------------------------------ | ---------------- |
| Article has `auto_segment` and you click **Regenerate**      | Yes              |
| Article has `auto_segment` and you send new HTML via the API | Yes              |
| Article has `manual_segment` and you click **Regenerate**    | No               |

Articles created or last saved via the dashboard Editor use `manual_segment` by default. Set `type` to `auto_segment` when sending content through the API if you want filters to apply.

For more detail, see [Processing types](/api-reference/content/processing-types) in the API reference.

## FAQs

<AccordionGroup>
  <Accordion title="Why is unwanted content being extracted?">
    Common causes:

    * **Automatic extraction** included the content—add an **Exclude** filter targeting the relevant HTML element and regenerate.
    * **No Exclude filter** matches the element—inspect the page HTML, choose the right [filter type](#filter-types), and regenerate. For formatted paragraphs, Text alone may not work—try Class, ID, or XPath.
    * An **Include** filter is too broad—review Include filters; they strip everything outside the match.
  </Accordion>

  <Accordion title="Why is some content not being extracted?">
    Common causes:

    * **Automatic extraction** excluded the content—add an **Include** filter targeting the relevant container and regenerate.
    * An **Exclude** filter is too broad—e.g. a Class filter using a substring that matches editorial elements. Review and narrow the filter.
    * The article uses `manual_segment`—filters no longer apply. Regenerate via API with `type: auto_segment`, or edit in the Editor.
  </Accordion>

  <Accordion title="My Text filter doesn't match—why?">
    The Text filter only checks an element's **direct text**, not text nested in child tags. This often affects real article markup:

    * `<p>Hello world</p>`—Text `world` on Type `p` matches
    * `<p>Hello <strong>world</strong></p>`—Text `world` on Type `p` does **not** match
    * Text split across multiple `<em>` tags will not match as a single phrase on the parent `<p>`

    **Fix:** use [XPath](#xpath-element_xpath) to target the parent element—e.g. by a distinctive `href`, `id`, or class on a child link rather than the paragraph text.
  </Accordion>

  <Accordion title="How do I remove a newsletter sign-up paragraph?">
    If the paragraph contains a distinctive subscribe link, use an **Exclude** XPath filter:

    ```text theme={null}
    //p[.//a[contains(@href, 'newsletter')]]
    ```

    Replace `newsletter` with a distinctive part of the subscribe URL. Avoid excluding by a generic class name (e.g. a class meaning "body" applied to all paragraphs)—that can remove the entire article.

    If you only need to remove the link and keep the paragraph, use **Exclude** Type `a` + Text matching the link text (e.g. `Subscribe`).
  </Accordion>

  <Accordion title="When do filters take effect?">
    Filters apply on the next extraction or regeneration after you save them. Existing content is not updated automatically—click **Regenerate** on affected items, or re-import/re-publish through your integration.

    Filters do not apply to content last saved in the Editor (`manual_segment`). See [When filters apply](#when-filters-apply) for the full breakdown by integration.
  </Accordion>

  <Accordion title="Why does content disappear even though I haven't added filters?">
    New accounts include [default Type filters](#default-filters)—a set of **Include** and **Exclude** rules that run automatically. The **Include** presets keep only common editorial element types (`p`, `h1`–`h6`, `div`, etc.) and strip everything else. Check **Settings → Extraction → Filters** for organization-wide (**All projects**) rules as well as project-specific ones.
  </Accordion>
</AccordionGroup>

## Getting help

If you encounter issues or have questions, [contact support](/docs-and-guides/support/get-support). Include the article URL or HTML snippet and the filters you have configured.
