Content extraction

Content extraction is how BeyondWords turns source HTML into the article used to generate audio and video. It applies when ingesting content through Magic Embed, the RSS Feed Importer, the WordPress and Ghost plugins, the API, or URL imports.

Extracted text becomes the article within the content item. You can edit the article in the Editor, but changes there switch the item to manual_segment processing—see Processing type and filters below.

For precise control over title, author, publish date, and other metadata in your HTML, use data attributes alongside extraction settings and filters.

How it works

BeyondWords receives HTML—fetched from a live URL, imported from a feed, or sent directly via an integration.
Content filters (if configured) remove or retain whole HTML elements in the raw markup.
An extraction mode determines how editorial content is identified from the filtered HTML.
The result becomes the article used for segmentation and audio/video generation.

Filters run on HTML before segments are created. They operate on elements, not on individual words inside a paragraph.

Extraction modes

Configure the extraction mode in Settings → Extraction.

Mode	How it works	Best for
Automatic (default)	AI identifies editorial content; filters fine-tune the result	Most new projects
Manual	Filters alone define what is included or excluded	Predictable, markup-driven control
Legacy	Rule-based heuristics plus filters	Deprecated—migrate to Automatic

Automatic extraction

Automatic extraction uses AI to identify and extract editorial content from your page while ignoring elements that should not be used for audio or video generation. Configured content filters are applied to the HTML before AI extraction runs. In most cases, Exclude filters are the right tool—see Extraction mode and filters.

Although we have safeguards in place to improve extraction accuracy, there is a small risk that AI-based extraction may introduce unintended inaccuracies. Use Exclude filters to remove recurring non-editorial blocks (newsletter sign-ups, related-article widgets, etc.).

Manual extraction

Manual extraction relies on content filters to define which parts of a page are included or excluded. This gives more predictable results than automatic extraction, but requires careful filter configuration—especially default filters and any Include rules. See Extraction mode and filters.

Legacy extraction

Legacy extraction is deprecated and scheduled for removal. It remains available for existing projects but should not be used for new implementations. Migrate to Automatic extraction.

Legacy extraction uses rule-based heuristics and content filters to identify the main editorial content on a page. This mode exists for backwards compatibility with older projects that relied on the previous extraction pipeline.

Extraction settings

Static IP

If your site requires IP allowlisting for BeyondWords to fetch article pages, enable static IP:

Go to Settings → Extraction in your project dashboard
Switch Static IP on
Allowlist the displayed IP addresses in your firewall, CDN, or server configuration

Page-fetch requests are sent from 20.234.8.180 with User-Agent: BeyondWords Importer. This applies to URL fetching (Magic Embed, RSS page extraction, etc.)—not to webhook delivery.

Filters

Content filters control which parts of your source HTML are kept or removed before extraction and segmentation. Use them to:

Exclude recurring elements that should not be read aloud—newsletter sign-ups, related-article blocks, social embeds, footnotes
Include only specific containers when extraction picks up more than you need (use sparingly—see Include vs Exclude)

When filters apply

Dashboard content filters run on raw HTML before extraction and segmentation. Whether they run depends on how content reaches BeyondWords and its processing type:

Source	Filters applied?
Magic Embed (live page fetch)	Yes—before AI or rule-based extraction
RSS Feed Importer (page fetch enabled)	Yes—on fetched article HTML
API / WordPress REST API (`body` HTML)	Yes—when the content item uses `auto_segment`
Ghost plugin	Yes—on HTML sent to BeyondWords, when `auto_segment`
Dashboard Editor	No—saving in the Editor sets `manual_segment`; filters do not re-apply

You can also manage filters via the Content filters API and extraction settings via the Content extraction settings API.

Content filters only apply to articles with auto_segment processing. If an article was created or last edited in the Editor, its processing type is manual_segment and filters will not apply.To restore filter processing: send a new API update with type set to auto_segment, or make changes manually in the Editor instead of relying on filters.

For the WordPress REST API integration, dashboard filters apply to the HTML in the request body. WordPress users can also preprocess that HTML before it is sent using the PHP beyondwords_content_params hook—that is a WordPress plugin feature, not a dashboard content filter. Filters take effect on the next extraction or regeneration after you save them—not on content that was already generated. See When do filters take effect? in the FAQs.

Filter scope

Filters can be scoped to All projects (organization-wide) or This project only. At runtime, both apply:

Organization-wide filters (All projects) run for every project in your account
Project-specific filters (This project only) run in addition to organization-wide filters

Review filters at both levels when troubleshooting—a rule configured under All projects affects every project even if a single project’s filter list looks correct.

Default filters

New BeyondWords accounts are created with preset Type filters. These are normal dashboard filters—not optional—and they run in the same way as any filter you add yourself (Exclude first, then Include). Exclude (element types removed from the HTML): aside, figcaption, footer, form, iframe, nav, noscript Include (only these element types and their ancestors/descendants are kept—everything else is stripped): head, p, div, li, h1, h2, h3, h4, h5, h6, blockquote, table, img These defaults explain why some markup disappears even before you add custom filters, and why Manual extraction mode is sensitive to Include rules. You can edit or delete default filters in Settings → Extraction → Filters like any other filter.

Extraction mode and filters

Extraction mode	Role of filters	Recommended approach
Automatic (default)	Fine-tune HTML before AI extraction	Add Exclude filters for recurring non-editorial blocks. AI handles most editorial identification.
Manual	Filters are the primary extraction mechanism	Understand default filters and any Include rules—they define what survives. Use Exclude for specific removals; use Include only when you need a narrow allowlist.
Legacy (deprecated)	Applied before rule-based heuristics	Migrate to Automatic. Use Exclude filters as you would with Automatic mode.

Include vs Exclude

Rule	Effect
Exclude	Remove matching elements from the HTML. This is the most common choice—remove a sidebar, footnote block, or newsletter paragraph.
Include	Keep only matching elements (plus their ancestors and descendants). Everything else is removed. Use only when you want to extract from a specific container and discard the rest of the page.

Most support cases are solved with Exclude filters. Include filters are aggressive—a single overly broad Include rule can strip most of the article. This is especially true in Manual extraction mode and on accounts with default Include filters. The Text filter uses the same matching rules whether you choose Include or Exclude—only the rule type changes. In practice, Text filters are almost always Exclude (e.g. remove a link whose direct text is Subscribe). Include + Text is rare and keeps only elements whose direct text matches, which is usually too narrow to be useful on its own.

How filters work

Filters match HTML elements and remove them wholesale—not individual words or phrases inside an element.
Combined conditions on a single filter use AND logic (click + in the dashboard). All conditions must match the same element.
Multiple filters of the same rule type are combined with OR logic—if any filter matches, the rule applies.
Exclude filters run first and remove matching elements (and everything inside them).
Include filters run after excludes and strip everything that does not match (and is not an ancestor or descendant of a match).
Filters with invalid XPath expressions are skipped silently. Double-check syntax, especially for XPath filters.

For URL-based extraction (Magic Embed, RSS page fetch), BeyondWords also applies built-in Exclude rules before your dashboard filters: script, style, HTML comments, and elements with class beyondwords-player. These cannot be disabled in the dashboard.

Create a filter

Start a new filter

Go to Settings → Extraction → Filters in your project dashboard.Click + Filter.

Select the filter type

Choose Type, Class, Data, ID, XPath, or Text. See filter types below.

Enter the filter criteria

Enter the value for your chosen type—for example, h2 for a Type filter, or newsletter for a Class filter.

Provide only the name or identifier—no prefix characters like <, ., or #. The exception is XPath, where you enter a full XPath expression.

Add additional conditions (optional)

Click + to add another condition. Combined conditions use AND logic—the element must match all conditions.For example, combine Type p and Text Sponsored to exclude only paragraphs whose direct text contains “Sponsored”, rather than every element on the page containing that word.

Select Include or Exclude

Choose whether to Include or Exclude matching elements. In most cases, choose Exclude.

Set the scope and save

Choose All projects or This project only, then click Save changes. See Filter scope for how organization-wide and project-specific filters combine.Regenerate affected content items for the filter to take effect.

Filter types

Type	Matches on	Common use
Type	HTML tag name	Remove `sup` references, `aside` blocks, etc.
Class	Substring in the `class` attribute	Sidebars, footnotes, embed containers
ID	Exact `id` attribute	Unique advert or widget blocks
Data	Presence of a `data-*` attribute	CMS markers like `data-exclude`
Text	Direct text content (substring)	Simple text matches in a single element
XPath	Full XPath expression	Complex markup, fragmented text, link-based targeting

Type (`element_type`)

Matches elements by HTML tag name (e.g. p, h2, blockquote, sup). How matching works

Matches the tag name exactly—enter h2, not <h2>.
When combined with other conditions via +, all conditions must match the same element.
When no Type is set, the filter defaults to all elements (*).

Examples

Exclude Type sup—removes inline superscript reference numbers (e.g. ¹, ², [1])
Exclude Type a + Text Subscribe—removes a subscribe link, keeps the surrounding paragraph

Class (`element_class`)

Matches elements whose class attribute contains the value you enter (substring match, not an exact class token). How matching works

Uses substring matching—nav also matches navbar, navigation, and main-nav.
For <div class="main navbar">, entering main navbar can match that element.
Each class name requires its own filter. Entering multiple names in one filter (e.g. nonedit, collection-embed) does not match either class.
Do not include a leading dot—enter sidebar, not .sidebar. A leading dot causes the filter to fail silently.

Examples

Exclude Class footnotes—removes a footnote block
Exclude Class related-articles—removes a related-content widget

Common mistakes

Using a broad class name that appears on editorial content—e.g. a class meaning “body text” in another language applied to every paragraph—can remove the entire article. Inspect your HTML before excluding by class.

ID (`element_id`)

Matches elements with an exact id attribute value. How matching works

Exact match—@id='your-id'.
Do not include a # prefix—enter newsletter-signup, not #newsletter-signup.
Best for unique, stable containers on your pages.

Examples

Exclude ID advert-banner—removes a specific advert container

Data (`element_data`)

Matches elements that have a specific data-* attribute. How matching works

Enter the attribute name without the data- prefix—e.g. enter exclude to match elements with a data-exclude attribute.
Matches attribute presence, not a specific attribute value.

Examples

Exclude Data exclude—removes elements marked <div data-exclude> in your CMS markup

See data attributes for marking content in your HTML.

Text (`element_text`)

Matches elements based on the direct text they contain.

This filter removes whole HTML elements—not individual words or phrases. If an element matches, the element and everything inside it is removed. There is no way to strip a specific word or sentence from a paragraph while keeping the rest.

How matching works

Matches when an element’s direct text contains the value you enter. Text inside child tags (e.g. , , <a>) is not part of the parent’s direct text.
Matching is case-sensitive and matches substrings—more matches “more”, “Read more”, and “moreover”.
When an element matches, the whole element and its contents are removed.
Used alone, the filter applies to every element on the page. Combine with Type, Class, or ID via + to narrow the match.

Examples

Exclude Type a + Text Subscribe—removes a subscribe link, keeps the surrounding paragraph
Exclude Type p + Text Sponsored—removes paragraphs whose direct text contains “Sponsored”

What this filter can’t do

Remove a word or sentence from inside a paragraph while keeping the rest of the paragraph
Match text split across child tags (e.g. Subscribe) or formatted inline (e.g., Hello world—filtering for world on the  will not match because world is inside )

For fragmented or heavily formatted paragraphs, use XPath to target the wrapping element instead—for example, a newsletter paragraph identified by a link:

//p[.//a[contains(@href, 'newsletter')]]

XPath (`element_xpath`)

Matches elements using a full XPath expression. When XPath is set, other filter fields are ignored. How matching works

Enter a complete XPath expression—e.g. //*[@role='dialog'] or //aside[contains(@class, 'sidebar')].
Provides the most precise control for complex document structures.
Best escape hatch when Text filters cannot match fragmented markup you do not control (common with third-party CMS and RSS-sourced HTML).

Examples

Exclude //*[@role='dialog']—removes dialog/modal elements
Exclude //p[.//a[contains(@href, 'suscripcion-newsletter')]]—removes a newsletter sign-up paragraph identified by its subscribe link, regardless of how text is split across  and  tags inside

Processing type and filters

When sending content through the API, the type field determines whether content filters are applied.

Action	Filters applied?
Article has `auto_segment` and you click Regenerate	Yes
Article has `auto_segment` and you send new HTML via the API	Yes
Article has `manual_segment` and you click Regenerate	No

Articles created or last saved via the dashboard Editor use manual_segment by default. Set type to auto_segment when sending content through the API if you want filters to apply. For more detail, see Processing types in the API reference.

FAQs

Why is unwanted content being extracted?

Common causes:

Automatic extraction included the content—add an Exclude filter targeting the relevant HTML element and regenerate.
No Exclude filter matches the element—inspect the page HTML, choose the right filter type, and regenerate. For formatted paragraphs, Text alone may not work—try Class, ID, or XPath.
An Include filter is too broad—review Include filters; they strip everything outside the match.

Why is some content not being extracted?

Common causes:

Automatic extraction excluded the content—add an Include filter targeting the relevant container and regenerate.
An Exclude filter is too broad—e.g. a Class filter using a substring that matches editorial elements. Review and narrow the filter.
The article uses manual_segment—filters no longer apply. Regenerate via API with type: auto_segment, or edit in the Editor.

My Text filter doesn't match—why?

The Text filter only checks an element’s direct text, not text nested in child tags. This often affects real article markup:

Hello world—Text world on Type p matches
Hello world—Text world on Type p does not match
Text split across multiple  tags will not match as a single phrase on the parent

Fix: use XPath to target the parent element—e.g. by a distinctive href, id, or class on a child link rather than the paragraph text.

If the paragraph contains a distinctive subscribe link, use an Exclude XPath filter:

//p[.//a[contains(@href, 'newsletter')]]

Replace newsletter with a distinctive part of the subscribe URL. Avoid excluding by a generic class name (e.g. a class meaning “body” applied to all paragraphs)—that can remove the entire article.If you only need to remove the link and keep the paragraph, use Exclude Type a + Text matching the link text (e.g. Subscribe).

When do filters take effect?

Filters apply on the next extraction or regeneration after you save them. Existing content is not updated automatically—click Regenerate on affected items, or re-import/re-publish through your integration.Filters do not apply to content last saved in the Editor (manual_segment). See When filters apply for the full breakdown by integration.

Why does content disappear even though I haven't added filters?

New accounts include default Type filters—a set of Include and Exclude rules that run automatically. The Include presets keep only common editorial element types (p, h1–h6, div, etc.) and strip everything else. Check Settings → Extraction → Filters for organization-wide (All projects) rules as well as project-specific ones.

Getting help

If you encounter issues or have questions, contact support. Include the article URL or HTML snippet and the filters you have configured.

Get started

Content

Voices

Distribution

Analytics

Monetization

Tools

Integrations

Admin

Support

Migration guides

How it works

Extraction modes

Automatic extraction

Manual extraction

Legacy extraction

Extraction settings

Static IP

Filters

When filters apply

Filter scope

Default filters

Extraction mode and filters

Include vs Exclude

How filters work

Create a filter

Filter types

Type (`element_type`)

Class (`element_class`)

ID (`element_id`)

Data (`element_data`)

Text (`element_text`)

XPath (`element_xpath`)

Processing type and filters

FAQs

Getting help

​How it works

​Extraction modes

​Automatic extraction

​Manual extraction

​Legacy extraction

​Extraction settings

​Static IP

​Filters

​When filters apply

​Filter scope

​Default filters

​Extraction mode and filters

​Include vs Exclude

​How filters work

​Create a filter

​Filter types

​Type (element_type)

​Class (element_class)

​ID (element_id)

​Data (element_data)

​Text (element_text)

​XPath (element_xpath)

​Processing type and filters

​FAQs

​Getting help

How it works

Extraction modes

Automatic extraction

Manual extraction

Legacy extraction

Extraction settings

Static IP

Filters

When filters apply

Filter scope

Default filters

Extraction mode and filters

Include vs Exclude

How filters work

Create a filter

Filter types

Type (`element_type`)

Class (`element_class`)

ID (`element_id`)

Data (`element_data`)

Text (`element_text`)

XPath (`element_xpath`)

Processing type and filters

FAQs

Getting help