Skip to main content
Content extraction is how BeyondWords turns source HTML into the article used to generate audio and video. It applies when ingesting content through Magic Embed, the RSS Feed Importer, the WordPress and Ghost plugins, the API, or URL imports.
Extracted text becomes the article within the content item. You can edit the article in the Editor, but changes there switch the item to manual_segment processing—see Processing type and filters below.
For precise control over title, author, publish date, and other metadata in your HTML, use data attributes alongside extraction settings and filters.

How it works

  1. BeyondWords receives HTML—fetched from a live URL, imported from a feed, or sent directly via an integration.
  2. Content filters (if configured) remove or retain whole HTML elements in the raw markup.
  3. An extraction mode determines how editorial content is identified from the filtered HTML.
  4. The result becomes the article used for segmentation and audio/video generation.
Filters run on HTML before segments are created. They operate on elements, not on individual words inside a paragraph.

Extraction modes

Configure the extraction mode in Settings → Extraction.
ModeHow it worksBest for
Automatic (default)AI identifies editorial content; filters fine-tune the resultMost new projects
ManualFilters alone define what is included or excludedPredictable, markup-driven control
LegacyRule-based heuristics plus filtersDeprecated—migrate to Automatic

Automatic extraction

Automatic extraction uses AI to identify and extract editorial content from your page while ignoring elements that should not be used for audio or video generation. Configured content filters are applied to the HTML before AI extraction runs. In most cases, Exclude filters are the right tool—see Extraction mode and filters.
Although we have safeguards in place to improve extraction accuracy, there is a small risk that AI-based extraction may introduce unintended inaccuracies. Use Exclude filters to remove recurring non-editorial blocks (newsletter sign-ups, related-article widgets, etc.).

Manual extraction

Manual extraction relies on content filters to define which parts of a page are included or excluded. This gives more predictable results than automatic extraction, but requires careful filter configuration—especially default filters and any Include rules. See Extraction mode and filters.

Legacy extraction

Legacy extraction is deprecated and scheduled for removal. It remains available for existing projects but should not be used for new implementations. Migrate to Automatic extraction.
Legacy extraction uses rule-based heuristics and content filters to identify the main editorial content on a page. This mode exists for backwards compatibility with older projects that relied on the previous extraction pipeline.

Extraction settings

Static IP

If your site requires IP allowlisting for BeyondWords to fetch article pages, enable static IP:
  1. Go to Settings → Extraction in your project dashboard
  2. Switch Static IP on
  3. Allowlist the displayed IP addresses in your firewall, CDN, or server configuration
Page-fetch requests are sent from 20.234.8.180 with User-Agent: BeyondWords Importer. This applies to URL fetching (Magic Embed, RSS page extraction, etc.)—not to webhook delivery.

Filters

Content filters control which parts of your source HTML are kept or removed before extraction and segmentation. Use them to:
  • Exclude recurring elements that should not be read aloud—newsletter sign-ups, related-article blocks, social embeds, footnotes
  • Include only specific containers when extraction picks up more than you need (use sparingly—see Include vs Exclude)

When filters apply

Dashboard content filters run on raw HTML before extraction and segmentation. Whether they run depends on how content reaches BeyondWords and its processing type:
SourceFilters applied?
Magic Embed (live page fetch)Yes—before AI or rule-based extraction
RSS Feed Importer (page fetch enabled)Yes—on fetched article HTML
API / WordPress REST API (body HTML)Yes—when the content item uses auto_segment
Ghost pluginYes—on HTML sent to BeyondWords, when auto_segment
Dashboard EditorNo—saving in the Editor sets manual_segment; filters do not re-apply
You can also manage filters via the Content filters API and extraction settings via the Content extraction settings API.
Content filters only apply to articles with auto_segment processing. If an article was created or last edited in the Editor, its processing type is manual_segment and filters will not apply.To restore filter processing: send a new API update with type set to auto_segment, or make changes manually in the Editor instead of relying on filters.
For the WordPress REST API integration, dashboard filters apply to the HTML in the request body. WordPress users can also preprocess that HTML before it is sent using the PHP beyondwords_content_params hook—that is a WordPress plugin feature, not a dashboard content filter. Filters take effect on the next extraction or regeneration after you save them—not on content that was already generated. See When do filters take effect? in the FAQs.

Filter scope

Filters can be scoped to All projects (organization-wide) or This project only. At runtime, both apply:
  • Organization-wide filters (All projects) run for every project in your account
  • Project-specific filters (This project only) run in addition to organization-wide filters
Review filters at both levels when troubleshooting—a rule configured under All projects affects every project even if a single project’s filter list looks correct.

Default filters

New BeyondWords accounts are created with preset Type filters. These are normal dashboard filters—not optional—and they run in the same way as any filter you add yourself (Exclude first, then Include). Exclude (element types removed from the HTML): aside, figcaption, footer, form, iframe, nav, noscript Include (only these element types and their ancestors/descendants are kept—everything else is stripped): head, p, div, li, h1, h2, h3, h4, h5, h6, blockquote, table, img These defaults explain why some markup disappears even before you add custom filters, and why Manual extraction mode is sensitive to Include rules. You can edit or delete default filters in Settings → Extraction → Filters like any other filter.

Extraction mode and filters

Extraction modeRole of filtersRecommended approach
Automatic (default)Fine-tune HTML before AI extractionAdd Exclude filters for recurring non-editorial blocks. AI handles most editorial identification.
ManualFilters are the primary extraction mechanismUnderstand default filters and any Include rules—they define what survives. Use Exclude for specific removals; use Include only when you need a narrow allowlist.
Legacy (deprecated)Applied before rule-based heuristicsMigrate to Automatic. Use Exclude filters as you would with Automatic mode.

Include vs Exclude

RuleEffect
ExcludeRemove matching elements from the HTML. This is the most common choice—remove a sidebar, footnote block, or newsletter paragraph.
IncludeKeep only matching elements (plus their ancestors and descendants). Everything else is removed. Use only when you want to extract from a specific container and discard the rest of the page.
Most support cases are solved with Exclude filters. Include filters are aggressive—a single overly broad Include rule can strip most of the article. This is especially true in Manual extraction mode and on accounts with default Include filters. The Text filter uses the same matching rules whether you choose Include or Exclude—only the rule type changes. In practice, Text filters are almost always Exclude (e.g. remove a link whose direct text is Subscribe). Include + Text is rare and keeps only elements whose direct text matches, which is usually too narrow to be useful on its own.

How filters work

  • Filters match HTML elements and remove them wholesale—not individual words or phrases inside an element.
  • Combined conditions on a single filter use AND logic (click + in the dashboard). All conditions must match the same element.
  • Multiple filters of the same rule type are combined with OR logic—if any filter matches, the rule applies.
  • Exclude filters run first and remove matching elements (and everything inside them).
  • Include filters run after excludes and strip everything that does not match (and is not an ancestor or descendant of a match).
  • Filters with invalid XPath expressions are skipped silently. Double-check syntax, especially for XPath filters.
For URL-based extraction (Magic Embed, RSS page fetch), BeyondWords also applies built-in Exclude rules before your dashboard filters: script, style, HTML comments, and elements with class beyondwords-player. These cannot be disabled in the dashboard.

Create a filter

1

Start a new filter

Go to Settings → Extraction → Filters in your project dashboard.Click + Filter.
2

Select the filter type

Choose Type, Class, Data, ID, XPath, or Text. See filter types below.
3

Enter the filter criteria

Enter the value for your chosen type—for example, h2 for a Type filter, or newsletter for a Class filter.
Provide only the name or identifier—no prefix characters like <, ., or #. The exception is XPath, where you enter a full XPath expression.
4

Add additional conditions (optional)

Click + to add another condition. Combined conditions use AND logic—the element must match all conditions.For example, combine Type p and Text Sponsored to exclude only paragraphs whose direct text contains “Sponsored”, rather than every element on the page containing that word.
5

Select Include or Exclude

Choose whether to Include or Exclude matching elements. In most cases, choose Exclude.
6

Set the scope and save

Choose All projects or This project only, then click Save changes. See Filter scope for how organization-wide and project-specific filters combine.Regenerate affected content items for the filter to take effect.

Filter types

TypeMatches onCommon use
TypeHTML tag nameRemove sup references, aside blocks, etc.
ClassSubstring in the class attributeSidebars, footnotes, embed containers
IDExact id attributeUnique advert or widget blocks
DataPresence of a data-* attributeCMS markers like data-exclude
TextDirect text content (substring)Simple text matches in a single element
XPathFull XPath expressionComplex markup, fragmented text, link-based targeting

Type (element_type)

Matches elements by HTML tag name (e.g. p, h2, blockquote, sup). How matching works
  • Matches the tag name exactly—enter h2, not <h2>.
  • When combined with other conditions via +, all conditions must match the same element.
  • When no Type is set, the filter defaults to all elements (*).
Examples
  • Exclude Type sup—removes inline superscript reference numbers (e.g. ¹, ², [1])
  • Exclude Type a + Text Subscribe—removes a subscribe link, keeps the surrounding paragraph

Class (element_class)

Matches elements whose class attribute contains the value you enter (substring match, not an exact class token). How matching works
  • Uses substring matching—nav also matches navbar, navigation, and main-nav.
  • For <div class="main navbar">, entering main navbar can match that element.
  • Each class name requires its own filter. Entering multiple names in one filter (e.g. nonedit, collection-embed) does not match either class.
  • Do not include a leading dot—enter sidebar, not .sidebar. A leading dot causes the filter to fail silently.
Examples
  • Exclude Class footnotes—removes a footnote block
  • Exclude Class related-articles—removes a related-content widget
Common mistakes
  • Using a broad class name that appears on editorial content—e.g. a class meaning “body text” in another language applied to every paragraph—can remove the entire article. Inspect your HTML before excluding by class.

ID (element_id)

Matches elements with an exact id attribute value. How matching works
  • Exact match—@id='your-id'.
  • Do not include a # prefix—enter newsletter-signup, not #newsletter-signup.
  • Best for unique, stable containers on your pages.
Examples
  • Exclude ID advert-banner—removes a specific advert container

Data (element_data)

Matches elements that have a specific data-* attribute. How matching works
  • Enter the attribute name without the data- prefix—e.g. enter exclude to match elements with a data-exclude attribute.
  • Matches attribute presence, not a specific attribute value.
Examples
  • Exclude Data exclude—removes elements marked <div data-exclude> in your CMS markup
See data attributes for marking content in your HTML.

Text (element_text)

Matches elements based on the direct text they contain.
This filter removes whole HTML elements—not individual words or phrases. If an element matches, the element and everything inside it is removed. There is no way to strip a specific word or sentence from a paragraph while keeping the rest.
How matching works
  • Matches when an element’s direct text contains the value you enter. Text inside child tags (e.g. <strong>, <em>, <a>) is not part of the parent’s direct text.
  • Matching is case-sensitive and matches substringsmore matches “more”, “Read more”, and “moreover”.
  • When an element matches, the whole element and its contents are removed.
  • Used alone, the filter applies to every element on the page. Combine with Type, Class, or ID via + to narrow the match.
Examples
  • Exclude Type a + Text Subscribe—removes a subscribe link, keeps the surrounding paragraph
  • Exclude Type p + Text Sponsored—removes paragraphs whose direct text contains “Sponsored”
What this filter can’t do
  • Remove a word or sentence from inside a paragraph while keeping the rest of the paragraph
  • Match text split across child tags (e.g. <em>Sub</em><em>scribe</em>) or formatted inline (e.g., <p>Hello <strong>world</strong></p>—filtering for world on the <p> will not match because world is inside <strong>)
For fragmented or heavily formatted paragraphs, use XPath to target the wrapping element instead—for example, a newsletter paragraph identified by a link:
//p[.//a[contains(@href, 'newsletter')]]

XPath (element_xpath)

Matches elements using a full XPath expression. When XPath is set, other filter fields are ignored. How matching works
  • Enter a complete XPath expression—e.g. //*[@role='dialog'] or //aside[contains(@class, 'sidebar')].
  • Provides the most precise control for complex document structures.
  • Best escape hatch when Text filters cannot match fragmented markup you do not control (common with third-party CMS and RSS-sourced HTML).
Examples
  • Exclude //*[@role='dialog']—removes dialog/modal elements
  • Exclude //p[.//a[contains(@href, 'suscripcion-newsletter')]]—removes a newsletter sign-up paragraph identified by its subscribe link, regardless of how text is split across <em> and <strong> tags inside

Processing type and filters

When sending content through the API, the type field determines whether content filters are applied.
ActionFilters applied?
Article has auto_segment and you click RegenerateYes
Article has auto_segment and you send new HTML via the APIYes
Article has manual_segment and you click RegenerateNo
Articles created or last saved via the dashboard Editor use manual_segment by default. Set type to auto_segment when sending content through the API if you want filters to apply. For more detail, see Processing types in the API reference.

FAQs

Common causes:
  • Automatic extraction included the content—add an Exclude filter targeting the relevant HTML element and regenerate.
  • No Exclude filter matches the element—inspect the page HTML, choose the right filter type, and regenerate. For formatted paragraphs, Text alone may not work—try Class, ID, or XPath.
  • An Include filter is too broad—review Include filters; they strip everything outside the match.
Common causes:
  • Automatic extraction excluded the content—add an Include filter targeting the relevant container and regenerate.
  • An Exclude filter is too broad—e.g. a Class filter using a substring that matches editorial elements. Review and narrow the filter.
  • The article uses manual_segment—filters no longer apply. Regenerate via API with type: auto_segment, or edit in the Editor.
The Text filter only checks an element’s direct text, not text nested in child tags. This often affects real article markup:
  • <p>Hello world</p>—Text world on Type p matches
  • <p>Hello <strong>world</strong></p>—Text world on Type p does not match
  • Text split across multiple <em> tags will not match as a single phrase on the parent <p>
Fix: use XPath to target the parent element—e.g. by a distinctive href, id, or class on a child link rather than the paragraph text.
If the paragraph contains a distinctive subscribe link, use an Exclude XPath filter:
//p[.//a[contains(@href, 'newsletter')]]
Replace newsletter with a distinctive part of the subscribe URL. Avoid excluding by a generic class name (e.g. a class meaning “body” applied to all paragraphs)—that can remove the entire article.If you only need to remove the link and keep the paragraph, use Exclude Type a + Text matching the link text (e.g. Subscribe).
Filters apply on the next extraction or regeneration after you save them. Existing content is not updated automatically—click Regenerate on affected items, or re-import/re-publish through your integration.Filters do not apply to content last saved in the Editor (manual_segment). See When filters apply for the full breakdown by integration.
New accounts include default Type filters—a set of Include and Exclude rules that run automatically. The Include presets keep only common editorial element types (p, h1h6, div, etc.) and strip everything else. Check Settings → Extraction → Filters for organization-wide (All projects) rules as well as project-specific ones.

Getting help

If you encounter issues or have questions, contact support. Include the article URL or HTML snippet and the filters you have configured.