Extracted text becomes the article within the content item. You can edit the article in the Editor, but changes there switch the item to
manual_segment processing—see Processing type and filters below.How it works
- BeyondWords receives HTML—fetched from a live URL, imported from a feed, or sent directly via an integration.
- Content filters (if configured) remove or retain whole HTML elements in the raw markup.
- An extraction mode determines how editorial content is identified from the filtered HTML.
- The result becomes the article used for segmentation and audio/video generation.
Extraction modes
Configure the extraction mode in Settings → Extraction.Automatic extraction
Automatic extraction uses AI to identify and extract editorial content from your page while ignoring elements that should not be used for audio or video generation. Configured content filters are applied to the HTML before AI extraction runs. In most cases, Exclude filters are the right tool—see Extraction mode and filters.Manual extraction
Manual extraction relies on content filters to define which parts of a page are included or excluded. This gives more predictable results than automatic extraction, but requires careful filter configuration—especially default filters and any Include rules. See Extraction mode and filters.Legacy extraction
Legacy extraction uses rule-based heuristics and content filters to identify the main editorial content on a page. This mode exists for backwards compatibility with older projects that relied on the previous extraction pipeline.Extraction settings
Static IP
If your site requires IP allowlisting for BeyondWords to fetch article pages, enable static IP:- Go to Settings → Extraction in your project dashboard
- Switch Static IP on
- Allowlist the displayed IP addresses in your firewall, CDN, or server configuration
Page-fetch requests are sent from
20.234.8.180 with User-Agent: BeyondWords Importer. This applies to URL fetching (Magic Embed, RSS page extraction, etc.)—not to webhook delivery.Filters
Content filters control which parts of your source HTML are kept or removed before extraction and segmentation. Use them to:- Exclude recurring elements that should not be read aloud—newsletter sign-ups, related-article blocks, social embeds, footnotes
- Include only specific containers when extraction picks up more than you need (use sparingly—see Include vs Exclude)
When filters apply
Dashboard content filters run on raw HTML before extraction and segmentation. Whether they run depends on how content reaches BeyondWords and its processing type:| Source | Filters applied? |
|---|---|
| Magic Embed (live page fetch) | Yes—before AI or rule-based extraction |
| RSS Feed Importer (page fetch enabled) | Yes—on fetched article HTML |
API / WordPress REST API (body HTML) | Yes—when the content item uses auto_segment |
| Ghost plugin | Yes—on HTML sent to BeyondWords, when auto_segment |
| Dashboard Editor | No—saving in the Editor sets manual_segment; filters do not re-apply |
body. WordPress users can also preprocess that HTML before it is sent using the PHP beyondwords_content_params hook—that is a WordPress plugin feature, not a dashboard content filter.
Filters take effect on the next extraction or regeneration after you save them—not on content that was already generated. See When do filters take effect? in the FAQs.
Filter scope
Filters can be scoped to All projects (organization-wide) or This project only. At runtime, both apply:- Organization-wide filters (
All projects) run for every project in your account - Project-specific filters (
This project only) run in addition to organization-wide filters
Default filters
New BeyondWords accounts are created with preset Type filters. These are normal dashboard filters—not optional—and they run in the same way as any filter you add yourself (Exclude first, then Include). Exclude (element types removed from the HTML):aside, figcaption, footer, form, iframe, nav, noscript
Include (only these element types and their ancestors/descendants are kept—everything else is stripped):
head, p, div, li, h1, h2, h3, h4, h5, h6, blockquote, table, img
These defaults explain why some markup disappears even before you add custom filters, and why Manual extraction mode is sensitive to Include rules. You can edit or delete default filters in Settings → Extraction → Filters like any other filter.
Extraction mode and filters
| Extraction mode | Role of filters | Recommended approach |
|---|---|---|
| Automatic (default) | Fine-tune HTML before AI extraction | Add Exclude filters for recurring non-editorial blocks. AI handles most editorial identification. |
| Manual | Filters are the primary extraction mechanism | Understand default filters and any Include rules—they define what survives. Use Exclude for specific removals; use Include only when you need a narrow allowlist. |
| Legacy (deprecated) | Applied before rule-based heuristics | Migrate to Automatic. Use Exclude filters as you would with Automatic mode. |
Include vs Exclude
| Rule | Effect |
|---|---|
| Exclude | Remove matching elements from the HTML. This is the most common choice—remove a sidebar, footnote block, or newsletter paragraph. |
| Include | Keep only matching elements (plus their ancestors and descendants). Everything else is removed. Use only when you want to extract from a specific container and discard the rest of the page. |
Subscribe). Include + Text is rare and keeps only elements whose direct text matches, which is usually too narrow to be useful on its own.
How filters work
- Filters match HTML elements and remove them wholesale—not individual words or phrases inside an element.
- Combined conditions on a single filter use AND logic (click + in the dashboard). All conditions must match the same element.
- Multiple filters of the same rule type are combined with OR logic—if any filter matches, the rule applies.
- Exclude filters run first and remove matching elements (and everything inside them).
- Include filters run after excludes and strip everything that does not match (and is not an ancestor or descendant of a match).
- Filters with invalid XPath expressions are skipped silently. Double-check syntax, especially for XPath filters.
script, style, HTML comments, and elements with class beyondwords-player. These cannot be disabled in the dashboard.
Create a filter
Select the filter type
Choose Type, Class, Data, ID, XPath, or Text. See filter types below.
Enter the filter criteria
Enter the value for your chosen type—for example,
h2 for a Type filter, or newsletter for a Class filter.Add additional conditions (optional)
Click + to add another condition. Combined conditions use AND logic—the element must match all conditions.For example, combine Type
p and Text Sponsored to exclude only paragraphs whose direct text contains “Sponsored”, rather than every element on the page containing that word.Select Include or Exclude
Choose whether to Include or Exclude matching elements. In most cases, choose Exclude.
Set the scope and save
Choose All projects or This project only, then click Save changes. See Filter scope for how organization-wide and project-specific filters combine.Regenerate affected content items for the filter to take effect.
Filter types
| Type | Matches on | Common use |
|---|---|---|
| Type | HTML tag name | Remove sup references, aside blocks, etc. |
| Class | Substring in the class attribute | Sidebars, footnotes, embed containers |
| ID | Exact id attribute | Unique advert or widget blocks |
| Data | Presence of a data-* attribute | CMS markers like data-exclude |
| Text | Direct text content (substring) | Simple text matches in a single element |
| XPath | Full XPath expression | Complex markup, fragmented text, link-based targeting |
Type (element_type)
Matches elements by HTML tag name (e.g. p, h2, blockquote, sup).
How matching works
- Matches the tag name exactly—enter
h2, not<h2>. - When combined with other conditions via +, all conditions must match the same element.
- When no Type is set, the filter defaults to all elements (
*).
- Exclude Type
sup—removes inline superscript reference numbers (e.g. ¹, ², [1]) - Exclude Type
a+ TextSubscribe—removes a subscribe link, keeps the surrounding paragraph
Class (element_class)
Matches elements whose class attribute contains the value you enter (substring match, not an exact class token).
How matching works
- Uses substring matching—
navalso matchesnavbar,navigation, andmain-nav. - For
<div class="main navbar">, enteringmain navbarcan match that element. - Each class name requires its own filter. Entering multiple names in one filter (e.g.
nonedit, collection-embed) does not match either class. - Do not include a leading dot—enter
sidebar, not.sidebar. A leading dot causes the filter to fail silently.
- Exclude Class
footnotes—removes a footnote block - Exclude Class
related-articles—removes a related-content widget
- Using a broad class name that appears on editorial content—e.g. a class meaning “body text” in another language applied to every paragraph—can remove the entire article. Inspect your HTML before excluding by class.
ID (element_id)
Matches elements with an exact id attribute value.
How matching works
- Exact match—
@id='your-id'. - Do not include a
#prefix—enternewsletter-signup, not#newsletter-signup. - Best for unique, stable containers on your pages.
- Exclude ID
advert-banner—removes a specific advert container
Data (element_data)
Matches elements that have a specific data-* attribute.
How matching works
- Enter the attribute name without the
data-prefix—e.g. enterexcludeto match elements with adata-excludeattribute. - Matches attribute presence, not a specific attribute value.
- Exclude Data
exclude—removes elements marked<div data-exclude>in your CMS markup
Text (element_text)
Matches elements based on the direct text they contain.
How matching works
- Matches when an element’s direct text contains the value you enter. Text inside child tags (e.g.
<strong>,<em>,<a>) is not part of the parent’s direct text. - Matching is case-sensitive and matches substrings—
morematches “more”, “Read more”, and “moreover”. - When an element matches, the whole element and its contents are removed.
- Used alone, the filter applies to every element on the page. Combine with Type, Class, or ID via + to narrow the match.
- Exclude Type
a+ TextSubscribe—removes a subscribe link, keeps the surrounding paragraph - Exclude Type
p+ TextSponsored—removes paragraphs whose direct text contains “Sponsored”
- Remove a word or sentence from inside a paragraph while keeping the rest of the paragraph
- Match text split across child tags (e.g.
<em>Sub</em><em>scribe</em>) or formatted inline (e.g.,<p>Hello <strong>world</strong></p>—filtering forworldon the<p>will not match becauseworldis inside<strong>)
XPath (element_xpath)
Matches elements using a full XPath expression. When XPath is set, other filter fields are ignored.
How matching works
- Enter a complete XPath expression—e.g.
//*[@role='dialog']or//aside[contains(@class, 'sidebar')]. - Provides the most precise control for complex document structures.
- Best escape hatch when Text filters cannot match fragmented markup you do not control (common with third-party CMS and RSS-sourced HTML).
- Exclude
//*[@role='dialog']—removes dialog/modal elements - Exclude
//p[.//a[contains(@href, 'suscripcion-newsletter')]]—removes a newsletter sign-up paragraph identified by its subscribe link, regardless of how text is split across<em>and<strong>tags inside
Processing type and filters
When sending content through the API, thetype field determines whether content filters are applied.
| Action | Filters applied? |
|---|---|
Article has auto_segment and you click Regenerate | Yes |
Article has auto_segment and you send new HTML via the API | Yes |
Article has manual_segment and you click Regenerate | No |
manual_segment by default. Set type to auto_segment when sending content through the API if you want filters to apply.
For more detail, see Processing types in the API reference.
FAQs
Why is unwanted content being extracted?
Why is unwanted content being extracted?
Common causes:
- Automatic extraction included the content—add an Exclude filter targeting the relevant HTML element and regenerate.
- No Exclude filter matches the element—inspect the page HTML, choose the right filter type, and regenerate. For formatted paragraphs, Text alone may not work—try Class, ID, or XPath.
- An Include filter is too broad—review Include filters; they strip everything outside the match.
Why is some content not being extracted?
Why is some content not being extracted?
Common causes:
- Automatic extraction excluded the content—add an Include filter targeting the relevant container and regenerate.
- An Exclude filter is too broad—e.g. a Class filter using a substring that matches editorial elements. Review and narrow the filter.
- The article uses
manual_segment—filters no longer apply. Regenerate via API withtype: auto_segment, or edit in the Editor.
My Text filter doesn't match—why?
My Text filter doesn't match—why?
The Text filter only checks an element’s direct text, not text nested in child tags. This often affects real article markup:
<p>Hello world</p>—Textworldon Typepmatches<p>Hello <strong>world</strong></p>—Textworldon Typepdoes not match- Text split across multiple
<em>tags will not match as a single phrase on the parent<p>
href, id, or class on a child link rather than the paragraph text.How do I remove a newsletter sign-up paragraph?
How do I remove a newsletter sign-up paragraph?
When do filters take effect?
When do filters take effect?
Filters apply on the next extraction or regeneration after you save them. Existing content is not updated automatically—click Regenerate on affected items, or re-import/re-publish through your integration.Filters do not apply to content last saved in the Editor (
manual_segment). See When filters apply for the full breakdown by integration.Why does content disappear even though I haven't added filters?
Why does content disappear even though I haven't added filters?
New accounts include default Type filters—a set of Include and Exclude rules that run automatically. The Include presets keep only common editorial element types (
p, h1–h6, div, etc.) and strip everything else. Check Settings → Extraction → Filters for organization-wide (All projects) rules as well as project-specific ones.