Skip to main content

Overview

Content extraction is the process BeyondWords uses to pull article text into our platform. This is required for features like Magic Embed, extraction-enabled RSS feeds, and the URL importer. content-extraction

Extraction mode

Automatic

In this mode, article content is automatically identified and structured using AI. The model recognises key elements on your webpage and outputs a clean, well-formatted article for audio generation. Any content filters you’ve configured are also applied during this process.
Recommended for all new projects setting up content extraction.

Manual

In this mode, article content is extracted using only the content filters you configure. This gives you full control over exactly which parts of the article are ingested.

Legacy

Content is extracted using a combination of content filters and rule-based heuristics. Unlike Automatic extraction it uses predefined conditions to locate content. This approach works well if the structure of your site is consistent, but it is less flexible than the Automatic mode.
This mode is recommended only for customers with existing projects already set up and working with this method.

Request headers

For paywalled or protected content, you may need to provide authentication headers to grant our servers access to your content.
  • Add a Header Name and Header Value.
  • Click + to add multiple headers if needed.
  • Ensure the headers grant full access to your content.
Requests will be made with User-Agent: BeyondWords Importer

Static IP

If your website requires IP allowlisting, you may need to enable this option to grant our servers access to your content.
  • Enable Static IP.
  • Ensure your server allows full access to your content.
Requests will be sent from 20.234.8.180 or 176.34.249.78

Javascript enabled

Enable this option if your website is a single-page application (SPA) built with frameworks such as React, Vue.js, or Angular. When enabled, the extractor waits briefly for the page to finish loading and for network activity to stop before processing the HTML.
I