Skip to content

feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Feb 13, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
95 changes: 53 additions & 42 deletions docs/02_guides/05_scrapy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,90 +7,101 @@ import CodeBlock from '@theme/CodeBlock';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

import UnderscoreMainExample from '!!raw-loader!./code/scrapy_src/__main__.py';
import MainExample from '!!raw-loader!./code/scrapy_src/main.py';
import ItemsExample from '!!raw-loader!./code/scrapy_src/items.py';
import SettingsExample from '!!raw-loader!./code/scrapy_src/settings.py';
import TitleSpiderExample from '!!raw-loader!./code/scrapy_src/spiders/title.py';
import UnderscoreMainExample from '!!raw-loader!./code/scrapy_project/src/__main__.py';
import MainExample from '!!raw-loader!./code/scrapy_project/src/main.py';
import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';

[Scrapy](https://scrapy.org/) is an open-source web scraping framework written in Python. It provides a complete set of tools for web scraping, including the ability to define how to extract data from websites, handle pagination and navigation.
[Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform).

:::tip
## Integrating Scrapy with the Apify platform

Our CLI now supports transforming Scrapy projects into Apify Actors with a single command! Check out the [Scrapy migration guide](https://docs.apify.com/cli/docs/integrating-scrapy) for more information.
The Apify SDK provides an Apify-Scrapy integration. The main challenge of this is to combine two asynchronous frameworks that use different event loop implementations. Scrapy uses [Twisted](https://twisted.org/) for asynchronous execution, while the Apify SDK is based on [asyncio](https://docs.python.org/3/library/asyncio.html). The key thing is to install the Twisted's `asyncioreactor` to run Twisted's asyncio compatible event loop. This allows both Twisted and asyncio to run on a single event loop, enabling a Scrapy spider to run as an Apify Actor with minimal modifications.

:::
<CodeBlock className="language-python" title="__main.py__: The Actor entry point ">
{UnderscoreMainExample}
</CodeBlock>

Some of the key features of Scrapy for web scraping include:
In this setup, `apify.scrapy.initialize_logging` configures an Apify log formatter and reconfigures loggers to ensure consistent logging across Scrapy, the Apify SDK, and other libraries. The `apify.scrapy.run_scrapy_actor` bridges asyncio coroutines with Twisted's reactor, enabling the Actor's main coroutine, which contains the Scrapy spider, to be executed.

- **Request and response handling** - Scrapy provides an easy-to-use interface for making HTTP requests and handling responses,
allowing you to navigate through web pages and extract data.
- **Robust Spider framework** - Scrapy has a spider framework that allows you to define how to scrape data from websites,
including how to follow links, how to handle pagination, and how to parse the data.
- **Built-in data extraction** - Scrapy includes built-in support for data extraction using XPath and CSS selectors,
allowing you to easily extract data from HTML and XML documents.
- **Integration with other tool** - Scrapy can be integrated with other Python tools like BeautifulSoup and Selenium for more advanced scraping tasks.
Make sure the `SCRAPY_SETTINGS_MODULE` environment variable is set to the path of the Scrapy settings module. This variable is also used by the `Actor` class to detect that the project is a Scrapy project, triggering additional actions.

## Using Scrapy template
<CodeBlock className="language-python" title="main.py: The Actor main coroutine">
{MainExample}
</CodeBlock>

The fastest way to start using Scrapy in Apify Actors is by leveraging the [Scrapy Actor template](https://apify.com/templates/categories/python). This template provides a pre-configured structure and setup necessary to integrate Scrapy into your Actors seamlessly. It includes: setting up the Scrapy settings, `asyncio` reactor, Actor logger, and item pipeline as necessary to make Scrapy spiders run in Actors and save their outputs in Apify datasets.
Within the Actor's main coroutine, the Actor's input is processed as usual. The function `apify.scrapy.apply_apify_settings` is then used to configure Scrapy settings with Apify-specific components before the spider is executed. The key components and other helper functions are described in the next section.

## Manual setup
## Key integration components
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be awesome if we could have a nice diagram here 🙂 In the ideal case, it would show how the asyncio event loop and twisted reactor interact. We can definitely postpone that though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point is that Asyncio and Twisted share the same event loop. I'm not familiar with the details of how Asyncio and Twisted exactly interact together. Other than that, we have a separate thread with an Asyncio event loop for the Scheduler's synchronous calls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We gave it two weeks and made it work, but that doesn't mean we understand the sorcery enough to draw diagrams 😅 But I think it's something like...

diagram

(It's black on transparent SVG, so you probably won't see it well in dark mode)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if I chose the best diagram for this, maybe https://swimlanes.io/ would be better


If you prefer not to use the template, you will need to manually configure several components to integrate Scrapy with the Apify SDK.
The Apify SDK provides several custom components to support integration with the Apify platform:

### Event loop & reactor
- [`apify.scrapy.ApifyScheduler`](https://docs.apify.com/sdk/python/reference/class/ApifyScheduler) - Replaces Scrapy's default [scheduler](https://docs.scrapy.org/en/latest/topics/scheduler.html) with one that uses Apify's [request queue](https://docs.apify.com/platform/storage/request-queue) for storing requests. It manages enqueuing, dequeuing, and maintaining the state and priority of requests.
- [`apify.scrapy.ActorDatasetPushPipeline`](https://docs.apify.com/sdk/python/reference/class/ActorDatasetPushPipeline) - A Scrapy [item pipeline](https://docs.scrapy.org/en/latest/topics/item-pipeline.html) that pushes scraped items to Apify's [dataset](https://docs.apify.com/platform/storage/dataset). When enabled, every item produced by the spider is sent to the dataset.
- [`apify.scrapy.ApifyHttpProxyMiddleware`](https://docs.apify.com/sdk/python/reference/class/ApifyHttpProxyMiddleware) - A Scrapy [middleware](https://docs.scrapy.org/en/latest/topics/downloader-middleware.html) that manages proxy configurations. This middleware replaces Scrapy's default `HttpProxyMiddleware` to facilitate the use of Apify's proxy service.

The Apify SDK is built on Python's asynchronous [`asyncio`](https://docs.python.org/3/library/asyncio.html) library, whereas Scrapy uses [`twisted`](https://twisted.org/) for its asynchronous operations. To make these two frameworks work together, you need to:
Additional helper functions in the [`apify.scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) subpackage include:

- Set the [`AsyncioSelectorReactor`](https://docs.scrapy.org/en/latest/topics/asyncio.html#installing-the-asyncio-reactor) in Scrapy's project settings: This reactor is `twisted`'s implementation of the `asyncio` event loop, enabling compatibility between the two libraries.
- Install [`nest_asyncio`](https://pypi.org/project/nest-asyncio/): The `nest_asyncio` package allows the asyncio event loop to run within an already running loop, which is essential for integration with the Apify SDK.
- `apply_apify_settings` - Applies Apify-specific components to Scrapy settings.
- `to_apify_request` and `to_scrapy_request` - Convert between Apify and Scrapy request objects.
- `initialize_logging` - Configures logging for the Actor environment.
- `run_scrapy_actor` - Bridges asyncio and Twisted event loops.

By making these adjustments, you can ensure collaboration between `twisted`-based Scrapy and the `asyncio`-based Apify SDK.
## Create a new Apify-Scrapy project

### Other components
The simplest way to start using Scrapy in Apify Actors is to use the [Scrapy Actor template](https://apify.com/templates/python-scrapy). The template provides a pre-configured project structure and setup that includes all necessary components to run Scrapy spiders as Actors and store their output in Apify datasets. If you prefer manual setup, refer to the example Actor section below for configuration details.

We also prepared other Scrapy components to work with Apify SDK, they are available in the [`apify/scrapy`](https://github.com/apify/apify-sdk-python/tree/master/src/apify/scrapy) sub-package. These components include:
## Wrapping an existing Scrapy project

- `ApifyScheduler`: A Scrapy scheduler that uses the Apify Request Queue to manage requests.
- `ApifyHttpProxyMiddleware`: A Scrapy middleware for working with Apify proxies.
- `ActorDatasetPushPipeline`: A Scrapy item pipeline that pushes scraped items into the Apify dataset.
The Apify CLI supports converting an existing Scrapy project into an Apify Actor with a single command. The CLI expects the project to follow the standard Scrapy layout (including a `scrapy.cfg` file in the project root). During the wrapping process, the CLI:

The module contains other helper functions, like `apply_apify_settings` for applying these components to Scrapy settings, and `to_apify_request` and `to_scrapy_request` for converting between Apify and Scrapy request objects.
- Creates the necessary files and directories for an Apify Actor.
- Installs the Apify SDK and required dependencies.
- Updates Scrapy settings to include Apify-specific components.

For further details, see the [Scrapy migration guide](https://docs.apify.com/cli/docs/integrating-scrapy).

## Example Actor

Here is an example of a Scrapy Actor that scrapes the titles of web pages and enqueues all links found on each page. This example is identical to the one provided in the Apify Actor templates.
The following example demonstrates a Scrapy Actor that scrapes page titles and enqueues links found on each page. This example aligns with the structure provided in the Apify Actor templates.

<Tabs>
<TabItem value="__main__.py" label="__main.py__">
<CodeBlock className="language-python">
{UnderscoreMainExample}
</CodeBlock>
</TabItem>
<TabItem value="main.py" label="main.py" default>
<TabItem value="main.py" label="main.py">
<CodeBlock className="language-python">
{MainExample}
</CodeBlock>
</TabItem>
<TabItem value="items.py" label="items.py" default>
<TabItem value="settings.py" label="settings.py">
<CodeBlock className="language-python">
{ItemsExample}
{SettingsExample}
</CodeBlock>
</TabItem>
<TabItem value="settings.py" label="settings.py" default>
<TabItem value="items.py" label="items.py">
<CodeBlock className="language-python">
{SettingsExample}
{ItemsExample}
</CodeBlock>
</TabItem>
<TabItem value="spiders/title.py" label="spiders/title.py" default>
<TabItem value="spiders/title.py" label="spiders/title.py">
<CodeBlock className="language-python">
{TitleSpiderExample}
{SpidersExample}
</CodeBlock>
</TabItem>
</Tabs>

## Conclusion

In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects
using Scrapy, the Apify SDK and host them on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!
In this guide you learned how to use Scrapy in Apify Actors. You can now start building your own web scraping projects using Scrapy, the Apify SDK and host them on the Apify platform. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify CLI: Integrating Scrapy projects](https://docs.apify.com/cli/docs/integrating-scrapy)
- [Apify: Run Scrapy spiders on Apify](https://apify.com/run-scrapy-in-cloud)
- [Apify templates: Pyhon Actor Scrapy template](https://apify.com/templates/python-scrapy)
- [Apify store: Scrapy Books Example Actor](https://apify.com/vdusek/scrapy-books-example)
- [Scrapy: Official documentation](https://docs.scrapy.org/)
21 changes: 21 additions & 0 deletions docs/02_guides/code/scrapy_project/src/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from __future__ import annotations

from twisted.internet import asyncioreactor

# Install Twisted's asyncio reactor before importing any other Twisted or Scrapy components.
asyncioreactor.install() # type: ignore[no-untyped-call]

import os

from apify.scrapy import initialize_logging, run_scrapy_actor

# Import your main Actor coroutine here.
from .main import main

# Ensure the location to the Scrapy settings module is defined.
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings'


if __name__ == '__main__':
initialize_logging()
run_scrapy_actor(main())
10 changes: 10 additions & 0 deletions docs/02_guides/code/scrapy_project/src/items.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from __future__ import annotations

from scrapy import Field, Item


class TitleItem(Item):
"""Represents a title item scraped from a web page."""

url = Field()
title = Field()
32 changes: 32 additions & 0 deletions docs/02_guides/code/scrapy_project/src/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
from __future__ import annotations

from scrapy.crawler import CrawlerRunner
from scrapy.utils.defer import deferred_to_future

from apify import Actor
from apify.scrapy import apply_apify_settings

# Import your Scrapy spider here.
from .spiders import TitleSpider as Spider


async def main() -> None:
"""Apify Actor main coroutine for executing the Scrapy spider."""
async with Actor:
# Retrieve and process Actor input.
actor_input = await Actor.get_input() or {}
start_urls = [url['url'] for url in actor_input.get('startUrls', [])]
allowed_domains = actor_input.get('allowedDomains')
proxy_config = actor_input.get('proxyConfiguration')

# Apply Apify settings, which will override the Scrapy project settings.
settings = apply_apify_settings(proxy_config=proxy_config)

# Create CrawlerRunner and execute the Scrapy spider.
crawler_runner = CrawlerRunner(settings)
crawl_deferred = crawler_runner.crawl(
Spider,
start_urls=start_urls,
allowed_domains=allowed_domains,
)
await deferred_to_future(crawl_deferred)
8 changes: 8 additions & 0 deletions docs/02_guides/code/scrapy_project/src/settings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
BOT_NAME = 'titlebot'
DEPTH_LIMIT = 1
LOG_LEVEL = 'INFO'
NEWSPIDER_MODULE = 'src.spiders'
ROBOTSTXT_OBEY = True
SPIDER_MODULES = ['src.spiders']
TELNETCONSOLE_ENABLED = False
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
1 change: 1 addition & 0 deletions docs/02_guides/code/scrapy_project/src/spiders/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .title import TitleSpider
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
# ruff: noqa: TID252, RUF012

from __future__ import annotations

from typing import TYPE_CHECKING
from typing import TYPE_CHECKING, Any
from urllib.parse import urljoin

from scrapy import Request, Spider
Expand All @@ -16,28 +14,44 @@


class TitleSpider(Spider):
"""Scrapes title pages and enqueues all links found on the page."""

name = 'title_spider'
"""A spider that scrapes web pages to extract titles and discover new links.

# The `start_urls` specified in this class will be merged with the `start_urls` value from your Actor input
# when the project is executed using Apify.
start_urls = ['https://apify.com/']
This spider retrieves the content of the <title> element from each page and queues
any valid hyperlinks for further crawling.
"""

# Scrape only the pages within the Apify domain.
allowed_domains = ['apify.com']
name = 'title_spider'

# Limit the number of pages to scrape.
custom_settings = {'CLOSESPIDER_PAGECOUNT': 10}

def __init__(
self,
start_urls: list[str],
allowed_domains: list[str],
*args: Any,
**kwargs: Any,
) -> None:
"""A default costructor.

Args:
start_urls: URLs to start the scraping from.
allowed_domains: Domains that the scraper is allowed to crawl.
*args: Additional positional arguments.
**kwargs: Additional keyword arguments.
"""
super().__init__(*args, **kwargs)
self.start_urls = start_urls
self.allowed_domains = allowed_domains

def parse(self, response: Response) -> Generator[TitleItem | Request, None, None]:
"""Parse the web page response.

Args:
response: The web page response.

Yields:
Yields scraped TitleItem and Requests for links.
Yields scraped `TitleItem` and new `Request` objects for links.
"""
self.logger.info('TitleSpider is parsing %s...', response)

Expand All @@ -46,7 +60,7 @@ def parse(self, response: Response) -> Generator[TitleItem | Request, None, None
title = response.css('title::text').extract_first()
yield TitleItem(url=url, title=title)

# Extract all links from the page, create Requests out of them, and yield them
# Extract all links from the page, create `Request` objects out of them, and yield them.
for link_href in response.css('a::attr("href")'):
link_url = urljoin(response.url, link_href.get())
if link_url.startswith(('http://', 'https://')):
Expand Down
Loading