-
Notifications
You must be signed in to change notification settings - Fork 12
feat: Unify Apify and Scrapy to use single event loop & remove nest-asyncio
#390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
8d40e40
fix: fix Scrapy integration
vdusek 8cc38b4
comment out docs check CI step
vdusek c1eacba
Scrapy integration test is working
vdusek bd50787
rm scrapy condition for sys exit
vdusek 0c0fbcc
Polishment
vdusek 357cc27
revert non-intentionally changes in proxy conf
vdusek e7f0aeb
Update the Scrapy guide
vdusek 1321f82
add async thread helper class
vdusek f1a7fd7
address feedback
vdusek f99a9eb
polishment
vdusek 7f2242d
do not run sys.exit for scrapy
vdusek 57086f5
Fix RQ stuck in infinite loop due to ID mismatch
vdusek 8646c06
Address Honza's feedback
vdusek 4a9665a
utilize SCRAPY_SETTINGS_MODULE env var
vdusek f1e8bd5
allow ajax crawl middleware
vdusek 413a56a
mention SCRAPY_SETTINGS_MODULE env var
vdusek File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
from __future__ import annotations | ||
|
||
from twisted.internet import asyncioreactor | ||
|
||
# Install Twisted's asyncio reactor before importing any other Twisted or Scrapy components. | ||
asyncioreactor.install() # type: ignore[no-untyped-call] | ||
|
||
import os | ||
|
||
from apify.scrapy import initialize_logging, run_scrapy_actor | ||
|
||
# Import your main Actor coroutine here. | ||
from .main import main | ||
|
||
# Ensure the location to the Scrapy settings module is defined. | ||
os.environ['SCRAPY_SETTINGS_MODULE'] = 'src.settings' | ||
|
||
|
||
if __name__ == '__main__': | ||
initialize_logging() | ||
run_scrapy_actor(main()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
from __future__ import annotations | ||
|
||
from scrapy import Field, Item | ||
|
||
|
||
class TitleItem(Item): | ||
"""Represents a title item scraped from a web page.""" | ||
|
||
url = Field() | ||
title = Field() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
from __future__ import annotations | ||
|
||
from scrapy.crawler import CrawlerRunner | ||
from scrapy.utils.defer import deferred_to_future | ||
|
||
from apify import Actor | ||
from apify.scrapy import apply_apify_settings | ||
|
||
# Import your Scrapy spider here. | ||
from .spiders import TitleSpider as Spider | ||
|
||
|
||
async def main() -> None: | ||
"""Apify Actor main coroutine for executing the Scrapy spider.""" | ||
async with Actor: | ||
vdusek marked this conversation as resolved.
Show resolved
Hide resolved
|
||
# Retrieve and process Actor input. | ||
actor_input = await Actor.get_input() or {} | ||
start_urls = [url['url'] for url in actor_input.get('startUrls', [])] | ||
allowed_domains = actor_input.get('allowedDomains') | ||
proxy_config = actor_input.get('proxyConfiguration') | ||
|
||
# Apply Apify settings, which will override the Scrapy project settings. | ||
settings = apply_apify_settings(proxy_config=proxy_config) | ||
|
||
# Create CrawlerRunner and execute the Scrapy spider. | ||
crawler_runner = CrawlerRunner(settings) | ||
crawl_deferred = crawler_runner.crawl( | ||
Spider, | ||
start_urls=start_urls, | ||
allowed_domains=allowed_domains, | ||
) | ||
await deferred_to_future(crawl_deferred) |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
BOT_NAME = 'titlebot' | ||
DEPTH_LIMIT = 1 | ||
LOG_LEVEL = 'INFO' | ||
NEWSPIDER_MODULE = 'src.spiders' | ||
ROBOTSTXT_OBEY = True | ||
SPIDER_MODULES = ['src.spiders'] | ||
TELNETCONSOLE_ENABLED = False | ||
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .title import TitleSpider |
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be awesome if we could have a nice diagram here 🙂 In the ideal case, it would show how the asyncio event loop and twisted reactor interact. We can definitely postpone that though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that Asyncio and Twisted share the same event loop. I'm not familiar with the details of how Asyncio and Twisted exactly interact together. Other than that, we have a separate thread with an Asyncio event loop for the Scheduler's synchronous calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We gave it two weeks and made it work, but that doesn't mean we understand the sorcery enough to draw diagrams 😅 But I think it's something like...
(It's black on transparent SVG, so you probably won't see it well in dark mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if I chose the best diagram for this, maybe https://swimlanes.io/ would be better