Skip to content

Commit 3571797

Browse files
MikeMelizAaron Bishop
authored andcommitted
Merge pull request MikeMeliz#15 from the-siegfried/14-implement-yara-keyword-search
14 implement yara keyword search
2 parents 22b0b0e + 32d7b7a commit 3571797

File tree

7 files changed

+349
-28
lines changed

7 files changed

+349
-28
lines changed

.gitignore

Lines changed: 160 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,160 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
share/python-wheels/
24+
*.egg-info/
25+
.installed.cfg
26+
*.egg
27+
MANIFEST
28+
29+
# PyInstaller
30+
# Usually these files are written by a python script from a template
31+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
32+
*.manifest
33+
*.spec
34+
35+
# Installer logs
36+
pip-log.txt
37+
pip-delete-this-directory.txt
38+
39+
# Unit test / coverage reports
40+
htmlcov/
41+
.tox/
42+
.nox/
43+
.coverage
44+
.coverage.*
45+
.cache
46+
nosetests.xml
47+
coverage.xml
48+
*.cover
49+
*.py,cover
50+
.hypothesis/
51+
.pytest_cache/
52+
cover/
53+
54+
# Translations
55+
*.mo
56+
*.pot
57+
58+
# Django stuff:
59+
*.log
60+
local_settings.py
61+
db.sqlite3
62+
db.sqlite3-journal
63+
64+
# Flask stuff:
65+
instance/
66+
.webassets-cache
67+
68+
# Scrapy stuff:
69+
.scrapy
70+
71+
# Sphinx documentation
72+
docs/_build/
73+
74+
# PyBuilder
75+
.pybuilder/
76+
target/
77+
78+
# Jupyter Notebook
79+
.ipynb_checkpoints
80+
81+
# IPython
82+
profile_default/
83+
ipython_config.py
84+
85+
# pyenv
86+
# For a library or package, you might want to ignore these files since the code is
87+
# intended to run in multiple environments; otherwise, check them in:
88+
# .python-version
89+
90+
# pipenv
91+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
92+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
93+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
94+
# install all needed dependencies.
95+
#Pipfile.lock
96+
97+
# poetry
98+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
99+
# This is especially recommended for binary packages to ensure reproducibility, and is more
100+
# commonly ignored for libraries.
101+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
102+
#poetry.lock
103+
104+
# pdm
105+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
106+
#pdm.lock
107+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
108+
# in version control.
109+
# https://pdm.fming.dev/#use-with-ide
110+
.pdm.toml
111+
112+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
113+
__pypackages__/
114+
115+
# Celery stuff
116+
celerybeat-schedule
117+
celerybeat.pid
118+
119+
# SageMath parsed files
120+
*.sage.py
121+
122+
# Environments
123+
.env
124+
.venv
125+
env/
126+
venv/
127+
ENV/
128+
env.bak/
129+
venv.bak/
130+
131+
# Spyder project settings
132+
.spyderproject
133+
.spyproject
134+
135+
# Rope project settings
136+
.ropeproject
137+
138+
# mkdocs documentation
139+
/site
140+
141+
# mypy
142+
.mypy_cache/
143+
.dmypy.json
144+
dmypy.json
145+
146+
# Pyre type checker
147+
.pyre/
148+
149+
# pytype static type analyzer
150+
.pytype/
151+
152+
# Cython debug symbols
153+
cython_debug/
154+
155+
# PyCharm
156+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
157+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
158+
# and can be added to the global gitignore or merged into this file. For a more nuclear
159+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
160+
.idea/

README.md

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,7 @@ arg | Long | Description
6666
-e |--extract| Extract page's code to terminal or file. (Default: Terminal)
6767
-i |--input filename| Input file with URL(s) (seperated by line)
6868
-o |--output [filename]| Output page(s) to file(s) (for one page)
69+
-y |--yara | Perform yara keyword search (0 = search entire html object. 1 = search only text).
6970
**Crawl**: | |
7071
-c |--crawl| Crawl website (Default output on /links.txt)
7172
-d |--cdepth| Set depth of crawl's travel (Default: 1)
@@ -98,6 +99,14 @@ $ python torcrawl.py -u http://www.github.com | grep 'google-analytics'
9899
<meta name="google-analytics" content="UA-*******-*">
99100
```
100101

102+
Extract to file and find only the line with google-analytics using yara:
103+
```shell
104+
$ python torcrawl.py -v -w -u https://github.com -e -y 0
105+
...
106+
```
107+
**_Note:_** update res/keyword.yar to search for other keywords.
108+
Use ```-y 0``` for raw html searching and ```-y 1``` for text search only.
109+
101110
Extract a set of webpages (imported from file) to terminal:
102111

103112
```shell
@@ -156,6 +165,24 @@ $ python torcrawl.py -u http://www.github.com/ -c -e | grep '</html>'
156165
...
157166
```
158167

168+
### As Both + Keyword Search:
169+
You can crawl a page, perform a keyword search and extract the webpages that match the findings into a folder with a single command:
170+
171+
```shell
172+
$ python torcrawl.py -v -u http://www.github.com/ -c -d 2 -p 5 -e -y 0
173+
## TOR is ready!
174+
## URL: http://www.github.com/
175+
## Your IP: *.*.*.*
176+
## Crawler Started from http://www.github.com with step 1 and wait 5
177+
## Step 1 completed with: 11 results
178+
## File created on /script/path/FolderName/index.htm
179+
## File created on /script/path/FolderName/projects.html
180+
## ...
181+
```
182+
183+
***Note:*** *Update res/keyword.yar to search for other keywords.
184+
Use ```-y 0``` for raw html searching and ```-y 1``` for text search only.*
185+
159186
## Demo:
160187
![peek 2018-12-08 16-11](https://user-images.githubusercontent.com/9204902/49687660-f72f8280-fb0e-11e8-981e-1bbeeac398cc.gif)
161188

modules/crawler.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,8 @@ def crawl(self):
136136
continue
137137

138138
ver_link = self.canonical(link)
139-
lst.add(ver_link)
139+
if ver_link is not None:
140+
lst.add(ver_link)
140141

141142
# For each <area> tag.
142143
for link in soup.findAll('area'):
@@ -146,7 +147,8 @@ def crawl(self):
146147
continue
147148

148149
ver_link = self.canonical(link)
149-
lst.add(ver_link)
150+
if ver_link is not None:
151+
lst.add(ver_link)
150152

151153
# TODO: For images
152154
# TODO: For scripts

0 commit comments

Comments
 (0)