Skip to content

Commit 7759e9d

Browse files
Improve Ultralytics website download (#96)
Co-authored-by: UltralyticsAssistant <[email protected]>
1 parent cfe081e commit 7759e9d

File tree

1 file changed

+25
-5
lines changed

1 file changed

+25
-5
lines changed

.github/workflows/links.yml

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -64,17 +64,37 @@ jobs:
6464
'./**/*.html'
6565
6666
- name: Download Ultralytics Website
67-
# WARNING: Do not remove deprecated language directories from --exclude-directories list
6867
if: matrix.branch == 'main'
6968
run: |
69+
# Download sitemap.xml
70+
wget -O sitemap.xml https://www.ultralytics.com/sitemap.xml
71+
72+
# Parse URLs using a combination of tr, sed, and grep
73+
tr '\n' ' ' < sitemap.xml | \
74+
sed 's/<loc>/\n<loc>/g' | \
75+
grep -oP '(?<=<loc>).*?(?=</loc>)' | \
76+
sed 's/^[[:space:]]*//;s/[[:space:]]*$//' > urls.txt
77+
78+
# Count total URLs to be downloaded
79+
total_urls=$(wc -l < urls.txt)
80+
echo "Total URLs to be downloaded: $total_urls"
81+
82+
# Download all URLs in parallel
7083
mkdir ultralytics_website
7184
wget -P ultralytics_website \
72-
--recursive \
73-
--no-parent \
7485
--adjust-extension \
7586
--reject "*.jpg*,*.jpeg*,*.png*,*.gif*,*.webp*,*.svg*,*.txt" \
76-
--exclude-directories="/zh/,/ko/,/ja/,/ru/,/de/,/fr/,/es/,/pt/,/tr/,/vi/,/ar/,/it/,/nl/,/hi/" \
77-
https://www.ultralytics.com/ || true
87+
--input-file=urls.txt \
88+
--no-clobber \
89+
--no-parent \
90+
--wait=0.001 \
91+
--random-wait \
92+
--tries=3 \
93+
--no-verbose
94+
95+
# Count successfully downloaded files
96+
downloaded_files=$(find ultralytics_website -type f | wc -l)
97+
echo "Total pages downloaded: $downloaded_files"
7898
7999
- name: Run Broken Link Checks on Ultralytics Website
80100
if: matrix.branch == 'main'

0 commit comments

Comments
 (0)