pip install "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"Or, if you are using uv:
uv add "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"To archive a set of URLs:
from archive_scraper import archive_urls
urls = [
"https://adu.autonomy.work/posts/2024_08_28_heritage/",
"https://pypi.org/project/beautifulsoup4/",
]
archive_urls(urls)After all archive processes are done, the URLs can then be scraped using:
from archive_scraper import scrape_urls
scrape_results = scrape_urls(urls)The result of scrape_urls is a list, each element (corresponding the each URL in the input list) of the form: (success, archived_url, content, error).
You can also do both the archiving and scraping in one go using:
from archive_scraper import archive_and_scrape_urls
urls = [
"https://adu.autonomy.work/posts/2024_08_28_heritage/",
"https://pypi.org/project/beautifulsoup4/",
]
archiving_results, scrape_results = archive_and_scrape_urls(urls)# Archiving
archive-scraper archive https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/
# Scraping and saving to a JSON file
archive-scraper scrape -o out.json https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/
# Scraping and saving to a diskcache
archive-scraper scrape -o out --output-as-diskcache https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/Put your list of URLs in .txt files in the scraping_sandbox folder, one URL per line. Then run the cells in scraping_sandbox.ipynb to archive and scrape them. The results will be saved to out.json.
The URLs will be scraped in the order of the filenames (alphabetically sorted), and within each file, in the order they appear. Prepend the filenames with numbers (e.g. 01_, 02_, etc.) to control the order of scraping.
- Install uv.
- Install direnv to automatically load the project virtual environment when entering it.
- Mac:
brew install direnv - Linux:
curl -sfL https://direnv.net/install.sh | bash
- Mac:
Run the following:
# In the root of the repo folder
uv sync --all-extras # Installs the virtual environment at './.venv'
direnv allow # Allows the automatic running of the script './.envrc'
nbl install-hooks # Installs a git hooks that ensures that notebooks are added properlyYou are now set up to develop the codebase.
Further instructions:
- To export notebooks run
nbl export. - To clean notebooks run
nbl clean. - To see other available commands run just
nbl. - To add a new dependency run
uv add package-name. See the the uv documentation for more details. - You need to
git addall 'twinned' notebooks for the commit to be validated by the git-hook. For example, if you addnbs/my-nb.ipynb, you must also addpts/my-nb.pct.py. - To render the documentation, run
nbl render-docs. To preview it runnbl preview-docs - To upgrade all dependencies run
uv sync --upgrade --all-extras