Skip to content

Autonomy-Data-Unit/archive-scraper

Repository files navigation

archive-scraper

Installation

pip install "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"

Or, if you are using uv:

uv add "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"

Usage (as library)

To archive a set of URLs:

from archive_scraper import archive_urls
urls = [
    "https://adu.autonomy.work/posts/2024_08_28_heritage/",
    "https://pypi.org/project/beautifulsoup4/",
]
archive_urls(urls)

After all archive processes are done, the URLs can then be scraped using:

from archive_scraper import scrape_urls

scrape_results = scrape_urls(urls)

The result of scrape_urls is a list, each element (corresponding the each URL in the input list) of the form: (success, archived_url, content, error).

You can also do both the archiving and scraping in one go using:

from archive_scraper import archive_and_scrape_urls
urls = [
    "https://adu.autonomy.work/posts/2024_08_28_heritage/",
    "https://pypi.org/project/beautifulsoup4/",
]
archiving_results, scrape_results = archive_and_scrape_urls(urls)

Usage (CLI)

# Archiving
archive-scraper archive https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

# Scraping and saving to a JSON file
archive-scraper scrape -o out.json https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

# Scraping and saving to a diskcache
archive-scraper scrape -o out --output-as-diskcache https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

Using the scraping_sandbox

Put your list of URLs in .txt files in the scraping_sandbox folder, one URL per line. Then run the cells in scraping_sandbox.ipynb to archive and scrape them. The results will be saved to out.json.

The URLs will be scraped in the order of the filenames (alphabetically sorted), and within each file, in the order they appear. Prepend the filenames with numbers (e.g. 01_, 02_, etc.) to control the order of scraping.

Development install instructions

Prerequisites

  • Install uv.
  • Install direnv to automatically load the project virtual environment when entering it.
    • Mac: brew install direnv
    • Linux: curl -sfL https://direnv.net/install.sh | bash

Setting up the environment

Run the following:

# In the root of the repo folder
uv sync --all-extras # Installs the virtual environment at './.venv'
direnv allow # Allows the automatic running of the script './.envrc'
nbl install-hooks # Installs a git hooks that ensures that notebooks are added properly

You are now set up to develop the codebase.

Further instructions:

  • To export notebooks run nbl export.
  • To clean notebooks run nbl clean.
  • To see other available commands run just nbl.
  • To add a new dependency run uv add package-name. See the the uv documentation for more details.
  • You need to git add all 'twinned' notebooks for the commit to be validated by the git-hook. For example, if you add nbs/my-nb.ipynb, you must also add pts/my-nb.pct.py.
  • To render the documentation, run nbl render-docs. To preview it run nbl preview-docs
  • To upgrade all dependencies run uv sync --upgrade --all-extras

About

Scape archive.ph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors