archive-scraper

Installation

pip install "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"

Or, if you are using uv:

uv add "git+ssh://git@github.com/Autonomy-Data-Unit/archive-scraper.git"

Usage (as library)

To archive a set of URLs:

from archive_scraper import archive_urls
urls = [
    "https://adu.autonomy.work/posts/2024_08_28_heritage/",
    "https://pypi.org/project/beautifulsoup4/",
]
archive_urls(urls)

After all archive processes are done, the URLs can then be scraped using:

from archive_scraper import scrape_urls

scrape_results = scrape_urls(urls)

The result of scrape_urls is a list, each element (corresponding the each URL in the input list) of the form: (success, archived_url, content, error).

You can also do both the archiving and scraping in one go using:

from archive_scraper import archive_and_scrape_urls
urls = [
    "https://adu.autonomy.work/posts/2024_08_28_heritage/",
    "https://pypi.org/project/beautifulsoup4/",
]
archiving_results, scrape_results = archive_and_scrape_urls(urls)

Usage (CLI)

# Archiving
archive-scraper archive https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

# Scraping and saving to a JSON file
archive-scraper scrape -o out.json https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

# Scraping and saving to a diskcache
archive-scraper scrape -o out --output-as-diskcache https://adu.autonomy.work/posts/2024_08_28_heritage/ https://pypi.org/project/beautifulsoup4/

Using the `scraping_sandbox`

Put your list of URLs in .txt files in the scraping_sandbox folder, one URL per line. Then run the cells in scraping_sandbox.ipynb to archive and scrape them. The results will be saved to out.json.

The URLs will be scraped in the order of the filenames (alphabetically sorted), and within each file, in the order they appear. Prepend the filenames with numbers (e.g. 01_, 02_, etc.) to control the order of scraping.

Development install instructions

Prerequisites

Install uv.
Install direnv to automatically load the project virtual environment when entering it.
- Mac: brew install direnv
- Linux: curl -sfL https://direnv.net/install.sh | bash

Setting up the environment

Run the following:

# In the root of the repo folder
uv sync --all-extras # Installs the virtual environment at './.venv'
direnv allow # Allows the automatic running of the script './.envrc'
nbl install-hooks # Installs a git hooks that ensures that notebooks are added properly

You are now set up to develop the codebase.

Further instructions:

To export notebooks run nbl export.
To clean notebooks run nbl clean.
To see other available commands run just nbl.
To add a new dependency run uv add package-name. See the the uv documentation for more details.
You need to git add all 'twinned' notebooks for the commit to be validated by the git-hook. For example, if you add nbs/my-nb.ipynb, you must also add pts/my-nb.pct.py.
To render the documentation, run nbl render-docs. To preview it run nbl preview-docs
To upgrade all dependencies run uv sync --upgrade --all-extras

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
archive_scraper		archive_scraper
nbs		nbs
pts		pts
scraping_sandbox		scraping_sandbox
.envrc		.envrc
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
config.toml		config.toml
fetch_from_s3.py		fetch_from_s3.py
nblite.toml		nblite.toml
pyproject.toml		pyproject.toml
store		store
upload_to_s3.py		upload_to_s3.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

archive-scraper

Installation

Usage (as library)

Usage (CLI)

Using the `scraping_sandbox`

Development install instructions

Prerequisites

Setting up the environment

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

archive-scraper

Installation

Usage (as library)

Usage (CLI)

Using the scraping_sandbox

Development install instructions

Prerequisites

Setting up the environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Using the `scraping_sandbox`

Packages