Skip to content

ScrapeServ: 自托管API,根据URL生成网站截图

Published:

原文链接


ScrapeServ: Simple URL to screenshots server

You run the API as a web server on your machine, you send it a URL, and you get back the website data as a file plus screenshots of the site. Simple as.

\

poster

\

This project was made to support Abbey, an AI platform. Its author is Gordon Kamer. Please leave a star if you like the project!

Some highlights:

This web scraper is resource intensive but higher quality than many alternatives. Websites are scraped using Playwright, which launches a Firefox browser context for each job.

Setup

You should have Docker and docker compose installed.

Easy (using pre-built image)

A pre-built image is available for your use called usaiinc/scraper. You can use it with docker compose by creating a file called docker-compose.yml and putting the following inside it:

services:
  scraper:
    image: usaiinc/scraper:latest
    ports:
      - 5006:5006
    # volumes:
    #   - ./.env:/app/.env

Then you can run it by running docker compose up in the same directory as your file. See the Usage section below on how to interact with the server!

Customizable (build from source)

Another option is to clone the repo and build it yourself, which is also quite easy! This will also let you modify server settings like memory usage, the maximum length of the queue, and other default configurations.

  1. Clone this repo
  2. Run docker compose up (a docker-compose.yml file is provided for your use)

…and the service will be available at http://localhost:5006. See the Usage section below for details on how to interact with it.

Usage

From Your App

Look in client for a full reference client implementation in Python. Just send an HTTP request and process the response according to the API reference below.

From the Command Line on Mac/Linux

You can use cURL and ripmime to interact with the API from the command line. Ripmime processes the multipart/mixed HTTP response and puts the downloaded files into a folder. Install ripmime using brew install ripmime on Mac or apt-get install ripmime on Linux. Then, paste this into your terminal:

curl -i -s -X POST "http://localhost:5006/scrape" \
    -H "Content-Type: application/json" \
    -d '{"url": "https://goodreason.ai"}' \
    | ripmime -i - -d outfolder --formdata --no-nameless

…replacing the URL and output folder name appropriately.

API Reference

Path /: The root path returns status 200, plus some text to let you know the server’s running if you visit the server in a web browser.

Path /scrape: Accepts a JSON formatted POST request and returns a multipart/mixed response including the resource file, screenshots, and request header information.

JSON formatted arguments:

You can provide the desired output image format as an Accept header MIME type. If no Accept header is provided (or if the Accept header is */* or image/*), the screenshots are returned by default as JPEGs. The following values are supported:

Every response from /scrape will be either:

Refer to the client for a full reference implementation, which shows you how to call the API and save the files it sends back. You can also save the returned files from the command line.

Security Considerations

Navigating to untrusted websites is a serious security issue. Risks are somewhat mitigated in the following ways:

You may take additional precautions depending on your needs, like:

If you’d like to make sure that this API is up to your security standards, please examine the code and open issues! It’s not a big repo.

API Keys

If your scrape server is publicly accessible over the internet, you should set an API key using a .env file inside the /scraper folder (same level as app.py).

You can set as many API keys as you’d like; allowed API keys are those that start with SCRAPER_API_KEY. For example, here is a .env file that has three available keys:

SCRAPER_API_KEY=should-be-secret
SCRAPER_API_KEY_OTHER=can-also-be-used
SCRAPER_API_KEY_3=works-too

API keys are sent to the service using the Authorization Bearer scheme.

Other Configuration

You can control memory limits and other variables at the top of scraper/worker.py (provided you’re building from source). Here are the defaults:

MEM_LIMIT_MB = 4_000  # 4 GB memory threshold for child scraping process
MAX_CONCURRENT_TASKS = 3
DEFAULT_SCREENSHOTS = 5  # The max number of screenshots if the user doesn't set a max
MAX_SCREENSHOTS = 10  # User cannot set max_screenshots above this value
DEFAULT_WAIT = 1000  # Value for wait if a user doesn't set one (ms)
MAX_WAIT = 5000  # A user cannot ask for more than this long of a wait (ms)
SCREENSHOT_QUALITY = 85  # Argument to PIL image save
DEFAULT_BROWSER_DIM = [1280, 2000]  # If a user doesn't set browser dimensions  Width x Height in pixels
MAX_BROWSER_DIM = [2400, 4000]  # Maximum width and height a user can set
MIN_BROWSER_DIM = [100, 100]  # Minimum width and height a user can set
USER_AGENT = "Mozilla/5.0 (compatible; Abbey/1.0; +https://github.com/US-Artificial-Intelligence/scraper)"

Previous Post
Cursor: 利用AI加速大型项目开发的终极工具
Next Post
VisActor与Next.js构建的现代仪表板模板,拥有优美的UI和丰富的数据可视化组件