scrapegraph-py 1.46.0


pip install scrapegraph-py

  Latest version

Released: Jan 26, 2026

Project Links

Meta
Author: Marco Vinciguerra, Lorenzo Padoan
Requires Python: <4.0,>=3.10

Classifiers

Intended Audience
  • Developers

Operating System
  • OS Independent

Programming Language
  • Python :: 3

Topic
  • Software Development :: Libraries :: Python Modules

๐ŸŒ ScrapeGraph Python SDK

PyPI version Python Support License Code style: black Documentation Status

ScrapeGraph API Banner

Official Python SDK for the ScrapeGraph API - Smart web scraping powered by AI.

๐Ÿ“ฆ Installation

Basic Installation

pip install scrapegraph-py

This installs the core SDK with minimal dependencies. The SDK is fully functional with just the core dependencies.

Optional Dependencies

For specific use cases, you can install optional extras:

HTML Validation (required when using website_html parameter):

pip install scrapegraph-py[html]

Langchain Integration (for using with Langchain/Langgraph):

pip install scrapegraph-py[langchain]

All Optional Dependencies:

pip install scrapegraph-py[html,langchain]

๐Ÿš€ Features

  • ๐Ÿค– AI-powered web scraping and search
  • ๐Ÿ•ท๏ธ Smart crawling with both AI extraction and markdown conversion modes
  • ๐Ÿ’ฐ Cost-effective markdown conversion (80% savings vs AI mode)
  • ๐Ÿ”„ Both sync and async clients
  • ๐Ÿ“Š Structured output with Pydantic schemas
  • ๐Ÿ” Detailed logging
  • โšก Automatic retries
  • ๐Ÿ” Secure authentication

๐ŸŽฏ Quick Start

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

[!NOTE] You can set the SGAI_API_KEY environment variable and initialize the client without parameters: client = Client()

๐Ÿ“š Available Endpoints

๐Ÿค– SmartScraper

Extract structured data from any webpage or HTML content using AI.

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

# Using a URL
response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract the main heading and description"
)

# Or using HTML content
# Note: Using website_html requires the [html] extra: pip install scrapegraph-py[html]
html_content = """
<html>
    <body>
        <h1>Company Name</h1>
        <p>We are a technology company focused on AI solutions.</p>
    </body>
</html>
"""

response = client.smartscraper(
    website_html=html_content,
    user_prompt="Extract the company description"
)

print(response)
Output Schema (Optional)
from pydantic import BaseModel, Field
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

class WebsiteData(BaseModel):
    title: str = Field(description="The page title")
    description: str = Field(description="The meta description")

response = client.smartscraper(
    website_url="https://example.com",
    user_prompt="Extract the title and description",
    output_schema=WebsiteData
)
๐Ÿช Cookies Support

Use cookies for authentication and session management:

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

# Define cookies for authentication
cookies = {
    "session_id": "abc123def456",
    "auth_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9...",
    "user_preferences": "dark_mode,usd"
}

response = client.smartscraper(
    website_url="https://example.com/dashboard",
    user_prompt="Extract user profile information",
    cookies=cookies
)

Common Use Cases:

  • E-commerce sites: User authentication, shopping cart persistence
  • Social media: Session management, user preferences
  • Banking/Financial: Secure authentication, transaction history
  • News sites: User preferences, subscription content
  • API endpoints: Authentication tokens, API keys
๐Ÿ”„ Advanced Features

Infinite Scrolling:

response = client.smartscraper(
    website_url="https://example.com/feed",
    user_prompt="Extract all posts from the feed",
    cookies=cookies,
    number_of_scrolls=10  # Scroll 10 times to load more content
)

Pagination:

response = client.smartscraper(
    website_url="https://example.com/products",
    user_prompt="Extract all product information",
    cookies=cookies,
    total_pages=5  # Scrape 5 pages
)

Combined with Cookies:

response = client.smartscraper(
    website_url="https://example.com/dashboard",
    user_prompt="Extract user data from all pages",
    cookies=cookies,
    number_of_scrolls=5,
    total_pages=3
)

๐Ÿ” SearchScraper

Perform AI-powered web searches with structured results and reference URLs.

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

response = client.searchscraper(
    user_prompt="What is the latest version of Python and its main features?"
)

print(f"Answer: {response['result']}")
print(f"Sources: {response['reference_urls']}")
Output Schema (Optional)
from pydantic import BaseModel, Field
from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

class PythonVersionInfo(BaseModel):
    version: str = Field(description="The latest Python version number")
    release_date: str = Field(description="When this version was released")
    major_features: list[str] = Field(description="List of main features")

response = client.searchscraper(
    user_prompt="What is the latest version of Python and its main features?",
    output_schema=PythonVersionInfo
)

๐Ÿ“ Markdownify

Converts any webpage into clean, formatted markdown.

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

response = client.markdownify(
    website_url="https://example.com"
)

print(response)

๐Ÿ•ท๏ธ Crawler

Intelligently crawl and extract data from multiple pages with support for both AI extraction and markdown conversion modes.

AI Extraction Mode (Default)

Extract structured data from multiple pages using AI:

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

# Define the data schema for extraction
schema = {
    "type": "object",
    "properties": {
        "company_name": {"type": "string"},
        "founders": {
            "type": "array",
            "items": {"type": "string"}
        },
        "description": {"type": "string"}
    }
}

response = client.crawl(
    url="https://scrapegraphai.com",
    prompt="extract the company information and founders",
    data_schema=schema,
    depth=2,
    max_pages=5,
    same_domain_only=True
)

# Poll for results (crawl is asynchronous)
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)

Markdown Conversion Mode (Cost-Effective)

Convert pages to clean markdown without AI processing (80% cheaper):

from scrapegraph_py import Client

client = Client(api_key="your-api-key-here")

response = client.crawl(
    url="https://scrapegraphai.com",
    extraction_mode=False,  # Markdown conversion mode
    depth=2,
    max_pages=5,
    same_domain_only=True,
    sitemap=True  # Use sitemap for better page discovery
)

# Poll for results
crawl_id = response.get("crawl_id")
result = client.get_crawl(crawl_id)

# Access markdown content
for page in result["result"]["pages"]:
    print(f"URL: {page['url']}")
    print(f"Markdown: {page['markdown']}")
    print(f"Metadata: {page['metadata']}")
๐Ÿ”ง Crawl Parameters
  • url (required): Starting URL for the crawl
  • extraction_mode (default: True):
    • True = AI extraction mode (requires prompt and data_schema)
    • False = Markdown conversion mode (no AI, 80% cheaper)
  • prompt (required for AI mode): AI prompt to guide data extraction
  • data_schema (required for AI mode): JSON schema defining extracted data structure
  • depth (default: 2): Maximum crawl depth (1-10)
  • max_pages (default: 2): Maximum pages to crawl (1-100)
  • same_domain_only (default: True): Only crawl pages from the same domain
  • sitemap (default: False): Use sitemap.xml for better page discovery and more comprehensive crawling
  • cache_website (default: True): Cache website content
  • batch_size (optional): Batch size for processing pages (1-10)

Cost Comparison:

  • AI Extraction Mode: ~10 credits per page
  • Markdown Conversion Mode: ~2 credits per page (80% savings!)

Sitemap Benefits:

  • Better page discovery using sitemap.xml
  • More comprehensive website coverage
  • Efficient crawling of structured websites
  • Perfect for e-commerce, news sites, and content-heavy websites

โšก Async Support

All endpoints support async operations:

import asyncio
from scrapegraph_py import AsyncClient

async def main():
    async with AsyncClient() as client:
        response = await client.smartscraper(
            website_url="https://example.com",
            user_prompt="Extract the main content"
        )
        print(response)

asyncio.run(main())

๐Ÿ“– Documentation

For detailed documentation, visit docs.scrapegraphai.com

๐Ÿ› ๏ธ Development

For information about setting up the development environment and contributing to the project, see our Contributing Guide.

๐Ÿ’ฌ Support & Feedback

  • ๐Ÿ“ง Email: support@scrapegraphai.com
  • ๐Ÿ’ป GitHub Issues: Create an issue
  • ๐ŸŒŸ Feature Requests: Request a feature
  • โญ API Feedback: You can also submit feedback programmatically using the feedback endpoint:
    from scrapegraph_py import Client
    
    client = Client(api_key="your-api-key-here")
    
    client.submit_feedback(
        request_id="your-request-id",
        rating=5,
        feedback_text="Great results!"
    )
    

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Links


Made with โค๏ธ by ScrapeGraph AI

1.46.0 Jan 26, 2026
1.45.0 Jan 23, 2026
1.44.1 Jan 17, 2026
1.44.0 Nov 28, 2025
1.43.0 Nov 26, 2025
1.42.0 Nov 21, 2025
1.41.1 Nov 14, 2025
1.41.0 Nov 04, 2025
1.40.0 Nov 04, 2025
1.39.0 Nov 03, 2025
1.38.0 Oct 23, 2025
1.37.0 Oct 23, 2025
1.36.0 Oct 16, 2025
1.35.0 Oct 15, 2025
1.34.0 Oct 08, 2025
1.33.0 Oct 06, 2025
1.32.0 Oct 06, 2025
1.31.0 Sep 17, 2025
1.30.0 Sep 17, 2025
1.29.0 Sep 16, 2025
1.28.0 Sep 16, 2025
1.27.0 Sep 14, 2025
1.26.0 Sep 11, 2025
1.25.1 Sep 08, 2025
1.25.0 Sep 08, 2025
1.24.0 Sep 03, 2025
1.23.0 Sep 01, 2025
1.22.0 Sep 01, 2025
1.21.0 Sep 01, 2025
1.20.0 Aug 19, 2025
1.19.0 Aug 18, 2025
1.18.2 Aug 06, 2025
1.18.1 Aug 06, 2025
1.18.0 Aug 05, 2025
1.17.0 Jul 30, 2025
1.16.0 Jul 21, 2025
1.15.0 Jul 18, 2025
1.14.2 Jul 12, 2025
1.14.1 Jul 08, 2025
1.14.0 Jul 08, 2025
1.12.2 Jul 08, 2025
1.12.1 Jul 08, 2025
1.12.0 Feb 05, 2025
1.11.0 Feb 03, 2025
1.11.0b1 Feb 03, 2025
1.10.2 Jan 22, 2025
1.10.1 Jan 22, 2025
1.10.0 Jan 16, 2025
1.9.0 Jan 08, 2025
1.9.0b7 Feb 03, 2025
1.9.0b6 Jan 08, 2025
1.9.0b5 Jan 03, 2025
1.9.0b3 Dec 10, 2024
1.9.0b2 Dec 10, 2024
1.9.0b1 Dec 10, 2024
1.8.1 Jul 08, 2025
1.8.0 Dec 08, 2024
1.7.0 Dec 05, 2024
1.7.0b1 Dec 05, 2024
1.6.0 Dec 05, 2024
1.6.0b1 Dec 05, 2024
1.5.0 Dec 04, 2024
1.5.0b1 Dec 05, 2024
1.4.3 Dec 03, 2024
1.4.3b3 Dec 05, 2024
1.4.3b2 Dec 05, 2024
1.4.3b1 Dec 03, 2024
1.4.2 Dec 02, 2024
1.4.1 Dec 02, 2024
1.4.0 Nov 30, 2024
1.3.0 Nov 30, 2024
1.2.2 Nov 29, 2024
1.2.1 Nov 29, 2024
1.2.0 Nov 28, 2024
1.1.0 Nov 28, 2024
1.0.0 Jul 02, 2025
0.0.3 Nov 20, 2024
0.0.2 Nov 20, 2024
0.0.1 Nov 09, 2024

Wheel compatibility matrix

Platform Python 3
any

Files in release

Extras:
Dependencies:
aiohttp (>=3.10)
beautifulsoup4 (>=4.12.3)
pydantic (>=2.10.2)
python-dotenv (>=1.0.1)
requests (>=2.32.3)
toonify (>=1.0.0)