Megha Panicker
Rename app to Autonomous Data Analyst and improve user-doc indexing
7154ae4

A newer version of the Gradio SDK is available: 6.9.0

Upgrade
metadata
title: Autonomous Data Analyst
emoji: πŸ’»
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Multi-Source Agentic RAG

A local AI agent with an intelligent Router that directs traffic between:

  • Neon PostgreSQL – structured/numerical data (sales, counts, aggregations)
  • ChromaDB – policies, documentation (company_policies collection)

Both paths converge at a synthesis node that produces natural language answers.

Architecture

  • Router: LLM analyzes intent β†’ routes to sql or vector_store
  • SQL path: explore β†’ generate β†’ HITL approval β†’ execute β†’ self-correct β†’ visualize
  • Vector path: retrieve from ChromaDB company_policies
  • Synthesis: generate_answer produces a clear natural language response
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Router    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
           β–Ό               β”‚               β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚  SQL path   β”‚        β”‚        β”‚ Vector path β”‚
    β”‚ (Neon)      β”‚        β”‚        β”‚ (ChromaDB)  β”‚
    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜        β”‚        β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
           β”‚               β”‚               β”‚
           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β–Ό
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  generate_  β”‚
                    β”‚   answer    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Free hosting (Hugging Face Spaces)

You can host the Gradio app for free on Hugging Face Spaces. The app will use Groq (free LLM API) and HuggingFace embeddings in the cloudβ€”no Ollama required.

See DEPLOY.md for step-by-step instructions (Groq API key, Space secrets, and Neon connection string).

Setup

1. Create a Neon database

  1. Sign up at neon.tech and create a project.
  2. Copy the connection string from the dashboard (Connection string β†’ URI).

2. Configure connection

export NEON_DATABASE_URL='postgresql://user:password@ep-xxx.region.aws.neon.tech/neondb?sslmode=require'

Or copy .env.example to .env and set NEON_DATABASE_URL there.

3. Install Python dependencies

cd "AGENTIC _AI"
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

4. Seed the database (once)

python3 seed_neon.py

This creates 7 tables with different kinds of data:

Table Kind Description
regions Lookup Geographic (name, country)
product_categories Lookup Category names
customers Entity Name, email, region_id, created_at
products Catalog Name, category_id, base_price
sales Transactions customer_id, product_id, quantity, amount, sale_date
returns Events sale_id, amount, reason, return_date
promotions Time-bound code, discount_pct, start_date, end_date

Then load ~100k rows of synthetic data (with edge cases):

python3 generate_synthetic_sales.py

5. Seed the vector store (once)

ollama pull nomic-embed-text  # embedding model
python3 seed_vector_store.py

6. Install Ollama and pull models

ollama pull deepseek-coder-v2  # for routing + SQL + synthesis
ollama pull nomic-embed-text   # for ChromaDB embeddings
# Run ollama serve if not using the app

Usage

CLI

# SQL path (structured data)
python3 main.py "What are total sales by category?"
python3 main.py "Show sales by region"

# Vector path (policies/docs)
python3 main.py "What is the expense policy?"
python3 main.py "What is the remote work policy?"

Streamlit UI

streamlit run app.py
  • Chat interface with st.chat_message and st.chat_input
  • Agent Reasoning expander shows datasource, SQL, errors
  • st.dataframe() for SQL results, st.image() for charts
  • HITL: high-impact queries pause with Approve / Reject buttons
  • Sidebar: Neon connection string, Clear Chat

Gradio UI

python3 app_gradio.py
  • Opens at http://localhost:7860
  • Chat interface with config (Neon URL, Ollama model) in an accordion
  • Sample-question buttons; results table and chart below chat when applicable
  • HITL: Approve / Reject buttons appear when a high-impact SQL query is proposed

The agent will:

  1. Explore the PostgreSQL schema
  2. Generate a SQL query
  3. Human-in-the-Loop (HITL): If the query is high-impact (UPDATE, DELETE, INSERT, DROP, etc.), the program pauses and prompts you to approve or reject before execution
  4. Execute it (and retry with fixes if it fails)
  5. Infer the best chart type and save to data/outputs/analysis_chart.png

Project Structure

AGENTIC _AI/
β”œβ”€β”€ main.py              # Entry point (uses NEON_DATABASE_URL)
β”œβ”€β”€ seed_neon.py         # One-time: create tables and seed data on Neon
β”œβ”€β”€ .env.example         # Example env vars (NEON_DATABASE_URL)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ data/
β”‚   └── outputs/         # Saved charts
└── src/
    β”œβ”€β”€ agent.py         # LangGraph agent + self-correction loop
    β”œβ”€β”€ models.py        # Pydantic schemas
    β”œβ”€β”€ state.py         # Agent state
    β”œβ”€β”€ tools.py         # Schema explorer, SQL executor
    └── visualizer.py    # Pandas/Matplotlib/Seaborn pipeline

Concepts Covered

  • Local LLM orchestration with Ollama
  • Stateful agent loops with LangGraph
  • Self-correction when SQL fails
  • Structured outputs with Pydantic
  • Function-style tools for schema and execution
  • Data lifecycle from raw SQL to visualization