Spaces:

Meghajp
/

Autonomous_Data_Scientist

Sleeping

File size: 6,348 Bytes

---
title: Autonomous Data Analyst
emoji: 💻
colorFrom: purple
colorTo: gray
sdk: gradio
sdk_version: 6.5.1
app_file: app.py
pinned: false
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Multi-Source Agentic RAG

A local AI agent with an intelligent **Router** that directs traffic between:
- **Neon PostgreSQL** – structured/numerical data (sales, counts, aggregations)
- **ChromaDB** – policies, documentation (company_policies collection)

Both paths converge at a **synthesis node** that produces natural language answers.

## Architecture

- **Router**: LLM analyzes intent → routes to `sql` or `vector_store`
- **SQL path**: explore → generate → HITL approval → execute → self-correct → visualize
- **Vector path**: retrieve from ChromaDB `company_policies`
- **Synthesis**: `generate_answer` produces a clear natural language response

```
                    ┌─────────────┐
                    │   Router    │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           ▼               │               ▼
    ┌─────────────┐        │        ┌─────────────┐
    │  SQL path   │        │        │ Vector path │
    │ (Neon)      │        │        │ (ChromaDB)  │
    └──────┬──────┘        │        └──────┬──────┘
           │               │               │
           └───────────────┼───────────────┘
                           ▼
                    ┌─────────────┐
                    │  generate_  │
                    │   answer    │
                    └─────────────┘
```

## Free hosting (Hugging Face Spaces)

You can host the Gradio app for free on [Hugging Face Spaces](https://huggingface.co/spaces). The app will use **Groq** (free LLM API) and **HuggingFace embeddings** in the cloud—no Ollama required.

See **[DEPLOY.md](DEPLOY.md)** for step-by-step instructions (Groq API key, Space secrets, and Neon connection string).

## Setup

### 1. Create a Neon database

1. Sign up at [neon.tech](https://neon.tech) and create a project.
2. Copy the connection string from the dashboard (Connection string → URI).

### 2. Configure connection

```bash
export NEON_DATABASE_URL='postgresql://user:password@ep-xxx.region.aws.neon.tech/neondb?sslmode=require'
```

Or copy `.env.example` to `.env` and set `NEON_DATABASE_URL` there.

### 3. Install Python dependencies

```bash
cd "AGENTIC _AI"
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt
```

### 4. Seed the database (once)

```bash
python3 seed_neon.py
```

This creates **7 tables** with different kinds of data:

| Table | Kind | Description |
|-------|------|-------------|
| **regions** | Lookup | Geographic (name, country) |
| **product_categories** | Lookup | Category names |
| **customers** | Entity | Name, email, region_id, created_at |
| **products** | Catalog | Name, category_id, base_price |
| **sales** | Transactions | customer_id, product_id, quantity, amount, sale_date |
| **returns** | Events | sale_id, amount, reason, return_date |
| **promotions** | Time-bound | code, discount_pct, start_date, end_date |

Then load ~100k rows of synthetic data (with edge cases):

```bash
python3 generate_synthetic_sales.py
```

### 5. Seed the vector store (once)

```bash
ollama pull nomic-embed-text  # embedding model
python3 seed_vector_store.py
```

### 6. Install Ollama and pull models

```bash
ollama pull deepseek-coder-v2  # for routing + SQL + synthesis
ollama pull nomic-embed-text   # for ChromaDB embeddings
# Run ollama serve if not using the app
```

## Usage

### CLI
```bash
# SQL path (structured data)
python3 main.py "What are total sales by category?"
python3 main.py "Show sales by region"

# Vector path (policies/docs)
python3 main.py "What is the expense policy?"
python3 main.py "What is the remote work policy?"
```

### Streamlit UI
```bash
streamlit run app.py
```

- Chat interface with `st.chat_message` and `st.chat_input`
- **Agent Reasoning** expander shows datasource, SQL, errors
- `st.dataframe()` for SQL results, `st.image()` for charts
- **HITL**: high-impact queries pause with Approve / Reject buttons
- Sidebar: Neon connection string, Clear Chat

### Gradio UI
```bash
python3 app_gradio.py
```

- Opens at **http://localhost:7860**
- Chat interface with config (Neon URL, Ollama model) in an accordion
- Sample-question buttons; results table and chart below chat when applicable
- **HITL**: Approve / Reject buttons appear when a high-impact SQL query is proposed

The agent will:

1. Explore the PostgreSQL schema
2. Generate a SQL query
3. **Human-in-the-Loop (HITL):** If the query is high-impact (UPDATE, DELETE, INSERT, DROP, etc.), the program pauses and prompts you to approve or reject before execution
4. Execute it (and retry with fixes if it fails)
5. Infer the best chart type and save to `data/outputs/analysis_chart.png`

## Project Structure

```
AGENTIC _AI/
├── main.py              # Entry point (uses NEON_DATABASE_URL)
├── seed_neon.py         # One-time: create tables and seed data on Neon
├── .env.example         # Example env vars (NEON_DATABASE_URL)
├── requirements.txt
├── data/
│   └── outputs/         # Saved charts
└── src/
    ├── agent.py         # LangGraph agent + self-correction loop
    ├── models.py        # Pydantic schemas
    ├── state.py         # Agent state
    ├── tools.py         # Schema explorer, SQL executor
    └── visualizer.py    # Pandas/Matplotlib/Seaborn pipeline
```

## Concepts Covered

- **Local LLM orchestration** with Ollama
- **Stateful agent loops** with LangGraph
- **Self-correction** when SQL fails
- **Structured outputs** with Pydantic
- **Function-style tools** for schema and execution
- **Data lifecycle** from raw SQL to visualization