Data Streaming

Access and train on sensitive data without downloading it. Data streaming ensures governance while providing full utility.

The problem with data transfers

Traditionally, AI teams need to download data to use it. This creates immediate problems:

— Data leakage risk — Once downloaded, controls are lost
— Regulatory issues — Data crossing borders triggers compliance problems
— Governance gaps — No way to enforce policies after transfer
— Legal exposure — Clear chain of liability to the AI team

Xase enables data use without downloads through governed streaming.

How data streaming works

1. Create Access Session

Start with policy-approved access:

import xase

client = xase.Client(api_key="sk_...")

# Get governed access to data
session = client.access(
    dataset="patient-records-2025",
    purpose="model-training",
    duration="30d"
)

2. Stream Data

Stream batches for training with full governance:

# Stream batches for training
for batch in session.stream(batch_size=32):
    # Train on data without downloading
    model.train_on_batch(batch)
    
    # All usage is automatically tracked
    # All policy constraints are enforced
    # All evidence is automatically generated

3. Access Specific Records

Access specific entries when needed:

# Get specific patient record
patient = session.get("patient_45678")

# Apply transformation with tracking
processed = session.transform(
    data=patient,
    function=anonymize_fields,
    metadata={"purpose": "privacy protection"}
)

4. Filtering and Queries

Apply filters without downloading all data:

# Stream with filters
filtered_data = session.stream(
    filter={
        "age": {"$gte": 18},
        "diagnosis": {"$in": ["diabetes", "hypertension"]}
    },
    batch_size=32
)

for batch in filtered_data:
    model.train_on_batch(batch)

Key features

No Data Downloads

Data never leaves the governed environment. All processing happens through the streaming interface.

Runtime Policy Enforcement

Policies are continuously enforced during streaming. Revoked access stops streams immediately.

Automatic Tracking

Every record access and operation is logged with identity, timestamp, and purpose.

Server-Side Filtering

Apply filters server-side to reduce bandwidth and process only relevant data.

Advanced usage

Streaming Aggregations

Compute aggregations without downloading raw data:

# Get aggregated statistics
stats = session.aggregate(
    pipeline=[
        {"$match": {"age": {"$gte": 30}}},
        {"$group": {
            "_id": "$diagnosis",
            "count": {"$sum": 1},
            "avg_age": {"$avg": "$age"}
        }}
    ]
)

for result in stats:
    print(f"Diagnosis: {result['_id']}")
    print(f"Count: {result['count']}")
    print(f"Avg age: {result['avg_age']}")

Streaming with Transforms

Apply transformations during streaming:

# Stream with transformation
for batch in session.stream(
    batch_size=32,
    transform=lambda data: normalize(data)
):
    model.train_on_batch(batch)

Next steps

Evidence BundlesUnderstand automatic evidence generation

Usage MeteringLearn how usage is measured and billed