ai – DAILY NEWS

TECH & AI

Is anyone using AWS CodePipeline for the complete CI/CD pipeline?

jackminion Jul 5, 2026 0

Is anyone using AWS CodePipeline for the complete CI/CD pipeline?

Source link

TECH & AI

I Ditched Vector Search for My Coding Agent’s Memory. FTS5 Won.

jackminion Jul 4, 2026 0

Every “give your agent memory” tutorial I’ve read reaches for the same stack: chunk your docs, embed them, throw the vectors in a database, do cosine similarity at query time. So when I needed my coding agent to search through indexed tool output, git logs, and fetched docs without dumping raw text into the model’s context window, I assumed I’d be standing up a vector store too.

I didn’t. I used SQLite’s FTS5 full-text search instead, and for this specific job it’s not a compromise — it’s the better tool.

What the problem actually was

The tool I built (context-mode, for routing large command output and API responses out of the model’s context) needs to answer queries like:

“failing tests”
“HTTP 500 errors”
“async route handlers”

against arbitrary shell output, JSON responses, and fetched web pages — indexed once, searched however many times a session needs. The naive version just dumps everything into context and lets the model read it. That works until the output is 50KB of test logs and you’ve burned half your context window on a summary you needed three lines of.

Why vectors are the wrong default here, not just an alternative

Vector search is built to answer “what’s semantically similar to this.” That’s the right tool when you’re searching prose — support tickets, documentation, chat transcripts — where the same idea gets expressed in different words and you need “how do I reset my password” to match a doc titled “Account Recovery Steps.”

Coding-agent queries mostly aren’t that. “HTTP 500 errors” isn’t a fuzzy semantic concept I want approximated — it’s closer to a literal grep with better ranking. The content being searched is also structured and keyword-dense: stack traces, log lines, JSON keys, error codes. Embedding a stack trace and comparing cosine similarity throws away the thing that actually matters (the literal exception name, the literal line number) in favor of a vector representation that’s better at “these two paragraphs are about similar topics” than “this line contains the string ECONNREFUSED.”

FTS5 is built for exactly this: tokenized, indexed, ranked full-text search over exact and near-exact term matches, with BM25-style relevance scoring out of the box.

What it actually looks like

No embedding model, no vector database, no network round-trip to compute embeddings. It’s stdlib:

import sqlite3

conn = sqlite3.connect(“index.db”)
conn.execute(“””
CREATE VIRTUAL TABLE IF NOT EXISTS docs
USING fts5(source, content)
“””)

def index(source: str, content: str):
conn.execute(“INSERT INTO docs (source, content) VALUES (?, ?)”, (source, content))
conn.commit()

def search(query: str, limit: int = 5):
rows = conn.execute(“””
SELECT source, snippet(docs, 1, ‘(‘, ‘)’, ‘…’, 20), rank
FROM docs WHERE docs MATCH ? ORDER BY rank LIMIT ?
“””, (query, limit)).fetchall()
return rows

Enter fullscreen mode

Exit fullscreen mode

That’s the whole engine. snippet() gives you highlighted context around the match for free. rank gives you BM25 ordering for free. Querying “HTTP 500 errors” against a batch of indexed test output returns the actual lines containing 500 and error, ranked by term frequency and rarity — not the semantically-nearest paragraph, the actually-relevant one.

Where this would fall over — and why it doesn’t here

FTS5 is a bad choice if your queries genuinely need semantic matching: “find the doc about resetting my password” needs to match “Account Recovery,” and no amount of tokenization gets you there without embeddings. If I were building search over a knowledge base of prose documentation with inconsistent terminology, I’d reach for vectors, possibly hybrid (BM25 for recall, vectors for semantic re-ranking).

But an agent’s own tool output, error logs, and fetched API responses are dense with the literal terms you’re going to search for, because you (or the agent) wrote the query with those terms in mind. “Failing tests” as a query is going to co-occur with FAIL, AssertionError, test names — words that are actually in the log. The semantic gap that justifies embeddings mostly doesn’t exist in this domain.

The generalizable lesson

“Add semantic search” has become a reflex the same way “add a cache” or “add a queue” is — reached for because it’s the default answer to “how do I search this,” not because the problem demands it. Vector infra costs you an embedding model, a vector database or extension, and a slower indexing step, in exchange for a capability — semantic similarity — that keyword-dense, structured content usually doesn’t need.

Before reaching for embeddings on your next “agent needs to search X” problem, ask what the query and the content actually look like. If both are keyword-dense and structurally similar (logs, code, JSON, stack traces), full-text search with BM25 ranking will outperform vectors on relevance and cost you a fraction of the infrastructure. Save the vector database for the day your content is actually prose with vocabulary mismatch — most agent tooling isn’t there yet.

Source link

TECH & AI

Stop Leaking Medical Data! Build a Privacy-First Skin Cancer Classifier with Federated Learning & PySyft 🩺🛡️

jackminion Jul 4, 2026 0

Data is the new oil, but in healthcare, data is more like plutonium—extremely valuable but incredibly dangerous if handled incorrectly. If you are building AI for medical use cases, you’ve likely hit the “Data Silo” wall. Hospitals can’t just ZIP up patient records and DM them to you because of GDPR, HIPAA, and basic human ethics.

So, how do we train a high-performing Skin Lesion Classification model without ever actually seeing the raw medical images? Welcome to the world of Federated Learning (FL) and Privacy-Preserving AI. In this guide, we’ll explore how to use PySyft and PyTorch to train models on decentralized data while keeping sensitive information exactly where it belongs: with the patient.

We will focus on Federated Learning, Differential Privacy, and Secure Multi-Party Computation (SMPC) to build a robust, privacy-first pipeline.

The Architecture: Move the Code, Not the Data

In traditional Machine Learning, we bring data to the model. In Federated Learning, we flip the script: we bring the model to the data.

graph TD
subgraph “Central Server (Aggregator)”
A(Global Model v1.0) –>|Distribute Weights| B{Encrypted Aggregator}
B –>|Updated Global Model| A
end

subgraph “Hospital A (Edge Node)”
C(Local Data: Skin Images) –> D(Local Training)
D –>|Trained Gradients| B
end

subgraph “Hospital B (Edge Node)”
E(Local Data: Skin Images) –> F(Local Training)
F –>|Trained Gradients| B
end

style A fill:#f9f,stroke:#333,stroke-width:2px
style C fill:#bbf,stroke:#333
style E fill:#bbf,stroke:#333

Enter fullscreen mode

Exit fullscreen mode

As shown in the flow above, the raw images never leave the hospitals. Only the “learnings” (gradients/weights) are sent back to the central server.

Prerequisites

Before we dive into the code, ensure you have the following stack ready:

PyTorch: The backbone for our neural networks.

PySyft: The secret sauce for federated and private learning.

Differential Privacy (Opacus): To prevent “membership inference attacks.”

Step 1: Setting Up Virtual Workers

In a real-world scenario, these would be physical servers in different hospitals. For this tutorial, we will simulate two hospitals (Alice and Bob) using PySyft’s virtual workers.

import torch
import syft as sy

# Hooking PyTorch to add extra privacy features
hook = sy.TorchHook(torch)

# Create two remote ‘hospitals’
hospital_alice = sy.VirtualWorker(hook, id=”alice”)
hospital_bob = sy.VirtualWorker(hook, id=”bob”)

print(f”Nodes initialized: {hospital_alice.id}, {hospital_bob.id} 🏥”)

Enter fullscreen mode

Exit fullscreen mode

Step 2: Distributing the Dataset

Imagine we have a dataset of skin lesion images (like the HAM10000 dataset). We split it and “send” it to our hospitals. In reality, the data would already exist there; we are simply gaining pointers to it.

# Simulated skin lesion data (Features = Pixels, Targets = Cancer Type)
data = torch.tensor(((0.1, 0.2), (0.3, 0.4), (0.5, 0.6), (0.7, 0.8)), requires_grad=True)
target = torch.tensor(((0), (0), (1), (1)))

# Distribute data to hospitals
# In a real app, data stays local; here we simulate the ‘silo’
data_alice = data(0:2).send(hospital_alice)
target_alice = target(0:2).send(hospital_alice)

data_bob = data(2:4).send(hospital_bob)
target_bob = target(2:4).send(hospital_bob)

datasets = ((data_alice, target_alice), (data_bob, target_bob))

Enter fullscreen mode

Exit fullscreen mode

Step 3: The Federated Training Loop

Now for the magic. We define a simple CNN/Linear model and send it to the remote locations for training.

from torch import nn, optim

# A simple model for skin lesion classification
model = nn.Linear(2, 1)

def train(epochs=5):
optimizer = optim.SGD(model.parameters(), lr=0.1)

for epoch in range(epochs):
for data, target in datasets:
# 1. Send model to the hospital node
model.send(data.location)

# 2. Normal Training Step
optimizer.zero_grad()
output = model(data)
loss = ((output – target)**2).sum()
loss.backward()
optimizer.step()

# 3. Get the updated model back (The data stays behind!)
model.get()

print(f”Epoch {epoch} complete at {data.location.id}. Loss: {loss.get().item():.4f}”)

train()

Enter fullscreen mode

Exit fullscreen mode

Step 4: Adding Differential Privacy (DP)

Even if we don’t see the data, a clever attacker could theoretically reverse-engineer the gradients to see what the training images looked like. To prevent this, we add Differential Privacy. This injects controlled “noise” into the gradients.

Pro-Tip: If you’re looking for production-grade patterns on how to implement Differential Privacy at scale or want to explore hardware-level security like TEEs (Trusted Execution Environments), I highly recommend checking out the advanced research articles over at WellAlly Tech Blog. They cover the intersection of AI and privacy in much greater depth! 🥑

The Result: Privacy is a Feature, Not a Bug

By the end of this process, you have a model that has learned the features of skin cancer from multiple sources without violating a single privacy regulation.

Why this matters:

Compliance: You are automatically GDPR/HIPAA compliant by design (Privacy by Design).
Data Diversity: You can train on data from a hospital in New York and a clinic in London simultaneously, creating a more generalized and less biased model.
Security: Even if your central server is breached, the attacker finds no patient data—only model weights.

Conclusion 🚀

Federated Learning is transforming how we think about sensitive data. We no longer need to choose between AI Innovation and User Privacy. With tools like PySyft and PyTorch, the “Privacy-First” approach is becoming the industry standard.

Are you ready to build the future of secure AI? If you enjoyed this “Learning in Public” session, drop a comment below! What’s your biggest challenge with medical data? Let’s discuss! 👇

Source link