Viktor Chalyi - VP of Engineering | Director of Engineering | Engineering Manager

Claude Code Token Limit: How to Stretch Your Daily Budget

2026-04-24T09:00:00+00:00

Every Claude Code Pro session starts with a quiet tax: CLAUDE.md loads, MCP servers initialize, skills register. Before you type a single message, roughly 10,000 tokens are gone. Plan a feature, revise the spec, iterate on the approach. You’re already at 40% of your 5-hour budget. Start implementing, debug what broke, verify it works, and the limit hits. You wait. The momentum is gone.

Why Claude Code Token Limits Break Your Flow

The token ceiling is not just a billing constraint. It is a pacing problem. Sessions have a natural shape: orient, plan, build, verify. That arc fits inside a 5-hour window only if token spend is efficient. Most sessions are not efficient, not because of waste in the obvious sense, but because of structure. Every git status dumps verbose output into the context. Every explanation Claude gives is written for a patient reader rather than someone who already knows the domain. Every file that was read once stays in context whether it matters anymore or not. The result is a session that burns through budget on overhead instead of work.

Three tools attack this from different angles. RTK compresses what goes into context. Caveman trims what comes out of the model. CodeBurn shows where the remainder goes so you know what to fix next. None of them require changes to how you work. Install them once and they run in the background.

RTK: Compress What Goes Into Context

Command output is one of the largest and most overlooked sources of token consumption in a Claude Code session. A git log with a hundred entries, a docker ps with a dozen containers, an npm install with its full dependency tree: all of it lands in context verbatim unless something intercepts it first. RTK is that interceptor.

RTK is a single Rust binary that acts as a proxy for common shell commands. It supports 100+ commands across git, npm, cargo, docker, and other ecosystems. The interception is transparent: a hook rewrites git status to rtk git status automatically, so nothing in your workflow changes. What changes is the output: filtered, grouped, deduplicated, and truncated to what Claude actually needs to make a decision.

The numbers are concrete. In a typical 30-minute coding session, RTK reduced token consumption from approximately 118,000 tokens to 23,900, an 80% reduction on command output alone. Across a full development session, Claude Code best practices point to bash output as a primary driver of context bloat. RTK addresses that directly.

Install:

# Homebrew
brew install rtk-ai/tap/rtk

# Or curl
curl -sSL https://raw.githubusercontent.com/rtk-ai/rtk/main/install.sh | bash

After installation, add the hook to your Claude Code configuration or run rtk gain to verify savings from your sessions.

Caveman: Make Claude Stop Over-Explaining

RTK handles the input side. Caveman handles the output side. By default, Claude writes responses for a general audience: full sentences, examples, context, summaries. For a developer in the middle of a session who already knows the codebase and just asked a specific question, most of that text is noise. Caveman replaces it with signal.

The plugin enforces brevity at the model level. Activate it with /caveman and responses shift to terse fragments. Enough information, stripped of everything else. A React re-render explanation that normally takes 540 tokens comes back in 70. An auth middleware fix that would fill a screen arrives in two lines. Across a benchmark of 10 typical development tasks, Caveman delivered an average of 65% output token reduction with no loss in technical accuracy.

Three intensity levels let you match verbosity to context. Lite mode keeps grammar intact and reads as professional terseness. Full mode uses fragments and drops articles. Ultra mode compresses to telegraphic abbreviations, useful for repetitive operations like reviewing a long list of small changes. A /caveman-compress command also runs on your CLAUDE.md and memory files, shrinking input context by roughly 46%.

Install:

claude plugin marketplace add JuliusBrussee/caveman && claude plugin install caveman@caveman

Activate with /caveman, deactivate with “normal mode”. Toggle modes with /caveman lite, /caveman full, or /caveman ultra.

CodeBurn: See Where Your Tokens Actually Go

RTK and Caveman reduce consumption. CodeBurn tells you what is left and where it is going. It reads session data directly from disk, no proxy, no API key, no instrumentation required, and renders a terminal dashboard with spending broken down by project, model, task category, tool, shell command, and MCP server.

The most useful feature is codeburn optimize. It scans your recent sessions and flags specific waste patterns: files that were read multiple times without being edited, bash commands with uncapped output, MCP servers that were loaded but never called, context files that have grown beyond useful size. These are not general recommendations. They are findings from your actual sessions. One review of a typical week’s usage will surface at least two or three concrete changes that cut measurable budget.

The model comparison tool is worth running before committing to a model for a long project. It puts two models side by side across one-shot success rate, retry frequency, cost per call, cache hit rate, and per-category performance. Session limits feel different when you know that one model resolves a debugging task in one attempt while another averages three.

Install:

npm install -g codeburn
# or run without installing
npx codeburn

Key commands: codeburn report for a 7-day dashboard, codeburn today for current spend, codeburn optimize for waste patterns, codeburn compare for model analysis.

Two Habits That Cost Nothing

Tools compress and filter, but two simple habits do more to prevent token drain than any proxy. First, do not rely on autocompact. Claude Code compacts the context automatically when it approaches the limit, but by then the context is already bloated and the compression summary loses fidelity. Compact manually at 50 or 60% with /compact instead. The summary captures the session while it is still sharp, and you get a clean working context without hitting the wall. Second, start each new feature in a fresh context window. The context from the previous session contains file reads, diffs, tool call outputs, and back-and-forth that are irrelevant to the new task. Continuing an old session in the wrong direction costs far more than the ~10,000 token overhead of starting fresh. One feature per context window is a discipline that compounds across every session.

None of these solve the token limit. They change what the limit means. RTK cuts bash output, Caveman cuts response verbosity, CodeBurn surfaces what is left to fix, and two free habits keep the context clean throughout. The same 5-hour budget covers substantially more work, and the limit stops being the thing that ends your sessions.

Knockpy and crt.sh: Finding Subdomains Your Org Forgot

2026-04-20T10:00:00+00:00

Most engineering orgs cannot list every subdomain they own. Knockpy and crt.sh close that gap in an afternoon, and explain why leaked dev environments and forgotten staging hosts stay a standing risk.

What Knockpy Does, and the Legal Line

Knockpy, maintained at guelfoweb/knockpy, is a Python tool that enumerates subdomains for a given domain. Version 9 ships with two complementary scan modes, a wildcard detector, and a local database that stores every run. Install is straightforward:

git clone https://github.com/guelfoweb/knockpy.git
cd knockpy
python3 -m venv .venv && . .venv/bin/activate
pip install .

Before any commands, a disclaimer.

Use this only on domains you own or have written authorization to test. Subdomain enumeration hits third-party services and in active mode sends DNS traffic at the target. Running it against infrastructure you do not own can violate computer misuse laws in most jurisdictions. This post is educational.

Three Modes: Recon, Bruteforce, Wildcard

Three commands cover most real workflows.

Passive recon. No packets touch the target. Knockpy queries public data sources and prints the resulting subdomains.

knockpy -d example.com --recon

Bruteforce. An active scan that resolves a wordlist of common subdomain names against the target’s DNS servers. A default wordlist ships with knockpy, and you can override it with --wordlist.

knockpy -d example.com --bruteforce

Combined recon plus bruteforce is the typical day-to-day run. Passive sources find the obvious hosts, bruteforce finds the unglamorous ones like jenkins, grafana, and staging-old.

knockpy -d example.com --recon --bruteforce

Wildcard detection. Some domains resolve every possible subdomain to the same IP, which makes bruteforce results useless. The --wildcard flag tests this and exits.

knockpy -d example.com --wildcard

You can tune concurrency, DNS resolver, and timeout at runtime:

knockpy -d example.com --bruteforce --wordlist ./custom.txt --threads 100 --dns 1.1.1.1

Inside Knockpy: Passive vs Active Scans

The passive path never touches the target. Knockpy queries third-party services that already index slices of the public internet. Each source catches different hosts, which is why a real recon run pulls from all of them at once.

crt.sh (Certificate Transparency logs). CT is a cross-vendor, append-only log where every certificate issued by a publicly trusted CA (Let’s Encrypt, DigiCert, Sectigo, Google Trust Services) is recorded. Every hostname in a certificate’s Subject Alternative Names field lands here within minutes of issuance, and modern browsers refuse certs that skip CT logging, so the coverage is close to complete for HTTPS. No API key. This is the strongest signal for production-facing web hosts.

VirusTotal. Maintains one of the largest passive DNS datasets in the industry, built up over years from the URLs, emails, and files users submit for scanning. When someone uploaded an attachment that referenced jenkins-old.company.com, that hostname got recorded, even if the host never had a public certificate. Free API key required, set via API_KEY_VIRUSTOTAL, with a 4 requests/minute cap on the public tier.

Shodan. Scans the entire IPv4 space continuously and fingerprints every reachable service: banners, TLS certs, protocol responses. Knockpy asks Shodan which hostnames it has observed on the target. Catches hosts answering on non-web ports (SSH, IMAP, RDP, custom TCP services) that a cert-only search will miss. Needs API_KEY_SHODAN.

RapidDNS. A free passive DNS aggregator, no API key, queried by scraping its result pages. Useful as a zero-setup fallback, and it occasionally surfaces subdomains the other sources miss because its collection pipeline is different.

Sources are pluggable. Configuration lives in ~/.knockpy/recon_services.json, and adding a new source is writing a small parser. To preview which sources are responding before a real run, knockpy -d example.com --recon --test exercises each one and prints the status.

The active path is a parallel DNS bruteforce. Knockpy spawns up to --threads (default 250) concurrent resolvers and queries every entry in the wordlist against the target’s authoritative nameservers, using --timeout (default 3 seconds) per lookup. Subdomains that resolve are kept, the rest are dropped.

The wildcard check runs before bruteforce. Knockpy generates random strings that almost certainly do not exist as subdomains and tries to resolve them. If they come back with an IP, the domain uses wildcard DNS and the bruteforce output needs to be filtered or treated carefully.

Reports, Replay, and HTML Export

Every knockpy run is persisted in a SQLite database under ~/.knockpy/. You list, replay, and export past runs through the --report flag.

knockpy --report list        # every past run
knockpy --report latest      # last run, printed
knockpy --report         # specific run

HTML export is supported through the same --report flag (check knockpy --help for the exact subcommand on your version). An HTML report is the artifact you attach to a ticket, share with stakeholders who do not live in a terminal, or drop into an audit trail. The more valuable trick is diffing reports week over week: newly appearing subdomains are where shadow IT shows up first.

crt.sh and the Attack Surface Inventory Problem

A good reconnaissance run often starts with crt.sh before touching knockpy at all. Certificate Transparency is a browser-enforced log where every publicly trusted TLS certificate is recorded. When a team in your org spins up new-staging.internal.company.com and fetches a Let’s Encrypt certificate, that hostname becomes searchable in crt.sh within hours. Anyone can query it with no credentials, and the results give you a solid starting list to feed into knockpy:

https://crt.sh/?q=%25.company.com

You cannot defend what you cannot see. Subdomain discovery is the cheapest first step, and it is one of the few security exercises where the tooling is free and the value compounds every week.

Engineering Manager Playbook as a Living LLM Wiki

2026-04-16T09:00:00+00:00

Onboarding a new engineering manager fails the same way every time: the playbook is out of date before the new hire finishes their second week. Rebuilding it as a living LLM wiki fixes that.

Why an Engineering Manager Playbook Exists

Onboarding engineering managers without a written playbook is expensive. They spend the first month asking the same questions every previous hire already asked. Tribal knowledge lives in Slack DMs, old wiki pages, and the heads of whichever senior engineers happen to be free that afternoon. A written engineering manager playbook compresses months of that into a few days of guided reading. Across several hires, the document has been doing real work. It answers the standard questions about how teams are structured, who owns what, how decisions get made, and where to find the parts of the system that matter. The hard part was never whether to have a playbook. It was how to keep it honest.

What Goes Into an Engineering Manager Playbook

A useful playbook covers the structural layer of the job, not the personal-style layer. The topics worth writing down are the ones that change hands when a role changes hands:

Engineering metrics: DORA measurements, cycle time, delivery rate, tech-debt ratio.
Performance reviews: cadence, competency model, how feedback is aggregated, how recognition works.
Goals framework: how business and technical goals are defined, who owns which, how RACI is applied.
Incident management: severity definitions, alerting, on-call, the incident response loop.
Observability: how metrics, traces, and logs are split across the stack.
The 30/60/90 onboarding plan for the role itself.
Team topology, tools catalog, roles, meeting cadence, and the Slack channel map.

None of this is revolutionary. What makes it valuable is that it is written down in one place, cross-linked, and correct on the day the new manager reads it. That last condition is where static documents lose.

Why Engineering Manager Playbooks Go Stale

A hand-maintained playbook rots for structural reasons, not from laziness. Every time a team is renamed, a tool is replaced, a process is revised, or a channel is retired, somebody has to remember to update the doc. Nobody does, consistently. Ingesting a new source means opening the file, finding the right section, editing it, checking the cross-references, and hoping no other page now contradicts the change. The friction is high enough that updates get skipped. After a quarter or two, the playbook describes an organization that no longer exists, and new hires quietly learn to stop trusting it.

The underlying problem is that a playbook has been treated as a document. It should be treated as an index over raw material.

Karpathy’s LLM Wiki Pattern

The turning point came from Andrej Karpathy’s LLM wiki gist. The pattern is simple and load-bearing. Raw notes, articles, meeting transcripts, and ad-hoc documents go into a raw/ folder. An LLM ingests them and produces a structured wiki with three kinds of pages: concepts, entities, and source summaries. Every ingest updates an index file and appends to a log. Pages cross-link each other using plain markdown. When a new source contradicts an existing claim, the LLM flags the contradiction rather than silently overwriting.

No retrieval-augmented generation is needed. The index is the routing layer. When a question comes in, the model reads the index, pulls the relevant pages, and answers with citations to the pages it used. When new material arrives, the same index tells it what already exists and where to merge. The wiki compounds instead of bloating. It lints itself for contradictions, orphan pages, and missing cross-references on request.

Applying this to the engineering manager playbook turned a brittle document into a living one. Raw notes go in, the wiki absorbs them, and the next reader gets current information instead of a frozen snapshot.

flowchart LR
    A["Raw notes
(articles, meeting
transcripts, PDFs)"] -->|ingest| B(("Claude Code"))
    B --> C["Concept pages
(processes, methods)"]
    B --> D["Entity pages
(teams, tools, roles)"]
    B --> E["Source summaries"]
    C --> F[["index.md
+ activity log"]]
    D --> F
    E --> F
    F -->|query| G["Cited answer"]
    F -.->|next ingest| B
    classDef hub fill:#cc785c,stroke:#cc785c,color:#0d1117,font-weight:700
    classDef page fill:#1c2128,stroke:#30363d,color:#e6edf3
    classDef raw fill:#21262d,stroke:#30363d,color:#8b949e
    class B hub
    class C,D,E,F page
    class A,G raw

Claude Code as the Interface, Obsidian as the Map

The day-to-day interface is Claude Code. Open the wiki repository in it and the whole thing behaves like a person who has read every page. Ask about the goals framework and it cites the relevant page. Paste a meeting note and say “ingest this” and it rewrites the three or four pages that actually need to change. Run a lint pass and it returns a checklist of contradictions and gaps. That is the part that is difficult to explain to anyone who has not tried it: the wiki stops feeling like documentation and starts feeling like a colleague with perfect recall of their own notes. Claude Code is remarkably good at this kind of work, and it is the first tool where a wiki has actually felt maintained instead of merely stored.

Obsidian sits alongside as a reading surface. It lacks the live interaction of Claude Code, but it is an excellent IDE for a markdown knowledge base. The graph view exposes link structure at a glance, backlinks make navigation instant, and keyboard-driven browsing is fast. Claude Code is how the wiki is maintained and queried. Obsidian is how it is read and explored.

Documentation that regenerates itself is not a gimmick. It is the only kind that survives contact with a fast-moving engineering organization.

Cloudflare Pages: Deploy a Site for $10 a Year

2026-04-14T09:00:00+00:00

Deploying meetingscost.com cost me $10 for the year — that was the domain, and everything else (hosting, CDN, CI/CD, SSL, email routing) came free through Cloudflare.

Where to Host a Static Site for Free in 2026

The common options are GitHub Pages, Netlify free tier, and Vercel hobby plan. All three work for basic static sites. GitHub Pages is the simplest but has limited build flexibility beyond Jekyll. Netlify and Vercel both auto-deploy from GitHub and offer 100 GB/month bandwidth on their free tiers, but each has build minute caps and locks some useful features behind paid plans. Cloudflare Pages sits in the same category with a few meaningful differences: it runs on Cloudflare’s global edge network across 330+ cities, has no bandwidth limits on the free tier, and allows 500 builds per month with no compute time ceiling. For a static site or a Workers-based project, that is more headroom than most side projects will consume.

Why Cloudflare Pages Is a Strong Free Hosting Choice

The GitHub integration works without configuration. Connect a repo, set an optional build command, and every push to main triggers a deploy. For a plain HTML/CSS/JS site, there is no build command — just point Cloudflare at the directory containing your files. The wrangler.jsonc config for a static site is four lines:

{
  "name": "my-site",
  "compatibility_date": "2026-04-14",
  "assets": {
    "directory": "./"
  }
}

Push that to GitHub, connect the repo in the Cloudflare Pages dashboard, and the site is live on a *.pages.dev subdomain in under a minute. No YAML pipeline files to write, no Docker images to configure. For anyone already using GitHub, this is zero additional tooling overhead.

The edge delivery matters for user experience. Cloudflare Pages serves assets from the nearest of those 330+ locations globally. For a simple marketing site or a calculator tool, this means sub-100ms load times for most users without any CDN configuration or cache-warming.

Cloudflare as Your Domain Registrar

Most domain registrars sell at cost and recover margin through renewal price increases, add-ons, and aggressive upsells at checkout. Cloudflare Registrar sells domains at ICANN wholesale price with no markup. A .com domain runs about $10.44 per year, and that price stays flat at renewal. WHOIS privacy is included by default — no separate fee.

The operational benefit is having DNS, hosting, and the domain in one dashboard. Connecting a custom domain to a Pages project takes two steps: add the domain in the Pages project settings, update your nameservers to point to Cloudflare. After that, DNS propagation and SSL provisioning happen automatically. No manual A records to wire up, no waiting for certificate issuance.

Free Features That Make a Difference

Two Cloudflare features that look small but are genuinely useful in practice:

Redirect rules. Setting up a redirect from www.yourdomain.com to the apex domain (or the reverse) is a single rule in the Cloudflare dashboard. No nginx config, no serverless function, no extra DNS entries. The rule propagates globally in seconds.

Email Routing. Registering a domain through Cloudflare includes email routing at no cost. You can create a custom address like contact@yourdomain.com that forwards to any personal inbox. This is useful for side projects that need a professional contact point without paying for Google Workspace or similar. When I set up a Google Play developer account for an LLC, a custom domain email was required as the public business contact. Cloudflare Email Routing handled that with a few clicks and no additional cost. I documented that full process in How to Register a Google Play Developer Account for Your LLC.

What Else Cloudflare Gives You on the Free Tier

A few more capabilities worth knowing before reaching for a paid alternative:

Web Analytics. Privacy-first, cookie-free, no GDPR banner required. Shows page views, referrers, and top countries. Accurate enough for a side project without any third-party tracking scripts.
DDoS protection. Always on at L3/L4, no configuration required. Your site gets it by default.
SSL/TLS. Auto-provisioned and auto-renewed. No Certbot, no Let’s Encrypt setup, no renewal reminders.
Firewall rules. The free tier includes custom firewall rules — enough to block specific countries, rate-limit aggressive bots, or challenge suspicious traffic patterns.
R2 object storage. 10 GB free with zero egress fees. If a project needs to serve user-uploaded content or large assets, R2 is cheaper than S3 for anything with significant read traffic, since you pay only for storage and writes, not downloads.

For a side project or micro-site, the total annual cost is the domain. The hosting, the CDN, the CI/CD pipeline, the SSL certificate, and the email address all run on Cloudflare’s free tier without modification or workarounds.

Capacitor WebView Cache: Why New Builds Show Old Assets

2026-04-11T09:00:00+00:00

A Capacitor WebView cache bug in our runner game kept shipping old JavaScript to players after every update, even though the new APK installed cleanly. Two stacked cache layers had to be torn out before a fresh build actually reached the screen.

How Capacitor Wraps a Web Game in a Native Shell

Capacitor is Ionic’s successor to Cordova: a thin native runtime that hosts your HTML, CSS, and JavaScript inside a platform WebView and exposes native APIs through a JavaScript bridge. On Android, your www/ folder is bundled straight into the APK and served by the system WebView, which is Chromium on any modern device. On iOS, the same bundle runs inside WKWebView. One codebase, two native shells, and near-native input latency for a canvas-based game like ours, a mobile runner called Road Rage that my friend started building and I joined a few weeks in.

The architecture looks like this:

The WebView is the whole runtime. Everything the player sees is HTML rendered inside that container, and every native capability reaches the game through the bridge. Which is exactly why the WebView’s caching behavior became load-bearing for us.

The Bug: New Build Installs, Old Game Loads

We ship internal builds through Firebase App Distribution. The flow is normal: bump the version, run npx cap sync, assemble the APK, upload, testers tap Update. The APK installs fine, the version label on the main menu shows the new number, then the game boots into a visibly older UI. The worst symptom was mixed-version state: a new index.html loading against stale game.js and styles.css from the previous install. On a Pixel 5 this silently broke the ×2 coins button because the event handler it needed lived in the new JS bundle, but the markup was rendering against the old one.

“Clear app storage” fixed it every time, which was the giveaway. The bytes inside the APK were correct. Something between the APK and the screen was holding onto the previous build’s files.

Two Cache Layers Between Your Code and the Player

A Capacitor Android app can cache assets in two independent places, and both have to be right for an update to stick:

Service Worker cache. A legacy PWA service worker (sw.js) from the early web prototype was still registered, using a cache-first strategy keyed on a hardcoded cache name. Because the cache name never changed, every boot read index.html and friends from IndexedDB instead of from the APK. New builds were invisible to the app until someone wiped storage.
Android WebView HTTP cache. Even after removing the service worker, Android’s system WebView keeps its own disk-backed HTTP cache for files it has loaded before. That cache is not flushed when the APK is upgraded, so assets that matched the previous install’s URLs kept serving from WebView storage in preference to the fresh copies packaged inside the new APK.

The two layers produce the same external symptom, which is why the first fix looked complete and wasn’t. You end up debugging the wrong layer twice.

The Fix: Remove One Cache, Disable the Other

The first commit ripped out the service worker and the PWA manifest entirely. Capacitor already serves www/ directly from packaged assets, so a service worker sitting on top of that was redundant and strictly harmful. Six files deleted, one cache layer gone, problem apparently solved. It wasn’t.

The second commit reached into MainActivity.java and did two things on every startup:

WebView webView = this.bridge.getWebView();
if (webView != null) {
    webView.clearCache(true);
    webView.getSettings().setCacheMode(WebSettings.LOAD_NO_CACHE);
}

clearCache(true) flushes any HTTP cache left over from the previous install, and LOAD_NO_CACHE tells the WebView to skip its disk cache on subsequent loads. There is no performance penalty, because Capacitor reads www/ straight from the APK’s packaged assets, not over HTTP. The moment this landed, Firebase App Distribution updates started reaching players cleanly and the ×2 coins button came back to life.

Cross-platform hybrid stacks like Capacitor and Cordova are built on a compromise: one web codebase, two native hosts. That compromise is mostly invisible, until a caching layer you forgot about starts serving yesterday’s build. The rule we now enforce in this codebase is simple: on a native host, the WebView must never cache code it reads from packaged assets.

How to Register a Google Play Developer Account for Your LLC: A Step-by-Step Guide

2026-04-08T16:00:00+00:00

Registering a Google Play Developer account for an LLC is not as straightforward as you might expect. Unlike a personal developer account, an organization account requires a DUNS number, a company website, a public email, and a public phone number. Some of these are not intuitive to obtain, and the process has a few surprises along the way.

The good news: with the right preparation, you can get through the entire process in about 4 days. Here is exactly how I did it.

What You Need Before You Start

Before diving into the steps, here is the full list of what Google requires for an organization account:

A DUNS number for your LLC
A company website verified through Google Search Console
A public contact email on a custom domain
A public phone number for your developer page
A Google account to use as the developer account
$25 for the one-time registration fee

I recommend reading through all the steps first so you can kick off parallel tasks (like requesting your DUNS number while setting up your website).

Step 1: Get Your DUNS Number

A DUNS (Data Universal Numbering System) number is a unique nine-digit identifier issued by Dun & Bradstreet. Google requires it to verify your organization’s identity.

How to apply

Go to the Dun & Bradstreet website and request a DUNS number for your LLC. The official timeline says it can take up to 30 days, but in my case it took only 2 days to receive the number via email.

The catch nobody warns you about

Here is where it gets interesting. I received my DUNS number by email and immediately tried to use it when setting up my Google Play Developer account. Google could not find my organization by the DUNS number. After contacting support, they explained that since the number was brand new, it had not yet propagated through their systems. They asked me to wait up to 48 hours.

So the natural question is: why send me the number if I cannot use it yet? It would have been much more helpful to simply delay the email until the number is actually active. But that is how it works, so plan for an extra day of waiting.

Step 2: Set Up a Company Website

Google requires your organization to have a website, and you will need to verify ownership through Google Search Console. This is used to confirm that the website belongs to your LLC.

Thanks to AI tools, building a simple company website is surprisingly fast. You can put together a clean, professional-looking site in about an hour. I used Cloudflare Pages for hosting, which is completely free. Just push your site to a Git repository, connect it to Cloudflare Pages, and it deploys automatically.

Website verification

Once your site is live:

Go to Google Search Console
Add your domain as a property
Follow the verification steps (usually adding a DNS TXT record)

If you are already using Cloudflare for DNS, adding the verification record takes less than a minute.

Step 3: Set Up a Business Email

Google Play requires a contact email address that will be publicly displayed on your developer page. Using a personal Gmail address does not look professional for a business, so you will want an email on your company domain (e.g., contact@yourcompany.com).

Cloudflare Email Routing makes this completely free. Here is how it works:

Go to your domain in the Cloudflare dashboard
Navigate to Email > Email Routing
Set up a routing rule to forward emails from your custom domain to your personal Gmail

That is it. Emails sent to contact@yourcompany.com will arrive in your Gmail inbox. No need to pay for Google Workspace or any other email hosting service.

Kudos to Cloudflare here. Between domain registration, DNS, email routing, and website hosting, they offer an impressive amount for free.

Step 4: Get a Developer Phone Number

Google requires a phone number that will be publicly visible on your Google Play developer page. This is a legitimate privacy concern. You probably do not want your personal cell phone number exposed to every user who visits your app listing.

The solution: Google Voice. If you already have a phone number, you can get a Google Voice number for free. It gives you a separate number that forwards calls and texts to your real phone, keeping your personal number private.

Setting up Google Voice

Go to voice.google.com
Choose a phone number (you can pick your area code)
Link it to your existing phone number
Use this number as your developer contact number

The whole setup takes about 10 minutes.

Step 5: Complete the Google Play Console Registration

With all the prerequisites in place, you can now finish the registration:

Go to Google Play Console
Sign in with the Google account you want to use as the developer account
Select Organization as the account type
Enter your organization details:
- Legal business name (must match your LLC registration)
- DUNS number
- Business address
- Contact information
Pay the $25 one-time registration fee
Verify your website through Google Search Console (if not done already)
Provide your contact email and phone number
Complete identity verification (Google may request additional documents)
Accept the Google Play Developer Distribution Agreement

After submitting, Google reviews your application. Approval can take a few days, but in many cases it is processed within 24-48 hours.

Timeline Breakdown

Here is how long the entire process took in my experience:

Day	Task	Details
Day 1-2	DUNS number	Applied and received the number via email
Day 3	DUNS propagation + parallel setup	Waited for the DUNS number to become findable. Used this time to set up Google Voice, build the company website, and configure email routing
Day 4	Registration	Completed the Google Play Console setup

Total: approximately 4 days from start to a submitted application.

The biggest time sink is the DUNS number. If you are planning to publish an app, request your DUNS number first and work on everything else while you wait.

Final Thoughts

The process of registering a Google Play Developer account for an LLC is more involved than it needs to be. The DUNS number requirement adds days of waiting, and the public phone number requirement raises privacy concerns that Google does not address.

That said, with tools like Cloudflare (free domain, hosting, and email routing) and Google Voice (free private phone number), you can get through the process without spending anything beyond the $25 registration fee. Start with the DUNS number, set up everything else in parallel, and you will be ready to publish your first app in under a week.

Claude Code Best Practices: How I Use AI to Build Faster and Smarter

2026-03-20T09:00:00+00:00

I’ve been using Claude Code daily. Here’s what actually moved the needle:

1. Never Touch Files Manually, Teach the AI Instead

When I discover how something works, I ask Claude to save it to memory. Next session, it already knows my architecture, conventions, and edge cases. Every manual edit is a missed opportunity to build knowledge that compounds across sessions.

2. Create Skills for Repetitive Workflows

Lint, test, build, commit, push, open a PR - one command. No context-switching, no remembering flags. And when something fails, the AI handles it intelligently instead of just bailing.

3. Start Every Feature with a Written PRD

Before any code, I switch to planning mode. Claude explores the codebase, designs the approach, writes a PRD. I review, adjust, then execute. Features land cleaner, rework drops, and I have a folder of dated docs capturing every architectural decision.

4. Be Selective with MCP Servers

Every MCP server you add registers its tools into the context window. Too many servers pollute the context and exhaust it much faster, leaving less room for the actual work. I keep only the servers I use regularly and disable the rest. Lean context = better focus and longer productive sessions.

5. Enforce TDD

In my CLAUDE.md, I instruct Claude Code to always start with tests first for every new feature: write the tests, confirm they fail, implement the feature, confirm the tests go green. The rule is absolute: never fix tests just to make them pass. This keeps the test suite honest and forces real solutions instead of workarounds.

Treat AI as a long-term collaborator, not a one-shot autocomplete. Build memory. Build automation. Build process. The developers who will thrive aren’t the ones who prompt the hardest — they’re the ones who build systems around their AI tools that compound over time.

Run Hugging Face LLMs Free on Google Colab

2025-11-11T17:00:00+00:00

Running LLMs from Hugging Face Hub for Free on Google Colab

If you’ve ever wanted to experiment with large language models but lacked the hardware, here’s the good news: you can run Hugging Face models directly in Google Colab, taking advantage of free T4 GPUs.

Here’s the setup in a nutshell:

Open a Colab notebook and select GPU (T4).
Obtain a token from Hugging Face Hub (for accessing models).
Use transformers library to load a model via the Hugging Face Hub.
Run inference locally in Colab. No paid API or hosting required!

Hugging Face Can Feel Overwhelming at First

If you’re new to Hugging Face, it’s easy to get lost in its rich ecosystem: transformers, datasets, inference, and more. Each library plays a different role, and understanding how they connect can take a bit of time.

A key distinction many newcomers miss is how and where your model actually runs and that’s where the difference between pipeline and InferenceClient becomes important:

pipeline. Downloads the model weights and runs it locally (on your Colab T4 or your own GPU). Great for learning, experimentation, and custom workflows.

from transformers import pipeline
pipe = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2")
result = pipe("Explain quantum computing in simple terms:")
print(result)

InferenceClient. Sends your request to the Hugging Face Inference API, where the model runs remotely on one of their AI infrastructure providers.You don’t need to manage hardware. The compute is handled entirely by Hugging Face and their partners.

from huggingface_hub import InferenceClient
from dotenv import load_dotenv
load_dotenv()
hf_token = os.getenv("HF_TOKEN")
    
client = InferenceClient(token=hf_token)
resp = client.text_generation(
    prompt='Tell me a math joke', 
    model="meta-llama/Llama-3.1-8B-Instruct",
    max_new_tokens=100,  # Generate up to 100 new tokens
    temperature=0.7,     # Add some randomness
    do_sample=True       # Enable sampling for more creative responses
)
print(resp)

LLM Performance on Mac: Native vs Docker Ollama Benchmark

2025-06-25T17:00:00+00:00

LLM runs slow in Docker on MacOS

I have started generating a daily RSS digest using Matcha and summarizing it with Ollama. You can find more details in my previous article. However, when there are many articles to summarize, the process becomes quite slow. For example, once I had to wait almost one hour to get the daily digest. This made me think about how to make it faster. My first guess was that the GPU is not being used at all. I found evidence for this in the Ollama GitHub repository:

When you run Ollama as a native Mac application on M1 (or newer) hardware, we run the LLM on the GPU.

Docker Desktop on Mac, does NOT expose the Apple GPU to the container runtime, it only exposes an ARM CPU (or virtual x86 CPU via Rosetta emulation) so when you run Ollama inside that container, it is running purely on CPU, not utilizing your GPU hardware.

On PC’s NVIDIA and AMD have support for GPU pass-through into containers, so it is possible for ollama in a container to access the GPU, but this is not possible on Apple hardware.

So let’s install ollama natively on MacOS and run a benchmark to compare the results.

Run LLM natively on MacOS

Install ollama
```
brew install ollama
```
Start ollama
```
ollama serve
```
Make sure ollama is up and running
```
curl http://localhost:11434
```
Pull a model (e.g. granite3.3 by IBM)
```
ollama pull granite3.3:latest
```
Run a model
```
ollama run granite3.3:latest
```
Adding a flag --verbose gives you helpful infromation about the performance
```
ollama run granite3.3:latest --verbose
```

Run a benchmark

We live in remarkable times. Whenever I face a challenge, I first search online, as there is a high chance that someone else has already encountered and solved a similar problem. I was curious if there was a benchmarking tool for LLMs that I could use, and I discovered llm.aidatatools.com.

Install:

pip install llm-benchmark

Setup:

Total memory size : 32.00 GB
cpu_info: Apple M1 Pro
gpu_info: Apple M1 Pro
os_version: macOS 14.6 (23G80)
ollama_version: 0.9.1

It can automatically pick and pull models based on RAM available on your machine but you also can create a config file with models you’d like to test:

file_name: "custombenchmarkmodels.yml"
version: 2.0.custom
models:
  - model: "granite3.3:8b"
  - model: "phi4:14b"
  - model: "deepseek-r1:14b"

Now you can run the benchmark with the models of your choice:

llm_benchmark run --custombenchmark=path/to/custombenchmarkmodels.yml

Here is a sample of what a benchmark looks like:

model_name =    granite3.3:8b
prompt = Summarize the key differences between classical and operant conditioning in psychology.
eval rate:            24.49 tokens/s
prompt = Translate the following English paragraph into Chinese and elaborate more -> Artificial intelligence is transforming various industries by enhancing efficiency and enabling new capabilities.
eval rate:            24.93 tokens/s
prompt = What are the main causes of the American Civil War?
eval rate:            24.00 tokens/s
prompt = How does photosynthesis contribute to the carbon cycle?
eval rate:            24.14 tokens/s
prompt = Develop a python function that solves the following problem, sudoku game.
eval rate:            23.93 tokens/s
--------------------
Average of eval rate:  24.298  tokens/s

During the native run, GPU utilization was consistently close to 100%, confirming that Ollama was able to leverage the Apple M1 Pro’s GPU for accelerated inference.

The same benchmark was performed with Ollama running inside a Docker container, where GPU usage was not detected. As expected, the evaluation rates were significantly lower, and the models relied solely on CPU resources, resulting in much slower inference times.

Benchmark results

The results of benchmarking for ollama running natively and in a docker container:

Model	Avg. Eval Rate (tokens/s)	GPU Utilization	Notes
granite3.3:8b	24.3	~100%	Native, Apple M1 Pro
phi4:14b	14.5	~100%	Native, Apple M1 Pro
deepseek-r1:14b	13.7	~100%	Native, Apple M1 Pro
granite3.3:8b	4.3	0%	Docker, Apple M1 Pro
phi4:14b	2.3	0%	Docker, Apple M1 Pro
deepseek-r1:14b	2.4	0%	Docker, Apple M1 Pro

It’s quite obvious and the benchmark shows that running Ollama natively on a Mac with Apple Silicon delivers up to 5–6 times faster LLM inference speeds compared to Docker, thanks to full GPU utilization, while Docker runs are limited to CPU and are significantly slower.

Summarize RSS Feeds with Local LLMs: Ollama, Open-WebUI, and Matcha Guide

2025-06-19T17:00:00+00:00

Fear of missing out (FOMO)

Today, we get a lot of news and updates all the time. It is easy to feel worried about missing something important. There is so much information that it can be hard to know what really matters.

Striking a balance between staying updated and avoiding information overload is essential, especially in the fast-moving IT industry. The urge to constantly monitor news can quickly drain your productivity and impact your well-being. This article shows how you can leverage large language models (LLMs) to automatically summarize RSS feeds, helping you keep up with key developments efficiently — so you can focus on what truly matters.

I am pretty sure there are already services that can summarize your RSS feeds for a fee, but fortunately, you can run LLMs locally to save money and gain experience with the tooling at the same time.

Running LLM locally

There are two popular tools to run large language models (LLMs) locally:

Adjust resources in docker environment

If you run these tools in a Docker container, you should increase the CPU and memory limits in your Docker environment. LLMs are resource-intensive, requiring significant memory and CPU power. By default, Colima allocates 2 CPUs and 2GB of memory. To adjust these settings:

colima start --cpu 6 --memory 10

Current resource allocation can be checked:

colima list

Local-AI can be used as an all-in-one Docker image with a pre-configured set of models (text-to-speech, speech-to-text, image generation, etc) and it has a nice UI. However, after installing it, chat completion with deepseek-r1 did not work for me. I could not find a solution in the official repository issues, so I decided to use Ollama instead.

Ollama & open-webui

Ollama also can run in a docker container but it doesn’t come with UI. Fortunately, Open-WebUI provides a user-friendly web interface for interacting with your locally running LLM models. It makes it easy to chat with models, manage conversations, and access advanced features without needing to use the command line.

I ask github copilot to create a docker-compose file for me and with a few tweaks I got a working docker-compose.yaml:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_MODELS=/root/.ollama/models
    restart: unless-stopped
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - ollama_data:/app/backend/data
    environment:
      - OLLAMA_API_BASE_URL=http://ollama:11434
      - WEBUI_AUTH=False
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:

Ollama supports many models. See the full list here. The official Ollama Docker container does not include any models by default. To install a model, connect to the running container and pull it manually (an example with granite by IBM):

docker exec -it ollama ollama pull granite3.3:latest

The attached volume keeps your models, so you do not need to reinstall them after restarting. Verify that ollama is up and running:

curl http://localhost:11434/

Open-webui will be available under http://localhost:3000/:

One of the key advantages of Ollama is its REST API, which is compatible with the OpenAI API. This compatibility allows you to seamlessly integrate Ollama with applications and tools designed for OpenAI, making it a flexible choice for local LLM deployments.

Summmarize RSS feeds

Matcha

When I was looking for a free service to summarize RSS feeds, I came accross - Matcha. It’s an app written in golang which makes it quite easy to fork and extend if you need.

This is how the author of the app describes it:

Matcha is a daily digest generator for your RSS feeds and interested topics/keywords. By using any markdown file viewer (such as Obsidian) or directly from terminal (-t option), you can read your RSS articles whenever you want at your pace, thus avoiding FOMO throughout the day.

Once you’ve donwloaded a corressponding binary from the release page, you can run this single binary to generate the default config.yml file:

markdown_dir_path:
feeds:
  - http://hnrss.org/best 10
  - https://waitbutwhy.com/feed
  - http://tonsky.me/blog/atom.xml
  - http://www.joelonsoftware.com/rss.xml
  - https://www.youtube.com/feeds/videos.xml?channel_id=UCHnyfMqiRRG1u-2MsSQLbXA
google_news_keywords: George Hotz,ChatGPT,Copenhagen
instapaper: true
weather_latitude: 37.77
weather_longitude: 122.41
terminal_mode: false
opml_file_path:
markdown_file_prefix:
markdown_file_suffix:
reading_time: false
sunrise_sunset: false
openai_api_key:
openai_base_url:
openai_model:
summary_feeds:

You can specify your favorite RSS feeds and Google News keywords of interest, then run the Matcha binary again. It will generate a well-formatted markdown file that you can open with any markdown reader. The author recommends Obsidian, which is popular for its local-first approach and support for plain .md files. Personally, I use Notion for note-taking, but I have considered Obsidian in the past. While Obsidian’s paid subscription for cross-device sync wasn’t appealing to me at the time, its use of standard markdown files makes it an excellent choice for reading and organizing these summaries.

The most interesting part is the summary_feeds section, where you can specify which feeds you want to have summarized by your LLM. Now, let’s bring everything together and configure Matcha to use your locally running Ollama model. Here’s the relevant part of the config file:

openai_api_key:
openai_base_url: http://localhost:11434/v1
openai_model: granite3.3:latest
summary_feeds:
    - https://www.lennysnewsletter.com/feed
    - https://newsletter.systemdesign.one/feed

Run the binary again. The output includes links to original articles and summary for each of them:

With this setup, you can now run Matcha daily to automatically generate a digest of your favorite RSS feeds. The summaries are created using your local LLM, ensuring privacy and cost savings. This workflow helps you stay informed without being overwhelmed by information overload. Get yourself some coffee ☕️ and enjoy!

There is room for improvement:

Specify a folder for Matcha to output its markdown file to have automatic synchronization across multiple devices (e.g. dropbox or goolge drive).
To avoid excessive costs (especially when using OpenAI), the author limits each article’s text to 5,000 characters before submitting it for summarization—so not the entire article is used.
Automate daily Matcha runs to generate a daily digest of your RSS feeds and their summaries.