Wyatt Weaver

PDF Condensor came from a very specific workflow problem: insurance documents can be enormous, but the tools people want to use with them often have hard upload limits. The source is available at UwUGreed/PDF-Condensor.

The first tempting solution is to extract the whole PDF into text or Markdown. That is useful for some workflows, but it is the wrong move when the document needs to stay visually intact. Policies, endorsements, schedules, forms, and scanned pages often matter as pages, not just as raw text.

So I built a smaller tool with a narrower job: keep the PDF as a PDF, preserve page order, and split it into upload-sized chunks.

The Product Shape

The app is a Streamlit interface with two modes:

Split by maximum file size.
Split by fixed page count.

The default size target is 29 MB because that gives a little room under a 30 MB upload cap. If the original file already fits, the app tells the user not to split it. If one page is too large to fit under the configured cap, it fails loudly instead of pretending the output is safe.

That sounds simple, but the simplicity is the point. The tool does not OCR, compress, rewrite, summarize, or alter the content. It just prepares the document for the next step.

How It Works

The core is pypdf. For size-based splitting, the app tests page ranges and uses a binary-search style loop to find the largest consecutive chunk that fits under the selected byte limit.

That keeps the output efficient without forcing the user to guess page counts manually. A 400-page file might split into uneven ranges if some pages are image-heavy, and that is exactly what you want when the real constraint is file size rather than page count.

The output filenames include page ranges, which makes it easier to keep the document sequence straight:

policy-pages-0001-0148.pdf
policy-pages-0149-0296.pdf
policy-pages-0297-0410.pdf

Design Decisions

The most important decision was what not to do.

I did not want the app to become a general-purpose document AI tool. That would have made it slower, riskier, and harder to trust. The use case was operational: take a large file, make it uploadable, and avoid changing the document meaning.

That led to a few clear boundaries:

Process files locally.
Avoid third-party document APIs.
Keep original pages intact.
Return individual PDFs instead of bundling everything into a zip.
Make the output page ranges obvious.

Why It Matters

This is not the flashiest project from the internship, but it is one of the cleanest examples of a useful internal tool. It removes friction from a real workflow without asking the user to learn a new system or trust a complicated automation.

Good tools do not always need to be huge. Sometimes the best build is the one that does one annoying job cleanly and then gets out of the way.