Programming • Jan 23, 2026 • Cliff

Posix In Python Learning Series (Part 3)

Tags: Python, Development, linux, posix, Testing, CLI, grep, Debugging, Bug-Fix, Tutorial

Fixing Line Number Tracking in a Multi-File grep Implementation

The Bug

While implementing a grep command-line tool in Python, I encountered a subtle but significant bug in line number tracking. The tool needed to search through multiple files and optionally display line numbers with the -n flag, similar to GNU grep. However, the line numbers were only accurate for the first file; subsequent files would show incorrect line numbers, offset by the cumulative line count of all previous files.

The Problem

Here's the original buggy code:

line_number = 0

try:
    for line, source in file_input_handler(targets):
        line_number += 1
        if regex.search(line):
            # ... output the match with line_number

The issue is straightforward once you see it: line_number is initialized once before processing begins and then increments continuously through all files.

Example of the Bug

Suppose we're searching through two files:

file1.txt (3 lines):

apple
banana
cherry

file2.txt (3 lines):

dog
elephant
fox

If we run grep -n "e" file1.txt file2.txt, we'd expect:

file1.txt:1:apple
file1.txt:3:cherry
file2.txt:2:elephant
file2.txt:3:fox

But instead we got:

file1.txt:1:apple
file1.txt:3:cherry
file2.txt:5:elephant  ← Wrong! Should be line 2
file2.txt:6:fox       ← Wrong! Should be line 3

The line numbers for file2.txt are offset by 3 (the size of file1.txt).

Why Not Just Use enumerate()?

A common first instinct might be: "Why not just use Python's enumerate() function?" After all, it's designed for tracking indices in loops:

for idx, (line, source) in enumerate(file_input_handler(targets), start=1):
    # Use idx as line_number

The problem is that enumerate() gives us the index within the entire iteration, not per file. The file_input_handler() generator yields lines from all files sequentially in one continuous stream:

(line1_from_file1, "file1.txt")
(line2_from_file1, "file1.txt")
(line3_from_file1, "file1.txt")
(line1_from_file2, "file2.txt")  ← enumerate would say this is index 4!
(line2_from_file2, "file2.txt")  ← enumerate would say this is index 5!

So enumerate() would produce the exact same bug we started with—it doesn't know when we've moved to a new file.

The Solution

The fix requires tracking which file we're currently processing and resetting the line counter when we encounter a new file:

line_number = 0
current_source = None

try:
    for line, source in file_input_handler(targets):
        # Reset line number when we move to a new file
        if source != current_source:
            current_source = source
            line_number = 0

        line_number += 1
        if regex.search(line):
            # ... output the match with line_number

Now each time source changes (indicating we've moved to a new file), we: 1. Update current_source to track our new location 2. Reset line_number to 0 3. Then increment it to 1 for the first line of the new file

Testing the Fix

To ensure this bug doesn't resurface, we need a test that specifically verifies line numbers reset for each file. Here's the test I wrote:

def test_line_numbers_reset_for_each_file(tmp_path: Path, capsys: pytest.CaptureFixture[str]):
    """Test that line numbers reset to 1 for each new file when using -n flag."""
    # Create two files with matches at different positions
    file1 = write_lines(tmp_path, "file1.txt", ["alpha", "needle", "beta", "needle"])
    file2 = write_lines(tmp_path, "file2.txt", ["gamma", "needle", "delta"])

    code = main(["needle", "-H", "-n", str(file1), str(file2)])
    out = capsys.readouterr().out.strip().splitlines()

    assert code == RETURN_CODES["SUCCESS"]
    assert len(out) == 3

    # file1.txt should have matches at lines 2 and 4
    assert f"{file1}:2:needle" in out
    assert f"{file1}:4:needle" in out

    # file2.txt should have a match at line 2 (not line 6!)
    # This is the critical assertion - if line numbers don't reset,
    # this would be line 6 (4 lines from file1 + 2 lines into file2)
    assert f"{file2}:2:needle" in out

Why This Test Works

This test is specifically designed to catch the boundary condition:

Multiple files: We need at least two files to expose the bug
Strategic match positions: The match in file2 is at line 2, which would incorrectly report as line 6 with the bug (4 lines from file1 + 2 lines into file2)
Both -H and -n flags: We need -H (with-filename) to distinguish outputs and -n (line-number) to verify the counters
Explicit assertions: The comment in the test explicitly documents what would happen if the bug existed

The beauty of this test is that it would fail with the buggy code (expecting "file2.txt:2:needle" but getting "file2.txt:6:needle") and pass with the fixed code.

Testing Strategies for Boundary Conditions

When testing tools that process multiple inputs, always consider:

Single input: Does it work for one file? (baseline)
Multiple inputs: Does state reset between inputs?
Empty inputs: What happens with zero-length files?
Mixed sources: stdin and files, if supported
Edge positions: Matches at first/last line of each file

This bug is a perfect example of why unit tests should go beyond the "happy path" and exercise boundary conditions where state transitions occur.

Key Takeaways

State management matters: When processing multiple sources in a single loop, you need explicit state tracking to know when boundaries are crossed.
Built-in tools have limitations: enumerate() is excellent for tracking position in a single sequence, but it doesn't understand semantic boundaries like "we're now in a different file."
Context is crucial: The line number only makes sense within the context of a specific file. When that context changes, the counter must reset.
Test with multiple inputs: This bug wouldn't have been caught by only testing with a single file or stdin. Multi-file testing revealed the issue immediately.

This type of bug is common in tools that process multiple inputs sequentially—whether files, streams, or data batches. The solution pattern (track current context, reset counters on context change) applies broadly beyond just line counting in grep implementations.

We build software the same way we write about it: Robust. Tested. Correct.

At McIndi Solutions, we specialize in mission-critical modernization and high-security platforms for healthcare and finance. Whether you need a fractional CTO to guide your architecture or a senior engineering team to unblock a complex automation challenge, we are available for advisory and hands-on engagements.

Email us at sales@mcindi.com to discuss your project.

← Back to Blog