meabed/nutch2-index-html — Lab

nutch2-index-html is an older search/indexing project, but it captures a thread that still shows up in my work: make the pipeline explicit, keep the transformation close to the crawler, and do not lose useful source data before search needs it.

The job#

Apache Nutch crawls pages, then the indexing layer decides what survives into search. This plugin keeps HTML content available for indexing in Nutch 2.x setups that need it.

That sounds small because the useful work is small. Crawlers, parsers, and indexers all have a habit of throwing away context too early. Later, search quality or analysis asks for that context back, and the team has to decide whether to recrawl, rebuild, or accept worse results.

Keeping HTML available at the right boundary is a defensive pipeline decision. It preserves source material until the search layer has enough information to decide what matters.

flowchart LR A[crawl page] --> B[parse content] B --> C[preserve HTML] C --> D[indexing layer] D --> E[search decisions]

Why I keep it listed#

It is not fashionable. It is a small, concrete example of solving the problem in front of the system: crawler, parser, indexer, search. The shape still matters in newer data pipelines.

The plugin adds a field for raw HTML content so the indexing side can decide whether that source is useful. That is an old Apache Nutch detail, but the product lesson is current: do not make a storage or indexing shortcut accidentally decide the future of search quality.

The operational edge#

The install path is explicit. Copy the plugin, wire it into the Nutch build, enable it in the plugin configuration, add the field to the Solr schema, and run the crawler. That explicitness is why the repo is still interesting. Every step names the boundary where data can disappear.

Modern data systems hide more of that behind managed services, but the responsibility is the same: know what is preserved, what is transformed, and what the next system can still inspect.

The engineering habit#

Data pipelines should make loss explicit. If you discard source information, do it because the product no longer needs it, not because a transformation step happened to be convenient.

That principle shows up in modern AI work too. Keep raw inputs, model versions, generated outputs, and evaluation traces long enough to debug behavior. The tool names change. The pipeline discipline does not.

The job#

Why I keep it listed#

The operational edge#

The engineering habit#

Discussion (0)