# nutch2-index-html

> A small Nutch plugin about preserving source data until the search layer is ready to decide.

`nutch2-index-html` is an older search/indexing project, but it captures a thread that still shows up in my work: make the pipeline explicit, keep the transformation close to the crawler, and do not lose useful source data before search needs it.

## The job

Apache Nutch crawls pages, then the indexing layer decides what survives into search. This plugin keeps HTML content available for indexing in Nutch 2.x setups that need it.

That sounds small because the useful work is small. Crawlers, parsers, and indexers all have a habit of throwing away context too early. Later, search quality or analysis asks for that context back, and the team has to decide whether to recrawl, rebuild, or accept worse results.

Keeping HTML available at the right boundary is a defensive pipeline decision. It preserves source material until the search layer has enough information to decide what matters.

```mermaid
flowchart LR
  A[crawl page] --> B[parse content]
  B --> C[preserve HTML]
  C --> D[indexing layer]
  D --> E[search decisions]
```

<Tradeoff title="Preserving source costs storage, losing it costs options">
  Raw source data is not free, but throwing it away too early makes every future quality fix depend
  on a recrawl or a compromise.
</Tradeoff>

## Why I keep it listed

It is not fashionable. It is a small, concrete example of solving the problem in front of the system: crawler, parser, indexer, search. The shape still matters in newer data pipelines.

The plugin adds a field for raw HTML content so the indexing side can decide whether that source is
useful. That is an old Apache Nutch detail, but the product lesson is current: do not make a storage
or indexing shortcut accidentally decide the future of search quality.

## The operational edge

The install path is explicit. Copy the plugin, wire it into the Nutch build, enable it in the
plugin configuration, add the field to the Solr schema, and run the crawler. That explicitness is
why the repo is still interesting. Every step names the boundary where data can disappear.

Modern data systems hide more of that behind managed services, but the responsibility is the same:
know what is preserved, what is transformed, and what the next system can still inspect.

## The engineering habit

Data pipelines should make loss explicit. If you discard source information, do it because the product no longer needs it, not because a transformation step happened to be convenient.

That principle shows up in modern AI work too. Keep raw inputs, model versions, generated outputs, and evaluation traces long enough to debug behavior. The tool names change. The pipeline discipline does not.
