πŸ€– fibre in the ground 🐎

14 Nov 2025 β€’ 12 min read

could the word/pixel guessing bubble leave a productive residue?

Abridged excerpt from a Q&A with Cory Doctorow, in which he investigates potential uses for language/vision models if hyperscaling is over, hyperlocal optimised models are widespread, and GPUs are cheap:

Enshittification With Whitney Betran and Ed Zitron at the Seattle Public Library

Our story begins around 35 minutes in…

Let me advance a theory of the less bad and more bad bubble. Some bubbles have productive residues and some don't.
Enron left nothing behind…
Now Worldcom, which was a grotesque fraud, some of you will remember? They raised billions of dollars claiming that they had orders for fibre. They dug up the streets all over the world. They put fibre in the ground. They didn't have the orders for the fibre. They stole billions of dollars from everyday investors. The CEO died in prison.
But there was still all that fibre in the ground. So I've got two gigabit symmetrical fibre at home in Burbank because AT&T bought some old dark fibre from Worldcom because fibre lasts forever? It's just glass. Once it's there, it is a productive residue.
So what kind of bubbles are we living through? Well, crypto is not gonna leave behind anything. Crypto is gonna leave behind shitty Austrian economics and worse JPEGs.
AI is actually gonna leave behind some stuff.
So if you wanna think about like a post AI bubble world and I just got edits from my editor. I wrote a book over the summer called The Reverse Centaur's Guide to Life After AI. And if you wanna think about a post AI world, imagine what you would do if GPUs were 10 cents on the dollar. If there were a lot of skilled applied statisticians looking for work. And if you had a bunch of open source models that had barely been optimised and had a lot of room at the bottom?
I'll give you an example. I was writing an essay and I couldn't remember where I'd heard a quote I'd heard in a podcast. I couldn't remember which quote it was. So I downloaded Whisper, which is an open source model to my laptop, which doesn't have a GPU, little commodity laptop, threw 30 hours of podcasts that I'd recently listened to at it, I got a full transcription in an hour my fan didn't even turn on.
Yeah, so I know tonnes of people who use this and the title of the book, Reverse Centaur, refers to this idea from automation theory, where a centaur is someone who gets to use machines to assist them, a human head on a machine body? And so, you know, you riding a bicycle, you using a compiler.
A reverse centaur is a machine head on a human body. It's someone who's been conscripted to be a peripheral for a machine?
I've got a very treatable form of cancer, but I'm paying a lot of attention to stories about cancer and, you know, open source models or AI models that can sometimes see solid mass tumors that radiologists miss. And if what we said was, we at the Kaiser Oncology Department are going to invest in a service that is going to sometimes ask our radiologist to take a second look to see if they miss something, such that instead of doing 100 x-rays a day, they're gonna do 98? Then I would say, as someone with cancer, that sounds interesting to me.
I don't think anyone is pitching any oncology ward in the world on that. I think the pitch is fire 90% of your oncologists, fire 90% of your radiologists, have the remainder babysit AI, have them be the accountability sinks and moral crumple zones for a machine that is processing this stuff at a speed that no human could possibly account for, have them put their name at the bottom of it, and have them absorb the blame for your cost-cutting measures.
When I hear people talk about AI, I hear programmers talk about AI doing things that are useful, So there's a non-profit called the Human Rights Data Analysis Group:

https://hrdag.org
It's run by some really brilliant mathematicians, statisticians. They started off doing statistical extrapolations of war crimes for human rights tribunals, mostly in The Hague, and talking about the aspects of war crimes that were not visible, but could be statistically inferred from adjacent data.
They did a project with Innocence Project New Orleans, where they used LLMs to identify the linguistic correlates of arrest reports that produced exonerations, and they used that to analyse a lot more arrest reports than they could otherwise, and they put that at the top of a funnel, where lawyers and paralegals were able to accelerate their exoneration work. That's a new thing on this earth:

https://wclawr.org/index.php/wclr/article/view/112
It's very cool, and I'm like, okay, well if these guys can accelerate that work with cheap hardware that today is out of reach, if they can figure out how to use open source models but make them more efficient because you've got all these skilled applied statisticians who are no longer caught up in the bubble, then I think we could see some useful things after the bubble.
That's my argument for this is fibre in the ground and not shitty monkey JPEGs.

Modelling the speech

Cory mentioned using an open source model, whisper, to model speech audio as text; curious, I used faster-whisper to transcribe the above section of the exchange before reading, checking, and abridging it by hand:

https://pypi.org/project/faster-whisper/

Below is the python script used to model the text: written by mistral codestral. As with Cory, no GPU or fan was required (whisper has been optimised). The editorial was by me: reading the modelled text, checking the references, tidying the spelling, and bridging over the interjections of Ed Zitron who, despite making a decent foil, seemed hellbent on the dystopian endgame, and less interested in what a salvage operation might look like.

# online
> python main.py https://foo.bar/speech.mp3 text.txt
# offline
> python main.py speech.mp3 text.txt

# main.py
from faster_whisper import (WhisperModel)
import argparse
import requests

def main():
    model_size = "medium"
    model = WhisperModel( model_size, device="cpu", compute_type="int8",)

    parser = argparse.ArgumentParser(description="Process an input spoken audio file (arg 1) from http(s) url or filepath and transcribe to file (arg 2)")
    parser.add_argument("input_file", help="Input file path")
    parser.add_argument( "output_file", help="Output file path")
    args = parser.parse_args()
    audio_file = read_file(args.input_file)

    print(f"transcribing {args.input_file} to {args.output_file}")

    segments, info = model.transcribe(audio_file,beam_size=5)

    with open( args.output_file, "w", encoding="utf-8") as f:
        for segment in segments:
            line = f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}\n"
            print(line.strip())  # Print to console
            f.write(line)  # Print to file
    print(f"\nTranscription saved to '{args.output_file}'")

def read_file(file_path):
    if file_path.startswith("http://") or file_path.startswith("https://"):
        response = requests.get(file_path, stream=True)
        response.raise_for_status()
        return response.raw
    else:
        fp = open(file_path, "rb")
        return fp

if __name__ == "__main__":
    main()

Β§

The innocence project

The paper referenced by Cory, "Innocence Discovery Lab - Harnessing Large Language Models to Surface Data Buried in Wrongful Conviction Case Documents" described a method using language models "to transform unstructured documents from case documents into a structured, accessible format." This is a practice known as Information Extration (IE).

The paper starts out by demonstrating the limitations of regex in extracting information due to its rules base approach; high on accuracy, low on recall, with an inability to understand relational connections between entities in text. In the first of a series of code extracts, an example regex used to find passages containing named investigators is listed:

# Listing 1: Regular Expression Pattern
pattern = .compile(r”(detective|sergeant|lieutenant|captain|corporal|deputy|
investigator|criminalist|technician|det\.|sgt\.|lt\.|cpt\.|cpl\.|dty\.|tech\.|dr\.)\s+([A-Z][A-Za-z]*(\s[A-Z][A-Za-z]*)?)", re.IGNORECASE)

The paper goes on to identify a method for using LLMs to forge contextual similarity and semantic connections between documents in a database using a multi-stage method, illustrated with code fragments repeated beneath:

Hypothetical Document Embeddings (HyDE)
Transform raw text into a structured, searchable format […] searches leveraging these embeddings focus on contextual similarity and semantic connections between documents, surpassing traditional keyword-based search methods in depth and relevance.
# Listing 2.0: Hypothetical Document Embeddings Query
PROMPT_TEMPLATE_HYDE = PromptTemplate(
    input_variables=["question"], template="""You're an AI assistant
    specializing in criminal justice research.Your main focus is on
    identifying the names and providing detailed context of mention for each
    law enforcement personnel. This includes police officers, detectives,
    deputies, lieutenants, sergeants, captains, technicians, coroners,
    investigators, patrolmen, and criminalists, as described in court
    transcripts and police reports. Question: {question} Responses:"""
)
# Listing 2.1: Hypothetical Document Embeddings Implementation
def generate_hypothetical_embeddings():
    llm = OpenAI()
    prompt = PROMPT_TEMPLATE_HYDE
    llm_chain = LLMChain(llm=llm, prompt=prompt)
    base_embeddings = OpenAIEmbeddings()
    embeddings = HypotheticalDocumentEmbedder(
    llm_chain=llm_chain, base_embeddings=base_embeddings)
    return embeddings

Creating the vector database
For segmentation, we use LangChain's RecursiveCharacterTextSplitter, which divides the document into word chunks. The chunk size and overlap are chosen to ensure that each segment is comprehensive enough to maintain context while being sufficiently small for efficient processing. Post-segmentation, these chunks are transformed into high-dimensional vectors using the hypothetical document's embedding scheme.
The concluding step involves the FAISS.from_documents function, which compiles these vectors into an indexed database. This database enables efficient and context- sensitive searches, allowing for the quick identification of documents that share content similarities with the hypothetical document.
# Listing 3: Storing the Document in a Vector Database
def process_single_document(file_path, embeddings):
logger.info(f"Processing document: {file_path}"
loader = JSONLoader(file_path)
text = loader.load()
logger.info(f"Text loaded from document: {file_path}")
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
chunk_overlap=250)
docs = text_splitter.split_documents(text)
db = FAISS.from_documents(docs, embeddings)
return db

Creating a prompt output format
The model then extracts information relevant to the query and structures the output according to the specifications in the prompt template.
# Listing 4.0: Template for Model
PROMPT_TEMPLATE_MODEL = PromptTemplate(input_variables=["question",
"docs"],template=""

As an AI assistant, my role is to meticulously analyze criminal justice documents and
extract information about law enforcement personnel.
Query: {question}
Documents: {docs}
The response will contain:
1) The name of a police officer.
Please prefix the name with "Officer Name: ".
For example, "Officer Name: John Smith".
2) If available, provide an in-depth description of the context of their mention.
If the context induces ambiguity regarding the individuals role in law enforcement,
note this.
Please prefix this information with "Officer Context: ".
3) Review the context to discern the role of the officer. For example, Lead Detective.
Please prefix this information with "Officer Role: "
For example, "Officer Role: Lead Detective"
The full response should follow the format below, with no prefixes such as 1., 2., 3., a.,
b., c.:
Officer Name: John Smith
Officer Context: Mentioned as officer at the scene of the incident.
Officer Role: Patrol Officer
Officer Name:
Officer Context:
Officer Role:
Additional guidelines:
Only derive responses from factual information found within the police reports.
""",)

Initial Query Processing
The extraction phase begins when a user sends a query to the vector database. Once the query is received, the database conducts a search within its embedding space, identifying and retrieving text chunks that best match the query's contextual and semantic criteria. This retrieval process is carried out using the db.similarity_search_with_score method, which selects the top 'k' relevant chunks based on their high similarity to the query.
Sorting of Retrieved Chunks
After their retrieval, the chunks are sorted [to ensure relevant chunks are] appropriately organized within the model’s context window […] After sorting, the chunks are concatenated into a single string […] reducing unnecessary tokens.
# Listing 4.1: Function for Generating Responses
def get_response_from_query(db, query):
# Set up the parameters
prompt = PROMPT_TEMPLATE_MODEL
roles = ROLE_TEMPLATE
temperature = 1
k = 20
# Perform the similarity search
doc_list = db.similarity_search_with_score(query, k=k)
# Sort documents by relevance scores as suggested in https://arxiv.org/abs/2307.03172
docs = sorted(doc_list, key=lambda x: x[1], reverse=True)
third = len(docs) // 3
highest_third = docs[:third]
middle_third = docs[third:2*third]
lowest_third = docs[2*third:]
highest_third = sorted(highest_third, key=lambda x: x[1],reverse=True)
middle_third = sorted(middle_third, key=lambda x: x[1], reverse=True)
lowest_third = sorted(lowest_third, key=lambda x: x[1], reverse=True)
sorted_docs = highest_third + lowest_third + middle_third
# Join documents into one string for processing
docs_page_content = " ".join([d[0].page_content for d in sortedocs])
        
Model Initialisation and Response Generation
The processing begins with the instantiation of an OpenAI model and the LLMChain class. This setup allows the chain to process the combined document content along with the original query. Following this, the LLMChain executes its run method, using the inputs of prompt, query, and document content to generate a structured and detailed response. The model then extracts information relevant to the query and structures the output according to the specifications in the prompt template.

# Create an instance of the OpenAI model
llm = ChatOpenAI(model_name="gpt-4")
# Create an instance of the LLMChain
chain = LLMChain(llm=llm, prompt=prompt)
# Run the LLMChain and print the response
response = chain.run(question=query, docs=docs_page_content,
temperature=temperature)
print(response)
return response

The researchers additionally fine-tuned the model to use a cheaper model, and outline a method to de-depulicate different mentions of the same case workers in the database. Ultimately, they end up with a model that can process investigations and extract a table with the following columns:

  1. investigator name (de-duplicated)
  2. investigator role (de-duplicated)
  3. investigator involvement in a case (compiled)

Overall then, it looks like they are trying to find links between wrongful conviction cases by looking for patterns of investigator involvement across a collection of exoneration documents, possibly helping uncover links to further potential exonerations.