Open Source Extraction Service

Open Source Extraction Service

7 min read

Earlier this month we announced our most recent OSS use-case accelerant: a service for extracting structured data from unstructured sources, such as text and PDF documents. Today we are exposing a hosted version of the service with a simple front end. The application is free to use, but is not intended for production workloads or sensitive data. The intent is to showcase what is possible in this category in 2024, and to help developers get a running start with their own applications.

Key Links:

Why now?

Structured data extraction has emerged as a valuable use case of large language models, which can reason through the ambiguities of unstructured text to coerce information into a desired schema. Model providers are increasingly supporting long context windows and function calling capabilities, both key features to data extraction. And we have recently improved LangChain’s support for data extraction, allowing developers to easily work with a variety of file types, schema formats, models, few-shot examples, and extraction methods (e.g., tool calling, JSON mode, or parsing). Hosting a reference application allows users to experiment with the latest tools for their own use-cases, and connect what they see to the underlying OSS implementation.

Features

  • Support for PDF, HTML, and text;
  • Defining and persisting extractors with their own schema and custom instructions;
  • Adding few-shot examples for in-context learning;
  • Sharing extractors among users;
  • Swapping LLM models;
  • A LangServe endpoint for the core extraction logic, allowing it to be plugged into your own Langchain workflows;
  • A frontend that lets you define extraction schemas in natural language, share with other users, and test them on text or files (no support of few shot examples yet).

Walkthrough

Let’s walk through an example, extracting financial data from a public company earnings call. Here we use the prepared remarks from Uber’s Q4 2023 earnings call, which Uber investor relations makes available online.

Most public companies host earnings calls, providing their management opportunities to discuss past financial results and future plans. Natural language transcripts of these calls may contain useful information, but often this information must first be extracted from the document and arranged into a structured form so that it can be analyzed or compared across time periods and other companies.

Let’s first grab the PDF:

import requests

pdf_url = "https://s23.q4cdn.com/407969754/files/doc_earnings/2023/q4/transcript/Uber-Q4-23-Prepared-Remarks.pdf"

# Get PDF bytes
pdf_response = requests.get(pdf_url)
assert(pdf_response.status_code == 200)
pdf_bytes = pdf_response.content

Next, we will generate a unique identifier for ourselves. Our application does not manage users or include legitimate authentication. Access to extractors, few-shot examples, and other artifacts is controlled via this ID. Consider it secret, and don’t lose it!

from uuid import uuid4

user_id = str(uuid4())
headers = {"x-key": user_id}

We next specify the schema of what we intend to extract. Here we specify a record of financial data. We allow the LLM to infer various attributes, such as the time period for the record. Here we use Pydantic for readability, but ultimately the service relies on JSON schema.

from pydantic import BaseModel, Field

class FinancialData(BaseModel):
    name: str = Field(..., description="Name of the financial figure, such as revenue.")
    value: float = Field(..., description="Nominal earnings in local currency.")
    scale: str = Field(..., description="Scale of figure, such as MM, B, or percent.")
    period_start: str = Field(..., description="The start of the time period in ISO format.")
    period_duration: int = Field(..., description="Duration of period, in months")
    evidence: str = Field(..., description="Verbatim sentence of text where figure was found.")

Note that we include an evidence attribute, which provides context for the predictions and supports downstream verification of the results.

Once we've defined our schema, we create an extractor by posting it to the application:

url = "https://extract-server-f34kggfazq-uc.a.run.app"

data = {
    "user_id": user_id,
    "description": "Financial revenues and other figures.",
    "schema": FinancialData.schema(),
    "instruction": (
        "Extract standard financial figures, specifically earnings and "
        "revenue figures. Only extract historical facts, not estimates or guidance."
    )
}

response = requests.post(f"{url}/extractors", json=data, headers=headers)
extractor = response.json()

We’ve posted the extractor, which we can now access using its unique ID. We can now try the extractor on our PDF:

result = requests.post(
    f"{url}/extract",
    data={"extractor_id": extractor["uuid"], "model_name": "gpt-3.5-turbo"},
    files={"file": pdf_bytes},
    headers=headers,
)

result.json()

And we get back:

{'data': [{'name': 'Adjusted EBITDA',
   'scale': 'million',
   'value': 1300,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'GAAP operating income',
   'scale': 'million',
   'value': 652,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'Revenue',
   'scale': 'billion',
   'value': 9.9,
   'evidence': 'We grew our revenue by 13% YoY on a constant-currency basis to $9.9 billion.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'Adjusted EBITDA',
   'scale': '$',
   'value': 1.26,
   'evidence': 'We expect Adjusted EBITDA of $1.26 billion to $1.34 billion.',
   'period_start': '2023-01-01',
   'period_duration': 12},
  {'name': 'Adjusted EBITDA',
   'scale': '$',
   'value': 1.34,
   'evidence': 'We expect Adjusted EBITDA of $1.26 billion to $1.34 billion.',
   'period_start': '2023-01-01',
   'period_duration': 12}]}

Note that the formatting of the result has deviated in some ways from the descriptions in the schema– for example, we output “million” and “billion” instead of “MM” or “B” as instructed. Different models will struggle with this to varying degrees. Before reaching for a larger model, a judicious choice of few-shot examples can often be an efficient way to clarify our intent. Let’s add one to our extractor:

examples = [
    {
        "text": "In 2022, Revenue was $1 million and EBIT was $2M.",
        "output": [
            FinancialData(
                name="revenue",
                value=1,
                scale="MM",
                period_start="2022-01-01",
                period_duration=12,
                evidence="In 2022, Revenue was $1 million and EBIT was $2M.",
            ).dict(),
            FinancialData(
                name="ebit",
                value=2,
                scale="MM",
                period_start="2022-01-01",
                period_duration=12,
                evidence="In 2022, Revenue was $1 million and EBIT was $2M.",
            ).dict()
        ],
    },
]

responses = []
for example in examples:
    create_request = {
        "extractor_id": extractor["uuid"],
        "content": example["text"],
        "output": example['output'],
    }
    response = requests.post(f"{url}/examples", json=create_request, headers=headers)
    responses.append(response)

Here we add a single example that contains two records, with updated casing for the “name” field and formatting of the “scale” field. Re-running the extraction on the document, we recover the intended formatting:

result = requests.post(
    f"{url}/extract",
    data={"extractor_id": extractor["uuid"], "model_name": "gpt-3.5-turbo"},
    files={"file": pdf_bytes},
    headers=headers,
)

result.json()

And we get:

{'data': [{'name': 'adjusted ebitda',
   'scale': 'MM',
   'value': 1300.0,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'gaap operating income',
   'scale': 'MM',
   'value': 652.0,
   'evidence': 'These strong top-line trends, combined with continued rigor on costs, translated to $1.3 billion in Adjusted EBITDA and $652 million in GAAP operating income.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'gross bookings',
   'scale': 'B',
   'value': 37.6,
   'evidence': 'Gross Bookings growth accelerated to 21% YoY on a constant-currency basis (23% excluding Freight), as we generated Gross Bookings of $37.6 billion.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'revenue',
   'scale': 'B',
   'value': 9.9,
   'evidence': 'We grew our revenue by 13% YoY on a constant-currency basis to $9.9 billion.',
   'period_start': '2023-10-01',
   'period_duration': 3},
  {'name': 'adjusted ebitda',
   'scale': 'B',
   'value': 1.3,
   'evidence': 'We expect Adjusted EBITDA of $1.26 billion to $1.34 billion.',
   'period_start': '2023-01-01',
   'period_duration': 12}],
 'content_too_long': False}

LangServe client

A final note: because we’ve hosted the core extraction logic with LangServe, we can access it via the RemoteRunnable interface and plug it into larger chains and agent workflows.

The runnable can be invoked in the usual way:

from langserve import RemoteRunnable

runnable = RemoteRunnable(f"{url}/extract_text/")
response = runnable.invoke(
    {
        "text": "Our 2023 revenue was $100.",
        "schema": FinancialData.schema(),
    }
)
print(response)

And we get:

{'data': [{'name': 'revenue',
   'value': 100,
   'scale': '$',
   'period_start': '2023-01-01',
   'period_duration': 12,
   'evidence': 'Our 2023 revenue was $100.'}]}

Or below, we incorporate it into a retrieval scenario. Here, instead of extracting directly on the input, we index some documents and treat the input as a search query. Searching for “rev”, below, will retrieve a document containing revenue and restrict the extraction to it:

from operator import itemgetter

from langchain_community.vectorstores import FAISS
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import OpenAIEmbeddings


doc_contents = ["Our 2023 revenue was $100", "Our Q1 profit was $10 in 2023."]
vectorstore = FAISS.from_texts(doc_contents, embedding=OpenAIEmbeddings())
retriever = vectorstore.as_retriever()

larger_runnable = (
    {
        "text": itemgetter("text") | retriever | (lambda docs: docs[0].page_content),  # fetch content of top doc,
        "schema": itemgetter("schema"),
    }
    | runnable
)
larger_runnable.invoke({"text": "rev", "schema": FinancialData.schema()})

Which yields:

{'data': [{'name': 'revenue',
   'value': 100,
   'scale': '$',
   'period_start': '2023-01-01',
   'period_duration': 12,
   'evidence': 'Our 2023 revenue was $100'}]}

We are excited to see what extraction workflows you build, and welcome both feedback on and contributions to LangChain’s extraction capabilities!