One of the core value props of LangChain is the ability to combine Large Language Models with your own text data. There are multiple (four!) different methods of doing so, and many different applications this can power.
A step that sits upstream of using text data is the ability to get your data into a text form. This can be rather tricky due to the multitude of different formats that exist out there.
Enter... unstructured.io.
Unstructured is a company with a mission of transforming natural language data from raw to machine ready. One of the main ways they do this is with an open source Python package. This package as support for MANY different types of file extensions: .txt
, .docx
, .pptx
, .jpg
, .png
, .eml
, .html
, and .pdf
documents.
After playing around with Unstructured, we realized that by integrating with it we could easily start to build out first class support for loading documents of all types into a format that LangChains could work with. So we created the Document Loaders module, a large part of which is powered by Unstructured.
There are currently two loaders that are powered by Unstructured. Both seem rather simple, but are quite powerful.
The first is the UnstructuredFileLoader. This has a simple interface (you just pass it a file path) but under the hood Unstructured is doing a lot of smart logic to infer which data type it is (PDF, PowerPoint, image, etc) and extract text.
The second is the DirectoryLoader. Again, this has a pretty simple interface: it takes only a path to a directory and an optional regex to glob for files against. But under the hood it is looping over all files and using the above UnstructuredFileLoader to load them. This makes it possible to load files of all types in a single call.
We're incredibly excited to have made this integration with Unstructured. With their focus on transforming raw data into clean text, it makes it incredibly easy to combine language models with your data, no matter what form it is in.