RAG Pipeline¶

The RAG pipeline is the main service of the AI search. At the moment though, we are not doing the Generation part and only using the RAG to find relevant datasets for the given query.
The RAG pipeline code here is divided into two parts - training (training.py) and inference (backend.py). The first is used to gather data from the OpenML API and then preprocess it and store it in a vector database, the second is used for inference.

Training¶

All the modules you are looking for are in backend/modules. To modify/understand any of the behavior, you should look at the corresponding documentation for each of the ones that you want to modify.
config.json : JSON with the main config used for training and inference - documentation
results_gen.py : Code for creating the output and running parts of the other modules during inference - documentation
general_utils.py : Code for device configuration (gpu/cpu/mps) - documentation
metadata_utils.py : Getting/formatting/loading metadata from OpenML - documentation
rag_llm.py : Langchain code for the RAG pipeline - documentation
utils.py : Just imports all the utility files
vector_store_utils.py : Code for loading data into the vector store. - documentation

This component runs the RAG pipeline. It returns a JSON with dataset ids of the OpenML datasets that match the query.
You can start it by running cd backend && uvicorn backend:app --host 0.0.0.0 --port 8000 &
Curl Example : curl http://0.0.0.0:8000/dataset/find%20me%20a%20mushroom%20dataset