Skip to content

RAG Pipeline

  • The RAG pipeline is the main service of the AI search. At the moment though, we are not doing the Generation part and only using the RAG to find relevant datasets for the given query.
  • The RAG pipeline code here is divided into two parts - training (training.py) and inference (backend.py). The first is used to gather data from the OpenML API and then preprocess it and store it in a vector database, the second is used for inference.

Training

  • All the modules you are looking for are in backend/modules. To modify/understand any of the behavior, you should look at the corresponding documentation for each of the ones that you want to modify.
  • config.json : JSON with the main config used for training and inference - documentation
  • results_gen.py : Code for creating the output and running parts of the other modules during inference - documentation
  • general_utils.py : Code for device configuration (gpu/cpu/mps) - documentation
  • metadata_utils.py : Getting/formatting/loading metadata from OpenML - documentation
  • rag_llm.py : Langchain code for the RAG pipeline - documentation
  • utils.py : Just imports all the utility files
  • vector_store_utils.py : Code for loading data into the vector store. - documentation

Inference

  • This component runs the RAG pipeline. It returns a JSON with dataset ids of the OpenML datasets that match the query.
  • You can start it by running cd backend && uvicorn backend:app --host 0.0.0.0 --port 8000 &
  • Curl Example : curl http://0.0.0.0:8000/dataset/find%20me%20a%20mushroom%20dataset