The RAG pipeline is the main service of the AI search. At the moment though, we are not doing the Generation part and only using the RAG to find relevant datasets for the given query.
The RAG pipeline code here is divided into two parts - training (training.py) and inference (backend.py). The first is used to gather data from the OpenML API and then preprocess it and store it in a vector database, the second is used for inference.
All the modules you are looking for are in backend/modules. To modify/understand any of the behavior, you should look at the corresponding documentation for each of the ones that you want to modify.
config.json : JSON with the main config used for training and inference - documentation
results_gen.py : Code for creating the output and running parts of the other modules during inference - documentation
general_utils.py : Code for device configuration (gpu/cpu/mps) - documentation
metadata_utils.py : Getting/formatting/loading metadata from OpenML - documentation
rag_llm.py : Langchain code for the RAG pipeline - documentation
utils.py : Just imports all the utility files
vector_store_utils.py : Code for loading data into the vector store. - documentation