Skip to content

Load the Chroma Db and get retrieval results for a given query

  • How would you load the Chroma Db and get retrieval results for a given query?
1
2
3
4
5
6
from __future__ import annotations
from langchain.globals import set_llm_cache
from langchain_community.cache import SQLiteCache
import os
import sys
import chromadb
1
2
3
from backend.modules.utils import *
from backend.modules.rag_llm import *
from backend.modules.results_gen import *
/Users/smukherjee/.pyenv/versions/3.10.14/envs/openml/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
config = load_config_and_device("../../../backend/config.json")
config["persist_dir"] = "../../data/doc_examples/chroma_db/"
config["data_dir"] = "../../data/doc_examples/"
config["type_of_data"] = "dataset"
config["training"] = False
config["testing_flag"] = True  # set this to false while training, this is for demo
config["test_subset"] = True  # set this to false while training, this is for demo
# load the persistent database using ChromaDB
client = chromadb.PersistentClient(path=config["persist_dir"])
print(config)
[INFO] Finding device.
[INFO] Device found: mps
{'rqa_prompt_template': 'This database is a list of metadata. Use the following pieces of context to find the relevant document. Answer only from the context given using the {question} given. If you do not know the answer, say you do not know. {context}', 'llm_prompt_template': 'The following is a set of documents {docs}. Based on these docs, please summarize the content concisely. Also give a list of main concepts found in the documents. Do not add any new information. Helpful Answer: ', 'num_return_documents': 30, 'embedding_model': 'BAAI/bge-large-en-v1.5', 'llm_model': 'llama3', 'num_documents_for_llm': 30, 'data_dir': '../../data/doc_examples/', 'persist_dir': '../../data/doc_examples/chroma_db/', 'testing_flag': True, 'ignore_downloading_data': False, 'test_subset': True, 'data_download_n_jobs': 20, 'training': False, 'temperature': 0.95, 'top_p': 0.95, 'search_type': 'similarity', 'reranking': False, 'long_context_reorder': False, 'structure_query': False, 'use_chroma_for_saving_metadata': False, 'device': 'mps', 'type_of_data': 'dataset'}

1
2
3
4
5
6
7
8
# Setup llm chain, initialize the retriever and llm, and setup Retrieval QA
qa_dataset_handler = QASetup(
    config=config,
    data_type=config["type_of_data"],
    client=client,
)

qa_dataset, _ = qa_dataset_handler.setup_vector_db_and_qa()
[INFO] Loading metadata from file.
[INFO] Loading model...
[INFO] Model loaded.
[INFO] Subsetting the data.
[INFO] Generating unique documents. Total documents: 500
Number of unique documents: 0 vs Total documents: 500
No new documents to add.

1
2
3
4
# get the llm chain and set the cache
llm_chain_handler = LLMChainCreator(config=config, local=True)
llm_chain_handler.enable_cache()
llm_chain = llm_chain_handler.get_llm_chain()

Just get documents

1
query = "give me datasets about mushrooms"
1
2
res = qa_dataset.invoke(input=query, top_k=5)[:10]
res
[Document(metadata={'MajorityClassSize': 4208.0, 'MaxNominalAttDistinctValues': 12.0, 'MinorityClassSize': 3916.0, 'NumberOfClasses': 2.0, 'NumberOfFeatures': 23.0, 'NumberOfInstances': 8124.0, 'NumberOfInstancesWithMissingValues': 2480.0, 'NumberOfMissingValues': 2480.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 23.0, 'Unnamed: 0': 19, 'description': "**Author**: [Jeff Schlimmer](Jeffrey.Schlimmer@a.gp.cs.cmu.edu)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/mushroom) - 1981     \n**Please cite**:  The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n\n### Description\n\nThis dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.\n\n### Source\n```\n(a) Origin: \nMushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n(b) Donor: \nJeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)\n```\n\n### Dataset description\n\nThis dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.\n\n### Attributes Information\n```\n1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s \n2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s \n3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y \n4. bruises?: bruises=t,no=f \n5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s \n6. gill-attachment: attached=a,descending=d,free=f,notched=n \n7. gill-spacing: close=c,crowded=w,distant=d \n8. gill-size: broad=b,narrow=n \n9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y \n10. stalk-shape: enlarging=e,tapering=t \n11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? \n12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s \n13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s \n14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n16. veil-type: partial=p,universal=u \n17. veil-color: brown=n,orange=o,white=w,yellow=y \n18. ring-number: none=n,one=o,two=t \n19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z \n20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y \n21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y \n22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d\n```\n\n### Relevant papers\n\nSchlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine. \n\nIba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann. \n\nDuch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link] \n\nDuch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.", 'did': 24, 'features': '0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 - habitat (nominal)], 22 : [22 - class (nominal)],', 'format': 'ARFF', 'name': 'mushroom', 'qualities': 'AutoCorrelation : 0.726332635725717, CfsSubsetEval_DecisionStumpAUC : 0.9910519616800724, CfsSubsetEval_DecisionStumpErrRate : 0.013047759724273756, CfsSubsetEval_DecisionStumpKappa : 0.9738461616958994, CfsSubsetEval_NaiveBayesAUC : 0.9910519616800724, CfsSubsetEval_NaiveBayesErrRate : 0.013047759724273756, CfsSubsetEval_NaiveBayesKappa : 0.9738461616958994, CfsSubsetEval_kNN1NAUC : 0.9910519616800724, CfsSubsetEval_kNN1NErrRate : 0.013047759724273756, CfsSubsetEval_kNN1NKappa : 0.9738461616958994, ClassEntropy : 0.9990678968724604, DecisionStumpAUC : 0.8894935275772204, DecisionStumpErrRate : 0.11324470704086657, DecisionStumpKappa : 0.77457574608175, Dimensionality : 0.002831117676021664, EquivalentNumberOfAtts : 5.0393135801657, J48.00001.AUC : 1.0, J48.00001.ErrRate : 0.0, J48.00001.Kappa : 1.0, J48.0001.AUC : 1.0, J48.0001.ErrRate : 0.0, J48.0001.Kappa : 1.0, J48.001.AUC : 1.0, J48.001.ErrRate : 0.0, J48.001.Kappa : 1.0, MajorityClassPercentage : 51.7971442639094, MajorityClassSize : 4208.0, MaxAttributeEntropy : 3.030432883772633, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 0.906074977384, MaxNominalAttDistinctValues : 12.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 1.4092554739602103, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.19825475850613955, MeanNoiseToSignalRatio : 6.108305922031972, MeanNominalAttDistinctValues : 5.130434782608695, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : -0.0, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 0.0, MinNominalAttDistinctValues : 1.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 48.20285573609059, MinorityClassSize : 3916.0, NaiveBayesAUC : 0.9976229672941662, NaiveBayesErrRate : 0.04899064500246184, NaiveBayesKappa : 0.9015972799616292, NumberOfBinaryFeatures : 5.0, NumberOfClasses : 2.0, NumberOfFeatures : 23.0, NumberOfInstances : 8124.0, NumberOfInstancesWithMissingValues : 2480.0, NumberOfMissingValues : 2480.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 23.0, PercentageOfBinaryFeatures : 21.73913043478261, PercentageOfInstancesWithMissingValues : 30.526834071885773, PercentageOfMissingValues : 1.3272536552993814, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.8286618104993447, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.034184520425602494, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 1.467128011861462, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.174606545183155, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 2.0533554351937426, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.27510225484918505, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9999987256143267, REPTreeDepth1ErrRate : 0.00036927621861152144, REPTreeDepth1Kappa : 0.9992605118549308, REPTreeDepth2AUC : 0.9999987256143267, REPTreeDepth2ErrRate : 0.00036927621861152144, REPTreeDepth2Kappa : 0.9992605118549308, REPTreeDepth3AUC : 0.9999987256143267, REPTreeDepth3ErrRate : 0.00036927621861152144, REPTreeDepth3Kappa : 0.9992605118549308, RandomTreeDepth1AUC : 0.9995247148288974, RandomTreeDepth1ErrRate : 0.0004923682914820286, RandomTreeDepth1Kappa : 0.9990140245420991, RandomTreeDepth2AUC : 0.9995247148288974, RandomTreeDepth2ErrRate : 0.0004923682914820286, RandomTreeDepth2Kappa : 0.9990140245420991, RandomTreeDepth3AUC : 0.9995247148288974, RandomTreeDepth3ErrRate : 0.0004923682914820286, RandomTreeDepth3Kappa : 0.9990140245420991, StdvNominalAttDistinctValues : 3.1809710899501766, kNN1NAUC : 1.0, kNN1NErrRate : 0.0, kNN1NKappa : 1.0,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content="### Description\n\nThis dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.\n\n### Source\n```\n(a) Origin: \nMushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n(b) Donor: \nJeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)\n```\n\n### Dataset description\n\nThis dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy."),
 Document(metadata={'MajorityClassSize': 4208.0, 'MaxNominalAttDistinctValues': 12.0, 'MinorityClassSize': 3916.0, 'NumberOfClasses': 2.0, 'NumberOfFeatures': 23.0, 'NumberOfInstances': 8124.0, 'NumberOfInstancesWithMissingValues': 2480.0, 'NumberOfMissingValues': 2480.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 23.0, 'Unnamed: 0': 19, 'description': "**Author**: [Jeff Schlimmer](Jeffrey.Schlimmer@a.gp.cs.cmu.edu)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/mushroom) - 1981     \n**Please cite**:  The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n\n### Description\n\nThis dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.\n\n### Source\n```\n(a) Origin: \nMushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n(b) Donor: \nJeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)\n```\n\n### Dataset description\n\nThis dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.\n\n### Attributes Information\n```\n1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s \n2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s \n3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y \n4. bruises?: bruises=t,no=f \n5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s \n6. gill-attachment: attached=a,descending=d,free=f,notched=n \n7. gill-spacing: close=c,crowded=w,distant=d \n8. gill-size: broad=b,narrow=n \n9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y \n10. stalk-shape: enlarging=e,tapering=t \n11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? \n12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s \n13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s \n14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n16. veil-type: partial=p,universal=u \n17. veil-color: brown=n,orange=o,white=w,yellow=y \n18. ring-number: none=n,one=o,two=t \n19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z \n20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y \n21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y \n22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d\n```\n\n### Relevant papers\n\nSchlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine. \n\nIba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann. \n\nDuch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link] \n\nDuch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.", 'did': 24, 'features': '0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 - habitat (nominal)], 22 : [22 - class (nominal)],', 'format': 'ARFF', 'name': 'mushroom', 'qualities': 'AutoCorrelation : 0.726332635725717, CfsSubsetEval_DecisionStumpAUC : 0.9910519616800724, CfsSubsetEval_DecisionStumpErrRate : 0.013047759724273756, CfsSubsetEval_DecisionStumpKappa : 0.9738461616958994, CfsSubsetEval_NaiveBayesAUC : 0.9910519616800724, CfsSubsetEval_NaiveBayesErrRate : 0.013047759724273756, CfsSubsetEval_NaiveBayesKappa : 0.9738461616958994, CfsSubsetEval_kNN1NAUC : 0.9910519616800724, CfsSubsetEval_kNN1NErrRate : 0.013047759724273756, CfsSubsetEval_kNN1NKappa : 0.9738461616958994, ClassEntropy : 0.9990678968724604, DecisionStumpAUC : 0.8894935275772204, DecisionStumpErrRate : 0.11324470704086657, DecisionStumpKappa : 0.77457574608175, Dimensionality : 0.002831117676021664, EquivalentNumberOfAtts : 5.0393135801657, J48.00001.AUC : 1.0, J48.00001.ErrRate : 0.0, J48.00001.Kappa : 1.0, J48.0001.AUC : 1.0, J48.0001.ErrRate : 0.0, J48.0001.Kappa : 1.0, J48.001.AUC : 1.0, J48.001.ErrRate : 0.0, J48.001.Kappa : 1.0, MajorityClassPercentage : 51.7971442639094, MajorityClassSize : 4208.0, MaxAttributeEntropy : 3.030432883772633, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 0.906074977384, MaxNominalAttDistinctValues : 12.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 1.4092554739602103, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.19825475850613955, MeanNoiseToSignalRatio : 6.108305922031972, MeanNominalAttDistinctValues : 5.130434782608695, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : -0.0, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 0.0, MinNominalAttDistinctValues : 1.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 48.20285573609059, MinorityClassSize : 3916.0, NaiveBayesAUC : 0.9976229672941662, NaiveBayesErrRate : 0.04899064500246184, NaiveBayesKappa : 0.9015972799616292, NumberOfBinaryFeatures : 5.0, NumberOfClasses : 2.0, NumberOfFeatures : 23.0, NumberOfInstances : 8124.0, NumberOfInstancesWithMissingValues : 2480.0, NumberOfMissingValues : 2480.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 23.0, PercentageOfBinaryFeatures : 21.73913043478261, PercentageOfInstancesWithMissingValues : 30.526834071885773, PercentageOfMissingValues : 1.3272536552993814, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.8286618104993447, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.034184520425602494, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 1.467128011861462, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.174606545183155, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 2.0533554351937426, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.27510225484918505, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9999987256143267, REPTreeDepth1ErrRate : 0.00036927621861152144, REPTreeDepth1Kappa : 0.9992605118549308, REPTreeDepth2AUC : 0.9999987256143267, REPTreeDepth2ErrRate : 0.00036927621861152144, REPTreeDepth2Kappa : 0.9992605118549308, REPTreeDepth3AUC : 0.9999987256143267, REPTreeDepth3ErrRate : 0.00036927621861152144, REPTreeDepth3Kappa : 0.9992605118549308, RandomTreeDepth1AUC : 0.9995247148288974, RandomTreeDepth1ErrRate : 0.0004923682914820286, RandomTreeDepth1Kappa : 0.9990140245420991, RandomTreeDepth2AUC : 0.9995247148288974, RandomTreeDepth2ErrRate : 0.0004923682914820286, RandomTreeDepth2Kappa : 0.9990140245420991, RandomTreeDepth3AUC : 0.9995247148288974, RandomTreeDepth3ErrRate : 0.0004923682914820286, RandomTreeDepth3Kappa : 0.9990140245420991, StdvNominalAttDistinctValues : 3.1809710899501766, kNN1NAUC : 1.0, kNN1NErrRate : 0.0, kNN1NKappa : 1.0,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='did - 24, name - mushroom, version - 1, uploader - 1, status - active, format - ARFF, MajorityClassSize - 4208.0, MaxNominalAttDistinctValues - 12.0, MinorityClassSize - 3916.0, NumberOfClasses - 2.0, NumberOfFeatures - 23.0, NumberOfInstances - 8124.0, NumberOfInstancesWithMissingValues - 2480.0, NumberOfMissingValues - 2480.0, NumberOfNumericFeatures - 0.0, NumberOfSymbolicFeatures - 23.0, description - **Author**: [Jeff Schlimmer](Jeffrey.Schlimmer@a.gp.cs.cmu.edu)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/mushroom) - 1981     \n**Please cite**:  The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n\n### Description\n\nThis dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.'),
 Document(metadata={'NumberOfClasses': 0.0, 'NumberOfFeatures': 37.0, 'NumberOfInstances': 6435.0, 'NumberOfInstancesWithMissingValues': 0.0, 'NumberOfMissingValues': 0.0, 'NumberOfNumericFeatures': 37.0, 'NumberOfSymbolicFeatures': 0.0, 'Unnamed: 0': 203, 'description': "**Author**:   \n**Source**: Unknown - 1993  \n**Please cite**:   \n\nSource:\nAshwin Srinivasan\nDepartment of Statistics and Data Modeling\nUniversity of Strathclyde\nGlasgow\nScotland\nUK\nross '@' uk.ac.turing\n\nThe original Landsat data for this database was generated from data purchased from NASA by the Australian Centre for Remote Sensing, and used for research at: \nThe Centre for Remote Sensing\nUniversity of New South Wales\nKensington, PO Box 1\nNSW 2033\nAustralia.\n\nThe sample database was generated taking a small section (82 rows and 100 columns) from the original data. The binary values were converted to their present ASCII form by Ashwin Srinivasan. The classification for each pixel was performed on the basis of an actual site visit by Ms. Karen Hall, when working for Professor John A. Richards, at the Centre for Remote Sensing at the University of New South Wales, Australia. Conversion to 3x3 neighbourhoods and splitting into test and training sets was done by Alistair Sutherland.\n\nData Set Information:\nThe database consists of the multi-spectral values of pixels in 3x3 neighbourhoods in a satellite image, and the classification associated with the central pixel in each neighbourhood. The aim is to predict this classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a number. The Landsat satellite data is one of the many sources of information available for a scene. The interpretation of a scene by  integrating spatial data of diverse types and resolutions including multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant importance with the onset of an era characterised by integrative approaches to remote sensing (for example, NASA's Earth Observing System commencing this decade). Existing statistical methods are ill-equipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered in isolation (as in this sample database). This data satisfies the important requirements of being numerical and at a single resolution, and standard maximum-likelihood classification performs very well. Consequently, for this data, it should be interesting to compare the performance of other methods against the statistical approach. One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains 2340 x 3380 such pixels. The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds to a 3x3 square neighbourhood of pixels completely contained within the 82x100 sub-area. Each line contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3 neighbourhood and a number indicating the classification label of the central pixel. The number is a code for the following classes:\n\nNumber Class\n1 red soil\n2 cotton crop\n3 grey soil\n4 damp grey soil\n5 soil with vegetation stubble\n6 mixture class (all types present)\n7 very damp grey soil\nNB. There are no examples with class 6 in this dataset.\n \nThe data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset. In each line of data the four spectral values for the top-left pixel are given first followed by the four spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are given by attributes 17,18,19 and 20. If you like you can use only these four attributes, while ignoring the others. This avoids the problem which arises when a 3x3 neighbourhood straddles a boundary.\n\nAttribute Information:\nThe attributes are numerical, in the range 0 to 255.\n\nUCI: http://archive.ics.uci.edu/ml/datasets/Statlog+(Landsat+Satellite)", 'did': 294, 'features': '0 : [0 - attr1 (numeric)], 1 : [1 - attr2 (numeric)], 2 : [2 - attr3 (numeric)], 3 : [3 - attr4 (numeric)], 4 : [4 - attr5 (numeric)], 5 : [5 - attr6 (numeric)], 6 : [6 - attr7 (numeric)], 7 : [7 - attr8 (numeric)], 8 : [8 - attr9 (numeric)], 9 : [9 - attr10 (numeric)], 10 : [10 - attr11 (numeric)], 11 : [11 - attr12 (numeric)], 12 : [12 - attr13 (numeric)], 13 : [13 - attr14 (numeric)], 14 : [14 - attr15 (numeric)], 15 : [15 - attr16 (numeric)], 16 : [16 - attr17 (numeric)], 17 : [17 - attr18 (numeric)], 18 : [18 - attr19 (numeric)], 19 : [19 - attr20 (numeric)], 20 : [20 - attr21 (numeric)], 21 : [21 - attr22 (numeric)], 22 : [22 - attr23 (numeric)], 23 : [23 - attr24 (numeric)], 24 : [24 - attr25 (numeric)], 25 : [25 - attr26 (numeric)], 26 : [26 - attr27 (numeric)], 27 : [27 - attr28 (numeric)], 28 : [28 - attr29 (numeric)], 29 : [29 - attr30 (numeric)], 30 : [30 - attr31 (numeric)], 31 : [31 - attr32 (numeric)], 32 : [32 - attr33 (numeric)], 33 : [33 - attr34 (numeric)], 34 : [34 - attr35 (numeric)], 35 : [35 - attr36 (numeric)], 36 : [36 - class (numeric)],', 'format': 'ARFF', 'name': 'satellite_image', 'qualities': 'AutoCorrelation : 0.5853279452906435, CfsSubsetEval_DecisionStumpAUC : nan, CfsSubsetEval_DecisionStumpErrRate : nan, CfsSubsetEval_DecisionStumpKappa : nan, CfsSubsetEval_NaiveBayesAUC : nan, CfsSubsetEval_NaiveBayesErrRate : nan, CfsSubsetEval_NaiveBayesKappa : nan, CfsSubsetEval_kNN1NAUC : nan, CfsSubsetEval_kNN1NErrRate : nan, CfsSubsetEval_kNN1NKappa : nan, ClassEntropy : nan, DecisionStumpAUC : nan, DecisionStumpErrRate : nan, DecisionStumpKappa : nan, Dimensionality : 0.00574980574980575, EquivalentNumberOfAtts : nan, J48.00001.AUC : nan, J48.00001.ErrRate : nan, J48.00001.Kappa : nan, J48.0001.AUC : nan, J48.0001.ErrRate : nan, J48.0001.Kappa : nan, J48.001.AUC : nan, J48.001.ErrRate : nan, J48.001.Kappa : nan, MajorityClassPercentage : nan, MajorityClassSize : nan, MaxAttributeEntropy : nan, MaxKurtosisOfNumericAtts : 1.2773432544146832, MaxMeansOfNumericAtts : 99.31126651126642, MaxMutualInformation : nan, MaxNominalAttDistinctValues : nan, MaxSkewnessOfNumericAtts : 0.9187090836988436, MaxStdDevOfNumericAtts : 22.90506492772991, MeanAttributeEntropy : nan, MeanKurtosisOfNumericAtts : -0.18345361023395665, MeanMeansOfNumericAtts : 81.3149961149961, MeanMutualInformation : nan, MeanNoiseToSignalRatio : nan, MeanNominalAttDistinctValues : nan, MeanSkewnessOfNumericAtts : 0.04831449741968043, MeanStdDevOfNumericAtts : 17.586070075450067, MinAttributeEntropy : nan, MinKurtosisOfNumericAtts : -1.2441720904806828, MinMeansOfNumericAtts : 3.6686868686868834, MinMutualInformation : nan, MinNominalAttDistinctValues : nan, MinSkewnessOfNumericAtts : -0.6747275074215006, MinStdDevOfNumericAtts : 2.214052121287819, MinorityClassPercentage : nan, MinorityClassSize : nan, NaiveBayesAUC : nan, NaiveBayesErrRate : nan, NaiveBayesKappa : nan, NumberOfBinaryFeatures : 0.0, NumberOfClasses : 0.0, NumberOfFeatures : 37.0, NumberOfInstances : 6435.0, NumberOfInstancesWithMissingValues : 0.0, NumberOfMissingValues : 0.0, NumberOfNumericFeatures : 37.0, NumberOfSymbolicFeatures : 0.0, PercentageOfBinaryFeatures : 0.0, PercentageOfInstancesWithMissingValues : 0.0, PercentageOfMissingValues : 0.0, PercentageOfNumericFeatures : 100.0, PercentageOfSymbolicFeatures : 0.0, Quartile1AttributeEntropy : nan, Quartile1KurtosisOfNumericAtts : -0.8829551820521702, Quartile1MeansOfNumericAtts : 69.34483294483297, Quartile1MutualInformation : nan, Quartile1SkewnessOfNumericAtts : -0.3859749826493584, Quartile1StdDevOfNumericAtts : 13.604282494809674, Quartile2AttributeEntropy : nan, Quartile2KurtosisOfNumericAtts : -0.6732423440004554, Quartile2MeansOfNumericAtts : 82.66060606060603, Quartile2MutualInformation : nan, Quartile2SkewnessOfNumericAtts : 0.02239958092752799, Quartile2StdDevOfNumericAtts : 16.729622667298376, Quartile3AttributeEntropy : nan, Quartile3KurtosisOfNumericAtts : 0.5035049254688353, Quartile3MeansOfNumericAtts : 91.22408702408694, Quartile3MutualInformation : nan, Quartile3SkewnessOfNumericAtts : 0.6162940189640502, Quartile3StdDevOfNumericAtts : 20.936744304390697, REPTreeDepth1AUC : nan, REPTreeDepth1ErrRate : nan, REPTreeDepth1Kappa : nan, REPTreeDepth2AUC : nan, REPTreeDepth2ErrRate : nan, REPTreeDepth2Kappa : nan, REPTreeDepth3AUC : nan, REPTreeDepth3ErrRate : nan, REPTreeDepth3Kappa : nan, RandomTreeDepth1AUC : nan, RandomTreeDepth1ErrRate : nan, RandomTreeDepth1Kappa : nan, RandomTreeDepth2AUC : nan, RandomTreeDepth2ErrRate : nan, RandomTreeDepth2Kappa : nan, RandomTreeDepth3AUC : nan, RandomTreeDepth3ErrRate : nan, RandomTreeDepth3Kappa : nan, StdvNominalAttDistinctValues : nan, kNN1NAUC : nan, kNN1NErrRate : nan, kNN1NKappa : nan,', 'status': 'active', 'uploader': 94, 'version': 1}, page_content='Data Set Information:'),
 Document(metadata={'MajorityClassSize': 518298.0, 'MaxNominalAttDistinctValues': 12.0, 'MinorityClassSize': 481702.0, 'NumberOfClasses': 2.0, 'NumberOfFeatures': 23.0, 'NumberOfInstances': 1000000.0, 'NumberOfInstancesWithMissingValues': 0.0, 'NumberOfMissingValues': 0.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 23.0, 'Unnamed: 0': 68, 'did': 120, 'features': '0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 - habitat (nominal)], 22 : [22 - class (nominal)],', 'format': 'ARFF', 'name': 'BNG(mushroom)', 'qualities': 'AutoCorrelation : 0.5011905011905012, CfsSubsetEval_DecisionStumpAUC : 0.9847860299226502, CfsSubsetEval_DecisionStumpErrRate : 0.021824, CfsSubsetEval_DecisionStumpKappa : 0.9562780842181652, CfsSubsetEval_NaiveBayesAUC : 0.9847860299226502, CfsSubsetEval_NaiveBayesErrRate : 0.021824, CfsSubsetEval_NaiveBayesKappa : 0.9562780842181652, CfsSubsetEval_kNN1NAUC : 0.9847860299226502, CfsSubsetEval_kNN1NErrRate : 0.021824, CfsSubsetEval_kNN1NKappa : 0.9562780842181652, ClassEntropy : 0.9990337071596953, DecisionStumpAUC : 0.8815512935166292, DecisionStumpErrRate : 0.121245, DecisionStumpKappa : 0.7587911383829151, Dimensionality : 2.3e-05, EquivalentNumberOfAtts : 6.097271107545528, J48.00001.AUC : 0.9962742048687271, J48.00001.ErrRate : 0.007847, J48.00001.Kappa : 0.9842850236101645, J48.0001.AUC : 0.9962742048687271, J48.0001.ErrRate : 0.007847, J48.0001.Kappa : 0.9842850236101645, J48.001.AUC : 0.9962742048687271, J48.001.ErrRate : 0.007847, J48.001.Kappa : 0.9842850236101645, MajorityClassPercentage : 51.829800000000006, MajorityClassSize : 518298.0, MaxAttributeEntropy : 3.0845637992777144, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 0.84128137803192, MaxNominalAttDistinctValues : 12.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 1.5385002552906082, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.16384931710242728, MeanNoiseToSignalRatio : 8.389726380909137, MeanNominalAttDistinctValues : 5.521739130434782, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : 0.0016183542115170931, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 1.10978079e-06, MinNominalAttDistinctValues : 2.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 48.1702, MinorityClassSize : 481702.0, NaiveBayesAUC : 0.989456054908011, NaiveBayesErrRate : 0.072603, NaiveBayesKappa : 0.8540047016650592, NumberOfBinaryFeatures : 5.0, NumberOfClasses : 2.0, NumberOfFeatures : 23.0, NumberOfInstances : 1000000.0, NumberOfInstancesWithMissingValues : 0.0, NumberOfMissingValues : 0.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 23.0, PercentageOfBinaryFeatures : 21.73913043478261, PercentageOfInstancesWithMissingValues : 0.0, PercentageOfMissingValues : 0.0, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.8684476271925594, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.0261470876211225, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 1.5540739508863595, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.13087075005855, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 2.281038681528015, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.2240629781340025, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9971920115811678, REPTreeDepth1ErrRate : 0.01052, REPTreeDepth1Kappa : 0.9789309934238616, REPTreeDepth2AUC : 0.9971920115811678, REPTreeDepth2ErrRate : 0.01052, REPTreeDepth2Kappa : 0.9789309934238616, REPTreeDepth3AUC : 0.9971920115811678, REPTreeDepth3ErrRate : 0.01052, REPTreeDepth3Kappa : 0.9789309934238616, RandomTreeDepth1AUC : 0.9815888004820784, RandomTreeDepth1ErrRate : 0.024243, RandomTreeDepth1Kappa : 0.9514421122524949, RandomTreeDepth2AUC : 0.9815888004820784, RandomTreeDepth2ErrRate : 0.024243, RandomTreeDepth2Kappa : 0.9514421122524949, RandomTreeDepth3AUC : 0.9815888004820784, RandomTreeDepth3ErrRate : 0.024243, RandomTreeDepth3Kappa : 0.9514421122524949, StdvNominalAttDistinctValues : 3.0580677978302706, kNN1NAUC : 0.9989058041456409, kNN1NErrRate : 0.011358, kNN1NKappa : 0.977249584712958,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='did - 120, name - BNG(mushroom), version - 1, uploader - 1, status - active, format - ARFF, MajorityClassSize - 518298.0, MaxNominalAttDistinctValues - 12.0, MinorityClassSize - 481702.0, NumberOfClasses - 2.0, NumberOfFeatures - 23.0, NumberOfInstances - 1000000.0, NumberOfInstancesWithMissingValues - 0.0, NumberOfMissingValues - 0.0, NumberOfNumericFeatures - 0.0, NumberOfSymbolicFeatures - 23.0, description - None, qualities - AutoCorrelation : 0.5011905011905012, CfsSubsetEval_DecisionStumpAUC : 0.9847860299226502, CfsSubsetEval_DecisionStumpErrRate : 0.021824, CfsSubsetEval_DecisionStumpKappa : 0.9562780842181652, CfsSubsetEval_NaiveBayesAUC : 0.9847860299226502, CfsSubsetEval_NaiveBayesErrRate : 0.021824, CfsSubsetEval_NaiveBayesKappa : 0.9562780842181652, CfsSubsetEval_kNN1NAUC : 0.9847860299226502, CfsSubsetEval_kNN1NErrRate : 0.021824, CfsSubsetEval_kNN1NKappa : 0.9562780842181652, ClassEntropy : 0.9990337071596953, DecisionStumpAUC : 0.8815512935166292, DecisionStumpErrRate :'),
 Document(metadata={'MajorityClassSize': 518298.0, 'MaxNominalAttDistinctValues': 12.0, 'MinorityClassSize': 481702.0, 'NumberOfClasses': 2.0, 'NumberOfFeatures': 23.0, 'NumberOfInstances': 1000000.0, 'NumberOfInstancesWithMissingValues': 0.0, 'NumberOfMissingValues': 0.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 23.0, 'Unnamed: 0': 68, 'did': 120, 'features': '0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 - habitat (nominal)], 22 : [22 - class (nominal)],', 'format': 'ARFF', 'name': 'BNG(mushroom)', 'qualities': 'AutoCorrelation : 0.5011905011905012, CfsSubsetEval_DecisionStumpAUC : 0.9847860299226502, CfsSubsetEval_DecisionStumpErrRate : 0.021824, CfsSubsetEval_DecisionStumpKappa : 0.9562780842181652, CfsSubsetEval_NaiveBayesAUC : 0.9847860299226502, CfsSubsetEval_NaiveBayesErrRate : 0.021824, CfsSubsetEval_NaiveBayesKappa : 0.9562780842181652, CfsSubsetEval_kNN1NAUC : 0.9847860299226502, CfsSubsetEval_kNN1NErrRate : 0.021824, CfsSubsetEval_kNN1NKappa : 0.9562780842181652, ClassEntropy : 0.9990337071596953, DecisionStumpAUC : 0.8815512935166292, DecisionStumpErrRate : 0.121245, DecisionStumpKappa : 0.7587911383829151, Dimensionality : 2.3e-05, EquivalentNumberOfAtts : 6.097271107545528, J48.00001.AUC : 0.9962742048687271, J48.00001.ErrRate : 0.007847, J48.00001.Kappa : 0.9842850236101645, J48.0001.AUC : 0.9962742048687271, J48.0001.ErrRate : 0.007847, J48.0001.Kappa : 0.9842850236101645, J48.001.AUC : 0.9962742048687271, J48.001.ErrRate : 0.007847, J48.001.Kappa : 0.9842850236101645, MajorityClassPercentage : 51.829800000000006, MajorityClassSize : 518298.0, MaxAttributeEntropy : 3.0845637992777144, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 0.84128137803192, MaxNominalAttDistinctValues : 12.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 1.5385002552906082, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.16384931710242728, MeanNoiseToSignalRatio : 8.389726380909137, MeanNominalAttDistinctValues : 5.521739130434782, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : 0.0016183542115170931, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 1.10978079e-06, MinNominalAttDistinctValues : 2.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 48.1702, MinorityClassSize : 481702.0, NaiveBayesAUC : 0.989456054908011, NaiveBayesErrRate : 0.072603, NaiveBayesKappa : 0.8540047016650592, NumberOfBinaryFeatures : 5.0, NumberOfClasses : 2.0, NumberOfFeatures : 23.0, NumberOfInstances : 1000000.0, NumberOfInstancesWithMissingValues : 0.0, NumberOfMissingValues : 0.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 23.0, PercentageOfBinaryFeatures : 21.73913043478261, PercentageOfInstancesWithMissingValues : 0.0, PercentageOfMissingValues : 0.0, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.8684476271925594, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.0261470876211225, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 1.5540739508863595, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.13087075005855, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 2.281038681528015, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.2240629781340025, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9971920115811678, REPTreeDepth1ErrRate : 0.01052, REPTreeDepth1Kappa : 0.9789309934238616, REPTreeDepth2AUC : 0.9971920115811678, REPTreeDepth2ErrRate : 0.01052, REPTreeDepth2Kappa : 0.9789309934238616, REPTreeDepth3AUC : 0.9971920115811678, REPTreeDepth3ErrRate : 0.01052, REPTreeDepth3Kappa : 0.9789309934238616, RandomTreeDepth1AUC : 0.9815888004820784, RandomTreeDepth1ErrRate : 0.024243, RandomTreeDepth1Kappa : 0.9514421122524949, RandomTreeDepth2AUC : 0.9815888004820784, RandomTreeDepth2ErrRate : 0.024243, RandomTreeDepth2Kappa : 0.9514421122524949, RandomTreeDepth3AUC : 0.9815888004820784, RandomTreeDepth3ErrRate : 0.024243, RandomTreeDepth3Kappa : 0.9514421122524949, StdvNominalAttDistinctValues : 3.0580677978302706, kNN1NAUC : 0.9989058041456409, kNN1NErrRate : 0.011358, kNN1NKappa : 0.977249584712958,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='RandomTreeDepth3ErrRate : 0.024243, RandomTreeDepth3Kappa : 0.9514421122524949, StdvNominalAttDistinctValues : 3.0580677978302706, kNN1NAUC : 0.9989058041456409, kNN1NErrRate : 0.011358, kNN1NKappa : 0.977249584712958,, features - 0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 -'),
 Document(metadata={'MajorityClassSize': 92.0, 'MaxNominalAttDistinctValues': 19.0, 'MinorityClassSize': 8.0, 'NumberOfClasses': 19.0, 'NumberOfFeatures': 36.0, 'NumberOfInstances': 683.0, 'NumberOfInstancesWithMissingValues': 121.0, 'NumberOfMissingValues': 2337.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 36.0, 'Unnamed: 0': 36, 'description': '**Author**: R.S. Michalski and R.L. Chilausky (Donors: Ming Tan & Jeff Schlimmer)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Soybean+(Large)) - 1988  \n**Please cite**: R.S. Michalski and R.L. Chilausky "Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis", International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.  \n\n**Large Soybean Database**  \nThis is the large soybean database from the UCI repository, with its training and test database combined into a single file. \n\nThere are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value \'dna\' means does not apply. The values for attributes are encoded numerically, with the first value encoded as "0,\'\' the second as "1,\'\' and so forth. An unknown value is encoded as "?\'\'.\n\n### Attribute Information\n\n1. date: april,may,june,july,august,september,october,?. \n2. plant-stand: normal,lt-normal,?. \n3. precip: lt-norm,norm,gt-norm,?. \n4. temp: lt-norm,norm,gt-norm,?. \n5. hail: yes,no,?. \n6. crop-hist: diff-lst-year,same-lst-yr,same-lst-two-yrs, \nsame-lst-sev-yrs,?. \n7. area-damaged: scattered,low-areas,upper-areas,whole-field,?. \n8. severity: minor,pot-severe,severe,?. \n9. seed-tmt: none,fungicide,other,?. \n10. germination: 90-100%,80-89%,lt-80%,?. \n11. plant-growth: norm,abnorm,?. \n12. leaves: norm,abnorm. \n13. leafspots-halo: absent,yellow-halos,no-yellow-halos,?. \n14. leafspots-marg: w-s-marg,no-w-s-marg,dna,?. \n15. leafspot-size: lt-1/8,gt-1/8,dna,?. \n16. leaf-shread: absent,present,?. \n17. leaf-malf: absent,present,?. \n18. leaf-mild: absent,upper-surf,lower-surf,?. \n19. stem: norm,abnorm,?. \n20. lodging: yes,no,?. \n21. stem-cankers: absent,below-soil,above-soil,above-sec-nde,?. \n22. canker-lesion: dna,brown,dk-brown-blk,tan,?. \n23. fruiting-bodies: absent,present,?. \n24. external decay: absent,firm-and-dry,watery,?. \n25. mycelium: absent,present,?. \n26. int-discolor: none,brown,black,?. \n27. sclerotia: absent,present,?. \n28. fruit-pods: norm,diseased,few-present,dna,?. \n29. fruit spots: absent,colored,brown-w/blk-specks,distort,dna,?. \n30. seed: norm,abnorm,?. \n31. mold-growth: absent,present,?. \n32. seed-discolor: absent,present,?. \n33. seed-size: norm,lt-norm,?. \n34. shriveling: absent,present,?. \n35. roots: norm,rotted,galls-cysts,?.\n\n### Classes \n\n-- 19 Classes = {diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot, phytophthora-rot, brown-stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-blight, bacterial-pustule, purple-seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-injury} \n\n### Revelant papers\n\nTan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning (pp. 121-134). Ann Arbor, Michigan: Morgan Kaufmann. \n\nFisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and Predictive Accuracy. Proceedings of the Fifth International Conference on Machine Learning (pp. 22-28). Ann Arbor, Michigan: Morgan Kaufmann.', 'did': 42, 'features': '0 : [0 - date (nominal)], 1 : [1 - plant-stand (nominal)], 2 : [2 - precip (nominal)], 3 : [3 - temp (nominal)], 4 : [4 - hail (nominal)], 5 : [5 - crop-hist (nominal)], 6 : [6 - area-damaged (nominal)], 7 : [7 - severity (nominal)], 8 : [8 - seed-tmt (nominal)], 9 : [9 - germination (nominal)], 10 : [10 - plant-growth (nominal)], 11 : [11 - leaves (nominal)], 12 : [12 - leafspots-halo (nominal)], 13 : [13 - leafspots-marg (nominal)], 14 : [14 - leafspot-size (nominal)], 15 : [15 - leaf-shread (nominal)], 16 : [16 - leaf-malf (nominal)], 17 : [17 - leaf-mild (nominal)], 18 : [18 - stem (nominal)], 19 : [19 - lodging (nominal)], 20 : [20 - stem-cankers (nominal)], 21 : [21 - canker-lesion (nominal)], 22 : [22 - fruiting-bodies (nominal)], 23 : [23 - external-decay (nominal)], 24 : [24 - mycelium (nominal)], 25 : [25 - int-discolor (nominal)], 26 : [26 - sclerotia (nominal)], 27 : [27 - fruit-pods (nominal)], 28 : [28 - fruit-spots (nominal)], 29 : [29 - seed (nominal)], 30 : [30 - mold-growth (nominal)], 31 : [31 - seed-discolor (nominal)], 32 : [32 - seed-size (nominal)], 33 : [33 - shriveling (nominal)], 34 : [34 - roots (nominal)], 35 : [35 - class (nominal)],', 'format': 'ARFF', 'name': 'soybean', 'qualities': 'AutoCorrelation : 0.9457478005865103, CfsSubsetEval_DecisionStumpAUC : 0.9620422408823379, CfsSubsetEval_DecisionStumpErrRate : 0.13323572474377746, CfsSubsetEval_DecisionStumpKappa : 0.8534752853145238, CfsSubsetEval_NaiveBayesAUC : 0.9620422408823379, CfsSubsetEval_NaiveBayesErrRate : 0.13323572474377746, CfsSubsetEval_NaiveBayesKappa : 0.8534752853145238, CfsSubsetEval_kNN1NAUC : 0.9620422408823379, CfsSubsetEval_kNN1NErrRate : 0.13323572474377746, CfsSubsetEval_kNN1NKappa : 0.8534752853145238, ClassEntropy : 3.83550798457672, DecisionStumpAUC : 0.8099631489104341, DecisionStumpErrRate : 0.7203513909224012, DecisionStumpKappa : 0.19424522533539545, Dimensionality : 0.0527086383601757, EquivalentNumberOfAtts : 7.508591767241043, J48.00001.AUC : 0.9739047068470593, J48.00001.ErrRate : 0.12152269399707175, J48.00001.Kappa : 0.8663370421980624, J48.0001.AUC : 0.9739047068470593, J48.0001.ErrRate : 0.12152269399707175, J48.0001.Kappa : 0.8663370421980624, J48.001.AUC : 0.9739047068470593, J48.001.ErrRate : 0.12152269399707175, J48.001.Kappa : 0.8663370421980624, MajorityClassPercentage : 13.469985358711567, MajorityClassSize : 92.0, MaxAttributeEntropy : 2.6849389644492594, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 1.28692474762189, MaxNominalAttDistinctValues : 19.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 0.9655890619117928, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.5108158897798274, MeanNoiseToSignalRatio : 0.8902878340922058, MeanNominalAttDistinctValues : 3.2777777777777777, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : 0.07262476248540556, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 0.0468182939867, MinNominalAttDistinctValues : 2.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 1.171303074670571, MinorityClassSize : 8.0, NaiveBayesAUC : 0.9921587580230303, NaiveBayesErrRate : 0.08931185944363104, NaiveBayesKappa : 0.9019654903843212, NumberOfBinaryFeatures : 16.0, NumberOfClasses : 19.0, NumberOfFeatures : 36.0, NumberOfInstances : 683.0, NumberOfInstancesWithMissingValues : 121.0, NumberOfMissingValues : 2337.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 36.0, PercentageOfBinaryFeatures : 44.44444444444444, PercentageOfInstancesWithMissingValues : 17.71595900439239, PercentageOfMissingValues : 9.504636408003904, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.4629328593168401, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.26369905545327, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 0.9158362664344971, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.45996721558355, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 1.408326420019514, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.71879499353135, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9436075624852911, REPTreeDepth1ErrRate : 0.26500732064421667, REPTreeDepth1Kappa : 0.7052208643815201, REPTreeDepth2AUC : 0.9436075624852911, REPTreeDepth2ErrRate : 0.26500732064421667, REPTreeDepth2Kappa : 0.7052208643815201, REPTreeDepth3AUC : 0.9436075624852911, REPTreeDepth3ErrRate : 0.26500732064421667, REPTreeDepth3Kappa : 0.7052208643815201, RandomTreeDepth1AUC : 0.9035959879652148, RandomTreeDepth1ErrRate : 0.18594436310395315, RandomTreeDepth1Kappa : 0.7960191985250715, RandomTreeDepth2AUC : 0.9035959879652148, RandomTreeDepth2ErrRate : 0.18594436310395315, RandomTreeDepth2Kappa : 0.7960191985250715, RandomTreeDepth3AUC : 0.9035959879652148, RandomTreeDepth3ErrRate : 0.18594436310395315, RandomTreeDepth3Kappa : 0.7960191985250715, StdvNominalAttDistinctValues : 2.884551077834282, kNN1NAUC : 0.9616161058225481, kNN1NErrRate : 0.1171303074670571, kNN1NKappa : 0.871344781387376,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='### Classes \n\n-- 19 Classes = {diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot, phytophthora-rot, brown-stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-blight, bacterial-pustule, purple-seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-injury} \n\n### Revelant papers\n\nTan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning (pp. 121-134). Ann Arbor, Michigan: Morgan Kaufmann.'),
 Document(metadata={'MajorityClassSize': 214.0, 'MaxNominalAttDistinctValues': 27.0, 'MinorityClassSize': 105.0, 'NumberOfClasses': 5.0, 'NumberOfFeatures': 20.0, 'NumberOfInstances': 736.0, 'NumberOfInstancesWithMissingValues': 95.0, 'NumberOfMissingValues': 448.0, 'NumberOfNumericFeatures': 14.0, 'NumberOfSymbolicFeatures': 6.0, 'Unnamed: 0': 123, 'description': "**Author**: Bruce Bulloch    \n**Source**: [WEKA Dataset Collection](http://www.cs.waikato.ac.nz/ml/weka/datasets.html) - part of the agridatasets archive. [This is the true source](http://tunedit.org/repo/Data/Agricultural/eucalyptus.arff)  \n**Please cite**: None  \n\n**Eucalyptus Soil Conservation**  \nThe objective was to determine which seedlots in a species are best for soil conservation in seasonally dry hill country. Determination is found by measurement of height, diameter by height, survival, and other contributing factors. \n \nIt is important to note that eucalypt trial methods changed over time; earlier trials included mostly 15 - 30cm tall seedling grown in peat plots and the later trials have included mostly three replications of eight trees grown. This change may contribute to less significant results.\n\nExperimental data recording procedures which require noting include:\n - instances with no data recorded due to experimental recording procedures\n   require that the absence of a species from one replicate at a site was\n   treated as a missing value, but if absent from two or more replicates at a\n   site the species was excluded from the site's analyses.\n - missing data for survival, vigour, insect resistance, stem form, crown form\n   and utility especially for the data recorded at the Morea Station; this \n   could indicate the death of species in these areas or a lack in collection\n   of data.  \n\n### Attribute Information  \n \n  1.  Abbrev - site abbreviation - enumerated\n  2.  Rep - site rep - integer\n  3.  Locality - site locality in the North Island - enumerated\n  4.  Map_Ref - map location in the North Island - enumerated\n  5.  Latitude - latitude approximation - enumerated\n  6.  Altitude - altitude approximation - integer\n  7.  Rainfall - rainfall (mm pa) - integer\n  8.  Frosts - frosts (deg. c) - integer\n  9.  Year - year of planting - integer\n  10. Sp - species code - enumerated\n  11. PMCno - seedlot number - integer\n  12. DBH - best diameter base height (cm) - real\n  13. Ht - height (m) - real\n  14. Surv - survival - integer\n  15. Vig - vigour - real\n  16. Ins_res - insect resistance - real\n  17. Stem_Fm - stem form - real\n  18. Crown_Fm - crown form - real\n  19. Brnch_Fm - branch form - real\n  Class:\n  20. Utility - utility rating - enumerated\n\n### Relevant papers\n\nBulluch B. T., (1992) Eucalyptus Species Selection for Soil Conservation in Seasonally Dry Hill Country - Twelfth Year Assessment  New Zealand Journal of Forestry Science 21(1): 10 - 31 (1991)  \n\nKirsten Thomson and Robert J. McQueen (1996) Machine Learning Applied to Fourteen Agricultural Datasets. University of Waikato Research Report  \nhttps://www.cs.waikato.ac.nz/ml/publications/1996/Thomson-McQueen-96.pdf + the original publication:", 'did': 188, 'features': '0 : [0 - Abbrev (nominal)], 1 : [1 - Rep (numeric)], 2 : [2 - Locality (nominal)], 3 : [3 - Map_Ref (nominal)], 4 : [4 - Latitude (nominal)], 5 : [5 - Altitude (numeric)], 6 : [6 - Rainfall (numeric)], 7 : [7 - Frosts (numeric)], 8 : [8 - Year (numeric)], 9 : [9 - Sp (nominal)], 10 : [10 - PMCno (numeric)], 11 : [11 - DBH (numeric)], 12 : [12 - Ht (numeric)], 13 : [13 - Surv (numeric)], 14 : [14 - Vig (numeric)], 15 : [15 - Ins_res (numeric)], 16 : [16 - Stem_Fm (numeric)], 17 : [17 - Crown_Fm (numeric)], 18 : [18 - Brnch_Fm (numeric)], 19 : [19 - Utility (nominal)],', 'format': 'ARFF', 'name': 'eucalyptus', 'qualities': 'AutoCorrelation : 0.39319727891156464, CfsSubsetEval_DecisionStumpAUC : 0.8239493966657213, CfsSubsetEval_DecisionStumpErrRate : 0.41847826086956524, CfsSubsetEval_DecisionStumpKappa : 0.4637307109078737, CfsSubsetEval_NaiveBayesAUC : 0.8239493966657213, CfsSubsetEval_NaiveBayesErrRate : 0.41847826086956524, CfsSubsetEval_NaiveBayesKappa : 0.4637307109078737, CfsSubsetEval_kNN1NAUC : 0.8239493966657213, CfsSubsetEval_kNN1NErrRate : 0.41847826086956524, CfsSubsetEval_kNN1NKappa : 0.4637307109078737, ClassEntropy : 2.262083620428274, DecisionStumpAUC : 0.7519401667350958, DecisionStumpErrRate : 0.5054347826086957, DecisionStumpKappa : 0.30247986100142155, Dimensionality : 0.02717391304347826, EquivalentNumberOfAtts : 5.9334684401020565, J48.00001.AUC : 0.8184137151683228, J48.00001.ErrRate : 0.3967391304347826, J48.00001.Kappa : 0.49336985707179887, J48.0001.AUC : 0.8184137151683228, J48.0001.ErrRate : 0.3967391304347826, J48.0001.Kappa : 0.49336985707179887, J48.001.AUC : 0.8184137151683228, J48.001.ErrRate : 0.3967391304347826, J48.001.Kappa : 0.49336985707179887, MajorityClassPercentage : 29.076086956521742, MajorityClassSize : 214.0, MaxAttributeEntropy : 4.2373637557635595, MaxKurtosisOfNumericAtts : 734.9416211795777, MaxMeansOfNumericAtts : 2054.7393689986247, MaxMutualInformation : 0.42753276429854, MaxNominalAttDistinctValues : 27.0, MaxSkewnessOfNumericAtts : 27.109270846229688, MaxStdDevOfNumericAtts : 1551.7798185802085, MeanAttributeEntropy : 3.4626363060529055, MeanKurtosisOfNumericAtts : 62.86596625813314, MeanMeansOfNumericAtts : 390.0868288072735, MeanMutualInformation : 0.381241367214446, MeanNoiseToSignalRatio : 8.082530396301912, MeanNominalAttDistinctValues : 13.666666666666666, MeanSkewnessOfNumericAtts : 2.551453016115177, MeanStdDevOfNumericAtts : 172.61081562461396, MinAttributeEntropy : 2.5810641739409617, MinKurtosisOfNumericAtts : -1.887802596870339, MinMeansOfNumericAtts : -2.5842391304347836, MinMutualInformation : 0.24650313929826, MinNominalAttDistinctValues : 5.0, MinSkewnessOfNumericAtts : -0.6970908724266737, MinStdDevOfNumericAtts : 0.49318784476285216, MinorityClassPercentage : 14.266304347826086, MinorityClassSize : 105.0, NaiveBayesAUC : 0.8520788174118736, NaiveBayesErrRate : 0.45108695652173914, NaiveBayesKappa : 0.42741183362624485, NumberOfBinaryFeatures : 0.0, NumberOfClasses : 5.0, NumberOfFeatures : 20.0, NumberOfInstances : 736.0, NumberOfInstancesWithMissingValues : 95.0, NumberOfMissingValues : 448.0, NumberOfNumericFeatures : 14.0, NumberOfSymbolicFeatures : 6.0, PercentageOfBinaryFeatures : 0.0, PercentageOfInstancesWithMissingValues : 12.907608695652172, PercentageOfMissingValues : 3.0434782608695654, PercentageOfNumericFeatures : 70.0, PercentageOfSymbolicFeatures : 30.0, Quartile1AttributeEntropy : 2.908861461974274, Quartile1KurtosisOfNumericAtts : -0.4961422376730956, Quartile1MeansOfNumericAtts : 2.882908545727137, Quartile1MutualInformation : 0.323312362530555, Quartile1SkewnessOfNumericAtts : -0.3960800165047112, Quartile1StdDevOfNumericAtts : 0.778502789181291, Quartile2AttributeEntropy : 3.4759821137655975, Quartile2KurtosisOfNumericAtts : 0.4289115082721384, Quartile2MeansOfNumericAtts : 6.249602617058818, Quartile2MutualInformation : 0.40765214345173, Quartile2SkewnessOfNumericAtts : 0.11119478923130877, Quartile2StdDevOfNumericAtts : 1.3465996573398586, Quartile3AttributeEntropy : 4.009738246275191, Quartile3KurtosisOfNumericAtts : 1.3641239688052669, Quartile3MeansOfNumericAtts : 403.0027173913041, Quartile3MutualInformation : 0.425964983779695, Quartile3SkewnessOfNumericAtts : 0.9548948008878528, Quartile3StdDevOfNumericAtts : 80.61760056258042, REPTreeDepth1AUC : 0.7171370640805235, REPTreeDepth1ErrRate : 0.5557065217391305, REPTreeDepth1Kappa : 0.2672017371533179, REPTreeDepth2AUC : 0.7171370640805235, REPTreeDepth2ErrRate : 0.5557065217391305, REPTreeDepth2Kappa : 0.2672017371533179, REPTreeDepth3AUC : 0.7171370640805235, REPTreeDepth3ErrRate : 0.5557065217391305, REPTreeDepth3Kappa : 0.2672017371533179, RandomTreeDepth1AUC : 0.7219508813313532, RandomTreeDepth1ErrRate : 0.47690217391304346, RandomTreeDepth1Kappa : 0.3915134670419616, RandomTreeDepth2AUC : 0.7219508813313532, RandomTreeDepth2ErrRate : 0.47690217391304346, RandomTreeDepth2Kappa : 0.3915134670419616, RandomTreeDepth3AUC : 0.7219508813313532, RandomTreeDepth3ErrRate : 0.47690217391304346, RandomTreeDepth3Kappa : 0.3915134670419616, StdvNominalAttDistinctValues : 7.659416862050705, kNN1NAUC : 0.7018152602695222, kNN1NErrRate : 0.46603260869565216, kNN1NKappa : 0.40228622299671357,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='Kirsten Thomson and Robert J. McQueen (1996) Machine Learning Applied to Fourteen Agricultural Datasets. University of Waikato Research Report'),
 Document(metadata={'MajorityClassSize': 92.0, 'MaxNominalAttDistinctValues': 19.0, 'MinorityClassSize': 8.0, 'NumberOfClasses': 19.0, 'NumberOfFeatures': 36.0, 'NumberOfInstances': 683.0, 'NumberOfInstancesWithMissingValues': 121.0, 'NumberOfMissingValues': 2337.0, 'NumberOfNumericFeatures': 0.0, 'NumberOfSymbolicFeatures': 36.0, 'Unnamed: 0': 36, 'description': '**Author**: R.S. Michalski and R.L. Chilausky (Donors: Ming Tan & Jeff Schlimmer)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Soybean+(Large)) - 1988  \n**Please cite**: R.S. Michalski and R.L. Chilausky "Learning by Being Told and Learning from Examples: An Experimental Comparison of the Two Methods of Knowledge Acquisition in the Context of Developing an Expert System for Soybean Disease Diagnosis", International Journal of Policy Analysis and Information Systems, Vol. 4, No. 2, 1980.  \n\n**Large Soybean Database**  \nThis is the large soybean database from the UCI repository, with its training and test database combined into a single file. \n\nThere are 19 classes, only the first 15 of which have been used in prior work. The folklore seems to be that the last four classes are unjustified by the data since they have so few examples. There are 35 categorical attributes, some nominal and some ordered. The value \'dna\' means does not apply. The values for attributes are encoded numerically, with the first value encoded as "0,\'\' the second as "1,\'\' and so forth. An unknown value is encoded as "?\'\'.\n\n### Attribute Information\n\n1. date: april,may,june,july,august,september,october,?. \n2. plant-stand: normal,lt-normal,?. \n3. precip: lt-norm,norm,gt-norm,?. \n4. temp: lt-norm,norm,gt-norm,?. \n5. hail: yes,no,?. \n6. crop-hist: diff-lst-year,same-lst-yr,same-lst-two-yrs, \nsame-lst-sev-yrs,?. \n7. area-damaged: scattered,low-areas,upper-areas,whole-field,?. \n8. severity: minor,pot-severe,severe,?. \n9. seed-tmt: none,fungicide,other,?. \n10. germination: 90-100%,80-89%,lt-80%,?. \n11. plant-growth: norm,abnorm,?. \n12. leaves: norm,abnorm. \n13. leafspots-halo: absent,yellow-halos,no-yellow-halos,?. \n14. leafspots-marg: w-s-marg,no-w-s-marg,dna,?. \n15. leafspot-size: lt-1/8,gt-1/8,dna,?. \n16. leaf-shread: absent,present,?. \n17. leaf-malf: absent,present,?. \n18. leaf-mild: absent,upper-surf,lower-surf,?. \n19. stem: norm,abnorm,?. \n20. lodging: yes,no,?. \n21. stem-cankers: absent,below-soil,above-soil,above-sec-nde,?. \n22. canker-lesion: dna,brown,dk-brown-blk,tan,?. \n23. fruiting-bodies: absent,present,?. \n24. external decay: absent,firm-and-dry,watery,?. \n25. mycelium: absent,present,?. \n26. int-discolor: none,brown,black,?. \n27. sclerotia: absent,present,?. \n28. fruit-pods: norm,diseased,few-present,dna,?. \n29. fruit spots: absent,colored,brown-w/blk-specks,distort,dna,?. \n30. seed: norm,abnorm,?. \n31. mold-growth: absent,present,?. \n32. seed-discolor: absent,present,?. \n33. seed-size: norm,lt-norm,?. \n34. shriveling: absent,present,?. \n35. roots: norm,rotted,galls-cysts,?.\n\n### Classes \n\n-- 19 Classes = {diaporthe-stem-canker, charcoal-rot, rhizoctonia-root-rot, phytophthora-rot, brown-stem-rot, powdery-mildew, downy-mildew, brown-spot, bacterial-blight, bacterial-pustule, purple-seed-stain, anthracnose, phyllosticta-leaf-spot, alternarialeaf-spot, frog-eye-leaf-spot, diaporthe-pod-&-stem-blight, cyst-nematode, 2-4-d-injury, herbicide-injury} \n\n### Revelant papers\n\nTan, M., & Eshelman, L. (1988). Using weighted networks to represent classification knowledge in noisy domains. Proceedings of the Fifth International Conference on Machine Learning (pp. 121-134). Ann Arbor, Michigan: Morgan Kaufmann. \n\nFisher,D.H. & Schlimmer,J.C. (1988). Concept Simplification and Predictive Accuracy. Proceedings of the Fifth International Conference on Machine Learning (pp. 22-28). Ann Arbor, Michigan: Morgan Kaufmann.', 'did': 42, 'features': '0 : [0 - date (nominal)], 1 : [1 - plant-stand (nominal)], 2 : [2 - precip (nominal)], 3 : [3 - temp (nominal)], 4 : [4 - hail (nominal)], 5 : [5 - crop-hist (nominal)], 6 : [6 - area-damaged (nominal)], 7 : [7 - severity (nominal)], 8 : [8 - seed-tmt (nominal)], 9 : [9 - germination (nominal)], 10 : [10 - plant-growth (nominal)], 11 : [11 - leaves (nominal)], 12 : [12 - leafspots-halo (nominal)], 13 : [13 - leafspots-marg (nominal)], 14 : [14 - leafspot-size (nominal)], 15 : [15 - leaf-shread (nominal)], 16 : [16 - leaf-malf (nominal)], 17 : [17 - leaf-mild (nominal)], 18 : [18 - stem (nominal)], 19 : [19 - lodging (nominal)], 20 : [20 - stem-cankers (nominal)], 21 : [21 - canker-lesion (nominal)], 22 : [22 - fruiting-bodies (nominal)], 23 : [23 - external-decay (nominal)], 24 : [24 - mycelium (nominal)], 25 : [25 - int-discolor (nominal)], 26 : [26 - sclerotia (nominal)], 27 : [27 - fruit-pods (nominal)], 28 : [28 - fruit-spots (nominal)], 29 : [29 - seed (nominal)], 30 : [30 - mold-growth (nominal)], 31 : [31 - seed-discolor (nominal)], 32 : [32 - seed-size (nominal)], 33 : [33 - shriveling (nominal)], 34 : [34 - roots (nominal)], 35 : [35 - class (nominal)],', 'format': 'ARFF', 'name': 'soybean', 'qualities': 'AutoCorrelation : 0.9457478005865103, CfsSubsetEval_DecisionStumpAUC : 0.9620422408823379, CfsSubsetEval_DecisionStumpErrRate : 0.13323572474377746, CfsSubsetEval_DecisionStumpKappa : 0.8534752853145238, CfsSubsetEval_NaiveBayesAUC : 0.9620422408823379, CfsSubsetEval_NaiveBayesErrRate : 0.13323572474377746, CfsSubsetEval_NaiveBayesKappa : 0.8534752853145238, CfsSubsetEval_kNN1NAUC : 0.9620422408823379, CfsSubsetEval_kNN1NErrRate : 0.13323572474377746, CfsSubsetEval_kNN1NKappa : 0.8534752853145238, ClassEntropy : 3.83550798457672, DecisionStumpAUC : 0.8099631489104341, DecisionStumpErrRate : 0.7203513909224012, DecisionStumpKappa : 0.19424522533539545, Dimensionality : 0.0527086383601757, EquivalentNumberOfAtts : 7.508591767241043, J48.00001.AUC : 0.9739047068470593, J48.00001.ErrRate : 0.12152269399707175, J48.00001.Kappa : 0.8663370421980624, J48.0001.AUC : 0.9739047068470593, J48.0001.ErrRate : 0.12152269399707175, J48.0001.Kappa : 0.8663370421980624, J48.001.AUC : 0.9739047068470593, J48.001.ErrRate : 0.12152269399707175, J48.001.Kappa : 0.8663370421980624, MajorityClassPercentage : 13.469985358711567, MajorityClassSize : 92.0, MaxAttributeEntropy : 2.6849389644492594, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 1.28692474762189, MaxNominalAttDistinctValues : 19.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 0.9655890619117928, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.5108158897798274, MeanNoiseToSignalRatio : 0.8902878340922058, MeanNominalAttDistinctValues : 3.2777777777777777, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : 0.07262476248540556, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 0.0468182939867, MinNominalAttDistinctValues : 2.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 1.171303074670571, MinorityClassSize : 8.0, NaiveBayesAUC : 0.9921587580230303, NaiveBayesErrRate : 0.08931185944363104, NaiveBayesKappa : 0.9019654903843212, NumberOfBinaryFeatures : 16.0, NumberOfClasses : 19.0, NumberOfFeatures : 36.0, NumberOfInstances : 683.0, NumberOfInstancesWithMissingValues : 121.0, NumberOfMissingValues : 2337.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 36.0, PercentageOfBinaryFeatures : 44.44444444444444, PercentageOfInstancesWithMissingValues : 17.71595900439239, PercentageOfMissingValues : 9.504636408003904, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.4629328593168401, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.26369905545327, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 0.9158362664344971, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.45996721558355, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 1.408326420019514, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.71879499353135, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9436075624852911, REPTreeDepth1ErrRate : 0.26500732064421667, REPTreeDepth1Kappa : 0.7052208643815201, REPTreeDepth2AUC : 0.9436075624852911, REPTreeDepth2ErrRate : 0.26500732064421667, REPTreeDepth2Kappa : 0.7052208643815201, REPTreeDepth3AUC : 0.9436075624852911, REPTreeDepth3ErrRate : 0.26500732064421667, REPTreeDepth3Kappa : 0.7052208643815201, RandomTreeDepth1AUC : 0.9035959879652148, RandomTreeDepth1ErrRate : 0.18594436310395315, RandomTreeDepth1Kappa : 0.7960191985250715, RandomTreeDepth2AUC : 0.9035959879652148, RandomTreeDepth2ErrRate : 0.18594436310395315, RandomTreeDepth2Kappa : 0.7960191985250715, RandomTreeDepth3AUC : 0.9035959879652148, RandomTreeDepth3ErrRate : 0.18594436310395315, RandomTreeDepth3Kappa : 0.7960191985250715, StdvNominalAttDistinctValues : 2.884551077834282, kNN1NAUC : 0.9616161058225481, kNN1NErrRate : 0.1171303074670571, kNN1NKappa : 0.871344781387376,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='RandomTreeDepth3Kappa : 0.7960191985250715, StdvNominalAttDistinctValues : 2.884551077834282, kNN1NAUC : 0.9616161058225481, kNN1NErrRate : 0.1171303074670571, kNN1NKappa : 0.871344781387376,, features - 0 : [0 - date (nominal)], 1 : [1 - plant-stand (nominal)], 2 : [2 - precip (nominal)], 3 : [3 - temp (nominal)], 4 : [4 - hail (nominal)], 5 : [5 - crop-hist (nominal)], 6 : [6 - area-damaged (nominal)], 7 : [7 - severity (nominal)], 8 : [8 - seed-tmt (nominal)], 9 : [9 - germination (nominal)], 10 : [10 - plant-growth (nominal)], 11 : [11 - leaves (nominal)], 12 : [12 - leafspots-halo (nominal)], 13 : [13 - leafspots-marg (nominal)], 14 : [14 - leafspot-size (nominal)], 15 : [15 - leaf-shread (nominal)], 16 : [16 - leaf-malf (nominal)], 17 : [17 - leaf-mild (nominal)], 18 : [18 - stem (nominal)], 19 : [19 - lodging (nominal)], 20 : [20 - stem-cankers (nominal)], 21 : [21 - canker-lesion (nominal)], 22 : [22 - fruiting-bodies (nominal)], 23 : [23 - external-decay (nominal)], 24 : [24'),
 Document(metadata={'MajorityClassSize': 71.0, 'MaxNominalAttDistinctValues': 3.0, 'MinorityClassSize': 48.0, 'NumberOfClasses': 3.0, 'NumberOfFeatures': 14.0, 'NumberOfInstances': 178.0, 'NumberOfInstancesWithMissingValues': 0.0, 'NumberOfMissingValues': 0.0, 'NumberOfNumericFeatures': 13.0, 'NumberOfSymbolicFeatures': 1.0, 'Unnamed: 0': 122, 'description': '**Author**:   \n**Source**: Unknown -   \n**Please cite**:   \n\n1. Title of Database: Wine recognition data\n \tUpdated Sept 21, 1998 by C.Blake : Added attribute information\n \n 2. Sources:\n    (a) Forina, M. et al, PARVUS - An Extendible Package for Data\n        Exploration, Classification and Correlation. Institute of Pharmaceutical\n        and Food Analysis and Technologies, Via Brigata Salerno, \n        16147 Genoa, Italy.\n \n    (b) Stefan Aeberhard, email: stefan@coral.cs.jcu.edu.au\n    (c) July 1991\n 3. Past Usage:\n \n    (1)\n    S. Aeberhard, D. Coomans and O. de Vel,\n    Comparison of Classifiers in High Dimensional Settings,\n    Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of\n    Mathematics and Statistics, James Cook University of North Queensland.\n    (Also submitted to Technometrics).\n \n    The data was used with many others for comparing various \n    classifiers. The classes are separable, though only RDA \n    has achieved 100% correct classification.\n    (RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))\n    (All results using the leave-one-out technique)\n \n    In a classification context, this is a well posed problem \n    with "well behaved" class structures. A good data set \n    for first testing of a new classifier, but not very \n    challenging.\n \n    (2) \n    S. Aeberhard, D. Coomans and O. de Vel,\n    "THE CLASSIFICATION PERFORMANCE OF RDA"\n    Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of\n    Mathematics and Statistics, James Cook University of North Queensland.\n    (Also submitted to Journal of Chemometrics).\n \n    Here, the data was used to illustrate the superior performance of\n    the use of a new appreciation function with RDA. \n \n 4. Relevant Information:\n \n    -- These data are the results of a chemical analysis of\n       wines grown in the same region in Italy but derived from three\n       different cultivars.\n       The analysis determined the quantities of 13 constituents\n       found in each of the three types of wines. \n \n    -- I think that the initial data set had around 30 variables, but \n       for some reason I only have the 13 dimensional version. \n       I had a list of what the 30 or so variables were, but a.) \n       I lost it, and b.), I would not know which 13 variables\n       are included in the set.\n \n    -- The attributes are (dontated by Riccardo Leardi, \n \triclea@anchem.unige.it )\n  \t1) Alcohol\n  \t2) Malic acid\n  \t3) Ash\n \t4) Alcalinity of ash  \n  \t5) Magnesium\n \t6) Total phenols\n  \t7) Flavanoids\n  \t8) Nonflavanoid phenols\n  \t9) Proanthocyanins\n \t10)Color intensity\n  \t11)Hue\n  \t12)OD280/OD315 of diluted wines\n  \t13)Proline            \n \n 5. Number of Instances\n \n       \tclass 1 59\n \tclass 2 71\n \tclass 3 48\n \n 6. Number of Attributes \n \t\n \t13\n \n 7. For Each Attribute:\n \n \tAll attributes are continuous\n \t\n \tNo statistics available, but suggest to standardise\n \tvariables for certain uses (e.g. for us with classifiers\n \twhich are NOT scale invariant)\n \n \tNOTE: 1st attribute is class identifier (1-3)\n \n 8. Missing Attribute Values:\n \n \tNone\n \n 9. Class Distribution: number of instances per class\n \n       \tclass 1 59\n \tclass 2 71\n \tclass 3 48\n\n Information about the dataset\n CLASSTYPE: nominal\n CLASSINDEX: first', 'did': 187, 'features': '0 : [0 - class (nominal)], 1 : [1 - Alcohol (numeric)], 2 : [2 - Malic_acid (numeric)], 3 : [3 - Ash (numeric)], 4 : [4 - Alcalinity_of_ash (numeric)], 5 : [5 - Magnesium (numeric)], 6 : [6 - Total_phenols (numeric)], 7 : [7 - Flavanoids (numeric)], 8 : [8 - Nonflavanoid_phenols (numeric)], 9 : [9 - Proanthocyanins (numeric)], 10 : [10 - Color_intensity (numeric)], 11 : [11 - Hue (numeric)], 12 : [12 - OD280%2FOD315_of_diluted_wines (numeric)], 13 : [13 - Proline (numeric)],', 'format': 'ARFF', 'name': 'wine', 'qualities': 'AutoCorrelation : 0.9887005649717514, CfsSubsetEval_DecisionStumpAUC : 0.934807485785613, CfsSubsetEval_DecisionStumpErrRate : 0.0898876404494382, CfsSubsetEval_DecisionStumpKappa : 0.8636080647478569, CfsSubsetEval_NaiveBayesAUC : 0.934807485785613, CfsSubsetEval_NaiveBayesErrRate : 0.0898876404494382, CfsSubsetEval_NaiveBayesKappa : 0.8636080647478569, CfsSubsetEval_kNN1NAUC : 0.934807485785613, CfsSubsetEval_kNN1NErrRate : 0.0898876404494382, CfsSubsetEval_kNN1NKappa : 0.8636080647478569, ClassEntropy : 1.5668222768551812, DecisionStumpAUC : 0.7973435168459908, DecisionStumpErrRate : 0.37640449438202245, DecisionStumpKappa : 0.4058981767460396, Dimensionality : 0.07865168539325842, EquivalentNumberOfAtts : nan, J48.00001.AUC : 0.934807485785613, J48.00001.ErrRate : 0.0898876404494382, J48.00001.Kappa : 0.8636080647478569, J48.0001.AUC : 0.934807485785613, J48.0001.ErrRate : 0.0898876404494382, J48.0001.Kappa : 0.8636080647478569, J48.001.AUC : 0.934807485785613, J48.001.ErrRate : 0.0898876404494382, J48.001.Kappa : 0.8636080647478569, MajorityClassPercentage : 39.8876404494382, MajorityClassSize : 71.0, MaxAttributeEntropy : nan, MaxKurtosisOfNumericAtts : 2.1049913235905877, MaxMeansOfNumericAtts : 746.8932584269661, MaxMutualInformation : nan, MaxNominalAttDistinctValues : 3.0, MaxSkewnessOfNumericAtts : 1.0981910547551612, MaxStdDevOfNumericAtts : 314.90747427684903, MeanAttributeEntropy : nan, MeanKurtosisOfNumericAtts : 0.006742802303924433, MeanMeansOfNumericAtts : 69.13366292091614, MeanMutualInformation : nan, MeanNoiseToSignalRatio : nan, MeanNominalAttDistinctValues : 3.0, MeanSkewnessOfNumericAtts : 0.3501684984202117, MeanStdDevOfNumericAtts : 26.17778523132608, MinAttributeEntropy : nan, MinKurtosisOfNumericAtts : -1.0864345274098706, MinMeansOfNumericAtts : 0.3618539325842697, MinMutualInformation : nan, MinNominalAttDistinctValues : 3.0, MinSkewnessOfNumericAtts : -0.30728549895848073, MinStdDevOfNumericAtts : 0.12445334029667939, MinorityClassPercentage : 26.96629213483146, MinorityClassSize : 48.0, NaiveBayesAUC : 0.983140867878747, NaiveBayesErrRate : 0.0449438202247191, NaiveBayesKappa : 0.9319148936170213, NumberOfBinaryFeatures : 0.0, NumberOfClasses : 3.0, NumberOfFeatures : 14.0, NumberOfInstances : 178.0, NumberOfInstancesWithMissingValues : 0.0, NumberOfMissingValues : 0.0, NumberOfNumericFeatures : 13.0, NumberOfSymbolicFeatures : 1.0, PercentageOfBinaryFeatures : 0.0, PercentageOfInstancesWithMissingValues : 0.0, PercentageOfMissingValues : 0.0, PercentageOfNumericFeatures : 92.85714285714286, PercentageOfSymbolicFeatures : 7.142857142857142, Quartile1AttributeEntropy : nan, Quartile1KurtosisOfNumericAtts : -0.8440630459414751, Quartile1MeansOfNumericAtts : 1.8100842696629211, Quartile1MutualInformation : nan, Quartile1SkewnessOfNumericAtts : -0.0151955294387136, Quartile1StdDevOfNumericAtts : 0.4233514358677881, Quartile2AttributeEntropy : nan, Quartile2KurtosisOfNumericAtts : -0.24840310614613204, Quartile2MeansOfNumericAtts : 2.366516853932584, Quartile2MutualInformation : nan, Quartile2SkewnessOfNumericAtts : 0.2130468864264532, Quartile2StdDevOfNumericAtts : 0.8118265380058574, Quartile3AttributeEntropy : nan, Quartile3KurtosisOfNumericAtts : 0.5212950315345126, Quartile3MeansOfNumericAtts : 16.247780898876407, Quartile3MutualInformation : nan, Quartile3SkewnessOfNumericAtts : 0.8182032861734947, Quartile3StdDevOfNumericAtts : 2.828924819497959, REPTreeDepth1AUC : 0.9038527890255288, REPTreeDepth1ErrRate : 0.15168539325842698, REPTreeDepth1Kappa : 0.7710010959165197, REPTreeDepth2AUC : 0.9038527890255288, REPTreeDepth2ErrRate : 0.15168539325842698, REPTreeDepth2Kappa : 0.7710010959165197, REPTreeDepth3AUC : 0.9038527890255288, REPTreeDepth3ErrRate : 0.15168539325842698, REPTreeDepth3Kappa : 0.7710010959165197, RandomTreeDepth1AUC : 0.9363963414265778, RandomTreeDepth1ErrRate : 0.08426966292134831, RandomTreeDepth1Kappa : 0.872212118311477, RandomTreeDepth2AUC : 0.9363963414265778, RandomTreeDepth2ErrRate : 0.08426966292134831, RandomTreeDepth2Kappa : 0.872212118311477, RandomTreeDepth3AUC : 0.9363963414265778, RandomTreeDepth3ErrRate : 0.08426966292134831, RandomTreeDepth3Kappa : 0.872212118311477, StdvNominalAttDistinctValues : 0.0, kNN1NAUC : 0.9552036199095022, kNN1NErrRate : 0.06179775280898876, kNN1NKappa : 0.9069126176666349,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='the use of a new appreciation function with RDA. \n \n 4. Relevant Information:\n \n    -- These data are the results of a chemical analysis of\n       wines grown in the same region in Italy but derived from three\n       different cultivars.\n       The analysis determined the quantities of 13 constituents\n       found in each of the three types of wines. \n \n    -- I think that the initial data set had around 30 variables, but \n       for some reason I only have the 13 dimensional version. \n       I had a list of what the 30 or so variables were, but a.) \n       I lost it, and b.), I would not know which 13 variables\n       are included in the set.\n \n    -- The attributes are (dontated by Riccardo Leardi, \n \triclea@anchem.unige.it )\n  \t1) Alcohol\n  \t2) Malic acid\n  \t3) Ash\n \t4) Alcalinity of ash  \n  \t5) Magnesium\n \t6) Total phenols\n  \t7) Flavanoids\n  \t8) Nonflavanoid phenols\n  \t9) Proanthocyanins\n \t10)Color intensity\n  \t11)Hue\n  \t12)OD280/OD315 of diluted wines'),
 Document(metadata={'MaxNominalAttDistinctValues': 3.0, 'NumberOfClasses': 0.0, 'NumberOfFeatures': 5.0, 'NumberOfInstances': 125.0, 'NumberOfInstancesWithMissingValues': 0.0, 'NumberOfMissingValues': 0.0, 'NumberOfNumericFeatures': 3.0, 'NumberOfSymbolicFeatures': 2.0, 'Unnamed: 0': 134, 'description': '**Author**:   \n**Source**: Unknown -   \n**Please cite**:   \n\n!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\n Identifier attribute deleted.\n\n !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!\n\n NAME:  Sexual activity and the lifespan of male fruitflies\n TYPE:  Designed (almost factorial) experiment\n SIZE:  125 observations, 5 variables\n \n DESCRIPTIVE ABSTRACT:\n A cost of increased reproduction in terms of reduced longevity has been\n shown for female fruitflies, but not for males.  The flies used were an\n outbred stock.  Sexual activity was manipulated by supplying individual\n males with one or eight receptive virgin females per day.  The\n longevity of these males was compared with that of two control types.\n The first control consisted of two sets of individual males kept with\n one or eight newly inseminated females.  Newly inseminated females will\n not usually remate for at least two days, and thus served as a control\n for any effect of competition with the male for food or space.  The\n second control was a set of individual males kept with no females.\n There were 25 males in each of the five groups, which were treated\n identically in number of anaesthetizations (using CO2) and provision of\n fresh food medium.\n \n SOURCE:\n Figure 2 in the article "Sexual Activity and the Lifespan of Male\n Fruitflies" by Linda Partridge and Marion Farquhar.  _Nature_, 294,\n 580-581, 1981.\n \n VARIABLE DESCRIPTIONS:\n Columns  Variable    Description\n -------  --------    -----------\n  1- 2    ID          Serial No. (1-25) within each group of 25\n                      (the order in which data points were abstracted)\n \n  4       PARTNERS    Number of companions (0, 1 or 8)\n \n  6       TYPE        Type of companion\n                        0: newly pregnant female\n                        1: virgin female\n                        9: not applicable (when PARTNERS=0)\n \n  8- 9    LONGEVITY   Lifespan, in days\n \n 11-14    THORAX      Length of thorax, in mm (x.xx)\n \n 16-17    SLEEP       Percentage of each day spent sleeping\n \n \n SPECIAL NOTES:\n `Compliance\' of the males in the two experimental groups was documented\n as follows:  On two days per week throughout the life of each\n experimental male, the females that had been supplied as virgins to\n that male were kept and examined for fertile eggs.  The insemination\n rate declined from approximately 7 females/day at age one week to just\n under 2/day at age eight weeks in the males supplied with eight virgin\n females per day, and from just under 1/day at age one week to\n approximately 0.6/day at age eight weeks in the males supplied with one\n virgin female per day.  These `compliance\' data were not supplied for\n individual males, but the authors say that "There were no significant\n differences between the individual males within each experimental\n group."\n \n STORY BEHIND THE DATA:\n James Hanley found this dataset in _Nature_ and was attracted by the\n way the raw data were presented in classical analysis of covariance\n style in Figure 2.  He read the data points from the graphs and brought\n them to the attention of a colleague with whom he was teaching the\n applied statistics course.  Dr. Liddell thought that with only three\n explanatory variables (THORAX, plus PARTNERS and TYPE to describe the\n five groups), it would not be challenging enough as a data-analysis\n project.  He suggested adding another variable.  James Hanley added\n SLEEP, a variable not mentioned in the published article.  Teachers can\n contact us about the construction of this variable.  (We prefer to\n divulge the details at the end of the data-analysis project.)\n \n Further discussion of the background and pedagogical use of this\n dataset can be found in Hanley (1983) and in Hanley and Shapiro\n (1994).  To obtain the Hanley and Shapiro article, send the one-line\n e-mail message:\n send jse/v2n1/datasets.hanley\n to the address archive@jse.stat.ncsu.edu\n \n PEDAGOGICAL NOTES:\n This has been the most successful and the most memorable dataset we\n have used in an "applications of statistics" course, which we have\n taught for ten years.  The most common analysis techniques have been\n analysis of variance, classical analysis of covariance, and multiple\n regression.  Because the variable THORAX is so strong (it explains\n about 1/3 of the variance in LONGEVITY), it is important to consider it\n to increase the precision of between-group contrasts.  When students\n first check and find that the distributions of thorax length, and in\n particular, the mean thorax length, are very similar in the different\n groups, many of them are willing to say (in epidemiological\n terminology) that THORAX is not a confounding variable, and that it can\n be omitted from the analysis.\n \n There is usually lively discussion about the primary contrast.  The\n five groups and their special structure allow opportunities for\n students to understand and verbalize what we mean by the term\n "statistical interaction."\n \n There is also much debate as to whether one should take the SLEEP\n variable into account.  Some students say that it is an `intermediate\'\n variable.  Some students formally test the mean level of SLEEP across\n groups, find one pair where there is a statistically significant\n difference, and want to treat it as a confounding variable.  A few\n students muse about how it was measured.\n \n There is heteroscedasticity in the LONGEVITY variable.\n \n One very observant student (now a professor) argued that THORAX cannot\n be used as a predictor or explanatory variable for the LONGEVITY\n outcome since fruitflies who die young may not be fully grown, i.e., it\n is also an intermediate variable.  One Ph.D. student who had studied\n entomology assured us that fruitflies do not grow longer after birth;\n therefore, the THORAX length is not time-dependent!\n \n Curiously, the dataset has seldom been analyzed using techniques from\n survival analysis.  The fact that there are no censored observations is\n not really an excuse, and one could easily devise a way to introduce\n censoring of LONGEVITY.\n \n REFERENCES:\n Hanley, J. A. (1983), "Appropriate Uses of Multivariate Analysis,"\n _Annual Review of Public Health_, 4, 155-180.\n \n Hanley, J. A., and Shapiro, S. H. (1994), "Sexual Activity and the\n Lifespan of Male Fruitflies:  A Dataset That Gets Attention," _Journal\n of Statistics Education_, Volume 2, Number 1.\n \n SUBMITTED BY:\n James A. Hanley and Stanley H. Shapiro\n Department of Epidemiology and Biostatistics\n McGill University\n 1020 Pine Avenue West\n Montreal, Quebec, H3A 1A2\n Canada\n tel: +1 (514) 398-6270 (JH) \n      +1 (514) 398-6272 (SS)\n fax: +1 (514) 398-4503\n INJH@musicb.mcgill.ca, StanS@epid.lan.mcgill.ca', 'did': 199, 'features': '0 : [0 - PARTNERS (nominal)], 1 : [1 - TYPE (nominal)], 2 : [2 - THORAX (numeric)], 3 : [3 - SLEEP (numeric)], 4 : [4 - class (numeric)],', 'format': 'ARFF', 'name': 'fruitfly', 'qualities': 'AutoCorrelation : -16.653225806451612, CfsSubsetEval_DecisionStumpAUC : nan, CfsSubsetEval_DecisionStumpErrRate : nan, CfsSubsetEval_DecisionStumpKappa : nan, CfsSubsetEval_NaiveBayesAUC : nan, CfsSubsetEval_NaiveBayesErrRate : nan, CfsSubsetEval_NaiveBayesKappa : nan, CfsSubsetEval_kNN1NAUC : nan, CfsSubsetEval_kNN1NErrRate : nan, CfsSubsetEval_kNN1NKappa : nan, ClassEntropy : nan, DecisionStumpAUC : nan, DecisionStumpErrRate : nan, DecisionStumpKappa : nan, Dimensionality : 0.04, EquivalentNumberOfAtts : nan, J48.00001.AUC : nan, J48.00001.ErrRate : nan, J48.00001.Kappa : nan, J48.0001.AUC : nan, J48.0001.ErrRate : nan, J48.0001.Kappa : nan, J48.001.AUC : nan, J48.001.ErrRate : nan, J48.001.Kappa : nan, MajorityClassPercentage : nan, MajorityClassSize : nan, MaxAttributeEntropy : nan, MaxKurtosisOfNumericAtts : 3.1484095157236704, MaxMeansOfNumericAtts : 57.44, MaxMutualInformation : nan, MaxNominalAttDistinctValues : 3.0, MaxSkewnessOfNumericAtts : 1.5903052309118162, MaxStdDevOfNumericAtts : 17.563892580537072, MeanAttributeEntropy : nan, MeanKurtosisOfNumericAtts : 0.7789944524450039, MeanMeansOfNumericAtts : 27.241653333333332, MeanMutualInformation : nan, MeanNoiseToSignalRatio : nan, MeanNominalAttDistinctValues : 3.0, MeanSkewnessOfNumericAtts : 0.3135430433813126, MeanStdDevOfNumericAtts : 11.173398006246252, MinAttributeEntropy : nan, MinKurtosisOfNumericAtts : -0.410404642598019, MinMeansOfNumericAtts : 0.82096, MinMutualInformation : nan, MinNominalAttDistinctValues : 3.0, MinSkewnessOfNumericAtts : -0.6380573853536728, MinStdDevOfNumericAtts : 0.07745366981455389, MinorityClassPercentage : nan, MinorityClassSize : nan, NaiveBayesAUC : nan, NaiveBayesErrRate : nan, NaiveBayesKappa : nan, NumberOfBinaryFeatures : 0.0, NumberOfClasses : 0.0, NumberOfFeatures : 5.0, NumberOfInstances : 125.0, NumberOfInstancesWithMissingValues : 0.0, NumberOfMissingValues : 0.0, NumberOfNumericFeatures : 3.0, NumberOfSymbolicFeatures : 2.0, PercentageOfBinaryFeatures : 0.0, PercentageOfInstancesWithMissingValues : 0.0, PercentageOfMissingValues : 0.0, PercentageOfNumericFeatures : 60.0, PercentageOfSymbolicFeatures : 40.0, Quartile1AttributeEntropy : nan, Quartile1KurtosisOfNumericAtts : -0.410404642598019, Quartile1MeansOfNumericAtts : 0.82096, Quartile1MutualInformation : nan, Quartile1SkewnessOfNumericAtts : -0.6380573853536728, Quartile1StdDevOfNumericAtts : 0.07745366981455389, Quartile2AttributeEntropy : nan, Quartile2KurtosisOfNumericAtts : -0.4010215157906396, Quartile2MeansOfNumericAtts : 23.464, Quartile2MutualInformation : nan, Quartile2SkewnessOfNumericAtts : -0.011618715414205413, Quartile2StdDevOfNumericAtts : 15.878847768387132, Quartile3AttributeEntropy : nan, Quartile3KurtosisOfNumericAtts : 3.1484095157236704, Quartile3MeansOfNumericAtts : 57.44, Quartile3MutualInformation : nan, Quartile3SkewnessOfNumericAtts : 1.5903052309118162, Quartile3StdDevOfNumericAtts : 17.563892580537072, REPTreeDepth1AUC : nan, REPTreeDepth1ErrRate : nan, REPTreeDepth1Kappa : nan, REPTreeDepth2AUC : nan, REPTreeDepth2ErrRate : nan, REPTreeDepth2Kappa : nan, REPTreeDepth3AUC : nan, REPTreeDepth3ErrRate : nan, REPTreeDepth3Kappa : nan, RandomTreeDepth1AUC : nan, RandomTreeDepth1ErrRate : nan, RandomTreeDepth1Kappa : nan, RandomTreeDepth2AUC : nan, RandomTreeDepth2ErrRate : nan, RandomTreeDepth2Kappa : nan, RandomTreeDepth3AUC : nan, RandomTreeDepth3ErrRate : nan, RandomTreeDepth3Kappa : nan, StdvNominalAttDistinctValues : 0.0, kNN1NAUC : nan, kNN1NErrRate : nan, kNN1NKappa : nan,', 'status': 'active', 'uploader': 1, 'version': 1}, page_content='_Annual Review of Public Health_, 4, 155-180.\n \n Hanley, J. A., and Shapiro, S. H. (1994), "Sexual Activity and the\n Lifespan of Male Fruitflies:  A Dataset That Gets Attention," _Journal\n of Statistics Education_, Volume 2, Number 1.\n \n SUBMITTED BY:\n James A. Hanley and Stanley H. Shapiro\n Department of Epidemiology and Biostatistics\n McGill University\n 1020 Pine Avenue West\n Montreal, Quebec, H3A 1A2\n Canada\n tel: +1 (514) 398-6270 (JH) \n      +1 (514) 398-6272 (SS)\n fax: +1 (514) 398-4503')]
1
res[0].metadata
{'MajorityClassSize': 4208.0,
 'MaxNominalAttDistinctValues': 12.0,
 'MinorityClassSize': 3916.0,
 'NumberOfClasses': 2.0,
 'NumberOfFeatures': 23.0,
 'NumberOfInstances': 8124.0,
 'NumberOfInstancesWithMissingValues': 2480.0,
 'NumberOfMissingValues': 2480.0,
 'NumberOfNumericFeatures': 0.0,
 'NumberOfSymbolicFeatures': 23.0,
 'Unnamed: 0': 19,
 'description': "**Author**: [Jeff Schlimmer](Jeffrey.Schlimmer@a.gp.cs.cmu.edu)  \n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/mushroom) - 1981     \n**Please cite**:  The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n\n### Description\n\nThis dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.\n\n### Source\n```\n(a) Origin: \nMushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf \n\n(b) Donor: \nJeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)\n```\n\n### Dataset description\n\nThis dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.\n\n### Attributes Information\n```\n1. cap-shape: bell=b,conical=c,convex=x,flat=f, knobbed=k,sunken=s \n2. cap-surface: fibrous=f,grooves=g,scaly=y,smooth=s \n3. cap-color: brown=n,buff=b,cinnamon=c,gray=g,green=r, pink=p,purple=u,red=e,white=w,yellow=y \n4. bruises?: bruises=t,no=f \n5. odor: almond=a,anise=l,creosote=c,fishy=y,foul=f, musty=m,none=n,pungent=p,spicy=s \n6. gill-attachment: attached=a,descending=d,free=f,notched=n \n7. gill-spacing: close=c,crowded=w,distant=d \n8. gill-size: broad=b,narrow=n \n9. gill-color: black=k,brown=n,buff=b,chocolate=h,gray=g, green=r,orange=o,pink=p,purple=u,red=e, white=w,yellow=y \n10. stalk-shape: enlarging=e,tapering=t \n11. stalk-root: bulbous=b,club=c,cup=u,equal=e, rhizomorphs=z,rooted=r,missing=? \n12. stalk-surface-above-ring: fibrous=f,scaly=y,silky=k,smooth=s \n13. stalk-surface-below-ring: fibrous=f,scaly=y,silky=k,smooth=s \n14. stalk-color-above-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n15. stalk-color-below-ring: brown=n,buff=b,cinnamon=c,gray=g,orange=o, pink=p,red=e,white=w,yellow=y \n16. veil-type: partial=p,universal=u \n17. veil-color: brown=n,orange=o,white=w,yellow=y \n18. ring-number: none=n,one=o,two=t \n19. ring-type: cobwebby=c,evanescent=e,flaring=f,large=l, none=n,pendant=p,sheathing=s,zone=z \n20. spore-print-color: black=k,brown=n,buff=b,chocolate=h,green=r, orange=o,purple=u,white=w,yellow=y \n21. population: abundant=a,clustered=c,numerous=n, scattered=s,several=v,solitary=y \n22. habitat: grasses=g,leaves=l,meadows=m,paths=p, urban=u,waste=w,woods=d\n```\n\n### Relevant papers\n\nSchlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Doctoral disseration, Department of Information and Computer Science, University of California, Irvine. \n\nIba,W., Wogulis,J., & Langley,P. (1988). Trading off Simplicity and Coverage in Incremental Concept Learning. In Proceedings of the 5th International Conference on Machine Learning, 73-79. Ann Arbor, Michigan: Morgan Kaufmann. \n\nDuch W, Adamczak R, Grabczewski K (1996) Extraction of logical rules from training data using backpropagation networks, in: Proc. of the The 1st Online Workshop on Soft Computing, 19-30.Aug.1996, pp. 25-30, [Web Link] \n\nDuch W, Adamczak R, Grabczewski K, Ishikawa M, Ueda H, Extraction of crisp logical rules using constrained backpropagation networks - comparison of two new approaches, in: Proc. of the European Symposium on Artificial Neural Networks (ESANN'97), Bruge, Belgium 16-18.4.1997.",
 'did': 24,
 'features': '0 : [0 - cap-shape (nominal)], 1 : [1 - cap-surface (nominal)], 2 : [2 - cap-color (nominal)], 3 : [3 - bruises%3F (nominal)], 4 : [4 - odor (nominal)], 5 : [5 - gill-attachment (nominal)], 6 : [6 - gill-spacing (nominal)], 7 : [7 - gill-size (nominal)], 8 : [8 - gill-color (nominal)], 9 : [9 - stalk-shape (nominal)], 10 : [10 - stalk-root (nominal)], 11 : [11 - stalk-surface-above-ring (nominal)], 12 : [12 - stalk-surface-below-ring (nominal)], 13 : [13 - stalk-color-above-ring (nominal)], 14 : [14 - stalk-color-below-ring (nominal)], 15 : [15 - veil-type (nominal)], 16 : [16 - veil-color (nominal)], 17 : [17 - ring-number (nominal)], 18 : [18 - ring-type (nominal)], 19 : [19 - spore-print-color (nominal)], 20 : [20 - population (nominal)], 21 : [21 - habitat (nominal)], 22 : [22 - class (nominal)],',
 'format': 'ARFF',
 'name': 'mushroom',
 'qualities': 'AutoCorrelation : 0.726332635725717, CfsSubsetEval_DecisionStumpAUC : 0.9910519616800724, CfsSubsetEval_DecisionStumpErrRate : 0.013047759724273756, CfsSubsetEval_DecisionStumpKappa : 0.9738461616958994, CfsSubsetEval_NaiveBayesAUC : 0.9910519616800724, CfsSubsetEval_NaiveBayesErrRate : 0.013047759724273756, CfsSubsetEval_NaiveBayesKappa : 0.9738461616958994, CfsSubsetEval_kNN1NAUC : 0.9910519616800724, CfsSubsetEval_kNN1NErrRate : 0.013047759724273756, CfsSubsetEval_kNN1NKappa : 0.9738461616958994, ClassEntropy : 0.9990678968724604, DecisionStumpAUC : 0.8894935275772204, DecisionStumpErrRate : 0.11324470704086657, DecisionStumpKappa : 0.77457574608175, Dimensionality : 0.002831117676021664, EquivalentNumberOfAtts : 5.0393135801657, J48.00001.AUC : 1.0, J48.00001.ErrRate : 0.0, J48.00001.Kappa : 1.0, J48.0001.AUC : 1.0, J48.0001.ErrRate : 0.0, J48.0001.Kappa : 1.0, J48.001.AUC : 1.0, J48.001.ErrRate : 0.0, J48.001.Kappa : 1.0, MajorityClassPercentage : 51.7971442639094, MajorityClassSize : 4208.0, MaxAttributeEntropy : 3.030432883772633, MaxKurtosisOfNumericAtts : nan, MaxMeansOfNumericAtts : nan, MaxMutualInformation : 0.906074977384, MaxNominalAttDistinctValues : 12.0, MaxSkewnessOfNumericAtts : nan, MaxStdDevOfNumericAtts : nan, MeanAttributeEntropy : 1.4092554739602103, MeanKurtosisOfNumericAtts : nan, MeanMeansOfNumericAtts : nan, MeanMutualInformation : 0.19825475850613955, MeanNoiseToSignalRatio : 6.108305922031972, MeanNominalAttDistinctValues : 5.130434782608695, MeanSkewnessOfNumericAtts : nan, MeanStdDevOfNumericAtts : nan, MinAttributeEntropy : -0.0, MinKurtosisOfNumericAtts : nan, MinMeansOfNumericAtts : nan, MinMutualInformation : 0.0, MinNominalAttDistinctValues : 1.0, MinSkewnessOfNumericAtts : nan, MinStdDevOfNumericAtts : nan, MinorityClassPercentage : 48.20285573609059, MinorityClassSize : 3916.0, NaiveBayesAUC : 0.9976229672941662, NaiveBayesErrRate : 0.04899064500246184, NaiveBayesKappa : 0.9015972799616292, NumberOfBinaryFeatures : 5.0, NumberOfClasses : 2.0, NumberOfFeatures : 23.0, NumberOfInstances : 8124.0, NumberOfInstancesWithMissingValues : 2480.0, NumberOfMissingValues : 2480.0, NumberOfNumericFeatures : 0.0, NumberOfSymbolicFeatures : 23.0, PercentageOfBinaryFeatures : 21.73913043478261, PercentageOfInstancesWithMissingValues : 30.526834071885773, PercentageOfMissingValues : 1.3272536552993814, PercentageOfNumericFeatures : 0.0, PercentageOfSymbolicFeatures : 100.0, Quartile1AttributeEntropy : 0.8286618104993447, Quartile1KurtosisOfNumericAtts : nan, Quartile1MeansOfNumericAtts : nan, Quartile1MutualInformation : 0.034184520425602494, Quartile1SkewnessOfNumericAtts : nan, Quartile1StdDevOfNumericAtts : nan, Quartile2AttributeEntropy : 1.467128011861462, Quartile2KurtosisOfNumericAtts : nan, Quartile2MeansOfNumericAtts : nan, Quartile2MutualInformation : 0.174606545183155, Quartile2SkewnessOfNumericAtts : nan, Quartile2StdDevOfNumericAtts : nan, Quartile3AttributeEntropy : 2.0533554351937426, Quartile3KurtosisOfNumericAtts : nan, Quartile3MeansOfNumericAtts : nan, Quartile3MutualInformation : 0.27510225484918505, Quartile3SkewnessOfNumericAtts : nan, Quartile3StdDevOfNumericAtts : nan, REPTreeDepth1AUC : 0.9999987256143267, REPTreeDepth1ErrRate : 0.00036927621861152144, REPTreeDepth1Kappa : 0.9992605118549308, REPTreeDepth2AUC : 0.9999987256143267, REPTreeDepth2ErrRate : 0.00036927621861152144, REPTreeDepth2Kappa : 0.9992605118549308, REPTreeDepth3AUC : 0.9999987256143267, REPTreeDepth3ErrRate : 0.00036927621861152144, REPTreeDepth3Kappa : 0.9992605118549308, RandomTreeDepth1AUC : 0.9995247148288974, RandomTreeDepth1ErrRate : 0.0004923682914820286, RandomTreeDepth1Kappa : 0.9990140245420991, RandomTreeDepth2AUC : 0.9995247148288974, RandomTreeDepth2ErrRate : 0.0004923682914820286, RandomTreeDepth2Kappa : 0.9990140245420991, RandomTreeDepth3AUC : 0.9995247148288974, RandomTreeDepth3ErrRate : 0.0004923682914820286, RandomTreeDepth3Kappa : 0.9990140245420991, StdvNominalAttDistinctValues : 3.1809710899501766, kNN1NAUC : 1.0, kNN1NErrRate : 0.0, kNN1NKappa : 1.0,',
 'status': 'active',
 'uploader': 1,
 'version': 1}
1
print(res[0].page_content)
### Description

This dataset describes mushrooms in terms of their physical characteristics. They are classified into: poisonous or edible.

### Source
1
2
3
4
5
(a) Origin: 
Mushroom records are drawn from The Audubon Society Field Guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf 

(b) Donor: 
Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)
### Dataset description This dataset includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibility of a mushroom; no rule like ``leaflets three, let it be'' for Poisonous Oak and Ivy.

Process the results and return a dataframe instead

1
2
3
4
5
6
output_df, ids_order = QueryProcessor(
    query=query,
    qa=qa_dataset,
    type_of_query=config["type_of_data"],
    config=config,
).get_result_from_query()
1
ids_order
[24,
 24,
 294,
 120,
 120,
 42,
 188,
 42,
 187,
 199,
 183,
 134,
 23,
 134,
 287,
 334,
 335,
 333,
 42,
 42,
 287,
 343,
 8,
 334,
 24,
 333,
 179,
 335,
 61,
 13]
1
output_df.head()
id name Description OpenML URL Command
0 24 mushroom StdvNominalAttDistinctValues : 3.1809710899501... <a href="https://www.openml.org/search?type=da... dataset = openml.datasets.get_dataset(24)
2 294 satellite_image Data Set Information: <a href="https://www.openml.org/search?type=da... dataset = openml.datasets.get_dataset(294)
3 120 BNG(mushroom) RandomTreeDepth3ErrRate : 0.024243, RandomTree... <a href="https://www.openml.org/search?type=da... dataset = openml.datasets.get_dataset(120)
5 42 soybean did - 42, name - soybean, version - 1, uploade... <a href="https://www.openml.org/search?type=da... dataset = openml.datasets.get_dataset(42)
6 188 eucalyptus Kirsten Thomson and Robert J. McQueen (1996) M... <a href="https://www.openml.org/search?type=da... dataset = openml.datasets.get_dataset(188)