Supported Algorithms

    Except for the Localization algorithm, all of the following algorithms can only support retrieving 10,000 documents from an index as an input.

    K-means

    K-means is a simple and popular unsupervised clustering ML algorithm built on top of Tribuo library. K-means will randomly choose centroids, then calculate iteratively to optimize the position of the centroids until each observation belongs to the cluster with the nearest mean.

    APIs

    Example

    The following example uses the Iris Data index to train k-means synchronously.

    Limitations

    The training process supports multi-threads, but the number of threads should be less than half of the number of CPUs.

    Linear regression maps the linear relationship between inputs and outputs. In ML Commons, the linear regression algorithm is adopted from the public machine learning library , which offers multidimensional linear regression models. The model supports the linear optimizer in training, including popular approaches like Linear Decay, SQRT_DECAY, ADA, , and RMS_DROP.

    Parameters

    ParameterTypeDescriptionDefault Value
    learningRateDoubleThe rate of speed at which the gradient moves during descent0.01
    momentumFactorDoubleThe medium-term from which the regressor rises or falls0
    epsilonDoubleThe criteria used to identify a linear model1.00E-06
    beta1DoubleThe estimated exponential decay for the moment0.9
    beta2DoubleThe estimated exponential decay for the moment0.99
    decayRateDoubleThe rate at which the model decays exponentially0.9
    momentumTypeMomentumTypeThe defined Stochastic Gradient Descent (SDG) momentum type that helps accelerate gradient vectors in the right directions, leading to a fast convergenceSTANDARD
    optimizerTypeOptimizerTypeThe optimizer used in the modelSIMPLE_SGD

    APIs

    Example

    The following example creates a new prediction based on the previously trained linear regression model.

    Request

    1. POST _plugins/_ml/_predict/LINEAR_REGRESSION/ROZs-38Br5eVE0lTsoD9
    2. {
    3. "parameters": {
    4. "target": "price"
    5. },
    6. "input_data": {
    7. "column_metas": [
    8. {
    9. "name": "A",
    10. "column_type": "DOUBLE"
    11. },
    12. {
    13. "name": "B",
    14. "column_type": "DOUBLE"
    15. }
    16. ],
    17. "rows": [
    18. {
    19. "values": [
    20. {
    21. "column_type": "DOUBLE",
    22. "value": 3
    23. },
    24. {
    25. "column_type": "DOUBLE",
    26. "value": 5
    27. }
    28. ]
    29. }
    30. ]
    31. }
    32. }

    Response

    1. {
    2. "status": "COMPLETED",
    3. "prediction_result": {
    4. "column_metas": [
    5. {
    6. "name": "price",
    7. "column_type": "DOUBLE"
    8. }
    9. ],
    10. "rows": [
    11. {
    12. "values": [
    13. {
    14. "column_type": "DOUBLE",
    15. "value": 17.25701855310131
    16. }
    17. ]
    18. }
    19. ]
    20. }
    21. }

    ML Commons only supports the linear Stochastic gradient trainer or optimizer, which cannot effectively map the non-linear relationships in trained data. When used with complicated datasets, the linear Stochastic trainer might cause some convergence problems and inaccurate results.

    RCF

    Random Cut Forest (RCF) is a probabilistic data structure used primarily for unsupervised anomaly detection. Its use also extends to density estimation and forecasting. OpenSearch leverages RCF for anomaly detection. ML Commons supports two new variants of RCF for different use cases:

    • Batch RCF: Detects anomalies in non-time series data.
    • Fixed in time (FIT) RCF: Detects anomalies in time series data.

    Parameters

    Batch RCF

    Fit RCF

    ParameterTypeDescriptionDefault Value
    number_of_treesintegerThe number of trees in the forest30
    shingle_sizeintegerA shingle, or a consecutive sequence of the most recent records8
    sample_sizeintegerThe sample size used by stream samplers in the forest256
    output_afterintegerThe number of points required by stream samplers before results return32
    time_decaydoubleThe decay factor used by stream samplers in the forest0.0001
    anomaly_ratedoubleThe anomaly rate0.005
    time_fieldstring(Required) The time filed for RCF to use as time series dataN/A
    date_formatstringThe date and time format for the time_field field“yyyy-MM-ddHH:mm:ss”
    time_zonestringThe time zone for the time_field field“UTC”

    APIs

    Limitations

    For FIT RCF, you can train the model with historical data and store the trained model in your index. The model will be deserialized and predict new data points when using the Predict API. However, the model in the index will not be refreshed with new data, because the model is fixed in time.

    RCFSummarize is a clustering algorithm based on the Clustering Using REpresentatives (CURE) algorithm. Compared to k-means, which uses random iterations to cluster, RCFSummarize uses a hierarchical clustering technique. The algorithm starts, with a set of randomly selected centroids larger than the centroids’ ground truth distribution. During iteration, centroid pairs too close to each other automatically merge. Therefore, the number of centroids (max_k) converge to a rational number of clusters that fits ground truth, as opposed to a fixed k number of clusters.

    Parameters

    APIs

    Example: Train and predict

    The following example estimates cluster centers and provides cluster labels for each sample in a given data frame.

    1. POST _plugins/_ml/_train_predict/RCF_SUMMARIZE
    2. {
    3. "parameters": {
    4. "centroids": 3,
    5. "max_k": 15,
    6. "distance_type": "L2"
    7. },
    8. "input_data": {
    9. "column_metas": [
    10. "name": "d0",
    11. "column_type": "DOUBLE"
    12. },
    13. {
    14. "name": "d1",
    15. "column_type": "DOUBLE"
    16. }
    17. ],
    18. "rows": [
    19. {
    20. "values": [
    21. {
    22. "column_type": "DOUBLE",
    23. "value": 6.2
    24. },
    25. {
    26. "column_type": "DOUBLE",
    27. "value": 3.4
    28. }
    29. ]
    30. }
    31. ]
    32. }
    33. }

    Response

    The parameter within the prediction result has been modified for length. In your response, expect more rows and columns to be contained within the response body.

    Localization

    The Localization algorithm finds subset-level information for aggregate data (for example, aggregated over time) that demonstrates the activity of interest, such as spikes, drops, changes, or anomalies. Localization can be applied in different scenarios, such as data exploration or root cause analysis, to expose the contributors driving the activity of interest in the aggregate data.

    All parameters are required except filter_query and anomaly_start.

    ParameterTypeDescriptionDefault Value
    index_nameStringThe data collection to analyzeN/A
    attribute_field_namesListThe fields for entity keysN/A
    aggregationsListThe fields and aggregation for valuesN/A
    time_field_nameStringThe timestamp fieldnull
    start_timeLongThe beginning of the time range0
    end_timeLongThe end of the time range0
    min_time_intervalLongThe minimum time interval/scale for analysis0
    num_outputsintegerThe maximum number of values from localization/slicing0
    filter_queryLong(Optional) Reduces the collection of data for analysisOptional.empty()
    anomaly_starQueryBuilder(Optional) The time after which the data will be analyzedOptional.empty()

    Example: Execute localization

    The following example executes Localization against an RCA index.

    Request

    1. POST /_plugins/_ml/_execute/anomaly_localization
    2. {
    3. "index_name": "rca-index",
    4. "attribute_field_names": [
    5. "attribute"
    6. ],
    7. "aggregations": [
    8. {
    9. "sum": {
    10. "sum": {
    11. "field": "value"
    12. }
    13. }
    14. }
    15. ],
    16. "time_field_name": "timestamp",
    17. "start_time": 1620630000000,
    18. "end_time": 1621234800000,
    19. "min_time_interval": 86400000,
    20. "num_outputs": 10
    21. }

    Response

    1. {
    2. "results" : [
    3. {
    4. "name" : "sum",
    5. "result" : {
    6. "buckets" : [
    7. {
    8. "start_time" : 1620630000000,
    9. "end_time" : 1620716400000,
    10. "overall_aggregate_value" : 65.0
    11. },
    12. {
    13. "start_time" : 1620716400000,
    14. "end_time" : 1620802800000,
    15. "overall_aggregate_value" : 75.0,
    16. "entities" : [
    17. {
    18. "key" : [
    19. "attr0"
    20. ],
    21. "contribution_value" : 1.0,
    22. "base_value" : 2.0,
    23. "new_value" : 3.0
    24. },
    25. {
    26. "key" : [
    27. "attr1"
    28. ],
    29. "contribution_value" : 1.0,
    30. "base_value" : 3.0,
    31. "new_value" : 4.0
    32. },
    33. {
    34. ...
    35. },
    36. {
    37. "key" : [
    38. "attr8"
    39. ],
    40. "contribution_value" : 6.0,
    41. "base_value" : 10.0,
    42. "new_value" : 16.0
    43. },
    44. {
    45. "key" : [
    46. "attr9"
    47. ],
    48. "contribution_value" : 6.0,
    49. "base_value" : 11.0,
    50. "new_value" : 17.0
    51. }
    52. ]
    53. }
    54. ]
    55. }
    56. }
    57. ]
    58. }

    Limitations

    The Localization algorithm can only be executed directly. Therefore, it cannot be used with the ML Commons Train and Predict APIs.

    A classification algorithm, logistic regression models the probability of a discrete outcome given an input variable. In ML Commons, these classifications include both binary and multi-class. The most common is the binary classification, which takes two values, such as “true/false” or “yes/no”, and predicts the outcome based on the values specified. Alternatively, a multi-class output can categorize different inputs based on type. This makes logistic regression most useful for situations where you are trying to determine how your inputs fit best into a specified category.

    Parameters

    APIs

    Example: Train/Predict with Iris data

    The following example creates an index in OpenSearch with the , then trains the data using logistic regression. Lastly, it uses the trained model to predict Iris types separated by row.

    Create an Iris index

    Before using this request, make sure that you have downloaded .

    1. PUT /iris_data
    2. {
    3. "mappings": {
    4. "properties": {
    5. "type": "double"
    6. },
    7. "sepal_width_in_cm": {
    8. "type": "double"
    9. },
    10. "petal_length_in_cm": {
    11. "type": "double"
    12. },
    13. "petal_width_in_cm": {
    14. "type": "double"
    15. },
    16. "class": {
    17. }
    18. }
    19. }
    20. }

    Ingest data from IRIS_data.txt

    Train the logistic regression model

    This example uses a multi-class logistic regression categorization methodology. Here, the inputs of sepal and petal length and width are used to train the model to categorize centroids based on the class, as indicated by the target parameter.

    Request

    1. {
    2. "parameters": {
    3. "target": "class"
    4. },
    5. "input_query": {
    6. "query": {
    7. "match_all": {}
    8. },
    9. "_source": [
    10. "sepal_length_in_cm",
    11. "sepal_width_in_cm",
    12. "petal_length_in_cm",
    13. "petal_width_in_cm",
    14. "class"
    15. ],
    16. "size": 200
    17. },
    18. "input_index": [
    19. "iris_data"
    20. ]
    21. }

    Response

    The model_id will be used to predict the class of the Iris.

    1. {
    2. "model_id" : "TOgsf4IByBqD7FK_FQGc",
    3. "status" : "COMPLETED"
    4. }

    Predict results

    Using the model_id of the trained Iris dataset, logistic regression will predict the class of the Iris based on the input data.

    1. POST _plugins/_ml/_predict/logistic_regression/SsfQaoIBEoC4g4joZiyD
    2. {
    3. "parameters": {
    4. "target": "class"
    5. },
    6. "input_data": {
    7. "column_metas": [
    8. {
    9. "name": "sepal_length_in_cm",
    10. "column_type": "DOUBLE"
    11. },
    12. {
    13. "name": "sepal_width_in_cm",
    14. "column_type": "DOUBLE"
    15. },
    16. {
    17. "name": "petal_length_in_cm",
    18. "column_type": "DOUBLE"
    19. },
    20. {
    21. "name": "petal_width_in_cm",
    22. "column_type": "DOUBLE"
    23. }
    24. ],
    25. "rows": [
    26. {
    27. "values": [
    28. {
    29. "column_type": "DOUBLE",
    30. "value": 6.2
    31. },
    32. {
    33. "column_type": "DOUBLE",
    34. "value": 3.4
    35. },
    36. {
    37. "column_type": "DOUBLE",
    38. "value": 5.4
    39. },
    40. {
    41. "column_type": "DOUBLE",
    42. "value": 2.3
    43. }
    44. ]
    45. },
    46. {
    47. "values": [
    48. {
    49. "column_type": "DOUBLE",
    50. "value": 5.9
    51. },
    52. {
    53. "column_type": "DOUBLE",
    54. "value": 3.0
    55. },
    56. {
    57. "column_type": "DOUBLE",
    58. "value": 5.1
    59. },
    60. {
    61. "column_type": "DOUBLE",
    62. "value": 1.8
    63. }
    64. ]
    65. }
    66. ]

    Response

    Limitations