Skip to content
Snippets Groups Projects
  • Dongjoon Hyun's avatar
    2c170dd3
    [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update... · 2c170dd3
    Dongjoon Hyun authored
    [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
    
    ## What changes were proposed in this pull request?
    
    This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
    - Use multiline format in SparkSession builder patterns.
    - Update `binary_classification_metrics_example.py` to use `SparkSession`.
    - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
    
    ## How was this patch tested?
    
    After passing the Jenkins tests and run `dev/lint-java` manually.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #12911 from dongjoon-hyun/SPARK-15134.
    2c170dd3
    History
    [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update...
    Dongjoon Hyun authored
    [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
    
    ## What changes were proposed in this pull request?
    
    This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
    - Use multiline format in SparkSession builder patterns.
    - Update `binary_classification_metrics_example.py` to use `SparkSession`.
    - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
    
    ## How was this patch tested?
    
    After passing the Jenkins tests and run `dev/lint-java` manually.
    
    Author: Dongjoon Hyun <dongjoon@apache.org>
    
    Closes #12911 from dongjoon-hyun/SPARK-15134.
train_validation_split.py 2.72 KiB
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# $example on$
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
# $example off$
from pyspark.sql import SparkSession

"""
This example demonstrates applying TrainValidationSplit to split data
and preform model selection.
Run with:

  bin/spark-submit examples/src/main/python/ml/train_validation_split.py
"""

if __name__ == "__main__":
    spark = SparkSession\
        .builder\
        .appName("TrainValidationSplit")\
        .getOrCreate()
    # $example on$
    # Prepare training and test data.
    data = spark.read.format("libsvm")\
        .load("data/mllib/sample_linear_regression_data.txt")
    train, test = data.randomSplit([0.7, 0.3])
    lr = LinearRegression(maxIter=10, regParam=0.1)

    # We use a ParamGridBuilder to construct a grid of parameters to search over.
    # TrainValidationSplit will try all combinations of values and determine best model using
    # the evaluator.
    paramGrid = ParamGridBuilder()\
        .addGrid(lr.regParam, [0.1, 0.01]) \
        .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
        .build()

    # In this case the estimator is simply the linear regression.
    # A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
    tvs = TrainValidationSplit(estimator=lr,
                               estimatorParamMaps=paramGrid,
                               evaluator=RegressionEvaluator(),
                               # 80% of the data will be used for training, 20% for validation.
                               trainRatio=0.8)

    # Run TrainValidationSplit, and choose the best set of parameters.
    model = tvs.fit(train)
    # Make predictions on test data. model is the model with combination of parameters
    # that performed best.
    prediction = model.transform(test)
    for row in prediction.take(5):
        print(row)
    # $example off$
    spark.stop()