Skip to content
Snippets Groups Projects
  1. Feb 26, 2015
    • Sean Owen's avatar
      SPARK-4300 [CORE] Race condition during SparkWorker shutdown · 3fb53c02
      Sean Owen authored
      Close appender saving stdout/stderr before destroying process to avoid exception on reading closed input stream.
      (This also removes a redundant `waitFor()` although it was harmless)
      
      CC tdas since I think you wrote this method.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4787 from srowen/SPARK-4300 and squashes the following commits:
      
      e0cdabf [Sean Owen] Close appender saving stdout/stderr before destroying process to avoid exception on reading closed input stream
      3fb53c02
    • Cheolsoo Park's avatar
      [SPARK-6018] [YARN] NoSuchMethodError in Spark app is swallowed by YARN AM · 5f3238b3
      Cheolsoo Park authored
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #4773 from piaozhexiu/SPARK-6018 and squashes the following commits:
      
      2a919d5 [Cheolsoo Park] Rename e with cause to avoid duplicate names
      1e71d2d [Cheolsoo Park] Replace placeholder with throwable
      eb5750d [Cheolsoo Park] NoSuchMethodError in Spark app is swallowed by YARN AM
      5f3238b3
    • Tathagata Das's avatar
      [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils... · aa63f633
      Tathagata Das authored
      [SPARK-6027][SPARK-5546] Fixed --jar and --packages not working for KafkaUtils and improved error message
      
      The problem with SPARK-6027 in short is that JARs like the kafka-assembly.jar does not work in python as the added JAR is not visible in the classloader used by Py4J. Py4J uses Class.forName(), which does not uses the systemclassloader, but the JARs are only visible in the Thread's contextclassloader. So this back uses the context class loader to create the KafkaUtils dstream object. This works for both cases where the Kafka libraries are added with --jars spark-streaming-kafka-assembly.jar or with --packages spark-streaming-kafka
      
      Also improves the error message.
      
      davies
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4779 from tdas/kafka-python-fix and squashes the following commits:
      
      fb16b04 [Tathagata Das] Removed import
      c1fdf35 [Tathagata Das] Fixed long line and improved documentation
      7b88be8 [Tathagata Das] Fixed --jar not working for KafkaUtils and improved error message
      aa63f633
    • xukun 00228947's avatar
      [SPARK-3562]Periodic cleanup event logs · 8942b522
      xukun 00228947 authored
      Author: xukun 00228947 <xukun.xu@huawei.com>
      
      Closes #4214 from viper-kun/cleaneventlog and squashes the following commits:
      
      7a5b9c5 [xukun 00228947] fix issue
      31674ee [xukun 00228947] fix issue
      6e3d06b [xukun 00228947] fix issue
      373f3b9 [xukun 00228947] fix issue
      71782b5 [xukun 00228947] fix issue
      5b45035 [xukun 00228947] fix issue
      70c28d6 [xukun 00228947] fix issues
      adcfe86 [xukun 00228947] Periodic cleanup event logs
      8942b522
    • Li Zhihui's avatar
      Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs. · 10094a52
      Li Zhihui authored
      The configuration is not supported in mesos mode now.
      See https://github.com/apache/spark/pull/1462
      
      Author: Li Zhihui <zhihui.li@intel.com>
      
      Closes #4781 from li-zhihui/fixdocconf and squashes the following commits:
      
      63e7a44 [Li Zhihui] Modify default value description for spark.scheduler.minRegisteredResourcesRatio on docs.
      10094a52
    • Sean Owen's avatar
      SPARK-4704 [CORE] SparkSubmitDriverBootstrap doesn't flush output · cd5c8d7b
      Sean Owen authored
      Join on output threads to make sure any lingering output from process reaches stdout, stderr before exiting
      
      CC andrewor14 since I believe he created this section of code
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4788 from srowen/SPARK-4704 and squashes the following commits:
      
      ad7114e [Sean Owen] Join on output threads to make sure any lingering output from process reaches stdout, stderr before exiting
      cd5c8d7b
    • Davies Liu's avatar
      [SPARK-5363] Fix bug in PythonRDD: remove() inside iterator is not safe · 7fa960e6
      Davies Liu authored
      Removing elements from a mutable HashSet while iterating over it can cause the
      iteration to incorrectly skip over entries that were not removed. If this
      happened, PythonRDD would write fewer broadcast variables than the Python
      worker was expecting to read, which would cause the Python worker to hang
      indefinitely.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4776 from davies/fix_hang and squashes the following commits:
      
      a4384a5 [Davies Liu] fix bug: remvoe() inside iterator is not safe
      7fa960e6
    • Liang-Chi Hsieh's avatar
      [SPARK-6004][MLlib] Pick the best model when training GradientBoostedTrees with validation · cfff397f
      Liang-Chi Hsieh authored
      Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4763 from viirya/gbt_record_model and squashes the following commits:
      
      452e049 [Liang-Chi Hsieh] Address comment.
      ea2fae2 [Liang-Chi Hsieh] Pick the best model when training GradientBoostedTrees with validation.
      cfff397f
    • Jacky Li's avatar
      [SPARK-6007][SQL] Add numRows param in DataFrame.show() · 23586575
      Jacky Li authored
      It is useful to let the user decide the number of rows to show in DataFrame.show
      
      Author: Jacky Li <jacky.likun@huawei.com>
      
      Closes #4767 from jackylk/show and squashes the following commits:
      
      a0e0f4b [Jacky Li] fix testcase
      7cdbe91 [Jacky Li] modify according to comment
      bb54537 [Jacky Li] for Java compatibility
      d7acc18 [Jacky Li] modify according to comments
      981be52 [Jacky Li] add numRows param in DataFrame.show()
      23586575
    • Marcelo Vanzin's avatar
      [SPARK-5801] [core] Avoid creating nested directories. · df3d559b
      Marcelo Vanzin authored
      Cache the value of the local root dirs to use for storing local data,
      so that the same directories are reused.
      
      Also, to avoid an extra level of nesting, use a different env variable
      to propagate the local dirs from the Worker to the executors. And make
      the executor directory use a different name.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #4747 from vanzin/SPARK-5801 and squashes the following commits:
      
      e0114e1 [Marcelo Vanzin] Update unit test.
      18ee0a7 [Marcelo Vanzin] [SPARK-5801] [core] Avoid creating nested directories.
      df3d559b
    • Yin Huai's avatar
      [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing... · 192e42a2
      Yin Huai authored
      [SPARK-6016][SQL] Cannot read the parquet table after overwriting the existing table when spark.sql.parquet.cacheMetadata=true
      
      Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4775 from yhuai/parquetFooterCache and squashes the following commits:
      
      78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat.
      dff6fba [Yin Huai] Failed unit test.
      192e42a2
    • Yin Huai's avatar
      [SPARK-6023][SQL] ParquetConversions fails to replace the destination... · f02394d0
      Yin Huai authored
      [SPARK-6023][SQL] ParquetConversions fails to replace the destination MetastoreRelation of an InsertIntoTable node to ParquetRelation2
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-6023
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4782 from yhuai/parquetInsertInto and squashes the following commits:
      
      ae7e806 [Yin Huai] Convert MetastoreRelation in InsertIntoTable and InsertIntoHiveTable.
      ba543cd [Yin Huai] More tests.
      50b6d0f [Yin Huai] Update error messages.
      346780c [Yin Huai] Failed test.
      f02394d0
    • Judy Nash's avatar
      [SPARK-5914] to run spark-submit requiring only user perm on windows · 51a6f909
      Judy Nash authored
      Because windows on-default does not grant read permission to jars except to admin, spark-submit would fail with "ClassNotFound" exception if user runs slave service with only user permission.
      This fix is to add read permission to owner of the jar (which would be the slave service account in windows )
      
      Author: Judy Nash <judynash@microsoft.com>
      
      Closes #4742 from judynash/SPARK-5914 and squashes the following commits:
      
      e288e56 [Judy Nash] Fix spacing and refactor code
      1de3c0e [Judy Nash] [SPARK-5914] Enable spark-submit to run requiring only user permission on windows
      51a6f909
    • Xiangrui Meng's avatar
      [SPARK-5976][MLLIB] Add partitioner to factors returned by ALS · e43139f4
      Xiangrui Meng authored
      The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression.
      
      srowen coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4748 from mengxr/SPARK-5976 and squashes the following commits:
      
      9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS
      260f183 [Xiangrui Meng] add a test for partitioner
      e43139f4
  2. Feb 25, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT · d20559b1
      Joseph K. Bradley authored
      * Add GradientBoostedTrees Python examples to ML guide
        * I ran these in the pyspark shell, and they worked.
      * Add save/load to examples in ML guide
      * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:
      
      c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
      bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
      6d81c3e [Joseph K. Bradley] completed python GBT examples
      9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
      c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
      d20559b1
    • Brennon York's avatar
      [SPARK-1182][Docs] Sort the configuration parameters in configuration.md · 46a044a3
      Brennon York authored
      Sorts all configuration options present on the `configuration.md` page to ease readability.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits:
      
      5696f21 [Brennon York] fixed merge conflict with port comments
      81a7b10 [Brennon York] capitalized A in Allocation
      e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc
      7de5f75 [Brennon York] moved serialization from application to compression and serialization section
      a16fec0 [Brennon York] moved shuffle settings from network to shuffle
      f8fa286 [Brennon York] sorted encryption category
      1023f15 [Brennon York] moved initialExecutors
      e9d62aa [Brennon York] fixed akka.heartbeat.interval
      25e6f6f [Brennon York] moved spark.executer.user*
      4625ade [Brennon York] added spark.executor.extra* items
      4ee5648 [Brennon York] fixed merge conflicts
      1b49234 [Brennon York] sorting mishap
      2b5758b [Brennon York] sorting mishap
      6fbdf42 [Brennon York] sorting mishap
      55dc6f8 [Brennon York] sorted security
      ec34294 [Brennon York] sorted dynamic allocation
      2a7c4a3 [Brennon York] sorted scheduling
      aa9acdc [Brennon York] sorted networking
      a4380b8 [Brennon York] sorted execution behavior
      27f3919 [Brennon York] sorted compression and serialization
      80a5bbb [Brennon York] sorted spark ui
      3f32e5b [Brennon York] sorted shuffle behavior
      6c51b38 [Brennon York] sorted runtime environment
      efe9d6f [Brennon York] sorted application properties
      46a044a3
    • Yanbo Liang's avatar
      [SPARK-5926] [SQL] make DataFrame.explain leverage queryExecution.logical · 41e2e5ac
      Yanbo Liang authored
      DataFrame.explain return wrong result when the query is DDL command.
      
      For example, the following two queries should print out the same execution plan, but it not.
      sql("create table tb as select * from src where key > 490").explain(true)
      sql("explain extended create table tb as select * from src where key > 490")
      
      This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use  the unexecuted plan queryExecution.logical.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #4707 from yanboliang/spark-5926 and squashes the following commits:
      
      fa6db63 [Yanbo Liang] logicalPlan is not lazy
      0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
      41e2e5ac
    • Liang-Chi Hsieh's avatar
      [SPARK-5999][SQL] Remove duplicate Literal matching block · 12dbf98c
      Liang-Chi Hsieh authored
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #4760 from viirya/dup_literal and squashes the following commits:
      
      06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
      12dbf98c
    • Cheng Lian's avatar
      [SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits · e0fdd467
      Cheng Lian authored
      `ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue.
      
      In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4768 from liancheng/spark-6010 and squashes the following commits:
      
      9002f0a [Cheng Lian] Fixes SPARK-6010
      e0fdd467
    • Davies Liu's avatar
      [SPARK-5944] [PySpark] fix version in Python API docs · f3f4c87b
      Davies Liu authored
      use RELEASE_VERSION when building the Python API docs
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4731 from davies/api_version and squashes the following commits:
      
      c9744c9 [Davies Liu] Update create-release.sh
      08cbc3f [Davies Liu] fix python docs
      f3f4c87b
    • Kay Ousterhout's avatar
      [SPARK-5982] Remove incorrect Local Read Time Metric · 838a4803
      Kay Ousterhout authored
      This metric is incomplete, because the files are memory mapped, so much of the read from disk occurs later as tasks actually read the file's data.
      
      This should be merged into 1.3, so that we never expose this incorrect metric to users.
      
      CC pwendell ksakellis sryza
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #4749 from kayousterhout/SPARK-5982 and squashes the following commits:
      
      9737b5e [Kay Ousterhout] More fixes
      a1eb300 [Kay Ousterhout] Removed one more use of local read time
      cf13497 [Kay Ousterhout] [SPARK-5982] Remove incorrectwq Local Read Time Metric
      838a4803
    • Brennon York's avatar
      [SPARK-1955][GraphX]: VertexRDD can incorrectly assume index sharing · 9f603fce
      Brennon York authored
      Fixes the issue whereby when VertexRDD's are `diff`ed, `innerJoin`ed, or `leftJoin`ed and have different partition sizes they fail under the `zipPartitions` method. This fix tests whether the partitions are equal or not and, if not, will repartition the other to match the partition size of the calling VertexRDD.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #4705 from brennonyork/SPARK-1955 and squashes the following commits:
      
      0882590 [Brennon York] updated to properly handle differently-partitioned vertexRDDs
      9f603fce
    • Milan Straka's avatar
      [SPARK-5970][core] Register directory created in getOrCreateLocalRootDirs for automatic deletion. · a777c65d
      Milan Straka authored
      As documented in createDirectory, the result of createDirectory is not registered for automatic removal. Currently there are 4 directories left in `/tmp` after just running `pyspark`.
      
      Author: Milan Straka <fox@ucw.cz>
      
      Closes #4759 from foxik/remove-tmp-dirs and squashes the following commits:
      
      280450d [Milan Straka] Use createTempDir in getOrCreateLocalRootDirs...
      a777c65d
    • Sean Owen's avatar
      SPARK-5930 [DOCS] Documented default of spark.shuffle.io.retryWait is confusing · 7d8e6a2e
      Sean Owen authored
      Clarify default max wait in spark.shuffle.io.retryWait docs
      
      CC andrewor14
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4769 from srowen/SPARK-5930 and squashes the following commits:
      
      ae2792b [Sean Owen] Clarify default max wait in spark.shuffle.io.retryWait docs
      7d8e6a2e
    • Michael Armbrust's avatar
      [SPARK-5996][SQL] Fix specialized outbound conversions · f84c799e
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4757 from marmbrus/udtConversions and squashes the following commits:
      
      3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions
      f84c799e
    • guliangliang's avatar
      [SPARK-5771] Number of Cores in Completed Applications of Standalone Master... · dd077abf
      guliangliang authored
      [SPARK-5771] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
      
      In Standalone mode, the number of cores in Completed Applications of the Master Web Page will always be zero, if sc.stop() is called.
      But the number will always be right, if sc.stop() is not called.
      The reason maybe:
      after sc.stop() is called, the function removeExecutor of class ApplicationInfo will be called, thus reduce the variable coresGranted to zero. The variable coresGranted is used to display the number of Cores on the Web Page.
      
      Author: guliangliang <guliangliang@qiyi.com>
      
      Closes #4567 from marsishandsome/Spark5771 and squashes the following commits:
      
      694796e [guliangliang] remove duplicate code
      a20e390 [guliangliang] change to Cores Using & Requested
      0c19c95 [guliangliang] change Cores to Cores (max)
      cfbd97d [guliangliang] [SPARK-5771] Number of Cores in Completed Applications of Standalone Master Web Page always be 0 if sc.stop() is called
      dd077abf
    • Benedikt Linse's avatar
      [GraphX] fixing 3 typos in the graphx programming guide · 5b8480e0
      Benedikt Linse authored
      Corrected 3 Typos in the GraphX programming guide. I hope this is the correct way to contribute.
      
      Author: Benedikt Linse <benedikt.linse@gmail.com>
      
      Closes #4766 from 1123/master and squashes the following commits:
      
      8a63812 [Benedikt Linse] fixing 3 typos in the graphx programming guide
      5b8480e0
    • prabs's avatar
      [SPARK-5666][streaming][MQTT streaming] some trivial fixes · d51ed263
      prabs authored
      modified to adhere to accepted coding standards as pointed by tdas in PR #3844
      
      Author: prabs <prabsmails@gmail.com>
      Author: Prabeesh K <prabsmails@gmail.com>
      
      Closes #4178 from prabeesh/master and squashes the following commits:
      
      bd2cb49 [Prabeesh K] adress the comment
      ccc0765 [prabs] adress the comment
      46f9619 [prabs] adress the comment
      c035bdc [prabs] adress the comment
      22dd7f7 [prabs] address the comments
      0cc67bd [prabs] adress the comment
      838c38e [prabs] adress the comment
      cd57029 [prabs] address the comments
      66919a3 [Prabeesh K] changed MqttDefaultFilePersistence to MemoryPersistence
      5857989 [prabs] modified to adhere to accepted coding standards
      d51ed263
  3. Feb 24, 2015
    • Davies Liu's avatar
      [SPARK-5994] [SQL] Python DataFrame documentation fixes · d641fbb3
      Davies Liu authored
      select empty should NOT be the same as select. make sure selectExpr is behaving the same.
      join param documentation
      link to source doesn't work in jekyll generated file
      cross reference of columns (i.e. enabling linking)
      show(): move df example before df.show()
      move tests in SQLContext out of docstring otherwise doc is too long
      Column.desc and .asc doesn't have any documentation
      in documentation, sort functions.*)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4756 from davies/df_docs and squashes the following commits:
      
      f30502c [Davies Liu] fix doc
      32f0d46 [Davies Liu] fix DataFrame docs
      d641fbb3
    • Yin Huai's avatar
      [SPARK-5286][SQL] SPARK-5286 followup · 769e092b
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-5286
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #4755 from yhuai/SPARK-5286-throwable and squashes the following commits:
      
      4c0c450 [Yin Huai] Catch Throwable instead of Exception.
      769e092b
    • Tathagata Das's avatar
      [SPARK-5993][Streaming][Build] Fix assembly jar location of kafka-assembly · 922b43b3
      Tathagata Das authored
      Published Kafka-assembly JAR was empty in 1.3.0-RC1
      This is because the maven build generated two Jars-
      1. an empty JAR file (since kafka-assembly has no code of its own)
      2. a assembly JAR file containing everything in a different location as 1
      The maven publishing plugin uploaded 1 and not 2.
      Instead if 2 is not configure to generate in a different location, there is only 1 jar containing everything, which gets published.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #4753 from tdas/SPARK-5993 and squashes the following commits:
      
      c390db8 [Tathagata Das] Fix assembly jar location of kafka-assembly
      922b43b3
    • Reynold Xin's avatar
      [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python. · fba11c2f
      Reynold Xin authored
      Also added desc/asc function for constructing sorting expressions more conveniently. And added a small fix to lift alias out of cast expression.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4752 from rxin/SPARK-5985 and squashes the following commits:
      
      aeda5ae [Reynold Xin] Added Experimental flag to ColumnName.
      047ad03 [Reynold Xin] Lift alias out of cast.
      c9cf17c [Reynold Xin] [SPARK-5985][SQL] DataFrame sortBy -> orderBy in Python.
      fba11c2f
    • Reynold Xin's avatar
      [SPARK-5904][SQL] DataFrame Java API test suites. · 53a1ebf3
      Reynold Xin authored
      Added a new test suite to make sure Java DF programs can use varargs properly.
      Also moved all suites into test.org.apache.spark package to make sure the suites also test for method visibility.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #4751 from rxin/df-tests and squashes the following commits:
      
      1e8b8e4 [Reynold Xin] Fixed imports and renamed JavaAPISuite.
      a6ca53b [Reynold Xin] [SPARK-5904][SQL] DataFrame Java API test suites.
      53a1ebf3
    • Cheng Lian's avatar
      [SPARK-5751] [SQL] [WIP] Revamped HiveThriftServer2Suite for robustness · f816e739
      Cheng Lian authored
      **NOTICE** Do NOT merge this, as we're waiting for #3881 to be merged.
      
      `HiveThriftServer2Suite` has been notorious for its flakiness for a while. This was mostly due to spawning and communicate with external server processes. This PR revamps this test suite for better robustness:
      
      1. Fixes a racing condition occurred while using `tail -f` to check log file
      
         It's possible that the line we are looking for has already been printed into the log file before we start the `tail -f` process. This PR uses `tail -n +0 -f` to ensure all lines are checked.
      
      2. Retries up to 3 times if the server fails to start
      
         In most of the cases, the server fails to start because of port conflict. This PR no longer asks the system to choose an available TCP port, but uses a random port first, and retries up to 3 times if the server fails to start.
      
      3. A server instance is reused among all test cases within a single suite
      
         The original `HiveThriftServer2Suite` is splitted into two test suites, `HiveThriftBinaryServerSuite` and `HiveThriftHttpServerSuite`. Each suite starts a `HiveThriftServer2` instance and reuses it for all of its test cases.
      
      **TODO**
      
      - [ ] Starts the Thrift server in foreground once #3881 is merged (adding `--foreground` flag to `spark-daemon.sh`)
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4720)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #4720 from liancheng/revamp-thrift-server-tests and squashes the following commits:
      
      d6c80eb [Cheng Lian] Relaxes server startup timeout
      6f14eb1 [Cheng Lian] Revamped HiveThriftServer2Suite for robustness
      f816e739
    • MechCoder's avatar
      [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation · 2a0fe348
      MechCoder authored
      One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
      
      This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #4677 from MechCoder/spark-5436 and squashes the following commits:
      
      1bb21d4 [MechCoder] Combine regression and classification tests into a single one
      e4d799b [MechCoder] Addresses indentation and doc comments
      b48a70f [MechCoder] COSMIT
      b928a19 [MechCoder] Move validation while training section under usage tips
      fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
      55e5c3b [MechCoder] One liner for prevValidateError
      3e74372 [MechCoder] TST: Add test for classification
      77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
      2a0fe348
    • Davies Liu's avatar
      [SPARK-5973] [PySpark] fix zip with two RDDs with AutoBatchedSerializer · da505e59
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4745 from davies/fix_zip and squashes the following commits:
      
      2124b2c [Davies Liu] Update tests.py
      b5c828f [Davies Liu] increase the number of records
      c1e40fd [Davies Liu] fix zip with two RDDs with AutoBatchedSerializer
      da505e59
    • Michael Armbrust's avatar
      [SPARK-5952][SQL] Lock when using hive metastore client · a2b91379
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #4746 from marmbrus/hiveLock and squashes the following commits:
      
      8b871cf [Michael Armbrust] [SPARK-5952][SQL] Lock when using hive metastore client
      a2b91379
    • Judy's avatar
      [Spark-5708] Add Slf4jSink to Spark Metrics · c5ba975e
      Judy authored
      Add Slf4jSink to Spark Metrics using Coda Hale's SlfjReporter.
      This sends metrics to log4j, allowing spark users to reuse log4j pipeline for metrics collection.
      
      Reviewed existing unit tests and didn't see any sink-related tests. Please advise on if tests should be added.
      
      Author: Judy <judynash@microsoft.com>
      Author: judynash <judynash@microsoft.com>
      
      Closes #4644 from judynash/master and squashes the following commits:
      
      57ef214 [judynash] doc clarification and indent fixes
      a751a66 [Judy] Spark-5708: Add Slf4jSink to Spark Metrics
      c5ba975e
    • Xiangrui Meng's avatar
      [MLLIB] Change x_i to y_i in Variance's user guide · 105791e3
      Xiangrui Meng authored
      Variance is calculated on labels/responses.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4740 from mengxr/patch-1 and squashes the following commits:
      
      673317b [Xiangrui Meng] [MLLIB] Change x_i to y_i in Variance's user guide
      105791e3
    • Andrew Or's avatar
      [SPARK-5965] Standalone Worker UI displays {{USER_JAR}} · 6d2caa57
      Andrew Or authored
      For screenshot see: https://issues.apache.org/jira/browse/SPARK-5965
      This was caused by 20a60131.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #4739 from andrewor14/user-jar-blocker and squashes the following commits:
      
      23c4a9e [Andrew Or] Use right argument
      6d2caa57
Loading