Skip to content
Snippets Groups Projects
  1. May 20, 2016
    • sethah's avatar
      [SPARK-15394][ML][DOCS] User guide typos and grammar audit · 5e203505
      sethah authored
      ## What changes were proposed in this pull request?
      
      Correct some typos and incorrectly worded sentences.
      
      ## How was this patch tested?
      
      Doc changes only.
      
      Note that many of these changes were identified by whomfire01
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #13180 from sethah/ml_guide_audit.
      5e203505
    • Zheng RuiFeng's avatar
      [SPARK-15398][ML] Update the warning message to recommend ML usage · 47a2940d
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      MLlib are not recommended to use, and some methods are even deprecated.
      Update the warning message to recommend ML usage.
      ```
        def showWarning() {
          System.err.println(
            """WARN: This is a naive implementation of Logistic Regression and is given as an example!
              |Please use either org.apache.spark.mllib.classification.LogisticRegressionWithSGD or
              |org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS
              |for more conventional use.
            """.stripMargin)
        }
      ```
      To
      ```
        def showWarning() {
          System.err.println(
            """WARN: This is a naive implementation of Logistic Regression and is given as an example!
              |Please use org.apache.spark.ml.classification.LogisticRegression
              |for more conventional use.
            """.stripMargin)
        }
      ```
      
      ## How was this patch tested?
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13190 from zhengruifeng/update_recd.
      47a2940d
    • wm624@hotmail.com's avatar
      [SPARK-15363][ML][EXAMPLE] Example code shouldn't use VectorImplicits._, asML/fromML · 4c7a6b38
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      In this DataFrame example, we use VectorImplicits._, which is private API.
      
      Since Vectors object has public API, we use Vectors.fromML instead of implicts.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Manually run the example.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #13213 from wangmiao1981/ml.
      4c7a6b38
    • Lianhui Wang's avatar
      [SPARK-15335][SQL] Implement TRUNCATE TABLE Command · 09a00510
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      
      Like TRUNCATE TABLE Command in Hive, TRUNCATE TABLE is also supported by Hive. See the link: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL
      Below is the related Hive JIRA: https://issues.apache.org/jira/browse/HIVE-446
      This PR is to implement such a command for truncate table excluded column truncation(HIVE-4005).
      
      ## How was this patch tested?
      Added a test case.
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13170 from lianhuiwang/truncate.
      09a00510
    • Takuya UESHIN's avatar
      [SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output... · d5e1c5ac
      Takuya UESHIN authored
      [SPARK-15313][SQL] EmbedSerializerInFilter rule should keep exprIds of output of surrounded SerializeFromObject.
      
      ## What changes were proposed in this pull request?
      
      The following code:
      
      ```
      val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
      ds.filter(_._1 == "b").select(expr("_1").as[String]).foreach(println(_))
      ```
      
      throws an Exception:
      
      ```
      org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: _1#420
       at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:50)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
      
      ...
       Cause: java.lang.RuntimeException: Couldn't find _1#420 in [_1#416,_2#417]
       at scala.sys.package$.error(package.scala:27)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88)
       at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:49)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88)
       at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87)
      ...
      ```
      
      This is because `EmbedSerializerInFilter` rule drops the `exprId`s of output of surrounded `SerializeFromObject`.
      
      The analyzed and optimized plans of the above example are as follows:
      
      ```
      == Analyzed Logical Plan ==
      _1: string
      Project [_1#420]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
         +- Filter <function1>.apply
            +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
               +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
      
      == Optimized Logical Plan ==
      !Project [_1#420]
      +- Filter <function1>.apply
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
      ```
      
      This PR fixes `EmbedSerializerInFilter` rule to keep `exprId`s of output of surrounded `SerializeFromObject`.
      
      The plans after this patch are as follows:
      
      ```
      == Analyzed Logical Plan ==
      _1: string
      Project [_1#420]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, scala.Tuple2]._1, true) AS _1#420,input[0, scala.Tuple2]._2 AS _2#421]
         +- Filter <function1>.apply
            +- DeserializeToObject newInstance(class scala.Tuple2), obj#419: scala.Tuple2
               +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
      
      == Optimized Logical Plan ==
      Project [_1#416]
      +- Filter <function1>.apply
         +- LocalRelation [_1#416,_2#417], [[0,1800000001,1,61],[0,1800000001,2,62],[0,1800000001,3,63]]
      ```
      
      ## How was this patch tested?
      
      Existing tests and I added a test to check if `filter and then select` works.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13096 from ueshin/issues/SPARK-15313.
      d5e1c5ac
    • Oleg Danilov's avatar
      [SPARK-14261][SQL] Memory leak in Spark Thrift Server · e384c7fb
      Oleg Danilov authored
      Fixed memory leak (HiveConf in the CommandProcessorFactory)
      
      Author: Oleg Danilov <oleg.danilov@wandisco.com>
      
      Closes #12932 from dosoft/SPARK-14261.
      e384c7fb
    • Reynold Xin's avatar
      [SPARK-14990][SQL] Fix checkForSameTypeInputExpr (ignore nullability) · 3ba34d43
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch fixes a bug in TypeUtils.checkForSameTypeInputExpr. Previously the code was testing on strict equality, which does not taking nullability into account.
      
      This is based on https://github.com/apache/spark/pull/12768. This patch fixed a bug there (with empty expression) and added a test case.
      
      ## How was this patch tested?
      Added a new test suite and test case.
      
      Closes #12768.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Oleg Danilov <oleg.danilov@wandisco.com>
      
      Closes #13208 from rxin/SPARK-14990.
      3ba34d43
  2. May 19, 2016
    • Reynold Xin's avatar
      [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate... · f2ee0ed4
      Reynold Xin authored
      [SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate config options to existing sessions if specified
      
      ## What changes were proposed in this pull request?
      Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that.
      
      This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession.
      
      ## How was this patch tested?
      Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13200 from rxin/SPARK-15075.
      f2ee0ed4
    • Kevin Yu's avatar
      [SPARK-11827][SQL] Adding java.math.BigInteger support in Java type inference... · 17591d90
      Kevin Yu authored
      [SPARK-11827][SQL] Adding java.math.BigInteger support in Java type inference for POJOs and Java collections
      
      Hello : Can you help check this PR? I am adding support for the java.math.BigInteger for java bean code path. I saw internally spark is converting the BigInteger to BigDecimal in ColumnType.scala and CatalystRowConverter.scala. I use the similar way and convert the BigInteger to the BigDecimal. .
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #10125 from kevinyu98/working_on_spark-11827.
      17591d90
    • Sumedh Mungee's avatar
      [SPARK-15321] Fix bug where Array[Timestamp] cannot be encoded/decoded correctly · d5c47f8f
      Sumedh Mungee authored
      ## What changes were proposed in this pull request?
      
      Fix `MapObjects.itemAccessorMethod` to handle `TimestampType`. Without this fix, `Array[Timestamp]` cannot be properly encoded or decoded. To reproduce this, in `ExpressionEncoderSuite`, if you add the following test case:
      
      `encodeDecodeTest(Array(Timestamp.valueOf("2016-01-29 10:00:00")), "array of timestamp")
      `
      ... you will see that (without this fix) it fails with the following output:
      
      ```
      - encode/decode for array of timestamp: [Ljava.sql.Timestamp;fd9ebde *** FAILED ***
        Exception thrown while decoding
        Converted: [0,1000000010,800000001,52a7ccdc36800]
        Schema: value#61615
        root
        -- value: array (nullable = true)
            |-- element: timestamp (containsNull = true)
        Encoder:
        class[value[0]: array<timestamp>] (ExpressionEncoderSuite.scala:312)
      ```
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sumedh Mungee <smungee@gmail.com>
      
      Closes #13108 from smungee/fix-itemAccessorMethod.
      d5c47f8f
    • Xiangrui Meng's avatar
      Closes #11915 · 66ec2494
      Xiangrui Meng authored
      Closes #8648
      Closes #13089
      66ec2494
    • Sandeep Singh's avatar
      [SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession · 01cf649c
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion
      
      ## How was this patch tested?
      Existing Tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13101 from techaddict/SPARK-15296.
      01cf649c
    • Shixiong Zhu's avatar
      [SPARK-15416][SQL] Display a better message for not finding classes removed in Spark 2.0 · 16ba71ab
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      If finding `NoClassDefFoundError` or `ClassNotFoundException`, check if the class name is removed in Spark 2.0. If so, the user must be using an incompatible library and we can provide a better message.
      
      ## How was this patch tested?
      
      1. Run `bin/pyspark --packages com.databricks:spark-avro_2.10:2.0.1`
      2. type `sqlContext.read.format("com.databricks.spark.avro").load("src/test/resources/episodes.avro")`.
      
      It will show `java.lang.ClassNotFoundException: org.apache.spark.sql.sources.HadoopFsRelationProvider is removed in Spark 2.0. Please check if your library is compatible with Spark 2.0`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13201 from zsxwing/better-message.
      16ba71ab
    • Yanbo Liang's avatar
      [MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync · 66436778
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```ml.evaluation``` Scala and Python API sync.
      
      ## How was this patch tested?
      Only API docs change, no new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13195 from yanboliang/evaluation-doc.
      66436778
    • Yanbo Liang's avatar
      [SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify "summary" was not saved · f8107c78
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it.
      We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW.
      
      ## How was this patch tested?
      Documentation update, no unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13131 from yanboliang/spark-15341.
      f8107c78
    • jerryshao's avatar
      [SPARK-15375][SQL][STREAMING] Add ConsoleSink to structure streaming · dcf407de
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Add ConsoleSink to structure streaming, user could use it to display dataframes on the console (useful for debugging and demostrating), similar to the functionality of `DStream#print`, to use it:
      
      ```
          val query = result.write
            .format("console")
            .trigger(ProcessingTime("2 seconds"))
            .startStream()
      ```
      
      ## How was this patch tested?
      
      local verified.
      
      Not sure it is suitable to add into structure streaming, please review and help to comment, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #13162 from jerryshao/SPARK-15375.
      dcf407de
    • Sandeep Singh's avatar
      [SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public · ef43a5fe
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg):
      `Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML`
      
      ## How was this patch tested?
      Existing Tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13202 from techaddict/SPARK-15414.
      ef43a5fe
    • Yanbo Liang's avatar
      [SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering · 59e6c556
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Audit Scala API for ml.clustering.
      Fix some wrong API documentations and update outdated one.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13148 from yanboliang/spark-15361.
      59e6c556
    • DB Tsai's avatar
      [SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala · 5255e55c
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Add since to ml.stat.MultivariateOnlineSummarizer.scala
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #13197 from dbtsai/cleanup.
      5255e55c
    • Shixiong Zhu's avatar
    • Davies Liu's avatar
      [SPARK-15392][SQL] fix default value of size estimation of logical plan · 5ccecc07
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
      
      This PR change the default value to Long.MaxValue.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13183 from davies/fix_default_size.
      5ccecc07
    • Shixiong Zhu's avatar
      [SPARK-15317][CORE] Don't store accumulators for every task in listeners · 4e3cb7a5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.
      
      In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.
      
      ## How was this patch tested?
      
      I ran two tests reported in JIRA locally:
      
      The first one is:
      ```
      val data = spark.range(0, 10000, 1, 10000)
      data.cache().count()
      ```
      The retained size of JobProgressListener decreases from 60.7M to 6.9M.
      
      The second one is:
      ```
      import org.apache.spark.ml.CC
      import org.apache.spark.sql.SQLContext
      val sqlContext = SQLContext.getOrCreate(sc)
      CC.runTest(sqlContext)
      ```
      
      This test won't cause OOM after applying this patch.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13153 from zsxwing/memory.
      4e3cb7a5
    • Cheng Lian's avatar
      [SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output · 6ac1c3a0
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
      
      ## How was this patch tested?
      
      Updated existing test case to check exception message.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13173 from liancheng/spark-14346-follow-up.
      6ac1c3a0
    • Holden Karau's avatar
      [SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression · e71cd96b
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list
      
      ## How was this patch tested?
      
      doctests & built docs locally
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
      e71cd96b
    • hyukjinkwon's avatar
      [SPARK-15322][SQL][FOLLOW-UP] Update deprecated accumulator usage into accumulatorV2 · f5065abf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.
      
      Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.
      
      ## How was this patch tested?
      
      Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13187 from HyukjinKwon/SPARK-15322.
      f5065abf
    • Kousuke Saruta's avatar
      [SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to make database directory. · faafd1e9
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #13175 from sarutak/SPARK-15387.
      faafd1e9
    • Davies Liu's avatar
      [SPARK-15300] Fix writer lock conflict when remove a block · ad182086
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      A writer lock could be acquired when 1) create a new block 2) remove a block 3) evict a block to disk. 1) and 3) could happen in the same time within the same task, all of them could happen in the same time outside a task. It's OK that when someone try to grab the write block for a block, but the block is acquired by another one that has the same task attempt id.
      
      This PR remove the check.
      
      ## How was this patch tested?
      
      Updated existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13082 from davies/write_lock_conflict.
      ad182086
    • gatorsmile's avatar
      [SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session Catalog · ef7a5e0b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385
      
      The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135
      
      For example, in PySpark, if we input the following statement:
      ```python
      >>> l = [('Alice', 1)]
      >>> df = sqlContext.createDataFrame(l)
      >>> df.createTempView("people")
      >>> df.createTempView("people")
      ```
      Before this PR, the exception we will get is like
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
          self._jdf.createTempView(name)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView.
      : org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists;
          at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324)
          at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523)
          at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
          at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
          at py4j.Gateway.invoke(Gateway.java:280)
          at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
          at py4j.commands.CallCommand.execute(CallCommand.java:79)
          at py4j.GatewayConnection.run(GatewayConnection.java:211)
          at java.lang.Thread.run(Thread.java:745)
      ```
      After this PR, the exception we will get become cleaner:
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
          self._jdf.createTempView(name)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco
          raise AnalysisException(s.split(': ', 1)[1], stackTrace)
      pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;"
      ```
      
      #### How was this patch tested?
      Fixed an existing PySpark test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13126 from gatorsmile/followup-14684.
      ef7a5e0b
    • Davies Liu's avatar
      [SPARK-15390] fix broadcast with 100 millions rows · 9308bf11
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.
      
      This PR fix the overflow by converting it to Long when calculating the size of memory.
      
      Also add more checking in broadcast to show reasonable messages.
      
      ## How was this patch tested?
      
      Add test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13182 from davies/fix_broadcast.
      9308bf11
    • Pravin Gadakh's avatar
      [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local · 31f63ac2
      Pravin Gadakh authored
      ## What changes were proposed in this pull request?
      
      This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local.
      
      ## How was this patch tested?
      
      Scala Style Checks.
      
      Author: Pravin Gadakh <prgadakh@in.ibm.com>
      
      Closes #13191 from pravingadakh/SPARK-14613.
      31f63ac2
    • Yanbo Liang's avatar
      [SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification · 8ecf7f77
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
      * Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
      * Add missing setter function for ```solver``` and ```stepSize```.
      * Make ```GD``` solver take effect.
      * Update docs, annotations and fix other minor issues.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13076 from yanboliang/spark-15292.
      8ecf7f77
    • Yanbo Liang's avatar
      [SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible · 1052d364
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      [SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.
      
      ## How was this patch tested?
      Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13149 from yanboliang/spark-15362.
      1052d364
    • Sandeep Singh's avatar
      [CORE][MINOR] Remove redundant set master in OutputCommitCoordinatorIntegrationSuite · 3facca51
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43.
      
      ## How was this patch tested?
      existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13168 from techaddict/minor-1.
      3facca51
    • Dongjoon Hyun's avatar
      [SPARK-14939][SQL] Add FoldablePropagation optimizer · 5907ebfc
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)
      
      1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
      2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
      3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"
      
      This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.
      
      **Before**
      ```
      scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
      == Physical Plan ==
      WholeStageCodegen
      :  +- Sort [1.0#5 ASC,x#0 ASC], true, 0
      :     +- INPUT
      +- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
         +- WholeStageCodegen
            :  +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
            :     +- INPUT
            +- Scan OneRowRelation[]
      ```
      
      **After**
      ```
      scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
      :     +- INPUT
      +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12719 from dongjoon-hyun/SPARK-14939.
      5907ebfc
    • hyukjinkwon's avatar
      [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession · e2ec32da
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:
      
      - `simple_params_example.py`
      - `aft_survival_regression.py`
      
      are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.
      
      This PR corrects the example and make this use SparkSession.
      
      In more detail, it seems `threshold` is replaced to `thresholds` here and there by https://github.com/apache/spark/commit/5a23213c148bfe362514f9c71f5273ebda0a848a. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).
      
      According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61.
      
      So, in this PR, it sets the equivalent value so that this does not throw an exception.
      
      ## How was this patch tested?
      
      Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13135 from HyukjinKwon/SPARK-15031.
      e2ec32da
  3. May 18, 2016
    • Wenchen Fan's avatar
      [SPARK-15381] [SQL] physical object operator should define reference correctly · 661c2104
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.
      
      ## How was this patch tested?
      
      new test in DatasetSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13167 from cloud-fan/bug.
      661c2104
    • Shixiong Zhu's avatar
      [SPARK-15395][CORE] Use getHostString to create RpcAddress · 5c9117a3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Right now the netty RPC uses `InetSocketAddress.getHostName` to create `RpcAddress` for network events. If we use an IP address to connect, then the RpcAddress's host will be a host name (if the reverse lookup successes) instead of the IP address. However, some places need to compare the original IP address and the RpcAddress in `onDisconnect` (e.g., CoarseGrainedExecutorBackend), and this behavior will make the check incorrect.
      
      This PR uses `getHostString` to resolve the issue.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13185 from zsxwing/host-string.
      5c9117a3
    • Bryan Cutler's avatar
      [DOC][MINOR] ml.feature Scala and Python API sync · b1bc5ebd
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.
      
      ## How was this patch tested?
      
      Built docs locally, ran style checks
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13159 from BryanCutler/ml.feature-api-sync.
      b1bc5ebd
    • Reynold Xin's avatar
      [SPARK-14463][SQL] Document the semantics for read.text · 4987f39a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13184 from rxin/SPARK-14463.
      4987f39a
    • gatorsmile's avatar
      [SPARK-15297][SQL] Fix Set -V Command · 9c2a376e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.
      
      In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.
      
      This PR is to fix the above two issues.
      
      #### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13081 from gatorsmile/setVcommand.
      9c2a376e
Loading