Skip to content
Snippets Groups Projects
  1. May 19, 2016
    • DB Tsai's avatar
      [SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala · 5255e55c
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Add since to ml.stat.MultivariateOnlineSummarizer.scala
      
      ## How was this patch tested?
      
      unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #13197 from dbtsai/cleanup.
      5255e55c
    • Shixiong Zhu's avatar
    • Davies Liu's avatar
      [SPARK-15392][SQL] fix default value of size estimation of logical plan · 5ccecc07
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We use autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
      
      This PR change the default value to Long.MaxValue.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13183 from davies/fix_default_size.
      5ccecc07
    • Shixiong Zhu's avatar
      [SPARK-15317][CORE] Don't store accumulators for every task in listeners · 4e3cb7a5
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      In general, the Web UI doesn't need to store the Accumulator/AccumulableInfo for every task. It only needs the Accumulator values.
      
      In this PR, it creates new UIData classes to store the necessary fields and make `JobProgressListener` store only these new classes, so that `JobProgressListener` won't store Accumulator/AccumulableInfo and the size of `JobProgressListener` becomes pretty small. I also eliminates `AccumulableInfo` from `SQLListener` so that we don't keep any references for those unused `AccumulableInfo`s.
      
      ## How was this patch tested?
      
      I ran two tests reported in JIRA locally:
      
      The first one is:
      ```
      val data = spark.range(0, 10000, 1, 10000)
      data.cache().count()
      ```
      The retained size of JobProgressListener decreases from 60.7M to 6.9M.
      
      The second one is:
      ```
      import org.apache.spark.ml.CC
      import org.apache.spark.sql.SQLContext
      val sqlContext = SQLContext.getOrCreate(sc)
      CC.runTest(sqlContext)
      ```
      
      This test won't cause OOM after applying this patch.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13153 from zsxwing/memory.
      4e3cb7a5
    • Cheng Lian's avatar
      [SPARK-14346][SQL] Lists unsupported Hive features in SHOW CREATE TABLE output · 6ac1c3a0
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR is a follow-up of #13079. It replaces `hasUnsupportedFeatures: Boolean` in `CatalogTable` with `unsupportedFeatures: Seq[String]`, which contains unsupported Hive features of the underlying Hive table. In this way, we can accurately report all unsupported Hive features in the exception message.
      
      ## How was this patch tested?
      
      Updated existing test case to check exception message.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13173 from liancheng/spark-14346-follow-up.
      6ac1c3a0
    • Holden Karau's avatar
      [SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression · e71cd96b
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list
      
      ## How was this patch tested?
      
      doctests & built docs locally
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
      e71cd96b
    • hyukjinkwon's avatar
      [SPARK-15322][SQL][FOLLOW-UP] Update deprecated accumulator usage into accumulatorV2 · f5065abf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR corrects another case that uses deprecated `accumulableCollection` to use `listAccumulator`, which seems the previous PR missed.
      
      Since `ArrayBuffer[InternalRow].asJava` is `java.util.List[InternalRow]`, it seems ok to replace the usage.
      
      ## How was this patch tested?
      
      Related existing tests `InMemoryColumnarQuerySuite` and `CachedTableSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13187 from HyukjinKwon/SPARK-15322.
      f5065abf
    • Kousuke Saruta's avatar
      [SPARK-15387][SQL] SessionCatalog in SimpleAnalyzer does not need to make database directory. · faafd1e9
      Kousuke Saruta authored
      ## What changes were proposed in this pull request?
      
      After #12871 is fixed, we are forced to make `/user/hive/warehouse` when SimpleAnalyzer is used but SimpleAnalyzer may not need the directory.
      
      ## How was this patch tested?
      
      Manual test.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #13175 from sarutak/SPARK-15387.
      faafd1e9
    • Davies Liu's avatar
      [SPARK-15300] Fix writer lock conflict when remove a block · ad182086
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      A writer lock could be acquired when 1) create a new block 2) remove a block 3) evict a block to disk. 1) and 3) could happen in the same time within the same task, all of them could happen in the same time outside a task. It's OK that when someone try to grab the write block for a block, but the block is acquired by another one that has the same task attempt id.
      
      This PR remove the check.
      
      ## How was this patch tested?
      
      Updated existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13082 from davies/write_lock_conflict.
      ad182086
    • gatorsmile's avatar
      [SPARK-14603][SQL][FOLLOWUP] Verification of Metadata Operations by Session Catalog · ef7a5e0b
      gatorsmile authored
      #### What changes were proposed in this pull request?
      This follow-up PR is to address the remaining comments in https://github.com/apache/spark/pull/12385
      
      The major change in this PR is to issue better error messages in PySpark by using the mechanism that was proposed by davies in https://github.com/apache/spark/pull/7135
      
      For example, in PySpark, if we input the following statement:
      ```python
      >>> l = [('Alice', 1)]
      >>> df = sqlContext.createDataFrame(l)
      >>> df.createTempView("people")
      >>> df.createTempView("people")
      ```
      Before this PR, the exception we will get is like
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
          self._jdf.createTempView(name)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 63, in deco
          return f(*a, **kw)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o35.createTempView.
      : org.apache.spark.sql.catalyst.analysis.TempTableAlreadyExistsException: Temporary table 'people' already exists;
          at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTempView(SessionCatalog.scala:324)
          at org.apache.spark.sql.SparkSession.createTempView(SparkSession.scala:523)
          at org.apache.spark.sql.Dataset.createTempView(Dataset.scala:2328)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:606)
          at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
          at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
          at py4j.Gateway.invoke(Gateway.java:280)
          at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
          at py4j.commands.CallCommand.execute(CallCommand.java:79)
          at py4j.GatewayConnection.run(GatewayConnection.java:211)
          at java.lang.Thread.run(Thread.java:745)
      ```
      After this PR, the exception we will get become cleaner:
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/dataframe.py", line 152, in createTempView
          self._jdf.createTempView(name)
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        File "/Users/xiaoli/IdeaProjects/sparkDelivery/python/pyspark/sql/utils.py", line 75, in deco
          raise AnalysisException(s.split(': ', 1)[1], stackTrace)
      pyspark.sql.utils.AnalysisException: u"Temporary table 'people' already exists;"
      ```
      
      #### How was this patch tested?
      Fixed an existing PySpark test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13126 from gatorsmile/followup-14684.
      ef7a5e0b
    • Davies Liu's avatar
      [SPARK-15390] fix broadcast with 100 millions rows · 9308bf11
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      When broadcast a table with more than 100 millions rows (should not ideally), the size of needed memory will overflow.
      
      This PR fix the overflow by converting it to Long when calculating the size of memory.
      
      Also add more checking in broadcast to show reasonable messages.
      
      ## How was this patch tested?
      
      Add test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13182 from davies/fix_broadcast.
      9308bf11
    • Pravin Gadakh's avatar
      [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local · 31f63ac2
      Pravin Gadakh authored
      ## What changes were proposed in this pull request?
      
      This PR add `Since` annotations in `Vectors.scala` and `Matrices.scala` of spark-mllib-local.
      
      ## How was this patch tested?
      
      Scala Style Checks.
      
      Author: Pravin Gadakh <prgadakh@in.ibm.com>
      
      Closes #13191 from pravingadakh/SPARK-14613.
      31f63ac2
    • Yanbo Liang's avatar
      [SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification · 8ecf7f77
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section.
      * Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver```
      * Add missing setter function for ```solver``` and ```stepSize```.
      * Make ```GD``` solver take effect.
      * Update docs, annotations and fix other minor issues.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13076 from yanboliang/spark-15292.
      8ecf7f77
    • Yanbo Liang's avatar
      [SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible · 1052d364
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      [SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6.
      
      ## How was this patch tested?
      Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13149 from yanboliang/spark-15362.
      1052d364
    • Sandeep Singh's avatar
      [CORE][MINOR] Remove redundant set master in OutputCommitCoordinatorIntegrationSuite · 3facca51
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Remove redundant set master in OutputCommitCoordinatorIntegrationSuite, as we are already setting it in SparkContext below on line 43.
      
      ## How was this patch tested?
      existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13168 from techaddict/minor-1.
      3facca51
    • Dongjoon Hyun's avatar
      [SPARK-14939][SQL] Add FoldablePropagation optimizer · 5907ebfc
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to add new **FoldablePropagation** optimizer that propagates foldable expressions by replacing all attributes with the aliases of original foldable expression. Other optimizations will take advantage of the propagated foldable expressions: e.g. `EliminateSorts` optimizer now can handle the following Case 2 and 3. (Case 1 is the previous implementation.)
      
      1. Literals and foldable expression, e.g. "ORDER BY 1.0, 'abc', Now()"
      2. Foldable ordinals, e.g. "SELECT 1.0, 'abc', Now() ORDER BY 1, 2, 3"
      3. Foldable aliases, e.g. "SELECT 1.0 x, 'abc' y, Now() z ORDER BY x, y, z"
      
      This PR has been generalized based on cloud-fan 's key ideas many times; he should be credited for the work he did.
      
      **Before**
      ```
      scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
      == Physical Plan ==
      WholeStageCodegen
      :  +- Sort [1.0#5 ASC,x#0 ASC], true, 0
      :     +- INPUT
      +- Exchange rangepartitioning(1.0#5 ASC, x#0 ASC, 200), None
         +- WholeStageCodegen
            :  +- Project [1.0 AS 1.0#5,1461873043577000 AS x#0]
            :     +- INPUT
            +- Scan OneRowRelation[]
      ```
      
      **After**
      ```
      scala> sql("SELECT 1.0, Now() x ORDER BY 1, x").explain
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [1.0 AS 1.0#5,1461873079484000 AS x#0]
      :     +- INPUT
      +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests including a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12719 from dongjoon-hyun/SPARK-14939.
      5907ebfc
    • hyukjinkwon's avatar
      [SPARK-15031][EXAMPLES][FOLLOW-UP] Make Python param example working with SparkSession · e2ec32da
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems most of Python examples were changed to use SparkSession by https://github.com/apache/spark/pull/12809. This PR said both examples below:
      
      - `simple_params_example.py`
      - `aft_survival_regression.py`
      
      are not changed because it dose not work. It seems `aft_survival_regression.py` is changed by https://github.com/apache/spark/pull/13050 but `simple_params_example.py` is not yet.
      
      This PR corrects the example and make this use SparkSession.
      
      In more detail, it seems `threshold` is replaced to `thresholds` here and there by https://github.com/apache/spark/commit/5a23213c148bfe362514f9c71f5273ebda0a848a. However, when it calls `lr.fit(training, paramMap)` this overwrites the values. So, `threshold` was 5 and `thresholds` becomes 5.5 (by `1 / (1 + thresholds(0) / thresholds(1)`).
      
      According to the comment below. this is not allowed, https://github.com/apache/spark/blob/354f8f11bd4b20fa99bd67a98da3525fd3d75c81/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L58-L61.
      
      So, in this PR, it sets the equivalent value so that this does not throw an exception.
      
      ## How was this patch tested?
      
      Manully (`mvn package -DskipTests && spark-submit simple_params_example.py`)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #13135 from HyukjinKwon/SPARK-15031.
      e2ec32da
  2. May 18, 2016
    • Wenchen Fan's avatar
      [SPARK-15381] [SQL] physical object operator should define reference correctly · 661c2104
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Whole Stage Codegen depends on `SparkPlan.reference` to do some optimization. For physical object operators, they should be consistent with their logical version and set the `reference` correctly.
      
      ## How was this patch tested?
      
      new test in DatasetSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13167 from cloud-fan/bug.
      661c2104
    • Shixiong Zhu's avatar
      [SPARK-15395][CORE] Use getHostString to create RpcAddress · 5c9117a3
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Right now the netty RPC uses `InetSocketAddress.getHostName` to create `RpcAddress` for network events. If we use an IP address to connect, then the RpcAddress's host will be a host name (if the reverse lookup successes) instead of the IP address. However, some places need to compare the original IP address and the RpcAddress in `onDisconnect` (e.g., CoarseGrainedExecutorBackend), and this behavior will make the check incorrect.
      
      This PR uses `getHostString` to resolve the issue.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13185 from zsxwing/host-string.
      5c9117a3
    • Bryan Cutler's avatar
      [DOC][MINOR] ml.feature Scala and Python API sync · b1bc5ebd
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      I reviewed Scala and Python APIs for ml.feature and corrected discrepancies.
      
      ## How was this patch tested?
      
      Built docs locally, ran style checks
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #13159 from BryanCutler/ml.feature-api-sync.
      b1bc5ebd
    • Reynold Xin's avatar
      [SPARK-14463][SQL] Document the semantics for read.text · 4987f39a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch is a follow-up to https://github.com/apache/spark/pull/13104 and adds documentation to clarify the semantics of read.text with respect to partitioning.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13184 from rxin/SPARK-14463.
      4987f39a
    • gatorsmile's avatar
      [SPARK-15297][SQL] Fix Set -V Command · 9c2a376e
      gatorsmile authored
      #### What changes were proposed in this pull request?
      The command `SET -v` always outputs the default values even if we set the parameter. This behavior is incorrect. Instead, if users override it, we should output the user-specified value.
      
      In addition, the output schema of `SET -v` is wrong. We should use the column `value` instead of `default` for the parameter value.
      
      This PR is to fix the above two issues.
      
      #### How was this patch tested?
      Added a test case.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #13081 from gatorsmile/setVcommand.
      9c2a376e
    • Wenchen Fan's avatar
      [SPARK-15192][SQL] null check for SparkSession.createDataFrame · ebfe3a1f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema.
      
      ## How was this patch tested?
      
      new tests in `DatasetSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13008 from cloud-fan/row-encoder.
      ebfe3a1f
    • Jurriaan Pruis's avatar
      [SPARK-15323][SPARK-14463][SQL] Fix reading of partitioned format=text datasets · 32be51fb
      Jurriaan Pruis authored
      https://issues.apache.org/jira/browse/SPARK-15323
      
      I was using partitioned text datasets in Spark 1.6.1 but it broke in Spark 2.0.0.
      
      It would be logical if you could also write those,
      but not entirely sure how to solve this with the new DataSet implementation.
      
      Also it doesn't work using `sqlContext.read.text`, since that method returns a `DataSet[String]`.
      See https://issues.apache.org/jira/browse/SPARK-14463 for that issue.
      
      Author: Jurriaan Pruis <email@jurriaanpruis.nl>
      
      Closes #13104 from jurriaan/fix-partitioned-text-reads.
      32be51fb
    • Davies Liu's avatar
      84b23453
    • Davies Liu's avatar
      [SPARK-15392][SQL] fix default value of size estimation of logical plan · fc29b896
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We use  autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.
      
      This PR change the default value to Long.MaxValue.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13179 from davies/fix_default_size.
      fc29b896
    • Dongjoon Hyun's avatar
      [SPARK-15373][WEB UI] Spark UI should show consistent timezones. · cc6a47dd
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, SparkUI shows two timezones in a single page when the timezone of browser is different from the server JVM timezone. The following is an example on Databricks CE which uses 'Etc/UTC' timezone.
      
      - The time of `submitted` column of list and pop-up description shows `2016/05/18 00:03:07`
      - The time of `timeline chart` shows `2016/05/17 17:03:07`.
      
      ![Different Timezone](https://issues.apache.org/jira/secure/attachment/12804553/12804553_timezone.png)
      
      This PR fixes the **timeline chart** to use the same timezone by the followings.
      - Upgrade `vis` from 3.9.0(2015-01-16)  to 4.16.1(2016-04-18)
      - Override `moment` of `vis` to get `offset`
      - Update `AllJobsPage`, `JobPage`, and `StagePage`.
      
      ## How was this patch tested?
      
      Manual. Run the following command and see the Spark UI's event timelines.
      
      ```
      $ SPARK_SUBMIT_OPTS="-Dscala.usejavacp=true -Duser.timezone=Etc/UTC" bin/spark-submit --class org.apache.spark.repl.Main
      ...
      scala> sql("select 1").head
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13158 from dongjoon-hyun/SPARK-15373.
      cc6a47dd
    • Sean Owen's avatar
      [SPARK-15386][CORE] Master doesn't compile against Java 1.7 / Process.isAlive · 4768d037
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Remove call to Process.isAlive -- Java 8 only. Introduced in https://github.com/apache/spark/pull/13042 / SPARK-15263
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13174 from srowen/SPARK-15386.
      4768d037
    • Nick Pentreath's avatar
      [SPARK-14891][ML] Add schema validation for ALS · e8b79afa
      Nick Pentreath authored
      This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.
      
      With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).
      
      ## How was this patch tested?
      New test cases in `ALSSuite`.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
      e8b79afa
    • Liang-Chi Hsieh's avatar
      [SPARK-15342] [SQL] [PYSPARK] PySpark test for non ascii column name does not... · 3d1e67f9
      Liang-Chi Hsieh authored
      [SPARK-15342] [SQL] [PYSPARK] PySpark test for non ascii column name does not actually test with unicode column name
      
      ## What changes were proposed in this pull request?
      
      The PySpark SQL `test_column_name_with_non_ascii` wants to test non-ascii column name. But it doesn't actually test it. We need to construct an unicode explicitly using `unicode` under Python 2.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13134 from viirya/correct-non-ascii-colname-pytest.
      3d1e67f9
    • Davies Liu's avatar
      [SPARK-15357] Cooperative spilling should check consumer memory mode · 8fb1d1c7
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.
      
      ## How was this patch tested?
      
      Add new test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13151 from davies/fix_mode.
      8fb1d1c7
    • Tejas Patil's avatar
      [SPARK-15263][CORE] Make shuffle service dir cleanup faster by using `rm -rf` · c1fd9cac
      Tejas Patil authored
      ## What changes were proposed in this pull request?
      
      Jira: https://issues.apache.org/jira/browse/SPARK-15263
      
      The current logic for directory cleanup is slow because it does directory listing, recurses over child directories, checks for symbolic links, deletes leaf files and finally deletes the dirs when they are empty. There is back-and-forth switching from kernel space to user space while doing this. Since most of the deployment backends would be Unix systems, we could essentially just do `rm -rf` so that entire deletion logic runs in kernel space.
      
      The current Java based impl in Spark seems to be similar to what standard libraries like guava and commons IO do (eg. http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540). However, guava removed this method in favour of shelling out to an operating system command (like in this PR). See the `Deprecated` note in older javadocs for guava for details : http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)
      
      Ideally, Java should be providing such APIs so that users won't have to do such things to get platform specific code. Also, its not just about speed, but also handling race conditions while doing at FS deletions is tricky. I could find this bug for Java in similar context : http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952
      
      ## How was this patch tested?
      
      I am relying on existing test cases to test the method. If there are suggestions about testing it, welcome to hear about it.
      
      ## Performance gains
      
      *Input setup* : Created a nested directory structure of depth 3 and each entry having 50 sub-dirs. The input being cleaned up had total ~125k dirs.
      
      Ran both approaches (in isolation) for 6 times to get average numbers:
      
      Native Java cleanup  | `rm -rf` as a separate process
      ------------ | -------------
      10.04 sec | 4.11 sec
      
      This change made deletion 2.4 times faster for the given test input.
      
      Author: Tejas Patil <tejasp@fb.com>
      
      Closes #13042 from tejasapatil/delete_recursive.
      c1fd9cac
    • DLucky's avatar
      [SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points · 420b7006
      DLucky authored
      mateiz srowen
      
      I state that the contribution is my original work and that I license the work to the project under the project's open source license
      
      There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.
      
      The related JIRA issue though marked as resolved, this change may relate to it I think.
      
      ## Proposed Change
      
      After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
      Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.
      
      ## Test result
      
      One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)
      
      I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
      The data has 4096 features and I tunes K from 100 to 500.
      The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below
      
      ### Local KMeans++ test:
      
      Dataset:4m_ini_center
      Data_size:16234
      Dimension:4096
      
      Lloyd's Iteration = 10
      The y-axis is time in sec, the x-axis is tuning the K.
      
      ![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)
      
      ![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)
      
      ### On a larger dataset
      
      An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.
      
      Dataset:4k24
      Data_size:243960
      Dimension:4096
      
      Normlize 	Enlarge 	Initialize 	Lloyd's_Iteration
      NO    	1 	         3 	          5
      YES 	        10000 	 3 	          5
      
      Notice: the normlized data is enlarged to ensure precision
      
      The cost time: x-for value of K, y-for time in sec
      ![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)
      
      SE for unnormalized data between two version, to ensure the correctness
      ![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)
      
      Here is the SE between normalized data just for reference, it's also correct.
      ![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)
      
      Author: DLucky <mouendless@gmail.com>
      
      Closes #13133 from mouendless/patch-2.
      420b7006
    • Cheng Lian's avatar
      [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10 · c4a45fd8
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR fixes a Scala 2.10 compilation failure introduced in PR #13127.
      
      ## How was this patch tested?
      
      Jenkins build.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #13166 from liancheng/hotfix-for-scala-2.10.
      c4a45fd8
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Remove unused pattern matching variables in Optimizers. · d2f81df1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR removes unused pattern matching variable in Optimizers in order to improve readability.
      
      ## How was this patch tested?
      
      Pass the existing Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.
      d2f81df1
    • WeichenXu's avatar
      [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into... · 2f9047b5
      WeichenXu authored
      [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
      
      ## What changes were proposed in this pull request?
      
      I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)
      
      ## How was this patch tested?
      
      Exisiting unit tests
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
      2f9047b5
    • Davies Liu's avatar
      [SPARK-15307][SQL] speed up listing files for data source · 33814f88
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, listing files is very slow if there is thousands files, especially on local file system, because:
      1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
      2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).
      
      This PR improve these by:
      1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
      2) Only create an JobConf once within one task.
      
      ## How was this patch tested?
      
      Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13094 from davies/listing.
      33814f88
    • Sean Zhong's avatar
      [SPARK-15334][SQL] HiveClient facade not compatible with Hive 0.12 · 6e02aec4
      Sean Zhong authored
      ## What changes were proposed in this pull request?
      
      HiveClient facade is not compatible with Hive 0.12.
      
      This PR Fixes the following compatibility issues:
      1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`.
      2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12.
      3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`
      4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0.
      
      ## How was this patch tested?
      
      Unit test.
      
      Author: Sean Zhong <seanzhong@databricks.com>
      
      Closes #13127 from clockfly/versionSuite.
      6e02aec4
    • Takuya Kuwahara's avatar
      [SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support validationMetrics · 411c04ad
      Takuya Kuwahara authored
      ## What changes were proposed in this pull request?
      
      This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it.
      
      ## How was this patch tested?
      
      test in `python/pyspark/ml/tests.py`
      
      Author: Takuya Kuwahara <taakuu19@gmail.com>
      
      Closes #12767 from taku-k/spark-14978.
      411c04ad
  3. May 17, 2016
    • Yin Huai's avatar
      [SPARK-14346] Fix scala-2.10 build · 2a5db9c1
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Scala 2.10 build was broken by #13079. I am reverting the change of that line.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13157 from yhuai/SPARK-14346-fix-scala2.10.
      2a5db9c1
Loading