Skip to content
Snippets Groups Projects
  1. May 12, 2016
  2. May 11, 2016
    • Yin Huai's avatar
      [SPARK-15072][SQL][PYSPARK][HOT-FIX] Remove SparkSession.withHiveSupport from readwrite.py · ba5487c0
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Seems https://github.com/apache/spark/commit/db573fc743d12446dd0421fb45d00c2f541eaf9a did not remove withHiveSupport from readwrite.py
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13069 from yhuai/fixPython.
      ba5487c0
    • Cheng Lian's avatar
      [SPARK-14346] SHOW CREATE TABLE for data source tables · f036dd7c
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).
      
      To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12781 from liancheng/spark-14346-show-create-table.
      f036dd7c
    • Sandeep Singh's avatar
      [SPARK-15080][CORE] Break copyAndReset into copy and reset · ff92eb2e
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Break copyAndReset into two methods copy and reset instead of just one.
      
      ## How was this patch tested?
      Existing Tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12936 from techaddict/SPARK-15080.
      ff92eb2e
    • Sandeep Singh's avatar
      [SPARK-15072][SQL][PYSPARK] FollowUp: Remove SparkSession.withHiveSupport in PySpark · db573fc7
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      This is a followup of https://github.com/apache/spark/pull/12851
      Remove `SparkSession.withHiveSupport` in PySpark and instead use `SparkSession.builder. enableHiveSupport`
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13063 from techaddict/SPARK-15072-followup.
      db573fc7
    • Bill Chambers's avatar
      [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names · 603f4453
      Bill Chambers authored
      ## What changes were proposed in this pull request?
      
      When a CSV begins with:
      - `,,`
      OR
      - `"","",`
      
      meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
      ```
      "","second column"
      "hello", "there"
      ```
      Then column names would become `"C0", "second column"`.
      
      This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.
      
      ### Current Behavior in Spark <=1.6
      In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.
      
      ### Current Behavior in Spark 2.0
      Spark throws a NullPointerError and will not read in the file.
      
      #### Reproduction in 2.0
      https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html
      
      ## How was this patch tested?
      A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.
      
      Author: Bill Chambers <bill@databricks.com>
      Author: Bill Chambers <wchambers@ischool.berkeley.edu>
      
      Closes #13041 from anabranch/master.
      603f4453
    • Andrew Or's avatar
      [SPARK-15276][SQL] CREATE TABLE with LOCATION should imply EXTERNAL · f14c4ba0
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Before:
      ```sql
      -- uses that location but issues a warning
      CREATE TABLE my_tab LOCATION /some/path
      -- deletes any existing data in the specified location
      DROP TABLE my_tab
      ```
      
      After:
      ```sql
      -- uses that location but creates an EXTERNAL table instead
      CREATE TABLE my_tab LOCATION /some/path
      -- does not delete the data at /some/path
      DROP TABLE my_tab
      ```
      
      This patch essentially makes the `EXTERNAL` field optional. This is related to #13032.
      
      ## How was this patch tested?
      
      New test in `DDLCommandSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13060 from andrewor14/location-implies-external.
      f14c4ba0
    • Nicholas Chammas's avatar
      [SPARK-15256] [SQL] [PySpark] Clarify DataFrameReader.jdbc() docstring · b9cf617a
      Nicholas Chammas authored
      This PR:
      * Corrects the documentation for the `properties` parameter, which is supposed to be a dictionary and not a list.
      * Generally clarifies the Python docstring for DataFrameReader.jdbc() by pulling from the [Scala docstrings](https://github.com/apache/spark/blob/b28137764716f56fa1a923c4278624a56364a505/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L201-L251) and rephrasing things.
      * Corrects minor Sphinx typos.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #13034 from nchammas/SPARK-15256.
      b9cf617a
    • Andrew Or's avatar
      [SPARK-15257][SQL] Require CREATE EXTERNAL TABLE to specify LOCATION · 8881765a
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Before:
      ```sql
      -- uses warehouse dir anyway
      CREATE EXTERNAL TABLE my_tab
      -- doesn't actually delete the data
      DROP TABLE my_tab
      ```
      After:
      ```sql
      -- no location is provided, throws exception
      CREATE EXTERNAL TABLE my_tab
      -- creates an external table using that location
      CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something'
      -- doesn't delete the data, which is expected
      DROP TABLE my_tab
      ```
      
      ## How was this patch tested?
      
      New test in `DDLCommandSuite`
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13032 from andrewor14/create-external-table-location.
      8881765a
    • Reynold Xin's avatar
      [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame · 40ba87f7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Earlier we removed experimental tag for Scala/Java DataFrames, but haven't done so for Python. This patch removes the experimental flag for Python and declares them stable.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13062 from rxin/SPARK-15278.
      40ba87f7
    • Sandeep Singh's avatar
      [SPARK-15270] [SQL] Use SparkSession Builder to build a session with HiveSupport · de9c85cc
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Before:
      Creating a hiveContext was failing
      ```python
      from pyspark.sql import HiveContext
      hc = HiveContext(sc)
      ```
      with
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "spark-2.0/python/pyspark/sql/context.py", line 458, in __init__
          sparkSession = SparkSession.withHiveSupport(sparkContext)
        File "spark-2.0/python/pyspark/sql/session.py", line 192, in withHiveSupport
          jsparkSession = sparkContext._jvm.SparkSession.withHiveSupport(sparkContext._jsc.sc())
        File "spark-2.0/python/lib/py4j-0.9.2-src.zip/py4j/java_gateway.py", line 1048, in __getattr__
      py4j.protocol.Py4JError: org.apache.spark.sql.SparkSession.withHiveSupport does not exist in the JVM
      ```
      
      Now:
      ```python
      >>> from pyspark.sql import HiveContext
      >>> hc = HiveContext(sc)
      >>> hc.range(0, 100)
      DataFrame[id: bigint]
      >>> hc.range(0, 100).count()
      100
      ```
      ## How was this patch tested?
      Existing Tests, tested manually in python shell
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13056 from techaddict/SPARK-15270.
      de9c85cc
    • Andrew Or's avatar
      [SPARK-15262] Synchronize block manager / scheduler executor state · 40a949aa
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      If an executor is still alive even after the scheduler has removed its metadata, we may receive a heartbeat from that executor and tell its block manager to reregister itself. If that happens, the block manager master will know about the executor, but the scheduler will not.
      
      That is a dangerous situation, because when the executor does get disconnected later, the scheduler will not ask the block manager to also remove metadata for that executor. Later, when we try to clean up an RDD or a broadcast variable, we may try to send a message to that executor, triggering an exception.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13055 from andrewor14/block-manager-remove.
      40a949aa
    • Maciej Brynski's avatar
      [SPARK-12200][SQL] Add __contains__ implementation to Row · 7ecd4968
      Maciej Brynski authored
      https://issues.apache.org/jira/browse/SPARK-12200
      
      Author: Maciej Brynski <maciej.brynski@adpilot.pl>
      Author: Maciej Bryński <maciek-github@brynski.pl>
      
      Closes #10194 from maver1ck/master.
      7ecd4968
    • Andrew Or's avatar
      [SPARK-15260] Atomically resize memory pools · bb88ad4e
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      When we acquire execution memory, we do a lot of things between shrinking the storage memory pool and enlarging the execution memory pool. In particular, we call `memoryStore.evictBlocksToFreeSpace`, which may do a lot of I/O and can throw exceptions. If an exception is thrown, the pool sizes on that executor will be in a bad state.
      
      This patch minimizes the things we do between the two calls to make the resizing more atomic.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #13039 from andrewor14/safer-pool.
      bb88ad4e
    • Tathagata Das's avatar
      [SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from... · 81c68ece
      Tathagata Das authored
      [SPARK-15248][SQL] Make MetastoreFileCatalog consider directories from partition specs of a partitioned metastore table
      
      Table partitions can be added with locations different from default warehouse location of a hive table.
      `CREATE TABLE parquetTable (a int) PARTITIONED BY (b int) STORED AS parquet `
      `ALTER TABLE parquetTable ADD PARTITION (b=1) LOCATION '/partition'`
      Querying such a table throws error as the MetastoreFileCatalog does not list the added partition directory, it only lists the default base location.
      
      ```
      [info] - SPARK-15248: explicitly added partitions should be readable *** FAILED *** (1 second, 8 milliseconds)
      [info]   java.util.NoSuchElementException: key not found: file:/Users/tdas/Projects/Spark/spark2/target/tmp/spark-b39ad224-c5d1-4966-8981-fb45a2066d61/partition
      [info]   at scala.collection.MapLike$class.default(MapLike.scala:228)
      [info]   at scala.collection.AbstractMap.default(Map.scala:59)
      [info]   at scala.collection.MapLike$class.apply(MapLike.scala:141)
      [info]   at scala.collection.AbstractMap.apply(Map.scala:59)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:59)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog$$anonfun$listFiles$1.apply(PartitioningAwareFileCatalog.scala:55)
      [info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      [info]   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      [info]   at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
      [info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
      [info]   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
      [info]   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
      [info]   at org.apache.spark.sql.execution.datasources.PartitioningAwareFileCatalog.listFiles(PartitioningAwareFileCatalog.scala:55)
      [info]   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:93)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
      [info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      [info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
      [info]   at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:55)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
      [info]   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
      [info]   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
      [info]   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
      [info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
      [info]   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
      [info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
      [info]   at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
      [info]   at org.apache.spark.sql.QueryTest.assertEmptyMissingInput(QueryTest.scala:330)
      [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:146)
      [info]   at org.apache.spark.sql.QueryTest.checkAnswer(QueryTest.scala:159)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:554)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7$$anonfun$apply$mcV$sp$25.apply(parquetSuites.scala:535)
      [info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTempDir(SQLTestUtils.scala:125)
      [info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTempDir(parquetSuites.scala:726)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12$$anonfun$apply$mcV$sp$7.apply$mcV$sp(parquetSuites.scala:535)
      [info]   at org.apache.spark.sql.test.SQLTestUtils$class.withTable(SQLTestUtils.scala:166)
      [info]   at org.apache.spark.sql.hive.ParquetPartitioningTest.withTable(parquetSuites.scala:726)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply$mcV$sp(parquetSuites.scala:534)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
      [info]   at org.apache.spark.sql.hive.ParquetMetastoreSuite$$anonfun$12.apply(parquetSuites.scala:534)
      ```
      
      The solution in this PR to get the paths to list from the partition spec and not rely on the default table path alone.
      
      unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #13022 from tdas/SPARK-15248.
      81c68ece
    • cody koeninger's avatar
      [SPARK-15085][STREAMING][KAFKA] Rename streaming-kafka artifact · 89e67d66
      cody koeninger authored
      ## What changes were proposed in this pull request?
      Renaming the streaming-kafka artifact to include kafka version, in anticipation of needing a different artifact for later kafka versions
      
      ## How was this patch tested?
      Unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #12946 from koeninger/SPARK-15085.
      89e67d66
    • Eric Liang's avatar
      [SPARK-15259] Sort time metric should not include spill and record insertion time · 6d0368ab
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      After SPARK-14669 it seems the sort time metric includes both spill and record insertion time. This makes it not very useful since the metric becomes close to the total execution time of the node.
      
      We should track just the time spent for in-memory sort, as before.
      
      ## How was this patch tested?
      
      Verified metric in the UI, also unit test on UnsafeExternalRowSorter.
      
      cc davies
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #13035 from ericl/fix-metrics.
      6d0368ab
    • Sandeep Singh's avatar
      [SPARK-15037] [SQL] [MLLIB] Part2: Use SparkSession instead of SQLContext in Python TestSuites · 29314379
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Use SparkSession instead of SQLContext in Python TestSuites
      
      ## How was this patch tested?
      Existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #13044 from techaddict/SPARK-15037-python.
      29314379
    • Wenchen Fan's avatar
      [SPARK-15241] [SPARK-15242] [SQL] fix 2 decimal-related issues in RowEncoder · d8935db5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      SPARK-15241: We now support java decimal and catalyst decimal in external row, it makes sense to also support scala decimal.
      
      SPARK-15242: This is a long-standing bug, and is exposed after https://github.com/apache/spark/pull/12364, which eliminate the `If` expression if the field is not nullable:
      ```
      val fieldValue = serializerFor(
        GetExternalRowField(inputObject, i, externalDataTypeForInput(f.dataType)),
        f.dataType)
      if (f.nullable) {
        If(
          Invoke(inputObject, "isNullAt", BooleanType, Literal(i) :: Nil),
          Literal.create(null, f.dataType),
          fieldValue)
      } else {
        fieldValue
      }
      ```
      
      Previously, we always use `DecimalType.SYSTEM_DEFAULT` as the output type of converted decimal field, which is wrong as it doesn't match the real decimal type. However, it works well because we always put converted field into `If` expression to do the null check, and `If` use its `trueValue`'s data type as its output type.
      Now if we have a not nullable decimal field, then the converted field's output type will be `DecimalType.SYSTEM_DEFAULT`, and we will write wrong data into unsafe row.
      
      The fix is simple, just use the given decimal type as the output type of converted decimal field.
      
      These 2 issues was found at https://github.com/apache/spark/pull/13008
      
      ## How was this patch tested?
      
      new tests in RowEncoderSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #13019 from cloud-fan/encoder-decimal.
      d8935db5
    • Dongjoon Hyun's avatar
      [SPARK-14933][HOTFIX] Replace `sqlContext` with `spark`. · e1576478
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This fixes compile errors.
      
      ## How was this patch tested?
      
      Pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13053 from dongjoon-hyun/hotfix_sqlquerysuite.
      e1576478
    • Liang-Chi Hsieh's avatar
      [SPARK-15268][SQL] Make JavaTypeInference work with UDTRegistration · a5f9fdbb
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We have a private `UDTRegistration` API to register user defined type. Currently `JavaTypeInference` can't work with it. So `SparkSession.createDataFrame` from a bean class will not correctly infer the schema of the bean class.
      
      ## How was this patch tested?
      `VectorUDTSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13046 from viirya/fix-udt-registry-javatypeinference.
      a5f9fdbb
    • xin Wu's avatar
      [SPARK-14933][SQL] Failed to create view out of a parquet or orc table · 427c20dd
      xin Wu authored
      ## What changes were proposed in this pull request?
      #### Symptom
       If a table is created as parquet or ORC table with hive syntaxt DDL, such as
      ```SQL
      create table t1 (c1 int, c2 string) stored as parquet
      ```
      The following command will fail
      ```SQL
      create view v1 as select * from t1
      ```
      #### Root Cause
      Currently, `HiveMetaStoreCatalog` converts Paruqet/Orc tables to `LogicalRelation` without giving any `tableIdentifier`. `SQLBuilder` expects the `LogicalRelation` to have an associated `tableIdentifier`. However, the `LogicalRelation` created earlier does not have such a `tableIdentifier`. Thus, `SQLBuilder.toSQL` can not recognize this logical plan and issue an exception.
      
      This PR is to assign a `TableIdentifier` to the `LogicalRelation` when resolving parquet or orc tables in `HiveMetaStoreCatalog`.
      
      ## How was this patch tested?
      testcases created and dev/run-tests is run.
      
      Author: xin Wu <xinwu@us.ibm.com>
      
      Closes #12716 from xwu0226/SPARK_14933.
      427c20dd
    • Zheng RuiFeng's avatar
      [SPARK-15150][EXAMPLE][DOC] Update LDA examples · d88afabd
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1,create a libsvm-type dataset for lda: `data/mllib/sample_lda_libsvm_data.txt`
      2,add python example
      3,directly read the datafile in examples
      4,BTW, change to `SparkSession` in `aft_survival_regression.py`
      
      ## How was this patch tested?
      manual tests
      `./bin/spark-submit examples/src/main/python/ml/lda_example.py`
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12927 from zhengruifeng/lda_pe.
      d88afabd
    • Nicholas Chammas's avatar
      [SPARK-15238] Clarify supported Python versions · fafc95af
      Nicholas Chammas authored
      This PR:
      * Clarifies that Spark *does* support Python 3, starting with Python 3.4.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #13017 from nchammas/supported-python-versions.
      fafc95af
    • mwws's avatar
      [SPARK-14976][STREAMING] make StreamingContext.textFileStream support wildcard · 33597810
      mwws authored
      ## What changes were proposed in this pull request?
      make StreamingContext.textFileStream support wildcard
      like /home/user/*/file
      
      ## How was this patch tested?
      I did manual test and added a new unit test case
      
      Author: mwws <wei.mao@intel.com>
      Author: unknown <maowei@maowei-MOBL.ccr.corp.intel.com>
      
      Closes #12752 from mwws/SPARK_FileStream.
      33597810
    • Zheng RuiFeng's avatar
      [SPARK-15149][EXAMPLE][DOC] update kmeans example · 8beae591
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Python example for ml.kmeans already exists, but not included in user guide.
      1,small changes like: `example_on` `example_off`
      2,add it to user guide
      3,update examples to directly read datafile
      
      ## How was this patch tested?
      manual tests
      `./bin/spark-submit examples/src/main/python/ml/kmeans_example.py
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12925 from zhengruifeng/km_pe.
      8beae591
    • Zheng RuiFeng's avatar
      [SPARK-14340][EXAMPLE][DOC] Update Examples and User Guide for ml.BisectingKMeans · cef73b56
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      1, add BisectingKMeans to ml-clustering.md
      2, add the missing Scala BisectingKMeansExample
      3, create a new datafile `data/mllib/sample_kmeans_data.txt`
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11844 from zhengruifeng/doc_bkm.
      cef73b56
    • Zheng RuiFeng's avatar
      [SPARK-15141][EXAMPLE][DOC] Update OneVsRest Examples · ad1a8466
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, Add python example for OneVsRest
      2, remove args-parsing
      
      ## How was this patch tested?
      manual tests
      `./bin/spark-submit examples/src/main/python/ml/one_vs_rest_example.py`
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12920 from zhengruifeng/ovr_pe.
      ad1a8466
    • Shixiong Zhu's avatar
      [SPARK-15231][SQL] Document the semantic of saveAsTable and insertInto and... · 875ef764
      Shixiong Zhu authored
      [SPARK-15231][SQL] Document the semantic of saveAsTable and insertInto and don't drop columns silently
      
      ## What changes were proposed in this pull request?
      
      This PR adds documents about the different behaviors between `insertInto` and `saveAsTable`, and throws an exception when the user try to add too man columns using `saveAsTable with append`.
      
      ## How was this patch tested?
      
      Unit tests added in this PR.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #13013 from zsxwing/SPARK-15231.
      875ef764
    • Holden Karau's avatar
      [SPARK-15189][PYSPARK][DOCS] Update ml.evaluation PyDoc · 007882c7
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Fix doctest issue, short param description, and tag items as Experimental
      
      ## How was this patch tested?
      
      build docs locally & doctests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12964 from holdenk/SPARK-15189-ml.Evaluation-PyDoc-issues.
      007882c7
    • Kousuke Saruta's avatar
      [SPARK-15235][WEBUI] Corresponding row cannot be highlighted even though... · ba181c0c
      Kousuke Saruta authored
      [SPARK-15235][WEBUI] Corresponding row cannot be highlighted even though cursor is on the job on Web UI's timeline
      
      ## What changes were proposed in this pull request?
      
      To extract job descriptions and stage name, there are following regular expressions in timeline-view.js
      
      ```
      var jobIdText = $($(baseElem).find(".application-timeline-content")[0]).text();
      var jobId = jobIdText.match("\\(Job (\\d+)\\)")[1];
      ...
      var stageIdText = $($(baseElem).find(".job-timeline-content")[0]).text();
      var stageIdAndAttempt = stageIdText.match("\\(Stage (\\d+\\.\\d+)\\)")[1].split(".");
      ```
      
      But if job descriptions include patterns like "(Job x)" or stage names include patterns like "(Stage x.y)", the regular expressions cannot be match as we expected, ending up with corresponding row cannot be highlighted even though we move the cursor onto the job on Web UI's timeline.
      
      ## How was this patch tested?
      
      Manually tested with spark-shell and Web UI.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #13016 from sarutak/SPARK-15235.
      ba181c0c
    • Lianhui Wang's avatar
      [SPARK-15246][SPARK-4452][CORE] Fix code style and improve volatile for · 9f0a642f
      Lianhui Wang authored
      ## What changes were proposed in this pull request?
      1. Fix code style
      2. remove volatile of elementsRead method because there is only one thread to use it.
      3. avoid volatile of _elementsRead because Collection increases number of  _elementsRead when it insert a element. It is very expensive. So we can avoid it.
      
      After this PR, I will push another PR for branch 1.6.
      ## How was this patch tested?
      unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #13020 from lianhuiwang/SPARK-4452-hotfix.
      9f0a642f
    • Davies Liu's avatar
      [SPARK-15255][SQL] limit the length of name for cached DataFrame · 1fbe2785
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      We use the tree string of an SparkPlan as the name of cached DataFrame, that could be very long, cause the browser to be not responsive. This PR will limit the length of the name to 1000 characters.
      
      ## How was this patch tested?
      
      Here is how the UI looks right now:
      
      ![ui](https://cloud.githubusercontent.com/assets/40902/15163355/d5640f9c-16bc-11e6-8655-809af8a4fed1.png)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13033 from davies/cache_name.
      1fbe2785
    • Dongjoon Hyun's avatar
      [SPARK-15265][SQL][MINOR] Fix Union query error message indentation · 66554596
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue fixes the error message indentation consistently with other set queries (EXCEPT/INTERSECT).
      
      **Before (4 lines)**
      ```
      scala> sql("(select 1) union (select 1, 2)").head
      org.apache.spark.sql.AnalysisException:
      Unions can only be performed on tables with the same number of columns,
       but one table has '2' columns and another table has
       '1' columns;
      ```
      
      **After (one-line)**
      ```
      scala> sql("(select 1) union (select 1, 2)").head
      org.apache.spark.sql.AnalysisException: Unions can only be performed on tables with the same number of columns, but one table has '2' columns and another table has '1' columns;
      ```
      **Reference (EXCEPT / INTERSECT)**
      ```
      scala> sql("(select 1) intersect (select 1, 2)").head
      org.apache.spark.sql.AnalysisException: Intersect can only be performed on tables with the same number of columns, but the left table has 1 columns and the right has 2;
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13043 from dongjoon-hyun/SPARK-15265.
      66554596
    • hyukjinkwon's avatar
      [SPARK-15250][SQL] Remove deprecated json API in DataFrameReader · 3ff01205
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR removes the old `json(path: String)` API which is covered by the new `json(paths: String*)`.
      
      ## How was this patch tested?
      
      Jenkins tests (existing tests should cover this)
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      
      Closes #13040 from HyukjinKwon/SPARK-15250.
      3ff01205
  3. May 10, 2016
Loading