Skip to content
Snippets Groups Projects
  1. Oct 07, 2016
    • Herman van Hovell's avatar
      [SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules · 18bf9d2b
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds the Kafka 0.10 subproject to the build infrastructure. This makes sure Kafka 0.10 tests are only triggers when it or of its dependencies change.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15355 from hvanhovell/SPARK-17782.
      Unverified
      18bf9d2b
    • Bryan Cutler's avatar
      [SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of paths · bcaa799c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use an undefined variable `paths`.  This change checks if the param `paths` is a basestring and then converts it to a list, so that the same variable `paths` can be used for both cases
      
      ## How was this patch tested?
      Added unit test for reading list of files
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
      bcaa799c
  2. Oct 06, 2016
    • sethah's avatar
      [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general... · 3713bb19
      sethah authored
      [SPARK-17792][ML] L-BFGS solver for linear regression does not accept general numeric label column types
      
      ## What changes were proposed in this pull request?
      
      Before, we computed `instances` in LinearRegression in two spots, even though they did the same thing. One of them did not cast the label column to `DoubleType`. This patch consolidates the computation and always casts the label column to `DoubleType`.
      
      ## How was this patch tested?
      
      Added a unit test to check all solvers. This test failed before this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15364 from sethah/linreg_numeric_type.
      3713bb19
    • Christian Kadner's avatar
      [SPARK-17803][TESTS] Upgrade docker-client dependency · 49d11d49
      Christian Kadner authored
      [SPARK-17803: Docker integration tests don't run with "Docker for Mac"](https://issues.apache.org/jira/browse/SPARK-17803)
      
      ## What changes were proposed in this pull request?
      
      This PR upgrades the [docker-client](https://mvnrepository.com/artifact/com.spotify/docker-client) dependency from [3.6.6](https://mvnrepository.com/artifact/com.spotify/docker-client/3.6.6) to [5.0.2](https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2) to enable _Docker for Mac_ users to run the `docker-integration-tests` out of the box.
      
      The very latest docker-client version is [6.0.0](https://mvnrepository.com/artifact/com.spotify/docker-client/6.0.0) but that has one additional dependency and no usage yet.
      
      ## How was this patch tested?
      
      The code change was tested on Mac OS X Yosemite with both _Docker Toolbox_ as well as _Docker for Mac_ and on Linux Ubuntu 14.04.
      
      ```
      $ build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive -Phive-thriftserver -DskipTests clean package
      
      $ build/mvn -Pdocker-integration-tests -Pscala-2.11 -pl :spark-docker-integration-tests_2.11 clean compile test
      ```
      
      Author: Christian Kadner <ckadner@us.ibm.com>
      
      Closes #15378 from ckadner/SPARK-17803_Docker_for_Mac.
      49d11d49
    • Shixiong Zhu's avatar
      [SPARK-17780][SQL] Report Throwable to user in StreamExecution · 9a48e60e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When using an incompatible source for structured streaming, it may throw NoClassDefFoundError. It's better to just catch Throwable and report it to the user since the streaming thread is dying.
      
      ## How was this patch tested?
      
      `test("NoClassDefFoundError from an incompatible source")`
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15352 from zsxwing/SPARK-17780.
      9a48e60e
    • Reynold Xin's avatar
      [SPARK-17798][SQL] Remove redundant Experimental annotations in sql.streaming · 79accf45
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I was looking through API annotations to catch mislabeled APIs, and realized DataStreamReader and DataStreamWriter classes are already annotated as Experimental, and as a result there is no need to annotate each method within them.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15373 from rxin/SPARK-17798.
      79accf45
    • Dongjoon Hyun's avatar
      [SPARK-17750][SQL] Fix CREATE VIEW with INTERVAL arithmetic. · 92b7e572
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark raises `RuntimeException` when creating a view with timestamp with INTERVAL arithmetic like the following. The root cause is the arithmetic expression, `TimeAdd`, was transformed into `timeadd` function as a VIEW definition. This PR fixes the SQL definition of `TimeAdd` and `TimeSub` expressions.
      
      ```scala
      scala> sql("CREATE TABLE dates (ts TIMESTAMP)")
      
      scala> sql("CREATE VIEW view1 AS SELECT ts + INTERVAL 1 DAY FROM dates")
      java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
      ```
      
      ## How was this patch tested?
      
      Pass Jenkins with a new testcase.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15318 from dongjoon-hyun/SPARK-17750.
      92b7e572
    • hyukjinkwon's avatar
      [BUILD] Closing some stale PRs · 5e9f32dd
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to close some stale PRs and ones suggested to be closed by committer(s) or obviously inappropriate PRs (e.g. branch to branch).
      
      Closes #13458
      Closes #15278
      Closes #15294
      Closes #15339
      Closes #15283
      
      ## How was this patch tested?
      
      N/A
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15356 from HyukjinKwon/closing-prs.
      Unverified
      5e9f32dd
    • Yanbo Liang's avatar
      [MINOR][ML] Avoid 2D array flatten in NB training. · 7aeb20be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Avoid 2D array flatten in ```NaiveBayes``` training, since flatten method might be expensive (It will create another array and copy data there).
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15359 from yanboliang/nb-theta.
      7aeb20be
  3. Oct 05, 2016
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL][TEST-MAVEN] Generate the sql test jar to fix the maven build · b678e465
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Generate the sql test jar to fix the maven build
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15368 from zsxwing/sql-test-jar.
      b678e465
    • Shixiong Zhu's avatar
      [SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
      
      It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
      
      tdas did most of work and part of them was inspired by koeninger's work.
      
      ### Introduction
      
      The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
      
      Column | Type
      ---- | ----
      key | binary
      value | binary
      topic | string
      partition | int
      offset | long
      timestamp | long
      timestampType | int
      
      The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
      
      ### Configuration
      
      The user can use `DataStreamReader.option` to set the following configurations.
      
      Kafka Source's options | value | default | meaning
      ------ | ------- | ------ | -----
      startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
      failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
      subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
      kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
      fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
      fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
      
      Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
      
      ### Usage
      
      * Subscribe to 1 topic
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1")
        .load()
      ```
      
      * Subscribe to multiple topics
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribe", "topic1,topic2")
        .load()
      ```
      
      * Subscribe to a pattern
      ```Scala
      spark
        .readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "host:port")
        .option("subscribePattern", "topic.*")
        .load()
      ```
      
      ## How was this patch tested?
      
      The new unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Shixiong Zhu <zsxwing@gmail.com>
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15102 from zsxwing/kafka-source.
      9293734d
    • Herman van Hovell's avatar
      [SPARK-17758][SQL] Last returns wrong result in case of empty partition · 5fd54b99
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order:
      ```
      - Partition 1 [Row1, Row2]
      - Partition 2 [Row3]
      - Partition 3 []
      ```
      In this case the `Last` function will currently return a null, instead of the value of `Row3`.
      
      This PR fixes this by adding a `valueSet` flag to the `Last` function.
      
      ## How was this patch tested?
      We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15348 from hvanhovell/SPARK-17758.
      5fd54b99
    • Shixiong Zhu's avatar
      [SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of BlockManagerSuite · 221b418b
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Mock SparkContext to reduce memory usage of BlockManagerSuite
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15350 from zsxwing/SPARK-17778.
      221b418b
    • sethah's avatar
      [SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression · 9df54f53
      sethah authored
      ## What changes were proposed in this pull request?
      Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training.
      
      ## How was this patch tested?
      Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15349 from sethah/SPARK-17239.
      Unverified
      9df54f53
    • Dongjoon Hyun's avatar
      [SPARK-17328][SQL] Fix NPE with EXPLAIN DESCRIBE TABLE · 6a05eb24
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the following NPE scenario in two ways.
      
      **Reported Error Scenario**
      ```scala
      scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
      INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
      java.lang.NullPointerException
      ```
      
      - **DESCRIBE**: Extend `DESCRIBE` syntax to accept `TABLE`.
      - **EXPLAIN**: Prevent NPE in case of the parsing failure of target statement, e.g., `EXPLAIN DESCRIBE TABLES x`.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15357 from dongjoon-hyun/SPARK-17328.
      6a05eb24
    • Herman van Hovell's avatar
      [SPARK-17258][SQL] Parse scientific decimal literals as decimals · 89516c1c
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Currently Spark SQL parses regular decimal literals (e.g. `10.00`) as decimals and scientific decimal literals (e.g. `10.0e10`) as doubles. The difference between the two confuses most users. This PR unifies the parsing behavior and also parses scientific decimal literals as decimals.
      
      This implications in tests are limited to a single Hive compatibility test.
      
      ## How was this patch tested?
      Updated tests in `ExpressionParserSuite` and `SQLQueryTestSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #14828 from hvanhovell/SPARK-17258.
      89516c1c
    • hyukjinkwon's avatar
      [SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR · c9fe10d4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      `write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.
      
      In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.
      
      **Before**
      
       - `read.df`
      
        ```r
      > read.df(source = "json")
      Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
        argument "x" is missing with no default
      ```
      
        ```r
      > read.df(path = c(1, 2))
      Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
        argument "x" is missing with no default
      ```
      
        ```r
      > read.df(c(1, 2))
      Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
        java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
      	at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
      	at
      ...
      In if (is.na(object)) { :
      ...
      ```
      
       - `write.df`
      
        ```r
      > write.df(df, source = "json")
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
      ```
      
        ```r
      > write.df(df, source = c(1, 2))
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
      ```
      
        ```r
      > write.df(df, mode = TRUE)
      Error in (function (classes, fdef, mtable)  :
        unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
      ```
      
      **After**
      
      - `read.df`
      
        ```r
      > read.df(source = "json")
      Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
      ```
      
        ```r
      > read.df(path = c(1, 2))
      Error in f(x, ...) : path should be charactor, null or omitted.
      ```
      
        ```r
      > read.df(c(1, 2))
      Error in f(x, ...) : path should be charactor, null or omitted.
      ```
      
      - `write.df`
      
        ```r
      > write.df(df, source = "json")
      Error in save : illegal argument - 'path' is not specified
      ```
      
        ```r
      > write.df(df, source = c(1, 2))
      Error in .local(df, path, ...) :
        source should be charactor, null or omitted. It is 'parquet' by default.
      ```
      
        ```r
      > write.df(df, mode = TRUE)
      Error in .local(df, path, ...) :
        mode should be charactor or omitted. It is 'error' by default.
      ```
      
      ## How was this patch tested?
      
      Unit tests in `test_sparkSQL.R`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15231 from HyukjinKwon/write-default-r.
      c9fe10d4
  4. Oct 04, 2016
  5. Oct 03, 2016
    • Takuya UESHIN's avatar
      [SPARK-17702][SQL] Code generation including too many mutable states exceeds JVM size limit. · b1b47274
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Code generation including too many mutable states exceeds JVM size limit to extract values from `references` into fields in the constructor.
      We should split the generated extractions in the constructor into smaller functions.
      
      ## How was this patch tested?
      
      I added some tests to check if the generated codes for the expressions exceed or not.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15275 from ueshin/issues/SPARK-17702.
      b1b47274
    • Dongjoon Hyun's avatar
      [SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException in Thriftserver · c571cfb2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, Spark Thrift Server raises `IllegalArgumentException` for queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.
      
      **Before**
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
      Closing: 0: jdbc:hive2://localhost:10000
      
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
      Closing: 0: jdbc:hive2://localhost:10000
      ```
      
      **After**
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      +-------+--+
      | NULL  |
      +-------+--+
      | NULL  |
      +-------+--+
      1 row selected (3.242 seconds)
      Beeline version 1.2.1.spark2 by Apache Hive
      Closing: 0: jdbc:hive2://localhost:10000
      
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      Connecting to jdbc:hive2://localhost:10000
      Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
      Driver: Hive JDBC (version 1.2.1.spark2)
      Transaction isolation: TRANSACTION_REPEATABLE_READ
      +-------------------------+--+
      | (IF(true, NULL, NULL))  |
      +-------------------------+--+
      | NULL                    |
      +-------------------------+--+
      1 row selected (0.201 seconds)
      Beeline version 1.2.1.spark2 by Apache Hive
      Closing: 0: jdbc:hive2://localhost:10000
      ```
      
      ## How was this patch tested?
      
      * Pass the Jenkins test with a new testsuite.
      * Also, Manually, after starting Spark Thrift Server, run the following command.
      ```sql
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
      $ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
      ```
      
      **Hive 1.2**
      ```sql
      hive> create table null_table as select null;
      hive> desc null_table;
      OK
      _c0                     void
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15325 from dongjoon-hyun/SPARK-17112.
      c571cfb2
    • Herman van Hovell's avatar
      [SPARK-17753][SQL] Allow a complex expression as the input a value based case statement · 2bbecdec
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We currently only allow relatively simple expressions as the input for a value based case statement. Expressions like `case (a > 1) or (b = 2) when true then 1 when false then 0 end` currently fail. This PR adds support for such expressions.
      
      ## How was this patch tested?
      Added a test to the ExpressionParserSuite.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15322 from hvanhovell/SPARK-17753.
      2bbecdec
    • zero323's avatar
      [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract · d8399b60
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.
      
      Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.
      
      ## How was this patch tested?
      
      PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #15144 from zero323/SPARK-17587.
      d8399b60
    • Jason White's avatar
      [SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch · 1f31bdae
      Jason White authored
      ## What changes were proposed in this pull request?
      
      This PR removes a patch on ListConverter from https://github.com/apache/spark/pull/5570, as it is no longer necessary. The underlying issue in Py4J https://github.com/bartdag/py4j/issues/160 was patched in https://github.com/bartdag/py4j/commit/224b94b6665e56a93a064073886e1d803a4969d2 and is present in 0.10.3, the version currently in use in Spark.
      
      ## How was this patch tested?
      
      The original test added in https://github.com/apache/spark/pull/5570 remains.
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #15254 from JasonMWhite/remove_listconverter_patch.
      1f31bdae
    • Sean Owen's avatar
      [SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs · 1dd68d38
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Move note about labels being +1/-1 in formulation only to be just under the table of formulations.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15330 from srowen/SPARK-17718.
      Unverified
      1dd68d38
    • Zhenhua Wang's avatar
      [SPARK-17073][SQL] generate column-level statistics · 7bf92127
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Generate basic column statistics for all the atomic types:
      - numeric types: max, min, num of nulls, ndv (number of distinct values)
      - date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
      - string: avg length, max length, num of nulls, ndv
      - binary: avg length, max length, num of nulls
      - boolean: num of nulls, num of trues, num of falsies
      
      Also support storing and loading these statistics.
      
      One thing to notice:
      We support analyzing columns independently, e.g.:
      sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;`
      sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;`
      when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`:
      `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;`
      
      ## How was this patch tested?
      
      add unit tests
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      
      Closes #15090 from wzhfy/colStats.
      7bf92127
    • Jagadeesan's avatar
      [SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,… · a27033c0
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~
      
      … pandoc]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #15309 from jagadeesanas2/SPARK-17736.
      Unverified
      a27033c0
    • Alex Bozarth's avatar
      [SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI · de3f71ed
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the class, i.e. org.apache.spark.sql.hive.thrift.HiveThriftServer2. I changed it to Thrift JDBC/ODBC Server (like Spark shell for spark-shell) as recommended by jaceklaskowski. Note the user can still change the name adding `--name "App Name"` parameter to the start script as before
      
      ## How was this patch tested?
      
      By running the script with various parameters and checking the web ui
      
      ![screen shot 2016-09-27 at 12 19 12 pm](https://cloud.githubusercontent.com/assets/13952758/18888329/aebca47c-84ac-11e6-93d0-6e98684977c5.png)
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #15268 from ajbozarth/spark17598.
      Unverified
      de3f71ed
  6. Oct 02, 2016
    • Tao LI's avatar
      [SPARK-14914][CORE][SQL] Skip/fix some test cases on Windows due to limitation of Windows · 76dc2d90
      Tao LI authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix/skip some tests failed on Windows. This PR takes over https://github.com/apache/spark/pull/12696.
      
      **Before**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit *** FAILED *** (202 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specifie
      
      [info] - includes jars passed in through --jars *** FAILED *** (1 second, 625 milliseconds)
      [info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] - reads of memory-mapped and non memory-mapped files are equivalent *** FAILED *** (1 second, 78 milliseconds)
      [info]   diskStoreMapped.remove(blockId) was false (DiskStoreSuite.scala:41)
      ```
      
      **After**
      
      - **SparkSubmitSuite**
      
        ```
      [info] - launch simple application with spark-submit (578 milliseconds)
      [info] - includes jars passed in through --jars (1 second, 875 milliseconds)
      ```
      
      - **DiskStoreSuite**
      
        ```
      [info] DiskStoreSuite:
      [info] - reads of memory-mapped and non memory-mapped files are equivalent !!! CANCELED !!! (766 milliseconds
      ```
      
      For `CreateTableAsSelectSuite` and `FsHistoryProviderSuite`, I could not reproduce as the Java version seems higher than the one that has the bugs about `setReadable(..)` and `setWritable(...)` but as they are bugs reported clearly, it'd be sensible to skip those. We should revert the changes for both back as soon as we drop the support of Java 7.
      
      ## How was this patch tested?
      
      Manually tested via AppVeyor.
      
      Closes #12696
      
      Author: Tao LI <tl@microsoft.com>
      Author: U-FAREAST\tl <tl@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15320 from HyukjinKwon/SPARK-14914.
      76dc2d90
    • Sital Kedia's avatar
      [SPARK-17509][SQL] When wrapping catalyst datatype to Hive data type avoid… · f8d7fade
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      When wrapping catalyst datatypes to Hive data type, wrap function was doing an expensive pattern matching which was consuming around 11% of cpu time. Avoid the pattern matching by returning the wrapper only once and reuse it.
      
      ## How was this patch tested?
      
      Tested by running the job on cluster and saw around 8% cpu improvements.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #15064 from sitalkedia/skedia/hive_wrapper.
      f8d7fade
  7. Oct 01, 2016
    • Sean Owen's avatar
      [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · b88cb63d
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Partial revert of #15277 to instead sort and store input to model rather than require sorted input
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15299 from srowen/SPARK-17704.2.
      Unverified
      b88cb63d
    • Herman van Hovell's avatar
      [SPARK-17717][SQL] Add Exist/find methods to Catalog [FOLLOW-UP] · af6ece33
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      We added find and exists methods for Databases, Tables and Functions to the user facing Catalog in PR https://github.com/apache/spark/pull/15301. However, it was brought up that the semantics of the  `find` methods are more in line a `get` method (get an object or else fail). So we rename these in this PR.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15308 from hvanhovell/SPARK-17717-2.
      af6ece33
    • Eric Liang's avatar
      [SPARK-17740] Spark tests should mock / interpose HDFS to ensure that streams are closed · 4bcd9b72
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      As a followup to SPARK-17666, ensure filesystem connections are not leaked at least in unit tests. This is done here by intercepting filesystem calls as suggested by JoshRosen . At the end of each test, we assert no filesystem streams are left open.
      
      This applies to all tests using SharedSQLContext or SharedSparkContext.
      
      ## How was this patch tested?
      
      I verified that tests in sql and core are indeed using the filesystem backend, and fixed the detected leaks. I also checked that reverting https://github.com/apache/spark/pull/15245 causes many actual test failures due to connection leaks.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekhliang@gmail.com>
      
      Closes #15306 from ericl/sc-4672.
      4bcd9b72
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Add an up-to-date description for default serialization during shuffling · 15e9bbb4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.
      
      ## How was this patch tested?
      
      This is a documentation update.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.
      15e9bbb4
Loading