Skip to content
Snippets Groups Projects
  1. Jul 14, 2016
    • Liwei Lin's avatar
      [SPARK-16503] SparkSession should provide Spark version · 39c836e9
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      This patch enables SparkSession to provide spark version.
      
      ## How was this patch tested?
      
      Manual test:
      
      ```
      scala> sc.version
      res0: String = 2.1.0-SNAPSHOT
      
      scala> spark.version
      res1: String = 2.1.0-SNAPSHOT
      ```
      
      ```
      >>> sc.version
      u'2.1.0-SNAPSHOT'
      >>> spark.version
      u'2.1.0-SNAPSHOT'
      ```
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #14165 from lw-lin/add-version.
      39c836e9
    • Dongjoon Hyun's avatar
      [SPARK-16536][SQL][PYSPARK][MINOR] Expose `sql` in PySpark Shell · 9c530576
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR exposes `sql` in PySpark Shell like Scala/R Shells for consistency.
      
      **Background**
       * Scala
       ```scala
      scala> sql("select 1 a")
      res0: org.apache.spark.sql.DataFrame = [a: int]
      ```
      
       * R
       ```r
      > sql("select 1")
      SparkDataFrame[1:int]
      ```
      
      **Before**
       * Python
      
       ```python
      >>> sql("select 1 a")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      NameError: name 'sql' is not defined
      ```
      
      **After**
       * Python
      
       ```python
      >>> sql("select 1 a")
      DataFrame[a: int]
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #14190 from dongjoon-hyun/SPARK-16536.
      9c530576
  2. Jul 13, 2016
    • Joseph K. Bradley's avatar
      [SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ML, doc fixes · a5f51e21
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Fixing issues found during 2.0 API checks:
      * GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed
      * sqlDataTypes: name does not follow conventions. Do we need to expose it?
      * Evaluator: inconsistent doc between evaluate and isLargerBetter
      * MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little
      * GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name
      
      ## How was this patch tested?
      
      Existing unit tests.  Docs generated locally.  (MinMaxScaler is improved a tiny bit.)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14187 from jkbradley/final-api-check-2.0.
      a5f51e21
    • gatorsmile's avatar
      [SPARK-16482][SQL] Describe Table Command for Tables Requiring Runtime Inferred Schema · c5ec8798
      gatorsmile authored
      #### What changes were proposed in this pull request?
      If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table.
      
      ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~
      
      For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation.
      
      #### How was this patch tested?
      Added test cases
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14148 from gatorsmile/describeSchema.
      c5ec8798
    • Felix Cheung's avatar
      [SPARKR][DOCS][MINOR] R programming guide to include csv data source example · fb2e8eeb
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Minor documentation update for code example, code style, and missed reference to "sparkR.init"
      
      ## How was this patch tested?
      
      manual
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #14178 from felixcheung/rcsvprogrammingguide.
      fb2e8eeb
    • Felix Cheung's avatar
      [SPARKR][MINOR] R examples and test updates · b4baf086
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Minor example updates
      
      ## How was this patch tested?
      
      manual
      
      shivaram
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #14171 from felixcheung/rexample.
      b4baf086
    • James Thomas's avatar
      [SPARK-16114][SQL] updated structured streaming guide · 51a6706b
      James Thomas authored
      ## What changes were proposed in this pull request?
      
      Updated structured streaming programming guide with new windowed example.
      
      ## How was this patch tested?
      
      Docs
      
      Author: James Thomas <jamesjoethomas@gmail.com>
      
      Closes #14183 from jjthomas/ss_docs_update.
      51a6706b
    • Burak Yavuz's avatar
      [SPARK-16531][SQL][TEST] Remove timezone setting from DataFrameTimeWindowingSuite · 0744d84c
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      It's unnecessary. `QueryTest` already sets it.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #14170 from brkyvz/test-tz.
      0744d84c
    • Joseph K. Bradley's avatar
      [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit for ML · 01f09b16
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      General decisions to follow, except where noted:
      * spark.mllib, pyspark.mllib: Remove all Experimental annotations.  Leave DeveloperApi annotations alone.
      * spark.ml, pyspark.ml
      ** Annotate Estimator-Model pairs of classes and companion objects the same way.
      ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation.
      ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation.
      * DeveloperApi annotations are left alone, except where noted.
      * No changes to which types are sealed.
      
      Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new:
      * Model Summary classes
      * MLWriter, MLReader, MLWritable, MLReadable
      * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency.
      * RFormula: Its behavior may need to change slightly to match R in edge cases.
      * AFTSurvivalRegression
      * MultilayerPerceptronClassifier
      
      DeveloperApi changes:
      * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi
      
      ## How was this patch tested?
      
      N/A
      
      Note to reviewers:
      * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental.
      * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature.  I did not find such cases, but please verify.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14147 from jkbradley/experimental-audit.
      01f09b16
    • jerryshao's avatar
      [SPARK-16435][YARN][MINOR] Add warning log if initialExecutors is less than minExecutors · d8220c1e
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Currently if `spark.dynamicAllocation.initialExecutors` is less than `spark.dynamicAllocation.minExecutors`, Spark will automatically pick the minExecutors without any warning. While in 1.6 Spark will throw exception if configured like this. So here propose to add warning log if these parameters are configured invalidly.
      
      ## How was this patch tested?
      
      Unit test added to verify the scenario.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14149 from jerryshao/SPARK-16435.
      d8220c1e
    • 蒋星博's avatar
      [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates... · f376c372
      蒋星博 authored
      [SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition.
      
      ## What changes were proposed in this pull request?
      
      Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example:
      ```SELECT a FROM t WHERE rand() < 0.1 AND a = 1```
      And
      ```SELECT a FROM t WHERE a = 1 AND rand() < 0.1```
      may call rand() for different times and therefore the output rows differ.
      
      This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates.
      
      ## How was this patch tested?
      
      Expanded related testcases in FilterPushdownSuite.
      
      Author: 蒋星博 <jiangxingbo@meituan.com>
      
      Closes #14012 from jiangxb1987/ppd.
      f376c372
    • oraviv's avatar
      [SPARK-16469] enhanced simulate multiply · ea06e4ef
      oraviv authored
      ## What changes were proposed in this pull request?
      
      We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2).
      
      ## How was this patch tested?
      
      We have added a performance test and verified the reduced time.
      
      Author: oraviv <oraviv@paypal.com>
      
      Closes #14068 from uzadude/master.
      ea06e4ef
    • Sean Owen's avatar
      [SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM for long runs · 51ade51a
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14153 from srowen/SPARK-16440.
      51ade51a
    • sharkd's avatar
      [MINOR][YARN] Fix code error in yarn-cluster unit test · 3d6f679c
      sharkd authored
      ## What changes were proposed in this pull request?
      
      Fix code error in yarn-cluster unit test.
      
      ## How was this patch tested?
      
      Use exist tests
      
      Author: sharkd <sharkd.tu@gmail.com>
      
      Closes #14166 from sharkdtu/master.
      3d6f679c
    • sandy's avatar
      [SPARK-16438] Add Asynchronous Actions documentation · bf107f1e
      sandy authored
      ## What changes were proposed in this pull request?
      
      Add Asynchronous Actions documentation inside action of programming guide
      
      ## How was this patch tested?
      
      check the documentation indentation and formatting with md preview.
      
      Author: sandy <phalodi@gmail.com>
      
      Closes #14104 from phalodi/SPARK-16438.
      bf107f1e
    • Maciej Brynski's avatar
      [SPARK-16439] Fix number formatting in SQL UI · 83879ebc
      Maciej Brynski authored
      ## What changes were proposed in this pull request?
      
      Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator.
      Problem exists when server locale has no-breaking space as separator. (for example pl_PL)
      This patch turns off grouping and remove this separator.
      
      The problem starts with this PR.
      https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125
      
      ## How was this patch tested?
      
      Manual UI tests. Screenshot attached.
      
      ![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png)
      
      Author: Maciej Brynski <maciej.brynski@adpilot.pl>
      
      Closes #14142 from maver1ck/master.
      83879ebc
    • Xin Ren's avatar
      [MINOR] Fix Java style errors and remove unused imports · f73891e0
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Fix Java style errors and remove unused imports, which are randomly found
      
      ## How was this patch tested?
      
      Tested on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14161 from keypointt/SPARK-16437.
      f73891e0
    • Alex Bozarth's avatar
      [SPARK-16375][WEB UI] Fixed misassigned var: numCompletedTasks was assigned to numSkippedTasks · f156136d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      I fixed a misassigned var,  numCompletedTasks was assigned to numSkippedTasks in the convertJobData method
      
      ## How was this patch tested?
      
      dev/run-tests
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #14141 from ajbozarth/spark16375.
      f156136d
    • Sean Owen's avatar
      [SPARK-15889][STREAMING] Follow-up fix to erroneous condition in StreamTest · c190d89b
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      A second form of AssertQuery now actually invokes the condition; avoids a build warning too
      
      ## How was this patch tested?
      
      Jenkins; running StreamTest
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14133 from srowen/SPARK-15889.2.
      c190d89b
    • aokolnychyi's avatar
      [SPARK-16303][DOCS][EXAMPLES] Updated SQL programming guide and examples · 772c213e
      aokolnychyi authored
      - Hard-coded Spark SQL sample snippets were moved into source files under examples sub-project.
      - Removed the inconsistency between Scala and Java Spark SQL examples
      - Scala and Java Spark SQL examples were updated
      
      The work is still in progress. All involved examples were tested manually. An additional round of testing will be done after the code review.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/16710314/51851606-462a-11e6-9fbe-0818daef65e4.png)
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #14119 from aokolnychyi/spark_16303.
      772c213e
    • Eric Liang's avatar
      [SPARK-16514][SQL] Fix various regex codegen bugs · 1c58fa90
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      RegexExtract and RegexReplace currently crash on non-nullable input due use of a hard-coded local variable name (e.g. compiles fail with `java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 85, Column 26: Redefinition of local variable "m" `).
      
      This changes those variables to use fresh names, and also in a few other places.
      
      ## How was this patch tested?
      
      Unit tests. rxin
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #14168 from ericl/sc-3906.
      1c58fa90
  3. Jul 12, 2016
    • petermaxlee's avatar
      [SPARK-16284][SQL] Implement reflect SQL function · 56bd399a
      petermaxlee authored
      ## What changes were proposed in this pull request?
      This patch implements reflect SQL function, which can be used to invoke a Java method in SQL. Slightly different from Hive, this implementation requires the class name and the method name to be literals. This implementation also supports only a smaller number of data types, and requires the function to be static, as suggested by rxin in #13969.
      
      java_method is an alias for reflect, so this should also resolve SPARK-16277.
      
      ## How was this patch tested?
      Added expression unit tests and an end-to-end test.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      
      Closes #14138 from petermaxlee/reflect-static.
      56bd399a
    • Marcelo Vanzin's avatar
      [SPARK-16119][SQL] Support PURGE option to drop table / partition. · 7f968867
      Marcelo Vanzin authored
      This option is used by Hive to directly delete the files instead of
      moving them to the trash. This is needed in certain configurations
      where moving the files does not work. For non-Hive tables and partitions,
      Spark already behaves as if the PURGE option was set, so there's no
      need to do anything.
      
      Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for
      partitions), so the code reflects that: trying to use the option with
      older versions of Hive will cause an exception to be thrown.
      
      The change is a little noisier than I would like, because of the code
      to propagate the new flag through all the interfaces and implementations;
      the main changes are in the parser and in HiveShim, aside from the tests
      (DDLCommandSuite, VersionsSuite).
      
      Tested by running sql and catalyst unit tests, plus VersionsSuite which
      has been updated to test the version-specific behavior. I also ran an
      internal test suite that uses PURGE and would not pass previously.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #13831 from vanzin/SPARK-16119.
      7f968867
    • Yangyang Liu's avatar
      [SPARK-16405] Add metrics and source for external shuffle service · 68df47ac
      Yangyang Liu authored
      ## What changes were proposed in this pull request?
      
      Since externalShuffleService is essential for spark, better monitoring for shuffle service is necessary. In order to do so, we added various metrics in shuffle service and imported into ExternalShuffleServiceSource for metric system.
      Metrics added in shuffle service:
      * registeredExecutorsSize
      * openBlockRequestLatencyMillis
      * registerExecutorRequestLatencyMillis
      * blockTransferRateBytes
      
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-16405
      
      ## How was this patch tested?
      
      Some test cases are added to verify metrics as expected in metric system. Those unit test cases are shown in `ExternalShuffleBlockHandlerSuite `
      
      Author: Yangyang Liu <yangyangliu@fb.com>
      
      Closes #14080 from lovexi/yangyang-metrics.
      68df47ac
    • sharkd's avatar
      [SPARK-16414][YARN] Fix bugs for "Can not get user config when calling... · d513c99c
      sharkd authored
      [SPARK-16414][YARN] Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf on yarn cluser mode"
      
      ## What changes were proposed in this pull request?
      
      The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster` in `ApplicationMaster.main` when deploying spark on yarn cluster mode, the `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration.
      
      So, we should load the properties file with the Spark configuration and set entries as system properties before `SparkHadoopUtil` first instantiate.
      
      ## How was this patch tested?
      
      Add a test case
      
      Author: sharkd <sharkd.tu@gmail.com>
      Author: sharkdtu <sharkdtu@tencent.com>
      
      Closes #14088 from sharkdtu/master.
      d513c99c
    • Reynold Xin's avatar
      [SPARK-16489][SQL] Guard against variable reuse mistakes in expression code generation · c377e49e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In code generation, it is incorrect for expressions to reuse variable names across different instances of itself. As an example, SPARK-16488 reports a bug in which pmod expression reuses variable name "r".
      
      This patch updates ExpressionEvalHelper test harness to always project two instances of the same expression, which will help us catch variable reuse problems in expression unit tests. This patch also fixes the bug in crc32 expression.
      
      ## How was this patch tested?
      This is a test harness change, but I also created a new test suite for testing the test harness.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14146 from rxin/SPARK-16489.
      c377e49e
    • Lianhui Wang's avatar
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose... · 5ad68ba5
      Lianhui Wang authored
      [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose children are deterministic project or filter operators.
      
      ## What changes were proposed in this pull request?
      when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003.
      
      ## How was this patch tested?
      add unit tests
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Lianhui Wang <lianhuiwang@users.noreply.github.com>
      
      Closes #13494 from lianhuiwang/metadata-only.
      5ad68ba5
    • WeichenXu's avatar
      [SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually... · 6cb75db9
      WeichenXu authored
      [SPARK-16470][ML][OPTIMIZER] Check linear regression training whether actually reach convergence and add warning if not
      
      ## What changes were proposed in this pull request?
      
      In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence.
      
      The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations:
      
      1) reach max iteration number
      2) function reach value convergence
      3) objective function stop improving
      4) gradient reach convergence
      5) search failed(due to some internal numerical error)
      
      I add warning printing code so that
      if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string.
      
      ## How was this patch tested?
      
      Manual.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.
      6cb75db9
    • Takuya UESHIN's avatar
      [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a... · 5b28e025
      Takuya UESHIN authored
      [SPARK-16189][SQL] Add ExternalRDD logical plan for input with RDD to have a chance to eliminate serialize/deserialize.
      
      ## What changes were proposed in this pull request?
      
      Currently the input `RDD` of `Dataset` is always serialized to `RDD[InternalRow]` prior to being as `Dataset`, but there is a case that we use `map` or `mapPartitions` just after converted to `Dataset`.
      In this case, serialize and then deserialize happens but it would not be needed.
      
      This pr adds `ExistingRDD` logical plan for input with `RDD` to have a chance to eliminate serialize/deserialize.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #13890 from ueshin/issues/SPARK-16189.
      5b28e025
    • WeichenXu's avatar
      [MINOR][ML] update comment where is inconsistent with code in ml.regression.LinearRegression · fc11c509
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0`
      the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it.
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14121 from WeichenXu123/update_lr_comment.
      fc11c509
    • petermaxlee's avatar
      [SPARK-16199][SQL] Add a method to list the referenced columns in data source Filter · c9a67621
      petermaxlee authored
      ## What changes were proposed in this pull request?
      It would be useful to support listing the columns that are referenced by a filter. This can help simplify data source planning, because with this we would be able to implement unhandledFilters method in HadoopFsRelation.
      
      This is based on rxin's patch (#13901) and adds unit tests.
      
      ## How was this patch tested?
      Added a new suite FiltersSuite.
      
      Author: petermaxlee <petermaxlee@gmail.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14120 from petermaxlee/SPARK-16199.
      c9a67621
  4. Jul 11, 2016
    • Russell Spitzer's avatar
      [SPARK-12639][SQL] Mark Filters Fully Handled By Sources with * · b1e5281c
      Russell Spitzer authored
      ## What changes were proposed in this pull request?
      
      In order to make it clear which filters are fully handled by the
      underlying datasource we will mark them with an *. This will give a
      clear visual queue to users that the filter is being treated differently
      by catalyst than filters which are just presented to the underlying
      DataSource.
      
      Examples from the FilteredScanSuite, in this example `c IN (...)` is handled by the source, `b < ...` is not
      ### Before
      ```
      //SELECT a FROM oneToTenFiltered WHERE a + b > 9 AND b < 16 AND c IN ('bbbbbBBBBB', 'cccccCCCCC', 'dddddDDDDD', 'foo')
      == Physical Plan ==
      Project [a#0]
      +- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
         +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
      ```
      
      ### After
      ```
      == Physical Plan ==
      Project [a#0]
      +- Filter (((a#0 + b#1) > 9) && (b#1 < 16))
         +- Scan SimpleFilteredScan(1,10)[a#0,b#1] PushedFilters: [LessThan(b,16), *In(c, [bbbbbBBBBB,cccccCCCCC,dddddDDDDD,foo]]
      ```
      
      ## How was the this patch tested?
      
      Manually tested with the Spark Cassandra Connector, a source which fully handles underlying filters. Now fully handled filters appear with an * next to their names. I can add an automated test as well if requested
      
      Post 1.6.1
      Tested by modifying the FilteredScanSuite to run explains.
      
      Author: Russell Spitzer <Russell.Spitzer@gmail.com>
      
      Closes #11317 from RussellSpitzer/SPARK-12639-Star.
      b1e5281c
    • Sameer Agarwal's avatar
      [SPARK-16488] Fix codegen variable namespace collision in pmod and partitionBy · 9cc74f95
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a variable namespace collision bug in pmod and partitionBy
      
      ## How was this patch tested?
      
      Regression test for one possible occurrence. A more general fix in `ExpressionEvalHelper.checkEvaluation` will be in a subsequent PR.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #14144 from sameeragarwal/codegen-bug.
      9cc74f95
    • Tathagata Das's avatar
      [SPARK-16430][SQL][STREAMING] Fixed bug in the maxFilesPerTrigger in FileStreamSource · e50efd53
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Incorrect list of files were being allocated to a batch. This caused a file to read multiple times in the multiple batches.
      
      ## How was this patch tested?
      
      Added unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #14143 from tdas/SPARK-16430-1.
      e50efd53
    • Shixiong Zhu's avatar
      [SPARK-16433][SQL] Improve StreamingQuery.explain when no data arrives · 91a443b8
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Display `No physical plan. Waiting for data.` instead of `N/A`  for StreamingQuery.explain when no data arrives because `N/A` doesn't provide meaningful information.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #14100 from zsxwing/SPARK-16433.
      91a443b8
    • Xin Ren's avatar
      [MINOR][STREAMING][DOCS] Minor changes on kinesis integration · 05d7151c
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Some minor changes for documentation page "Spark Streaming + Kinesis Integration".
      
      Moved "streaming-kinesis-arch.png" before the bullet list, not in between the bullets.
      
      ## How was this patch tested?
      
      Tested manually, on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14097 from keypointt/kinesisDoc.
      05d7151c
    • James Thomas's avatar
      [SPARK-16114][SQL] structured streaming event time window example · 9e2c763d
      James Thomas authored
      ## What changes were proposed in this pull request?
      
      A structured streaming example with event time windowing.
      
      ## How was this patch tested?
      
      Run locally
      
      Author: James Thomas <jamesjoethomas@gmail.com>
      
      Closes #13957 from jjthomas/current.
      9e2c763d
    • Marcelo Vanzin's avatar
      [SPARK-16349][SQL] Fall back to isolated class loader when classes not found. · b4fbe140
      Marcelo Vanzin authored
      Some Hadoop classes needed by the Hive metastore client jars are not present
      in Spark's packaging (for example, "org/apache/hadoop/mapred/MRVersion"). So
      if the parent class loader fails to find a class, try to load it from the
      isolated class loader, in case it's available there.
      
      Tested by setting spark.sql.hive.metastore.jars to local paths with Hive/Hadoop
      libraries and verifying that Spark can talk to the metastore.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #14020 from vanzin/SPARK-16349.
      b4fbe140
    • Felix Cheung's avatar
      [SPARK-16144][SPARKR] update R API doc for mllib · 7f38b9d5
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty:
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png)
      
      Here's what I meant as the fix:
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png)
      
      ![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png)
      
      I didn't realize there was already a JIRA on this. mengxr yanboliang
      
      ## How was this patch tested?
      
      check doc generated.
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #13993 from felixcheung/rmllibdoc.
      7f38b9d5
    • Yanbo Liang's avatar
      [SPARKR][DOC] SparkR ML user guides update for 2.0 · 2ad031be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Update SparkR ML section to make them consistent with SparkR API docs.
      * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
      
      ## How was this patch tested?
      Only docs update, manually check the generated docs.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14011 from yanboliang/r-user-guide-update.
      2ad031be
Loading