Skip to content
Snippets Groups Projects
  1. Apr 23, 2016
    • Reynold Xin's avatar
      [SPARK-14869][SQL] Don't mask exceptions in ResolveRelations · 890abd12
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence.
      
      ## How was this patch tested?
      I manually hacked some bugs into Spark and made sure the exceptions were being propagated up.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12634 from rxin/SPARK-14869.
      890abd12
    • Reynold Xin's avatar
      [SPARK-14872][SQL] Restructure command package · 5c8a0ec9
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch restructures sql.execution.command package to break the commands into multiple files, in some logical organization: databases, tables, views, functions.
      
      I also renamed basicOperators.scala to basicLogicalOperators.scala and basicPhysicalOperators.scala.
      
      ## How was this patch tested?
      N/A - all I did was moving code around.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12636 from rxin/SPARK-14872.
      5c8a0ec9
    • Reynold Xin's avatar
      [SPARK-14871][SQL] Disable StatsReportListener to declutter output · fddd3aee
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Spark SQL inherited from Shark to use the StatsReportListener. Unfortunately this clutters the spark-sql CLI output and makes it very difficult to read the actual query results.
      
      ## How was this patch tested?
      Built and tested in spark-sql CLI.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12635 from rxin/SPARK-14871.
      fddd3aee
    • Davies Liu's avatar
      [HOTFIX] disable generated aggregate map · ee6b209a
      Davies Liu authored
      ee6b209a
    • Reynold Xin's avatar
      Turn script transformation back on. · f0bba744
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12565 from rxin/test-flaky.
      f0bba744
    • felixcheung's avatar
      [SPARK-14594][SPARKR] check execution return status code · 39d3bc62
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous.
      
      ```
      Error in if (returnStatus != 0) { : argument is of length zero
      ```
      
      This change attempts to make it more clear (however, one would still need to investigate why JVM fails)
      
      ## How was this patch tested?
      
      manually
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #12622 from felixcheung/rreturnstatus.
      39d3bc62
    • Reynold Xin's avatar
      Closes some open PRs that have been requested to close. · 6acc72a0
      Reynold Xin authored
      Closes #7647
      Closes #8195
      Closes #8741
      Closes #8972
      Closes #9490
      Closes #10419
      Closes #10761
      Closes #11003
      Closes #11201
      Closes #11803
      Closes #12111
      Closes #12442
      6acc72a0
    • Sean Owen's avatar
      [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala Double... · be0d5d3b
      Sean Owen authored
      [SPARK-14873][CORE] Java sampleByKey methods take ju.Map but with Scala Double values; results in type Object
      
      ## What changes were proposed in this pull request?
      
      Java `sampleByKey` methods should accept `Map` with `java.lang.Double` values
      
      ## How was this patch tested?
      
      Existing (updated) Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #12637 from srowen/SPARK-14873.
      be0d5d3b
    • felixcheung's avatar
      [SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame · a55fbe2a
      felixcheung authored
      ## What changes were proposed in this pull request?
      
      Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict.
      
      Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame").
      
      Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat.
      
      ## How was this patch tested?
      
      SparkR tests, manually loading S4Vector then SparkR package
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #12621 from felixcheung/rdataframe.
      a55fbe2a
    • Zheng RuiFeng's avatar
      [MINOR][ML][MLLIB] Remove unused imports · 86ca8fef
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      del unused imports in ML/MLLIB
      
      ## How was this patch tested?
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12497 from zhengruifeng/del_unused_imports.
      86ca8fef
    • Rajesh Balamohan's avatar
      [SPARK-14551][SQL] Reduce number of NameNode calls in OrcRelation · e5226e30
      Rajesh Balamohan authored
      ## What changes were proposed in this pull request?
      When FileSourceStrategy is used, record reader is created which incurs a NN call internally. Later in OrcRelation.unwrapOrcStructs, it ends ups reading the file information to get the ObjectInspector. This incurs additional NN call. It would be good to avoid this additional NN call (specifically for partitioned datasets).
      
      Added OrcRecordReader which is very similar to OrcNewInputFormat.OrcRecordReader with an option of exposing the ObjectInspector. This eliminates the need to look up the file later for generating the object inspector. This would be specifically be useful for partitioned tables/datasets.
      
      ## How was this patch tested?
      Ran tpc-ds queries manually and also verified by running org.apache.spark.sql.hive.orc.OrcSuite,org.apache.spark.sql.hive.orc.OrcQuerySuite,org.apache.spark.sql.hive.orc.OrcPartitionDiscoverySuite,OrcPartitionDiscoverySuite.OrcHadoopFsRelationSuite,org.apache.spark.sql.hive.execution.HiveCompatibilitySuite
      
      …SourceStrategy mode
      
      Author: Rajesh Balamohan <rbalamohan@apache.org>
      
      Closes #12319 from rajeshbalamohan/SPARK-14551.
      e5226e30
    • Reynold Xin's avatar
      [SPARK-14866][SQL] Break SQLQuerySuite out into smaller test suites · 95faa731
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch breaks SQLQuerySuite out into smaller test suites. It was a little bit too large for debugging.
      
      ## How was this patch tested?
      This is a test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12630 from rxin/SPARK-14866.
      95faa731
    • Josh Rosen's avatar
      [SPARK-14863][SQL] Cache TreeNode's hashCode by default · bdde010e
      Josh Rosen authored
      Caching TreeNode's `hashCode` can lead to orders-of-magnitude performance improvement in certain optimizer rules when operating on huge/complex schemas.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #12626 from JoshRosen/cache-treenode-hashcode.
      bdde010e
    • Davies Liu's avatar
      [SPARK-14856] [SQL] returning batch correctly · 39a77e15
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the Parquet reader decide whether to return batch based on required schema or full schema, it's not consistent, this PR fix that.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12619 from davies/fix_return_batch.
      39a77e15
  2. Apr 22, 2016
    • Reynold Xin's avatar
      [SPARK-14842][SQL] Implement view creation in sql/core · c0611018
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch re-implements view creation command in sql/core, based on the pre-existing view creation command in the Hive module. This consolidates the view creation logical command and physical command into a single one, called CreateViewCommand.
      
      ## How was this patch tested?
      All the code should've been tested by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12615 from rxin/SPARK-14842-2.
      c0611018
    • Yin Huai's avatar
      [SPARK-14807] Create a compatibility module · 7dde1da9
      Yin Huai authored
      ## What changes were proposed in this pull request?
      
      This PR creates a compatibility module in sql (called `hive-1-x-compatibility`), which will host HiveContext in Spark 2.0 (moving HiveContext to here will be done separately). This module is not included in assembly because only users who still want to access HiveContext need it.
      
      ## How was this patch tested?
      I manually tested `sbt/sbt -Phive package` and `mvn -Phive package -DskipTests`.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #12580 from yhuai/compatibility.
      7dde1da9
    • Reynold Xin's avatar
      [SPARK-14855][SQL] Add "Exec" suffix to physical operators · d7d0cad0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds "Exec" suffix to all physical operators. Before this patch, Spark's physical operators and logical operators are named the same (e.g. Project could be logical.Project or execution.Project), which caused small issues in code review and bigger issues in code refactoring.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12617 from rxin/exec-node.
      d7d0cad0
    • Tathagata Das's avatar
      [SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred... · c431a76d
      Tathagata Das authored
      [SPARK-14832][SQL][STREAMING] Refactor DataSource to ensure schema is inferred only once when creating a file stream
      
      ## What changes were proposed in this pull request?
      
      When creating a file stream using sqlContext.write.stream(), existing files are scanned twice for finding the schema
      - Once, when creating a DataSource + StreamingRelation in the DataFrameReader.stream()
      - Again, when creating streaming Source from the DataSource, in DataSource.createSource()
      
      Instead, the schema should be generated only once, at the time of creating the dataframe, and when the streaming source is created, it should just reuse that schema
      
      The solution proposed in this PR is to add a lazy field in DataSource that caches the schema. Then streaming Source created by the DataSource can just reuse the schema.
      
      ## How was this patch tested?
      Refactored unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #12591 from tdas/SPARK-14832.
      c431a76d
    • Davies Liu's avatar
      [SPARK-14582][SQL] increase parallelism for small tables · c25b97fc
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR try to increase the parallelism for small table (a few of big files) to reduce the query time, by decrease the maxSplitBytes, the goal is to have at least one task per CPU in the cluster, if the total size of all files is bigger than openCostInBytes * 2 * nCPU.
      
      For example, a small/medium table could be used as dimension table in huge query, this will be useful to reduce the time waiting for broadcast.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12344 from davies/more_partition.
      c25b97fc
    • Liwei Lin's avatar
      [SPARK-14701][STREAMING] First stop the event loop, then stop the checkpoint writer in JobGenerator · fde1340c
      Liwei Lin authored
      Currently if we call `streamingContext.stop` (e.g. in a `StreamingListener.onBatchCompleted` callback) when a batch is about to complete, a `rejectedException` may get thrown from `checkPointWriter.executor`, since the `eventLoop` will try to process `DoCheckpoint` events even after the `checkPointWriter.executor` was stopped.
      
      Please see [SPARK-14701](https://issues.apache.org/jira/browse/SPARK-14701) for details and stack traces.
      
      ## What changes were proposed in this pull request?
      
      Reversed the stopping order of `event loop` and `checkpoint writer`.
      
      ## How was this patch tested?
      
      Existing test suits.
      (no dedicated test suits were added because the change is simple to reason about)
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #12489 from lw-lin/spark-14701.
      fde1340c
    • Dongjoon Hyun's avatar
      [SPARK-14796][SQL] Add spark.sql.optimizer.inSetConversionThreshold config option. · 3647120a
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Currently, `OptimizeIn` optimizer replaces `In` expression into `InSet` expression if the size of set is greater than a constant, 10.
      This issue aims to make a configuration `spark.sql.optimizer.inSetConversionThreshold` for that.
      
      After this PR, `OptimizerIn` is configurable.
      ```scala
      scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [a#7 IN (1,2,3) AS (a IN (1, 2, 3))#8]
      :     +- INPUT
      +- Generate explode([1,2]), false, false, [a#7]
         +- Scan OneRowRelation[]
      
      scala> sqlContext.setConf("spark.sql.optimizer.inSetConversionThreshold", "2")
      
      scala> sql("select a in (1,2,3) from (select explode(array(1,2)) a) T").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [a#16 INSET (1,2,3) AS (a IN (1, 2, 3))#17]
      :     +- INPUT
      +- Generate explode([1,2]), false, false, [a#16]
         +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (with a new testcase)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12562 from dongjoon-hyun/SPARK-14796.
      3647120a
    • Davies Liu's avatar
      [SPARK-14669] [SQL] Fix some SQL metrics in codegen and added more · 0dcf9dbe
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      1. Fix the "spill size" of TungstenAggregate and Sort
      2. Rename "data size" to "peak memory" to match the actual meaning (also consistent with task metrics)
      3. Added "data size" for ShuffleExchange and BroadcastExchange
      4. Added some timing for Sort, Aggregate and BroadcastExchange (this requires another patch to work)
      
      ## How was this patch tested?
      
      Existing tests.
      ![metrics](https://cloud.githubusercontent.com/assets/40902/14573908/21ad2f00-030d-11e6-9e2c-c544f30039ea.png)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12425 from davies/fix_metrics.
      0dcf9dbe
    • Davies Liu's avatar
      [SPARK-14791] [SQL] fix risk condition between broadcast and subquery · 0419d631
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      SparkPlan.prepare() could be called in different threads (BroadcastExchange will call it in a thread pool), it only make sure that doPrepare() will only be called once, the second call to prepare() may return earlier before all the children had finished prepare(). Then some operator may call doProduce() before prepareSubqueries(), `null` will be used as the result of subquery, which is wrong. This cause TPCDS Q23B returns wrong answer sometimes.
      
      This PR added synchronization for prepare(), make sure all the children had finished prepare() before return. Also call prepare() in produce() (similar to execute()).
      
      Added checking for ScalarSubquery to make sure that the subquery has finished before using the result.
      
      ## How was this patch tested?
      
      Manually tested with Q23B, no wrong answer anymore.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12600 from davies/fix_risk.
      0419d631
    • Davies Liu's avatar
      [SPARK-14763][SQL] fix subquery resolution · c417cec0
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, a column could be resolved wrongly if there are columns from both outer table and subquery have the same name, we should only resolve the attributes that can't be resolved within subquery. They may have same exprId than other attributes in subquery, so we should create alias for them.
      
      Also, the column in IN subquery could have same exprId, we should create alias for them.
      
      ## How was this patch tested?
      
      Added regression tests. Manually tests TPCDS Q70 and Q95, work well after this patch.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12539 from davies/fix_subquery.
      c417cec0
    • Herman van Hovell's avatar
      [SPARK-14762] [SQL] TPCDS Q90 fails to parse · d060da09
      Herman van Hovell authored
      ### What changes were proposed in this pull request?
      TPCDS Q90 fails to parse because it uses a reserved keyword as an Identifier; `AT` was used as an alias for one of the subqueries. `AT` is not a reserved keyword and should have been registerd as a in the `nonReserved` rule.
      
      In order to prevent this from happening again I have added tests for all keywords that are non-reserved in Hive. See the `nonReserved`, `sql11ReservedKeywordsUsedAsCastFunctionName` & `sql11ReservedKeywordsUsedAsIdentifier` rules in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g.
      
      ### How was this patch tested?
      
      Added tests to for all Hive non reserved keywords to `TableIdentifierParserSuite`.
      
      cc davies
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #12537 from hvanhovell/SPARK-14762.
      d060da09
    • Sun Rui's avatar
      [SPARK-13178] RRDD faces with concurrency issue in case of rdd.zip(rdd).count(). · 1a7fc74c
      Sun Rui authored
      ## What changes were proposed in this pull request?
      The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792.
      This PR just removes a workaround not needed anymore.
      
      ## How was this patch tested?
      SparkR unit tests.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #12606 from sun-rui/SPARK-13178.
      1a7fc74c
    • Reynold Xin's avatar
      [SPARK-14841][SQL] Move SQLBuilder into sql/core · aeb52bea
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves SQLBuilder into sql/core so we can in the future move view generation also into sql/core.
      
      ## How was this patch tested?
      Also moved unit tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #12602 from rxin/SPARK-14841.
      aeb52bea
    • Liang-Chi Hsieh's avatar
      [SPARK-14843][ML] Fix encoding error in LibSVMRelation · 8098f158
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      We use `RowEncoder` in libsvm data source to serialize the label and features read from libsvm files. However, the schema passed in this encoder is not correct. As the result, we can't correctly select `features` column from the DataFrame. We should use full data schema instead of `requiredSchema` to serialize the data read in. Then do projection to select required columns later.
      
      ## How was this patch tested?
      `LibSVMRelationSuite`.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12611 from viirya/fix-libsvm.
      8098f158
    • Reynold Xin's avatar
      [SPARK-10001] Consolidate Signaling and SignalLogger. · c089c6f4
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This is a follow-up to #12557, with the following changes:
      
      1. Fixes some of the style issues.
      2. Merges Signaling and SignalLogger into a new class called SignalUtils. It was pretty confusing to have Signaling and Signal in one file, and it was also confusing to have two classes named Signaling and one called the other.
      3. Made logging registration idempotent.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12605 from rxin/SPARK-10001.
      c089c6f4
    • Liang-Chi Hsieh's avatar
      [SPARK-13266] [SQL] None read/writer options were not transalated to "null" · 056883e0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.
      
      This is based on #11305 from mathieulongtin.
      
      ## How was this patch tested?
      
      Added test to readwriter.py.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: mathieu longtin <mathieu.longtin@nuance.com>
      
      Closes #12494 from viirya/py-df-none-option.
      056883e0
    • Pete Robbins's avatar
      [SPARK-14848][SQL] Compare as Set in DatasetSuite - Java encoder · 5bed13a8
      Pete Robbins authored
      ## What changes were proposed in this pull request?
      Change test to compare sets rather than sequence
      
      ## How was this patch tested?
      Full test runs on little endian and big endian platforms
      
      Author: Pete Robbins <robbinspg@gmail.com>
      
      Closes #12610 from robbinspg/DatasetSuiteFix.
      5bed13a8
    • Zheng RuiFeng's avatar
      [MINOR][DOC] Fix doc style in ml.ann.Layer and MultilayerPerceptronClassifier · 92675471
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, fix the indentation
      2, add a missing param desc
      
      ## How was this patch tested?
      unit tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12499 from zhengruifeng/fix_doc.
      92675471
    • Joan's avatar
      [SPARK-6429] Implement hashCode and equals together · bf95b8da
      Joan authored
      ## What changes were proposed in this pull request?
      
      Implement some `hashCode` and `equals` together in order to enable the scalastyle.
      This is a first batch, I will continue to implement them but I wanted to know your thoughts.
      
      Author: Joan <joan@goyeau.com>
      
      Closes #12157 from joan38/SPARK-6429-HashCode-Equals.
      bf95b8da
    • Liang-Chi Hsieh's avatar
      [SPARK-14609][SQL] Native support for LOAD DATA DDL command · e09ab5da
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Add the native support for LOAD DATA DDL command that loads data into Hive table/partition.
      
      ## How was this patch tested?
      
      `HiveDDLCommandSuite` and `HiveQuerySuite`. Besides, few Hive tests (`WindowQuerySuite`, `HiveTableScanSuite` and `HiveSerDeSuite`) also use `LOAD DATA` command.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #12412 from viirya/ddl-load-data.
      e09ab5da
    • Reynold Xin's avatar
      [SPARK-14826][SQL] Remove HiveQueryExecution · 284b15d2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes HiveQueryExecution. As part of this, I consolidated all the describe commands into DescribeTableCommand.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12588 from rxin/SPARK-14826.
      284b15d2
    • Jakob Odersky's avatar
      [SPARK-10001] [CORE] Interrupt tasks in repl with Ctrl+C · 80127935
      Jakob Odersky authored
      ## What changes were proposed in this pull request?
      
      Improve signal handling to allow interrupting running tasks from the REPL (with Ctrl+C).
      If no tasks are running or Ctrl+C is pressed twice, the signal is forwarded to the default handler resulting in the usual termination of the application.
      
      This PR is a rewrite of -- and therefore closes #8216 -- as per piaozhexiu's request
      
      ## How was this patch tested?
      Signal handling is not easily testable therefore no unit tests were added. Nevertheless, the new functionality is implemented in a best-effort approach, soft-failing in case signals aren't available on a specific OS.
      
      Author: Jakob Odersky <jakob@odersky.com>
      
      Closes #12557 from jodersky/SPARK-10001-sigint.
      80127935
  3. Apr 21, 2016
    • Reynold Xin's avatar
      [SPARK-14835][SQL] Remove MetastoreRelation dependency from SQLBuilder · 3405cc77
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes SQLBuilder's dependency on MetastoreRelation. We should be able to move SQLBuilder into the sql/core package after this change.
      
      ## How was this patch tested?
      N/A - covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12594 from rxin/SPARK-14835.
      3405cc77
    • Cheng Lian's avatar
      [SPARK-14369] [SQL] Locality support for FileScanRDD · 145433f1
      Cheng Lian authored
      (This PR is a rebased version of PR #12153.)
      
      ## What changes were proposed in this pull request?
      
      This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts:
      
      1.  Block location lookup
      
          Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es.
      
          Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems.
      
      2.  Selecting preferred locations
      
          For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs.
      
      ## How was this patch tested?
      
      Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #12527 from liancheng/spark-14369-locality-rebased.
      145433f1
    • Sameer Agarwal's avatar
      [SPARK-14680] [SQL] Support all datatypes to use VectorizedHashmap in TungstenAggregate · b29bc3f5
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for all primitive datatypes, decimal types and stringtypes in the VectorizedHashmap during aggregation.
      
      ## How was this patch tested?
      
      Existing tests for group-by aggregates should already test for all these datatypes. Additionally, manually inspected the generated code for all supported datatypes (details below).
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #12440 from sameeragarwal/all-datatypes.
      b29bc3f5
    • Takuya UESHIN's avatar
      [SPARK-14793] [SQL] Code generation for large complex type exceeds JVM size limit. · f1fdb238
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Code generation for complex type, `CreateArray`, `CreateMap`, `CreateStruct`, `CreateNamedStruct`, exceeds JVM size limit for large elements.
      
      We should split generated code into multiple `apply` functions if the complex types have large elements,  like `UnsafeProjection` or others for large expressions.
      
      ## How was this patch tested?
      
      I added some tests to check if the generated codes for the expressions exceed or not.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #12559 from ueshin/issues/SPARK-14793.
      f1fdb238
Loading