Skip to content
Snippets Groups Projects
  1. Feb 01, 2016
    • Sean Owen's avatar
      [SPARK-12637][CORE] Print stage info of finished stages properly · 715a19d5
      Sean Owen authored
      Improve printing of StageInfo in onStageCompleted
      
      See also https://github.com/apache/spark/pull/10585
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10922 from srowen/SPARK-12637.
      715a19d5
    • Reynold Xin's avatar
      [SPARK-13078][SQL] API and test cases for internal catalog · be7a2fc0
      Reynold Xin authored
      This pull request creates an internal catalog API. The creation of this API is the first step towards consolidating SQLContext and HiveContext. I envision we will have two different implementations in Spark 2.0: (1) a simple in-memory implementation, and (2) an implementation based on the current HiveClient (ClientWrapper).
      
      I took a look at what Hive's internal metastore interface/implementation, and then created this API based on it. I believe this is the minimal set needed in order to achieve all the needed functionality.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10982 from rxin/SPARK-13078.
      be7a2fc0
    • Jacek Laskowski's avatar
      Fix for [SPARK-12854][SQL] Implement complex types support in Columna… · a2973fed
      Jacek Laskowski authored
      …rBatch
      
      Fixes build for Scala 2.11.
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #10946 from jaceklaskowski/SPARK-12854-fix.
      a2973fed
    • Nong Li's avatar
      [SPARK-13043][SQL] Implement remaining catalyst types in ColumnarBatch. · 064b029c
      Nong Li authored
      This includes: float, boolean, short, decimal and calendar interval.
      
      Decimal is mapped to long or byte array depending on the size and calendar
      interval is mapped to a struct of int and long.
      
      The only remaining type is map. The schema mapping is straightforward but
      we might want to revisit how we deal with this in the rest of the execution
      engine.
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #10961 from nongli/spark-13043.
      064b029c
    • Iulian Dragos's avatar
      [SPARK-12979][MESOS] Don’t resolve paths on the local file system in Mesos scheduler · c9b89a0a
      Iulian Dragos authored
      The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors.
      
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #10923 from dragos/issue/canonical-paths.
      c9b89a0a
    • Nilanjan Raychaudhuri's avatar
      [SPARK-12265][MESOS] Spark calls System.exit inside driver instead of throwing exception · a41b68b9
      Nilanjan Raychaudhuri authored
      This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user.
      
      Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
      Author: Iulian Dragos <jaguarul@gmail.com>
      
      Closes #10921 from dragos/pr/10729.
      a41b68b9
    • Timothy Chen's avatar
      [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zookeeper dir... · 51b03b71
      Timothy Chen authored
      [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zookeeper dir with mesos conf and add docs.
      
      Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #10057 from tnachen/fix_mesos_dir.
      51b03b71
    • Lewuathe's avatar
      [ML][MINOR] Invalid MulticlassClassification reference in ml-guide · 711ce048
      Lewuathe authored
      In [ml-guide](https://spark.apache.org/docs/latest/ml-guide.html#example-model-selection-via-cross-validation), there is invalid reference to `MulticlassClassificationEvaluator` apidoc.
      
      https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.MultiClassClassificationEvaluator
      
      Author: Lewuathe <lewuathe@me.com>
      
      Closes #10996 from Lewuathe/fix-typo-in-ml-guide.
      711ce048
    • Takeshi YAMAMURO's avatar
      [DOCS] Fix the jar location of datanucleus in sql-programming-guid.md · da9146c9
      Takeshi YAMAMURO authored
      ISTM `lib` is better because `datanucleus` jars are located in `lib` for release builds.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #10901 from maropu/DocFix.
      da9146c9
    • gatorsmile's avatar
      [SPARK-12705][SPARK-10777][SQL] Analyzer Rule ResolveSortReferences · 8f26eb5e
      gatorsmile authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12705
      
      **Scope:**
      This PR is a general fix for sorting reference resolution when the child's `outputSet` does not have the order-by attributes (called, *missing attributes*):
        - UnaryNode support is limited to `Project`, `Window`, `Aggregate`, `Distinct`, `Filter`, `RepartitionByExpression`.
        - We will not try to resolve the missing references inside a subquery, unless the outputSet of this subquery contains it.
      
      **General Reference Resolution Rules:**
        - Jump over the nodes with the following types: `Distinct`, `Filter`, `RepartitionByExpression`. Do not need to add missing attributes. The reason is their `outputSet` is decided by their `inputSet`, which is the `outputSet` of their children.
        - Group-by expressions in `Aggregate`: missing order-by attributes are not allowed to be added into group-by expressions since it will change the query result. Thus, in RDBMS, it is not allowed.
        - Aggregate expressions in `Aggregate`: if the group-by expressions in `Aggregate` contains the missing attributes but aggregate expressions do not have it, just add them into the aggregate expressions. This can resolve the analysisExceptions thrown by the three TCPDS queries.
        - `Project` and `Window` are special. We just need to add the missing attributes to their `projectList`.
      
      **Implementation:**
        1. Traverse the whole tree in a pre-order manner to find all the resolvable missing order-by attributes.
        2. Traverse the whole tree in a post-order manner to add the found missing order-by attributes to the node if their `inputSet` contains the attributes.
        3. If the origins of the missing order-by attributes are different nodes, each pass only resolves the missing attributes that are from the same node.
      
      **Risk:**
      Low. This rule will be trigger iff ```!s.resolved && child.resolved``` is true. Thus, very few cases are affected.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #10678 from gatorsmile/sortWindows.
      8f26eb5e
    • gatorsmile's avatar
      [SPARK-12989][SQL] Delaying Alias Cleanup after ExtractWindowExpressions · 33c8a490
      gatorsmile authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12989
      
      In the rule `ExtractWindowExpressions`, we simply replace alias by the corresponding attribute. However, this will cause an issue exposed by the following case:
      
      ```scala
      val data = Seq(("a", "b", "c", 3), ("c", "b", "a", 3)).toDF("A", "B", "C", "num")
        .withColumn("Data", struct("A", "B", "C"))
        .drop("A")
        .drop("B")
        .drop("C")
      
      val winSpec = Window.partitionBy("Data.A", "Data.B").orderBy($"num".desc)
      data.select($"*", max("num").over(winSpec) as "max").explain(true)
      ```
      In this case, both `Data.A` and `Data.B` are `alias` in `WindowSpecDefinition`. If we replace these alias expression by their alias names, we are unable to know what they are since they will not be put in `missingExpr` too.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10963 from gatorsmile/seletStarAfterColDrop.
      33c8a490
    • Shixiong Zhu's avatar
      [SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey... · 6075573a
      Shixiong Zhu authored
      [SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey is followed by a checkpointed dstream
      
      Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10934 from zsxwing/recursive-checkpoint.
      6075573a
    • Wenchen Fan's avatar
      [SPARK-13093] [SQL] improve null check in nullSafeCodeGen for unary, binary and ternary expression · c1da4d42
      Wenchen Fan authored
      The current implementation is sub-optimal:
      
      * If an expression is always nullable, e.g. `Unhex`, we can still remove null check for children if they are not nullable.
      * If an expression has some non-nullable children, we can still remove null check for these children and keep null check for others.
      
      This PR improves this by making the null check elimination more fine-grained.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10987 from cloud-fan/null-check.
      c1da4d42
  2. Jan 31, 2016
  3. Jan 30, 2016
    • wangyang's avatar
      [SPARK-13100][SQL] improving the performance of stringToDate method in DateTimeUtils.scala · de283719
      wangyang authored
       In jdk1.7 TimeZone.getTimeZone() is synchronized, so use an instance variable to hold an GMT TimeZone object instead of instantiate it every time.
      
      Author: wangyang <wangyang@haizhi.com>
      
      Closes #10994 from wangyang1992/datetimeUtil.
      de283719
    • Josh Rosen's avatar
      [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version · 289373b2
      Josh Rosen authored
      This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
      
      The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
      
      After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10608 from JoshRosen/SPARK-6363.
      289373b2
    • Wenchen Fan's avatar
      [SPARK-13098] [SQL] remove GenericInternalRowWithSchema · dab246f7
      Wenchen Fan authored
      This class is only used for serialization of Python DataFrame. However, we don't require internal row there, so `GenericRowWithSchema` can also do the job.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10992 from cloud-fan/python.
      dab246f7
  4. Jan 29, 2016
    • Davies Liu's avatar
      [SPARK-12914] [SQL] generate aggregation with grouping keys · e6a02c66
      Davies Liu authored
      This PR add support for grouping keys for generated TungstenAggregate.
      
      Spilling and performance improvements for BytesToBytesMap will be done by followup PR.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10855 from davies/gen_keys.
      e6a02c66
    • Andrew Or's avatar
      [SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics · 12252d1d
      Andrew Or authored
      This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g.
      ```
      sc.textFile(..., 4).coalesce(2).count()
      ```
      we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`.
      
      For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark.
      
      This was caused by #10835.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10973 from andrewor14/fix-input-metrics-coalesce.
      12252d1d
    • Andrew Or's avatar
      [SPARK-13088] Fix DAG viz in latest version of chrome · 70e69fc4
      Andrew Or authored
      Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: https://github.com/andrewor14/dagre-d3/commit/7d6c0002e4c74b82a02c5917876576f71e215590, which is taken from the fix in the main repo: https://github.com/cpettitt/dagre-d3/commit/1ef067f1c6ad2e0980f6f0ca471bce998784b7b2
      
      Upstream issue: https://github.com/cpettitt/dagre-d3/issues/202
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10986 from andrewor14/fix-dag-viz.
      70e69fc4
    • Andrew Or's avatar
      [SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySet · e6ceac49
      Andrew Or authored
      Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty.
      
      https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10990 from andrewor14/accum-suite-less-flaky.
      e6ceac49
    • Reynold Xin's avatar
      [SPARK-13076][SQL] Rename ClientInterface -> HiveClient · 2cbc4128
      Reynold Xin authored
      And ClientWrapper -> HiveClientImpl.
      
      I have some followup pull requests to introduce a new internal catalog, and I think this new naming reflects better the functionality of the two classes.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10981 from rxin/SPARK-13076.
      2cbc4128
    • Andrew Or's avatar
      [SPARK-13055] SQLHistoryListener throws ClassCastException · e38b0baa
      Andrew Or authored
      This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails.
      
      The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #10971 from andrewor14/fix-sql-history.
      e38b0baa
    • Cheng Lian's avatar
      [SPARK-12818] Polishes spark-sketch module · 2b027e9a
      Cheng Lian authored
      Fixes various minor code and Javadoc styling issues.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10985 from liancheng/sketch-polishing.
      2b027e9a
    • gatorsmile's avatar
      [SPARK-12656] [SQL] Implement Intersect with Left-semi Join · 5f686cc8
      gatorsmile authored
      Our current Intersect physical operator simply delegates to RDD.intersect. We should remove the Intersect physical operator and simply transform a logical intersect into a semi-join with distinct. This way, we can take advantage of all the benefits of join implementations (e.g. managed memory, code generation, broadcast joins).
      
      After a search, I found one of the mainstream RDBMS did the same. In their query explain, Intersect is replaced by Left-semi Join. Left-semi Join could help outer-join elimination in Optimizer, as shown in the PR: https://github.com/apache/spark/pull/10566
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #10630 from gatorsmile/IntersectBySemiJoin.
      5f686cc8
    • Wenchen Fan's avatar
      [SPARK-13072] [SQL] simplify and improve murmur3 hash expression codegen · c5f745ed
      Wenchen Fan authored
      simplify(remove several unnecessary local variables) the generated code of hash expression, and avoid null check if possible.
      
      generated code comparison for `hash(int, double, string, array<string>)`:
      **before:**
      ```
        public UnsafeRow apply(InternalRow i) {
          /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
          int value1 = 42;
          /* input[0, int] */
          int value3 = i.getInt(0);
          if (!false) {
            value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
          }
          /* input[1, double] */
          double value5 = i.getDouble(1);
          if (!false) {
            value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
          }
          /* input[2, string] */
          boolean isNull6 = i.isNullAt(2);
          UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
          if (!isNull6) {
            value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
          }
          /* input[3, array<int>] */
          boolean isNull8 = i.isNullAt(3);
          ArrayData value9 = isNull8 ? null : (i.getArray(3));
          if (!isNull8) {
            int result10 = value1;
            for (int index11 = 0; index11 < value9.numElements(); index11++) {
              if (!value9.isNullAt(index11)) {
                final int element12 = value9.getInt(index11);
                result10 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element12, result10);
              }
            }
            value1 = result10;
          }
        }
      ```
      **after:**
      ```
        public UnsafeRow apply(InternalRow i) {
          /* hash(input[0, int],input[1, double],input[2, string],input[3, array<int>],42) */
          int value1 = 42;
          /* input[0, int] */
          int value3 = i.getInt(0);
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(value3, value1);
          /* input[1, double] */
          double value5 = i.getDouble(1);
          value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashLong(Double.doubleToLongBits(value5), value1);
          /* input[2, string] */
          boolean isNull6 = i.isNullAt(2);
          UTF8String value7 = isNull6 ? null : (i.getUTF8String(2));
      
          if (!isNull6) {
            value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashUnsafeBytes(value7.getBaseObject(), value7.getBaseOffset(), value7.numBytes(), value1);
          }
      
          /* input[3, array<int>] */
          boolean isNull8 = i.isNullAt(3);
          ArrayData value9 = isNull8 ? null : (i.getArray(3));
          if (!isNull8) {
            for (int index10 = 0; index10 < value9.numElements(); index10++) {
              final int element11 = value9.getInt(index10);
              value1 = org.apache.spark.unsafe.hash.Murmur3_x86_32.hashInt(element11, value1);
            }
          }
      
          rowWriter14.write(0, value1);
          return result12;
        }
      ```
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10974 from cloud-fan/codegen.
      c5f745ed
    • zhuol's avatar
      [SPARK-10873] Support column sort and search for History Server. · e4c1162b
      zhuol authored
      [SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience.
      
      1. Create the historypage-template.html for displaying application information in datables.
      2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search.
      3. "duration" and "lastUpdated" rest API are added to application's "attempts".
      4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified.
      
      Snapshots for how it looks like now:
      
      History page view:
      ![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png)
      
      Search:
      ![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png)
      
      Sort by started time:
      ![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png)
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #10648 from zhuoliu/10873.
      e4c1162b
    • Yanbo Liang's avatar
      [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example · e51b6eaa
      Yanbo Liang authored
      * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
      * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
      
      cc mengxr jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10469 from yanboliang/spark-11939.
      e51b6eaa
    • Davies Liu's avatar
      [SPARK-13031][SQL] cleanup codegen and improve test coverage · 55561e76
      Davies Liu authored
      1. enable whole stage codegen during tests even there is only one operator supports that.
      2. split doProduce() into two APIs: upstream() and doProduce()
      3. generate prefix for fresh names of each operator
      4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again)
      5. fix bugs and tests.
      
      This PR re-open #10944 and fix the bug.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10977 from davies/gen_refactor.
      55561e76
    • Alex Bozarth's avatar
      [SPARK-13050][BUILD] Scalatest tags fail build with the addition of the sketch module · 8d3cc3de
      Alex Bozarth authored
      A dependency on the spark test tags was left out of the sketch module pom file causing builds to fail when test tags were used. This dependency is found in the pom file for every other module in spark.
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #10954 from ajbozarth/spark13050.
      8d3cc3de
    • Wenchen Fan's avatar
      [SPARK-13067] [SQL] workaround for a weird scala reflection problem · 721ced28
      Wenchen Fan authored
      A simple workaround to avoid getting parameter types when convert a
      logical plan to json.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10970 from cloud-fan/reflection.
      721ced28
    • Liang-Chi Hsieh's avatar
      [SPARK-12968][SQL] Implement command to set current database · 66449b8d
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-12968
      
      Implement command to set current database.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #10916 from viirya/ddl-use-database.
      66449b8d
  5. Jan 28, 2016
Loading