Skip to content
Snippets Groups Projects
  1. Oct 14, 2015
  2. Oct 13, 2015
    • Yin Huai's avatar
      [SPARK-11091] [SQL] Change spark.sql.canonicalizeView to spark.sql.nativeView. · ce3f9a80
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-11091
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9103 from yhuai/SPARK-11091.
      ce3f9a80
    • Wenchen Fan's avatar
      [SPARK-11068] [SQL] add callback to query execution · 15ff85b3
      Wenchen Fan authored
      With this feature, we can track the query plan, time cost, exception during query execution for spark users.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #9078 from cloud-fan/callback.
      15ff85b3
    • Wenchen Fan's avatar
      [SPARK-11032] [SQL] correctly handle having · e170c221
      Wenchen Fan authored
      We should not stop resolving having when the having condtion is resolved, or something like `count(1)` will crash.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #9105 from cloud-fan/having.
      e170c221
    • Michael Armbrust's avatar
      [SPARK-11090] [SQL] Constructor for Product types from InternalRow · 328d1b3e
      Michael Armbrust authored
      This is a first draft of the ability to construct expressions that will take a catalyst internal row and construct a Product (case class or tuple) that has fields with the correct names.  Support include:
       - Nested classes
       - Maps
       - Efficiently handling of arrays of primitive types
      
      Not yet supported:
       - Case classes that require custom collection types (i.e. List instead of Seq).
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9100 from marmbrus/productContructor.
      328d1b3e
    • vectorijk's avatar
      [SPARK-11059] [ML] Change range of quantile probabilities in AFTSurvivalRegression · 3889b1c7
      vectorijk authored
      Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
       in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)
      
      Author: vectorijk <jiangkai@gmail.com>
      
      Closes #9083 from vectorijk/spark-11059.
      3889b1c7
    • Josh Rosen's avatar
      [SPARK-10932] [PROJECT INFRA] Port two minor changes to release-build.sh from scripts' old repo · d0482f6a
      Josh Rosen authored
      Spark's release packaging scripts used to live in a separate repository. Although these scripts are now part of the Spark repo, there are some minor patches made against the old repos that are missing in Spark's copy of the script. This PR ports those changes.
      
      /cc shivaram, who originally submitted these changes against https://github.com/rxin/spark-utils
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8986 from JoshRosen/port-release-build-fixes-from-rxin-repo.
      d0482f6a
    • Josh Rosen's avatar
      [SPARK-11080] [SQL] Incorporate per-JVM id into ExprId to prevent unsafe cross-JVM comparisions · ef72673b
      Josh Rosen authored
      In the current implementation of named expressions' `ExprIds`, we rely on a per-JVM AtomicLong to ensure that expression ids are unique within a JVM. However, these expression ids will not be _globally_ unique. This opens the potential for id collisions if new expression ids happen to be created inside of tasks rather than on the driver.
      
      There are currently a few cases where tasks allocate expression ids, which happen to be safe because those expressions are never compared to expressions created on the driver. In order to guard against the introduction of invalid comparisons between driver-created and executor-created expression ids, this patch extends `ExprId` to incorporate a UUID to identify the JVM that created the id, which prevents collisions.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9093 from JoshRosen/SPARK-11080.
      ef72673b
    • trystanleftwich's avatar
      [SPARK-11052] Spaces in the build dir causes failures in the build/mv… · 0d1b73b7
      trystanleftwich authored
      …n script
      
      Author: trystanleftwich <trystan@atscale.com>
      
      Closes #9065 from trystanleftwich/SPARK-11052.
      0d1b73b7
    • Andrew Or's avatar
      [SPARK-10983] Unified memory manager · b3ffac51
      Andrew Or authored
      This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
      
      - **spark.memory.fraction (default 0.75)**: ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
      
      - **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `s​park.memory.fraction`. ​Cached data may only be evicted if total storage exceeds this region.
      
      - **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
      
      For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9084 from andrewor14/unified-memory-manager.
      b3ffac51
    • Xiangrui Meng's avatar
      [SPARK-7402] [ML] JSON SerDe for standard param types · 2b574f52
      Xiangrui Meng authored
      This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9090 from mengxr/SPARK-7402.
      2b574f52
    • Joseph K. Bradley's avatar
      [PYTHON] [MINOR] List modules in PySpark tests when given bad name · c75f058b
      Joseph K. Bradley authored
      Output list of supported modules for python tests in error message when given bad module name.
      
      CC: davies
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9088 from jkbradley/python-tests-modules.
      c75f058b
    • Adrian Zhuang's avatar
      [SPARK-10913] [SPARKR] attach() function support · f7f28ee7
      Adrian Zhuang authored
      Bring the change code up to date.
      
      Author: Adrian Zhuang <adrian555@users.noreply.github.com>
      Author: adrian555 <wzhuang@us.ibm.com>
      
      Closes #9031 from adrian555/attach2.
      f7f28ee7
    • Narine Kokhlikyan's avatar
      [SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame · 1e0aba90
      Narine Kokhlikyan authored
      as.DataFrame is more a R-style like signature.
      Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8952 from NarineK/sparkrasDataFrame.
      1e0aba90
    • Sun Rui's avatar
      [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame · 5e3868ba
      Sun Rui authored
      Two points in this PR:
      
      1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".
      
      2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8794 from sun-rui/SPARK-10051.
      5e3868ba
    • Davies Liu's avatar
      [SPARK-11030] [SQL] share the SQLTab across sessions · d0cc79cc
      Davies Liu authored
      The SQLTab will be shared by multiple sessions.
      
      If we create multiple independent SQLContexts (not using newSession()), will still see multiple SQLTabs in the Spark UI.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9048 from davies/sqlui.
      d0cc79cc
    • Reynold Xin's avatar
      [SPARK-11079] Post-hoc review Netty-based RPC - round 1 · 1797055d
      Reynold Xin authored
      I'm going through the implementation right now for post-doc review. Adding more comments and renaming things as I go through them.
      
      I also want to write higher level documentation about how the whole thing works -- but those will come in other pull requests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9091 from rxin/rpc-review.
      1797055d
    • Davies Liu's avatar
      [SPARK-11009] [SQL] fix wrong result of Window function in cluster mode · 6987c067
      Davies Liu authored
      Currently, All windows function could generate wrong result in cluster sometimes.
      
      The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver.
      
      Here is the script that could reproduce the problem (run in local cluster):
      ```
      from pyspark import SparkContext, HiveContext
      from pyspark.sql.window import Window
      from pyspark.sql.functions import rowNumber
      
      sqlContext = HiveContext(SparkContext())
      sqlContext.setConf("spark.sql.shuffle.partitions", "3")
      df =  sqlContext.range(1<<20)
      df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
      ws = Window.partitionBy(df2.A).orderBy(df2.B)
      df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0")
      assert df3.count() == 0
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9050 from davies/wrong_window.
      6987c067
    • Lianhui Wang's avatar
      [SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for... · 626aab79
      Lianhui Wang authored
      [SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar'
      
      when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #9045 from lianhuiwang/spark-11026.
      626aab79
  3. Oct 12, 2015
    • Davies Liu's avatar
      [SPARK-10990] [SPARK-11018] [SQL] improve unrolling of complex types · c4da5345
      Davies Liu authored
      This PR improve the unrolling and read of complex types in columnar cache:
      1) Using UnsafeProjection to do serialization of complex types, so they will not be serialized three times (two for actualSize)
      2) Copy the bytes from UnsafeRow/UnsafeArrayData to ByteBuffer directly, avoiding the immediate byte[]
      3) Using the underlying array in ByteBuffer to create UTF8String/UnsafeRow/UnsafeArrayData without copy.
      
      Combine these optimizations,  we can reduce the unrolling time from 25s to 21s (20% less), reduce the scanning time from 3.5s to 2.5s (28% less).
      
      ```
      df = sqlContext.read.parquet(path)
      t = time.time()
      df.cache()
      df.count()
      print 'unrolling', time.time() - t
      
      for i in range(10):
          t = time.time()
          print df.select("*")._jdf.queryExecution().toRdd().count()
          print time.time() - t
      ```
      
      The schema is
      ```
      root
       |-- a: struct (nullable = true)
       |    |-- b: long (nullable = true)
       |    |-- c: string (nullable = true)
       |-- d: array (nullable = true)
       |    |-- element: long (containsNull = true)
       |-- e: map (nullable = true)
       |    |-- key: long
       |    |-- value: string (valueContainsNull = true)
      ```
      
      Now the columnar cache depends on that UnsafeProjection support all the data types (including UDT), this PR also fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9016 from davies/complex2.
      c4da5345
    • jerryshao's avatar
      [SPARK-10739] [YARN] Add application attempt window for Spark on Yarn · f97e9323
      jerryshao authored
      Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits:
      
      36eabdc [jerryshao] change the doc
      7f9b77d [jerryshao] Style change
      1c9afd0 [jerryshao] Address the comments
      caca695 [jerryshao] Add application attempt window for Spark on Yarn
      f97e9323
    • Kay Ousterhout's avatar
      [SPARK-11056] Improve documentation of SBT build. · 091c2c3e
      Kay Ousterhout authored
      This commit improves the documentation around building Spark to
      (1) recommend using SBT interactive mode to avoid the overhead of
      launching SBT and (2) refer to the wiki page that documents using
      SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each
      compile.
      
      cc srowen
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9068 from kayousterhout/SPARK-11056.
      091c2c3e
    • Yin Huai's avatar
      [SPARK-11042] [SQL] Add a mechanism to ban creating multiple root SQLContexts/HiveContexts in a JVM · 8a354bef
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-11042
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9058 from yhuai/SPARK-11042.
      8a354bef
    • Ashwin Shankar's avatar
      [SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and cancel all running jobs · 2e572c41
      Ashwin Shankar authored
      This patch adds a signal handler to trap Ctrl-C and cancels running job.
      
      Author: Ashwin Shankar <ashankar@netflix.com>
      
      Closes #9033 from ashwinshankar77/master.
      2e572c41
    • Marcelo Vanzin's avatar
      [SPARK-11023] [YARN] Avoid creating URIs from local paths directly. · 149472a0
      Marcelo Vanzin authored
      The issue is that local paths on Windows, when provided with drive
      letters or backslashes, are not valid URIs.
      
      Instead of trying to figure out whether paths are URIs or not, use
      Utils.resolveURI() which does that for us.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9049 from vanzin/SPARK-11023 and squashes the following commits:
      
      77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
      149472a0
    • Cheng Lian's avatar
      [SPARK-11007] [SQL] Adds dictionary aware Parquet decimal converters · 64b1d00e
      Cheng Lian authored
      For Parquet decimal columns that are encoded using plain-dictionary encoding, we can make the upper level converter aware of the dictionary, so that we can pre-instantiate all the decimals to avoid duplicated instantiation.
      
      Note that plain-dictionary encoding isn't available for `FIXED_LEN_BYTE_ARRAY` for Parquet writer version `PARQUET_1_0`. So currently only decimals written as `INT32` and `INT64` can benefit from this optimization.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9040 from liancheng/spark-11007.decimal-converter-dict-support.
      64b1d00e
    • Liang-Chi Hsieh's avatar
      [SPARK-10960] [SQL] SQL with windowing function should be able to refer column in inner select · fcb37a04
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10960
      
      When accessing a column in inner select from a select with window function, `AnalysisException` will be thrown. For example, an query like this:
      
           select area, rank() over (partition by area order by tmp.month) + tmp.tmp1 as c1 from (select month, area, product, 1 as tmp1 from windowData) tmp
      
      Currently, the rule `ExtractWindowExpressions` in `Analyzer` only extracts regular expressions from `WindowFunction`, `WindowSpecDefinition` and `AggregateExpression`. We need to also extract other attributes as the one in `Alias` as shown in the above query.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9011 from viirya/fix-window-inner-column.
      fcb37a04
  4. Oct 11, 2015
    • Josh Rosen's avatar
      [SPARK-11053] Remove use of KVIterator in SortBasedAggregationIterator · 595012ea
      Josh Rosen authored
      SortBasedAggregationIterator uses a KVIterator interface in order to process input rows as key-value pairs, but this use of KVIterator is unnecessary, slightly complicates the code, and might hurt performance. This patch refactors this code to remove the use of this extra layer of iterator wrapping and simplifies other parts of the code in the process.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9066 from JoshRosen/sort-iterator-cleanup.
      595012ea
  5. Oct 10, 2015
    • Jacker Hu's avatar
      [SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform function... · a16396df
      Jacker Hu authored
      [SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform function in DStream returns NULL
      
      Currently, the ```TransformedDStream``` will using ```Some(transformFunc(parentRDDs, validTime))``` as compute return value, when the ```transformFunc``` somehow returns null as return value, the followed operator will have NullPointerExeception.
      
      This fix uses the ```Option()``` instead of ```Some()``` to deal with the possible null value. When   ```transformFunc``` returns ```null```, the option will transform null to ```None```, the downstream can handle ```None``` correctly.
      
      NOTE (2015-09-25): The latest fix will check the return value of transform function, if it is ```NULL```, a spark exception will be thrown out
      
      Author: Jacker Hu <gt.hu.chang@gmail.com>
      Author: jhu-chang <gt.hu.chang@gmail.com>
      
      Closes #8881 from jhu-chang/Fix_Transform.
      a16396df
    • Sun Rui's avatar
      [SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions. · 864de3bf
      Sun Rui authored
      1.  Add a "col" function into DataFrame.
      2.  Move the current "col" function in Column.R to functions.R, convert it to S4 function.
      3.  Add a s4 "column" function in functions.R.
      4.  Convert the "column" function in Column.R to S4 function. This is for private use.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8864 from sun-rui/SPARK-10079.
      864de3bf
  6. Oct 09, 2015
Loading