Skip to content
Snippets Groups Projects
  1. Oct 13, 2015
    • trystanleftwich's avatar
      [SPARK-11052] Spaces in the build dir causes failures in the build/mv… · 0d1b73b7
      trystanleftwich authored
      …n script
      
      Author: trystanleftwich <trystan@atscale.com>
      
      Closes #9065 from trystanleftwich/SPARK-11052.
      0d1b73b7
    • Andrew Or's avatar
      [SPARK-10983] Unified memory manager · b3ffac51
      Andrew Or authored
      This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
      
      - **spark.memory.fraction (default 0.75)**: ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
      
      - **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `s​park.memory.fraction`. ​Cached data may only be evicted if total storage exceeds this region.
      
      - **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
      
      For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #9084 from andrewor14/unified-memory-manager.
      b3ffac51
    • Xiangrui Meng's avatar
      [SPARK-7402] [ML] JSON SerDe for standard param types · 2b574f52
      Xiangrui Meng authored
      This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #9090 from mengxr/SPARK-7402.
      2b574f52
    • Joseph K. Bradley's avatar
      [PYTHON] [MINOR] List modules in PySpark tests when given bad name · c75f058b
      Joseph K. Bradley authored
      Output list of supported modules for python tests in error message when given bad module name.
      
      CC: davies
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #9088 from jkbradley/python-tests-modules.
      c75f058b
    • Adrian Zhuang's avatar
      [SPARK-10913] [SPARKR] attach() function support · f7f28ee7
      Adrian Zhuang authored
      Bring the change code up to date.
      
      Author: Adrian Zhuang <adrian555@users.noreply.github.com>
      Author: adrian555 <wzhuang@us.ibm.com>
      
      Closes #9031 from adrian555/attach2.
      f7f28ee7
    • Narine Kokhlikyan's avatar
      [SPARK-10888] [SPARKR] Added as.DataFrame as a synonym to createDataFrame · 1e0aba90
      Narine Kokhlikyan authored
      as.DataFrame is more a R-style like signature.
      Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.
      
      Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
      
      Closes #8952 from NarineK/sparkrasDataFrame.
      1e0aba90
    • Sun Rui's avatar
      [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame · 5e3868ba
      Sun Rui authored
      Two points in this PR:
      
      1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".
      
      2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8794 from sun-rui/SPARK-10051.
      5e3868ba
    • Davies Liu's avatar
      [SPARK-11030] [SQL] share the SQLTab across sessions · d0cc79cc
      Davies Liu authored
      The SQLTab will be shared by multiple sessions.
      
      If we create multiple independent SQLContexts (not using newSession()), will still see multiple SQLTabs in the Spark UI.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9048 from davies/sqlui.
      d0cc79cc
    • Reynold Xin's avatar
      [SPARK-11079] Post-hoc review Netty-based RPC - round 1 · 1797055d
      Reynold Xin authored
      I'm going through the implementation right now for post-doc review. Adding more comments and renaming things as I go through them.
      
      I also want to write higher level documentation about how the whole thing works -- but those will come in other pull requests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9091 from rxin/rpc-review.
      1797055d
    • Davies Liu's avatar
      [SPARK-11009] [SQL] fix wrong result of Window function in cluster mode · 6987c067
      Davies Liu authored
      Currently, All windows function could generate wrong result in cluster sometimes.
      
      The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver.
      
      Here is the script that could reproduce the problem (run in local cluster):
      ```
      from pyspark import SparkContext, HiveContext
      from pyspark.sql.window import Window
      from pyspark.sql.functions import rowNumber
      
      sqlContext = HiveContext(SparkContext())
      sqlContext.setConf("spark.sql.shuffle.partitions", "3")
      df =  sqlContext.range(1<<20)
      df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
      ws = Window.partitionBy(df2.A).orderBy(df2.B)
      df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0")
      assert df3.count() == 0
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9050 from davies/wrong_window.
      6987c067
    • Lianhui Wang's avatar
      [SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for... · 626aab79
      Lianhui Wang authored
      [SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for 'spark-submit --jars hdfs://user/foo.jar'
      
      when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs
      
      Author: Lianhui Wang <lianhuiwang09@gmail.com>
      
      Closes #9045 from lianhuiwang/spark-11026.
      626aab79
  2. Oct 12, 2015
    • Davies Liu's avatar
      [SPARK-10990] [SPARK-11018] [SQL] improve unrolling of complex types · c4da5345
      Davies Liu authored
      This PR improve the unrolling and read of complex types in columnar cache:
      1) Using UnsafeProjection to do serialization of complex types, so they will not be serialized three times (two for actualSize)
      2) Copy the bytes from UnsafeRow/UnsafeArrayData to ByteBuffer directly, avoiding the immediate byte[]
      3) Using the underlying array in ByteBuffer to create UTF8String/UnsafeRow/UnsafeArrayData without copy.
      
      Combine these optimizations,  we can reduce the unrolling time from 25s to 21s (20% less), reduce the scanning time from 3.5s to 2.5s (28% less).
      
      ```
      df = sqlContext.read.parquet(path)
      t = time.time()
      df.cache()
      df.count()
      print 'unrolling', time.time() - t
      
      for i in range(10):
          t = time.time()
          print df.select("*")._jdf.queryExecution().toRdd().count()
          print time.time() - t
      ```
      
      The schema is
      ```
      root
       |-- a: struct (nullable = true)
       |    |-- b: long (nullable = true)
       |    |-- c: string (nullable = true)
       |-- d: array (nullable = true)
       |    |-- element: long (containsNull = true)
       |-- e: map (nullable = true)
       |    |-- key: long
       |    |-- value: string (valueContainsNull = true)
      ```
      
      Now the columnar cache depends on that UnsafeProjection support all the data types (including UDT), this PR also fix that.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9016 from davies/complex2.
      c4da5345
    • jerryshao's avatar
      [SPARK-10739] [YARN] Add application attempt window for Spark on Yarn · f97e9323
      jerryshao authored
      Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits:
      
      36eabdc [jerryshao] change the doc
      7f9b77d [jerryshao] Style change
      1c9afd0 [jerryshao] Address the comments
      caca695 [jerryshao] Add application attempt window for Spark on Yarn
      f97e9323
    • Kay Ousterhout's avatar
      [SPARK-11056] Improve documentation of SBT build. · 091c2c3e
      Kay Ousterhout authored
      This commit improves the documentation around building Spark to
      (1) recommend using SBT interactive mode to avoid the overhead of
      launching SBT and (2) refer to the wiki page that documents using
      SPARK_PREPEND_CLASSES to avoid creating the assembly jar for each
      compile.
      
      cc srowen
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9068 from kayousterhout/SPARK-11056.
      091c2c3e
    • Yin Huai's avatar
      [SPARK-11042] [SQL] Add a mechanism to ban creating multiple root SQLContexts/HiveContexts in a JVM · 8a354bef
      Yin Huai authored
      https://issues.apache.org/jira/browse/SPARK-11042
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9058 from yhuai/SPARK-11042.
      8a354bef
    • Ashwin Shankar's avatar
      [SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and cancel all running jobs · 2e572c41
      Ashwin Shankar authored
      This patch adds a signal handler to trap Ctrl-C and cancels running job.
      
      Author: Ashwin Shankar <ashankar@netflix.com>
      
      Closes #9033 from ashwinshankar77/master.
      2e572c41
    • Marcelo Vanzin's avatar
      [SPARK-11023] [YARN] Avoid creating URIs from local paths directly. · 149472a0
      Marcelo Vanzin authored
      The issue is that local paths on Windows, when provided with drive
      letters or backslashes, are not valid URIs.
      
      Instead of trying to figure out whether paths are URIs or not, use
      Utils.resolveURI() which does that for us.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9049 from vanzin/SPARK-11023 and squashes the following commits:
      
      77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
      149472a0
    • Cheng Lian's avatar
      [SPARK-11007] [SQL] Adds dictionary aware Parquet decimal converters · 64b1d00e
      Cheng Lian authored
      For Parquet decimal columns that are encoded using plain-dictionary encoding, we can make the upper level converter aware of the dictionary, so that we can pre-instantiate all the decimals to avoid duplicated instantiation.
      
      Note that plain-dictionary encoding isn't available for `FIXED_LEN_BYTE_ARRAY` for Parquet writer version `PARQUET_1_0`. So currently only decimals written as `INT32` and `INT64` can benefit from this optimization.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #9040 from liancheng/spark-11007.decimal-converter-dict-support.
      64b1d00e
    • Liang-Chi Hsieh's avatar
      [SPARK-10960] [SQL] SQL with windowing function should be able to refer column in inner select · fcb37a04
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-10960
      
      When accessing a column in inner select from a select with window function, `AnalysisException` will be thrown. For example, an query like this:
      
           select area, rank() over (partition by area order by tmp.month) + tmp.tmp1 as c1 from (select month, area, product, 1 as tmp1 from windowData) tmp
      
      Currently, the rule `ExtractWindowExpressions` in `Analyzer` only extracts regular expressions from `WindowFunction`, `WindowSpecDefinition` and `AggregateExpression`. We need to also extract other attributes as the one in `Alias` as shown in the above query.
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      
      Closes #9011 from viirya/fix-window-inner-column.
      fcb37a04
  3. Oct 11, 2015
    • Josh Rosen's avatar
      [SPARK-11053] Remove use of KVIterator in SortBasedAggregationIterator · 595012ea
      Josh Rosen authored
      SortBasedAggregationIterator uses a KVIterator interface in order to process input rows as key-value pairs, but this use of KVIterator is unnecessary, slightly complicates the code, and might hurt performance. This patch refactors this code to remove the use of this extra layer of iterator wrapping and simplifies other parts of the code in the process.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9066 from JoshRosen/sort-iterator-cleanup.
      595012ea
  4. Oct 10, 2015
    • Jacker Hu's avatar
      [SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform function... · a16396df
      Jacker Hu authored
      [SPARK-10772] [STREAMING] [SCALA] NullPointerException when transform function in DStream returns NULL
      
      Currently, the ```TransformedDStream``` will using ```Some(transformFunc(parentRDDs, validTime))``` as compute return value, when the ```transformFunc``` somehow returns null as return value, the followed operator will have NullPointerExeception.
      
      This fix uses the ```Option()``` instead of ```Some()``` to deal with the possible null value. When   ```transformFunc``` returns ```null```, the option will transform null to ```None```, the downstream can handle ```None``` correctly.
      
      NOTE (2015-09-25): The latest fix will check the return value of transform function, if it is ```NULL```, a spark exception will be thrown out
      
      Author: Jacker Hu <gt.hu.chang@gmail.com>
      Author: jhu-chang <gt.hu.chang@gmail.com>
      
      Closes #8881 from jhu-chang/Fix_Transform.
      a16396df
    • Sun Rui's avatar
      [SPARK-10079] [SPARKR] Make 'column' and 'col' functions be S4 functions. · 864de3bf
      Sun Rui authored
      1.  Add a "col" function into DataFrame.
      2.  Move the current "col" function in Column.R to functions.R, convert it to S4 function.
      3.  Add a s4 "column" function in functions.R.
      4.  Convert the "column" function in Column.R to S4 function. This is for private use.
      
      Author: Sun Rui <rui.sun@intel.com>
      
      Closes #8864 from sun-rui/SPARK-10079.
      864de3bf
  5. Oct 09, 2015
    • Vladimir Vladimirov's avatar
      [SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark · c1b4ce43
      Vladimir Vladimirov authored
      Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
      
      Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
      
      Closes #8700 from smartkiwi/SPARK-10535_.
      c1b4ce43
    • Tom Graves's avatar
      [SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl · 63c340a7
      Tom Graves authored
      https://issues.apache.org/jira/browse/SPARK-10858
      
      The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI().  But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that  later in Client as a URI it doesn't recognize there is a fragment.
      
      so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on.
      
      Author: Tom Graves <tgraves@yahoo-inc.com>
      
      Closes #9035 from tgravescs/SPARK-10858.
      63c340a7
    • Rick Hillegas's avatar
      [SPARK-10855] [SQL] Add a JDBC dialect for Apache Derby · 12b7191d
      Rick Hillegas authored
      marmbrus
      rxin
      
      This patch adds a JdbcDialect class, which customizes the datatype mappings for Derby backends. The patch also adds unit tests for the new dialect, corresponding to the existing tests for other JDBC dialects.
      
      JDBCSuite runs cleanly for me with this patch. So does JDBCWriteSuite, although it produces noise as described here: https://issues.apache.org/jira/browse/SPARK-10890
      
      This patch is my original work, which I license to the ASF. I am a Derby contributor, so my ICLA is on file under SVN id "rhillegas": http://people.apache.org/committer-index.html
      
      Touches the following files:
      
      ---------------------------------
      
      org.apache.spark.sql.jdbc.JdbcDialects
      
      Adds a DerbyDialect.
      
      ---------------------------------
      
      org.apache.spark.sql.jdbc.JDBCSuite
      
      Adds unit tests for the new DerbyDialect.
      
      Author: Rick Hillegas <rhilleg@us.ibm.com>
      
      Closes #8982 from rick-ibm/b_10855.
      12b7191d
    • Marcelo Vanzin's avatar
      [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps. · 015f7ef5
      Marcelo Vanzin authored
      This change adds an API that encapsulates information about an app
      launched using the library. It also creates a socket-based communication
      layer for apps that are launched as child processes; the launching
      application listens for connections from launched apps, and once
      communication is established, the channel can be used to send updates
      to the launching app, or to send commands to the child app.
      
      The change also includes hooks for local, standalone/client and yarn
      masters.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7052 from vanzin/SPARK-8673.
      015f7ef5
    • Rerngvit Yanggratoke's avatar
      [SPARK-10905] [SPARKR] Export freqItems() for DataFrameStatFunctions · 70f44ad2
      Rerngvit Yanggratoke authored
      [SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
      - Add function (together with roxygen2 doc) to DataFrame.R and generics.R
      - Expose the function in NAMESPACE
      - Add unit test for the function
      
      Author: Rerngvit Yanggratoke <rerngvit@kth.se>
      
      Closes #8962 from rerngvit/SPARK-10905.
      70f44ad2
    • Nick Pritchard's avatar
      [SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric · 5994cfe8
      Nick Pritchard authored
      Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
      
      Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
      
      Closes #8940 from pnpritchard/SPARK-10875.
      5994cfe8
    • Bryan Cutler's avatar
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with... · 5410747a
      Bryan Cutler authored
      [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters
      
      These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training.  Same with StreamingLinearRegressionWithSGD.  I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
      5410747a
  6. Oct 08, 2015
    • Andrew Or's avatar
      [SPARK-10956] Common MemoryManager interface for storage and execution · 67fbecbf
      Andrew Or authored
      This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible.
      
      This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: andrewor14 <andrew@databricks.com>
      
      Closes #9000 from andrewor14/memory-manager.
      67fbecbf
    • Hari Shreedharan's avatar
      [SPARK-10955] [STREAMING] Add a warning if dynamic allocation for Streaming applications · 09841290
      Hari Shreedharan authored
      Dynamic allocation can be painful for streaming apps and can lose data. Log a warning for streaming applications if dynamic allocation is enabled.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #8998 from harishreedharan/ss-log-error and squashes the following commits:
      
      462b264 [Hari Shreedharan] Improve log message.
      2733d94 [Hari Shreedharan] Minor change to warning message.
      eaa48cc [Hari Shreedharan] Log a warning instead of failing the application if dynamic allocation is enabled.
      725f090 [Hari Shreedharan] Add config parameter to allow dynamic allocation if the user explicitly sets it.
      b3f9a95 [Hari Shreedharan] Disable dynamic allocation and kill app if it is enabled.
      a4a5212 [Hari Shreedharan] [streaming] SPARK-10955. Disable dynamic allocation for Streaming applications.
      09841290
    • Hari Shreedharan's avatar
      [SPARK-11019] [STREAMING] [FLUME] Gracefully shutdown Flume receiver th… · fa3e4d8f
      Hari Shreedharan authored
      …reads.
      
      Wait for a minute for the receiver threads to shutdown before interrupting them.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #9041 from harishreedharan/flume-graceful-shutdown.
      fa3e4d8f
    • zero323's avatar
      [SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we… · 8e67882b
      zero323 authored
      __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry
      
          from pyspark.mllib.linalg import Vectors
          sv = Vectors.sparse(5, {1: 3})
          sv[0]
          ## 0.0
          sv[1]
          ## 3.0
          sv[2]
          ## Traceback (most recent call last):
          ##   File "<stdin>", line 1, in <module>
          ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
          ##     row_ind = inds[insert_index]
          ## IndexError: index out of bounds
      
      Author: zero323 <matthew.szymkiewicz@gmail.com>
      
      Closes #9009 from zero323/sparse_vector_index_error.
      8e67882b
    • Davies Liu's avatar
      [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL · 3390b400
      Davies Liu authored
      This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.
      
      A new session of SQLContext could be created by:
      
      1) create an new SQLContext
      2) call newSession() on existing SQLContext
      
      For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).
      
      CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.
      
      Added jars are still shared by all the sessions, because SparkContext does not support sessions.
      
      cc marmbrus yhuai rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8909 from davies/sessions.
      3390b400
    • Reynold Xin's avatar
      [SPARK-10914] UnsafeRow serialization breaks when two machines have different Oops size. · 84ea2871
      Reynold Xin authored
      UnsafeRow contains 3 pieces of information when pointing to some data in memory (an object, a base offset, and length). When the row is serialized with Java/Kryo serialization, the object layout in memory can change if two machines have different pointer width (Oops in JVM).
      
      To reproduce, launch Spark using
      
      MASTER=local-cluster[2,1,1024] bin/spark-shell --conf "spark.executor.extraJavaOptions=-XX:-UseCompressedOops"
      
      And then run the following
      
      scala> sql("select 1 xx").collect()
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #9030 from rxin/SPARK-10914.
      84ea2871
    • Cheng Lian's avatar
      [SPARK-8848] [SQL] Refactors Parquet write path to follow parquet-format · 02149ff0
      Cheng Lian authored
      This PR refactors Parquet write path to follow parquet-format spec.  It's a successor of PR #7679, but with less non-essential changes.
      
      Major changes include:
      
      1.  Replaces `RowWriteSupport` and `MutableRowWriteSupport` with `CatalystWriteSupport`
      
          - Writes Parquet data using standard layout defined in parquet-format
      
            Specifically, we are now writing ...
      
            - ... arrays and maps in standard 3-level structure with proper annotations and field names
            - ... decimals as `INT32` and `INT64` whenever possible, and taking `FIXED_LEN_BYTE_ARRAY` as the final fallback
      
          - Supports legacy mode which is compatible with Spark 1.4 and prior versions
      
            The legacy mode is by default off, and can be turned on by flipping SQL option `spark.sql.parquet.writeLegacyFormat` to `true`.
      
          - Eliminates per value data type dispatching costs via prebuilt composed writer functions
      
      1.  Cleans up the last pieces of old Parquet support code
      
      As pointed out by rxin previously, we probably want to rename all those `Catalyst*` Parquet classes to `Parquet*` for clarity.  But I'd like to do this in a follow-up PR to minimize code review noises in this one.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #8988 from liancheng/spark-8848/standard-parquet-write-path.
      02149ff0
    • Josh Rosen's avatar
      [SPARK-10988] [SQL] Reduce duplication in Aggregate2's expression rewriting logic · 2816c89b
      Josh Rosen authored
      In `aggregate/utils.scala`, there is a substantial amount of duplication in the expression-rewriting logic. As a prerequisite to supporting imperative aggregate functions in `TungstenAggregate`, this patch refactors this file so that the same expression-rewriting logic is used for both `SortAggregate` and `TungstenAggregate`.
      
      In order to allow both operators to use the same rewriting logic, `TungstenAggregationIterator. generateResultProjection()` has been updated so that it first evaluates all declarative aggregate functions' `evaluateExpression`s and writes the results into a temporary buffer, and then uses this temporary buffer and the grouping expressions to evaluate the final resultExpressions. This matches the logic in SortAggregateIterator, where this two-pass approach is necessary in order to support imperative aggregates. If this change turns out to cause performance regressions, then we can look into re-implementing the single-pass evaluation in a cleaner way as part of a followup patch.
      
      Since the rewriting logic is now shared across both operators, this patch also extracts that logic and places it in `SparkStrategies`. This makes the rewriting logic a bit easier to follow, I think.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9015 from JoshRosen/SPARK-10988.
      2816c89b
    • Michael Armbrust's avatar
      [SPARK-10993] [SQL] Inital code generated encoder for product types · 9e66a53c
      Michael Armbrust authored
      This PR is a first cut at code generating an encoder that takes a Scala `Product` type and converts it directly into the tungsten binary format.  This is done through the addition of a new set of expression that can be used to invoke methods on raw JVM objects, extracting fields and converting the result into the required format.  These can then be used directly in an `UnsafeProjection` allowing us to leverage the existing encoding logic.
      
      According to some simple benchmarks, this can significantly speed up conversion (~4x).  However, replacing CatalystConverters is deferred to a later PR to keep this PR at a reasonable size.
      
      ```scala
      case class SomeInts(a: Int, b: Int, c: Int, d: Int, e: Int)
      
      val data = SomeInts(1, 2, 3, 4, 5)
      val encoder = ProductEncoder[SomeInts]
      val converter = CatalystTypeConverters.createToCatalystConverter(ScalaReflection.schemaFor[SomeInts].dataType)
      
      (1 to 5).foreach {iter =>
        benchmark(s"converter $iter") {
          var i = 100000000
          while (i > 0) {
            val res = converter(data).asInstanceOf[InternalRow]
            assert(res.getInt(0) == 1)
            assert(res.getInt(1) == 2)
            i -= 1
          }
        }
      
        benchmark(s"encoder $iter") {
          var i = 100000000
          while (i > 0) {
            val res = encoder.toRow(data)
            assert(res.getInt(0) == 1)
            assert(res.getInt(1) == 2)
            i -= 1
          }
        }
      }
      ```
      
      Results:
      ```
      [info] converter 1: 7170ms
      [info] encoder 1: 1888ms
      [info] converter 2: 6763ms
      [info] encoder 2: 1824ms
      [info] converter 3: 6912ms
      [info] encoder 3: 1802ms
      [info] converter 4: 7131ms
      [info] encoder 4: 1798ms
      [info] converter 5: 7350ms
      [info] encoder 5: 1912ms
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9019 from marmbrus/productEncoder.
      9e66a53c
    • Michael Armbrust's avatar
      Revert [SPARK-8654] [SQL] Fix Analysis exception when using NULL IN · a8226a9f
      Michael Armbrust authored
      This reverts commit dcbd58a9 from #8983
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #9034 from marmbrus/revert8654.
      a8226a9f
    • Wenchen Fan's avatar
      [SPARK-10337] [SQL] fix hive views on non-hive-compatible tables. · af2a5544
      Wenchen Fan authored
      add a new config to deal with this special case.
      
      Author: Wenchen Fan <cloud0fan@163.com>
      
      Closes #8990 from cloud-fan/view-master.
      af2a5544
Loading